DECISION BOUNDARY FOR CLASSIFIERS: AN INTRODUCTION (2024)

There are many debates on how to decide the best classifier. Measuring the Performance Metrics score, getting the area under ROC are few of the approaches, but there is quite a lot of useful information to be gleaned from visualizing a decision boundary, information that will give us an intuitive grasp of learning models.

So, in this article, we will learn about the below:

  • What is Decision Boundary
  • Importance of Decision Boundary
  • Types of Decision Boundary
  • Decision Boundary for different classifiers.
  • An Use Case with Python code
  • Decision Boundary for Higher Dimension Data
  • Conclusion

So, lets start

While training a classifier on a dataset, using a specific classification algorithm, it is required to define a set of hyper-planes, called Decision Boundary, that separates the data points into specific classes, where the algorithm switches from one class to another. On one side a decision boundary, a datapoints is more likely to be called as class A — on the other side of the boundary, it’s more likely to be called as class B.

Let’s take an example of a Logistic Regression.

The goal of logistic regression, is to figure out some way to split the datapoints to have an accurate prediction of a given observation’s class using the information present in the features.

Let’s suppose we define a line that describes the decision boundary. So, all of the points on one side of the boundary shall have all the datapoints belong to class A and all of the points on one side of the boundary shall have all the datapoints belong to class B.

S(z)=1/(1+e^-z)

  • S(z) = Output between 0 and 1 (probability estimate)
  • z = Input to the function (z= mx + b)
  • e = Base of natural log

Our current prediction function returns a probability score between 0 and 1. In order to map this to a discrete class (A/B), we select a threshold value or tipping point above which we will classify values into class A and below which we classify values into class B.

p>=0.5,class=A

p<=0.5,class=B

If our threshold was .5 and our prediction function returned .7, we would classify this observation belongs to class A. If our prediction was .2 we would classify the observation belongs to class B.

So, line with 0.5 is called the decision boundary.

In order to map predicted values to probabilities, we use the Sigmoid function.

DECISION BOUNDARY FOR CLASSIFIERS: AN INTRODUCTION (3)

A decision boundary, is a surface that separates data points belonging to different class lables. Decision Boundaries are not only confined to just the data points that we have provided, but also they span through the entire feature space we trained on. The model can predict a value for any possible combination of inputs in our feature space. If the data we train on is not ‘diverse’, the overall topology of the model will generalize poorly to new instances. So, it is important to analyse all the models which can be best suitable for ‘diverse’ dataset, before using the model into production.

Examining decision boundaries is a great way to learn how the training data we select affects performance and the ability for our model to generalize. Visualization of decision boundaries can illustrate how sensitive models are to each dataset, which is a great way to understand how specific algorithms work, and their limitations for specific datasets.

Objective: To build the decision boundary for various classifiers algorithms and decide which is the best algorithm for the dataset.

Dataset is available here.

Dataset Description: The Dataset contains users’ information, based on which the best model should be built to predict whether the user will buy a car or not.

The Independent variables:

  • Age: Age of the user
  • Estimated Salary: Salary of the user.

The dependent variable: ‘Purchased’ which is 1 if user purchases the car and 0 otherwise.

Step 1: Import all the required libraries

PYTHON

# Package importsimport matplotlib.pyplot as pltimport numpy as npimport sklearnimport sklearn.datasetsimport sklearn.linear_modelimport matplotlibimport pandas as pd

Step 2: Import the dataset

PYTHON

from google.colab import filesuploaded=files.upload()import iodf2=pd.read_csv(io.BytesIO(uploaded['Social_Network_Ads.csv']))

Step 3: Applying StandardScaler to the dataset. Variables ‘Salary’ and ‘Age’ are not in the same scale. So, these should be scaled. Or else, model cannot be predict a good result. Standard Scaling also helps to speed up the calculations in an algorithm.

PYTHON

X = df2.iloc[:, :-1].valuesy = df2.iloc[:, -1].valuesfrom sklearn.preprocessing import StandardScalersc = StandardScaler()X = sc.fit_transform(X)

Step 4: Import sklearn libraries for classifiers

PYTHON

from sklearn.linear_model import LogisticRegressionfrom sklearn.naive_bayes import GaussianNBfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.svm import SVC

Step 5: Get the dimension of the dataset.

Step 6: Build Logistic Regression model and Display the Decision Boundary for Logistic Regression. Decision Boundary can be visualized by dense sampling via meshgrid. However, if the grid resolution is not enough, the boundary will appear inaccurate. The purpose of meshgrid is to create a rectangular grid out of an array of x values and an array of y values. We can get the complete explanation on how to plot a meshgrid from here.

In Meshgrid, we will make an image, where each pixel represents a grid cell in the 2D feature space. The image defines a grid over the 2D feature space. The pixels of the image are then classified using the classifier, which will assign a class label to each grid cell. The classified image is then used as a background for a scatter plot that shows the data points of each class.

Advantage: It classifies the grid points in the 2D feature space.

Disadvantage: A computational cost for making very fine decision boundary maps, as we would have to make the grid finer and finer.

PYTHON

# Display plots inline and change default figure size%matplotlib inlinematplotlib.rcParams['figure.figsize'] = (10.0, 8.0)# Train the logistic rgeression classifierclf = sklearn.linear_model.LogisticRegressionCV()clf.fit(X, y)# Helper function to plot a decision boundary.def plot_decision_boundary(pred_func):# Set min and max values and give it some paddingx_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5h = 0.01# Generate a grid of points with distance h between themxx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))# Predict the function value for the whole gidZ = pred_func(np.c_[xx.ravel(), yy.ravel()])Z = Z.reshape(xx.shape)# Plot the contour and training examplesplt.contourf(xx, yy, Z, cmap=plt.cm.Spectral)plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral# Plot the decision boundaryplot_decision_boundary(lambda x: clf.predict(x))plt.title("Logistic Regression")

In Logistic Regression, Decision Boundary is a linear line, which separates class A and class B. Some of the points from class A have come to the region of class B too, because in linear model, its difficult to get the exact boundary line separating the two classes.

DECISION BOUNDARY FOR CLASSIFIERS: AN INTRODUCTION (4)

Step 7: Build Random Forest model and Plot the decision boundary. Being a Tree-based model it has many trees and the plot has tried to capture all the relevant classes. It is a nonlinear classifier.

PYTHON

# Display plots inline and change default figure size%matplotlib inlinematplotlib.rcParams['figure.figsize'] = (10.0, 8.0)# Train the RandomForestClassifierclf1 = RandomForestClassifier(random_state=1, n_estimators=100)clf1.fit(X, y)# Plot the decision boundaryplot_decision_boundary(lambda x: clf1.predict(x))plt.title("Random Forest")

The decision surfaces for the Decision Tree and Random Forest are very complex. The Decision Tree is by far the most sensitive, showing only extreme classification probabilities that are heavily influenced by single points. The Random Forest shows lower sensitivity, with isolated points having much less extreme classification probabilities. The SVM is the least sensitive, since it has a very smooth decision boundary.

DECISION BOUNDARY FOR CLASSIFIERS: AN INTRODUCTION (5)

Step 8: Build Support Vector Machine model and Plot the decision boundary

PYTHON

# Display plots inline and change default figure size%matplotlib inlinefrom sklearn.svm import SVCmatplotlib.rcParams['figure.figsize'] = (10.0, 8.0)# Train the Support Vector Machine classifierclf3 = SVC(gamma='auto')clf3.fit(X, y)# Plot the decision boundaryplot_decision_boundary(lambda x: clf3.predict(x))plt.title("Support Vector Machine")

Support Vector Machine find a hyperplane that separates the feature space into two classes with the maximum margin. If the problem is not originally linearly separable, the kernel trick is used to turn it into a linearly separable one, by increasing the number of dimensions. Thus a general hyper surface in a small dimension space is turned into a hyperplane in a space with much larger dimensions.

DECISION BOUNDARY FOR CLASSIFIERS: AN INTRODUCTION (6)

Step 9: Build Decision Tree model and Plot the decision boundary

PYTHON

# Display plots inline and change default figure size%matplotlib inlinefrom sklearn.tree import DecisionTreeClassifiermatplotlib.rcParams['figure.figsize'] = (10.0, 8.0)# Train the Decision Tree classifierclf4 = DecisionTreeClassifier()clf4.fit(X, y)# Plot the decision boundaryplot_decision_boundary(lambda x: clf4.predict(x))plt.title("Decision Tree")
DECISION BOUNDARY FOR CLASSIFIERS: AN INTRODUCTION (7)

Step 10: Build Gaussian NaiveBayes model and Plot the decision boundary

PYTHON

# Display plots inline and change default figure size%matplotlib inlinefrom sklearn.naive_bayes import GaussianNBmatplotlib.rcParams['figure.figsize'] = (10.0, 8.0)# Train the Gaussian NaiveBayes classifierclf5 = GaussianNB()clf5.fit(X, y)# Plot the decision boundaryplot_decision_boundary(lambda x: clf5.predict(x))plt.title("GaussianNB Classifier")
DECISION BOUNDARY FOR CLASSIFIERS: AN INTRODUCTION (8)

Gaussian Naive Bayes has also performed well, having a smooth curve boundary line.

Decision boundaries can easily be visualized for 2D and 3D datasets. Generalizing beyond 3D forms a challenge in terms of the visualization where we have to transform the boundary which is present in multi dimension to a lower dimension, that can be displayed and understood by the experts is difficult.

However, a Decision Boundary can be plotted, using tSNE, where the dimensions of the data can be reduced in several steps. for example: If the dimension of my data is 150, then at first this shall be reduced to 50 and then shall be to 2 dimensions.

Libraries TSNE from sklearn.manifold and TruncatedSVD from sklearn.decomposition are used for this.

A very nice research paper is published here, describing about plotting decision boundary for higher dimensional data.

In this article, we learnt what is the role of Decision Boundary in determining a classifier model, built several classifier models and plotted their respective decision boundaries to select the best model and also knew that plotting a Decision Boundary for higher dimensional data is a complex task, can be plotted, using tSNE, where the dimensions of the data can be reduced in several steps.

Now, what is next for you…? Please come up with few points about the proposed approach to plot decision boundaries for higher dimensional data, as found from the research paper, published here.

See you then in our next article… Till then Stay Tuned and Happy Learning!

DECISION BOUNDARY FOR CLASSIFIERS: AN INTRODUCTION (2024)

References

Top Articles
Latest Posts
Article information

Author: Errol Quitzon

Last Updated:

Views: 5891

Rating: 4.9 / 5 (79 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Errol Quitzon

Birthday: 1993-04-02

Address: 70604 Haley Lane, Port Weldonside, TN 99233-0942

Phone: +9665282866296

Job: Product Retail Agent

Hobby: Computer programming, Horseback riding, Hooping, Dance, Ice skating, Backpacking, Rafting

Introduction: My name is Errol Quitzon, I am a fair, cute, fancy, clean, attractive, sparkling, kind person who loves writing and wants to share my knowledge and understanding with you.