A Practical Guide to Interpreting and Visualising Support Vector Machines (2024)

SVM’s are often considered ‘Black Boxes’. In this article we cover techniques to visualise learned SVM models and their performance on real world data.

A Practical Guide to Interpreting and Visualising Support Vector Machines (3)

About the author

Hugo Dolan is an undergraduate Financial Mathematics student at University College Dublin. This is mostly based and motivated by recent data analytics and machine learning experiences in the NFL Punt Analytics Kaggle Competition and the being part of the team who won the Citadel Dublin Data Open, along with material from Stanford’s CS229 online course.

This article contains the following sections:

  1. Introduction to Linear Models, SVM’s and Kernels
  2. Interpreting high dimensional engineered feature spaces which are utilising SVM kernels…
  3. Techniques for evaluating the performance of high dimensional classification boundaries
  4. Practical options for working with large class imbalances
  5. How much data you need to train an SVM

Some assumptions I’ll make:

This article will assume familiarity with basic ML models such as Logistic and Linear Regression. It will also assume you know how to draw some of the graphs I discuss (I have a guide for this if your stuck!). We will also assume that you know what a decision function and objective / cost function is and that you have some basic linear algebra knowledge. If not its still worth a read you can always bookmark this and come back later for a deeper pass on some of the more mathematical parts of this article.

Introduction to Linear Models, SVM’s and Kernels

In machine learning linear classifiers are any model in which there is a single hypothesis function which maps between model inputs and predicted outputs.

A Practical Guide to Interpreting and Visualising Support Vector Machines (4)

The primary advantage of linear models over neural networks (a non linear model) is that the feature weights directly correspond to the importance of the feature within the model. Thus it is easy to understand what the model has ‘learned’.

A Practical Guide to Interpreting and Visualising Support Vector Machines (5)

At the core of any linear model is a dot product between the input example and the parameter / weight vector. In the case of linear regression this is the entire hypothesis function. Where as logistic regression feeds the dot product through a sigmoid function such that the output is between 0 and 1 and hence is suitable for binary classification problems.

A Practical Guide to Interpreting and Visualising Support Vector Machines (6)

When considering classification problems the downfall of linear models is that ultimately the decision boundary is a straight line, plane or hyperplane with coefficients equal to the models weights / parameters and thus can only classify data which is linearly separable, which could be a big limitation when working on more complex analytics problems.

The Support Vector Machine (SVM) is the only linear model which can classify data which is not linearly separable.

You might be asking how the SVM which is a linear model can fit a linear classifier to non linear data. Intuitively with a simple linear regression model we may manually engineer x, x², x³,… features to attempt to achieve a fit to a non linear set of data points.

A Practical Guide to Interpreting and Visualising Support Vector Machines (8)

Transferring this intuition to our SVM, when we engineer the x² feature we are essentially multiplying feature x by itself. So suppose we engineer features from our dataset by multiplying combinations of features x1,x2,x3… together, then theoretically we *could* end up with a space in which your engineered features are linearly separable. Taking the previous simple example look at how the data below is transformed to an almost linear trend in the x³ feature space.

A Practical Guide to Interpreting and Visualising Support Vector Machines (9)

Unfortunately to achieve this with a complex dataset requires creating more than just a 3 dimensional space (features x, x²,x³) but in fact extremely high dimensional feature spaces which would be computationally very expensive to calculate for every example in our dataset. Below I show an example of a function ø(x) which takes our original features x and combines them to create many 2nd order polynomial features.

Before we proceed: I will use the notation of x to denote data points / training examples with superscripts to denote a particular data point and subscripts to denote a particular feature.

A Practical Guide to Interpreting and Visualising Support Vector Machines (10)

Luckily for us there is a way out of this computational complexity conundrum! When we derive the optimisation problem for an SVM (The complex looking formula which tells us how to derive and update our weights for maximisation during coordinate ascent) it turns out that our feature vectors for our training input x appears in only a single place within the entire optimisation formula (highlighted in red).

A Practical Guide to Interpreting and Visualising Support Vector Machines (11)

This dot product was for our original feature space so now lets replace it with our engineer feature space using our function ø.

A Practical Guide to Interpreting and Visualising Support Vector Machines (12)

So how does this help reduce the computational complexity? Well by definition of the dot product, we take the i-th entry of ø(x(i)) and multiply it by the i-th entry of ø( x(j)) and then sum all of these up to obtain a single scaler. Applying this we get:

A Practical Guide to Interpreting and Visualising Support Vector Machines (13)

The kernel trick is a really simple rearrangement of the original equation we can see that we’ve totally removed ø(x) and only have to perform computations using our original input features, but still have the effect of computing a high dimensional space.

All we have to do now is substitute the dot product involving ø(x) with the kernel equivalent ( K(x^i, x^j) ):

A Practical Guide to Interpreting and Visualising Support Vector Machines (14)

Similarly when we want to use our model to make predictions we never explicitly calculate the weights for our high dimensional space but instead use the kernel trick to make our predictions:

A Practical Guide to Interpreting and Visualising Support Vector Machines (15)

In summary we can use the kernel trick to transform a non linear data set into a data set which is linearly separable, just in a higher dimensional space. Sklearn come prepackage with a number of kernels in the SVC implementation, including Radius Basis Kernel (RBF) and Polynomial Kernels, each have their own hyper parameters which can be adjusted experimentally using cross validation to achieve the best results.

A slight hitch, interpreting a high dimensional engineered feature space…

So remember how we said the great benefit of a linear model was that the weights / parameters of the model could be interpreted as the importance of the features. Well thats gone once we engineer a high or infinite dimensional feature set, the weights of the model implicitly correspond to the high dimensional space which isn’t useful in aiding our understanding.

Instead what we can do is fit a logistic regression model which estimates the probability of label y being 1, given the original features, where f(x) is the SVM decision function:

A Practical Guide to Interpreting and Visualising Support Vector Machines (16)

We use maximum likelihood estimation to fit the parameters of this logistic regression model, the technique is called Platt Scaling, the original paper [3] is definitely worth reading if your curious about the inner workings.

So how does this help us understand how the SVM works? Well we simply fit the model and select a point in our dataset to evaluate it at, then perturb one feature at a time through a range of values, whilst keeping the other features fixed. We can use this to draw a graph of the sensitivity of the model to each feature.

SKlearn comes with this feature built into the SVC model, you just have to make sure the probability=true, when initialising and then use the clf.predict_proba(X) function to obtain the probabilities.

In practice I found that rather than just evaluating round a single point it is often better to sample a collection of relevant points eg. 40 negative examples and average the probability distribution by feature, to get something more representative.

Here’s an example I did when I was working on the NFL Punt Analytics Kaggle Competition, examining effects of various factors on concussions:

A Practical Guide to Interpreting and Visualising Support Vector Machines (17)

I took all the negative examples and averaged the probabilities across them, i’ve highlighted the areas in red for each feature where players were most likely to receive a concussion. A trick if you’ve a bunch of one hot encoded variables like player role, is to aggregate them into a bar chart and just look at the net change in probability between when the feature is present and not present.

There’s also a great example applied of this applied to marketing data [1] which you can find here. I’d also like to thank Georgi who took the time to answer some of my questions about the paper.

Techniques for evaluating performance

When your dealing with a high dimensional models involving an SVM, it would be nice to be able to visualise how the model is classifying data points without purely relying on metrics like F1 scores or ROC AUC.

Whilst some may use techniques such as principle component analysis to visualise classification, in doing so we loose dimensions of our feature space and thus distorts the visual we are looking to achieve.

A nice technique I found is called ‘Histogram of projects’ [2], it involves graphing the distribution of output of the SVM decision function for your training and test sets.

The decision function is easy to obtain in SKlearn’s SVC implementation simply call decision_function(X). You will want to keep track of your datasets labels so you can colour code your histogram of projections as below:

A Practical Guide to Interpreting and Visualising Support Vector Machines (18)

A histogram of projection is fairly simple to interpret. The histogram x axis identifies the distance of a specific training example from the decision boundary of the SVM (indicated by the central dashed line).

An SVM has a margin of separation equal to 1 either side of the the decision boundary, this is an enforced constraint of the dual optimisation problem (The ‘Support Vectors’ are data points which lie along these margins). You’ll notice in the above model there is some leakage into the margin areas and indeed cross over from one class into the class on the opposite side of the decision boundary. This is because we’ve set the regularisation hyper-parameter C > 0 (It allows a tradeoff between some misclassification and minimising the SVM objective function).

Despite working with high dimensional feature spaces, this chart visualises successfully visualises the decision boundary region and all classification without loss of dimensionality. All the metrics seen in a confusion matrix (ie number of True Positives, False Positives, True Negatives and False Negatives) can also be seen through the histogram. It also enables us to observe whether a model is generalising well to the test set. If the test set has a similar distribution of decision function outputs to the training set then we might say that the model has good performance on generalisation. The model may also be used to determine whether given the selected hyper-parameters whether the data set is linearly separable.

Practical Options For Dealing With Imbalanced Data

When a dataset has a disproportionate number of examples for one class relative to another we say it is imbalanced.

A Practical Guide to Interpreting and Visualising Support Vector Machines (19)

This is a problem if we want to build a ML model to predict occurrences of the minority, as we can achieve high levels of accuracy by simply misclassifying all minority examples as the majority class.

This occurs commonly in real world data, whether it is identifying malignant tissue, credit card fraud or concussions in a sport, due to the relative rarity of the incidents we wish to correctly identify.

There are two commonly accepted practices for rectifying ML models working with imbalanced data:

  1. Over-sampling the minority class / under-sampling the majority class
  2. Increasing the weighting for minority examples in the cost function

Option 1 : SMOTE

There are two ways in which we can resample data, either by removing existing examples (Under-sampling) or adding new examples (Over-sampling). The most commonly accepted method is to oversample the minority class using an algorithm called SMOTE (Synthetic Minority Oversampling Technique) [5]

Its much simpler than the name suggests, for each minority point in the dataset, it selects k nearest other minority examples (typically 5) and interpolates randomly new minority examples along the line ‘joining’ existing minority examples.

This is a reasonable thing to do as we are simply making the assumption that by interpolating between similar existing examples we will obtain new examples of the same class.

A Practical Guide to Interpreting and Visualising Support Vector Machines (20)

This tends to improve the performance of the model significantly and helps to generalises the decision boundary for minority examples.

Option 2 : Introducing weights into the objective function

Another procedure which can be employed is to assign higher weights in the objective function for misclassification of minority examples. This will ‘incentivise’ the algorithm to correctly classify minority classes.

I’ve no personal experience using this method, but it can be use in conjunction with option 1. This is a link to a good paper here [4] which details many approaches to dealing with the class imbalance

How much data do I need to train an SVM

A reasonable rule of thumb is around 10x as many training examples as features as a minimum. If you have a large amount of training data your best off using fewer than 50,000 training examples as the SVC implementation in sklearn has O(n³) complexity, meaning the time of convergence to a solution grows cubicly with the number of training examples, it can get pretty slow on even a decent laptop or kaggle container.

Its often worthwhile training on a smaller dataset first, and tuning your model’s hyper-parameter’s. You can keep a small cross validation test set for model selection. You could be surprised at how well your model generalises when you test on your remaining dataset, despite the small proportion of actual data you may have used.

Note: If your new to this, a tip is to use sklearn’s train test split module and to fix the random seed so that your results are reproducible if you happen to go back to edit some earlier code and rerun the training / model selection procedure.

I hope this article has been helpful, if you’ve any comments / questions leave them below, i’ll try to answer them as best I can.

Paper References

[1] Nalbantov, Georgi & C. Bioch, Jan & Groenen, Patrick. (2005). Solving and Interpreting Binary Classification Problems in Marketing with SVMs. 566–573. 10.1007/3–540–31314–1_69.

[2] Cherkassky, Vladimir & Dhar, Sauptik. (2010). Simple Method for Interpretation of High-Dimensional Nonlinear SVM Classification Models.. 267–272.

[3] Platt, J. (2019). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. [online] Citeseer.ist.psu.edu. Available at: http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1639.

[4] Batuwita, Rukshan & Palade, Vasile. (2013). Class Imbalance Learning Methods for Support Vector Machines. 10.1002/9781118646106.ch5.

[5] Chawla, N. V., et al. “SMOTE: Synthetic Minority Over-Sampling Technique.” Journal of Artificial Intelligence Research, www.jair.org/index.php/jair/article/view/10302.

A Practical Guide to Interpreting and Visualising Support Vector Machines (2024)

References

Top Articles
Latest Posts
Article information

Author: Kimberely Baumbach CPA

Last Updated:

Views: 5895

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Kimberely Baumbach CPA

Birthday: 1996-01-14

Address: 8381 Boyce Course, Imeldachester, ND 74681

Phone: +3571286597580

Job: Product Banking Analyst

Hobby: Cosplaying, Inline skating, Amateur radio, Baton twirling, Mountaineering, Flying, Archery

Introduction: My name is Kimberely Baumbach CPA, I am a gorgeous, bright, charming, encouraging, zealous, lively, good person who loves writing and wants to share my knowledge and understanding with you.