Support Vector Machine (2024)

A deep dive into the SVM model

Published in

Towards Data Science

21 min read

Jul 23, 2020

In this post, we’ll discuss the use of support vector machines (SVM) as a classification model. We will start by exploring the idea behind it, translate this idea into a mathematical problem and use quadratic programming (QP) to solve it.

Let’s start by analyzing the intuition behind the model. A linear classification algorithm’s goal is to divide the input space in regions labeled by the different classes. That way, we can predict the class of any new data point based on which region it belongs to. Different statistical models diverge in how they find this decision boundary and in this post we are gonna talk about the strategy used by the SVM.

Below we have a dataset with two classes. Here we show two decision boundaries represented by line 1 and 2. It is easy to see that there are many other lines that can divide the input space in a way that all observations are correctly classified — in fact, there are an infinite number of lines.

So how does the SVM choose its line? There are two main ideas. Notice how point A is very far from line 1. It seems intuitive to conclude that, based on the decision boundary defined by line 1, we are more confident in predicting that point A belongs to the circle class than to say the same about point B. The SVM aims to maximize the confidence of its predictions within the training set. Also notice that there are many points very close to line 2: this means that any tweak to the decision boundary’s parameters could result in incorrectly predicting the classes of some of these points. These two ideas combined form the intuition behind the SVM.

Note: throughout this post, you’ll see the word hyperplane being used. A hyperplane is just a fancy name for a subset with one less dimension than its ambient space, which is the case for the SVM decision boundary. That means that if the input space is in ℝᵖ, the decision boundary has (p — 1) dimensions. For example, in the image below, the input space is in ℝ², so the SVM decision boundary is one dimensional: a line.

The SVM is a linear classification model. For an output y ∈ {-1, 1}, we can write the hypothesis function as a linear combination of the inputs:

And we predict:

It seems intuitive that the further away the hypothesis value is from zero, the more confident we are in our predictions. If h(x) = 10, we are more confident in predicting y=1 than when h(x) = 0.1.

That could lead to the idea that we want to find the parameters (w, b) that will maximize the values of h when y⁽ⁱ⁾ = 1, and minimize the values of h when y⁽ⁱ⁾ = -1. In other words, we might feel inclined to maximize the functional margin γ̂:

But there is a problem with that: notice how the predicted class depends only on the sign of h. That means that we can scale the parameters, for example (w, b) → (10w, 10b), without changing the predicted classes. This would scale the values of h by a factor of 10 and give the false idea that our model is 10 times more confident in its predictions.

This is where the concept of the geometric margin comes in. The geometric margin γ̂ is defined as the distance of the i-th observation to the decision boundary. Unlike the functional margin, this measure is invariant to the scaling of parameters. After all, the hyperplane defined by wᵀx + b = 0 is exactly the same as the one defined by 10wᵀx + 10b = 0.

The SVM looks for the hyperplane that is as far as possible from the closest member of each class. Looking at the figure below, we can see that P₁ and P₂ are the closest observations from each class. The SVM decision boundary is equidistant to P₁ and P₂, that is, d₁ = d₂. Last and more importantly, of all decision boundaries that correctly classify every observation in the training set, the SVM is the one with the greatest minimum distance min(d₁, d₂).

In other words, if we define γ = min γ⁽ⁱ⁾, the SVM looks for the hyperplane with the largest value of γ.

The SVM aims to maximize the minimum geometric margin.

Before moving on, we need to talk about one very important thing when dealing with a SVM: feature scaling. Since the SVM optimization goal is based on geometric distances, axis with very different scaling makes the model favor the direction with the largest values.

For example, on the image below, we can see that before scaling the features, the SVM looks for a decision boundary such that the distance vector d₁ has the greatest vertical component as possible. This is why we should always apply feature scaling before fitting a SVM.

Always scale the features before fitting an SVM

As we said before, the SVM is a linear model — that means it is good at finding linear relationships in the data. But as you imagine, most real world datasets are not linear. And while we can’t change the nature of the SVM itself, we can change the inputs that we feed to the model. Let’s see why and how this would help.

The image below is a good illustration of how changing the inputs can help the SVM model non-linear relationships. On the left, the data is clearly not linearly separable, that is, we can’t find a point that would have one class to its right, and the other class to its left.

But let’s see what happens when we add one extra feature to this dataset: namely, the distance of each observation to the point (0). That means that a point defined by (-3) on the original input space is now represented by the pair (-3, 3); similarly, a point represented by (5) becomes (5, 5). We are remapping the input space from one dimension to two dimensions. And as you can see on the image to the right, this powerful and simple trick makes the data linearly separable in ℝ² and, consequentially, a better fit for the SVM.

We can circumvent the linear limitations of the model by applying non-linear transformations to the original features.

One common way to implement this idea is by adding higher degree polynomials to the training set. The image below shows a dataset with two features (x₁, x₂) that is clearly not linearly separable in ℝ².

We need to transform the inputs in a way that the data becomes linearly separable (don’t forget to scale it afterwards!). We will add polynomial features up to a third degree, remapping the input space from ℝ² to ℝ¹⁰, such that (x₁, x₂) becomes (1, x₁, x₂, x₁x₂, x₁², x₂², x₁²x₂, x₂²x₁, x₁³, x₂³). Unfortunately we can’t see the cool 10-dimensional plot, so we settle down for the 2D representation with the original features. Note that to make predictions about new data points, we need to apply the same polynomial transformation before feeding it to the SVM.

Another way to work around the linear limitations of the SVM is by using similarity measures, which are simply functions that quantify the similarity between two data points. One of the most common similarity functions is the Gaussian Radial Basis Function (RBF), which is defined as:

We can choose a few landmark points and remap the inputs to a vector containing the distance to each of the landmarks. So if we have three landmarks points (l₁, l₂, l₃) and we are using the RBF similarity function ϕ, an observation originally represented by x becomes (ϕ(x, l₁), ϕ(x,l₂), ϕ(x, l₃)).

But how can we pick the data points to be used as landmarks? One strategy is to just pick every point in the training set. This obviously doesn’t scale very well since we create an extra feature for every new observation, such that a training set with 500 observations and three features becomes a training set with 500 observations and 500 features (assuming we drop the original features).

Later on this post, we’ll introduce a way to implement this idea in a much more elegant and efficient way. For now, we can see below a simpler implementation of this concept of using every observation in the training set as a landmark and how that affects the SVM.

As we said before, the SVM is a linear classification model, and we can write its hypothesis function as a linear combination of the inputs:

And we predict:

Now it is time to translate the SVM goal of finding the largest geometric margin into a mathematical problem. Given a point x⁽ⁱ⁾, what is its distance to the decision boundary defined by wᵀx + b = 0? Well, we know that w is the normal vector of this hyperplane, which means that it is perpendicular to the plane. And if w is the normal vector, then w/||w|| is a unit vector perpendicular to the plane.

Now let γ be the distance from x to the hyperplane — we can then say that x - γ * (w/||w||) lands on the hyperplane (represented by p in the image above). Last since p is a vector on the hyperplane, this means that it must satisfy the hyperplane’s equation wᵀp + b = 0. That gives us the system of equations below:

And if we solve it for γ, we find that the distance of a point x⁽ⁱ⁾ to the SVM decision boundary is given by γ⁽ⁱ⁾ = (wᵀx⁽ⁱ⁾ + b)/ ||w||. Notice how this equation is only valid for points in the same direction as the normal vector w since we subtracted it from x; otherwise we would have to add γ w /w||. With a little tweak, we can write the generalized geometric margin γ as:

We can again verify that, as we said before, the geometric margin is invariant to the scaling of parameters. Also notice that if ||w||= 1, then the geometric margin and the functional margin are the same.

Hard margin SVM

The hard margin SVM works on the assumption that the data is linearly separable, and it forces the model to correctly classify every observation on the training set. The i-th observation is correctly classified if its functional margin is greater than zero:

Now remember how we said that the predicted classes depend only on the sign of h and that the geometric margins are invariant to the scaling of parameters? That means that we can add arbitrary constraints to the parameters without affecting the model. For any fitted SVM, we can impose a constraint like ||w|| = 1, then all we need to do is rescale its parameters and the model remains unchanged.

We can use this result along with the fact that the geometric and the functional margins are equal when ||w|| = 1 to write the optimization objective of the hard margin SVM:

The first constraint forces every observation to be correctly classified. The second constraint forces γ to not only be a lower bound for the functional margin, but also for the geometric margin. This is precisely the hard margin SVM optimization objective:

to maximize the minimum geometric margin without any misclassifications

If we had tools to solve this constrained optimization problem, we could stop here. But unfortunately, the ||w|| = 1 is a non-convex constraint, so we will need to make some changes to get this problem into a more friendly format.

We can get rid of the nasty constraint by dividing the objective by the norm — if γ is a lower bound for the functional margin, then γ/||w|| is a lower bound for the geometric margin. So our new problem can be written as:

Now it is the objective function that is non-convex, but we are one step closer. Remember one last time that we can add arbitrary constraints to the parameters. So we can impose γ = 1 and again, this does not change the model and it can be satisfied by simply rescaling the parameters.

Our new optimization objective is then to maximize 1/||w||, which is equivalent to minimize ||w||. Last, since||w|| is not differentiable at 0, we’ll minimize (1/2)*||w||² instead, whose derivative is just w. Optimization algorithms work much better on differentiable functions.

Finally, we define the hard margin SVM optimization objective as:

Soft margin SVM

The hard margin SVM has two very important limitations:

- it only works on linearly separable data;

- it is very sensible to outliers.

If we want more flexibility, we need to introduce a way for the model to allow for misclassifications, and we do that using the concept of slack variables. Every observation has its own slack measure, which represents how much that particular data point can violate the margin. Our new constraint can be rewritten as: y⁽ⁱ⁾(wᵀx⁽ⁱ⁾ + b) ≥ 1 - ϵ⁽ⁱ⁾. Notice that we now need to estimate one extra variable for every observation.

There is just one problem now. If we are still trying to minimize ||w||, we can simply let ϵ → ∞ and be sure that the new constraints will be respected. We need the new objective function to represent the soft margin SVM conflicting objectives: we still want the margins to be as wide as possible, and we want the slack variables to be as small as possible to prevent margins’ violations. This gives us the soft margin SVM classifier objective:

We introduced a new hyperparameter C that lets us control the tradeoff between the conflicting objectives. As C grows, it forces the optimization algorithm to find smaller values for ϵ. Below you can see how different values of C affect the SVM.

Before we code our own implementation of the soft margin SVM, we need to introduce a new concept.

Quadratic programming (QP) is a system to solve a particular type of optimization problems where the objective function is quadratic, and the constraints are linear — precisely the type of problems we have been dealing with!

Exactly how different algorithms solve QP optimizations is beyond the scope of this post. We instead are gonna focus in deriving the SVM optimization problem that is solved by libraries like libsvm.

Given the QP problem defined as:

All we need to do is figure out what values of P, q, G, h that represent the SVM problem, and we can use a solver (we’ll use cvxopt later on) to get the parameters’ estimates.

Notice that we will need to rewrite the SVM constraints to fit the QP constraints’ shape. They will be:

So for the soft margin SVM, let X be the training set represented by the m×(n+1) matrix (m observations and n features) left-padded with a 1 column-vector for the intercept. Then the SVM QP problem is defined with:

How so? Since we need to estimate b, w, and ϵ, the vector u of the QP problem becomes [b, w₁, …, wₙ, ε₁, …, εₘ]. And then:

- the P matrix uses the identity to get w back and zeros to cancel b and ϵ;

- the q vector multiplies b and w by 0, and multiplies ϵ by C, so that we have the second part of the optimization objective C ∑ᵢ₌₁ᵐ ε⁽ⁱ⁾;

- the K matrix gives us -y⁽ⁱ⁾(wᵀx⁽ⁱ⁾ + b) and the -Iₘ to its right subtracts εᵢ. The bottom row of the G matrix represents -ϵ⁽ⁱ⁾ ≤ 0;

- last the h vector has -1 and 0 just like the constraints (≤ -1 and ≤ 0).

Now all we have to do is give these values to a QP software and we have the parameters of a soft margin SVM. You can check the code to implement this idea below (or in this repo). We made a lot of progress already, but we are not done yet.

Remember we talked about similarity measures earlier on this post as a way to work around the linear limitations of the SVM? It turns out there is a much more elegant and efficient solution to this problem than explicitly making every observation a landmark and transforming the inputs.

We’ll next talk about Lagrange duality. This will lead us to a different representation of the soft margin SVM optimization problem (called its dual form). We will be able to apply non-linear transformations over the input space in a much more efficient way, allowing the SVM to work well even in very high dimensions.

The general idea of the Lagrange method is to transform a constrained optimization problem (primal form) into an unconstrained one (dual form), by moving the constraints into the objective function. There are two main reasons for writing the SVM optimization problem in its dual form:

- kernel trick: the training and predictions of the SVM will depend on the data points only through their inner product. This powerful result allows us to apply any transformations on the training set (even remapping it to an infinite-dimensional space!) as long as we can calculate that inner product;

- support vectors: the predictions will depend only on a few points on the training set called support vectors. The dual form gives us an easy way to identify these points.

Hopefully, these reasons will become clear by the end of this post. For now, let’s leave the SVM aside for a moment and think about solving constrained optimization problems like the one below.

For simplicity, we will analyze a problem with only inequality constraints (like the SVM), but the results could be easily extended to equalities. One way to rewrite this problem is by using an infinite step function g(u):

We can then rewrite the original problem as:

Notice that we penalize J(x) with a positive infinity term if any of the original constraints are not satisfied. And when all the constraints are met, then g(fᵢ(x)) = 0 ∀ i and J(x) = f₀(x). So minimizing J(x) is equivalent to the original problem, but we can’t really solve this new optimization yet — the step function is neither continuous nor differentiable.

What if instead of the infinite step function, we used a linear function like g(u) = α u? If we enforce α ≥ 0, then the penalty is, at least, in the right direction when a constraint is not met (α u > 0). Furthermore, g is now continuously differentiable: a much better fit for optimization problems.

So if we substitute our new linear penalty in the J(x) equation above, we get a function of x and α that is known as the Lagrangian:

Notice that if we maximize the Lagrangian with respect to α, we get J(x) back: αᵢ = 0 if fᵢ(x) < 0 (since αᵢ ≤ 0) and αᵢ → ∞ if fᵢ(x) > 0. So the original problem can now be written as a function of the Lagrangian:

Last, since the min max f ≥ max min f for any function f, we have that:

The minₓ L(x, α) is called the dual function and its maximum is a lower bound for the original (primal) optimization problem. Moreover, it can be shown that, under certain conditions:

so that we can solve the dual problem as opposed to the primal problem. Luckily the SVM satisfy these conditions (specifically the Slater’s condition): the objective and the constraint functions are convex and continuously differentiable.

Since the SVM satisfy the regularity conditions, if there is a solution for the primal problem, it will necessarily be among the stationary points (x*, α*) of the Lagrangian that respect the Karush–Kuhn–Tucker (KKT) conditions. Furthermore, in the case of the SVM (convex differentiable), the KKT conditions are not just necessary, but also sufficient for optimality.

What are the KKT conditions? If the primal problem has a solution x*, and α* is a solution for the dual problem, such that L(x*, α*) = f₀(x*). Then the KKT conditions must be met:

Below we have the soft margin SVM optimization problem:

If we rewrite both constraints as:

The Lagrangian can be written as:

Remember that since the SVM optimization problem satisfy some special conditions (objective and constraints functions are convex and continuously differentiable), any solution to the Lagrangian is a solution to the original optimization and it must satisfy the KKT conditions. From the KKT we have:

From the dual feasibility condition, we have:

So based on the partial derivative on b, on the gradient over w, and on the dual feasibility condition, we derive the constraints for the dual problem:

We are almost there! Back to the Lagrangian, let’s use the result we got from the w gradient (w = ∑ᵢ₌₁ᵐ α⁽ⁱ⁾ y⁽ⁱ⁾ x⁽ⁱ⁾) and substitute it on the last term of the Lagrangian sum:

Let

then plugging everything back into the Lagrangian we have:

Last, since ∑ α⁽ⁱ⁾ y⁽ⁱ⁾ = C − λ⁽ⁱ⁾ - α⁽ⁱ⁾ = 0, we finally have the dual form of the soft margin SVM (after multiplying by -1 and turning the maximization into a minimization problem):

Now we just need to get back the vector of weights w and the bias term b from the Lagrange multipliers α. For the weights, we can simply use the fact that w = ∑ α⁽ⁱ⁾ y⁽ⁱ⁾ x⁽ⁱ⁾. Notice that we only care about the indices where α⁽ⁱ⁾ > 0.

To get b back, let’s take one last look at the KKT conditions — from the complimentary slackness, we have:

From the first equation, we see that when αᵢ* > 0, the i-th constraint must be active, that is, it must hold by equality that:

Based on λ⁽ⁱ⁾ ϵ⁽ⁱ⁾ = 0 and C - α⁽ⁱ⁾ - λ⁽ⁱ⁾ = 0, we have that 0 < α⁽ⁱ⁾ < C =>λ⁽ⁱ⁾ > 0 => ϵ⁽ⁱ⁾ = 0. We’ll average over this subset of the training sample to calculate b. If 0 < α⁽ⁱ⁾ < C, we have:

where M is the subset of the training sample that satisfy 0 < α⁽ⁱ⁾ < C and nₘ is its size. Last, to make a prediction about a new observation x we need to calculate:

That was a long ride, but we just derived the exact optimization problem that the popular library libsvm solves! We will use a standard QP solver later on, but if you are interested in how they solve it more efficiently, you can check out this paper, where they present the sequential minimal optimization (SMO) algorithm that the _libsvm_ library is based on.

Dual form conclusions

There are two very important results that need to be highlighted:

1. The SVM predictions depend only on the observations where α⁽ⁱ⁾ > 0. These points represent a very special subset of the training sample and are called support vectors.

Once the model is trained, we can usually discard a large portion of the training set and only retain the support vectors.

2. During training and to make predictions, we depend on the data points only through their inner product (x⁽ⁱ⁾)ᵀ x⁽ʲ⁾. This beautiful and important result is what allows the kernel trick.

In the non-linear classification section, we talked about applying non-linear transformations over the original features before fitting a SVM. This simple trick allowed the linear SVM to capture non-linear relationship in the data.

For example, let ϕ be a feature mapping function, like the one below:

Rather than fitting the SVM using the original features (x₁, x₂), we want to use ϕ(x). We’ve already seen that when using the dual form, the SVM depend on the data points only through their inner product. This means we only need to calculate:

This is where kernels come in. In machine learning, a kernel is a function that calculates the inner product ϕ(x⁽ⁱ⁾)ᵀ ϕ(x⁽ʲ⁾) based only on the original vectors (x⁽ⁱ⁾, x⁽ʲ⁾). No need to compute the transformation ϕ(x) or even know about it!

Let’s look at a quick example. Let ϕ(x) be the feature mapping function defined below:

Then the inner product of two vectors a = [a₁, a₂] and b=[b₁, b₂] after being applied the transformation ϕ is:

That is a very interesting result. The inner product of the transformed vectors is equal to the square of the inner product of the original vectors. You don’t need to apply the transformation at all!

If we were fitting an SVM to this data, we could simply take the square of the inner products of the original vectors and we would be, in fact, fitting a model to the transformed vectors.

Another example of a kernel is the RBF function that we talked about earlier. It turns out that the transformation embedded in the RBF kernel maps each training instance to an infinite-dimensional space.

This is a perfect example of how kernels can be useful: it allows the SVM to learn in a high (or infinite!) dimensional feature space, while only having to calculate the kernel function of the original vectors.

This all begs the question: can every function be used as a kernel? It can be shown that if a function K(a, b) respects a few conditions (Mercer’s conditions), then there exists a function ϕ that maps a and b into another space, such that: K(a, b) = ϕ(a)ᵀ ϕ(b). We can then use K as a kernel because we know that ϕ exists. So as long as a function meets the Mercer’s conditions, it can be used as a kernel.

One last note: if we are using a kernel, we usually can’t get the weight vector w from the Lagrange multipliers. That happens because, after transforming the inputs, w = ∑ᵢ α⁽ⁱ⁾ y⁽ⁱ⁾ ϕ(x⁽ⁱ⁾) and ϕ is not necessarily known. But this presents no problem because everywhere else we only need to compute ϕ(x⁽ⁱ⁾)ϕ(x⁽ʲ⁾) = K(x⁽ⁱ⁾, x⁽ʲ⁾).

Now that we know how to derive and solve the SVM optimization problem, we are ready to understand the code behind it. The original code is available in this repo — the only change made was to estimate b by averaging over the instances where 0 < α⁽ⁱ⁾ < C (instead of 0 < α⁽ⁱ⁾).

That is it! I hope you learned as much as I did by writing this. You can find a notebook to play with the three models we coded in this post here.

- http://cs229.stanford.edu/notes/cs229-notes3.pdf

- https://www-cs.stanford.edu/people/davidknowles/lagrangian_duality.pdf

- https://gist.github.com/mblondel/586753

- Bishop, Christopher — Pattern Recognition And Machine Learning (link)

- Géron, Aurélien — Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition (link)

Support Vector Machine (2024)

A deep dive into the SVM model

Hard margin SVM

Soft margin SVM

Dual form conclusions

References