Support Vector Machine topic of machine learning.pptx

Support Vector
Machine
And Kernel Functions
By Ms. Gunjan Rani
Assistant Professor
Acharya Narendra Dev College

Support Vector Machines (SVM), among classifiers, are probably
the most intuitive and elegant, especially for binary
classification tasks.

Linearly Separable data
Generally speaking, the idea of SVM is finding a frontier which separates observations into
classes.

Our SVM algorithm is then asked to find a boundary which is able to segregate those two
classes of clients, like so:

However, that was not the only way the two classes could have been separated. Let’s
consider the following alternatives:

Since we can separate observations in numerous ways, SVM is performed so that it can
finally find the boundary, called hyperplane, which best segregates the classes.
What does it mean “best segregating”?

The optimum hyperplane is the one which guarantees the largest “area of freedom” for
other future observations, an area where they can afford to deviate from their pattern
without undermining the model. The SVM, in particular, is looking for a boundary that is
maximally far away from any data point. This distance from that decision boundary to the
closest data points determines the margin of the classifier.
The best hyperplane is the one which maximizes the margin, under the constraint that each
data point must lie on the correct side of the margin.

Now a further concept needs to be
introduced. If you look at the graph:

Data points which are circled play a very important role in building the model. These data
points are called Support Vectors.

There are two classes: squares and circles, however there is no way to trace a straight line
to segregate those classes. How shall we proceed?
The answer to this task is the reason why SVM became so popular, as there is a trick,
known as “Kernel Trick” from his inventor, which allows support vector machines to
effortlessly go from linear models to non-linear models by using only a slightly different
optimization problem from the one we have just seen.

What I want to do is providing an intuitive way to elaborate this task. Let’s see the idea
behind with this easy graph:

Again, you want to segregate squares from circles, and here a possible boundary might be
the one in green.
Now, the idea is the following: if we lift up the circles in a 3-dimensional plane, using a
given transformation function φ(x) we would be able to easily separate observations with
the support of a plan:

but how can we compute φ(x)?
Well, actually we are not asked to do so. Indeed, moving to a high-dimensional space is
the kernel functions’ job, which creates an implicit feature space.
Therefore, the transformation function φ(x) is not needed as we are provided with a list of
kernel functions, known also as “similarity functions” . The most popular is the Radial Basis
Function (RBF) kernel, or Gaussian kernel function, which looks like that:
With x and y being two vectors of features representing two different observations.

SVM Maths
Decision rule for any decision Boundary
during classification problems.
If we have a new data point say x(x1, x2),
we needt to determine which class it
belongs to.
We can clearly see, that x belongs to -ve
class but we need to have some decision
rule to make our machine understand which
class this x belongs to.
-ve class
+ve class

● We take a vector perpendicular to
decision boundary such that we now
know direction of w but not its
magnitude.
● We now take a vector x and find its
projection on w vector which will be
nothing but the dot product of the
two vectors.
Vector w Vector x
projectio
n
t
-ve class
+ve class

Let’s say if the projection of x on w is greater than some constant, say t, then it belongs to
+ve class else it belongs to -ve class., or
-> w.x ≥ t, then y = +ve, or
-> w.x - t ≥ 0, then y = =ve
This is our decision rule which says.
y_pred = +ve
if wTx - t ≥ 0
else if wTx - t < 0
then y_pred = -ve

Now, SVM
We are dealing with feature space with 2 or
more dimensions so we will now refer or
decision boundary as an hyperplane.
Goal: maximize the Hyperplane.
-ve class
+ve class

Let the size of margin = d, we need to
maximise d but we can’t maximise d as much
as we want, as there are some restrictions
coz of -ve and +ve hyperplanes on either
side of the Hyperplane.
Hyperplane can be represented as,
wTx - t = 0,
-ve hyperplane can be represented as,
wTx - t ⪬ -1,
+ve hyperplane can be represented as,
wTx - t ≥ 1
-ve class
+ve class
d
+ve
-ve

So, we need to maximize d such that
+ve wTx - t ≥ 1
&
yi -ve wTx - t ⪬ -1
These two constraints can be merged into one,
So, our problem becomes, maximise d s.t yi(wTx - t) ≥ 1

To maximize d all we now need is an expression for d,

From the figure d can be written as,
d = t+m/||w|| - (t-m/||w||) = 2m/||w||,
We have already taken m as 1, therefore,
d = 2/||w||
So, our problem now becomes we have to maximize 2/||w|| s.t yi(wTx - t) ≥ 1
Or, minimise 12(||w||2) s.t yi(wTx - t) ≥ 1
w∗, t∗ = argmin 12(||w||2) subject to yi (w· xi −t ) ≥ 1,1 ≤ i ≤ n
w,t

We will use Lagrange multipliers to handle the constraints along with the expression to
minimize.
Adding the constraints with multipliers αi for each training example gives the Lagrange
function,

By taking the partial derivative of the Lagrange function with respect to t and setting
it to 0 we find that for the optimal threshold t we have,
Similarly, we will take the partial derivative of this eq with respect to w and equating it to
0,
We, now have,

Soft SVM
However, in the real world, the data is not
linearly separable and trying to fit in a
maximum margin classifier could result in
overfitting the model (high variance). Here
is an instance of non-linearly separable
data:

For the above data set which is non-linear
and not easily separable, trying to fit a
maximum margin classifier can result in a
model such as the following which looks to
be an overfitted model with a high variance.
It gets difficult for training to converge. The
model may fail to generalize for the larger
population.

● To avoid overfitting of our model we
loosen it by introducing slack
variables ξi , one for each example,
which allow some of them to be
inside the margin or even at the
wrong side of the decision boundary –
we will call these margin errors.
● Thus, we change the constraints to
w· xi −t ≥ 1−ξi and add the sum of all
slack variables to the objective
function to be minimised, resulting in
the following soft margin
optimisation problem:

The C parameter tells the SVM optimization how much you want to avoid misclassifying
each training example. For large values of C, the optimization will choose a smaller-margin
hyperplane. Conversely, a very small value of C will cause the optimizer to look for a
larger-margin separating hyperplane, even if that hyperplane misclassifies more points.
Hence C controls to some extent the ‘complexity’ of the SVM and hence is often
referred to as the complexity parameter.

Applying Lagrange multipliers to new constraints,

Now we will find derivative of the above equation with respect to ξi and equating it as 0,
we will get,
C −αi −βi = 0
Furthermore, since both αi and βi are positive, this means that αi cannot be larger than C,
which manifests itself as an additional upper bound on αi in the dual problem:

Kernel Functions
What if the data we want to separate can’t be separated Linearly?
For example:

To classify such problems we use Kernel’s trick where we use Kernel functions to
change the lower dimension feature space into higher dimension feature space.
Example:
Can you separate this data represented in One dimension?
No!
What if we change this to higher dimension?

Lower Dimension Feature Space
Kernel
Functions
Higher DImension Feature Space

Kernel Functions are given by the relations:
k(x, x’) = Ⲫ(x)T Ⲫ(x’)

Types of Kernels
1. Linear Kernel
Let us say that we have two vectors with name x1 and Y1, then the linear kernel is defined
by the dot product of these two vectors:
K(x, x’) = x . x’

2. A stationary kernel is one which is translation invariant:
k(x, x’) = k (x - x’)
They are invariant to translation in input space.

3. Homogeneous Kernels or Gaussian Kernels.
These are also known as radial basis function kernel.
k(x, x’) = k (|| x - x'||)
Ex1
Ex2

SVM Implementation:
1. Import the Libraries
2. Import the Dataset
3. Separate Dependent and Independent Variables.
4. Split the dataset into Training and Test set.
5. Import the SVC class from SVM module.
6. Create the object/model of class SVC.
Important Parameter:
Kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’
7. Train the model using Training Set.
8. Predict the result for the Test set.

Support Vector Machine topic of machine learning.pptx

More Related Content

Similar to Support Vector Machine topic of machine learning.pptx

Recently uploaded

Support Vector Machine topic of machine learning.pptx