Lecture12 - SVM

Introduction to Machine
Learning
Lecture 12
Support Vector Machines

Albert Orriols i Puig
aorriols@salle.url.edu
i l @ ll ld

Artificial Intelligence – Machine Learning
Enginyeria i Arquitectura La Salle
gy q
Universitat Ramon Llull

Recap of Lecture 11
1st generation NN: Perceptrons and others
g p

Also multi-layer percetrons

Slide 2
Artificial Intelligence Machine Learning

Recap of Lecture 11
2nd generation NN
g
Some people figure it out how to adapt the weights of internal
layers
aye s

Seemed to be very powerful and able to solve almost anything
The reality showed that this was not exactly true
Slide 3

Today’s Agenda

Moving to SVM
g
Linear SVM
The separable case
The non-separable case
Non-Linear
Non Linear SVM

Slide 4

Introduction
SVM (Vapnik, 1995)
(p, )
Clever type of perceptron
Instead f h d di the layer of non-adaptive f t
I t d of hand-coding th l f d ti features, each h
training example is used to create a new feature using a fixed
recipe
ec pe
A clever optimization technique is used to select the best
subset o features
of eatu es
Many NNs researchers switched to SVM in the 1990s
because they work better
Here, we’ll take a slow path into SVM concepts

Slide 5

Shattering Points with Oriented Hyperplanes
Remember the idea
I want to build hyperplanes that separate points of two classes
In a two-dimensional space lines
E.g.: Linear Classifier

Which is the best separating line?

Remember, a hyperplane is
represented by th equation
t d b the ti

WX + b = 0

Slide 6

Linear SVM
I want the line that maximizes the margin between
g
examples of both classes!

Support Vectors

Slide 7

Linear SVM
In more detail
Let’s assume two classes
yi = {-1 1}
{-1,
Each example described by
a set of features x (x is a
vector; for clarity, we will
mark vectors in bold in the
remainder of the slides)
The problem can be formulated as follows
All training must satisfy
(
(in the separable case) )

This can be combined

Slide 8

Linear SVM
What are the support vectors?
pp
Let’s find the points that lay on the hyper plane H1
Their perpendicular distance to the origin is

Let’s find the points that lay on the hyper plane H2
Their perpendicular distance to the origin is

The margin is:

Slide 9

Linear SVM
Therefore, the problem is
, p
Find the hyper plane that minimizes

Subject to

But let us change to the Lagrange formulation because
The constraints will be placed on the Lagrange multipliers
themselves (easier to handle)
Training data will appear only in form of dot products between
vectors

Slide 10

Linear SVM
The Lagrangian formulation comes to be
g g

Where αi are the Lagrange multipliers
So,
So now we need to
Minimize Lp w.r.t w, b
Simultaneously require that the derivatives of Lp w.r.t to α
vanish
All subject to the constraints αi ≥ 0

Slide 11

Linear SVM
Transformation to the dual problem
p
This is a convex problem
We
W can equivalently solve th d l problem
i l tl l the dual bl

That is, maximize LD

W.r.t αi
Subject to constraints
And with αi ≥ 0

Slide 12

Linear SVM

This is a quadratic programming problem. You can solve
it with many methods such as gradient descent
We’ll not see these methods in class

Slide 13

The Non-Separable case
What if I can not separate the two classes
p

We will not be able to solve the Lagrangian formulation
proposed
Any idea?

Slide 14

The Non-Separable Case
Just relax the constraints by p
y permitting some errors
g

Slide 15

The Non-Separable Case
That means that the Lagrangian is rewritten
g g
We change the objective
function to be minimized to
uco o ed o
Therefore, we are maximizing the margin and minimizing the error
C i a constant to be chosen b th user
is t tt b h by the
The dual problem becomes

Subject to and

Slide 16

Non-Linear SVM
What happens if the decision function is a linear function of
pp
the data?

In our equations data appears in form of dot products xi · xj
equations,
Wouldn’t you like to have polynomials, logarithmics, …
functions to fit the data?

Slide 17

Non-Linear SVM
The kernel trick
Map the data into a higher-dimensional space
Mercer theorem: any continuous, symmetric, positive semi-
definite kernel function K(x, y) can be expressed as a dot
product in a high dimensional space
high-dimensional
Now, we have a kernel function
An example
All we have talked about still holds when using the
kernel function
The only difference is that now my function will be

Slide 18

Non-Linear SVM
Some typical kernels

A visual example of a polynomial kernel with p=3
i l lf l i lk l ith 3

Slide 19

Some Further Issues
We have to classify data
y
Described by nominal attributes and continuous attributes
Probably ith i i
P b bl with missing values
l
That may have more than two classes
How SVM deal with them?
SVM defined over continuous attributes No problem!
attributes.
Nominal attributes Map into continuous space
Multiple classes Build S
SVM that discriminate each pair of
f
classes

Slide 20

Some Further Issues
I’ve seen lots of formulas… But I want to program a SVM
pg
builder. How I get my SVM?
We have already mentioned that there are many methods to
solve the quadratic programming problem
Many algorithms designed for SVM
One of the most significant: Sequential Minimal Optimization
Currently, there are many new algorithms
C lh l ih

Slide 21

Next Class

Association Rules

Slide 22

Lecture12 - SVM

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Lecture12 - SVM

Similar to Lecture12 - SVM (15)

More from Albert Orriols-Puig

More from Albert Orriols-Puig (18)

Lecture12 - SVM