SlideShare a Scribd company logo
1 of 177
Download to read offline
Machine Learning for Data Mining
Introduction to Support Vector Machines
Andres Mendez-Vazquez
June 21, 2015
1 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
2 / 85
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
In addition
Alexey Yakovlevich Chervonenkis
He was a Soviet and Russian mathematician, and, with Vladimir Vapnik,
was one of the main developers of the Vapnik–Chervonenkis theory, also
known as the "fundamental theory of learning" an important part of
computational learning theory.
He died in September 22nd, 2014
At Losiny Ostrov National Park on 22 September 2014.
4 / 85
In addition
Alexey Yakovlevich Chervonenkis
He was a Soviet and Russian mathematician, and, with Vladimir Vapnik,
was one of the main developers of the Vapnik–Chervonenkis theory, also
known as the "fundamental theory of learning" an important part of
computational learning theory.
He died in September 22nd, 2014
At Losiny Ostrov National Park on 22 September 2014.
4 / 85
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
6 / 85
Separable Classes
Given
xi, i = 1, · · · , N
A set of samples belonging to two classes ω1, ω2.
Objective
We want to obtain decision functions
g(x) = wt
x + w0
7 / 85
Separable Classes
Given
xi, i = 1, · · · , N
A set of samples belonging to two classes ω1, ω2.
Objective
We want to obtain decision functions
g(x) = wt
x + w0
7 / 85
Such that we can do the following
A linear separation function g (x) = wt
x + w0
8 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
9 / 85
In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
What do we want?
Our goal is to search for a direction w that gives the maximum possible margin
direction 2
MARGINSdirection 1
11 / 85
Remember
We have the following
d
Projection r distance
0
12 / 85
A Little of Geometry
Thus
r
d
A
B
C
Then
d =
|w0|
w2
1 + w2
2
, r =
|g (x)|
w2
1 + w2
2
(1)
13 / 85
A Little of Geometry
Thus
r
d
A
B
C
Then
d =
|w0|
w2
1 + w2
2
, r =
|g (x)|
w2
1 + w2
2
(1)
13 / 85
First d = |w0|√
w2
1 +w2
2
We can use the following rule in a triangle with a 90o
angle
Area =
1
2
Cd (2)
In addition, the area can be calculated also as
Area =
1
2
AB (3)
Thus
d =
AB
C
Remark: Can you get the rest of values?
14 / 85
First d = |w0|√
w2
1 +w2
2
We can use the following rule in a triangle with a 90o
angle
Area =
1
2
Cd (2)
In addition, the area can be calculated also as
Area =
1
2
AB (3)
Thus
d =
AB
C
Remark: Can you get the rest of values?
14 / 85
First d = |w0|√
w2
1 +w2
2
We can use the following rule in a triangle with a 90o
angle
Area =
1
2
Cd (2)
In addition, the area can be calculated also as
Area =
1
2
AB (3)
Thus
d =
AB
C
Remark: Can you get the rest of values?
14 / 85
What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
This has the following interpretation
The distance from the projection
0
16 / 85
Now
We know that the straight line that we are looking for looks like
wT
x + w0 = 0 (5)
What about something like this
wT
x + w0 = δ (6)
Clearly
This will be above or below the initial line wT x + w0 = 0.
17 / 85
Now
We know that the straight line that we are looking for looks like
wT
x + w0 = 0 (5)
What about something like this
wT
x + w0 = δ (6)
Clearly
This will be above or below the initial line wT x + w0 = 0.
17 / 85
Now
We know that the straight line that we are looking for looks like
wT
x + w0 = 0 (5)
What about something like this
wT
x + w0 = δ (6)
Clearly
This will be above or below the initial line wT x + w0 = 0.
17 / 85
Come back to the hyperplanes
We have then for each border support line an specific bias!!!
Support Vectors
18 / 85
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
Come back to the hyperplanes
The meaning of what I am saying!!!
20 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
21 / 85
A little about Support Vectors
They are the vectors
xi such that wT xi + w0 = 1 or wT xi + w0 = −1
Properties
The vectors nearest to the decision surface and the most difficult to
classify.
Because of that, we have the name “Support Vector Machines”.
22 / 85
A little about Support Vectors
They are the vectors
xi such that wT xi + w0 = 1 or wT xi + w0 = −1
Properties
The vectors nearest to the decision surface and the most difficult to
classify.
Because of that, we have the name “Support Vector Machines”.
22 / 85
A little about Support Vectors
They are the vectors
xi such that wT xi + w0 = 1 or wT xi + w0 = −1
Properties
The vectors nearest to the decision surface and the most difficult to
classify.
Because of that, we have the name “Support Vector Machines”.
22 / 85
Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wT
xi + w0 = −(+)1 for di = −(+)1 (7)
Implies
The distance to the support vectors is:
r =
g(xi)
||w||
=



1
||w|| if di = +1
− 1
||w|| if di = −1
23 / 85
Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wT
xi + w0 = −(+)1 for di = −(+)1 (7)
Implies
The distance to the support vectors is:
r =
g(xi)
||w||
=



1
||w|| if di = +1
− 1
||w|| if di = −1
23 / 85
Therefore ...
We want the optimum value of the margin of separation as
ρ =
1
||w||
+
1
||w||
=
2
||w||
(8)
And the support vectors define the value of ρ
24 / 85
Therefore ...
We want the optimum value of the margin of separation as
ρ =
1
||w||
+
1
||w||
=
2
||w||
(8)
And the support vectors define the value of ρ
Support Vectors
24 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
25 / 85
Quadratic Optimization
Then, we have the samples with labels
T = {(xi, di)}N
i=1
Then we can put the decision rule as
di(wT xi + w0) ≥ 1 i = 1, · · · , N
26 / 85
Quadratic Optimization
Then, we have the samples with labels
T = {(xi, di)}N
i=1
Then we can put the decision rule as
di(wT xi + w0) ≥ 1 i = 1, · · · , N
26 / 85
Then, we have the optimization problem
The optimization problem
minwΦ(w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
Observations
The cost functions Φ (w) is convex.
The constrains are linear with respect to w.
27 / 85
Then, we have the optimization problem
The optimization problem
minwΦ(w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
Observations
The cost functions Φ (w) is convex.
The constrains are linear with respect to w.
27 / 85
Then, we have the optimization problem
The optimization problem
minwΦ(w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
Observations
The cost functions Φ (w) is convex.
The constrains are linear with respect to w.
27 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
28 / 85
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem to an equivalent
unconstrained problem with the help of certain unspecified parameters
known as Lagrange multipliers.
29 / 85
Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem to an equivalent
unconstrained problem with the help of certain unspecified parameters
known as Lagrange multipliers.
29 / 85
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
30 / 85
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
30 / 85
Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
30 / 85
Finding an Optimum using Lagrange Multipliers
New problem
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
We want a λ = λ∗
optimal
If the minimum of L (x1, x2, ..., xn, λ∗
) occurs at
(x1, x2, ..., xn)T
= (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)
T∗
satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)
T∗
minimizes:
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
Trick
It is to find appropriate value for Lagrangian multiplier λ.
31 / 85
Finding an Optimum using Lagrange Multipliers
New problem
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
We want a λ = λ∗
optimal
If the minimum of L (x1, x2, ..., xn, λ∗
) occurs at
(x1, x2, ..., xn)T
= (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)
T∗
satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)
T∗
minimizes:
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
Trick
It is to find appropriate value for Lagrangian multiplier λ.
31 / 85
Finding an Optimum using Lagrange Multipliers
New problem
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
We want a λ = λ∗
optimal
If the minimum of L (x1, x2, ..., xn, λ∗
) occurs at
(x1, x2, ..., xn)T
= (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)
T∗
satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)
T∗
minimizes:
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
Trick
It is to find appropriate value for Lagrangian multiplier λ.
31 / 85
Remember
Think about this
Remember First Law of Newton!!!
Yes!!!
32 / 85
Remember
Think about this
Remember First Law of Newton!!!
Yes!!!
A system in equilibrium does not move
Static Body
32 / 85
Lagrange Multipliers
Definition
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problem
33 / 85
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (10)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
34 / 85
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (10)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
34 / 85
Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (10)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
34 / 85
Gradient to a Surface
After all a gradient is a measure of the maximal change
For example the gradient of a function of three variables:
f (x) = i
∂f (x)
∂x
+ j
∂f (x)
∂y
+ k
∂f (x)
∂z
(11)
where i, j and k are unitary vectors in the directions x, y and z.
35 / 85
Example
We have f (x, y) = x exp {−x2
− y2
}
36 / 85
Example
With Gradient at the the contours when projecting in the 2D plane
37 / 85
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λK FK = 0 (12)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
38 / 85
Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λK FK = 0 (12)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
38 / 85
Thus
If we have the following optimization:
min f (x)
s.tg1 (x) = 0
g2 (x) = 0
39 / 85
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f (−→x ) − λ1 g1 (−→x ) − λ2 g2 (−→x ) = 0
39 / 85
Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f (−→x ) − λ1 g1 (−→x ) − λ2 g2 (−→x ) = 0
39 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
40 / 85
Method
Steps
1 Original problem is rewritten as:
1 minimize L(x, λ) = f (x) − λh1(x)
2 Take derivatives of L(x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).
41 / 85
Method
Steps
1 Original problem is rewritten as:
1 minimize L(x, λ) = f (x) − λh1(x)
2 Take derivatives of L(x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).
41 / 85
Example
We can apply that to the following problem
min f (x, y) = x2
− 8x + y2
− 12y + 48
s.t x + y = 8
42 / 85
Then, Rewriting The Optimization Problem
The optimization with equality constraints
minwΦ(w) = 1
2wT w
s.t
di(wT xi + w0) − 1 = 0 i = 1, · · · , N
43 / 85
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
44 / 85
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
44 / 85
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
44 / 85
Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
44 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
45 / 85
Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)
s.t g1 (x) = 0
...
gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!
A point x is a local minimum of an equality constrained problem P only if
a set of non-negative αj’s may be found such that:
L(x, α) = f (x) −
N
i=1
αi gi (x) = 0
46 / 85
Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)
s.t g1 (x) = 0
...
gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!
A point x is a local minimum of an equality constrained problem P only if
a set of non-negative αj’s may be found such that:
L(x, α) = f (x) −
N
i=1
αi gi (x) = 0
46 / 85
Karush-Kuhn-Tucker Conditions
Important
Think about this each constraint correspond to a sample in both classes,
thus
The corresponding αi’s are going to be zero after optimization, if a
constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember
Maximization).
Again the Support Vectors
This actually defines the idea of support vectors!!!
Thus
Only the αi’s with active constraints (Support Vectors) will be different
from zero when di wT xi + w0 − 1 = 0.
47 / 85
Karush-Kuhn-Tucker Conditions
Important
Think about this each constraint correspond to a sample in both classes,
thus
The corresponding αi’s are going to be zero after optimization, if a
constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember
Maximization).
Again the Support Vectors
This actually defines the idea of support vectors!!!
Thus
Only the αi’s with active constraints (Support Vectors) will be different
from zero when di wT xi + w0 − 1 = 0.
47 / 85
Karush-Kuhn-Tucker Conditions
Important
Think about this each constraint correspond to a sample in both classes,
thus
The corresponding αi’s are going to be zero after optimization, if a
constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember
Maximization).
Again the Support Vectors
This actually defines the idea of support vectors!!!
Thus
Only the αi’s with active constraints (Support Vectors) will be different
from zero when di wT xi + w0 − 1 = 0.
47 / 85
A small deviation from the SVM’s for the sake of Vox
Populi
Theorem (Karush-Kuhn-Tucker Necessary Conditions)
Let X be a non-empty open set Rn, and let f : Rn → R and gi : Rn → R
for i = 1, ..., m. Consider the problem P to minimize f (x) subject to
x ∈ X and gi (x) ≤ 0 i = 1, ..., m. Let x be a feasible solution, and
denote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I are
differentiable at x and that gi i /∈ I are continuous at x. Furthermore,
suppose that gi (x) for i ∈ I are linearly independent. If x solves
problem P locally, there exist scalars ui for i ∈ I such that
f (x) +
i∈I
ui gi (x) = 0
ui ≥ 0 for i ∈ I
48 / 85
It is more...
In addition to the above assumptions
If gi for each i /∈ I is also differentiable at x, the previous conditions can
be written in the following equivalent form:
f (x) +
m
i=1
ui gi (x) = 0
ugi (x) = 0 for i = 1, ..., m
ui ≥ 0 for i = 1, ..., m
49 / 85
The necessary conditions for optimality
We use the previous theorem
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1] (14)
Condition 1
∂J(w, w0, α)
∂w
= 0
Condition 2
∂J(w, w0, α)
∂w0
= 0
50 / 85
The necessary conditions for optimality
We use the previous theorem
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1] (14)
Condition 1
∂J(w, w0, α)
∂w
= 0
Condition 2
∂J(w, w0, α)
∂w0
= 0
50 / 85
The necessary conditions for optimality
We use the previous theorem
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1] (14)
Condition 1
∂J(w, w0, α)
∂w
= 0
Condition 2
∂J(w, w0, α)
∂w0
= 0
50 / 85
Using the conditions
We have the first condition
∂J(w, w0, α)
∂w
=
∂ 1
2wT w
∂w
−
∂
N
i=1
αi[di(wT
xi + w0) − 1]
∂w
= 0
∂J(w, w0, α)
∂w
=
1
2
(w + w) −
N
i=1
αidixi
Thus
w =
N
i=1
αidixi (15)
51 / 85
Using the conditions
We have the first condition
∂J(w, w0, α)
∂w
=
∂ 1
2wT w
∂w
−
∂
N
i=1
αi[di(wT
xi + w0) − 1]
∂w
= 0
∂J(w, w0, α)
∂w
=
1
2
(w + w) −
N
i=1
αidixi
Thus
w =
N
i=1
αidixi (15)
51 / 85
Using the conditions
We have the first condition
∂J(w, w0, α)
∂w
=
∂ 1
2wT w
∂w
−
∂
N
i=1
αi[di(wT
xi + w0) − 1]
∂w
= 0
∂J(w, w0, α)
∂w
=
1
2
(w + w) −
N
i=1
αidixi
Thus
w =
N
i=1
αidixi (15)
51 / 85
In a similar way ...
We have by the second optimality condition
N
i=1
αidi = 0
Note
αi[di(wT
xi + w0) − 1] = 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 or
di(wT xi + w0) − 1 = 0.
52 / 85
In a similar way ...
We have by the second optimality condition
N
i=1
αidi = 0
Note
αi[di(wT
xi + w0) − 1] = 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 or
di(wT xi + w0) − 1 = 0.
52 / 85
Thus
We need something extra
Our classic trick of transforming a problem into another problem
In this case
We use the Primal-Dual Problem for Lagrangian
Where
We move from a minimization to a maximization!!!
53 / 85
Thus
We need something extra
Our classic trick of transforming a problem into another problem
In this case
We use the Primal-Dual Problem for Lagrangian
Where
We move from a minimization to a maximization!!!
53 / 85
Thus
We need something extra
Our classic trick of transforming a problem into another problem
In this case
We use the Primal-Dual Problem for Lagrangian
Where
We move from a minimization to a maximization!!!
53 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
54 / 85
Lagrangian Dual Problem
Consider the following nonlinear programming problem
Primal Problem P
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
hi (x) = 0 for i = 1, ..., l
x ∈ X
Lagrange Dual Problem D
max Θ (u, v)
s.t. u > 0
where Θ (u, v) = infx f (x) + m
i=1 uigi (x) + l
i=1 vihi (x) |x ∈ X
55 / 85
Lagrangian Dual Problem
Consider the following nonlinear programming problem
Primal Problem P
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
hi (x) = 0 for i = 1, ..., l
x ∈ X
Lagrange Dual Problem D
max Θ (u, v)
s.t. u > 0
where Θ (u, v) = infx f (x) + m
i=1 uigi (x) + l
i=1 vihi (x) |x ∈ X
55 / 85
What does this mean?
Assume that the equality constraint does not exist
We have then
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
x ∈ X
Now assume that we finish with only one constraint
We have then
min f (x)
s.t g (x) ≤ 0
x ∈ X
56 / 85
What does this mean?
Assume that the equality constraint does not exist
We have then
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
x ∈ X
Now assume that we finish with only one constraint
We have then
min f (x)
s.t g (x) ≤ 0
x ∈ X
56 / 85
What does this mean?
First, we have the following figure
A
B
X
G
Slope:
Slope:
57 / 85
What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
Thus
Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -
Equivalent to f (x) + u g(x) = 0
58 / 85
What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
Thus
Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -
Equivalent to f (x) + u g(x) = 0
58 / 85
What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down as
possible, along its negative gradient, while in contact with G.
59 / 85
What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down as
possible, along its negative gradient, while in contact with G.
59 / 85
In other words
Move the line parallel to itself until it supports G
A
B
X
G
Slope:
Slope:
Note The Set G lies above the line and touches it.
60 / 85
Thus
Thus
Then, the problem is to find the slope of the supporting hyperplane for
G.
Then intersection with the z-axis
Gives θ(u)
61 / 85
Thus
Thus
Then, the problem is to find the slope of the supporting hyperplane for
G.
Then intersection with the z-axis
Gives θ(u)
61 / 85
Again
We can see the θ
A
B
X
G
Slope:
Slope:
62 / 85
Thus
The dual problem is equivalent
Finding the slope of the supporting hyperplane such that its intercept on
the z-axis is maximal
63 / 85
Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X
G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
64 / 85
Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X
G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
64 / 85
For more on this Please!!!
Look at this book
From “Nonlinear Programming: Theory and Algorithms” by Mokhtar
S. Bazaraa, and C. M. Shetty. Wiley, New York, (2006)
At Page 260.
65 / 85
Example (Lagrange Dual)
Primal
min x2
1 + x2
2
s.t. −x1 − x2 + 4 ≤ 0
x1, x2 ≥ 0
Lagrange Dual
Θ(u) = inf {x2
1 + x2
2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}
66 / 85
Example (Lagrange Dual)
Primal
min x2
1 + x2
2
s.t. −x1 − x2 + 4 ≤ 0
x1, x2 ≥ 0
Lagrange Dual
Θ(u) = inf {x2
1 + x2
2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0}
66 / 85
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clear
What about when u < 0
We have that
θ (u) =
−1
2u2 + 4u if u ≥ 0
4u if u < 0
(18)
67 / 85
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clear
What about when u < 0
We have that
θ (u) =
−1
2u2 + 4u if u ≥ 0
4u if u < 0
(18)
67 / 85
Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clear
What about when u < 0
We have that
θ (u) =
−1
2u2 + 4u if u ≥ 0
4u if u < 0
(18)
67 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
68 / 85
Duality Theorem
First Property
If the Primal has an optimal solution, the dual too.
Thus
In order to w ∗ and α∗ to be optimal solutions for the primal and dual
problem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J(w∗, w0∗, α∗)
= min
w
J(w, w0∗, α∗)
69 / 85
Duality Theorem
First Property
If the Primal has an optimal solution, the dual too.
Thus
In order to w ∗ and α∗ to be optimal solutions for the primal and dual
problem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J(w∗, w0∗, α∗)
= min
w
J(w, w0∗, α∗)
69 / 85
Reformulate our Equations
We have then
J(w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi − ω0
N
i=1
αidi +
N
i=1
αi
Now for our 2nd optimality condition
J(w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi +
N
i=1
αi
70 / 85
Reformulate our Equations
We have then
J(w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi − ω0
N
i=1
αidi +
N
i=1
αi
Now for our 2nd optimality condition
J(w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi +
N
i=1
αi
70 / 85
We have finally for the 1st Optimality Condition:
First
wT
w =
N
i=1
αidiwT
xi =
N
i=1
N
j=1
αiαjdidjxT
j xi
Second, setting J(w, ω0, α) = Q (α)
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
71 / 85
We have finally for the 1st Optimality Condition:
First
wT
w =
N
i=1
αidiwT
xi =
N
i=1
N
j=1
αiαjdidjxT
j xi
Second, setting J(w, ω0, α) = Q (α)
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
71 / 85
From here, we have the problem
This is the problem that we really solve
Given the training sample {(xi, di)}N
i=1, find the Lagrange multipliers
{αi}N
i=1 that maximize the objective function
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
subject to the constraints
N
i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · , N (20)
Note
In the Primal, we were trying to minimize the cost function, for this it is
necessary to maximize α. That is the reason why we are maximizing Q(α).
72 / 85
From here, we have the problem
This is the problem that we really solve
Given the training sample {(xi, di)}N
i=1, find the Lagrange multipliers
{αi}N
i=1 that maximize the objective function
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
subject to the constraints
N
i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · , N (20)
Note
In the Primal, we were trying to minimize the cost function, for this it is
necessary to maximize α. That is the reason why we are maximizing Q(α).
72 / 85
Solving for α
We can compute w∗
once we get the optimal α∗
i by using (Eq. 15)
w∗
=
N
i=1
α∗
i dixi
In addition, we can compute the optimal bias w∗
0 using the optimal
weight, w∗
For this, we use the positive margin equation:
g(xi) = wT
x(s)
+ w0 = 1
corresponding to a positive support vector.
Then
w0 = 1 − (w∗
)T
x(s)
for d(s)
= 1 (21)
73 / 85
Solving for α
We can compute w∗
once we get the optimal α∗
i by using (Eq. 15)
w∗
=
N
i=1
α∗
i dixi
In addition, we can compute the optimal bias w∗
0 using the optimal
weight, w∗
For this, we use the positive margin equation:
g(xi) = wT
x(s)
+ w0 = 1
corresponding to a positive support vector.
Then
w0 = 1 − (w∗
)T
x(s)
for d(s)
= 1 (21)
73 / 85
Solving for α
We can compute w∗
once we get the optimal α∗
i by using (Eq. 15)
w∗
=
N
i=1
α∗
i dixi
In addition, we can compute the optimal bias w∗
0 using the optimal
weight, w∗
For this, we use the positive margin equation:
g(xi) = wT
x(s)
+ w0 = 1
corresponding to a positive support vector.
Then
w0 = 1 − (w∗
)T
x(s)
for d(s)
= 1 (21)
73 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
74 / 85
What do we need?
Until now, we have only a maximal margin algorithm
All this work fine when the classes are separable
Problem, What when they are not separable?
What we can do?
75 / 85
What do we need?
Until now, we have only a maximal margin algorithm
All this work fine when the classes are separable
Problem, What when they are not separable?
What we can do?
75 / 85
What do we need?
Until now, we have only a maximal margin algorithm
All this work fine when the classes are separable
Problem, What when they are not separable?
What we can do?
75 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
76 / 85
Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl
→ y ∈ Rk
Then, it is possible to define the following mapping
77 / 85
Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl
→ y ∈ Rk
Then, it is possible to define the following mapping
77 / 85
Define a map to a higher Dimension
Nonlinear transformations
Given a series of nonlinear transformations
{φi(x)}m
i=1
from input space to the feature space.
We can define the decision surface as
m
i=1
wiφi(x) + w0 = 0
.
78 / 85
Define a map to a higher Dimension
Nonlinear transformations
Given a series of nonlinear transformations
{φi(x)}m
i=1
from input space to the feature space.
We can define the decision surface as
m
i=1
wiφi(x) + w0 = 0
.
78 / 85
This allows us to define
The following vector
φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T
.
That represents the mapping.
From this mapping
we can define the following kernel function
K : X × X → R
K (xi, xj) = φ (xi)T
φ (xj)
79 / 85
This allows us to define
The following vector
φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T
.
That represents the mapping.
From this mapping
we can define the following kernel function
K : X × X → R
K (xi, xj) = φ (xi)T
φ (xj)
79 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
80 / 85
Example
Assume
x ∈ R → y =



x2
1√
2x1x2
x2
2



We can show that
yT
i yj = xT
i xj
2
81 / 85
Example
Assume
x ∈ R → y =



x2
1√
2x1x2
x2
2



We can show that
yT
i yj = xT
i xj
2
81 / 85
Example of Kernels
Polynomials
k(x, z) = (xT
z + 1)q
q > 0
Radial Basis Functions
k(x, z) = exp −
||x − z||2
σ2
Hyperbolic Tangents
k(x, z) = tanh(βxT
z + γ)
82 / 85
Example of Kernels
Polynomials
k(x, z) = (xT
z + 1)q
q > 0
Radial Basis Functions
k(x, z) = exp −
||x − z||2
σ2
Hyperbolic Tangents
k(x, z) = tanh(βxT
z + γ)
82 / 85
Example of Kernels
Polynomials
k(x, z) = (xT
z + 1)q
q > 0
Radial Basis Functions
k(x, z) = exp −
||x − z||2
σ2
Hyperbolic Tangents
k(x, z) = tanh(βxT
z + γ)
82 / 85
Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
83 / 85
Now, How to select a Kernel?
We have a problem
Selecting a specific kernel and parameters is usually done in a try-and-see
manner.
Thus
In general, the Radial Basis Functions kernel is a reasonable first choice.
Then
if this fails, we can try the other possible kernels.
84 / 85
Now, How to select a Kernel?
We have a problem
Selecting a specific kernel and parameters is usually done in a try-and-see
manner.
Thus
In general, the Radial Basis Functions kernel is a reasonable first choice.
Then
if this fails, we can try the other possible kernels.
84 / 85
Now, How to select a Kernel?
We have a problem
Selecting a specific kernel and parameters is usually done in a try-and-see
manner.
Thus
In general, the Radial Basis Functions kernel is a reasonable first choice.
Then
if this fails, we can try the other possible kernels.
84 / 85
Thus, we have something like this
Step 1
Normalize the data.
Step 2
Use cross-validation to adjust the parameters of the selected kernel.
Step 3
Train against the entire dataset.
85 / 85
Thus, we have something like this
Step 1
Normalize the data.
Step 2
Use cross-validation to adjust the parameters of the selected kernel.
Step 3
Train against the entire dataset.
85 / 85
Thus, we have something like this
Step 1
Normalize the data.
Step 2
Use cross-validation to adjust the parameters of the selected kernel.
Step 3
Train against the entire dataset.
85 / 85

More Related Content

Similar to An introduction to support vector machines

From Circuit Theory to System Theory
From Circuit Theory to System Theory From Circuit Theory to System Theory
From Circuit Theory to System Theory Sporsho
 
As pi re2015_abstracts
As pi re2015_abstractsAs pi re2015_abstracts
As pi re2015_abstractsJoseph Park
 
Coates p: 1999 agent based modelling
Coates p: 1999 agent based modellingCoates p: 1999 agent based modelling
Coates p: 1999 agent based modellingArchiLab 7
 
A Study on Evolving Self-organizing Cellular Automata based on Neural Network...
A Study on Evolving Self-organizing Cellular Automata based on Neural Network...A Study on Evolving Self-organizing Cellular Automata based on Neural Network...
A Study on Evolving Self-organizing Cellular Automata based on Neural Network...IRJET Journal
 
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training AlgorithmsExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training AlgorithmsIRJET Journal
 
Causal Competences of Many Kinds
Causal Competences of Many KindsCausal Competences of Many Kinds
Causal Competences of Many KindsAaron Sloman
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literaturepetermurrayrust
 
Supporting the exploding dimensions of the chemical sciences via global netwo...
Supporting the exploding dimensions of the chemical sciences via global netwo...Supporting the exploding dimensions of the chemical sciences via global netwo...
Supporting the exploding dimensions of the chemical sciences via global netwo...Valery Tkachenko
 
Capstone_Hsu_Final
Capstone_Hsu_FinalCapstone_Hsu_Final
Capstone_Hsu_FinalDylan Hsu
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...Numenta
 
Machine Learning:
Machine Learning:Machine Learning:
Machine Learning:butest
 
Black holes as tools for quantum computing by advanced extraterrestrial civil...
Black holes as tools for quantum computing by advanced extraterrestrial civil...Black holes as tools for quantum computing by advanced extraterrestrial civil...
Black holes as tools for quantum computing by advanced extraterrestrial civil...Sérgio Sacani
 
Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...
Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...
Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...Bruce Damer
 
Virtuality, causation and the mind-body relationship
Virtuality, causation and the mind-body relationshipVirtuality, causation and the mind-body relationship
Virtuality, causation and the mind-body relationshipAaron Sloman
 
A Self-Adaptive Evolutionary Negative Selection Approach for Anom
A Self-Adaptive Evolutionary Negative Selection Approach for AnomA Self-Adaptive Evolutionary Negative Selection Approach for Anom
A Self-Adaptive Evolutionary Negative Selection Approach for AnomLuis J. Gonzalez, PhD
 

Similar to An introduction to support vector machines (20)

From Circuit Theory to System Theory
From Circuit Theory to System Theory From Circuit Theory to System Theory
From Circuit Theory to System Theory
 
As pi re2015_abstracts
As pi re2015_abstractsAs pi re2015_abstracts
As pi re2015_abstracts
 
Coates p: 1999 agent based modelling
Coates p: 1999 agent based modellingCoates p: 1999 agent based modelling
Coates p: 1999 agent based modelling
 
A Study on Evolving Self-organizing Cellular Automata based on Neural Network...
A Study on Evolving Self-organizing Cellular Automata based on Neural Network...A Study on Evolving Self-organizing Cellular Automata based on Neural Network...
A Study on Evolving Self-organizing Cellular Automata based on Neural Network...
 
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training AlgorithmsExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
ExoSGAN and ExoACGAN: Exoplanet Detection using Adversarial Training Algorithms
 
Causal Competences of Many Kinds
Causal Competences of Many KindsCausal Competences of Many Kinds
Causal Competences of Many Kinds
 
Morphogenetic Engineering: Reconciling Architecture and Self-Organization Thr...
Morphogenetic Engineering: Reconciling Architecture and Self-Organization Thr...Morphogenetic Engineering: Reconciling Architecture and Self-Organization Thr...
Morphogenetic Engineering: Reconciling Architecture and Self-Organization Thr...
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literature
 
Supporting the exploding dimensions of the chemical sciences via global netwo...
Supporting the exploding dimensions of the chemical sciences via global netwo...Supporting the exploding dimensions of the chemical sciences via global netwo...
Supporting the exploding dimensions of the chemical sciences via global netwo...
 
Capstone_Hsu_Final
Capstone_Hsu_FinalCapstone_Hsu_Final
Capstone_Hsu_Final
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
Does the neocortex use grid cell-like mechanisms to learn the structure of ob...
 
Machine Learning:
Machine Learning:Machine Learning:
Machine Learning:
 
Black holes as tools for quantum computing by advanced extraterrestrial civil...
Black holes as tools for quantum computing by advanced extraterrestrial civil...Black holes as tools for quantum computing by advanced extraterrestrial civil...
Black holes as tools for quantum computing by advanced extraterrestrial civil...
 
dissertation
dissertationdissertation
dissertation
 
Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...
Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...
Bruce Damer's talk at EE380, the Stanford University Computer Systems Colloqu...
 
Eps scientific realism
Eps scientific realismEps scientific realism
Eps scientific realism
 
Virtuality, causation and the mind-body relationship
Virtuality, causation and the mind-body relationshipVirtuality, causation and the mind-body relationship
Virtuality, causation and the mind-body relationship
 
A Self-Adaptive Evolutionary Negative Selection Approach for Anom
A Self-Adaptive Evolutionary Negative Selection Approach for AnomA Self-Adaptive Evolutionary Negative Selection Approach for Anom
A Self-Adaptive Evolutionary Negative Selection Approach for Anom
 
CSPNthesis-all-2
CSPNthesis-all-2CSPNthesis-all-2
CSPNthesis-all-2
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Recently uploaded

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 

Recently uploaded (20)

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Introduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptxIntroduction to Microprocesso programming and interfacing.pptx
Introduction to Microprocesso programming and interfacing.pptx
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 

An introduction to support vector machines

  • 1. Machine Learning for Data Mining Introduction to Support Vector Machines Andres Mendez-Vazquez June 21, 2015 1 / 85
  • 2. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 2 / 85
  • 3. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 3 / 85
  • 4. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 3 / 85
  • 5. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 3 / 85
  • 6. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 3 / 85
  • 7. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 3 / 85
  • 8. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 3 / 85
  • 9. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the field of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 3 / 85
  • 10. In addition Alexey Yakovlevich Chervonenkis He was a Soviet and Russian mathematician, and, with Vladimir Vapnik, was one of the main developers of the Vapnik–Chervonenkis theory, also known as the "fundamental theory of learning" an important part of computational learning theory. He died in September 22nd, 2014 At Losiny Ostrov National Park on 22 September 2014. 4 / 85
  • 11. In addition Alexey Yakovlevich Chervonenkis He was a Soviet and Russian mathematician, and, with Vladimir Vapnik, was one of the main developers of the Vapnik–Chervonenkis theory, also known as the "fundamental theory of learning" an important part of computational learning theory. He died in September 22nd, 2014 At Losiny Ostrov National Park on 22 September 2014. 4 / 85
  • 12. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85
  • 13. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85
  • 14. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85
  • 15. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85
  • 16. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85
  • 17. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85
  • 18. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85
  • 19. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85
  • 20. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85
  • 21. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 6 / 85
  • 22. Separable Classes Given xi, i = 1, · · · , N A set of samples belonging to two classes ω1, ω2. Objective We want to obtain decision functions g(x) = wt x + w0 7 / 85
  • 23. Separable Classes Given xi, i = 1, · · · , N A set of samples belonging to two classes ω1, ω2. Objective We want to obtain decision functions g(x) = wt x + w0 7 / 85
  • 24. Such that we can do the following A linear separation function g (x) = wt x + w0 8 / 85
  • 25. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 9 / 85
  • 26. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2 10 / 85
  • 27. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2 10 / 85
  • 28. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2 10 / 85
  • 29. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2 10 / 85
  • 30. What do we want? Our goal is to search for a direction w that gives the maximum possible margin direction 2 MARGINSdirection 1 11 / 85
  • 31. Remember We have the following d Projection r distance 0 12 / 85
  • 32. A Little of Geometry Thus r d A B C Then d = |w0| w2 1 + w2 2 , r = |g (x)| w2 1 + w2 2 (1) 13 / 85
  • 33. A Little of Geometry Thus r d A B C Then d = |w0| w2 1 + w2 2 , r = |g (x)| w2 1 + w2 2 (1) 13 / 85
  • 34. First d = |w0|√ w2 1 +w2 2 We can use the following rule in a triangle with a 90o angle Area = 1 2 Cd (2) In addition, the area can be calculated also as Area = 1 2 AB (3) Thus d = AB C Remark: Can you get the rest of values? 14 / 85
  • 35. First d = |w0|√ w2 1 +w2 2 We can use the following rule in a triangle with a 90o angle Area = 1 2 Cd (2) In addition, the area can be calculated also as Area = 1 2 AB (3) Thus d = AB C Remark: Can you get the rest of values? 14 / 85
  • 36. First d = |w0|√ w2 1 +w2 2 We can use the following rule in a triangle with a 90o angle Area = 1 2 Cd (2) In addition, the area can be calculated also as Area = 1 2 AB (3) Thus d = AB C Remark: Can you get the rest of values? 14 / 85
  • 37. What about r = |g(x)|√ w2 1 +w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g(x) =wT xp + r w w + w0 =wT xp + w0 + r wT w wT =g (xp) + r w Then r = g(x) ||w|| 15 / 85
  • 38. What about r = |g(x)|√ w2 1 +w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g(x) =wT xp + r w w + w0 =wT xp + w0 + r wT w wT =g (xp) + r w Then r = g(x) ||w|| 15 / 85
  • 39. What about r = |g(x)|√ w2 1 +w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g(x) =wT xp + r w w + w0 =wT xp + w0 + r wT w wT =g (xp) + r w Then r = g(x) ||w|| 15 / 85
  • 40. What about r = |g(x)|√ w2 1 +w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g(x) =wT xp + r w w + w0 =wT xp + w0 + r wT w wT =g (xp) + r w Then r = g(x) ||w|| 15 / 85
  • 41. What about r = |g(x)|√ w2 1 +w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g(x) =wT xp + r w w + w0 =wT xp + w0 + r wT w wT =g (xp) + r w Then r = g(x) ||w|| 15 / 85
  • 42. This has the following interpretation The distance from the projection 0 16 / 85
  • 43. Now We know that the straight line that we are looking for looks like wT x + w0 = 0 (5) What about something like this wT x + w0 = δ (6) Clearly This will be above or below the initial line wT x + w0 = 0. 17 / 85
  • 44. Now We know that the straight line that we are looking for looks like wT x + w0 = 0 (5) What about something like this wT x + w0 = δ (6) Clearly This will be above or below the initial line wT x + w0 = 0. 17 / 85
  • 45. Now We know that the straight line that we are looking for looks like wT x + w0 = 0 (5) What about something like this wT x + w0 = δ (6) Clearly This will be above or below the initial line wT x + w0 = 0. 17 / 85
  • 46. Come back to the hyperplanes We have then for each border support line an specific bias!!! Support Vectors 18 / 85
  • 47. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 19 / 85
  • 48. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 19 / 85
  • 49. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 19 / 85
  • 50. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 19 / 85
  • 51. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 19 / 85
  • 52. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 19 / 85
  • 53. Come back to the hyperplanes The meaning of what I am saying!!! 20 / 85
  • 54. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 21 / 85
  • 55. A little about Support Vectors They are the vectors xi such that wT xi + w0 = 1 or wT xi + w0 = −1 Properties The vectors nearest to the decision surface and the most difficult to classify. Because of that, we have the name “Support Vector Machines”. 22 / 85
  • 56. A little about Support Vectors They are the vectors xi such that wT xi + w0 = 1 or wT xi + w0 = −1 Properties The vectors nearest to the decision surface and the most difficult to classify. Because of that, we have the name “Support Vector Machines”. 22 / 85
  • 57. A little about Support Vectors They are the vectors xi such that wT xi + w0 = 1 or wT xi + w0 = −1 Properties The vectors nearest to the decision surface and the most difficult to classify. Because of that, we have the name “Support Vector Machines”. 22 / 85
  • 58. Now, we can resume the decision rule for the hyperplane For the support vectors g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7) Implies The distance to the support vectors is: r = g(xi) ||w|| =    1 ||w|| if di = +1 − 1 ||w|| if di = −1 23 / 85
  • 59. Now, we can resume the decision rule for the hyperplane For the support vectors g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7) Implies The distance to the support vectors is: r = g(xi) ||w|| =    1 ||w|| if di = +1 − 1 ||w|| if di = −1 23 / 85
  • 60. Therefore ... We want the optimum value of the margin of separation as ρ = 1 ||w|| + 1 ||w|| = 2 ||w|| (8) And the support vectors define the value of ρ 24 / 85
  • 61. Therefore ... We want the optimum value of the margin of separation as ρ = 1 ||w|| + 1 ||w|| = 2 ||w|| (8) And the support vectors define the value of ρ Support Vectors 24 / 85
  • 62. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 25 / 85
  • 63. Quadratic Optimization Then, we have the samples with labels T = {(xi, di)}N i=1 Then we can put the decision rule as di(wT xi + w0) ≥ 1 i = 1, · · · , N 26 / 85
  • 64. Quadratic Optimization Then, we have the samples with labels T = {(xi, di)}N i=1 Then we can put the decision rule as di(wT xi + w0) ≥ 1 i = 1, · · · , N 26 / 85
  • 65. Then, we have the optimization problem The optimization problem minwΦ(w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N Observations The cost functions Φ (w) is convex. The constrains are linear with respect to w. 27 / 85
  • 66. Then, we have the optimization problem The optimization problem minwΦ(w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N Observations The cost functions Φ (w) is convex. The constrains are linear with respect to w. 27 / 85
  • 67. Then, we have the optimization problem The optimization problem minwΦ(w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N Observations The cost functions Φ (w) is convex. The constrains are linear with respect to w. 27 / 85
  • 68. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 28 / 85
  • 69. Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. 29 / 85
  • 70. Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspecified parameters known as Lagrange multipliers. 29 / 85
  • 71. Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 30 / 85
  • 72. Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 30 / 85
  • 73. Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9) where L(x, λ) is the Lagrangian function. λ is an unspecified positive or negative constant called the Lagrange Multiplier. 30 / 85
  • 74. Finding an Optimum using Lagrange Multipliers New problem min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} We want a λ = λ∗ optimal If the minimum of L (x1, x2, ..., xn, λ∗ ) occurs at (x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗ and (x1, x2, ..., xn) T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn) T∗ minimizes: min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 Trick It is to find appropriate value for Lagrangian multiplier λ. 31 / 85
  • 75. Finding an Optimum using Lagrange Multipliers New problem min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} We want a λ = λ∗ optimal If the minimum of L (x1, x2, ..., xn, λ∗ ) occurs at (x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗ and (x1, x2, ..., xn) T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn) T∗ minimizes: min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 Trick It is to find appropriate value for Lagrangian multiplier λ. 31 / 85
  • 76. Finding an Optimum using Lagrange Multipliers New problem min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} We want a λ = λ∗ optimal If the minimum of L (x1, x2, ..., xn, λ∗ ) occurs at (x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗ and (x1, x2, ..., xn) T∗ satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn) T∗ minimizes: min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 Trick It is to find appropriate value for Lagrangian multiplier λ. 31 / 85
  • 77. Remember Think about this Remember First Law of Newton!!! Yes!!! 32 / 85
  • 78. Remember Think about this Remember First Law of Newton!!! Yes!!! A system in equilibrium does not move Static Body 32 / 85
  • 79. Lagrange Multipliers Definition Gives a set of necessary conditions to identify optimal points of equality constrained optimization problem 33 / 85
  • 80. Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (10) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 34 / 85
  • 81. Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (10) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 34 / 85
  • 82. Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (10) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 34 / 85
  • 83. Gradient to a Surface After all a gradient is a measure of the maximal change For example the gradient of a function of three variables: f (x) = i ∂f (x) ∂x + j ∂f (x) ∂y + k ∂f (x) ∂z (11) where i, j and k are unitary vectors in the directions x, y and z. 35 / 85
  • 84. Example We have f (x, y) = x exp {−x2 − y2 } 36 / 85
  • 85. Example With Gradient at the the contours when projecting in the 2D plane 37 / 85
  • 86. Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λK FK = 0 (12) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 38 / 85
  • 87. Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λK FK = 0 (12) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 38 / 85
  • 88. Thus If we have the following optimization: min f (x) s.tg1 (x) = 0 g2 (x) = 0 39 / 85
  • 89. Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can fix by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f (−→x ) − λ1 g1 (−→x ) − λ2 g2 (−→x ) = 0 39 / 85
  • 90. Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can fix by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f (−→x ) − λ1 g1 (−→x ) − λ2 g2 (−→x ) = 0 39 / 85
  • 91. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 40 / 85
  • 92. Method Steps 1 Original problem is rewritten as: 1 minimize L(x, λ) = f (x) − λh1(x) 2 Take derivatives of L(x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of α in constraint h1(x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α). 41 / 85
  • 93. Method Steps 1 Original problem is rewritten as: 1 minimize L(x, λ) = f (x) − λh1(x) 2 Take derivatives of L(x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of α in constraint h1(x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α). 41 / 85
  • 94. Example We can apply that to the following problem min f (x, y) = x2 − 8x + y2 − 12y + 48 s.t x + y = 8 42 / 85
  • 95. Then, Rewriting The Optimization Problem The optimization with equality constraints minwΦ(w) = 1 2wT w s.t di(wT xi + w0) − 1 = 0 i = 1, · · · , N 43 / 85
  • 96. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 44 / 85
  • 97. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 44 / 85
  • 98. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 44 / 85
  • 99. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 44 / 85
  • 100. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 45 / 85
  • 101. Karush-Kuhn-Tucker Conditions First An Inequality Constrained Problem P min f (x) s.t g1 (x) = 0 ... gN (x) = 0 A really minimal version!!! Hey, it is a patch work!!! A point x is a local minimum of an equality constrained problem P only if a set of non-negative αj’s may be found such that: L(x, α) = f (x) − N i=1 αi gi (x) = 0 46 / 85
  • 102. Karush-Kuhn-Tucker Conditions First An Inequality Constrained Problem P min f (x) s.t g1 (x) = 0 ... gN (x) = 0 A really minimal version!!! Hey, it is a patch work!!! A point x is a local minimum of an equality constrained problem P only if a set of non-negative αj’s may be found such that: L(x, α) = f (x) − N i=1 αi gi (x) = 0 46 / 85
  • 103. Karush-Kuhn-Tucker Conditions Important Think about this each constraint correspond to a sample in both classes, thus The corresponding αi’s are going to be zero after optimization, if a constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember Maximization). Again the Support Vectors This actually defines the idea of support vectors!!! Thus Only the αi’s with active constraints (Support Vectors) will be different from zero when di wT xi + w0 − 1 = 0. 47 / 85
  • 104. Karush-Kuhn-Tucker Conditions Important Think about this each constraint correspond to a sample in both classes, thus The corresponding αi’s are going to be zero after optimization, if a constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember Maximization). Again the Support Vectors This actually defines the idea of support vectors!!! Thus Only the αi’s with active constraints (Support Vectors) will be different from zero when di wT xi + w0 − 1 = 0. 47 / 85
  • 105. Karush-Kuhn-Tucker Conditions Important Think about this each constraint correspond to a sample in both classes, thus The corresponding αi’s are going to be zero after optimization, if a constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember Maximization). Again the Support Vectors This actually defines the idea of support vectors!!! Thus Only the αi’s with active constraints (Support Vectors) will be different from zero when di wT xi + w0 − 1 = 0. 47 / 85
  • 106. A small deviation from the SVM’s for the sake of Vox Populi Theorem (Karush-Kuhn-Tucker Necessary Conditions) Let X be a non-empty open set Rn, and let f : Rn → R and gi : Rn → R for i = 1, ..., m. Consider the problem P to minimize f (x) subject to x ∈ X and gi (x) ≤ 0 i = 1, ..., m. Let x be a feasible solution, and denote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I are differentiable at x and that gi i /∈ I are continuous at x. Furthermore, suppose that gi (x) for i ∈ I are linearly independent. If x solves problem P locally, there exist scalars ui for i ∈ I such that f (x) + i∈I ui gi (x) = 0 ui ≥ 0 for i ∈ I 48 / 85
  • 107. It is more... In addition to the above assumptions If gi for each i /∈ I is also differentiable at x, the previous conditions can be written in the following equivalent form: f (x) + m i=1 ui gi (x) = 0 ugi (x) = 0 for i = 1, ..., m ui ≥ 0 for i = 1, ..., m 49 / 85
  • 108. The necessary conditions for optimality We use the previous theorem 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] (14) Condition 1 ∂J(w, w0, α) ∂w = 0 Condition 2 ∂J(w, w0, α) ∂w0 = 0 50 / 85
  • 109. The necessary conditions for optimality We use the previous theorem 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] (14) Condition 1 ∂J(w, w0, α) ∂w = 0 Condition 2 ∂J(w, w0, α) ∂w0 = 0 50 / 85
  • 110. The necessary conditions for optimality We use the previous theorem 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] (14) Condition 1 ∂J(w, w0, α) ∂w = 0 Condition 2 ∂J(w, w0, α) ∂w0 = 0 50 / 85
  • 111. Using the conditions We have the first condition ∂J(w, w0, α) ∂w = ∂ 1 2wT w ∂w − ∂ N i=1 αi[di(wT xi + w0) − 1] ∂w = 0 ∂J(w, w0, α) ∂w = 1 2 (w + w) − N i=1 αidixi Thus w = N i=1 αidixi (15) 51 / 85
  • 112. Using the conditions We have the first condition ∂J(w, w0, α) ∂w = ∂ 1 2wT w ∂w − ∂ N i=1 αi[di(wT xi + w0) − 1] ∂w = 0 ∂J(w, w0, α) ∂w = 1 2 (w + w) − N i=1 αidixi Thus w = N i=1 αidixi (15) 51 / 85
  • 113. Using the conditions We have the first condition ∂J(w, w0, α) ∂w = ∂ 1 2wT w ∂w − ∂ N i=1 αi[di(wT xi + w0) − 1] ∂w = 0 ∂J(w, w0, α) ∂w = 1 2 (w + w) − N i=1 αidixi Thus w = N i=1 αidixi (15) 51 / 85
  • 114. In a similar way ... We have by the second optimality condition N i=1 αidi = 0 Note αi[di(wT xi + w0) − 1] = 0 Because the constraint vanishes in the optimal solution i.e. αi = 0 or di(wT xi + w0) − 1 = 0. 52 / 85
  • 115. In a similar way ... We have by the second optimality condition N i=1 αidi = 0 Note αi[di(wT xi + w0) − 1] = 0 Because the constraint vanishes in the optimal solution i.e. αi = 0 or di(wT xi + w0) − 1 = 0. 52 / 85
  • 116. Thus We need something extra Our classic trick of transforming a problem into another problem In this case We use the Primal-Dual Problem for Lagrangian Where We move from a minimization to a maximization!!! 53 / 85
  • 117. Thus We need something extra Our classic trick of transforming a problem into another problem In this case We use the Primal-Dual Problem for Lagrangian Where We move from a minimization to a maximization!!! 53 / 85
  • 118. Thus We need something extra Our classic trick of transforming a problem into another problem In this case We use the Primal-Dual Problem for Lagrangian Where We move from a minimization to a maximization!!! 53 / 85
  • 119. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 54 / 85
  • 120. Lagrangian Dual Problem Consider the following nonlinear programming problem Primal Problem P min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m hi (x) = 0 for i = 1, ..., l x ∈ X Lagrange Dual Problem D max Θ (u, v) s.t. u > 0 where Θ (u, v) = infx f (x) + m i=1 uigi (x) + l i=1 vihi (x) |x ∈ X 55 / 85
  • 121. Lagrangian Dual Problem Consider the following nonlinear programming problem Primal Problem P min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m hi (x) = 0 for i = 1, ..., l x ∈ X Lagrange Dual Problem D max Θ (u, v) s.t. u > 0 where Θ (u, v) = infx f (x) + m i=1 uigi (x) + l i=1 vihi (x) |x ∈ X 55 / 85
  • 122. What does this mean? Assume that the equality constraint does not exist We have then min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m x ∈ X Now assume that we finish with only one constraint We have then min f (x) s.t g (x) ≤ 0 x ∈ X 56 / 85
  • 123. What does this mean? Assume that the equality constraint does not exist We have then min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m x ∈ X Now assume that we finish with only one constraint We have then min f (x) s.t g (x) ≤ 0 x ∈ X 56 / 85
  • 124. What does this mean? First, we have the following figure A B X G Slope: Slope: 57 / 85
  • 125. What does this means? Thus at the y − z plane you have G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16) Thus Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) - Equivalent to f (x) + u g(x) = 0 58 / 85
  • 126. What does this means? Thus at the y − z plane you have G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16) Thus Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) - Equivalent to f (x) + u g(x) = 0 58 / 85
  • 127. What does this means? Thus at the y − z plane, we have z + uy = α (17) a line with slope −u. Then, to minimize z + uy = α We need to move the line z + uy = α in a parallel to itself as far down as possible, along its negative gradient, while in contact with G. 59 / 85
  • 128. What does this means? Thus at the y − z plane, we have z + uy = α (17) a line with slope −u. Then, to minimize z + uy = α We need to move the line z + uy = α in a parallel to itself as far down as possible, along its negative gradient, while in contact with G. 59 / 85
  • 129. In other words Move the line parallel to itself until it supports G A B X G Slope: Slope: Note The Set G lies above the line and touches it. 60 / 85
  • 130. Thus Thus Then, the problem is to find the slope of the supporting hyperplane for G. Then intersection with the z-axis Gives θ(u) 61 / 85
  • 131. Thus Thus Then, the problem is to find the slope of the supporting hyperplane for G. Then intersection with the z-axis Gives θ(u) 61 / 85
  • 132. Again We can see the θ A B X G Slope: Slope: 62 / 85
  • 133. Thus The dual problem is equivalent Finding the slope of the supporting hyperplane such that its intercept on the z-axis is maximal 63 / 85
  • 134. Or Such an hyperplane has slope −u and support G at (y, z) A B X G Slope: Slope: Remark: The optimal solution is u and the optimal dual objective is z. 64 / 85
  • 135. Or Such an hyperplane has slope −u and support G at (y, z) A B X G Slope: Slope: Remark: The optimal solution is u and the optimal dual objective is z. 64 / 85
  • 136. For more on this Please!!! Look at this book From “Nonlinear Programming: Theory and Algorithms” by Mokhtar S. Bazaraa, and C. M. Shetty. Wiley, New York, (2006) At Page 260. 65 / 85
  • 137. Example (Lagrange Dual) Primal min x2 1 + x2 2 s.t. −x1 − x2 + 4 ≤ 0 x1, x2 ≥ 0 Lagrange Dual Θ(u) = inf {x2 1 + x2 2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0} 66 / 85
  • 138. Example (Lagrange Dual) Primal min x2 1 + x2 2 s.t. −x1 − x2 + 4 ≤ 0 x1, x2 ≥ 0 Lagrange Dual Θ(u) = inf {x2 1 + x2 2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0} 66 / 85
  • 139. Solution Derive with respect to x1 and x2 We have two case to take in account: u ≥ 0 and u < 0 The first case is clear What about when u < 0 We have that θ (u) = −1 2u2 + 4u if u ≥ 0 4u if u < 0 (18) 67 / 85
  • 140. Solution Derive with respect to x1 and x2 We have two case to take in account: u ≥ 0 and u < 0 The first case is clear What about when u < 0 We have that θ (u) = −1 2u2 + 4u if u ≥ 0 4u if u < 0 (18) 67 / 85
  • 141. Solution Derive with respect to x1 and x2 We have two case to take in account: u ≥ 0 and u < 0 The first case is clear What about when u < 0 We have that θ (u) = −1 2u2 + 4u if u ≥ 0 4u if u < 0 (18) 67 / 85
  • 142. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 68 / 85
  • 143. Duality Theorem First Property If the Primal has an optimal solution, the dual too. Thus In order to w ∗ and α∗ to be optimal solutions for the primal and dual problem respectively, It is necessary and sufficient that w∗: It is feasible for the primal problem and Φ(w∗) = J(w∗, w0∗, α∗) = min w J(w, w0∗, α∗) 69 / 85
  • 144. Duality Theorem First Property If the Primal has an optimal solution, the dual too. Thus In order to w ∗ and α∗ to be optimal solutions for the primal and dual problem respectively, It is necessary and sufficient that w∗: It is feasible for the primal problem and Φ(w∗) = J(w∗, w0∗, α∗) = min w J(w, w0∗, α∗) 69 / 85
  • 145. Reformulate our Equations We have then J(w, w0, α) = 1 2 wT w − N i=1 αidiwT xi − ω0 N i=1 αidi + N i=1 αi Now for our 2nd optimality condition J(w, w0, α) = 1 2 wT w − N i=1 αidiwT xi + N i=1 αi 70 / 85
  • 146. Reformulate our Equations We have then J(w, w0, α) = 1 2 wT w − N i=1 αidiwT xi − ω0 N i=1 αidi + N i=1 αi Now for our 2nd optimality condition J(w, w0, α) = 1 2 wT w − N i=1 αidiwT xi + N i=1 αi 70 / 85
  • 147. We have finally for the 1st Optimality Condition: First wT w = N i=1 αidiwT xi = N i=1 N j=1 αiαjdidjxT j xi Second, setting J(w, ω0, α) = Q (α) Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi 71 / 85
  • 148. We have finally for the 1st Optimality Condition: First wT w = N i=1 αidiwT xi = N i=1 N j=1 αiαjdidjxT j xi Second, setting J(w, ω0, α) = Q (α) Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi 71 / 85
  • 149. From here, we have the problem This is the problem that we really solve Given the training sample {(xi, di)}N i=1, find the Lagrange multipliers {αi}N i=1 that maximize the objective function Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi subject to the constraints N i=1 αidi = 0 (19) αi ≥ 0 for i = 1, · · · , N (20) Note In the Primal, we were trying to minimize the cost function, for this it is necessary to maximize α. That is the reason why we are maximizing Q(α). 72 / 85
  • 150. From here, we have the problem This is the problem that we really solve Given the training sample {(xi, di)}N i=1, find the Lagrange multipliers {αi}N i=1 that maximize the objective function Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi subject to the constraints N i=1 αidi = 0 (19) αi ≥ 0 for i = 1, · · · , N (20) Note In the Primal, we were trying to minimize the cost function, for this it is necessary to maximize α. That is the reason why we are maximizing Q(α). 72 / 85
  • 151. Solving for α We can compute w∗ once we get the optimal α∗ i by using (Eq. 15) w∗ = N i=1 α∗ i dixi In addition, we can compute the optimal bias w∗ 0 using the optimal weight, w∗ For this, we use the positive margin equation: g(xi) = wT x(s) + w0 = 1 corresponding to a positive support vector. Then w0 = 1 − (w∗ )T x(s) for d(s) = 1 (21) 73 / 85
  • 152. Solving for α We can compute w∗ once we get the optimal α∗ i by using (Eq. 15) w∗ = N i=1 α∗ i dixi In addition, we can compute the optimal bias w∗ 0 using the optimal weight, w∗ For this, we use the positive margin equation: g(xi) = wT x(s) + w0 = 1 corresponding to a positive support vector. Then w0 = 1 − (w∗ )T x(s) for d(s) = 1 (21) 73 / 85
  • 153. Solving for α We can compute w∗ once we get the optimal α∗ i by using (Eq. 15) w∗ = N i=1 α∗ i dixi In addition, we can compute the optimal bias w∗ 0 using the optimal weight, w∗ For this, we use the positive margin equation: g(xi) = wT x(s) + w0 = 1 corresponding to a positive support vector. Then w0 = 1 − (w∗ )T x(s) for d(s) = 1 (21) 73 / 85
  • 154. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 74 / 85
  • 155. What do we need? Until now, we have only a maximal margin algorithm All this work fine when the classes are separable Problem, What when they are not separable? What we can do? 75 / 85
  • 156. What do we need? Until now, we have only a maximal margin algorithm All this work fine when the classes are separable Problem, What when they are not separable? What we can do? 75 / 85
  • 157. What do we need? Until now, we have only a maximal margin algorithm All this work fine when the classes are separable Problem, What when they are not separable? What we can do? 75 / 85
  • 158. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 76 / 85
  • 159. Map to a higher Dimensional Space Assume that exist a mapping x ∈ Rl → y ∈ Rk Then, it is possible to define the following mapping 77 / 85
  • 160. Map to a higher Dimensional Space Assume that exist a mapping x ∈ Rl → y ∈ Rk Then, it is possible to define the following mapping 77 / 85
  • 161. Define a map to a higher Dimension Nonlinear transformations Given a series of nonlinear transformations {φi(x)}m i=1 from input space to the feature space. We can define the decision surface as m i=1 wiφi(x) + w0 = 0 . 78 / 85
  • 162. Define a map to a higher Dimension Nonlinear transformations Given a series of nonlinear transformations {φi(x)}m i=1 from input space to the feature space. We can define the decision surface as m i=1 wiφi(x) + w0 = 0 . 78 / 85
  • 163. This allows us to define The following vector φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T . That represents the mapping. From this mapping we can define the following kernel function K : X × X → R K (xi, xj) = φ (xi)T φ (xj) 79 / 85
  • 164. This allows us to define The following vector φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T . That represents the mapping. From this mapping we can define the following kernel function K : X × X → R K (xi, xj) = φ (xi)T φ (xj) 79 / 85
  • 165. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 80 / 85
  • 166. Example Assume x ∈ R → y =    x2 1√ 2x1x2 x2 2    We can show that yT i yj = xT i xj 2 81 / 85
  • 167. Example Assume x ∈ R → y =    x2 1√ 2x1x2 x2 2    We can show that yT i yj = xT i xj 2 81 / 85
  • 168. Example of Kernels Polynomials k(x, z) = (xT z + 1)q q > 0 Radial Basis Functions k(x, z) = exp − ||x − z||2 σ2 Hyperbolic Tangents k(x, z) = tanh(βxT z + γ) 82 / 85
  • 169. Example of Kernels Polynomials k(x, z) = (xT z + 1)q q > 0 Radial Basis Functions k(x, z) = exp − ||x − z||2 σ2 Hyperbolic Tangents k(x, z) = tanh(βxT z + γ) 82 / 85
  • 170. Example of Kernels Polynomials k(x, z) = (xT z + 1)q q > 0 Radial Basis Functions k(x, z) = exp − ||x − z||2 σ2 Hyperbolic Tangents k(x, z) = tanh(βxT z + γ) 82 / 85
  • 171. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 83 / 85
  • 172. Now, How to select a Kernel? We have a problem Selecting a specific kernel and parameters is usually done in a try-and-see manner. Thus In general, the Radial Basis Functions kernel is a reasonable first choice. Then if this fails, we can try the other possible kernels. 84 / 85
  • 173. Now, How to select a Kernel? We have a problem Selecting a specific kernel and parameters is usually done in a try-and-see manner. Thus In general, the Radial Basis Functions kernel is a reasonable first choice. Then if this fails, we can try the other possible kernels. 84 / 85
  • 174. Now, How to select a Kernel? We have a problem Selecting a specific kernel and parameters is usually done in a try-and-see manner. Thus In general, the Radial Basis Functions kernel is a reasonable first choice. Then if this fails, we can try the other possible kernels. 84 / 85
  • 175. Thus, we have something like this Step 1 Normalize the data. Step 2 Use cross-validation to adjust the parameters of the selected kernel. Step 3 Train against the entire dataset. 85 / 85
  • 176. Thus, we have something like this Step 1 Normalize the data. Step 2 Use cross-validation to adjust the parameters of the selected kernel. Step 3 Train against the entire dataset. 85 / 85
  • 177. Thus, we have something like this Step 1 Normalize the data. Step 2 Use cross-validation to adjust the parameters of the selected kernel. Step 3 Train against the entire dataset. 85 / 85