These are my slides about SVM's. They are based in the Chapter on SVM's at Simon Haykin's book in Neural Networks 2nd edition. They have been augmented using the idea that Lagrange was a physicist allowing him to create the first intuition about dual problems.
1. Machine Learning for Data Mining
Introduction to Support Vector Machines
Andres Mendez-Vazquez
June 21, 2015
1 / 85
2. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
2 / 85
3. History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
4. History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
5. History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
6. History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
7. History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
8. History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
9. History
Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963
At the Institute of Control Sciences, Moscow
On the paper “Estimation of dependencies based on empirical data”
Corinna Cortes and Vladimir Vapnik in 1995
They Invented their Current Incarnation - Soft Margins
At the AT&T Labs
BTW Corinna Cortes
Danish computer scientist who is known for her contributions to the
field of machine learning.
She is currently the Head of Google Research, New York.
Cortes is a recipient of the Paris Kanellakis Theory and Practice
Award (ACM) for her work on theoretical foundations of support
vector machines.
3 / 85
10. In addition
Alexey Yakovlevich Chervonenkis
He was a Soviet and Russian mathematician, and, with Vladimir Vapnik,
was one of the main developers of the Vapnik–Chervonenkis theory, also
known as the "fundamental theory of learning" an important part of
computational learning theory.
He died in September 22nd, 2014
At Losiny Ostrov National Park on 22 September 2014.
4 / 85
11. In addition
Alexey Yakovlevich Chervonenkis
He was a Soviet and Russian mathematician, and, with Vladimir Vapnik,
was one of the main developers of the Vapnik–Chervonenkis theory, also
known as the "fundamental theory of learning" an important part of
computational learning theory.
He died in September 22nd, 2014
At Losiny Ostrov National Park on 22 September 2014.
4 / 85
12. Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
13. Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
14. Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
15. Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
16. Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
17. Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
18. Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
19. Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
20. Applications
Partial List
1 Predictive Control
Control of chaotic systems.
2 Inverse Geosounding Problem
It is used to understand the internal structure of our planet.
3 Environmental Sciences
Spatio-temporal environmental data analysis and modeling.
4 Protein Fold and Remote Homology Detection
In the recognition if two different species contain similar genes.
5 Facial expression classification
6 Texture Classification
7 E-Learning
8 Handwritten Recognition
9 AND counting....
5 / 85
21. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
6 / 85
22. Separable Classes
Given
xi, i = 1, · · · , N
A set of samples belonging to two classes ω1, ω2.
Objective
We want to obtain decision functions
g(x) = wt
x + w0
7 / 85
23. Separable Classes
Given
xi, i = 1, · · · , N
A set of samples belonging to two classes ω1, ω2.
Objective
We want to obtain decision functions
g(x) = wt
x + w0
7 / 85
24. Such that we can do the following
A linear separation function g (x) = wt
x + w0
8 / 85
25. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
9 / 85
26. In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
27. In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
28. In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
29. In other words ...
We have the following samples
For x1, · · · , xm ∈ C1
For x1, · · · , xn ∈ C2
We want the following decision surfaces
wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1
wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2
10 / 85
30. What do we want?
Our goal is to search for a direction w that gives the maximum possible margin
direction 2
MARGINSdirection 1
11 / 85
32. A Little of Geometry
Thus
r
d
A
B
C
Then
d =
|w0|
w2
1 + w2
2
, r =
|g (x)|
w2
1 + w2
2
(1)
13 / 85
33. A Little of Geometry
Thus
r
d
A
B
C
Then
d =
|w0|
w2
1 + w2
2
, r =
|g (x)|
w2
1 + w2
2
(1)
13 / 85
34. First d = |w0|√
w2
1 +w2
2
We can use the following rule in a triangle with a 90o
angle
Area =
1
2
Cd (2)
In addition, the area can be calculated also as
Area =
1
2
AB (3)
Thus
d =
AB
C
Remark: Can you get the rest of values?
14 / 85
35. First d = |w0|√
w2
1 +w2
2
We can use the following rule in a triangle with a 90o
angle
Area =
1
2
Cd (2)
In addition, the area can be calculated also as
Area =
1
2
AB (3)
Thus
d =
AB
C
Remark: Can you get the rest of values?
14 / 85
36. First d = |w0|√
w2
1 +w2
2
We can use the following rule in a triangle with a 90o
angle
Area =
1
2
Cd (2)
In addition, the area can be calculated also as
Area =
1
2
AB (3)
Thus
d =
AB
C
Remark: Can you get the rest of values?
14 / 85
37. What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
38. What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
39. What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
40. What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
41. What about r = |g(x)|√
w2
1 +w2
2
?
First, remember
g (xp) = 0 and x = xp + r
w
w
(4)
Thus, we have
g(x) =wT
xp + r
w
w
+ w0
=wT
xp + w0 + r
wT w
wT
=g (xp) + r w
Then
r = g(x)
||w||
15 / 85
42. This has the following interpretation
The distance from the projection
0
16 / 85
43. Now
We know that the straight line that we are looking for looks like
wT
x + w0 = 0 (5)
What about something like this
wT
x + w0 = δ (6)
Clearly
This will be above or below the initial line wT x + w0 = 0.
17 / 85
44. Now
We know that the straight line that we are looking for looks like
wT
x + w0 = 0 (5)
What about something like this
wT
x + w0 = δ (6)
Clearly
This will be above or below the initial line wT x + w0 = 0.
17 / 85
45. Now
We know that the straight line that we are looking for looks like
wT
x + w0 = 0 (5)
What about something like this
wT
x + w0 = δ (6)
Clearly
This will be above or below the initial line wT x + w0 = 0.
17 / 85
46. Come back to the hyperplanes
We have then for each border support line an specific bias!!!
Support Vectors
18 / 85
47. Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
48. Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
49. Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
50. Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
51. Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
52. Then, normalize by δ
The new margin functions
w T x + w10 = 1
w T x + w01 = −1
where w = w
δ
, w10 =
w0
δ ,and w01 =
w0
δ
Now, we come back to the middle separator hyperplane, but with the
normalized term
wT xi + w0 ≥ w T x + w10 for di = +1
wT xi + w0 ≤ w T x + w01 for di = −1
Where w0 is the bias of that central hyperplane!! And the w is the
normalized direction of w
19 / 85
53. Come back to the hyperplanes
The meaning of what I am saying!!!
20 / 85
54. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
21 / 85
55. A little about Support Vectors
They are the vectors
xi such that wT xi + w0 = 1 or wT xi + w0 = −1
Properties
The vectors nearest to the decision surface and the most difficult to
classify.
Because of that, we have the name “Support Vector Machines”.
22 / 85
56. A little about Support Vectors
They are the vectors
xi such that wT xi + w0 = 1 or wT xi + w0 = −1
Properties
The vectors nearest to the decision surface and the most difficult to
classify.
Because of that, we have the name “Support Vector Machines”.
22 / 85
57. A little about Support Vectors
They are the vectors
xi such that wT xi + w0 = 1 or wT xi + w0 = −1
Properties
The vectors nearest to the decision surface and the most difficult to
classify.
Because of that, we have the name “Support Vector Machines”.
22 / 85
58. Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wT
xi + w0 = −(+)1 for di = −(+)1 (7)
Implies
The distance to the support vectors is:
r =
g(xi)
||w||
=
1
||w|| if di = +1
− 1
||w|| if di = −1
23 / 85
59. Now, we can resume the decision rule for the hyperplane
For the support vectors
g (xi) = wT
xi + w0 = −(+)1 for di = −(+)1 (7)
Implies
The distance to the support vectors is:
r =
g(xi)
||w||
=
1
||w|| if di = +1
− 1
||w|| if di = −1
23 / 85
60. Therefore ...
We want the optimum value of the margin of separation as
ρ =
1
||w||
+
1
||w||
=
2
||w||
(8)
And the support vectors define the value of ρ
24 / 85
61. Therefore ...
We want the optimum value of the margin of separation as
ρ =
1
||w||
+
1
||w||
=
2
||w||
(8)
And the support vectors define the value of ρ
Support Vectors
24 / 85
62. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
25 / 85
63. Quadratic Optimization
Then, we have the samples with labels
T = {(xi, di)}N
i=1
Then we can put the decision rule as
di(wT xi + w0) ≥ 1 i = 1, · · · , N
26 / 85
64. Quadratic Optimization
Then, we have the samples with labels
T = {(xi, di)}N
i=1
Then we can put the decision rule as
di(wT xi + w0) ≥ 1 i = 1, · · · , N
26 / 85
65. Then, we have the optimization problem
The optimization problem
minwΦ(w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
Observations
The cost functions Φ (w) is convex.
The constrains are linear with respect to w.
27 / 85
66. Then, we have the optimization problem
The optimization problem
minwΦ(w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
Observations
The cost functions Φ (w) is convex.
The constrains are linear with respect to w.
27 / 85
67. Then, we have the optimization problem
The optimization problem
minwΦ(w) = 1
2wT w
s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N
Observations
The cost functions Φ (w) is convex.
The constrains are linear with respect to w.
27 / 85
68. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
28 / 85
69. Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem to an equivalent
unconstrained problem with the help of certain unspecified parameters
known as Lagrange multipliers.
29 / 85
70. Lagrange Multipliers
The method of Lagrange multipliers
Gives a set of necessary conditions to identify optimal points of equality
constrained optimization problems.
This is done by converting a constrained problem to an equivalent
unconstrained problem with the help of certain unspecified parameters
known as Lagrange multipliers.
29 / 85
71. Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
30 / 85
72. Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
30 / 85
73. Lagrange Multipliers
The classical problem formulation
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
It can be converted into
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9)
where
L(x, λ) is the Lagrangian function.
λ is an unspecified positive or negative constant called the Lagrange
Multiplier.
30 / 85
74. Finding an Optimum using Lagrange Multipliers
New problem
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
We want a λ = λ∗
optimal
If the minimum of L (x1, x2, ..., xn, λ∗
) occurs at
(x1, x2, ..., xn)T
= (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)
T∗
satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)
T∗
minimizes:
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
Trick
It is to find appropriate value for Lagrangian multiplier λ.
31 / 85
75. Finding an Optimum using Lagrange Multipliers
New problem
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
We want a λ = λ∗
optimal
If the minimum of L (x1, x2, ..., xn, λ∗
) occurs at
(x1, x2, ..., xn)T
= (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)
T∗
satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)
T∗
minimizes:
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
Trick
It is to find appropriate value for Lagrangian multiplier λ.
31 / 85
76. Finding an Optimum using Lagrange Multipliers
New problem
min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)}
We want a λ = λ∗
optimal
If the minimum of L (x1, x2, ..., xn, λ∗
) occurs at
(x1, x2, ..., xn)T
= (x1, x2, ..., xn)T∗
and (x1, x2, ..., xn)
T∗
satisfies h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn)
T∗
minimizes:
min f (x1, x2, ..., xn)
s.t h1 (x1, x2, ..., xn) = 0
Trick
It is to find appropriate value for Lagrangian multiplier λ.
31 / 85
80. Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (10)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
34 / 85
81. Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (10)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
34 / 85
82. Lagrange was a Physicists
He was thinking in the following formula
A system in equilibrium has the following equation:
F1 + F2 + ... + FK = 0 (10)
But functions do not have forces?
Are you sure?
Think about the following
The Gradient of a surface.
34 / 85
83. Gradient to a Surface
After all a gradient is a measure of the maximal change
For example the gradient of a function of three variables:
f (x) = i
∂f (x)
∂x
+ j
∂f (x)
∂y
+ k
∂f (x)
∂z
(11)
where i, j and k are unitary vectors in the directions x, y and z.
35 / 85
86. Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λK FK = 0 (12)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
38 / 85
87. Now, Think about this
Yes, we can use the gradient
However, we need to do some scaling of the forces by using parameters λ
Thus, we have
F0 + λ1F1 + ... + λK FK = 0 (12)
where F0 is the gradient of the principal cost function and Fi for
i = 1, 2, .., K.
38 / 85
88. Thus
If we have the following optimization:
min f (x)
s.tg1 (x) = 0
g2 (x) = 0
39 / 85
89. Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f (−→x ) − λ1 g1 (−→x ) − λ2 g2 (−→x ) = 0
39 / 85
90. Geometric interpretation in the case of minimization
What is wrong? Gradients are going in the other direction, we can fix
by simple multiplying by -1
Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize
f (−→x )
g1 (−→x )
g2 (−→x )
−∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0
Nevertheless: it is equivalent to f (−→x ) − λ1 g1 (−→x ) − λ2 g2 (−→x ) = 0
39 / 85
91. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
40 / 85
92. Method
Steps
1 Original problem is rewritten as:
1 minimize L(x, λ) = f (x) − λh1(x)
2 Take derivatives of L(x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).
41 / 85
93. Method
Steps
1 Original problem is rewritten as:
1 minimize L(x, λ) = f (x) − λh1(x)
2 Take derivatives of L(x, λ) with respect to xi and set them equal to
zero.
3 Express all xi in terms of Lagrangian multiplier λ.
4 Plug x in terms of α in constraint h1(x) = 0 and solve λ.
5 Calculate x by using the just found value for λ.
From the step 2
If there are n variables (i.e., x1, · · · , xn) then you will get n equations with
n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α).
41 / 85
94. Example
We can apply that to the following problem
min f (x, y) = x2
− 8x + y2
− 12y + 48
s.t x + y = 8
42 / 85
95. Then, Rewriting The Optimization Problem
The optimization with equality constraints
minwΦ(w) = 1
2wT w
s.t
di(wT xi + w0) − 1 = 0 i = 1, · · · , N
43 / 85
96. Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
44 / 85
97. Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
44 / 85
98. Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
44 / 85
99. Then, for our problem
Using the Lagrange Multipliers (We will call them αi)
We obtain the following cost function
J(w, w0, α) =
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1]
Observation
Minimize with respect to w and w0.
Maximize with respect to α because it dominates
−
N
i=1
αi[di(wT
xi + w0) − 1]. (13)
44 / 85
100. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
45 / 85
101. Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)
s.t g1 (x) = 0
...
gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!
A point x is a local minimum of an equality constrained problem P only if
a set of non-negative αj’s may be found such that:
L(x, α) = f (x) −
N
i=1
αi gi (x) = 0
46 / 85
102. Karush-Kuhn-Tucker Conditions
First An Inequality Constrained Problem P
min f (x)
s.t g1 (x) = 0
...
gN (x) = 0
A really minimal version!!! Hey, it is a patch work!!!
A point x is a local minimum of an equality constrained problem P only if
a set of non-negative αj’s may be found such that:
L(x, α) = f (x) −
N
i=1
αi gi (x) = 0
46 / 85
103. Karush-Kuhn-Tucker Conditions
Important
Think about this each constraint correspond to a sample in both classes,
thus
The corresponding αi’s are going to be zero after optimization, if a
constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember
Maximization).
Again the Support Vectors
This actually defines the idea of support vectors!!!
Thus
Only the αi’s with active constraints (Support Vectors) will be different
from zero when di wT xi + w0 − 1 = 0.
47 / 85
104. Karush-Kuhn-Tucker Conditions
Important
Think about this each constraint correspond to a sample in both classes,
thus
The corresponding αi’s are going to be zero after optimization, if a
constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember
Maximization).
Again the Support Vectors
This actually defines the idea of support vectors!!!
Thus
Only the αi’s with active constraints (Support Vectors) will be different
from zero when di wT xi + w0 − 1 = 0.
47 / 85
105. Karush-Kuhn-Tucker Conditions
Important
Think about this each constraint correspond to a sample in both classes,
thus
The corresponding αi’s are going to be zero after optimization, if a
constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember
Maximization).
Again the Support Vectors
This actually defines the idea of support vectors!!!
Thus
Only the αi’s with active constraints (Support Vectors) will be different
from zero when di wT xi + w0 − 1 = 0.
47 / 85
106. A small deviation from the SVM’s for the sake of Vox
Populi
Theorem (Karush-Kuhn-Tucker Necessary Conditions)
Let X be a non-empty open set Rn, and let f : Rn → R and gi : Rn → R
for i = 1, ..., m. Consider the problem P to minimize f (x) subject to
x ∈ X and gi (x) ≤ 0 i = 1, ..., m. Let x be a feasible solution, and
denote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I are
differentiable at x and that gi i /∈ I are continuous at x. Furthermore,
suppose that gi (x) for i ∈ I are linearly independent. If x solves
problem P locally, there exist scalars ui for i ∈ I such that
f (x) +
i∈I
ui gi (x) = 0
ui ≥ 0 for i ∈ I
48 / 85
107. It is more...
In addition to the above assumptions
If gi for each i /∈ I is also differentiable at x, the previous conditions can
be written in the following equivalent form:
f (x) +
m
i=1
ui gi (x) = 0
ugi (x) = 0 for i = 1, ..., m
ui ≥ 0 for i = 1, ..., m
49 / 85
108. The necessary conditions for optimality
We use the previous theorem
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1] (14)
Condition 1
∂J(w, w0, α)
∂w
= 0
Condition 2
∂J(w, w0, α)
∂w0
= 0
50 / 85
109. The necessary conditions for optimality
We use the previous theorem
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1] (14)
Condition 1
∂J(w, w0, α)
∂w
= 0
Condition 2
∂J(w, w0, α)
∂w0
= 0
50 / 85
110. The necessary conditions for optimality
We use the previous theorem
1
2
wT
w −
N
i=1
αi[di(wT
xi + w0) − 1] (14)
Condition 1
∂J(w, w0, α)
∂w
= 0
Condition 2
∂J(w, w0, α)
∂w0
= 0
50 / 85
111. Using the conditions
We have the first condition
∂J(w, w0, α)
∂w
=
∂ 1
2wT w
∂w
−
∂
N
i=1
αi[di(wT
xi + w0) − 1]
∂w
= 0
∂J(w, w0, α)
∂w
=
1
2
(w + w) −
N
i=1
αidixi
Thus
w =
N
i=1
αidixi (15)
51 / 85
112. Using the conditions
We have the first condition
∂J(w, w0, α)
∂w
=
∂ 1
2wT w
∂w
−
∂
N
i=1
αi[di(wT
xi + w0) − 1]
∂w
= 0
∂J(w, w0, α)
∂w
=
1
2
(w + w) −
N
i=1
αidixi
Thus
w =
N
i=1
αidixi (15)
51 / 85
113. Using the conditions
We have the first condition
∂J(w, w0, α)
∂w
=
∂ 1
2wT w
∂w
−
∂
N
i=1
αi[di(wT
xi + w0) − 1]
∂w
= 0
∂J(w, w0, α)
∂w
=
1
2
(w + w) −
N
i=1
αidixi
Thus
w =
N
i=1
αidixi (15)
51 / 85
114. In a similar way ...
We have by the second optimality condition
N
i=1
αidi = 0
Note
αi[di(wT
xi + w0) − 1] = 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 or
di(wT xi + w0) − 1 = 0.
52 / 85
115. In a similar way ...
We have by the second optimality condition
N
i=1
αidi = 0
Note
αi[di(wT
xi + w0) − 1] = 0
Because the constraint vanishes in the optimal solution i.e. αi = 0 or
di(wT xi + w0) − 1 = 0.
52 / 85
116. Thus
We need something extra
Our classic trick of transforming a problem into another problem
In this case
We use the Primal-Dual Problem for Lagrangian
Where
We move from a minimization to a maximization!!!
53 / 85
117. Thus
We need something extra
Our classic trick of transforming a problem into another problem
In this case
We use the Primal-Dual Problem for Lagrangian
Where
We move from a minimization to a maximization!!!
53 / 85
118. Thus
We need something extra
Our classic trick of transforming a problem into another problem
In this case
We use the Primal-Dual Problem for Lagrangian
Where
We move from a minimization to a maximization!!!
53 / 85
119. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
54 / 85
120. Lagrangian Dual Problem
Consider the following nonlinear programming problem
Primal Problem P
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
hi (x) = 0 for i = 1, ..., l
x ∈ X
Lagrange Dual Problem D
max Θ (u, v)
s.t. u > 0
where Θ (u, v) = infx f (x) + m
i=1 uigi (x) + l
i=1 vihi (x) |x ∈ X
55 / 85
121. Lagrangian Dual Problem
Consider the following nonlinear programming problem
Primal Problem P
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
hi (x) = 0 for i = 1, ..., l
x ∈ X
Lagrange Dual Problem D
max Θ (u, v)
s.t. u > 0
where Θ (u, v) = infx f (x) + m
i=1 uigi (x) + l
i=1 vihi (x) |x ∈ X
55 / 85
122. What does this mean?
Assume that the equality constraint does not exist
We have then
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
x ∈ X
Now assume that we finish with only one constraint
We have then
min f (x)
s.t g (x) ≤ 0
x ∈ X
56 / 85
123. What does this mean?
Assume that the equality constraint does not exist
We have then
min f (x)
s.t gi (x) ≤ 0 for i = 1, ..., m
x ∈ X
Now assume that we finish with only one constraint
We have then
min f (x)
s.t g (x) ≤ 0
x ∈ X
56 / 85
124. What does this mean?
First, we have the following figure
A
B
X
G
Slope:
Slope:
57 / 85
125. What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
Thus
Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -
Equivalent to f (x) + u g(x) = 0
58 / 85
126. What does this means?
Thus at the y − z plane you have
G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16)
Thus
Given u ≥ 0, we need to minimize f (x) + ug(x) to find θ (u) -
Equivalent to f (x) + u g(x) = 0
58 / 85
127. What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down as
possible, along its negative gradient, while in contact with G.
59 / 85
128. What does this means?
Thus at the y − z plane, we have
z + uy = α (17)
a line with slope −u.
Then, to minimize z + uy = α
We need to move the line z + uy = α in a parallel to itself as far down as
possible, along its negative gradient, while in contact with G.
59 / 85
129. In other words
Move the line parallel to itself until it supports G
A
B
X
G
Slope:
Slope:
Note The Set G lies above the line and touches it.
60 / 85
130. Thus
Thus
Then, the problem is to find the slope of the supporting hyperplane for
G.
Then intersection with the z-axis
Gives θ(u)
61 / 85
131. Thus
Thus
Then, the problem is to find the slope of the supporting hyperplane for
G.
Then intersection with the z-axis
Gives θ(u)
61 / 85
133. Thus
The dual problem is equivalent
Finding the slope of the supporting hyperplane such that its intercept on
the z-axis is maximal
63 / 85
134. Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X
G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
64 / 85
135. Or
Such an hyperplane has slope −u and support G at (y, z)
A
B
X
G
Slope:
Slope:
Remark: The optimal solution is u and the optimal dual objective is z.
64 / 85
136. For more on this Please!!!
Look at this book
From “Nonlinear Programming: Theory and Algorithms” by Mokhtar
S. Bazaraa, and C. M. Shetty. Wiley, New York, (2006)
At Page 260.
65 / 85
139. Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clear
What about when u < 0
We have that
θ (u) =
−1
2u2 + 4u if u ≥ 0
4u if u < 0
(18)
67 / 85
140. Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clear
What about when u < 0
We have that
θ (u) =
−1
2u2 + 4u if u ≥ 0
4u if u < 0
(18)
67 / 85
141. Solution
Derive with respect to x1 and x2
We have two case to take in account: u ≥ 0 and u < 0
The first case is clear
What about when u < 0
We have that
θ (u) =
−1
2u2 + 4u if u ≥ 0
4u if u < 0
(18)
67 / 85
142. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
68 / 85
143. Duality Theorem
First Property
If the Primal has an optimal solution, the dual too.
Thus
In order to w ∗ and α∗ to be optimal solutions for the primal and dual
problem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J(w∗, w0∗, α∗)
= min
w
J(w, w0∗, α∗)
69 / 85
144. Duality Theorem
First Property
If the Primal has an optimal solution, the dual too.
Thus
In order to w ∗ and α∗ to be optimal solutions for the primal and dual
problem respectively, It is necessary and sufficient that w∗:
It is feasible for the primal problem and
Φ(w∗) = J(w∗, w0∗, α∗)
= min
w
J(w, w0∗, α∗)
69 / 85
145. Reformulate our Equations
We have then
J(w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi − ω0
N
i=1
αidi +
N
i=1
αi
Now for our 2nd optimality condition
J(w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi +
N
i=1
αi
70 / 85
146. Reformulate our Equations
We have then
J(w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi − ω0
N
i=1
αidi +
N
i=1
αi
Now for our 2nd optimality condition
J(w, w0, α) =
1
2
wT
w −
N
i=1
αidiwT
xi +
N
i=1
αi
70 / 85
147. We have finally for the 1st Optimality Condition:
First
wT
w =
N
i=1
αidiwT
xi =
N
i=1
N
j=1
αiαjdidjxT
j xi
Second, setting J(w, ω0, α) = Q (α)
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
71 / 85
148. We have finally for the 1st Optimality Condition:
First
wT
w =
N
i=1
αidiwT
xi =
N
i=1
N
j=1
αiαjdidjxT
j xi
Second, setting J(w, ω0, α) = Q (α)
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
71 / 85
149. From here, we have the problem
This is the problem that we really solve
Given the training sample {(xi, di)}N
i=1, find the Lagrange multipliers
{αi}N
i=1 that maximize the objective function
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
subject to the constraints
N
i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · , N (20)
Note
In the Primal, we were trying to minimize the cost function, for this it is
necessary to maximize α. That is the reason why we are maximizing Q(α).
72 / 85
150. From here, we have the problem
This is the problem that we really solve
Given the training sample {(xi, di)}N
i=1, find the Lagrange multipliers
{αi}N
i=1 that maximize the objective function
Q(α) =
N
i=1
αi −
1
2
N
i=1
N
j=1
αiαjdidjxT
j xi
subject to the constraints
N
i=1
αidi = 0 (19)
αi ≥ 0 for i = 1, · · · , N (20)
Note
In the Primal, we were trying to minimize the cost function, for this it is
necessary to maximize α. That is the reason why we are maximizing Q(α).
72 / 85
151. Solving for α
We can compute w∗
once we get the optimal α∗
i by using (Eq. 15)
w∗
=
N
i=1
α∗
i dixi
In addition, we can compute the optimal bias w∗
0 using the optimal
weight, w∗
For this, we use the positive margin equation:
g(xi) = wT
x(s)
+ w0 = 1
corresponding to a positive support vector.
Then
w0 = 1 − (w∗
)T
x(s)
for d(s)
= 1 (21)
73 / 85
152. Solving for α
We can compute w∗
once we get the optimal α∗
i by using (Eq. 15)
w∗
=
N
i=1
α∗
i dixi
In addition, we can compute the optimal bias w∗
0 using the optimal
weight, w∗
For this, we use the positive margin equation:
g(xi) = wT
x(s)
+ w0 = 1
corresponding to a positive support vector.
Then
w0 = 1 − (w∗
)T
x(s)
for d(s)
= 1 (21)
73 / 85
153. Solving for α
We can compute w∗
once we get the optimal α∗
i by using (Eq. 15)
w∗
=
N
i=1
α∗
i dixi
In addition, we can compute the optimal bias w∗
0 using the optimal
weight, w∗
For this, we use the positive margin equation:
g(xi) = wT
x(s)
+ w0 = 1
corresponding to a positive support vector.
Then
w0 = 1 − (w∗
)T
x(s)
for d(s)
= 1 (21)
73 / 85
154. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
74 / 85
155. What do we need?
Until now, we have only a maximal margin algorithm
All this work fine when the classes are separable
Problem, What when they are not separable?
What we can do?
75 / 85
156. What do we need?
Until now, we have only a maximal margin algorithm
All this work fine when the classes are separable
Problem, What when they are not separable?
What we can do?
75 / 85
157. What do we need?
Until now, we have only a maximal margin algorithm
All this work fine when the classes are separable
Problem, What when they are not separable?
What we can do?
75 / 85
158. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
76 / 85
159. Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl
→ y ∈ Rk
Then, it is possible to define the following mapping
77 / 85
160. Map to a higher Dimensional Space
Assume that exist a mapping
x ∈ Rl
→ y ∈ Rk
Then, it is possible to define the following mapping
77 / 85
161. Define a map to a higher Dimension
Nonlinear transformations
Given a series of nonlinear transformations
{φi(x)}m
i=1
from input space to the feature space.
We can define the decision surface as
m
i=1
wiφi(x) + w0 = 0
.
78 / 85
162. Define a map to a higher Dimension
Nonlinear transformations
Given a series of nonlinear transformations
{φi(x)}m
i=1
from input space to the feature space.
We can define the decision surface as
m
i=1
wiφi(x) + w0 = 0
.
78 / 85
163. This allows us to define
The following vector
φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T
.
That represents the mapping.
From this mapping
we can define the following kernel function
K : X × X → R
K (xi, xj) = φ (xi)T
φ (xj)
79 / 85
164. This allows us to define
The following vector
φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T
.
That represents the mapping.
From this mapping
we can define the following kernel function
K : X × X → R
K (xi, xj) = φ (xi)T
φ (xj)
79 / 85
165. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
80 / 85
166. Example
Assume
x ∈ R → y =
x2
1√
2x1x2
x2
2
We can show that
yT
i yj = xT
i xj
2
81 / 85
167. Example
Assume
x ∈ R → y =
x2
1√
2x1x2
x2
2
We can show that
yT
i yj = xT
i xj
2
81 / 85
168. Example of Kernels
Polynomials
k(x, z) = (xT
z + 1)q
q > 0
Radial Basis Functions
k(x, z) = exp −
||x − z||2
σ2
Hyperbolic Tangents
k(x, z) = tanh(βxT
z + γ)
82 / 85
169. Example of Kernels
Polynomials
k(x, z) = (xT
z + 1)q
q > 0
Radial Basis Functions
k(x, z) = exp −
||x − z||2
σ2
Hyperbolic Tangents
k(x, z) = tanh(βxT
z + γ)
82 / 85
170. Example of Kernels
Polynomials
k(x, z) = (xT
z + 1)q
q > 0
Radial Basis Functions
k(x, z) = exp −
||x − z||2
σ2
Hyperbolic Tangents
k(x, z) = tanh(βxT
z + γ)
82 / 85
171. Outline
1 History
2 Separable Classes
Separable Classes
Hyperplanes
3 Support Vectors
Support Vectors
Quadratic Optimization
Lagrange Multipliers
Method
Karush-Kuhn-Tucker Conditions
Primal-Dual Problem for Lagrangian
Properties
4 Kernel
Kernel Idea
Higher Dimensional Space
Examples
Now, How to select a Kernel?
83 / 85
172. Now, How to select a Kernel?
We have a problem
Selecting a specific kernel and parameters is usually done in a try-and-see
manner.
Thus
In general, the Radial Basis Functions kernel is a reasonable first choice.
Then
if this fails, we can try the other possible kernels.
84 / 85
173. Now, How to select a Kernel?
We have a problem
Selecting a specific kernel and parameters is usually done in a try-and-see
manner.
Thus
In general, the Radial Basis Functions kernel is a reasonable first choice.
Then
if this fails, we can try the other possible kernels.
84 / 85
174. Now, How to select a Kernel?
We have a problem
Selecting a specific kernel and parameters is usually done in a try-and-see
manner.
Thus
In general, the Radial Basis Functions kernel is a reasonable first choice.
Then
if this fails, we can try the other possible kernels.
84 / 85
175. Thus, we have something like this
Step 1
Normalize the data.
Step 2
Use cross-validation to adjust the parameters of the selected kernel.
Step 3
Train against the entire dataset.
85 / 85
176. Thus, we have something like this
Step 1
Normalize the data.
Step 2
Use cross-validation to adjust the parameters of the selected kernel.
Step 3
Train against the entire dataset.
85 / 85
177. Thus, we have something like this
Step 1
Normalize the data.
Step 2
Use cross-validation to adjust the parameters of the selected kernel.
Step 3
Train against the entire dataset.
85 / 85