SlideShare a Scribd company logo
1 of 38
Download to read offline
Lecture 2: Linear SVM in the Dual
Stéphane Canu
stephane.canu@litislab.eu
Sao Paulo 2014
March 12, 2014
Road map
1 Linear SVM
Optimization in 10 slides
Equality constraints
Inequality constraints
Dual formulation of the linear SVM
Solving the dual
Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.
Linear SVM: the problem
Linear SVM are the solution of the following problem (called primal)
Let {(xi , yi ); i = 1 : n} be a set of labelled data with
xi ∈ IRd
, yi ∈ {1, −1}.
A support vector machine (SVM) is a linear classifier associated with the
following decision function: D(x) = sign w x + b where w ∈ IRd
and
b ∈ IR a given thought the solution of the following problem:
min
w,b
1
2 w 2 = 1
2w w
with yi (w xi + b) ≥ 1 i = 1, n
This is a quadratic program (QP):
min
z
1
2 z Az − d z
with Bz ≤ e
z = (w, b) , d = (0, . . . , 0) , A =
I 0
0 0
, B = −[diag(y)X, y] et e = −(1, . . . , 1)
Road map
1 Linear SVM
Optimization in 10 slides
Equality constraints
Inequality constraints
Dual formulation of the linear SVM
Solving the dual
A simple example (to begin with)
min
x1,x2
J(x) = (x1 − a)2 + (x2 − b)2
with
x x
xJ(x)
iso cost lines: J(x) = k
A simple example (to begin with)
min
x1,x2
J(x) = (x1 − a)2 + (x2 − b)2
with H(x) = α(x1 − c)2 + β(x2 − d)2 + γx1x2 − 1
Ω = {x|H(x) = 0}
x x
xJ(x)
∆x
xH(x)
tangent hyperplane
iso cost lines: J(x) = k
xH(x) = λ xJ(x)
The only one equality constraint case
min
x
J(x) J(x + εd) ≈ J(x) + ε xJ(x) d
with H(x) = 0 H(x + εd) ≈ H(x) + ε xH(x) d
Loss J : d is a descent direction if it exists ε0 ∈ IR such that
∀ε ∈ IR, 0 < ε ≤ ε0
J(x + εd) < J(x) ⇒ xJ(x) d < 0
constraint H : d is a feasible descent direction if it exists ε0 ∈ IR such
that ∀ε ∈ IR, 0 < ε ≤ ε0
H(x + εd) = 0 ⇒ xH(x) d = 0
If at x , vectors xJ(x ) and xH(x ) are collinear there is no feasible
descent direction d. Therefore, x is a local solution of the problem.
Lagrange multipliers
Assume J and functions Hi are continuously differentials (and independent)
P =



min
x∈IRn
J(x)
avec H1(x) = 0
et H2(x) = 0
. . .
Hp(x) = 0
Lagrange multipliers
Assume J and functions Hi are continuously differentials (and independent)
P =



min
x∈IRn
J(x)
avec H1(x) = 0 λ1
et H2(x) = 0 λ2
. . .
Hp(x) = 0 λp
each constraint is associated with λi : the Lagrange multiplier.
Lagrange multipliers
Assume J and functions Hi are continuously differentials (and independent)
P =



min
x∈IRn
J(x)
avec H1(x) = 0 λ1
et H2(x) = 0 λ2
. . .
Hp(x) = 0 λp
each constraint is associated with λi : the Lagrange multiplier.
Theorem (First order optimality conditions)
for x being a local minima of P, it is necessary that:
x J(x ) +
p
i=1
λi x Hi (x ) = 0 and Hi (x ) = 0, i = 1, p
Plan
1 Linear SVM
Optimization in 10 slides
Equality constraints
Inequality constraints
Dual formulation of the linear SVM
Solving the dual
Stéphane Canu (INSA Rouen - LITIS) March 12, 2014 8 / 32
The only one inequality constraint case
min
x
J(x) J(x + εd) ≈ J(x) + ε xJ(x) d
with G(x) ≤ 0 G(x + εd) ≈ G(x) + ε xG(x) d
cost J : d is a descent direction if it exists ε0 ∈ IR such that
∀ε ∈ IR, 0 < ε ≤ ε0
J(x + εd) < J(x) ⇒ xJ(x) d < 0
constraint G : d is a feasible descent direction if it exists ε0 ∈ IR such that
∀ε ∈ IR, 0 < ε ≤ ε0
G(x + εd) ≤ 0 ⇒
G(x) < 0 : no limit here on d
G(x) = 0 : xG(x) d ≤ 0
Two possibilities
If x lies at the limit of the feasible domain (G(x ) = 0) and if vectors
xJ(x ) and xG(x ) are collinear and in opposite directions, there is no
feasible descent direction d at that point. Therefore, x is a local solution
of the problem... Or if xJ(x ) = 0
Two possibilities for optimality
xJ(x ) = −µ xG(x ) and µ > 0; G(x ) = 0
or
xJ(x ) = 0 and µ = 0; G(x ) < 0
This alternative is summarized in the so called complementarity condition:
µ G(x ) = 0
µ = 0
G(x ) < 0
G(x ) = 0
µ > 0
First order optimality condition (1)
problem P =



min
x∈IRn
J(x)
with hj (x) = 0 j = 1, . . . , p
and gi (x) ≤ 0 i = 1, . . . , q
Definition: Karush, Kuhn and Tucker (KKT) conditions
stationarity J(x ) +
p
j=1
λj hj (x ) +
q
i=1
µi gi (x ) = 0
primal admissibility hj (x ) = 0 j = 1, . . . , p
gi (x ) ≤ 0 i = 1, . . . , q
dual admissibility µi ≥ 0 i = 1, . . . , q
complementarity µi gi (x ) = 0 i = 1, . . . , q
λj and µi are called the Lagrange multipliers of problem P
First order optimality condition (2)
Theorem (12.1 Nocedal & Wright pp 321)
If a vector x is a stationary point of problem P
Then there existsa Lagrange multipliers such that x , {λj }j=1:p, {µi }i=1:q
fulfill KKT conditions
a
under some conditions e.g. linear independence constraint qualification
If the problem is convex, then a stationary point is the solution of the
problem
A quadratic program (QP) is convex when. . .
(QP)
min
z
1
2 z Az − d z
with Bz ≤ e
. . . when matrix A is positive definite
KKT condition - Lagrangian (3)
problem P =



min
x∈IRn
J(x)
with hj (x) = 0 j = 1, . . . , p
and gi (x) ≤ 0 i = 1, . . . , q
Definition: Lagrangian
The lagrangian of problem P is the following function:
L(x, λ, µ) = J(x) +
p
j=1
λj hj (x) +
q
i=1
µi gi (x)
The importance of being a lagrangian
the stationarity condition can be written: L(x , λ, µ) = 0
the lagrangian saddle point max
λ,µ
min
x
L(x, λ, µ)
Primal variables: x and dual variables λ, µ (the Lagrange multipliers)
Duality – definitions (1)
Primal and (Lagrange) dual problems
P =



min
x∈IRn
J(x)
with hj (x) = 0 j = 1, p
and gi (x) ≤ 0 i = 1, q
D =
max
λ∈IRp,µ∈IRq
Q(λ, µ)
with µj ≥ 0 j = 1, q
Dual objective function:
Q(λ, µ) = inf
x
L(x, λ, µ)
= inf
x
J(x) +
p
j=1
λj hj (x) +
q
i=1
µi gi (x)
Wolf dual problem
W =



max
x,λ∈IRp,µ∈IRq
L(x, λ, µ)
with µj ≥ 0 j = 1, q
and J(x ) +
p
j=1
λj hj (x ) +
q
i=1
µi gi (x ) = 0
Duality – theorems (2)
Theorem (12.12, 12.13 and 12.14 Nocedal & Wright pp 346)
If f , g and h are convex and continuously differentiablea, then the solution
of the dual problem is the same as the solution of the primal
a
under some conditions e.g. linear independence constraint qualification
(λ , µ ) = solution of problem D
x = arg min
x
L(x, λ , µ )
Q(λ , µ ) = arg min
x
L(x, λ , µ ) = L(x , λ , µ )
= J(x ) + λ H(x ) + µ G(x ) = J(x )
and for any feasible point x
Q(λ, µ) ≤ J(x) → 0 ≤ J(x) − Q(λ, µ)
The duality gap is the difference between the primal and dual cost functions
Road map
1 Linear SVM
Optimization in 10 slides
Equality constraints
Inequality constraints
Dual formulation of the linear SVM
Solving the dual
Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.
Linear SVM dual formulation - The lagrangian
min
w,b
1
2 w 2
with yi (w xi + b) ≥ 1 i = 1, n
Looking for the lagrangian saddle point max
α
min
w,b
L(w, b, α) with so called
lagrange multipliers αi ≥ 0
L(w, b, α) =
1
2
w 2
−
n
i=1
αi yi (w xi + b) − 1
αi represents the influence of constraint thus the influence of the training
example (xi , yi )
Stationarity conditions
L(w, b, α) =
1
2
w 2
−
n
i=1
αi yi (w xi + b) − 1
Computing the gradients:



wL(w, b, α) = w −
n
i=1
αi yi xi
∂L(w, b, α)
∂b
=
n
i=1 αi yi
we have the following optimality conditions



wL(w, b, α) = 0 ⇒ w =
n
i=1
αi yi xi
∂L(w, b, α)
∂b
= 0 ⇒
n
i=1
αi yi = 0
KKT conditions for SVM
stationarity w −
n
i=1
αi yi xi = 0 and
n
i=1
αi yi = 0
primal admissibility yi (w xi + b) ≥ 1 i = 1, . . . , n
dual admissibility αi ≥ 0 i = 1, . . . , n
complementarity αi yi (w xi + b) − 1 = 0 i = 1, . . . , n
The complementary condition split the data into two sets
A be the set of active constraints: usefull points
A = {i ∈ [1, n] yi (w∗
xi + b∗
) = 1}
its complementary ¯A useless points
if i /∈ A, αi = 0
The KKT conditions for SVM
The same KKT but using matrix notations and the active set A
stationarity w − X Dy α = 0
α y = 0
primal admissibility Dy (Xw + b I1) ≥ I1
dual admissibility α ≥ 0
complementarity Dy (XAw + b I1A) = I1A
α ¯A = 0
Knowing A, the solution verifies the following linear system:



w −XA Dy αA = 0
−Dy XAw −byA = −eA
−yAαA = 0
with Dy = diag(yA), αA = α(A) , yA = y(A) et XA = X(XA; :).
The KKT conditions as a linear system



w −XA Dy αA = 0
−Dy XAw −byA = −eA
−yAαA = 0
with Dy = diag(yA), αA = α(A) , yA = y(A) et XA = X(XA; :).
=
I −XA Dy 0
−Dy XA 0 −yA
0 −yA 0
w
αA
b
0
−eA
0
we can work on it to separate w from (αA, b)
The SVM dual formulation
The SVM Wolfe dual



max
w,b,α
1
2 w 2
−
n
i=1
αi yi (w xi + b) − 1
with αi ≥ 0 i = 1, . . . , n
and w −
n
i=1
αi yi xi = 0 and
n
i=1
αi yi = 0
using the fact: w =
n
i=1
αi yi xi
The SVM Wolfe dual without w and b



max
α
−1
2
n
i=1
n
j=1
αj αi yi yj xj xi +
n
i=1
αi
with αi ≥ 0 i = 1, . . . , n
and
n
i=1
αi yi = 0
Linear SVM dual formulation
L(w, b, α) =
1
2
w 2
−
n
i=1
αi yi (w xi + b) − 1
Optimality: w =
n
i=1
αi yi xi
n
i=1
αi yi = 0
L(α) = 1
2
n
i=1
n
j=1
αj αi yi yj xj xi
w w
−
n
i=1 αi yi
n
j=1
αj yj xj
w
xi − b
n
i=1
αi yi
=0
+
n
i=1 αi
= −
1
2
n
i=1
n
j=1
αj αi yi yj xj xi +
n
i=1
αi
Dual linear SVM is also a quadratic program
problem D



min
α∈IRn
1
2 α Gα − e α
with y α = 0
and 0 ≤ αi i = 1, n
with G a symmetric matrix n × n such that Gij = yi yj xj xi
SVM primal vs. dual
Primal



min
w∈IRd
,b∈IR
1
2 w 2
with yi (w xi + b) ≥ 1
i = 1, n
d + 1 unknown
n constraints
classical QP
perfect when d << n
Dual



min
α∈IRn
1
2α Gα − e α
with y α = 0
and 0 ≤ αi i = 1, n
n unknown
G Gram matrix (pairwise
influence matrix)
n box constraints
easy to solve
to be used when d > n
SVM primal vs. dual
Primal



min
w∈IRd
,b∈IR
1
2 w 2
with yi (w xi + b) ≥ 1
i = 1, n
d + 1 unknown
n constraints
classical QP
perfect when d << n
Dual



min
α∈IRn
1
2α Gα − e α
with y α = 0
and 0 ≤ αi i = 1, n
n unknown
G Gram matrix (pairwise
influence matrix)
n box constraints
easy to solve
to be used when d > n
f (x) =
d
j=1
wj xj + b =
n
i=1
αi yi (x xi ) + b
The bi dual (the dual of the dual)


min
α∈IRn
1
2 α Gα − e α
with y α = 0
and 0 ≤ αi i = 1, n
L(α, λ, µ) = 1
2 α Gα − e α + λ y α − µ α
αL(α, λ, µ) = Gα − e + λ y − µ
The bidual 


max
α,λ,µ
−1
2 α Gα
with Gα − e + λ y − µ = 0
and 0 ≤ µ
since w 2
= 1
2 α Gα and DXw = Gα
max
w,λ
−1
2 w 2
with DXw + λ y ≥ e
by identification (possibly up to a sign)
b = λ is the Lagrange multiplier of the equality constraint
Cold case: the least square problem
Linear model
yi =
d
j=1
wj xij + εi , i = 1, n
n data and d variables; d < n
min
w
=
n
i=1


d
j=1
xij wj − yi


2
= Xw − y 2
Solution: w = (X X)−1X y
f (x) = x (X X)−1
X y
w
What is the influence of each data point (matrix X lines) ?
Shawe-Taylor & Cristianini’s Book, 2004
data point influence (contribution)
for any new data point x
f (x) = x (X X)(X X)−1 (X X)−1
X y
w
= x X X(X X)−1
(X X)−1
X y
α
x
n examples
dvariables
X
α
w
f (x) =
d
j=1
wj xj
data point influence (contribution)
for any new data point x
f (x) = x (X X)(X X)−1 (X X)−1
X y
w
= x X X(X X)−1
(X X)−1
X y
α
x
n examples
dvariables
X
α
w
x xi
f (x) =
d
j=1
wj xj =
n
i=1
αi (x xi )
from variables to examples
α = X(X X)−1
w
n examples
et w = X α
d variables
what if d ≥ n !
SVM primal vs. dual
Primal



min
w∈IRd
,b∈IR
1
2 w 2
with yi (w xi + b) ≥ 1
i = 1, n
d + 1 unknown
n constraints
classical QP
perfect when d << n
Dual



min
α∈IRn
1
2α Gα − e α
with y α = 0
and 0 ≤ αi i = 1, n
n unknown
G Gram matrix (pairwise
influence matrix)
n box constraints
easy to solve
to be used when d > n
f (x) =
d
j=1
wj xj + b =
n
i=1
αi yi (x xi ) + b
Road map
1 Linear SVM
Optimization in 10 slides
Equality constraints
Inequality constraints
Dual formulation of the linear SVM
Solving the dual
Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.
Solving the dual (1)
Data point influence
αi = 0 this point is useless
αi = 0 this point is said to be
support
f (x) =
d
j=1
wj xj + b =
n
i=1
αi yi (x xi ) + b
Solving the dual (1)
Data point influence
αi = 0 this point is useless
αi = 0 this point is said to be
support
f (x) =
d
j=1
wj xj + b =
3
i=1
αi yi (x xi ) + b
Decison border only depends on 3 points (d + 1)
Solving the dual (2)
Assume we know these 3 data points



min
α∈IRn
1
2α Gα − e α
with y α = 0
and 0 ≤ αi i = 1, n
=⇒
min
α∈IR3
1
2α Gα − e α
with y α = 0
L(α, b) =
1
2
α Gα − e α + b y α
solve the following linear system
Gα + b y = e
y α = 0
U = chol(G); % upper
a = U (U’e);
c = U (U’y);
b = (y’*a)(y’*c)
alpha = U (U’(e - b*y));
Conclusion: variables or data point?
seeking for a universal learning algorithm
no model for IP(x, y)
the linear case: data is separable
the non separable case
double objective: minimizing the error together with the regularity of
the solution
multi objective optimisation
dualiy : variable – example
use the primal when d < n (in the liner case) or when matrix G is hard
to compute
otherwise use the dual
universality = nonlinearity
kernels

More Related Content

What's hot

Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsGabriel Peyré
 
Mesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursMesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursGabriel Peyré
 
1 hofstad
1 hofstad1 hofstad
1 hofstadYandex
 
A Generalized Metric Space and Related Fixed Point Theorems
A Generalized Metric Space and Related Fixed Point TheoremsA Generalized Metric Space and Related Fixed Point Theorems
A Generalized Metric Space and Related Fixed Point TheoremsIRJET Journal
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante
 
Proximal Splitting and Optimal Transport
Proximal Splitting and Optimal TransportProximal Splitting and Optimal Transport
Proximal Splitting and Optimal TransportGabriel Peyré
 
23 industrial engineering
23 industrial engineering23 industrial engineering
23 industrial engineeringmloeb825
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Gabriel Peyré
 
Learning Sparse Representation
Learning Sparse RepresentationLearning Sparse Representation
Learning Sparse RepresentationGabriel Peyré
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Gabriel Peyré
 
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsLow Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsGabriel Peyré
 
Model Selection with Piecewise Regular Gauges
Model Selection with Piecewise Regular GaugesModel Selection with Piecewise Regular Gauges
Model Selection with Piecewise Regular GaugesGabriel Peyré
 
Geodesic Method in Computer Vision and Graphics
Geodesic Method in Computer Vision and GraphicsGeodesic Method in Computer Vision and Graphics
Geodesic Method in Computer Vision and GraphicsGabriel Peyré
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationGabriel Peyré
 
Stuff You Must Know Cold for the AP Calculus BC Exam!
Stuff You Must Know Cold for the AP Calculus BC Exam!Stuff You Must Know Cold for the AP Calculus BC Exam!
Stuff You Must Know Cold for the AP Calculus BC Exam!A Jorge Garcia
 
Lesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General RegionsLesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General RegionsMatthew Leingang
 
Lesson18 Double Integrals Over Rectangles Slides
Lesson18   Double Integrals Over Rectangles SlidesLesson18   Double Integrals Over Rectangles Slides
Lesson18 Double Integrals Over Rectangles SlidesMatthew Leingang
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationGabriel Peyré
 
Mesh Processing Course : Multiresolution
Mesh Processing Course : MultiresolutionMesh Processing Course : Multiresolution
Mesh Processing Course : MultiresolutionGabriel Peyré
 

What's hot (20)

Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse Problems
 
Mesh Processing Course : Active Contours
Mesh Processing Course : Active ContoursMesh Processing Course : Active Contours
Mesh Processing Course : Active Contours
 
1 hofstad
1 hofstad1 hofstad
1 hofstad
 
A Generalized Metric Space and Related Fixed Point Theorems
A Generalized Metric Space and Related Fixed Point TheoremsA Generalized Metric Space and Related Fixed Point Theorems
A Generalized Metric Space and Related Fixed Point Theorems
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
 
Proximal Splitting and Optimal Transport
Proximal Splitting and Optimal TransportProximal Splitting and Optimal Transport
Proximal Splitting and Optimal Transport
 
23 industrial engineering
23 industrial engineering23 industrial engineering
23 industrial engineering
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
 
Learning Sparse Representation
Learning Sparse RepresentationLearning Sparse Representation
Learning Sparse Representation
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
 
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse ProblemsLow Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
Low Complexity Regularization of Inverse Problems - Course #1 Inverse Problems
 
Model Selection with Piecewise Regular Gauges
Model Selection with Piecewise Regular GaugesModel Selection with Piecewise Regular Gauges
Model Selection with Piecewise Regular Gauges
 
Geodesic Method in Computer Vision and Graphics
Geodesic Method in Computer Vision and GraphicsGeodesic Method in Computer Vision and Graphics
Geodesic Method in Computer Vision and Graphics
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex Optimization
 
Stuff You Must Know Cold for the AP Calculus BC Exam!
Stuff You Must Know Cold for the AP Calculus BC Exam!Stuff You Must Know Cold for the AP Calculus BC Exam!
Stuff You Must Know Cold for the AP Calculus BC Exam!
 
Lesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General RegionsLesson 19: Double Integrals over General Regions
Lesson 19: Double Integrals over General Regions
 
Lesson18 Double Integrals Over Rectangles Slides
Lesson18   Double Integrals Over Rectangles SlidesLesson18   Double Integrals Over Rectangles Slides
Lesson18 Double Integrals Over Rectangles Slides
 
Caims 2009
Caims 2009Caims 2009
Caims 2009
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems Regularization
 
Mesh Processing Course : Multiresolution
Mesh Processing Course : MultiresolutionMesh Processing Course : Multiresolution
Mesh Processing Course : Multiresolution
 

Viewers also liked

Analysis of Classification Techniques based on SVM for Face Recognition
Analysis of Classification Techniques based on SVM for Face RecognitionAnalysis of Classification Techniques based on SVM for Face Recognition
Analysis of Classification Techniques based on SVM for Face RecognitionEditor Jacotech
 
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classificationYiwei Chen
 
Destek vektör makineleri
Destek vektör makineleriDestek vektör makineleri
Destek vektör makineleriozgur_dolgun
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tearsAnkit Sharma
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTrilok Sharma
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 

Viewers also liked (9)

Analysis of Classification Techniques based on SVM for Face Recognition
Analysis of Classification Techniques based on SVM for Face RecognitionAnalysis of Classification Techniques based on SVM for Face Recognition
Analysis of Classification Techniques based on SVM for Face Recognition
 
Bayes Aglari
Bayes AglariBayes Aglari
Bayes Aglari
 
How to use SVM for data classification
How to use SVM for data classificationHow to use SVM for data classification
How to use SVM for data classification
 
Destek vektör makineleri
Destek vektör makineleriDestek vektör makineleri
Destek vektör makineleri
 
About SVM
About SVMAbout SVM
About SVM
 
Svm my
Svm mySvm my
Svm my
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
 
Tweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVMTweets Classification using Naive Bayes and SVM
Tweets Classification using Naive Bayes and SVM
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 

Similar to Lecture 2: linear SVM in the Dual

Basic math including gradient
Basic math including gradientBasic math including gradient
Basic math including gradientRamesh Kesavan
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsVjekoslavKovac1
 
2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information Theory2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information TheoryJoe Suzuki
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machinenozomuhamada
 
34032 green func
34032 green func34032 green func
34032 green funcansarixxx
 
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksLinear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksFengtao Wu
 
ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)CrackDSE
 
A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVMHonglin Yu
 
1609 probability function p on subspace of s
1609 probability function p on subspace of s1609 probability function p on subspace of s
1609 probability function p on subspace of sDr Fereidoun Dejahang
 
Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009akabaka12
 

Similar to Lecture 2: linear SVM in the Dual (20)

Basic math including gradient
Basic math including gradientBasic math including gradient
Basic math including gradient
 
Duality.pdf
Duality.pdfDuality.pdf
Duality.pdf
 
cswiercz-general-presentation
cswiercz-general-presentationcswiercz-general-presentation
cswiercz-general-presentation
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurations
 
2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information Theory2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information Theory
 
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
 
34032 green func
34032 green func34032 green func
34032 green func
 
stoch41.pdf
stoch41.pdfstoch41.pdf
stoch41.pdf
 
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksLinear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
 
ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
A Simple Review on SVM
A Simple Review on SVMA Simple Review on SVM
A Simple Review on SVM
 
1609 probability function p on subspace of s
1609 probability function p on subspace of s1609 probability function p on subspace of s
1609 probability function p on subspace of s
 
Automatic bayesian cubature
Automatic bayesian cubatureAutomatic bayesian cubature
Automatic bayesian cubature
 
Plug-and-Play methods for inverse problems in imagine, by Julie Delon
Plug-and-Play methods for inverse problems in imagine, by Julie DelonPlug-and-Play methods for inverse problems in imagine, by Julie Delon
Plug-and-Play methods for inverse problems in imagine, by Julie Delon
 
QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...
QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...
QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...
 
Statistical Method In Economics
Statistical Method In EconomicsStatistical Method In Economics
Statistical Method In Economics
 
Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009Engr 213 midterm 2a sol 2009
Engr 213 midterm 2a sol 2009
 
Ch05 2
Ch05 2Ch05 2
Ch05 2
 

Recently uploaded

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 

Recently uploaded (20)

Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 

Lecture 2: linear SVM in the Dual

  • 1. Lecture 2: Linear SVM in the Dual Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 March 12, 2014
  • 2. Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.
  • 3. Linear SVM: the problem Linear SVM are the solution of the following problem (called primal) Let {(xi , yi ); i = 1 : n} be a set of labelled data with xi ∈ IRd , yi ∈ {1, −1}. A support vector machine (SVM) is a linear classifier associated with the following decision function: D(x) = sign w x + b where w ∈ IRd and b ∈ IR a given thought the solution of the following problem: min w,b 1 2 w 2 = 1 2w w with yi (w xi + b) ≥ 1 i = 1, n This is a quadratic program (QP): min z 1 2 z Az − d z with Bz ≤ e z = (w, b) , d = (0, . . . , 0) , A = I 0 0 0 , B = −[diag(y)X, y] et e = −(1, . . . , 1)
  • 4. Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual
  • 5. A simple example (to begin with) min x1,x2 J(x) = (x1 − a)2 + (x2 − b)2 with x x xJ(x) iso cost lines: J(x) = k
  • 6. A simple example (to begin with) min x1,x2 J(x) = (x1 − a)2 + (x2 − b)2 with H(x) = α(x1 − c)2 + β(x2 − d)2 + γx1x2 − 1 Ω = {x|H(x) = 0} x x xJ(x) ∆x xH(x) tangent hyperplane iso cost lines: J(x) = k xH(x) = λ xJ(x)
  • 7. The only one equality constraint case min x J(x) J(x + εd) ≈ J(x) + ε xJ(x) d with H(x) = 0 H(x + εd) ≈ H(x) + ε xH(x) d Loss J : d is a descent direction if it exists ε0 ∈ IR such that ∀ε ∈ IR, 0 < ε ≤ ε0 J(x + εd) < J(x) ⇒ xJ(x) d < 0 constraint H : d is a feasible descent direction if it exists ε0 ∈ IR such that ∀ε ∈ IR, 0 < ε ≤ ε0 H(x + εd) = 0 ⇒ xH(x) d = 0 If at x , vectors xJ(x ) and xH(x ) are collinear there is no feasible descent direction d. Therefore, x is a local solution of the problem.
  • 8. Lagrange multipliers Assume J and functions Hi are continuously differentials (and independent) P =    min x∈IRn J(x) avec H1(x) = 0 et H2(x) = 0 . . . Hp(x) = 0
  • 9. Lagrange multipliers Assume J and functions Hi are continuously differentials (and independent) P =    min x∈IRn J(x) avec H1(x) = 0 λ1 et H2(x) = 0 λ2 . . . Hp(x) = 0 λp each constraint is associated with λi : the Lagrange multiplier.
  • 10. Lagrange multipliers Assume J and functions Hi are continuously differentials (and independent) P =    min x∈IRn J(x) avec H1(x) = 0 λ1 et H2(x) = 0 λ2 . . . Hp(x) = 0 λp each constraint is associated with λi : the Lagrange multiplier. Theorem (First order optimality conditions) for x being a local minima of P, it is necessary that: x J(x ) + p i=1 λi x Hi (x ) = 0 and Hi (x ) = 0, i = 1, p
  • 11. Plan 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Stéphane Canu (INSA Rouen - LITIS) March 12, 2014 8 / 32
  • 12. The only one inequality constraint case min x J(x) J(x + εd) ≈ J(x) + ε xJ(x) d with G(x) ≤ 0 G(x + εd) ≈ G(x) + ε xG(x) d cost J : d is a descent direction if it exists ε0 ∈ IR such that ∀ε ∈ IR, 0 < ε ≤ ε0 J(x + εd) < J(x) ⇒ xJ(x) d < 0 constraint G : d is a feasible descent direction if it exists ε0 ∈ IR such that ∀ε ∈ IR, 0 < ε ≤ ε0 G(x + εd) ≤ 0 ⇒ G(x) < 0 : no limit here on d G(x) = 0 : xG(x) d ≤ 0 Two possibilities If x lies at the limit of the feasible domain (G(x ) = 0) and if vectors xJ(x ) and xG(x ) are collinear and in opposite directions, there is no feasible descent direction d at that point. Therefore, x is a local solution of the problem... Or if xJ(x ) = 0
  • 13. Two possibilities for optimality xJ(x ) = −µ xG(x ) and µ > 0; G(x ) = 0 or xJ(x ) = 0 and µ = 0; G(x ) < 0 This alternative is summarized in the so called complementarity condition: µ G(x ) = 0 µ = 0 G(x ) < 0 G(x ) = 0 µ > 0
  • 14. First order optimality condition (1) problem P =    min x∈IRn J(x) with hj (x) = 0 j = 1, . . . , p and gi (x) ≤ 0 i = 1, . . . , q Definition: Karush, Kuhn and Tucker (KKT) conditions stationarity J(x ) + p j=1 λj hj (x ) + q i=1 µi gi (x ) = 0 primal admissibility hj (x ) = 0 j = 1, . . . , p gi (x ) ≤ 0 i = 1, . . . , q dual admissibility µi ≥ 0 i = 1, . . . , q complementarity µi gi (x ) = 0 i = 1, . . . , q λj and µi are called the Lagrange multipliers of problem P
  • 15. First order optimality condition (2) Theorem (12.1 Nocedal & Wright pp 321) If a vector x is a stationary point of problem P Then there existsa Lagrange multipliers such that x , {λj }j=1:p, {µi }i=1:q fulfill KKT conditions a under some conditions e.g. linear independence constraint qualification If the problem is convex, then a stationary point is the solution of the problem A quadratic program (QP) is convex when. . . (QP) min z 1 2 z Az − d z with Bz ≤ e . . . when matrix A is positive definite
  • 16. KKT condition - Lagrangian (3) problem P =    min x∈IRn J(x) with hj (x) = 0 j = 1, . . . , p and gi (x) ≤ 0 i = 1, . . . , q Definition: Lagrangian The lagrangian of problem P is the following function: L(x, λ, µ) = J(x) + p j=1 λj hj (x) + q i=1 µi gi (x) The importance of being a lagrangian the stationarity condition can be written: L(x , λ, µ) = 0 the lagrangian saddle point max λ,µ min x L(x, λ, µ) Primal variables: x and dual variables λ, µ (the Lagrange multipliers)
  • 17. Duality – definitions (1) Primal and (Lagrange) dual problems P =    min x∈IRn J(x) with hj (x) = 0 j = 1, p and gi (x) ≤ 0 i = 1, q D = max λ∈IRp,µ∈IRq Q(λ, µ) with µj ≥ 0 j = 1, q Dual objective function: Q(λ, µ) = inf x L(x, λ, µ) = inf x J(x) + p j=1 λj hj (x) + q i=1 µi gi (x) Wolf dual problem W =    max x,λ∈IRp,µ∈IRq L(x, λ, µ) with µj ≥ 0 j = 1, q and J(x ) + p j=1 λj hj (x ) + q i=1 µi gi (x ) = 0
  • 18. Duality – theorems (2) Theorem (12.12, 12.13 and 12.14 Nocedal & Wright pp 346) If f , g and h are convex and continuously differentiablea, then the solution of the dual problem is the same as the solution of the primal a under some conditions e.g. linear independence constraint qualification (λ , µ ) = solution of problem D x = arg min x L(x, λ , µ ) Q(λ , µ ) = arg min x L(x, λ , µ ) = L(x , λ , µ ) = J(x ) + λ H(x ) + µ G(x ) = J(x ) and for any feasible point x Q(λ, µ) ≤ J(x) → 0 ≤ J(x) − Q(λ, µ) The duality gap is the difference between the primal and dual cost functions
  • 19. Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.
  • 20. Linear SVM dual formulation - The lagrangian min w,b 1 2 w 2 with yi (w xi + b) ≥ 1 i = 1, n Looking for the lagrangian saddle point max α min w,b L(w, b, α) with so called lagrange multipliers αi ≥ 0 L(w, b, α) = 1 2 w 2 − n i=1 αi yi (w xi + b) − 1 αi represents the influence of constraint thus the influence of the training example (xi , yi )
  • 21. Stationarity conditions L(w, b, α) = 1 2 w 2 − n i=1 αi yi (w xi + b) − 1 Computing the gradients:    wL(w, b, α) = w − n i=1 αi yi xi ∂L(w, b, α) ∂b = n i=1 αi yi we have the following optimality conditions    wL(w, b, α) = 0 ⇒ w = n i=1 αi yi xi ∂L(w, b, α) ∂b = 0 ⇒ n i=1 αi yi = 0
  • 22. KKT conditions for SVM stationarity w − n i=1 αi yi xi = 0 and n i=1 αi yi = 0 primal admissibility yi (w xi + b) ≥ 1 i = 1, . . . , n dual admissibility αi ≥ 0 i = 1, . . . , n complementarity αi yi (w xi + b) − 1 = 0 i = 1, . . . , n The complementary condition split the data into two sets A be the set of active constraints: usefull points A = {i ∈ [1, n] yi (w∗ xi + b∗ ) = 1} its complementary ¯A useless points if i /∈ A, αi = 0
  • 23. The KKT conditions for SVM The same KKT but using matrix notations and the active set A stationarity w − X Dy α = 0 α y = 0 primal admissibility Dy (Xw + b I1) ≥ I1 dual admissibility α ≥ 0 complementarity Dy (XAw + b I1A) = I1A α ¯A = 0 Knowing A, the solution verifies the following linear system:    w −XA Dy αA = 0 −Dy XAw −byA = −eA −yAαA = 0 with Dy = diag(yA), αA = α(A) , yA = y(A) et XA = X(XA; :).
  • 24. The KKT conditions as a linear system    w −XA Dy αA = 0 −Dy XAw −byA = −eA −yAαA = 0 with Dy = diag(yA), αA = α(A) , yA = y(A) et XA = X(XA; :). = I −XA Dy 0 −Dy XA 0 −yA 0 −yA 0 w αA b 0 −eA 0 we can work on it to separate w from (αA, b)
  • 25. The SVM dual formulation The SVM Wolfe dual    max w,b,α 1 2 w 2 − n i=1 αi yi (w xi + b) − 1 with αi ≥ 0 i = 1, . . . , n and w − n i=1 αi yi xi = 0 and n i=1 αi yi = 0 using the fact: w = n i=1 αi yi xi The SVM Wolfe dual without w and b    max α −1 2 n i=1 n j=1 αj αi yi yj xj xi + n i=1 αi with αi ≥ 0 i = 1, . . . , n and n i=1 αi yi = 0
  • 26. Linear SVM dual formulation L(w, b, α) = 1 2 w 2 − n i=1 αi yi (w xi + b) − 1 Optimality: w = n i=1 αi yi xi n i=1 αi yi = 0 L(α) = 1 2 n i=1 n j=1 αj αi yi yj xj xi w w − n i=1 αi yi n j=1 αj yj xj w xi − b n i=1 αi yi =0 + n i=1 αi = − 1 2 n i=1 n j=1 αj αi yi yj xj xi + n i=1 αi Dual linear SVM is also a quadratic program problem D    min α∈IRn 1 2 α Gα − e α with y α = 0 and 0 ≤ αi i = 1, n with G a symmetric matrix n × n such that Gij = yi yj xj xi
  • 27. SVM primal vs. dual Primal    min w∈IRd ,b∈IR 1 2 w 2 with yi (w xi + b) ≥ 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n Dual    min α∈IRn 1 2α Gα − e α with y α = 0 and 0 ≤ αi i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n
  • 28. SVM primal vs. dual Primal    min w∈IRd ,b∈IR 1 2 w 2 with yi (w xi + b) ≥ 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n Dual    min α∈IRn 1 2α Gα − e α with y α = 0 and 0 ≤ αi i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n f (x) = d j=1 wj xj + b = n i=1 αi yi (x xi ) + b
  • 29. The bi dual (the dual of the dual)   min α∈IRn 1 2 α Gα − e α with y α = 0 and 0 ≤ αi i = 1, n L(α, λ, µ) = 1 2 α Gα − e α + λ y α − µ α αL(α, λ, µ) = Gα − e + λ y − µ The bidual    max α,λ,µ −1 2 α Gα with Gα − e + λ y − µ = 0 and 0 ≤ µ since w 2 = 1 2 α Gα and DXw = Gα max w,λ −1 2 w 2 with DXw + λ y ≥ e by identification (possibly up to a sign) b = λ is the Lagrange multiplier of the equality constraint
  • 30. Cold case: the least square problem Linear model yi = d j=1 wj xij + εi , i = 1, n n data and d variables; d < n min w = n i=1   d j=1 xij wj − yi   2 = Xw − y 2 Solution: w = (X X)−1X y f (x) = x (X X)−1 X y w What is the influence of each data point (matrix X lines) ? Shawe-Taylor & Cristianini’s Book, 2004
  • 31. data point influence (contribution) for any new data point x f (x) = x (X X)(X X)−1 (X X)−1 X y w = x X X(X X)−1 (X X)−1 X y α x n examples dvariables X α w f (x) = d j=1 wj xj
  • 32. data point influence (contribution) for any new data point x f (x) = x (X X)(X X)−1 (X X)−1 X y w = x X X(X X)−1 (X X)−1 X y α x n examples dvariables X α w x xi f (x) = d j=1 wj xj = n i=1 αi (x xi ) from variables to examples α = X(X X)−1 w n examples et w = X α d variables what if d ≥ n !
  • 33. SVM primal vs. dual Primal    min w∈IRd ,b∈IR 1 2 w 2 with yi (w xi + b) ≥ 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n Dual    min α∈IRn 1 2α Gα − e α with y α = 0 and 0 ≤ αi i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n f (x) = d j=1 wj xj + b = n i=1 αi yi (x xi ) + b
  • 34. Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.
  • 35. Solving the dual (1) Data point influence αi = 0 this point is useless αi = 0 this point is said to be support f (x) = d j=1 wj xj + b = n i=1 αi yi (x xi ) + b
  • 36. Solving the dual (1) Data point influence αi = 0 this point is useless αi = 0 this point is said to be support f (x) = d j=1 wj xj + b = 3 i=1 αi yi (x xi ) + b Decison border only depends on 3 points (d + 1)
  • 37. Solving the dual (2) Assume we know these 3 data points    min α∈IRn 1 2α Gα − e α with y α = 0 and 0 ≤ αi i = 1, n =⇒ min α∈IR3 1 2α Gα − e α with y α = 0 L(α, b) = 1 2 α Gα − e α + b y α solve the following linear system Gα + b y = e y α = 0 U = chol(G); % upper a = U (U’e); c = U (U’y); b = (y’*a)(y’*c) alpha = U (U’(e - b*y));
  • 38. Conclusion: variables or data point? seeking for a universal learning algorithm no model for IP(x, y) the linear case: data is separable the non separable case double objective: minimizing the error together with the regularity of the solution multi objective optimisation dualiy : variable – example use the primal when d < n (in the liner case) or when matrix G is hard to compute otherwise use the dual universality = nonlinearity kernels