Arthur Charpentier, SIDE Summer School, July 2019
# 6 Classification & Support Vector Machine
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal)
Machine Learning & Econometrics
SIDE Summer School - July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Linearly Separable sample [econometric notations]
Data (y1, x1), · · · , (yn, xn) - with y ∈ {0, 1} - are
linearly separable if there are (β0, β) such that
- yi = 1 if β0 + x β > 0
- yi = 0 if β0 + x β < 0
Linearly Separable sample [ML notations]
Data (y1, x1), · · · , (yn, xn) - with y ∈ {−1, +1} - are
linearly separable if there are (b, ω) such that
- yi = +1 if b + x, ω > 0
- yi = −1 if b + x, ω < 0
or equivalently yi · (b + x, ω ) > 0, ∀i.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
yi · (b + x, ω ) = 0 is an hyperplane (in Rp
)
orthogonal with ω
Use m(x) = 1b+ x,ω ≥0 − 1b+ x,ω <0 as classifier
Problem : equation (i.e. (b, ω)) is not unique !
Canonical form : min
i=1,··· ,n
|b + xi, ω | = 1
Problem : solution here is not unique !
Idea : use the widest (safety) margin γ
Vapnik & Lerner (1963, Pattern recognition using gen-
eralized portrait method) or Cover (1965, Geometrical
and statistical properties of systems of linear inequali-
ties with applications in pattern recognition)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Consider two points, ω−1 and ω+1
γ =
1
2
ω, ω+1 − ω−1
ω
It is minimal when
b + xi, ω−1 = −1 and
b + xi, ω+1 = +1, and therefore
γ =
1
ω
Optimization problem becomes
min
(b ω)
1
2
ω 2
2
s.t. yi · (b + x, ω ) > 0, ∀i.
convex optimization problem with linear constraints
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Consider the following problem : min
u∈Rp
h(u) s.t. gi(u) ≥ 0 ∀i = 1, · · · , n
where h is quadratic and gi’s are linear.
Lagrangian is L : Rp
× Rn
→ R defined as L(u, α) = h(u) −
n
i=1
αigi(u)
where α are dual variables, and the dual function is
Λ : α → L(uα, α) = min{L(u, α)} where uα = argmin L(u, α)
One can solve the dual problem, max{Λ(α)} s.t. α ≥ 0. Solution is u = uα .
Si gi(uα ) > 0, then necessarily αi = 0 (see Karush-Kuhn-Tucker (KKT)
condition, αi · gi(uα ) = 0)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Here, L(b, ω, α) =
1
2
ω 2
−
n
i=1
αi · yi · (b + x, ω ) − 1
From the first order conditions,
∂L(b, ω, α)
∂ω
= ω −
n
i=1
αi · yixi = 0, i.e. ω =
n
i=1
αi yixi
∂L(b, ω, α)
∂b
= −
n
i=1
αi · yi = 0, i.e.
n
i=1
αi · yi = 0
and
Λ(α) =
n
i=1
αi −
1
2
n
i,j=1
αiαjyiyj αi, αj
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
min
α
1
2
α Qα − 1 α s.t.



αi ≥ 0, ∀i
y 1 = 0
where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then
ω =
n
i=1
αi yixi and b = −
1
2
min
i:yi=+1
{ xi, ω } + min
i:yi=−1
{ xi, ω }
Points xi such that αi > 0 are called support
yi · (b + xi, ω ) = 1
Use m (x) = 1b + x,ω ≥0−1b + x,ω <0 as classifier
Observe that γ =
n
i=1
α 2
i
−1/2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Optimization problem was
min
(b,ω)
1
2
ω 2
2
s.t. yi · (b + x, ω ) > 0, ∀i,
which became
min
(b,ω)
1
2
ω 2
2
+
n
i=1
αi · 1 − yi · (b + x, ω ) ,
or
min
(b,ω)
1
2
ω 2
2
+ penalty ,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Consider here the more general case
where the space is not linearly separable
( ω, xi + b)yi ≥ 1
becomes
( ω, xi + b)yi ≥ 1−ξi
for some slack variables ξi’s.
and penalize large slack variables ξi (when > 0) by solving (for some cost C)
min
ω,b
1
2
β β + C
n
i=1
ξi
subject to ∀i, ξi ≥ 0 and (xi ω + b)yi ≥ 1 − ξi.
This is the soft-margin extension, see e1071::svm() or kernlab::ksvm()
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
The dual optimization problem is now
min
α
1
2
α Qα − 1 α s.t.



0 ≤ αi≤ C, ∀i
y 1 = 0
where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then
ω =
n
i=1
αi yixi and b = −
1
2
min
i:yi=+1
{ xi, ω } + min
i:yi=−1
{ xi, ω }
Note further that the (primal) optimization problem can be written
min
(b,ω)
1
2
ω 2
2
+
n
i=1
1 − yi · (b + x, ω ) +
,
where (1 − z)+ is a convex upper bound for empirical error 1z≤0
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
One can also consider the kernel trick : xi xj is replace by ϕ(xi) ϕ(xj) for some
mapping ϕ,
K(xi, xj) = ϕ(xi) ϕ(xj)
For instance K(a, b) = (a b)3
= ϕ(a) ϕ(b)
where ϕ(a1, a2) = (a3
1 ,
√
3a2
1a2 ,
√
3a1a2
2 , a3
2)
Consider polynomial kernels
K(a, b) = (1 + a b)p
or a Gaussian kernel
K(a, b) = exp(−(a − b) (a − b))
and solve max
αi≥0



n
i=1
αi −
1
2
n
i,j=1
yiyjαiαjK(xi, xj)



@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Consider the following training sample {(yi, x1,i, x2,i)} with yi ∈ {•, •}
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Linear kernel
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Polynomial kernel (degree 2)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Polynomial kernel (degree 3)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Polynomial kernel (degree 4)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Polynomial kernel (degree 5)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Polynomial kernel (degree 6)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
Radial kernel
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine - Radial Kernel, impact of the cost C
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine - Radial Kernel, tuning parameter γ
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
The radial kernel is formed by taking an infinite sum over polynomial kernels...
K(x, y) = exp − γ x − y 2
= ψ(x), ψ(y)
where ψ is some Rn
→ R∞
function, since
K(x, y) = exp − γ x − y 2
= exp(−γ x 2
− γ y 2
)
=constant
· exp 2γ x, y
i.e.
K(x, y) ∝ exp 2γ x, y =
∞
k=0
2γ
x, y k
k!
=
∞
k=0
2γKk(x, y)
where Kk is the polynomial kernel of degree k.
If K = K1 + K2 with ψj : Rn
→ Rdj
then ψ : Rn
→ Rd
with d ∼ d1 + d2
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
Arthur Charpentier, SIDE Summer School, July 2019
SVM : Support Vector Machine
A kernel is a measure of similarity between vectors.
The smaller the value of γ the narrower the vectors should be to have a small
measure
Is there a probabilistic interpretation ?
Platt (2000, Probabilities for SVM) suggested to use a logistic function over the
SVM scores,
p(x) =
exp[b + x, ω ]
1 + exp[b + x, ω ]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23

Side 2019 #6

  • 1.
    Arthur Charpentier, SIDESummer School, July 2019 # 6 Classification & Support Vector Machine Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal) Machine Learning & Econometrics SIDE Summer School - July 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  • 2.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Linearly Separable sample [econometric notations] Data (y1, x1), · · · , (yn, xn) - with y ∈ {0, 1} - are linearly separable if there are (β0, β) such that - yi = 1 if β0 + x β > 0 - yi = 0 if β0 + x β < 0 Linearly Separable sample [ML notations] Data (y1, x1), · · · , (yn, xn) - with y ∈ {−1, +1} - are linearly separable if there are (b, ω) such that - yi = +1 if b + x, ω > 0 - yi = −1 if b + x, ω < 0 or equivalently yi · (b + x, ω ) > 0, ∀i. @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  • 3.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine yi · (b + x, ω ) = 0 is an hyperplane (in Rp ) orthogonal with ω Use m(x) = 1b+ x,ω ≥0 − 1b+ x,ω <0 as classifier Problem : equation (i.e. (b, ω)) is not unique ! Canonical form : min i=1,··· ,n |b + xi, ω | = 1 Problem : solution here is not unique ! Idea : use the widest (safety) margin γ Vapnik & Lerner (1963, Pattern recognition using gen- eralized portrait method) or Cover (1965, Geometrical and statistical properties of systems of linear inequali- ties with applications in pattern recognition) @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  • 4.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Consider two points, ω−1 and ω+1 γ = 1 2 ω, ω+1 − ω−1 ω It is minimal when b + xi, ω−1 = −1 and b + xi, ω+1 = +1, and therefore γ = 1 ω Optimization problem becomes min (b ω) 1 2 ω 2 2 s.t. yi · (b + x, ω ) > 0, ∀i. convex optimization problem with linear constraints @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  • 5.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Consider the following problem : min u∈Rp h(u) s.t. gi(u) ≥ 0 ∀i = 1, · · · , n where h is quadratic and gi’s are linear. Lagrangian is L : Rp × Rn → R defined as L(u, α) = h(u) − n i=1 αigi(u) where α are dual variables, and the dual function is Λ : α → L(uα, α) = min{L(u, α)} where uα = argmin L(u, α) One can solve the dual problem, max{Λ(α)} s.t. α ≥ 0. Solution is u = uα . Si gi(uα ) > 0, then necessarily αi = 0 (see Karush-Kuhn-Tucker (KKT) condition, αi · gi(uα ) = 0) @freakonometrics freakonometrics freakonometrics.hypotheses.org 5
  • 6.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Here, L(b, ω, α) = 1 2 ω 2 − n i=1 αi · yi · (b + x, ω ) − 1 From the first order conditions, ∂L(b, ω, α) ∂ω = ω − n i=1 αi · yixi = 0, i.e. ω = n i=1 αi yixi ∂L(b, ω, α) ∂b = − n i=1 αi · yi = 0, i.e. n i=1 αi · yi = 0 and Λ(α) = n i=1 αi − 1 2 n i,j=1 αiαjyiyj αi, αj @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  • 7.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine min α 1 2 α Qα − 1 α s.t.    αi ≥ 0, ∀i y 1 = 0 where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then ω = n i=1 αi yixi and b = − 1 2 min i:yi=+1 { xi, ω } + min i:yi=−1 { xi, ω } Points xi such that αi > 0 are called support yi · (b + xi, ω ) = 1 Use m (x) = 1b + x,ω ≥0−1b + x,ω <0 as classifier Observe that γ = n i=1 α 2 i −1/2 @freakonometrics freakonometrics freakonometrics.hypotheses.org 7
  • 8.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Optimization problem was min (b,ω) 1 2 ω 2 2 s.t. yi · (b + x, ω ) > 0, ∀i, which became min (b,ω) 1 2 ω 2 2 + n i=1 αi · 1 − yi · (b + x, ω ) , or min (b,ω) 1 2 ω 2 2 + penalty , @freakonometrics freakonometrics freakonometrics.hypotheses.org 8
  • 9.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Consider here the more general case where the space is not linearly separable ( ω, xi + b)yi ≥ 1 becomes ( ω, xi + b)yi ≥ 1−ξi for some slack variables ξi’s. and penalize large slack variables ξi (when > 0) by solving (for some cost C) min ω,b 1 2 β β + C n i=1 ξi subject to ∀i, ξi ≥ 0 and (xi ω + b)yi ≥ 1 − ξi. This is the soft-margin extension, see e1071::svm() or kernlab::ksvm() @freakonometrics freakonometrics freakonometrics.hypotheses.org 9
  • 10.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine The dual optimization problem is now min α 1 2 α Qα − 1 α s.t.    0 ≤ αi≤ C, ∀i y 1 = 0 where Q = [Qi,j] and Qi,j = yiyj xi, xj , and then ω = n i=1 αi yixi and b = − 1 2 min i:yi=+1 { xi, ω } + min i:yi=−1 { xi, ω } Note further that the (primal) optimization problem can be written min (b,ω) 1 2 ω 2 2 + n i=1 1 − yi · (b + x, ω ) + , where (1 − z)+ is a convex upper bound for empirical error 1z≤0 @freakonometrics freakonometrics freakonometrics.hypotheses.org 10
  • 11.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine One can also consider the kernel trick : xi xj is replace by ϕ(xi) ϕ(xj) for some mapping ϕ, K(xi, xj) = ϕ(xi) ϕ(xj) For instance K(a, b) = (a b)3 = ϕ(a) ϕ(b) where ϕ(a1, a2) = (a3 1 , √ 3a2 1a2 , √ 3a1a2 2 , a3 2) Consider polynomial kernels K(a, b) = (1 + a b)p or a Gaussian kernel K(a, b) = exp(−(a − b) (a − b)) and solve max αi≥0    n i=1 αi − 1 2 n i,j=1 yiyjαiαjK(xi, xj)    @freakonometrics freakonometrics freakonometrics.hypotheses.org 11
  • 12.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Consider the following training sample {(yi, x1,i, x2,i)} with yi ∈ {•, •} @freakonometrics freakonometrics freakonometrics.hypotheses.org 12
  • 13.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Linear kernel @freakonometrics freakonometrics freakonometrics.hypotheses.org 13
  • 14.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Polynomial kernel (degree 2) @freakonometrics freakonometrics freakonometrics.hypotheses.org 14
  • 15.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Polynomial kernel (degree 3) @freakonometrics freakonometrics freakonometrics.hypotheses.org 15
  • 16.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Polynomial kernel (degree 4) @freakonometrics freakonometrics freakonometrics.hypotheses.org 16
  • 17.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Polynomial kernel (degree 5) @freakonometrics freakonometrics freakonometrics.hypotheses.org 17
  • 18.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Polynomial kernel (degree 6) @freakonometrics freakonometrics freakonometrics.hypotheses.org 18
  • 19.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine Radial kernel @freakonometrics freakonometrics freakonometrics.hypotheses.org 19
  • 20.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine - Radial Kernel, impact of the cost C @freakonometrics freakonometrics freakonometrics.hypotheses.org 20
  • 21.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine - Radial Kernel, tuning parameter γ @freakonometrics freakonometrics freakonometrics.hypotheses.org 21
  • 22.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine The radial kernel is formed by taking an infinite sum over polynomial kernels... K(x, y) = exp − γ x − y 2 = ψ(x), ψ(y) where ψ is some Rn → R∞ function, since K(x, y) = exp − γ x − y 2 = exp(−γ x 2 − γ y 2 ) =constant · exp 2γ x, y i.e. K(x, y) ∝ exp 2γ x, y = ∞ k=0 2γ x, y k k! = ∞ k=0 2γKk(x, y) where Kk is the polynomial kernel of degree k. If K = K1 + K2 with ψj : Rn → Rdj then ψ : Rn → Rd with d ∼ d1 + d2 @freakonometrics freakonometrics freakonometrics.hypotheses.org 22
  • 23.
    Arthur Charpentier, SIDESummer School, July 2019 SVM : Support Vector Machine A kernel is a measure of similarity between vectors. The smaller the value of γ the narrower the vectors should be to have a small measure Is there a probabilistic interpretation ? Platt (2000, Probabilities for SVM) suggested to use a logistic function over the SVM scores, p(x) = exp[b + x, ω ] 1 + exp[b + x, ω ] @freakonometrics freakonometrics freakonometrics.hypotheses.org 23