Support Vector Machine
By: Amr Koura
Agenda
● Definition.
● Kernel Functions.
● Optimization Problem.
● Soft Margin Hyperplanes.
● V-SVC.
● SMO algorithm.
● Demo.
Definition
Definition
● Supervised learning model with associated
learning algorithms that analyze and recognize
patterns.
● Application:
- Machine learning.
- Pattern recognition.
- classification and regression analysis.
Binary Classifier
● Given set of Points P={ such that
and } .
build model that assign new example to
( X i ,Y i) X i ∈R
d
Y i ∈{−1,1}
{−1,1}
Question
● What if the examples are not linearly
separable?
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html
Kernel Function
Kernel Function
● SVM can efficiently perform non linear
classification using Kernel trick.
● Kernel trick map the input into high dimension
space where the examples become linearly
separable.
Kernel Function
https://en.wikipedia.org/wiki/Support_vector_machine
Kernel Function
● Linear Kernel.
● Polynomial Kernel.
● Gaussian RBF Kernel.
● Sigmoid Kernel.
Linear Kernel Function
● K(X,Y)=<X,Y>
Dot product between X,Y.
Polynomial Kernel Function
Where d: degree of polynomial, and c is free
parameter trade off between the influence of
higher and lower order terms in polynomials.
k ( X ,Y )=(γ∗< X ,Y > + c)
d
Gaussian RBF Kernel
Where denote square euclidean
distance.
Other form:
k ( X ,Y )=exp(
∣∣X −Y∣∣
2
−2∗σ
)
∣∣X −Y∣∣
2
k ( X ,Y )=exp(−¿ γ∗∣∣X −Y∣∣
2
)
Sigmoid Kernel Function
Where is scaling factor and r is shifting
parameter.
k ( X ,Y )=tanh(γ∗< X ,Y > + r)
γ
Optimization Problem
Optimization Problem
● Need to find hyperplane with maximum margin.
https://en.wikipedia.org/wiki/Support_vector_machine
Optimization Problem
● Distance between two hyperplanes = .
● Goal:
1- minimize ||W||.
2- prevent points to fall into margin.
● Constraint:
and
together:
, st:
2
∣∣W∣∣
W.X i−b≥1 forY i=1 W.X i−b≤−1 forY i=−1
yi (W.X i−b)≥1 for 1≤i≤nmin(W ,b)
∣∣W∣∣
Optimization Problem
● Mathematically convenient:
, st:
● By Lagrange multiplier , the problem become
quadratic optimization problem.
arg min(W ,b)
1
2
∣∣W∣∣
2
yi (W.X i−b)≥1
arg min(W ,b) max(α> 0)
1
2
∣∣W∣∣
2
−∑
i=1
n
αi [ yi (W.X i−b)−1]
Optimization Problem
● The solution can be expressed in linear
combination of :
.
for these points in support vector.
X i
W =∑
1
n
αi Y i X i
αi≠0
Optimization problem
● The QP is solved iff:
1) KKT conditions are fulfilled for every
example.
2) is semi definite positive.
● KKT conditions are:
Qi , j= yi∗y j∗k ( ⃗X i∗ ⃗X j)
αi=0⇒ yi∗ f ( ⃗xi )⩾1
0< αi< C ⇒ yi∗ f (⃗xi)⩾1
αi=C ⇒ yi∗ f ( ⃗xi )⩽1
Soft Margin
Hyperplanes
Soft Margin Hyperplanes
● The soft margin hyperplanes will choose a
hyperplane that splits the examples as cleanly
as possible with maximum margin.
● Non slack variable , measure the degree of
misclassification.
ξi
Soft Margin Hyperplanes
Learning with Kernels , by: scholkopf
Soft Margin Hyperplanes
● The optimization problem:
, st: , .
using Lagrange multiplier:
st: ,
arg min(W ,ξ ,b)
1
2
∣∣W∣∣
2
+
C
n
∑
1
n
ξi yi (W.X i+ b)≥1−ξi ξi≥0
∑
i=1
n
αi yi=0
W (α)=∑
i=0
n
αi−
1
2
∑
i , j=1
n
αi α j yi y j k (xi , x j)
0≤αi≤
C
n
● C is essentially a regularisation parameter,
which controls the trade-off between achieving
a low error on the training data and minimising
the norm of the weights.
● After the Optimizer computes , the W can be
computed as
αi
W =∑
1
n
X i Y i αi
V-SVC
V-SVC
● In previous formula , C variable was tradeoff
between (1) minimizing training errors
(2)maximizing margin.
● Replace C by parameter V, control number of
margin errors and support vectors.
● V is upper bound of training error rate.
V-SVC
● The optimization problem become:
,st:
, and .
minimize(W ,ξ ,ρ)
1
2
∣∣W∣∣
2
−V ρ+
1
n
∑
1
n
ξi
yi (W.X i+ b)≥ρ−ξi ξi≥0 ρi≥0
V-SVC
● Using Lagrange multiplier:
St:
, and
and decision function f(X)=
minimizeα∈Rd W (α)=−
1
2
∑
i , j=1
n
αi α j Y i Y j k ( X i , X j)
0≤αi≤
1
n
∑
i=1
n
αi Y i=0 ∑
i=1
n
αi≥V
sgn(∑
i=1
n
αi yi k ( X , X i)+ b)
SMO Algorithm
SMO Algorithm
● Sequential Minimal Optimization algorithm used
to solve quadratic programming problem.
● Algorithm:
1- select pair of examples “details are coming”.
2- optimize target function with respect to
selected pair analytically.
3- repeat until the selected pairs “step 1” is
optimized or number of iteration exceed user
defined input.
SMO Algorithm
2-optimize target function with respect to
selected pair analytically.
- the update on value of and depends on
the difference between the approximation error
in and .
X =Kii+ K jj−2Y i Y j Kij
αi
α j
αi α j
Solve for two Lagrange multipliers
http://research.microsoft.com/pubs/68391/smo-book.pdf
Solve for two Lagrange multipliers
http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf
Solve for two Lagrange multipliers
double X = Kii+Kjj+2*Kij;
double delta = (-G[i]-G[j])/X;
double diff = alpha[i] - alpha[j];
alpha[i] += delta; alpha[j] += delta;
if(region I):
alpha[i] = C_i; alpha[j] = C_i – diff;
if(region II):
alpha[j] = C_j; alpha[i] = C_j + diff;
if(region III):
alpha[j] = 0;alpha[i] = diff;
If (region IV):
alpha[i] = 0;alpha[j] = -diff;
SMO Algorithm
● 1- select pair of examples:
we need to find pair (i,j) where the difference
between classification error is maximum.
The pair is optimal if the difference between
classification error is less than
(( f (xi)− yi)−( f (x j)− y j))
2
ξ
SMO Algorithm
1- select pair of examples “Continue”:
Define the following variables:
(Max difference) (min difference)
I0={i ,αi=0,αi ∈(0,Ci)}
I+ ,0={i ,αi=0, yi=1} I+ ,C={i ,αi=Ci , yi=1}
I−,0={i ,αi=0, yi=−1} I−,C={i ,αi=Ci , yi=−1}
maxi∈{I0∪I+ ,0∪I−,c} f (xi)− yi
min j∈{I 0∪I−,0∪I+ ,c } f (x j)− y j
SMO algorithm complexity
● Memory complexity: no additional matrix is
required to solve the problem. Only 2*2 Matrix
is required in each iteration.
● Memory complexity is linear on training data set
size.
● SMO algorithm is scaled between linear and
quadratic in the size of training data size.

Svm V SVC

  • 1.
  • 2.
    Agenda ● Definition. ● KernelFunctions. ● Optimization Problem. ● Soft Margin Hyperplanes. ● V-SVC. ● SMO algorithm. ● Demo.
  • 3.
  • 4.
    Definition ● Supervised learningmodel with associated learning algorithms that analyze and recognize patterns. ● Application: - Machine learning. - Pattern recognition. - classification and regression analysis.
  • 5.
    Binary Classifier ● Givenset of Points P={ such that and } . build model that assign new example to ( X i ,Y i) X i ∈R d Y i ∈{−1,1} {−1,1}
  • 6.
    Question ● What ifthe examples are not linearly separable? http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html
  • 7.
  • 8.
    Kernel Function ● SVMcan efficiently perform non linear classification using Kernel trick. ● Kernel trick map the input into high dimension space where the examples become linearly separable.
  • 9.
  • 10.
    Kernel Function ● LinearKernel. ● Polynomial Kernel. ● Gaussian RBF Kernel. ● Sigmoid Kernel.
  • 11.
    Linear Kernel Function ●K(X,Y)=<X,Y> Dot product between X,Y.
  • 12.
    Polynomial Kernel Function Whered: degree of polynomial, and c is free parameter trade off between the influence of higher and lower order terms in polynomials. k ( X ,Y )=(γ∗< X ,Y > + c) d
  • 13.
    Gaussian RBF Kernel Wheredenote square euclidean distance. Other form: k ( X ,Y )=exp( ∣∣X −Y∣∣ 2 −2∗σ ) ∣∣X −Y∣∣ 2 k ( X ,Y )=exp(−¿ γ∗∣∣X −Y∣∣ 2 )
  • 14.
    Sigmoid Kernel Function Whereis scaling factor and r is shifting parameter. k ( X ,Y )=tanh(γ∗< X ,Y > + r) γ
  • 15.
  • 16.
    Optimization Problem ● Needto find hyperplane with maximum margin. https://en.wikipedia.org/wiki/Support_vector_machine
  • 17.
    Optimization Problem ● Distancebetween two hyperplanes = . ● Goal: 1- minimize ||W||. 2- prevent points to fall into margin. ● Constraint: and together: , st: 2 ∣∣W∣∣ W.X i−b≥1 forY i=1 W.X i−b≤−1 forY i=−1 yi (W.X i−b)≥1 for 1≤i≤nmin(W ,b) ∣∣W∣∣
  • 18.
    Optimization Problem ● Mathematicallyconvenient: , st: ● By Lagrange multiplier , the problem become quadratic optimization problem. arg min(W ,b) 1 2 ∣∣W∣∣ 2 yi (W.X i−b)≥1 arg min(W ,b) max(α> 0) 1 2 ∣∣W∣∣ 2 −∑ i=1 n αi [ yi (W.X i−b)−1]
  • 19.
    Optimization Problem ● Thesolution can be expressed in linear combination of : . for these points in support vector. X i W =∑ 1 n αi Y i X i αi≠0
  • 20.
    Optimization problem ● TheQP is solved iff: 1) KKT conditions are fulfilled for every example. 2) is semi definite positive. ● KKT conditions are: Qi , j= yi∗y j∗k ( ⃗X i∗ ⃗X j) αi=0⇒ yi∗ f ( ⃗xi )⩾1 0< αi< C ⇒ yi∗ f (⃗xi)⩾1 αi=C ⇒ yi∗ f ( ⃗xi )⩽1
  • 21.
  • 22.
    Soft Margin Hyperplanes ●The soft margin hyperplanes will choose a hyperplane that splits the examples as cleanly as possible with maximum margin. ● Non slack variable , measure the degree of misclassification. ξi
  • 23.
    Soft Margin Hyperplanes Learningwith Kernels , by: scholkopf
  • 24.
    Soft Margin Hyperplanes ●The optimization problem: , st: , . using Lagrange multiplier: st: , arg min(W ,ξ ,b) 1 2 ∣∣W∣∣ 2 + C n ∑ 1 n ξi yi (W.X i+ b)≥1−ξi ξi≥0 ∑ i=1 n αi yi=0 W (α)=∑ i=0 n αi− 1 2 ∑ i , j=1 n αi α j yi y j k (xi , x j) 0≤αi≤ C n
  • 25.
    ● C isessentially a regularisation parameter, which controls the trade-off between achieving a low error on the training data and minimising the norm of the weights. ● After the Optimizer computes , the W can be computed as αi W =∑ 1 n X i Y i αi
  • 26.
  • 27.
    V-SVC ● In previousformula , C variable was tradeoff between (1) minimizing training errors (2)maximizing margin. ● Replace C by parameter V, control number of margin errors and support vectors. ● V is upper bound of training error rate.
  • 28.
    V-SVC ● The optimizationproblem become: ,st: , and . minimize(W ,ξ ,ρ) 1 2 ∣∣W∣∣ 2 −V ρ+ 1 n ∑ 1 n ξi yi (W.X i+ b)≥ρ−ξi ξi≥0 ρi≥0
  • 29.
    V-SVC ● Using Lagrangemultiplier: St: , and and decision function f(X)= minimizeα∈Rd W (α)=− 1 2 ∑ i , j=1 n αi α j Y i Y j k ( X i , X j) 0≤αi≤ 1 n ∑ i=1 n αi Y i=0 ∑ i=1 n αi≥V sgn(∑ i=1 n αi yi k ( X , X i)+ b)
  • 30.
  • 31.
    SMO Algorithm ● SequentialMinimal Optimization algorithm used to solve quadratic programming problem. ● Algorithm: 1- select pair of examples “details are coming”. 2- optimize target function with respect to selected pair analytically. 3- repeat until the selected pairs “step 1” is optimized or number of iteration exceed user defined input.
  • 32.
    SMO Algorithm 2-optimize targetfunction with respect to selected pair analytically. - the update on value of and depends on the difference between the approximation error in and . X =Kii+ K jj−2Y i Y j Kij αi α j αi α j
  • 33.
    Solve for twoLagrange multipliers http://research.microsoft.com/pubs/68391/smo-book.pdf
  • 34.
    Solve for twoLagrange multipliers http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf
  • 35.
    Solve for twoLagrange multipliers double X = Kii+Kjj+2*Kij; double delta = (-G[i]-G[j])/X; double diff = alpha[i] - alpha[j]; alpha[i] += delta; alpha[j] += delta; if(region I): alpha[i] = C_i; alpha[j] = C_i – diff; if(region II): alpha[j] = C_j; alpha[i] = C_j + diff; if(region III): alpha[j] = 0;alpha[i] = diff; If (region IV): alpha[i] = 0;alpha[j] = -diff;
  • 36.
    SMO Algorithm ● 1-select pair of examples: we need to find pair (i,j) where the difference between classification error is maximum. The pair is optimal if the difference between classification error is less than (( f (xi)− yi)−( f (x j)− y j)) 2 ξ
  • 37.
    SMO Algorithm 1- selectpair of examples “Continue”: Define the following variables: (Max difference) (min difference) I0={i ,αi=0,αi ∈(0,Ci)} I+ ,0={i ,αi=0, yi=1} I+ ,C={i ,αi=Ci , yi=1} I−,0={i ,αi=0, yi=−1} I−,C={i ,αi=Ci , yi=−1} maxi∈{I0∪I+ ,0∪I−,c} f (xi)− yi min j∈{I 0∪I−,0∪I+ ,c } f (x j)− y j
  • 38.
    SMO algorithm complexity ●Memory complexity: no additional matrix is required to solve the problem. Only 2*2 Matrix is required in each iteration. ● Memory complexity is linear on training data set size. ● SMO algorithm is scaled between linear and quadratic in the size of training data size.