Machine Learning Guide to Support Vector Machines

Support Vector Machines
(C) CDAC Mumbai Workshop on Machine Learning
Prakash B. Pimpale
CDAC Mumbai

Outline
o Introduction
o Towards SVM
o Basic Concept
o Implementations
o Issues
o Conclusion & References

Introduction:
o SVMs – a supervised learning methods for
classification and Regression
o Base: Vapnik-Chervonenkis theory
o First practical implementation: Early nineties
o Satisfying from theoretical point of view
o Can lead to high performance in practical
applications
o Currently considered one of the most efficient
family of algorithms in Machine Learning

Towards SVM
A:I found really good function describing the training
examples using ANN but couldn’t classify test example
that efficiently, what could be the problem?
B: It didn’t generalize well!
A: What should I do now?
B: Try SVM!
A: why?
B: SVM
1)Generalises well
And what's more….
2)Computationally efficient (just a convex optimization
problem)
3)Robust in high dimensions also (no overfitting)

A: Why is it so?
B: So many questions…?? L
o Vapnik & Chervonenkis Statistical Learning Theory Result:
Relates ability to learn a rule for classifying training data to
ability of resulting rule to classify unseen examples
(Generalization)
o Let a rule f ,
f ∈ F
o Empirical Risk of : Measure of quality of classification on
training data
Best performance
Worst performance
f
R ( f ) = 0 emp
R ( f ) = 1 emp

What about the Generalization?
o Risk of classifier: Probability that rule ƒ makes a
mistake on a new sample randomly generated by
random machine
R ( f ) = P ( f ( x ) ≠ y )
Best Generalization
Worst Generalization
R( f ) = 0
R ( f ) = 1
o Many times small Empirical Risk implies Small Risk

Is the problem solved? …….. NO!
t f
o Is Risk of selected by Empirical Risk Minimization
fi
(ERM) near to that of ideal ?
o No, not in case of overfitting
o Important Result of Statistical Learning Theory
( )
V F
N
( ) ≤ ( ) +
ER ft R fi C
Where, V(F)- VC dimension of class F
N- number of observations for training
C- Universal Constant

What it says:
o Risk of rule selected by ERM is not far from Risk of
the ideal rule if-
1) N is large enough
2)VC dimension of F should be small enough
[VC dimension? In short larger a class F, the larger its VC dimension (Sorry Vapnik sir!)]

Structural Risk Minimization (SRM)
o Consider family of F =>
F F F F
⊂ ⊂ ⊂ ⊂
0 1
. .
.......... ......
V F V F V F V F
( ) ( ) .......... ( ) ...... ( )
0 1
s t
n
n
≤ ≤ ≤ ≤ ≤
o Find the minimum Empirical Risk for each subclass
and its VC dimension
o Select a subclass with minimum bound on the Risk
(i.e. sum of the VC dimension and empirical risk)

SRM Graphically: N
V F
ER ft R fi C
( )
( ) ≤ ( ) +

A: What it has to do with SVM….?
B:SVM is an approximate implementation of SRM!
A: How?
B: Just in simple way for now:
Just import a result:
Maximizing distance of the decision boundary from
training points minimizes the VC dimension
resulting into the Good generalization!

A: Means Now onwards our target is Maximizing Distance
between decision boundary and the Training points!
B: Yeah, Right!
A: Ok, I am convinced that SVM will generalize well,
but can you please explain what is the concept of
SVM and how to implement it, are there any
packages available?
B: Yeah, don’t worry, there are many implementations
available, just use them for your application, now the
next part of the presentation will give a basic idea
about the SVM, so be with me!

Basic Concept of SVM:
o Which line
will classify
the unseen
data well?
o The dotted
line! Its line
with
Maximum
Margin!

Cont…
Support Vectors Support Vectors

  


  

−
+
+ =
1
0
1
WT X b

Some definitions:
o Functional Margin:
w.r.t.
1) individual examples :
γˆ ( i ) = y ( i ) (W T x ( i ) + b )
2)example set S = {( x ( i ) , y ( i ) ); i = 1,....., m }
γ ˆ =
min γ
ˆ ( i
)
1 ,...,
i m
=
o Geometric Margin:
w.r.t
1)Individual examples:
2) example set S,



W
( ) ( ) ( )
b
y i
min i
i m
γ γ

 

 

+  
 
=
|| || || ||
W
x
W
T
γ i i
( )
1 ,...,
=
=

Problem Formulation:

  


  

−
+
+ =
1
0
1
W T X b

Cont..
o Distance of a point (u, v) from Ax+By+C=0, is given by
|Ax+By+C|/||n||
Where ||n|| is norm of vector n(A,B)
Distance of hyperpalne from origin = b
o || W ||
o Distance of point A from origin =
o Distance of point B from Origin =
b +
|| ||
1
b −
o Distance between points A and B (Margin) =
1
W
|| W
||
2
W
|| ||

Cont…
We have data set
{ ( ), ( )}, 1,....,
X R and Y R
1
i i
∈ ∈
X Y i m
d
=
separating hyperplane
+ =
T
W X b
T i i
( ) ( )
+ > = +
0 1
s t
W X b if Y
T i i
( ) ( )
+ < = −
0 1
. .
0
W X b if Y

Cont…
o Suppose training data satisfy following constrains also,
T i i
( ) ( )
+ ≤ − = −
W X b for Y
+ ≥ + = +
1 1
T i i
( ) ( )
W X b for Y
1 1
Combining these to the one,
Y ( i ) (W T X ( i ) + b) ≥ 1 for ∀i
o Our objective is to find Hyperplane(W,b) with maximal
separation between it and closest data points while satisfying
the above constrains

THE PROBLEM:
2
max
W,b W
|| ||
such that
Y(i)(WTX(i) +b) ≥1 for ∀i
Also we know
|| W || = W TW

Cont..
So the Problem can be written as:
WTW
1 min
W,W ,
b 2
Such that
Y(i)(WTX(i) +b)≥1 for ∀i
Notice: W TW =||W ||2
It is just a convex quadratic optimization problem !

DUAL
o Solving dual for our problem will lead us to apply SVM for
nonlinearly separable data, efficiently
o It can be shown that
min primal max(min L ( W , b
, α
))
≥
α
=
o Primal problem:
1 min
W b 2
,
Such that
0 ,
W b
W TW
Y(i)(WTX(i) +b) ≥1 for ∀i

Constructing Lagrangian
o Lagrangian for our problem:
1
m
= − Σ [ + −
]
L ( W , b ,α ) || W || 2 α
Y ( i ) ( W T X ( i
) b) 1
i 2
=
i
Where a Lagrange multiplier and
o Now minimizing it w.r.t. W and b:
We set derivatives of Lagrangian w.r.t. W and b to zero
b
1
) α ≥ 0 i α

Cont…
o Setting derivative w.r.t. W to zero, it gives:
( )
W Σ Y X
− α =
1
( )
. .
i 0
m
i
i
i
i e
=
i
( )
m
Σ
=
= α
i W Y X
1
i
( )
i
o Setting derivative w.r.t. b to zero, it gives:
m
Σ
=
α ( ) =
0
i
i
iY
1

Cont…
o Plugging these results into Lagrangian gives
1
Σ Σ
= =
L ( W , b ,α ) = α −
Y Y α α
X X
i m
i=i i, j
j=i T j
i j
i j
m
i
, 1
( ) ( ) ( ) ( )
1
( ) ( )
2
o Say it
m
1
Σ Σ
= =
(α) α α α
D = −
Y Y X X
i i j
o This is result of our minimization w.r.t W and b,
i T j
i j
i j
m
i
, 1
( ) ( ) ( ) ( )
1
( ) ( )
2

So The DUAL:
o Now Dual becomes::
1
Σ Σ
= =
= −
≥ =
i
m
i j
i j
i j
i j
m
i
i
i m
s t
D Y Y X X
, 1
( ) ( ) ( ) ( )
1
0 , 1 ,...,
. .
,
2
max ( )
α
α α α α
α
m
Σ=
α
( ) =
0
i
i
i
Y
1
o Solving this optimization problem gives us
o Also Karush-Kuhn-Tucker (KKT) condition is
satisfied at this solution i.e.
i α
[Y i WTX i b ] for i m
i α ( )( ( ) + )−1 =0, =1,...,

Values of W and b:
o W can be found using
( )
W = Σ α
Y X
i 1
( ) i
m
i
i
=
o b can be found using:
max * min *
b i =− i = +
= −
2
*
( )
: 1
( )
: ( ) 1 ( )
T i
i Y
T i
i Y W X W X

What if data is nonlinearly separable?
o The maximal margin
hyperplane can classify
only linearly separable
data
o What if the data is linearly
non-separable?
o Take your data to linearly
separable ( higher
dimensional space) and
use maximal margin
hyperplane there!

Taking it to higher dimension works!
Ex. XOR

Doing it in higher dimensional space
o Let Φ:X→F
be non linear mapping from input
space X (original space) to feature space (higher
dimensional) F
o Then our inner (dot) product X (i) , X ( j)
in higher
dimensional space is
φ (X (i ) ),φ (X ( j ) )
o Now, the problem becomes:
Σ
1
Σ Σ
≥ =
0, 1,...,
=
= =
=
= −
m
i
i
i
i
m
i j
i j
i j
i j
m
i
i
Y
i m
s t
D Y Y X X
1
( )
, 1
( ) ( ) ( ) ( )
1
0
. .
( ), ( )
2
max ( )
α
α
α α α α φ φ
α

Kernel function:
o There exist a way to compute inner product in feature
space as function of original input points – Its kernel
function!
o Kernel function:
K(x, z) = φ(x),φ(z)
φ K (x, z)
o We need not know to compute

An example:
For n=3, feature mapping
, φ
is given as :
n
let x z R
K x z x z
2
( , ) ( )
=
Σ Σ
∈
=
n
j j
n
i i
T
i e K x z x z x z
. . ( , ) ( )( )

   
    
x x
1 1
x x
1 2
x x
1 3
x x
n
n
ΣΣ
j
= =
1 1
n
i
Σ
( )( )
, =
1
j
= =
=
=
i j
i
1 1
x x z z
i j i j
x x z z
i j i j
        

       

=
2 1
x x
2 2
x x
2 3
x x
3 1
x x
3 2
3 3
( )
x x
φ x
K(x, z) = φ (x),φ (z)

example cont…
o Here,
for
( , ) ( ) 2
K x z x z
1
3


=


=
=
x z
T
1
2
2
4
( )
x x
1 1
x x
1 2
x x
2 1
2 2

   


   

=

   


   

=
x x
φ x

[ ]
T
x z
11
3
4
1 2
4
2



= 2 =



=

T
K x z x z
( , ) ( ) 121
=
z
( )
φ T φ

[ ]
121
9
12
12
16
9
12
12
16

   


( ) ( ) 1 2 2 4
=

   

   

=
   

=
x z
φ

So our SVM for the non-linearly
separable data:
o Optimization problem:
1
Σ Σ
= =
= −
0, α
≥ =
m
i j
i j
i j
i j
m
i
i
i m
s t
D Y Y K X X
, 1
( ) ( ) ( ) ( )
1
0, 1,...,
. .
,
2
max ( )
α
α α α α
α
i
m
Σ=
( ) =
0
i
i
i
Y
1
o Decision function
m
Σ ( i ) ( i
)
=
i F X Sign α Y K X X b
( ) = ( ( , ) +
)
1
i

Some commonly used Kernel functions:
o Linear:
K(X ,Y ) = X TY
o Polynomial of degree d:
K ( X ,Y ) = ( X TY + 1)
d
o Gaussian Radial Basis Function (RBF):
o Tanh kernel:
Y || ||2
2
X Y
( , ) 2σ
K X Y e
−
−
=
K (X ,Y ) = tanh( ρ (X TY ) −δ )

Implementations:
Some Ready to use available SVM implementations:
1)LIBSVM:A library for SVM by Chih-Chung Chang and
chih-Jen Lin
(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
2)SVM light : An implementation in C by Thorsten
Joachims
(at: http://svmlight.joachims.org/ )
3)Weka: A Data Mining Software in Java by University
of Waikato
(at: http://www.cs.waikato.ac.nz/ml/weka/ )

Issues:
o Selecting suitable kernel: Its most of the time trial
and error
o Multiclass classification: One decision function for
each class( l1 vs l-1 ) and then finding one with max
value i.e. if X belongs to class 1, then for this and
other (l-1) classes vales of decision functions:
( ) 1
≥ +
F X
( ) 1
≤ −
F X
( ) 1
.
.
1
2
≤ −
F X
l

Cont….
o Sensitive to noise: Mislabeled data can badly affect
the performance
o Good performance for the applications like-
1)computational biology and medical applications
(protein, cancer classification problems)
2)Image classification
3)hand-written character recognition
And many others…..
o Use SVM :High dimensional, linearly separable
data (strength), for nonlinearly depends on choice of
kernel

Conclusion:
Support Vector Machines provides very
simple method for linear classification. But
performance, in case of nonlinearly separable
data, largely depends on the choice of kernel!

References:
o Nello Cristianini and John Shawe-Taylor (2000)??
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
Cambridge University Press
o Christopher J.C. Burges (1998)??
A tutorial on Support Vector Machines for pattern recognition
Usama Fayyad, editor, Data Mining and Knowledge Discovery, 2, 121-167.
Kluwer Academic Publishers, Boston.
o Andrew Ng (2007)
CSS229 Lecture Notes
Stanford Engineering Everywhere, Stanford University .
o Support Vector Machines <http://www.svms.org > (Accessed 10.11.2008)
o Wikipedia
o Kernel-Machines.org<http://www.kernel-machines.org >(Accessed 10.11.2008)

Thank You!
prakash@cdacmumbai.in ;
pbpimpale@gmail.com

Machine Learning Guide to Support Vector Machines

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Machine Learning Guide to Support Vector Machines

Similar to Machine Learning Guide to Support Vector Machines (20)

More from Prakash Pimpale

More from Prakash Pimpale (7)

Recently uploaded

Recently uploaded (20)

Machine Learning Guide to Support Vector Machines