Support Vector Machines for Classification

Support Vector Machines
(C) CDAC Mumbai Workshop on Machine Learning
Support Vector Machines
Prakash B. Pimpale
CDAC Mumbai

Outline
Introduction
Towards SVM
Basic Concept
Basic Concept
Implementations
Issues
Conclusion & References

Introduction:
SVMs – a supervised learning methods for
classification and Regression
Base: Vapnik-Chervonenkis theory
First practical implementation: Early nineties
First practical implementation: Early nineties
Satisfying from theoretical point of view
Can lead to high performance in practical
applications
Currently considered one of the most efficient
family of algorithms in Machine Learning

Towards SVM
A:I found really good function describing the training
examples using ANN but couldn’t classify test example
that efficiently, what could be the problem?
B: It didn’t generalize well!
A: What should I do now?
A: What should I do now?
B: Try SVM!
A: why?
B: SVM
1)Generalises well
And what's more….
2)Computationally efficient (just a convex optimization
problem)
3)Robust in high dimensions also (no overfitting)

A: Why is it so?
B: So many questions…??
Vapnik & Chervonenkis Statistical Learning Theory Result:
Relates ability to learn a rule for classifying training data to
ability of resulting rule to classify unseen examples
(Generalization)
Let a rule ,
Empirical Risk of : Measure of quality of classification on
training data
Best performance
Worst performance
f Ff ∈
f
0)( =fR emp
1)( =fR emp

What about the Generalization?
Risk of classifier: Probability that rule ƒ makes a
mistake on a new sample randomly generated by
random machine
))(()( yxfPfR ≠=
Best Generalization
Worst Generalization
Many times small Empirical Risk implies Small Risk
0)( =fR
1)( =fR

Is the problem solved? …….. NO!
Is Risk of selected by Empirical Risk Minimization
(ERM) near to that of ideal ?
No, not in case of overfitting
Important Result of Statistical Learning Theory
tf
if
Important Result of Statistical Learning Theory
Where, V(F)- VC dimension of class F
N- number of observations for training
C- Universal Constant
N
FV
CfRfER it
)(
)()( +≤

What it says:
Risk of rule selected by ERM is not far from Risk of
the ideal rule if-
1) N is large enough
2)VC dimension of F should be small enough
2)VC dimension of F should be small enough
[VC dimension? In short larger a class F, the larger its VC dimension (Sorry Vapnik sir!)]

Structural Risk Minimization (SRM)
Consider family of F =>
)(......)(..........)()(
..
................
10
10
FVFVFVFV
ts
FFFF
n
n
≤≤≤≤≤
⊂⊂⊂⊂
Find the minimum Empirical Risk for each subclass
and its VC dimension
Select a subclass with minimum bound on the Risk
(i.e. sum of the VC dimension and empirical risk)
)(......)(..........)()( 10 FVFVFVFV n ≤≤≤≤≤

SRM Graphically: N
FV
CfRfER it
)(
)()( +≤

A: What it has to do with SVM….?
B:SVM is an approximate implementation of SRM!
A: How?
B: Just in simple way for now:
Just import a result:
Maximizing distance of the decision boundary from
training points minimizes the VC dimension
resulting into the Good generalization!

A: Means Now onwards our target is Maximizing Distance
between decision boundary and the Training points!
B: Yeah, Right!
A: Ok, I am convinced that SVM will generalize well,
but can you please explain what is the concept of
SVM and how to implement it, are there any
SVM and how to implement it, are there any
packages available?
B: Yeah, don’t worry, there are many implementations
available, just use them for your application, now the
next part of the presentation will give a basic idea
about the SVM, so be with me!

Basic Concept of SVM:
Which line
will classify
the unseen
data well?
data well?
The dotted
line! Its line
with
Maximum
Margin!

Cont…
Support Vectors Support Vectors










+
−
=+
1
0
1
bXWT

Some definitions:
Functional Margin:
w.r.t.
1) individual examples :
2)example set },.....,1);,{( )()(
miyxS ii
==
)(ˆ )()()(
bxWy iTii
+=γ
)(
ˆminˆ i
γγ =
Geometric Margin:
w.r.t
1)Individual examples:
2) example set S,
)(
,...,1
ˆminˆ i
mi
γγ
=
=








+





=
||||||||
)()()(
W
b
x
W
W
y i
T
ii
γ
)(
,...,1
min i
mi
γγ
=
=

Problem Formulation:










+
−
=+
1
0
1
bXW T

Cont..
Distance of a point (u, v) from Ax+By+C=0, is given by
|Ax+By+C|/||n||
Where ||n|| is norm of vector n(A,B)
Distance of hyperpalne from origin = |||| W
b
Distance of point A from origin =
Distance of point B from Origin =
Distance between points A and B (Margin) =
|||| W
||||
1
W
b +
||||
1
W
b −
||||
2
W

Cont…
We have data set
1
)()(
,....,1},,{
RYandRX
miYX
d
ii
∈∈
=
separating hyperplane
10
10
..
0
)()(
)()(
−=<+
+=>+
=+
iiT
iiT
T
YifbXW
YifbXW
ts
bXW

Cont…
Suppose training data satisfy following constrains also,
Combining these to the one,
11
11
)()(
)()(
−=−≤+
+=+≥+
iiT
iiT
YforbXW
YforbXW
Combining these to the one,
Our objective is to find Hyperplane(W,b) with maximal
separation between it and closest data points while satisfying
the above constrains
iforbXWY iTi
∀≥+ 1)( )()(

THE PROBLEM:
||||
2
max
, WbW
such that
Also we know
iforbXWY iTi
∀≥+ 1)( )()(
WWW T
=||||

Cont..
WW T
bW 2
1
min,
So the Problem can be written as:
bW 2,
iforbXWY iTi
∀≥+ 1)( )()(
Such that
It is just a convex quadratic optimization problem !
2
||||WWW T
=Notice:

DUAL
Solving dual for our problem will lead us to apply SVM for
nonlinearly separable data, efficiently
It can be shown that
)),,(min(maxmin α
α
bWLprimal
≥
=
Primal problem:
Such that
)),,(min(maxmin
,0
α
α
bWLprimal
bW≥
=
WW T
bW 2
1
min,
iforbXWY iTi
∀≥+ 1)( )()(

Constructing Lagrangian
Lagrangian for our problem:
[ ]∑ −+−=
m
iTi
i bXWYWbWL )()(2
1)(||||
2
1
),,( αα
Where a Lagrange multiplier and
Now minimizing it w.r.t. W and b:
We set derivatives of Lagrangian w.r.t. W and b to zero
[ ]∑=
−+−=
i
i bXWYWbWL
1
1)(||||
2
),,( αα
α 0≥iα

Cont…
Setting derivative w.r.t. W to zero, it gives:
)(
1
)(
..
0i
m
i
i
i
ei
XYW ∑=
=− α
Setting derivative w.r.t. b to zero, it gives:
)(
1
)(
..
i
m
i
i
i XYW
ei
∑=
= α
∑=
=
m
i
i
iY
1
)(
0α

Cont…
Plugging these results into Lagrangian gives
∑∑ ==
−=
m
ji
jTi
ji
ji
m
i
i XXYYbWL
1,
)()()()(
1
)()(
2
1
),,( αααα
Say it
This is result of our minimization w.r.t W and b,
== jii 1,1
∑∑ ==
−=
m
ji
jTi
ji
ji
m
i
i XXYYD
1,
)()()()(
1
)()(
2
1
)( αααα

So The DUAL:
Now Dual becomes::
∑∑ ==
=≥
−=
i
m
ji
ji
ji
ji
m
i
i
mi
ts
XXYYD
1,
)()()()(
1
,...,1,0
..
,
2
1
)(max
α
αααα
α
Solving this optimization problem gives us
Also Karush-Kuhn-Tucker (KKT) condition is
satisfied at this solution i.e.
∑=
=
=≥
m
i
i
i
i
Y
mi
1
)(
0
,...,1,0
α
α
iα
[ ] miforbXWY iTi
i ,...,1,01)( )()(
==−+α

Values of W and b:
W can be found using
)(
1
)( i
m
i
i
i XYW ∑=
= α
b can be found using:
1i =
2
*min*max
*
)(
1:
)(
1: )()(
iT
Yi
iT
Yi
XWXW
b
ii
=−=
+
−=

What if data is nonlinearly separable?
The maximal margin
hyperplane can classify
only linearly separable
data
What if the data is linearly
What if the data is linearly
non-separable?
Take your data to linearly
separable ( higher
dimensional space) and
use maximal margin
hyperplane there!

Taking it to higher dimension works!
Ex. XOR

Doing it in higher dimensional space
Let be non linear mapping from input
space X (original space) to feature space (higher
dimensional) F
Then our inner (dot) product in higher
FX →Φ:
)()(
, ji
XX
Then our inner (dot) product in higher
dimensional space is
Now, the problem becomes:
)()(
, ji
XX
)(),( )()( ji
XX φφ
∑
∑∑
=
==
=
=≥
−=
m
i
i
i
i
m
ji
ji
ji
ji
m
i
i
Y
mi
ts
XXYYD
1
)(
1,
)()()()(
1
0
,...,1,0
..
)(,)(
2
1
)(max
α
α
φφαααα
α

Kernel function:
There exist a way to compute inner product in feature
space as function of original input points – Its kernel
function!
Kernel function:
Kernel function:
We need not know to compute
)(),(),( zxzxK φφ=
),( zxKφ

An example:
For n=3, feature mapping
is given as :
∑∑=
=
∈
n
jj
n
ii
T
n
zxzxzxKei
zxzxK
Rzxlet
2
)()(),(..
)(),(
, φ










31
21
11
xx
xx
xx
xx
∑
∑∑
∑∑
=
= =
==
=
=
=
n
ji
jiji
n
i
n
j
jiji
j
jj
i
ii
zzxx
zzxx
zxzxzxKei
1,
1 1
11
))((
)()(),(..




















=
33
23
13
32
22
12
)(
xx
xx
xx
xx
xx
xx
xφ
)(),(),( zxzxK φφ=

example cont…
Here,
31
)(),( 2

=

=
=
zx
zxzxK
for
T
4
2
2
1
)(
22
12
21
11












=












=
xx
xx
xx
xx
xφ
[ ]
121)(),(
11
4
3
21
4
3
2
1
2
==
=






=






=





=
zxzxK
zx
zx
T
T
[ ]
121
16
12
12
9
4221)()(
16
12
12
9
)(
=












=












=
zx
z
T
φφ
φ

So our SVM for the non-linearly
separable data:
Optimization problem:
∑∑ ==
=≥
−=
m
ji
ji
ji
ji
m
i
i
mi
ts
XXKYYD
1,
)()()()(
1
,...,1,0
..
,
2
1
)(max
α
αααα
α
Decision function
∑=
=
=≥
m
i
i
i
i
Y
mi
1
)(
0
,...,1,0
α
α
)),(()(
1
)()(
∑=
+=
m
i
ii
i bXXKYSignXF α

Some commonly used Kernel functions:
Linear:
Polynomial of degree d: dT
YXYXK )1(),( +=
YXYXK T
=),(
Polynomial of degree d:
Gaussian Radial Basis Function (RBF):
Tanh kernel:
YXYXK )1(),( +=
2
2||||
2
),( σ
YX
eYXK
−
−
=
))(tanh(),( δρ −= YXYXK T

Implementations:
Some Ready to use available SVM implementations:
1)LIBSVM:A library for SVM by Chih-Chung Chang and
chih-Jen Lin
(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
2)SVM light : An implementation in C by Thorsten
Joachims
(at: http://svmlight.joachims.org/ )
3)Weka: A Data Mining Software in Java by University
of Waikato
(at: http://www.cs.waikato.ac.nz/ml/weka/ )

Issues:
Selecting suitable kernel: Its most of the time trial
and error
Multiclass classification: One decision function for
each class( l1 vs l-1 ) and then finding one with max
each class( l1 vs l-1 ) and then finding one with max
value i.e. if X belongs to class 1, then for this and
other (l-1) classes vales of decision functions:
1)(
.
.
1)(
1)(
2
1
−≤
−≤
+≥
XF
XF
XF
l

Cont….
Sensitive to noise: Mislabeled data can badly affect
the performance
Good performance for the applications like-
1)computational biology and medical applications
(protein, cancer classification problems)
(protein, cancer classification problems)
2)Image classification
3)hand-written character recognition
And many others…..
Use SVM :High dimensional, linearly separable
data (strength), for nonlinearly depends on choice of
kernel

Conclusion:
Support Vector Machines provides very
simple method for linear classification. But
performance, in case of nonlinearly separable
data, largely depends on the choice of kernel!
data, largely depends on the choice of kernel!

References:
Nello Cristianini and John Shawe-Taylor (2000)??
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
Cambridge University Press
Christopher J.C. Burges (1998)??
A tutorial on Support Vector Machines for pattern recognition
Usama Fayyad, editor, Data Mining and Knowledge Discovery, 2, 121-167.
Kluwer Academic Publishers, Boston.
Kluwer Academic Publishers, Boston.
Andrew Ng (2007)
CSS229 Lecture Notes
Stanford Engineering Everywhere, Stanford University .
Support Vector Machines <http://www.svms.org > (Accessed 10.11.2008)
Wikipedia
Kernel-Machines.org<http://www.kernel-machines.org >(Accessed 10.11.2008)

Thank You!
Thank You!
prakash@cdacmumbai.in ;
pbpimpale@gmail.com

Support Vector Machines for Classification

More Related Content

What's hot

Viewers also liked

Similar to Support Vector Machines for Classification

More from Prakash Pimpale

Recently uploaded

In this document

Support Vector Machines for Classification