The document provides an overview of support vector machines (SVMs), including:
1) SVMs are a supervised learning method for classification and regression based on Vapnik-Chervonenkis theory that aims to find a separating hyperplane with maximum margin between classes.
2) The SVM formulation can be expressed as an optimization problem to maximize the margin, which results in a convex quadratic problem.
3) The dual formulation allows SVMs to be applied to non-linearly separable data using kernel methods that implicitly map inputs to higher dimensional feature spaces.
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Machine Learning Guide to Support Vector Machines
1. Support Vector Machines
(C) CDAC Mumbai Workshop on Machine Learning
Prakash B. Pimpale
CDAC Mumbai
2. Outline
o Introduction
o Towards SVM
o Basic Concept
o Implementations
o Issues
o Conclusion & References
(C) CDAC Mumbai Workshop on Machine Learning
3. Introduction:
o SVMs – a supervised learning methods for
classification and Regression
o Base: Vapnik-Chervonenkis theory
o First practical implementation: Early nineties
o Satisfying from theoretical point of view
o Can lead to high performance in practical
applications
o Currently considered one of the most efficient
family of algorithms in Machine Learning
(C) CDAC Mumbai Workshop on Machine Learning
4. Towards SVM
A:I found really good function describing the training
examples using ANN but couldn’t classify test example
that efficiently, what could be the problem?
B: It didn’t generalize well!
A: What should I do now?
B: Try SVM!
A: why?
B: SVM
1)Generalises well
And what's more….
2)Computationally efficient (just a convex optimization
problem)
3)Robust in high dimensions also (no overfitting)
(C) CDAC Mumbai Workshop on Machine Learning
5. A: Why is it so?
B: So many questions…?? L
o Vapnik & Chervonenkis Statistical Learning Theory Result:
Relates ability to learn a rule for classifying training data to
ability of resulting rule to classify unseen examples
(Generalization)
o Let a rule f ,
f ∈ F
o Empirical Risk of : Measure of quality of classification on
training data
Best performance
Worst performance
f
R ( f ) = 0 emp
R ( f ) = 1 emp
(C) CDAC Mumbai Workshop on Machine Learning
6. What about the Generalization?
o Risk of classifier: Probability that rule ƒ makes a
mistake on a new sample randomly generated by
random machine
R ( f ) = P ( f ( x ) ≠ y )
Best Generalization
Worst Generalization
R( f ) = 0
R ( f ) = 1
o Many times small Empirical Risk implies Small Risk
(C) CDAC Mumbai Workshop on Machine Learning
7. Is the problem solved? …….. NO!
t f
o Is Risk of selected by Empirical Risk Minimization
fi
(ERM) near to that of ideal ?
o No, not in case of overfitting
o Important Result of Statistical Learning Theory
( )
V F
N
( ) ≤ ( ) +
ER ft R fi C
Where, V(F)- VC dimension of class F
N- number of observations for training
C- Universal Constant
(C) CDAC Mumbai Workshop on Machine Learning
8. What it says:
o Risk of rule selected by ERM is not far from Risk of
the ideal rule if-
1) N is large enough
2)VC dimension of F should be small enough
[VC dimension? In short larger a class F, the larger its VC dimension (Sorry Vapnik sir!)]
(C) CDAC Mumbai Workshop on Machine Learning
9. Structural Risk Minimization (SRM)
o Consider family of F =>
F F F F
⊂ ⊂ ⊂ ⊂
0 1
. .
.......... ......
V F V F V F V F
( ) ( ) .......... ( ) ...... ( )
0 1
s t
n
n
≤ ≤ ≤ ≤ ≤
o Find the minimum Empirical Risk for each subclass
and its VC dimension
o Select a subclass with minimum bound on the Risk
(i.e. sum of the VC dimension and empirical risk)
(C) CDAC Mumbai Workshop on Machine Learning
10. SRM Graphically: N
V F
ER ft R fi C
( )
( ) ≤ ( ) +
(C) CDAC Mumbai Workshop on Machine Learning
11. A: What it has to do with SVM….?
B:SVM is an approximate implementation of SRM!
A: How?
B: Just in simple way for now:
Just import a result:
Maximizing distance of the decision boundary from
training points minimizes the VC dimension
resulting into the Good generalization!
(C) CDAC Mumbai Workshop on Machine Learning
12. A: Means Now onwards our target is Maximizing Distance
between decision boundary and the Training points!
B: Yeah, Right!
A: Ok, I am convinced that SVM will generalize well,
but can you please explain what is the concept of
SVM and how to implement it, are there any
packages available?
B: Yeah, don’t worry, there are many implementations
available, just use them for your application, now the
next part of the presentation will give a basic idea
about the SVM, so be with me!
(C) CDAC Mumbai Workshop on Machine Learning
13. Basic Concept of SVM:
o Which line
will classify
the unseen
data well?
(C) CDAC Mumbai Workshop on Machine Learning
o The dotted
line! Its line
with
Maximum
Margin!
14. Cont…
Support Vectors Support Vectors
(C) CDAC Mumbai Workshop on Machine Learning
−
+
+ =
1
0
1
WT X b
15. Some definitions:
o Functional Margin:
w.r.t.
1) individual examples :
γˆ ( i ) = y ( i ) (W T x ( i ) + b )
2)example set S = {( x ( i ) , y ( i ) ); i = 1,....., m }
γ ˆ =
min γ
ˆ ( i
)
1 ,...,
i m
=
o Geometric Margin:
w.r.t
1)Individual examples:
2) example set S,
W
( ) ( ) ( )
b
y i
min i
i m
γ γ
(C) CDAC Mumbai Workshop on Machine Learning
+
=
|| || || ||
W
x
W
T
γ i i
( )
1 ,...,
=
=
16. Problem Formulation:
(C) CDAC Mumbai Workshop on Machine Learning
−
+
+ =
1
0
1
W T X b
17. Cont..
o Distance of a point (u, v) from Ax+By+C=0, is given by
|Ax+By+C|/||n||
Where ||n|| is norm of vector n(A,B)
Distance of hyperpalne from origin = b
o || W ||
o Distance of point A from origin =
o Distance of point B from Origin =
b +
|| ||
1
b −
o Distance between points A and B (Margin) =
(C) CDAC Mumbai Workshop on Machine Learning
1
W
|| W
||
2
W
|| ||
18. Cont…
We have data set
{ ( ), ( )}, 1,....,
X R and Y R
1
i i
∈ ∈
X Y i m
d
=
separating hyperplane
+ =
T
W X b
T i i
( ) ( )
+ > = +
0 1
s t
W X b if Y
T i i
( ) ( )
+ < = −
0 1
. .
0
W X b if Y
(C) CDAC Mumbai Workshop on Machine Learning
19. Cont…
o Suppose training data satisfy following constrains also,
T i i
( ) ( )
+ ≤ − = −
W X b for Y
+ ≥ + = +
1 1
T i i
( ) ( )
W X b for Y
1 1
Combining these to the one,
Y ( i ) (W T X ( i ) + b) ≥ 1 for ∀i
o Our objective is to find Hyperplane(W,b) with maximal
separation between it and closest data points while satisfying
the above constrains
(C) CDAC Mumbai Workshop on Machine Learning
20. THE PROBLEM:
2
max
W,b W
|| ||
such that
Y(i)(WTX(i) +b) ≥1 for ∀i
Also we know
|| W || = W TW
(C) CDAC Mumbai Workshop on Machine Learning
21. Cont..
So the Problem can be written as:
WTW
1 min
W,W ,
b 2
Such that
Y(i)(WTX(i) +b)≥1 for ∀i
Notice: W TW =||W ||2
It is just a convex quadratic optimization problem !
(C) CDAC Mumbai Workshop on Machine Learning
22. DUAL
o Solving dual for our problem will lead us to apply SVM for
nonlinearly separable data, efficiently
o It can be shown that
min primal max(min L ( W , b
, α
))
≥
α
=
o Primal problem:
1 min
W b 2
,
Such that
0 ,
W b
W TW
Y(i)(WTX(i) +b) ≥1 for ∀i
(C) CDAC Mumbai Workshop on Machine Learning
23. Constructing Lagrangian
o Lagrangian for our problem:
1
m
= − Σ [ + −
]
L ( W , b ,α ) || W || 2 α
Y ( i ) ( W T X ( i
) b) 1
i 2
=
i
Where a Lagrange multiplier and
o Now minimizing it w.r.t. W and b:
We set derivatives of Lagrangian w.r.t. W and b to zero
(C) CDAC Mumbai Workshop on Machine Learning
b
1
) α ≥ 0 i α
24. Cont…
o Setting derivative w.r.t. W to zero, it gives:
( )
W Σ Y X
− α =
1
( )
. .
i 0
m
i
i
i
i e
=
i
( )
m
Σ
=
= α
i W Y X
1
i
( )
i
o Setting derivative w.r.t. b to zero, it gives:
m
Σ
=
α ( ) =
0
i
i
iY
1
(C) CDAC Mumbai Workshop on Machine Learning
25. Cont…
o Plugging these results into Lagrangian gives
1
Σ Σ
= =
L ( W , b ,α ) = α −
Y Y α α
X X
i m
i=i i, j
j=i T j
i j
i j
m
i
, 1
( ) ( ) ( ) ( )
1
( ) ( )
2
o Say it
m
1
Σ Σ
= =
(α) α α α
D = −
Y Y X X
i i j
o This is result of our minimization w.r.t W and b,
(C) CDAC Mumbai Workshop on Machine Learning
i T j
i j
i j
m
i
, 1
( ) ( ) ( ) ( )
1
( ) ( )
2
26. So The DUAL:
o Now Dual becomes::
1
Σ Σ
= =
= −
≥ =
i
m
i j
i j
i j
i j
m
i
i
i m
s t
D Y Y X X
, 1
( ) ( ) ( ) ( )
1
0 , 1 ,...,
. .
,
2
max ( )
α
α α α α
α
m
Σ=
α
( ) =
0
i
i
i
Y
1
o Solving this optimization problem gives us
o Also Karush-Kuhn-Tucker (KKT) condition is
satisfied at this solution i.e.
(C) CDAC Mumbai Workshop on Machine Learning
i α
[Y i WTX i b ] for i m
i α ( )( ( ) + )−1 =0, =1,...,
27. Values of W and b:
o W can be found using
( )
W = Σ α
Y X
i 1
( ) i
m
i
i
=
o b can be found using:
max * min *
b i =− i = +
= −
2
*
(C) CDAC Mumbai Workshop on Machine Learning
( )
: 1
( )
: ( ) 1 ( )
T i
i Y
T i
i Y W X W X
28. What if data is nonlinearly separable?
o The maximal margin
hyperplane can classify
only linearly separable
data
o What if the data is linearly
non-separable?
o Take your data to linearly
separable ( higher
dimensional space) and
use maximal margin
hyperplane there!
(C) CDAC Mumbai Workshop on Machine Learning
29. Taking it to higher dimension works!
Ex. XOR
(C) CDAC Mumbai Workshop on Machine Learning
30. Doing it in higher dimensional space
o Let Φ:X→F
be non linear mapping from input
space X (original space) to feature space (higher
dimensional) F
o Then our inner (dot) product X (i) , X ( j)
in higher
dimensional space is
φ (X (i ) ),φ (X ( j ) )
o Now, the problem becomes:
Σ
1
Σ Σ
≥ =
0, 1,...,
(C) CDAC Mumbai Workshop on Machine Learning
=
= =
=
= −
m
i
i
i
i
m
i j
i j
i j
i j
m
i
i
Y
i m
s t
D Y Y X X
1
( )
, 1
( ) ( ) ( ) ( )
1
0
. .
( ), ( )
2
max ( )
α
α
α α α α φ φ
α
31. Kernel function:
o There exist a way to compute inner product in feature
space as function of original input points – Its kernel
function!
o Kernel function:
K(x, z) = φ(x),φ(z)
φ K (x, z)
o We need not know to compute
(C) CDAC Mumbai Workshop on Machine Learning
32. An example:
For n=3, feature mapping
, φ
is given as :
n
let x z R
K x z x z
2
( , ) ( )
=
Σ Σ
∈
=
n
j j
n
i i
T
i e K x z x z x z
. . ( , ) ( )( )
x x
1 1
x x
1 2
x x
1 3
x x
n
n
ΣΣ
j
= =
1 1
n
i
Σ
( )( )
, =
1
j
= =
=
=
i j
i
1 1
x x z z
i j i j
x x z z
i j i j
(C) CDAC Mumbai Workshop on Machine Learning
=
2 1
x x
2 2
x x
2 3
x x
3 1
x x
3 2
3 3
( )
x x
φ x
K(x, z) = φ (x),φ (z)
33. example cont…
o Here,
for
( , ) ( ) 2
K x z x z
1
3
=
=
=
x z
T
1
2
2
4
( )
x x
1 1
x x
1 2
x x
2 1
2 2
=
=
x x
φ x
[ ]
T
x z
11
3
4
1 2
4
2
= 2 =
=
T
K x z x z
( , ) ( ) 121
=
z
( )
φ T φ
(C) CDAC Mumbai Workshop on Machine Learning
[ ]
121
9
12
12
16
9
12
12
16
( ) ( ) 1 2 2 4
=
=
=
x z
φ
34. So our SVM for the non-linearly
separable data:
o Optimization problem:
1
Σ Σ
= =
= −
0, α
≥ =
m
i j
i j
i j
i j
m
i
i
i m
s t
D Y Y K X X
, 1
( ) ( ) ( ) ( )
1
0, 1,...,
. .
,
2
max ( )
α
α α α α
α
i
m
Σ=
( ) =
0
i
i
i
Y
1
o Decision function
m
Σ ( i ) ( i
)
=
i F X Sign α Y K X X b
( ) = ( ( , ) +
)
1
i
(C) CDAC Mumbai Workshop on Machine Learning
35. Some commonly used Kernel functions:
o Linear:
K(X ,Y ) = X TY
o Polynomial of degree d:
K ( X ,Y ) = ( X TY + 1)
d
o Gaussian Radial Basis Function (RBF):
o Tanh kernel:
(C) CDAC Mumbai Workshop on Machine Learning
Y || ||2
2
X Y
( , ) 2σ
K X Y e
−
−
=
K (X ,Y ) = tanh( ρ (X TY ) −δ )
36. Implementations:
Some Ready to use available SVM implementations:
1)LIBSVM:A library for SVM by Chih-Chung Chang and
chih-Jen Lin
(at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
2)SVM light : An implementation in C by Thorsten
Joachims
(at: http://svmlight.joachims.org/ )
3)Weka: A Data Mining Software in Java by University
of Waikato
(at: http://www.cs.waikato.ac.nz/ml/weka/ )
(C) CDAC Mumbai Workshop on Machine Learning
37. Issues:
o Selecting suitable kernel: Its most of the time trial
and error
o Multiclass classification: One decision function for
each class( l1 vs l-1 ) and then finding one with max
value i.e. if X belongs to class 1, then for this and
other (l-1) classes vales of decision functions:
( ) 1
≥ +
F X
( ) 1
≤ −
F X
( ) 1
.
.
1
2
≤ −
F X
l
(C) CDAC Mumbai Workshop on Machine Learning
38. Cont….
o Sensitive to noise: Mislabeled data can badly affect
the performance
o Good performance for the applications like-
1)computational biology and medical applications
(protein, cancer classification problems)
2)Image classification
3)hand-written character recognition
And many others…..
o Use SVM :High dimensional, linearly separable
data (strength), for nonlinearly depends on choice of
kernel
(C) CDAC Mumbai Workshop on Machine Learning
39. Conclusion:
Support Vector Machines provides very
simple method for linear classification. But
performance, in case of nonlinearly separable
data, largely depends on the choice of kernel!
(C) CDAC Mumbai Workshop on Machine Learning
40. References:
o Nello Cristianini and John Shawe-Taylor (2000)??
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
Cambridge University Press
o Christopher J.C. Burges (1998)??
A tutorial on Support Vector Machines for pattern recognition
Usama Fayyad, editor, Data Mining and Knowledge Discovery, 2, 121-167.
Kluwer Academic Publishers, Boston.
o Andrew Ng (2007)
CSS229 Lecture Notes
Stanford Engineering Everywhere, Stanford University .
o Support Vector Machines <http://www.svms.org > (Accessed 10.11.2008)
o Wikipedia
o Kernel-Machines.org<http://www.kernel-machines.org >(Accessed 10.11.2008)
(C) CDAC Mumbai Workshop on Machine Learning