SlideShare a Scribd company logo
1 of 84
Support Vector
Machines
Support Vector Machines: Slide 2
Copyright © 2001, 2003, Andrew W. Moore
Roadmap
• Hard-Margin Linear Classifier
• Maximize Margin
• Support Vector
• Quadratic Programming
• Soft-Margin Linear Classifier
• Maximize Margin
• Support Vector
• Quadratic Programming
• Non-Linear Separable Problem
• XOR
• Transform to Non-Linear by Kernels
• Reference
Support Vector Machines: Slide 3
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Support Vector Machines: Slide 4
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Support Vector Machines: Slide 5
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Support Vector Machines: Slide 6
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Support Vector Machines: Slide 7
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Any of these
would be fine..
..but which is
best?
Support Vector Machines: Slide 8
Copyright © 2001, 2003, Andrew W. Moore
Classifier Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
Support Vector Machines: Slide 9
Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Support Vector Machines: Slide 10
Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vectors
are those
datapoints that
the margin
pushes up
against
Linear SVM
Support Vector Machines: Slide 11
Copyright © 2001, 2003, Andrew W. Moore
Why Maximum Margin?
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vectors
are those
datapoints that
the margin
pushes up
against
1. Intuitively this feels safest.
2. Empirically it works very well.
3. If we’ve made a small error in the
location of the boundary (it’s been
jolted in its perpendicular direction)
this gives us least chance of causing a
misclassification.
4. LOOCV is easy since the model is
immune to removal of any non-
support-vector datapoints.
5. There’s some theory (using VC
dimension) that is related to (but not
the same as) the proposition that this
is a good thing.
Support Vector Machines: Slide 12
Copyright © 2001, 2003, Andrew W. Moore
Estimate the Margin
• What is the distance expression for a point x to a
line wx+b= 0?
denotes +1
denotes -1 x
wx +b = 0
2 2
1
2
( )
d
i
i
b b
d
w

   
 

x w x w
x
w
X – Vector
W – Normal Vector
b – Scale Value
W
Support Vector Machines: Slide 13
Copyright © 2001, 2003, Andrew W. Moore
Estimate the Margin
• What is the expression for margin?
denotes +1
denotes -1 wx +b = 0
2
1
margin argmin ( ) argmin
d
D D
i
i
b
d
w
 

 
 

x x
x w
x
Margin
Support Vector Machines: Slide 14
Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
denotes +1
denotes -1 wx +b = 0
,
,
2
,
1
argmax margin( , , )
= argmax arg min ( )
argmax arg min
i
i
b
i
b D
i
d
b D
i
i
b D
d
b
w



 


w
w x
w x
w
x
x w
Margin
Support Vector Machines: Slide 15
Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
denotes +1
denotes -1 wx +b = 0
Margin
• Min-max problem  game problem
WXi+b≥0 iff yi=1
WXi+b≤0 iff yi=-1
yi(WXi+b) ≥0
 
2
,
1
argmax arg min
subject to : 0
i
i
d
b D
i
i
i i i
b
w
D y b


 
    

w x
x w
x x w ≥0
Support Vector Machines: Slide 16
Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
denotes +1
denotes -1
wx +b = 0
Margin
Strategy:
: 1
i i
D b
    
x x w
 
2
,
1
argmax arg min
subject to : 0
i
i
d
b D
i
i
i i i
b
w
D y b


 
    

w x
x w
x x w
 
2
1
,
argmin
subject to : 1
d
i
i
b
i i i
w
D y b

    

w
x x w
WXi+b≥0 iff yi=1
WXi+b≤0 iff yi=-1
yi(WXi+b) ≥ 0
wx +b = 0
α(wx +b) = 0 where α≠0
Support Vector Machines: Slide 17
Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
• How does it come ?
: 1
i i
D b
    
x x w
 
2
,
1
argmax arg min
subject to : 0
i
i
d
b D
i
i
i i i
b
w
D y b


 
    

w x
x w
x x w
 
2
1
,
argmin
subject to : 1
d
i
i
b
i i i
w
D y b

    

w
x x w


 








d
i
i
d
i
i
i
d
i
i
i
w
K
w
K
w
x
b
w
w
x
b
1
2
1
2
1
2
'
1
|
.
|
min
arg
|
.
|
min
arg








 d
i
i
d
i
i
d
i
i
i
w
w
w
w
x
b
1
2
1
2
1
2
'
min
arg
'
1
max
arg
|
.
|
min
arg
max
arg
We have
Thus,
Support Vector Machines: Slide 18
Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin Linear Classifier
• How to solve it?
 
 
 
* * 2
1
,
1 1
2 2
{ , }= argmax
subject to
1
1
....
1
d
k
k
w b
N N
w b w
y w x b
y w x b
y w x b

  
  
  

Support Vector Machines: Slide 19
Copyright © 2001, 2003, Andrew W. Moore
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
• Detail solution of Quadratic Programming
• Convex Optimization Stephen P. Boyd
• Online Edition, Free for Downloading
Support Vector Machines: Slide 20
Copyright © 2001, 2003, Andrew W. Moore
Quadratic Programming
2
max
arg
u
u
u
d
u
R
c
T
T


Find
n
m
nm
n
n
m
m
m
m
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a












...
:
...
...
2
2
1
1
2
2
2
22
1
21
1
1
2
12
1
11
)
(
)
(
2
2
)
(
1
1
)
(
)
2
(
)
2
(
2
2
)
2
(
1
1
)
2
(
)
1
(
)
1
(
2
2
)
1
(
1
1
)
1
(
...
:
...
...
e
n
m
m
e
n
e
n
e
n
n
m
m
n
n
n
n
m
m
n
n
n
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a
























And subject to
n additional linear
inequality
constraints
e
additional
linear
equality
constraints
Quadratic criterion
Subject to
Support Vector Machines: Slide 21
Copyright © 2001, 2003, Andrew W. Moore
Quadratic Programming for the Linear Classifier
 
* * 2
,
{ , }= min
subject to 1 for all training data ( , )
i
i
w b
i i i i
w b w
y w x b x y
  

 
 
 
 
* *
,
1 1
2 2
{ , }= argmax 0 0
1
1
inequality constraints
....
1
T
w b
N N
w b w w w
y w x b
y w x b
y w x b
  

  

   



   
n
I
Support Vector Machines: Slide 22
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• Popular Tools - LibSVM
Support Vector Machines: Slide 23
Copyright © 2001, 2003, Andrew W. Moore
Roadmap
• Hard-Margin Linear Classifier
• Maximize Margin
• Support Vector
• Quadratic Programming
• Soft-Margin Linear Classifier
• Maximize Margin
• Support Vector
• Quadratic Programming
• Non-Linear Separable Problem
• XOR
• Transform to Non-Linear by Kernels
• Reference
Support Vector Machines: Slide 24
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Support Vector Machines: Slide 25
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1:
Find minimum w.w, while
minimizing number of
training set errors.
Problemette: Two things
to minimize makes for
an ill-defined
optimization
Support Vector Machines: Slide 26
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical
problem that’s about to make
us reject this approach. Can
you guess what it is?
Tradeoff parameter
Support Vector Machines: Slide 27
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical
problem that’s about to make
us reject this approach. Can
you guess what it is?
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
(Also, doesn’t distinguish between
disastrous errors and near misses)
Support Vector Machines: Slide 28
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 2.0:
Minimize
w.w + C (distance of error
points to their
correct place)
Support Vector Machines: Slide 29
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine (SVM) for
Noisy Data
• Any problem with the above
formulism?
 
 
 
d
* * 2
1 1
,
1 1 1
2 2 2
{ , }= min
1
1
...
1
N
i j
i j
w b
N N N
w b w c
y w x b
y w x b
y w x b




 

   
   
   
  denotes +1
denotes -1
1

2

3

Support Vector Machines: Slide 30
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine (SVM) for
Noisy Data
• Balance the trade off between
margin and classification errors
 
 
 
d
* * 2
1 1
,
1 1 1 1
2 2 2 2
{ , }= min
1 , 0
1 , 0
...
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b

 
 
 
 

    
    
    
 
denotes +1
denotes -1
1

2

3

Support Vector Machines: Slide 31
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine for Noisy Data
 
 
 
* * 2
1
,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0
inequality constraints
....
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b

 
 
 



    

     



     
 
How do we determine the appropriate value for c ?
Support Vector Machines: Slide 32
Copyright © 2001, 2003, Andrew W. Moore
The Dual Form of QP
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
where ( )
kl k l k l
Q y y
 
x x
Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w Then classify with:
f(x,w,b) = sign(w. x - b)
0
1



R
k
k
k y
α
Support Vector Machines: Slide 33
Copyright © 2001, 2003, Andrew W. Moore
The Dual Form of QP
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
where ( )
kl k l k l
Q y y
 
x x
Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
0
1



R
k
k
k y
α
Support Vector Machines: Slide 34
Copyright © 2001, 2003, Andrew W. Moore
An Equivalent QP
Maximize where )
.
( l
k
l
k
kl y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
0
1



R
k
k
k y
α
Datapoints with ak > 0
will be the support
vectors
..so this sum only needs
to be over the
support vectors.

  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
Support Vector Machines: Slide 35
Copyright © 2001, 2003, Andrew W. Moore
Support Vectors
denotes +1
denotes -1
1
w x b
  
1
w x b
   
w
Support Vectors
Decision boundary is
determined only by those
support vectors !



R
k
k
k
k y
α
1
x
w
   
 
: 1 0
i i i i
i y w x b
a 
     
ai = 0 for non-support vectors
ai  0 for support vectors
Support Vector Machines: Slide 36
Copyright © 2001, 2003, Andrew W. Moore
The Dual Form of QP
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
where ( )
kl k l k l
Q y y
 
x x
Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w Then classify with:
f(x,w,b) = sign(w. x - b)
0
1



R
k
k
k y
α
How to determine b ?
Support Vector Machines: Slide 37
Copyright © 2001, 2003, Andrew W. Moore
An Equivalent QP: Determine b
A linear programming problem !
 
 
 
* * 2
1
,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0
....
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b

 
 
 


    
    
    
 
 
 
 
 
1
*
1
,
1 1 1 1
2 2 2 2
= argmin
1 , 0
1 , 0
....
1 , 0
N
i i
N
j
j
b
N N N N
b
y w x b
y w x b
y w x b


 
 
 


    
    
    

Fix w
Support Vector Machines: Slide 38
Copyright © 2001, 2003, Andrew W. Moore

  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
An Equivalent QP
Maximize where )
.
( l
k
l
k
kl y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
k
k
K
K
K
K
α
K
ε
y
b
max
arg
where
.
)
1
(



 w
x
Then classify with:
f(x,w,b) = sign(w. x - b)
0
1



R
k
k
k y
α
Datapoints with ak > 0
will be the support
vectors
..so this sum only needs
to be over the
support vectors.
Why did I tell you about this
equivalent QP?
• It’s a formulation that QP
packages can optimize more
quickly
• Because of further jaw-
dropping developments
you’re about to learn.
Support Vector Machines: Slide 39
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• Parameter c is used to control the fitness
Noise
Support Vector Machines: Slide 40
Copyright © 2001, 2003, Andrew W. Moore
Roadmap
• Hard-Margin Linear Classifier (Clean Data)
• Maximize Margin
• Support Vector
• Quadratic Programming
• Soft-Margin Linear Classifier (Noisy Data)
• Maximize Margin
• Support Vector
• Quadratic Programming
• Non-Linear Separable Problem
• XOR
• Transform to Non-Linear by Kernels
• Reference
Support Vector Machines: Slide 41
Copyright © 2001, 2003, Andrew W. Moore
Feature Transformation ?
• The problem is non-linear
• Find some trick to transform the input
• Linear separable after Feature Transformation
• What Features should we use ?
XOR Problem
Basic Idea :
Support Vector Machines: Slide 42
Copyright © 2001, 2003, Andrew W. Moore
Suppose we’re in 1-dimension
What would
SVMs do with
this data?
x=0
Support Vector Machines: Slide 43
Copyright © 2001, 2003, Andrew W. Moore
Suppose we’re in 1-dimension
Not a big surprise
Positive “plane” Negative “plane”
x=0
Support Vector Machines: Slide 44
Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional dataset
That’s wiped the
smirk off SVM’s
face.
What can be
done about
this?
x=0
Support Vector Machines: Slide 45
Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional dataset
x=0 )
,
( 2
k
k
k x
x

z
Map the data
from low-dimensional space
to high-dimensional space
Let’s permit them here too
Support Vector Machines: Slide 46
Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional dataset
Map the data
from low-dimensional space
to high-dimensional space
Let’s permit them here too
x=0 )
,
( 2
k
k
k x
x

z
Feature Enumeration
k
k
transform
k z
x
x 



 
 )
(
Support Vector Machines: Slide 47
Copyright © 2001, 2003, Andrew W. Moore
Non-linear SVMs: Feature spaces
• General idea: the original input space can always be mapped
to some higher-dimensional feature space where the training
set is separable:
Φ: x → φ(x)
Support Vector Machines: Slide 48
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• Polynomial features for the XOR problem
Support Vector Machines: Slide 49
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• But……Is it the best margin Intuitively?
Support Vector Machines: Slide 50
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• Why not something like this ?
Support Vector Machines: Slide 51
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• Or something like this ? Could We ?
• A More Symmetric Boundary
Support Vector Machines: Slide 52
Copyright © 2001, 2003, Andrew W. Moore
Degree of Polynomial Features
X^1 X^2 X^3
X^4 X^5 X^6
Support Vector Machines: Slide 53
Copyright © 2001, 2003, Andrew W. Moore
Towards Infinite Dimensions of Features
.......
!
4
1
!
3
1
!
2
1
!
1
1
!
1 4
3
2
1
0





 


x
x
x
x
x
i
e
i
i
x
• Enuermate polynomial features of all degrees ?
• Taylor Expension of exponential function
zk = ( radial basis functions of xk )







 


 2
2
|
|
exp
)
(
]
[

j
k
k
j
k φ
j
c
x
x
z
Support Vector Machines: Slide 54
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• “Radius basis functions” for the XOR problem
Support Vector Machines: Slide 55
Copyright © 2001, 2003, Andrew W. Moore
Efficiency Problem in Computing Feature
• Feature space Mapping
• Example: all 2 degree Monomials
9 Multipllication
3 Multipllication
kernel trick
This use of kernel function
to avoid carrying out
Φ(x) explicitly is known
as the kernel trick
Support Vector Machines: Slide 56
Copyright © 2001, 2003, Andrew W. Moore
Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
zk = ( sigmoid functions of xk )







 


 2
2
|
|
exp
)
(
]
[

j
k
k
j
k φ
j
c
x
x
z
Support Vector Machines: Slide 57
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• “Radius Basis Function”
(Gaussian Kernel) Could
solve complicated Non-
Linear Problems
• γ and c control the
complexity of decision
boundary
2
1

 
Support Vector Machines: Slide 58
Copyright © 2001, 2003, Andrew W. Moore
How to Control the Complexity
• Bob got up and found that breakfast was ready
• Level-1 His Child (Underfitting)
• Level-2 His Wife (Reasonble) 
• Level-3 The Alien (Overfitting)
Which reasoning below is the most probable?
Support Vector Machines: Slide 59
Copyright © 2001, 2003, Andrew W. Moore
How to Control the Complexity
• SVM is powerful to approximate any training data
• The complexity affects the performance on new data
• SVM supports parameters for controlling the complexity
• SVM does not tell you how to set these parameters
• Determine the Parameters by Cross-Validation
Underfitting Overfitting complexity
Support Vector Machines: Slide 60
Copyright © 2001, 2003, Andrew W. Moore
General Condition for Predictivity in Learning Theory
• Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee and
Partha Niyogi. General Condition for Predictivity in
Learning Theory. Nature. Vol 428, March, 2004.
Support Vector Machines: Slide 61
Copyright © 2001, 2003, Andrew W. Moore
Recall The MDL principle……
• MDL stands for minimum description length
• The description length is defined as:
Space required to described a theory
+
Space required to described the theory’s mistakes
• In our case the theory is the classifier and the
mistakes are the errors on the training data
• Aim: we want a classifier with minimal DL
• MDL principle is a model selection criterion
Support Vector Machines: Slide 62
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine (SVM) for
Noisy Data
• Balance the trade off between
margin and classification errors
 
 
 
d
* * 2
1 1
,
1 1 1 1
2 2 2 2
{ , }= min
1 , 0
1 , 0
...
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b

 
 
 
 

    
    
    
 
denotes +1
denotes -1
1

2

3

Describe the Theory Describe the Mistake
Support Vector Machines: Slide 63
Copyright © 2001, 2003, Andrew W. Moore
SVM Performance
• Anecdotally they work very very well indeed.
• Example: They are currently the best-known
classifier on a well-studied hand-written-character
recognition benchmark
• Another Example: Andrew knows several reliable
people doing practical real-world work who claim
that SVMs have saved them when their other
favorite classifiers did poorly.
• There is a lot of excitement and religious fervor
about SVMs as of 2001.
• Despite this, some practitioners are a little
skeptical.
Support Vector Machines: Slide 64
Copyright © 2001, 2003, Andrew W. Moore
References
• An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible: (Not for beginners
including myself)
Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998
• Software: SVM-light, http://svmlight.joachims.org/,
LibSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/
SMO in Weka
Nov 23rd, 2001
Copyright © 2001, 2003, Andrew W. Moore
Support Vector
Regression
Support Vector Machines: Slide 66
Copyright © 2001, 2003, Andrew W. Moore
Roadmap
• Squared-Loss Linear Regression
• Little Noise
• Large Noise
• Linear-Loss Function
• Support Vector Regression
Support Vector Machines: Slide 67
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
f
x
a
yest
f(x,w,b) = w. x - b
How would you
fit this data?
Support Vector Machines: Slide 68
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
f
x
a
yest
f(x,w,b) = w. x - b
How would you
fit this data?
Support Vector Machines: Slide 69
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
f
x
a
yest
f(x,w,b) = w. x - b
How would you
fit this data?
Support Vector Machines: Slide 70
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
f
x
a
yest
f(x,w,b) = w. x - b
How would you
fit this data?
Support Vector Machines: Slide 71
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
f
x
a
yest
f(x,w,b) = w. x - b
Any of these
would be fine..
..but which is
best?
Support Vector Machines: Slide 72
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
f
x
a
yest
f(x,w,b) = w. x - b
How to define the
fitting error of a
linear regression ?
Support Vector Machines: Slide 73
Copyright © 2001, 2003, Andrew W. Moore
Linear Regression
f
x
a
yest
f(x,w,b) = w. x - b
How to define the
fitting error of a
linear regression ?
2
)
.
( i
i
i y
b
x
w
err 


Squared-Loss
Support Vector Machines: Slide 74
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• http://www.math.csusb.edu/faculty/stanton/m262/
regress/regress.html
Support Vector Machines: Slide 75
Copyright © 2001, 2003, Andrew W. Moore
Sensitive to Outliers
Outlier
Support Vector Machines: Slide 76
Copyright © 2001, 2003, Andrew W. Moore
Why ?
• Squared-Loss Function
• Fitting Error Grows Quadratically
2
)
.
( i
i
i y
b
x
w
err 


Support Vector Machines: Slide 77
Copyright © 2001, 2003, Andrew W. Moore
How about Linear-Loss ?
• Linear-Loss Function
• Fitting Error Grows Linearly
|
.
| i
i
i y
b
x
w
err 


Support Vector Machines: Slide 78
Copyright © 2001, 2003, Andrew W. Moore
Actually
• SVR uses the Loss Function below
-insensitive loss function
 
Support Vector Machines: Slide 79
Copyright © 2001, 2003, Andrew W. Moore
Epsilon Support Vector Regression (-SVR)
• Given: a data set {x1, ..., xn} with target values
{u1, ..., un}, we want to do -SVR
• The optimization problem is
• Similar to SVM, this can be solved as a quadratic
programming problem
Support Vector Machines: Slide 80
Copyright © 2001, 2003, Andrew W. Moore
Online Demo
• Less Sensitive to Outlier
Support Vector Machines: Slide 81
Copyright © 2001, 2003, Andrew W. Moore
Again, Extend to Non-Linear Case
• Similar with SVM
Support Vector Machines: Slide 82
Copyright © 2001, 2003, Andrew W. Moore
What We Learn
• Linear Classifier with Clean Data
• Linear Classifier with Noisy Data
• SVM for Noisy and Non-Linear Data
• Linear Regression with Clean Data
• Linear Regression with Noisy Data
• SVR for Noisy and Non-Linear Data
• General Condition for Predictivity in Learning Theory
Support Vector Machines: Slide 83
Copyright © 2001, 2003, Andrew W. Moore
The End
Support Vector Machines: Slide 84
Copyright © 2001, 2003, Andrew W. Moore
Saddle Point

More Related Content

Similar to Unit 4 SVM and AVR.ppt

Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
nextlib
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
zukun
 

Similar to Unit 4 SVM and AVR.ppt (20)

Diving into Tensorflow.js
Diving into Tensorflow.jsDiving into Tensorflow.js
Diving into Tensorflow.js
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Lecture4 xing
Lecture4 xingLecture4 xing
Lecture4 xing
 
Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector Machine
 
SVM.ppt
SVM.pptSVM.ppt
SVM.ppt
 
PPT
PPTPPT
PPT
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
Deep Learning for AI (2)
Deep Learning for AI (2)Deep Learning for AI (2)
Deep Learning for AI (2)
 
linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.ppt
 
f37-book-intarch-pres-pt1.ppt
f37-book-intarch-pres-pt1.pptf37-book-intarch-pres-pt1.ppt
f37-book-intarch-pres-pt1.ppt
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
 
f37-book-intarch-pres-pt1.ppt
f37-book-intarch-pres-pt1.pptf37-book-intarch-pres-pt1.ppt
f37-book-intarch-pres-pt1.ppt
 
f37-book-intarch-pres-pt1.ppt
f37-book-intarch-pres-pt1.pptf37-book-intarch-pres-pt1.ppt
f37-book-intarch-pres-pt1.ppt
 
Introduction to Support Vector Machines
Introduction to Support Vector MachinesIntroduction to Support Vector Machines
Introduction to Support Vector Machines
 
06 Mean Var
06 Mean Var06 Mean Var
06 Mean Var
 
Tutorial on Support Vector Machine
Tutorial on Support Vector MachineTutorial on Support Vector Machine
Tutorial on Support Vector Machine
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 

More from Rahul Borate

Unit 4_Introduction to Server Farms.pptx
Unit 4_Introduction to Server Farms.pptxUnit 4_Introduction to Server Farms.pptx
Unit 4_Introduction to Server Farms.pptx
Rahul Borate
 
Unit 3_Data Center Design in storage.pptx
Unit  3_Data Center Design in storage.pptxUnit  3_Data Center Design in storage.pptx
Unit 3_Data Center Design in storage.pptx
Rahul Borate
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
Practice_Exercises_Files_and_Exceptions.pptx
Practice_Exercises_Files_and_Exceptions.pptxPractice_Exercises_Files_and_Exceptions.pptx
Practice_Exercises_Files_and_Exceptions.pptx
Rahul Borate
 

More from Rahul Borate (20)

PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
Unit 4_Introduction to Server Farms.pptx
Unit 4_Introduction to Server Farms.pptxUnit 4_Introduction to Server Farms.pptx
Unit 4_Introduction to Server Farms.pptx
 
Unit 3_Data Center Design in storage.pptx
Unit  3_Data Center Design in storage.pptxUnit  3_Data Center Design in storage.pptx
Unit 3_Data Center Design in storage.pptx
 
Fundamentals of storage Unit III Backup and Recovery.ppt
Fundamentals of storage Unit III Backup and Recovery.pptFundamentals of storage Unit III Backup and Recovery.ppt
Fundamentals of storage Unit III Backup and Recovery.ppt
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Confusion Matrix.pptx
Confusion Matrix.pptxConfusion Matrix.pptx
Confusion Matrix.pptx
 
Unit I Fundamentals of Cloud Computing.pptx
Unit I Fundamentals of Cloud Computing.pptxUnit I Fundamentals of Cloud Computing.pptx
Unit I Fundamentals of Cloud Computing.pptx
 
Unit II Cloud Delivery Models.pptx
Unit II Cloud Delivery Models.pptxUnit II Cloud Delivery Models.pptx
Unit II Cloud Delivery Models.pptx
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
EDA.pptx
EDA.pptxEDA.pptx
EDA.pptx
 
Module III MachineLearningSparkML.pptx
Module III MachineLearningSparkML.pptxModule III MachineLearningSparkML.pptx
Module III MachineLearningSparkML.pptx
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
2.2 Logit and Probit.pptx
2.2 Logit and Probit.pptx2.2 Logit and Probit.pptx
2.2 Logit and Probit.pptx
 
UNIT I Streaming Data & Architectures.pptx
UNIT I Streaming Data & Architectures.pptxUNIT I Streaming Data & Architectures.pptx
UNIT I Streaming Data & Architectures.pptx
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Practice_Exercises_Files_and_Exceptions.pptx
Practice_Exercises_Files_and_Exceptions.pptxPractice_Exercises_Files_and_Exceptions.pptx
Practice_Exercises_Files_and_Exceptions.pptx
 
Practice_Exercises_Data_Structures.pptx
Practice_Exercises_Data_Structures.pptxPractice_Exercises_Data_Structures.pptx
Practice_Exercises_Data_Structures.pptx
 
Practice_Exercises_Control_Flow.pptx
Practice_Exercises_Control_Flow.pptxPractice_Exercises_Control_Flow.pptx
Practice_Exercises_Control_Flow.pptx
 
blog creation.pdf
blog creation.pdfblog creation.pdf
blog creation.pdf
 
Chapter I.pptx
Chapter I.pptxChapter I.pptx
Chapter I.pptx
 

Recently uploaded

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 

Recently uploaded (20)

UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 

Unit 4 SVM and AVR.ppt

  • 2. Support Vector Machines: Slide 2 Copyright © 2001, 2003, Andrew W. Moore Roadmap • Hard-Margin Linear Classifier • Maximize Margin • Support Vector • Quadratic Programming • Soft-Margin Linear Classifier • Maximize Margin • Support Vector • Quadratic Programming • Non-Linear Separable Problem • XOR • Transform to Non-Linear by Kernels • Reference
  • 3. Support Vector Machines: Slide 3 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 4. Support Vector Machines: Slide 4 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 5. Support Vector Machines: Slide 5 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 6. Support Vector Machines: Slide 6 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 7. Support Vector Machines: Slide 7 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine.. ..but which is best?
  • 8. Support Vector Machines: Slide 8 Copyright © 2001, 2003, Andrew W. Moore Classifier Margin f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
  • 9. Support Vector Machines: Slide 9 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM
  • 10. Support Vector Machines: Slide 10 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM
  • 11. Support Vector Machines: Slide 11 Copyright © 2001, 2003, Andrew W. Moore Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1. Intuitively this feels safest. 2. Empirically it works very well. 3. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 4. LOOCV is easy since the model is immune to removal of any non- support-vector datapoints. 5. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing.
  • 12. Support Vector Machines: Slide 12 Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin • What is the distance expression for a point x to a line wx+b= 0? denotes +1 denotes -1 x wx +b = 0 2 2 1 2 ( ) d i i b b d w         x w x w x w X – Vector W – Normal Vector b – Scale Value W
  • 13. Support Vector Machines: Slide 13 Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin • What is the expression for margin? denotes +1 denotes -1 wx +b = 0 2 1 margin argmin ( ) argmin d D D i i b d w         x x x w x Margin
  • 14. Support Vector Machines: Slide 14 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 , , 2 , 1 argmax margin( , , ) = argmax arg min ( ) argmax arg min i i b i b D i d b D i i b D d b w        w w x w x w x x w Margin
  • 15. Support Vector Machines: Slide 15 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin • Min-max problem  game problem WXi+b≥0 iff yi=1 WXi+b≤0 iff yi=-1 yi(WXi+b) ≥0   2 , 1 argmax arg min subject to : 0 i i d b D i i i i i b w D y b           w x x w x x w ≥0
  • 16. Support Vector Machines: Slide 16 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 Margin Strategy: : 1 i i D b      x x w   2 , 1 argmax arg min subject to : 0 i i d b D i i i i i b w D y b           w x x w x x w   2 1 , argmin subject to : 1 d i i b i i i w D y b        w x x w WXi+b≥0 iff yi=1 WXi+b≤0 iff yi=-1 yi(WXi+b) ≥ 0 wx +b = 0 α(wx +b) = 0 where α≠0
  • 17. Support Vector Machines: Slide 17 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin • How does it come ? : 1 i i D b      x x w   2 , 1 argmax arg min subject to : 0 i i d b D i i i i i b w D y b           w x x w x x w   2 1 , argmin subject to : 1 d i i b i i i w D y b        w x x w             d i i d i i i d i i i w K w K w x b w w x b 1 2 1 2 1 2 ' 1 | . | min arg | . | min arg          d i i d i i d i i i w w w w x b 1 2 1 2 1 2 ' min arg ' 1 max arg | . | min arg max arg We have Thus,
  • 18. Support Vector Machines: Slide 18 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin Linear Classifier • How to solve it?       * * 2 1 , 1 1 2 2 { , }= argmax subject to 1 1 .... 1 d k k w b N N w b w y w x b y w x b y w x b           
  • 19. Support Vector Machines: Slide 19 Copyright © 2001, 2003, Andrew W. Moore Learning via Quadratic Programming • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints. • Detail solution of Quadratic Programming • Convex Optimization Stephen P. Boyd • Online Edition, Free for Downloading
  • 20. Support Vector Machines: Slide 20 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming 2 max arg u u u d u R c T T   Find n m nm n n m m m m b u a u a u a b u a u a u a b u a u a u a             ... : ... ... 2 2 1 1 2 2 2 22 1 21 1 1 2 12 1 11 ) ( ) ( 2 2 ) ( 1 1 ) ( ) 2 ( ) 2 ( 2 2 ) 2 ( 1 1 ) 2 ( ) 1 ( ) 1 ( 2 2 ) 1 ( 1 1 ) 1 ( ... : ... ... e n m m e n e n e n n m m n n n n m m n n n b u a u a u a b u a u a u a b u a u a u a                         And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion Subject to
  • 21. Support Vector Machines: Slide 21 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming for the Linear Classifier   * * 2 , { , }= min subject to 1 for all training data ( , ) i i w b i i i i w b w y w x b x y             * * , 1 1 2 2 { , }= argmax 0 0 1 1 inequality constraints .... 1 T w b N N w b w w w y w x b y w x b y w x b                    n I
  • 22. Support Vector Machines: Slide 22 Copyright © 2001, 2003, Andrew W. Moore Online Demo • Popular Tools - LibSVM
  • 23. Support Vector Machines: Slide 23 Copyright © 2001, 2003, Andrew W. Moore Roadmap • Hard-Margin Linear Classifier • Maximize Margin • Support Vector • Quadratic Programming • Soft-Margin Linear Classifier • Maximize Margin • Support Vector • Quadratic Programming • Non-Linear Separable Problem • XOR • Transform to Non-Linear by Kernels • Reference
  • 24. Support Vector Machines: Slide 24 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do?
  • 25. Support Vector Machines: Slide 25 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization
  • 26. Support Vector Machines: Slide 26 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter
  • 27. Support Vector Machines: Slide 27 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses)
  • 28. Support Vector Machines: Slide 28 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 2.0: Minimize w.w + C (distance of error points to their correct place)
  • 29. Support Vector Machines: Slide 29 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine (SVM) for Noisy Data • Any problem with the above formulism?       d * * 2 1 1 , 1 1 1 2 2 2 { , }= min 1 1 ... 1 N i j i j w b N N N w b w c y w x b y w x b y w x b                      denotes +1 denotes -1 1  2  3 
  • 30. Support Vector Machines: Slide 30 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine (SVM) for Noisy Data • Balance the trade off between margin and classification errors       d * * 2 1 1 , 1 1 1 1 2 2 2 2 { , }= min 1 , 0 1 , 0 ... 1 , 0 N i j i j w b N N N N w b w c y w x b y w x b y w x b                            denotes +1 denotes -1 1  2  3 
  • 31. Support Vector Machines: Slide 31 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine for Noisy Data       * * 2 1 , 1 1 1 1 2 2 2 2 { , }= argmin 1 , 0 1 , 0 inequality constraints .... 1 , 0 N i j i j w b N N N N w b w c y w x b y w x b y w x b                                  How do we determine the appropriate value for c ?
  • 32. Support Vector Machines: Slide 32 Copyright © 2001, 2003, Andrew W. Moore The Dual Form of QP Maximize       R k R l kl l k R k k Q α α α 1 1 1 2 1 where ( ) kl k l k l Q y y   x x Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w Then classify with: f(x,w,b) = sign(w. x - b) 0 1    R k k k y α
  • 33. Support Vector Machines: Slide 33 Copyright © 2001, 2003, Andrew W. Moore The Dual Form of QP Maximize       R k R l kl l k R k k Q α α α 1 1 1 2 1 where ( ) kl k l k l Q y y   x x Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w 0 1    R k k k y α
  • 34. Support Vector Machines: Slide 34 Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP Maximize where ) . ( l k l k kl y y Q x x  Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w 0 1    R k k k y α Datapoints with ak > 0 will be the support vectors ..so this sum only needs to be over the support vectors.       R k R l kl l k R k k Q α α α 1 1 1 2 1
  • 35. Support Vector Machines: Slide 35 Copyright © 2001, 2003, Andrew W. Moore Support Vectors denotes +1 denotes -1 1 w x b    1 w x b     w Support Vectors Decision boundary is determined only by those support vectors !    R k k k k y α 1 x w       : 1 0 i i i i i y w x b a        ai = 0 for non-support vectors ai  0 for support vectors
  • 36. Support Vector Machines: Slide 36 Copyright © 2001, 2003, Andrew W. Moore The Dual Form of QP Maximize       R k R l kl l k R k k Q α α α 1 1 1 2 1 where ( ) kl k l k l Q y y   x x Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w Then classify with: f(x,w,b) = sign(w. x - b) 0 1    R k k k y α How to determine b ?
  • 37. Support Vector Machines: Slide 37 Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP: Determine b A linear programming problem !       * * 2 1 , 1 1 1 1 2 2 2 2 { , }= argmin 1 , 0 1 , 0 .... 1 , 0 N i j i j w b N N N N w b w c y w x b y w x b y w x b                                   1 * 1 , 1 1 1 1 2 2 2 2 = argmin 1 , 0 1 , 0 .... 1 , 0 N i i N j j b N N N N b y w x b y w x b y w x b                           Fix w
  • 38. Support Vector Machines: Slide 38 Copyright © 2001, 2003, Andrew W. Moore       R k R l kl l k R k k Q α α α 1 1 1 2 1 An Equivalent QP Maximize where ) . ( l k l k kl y y Q x x  Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w k k K K K K α K ε y b max arg where . ) 1 (     w x Then classify with: f(x,w,b) = sign(w. x - b) 0 1    R k k k y α Datapoints with ak > 0 will be the support vectors ..so this sum only needs to be over the support vectors. Why did I tell you about this equivalent QP? • It’s a formulation that QP packages can optimize more quickly • Because of further jaw- dropping developments you’re about to learn.
  • 39. Support Vector Machines: Slide 39 Copyright © 2001, 2003, Andrew W. Moore Online Demo • Parameter c is used to control the fitness Noise
  • 40. Support Vector Machines: Slide 40 Copyright © 2001, 2003, Andrew W. Moore Roadmap • Hard-Margin Linear Classifier (Clean Data) • Maximize Margin • Support Vector • Quadratic Programming • Soft-Margin Linear Classifier (Noisy Data) • Maximize Margin • Support Vector • Quadratic Programming • Non-Linear Separable Problem • XOR • Transform to Non-Linear by Kernels • Reference
  • 41. Support Vector Machines: Slide 41 Copyright © 2001, 2003, Andrew W. Moore Feature Transformation ? • The problem is non-linear • Find some trick to transform the input • Linear separable after Feature Transformation • What Features should we use ? XOR Problem Basic Idea :
  • 42. Support Vector Machines: Slide 42 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension What would SVMs do with this data? x=0
  • 43. Support Vector Machines: Slide 43 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension Not a big surprise Positive “plane” Negative “plane” x=0
  • 44. Support Vector Machines: Slide 44 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0
  • 45. Support Vector Machines: Slide 45 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset x=0 ) , ( 2 k k k x x  z Map the data from low-dimensional space to high-dimensional space Let’s permit them here too
  • 46. Support Vector Machines: Slide 46 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset Map the data from low-dimensional space to high-dimensional space Let’s permit them here too x=0 ) , ( 2 k k k x x  z Feature Enumeration k k transform k z x x        ) (
  • 47. Support Vector Machines: Slide 47 Copyright © 2001, 2003, Andrew W. Moore Non-linear SVMs: Feature spaces • General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)
  • 48. Support Vector Machines: Slide 48 Copyright © 2001, 2003, Andrew W. Moore Online Demo • Polynomial features for the XOR problem
  • 49. Support Vector Machines: Slide 49 Copyright © 2001, 2003, Andrew W. Moore Online Demo • But……Is it the best margin Intuitively?
  • 50. Support Vector Machines: Slide 50 Copyright © 2001, 2003, Andrew W. Moore Online Demo • Why not something like this ?
  • 51. Support Vector Machines: Slide 51 Copyright © 2001, 2003, Andrew W. Moore Online Demo • Or something like this ? Could We ? • A More Symmetric Boundary
  • 52. Support Vector Machines: Slide 52 Copyright © 2001, 2003, Andrew W. Moore Degree of Polynomial Features X^1 X^2 X^3 X^4 X^5 X^6
  • 53. Support Vector Machines: Slide 53 Copyright © 2001, 2003, Andrew W. Moore Towards Infinite Dimensions of Features ....... ! 4 1 ! 3 1 ! 2 1 ! 1 1 ! 1 4 3 2 1 0          x x x x x i e i i x • Enuermate polynomial features of all degrees ? • Taylor Expension of exponential function zk = ( radial basis functions of xk )             2 2 | | exp ) ( ] [  j k k j k φ j c x x z
  • 54. Support Vector Machines: Slide 54 Copyright © 2001, 2003, Andrew W. Moore Online Demo • “Radius basis functions” for the XOR problem
  • 55. Support Vector Machines: Slide 55 Copyright © 2001, 2003, Andrew W. Moore Efficiency Problem in Computing Feature • Feature space Mapping • Example: all 2 degree Monomials 9 Multipllication 3 Multipllication kernel trick This use of kernel function to avoid carrying out Φ(x) explicitly is known as the kernel trick
  • 56. Support Vector Machines: Slide 56 Copyright © 2001, 2003, Andrew W. Moore Common SVM basis functions zk = ( polynomial terms of xk of degree 1 to q ) zk = ( radial basis functions of xk ) zk = ( sigmoid functions of xk )             2 2 | | exp ) ( ] [  j k k j k φ j c x x z
  • 57. Support Vector Machines: Slide 57 Copyright © 2001, 2003, Andrew W. Moore Online Demo • “Radius Basis Function” (Gaussian Kernel) Could solve complicated Non- Linear Problems • γ and c control the complexity of decision boundary 2 1   
  • 58. Support Vector Machines: Slide 58 Copyright © 2001, 2003, Andrew W. Moore How to Control the Complexity • Bob got up and found that breakfast was ready • Level-1 His Child (Underfitting) • Level-2 His Wife (Reasonble)  • Level-3 The Alien (Overfitting) Which reasoning below is the most probable?
  • 59. Support Vector Machines: Slide 59 Copyright © 2001, 2003, Andrew W. Moore How to Control the Complexity • SVM is powerful to approximate any training data • The complexity affects the performance on new data • SVM supports parameters for controlling the complexity • SVM does not tell you how to set these parameters • Determine the Parameters by Cross-Validation Underfitting Overfitting complexity
  • 60. Support Vector Machines: Slide 60 Copyright © 2001, 2003, Andrew W. Moore General Condition for Predictivity in Learning Theory • Tomaso Poggio, Ryan Rifkin, Sayan Mukherjee and Partha Niyogi. General Condition for Predictivity in Learning Theory. Nature. Vol 428, March, 2004.
  • 61. Support Vector Machines: Slide 61 Copyright © 2001, 2003, Andrew W. Moore Recall The MDL principle…… • MDL stands for minimum description length • The description length is defined as: Space required to described a theory + Space required to described the theory’s mistakes • In our case the theory is the classifier and the mistakes are the errors on the training data • Aim: we want a classifier with minimal DL • MDL principle is a model selection criterion
  • 62. Support Vector Machines: Slide 62 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine (SVM) for Noisy Data • Balance the trade off between margin and classification errors       d * * 2 1 1 , 1 1 1 1 2 2 2 2 { , }= min 1 , 0 1 , 0 ... 1 , 0 N i j i j w b N N N N w b w c y w x b y w x b y w x b                            denotes +1 denotes -1 1  2  3  Describe the Theory Describe the Mistake
  • 63. Support Vector Machines: Slide 63 Copyright © 2001, 2003, Andrew W. Moore SVM Performance • Anecdotally they work very very well indeed. • Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark • Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. • There is a lot of excitement and religious fervor about SVMs as of 2001. • Despite this, some practitioners are a little skeptical.
  • 64. Support Vector Machines: Slide 64 Copyright © 2001, 2003, Andrew W. Moore References • An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html • The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, Wiley- Interscience; 1998 • Software: SVM-light, http://svmlight.joachims.org/, LibSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ SMO in Weka
  • 65. Nov 23rd, 2001 Copyright © 2001, 2003, Andrew W. Moore Support Vector Regression
  • 66. Support Vector Machines: Slide 66 Copyright © 2001, 2003, Andrew W. Moore Roadmap • Squared-Loss Linear Regression • Little Noise • Large Noise • Linear-Loss Function • Support Vector Regression
  • 67. Support Vector Machines: Slide 67 Copyright © 2001, 2003, Andrew W. Moore Linear Regression f x a yest f(x,w,b) = w. x - b How would you fit this data?
  • 68. Support Vector Machines: Slide 68 Copyright © 2001, 2003, Andrew W. Moore Linear Regression f x a yest f(x,w,b) = w. x - b How would you fit this data?
  • 69. Support Vector Machines: Slide 69 Copyright © 2001, 2003, Andrew W. Moore Linear Regression f x a yest f(x,w,b) = w. x - b How would you fit this data?
  • 70. Support Vector Machines: Slide 70 Copyright © 2001, 2003, Andrew W. Moore Linear Regression f x a yest f(x,w,b) = w. x - b How would you fit this data?
  • 71. Support Vector Machines: Slide 71 Copyright © 2001, 2003, Andrew W. Moore Linear Regression f x a yest f(x,w,b) = w. x - b Any of these would be fine.. ..but which is best?
  • 72. Support Vector Machines: Slide 72 Copyright © 2001, 2003, Andrew W. Moore Linear Regression f x a yest f(x,w,b) = w. x - b How to define the fitting error of a linear regression ?
  • 73. Support Vector Machines: Slide 73 Copyright © 2001, 2003, Andrew W. Moore Linear Regression f x a yest f(x,w,b) = w. x - b How to define the fitting error of a linear regression ? 2 ) . ( i i i y b x w err    Squared-Loss
  • 74. Support Vector Machines: Slide 74 Copyright © 2001, 2003, Andrew W. Moore Online Demo • http://www.math.csusb.edu/faculty/stanton/m262/ regress/regress.html
  • 75. Support Vector Machines: Slide 75 Copyright © 2001, 2003, Andrew W. Moore Sensitive to Outliers Outlier
  • 76. Support Vector Machines: Slide 76 Copyright © 2001, 2003, Andrew W. Moore Why ? • Squared-Loss Function • Fitting Error Grows Quadratically 2 ) . ( i i i y b x w err   
  • 77. Support Vector Machines: Slide 77 Copyright © 2001, 2003, Andrew W. Moore How about Linear-Loss ? • Linear-Loss Function • Fitting Error Grows Linearly | . | i i i y b x w err   
  • 78. Support Vector Machines: Slide 78 Copyright © 2001, 2003, Andrew W. Moore Actually • SVR uses the Loss Function below -insensitive loss function  
  • 79. Support Vector Machines: Slide 79 Copyright © 2001, 2003, Andrew W. Moore Epsilon Support Vector Regression (-SVR) • Given: a data set {x1, ..., xn} with target values {u1, ..., un}, we want to do -SVR • The optimization problem is • Similar to SVM, this can be solved as a quadratic programming problem
  • 80. Support Vector Machines: Slide 80 Copyright © 2001, 2003, Andrew W. Moore Online Demo • Less Sensitive to Outlier
  • 81. Support Vector Machines: Slide 81 Copyright © 2001, 2003, Andrew W. Moore Again, Extend to Non-Linear Case • Similar with SVM
  • 82. Support Vector Machines: Slide 82 Copyright © 2001, 2003, Andrew W. Moore What We Learn • Linear Classifier with Clean Data • Linear Classifier with Noisy Data • SVM for Noisy and Non-Linear Data • Linear Regression with Clean Data • Linear Regression with Noisy Data • SVR for Noisy and Non-Linear Data • General Condition for Predictivity in Learning Theory
  • 83. Support Vector Machines: Slide 83 Copyright © 2001, 2003, Andrew W. Moore The End
  • 84. Support Vector Machines: Slide 84 Copyright © 2001, 2003, Andrew W. Moore Saddle Point