SlideShare a Scribd company logo
1 of 79
Nov 23rd, 2001
Copyright © 2001, 2003, Andrew W. Moore
Support Vector
Machines
Modified from the slides by Dr. Andrew W. Moore
http://www.cs.cmu.edu/~awm/tutorials
Support Vector Machines: Slide 2
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Support Vector Machines: Slide 3
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Support Vector Machines: Slide 4
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Support Vector Machines: Slide 5
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Support Vector Machines: Slide 6
Copyright © 2001, 2003, Andrew W. Moore
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Any of these
would be fine..
..but which is
best?
Support Vector Machines: Slide 7
Copyright © 2001, 2003, Andrew W. Moore
Classifier Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
Support Vector Machines: Slide 8
Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
Support Vector Machines: Slide 9
Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vectors
are those
datapoints that
the margin
pushes up
against
Linear SVM
Support Vector Machines: Slide 10
Copyright © 2001, 2003, Andrew W. Moore
Why Maximum Margin?
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vectors
are those
datapoints that
the margin
pushes up
against
1. Intuitively this feels safest.
2. If we’ve made a small error in the
location of the boundary (it’s been
jolted in its perpendicular direction)
this gives us least chance of causing a
misclassification.
3. LOOCV is easy since the model is
immune to removal of any non-
support-vector datapoints.
4. There’s some theory (using VC
dimension) that is related to (but not
the same as) the proposition that this
is a good thing.
5. Empirically it works very very well.
Support Vector Machines: Slide 11
Copyright © 2001, 2003, Andrew W. Moore
Estimate the Margin
• What is the distance expression for a point x to a
line wx+b= 0?
denotes +1
denotes -1 x
wx +b = 0
2 2
1
2
( )
d
i
i
b b
d
w

   
 

x w x w
x
w
Support Vector Machines: Slide 12
Copyright © 2001, 2003, Andrew W. Moore
Estimate the Margin
• What is the expression for margin?
denotes +1
denotes -1 wx +b = 0
2
1
margin min ( ) min
d
D D
i
i
b
d
w
 

 
 

x x
x w
x
Margin
Support Vector Machines: Slide 13
Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
denotes +1
denotes -1 wx +b = 0
,
,
2
,
1
argmax margin( , , )
= argmax min ( )
argmax min
i
i
b
i
D
b
i
d
D
b
i
i
b D
d
b
w



 


w
x
w
x
w
w
x
x w
Margin
Support Vector Machines: Slide 14
Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
denotes +1
denotes -1 wx +b = 0
 
2
,
1
argmax min
subject to : 0
i
i
d
D
b
i
i
i i i
b
w
D y b


 
    

x
w
x w
x x w
Margin
• Min-max problem  game problem
Support Vector Machines: Slide 15
Copyright © 2001, 2003, Andrew W. Moore
Maximize Margin
denotes +1
denotes -1 wx +b = 0
 
2
,
1
argmax min
subject to : 0
i
i
d
D
b
i
i
i i i
b
w
D y b


 
    

x
w
x w
x x w
Margin
Strategy:
: 1
i i
D b
    
x x w
 
2
1
,
argmin
subject to : 1
d
i
i
b
i i i
w
D y b

    

w
x x w
Support Vector Machines: Slide 16
Copyright © 2001, 2003, Andrew W. Moore
Maximum Margin Linear Classifier
• How to solve it?
 
 
 
* * 2
1
,
1 1
2 2
{ , }= argmin
subject to
1
1
....
1
d
k
k
w b
N N
w b w
y w x b
y w x b
y w x b

  
  
  

Support Vector Machines: Slide 17
Copyright © 2001, 2003, Andrew W. Moore
Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.
Support Vector Machines: Slide 18
Copyright © 2001, 2003, Andrew W. Moore
Quadratic Programming
arg min
2
T
T R
c  
u
u u
d u
Find
n
m
nm
n
n
m
m
m
m
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a












...
:
...
...
2
2
1
1
2
2
2
22
1
21
1
1
2
12
1
11
)
(
)
(
2
2
)
(
1
1
)
(
)
2
(
)
2
(
2
2
)
2
(
1
1
)
2
(
)
1
(
)
1
(
2
2
)
1
(
1
1
)
1
(
...
:
...
...
e
n
m
m
e
n
e
n
e
n
n
m
m
n
n
n
n
m
m
n
n
n
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a
























And subject to
n additional linear
inequality
constraints
e
additional
linear
equality
constraints
Quadratic criterion
Subject to
Support Vector Machines: Slide 19
Copyright © 2001, 2003, Andrew W. Moore
Quadratic Programming
arg min
2
T
T R
c  
u
u u
d u
Find
Subject to
n
m
nm
n
n
m
m
m
m
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a












...
:
...
...
2
2
1
1
2
2
2
22
1
21
1
1
2
12
1
11
)
(
)
(
2
2
)
(
1
1
)
(
)
2
(
)
2
(
2
2
)
2
(
1
1
)
2
(
)
1
(
)
1
(
2
2
)
1
(
1
1
)
1
(
...
:
...
...
e
n
m
m
e
n
e
n
e
n
n
m
m
n
n
n
n
m
m
n
n
n
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a
























And subject to
n additional linear
inequality
constraints
e
additional
linear
equality
constraints
Quadratic criterion
Support Vector Machines: Slide 20
Copyright © 2001, 2003, Andrew W. Moore
Quadratic Programming
 
* * 2
,
{ , }= min
subject to 1 for all training data ( , )
i
i
w b
i i i i
w b w
y w x b x y
  

 
 
 
 
* *
,
1 1
2 2
{ , }= argmax 0 0
1
1
inequality constraints
....
1
T
w b
N N
w b w w w
y w x b
y w x b
y w x b
  

  

   



   
n
I
Support Vector Machines: Slide 21
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Support Vector Machines: Slide 22
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1:
Find minimum w.w, while
minimizing number of
training set errors.
Problemette: Two things
to minimize makes for
an ill-defined
optimization
Support Vector Machines: Slide 23
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical
problem that’s about to make
us reject this approach. Can
you guess what it is?
Tradeoff parameter
Support Vector Machines: Slide 24
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical
problem that’s about to make
us reject this approach. Can
you guess what it is?
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
(Also, doesn’t distinguish between
disastrous errors and near misses)
Support Vector Machines: Slide 25
Copyright © 2001, 2003, Andrew W. Moore
Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?
Idea 2.0:
Minimize
w.w + C (distance of error
points to their
correct place)
Support Vector Machines: Slide 26
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine (SVM) for
Noisy Data
• Any problem with the above
formulism?
 
 
 
d
* * 2
1 1
, ,
1 1 1
2 2 2
{ , }= min
1
1
...
1
N
i j
i j
w b
N N N
w b w c
y w x b
y w x b
y w x b





 

   
   
   
  denotes +1
denotes -1
1

2

3

Support Vector Machines: Slide 27
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine (SVM) for
Noisy Data
• Balance the trade off between
margin and classification errors
 
 
 
d
* * 2
1 1
, ,
1 1 1 1
2 2 2 2
{ , }= min
1 , 0
1 , 0
...
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b


 
 
 
 

    
    
    
 
denotes +1
denotes -1
1

2

3

Support Vector Machines: Slide 28
Copyright © 2001, 2003, Andrew W. Moore
Support Vector Machine for Noisy Data
 
 
 
* * 2
1
, ,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0
inequality constraints
....
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b


 
 
 



    

     



     
 
How do we determine the appropriate value for c ?
Support Vector Machines: Slide 29
Copyright © 2001, 2003, Andrew W. Moore
The Dual Form of QP
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
where ( )
kl k l k l
Q y y
 
x x
Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w Then classify with:
f(x,w,b) = sign(w. x - b)
0
1



R
k
k
k y
α
Support Vector Machines: Slide 30
Copyright © 2001, 2003, Andrew W. Moore
The Dual Form of QP
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
where ( )
kl k l k l
Q y y
 
x x
Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
0
1



R
k
k
k y
α
Support Vector Machines: Slide 31
Copyright © 2001, 2003, Andrew W. Moore
An Equivalent QP
Maximize where )
.
( l
k
l
k
kl y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
0
1



R
k
k
k y
α
Datapoints with ak > 0
will be the support
vectors
..so this sum only needs
to be over the
support vectors.

  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
Support Vector Machines: Slide 32
Copyright © 2001, 2003, Andrew W. Moore
Support Vectors
denotes +1
denotes -1
1
w x b
  
1
w x b
   
w
Support Vectors
Decision boundary is
determined only by those
support vectors !



R
k
k
k
k y
α
1
x
w
   
 
: 1 0
i i i i
i y w x b
a 
     
ai = 0 for non-support vectors
ai  0 for support vectors
Support Vector Machines: Slide 33
Copyright © 2001, 2003, Andrew W. Moore
The Dual Form of QP
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
where ( )
kl k l k l
Q y y
 
x x
Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w Then classify with:
f(x,w,b) = sign(w. x - b)
0
1



R
k
k
k y
α
How to determine b ?
Support Vector Machines: Slide 34
Copyright © 2001, 2003, Andrew W. Moore
An Equivalent QP: Determine b
A linear programming problem !
 
 
 
* * 2
1
,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0
....
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b

 
 
 


    
    
    
 
 
 
 
 
1
*
1
,
1 1 1 1
2 2 2 2
= argmin
1 , 0
1 , 0
....
1 , 0
N
i i
N
j
j
b
N N N N
b
y w x b
y w x b
y w x b


 
 
 


    
    
    

Fix w
Support Vector Machines: Slide 35
Copyright © 2001, 2003, Andrew W. Moore

  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
An Equivalent QP
Maximize where )
.
( l
k
l
k
kl y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
k
k
K
K
K
K
α
K
ε
y
b
max
arg
where
.
)
1
(



 w
x
Then classify with:
f(x,w,b) = sign(w. x - b)
0
1



R
k
k
k y
α
Datapoints with ak > 0
will be the support
vectors
..so this sum only needs
to be over the
support vectors.
Why did I tell you about this
equivalent QP?
• It’s a formulation that QP
packages can optimize more
quickly
• Because of further jaw-
dropping developments
you’re about to learn.
Support Vector Machines: Slide 36
Copyright © 2001, 2003, Andrew W. Moore
Suppose we’re in 1-dimension
What would
SVMs do with
this data?
x=0
Support Vector Machines: Slide 37
Copyright © 2001, 2003, Andrew W. Moore
Suppose we’re in 1-dimension
Not a big surprise
Positive “plane” Negative “plane”
x=0
Support Vector Machines: Slide 38
Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional dataset
That’s wiped the
smirk off SVM’s
face.
What can be
done about
this?
x=0
Support Vector Machines: Slide 39
Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0 )
,
( 2
k
k
k x
x

z
Support Vector Machines: Slide 40
Copyright © 2001, 2003, Andrew W. Moore
Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0 )
,
( 2
k
k
k x
x

z
Support Vector Machines: Slide 41
Copyright © 2001, 2003, Andrew W. Moore
Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
zk = ( sigmoid functions of xk )
This is sensible.
Is that the end of the story?
No…there’s one more trick!







 


 2
2
|
|
exp
)
(
]
[

j
k
k
j
k φ
j
c
x
x
z
Support Vector Machines: Slide 42
Copyright © 2001, 2003, Andrew W. Moore
Quadratic
Basis Functions



























































 m
m
m
m
m
m
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
3
2
1
3
1
2
1
2
2
2
2
1
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)
(x
Φ
Constant Term
Linear Terms
Pure
Quadratic
Terms
Quadratic
Cross-Terms
Number of terms (assuming m input
dimensions) = (m+2)-choose-2
= (m+2)(m+1)/2
= (as near as makes no difference) m2/2
You may be wondering what those
’s are doing.
•You should be happy that they do no
harm
•You’ll find out why they’re there soon.
2
Support Vector Machines: Slide 43
Copyright © 2001, 2003, Andrew W. Moore
QP (old)
Maximize where )
.
( l
k
l
k
kl y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
Then classify with:
f(x,w,b) = sign(w. x - b)
0
1



R
k
k
k y
α

  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
Support Vector Machines: Slide 44
Copyright © 2001, 2003, Andrew W. Moore
QP with basis functions
where ))
(
).
(
( l
k
l
k
kl y
y
Q x
Φ
x
Φ

Subject to these
constraints:
k
C
αk 


0
Then define: Then classify with:
f(x,w,b) = sign(w. f(x) - b)
0
1



R
k
k
k y
α


0
s.t.
)
(
k
α
k
k
k
k y
α x
Φ
w
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
Most important changes:
X  f(x)
Support Vector Machines: Slide 45
Copyright © 2001, 2003, Andrew W. Moore
QP with basis functions
where ))
(
).
(
( l
k
l
k
kl y
y
Q x
Φ
x
Φ

Subject to these
constraints:
k
C
αk 


0
Then define:
Then classify with:
f(x,w,b) = sign(w. f(x) - b)
0
1



R
k
k
k y
α
We must do R2/2 dot products to
get this matrix ready.
Each dot product requires m2/2
additions and multiplications
The whole thing costs R2 m2 /4.
Yeeks!
…or does it?


0
s.t.
)
(
k
α
k
k
k
k y
α x
Φ
w
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
Support Vector Machines: Slide 46
Copyright © 2001, 2003, Andrew W. Moore
Quadratic
Dot
Products
























































































































 m
m
m
m
m
m
m
m
m
m
m
m
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
1
1
3
2
1
3
1
2
1
2
2
2
2
1
2
1
1
1
3
2
1
3
1
2
1
2
2
2
2
1
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)
(
)
( b
Φ
a
Φ
1


m
i
i
ib
a
1
2


m
i
i
i b
a
1
2
2
 
 

m
i
m
i
j
j
i
j
i b
b
a
a
1 1
2
+
+
+
Support Vector Machines: Slide 47
Copyright © 2001, 2003, Andrew W. Moore
Quadratic
Dot
Products

 )
(
)
( b
Φ
a
Φ
 

  






m
i
m
i
j
j
i
j
i
m
i
i
i
m
i
i
i b
b
a
a
b
a
b
a
1 1
1
2
2
1
2
2
1
Just out of casual, innocent, interest,
let’s look at another function of a and
b:
2
)
1
.
( 
b
a
1
.
2
)
.
( 2


 b
a
b
a
1
2
1
2
1








 
 

m
i
i
i
m
i
i
i b
a
b
a
1
2
1
1 1


 
 
 
m
i
i
i
m
i
m
j
j
j
i
i b
a
b
a
b
a
1
2
2
)
(
1
1 1
1
2



 
 
 
 


m
i
i
i
m
i
m
i
j
j
j
i
i
m
i
i
i b
a
b
a
b
a
b
a
Support Vector Machines: Slide 48
Copyright © 2001, 2003, Andrew W. Moore
Quadratic
Dot
Products

 )
(
)
( b
Φ
a
Φ
Just out of casual, innocent, interest,
let’s look at another function of a and
b:
2
)
1
.
( 
b
a
1
.
2
)
.
( 2


 b
a
b
a
1
2
1
2
1








 
 

m
i
i
i
m
i
i
i b
a
b
a
1
2
1
1 1


 
 
 
m
i
i
i
m
i
m
j
j
j
i
i b
a
b
a
b
a
1
2
2
)
(
1
1 1
1
2



 
 
 
 


m
i
i
i
m
i
m
i
j
j
j
i
i
m
i
i
i b
a
b
a
b
a
b
a
They’re the same!
And this is only O(m) to
compute!
 

  






m
i
m
i
j
j
i
j
i
m
i
i
i
m
i
i
i b
b
a
a
b
a
b
a
1 1
1
2
2
1
2
2
1
Support Vector Machines: Slide 49
Copyright © 2001, 2003, Andrew W. Moore
QP with Quintic basis functions
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1
where ))
(
).
(
( l
k
l
k
kl y
y
Q x
Φ
x
Φ

Subject to these
constraints:
k
C
αk 


0
Then define:


0
s.t.
)
(
k
α
k
k
k
k y
α x
Φ
w
Then classify with:
f(x,w,b) = sign(w. f(x) - b)
0
1



R
k
k
k y
α
Support Vector Machines: Slide 50
Copyright © 2001, 2003, Andrew W. Moore
QP with Quadratic basis functions
where )
,
( l
k
l
k
kl K
y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:
k
k
K
K
K
K
α
K
ε
y
b
max
arg
where
.
)
1
(



 w
x
Then classify with:
f(x,w,b) = sign(K(w, x) - b)
0
1



R
k
k
k y
α


0
s.t.
)
(
k
α
k
k
k
k y
α x
Φ
w
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
Most important change:
)
,
(
)
(
).
( l
k
l
k K x
x
x
Φ
x
Φ 
Support Vector Machines: Slide 51
Copyright © 2001, 2003, Andrew W. Moore
Higher Order Polynomials
Poly-
nomial
f(x) Cost to
build Qkl
matrix
tradition
ally
Cost if 100
inputs
f(a).f(b) Cost to
build Qkl
matrix
sneakily
Cost if
100
inputs
Quadratic All m2/2
terms up to
degree 2
m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2
Cubic All m3/6
terms up to
degree 3
m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2
Quartic All m4/24
terms up to
degree 4
m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2
Support Vector Machines: Slide 52
Copyright © 2001, 2003, Andrew W. Moore
SVM Kernel Functions
• K(a,b)=(a . b +1)d is an example of an SVM
Kernel Function
• Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right Kernel Function
• Radial-Basis-style Kernel Function:
• Neural-net-style Kernel Function:







 

 2
2
2
)
(
exp
)
,
(

b
a
b
a
K
)
.
tanh(
)
,
( 
 
 b
a
b
a
K
Support Vector Machines: Slide 53
Copyright © 2001, 2003, Andrew W. Moore
Kernel Tricks
• Replacing dot product with a kernel function
• Not all functions are kernel functions
• Need to be decomposable
• K(a,b) = f(a)  f(b)
• Could K(a,b) = (a-b)3 be a kernel function ?
• Could K(a,b) = (a-b)4 – (a+b)2 be a kernel
function?
Support Vector Machines: Slide 54
Copyright © 2001, 2003, Andrew W. Moore
Kernel Tricks
• Mercer’s condition
To expand Kernel function K(x,y) into a dot product, i.e.
K(x,y)=(x)(y), K(x, y) has to be positive semi-definite
function, i.e., for any function f(x) whose is finite,
the following inequality holds
• Could be a kernel function?
( ) ( , ) ( ) 0
dxdyf x K x y f y 

2
( )
f x dx

 
( , )
p
i i
i
K x y x y
 
Support Vector Machines: Slide 55
Copyright © 2001, 2003, Andrew W. Moore
Kernel Tricks
• Pro
• Introducing nonlinearity into the model
• Computational cheap
• Con
• Still have potential overfitting problems
Support Vector Machines: Slide 56
Copyright © 2001, 2003, Andrew W. Moore
Nonlinear Kernel (I)
Support Vector Machines: Slide 57
Copyright © 2001, 2003, Andrew W. Moore
Nonlinear Kernel (II)
Support Vector Machines: Slide 58
Copyright © 2001, 2003, Andrew W. Moore
Kernelize Logistic Regression
How can we introduce the nonlinearity into the
logistic regression?
2
1 1
1
( | )
1 exp( )
1
( ) log
1 exp( )
N N
reg k
i k
p y x
yx w
l c w
yx w
a  

  
 
  
 
Support Vector Machines: Slide 59
Copyright © 2001, 2003, Andrew W. Moore
Kernelize Logistic Regression
1
1
( ), ( )
( , ) ( , )
N
i i
i
N
i i
i
x x w x
K w x K x x
f a f
a


 



 
 
1
1 , 1
1
1 1
( | )
1 exp( ( , )) 1 exp ( , )
1
( ) log ( , )
1 exp ( , )
N
i i
i
N N
reg i j i j
i i j
N
i j j i
j
p y x
yK x w y K x x
l c K x x
y K x x
a
a a a
a

 

 
   
 
 

 

• Representation Theorem
Support Vector Machines: Slide 60
Copyright © 2001, 2003, Andrew W. Moore
Overfitting in SVM
Breast Cancer
0
0.02
0.04
0.06
0.08
1 2 3 4 5
PolyDegree
Classification
Error
Ionosphere
0
0.05
0.1
0.15
0.2
1 2 3 4 5
PolyDegree
Classification
Error
Training Error
Testing Error
Support Vector Machines: Slide 61
Copyright © 2001, 2003, Andrew W. Moore
SVM Performance
• Anecdotally they work very very well indeed.
• Example: They are currently the best-known
classifier on a well-studied hand-written-character
recognition benchmark
• Another Example: Andrew knows several reliable
people doing practical real-world work who claim
that SVMs have saved them when their other
favorite classifiers did poorly.
• There is a lot of excitement and religious fervor
about SVMs as of 2001.
• Despite this, some practitioners are a little
skeptical.
Support Vector Machines: Slide 62
Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel
• Kernel function describes the correlation or similarity
between two data points
• Given that I have a function s(x,y) that describes the
similarity between two data points. Assume that it is a non-
negative and symmetric function. How can we generate a
kernel function based on this similarity function?
• A graph theory approach …
Support Vector Machines: Slide 63
Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel
• Create a graph for the data points
• Each vertex corresponds to a data point
• The weight of each edge is the similarity s(x,y)
• Graph Laplacian
• Properties of Laplacian
• Negative semi-definite
 
 
,
,
,
i j
i j
i k
k i
s x x i j
L
s x x i j

 

 
 

 
Support Vector Machines: Slide 64
Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel
• Consider a simple Laplacian
• Consider
• What do these matrixes represent?
• A diffusion kernel
,
( )
1 and are connected
1
k i
i j
i j
x N x
x x
L
i j



 
 

 
2 4
, ,...
L L
Support Vector Machines: Slide 65
Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel
• Consider a simple Laplacian
• Consider
• What do these matrixes represent?
• A diffusion kernel
,
( )
1 and are connected
1
k i
i j
i j
x N x
x x
L
i j



 
 

 
2 4
, ,...
L L
Support Vector Machines: Slide 66
Copyright © 2001, 2003, Andrew W. Moore
Diffusion Kernel: Properties
• Positive definite
• Local relationships L induce global relationships
• Works for undirected weighted graphs with
similarities
• How to compute the diffusion kernel
, or
L d
K e K LK
d

  

 
( , ) ( , )
i j j i
s x x s x x

L
e
Support Vector Machines: Slide 67
Copyright © 2001, 2003, Andrew W. Moore
Computing Diffusion Kernel
• Singular value decomposition of Laplacian L
• What is L2 ?
1
1 2 1 2
1
,
( , ,..., ) ( , ,..., )
, :
T T
m m
m
m T
i i i
i
T
i j i j
i j





 
 
  
 
 

 

L = UΣU u u u u u u
u u
u u
 
2
2
1
,
, 1 , 1
2
1
m T
i i i
i
m m
T T T
i j i i j j i j i j i j
i j i j
m T
i i i
i

    


 

 


 

L = u u
u u u u u u
u u
Support Vector Machines: Slide 68
Copyright © 2001, 2003, Andrew W. Moore
Computing Diffusion Kernel
• What about Ln ?
• Compute diffusion kernel
2
1
m n T
i i i
i


 
L u u
L
e
 
1 1 1
1 1 1
! !
!
i
n n n
m
L n T
i i i
n n i
n n
m m
T T
i
i i i i
i n i
L
e
n n
e
n


 

 
 
  

  
 
 
 
 
 
  
  
u u
u u u u
Support Vector Machines: Slide 69
Copyright © 2001, 2003, Andrew W. Moore
Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
• SVM 1 learns “Output==1” vs “Output != 1”
• SVM 2 learns “Output==2” vs “Output != 2”
• :
• SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.
Support Vector Machines: Slide 70
Copyright © 2001, 2003, Andrew W. Moore
Ranking Problem
• Consider a problem of ranking essays
• Three ranking categories: good, ok, bad
• Given a input document, predict its ranking
category
• How should we formulate this problem?
• A simple multiple class solution
• Each ranking category is a independent class
• But, there is something missing here …
• We miss the ordinal relationship between classes !
Support Vector Machines: Slide 71
Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
• Which choice is better?
• How could we formulate this problem?
‘good’
‘OK’
‘bad’
w
w’
Support Vector Machines: Slide 72
Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
• What are the two decision boundaries?
• What is the margin for ordinal regression?
• Maximize margin
1 2
0 and 0
b b
     
w x w x
1
1 1
2
1
2
2 2
2
1
1 2 1 1 2 2
margin ( , ) min ( ) min
margin ( , ) min ( ) min
margin( , , )=min(margin ( , ),margin ( , ))
g o g o
o b o b
d
D D D D
i
i
d
D D D D
i
i
b
b d
w
b
b d
w
b b b b
   

   

 
 
 
 


x x
x x
x w
w x
x w
w x
w w w
1 2
* * *
1 2 1 2
, ,
{ , , } arg max margin( , , )
b b
b b b b

w
w w
Support Vector Machines: Slide 73
Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
• What are the two decision boundaries?
• What is the margin for ordinal regression?
• Maximize margin
1 2
0 and 0
b b
     
w x w x
1
1 1
2
1
2
2 2
2
1
1 2 1 1 2 2
margin ( , ) min ( ) min
margin ( , ) min ( ) min
margin( , , )=min(margin ( , ),margin ( , ))
g o g o
o b o b
d
D D D D
i
i
d
D D D D
i
i
b
b d
w
b
b d
w
b b b b
   

   

 
 
 
 


x x
x x
x w
w x
x w
w x
w w w
1 2
* * *
1 2 1 2
, ,
{ , , } arg max margin( , , )
b b
b b b b

w
w w
Support Vector Machines: Slide 74
Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
• What are the two decision boundaries?
• What is the margin for ordinal regression?
• Maximize margin
1 2
0 and 0
b b
     
w x w x
1
1 1
2
1
2
2 2
2
1
1 2 1 1 2 2
margin ( , ) min ( ) min
margin ( , ) min ( ) min
margin( , , )=min(margin ( , ),margin ( , ))
g o g o
o b o b
d
D D D D
i
i
d
D D D D
i
i
b
b d
w
b
b d
w
b b b b
   

   

 
 
 
 


x x
x x
x w
w x
x w
w x
w w w
1 2
* * *
1 2 1 2
, ,
{ , , } arg max margin( , , )
b b
b b b b

w
w w
Support Vector Machines: Slide 75
Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
• How do we solve this monster ?
1 2
1 2
1 2
* * *
1 2 1 2
, ,
1 1 2 2
, ,
1 2
2 2
, ,
1 1
{ , , } argmax margin( , , )
argmax min(margin ( , ),margin ( , ))
argmax min min , min
g o o b
b b
b b
d d
D D D D
b b
i i
i i
b b b b
b b
b b
w w
   
 


 
   
 

 
 
 
w
w
x x
w
w w
w w
x w x w
1
1 2
2
subject to
: 0
: 0, 0
: 0
i g i
i o i i
i b i
D b
D b b
D b
    
       
    
x x w
x x w x w
x x w
Support Vector Machines: Slide 76
Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
• The same old trick
• To remove the scaling invariance, set
• Now the problem is simplified as:
1
2
: 1
: 1
i g o i
i o b i
D D b
D D b
     
     
x x w
x x w
1 2
* * * 2
1 2 1
, ,
{ , , } argmin
d
i
i
b b
b b w

 
w
w
1
1 2
2
subject to
: 1
: 1, 1
: 1
i g i
i o i i
i b i
D b
D b b
D b
    
        
     
x x w
x x w x w
x x w
Support Vector Machines: Slide 77
Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
• Noisy case
• Is this sufficient enough?
 
1 2
* * * 2
1 2 1
, ,
{ , , } argmin
i g i o i b
d
i i i i i
i
x D x D x D
b b
b b w c c c
   
   

  
    
   
w
w
1
1 2
2
subject to
: 1 , 0
: 1 , 1 , 0, 0
: 1 , 0
i g i i i
i o i i i i i i
i b i i i
D b
D b b
D b
 
   
 
 
   
 
      
            
       
x x w
x x w x w
x x w
Support Vector Machines: Slide 78
Copyright © 2001, 2003, Andrew W. Moore
Ordinal Regression
‘good’
‘OK’
‘bad’
w
 
1 2
* * * 2
1 2 1
, ,
{ , , } argmin
i g i o i b
d
i i i i i
i
x D x D x D
b b
b b w c c c
   
   

  
    
   
w
w
1
1 2
2
1 2
subject to
: 1 , 0
: 1 , 1 , 0, 0
: 1 , 0
i g i i i
i o i i i i i i
i b i i i
D b
D b b
D b
b b
 
   
 
 
   
 
      
            
       

x x w
x x w x w
x x w
Support Vector Machines: Slide 79
Copyright © 2001, 2003, Andrew W. Moore
References
• An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible: (Not for beginners
including myself)
Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998
• Software: SVM-light, http://svmlight.joachims.org/,
free download

More Related Content

Similar to svm-jain.ppt

Tutorial on Support Vector Machine
Tutorial on Support Vector MachineTutorial on Support Vector Machine
Tutorial on Support Vector MachineLoc Nguyen
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques ijsc
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer visionzukun
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector MachineShao-Chuan Wang
 
Notes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMNotes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMSyedSaimGardezi
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine VisionNasir Jumani
 
Two algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksTwo algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksESCOM
 
Overfit10
Overfit10Overfit10
Overfit10okeee
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningKrzysztof Kowalczyk
 
linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.pptMahimMajee
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用Mark Chang
 
NTC_TENSORFLOW深度學習快速上手班_Part3_電腦視覺應用
NTC_TENSORFLOW深度學習快速上手班_Part3_電腦視覺應用NTC_TENSORFLOW深度學習快速上手班_Part3_電腦視覺應用
NTC_TENSORFLOW深度學習快速上手班_Part3_電腦視覺應用NTC.im(Notch Training Center)
 
9.6 Systems of Inequalities and Linear Programming
9.6 Systems of Inequalities and Linear Programming9.6 Systems of Inequalities and Linear Programming
9.6 Systems of Inequalities and Linear Programmingsmiller5
 
Dm part03 neural-networks-handout
Dm part03 neural-networks-handoutDm part03 neural-networks-handout
Dm part03 neural-networks-handoutokeee
 

Similar to svm-jain.ppt (20)

SVM.ppt
SVM.pptSVM.ppt
SVM.ppt
 
Tutorial on Support Vector Machine
Tutorial on Support Vector MachineTutorial on Support Vector Machine
Tutorial on Support Vector Machine
 
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques  Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
Methodological Study Of Opinion Mining And Sentiment Analysis Techniques
 
05 history of cv a machine learning (theory) perspective on computer vision
05  history of cv a machine learning (theory) perspective on computer vision05  history of cv a machine learning (theory) perspective on computer vision
05 history of cv a machine learning (theory) perspective on computer vision
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 
Mle
MleMle
Mle
 
Notes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVMNotes relating to Machine Learning and SVM
Notes relating to Machine Learning and SVM
 
Diving into Tensorflow.js
Diving into Tensorflow.jsDiving into Tensorflow.js
Diving into Tensorflow.js
 
Lecture4 xing
Lecture4 xingLecture4 xing
Lecture4 xing
 
Introduction to Machine Vision
Introduction to Machine VisionIntroduction to Machine Vision
Introduction to Machine Vision
 
Two algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networksTwo algorithms to accelerate training of back-propagation neural networks
Two algorithms to accelerate training of back-propagation neural networks
 
Overfit10
Overfit10Overfit10
Overfit10
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
 
Kmeans
KmeansKmeans
Kmeans
 
linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.ppt
 
PPT
PPTPPT
PPT
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用
 
NTC_TENSORFLOW深度學習快速上手班_Part3_電腦視覺應用
NTC_TENSORFLOW深度學習快速上手班_Part3_電腦視覺應用NTC_TENSORFLOW深度學習快速上手班_Part3_電腦視覺應用
NTC_TENSORFLOW深度學習快速上手班_Part3_電腦視覺應用
 
9.6 Systems of Inequalities and Linear Programming
9.6 Systems of Inequalities and Linear Programming9.6 Systems of Inequalities and Linear Programming
9.6 Systems of Inequalities and Linear Programming
 
Dm part03 neural-networks-handout
Dm part03 neural-networks-handoutDm part03 neural-networks-handout
Dm part03 neural-networks-handout
 

Recently uploaded

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 

Recently uploaded (20)

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 

svm-jain.ppt

  • 1. Nov 23rd, 2001 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines Modified from the slides by Dr. Andrew W. Moore http://www.cs.cmu.edu/~awm/tutorials
  • 2. Support Vector Machines: Slide 2 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 3. Support Vector Machines: Slide 3 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 4. Support Vector Machines: Slide 4 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 5. Support Vector Machines: Slide 5 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 6. Support Vector Machines: Slide 6 Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine.. ..but which is best?
  • 7. Support Vector Machines: Slide 7 Copyright © 2001, 2003, Andrew W. Moore Classifier Margin f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
  • 8. Support Vector Machines: Slide 8 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM
  • 9. Support Vector Machines: Slide 9 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin f x a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against Linear SVM
  • 10. Support Vector Machines: Slide 10 Copyright © 2001, 2003, Andrew W. Moore Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1. Intuitively this feels safest. 2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3. LOOCV is easy since the model is immune to removal of any non- support-vector datapoints. 4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 5. Empirically it works very very well.
  • 11. Support Vector Machines: Slide 11 Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin • What is the distance expression for a point x to a line wx+b= 0? denotes +1 denotes -1 x wx +b = 0 2 2 1 2 ( ) d i i b b d w         x w x w x w
  • 12. Support Vector Machines: Slide 12 Copyright © 2001, 2003, Andrew W. Moore Estimate the Margin • What is the expression for margin? denotes +1 denotes -1 wx +b = 0 2 1 margin min ( ) min d D D i i b d w         x x x w x Margin
  • 13. Support Vector Machines: Slide 13 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0 , , 2 , 1 argmax margin( , , ) = argmax min ( ) argmax min i i b i D b i d D b i i b D d b w        w x w x w w x x w Margin
  • 14. Support Vector Machines: Slide 14 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0   2 , 1 argmax min subject to : 0 i i d D b i i i i i b w D y b           x w x w x x w Margin • Min-max problem  game problem
  • 15. Support Vector Machines: Slide 15 Copyright © 2001, 2003, Andrew W. Moore Maximize Margin denotes +1 denotes -1 wx +b = 0   2 , 1 argmax min subject to : 0 i i d D b i i i i i b w D y b           x w x w x x w Margin Strategy: : 1 i i D b      x x w   2 1 , argmin subject to : 1 d i i b i i i w D y b        w x x w
  • 16. Support Vector Machines: Slide 16 Copyright © 2001, 2003, Andrew W. Moore Maximum Margin Linear Classifier • How to solve it?       * * 2 1 , 1 1 2 2 { , }= argmin subject to 1 1 .... 1 d k k w b N N w b w y w x b y w x b y w x b           
  • 17. Support Vector Machines: Slide 17 Copyright © 2001, 2003, Andrew W. Moore Learning via Quadratic Programming • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints.
  • 18. Support Vector Machines: Slide 18 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming arg min 2 T T R c   u u u d u Find n m nm n n m m m m b u a u a u a b u a u a u a b u a u a u a             ... : ... ... 2 2 1 1 2 2 2 22 1 21 1 1 2 12 1 11 ) ( ) ( 2 2 ) ( 1 1 ) ( ) 2 ( ) 2 ( 2 2 ) 2 ( 1 1 ) 2 ( ) 1 ( ) 1 ( 2 2 ) 1 ( 1 1 ) 1 ( ... : ... ... e n m m e n e n e n n m m n n n n m m n n n b u a u a u a b u a u a u a b u a u a u a                         And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion Subject to
  • 19. Support Vector Machines: Slide 19 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming arg min 2 T T R c   u u u d u Find Subject to n m nm n n m m m m b u a u a u a b u a u a u a b u a u a u a             ... : ... ... 2 2 1 1 2 2 2 22 1 21 1 1 2 12 1 11 ) ( ) ( 2 2 ) ( 1 1 ) ( ) 2 ( ) 2 ( 2 2 ) 2 ( 1 1 ) 2 ( ) 1 ( ) 1 ( 2 2 ) 1 ( 1 1 ) 1 ( ... : ... ... e n m m e n e n e n n m m n n n n m m n n n b u a u a u a b u a u a u a b u a u a u a                         And subject to n additional linear inequality constraints e additional linear equality constraints Quadratic criterion
  • 20. Support Vector Machines: Slide 20 Copyright © 2001, 2003, Andrew W. Moore Quadratic Programming   * * 2 , { , }= min subject to 1 for all training data ( , ) i i w b i i i i w b w y w x b x y             * * , 1 1 2 2 { , }= argmax 0 0 1 1 inequality constraints .... 1 T w b N N w b w w w y w x b y w x b y w x b                    n I
  • 21. Support Vector Machines: Slide 21 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do?
  • 22. Support Vector Machines: Slide 22 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1: Find minimum w.w, while minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization
  • 23. Support Vector Machines: Slide 23 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter
  • 24. Support Vector Machines: Slide 24 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 1.1: Minimize w.w + C (#train errors) There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. (Also, doesn’t distinguish between disastrous errors and near misses)
  • 25. Support Vector Machines: Slide 25 Copyright © 2001, 2003, Andrew W. Moore Uh-oh! denotes +1 denotes -1 This is going to be a problem! What should we do? Idea 2.0: Minimize w.w + C (distance of error points to their correct place)
  • 26. Support Vector Machines: Slide 26 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine (SVM) for Noisy Data • Any problem with the above formulism?       d * * 2 1 1 , , 1 1 1 2 2 2 { , }= min 1 1 ... 1 N i j i j w b N N N w b w c y w x b y w x b y w x b                       denotes +1 denotes -1 1  2  3 
  • 27. Support Vector Machines: Slide 27 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine (SVM) for Noisy Data • Balance the trade off between margin and classification errors       d * * 2 1 1 , , 1 1 1 1 2 2 2 2 { , }= min 1 , 0 1 , 0 ... 1 , 0 N i j i j w b N N N N w b w c y w x b y w x b y w x b                             denotes +1 denotes -1 1  2  3 
  • 28. Support Vector Machines: Slide 28 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machine for Noisy Data       * * 2 1 , , 1 1 1 1 2 2 2 2 { , }= argmin 1 , 0 1 , 0 inequality constraints .... 1 , 0 N i j i j w b N N N N w b w c y w x b y w x b y w x b                                   How do we determine the appropriate value for c ?
  • 29. Support Vector Machines: Slide 29 Copyright © 2001, 2003, Andrew W. Moore The Dual Form of QP Maximize       R k R l kl l k R k k Q α α α 1 1 1 2 1 where ( ) kl k l k l Q y y   x x Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w Then classify with: f(x,w,b) = sign(w. x - b) 0 1    R k k k y α
  • 30. Support Vector Machines: Slide 30 Copyright © 2001, 2003, Andrew W. Moore The Dual Form of QP Maximize       R k R l kl l k R k k Q α α α 1 1 1 2 1 where ( ) kl k l k l Q y y   x x Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w 0 1    R k k k y α
  • 31. Support Vector Machines: Slide 31 Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP Maximize where ) . ( l k l k kl y y Q x x  Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w 0 1    R k k k y α Datapoints with ak > 0 will be the support vectors ..so this sum only needs to be over the support vectors.       R k R l kl l k R k k Q α α α 1 1 1 2 1
  • 32. Support Vector Machines: Slide 32 Copyright © 2001, 2003, Andrew W. Moore Support Vectors denotes +1 denotes -1 1 w x b    1 w x b     w Support Vectors Decision boundary is determined only by those support vectors !    R k k k k y α 1 x w       : 1 0 i i i i i y w x b a        ai = 0 for non-support vectors ai  0 for support vectors
  • 33. Support Vector Machines: Slide 33 Copyright © 2001, 2003, Andrew W. Moore The Dual Form of QP Maximize       R k R l kl l k R k k Q α α α 1 1 1 2 1 where ( ) kl k l k l Q y y   x x Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w Then classify with: f(x,w,b) = sign(w. x - b) 0 1    R k k k y α How to determine b ?
  • 34. Support Vector Machines: Slide 34 Copyright © 2001, 2003, Andrew W. Moore An Equivalent QP: Determine b A linear programming problem !       * * 2 1 , 1 1 1 1 2 2 2 2 { , }= argmin 1 , 0 1 , 0 .... 1 , 0 N i j i j w b N N N N w b w c y w x b y w x b y w x b                                   1 * 1 , 1 1 1 1 2 2 2 2 = argmin 1 , 0 1 , 0 .... 1 , 0 N i i N j j b N N N N b y w x b y w x b y w x b                           Fix w
  • 35. Support Vector Machines: Slide 35 Copyright © 2001, 2003, Andrew W. Moore       R k R l kl l k R k k Q α α α 1 1 1 2 1 An Equivalent QP Maximize where ) . ( l k l k kl y y Q x x  Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w k k K K K K α K ε y b max arg where . ) 1 (     w x Then classify with: f(x,w,b) = sign(w. x - b) 0 1    R k k k y α Datapoints with ak > 0 will be the support vectors ..so this sum only needs to be over the support vectors. Why did I tell you about this equivalent QP? • It’s a formulation that QP packages can optimize more quickly • Because of further jaw- dropping developments you’re about to learn.
  • 36. Support Vector Machines: Slide 36 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension What would SVMs do with this data? x=0
  • 37. Support Vector Machines: Slide 37 Copyright © 2001, 2003, Andrew W. Moore Suppose we’re in 1-dimension Not a big surprise Positive “plane” Negative “plane” x=0
  • 38. Support Vector Machines: Slide 38 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0
  • 39. Support Vector Machines: Slide 39 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too x=0 ) , ( 2 k k k x x  z
  • 40. Support Vector Machines: Slide 40 Copyright © 2001, 2003, Andrew W. Moore Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too x=0 ) , ( 2 k k k x x  z
  • 41. Support Vector Machines: Slide 41 Copyright © 2001, 2003, Andrew W. Moore Common SVM basis functions zk = ( polynomial terms of xk of degree 1 to q ) zk = ( radial basis functions of xk ) zk = ( sigmoid functions of xk ) This is sensible. Is that the end of the story? No…there’s one more trick!             2 2 | | exp ) ( ] [  j k k j k φ j c x x z
  • 42. Support Vector Machines: Slide 42 Copyright © 2001, 2003, Andrew W. Moore Quadratic Basis Functions                                                             m m m m m m x x x x x x x x x x x x x x x x x x 1 1 3 2 1 3 1 2 1 2 2 2 2 1 2 1 2 : 2 : 2 2 : 2 2 : 2 : 2 2 1 ) (x Φ Constant Term Linear Terms Pure Quadratic Terms Quadratic Cross-Terms Number of terms (assuming m input dimensions) = (m+2)-choose-2 = (m+2)(m+1)/2 = (as near as makes no difference) m2/2 You may be wondering what those ’s are doing. •You should be happy that they do no harm •You’ll find out why they’re there soon. 2
  • 43. Support Vector Machines: Slide 43 Copyright © 2001, 2003, Andrew W. Moore QP (old) Maximize where ) . ( l k l k kl y y Q x x  Subject to these constraints: k C αk    0 Then define:    R k k k k y α 1 x w Then classify with: f(x,w,b) = sign(w. x - b) 0 1    R k k k y α       R k R l kl l k R k k Q α α α 1 1 1 2 1
  • 44. Support Vector Machines: Slide 44 Copyright © 2001, 2003, Andrew W. Moore QP with basis functions where )) ( ). ( ( l k l k kl y y Q x Φ x Φ  Subject to these constraints: k C αk    0 Then define: Then classify with: f(x,w,b) = sign(w. f(x) - b) 0 1    R k k k y α   0 s.t. ) ( k α k k k k y α x Φ w Maximize       R k R l kl l k R k k Q α α α 1 1 1 2 1 Most important changes: X  f(x)
  • 45. Support Vector Machines: Slide 45 Copyright © 2001, 2003, Andrew W. Moore QP with basis functions where )) ( ). ( ( l k l k kl y y Q x Φ x Φ  Subject to these constraints: k C αk    0 Then define: Then classify with: f(x,w,b) = sign(w. f(x) - b) 0 1    R k k k y α We must do R2/2 dot products to get this matrix ready. Each dot product requires m2/2 additions and multiplications The whole thing costs R2 m2 /4. Yeeks! …or does it?   0 s.t. ) ( k α k k k k y α x Φ w Maximize       R k R l kl l k R k k Q α α α 1 1 1 2 1
  • 46. Support Vector Machines: Slide 46 Copyright © 2001, 2003, Andrew W. Moore Quadratic Dot Products                                                                                                                          m m m m m m m m m m m m b b b b b b b b b b b b b b b b b b a a a a a a a a a a a a a a a a a a 1 1 3 2 1 3 1 2 1 2 2 2 2 1 2 1 1 1 3 2 1 3 1 2 1 2 2 2 2 1 2 1 2 : 2 : 2 2 : 2 2 : 2 : 2 2 1 2 : 2 : 2 2 : 2 2 : 2 : 2 2 1 ) ( ) ( b Φ a Φ 1   m i i ib a 1 2   m i i i b a 1 2 2      m i m i j j i j i b b a a 1 1 2 + + +
  • 47. Support Vector Machines: Slide 47 Copyright © 2001, 2003, Andrew W. Moore Quadratic Dot Products   ) ( ) ( b Φ a Φ             m i m i j j i j i m i i i m i i i b b a a b a b a 1 1 1 2 2 1 2 2 1 Just out of casual, innocent, interest, let’s look at another function of a and b: 2 ) 1 . (  b a 1 . 2 ) . ( 2    b a b a 1 2 1 2 1              m i i i m i i i b a b a 1 2 1 1 1         m i i i m i m j j j i i b a b a b a 1 2 2 ) ( 1 1 1 1 2              m i i i m i m i j j j i i m i i i b a b a b a b a
  • 48. Support Vector Machines: Slide 48 Copyright © 2001, 2003, Andrew W. Moore Quadratic Dot Products   ) ( ) ( b Φ a Φ Just out of casual, innocent, interest, let’s look at another function of a and b: 2 ) 1 . (  b a 1 . 2 ) . ( 2    b a b a 1 2 1 2 1              m i i i m i i i b a b a 1 2 1 1 1         m i i i m i m j j j i i b a b a b a 1 2 2 ) ( 1 1 1 1 2              m i i i m i m i j j j i i m i i i b a b a b a b a They’re the same! And this is only O(m) to compute!             m i m i j j i j i m i i i m i i i b b a a b a b a 1 1 1 2 2 1 2 2 1
  • 49. Support Vector Machines: Slide 49 Copyright © 2001, 2003, Andrew W. Moore QP with Quintic basis functions Maximize       R k R l kl l k R k k Q α α α 1 1 1 where )) ( ). ( ( l k l k kl y y Q x Φ x Φ  Subject to these constraints: k C αk    0 Then define:   0 s.t. ) ( k α k k k k y α x Φ w Then classify with: f(x,w,b) = sign(w. f(x) - b) 0 1    R k k k y α
  • 50. Support Vector Machines: Slide 50 Copyright © 2001, 2003, Andrew W. Moore QP with Quadratic basis functions where ) , ( l k l k kl K y y Q x x  Subject to these constraints: k C αk    0 Then define: k k K K K K α K ε y b max arg where . ) 1 (     w x Then classify with: f(x,w,b) = sign(K(w, x) - b) 0 1    R k k k y α   0 s.t. ) ( k α k k k k y α x Φ w Maximize       R k R l kl l k R k k Q α α α 1 1 1 2 1 Most important change: ) , ( ) ( ). ( l k l k K x x x Φ x Φ 
  • 51. Support Vector Machines: Slide 51 Copyright © 2001, 2003, Andrew W. Moore Higher Order Polynomials Poly- nomial f(x) Cost to build Qkl matrix tradition ally Cost if 100 inputs f(a).f(b) Cost to build Qkl matrix sneakily Cost if 100 inputs Quadratic All m2/2 terms up to degree 2 m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2 Cubic All m3/6 terms up to degree 3 m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2 Quartic All m4/24 terms up to degree 4 m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2
  • 52. Support Vector Machines: Slide 52 Copyright © 2001, 2003, Andrew W. Moore SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM Kernel Function • Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function • Radial-Basis-style Kernel Function: • Neural-net-style Kernel Function:            2 2 2 ) ( exp ) , (  b a b a K ) . tanh( ) , (     b a b a K
  • 53. Support Vector Machines: Slide 53 Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks • Replacing dot product with a kernel function • Not all functions are kernel functions • Need to be decomposable • K(a,b) = f(a)  f(b) • Could K(a,b) = (a-b)3 be a kernel function ? • Could K(a,b) = (a-b)4 – (a+b)2 be a kernel function?
  • 54. Support Vector Machines: Slide 54 Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks • Mercer’s condition To expand Kernel function K(x,y) into a dot product, i.e. K(x,y)=(x)(y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds • Could be a kernel function? ( ) ( , ) ( ) 0 dxdyf x K x y f y   2 ( ) f x dx    ( , ) p i i i K x y x y  
  • 55. Support Vector Machines: Slide 55 Copyright © 2001, 2003, Andrew W. Moore Kernel Tricks • Pro • Introducing nonlinearity into the model • Computational cheap • Con • Still have potential overfitting problems
  • 56. Support Vector Machines: Slide 56 Copyright © 2001, 2003, Andrew W. Moore Nonlinear Kernel (I)
  • 57. Support Vector Machines: Slide 57 Copyright © 2001, 2003, Andrew W. Moore Nonlinear Kernel (II)
  • 58. Support Vector Machines: Slide 58 Copyright © 2001, 2003, Andrew W. Moore Kernelize Logistic Regression How can we introduce the nonlinearity into the logistic regression? 2 1 1 1 ( | ) 1 exp( ) 1 ( ) log 1 exp( ) N N reg k i k p y x yx w l c w yx w a             
  • 59. Support Vector Machines: Slide 59 Copyright © 2001, 2003, Andrew W. Moore Kernelize Logistic Regression 1 1 ( ), ( ) ( , ) ( , ) N i i i N i i i x x w x K w x K x x f a f a            1 1 , 1 1 1 1 ( | ) 1 exp( ( , )) 1 exp ( , ) 1 ( ) log ( , ) 1 exp ( , ) N i i i N N reg i j i j i i j N i j j i j p y x yK x w y K x x l c K x x y K x x a a a a a                   • Representation Theorem
  • 60. Support Vector Machines: Slide 60 Copyright © 2001, 2003, Andrew W. Moore Overfitting in SVM Breast Cancer 0 0.02 0.04 0.06 0.08 1 2 3 4 5 PolyDegree Classification Error Ionosphere 0 0.05 0.1 0.15 0.2 1 2 3 4 5 PolyDegree Classification Error Training Error Testing Error
  • 61. Support Vector Machines: Slide 61 Copyright © 2001, 2003, Andrew W. Moore SVM Performance • Anecdotally they work very very well indeed. • Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark • Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. • There is a lot of excitement and religious fervor about SVMs as of 2001. • Despite this, some practitioners are a little skeptical.
  • 62. Support Vector Machines: Slide 62 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel • Kernel function describes the correlation or similarity between two data points • Given that I have a function s(x,y) that describes the similarity between two data points. Assume that it is a non- negative and symmetric function. How can we generate a kernel function based on this similarity function? • A graph theory approach …
  • 63. Support Vector Machines: Slide 63 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel • Create a graph for the data points • Each vertex corresponds to a data point • The weight of each edge is the similarity s(x,y) • Graph Laplacian • Properties of Laplacian • Negative semi-definite     , , , i j i j i k k i s x x i j L s x x i j           
  • 64. Support Vector Machines: Slide 64 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel • Consider a simple Laplacian • Consider • What do these matrixes represent? • A diffusion kernel , ( ) 1 and are connected 1 k i i j i j x N x x x L i j           2 4 , ,... L L
  • 65. Support Vector Machines: Slide 65 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel • Consider a simple Laplacian • Consider • What do these matrixes represent? • A diffusion kernel , ( ) 1 and are connected 1 k i i j i j x N x x x L i j           2 4 , ,... L L
  • 66. Support Vector Machines: Slide 66 Copyright © 2001, 2003, Andrew W. Moore Diffusion Kernel: Properties • Positive definite • Local relationships L induce global relationships • Works for undirected weighted graphs with similarities • How to compute the diffusion kernel , or L d K e K LK d        ( , ) ( , ) i j j i s x x s x x  L e
  • 67. Support Vector Machines: Slide 67 Copyright © 2001, 2003, Andrew W. Moore Computing Diffusion Kernel • Singular value decomposition of Laplacian L • What is L2 ? 1 1 2 1 2 1 , ( , ,..., ) ( , ,..., ) , : T T m m m m T i i i i T i j i j i j                     L = UΣU u u u u u u u u u u   2 2 1 , , 1 , 1 2 1 m T i i i i m m T T T i j i i j j i j i j i j i j i j m T i i i i                   L = u u u u u u u u u u
  • 68. Support Vector Machines: Slide 68 Copyright © 2001, 2003, Andrew W. Moore Computing Diffusion Kernel • What about Ln ? • Compute diffusion kernel 2 1 m n T i i i i     L u u L e   1 1 1 1 1 1 ! ! ! i n n n m L n T i i i n n i n n m m T T i i i i i i n i L e n n e n                                 u u u u u u
  • 69. Support Vector Machines: Slide 69 Copyright © 2001, 2003, Andrew W. Moore Doing multi-class classification • SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). • What can be done? • Answer: with output arity N, learn N SVM’s • SVM 1 learns “Output==1” vs “Output != 1” • SVM 2 learns “Output==2” vs “Output != 2” • : • SVM N learns “Output==N” vs “Output != N” • Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.
  • 70. Support Vector Machines: Slide 70 Copyright © 2001, 2003, Andrew W. Moore Ranking Problem • Consider a problem of ranking essays • Three ranking categories: good, ok, bad • Given a input document, predict its ranking category • How should we formulate this problem? • A simple multiple class solution • Each ranking category is a independent class • But, there is something missing here … • We miss the ordinal relationship between classes !
  • 71. Support Vector Machines: Slide 71 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression • Which choice is better? • How could we formulate this problem? ‘good’ ‘OK’ ‘bad’ w w’
  • 72. Support Vector Machines: Slide 72 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression • What are the two decision boundaries? • What is the margin for ordinal regression? • Maximize margin 1 2 0 and 0 b b       w x w x 1 1 1 2 1 2 2 2 2 1 1 2 1 1 2 2 margin ( , ) min ( ) min margin ( , ) min ( ) min margin( , , )=min(margin ( , ),margin ( , )) g o g o o b o b d D D D D i i d D D D D i i b b d w b b d w b b b b                     x x x x x w w x x w w x w w w 1 2 * * * 1 2 1 2 , , { , , } arg max margin( , , ) b b b b b b  w w w
  • 73. Support Vector Machines: Slide 73 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression • What are the two decision boundaries? • What is the margin for ordinal regression? • Maximize margin 1 2 0 and 0 b b       w x w x 1 1 1 2 1 2 2 2 2 1 1 2 1 1 2 2 margin ( , ) min ( ) min margin ( , ) min ( ) min margin( , , )=min(margin ( , ),margin ( , )) g o g o o b o b d D D D D i i d D D D D i i b b d w b b d w b b b b                     x x x x x w w x x w w x w w w 1 2 * * * 1 2 1 2 , , { , , } arg max margin( , , ) b b b b b b  w w w
  • 74. Support Vector Machines: Slide 74 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression • What are the two decision boundaries? • What is the margin for ordinal regression? • Maximize margin 1 2 0 and 0 b b       w x w x 1 1 1 2 1 2 2 2 2 1 1 2 1 1 2 2 margin ( , ) min ( ) min margin ( , ) min ( ) min margin( , , )=min(margin ( , ),margin ( , )) g o g o o b o b d D D D D i i d D D D D i i b b d w b b d w b b b b                     x x x x x w w x x w w x w w w 1 2 * * * 1 2 1 2 , , { , , } arg max margin( , , ) b b b b b b  w w w
  • 75. Support Vector Machines: Slide 75 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression • How do we solve this monster ? 1 2 1 2 1 2 * * * 1 2 1 2 , , 1 1 2 2 , , 1 2 2 2 , , 1 1 { , , } argmax margin( , , ) argmax min(margin ( , ),margin ( , )) argmax min min , min g o o b b b b b d d D D D D b b i i i i b b b b b b b b w w                        w w x x w w w w w x w x w 1 1 2 2 subject to : 0 : 0, 0 : 0 i g i i o i i i b i D b D b b D b                   x x w x x w x w x x w
  • 76. Support Vector Machines: Slide 76 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression • The same old trick • To remove the scaling invariance, set • Now the problem is simplified as: 1 2 : 1 : 1 i g o i i o b i D D b D D b             x x w x x w 1 2 * * * 2 1 2 1 , , { , , } argmin d i i b b b b w    w w 1 1 2 2 subject to : 1 : 1, 1 : 1 i g i i o i i i b i D b D b b D b                     x x w x x w x w x x w
  • 77. Support Vector Machines: Slide 77 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression • Noisy case • Is this sufficient enough?   1 2 * * * 2 1 2 1 , , { , , } argmin i g i o i b d i i i i i i x D x D x D b b b b w c c c                      w w 1 1 2 2 subject to : 1 , 0 : 1 , 1 , 0, 0 : 1 , 0 i g i i i i o i i i i i i i b i i i D b D b b D b                                             x x w x x w x w x x w
  • 78. Support Vector Machines: Slide 78 Copyright © 2001, 2003, Andrew W. Moore Ordinal Regression ‘good’ ‘OK’ ‘bad’ w   1 2 * * * 2 1 2 1 , , { , , } argmin i g i o i b d i i i i i i x D x D x D b b b b w c c c                      w w 1 1 2 2 1 2 subject to : 1 , 0 : 1 , 1 , 0, 0 : 1 , 0 i g i i i i o i i i i i i i b i i i D b D b b D b b b                                              x x w x x w x w x x w
  • 79. Support Vector Machines: Slide 79 Copyright © 2001, 2003, Andrew W. Moore References • An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html • The VC/SRM/SVM Bible: (Not for beginners including myself) Statistical Learning Theory by Vladimir Vapnik, Wiley- Interscience; 1998 • Software: SVM-light, http://svmlight.joachims.org/, free download