svm-jain.ppt

Nov 23rd, 2001
Copyright © 2001, 2003, Andrew W. Moore
Support Vector
Machines
Modified from the slides by Dr. Andrew W. Moore
http://www.cs.cmu.edu/~awm/tutorials

Support Vector Machines: Slide 2
Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?

Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
How would you
classify this data?

Linear Classifiers
f
x
a
yest
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?

Classifier Margin
f
x
a
yest
denotes +1
denotes -1
Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.

Maximum Margin
f
x
a
yest
denotes +1
denotes -1
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM

Maximum Margin
f
x
a
yest
denotes +1
denotes -1
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vectors
are those
datapoints that
the margin
pushes up
against
Linear SVM

Why Maximum Margin?
denotes +1
denotes -1
The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Support Vectors
are those
datapoints that
the margin
pushes up
against
1. Intuitively this feels safest.
2. If we’ve made a small error in the
location of the boundary (it’s been
jolted in its perpendicular direction)
this gives us least chance of causing a
misclassification.
3. LOOCV is easy since the model is
immune to removal of any non-
support-vector datapoints.
4. There’s some theory (using VC
dimension) that is related to (but not
the same as) the proposition that this
is a good thing.
5. Empirically it works very very well.

Estimate the Margin
• What is the distance expression for a point x to a
line wx+b= 0?
denotes +1
denotes -1 x
wx +b = 0
2 2
1
2
( )
d
i
i
b b
d
w

   
 

x w x w
x
w

Estimate the Margin
• What is the expression for margin?
denotes +1
denotes -1 wx +b = 0
2
1
margin min ( ) min
d
D D
i
i
b
d
w
 

 
 

x x
x w
x
Margin

Maximize Margin
denotes +1
,
,
2
,
1
argmax margin( , , )
= argmax min ( )
argmax min
i
i
b
i
D
b
i
d
D
b
i
i
b D
d
b
w



 


w
x
w
x
w
w
x
x w
Margin

Maximize Margin
denotes +1
 
2
,
1
argmax min
subject to : 0
i
i
d
D
b
i
i
i i i
b
w
D y b


 
    

x
w
x w
x x w
Margin
• Min-max problem  game problem

Maximize Margin
denotes +1
 
2
,
1
argmax min
subject to : 0
i
i
d
D
b
i
i
i i i
b
w
D y b


 
    

x
w
x w
x x w
Margin
Strategy:
: 1
i i
D b
    
x x w
 
2
1
,
argmin
subject to : 1
d
i
i
b
i i i
w
D y b

    

w
x x w

Maximum Margin Linear Classifier
• How to solve it?
 
 
 
* * 2
1
,
1 1
2 2
{ , }= argmin
subject to
1
1
....
1
d
k
k
w b
N N
w b w
y w x b
y w x b
y w x b

  
  
  


Learning via Quadratic Programming
• QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
some real-valued variables subject to linear
constraints.

Quadratic Programming
arg min
2
T
T R
c  
u
u u
d u
Find
n
m
nm
n
n
m
m
m
m
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a












...
:
...
...
2
2
1
1
2
2
2
22
1
21
1
1
2
12
1
11
)
(
)
(
2
2
)
(
1
1
)
(
)
2
(
)
2
(
2
2
)
2
(
1
1
)
2
(
)
1
(
)
1
(
2
2
)
1
(
1
1
)
1
(
...
:
...
...
e
n
m
m
e
n
e
n
e
n
n
m
m
n
n
n
n
m
m
n
n
n
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a
























And subject to
n additional linear
inequality
constraints
e
additional
linear
equality
constraints
Quadratic criterion
Subject to

arg min
2
T
T R
c  
u
u u
d u
Find
Subject to
n
m
nm
n
n
m
m
m
m
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a












...
:
...
...
2
2
1
1
2
2
2
22
1
21
1
1
2
12
1
11
)
(
)
(
2
2
)
(
1
1
)
(
)
2
(
)
2
(
2
2
)
2
(
1
1
)
2
(
)
1
(
)
1
(
2
2
)
1
(
1
1
)
1
(
...
:
...
...
e
n
m
m
e
n
e
n
e
n
n
m
m
n
n
n
n
m
m
n
n
n
b
u
a
u
a
u
a
b
u
a
u
a
u
a
b
u
a
u
a
u
a
























And subject to
n additional linear
inequality
constraints
e
additional
linear
equality
constraints
Quadratic criterion

 
* * 2
,
{ , }= min
subject to 1 for all training data ( , )
i
i
w b
i i i i
w b w
y w x b x y
  

 
 
 
 
* *
,
1 1
2 2
{ , }= argmax 0 0
1
1
inequality constraints
....
1
T
w b
N N
w b w w w
y w x b
y w x b
y w x b
  

  

   



   
n
I

Uh-oh!
denotes +1
denotes -1
This is going to be a problem!
What should we do?

Uh-oh!
denotes +1
denotes -1
What should we do?
Idea 1:
Find minimum w.w, while
minimizing number of
training set errors.
Problemette: Two things
to minimize makes for
an ill-defined
optimization

Uh-oh!
denotes +1
denotes -1
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical
problem that’s about to make
us reject this approach. Can
you guess what it is?
Tradeoff parameter

Uh-oh!
denotes +1
denotes -1
What should we do?
Idea 1.1:
Minimize
w.w + C (#train errors)
There’s a serious practical
problem that’s about to make
us reject this approach. Can
you guess what it is?
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
(Also, doesn’t distinguish between
disastrous errors and near misses)

Uh-oh!
denotes +1
denotes -1
What should we do?
Idea 2.0:
Minimize
w.w + C (distance of error
points to their
correct place)

Support Vector Machine (SVM) for
Noisy Data
• Any problem with the above
formulism?
 
 
 
d
* * 2
1 1
, ,
1 1 1
2 2 2
{ , }= min
1
1
...
1
N
i j
i j
w b
N N N
w b w c
y w x b
y w x b
y w x b





 

   
   
   
  denotes +1
denotes -1
1

2

3


Support Vector Machine (SVM) for
Noisy Data
• Balance the trade off between
margin and classification errors
 
 
 
d
* * 2
1 1
, ,
1 1 1 1
2 2 2 2
{ , }= min
1 , 0
1 , 0
...
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b


 
 
 
 

    
    
    
 
denotes +1
denotes -1
1

2

3


Support Vector Machine for Noisy Data
 
 
 
* * 2
1
, ,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0
inequality constraints
....
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b


 
 
 



    

     



     
 
How do we determine the appropriate value for c ?

The Dual Form of QP
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
where ( )
kl k l k l
Q y y
 
x x
Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w Then classify with:
0
1



R
k
k
k y
α

The Dual Form of QP
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
where ( )
kl k l k l
Q y y
 
x x
Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
0
1



R
k
k
k y
α

An Equivalent QP
Maximize where )
.
( l
k
l
k
kl y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
0
1



R
k
k
k y
α
Datapoints with ak > 0
will be the support
vectors
..so this sum only needs
to be over the
support vectors.

  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1

Support Vectors
denotes +1
denotes -1
1
w x b
  
1
w x b
   
w
Support Vectors
Decision boundary is
determined only by those
support vectors !



R
k
k
k
k y
α
1
x
w
   
 
: 1 0
i i i i
i y w x b
a 
     
ai = 0 for non-support vectors
ai  0 for support vectors

The Dual Form of QP
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
where ( )
kl k l k l
Q y y
 
x x
Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w Then classify with:
0
1



R
k
k
k y
α
How to determine b ?

An Equivalent QP: Determine b
A linear programming problem !
 
 
 
* * 2
1
,
1 1 1 1
2 2 2 2
{ , }= argmin
1 , 0
1 , 0
....
1 , 0
N
i j
i j
w b
N N N N
w b w c
y w x b
y w x b
y w x b

 
 
 


    
    
    
 
 
 
 
 
1
*
1
,
1 1 1 1
2 2 2 2
= argmin
1 , 0
1 , 0
....
1 , 0
N
i i
N
j
j
b
N N N N
b
y w x b
y w x b
y w x b


 
 
 


    
    
    

Fix w


  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
An Equivalent QP
Maximize where )
.
( l
k
l
k
kl y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
k
k
K
K
K
K
α
K
ε
y
b
max
arg
where
.
)
1
(



 w
x
Then classify with:
0
1



R
k
k
k y
α
Datapoints with ak > 0
will be the support
vectors
..so this sum only needs
to be over the
support vectors.
Why did I tell you about this
equivalent QP?
• It’s a formulation that QP
packages can optimize more
quickly
• Because of further jaw-
dropping developments
you’re about to learn.

Suppose we’re in 1-dimension
What would
SVMs do with
this data?
x=0

Suppose we’re in 1-dimension
Not a big surprise
Positive “plane” Negative “plane”
x=0

Harder 1-dimensional dataset
That’s wiped the
smirk off SVM’s
face.
What can be
done about
this?
x=0

Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0 )
,
( 2
k
k
k x
x

z

Common SVM basis functions
zk = ( polynomial terms of xk of degree 1 to q )
zk = ( radial basis functions of xk )
zk = ( sigmoid functions of xk )
This is sensible.
Is that the end of the story?
No…there’s one more trick!







 


 2
2
|
|
exp
)
(
]
[

j
k
k
j
k φ
j
c
x
x
z

Quadratic
Basis Functions



























































 m
m
m
m
m
m
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
1
3
2
1
3
1
2
1
2
2
2
2
1
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)
(x
Φ
Constant Term
Linear Terms
Pure
Quadratic
Terms
Quadratic
Cross-Terms
Number of terms (assuming m input
dimensions) = (m+2)-choose-2
= (m+2)(m+1)/2
= (as near as makes no difference) m2/2
You may be wondering what those
’s are doing.
•You should be happy that they do no
harm
•You’ll find out why they’re there soon.
2

QP (old)
Maximize where )
.
( l
k
l
k
kl y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:



R
k
k
k
k y
α
1
x
w
Then classify with:
0
1



R
k
k
k y
α

  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1

QP with basis functions
where ))
(
).
(
( l
k
l
k
kl y
y
Q x
Φ
x
Φ

Subject to these
constraints:
k
C
αk 


0
Then define: Then classify with:
f(x,w,b) = sign(w. f(x) - b)
0
1



R
k
k
k y
α


0
s.t.
)
(
k
α
k
k
k
k y
α x
Φ
w
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
Most important changes:
X  f(x)

QP with basis functions
where ))
(
).
(
( l
k
l
k
kl y
y
Q x
Φ
x
Φ

Subject to these
constraints:
k
C
αk 


0
Then define:
Then classify with:
0
1



R
k
k
k y
α
We must do R2/2 dot products to
get this matrix ready.
Each dot product requires m2/2
additions and multiplications
The whole thing costs R2 m2 /4.
Yeeks!
…or does it?


0
s.t.
)
(
k
α
k
k
k
k y
α x
Φ
w
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1

Quadratic
Dot
Products
























































































































 m
m
m
m
m
m
m
m
m
m
m
m
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
1
1
3
2
1
3
1
2
1
2
2
2
2
1
2
1
1
1
3
2
1
3
1
2
1
2
2
2
2
1
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
2
:
2
:
2
2
:
2
2
:
2
:
2
2
1
)
(
)
( b
Φ
a
Φ
1


m
i
i
ib
a
1
2


m
i
i
i b
a
1
2
2
 
 

m
i
m
i
j
j
i
j
i b
b
a
a
1 1
2
+
+
+

Quadratic
Dot
Products

 )
(
)
( b
Φ
a
Φ
 

  






m
i
m
i
j
j
i
j
i
m
i
i
i
m
i
i
i b
b
a
a
b
a
b
a
1 1
1
2
2
1
2
2
1
Just out of casual, innocent, interest,
let’s look at another function of a and
b:
2
)
1
.
( 
b
a
1
.
2
)
.
( 2


 b
a
b
a
1
2
1
2
1








 
 

m
i
i
i
m
i
i
i b
a
b
a
1
2
1
1 1


 
 
 
m
i
i
i
m
i
m
j
j
j
i
i b
a
b
a
b
a
1
2
2
)
(
1
1 1
1
2



 
 
 
 


m
i
i
i
m
i
m
i
j
j
j
i
i
m
i
i
i b
a
b
a
b
a
b
a

Quadratic
Dot
Products

 )
(
)
( b
Φ
a
Φ
Just out of casual, innocent, interest,
let’s look at another function of a and
b:
2
)
1
.
( 
b
a
1
.
2
)
.
( 2


 b
a
b
a
1
2
1
2
1








 
 

m
i
i
i
m
i
i
i b
a
b
a
1
2
1
1 1


 
 
 
m
i
i
i
m
i
m
j
j
j
i
i b
a
b
a
b
a
1
2
2
)
(
1
1 1
1
2



 
 
 
 


m
i
i
i
m
i
m
i
j
j
j
i
i
m
i
i
i b
a
b
a
b
a
b
a
They’re the same!
And this is only O(m) to
compute!
 

  






m
i
m
i
j
j
i
j
i
m
i
i
i
m
i
i
i b
b
a
a
b
a
b
a
1 1
1
2
2
1
2
2
1

QP with Quintic basis functions
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1
where ))
(
).
(
( l
k
l
k
kl y
y
Q x
Φ
x
Φ

Subject to these
constraints:
k
C
αk 


0
Then define:


0
s.t.
)
(
k
α
k
k
k
k y
α x
Φ
w
Then classify with:
0
1



R
k
k
k y
α

QP with Quadratic basis functions
where )
,
( l
k
l
k
kl K
y
y
Q x
x

Subject to these
constraints:
k
C
αk 


0
Then define:
k
k
K
K
K
K
α
K
ε
y
b
max
arg
where
.
)
1
(



 w
x
Then classify with:
f(x,w,b) = sign(K(w, x) - b)
0
1



R
k
k
k y
α


0
s.t.
)
(
k
α
k
k
k
k y
α x
Φ
w
Maximize 
  


R
k
R
l
kl
l
k
R
k
k Q
α
α
α
1 1
1 2
1
Most important change:
)
,
(
)
(
).
( l
k
l
k K x
x
x
Φ
x
Φ 

Higher Order Polynomials
Poly-
nomial
f(x) Cost to
build Qkl
matrix
tradition
ally
Cost if 100
inputs
f(a).f(b) Cost to
build Qkl
matrix
sneakily
Cost if
100
inputs
Quadratic All m2/2
terms up to
degree 2
m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2
Cubic All m3/6
terms up to
degree 3
m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2
Quartic All m4/24
terms up to
degree 4
m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2

SVM Kernel Functions
• K(a,b)=(a . b +1)d is an example of an SVM
Kernel Function
• Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right Kernel Function
• Radial-Basis-style Kernel Function:
• Neural-net-style Kernel Function:







 

 2
2
2
)
(
exp
)
,
(

b
a
b
a
K
)
.
tanh(
)
,
( 
 
 b
a
b
a
K

Kernel Tricks
• Replacing dot product with a kernel function
• Not all functions are kernel functions
• Need to be decomposable
• K(a,b) = f(a)  f(b)
• Could K(a,b) = (a-b)3 be a kernel function ?
• Could K(a,b) = (a-b)4 – (a+b)2 be a kernel
function?

Kernel Tricks
• Mercer’s condition
To expand Kernel function K(x,y) into a dot product, i.e.
K(x,y)=(x)(y), K(x, y) has to be positive semi-definite
function, i.e., for any function f(x) whose is finite,
the following inequality holds
• Could be a kernel function?
( ) ( , ) ( ) 0
dxdyf x K x y f y 

2
( )
f x dx

 
( , )
p
i i
i
K x y x y
 

Kernel Tricks
• Pro
• Introducing nonlinearity into the model
• Computational cheap
• Con
• Still have potential overfitting problems

Nonlinear Kernel (I)

Nonlinear Kernel (II)

Kernelize Logistic Regression
How can we introduce the nonlinearity into the
logistic regression?
2
1 1
1
( | )
1 exp( )
1
( ) log
1 exp( )
N N
reg k
i k
p y x
yx w
l c w
yx w
a  

  
 
  
 

Kernelize Logistic Regression
1
1
( ), ( )
( , ) ( , )
N
i i
i
N
i i
i
x x w x
K w x K x x
f a f
a


 



 
 
1
1 , 1
1
1 1
( | )
1 exp( ( , )) 1 exp ( , )
1
( ) log ( , )
1 exp ( , )
N
i i
i
N N
reg i j i j
i i j
N
i j j i
j
p y x
yK x w y K x x
l c K x x
y K x x
a
a a a
a

 

 
   
 
 

 

• Representation Theorem

Overfitting in SVM
Breast Cancer
0
0.02
0.04
0.06
0.08
1 2 3 4 5
PolyDegree
Classification
Error
Ionosphere
0
0.05
0.1
0.15
0.2
1 2 3 4 5
PolyDegree
Classification
Error
Training Error
Testing Error

SVM Performance
• Anecdotally they work very very well indeed.
• Example: They are currently the best-known
classifier on a well-studied hand-written-character
recognition benchmark
• Another Example: Andrew knows several reliable
people doing practical real-world work who claim
that SVMs have saved them when their other
favorite classifiers did poorly.
• There is a lot of excitement and religious fervor
about SVMs as of 2001.
• Despite this, some practitioners are a little
skeptical.

Diffusion Kernel
• Kernel function describes the correlation or similarity
between two data points
• Given that I have a function s(x,y) that describes the
similarity between two data points. Assume that it is a non-
negative and symmetric function. How can we generate a
kernel function based on this similarity function?
• A graph theory approach …

Diffusion Kernel
• Create a graph for the data points
• Each vertex corresponds to a data point
• The weight of each edge is the similarity s(x,y)
• Graph Laplacian
• Properties of Laplacian
• Negative semi-definite
 
 
,
,
,
i j
i j
i k
k i
s x x i j
L
s x x i j

 

 
 

 

Diffusion Kernel
• Consider a simple Laplacian
• Consider
• What do these matrixes represent?
• A diffusion kernel
,
( )
1 and are connected
1
k i
i j
i j
x N x
x x
L
i j



 
 

 
2 4
, ,...
L L

Diffusion Kernel: Properties
• Positive definite
• Local relationships L induce global relationships
• Works for undirected weighted graphs with
similarities
• How to compute the diffusion kernel
, or
L d
K e K LK
d

  

 
( , ) ( , )
i j j i
s x x s x x

L
e

Computing Diffusion Kernel
• Singular value decomposition of Laplacian L
• What is L2 ?
1
1 2 1 2
1
,
( , ,..., ) ( , ,..., )
, :
T T
m m
m
m T
i i i
i
T
i j i j
i j





 
 
  
 
 

 

L = UΣU u u u u u u
u u
u u
 
2
2
1
,
, 1 , 1
2
1
m T
i i i
i
m m
T T T
i j i i j j i j i j i j
i j i j
m T
i i i
i

    


 

 


 

L = u u
u u u u u u
u u

Computing Diffusion Kernel
• What about Ln ?
• Compute diffusion kernel
2
1
m n T
i i i
i


 
L u u
L
e
 
1 1 1
1 1 1
! !
!
i
n n n
m
L n T
i i i
n n i
n n
m m
T T
i
i i i i
i n i
L
e
n n
e
n


 

 
 
  

  
 
 
 
 
 
  
  
u u
u u u u

Doing multi-class classification
• SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
• What can be done?
• Answer: with output arity N, learn N SVM’s
• SVM 1 learns “Output==1” vs “Output != 1”
• SVM 2 learns “Output==2” vs “Output != 2”
• :
• SVM N learns “Output==N” vs “Output != N”
• Then to predict the output for a new input, just
predict with each SVM and find out which one puts
the prediction the furthest into the positive region.

Ranking Problem
• Consider a problem of ranking essays
• Three ranking categories: good, ok, bad
• Given a input document, predict its ranking
category
• How should we formulate this problem?
• A simple multiple class solution
• Each ranking category is a independent class
• But, there is something missing here …
• We miss the ordinal relationship between classes !

Ordinal Regression
• Which choice is better?
• How could we formulate this problem?
‘good’
‘OK’
‘bad’
w
w’

Ordinal Regression
• What are the two decision boundaries?
• What is the margin for ordinal regression?
• Maximize margin
1 2
0 and 0
b b
     
w x w x
1
1 1
2
1
2
2 2
2
1
1 2 1 1 2 2
margin ( , ) min ( ) min
margin( , , )=min(margin ( , ),margin ( , ))
g o g o
o b o b
d
D D D D
i
i
d
D D D D
i
i
b
b d
w
b
b d
w
b b b b
   

   

 
 
 
 


x x
x x
x w
w x
x w
w x
w w w
1 2
* * *
1 2 1 2
, ,
{ , , } arg max margin( , , )
b b
b b b b

w
w w

Ordinal Regression
• Maximize margin
1 2
0 and 0
b b
     
w x w x
1
1 1
2
1
2
2 2
2
1
1 2 1 1 2 2
g o g o
o b o b
d
D D D D
i
i
d
D D D D
i
i
b
b d
w
b
b d
w
b b b b
   

   

 
 
 
 


x x
x x
x w
w x
x w
w x
w w w
1 2
* * *
1 2 1 2
, ,
b b
b b b b

w
w w

Ordinal Regression
• How do we solve this monster ?
1 2
1 2
1 2
* * *
1 2 1 2
, ,
1 1 2 2
, ,
1 2
2 2
, ,
1 1
{ , , } argmax margin( , , )
argmax min(margin ( , ),margin ( , ))
argmax min min , min
g o o b
b b
b b
d d
D D D D
b b
i i
i i
b b b b
b b
b b
w w
   
 


 
   
 

 
 
 
w
w
x x
w
w w
w w
x w x w
1
1 2
2
subject to
: 0
: 0, 0
: 0
i g i
i o i i
i b i
D b
D b b
D b
    
       
    
x x w
x x w x w
x x w

Ordinal Regression
• The same old trick
• To remove the scaling invariance, set
• Now the problem is simplified as:
1
2
: 1
: 1
i g o i
i o b i
D D b
D D b
     
     
x x w
x x w
1 2
* * * 2
1 2 1
, ,
{ , , } argmin
d
i
i
b b
b b w

 
w
w
1
1 2
2
subject to
: 1
: 1, 1
: 1
i g i
i o i i
i b i
D b
D b b
D b
    
        
     
x x w
x x w x w
x x w

Ordinal Regression
• Noisy case
• Is this sufficient enough?
 
1 2
* * * 2
1 2 1
, ,
{ , , } argmin
i g i o i b
d
i i i i i
i
x D x D x D
b b
b b w c c c
   
   

  
    
   
w
w
1
1 2
2
subject to
: 1 , 0
: 1 , 1 , 0, 0
: 1 , 0
i g i i i
i o i i i i i i
i b i i i
D b
D b b
D b
 
   
 
 
   
 
      
            
       
x x w
x x w x w
x x w

Ordinal Regression
‘good’
‘OK’
‘bad’
w
 
1 2
* * * 2
1 2 1
, ,
{ , , } argmin
i g i o i b
d
i i i i i
i
x D x D x D
b b
b b w c c c
   
   

  
    
   
w
w
1
1 2
2
1 2
subject to
: 1 , 0
: 1 , 1 , 0, 0
: 1 , 0
i g i i i
i o i i i i i i
i b i i i
D b
D b b
D b
b b
 
   
 
 
   
 
      
            
       

x x w
x x w x w
x x w

References
• An excellent tutorial on VC-dimension and Support
Vector Machines:
C.J.C. Burges. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible: (Not for beginners
including myself)
Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998
• Software: SVM-light, http://svmlight.joachims.org/,
free download

svm-jain.ppt

Recommended

Recommended

More Related Content

Similar to svm-jain.ppt

Similar to svm-jain.ppt (20)

Recently uploaded

Recently uploaded (20)

svm-jain.ppt