© Eric Xing @ CMU, 2006-2010 1
Machine Learning
Support Vector Machines
Eric Xing
Lecture 4, August 12, 2010
Reading:
© Eric Xing @ CMU, 2006-2010 2
What is a good Decision
Boundary?
 Why we may have such boundaries?
 Irregular distribution
 Imbalanced training sizes
 outliners
© Eric Xing @ CMU, 2006-2010 3
Classification and Margin
 Parameterzing decision boundary
 Let w denote a vector orthogonal to the decision boundary, and b denote a scalar
"offset" term, then we can write the decision boundary as:
Class 1
Class 2
d -
0=+ bxwT
d+
© Eric Xing @ CMU, 2006-2010 4
Classification and Margin
 Parameterzing decision boundary
 Let w denote a vector orthogonal to the decision boundary, and b denote a scalar
"offset" term, then we can write the decision boundary as:
Class 1
Class 2
0=+ TT
T
w
b
x
w
w
 Margin
(wTxi+b)/||w|| > +c/||w|| for all xi in class 2
(wTxi+b)/||w|| < −c/||w|| for all xi in class 1
Or more compactly:
(wTxi+b)yi /||w|| >c/||w||
The margin between two points
m = d− + d+ =
d -
d+
© Eric Xing @ CMU, 2006-2010 5
Maximum Margin Classification
 The margin is:
 Here is our Maximum Margin Classification problem:
( ) w
c
xx
w
w
m ji
T
2
=−= **
iwcwbxwy
w
c
i
T
i
w
∀≥+ ,//)(s.t
2
max
© Eric Xing @ CMU, 2006-2010 6
Maximum Margin Classification,
con'd.
 The optimization problem:
 But note that the magnitude of c merely scales w and b, and does
not change the classification boundary at all! (why?)
 So we instead work on this cleaner problem:
 The solution to this leads to the famous Support Vector Machines -
-- believed by many to be the best "off-the-shelf" supervised learning
algorithm
iwcwbxwy
w
c
i
T
i
bw
∀≥+ ,//)(
s.t
max ,
ibxwy
w
i
T
i
bw
∀≥+ ,)(
s.t
max ,
1
1
© Eric Xing @ CMU, 2006-2010 7
Support vector machine
 A convex quadratic programming problem
with linear constrains:
 The attained margin is now given by
 Only a few of the classification constraints are relevant  support vectors
 Constrained optimization
 We can directly solve this using commercial quadratic programming (QP) code
 But we want to take a more careful investigation of Lagrange duality, and the
solution of the above in its dual form.
 deeper insight: support vectors, kernels …
 more efficient algorithm
ibxwy
w
i
T
i
bw
∀≥+ ,)(
s.t
max ,
1
1
w
1
d+
d-
© Eric Xing @ CMU, 2006-2010 8
Digression to Lagrangian Duality
 The Primal Problem
Primal:
The generalized Lagrangian:
the α's (αι≥0) and β's are called the Lagarangian multipliers
Lemma:
A re-written Primal:
liwh
kiwg
wf
i
i
w
,1,,)(
,1,,)(
)(
s.t.
min


==
=≤
0
0
∑∑ ==
++=
l
i
ii
k
i
ii whwgwfw
11
)()()(),,( βαβαL



∞
=≥
o/w
sconstraintprimalsatisfiesif)(
),,(max ,,
wwf
wi
βααβα L0
),,(maxmin ,, βααβα wiw L0≥
© Eric Xing @ CMU, 2006-2010 9
Lagrangian Duality, cont.
 Recall the Primal Problem:
 The Dual Problem:
 Theorem (weak duality):
 Theorem (strong duality):
Iff there exist a saddle point of , we have
),,(maxmin ,, βααβα wiw L0≥
),,(minmax ,, βααβα wwi
L0≥
*
,,,,
*
),,(maxmin),,(minmax pwwd ii ww =≤= ≥≥ βαβα αβααβα LL 00
**
pd =
),,( βαwL
© Eric Xing @ CMU, 2006-2010 10
The KKT conditions
 If there exists some saddle point of L, then the saddle point
satisfies the following "Karush-Kuhn-Tucker" (KKT)
conditions:
 Theorem: If w*, α* and β* satisfy the KKT condition, then it is also a
solution to the primal and the dual problems.
mi
miwg
miwgα
liw
kiw
w
i
i
ii
i
i
,,1,0
,,1,0)(
,,1,0)(
,,1,0),,(
,,1,0),,(





=≥
=≤
==
==
∂
∂
==
∂
∂
α
βα
β
βα
L
L
© Eric Xing @ CMU, 2006-2010 11
Solving optimal margin classifier
 Recall our opt problem:
 This is equivalent to
 Write the Lagrangian:
 Recall that (*) can be reformulated as
Now we solve its dual problem:
ibxwy
w
i
T
i
bw
∀≥+ ,)(
s.t
max ,
1
1
ibxwy
ww
i
T
i
T
bw
∀≤+− ,)(
s.t
min ,
01
2
1
[ ]∑=
−+−=
m
i
i
T
ii
T
bxwywwbw
1
1
2
1
)(),,( ααL
*( )
),,(maxmin , αα bwibw L0≥
),,(minmax , αα bwbwi
L0≥
© Eric Xing @ CMU, 2006-2010 12
***( )
The Dual Problem
 We minimize L with respect to w and b first:
Note that (*) implies:
 Plug (***) back to L , and using (**), we have:
),,(minmax , αα bwbwi
L0≥
,),,( ∑=
=−=∇
m
i
iiiw xywbw
1
0ααL
,),,( ∑=
==∇
m
i
iib ybw
1
0ααL
∑=
=
m
i
iii xyw
1
α
*( )
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yybw
11 2
1
,
)(),,( xxααααL
**( )
[ ]∑=
−+−=
m
i
i
T
ii
T
bxwywwbw
1
1
2
1
)(),,( ααL
© Eric Xing @ CMU, 2006-2010 13
The Dual problem, cont.
 Now we have the following dual opt problem:
 This is, (again,) a quadratic programming problem.
 A global maximum of αi can always be found.
 But what's the big deal??
 Note two things:
1. w can be recovered by
2. The "kernel"
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
11 2
1
,
)()(max xxααααα J
.
,,,s.t.
∑=
=
=≥
m
i
ii
i
y
ki
1
0
10
α
α 
∑=
=
m
i
iii yw
1
xα
j
T
i xx
See next …
More later …
© Eric Xing @ CMU, 2006-2010 14
I. Support vectors
 Note the KKT condition --- only a few αi's can be nonzero!!
miwgα ii ,,1,0)( ==
α6=1.4
Class 1
Class 2
α1=0.8
α2=0
α3=0
α4=0
α5=0
α7=0
α8=0.6
α9=0
α10=0
Call the training data points
whose αi's are nonzero the
support vectors (SV)
© Eric Xing @ CMU, 2006-2010 15
Support vector machines
 Once we have the Lagrange multipliers {αi}, we can
reconstruct the parameter vector w as a weighted combination
of the training examples:
 For testing with a new data z
 Compute
and classify z as class 1 if the sum is positive, and class 2 otherwise
 Note: w need not be formed explicitly
∑∈
=
SVi
iii yw xα
( ) bzybzw
SVi
T
iii
T
+=+ ∑∈
xα
© Eric Xing @ CMU, 2006-2010 16
Interpretation of support vector
machines
 The optimal w is a linear combination of a small number of
data points. This “sparse” representation can be viewed as
data compression as in the construction of kNN classifier
 To compute the weights {αi}, and to use support vector
machines we need to specify only the inner products (or
kernel) between the examples
 We make decisions by comparing each new example z with
only the support vectors:
j
T
i xx
( ) 





+= ∑∈
bzyy
SVi
T
iii xαsign*
© Eric Xing @ CMU, 2006-2010 17
 Is this data linearly-separable?
 How about a quadratic mapping φ(xi)?
II. The Kernel Trick
© Eric Xing @ CMU, 2006-2010 18
II. The Kernel Trick
 Recall the SVM optimization problem
 The data points only appear as inner product
 As long as we can calculate the inner product in the feature
space, we do not need the mapping explicitly
 Many common geometric operations (angles, distances) can
be expressed by inner products
 Define the kernel function K by
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
11 2
1
,
)()(max xxααααα J
.0
,,1,0s.t.
1
∑=
=
=≤≤
m
i
ii
i
y
miC
α
α 
)()(),( j
T
ijiK xxxx φφ=
© Eric Xing @ CMU, 2006-2010 19
 Computation depends on feature space
 Bad if its dimension is much larger than input space
( )∑∑ ==
−
m
ji
jijiji
m
i
i Kyy
1,1
,
2
1
max xxαααα
.0
,,1,0s.t.
1
∑=
=
=≥
m
i
ii
i
y
ki
α
α 
Where K(xi,xj) = φ(xi)t φ(xj) ( ) 





+= ∑∈
bzKyzy
SVi
iii ,sign)(* xα
II. The Kernel Trick
© Eric Xing @ CMU, 2006-2010 20
Transforming the Data
 Computation in the feature space can be costly because it is high
dimensional
 The feature space is typically infinite-dimensional!
 The kernel trick comes to rescue
φ( )
φ( )
φ( )
φ( )φ( )
φ( )
φ( )
φ( )
φ(.)
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
Feature spaceInput space
Note: feature space is of higher dimension
than the input space in practice
© Eric Xing @ CMU, 2006-2010 21
An Example for feature mapping
and kernels
 Consider an input x=[x1,x2]
 Suppose φ(.) is given as follows
 An inner product in the feature space is
 So, if we define the kernel function as follows, there is no
need to carry out φ(.) explicitly
21
2
2
2
121
2
1
2221 xxxxxx
x
x
,,,,,=













φ
=



























'
'
,
2
1
2
1
x
x
x
x
φφ
( )2
1 ')',( xxxx T
K +=
© Eric Xing @ CMU, 2006-2010 22
More examples of kernel
functions
 Linear kernel (we've seen it)
 Polynomial kernel (we just saw an example)
where p = 2, 3, … To get the feature vectors we concatenate all pth order
polynomial terms of the components of x (weighted appropriately)
 Radial basis kernel
In this case the feature space consists of functions and results in a non-
parametric classifier.
')',( xxxx T
K =
( )pT
K ')',( xxxx += 1






−−=
2
2
1
'exp)',( xxxxK
© Eric Xing @ CMU, 2006-2010 23
The essence of kernel
 Feature mapping, but “without paying a cost”
 E.g., polynomial kernel
 How many dimensions we’ve got in the new space?
 How many operations it takes to compute K()?
 Kernel design, any principle?
 K(x,z) can be thought of as a similarity function between x and z
 This intuition can be well reflected in the following “Gaussian” function
(Similarly one can easily come up with other K() in the same spirit)
 Is this necessarily lead to a “legal” kernel?
(in the above particular case, K() is a legal one, do you know how many
dimension φ(x) is?
© Eric Xing @ CMU, 2006-2010 24
Kernel matrix
 Suppose for now that K is indeed a valid kernel corresponding
to some feature mapping φ, then for x1, …, xm, we can
compute an m×m matrix , where
 This is called a kernel matrix!
 Now, if a kernel function is indeed a valid kernel, and its
elements are dot-product in the transformed feature space, it
must satisfy:
 Symmetry K=KT
proof
 Positive –semidefinite
proof?
 Mercer’s theorem
© Eric Xing @ CMU, 2006-2010 25
SVM examples
© Eric Xing @ CMU, 2006-2010 26
Examples for Non Linear SVMs –
Gaussian Kernel
© Eric Xing @ CMU, 2006-2010 27
 xi is a bag of words
 Define φ(xi) as a count of every n-gram up to n=k in xi.
 This is huge space 26k
 What are we measuring by φ(xi)t φ(xj)?
 Can we compute the same quantity on input space?
 Efficient linear dynamic program!
 Kernel is a measure of similarity
 Must be positive semi-definite
Example Kernel
© Eric Xing @ CMU, 2006-2010 28
Non-linearly Separable Problems
 We allow “error” ξi in classification; it is based on the output of
the discriminant function wTx+b
 ξi approximates the number of misclassified samples
Class 1
Class 2
© Eric Xing @ CMU, 2006-2010 29
Soft Margin Hyperplane
 Now we have a slightly different opt problem:
 ξi are “slack variables” in optimization
 Note that ξi=0 if there is no error for xi
 ξi is an upper bound of the number of errors
 C : tradeoff parameter between error and margin
,
,)(
s.t
i
ibxwy
i
ii
T
i
∀≥
∀−≥+
0
1
ξ
ξ
∑=
+
m
i
i
T
bw Cww
12
1
ξ,min
© Eric Xing @ CMU, 2006-2010 30
Hinge Loss
 Remember Ridge regression
 Min [squared loss + λ wtw]
 How about SVM?
regularization Loss: hinge loss
© Eric Xing @ CMU, 2006-2010 31
The Optimization Problem
 The dual of this new constrained optimization problem is
 This is very similar to the optimization problem in the linear
separable case, except that there is an upper bound C on αi
now
 Once again, a QP solver can be used to find αi
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
11 2
1
,
)()(max xxααααα J
.0
,,1,0s.t.
1
∑=
=
=≤≤
m
i
ii
i
y
miC
α
α 
© Eric Xing @ CMU, 2006-2010 32
The SMO algorithm
 Consider solving the unconstrained opt problem:
 We’ve already seen several opt algorithms!
 ?
 ?
 ?
 Coordinate ascend:
© Eric Xing @ CMU, 2006-2010 33
Coordinate ascend
© Eric Xing @ CMU, 2006-2010 34
Sequential minimal optimization
 Constrained optimization:
 Question: can we do coordinate along one direction at a time
(i.e., hold all α[-i] fixed, and update αi?)
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
11 2
1
,
)()(max xxααααα J
.0
,,1,0s.t.
1
∑=
=
=≤≤
m
i
ii
i
y
miC
α
α 
© Eric Xing @ CMU, 2006-2010 35
The SMO algorithm
Repeat till convergence
1. Select some pair αi and αj to update next (using a heuristic that tries
to pick the two that will allow us to make the biggest progress
towards the global maximum).
2. Re-optimize J(α) with respect to αi and αj, while holding all the other
αk 's (k ≠ i; j) fixed.
Will this procedure converge?
© Eric Xing @ CMU, 2006-2010 36
Convergence of SMO
 Let’s hold α3 ,…, αm fixed and reopt J w.r.t. α1 and α2
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
1,1
)(
2
1
)(max xxααααα J
.
,,,0s.t.
∑=
=
=≤≤
m
i
ii
i
y
kiC
1
0
1
α
α 
KKT:
© Eric Xing @ CMU, 2006-2010 37
Convergence of SMO
 The constraints:
 The objective:
 Constrained opt:
© Eric Xing @ CMU, 2006-2010 38
Cross-validation error of SVM
 The leave-one-out cross-validation error does not depend on
the dimensionality of the feature space but only on the # of
support vectors!
examplestrainingof#
ctorssupport ve#
errorCVout-one-Leave =
© Eric Xing @ CMU, 2006-2010 39
Summary
 Max-margin decision boundary
 Constrained convex optimization
 Duality
 The KTT conditions and the support vectors
 Non-separable case and slack variables
 The SMO algorithm

Lecture4 xing

  • 1.
    © Eric Xing@ CMU, 2006-2010 1 Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
  • 2.
    © Eric Xing@ CMU, 2006-2010 2 What is a good Decision Boundary?  Why we may have such boundaries?  Irregular distribution  Imbalanced training sizes  outliners
  • 3.
    © Eric Xing@ CMU, 2006-2010 3 Classification and Margin  Parameterzing decision boundary  Let w denote a vector orthogonal to the decision boundary, and b denote a scalar "offset" term, then we can write the decision boundary as: Class 1 Class 2 d - 0=+ bxwT d+
  • 4.
    © Eric Xing@ CMU, 2006-2010 4 Classification and Margin  Parameterzing decision boundary  Let w denote a vector orthogonal to the decision boundary, and b denote a scalar "offset" term, then we can write the decision boundary as: Class 1 Class 2 0=+ TT T w b x w w  Margin (wTxi+b)/||w|| > +c/||w|| for all xi in class 2 (wTxi+b)/||w|| < −c/||w|| for all xi in class 1 Or more compactly: (wTxi+b)yi /||w|| >c/||w|| The margin between two points m = d− + d+ = d - d+
  • 5.
    © Eric Xing@ CMU, 2006-2010 5 Maximum Margin Classification  The margin is:  Here is our Maximum Margin Classification problem: ( ) w c xx w w m ji T 2 =−= ** iwcwbxwy w c i T i w ∀≥+ ,//)(s.t 2 max
  • 6.
    © Eric Xing@ CMU, 2006-2010 6 Maximum Margin Classification, con'd.  The optimization problem:  But note that the magnitude of c merely scales w and b, and does not change the classification boundary at all! (why?)  So we instead work on this cleaner problem:  The solution to this leads to the famous Support Vector Machines - -- believed by many to be the best "off-the-shelf" supervised learning algorithm iwcwbxwy w c i T i bw ∀≥+ ,//)( s.t max , ibxwy w i T i bw ∀≥+ ,)( s.t max , 1 1
  • 7.
    © Eric Xing@ CMU, 2006-2010 7 Support vector machine  A convex quadratic programming problem with linear constrains:  The attained margin is now given by  Only a few of the classification constraints are relevant  support vectors  Constrained optimization  We can directly solve this using commercial quadratic programming (QP) code  But we want to take a more careful investigation of Lagrange duality, and the solution of the above in its dual form.  deeper insight: support vectors, kernels …  more efficient algorithm ibxwy w i T i bw ∀≥+ ,)( s.t max , 1 1 w 1 d+ d-
  • 8.
    © Eric Xing@ CMU, 2006-2010 8 Digression to Lagrangian Duality  The Primal Problem Primal: The generalized Lagrangian: the α's (αι≥0) and β's are called the Lagarangian multipliers Lemma: A re-written Primal: liwh kiwg wf i i w ,1,,)( ,1,,)( )( s.t. min   == =≤ 0 0 ∑∑ == ++= l i ii k i ii whwgwfw 11 )()()(),,( βαβαL    ∞ =≥ o/w sconstraintprimalsatisfiesif)( ),,(max ,, wwf wi βααβα L0 ),,(maxmin ,, βααβα wiw L0≥
  • 9.
    © Eric Xing@ CMU, 2006-2010 9 Lagrangian Duality, cont.  Recall the Primal Problem:  The Dual Problem:  Theorem (weak duality):  Theorem (strong duality): Iff there exist a saddle point of , we have ),,(maxmin ,, βααβα wiw L0≥ ),,(minmax ,, βααβα wwi L0≥ * ,,,, * ),,(maxmin),,(minmax pwwd ii ww =≤= ≥≥ βαβα αβααβα LL 00 ** pd = ),,( βαwL
  • 10.
    © Eric Xing@ CMU, 2006-2010 10 The KKT conditions  If there exists some saddle point of L, then the saddle point satisfies the following "Karush-Kuhn-Tucker" (KKT) conditions:  Theorem: If w*, α* and β* satisfy the KKT condition, then it is also a solution to the primal and the dual problems. mi miwg miwgα liw kiw w i i ii i i ,,1,0 ,,1,0)( ,,1,0)( ,,1,0),,( ,,1,0),,(      =≥ =≤ == == ∂ ∂ == ∂ ∂ α βα β βα L L
  • 11.
    © Eric Xing@ CMU, 2006-2010 11 Solving optimal margin classifier  Recall our opt problem:  This is equivalent to  Write the Lagrangian:  Recall that (*) can be reformulated as Now we solve its dual problem: ibxwy w i T i bw ∀≥+ ,)( s.t max , 1 1 ibxwy ww i T i T bw ∀≤+− ,)( s.t min , 01 2 1 [ ]∑= −+−= m i i T ii T bxwywwbw 1 1 2 1 )(),,( ααL *( ) ),,(maxmin , αα bwibw L0≥ ),,(minmax , αα bwbwi L0≥
  • 12.
    © Eric Xing@ CMU, 2006-2010 12 ***( ) The Dual Problem  We minimize L with respect to w and b first: Note that (*) implies:  Plug (***) back to L , and using (**), we have: ),,(minmax , αα bwbwi L0≥ ,),,( ∑= =−=∇ m i iiiw xywbw 1 0ααL ,),,( ∑= ==∇ m i iib ybw 1 0ααL ∑= = m i iii xyw 1 α *( ) ∑∑ == −= m ji j T ijiji m i i yybw 11 2 1 , )(),,( xxααααL **( ) [ ]∑= −+−= m i i T ii T bxwywwbw 1 1 2 1 )(),,( ααL
  • 13.
    © Eric Xing@ CMU, 2006-2010 13 The Dual problem, cont.  Now we have the following dual opt problem:  This is, (again,) a quadratic programming problem.  A global maximum of αi can always be found.  But what's the big deal??  Note two things: 1. w can be recovered by 2. The "kernel" ∑∑ == −= m ji j T ijiji m i i yy 11 2 1 , )()(max xxααααα J . ,,,s.t. ∑= = =≥ m i ii i y ki 1 0 10 α α  ∑= = m i iii yw 1 xα j T i xx See next … More later …
  • 14.
    © Eric Xing@ CMU, 2006-2010 14 I. Support vectors  Note the KKT condition --- only a few αi's can be nonzero!! miwgα ii ,,1,0)( == α6=1.4 Class 1 Class 2 α1=0.8 α2=0 α3=0 α4=0 α5=0 α7=0 α8=0.6 α9=0 α10=0 Call the training data points whose αi's are nonzero the support vectors (SV)
  • 15.
    © Eric Xing@ CMU, 2006-2010 15 Support vector machines  Once we have the Lagrange multipliers {αi}, we can reconstruct the parameter vector w as a weighted combination of the training examples:  For testing with a new data z  Compute and classify z as class 1 if the sum is positive, and class 2 otherwise  Note: w need not be formed explicitly ∑∈ = SVi iii yw xα ( ) bzybzw SVi T iii T +=+ ∑∈ xα
  • 16.
    © Eric Xing@ CMU, 2006-2010 16 Interpretation of support vector machines  The optimal w is a linear combination of a small number of data points. This “sparse” representation can be viewed as data compression as in the construction of kNN classifier  To compute the weights {αi}, and to use support vector machines we need to specify only the inner products (or kernel) between the examples  We make decisions by comparing each new example z with only the support vectors: j T i xx ( )       += ∑∈ bzyy SVi T iii xαsign*
  • 17.
    © Eric Xing@ CMU, 2006-2010 17  Is this data linearly-separable?  How about a quadratic mapping φ(xi)? II. The Kernel Trick
  • 18.
    © Eric Xing@ CMU, 2006-2010 18 II. The Kernel Trick  Recall the SVM optimization problem  The data points only appear as inner product  As long as we can calculate the inner product in the feature space, we do not need the mapping explicitly  Many common geometric operations (angles, distances) can be expressed by inner products  Define the kernel function K by ∑∑ == −= m ji j T ijiji m i i yy 11 2 1 , )()(max xxααααα J .0 ,,1,0s.t. 1 ∑= = =≤≤ m i ii i y miC α α  )()(),( j T ijiK xxxx φφ=
  • 19.
    © Eric Xing@ CMU, 2006-2010 19  Computation depends on feature space  Bad if its dimension is much larger than input space ( )∑∑ == − m ji jijiji m i i Kyy 1,1 , 2 1 max xxαααα .0 ,,1,0s.t. 1 ∑= = =≥ m i ii i y ki α α  Where K(xi,xj) = φ(xi)t φ(xj) ( )       += ∑∈ bzKyzy SVi iii ,sign)(* xα II. The Kernel Trick
  • 20.
    © Eric Xing@ CMU, 2006-2010 20 Transforming the Data  Computation in the feature space can be costly because it is high dimensional  The feature space is typically infinite-dimensional!  The kernel trick comes to rescue φ( ) φ( ) φ( ) φ( )φ( ) φ( ) φ( ) φ( ) φ(.) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) φ( ) Feature spaceInput space Note: feature space is of higher dimension than the input space in practice
  • 21.
    © Eric Xing@ CMU, 2006-2010 21 An Example for feature mapping and kernels  Consider an input x=[x1,x2]  Suppose φ(.) is given as follows  An inner product in the feature space is  So, if we define the kernel function as follows, there is no need to carry out φ(.) explicitly 21 2 2 2 121 2 1 2221 xxxxxx x x ,,,,,=              φ =                            ' ' , 2 1 2 1 x x x x φφ ( )2 1 ')',( xxxx T K +=
  • 22.
    © Eric Xing@ CMU, 2006-2010 22 More examples of kernel functions  Linear kernel (we've seen it)  Polynomial kernel (we just saw an example) where p = 2, 3, … To get the feature vectors we concatenate all pth order polynomial terms of the components of x (weighted appropriately)  Radial basis kernel In this case the feature space consists of functions and results in a non- parametric classifier. ')',( xxxx T K = ( )pT K ')',( xxxx += 1       −−= 2 2 1 'exp)',( xxxxK
  • 23.
    © Eric Xing@ CMU, 2006-2010 23 The essence of kernel  Feature mapping, but “without paying a cost”  E.g., polynomial kernel  How many dimensions we’ve got in the new space?  How many operations it takes to compute K()?  Kernel design, any principle?  K(x,z) can be thought of as a similarity function between x and z  This intuition can be well reflected in the following “Gaussian” function (Similarly one can easily come up with other K() in the same spirit)  Is this necessarily lead to a “legal” kernel? (in the above particular case, K() is a legal one, do you know how many dimension φ(x) is?
  • 24.
    © Eric Xing@ CMU, 2006-2010 24 Kernel matrix  Suppose for now that K is indeed a valid kernel corresponding to some feature mapping φ, then for x1, …, xm, we can compute an m×m matrix , where  This is called a kernel matrix!  Now, if a kernel function is indeed a valid kernel, and its elements are dot-product in the transformed feature space, it must satisfy:  Symmetry K=KT proof  Positive –semidefinite proof?  Mercer’s theorem
  • 25.
    © Eric Xing@ CMU, 2006-2010 25 SVM examples
  • 26.
    © Eric Xing@ CMU, 2006-2010 26 Examples for Non Linear SVMs – Gaussian Kernel
  • 27.
    © Eric Xing@ CMU, 2006-2010 27  xi is a bag of words  Define φ(xi) as a count of every n-gram up to n=k in xi.  This is huge space 26k  What are we measuring by φ(xi)t φ(xj)?  Can we compute the same quantity on input space?  Efficient linear dynamic program!  Kernel is a measure of similarity  Must be positive semi-definite Example Kernel
  • 28.
    © Eric Xing@ CMU, 2006-2010 28 Non-linearly Separable Problems  We allow “error” ξi in classification; it is based on the output of the discriminant function wTx+b  ξi approximates the number of misclassified samples Class 1 Class 2
  • 29.
    © Eric Xing@ CMU, 2006-2010 29 Soft Margin Hyperplane  Now we have a slightly different opt problem:  ξi are “slack variables” in optimization  Note that ξi=0 if there is no error for xi  ξi is an upper bound of the number of errors  C : tradeoff parameter between error and margin , ,)( s.t i ibxwy i ii T i ∀≥ ∀−≥+ 0 1 ξ ξ ∑= + m i i T bw Cww 12 1 ξ,min
  • 30.
    © Eric Xing@ CMU, 2006-2010 30 Hinge Loss  Remember Ridge regression  Min [squared loss + λ wtw]  How about SVM? regularization Loss: hinge loss
  • 31.
    © Eric Xing@ CMU, 2006-2010 31 The Optimization Problem  The dual of this new constrained optimization problem is  This is very similar to the optimization problem in the linear separable case, except that there is an upper bound C on αi now  Once again, a QP solver can be used to find αi ∑∑ == −= m ji j T ijiji m i i yy 11 2 1 , )()(max xxααααα J .0 ,,1,0s.t. 1 ∑= = =≤≤ m i ii i y miC α α 
  • 32.
    © Eric Xing@ CMU, 2006-2010 32 The SMO algorithm  Consider solving the unconstrained opt problem:  We’ve already seen several opt algorithms!  ?  ?  ?  Coordinate ascend:
  • 33.
    © Eric Xing@ CMU, 2006-2010 33 Coordinate ascend
  • 34.
    © Eric Xing@ CMU, 2006-2010 34 Sequential minimal optimization  Constrained optimization:  Question: can we do coordinate along one direction at a time (i.e., hold all α[-i] fixed, and update αi?) ∑∑ == −= m ji j T ijiji m i i yy 11 2 1 , )()(max xxααααα J .0 ,,1,0s.t. 1 ∑= = =≤≤ m i ii i y miC α α 
  • 35.
    © Eric Xing@ CMU, 2006-2010 35 The SMO algorithm Repeat till convergence 1. Select some pair αi and αj to update next (using a heuristic that tries to pick the two that will allow us to make the biggest progress towards the global maximum). 2. Re-optimize J(α) with respect to αi and αj, while holding all the other αk 's (k ≠ i; j) fixed. Will this procedure converge?
  • 36.
    © Eric Xing@ CMU, 2006-2010 36 Convergence of SMO  Let’s hold α3 ,…, αm fixed and reopt J w.r.t. α1 and α2 ∑∑ == −= m ji j T ijiji m i i yy 1,1 )( 2 1 )(max xxααααα J . ,,,0s.t. ∑= = =≤≤ m i ii i y kiC 1 0 1 α α  KKT:
  • 37.
    © Eric Xing@ CMU, 2006-2010 37 Convergence of SMO  The constraints:  The objective:  Constrained opt:
  • 38.
    © Eric Xing@ CMU, 2006-2010 38 Cross-validation error of SVM  The leave-one-out cross-validation error does not depend on the dimensionality of the feature space but only on the # of support vectors! examplestrainingof# ctorssupport ve# errorCVout-one-Leave =
  • 39.
    © Eric Xing@ CMU, 2006-2010 39 Summary  Max-margin decision boundary  Constrained convex optimization  Duality  The KTT conditions and the support vectors  Non-separable case and slack variables  The SMO algorithm