Lecture4 xing

© Eric Xing @ CMU, 2006-2010 1
Machine Learning
Support Vector Machines
Eric Xing
Lecture 4, August 12, 2010
Reading:

© Eric Xing @ CMU, 2006-2010 2
What is a good Decision
Boundary?
 Why we may have such boundaries?
 Irregular distribution
 Imbalanced training sizes
 outliners

© Eric Xing @ CMU, 2006-2010 3
Classification and Margin
 Parameterzing decision boundary
 Let w denote a vector orthogonal to the decision boundary, and b denote a scalar
"offset" term, then we can write the decision boundary as:
Class 1
Class 2
d -
0=+ bxwT
d+

© Eric Xing @ CMU, 2006-2010 4
Classification and Margin
 Parameterzing decision boundary
 Let w denote a vector orthogonal to the decision boundary, and b denote a scalar
"offset" term, then we can write the decision boundary as:
Class 1
Class 2
0=+ TT
T
w
b
x
w
w
 Margin
(wTxi+b)/||w|| > +c/||w|| for all xi in class 2
(wTxi+b)/||w|| < −c/||w|| for all xi in class 1
Or more compactly:
(wTxi+b)yi /||w|| >c/||w||
The margin between two points
m = d− + d+ =
d -
d+

© Eric Xing @ CMU, 2006-2010 5
Maximum Margin Classification
 The margin is:
 Here is our Maximum Margin Classification problem:
( ) w
c
xx
w
w
m ji
T
2
=−= **
iwcwbxwy
w
c
i
T
i
w
∀≥+ ,//)(s.t
2
max

© Eric Xing @ CMU, 2006-2010 6
Maximum Margin Classification,
con'd.
 The optimization problem:
 But note that the magnitude of c merely scales w and b, and does
not change the classification boundary at all! (why?)
 So we instead work on this cleaner problem:
 The solution to this leads to the famous Support Vector Machines -
-- believed by many to be the best "off-the-shelf" supervised learning
algorithm
iwcwbxwy
w
c
i
T
i
bw
∀≥+ ,//)(
s.t
max ,
ibxwy
w
i
T
i
bw
∀≥+ ,)(
s.t
max ,
1
1

© Eric Xing @ CMU, 2006-2010 7
Support vector machine
 A convex quadratic programming problem
with linear constrains:
 The attained margin is now given by
 Only a few of the classification constraints are relevant  support vectors
 Constrained optimization
 We can directly solve this using commercial quadratic programming (QP) code
 But we want to take a more careful investigation of Lagrange duality, and the
solution of the above in its dual form.
 deeper insight: support vectors, kernels …
 more efficient algorithm
ibxwy
w
i
T
i
bw
∀≥+ ,)(
s.t
max ,
1
1
w
1
d+
d-

© Eric Xing @ CMU, 2006-2010 8
Digression to Lagrangian Duality
 The Primal Problem
Primal:
The generalized Lagrangian:
the α's (αι≥0) and β's are called the Lagarangian multipliers
Lemma:
A re-written Primal:
liwh
kiwg
wf
i
i
w
,1,,)(
,1,,)(
)(
s.t.
min


==
=≤
0
0
∑∑ ==
++=
l
i
ii
k
i
ii whwgwfw
11
)()()(),,( βαβαL



∞
=≥
o/w
sconstraintprimalsatisfiesif)(
),,(max ,,
wwf
wi
βααβα L0
),,(maxmin ,, βααβα wiw L0≥

© Eric Xing @ CMU, 2006-2010 9
Lagrangian Duality, cont.
 Recall the Primal Problem:
 The Dual Problem:
 Theorem (weak duality):
 Theorem (strong duality):
Iff there exist a saddle point of , we have
),,(maxmin ,, βααβα wiw L0≥
),,(minmax ,, βααβα wwi
L0≥
*
,,,,
*
),,(maxmin),,(minmax pwwd ii ww =≤= ≥≥ βαβα αβααβα LL 00
**
pd =
),,( βαwL

© Eric Xing @ CMU, 2006-2010 10
The KKT conditions
 If there exists some saddle point of L, then the saddle point
satisfies the following "Karush-Kuhn-Tucker" (KKT)
conditions:
 Theorem: If w*, α* and β* satisfy the KKT condition, then it is also a
solution to the primal and the dual problems.
mi
miwg
miwgα
liw
kiw
w
i
i
ii
i
i
,,1,0
,,1,0)(
,,1,0)(
,,1,0),,(
,,1,0),,(





=≥
=≤
==
==
∂
∂
==
∂
∂
α
βα
β
βα
L
L

© Eric Xing @ CMU, 2006-2010 11
Solving optimal margin classifier
 Recall our opt problem:
 This is equivalent to
 Write the Lagrangian:
 Recall that (*) can be reformulated as
Now we solve its dual problem:
ibxwy
w
i
T
i
bw
∀≥+ ,)(
s.t
max ,
1
1
ibxwy
ww
i
T
i
T
bw
∀≤+− ,)(
s.t
min ,
01
2
1
[ ]∑=
−+−=
m
i
i
T
ii
T
bxwywwbw
1
1
2
1
)(),,( ααL
*( )
),,(maxmin , αα bwibw L0≥
),,(minmax , αα bwbwi
L0≥

© Eric Xing @ CMU, 2006-2010 12
***( )
The Dual Problem
 We minimize L with respect to w and b first:
Note that (*) implies:
 Plug (***) back to L , and using (**), we have:
),,(minmax , αα bwbwi
L0≥
,),,( ∑=
=−=∇
m
i
iiiw xywbw
1
0ααL
,),,( ∑=
==∇
m
i
iib ybw
1
0ααL
∑=
=
m
i
iii xyw
1
α
*( )
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yybw
11 2
1
,
)(),,( xxααααL
**( )
[ ]∑=
−+−=
m
i
i
T
ii
T
bxwywwbw
1
1
2
1
)(),,( ααL

© Eric Xing @ CMU, 2006-2010 13
The Dual problem, cont.
 Now we have the following dual opt problem:
 This is, (again,) a quadratic programming problem.
 A global maximum of αi can always be found.
 But what's the big deal??
 Note two things:
1. w can be recovered by
2. The "kernel"
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
11 2
1
,
)()(max xxααααα J
.
,,,s.t.
∑=
=
=≥
m
i
ii
i
y
ki
1
0
10
α
α 
∑=
=
m
i
iii yw
1
xα
j
T
i xx
See next …
More later …

© Eric Xing @ CMU, 2006-2010 14
I. Support vectors
 Note the KKT condition --- only a few αi's can be nonzero!!
miwgα ii ,,1,0)( ==
α6=1.4
Class 1
Class 2
α1=0.8
α2=0
α3=0
α4=0
α5=0
α7=0
α8=0.6
α9=0
α10=0
Call the training data points
whose αi's are nonzero the
support vectors (SV)

© Eric Xing @ CMU, 2006-2010 15
Support vector machines
 Once we have the Lagrange multipliers {αi}, we can
reconstruct the parameter vector w as a weighted combination
of the training examples:
 For testing with a new data z
 Compute
and classify z as class 1 if the sum is positive, and class 2 otherwise
 Note: w need not be formed explicitly
∑∈
=
SVi
iii yw xα
( ) bzybzw
SVi
T
iii
T
+=+ ∑∈
xα

© Eric Xing @ CMU, 2006-2010 16
Interpretation of support vector
machines
 The optimal w is a linear combination of a small number of
data points. This “sparse” representation can be viewed as
data compression as in the construction of kNN classifier
 To compute the weights {αi}, and to use support vector
machines we need to specify only the inner products (or
kernel) between the examples
 We make decisions by comparing each new example z with
only the support vectors:
j
T
i xx
( ) 





+= ∑∈
bzyy
SVi
T
iii xαsign*

© Eric Xing @ CMU, 2006-2010 17
 Is this data linearly-separable?
 How about a quadratic mapping φ(xi)?
II. The Kernel Trick

© Eric Xing @ CMU, 2006-2010 18
 Recall the SVM optimization problem
 The data points only appear as inner product
 As long as we can calculate the inner product in the feature
space, we do not need the mapping explicitly
 Many common geometric operations (angles, distances) can
be expressed by inner products
 Define the kernel function K by
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
11 2
1
,
.0
,,1,0s.t.
1
∑=
=
=≤≤
m
i
ii
i
y
miC
α
α 
)()(),( j
T
ijiK xxxx φφ=

© Eric Xing @ CMU, 2006-2010 19
 Computation depends on feature space
 Bad if its dimension is much larger than input space
( )∑∑ ==
−
m
ji
jijiji
m
i
i Kyy
1,1
,
2
1
max xxαααα
.0
,,1,0s.t.
1
∑=
=
=≥
m
i
ii
i
y
ki
α
α 
Where K(xi,xj) = φ(xi)t φ(xj) ( ) 





+= ∑∈
bzKyzy
SVi
iii ,sign)(* xα

© Eric Xing @ CMU, 2006-2010 20
Transforming the Data
 Computation in the feature space can be costly because it is high
dimensional
 The feature space is typically infinite-dimensional!
 The kernel trick comes to rescue
φ( )
φ( )
φ( )
φ( )φ( )
φ( )
φ( )
φ( )
φ(.)
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
φ( )
Feature spaceInput space
Note: feature space is of higher dimension
than the input space in practice

© Eric Xing @ CMU, 2006-2010 21
An Example for feature mapping
and kernels
 Consider an input x=[x1,x2]
 Suppose φ(.) is given as follows
 An inner product in the feature space is
 So, if we define the kernel function as follows, there is no
need to carry out φ(.) explicitly
21
2
2
2
121
2
1
2221 xxxxxx
x
x
,,,,,=













φ
=



























'
'
,
2
1
2
1
x
x
x
x
φφ
( )2
1 ')',( xxxx T
K +=

© Eric Xing @ CMU, 2006-2010 22
More examples of kernel
functions
 Linear kernel (we've seen it)
 Polynomial kernel (we just saw an example)
where p = 2, 3, … To get the feature vectors we concatenate all pth order
polynomial terms of the components of x (weighted appropriately)
 Radial basis kernel
In this case the feature space consists of functions and results in a non-
parametric classifier.
')',( xxxx T
K =
( )pT
K ')',( xxxx += 1






−−=
2
2
1
'exp)',( xxxxK

© Eric Xing @ CMU, 2006-2010 23
The essence of kernel
 Feature mapping, but “without paying a cost”
 E.g., polynomial kernel
 How many dimensions we’ve got in the new space?
 How many operations it takes to compute K()?
 Kernel design, any principle?
 K(x,z) can be thought of as a similarity function between x and z
 This intuition can be well reflected in the following “Gaussian” function
(Similarly one can easily come up with other K() in the same spirit)
 Is this necessarily lead to a “legal” kernel?
(in the above particular case, K() is a legal one, do you know how many
dimension φ(x) is?

© Eric Xing @ CMU, 2006-2010 24
Kernel matrix
 Suppose for now that K is indeed a valid kernel corresponding
to some feature mapping φ, then for x1, …, xm, we can
compute an m×m matrix , where
 This is called a kernel matrix!
 Now, if a kernel function is indeed a valid kernel, and its
elements are dot-product in the transformed feature space, it
must satisfy:
 Symmetry K=KT
proof
 Positive –semidefinite
proof?
 Mercer’s theorem

© Eric Xing @ CMU, 2006-2010 25
SVM examples

© Eric Xing @ CMU, 2006-2010 26
Examples for Non Linear SVMs –
Gaussian Kernel

© Eric Xing @ CMU, 2006-2010 27
 xi is a bag of words
 Define φ(xi) as a count of every n-gram up to n=k in xi.
 This is huge space 26k
 What are we measuring by φ(xi)t φ(xj)?
 Can we compute the same quantity on input space?
 Efficient linear dynamic program!
 Kernel is a measure of similarity
 Must be positive semi-definite
Example Kernel

© Eric Xing @ CMU, 2006-2010 28
Non-linearly Separable Problems
 We allow “error” ξi in classification; it is based on the output of
the discriminant function wTx+b
 ξi approximates the number of misclassified samples
Class 1
Class 2

© Eric Xing @ CMU, 2006-2010 29
Soft Margin Hyperplane
 Now we have a slightly different opt problem:
 ξi are “slack variables” in optimization
 Note that ξi=0 if there is no error for xi
 ξi is an upper bound of the number of errors
 C : tradeoff parameter between error and margin
,
,)(
s.t
i
ibxwy
i
ii
T
i
∀≥
∀−≥+
0
1
ξ
ξ
∑=
+
m
i
i
T
bw Cww
12
1
ξ,min

© Eric Xing @ CMU, 2006-2010 30
Hinge Loss
 Remember Ridge regression
 Min [squared loss + λ wtw]
 How about SVM?
regularization Loss: hinge loss

© Eric Xing @ CMU, 2006-2010 31
The Optimization Problem
 The dual of this new constrained optimization problem is
 This is very similar to the optimization problem in the linear
separable case, except that there is an upper bound C on αi
now
 Once again, a QP solver can be used to find αi
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
11 2
1
,
.0
,,1,0s.t.
1
∑=
=
=≤≤
m
i
ii
i
y
miC
α
α 

© Eric Xing @ CMU, 2006-2010 32
The SMO algorithm
 Consider solving the unconstrained opt problem:
 We’ve already seen several opt algorithms!
 ?
 ?
 ?
 Coordinate ascend:

© Eric Xing @ CMU, 2006-2010 34
Sequential minimal optimization
 Constrained optimization:
 Question: can we do coordinate along one direction at a time
(i.e., hold all α[-i] fixed, and update αi?)
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
11 2
1
,
.0
,,1,0s.t.
1
∑=
=
=≤≤
m
i
ii
i
y
miC
α
α 

© Eric Xing @ CMU, 2006-2010 35
The SMO algorithm
Repeat till convergence
1. Select some pair αi and αj to update next (using a heuristic that tries
to pick the two that will allow us to make the biggest progress
towards the global maximum).
2. Re-optimize J(α) with respect to αi and αj, while holding all the other
αk 's (k ≠ i; j) fixed.
Will this procedure converge?

© Eric Xing @ CMU, 2006-2010 36
Convergence of SMO
 Let’s hold α3 ,…, αm fixed and reopt J w.r.t. α1 and α2
∑∑ ==
−=
m
ji
j
T
ijiji
m
i
i yy
1,1
)(
2
1
)(max xxααααα J
.
,,,0s.t.
∑=
=
=≤≤
m
i
ii
i
y
kiC
1
0
1
α
α 
KKT:

© Eric Xing @ CMU, 2006-2010 38
Cross-validation error of SVM
 The leave-one-out cross-validation error does not depend on
the dimensionality of the feature space but only on the # of
support vectors!
examplestrainingof#
ctorssupport ve#
errorCVout-one-Leave =

© Eric Xing @ CMU, 2006-2010 39
Summary
 Max-margin decision boundary
 Constrained convex optimization
 Duality
 The KTT conditions and the support vectors
 Non-separable case and slack variables
 The SMO algorithm

Lecture4 xing

More Related Content

What's hot

Viewers also liked

Similar to Lecture4 xing

More from Tianlu Wang

Recently uploaded

Lecture4 xing