Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Loading in …5
×

# Support Vector Machines

1,919 views

Published on

Published in: Technology, Business
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

### Support Vector Machines

1. 1. Support Vector Machines Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted Professor if you found this source material useful in giving your own lectures. Feel free to use School of Computer Science these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use Carnegie Mellon University of a significant portion of these slides in your own lecture, please include this www.cs.cmu.edu/~awm message, or the following link to the source repository of Andrew’s tutorials: awm@cs.cmu.edu http://www.cs.cmu.edu/~awm/tutorials . 412-268-7599 Comments and corrections gratefully received. Copyright © 2001, 2003, Andrew W. Moore Nov 23rd, 2001 α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 2 1
2. 2. α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 3 α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 4 2
3. 3. α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 How would you classify this data? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 5 α Linear Classifiers x f yest f(x,w,b) = sign(w. x - b) denotes +1 denotes -1 Any of these would be fine.. ..but which is best? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 6 3
4. 4. α Classifier Margin x f yest f(x,w,b) = sign(w. x - b) denotes +1 Define the margin denotes -1 of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 7 α Maximum Margin x f yest f(x,w,b) = sign(w. x - b) denotes +1 The maximum denotes -1 margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 8 4
5. 5. α Maximum Margin x f yest f(x,w,b) = sign(w. x - b) denotes +1 The maximum denotes -1 margin linear classifier is the linear classifier Support Vectors with the, um, are those maximum margin. datapoints that the margin This is the pushes up simplest kind of against SVM (Called an LSVM) Linear SVM Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 9 Why Maximum Margin? 1. Intuitively this feels safest. f(x,w,b) = sign(w. x - b) 2. If we’ve made a small error in the denotes +1 location of the boundary (it’s been The maximum jolted in its perpendicular direction) denotes -1 this gives us least chance linear margin of causing a misclassification. classifier is the 3. LOOCV is easy since the classifier linear model is Support Vectors immune to removal of the,non- with any um, are those support-vector datapoints. margin. maximum datapoints that 4. There’s some theory (using VC the margin dimension) that is relatedthe(but not This is to pushes up the same as) thesimplest kind of proposition that this against is a good thing. SVM (Called an LSVM) 5. Empirically it works very very well. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 10 5
6. 6. Specifying a line and margin ” +1 Plus-Plane s= las Classifier Boundary t C one cz edi Minus-Plane “Pr - 1” s= las t C one cz edi “Pr • How do we represent this mathematically? • …in m input dimensions? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 11 Specifying a line and margin 1” Plus-Plane =+ ss Cla Classifier Boundary ict zone ed Minus-Plane “Pr - 1” s= las =1 t C one +b cz edi wx =0 “Pr +b wx b=-1 + wx • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } w . x + b >= 1 Classify as.. +1 if w . x + b <= -1 -1 if -1 < w . x + b < 1 Universe if explodes Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 12 6
7. 7. Computing the margin width ” +1 M = Margin Width s= las t C one cz edi “Pr How do we compute - 1” s= las M in terms of w =1 t C one +b cz edi wx =0 and b? “Pr +b wx b=-1 + wx • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 13 Computing the margin width 1” =+ M = Margin Width ss Cla ict zone ed “Pr How do we compute - 1” s= las M in terms of w =1 t C one +b cz edi wx =0 and b? “Pr +b wx b=-1 + wx • Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } Claim: The vector w is perpendicular to the Plus Plane. Why? Let u and v be two vectors on the Plus Plane. What is w . ( u – v ) ? And so of course the vector w is also perpendicular to the Minus Plane Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 14 7
8. 8. Computing the margin width ” +1 x+ M = Margin Width s= las t C one cz edi “Pr How do we compute ” - -1 x = ss M in terms of w Cla e =1 t on +b cz edi wx =0 and b? “Pr +b wx b=-1 + wx Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } • • The vector w is perpendicular to the Plus Plane Any location in Let x- be any point on the minus plane • m not Rm:: not necessarily a Let x+ be the closest plus-plane-point to x-. • datapoint Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 15 Computing the margin width 1” = + x+ M = Margin Width s las e tC dic zon Pre “ 1” How do we compute x- - = ss M in terms of w la =1 ct C zone +b edi wx =0 and b? “Pr +b -1 wx = +b wx Plus-plane = { x : w . x + b = +1 } • Minus-plane = { x : w . x + b = -1 } • • The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x-. • Claim: x+ = x- + λ w for some value of λ. Why? • Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 16 8
9. 9. Computing the margin width ” +1 x+ M = Margin Width s= las t C one The line from x- to x+ is cz edi “Pr 1” How do we compute perpendicular to the =- - sx planes.terms of w las M in =1 tC e +b dic zon wx =0 re So to getb? and from x- to x+ “P +b wx b=-1 + travel some distance in wx direction w. { x : w . x + b = +1 } • Plus-plane = Minus-plane = { x : w . x + b = -1 } • • The vector w is perpendicular to the Plus Plane Let x- be any point on the minus plane • Let x+ be the closest plus-plane-point to x-. • Claim: x+ = x- + λ w for some value of λ. Why? • Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 17 Computing the margin width 1” = + x+ M = Margin Width s las e tC dic zon Pre “ 1” x- - = ss la =1 ct C zone +b edi wx =0 “Pr +b -1 wx = +b wx What we know: • w . x+ + b = +1 • w . x- + b = -1 • x+ = x- + λ w • |x+ - x- | = M It’s now easy to get M in terms of w and b Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 18 9
10. 10. Computing the margin width ” +1 x+ M = Margin Width s= las t C one cz edi “Pr ” - -1 x = ss Cla e =1 t on +b cz edi wx w . (x - + λ w) + b = 1 =0 “Pr +b wx b=-1 + wx => What we know: w . x - + b + λ w .w = 1 • w . x+ + b = +1 • w . x- + b = -1 => • x+ = x- + λ w -1 + λ w .w = 1 • |x+ - x- | = M => It’s now easy to get M 2 λ= in terms of w and b w.w Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 19 Computing the margin width 1” 2 = + x+ M = Margin Width = ss w.w Cla ict zone ed “Pr 1” x- - = ss la =1 ct C zone +b edi wx M = |x+ - x- | =| λ w |= =0 “Pr +b -1 wx = +b wx What we know: = λ | w | = λ w.w • w . x+ + b = +1 • w . x- + b = -1 2 w.w 2 = = • x+ = x- + λ w w.w w.w • |x+ - x- | = M • 2 λ= w.w Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 20 10
11. 11. Learning the Maximum Margin Classifier ” +1 x+ 2 M = Margin Width = s= las w.w t C one cz edi “Pr ” - -1 x = ss Cla e =1 t on +b cz edi wx =0 “Pr +b wx b=-1 + wx Given a guess of w and b we can • Compute whether all data points in the correct half-planes • Compute the width of the margin So now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. How? Gradient descent? Simulated Annealing? Matrix Inversion? EM? Newton’s Method? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 21 Learning via Quadratic Programming • QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 22 11
12. 12. Quadratic Programming u T Ru arg max c + d u + Quadratic criterion T Find 2 u a11u1 + a12u2 + ... + a1mum ≤ b1 Subject to a21u1 + a22u2 + ... + a2 mu m ≤ b2 n additional linear inequality : constraints an1u1 + an 2u 2 + ... + anmum ≤ bn And subject to e additional linear a( n +1)1u1 + a( n +1) 2u2 + ... + a( n +1) mum = b( n +1) constraints equality a( n + 2 )1u1 + a( n + 2 ) 2u2 + ... + a( n + 2 ) mum = b( n + 2 ) : a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e ) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 23 Quadratic Programming u T Ru arg max c + dT u + Quadratic criterion Find 2 u a11u1 + a12u2 + ... + a1mum ≤ b1g Subject to in for find orithms uadratb a21u1 + a22ist2a+ ... + ed2 m u m ≤ ic2 x u lg a n additional linear There e constrain q iciently inequality ff h s uc : more e a much an gradient constraints optim li bly th d re+a... + ent. u ≤ b an1u1 + ann u 2 asc anm m a2 n yo u fiddly… And subject to ery e additional linear a( n +1)1u1 (But( nh1)y u2e+vn’t + an(tn +o) muite = b( n +1) + a t + e 2 ar o ... w a t1 wr m constraints equality ly d p obab elf) a( n + 2 )1u1 + a(rn + 2 ) 2uon+ ... urs ( n + 2 ) mum = b( n + 2 ) e yo + a 2 : a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e ) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 24 12
13. 13. Learning the Maximum Margin Classifier ” M = Given guess of w , b we can +1 = 2 ss Cla ne w.w •Compute whether all data ct zo edi points are in the correct “Pr - 1” half-planes = ss Cla e =1 ct zon • Compute the margin width +b edi wx =0 “Pr +b wx b=-1 Assume R datapoints, each + wx (xk,yk) where yk = +/- 1 What should our quadratic How many constraints will we optimization criterion be? have? What should they be? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 25 Learning the Maximum Margin Classifier ” M = Given guess of w , b we can +1 s= 2 las w.w •Compute whether all data t C one cz edi points are in the correct “Pr - 1” half-planes = ss Cla e =1 t • Compute the margin width +b dic zon wx =0 e “Pr +b wx b=-1 Assume R datapoints, each + wx (xk,yk) where yk = +/- 1 What should our quadratic How many constraints will we have? R optimization criterion be? Minimize w.w What should they be? w . xk + b >= 1 if yk = 1 w . xk + b <= -1 if yk = -1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 26 13
14. 14. Uh-oh! This is going to be a problem! What should we do? denotes +1 denotes -1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 27 Uh-oh! This is going to be a problem! What should we do? Idea 1: denotes +1 Find minimum w.w, while denotes -1 minimizing number of training set errors. Problemette: Two things to minimize makes for an ill-defined optimization Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 28 14
15. 15. Uh-oh! This is going to be a problem! What should we do? Idea 1.1: denotes +1 denotes -1 Minimize w.w + C (#train errors) Tradeoff parameter There’s a serious practical problem that’s about to make us reject this approach. Can you guess what it is? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 29 Uh-oh! This is going to be a problem! What should we do? Idea 1.1: denotes +1 denotes -1 Minimize w.w + C (#train errors) Tradeoff parameter Can’t be expressed as a Quadratic Programming problem. Solving it may be too slow. There’s a serious practical (Also, doesn’t distinguish between problem that’s about yto make n So… a disastrous errors and near misses) other us reject this approach. Can ideas? you guess what it is? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 30 15
16. 16. Uh-oh! This is going to be a problem! What should we do? Idea 2.0: denotes +1 denotes -1 Minimize w.w + C (distance of error points to their correct place) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 31 Learning Maximum Margin with Noise M = Given guess of w , b we can 2 w.w •Compute sum of distances of points to their correct zones =1 • Compute the margin width +b wx =0 +b wx b=-1 Assume R datapoints, each + wx (xk,yk) where yk = +/- 1 What should our quadratic How many constraints will we optimization criterion be? have? What should they be? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 32 16
17. 17. Learning Maximum Margin with Noise M = Given guess of w , b we can ε11 2 w.w •Compute sum of distances ε2 of points to their correct zones =1 • Compute the margin width +b ε7 wx =0 +b wx b=-1 Assume R datapoints, each + wx (xk,yk) where yk = +/- 1 What should our quadratic How many constraints will we have? R optimization criterion be? Minimize What should they be? R 1 w.w + C ∑ εk w . xk + b >= 1-εk if yk = 1 2 k =1 w . xk + b <= -1+εk if yk = -1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 33 Learning Maximum Marginmwith Noise = # input M = Given guessdimensions we can of w , b ε11 2 w.w • Compute sum of distances ε2 of points to their correct Our original (noiseless data) QP had m+1 zones variables: w1, w2, … wm, and b. =1 • Compute the margin width +b ε7 wx =0 Assume has m+1+R Our new (noisy data) QP R datapoints, each +b wx b=-1 + variables: w1, w2(x ,y ) , b, εk ,yε1 = +/- 1 ,… εR , … wm where wx kk k What should our quadratic How many constraints will we R= # records have? R optimization criterion be? Minimize What should they be? R 1 w.w + C ∑ εk w . xk + b >= 1-εk if yk = 1 2 k =1 w . xk + b <= -1+εk if yk = -1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 34 17
18. 18. Learning Maximum Margin with Noise M = Given guess of w , b we can ε11 2 w.w •Compute sum of distances ε2 of points to their correct zones =1 • Compute the margin width +b ε7 wx =0 +b wx b=-1 Assume R datapoints, each + wx (xk,yk) where yk = +/- 1 What should our quadratic How many constraints will we have? R optimization criterion be? Minimize What should they be? R 1 w.w + C ∑ εk w . xk + b >= 1-εk if yk = 1 2 k =1 w . xk + b <= -1+εk if yk = -1 There’s a bug in this QP. Can you spot it? Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 35 Learning Maximum Margin with Noise M = Given guess of w , b we can ε11 2 w.w •Compute sum of distances ε2 of points to their correct zones =1 • Compute the margin width +b ε7 wx =0 +b wx b=-1 Assume R datapoints, each + wx (xk,yk) where yk = +/- 1 What should our quadratic How many constraints will we have? 2R optimization criterion be? Minimize What should they be? R 1 w.w + C ∑ εk w . xk + b >= 1-εk if yk = 1 2 k =1 w . xk + b <= -1+εk if yk = -1 εk >= 0 for all k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 36 18
19. 19. An Equivalent QP Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. R 1R R Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl (x k .x l ) 2 k =1 l =1 k =1 R ∑ Subject to these 0 ≤ αk ≤ C ∀k αk yk = 0 constraints: k =1 Then define: R ∑ Then classify with: w= αk ykx k f(x,w,b) = sign(w. x - b) k =1 b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 37 An Equivalent QP Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. R 1R R Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl (x k .x l ) 2 k =1 l =1 k =1 R ∑ Subject to these 0 ≤ αk ≤ C ∀k αk yk = 0 constraints: k =1 Then define: Datapoints with αk > 0 will be the support R ∑ vectors classify with: Then w= αk ykx k f(x,w,b) = sign(w. x - b) k =1 ..so this sum only needs b = y K (1 − ε K ) − x K . w to be over the K support vectors. where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 38 19
20. 20. An Equivalent QP R 1R R Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl (x k .x l ) 2 k =1 l =1 k =1 Why did I tell you about this R ∑ Subject to these equivalent≤ C ∀k 0 ≤ αk QP? yk = 0 αk constraints: • It’s a formulation that QP k = 1 packages can optimize more Then define: Datapoints with αk > 0 quickly will be the support R ∑ vectors classify with: y k xBecause of further jaw- • Then w= αk k dropping developments you’re f(x,w,b) = sign(w. x - b) k =1 ..so this sum only needs about to learn. b = y K (1 − ε K ) − x K . w K to be over the support vectors. where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 39 Suppose we’re in 1-dimension What would SVMs do with this data? x=0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 40 20
21. 21. Suppose we’re in 1-dimension Not a big surprise x=0 Positive “plane” Negative “plane” Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 41 Harder 1-dimensional dataset That’s wiped the smirk off SVM’s face. What can be done about this? x=0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 42 21
22. 22. Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too z k = ( xk , xk ) 2 x=0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 43 Harder 1-dimensional dataset Remember how permitting non- linear basis functions made linear regression so much nicer? Let’s permit them here too z k = ( xk , xk ) 2 x=0 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 44 22
23. 23. Common SVM basis functions zk = ( polynomial terms of xk of degree 1 to q ) zk = ( radial basis functions of xk ) ⎛ | xk − c j | ⎞ z k [ j ] = φ j (x k ) = KernelFn⎜ ⎜ KW ⎟ ⎟ ⎝ ⎠ zk = ( sigmoid functions of xk ) This is sensible. Is that the end of the story? No…there’s one more trick! Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 45 Quadratic ⎛ ⎞ Constant Term 1 ⎜ ⎟ 2 x1 ⎟ ⎜ Basis Functions ⎜ 2 x2 ⎟ ⎜ ⎟ Linear Terms ⎜ ⎟ : ⎜ ⎟ Number of terms (assuming m input 2 xm ⎟ ⎜ dimensions) = (m+2)-choose-2 ⎜ ⎟ x12 ⎜ ⎟ Pure = (m+2)(m+1)/2 x22 ⎜ ⎟ Quadratic ⎜ ⎟ : Terms = (as near as makes no difference) m2/2 ⎜ ⎟ xm2 ⎜ ⎟ Φ(x) = ⎜ 2 x1 x2 ⎟ ⎜ ⎟ You may be wondering what those ⎜ 2 x1 x3 ⎟ ⎜ ⎟ 2 ’s are doing. : ⎜ ⎟ Quadratic ⎜ 2 x1 xm ⎟ •You should be happy that they do no Cross-Terms ⎜ ⎟ ⎜ 2 x2 x3 ⎟ harm ⎜ ⎟ : •You’ll find out why they’re there ⎜ ⎟ ⎜ 2 x1 xm ⎟ soon. ⎜ ⎟ : ⎜ ⎜ 2x x ⎟ ⎟ ⎝ m −1 m ⎠ Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 46 23
24. 24. QP with basis functions Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. R 1R R ∑ αk − 2 ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(xl )) Maximize k =1 k =1 l =1 R ∑ Subject to these 0 ≤ αk ≤ C ∀k αk yk = 0 constraints: k =1 Then define: ∑ w= Then classify with: α k ykΦ (x k ) k s.t. α k > 0 f(x,w,b) = sign(w. φ(x) - b) b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 47 QP with basis functions R 1R R ∑ αk − 2 ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(xl )) Maximize k =1 k =1 l =1 We must do R2/2 dot products to get this matrix ready. R 0 ≤ αk ≤ C ∀k product ∑ α k m2k = 0 Subject to these y /2 Each dot requires constraints: k =1 additions and multiplications The whole thing costs R2 m2 /4. Then define: Yeeks! ∑ …or does it? w= Then classify with: α k ykΦ (x k ) k s.t. α k > 0 f(x,w,b) = sign(w. φ(x) - b) b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 48 24
25. 25. ⎛ ⎞⎛ ⎞ 1 1 Quadratic Dot ⎜ ⎟⎜ ⎟ 1 2a1 ⎟ ⎜ 2b1 ⎟ ⎜ + Products ⎜ 2 a2 ⎟ ⎜ 2b2 ⎟ ⎜ ⎟⎜ ⎟ m ∑ 2a b ⎜ ⎟⎜ ⎟ : : ii ⎜ ⎟⎜ ⎟ i =1 2 am ⎟ ⎜ 2bm ⎟ ⎜ + ⎜ ⎟ ⎜ b12 ⎟ a12 ⎜ ⎟⎜ ⎟ a2 ⎟ ⎜ b2 2 2 ⎜ ⎟ m ∑a b 22 ⎜ ⎟⎜ ⎟ ii : : ⎜ ⎟⎜ ⎟ i =1 ⎜ am ⎟ ⎜ bm 2 2 ⎟ Φ(a) • Φ(b) = ⎜ ⎟ • ⎜ 2b b ⎟ 2a1a2 ⎜ ⎟⎜ ⎟ 12 ⎜ 2a1a3 ⎟ ⎜ 2b1b3 ⎟ + ⎜ ⎟⎜ ⎟ : : ⎜ ⎟⎜ ⎟ ⎜ 2a1am ⎟ ⎜ 2b1bm ⎟ ⎜ ⎟⎜ ⎟ m m ∑ ∑ 2a a b b ⎜ 2a2 a3 ⎟ ⎜ 2b2b3 ⎟ i ji j ⎜ ⎟⎜ ⎟ i =1 j = i +1 : : ⎜ ⎟⎜ ⎟ ⎜ 2a1am ⎟ ⎜ 2b1bm ⎟ ⎜ ⎟⎜ ⎟ : : ⎜ ⎜ 2a a ⎟ ⎜ 2b b ⎟ ⎟⎜ ⎟ ⎝ m −1 m ⎠ ⎝ m −1 m ⎠ Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 49 Quadratic Dot Just out of casual, innocent, interest, let’s look at another function of a and Products b: (a.b + 1) 2 = (a.b) 2 + 2a.b + 1 2 ⎛m ⎞ m = ⎜ ∑ ai bi ⎟ + 2∑ ai bi + 1 ⎝ i =1 ⎠ i =1 Φ(a) • Φ(b) = m m m = ∑∑ ai bi a j b j + 2∑ ai bi + 1 m m m m 1 + 2∑ ai bi + ∑ ai2bi2 + ∑ ∑ 2a a b b i =1 j =1 i =1 i ji j i =1 i =1 i =1 j = i +1 m m m m = ∑ ( ai bi ) 2 + 2∑ ∑aba b + 2∑ ai bi + 1 ii j j i =1 i =1 j = i +1 i =1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 50 25
26. 26. Quadratic Dot Just out of casual, innocent, interest, let’s look at another function of a and Products b: (a.b + 1) 2 = (a.b) 2 + 2a.b + 1 2 ⎛m ⎞ m = ⎜ ∑ ai bi ⎟ + 2∑ ai bi + 1 ⎝ i =1 ⎠ i =1 Φ(a) • Φ(b) = m m m = ∑∑ ai bi a j b j + 2∑ ai bi + 1 m m m m 1 + 2∑ ai bi + ∑ ai2bi2 + ∑ ∑ 2a a b b i =1 j =1 i =1 i ji j i =1 i =1 i =1 j = i +1 m m m m = ∑ ( ai bi ) 2 + 2∑ ∑aba b + 2∑ ai bi + 1 ii j j i =1 i =1 j = i +1 i =1 They’re the same! And this is only O(m) to compute! Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 51 QP with Quadratic basis functions Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. R 1R R ∑ αk − 2 ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(xl )) Maximize k =1 k =1 l =1 We must do R2/2 dot products to get this matrix ready. R 0 ≤ αk ≤ C ∀k product ∑ only y k = 0 Subject to these α Each dot now k requires constraints: k =1 m additions and multiplications Then define: ∑ w= Then classify with: α k ykΦ (x k ) k s.t. α k > 0 f(x,w,b) = sign(w. φ(x) - b) b = y K (1 − ε K ) − x K . w K where K = arg max α k k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 52 26
27. 27. Higher Order Polynomials φ(x) φ(a).φ(b) Cost to Poly- Cost to Cost if 100 Cost if build Qkl build Qkl 100 nomial inputs matrix matrix inputs tradition sneakily ally Quadratic All m2/2 m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2 terms up to degree 2 All m3/6 m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2 Cubic terms up to degree 3 All m4/24 m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2 Quartic terms up to degree 4 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 53 QP with Quintic basis functions We must do R2/2 dot products to get this R RR ∑ α + ∑∑ α α Q Qkl = yk yl (Φ(x k ).Φ(x l )) matrix ready. Maximize where k k l kl In 100-d, each 1 product now needs 103 k = dot k =1 l =1 operations instead of 75 million R But there are still worrying things lurking away. ∑ Subject to these 0 ≤ αk ≤ C ∀k αk yk = 0 What are they? constraints: k =1 Then define: ∑ w= α k ykΦ (x k ) k s.t. α k > 0 b = y K (1 − ε K ) − x K . w K Then classify with: where K = arg max α k f(x,w,b) = sign(w. φ(x) - b) k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 54 27
28. 28. QP with Quintic basis functions We must do R2/2 dot products to get this R RR ∑ α + ∑∑ α α Q Qkl = yk yl (Φ(x k ).Φ(x l )) matrix ready. Maximize where k k l kl In 100-d, each 1 product now needs 103 k = dot k =1 l =1 operations instead of 75 million R But there are still worrying things lurking away. ∑ Subject to these 0 ≤ α ≤ C ∀k =0 αy What are they? k k k constraints: •The fear of overfittingkwith this enormous =1 number of terms Then define: •The evaluation phase (doing a set of predictions on a test set) will be very ∑ expensive (why?) w= α k ykΦ (x k ) k s.t. α k > 0 b = y K (1 − ε K ) − x K . w K Then classify with: where K = arg max α k f(x,w,b) = sign(w. φ(x) - b) k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 55 QP with Quintic basis functions We must do R2/2 dot products to get this R RR ∑ α + ∑∑ α α Q Qkl = yk yl (Φ(x k ).Φ(x l )) matrix ready. Maximize where k k l kl The use of Maximum Margin In 100-d, each 1 product now needs 103 k = dot k =1 l =1 magically makes this not a operations instead of 75 million problem R But there are still worrying things lurking away. ∑ Subject to these 0 ≤ α ≤ C ∀k =0 αy What are they? k k k constraints: •The fear of overfittingkwith this enormous =1 number of terms Then define: •The evaluation phase (doing a set of predictions on a test set) will be very ∑ expensive (why?) w= α k ykΦ (x k ) Because each w. φ(x) (see below) k s.t. α k > 0 needs 75 million operations. What b = y K (1 − ε K ) − x K . w K can be done? Then classify with: where K = arg max α k f(x,w,b) = sign(w. φ(x) - b) k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 56 28
29. 29. QP with Quintic basis functions We must do R2/2 dot products to get this R RR ∑ α + ∑∑ α α Q Qkl = yk yl (Φ(x k ).Φ(x l )) matrix ready. Maximize where k k l kl The use of Maximum Margin In 100-d, each 1 product now needs 103 k = dot k =1 l =1 magically makes this not a operations instead of 75 million problem R But there are still worrying things lurking away. ∑ Subject to these 0 ≤ α ≤ C ∀k =0 αy What are they? k k k constraints: •The fear of overfittingkwith this enormous =1 number of terms Then define: •The evaluation phase (doing a set of predictions on a test set) will be very ∑ expensive (why?) w= α k ykΦ (x k ) Because each w. φ(x) (see below) k s.t. α k > 0 needs 75 million operations. What ∑ α − x ⋅Φ x w ⋅ = ( y ) =(1 − ε y)k Φ (x kK) . w (K ) bΦx can be done? k K K k s.t. α k > 0 Then classify with: K∑=>α0arg( xmax + 1 ) k = k ⋅x k yk 5 α where k s.t. α f(x,w,b) = sign(w. φ(x) - b) k k Only Sm operations (S=#support vectors) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 57 QP with Quintic basis functions We must do R2/2 dot products to get this R RR ∑ α + ∑∑ α α Q Qkl = yk yl (Φ(x k ).Φ(x l )) matrix ready. Maximize where k k l kl The use of Maximum Margin In 100-d, each 1 product now needs 103 k = dot k =1 l =1 magically makes this not a operations instead of 75 million problem R But there are still worrying things lurking away. ∑ Subject to these 0 ≤ α ≤ C ∀k =0 αy What are they? k k k constraints: •The fear of overfittingkwith this enormous =1 number of terms Then define: •The evaluation phase (doing a set of predictions on a test set) will be very ∑ expensive (why?) w= α k ykΦ (x k ) Because each w. φ(x) (see below) k s.t. α k > 0 needs 75 million operations. What ∑ α − x ⋅Φ x w ⋅ = ( y ) =(1 − ε y)k Φ (x kK) . w (K ) bΦx can be done? k K K k s.t. α k > 0 Then classify this many callout bubbles on When you see with: K∑=>α0arg( xmax + 1 ) k = k ⋅x k yk 5 α where a slide it’s time to wrap the author in a k s.t. α blanket, gently take him away and murmur φ(x) k k f(x,w,b) = been at the PowerPoint for too “someone’s sign(w. - b) Only Sm operations (S=#support vectors) long.” Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 58 29
30. 30. QP with Quintic basis functions R 1R R ∑ αk − 2 ∑∑ αk αl Qkl where Qklas= yk ylof(you’dx k ).Φ(xl )) Andrew’s opinion Φ( SVMs don’t Maximize why overfit much as think: k =1 k =1 l =1 No matter what the basis function, there are really R only up to R ∑ Subject to these 0 ≤ αk ≤ C ∀k =0 αy parameters: α1, α2 .. αk , and usually R k constraints: most are set to zero by the Maximum k =1 Margin. Asking for small w.w is like “weight Then define: decay” in Neural Nets and like Ridge Regression parameters in Linear ∑ w= α k ykΦ (x k ) regression and like the use of Priors in Bayesian Regression---all designed k s.t. α k > 0 to smooth the function and reduce ∑ α − x ⋅Φ x overfitting. w ⋅ = ( y ) =(1 − ε y)k Φ (x kK) . w (K ) bΦx k K K k s.t. α k > 0 Then classify with: K∑=>α0arg( xmax + 1 ) k = k ⋅x k yk 5 α where k s.t. α f(x,w,b) = sign(w. φ(x) - b) k k Only Sm operations (S=#support vectors) Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 59 SVM Kernel Functions • K(a,b)=(a . b +1)d is an example of an SVM Kernel Function • Beyond polynomials there are other very high dimensional basis functions that can be made practical by finding the right Kernel Function • Radial-Basis-style Kernel Function: ⎛ (a − b) 2 ⎞ σ, κ and δ are magic K (a, b) = exp⎜ − ⎟ ⎜ 2σ 2 ⎟ parameters that must ⎝ ⎠ be chosen by a model • Neural-net-style Kernel Function: selection method such as CV or VCSRM* K (a, b) = tanh(κ a.b − δ ) *see last lecture Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 60 30
31. 31. VC-dimension of an SVM • Very very very loosely speaking there is some theory which under some different assumptions puts an upper bound on the VC dimension as ⎡ Diameter ⎤ ⎢ Margin ⎥ ⎢ ⎥ • where • Diameter is the diameter of the smallest sphere that can enclose all the high-dimensional term-vectors derived from the training set. • Margin is the smallest margin we’ll let the SVM use • This can be used in SRM (Structural Risk Minimization) for choosing the polynomial degree, RBF σ, etc. • But most people just use Cross-Validation Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 61 SVM Performance • Anecdotally they work very very well indeed. • Example: They are currently the best-known classifier on a well-studied hand-written-character recognition benchmark • Another Example: Andrew knows several reliable people doing practical real-world work who claim that SVMs have saved them when their other favorite classifiers did poorly. • There is a lot of excitement and religious fervor about SVMs as of 2001. • Despite this, some practitioners (including your lecturer) are a little skeptical. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 62 31
32. 32. Doing multi-class classification • SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). • What can be done? • Answer: with output arity N, learn N SVM’s • SVM 1 learns “Output==1” vs “Output != 1” • SVM 2 learns “Output==2” vs “Output != 2” • : • SVM N learns “Output==N” vs “Output != N” • Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region. Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 63 References • An excellent tutorial on VC-dimension and Support Vector Machines: C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html • The VC/SRM/SVM Bible: Statistical Learning Theory by Vladimir Vapnik, Wiley- Interscience; 1998 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 64 32
33. 33. What You Should Know • Linear SVMs • The definition of a maximum margin classifier • What QP can do for you (but, for this class, you don’t need to know how it does it) • How Maximum Margin can be turned into a QP problem • How we deal with noisy (non-separable) data • How we permit non-linear boundaries • How SVM Kernel functions permit us to pretend we’re working with ultra-high-dimensional basis- function terms Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 65 33