2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines

on

  • 626 views

 

Statistics

Views

Total Views
626
Views on SlideShare
512
Embed Views
114

Actions

Likes
2
Downloads
25
Comments
0

3 Embeds 114

http://ejcarmona.hol.es 71
http://blog.ejcarmona.hol.es 42
http://stxaviershighschoolpatna.dschool.co 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

2013-1 Machine Learning Lecture 05 - Andrew Moore - Support Vector Machines Presentation Transcript

  • 1. Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. MooreSupport VectorMachinesModified from the slides by Dr. Andrew W. Moorehttp://www.cs.cmu.edu/~awm/tutorials
  • 2. Support Vector Machines: Slide 2Copyright © 2001, 2003, Andrew W. MooreLinear Classifiersfxayestdenotes +1denotes -1f(x,w,b) = sign(w. x - b)How would youclassify this data?
  • 3. Support Vector Machines: Slide 3Copyright © 2001, 2003, Andrew W. MooreLinear Classifiersfxayestdenotes +1denotes -1f(x,w,b) = sign(w. x - b)How would youclassify this data?
  • 4. Support Vector Machines: Slide 4Copyright © 2001, 2003, Andrew W. MooreLinear Classifiersfxayestdenotes +1denotes -1f(x,w,b) = sign(w. x - b)How would youclassify this data?
  • 5. Support Vector Machines: Slide 5Copyright © 2001, 2003, Andrew W. MooreLinear Classifiersfxayestdenotes +1denotes -1f(x,w,b) = sign(w. x - b)How would youclassify this data?
  • 6. Support Vector Machines: Slide 6Copyright © 2001, 2003, Andrew W. MooreLinear Classifiersfxayestdenotes +1denotes -1f(x,w,b) = sign(w. x - b)Any of thesewould be fine....but which isbest?
  • 7. Support Vector Machines: Slide 7Copyright © 2001, 2003, Andrew W. MooreClassifier Marginfxayestdenotes +1denotes -1f(x,w,b) = sign(w. x - b)Define the marginof a linearclassifier as thewidth that theboundary could beincreased bybefore hitting adatapoint.
  • 8. Support Vector Machines: Slide 8Copyright © 2001, 2003, Andrew W. MooreMaximum Marginfxayestdenotes +1denotes -1f(x,w,b) = sign(w. x - b)The maximummargin linearclassifier is thelinear classifierwith the, um,maximum margin.This is thesimplest kind ofSVM (Called anLSVM)Linear SVM
  • 9. Support Vector Machines: Slide 9Copyright © 2001, 2003, Andrew W. MooreMaximum Marginfxayestdenotes +1denotes -1f(x,w,b) = sign(w. x - b)The maximummargin linearclassifier is thelinear classifierwith the, um,maximum margin.This is thesimplest kind ofSVM (Called anLSVM)Support Vectorsare thosedatapoints thatthe marginpushes upagainstLinear SVM
  • 10. Support Vector Machines: Slide 10Copyright © 2001, 2003, Andrew W. MooreWhy Maximum Margin?denotes +1denotes -1f(x,w,b) = sign(w. x - b)The maximummargin linearclassifier is thelinear classifierwith the, um,maximum margin.This is thesimplest kind ofSVM (Called anLSVM)Support Vectorsare thosedatapoints thatthe marginpushes upagainst1. Intuitively this feels safest.2. If we’ve made a small error in thelocation of the boundary (it’s beenjolted in its perpendicular direction)this gives us least chance of causing amisclassification.3. LOOCV is easy since the model isimmune to removal of any non-support-vector datapoints.4. There’s some theory (using VCdimension) that is related to (but notthe same as) the proposition that thisis a good thing.5. Empirically it works very very well.
  • 11. Support Vector Machines: Slide 11Copyright © 2001, 2003, Andrew W. MooreEstimate the Margin• What is the distance expression for a point x to aline wx+b= 0?denotes +1denotes -1 xwx +b = 02 212( )diib bdw    x w x wxw
  • 12. Support Vector Machines: Slide 12Copyright © 2001, 2003, Andrew W. MooreEstimate the Margin• What is the expression for margin?denotes +1denotes -1 wx +b = 021margin min ( ) mindD Diibdw   x xx wxMargin
  • 13. Support Vector Machines: Slide 13Copyright © 2001, 2003, Andrew W. MooreMaximize Margindenotes +1denotes -1 wx +b = 0,,2,1argmax margin( , , )= argmax min ( )argmax miniibiDbidDbiib Ddbw wxwxwwxx wMargin
  • 14. Support Vector Machines: Slide 14Copyright © 2001, 2003, Andrew W. MooreMaximize Margindenotes +1denotes -1 wx +b = 0 2,1argmax minsubject to : 0iidDbiii i ibwD y b     xwx wx x wMargin• Min-max problem  game problem
  • 15. Support Vector Machines: Slide 15Copyright © 2001, 2003, Andrew W. MooreMaximize Margindenotes +1denotes -1 wx +b = 0 2,1argmax minsubject to : 0iidDbiii i ibwD y b     xwx wx x wMarginStrategy:: 1i iD b    x x w 21,argminsubject to : 1diibi i iwD y b    wx x w
  • 16. Support Vector Machines: Slide 16Copyright © 2001, 2003, Andrew W. MooreMaximum Margin Linear Classifier• How to solve it?   * * 21,1 12 2{ , }= argminsubject to11....1dkkw bN Nw b wy w x by w x by w x b      
  • 17. Support Vector Machines: Slide 17Copyright © 2001, 2003, Andrew W. MooreLearning via Quadratic Programming• QP is a well-studied class of optimizationalgorithms to maximize a quadratic function ofsome real-valued variables subject to linearconstraints.
  • 18. Support Vector Machines: Slide 18Copyright © 2001, 2003, Andrew W. MooreQuadratic Programmingargmin2TT Rc  uu ud uFindnmnmnnmmmmbuauauabuauauabuauaua...:......22112222212111212111)()(22)(11)()2()2(22)2(11)2()1()1(22)1(11)1(...:......enmmenenennmmnnnnmmnnnbuauauabuauauabuauauaAnd subject ton additional linearinequalityconstraintseadditionallinearequalityconstraintsQuadratic criterionSubject to
  • 19. Support Vector Machines: Slide 19Copyright © 2001, 2003, Andrew W. MooreQuadratic Programmingargmin2TT Rc  uu ud uFindSubject tonmnmnnmmmmbuauauabuauauabuauaua...:......22112222212111212111)()(22)(11)()2()2(22)2(11)2()1()1(22)1(11)1(...:......enmmenenennmmnnnnmmnnnbuauauabuauauabuauauaAnd subject ton additional linearinequalityconstraintseadditionallinearequalityconstraintsQuadratic criterion
  • 20. Support Vector Machines: Slide 20Copyright © 2001, 2003, Andrew W. MooreQuadratic Programming * * 2,{ , }= minsubject to 1 for all training data ( , )iiw bi i i iw b wy w x b x y      * *,1 12 2{ , }= argmax 0 011inequality constraints....1Tw bN Nw b w w wy w x by w x by w x b          nI
  • 21. Support Vector Machines: Slide 21Copyright © 2001, 2003, Andrew W. MooreUh-oh!denotes +1denotes -1This is going to be a problem!What should we do?
  • 22. Support Vector Machines: Slide 22Copyright © 2001, 2003, Andrew W. MooreUh-oh!denotes +1denotes -1This is going to be a problem!What should we do?Idea 1:Find minimum w.w, whileminimizing number oftraining set errors.Problemette: Two thingsto minimize makes foran ill-definedoptimization
  • 23. Support Vector Machines: Slide 23Copyright © 2001, 2003, Andrew W. MooreUh-oh!denotes +1denotes -1This is going to be a problem!What should we do?Idea 1.1:Minimizew.w + C (#train errors)There’s a serious practicalproblem that’s about to makeus reject this approach. Canyou guess what it is?Tradeoff parameter
  • 24. Support Vector Machines: Slide 24Copyright © 2001, 2003, Andrew W. MooreUh-oh!denotes +1denotes -1This is going to be a problem!What should we do?Idea 1.1:Minimizew.w + C (#train errors)There’s a serious practicalproblem that’s about to makeus reject this approach. Canyou guess what it is?Tradeoff parameterCan’t be expressed as a QuadraticProgramming problem.Solving it may be too slow.(Also, doesn’t distinguish betweendisastrous errors and near misses)
  • 25. Support Vector Machines: Slide 25Copyright © 2001, 2003, Andrew W. MooreUh-oh!denotes +1denotes -1This is going to be a problem!What should we do?Idea 2.0:Minimizew.w + C (distance of errorpoints to theircorrect place)
  • 26. Support Vector Machines: Slide 26Copyright © 2001, 2003, Andrew W. MooreSupport Vector Machine (SVM) forNoisy Data• Any problem with the aboveformulism?   d* * 21 1, ,1 1 12 2 2{ , }= min11...1Ni ji jw bN N Nw b w cy w x by w x by w x b            denotes +1denotes -1123
  • 27. Support Vector Machines: Slide 27Copyright © 2001, 2003, Andrew W. MooreSupport Vector Machine (SVM) forNoisy Data• Balance the trade off betweenmargin and classification errors   d* * 21 1, ,1 1 1 12 2 2 2{ , }= min1 , 01 , 0...1 , 0Ni ji jw bN N N Nw b w cy w x by w x by w x b                 denotes +1denotes -1123
  • 28. Support Vector Machines: Slide 28Copyright © 2001, 2003, Andrew W. MooreSupport Vector Machine for Noisy Data   * * 21, ,1 1 1 12 2 2 2{ , }= argmin1 , 01 , 0inequality constraints....1 , 0Ni ji jw bN N N Nw b w cy w x by w x by w x b                  How do we determine the appropriate value for c ?
  • 29. Support Vector Machines: Slide 29Copyright © 2001, 2003, Andrew W. MooreThe Dual Form of QPMaximize   RkRlkllkRkk Qααα1 11 21where ( )kl k l k lQ y y x xSubject to theseconstraints:kCαk 0Then define:Rkkkk yα1xw Then classify with:f(x,w,b) = sign(w. x - b)01Rkkk yα
  • 30. Support Vector Machines: Slide 30Copyright © 2001, 2003, Andrew W. MooreThe Dual Form of QPMaximize   RkRlkllkRkk Qααα1 11 21where ( )kl k l k lQ y y x xSubject to theseconstraints:kCαk 0Then define:Rkkkk yα1xw01Rkkk yα
  • 31. Support Vector Machines: Slide 31Copyright © 2001, 2003, Andrew W. MooreAn Equivalent QPMaximize where ).( lklkkl yyQ xxSubject to theseconstraints:kCαk 0Then define:Rkkkk yα1xw01Rkkk yαDatapoints with ak > 0will be the supportvectors..so this sum only needsto be over thesupport vectors.  RkRlkllkRkk Qααα1 11 21
  • 32. Support Vector Machines: Slide 32Copyright © 2001, 2003, Andrew W. MooreSupport Vectorsdenotes +1denotes -11w x b  1w x b   wSupport VectorsDecision boundary isdetermined only by thosesupport vectors !Rkkkk yα1xw    : 1 0i i i ii y w x ba      ai = 0 for non-support vectorsai  0 for support vectors
  • 33. Support Vector Machines: Slide 33Copyright © 2001, 2003, Andrew W. MooreThe Dual Form of QPMaximize   RkRlkllkRkk Qααα1 11 21where ( )kl k l k lQ y y x xSubject to theseconstraints:kCαk 0Then define:Rkkkk yα1xw Then classify with:f(x,w,b) = sign(w. x - b)01Rkkk yαHow to determine b ?
  • 34. Support Vector Machines: Slide 34Copyright © 2001, 2003, Andrew W. MooreAn Equivalent QP: Determine bA linear programming problem !   * * 21,1 1 1 12 2 2 2{ , }= argmin1 , 01 , 0....1 , 0Ni ji jw bN N N Nw b w cy w x by w x by w x b                    1*1,1 1 1 12 2 2 2= argmin1 , 01 , 0....1 , 0Ni iNjjbN N N Nby w x by w x by w x b               Fix w
  • 35. Support Vector Machines: Slide 35Copyright © 2001, 2003, Andrew W. Moore  RkRlkllkRkk Qααα1 11 21An Equivalent QPMaximize where ).( lklkkl yyQ xxSubject to theseconstraints:kCαk 0Then define:Rkkkk yα1xwkkKKKKαKεybmaxargwhere.)1( wxThen classify with:f(x,w,b) = sign(w. x - b)01Rkkk yαDatapoints with ak > 0will be the supportvectors..so this sum only needsto be over thesupport vectors.Why did I tell you about thisequivalent QP?• It’s a formulation that QPpackages can optimize morequickly• Because of further jaw-dropping developmentsyou’re about to learn.
  • 36. Support Vector Machines: Slide 36Copyright © 2001, 2003, Andrew W. MooreSuppose we’re in 1-dimensionWhat wouldSVMs do withthis data?x=0
  • 37. Support Vector Machines: Slide 37Copyright © 2001, 2003, Andrew W. MooreSuppose we’re in 1-dimensionNot a big surprisePositive “plane” Negative “plane”x=0
  • 38. Support Vector Machines: Slide 38Copyright © 2001, 2003, Andrew W. MooreHarder 1-dimensional datasetThat’s wiped thesmirk off SVM’sface.What can bedone aboutthis?x=0
  • 39. Support Vector Machines: Slide 39Copyright © 2001, 2003, Andrew W. MooreHarder 1-dimensional datasetRemember howpermitting non-linear basisfunctions madelinear regressionso much nicer?Let’s permit themhere toox=0 ),( 2kkk xxz
  • 40. Support Vector Machines: Slide 40Copyright © 2001, 2003, Andrew W. MooreHarder 1-dimensional datasetRemember howpermitting non-linear basisfunctions madelinear regressionso much nicer?Let’s permit themhere toox=0 ),( 2kkk xxz
  • 41. Support Vector Machines: Slide 41Copyright © 2001, 2003, Andrew W. MooreCommon SVM basis functionszk = ( polynomial terms of xk of degree 1 to q )zk = ( radial basis functions of xk )zk = ( sigmoid functions of xk )This is sensible.Is that the end of the story?No…there’s one more trick!  22||exp)(][jkkjk φjcxxz
  • 42. Support Vector Machines: Slide 42Copyright © 2001, 2003, Andrew W. MooreQuadraticBasis Functions mmmmmmxxxxxxxxxxxxxxxxxx11321312122221212:2:22:22:2:221)(xΦConstant TermLinear TermsPureQuadraticTermsQuadraticCross-TermsNumber of terms (assuming m inputdimensions) = (m+2)-choose-2= (m+2)(m+1)/2= (as near as makes no difference) m2/2You may be wondering what those’s are doing.•You should be happy that they do noharm•You’ll find out why they’re there soon.2
  • 43. Support Vector Machines: Slide 43Copyright © 2001, 2003, Andrew W. MooreQP (old)Maximize where ).( lklkkl yyQ xxSubject to theseconstraints:kCαk 0Then define:Rkkkk yα1xw Then classify with:f(x,w,b) = sign(w. x - b)01Rkkk yα  RkRlkllkRkk Qααα1 11 21
  • 44. Support Vector Machines: Slide 44Copyright © 2001, 2003, Andrew W. MooreQP with basis functionswhere ))().(( lklkkl yyQ xΦxΦSubject to theseconstraints:kCαk 0Then define: Then classify with:f(x,w,b) = sign(w. f(x) - b)01Rkkk yα0s.t.)(kαkkkk yα xΦwMaximize   RkRlkllkRkk Qααα1 11 21Most important changes:X  f(x)
  • 45. Support Vector Machines: Slide 45Copyright © 2001, 2003, Andrew W. MooreQP with basis functionswhere ))().(( lklkkl yyQ xΦxΦSubject to theseconstraints:kCαk 0Then define:Then classify with:f(x,w,b) = sign(w. f(x) - b)01Rkkk yαWe must do R2/2 dot products toget this matrix ready.Each dot product requires m2/2additions and multiplicationsThe whole thing costs R2 m2 /4.Yeeks!…or does it?0s.t.)(kαkkkk yα xΦwMaximize   RkRlkllkRkk Qααα1 11 21
  • 46. Support Vector Machines: Slide 46Copyright © 2001, 2003, Andrew W. MooreQuadraticDotProducts mmmmmmmmmmmmbbbbbbbbbbbbbbbbbbaaaaaaaaaaaaaaaaaa113213121222212111321312122221212:2:22:22:2:2212:2:22:22:2:221)()( bΦaΦ1miiiba12miii ba122  mimijjiji bbaa1 12+++
  • 47. Support Vector Machines: Slide 47Copyright © 2001, 2003, Andrew W. MooreQuadraticDotProducts )()( bΦaΦ   mimijjijimiiimiii bbaababa1 11221221Just out of casual, innocent, interest,let’s look at another function of a andb:2)1.( ba1.2).( 2 baba12121  miiimiii baba1211 1   miiimimjjjii bababa122)(11 112    miiimimijjjiimiii babababa
  • 48. Support Vector Machines: Slide 48Copyright © 2001, 2003, Andrew W. MooreQuadraticDotProducts )()( bΦaΦJust out of casual, innocent, interest,let’s look at another function of a andb:2)1.( ba1.2).( 2 baba12121  miiimiii baba1211 1   miiimimjjjii bababa122)(11 112    miiimimijjjiimiii babababaThey’re the same!And this is only O(m) tocompute!   mimijjijimiiimiii bbaababa1 11221221
  • 49. Support Vector Machines: Slide 49Copyright © 2001, 2003, Andrew W. MooreQP with Quintic basis functionsMaximize   RkRlkllkRkk Qααα1 11where ))().(( lklkkl yyQ xΦxΦSubject to theseconstraints:kCαk 0Then define:0s.t.)(kαkkkk yα xΦwThen classify with:f(x,w,b) = sign(w. f(x) - b)01Rkkk yα
  • 50. Support Vector Machines: Slide 50Copyright © 2001, 2003, Andrew W. MooreQP with Quadratic basis functionswhere ),( lklkkl KyyQ xxSubject to theseconstraints:kCαk 0Then define:kkKKKKαKεybmaxargwhere.)1( wxThen classify with:f(x,w,b) = sign(K(w, x) - b)01Rkkk yα0s.t.)(kαkkkk yα xΦwMaximize   RkRlkllkRkk Qααα1 11 21Most important change:),()().( lklk K xxxΦxΦ 
  • 51. Support Vector Machines: Slide 51Copyright © 2001, 2003, Andrew W. MooreHigher Order PolynomialsPoly-nomialf(x) Cost tobuild QklmatrixtraditionallyCost if 100inputsf(a).f(b) Cost tobuild QklmatrixsneakilyCost if100inputsQuadratic All m2/2terms up todegree 2m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2Cubic All m3/6terms up todegree 3m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2Quartic All m4/24terms up todegree 4m4 R2 /48 1,960,000 R2 (a.b+1)4 m R2 / 2 50 R2
  • 52. Support Vector Machines: Slide 52Copyright © 2001, 2003, Andrew W. MooreSVM Kernel Functions• K(a,b)=(a . b +1)d is an example of an SVMKernel Function• Beyond polynomials there are other very highdimensional basis functions that can be madepractical by finding the right Kernel Function• Radial-Basis-style Kernel Function:• Neural-net-style Kernel Function:  222)(exp),(babaK).tanh(),(   babaK
  • 53. Support Vector Machines: Slide 53Copyright © 2001, 2003, Andrew W. MooreKernel Tricks• Replacing dot product with a kernel function• Not all functions are kernel functions• Need to be decomposable• K(a,b) = f(a)  f(b)• Could K(a,b) = (a-b)3 be a kernel function ?• Could K(a,b) = (a-b)4 – (a+b)2 be a kernelfunction?
  • 54. Support Vector Machines: Slide 54Copyright © 2001, 2003, Andrew W. MooreKernel Tricks• Mercer’s conditionTo expand Kernel function K(x,y) into a dot product, i.e.K(x,y)=(x)(y), K(x, y) has to be positive semi-definitefunction, i.e., for any function f(x) whose is finite,the following inequality holds• Could be a kernel function?( ) ( , ) ( ) 0dxdyf x K x y f y 2( )f x dx ( , )pi iiK x y x y 
  • 55. Support Vector Machines: Slide 55Copyright © 2001, 2003, Andrew W. MooreKernel Tricks• Pro• Introducing nonlinearity into the model• Computational cheap• Con• Still have potential overfitting problems
  • 56. Support Vector Machines: Slide 56Copyright © 2001, 2003, Andrew W. MooreNonlinear Kernel (I)
  • 57. Support Vector Machines: Slide 57Copyright © 2001, 2003, Andrew W. MooreNonlinear Kernel (II)
  • 58. Support Vector Machines: Slide 58Copyright © 2001, 2003, Andrew W. MooreKernelize Logistic RegressionHow can we introduce the nonlinearity into thelogistic regression?21 11( | )1 exp( )1( ) log1 exp( )N Nreg ki kp y xyx wl c wyx wa        
  • 59. Support Vector Machines: Slide 59Copyright © 2001, 2003, Andrew W. MooreKernelize Logistic Regression11( ), ( )( , ) ( , )Ni iiNi iix x w xK w x K x xf a fa   11 , 111 1( | )1 exp( ( , )) 1 exp ( , )1( ) log ( , )1 exp ( , )Ni iiN Nreg i j i ji i jNi j j ijp y xyK x w y K x xl c K x xy K x xaa a aa        • Representation Theorem
  • 60. Support Vector Machines: Slide 60Copyright © 2001, 2003, Andrew W. MooreOverfitting in SVMBreast Cancer00.020.040.060.081 2 3 4 5PolyDegreeClassificationErrorIonosphere00.050.10.150.21 2 3 4 5PolyDegreeClassificationErrorTraining ErrorTesting Error
  • 61. Support Vector Machines: Slide 61Copyright © 2001, 2003, Andrew W. MooreSVM Performance• Anecdotally they work very very well indeed.• Example: They are currently the best-knownclassifier on a well-studied hand-written-characterrecognition benchmark• Another Example: Andrew knows several reliablepeople doing practical real-world work who claimthat SVMs have saved them when their otherfavorite classifiers did poorly.• There is a lot of excitement and religious fervorabout SVMs as of 2001.• Despite this, some practitioners are a littleskeptical.
  • 62. Support Vector Machines: Slide 62Copyright © 2001, 2003, Andrew W. MooreDiffusion Kernel• Kernel function describes the correlation or similaritybetween two data points• Given that I have a function s(x,y) that describes thesimilarity between two data points. Assume that it is a non-negative and symmetric function. How can we generate akernel function based on this similarity function?• A graph theory approach …
  • 63. Support Vector Machines: Slide 63Copyright © 2001, 2003, Andrew W. MooreDiffusion Kernel• Create a graph for the data points• Each vertex corresponds to a data point• The weight of each edge is the similarity s(x,y)• Graph Laplacian• Properties of Laplacian• Negative semi-definite  ,,,i ji ji kk is x x i jLs x x i j    
  • 64. Support Vector Machines: Slide 64Copyright © 2001, 2003, Andrew W. MooreDiffusion Kernel• Consider a simple Laplacian• Consider• What do these matrixes represent?• A diffusion kernel,( )1 and are connected1k ii ji jx N xx xLi j   2 4, ,...L L
  • 65. Support Vector Machines: Slide 65Copyright © 2001, 2003, Andrew W. MooreDiffusion Kernel• Consider a simple Laplacian• Consider• What do these matrixes represent?• A diffusion kernel,( )1 and are connected1k ii ji jx N xx xLi j   2 4, ,...L L
  • 66. Support Vector Machines: Slide 66Copyright © 2001, 2003, Andrew W. MooreDiffusion Kernel: Properties• Positive definite• Local relationships L induce global relationships• Works for undirected weighted graphs withsimilarities• How to compute the diffusion kernel, orL dK e K LKd   ( , ) ( , )i j j is x x s x xLe
  • 67. Support Vector Machines: Slide 67Copyright © 2001, 2003, Andrew W. MooreComputing Diffusion Kernel• Singular value decomposition of Laplacian L• What is L2 ?11 2 1 21,( , ,..., ) ( , ,..., ), :T Tm mmm Ti i iiTi j i ji j       L = UΣU u u u u u uu uu u 221,, 1 , 121m Ti i iim mT T Ti j i i j j i j i j i ji j i jm Ti i ii       L = u uu u u u u uu u
  • 68. Support Vector Machines: Slide 68Copyright © 2001, 2003, Andrew W. MooreComputing Diffusion Kernel• What about Ln ?• Compute diffusion kernel21m n Ti i ii L u uLe 1 1 11 1 1! !!in n nmL n Ti i in n in nm mT Tii i i ii n iLen nen                u uu u u u
  • 69. Support Vector Machines: Slide 69Copyright © 2001, 2003, Andrew W. MooreDoing multi-class classification• SVMs can only handle two-class outputs (i.e. acategorical output variable with arity 2).• What can be done?• Answer: with output arity N, learn N SVM’s• SVM 1 learns “Output==1” vs “Output != 1”• SVM 2 learns “Output==2” vs “Output != 2”• :• SVM N learns “Output==N” vs “Output != N”• Then to predict the output for a new input, justpredict with each SVM and find out which one putsthe prediction the furthest into the positive region.
  • 70. Support Vector Machines: Slide 70Copyright © 2001, 2003, Andrew W. MooreRanking Problem• Consider a problem of ranking essays• Three ranking categories: good, ok, bad• Given a input document, predict its rankingcategory• How should we formulate this problem?• A simple multiple class solution• Each ranking category is a independent class• But, there is something missing here …• We miss the ordinal relationship between classes !
  • 71. Support Vector Machines: Slide 71Copyright © 2001, 2003, Andrew W. MooreOrdinal Regression• Which choice is better?• How could we formulate this problem?‘good’‘OK’‘bad’ww’
  • 72. Support Vector Machines: Slide 72Copyright © 2001, 2003, Andrew W. MooreOrdinal Regression• What are the two decision boundaries?• What is the margin for ordinal regression?• Maximize margin1 20 and 0b b     w x w x11 12122 2211 2 1 1 2 2margin ( , ) min ( ) minmargin ( , ) min ( ) minmargin( , , )=min(margin ( , ),margin ( , ))g o g oo b o bdD D D DiidD D D Diibb dwbb dwb b b b          x xx xx ww xx ww xw w w1 2* * *1 2 1 2, ,{ , , } argmax margin( , , )b bb b b bww w
  • 73. Support Vector Machines: Slide 73Copyright © 2001, 2003, Andrew W. MooreOrdinal Regression• What are the two decision boundaries?• What is the margin for ordinal regression?• Maximize margin1 20 and 0b b     w x w x11 12122 2211 2 1 1 2 2margin ( , ) min ( ) minmargin ( , ) min ( ) minmargin( , , )=min(margin ( , ),margin ( , ))g o g oo b o bdD D D DiidD D D Diibb dwbb dwb b b b          x xx xx ww xx ww xw w w1 2* * *1 2 1 2, ,{ , , } argmax margin( , , )b bb b b bww w
  • 74. Support Vector Machines: Slide 74Copyright © 2001, 2003, Andrew W. MooreOrdinal Regression• What are the two decision boundaries?• What is the margin for ordinal regression?• Maximize margin1 20 and 0b b     w x w x11 12122 2211 2 1 1 2 2margin ( , ) min ( ) minmargin ( , ) min ( ) minmargin( , , )=min(margin ( , ),margin ( , ))g o g oo b o bdD D D DiidD D D Diibb dwbb dwb b b b          x xx xx ww xx ww xw w w1 2* * *1 2 1 2, ,{ , , } argmax margin( , , )b bb b b bww w
  • 75. Support Vector Machines: Slide 75Copyright © 2001, 2003, Andrew W. MooreOrdinal Regression• How do we solve this monster ?1 21 21 2* * *1 2 1 2, ,1 1 2 2, ,1 22 2, ,1 1{ , , } arg max margin( , , )arg max min(margin ( , ),margin ( , ))arg max min min , ming o o bb bb bd dD D D Db bi ii ib b b bb bb bw w            wwx xww ww wx w x w11 22subject to: 0: 0, 0: 0i g ii o i ii b iD bD b bD b               x x wx x w x wx x w
  • 76. Support Vector Machines: Slide 76Copyright © 2001, 2003, Andrew W. MooreOrdinal Regression• The same old trick• To remove the scaling invariance, set• Now the problem is simplified as:12: 1: 1i g o ii o b iD D bD D b          x x wx x w1 2* * * 21 2 1, ,{ , , } argmindiib bb b w ww11 22subject to: 1: 1, 1: 1i g ii o i ii b iD bD b bD b                 x x wx x w x wx x w
  • 77. Support Vector Machines: Slide 77Copyright © 2001, 2003, Andrew W. MooreOrdinal Regression• Noisy case• Is this sufficient enough? 1 2* * * 21 2 1, ,{ , , } argmini g i o i bdi i i i iix D x D x Db bb b w c c c               ww11 22subject to: 1 , 0: 1 , 1 , 0, 0: 1 , 0i g i i ii o i i i i i ii b i i iD bD b bD b                                   x x wx x w x wx x w
  • 78. Support Vector Machines: Slide 78Copyright © 2001, 2003, Andrew W. MooreOrdinal Regression‘good’‘OK’‘bad’w 1 2* * * 21 2 1, ,{ , , } argmini g i o i bdi i i i iix D x D x Db bb b w c c c               ww11 221 2subject to: 1 , 0: 1 , 1 , 0, 0: 1 , 0i g i i ii o i i i i i ii b i i iD bD b bD bb b                                   x x wx x w x wx x w
  • 79. Support Vector Machines: Slide 79Copyright © 2001, 2003, Andrew W. MooreReferences• An excellent tutorial on VC-dimension and SupportVector Machines:C.J.C. Burges. A tutorial on support vector machinesfor pattern recognition. Data Mining and KnowledgeDiscovery, 2(2):955-974, 1998.http://citeseer.nj.nec.com/burges98tutorial.html• The VC/SRM/SVM Bible: (Not for beginnersincluding myself)Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998• Software: SVM-light, http://svmlight.joachims.org/,free download