Support Vector Machines in MapReducePresented byAsghar Dehghani, Alpine Data LabsSara Asher, Alpine Data Labs
Overview§  Theory of basic SVM (biclassification, linear)§  Generalized SVM (multi-classification)§  MapReducing SVM§ ...
Background on SVM§  Given a bunch of points…
Background on SVM§  How do we classify a new point?
Background on SVM§  Split the space using a hyper-plane
Background on SVM§  Split the space using a hyper-plane
Background on SVM§  Split the space using a hyper-plane
Background on SVM§  Which plane do you use?
Background on SVM§  Margin: Distance from closest points to the hyper-plane§  Idea: Among the set of hyper-planes, choos...
Background on SVM6/6/13 10wTx + b = 0ρ	  wTx + b > ρwTx + b < -ρρ	  •  Hyper-plane represented by:•  We want to choose the...
Background on SVM (cont.)§  Thus the goal is solving the following optimization problem:6/6/13 11Subject to yi (wTxi + b)...
Background on SVM (cont.)§  To avoid square roots, can do the following transformation§  Thus, the problem is solving a ...
Background on SVM (cont.)§  What happens is the data is not linearly separable? (i.e., thereis no hyper-plane that will s...
Background on SVM (cont.)6/6/13 14Subject to yi (wTxi + b) ≥ 1, i=1 .. n)21(2,min wArgbW•  Slack variables ξi is added to ...
Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3...
Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3...
Background on SVM (cont.)6/6/13 17Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )Answer: Fe...
Background on SVM (cont.)§  What happens to the multi-class situations?There are different ways to handle multi-classific...
Cost sensitive formulation of hinge loss(Crammer and Singer 2001)WhereThis loss function is called “cost-sensitive hinge.”...
SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framewo...
SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framewo...
Parallelized Stochastic Gradient Descent - Theory6/6/13 22
Parallelized Stochastic Gradient Descent - Theory§  Conditions:•  SVM loss function has bounded gradient•  The solver is ...
Optimization§  Conditions:•  SVM loss function has bounded gradient•  The solver is stochastic§  Loss: Cost sensitive hi...
SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 25
SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 26Idea: Transform the pattern space to a higherdimensional...
SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 27Idea: Transform the pattern space to a higherdimensional...
SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 28Idea: Transform the pattern space to a higherdimensional...
SVM: Kernels§  Two questions:•  What kind of function is a kernel?•  What kernel is appropriate for a specific problem?§...
SVM: Kernels§  Examples of popular kernel functions:•  Gaussian kernel:•  Laplacian kernel:•  Polynomial kernel:6/6/13 30...
SVM: Kernels§  Kernel (dual) feature space is defined by the inner productsbetween each§  Kernel matrix is N × N, where ...
SVM: Implementation§  Question: How having a non-linear SVM without paying theprice of duality?6/6/13 32
SVM: Implementation§  Question: How having a non-linear SVM without paying theprice of duality?§  Claim: For certain ker...
SVM: Implementation6/6/13 34Random	  Features	  for	  Large-­‐Scale	  Kernel	  	  Machines	  By	  Ali	  Rahimi	  and	  Ben...
Approximating shift-invariant kernel6/6/13 35Random	  Features	  for	  Large-­‐Scale	  Kernel	  	  Machines	  	  	  	  Giv...
SVM: Implementation6/6/13 36
Approximating dot-product kernel6/6/13 37Random	  Feature	  Maps	  for	  Dot	  Product	  Kernels	  	  	  	  Obtain the Mac...
SVM: Implementation SummaryUsing these approximations, we can now treat this as a linear SVMproblem.(1)  Job 1 – compute s...
SVM: Implementation examples§  SVM used by large entertainment company for customersegmentation•  Web logs containing bro...
Questions?
Upcoming SlideShare
Loading in...5
×

Svm map reduce_slides

1,136

Published on

Slides from "Support Vector Machines in MapReduce" Meetup.

Published in: Technology, Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,136
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Svm map reduce_slides"

  1. 1. Support Vector Machines in MapReducePresented byAsghar Dehghani, Alpine Data LabsSara Asher, Alpine Data Labs
  2. 2. Overview§  Theory of basic SVM (biclassification, linear)§  Generalized SVM (multi-classification)§  MapReducing SVM§  Handling kernels (nonlinear SVM) in MapReduce§  Demo
  3. 3. Background on SVM§  Given a bunch of points…
  4. 4. Background on SVM§  How do we classify a new point?
  5. 5. Background on SVM§  Split the space using a hyper-plane
  6. 6. Background on SVM§  Split the space using a hyper-plane
  7. 7. Background on SVM§  Split the space using a hyper-plane
  8. 8. Background on SVM§  Which plane do you use?
  9. 9. Background on SVM§  Margin: Distance from closest points to the hyper-plane§  Idea: Among the set of hyper-planes, choose the one thatmaximizes the margin6/6/13 9ρ  SVs  
  10. 10. Background on SVM6/6/13 10wTx + b = 0ρ  wTx + b > ρwTx + b < -ρρ  •  Hyper-plane represented by:•  We want to choose the w andb that will maximize themargin ρ.•  Using some algebra andsome rescaling, we can showthat for the support vectors:margin =1w
  11. 11. Background on SVM (cont.)§  Thus the goal is solving the following optimization problem:6/6/13 11Subject to yi (wTxi + b) ≥ 1, i =1..n(where yi = 1 or -1, depending on which class of xi)ArgmaxW,bρ =1w!"##$%&& = ArgminW,bw( )
  12. 12. Background on SVM (cont.)§  To avoid square roots, can do the following transformation§  Thus, the problem is solving a quadratic function minimizationsubject to linear constraints (well studied)6/6/13 12Subject to yi (wTxi + b) ≥ 1, i =1..n                                Subject to yi (wTxi + b) ≥ 1, i =1..n)21(2,min wArgbWW,bArgmin( w )
  13. 13. Background on SVM (cont.)§  What happens is the data is not linearly separable? (i.e., thereis no hyper-plane that will split the data exactly)
  14. 14. Background on SVM (cont.)6/6/13 14Subject to yi (wTxi + b) ≥ 1, i=1 .. n)21(2,min wArgbW•  Slack variables ξi is added to the constraints.•  ξi is the distance from xi to its class boundary.
  15. 15. Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3), 273–29715Subject to yi (wTxi + b) ≥ 1, i=1 .. n)21(2,min wArgbW⇓Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )(add slack)•  Slack variables ξi is added to the constraints.•  ξi is the distance from xi to its class boundary.•  C is the regularization parameter which controls the bias-variance trade-off (significance of outliers)
  16. 16. Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3), 273–29716Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )Question: how to get rid of the constraints?
  17. 17. Background on SVM (cont.)6/6/13 17Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )Answer: Fenchel Duality and Representer Theorems!W,bArgminλ2w2+ max 0,1− wTxi − b( )Hinge Loss  i=1n∑#$%%%&(((We’ve removed the constraint! SVM minimizes the“L2 Regularized Hinge”
  18. 18. Background on SVM (cont.)§  What happens to the multi-class situations?There are different ways to handle multi-classification:•  One vs. all•  One vs. one•  Cost-sensitive Hinge (Crammer and Singer 2001)
  19. 19. Cost sensitive formulation of hinge loss(Crammer and Singer 2001)WhereThis loss function is called “cost-sensitive hinge.”And the prediction function is:Background on SVM (cont.)6/6/13 Crammer, K & Singer. Y. (2001). On the algorithmic implementation ofmulticlass kernel-based vector machines. JMLR, 2, 262-292.19W,bArgminλ2w2+ max 0,1+ f r(xi )− f t(xi )( )multi-class Hinge  i=1n∑#$%%%&(((f r(xi ) = Argmax(wi x + bi ),i ∈ Y,i ≠ tf t(xi ) = wt x + btf (x) = Argmax(wi x + bi ),i ∈ Y
  20. 20. SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framework?6/6/13 20
  21. 21. SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framework?6/6/13 21Parallelized  Stochas1c  Gradient  Descent  By  Marn  Zinkevich,  Markus  Weimer,  Alexander  J.  Smola,  Lihong  Li    NIPS  2010  
  22. 22. Parallelized Stochastic Gradient Descent - Theory6/6/13 22
  23. 23. Parallelized Stochastic Gradient Descent - Theory§  Conditions:•  SVM loss function has bounded gradient•  The solver is stochastic§  Result:•  You can break the original sample into randomly distributedsubsamples and solve on each subsample.•  The convex combination of each sub-solution will be the same as thesolution for the original sample6/6/13 23
  24. 24. Optimization§  Conditions:•  SVM loss function has bounded gradient•  The solver is stochastic§  Loss: Cost sensitive hinge•  Crammer, K & Singer. Y. (2001). On the algorithmic implementation ofmulticlass kernel-based vector machines. JMLR, 2, 262-292.§  Solver: Pegasos•  Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: primal estimatedsub-gradient solver for svm. ICML, 807-814.§  Use mapper to random distribute the samples, and use reducer toiterate on the sub-sample.6/6/13 24
  25. 25. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 25
  26. 26. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 26Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
  27. 27. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 27Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
  28. 28. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 28Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
  29. 29. SVM: Kernels§  Two questions:•  What kind of function is a kernel?•  What kernel is appropriate for a specific problem?§  The answers:•  Mercer’s Theorem: Every semi-positive definite symmetricfunction is a kernel•  Depends on the problem.6/6/13http://www.ism.ac.jp/~fukumizu/H20_kernel/Kernel_7_theory.pdf29
  30. 30. SVM: Kernels§  Examples of popular kernel functions:•  Gaussian kernel:•  Laplacian kernel:•  Polynomial kernel:6/6/13 30222),( σjiexxK jixx −−=θθ ||||sin||||),(jijiji xxKxxxx−−=( )djTiji bxaxxxK +=),(
  31. 31. SVM: Kernels§  Kernel (dual) feature space is defined by the inner productsbetween each§  Kernel matrix is N × N, where N is the number of samples§  As your sample size goes up, kernel matrix gets huge!§  Yet, the problem is lack of ability to match with MapReduce!6/6/13 31⇒ Dual space is not feasible at scalexi and xj
  32. 32. SVM: Implementation§  Question: How having a non-linear SVM without paying theprice of duality?6/6/13 32
  33. 33. SVM: Implementation§  Question: How having a non-linear SVM without paying theprice of duality?§  Claim: For certain kernel functions we can find a function zwhere6/6/13 33W,bArgminλ2w2+ max 0,1− wTz xi( )− b( )Hinge Loss  i=1n∑#$%%%&(((z
  34. 34. SVM: Implementation6/6/13 34Random  Features  for  Large-­‐Scale  Kernel    Machines  By  Ali  Rahimi  and  Ben  Recht  NIPS  2007  Can  approximate  shi1-­‐invariant  kernels  Random  Feature  Maps  for  Dot  Product  By  PurushoHam  Kar  and  Karish  Karnick  AISTATS  2012  Can  approximate  dot-­‐product  kernels  
  35. 35. Approximating shift-invariant kernel6/6/13 35Random  Features  for  Large-­‐Scale  Kernel    Machines        Given a positive definite shift-invariant kernel K x, y( )= f x − y( ),we can create a randomized feature map Z : Rd→ RDsuch thatZ x( )#Z y( ) ≈ K x − y( )Compute the Fourier tranform p of the kernel k: p(ω) =12πe− j "ω δk δ( )dΔ∫Draw D iid samples ω1,...,ωD ∈ Rdfrom p.Draw D iid samples b1,...,bD ∈ R from the uniform distribution on 0,2π[ ].Z : x →2Dcos "ω1x + b1( )cos "ωDx + bD( )#$ %&"    
  36. 36. SVM: Implementation6/6/13 36
  37. 37. Approximating dot-product kernel6/6/13 37Random  Feature  Maps  for  Dot  Product  Kernels        Obtain the Maclaurin expansion of f (x) = an xnn=0∞∑ by setting an =fn( )0( )n!Fix a value p >1. For i =1 to D :Choose a non-negative integer with P N = n[ ]=1pn+1Choose N vectors ω1,...,ωn ∈ −1,1{ }dselecting each coordinate using fair coin tosses.Let feature map Zi : x → aN pN+1ωjTj=1N∏ xZ : x →1DZ1 x( ),..., ZD x( )( )Given a positive definite dot product kernel K x, y( )= f x, y( ),we can create a randomized feature map Z : Rd→ RDsuch thatZ x( ),Z y( ) ≈ K x, y( )
  38. 38. SVM: Implementation SummaryUsing these approximations, we can now treat this as a linear SVMproblem.(1)  Job 1 – compute stats for feature and class (mean, variance, classcardinality, etc.)(2)  Job 2- Transform sample by the approximate kernel and computestats for new feature space.(3)  Job 3 – randomly distribute the new samples and train the model inthe reducer.6/6/13 38We can use map-reduce to solve non-linearmulti-classification SVM!
  39. 39. SVM: Implementation examples§  SVM used by large entertainment company for customersegmentation•  Web logs containing browsing information mined for customer attributeslike gender and age•  Raw Omniture logs stored in Hadoop•  Models built on ~10 billion rows and 1 million features•  Models used to improve inventory value of company’s web propertiesfor publishers
  40. 40. Questions?

×