Your SlideShare is downloading. ×
0
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Svm map reduce_slides
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Svm map reduce_slides

1,024

Published on

Slides from "Support Vector Machines in MapReduce" Meetup.

Slides from "Support Vector Machines in MapReduce" Meetup.

Published in: Technology, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,024
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Support Vector Machines in MapReducePresented byAsghar Dehghani, Alpine Data LabsSara Asher, Alpine Data Labs
  • 2. Overview§  Theory of basic SVM (biclassification, linear)§  Generalized SVM (multi-classification)§  MapReducing SVM§  Handling kernels (nonlinear SVM) in MapReduce§  Demo
  • 3. Background on SVM§  Given a bunch of points…
  • 4. Background on SVM§  How do we classify a new point?
  • 5. Background on SVM§  Split the space using a hyper-plane
  • 6. Background on SVM§  Split the space using a hyper-plane
  • 7. Background on SVM§  Split the space using a hyper-plane
  • 8. Background on SVM§  Which plane do you use?
  • 9. Background on SVM§  Margin: Distance from closest points to the hyper-plane§  Idea: Among the set of hyper-planes, choose the one thatmaximizes the margin6/6/13 9ρ  SVs  
  • 10. Background on SVM6/6/13 10wTx + b = 0ρ  wTx + b > ρwTx + b < -ρρ  •  Hyper-plane represented by:•  We want to choose the w andb that will maximize themargin ρ.•  Using some algebra andsome rescaling, we can showthat for the support vectors:margin =1w
  • 11. Background on SVM (cont.)§  Thus the goal is solving the following optimization problem:6/6/13 11Subject to yi (wTxi + b) ≥ 1, i =1..n(where yi = 1 or -1, depending on which class of xi)ArgmaxW,bρ =1w!"##$%&& = ArgminW,bw( )
  • 12. Background on SVM (cont.)§  To avoid square roots, can do the following transformation§  Thus, the problem is solving a quadratic function minimizationsubject to linear constraints (well studied)6/6/13 12Subject to yi (wTxi + b) ≥ 1, i =1..n                                Subject to yi (wTxi + b) ≥ 1, i =1..n)21(2,min wArgbWW,bArgmin( w )
  • 13. Background on SVM (cont.)§  What happens is the data is not linearly separable? (i.e., thereis no hyper-plane that will split the data exactly)
  • 14. Background on SVM (cont.)6/6/13 14Subject to yi (wTxi + b) ≥ 1, i=1 .. n)21(2,min wArgbW•  Slack variables ξi is added to the constraints.•  ξi is the distance from xi to its class boundary.
  • 15. Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3), 273–29715Subject to yi (wTxi + b) ≥ 1, i=1 .. n)21(2,min wArgbW⇓Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )(add slack)•  Slack variables ξi is added to the constraints.•  ξi is the distance from xi to its class boundary.•  C is the regularization parameter which controls the bias-variance trade-off (significance of outliers)
  • 16. Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3), 273–29716Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )Question: how to get rid of the constraints?
  • 17. Background on SVM (cont.)6/6/13 17Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )Answer: Fenchel Duality and Representer Theorems!W,bArgminλ2w2+ max 0,1− wTxi − b( )Hinge Loss  i=1n∑#$%%%&(((We’ve removed the constraint! SVM minimizes the“L2 Regularized Hinge”
  • 18. Background on SVM (cont.)§  What happens to the multi-class situations?There are different ways to handle multi-classification:•  One vs. all•  One vs. one•  Cost-sensitive Hinge (Crammer and Singer 2001)
  • 19. Cost sensitive formulation of hinge loss(Crammer and Singer 2001)WhereThis loss function is called “cost-sensitive hinge.”And the prediction function is:Background on SVM (cont.)6/6/13 Crammer, K & Singer. Y. (2001). On the algorithmic implementation ofmulticlass kernel-based vector machines. JMLR, 2, 262-292.19W,bArgminλ2w2+ max 0,1+ f r(xi )− f t(xi )( )multi-class Hinge  i=1n∑#$%%%&(((f r(xi ) = Argmax(wi x + bi ),i ∈ Y,i ≠ tf t(xi ) = wt x + btf (x) = Argmax(wi x + bi ),i ∈ Y
  • 20. SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framework?6/6/13 20
  • 21. SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framework?6/6/13 21Parallelized  Stochas1c  Gradient  Descent  By  Marn  Zinkevich,  Markus  Weimer,  Alexander  J.  Smola,  Lihong  Li    NIPS  2010  
  • 22. Parallelized Stochastic Gradient Descent - Theory6/6/13 22
  • 23. Parallelized Stochastic Gradient Descent - Theory§  Conditions:•  SVM loss function has bounded gradient•  The solver is stochastic§  Result:•  You can break the original sample into randomly distributedsubsamples and solve on each subsample.•  The convex combination of each sub-solution will be the same as thesolution for the original sample6/6/13 23
  • 24. Optimization§  Conditions:•  SVM loss function has bounded gradient•  The solver is stochastic§  Loss: Cost sensitive hinge•  Crammer, K & Singer. Y. (2001). On the algorithmic implementation ofmulticlass kernel-based vector machines. JMLR, 2, 262-292.§  Solver: Pegasos•  Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: primal estimatedsub-gradient solver for svm. ICML, 807-814.§  Use mapper to random distribute the samples, and use reducer toiterate on the sub-sample.6/6/13 24
  • 25. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 25
  • 26. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 26Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
  • 27. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 27Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
  • 28. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 28Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
  • 29. SVM: Kernels§  Two questions:•  What kind of function is a kernel?•  What kernel is appropriate for a specific problem?§  The answers:•  Mercer’s Theorem: Every semi-positive definite symmetricfunction is a kernel•  Depends on the problem.6/6/13http://www.ism.ac.jp/~fukumizu/H20_kernel/Kernel_7_theory.pdf29
  • 30. SVM: Kernels§  Examples of popular kernel functions:•  Gaussian kernel:•  Laplacian kernel:•  Polynomial kernel:6/6/13 30222),( σjiexxK jixx −−=θθ ||||sin||||),(jijiji xxKxxxx−−=( )djTiji bxaxxxK +=),(
  • 31. SVM: Kernels§  Kernel (dual) feature space is defined by the inner productsbetween each§  Kernel matrix is N × N, where N is the number of samples§  As your sample size goes up, kernel matrix gets huge!§  Yet, the problem is lack of ability to match with MapReduce!6/6/13 31⇒ Dual space is not feasible at scalexi and xj
  • 32. SVM: Implementation§  Question: How having a non-linear SVM without paying theprice of duality?6/6/13 32
  • 33. SVM: Implementation§  Question: How having a non-linear SVM without paying theprice of duality?§  Claim: For certain kernel functions we can find a function zwhere6/6/13 33W,bArgminλ2w2+ max 0,1− wTz xi( )− b( )Hinge Loss  i=1n∑#$%%%&(((z
  • 34. SVM: Implementation6/6/13 34Random  Features  for  Large-­‐Scale  Kernel    Machines  By  Ali  Rahimi  and  Ben  Recht  NIPS  2007  Can  approximate  shi1-­‐invariant  kernels  Random  Feature  Maps  for  Dot  Product  By  PurushoHam  Kar  and  Karish  Karnick  AISTATS  2012  Can  approximate  dot-­‐product  kernels  
  • 35. Approximating shift-invariant kernel6/6/13 35Random  Features  for  Large-­‐Scale  Kernel    Machines        Given a positive definite shift-invariant kernel K x, y( )= f x − y( ),we can create a randomized feature map Z : Rd→ RDsuch thatZ x( )#Z y( ) ≈ K x − y( )Compute the Fourier tranform p of the kernel k: p(ω) =12πe− j "ω δk δ( )dΔ∫Draw D iid samples ω1,...,ωD ∈ Rdfrom p.Draw D iid samples b1,...,bD ∈ R from the uniform distribution on 0,2π[ ].Z : x →2Dcos "ω1x + b1( )cos "ωDx + bD( )#$ %&"    
  • 36. SVM: Implementation6/6/13 36
  • 37. Approximating dot-product kernel6/6/13 37Random  Feature  Maps  for  Dot  Product  Kernels        Obtain the Maclaurin expansion of f (x) = an xnn=0∞∑ by setting an =fn( )0( )n!Fix a value p >1. For i =1 to D :Choose a non-negative integer with P N = n[ ]=1pn+1Choose N vectors ω1,...,ωn ∈ −1,1{ }dselecting each coordinate using fair coin tosses.Let feature map Zi : x → aN pN+1ωjTj=1N∏ xZ : x →1DZ1 x( ),..., ZD x( )( )Given a positive definite dot product kernel K x, y( )= f x, y( ),we can create a randomized feature map Z : Rd→ RDsuch thatZ x( ),Z y( ) ≈ K x, y( )
  • 38. SVM: Implementation SummaryUsing these approximations, we can now treat this as a linear SVMproblem.(1)  Job 1 – compute stats for feature and class (mean, variance, classcardinality, etc.)(2)  Job 2- Transform sample by the approximate kernel and computestats for new feature space.(3)  Job 3 – randomly distribute the new samples and train the model inthe reducer.6/6/13 38We can use map-reduce to solve non-linearmulti-classification SVM!
  • 39. SVM: Implementation examples§  SVM used by large entertainment company for customersegmentation•  Web logs containing browsing information mined for customer attributeslike gender and age•  Raw Omniture logs stored in Hadoop•  Models built on ~10 billion rows and 1 million features•  Models used to improve inventory value of company’s web propertiesfor publishers
  • 40. Questions?

×