Upcoming SlideShare
×

# Svm map reduce_slides

1,136

Published on

Slides from "Support Vector Machines in MapReduce" Meetup.

Published in: Technology, Education
5 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
1,136
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
0
0
Likes
5
Embeds 0
No embeds

No notes for slide

### Transcript of "Svm map reduce_slides"

1. 1. Support Vector Machines in MapReducePresented byAsghar Dehghani, Alpine Data LabsSara Asher, Alpine Data Labs
2. 2. Overview§  Theory of basic SVM (biclassification, linear)§  Generalized SVM (multi-classification)§  MapReducing SVM§  Handling kernels (nonlinear SVM) in MapReduce§  Demo
3. 3. Background on SVM§  Given a bunch of points…
4. 4. Background on SVM§  How do we classify a new point?
5. 5. Background on SVM§  Split the space using a hyper-plane
6. 6. Background on SVM§  Split the space using a hyper-plane
7. 7. Background on SVM§  Split the space using a hyper-plane
8. 8. Background on SVM§  Which plane do you use?
9. 9. Background on SVM§  Margin: Distance from closest points to the hyper-plane§  Idea: Among the set of hyper-planes, choose the one thatmaximizes the margin6/6/13 9ρ  SVs
10. 10. Background on SVM6/6/13 10wTx + b = 0ρ  wTx + b > ρwTx + b < -ρρ  •  Hyper-plane represented by:•  We want to choose the w andb that will maximize themargin ρ.•  Using some algebra andsome rescaling, we can showthat for the support vectors:margin =1w
11. 11. Background on SVM (cont.)§  Thus the goal is solving the following optimization problem:6/6/13 11Subject to yi (wTxi + b) ≥ 1, i =1..n(where yi = 1 or -1, depending on which class of xi)ArgmaxW,bρ =1w!"##\$%&& = ArgminW,bw( )
12. 12. Background on SVM (cont.)§  To avoid square roots, can do the following transformation§  Thus, the problem is solving a quadratic function minimizationsubject to linear constraints (well studied)6/6/13 12Subject to yi (wTxi + b) ≥ 1, i =1..n                                Subject to yi (wTxi + b) ≥ 1, i =1..n)21(2,min wArgbWW,bArgmin( w )
13. 13. Background on SVM (cont.)§  What happens is the data is not linearly separable? (i.e., thereis no hyper-plane that will split the data exactly)
14. 14. Background on SVM (cont.)6/6/13 14Subject to yi (wTxi + b) ≥ 1, i=1 .. n)21(2,min wArgbW•  Slack variables ξi is added to the constraints.•  ξi is the distance from xi to its class boundary.
15. 15. Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3), 273–29715Subject to yi (wTxi + b) ≥ 1, i=1 .. n)21(2,min wArgbW⇓Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )(add slack)•  Slack variables ξi is added to the constraints.•  ξi is the distance from xi to its class boundary.•  C is the regularization parameter which controls the bias-variance trade-off (significance of outliers)
16. 16. Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3), 273–29716Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )Question: how to get rid of the constraints?
17. 17. Background on SVM (cont.)6/6/13 17Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )Answer: Fenchel Duality and Representer Theorems!W,bArgminλ2w2+ max 0,1− wTxi − b( )Hinge Loss  i=1n∑#\$%%%&(((We’ve removed the constraint! SVM minimizes the“L2 Regularized Hinge”
18. 18. Background on SVM (cont.)§  What happens to the multi-class situations?There are different ways to handle multi-classification:•  One vs. all•  One vs. one•  Cost-sensitive Hinge (Crammer and Singer 2001)
19. 19. Cost sensitive formulation of hinge loss(Crammer and Singer 2001)WhereThis loss function is called “cost-sensitive hinge.”And the prediction function is:Background on SVM (cont.)6/6/13 Crammer, K & Singer. Y. (2001). On the algorithmic implementation ofmulticlass kernel-based vector machines. JMLR, 2, 262-292.19W,bArgminλ2w2+ max 0,1+ f r(xi )− f t(xi )( )multi-class Hinge  i=1n∑#\$%%%&(((f r(xi ) = Argmax(wi x + bi ),i ∈ Y,i ≠ tf t(xi ) = wt x + btf (x) = Argmax(wi x + bi ),i ∈ Y
20. 20. SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framework?6/6/13 20
21. 21. SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framework?6/6/13 21Parallelized  Stochas1c  Gradient  Descent  By  Marn  Zinkevich,  Markus  Weimer,  Alexander  J.  Smola,  Lihong  Li    NIPS  2010
22. 22. Parallelized Stochastic Gradient Descent - Theory6/6/13 22
23. 23. Parallelized Stochastic Gradient Descent - Theory§  Conditions:•  SVM loss function has bounded gradient•  The solver is stochastic§  Result:•  You can break the original sample into randomly distributedsubsamples and solve on each subsample.•  The convex combination of each sub-solution will be the same as thesolution for the original sample6/6/13 23
24. 24. Optimization§  Conditions:•  SVM loss function has bounded gradient•  The solver is stochastic§  Loss: Cost sensitive hinge•  Crammer, K & Singer. Y. (2001). On the algorithmic implementation ofmulticlass kernel-based vector machines. JMLR, 2, 262-292.§  Solver: Pegasos•  Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: primal estimatedsub-gradient solver for svm. ICML, 807-814.§  Use mapper to random distribute the samples, and use reducer toiterate on the sub-sample.6/6/13 24
25. 25. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 25
26. 26. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 26Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
27. 27. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 27Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
28. 28. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 28Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
29. 29. SVM: Kernels§  Two questions:•  What kind of function is a kernel?•  What kernel is appropriate for a specific problem?§  The answers:•  Mercer’s Theorem: Every semi-positive definite symmetricfunction is a kernel•  Depends on the problem.6/6/13http://www.ism.ac.jp/~fukumizu/H20_kernel/Kernel_7_theory.pdf29
30. 30. SVM: Kernels§  Examples of popular kernel functions:•  Gaussian kernel:•  Laplacian kernel:•  Polynomial kernel:6/6/13 30222),( σjiexxK jixx −−=θθ ||||sin||||),(jijiji xxKxxxx−−=( )djTiji bxaxxxK +=),(
31. 31. SVM: Kernels§  Kernel (dual) feature space is defined by the inner productsbetween each§  Kernel matrix is N × N, where N is the number of samples§  As your sample size goes up, kernel matrix gets huge!§  Yet, the problem is lack of ability to match with MapReduce!6/6/13 31⇒ Dual space is not feasible at scalexi and xj
32. 32. SVM: Implementation§  Question: How having a non-linear SVM without paying theprice of duality?6/6/13 32
33. 33. SVM: Implementation§  Question: How having a non-linear SVM without paying theprice of duality?§  Claim: For certain kernel functions we can find a function zwhere6/6/13 33W,bArgminλ2w2+ max 0,1− wTz xi( )− b( )Hinge Loss  i=1n∑#\$%%%&(((z
34. 34. SVM: Implementation6/6/13 34Random  Features  for  Large-­‐Scale  Kernel    Machines  By  Ali  Rahimi  and  Ben  Recht  NIPS  2007  Can  approximate  shi1-­‐invariant  kernels  Random  Feature  Maps  for  Dot  Product  By  PurushoHam  Kar  and  Karish  Karnick  AISTATS  2012  Can  approximate  dot-­‐product  kernels
35. 35. Approximating shift-invariant kernel6/6/13 35Random  Features  for  Large-­‐Scale  Kernel    Machines        Given a positive definite shift-invariant kernel K x, y( )= f x − y( ),we can create a randomized feature map Z : Rd→ RDsuch thatZ x( )#Z y( ) ≈ K x − y( )Compute the Fourier tranform p of the kernel k: p(ω) =12πe− j "ω δk δ( )dΔ∫Draw D iid samples ω1,...,ωD ∈ Rdfrom p.Draw D iid samples b1,...,bD ∈ R from the uniform distribution on 0,2π[ ].Z : x →2Dcos "ω1x + b1( )cos "ωDx + bD( )#\$ %&"
36. 36. SVM: Implementation6/6/13 36
37. 37. Approximating dot-product kernel6/6/13 37Random  Feature  Maps  for  Dot  Product  Kernels        Obtain the Maclaurin expansion of f (x) = an xnn=0∞∑ by setting an =fn( )0( )n!Fix a value p >1. For i =1 to D :Choose a non-negative integer with P N = n[ ]=1pn+1Choose N vectors ω1,...,ωn ∈ −1,1{ }dselecting each coordinate using fair coin tosses.Let feature map Zi : x → aN pN+1ωjTj=1N∏ xZ : x →1DZ1 x( ),..., ZD x( )( )Given a positive definite dot product kernel K x, y( )= f x, y( ),we can create a randomized feature map Z : Rd→ RDsuch thatZ x( ),Z y( ) ≈ K x, y( )
38. 38. SVM: Implementation SummaryUsing these approximations, we can now treat this as a linear SVMproblem.(1)  Job 1 – compute stats for feature and class (mean, variance, classcardinality, etc.)(2)  Job 2- Transform sample by the approximate kernel and computestats for new feature space.(3)  Job 3 – randomly distribute the new samples and train the model inthe reducer.6/6/13 38We can use map-reduce to solve non-linearmulti-classification SVM!
39. 39. SVM: Implementation examples§  SVM used by large entertainment company for customersegmentation•  Web logs containing browsing information mined for customer attributeslike gender and age•  Raw Omniture logs stored in Hadoop•  Models built on ~10 billion rows and 1 million features•  Models used to improve inventory value of company’s web propertiesfor publishers
40. 40. Questions?