Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Large scale logistic regression and... by Hiroki Naganuma 182 views
- Outlier analysis for Temporal Datasets by QuantUniversity 3211 views
- Anomaly detection Meetup Slides by QuantUniversity 2690 views
- Deep learning Tutorial - Part II by QuantUniversity 4716 views
- Deep learning - Part I by QuantUniversity 6266 views
- Deep learning and Apache Spark by QuantUniversity 4816 views

1,993 views

Published on

Slides from "Support Vector Machines in MapReduce" Meetup.

No Downloads

Total views

1,993

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

0

Comments

0

Likes

7

No embeds

No notes for slide

- 1. Support Vector Machines in MapReducePresented byAsghar Dehghani, Alpine Data LabsSara Asher, Alpine Data Labs
- 2. Overview§ Theory of basic SVM (biclassification, linear)§ Generalized SVM (multi-classification)§ MapReducing SVM§ Handling kernels (nonlinear SVM) in MapReduce§ Demo
- 3. Background on SVM§ Given a bunch of points…
- 4. Background on SVM§ How do we classify a new point?
- 5. Background on SVM§ Split the space using a hyper-plane
- 6. Background on SVM§ Split the space using a hyper-plane
- 7. Background on SVM§ Split the space using a hyper-plane
- 8. Background on SVM§ Which plane do you use?
- 9. Background on SVM§ Margin: Distance from closest points to the hyper-plane§ Idea: Among the set of hyper-planes, choose the one thatmaximizes the margin6/6/13 9ρ SVs
- 10. Background on SVM6/6/13 10wTx + b = 0ρ wTx + b > ρwTx + b < -ρρ • Hyper-plane represented by:• We want to choose the w andb that will maximize themargin ρ.• Using some algebra andsome rescaling, we can showthat for the support vectors:margin =1w
- 11. Background on SVM (cont.)§ Thus the goal is solving the following optimization problem:6/6/13 11Subject to yi (wTxi + b) ≥ 1, i =1..n(where yi = 1 or -1, depending on which class of xi)ArgmaxW,bρ =1w!"##$%&& = ArgminW,bw( )
- 12. Background on SVM (cont.)§ To avoid square roots, can do the following transformation§ Thus, the problem is solving a quadratic function minimizationsubject to linear constraints (well studied)6/6/13 12Subject to yi (wTxi + b) ≥ 1, i =1..n Subject to yi (wTxi + b) ≥ 1, i =1..n)21(2,min wArgbWW,bArgmin( w )
- 13. Background on SVM (cont.)§ What happens is the data is not linearly separable? (i.e., thereis no hyper-plane that will split the data exactly)
- 14. Background on SVM (cont.)6/6/13 14Subject to yi (wTxi + b) ≥ 1, i=1 .. n)21(2,min wArgbW• Slack variables ξi is added to the constraints.• ξi is the distance from xi to its class boundary.
- 15. Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3), 273–29715Subject to yi (wTxi + b) ≥ 1, i=1 .. n)21(2,min wArgbW⇓Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )(add slack)• Slack variables ξi is added to the constraints.• ξi is the distance from xi to its class boundary.• C is the regularization parameter which controls the bias-variance trade-off (significance of outliers)
- 16. Background on SVM (cont.)6/6/13 CORTES, Corinna, and Vladimir VAPNIK, 1995. Support-vectornetworks. Machine Learning, 20(3), 273–29716Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )Question: how to get rid of the constraints?
- 17. Background on SVM (cont.)6/6/13 17Subject to yi (wTxi + b) ≥ 1 – ξi,, ξi ≥ 0, i=1 .. nW,bArgmin(12w2+C ξii=1n∑ )Answer: Fenchel Duality and Representer Theorems!W,bArgminλ2w2+ max 0,1− wTxi − b( )Hinge Loss i=1n∑#$%%%&(((We’ve removed the constraint! SVM minimizes the“L2 Regularized Hinge”
- 18. Background on SVM (cont.)§ What happens to the multi-class situations?There are different ways to handle multi-classification:• One vs. all• One vs. one• Cost-sensitive Hinge (Crammer and Singer 2001)
- 19. Cost sensitive formulation of hinge loss(Crammer and Singer 2001)WhereThis loss function is called “cost-sensitive hinge.”And the prediction function is:Background on SVM (cont.)6/6/13 Crammer, K & Singer. Y. (2001). On the algorithmic implementation ofmulticlass kernel-based vector machines. JMLR, 2, 262-292.19W,bArgminλ2w2+ max 0,1+ f r(xi )− f t(xi )( )multi-class Hinge i=1n∑#$%%%&(((f r(xi ) = Argmax(wi x + bi ),i ∈ Y,i ≠ tf t(xi ) = wt x + btf (x) = Argmax(wi x + bi ),i ∈ Y
- 20. SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framework?6/6/13 20
- 21. SVM: ImplementationWe now have our function that we need to optimize. But how dowe parallelize this for map-reduce framework?6/6/13 21Parallelized Stochas1c Gradient Descent By Marn Zinkevich, Markus Weimer, Alexander J. Smola, Lihong Li NIPS 2010
- 22. Parallelized Stochastic Gradient Descent - Theory6/6/13 22
- 23. Parallelized Stochastic Gradient Descent - Theory§ Conditions:• SVM loss function has bounded gradient• The solver is stochastic§ Result:• You can break the original sample into randomly distributedsubsamples and solve on each subsample.• The convex combination of each sub-solution will be the same as thesolution for the original sample6/6/13 23
- 24. Optimization§ Conditions:• SVM loss function has bounded gradient• The solver is stochastic§ Loss: Cost sensitive hinge• Crammer, K & Singer. Y. (2001). On the algorithmic implementation ofmulticlass kernel-based vector machines. JMLR, 2, 262-292.§ Solver: Pegasos• Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: primal estimatedsub-gradient solver for svm. ICML, 807-814.§ Use mapper to random distribute the samples, and use reducer toiterate on the sub-sample.6/6/13 24
- 25. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 25
- 26. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 26Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
- 27. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 27Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
- 28. SVM: Non-tolerable dataBut what about non-tolerable data?6/6/13 28Idea: Transform the pattern space to a higherdimensional space, called feature space, which islinearly separable
- 29. SVM: Kernels§ Two questions:• What kind of function is a kernel?• What kernel is appropriate for a specific problem?§ The answers:• Mercer’s Theorem: Every semi-positive definite symmetricfunction is a kernel• Depends on the problem.6/6/13http://www.ism.ac.jp/~fukumizu/H20_kernel/Kernel_7_theory.pdf29
- 30. SVM: Kernels§ Examples of popular kernel functions:• Gaussian kernel:• Laplacian kernel:• Polynomial kernel:6/6/13 30222),( σjiexxK jixx −−=θθ ||||sin||||),(jijiji xxKxxxx−−=( )djTiji bxaxxxK +=),(
- 31. SVM: Kernels§ Kernel (dual) feature space is defined by the inner productsbetween each§ Kernel matrix is N × N, where N is the number of samples§ As your sample size goes up, kernel matrix gets huge!§ Yet, the problem is lack of ability to match with MapReduce!6/6/13 31⇒ Dual space is not feasible at scalexi and xj
- 32. SVM: Implementation§ Question: How having a non-linear SVM without paying theprice of duality?6/6/13 32
- 33. SVM: Implementation§ Question: How having a non-linear SVM without paying theprice of duality?§ Claim: For certain kernel functions we can find a function zwhere6/6/13 33W,bArgminλ2w2+ max 0,1− wTz xi( )− b( )Hinge Loss i=1n∑#$%%%&(((z
- 34. SVM: Implementation6/6/13 34Random Features for Large-‐Scale Kernel Machines By Ali Rahimi and Ben Recht NIPS 2007 Can approximate shi1-‐invariant kernels Random Feature Maps for Dot Product By PurushoHam Kar and Karish Karnick AISTATS 2012 Can approximate dot-‐product kernels
- 35. Approximating shift-invariant kernel6/6/13 35Random Features for Large-‐Scale Kernel Machines Given a positive definite shift-invariant kernel K x, y( )= f x − y( ),we can create a randomized feature map Z : Rd→ RDsuch thatZ x( )#Z y( ) ≈ K x − y( )Compute the Fourier tranform p of the kernel k: p(ω) =12πe− j "ω δk δ( )dΔ∫Draw D iid samples ω1,...,ωD ∈ Rdfrom p.Draw D iid samples b1,...,bD ∈ R from the uniform distribution on 0,2π[ ].Z : x →2Dcos "ω1x + b1( )cos "ωDx + bD( )#$ %&"
- 36. SVM: Implementation6/6/13 36
- 37. Approximating dot-product kernel6/6/13 37Random Feature Maps for Dot Product Kernels Obtain the Maclaurin expansion of f (x) = an xnn=0∞∑ by setting an =fn( )0( )n!Fix a value p >1. For i =1 to D :Choose a non-negative integer with P N = n[ ]=1pn+1Choose N vectors ω1,...,ωn ∈ −1,1{ }dselecting each coordinate using fair coin tosses.Let feature map Zi : x → aN pN+1ωjTj=1N∏ xZ : x →1DZ1 x( ),..., ZD x( )( )Given a positive definite dot product kernel K x, y( )= f x, y( ),we can create a randomized feature map Z : Rd→ RDsuch thatZ x( ),Z y( ) ≈ K x, y( )
- 38. SVM: Implementation SummaryUsing these approximations, we can now treat this as a linear SVMproblem.(1) Job 1 – compute stats for feature and class (mean, variance, classcardinality, etc.)(2) Job 2- Transform sample by the approximate kernel and computestats for new feature space.(3) Job 3 – randomly distribute the new samples and train the model inthe reducer.6/6/13 38We can use map-reduce to solve non-linearmulti-classification SVM!
- 39. SVM: Implementation examples§ SVM used by large entertainment company for customersegmentation• Web logs containing browsing information mined for customer attributeslike gender and age• Raw Omniture logs stored in Hadoop• Models built on ~10 billion rows and 1 million features• Models used to improve inventory value of company’s web propertiesfor publishers
- 40. Questions?

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment