Machine Learning on Big DataLessons Learned from Google ProjectsMax LinSoftware Engineer | Google ResearchMassively Parall...
Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
“Machine Learning is a studyof computer algorithms that   improve automatically    through experience.”
The quick brown fox                                           English           jumped over the lazy dog.           To err...
Linear Classifier       The quick brown fox jumped over the lazy dog.    ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañ...
Training Data                 Input X                      Ouput Y                        P                               ...
Typical machine learningdata at GoogleN: 100 billions / 1 billionP: 1 billion / 10 million(mean / median)                 ...
Classifier Training• Training: Given {(x, y)} and f, minimize the  following objective function                  N         ...
Use Newton’s method?    t+1      t     t −1                    tw         ← w − H(w )      ∇J(w )                    http:...
Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • ...
Subsampling                               Big DataReduce N   Shard 1   Shard 2     Shard 3                                ...
Why not Small Data?                [Banko and Brill, 2001]
Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • ...
Parallelize Estimates• Naive Bayes Classifier                 N P                                                i     arg ...
Word Counting                                            (‘the|EN’, 1)         X: “The quick brown fox ...” Map           ...
Word Counting                                      Big Data             Mapper 1   Mapper 2    Mapper 3             Mapper...
Parallelize Optimization• Maximum Entropy Classifiers                         P             N                              ...
Gradient Descent        http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf
Gradient Descent• w is initialized as zero• for t in 1 to T • Calculate gradients ∇J(w) • w ← w − η∇J(w)      t+1         ...
Distribute Gradient• w is initialized as zero• for t in 1 to T • Calculate gradients in parallel    wt+1 ← wt − η∇J(w)• Tr...
Distribute Gradient                                      Big Data          Machine 1     Machine 2   Machine 3          Ma...
Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • ...
Parallelize Subroutines• Support Vector Machines                 1                                         n              ...
The computationalcost for the Primal-Dual Interior PointMethod is O(n^3) intime and O(n^2) in      memoryhttp://www.flickr....
Parallel SVM                [Chang et al, 2007]•   Parallel, row-wise incomplete Cholesky    Factorization for Q•   Parall...
Parallel ICF• Distribute Q by row into M machines    Machine 1     Machine 2   Machine 3      row 1        row 3       row...
Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • ...
Majority Vote                                Big Data      Machine 1   Machine 2   Machine 3          Machine MMap    Shar...
Majority Vote• Train individual classifiers independently• Predict by taking majority votes• Training CPU: O(TPN) to O(TPN ...
Parameter Mixture                          [Mann et al, 2009]                                   Big Data         Machine 1...
Much Less network                                                      usage than                                         ...
Iterative Param Mixture                       [McDonald et al., 2010]                                       Big Data      ...
Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
Scalable           http://www.flickr.com/photos/mr_t_in_dc/5469563053
Parallelhttp://www.flickr.com/photos/aloshbennett/3209564747/
Accuracyhttp://www.flickr.com/photos/wanderlinse/4367261825/
http://www.flickr.com/photos/imagelink/4006753760/
Binary                                                     Classificationhttp://www.flickr.com/photos/brenderous/4532934181/
Automatic FeatureDiscovery   http://www.flickr.com/photos/mararie/2340572508/
Fast                                              Responsehttp://www.flickr.com/photos/prunejuice/3687192643/
Memory is new      hard disk.http://www.flickr.com/photos/jepoirrier/840415676/
Algorithm +                                                Infrastructurehttp://www.flickr.com/photos/neubie/854242030/
Design forMulticores             http://www.flickr.com/photos/geektechnique/2344029370/
Combiner
Multi-shard Combiner[Chandra et al., 2010]
MachineLearning on Big Data
Parallelize ML         Algorithms• Embarrassingly parallel• Parallelize sub-routines• Distributed learning
Parallel   Accuracy  FastResponse
Google APIs•   Prediction API    •   machine learning service on the cloud    •   http://code.google.com/apis/predict•   B...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)
Upcoming SlideShare
Loading in …5
×

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

4,222 views
4,030 views

Published on

Abstract:

Machine learning researchers and practitioners develop computer
algorithms that "improve performance automatically through
experience". At Google, machine learning is applied to solve many
problems, such as prioritizing emails in Gmail, recommending tags for
YouTube videos, and identifying different aspects from online user
reviews. Machine learning on big data, however, is challenging. Some
"simple" machine learning algorithms with quadratic time complexity,
while running fine with hundreds of records, are almost impractical to
use on billions of records.

In this talk, I will describe lessons drawn from various Google
projects on developing large scale machine learning systems. These
systems build on top of Google's computing infrastructure such as GFS
and MapReduce, and attack the scalability problem through massively
parallel algorithms. I will present the design decisions made in
these systems, strategies of scaling and speeding up machine learning
systems on web scale data.


Speaker biography:

Max Lin is a software engineer with Google Research in New York City
office. He is the tech lead of the Google Prediction API, a machine
learning web service in the cloud. Prior to Google, he published
research work on video content analysis, sentiment analysis, machine
learning, and cross-lingual information retrieval. He had a PhD in
Computer Science from Carnegie Mellon University.

Published in: Education, Technology, Business
1 Comment
6 Likes
Statistics
Notes
  • Add fine-grained thread level parallelism for machine learning to Hadoop with zero MapReduce. http://bigdata.pervasive.com/Products/Analytic-Engine-Pervasive-DataRush.aspx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
4,222
On SlideShare
0
From Embeds
0
Number of Embeds
298
Actions
Shares
0
Downloads
139
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

  1. 1. Machine Learning on Big DataLessons Learned from Google ProjectsMax LinSoftware Engineer | Google ResearchMassively Parallel Computing | Harvard CS 264Guest Lecture | March 29th, 2011
  2. 2. Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
  3. 3. Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
  4. 4. “Machine Learning is a studyof computer algorithms that improve automatically through experience.”
  5. 5. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you EnglishTraining Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ?Testing f(x’) is the question = y’ La fe mueve montañas. ?
  6. 6. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ] P f (x) = w · x = wp ∗ xp p=1
  7. 7. Training Data Input X Ouput Y P ... ... ...N ... ... ... ... ... ... ...
  8. 8. Typical machine learningdata at GoogleN: 100 billions / 1 billionP: 1 billion / 10 million(mean / median) http://www.flickr.com/photos/mr_t_in_dc/5469563053
  9. 9. Classifier Training• Training: Given {(x, y)} and f, minimize the following objective function N arg min L(yi , f (xi ; w)) + R(w) w n=1
  10. 10. Use Newton’s method? t+1 t t −1 tw ← w − H(w ) ∇J(w ) http://www.flickr.com/photos/visitfinland/5424369765/
  11. 11. Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
  12. 12. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  13. 13. Subsampling Big DataReduce N Shard 1 Shard 2 Shard 3 ... Shard M Machine Model
  14. 14. Why not Small Data? [Banko and Brill, 2001]
  15. 15. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  16. 16. Parallelize Estimates• Naive Bayes Classifier N P i arg min − P (xp |yi ; w)P (yi ; w) w i=1 p=1• Maximum Likelihood Estimates N i i=1 1EN,the (x ) wthe|EN = N i=1 1EN (xi )
  17. 17. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1)Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3 C( the |EN ) w the |EN = C(EN )
  18. 18. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) ReducerReduce Tally counts and update w Model
  19. 19. Parallelize Optimization• Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p• Good: J(w) is concave• Bad: no closed-form solution like NB• Ugly: Large N
  20. 20. Gradient Descent http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf
  21. 21. Gradient Descent• w is initialized as zero• for t in 1 to T • Calculate gradients ∇J(w) • w ← w − η∇J(w) t+1 t N ∇J(w) = P (w, xi , yi ) i=1
  22. 22. Distribute Gradient• w is initialized as zero• for t in 1 to T • Calculate gradients in parallel wt+1 ← wt − η∇J(w)• Training CPU: O(TPN) to O(TPN / M)
  23. 23. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum)Reduce Sum and Update w Repeat M/R until converge Model
  24. 24. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  25. 25. Parallelize Subroutines• Support Vector Machines 1 n 2 arg min ||w||2 +C ζi w,b,ζ 2 i=1 s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0• Solve the dual problem 1 T arg min α Qα − αT 1 α 2 s.t. 0 ≤ α ≤ C, yT α = 0
  26. 26. The computationalcost for the Primal-Dual Interior PointMethod is O(n^3) intime and O(n^2) in memoryhttp://www.flickr.com/photos/sea-turtle/198445204/
  27. 27. Parallel SVM [Chang et al, 2007]• Parallel, row-wise incomplete Cholesky Factorization for Q• Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M)• Parallel Support Vector Machines (psvm) http:// code.google.com/p/psvm/ • Implement in MPI
  28. 28. Parallel ICF• Distribute Q by row into M machines Machine 1 Machine 2 Machine 3 row 1 row 3 row 5 ... row 2 row 4 row 6• For each dimension n N √ • Send local pivots to master • Master selects largest local pivots and broadcast the global pivot to workers
  29. 29. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  30. 30. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine MMap Shard 1 Shard 2 Shard 3 ... Shard M Model 1 Model 2 Model 3 Model 4
  31. 31. Majority Vote• Train individual classifiers independently• Predict by taking majority votes• Training CPU: O(TPN) to O(TPN / M)
  32. 32. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...Reduce Average w Model
  33. 33. Much Less network usage than distributed gradient descent O(MN) vs. O(MNT)ttp://www.flickr.com/photos/annamatic3000/127945652/
  34. 34. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduceafter each Average w epoch Model
  35. 35. Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
  36. 36. Scalable http://www.flickr.com/photos/mr_t_in_dc/5469563053
  37. 37. Parallelhttp://www.flickr.com/photos/aloshbennett/3209564747/
  38. 38. Accuracyhttp://www.flickr.com/photos/wanderlinse/4367261825/
  39. 39. http://www.flickr.com/photos/imagelink/4006753760/
  40. 40. Binary Classificationhttp://www.flickr.com/photos/brenderous/4532934181/
  41. 41. Automatic FeatureDiscovery http://www.flickr.com/photos/mararie/2340572508/
  42. 42. Fast Responsehttp://www.flickr.com/photos/prunejuice/3687192643/
  43. 43. Memory is new hard disk.http://www.flickr.com/photos/jepoirrier/840415676/
  44. 44. Algorithm + Infrastructurehttp://www.flickr.com/photos/neubie/854242030/
  45. 45. Design forMulticores http://www.flickr.com/photos/geektechnique/2344029370/
  46. 46. Combiner
  47. 47. Multi-shard Combiner[Chandra et al., 2010]
  48. 48. MachineLearning on Big Data
  49. 49. Parallelize ML Algorithms• Embarrassingly parallel• Parallelize sub-routines• Distributed learning
  50. 50. Parallel Accuracy FastResponse
  51. 51. Google APIs• Prediction API • machine learning service on the cloud • http://code.google.com/apis/predict• BigQuery • interactive analysis of massive data on the cloud • http://code.google.com/apis/bigquery

×