Successfully reported this slideshow.

Machine Learning on Big Data

110

Share

Upcoming SlideShare
501 synonyms and antonyms
501 synonyms and antonyms
Loading in …3
×
1 of 177
1 of 177

Machine Learning on Big Data

110

Share

Download to read offline

Description

Max Lin gave a guest lecture about machine learning on big data in the Harvard CS 264 class on March 29, 2011.

Transcript

  1. 1. Machine Learning on Big Data Lessons Learned from Google Projects Max Lin Software Engineer | Google Research Massively Parallel Computing | Harvard CS 264 Guest Lecture | March 29th, 2011
  2. 2. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  3. 3. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  4. 4. “Machine Learning is a study of computer algorithms that improve automatically through experience.”
  5. 5. The quick brown fox jumped over the lazy dog.
  6. 6. The quick brown fox English jumped over the lazy dog.
  7. 7. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you need a computer.
  8. 8. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer.
  9. 9. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien no venga.
  10. 10. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga.
  11. 11. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida.
  12. 12. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish
  13. 13. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that is the question
  14. 14. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question
  15. 15. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas.
  16. 16. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  17. 17. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  18. 18. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  19. 19. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  20. 20. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
  21. 21. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing is the question La fe mueve montañas. ?
  22. 22. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing f(x’) is the question La fe mueve montañas. ?
  23. 23. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing f(x’) is the question = y’ La fe mueve montañas. ?
  24. 24. Linear Classifier
  25. 25. Linear Classifier The quick brown fox jumped over the lazy dog.
  26. 26. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’
  27. 27. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ...
  28. 28. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’
  29. 29. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ...
  30. 30. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’
  31. 31. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ...
  32. 32. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’
  33. 33. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ...
  34. 34. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’
  35. 35. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
  36. 36. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0,
  37. 37. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ...
  38. 38. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0,
  39. 39. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ...
  40. 40. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1,
  41. 41. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ...
  42. 42. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1,
  43. 43. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ...
  44. 44. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ... 0,
  45. 45. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ... 0, ...
  46. 46. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... [ 0, ... 0, ... 1, ... 1, ... 0, ...
  47. 47. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
  48. 48. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
  49. 49. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
  50. 50. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
  51. 51. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ] P f (x) = w · x = w p ∗ xp p=1
  52. 52. Training Data Input X Ouput Y P ... ... ... N ... ... ... ... ... ... ...
  53. 53. Typical machine learning data at Google N: 100 billions / 1 billion P: 1 billion / 10 million (mean / median) http://www.flickr.com/photos/mr_t_in_dc/5469563053
  54. 54. Classifier Training • Training: Given {(x, y)} and f, minimize the following objective function N arg min L(yi , f (xi ; w)) + R(w) w n=1
  55. 55. Use Newton’s method? t +1 t t −1 t w ← w − H(w ) J(w ) http://www.flickr.com/photos/visitfinland/5424369765/
  56. 56. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  57. 57. Scaling Up
  58. 58. Scaling Up • Why big data?
  59. 59. Scaling Up • Why big data? • Parallelize machine learning algorithms
  60. 60. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel
  61. 61. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines
  62. 62. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  63. 63. Subsampling
  64. 64. Subsampling Big Data
  65. 65. Subsampling Big Data Shard 1 Shard 2 Shard 3 Shard M ...
  66. 66. Subsampling Big Data Reduce N Shard 1
  67. 67. Subsampling Big Data Reduce N Shard 1 Machine
  68. 68. Subsampling Big Data Reduce N Machine Shard 1
  69. 69. Subsampling Big Data Reduce N Machine Shard 1 Model
  70. 70. Why not Small Data? [Banko and Brill, 2001]
  71. 71. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  72. 72. Parallelize Estimates • Naive Bayes Classifier N P i arg min − P (xp |yi ; w)P (yi ; w) w i=1 p=1 • Maximum Likelihood Estimates N i i=1 1EN,the (x ) wthe|EN = N i=1 1EN (xi )
  73. 73. Word Counting
  74. 74. Word Counting Map
  75. 75. Word Counting X: “The quick brown fox ...” Map Y: EN
  76. 76. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map Y: EN
  77. 77. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN
  78. 78. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1)
  79. 79. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce
  80. 80. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
  81. 81. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3
  82. 82. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3 C( the |EN ) w the |EN = C(EN )
  83. 83. Word Counting
  84. 84. Word Counting Big Data
  85. 85. Word Counting Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  86. 86. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1)
  87. 87. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) Reducer Reduce Tally counts and update w
  88. 88. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) Reducer Reduce Tally counts and update w Model
  89. 89. Parallelize Optimization N P i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
  90. 90. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
  91. 91. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
  92. 92. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
  93. 93. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave
  94. 94. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave • Bad: no closed-form solution like NB
  95. 95. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave • Bad: no closed-form solution like NB • Ugly: Large N
  96. 96. Gradient Descent http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf
  97. 97. Gradient Descent
  98. 98. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients •
  99. 99. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients J(w) •
  100. 100. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients J(w) • w ← w − η J(w) t+1 t
  101. 101. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients J(w) • w ← w − η J(w) t+1 t N J(w) = P (w, xi , yi ) i=1
  102. 102. Distribute Gradient • w is initialized as zero • for t in 1 to T • Calculate gradients in parallel • Training CPU: O(TPN) to O(TPN / M)
  103. 103. Distribute Gradient • w is initialized as zero • for t in 1 to T • Calculate gradients in parallel wt+1 ← wt − η J(w) • Training CPU: O(TPN) to O(TPN / M)
  104. 104. Distribute Gradient
  105. 105. Distribute Gradient Big Data
  106. 106. Distribute Gradient Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  107. 107. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum)
  108. 108. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum) Reduce Sum and Update w
  109. 109. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum) Reduce Sum and Update w Repeat M/R until converge Model
  110. 110. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  111. 111. Parallelize Subroutines • Support Vector Machines 1 n 2 arg min ||w||2 +C ζi w,b,ζ 2 i=1 s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0 • Solve the dual problem 1 T arg min α Qα − αT 1 α 2 s.t. 0 ≤ α ≤ C, yT α = 0
  112. 112. The computational cost for the Primal- Dual Interior Point Method is O(n^3) in time and O(n^2) in memory http://www.flickr.com/photos/sea-turtle/198445204/
  113. 113. Parallel SVM [Chang et al, 2007] √ N
  114. 114. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q √ N
  115. 115. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q • Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M)
  116. 116. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q • Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M) • Parallel Support Vector Machines (psvm) http:// code.google.com/p/psvm/ • Implement in MPI
  117. 117. Parallel ICF • Distribute Q by row into M machines Machine 1 Machine 2 Machine 3 row 1 row 3 row 5 ... row 2 row 4 row 6 • For each dimension n < N √ • Send local pivots to master • Master selects largest local pivots and broadcast the global pivot to workers
  118. 118. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  119. 119. Majority Vote
  120. 120. Majority Vote Big Data
  121. 121. Majority Vote Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  122. 122. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M
  123. 123. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M Model 1 Model 2 Model 3 Model 4
  124. 124. Majority Vote • Train individual classifiers independently • Predict by taking majority votes • Training CPU: O(TPN) to O(TPN / M)
  125. 125. Parameter Mixture [Mann et al, 2009]
  126. 126. Parameter Mixture [Mann et al, 2009] Big Data
  127. 127. Parameter Mixture [Mann et al, 2009] Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  128. 128. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...
  129. 129. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce Average w
  130. 130. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce Average w Model
  131. 131. Much Less network usage than distributed gradient descent O(MN) vs. O(MNT) ttp://www.flickr.com/photos/annamatic3000/127945652/
  132. 132. Iterative Param Mixture [McDonald et al., 2010]
  133. 133. Iterative Param Mixture[McDonald et al., 2010] Big Data
  134. 134. Iterative Param Mixture [McDonald et al., 2010] Big Data Shard 1 Shard 2 Shard 3 ... Shard M
  135. 135. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...
  136. 136. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce after each Average w epoch
  137. 137. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce after each Average w epoch Model
  138. 138. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
  139. 139. Scalable http://www.flickr.com/photos/mr_t_in_dc/5469563053
  140. 140. Parallel http://www.flickr.com/photos/aloshbennett/3209564747/
  141. 141. Accuracy http://www.flickr.com/photos/wanderlinse/4367261825/
  142. 142. http://www.flickr.com/photos/imagelink/4006753760/
  143. 143. Binary Classification http://www.flickr.com/photos/brenderous/4532934181/
  144. 144. Automatic Feature Discovery http://www.flickr.com/photos/mararie/2340572508/
  145. 145. Fast Response http://www.flickr.com/photos/prunejuice/3687192643/
  146. 146. Memory is new hard disk. http://www.flickr.com/photos/jepoirrier/840415676/
  147. 147. Algorithm + Infrastructure http://www.flickr.com/photos/neubie/854242030/
  148. 148. Design for Multicores http://www.flickr.com/photos/geektechnique/2344029370/
  149. 149. Combiner
  150. 150. Multi-shard Combiner [Chandra et al., 2010]
  151. 151. Machine Learning on Big Data
  152. 152. Parallelize ML Algorithms
  153. 153. Parallelize ML Algorithms • Embarrassingly parallel
  154. 154. Parallelize ML Algorithms • Embarrassingly parallel • Parallelize sub-routines
  155. 155. Parallelize ML Algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
  156. 156. Parallel
  157. 157. Parallel Accuracy
  158. 158. Parallel Accuracy Fast Response
  159. 159. Parallel Accuracy Fast Response
  160. 160. Google APIs
  161. 161. Google APIs • Prediction API • machine learning service on the cloud • http://code.google.com/apis/predict
  162. 162. Google APIs • Prediction API • machine learning service on the cloud • http://code.google.com/apis/predict • BigQuery • interactive analysis of massive data on the cloud • http://code.google.com/apis/bigquery

Editor's Notes

  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \\arg \\min_{\\mathbf{w}} \\sum_{n = 1}^N L(y_i, f(x_i; \\mathbf{w})) + R(\\mathbf{w})\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n\\arg \\min_\\mathbf{w} -\\prod_{i=1}^N \\prod_{p=1}^P P(x^i_p | y_i; \\mathbf{w}) P(y_i; \\mathbf{w})\n\nf(x) = \\sum_{p=1}^P \\log \\frac{\\mathbf{1}(x_p) * w_{p|EN}}{\\mathbf{1}(x_p) * w_{p|ES}} + \\log \\frac{w_{EN}}{w_{ES}}\n\nw_{the|EN} = \\frac{\\sum_{i=1}^N \\mathbf{1}_{EN,the}(x^i)}{\\sum_{i=1}^N \\mathbf{1}_{EN}(x^i)}\n\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • \n
  • \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
  • \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
  • \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
  • \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \\arg \\min \\frac{1}{2} || \\mathbf{w} ||_2^2 + C \\sum_{i=1}^n \\zeta_i\n\\arg \\min_{\\mathbf{w},b,\\mathbf{\\zeta}} \\frac{1}{2} || \\mathbf{w} ||_2^2 + C \\sum_{i=1}^n \\zeta_i\n\n\\text{s.t.} \\quad 1 - y_i(\\mathbf{w} \\cdot \\phi(x_i) + b) \\le \\zeta_i, \\zeta_i \\ge 0\n\n\\arg \\min_\\mathbf{\\alpha} \\frac{1}{2} \\mathbf{\\alpha}^T \\mathbf{Q} \\mathbf{\\alpha} - \\mathbf{\\alpha}^T\\mathbf{1}\n\n\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Amdahl&amp;#x2019;s law\ncommunication cost, choose M\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Sub-sampling provides inferior performance\nParameter mixture improves, but not as good as all data\nIterative parameter mixture achieves as good as all data\nDistributed algorithms return better classifiers quicker\n\n
  • \n
  • billion of instances, millions of features\nwithin reasonable resources\n
  • \n
  • guarantee, state-of-the-art\n
  • easy to use - adoption setup\neasy to maintain - reliable, production, work with batch systems\n
  • \n
  • \n
  • new features every day\nfeedback change\n
  • iterative, fast retrieval for data and model / parameters\n
  • fault-tolerant pieces: MapReduce (scalable), multi-cores, GFS for data\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Description

    Max Lin gave a guest lecture about machine learning on big data in the Harvard CS 264 class on March 29, 2011.

    Transcript

    1. 1. Machine Learning on Big Data Lessons Learned from Google Projects Max Lin Software Engineer | Google Research Massively Parallel Computing | Harvard CS 264 Guest Lecture | March 29th, 2011
    2. 2. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
    3. 3. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
    4. 4. “Machine Learning is a study of computer algorithms that improve automatically through experience.”
    5. 5. The quick brown fox jumped over the lazy dog.
    6. 6. The quick brown fox English jumped over the lazy dog.
    7. 7. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you need a computer.
    8. 8. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer.
    9. 9. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien no venga.
    10. 10. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga.
    11. 11. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida.
    12. 12. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish
    13. 13. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that is the question
    14. 14. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question
    15. 15. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas.
    16. 16. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
    17. 17. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
    18. 18. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
    19. 19. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
    20. 20. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
    21. 21. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing is the question La fe mueve montañas. ?
    22. 22. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing f(x’) is the question La fe mueve montañas. ?
    23. 23. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you English Training Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? Testing f(x’) is the question = y’ La fe mueve montañas. ?
    24. 24. Linear Classifier
    25. 25. Linear Classifier The quick brown fox jumped over the lazy dog.
    26. 26. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’
    27. 27. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ...
    28. 28. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’
    29. 29. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ...
    30. 30. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’
    31. 31. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ...
    32. 32. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’
    33. 33. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ...
    34. 34. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’
    35. 35. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
    36. 36. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0,
    37. 37. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ...
    38. 38. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0,
    39. 39. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ...
    40. 40. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1,
    41. 41. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ...
    42. 42. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1,
    43. 43. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ...
    44. 44. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ... 0,
    45. 45. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ... 0, ...
    46. 46. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... [ 0, ... 0, ... 1, ... 1, ... 0, ...
    47. 47. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
    48. 48. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
    49. 49. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
    50. 50. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
    51. 51. Linear Classifier The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ] P f (x) = w · x = w p ∗ xp p=1
    52. 52. Training Data Input X Ouput Y P ... ... ... N ... ... ... ... ... ... ...
    53. 53. Typical machine learning data at Google N: 100 billions / 1 billion P: 1 billion / 10 million (mean / median) http://www.flickr.com/photos/mr_t_in_dc/5469563053
    54. 54. Classifier Training • Training: Given {(x, y)} and f, minimize the following objective function N arg min L(yi , f (xi ; w)) + R(w) w n=1
    55. 55. Use Newton’s method? t +1 t t −1 t w ← w − H(w ) J(w ) http://www.flickr.com/photos/visitfinland/5424369765/
    56. 56. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
    57. 57. Scaling Up
    58. 58. Scaling Up • Why big data?
    59. 59. Scaling Up • Why big data? • Parallelize machine learning algorithms
    60. 60. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel
    61. 61. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines
    62. 62. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
    63. 63. Subsampling
    64. 64. Subsampling Big Data
    65. 65. Subsampling Big Data Shard 1 Shard 2 Shard 3 Shard M ...
    66. 66. Subsampling Big Data Reduce N Shard 1
    67. 67. Subsampling Big Data Reduce N Shard 1 Machine
    68. 68. Subsampling Big Data Reduce N Machine Shard 1
    69. 69. Subsampling Big Data Reduce N Machine Shard 1 Model
    70. 70. Why not Small Data? [Banko and Brill, 2001]
    71. 71. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
    72. 72. Parallelize Estimates • Naive Bayes Classifier N P i arg min − P (xp |yi ; w)P (yi ; w) w i=1 p=1 • Maximum Likelihood Estimates N i i=1 1EN,the (x ) wthe|EN = N i=1 1EN (xi )
    73. 73. Word Counting
    74. 74. Word Counting Map
    75. 75. Word Counting X: “The quick brown fox ...” Map Y: EN
    76. 76. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map Y: EN
    77. 77. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN
    78. 78. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1)
    79. 79. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce
    80. 80. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
    81. 81. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3
    82. 82. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1) Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3 C( the |EN ) w the |EN = C(EN )
    83. 83. Word Counting
    84. 84. Word Counting Big Data
    85. 85. Word Counting Big Data Shard 1 Shard 2 Shard 3 ... Shard M
    86. 86. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1)
    87. 87. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) Reducer Reduce Tally counts and update w
    88. 88. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) Reducer Reduce Tally counts and update w Model
    89. 89. Parallelize Optimization N P i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
    90. 90. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
    91. 91. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
    92. 92. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
    93. 93. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave
    94. 94. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave • Bad: no closed-form solution like NB
    95. 95. Parallelize Optimization • Maximum Entropy Classifiers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p • Good: J(w) is concave • Bad: no closed-form solution like NB • Ugly: Large N
    96. 96. Gradient Descent http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf
    97. 97. Gradient Descent
    98. 98. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients •
    99. 99. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients J(w) •
    100. 100. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients J(w) • w ← w − η J(w) t+1 t
    101. 101. Gradient Descent • w is initialized as zero • for t in 1 to T • Calculate gradients J(w) • w ← w − η J(w) t+1 t N J(w) = P (w, xi , yi ) i=1
    102. 102. Distribute Gradient • w is initialized as zero • for t in 1 to T • Calculate gradients in parallel • Training CPU: O(TPN) to O(TPN / M)
    103. 103. Distribute Gradient • w is initialized as zero • for t in 1 to T • Calculate gradients in parallel wt+1 ← wt − η J(w) • Training CPU: O(TPN) to O(TPN / M)
    104. 104. Distribute Gradient
    105. 105. Distribute Gradient Big Data
    106. 106. Distribute Gradient Big Data Shard 1 Shard 2 Shard 3 ... Shard M
    107. 107. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum)
    108. 108. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum) Reduce Sum and Update w
    109. 109. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum) Reduce Sum and Update w Repeat M/R until converge Model
    110. 110. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
    111. 111. Parallelize Subroutines • Support Vector Machines 1 n 2 arg min ||w||2 +C ζi w,b,ζ 2 i=1 s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0 • Solve the dual problem 1 T arg min α Qα − αT 1 α 2 s.t. 0 ≤ α ≤ C, yT α = 0
    112. 112. The computational cost for the Primal- Dual Interior Point Method is O(n^3) in time and O(n^2) in memory http://www.flickr.com/photos/sea-turtle/198445204/
    113. 113. Parallel SVM [Chang et al, 2007] √ N
    114. 114. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q √ N
    115. 115. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q • Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M)
    116. 116. Parallel SVM [Chang et al, 2007] • Parallel, row-wise incomplete Cholesky Factorization for Q • Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M) • Parallel Support Vector Machines (psvm) http:// code.google.com/p/psvm/ • Implement in MPI
    117. 117. Parallel ICF • Distribute Q by row into M machines Machine 1 Machine 2 Machine 3 row 1 row 3 row 5 ... row 2 row 4 row 6 • For each dimension n < N √ • Send local pivots to master • Master selects largest local pivots and broadcast the global pivot to workers
    118. 118. Scaling Up • Why big data? • Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
    119. 119. Majority Vote
    120. 120. Majority Vote Big Data
    121. 121. Majority Vote Big Data Shard 1 Shard 2 Shard 3 ... Shard M
    122. 122. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M
    123. 123. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M Model 1 Model 2 Model 3 Model 4
    124. 124. Majority Vote • Train individual classifiers independently • Predict by taking majority votes • Training CPU: O(TPN) to O(TPN / M)
    125. 125. Parameter Mixture [Mann et al, 2009]
    126. 126. Parameter Mixture [Mann et al, 2009] Big Data
    127. 127. Parameter Mixture [Mann et al, 2009] Big Data Shard 1 Shard 2 Shard 3 ... Shard M
    128. 128. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...
    129. 129. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce Average w
    130. 130. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce Average w Model
    131. 131. Much Less network usage than distributed gradient descent O(MN) vs. O(MNT) ttp://www.flickr.com/photos/annamatic3000/127945652/
    132. 132. Iterative Param Mixture [McDonald et al., 2010]
    133. 133. Iterative Param Mixture[McDonald et al., 2010] Big Data
    134. 134. Iterative Param Mixture [McDonald et al., 2010] Big Data Shard 1 Shard 2 Shard 3 ... Shard M
    135. 135. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...
    136. 136. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce after each Average w epoch
    137. 137. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduce after each Average w epoch Model
    138. 138. Outline • Machine Learning intro • Scaling machine learning algorithms up • Design choices of large scale ML systems
    139. 139. Scalable http://www.flickr.com/photos/mr_t_in_dc/5469563053
    140. 140. Parallel http://www.flickr.com/photos/aloshbennett/3209564747/
    141. 141. Accuracy http://www.flickr.com/photos/wanderlinse/4367261825/
    142. 142. http://www.flickr.com/photos/imagelink/4006753760/
    143. 143. Binary Classification http://www.flickr.com/photos/brenderous/4532934181/
    144. 144. Automatic Feature Discovery http://www.flickr.com/photos/mararie/2340572508/
    145. 145. Fast Response http://www.flickr.com/photos/prunejuice/3687192643/
    146. 146. Memory is new hard disk. http://www.flickr.com/photos/jepoirrier/840415676/
    147. 147. Algorithm + Infrastructure http://www.flickr.com/photos/neubie/854242030/
    148. 148. Design for Multicores http://www.flickr.com/photos/geektechnique/2344029370/
    149. 149. Combiner
    150. 150. Multi-shard Combiner [Chandra et al., 2010]
    151. 151. Machine Learning on Big Data
    152. 152. Parallelize ML Algorithms
    153. 153. Parallelize ML Algorithms • Embarrassingly parallel
    154. 154. Parallelize ML Algorithms • Embarrassingly parallel • Parallelize sub-routines
    155. 155. Parallelize ML Algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
    156. 156. Parallel
    157. 157. Parallel Accuracy
    158. 158. Parallel Accuracy Fast Response
    159. 159. Parallel Accuracy Fast Response
    160. 160. Google APIs
    161. 161. Google APIs • Prediction API • machine learning service on the cloud • http://code.google.com/apis/predict
    162. 162. Google APIs • Prediction API • machine learning service on the cloud • http://code.google.com/apis/predict • BigQuery • interactive analysis of massive data on the cloud • http://code.google.com/apis/bigquery

    Editor's Notes

  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \\arg \\min_{\\mathbf{w}} \\sum_{n = 1}^N L(y_i, f(x_i; \\mathbf{w})) + R(\\mathbf{w})\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n\\arg \\min_\\mathbf{w} -\\prod_{i=1}^N \\prod_{p=1}^P P(x^i_p | y_i; \\mathbf{w}) P(y_i; \\mathbf{w})\n\nf(x) = \\sum_{p=1}^P \\log \\frac{\\mathbf{1}(x_p) * w_{p|EN}}{\\mathbf{1}(x_p) * w_{p|ES}} + \\log \\frac{w_{EN}}{w_{ES}}\n\nw_{the|EN} = \\frac{\\sum_{i=1}^N \\mathbf{1}_{EN,the}(x^i)}{\\sum_{i=1}^N \\mathbf{1}_{EN}(x^i)}\n\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
  • \n
  • \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
  • \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
  • \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
  • \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \\arg \\min \\frac{1}{2} || \\mathbf{w} ||_2^2 + C \\sum_{i=1}^n \\zeta_i\n\\arg \\min_{\\mathbf{w},b,\\mathbf{\\zeta}} \\frac{1}{2} || \\mathbf{w} ||_2^2 + C \\sum_{i=1}^n \\zeta_i\n\n\\text{s.t.} \\quad 1 - y_i(\\mathbf{w} \\cdot \\phi(x_i) + b) \\le \\zeta_i, \\zeta_i \\ge 0\n\n\\arg \\min_\\mathbf{\\alpha} \\frac{1}{2} \\mathbf{\\alpha}^T \\mathbf{Q} \\mathbf{\\alpha} - \\mathbf{\\alpha}^T\\mathbf{1}\n\n\n\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Amdahl&amp;#x2019;s law\ncommunication cost, choose M\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Sub-sampling provides inferior performance\nParameter mixture improves, but not as good as all data\nIterative parameter mixture achieves as good as all data\nDistributed algorithms return better classifiers quicker\n\n
  • \n
  • billion of instances, millions of features\nwithin reasonable resources\n
  • \n
  • guarantee, state-of-the-art\n
  • easy to use - adoption setup\neasy to maintain - reliable, production, work with batch systems\n
  • \n
  • \n
  • new features every day\nfeedback change\n
  • iterative, fast retrieval for data and model / parameters\n
  • fault-tolerant pieces: MapReduce (scalable), multi-cores, GFS for data\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • More Related Content

    Related Books

    Free with a 30 day trial from Scribd

    See all

    ×