0
Upcoming SlideShare
×

# Machine Learning on Big Data

29,025

Published on

Max Lin gave a guest lecture about machine learning on big data in the Harvard CS 264 class on March 29, 2011.

Published in: Technology
68 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Great

Are you sure you want to  Yes  No
• Very educational

Are you sure you want to  Yes  No
Views
Total Views
29,025
On Slideshare
0
From Embeds
0
Number of Embeds
38
Actions
Shares
0
1,184
2
Likes
68
Embeds 0
No embeds

No notes for slide
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \\arg \\min_{\\mathbf{w}} \\sum_{n = 1}^N L(y_i, f(x_i; \\mathbf{w})) + R(\\mathbf{w})\n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n\\arg \\min_\\mathbf{w} -\\prod_{i=1}^N \\prod_{p=1}^P P(x^i_p | y_i; \\mathbf{w}) P(y_i; \\mathbf{w})\n\nf(x) = \\sum_{p=1}^P \\log \\frac{\\mathbf{1}(x_p) * w_{p|EN}}{\\mathbf{1}(x_p) * w_{p|ES}} + \\log \\frac{w_{EN}}{w_{ES}}\n\nw_{the|EN} = \\frac{\\sum_{i=1}^N \\mathbf{1}_{EN,the}(x^i)}{\\sum_{i=1}^N \\mathbf{1}_{EN}(x^i)}\n\n\n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
• concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
• concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
• concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
• concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
• concave os w\nno closed-form\n\n\\arg \\min_\\mathbf{w} \\prod_{i=1}^N \\frac{\\exp(\\sum_{p=1}^P w_p * x^i_p)^{y_i}}{1 + \\exp(\\sum_{p=1}^P w_p * x^i_p)}\n
• \n
• \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
• \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
• \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
• \\mathbf{w}^{t+1} \\leftarrow \\mathbf{w}^{t} - \\eta \\nabla J(\\mathbf{w})\n\n\\frac{\\partial}{\\partial w_p} J(\\mathbf{w}) = \\sum_{i=1}^N \\frac{\\partial}{\\partial w_p} y^i \\sum_p w_p * x^i_p - \\ln (1 + \\exp(\\sum_p w_p * x^i_p))\n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \\arg \\min \\frac{1}{2} || \\mathbf{w} ||_2^2 + C \\sum_{i=1}^n \\zeta_i\n\\arg \\min_{\\mathbf{w},b,\\mathbf{\\zeta}} \\frac{1}{2} || \\mathbf{w} ||_2^2 + C \\sum_{i=1}^n \\zeta_i\n\n\\text{s.t.} \\quad 1 - y_i(\\mathbf{w} \\cdot \\phi(x_i) + b) \\le \\zeta_i, \\zeta_i \\ge 0\n\n\\arg \\min_\\mathbf{\\alpha} \\frac{1}{2} \\mathbf{\\alpha}^T \\mathbf{Q} \\mathbf{\\alpha} - \\mathbf{\\alpha}^T\\mathbf{1}\n\n\n\n
• \n
• \n
• \n
• \n
• \n
• Amdahl&amp;#x2019;s law\ncommunication cost, choose M\n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• Sub-sampling provides inferior performance\nParameter mixture improves, but not as good as all data\nIterative parameter mixture achieves as good as all data\nDistributed algorithms return better classifiers quicker\n\n
• \n
• billion of instances, millions of features\nwithin reasonable resources\n
• \n
• guarantee, state-of-the-art\n
• easy to use - adoption setup\neasy to maintain - reliable, production, work with batch systems\n
• \n
• \n
• new features every day\nfeedback change\n
• iterative, fast retrieval for data and model / parameters\n
• fault-tolerant pieces: MapReduce (scalable), multi-cores, GFS for data\n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• \n
• ### Transcript of "Machine Learning on Big Data"

1. 1. Machine Learning on Big DataLessons Learned from Google ProjectsMax LinSoftware Engineer | Google ResearchMassively Parallel Computing | Harvard CS 264Guest Lecture | March 29th, 2011
2. 2. Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
3. 3. Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
4. 4. “Machine Learning is a studyof computer algorithms that improve automatically through experience.”
5. 5. The quick brown foxjumped over the lazy dog.
6. 6. The quick brown fox Englishjumped over the lazy dog.
7. 7. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up youneed a computer.
8. 8. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up you Englishneed a computer.
9. 9. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up you Englishneed a computer.No hay mal que por bienno venga.
10. 10. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up you Englishneed a computer.No hay mal que por bien Spanishno venga.
11. 11. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up you Englishneed a computer.No hay mal que por bien Spanishno venga.La tercera es la vencida.
12. 12. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up you Englishneed a computer.No hay mal que por bien Spanishno venga.La tercera es la vencida. Spanish
13. 13. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up you Englishneed a computer.No hay mal que por bien Spanishno venga.La tercera es la vencida. SpanishTo be or not to be -- thatis the question
14. 14. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up you Englishneed a computer.No hay mal que por bien Spanishno venga.La tercera es la vencida. SpanishTo be or not to be -- that ?is the question
15. 15. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up you Englishneed a computer.No hay mal que por bien Spanishno venga.La tercera es la vencida. SpanishTo be or not to be -- that ?is the questionLa fe mueve montañas.
16. 16. The quick brown fox Englishjumped over the lazy dog.To err is human, but toreally foul things up you Englishneed a computer.No hay mal que por bien Spanishno venga.La tercera es la vencida. SpanishTo be or not to be -- that ?is the questionLa fe mueve montañas. ?
17. 17. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you EnglishTraining need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
18. 18. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you EnglishTraining Input X need a computer. No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
19. 19. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you EnglishTraining Input X need a computer. Output Y No hay mal que por bien Spanish no venga. La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
20. 20. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you EnglishTraining Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ? is the question La fe mueve montañas. ?
21. 21. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you EnglishTraining Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ?Testing is the question La fe mueve montañas. ?
22. 22. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you EnglishTraining Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ?Testing f(x’) is the question La fe mueve montañas. ?
23. 23. The quick brown fox English jumped over the lazy dog. To err is human, but to really foul things up you EnglishTraining Input X need a computer. Output Y No hay mal que por bien Spanish no venga. Model f(x) La tercera es la vencida. Spanish To be or not to be -- that ?Testing f(x’) is the question = y’ La fe mueve montañas. ?
24. 24. Linear Classiﬁer
25. 25. Linear ClassiﬁerThe quick brown fox jumped over the lazy dog.
26. 26. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’
27. 27. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ...
28. 28. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’
29. 29. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ...
30. 30. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’
31. 31. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ...
32. 32. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’
33. 33. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ...
34. 34. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’
35. 35. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
36. 36. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...0,
37. 37. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ...
38. 38. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0,
39. 39. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ...
40. 40. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1,
41. 41. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ...
42. 42. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1,
43. 43. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ...
44. 44. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ... 0,
45. 45. Linear Classiﬁer The quick brown fox jumped over the lazy dog.‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ... 0, ... 0, ... 1, ... 1, ... 0, ...
46. 46. Linear Classiﬁer The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...[ 0, ... 0, ... 1, ... 1, ... 0, ...
47. 47. Linear Classiﬁer The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...[ 0, ... 0, ... 1, ... 1, ... 0, ... ]
48. 48. Linear Classiﬁer The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]
49. 49. Linear Classiﬁer The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...x [ 0, ... 0, ... 1, ... 1, ... 0, ... ] [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
50. 50. Linear Classiﬁer The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
51. 51. Linear Classiﬁer The quick brown fox jumped over the lazy dog. ‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ] P f (x) = w · x = w p ∗ xp p=1
52. 52. Training Data Input X Ouput Y P ... ... ...N ... ... ... ... ... ... ...
53. 53. Typical machine learningdata at GoogleN: 100 billions / 1 billionP: 1 billion / 10 million(mean / median) http://www.ﬂickr.com/photos/mr_t_in_dc/5469563053
54. 54. Classiﬁer Training• Training: Given {(x, y)} and f, minimize the following objective function N arg min L(yi , f (xi ; w)) + R(w) w n=1
55. 55. Use Newton’s method? t +1 t t −1 tw ← w − H(w ) J(w ) http://www.ﬂickr.com/photos/visitﬁnland/5424369765/
56. 56. Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
57. 57. Scaling Up
58. 58. Scaling Up• Why big data?
59. 59. Scaling Up• Why big data?• Parallelize machine learning algorithms
60. 60. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel
61. 61. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines
62. 62. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
63. 63. Subsampling
64. 64. Subsampling Big Data
65. 65. Subsampling Big DataShard 1 Shard 2 Shard 3 Shard M ...
66. 66. Subsampling Big DataReduce N Shard 1
67. 67. Subsampling Big DataReduce N Shard 1 Machine
68. 68. Subsampling Big DataReduce N Machine Shard 1
69. 69. Subsampling Big DataReduce N Machine Shard 1 Model
70. 70. Why not Small Data? [Banko and Brill, 2001]
71. 71. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
72. 72. Parallelize Estimates• Naive Bayes Classiﬁer N P i arg min − P (xp |yi ; w)P (yi ; w) w i=1 p=1• Maximum Likelihood Estimates N i i=1 1EN,the (x ) wthe|EN = N i=1 1EN (xi )
73. 73. Word Counting
74. 74. Word CountingMap
75. 75. Word Counting X: “The quick brown fox ...”Map Y: EN
76. 76. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...”Map Y: EN
77. 77. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...”Map (‘quick|EN’, 1) Y: EN
78. 78. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...”Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1)
79. 79. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1)Reduce
80. 80. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1)Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]
81. 81. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1)Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3
82. 82. Word Counting (‘the|EN’, 1) X: “The quick brown fox ...” Map (‘quick|EN’, 1) Y: EN (‘brown|EN’, 1)Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ] C(‘the’|EN) = SUM of values = 3 C( the |EN ) w the |EN = C(EN )
83. 83. Word Counting
84. 84. Word Counting Big Data
85. 85. Word Counting Big DataShard 1 Shard 2 Shard 3 ... Shard M
86. 86. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper MMap Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1)
87. 87. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) ReducerReduce Tally counts and update w
88. 88. Word Counting Big Data Mapper 1 Mapper 2 Mapper 3 Mapper M Map Shard 1 Shard 2 Shard 3 ... Shard M (‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1) ReducerReduce Tally counts and update w Model
89. 89. Parallelize Optimization N P i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
90. 90. Parallelize Optimization• Maximum Entropy Classiﬁers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
91. 91. Parallelize Optimization• Maximum Entropy Classiﬁers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
92. 92. Parallelize Optimization• Maximum Entropy Classiﬁers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p
93. 93. Parallelize Optimization• Maximum Entropy Classiﬁers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p• Good: J(w) is concave
94. 94. Parallelize Optimization• Maximum Entropy Classiﬁers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p• Good: J(w) is concave• Bad: no closed-form solution like NB
95. 95. Parallelize Optimization• Maximum Entropy Classiﬁers P N i yi exp( p=1 wp ∗ xp ) arg min P w i=1 1 + exp( p=1 wp ∗ xi ) p• Good: J(w) is concave• Bad: no closed-form solution like NB• Ugly: Large N
98. 98. Gradient Descent• w is initialized as zero• for t in 1 to T • Calculate gradients •
99. 99. Gradient Descent• w is initialized as zero• for t in 1 to T • Calculate gradients J(w) •
100. 100. Gradient Descent• w is initialized as zero• for t in 1 to T • Calculate gradients J(w) • w ← w − η J(w) t+1 t
101. 101. Gradient Descent• w is initialized as zero• for t in 1 to T • Calculate gradients J(w) • w ← w − η J(w) t+1 t N J(w) = P (w, xi , yi ) i=1
102. 102. Distribute Gradient• w is initialized as zero• for t in 1 to T • Calculate gradients in parallel• Training CPU: O(TPN) to O(TPN / M)
103. 103. Distribute Gradient• w is initialized as zero• for t in 1 to T • Calculate gradients in parallel wt+1 ← wt − η J(w)• Training CPU: O(TPN) to O(TPN / M)
105. 105. Distribute Gradient Big Data
106. 106. Distribute Gradient Big Data Shard 1 Shard 2 Shard 3 ... Shard M
107. 107. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine MMap Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum)
108. 108. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum)Reduce Sum and Update w
109. 109. Distribute Gradient Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, partial gradient sum)Reduce Sum and Update w Repeat M/R until converge Model
110. 110. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
111. 111. Parallelize Subroutines• Support Vector Machines 1 n 2 arg min ||w||2 +C ζi w,b,ζ 2 i=1 s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0• Solve the dual problem 1 T arg min α Qα − αT 1 α 2 s.t. 0 ≤ α ≤ C, yT α = 0
112. 112. The computationalcost for the Primal-Dual Interior PointMethod is O(n^3) intime and O(n^2) in memoryhttp://www.ﬂickr.com/photos/sea-turtle/198445204/
113. 113. Parallel SVM [Chang et al, 2007] √ N
114. 114. Parallel SVM [Chang et al, 2007]• Parallel, row-wise incomplete Cholesky Factorization for Q √ N
115. 115. Parallel SVM [Chang et al, 2007]• Parallel, row-wise incomplete Cholesky Factorization for Q• Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M)
116. 116. Parallel SVM [Chang et al, 2007]• Parallel, row-wise incomplete Cholesky Factorization for Q• Parallel interior point method • Time O(n^3) becomes O(n^2 / M) √ • Memory O(n^2) becomes O(n N / M)• Parallel Support Vector Machines (psvm) http:// code.google.com/p/psvm/ • Implement in MPI
117. 117. Parallel ICF• Distribute Q by row into M machines Machine 1 Machine 2 Machine 3 row 1 row 3 row 5 ... row 2 row 4 row 6• For each dimension n < N √ • Send local pivots to master • Master selects largest local pivots and broadcast the global pivot to workers
118. 118. Scaling Up• Why big data?• Parallelize machine learning algorithms • Embarrassingly parallel • Parallelize sub-routines • Distributed learning
119. 119. Majority Vote
120. 120. Majority Vote Big Data
121. 121. Majority Vote Big DataShard 1 Shard 2 Shard 3 ... Shard M
122. 122. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine MMap Shard 1 Shard 2 Shard 3 ... Shard M
123. 123. Majority Vote Big Data Machine 1 Machine 2 Machine 3 Machine MMap Shard 1 Shard 2 Shard 3 ... Shard M Model 1 Model 2 Model 3 Model 4
124. 124. Majority Vote• Train individual classiﬁers independently• Predict by taking majority votes• Training CPU: O(TPN) to O(TPN / M)
125. 125. Parameter Mixture [Mann et al, 2009]
126. 126. Parameter Mixture [Mann et al, 2009] Big Data
127. 127. Parameter Mixture [Mann et al, 2009] Big Data Shard 1 Shard 2 Shard 3 ... Shard M
128. 128. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine MMap Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...
129. 129. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...Reduce Average w
130. 130. Parameter Mixture [Mann et al, 2009] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...Reduce Average w Model
131. 131. Much Less network usage than distributed gradient descent O(MN) vs. O(MNT)ttp://www.ﬂickr.com/photos/annamatic3000/127945652/
132. 132. Iterative Param Mixture [McDonald et al., 2010]
133. 133. Iterative Param Mixture[McDonald et al., 2010] Big Data
134. 134. Iterative Param Mixture [McDonald et al., 2010] Big Data Shard 1 Shard 2 Shard 3 ... Shard M
135. 135. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine MMap Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ...
136. 136. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduceafter each Average w epoch
137. 137. Iterative Param Mixture [McDonald et al., 2010] Big Data Machine 1 Machine 2 Machine 3 Machine M Map Shard 1 Shard 2 Shard 3 ... Shard M (dummy key, w1) (dummy key, w2) ... Reduceafter each Average w epoch Model
138. 138. Outline• Machine Learning intro• Scaling machine learning algorithms up• Design choices of large scale ML systems
139. 139. Scalable http://www.ﬂickr.com/photos/mr_t_in_dc/5469563053
140. 140. Parallelhttp://www.ﬂickr.com/photos/aloshbennett/3209564747/
141. 141. Accuracyhttp://www.ﬂickr.com/photos/wanderlinse/4367261825/
143. 143. Binary Classiﬁcationhttp://www.ﬂickr.com/photos/brenderous/4532934181/
144. 144. Automatic FeatureDiscovery http://www.ﬂickr.com/photos/mararie/2340572508/
145. 145. Fast Responsehttp://www.ﬂickr.com/photos/prunejuice/3687192643/
146. 146. Memory is new hard disk.http://www.ﬂickr.com/photos/jepoirrier/840415676/
147. 147. Algorithm + Infrastructurehttp://www.ﬂickr.com/photos/neubie/854242030/
148. 148. Design forMulticores http://www.ﬂickr.com/photos/geektechnique/2344029370/
149. 149. Combiner
150. 150. Multi-shard Combiner[Chandra et al., 2010]
151. 151. MachineLearning on Big Data
152. 152. Parallelize ML Algorithms
153. 153. Parallelize ML Algorithms• Embarrassingly parallel
154. 154. Parallelize ML Algorithms• Embarrassingly parallel• Parallelize sub-routines
155. 155. Parallelize ML Algorithms• Embarrassingly parallel• Parallelize sub-routines• Distributed learning
156. 156. Parallel
157. 157. Parallel Accuracy
158. 158. Parallel Accuracy FastResponse
159. 159. Parallel Accuracy FastResponse