[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

Machine Learning on Big Data
Lessons Learned from Google Projects

Max Lin
Software Engineer | Google Research

Massively Parallel Computing | Harvard CS 264
Guest Lecture | March 29th, 2011

Outline

• Machine Learning intro
• Scaling machine learning algorithms up
• Design choices of large scale ML systems

“Machine Learning is a study
of computer algorithms that
improve automatically
through experience.”

The quick brown fox
English
jumped over the lazy dog.
To err is human, but to
really foul things up you English
Training Input X
need a computer. Output Y
No hay mal que por bien
Spanish
no venga.
Model f(x)
La tercera es la vencida. Spanish

To be or not to be -- that
?
Testing f(x’)
is the question
= y’
La fe mueve montañas. ?

Linear Classiﬁer
The quick brown fox jumped over the lazy dog.

‘a’ ... ‘aardvark’ ... ‘dog’ ... ‘the’ ... ‘montañas’ ...
x [ 0, ... 0, ... 1, ... 1, ... 0, ... ]

w [ 0.1, ... 132, ... 150, ... 200, ... -153, ... ]
P

f (x) = w · x = wp ∗ xp
p=1

Training Data
Input X Ouput Y

P

...

...

...
N

... ... ... ... ... ...

...

Typical machine learning
data at Google

N: 100 billions / 1 billion
P: 1 billion / 10 million
(mean / median)

http://www.ﬂickr.com/photos/mr_t_in_dc/5469563053

Classiﬁer Training

• Training: Given {(x, y)} and f, minimize the
following objective function
N

arg min L(yi , f (xi ; w)) + R(w)
w
n=1

Use Newton’s method?
t+1 t t −1 t
w ← w − H(w ) ∇J(w )

http://www.ﬂickr.com/photos/visitﬁnland/5424369765/

Scaling Up

• Why big data?
• Parallelize machine learning algorithms
• Embarrassingly parallel
• Parallelize sub-routines
• Distributed learning

Subsampling
Big Data

Reduce N Shard 1 Shard 2 Shard 3
...
Shard M

Machine

Model

Why not Small Data?

[Banko and Brill, 2001]

Parallelize Estimates
• Naive Bayes Classiﬁer
N P

i
arg min − P (xp |yi ; w)P (yi ; w)
w
i=1 p=1

• Maximum Likelihood Estimates
N i
i=1 1EN,the (x )
wthe|EN = N
i=1 1EN (xi )

Word Counting
Big Data

Mapper 1 Mapper 2 Mapper 3 Mapper M

Map Shard 1 Shard 2 Shard 3 ... Shard M

(‘the’ | EN, 1) (‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

Reducer
Reduce Tally counts
and update w

Model

Parallelize Optimization
• Maximum Entropy Classiﬁers
P
N
i yi
exp( p=1 wp ∗ xp )
arg min P
w
i=1 1 + exp( p=1 wp ∗ xi )
p

• Good: J(w) is concave
• Bad: no closed-form solution like NB
• Ugly: Large N

Gradient Descent

http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf

Gradient Descent
• w is initialized as zero
• for t in 1 to T
• Calculate gradients ∇J(w)
• w ← w − η∇J(w)
t+1 t

N

∇J(w) = P (w, xi , yi )
i=1

Distribute Gradient
• w is initialized as zero
• for t in 1 to T
• Calculate gradients in parallel
wt+1 ← wt − η∇J(w)

• Training CPU: O(TPN) to O(TPN / M)

Distribute Gradient
Big Data

Machine 1 Machine 2 Machine 3 Machine M


(dummy key, partial gradient sum)

Reduce Sum and
Update w

Repeat M/R
until converge Model

Parallelize Subroutines
• Support Vector Machines
1
n

2
arg min ||w||2 +C ζi
w,b,ζ 2 i=1

s.t. 1 − yi (w · φ(xi ) + b) ≤ ζi , ζi ≥ 0
• Solve the dual problem
1 T
arg min α Qα − αT 1
α 2

s.t. 0 ≤ α ≤ C, yT α = 0

The computational
cost for the Primal-
Dual Interior Point
Method is O(n^3) in
time and O(n^2) in
memory

http://www.ﬂickr.com/photos/sea-turtle/198445204/

Parallel SVM [Chang et al, 2007]

• Parallel, row-wise incomplete Cholesky
Factorization for Q
• Parallel interior point method
• Time O(n^3) becomes O(n^2 / M)
√
• Memory O(n^2) becomes O(n N / M)
• Parallel Support Vector Machines (psvm) http://
code.google.com/p/psvm/
• Implement in MPI

Parallel ICF
• Distribute Q by row into M machines
Machine 1 Machine 2 Machine 3

row 1 row 3 row 5 ...
row 2 row 4 row 6

• For each dimension n N √

• Send local pivots to master
• Master selects largest local pivots and
broadcast the global pivot to workers

Majority Vote
Big Data



Model 1 Model 2 Model 3 Model 4

Majority Vote

• Train individual classiﬁers independently
• Predict by taking majority votes
• Training CPU: O(TPN) to O(TPN / M)

Parameter Mixture [Mann et al, 2009]

Big Data



(dummy key, w1) (dummy key, w2) ...

Reduce Average w

Model

Much Less network
usage than
distributed gradient
descent
O(MN) vs. O(MNT)

ttp://www.ﬂickr.com/photos/annamatic3000/127945652/

Iterative Param Mixture [McDonald et al., 2010]

Big Data



(dummy key, w1) (dummy key, w2) ...
Reduce
after each Average w

epoch
Model

Scalable

http://www.ﬂickr.com/photos/mr_t_in_dc/5469563053

Parallel

http://www.ﬂickr.com/photos/aloshbennett/3209564747/

Accuracy
http://www.ﬂickr.com/photos/wanderlinse/4367261825/

http://www.ﬂickr.com/photos/imagelink/4006753760/

Binary
Classiﬁcation
http://www.ﬂickr.com/photos/brenderous/4532934181/

Automatic
Feature
Discovery

http://www.ﬂickr.com/photos/mararie/2340572508/

Fast
Response

http://www.ﬂickr.com/photos/prunejuice/3687192643/

Memory is new
hard disk.

http://www.ﬂickr.com/photos/jepoirrier/840415676/

Algorithm +
Infrastructure

http://www.ﬂickr.com/photos/neubie/854242030/

Design for
Multicores
http://www.ﬂickr.com/photos/geektechnique/2344029370/

Multi-shard Combiner

[Chandra et al., 2010]

Parallelize ML
Algorithms

• Embarrassingly parallel
• Parallelize sub-routines
• Distributed learning

Parallel Accuracy

Fast
Response

Google APIs
• Prediction API
• machine learning service on the cloud
• http://code.google.com/apis/predict

• BigQuery
• interactive analysis of massive data on the cloud
• http://code.google.com/apis/bigquery

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

More Related Content

What's hot

Viewers also liked

Similar to [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)

More from npinto

Recently uploaded

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin, Google Research)