11
LSRS 2016
(Some) pitfalls of distributed learning
Yves Raimond, Algorithms Engineering, Netflix
2
Some background
3
2006
4
5
▪ > 83M members
▪ > 190 countries
▪ > 3B hours/month
▪ > 1000 device types
▪ 36% of peak US downstream
traffic
Netflix scale
6
Recommendations @ Netflix
7
Help members find content to watch and enjoy
to maximize member satisfaction and retention
Goal
8
9
Two potential reasons to try distributed
learning
10
Reason 1: minimizing training time
Collecting dataset
Training
Serving
Time
Model 1
time-to-serve delay
11
Training time vs online performance
▪ Most (all?) recommendation algorithms need to predict
future behavior from past information
▪ If model training takes days, it might miss out on important
changes
▪ New items being introduced
▪ Popularity swings
▪ Changes in underlying feature distributions
▪ Time-to-serve can be a key component in how good the
recommendations will be, online
12
Training time vs experimentation speed
▪ Faster training time
=> more offline experimentations and iterations
=> better models
▪ Many other factors at play (like modularity of the ML
framework), but training time is a key one
▪ How quickly can you iterate through e.g. model architectures
if training a model takes days?
13
Reason 2: increasing dataset size
▪ If your model is complex enough (trees, DNNs, …) more data
could help
▪ … But this will have an impact on the training time
▪ … Which in turn could have a negative impact on
time-to-serve delay and experimentation speed
▪ Hard limits
14
Let’s distribute!
15
Topic-sensitive PageRank
▪ Popular graph diffusion algorithm
▪ Capturing vertex importance with regards to a particular
vertex
▪ Easy to distribute using Spark and GraphX
▪ Fast distributed implementation contributed by Netflix
(coming up in Spark 2.1!)
16
Iteration 0
We start by
activating a single
node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
related to
17
Iteration 1
With some probability,
we follow outbound
edges, otherwise we
go back to the origin.
18
Iteration 2
Vertex accumulates
higher mass
19
Iteration 2
And again, until
convergence
20
Latent Dirichlet Allocation
▪ Popular clustering /
latent factors model
▪ Uncollapsed Gibbs
sampler is fairly easy to
distribute
Per-topic
word
distributions
Per-document
topic
distributions
Topic label for
document d and
word w
21
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed parameterized graph for
LDA with 3 Topics
documents
words
Word appear
in document
22
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distribution
for the triplet using
vertex attributes
23
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distributions for
all triplets
24
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
1
1
2
0
Sample Topics for all edges
25
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0
1
0
0
1
1
1
0
0
0 2 0
1 0 1
1
1
2
0
Neighborhood aggregation for topic
histograms
26
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.1
0.4
0.3
0.1
0.4
0.4
0.8
0.2
0.3
0.1 0.8 0.1
0.45 0.1 0.45
Realize samples from Dirichlet to
update the graph
27
Now, is it faster?
28
Topic Sensitive Pagerank
▪ Distributed Spark/GraphX implementation
▪ Available in Spark 2.1
▪ Propagates multiple vertices at once
▪ Alternative implementation
▪ Single-threaded and single-machine for one source vertex
▪ Works on full graph adjacency
▪ Scala/Breeze, horizontally scaled with Spark to propagate multiple
vertices at once
▪ Dimension: number of vertices for which we compute a
ranking
29
Open Source DBPedia
dataset
Sublinear rise in time with
Spark/GraphX vs linear rise in the
horizontally scaled version
Doubling the size of cluster:
2.0 speedup in horizontally scaled
version vs 1.2 in Spark/GraphX
30
Latent Dirichlet Allocation
▪ Distributed Spark/GraphX implementation
▪ Alternative implementation
▪ Single machine, multi-threaded Java code
▪ NOT horizontally scaled
▪ Dimension: training set size
31
Netflix dataset
Number of Topics = 100
Spark/GraphX setup:
8 x resources than the
multi-core setup
Wikipedia dataset, 100
Topic LDA
Cluster: (16 x r3.2xl)
(source: Databricks)
Spark/GraphX for very large
datasets outperforms multi-core
32
Other comparisons
▪ Frank McSherry’s blog post
comparing different distributed
pagerank implementation and a
single-threaded Rust
implementation on his laptop
▪ 1.5B edges for twitter_rv, 3.7B
for uk_2007_05
▪ “If you are going to use a big
data system for yourself, see if it
is faster than your laptop.”
33
Other comparisons
▪ GraphChi, a single-machine large-scale graph computation engine
developed at CMU, reports similar findings
34
Now, is it faster?
No, unless your problem or dataset is huge :(
35
To conclude...
▪ When distributing an algorithm, there are two opposing
forces:
▪ 1) Communication overhead (shifting data from node to node)
▪ 2) More raw computing power available
▪ Whether one overtakes the other depends on the size of your
problem
▪ Single-machine ML can be very efficient!
▪ Smarter algorithms can beat brute force
▪ Better data structures, input data formats, caching, optimization
algorithms, etc. can all make a huge difference
▪ Good core implementation is a prerequisite to distribution
▪ Easy to get large machines!
36
To conclude...
▪ However, distribution lets you easily throw more hardware
at a problem
▪ Also, some algorithms/methods are better than others at
minimizing the communication overhead
▪ Iterative distributed graph algorithms can be inefficient in that
respect
▪ Can your problem fit on a single machine?
▪ Can your problem be partitioned?
▪ For SGD-like algos, parameter servers can be used to distribute while
keeping this overhead to a minimum
37
Questions?
(Yes, we’re hiring)
Many thanks to @EhtshamElahi!

(Some) pitfalls of distributed learning

  • 1.
    11 LSRS 2016 (Some) pitfallsof distributed learning Yves Raimond, Algorithms Engineering, Netflix
  • 2.
  • 3.
  • 4.
  • 5.
    5 ▪ > 83Mmembers ▪ > 190 countries ▪ > 3B hours/month ▪ > 1000 device types ▪ 36% of peak US downstream traffic Netflix scale
  • 6.
  • 7.
    7 Help members findcontent to watch and enjoy to maximize member satisfaction and retention Goal
  • 8.
  • 9.
    9 Two potential reasonsto try distributed learning
  • 10.
    10 Reason 1: minimizingtraining time Collecting dataset Training Serving Time Model 1 time-to-serve delay
  • 11.
    11 Training time vsonline performance ▪ Most (all?) recommendation algorithms need to predict future behavior from past information ▪ If model training takes days, it might miss out on important changes ▪ New items being introduced ▪ Popularity swings ▪ Changes in underlying feature distributions ▪ Time-to-serve can be a key component in how good the recommendations will be, online
  • 12.
    12 Training time vsexperimentation speed ▪ Faster training time => more offline experimentations and iterations => better models ▪ Many other factors at play (like modularity of the ML framework), but training time is a key one ▪ How quickly can you iterate through e.g. model architectures if training a model takes days?
  • 13.
    13 Reason 2: increasingdataset size ▪ If your model is complex enough (trees, DNNs, …) more data could help ▪ … But this will have an impact on the training time ▪ … Which in turn could have a negative impact on time-to-serve delay and experimentation speed ▪ Hard limits
  • 14.
  • 15.
    15 Topic-sensitive PageRank ▪ Populargraph diffusion algorithm ▪ Capturing vertex importance with regards to a particular vertex ▪ Easy to distribute using Spark and GraphX ▪ Fast distributed implementation contributed by Netflix (coming up in Spark 2.1!)
  • 16.
    16 Iteration 0 We startby activating a single node “Seattle” related to shot in featured in related to cast cast cast related to
  • 17.
    17 Iteration 1 With someprobability, we follow outbound edges, otherwise we go back to the origin.
  • 18.
  • 19.
    19 Iteration 2 And again,until convergence
  • 20.
    20 Latent Dirichlet Allocation ▪Popular clustering / latent factors model ▪ Uncollapsed Gibbs sampler is fairly easy to distribute Per-topic word distributions Per-document topic distributions Topic label for document d and word w
  • 21.
    21 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.30.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics documents words Word appear in document
  • 22.
    22 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.30.6 0.1 0.2 0.5 0.3 Categorical distribution for the triplet using vertex attributes
  • 23.
    23 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.30.6 0.1 0.2 0.5 0.3 Categorical distributions for all triplets
  • 24.
    24 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.30.6 0.1 0.2 0.5 0.3 1 1 2 0 Sample Topics for all edges
  • 25.
    25 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0 1 0 0 1 1 1 0 0 02 0 1 0 1 1 1 2 0 Neighborhood aggregation for topic histograms
  • 26.
    26 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.1 0.4 0.3 0.1 0.4 0.4 0.8 0.2 0.3 0.10.8 0.1 0.45 0.1 0.45 Realize samples from Dirichlet to update the graph
  • 27.
  • 28.
    28 Topic Sensitive Pagerank ▪Distributed Spark/GraphX implementation ▪ Available in Spark 2.1 ▪ Propagates multiple vertices at once ▪ Alternative implementation ▪ Single-threaded and single-machine for one source vertex ▪ Works on full graph adjacency ▪ Scala/Breeze, horizontally scaled with Spark to propagate multiple vertices at once ▪ Dimension: number of vertices for which we compute a ranking
  • 29.
    29 Open Source DBPedia dataset Sublinearrise in time with Spark/GraphX vs linear rise in the horizontally scaled version Doubling the size of cluster: 2.0 speedup in horizontally scaled version vs 1.2 in Spark/GraphX
  • 30.
    30 Latent Dirichlet Allocation ▪Distributed Spark/GraphX implementation ▪ Alternative implementation ▪ Single machine, multi-threaded Java code ▪ NOT horizontally scaled ▪ Dimension: training set size
  • 31.
    31 Netflix dataset Number ofTopics = 100 Spark/GraphX setup: 8 x resources than the multi-core setup Wikipedia dataset, 100 Topic LDA Cluster: (16 x r3.2xl) (source: Databricks) Spark/GraphX for very large datasets outperforms multi-core
  • 32.
    32 Other comparisons ▪ FrankMcSherry’s blog post comparing different distributed pagerank implementation and a single-threaded Rust implementation on his laptop ▪ 1.5B edges for twitter_rv, 3.7B for uk_2007_05 ▪ “If you are going to use a big data system for yourself, see if it is faster than your laptop.”
  • 33.
    33 Other comparisons ▪ GraphChi,a single-machine large-scale graph computation engine developed at CMU, reports similar findings
  • 34.
    34 Now, is itfaster? No, unless your problem or dataset is huge :(
  • 35.
    35 To conclude... ▪ Whendistributing an algorithm, there are two opposing forces: ▪ 1) Communication overhead (shifting data from node to node) ▪ 2) More raw computing power available ▪ Whether one overtakes the other depends on the size of your problem ▪ Single-machine ML can be very efficient! ▪ Smarter algorithms can beat brute force ▪ Better data structures, input data formats, caching, optimization algorithms, etc. can all make a huge difference ▪ Good core implementation is a prerequisite to distribution ▪ Easy to get large machines!
  • 36.
    36 To conclude... ▪ However,distribution lets you easily throw more hardware at a problem ▪ Also, some algorithms/methods are better than others at minimizing the communication overhead ▪ Iterative distributed graph algorithms can be inefficient in that respect ▪ Can your problem fit on a single machine? ▪ Can your problem be partitioned? ▪ For SGD-like algos, parameter servers can be used to distribute while keeping this overhead to a minimum
  • 37.