Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

(Some) pitfalls of distributed learning

59,624 views

Published on

Talk given at the LSRS workshop, RecSys 2016

Published in: Engineering

(Some) pitfalls of distributed learning

  1. 1. 11 LSRS 2016 (Some) pitfalls of distributed learning Yves Raimond, Algorithms Engineering, Netflix
  2. 2. 2 Some background
  3. 3. 3 2006
  4. 4. 4
  5. 5. 5 ▪ > 83M members ▪ > 190 countries ▪ > 3B hours/month ▪ > 1000 device types ▪ 36% of peak US downstream traffic Netflix scale
  6. 6. 6 Recommendations @ Netflix
  7. 7. 7 Help members find content to watch and enjoy to maximize member satisfaction and retention Goal
  8. 8. 8
  9. 9. 9 Two potential reasons to try distributed learning
  10. 10. 10 Reason 1: minimizing training time Collecting dataset Training Serving Time Model 1 time-to-serve delay
  11. 11. 11 Training time vs online performance ▪ Most (all?) recommendation algorithms need to predict future behavior from past information ▪ If model training takes days, it might miss out on important changes ▪ New items being introduced ▪ Popularity swings ▪ Changes in underlying feature distributions ▪ Time-to-serve can be a key component in how good the recommendations will be, online
  12. 12. 12 Training time vs experimentation speed ▪ Faster training time => more offline experimentations and iterations => better models ▪ Many other factors at play (like modularity of the ML framework), but training time is a key one ▪ How quickly can you iterate through e.g. model architectures if training a model takes days?
  13. 13. 13 Reason 2: increasing dataset size ▪ If your model is complex enough (trees, DNNs, …) more data could help ▪ … But this will have an impact on the training time ▪ … Which in turn could have a negative impact on time-to-serve delay and experimentation speed ▪ Hard limits
  14. 14. 14 Let’s distribute!
  15. 15. 15 Topic-sensitive PageRank ▪ Popular graph diffusion algorithm ▪ Capturing vertex importance with regards to a particular vertex ▪ Easy to distribute using Spark and GraphX ▪ Fast distributed implementation contributed by Netflix (coming up in Spark 2.1!)
  16. 16. 16 Iteration 0 We start by activating a single node “Seattle” related to shot in featured in related to cast cast cast related to
  17. 17. 17 Iteration 1 With some probability, we follow outbound edges, otherwise we go back to the origin.
  18. 18. 18 Iteration 2 Vertex accumulates higher mass
  19. 19. 19 Iteration 2 And again, until convergence
  20. 20. 20 Latent Dirichlet Allocation ▪ Popular clustering / latent factors model ▪ Uncollapsed Gibbs sampler is fairly easy to distribute Per-topic word distributions Per-document topic distributions Topic label for document d and word w
  21. 21. 21 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 A distributed parameterized graph for LDA with 3 Topics documents words Word appear in document
  22. 22. 22 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 Categorical distribution for the triplet using vertex attributes
  23. 23. 23 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 Categorical distributions for all triplets
  24. 24. 24 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.3 0.4 0.1 0.3 0.2 0.8 0.4 0.4 0.1 0.3 0.6 0.1 0.2 0.5 0.3 1 1 2 0 Sample Topics for all edges
  25. 25. 25 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0 1 0 0 1 1 1 0 0 0 2 0 1 0 1 1 1 2 0 Neighborhood aggregation for topic histograms
  26. 26. 26 Distributed Gibbs Sampler w1 w2 w3 d1 d2 0.1 0.4 0.3 0.1 0.4 0.4 0.8 0.2 0.3 0.1 0.8 0.1 0.45 0.1 0.45 Realize samples from Dirichlet to update the graph
  27. 27. 27 Now, is it faster?
  28. 28. 28 Topic Sensitive Pagerank ▪ Distributed Spark/GraphX implementation ▪ Available in Spark 2.1 ▪ Propagates multiple vertices at once ▪ Alternative implementation ▪ Single-threaded and single-machine for one source vertex ▪ Works on full graph adjacency ▪ Scala/Breeze, horizontally scaled with Spark to propagate multiple vertices at once ▪ Dimension: number of vertices for which we compute a ranking
  29. 29. 29 Open Source DBPedia dataset Sublinear rise in time with Spark/GraphX vs linear rise in the horizontally scaled version Doubling the size of cluster: 2.0 speedup in horizontally scaled version vs 1.2 in Spark/GraphX
  30. 30. 30 Latent Dirichlet Allocation ▪ Distributed Spark/GraphX implementation ▪ Alternative implementation ▪ Single machine, multi-threaded Java code ▪ NOT horizontally scaled ▪ Dimension: training set size
  31. 31. 31 Netflix dataset Number of Topics = 100 Spark/GraphX setup: 8 x resources than the multi-core setup Wikipedia dataset, 100 Topic LDA Cluster: (16 x r3.2xl) (source: Databricks) Spark/GraphX for very large datasets outperforms multi-core
  32. 32. 32 Other comparisons ▪ Frank McSherry’s blog post comparing different distributed pagerank implementation and a single-threaded Rust implementation on his laptop ▪ 1.5B edges for twitter_rv, 3.7B for uk_2007_05 ▪ “If you are going to use a big data system for yourself, see if it is faster than your laptop.”
  33. 33. 33 Other comparisons ▪ GraphChi, a single-machine large-scale graph computation engine developed at CMU, reports similar findings
  34. 34. 34 Now, is it faster? No, unless your problem or dataset is huge :(
  35. 35. 35 To conclude... ▪ When distributing an algorithm, there are two opposing forces: ▪ 1) Communication overhead (shifting data from node to node) ▪ 2) More raw computing power available ▪ Whether one overtakes the other depends on the size of your problem ▪ Single-machine ML can be very efficient! ▪ Smarter algorithms can beat brute force ▪ Better data structures, input data formats, caching, optimization algorithms, etc. can all make a huge difference ▪ Good core implementation is a prerequisite to distribution ▪ Easy to get large machines!
  36. 36. 36 To conclude... ▪ However, distribution lets you easily throw more hardware at a problem ▪ Also, some algorithms/methods are better than others at minimizing the communication overhead ▪ Iterative distributed graph algorithms can be inefficient in that respect ▪ Can your problem fit on a single machine? ▪ Can your problem be partitioned? ▪ For SGD-like algos, parameter servers can be used to distribute while keeping this overhead to a minimum
  37. 37. 37 Questions? (Yes, we’re hiring) Many thanks to @EhtshamElahi!

×