Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Splash

676 views

Published on

by Yuchen Zhang

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Splash

  1. 1. Splash User-friendly Programming Interface for Parallelizing Stochastic Algorithms Yuchen Zhang and Michael Jordan AMP Lab, UC Berkeley AMP Lab Splash April 2015 1 / 27
  2. 2. Batch Algorithm v.s. Stochastic Algorithm Consider minimizing a loss function L(w) := 1 n n i=1 i (w). AMP Lab Splash April 2015 2 / 27
  3. 3. Batch Algorithm v.s. Stochastic Algorithm Consider minimizing a loss function L(w) := 1 n n i=1 i (w). Gradient Descent: iteratively update wt+1 = wt − ηt L(wt). AMP Lab Splash April 2015 2 / 27
  4. 4. Batch Algorithm v.s. Stochastic Algorithm Consider minimizing a loss function L(w) := 1 n n i=1 i (w). Gradient Descent: iteratively update wt+1 = wt − ηt L(wt). Pros: Easy to parallelize (via Spark). Cons: May need hundreds of iterations to converge. running time (seconds) 0 50 100 150 200 250 lossfunction 0.55 0.6 0.65 0.7 Gradient Descent - 64 threads Logistic regression: 64 cores; 500k samples; 54 dimensions AMP Lab Splash April 2015 2 / 27
  5. 5. Batch Algorithm v.s. Stochastic Algorithm Consider minimizing a loss function L(w) := 1 n n i=1 i (w). Stochastic Gradient Descent (SGD): randomly draw t, then wt+1 = wt − ηt t(wt). AMP Lab Splash April 2015 3 / 27
  6. 6. Batch Algorithm v.s. Stochastic Algorithm Consider minimizing a loss function L(w) := 1 n n i=1 i (w). Stochastic Gradient Descent (SGD): randomly draw t, then wt+1 = wt − ηt t(wt). Pros: Much faster convergence. Cons: Sequential algorithm, difficult to parallelize. running time (seconds) 0 50 100 150 200 250 lossfunction 0.55 0.6 0.65 0.7 Gradient Descent - 64 threads Stochastic Gradient Descent Logistic regression: 64 cores; 500k samples; 54 dimensions AMP Lab Splash April 2015 3 / 27
  7. 7. More Stochastic Algorithms Convex Optimization Adaptive SGD (Duchi et al.) Stochastic Average Gradient Method (Schmidt et al.) Stochastic Dual Coordinate Ascent (Shalev-Shwartz and Zhang) AMP Lab Splash April 2015 4 / 27
  8. 8. More Stochastic Algorithms Convex Optimization Adaptive SGD (Duchi et al.) Stochastic Average Gradient Method (Schmidt et al.) Stochastic Dual Coordinate Ascent (Shalev-Shwartz and Zhang) Probabilistic Model Inference Markov chain Monte Carlo and Gibbs sampling Expectation propagation (Minka) Stochastic variational inference (Hoffman et al.) AMP Lab Splash April 2015 4 / 27
  9. 9. More Stochastic Algorithms Convex Optimization Adaptive SGD (Duchi et al.) Stochastic Average Gradient Method (Schmidt et al.) Stochastic Dual Coordinate Ascent (Shalev-Shwartz and Zhang) Probabilistic Model Inference Markov chain Monte Carlo and Gibbs sampling Expectation propagation (Minka) Stochastic variational inference (Hoffman et al.) SGD variants for Matrix factorization Learning neural networks Learning denoising auto-encoder AMP Lab Splash April 2015 4 / 27
  10. 10. More Stochastic Algorithms Convex Optimization Adaptive SGD (Duchi et al.) Stochastic Average Gradient Method (Schmidt et al.) Stochastic Dual Coordinate Ascent (Shalev-Shwartz and Zhang) Probabilistic Model Inference Markov chain Monte Carlo and Gibbs sampling Expectation propagation (Minka) Stochastic variational inference (Hoffman et al.) SGD variants for Matrix factorization Learning neural networks Learning denoising auto-encoder How to parallelize these algorithms? AMP Lab Splash April 2015 4 / 27
  11. 11. First Attempt After processing a subsequence of random samples... Single-thread Algorithm: incremental update w ← w + ∆. AMP Lab Splash April 2015 5 / 27
  12. 12. First Attempt After processing a subsequence of random samples... Single-thread Algorithm: incremental update w ← w + ∆. Parallel Algorithm: Thread 1 (on 1/m of samples): w ← w + ∆1. Thread 2 (on 1/m of samples): w ← w + ∆2. . . . Thread m (on 1/m of samples): w ← w + ∆m. AMP Lab Splash April 2015 5 / 27
  13. 13. First Attempt After processing a subsequence of random samples... Single-thread Algorithm: incremental update w ← w + ∆. Parallel Algorithm: Thread 1 (on 1/m of samples): w ← w + ∆1. Thread 2 (on 1/m of samples): w ← w + ∆2. . . . Thread m (on 1/m of samples): w ← w + ∆m. Aggregate parallel updates w ← w + ∆1 + · · · + ∆m. AMP Lab Splash April 2015 5 / 27
  14. 14. First Attempt After processing a subsequence of random samples... Single-thread Algorithm: incremental update w ← w + ∆. Parallel Algorithm: Thread 1 (on 1/m of samples): w ← w + ∆1. Thread 2 (on 1/m of samples): w ← w + ∆2. . . . Thread m (on 1/m of samples): w ← w + ∆m. Aggregate parallel updates w ← w + ∆1 + · · · + ∆m. running time (seconds) 0 20 40 60 lossfunction 0 20 40 60 80 100 Single-thread SGD Parallel SGD - 64 threads Doesn’t work for SGD! AMP Lab Splash April 2015 5 / 27
  15. 15. Conflicts in Parallel Updates Reason of failure: ∆1, . . . , ∆m simultaneously manipulate the same variable w, causing conflicts in parallel updates. AMP Lab Splash April 2015 6 / 27
  16. 16. Conflicts in Parallel Updates Reason of failure: ∆1, . . . , ∆m simultaneously manipulate the same variable w, causing conflicts in parallel updates. How to resolve conflicts AMP Lab Splash April 2015 6 / 27
  17. 17. Conflicts in Parallel Updates Reason of failure: ∆1, . . . , ∆m simultaneously manipulate the same variable w, causing conflicts in parallel updates. How to resolve conflicts 1 Frequent communication between threads: Pros: general approach to resolving conflict. Cons: inter-node (asynchronous) communication is expensive! AMP Lab Splash April 2015 6 / 27
  18. 18. Conflicts in Parallel Updates Reason of failure: ∆1, . . . , ∆m simultaneously manipulate the same variable w, causing conflicts in parallel updates. How to resolve conflicts 1 Frequent communication between threads: Pros: general approach to resolving conflict. Cons: inter-node (asynchronous) communication is expensive! 2 Carefully partition the data to avoid threads simultaneously manipulating the same variable: Pros: doesn’t need frequent communication. Cons: need problem-specific partitioning schemes; only works for a subset of problems. AMP Lab Splash April 2015 6 / 27
  19. 19. Splash: A Principle Solution Splash is A programming interface for developing stochastic algorithms An execution engine for running stochastic algorithm on distributed systems. AMP Lab Splash April 2015 7 / 27
  20. 20. Splash: A Principle Solution Splash is A programming interface for developing stochastic algorithms An execution engine for running stochastic algorithm on distributed systems. Features of Splash include: Easy Programming: User develop single-thread algorithms via Splash: no communication protocol, no conflict management, no data partitioning, no hyper-parameters tuning. AMP Lab Splash April 2015 7 / 27
  21. 21. Splash: A Principle Solution Splash is A programming interface for developing stochastic algorithms An execution engine for running stochastic algorithm on distributed systems. Features of Splash include: Easy Programming: User develop single-thread algorithms via Splash: no communication protocol, no conflict management, no data partitioning, no hyper-parameters tuning. Fast Performance: Splash adopts novel strategy for automatic parallelization with infrequent communication. Communication is no longer a performance bottleneck. AMP Lab Splash April 2015 7 / 27
  22. 22. Splash: A Principle Solution Splash is A programming interface for developing stochastic algorithms An execution engine for running stochastic algorithm on distributed systems. Features of Splash include: Easy Programming: User develop single-thread algorithms via Splash: no communication protocol, no conflict management, no data partitioning, no hyper-parameters tuning. Fast Performance: Splash adopts novel strategy for automatic parallelization with infrequent communication. Communication is no longer a performance bottleneck. Integration with Spark: taking RDD as input and returning RDD as output. Work with KeystoneML, MLlib and other data analysis tools on Spark. AMP Lab Splash April 2015 7 / 27
  23. 23. Programming Interface AMP Lab Splash April 2015 8 / 27
  24. 24. Programming with Splash Splash users implement the following function: def process(sample: Any, weight: Int, var: VariableSet){ /*implement stochastic algorithm*/ } where sample — a random sample from the dataset. weight — observe the sample duplicated by weight times. var — set of all shared variables. AMP Lab Splash April 2015 9 / 27
  25. 25. Example: SGD for Linear Regression Goal: find w∗ = arg minw 1 n n i=1(wxi − yi )2. SGD update: randomly draw (xi , yi ), then w ← w − η w (wxi − yi )2. AMP Lab Splash April 2015 10 / 27
  26. 26. Example: SGD for Linear Regression Goal: find w∗ = arg minw 1 n n i=1(wxi − yi )2. SGD update: randomly draw (xi , yi ), then w ← w − η w (wxi − yi )2. Splash implementation: def process(sample: Any, weight: Int, var: VariableSet){ val stepsize = var.get(“eta”) * weight val gradient = sample.x * (var.get(“w”) * sample.x - sample.y) var.add(“w”, - stepsize * gradient) } AMP Lab Splash April 2015 10 / 27
  27. 27. Example: SGD for Linear Regression Goal: find w∗ = arg minw 1 n n i=1(wxi − yi )2. SGD update: randomly draw (xi , yi ), then w ← w − η w (wxi − yi )2. Splash implementation: def process(sample: Any, weight: Int, var: VariableSet){ val stepsize = var.get(“eta”) * weight val gradient = sample.x * (var.get(“w”) * sample.x - sample.y) var.add(“w”, - stepsize * gradient) } Supported operations: get, add, multiply, delayedAdd. AMP Lab Splash April 2015 10 / 27
  28. 28. Get Operations Get the value of the variable (Double or Array[Double]). get(key) returns var[key] getArray(key) returns varArray[key] getArrayElement(key, index) returns varArray[key][index] getArrayElements(key, indices) returns varArray[key][indices] Array-based operations are more efficient than element-wise operations, because the key-value retrieval is executed only once for operating an array. AMP Lab Splash April 2015 11 / 27
  29. 29. Add Operations Add a quantity δ to the variable. add(key, delta): var[key] += delta addArray(key, deltaArray): varArray[key] += deltaArray addArrayElement(key, index, delta): varArray[key][index] += delta addArrayElements(key, indices, deltaArrayElements): varArray[key][indices] += deltaArrayElements AMP Lab Splash April 2015 12 / 27
  30. 30. Multiply Operations Multiply a quantity γ to the variable v. multiply(key, gamma): var[key] *= gamma multiplyArray(key, gamma): varArray[key] *= gamma We have optimized the implementation so that the time complexity of multiplyArray is O(1), independent of the array dimension. AMP Lab Splash April 2015 13 / 27
  31. 31. Multiply Operations Multiply a quantity γ to the variable v. multiply(key, gamma): var[key] *= gamma multiplyArray(key, gamma): varArray[key] *= gamma We have optimized the implementation so that the time complexity of multiplyArray is O(1), independent of the array dimension. Example: SGD with sparse features and 2-norm regularization. w ← (1 − λ) ∗ w (multiply operation) (1) w ← w − η f (w) (addArrayElements operation) (2) Time complexity of (1) = O(1); Time complexity of (2) = nnz( f (w)). AMP Lab Splash April 2015 13 / 27
  32. 32. Delayed Add Operations Add a quantity δ to the variable v. The operation is not executed until the next time the same sample is processed by the system. delayedAdd(key, delta): var[key] += delta delayedAddArray(key, deltaArray): varArray[key] += deltaArray delayedAddArrayElement(key, index, delta): varArray[key][index] += delta AMP Lab Splash April 2015 14 / 27
  33. 33. Delayed Add Operations Add a quantity δ to the variable v. The operation is not executed until the next time the same sample is processed by the system. delayedAdd(key, delta): var[key] += delta delayedAddArray(key, deltaArray): varArray[key] += deltaArray delayedAddArrayElement(key, index, delta): varArray[key][index] += delta Example: Collapsed Gibbs Sampling for LDA – update the word-topic counter nwk when topic k is assigned to word w. nwk ← nwk + weight (add operation) (3) nwk ← nwk − weight (delayed add operation) (4) (3) executed instantly; (4) will be executed at the next time before a new topic is sampled for the same word. AMP Lab Splash April 2015 14 / 27
  34. 34. Running Stochastic Algorithm Three simple steps: 1 Convert RDD dataset to Parametrized RDD: val paramRdd = new ParametrizedRDD(rdd) AMP Lab Splash April 2015 15 / 27
  35. 35. Running Stochastic Algorithm Three simple steps: 1 Convert RDD dataset to Parametrized RDD: val paramRdd = new ParametrizedRDD(rdd) 2 Set a function that implements the algorithm: paramRdd.setProcessFunction(process) AMP Lab Splash April 2015 15 / 27
  36. 36. Running Stochastic Algorithm Three simple steps: 1 Convert RDD dataset to Parametrized RDD: val paramRdd = new ParametrizedRDD(rdd) 2 Set a function that implements the algorithm: paramRdd.setProcessFunction(process) 3 Start running: paramRdd.run() AMP Lab Splash April 2015 15 / 27
  37. 37. Execution Engine AMP Lab Splash April 2015 16 / 27
  38. 38. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores): AMP Lab Splash April 2015 17 / 27
  39. 39. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores): Iteration 1: candidates = 1, 4, 16, 64, 171 AMP Lab Splash April 2015 17 / 27
  40. 40. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores): Iteration 1: candidates = 1, 4, 16, 64, 171 Iteration 2: candidates = 4, 16, 64, 172 AMP Lab Splash April 2015 17 / 27
  41. 41. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores): Iteration 1: candidates = 1, 4, 16, 64, 171 Iteration 2: candidates = 4, 16, 64, 172 Iteration 3: candidates = 64, 192 AMP Lab Splash April 2015 17 / 27
  42. 42. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores): Iteration 1: candidates = 1, 4, 16, 64, 171 Iteration 2: candidates = 4, 16, 64, 172 Iteration 3: candidates = 64, 192 Iteration 4: candidates = 64, 192 AMP Lab Splash April 2015 17 / 27
  43. 43. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores): Iteration 1: candidates = 1, 4, 16, 64, 171 Iteration 2: candidates = 4, 16, 64, 172 Iteration 3: candidates = 64, 192 Iteration 4: candidates = 64, 192 Iteration 5: candidates = 64, 192 AMP Lab Splash April 2015 17 / 27
  44. 44. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores): Iteration 1: candidates = 1, 4, 16, 64, 171 Iteration 2: candidates = 4, 16, 64, 172 Iteration 3: candidates = 64, 192 Iteration 4: candidates = 64, 192 Iteration 5: candidates = 64, 192 Iteration 6: candidates = 256 AMP Lab Splash April 2015 17 / 27
  45. 45. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores): Iteration 1: candidates = 1, 4, 16, 64, 171 Iteration 2: candidates = 4, 16, 64, 172 Iteration 3: candidates = 64, 192 Iteration 4: candidates = 64, 192 Iteration 5: candidates = 64, 192 Iteration 6: candidates = 256 Iteration 7: candidates = 256 ...... AMP Lab Splash April 2015 17 / 27
  46. 46. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores). For each i ∈ [k], collect mi cores and do: AMP Lab Splash April 2015 18 / 27
  47. 47. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores). For each i ∈ [k], collect mi cores and do: 1 Each core gets a sub-sequence of samples (by default 1 m of the full data). They process the samples sequentially using the process function. Every sample is weighted by mi . AMP Lab Splash April 2015 18 / 27
  48. 48. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores). For each i ∈ [k], collect mi cores and do: 1 Each core gets a sub-sequence of samples (by default 1 m of the full data). They process the samples sequentially using the process function. Every sample is weighted by mi . 2 Combine the updates of all mi cores to get the global update. There are different strategies for combining different types of updates. For add operations, the updates are averaged. AMP Lab Splash April 2015 18 / 27
  49. 49. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores). For each i ∈ [k], collect mi cores and do: 1 Each core gets a sub-sequence of samples (by default 1 m of the full data). They process the samples sequentially using the process function. Every sample is weighted by mi . 2 Combine the updates of all mi cores to get the global update. There are different strategies for combining different types of updates. For add operations, the updates are averaged. 2 If k > 1, then select the best mi by a parallel cross-validation procedure. AMP Lab Splash April 2015 18 / 27
  50. 50. How does Splash work? In each iteration, the execution engine does: 1 Propose candidate degrees of parallelism m1, . . . , mk such that k i mi = m := (# of cores). For each i ∈ [k], collect mi cores and do: 1 Each core gets a sub-sequence of samples (by default 1 m of the full data). They process the samples sequentially using the process function. Every sample is weighted by mi . 2 Combine the updates of all mi cores to get the global update. There are different strategies for combining different types of updates. For add operations, the updates are averaged. 2 If k > 1, then select the best mi by a parallel cross-validation procedure. 3 Broadcast the best update to all machines to apply this update. Then proceed to the next iteration. AMP Lab Splash April 2015 18 / 27
  51. 51. Example: Reweighting for SGD -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 (a) Optimal solution (b) Solution with full update (c) Local solutions with unit-weight update AMP Lab Splash April 2015 19 / 27
  52. 52. Example: Reweighting for SGD -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 (a) Optimal solution (b) Solution with full update (c) Local solutions with unit-weight update (d) Average local solutions in (c) (e) Aggregate local solutions in (c) (29,8) AMP Lab Splash April 2015 19 / 27
  53. 53. Example: Reweighting for SGD -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 (a) Optimal solution (b) Solution with full update (c) Local solutions with unit-weight update (d) Average local solutions in (c) (e) Aggregate local solutions in (c) (f) Local solutions with weighted update (29,8) AMP Lab Splash April 2015 19 / 27
  54. 54. Example: Reweighting for SGD -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 (a) Optimal solution (b) Solution with full update (c) Local solutions with unit-weight update (d) Average local solutions in (c) (e) Aggregate local solutions in (c) (f) Local solutions with weighted update (g) Average local solutions in (f) (29,8) AMP Lab Splash April 2015 19 / 27
  55. 55. Machine Learning Package AMP Lab Splash April 2015 20 / 27
  56. 56. Stochastic Machine Learning Library on Splash Goal: Fast performance: order-of-magnitude faster than MLlib. Ease of use: call with one line of code. Integration: easy to build a data analytics pipeline. Algorithms: Stochastic gradient descent. Stochastic matrix factorization. Gibbs sampling for LDA. Will implement more algorithms in the future... AMP Lab Splash April 2015 21 / 27
  57. 57. Experiment Setups System: Amazon EC2 cluster with 8 workers. Each worker has 8 Intel Xeon E5-2665 cores and 30 GBs of memory and was connected to a commodity 1GB network Algorithms: SGD for logistic regression; mini-batch SGD for collaborative filtering; Gibbs Sampling for topic modelling;. Datasets: MNIST 8M (LR): 8 million samples, 7,840 parameters. Netflix (CF): 100 million samples, 65 million parameters. NYTimes (LDA): 100 million samples, 200 million parameters. Baseline methods: single-thread stochastic algorithm; MLlib (the official machine learning library for Spark). AMP Lab Splash April 2015 22 / 27
  58. 58. Logistic Regression on MNIST Digit Recognition runtime (seconds) 0 100 200 300 400 500 lossfunction 0.45 0.5 0.55 0.6 Splash (SGD) Single-thread SGD MLlib (L-BFGS) loss function value 0.460.470.480.49 speeduprate 10 20 30 40 Over single-thread SGD Over MLlib (L-BFGS) Splash converges to a good solution in a few seconds, while other methods take hundreds of seconds. Splash is 10x - 25x faster than single-thread SGD. Splash is 15x - 30x faster than parallelized L-BFGS. AMP Lab Splash April 2015 23 / 27
  59. 59. Netflix Movie Recommendation runtime (seconds) 0 100 200 300 400 500 predictionloss 0.8 1 1.2 1.4 Splash (SGD) Single-thread SGD MLlib (ALS) Splash is 36x faster than parallelized Alternating Least Square (ALS). Splash converges to a better solution than ALS (the problem is non-convex). AMP Lab Splash April 2015 24 / 27
  60. 60. Topic Modelling on New York Times Articles runtime (seconds) 0 1000 2000 predictivelog-likelihood -9 -8.5 -8 Splash (Gibbs) Single-thread (Gibbs) MLlib (VI) Splash is 3x - 6x faster than parallelized Variational Inference (VI). Splash converges to a better solution than VI. AMP Lab Splash April 2015 25 / 27
  61. 61. Runtime Analysis MNIST 8M (LR) Netflix (CF) NYTimes (LDA) Runtimeperpass 0 10 20 30 40 50 60 Computation time Waiting time Communication time Waiting time is 16%, 21%, 26% of the computation time. Communication time is 6%, 39% and 103% of the computation time. AMP Lab Splash April 2015 26 / 27
  62. 62. Summary Splash is a general-purpose programming interface for developing stochastic algorithms. Splash is also an execution engine for automatic parallelizing stochastic algorithms. Reweighting is the key to achieve fast performance without scarifying communication efficiency. We observe good empirical performance and we have theoretical guarantees for SGD. Splash is online at http://zhangyuc.github.io/splash/. AMP Lab Splash April 2015 27 / 27

×