Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

358 views

Published on

A General Framework for Communication-Efficient Distributed Optimization: Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In light of this, we propose a general framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. Our framework enjoys strong convergence guarantees and exhibits state-of-the-art empirical performance in the distributed setting. We demonstrate this performance with extensive experiments in Apache Spark, achieving speedups of up to 50x compared to leading distributed methods on common machine learning objectives.

Published in: Technology
  • Be the first to comment

Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016

  1. 1. CoCoA A General Framework for Communication-Efficient Distributed Optimization Virginia Smith Simone Forte ⋅ Chenxin Ma ⋅ Martin Takac ⋅ Martin Jaggi ⋅ Michael I. Jordan
  2. 2. Machine Learning with Large Datasets
  3. 3. Machine Learning with Large Datasets
  4. 4. Machine Learning with Large Datasets image/music/video tagging document categorization item recommendation click-through rate prediction sequence tagging protein structure prediction sensor data prediction spam classification fraud detection
  5. 5. Machine Learning Workflow
  6. 6. Machine Learning Workflow DATA & PROBLEM classification, regression, collaborative filtering, …
  7. 7. Machine Learning Workflow DATA & PROBLEM classification, regression, collaborative filtering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
  8. 8. Machine Learning Workflow DATA & PROBLEM classification, regression, collaborative filtering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, … OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
  9. 9. Machine Learning Workflow DATA & PROBLEM classification, regression, collaborative filtering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, … OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
  10. 10. Machine Learning Workflow DATA & PROBLEM classification, regression, collaborative filtering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, … OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, … SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …
  11. 11. Machine Learning Workflow DATA & PROBLEM classification, regression, collaborative filtering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, … OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, … SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, … Open Problem: efficiently solving objective when data is distributed
  12. 12. Distributed Optimization
  13. 13. Distributed Optimization
  14. 14. Distributed Optimization reduce: w = w ↵ P k w
  15. 15. Distributed Optimization reduce: w = w ↵ P k w
  16. 16. Distributed Optimization “always communicate” reduce: w = w ↵ P k w
  17. 17. Distributed Optimization ✔ convergence 
 guarantees “always communicate” reduce: w = w ↵ P k w
  18. 18. Distributed Optimization ✔ convergence 
 guarantees ✗ high communication “always communicate” reduce: w = w ↵ P k w
  19. 19. Distributed Optimization ✔ convergence 
 guarantees ✗ high communication “always communicate” reduce: w = w ↵ P k w
  20. 20. Distributed Optimization ✔ convergence 
 guarantees ✗ high communication average: w := 1 K P k wk “always communicate” reduce: w = w ↵ P k w
  21. 21. Distributed Optimization ✔ convergence 
 guarantees ✗ high communication average: w := 1 K P k wk “always communicate” “never communicate” reduce: w = w ↵ P k w
  22. 22. Distributed Optimization ✔ convergence 
 guarantees ✗ high communication ✔ low communication average: w := 1 K P k wk “always communicate” “never communicate” reduce: w = w ↵ P k w
  23. 23. Distributed Optimization ✔ convergence 
 guarantees ✗ high communication ✔ low communication ✗ convergence 
 not guaranteed average: w := 1 K P k wk “always communicate” “never communicate” reduce: w = w ↵ P k w
  24. 24. Mini-batch
  25. 25. Mini-batch
  26. 26. Mini-batch reduce: w = w ↵ |b| P i2b w
  27. 27. Mini-batch reduce: w = w ↵ |b| P i2b w
  28. 28. Mini-batch ✔ convergence guarantees reduce: w = w ↵ |b| P i2b w
  29. 29. Mini-batch ✔ convergence guarantees ✔ tunable communication reduce: w = w ↵ |b| P i2b w
  30. 30. Mini-batch ✔ convergence guarantees ✔ tunable communication a natural middle-ground reduce: w = w ↵ |b| P i2b w
  31. 31. Mini-batch ✔ convergence guarantees ✔ tunable communication a natural middle-ground reduce: w = w ↵ |b| P i2b w
  32. 32. Mini-batch Limitations
  33. 33. Mini-batch Limitations 1. ONE-OFF METHODS
  34. 34. Mini-batch Limitations 1. ONE-OFF METHODS 2. STALE UPDATES
  35. 35. Mini-batch Limitations 1. ONE-OFF METHODS 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE
  36. 36. LARGE-SCALE OPTIMIZATION CoCoA ProxCoCoA+
  37. 37. LARGE-SCALE OPTIMIZATION CoCoA ProxCoCoA+
  38. 38. Mini-batch Limitations 1. ONE-OFF METHODS 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE
  39. 39. Mini-batch Limitations 1. ONE-OFF METHODS
 Primal-Dual Framework 2. STALE UPDATES 
 Immediately apply local updates 3. AVERAGE OVER BATCH SIZE
 Average over K << batch size
  40. 40. 1. ONE-OFF METHODS
 Primal-Dual Framework 2. STALE UPDATES 
 Immediately apply local updates 3. AVERAGE OVER BATCH SIZE
 Average over K << batch size CoCoA-v1
  41. 41. 1. Primal-Dual Framework PRIMAL DUAL≥
  42. 42. 1. Primal-Dual Framework PRIMAL DUAL≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) #
  43. 43. 1. Primal-Dual Framework PRIMAL DUAL≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  44. 44. 1. Primal-Dual Framework PRIMAL DUAL Stopping criteria given by duality gap ≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  45. 45. 1. Primal-Dual Framework PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice ≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  46. 46. 1. Primal-Dual Framework PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear ≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  47. 47. 1. Primal-Dual Framework PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear Dual separates across machines 
 (one dual variable per datapoint) ≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  48. 48. 1. Primal-Dual Framework
  49. 49. 1. Primal-Dual Framework Global objective:
  50. 50. 1. Primal-Dual Framework Global objective: Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  51. 51. 1. Primal-Dual Framework Global objective: Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  52. 52. 1. Primal-Dual Framework Global objective: Local objective: Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  53. 53. 1. Primal-Dual Framework Global objective: Local objective: max ↵[k]2Rn 1 n X i2Pk `⇤ i ( ↵i ( ↵[k])i) 1 n wT A ↵[k] 2 1 n A ↵[k] 2 Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  54. 54. 1. Primal-Dual Framework Global objective: Local objective: max ↵[k]2Rn 1 n X i2Pk `⇤ i ( ↵i ( ↵[k])i) 1 n wT A ↵[k] 2 1 n A ↵[k] 2 Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  55. 55. 1. Primal-Dual Framework Global objective: Local objective: Can solve the local objective using any internal optimization method max ↵[k]2Rn 1 n X i2Pk `⇤ i ( ↵i ( ↵[k])i) 1 n wT A ↵[k] 2 1 n A ↵[k] 2 Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  56. 56. 2. Immediately Apply Updates
  57. 57. 2. Immediately Apply Updates for i 2 b w w ↵riP(w) end w w + w STALE
  58. 58. 2. Immediately Apply Updates for i 2 b w w ↵riP(w) end w w + w STALE FRESH for i 2 b w ↵riP(w) w w + w end w w + w
  59. 59. 3. Average over K
  60. 60. 3. Average over K reduce: w = w + 1 K P k wk
  61. 61. COCOA-v1 Limitations
  62. 62. Can we avoid having to average the partial solutions? COCOA-v1 Limitations
  63. 63. Can we avoid having to average the partial solutions? ✔ COCOA-v1 Limitations
  64. 64. Can we avoid having to average the partial solutions? ✔ [CoCoA+, Ma & Smith, et. al., ICML ’15] COCOA-v1 Limitations
  65. 65. Can we avoid having to average the partial solutions?
 
 L1-regularized objectives not covered in initial framework ✔ [CoCoA+, Ma & Smith, et. al., ICML ’15] COCOA-v1 Limitations
  66. 66. Can we avoid having to average the partial solutions?
 
 L1-regularized objectives not covered in initial framework ✔ [CoCoA+, Ma & Smith, et. al., ICML ’15] COCOA-v1 Limitations
  67. 67. LARGE-SCALE OPTIMIZATION CoCoA ProxCoCoA+
  68. 68. LARGE-SCALE OPTIMIZATION CoCoA ProxCoCoA+
  69. 69. L1 Regularization
  70. 70. L1 Regularization Encourages sparse solutions
  71. 71. L1 Regularization Encourages sparse solutions Includes popular models:
 - lasso regression
 - sparse logistic regression
 - elastic net-regularized problems
  72. 72. L1 Regularization Encourages sparse solutions Includes popular models:
 - lasso regression
 - sparse logistic regression
 - elastic net-regularized problems Beneficial to distribute by feature
  73. 73. L1 Regularization Encourages sparse solutions Includes popular models:
 - lasso regression
 - sparse logistic regression
 - elastic net-regularized problems Beneficial to distribute by feature Can we map this to the CoCoA setup?
  74. 74. Solution: Solve Primal Directly
  75. 75. min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Solution: Solve Primal Directly PRIMAL DUAL≥ Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
  76. 76. Solution: Solve Primal Directly PRIMAL DUAL≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
  77. 77. Solution: Solve Primal Directly PRIMAL DUAL Stopping criteria given by duality gap ≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
  78. 78. Solution: Solve Primal Directly PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice ≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
  79. 79. Solution: Solve Primal Directly PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. glmnet ≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
  80. 80. Solution: Solve Primal Directly PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. glmnet Primal separates across machines 
 (one primal variable per feature) ≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
  81. 81. 50x speedup
  82. 82. CoCoA: A Framework for Distributed Optimization
  83. 83. CoCoA: A Framework for Distributed Optimization flexible
  84. 84. CoCoA: A Framework for Distributed Optimization flexible allows for arbitrary internal methods
  85. 85. CoCoA: A Framework for Distributed Optimization flexible efficient allows for arbitrary internal methods
  86. 86. CoCoA: A Framework for Distributed Optimization flexible efficient allows for arbitrary internal methods fast! (strong convergence & low communication)
  87. 87. CoCoA: A Framework for Distributed Optimization flexible efficient general allows for arbitrary internal methods fast! (strong convergence & low communication)
  88. 88. CoCoA: A Framework for Distributed Optimization flexible efficient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models
  89. 89. CoCoA: A Framework for Distributed Optimization flexible efficient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models 1. CoCoA-v1 [NIPS ’14]
  90. 90. CoCoA: A Framework for Distributed Optimization flexible efficient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models 1. CoCoA-v1 [NIPS ’14] 2. CoCoA+ [ICML ’15]
  91. 91. CoCoA: A Framework for Distributed Optimization flexible efficient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models 1. CoCoA-v1 [NIPS ’14] 2. CoCoA+ [ICML ’15] 3.ProxCoCoA+ [current work]
  92. 92. CoCoA: A Framework for Distributed Optimization flexible efficient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models 1. CoCoA-v1 [NIPS ’14] 2. CoCoA+ [ICML ’15] 3.ProxCoCoA+ [current work] CoCoA Framework
  93. 93. Impact & Adoption
  94. 94. Impact & Adoption gingsmith.github.io/ cocoa/
  95. 95. Impact & Adoption gingsmith.github.io/ cocoa/
  96. 96. Impact & Adoption gingsmith.github.io/ cocoa/
  97. 97. Impact & Adoption gingsmith.github.io/ cocoa/
  98. 98. Impact & Adoption gingsmith.github.io/ cocoa/ Numerous talks & demos
  99. 99. Impact & Adoption ICML gingsmith.github.io/ cocoa/ Numerous talks & demos
  100. 100. Impact & Adoption ICML gingsmith.github.io/ cocoa/ Numerous talks & demos Open-source code & documentation
  101. 101. Impact & Adoption ICML gingsmith.github.io/ cocoa/ Numerous talks & demos Open-source code & documentation Industry & academic adoption
  102. 102. Thanks! cs.berkeley.edu/~vsmith github.com/gingsmith/proxcocoa github.com/gingsmith/cocoa
  103. 103. Empirical Results in Dataset Training 
 (n) Features 
 (d) Sparsity Workers (K) url 2M 3M 4e-3% 4 kddb 19M 29M 1e-4% 4 epsilon 400K 2K 100% 8 webspam 350K 16M 0.02% 16
  104. 104. A First Approach:
  105. 105. A First Approach: CoCoA+ with Smoothing
  106. 106. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers
  107. 107. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer
  108. 108. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer k↵k1 + k↵k2 2
  109. 109. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer k↵k1 + k↵k2 2
  110. 110. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 k↵k1 + k↵k2 2
  111. 111. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 k↵k1 + k↵k2 2
  112. 112. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 k↵k1 + k↵k2 2
  113. 113. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 k↵k1 + k↵k2 2
  114. 114. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 k↵k1 + k↵k2 2 X

  115. 115. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 k↵k1 + k↵k2 2 X
 CoCoA+ with smoothing doesn’t work
  116. 116. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 k↵k1 + k↵k2 2 X
 CoCoA+ with smoothing doesn’t work Additionally, CoCoA+ distributes by datapoint, not by feature
  117. 117. Better Solution: ProxCoCoA+
  118. 118. Better Solution: ProxCoCoA+ Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465
  119. 119. Better Solution: ProxCoCoA+ Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 ProxCoCoA+ 0.6030
  120. 120. Better Solution: ProxCoCoA+ Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 ProxCoCoA+ 0.6030
  121. 121. Better Solution: ProxCoCoA+ Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 ProxCoCoA+ 0.6030
  122. 122. Convergence
  123. 123. Convergence Assumption: Local -Approximation For , we assume the local solver 
 finds an approximate solution satisfying: ⇥ ⇥ 2 [0, 1) E ⇥ G 0 k ( ↵[k]) G 0 k ( ↵? [k]) ⇤  ⇥ ⇣ G 0 k (0) G 0 k ( ↵? [k]) ⌘
  124. 124. Convergence Theorem 1. Let have -bounded supportL T ˜O ⇣ 1 1 ⇥ ⇣ 8L2 n2 ⌧✏ + ˜c ⌘⌘ gi Assumption: Local -Approximation For , we assume the local solver 
 finds an approximate solution satisfying: ⇥ ⇥ 2 [0, 1) E ⇥ G 0 k ( ↵[k]) G 0 k ( ↵? [k]) ⇤  ⇥ ⇣ G 0 k (0) G 0 k ( ↵? [k]) ⌘
  125. 125. Convergence Theorem 1. Let have -bounded supportL T ˜O ⇣ 1 1 ⇥ ⇣ 8L2 n2 ⌧✏ + ˜c ⌘⌘ gi Assumption: Local -Approximation For , we assume the local solver 
 finds an approximate solution satisfying: ⇥ ⇥ 2 [0, 1) E ⇥ G 0 k ( ↵[k]) G 0 k ( ↵? [k]) ⇤  ⇥ ⇣ G 0 k (0) G 0 k ( ↵? [k]) ⌘ Theorem 2. Let be 
 -strongly convexµ T 1 (1 ⇥) µ⌧+n µ⌧ log n ✏ gi

×