Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Nikhil Garg, Engineering Manager, Q... by MLconf 1094 views
- Ross Goodwin, Technologist, Sunspri... by MLconf 277 views
- Josh Patterson, Advisor, Skymind – ... by MLconf 504 views
- Stephanie deWet, Software Engineer,... by MLconf 768 views
- Alex Smola, Director of Machine Lea... by MLconf 1786 views
- Andrew Musselman, Committer and PMC... by MLconf 855 views

382 views

Published on

Published in:
Technology

No Downloads

Total views

382

On SlideShare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

21

Comments

0

Likes

2

No embeds

No notes for slide

- 1. CoCoA A General Framework for Communication-Efﬁcient Distributed Optimization Virginia Smith Simone Forte ⋅ Chenxin Ma ⋅ Martin Takac ⋅ Martin Jaggi ⋅ Michael I. Jordan
- 2. Machine Learning with Large Datasets
- 3. Machine Learning with Large Datasets
- 4. Machine Learning with Large Datasets image/music/video tagging document categorization item recommendation click-through rate prediction sequence tagging protein structure prediction sensor data prediction spam classiﬁcation fraud detection
- 5. Machine Learning Workﬂow
- 6. Machine Learning Workﬂow DATA & PROBLEM classiﬁcation, regression, collaborative ﬁltering, …
- 7. Machine Learning Workﬂow DATA & PROBLEM classiﬁcation, regression, collaborative ﬁltering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, …
- 8. Machine Learning Workﬂow DATA & PROBLEM classiﬁcation, regression, collaborative ﬁltering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, … OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
- 9. Machine Learning Workﬂow DATA & PROBLEM classiﬁcation, regression, collaborative ﬁltering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, … OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, …
- 10. Machine Learning Workﬂow DATA & PROBLEM classiﬁcation, regression, collaborative ﬁltering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, … OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, … SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, …
- 11. Machine Learning Workﬂow DATA & PROBLEM classiﬁcation, regression, collaborative ﬁltering, … MACHINE LEARNING MODEL logistic regression, lasso, support vector machines, … OPTIMIZATION ALGORITHM gradient descent, coordinate descent, Newton’s method, … SYSTEMS SETTING multi-core, cluster, cloud, supercomputer, … Open Problem: efﬁciently solving objective when data is distributed
- 12. Distributed Optimization
- 13. Distributed Optimization
- 14. Distributed Optimization reduce: w = w ↵ P k w
- 15. Distributed Optimization reduce: w = w ↵ P k w
- 16. Distributed Optimization “always communicate” reduce: w = w ↵ P k w
- 17. Distributed Optimization ✔ convergence guarantees “always communicate” reduce: w = w ↵ P k w
- 18. Distributed Optimization ✔ convergence guarantees ✗ high communication “always communicate” reduce: w = w ↵ P k w
- 19. Distributed Optimization ✔ convergence guarantees ✗ high communication “always communicate” reduce: w = w ↵ P k w
- 20. Distributed Optimization ✔ convergence guarantees ✗ high communication average: w := 1 K P k wk “always communicate” reduce: w = w ↵ P k w
- 21. Distributed Optimization ✔ convergence guarantees ✗ high communication average: w := 1 K P k wk “always communicate” “never communicate” reduce: w = w ↵ P k w
- 22. Distributed Optimization ✔ convergence guarantees ✗ high communication ✔ low communication average: w := 1 K P k wk “always communicate” “never communicate” reduce: w = w ↵ P k w
- 23. Distributed Optimization ✔ convergence guarantees ✗ high communication ✔ low communication ✗ convergence not guaranteed average: w := 1 K P k wk “always communicate” “never communicate” reduce: w = w ↵ P k w
- 24. Mini-batch
- 25. Mini-batch
- 26. Mini-batch reduce: w = w ↵ |b| P i2b w
- 27. Mini-batch reduce: w = w ↵ |b| P i2b w
- 28. Mini-batch ✔ convergence guarantees reduce: w = w ↵ |b| P i2b w
- 29. Mini-batch ✔ convergence guarantees ✔ tunable communication reduce: w = w ↵ |b| P i2b w
- 30. Mini-batch ✔ convergence guarantees ✔ tunable communication a natural middle-ground reduce: w = w ↵ |b| P i2b w
- 31. Mini-batch ✔ convergence guarantees ✔ tunable communication a natural middle-ground reduce: w = w ↵ |b| P i2b w
- 32. Mini-batch Limitations
- 33. Mini-batch Limitations 1. ONE-OFF METHODS
- 34. Mini-batch Limitations 1. ONE-OFF METHODS 2. STALE UPDATES
- 35. Mini-batch Limitations 1. ONE-OFF METHODS 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE
- 36. LARGE-SCALE OPTIMIZATION CoCoA ProxCoCoA+
- 37. LARGE-SCALE OPTIMIZATION CoCoA ProxCoCoA+
- 38. Mini-batch Limitations 1. ONE-OFF METHODS 2. STALE UPDATES 3. AVERAGE OVER BATCH SIZE
- 39. Mini-batch Limitations 1. ONE-OFF METHODS Primal-Dual Framework 2. STALE UPDATES Immediately apply local updates 3. AVERAGE OVER BATCH SIZE Average over K << batch size
- 40. 1. ONE-OFF METHODS Primal-Dual Framework 2. STALE UPDATES Immediately apply local updates 3. AVERAGE OVER BATCH SIZE Average over K << batch size CoCoA-v1
- 41. 1. Primal-Dual Framework PRIMAL DUAL≥
- 42. 1. Primal-Dual Framework PRIMAL DUAL≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) #
- 43. 1. Primal-Dual Framework PRIMAL DUAL≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 44. 1. Primal-Dual Framework PRIMAL DUAL Stopping criteria given by duality gap ≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 45. 1. Primal-Dual Framework PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice ≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 46. 1. Primal-Dual Framework PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear ≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 47. 1. Primal-Dual Framework PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. liblinear Dual separates across machines (one dual variable per datapoint) ≥ min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 48. 1. Primal-Dual Framework
- 49. 1. Primal-Dual Framework Global objective:
- 50. 1. Primal-Dual Framework Global objective: Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 51. 1. Primal-Dual Framework Global objective: Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 52. 1. Primal-Dual Framework Global objective: Local objective: Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 53. 1. Primal-Dual Framework Global objective: Local objective: max ↵[k]2Rn 1 n X i2Pk `⇤ i ( ↵i ( ↵[k])i) 1 n wT A ↵[k] 2 1 n A ↵[k] 2 Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 54. 1. Primal-Dual Framework Global objective: Local objective: max ↵[k]2Rn 1 n X i2Pk `⇤ i ( ↵i ( ↵[k])i) 1 n wT A ↵[k] 2 1 n A ↵[k] 2 Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 55. 1. Primal-Dual Framework Global objective: Local objective: Can solve the local objective using any internal optimization method max ↵[k]2Rn 1 n X i2Pk `⇤ i ( ↵i ( ↵[k])i) 1 n wT A ↵[k] 2 1 n A ↵[k] 2 Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 56. 2. Immediately Apply Updates
- 57. 2. Immediately Apply Updates for i 2 b w w ↵riP(w) end w w + w STALE
- 58. 2. Immediately Apply Updates for i 2 b w w ↵riP(w) end w w + w STALE FRESH for i 2 b w ↵riP(w) w w + w end w w + w
- 59. 3. Average over K
- 60. 3. Average over K reduce: w = w + 1 K P k wk
- 61. COCOA-v1 Limitations
- 62. Can we avoid having to average the partial solutions? COCOA-v1 Limitations
- 63. Can we avoid having to average the partial solutions? ✔ COCOA-v1 Limitations
- 64. Can we avoid having to average the partial solutions? ✔ [CoCoA+, Ma & Smith, et. al., ICML ’15] COCOA-v1 Limitations
- 65. Can we avoid having to average the partial solutions? L1-regularized objectives not covered in initial framework ✔ [CoCoA+, Ma & Smith, et. al., ICML ’15] COCOA-v1 Limitations
- 66. Can we avoid having to average the partial solutions? L1-regularized objectives not covered in initial framework ✔ [CoCoA+, Ma & Smith, et. al., ICML ’15] COCOA-v1 Limitations
- 67. LARGE-SCALE OPTIMIZATION CoCoA ProxCoCoA+
- 68. LARGE-SCALE OPTIMIZATION CoCoA ProxCoCoA+
- 69. L1 Regularization
- 70. L1 Regularization Encourages sparse solutions
- 71. L1 Regularization Encourages sparse solutions Includes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems
- 72. L1 Regularization Encourages sparse solutions Includes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems Beneﬁcial to distribute by feature
- 73. L1 Regularization Encourages sparse solutions Includes popular models: - lasso regression - sparse logistic regression - elastic net-regularized problems Beneﬁcial to distribute by feature Can we map this to the CoCoA setup?
- 74. Solution: Solve Primal Directly
- 75. min w2Rd " P(w) := 2 ||w||2 + 1 n nX i=1 `i(wT xi) # Solution: Solve Primal Directly PRIMAL DUAL≥ Ai = 1 n xi max ↵2Rn " D(↵) := 2 kA↵k 2 1 n nX i=1 `⇤ i ( ↵i) #
- 76. Solution: Solve Primal Directly PRIMAL DUAL≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
- 77. Solution: Solve Primal Directly PRIMAL DUAL Stopping criteria given by duality gap ≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
- 78. Solution: Solve Primal Directly PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice ≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
- 79. Solution: Solve Primal Directly PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. glmnet ≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
- 80. Solution: Solve Primal Directly PRIMAL DUAL Stopping criteria given by duality gap Good performance in practice Default in software packages e.g. glmnet Primal separates across machines (one primal variable per feature) ≥ min ↵2Rn f(A↵) + nX i=1 gi(↵i) min w2Rd f⇤ (w) + nX i=1 `⇤ i ( xT i w)
- 81. 50x speedup
- 82. CoCoA: A Framework for Distributed Optimization
- 83. CoCoA: A Framework for Distributed Optimization ﬂexible
- 84. CoCoA: A Framework for Distributed Optimization ﬂexible allows for arbitrary internal methods
- 85. CoCoA: A Framework for Distributed Optimization ﬂexible efﬁcient allows for arbitrary internal methods
- 86. CoCoA: A Framework for Distributed Optimization ﬂexible efﬁcient allows for arbitrary internal methods fast! (strong convergence & low communication)
- 87. CoCoA: A Framework for Distributed Optimization ﬂexible efﬁcient general allows for arbitrary internal methods fast! (strong convergence & low communication)
- 88. CoCoA: A Framework for Distributed Optimization ﬂexible efﬁcient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models
- 89. CoCoA: A Framework for Distributed Optimization ﬂexible efﬁcient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models 1. CoCoA-v1 [NIPS ’14]
- 90. CoCoA: A Framework for Distributed Optimization ﬂexible efﬁcient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models 1. CoCoA-v1 [NIPS ’14] 2. CoCoA+ [ICML ’15]
- 91. CoCoA: A Framework for Distributed Optimization ﬂexible efﬁcient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models 1. CoCoA-v1 [NIPS ’14] 2. CoCoA+ [ICML ’15] 3.ProxCoCoA+ [current work]
- 92. CoCoA: A Framework for Distributed Optimization ﬂexible efﬁcient general allows for arbitrary internal methods fast! (strong convergence & low communication) works for a variety of ML models 1. CoCoA-v1 [NIPS ’14] 2. CoCoA+ [ICML ’15] 3.ProxCoCoA+ [current work] CoCoA Framework
- 93. Impact & Adoption
- 94. Impact & Adoption gingsmith.github.io/ cocoa/
- 95. Impact & Adoption gingsmith.github.io/ cocoa/
- 96. Impact & Adoption gingsmith.github.io/ cocoa/
- 97. Impact & Adoption gingsmith.github.io/ cocoa/
- 98. Impact & Adoption gingsmith.github.io/ cocoa/ Numerous talks & demos
- 99. Impact & Adoption ICML gingsmith.github.io/ cocoa/ Numerous talks & demos
- 100. Impact & Adoption ICML gingsmith.github.io/ cocoa/ Numerous talks & demos Open-source code & documentation
- 101. Impact & Adoption ICML gingsmith.github.io/ cocoa/ Numerous talks & demos Open-source code & documentation Industry & academic adoption
- 102. Thanks! cs.berkeley.edu/~vsmith github.com/gingsmith/proxcocoa github.com/gingsmith/cocoa
- 103. Empirical Results in Dataset Training (n) Features (d) Sparsity Workers (K) url 2M 3M 4e-3% 4 kddb 19M 29M 1e-4% 4 epsilon 400K 2K 100% 8 webspam 350K 16M 0.02% 16
- 104. A First Approach:
- 105. A First Approach: CoCoA+ with Smoothing
- 106. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers
- 107. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer
- 108. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer k↵k1 + k↵k2 2
- 109. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer k↵k1 + k↵k2 2
- 110. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 k↵k1 + k↵k2 2
- 111. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 k↵k1 + k↵k2 2
- 112. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 k↵k1 + k↵k2 2
- 113. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 k↵k1 + k↵k2 2
- 114. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 k↵k1 + k↵k2 2 X
- 115. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 k↵k1 + k↵k2 2 X CoCoA+ with smoothing doesn’t work
- 116. A First Approach: CoCoA+ with Smoothing Issue: CoCoA+ requires strongly-convex regularizers Approach: add a bit of L2 to the L1 regularizer Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 k↵k1 + k↵k2 2 X CoCoA+ with smoothing doesn’t work Additionally, CoCoA+ distributes by datapoint, not by feature
- 117. Better Solution: ProxCoCoA+
- 118. Better Solution: ProxCoCoA+ Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465
- 119. Better Solution: ProxCoCoA+ Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 ProxCoCoA+ 0.6030
- 120. Better Solution: ProxCoCoA+ Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 ProxCoCoA+ 0.6030
- 121. Better Solution: ProxCoCoA+ Amount of L2 Final Sparsity Ideal (δ = 0) 0.6030 δ = 0.0001 0.6035 δ = 0.001 0.6240 δ = 0.01 0.6465 ProxCoCoA+ 0.6030
- 122. Convergence
- 123. Convergence Assumption: Local -Approximation For , we assume the local solver ﬁnds an approximate solution satisfying: ⇥ ⇥ 2 [0, 1) E ⇥ G 0 k ( ↵[k]) G 0 k ( ↵? [k]) ⇤ ⇥ ⇣ G 0 k (0) G 0 k ( ↵? [k]) ⌘
- 124. Convergence Theorem 1. Let have -bounded supportL T ˜O ⇣ 1 1 ⇥ ⇣ 8L2 n2 ⌧✏ + ˜c ⌘⌘ gi Assumption: Local -Approximation For , we assume the local solver ﬁnds an approximate solution satisfying: ⇥ ⇥ 2 [0, 1) E ⇥ G 0 k ( ↵[k]) G 0 k ( ↵? [k]) ⇤ ⇥ ⇣ G 0 k (0) G 0 k ( ↵? [k]) ⌘
- 125. Convergence Theorem 1. Let have -bounded supportL T ˜O ⇣ 1 1 ⇥ ⇣ 8L2 n2 ⌧✏ + ˜c ⌘⌘ gi Assumption: Local -Approximation For , we assume the local solver ﬁnds an approximate solution satisfying: ⇥ ⇥ 2 [0, 1) E ⇥ G 0 k ( ↵[k]) G 0 k ( ↵? [k]) ⇤ ⇥ ⇣ G 0 k (0) G 0 k ( ↵? [k]) ⌘ Theorem 2. Let be -strongly convexµ T 1 (1 ⇥) µ⌧+n µ⌧ log n ✏ gi

No public clipboards found for this slide

Be the first to comment