- 1. Adaptive Three Operator Splitting. Fabian Pedregosa, Gauthier Gidel 2018 International Symposium on Mathematical Programming, Bordeaux
- 2. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. 1/18
- 3. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. • Can be generalized to an arbitrary number of proximal terms. 1/18
- 4. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. • Can be generalized to an arbitrary number of proximal terms. • Many complex penalties can be written as a sum of proximable terms: overlapping group lasso, 1 trend ﬁltering, isotonic constraints, total variation, intersection of constraints, etc. 1/18
- 5. Importance of step-size Guaranteed to converge for any step-size γ < 2/L, with L = Lipschitz constant of f . In practice, best performance is often achieved for γ 2/L 0 1000 2000 3000 4000 5000 Iterations 10 14 10 11 10 8 10 5 10 2 Objectiveminusoptimum =1/L =2/L =5/L =10/L =20/L =50/L 2/18
- 6. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. 3/18
- 7. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. • Adaptive step-size strategies (aka inexact line search) have a long tradition (Armijo, 1966), have been adapted to proximal-gradient (Beck and Teboulle, 2009). 3/18
- 8. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. • Adaptive step-size strategies (aka inexact line search) have a long tradition (Armijo, 1966), have been adapted to proximal-gradient (Beck and Teboulle, 2009). • Goal. Can these adaptive step-size methods be adapted to the three operator splitting? 3/18
- 9. Outline 1. Revisiting the Three Operator Splitting 2. Adaptive Three Operator Splitting 3. Experiments 4/18
- 10. Revisiting the Three Operator Splitting
- 11. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt 5/18
- 12. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt • Generalization of both Proximal-Gradient and Douglas-Rachford. 5/18
- 13. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt • Generalization of both Proximal-Gradient and Douglas-Rachford. • Depends only on one step-size parameter. 5/18
- 14. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 6/18
- 15. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 2. Saddle-point reformulation of original problem min x∈Rd f (x) + g(x) + h(x) = min x∈Rd f (x) + g(x) + max u∈Rd { x, u − h∗ (u)} 6/18
- 16. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 2. Saddle-point reformulation of original problem min x∈Rd f (x) + g(x) + h(x) = min x∈Rd max u∈Rd f (x) + g(x) + x, u − h∗ (u) :=L(x,u) 6/18
- 17. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal 7/18
- 18. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal • Proximal-gradient iteration, with x = zt as starting point: xt+1 = proxγg (zt − γ( f (zt) + ut)) 7/18
- 19. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal • Proximal-gradient iteration, with x = zt as starting point: xt+1 = proxγg (zt − γ( f (zt) + ut)) = ﬁrst step of TOS 7/18
- 20. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal 8/18
- 21. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal • Proximal-point iteration: ut+1 = proxσh∗ (ut + σxt+1) = second step in TOS with σ = 1/γ 8/18
- 22. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal • Proximal-point iteration: ut+1 = proxσh∗ (ut + σxt+1) = second step in TOS with σ = 1/γ • Third line: zt+1 = xt+1 − γ(ut+1 − ut) (extrapolation step) 8/18
- 23. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 1: proximal-gradient step 9/18
- 24. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 1: proximal-point step 9/18
- 25. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 2: extrapolation 9/18
- 26. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 2: proximal-gradient step 9/18
- 27. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 3: proximal-point step 9/18
- 28. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 3: extrapolation 9/18
- 29. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 4: proximal-gradient step 9/18
- 30. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 4: proximal-point step 9/18
- 31. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 5: extrapolation 9/18
- 32. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 5: proximal-gradient step 9/18
- 33. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: proximal-point step 9/18
- 34. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: extrapolation Take-Home Message TOS is (basically) alternated proximal-gradient and proximal-point 9/18
- 35. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: extrapolation Take-Home Message TOS is (basically) alternated proximal-gradient and proximal-point Can we adapt the adaptive step-size of proximal-gradient? 9/18
- 36. Adaptive Three Operator Splitting
- 37. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18
- 38. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) Run rest of algorithm with that step-size: ut+1 = proxh∗/γt (ut + xt+1/γt) (1) zt+1 = xt+1 − γt(ut+1 − ut) (2) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18
- 39. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) Run rest of algorithm with that step-size: ut+1 = proxh∗/γt (ut + xt+1/γt) (1) zt+1 = xt+1 − γt(ut+1 − ut) (2) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18 Beneﬁts • Automatic tuning of step-size • (practically) hyperparameter-free
- 40. Performance of the adaptive step-size strategy 0 1000 2000 3000 4000 5000 Iterations 10 14 10 11 10 8 10 5 10 2 Objectiveminusoptimum =1/L =2/L =5/L =10/L =20/L =50/L adaptive Performance is as good as best hand-tuned step-size 11/18
- 41. Convergence rates 1/3 Convergence rate in terms of average (aka ergodic) sequence. st def = t−1 i=0 γt , xt def = t−1 i=0 γi xi+1 /st , ut def = t−1 i=0 γi ui+1 /st . 12/18
- 42. Convergence rates 1/3 Convergence rate in terms of average (aka ergodic) sequence. st def = t−1 i=0 γt , xt def = t−1 i=0 γi xi+1 /st , ut def = t−1 i=0 γi ui+1 /st . Theorem (sublinear convergence rate) For any (x, u) ∈ domL: L(xt, u) − L(x, ut) ≤ z0 − x 2 + γ2 0 u0 − u 2 2st . 12/18
- 43. Convergence rates 2/3 If h is Lipschitz, we can bound it and obtain rates in terms of objective function suboptimality. 13/18
- 44. Convergence rates 2/3 If h is Lipschitz, we can bound it and obtain rates in terms of objective function suboptimality. Corollary Let h be βh-Lipschitz. Then we have P(xt+1) − P(x∗ ) ≤ z0 − x∗ 2+ 2γ2 0( u0 2+ β2 h) 2st = O(1/t). with P(x) = f (x) + g(x) + h(x). 13/18
- 45. Convergence rates 3/3 Linear convergence under (somewhat unrealistic) assumptions. Theorem If f is Lf -smooth, µ-strongly convex and h is Lh-smooth then xt+1 − x 2 ≤ 1 − min τ µ Lf , 1 1 + γ0Lh t+1 C0 (3) with τ = line search decrease factor, C0 = only depends on initial conditions. • Better rate than µ Lf × 1 (1+γLh)2 from (Davis and Yin, 2015). 14/18
- 46. Experiments
- 47. Logistic + Nearly-isotonic penalty Problem arg minx logistic(x) + λ p−1 i=1 max{xi − xi+1, 0} Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 200 400 600 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG 15/18
- 48. Logistic + Overlapping group lasso penalty Problem arg min x logistic(x) + λ g∈G [x]g 2 Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 5 10 15 20 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0.0 0.2 0.4 0.6 0.8 1.0 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 16/18
- 49. Quadratic loss + total variation penalty Problem arg min x least squares(x) + λ x TV Recoveredcoefficients =10 6 =10 5 =10 4 =10 3 0 500 1000 1500 2000 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 100 200 300 400 Time (in seconds) 10 12 10 9 10 6 10 3 100 0 10 20 30 40 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 2 4 6 8 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 17/18
- 50. Conclusion • Suﬃcient decrease condition to set step-size in three operator splitting. 18/18
- 51. Conclusion • Suﬃcient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. 18/18
- 52. Conclusion • Suﬃcient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as ﬁxed step-size method. 18/18
- 53. Conclusion • Suﬃcient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as ﬁxed step-size method. • Large empirical improvements, specially in the low-regularization and non-quadratic regime. 18/18
- 54. Conclusion • Suﬃcient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as ﬁxed step-size method. • Large empirical improvements, specially in the low-regularization and non-quadratic regime. Perspectives • Linear convergence under less restrictive assumptions? • Acceleration. https://arxiv.org/abs/1804.02339 18/18
