Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Adaptive Three Operator Splitting

201 views

Published on

Full paper: https://arxiv.org/pdf/1804.02339.pdf

We propose and analyze a novel adaptive step size variant of the Davis-Yin three operator splitting, a method that can solve optimization problems composed of a sum of a smooth term for which we have access to its gradient and an arbitrary number of potentially non-smooth terms for which we have access to their proximal operator. The proposed method leverages local information of the objective function, allowing for larger step sizes while preserving the convergence properties of the original method. It only requires two extra function evaluations per iteration and does not depend on any step size hyperparameter besides an initial estimate. We provide a convergence rate analysis of this method, showing sublinear convergence rate for general convex functions and linear convergence under stronger assumptions, matching the best known rates of its non adaptive variant. Finally, an empirical comparison with related methods on 6 different problems illustrates the computational advantage of the adaptive step size strategy.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Adaptive Three Operator Splitting

  1. 1. Adaptive Three Operator Splitting. Fabian Pedregosa, Gauthier Gidel 2018 International Symposium on Mathematical Programming, Bordeaux
  2. 2. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. 1/18
  3. 3. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. • Can be generalized to an arbitrary number of proximal terms. 1/18
  4. 4. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. • Can be generalized to an arbitrary number of proximal terms. • Many complex penalties can be written as a sum of proximable terms: overlapping group lasso, 1 trend filtering, isotonic constraints, total variation, intersection of constraints, etc. 1/18
  5. 5. Importance of step-size Guaranteed to converge for any step-size γ < 2/L, with L = Lipschitz constant of f . In practice, best performance is often achieved for γ 2/L 0 1000 2000 3000 4000 5000 Iterations 10 14 10 11 10 8 10 5 10 2 Objectiveminusoptimum =1/L =2/L =5/L =10/L =20/L =50/L 2/18
  6. 6. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. 3/18
  7. 7. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. • Adaptive step-size strategies (aka inexact line search) have a long tradition (Armijo, 1966), have been adapted to proximal-gradient (Beck and Teboulle, 2009). 3/18
  8. 8. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. • Adaptive step-size strategies (aka inexact line search) have a long tradition (Armijo, 1966), have been adapted to proximal-gradient (Beck and Teboulle, 2009). • Goal. Can these adaptive step-size methods be adapted to the three operator splitting? 3/18
  9. 9. Outline 1. Revisiting the Three Operator Splitting 2. Adaptive Three Operator Splitting 3. Experiments 4/18
  10. 10. Revisiting the Three Operator Splitting
  11. 11. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt 5/18
  12. 12. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt • Generalization of both Proximal-Gradient and Douglas-Rachford. 5/18
  13. 13. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt • Generalization of both Proximal-Gradient and Douglas-Rachford. • Depends only on one step-size parameter. 5/18
  14. 14. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 6/18
  15. 15. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 2. Saddle-point reformulation of original problem min x∈Rd f (x) + g(x) + h(x) = min x∈Rd f (x) + g(x) + max u∈Rd { x, u − h∗ (u)} 6/18
  16. 16. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 2. Saddle-point reformulation of original problem min x∈Rd f (x) + g(x) + h(x) = min x∈Rd max u∈Rd f (x) + g(x) + x, u − h∗ (u) :=L(x,u) 6/18
  17. 17. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal 7/18
  18. 18. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal • Proximal-gradient iteration, with x = zt as starting point: xt+1 = proxγg (zt − γ( f (zt) + ut)) 7/18
  19. 19. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal • Proximal-gradient iteration, with x = zt as starting point: xt+1 = proxγg (zt − γ( f (zt) + ut)) = first step of TOS 7/18
  20. 20. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal 8/18
  21. 21. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal • Proximal-point iteration: ut+1 = proxσh∗ (ut + σxt+1) = second step in TOS with σ = 1/γ 8/18
  22. 22. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal • Proximal-point iteration: ut+1 = proxσh∗ (ut + σxt+1) = second step in TOS with σ = 1/γ • Third line: zt+1 = xt+1 − γ(ut+1 − ut) (extrapolation step) 8/18
  23. 23. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 1: proximal-gradient step 9/18
  24. 24. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 1: proximal-point step 9/18
  25. 25. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 2: extrapolation 9/18
  26. 26. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 2: proximal-gradient step 9/18
  27. 27. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 3: proximal-point step 9/18
  28. 28. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 3: extrapolation 9/18
  29. 29. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 4: proximal-gradient step 9/18
  30. 30. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 4: proximal-point step 9/18
  31. 31. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 5: extrapolation 9/18
  32. 32. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 5: proximal-gradient step 9/18
  33. 33. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: proximal-point step 9/18
  34. 34. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: extrapolation Take-Home Message TOS is (basically) alternated proximal-gradient and proximal-point 9/18
  35. 35. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: extrapolation Take-Home Message TOS is (basically) alternated proximal-gradient and proximal-point Can we adapt the adaptive step-size of proximal-gradient? 9/18
  36. 36. Adaptive Three Operator Splitting
  37. 37. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18
  38. 38. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) Run rest of algorithm with that step-size: ut+1 = proxh∗/γt (ut + xt+1/γt) (1) zt+1 = xt+1 − γt(ut+1 − ut) (2) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18
  39. 39. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) Run rest of algorithm with that step-size: ut+1 = proxh∗/γt (ut + xt+1/γt) (1) zt+1 = xt+1 − γt(ut+1 − ut) (2) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18 Benefits • Automatic tuning of step-size • (practically) hyperparameter-free
  40. 40. Performance of the adaptive step-size strategy 0 1000 2000 3000 4000 5000 Iterations 10 14 10 11 10 8 10 5 10 2 Objectiveminusoptimum =1/L =2/L =5/L =10/L =20/L =50/L adaptive Performance is as good as best hand-tuned step-size 11/18
  41. 41. Convergence rates 1/3 Convergence rate in terms of average (aka ergodic) sequence. st def = t−1 i=0 γt , xt def = t−1 i=0 γi xi+1 /st , ut def = t−1 i=0 γi ui+1 /st . 12/18
  42. 42. Convergence rates 1/3 Convergence rate in terms of average (aka ergodic) sequence. st def = t−1 i=0 γt , xt def = t−1 i=0 γi xi+1 /st , ut def = t−1 i=0 γi ui+1 /st . Theorem (sublinear convergence rate) For any (x, u) ∈ domL: L(xt, u) − L(x, ut) ≤ z0 − x 2 + γ2 0 u0 − u 2 2st . 12/18
  43. 43. Convergence rates 2/3 If h is Lipschitz, we can bound it and obtain rates in terms of objective function suboptimality. 13/18
  44. 44. Convergence rates 2/3 If h is Lipschitz, we can bound it and obtain rates in terms of objective function suboptimality. Corollary Let h be βh-Lipschitz. Then we have P(xt+1) − P(x∗ ) ≤ z0 − x∗ 2+ 2γ2 0( u0 2+ β2 h) 2st = O(1/t). with P(x) = f (x) + g(x) + h(x). 13/18
  45. 45. Convergence rates 3/3 Linear convergence under (somewhat unrealistic) assumptions. Theorem If f is Lf -smooth, µ-strongly convex and h is Lh-smooth then xt+1 − x 2 ≤ 1 − min τ µ Lf , 1 1 + γ0Lh t+1 C0 (3) with τ = line search decrease factor, C0 = only depends on initial conditions. • Better rate than µ Lf × 1 (1+γLh)2 from (Davis and Yin, 2015). 14/18
  46. 46. Experiments
  47. 47. Logistic + Nearly-isotonic penalty Problem arg minx logistic(x) + λ p−1 i=1 max{xi − xi+1, 0} Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 200 400 600 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG 15/18
  48. 48. Logistic + Overlapping group lasso penalty Problem arg min x logistic(x) + λ g∈G [x]g 2 Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 5 10 15 20 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0.0 0.2 0.4 0.6 0.8 1.0 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 16/18
  49. 49. Quadratic loss + total variation penalty Problem arg min x least squares(x) + λ x TV Recoveredcoefficients =10 6 =10 5 =10 4 =10 3 0 500 1000 1500 2000 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 100 200 300 400 Time (in seconds) 10 12 10 9 10 6 10 3 100 0 10 20 30 40 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 2 4 6 8 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 17/18
  50. 50. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. 18/18
  51. 51. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. 18/18
  52. 52. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as fixed step-size method. 18/18
  53. 53. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as fixed step-size method. • Large empirical improvements, specially in the low-regularization and non-quadratic regime. 18/18
  54. 54. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as fixed step-size method. • Large empirical improvements, specially in the low-regularization and non-quadratic regime. Perspectives • Linear convergence under less restrictive assumptions? • Acceleration. https://arxiv.org/abs/1804.02339 18/18
  55. 55. References Armijo, Larry (1966). “Minimization of functions having Lipschitz continuous first partial derivatives”. In: Pacific Journal of Mathematics. Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to signal recovery”. In: Convex optimization in signal processing and communications. Davis, Damek and Wotao Yin (2015). “A three-operator splitting scheme and its optimization applications”. In: preprint arXiv:1504.01032v1. — (2017). “A three-operator splitting scheme and its optimization applications”. In: Set-valued and variational analysis. Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 18/18

×