Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sufficient decrease is all you need

43 views

Published on

A simple condition to forget about the step-size, with
applications to the Frank-Wolfe algorithm .

Published in: Science
  • Be the first to comment

  • Be the first to like this

Sufficient decrease is all you need

  1. 1. Sufficient decrease is all you need A simple condition to forget about the step-size, with applications to the Frank-Wolfe algorithm. Fabian Pedregosa June 4th, 2018. Google Brain Montreal
  2. 2. Where I Come From ML/Optimization/Software Guy Engineer (2010–2012) First contact with ML: develop ML library (scikit-learn). ML and NeuroScience (2012–2015) PhD applying ML to neuroscience. ML and Optimization (2015–) Stochastic, Parallel, Constrained, Hyperparameter optimization. 1/30
  3. 3. Outline Motivation: eliminate step-size parameter. 2/30
  4. 4. Outline Motivation: eliminate step-size parameter. 1. Frank-Wolfe, A method for constrained optimization. 2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size. 3. Perspectives. Other applications: proximal splitting, stochastic optimization. 2/30
  5. 5. Outline Motivation: eliminate step-size parameter. 1. Frank-Wolfe, A method for constrained optimization. 2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size. 3. Perspectives. Other applications: proximal splitting, stochastic optimization. With a little help from my collaborators Armin Askari (UC Berkeley) Geoffrey N´egiar (UC Berkeley) Martin Jaggi (EPFL) Gauthier Gidel (UdeM) 2/30
  6. 6. The Frank-Wolfe algorithm
  7. 7. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  8. 8. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  9. 9. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  10. 10. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  11. 11. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  12. 12. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. 4/30
  13. 13. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. • Solution of linear subproblem is always extremal element of D. 4/30
  14. 14. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. • Solution of linear subproblem is always extremal element of D. • Iterates admit sparse representation = xt convex combination of at most t elements. 4/30
  15. 15. Recent applications of Frank-Wolfe • Learning the structure of a neural network.1 • Attention mechanisms that enforce sparsity.2 • 1-constrained problems with extreme number of features.3 1 Wei Ping, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. 2 Vlad Niculae et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”. In: International Conference on Machine Learning. 3 Thomas Kerdreux, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. 5/30
  16. 16. A practical issue • Line-search only efficient when closed form exists (quadratic objective). • Step-size γt = 2/(t + 2) is convergent, but extremely slow. Algorithm 2: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 6/30
  17. 17. A practical issue • Line-search only efficient when closed form exists (quadratic objective). • Step-size γt = 2/(t + 2) is convergent, but extremely slow. Algorithm 2: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst Can we do better? 6/30
  18. 18. A sufficient decrease condition
  19. 19. Down the citation rabbit hole 4 Vladimir Demyanov Alexsandr Rubinov 4 Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). 7/30
  20. 20. Down the citation rabbit hole 4 Vladimir Demyanov Alexsandr Rubinov 4 Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). 7/30
  21. 21. The Demyanov-Rubinov (DR) Frank-Wolfe variant Problem: smooth objective, compact domain arg min x∈D f (x), where f is L-smooth . (L-smooth ≡ differentiable with L-Lipschitz gradient). • Step-size depends on the correlation between − f (xt) and the descent direction st − xt. Algorithm 3: FW, DR variant 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 γt =min − f (xt), st − xt L st − xt 2 , 1 4 xt+1 = (1 − γt)xt + γtst 8/30
  22. 22. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? 9/30
  23. 23. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? L-smooth inequality Any L-smooth function f verifies f (y) ≤ f (x) + f (x), y − x + L 2 x − y 2 , for all x, y in the domain. 9/30
  24. 24. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? L-smooth inequality Any L-smooth function f verifies f (y) ≤ f (x) + f (x), y − x + L 2 x − y 2 :=Qx (y) , for all x, y in the domain. • The right hand side is a quadratic upper bound Qx (y) f (y) 9/30
  25. 25. Justification of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 10/30
  26. 26. Justification of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 • Minimizing right hand side on γ ∈ [0, 1] gives γ =min − f (xt), st − xt L st − xt 2 , 1 , = Demyanov-Rubinov step-size! 10/30
  27. 27. Justification of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 • Minimizing right hand side on γ ∈ [0, 1] gives γ =min − f (xt), st − xt L st − xt 2 , 1 , = Demyanov-Rubinov step-size! • ≡ exact line search on the quadratic upper bound. 10/30
  28. 28. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. 11/30
  29. 29. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. Sufficient decrease is all you need L-smooth inequality can be replaced by f (xt+1) ≤ f (xt) − γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 with γt =min − f (xt), st − xt Lt st − xt 2 , 1 11/30
  30. 30. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. Sufficient decrease is all you need L-smooth inequality can be replaced by f (xt+1) ≤ f (xt) − γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 with γt =min − f (xt), st − xt Lt st − xt 2 , 1 Key difference with DR: L is replaced by Lt. Potentially Lt L. 11/30
  31. 31. The Adaptive FW algorithm New FW variant with adaptive step-size.5 Algorithm 4: The Adaptive Frank-Wolfe algorithm (AdaFW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find Lt that verifies sufficient decrease (1), with 4 γt =min − f (xt), st − xt Lt st − xt 2 , 1 5 xt+1 = (1 − γt)xt + γtst f (xt+1) ≤ f (xt) + γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 (1) 5 Fabian Pedregosa, Armin Askari, Geoffrey Negiar, and Martin Jaggi (2018). “Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted. 12/30
  32. 32. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. 13/30
  33. 33. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. • Adaptivity to local geometry. 13/30
  34. 34. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. • Adaptivity to local geometry. • Two extra function evaluations per iteration. Often given as byproduct of gradient. 13/30
  35. 35. Extension to other FW variants
  36. 36. Zig-Zagging phenomena in FW The Frank-Wolfe algorithm zig-zags when the solution lies in a face of the boundary. Some FW variants have been developed to address this issue. 14/30
  37. 37. Away-steps FW, informal Away-steps FW algorithm (Wolfe, 1970) (Gu´elat and Marcotte, 1986) adds the possibility to move away from an active atom. 15/30
  38. 38. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. 16/30
  39. 39. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  40. 40. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  41. 41. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  42. 42. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  43. 43. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  44. 44. Pairwise FW Key idea Move weight mass between two atoms in each step. Proposed by (Lacoste-Julien and Jaggi, 2015), inspired the MDM alg. (Mitchell, Demyanov, and Malozemov, 1974). Algorithm 6: Pairwise FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxs∈St f (xt), s 4 dt = st − vt 5 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 6 xt+1 = xt + γtdt 17/30
  45. 45. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). 18/30
  46. 46. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). • Can we design variants with sufficient decrease? 18/30
  47. 47. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). • Can we design variants with sufficient decrease? Introducing Adaptive Away-steps and Adaptive Pairwise Choose Lt such that it verifies f (xt + γtdt) ≤ f (xt) + γt f (xt), dt + γ2 t Lt 2 dt 2 with γt =min − f (xt), dt Lt dt 2 , 1 18/30
  48. 48. Adaptive Pairwise FW Algorithm 7: Pairwise FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxs∈St f (xt), s 4 dt = st − vt 5 Find Lt that verifies sufficient decrease (2), with 6 γt =min − f (xt), dt Lt dt 2 , 1 7 xt+1 = xt + γtdt f (xt + γtdt) ≤ f (xt) + γt f (xt), dt + γ2 t Lt 2 dt 2 (2) 19/30
  49. 49. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) 20/30
  50. 50. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) Convex f For all FW variants, f (xt) − f (x ) ≤ O(1/t) 20/30
  51. 51. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) Convex f For all FW variants, f (xt) − f (x ) ≤ O(1/t) Non-Convex f For all FW variants, maxs∈D f (xt), xt − s ≤ O(1/ √ t) 20/30
  52. 52. Experiments
  53. 53. Experiments RCV1 Problem: 1-constrained logistic regression arg min x 1≤α 1 n n i=1 ϕ(aT i x, bi ) with ϕ = logistic loss. Dataset dimension density Lt /L RCV1 47236 10−3 1.3 × 10−2 0 100 200 300 400 Time (in seconds) 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 1 ball radius = 100 0 200 400 600 800 Time (in seconds) 10 8 10 6 10 4 10 2 100 1 ball radius = 200 0 250 500 750 1000 Time (in seconds) 10 8 10 6 10 4 10 2 100 1 ball radius = 300 AdaFW AdaPFW AdaAFW FW PFW AFW D-FW 21/30
  54. 54. Experiments Madelon Problem: 1-constrained logistic regression arg min x 1≤α 1 n n i=1 ϕ(aT i x, bi ) with ϕ = logistic loss. Dataset dimension density Lt /L Madelon 500 1. 3.3 × 10−3 0 2 4 Time (in seconds) 10 8 10 6 10 4 10 2 Objectiveminusoptimum 1 ball radius = 13 0.0 2.5 5.0 7.5 10.0 Time (in seconds) 10 8 10 6 10 4 10 2 1 ball radius = 20 0 5 10 15 20 Time (in seconds) 10 8 10 6 10 4 10 2 1 ball radius = 30 AdaFW AdaPFW AdaAFW FW PFW AFW D-FW 22/30
  55. 55. Experiments MovieLens 1M Problem: trace-norm constrained robust matrix completion arg min x ∗≤α 1 |B| n (i,j)∈B h(Xi,j , Ai,j ) with h = Huber loss. Dataset dimension density Lt /L MovieLens 1M 22,393,987 0.04 1.1 × 10−2 0 200 400 600 800 Time (in seconds) 10 6 10 4 10 2 100 Objectiveminusoptimum trace ball radius = 300 0 1000 2000 Time (in seconds) 10 6 10 4 10 2 100 trace ball radius = 350 0 2000 4000 Time (in seconds) 10 6 10 4 10 2 100 trace ball radius = 400 Adaptive FW FW D-FW 23/30
  56. 56. Other applications
  57. 57. Proximal Splitting Building quadratic upper bound is common in proximal gradient descent (Beck and Teboulle, 2009) (Nesterov, 2013). Recently extended to the Davis-Yin three operator splitting6 f (x) + g(x) + h(x) with access to f , proxγg , proxγh Key insight: verify a sufficient decrease condition of the form f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 6 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning. 24/30
  58. 58. Nearly-isotonic penalty Problem arg minx loss(x) + λ p−1 i=1 max{xi − xi+1, 0} Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 200 400 600 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG 25/30
  59. 59. Overlapping group lasso penalty Problem arg min x loss(x) + λ g∈G [x]g 2 Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 5 10 15 20 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0.0 0.2 0.4 0.6 0.8 1.0 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 26/30
  60. 60. Perspectives
  61. 61. Stochastic optimization Problem arg min x∈Rd 1 n n i=1 fi (x) Heuristic from7 to estimate L by verifying at each iteration t fi (xt − 1 L fi (xt)) ≤ fi (xt) − 1 2L fi (xt) 2 with i random index sampled at iter t. 7 Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming. 27/30
  62. 62. Stochastic optimization Problem arg min x∈Rd 1 n n i=1 fi (x) Heuristic from7 to estimate L by verifying at each iteration t fi (xt − 1 L fi (xt)) ≤ fi (xt) − 1 2L fi (xt) 2 with i random index sampled at iter t. L-smooth inequality with y = xt − 1 L fi (xt), x = xt 7 Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming. 27/30
  63. 63. Experiments stochastic line search 28/30
  64. 64. Experiments stochastic line search Can we prove convergence for such (or similar) stochastic adaptive step-size? 28/30
  65. 65. Conclusion • Sufficient decrease condition to set step-size in FW and variants. 29/30
  66. 66. Conclusion • Sufficient decrease condition to set step-size in FW and variants. • (Mostly) Hyperparameter-free, adaptivity to local geometry. 29/30
  67. 67. Conclusion • Sufficient decrease condition to set step-size in FW and variants. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Applications in proximal splitting and stochastic optimization. Thanks for your attention 29/30
  68. 68. References Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to signal recovery”. In: Convex optimization in signal processing and communications. Demyanov, Vladimir and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). Gu´elat, Jacques and Patrice Marcotte (1986). “Some comments on Wolfe’s away step”. In: Mathematical Programming. Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. Lacoste-Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank-Wolfe optimization variants”. In: Advances in Neural Information Processing Systems. Mitchell, BF, Vladimir Fedorovich Demyanov, and VN Malozemov (1974). “Finding the point of a polyhedron closest to the origin”. In: SIAM Journal on Control. 29/30
  69. 69. Nesterov, Yu (2013). “Gradient methods for minimizing composite functions”. In: Mathematical Programming. Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”. In: International Conference on Machine Learning. Pedregosa, Fabian et al. (2018). “Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted. Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning. Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. Schmidt, Mark, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming. Wolfe, Philip (1970). “Convergence theory in nonlinear programming”. In: Integer and nonlinear programming. 30/30

×