- 1. Suﬃcient decrease is all you need A simple condition to forget about the step-size, with applications to the Frank-Wolfe algorithm. Fabian Pedregosa June 4th, 2018. Google Brain Montreal
- 2. Where I Come From ML/Optimization/Software Guy Engineer (2010–2012) First contact with ML: develop ML library (scikit-learn). ML and NeuroScience (2012–2015) PhD applying ML to neuroscience. ML and Optimization (2015–) Stochastic, Parallel, Constrained, Hyperparameter optimization. 1/30
- 3. Outline Motivation: eliminate step-size parameter. 2/30
- 4. Outline Motivation: eliminate step-size parameter. 1. Frank-Wolfe, A method for constrained optimization. 2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size. 3. Perspectives. Other applications: proximal splitting, stochastic optimization. 2/30
- 5. Outline Motivation: eliminate step-size parameter. 1. Frank-Wolfe, A method for constrained optimization. 2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size. 3. Perspectives. Other applications: proximal splitting, stochastic optimization. With a little help from my collaborators Armin Askari (UC Berkeley) Geoﬀrey N´egiar (UC Berkeley) Martin Jaggi (EPFL) Gauthier Gidel (UdeM) 2/30
- 7. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
- 8. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
- 9. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
- 10. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
- 11. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
- 12. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. 4/30
- 13. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. • Solution of linear subproblem is always extremal element of D. 4/30
- 14. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. • Solution of linear subproblem is always extremal element of D. • Iterates admit sparse representation = xt convex combination of at most t elements. 4/30
- 15. Recent applications of Frank-Wolfe • Learning the structure of a neural network.1 • Attention mechanisms that enforce sparsity.2 • 1-constrained problems with extreme number of features.3 1 Wei Ping, Qiang Liu, and Alexander T Ihler (2016). “Learning Inﬁnite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. 2 Vlad Niculae et al. (2018). “SparseMAP: Diﬀerentiable Sparse Structured Inference”. In: International Conference on Machine Learning. 3 Thomas Kerdreux, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. 5/30
- 16. A practical issue • Line-search only eﬃcient when closed form exists (quadratic objective). • Step-size γt = 2/(t + 2) is convergent, but extremely slow. Algorithm 2: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 6/30
- 17. A practical issue • Line-search only eﬃcient when closed form exists (quadratic objective). • Step-size γt = 2/(t + 2) is convergent, but extremely slow. Algorithm 2: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst Can we do better? 6/30
- 18. A suﬃcient decrease condition
- 19. Down the citation rabbit hole 4 Vladimir Demyanov Alexsandr Rubinov 4 Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). 7/30
- 20. Down the citation rabbit hole 4 Vladimir Demyanov Alexsandr Rubinov 4 Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). 7/30
- 21. The Demyanov-Rubinov (DR) Frank-Wolfe variant Problem: smooth objective, compact domain arg min x∈D f (x), where f is L-smooth . (L-smooth ≡ diﬀerentiable with L-Lipschitz gradient). • Step-size depends on the correlation between − f (xt) and the descent direction st − xt. Algorithm 3: FW, DR variant 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 γt =min − f (xt), st − xt L st − xt 2 , 1 4 xt+1 = (1 − γt)xt + γtst 8/30
- 22. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? 9/30
- 23. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? L-smooth inequality Any L-smooth function f veriﬁes f (y) ≤ f (x) + f (x), y − x + L 2 x − y 2 , for all x, y in the domain. 9/30
- 24. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? L-smooth inequality Any L-smooth function f veriﬁes f (y) ≤ f (x) + f (x), y − x + L 2 x − y 2 :=Qx (y) , for all x, y in the domain. • The right hand side is a quadratic upper bound Qx (y) f (y) 9/30
- 25. Justiﬁcation of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 10/30
- 26. Justiﬁcation of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 • Minimizing right hand side on γ ∈ [0, 1] gives γ =min − f (xt), st − xt L st − xt 2 , 1 , = Demyanov-Rubinov step-size! 10/30
- 27. Justiﬁcation of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 • Minimizing right hand side on γ ∈ [0, 1] gives γ =min − f (xt), st − xt L st − xt 2 , 1 , = Demyanov-Rubinov step-size! • ≡ exact line search on the quadratic upper bound. 10/30
- 28. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. 11/30
- 29. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. Suﬃcient decrease is all you need L-smooth inequality can be replaced by f (xt+1) ≤ f (xt) − γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 with γt =min − f (xt), st − xt Lt st − xt 2 , 1 11/30
- 30. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. Suﬃcient decrease is all you need L-smooth inequality can be replaced by f (xt+1) ≤ f (xt) − γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 with γt =min − f (xt), st − xt Lt st − xt 2 , 1 Key diﬀerence with DR: L is replaced by Lt. Potentially Lt L. 11/30
- 31. The Adaptive FW algorithm New FW variant with adaptive step-size.5 Algorithm 4: The Adaptive Frank-Wolfe algorithm (AdaFW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find Lt that veriﬁes suﬃcient decrease (1), with 4 γt =min − f (xt), st − xt Lt st − xt 2 , 1 5 xt+1 = (1 − γt)xt + γtst f (xt+1) ≤ f (xt) + γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 (1) 5 Fabian Pedregosa, Armin Askari, Geoﬀrey Negiar, and Martin Jaggi (2018). “Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted. 12/30
- 32. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. 13/30
- 33. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. • Adaptivity to local geometry. 13/30
- 34. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. • Adaptivity to local geometry. • Two extra function evaluations per iteration. Often given as byproduct of gradient. 13/30
- 35. Extension to other FW variants
- 36. Zig-Zagging phenomena in FW The Frank-Wolfe algorithm zig-zags when the solution lies in a face of the boundary. Some FW variants have been developed to address this issue. 14/30
- 37. Away-steps FW, informal Away-steps FW algorithm (Wolfe, 1970) (Gu´elat and Marcotte, 1986) adds the possibility to move away from an active atom. 15/30
- 38. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. 16/30
- 39. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
- 40. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
- 41. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
- 42. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
- 43. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
- 44. Pairwise FW Key idea Move weight mass between two atoms in each step. Proposed by (Lacoste-Julien and Jaggi, 2015), inspired the MDM alg. (Mitchell, Demyanov, and Malozemov, 1974). Algorithm 6: Pairwise FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxs∈St f (xt), s 4 dt = st − vt 5 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 6 xt+1 = xt + γtdt 17/30
- 45. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). 18/30
- 46. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). • Can we design variants with suﬃcient decrease? 18/30
- 47. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). • Can we design variants with suﬃcient decrease? Introducing Adaptive Away-steps and Adaptive Pairwise Choose Lt such that it veriﬁes f (xt + γtdt) ≤ f (xt) + γt f (xt), dt + γ2 t Lt 2 dt 2 with γt =min − f (xt), dt Lt dt 2 , 1 18/30
- 48. Adaptive Pairwise FW Algorithm 7: Pairwise FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxs∈St f (xt), s 4 dt = st − vt 5 Find Lt that veriﬁes suﬃcient decrease (2), with 6 γt =min − f (xt), dt Lt dt 2 , 1 7 xt+1 = xt + γtdt f (xt + γtdt) ≤ f (xt) + γt f (xt), dt + γ2 t Lt 2 dt 2 (2) 19/30
- 49. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) 20/30
- 50. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) Convex f For all FW variants, f (xt) − f (x ) ≤ O(1/t) 20/30
- 51. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) Convex f For all FW variants, f (xt) − f (x ) ≤ O(1/t) Non-Convex f For all FW variants, maxs∈D f (xt), xt − s ≤ O(1/ √ t) 20/30
- 52. Experiments
- 53. Experiments RCV1 Problem: 1-constrained logistic regression arg min x 1≤α 1 n n i=1 ϕ(aT i x, bi ) with ϕ = logistic loss. Dataset dimension density Lt /L RCV1 47236 10−3 1.3 × 10−2 0 100 200 300 400 Time (in seconds) 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 1 ball radius = 100 0 200 400 600 800 Time (in seconds) 10 8 10 6 10 4 10 2 100 1 ball radius = 200 0 250 500 750 1000 Time (in seconds) 10 8 10 6 10 4 10 2 100 1 ball radius = 300 AdaFW AdaPFW AdaAFW FW PFW AFW D-FW 21/30
- 54. Experiments Madelon Problem: 1-constrained logistic regression arg min x 1≤α 1 n n i=1 ϕ(aT i x, bi ) with ϕ = logistic loss. Dataset dimension density Lt /L Madelon 500 1. 3.3 × 10−3 0 2 4 Time (in seconds) 10 8 10 6 10 4 10 2 Objectiveminusoptimum 1 ball radius = 13 0.0 2.5 5.0 7.5 10.0 Time (in seconds) 10 8 10 6 10 4 10 2 1 ball radius = 20 0 5 10 15 20 Time (in seconds) 10 8 10 6 10 4 10 2 1 ball radius = 30 AdaFW AdaPFW AdaAFW FW PFW AFW D-FW 22/30
- 55. Experiments MovieLens 1M Problem: trace-norm constrained robust matrix completion arg min x ∗≤α 1 |B| n (i,j)∈B h(Xi,j , Ai,j ) with h = Huber loss. Dataset dimension density Lt /L MovieLens 1M 22,393,987 0.04 1.1 × 10−2 0 200 400 600 800 Time (in seconds) 10 6 10 4 10 2 100 Objectiveminusoptimum trace ball radius = 300 0 1000 2000 Time (in seconds) 10 6 10 4 10 2 100 trace ball radius = 350 0 2000 4000 Time (in seconds) 10 6 10 4 10 2 100 trace ball radius = 400 Adaptive FW FW D-FW 23/30
- 57. Proximal Splitting Building quadratic upper bound is common in proximal gradient descent (Beck and Teboulle, 2009) (Nesterov, 2013). Recently extended to the Davis-Yin three operator splitting6 f (x) + g(x) + h(x) with access to f , proxγg , proxγh Key insight: verify a suﬃcient decrease condition of the form f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 6 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning. 24/30
- 58. Nearly-isotonic penalty Problem arg minx loss(x) + λ p−1 i=1 max{xi − xi+1, 0} Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 200 400 600 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG 25/30
- 59. Overlapping group lasso penalty Problem arg min x loss(x) + λ g∈G [x]g 2 Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 5 10 15 20 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0.0 0.2 0.4 0.6 0.8 1.0 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 26/30
- 60. Perspectives
- 61. Stochastic optimization Problem arg min x∈Rd 1 n n i=1 fi (x) Heuristic from7 to estimate L by verifying at each iteration t fi (xt − 1 L fi (xt)) ≤ fi (xt) − 1 2L fi (xt) 2 with i random index sampled at iter t. 7 Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing ﬁnite sums with the stochastic average gradient”. In: Mathematical Programming. 27/30
- 62. Stochastic optimization Problem arg min x∈Rd 1 n n i=1 fi (x) Heuristic from7 to estimate L by verifying at each iteration t fi (xt − 1 L fi (xt)) ≤ fi (xt) − 1 2L fi (xt) 2 with i random index sampled at iter t. L-smooth inequality with y = xt − 1 L fi (xt), x = xt 7 Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing ﬁnite sums with the stochastic average gradient”. In: Mathematical Programming. 27/30
- 63. Experiments stochastic line search 28/30
- 64. Experiments stochastic line search Can we prove convergence for such (or similar) stochastic adaptive step-size? 28/30
- 65. Conclusion • Suﬃcient decrease condition to set step-size in FW and variants. 29/30
- 66. Conclusion • Suﬃcient decrease condition to set step-size in FW and variants. • (Mostly) Hyperparameter-free, adaptivity to local geometry. 29/30
- 67. Conclusion • Suﬃcient decrease condition to set step-size in FW and variants. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Applications in proximal splitting and stochastic optimization. Thanks for your attention 29/30
- 68. References Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to signal recovery”. In: Convex optimization in signal processing and communications. Demyanov, Vladimir and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). Gu´elat, Jacques and Patrice Marcotte (1986). “Some comments on Wolfe’s away step”. In: Mathematical Programming. Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. Lacoste-Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank-Wolfe optimization variants”. In: Advances in Neural Information Processing Systems. Mitchell, BF, Vladimir Fedorovich Demyanov, and VN Malozemov (1974). “Finding the point of a polyhedron closest to the origin”. In: SIAM Journal on Control. 29/30
- 69. Nesterov, Yu (2013). “Gradient methods for minimizing composite functions”. In: Mathematical Programming. Niculae, Vlad et al. (2018). “SparseMAP: Diﬀerentiable Sparse Structured Inference”. In: International Conference on Machine Learning. Pedregosa, Fabian et al. (2018). “Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted. Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning. Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Inﬁnite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. Schmidt, Mark, Nicolas Le Roux, and Francis Bach (2017). “Minimizing ﬁnite sums with the stochastic average gradient”. In: Mathematical Programming. Wolfe, Philip (1970). “Convergence theory in nonlinear programming”. In: Integer and nonlinear programming. 30/30