SlideShare a Scribd company logo
Sufficient decrease is all you need
A simple condition to forget about the step-size, with
applications to the Frank-Wolfe algorithm.
Fabian Pedregosa
June 4th, 2018. Google Brain Montreal
Where I Come From
ML/Optimization/Software Guy
Engineer (2010–2012)
First contact with ML: develop ML
library (scikit-learn).
ML and NeuroScience (2012–2015)
PhD applying ML to neuroscience.
ML and Optimization (2015–)
Stochastic, Parallel, Constrained,
Hyperparameter optimization.
1/30
Outline
Motivation: eliminate step-size parameter.
2/30
Outline
Motivation: eliminate step-size parameter.
1. Frank-Wolfe, A method for constrained optimization.
2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size.
3. Perspectives. Other applications: proximal splitting,
stochastic optimization.
2/30
Outline
Motivation: eliminate step-size parameter.
1. Frank-Wolfe, A method for constrained optimization.
2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size.
3. Perspectives. Other applications: proximal splitting,
stochastic optimization.
With a little help from my collaborators
Armin Askari
(UC Berkeley)
Geoffrey N´egiar
(UC Berkeley)
Martin Jaggi
(EPFL)
Gauthier Gidel
(UdeM)
2/30
The Frank-Wolfe algorithm
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
The Frank-Wolfe (FW) algorithm, aka conditional gradient
Problem: smooth f , compact D
arg min
x∈D
f (x)
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
3/30
Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
4/30
Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
• Solution of linear subproblem is always extremal element of
D.
4/30
Why people ♥ Frank-Wolfe
• Projection-free. Only linear subproblems arise vs quadratic
for projection.
• Solution of linear subproblem is always extremal element of
D.
• Iterates admit sparse representation = xt convex combination
of at most t elements.
4/30
Recent applications of Frank-Wolfe
• Learning the structure of a neural network.1
• Attention mechanisms that enforce sparsity.2
• 1-constrained problems with extreme number of features.3
1
Wei Ping, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs
with Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
2
Vlad Niculae et al. (2018). “SparseMAP: Differentiable Sparse Structured
Inference”. In: International Conference on Machine Learning.
3
Thomas Kerdreux, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th
International Conference on Machine Learning.
5/30
A practical issue
• Line-search only
efficient when closed
form exists (quadratic
objective).
• Step-size
γt = 2/(t + 2) is
convergent, but
extremely slow.
Algorithm 2: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
6/30
A practical issue
• Line-search only
efficient when closed
form exists (quadratic
objective).
• Step-size
γt = 2/(t + 2) is
convergent, but
extremely slow.
Algorithm 2: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find γt by line-search: γt ∈
arg minγ∈[0,1] f ((1−γ)xt +γst)
4 xt+1 = (1 − γt)xt + γtst
Can we do better?
6/30
A sufficient decrease condition
Down the citation rabbit hole
4
Vladimir Demyanov
Alexsandr Rubinov
4
Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
7/30
Down the citation rabbit hole
4
Vladimir Demyanov
Alexsandr Rubinov
4
Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
7/30
The Demyanov-Rubinov (DR) Frank-Wolfe variant
Problem: smooth objective, compact domain
arg min
x∈D
f (x), where f is L-smooth .
(L-smooth ≡ differentiable with L-Lipschitz gradient).
• Step-size depends
on the correlation
between − f (xt)
and the descent
direction st − xt.
Algorithm 3: FW, DR variant
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 γt =min
− f (xt), st − xt
L st − xt
2
, 1
4 xt+1 = (1 − γt)xt + γtst
8/30
The Demyanov-Rubinov (DR) Frank-Wolfe variant
Where does γt =min
− f (xt), st − xt
L st − xt
2
, 1 come from?
9/30
The Demyanov-Rubinov (DR) Frank-Wolfe variant
Where does γt =min
− f (xt), st − xt
L st − xt
2
, 1 come from?
L-smooth inequality
Any L-smooth function f verifies
f (y) ≤ f (x) + f (x), y − x +
L
2
x − y 2
,
for all x, y in the domain.
9/30
The Demyanov-Rubinov (DR) Frank-Wolfe variant
Where does γt =min
− f (xt), st − xt
L st − xt
2
, 1 come from?
L-smooth inequality
Any L-smooth function f verifies
f (y) ≤ f (x) + f (x), y − x +
L
2
x − y 2
:=Qx (y)
,
for all x, y in the domain.
• The right hand side is a
quadratic upper bound
Qx (y)
f (y)
9/30
Justification of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
10/30
Justification of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
• Minimizing right hand side on γ ∈ [0, 1] gives
γ =min
− f (xt), st − xt
L st − xt
2
, 1 ,
= Demyanov-Rubinov step-size!
10/30
Justification of the step-size
• L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt
gives
f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt +
γ2L
2
st − xt
2
• Minimizing right hand side on γ ∈ [0, 1] gives
γ =min
− f (xt), st − xt
L st − xt
2
, 1 ,
= Demyanov-Rubinov step-size!
• ≡ exact line search on the quadratic upper bound.
10/30
Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
11/30
Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
Sufficient decrease is all you need
L-smooth inequality can be replaced by
f (xt+1) ≤ f (xt) − γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
with γt =min
− f (xt), st − xt
Lt st − xt
2
, 1
11/30
Towards an Adaptive FW
Quadratic upper bound
The Demyanov-Rubinov makes use of a quadratic upper bound,
but it is only evaluated at xt, xt+1.
Sufficient decrease is all you need
L-smooth inequality can be replaced by
f (xt+1) ≤ f (xt) − γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
with γt =min
− f (xt), st − xt
Lt st − xt
2
, 1
Key difference with DR: L is replaced by Lt. Potentially Lt L.
11/30
The Adaptive FW algorithm
New FW variant with adaptive step-size.5
Algorithm 4: The Adaptive Frank-Wolfe algorithm (AdaFW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Find Lt that verifies sufficient decrease (1), with
4 γt =min
− f (xt), st − xt
Lt st − xt
2
, 1
5 xt+1 = (1 − γt)xt + γtst
f (xt+1) ≤ f (xt) + γt f (xt), st − xt +
γ2
t Lt
2
st − xt
2
(1)
5
Fabian Pedregosa, Armin Askari, Geoffrey Negiar, and Martin Jaggi (2018).
“Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted.
12/30
The Adaptive FW algorithm
γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
13/30
The Adaptive FW algorithm
γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
• Adaptivity to local geometry.
13/30
The Adaptive FW algorithm
γ =0 γ = 1γt
f (xt) + γ f (xt), st − xt + γ2Lt
2 st − xt
2
f ((1 − γ)xt + γst)
• Worst-case, Lt = L. Often Lt L =⇒ larger step-size.
• Adaptivity to local geometry.
• Two extra function evaluations per iteration. Often given as
byproduct of gradient.
13/30
Extension to other FW variants
Zig-Zagging phenomena in FW
The Frank-Wolfe algorithm zig-zags when the solution lies in a
face of the boundary.
Some FW variants have been developed to address this issue.
14/30
Away-steps FW, informal
Away-steps FW algorithm (Wolfe, 1970) (Gu´elat and Marcotte,
1986) adds the possibility to move away from an active atom.
15/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Away-steps FW algorithm
Keep active set St = active vertices that have been previously
selected and have non-zero weight.
Algorithm 5: Away-Steps FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxv∈St
f (xt), v
4 if − f (xt), st − xt ≥ − f (xt), xt − vt then
5 dt = st − xt, FW step
6 else
7 dt = xt − vt, Away step
8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax
t ] f (xt +γdt)
9 xt+1 = xt + γtdt
16/30
Pairwise FW
Key idea
Move weight mass between two atoms in each step.
Proposed by (Lacoste-Julien and Jaggi, 2015), inspired the MDM
alg. (Mitchell, Demyanov, and Malozemov, 1974).
Algorithm 6: Pairwise FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxs∈St
f (xt), s
4 dt = st − vt
5 Find γt by line-search: γt ∈
arg minγ∈[0,γmax
t ] f (xt +γdt)
6 xt+1 = xt + γtdt
17/30
Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
18/30
Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
• Can we design variants with sufficient decrease?
18/30
Away-steps FW and Pairwise FW
Convergence of Away-steps and Pairwise FW
• Linear convergence for strongly convex functions on polytopes
(Lacoste-Julien and Jaggi, 2015).
• Can we design variants with sufficient decrease?
Introducing Adaptive Away-steps and Adaptive Pairwise
Choose Lt such that it verifies
f (xt + γtdt) ≤ f (xt) + γt f (xt), dt +
γ2
t Lt
2
dt
2
with γt =min
− f (xt), dt
Lt dt
2
, 1
18/30
Adaptive Pairwise FW
Algorithm 7: Pairwise FW
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 vt ∈ arg maxs∈St
f (xt), s
4 dt = st − vt
5 Find Lt that verifies sufficient decrease (2), with
6 γt =min
− f (xt), dt
Lt dt
2
, 1
7 xt+1 = xt + γtdt
f (xt + γtdt) ≤ f (xt) + γt f (xt), dt +
γ2
t Lt
2
dt
2
(2)
19/30
Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
20/30
Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
Convex f
For all FW variants, f (xt) − f (x ) ≤ O(1/t)
20/30
Theory for Adaptive Step-size variants
Strongly convex f
Pairwise and Away-steps converge linearly on a polytope. For
each “good step” we have:
f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x ))
Convex f
For all FW variants, f (xt) − f (x ) ≤ O(1/t)
Non-Convex f
For all FW variants, maxs∈D f (xt), xt − s ≤ O(1/
√
t)
20/30
Experiments
Experiments RCV1
Problem: 1-constrained logistic regression
arg min
x 1≤α
1
n
n
i=1
ϕ(aT
i x, bi ) with ϕ = logistic loss.
Dataset dimension density Lt /L
RCV1 47236 10−3 1.3 × 10−2
0 100 200 300 400
Time (in seconds)
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
1 ball radius = 100
0 200 400 600 800
Time (in seconds)
10 8
10 6
10 4
10 2
100
1 ball radius = 200
0 250 500 750 1000
Time (in seconds)
10 8
10 6
10 4
10 2
100
1 ball radius = 300
AdaFW AdaPFW AdaAFW FW PFW AFW D-FW
21/30
Experiments Madelon
Problem: 1-constrained logistic regression
arg min
x 1≤α
1
n
n
i=1
ϕ(aT
i x, bi ) with ϕ = logistic loss.
Dataset dimension density Lt /L
Madelon 500 1. 3.3 × 10−3
0 2 4
Time (in seconds)
10 8
10 6
10 4
10 2
Objectiveminusoptimum
1 ball radius = 13
0.0 2.5 5.0 7.5 10.0
Time (in seconds)
10 8
10 6
10 4
10 2
1 ball radius = 20
0 5 10 15 20
Time (in seconds)
10 8
10 6
10 4
10 2
1 ball radius = 30
AdaFW AdaPFW AdaAFW FW PFW AFW D-FW
22/30
Experiments MovieLens 1M
Problem: trace-norm constrained robust matrix completion
arg min
x ∗≤α
1
|B|
n
(i,j)∈B
h(Xi,j , Ai,j ) with h = Huber loss.
Dataset dimension density Lt /L
MovieLens 1M 22,393,987 0.04 1.1 × 10−2
0 200 400 600 800
Time (in seconds)
10 6
10 4
10 2
100
Objectiveminusoptimum
trace ball radius = 300
0 1000 2000
Time (in seconds)
10 6
10 4
10 2
100
trace ball radius = 350
0 2000 4000
Time (in seconds)
10 6
10 4
10 2
100
trace ball radius = 400
Adaptive FW FW D-FW
23/30
Other applications
Proximal Splitting
Building quadratic upper bound is common in proximal gradient
descent (Beck and Teboulle, 2009) (Nesterov, 2013).
Recently extended to the Davis-Yin three operator splitting6
f (x) + g(x) + h(x)
with access to f , proxγg , proxγh
Key insight: verify a sufficient decrease condition of the form
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
6
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning.
24/30
Nearly-isotonic penalty
Problem
arg minx loss(x) + λ p−1
i=1 max{xi − xi+1, 0}
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 200 400 600
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG
25/30
Overlapping group lasso penalty
Problem
arg min
x
loss(x) + λ g∈G [x]g 2
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 5 10 15 20
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0.0 0.2 0.4 0.6 0.8 1.0
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
26/30
Perspectives
Stochastic optimization
Problem
arg min
x∈Rd
1
n
n
i=1
fi (x)
Heuristic from7 to estimate L by verifying at each iteration t
fi (xt −
1
L
fi (xt)) ≤ fi (xt) −
1
2L
fi (xt) 2
with i random index sampled at iter t.
7
Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite
sums with the stochastic average gradient”. In: Mathematical Programming.
27/30
Stochastic optimization
Problem
arg min
x∈Rd
1
n
n
i=1
fi (x)
Heuristic from7 to estimate L by verifying at each iteration t
fi (xt −
1
L
fi (xt)) ≤ fi (xt) −
1
2L
fi (xt) 2
with i random index sampled at iter t.
L-smooth inequality with y = xt − 1
L fi (xt), x = xt
7
Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite
sums with the stochastic average gradient”. In: Mathematical Programming.
27/30
Experiments stochastic line search
28/30
Experiments stochastic line search
Can we prove convergence for such (or similar) stochastic adaptive
step-size?
28/30
Conclusion
• Sufficient decrease condition to set step-size in FW and
variants.
29/30
Conclusion
• Sufficient decrease condition to set step-size in FW and
variants.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
29/30
Conclusion
• Sufficient decrease condition to set step-size in FW and
variants.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Applications in proximal splitting and stochastic optimization.
Thanks for your attention
29/30
References
Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to
signal recovery”. In: Convex optimization in signal processing and communications.
Demyanov, Vladimir and Aleksandr Rubinov (1970). Approximate methods in
optimization problems (translated from Russian).
Gu´elat, Jacques and Patrice Marcotte (1986). “Some comments on Wolfe’s away
step”. In: Mathematical Programming.
Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International
Conference on Machine Learning.
Lacoste-Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of
Frank-Wolfe optimization variants”. In: Advances in Neural Information Processing
Systems.
Mitchell, BF, Vladimir Fedorovich Demyanov, and VN Malozemov (1974). “Finding
the point of a polyhedron closest to the origin”. In: SIAM Journal on Control.
29/30
Nesterov, Yu (2013). “Gradient methods for minimizing composite functions”. In:
Mathematical Programming.
Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”.
In: International Conference on Machine Learning.
Pedregosa, Fabian et al. (2018). “Step-Size Adaptivity in Projection-Free
Optimization”. In: Submitted.
Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”.
In: Proceedings of the 35th International Conference on Machine Learning.
Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with
Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
Schmidt, Mark, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums
with the stochastic average gradient”. In: Mathematical Programming.
Wolfe, Philip (1970). “Convergence theory in nonlinear programming”. In: Integer and
nonlinear programming.
30/30

More Related Content

What's hot

fourier series and fourier transform
fourier series and fourier transformfourier series and fourier transform
fourier series and fourier transform
Vikas Rathod
 
Properties of fourier transform
Properties of fourier transformProperties of fourier transform
Properties of fourier transform
Nisarg Amin
 
fourier representation of signal and systems
fourier representation of signal and systemsfourier representation of signal and systems
fourier representation of signal and systems
Sugeng Widodo
 
002 ray modeling dynamic systems
002 ray modeling dynamic systems002 ray modeling dynamic systems
002 ray modeling dynamic systems
Institute of Technology Telkom
 
Fourier transformation
Fourier transformationFourier transformation
Fourier transformation
zertux
 
Optics Fourier Transform Ii
Optics Fourier Transform IiOptics Fourier Transform Ii
Optics Fourier Transform Ii
diarmseven
 
Fourier transforms
Fourier transformsFourier transforms
Fourier transforms
kalung0313
 
Presentation on fourier transformation
Presentation on fourier transformationPresentation on fourier transformation
Presentation on fourier transformation
Wasim Shah
 
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Alessandro Palmeri
 
Fourier transform
Fourier transformFourier transform
Fourier transform
Solo Hermelin
 
Fourier series and applications of fourier transform
Fourier series and applications of fourier transformFourier series and applications of fourier transform
Fourier series and applications of fourier transform
Krishna Jangid
 
The FFT And Spectral Analysis
The FFT And Spectral AnalysisThe FFT And Spectral Analysis
The FFT And Spectral Analysis
Athanasios Anastasiou
 
Lecture8 Signal and Systems
Lecture8 Signal and SystemsLecture8 Signal and Systems
Lecture8 Signal and Systems
babak danyal
 
Signal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsSignal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier Transforms
Arvind Devaraj
 
Fourier Transform
Fourier TransformFourier Transform
Fourier Transform
Nidhi Baranwal
 
fourier transforms
fourier transformsfourier transforms
fourier transforms
Umang Gupta
 
Dsp final
Dsp finalDsp final
Fourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsFourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time Signals
Jayanshu Gundaniya
 

What's hot (18)

fourier series and fourier transform
fourier series and fourier transformfourier series and fourier transform
fourier series and fourier transform
 
Properties of fourier transform
Properties of fourier transformProperties of fourier transform
Properties of fourier transform
 
fourier representation of signal and systems
fourier representation of signal and systemsfourier representation of signal and systems
fourier representation of signal and systems
 
002 ray modeling dynamic systems
002 ray modeling dynamic systems002 ray modeling dynamic systems
002 ray modeling dynamic systems
 
Fourier transformation
Fourier transformationFourier transformation
Fourier transformation
 
Optics Fourier Transform Ii
Optics Fourier Transform IiOptics Fourier Transform Ii
Optics Fourier Transform Ii
 
Fourier transforms
Fourier transformsFourier transforms
Fourier transforms
 
Presentation on fourier transformation
Presentation on fourier transformationPresentation on fourier transformation
Presentation on fourier transformation
 
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
Fourier series and applications of fourier transform
Fourier series and applications of fourier transformFourier series and applications of fourier transform
Fourier series and applications of fourier transform
 
The FFT And Spectral Analysis
The FFT And Spectral AnalysisThe FFT And Spectral Analysis
The FFT And Spectral Analysis
 
Lecture8 Signal and Systems
Lecture8 Signal and SystemsLecture8 Signal and Systems
Lecture8 Signal and Systems
 
Signal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsSignal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier Transforms
 
Fourier Transform
Fourier TransformFourier Transform
Fourier Transform
 
fourier transforms
fourier transformsfourier transforms
fourier transforms
 
Dsp final
Dsp finalDsp final
Dsp final
 
Fourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsFourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time Signals
 

Similar to Sufficient decrease is all you need

Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
Fabian Pedregosa
 
LÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaLÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieria
Gonzalo Fernandez
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and trans
Dr Fereidoun Dejahang
 
Modal Analysis Basic Theory
Modal Analysis Basic TheoryModal Analysis Basic Theory
Modal Analysis Basic Theory
YuanCheng38
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averages
VjekoslavKovac1
 
Limit and continuity
Limit and continuityLimit and continuity
Limit and continuity
Digvijaysinh Gohil
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
Pantelis Sopasakis
 
Gradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfGradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdf
MTrang34
 
Ece3075 a 8
Ece3075 a 8Ece3075 a 8
Ece3075 a 8
Aiman Malik
 
Computational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial ModelingComputational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial Modeling
Victor Zhorin
 
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
The Statistical and Applied Mathematical Sciences Institute
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
Pierre Jacob
 
functions limits and continuity
functions limits and continuityfunctions limits and continuity
functions limits and continuity
Pume Ananda
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
Taiji Suzuki
 
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
Tim Reis
 
Functions limits and continuity
Functions limits and continuityFunctions limits and continuity
Functions limits and continuity
sudersana viswanathan
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
Sungbin Lim
 
Singlevaropt
SinglevaroptSinglevaropt
Singlevaropt
sheetslibrary
 
DIGITAL IMAGE PROCESSING - Day 4 Image Transform
DIGITAL IMAGE PROCESSING - Day 4 Image TransformDIGITAL IMAGE PROCESSING - Day 4 Image Transform
DIGITAL IMAGE PROCESSING - Day 4 Image Transform
vijayanand Kandaswamy
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Sean Meyn
 

Similar to Sufficient decrease is all you need (20)

Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
LÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaLÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieria
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and trans
 
Modal Analysis Basic Theory
Modal Analysis Basic TheoryModal Analysis Basic Theory
Modal Analysis Basic Theory
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averages
 
Limit and continuity
Limit and continuityLimit and continuity
Limit and continuity
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
 
Gradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfGradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdf
 
Ece3075 a 8
Ece3075 a 8Ece3075 a 8
Ece3075 a 8
 
Computational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial ModelingComputational Tools and Techniques for Numerical Macro-Financial Modeling
Computational Tools and Techniques for Numerical Macro-Financial Modeling
 
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
functions limits and continuity
functions limits and continuityfunctions limits and continuity
functions limits and continuity
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
The lattice Boltzmann equation: background, boundary conditions, and Burnett-...
 
Functions limits and continuity
Functions limits and continuityFunctions limits and continuity
Functions limits and continuity
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
Singlevaropt
SinglevaroptSinglevaropt
Singlevaropt
 
DIGITAL IMAGE PROCESSING - Day 4 Image Transform
DIGITAL IMAGE PROCESSING - Day 4 Image TransformDIGITAL IMAGE PROCESSING - Day 4 Image Transform
DIGITAL IMAGE PROCESSING - Day 4 Image Transform
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 

More from Fabian Pedregosa

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
Fabian Pedregosa
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
Fabian Pedregosa
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Fabian Pedregosa
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
Fabian Pedregosa
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Fabian Pedregosa
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
Fabian Pedregosa
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
Fabian Pedregosa
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
Fabian Pedregosa
 

More from Fabian Pedregosa (11)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 

Recently uploaded

Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
David Osipyan
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 

Recently uploaded (20)

Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
3D Hybrid PIC simulation of the plasma expansion (ISSS-14)
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 

Sufficient decrease is all you need

  • 1. Sufficient decrease is all you need A simple condition to forget about the step-size, with applications to the Frank-Wolfe algorithm. Fabian Pedregosa June 4th, 2018. Google Brain Montreal
  • 2. Where I Come From ML/Optimization/Software Guy Engineer (2010–2012) First contact with ML: develop ML library (scikit-learn). ML and NeuroScience (2012–2015) PhD applying ML to neuroscience. ML and Optimization (2015–) Stochastic, Parallel, Constrained, Hyperparameter optimization. 1/30
  • 4. Outline Motivation: eliminate step-size parameter. 1. Frank-Wolfe, A method for constrained optimization. 2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size. 3. Perspectives. Other applications: proximal splitting, stochastic optimization. 2/30
  • 5. Outline Motivation: eliminate step-size parameter. 1. Frank-Wolfe, A method for constrained optimization. 2. Adaptive Frank-Wolfe. Frank-Wolfe without the step-size. 3. Perspectives. Other applications: proximal splitting, stochastic optimization. With a little help from my collaborators Armin Askari (UC Berkeley) Geoffrey N´egiar (UC Berkeley) Martin Jaggi (EPFL) Gauthier Gidel (UdeM) 2/30
  • 7. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 8. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 9. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 10. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 11. The Frank-Wolfe (FW) algorithm, aka conditional gradient Problem: smooth f , compact D arg min x∈D f (x) Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 3/30
  • 12. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. 4/30
  • 13. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. • Solution of linear subproblem is always extremal element of D. 4/30
  • 14. Why people ♥ Frank-Wolfe • Projection-free. Only linear subproblems arise vs quadratic for projection. • Solution of linear subproblem is always extremal element of D. • Iterates admit sparse representation = xt convex combination of at most t elements. 4/30
  • 15. Recent applications of Frank-Wolfe • Learning the structure of a neural network.1 • Attention mechanisms that enforce sparsity.2 • 1-constrained problems with extreme number of features.3 1 Wei Ping, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. 2 Vlad Niculae et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”. In: International Conference on Machine Learning. 3 Thomas Kerdreux, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. 5/30
  • 16. A practical issue • Line-search only efficient when closed form exists (quadratic objective). • Step-size γt = 2/(t + 2) is convergent, but extremely slow. Algorithm 2: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst 6/30
  • 17. A practical issue • Line-search only efficient when closed form exists (quadratic objective). • Step-size γt = 2/(t + 2) is convergent, but extremely slow. Algorithm 2: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find γt by line-search: γt ∈ arg minγ∈[0,1] f ((1−γ)xt +γst) 4 xt+1 = (1 − γt)xt + γtst Can we do better? 6/30
  • 19. Down the citation rabbit hole 4 Vladimir Demyanov Alexsandr Rubinov 4 Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). 7/30
  • 20. Down the citation rabbit hole 4 Vladimir Demyanov Alexsandr Rubinov 4 Vladimir Demyanov and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). 7/30
  • 21. The Demyanov-Rubinov (DR) Frank-Wolfe variant Problem: smooth objective, compact domain arg min x∈D f (x), where f is L-smooth . (L-smooth ≡ differentiable with L-Lipschitz gradient). • Step-size depends on the correlation between − f (xt) and the descent direction st − xt. Algorithm 3: FW, DR variant 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 γt =min − f (xt), st − xt L st − xt 2 , 1 4 xt+1 = (1 − γt)xt + γtst 8/30
  • 22. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? 9/30
  • 23. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? L-smooth inequality Any L-smooth function f verifies f (y) ≤ f (x) + f (x), y − x + L 2 x − y 2 , for all x, y in the domain. 9/30
  • 24. The Demyanov-Rubinov (DR) Frank-Wolfe variant Where does γt =min − f (xt), st − xt L st − xt 2 , 1 come from? L-smooth inequality Any L-smooth function f verifies f (y) ≤ f (x) + f (x), y − x + L 2 x − y 2 :=Qx (y) , for all x, y in the domain. • The right hand side is a quadratic upper bound Qx (y) f (y) 9/30
  • 25. Justification of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 10/30
  • 26. Justification of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 • Minimizing right hand side on γ ∈ [0, 1] gives γ =min − f (xt), st − xt L st − xt 2 , 1 , = Demyanov-Rubinov step-size! 10/30
  • 27. Justification of the step-size • L-smooth inequality at xt+1(γ) = (1 − γ)xt + γst, x = xt gives f (xt+1(γ)) ≤ f (xt) − γ f (xt), st − xt + γ2L 2 st − xt 2 • Minimizing right hand side on γ ∈ [0, 1] gives γ =min − f (xt), st − xt L st − xt 2 , 1 , = Demyanov-Rubinov step-size! • ≡ exact line search on the quadratic upper bound. 10/30
  • 28. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. 11/30
  • 29. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. Sufficient decrease is all you need L-smooth inequality can be replaced by f (xt+1) ≤ f (xt) − γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 with γt =min − f (xt), st − xt Lt st − xt 2 , 1 11/30
  • 30. Towards an Adaptive FW Quadratic upper bound The Demyanov-Rubinov makes use of a quadratic upper bound, but it is only evaluated at xt, xt+1. Sufficient decrease is all you need L-smooth inequality can be replaced by f (xt+1) ≤ f (xt) − γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 with γt =min − f (xt), st − xt Lt st − xt 2 , 1 Key difference with DR: L is replaced by Lt. Potentially Lt L. 11/30
  • 31. The Adaptive FW algorithm New FW variant with adaptive step-size.5 Algorithm 4: The Adaptive Frank-Wolfe algorithm (AdaFW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Find Lt that verifies sufficient decrease (1), with 4 γt =min − f (xt), st − xt Lt st − xt 2 , 1 5 xt+1 = (1 − γt)xt + γtst f (xt+1) ≤ f (xt) + γt f (xt), st − xt + γ2 t Lt 2 st − xt 2 (1) 5 Fabian Pedregosa, Armin Askari, Geoffrey Negiar, and Martin Jaggi (2018). “Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted. 12/30
  • 32. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. 13/30
  • 33. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. • Adaptivity to local geometry. 13/30
  • 34. The Adaptive FW algorithm γ =0 γ = 1γt f (xt) + γ f (xt), st − xt + γ2Lt 2 st − xt 2 f ((1 − γ)xt + γst) • Worst-case, Lt = L. Often Lt L =⇒ larger step-size. • Adaptivity to local geometry. • Two extra function evaluations per iteration. Often given as byproduct of gradient. 13/30
  • 35. Extension to other FW variants
  • 36. Zig-Zagging phenomena in FW The Frank-Wolfe algorithm zig-zags when the solution lies in a face of the boundary. Some FW variants have been developed to address this issue. 14/30
  • 37. Away-steps FW, informal Away-steps FW algorithm (Wolfe, 1970) (Gu´elat and Marcotte, 1986) adds the possibility to move away from an active atom. 15/30
  • 38. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. 16/30
  • 39. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 40. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 41. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 42. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 43. Away-steps FW algorithm Keep active set St = active vertices that have been previously selected and have non-zero weight. Algorithm 5: Away-Steps FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxv∈St f (xt), v 4 if − f (xt), st − xt ≥ − f (xt), xt − vt then 5 dt = st − xt, FW step 6 else 7 dt = xt − vt, Away step 8 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 9 xt+1 = xt + γtdt 16/30
  • 44. Pairwise FW Key idea Move weight mass between two atoms in each step. Proposed by (Lacoste-Julien and Jaggi, 2015), inspired the MDM alg. (Mitchell, Demyanov, and Malozemov, 1974). Algorithm 6: Pairwise FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxs∈St f (xt), s 4 dt = st − vt 5 Find γt by line-search: γt ∈ arg minγ∈[0,γmax t ] f (xt +γdt) 6 xt+1 = xt + γtdt 17/30
  • 45. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). 18/30
  • 46. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). • Can we design variants with sufficient decrease? 18/30
  • 47. Away-steps FW and Pairwise FW Convergence of Away-steps and Pairwise FW • Linear convergence for strongly convex functions on polytopes (Lacoste-Julien and Jaggi, 2015). • Can we design variants with sufficient decrease? Introducing Adaptive Away-steps and Adaptive Pairwise Choose Lt such that it verifies f (xt + γtdt) ≤ f (xt) + γt f (xt), dt + γ2 t Lt 2 dt 2 with γt =min − f (xt), dt Lt dt 2 , 1 18/30
  • 48. Adaptive Pairwise FW Algorithm 7: Pairwise FW 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 vt ∈ arg maxs∈St f (xt), s 4 dt = st − vt 5 Find Lt that verifies sufficient decrease (2), with 6 γt =min − f (xt), dt Lt dt 2 , 1 7 xt+1 = xt + γtdt f (xt + γtdt) ≤ f (xt) + γt f (xt), dt + γ2 t Lt 2 dt 2 (2) 19/30
  • 49. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) 20/30
  • 50. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) Convex f For all FW variants, f (xt) − f (x ) ≤ O(1/t) 20/30
  • 51. Theory for Adaptive Step-size variants Strongly convex f Pairwise and Away-steps converge linearly on a polytope. For each “good step” we have: f (xt+1) − f (x ) ≤ (1 − ρ)(f (xt) − f (x )) Convex f For all FW variants, f (xt) − f (x ) ≤ O(1/t) Non-Convex f For all FW variants, maxs∈D f (xt), xt − s ≤ O(1/ √ t) 20/30
  • 53. Experiments RCV1 Problem: 1-constrained logistic regression arg min x 1≤α 1 n n i=1 ϕ(aT i x, bi ) with ϕ = logistic loss. Dataset dimension density Lt /L RCV1 47236 10−3 1.3 × 10−2 0 100 200 300 400 Time (in seconds) 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 1 ball radius = 100 0 200 400 600 800 Time (in seconds) 10 8 10 6 10 4 10 2 100 1 ball radius = 200 0 250 500 750 1000 Time (in seconds) 10 8 10 6 10 4 10 2 100 1 ball radius = 300 AdaFW AdaPFW AdaAFW FW PFW AFW D-FW 21/30
  • 54. Experiments Madelon Problem: 1-constrained logistic regression arg min x 1≤α 1 n n i=1 ϕ(aT i x, bi ) with ϕ = logistic loss. Dataset dimension density Lt /L Madelon 500 1. 3.3 × 10−3 0 2 4 Time (in seconds) 10 8 10 6 10 4 10 2 Objectiveminusoptimum 1 ball radius = 13 0.0 2.5 5.0 7.5 10.0 Time (in seconds) 10 8 10 6 10 4 10 2 1 ball radius = 20 0 5 10 15 20 Time (in seconds) 10 8 10 6 10 4 10 2 1 ball radius = 30 AdaFW AdaPFW AdaAFW FW PFW AFW D-FW 22/30
  • 55. Experiments MovieLens 1M Problem: trace-norm constrained robust matrix completion arg min x ∗≤α 1 |B| n (i,j)∈B h(Xi,j , Ai,j ) with h = Huber loss. Dataset dimension density Lt /L MovieLens 1M 22,393,987 0.04 1.1 × 10−2 0 200 400 600 800 Time (in seconds) 10 6 10 4 10 2 100 Objectiveminusoptimum trace ball radius = 300 0 1000 2000 Time (in seconds) 10 6 10 4 10 2 100 trace ball radius = 350 0 2000 4000 Time (in seconds) 10 6 10 4 10 2 100 trace ball radius = 400 Adaptive FW FW D-FW 23/30
  • 57. Proximal Splitting Building quadratic upper bound is common in proximal gradient descent (Beck and Teboulle, 2009) (Nesterov, 2013). Recently extended to the Davis-Yin three operator splitting6 f (x) + g(x) + h(x) with access to f , proxγg , proxγh Key insight: verify a sufficient decrease condition of the form f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 6 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning. 24/30
  • 58. Nearly-isotonic penalty Problem arg minx loss(x) + λ p−1 i=1 max{xi − xi+1, 0} Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 200 400 600 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG 25/30
  • 59. Overlapping group lasso penalty Problem arg min x loss(x) + λ g∈G [x]g 2 Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 5 10 15 20 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0.0 0.2 0.4 0.6 0.8 1.0 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 26/30
  • 61. Stochastic optimization Problem arg min x∈Rd 1 n n i=1 fi (x) Heuristic from7 to estimate L by verifying at each iteration t fi (xt − 1 L fi (xt)) ≤ fi (xt) − 1 2L fi (xt) 2 with i random index sampled at iter t. 7 Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming. 27/30
  • 62. Stochastic optimization Problem arg min x∈Rd 1 n n i=1 fi (x) Heuristic from7 to estimate L by verifying at each iteration t fi (xt − 1 L fi (xt)) ≤ fi (xt) − 1 2L fi (xt) 2 with i random index sampled at iter t. L-smooth inequality with y = xt − 1 L fi (xt), x = xt 7 Mark Schmidt, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming. 27/30
  • 64. Experiments stochastic line search Can we prove convergence for such (or similar) stochastic adaptive step-size? 28/30
  • 65. Conclusion • Sufficient decrease condition to set step-size in FW and variants. 29/30
  • 66. Conclusion • Sufficient decrease condition to set step-size in FW and variants. • (Mostly) Hyperparameter-free, adaptivity to local geometry. 29/30
  • 67. Conclusion • Sufficient decrease condition to set step-size in FW and variants. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Applications in proximal splitting and stochastic optimization. Thanks for your attention 29/30
  • 68. References Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to signal recovery”. In: Convex optimization in signal processing and communications. Demyanov, Vladimir and Aleksandr Rubinov (1970). Approximate methods in optimization problems (translated from Russian). Gu´elat, Jacques and Patrice Marcotte (1986). “Some comments on Wolfe’s away step”. In: Mathematical Programming. Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. Lacoste-Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank-Wolfe optimization variants”. In: Advances in Neural Information Processing Systems. Mitchell, BF, Vladimir Fedorovich Demyanov, and VN Malozemov (1974). “Finding the point of a polyhedron closest to the origin”. In: SIAM Journal on Control. 29/30
  • 69. Nesterov, Yu (2013). “Gradient methods for minimizing composite functions”. In: Mathematical Programming. Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”. In: International Conference on Machine Learning. Pedregosa, Fabian et al. (2018). “Step-Size Adaptivity in Projection-Free Optimization”. In: Submitted. Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning. Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. Schmidt, Mark, Nicolas Le Roux, and Francis Bach (2017). “Minimizing finite sums with the stochastic average gradient”. In: Mathematical Programming. Wolfe, Philip (1970). “Convergence theory in nonlinear programming”. In: Integer and nonlinear programming. 30/30