We propose a novel Stochastic Frank-Wolfe (a.k.a. conditional gradient) algorithm for constrained smooth finite-sum minimization with a generalized linear prediction/structure. This class of problems includes empirical risk minimization with sparse, low-rank, or other structured constraints. The proposed method is simple to implement, does not require step-size tuning, and has a constant per-iteration cost that is independent of the dataset size. Furthermore, as a byproduct of the method we obtain a stochastic estimator of the Frank-Wolfe gap that can be used as a stopping criterion. Depending on the setting, the proposed method matches or improves on the best computational guarantees for Stochastic Frank-Wolfe algorithms. Benchmarks on several datasets highlight different regimes in which the proposed method exhibits a faster empirical convergence than related methods. Finally, we provide an implementation of all considered methods in an open-source package.
3. Outline
Motivation: Obtain a practical, fast version of Stochastic
Frank-Wolfe for finite-sum problems.
1. Problem of interest and setting.
2. The Frank-Wolfe algorithm.
3. Stochastic Frank-Wolfe. Making Stochastic Frank-Wolfe
practical: a primal-dual view.
4. Results. Convergence rates in theory and in practice.
1/31
5. Problem of Interest
The problem of interest is
OPT:
min
w∈C
1
n
n
i=1
fi (xi w)
• fi (·) is the univariate loss function of observation/sample i for i ∈ [n]
• n is the number of observations/samples
• C ⊂ Rd
is a compact convex set
• d is the order (dimension) of the model variable w
The particular structural dependence of the losses on xi w is a model
with “generalized linear structure” or “linear prediction”
2/31
6. Setting
Assumptions
• For i = 1, . . . , n, the univariate function fi (·) is L-smooth, namely
for all a, b ∈ R it holds that
|fi (a) − fi (b)| ≤ L|a − b|
• the Linear Minimization Oracle LMO(v):
s ← arg min
w∈C
v, w
returns an optimal solution and is easily solved for any v
3/31
7. Some Examples in Statistical and Machine Learning
• LASSO
minw
1
2n
n
i=1(yi − xi w)2
s.t. w 1 ≤ δ ,
where fi (·) = 1
2
(yi − ·)2
and C := {w : w 1 ≤ δ}
• Sparse Logistic Regression
minw
1
n
n
j=1 ln(1 + exp(−yi xi w))
s.t. w 1 ≤ δ ,
where fi (·) = ln(1 + exp(−yi ·)), C := {w : w 1 ≤ δ}
• Low-Rank Matrix Completion
minW ∈Rn×p
1
2|Ω| (i,j)∈Ω(Mi,j − W i,j )2
s.t. W Nuc ≤ δ ,
where f(i,j)(·) = 1
2
(· − Mi,j )2
and C := {W ∈ Rn×p
: W Nuc ≤ δ}
• Many more examples can be found in [Jaggi 2013] for instance
4/31
9. Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
10. Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
11. Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
12. Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
13. Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
14. Some Step-size Rules/Strategies
• “Recent standard”: γt = 2
t+2
• Exact line-search: γt = arg minγ∈[0,1]{f (xt + γ(st − xt))}
• QA (Quadratic Approximation) step-size:
γk = min 1,
− f (xt) (st − xt)
L st − xt
2
See [Demyanov & Rubinov 1970]
• Simple averaging: γt = 1
t+1
• Constant step-size: γt = γ for some given γ ∈ [0, 1]
• Dynamic strategy: determined by some history of optimality
bounds, see [Freund & Grigas 2014]
6/31
15. Simple Computational Guarantee for Frank-Wolfe
Here is a simple computational guarantee:
A Computational Guarantee for the Frank-Wolfe algorithm
If the step-size sequence {γt} is chosen by the recent standard, the QA
rule, or by exact line-search, then for all t ≥ 1 it holds that:
f (xt) − f ∗
≤
1
1
f (x0)−f ∗ + t
2K
<
2K
t
where K = L · diam(C)2
, and f is convex and L-smooth.
Related guarantees also holds for other types of step-size strategies.
7/31
16. Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
min
x∈D
g x vs. min
x∈D
y − x 2
2
8/31
17. Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
min
x∈D
g x vs. min
x∈D
y − x 2
2
• Solution of linear subproblem: extremal element of D.
8/31
18. Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
min
x∈D
g x vs. min
x∈D
y − x 2
2
• Solution of linear subproblem: extremal element of D.
• Sparse representation: xt convex combination of at most t
elements.
8/31
19. Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
min
x∈D
g x vs. min
x∈D
y − x 2
2
• Solution of linear subproblem: extremal element of D.
• Sparse representation: xt convex combination of at most t
elements.
Recent Applications
• Learning the structure of a neural network. Ping, Liu, and
Ihler, 2016
• Attention mechanisms that enforce sparsity. Niculae, 2018
• 1-constrained problems with extreme number of features.
Kerdreux, Pedregosa, and d’Aspremont, 2018 8/31
20. A practical issue for FW
• For large n (number of samples), we need a Stochastic variant
of FW
• Na¨ıve SGD-like algorithm fails in practice and in theory
• State of the art bounds on suboptimality after t iterations:
O(n/t) and O(1/ 3
√
t)
Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018
9/31
21. A practical issue for FW
• For large n (number of samples), we need a Stochastic variant
of FW
• Na¨ıve SGD-like algorithm fails in practice and in theory
• State of the art bounds on suboptimality after t iterations:
O(n/t) and O(1/ 3
√
t)
Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018
Can we do better?
9/31
23. Problem setting:
OPT
min
w∈C
1
n
n
i=1
fi (xi w)
• fi (·) is the univariate loss function of observation/sample i for
i ∈ [n]
• n is the number of observations/samples
• C ⊂ Rd is a compact convex set
• d is the order (dimension) of the model variable w
The particular structural dependence of the losses on xi w is a
model with “generalized linear structure” or “linear prediction”
10/31
24. Deterministic FW: Gradient Computation for OPT
OPT
f ∗
:= min
w∈C
F(w) = 1
n
n
i=1
fi (xi w)
Assumptions
• fi (·) is L-smooth for i ∈ [n]: ∀z, z , |fi (z) − fi (z )| ≤ L|z − z |
• Linear Minimization Oracle LMO(r): s ← arg minw∈C r, w
Denote X := [x1 ; x2 ; . . . ; xn ]
Gradient Computation
F(w) = 1
n
n
i=1 xi · fi (xi w) = X α where αi
← 1
n
fi (xi w), i ∈ [n]
Gradient computation is O(nd) operations (expensive when n 0 . . .)
11/31
25. Frank-Wolfe for OPT:
OPT
f ∗
:= min
w∈C
F(w) := 1
n
n
i=1
fi (xi w)
Frank-Wolfe algorithm for OPT:
Initialize at w0 ∈ C, t ← 0 .
At iteration t :
1. Compute F(wt−1) :
• αi
t ← 1
n fi (xi wt−1) for EVERY i ∈ [n]
• rt = X αt (= F(wt−1))
2. Compute st ← LMO(rt ) .
3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] .
Iteration cost is O(nd) operations (expensive when n 0 . . . ) 12/31
26. A Na¨ıve Frank-Wolfe (SFW) Strategy
OPT
f ∗
:= min
w∈C
F(w) := 1
n
n
i=1
fi (xi w)
Frank-Wolfe algorithm for OPT:
Initialize at w0 ∈ C, t ← 0 .
At iteration t :
1. Compute F(wt−1) :
• αi
t ← 1
n fi (xi wt−1) for ONE i ∈ [n] (αj
t = 0 for j = i)
• rt = X αt = xi fi (xi wt−1)
2. Compute st ← LMO(rt ) .
3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] .
This approach does not work without growing the batch size [Hazan] 13/31
27. Our Frank-Wolfe (SFW) Strategy
OPT
f ∗
:= min
w∈C
F(w) := 1
n
n
i=1
fi (xi w)
Frank-Wolfe algorithm for OPT:
Initialize at w0 ∈ C, t ← 0 .
At iteration t :
1. Compute F(wt−1) :
• αi
t ← 1
n fi (xi wt−1) for ONE i ∈ [n] (αj
t = αj
t−1 for j = i)
• rt = X αt = rt−1 + xi (αi
t − αi
t−1)
2. Compute st ← LMO(rt ) .
3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] .
Iteration cost is O(d) operations! Memory cost is O(d + n) 14/31
28. Motivation: a Primal-Dual Lens for Constructing FW
Recall the definition of the conjugate of a function f :
f ∗
(α) := max
x∈domf (·)
{α x − f (x)}
• If f is a closed convex function, then f ∗∗
= f
• f (x) := maxα∈dom f ∗(·){α x − f ∗
(α)} , and
• When f is differentiable, it holds that
f (x) ← α where α ← arg max
β∈domf ∗(·)
{β x − f ∗
(β)} .
15/31
29. Motivation: a Primal-Dual Lens for Constructing FW
Using conjugacy we can reformulate OPT as:
OPT: min
w∈C
f (Xw) = min
w∈C
max
α∈Rn
L(w, α)
def
= Xw, α − f ∗
(α)
Given wt−1 we construct the gradient of f (Xw) at wt−1 by maximizing over
the dual variable α:
αt ∈ arg max
α∈Rn
{L(wt−1, α) = Xwt−1, α − f ∗
(α)}
⇐⇒ f (Xwt−1) = X αt
Then the LMO step corresponds to fixing the dual variable and minimizing
over the primal variable w:
st ← arg min
w∈C
L(w, αt ) = w, X αt − f ∗
(αt )
⇐⇒ st ← LMO(X αt )
Finally,
wt = (1 − γt )wt−1 + γt st
16/31
31. Stochastic FW Methods for Finite-Sum Minimization
Computational complexity dependency on iterations t and sample size n
Algorithm Complexity Bound Non-Convex Case
Frank-Wolfe (deterministic) (1956) O
n
t
O
n
√
t
Reddi et al. (2016) O n +
n1/3
√
t
Mokhtari et al. (2018) O
1
3
√
t
Lu and Freund (2018) O
n
t
This work O
D1/D∞
t
1
0
1
D1/D∞ ≤ n, and D1/D∞ n in many recognizable instances.
17/31
32. Theoretical guarantees: convex case
Define the p norm “diameter” of C to be Dp := max
w,v∈C
X(w − v) p
Theorem: Computational Complexity of Novel Stochastic
Frank-Wolfe Algorithm
Let H0
def
= α0 − f (Xw0) 1 be the initial error of the gradient f , and let
the step-size rule be γt = 2
t+2
. For t ≥ 2, it holds that:
E[f (Xwt ) − f ∗] ≤
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
t
(t + 1)(t + 2)
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
.
Let us see what this bound is really about . . .
18/31
33. Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n − 1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
19/31
34. Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
20/31
35. Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
21/31
36. Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
22/31
37. Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
23/31
38. Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
24/31
39. On the Ratio D1/D∞
Here is a plot of
D1
D∞
1
n
for benchmark datasets
25/31
40. Algorithms and Datasets
We compared three Stochastic Frank-Wolfe algorithms:
• Mokhtari et al. [2018]
• Lu and Freund [2018]
• Novel Stochastic Frank-Wolfe (this work)
We report here on 1-constrained regression problems using the following
datasets:
Problem type Dataset δ d n D1/nD∞
logistic regression breast-cancer 5 10 683 0.929
logistic regression rcv1 (train) 100 47236 20242 0.021
linear regression California housing 0.1 8 20640 0.040
26/31
41. Computational Experiments
Figure showing optimality gaps versus number of sample gradient
evaluations
103
104
105
106
107
Number of sampled gradients processed
10−14
10−11
10−8
10−5
10−2
Relativesuboptimality
Breast Cancer
Mokhtari et al. (2018)
Lu & Freund (2018)
This work
105
106
107
108
Number of sampled gradients processed
10−8
10−6
10−4
10−2
100
RCV1
Mokhtari et al. (2018)
Lu & Freund (2018)
This work
104
105
106
107
108
109
Number of sampled gradients processed
10−9
10−7
10−5
10−3
10−1
California Housing
Mokhtari et al. (2018)
Lu & Freund (2018)
This work
27/31
42. Proof sketch (convex, smooth case)
A key lemma brings back the proof to the proof of deterministic
Frank-Wolfe.
Key Lemma
Let εt = f (Xwt) − f (Xw∗
). For any direction αt, with
st = LMO(X αt), we have
εt ≤ (1 − γt)εt−1 + γ2
t
LD2
2
2n
+ γtD∞ αt − f (Xwt−1) 1.
28/31
43. Proof sketch (convex, smooth case)
A key lemma brings back the proof to the proof of deterministic
Frank-Wolfe, with step size γt = 2/(t + 2).
Key Lemma (convex, smooth)
Let εt = f (Xwt) − f (Xw∗
). For any direction αt, with
st = LMO(X αt), we have
εt ≤ (1 − γt)εt−1 + γ2
t
LD2
2
2n
+ γtD∞ αt − f (Xwt−1) 1.
Additionally, we show that E αt − f (Xwt−1) 1 = O LD1
t .
This fact does not require convexity!
29/31
44. Questions to answer
• Linear convergence for FW variants (Stochastic Away steps,
Pairwise)
• Explain the observed asymptotic accelerated rate
• Algorithm in the dual space: it is not Stochastic Coordinate
Mirror Descent
30/31
47. Conclusion
• A practical, fast version of Stochastic Frank-Wolfe
• Hyperparameter-free
• Implementation available in
https://github.com/openopt/copt
31/31
48. Conclusion
• A practical, fast version of Stochastic Frank-Wolfe
• Hyperparameter-free
• Implementation available in
https://github.com/openopt/copt
• Use FW when the structure of your problem demands it!
Thanks for your attention
31/31
49. References
Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International
Conference on Machine Learning.
Lu, Haihao and Robert Michael Freund (2020). “Generalized stochastic FrankWolfe
algorithm with stochastic substitute gradient for structured convex optimization”.
In: Math. Program.
Mokhtari, Aryan, Hamed Hassani, and Amin Karbasi (2018). “Stochastic Conditional
Gradient Methods: From Convex Minimization to Submodular Maximization”. In:
ArXiv abs/1804.09554.
Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”.
In: International Conference on Machine Learning.
Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with
Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
31/31