SlideShare a Scribd company logo
1 of 49
Download to read offline
Stochastic Frank-Wolfe for
Constrained Finite-Sum Minimization
1
Geoffrey N´egiar, 2
Gideon Dresdner, 1
Alicia Yi-Ting Tsai,
1,5
Laurent El Ghaoui, 2
Francesco Locatello, 3
Robert Freund,
4
Fabian Pedregosa
June 12th, 2020. ICML, Online
1University of California, Berkeley 2ETH, Zurich 3MIT
4Google Research, Montr´eal 5SumUp Analytics
Outline
Motivation: Obtain a practical, fast version of Stochastic
Frank-Wolfe for finite-sum problems.
1/31
Outline
Motivation: Obtain a practical, fast version of Stochastic
Frank-Wolfe for finite-sum problems.
1. Problem of interest and setting.
2. The Frank-Wolfe algorithm.
3. Stochastic Frank-Wolfe. Making Stochastic Frank-Wolfe
practical: a primal-dual view.
4. Results. Convergence rates in theory and in practice.
1/31
Problem of Interest and Setting
Problem of Interest
The problem of interest is
OPT:
min
w∈C
1
n
n
i=1
fi (xi w)
• fi (·) is the univariate loss function of observation/sample i for i ∈ [n]
• n is the number of observations/samples
• C ⊂ Rd
is a compact convex set
• d is the order (dimension) of the model variable w
The particular structural dependence of the losses on xi w is a model
with “generalized linear structure” or “linear prediction”
2/31
Setting
Assumptions
• For i = 1, . . . , n, the univariate function fi (·) is L-smooth, namely
for all a, b ∈ R it holds that
|fi (a) − fi (b)| ≤ L|a − b|
• the Linear Minimization Oracle LMO(v):
s ← arg min
w∈C
v, w
returns an optimal solution and is easily solved for any v
3/31
Some Examples in Statistical and Machine Learning
• LASSO
minw
1
2n
n
i=1(yi − xi w)2
s.t. w 1 ≤ δ ,
where fi (·) = 1
2
(yi − ·)2
and C := {w : w 1 ≤ δ}
• Sparse Logistic Regression
minw
1
n
n
j=1 ln(1 + exp(−yi xi w))
s.t. w 1 ≤ δ ,
where fi (·) = ln(1 + exp(−yi ·)), C := {w : w 1 ≤ δ}
• Low-Rank Matrix Completion
minW ∈Rn×p
1
2|Ω| (i,j)∈Ω(Mi,j − W i,j )2
s.t. W Nuc ≤ δ ,
where f(i,j)(·) = 1
2
(· − Mi,j )2
and C := {W ∈ Rn×p
: W Nuc ≤ δ}
• Many more examples can be found in [Jaggi 2013] for instance
4/31
The Frank-Wolfe algorithm
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
x∈D
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
5/31
Some Step-size Rules/Strategies
• “Recent standard”: γt = 2
t+2
• Exact line-search: γt = arg minγ∈[0,1]{f (xt + γ(st − xt))}
• QA (Quadratic Approximation) step-size:
γk = min 1,
− f (xt) (st − xt)
L st − xt
2
See [Demyanov & Rubinov 1970]
• Simple averaging: γt = 1
t+1
• Constant step-size: γt = γ for some given γ ∈ [0, 1]
• Dynamic strategy: determined by some history of optimality
bounds, see [Freund & Grigas 2014]
6/31
Simple Computational Guarantee for Frank-Wolfe
Here is a simple computational guarantee:
A Computational Guarantee for the Frank-Wolfe algorithm
If the step-size sequence {γt} is chosen by the recent standard, the QA
rule, or by exact line-search, then for all t ≥ 1 it holds that:
f (xt) − f ∗
≤
1
1
f (x0)−f ∗ + t
2K
<
2K
t
where K = L · diam(C)2
, and f is convex and L-smooth.
Related guarantees also holds for other types of step-size strategies.
7/31
Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
min
x∈D
g x vs. min
x∈D
y − x 2
2
8/31
Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
min
x∈D
g x vs. min
x∈D
y − x 2
2
• Solution of linear subproblem: extremal element of D.
8/31
Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
min
x∈D
g x vs. min
x∈D
y − x 2
2
• Solution of linear subproblem: extremal element of D.
• Sparse representation: xt convex combination of at most t
elements.
8/31
Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
min
x∈D
g x vs. min
x∈D
y − x 2
2
• Solution of linear subproblem: extremal element of D.
• Sparse representation: xt convex combination of at most t
elements.
Recent Applications
• Learning the structure of a neural network. Ping, Liu, and
Ihler, 2016
• Attention mechanisms that enforce sparsity. Niculae, 2018
• 1-constrained problems with extreme number of features.
Kerdreux, Pedregosa, and d’Aspremont, 2018 8/31
A practical issue for FW
• For large n (number of samples), we need a Stochastic variant
of FW
• Na¨ıve SGD-like algorithm fails in practice and in theory
• State of the art bounds on suboptimality after t iterations:
O(n/t) and O(1/ 3
√
t)
Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018
9/31
A practical issue for FW
• For large n (number of samples), we need a Stochastic variant
of FW
• Na¨ıve SGD-like algorithm fails in practice and in theory
• State of the art bounds on suboptimality after t iterations:
O(n/t) and O(1/ 3
√
t)
Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018
Can we do better?
9/31
Practical Stochastic Frank-Wolfe:
a primal-dual point of view
Problem setting:
OPT
min
w∈C
1
n
n
i=1
fi (xi w)
• fi (·) is the univariate loss function of observation/sample i for
i ∈ [n]
• n is the number of observations/samples
• C ⊂ Rd is a compact convex set
• d is the order (dimension) of the model variable w
The particular structural dependence of the losses on xi w is a
model with “generalized linear structure” or “linear prediction”
10/31
Deterministic FW: Gradient Computation for OPT
OPT
f ∗
:= min
w∈C
F(w) = 1
n
n
i=1
fi (xi w)
Assumptions
• fi (·) is L-smooth for i ∈ [n]: ∀z, z , |fi (z) − fi (z )| ≤ L|z − z |
• Linear Minimization Oracle LMO(r): s ← arg minw∈C r, w
Denote X := [x1 ; x2 ; . . . ; xn ]
Gradient Computation
F(w) = 1
n
n
i=1 xi · fi (xi w) = X α where αi
← 1
n
fi (xi w), i ∈ [n]
Gradient computation is O(nd) operations (expensive when n 0 . . .)
11/31
Frank-Wolfe for OPT:
OPT
f ∗
:= min
w∈C
F(w) := 1
n
n
i=1
fi (xi w)
Frank-Wolfe algorithm for OPT:
Initialize at w0 ∈ C, t ← 0 .
At iteration t :
1. Compute F(wt−1) :
• αi
t ← 1
n fi (xi wt−1) for EVERY i ∈ [n]
• rt = X αt (= F(wt−1))
2. Compute st ← LMO(rt ) .
3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] .
Iteration cost is O(nd) operations (expensive when n 0 . . . ) 12/31
A Na¨ıve Frank-Wolfe (SFW) Strategy
OPT
f ∗
:= min
w∈C
F(w) := 1
n
n
i=1
fi (xi w)
Frank-Wolfe algorithm for OPT:
Initialize at w0 ∈ C, t ← 0 .
At iteration t :
1. Compute F(wt−1) :
• αi
t ← 1
n fi (xi wt−1) for ONE i ∈ [n] (αj
t = 0 for j = i)
• rt = X αt = xi fi (xi wt−1)
2. Compute st ← LMO(rt ) .
3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] .
This approach does not work without growing the batch size [Hazan] 13/31
Our Frank-Wolfe (SFW) Strategy
OPT
f ∗
:= min
w∈C
F(w) := 1
n
n
i=1
fi (xi w)
Frank-Wolfe algorithm for OPT:
Initialize at w0 ∈ C, t ← 0 .
At iteration t :
1. Compute F(wt−1) :
• αi
t ← 1
n fi (xi wt−1) for ONE i ∈ [n] (αj
t = αj
t−1 for j = i)
• rt = X αt = rt−1 + xi (αi
t − αi
t−1)
2. Compute st ← LMO(rt ) .
3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] .
Iteration cost is O(d) operations! Memory cost is O(d + n) 14/31
Motivation: a Primal-Dual Lens for Constructing FW
Recall the definition of the conjugate of a function f :
f ∗
(α) := max
x∈domf (·)
{α x − f (x)}
• If f is a closed convex function, then f ∗∗
= f
• f (x) := maxα∈dom f ∗(·){α x − f ∗
(α)} , and
• When f is differentiable, it holds that
f (x) ← α where α ← arg max
β∈domf ∗(·)
{β x − f ∗
(β)} .
15/31
Motivation: a Primal-Dual Lens for Constructing FW
Using conjugacy we can reformulate OPT as:
OPT: min
w∈C
f (Xw) = min
w∈C
max
α∈Rn
L(w, α)
def
= Xw, α − f ∗
(α)
Given wt−1 we construct the gradient of f (Xw) at wt−1 by maximizing over
the dual variable α:
αt ∈ arg max
α∈Rn
{L(wt−1, α) = Xwt−1, α − f ∗
(α)}
⇐⇒ f (Xwt−1) = X αt
Then the LMO step corresponds to fixing the dual variable and minimizing
over the primal variable w:
st ← arg min
w∈C
L(w, αt ) = w, X αt − f ∗
(αt )
⇐⇒ st ← LMO(X αt )
Finally,
wt = (1 − γt )wt−1 + γt st
16/31
Results: Theory and Practice
Stochastic FW Methods for Finite-Sum Minimization
Computational complexity dependency on iterations t and sample size n
Algorithm Complexity Bound Non-Convex Case
Frank-Wolfe (deterministic) (1956) O
n
t
O
n
√
t
Reddi et al. (2016) O n +
n1/3
√
t
Mokhtari et al. (2018) O
1
3
√
t
Lu and Freund (2018) O
n
t
This work O
D1/D∞
t
1
0
1
D1/D∞ ≤ n, and D1/D∞ n in many recognizable instances.
17/31
Theoretical guarantees: convex case
Define the p norm “diameter” of C to be Dp := max
w,v∈C
X(w − v) p
Theorem: Computational Complexity of Novel Stochastic
Frank-Wolfe Algorithm
Let H0
def
= α0 − f (Xw0) 1 be the initial error of the gradient f , and let
the step-size rule be γt = 2
t+2
. For t ≥ 2, it holds that:
E[f (Xwt ) − f ∗] ≤
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
t
(t + 1)(t + 2)
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
.
Let us see what this bound is really about . . .
18/31
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n − 1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
19/31
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
20/31
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
21/31
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
22/31
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
23/31
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
2
1
n
+ 8LD1D∞
n−1
n
1
t
+
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
t2
+ O
LD2
∞(1 + Ratio)
t
+ O D∞H0 + LD2
∞Ratio
n2
t2
≤ O
LD2
∞Ratio
t
≤ O
n
t
24/31
On the Ratio D1/D∞
Here is a plot of
D1
D∞
1
n
for benchmark datasets
25/31
Algorithms and Datasets
We compared three Stochastic Frank-Wolfe algorithms:
• Mokhtari et al. [2018]
• Lu and Freund [2018]
• Novel Stochastic Frank-Wolfe (this work)
We report here on 1-constrained regression problems using the following
datasets:
Problem type Dataset δ d n D1/nD∞
logistic regression breast-cancer 5 10 683 0.929
logistic regression rcv1 (train) 100 47236 20242 0.021
linear regression California housing 0.1 8 20640 0.040
26/31
Computational Experiments
Figure showing optimality gaps versus number of sample gradient
evaluations
103
104
105
106
107
Number of sampled gradients processed
10−14
10−11
10−8
10−5
10−2
Relativesuboptimality
Breast Cancer
Mokhtari et al. (2018)
Lu & Freund (2018)
This work
105
106
107
108
Number of sampled gradients processed
10−8
10−6
10−4
10−2
100
RCV1
Mokhtari et al. (2018)
Lu & Freund (2018)
This work
104
105
106
107
108
109
Number of sampled gradients processed
10−9
10−7
10−5
10−3
10−1
California Housing
Mokhtari et al. (2018)
Lu & Freund (2018)
This work
27/31
Proof sketch (convex, smooth case)
A key lemma brings back the proof to the proof of deterministic
Frank-Wolfe.
Key Lemma
Let εt = f (Xwt) − f (Xw∗
). For any direction αt, with
st = LMO(X αt), we have
εt ≤ (1 − γt)εt−1 + γ2
t
LD2
2
2n
+ γtD∞ αt − f (Xwt−1) 1.
28/31
Proof sketch (convex, smooth case)
A key lemma brings back the proof to the proof of deterministic
Frank-Wolfe, with step size γt = 2/(t + 2).
Key Lemma (convex, smooth)
Let εt = f (Xwt) − f (Xw∗
). For any direction αt, with
st = LMO(X αt), we have
εt ≤ (1 − γt)εt−1 + γ2
t
LD2
2
2n
+ γtD∞ αt − f (Xwt−1) 1.
Additionally, we show that E αt − f (Xwt−1) 1 = O LD1
t .
This fact does not require convexity!
29/31
Questions to answer
• Linear convergence for FW variants (Stochastic Away steps,
Pairwise)
• Explain the observed asymptotic accelerated rate
• Algorithm in the dual space: it is not Stochastic Coordinate
Mirror Descent
30/31
Conclusion
• A practical, fast version of Stochastic Frank-Wolfe
31/31
Conclusion
• A practical, fast version of Stochastic Frank-Wolfe
• Hyperparameter-free
31/31
Conclusion
• A practical, fast version of Stochastic Frank-Wolfe
• Hyperparameter-free
• Implementation available in
https://github.com/openopt/copt
31/31
Conclusion
• A practical, fast version of Stochastic Frank-Wolfe
• Hyperparameter-free
• Implementation available in
https://github.com/openopt/copt
• Use FW when the structure of your problem demands it!
Thanks for your attention
31/31
References
Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International
Conference on Machine Learning.
Lu, Haihao and Robert Michael Freund (2020). “Generalized stochastic FrankWolfe
algorithm with stochastic substitute gradient for structured convex optimization”.
In: Math. Program.
Mokhtari, Aryan, Hamed Hassani, and Amin Karbasi (2018). “Stochastic Conditional
Gradient Methods: From Convex Minimization to Submodular Maximization”. In:
ArXiv abs/1804.09554.
Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”.
In: International Conference on Machine Learning.
Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with
Frank-Wolfe”. In: Advances in Neural Information Processing Systems.
31/31

More Related Content

What's hot

Image transforms 2
Image transforms 2Image transforms 2
Image transforms 2Ali Baig
 
Ece formula sheet
Ece formula sheetEce formula sheet
Ece formula sheetManasa Mona
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010Christian Robert
 
Signal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsSignal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsArvind Devaraj
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
 
Fourier transformation
Fourier transformationFourier transformation
Fourier transformationzertux
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningAndres Hernandez
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Alessandro Palmeri
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsStephane Senecal
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
Digital Signal Processing Tutorial:Chapt 3 frequency analysis
Digital Signal Processing Tutorial:Chapt 3 frequency analysisDigital Signal Processing Tutorial:Chapt 3 frequency analysis
Digital Signal Processing Tutorial:Chapt 3 frequency analysisChandrashekhar Padole
 
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDBenjamin Jaedon Choi
 

What's hot (20)

Image transforms 2
Image transforms 2Image transforms 2
Image transforms 2
 
Ece formula sheet
Ece formula sheetEce formula sheet
Ece formula sheet
 
Ece4510 notes10
Ece4510 notes10Ece4510 notes10
Ece4510 notes10
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
Ece4510 notes09
Ece4510 notes09Ece4510 notes09
Ece4510 notes09
 
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010
 
Signal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsSignal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier Transforms
 
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
 
Fourier transformation
Fourier transformationFourier transformation
Fourier transformation
 
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine Learning
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methods
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
Digital Signal Processing Tutorial:Chapt 3 frequency analysis
Digital Signal Processing Tutorial:Chapt 3 frequency analysisDigital Signal Processing Tutorial:Chapt 3 frequency analysis
Digital Signal Processing Tutorial:Chapt 3 frequency analysis
 
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
 
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
 
Chris Sherlock's slides
Chris Sherlock's slidesChris Sherlock's slides
Chris Sherlock's slides
 

Similar to Practical Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization

Numerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryNumerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryHidenoriOgata
 
lecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdflecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdfAnaNeacsu5
 
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationTasuku Soma
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesVjekoslavKovac1
 
Gradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfGradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfMTrang34
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and transDr Fereidoun Dejahang
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersTaiji Suzuki
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...HidenoriOgata
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningzukun
 
LÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaLÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaGonzalo Fernandez
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite IntegralJelaiAujero
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator SplittingFabian Pedregosa
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AIMarc Lelarge
 

Similar to Practical Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization (20)

Numerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryNumerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theory
 
lecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdflecture01_lecture01_lecture0001_ceva.pdf
lecture01_lecture01_lecture0001_ceva.pdf
 
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function Maximization
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averages
 
Gradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfGradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdf
 
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
 
Limit and continuity
Limit and continuityLimit and continuity
Limit and continuity
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and trans
 
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
 
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learning
 
Evaluating definite integrals
Evaluating definite integralsEvaluating definite integrals
Evaluating definite integrals
 
LÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaLÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieria
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite Integral
 
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
 
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AI
 

Recently uploaded

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2RajaP95
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 

Recently uploaded (20)

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2HARMONY IN THE HUMAN BEING - Unit-II UHV-2
HARMONY IN THE HUMAN BEING - Unit-II UHV-2
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 

Practical Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization

  • 1. Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization 1 Geoffrey N´egiar, 2 Gideon Dresdner, 1 Alicia Yi-Ting Tsai, 1,5 Laurent El Ghaoui, 2 Francesco Locatello, 3 Robert Freund, 4 Fabian Pedregosa June 12th, 2020. ICML, Online 1University of California, Berkeley 2ETH, Zurich 3MIT 4Google Research, Montr´eal 5SumUp Analytics
  • 2. Outline Motivation: Obtain a practical, fast version of Stochastic Frank-Wolfe for finite-sum problems. 1/31
  • 3. Outline Motivation: Obtain a practical, fast version of Stochastic Frank-Wolfe for finite-sum problems. 1. Problem of interest and setting. 2. The Frank-Wolfe algorithm. 3. Stochastic Frank-Wolfe. Making Stochastic Frank-Wolfe practical: a primal-dual view. 4. Results. Convergence rates in theory and in practice. 1/31
  • 4. Problem of Interest and Setting
  • 5. Problem of Interest The problem of interest is OPT: min w∈C 1 n n i=1 fi (xi w) • fi (·) is the univariate loss function of observation/sample i for i ∈ [n] • n is the number of observations/samples • C ⊂ Rd is a compact convex set • d is the order (dimension) of the model variable w The particular structural dependence of the losses on xi w is a model with “generalized linear structure” or “linear prediction” 2/31
  • 6. Setting Assumptions • For i = 1, . . . , n, the univariate function fi (·) is L-smooth, namely for all a, b ∈ R it holds that |fi (a) − fi (b)| ≤ L|a − b| • the Linear Minimization Oracle LMO(v): s ← arg min w∈C v, w returns an optimal solution and is easily solved for any v 3/31
  • 7. Some Examples in Statistical and Machine Learning • LASSO minw 1 2n n i=1(yi − xi w)2 s.t. w 1 ≤ δ , where fi (·) = 1 2 (yi − ·)2 and C := {w : w 1 ≤ δ} • Sparse Logistic Regression minw 1 n n j=1 ln(1 + exp(−yi xi w)) s.t. w 1 ≤ δ , where fi (·) = ln(1 + exp(−yi ·)), C := {w : w 1 ≤ δ} • Low-Rank Matrix Completion minW ∈Rn×p 1 2|Ω| (i,j)∈Ω(Mi,j − W i,j )2 s.t. W Nuc ≤ δ , where f(i,j)(·) = 1 2 (· − Mi,j )2 and C := {W ∈ Rn×p : W Nuc ≤ δ} • Many more examples can be found in [Jaggi 2013] for instance 4/31
  • 9. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 10. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 11. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 12. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 13. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 14. Some Step-size Rules/Strategies • “Recent standard”: γt = 2 t+2 • Exact line-search: γt = arg minγ∈[0,1]{f (xt + γ(st − xt))} • QA (Quadratic Approximation) step-size: γk = min 1, − f (xt) (st − xt) L st − xt 2 See [Demyanov & Rubinov 1970] • Simple averaging: γt = 1 t+1 • Constant step-size: γt = γ for some given γ ∈ [0, 1] • Dynamic strategy: determined by some history of optimality bounds, see [Freund & Grigas 2014] 6/31
  • 15. Simple Computational Guarantee for Frank-Wolfe Here is a simple computational guarantee: A Computational Guarantee for the Frank-Wolfe algorithm If the step-size sequence {γt} is chosen by the recent standard, the QA rule, or by exact line-search, then for all t ≥ 1 it holds that: f (xt) − f ∗ ≤ 1 1 f (x0)−f ∗ + t 2K < 2K t where K = L · diam(C)2 , and f is convex and L-smooth. Related guarantees also holds for other types of step-size strategies. 7/31
  • 16. Frank-Wolfe: When do we use it? • Projection-free. Linear subproblems vs. quadratic for projected gradient descent (PGD). min x∈D g x vs. min x∈D y − x 2 2 8/31
  • 17. Frank-Wolfe: When do we use it? • Projection-free. Linear subproblems vs. quadratic for projected gradient descent (PGD). min x∈D g x vs. min x∈D y − x 2 2 • Solution of linear subproblem: extremal element of D. 8/31
  • 18. Frank-Wolfe: When do we use it? • Projection-free. Linear subproblems vs. quadratic for projected gradient descent (PGD). min x∈D g x vs. min x∈D y − x 2 2 • Solution of linear subproblem: extremal element of D. • Sparse representation: xt convex combination of at most t elements. 8/31
  • 19. Frank-Wolfe: When do we use it? • Projection-free. Linear subproblems vs. quadratic for projected gradient descent (PGD). min x∈D g x vs. min x∈D y − x 2 2 • Solution of linear subproblem: extremal element of D. • Sparse representation: xt convex combination of at most t elements. Recent Applications • Learning the structure of a neural network. Ping, Liu, and Ihler, 2016 • Attention mechanisms that enforce sparsity. Niculae, 2018 • 1-constrained problems with extreme number of features. Kerdreux, Pedregosa, and d’Aspremont, 2018 8/31
  • 20. A practical issue for FW • For large n (number of samples), we need a Stochastic variant of FW • Na¨ıve SGD-like algorithm fails in practice and in theory • State of the art bounds on suboptimality after t iterations: O(n/t) and O(1/ 3 √ t) Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018 9/31
  • 21. A practical issue for FW • For large n (number of samples), we need a Stochastic variant of FW • Na¨ıve SGD-like algorithm fails in practice and in theory • State of the art bounds on suboptimality after t iterations: O(n/t) and O(1/ 3 √ t) Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018 Can we do better? 9/31
  • 22. Practical Stochastic Frank-Wolfe: a primal-dual point of view
  • 23. Problem setting: OPT min w∈C 1 n n i=1 fi (xi w) • fi (·) is the univariate loss function of observation/sample i for i ∈ [n] • n is the number of observations/samples • C ⊂ Rd is a compact convex set • d is the order (dimension) of the model variable w The particular structural dependence of the losses on xi w is a model with “generalized linear structure” or “linear prediction” 10/31
  • 24. Deterministic FW: Gradient Computation for OPT OPT f ∗ := min w∈C F(w) = 1 n n i=1 fi (xi w) Assumptions • fi (·) is L-smooth for i ∈ [n]: ∀z, z , |fi (z) − fi (z )| ≤ L|z − z | • Linear Minimization Oracle LMO(r): s ← arg minw∈C r, w Denote X := [x1 ; x2 ; . . . ; xn ] Gradient Computation F(w) = 1 n n i=1 xi · fi (xi w) = X α where αi ← 1 n fi (xi w), i ∈ [n] Gradient computation is O(nd) operations (expensive when n 0 . . .) 11/31
  • 25. Frank-Wolfe for OPT: OPT f ∗ := min w∈C F(w) := 1 n n i=1 fi (xi w) Frank-Wolfe algorithm for OPT: Initialize at w0 ∈ C, t ← 0 . At iteration t : 1. Compute F(wt−1) : • αi t ← 1 n fi (xi wt−1) for EVERY i ∈ [n] • rt = X αt (= F(wt−1)) 2. Compute st ← LMO(rt ) . 3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] . Iteration cost is O(nd) operations (expensive when n 0 . . . ) 12/31
  • 26. A Na¨ıve Frank-Wolfe (SFW) Strategy OPT f ∗ := min w∈C F(w) := 1 n n i=1 fi (xi w) Frank-Wolfe algorithm for OPT: Initialize at w0 ∈ C, t ← 0 . At iteration t : 1. Compute F(wt−1) : • αi t ← 1 n fi (xi wt−1) for ONE i ∈ [n] (αj t = 0 for j = i) • rt = X αt = xi fi (xi wt−1) 2. Compute st ← LMO(rt ) . 3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] . This approach does not work without growing the batch size [Hazan] 13/31
  • 27. Our Frank-Wolfe (SFW) Strategy OPT f ∗ := min w∈C F(w) := 1 n n i=1 fi (xi w) Frank-Wolfe algorithm for OPT: Initialize at w0 ∈ C, t ← 0 . At iteration t : 1. Compute F(wt−1) : • αi t ← 1 n fi (xi wt−1) for ONE i ∈ [n] (αj t = αj t−1 for j = i) • rt = X αt = rt−1 + xi (αi t − αi t−1) 2. Compute st ← LMO(rt ) . 3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] . Iteration cost is O(d) operations! Memory cost is O(d + n) 14/31
  • 28. Motivation: a Primal-Dual Lens for Constructing FW Recall the definition of the conjugate of a function f : f ∗ (α) := max x∈domf (·) {α x − f (x)} • If f is a closed convex function, then f ∗∗ = f • f (x) := maxα∈dom f ∗(·){α x − f ∗ (α)} , and • When f is differentiable, it holds that f (x) ← α where α ← arg max β∈domf ∗(·) {β x − f ∗ (β)} . 15/31
  • 29. Motivation: a Primal-Dual Lens for Constructing FW Using conjugacy we can reformulate OPT as: OPT: min w∈C f (Xw) = min w∈C max α∈Rn L(w, α) def = Xw, α − f ∗ (α) Given wt−1 we construct the gradient of f (Xw) at wt−1 by maximizing over the dual variable α: αt ∈ arg max α∈Rn {L(wt−1, α) = Xwt−1, α − f ∗ (α)} ⇐⇒ f (Xwt−1) = X αt Then the LMO step corresponds to fixing the dual variable and minimizing over the primal variable w: st ← arg min w∈C L(w, αt ) = w, X αt − f ∗ (αt ) ⇐⇒ st ← LMO(X αt ) Finally, wt = (1 − γt )wt−1 + γt st 16/31
  • 31. Stochastic FW Methods for Finite-Sum Minimization Computational complexity dependency on iterations t and sample size n Algorithm Complexity Bound Non-Convex Case Frank-Wolfe (deterministic) (1956) O n t O n √ t Reddi et al. (2016) O n + n1/3 √ t Mokhtari et al. (2018) O 1 3 √ t Lu and Freund (2018) O n t This work O D1/D∞ t 1 0 1 D1/D∞ ≤ n, and D1/D∞ n in many recognizable instances. 17/31
  • 32. Theoretical guarantees: convex case Define the p norm “diameter” of C to be Dp := max w,v∈C X(w − v) p Theorem: Computational Complexity of Novel Stochastic Frank-Wolfe Algorithm Let H0 def = α0 − f (Xw0) 1 be the initial error of the gradient f , and let the step-size rule be γt = 2 t+2 . For t ≥ 2, it holds that: E[f (Xwt ) − f ∗] ≤ 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n t (t + 1)(t + 2) + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) . Let us see what this bound is really about . . . 18/31
  • 33. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n − 1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 19/31
  • 34. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 20/31
  • 35. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 21/31
  • 36. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 22/31
  • 37. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 23/31
  • 38. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 24/31
  • 39. On the Ratio D1/D∞ Here is a plot of D1 D∞ 1 n for benchmark datasets 25/31
  • 40. Algorithms and Datasets We compared three Stochastic Frank-Wolfe algorithms: • Mokhtari et al. [2018] • Lu and Freund [2018] • Novel Stochastic Frank-Wolfe (this work) We report here on 1-constrained regression problems using the following datasets: Problem type Dataset δ d n D1/nD∞ logistic regression breast-cancer 5 10 683 0.929 logistic regression rcv1 (train) 100 47236 20242 0.021 linear regression California housing 0.1 8 20640 0.040 26/31
  • 41. Computational Experiments Figure showing optimality gaps versus number of sample gradient evaluations 103 104 105 106 107 Number of sampled gradients processed 10−14 10−11 10−8 10−5 10−2 Relativesuboptimality Breast Cancer Mokhtari et al. (2018) Lu & Freund (2018) This work 105 106 107 108 Number of sampled gradients processed 10−8 10−6 10−4 10−2 100 RCV1 Mokhtari et al. (2018) Lu & Freund (2018) This work 104 105 106 107 108 109 Number of sampled gradients processed 10−9 10−7 10−5 10−3 10−1 California Housing Mokhtari et al. (2018) Lu & Freund (2018) This work 27/31
  • 42. Proof sketch (convex, smooth case) A key lemma brings back the proof to the proof of deterministic Frank-Wolfe. Key Lemma Let εt = f (Xwt) − f (Xw∗ ). For any direction αt, with st = LMO(X αt), we have εt ≤ (1 − γt)εt−1 + γ2 t LD2 2 2n + γtD∞ αt − f (Xwt−1) 1. 28/31
  • 43. Proof sketch (convex, smooth case) A key lemma brings back the proof to the proof of deterministic Frank-Wolfe, with step size γt = 2/(t + 2). Key Lemma (convex, smooth) Let εt = f (Xwt) − f (Xw∗ ). For any direction αt, with st = LMO(X αt), we have εt ≤ (1 − γt)εt−1 + γ2 t LD2 2 2n + γtD∞ αt − f (Xwt−1) 1. Additionally, we show that E αt − f (Xwt−1) 1 = O LD1 t . This fact does not require convexity! 29/31
  • 44. Questions to answer • Linear convergence for FW variants (Stochastic Away steps, Pairwise) • Explain the observed asymptotic accelerated rate • Algorithm in the dual space: it is not Stochastic Coordinate Mirror Descent 30/31
  • 45. Conclusion • A practical, fast version of Stochastic Frank-Wolfe 31/31
  • 46. Conclusion • A practical, fast version of Stochastic Frank-Wolfe • Hyperparameter-free 31/31
  • 47. Conclusion • A practical, fast version of Stochastic Frank-Wolfe • Hyperparameter-free • Implementation available in https://github.com/openopt/copt 31/31
  • 48. Conclusion • A practical, fast version of Stochastic Frank-Wolfe • Hyperparameter-free • Implementation available in https://github.com/openopt/copt • Use FW when the structure of your problem demands it! Thanks for your attention 31/31
  • 49. References Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. Lu, Haihao and Robert Michael Freund (2020). “Generalized stochastic FrankWolfe algorithm with stochastic substitute gradient for structured convex optimization”. In: Math. Program. Mokhtari, Aryan, Hamed Hassani, and Amin Karbasi (2018). “Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization”. In: ArXiv abs/1804.09554. Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”. In: International Conference on Machine Learning. Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. 31/31