SlideShare a Scribd company logo
1 of 49
Download to read offline
Stochastic Frank-Wolfe for
Constrained Finite-Sum Minimization
Geoffrey N´egiar, 2
Gideon Dresdner, 1
Alicia Yi-Ting Tsai,
Laurent El Ghaoui, 2
Francesco Locatello, 3
Robert Freund,
Fabian Pedregosa
June 12th, 2020. ICML, Online
1University of California, Berkeley 2ETH, Zurich 3MIT
4Google Research, Montr´eal 5SumUp Analytics
Motivation: Obtain a practical, fast version of Stochastic
Frank-Wolfe for finite-sum problems.
Motivation: Obtain a practical, fast version of Stochastic
Frank-Wolfe for finite-sum problems.
1. Problem of interest and setting.
2. The Frank-Wolfe algorithm.
3. Stochastic Frank-Wolfe. Making Stochastic Frank-Wolfe
practical: a primal-dual view.
4. Results. Convergence rates in theory and in practice.
Problem of Interest and Setting
Problem of Interest
The problem of interest is
fi (xi w)
• fi (·) is the univariate loss function of observation/sample i for i ∈ [n]
• n is the number of observations/samples
• C ⊂ Rd
is a compact convex set
• d is the order (dimension) of the model variable w
The particular structural dependence of the losses on xi w is a model
with “generalized linear structure” or “linear prediction”
• For i = 1, . . . , n, the univariate function fi (·) is L-smooth, namely
for all a, b ∈ R it holds that
|fi (a) − fi (b)| ≤ L|a − b|
• the Linear Minimization Oracle LMO(v):
s ← arg min
v, w
returns an optimal solution and is easily solved for any v
Some Examples in Statistical and Machine Learning
i=1(yi − xi w)2
s.t. w 1 ≤ δ ,
where fi (·) = 1
(yi − ·)2
and C := {w : w 1 ≤ δ}
• Sparse Logistic Regression
j=1 ln(1 + exp(−yi xi w))
s.t. w 1 ≤ δ ,
where fi (·) = ln(1 + exp(−yi ·)), C := {w : w 1 ≤ δ}
• Low-Rank Matrix Completion
minW ∈Rn×p
2|Ω| (i,j)∈Ω(Mi,j − W i,j )2
s.t. W Nuc ≤ δ ,
where f(i,j)(·) = 1
(· − Mi,j )2
and C := {W ∈ Rn×p
: W Nuc ≤ δ}
• Many more examples can be found in [Jaggi 2013] for instance
The Frank-Wolfe algorithm
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
Frank-Wolfe: What is it?
Problem: smooth f , compact and convex D
arg min
f (x)
! Beware of the notation change in this section !
Algorithm 1: Frank-Wolfe (FW)
1 for t = 0, 1 . . . do
2 st ∈ arg mins∈D f (xt), s
3 Choose step-size γt.
4 xt+1 = (1 − γt)xt + γtst
Some Step-size Rules/Strategies
• “Recent standard”: γt = 2
• Exact line-search: γt = arg minγ∈[0,1]{f (xt + γ(st − xt))}
• QA (Quadratic Approximation) step-size:
γk = min 1,
− f (xt) (st − xt)
L st − xt
See [Demyanov & Rubinov 1970]
• Simple averaging: γt = 1
• Constant step-size: γt = γ for some given γ ∈ [0, 1]
• Dynamic strategy: determined by some history of optimality
bounds, see [Freund & Grigas 2014]
Simple Computational Guarantee for Frank-Wolfe
Here is a simple computational guarantee:
A Computational Guarantee for the Frank-Wolfe algorithm
If the step-size sequence {γt} is chosen by the recent standard, the QA
rule, or by exact line-search, then for all t ≥ 1 it holds that:
f (xt) − f ∗
f (x0)−f ∗ + t
where K = L · diam(C)2
, and f is convex and L-smooth.
Related guarantees also holds for other types of step-size strategies.
Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
g x vs. min
y − x 2
Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
g x vs. min
y − x 2
• Solution of linear subproblem: extremal element of D.
Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
g x vs. min
y − x 2
• Solution of linear subproblem: extremal element of D.
• Sparse representation: xt convex combination of at most t
Frank-Wolfe: When do we use it?
• Projection-free. Linear subproblems vs. quadratic for
projected gradient descent (PGD).
g x vs. min
y − x 2
• Solution of linear subproblem: extremal element of D.
• Sparse representation: xt convex combination of at most t
Recent Applications
• Learning the structure of a neural network. Ping, Liu, and
Ihler, 2016
• Attention mechanisms that enforce sparsity. Niculae, 2018
• 1-constrained problems with extreme number of features.
Kerdreux, Pedregosa, and d’Aspremont, 2018 8/31
A practical issue for FW
• For large n (number of samples), we need a Stochastic variant
of FW
• Na¨ıve SGD-like algorithm fails in practice and in theory
• State of the art bounds on suboptimality after t iterations:
O(n/t) and O(1/ 3
Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018
A practical issue for FW
• For large n (number of samples), we need a Stochastic variant
of FW
• Na¨ıve SGD-like algorithm fails in practice and in theory
• State of the art bounds on suboptimality after t iterations:
O(n/t) and O(1/ 3
Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018
Can we do better?
Practical Stochastic Frank-Wolfe:
a primal-dual point of view
Problem setting:
fi (xi w)
• fi (·) is the univariate loss function of observation/sample i for
i ∈ [n]
• n is the number of observations/samples
• C ⊂ Rd is a compact convex set
• d is the order (dimension) of the model variable w
The particular structural dependence of the losses on xi w is a
model with “generalized linear structure” or “linear prediction”
Deterministic FW: Gradient Computation for OPT
f ∗
:= min
F(w) = 1
fi (xi w)
• fi (·) is L-smooth for i ∈ [n]: ∀z, z , |fi (z) − fi (z )| ≤ L|z − z |
• Linear Minimization Oracle LMO(r): s ← arg minw∈C r, w
Denote X := [x1 ; x2 ; . . . ; xn ]
Gradient Computation
F(w) = 1
i=1 xi · fi (xi w) = X α where αi
← 1
fi (xi w), i ∈ [n]
Gradient computation is O(nd) operations (expensive when n 0 . . .)
Frank-Wolfe for OPT:
f ∗
:= min
F(w) := 1
fi (xi w)
Frank-Wolfe algorithm for OPT:
Initialize at w0 ∈ C, t ← 0 .
At iteration t :
1. Compute F(wt−1) :
• αi
t ← 1
n fi (xi wt−1) for EVERY i ∈ [n]
• rt = X αt (= F(wt−1))
2. Compute st ← LMO(rt ) .
3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] .
Iteration cost is O(nd) operations (expensive when n 0 . . . ) 12/31
A Na¨ıve Frank-Wolfe (SFW) Strategy
f ∗
:= min
F(w) := 1
fi (xi w)
Frank-Wolfe algorithm for OPT:
Initialize at w0 ∈ C, t ← 0 .
At iteration t :
1. Compute F(wt−1) :
• αi
t ← 1
n fi (xi wt−1) for ONE i ∈ [n] (αj
t = 0 for j = i)
• rt = X αt = xi fi (xi wt−1)
2. Compute st ← LMO(rt ) .
3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] .
This approach does not work without growing the batch size [Hazan] 13/31
Our Frank-Wolfe (SFW) Strategy
f ∗
:= min
F(w) := 1
fi (xi w)
Frank-Wolfe algorithm for OPT:
Initialize at w0 ∈ C, t ← 0 .
At iteration t :
1. Compute F(wt−1) :
• αi
t ← 1
n fi (xi wt−1) for ONE i ∈ [n] (αj
t = αj
t−1 for j = i)
• rt = X αt = rt−1 + xi (αi
t − αi
2. Compute st ← LMO(rt ) .
3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] .
Iteration cost is O(d) operations! Memory cost is O(d + n) 14/31
Motivation: a Primal-Dual Lens for Constructing FW
Recall the definition of the conjugate of a function f :
f ∗
(α) := max
x∈domf (·)
{α x − f (x)}
• If f is a closed convex function, then f ∗∗
= f
• f (x) := maxα∈dom f ∗(·){α x − f ∗
(α)} , and
• When f is differentiable, it holds that
f (x) ← α where α ← arg max
β∈domf ∗(·)
{β x − f ∗
(β)} .
Motivation: a Primal-Dual Lens for Constructing FW
Using conjugacy we can reformulate OPT as:
OPT: min
f (Xw) = min
L(w, α)
= Xw, α − f ∗
Given wt−1 we construct the gradient of f (Xw) at wt−1 by maximizing over
the dual variable α:
αt ∈ arg max
{L(wt−1, α) = Xwt−1, α − f ∗
⇐⇒ f (Xwt−1) = X αt
Then the LMO step corresponds to fixing the dual variable and minimizing
over the primal variable w:
st ← arg min
L(w, αt ) = w, X αt − f ∗
(αt )
⇐⇒ st ← LMO(X αt )
wt = (1 − γt )wt−1 + γt st
Results: Theory and Practice
Stochastic FW Methods for Finite-Sum Minimization
Computational complexity dependency on iterations t and sample size n
Algorithm Complexity Bound Non-Convex Case
Frank-Wolfe (deterministic) (1956) O
Reddi et al. (2016) O n +
Mokhtari et al. (2018) O
Lu and Freund (2018) O
This work O
D1/D∞ ≤ n, and D1/D∞ n in many recognizable instances.
Theoretical guarantees: convex case
Define the p norm “diameter” of C to be Dp := max
X(w − v) p
Theorem: Computational Complexity of Novel Stochastic
Frank-Wolfe Algorithm
Let H0
= α0 − f (Xw0) 1 be the initial error of the gradient f , and let
the step-size rule be γt = 2
. For t ≥ 2, it holds that:
E[f (Xwt ) − f ∗] ≤
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
+ 8LD1D∞
(t + 1)(t + 2)
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
Let us see what this bound is really about . . .
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
+ 8LD1D∞
n − 1
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
+ O
∞(1 + Ratio)
+ O D∞H0 + LD2
≤ O
≤ O
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
+ 8LD1D∞
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
+ O
∞(1 + Ratio)
+ O D∞H0 + LD2
≤ O
≤ O
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
+ 8LD1D∞
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
+ O
∞(1 + Ratio)
+ O D∞H0 + LD2
≤ O
≤ O
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
+ 8LD1D∞
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
+ O
∞(1 + Ratio)
+ O D∞H0 + LD2
≤ O
≤ O
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
+ 8LD1D∞
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
+ O
∞(1 + Ratio)
+ O D∞H0 + LD2
≤ O
≤ O
Theoretical guarantees: convex case
p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p
Define Ratio := D1/D∞ and note that Ratio ≤ n
The expected optimality gap bound is:
2(f (Xw0) − f ∗)
(t + 1)(t + 2)
+ 2LD2
+ 8LD1D∞
(2D∞H0 + 64LD1D∞)n2
(t + 1)(t + 2)
= O
f (Xw0) − f ∗
+ O
∞(1 + Ratio)
+ O D∞H0 + LD2
≤ O
≤ O
On the Ratio D1/D∞
Here is a plot of
for benchmark datasets
Algorithms and Datasets
We compared three Stochastic Frank-Wolfe algorithms:
• Mokhtari et al. [2018]
• Lu and Freund [2018]
• Novel Stochastic Frank-Wolfe (this work)
We report here on 1-constrained regression problems using the following
Problem type Dataset δ d n D1/nD∞
logistic regression breast-cancer 5 10 683 0.929
logistic regression rcv1 (train) 100 47236 20242 0.021
linear regression California housing 0.1 8 20640 0.040
Computational Experiments
Figure showing optimality gaps versus number of sample gradient
Number of sampled gradients processed
Breast Cancer
Mokhtari et al. (2018)
Lu & Freund (2018)
This work
Number of sampled gradients processed
Mokhtari et al. (2018)
Lu & Freund (2018)
This work
Number of sampled gradients processed
California Housing
Mokhtari et al. (2018)
Lu & Freund (2018)
This work
Proof sketch (convex, smooth case)
A key lemma brings back the proof to the proof of deterministic
Key Lemma
Let εt = f (Xwt) − f (Xw∗
). For any direction αt, with
st = LMO(X αt), we have
εt ≤ (1 − γt)εt−1 + γ2
+ γtD∞ αt − f (Xwt−1) 1.
Proof sketch (convex, smooth case)
A key lemma brings back the proof to the proof of deterministic
Frank-Wolfe, with step size γt = 2/(t + 2).
Key Lemma (convex, smooth)
Let εt = f (Xwt) − f (Xw∗
). For any direction αt, with
st = LMO(X αt), we have
εt ≤ (1 − γt)εt−1 + γ2
+ γtD∞ αt − f (Xwt−1) 1.
Additionally, we show that E αt − f (Xwt−1) 1 = O LD1
t .
This fact does not require convexity!
Questions to answer
• Linear convergence for FW variants (Stochastic Away steps,
• Explain the observed asymptotic accelerated rate
• Algorithm in the dual space: it is not Stochastic Coordinate
Mirror Descent
• A practical, fast version of Stochastic Frank-Wolfe
• A practical, fast version of Stochastic Frank-Wolfe
• Hyperparameter-free
• A practical, fast version of Stochastic Frank-Wolfe
• Hyperparameter-free
• Implementation available in
• A practical, fast version of Stochastic Frank-Wolfe
• Hyperparameter-free
• Implementation available in
• Use FW when the structure of your problem demands it!
Thanks for your attention
Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018).
“Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International
Conference on Machine Learning.
Lu, Haihao and Robert Michael Freund (2020). “Generalized stochastic FrankWolfe
algorithm with stochastic substitute gradient for structured convex optimization”.
In: Math. Program.
Mokhtari, Aryan, Hamed Hassani, and Amin Karbasi (2018). “Stochastic Conditional
Gradient Methods: From Convex Minimization to Submodular Maximization”. In:
ArXiv abs/1804.09554.
Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”.
In: International Conference on Machine Learning.
Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with
Frank-Wolfe”. In: Advances in Neural Information Processing Systems.

More Related Content

What's hot

Image transforms 2
Image transforms 2Image transforms 2
Image transforms 2Ali Baig
Ece formula sheet
Ece formula sheetEce formula sheet
Ece formula sheetManasa Mona
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010Christian Robert
Signal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsSignal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsArvind Devaraj
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningKhang Pham
Fourier transformation
Fourier transformationFourier transformation
Fourier transformationzertux
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningAndres Hernandez
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Alessandro Palmeri
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsStephane Senecal
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
Digital Signal Processing Tutorial:Chapt 3 frequency analysis
Digital Signal Processing Tutorial:Chapt 3 frequency analysisDigital Signal Processing Tutorial:Chapt 3 frequency analysis
Digital Signal Processing Tutorial:Chapt 3 frequency analysisChandrashekhar Padole
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDBenjamin Jaedon Choi

What's hot (20)

Image transforms 2
Image transforms 2Image transforms 2
Image transforms 2
Ece formula sheet
Ece formula sheetEce formula sheet
Ece formula sheet
Ece4510 notes10
Ece4510 notes10Ece4510 notes10
Ece4510 notes10
Fourier transform
Fourier transformFourier transform
Fourier transform
Ece4510 notes09
Ece4510 notes09Ece4510 notes09
Ece4510 notes09
RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010RSS discussion of Girolami and Calderhead, October 13, 2010
RSS discussion of Girolami and Calderhead, October 13, 2010
Signal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier TransformsSignal Processing Introduction using Fourier Transforms
Signal Processing Introduction using Fourier Transforms
Overview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep LearningOverview on Optimization algorithms in Deep Learning
Overview on Optimization algorithms in Deep Learning
Fourier transformation
Fourier transformationFourier transformation
Fourier transformation
Estimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine LearningEstimating Future Initial Margin with Machine Learning
Estimating Future Initial Margin with Machine Learning
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
Recursive Compressed Sensing
Recursive Compressed SensingRecursive Compressed Sensing
Recursive Compressed Sensing
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Toward an Improved Computational Strategy for Vibration-Proof Structures Equi...
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methods
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
Digital Signal Processing Tutorial:Chapt 3 frequency analysis
Digital Signal Processing Tutorial:Chapt 3 frequency analysisDigital Signal Processing Tutorial:Chapt 3 frequency analysis
Digital Signal Processing Tutorial:Chapt 3 frequency analysis
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
Chris Sherlock's slides
Chris Sherlock's slidesChris Sherlock's slides
Chris Sherlock's slides

Similar to Practical Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization

Numerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryNumerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryHidenoriOgata
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationTasuku Soma
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesVjekoslavKovac1
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Pierre Jacob
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and transDr Fereidoun Dejahang
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersTaiji Suzuki
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...HidenoriOgata
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningzukun
LÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaLÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaGonzalo Fernandez
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite IntegralJelaiAujero
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator SplittingFabian Pedregosa
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydSri Ambati
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AIMarc Lelarge

Similar to Practical Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization (20)

Numerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryNumerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theory
Regret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function MaximizationRegret Minimization in Multi-objective Submodular Function Maximization
Regret Minimization in Multi-objective Submodular Function Maximization
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averages
Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...Estimation of the score vector and observed information matrix in intractable...
Estimation of the score vector and observed information matrix in intractable...
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of I...
Limit and continuity
Limit and continuityLimit and continuity
Limit and continuity
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and trans
Stochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of MultipliersStochastic Alternating Direction Method of Multipliers
Stochastic Alternating Direction Method of Multipliers
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
NIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learningNIPS2010: optimization algorithms in machine learning
NIPS2010: optimization algorithms in machine learning
Evaluating definite integrals
Evaluating definite integralsEvaluating definite integrals
Evaluating definite integrals
LÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieriaLÍMITES Y DERIVADAS aplicados a ingenieria
LÍMITES Y DERIVADAS aplicados a ingenieria
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite Integral
Adaptive Three Operator Splitting
Adaptive Three Operator SplittingAdaptive Three Operator Splitting
Adaptive Three Operator Splitting
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen BoydH2O World - Consensus Optimization and Machine Learning - Stephen Boyd
H2O World - Consensus Optimization and Machine Learning - Stephen Boyd
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AI

Recently uploaded

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665

Recently uploaded (20)

9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR

Practical Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization

  • 1. Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization 1 Geoffrey N´egiar, 2 Gideon Dresdner, 1 Alicia Yi-Ting Tsai, 1,5 Laurent El Ghaoui, 2 Francesco Locatello, 3 Robert Freund, 4 Fabian Pedregosa June 12th, 2020. ICML, Online 1University of California, Berkeley 2ETH, Zurich 3MIT 4Google Research, Montr´eal 5SumUp Analytics
  • 2. Outline Motivation: Obtain a practical, fast version of Stochastic Frank-Wolfe for finite-sum problems. 1/31
  • 3. Outline Motivation: Obtain a practical, fast version of Stochastic Frank-Wolfe for finite-sum problems. 1. Problem of interest and setting. 2. The Frank-Wolfe algorithm. 3. Stochastic Frank-Wolfe. Making Stochastic Frank-Wolfe practical: a primal-dual view. 4. Results. Convergence rates in theory and in practice. 1/31
  • 4. Problem of Interest and Setting
  • 5. Problem of Interest The problem of interest is OPT: min w∈C 1 n n i=1 fi (xi w) • fi (·) is the univariate loss function of observation/sample i for i ∈ [n] • n is the number of observations/samples • C ⊂ Rd is a compact convex set • d is the order (dimension) of the model variable w The particular structural dependence of the losses on xi w is a model with “generalized linear structure” or “linear prediction” 2/31
  • 6. Setting Assumptions • For i = 1, . . . , n, the univariate function fi (·) is L-smooth, namely for all a, b ∈ R it holds that |fi (a) − fi (b)| ≤ L|a − b| • the Linear Minimization Oracle LMO(v): s ← arg min w∈C v, w returns an optimal solution and is easily solved for any v 3/31
  • 7. Some Examples in Statistical and Machine Learning • LASSO minw 1 2n n i=1(yi − xi w)2 s.t. w 1 ≤ δ , where fi (·) = 1 2 (yi − ·)2 and C := {w : w 1 ≤ δ} • Sparse Logistic Regression minw 1 n n j=1 ln(1 + exp(−yi xi w)) s.t. w 1 ≤ δ , where fi (·) = ln(1 + exp(−yi ·)), C := {w : w 1 ≤ δ} • Low-Rank Matrix Completion minW ∈Rn×p 1 2|Ω| (i,j)∈Ω(Mi,j − W i,j )2 s.t. W Nuc ≤ δ , where f(i,j)(·) = 1 2 (· − Mi,j )2 and C := {W ∈ Rn×p : W Nuc ≤ δ} • Many more examples can be found in [Jaggi 2013] for instance 4/31
  • 9. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 10. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 11. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 12. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 13. Frank-Wolfe: What is it? Problem: smooth f , compact and convex D arg min x∈D f (x) ! Beware of the notation change in this section ! Algorithm 1: Frank-Wolfe (FW) 1 for t = 0, 1 . . . do 2 st ∈ arg mins∈D f (xt), s 3 Choose step-size γt. 4 xt+1 = (1 − γt)xt + γtst 5/31
  • 14. Some Step-size Rules/Strategies • “Recent standard”: γt = 2 t+2 • Exact line-search: γt = arg minγ∈[0,1]{f (xt + γ(st − xt))} • QA (Quadratic Approximation) step-size: γk = min 1, − f (xt) (st − xt) L st − xt 2 See [Demyanov & Rubinov 1970] • Simple averaging: γt = 1 t+1 • Constant step-size: γt = γ for some given γ ∈ [0, 1] • Dynamic strategy: determined by some history of optimality bounds, see [Freund & Grigas 2014] 6/31
  • 15. Simple Computational Guarantee for Frank-Wolfe Here is a simple computational guarantee: A Computational Guarantee for the Frank-Wolfe algorithm If the step-size sequence {γt} is chosen by the recent standard, the QA rule, or by exact line-search, then for all t ≥ 1 it holds that: f (xt) − f ∗ ≤ 1 1 f (x0)−f ∗ + t 2K < 2K t where K = L · diam(C)2 , and f is convex and L-smooth. Related guarantees also holds for other types of step-size strategies. 7/31
  • 16. Frank-Wolfe: When do we use it? • Projection-free. Linear subproblems vs. quadratic for projected gradient descent (PGD). min x∈D g x vs. min x∈D y − x 2 2 8/31
  • 17. Frank-Wolfe: When do we use it? • Projection-free. Linear subproblems vs. quadratic for projected gradient descent (PGD). min x∈D g x vs. min x∈D y − x 2 2 • Solution of linear subproblem: extremal element of D. 8/31
  • 18. Frank-Wolfe: When do we use it? • Projection-free. Linear subproblems vs. quadratic for projected gradient descent (PGD). min x∈D g x vs. min x∈D y − x 2 2 • Solution of linear subproblem: extremal element of D. • Sparse representation: xt convex combination of at most t elements. 8/31
  • 19. Frank-Wolfe: When do we use it? • Projection-free. Linear subproblems vs. quadratic for projected gradient descent (PGD). min x∈D g x vs. min x∈D y − x 2 2 • Solution of linear subproblem: extremal element of D. • Sparse representation: xt convex combination of at most t elements. Recent Applications • Learning the structure of a neural network. Ping, Liu, and Ihler, 2016 • Attention mechanisms that enforce sparsity. Niculae, 2018 • 1-constrained problems with extreme number of features. Kerdreux, Pedregosa, and d’Aspremont, 2018 8/31
  • 20. A practical issue for FW • For large n (number of samples), we need a Stochastic variant of FW • Na¨ıve SGD-like algorithm fails in practice and in theory • State of the art bounds on suboptimality after t iterations: O(n/t) and O(1/ 3 √ t) Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018 9/31
  • 21. A practical issue for FW • For large n (number of samples), we need a Stochastic variant of FW • Na¨ıve SGD-like algorithm fails in practice and in theory • State of the art bounds on suboptimality after t iterations: O(n/t) and O(1/ 3 √ t) Lu and Freund, 2020; Mokhtari, Hassani, and Karbasi, 2018 Can we do better? 9/31
  • 22. Practical Stochastic Frank-Wolfe: a primal-dual point of view
  • 23. Problem setting: OPT min w∈C 1 n n i=1 fi (xi w) • fi (·) is the univariate loss function of observation/sample i for i ∈ [n] • n is the number of observations/samples • C ⊂ Rd is a compact convex set • d is the order (dimension) of the model variable w The particular structural dependence of the losses on xi w is a model with “generalized linear structure” or “linear prediction” 10/31
  • 24. Deterministic FW: Gradient Computation for OPT OPT f ∗ := min w∈C F(w) = 1 n n i=1 fi (xi w) Assumptions • fi (·) is L-smooth for i ∈ [n]: ∀z, z , |fi (z) − fi (z )| ≤ L|z − z | • Linear Minimization Oracle LMO(r): s ← arg minw∈C r, w Denote X := [x1 ; x2 ; . . . ; xn ] Gradient Computation F(w) = 1 n n i=1 xi · fi (xi w) = X α where αi ← 1 n fi (xi w), i ∈ [n] Gradient computation is O(nd) operations (expensive when n 0 . . .) 11/31
  • 25. Frank-Wolfe for OPT: OPT f ∗ := min w∈C F(w) := 1 n n i=1 fi (xi w) Frank-Wolfe algorithm for OPT: Initialize at w0 ∈ C, t ← 0 . At iteration t : 1. Compute F(wt−1) : • αi t ← 1 n fi (xi wt−1) for EVERY i ∈ [n] • rt = X αt (= F(wt−1)) 2. Compute st ← LMO(rt ) . 3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] . Iteration cost is O(nd) operations (expensive when n 0 . . . ) 12/31
  • 26. A Na¨ıve Frank-Wolfe (SFW) Strategy OPT f ∗ := min w∈C F(w) := 1 n n i=1 fi (xi w) Frank-Wolfe algorithm for OPT: Initialize at w0 ∈ C, t ← 0 . At iteration t : 1. Compute F(wt−1) : • αi t ← 1 n fi (xi wt−1) for ONE i ∈ [n] (αj t = 0 for j = i) • rt = X αt = xi fi (xi wt−1) 2. Compute st ← LMO(rt ) . 3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] . This approach does not work without growing the batch size [Hazan] 13/31
  • 27. Our Frank-Wolfe (SFW) Strategy OPT f ∗ := min w∈C F(w) := 1 n n i=1 fi (xi w) Frank-Wolfe algorithm for OPT: Initialize at w0 ∈ C, t ← 0 . At iteration t : 1. Compute F(wt−1) : • αi t ← 1 n fi (xi wt−1) for ONE i ∈ [n] (αj t = αj t−1 for j = i) • rt = X αt = rt−1 + xi (αi t − αi t−1) 2. Compute st ← LMO(rt ) . 3. Set wt ← wt−1 + γt (st − wt−1), where γt ∈ [0, 1] . Iteration cost is O(d) operations! Memory cost is O(d + n) 14/31
  • 28. Motivation: a Primal-Dual Lens for Constructing FW Recall the definition of the conjugate of a function f : f ∗ (α) := max x∈domf (·) {α x − f (x)} • If f is a closed convex function, then f ∗∗ = f • f (x) := maxα∈dom f ∗(·){α x − f ∗ (α)} , and • When f is differentiable, it holds that f (x) ← α where α ← arg max β∈domf ∗(·) {β x − f ∗ (β)} . 15/31
  • 29. Motivation: a Primal-Dual Lens for Constructing FW Using conjugacy we can reformulate OPT as: OPT: min w∈C f (Xw) = min w∈C max α∈Rn L(w, α) def = Xw, α − f ∗ (α) Given wt−1 we construct the gradient of f (Xw) at wt−1 by maximizing over the dual variable α: αt ∈ arg max α∈Rn {L(wt−1, α) = Xwt−1, α − f ∗ (α)} ⇐⇒ f (Xwt−1) = X αt Then the LMO step corresponds to fixing the dual variable and minimizing over the primal variable w: st ← arg min w∈C L(w, αt ) = w, X αt − f ∗ (αt ) ⇐⇒ st ← LMO(X αt ) Finally, wt = (1 − γt )wt−1 + γt st 16/31
  • 31. Stochastic FW Methods for Finite-Sum Minimization Computational complexity dependency on iterations t and sample size n Algorithm Complexity Bound Non-Convex Case Frank-Wolfe (deterministic) (1956) O n t O n √ t Reddi et al. (2016) O n + n1/3 √ t Mokhtari et al. (2018) O 1 3 √ t Lu and Freund (2018) O n t This work O D1/D∞ t 1 0 1 D1/D∞ ≤ n, and D1/D∞ n in many recognizable instances. 17/31
  • 32. Theoretical guarantees: convex case Define the p norm “diameter” of C to be Dp := max w,v∈C X(w − v) p Theorem: Computational Complexity of Novel Stochastic Frank-Wolfe Algorithm Let H0 def = α0 − f (Xw0) 1 be the initial error of the gradient f , and let the step-size rule be γt = 2 t+2 . For t ≥ 2, it holds that: E[f (Xwt ) − f ∗] ≤ 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n t (t + 1)(t + 2) + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) . Let us see what this bound is really about . . . 18/31
  • 33. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n − 1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 19/31
  • 34. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 20/31
  • 35. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 21/31
  • 36. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 22/31
  • 37. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 23/31
  • 38. Theoretical guarantees: convex case p norm “diameter” of C is Dp := maxw,v∈C X(w − v) p Define Ratio := D1/D∞ and note that Ratio ≤ n The expected optimality gap bound is: 2(f (Xw0) − f ∗) (t + 1)(t + 2) + 2LD2 2 1 n + 8LD1D∞ n−1 n 1 t + (2D∞H0 + 64LD1D∞)n2 (t + 1)(t + 2) = O f (Xw0) − f ∗ t2 + O LD2 ∞(1 + Ratio) t + O D∞H0 + LD2 ∞Ratio n2 t2 ≤ O LD2 ∞Ratio t ≤ O n t 24/31
  • 39. On the Ratio D1/D∞ Here is a plot of D1 D∞ 1 n for benchmark datasets 25/31
  • 40. Algorithms and Datasets We compared three Stochastic Frank-Wolfe algorithms: • Mokhtari et al. [2018] • Lu and Freund [2018] • Novel Stochastic Frank-Wolfe (this work) We report here on 1-constrained regression problems using the following datasets: Problem type Dataset δ d n D1/nD∞ logistic regression breast-cancer 5 10 683 0.929 logistic regression rcv1 (train) 100 47236 20242 0.021 linear regression California housing 0.1 8 20640 0.040 26/31
  • 41. Computational Experiments Figure showing optimality gaps versus number of sample gradient evaluations 103 104 105 106 107 Number of sampled gradients processed 10−14 10−11 10−8 10−5 10−2 Relativesuboptimality Breast Cancer Mokhtari et al. (2018) Lu & Freund (2018) This work 105 106 107 108 Number of sampled gradients processed 10−8 10−6 10−4 10−2 100 RCV1 Mokhtari et al. (2018) Lu & Freund (2018) This work 104 105 106 107 108 109 Number of sampled gradients processed 10−9 10−7 10−5 10−3 10−1 California Housing Mokhtari et al. (2018) Lu & Freund (2018) This work 27/31
  • 42. Proof sketch (convex, smooth case) A key lemma brings back the proof to the proof of deterministic Frank-Wolfe. Key Lemma Let εt = f (Xwt) − f (Xw∗ ). For any direction αt, with st = LMO(X αt), we have εt ≤ (1 − γt)εt−1 + γ2 t LD2 2 2n + γtD∞ αt − f (Xwt−1) 1. 28/31
  • 43. Proof sketch (convex, smooth case) A key lemma brings back the proof to the proof of deterministic Frank-Wolfe, with step size γt = 2/(t + 2). Key Lemma (convex, smooth) Let εt = f (Xwt) − f (Xw∗ ). For any direction αt, with st = LMO(X αt), we have εt ≤ (1 − γt)εt−1 + γ2 t LD2 2 2n + γtD∞ αt − f (Xwt−1) 1. Additionally, we show that E αt − f (Xwt−1) 1 = O LD1 t . This fact does not require convexity! 29/31
  • 44. Questions to answer • Linear convergence for FW variants (Stochastic Away steps, Pairwise) • Explain the observed asymptotic accelerated rate • Algorithm in the dual space: it is not Stochastic Coordinate Mirror Descent 30/31
  • 45. Conclusion • A practical, fast version of Stochastic Frank-Wolfe 31/31
  • 46. Conclusion • A practical, fast version of Stochastic Frank-Wolfe • Hyperparameter-free 31/31
  • 47. Conclusion • A practical, fast version of Stochastic Frank-Wolfe • Hyperparameter-free • Implementation available in 31/31
  • 48. Conclusion • A practical, fast version of Stochastic Frank-Wolfe • Hyperparameter-free • Implementation available in • Use FW when the structure of your problem demands it! Thanks for your attention 31/31
  • 49. References Kerdreux, Thomas, Fabian Pedregosa, and Alexandre d’Aspremont (2018). “Frank-Wolfe with Subsampling Oracle”. In: Proceedings of the 35th International Conference on Machine Learning. Lu, Haihao and Robert Michael Freund (2020). “Generalized stochastic FrankWolfe algorithm with stochastic substitute gradient for structured convex optimization”. In: Math. Program. Mokhtari, Aryan, Hamed Hassani, and Amin Karbasi (2018). “Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization”. In: ArXiv abs/1804.09554. Niculae, Vlad et al. (2018). “SparseMAP: Differentiable Sparse Structured Inference”. In: International Conference on Machine Learning. Ping, Wei, Qiang Liu, and Alexander T Ihler (2016). “Learning Infinite RBMs with Frank-Wolfe”. In: Advances in Neural Information Processing Systems. 31/31