SlideShare a Scribd company logo
1 of 55
Download to read offline
Adaptive Three Operator Splitting.
Fabian Pedregosa, Gauthier Gidel
2018 International Symposium on Mathematical Programming, Bordeaux
Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
1/18
Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
• Can be generalized to an arbitrary number of proximal terms.
1/18
Three Operator Splitting (TOS)
• Recently proposed method (Davis and Yin, 2017)
Solves optimization problems of the form
minimize
x∈Rd
f (x) + g(x) + h(x) ,
with access to f , proxγg , proxγh.
• Can be generalized to an arbitrary number of proximal terms.
• Many complex penalties can be written as a sum of proximable
terms: overlapping group lasso, 1 trend filtering, isotonic
constraints, total variation, intersection of constraints, etc.
1/18
Importance of step-size
Guaranteed to converge for any step-size γ < 2/L, with L =
Lipschitz constant of f .
In practice, best performance is often achieved for γ 2/L
0 1000 2000 3000 4000 5000
Iterations
10 14
10 11
10 8
10 5
10 2
Objectiveminusoptimum
=1/L
=2/L
=5/L
=10/L
=20/L =50/L
2/18
Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
3/18
Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
• Adaptive step-size strategies (aka inexact line search) have a
long tradition (Armijo, 1966), have been adapted to
proximal-gradient (Beck and Teboulle, 2009).
3/18
Motivation
• L is a global upper bound on the Lipschitz constant, locally it
can be much smaller.
• Adaptive step-size strategies (aka inexact line search) have a
long tradition (Armijo, 1966), have been adapted to
proximal-gradient (Beck and Teboulle, 2009).
• Goal. Can these adaptive step-size methods be adapted to
the three operator splitting?
3/18
Outline
1. Revisiting the Three Operator Splitting
2. Adaptive Three Operator Splitting
3. Experiments
4/18
Revisiting the Three Operator
Splitting
Three Operator Splitting (TOS)
• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
5/18
Three Operator Splitting (TOS)
• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
• Generalization of both Proximal-Gradient and
Douglas-Rachford.
5/18
Three Operator Splitting (TOS)
• Three operator splitting (Davis and Yin, 2017):
zt = proxγh(yt)
xt = proxγg (2yt − zt − γ f (zt))
yt+1 = yt − zt + xt
• Generalization of both Proximal-Gradient and
Douglas-Rachford.
• Depends only on one step-size parameter.
5/18
Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
6/18
Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
2. Saddle-point reformulation of original problem
min
x∈Rd
f (x) + g(x) + h(x)
= min
x∈Rd
f (x) + g(x) + max
u∈Rd
{ x, u − h∗
(u)}
6/18
Revisiting the three operator splitting
1. Introduce ut := 1
γ (yt − xt). We can rewrite TOS as
xt+1 = proxγg (zt − γ( f (zt) + ut))
ut+1 = proxh∗/γ(ut + xt+1/γ) ,
zt+1 = xt+1 − γ(ut+1 − ut)
2. Saddle-point reformulation of original problem
min
x∈Rd
f (x) + g(x) + h(x)
= min
x∈Rd
max
u∈Rd
f (x) + g(x) + x, u − h∗
(u)
:=L(x,u)
6/18
Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
7/18
Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
• Proximal-gradient iteration, with x = zt as starting point:
xt+1 = proxγg (zt − γ( f (zt) + ut))
7/18
Minimizing with respect to primal variable
min
x∈Rd
L(x, ut) = min
x∈Rd
f (x) + x, ut
smooth
+ g(x)
proximal
• Proximal-gradient iteration, with x = zt as starting point:
xt+1 = proxγg (zt − γ( f (zt) + ut))
= first step of TOS
7/18
Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
8/18
Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
• Proximal-point iteration:
ut+1 = proxσh∗ (ut + σxt+1)
= second step in TOS with σ = 1/γ
8/18
Minimizing with respect to the dual variable
min
u∈Rd
L(xt, u) = min
u∈Rd
h∗
(u) − xt, u
proximal
• Proximal-point iteration:
ut+1 = proxσh∗ (ut + σxt+1)
= second step in TOS with σ = 1/γ
• Third line:
zt+1 = xt+1 − γ(ut+1 − ut) (extrapolation step)
8/18
Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 1: proximal-gradient step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 1: proximal-point step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 2: extrapolation
9/18
Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 2: proximal-gradient step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 3: proximal-point step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 3: extrapolation
9/18
Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 4: proximal-gradient step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 4: proximal-point step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 5: extrapolation
9/18
Revisiting the three operator splitting
→ xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 5: proximal-gradient step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
→ ut+1 = proxh /γ(ut + xt+1/γ)
zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 6: proximal-point step
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 6: extrapolation
Take-Home Message
TOS is (basically) alternated proximal-gradient and
proximal-point
9/18
Revisiting the three operator splitting
xt+1 = proxγg(zt− γ( f(zt) + ut))
ut+1 = proxh /γ(ut + xt+1/γ)
→ zt+1 = xt+1 − γ(ut+1 − ut)
u
x
Iteration 6: extrapolation
Take-Home Message
TOS is (basically) alternated proximal-gradient and
proximal-point
Can we adapt the adaptive step-size of proximal-gradient?
9/18
Adaptive Three Operator Splitting
Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18
Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
Run rest of algorithm with that step-size:
ut+1 = proxh∗/γt
(ut + xt+1/γt) (1)
zt+1 = xt+1 − γt(ut+1 − ut) (2)
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18
Adaptive Three Operator Splitting1
Start with optimistic step-size γt and decrease it until:
f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt +
1
2γt
xt+1 − zt
2
with xt+1 = proxγt g (zt − γt( f (zt) + ut))
Run rest of algorithm with that step-size:
ut+1 = proxh∗/γt
(ut + xt+1/γt) (1)
zt+1 = xt+1 − γt(ut+1 − ut) (2)
1
Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator
Splitting”. In: Proceedings of the 35th International Conference on Machine
Learning (ICML).
10/18
Benefits
• Automatic tuning of step-size
• (practically) hyperparameter-free
Performance of the adaptive step-size strategy
0 1000 2000 3000 4000 5000
Iterations
10 14
10 11
10 8
10 5
10 2
Objectiveminusoptimum
=1/L
=2/L
=5/L
=10/L
=20/L
=50/L
adaptive
Performance is as good as best hand-tuned step-size
11/18
Convergence rates 1/3
Convergence rate in terms of average (aka ergodic) sequence.
st
def
=
t−1
i=0
γt , xt
def
=
t−1
i=0
γi xi+1 /st , ut
def
=
t−1
i=0
γi ui+1 /st .
12/18
Convergence rates 1/3
Convergence rate in terms of average (aka ergodic) sequence.
st
def
=
t−1
i=0
γt , xt
def
=
t−1
i=0
γi xi+1 /st , ut
def
=
t−1
i=0
γi ui+1 /st .
Theorem (sublinear convergence rate)
For any (x, u) ∈ domL:
L(xt, u) − L(x, ut) ≤
z0 − x 2 + γ2
0 u0 − u 2
2st
.
12/18
Convergence rates 2/3
If h is Lipschitz, we can bound it and obtain rates in terms of
objective function suboptimality.
13/18
Convergence rates 2/3
If h is Lipschitz, we can bound it and obtain rates in terms of
objective function suboptimality.
Corollary
Let h be βh-Lipschitz. Then we have
P(xt+1) − P(x∗
) ≤
z0 − x∗ 2+ 2γ2
0( u0
2+ β2
h)
2st
= O(1/t).
with P(x) = f (x) + g(x) + h(x).
13/18
Convergence rates 3/3
Linear convergence under (somewhat unrealistic) assumptions.
Theorem
If f is Lf -smooth, µ-strongly convex and h is Lh-smooth then
xt+1 − x 2
≤ 1 − min τ
µ
Lf
,
1
1 + γ0Lh
t+1
C0 (3)
with τ = line search decrease factor, C0 = only depends on initial
conditions.
• Better rate than µ
Lf
× 1
(1+γLh)2 from (Davis and Yin, 2015).
14/18
Experiments
Logistic + Nearly-isotonic penalty
Problem
arg minx logistic(x) + λ p−1
i=1 max{xi − xi+1, 0}
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 200 400 600
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 100 200 300 400
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG
15/18
Logistic + Overlapping group lasso penalty
Problem
arg min
x
logistic(x) + λ g∈G [x]g 2
Coefficients
Magnitude
=10 6
Coefficients
Magnitude
=10 3
Coefficients
Magnitude
=0.01
Coefficients
Magnitude
=0.1
estimated coefficients
ground truth
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 10 20 30 40 50
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 5 10 15 20
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0.0 0.2 0.4 0.6 0.8 1.0
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
16/18
Quadratic loss + total variation penalty
Problem
arg min
x
least squares(x) + λ x TV
Recoveredcoefficients
=10 6 =10 5 =10 4 =10 3
0 500 1000 1500 2000
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Objectiveminusoptimum
0 100 200 300 400
Time (in seconds)
10 12
10 9
10 6
10 3
100
0 10 20 30 40
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
0 2 4 6 8
Time (in seconds)
10 12
10 10
10 8
10 6
10 4
10 2
100
Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG
17/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
18/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
18/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Same convergence guarantees as fixed step-size method.
18/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Same convergence guarantees as fixed step-size method.
• Large empirical improvements, specially in the
low-regularization and non-quadratic regime.
18/18
Conclusion
• Sufficient decrease condition to set step-size in three operator
splitting.
• (Mostly) Hyperparameter-free, adaptivity to local geometry.
• Same convergence guarantees as fixed step-size method.
• Large empirical improvements, specially in the
low-regularization and non-quadratic regime.
Perspectives
• Linear convergence under less restrictive assumptions?
• Acceleration.
https://arxiv.org/abs/1804.02339
18/18
References
Armijo, Larry (1966). “Minimization of functions having Lipschitz continuous first
partial derivatives”. In: Pacific Journal of Mathematics.
Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to
signal recovery”. In: Convex optimization in signal processing and communications.
Davis, Damek and Wotao Yin (2015). “A three-operator splitting scheme and its
optimization applications”. In: preprint arXiv:1504.01032v1.
— (2017). “A three-operator splitting scheme and its optimization applications”. In:
Set-valued and variational analysis.
Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”.
In: Proceedings of the 35th International Conference on Machine Learning (ICML).
18/18

More Related Content

What's hot

S. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsS. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsSteven Duplij (Stepan Douplii)
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingFast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingRakuten Group, Inc.
 
Memory Efficient Adaptive Optimization
Memory Efficient Adaptive OptimizationMemory Efficient Adaptive Optimization
Memory Efficient Adaptive Optimizationrohananil
 
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...Deep Learning JP
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationHidekazu Oiwa
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択Masahiro Suzuki
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Kai-Wen Zhao
 
Comparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsComparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsBigMC
 
[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデルDeep Learning JP
 
Dominación y extensiones óptimas de operadores con rango esencial compacto en...
Dominación y extensiones óptimas de operadores con rango esencial compacto en...Dominación y extensiones óptimas de operadores con rango esencial compacto en...
Dominación y extensiones óptimas de operadores con rango esencial compacto en...esasancpe
 
Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Matthew Leingang
 
Design Method of Directional GenLOT with Trend Vanishing Moments
Design Method of Directional GenLOT with Trend Vanishing MomentsDesign Method of Directional GenLOT with Trend Vanishing Moments
Design Method of Directional GenLOT with Trend Vanishing MomentsShogo Muramatsu
 
Variational AutoEncoder
Variational AutoEncoderVariational AutoEncoder
Variational AutoEncoderKazuki Nitta
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Shohei Taniguchi
 

What's hot (20)

S. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsS. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applications
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingFast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
 
Memory Efficient Adaptive Optimization
Memory Efficient Adaptive OptimizationMemory Efficient Adaptive Optimization
Memory Efficient Adaptive Optimization
 
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
【DL輪読会】SUMO: Unbiased Estimation of Log Marginal Probability for Latent Varia...
 
Wasserstein GAN
Wasserstein GANWasserstein GAN
Wasserstein GAN
 
A
AA
A
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
Irs gan doc
Irs gan docIrs gan doc
Irs gan doc
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
 
Semi vae memo (1)
Semi vae memo (1)Semi vae memo (1)
Semi vae memo (1)
 
Semi vae memo (2)
Semi vae memo (2)Semi vae memo (2)
Semi vae memo (2)
 
Comparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering modelsComparing estimation algorithms for block clustering models
Comparing estimation algorithms for block clustering models
 
[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル
 
Dominación y extensiones óptimas de operadores con rango esencial compacto en...
Dominación y extensiones óptimas de operadores con rango esencial compacto en...Dominación y extensiones óptimas de operadores con rango esencial compacto en...
Dominación y extensiones óptimas de operadores con rango esencial compacto en...
 
Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)Lesson 27: Integration by Substitution (Section 041 slides)
Lesson 27: Integration by Substitution (Section 041 slides)
 
Invers fungsi
Invers fungsiInvers fungsi
Invers fungsi
 
Design Method of Directional GenLOT with Trend Vanishing Moments
Design Method of Directional GenLOT with Trend Vanishing MomentsDesign Method of Directional GenLOT with Trend Vanishing Moments
Design Method of Directional GenLOT with Trend Vanishing Moments
 
Variational AutoEncoder
Variational AutoEncoderVariational AutoEncoder
Variational AutoEncoder
 
Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)Control as Inference (強化学習とベイズ統計)
Control as Inference (強化学習とベイズ統計)
 

Similar to Adaptive Three Operator Splitting

Algebra 2 Section 5-1
Algebra 2 Section 5-1Algebra 2 Section 5-1
Algebra 2 Section 5-1Jimbo Lamb
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you needFabian Pedregosa
 
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...Geoffrey Négiar
 
Operation on functions
Operation on functionsOperation on functions
Operation on functionsJeralyn Obsina
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...HidenoriOgata
 
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptxlesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptxJohnReyManzano2
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite IntegralJelaiAujero
 
Piecewise function lesson 3
Piecewise function lesson 3Piecewise function lesson 3
Piecewise function lesson 3aksetter
 
functions limits and continuity
functions limits and continuityfunctions limits and continuity
functions limits and continuityPume Ananda
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and transDr Fereidoun Dejahang
 
Gradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfGradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfMTrang34
 
3.4 Composition of Functions
3.4 Composition of Functions3.4 Composition of Functions
3.4 Composition of Functionssmiller5
 
Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...
Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...
Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...Alexander Litvinenko
 

Similar to Adaptive Three Operator Splitting (20)

Algebra 2 Section 5-1
Algebra 2 Section 5-1Algebra 2 Section 5-1
Algebra 2 Section 5-1
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Sufficient decrease is all you need
Sufficient decrease is all you needSufficient decrease is all you need
Sufficient decrease is all you need
 
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
 
Operation on functions
Operation on functionsOperation on functions
Operation on functions
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...
 
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptxlesson10-thechainrule034slides-091006133832-phpapp01.pptx
lesson10-thechainrule034slides-091006133832-phpapp01.pptx
 
Indefinite Integral
Indefinite IntegralIndefinite Integral
Indefinite Integral
 
Piecewise function lesson 3
Piecewise function lesson 3Piecewise function lesson 3
Piecewise function lesson 3
 
Ece3075 a 8
Ece3075 a 8Ece3075 a 8
Ece3075 a 8
 
functions limits and continuity
functions limits and continuityfunctions limits and continuity
functions limits and continuity
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and trans
 
Gradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdfGradient_Descent_Unconstrained.pdf
Gradient_Descent_Unconstrained.pdf
 
3.4 Composition of Functions
3.4 Composition of Functions3.4 Composition of Functions
3.4 Composition of Functions
 
composite functions
composite functionscomposite functions
composite functions
 
Functions limits and continuity
Functions limits and continuityFunctions limits and continuity
Functions limits and continuity
 
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
QMC: Operator Splitting Workshop, Perturbed (accelerated) Proximal-Gradient A...
 
Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...
Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...
Application H-matrices for solving PDEs with multi-scale coefficients, jumpin...
 
Singlevaropt
SinglevaroptSinglevaropt
Singlevaropt
 
Maths AIP.pdf
Maths AIP.pdfMaths AIP.pdf
Maths AIP.pdf
 

More from Fabian Pedregosa

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Fabian Pedregosa
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Fabian Pedregosa
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimationFabian Pedregosa
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsFabian Pedregosa
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine LearningFabian Pedregosa
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Fabian Pedregosa
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in pythonFabian Pedregosa
 

More from Fabian Pedregosa (11)

Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1Random Matrix Theory and Machine Learning - Part 1
Random Matrix Theory and Machine Learning - Part 1
 
Average case acceleration through spectral density estimation
Average case acceleration through spectral density estimationAverage case acceleration through spectral density estimation
Average case acceleration through spectral density estimation
 
Asynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and AlgorithmsAsynchronous Stochastic Optimization, New Analysis and Algorithms
Asynchronous Stochastic Optimization, New Analysis and Algorithms
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
 
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Opti...
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Lightning: large scale machine learning in python
Lightning: large scale machine learning in pythonLightning: large scale machine learning in python
Lightning: large scale machine learning in python
 
Profiling in Python
Profiling in PythonProfiling in Python
Profiling in Python
 

Recently uploaded

Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docxkarenmillo
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika DasBACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika DasChayanika Das
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationSanghamitraMohapatra5
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
Unveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialUnveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialMarkus Roggen
 

Recently uploaded (20)

Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docx
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika DasBACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
BACTERIAL DEFENSE SYSTEM by Dr. Chayanika Das
 
DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitation
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
Unveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s PotentialUnveiling the Cannabis Plant’s Potential
Unveiling the Cannabis Plant’s Potential
 

Adaptive Three Operator Splitting

  • 1. Adaptive Three Operator Splitting. Fabian Pedregosa, Gauthier Gidel 2018 International Symposium on Mathematical Programming, Bordeaux
  • 2. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. 1/18
  • 3. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. • Can be generalized to an arbitrary number of proximal terms. 1/18
  • 4. Three Operator Splitting (TOS) • Recently proposed method (Davis and Yin, 2017) Solves optimization problems of the form minimize x∈Rd f (x) + g(x) + h(x) , with access to f , proxγg , proxγh. • Can be generalized to an arbitrary number of proximal terms. • Many complex penalties can be written as a sum of proximable terms: overlapping group lasso, 1 trend filtering, isotonic constraints, total variation, intersection of constraints, etc. 1/18
  • 5. Importance of step-size Guaranteed to converge for any step-size γ < 2/L, with L = Lipschitz constant of f . In practice, best performance is often achieved for γ 2/L 0 1000 2000 3000 4000 5000 Iterations 10 14 10 11 10 8 10 5 10 2 Objectiveminusoptimum =1/L =2/L =5/L =10/L =20/L =50/L 2/18
  • 6. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. 3/18
  • 7. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. • Adaptive step-size strategies (aka inexact line search) have a long tradition (Armijo, 1966), have been adapted to proximal-gradient (Beck and Teboulle, 2009). 3/18
  • 8. Motivation • L is a global upper bound on the Lipschitz constant, locally it can be much smaller. • Adaptive step-size strategies (aka inexact line search) have a long tradition (Armijo, 1966), have been adapted to proximal-gradient (Beck and Teboulle, 2009). • Goal. Can these adaptive step-size methods be adapted to the three operator splitting? 3/18
  • 9. Outline 1. Revisiting the Three Operator Splitting 2. Adaptive Three Operator Splitting 3. Experiments 4/18
  • 10. Revisiting the Three Operator Splitting
  • 11. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt 5/18
  • 12. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt • Generalization of both Proximal-Gradient and Douglas-Rachford. 5/18
  • 13. Three Operator Splitting (TOS) • Three operator splitting (Davis and Yin, 2017): zt = proxγh(yt) xt = proxγg (2yt − zt − γ f (zt)) yt+1 = yt − zt + xt • Generalization of both Proximal-Gradient and Douglas-Rachford. • Depends only on one step-size parameter. 5/18
  • 14. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 6/18
  • 15. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 2. Saddle-point reformulation of original problem min x∈Rd f (x) + g(x) + h(x) = min x∈Rd f (x) + g(x) + max u∈Rd { x, u − h∗ (u)} 6/18
  • 16. Revisiting the three operator splitting 1. Introduce ut := 1 γ (yt − xt). We can rewrite TOS as xt+1 = proxγg (zt − γ( f (zt) + ut)) ut+1 = proxh∗/γ(ut + xt+1/γ) , zt+1 = xt+1 − γ(ut+1 − ut) 2. Saddle-point reformulation of original problem min x∈Rd f (x) + g(x) + h(x) = min x∈Rd max u∈Rd f (x) + g(x) + x, u − h∗ (u) :=L(x,u) 6/18
  • 17. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal 7/18
  • 18. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal • Proximal-gradient iteration, with x = zt as starting point: xt+1 = proxγg (zt − γ( f (zt) + ut)) 7/18
  • 19. Minimizing with respect to primal variable min x∈Rd L(x, ut) = min x∈Rd f (x) + x, ut smooth + g(x) proximal • Proximal-gradient iteration, with x = zt as starting point: xt+1 = proxγg (zt − γ( f (zt) + ut)) = first step of TOS 7/18
  • 20. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal 8/18
  • 21. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal • Proximal-point iteration: ut+1 = proxσh∗ (ut + σxt+1) = second step in TOS with σ = 1/γ 8/18
  • 22. Minimizing with respect to the dual variable min u∈Rd L(xt, u) = min u∈Rd h∗ (u) − xt, u proximal • Proximal-point iteration: ut+1 = proxσh∗ (ut + σxt+1) = second step in TOS with σ = 1/γ • Third line: zt+1 = xt+1 − γ(ut+1 − ut) (extrapolation step) 8/18
  • 23. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 1: proximal-gradient step 9/18
  • 24. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 1: proximal-point step 9/18
  • 25. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 2: extrapolation 9/18
  • 26. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 2: proximal-gradient step 9/18
  • 27. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 3: proximal-point step 9/18
  • 28. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 3: extrapolation 9/18
  • 29. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 4: proximal-gradient step 9/18
  • 30. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 4: proximal-point step 9/18
  • 31. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 5: extrapolation 9/18
  • 32. Revisiting the three operator splitting → xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 5: proximal-gradient step 9/18
  • 33. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) → ut+1 = proxh /γ(ut + xt+1/γ) zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: proximal-point step 9/18
  • 34. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: extrapolation Take-Home Message TOS is (basically) alternated proximal-gradient and proximal-point 9/18
  • 35. Revisiting the three operator splitting xt+1 = proxγg(zt− γ( f(zt) + ut)) ut+1 = proxh /γ(ut + xt+1/γ) → zt+1 = xt+1 − γ(ut+1 − ut) u x Iteration 6: extrapolation Take-Home Message TOS is (basically) alternated proximal-gradient and proximal-point Can we adapt the adaptive step-size of proximal-gradient? 9/18
  • 37. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18
  • 38. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) Run rest of algorithm with that step-size: ut+1 = proxh∗/γt (ut + xt+1/γt) (1) zt+1 = xt+1 − γt(ut+1 − ut) (2) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18
  • 39. Adaptive Three Operator Splitting1 Start with optimistic step-size γt and decrease it until: f (xt+1) ≤ f (zt) + f (zt), xt+1 − zt + 1 2γt xt+1 − zt 2 with xt+1 = proxγt g (zt − γt( f (zt) + ut)) Run rest of algorithm with that step-size: ut+1 = proxh∗/γt (ut + xt+1/γt) (1) zt+1 = xt+1 − γt(ut+1 − ut) (2) 1 Fabian Pedregosa and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 10/18 Benefits • Automatic tuning of step-size • (practically) hyperparameter-free
  • 40. Performance of the adaptive step-size strategy 0 1000 2000 3000 4000 5000 Iterations 10 14 10 11 10 8 10 5 10 2 Objectiveminusoptimum =1/L =2/L =5/L =10/L =20/L =50/L adaptive Performance is as good as best hand-tuned step-size 11/18
  • 41. Convergence rates 1/3 Convergence rate in terms of average (aka ergodic) sequence. st def = t−1 i=0 γt , xt def = t−1 i=0 γi xi+1 /st , ut def = t−1 i=0 γi ui+1 /st . 12/18
  • 42. Convergence rates 1/3 Convergence rate in terms of average (aka ergodic) sequence. st def = t−1 i=0 γt , xt def = t−1 i=0 γi xi+1 /st , ut def = t−1 i=0 γi ui+1 /st . Theorem (sublinear convergence rate) For any (x, u) ∈ domL: L(xt, u) − L(x, ut) ≤ z0 − x 2 + γ2 0 u0 − u 2 2st . 12/18
  • 43. Convergence rates 2/3 If h is Lipschitz, we can bound it and obtain rates in terms of objective function suboptimality. 13/18
  • 44. Convergence rates 2/3 If h is Lipschitz, we can bound it and obtain rates in terms of objective function suboptimality. Corollary Let h be βh-Lipschitz. Then we have P(xt+1) − P(x∗ ) ≤ z0 − x∗ 2+ 2γ2 0( u0 2+ β2 h) 2st = O(1/t). with P(x) = f (x) + g(x) + h(x). 13/18
  • 45. Convergence rates 3/3 Linear convergence under (somewhat unrealistic) assumptions. Theorem If f is Lf -smooth, µ-strongly convex and h is Lh-smooth then xt+1 − x 2 ≤ 1 − min τ µ Lf , 1 1 + γ0Lh t+1 C0 (3) with τ = line search decrease factor, C0 = only depends on initial conditions. • Better rate than µ Lf × 1 (1+γLh)2 from (Davis and Yin, 2015). 14/18
  • 47. Logistic + Nearly-isotonic penalty Problem arg minx logistic(x) + λ p−1 i=1 max{xi − xi+1, 0} Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 200 400 600 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 100 200 300 400 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L)TOS-AOLS PDHG Adaptive PDHG 15/18
  • 48. Logistic + Overlapping group lasso penalty Problem arg min x logistic(x) + λ g∈G [x]g 2 Coefficients Magnitude =10 6 Coefficients Magnitude =10 3 Coefficients Magnitude =0.01 Coefficients Magnitude =0.1 estimated coefficients ground truth 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 10 20 30 40 50 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 5 10 15 20 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0.0 0.2 0.4 0.6 0.8 1.0 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 16/18
  • 49. Quadratic loss + total variation penalty Problem arg min x least squares(x) + λ x TV Recoveredcoefficients =10 6 =10 5 =10 4 =10 3 0 500 1000 1500 2000 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Objectiveminusoptimum 0 100 200 300 400 Time (in seconds) 10 12 10 9 10 6 10 3 100 0 10 20 30 40 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 0 2 4 6 8 Time (in seconds) 10 12 10 10 10 8 10 6 10 4 10 2 100 Adaptive TOS (variant 1) Adaptive TOS (variant 2) TOS (1/L) TOS (1.99/L) TOS-AOLS PDHG Adaptive PDHG 17/18
  • 50. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. 18/18
  • 51. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. 18/18
  • 52. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as fixed step-size method. 18/18
  • 53. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as fixed step-size method. • Large empirical improvements, specially in the low-regularization and non-quadratic regime. 18/18
  • 54. Conclusion • Sufficient decrease condition to set step-size in three operator splitting. • (Mostly) Hyperparameter-free, adaptivity to local geometry. • Same convergence guarantees as fixed step-size method. • Large empirical improvements, specially in the low-regularization and non-quadratic regime. Perspectives • Linear convergence under less restrictive assumptions? • Acceleration. https://arxiv.org/abs/1804.02339 18/18
  • 55. References Armijo, Larry (1966). “Minimization of functions having Lipschitz continuous first partial derivatives”. In: Pacific Journal of Mathematics. Beck, Amir and Marc Teboulle (2009). “Gradient-based algorithms with applications to signal recovery”. In: Convex optimization in signal processing and communications. Davis, Damek and Wotao Yin (2015). “A three-operator splitting scheme and its optimization applications”. In: preprint arXiv:1504.01032v1. — (2017). “A three-operator splitting scheme and its optimization applications”. In: Set-valued and variational analysis. Pedregosa, Fabian and Gauthier Gidel (2018). “Adaptive Three Operator Splitting”. In: Proceedings of the 35th International Conference on Machine Learning (ICML). 18/18