QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of Inertial Algorithms - Silvia Villa, Mar 21, 2018

A new (more intuitive?) interpretation of inertial
algorithms
Silvia Villa
Politecnico di Milano
https://www.mate.polimi.it/analysis/?settore=people
SAMSI Workshop: Operator Splitting Methods in Data Analysis
Raleigh, March 21th, 2018
S. Villa (Polimi) 1 / 20

Introduction
Problem setting
Let
H be a Hilbert space
f : H → R convex, diﬀerentiable, with an L-Lipschitz continuous
gradient
Assume that argmin f = ∅.
We consider the problem of computing
min
x∈H
f (x)

Introduction
Gradient method
Given x0 ∈ H and γ ∈ ]0, 2/L[
For k ≥ 0 deﬁne
xk+1 = xk − γ f (xk)
Classic convergence results:
f (xk) − min f = O(1/k)
Convergence of the iterates: xk ¯x ∈ argmin f

Introduction
Accelerated/Inertial gradient method
Given x0 = y0 ∈ H and γ ∈ ]0, 1/L], αk ∈ ]0, 1[:
For k ≥ 0, deﬁne
xk+1 = yk − γ f (yk)
yk+1 = xk+1 + αk(xk+1 − xk)
Convergence results:
f (xk) − min f = O(1/k2)

Introduction
Accelerated/Inertial gradient method
Given x0 = y0 ∈ H and γ ∈ ]0, 1/L], αk ∈ ]0, 1[:
For k ≥ 0, deﬁne
xk+1 = yk − γ f (yk)
yk+1 = xk+1 + αk(xk+1 − xk)
Convergence results:
f (xk) − min f = O(1/k2)
xk ¯x ∈ argmin f

Introduction
Convergence results and remarks
Convergence and its rate depend on the choice of αk

Introduction
Nesterov rule for choosing αk [Nesterov, A method for solving a convex
programming problem with rate of convergence O(1/k2
), 1983]

Introduction
), 1983]
Extension to proximal point algorithm [G¨uler, New proximal point
algorithms for convex minimization, 1992] and to forward-backward
[Beck-Teboulle, FISTA, 2009]

Introduction
), 1983]
Extension to proximal point algorithm [G¨uler, New proximal point
algorithms for convex minimization, 1992] and to forward-backward
[Beck-Teboulle, FISTA, 2009]
Other choices for αk [Chambolle-Dossal, 2014],[Attouch-Peypouquet-Redont
2014], [Apidopoulus-Aujol-Dossal 2017]
αk = 1 −
α
k + 2
For convergence: α > 3.

Introduction
In the rest of the talk
Brief review of some approaches to show convergence
“Algebraic proof” [Nesterov; Güler; Beck-Teboulle; Chambolle-Dossal]
Estimate sequences [Nesterov; Güler; Salzo-Villa]
Primal and mirror descent combination [Zhu-Orecchia, Linear coupling: An
ultimate unification of gradient and mirror descent, 2017
ODE approach [Su-Boyd-Candes; Apidopoulos-Aujol-Dossal;
Attouch-Cabot-Peypouquet-Redont-Chbani...]

Introduction
In the rest of the talk
Brief review of some approaches to show convergence
“Algebraic proof” [Nesterov; Güler; Beck-Teboulle; Chambolle-Dossal]
Estimate sequences [Nesterov; Güler; Salzo-Villa]
Primal and mirror descent combination [Zhu-Orecchia, Linear coupling: An
ultimate unification of gradient and mirror descent, 2017
ODE approach [Su-Boyd-Candes; Apidopoulos-Aujol-Dossal;
Attouch-Cabot-Peypouquet-Redont-Chbani...]
Many results...but something is still missing :-)

Classic approaches
“Algebraic proof” - Nesterov’s choice of αk
αk =
(tk − 1)
tk+1
For an arbitrary y ∈ H, it holds:
2
L
(t2
k+1 − tk+1)(f (xk) − min f ) + tk+1y − (tk+1 − 1)xk − ¯x 2
≥
2
L
t2
k+1(f (xk+1) − min f ) + tk+1xk+1 − (tk+1 − 1)xk − ¯x 2
If
t2
k ≥ t2
k+1 − tk+1
and tk+1y = tk+1xk + (tk − 1)(xk − xk−1)

Classic approaches
then
2
L
t2
k (f (xk) − min f ) + tkxk − (tk − 1)xk−1 − ¯x 2
≥
2
L
t2
At the end, there exists c > 0 such that
f (xk) − min f ≤
c
t2
k

Classic approaches
then
2
L
t2
≥
2
L
t2
c
t2
k
=⇒ f (xk) − min f ≤
c
(k + 1)2

Classic approaches
then
2
L
t2
≥
2
L
t2
c
t2
k
=⇒ f (xk) − min f ≤
c
(k + 1)2
**SIMILAR** proof for convergence of (xk) ([Chambolle-Dossal, On the weak
convergence of the iterates of “FISTA”, 2014])

Classic approaches
Nesterov’s estimate sequences
Deﬁnition
A pair of sequences ϕk : X → R and βk ≥ 0 is called an estimate
sequence of the function f if for any x ∈ X and all k ≥ 0 we have
ϕk(x) − f (x) ≤ βk(ϕ0(x) − f (x)) and βk → 0.

Classic approaches
Nesterov’s estimate sequences
Deﬁnition
A pair of sequences ϕk : X → R and βk ≥ 0 is called an estimate
sequence of the function f if for any x ∈ X and all k ≥ 0 we have
ϕk(x) − f (x) ≤ βk(ϕ0(x) − f (x)) and βk → 0.
Theorem (Nesterov ’83)
Let (ϕk, βk) be an estimate sequence of f and let min f = f (¯x). If for
some xk ∈ X we have
f (xk) ≤ min ϕk
then
f (xk) − min f ≤ βk(ϕ0(¯x) − f (¯x))
βk gives a convergence rate for f (xk) − min f

Classic approaches
A method to build estimate sequences
Given a quadratic function ϕ0 = f (u0) + A
2 · −u0
2, tk > 1, xk+1 , we
deﬁne recursively
ϕk+1(x) := (1 − t−1
k )ϕk(x) + t−1
k (f (yk+1) + x − yk+1, f (yk+1)
≤f (x), linear term
).
We have:
ϕk = ¯ϕk + Ak
2 x − uk
2, is a quadratic function for every k
ϕk+1(x) − f (x) ≤ (1 − t−1
k )(ϕk(x) − f (x))
≤
k
i=0
(1 − t−1
i )
=βk+1
(ϕ0(x) − f (x)).

Classic approaches
Choice of tk and yk
The parameters ¯ϕk, Ak, uk can be updated recursively and depend on the
choice of tk and yk.
Lemma
Suppose that xk is such that ¯ϕk ≥ f (xk). Set yk = (1 − t−1
k )xk + t−1
k uk,
t2
k = t2
k+1 − tk+1, and xk+1 = yk − γ f (yk). Then
¯ϕk+1 ≥ f (xk+1)

More recent approaches
Linear coupling of gradient and mirror steps
Initialize x0 = y0 = u0.



yk+1 = (1 − t−1
k )xk + t−1
k uk Linear coupling
xk+1 = yk+1 − 1
L f (yk+1) Gradient step
uk+1 = uk − αk f (xk+1) Mirror step




yk+1 = (1 − t−1
k )xk + t−1
xk+1 = yk+1 − 1
Link with estimate sequences?




yk+1 = (1 − t−1
k )xk + t−1
xk+1 = yk+1 − 1
Link with estimate sequences?
Bregman estimate sequences version?

ODE approach
The convergence properties of the algorithm are “derived” from the
asymptotic behavior, as t → +∞, of the evolution system



¨x(t) + α
t ˙x(t) + f (x(t)) = 0
x(t0) = x0
˙x(t0) = v0.
A ﬁnite diﬀerence discretization, with tk = t0 + kh, gives
1
h2
(xk+1 − 2xk + xk−1) +
α
kh2
(xk − xk−1) + f (yk) = 0,
with yk to be determined later.

ODE approach
We **NATURALLY** obtain
xk+1 = xk + 1 −
α
k
(xk − xk−1) − h2
f (yk),
Setting γ = h2:
yk = xk + 1 − α
k (xk − xk−1)
xk+1 = yk − γ f (yk),

The heavy ball
Choosing yk = xk we obtain the “heavy ball method”
xk+1 = xk + 1 −
α
k
(xk − xk−1) − γ f (xk),
The continuous dynamic is the same, but the convergence behavior of the
algorithm is diﬀerent

The problem
What is the diﬀerence between the heavy ball and the inertial
algorithms?

The problem
What is the diﬀerence between the heavy ball and the inertial
algorithms?
Why choosing yk in such a way accelerate the gradient method?

New directions
Another point of view: link with numerical analysis
[Scieur-Roulet-Bach-D’Aspremont, Integration methods and Accelerated Optimization
Algorithms, 2017]
Use numerical integration schemes to approximate the solution of the
problem:
˙x(t) + f (x(t)) = 0
x(t0) = x0
Linear two-steps schemes
Fix h > 0. Given x0, x1 ∈ H, for k ≥ 0:
xk+1 = −ρ1xk − ρ0xk−1 + γ(σ0 f (xk−1) + σ1 f (xk)).
Restrictions on ρ0, ρ1, σ0, σ1 in R to guarantee convergence, i.e.
lim
γ→0
xk − x(t0 + kγ) = 0, ∀k ∈ [1, K]
under appropriate initial conditions x0, x1.

New directions
Quadratic strongly convex case
Let H = Rn, f (x) = Ax, with A symmetric positive deﬁnite. Then the
inertial coeﬃcient can be chosen constant, equal to α. In Polyak’s method:
xk+1 = xk + (1 − α)(xk − xk−1) − γ f (xk),
The heavy ball method is a multi-step method
Recall
xk+1 = −ρ1xk − ρ0xk−1 + γ(σ0 f (xk−1) + σ1 f (xk)).
Then, it is enough to choose:
ρ0 = 1 − α
ρ1 = 2 − α
σ0 = 0
σ1 = −1

New directions
inertial coeﬃcient in the accelerated algorithm can be chosen constant,
equal to α.
yk = xk + (1 − α) (xk − xk−1)
xk+1 = yk − γ f (yk),
Accelerated method
Nesterov’s method is a multi-step method
ρ0 = 1 − α
ρ1 = α − 2
σ0 = 1 − α
σ1 = α − 1

New directions
inertial coeﬃcient in the accelerated algorithm can be chosen constant,
equal to α.
yk = xk + (1 − α) (xk − xk−1)
xk+1 = yk − γ f (yk),
Accelerated method
Nesterov’s method is a multi-step method
ρ0 = 1 − α
ρ1 = α − 2
σ0 = 1 − α
σ1 = α − 1
It suﬃces to write
yk+1 = yk − γ f (yk) + (1 − α) (yk − γ f (yk) − yk−1 + γ f (yk−1))

New directions
Conclusions/Open problems
The Polyak method turns out to be “optimal” for quadratic strongly
convex functions

New directions
convex functions
What about the general case? The coeﬃcients are not constant, but
depend on k if f is only convex.

New directions
convex functions
Better choices of αk? Maybe driven by the geometry of the function?

New directions
convex functions
Can this framework encompass restarting techniques?
[O’Donoghue-Cand`es, Adaptive restart for accelerated gradient schemes, Found.
Comput. Math., 2013.]

New directions
convex functions
Can this framework encompass restarting techniques?
[O’Donoghue-Cand`es, Adaptive restart for accelerated gradient schemes, Found.
Comput. Math., 2013.]
New algorithms? New integration methods?

QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of Inertial Algorithms - Silvia Villa, Mar 21, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of Inertial Algorithms - Silvia Villa, Mar 21, 2018

Similar to QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of Inertial Algorithms - Silvia Villa, Mar 21, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

QMC: Operator Splitting Workshop, A New (More Intuitive?) Interpretation of Inertial Algorithms - Silvia Villa, Mar 21, 2018