NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

A Diﬀerential Equation for Modeling Nesterov’s
Accelerated Gradient Method: Theory and Insights
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014
speaker: kv
MCLab, CITI Academia Sinicas
May 14, 2015
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 1 / 40

Overview
1 Introduction
Smooth Unconstrained Optimization
Accelerated Scheme
Ordinary Diﬀerential Equation
2 Derivation of the ODE
Simple Properties
3 Equivalence between the ODE and Nesterov’s scheme
Analogous Convergence Rate
Quadratic f and Bessel function
Equivalence between the ODE and Nesterov’s Scheme
4 A family of generalized Nesterov’s schemes
Continuous Optimization
Composite Optimization
5 New Restart Scheme
6 Accelerating to linear convergence by restarting
Numerical examples
7 Discussion

Introduction

Smooth Unconstrained Optimization
We wish to minimize a smooth convex function
minimize f (x)
where f : Rn → R has a Lipschitz continuous gradient
f (x) − f (y) 2 < L x − y 2
µ-strong convexity
f (x) − µ x 2
/2
In this paper, FL denotes the class of convex function f with L-Lipschitz
continuous gradients deﬁned on Rn; Sµ denotes the class of µ-strongly
convex function f on Rn. We set Sµ,L = FL ∩ Sµ

Introduction: Accelerated Scheme
Nesterov’s Accelerated Gradient Scheme
xk = yk−1 − s f (yk−1) (1)
yk = xk +
k − 1
k + 2
(xk − xk−1) (2)
For ﬁxed step size s = 1/L, where L is Lipschitz constant of f , this
scheme exhibits the convergence rate
f (xk) − f ∗
≤ O(
L||x0 − x∗||2
k2
)
This improvement relies on the introduction to momentum xk − xk−1 and
the particularly tuned coeﬃcient (k − 1)/(k + 2) = 1 − 3/(k + 2)

Accelerated Scheme: Oscillation Problem
In general, Nesterov’s scheme is not monotone in the objective function.
(due to introduction to the momentum)

Introduction: Second order ODE
Derive a second order ordinary diﬀerential equation(ODE), which is the
exact limit of Nesterov’s scheme by taking small step size
¨X +
3
t
˙X + f (X) = 0 (3)
for t > 0, with initial condition X(0) = x0, ˙X(0) = 0; here x0 is the
starting point in Nesterov’s scheme. ˙X denotes to velocity and ¨X is
acceleration.
Small t: large 3/t leads the ODE to be an over-damped system
Large t: As t increases, system behaves like under-damped system,
oscillating with amplitude decreases to zero
Time parameter in this ODE is related to step size
t ∼ k
√
s

An Example: Trajectories
Minimize f = 0.02x2
1 + 0.005x2
2

An Example: Zoomed Trajectories
Minimize f = 0.02x2
1 + 0.005x2
2

An Example: Errors f − f ∗
Minimize f = 0.02x2
1 + 0.005x2
2

Derivation of the ODE

Assmue f ∈ FL for L > 0, combine two equations of (1) and applying
rescaling give
xk+1 − xk
√
s
=
k − 1
k + 2
xk − xk−1
√
s
−
√
s f (yk) (4)
Introducfelis e the ansatz xk ∼ X(k
√
s) for smooth curve X(t) deﬁne for
t > 0. With these approximations, we get Taylor expansions:
(xk+1 − xk)/
√
s = ˙X(t) +
1
2
¨X(t)
√
s + o(
√
s)
(xk − xk−1)/
√
s = ˙X(t) −
1
2
¨X(t)
√
s + o(
√
s)
√
s f (yk) =
√
s f (X(t)) + o(
√
s)
where the third equality we use yk − X(t) = o(1)

The formula (4) can be rewritten as
˙X(t) +
1
2
¨X(t)
√
s + o(
√
s)
= (1 −
3
√
s
t
){ ˙X(t) −
1
2
¨X(t)
√
s + o(
√
s)} −
√
s f (X(t)) + o(
√
s)
By comparing the coeﬃcients of
√
s, we obtain
¨X +
3
t
˙X + f (X) = 0
Theorem (Well posed ODE, Existence and Uniqueness)
For any f ∈ F∞ := UL>0FL and any x0 ∈ Rn, the ODE (3) with initial
conditions X(0) = x0, ˙X(0) = 0 has an unique global solution X.

Simple Properties
Invariance
ODE is invariant under time change and invariant under transformation
Initial asymptotic
Assume suﬃcient smoothness of X, that limt→0
¨X exists. The Mean Value
Theorem guarantees the existence ζ ∈ (0, t) that
˙X(t)/t = ( ˙X − X(0))/t = ¨X(ζ)
Hence the ODE deduces to ¨X(t) + 3 ¨X(ζ) + f (X(t)) = 0 Taking t → 0
(for small t), we have
X(t) = −
f (x0)t2
8
+ x0 + o(t2
)
Consistent with the empirical observation the Nesterov’s scheme moves
slowly in the beginning.

Connections and Interpretations

Analogous Convergence Rate
Now, we exhibit approximate equivalence between the ODE and
Nesterov’s scheme in terms of convergence rate.
Theorem (Discrete Nesterov Scheme 3.1)
For any f ∈ FL, the sequence {xk} in 1 with step size s ≤ 1/L obeys
f (xk) − f ∗
≤
2||x0 − x∗||2
s(k + 1)2
First result indicates the trajectory of ODE (3) closely resembles the
sequence {xk} in terms of the convergence rate
Theorem (Continuous ODE Scheme 3.2)
For any f ∈ F∞, let X(t) be the unique global solution to (3) with initial
conditions X(0) = x0, ˙X(0) = 0, for any t > 0
f (X(t)) − f ∗
≤
2||x0 − x∗||2
t2

Proof of Theorem 3.2
Consider energy functional deﬁned as
ε(t) := t2
(f (X(t)) − f ∗
) + 2||X +
t
2
˙X − x∗
||2
whose time derivative is
˙ε(t) = 2t(f (X) − f ∗
) + t2
f , ˙X + 4 X +
t
2
˙X − x∗
,
3
2
˙X +
t
2
¨X
Substituting 3 ˙X/2 + t ¨X/2 with −t f (X)/2
˙ε(t) = 2t(f (X) − f ∗
) + 4 X − x∗
, −t f (X)/2
= 2t(f (X) − f ∗
) − 2t X − x∗
, f (X) ≤ 0
Hence the monotonicity of ε and non-negativity of 2||X + t
2
˙X − x∗||2, the
gap obeys
f (X(t)) − f ∗
≤
˙ε(t)
t2
≤
˙ε(0)
t2
=
2||x0 − x∗||2
t2

For quadratic f , we have
f (x) =
1
2
x, Ax + b, x
where A ∈ Rn×n. Simple translation can absorb the liner term b, x . We
assume A is positive semi-deﬁnite, admitting spectral decomposition
A = QT ΛQ, replace x with Qx, we assume f = 1/2 x, Λx . The ODE
admits
¨Xi +
3
t
˙Xi + λi Xi = 0
where i = 1, ..., n

Introduce Yi (u) = uXi (u/
√
λi ), which satisfies
u2 Ÿi + u ˙Yi + (u2
− 1)Yi = 0
This is the Bessel’s Differential Equation with order 1. Apply asymptotic
expansion, we obtain
Xi (t) =
2xx0,i
t
√
λi
J1(t λi )
For t is large, the Bessel function has asymptotic form
J1(t) =
2
πt
(cos(t − 3π/4) + O(1/t))

Quadratic f and Bessel function: Example
Minimizef = 0.02x2
1 + 0.005x2
2 , whose eigenvalues are λ1,2 = 0.02, 0.005
f (X) − f ∗
= f (X) =
n
i=1
2x2
0,i
t2
J1(t λi )2
Denote two major period T1, T2. We get T1 = π/
√
λ1 = 22.214 and
T2 = π/
√
λ2 = 44.423

Equivalence between the ODE and Nesterov’s scheme
We study the stable step size for numerically solving ODE. The finite
difference approximation of (3) by the forward Euler method
X(t + ∆t) − 2X(t) + X(t + ∆t)
∆t2
+
3
t
X(t) − X(t − ∆t)
∆t
+ f (X(t)) = 0
which is equivalent to
X(t + ∆t) = (2 −
3∆t
t
)X(t) − ∆t2
f (X(t)) − (1 −
∆t
t
)X(t − ∆t)
Assuming that f is sufficiently smooth, for small perturbation. The
characteristic equation of this finite difference scheme is approximately
(identify k = t/∆t)
det{λ2
− (2 − ∆t2 2
f −
3∆t
t
)λ + 1 −
3∆t
t
} = 0 (5)
For numerical stability, all the roots of (5) should lie on unit circle.

A family of generalized Nesterov’s schemes
Exploit the power of ODE. We would be interested in studying the ODE
(3) with the number of 3 appearing the coeﬃcient of ˙X/t replaced by a
general constant r as
¨X +
r
t
˙X + f (X) = 0, X(0) = x0, ˙X(0) = 0 (6)
Using the argument similar to theorem 2.1, this ODE is guaranteed to
assume a unique global solution for any f ∈ F∞

Generalized Nesterov’s Scheme: Continuous Optimization
Theorem (4.1)
Suppose r > 3 and let X be the unique solution to (6) for some f ∈ F∞.
Then X(t) obeys
f (X(t)) − f ∗
≤
(r − 1)2||x0 − x∗||2
2t2
and
∞
0
t(f (X(t)) − f ∗
)dt ≤
(r − 1)2||x0 − x∗||2
2(r − 3)
Theorem (4.2)
For any f ∈ Sµ,L(Rn), the unique solution X to (6) with r ≥ 9/2 obeys
f (X(t)) − f ∗
≤
Cr5/2||x0 − x∗||2
t3√
µ

Generalized Nesterov’s Scheme: Continuous Optimization
For example, the solution to (6) with f (x) = ||x||2/2 is
X(t) =
2
r−1
2 Γ((r + 1)/2)J(r−1)/2(t)
t(r−1)/2
where J(r−1)/2(.) is the ﬁrst kind of Bessel function of order (r − 1)/2. For
large t, this Bessel function obeys
J(r−1)/2(t) = 2/(πt)(cos(t − rπ/4) + O(1/t). Hence
f (X(t)) − f ∗
≤ ||x0 − x∗
||2
/tr

Generalized Nesterov’s Scheme: Composite Optimization
Skip this part
min
x∈Rn
f (x) = g(x) + h(x)
where g ∈ FL for some L > 0 and h is convex on Rn with possible
extended value ∞. Deﬁne proximal subgradient
Gs(x) :=
x − arg minz{||z − (x − s g(x))||2/(2s) + h(z)}
s

New Restart Scheme

Restart Scheme: Previous Works
Restart: Erase the memory of previous iterations and resets the
momentum back to zero
Function Scheme: we restart whenever
f (xk
) > f (xk−1
)
Gradient Scheme: we restart whenever
f (yk−1
)T
(xk
− xk−1
) > 0
Refer to: Adaptive Restart for Accelerated Gradient Schemes, Brendan
ODonoghue Emmanuel Cands, 2012

New Restart Scheme: Speed Restart
This work provides a new restarting strategy, called speed restarting
scheme. The underlying motivation is to maintain relatively high velocity
˙X along the trajectory.
Deﬁnition 5.1
For ODE (3) with X(0) = x0, ˙X(0) = 0, let
T = T(f , x0) = sup{t > 0 : ∀u ∈ (0, t),
|| ˙X(u)||2
du
> 0}
be the speed restarting time.
In words, T is the ﬁrst time the velocity || ˙X|| decreases. Indeed, f (X(t))
is the decreasing function before time T, for t < T,
df (X(t))
dt
=< f (X), ˙X >= −
3
t
|| ˙X||2
−
1
2
|| ˙X||2
dt
< 0

Accelerating to linear convergence by restarting
The speed restarting ODE is thus
¨X(t) +
3
tsr
˙X(t) + f (X(t)) = 0 (7)
where tsr is set to zero whenever < ˙X, ¨X >= 0. We have following
observations
Xsr (t) is continuous for t ≤ 0, with Xsr (0) = x0
Xsr (t) satisﬁed (3) for 0 < t < T1 := T(x0; f )
Recursively deﬁne Ti+1 = Ti (Xsr ( Tj ; f ) for i ≤ 1 and
˜X(t) = Xsr ( Tj + t)

Lemma 5.2
There is a universal constant C > 0 such that
f (X(T)) − f (x∗
) ≤ (1 −
Cµ
L
)(f (x0) − f ∗
)
Guarantee each restarting reduces the error by a constant factor
Lemma 5.3
There is a universal constant ˜C such that
T ≤
4exp(
˜CL
µ )
5
√
L
An upper bound for T. It conforms that restartings are adequate

Applying Lemma 5.2, 5.3, we have
Theorem (5.1)
There exists positive constants c1 and c2, which only depend on the
condition number L/µ, such that for any f ∈ Sµ,L, we have
f (Xsr
(t)) − f (x∗
) ≤
c1L||x0 − x∗||2
2
exp−c2t
√
L
The theorem guarantees linear convergence of solution to (7). This is new
result in the literature.
where c1 = exp(Cµ/L) and c2 = 5Cµ
4L e(−˜Cµ/L)

Numerical examples: algorithm of speed restarting
Below we present a discrete analog to restarted scheme.

Quadratic

Log-sum-exp

Matrix compleltion

Lasso in l1-constrainted form with large space design

References
W. Su, S. Boyd, E. Candes (2014)
A Diﬀerential Equation for Modeling Nesterovs Accelerated Gradient Method:
Theory and Insights
NIPS 2014

thanks!

NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Recommended

Recommended

More Related Content

Similar to NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Similar to NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method (20)

More from Kai-Wen Zhao

More from Kai-Wen Zhao (8)

Recently uploaded

Recently uploaded (20)

NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method