SlideShare a Scribd company logo
A Differential Equation for Modeling Nesterov’s
Accelerated Gradient Method: Theory and Insights
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014
speaker: kv
MCLab, CITI Academia Sinicas
May 14, 2015
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 1 / 40
Overview
1 Introduction
Smooth Unconstrained Optimization
Accelerated Scheme
Ordinary Differential Equation
2 Derivation of the ODE
Simple Properties
3 Equivalence between the ODE and Nesterov’s scheme
Analogous Convergence Rate
Quadratic f and Bessel function
Equivalence between the ODE and Nesterov’s Scheme
4 A family of generalized Nesterov’s schemes
Continuous Optimization
Composite Optimization
5 New Restart Scheme
6 Accelerating to linear convergence by restarting
Numerical examples
7 Discussion
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 2 / 40
Introduction
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 3 / 40
Smooth Unconstrained Optimization
We wish to minimize a smooth convex function
minimize f (x)
where f : Rn → R has a Lipschitz continuous gradient
f (x) − f (y) 2 < L x − y 2
µ-strong convexity
f (x) − µ x 2
/2
In this paper, FL denotes the class of convex function f with L-Lipschitz
continuous gradients defined on Rn; Sµ denotes the class of µ-strongly
convex function f on Rn. We set Sµ,L = FL ∩ Sµ
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 4 / 40
Introduction: Accelerated Scheme
Nesterov’s Accelerated Gradient Scheme
xk = yk−1 − s f (yk−1) (1)
yk = xk +
k − 1
k + 2
(xk − xk−1) (2)
For fixed step size s = 1/L, where L is Lipschitz constant of f , this
scheme exhibits the convergence rate
f (xk) − f ∗
≤ O(
L||x0 − x∗||2
k2
)
This improvement relies on the introduction to momentum xk − xk−1 and
the particularly tuned coefficient (k − 1)/(k + 2) = 1 − 3/(k + 2)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 5 / 40
Accelerated Scheme: Oscillation Problem
In general, Nesterov’s scheme is not monotone in the objective function.
(due to introduction to the momentum)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 6 / 40
Introduction: Second order ODE
Derive a second order ordinary differential equation(ODE), which is the
exact limit of Nesterov’s scheme by taking small step size
¨X +
3
t
˙X + f (X) = 0 (3)
for t > 0, with initial condition X(0) = x0, ˙X(0) = 0; here x0 is the
starting point in Nesterov’s scheme. ˙X denotes to velocity and ¨X is
acceleration.
Small t: large 3/t leads the ODE to be an over-damped system
Large t: As t increases, system behaves like under-damped system,
oscillating with amplitude decreases to zero
Time parameter in this ODE is related to step size
t ∼ k
√
s
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 7 / 40
An Example: Trajectories
Minimize f = 0.02x2
1 + 0.005x2
2
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 8 / 40
An Example: Zoomed Trajectories
Minimize f = 0.02x2
1 + 0.005x2
2
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 9 / 40
An Example: Errors f − f ∗
Minimize f = 0.02x2
1 + 0.005x2
2
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 10 / 40
Derivation of the ODE
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 11 / 40
Derivation of the ODE
Assmue f ∈ FL for L > 0, combine two equations of (1) and applying
rescaling give
xk+1 − xk
√
s
=
k − 1
k + 2
xk − xk−1
√
s
−
√
s f (yk) (4)
Introducfelis e the ansatz xk ∼ X(k
√
s) for smooth curve X(t) define for
t > 0. With these approximations, we get Taylor expansions:
(xk+1 − xk)/
√
s = ˙X(t) +
1
2
¨X(t)
√
s + o(
√
s)
(xk − xk−1)/
√
s = ˙X(t) −
1
2
¨X(t)
√
s + o(
√
s)
√
s f (yk) =
√
s f (X(t)) + o(
√
s)
where the third equality we use yk − X(t) = o(1)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 12 / 40
Derivation of the ODE
The formula (4) can be rewritten as
˙X(t) +
1
2
¨X(t)
√
s + o(
√
s)
= (1 −
3
√
s
t
){ ˙X(t) −
1
2
¨X(t)
√
s + o(
√
s)} −
√
s f (X(t)) + o(
√
s)
By comparing the coefficients of
√
s, we obtain
¨X +
3
t
˙X + f (X) = 0
Theorem (Well posed ODE, Existence and Uniqueness)
For any f ∈ F∞ := UL>0FL and any x0 ∈ Rn, the ODE (3) with initial
conditions X(0) = x0, ˙X(0) = 0 has an unique global solution X.
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 13 / 40
Simple Properties
Invariance
ODE is invariant under time change and invariant under transformation
Initial asymptotic
Assume sufficient smoothness of X, that limt→0
¨X exists. The Mean Value
Theorem guarantees the existence ζ ∈ (0, t) that
˙X(t)/t = ( ˙X − X(0))/t = ¨X(ζ)
Hence the ODE deduces to ¨X(t) + 3 ¨X(ζ) + f (X(t)) = 0 Taking t → 0
(for small t), we have
X(t) = −
f (x0)t2
8
+ x0 + o(t2
)
Consistent with the empirical observation the Nesterov’s scheme moves
slowly in the beginning.
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 14 / 40
Connections and Interpretations
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 15 / 40
Analogous Convergence Rate
Now, we exhibit approximate equivalence between the ODE and
Nesterov’s scheme in terms of convergence rate.
Theorem (Discrete Nesterov Scheme 3.1)
For any f ∈ FL, the sequence {xk} in 1 with step size s ≤ 1/L obeys
f (xk) − f ∗
≤
2||x0 − x∗||2
s(k + 1)2
First result indicates the trajectory of ODE (3) closely resembles the
sequence {xk} in terms of the convergence rate
Theorem (Continuous ODE Scheme 3.2)
For any f ∈ F∞, let X(t) be the unique global solution to (3) with initial
conditions X(0) = x0, ˙X(0) = 0, for any t > 0
f (X(t)) − f ∗
≤
2||x0 − x∗||2
t2
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 16 / 40
Proof of Theorem 3.2
Consider energy functional defined as
ε(t) := t2
(f (X(t)) − f ∗
) + 2||X +
t
2
˙X − x∗
||2
whose time derivative is
˙ε(t) = 2t(f (X) − f ∗
) + t2
f , ˙X + 4 X +
t
2
˙X − x∗
,
3
2
˙X +
t
2
¨X
Substituting 3 ˙X/2 + t ¨X/2 with −t f (X)/2
˙ε(t) = 2t(f (X) − f ∗
) + 4 X − x∗
, −t f (X)/2
= 2t(f (X) − f ∗
) − 2t X − x∗
, f (X) ≤ 0
Hence the monotonicity of ε and non-negativity of 2||X + t
2
˙X − x∗||2, the
gap obeys
f (X(t)) − f ∗
≤
˙ε(t)
t2
≤
˙ε(0)
t2
=
2||x0 − x∗||2
t2
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 17 / 40
Quadratic f and Bessel function
For quadratic f , we have
f (x) =
1
2
x, Ax + b, x
where A ∈ Rn×n. Simple translation can absorb the liner term b, x . We
assume A is positive semi-definite, admitting spectral decomposition
A = QT ΛQ, replace x with Qx, we assume f = 1/2 x, Λx . The ODE
admits
¨Xi +
3
t
˙Xi + λi Xi = 0
where i = 1, ..., n
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 18 / 40
Quadratic f and Bessel function
Introduce Yi (u) = uXi (u/
√
λi ), which satisfies
u2 ¨Yi + u ˙Yi + (u2
− 1)Yi = 0
This is the Bessel’s Differential Equation with order 1. Apply asymptotic
expansion, we obtain
Xi (t) =
2xx0,i
t
√
λi
J1(t λi )
For t is large, the Bessel function has asymptotic form
J1(t) =
2
πt
(cos(t − 3π/4) + O(1/t))
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 19 / 40
Quadratic f and Bessel function: Example
Minimizef = 0.02x2
1 + 0.005x2
2 , whose eigenvalues are λ1,2 = 0.02, 0.005
f (X) − f ∗
= f (X) =
n
i=1
2x2
0,i
t2
J1(t λi )2
Denote two major period T1, T2. We get T1 = π/
√
λ1 = 22.214 and
T2 = π/
√
λ2 = 44.423
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 20 / 40
Equivalence between the ODE and Nesterov’s scheme
We study the stable step size for numerically solving ODE. The finite
difference approximation of (3) by the forward Euler method
X(t + ∆t) − 2X(t) + X(t + ∆t)
∆t2
+
3
t
X(t) − X(t − ∆t)
∆t
+ f (X(t)) = 0
which is equivalent to
X(t + ∆t) = (2 −
3∆t
t
)X(t) − ∆t2
f (X(t)) − (1 −
∆t
t
)X(t − ∆t)
Assuming that f is sufficiently smooth, for small perturbation. The
characteristic equation of this finite difference scheme is approximately
(identify k = t/∆t)
det{λ2
− (2 − ∆t2 2
f −
3∆t
t
)λ + 1 −
3∆t
t
} = 0 (5)
For numerical stability, all the roots of (5) should lie on unit circle.
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 21 / 40
A family of generalized Nesterov’s schemes
Exploit the power of ODE. We would be interested in studying the ODE
(3) with the number of 3 appearing the coefficient of ˙X/t replaced by a
general constant r as
¨X +
r
t
˙X + f (X) = 0, X(0) = x0, ˙X(0) = 0 (6)
Using the argument similar to theorem 2.1, this ODE is guaranteed to
assume a unique global solution for any f ∈ F∞
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 22 / 40
Generalized Nesterov’s Scheme: Continuous Optimization
Theorem (4.1)
Suppose r > 3 and let X be the unique solution to (6) for some f ∈ F∞.
Then X(t) obeys
f (X(t)) − f ∗
≤
(r − 1)2||x0 − x∗||2
2t2
and
∞
0
t(f (X(t)) − f ∗
)dt ≤
(r − 1)2||x0 − x∗||2
2(r − 3)
Theorem (4.2)
For any f ∈ Sµ,L(Rn), the unique solution X to (6) with r ≥ 9/2 obeys
f (X(t)) − f ∗
≤
Cr5/2||x0 − x∗||2
t3√
µ
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 23 / 40
Generalized Nesterov’s Scheme: Continuous Optimization
For example, the solution to (6) with f (x) = ||x||2/2 is
X(t) =
2
r−1
2 Γ((r + 1)/2)J(r−1)/2(t)
t(r−1)/2
where J(r−1)/2(.) is the first kind of Bessel function of order (r − 1)/2. For
large t, this Bessel function obeys
J(r−1)/2(t) = 2/(πt)(cos(t − rπ/4) + O(1/t). Hence
f (X(t)) − f ∗
≤ ||x0 − x∗
||2
/tr
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 24 / 40
Generalized Nesterov’s Scheme: Composite Optimization
Skip this part
min
x∈Rn
f (x) = g(x) + h(x)
where g ∈ FL for some L > 0 and h is convex on Rn with possible
extended value ∞. Define proximal subgradient
Gs(x) :=
x − arg minz{||z − (x − s g(x))||2/(2s) + h(z)}
s
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 25 / 40
New Restart Scheme
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 26 / 40
Restart Scheme: Previous Works
Restart: Erase the memory of previous iterations and resets the
momentum back to zero
Function Scheme: we restart whenever
f (xk
) > f (xk−1
)
Gradient Scheme: we restart whenever
f (yk−1
)T
(xk
− xk−1
) > 0
Refer to: Adaptive Restart for Accelerated Gradient Schemes, Brendan
ODonoghue Emmanuel Cands, 2012
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 27 / 40
Restart Scheme: Previous Works
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 28 / 40
Restart Scheme: Previous Works
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 29 / 40
New Restart Scheme: Speed Restart
This work provides a new restarting strategy, called speed restarting
scheme. The underlying motivation is to maintain relatively high velocity
˙X along the trajectory.
Definition 5.1
For ODE (3) with X(0) = x0, ˙X(0) = 0, let
T = T(f , x0) = sup{t > 0 : ∀u ∈ (0, t),
|| ˙X(u)||2
du
> 0}
be the speed restarting time.
In words, T is the first time the velocity || ˙X|| decreases. Indeed, f (X(t))
is the decreasing function before time T, for t < T,
df (X(t))
dt
=< f (X), ˙X >= −
3
t
|| ˙X||2
−
1
2
|| ˙X||2
dt
< 0
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 30 / 40
Accelerating to linear convergence by restarting
The speed restarting ODE is thus
¨X(t) +
3
tsr
˙X(t) + f (X(t)) = 0 (7)
where tsr is set to zero whenever < ˙X, ¨X >= 0. We have following
observations
Xsr (t) is continuous for t ≤ 0, with Xsr (0) = x0
Xsr (t) satisfied (3) for 0 < t < T1 := T(x0; f )
Recursively define Ti+1 = Ti (Xsr ( Tj ; f ) for i ≤ 1 and
˜X(t) = Xsr ( Tj + t)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 31 / 40
Accelerating to linear convergence by restarting
Lemma 5.2
There is a universal constant C > 0 such that
f (X(T)) − f (x∗
) ≤ (1 −
Cµ
L
)(f (x0) − f ∗
)
Guarantee each restarting reduces the error by a constant factor
Lemma 5.3
There is a universal constant ˜C such that
T ≤
4exp(
˜CL
µ )
5
√
L
An upper bound for T. It conforms that restartings are adequate
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 32 / 40
Accelerating to linear convergence by restarting
Applying Lemma 5.2, 5.3, we have
Theorem (5.1)
There exists positive constants c1 and c2, which only depend on the
condition number L/µ, such that for any f ∈ Sµ,L, we have
f (Xsr
(t)) − f (x∗
) ≤
c1L||x0 − x∗||2
2
exp−c2t
√
L
The theorem guarantees linear convergence of solution to (7). This is new
result in the literature.
where c1 = exp(Cµ/L) and c2 = 5Cµ
4L e(−˜Cµ/L)
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 33 / 40
Numerical examples: algorithm of speed restarting
Below we present a discrete analog to restarted scheme.
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 34 / 40
Quadratic
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 35 / 40
Log-sum-exp
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 36 / 40
Matrix compleltion
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 37 / 40
Lasso in l1-constrainted form with large space design
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 38 / 40
References
W. Su, S. Boyd, E. Candes (2014)
A Differential Equation for Modeling Nesterovs Accelerated Gradient Method:
Theory and Insights
NIPS 2014
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 39 / 40
thanks!
Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 40 / 40

More Related Content

Similar to NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Congrès SMAI 2019
Congrès SMAI 2019Congrès SMAI 2019
Congrès SMAI 2019
Hamed Zakerzadeh
 
pres06-main
pres06-mainpres06-main
pres06-main
Brian Donhauser
 
A common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spacesA common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spaces
Alexander Decker
 
A new axisymmetric finite element
A new axisymmetric finite elementA new axisymmetric finite element
A new axisymmetric finite element
Stefan Duprey
 
Computing the Nucleon Spin from Lattice QCD
Computing the Nucleon Spin from Lattice QCDComputing the Nucleon Spin from Lattice QCD
Computing the Nucleon Spin from Lattice QCD
Christos Kallidonis
 
A unique common fixed point theorem for four
A unique common fixed point theorem for fourA unique common fixed point theorem for four
A unique common fixed point theorem for four
Alexander Decker
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Patrick Diehl
 
-contraction and some fixed point results via modified !-distance mappings in t...
-contraction and some fixed point results via modified !-distance mappings in t...-contraction and some fixed point results via modified !-distance mappings in t...
-contraction and some fixed point results via modified !-distance mappings in t...
IJECEIAES
 
SLC 2015 talk improved version
SLC 2015 talk improved versionSLC 2015 talk improved version
SLC 2015 talk improved version
Zheng Mengdi
 
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
jfrchicanog
 
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point TheoremsAn Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
ijtsrd
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applications
Grigory Yaroslavtsev
 
ON OPTIMIZATION OF MANUFACTURING PLANAR DOUBLE-BASE HETEROTRANSISTORS TO DECR...
ON OPTIMIZATION OF MANUFACTURING PLANAR DOUBLE-BASE HETEROTRANSISTORS TO DECR...ON OPTIMIZATION OF MANUFACTURING PLANAR DOUBLE-BASE HETEROTRANSISTORS TO DECR...
ON OPTIMIZATION OF MANUFACTURING PLANAR DOUBLE-BASE HETEROTRANSISTORS TO DECR...
ijaceeejournal
 
D04302031042
D04302031042D04302031042
D04302031042
ijceronline
 
ANALYSIS OF MANUFACTURING OF VOLTAGE RESTORE TO INCREASE DENSITY OF ELEMENTS ...
ANALYSIS OF MANUFACTURING OF VOLTAGE RESTORE TO INCREASE DENSITY OF ELEMENTS ...ANALYSIS OF MANUFACTURING OF VOLTAGE RESTORE TO INCREASE DENSITY OF ELEMENTS ...
ANALYSIS OF MANUFACTURING OF VOLTAGE RESTORE TO INCREASE DENSITY OF ELEMENTS ...
ijoejournal
 
QMC: Operator Splitting Workshop, Compactness Estimates for Nonlinear PDEs - ...
QMC: Operator Splitting Workshop, Compactness Estimates for Nonlinear PDEs - ...QMC: Operator Splitting Workshop, Compactness Estimates for Nonlinear PDEs - ...
QMC: Operator Splitting Workshop, Compactness Estimates for Nonlinear PDEs - ...
The Statistical and Applied Mathematical Sciences Institute
 
Neural ODE
Neural ODENeural ODE
Neural ODE
Natan Katz
 
cmftJYeZhuanTalk.pdf
cmftJYeZhuanTalk.pdfcmftJYeZhuanTalk.pdf
cmftJYeZhuanTalk.pdf
jyjyzr69t7
 
Reference Dependence: Endogenous Anchors and Life-Cycle Investing
Reference Dependence: Endogenous Anchors and Life-Cycle InvestingReference Dependence: Endogenous Anchors and Life-Cycle Investing
Reference Dependence: Endogenous Anchors and Life-Cycle Investing
guasoni
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
Christian Robert
 

Similar to NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method (20)

Congrès SMAI 2019
Congrès SMAI 2019Congrès SMAI 2019
Congrès SMAI 2019
 
pres06-main
pres06-mainpres06-main
pres06-main
 
A common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spacesA common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spaces
 
A new axisymmetric finite element
A new axisymmetric finite elementA new axisymmetric finite element
A new axisymmetric finite element
 
Computing the Nucleon Spin from Lattice QCD
Computing the Nucleon Spin from Lattice QCDComputing the Nucleon Spin from Lattice QCD
Computing the Nucleon Spin from Lattice QCD
 
A unique common fixed point theorem for four
A unique common fixed point theorem for fourA unique common fixed point theorem for four
A unique common fixed point theorem for four
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
 
-contraction and some fixed point results via modified !-distance mappings in t...
-contraction and some fixed point results via modified !-distance mappings in t...-contraction and some fixed point results via modified !-distance mappings in t...
-contraction and some fixed point results via modified !-distance mappings in t...
 
SLC 2015 talk improved version
SLC 2015 talk improved versionSLC 2015 talk improved version
SLC 2015 talk improved version
 
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
Efficient Identification of Improving Moves in a Ball for Pseudo-Boolean Prob...
 
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point TheoremsAn Altering Distance Function in Fuzzy Metric Fixed Point Theorems
An Altering Distance Function in Fuzzy Metric Fixed Point Theorems
 
l1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applicationsl1-Embeddings and Algorithmic Applications
l1-Embeddings and Algorithmic Applications
 
ON OPTIMIZATION OF MANUFACTURING PLANAR DOUBLE-BASE HETEROTRANSISTORS TO DECR...
ON OPTIMIZATION OF MANUFACTURING PLANAR DOUBLE-BASE HETEROTRANSISTORS TO DECR...ON OPTIMIZATION OF MANUFACTURING PLANAR DOUBLE-BASE HETEROTRANSISTORS TO DECR...
ON OPTIMIZATION OF MANUFACTURING PLANAR DOUBLE-BASE HETEROTRANSISTORS TO DECR...
 
D04302031042
D04302031042D04302031042
D04302031042
 
ANALYSIS OF MANUFACTURING OF VOLTAGE RESTORE TO INCREASE DENSITY OF ELEMENTS ...
ANALYSIS OF MANUFACTURING OF VOLTAGE RESTORE TO INCREASE DENSITY OF ELEMENTS ...ANALYSIS OF MANUFACTURING OF VOLTAGE RESTORE TO INCREASE DENSITY OF ELEMENTS ...
ANALYSIS OF MANUFACTURING OF VOLTAGE RESTORE TO INCREASE DENSITY OF ELEMENTS ...
 
QMC: Operator Splitting Workshop, Compactness Estimates for Nonlinear PDEs - ...
QMC: Operator Splitting Workshop, Compactness Estimates for Nonlinear PDEs - ...QMC: Operator Splitting Workshop, Compactness Estimates for Nonlinear PDEs - ...
QMC: Operator Splitting Workshop, Compactness Estimates for Nonlinear PDEs - ...
 
Neural ODE
Neural ODENeural ODE
Neural ODE
 
cmftJYeZhuanTalk.pdf
cmftJYeZhuanTalk.pdfcmftJYeZhuanTalk.pdf
cmftJYeZhuanTalk.pdf
 
Reference Dependence: Endogenous Anchors and Life-Cycle Investing
Reference Dependence: Endogenous Anchors and Life-Cycle InvestingReference Dependence: Endogenous Anchors and Life-Cycle Investing
Reference Dependence: Endogenous Anchors and Life-Cycle Investing
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 

More from Kai-Wen Zhao

Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human label
Kai-Wen Zhao
 
Deep Double Descent
Deep Double DescentDeep Double Descent
Deep Double Descent
Kai-Wen Zhao
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person Detection
Kai-Wen Zhao
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
Kai-Wen Zhao
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBO
Kai-Wen Zhao
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
Kai-Wen Zhao
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
Kai-Wen Zhao
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNE
Kai-Wen Zhao
 

More from Kai-Wen Zhao (8)

Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human label
 
Deep Double Descent
Deep Double DescentDeep Double Descent
Deep Double Descent
 
Recent Object Detection Research & Person Detection
Recent Object Detection Research & Person DetectionRecent Object Detection Research & Person Detection
Recent Object Detection Research & Person Detection
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Toward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBOToward Disentanglement through Understand ELBO
Toward Disentanglement through Understand ELBO
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...Paper Review: An exact mapping between the Variational Renormalization Group ...
Paper Review: An exact mapping between the Variational Renormalization Group ...
 
High Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNEHigh Dimensional Data Visualization using t-SNE
High Dimensional Data Visualization using t-SNE
 

Recently uploaded

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 

Recently uploaded (20)

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 

NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

  • 1. A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 speaker: kv MCLab, CITI Academia Sinicas May 14, 2015 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 1 / 40
  • 2. Overview 1 Introduction Smooth Unconstrained Optimization Accelerated Scheme Ordinary Differential Equation 2 Derivation of the ODE Simple Properties 3 Equivalence between the ODE and Nesterov’s scheme Analogous Convergence Rate Quadratic f and Bessel function Equivalence between the ODE and Nesterov’s Scheme 4 A family of generalized Nesterov’s schemes Continuous Optimization Composite Optimization 5 New Restart Scheme 6 Accelerating to linear convergence by restarting Numerical examples 7 Discussion Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 2 / 40
  • 3. Introduction Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 3 / 40
  • 4. Smooth Unconstrained Optimization We wish to minimize a smooth convex function minimize f (x) where f : Rn → R has a Lipschitz continuous gradient f (x) − f (y) 2 < L x − y 2 µ-strong convexity f (x) − µ x 2 /2 In this paper, FL denotes the class of convex function f with L-Lipschitz continuous gradients defined on Rn; Sµ denotes the class of µ-strongly convex function f on Rn. We set Sµ,L = FL ∩ Sµ Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 4 / 40
  • 5. Introduction: Accelerated Scheme Nesterov’s Accelerated Gradient Scheme xk = yk−1 − s f (yk−1) (1) yk = xk + k − 1 k + 2 (xk − xk−1) (2) For fixed step size s = 1/L, where L is Lipschitz constant of f , this scheme exhibits the convergence rate f (xk) − f ∗ ≤ O( L||x0 − x∗||2 k2 ) This improvement relies on the introduction to momentum xk − xk−1 and the particularly tuned coefficient (k − 1)/(k + 2) = 1 − 3/(k + 2) Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 5 / 40
  • 6. Accelerated Scheme: Oscillation Problem In general, Nesterov’s scheme is not monotone in the objective function. (due to introduction to the momentum) Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 6 / 40
  • 7. Introduction: Second order ODE Derive a second order ordinary differential equation(ODE), which is the exact limit of Nesterov’s scheme by taking small step size ¨X + 3 t ˙X + f (X) = 0 (3) for t > 0, with initial condition X(0) = x0, ˙X(0) = 0; here x0 is the starting point in Nesterov’s scheme. ˙X denotes to velocity and ¨X is acceleration. Small t: large 3/t leads the ODE to be an over-damped system Large t: As t increases, system behaves like under-damped system, oscillating with amplitude decreases to zero Time parameter in this ODE is related to step size t ∼ k √ s Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 7 / 40
  • 8. An Example: Trajectories Minimize f = 0.02x2 1 + 0.005x2 2 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 8 / 40
  • 9. An Example: Zoomed Trajectories Minimize f = 0.02x2 1 + 0.005x2 2 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 9 / 40
  • 10. An Example: Errors f − f ∗ Minimize f = 0.02x2 1 + 0.005x2 2 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 10 / 40
  • 11. Derivation of the ODE Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 11 / 40
  • 12. Derivation of the ODE Assmue f ∈ FL for L > 0, combine two equations of (1) and applying rescaling give xk+1 − xk √ s = k − 1 k + 2 xk − xk−1 √ s − √ s f (yk) (4) Introducfelis e the ansatz xk ∼ X(k √ s) for smooth curve X(t) define for t > 0. With these approximations, we get Taylor expansions: (xk+1 − xk)/ √ s = ˙X(t) + 1 2 ¨X(t) √ s + o( √ s) (xk − xk−1)/ √ s = ˙X(t) − 1 2 ¨X(t) √ s + o( √ s) √ s f (yk) = √ s f (X(t)) + o( √ s) where the third equality we use yk − X(t) = o(1) Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 12 / 40
  • 13. Derivation of the ODE The formula (4) can be rewritten as ˙X(t) + 1 2 ¨X(t) √ s + o( √ s) = (1 − 3 √ s t ){ ˙X(t) − 1 2 ¨X(t) √ s + o( √ s)} − √ s f (X(t)) + o( √ s) By comparing the coefficients of √ s, we obtain ¨X + 3 t ˙X + f (X) = 0 Theorem (Well posed ODE, Existence and Uniqueness) For any f ∈ F∞ := UL>0FL and any x0 ∈ Rn, the ODE (3) with initial conditions X(0) = x0, ˙X(0) = 0 has an unique global solution X. Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 13 / 40
  • 14. Simple Properties Invariance ODE is invariant under time change and invariant under transformation Initial asymptotic Assume sufficient smoothness of X, that limt→0 ¨X exists. The Mean Value Theorem guarantees the existence ζ ∈ (0, t) that ˙X(t)/t = ( ˙X − X(0))/t = ¨X(ζ) Hence the ODE deduces to ¨X(t) + 3 ¨X(ζ) + f (X(t)) = 0 Taking t → 0 (for small t), we have X(t) = − f (x0)t2 8 + x0 + o(t2 ) Consistent with the empirical observation the Nesterov’s scheme moves slowly in the beginning. Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 14 / 40
  • 15. Connections and Interpretations Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 15 / 40
  • 16. Analogous Convergence Rate Now, we exhibit approximate equivalence between the ODE and Nesterov’s scheme in terms of convergence rate. Theorem (Discrete Nesterov Scheme 3.1) For any f ∈ FL, the sequence {xk} in 1 with step size s ≤ 1/L obeys f (xk) − f ∗ ≤ 2||x0 − x∗||2 s(k + 1)2 First result indicates the trajectory of ODE (3) closely resembles the sequence {xk} in terms of the convergence rate Theorem (Continuous ODE Scheme 3.2) For any f ∈ F∞, let X(t) be the unique global solution to (3) with initial conditions X(0) = x0, ˙X(0) = 0, for any t > 0 f (X(t)) − f ∗ ≤ 2||x0 − x∗||2 t2 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 16 / 40
  • 17. Proof of Theorem 3.2 Consider energy functional defined as ε(t) := t2 (f (X(t)) − f ∗ ) + 2||X + t 2 ˙X − x∗ ||2 whose time derivative is ˙ε(t) = 2t(f (X) − f ∗ ) + t2 f , ˙X + 4 X + t 2 ˙X − x∗ , 3 2 ˙X + t 2 ¨X Substituting 3 ˙X/2 + t ¨X/2 with −t f (X)/2 ˙ε(t) = 2t(f (X) − f ∗ ) + 4 X − x∗ , −t f (X)/2 = 2t(f (X) − f ∗ ) − 2t X − x∗ , f (X) ≤ 0 Hence the monotonicity of ε and non-negativity of 2||X + t 2 ˙X − x∗||2, the gap obeys f (X(t)) − f ∗ ≤ ˙ε(t) t2 ≤ ˙ε(0) t2 = 2||x0 − x∗||2 t2 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 17 / 40
  • 18. Quadratic f and Bessel function For quadratic f , we have f (x) = 1 2 x, Ax + b, x where A ∈ Rn×n. Simple translation can absorb the liner term b, x . We assume A is positive semi-definite, admitting spectral decomposition A = QT ΛQ, replace x with Qx, we assume f = 1/2 x, Λx . The ODE admits ¨Xi + 3 t ˙Xi + λi Xi = 0 where i = 1, ..., n Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 18 / 40
  • 19. Quadratic f and Bessel function Introduce Yi (u) = uXi (u/ √ λi ), which satisfies u2 ¨Yi + u ˙Yi + (u2 − 1)Yi = 0 This is the Bessel’s Differential Equation with order 1. Apply asymptotic expansion, we obtain Xi (t) = 2xx0,i t √ λi J1(t λi ) For t is large, the Bessel function has asymptotic form J1(t) = 2 πt (cos(t − 3π/4) + O(1/t)) Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 19 / 40
  • 20. Quadratic f and Bessel function: Example Minimizef = 0.02x2 1 + 0.005x2 2 , whose eigenvalues are λ1,2 = 0.02, 0.005 f (X) − f ∗ = f (X) = n i=1 2x2 0,i t2 J1(t λi )2 Denote two major period T1, T2. We get T1 = π/ √ λ1 = 22.214 and T2 = π/ √ λ2 = 44.423 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 20 / 40
  • 21. Equivalence between the ODE and Nesterov’s scheme We study the stable step size for numerically solving ODE. The finite difference approximation of (3) by the forward Euler method X(t + ∆t) − 2X(t) + X(t + ∆t) ∆t2 + 3 t X(t) − X(t − ∆t) ∆t + f (X(t)) = 0 which is equivalent to X(t + ∆t) = (2 − 3∆t t )X(t) − ∆t2 f (X(t)) − (1 − ∆t t )X(t − ∆t) Assuming that f is sufficiently smooth, for small perturbation. The characteristic equation of this finite difference scheme is approximately (identify k = t/∆t) det{λ2 − (2 − ∆t2 2 f − 3∆t t )λ + 1 − 3∆t t } = 0 (5) For numerical stability, all the roots of (5) should lie on unit circle. Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 21 / 40
  • 22. A family of generalized Nesterov’s schemes Exploit the power of ODE. We would be interested in studying the ODE (3) with the number of 3 appearing the coefficient of ˙X/t replaced by a general constant r as ¨X + r t ˙X + f (X) = 0, X(0) = x0, ˙X(0) = 0 (6) Using the argument similar to theorem 2.1, this ODE is guaranteed to assume a unique global solution for any f ∈ F∞ Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 22 / 40
  • 23. Generalized Nesterov’s Scheme: Continuous Optimization Theorem (4.1) Suppose r > 3 and let X be the unique solution to (6) for some f ∈ F∞. Then X(t) obeys f (X(t)) − f ∗ ≤ (r − 1)2||x0 − x∗||2 2t2 and ∞ 0 t(f (X(t)) − f ∗ )dt ≤ (r − 1)2||x0 − x∗||2 2(r − 3) Theorem (4.2) For any f ∈ Sµ,L(Rn), the unique solution X to (6) with r ≥ 9/2 obeys f (X(t)) − f ∗ ≤ Cr5/2||x0 − x∗||2 t3√ µ Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 23 / 40
  • 24. Generalized Nesterov’s Scheme: Continuous Optimization For example, the solution to (6) with f (x) = ||x||2/2 is X(t) = 2 r−1 2 Γ((r + 1)/2)J(r−1)/2(t) t(r−1)/2 where J(r−1)/2(.) is the first kind of Bessel function of order (r − 1)/2. For large t, this Bessel function obeys J(r−1)/2(t) = 2/(πt)(cos(t − rπ/4) + O(1/t). Hence f (X(t)) − f ∗ ≤ ||x0 − x∗ ||2 /tr Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 24 / 40
  • 25. Generalized Nesterov’s Scheme: Composite Optimization Skip this part min x∈Rn f (x) = g(x) + h(x) where g ∈ FL for some L > 0 and h is convex on Rn with possible extended value ∞. Define proximal subgradient Gs(x) := x − arg minz{||z − (x − s g(x))||2/(2s) + h(z)} s Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 25 / 40
  • 26. New Restart Scheme Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 26 / 40
  • 27. Restart Scheme: Previous Works Restart: Erase the memory of previous iterations and resets the momentum back to zero Function Scheme: we restart whenever f (xk ) > f (xk−1 ) Gradient Scheme: we restart whenever f (yk−1 )T (xk − xk−1 ) > 0 Refer to: Adaptive Restart for Accelerated Gradient Schemes, Brendan ODonoghue Emmanuel Cands, 2012 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 27 / 40
  • 28. Restart Scheme: Previous Works Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 28 / 40
  • 29. Restart Scheme: Previous Works Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 29 / 40
  • 30. New Restart Scheme: Speed Restart This work provides a new restarting strategy, called speed restarting scheme. The underlying motivation is to maintain relatively high velocity ˙X along the trajectory. Definition 5.1 For ODE (3) with X(0) = x0, ˙X(0) = 0, let T = T(f , x0) = sup{t > 0 : ∀u ∈ (0, t), || ˙X(u)||2 du > 0} be the speed restarting time. In words, T is the first time the velocity || ˙X|| decreases. Indeed, f (X(t)) is the decreasing function before time T, for t < T, df (X(t)) dt =< f (X), ˙X >= − 3 t || ˙X||2 − 1 2 || ˙X||2 dt < 0 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 30 / 40
  • 31. Accelerating to linear convergence by restarting The speed restarting ODE is thus ¨X(t) + 3 tsr ˙X(t) + f (X(t)) = 0 (7) where tsr is set to zero whenever < ˙X, ¨X >= 0. We have following observations Xsr (t) is continuous for t ≤ 0, with Xsr (0) = x0 Xsr (t) satisfied (3) for 0 < t < T1 := T(x0; f ) Recursively define Ti+1 = Ti (Xsr ( Tj ; f ) for i ≤ 1 and ˜X(t) = Xsr ( Tj + t) Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 31 / 40
  • 32. Accelerating to linear convergence by restarting Lemma 5.2 There is a universal constant C > 0 such that f (X(T)) − f (x∗ ) ≤ (1 − Cµ L )(f (x0) − f ∗ ) Guarantee each restarting reduces the error by a constant factor Lemma 5.3 There is a universal constant ˜C such that T ≤ 4exp( ˜CL µ ) 5 √ L An upper bound for T. It conforms that restartings are adequate Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 32 / 40
  • 33. Accelerating to linear convergence by restarting Applying Lemma 5.2, 5.3, we have Theorem (5.1) There exists positive constants c1 and c2, which only depend on the condition number L/µ, such that for any f ∈ Sµ,L, we have f (Xsr (t)) − f (x∗ ) ≤ c1L||x0 − x∗||2 2 exp−c2t √ L The theorem guarantees linear convergence of solution to (7). This is new result in the literature. where c1 = exp(Cµ/L) and c2 = 5Cµ 4L e(−˜Cµ/L) Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 33 / 40
  • 34. Numerical examples: algorithm of speed restarting Below we present a discrete analog to restarted scheme. Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 34 / 40
  • 35. Quadratic Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 35 / 40
  • 36. Log-sum-exp Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 36 / 40
  • 37. Matrix compleltion Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 37 / 40
  • 38. Lasso in l1-constrainted form with large space design Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 38 / 40
  • 39. References W. Su, S. Boyd, E. Candes (2014) A Differential Equation for Modeling Nesterovs Accelerated Gradient Method: Theory and Insights NIPS 2014 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 39 / 40
  • 40. thanks! Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 40 / 40