Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Continuous and Discrete-Time Analysis of SGD
1. Continuous and Discrete-Time Analysis of SGD
Joint work with A. Durmus and X. Fontaine.
We aim at minimizing f : Rd
→ R under the following assumptions:
• f is convex and admits a minimizer x?
• f is differentiable and for any x ∈ Rd
, ∇f (x) =
R
Z
H(x, z)dµZ(z)
• there exists L > 0 such that for any x, y ∈ Rd
and z ∈ Z
kH(x, z) − H(y, z)k 6 Lkx − yk ,
R
Z
kH(x?
, z)k2
dµZ(z) .
Stochastic Gradient Descent (discrete and continuous) (α ∈ [0, 1))
(SGD-d) Xn+1 = Xn − γ(n + 1)−α
H(Xn, Zn+1) . Zn ∼ µZ i.i.d
(SGD-c) dXt = −(γα + t)−α
n
∇f (Xt)dt + γ1/2
α Σ1/2
(Xt)dBt
o
,
γα = γ1/(1−α)
, Σ(x) =
R
Z
(H(x, z) − ∇f (x))(H(x, z) − ∇f (x))>
dµZ(z).
QU: Is SGD-c close to SGD-d? Can we obtain the minimax optimal rate
O(t−1/2
) for α = 1/2 using the techniques introduced in Su et al. (2016) ?
1 / 3
2. Approximation results
QU: can we show that SGD-d is close to SGD-c? Yes!
Finite horizon strong approximation
For any T > 0, there exists CT > 0 such that for any t ∈ [0, T],
E1/2
"
sup
t∈[0,T]
kXbtγαc − Xtk2
#
6 CT (ε1/2
γδ
+ γ)(1 + log(1/γ)) ,
with δ = min(1, 1/(2 − 2α)) and
ε = sup
nγα6T
E
W2
2(νn, N(0, Σ(Xn)))
,
with νn the distribution of H(Xn, ·) − ∇f (Xn) conditionally to Xn.
Proof based on Milstein (1994) and Kloeden and Platen (2013).
If H(x, {zi }M
i=1) = M−1
PM
k=1 ∇ˆ
f (x, zi ) then ε = O(M−2
) using recent advances
in Stein’s method Bonis (2020) (effect of the batch size).
2 / 3
3. Convergence results
QU: what is the optimal convergence rate?
Previous works:
• Minimax lower-bound → O(t−1/2
) (Agarwal et al. (2009))
• Bounded gradient case → O(t−1/2
) (Shamir and Zhang (2013))
• Our setting → O(t−1/3
) (Moulines and Bach (2011))
We close the gap between lower and upper bounds.
Optimal convergence rates
In our setting, for any α ∈ [0, 1) there exists Cα 0 such that for any n ∈ N
E [f (Xn) − f (x?
)] 6 Cα max(n−α
, n−1+α
) .
The proof relies on the “averaging from the past” procedure of Shamir and Zhang
(2013) and is also valid for SGD-c.
3 / 3
4. References
Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds
on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages
1–9, 2009.
Thomas Bonis. Stein’s method for normal approximation in wasserstein distances with application to the multivariate
central limit theorem. Probability Theory and Related Fields, pages 1–34, 2020.
Peter E Kloeden and Eckhard Platen. Numerical solution of stochastic differential equations, volume 23. Springer
Science Business Media, 2013.
Grigorii Noikhovich Milstein. Numerical integration of stochastic differential equations, volume 313. Springer Science
Business Media, 1994.
Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine
learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011.
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and
optimal averaging schemes. In International conference on machine learning, pages 71–79, 2013.
Weijie Su, Stephen Boyd, and Emmanuel J Candes. A differential equation for modeling nesterov’s accelerated
gradient method: theory and insights. The Journal of Machine Learning Research, 17(1):5312–5354, 2016.
4 / 3