Continuous and Discrete-Time Analysis of SGD

1. Continuous and Discrete-Time Analysis of SGD Joint work with A. Durmus and X. Fontaine. We aim at minimizing f : Rd → R under the following assumptions: • f is convex and admits a minimizer x? • f is differentiable and for any x ∈ Rd , ∇f (x) = R Z H(x, z)dµZ(z) • there exists L > 0 such that for any x, y ∈ Rd and z ∈ Z kH(x, z) − H(y, z)k 6 Lkx − yk , R Z kH(x? , z)k2 dµZ(z) . Stochastic Gradient Descent (discrete and continuous) (α ∈ [0, 1)) (SGD-d) Xn+1 = Xn − γ(n + 1)−α H(Xn, Zn+1) . Zn ∼ µZ i.i.d (SGD-c) dXt = −(γα + t)−α n ∇f (Xt)dt + γ1/2 α Σ1/2 (Xt)dBt o , γα = γ1/(1−α) , Σ(x) = R Z (H(x, z) − ∇f (x))(H(x, z) − ∇f (x))> dµZ(z). QU: Is SGD-c close to SGD-d? Can we obtain the minimax optimal rate O(t−1/2 ) for α = 1/2 using the techniques introduced in Su et al. (2016) ? 1 / 3

2. Approximation results QU: can we show that SGD-d is close to SGD-c? Yes! Finite horizon strong approximation For any T > 0, there exists CT > 0 such that for any t ∈ [0, T], E1/2 " sup t∈[0,T] kXbtγαc − Xtk2 # 6 CT (ε1/2 γδ + γ)(1 + log(1/γ)) , with δ = min(1, 1/(2 − 2α)) and ε = sup nγα6T E W2 2(νn, N(0, Σ(Xn))) , with νn the distribution of H(Xn, ·) − ∇f (Xn) conditionally to Xn. Proof based on Milstein (1994) and Kloeden and Platen (2013). If H(x, {zi }M i=1) = M−1 PM k=1 ∇ˆ f (x, zi ) then ε = O(M−2 ) using recent advances in Stein’s method Bonis (2020) (effect of the batch size). 2 / 3

3. Convergence results QU: what is the optimal convergence rate? Previous works: • Minimax lower-bound → O(t−1/2 ) (Agarwal et al. (2009)) • Bounded gradient case → O(t−1/2 ) (Shamir and Zhang (2013)) • Our setting → O(t−1/3 ) (Moulines and Bach (2011)) We close the gap between lower and upper bounds. Optimal convergence rates In our setting, for any α ∈ [0, 1) there exists Cα 0 such that for any n ∈ N E [f (Xn) − f (x? )] 6 Cα max(n−α , n−1+α ) . The proof relies on the “averaging from the past” procedure of Shamir and Zhang (2013) and is also valid for SGD-c. 3 / 3

4. References Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009. Thomas Bonis. Stein’s method for normal approximation in wasserstein distances with application to the multivariate central limit theorem. Probability Theory and Related Fields, pages 1–34, 2020. Peter E Kloeden and Eckhard Platen. Numerical solution of stochastic differential equations, volume 23. Springer Science Business Media, 2013. Grigorii Noikhovich Milstein. Numerical integration of stochastic differential equations, volume 313. Springer Science Business Media, 1994. Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011. Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71–79, 2013. Weijie Su, Stephen Boyd, and Emmanuel J Candes. A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. The Journal of Machine Learning Research, 17(1):5312–5354, 2016. 4 / 3

Continuous and Discrete-Time Analysis of SGD

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Continuous and Discrete-Time Analysis of SGD

Similar to Continuous and Discrete-Time Analysis of SGD (20)

Recently uploaded

Recently uploaded (20)

Continuous and Discrete-Time Analysis of SGD