SlideShare a Scribd company logo
1 of 4
Download to read offline
Continuous and Discrete-Time Analysis of SGD
Joint work with A. Durmus and X. Fontaine.
We aim at minimizing f : Rd
→ R under the following assumptions:
• f is convex and admits a minimizer x?
• f is differentiable and for any x ∈ Rd
, ∇f (x) =
R
Z
H(x, z)dµZ(z)
• there exists L > 0 such that for any x, y ∈ Rd
and z ∈ Z
kH(x, z) − H(y, z)k 6 Lkx − yk ,
R
Z
kH(x?
, z)k2
dµZ(z) .
Stochastic Gradient Descent (discrete and continuous) (α ∈ [0, 1))
(SGD-d) Xn+1 = Xn − γ(n + 1)−α
H(Xn, Zn+1) . Zn ∼ µZ i.i.d
(SGD-c) dXt = −(γα + t)−α
n
∇f (Xt)dt + γ1/2
α Σ1/2
(Xt)dBt
o
,
γα = γ1/(1−α)
, Σ(x) =
R
Z
(H(x, z) − ∇f (x))(H(x, z) − ∇f (x))>
dµZ(z).
QU: Is SGD-c close to SGD-d? Can we obtain the minimax optimal rate
O(t−1/2
) for α = 1/2 using the techniques introduced in Su et al. (2016) ?
1 / 3
Approximation results
QU: can we show that SGD-d is close to SGD-c? Yes!
Finite horizon strong approximation
For any T > 0, there exists CT > 0 such that for any t ∈ [0, T],
E1/2
"
sup
t∈[0,T]
kXbtγαc − Xtk2
#
6 CT (ε1/2
γδ
+ γ)(1 + log(1/γ)) ,
with δ = min(1, 1/(2 − 2α)) and
ε = sup
nγα6T
E

W2
2(νn, N(0, Σ(Xn)))

,
with νn the distribution of H(Xn, ·) − ∇f (Xn) conditionally to Xn.
Proof based on Milstein (1994) and Kloeden and Platen (2013).
If H(x, {zi }M
i=1) = M−1
PM
k=1 ∇ˆ
f (x, zi ) then ε = O(M−2
) using recent advances
in Stein’s method Bonis (2020) (effect of the batch size).
2 / 3
Convergence results
QU: what is the optimal convergence rate?
Previous works:
• Minimax lower-bound → O(t−1/2
) (Agarwal et al. (2009))
• Bounded gradient case → O(t−1/2
) (Shamir and Zhang (2013))
• Our setting → O(t−1/3
) (Moulines and Bach (2011))
We close the gap between lower and upper bounds.
Optimal convergence rates
In our setting, for any α ∈ [0, 1) there exists Cα  0 such that for any n ∈ N
E [f (Xn) − f (x?
)] 6 Cα max(n−α
, n−1+α
) .
The proof relies on the “averaging from the past” procedure of Shamir and Zhang
(2013) and is also valid for SGD-c.
3 / 3
References
Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds
on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages
1–9, 2009.
Thomas Bonis. Stein’s method for normal approximation in wasserstein distances with application to the multivariate
central limit theorem. Probability Theory and Related Fields, pages 1–34, 2020.
Peter E Kloeden and Eckhard Platen. Numerical solution of stochastic differential equations, volume 23. Springer
Science  Business Media, 2013.
Grigorii Noikhovich Milstein. Numerical integration of stochastic differential equations, volume 313. Springer Science
 Business Media, 1994.
Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine
learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011.
Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and
optimal averaging schemes. In International conference on machine learning, pages 71–79, 2013.
Weijie Su, Stephen Boyd, and Emmanuel J Candes. A differential equation for modeling nesterov’s accelerated
gradient method: theory and insights. The Journal of Machine Learning Research, 17(1):5312–5354, 2016.
4 / 3

More Related Content

What's hot

Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Frank Nielsen
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmann
olli0601
 

What's hot (20)

Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
Tailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest NeighborsTailored Bregman Ball Trees for Effective Nearest Neighbors
Tailored Bregman Ball Trees for Effective Nearest Neighbors
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
 Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli... Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Appli...
 
QMC Opening Workshop, High Accuracy Algorithms for Interpolating and Integrat...
QMC Opening Workshop, High Accuracy Algorithms for Interpolating and Integrat...QMC Opening Workshop, High Accuracy Algorithms for Interpolating and Integrat...
QMC Opening Workshop, High Accuracy Algorithms for Interpolating and Integrat...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmann
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
prior selection for mixture estimation
prior selection for mixture estimationprior selection for mixture estimation
prior selection for mixture estimation
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective Divergences
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 

Similar to Continuous and Discrete-Time Analysis of SGD

Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Chiheb Ben Hammouda
 

Similar to Continuous and Discrete-Time Analysis of SGD (20)

ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
block-mdp-masters-defense.pdf
block-mdp-masters-defense.pdfblock-mdp-masters-defense.pdf
block-mdp-masters-defense.pdf
 
A common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spacesA common fixed point theorem in cone metric spaces
A common fixed point theorem in cone metric spaces
 
On estimating the integrated co volatility using
On estimating the integrated co volatility usingOn estimating the integrated co volatility using
On estimating the integrated co volatility using
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
 
Distributionworkshop 2.pptx
Distributionworkshop 2.pptxDistributionworkshop 2.pptx
Distributionworkshop 2.pptx
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
 
Final_presentation
Final_presentationFinal_presentation
Final_presentation
 
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
 
1 hofstad
1 hofstad1 hofstad
1 hofstad
 
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSSOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
 
Norm-variation of bilinear averages
Norm-variation of bilinear averagesNorm-variation of bilinear averages
Norm-variation of bilinear averages
 
Fixed points of contractive and Geraghty contraction mappings under the influ...
Fixed points of contractive and Geraghty contraction mappings under the influ...Fixed points of contractive and Geraghty contraction mappings under the influ...
Fixed points of contractive and Geraghty contraction mappings under the influ...
 

Recently uploaded

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 

Recently uploaded (20)

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 

Continuous and Discrete-Time Analysis of SGD

  • 1. Continuous and Discrete-Time Analysis of SGD Joint work with A. Durmus and X. Fontaine. We aim at minimizing f : Rd → R under the following assumptions: • f is convex and admits a minimizer x? • f is differentiable and for any x ∈ Rd , ∇f (x) = R Z H(x, z)dµZ(z) • there exists L > 0 such that for any x, y ∈ Rd and z ∈ Z kH(x, z) − H(y, z)k 6 Lkx − yk , R Z kH(x? , z)k2 dµZ(z) . Stochastic Gradient Descent (discrete and continuous) (α ∈ [0, 1)) (SGD-d) Xn+1 = Xn − γ(n + 1)−α H(Xn, Zn+1) . Zn ∼ µZ i.i.d (SGD-c) dXt = −(γα + t)−α n ∇f (Xt)dt + γ1/2 α Σ1/2 (Xt)dBt o , γα = γ1/(1−α) , Σ(x) = R Z (H(x, z) − ∇f (x))(H(x, z) − ∇f (x))> dµZ(z). QU: Is SGD-c close to SGD-d? Can we obtain the minimax optimal rate O(t−1/2 ) for α = 1/2 using the techniques introduced in Su et al. (2016) ? 1 / 3
  • 2. Approximation results QU: can we show that SGD-d is close to SGD-c? Yes! Finite horizon strong approximation For any T > 0, there exists CT > 0 such that for any t ∈ [0, T], E1/2 " sup t∈[0,T] kXbtγαc − Xtk2 # 6 CT (ε1/2 γδ + γ)(1 + log(1/γ)) , with δ = min(1, 1/(2 − 2α)) and ε = sup nγα6T E W2 2(νn, N(0, Σ(Xn))) , with νn the distribution of H(Xn, ·) − ∇f (Xn) conditionally to Xn. Proof based on Milstein (1994) and Kloeden and Platen (2013). If H(x, {zi }M i=1) = M−1 PM k=1 ∇ˆ f (x, zi ) then ε = O(M−2 ) using recent advances in Stein’s method Bonis (2020) (effect of the batch size). 2 / 3
  • 3. Convergence results QU: what is the optimal convergence rate? Previous works: • Minimax lower-bound → O(t−1/2 ) (Agarwal et al. (2009)) • Bounded gradient case → O(t−1/2 ) (Shamir and Zhang (2013)) • Our setting → O(t−1/3 ) (Moulines and Bach (2011)) We close the gap between lower and upper bounds. Optimal convergence rates In our setting, for any α ∈ [0, 1) there exists Cα 0 such that for any n ∈ N E [f (Xn) − f (x? )] 6 Cα max(n−α , n−1+α ) . The proof relies on the “averaging from the past” procedure of Shamir and Zhang (2013) and is also valid for SGD-c. 3 / 3
  • 4. References Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009. Thomas Bonis. Stein’s method for normal approximation in wasserstein distances with application to the multivariate central limit theorem. Probability Theory and Related Fields, pages 1–34, 2020. Peter E Kloeden and Eckhard Platen. Numerical solution of stochastic differential equations, volume 23. Springer Science Business Media, 2013. Grigorii Noikhovich Milstein. Numerical integration of stochastic differential equations, volume 313. Springer Science Business Media, 1994. Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011. Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, pages 71–79, 2013. Weijie Su, Stephen Boyd, and Emmanuel J Candes. A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. The Journal of Machine Learning Research, 17(1):5312–5354, 2016. 4 / 3