MTH702 Optimization
Stochastic optimization
SGD+momentum
Adagrad
AdaDelta
Adam
SGD with Momentum
Sutskever, I., Martens, J., Dahl, G., Hinton, G.
On the importance of initialization and momentum in deep learning.
In International conference on machine learning, 2013.
2/16
The Algorithm
I motivated by Nesterov, researchers suggested to add ”momentum” to SGD even
for non-convex problems
Pytorch Implementation
vt+1 = µvt + gt+1 (1)
pt+1 = pt − η vt+1 (2)
https://pytorch.org/docs/stable/optim.html#torch.optim.SGD
I v is velocity
I µ is a momentum
3/16
Another motivation - smoothing noisy stochastic gradients
Google Colab Demonstration
An alternative formulation
vt+1 = µvt + (1 − µ)gt+1 (3)
pt+1 = pt − η vt+1 (4)
https://towardsdatascience.com/
stochastic-gradient-descent-with-momentum-a84097641a5d
4/16
What is vt?
vt = µvt−1 + (1 − µ)gt
= µ(µvt−2 + (1 − µ)gt−1) + (1 − µ)gt = µ2
vt−2 + µ(1 − µ)gt−1 + (1 − µ)gt
= µ3
vt−3 + µ2
(1 − µ)gt−2 + µ(1 − µ)gt−1 + (1 − µ)gt
= . . .
= µt+1
v0 +
t
X
l=0
µt−l
(1 − µ)gl
If v0 = 0 then vt = (1 − µ)
Pt
l=0 µt−lgl
Would vt be ”unbiased”?
How much is
(1 − µ)
t
X
l=0
µt−l
= (1 − µ)
t
X
l=0
µl
= 1 − µt+1
⇒
1
1 − µt+1
vt is ”unbiased”
5/16
Feature-Scaling
Motivation
I consider f(x) = 1
2xT

1 0
0 0.001

x
Q: What is smoothness parameter L? What is x∗? How well the gradient
algorithm would work?
7/16
How to make it better?
Q: Any idea?
What if we modify the algorithm to use some diagonal matrix D and change the
algorithm to xt+1 = xt − ηtD−1∇f(xt)? Q: What should be D to make it fast?
8/16
Adagrad
Duchi, J., Hazan, E., Singer, Y.
Adaptive subgradient methods for online learning and stochastic optimization
Journal of machine learning research, 2011
9/16
Adagrad
SGD update rule:
xi
k+1 = xi
k − η[gk]i
, ∀i ∈ {1, 2, . . . , d}
Adagrad:
xi
k+1 = xi
k − η
1
p
 + Gk,ii
[gk]i
, ∀i ∈ {1, 2, . . . , d}
Gk = diag(
k
X
l=0
gl gl)
 is usually choosen as 10−8 or 10−10.
PS: we can write it in ”compact” way as: xk+1 = xk − η 1
√
+Gk
gk Q: What can be
BAD about this algorithm? Q: How could we do it better?
https://ruder.io/optimizing-gradient-descent/index.html#adagrad
10/16
Adadelta
Zeiler
ADADELTA: An Adaptive Learning Rate Method, 2012
11/16
Adadelta
I is an extension of Adagrad that seeks to reduce its aggressive, monotonically
decreasing learning rate
I instead of accumulating all past squared gradients, Adadelta restricts the
window of accumulated past gradients to some fixed size w Q: How to
implement it? Would it be efficient? Q: Is there some hack we could do to
make it efficient?
I Instead of inefficiently storing w previous squared gradients, the sum of gradients
is recursively defined as a decaying average of all past squared gradients.
Gt = γGt−1 + (1 − γ)gt gt
Then the update is similar to Adagrad
xt+1 = xt − η
1
√
 + Gt
gt
https://ruder.io/optimizing-gradient-descent/index.html#adadelta
12/16
Adam
Kingma, D. P., Ba, J. L.
Adam: a Method for Stochastic Optimization.
International Conference on Learning Representations, 2015
13/16
Adam (Adaptive Moment Estimation)
I Adam basically ”combines” momentum with adadelta
I adam is storing an exponentially decaying average of past squared gradients
vt = β2vt−1 + (1 − β2)g2
t
I adam also keeps an exponentially decaying average of past gradients (similar to
momentum)
mt = β1mt−1 + (1 − β1)gt
I Note mt and vt are estimates of the first moment (the mean) and the second
moment (the uncentered variance) of the gradients respectively, hence the name
of the method
https://ruder.io/optimizing-gradient-descent/index.html#adam
14/16
Adam (Adaptive Moment Estimation)
I m0 and v0 is initialized as 0 Q: what does it mean about biasness?
I the authors of Adam observe that they are biased towards zero, especially during
the initial time steps, and especially when the decay rates are small (i.e. β1 and
β2 are close to 1). We can remove bias by defining
m̂t =
mt
1 − βt+1
1
v̂t =
vt
1 − βt+1
2
and define
xt+1 = xt − η
1
 +
√
v̂t
m̂t
The authors propose default values of β1 = 0.9, β2 = 0.999 and  = 10−8
https://ruder.io/optimizing-gradient-descent/index.html#adam
15/16
Bibliography
Thanks to Prof. Martin Jaggi and Prof. Mark Schmidt for their slides and lectures.
16/16
mbzuai.ac.ae
Mohamed bin Zayed
University of Artificial Intelligence
Masdar City
Abu Dhabi
United Arab Emirates

06_A.pdf

  • 1.
  • 2.
  • 3.
    SGD with Momentum Sutskever,I., Martens, J., Dahl, G., Hinton, G. On the importance of initialization and momentum in deep learning. In International conference on machine learning, 2013. 2/16
  • 4.
    The Algorithm I motivatedby Nesterov, researchers suggested to add ”momentum” to SGD even for non-convex problems Pytorch Implementation vt+1 = µvt + gt+1 (1) pt+1 = pt − η vt+1 (2) https://pytorch.org/docs/stable/optim.html#torch.optim.SGD I v is velocity I µ is a momentum 3/16
  • 5.
    Another motivation -smoothing noisy stochastic gradients Google Colab Demonstration An alternative formulation vt+1 = µvt + (1 − µ)gt+1 (3) pt+1 = pt − η vt+1 (4) https://towardsdatascience.com/ stochastic-gradient-descent-with-momentum-a84097641a5d 4/16
  • 6.
    What is vt? vt= µvt−1 + (1 − µ)gt = µ(µvt−2 + (1 − µ)gt−1) + (1 − µ)gt = µ2 vt−2 + µ(1 − µ)gt−1 + (1 − µ)gt = µ3 vt−3 + µ2 (1 − µ)gt−2 + µ(1 − µ)gt−1 + (1 − µ)gt = . . . = µt+1 v0 + t X l=0 µt−l (1 − µ)gl If v0 = 0 then vt = (1 − µ) Pt l=0 µt−lgl Would vt be ”unbiased”? How much is (1 − µ) t X l=0 µt−l = (1 − µ) t X l=0 µl = 1 − µt+1 ⇒ 1 1 − µt+1 vt is ”unbiased” 5/16
  • 7.
  • 8.
    Motivation I consider f(x)= 1 2xT 1 0 0 0.001 x Q: What is smoothness parameter L? What is x∗? How well the gradient algorithm would work? 7/16
  • 9.
    How to makeit better? Q: Any idea? What if we modify the algorithm to use some diagonal matrix D and change the algorithm to xt+1 = xt − ηtD−1∇f(xt)? Q: What should be D to make it fast? 8/16
  • 10.
    Adagrad Duchi, J., Hazan,E., Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization Journal of machine learning research, 2011 9/16
  • 11.
    Adagrad SGD update rule: xi k+1= xi k − η[gk]i , ∀i ∈ {1, 2, . . . , d} Adagrad: xi k+1 = xi k − η 1 p + Gk,ii [gk]i , ∀i ∈ {1, 2, . . . , d} Gk = diag( k X l=0 gl gl) is usually choosen as 10−8 or 10−10. PS: we can write it in ”compact” way as: xk+1 = xk − η 1 √ +Gk gk Q: What can be BAD about this algorithm? Q: How could we do it better? https://ruder.io/optimizing-gradient-descent/index.html#adagrad 10/16
  • 12.
    Adadelta Zeiler ADADELTA: An AdaptiveLearning Rate Method, 2012 11/16
  • 13.
    Adadelta I is anextension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate I instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w Q: How to implement it? Would it be efficient? Q: Is there some hack we could do to make it efficient? I Instead of inefficiently storing w previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. Gt = γGt−1 + (1 − γ)gt gt Then the update is similar to Adagrad xt+1 = xt − η 1 √ + Gt gt https://ruder.io/optimizing-gradient-descent/index.html#adadelta 12/16
  • 14.
    Adam Kingma, D. P.,Ba, J. L. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 2015 13/16
  • 15.
    Adam (Adaptive MomentEstimation) I Adam basically ”combines” momentum with adadelta I adam is storing an exponentially decaying average of past squared gradients vt = β2vt−1 + (1 − β2)g2 t I adam also keeps an exponentially decaying average of past gradients (similar to momentum) mt = β1mt−1 + (1 − β1)gt I Note mt and vt are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively, hence the name of the method https://ruder.io/optimizing-gradient-descent/index.html#adam 14/16
  • 16.
    Adam (Adaptive MomentEstimation) I m0 and v0 is initialized as 0 Q: what does it mean about biasness? I the authors of Adam observe that they are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (i.e. β1 and β2 are close to 1). We can remove bias by defining m̂t = mt 1 − βt+1 1 v̂t = vt 1 − βt+1 2 and define xt+1 = xt − η 1 + √ v̂t m̂t The authors propose default values of β1 = 0.9, β2 = 0.999 and = 10−8 https://ruder.io/optimizing-gradient-descent/index.html#adam 15/16
  • 17.
    Bibliography Thanks to Prof.Martin Jaggi and Prof. Mark Schmidt for their slides and lectures. 16/16
  • 18.
    mbzuai.ac.ae Mohamed bin Zayed Universityof Artificial Intelligence Masdar City Abu Dhabi United Arab Emirates