06_A.pdf

MTH702 Optimization
Stochastic optimization

SGD+momentum
Adagrad
AdaDelta
Adam

SGD with Momentum
Sutskever, I., Martens, J., Dahl, G., Hinton, G.
On the importance of initialization and momentum in deep learning.
In International conference on machine learning, 2013.
2/16

The Algorithm
I motivated by Nesterov, researchers suggested to add ”momentum” to SGD even
for non-convex problems
Pytorch Implementation
vt+1 = µvt + gt+1 (1)
pt+1 = pt − η vt+1 (2)
https://pytorch.org/docs/stable/optim.html#torch.optim.SGD
I v is velocity
I µ is a momentum
3/16

Another motivation - smoothing noisy stochastic gradients
Google Colab Demonstration
An alternative formulation
vt+1 = µvt + (1 − µ)gt+1 (3)
pt+1 = pt − η vt+1 (4)
https://towardsdatascience.com/
stochastic-gradient-descent-with-momentum-a84097641a5d
4/16

What is vt?
vt = µvt−1 + (1 − µ)gt
= µ(µvt−2 + (1 − µ)gt−1) + (1 − µ)gt = µ2
vt−2 + µ(1 − µ)gt−1 + (1 − µ)gt
= µ3
vt−3 + µ2
(1 − µ)gt−2 + µ(1 − µ)gt−1 + (1 − µ)gt
= . . .
= µt+1
v0 +
t
X
l=0
µt−l
(1 − µ)gl
If v0 = 0 then vt = (1 − µ)
Pt
l=0 µt−lgl
Would vt be ”unbiased”?
How much is
(1 − µ)
t
X
l=0
µt−l
= (1 − µ)
t
X
l=0
µl
= 1 − µt+1
⇒
1
1 − µt+1
vt is ”unbiased”
5/16

Motivation
I consider f(x) = 1
2xT

1 0
0 0.001

x
Q: What is smoothness parameter L? What is x∗? How well the gradient
algorithm would work?
7/16

How to make it better?
Q: Any idea?
What if we modify the algorithm to use some diagonal matrix D and change the
algorithm to xt+1 = xt − ηtD−1∇f(xt)? Q: What should be D to make it fast?
8/16

Adagrad
Duchi, J., Hazan, E., Singer, Y.
Adaptive subgradient methods for online learning and stochastic optimization
Journal of machine learning research, 2011
9/16

Adagrad
SGD update rule:
xi
k+1 = xi
k − η[gk]i
, ∀i ∈ {1, 2, . . . , d}
Adagrad:
xi
k+1 = xi
k − η
1
p
+ Gk,ii
[gk]i
, ∀i ∈ {1, 2, . . . , d}
Gk = diag(
k
X
l=0
gl gl)
is usually choosen as 10−8 or 10−10.
PS: we can write it in ”compact” way as: xk+1 = xk − η 1
√
+Gk
gk Q: What can be
BAD about this algorithm? Q: How could we do it better?
https://ruder.io/optimizing-gradient-descent/index.html#adagrad
10/16

Adadelta
Zeiler
ADADELTA: An Adaptive Learning Rate Method, 2012
11/16

Adadelta
I is an extension of Adagrad that seeks to reduce its aggressive, monotonically
decreasing learning rate
I instead of accumulating all past squared gradients, Adadelta restricts the
window of accumulated past gradients to some fixed size w Q: How to
implement it? Would it be efficient? Q: Is there some hack we could do to
make it efficient?
I Instead of inefficiently storing w previous squared gradients, the sum of gradients
is recursively defined as a decaying average of all past squared gradients.
Gt = γGt−1 + (1 − γ)gt gt
Then the update is similar to Adagrad
xt+1 = xt − η
1
√
+ Gt
gt
https://ruder.io/optimizing-gradient-descent/index.html#adadelta
12/16

Adam
Kingma, D. P., Ba, J. L.
Adam: a Method for Stochastic Optimization.
International Conference on Learning Representations, 2015
13/16

Adam (Adaptive Moment Estimation)
I Adam basically ”combines” momentum with adadelta
I adam is storing an exponentially decaying average of past squared gradients
vt = β2vt−1 + (1 − β2)g2
t
I adam also keeps an exponentially decaying average of past gradients (similar to
momentum)
mt = β1mt−1 + (1 − β1)gt
I Note mt and vt are estimates of the first moment (the mean) and the second
moment (the uncentered variance) of the gradients respectively, hence the name
of the method
https://ruder.io/optimizing-gradient-descent/index.html#adam
14/16

Adam (Adaptive Moment Estimation)
I m0 and v0 is initialized as 0 Q: what does it mean about biasness?
I the authors of Adam observe that they are biased towards zero, especially during
the initial time steps, and especially when the decay rates are small (i.e. β1 and
β2 are close to 1). We can remove bias by defining
m̂t =
mt
1 − βt+1
1
v̂t =
vt
1 − βt+1
2
and define
xt+1 = xt − η
1
+
√
v̂t
m̂t
The authors propose default values of β1 = 0.9, β2 = 0.999 and = 10−8
https://ruder.io/optimizing-gradient-descent/index.html#adam
15/16

Bibliography
Thanks to Prof. Martin Jaggi and Prof. Mark Schmidt for their slides and lectures.
16/16

mbzuai.ac.ae
Mohamed bin Zayed
University of Artificial Intelligence
Masdar City
Abu Dhabi
United Arab Emirates

06_A.pdf

More Related Content

Similar to 06_A.pdf

Recently uploaded

06_A.pdf