Paper review: Learned Optimizers that Scale and Generalize.

Learned Optimizers
that Scale and Generalize
Wichrowska et al. ICML 2017.
Wuhyun Shin
March 4, 2019
Lab Seminar
MLAI, KAIST

• Learning to Learn by Gradient Descent by Gradient Descent. Andrychowicz et al. NIPS 2016.
• Optimization as a Model for Few-shot Learning. Ravi and Larochelle. ICLR 2017.
• Learning to Learn without Gradient Descent by Gradient Descent. Chen et al. ICML 2017.
• Learning Gradient Descent: Better Generalization and Longer Horizon. Lv et al. ICML 2017.
• Learned Optimizers that Scale and Generalize. Wichrowska et al. ICML 2017.
• Meta-Learning Update Rules for Unsupervised Representation Learning. Metz et al. ICLR 2019.
• Understanding and Correcting Pathologies in the Training of Learned Optimizers. Metz et al. (In Progress)
Existing Works on Learned Optimizers
AFAIK, this is one of the most recently published paper among the works that are
closely related to our work.

There are two of the primary barriers for learned optimizers:
1. Scalability to larger problems (especially for meta-training).
2. generalization to new tasks.
To resolve these problems, they introduce a learned optimizer that generalize better
to new tasks with less memory and computation overhead.
Introduction

1. Meta-training set
- ensemble of small tasks with diverse loss landscape.
2. Hierarchical RNN architecture
- lower memory & time cost / capturing inter-parameter dependencies.
3. Features motivated by hand-designed optimizers
- Computing gradient at attended location.
- Dynamic input scaling with momentum at multiple timescales.
- Decomposition of output into direction and step length.
4. Other techniques
- Log averaged meta-objective, Sampled step number, etc.
Contribution

They propose a novel hierarchical architecture that enables (1) coordination of
update across parameters and (2) low computational and memory cost.
Architecture – Hierarchical RNN
The hierarchical RNN’s parameters(𝝍; 𝑚𝑒𝑡𝑎 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟) are shared across all
target problems so that it can be applied to any models that have different scales.
Coordinate-wise RNN

Enter as
bias term
Input as average
latent state
← capture inter-tensor dependencies
← capture inter-parameter dependencies
* Parameter Tensor : subset of parameters (e.g. for each layer)Parameter
Tensor 𝜽 𝟏
← capture per-parameter 2nd order information

B: Batch size
N: Number of parameters
K: Latent size of RNN
- Computational cost:
- Memory cost:
*Note that as B is increased, the computational cost approaches that of vanilla SGD.
This architecture allows the Parameter RNN to have very few hidden units with larger Tensor RNN
and Global RNN keeping track of problem-level information. (e.g. P=5, P=10, G=20)

1. Computing gradient attended location.
2. Dynamic input scaling with momentum at multiple time scales.
3. Decomposition of output into direction and step length.
Features Inspired by Hand-designed Optimizers
They incorporate knowledge of effective strategies for existing optimizers.
They emphasize that these are not arbitrary design choices and individually important,
which is demonstrated in ablation experiments in the later section.
Input

1. Computing gradient at attended location (away from current location)
The optimizer explores new regions of the loss surface by computing gradients away
(or ahead) from the current parameter position.
They are referring this as “a cross between Nesterov momentum and RNN attention mechanism”.
- Attended location:
𝜃𝑡
𝑛+1
= 𝜃𝑡
𝑛
+ ∆𝜃𝑡
𝑛
𝜙 𝑡
𝑛+1
= 𝜃𝑡
𝑛
+ ∆𝜙 𝑡
𝑛
𝒈 𝑡
𝑛
=
𝜕𝐿
𝜕𝜙𝑡
𝑛
- Updated location:
- Gradient as the input to the learned optimizer:
Output from the
learned optimizer

1. Computing gradient at attended location (away from current location)
The optimizer explores new regions of the loss surface by computing gradients away
(or ahead) from the current parameter position.
They are referring this as “a cross between Nesterov momentum and RNN attention mechanism”.
- Attended location:
𝜃𝑡
𝑛+1
= 𝜃𝑡
𝑛
+ ∆𝜃𝑡
𝑛
𝜙 𝑡
𝑛+1
= 𝜃𝑡
𝑛
+ ∆𝜙 𝑡
𝑛
𝒈 𝑡
𝑛
=
𝜕𝐿
𝜕𝜙𝑡
𝑛
- Updated location:
- Gradient as the input to the learned optimizer:
Output from the
learned optimizer
?
Based on the gradient
at attended point,
Optimizer suggests
next updated point.
𝜙 𝑡
𝑛
𝜃𝑡
𝑛
𝜃𝑡
𝑛+1

2. Dynamic input scaling with momentum at multiple timescales
By doing this, the learned optimizer has access to information about:
(i) loss surface curvature (ii) the degree of noise in the gradient
- Exponential moving averages with multiple timescales 𝒔 :
Output from the
learned optimizer
𝜎(𝛽)
1 − 𝜎(𝛽) 2−𝑠
𝒔 = 𝟎
𝒔 = 𝟏
𝒔 = 𝟐
𝒔 = 𝟑
For fixed logit output β, decaying weights are different according to timescale s.
(Larger 𝒔 → lower decay weight → considering longer steps)
𝒔 = 𝟎
𝒔 = 𝟏
𝒔 = 𝟐
𝒔 = 𝟑
𝑡

To make the optimizer to be invariant to parameter scale, they rescale the average
gradients in a fashion similar to what is done in RMSProp, ADAM.
- Dynamic input scaling by its magnitude:
Output from the
learned optimizer
Input to the learned optimizer
(for multiple timescales)

Relative gradient magnitudes at each averaging scale s may be useful for the
learned optimizer to have access to how gradient magnitudes are changing.
- Relative gradient magnitude
Output from the
learned optimizer

1-2. Overall process for input
: output from the learned optimizer at previous step
: input to the learned optimizer at current step

3. Decomposition of output into direction and step length
They enforced similar decomposition as in RMSprop and ADAM where the length
of update steps is solely controlled by learning rate, but invariant to any scaling of
the gradient.
Actual outputs from
the learned optimizer
Step length
Direction
Log step length

The RNN therefore controls it by outputting a log-additive change ∆𝜂 𝜃 and the
initial learning rate is sampled from log uniform distribution(1e-6 to 1e-2).
Actual outputs from
the learned optimizer
Step length
Direction Meta-parameters
The optimizer has to judge the correct step length from the gradient history, rather
than memorizing the range of it useful in its meta-training.
Log step length

In order to aid in coordination across parameters, we do provide the RNN as an
input the relative log learning rate of each parameter, compared to the rest.
Step length
Direction
Log step length

Optimizer inputs and outputs
Scaled average gradient
Relative log gradient magnitude
Relative log learning rate
Update direction for 𝜃 and 𝜙
Log change in step length
Momentum logits
Meta parameters:
Enter as
bias term
Input as average
latent state
Parameter
Tensor 𝜽 𝟏

Meta-training
Meta-training set They meta-train the optimizer on an ensemble of small problems
that have been chosen to capture many commonly encountered properties of loss
landscapes and stochastic gradients.
1. Exemplar problems from literature (Surjanovic & Bingham, 2013)
2. Well behaved problems (Quadratic bowls/logistic regression)
3. Noisy gradients and minibatch problems (Noise added/randomly generated data)
4. Slow convergence problems (Optimum at infinity/sparse discontinuous gradient)
5. Transformed problems (Transformation of previously defined problems)
By meta-training on small toy problems, they also avoid memory issues one would
encounter by meta-training on very large, real-world problems.

Meta-training
Meta-objective For the meta-training loss, they used the average log loss across all
training problems:
The average logarithm resembles minimizing the final function value, since very small
values of 𝑙(𝜃 𝑛
) make an outsized contribution to the average after taking the logarithm.
Meta parameters:
Encourages exact convergence to minima and
precise dynamic adjustment of learning rate.
(Shown in the ablation exp.)

Meta-training
Partial Unrolling with Full Gradient
Gradients flow through this graph!
Unlike Andrychowicz et al. 2016., they compute the full gradient in TensorFlow using
provided API, including second derivatives.
Backpropagation through time
at every partial unrolling step

Meta-training
Heavy-tailed Distribution over Training Step
In order to encourage the learned optimizer to generalize to long training runs,
(i) number of partial unrollings (N)
(ii) number of optimization steps within each partial unroll (M)
was drawn from the distribution like the above. (total optimization steps: NM)

Experiments
Failures of Existing Learned Optimizers
Comparable performance with static optimizers, even for number of iterations not
seen during meta-training.
Training loss vs. number of
optimization steps.
Trained on a 2-layer MLP
with sigmoid / MNIST /
batch size 64.

Experiments
Performance on Training Set Problems
Three sample problems from the meta-training set(Booth / Matyas / Rosenbrock).
The learned optimizer generally beats the others on problems in the training set.

Experiments
Generalization to New Problem Types
Trained on meta-training problem set which did not include any ConvNet or MLP.
Comparable performance to other static optimizers.

Experiments
Generalization to New Problem Types
Also tested the same optimizer on Inception V3 and on ResNet V2.
Stably trained for the first 10K to 20K but failed after that.

Experiments
Robustness to Choice of Learning Rate
Training curves on a randomly generated quadratic loss problem with different
learning rate initialization. The learned optimizer is also sensitive but more robust.

Experiments
Ablation Experiments
Ablation study demonstrating importance of design choices on a small ConvNet on
MNIST data. DEFAULT is the optimizer with all features included.
Taking the logarithm of the meta-objective
Computing gradient at attended location
Having RNN learn its own initial weights
Relative log learning rate as an input
Accumulation decay for multiple gradient timescales
Dynamic input scaling
Momentum on multiple timescales

Experiments
Wall clock comparison
For small minibatches, they significantly underperform ADAM and RMSProp in terms
of wall clock time.
However, its overhead is constant in terms of minibatch, it can be made relatively
small by increasing the minibatch size.

Discussion
• If the historical information can be manually encoded at lower cost as in this work,
performance gain from using RNN architecture might be redundant considering its
relatively high computational cost.
• Can we believe that encouraging the learned optimizer to work in similar ways as what is
done in human-designed adaptive optimizers is truly optimal? (Generalization power of
those adaptive optimizers have recently become controversial.)
• Can't we even make manually-set timescales s as a free variable?

Paper review: Learned Optimizers that Scale and Generalize.

More Related Content

What's hot

Similar to Paper review: Learned Optimizers that Scale and Generalize.

Recently uploaded

Paper review: Learned Optimizers that Scale and Generalize.