SlideShare a Scribd company logo
1 of 31
Learned Optimizers
that Scale and Generalize
Wichrowska et al. ICML 2017.
Wuhyun Shin
March 4, 2019
Lab Seminar
MLAI, KAIST
• Learning to Learn by Gradient Descent by Gradient Descent. Andrychowicz et al. NIPS 2016.
• Optimization as a Model for Few-shot Learning. Ravi and Larochelle. ICLR 2017.
• Learning to Learn without Gradient Descent by Gradient Descent. Chen et al. ICML 2017.
• Learning Gradient Descent: Better Generalization and Longer Horizon. Lv et al. ICML 2017.
• Learned Optimizers that Scale and Generalize. Wichrowska et al. ICML 2017.
• Meta-Learning Update Rules for Unsupervised Representation Learning. Metz et al. ICLR 2019.
• Understanding and Correcting Pathologies in the Training of Learned Optimizers. Metz et al. (In Progress)
Existing Works on Learned Optimizers
AFAIK, this is one of the most recently published paper among the works that are
closely related to our work.
There are two of the primary barriers for learned optimizers:
1. Scalability to larger problems (especially for meta-training).
2. generalization to new tasks.
To resolve these problems, they introduce a learned optimizer that generalize better
to new tasks with less memory and computation overhead.
Introduction
1. Meta-training set
- ensemble of small tasks with diverse loss landscape.
2. Hierarchical RNN architecture
- lower memory & time cost / capturing inter-parameter dependencies.
3. Features motivated by hand-designed optimizers
- Computing gradient at attended location.
- Dynamic input scaling with momentum at multiple timescales.
- Decomposition of output into direction and step length.
4. Other techniques
- Log averaged meta-objective, Sampled step number, etc.
Contribution
They propose a novel hierarchical architecture that enables (1) coordination of
update across parameters and (2) low computational and memory cost.
Architecture – Hierarchical RNN
The hierarchical RNN’s parameters(𝝍; 𝑚𝑒𝑡𝑎 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟) are shared across all
target problems so that it can be applied to any models that have different scales.
Coordinate-wise RNN
Architecture – Hierarchical RNN
Enter as
bias term
Input as average
latent state
← capture inter-tensor dependencies
← capture inter-parameter dependencies
* Parameter Tensor : subset of parameters (e.g. for each layer)Parameter
Tensor 𝜽 𝟏
They propose a novel hierarchical architecture that enables (1) coordination of
update across parameters and (2) low computational and memory cost.
← capture per-parameter 2nd order information
Architecture – Hierarchical RNN
B: Batch size
N: Number of parameters
K: Latent size of RNN
- Computational cost:
- Memory cost:
*Note that as B is increased, the computational cost approaches that of vanilla SGD.
This architecture allows the Parameter RNN to have very few hidden units with larger Tensor RNN
and Global RNN keeping track of problem-level information. (e.g. P=5, P=10, G=20)
They propose a novel hierarchical architecture that enables (1) coordination of
update across parameters and (2) low computational and memory cost.
1. Computing gradient attended location.
2. Dynamic input scaling with momentum at multiple time scales.
3. Decomposition of output into direction and step length.
Features Inspired by Hand-designed Optimizers
They incorporate knowledge of effective strategies for existing optimizers.
They emphasize that these are not arbitrary design choices and individually important,
which is demonstrated in ablation experiments in the later section.
Input
Features Inspired by Hand-designed Optimizers
1. Computing gradient at attended location (away from current location)
The optimizer explores new regions of the loss surface by computing gradients away
(or ahead) from the current parameter position.
They are referring this as “a cross between Nesterov momentum and RNN attention mechanism”.
- Attended location:
𝜃𝑡
𝑛+1
= 𝜃𝑡
𝑛
+ ∆𝜃𝑡
𝑛
𝜙 𝑡
𝑛+1
= 𝜃𝑡
𝑛
+ ∆𝜙 𝑡
𝑛
𝒈 𝑡
𝑛
=
𝜕𝐿
𝜕𝜙𝑡
𝑛
- Updated location:
- Gradient as the input to the learned optimizer:
Output from the
learned optimizer
Features Inspired by Hand-designed Optimizers
1. Computing gradient at attended location (away from current location)
The optimizer explores new regions of the loss surface by computing gradients away
(or ahead) from the current parameter position.
They are referring this as “a cross between Nesterov momentum and RNN attention mechanism”.
- Attended location:
𝜃𝑡
𝑛+1
= 𝜃𝑡
𝑛
+ ∆𝜃𝑡
𝑛
𝜙 𝑡
𝑛+1
= 𝜃𝑡
𝑛
+ ∆𝜙 𝑡
𝑛
𝒈 𝑡
𝑛
=
𝜕𝐿
𝜕𝜙𝑡
𝑛
- Updated location:
- Gradient as the input to the learned optimizer:
Output from the
learned optimizer
?
Based on the gradient
at attended point,
Optimizer suggests
next updated point.
𝜙 𝑡
𝑛
𝜃𝑡
𝑛
𝜃𝑡
𝑛+1
Features Inspired by Hand-designed Optimizers
2. Dynamic input scaling with momentum at multiple timescales
By doing this, the learned optimizer has access to information about:
(i) loss surface curvature (ii) the degree of noise in the gradient
- Exponential moving averages with multiple timescales 𝒔 :
Output from the
learned optimizer
𝜎(𝛽)
1 − 𝜎(𝛽) 2−𝑠
𝒔 = 𝟎
𝒔 = 𝟏
𝒔 = 𝟐
𝒔 = 𝟑
For fixed logit output β, decaying weights are different according to timescale s.
(Larger 𝒔 → lower decay weight → considering longer steps)
𝒔 = 𝟎
𝒔 = 𝟏
𝒔 = 𝟐
𝒔 = 𝟑
𝑡
Features Inspired by Hand-designed Optimizers
2. Dynamic input scaling with momentum at multiple timescales
To make the optimizer to be invariant to parameter scale, they rescale the average
gradients in a fashion similar to what is done in RMSProp, ADAM.
- Dynamic input scaling by its magnitude:
Output from the
learned optimizer
Input to the learned optimizer
(for multiple timescales)
Features Inspired by Hand-designed Optimizers
2. Dynamic input scaling with momentum at multiple timescales
Relative gradient magnitudes at each averaging scale s may be useful for the
learned optimizer to have access to how gradient magnitudes are changing.
- Relative gradient magnitude
Output from the
learned optimizer
Input to the learned optimizer
(for multiple timescales)
Features Inspired by Hand-designed Optimizers
1-2. Overall process for input
: output from the learned optimizer at previous step
: input to the learned optimizer at current step
Features Inspired by Hand-designed Optimizers
3. Decomposition of output into direction and step length
They enforced similar decomposition as in RMSprop and ADAM where the length
of update steps is solely controlled by learning rate, but invariant to any scaling of
the gradient.
Actual outputs from
the learned optimizer
Step length
Direction
Log step length
Features Inspired by Hand-designed Optimizers
3. Decomposition of output into direction and step length
The RNN therefore controls it by outputting a log-additive change ∆𝜂 𝜃 and the
initial learning rate is sampled from log uniform distribution(1e-6 to 1e-2).
Actual outputs from
the learned optimizer
Step length
Direction Meta-parameters
The optimizer has to judge the correct step length from the gradient history, rather
than memorizing the range of it useful in its meta-training.
Log step length
Features Inspired by Hand-designed Optimizers
3. Decomposition of output into direction and step length
In order to aid in coordination across parameters, we do provide the RNN as an
input the relative log learning rate of each parameter, compared to the rest.
Step length
Direction
The optimizer has to judge the correct step length from the gradient history, rather
than memorizing the range of it useful in its meta-training.
Log step length
Input to the learned optimizer
(for multiple timescales)
Optimizer inputs and outputs
The optimizer has to judge the correct step length from the gradient history, rather
than memorizing the range of it useful in its meta-training.
Scaled average gradient
Relative log gradient magnitude
Relative log learning rate
Update direction for 𝜃 and 𝜙
Log change in step length
Momentum logits
Meta parameters:
Enter as
bias term
Input as average
latent state
Parameter
Tensor 𝜽 𝟏
Meta-training
Meta-training set They meta-train the optimizer on an ensemble of small problems
that have been chosen to capture many commonly encountered properties of loss
landscapes and stochastic gradients.
1. Exemplar problems from literature (Surjanovic & Bingham, 2013)
2. Well behaved problems (Quadratic bowls/logistic regression)
3. Noisy gradients and minibatch problems (Noise added/randomly generated data)
4. Slow convergence problems (Optimum at infinity/sparse discontinuous gradient)
5. Transformed problems (Transformation of previously defined problems)
By meta-training on small toy problems, they also avoid memory issues one would
encounter by meta-training on very large, real-world problems.
Meta-training
Meta-objective For the meta-training loss, they used the average log loss across all
training problems:
The average logarithm resembles minimizing the final function value, since very small
values of 𝑙(𝜃 𝑛
) make an outsized contribution to the average after taking the logarithm.
Meta parameters:
Encourages exact convergence to minima and
precise dynamic adjustment of learning rate.
(Shown in the ablation exp.)
Meta-training
Partial Unrolling with Full Gradient
Gradients flow through this graph!
Unlike Andrychowicz et al. 2016., they compute the full gradient in TensorFlow using
provided API, including second derivatives.
Backpropagation through time
at every partial unrolling step
Meta-training
Heavy-tailed Distribution over Training Step
In order to encourage the learned optimizer to generalize to long training runs,
(i) number of partial unrollings (N)
(ii) number of optimization steps within each partial unroll (M)
was drawn from the distribution like the above. (total optimization steps: NM)
Experiments
Failures of Existing Learned Optimizers
Comparable performance with static optimizers, even for number of iterations not
seen during meta-training.
Training loss vs. number of
optimization steps.
Trained on a 2-layer MLP
with sigmoid / MNIST /
batch size 64.
Experiments
Performance on Training Set Problems
Three sample problems from the meta-training set(Booth / Matyas / Rosenbrock).
The learned optimizer generally beats the others on problems in the training set.
Experiments
Generalization to New Problem Types
Trained on meta-training problem set which did not include any ConvNet or MLP.
Comparable performance to other static optimizers.
Experiments
Generalization to New Problem Types
Also tested the same optimizer on Inception V3 and on ResNet V2.
Stably trained for the first 10K to 20K but failed after that.
Experiments
Robustness to Choice of Learning Rate
Training curves on a randomly generated quadratic loss problem with different
learning rate initialization. The learned optimizer is also sensitive but more robust.
Experiments
Ablation Experiments
Ablation study demonstrating importance of design choices on a small ConvNet on
MNIST data. DEFAULT is the optimizer with all features included.
Taking the logarithm of the meta-objective
Computing gradient at attended location
Having RNN learn its own initial weights
Relative log learning rate as an input
Accumulation decay for multiple gradient timescales
Dynamic input scaling
Momentum on multiple timescales
Experiments
Wall clock comparison
For small minibatches, they significantly underperform ADAM and RMSProp in terms
of wall clock time.
However, its overhead is constant in terms of minibatch, it can be made relatively
small by increasing the minibatch size.
Discussion
• If the historical information can be manually encoded at lower cost as in this work,
performance gain from using RNN architecture might be redundant considering its
relatively high computational cost.
• Can we believe that encouraging the learned optimizer to work in similar ways as what is
done in human-designed adaptive optimizers is truly optimal? (Generalization power of
those adaptive optimizers have recently become controversial.)
• Can't we even make manually-set timescales s as a free variable?
Thank you

More Related Content

What's hot

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsDanish Javed
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
ECG beats classification using multiclass SVMs with ECOC
ECG beats classification using multiclass SVMs with ECOCECG beats classification using multiclass SVMs with ECOC
ECG beats classification using multiclass SVMs with ECOCYomna Mahmoud Ibrahim Hassan
 
Hybrid hmmdtw based speech recognition with kernel adaptive filtering method
Hybrid hmmdtw based speech recognition with kernel adaptive filtering methodHybrid hmmdtw based speech recognition with kernel adaptive filtering method
Hybrid hmmdtw based speech recognition with kernel adaptive filtering methodijcsa
 
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...mabualsh
 
Using FME for Topographical Data Generalization at Natural Resources Canada
Using FME for Topographical Data Generalization at Natural Resources CanadaUsing FME for Topographical Data Generalization at Natural Resources Canada
Using FME for Topographical Data Generalization at Natural Resources CanadaSafe Software
 
Divide&Conquer & Dynamic Programming
Divide&Conquer & Dynamic ProgrammingDivide&Conquer & Dynamic Programming
Divide&Conquer & Dynamic ProgrammingGuillaume Guérard
 
Application of a merit function based interior point method to linear model p...
Application of a merit function based interior point method to linear model p...Application of a merit function based interior point method to linear model p...
Application of a merit function based interior point method to linear model p...Zac Darcy
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3Md. Mahedi Mahfuj
 
Elementary Parallel Algorithms
Elementary Parallel AlgorithmsElementary Parallel Algorithms
Elementary Parallel AlgorithmsHeman Pathak
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 

What's hot (20)

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Chap8 slides
Chap8 slidesChap8 slides
Chap8 slides
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
ECG beats classification using multiclass SVMs with ECOC
ECG beats classification using multiclass SVMs with ECOCECG beats classification using multiclass SVMs with ECOC
ECG beats classification using multiclass SVMs with ECOC
 
Hybrid hmmdtw based speech recognition with kernel adaptive filtering method
Hybrid hmmdtw based speech recognition with kernel adaptive filtering methodHybrid hmmdtw based speech recognition with kernel adaptive filtering method
Hybrid hmmdtw based speech recognition with kernel adaptive filtering method
 
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
Applications of Markov Decision Processes (MDPs) in the Internet of Things (I...
 
Using FME for Topographical Data Generalization at Natural Resources Canada
Using FME for Topographical Data Generalization at Natural Resources CanadaUsing FME for Topographical Data Generalization at Natural Resources Canada
Using FME for Topographical Data Generalization at Natural Resources Canada
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 
12 transient analysis theory
12 transient analysis theory12 transient analysis theory
12 transient analysis theory
 
Introduction to Adaptive filters
Introduction to Adaptive filtersIntroduction to Adaptive filters
Introduction to Adaptive filters
 
Divide&Conquer & Dynamic Programming
Divide&Conquer & Dynamic ProgrammingDivide&Conquer & Dynamic Programming
Divide&Conquer & Dynamic Programming
 
Aca11 bk2 ch9
Aca11 bk2 ch9Aca11 bk2 ch9
Aca11 bk2 ch9
 
Application of a merit function based interior point method to linear model p...
Application of a merit function based interior point method to linear model p...Application of a merit function based interior point method to linear model p...
Application of a merit function based interior point method to linear model p...
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Parallel computing chapter 3
Parallel computing chapter 3Parallel computing chapter 3
Parallel computing chapter 3
 
Chap6 slides
Chap6 slidesChap6 slides
Chap6 slides
 
Av 738 - Adaptive Filtering Lecture 1 - Introduction
Av 738 - Adaptive Filtering Lecture 1 - IntroductionAv 738 - Adaptive Filtering Lecture 1 - Introduction
Av 738 - Adaptive Filtering Lecture 1 - Introduction
 
Elementary Parallel Algorithms
Elementary Parallel AlgorithmsElementary Parallel Algorithms
Elementary Parallel Algorithms
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 

Similar to Paper review: Learned Optimizers that Scale and Generalize.

Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Dynamic programming prasintation eaisy
Dynamic programming prasintation eaisyDynamic programming prasintation eaisy
Dynamic programming prasintation eaisyahmed51236
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine LearningKnoldus Inc.
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionIJECEIAES
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politècnica de Catalunya
 
Learning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentLearning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentKaty Lee
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Maninda Edirisooriya
 
2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdfishan743441
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical EquationsIRJET Journal
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 

Similar to Paper review: Learned Optimizers that Scale and Generalize. (20)

Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Dynamic programming prasintation eaisy
Dynamic programming prasintation eaisyDynamic programming prasintation eaisy
Dynamic programming prasintation eaisy
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
 
Ds33717725
Ds33717725Ds33717725
Ds33717725
 
Ds33717725
Ds33717725Ds33717725
Ds33717725
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Learning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient DescentLearning to Learn by Gradient Descent by Gradient Descent
Learning to Learn by Gradient Descent by Gradient Descent
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
 
2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
master-thesis
master-thesismaster-thesis
master-thesis
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
DATA STRUCTURE.pdf
DATA STRUCTURE.pdfDATA STRUCTURE.pdf
DATA STRUCTURE.pdf
 
DATA STRUCTURE
DATA STRUCTUREDATA STRUCTURE
DATA STRUCTURE
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Paper review: Learned Optimizers that Scale and Generalize.

  • 1. Learned Optimizers that Scale and Generalize Wichrowska et al. ICML 2017. Wuhyun Shin March 4, 2019 Lab Seminar MLAI, KAIST
  • 2. • Learning to Learn by Gradient Descent by Gradient Descent. Andrychowicz et al. NIPS 2016. • Optimization as a Model for Few-shot Learning. Ravi and Larochelle. ICLR 2017. • Learning to Learn without Gradient Descent by Gradient Descent. Chen et al. ICML 2017. • Learning Gradient Descent: Better Generalization and Longer Horizon. Lv et al. ICML 2017. • Learned Optimizers that Scale and Generalize. Wichrowska et al. ICML 2017. • Meta-Learning Update Rules for Unsupervised Representation Learning. Metz et al. ICLR 2019. • Understanding and Correcting Pathologies in the Training of Learned Optimizers. Metz et al. (In Progress) Existing Works on Learned Optimizers AFAIK, this is one of the most recently published paper among the works that are closely related to our work.
  • 3. There are two of the primary barriers for learned optimizers: 1. Scalability to larger problems (especially for meta-training). 2. generalization to new tasks. To resolve these problems, they introduce a learned optimizer that generalize better to new tasks with less memory and computation overhead. Introduction
  • 4. 1. Meta-training set - ensemble of small tasks with diverse loss landscape. 2. Hierarchical RNN architecture - lower memory & time cost / capturing inter-parameter dependencies. 3. Features motivated by hand-designed optimizers - Computing gradient at attended location. - Dynamic input scaling with momentum at multiple timescales. - Decomposition of output into direction and step length. 4. Other techniques - Log averaged meta-objective, Sampled step number, etc. Contribution
  • 5. They propose a novel hierarchical architecture that enables (1) coordination of update across parameters and (2) low computational and memory cost. Architecture – Hierarchical RNN The hierarchical RNN’s parameters(𝝍; 𝑚𝑒𝑡𝑎 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟) are shared across all target problems so that it can be applied to any models that have different scales. Coordinate-wise RNN
  • 6. Architecture – Hierarchical RNN Enter as bias term Input as average latent state ← capture inter-tensor dependencies ← capture inter-parameter dependencies * Parameter Tensor : subset of parameters (e.g. for each layer)Parameter Tensor 𝜽 𝟏 They propose a novel hierarchical architecture that enables (1) coordination of update across parameters and (2) low computational and memory cost. ← capture per-parameter 2nd order information
  • 7. Architecture – Hierarchical RNN B: Batch size N: Number of parameters K: Latent size of RNN - Computational cost: - Memory cost: *Note that as B is increased, the computational cost approaches that of vanilla SGD. This architecture allows the Parameter RNN to have very few hidden units with larger Tensor RNN and Global RNN keeping track of problem-level information. (e.g. P=5, P=10, G=20) They propose a novel hierarchical architecture that enables (1) coordination of update across parameters and (2) low computational and memory cost.
  • 8. 1. Computing gradient attended location. 2. Dynamic input scaling with momentum at multiple time scales. 3. Decomposition of output into direction and step length. Features Inspired by Hand-designed Optimizers They incorporate knowledge of effective strategies for existing optimizers. They emphasize that these are not arbitrary design choices and individually important, which is demonstrated in ablation experiments in the later section. Input
  • 9. Features Inspired by Hand-designed Optimizers 1. Computing gradient at attended location (away from current location) The optimizer explores new regions of the loss surface by computing gradients away (or ahead) from the current parameter position. They are referring this as “a cross between Nesterov momentum and RNN attention mechanism”. - Attended location: 𝜃𝑡 𝑛+1 = 𝜃𝑡 𝑛 + ∆𝜃𝑡 𝑛 𝜙 𝑡 𝑛+1 = 𝜃𝑡 𝑛 + ∆𝜙 𝑡 𝑛 𝒈 𝑡 𝑛 = 𝜕𝐿 𝜕𝜙𝑡 𝑛 - Updated location: - Gradient as the input to the learned optimizer: Output from the learned optimizer
  • 10. Features Inspired by Hand-designed Optimizers 1. Computing gradient at attended location (away from current location) The optimizer explores new regions of the loss surface by computing gradients away (or ahead) from the current parameter position. They are referring this as “a cross between Nesterov momentum and RNN attention mechanism”. - Attended location: 𝜃𝑡 𝑛+1 = 𝜃𝑡 𝑛 + ∆𝜃𝑡 𝑛 𝜙 𝑡 𝑛+1 = 𝜃𝑡 𝑛 + ∆𝜙 𝑡 𝑛 𝒈 𝑡 𝑛 = 𝜕𝐿 𝜕𝜙𝑡 𝑛 - Updated location: - Gradient as the input to the learned optimizer: Output from the learned optimizer ? Based on the gradient at attended point, Optimizer suggests next updated point. 𝜙 𝑡 𝑛 𝜃𝑡 𝑛 𝜃𝑡 𝑛+1
  • 11. Features Inspired by Hand-designed Optimizers 2. Dynamic input scaling with momentum at multiple timescales By doing this, the learned optimizer has access to information about: (i) loss surface curvature (ii) the degree of noise in the gradient - Exponential moving averages with multiple timescales 𝒔 : Output from the learned optimizer 𝜎(𝛽) 1 − 𝜎(𝛽) 2−𝑠 𝒔 = 𝟎 𝒔 = 𝟏 𝒔 = 𝟐 𝒔 = 𝟑 For fixed logit output β, decaying weights are different according to timescale s. (Larger 𝒔 → lower decay weight → considering longer steps) 𝒔 = 𝟎 𝒔 = 𝟏 𝒔 = 𝟐 𝒔 = 𝟑 𝑡
  • 12. Features Inspired by Hand-designed Optimizers 2. Dynamic input scaling with momentum at multiple timescales To make the optimizer to be invariant to parameter scale, they rescale the average gradients in a fashion similar to what is done in RMSProp, ADAM. - Dynamic input scaling by its magnitude: Output from the learned optimizer Input to the learned optimizer (for multiple timescales)
  • 13. Features Inspired by Hand-designed Optimizers 2. Dynamic input scaling with momentum at multiple timescales Relative gradient magnitudes at each averaging scale s may be useful for the learned optimizer to have access to how gradient magnitudes are changing. - Relative gradient magnitude Output from the learned optimizer Input to the learned optimizer (for multiple timescales)
  • 14. Features Inspired by Hand-designed Optimizers 1-2. Overall process for input : output from the learned optimizer at previous step : input to the learned optimizer at current step
  • 15. Features Inspired by Hand-designed Optimizers 3. Decomposition of output into direction and step length They enforced similar decomposition as in RMSprop and ADAM where the length of update steps is solely controlled by learning rate, but invariant to any scaling of the gradient. Actual outputs from the learned optimizer Step length Direction Log step length
  • 16. Features Inspired by Hand-designed Optimizers 3. Decomposition of output into direction and step length The RNN therefore controls it by outputting a log-additive change ∆𝜂 𝜃 and the initial learning rate is sampled from log uniform distribution(1e-6 to 1e-2). Actual outputs from the learned optimizer Step length Direction Meta-parameters The optimizer has to judge the correct step length from the gradient history, rather than memorizing the range of it useful in its meta-training. Log step length
  • 17. Features Inspired by Hand-designed Optimizers 3. Decomposition of output into direction and step length In order to aid in coordination across parameters, we do provide the RNN as an input the relative log learning rate of each parameter, compared to the rest. Step length Direction The optimizer has to judge the correct step length from the gradient history, rather than memorizing the range of it useful in its meta-training. Log step length Input to the learned optimizer (for multiple timescales)
  • 18. Optimizer inputs and outputs The optimizer has to judge the correct step length from the gradient history, rather than memorizing the range of it useful in its meta-training. Scaled average gradient Relative log gradient magnitude Relative log learning rate Update direction for 𝜃 and 𝜙 Log change in step length Momentum logits Meta parameters: Enter as bias term Input as average latent state Parameter Tensor 𝜽 𝟏
  • 19. Meta-training Meta-training set They meta-train the optimizer on an ensemble of small problems that have been chosen to capture many commonly encountered properties of loss landscapes and stochastic gradients. 1. Exemplar problems from literature (Surjanovic & Bingham, 2013) 2. Well behaved problems (Quadratic bowls/logistic regression) 3. Noisy gradients and minibatch problems (Noise added/randomly generated data) 4. Slow convergence problems (Optimum at infinity/sparse discontinuous gradient) 5. Transformed problems (Transformation of previously defined problems) By meta-training on small toy problems, they also avoid memory issues one would encounter by meta-training on very large, real-world problems.
  • 20. Meta-training Meta-objective For the meta-training loss, they used the average log loss across all training problems: The average logarithm resembles minimizing the final function value, since very small values of 𝑙(𝜃 𝑛 ) make an outsized contribution to the average after taking the logarithm. Meta parameters: Encourages exact convergence to minima and precise dynamic adjustment of learning rate. (Shown in the ablation exp.)
  • 21. Meta-training Partial Unrolling with Full Gradient Gradients flow through this graph! Unlike Andrychowicz et al. 2016., they compute the full gradient in TensorFlow using provided API, including second derivatives. Backpropagation through time at every partial unrolling step
  • 22. Meta-training Heavy-tailed Distribution over Training Step In order to encourage the learned optimizer to generalize to long training runs, (i) number of partial unrollings (N) (ii) number of optimization steps within each partial unroll (M) was drawn from the distribution like the above. (total optimization steps: NM)
  • 23. Experiments Failures of Existing Learned Optimizers Comparable performance with static optimizers, even for number of iterations not seen during meta-training. Training loss vs. number of optimization steps. Trained on a 2-layer MLP with sigmoid / MNIST / batch size 64.
  • 24. Experiments Performance on Training Set Problems Three sample problems from the meta-training set(Booth / Matyas / Rosenbrock). The learned optimizer generally beats the others on problems in the training set.
  • 25. Experiments Generalization to New Problem Types Trained on meta-training problem set which did not include any ConvNet or MLP. Comparable performance to other static optimizers.
  • 26. Experiments Generalization to New Problem Types Also tested the same optimizer on Inception V3 and on ResNet V2. Stably trained for the first 10K to 20K but failed after that.
  • 27. Experiments Robustness to Choice of Learning Rate Training curves on a randomly generated quadratic loss problem with different learning rate initialization. The learned optimizer is also sensitive but more robust.
  • 28. Experiments Ablation Experiments Ablation study demonstrating importance of design choices on a small ConvNet on MNIST data. DEFAULT is the optimizer with all features included. Taking the logarithm of the meta-objective Computing gradient at attended location Having RNN learn its own initial weights Relative log learning rate as an input Accumulation decay for multiple gradient timescales Dynamic input scaling Momentum on multiple timescales
  • 29. Experiments Wall clock comparison For small minibatches, they significantly underperform ADAM and RMSProp in terms of wall clock time. However, its overhead is constant in terms of minibatch, it can be made relatively small by increasing the minibatch size.
  • 30. Discussion • If the historical information can be manually encoded at lower cost as in this work, performance gain from using RNN architecture might be redundant considering its relatively high computational cost. • Can we believe that encouraging the learned optimizer to work in similar ways as what is done in human-designed adaptive optimizers is truly optimal? (Generalization power of those adaptive optimizers have recently become controversial.) • Can't we even make manually-set timescales s as a free variable?