SlideShare a Scribd company logo
Online Hyperparameter Meta-Learning
with Hypergradient Distillation
Hae Beom Lee1, Hayeon Lee1, Jaewoong Shin3,
Eunho Yang1,2, Timothy Hospedales4,5, Sung Ju Hwang1,2
KAIST1, AITRICS2, Lunit3, University of Edinburgh4, Samsung AI Centre Cambridge5
ICLR 2022 spotlight
Meta-Learning
Unseen tasks
Knowledge
Transfer !
Meta-test
Test
Test
Training Test
Training
Training
Meta-training
S. Ravi, H. Larochelle, Optimization as a Model for Few-shot Learning, ICLR 2017
C. Finn, P, Abbeel, S. Levine, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
MAML (Finn et al., ‘17)
• Humans generalize well because we never learn from scratch.
• Learn a model that can generalize over a distribution of tasks.
Hyperparameters in Meta-Learning
Whole feature extractor
Element-wise learning rates Interleaved (e.g. “Warp”) layers
A. Raghu*, M. Raghu*, S. Bengio, O. Vinyals, Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML, ICLR 2020
S. Flennerhag, A. A. Rusu, R. Pascanu, F. Visin, H. Yin, R. Hadsell, Meta-Learning with Warped Gradient Descent, ICLR 2020
Z. Li, F. Zhou, F. Chen, H. Li, Meta-SGD: Learning to Learn Quickly for Few-Shot Learning, 2017
• The parameters that do not participate in inner-optimization → Hyperparameters in meta-learning.
• They are usually high-dimensional.
Hyperparameter Optimization (HO)
D. Maclaurin, D. Duvenaud, R. P. Adams, Gradient-based Hyperparameter Optimization through Reversible Learning, ICML 2015
https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083
J. Bergstra, Y. Bengio. Random search for hyper-parameter optimization, 2012
• Hyperparameter optimization (HO): a problem of choosing a set of optimal hyperparameters for
a learning algorithm.
• Which method should we use for such high-dimensional hyperparams?
Random Search Bayesian Optimization Gradient-based HO
Not scalable to
hyperparameter
dimension
Whole feature extractor
Element-wise learning rates Interleaved (e.g. “Warp”) layers
In Case of Few-shot Learning
• In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e.
hypergradient) is not too expensive.
• A few-gradient steps are sufficient for each task.
Shared
initialization
5 steps
5 steps
5 steps
few-shot task
few-shot task
few-shot task
In Case of Few-shot Learning
• In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e.
hypergradient) is not too expensive.
• A few-gradient steps are sufficient for each task.
Shared
initialization
backprop
backprop
backprop
few-shot task
few-shot task
few-shot task
• Many-shot learning → Only a few gradient steps?
Shared
initialization
HO Does Matter when Horizon Gets Longer
SVHN
Flowers
→ Meta-learner may suffer from the short-horizon bias (Wu et al. ‘18).
J. Shin*, H. B. Lee*, B. Gong, S. J. Hwang, Large-Scale Meta-Learning with Continual Trajectory Shifting, ICML 2021
Cars
Y. Wu*, M. Ren*, R. Liao, R. Grosse, Understanding Short-Horizon Bias in Stochastic Meta-Optimization, ICLR 2018
?
5 steps
5 steps
5 steps
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
HO Does Matter when Horizon Gets Longer
SVHN
Flowers
Cars
Shared
initialization
b….a….c….k….p….r….o….p....
HO Does Matter when Horizon Gets Longer
→ Computing a single hypergradient becomes too expensive!
• Many-shot learning → requires longer inner-learning trajectory
SVHN
Flowers
Cars
HO Does Matter when Horizon Gets Longer
→ Offline method: interval between two adjacent meta-updates is too long…
→ Meta-convergence is poor.
• Many-shot learning → requires longer inner-learning trajectory
1000 step forward
1000 step backward
meta-update meta-update meta-update
Shared
initialization
SVHN
Flowers
Cars
new trajectory
new trajectory
new
trajectory
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers update
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers
update
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers update
HO Does Matter when Horizon Gets Longer
→ Online method: update hyperparamer every inner-grad step!
• Many-shot learning → requires longer inner-learning trajectory
Shared
initialization
Flowers much faster
meta-convergence!
Shared
initialization
SVHN
Flowers
?
Criteria of Good HO Algorithm for Meta-Learning
3. Computing a single hypergradient
should not be too expensive
4. Update hyperparam every inner-grad step
i.e. online optimization
1. Scalable to hyperparameter dimension 2. Less or no short-horizon bias
Shared
initialization
b….a….c….k….p….r….o….p....
b….a….c….k….p….r….o….p….
b
…
.
a
…
.
c
…
.
k
…
.
p….r….o….p….
SVHN
Flowers
Cars
Whole feature extractor
Element-wise learning rates Interleaved (e.g. “Warp”) layers
Shared
initialization
Flowers
…
much faster
meta-convergence!
Limitations of Existing Grad-based HO Algs
Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously.
Criteria FMD
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017
Limitations of Existing Grad-based HO Algs
Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously.
Criteria FMD RMD
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
o
x
x
o
L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017
Limitations of Existing Grad-based HO Algs
Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously.
Criteria FMD RMD IFT
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
o
x
x
o
o
o
△
o
J. Lorraine, P. Vicol, D. Duvenaud, Optimizing Millions of Hyperparameters by Implicit Differentiation, AISTATS 2020
Limitations of Existing Grad-based HO Algs
Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously.
Criteria FMD RMD IFT 1-step
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
o
x
x
o
o
o
△
o
x
o
o
o
Shared
initialization
J. Luketina, M. Berglund, K. Greff, T. Raiko, Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters, ICML 2016
Goal of This Paper
This paper aims to overcome all the aforementioned limitations at the same time.
Criteria FMD RMD IFT 1-step Ours
1. Scalable to hyperparam dim
2. Less or no short horizon bias
3. Constant memory cost
4. Online optimization
o
o
o
x
o
x
x
o
o
o
△
o
x
o
o
o
o
o
o
o
Hypergradient distillation
Hypergradient Distillation
Shared
initialization
Shared
initialization
Hypergradient Distillation
requires 2t – 1 JVP computations (e.g. RMD)
respose Jacobian
hypergradient
Hypergradient Distillation
Requires only 1 JVP computation
But it suffers from short horizon bias
Shared
initialization
hypergradient respose Jacobian
Hypergradient Distillation
2t – 1 JVP
hypergradient
Shared
initialization
a single JVP
distilled weight and dataset
→ hypergrad direction
scaling factor
→ hypergrad size
𝑤𝑡, 𝐷𝑡
𝑤2, 𝐷2
𝑤1, 𝐷1
• it does not require computing the actual 𝑔𝑡
𝑆𝑂
.
• we only need to keep updating a moving
average of 𝑤𝑡
∗
and 𝐷𝑡
∗
.
• the scaling factor 𝜋∗
is also efficiently estimated
with a function approximator.
• we can approximately solve the distillation
problem efficiently.
Please read the main paper for
the technical details !
For each online HO step 𝒕,
distill
𝑤𝑡
∗
, 𝐷𝑡
∗
Experimental Setup
J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, D. Meng, Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting, NeurIPS 2019
ANIL (Raghu et al. ‘20) WarpGrad (Flennerhag et al. ‘20) Meta-Weight-Net (Shu et al. ‘20)
A. Nichol, J. Achiam, J. Schulman, On First-Order Meta-Learning Algorithms, 2018
Meta-learning
models
Other details
CIFAR100
tinyImageNet
• Inner-grad step = 100
• Use Reptile for learning
shared initialization
Task distribution
• 10-way 250-shot
Experimental Results
Q1. Does HyperDistill provide faster convergence?
Meta-training convergence (Test Loss)
Experimental Results
Q2. Does HyperDistill provide better generalization performance?
Meta-validation performance (Test Acc)
Meta-test performance (Test Acc)
Experimental Results
Q3. Is HyperDistill a reasonable approximation to the true hypergradient?
Cosine similarity to the true hypergradient
Q4. Is HyperDistill computationally efficient?
GPU memory consumption and wall-clock runtime
Conclusion
• The existing gradient-based HO algorithms do not satisfy the four criteria that should be met for
their practical use in meta-learning.
• In this paper, we showed that for each online HO step, it is possible to efficiently distill the whole
hypergradient indirect term into a single JVP, satisfying the four criteria simultaneously.
• Thank to the accurate hypergradient approximation, HyperDistill could improve meta-training
convergence and meta-testing performance, in a computationally efficient manner.
github.com/haebeom-lee/hyperdistill

More Related Content

Similar to Online Hyperparameter Meta-Learning with Hypergradient Distillation

DSA - 2012 - Conclusion
DSA - 2012 - ConclusionDSA - 2012 - Conclusion
DSA - 2012 - Conclusion
Haitham El-Ghareeb
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
c.titus.brown
 
MILA DL & RL summer school highlights
MILA DL & RL summer school highlights MILA DL & RL summer school highlights
MILA DL & RL summer school highlights
Natalia Díaz Rodríguez
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
Sebastian Ruder
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
vudinhphuong96
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Sri Ambati
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
philippbayer
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
Subrat Panda, PhD
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
ssuseradaf5f
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
Open science: redefining operant conditioning; PKC and motorneurons
Open science: redefining operant conditioning; PKC and motorneuronsOpen science: redefining operant conditioning; PKC and motorneurons
Open science: redefining operant conditioning; PKC and motorneurons
Julien Colomb
 
On the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondOn the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and Beyond
Sungchul Kim
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Jisu Han
 
The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰
Seonghoon Jung
 
Information Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learningInformation Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learning
JongsuHa
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
Sangwoo Mo
 
PSO__AndryPinto_InesDomingues_LuisRocha_HugoAlves_SusanaCruz.pptx
PSO__AndryPinto_InesDomingues_LuisRocha_HugoAlves_SusanaCruz.pptxPSO__AndryPinto_InesDomingues_LuisRocha_HugoAlves_SusanaCruz.pptx
PSO__AndryPinto_InesDomingues_LuisRocha_HugoAlves_SusanaCruz.pptx
SubhamGupta106798
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
XiaoweiJiang7
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
fnothaft
 

Similar to Online Hyperparameter Meta-Learning with Hypergradient Distillation (20)

DSA - 2012 - Conclusion
DSA - 2012 - ConclusionDSA - 2012 - Conclusion
DSA - 2012 - Conclusion
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
MILA DL & RL summer school highlights
MILA DL & RL summer school highlights MILA DL & RL summer school highlights
MILA DL & RL summer school highlights
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
An overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdfAn overview of gradient descent optimization algorithms.pdf
An overview of gradient descent optimization algorithms.pdf
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
Open science: redefining operant conditioning; PKC and motorneurons
Open science: redefining operant conditioning; PKC and motorneuronsOpen science: redefining operant conditioning; PKC and motorneurons
Open science: redefining operant conditioning; PKC and motorneurons
 
On the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondOn the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and Beyond
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
 
The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰The deep bootstrap 논문 리뷰
The deep bootstrap 논문 리뷰
 
Information Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learningInformation Theoretic aspect of reinforcement learning
Information Theoretic aspect of reinforcement learning
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
 
PSO__AndryPinto_InesDomingues_LuisRocha_HugoAlves_SusanaCruz.pptx
PSO__AndryPinto_InesDomingues_LuisRocha_HugoAlves_SusanaCruz.pptxPSO__AndryPinto_InesDomingues_LuisRocha_HugoAlves_SusanaCruz.pptx
PSO__AndryPinto_InesDomingues_LuisRocha_HugoAlves_SusanaCruz.pptx
 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 

More from MLAI2

Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
MLAI2
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
MLAI2
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual Learning
MLAI2
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
MLAI2
 
Skill-Based Meta-Reinforcement Learning
Skill-Based Meta-Reinforcement LearningSkill-Based Meta-Reinforcement Learning
Skill-Based Meta-Reinforcement Learning
MLAI2
 
Edge Representation Learning with Hypergraphs
Edge Representation Learning with HypergraphsEdge Representation Learning with Hypergraphs
Edge Representation Learning with Hypergraphs
MLAI2
 
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
MLAI2
 
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
Mini-Batch Consistent Slot Set Encoder For Scalable Set EncodingMini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
MLAI2
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive Learning
MLAI2
 
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
MLAI2
 
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMeta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
MLAI2
 
Accurate Learning of Graph Representations with Graph Multiset Pooling
Accurate Learning of Graph Representations with Graph Multiset PoolingAccurate Learning of Graph Representations with Graph Multiset Pooling
Accurate Learning of Graph Representations with Graph Multiset Pooling
MLAI2
 
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
MLAI2
 
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
MLAI2
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MLAI2
 
Adversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive LearningAdversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive Learning
MLAI2
 
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
MLAI2
 
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
MLAI2
 
Cost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention ProcessCost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention Process
MLAI2
 
Adversarial Neural Pruning with Latent Vulnerability Suppression
Adversarial Neural Pruning with Latent Vulnerability SuppressionAdversarial Neural Pruning with Latent Vulnerability Suppression
Adversarial Neural Pruning with Latent Vulnerability Suppression
MLAI2
 

More from MLAI2 (20)

Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
 
Online Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual LearningOnline Coreset Selection for Rehearsal-based Continual Learning
Online Coreset Selection for Rehearsal-based Continual Learning
 
Representational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual LearningRepresentational Continuity for Unsupervised Continual Learning
Representational Continuity for Unsupervised Continual Learning
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
 
Skill-Based Meta-Reinforcement Learning
Skill-Based Meta-Reinforcement LearningSkill-Based Meta-Reinforcement Learning
Skill-Based Meta-Reinforcement Learning
 
Edge Representation Learning with Hypergraphs
Edge Representation Learning with HypergraphsEdge Representation Learning with Hypergraphs
Edge Representation Learning with Hypergraphs
 
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
 
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
Mini-Batch Consistent Slot Set Encoder For Scalable Set EncodingMini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
 
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive LearningTask Adaptive Neural Network Search with Meta-Contrastive Learning
Task Adaptive Neural Network Search with Meta-Contrastive Learning
 
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
 
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-LearningMeta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
 
Accurate Learning of Graph Representations with Graph Multiset Pooling
Accurate Learning of Graph Representations with Graph Multiset PoolingAccurate Learning of Graph Representations with Graph Multiset Pooling
Accurate Learning of Graph Representations with Graph Multiset Pooling
 
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
 
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
 
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and ArchitecturesMetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
 
Adversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive LearningAdversarial Self-Supervised Contrastive Learning
Adversarial Self-Supervised Contrastive Learning
 
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
 
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
 
Cost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention ProcessCost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention Process
 
Adversarial Neural Pruning with Latent Vulnerability Suppression
Adversarial Neural Pruning with Latent Vulnerability SuppressionAdversarial Neural Pruning with Latent Vulnerability Suppression
Adversarial Neural Pruning with Latent Vulnerability Suppression
 

Recently uploaded

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 

Recently uploaded (20)

Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 

Online Hyperparameter Meta-Learning with Hypergradient Distillation

  • 1. Online Hyperparameter Meta-Learning with Hypergradient Distillation Hae Beom Lee1, Hayeon Lee1, Jaewoong Shin3, Eunho Yang1,2, Timothy Hospedales4,5, Sung Ju Hwang1,2 KAIST1, AITRICS2, Lunit3, University of Edinburgh4, Samsung AI Centre Cambridge5 ICLR 2022 spotlight
  • 2. Meta-Learning Unseen tasks Knowledge Transfer ! Meta-test Test Test Training Test Training Training Meta-training S. Ravi, H. Larochelle, Optimization as a Model for Few-shot Learning, ICLR 2017 C. Finn, P, Abbeel, S. Levine, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017 MAML (Finn et al., ‘17) • Humans generalize well because we never learn from scratch. • Learn a model that can generalize over a distribution of tasks.
  • 3. Hyperparameters in Meta-Learning Whole feature extractor Element-wise learning rates Interleaved (e.g. “Warp”) layers A. Raghu*, M. Raghu*, S. Bengio, O. Vinyals, Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML, ICLR 2020 S. Flennerhag, A. A. Rusu, R. Pascanu, F. Visin, H. Yin, R. Hadsell, Meta-Learning with Warped Gradient Descent, ICLR 2020 Z. Li, F. Zhou, F. Chen, H. Li, Meta-SGD: Learning to Learn Quickly for Few-Shot Learning, 2017 • The parameters that do not participate in inner-optimization → Hyperparameters in meta-learning. • They are usually high-dimensional.
  • 4. Hyperparameter Optimization (HO) D. Maclaurin, D. Duvenaud, R. P. Adams, Gradient-based Hyperparameter Optimization through Reversible Learning, ICML 2015 https://towardsdatascience.com/shallow-understanding-on-bayesian-optimization-324b6c1f7083 J. Bergstra, Y. Bengio. Random search for hyper-parameter optimization, 2012 • Hyperparameter optimization (HO): a problem of choosing a set of optimal hyperparameters for a learning algorithm. • Which method should we use for such high-dimensional hyperparams? Random Search Bayesian Optimization Gradient-based HO Not scalable to hyperparameter dimension Whole feature extractor Element-wise learning rates Interleaved (e.g. “Warp”) layers
  • 5. In Case of Few-shot Learning • In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e. hypergradient) is not too expensive. • A few-gradient steps are sufficient for each task. Shared initialization 5 steps 5 steps 5 steps few-shot task few-shot task few-shot task
  • 6. In Case of Few-shot Learning • In case of few-shot learning, computing the exact gradient w.r.t. the hyperparameter (i.e. hypergradient) is not too expensive. • A few-gradient steps are sufficient for each task. Shared initialization backprop backprop backprop few-shot task few-shot task few-shot task
  • 7. • Many-shot learning → Only a few gradient steps? Shared initialization HO Does Matter when Horizon Gets Longer SVHN Flowers → Meta-learner may suffer from the short-horizon bias (Wu et al. ‘18). J. Shin*, H. B. Lee*, B. Gong, S. J. Hwang, Large-Scale Meta-Learning with Continual Trajectory Shifting, ICML 2021 Cars Y. Wu*, M. Ren*, R. Liao, R. Grosse, Understanding Short-Horizon Bias in Stochastic Meta-Optimization, ICLR 2018 ? 5 steps 5 steps 5 steps
  • 8. • Many-shot learning → requires longer inner-learning trajectory Shared initialization HO Does Matter when Horizon Gets Longer SVHN Flowers Cars
  • 9. Shared initialization b….a….c….k….p….r….o….p.... HO Does Matter when Horizon Gets Longer → Computing a single hypergradient becomes too expensive! • Many-shot learning → requires longer inner-learning trajectory SVHN Flowers Cars
  • 10. HO Does Matter when Horizon Gets Longer → Offline method: interval between two adjacent meta-updates is too long… → Meta-convergence is poor. • Many-shot learning → requires longer inner-learning trajectory 1000 step forward 1000 step backward meta-update meta-update meta-update Shared initialization SVHN Flowers Cars new trajectory new trajectory new trajectory
  • 11. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers
  • 12. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers update
  • 13. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers
  • 14. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers update
  • 15. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers
  • 16. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers update
  • 17. HO Does Matter when Horizon Gets Longer → Online method: update hyperparamer every inner-grad step! • Many-shot learning → requires longer inner-learning trajectory Shared initialization Flowers much faster meta-convergence!
  • 18. Shared initialization SVHN Flowers ? Criteria of Good HO Algorithm for Meta-Learning 3. Computing a single hypergradient should not be too expensive 4. Update hyperparam every inner-grad step i.e. online optimization 1. Scalable to hyperparameter dimension 2. Less or no short-horizon bias Shared initialization b….a….c….k….p….r….o….p.... b….a….c….k….p….r….o….p…. b … . a … . c … . k … . p….r….o….p…. SVHN Flowers Cars Whole feature extractor Element-wise learning rates Interleaved (e.g. “Warp”) layers Shared initialization Flowers … much faster meta-convergence!
  • 19. Limitations of Existing Grad-based HO Algs Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously. Criteria FMD 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017
  • 20. Limitations of Existing Grad-based HO Algs Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously. Criteria FMD RMD 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x o x x o L. Franceschi, M. Donini, P. Frasconi, M. Pontil, Forward and Reverse Gradient-Based Hyperparameter Optimization, ICML 2017
  • 21. Limitations of Existing Grad-based HO Algs Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously. Criteria FMD RMD IFT 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x o x x o o o △ o J. Lorraine, P. Vicol, D. Duvenaud, Optimizing Millions of Hyperparameters by Implicit Differentiation, AISTATS 2020
  • 22. Limitations of Existing Grad-based HO Algs Unfortunately, the existing gradient-based HO algorithms do not satisfy all the criteria simultaneously. Criteria FMD RMD IFT 1-step 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x o x x o o o △ o x o o o Shared initialization J. Luketina, M. Berglund, K. Greff, T. Raiko, Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters, ICML 2016
  • 23. Goal of This Paper This paper aims to overcome all the aforementioned limitations at the same time. Criteria FMD RMD IFT 1-step Ours 1. Scalable to hyperparam dim 2. Less or no short horizon bias 3. Constant memory cost 4. Online optimization o o o x o x x o o o △ o x o o o o o o o Hypergradient distillation
  • 25. Shared initialization Hypergradient Distillation requires 2t – 1 JVP computations (e.g. RMD) respose Jacobian hypergradient
  • 26. Hypergradient Distillation Requires only 1 JVP computation But it suffers from short horizon bias Shared initialization hypergradient respose Jacobian
  • 27. Hypergradient Distillation 2t – 1 JVP hypergradient Shared initialization a single JVP distilled weight and dataset → hypergrad direction scaling factor → hypergrad size 𝑤𝑡, 𝐷𝑡 𝑤2, 𝐷2 𝑤1, 𝐷1 • it does not require computing the actual 𝑔𝑡 𝑆𝑂 . • we only need to keep updating a moving average of 𝑤𝑡 ∗ and 𝐷𝑡 ∗ . • the scaling factor 𝜋∗ is also efficiently estimated with a function approximator. • we can approximately solve the distillation problem efficiently. Please read the main paper for the technical details ! For each online HO step 𝒕, distill 𝑤𝑡 ∗ , 𝐷𝑡 ∗
  • 28. Experimental Setup J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, D. Meng, Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting, NeurIPS 2019 ANIL (Raghu et al. ‘20) WarpGrad (Flennerhag et al. ‘20) Meta-Weight-Net (Shu et al. ‘20) A. Nichol, J. Achiam, J. Schulman, On First-Order Meta-Learning Algorithms, 2018 Meta-learning models Other details CIFAR100 tinyImageNet • Inner-grad step = 100 • Use Reptile for learning shared initialization Task distribution • 10-way 250-shot
  • 29. Experimental Results Q1. Does HyperDistill provide faster convergence? Meta-training convergence (Test Loss)
  • 30. Experimental Results Q2. Does HyperDistill provide better generalization performance? Meta-validation performance (Test Acc) Meta-test performance (Test Acc)
  • 31. Experimental Results Q3. Is HyperDistill a reasonable approximation to the true hypergradient? Cosine similarity to the true hypergradient Q4. Is HyperDistill computationally efficient? GPU memory consumption and wall-clock runtime
  • 32. Conclusion • The existing gradient-based HO algorithms do not satisfy the four criteria that should be met for their practical use in meta-learning. • In this paper, we showed that for each online HO step, it is possible to efficiently distill the whole hypergradient indirect term into a single JVP, satisfying the four criteria simultaneously. • Thank to the accurate hypergradient approximation, HyperDistill could improve meta-training convergence and meta-testing performance, in a computationally efficient manner. github.com/haebeom-lee/hyperdistill