Joaquin Vanschoren (TU Eindhoven)
Meta-Learning
A tutorial
Cover art: MC Escher, Ustwo Games
Intro: humans can easily learn from a single example
2
thanks to years of learning (and eons of evolution)
Intro: humans can easily learn from a single example
2
thanks to years of learning (and eons of evolution)
Intro: humans can easily learn from a single example
Canna Indica ‘Picasso’
2
thanks to years of learning (and eons of evolution)
Intro: humans can easily learn from a single example
Canna Indica ‘Picasso’ ?
later
train test
2
thanks to years of learning (and eons of evolution)
Can a computer learn from a single example?
e.g. ResNet50
(23 million trainable parameters)
ImageNet
Canna
Indica
`Picasso’
That won’t work :) Humans also don’t start from scratch.
3
Can a computer learn from a single example?
e.g. ResNet50
(23 million trainable parameters)
ImageNet
Canna
Indica
`Picasso’
That won’t work :) Humans also don’t start from scratch.
3
e.g. ResNet50
(23 million trainable parameters)
Transfer learning?
ImageNet
(14 million images)
Pretrain
Canna
Indica
`Picasso’
Source task
Target task
4
e.g. ResNet50
(23 million trainable parameters)
Transfer learning?
ImageNet
(14 million images)
Pretrain
Canna
Indica
`Picasso’
Source task
Target task
4
e.g. ResNet50
(23 million trainable parameters)
Transfer learning?
ImageNet
(14 million images)
Pretrain
Finetune Canna
Indica
`Picasso’
Source task
Target task
4
e.g. ResNet50
(23 million trainable parameters)
Transfer learning?
ImageNet
(14 million images)
Pretrain
Finetune
A single source task (e.g. ImageNet) may not generalize well to the test task
Canna
Indica
`Picasso’
Source task
Target task
4
Meta-learning
Learn over a series (or distribution) of many di
ff
erent tasks/episodes
Inductive bias: learn assumptions that you can to transfer to new tasks
Prepare yourself to learn new things faster
task 1 task 2 task 1000 new task
Canna
Trumpet vine
…
inductive bias
5
Meta-learning
Learn over a series (or distribution) of many di
ff
erent tasks/episodes
Inductive bias: learn assumptions that you can to transfer to new tasks
Prepare yourself to learn new things faster
task 1 task 2 task 1000 new task
Canna
Trumpet vine
…
Useful in many real-life situations: rare events, test-time constraints, data collection costs, privacy issues,…
inductive bias
5
Inspired by human learning
We don’t transfer from a single source task, we learn across many, many tasks


year 1 year 2 year 3 year 4
We have a‘drive’to explore new, challenging, but doable, fun tasks
6
Human-like Learning***
Task 1
Learning
humans learn across tasks: less trial-and-error, less data, less compute


new tasks should be related to experience (doable, fun, interesting?)
Task 2
Learning
Task 3
Learning
new tasks


(of own choosing)
Tasknew
Learning
experience
Lake et al. 2017
key aspects of fast learning: compositionality, causality, learning to learn
Kemp et al. 2007
inductive bias
7
Inductive bias (in language)
Training
Lake et al. 2017
dax
zup
lug
wif
lug blicket wif
wif blicket dax
Test dax blicket zup?
lug kiki wif
wif kiki dax
wif blicket dax kiki lug?
8
which assumptions do we make?
Inductive bias (in language)
Training
Lake et al. 2017
dax
zup
lug
wif
lug blicket wif
wif blicket dax
Test dax blicket zup?
lug kiki wif
wif kiki dax
wif blicket dax kiki lug?
8
which assumptions do we make?
Inductive bias (in language)
Training
Lake et al. 2017
dax
zup
lug
wif
lug blicket wif
wif blicket dax
Test dax blicket zup?
lug kiki wif
wif kiki dax
wif blicket dax kiki lug?
8
which assumptions do we make?
Inductive bias (in language)
Training
Lake et al. 2017
dax
zup
lug
wif
lug blicket wif
wif blicket dax
Test dax blicket zup?
lug kiki wif
wif kiki dax
wif blicket dax kiki lug?
Common mistakes
one-to-one bias:


assume that every word is one color
8
which assumptions do we make?
Inductive bias (in language)
Training
Lake et al. 2017
dax
zup
lug
wif
lug blicket wif
wif blicket dax
Test dax blicket zup?
lug kiki wif
wif kiki dax
wif blicket dax kiki lug?
Common mistakes
one-to-one bias:


assume that every word is one color
concatenation bias:


assume that order is always left-to-right
8
which assumptions do we make?
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
zup?
zup zup?
blicket wif zup?
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
zup?
zup zup?
blicket wif zup?
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
zup?
zup zup?
blicket wif zup?
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
zup?
zup zup?
blicket wif zup?
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
zup?
zup zup?
blicket wif zup?
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
zup?
zup zup?
blicket wif zup?
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
zup?
zup zup?
blicket wif zup?
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
zup?
zup zup?
blicket wif zup?
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
Commonly used assumptions:
one-to-one bias:


assume that every word is one color


(and not a function or something else)
zup?
zup zup?
blicket wif zup?
zup
dax
/
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
Commonly used assumptions:
one-to-one bias:


assume that every word is one color


(and not a function or something else)
concatenation bias:


assume that order is always left-to-right
zup?
zup zup?
blicket wif zup?
zup
dax
/
zup
dax
Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
Commonly used assumptions:
one-to-one bias:


assume that every word is one color


(and not a function or something else)
concatenation bias:


assume that order is always left-to-right
zup?
zup zup?
blicket wif zup?
zup
dax
/
zup
dax
mutual exclusivity:


if object has a name, it doesn’t need


another name
zup
dax

Lake et al. 2019
9
What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
Commonly used assumptions:
one-to-one bias:


assume that every word is one color


(and not a function or something else)
concatenation bias:


assume that order is always left-to-right
zup?
zup zup?
blicket wif zup?
zup
dax
/
zup
dax
mutual exclusivity:


if object has a name, it doesn’t need


another name
zup
dax

These assumptions (inductive biases) are necessary for learning quickly
Humans assume that words have consistent meanings and follow input/output constraints
Lake et al. 2019
9
Meta-learning inductive biases
Task 1
model
Task 2 Tasknew
Lake et al. 2017
Capture useful assumptions from the data - that can often not be easily expressed
dax
zup
lug
wif
wif zup dax?
lug dax lug zup?
dax wif lug?
wif wif dax?
dax
zup
lug
wif
dax dax?
wif lug dax?
wif lug lug dax?
zup zup zup?
state model
state
dax
zup
lug
wif
dax dax?
lug lug dax dax?
wif lug lug dax?
zup wif zup?
Lake et al. 2019
10
model
Meta-learning goal
Task 1
Learning
learn minimal inductive biases from prior tasks instead of constructing manual ones


should still generalize well (otherwise you meta-over
fi
t)
Task 2
Learning
Task 3
Learning
Tasknew
Learning
Inductive bias: any assumptions added to training data to learn more e
ff
ectively. E.g:


• Instead of general model architectures, learn better architectures (and hyperparameters)


• Instead of starting from random weights, learn good initial weights


• Instead of standard loss/reward function, learn a better loss/reward function
(prior)
11
inductive bias
Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning
What can we learn to learn?
1. architectures / pipelines


(hyper-parameters)


Clune 2019
part 1
new tasks


(of own choosing)
experience
2. learning algorithms


(model parameters)


3. learning environments


(task generation)
inductive bias
12
part 2 part 3
3 pillars
Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning
What can we learn to learn?
1. architectures / pipelines


(hyper-parameters)


Clune 2019
part 1
new tasks


(of own choosing)
experience
2. learning algorithms


(model parameters)


3. learning environments


(task generation)
inductive bias
13
part 2 part 3
3 pillars
Task 1
Model
Learning
algorithm
ɸ* = argmaxɸ log p(ɸ | T
)

Task: distribution of samples q(x)
outputs y, loss ℒ(x,y
)

Model: fɸ(x) = y’
Task 2
Model
Learning
algorithm
!T1(fɸ1,λ(x),y) !T2
x,y x,y
training


∇ɸ
ɸ’1 ɸ’2
= argminɸ ℒ(fɸ,λ(x),y
)

fɸ1(x) fɸ2(x)
for reinforcement learning:
q(x1) + transition q(xt+1| xt, yt
)

ℒ = (neg) reward
Terminology
Learner: model parameters ɸ,
hyper-parameters λ
optimizer
λ1 λ2
14
Task 1
Model
Learning
algorithm
ɸ* = argmaxɸ log p(ɸ | T
)

Task: distribution of samples q(x)
outputs y, loss ℒ(x,y
)

Model: fɸ(x) = y’
Task 2
Model
Learning
algorithm
!T1(fɸ1,λ(x),y) !T2
x,y x,y
training


∇ɸ
ɸ’1 ɸ’2
= argminɸ ℒ(fɸ,λ(x),y
)

fɸ1(x) fɸ2(x)
for reinforcement learning:
q(x1) + transition q(xt+1| xt, yt
)

ℒ = (neg) reward
Terminology
Learner: model parameters ɸ,
hyper-parameters λ
optimizer
λ1 λ2
14
Meta-learning
algorithm g
meta-parameters
𝛳
(encode some
inductive bias)
Task 1..N Models
performance
AutoML Models
Models
Closely related to Automated Machine Learning (AutoML)


But: meta-learn how to design architectures/pipelines and tune hyperparameters


Human data scientists also learn from experience
Models
Models
Models
Learning hyperparameters
Hutter et al. 2019
λ
New task Models
performance
self-learning AutoML Models
λ
bias
(priors, meta-knowledge, human priors)
Search space can
be huge!
Meta-learning for AutoML: how?
Learning hyperparameter priors
Warm starting (what works on similar tasks?)
start randomly
start with 

good candidates
Meta-models (learn how to build models/components)
Complex 

hyperparameter space
Simple 

hyperparameter space
Vanschoren 2018
Task
λ, scores
λ, scores
Learner
Learner
metadata
hyperparameters = architecture + hyperparameters
Task
λ, scores
Learner
metadata
Task
Task
Observation:


current AutoML strongly depends on learned priors
Complex 

hyperparameter space
Simple 

hyperparameter space
observation
Manual architecture priors
• Most successful pipelines have a similar structure


• Can we meta-learn a prior over successful structures?
Ensembling/
stacking
Figure source: Feurer et al. 2015
Manual architecture priors
• Most successful pipelines have a similar structure


• Can we meta-learn a prior over successful structures?
+ smaller search space 

- you can’t learn entirely new architectures
Ensembling/
stacking
Figure source: Feurer et al. 2015
Manual architecture priors
autosklearn Feurer et al. 2015


autoWEKA Thornton et al. 2013


hyperopt-sklearn Komer et al. 2014


AutoGluon-Tabular Erickson et al. 2020
• Most successful pipelines have a similar structure


• Can we meta-learn a prior over successful structures?
+ smaller search space 

- you can’t learn entirely new architectures
Ensembling/
stacking
Figure source: Feurer et al. 2015
Successful deep networks often have repeated motifs (cells)


e.g. Inception v4:
Manual architecture priors
Szegedy
Figure source: Szegedy et al 2016
Google NASNet Zoph et al 2018
Compositionality: learn hierarchical building blocks to simplify the task
•learn parameterized building
blocks (cells)

•stack cells together in macro-
architecture

+ smaller search space 

+ cells can be learned on a small
dataset & transferred to a larger
dataset

-strong domain priors, doesn’t
generalize well
Cell search space
Cell search space prior
Figure source: Elsken et al., 2019
Google NASNet Zoph et al 2018
Compositionality: learn hierarchical building blocks to simplify the task
•learn parameterized building
blocks (cells)

•stack cells together in macro-
architecture

+ smaller search space 

+ cells can be learned on a small
dataset & transferred to a larger
dataset

-strong domain priors, doesn’t
generalize well
Cell search space
Can we meta-learn hierarchies / components that generalize better?
Cell search space prior
Figure source: Elsken et al., 2019
If you constrain the search space enough, you can get SOTA results with random search!
Li & Talwalkar 2019
Yu et al. 2019
Real et al. 2019
Cell search space prior
• Strong prior!
Figure source: Li & Talwalkar., 2019
Weight-agnostic neural networks
• ALL weights are shared 

• Only evolve the architecture?

• Minimal description length

• Baldwin e
ff
ect?
Manual priors: Weight sharing
Gaier & Ha, 2019
Figure source: Gaier & HA., 2019
Learning hyperparameter priors
Complex 

hyperparameter space
Simple 

hyperparameter space
λ, scores
Learner
1 van Rijn & Hutter 2018
ResNets for image classi
fi
cation
• Functional ANOVA 1


• Select hyperparameters that cause
variance in the evaluations.


• Useful to speed up black-box
optimization techniques
Learn hyperparameter importance
Figure source: van Rijn & Hutter, 2018
• Tunability 1,2,3
Learn good defaults, measure importance as improvement via tuning
1 Probst et al. 2018
2 Weerts et al. 2018
3 van Rijn et al. 2018
Learn defaults + hyperparameter importance
Learned defaults Tuning risk
• Start with a few (random) hyperparameter con
fi
gurations


• Fit a surrogate model to predict other con
fi
gurations


• Probabilistic regression: mean and standard deviation (blue band)


• Use an acquisition function to trade o
ff
exploration and exploitation, e.g. Expected
Improvement (EI)


• Sample for the best con
fi
guration under that function
μ σ
Bayesian Optimization (interlude)
Mockus, 1974
μ
μ + σ
μ − σ
value of λi
performance
• Repeat until some stopping
criterion:


• Fixed budget


• Convergence


• EI threshold


• Theoretical guarantees


• Also works for non-convex,
noisy data


• Used in AlphaGo
Srinivas et al. 2010, Freitas et al.
2012, Kawaguchi et al. 2016
Bayesian Optimization
Figure source: Shahriari 2016
• Hyperparameters can interact in very non-linear ways


• Use a neural net to learn a suitable basis expansion ϕz(λ) for all tasks


• You can use Bayesian linear models, transfers info on con
fi
guration space
Learn basis expansions for hyperparameters
Perrone et al. 2018
P
Bayesian Linear surrogate
φz(λ)i
λi,score
Learn basis expansion on lots of data (e.g. OpenML)
φz(λ)
φz(λ)
λ
Gaussian Processes surrogate
• If task j is similar to the new task, its surrogate model Sj will likely transfer well


• Sum up all Sj predictions, weighted by task similarity (as in active testing)1


• Build combined Gaussian process, weighted by current performance on new
task2
Tasks
Models
Models
Models
performance
Learning
Learning
Learning
New Task
meta-learner
Models
Models
Models
performance
per task tj:
Pi,j
}
Surrogate model transfer
1 Wistuba et al. 2018
λi
P
Sj
2 Feurer et al. 2018
S = ∑ wj Sj
+
+
S1
S2
S3
λi
prior tasks
Surrogate model transfer
Manolache & Vanschoren 2019
• Store surrogate model Sij for every pair of task i and algorithm j 

• Simpler surrogates, better transfer

• Learn weighted ensemble -> signi
fi
cant speed up in optimization
new task
Warm starting


(what works on similar tasks?)
start randomly
start with 

good candidates
Task
λ, scores
Learner
metadata
How to measure task similarity?
• Hand-designed (statistical) meta-features that describe (tabular) datasets 1


• Task2Vec: task embedding for image data 2


• Optimal transport: similarity measure based on comparing probability distributions 3


• Metadata embedding based on textual dataset description 4


• Dataset2Vec: compares batches of datasets 5


• Distribution-based invariant deep networks 6
1 Vanschoren 2018
2 Achille et al. 2019
3 Alvarez-Melis et al. 2020
4 Drori et al. 2019
5 Jooma et al. 2020
6 de Bie et al. 2020
Figure source: Alvarez-Melis et al. 2020
• Find k most similar tasks, warm-start search with best λi


• Auto-sklearn: Bayesian optimization (SMAC)


• Meta-learning yield better models, faster


• Winner of AutoML Challenges
Tasks
Models
Models
Models
performance
Learning
Learning
Learning
New Task
meta-learner
Models
Models
Models
performance
Pi,j
}
Warm-starting with kNN
λ1..k
mj
best λi on
similar tasks
Feurer et al. 2015
λi
Bayesian optimization
λ
P
λ1
λ3
λ2
λ4
Figure source: Feurer et al., 2015
Fusi et al. 2017
Pi,j
λi
TL
λL
tj
tnew warm-started


with λ1..k
. . .. . . .. . . . .
λi
λLi
P
p(P|λLi)
latent representation
• Learn latent representation for
tasks T and con
fi
gurations λ


• Use meta-features to warm-start
on new task


• Returns probabilistic predictions
for Bayesian optitmization
• Collaborative
fi
ltering: con
fi
gurations λi are `rated’by tasks tj
Probabilistic Matrix Factorization
Figure source: Fusi et al., 2017
DARTS: Di
ff
erentiable NAS
Liu et al. 2018
convolution


max pooling


zero
• Fixed (one-shot) structure, learn which operators to use

• Give all operators a weight 

• Optimize and model weights using bilevel optimization

• approximate *( ) by adapting after every training step
αi
αi ωj
ωj αi ωj
One-shot model operator weights αi interleaved optimization
of and with SGD
αi ωj
argmax αi
Figure source: Liu et al., 2018
• Warm-start DARTS with architectures that worked well on similar problems


• Slightly better performance, but much faster (5x)
Warm-started DARTS
Grobelnik and Vanschoren, 2021
Meta-models


(learn how to build models/components)
Task
λ, scores
Learner
metadata
Task
Task
• Learn direct mapping between meta-features and Pi,j
• Zero-shot meta-models: predict best λi given meta-features 1


• Ranking models: return ranking λ1..k 2


• Predict which algorithms / con
fi
gurations to consider / tune3


• Predict performance / runtime for given
𝛳
i and task4


• Can be integrated in larger AutoML systems: warm start, guide search,…
meta-learner
Algorithm selection models
λbest
1 Brazdil et al. 2009, Lemke et al. 2015
2 Sun and Pfahringer 2013, Pinto et al. 2017
meta-learner λ1..k
mj
mj
meta-learner
Pij
mj, λi
3 Sanders and C. Giraud-Carrier 2017
meta-learner
Λ
mj
4 Yang et al. 2018
• Learn nonlinearities: RL-based search of space of likely useful activation functions 1


• E.g. Swish can outperform ReLU
1 Ramachandran et al. 2017
• Learn optimizers: RL-based search of space of likely useful update rules 2


• E.g. PowerSign can outperform Adam, RMPprop
2 Bello et al. 2017
Learning model components
PowerSign : esign(g)sign(m)
g
Swish :
x
1 + e−βx
g: gradient, m:moving average
• Learn acquisition functions for Bayesian optimization 3
3 Volpp et al. 2020
Figure source: Ramachandran et al., 2017 (top), Bello et al. 2017 (bottom)
Monte Carlo Tree Search + reinforcement learning
MOSAIC [Rakotoarison et al. 2019]


AlphaD3M [Drori et al. 2019]
• Self-play:


• Game actions: insert, delete, replace components in a pipeline


• Monte Carlo Tree Search builds pipelines given action probabilities


• With grammar to avoid invalid pipelines


• Neural network (LSTM) Predicts pipeline performance (can be pre-trained on prior datasets)
Figure source: Drori et al., 2019
omniglot vgg_
fl
ower dtd
Results on increasingly di
ffi
cult tasks:

• Initially slower than DQN, but faster after
a few tasks

• Policy entropy shows learning/re-
learning
Gomez & Vanschoren, 2019
Meta-Reinforcement Learning for NAS
MetaNAS: MAML + Neural Architecture Search
Elsken et al., 2020
Figure source: Elsken et al., 2020
• Combines gradient based meta-learning (REPTILE) with NAS

• During meta-train, it optimizes the meta-architecture (DARTS weights) along with
the meta-parameters (initial weights)
𝛳


• During meta-test, the architecture can be adapted to the novel task through
gradient descent
Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning
What can we learn to learn?
1. architectures / pipelines


(hyper-parameters)


Clune 2019
part 1
new tasks


(of own choosing)
experience
2. learning algorithms


(model parameters)


3. learning environments


(task generation)
inductive bias
43
part 2 part 3
3 pillars
Task 1
Model
Learning
algorithm
ɸ* = argmaxɸ log p(ɸ | T
)

Task: distribution of samples q(x)
outputs y, loss ℒ(x,y
)

Model: fɸ(x) = y’
Task 2
Model
Learning
algorithm
!T1(fɸ1,λ(x),y) !T2
x,y x,y
training


∇ɸ
ɸ’1 ɸ’2
= argminɸ ℒ(fɸ,λ(x),y
)

fɸ1(x) fɸ2(x)
for reinforcement learning:
q(x1) + transition q(xt+1| xt, yt
)

ℒ = (neg) reward
Terminology (reminder)
Learner: model parameters ɸ,
hyper-parameters λ
optimizer
λ1 λ2
44
Task 1
Model
Learning
algorithm
ɸ* = argmaxɸ log p(ɸ | T
)

Task: distribution of samples q(x)
outputs y, loss ℒ(x,y
)

Model: fɸ(x) = y’
Task 2
Model
Learning
algorithm
!T1(fɸ1,λ(x),y) !T2
x,y x,y
training


∇ɸ
ɸ’1 ɸ’2
= argminɸ ℒ(fɸ,λ(x),y
)

fɸ1(x) fɸ2(x)
for reinforcement learning:
q(x1) + transition q(xt+1| xt, yt
)

ℒ = (neg) reward
Terminology (reminder)
Learner: model parameters ɸ,
hyper-parameters λ
optimizer
λ1 λ2
44
Meta-learning
algorithm g
meta-parameters
𝛳
(encode some
inductive bias)
Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning
x,y x,y x,y x,y
Strategy 1: bilevel optimization
parameterize some aspect of the learner that we want to learn as meta-parameters
𝛳


meta-learn
𝛳
across tasks
𝛳𝛳𝛳𝛳
*
ɸi* = argmaxɸi log p(ɸi | Ti,
𝛳
)

p(ɸnew | Tnew,
𝛳
*)
Bilevel optimization:
𝛳
* = argmax log p(
𝛳
| Ti)
ɸi* = argmaxɸ log p(ɸ | Ti,
𝛳
*)
so that
𝛳
(prior), could encode an initialization ɸ, the hyperparameters λ , the optimizer,…
Learned
𝛳
* should learn Tnew from small amount of data, yet generalize to a large number of tasks
allows
fi
ne-tuning on Tnew
45
Task 1
Model
Learning
algorithm
Task 2
Model
Learning
algorithm
Task 3
Model
Learning
algorithm
!T1
meta-train meta-test
!T2 !T3
Tasknew
Model
Learning
algorithm
!Tnew
p(T)
every task is a meta-training example
ETnew
!meta(!T1,!T2,…)
meta-learning
algorithm
Meta-model
x,y x,y x,y x,y
training


(inner loop)
meta-training


(outer loop)
adapt
meta-test error
meta-train error
𝛳
’
ɸ’1 ɸ’2 ɸ’3
𝛳
* = argmax log p(
𝛳
| T)
 

= argmin ℒmeta(ℒT1,ℒT2,…
)

meta-parameters
𝛳
meta-optimizer Fmeta
meta-loss ℒmeta
fɸ1(x) fɸ2(x) fɸ3(x)
distribution of tasks p(T)
f (Ti)
Meta-learning with bilevel optimization
𝛳𝛳
*
adapt
𝛳𝛳𝛳
ɸ’new
(
fi
netune)
46
Task 1
Model
Learning
algorithm
Task 2
Model
Learning
algorithm
Task 3
Model
Learning
algorithm
meta-train meta-test
Tasknew
Model
Learning
algorithm
p(T)
!(ɸi, Di,test)
meta-learning
algorithm g
Meta-model
Di,train
meta-training
meta-train error/reward
𝛳
’
𝛳
𝛳
 

distribution of tasks p(T)
ɸi
Strategy 2: black-box models
ɸi ɸnew
ɸ3
ɸ2
ɸ1
D3,test
D2,test
D1,test
ɸnew
ɸ1 ɸ2 ɸ3
Dnew,train
ɸi = g (Di,train)
Dnew,test
black box meta-model g : predicts ɸ given Dtrain (
𝛳
is hidden)
hypernetwork where input embedding learned across task
s

Amortized models:


no training in meta-test


(learning fast by learning slow)
feed-forward
 

only
∇
!(ɸ3, D3,test)
!(ɸ2, D2,test)
!(ɸ1, D1,test)
sample task i, update
𝛳
stateful, with memory


(e.g. RNN)
47
Example: few-shot classi
fi
cation
meta-train
train task 1
Dtrain (support)
train task n test task
Dtest
(query)
Dtrain
Dtest
Dtrain
Dtest
meta-


learner
model
learner
model
train tasks
Bilevel:
𝛳
*
ɸ
training
!(ɸ, Dtest)
!meta(!T1,!T2,…)
𝛳
’
meta-


training
Dtrain Dtest
model
learner
test tasks
𝛳
*
fi
netuning
!(ɸ, Dtest)
Dtrain Dtest
ɸ
𝛳
*: initial
ɸ

optimize
r

metric
 

…
Vinyals et al. 2016
Trianta
fi
llou et al. 2019
2-way, 3-shot
meta-train meta-test
48
Example: few-shot classi
fi
cation
meta-train
train task 1
Dtrain (support)
train task n test task
Dtest
(query)
Dtrain
Dtest
Dtrain
Dtest
meta-


learner
model
learner
model
test tasks
ɸ*
ɸ
Dtrain Dtest
Black Box:
meta-


learner
model
learner
model
train tasks
ɸ*
ɸ
!(ɸ, Dtest)
!meta(!T1,!T2,…)
meta-


training
Dtrain Dtest
𝛳
’
49
meta-RL


algorithm
model
RL


agent
model
train tasks
𝛳
*
ɸ
𝛳
’
model
RL


agent
test tasks
𝛳
*
fi
netuning
!(ɸ, Dtest)
ɸ
initial state:
randomized object
and goals positions
Example: meta-reinforcement learning
Bilevel:
episodes
Yu et al. 2019
training


(inner loop)
meta-training


(outer loop)
other initial state
s

or related tasks
episodes
𝛳
*:
fi
xed
ɸ

objective, e.g.
 

which ɸ to learn
Rmeta(RT1,RT2,…) R(
𝝅
ɸ, Dtest)
e
ff
ective policy
𝝅
ɸ modulated by ɸ
meta-policy


gradient
prior
50
meta-RL


algorithm
model
RL


agent
model
train tasks
𝛳
*
ɸ
𝛳
’
model
RL


agent
test tasks
𝛳
*
fi
netuning
!(ɸ, Dtest)
ɸ
initial state:
randomized object
and goals positions
Example: meta-reinforcement learning
Bilevel:
episodes
Yu et al. 2019
training


(inner loop)
meta-training


(outer loop)
other initial state
s

or related tasks
episodes
𝛳
*:
fi
xed
ɸ

objective, e.g.
 

which ɸ to learn
Rmeta(RT1,RT2,…) R(
𝝅
ɸ, Dtest)
e
ff
ective policy
𝝅
ɸ modulated by ɸ
meta-policy


gradient
prior
50
meta-RL


algorithm
RNN
(fast) RL


agent
RNN
test tasks
meta-RL


algorithm
RNN
(fast) RL


agent
RNN
train tasks
training


(inner loop)
Rmeta(RT1,RT2,…)
meta-training


(outer loop)
Black Box:
Example: meta-reinforcement learning
(few) test 

episodes
𝛳
’
train 

episodes
(few) test 

episodes
train 

episodes
R(
𝝅
ɸ, Dtest)
ɸ
ɸ
ɸ
ɸ
initial state:
randomized object
and goals positions
other initial state
s

or related tasks
51
meta-RL


algorithm
RNN
(fast) RL


agent
RNN
test tasks
meta-RL


algorithm
RNN
(fast) RL


agent
RNN
train tasks
training


(inner loop)
Rmeta(RT1,RT2,…)
meta-training


(outer loop)
Black Box:
Example: meta-reinforcement learning
(few) test 

episodes
𝛳
’
train 

episodes
(few) test 

episodes
train 

episodes
R(
𝝅
ɸ, Dtest)
ɸ
ɸ
ɸ
ɸ
initial state:
randomized object
and goals positions
other initial state
s

or related tasks
51
Taxonomy of meta-learning methods
Hospedales et al. 2020
meta-parameters
𝛳
meta-objective ℒmeta
meta-learning
algorithm
parameter initialization ɸ0 multi vs single task
like base-learners, meta-learners consist of a representation, an objective, and an optimizer


meta-learning
learned ɸ0
random ɸ0
meta-optimizer Fmeta
gradient descent
52
meta-learning
learned optimizer

(update rule)
normal SGD 

(back-propagation)
Taxonomy of meta-learning methods
meta-parameters
𝛳
meta-objective ℒmeta
meta-learning
algorithm
multi vs single task
like base-learners, meta-learners consist of a representation, an objective, and an optimizer


optimizer
Hospedales et al. 2020
meta-optimizer Fmeta
gradient descent
53
Taxonomy of meta-learning methods
meta-parameters
𝛳
meta-objective ℒmeta
meta-optimizer Fmeta
meta-learning
algorithm
parameter initialization ɸ0
optimizer
black-box model
embedding (metric learning)
hyperparameters λ
architecture
loss ℒ / reward R
…
gradient descent
reinforcement learning
evolution
many vs few shots
multi vs single task
fast adaptation vs
fi
nal perf.
like base-learners, meta-learners consist of a representation, an objective, and an optimizer


of
fl
ine vs online
Hospedales et al. 2020
We won’t exhaustively discuss all existing combinations,

let’s focus on understanding a few key strategies
Huisman et al. 2020
54
Gradient-based methods: learning ɸinit
Task i
Model
Learning
algorithm
!Ti
meta-train
Tasknew
Model
Learning
algorithm
!Tnew
!meta = ∑i !(ɸi, Di,test)
meta-learning
algorithm
Meta-model
x,y x,y
training


(inner loop)
meta-training


(outer loop)
𝛳
’
ɸ’i
fɸi(x)
ɸ’new
(
fi
netune)
𝛳
*
ɸinit =
𝛳
p(T)
•
𝛳
(prior): model initialization ɸinit

• learn representation suitable for
many tasks (e.g. pretrained CNN)

• maximize rapid learning 

• Each task i yields task-adapted ɸi

• Update algorithm u

• Finetune
𝛳
* on Tnew (in few steps)
meta-test
ɸinit =
learned ɸ0
random ɸ0
ɸ’new = u(
𝛳
*,Dnew,train)
∇ɸ
∇
∇ɸ
Can be seen as bilevel optimization:
𝛳
* = argmin ∑i ℒi(ɸi, Di,test)
ɸi = u(
𝛳
,Di,train)
ɸi = u(
𝛳
,Di,train)
55
• Current initialization
𝛳
, model f


• On i tasks, perform k SGD steps to
fi
nd , then evaluate


• Update task-speci
fi
c parameters:


• Update
𝛳
to minimize sum of per-task losses, repeat
Finn et al. 2017
Model agnostic meta-learning (MAML)
!Task 1
!Task 2
θ
θ
∇θℒ1(fϕ*
1
)
θ ∇ℒ2
∇ℒ1
∇ℒ3
ϕ*
3
ϕ*
1
ϕ*
2
Meta-training
Meta-testing
• Training data of new task Dtrain
•
𝛳
*: pre-trained parameters

• Finetune:
meta-gradient: second-order gradient + backpropagate
meta-gradient
𝛼
, β: learning rates
∇θℒ2(fϕ*
2
)
i=3, k=3
∇θℒi(fϕ*
i
)
ϕi = θ − α∇θℒi(fϕ*
i
)
ϕ1
ϕ2
θ ← θ − β∇θ ∑i
ℒi(f(θ − α∇θℒi(fϕ*
i
)))
θ ← θ − β∇θ ∑i
ℒi(fϕi
)
derivative of test-set loss
ϕ = θ* − α∇θℒ(fθ)
compute how changes in
𝛳
a
ff
ect the gradient at new
𝛳
θ1
θ2
θ3 = ϕ*
2
θ1
θ2
θ3 = ϕ*
1
ϕ*
i
56
• Current initialization
𝛳
, model f


• On i tasks, perform k SGD steps to
fi
nd , then evaluate


• Update task-speci
fi
c parameters:


• Update
𝛳
to minimize sum of per-task losses, repeat
Finn et al. 2017
Model agnostic meta-learning (MAML)
!Task 1
!Task 2
θ
θ
∇θℒ1(fϕ*
1
)
θ ∇ℒ2
∇ℒ1
∇ℒ3
ϕ*
3
ϕ*
1
ϕ*
2
Meta-training
Meta-testing
• Training data of new task Dtrain
•
𝛳
*: pre-trained parameters

• Finetune:
meta-gradient: second-order gradient + backpropagate
meta-gradient
𝛼
, β: learning rates
∇θℒ2(fϕ*
2
)
i=3, k=3
∇θℒi(fϕ*
i
)
ϕi = θ − α∇θℒi(fϕ*
i
)
ϕ1
ϕ2
θ ← θ − β∇θ ∑i
ℒi(f(θ − α∇θℒi(fϕ*
i
)))
θ ← θ − β∇θ ∑i
ℒi(fϕi
)
derivative of test-set loss
ϕ = θ* − α∇θℒ(fθ)
compute how changes in
𝛳
a
ff
ect the gradient at new
𝛳
θ1
θ2
θ3 = ϕ*
2
θ1
θ2
θ3 = ϕ*
1
ϕ*
i
56
• Current initialization
𝛳
, model f


• On i tasks, perform k SGD steps to
fi
nd , then evaluate


• Update task-speci
fi
c parameters:


• Update
𝛳
to minimize sum of per-task losses, repeat
Finn et al. 2017
Model agnostic meta-learning (MAML)
!Task 1
!Task 2
θ
θ
∇θℒ1(fϕ*
1
)
θ ∇ℒ2
∇ℒ1
∇ℒ3
ϕ*
3
ϕ*
1
ϕ*
2
Meta-training
Meta-testing
• Training data of new task Dtrain
•
𝛳
*: pre-trained parameters

• Finetune:
meta-gradient: second-order gradient + backpropagate
meta-gradient
𝛼
, β: learning rates
∇θℒ2(fϕ*
2
)
i=3, k=3
∇θℒi(fϕ*
i
)
ϕi = θ − α∇θℒi(fϕ*
i
)
ϕ1
ϕ2
θ ← θ − β∇θ ∑i
ℒi(f(θ − α∇θℒi(fϕ*
i
)))
θ ← θ − β∇θ ∑i
ℒi(fϕi
)
derivative of test-set loss
ϕ = θ* − α∇θℒ(fθ)
compute how changes in
𝛳
a
ff
ect the gradient at new
𝛳
θ1
θ2
θ3 = ϕ*
2
θ1
θ2
θ3 = ϕ*
1
ϕ*
i
56
• Example for reinforcement learning:


• Goal: reach certain velocity in certain direction
Model agnostic meta-learning (MAML)
Source: Finn et al. 2017
- oracle
- MAML
- pretrained
57
• Example for reinforcement learning:


• Goal: reach certain velocity in certain direction
Model agnostic meta-learning (MAML)
Source: Finn et al. 2017
- oracle
- MAML
- pretrained
57
1 Finn et al. 2017
5 Flennerhag et al. 2019
Other gradient-based techniques
• Changing update rule yield di
ff
erent variants:


• MAML 1,6


• MetaSGD 2


• Tnet 3


• Meta curvature 4


• WarpGrad 5
All use second-order gradients,
Meta-learn a transformation of the
gradient for better adaptation
2 Li et al. 2017
3 Lee et al. 2018
4 Park et al. 2019
fi
gure source: Flennerhag et al. 2019
MAML
WarpGrad
θ ← θ − β∇θ ∑i
ℒi(fϕi
)
ϕi = θ − α∇θℒi(fϕ*
i
)
ϕi = θ − α diag(w)∇θℒi(fϕ*
i
)
ϕi = θ − α∇θℒi(fϕ*
i
, w)
ϕi = θ − α B(θ, w)∇θℒi(fϕ*
i
)
ϕi = θ − α P(θ, ϕi)∇θℒi(fϕ*
i
)
6 Antoniou et al. 2019
w: weight per parameter
• Online MAML (Follow the Meta Leader) 7


• Minimizes regret


• Robust, but computation costs grow over time
7 Finn et al. 2019
58
1 Finn et al. 2017
Scalability
• Backpropagating derivates of ɸi wrt
𝛳
is compute + memory intensive (for large models)


• First order approximations of MAML:


• First order MAML1 uses only the last inner gradient update:


• Reptile 2: iterate over tasks, update
𝛳
in direction of :
2 Nichol et al. 2018
θ
θ1 θ2
θ3
θ4 = ϕ*
i
FOMAML
Reptile
ϕ*
i
− θ
Single task: FOMAML meta-gradient:
∇θ ∑i
ℒi(ϕi)
T1
T2
T3
θ
θ ← θ − β (ϕ*
i
− θ)
ϕ*
i
SGD
3 Rajeswaran et al. 2019
θ ← θ − β∑i
ℒi(fϕ*
i
)
59
• MAML is more resilient to over
fi
tting than many other meta-learning techniques 1


• E
ff
ectiveness seems mainly due to feature reuse (
fi
nding
𝛳
) 2


• Fine-tuning only the last layer equally good


• On few-shot learning benchmarks, a good embedding outperforms most meta-learning 3


• Learn representation on entire meta-dataset (merged into single task)


• Train a linear classi
fi
er on embedded few-shot Dtrain , predictDtest
1 Finn et al. 2018
Generalizability 2 Raghu et al. 2020
𝛳
*
ɸ’new
3 Tian et al. 2020
fi
gure source: Tian et al. 2020
linear model
• For meta-RL, also learn how to explore
(how to sample new environments)


• E-MAML: add exploration to meta-
objective (allows longer term goals) 4
4 Stadie et al. 2019
60
Grant et al. 2018
Can meta-learning reason about uncertainty in the task distribution?
y
x
g
fɸ
̂
ϕ ← gθ((x, y)train)
ϕ
θ
ytest ← f ̂
ϕ(xtest)
Ti = {Dtrain, Dtest} ←
𝒯
meta−train
Tj = {Dtrain, Dtest} ←
𝒯
meta−test
θ ← θ − ∇θℒ(ytest)
ϕj ← gθ((x, y)train)
ytest ← fϕj
(xtest)
y
x
g
fɸ
ϕ
θ
p(ϕi |(x, y)train)
Ti = {Dtrain, Dtest} ←
𝒯
meta−train
Tj = {Dtrain, Dtest} ←
𝒯
meta−test
p(θ) ← p(θ|(x, y)train)
p(ϕj |(x, y)train, θ)
p(ytest |xtest)
fi
netuning
p(ytest |xtest)
fi
netuning
Bayesian meta-learning
. : . . : 61
• Use approximation methods to represent uncertainty over


• Sampling technique + variational inference: PLATIPUS 1, BMAML 2, ABML 3


• Laplace approximation: LLAMA 4


• Variational approximation of posterior:


• Neural Statistician 5, Neural Processes 6


• What if our tasks are not IID?


• Impose additional structure, e.g. with task-speci
fi
c variable z 7
ϕi
y
x
ϕ
θ
z
mini-ImageNet with
fi
lters for non-homogeneous tasks:
7 Jerfel et al. 2019
Figure source: Jerfel et al. 2019
Fully Bayesian meta-learning
1 Finn et al. 2019
2 Kim et al. 2018
3 Ravi et al. 2019
5 Edwards et al. 2017
6 Garnelo et al. 2018
4 Grant et al. 2018
62
• Our brains probably don’t do backprop, instead:


• weights 2: networks that continuously modify the weights of another network


• Gradient based 3, 4: parameterize the update rule using a neural network


• Learn meta-parameters across tasks, by gradient descent
1 Bengio et al. 1995
meta-gradient
Meta-learning optimizers 2 Schmidhuber 1992
ϕ′

i = ϕi − η( )
θ
3 Runarsson and Jonsson 2000
4 Hochreiter 2001
63
5 Andrychowicz et al. 2016
6 Ravi and Larochelle 2017
7 Wichrowska et al. 2017
• Represent update rule as an LSTM 4,5,6, hierarchical RNN 7
• Meta-learned (RNN) optimizers‘rediscover’momentum, gradient clipping, learning rate
schedules, learning rate adaptation,… 1


• RL-based optimizers: represent updates as a policy, learn using guided policy search 2


• Combined with MAML:


• learn per-parameter learning rates 3,4


• learn precondition matrix (to `warp' the loss surface) 5,6


• Black-box optimizers: meta-learned with an RNN 7, or with user-de
fi
ned priors8


• Speed up backpropagation by meta-learning sparsity and weight sharing9
2 Ke and Malik 2016
Meta-learning optimizers
6 Flennerhag et al. 2019
3 Li et al. 2017
5 Park et al. 2019
4 Antoniou et al. 2019
7 Chen et al. 2017
8 Souza et al. 2020
9 Kirsch et al. 2020
1 Maheswaranathan et al. 2020
learning rate schedules learning rate adaptation
Figure source:


Maheswaranathan et al. 2020
64
Metric learning
Learn an embedding network
𝛳
that transforms data {Dtrain,Dtest} across all
tasks to a representation that allows easy similarity comparison
Can be seen as a simple black box meta-model


• Often non-parametric (independent of ɸ)
Task i
learner
ɸ
Dtrain g
g
? embedding
g (x)
65
Vinyals et al. 2016
Sung et al. 2018
Prototypical networks
Snell et al. 2017
• Use an embedding function f to encode each data point


• De
fi
ne a prototype vc for every class c based on the examples of that class


• Class distribution for input x is based on inverse distance between x and prototypes


• Distance function can be any di
ff
erentiable distance


• E.g. squared Euclidean


• Loss function to learn the embedding:
vc =
1
|D(y)
train | ∑
(xi,yi)∈D(y)
train
fθ(xi)
D(y)
train
p(y = c|xtest) = softmax(−dϕ(fθ(x), vc))
Figure source: Snell et al. 2017
ℒ(θ) = − log pθ(y = c|x)
66
Metric learning
• Quite a few other techniques exist


• Siamese neural networks 1


• Graph Neural Networks 2


• Also applicable for semi-supervised and active learning


• Attentive Recurrent Comparators 3


• Compares inputs not as a whole but by parts (e.g. image patches)


• MetaOptNet 4


• Learns embeddings so that linear models can distinguish between classes


• Overall:


• Fast at test time, although pair-wise comparisons limit task size


• Mostly limited to few-shot supervised tasks


• Fails when test tasks are more distant: no way to adapt
Koch et al. 2015
Garcia et al. 2017
Shyam et al. 2017
Lee et al. 2019
67
Wang et al. 2016
Duan et al. 2018
Black-box model for meta-RL
meta-RL


algorithm
RNN
(fast) RL


agent
RNN
test tasks
meta-RL


algorithm
RNN
(fast) RL


agent
RNN
train tasks
training


(inner loop)
Rmeta(RT1,RT2,…)
meta-training


(outer loop)
(few) test 

episodes
𝛳
’
train 

episodes
(few) test 

episodes
train 

episodes
R(
𝝅
ɸ, Dtest)
ɸ
ɸ
ɸ
ɸ
training task 1 training task 2
• RNNs serve as dynamic task
embedding storage


• Maximize expected reward in
each trial


• Very expressive, perform very
well on short tasks


• Longer horizons are an open
challenge
Previous inputs stored in hidden state


Image source: Duan et al. 201868
Santoro et al. 2016
Mishra et al. 2018
Other black-box models Munkhdalai et al. 2017
• Memory-augmented NNs 1


• Uses neural Turing machines: short term + long term memory


• Meta Networks 2


• Meta-learner that returns‘fast weights’for itself and the base network solving the task


• Simple Neural attentive meta- learner (SNAIL)


• Aims to overcome memory limitations of RNNs with series of 1D convolutions
Meta Networks. Image source: Munkhdalai et al. 2017 SNAIL. Image source: Mishra et al. 2018 69
Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning
What can we learn to learn?
1. architectures / pipelines


(hyper-parameters)


Clune 2019
part 1
new tasks


(of own choosing)
experience
2. learning algorithms


(model parameters)


3. learning environments


(task generation)
inductive bias
70
part 2 part 3
3 pillars
• Ultimately, meta-learning translates constraints on the learner to constraints on the data


• The biases we don’t put in manually have to be learnable from data


• Can we automatically create new tasks to inform and challenge our meta-learners?


• Paired open-ended trailblazer (POET): evolves a parameterized environment
𝛳
E for agent
𝛳
A


• Select agents that can solve challenges AND evolve environments so they are solvable
Training Task Acquisition
best performing
agents move on creates multiple overlapping curricula
Wang et al. 2019
Wang et al. 2020
Figure source: Wang et al. 2019 71
• Ultimately, meta-learning translates constraints on the learner to constraints on the data


• The biases we don’t put in manually have to be learnable from data


• Can we automatically create new tasks to inform and challenge our meta-learners?


• Paired open-ended trailblazer (POET): evolves a parameterized environment
𝛳
E for agent
𝛳
A


• Select agents that can solve challenges AND evolve environments so they are solvable
Training Task Acquisition
best performing
agents move on creates multiple overlapping curricula
Wang et al. 2019
Wang et al. 2020
Figure source: Wang et al. 2019 71
Does POET scale? Increasingly di
ffi
cult 3D terrain, 18 degrees of freedom.
Zhou & Vanschoren. 2021
72
Does POET scale? Increasingly di
ffi
cult 3D terrain, 18 degrees of freedom.
Zhou & Vanschoren. 2021
72
Generative Teaching Networks
• Based on an existing dataset, generate synthetic training data for more e
ffi
cient training


• Like dataset distillation, but uses meta-learning to update the generator model


• While POET has limited expressivity (limited to
𝛳
E), GTNs could produce all sorts of training
datasets and environments


• While being careful not to generate noise: needs some grounding in reality
Such et al. 2019
73
Thank you
Image source: Pesah et al. 2018 74

Meta learning tutorial

  • 1.
    Joaquin Vanschoren (TUEindhoven) Meta-Learning A tutorial Cover art: MC Escher, Ustwo Games
  • 2.
    Intro: humans caneasily learn from a single example 2 thanks to years of learning (and eons of evolution)
  • 3.
    Intro: humans caneasily learn from a single example 2 thanks to years of learning (and eons of evolution)
  • 4.
    Intro: humans caneasily learn from a single example Canna Indica ‘Picasso’ 2 thanks to years of learning (and eons of evolution)
  • 5.
    Intro: humans caneasily learn from a single example Canna Indica ‘Picasso’ ? later train test 2 thanks to years of learning (and eons of evolution)
  • 6.
    Can a computerlearn from a single example? e.g. ResNet50 (23 million trainable parameters) ImageNet Canna Indica `Picasso’ That won’t work :) Humans also don’t start from scratch. 3
  • 7.
    Can a computerlearn from a single example? e.g. ResNet50 (23 million trainable parameters) ImageNet Canna Indica `Picasso’ That won’t work :) Humans also don’t start from scratch. 3
  • 8.
    e.g. ResNet50 (23 milliontrainable parameters) Transfer learning? ImageNet (14 million images) Pretrain Canna Indica `Picasso’ Source task Target task 4
  • 9.
    e.g. ResNet50 (23 milliontrainable parameters) Transfer learning? ImageNet (14 million images) Pretrain Canna Indica `Picasso’ Source task Target task 4
  • 10.
    e.g. ResNet50 (23 milliontrainable parameters) Transfer learning? ImageNet (14 million images) Pretrain Finetune Canna Indica `Picasso’ Source task Target task 4
  • 11.
    e.g. ResNet50 (23 milliontrainable parameters) Transfer learning? ImageNet (14 million images) Pretrain Finetune A single source task (e.g. ImageNet) may not generalize well to the test task Canna Indica `Picasso’ Source task Target task 4
  • 12.
    Meta-learning Learn over aseries (or distribution) of many di ff erent tasks/episodes Inductive bias: learn assumptions that you can to transfer to new tasks Prepare yourself to learn new things faster task 1 task 2 task 1000 new task Canna Trumpet vine … inductive bias 5
  • 13.
    Meta-learning Learn over aseries (or distribution) of many di ff erent tasks/episodes Inductive bias: learn assumptions that you can to transfer to new tasks Prepare yourself to learn new things faster task 1 task 2 task 1000 new task Canna Trumpet vine … Useful in many real-life situations: rare events, test-time constraints, data collection costs, privacy issues,… inductive bias 5
  • 14.
    Inspired by humanlearning We don’t transfer from a single source task, we learn across many, many tasks year 1 year 2 year 3 year 4 We have a‘drive’to explore new, challenging, but doable, fun tasks 6
  • 15.
    Human-like Learning*** Task 1 Learning humanslearn across tasks: less trial-and-error, less data, less compute new tasks should be related to experience (doable, fun, interesting?) Task 2 Learning Task 3 Learning new tasks (of own choosing) Tasknew Learning experience Lake et al. 2017 key aspects of fast learning: compositionality, causality, learning to learn Kemp et al. 2007 inductive bias 7
  • 16.
    Inductive bias (inlanguage) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? 8 which assumptions do we make?
  • 17.
    Inductive bias (inlanguage) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? 8 which assumptions do we make?
  • 18.
    Inductive bias (inlanguage) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? 8 which assumptions do we make?
  • 19.
    Inductive bias (inlanguage) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? Common mistakes one-to-one bias: assume that every word is one color 8 which assumptions do we make?
  • 20.
    Inductive bias (inlanguage) Training Lake et al. 2017 dax zup lug wif lug blicket wif wif blicket dax Test dax blicket zup? lug kiki wif wif kiki dax wif blicket dax kiki lug? Common mistakes one-to-one bias: assume that every word is one color concatenation bias: assume that order is always left-to-right 8 which assumptions do we make?
  • 21.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
  • 22.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
  • 23.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
  • 24.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
  • 25.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
  • 26.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
  • 27.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
  • 28.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? zup? zup zup? blicket wif zup? Lake et al. 2019 9
  • 29.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? Commonly used assumptions: one-to-one bias: assume that every word is one color (and not a function or something else) zup? zup zup? blicket wif zup? zup dax / Lake et al. 2019 9
  • 30.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? Commonly used assumptions: one-to-one bias: assume that every word is one color (and not a function or something else) concatenation bias: assume that order is always left-to-right zup? zup zup? blicket wif zup? zup dax / zup dax Lake et al. 2019 9
  • 31.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? Commonly used assumptions: one-to-one bias: assume that every word is one color (and not a function or something else) concatenation bias: assume that order is always left-to-right zup? zup zup? blicket wif zup? zup dax / zup dax mutual exclusivity: if object has a name, it doesn’t need another name zup dax Lake et al. 2019 9
  • 32.
    What if thereis no training data? Item pool Lake et al. 2017 we can still solve problems by making assumptions dax zup? zup tufa? Test zup wif zup? zup wif blicket? Commonly used assumptions: one-to-one bias: assume that every word is one color (and not a function or something else) concatenation bias: assume that order is always left-to-right zup? zup zup? blicket wif zup? zup dax / zup dax mutual exclusivity: if object has a name, it doesn’t need another name zup dax These assumptions (inductive biases) are necessary for learning quickly Humans assume that words have consistent meanings and follow input/output constraints Lake et al. 2019 9
  • 33.
    Meta-learning inductive biases Task1 model Task 2 Tasknew Lake et al. 2017 Capture useful assumptions from the data - that can often not be easily expressed dax zup lug wif wif zup dax? lug dax lug zup? dax wif lug? wif wif dax? dax zup lug wif dax dax? wif lug dax? wif lug lug dax? zup zup zup? state model state dax zup lug wif dax dax? lug lug dax dax? wif lug lug dax? zup wif zup? Lake et al. 2019 10 model
  • 34.
    Meta-learning goal Task 1 Learning learnminimal inductive biases from prior tasks instead of constructing manual ones should still generalize well (otherwise you meta-over fi t) Task 2 Learning Task 3 Learning Tasknew Learning Inductive bias: any assumptions added to training data to learn more e ff ectively. E.g: • Instead of general model architectures, learn better architectures (and hyperparameters) • Instead of starting from random weights, learn good initial weights • Instead of standard loss/reward function, learn a better loss/reward function (prior) 11 inductive bias
  • 35.
    Task 1 Learning Task 2 Learning Task3 Learning Tasknew Learning What can we learn to learn? 1. architectures / pipelines (hyper-parameters) Clune 2019 part 1 new tasks (of own choosing) experience 2. learning algorithms (model parameters) 3. learning environments (task generation) inductive bias 12 part 2 part 3 3 pillars
  • 36.
    Task 1 Learning Task 2 Learning Task3 Learning Tasknew Learning What can we learn to learn? 1. architectures / pipelines (hyper-parameters) Clune 2019 part 1 new tasks (of own choosing) experience 2. learning algorithms (model parameters) 3. learning environments (task generation) inductive bias 13 part 2 part 3 3 pillars
  • 37.
    Task 1 Model Learning algorithm ɸ* =argmaxɸ log p(ɸ | T ) Task: distribution of samples q(x) outputs y, loss ℒ(x,y ) Model: fɸ(x) = y’ Task 2 Model Learning algorithm !T1(fɸ1,λ(x),y) !T2 x,y x,y training ∇ɸ ɸ’1 ɸ’2 = argminɸ ℒ(fɸ,λ(x),y ) fɸ1(x) fɸ2(x) for reinforcement learning: q(x1) + transition q(xt+1| xt, yt ) ℒ = (neg) reward Terminology Learner: model parameters ɸ, hyper-parameters λ optimizer λ1 λ2 14
  • 38.
    Task 1 Model Learning algorithm ɸ* =argmaxɸ log p(ɸ | T ) Task: distribution of samples q(x) outputs y, loss ℒ(x,y ) Model: fɸ(x) = y’ Task 2 Model Learning algorithm !T1(fɸ1,λ(x),y) !T2 x,y x,y training ∇ɸ ɸ’1 ɸ’2 = argminɸ ℒ(fɸ,λ(x),y ) fɸ1(x) fɸ2(x) for reinforcement learning: q(x1) + transition q(xt+1| xt, yt ) ℒ = (neg) reward Terminology Learner: model parameters ɸ, hyper-parameters λ optimizer λ1 λ2 14 Meta-learning algorithm g meta-parameters 𝛳 (encode some inductive bias)
  • 39.
    Task 1..N Models performance AutoMLModels Models Closely related to Automated Machine Learning (AutoML) But: meta-learn how to design architectures/pipelines and tune hyperparameters Human data scientists also learn from experience Models Models Models Learning hyperparameters Hutter et al. 2019 λ New task Models performance self-learning AutoML Models λ bias (priors, meta-knowledge, human priors) Search space can be huge!
  • 40.
    Meta-learning for AutoML:how? Learning hyperparameter priors Warm starting (what works on similar tasks?) start randomly start with good candidates Meta-models (learn how to build models/components) Complex hyperparameter space Simple hyperparameter space Vanschoren 2018 Task λ, scores λ, scores Learner Learner metadata hyperparameters = architecture + hyperparameters Task λ, scores Learner metadata Task Task
  • 41.
    Observation: current AutoML stronglydepends on learned priors Complex hyperparameter space Simple hyperparameter space observation
  • 42.
    Manual architecture priors •Most successful pipelines have a similar structure • Can we meta-learn a prior over successful structures? Ensembling/ stacking Figure source: Feurer et al. 2015
  • 43.
    Manual architecture priors •Most successful pipelines have a similar structure • Can we meta-learn a prior over successful structures? + smaller search space - you can’t learn entirely new architectures Ensembling/ stacking Figure source: Feurer et al. 2015
  • 44.
    Manual architecture priors autosklearnFeurer et al. 2015 autoWEKA Thornton et al. 2013 hyperopt-sklearn Komer et al. 2014 AutoGluon-Tabular Erickson et al. 2020 • Most successful pipelines have a similar structure • Can we meta-learn a prior over successful structures? + smaller search space - you can’t learn entirely new architectures Ensembling/ stacking Figure source: Feurer et al. 2015
  • 45.
    Successful deep networksoften have repeated motifs (cells) e.g. Inception v4: Manual architecture priors Szegedy Figure source: Szegedy et al 2016
  • 46.
    Google NASNet Zophet al 2018 Compositionality: learn hierarchical building blocks to simplify the task •learn parameterized building blocks (cells) •stack cells together in macro- architecture + smaller search space + cells can be learned on a small dataset & transferred to a larger dataset -strong domain priors, doesn’t generalize well Cell search space Cell search space prior Figure source: Elsken et al., 2019
  • 47.
    Google NASNet Zophet al 2018 Compositionality: learn hierarchical building blocks to simplify the task •learn parameterized building blocks (cells) •stack cells together in macro- architecture + smaller search space + cells can be learned on a small dataset & transferred to a larger dataset -strong domain priors, doesn’t generalize well Cell search space Can we meta-learn hierarchies / components that generalize better? Cell search space prior Figure source: Elsken et al., 2019
  • 48.
    If you constrainthe search space enough, you can get SOTA results with random search! Li & Talwalkar 2019 Yu et al. 2019 Real et al. 2019 Cell search space prior • Strong prior! Figure source: Li & Talwalkar., 2019
  • 49.
    Weight-agnostic neural networks •ALL weights are shared • Only evolve the architecture? • Minimal description length • Baldwin e ff ect? Manual priors: Weight sharing Gaier & Ha, 2019 Figure source: Gaier & HA., 2019
  • 50.
    Learning hyperparameter priors Complex hyperparameter space Simple hyperparameter space λ, scores Learner
  • 51.
    1 van Rijn& Hutter 2018 ResNets for image classi fi cation • Functional ANOVA 1 • Select hyperparameters that cause variance in the evaluations. • Useful to speed up black-box optimization techniques Learn hyperparameter importance Figure source: van Rijn & Hutter, 2018
  • 52.
    • Tunability 1,2,3 Learngood defaults, measure importance as improvement via tuning 1 Probst et al. 2018 2 Weerts et al. 2018 3 van Rijn et al. 2018 Learn defaults + hyperparameter importance Learned defaults Tuning risk
  • 53.
    • Start witha few (random) hyperparameter con fi gurations • Fit a surrogate model to predict other con fi gurations • Probabilistic regression: mean and standard deviation (blue band) • Use an acquisition function to trade o ff exploration and exploitation, e.g. Expected Improvement (EI) • Sample for the best con fi guration under that function μ σ Bayesian Optimization (interlude) Mockus, 1974 μ μ + σ μ − σ value of λi performance
  • 54.
    • Repeat untilsome stopping criterion: • Fixed budget • Convergence • EI threshold • Theoretical guarantees • Also works for non-convex, noisy data • Used in AlphaGo Srinivas et al. 2010, Freitas et al. 2012, Kawaguchi et al. 2016 Bayesian Optimization Figure source: Shahriari 2016
  • 55.
    • Hyperparameters caninteract in very non-linear ways • Use a neural net to learn a suitable basis expansion ϕz(λ) for all tasks • You can use Bayesian linear models, transfers info on con fi guration space Learn basis expansions for hyperparameters Perrone et al. 2018 P Bayesian Linear surrogate φz(λ)i λi,score Learn basis expansion on lots of data (e.g. OpenML) φz(λ) φz(λ) λ Gaussian Processes surrogate
  • 56.
    • If taskj is similar to the new task, its surrogate model Sj will likely transfer well • Sum up all Sj predictions, weighted by task similarity (as in active testing)1 • Build combined Gaussian process, weighted by current performance on new task2 Tasks Models Models Models performance Learning Learning Learning New Task meta-learner Models Models Models performance per task tj: Pi,j } Surrogate model transfer 1 Wistuba et al. 2018 λi P Sj 2 Feurer et al. 2018 S = ∑ wj Sj + + S1 S2 S3 λi
  • 57.
    prior tasks Surrogate modeltransfer Manolache & Vanschoren 2019 • Store surrogate model Sij for every pair of task i and algorithm j • Simpler surrogates, better transfer • Learn weighted ensemble -> signi fi cant speed up in optimization new task
  • 58.
    Warm starting (what workson similar tasks?) start randomly start with good candidates Task λ, scores Learner metadata
  • 59.
    How to measuretask similarity? • Hand-designed (statistical) meta-features that describe (tabular) datasets 1 • Task2Vec: task embedding for image data 2 • Optimal transport: similarity measure based on comparing probability distributions 3 • Metadata embedding based on textual dataset description 4 • Dataset2Vec: compares batches of datasets 5 • Distribution-based invariant deep networks 6 1 Vanschoren 2018 2 Achille et al. 2019 3 Alvarez-Melis et al. 2020 4 Drori et al. 2019 5 Jooma et al. 2020 6 de Bie et al. 2020 Figure source: Alvarez-Melis et al. 2020
  • 60.
    • Find kmost similar tasks, warm-start search with best λi • Auto-sklearn: Bayesian optimization (SMAC) • Meta-learning yield better models, faster • Winner of AutoML Challenges Tasks Models Models Models performance Learning Learning Learning New Task meta-learner Models Models Models performance Pi,j } Warm-starting with kNN λ1..k mj best λi on similar tasks Feurer et al. 2015 λi Bayesian optimization λ P λ1 λ3 λ2 λ4 Figure source: Feurer et al., 2015
  • 61.
    Fusi et al.2017 Pi,j λi TL λL tj tnew warm-started with λ1..k . . .. . . .. . . . . λi λLi P p(P|λLi) latent representation • Learn latent representation for tasks T and con fi gurations λ • Use meta-features to warm-start on new task • Returns probabilistic predictions for Bayesian optitmization • Collaborative fi ltering: con fi gurations λi are `rated’by tasks tj Probabilistic Matrix Factorization Figure source: Fusi et al., 2017
  • 62.
    DARTS: Di ff erentiable NAS Liuet al. 2018 convolution max pooling zero • Fixed (one-shot) structure, learn which operators to use • Give all operators a weight • Optimize and model weights using bilevel optimization • approximate *( ) by adapting after every training step αi αi ωj ωj αi ωj One-shot model operator weights αi interleaved optimization of and with SGD αi ωj argmax αi Figure source: Liu et al., 2018
  • 63.
    • Warm-start DARTSwith architectures that worked well on similar problems • Slightly better performance, but much faster (5x) Warm-started DARTS Grobelnik and Vanschoren, 2021
  • 64.
    Meta-models (learn how tobuild models/components) Task λ, scores Learner metadata Task Task
  • 65.
    • Learn directmapping between meta-features and Pi,j • Zero-shot meta-models: predict best λi given meta-features 1 • Ranking models: return ranking λ1..k 2 • Predict which algorithms / con fi gurations to consider / tune3 • Predict performance / runtime for given 𝛳 i and task4 • Can be integrated in larger AutoML systems: warm start, guide search,… meta-learner Algorithm selection models λbest 1 Brazdil et al. 2009, Lemke et al. 2015 2 Sun and Pfahringer 2013, Pinto et al. 2017 meta-learner λ1..k mj mj meta-learner Pij mj, λi 3 Sanders and C. Giraud-Carrier 2017 meta-learner Λ mj 4 Yang et al. 2018
  • 66.
    • Learn nonlinearities:RL-based search of space of likely useful activation functions 1 • E.g. Swish can outperform ReLU 1 Ramachandran et al. 2017 • Learn optimizers: RL-based search of space of likely useful update rules 2 • E.g. PowerSign can outperform Adam, RMPprop 2 Bello et al. 2017 Learning model components PowerSign : esign(g)sign(m) g Swish : x 1 + e−βx g: gradient, m:moving average • Learn acquisition functions for Bayesian optimization 3 3 Volpp et al. 2020 Figure source: Ramachandran et al., 2017 (top), Bello et al. 2017 (bottom)
  • 67.
    Monte Carlo TreeSearch + reinforcement learning MOSAIC [Rakotoarison et al. 2019] AlphaD3M [Drori et al. 2019] • Self-play: • Game actions: insert, delete, replace components in a pipeline • Monte Carlo Tree Search builds pipelines given action probabilities • With grammar to avoid invalid pipelines • Neural network (LSTM) Predicts pipeline performance (can be pre-trained on prior datasets) Figure source: Drori et al., 2019
  • 68.
    omniglot vgg_ fl ower dtd Resultson increasingly di ffi cult tasks: • Initially slower than DQN, but faster after a few tasks • Policy entropy shows learning/re- learning Gomez & Vanschoren, 2019 Meta-Reinforcement Learning for NAS
  • 69.
    MetaNAS: MAML +Neural Architecture Search Elsken et al., 2020 Figure source: Elsken et al., 2020 • Combines gradient based meta-learning (REPTILE) with NAS • During meta-train, it optimizes the meta-architecture (DARTS weights) along with the meta-parameters (initial weights) 𝛳 • During meta-test, the architecture can be adapted to the novel task through gradient descent
  • 70.
    Task 1 Learning Task 2 Learning Task3 Learning Tasknew Learning What can we learn to learn? 1. architectures / pipelines (hyper-parameters) Clune 2019 part 1 new tasks (of own choosing) experience 2. learning algorithms (model parameters) 3. learning environments (task generation) inductive bias 43 part 2 part 3 3 pillars
  • 71.
    Task 1 Model Learning algorithm ɸ* =argmaxɸ log p(ɸ | T ) Task: distribution of samples q(x) outputs y, loss ℒ(x,y ) Model: fɸ(x) = y’ Task 2 Model Learning algorithm !T1(fɸ1,λ(x),y) !T2 x,y x,y training ∇ɸ ɸ’1 ɸ’2 = argminɸ ℒ(fɸ,λ(x),y ) fɸ1(x) fɸ2(x) for reinforcement learning: q(x1) + transition q(xt+1| xt, yt ) ℒ = (neg) reward Terminology (reminder) Learner: model parameters ɸ, hyper-parameters λ optimizer λ1 λ2 44
  • 72.
    Task 1 Model Learning algorithm ɸ* =argmaxɸ log p(ɸ | T ) Task: distribution of samples q(x) outputs y, loss ℒ(x,y ) Model: fɸ(x) = y’ Task 2 Model Learning algorithm !T1(fɸ1,λ(x),y) !T2 x,y x,y training ∇ɸ ɸ’1 ɸ’2 = argminɸ ℒ(fɸ,λ(x),y ) fɸ1(x) fɸ2(x) for reinforcement learning: q(x1) + transition q(xt+1| xt, yt ) ℒ = (neg) reward Terminology (reminder) Learner: model parameters ɸ, hyper-parameters λ optimizer λ1 λ2 44 Meta-learning algorithm g meta-parameters 𝛳 (encode some inductive bias)
  • 73.
    Task 1 Learning Task 2 Learning Task3 Learning Tasknew Learning x,y x,y x,y x,y Strategy 1: bilevel optimization parameterize some aspect of the learner that we want to learn as meta-parameters 𝛳 meta-learn 𝛳 across tasks 𝛳𝛳𝛳𝛳 * ɸi* = argmaxɸi log p(ɸi | Ti, 𝛳 ) p(ɸnew | Tnew, 𝛳 *) Bilevel optimization: 𝛳 * = argmax log p( 𝛳 | Ti) ɸi* = argmaxɸ log p(ɸ | Ti, 𝛳 *) so that 𝛳 (prior), could encode an initialization ɸ, the hyperparameters λ , the optimizer,… Learned 𝛳 * should learn Tnew from small amount of data, yet generalize to a large number of tasks allows fi ne-tuning on Tnew 45
  • 74.
    Task 1 Model Learning algorithm Task 2 Model Learning algorithm Task3 Model Learning algorithm !T1 meta-train meta-test !T2 !T3 Tasknew Model Learning algorithm !Tnew p(T) every task is a meta-training example ETnew !meta(!T1,!T2,…) meta-learning algorithm Meta-model x,y x,y x,y x,y training (inner loop) meta-training (outer loop) adapt meta-test error meta-train error 𝛳 ’ ɸ’1 ɸ’2 ɸ’3 𝛳 * = argmax log p( 𝛳 | T) = argmin ℒmeta(ℒT1,ℒT2,… ) meta-parameters 𝛳 meta-optimizer Fmeta meta-loss ℒmeta fɸ1(x) fɸ2(x) fɸ3(x) distribution of tasks p(T) f (Ti) Meta-learning with bilevel optimization 𝛳𝛳 * adapt 𝛳𝛳𝛳 ɸ’new ( fi netune) 46
  • 75.
    Task 1 Model Learning algorithm Task 2 Model Learning algorithm Task3 Model Learning algorithm meta-train meta-test Tasknew Model Learning algorithm p(T) !(ɸi, Di,test) meta-learning algorithm g Meta-model Di,train meta-training meta-train error/reward 𝛳 ’ 𝛳 𝛳 distribution of tasks p(T) ɸi Strategy 2: black-box models ɸi ɸnew ɸ3 ɸ2 ɸ1 D3,test D2,test D1,test ɸnew ɸ1 ɸ2 ɸ3 Dnew,train ɸi = g (Di,train) Dnew,test black box meta-model g : predicts ɸ given Dtrain ( 𝛳 is hidden) hypernetwork where input embedding learned across task s Amortized models: no training in meta-test (learning fast by learning slow) feed-forward only ∇ !(ɸ3, D3,test) !(ɸ2, D2,test) !(ɸ1, D1,test) sample task i, update 𝛳 stateful, with memory (e.g. RNN) 47
  • 76.
    Example: few-shot classi fi cation meta-train traintask 1 Dtrain (support) train task n test task Dtest (query) Dtrain Dtest Dtrain Dtest meta- learner model learner model train tasks Bilevel: 𝛳 * ɸ training !(ɸ, Dtest) !meta(!T1,!T2,…) 𝛳 ’ meta- training Dtrain Dtest model learner test tasks 𝛳 * fi netuning !(ɸ, Dtest) Dtrain Dtest ɸ 𝛳 *: initial ɸ optimize r metric … Vinyals et al. 2016 Trianta fi llou et al. 2019 2-way, 3-shot meta-train meta-test 48
  • 77.
    Example: few-shot classi fi cation meta-train traintask 1 Dtrain (support) train task n test task Dtest (query) Dtrain Dtest Dtrain Dtest meta- learner model learner model test tasks ɸ* ɸ Dtrain Dtest Black Box: meta- learner model learner model train tasks ɸ* ɸ !(ɸ, Dtest) !meta(!T1,!T2,…) meta- training Dtrain Dtest 𝛳 ’ 49
  • 78.
    meta-RL algorithm model RL agent model train tasks 𝛳 * ɸ 𝛳 ’ model RL agent test tasks 𝛳 * fi netuning !(ɸ,Dtest) ɸ initial state: randomized object and goals positions Example: meta-reinforcement learning Bilevel: episodes Yu et al. 2019 training (inner loop) meta-training (outer loop) other initial state s or related tasks episodes 𝛳 *: fi xed ɸ objective, e.g. which ɸ to learn Rmeta(RT1,RT2,…) R( 𝝅 ɸ, Dtest) e ff ective policy 𝝅 ɸ modulated by ɸ meta-policy gradient prior 50
  • 79.
    meta-RL algorithm model RL agent model train tasks 𝛳 * ɸ 𝛳 ’ model RL agent test tasks 𝛳 * fi netuning !(ɸ,Dtest) ɸ initial state: randomized object and goals positions Example: meta-reinforcement learning Bilevel: episodes Yu et al. 2019 training (inner loop) meta-training (outer loop) other initial state s or related tasks episodes 𝛳 *: fi xed ɸ objective, e.g. which ɸ to learn Rmeta(RT1,RT2,…) R( 𝝅 ɸ, Dtest) e ff ective policy 𝝅 ɸ modulated by ɸ meta-policy gradient prior 50
  • 80.
    meta-RL algorithm RNN (fast) RL agent RNN test tasks meta-RL algorithm RNN (fast)RL agent RNN train tasks training (inner loop) Rmeta(RT1,RT2,…) meta-training (outer loop) Black Box: Example: meta-reinforcement learning (few) test episodes 𝛳 ’ train episodes (few) test episodes train episodes R( 𝝅 ɸ, Dtest) ɸ ɸ ɸ ɸ initial state: randomized object and goals positions other initial state s or related tasks 51
  • 81.
    meta-RL algorithm RNN (fast) RL agent RNN test tasks meta-RL algorithm RNN (fast)RL agent RNN train tasks training (inner loop) Rmeta(RT1,RT2,…) meta-training (outer loop) Black Box: Example: meta-reinforcement learning (few) test episodes 𝛳 ’ train episodes (few) test episodes train episodes R( 𝝅 ɸ, Dtest) ɸ ɸ ɸ ɸ initial state: randomized object and goals positions other initial state s or related tasks 51
  • 82.
    Taxonomy of meta-learningmethods Hospedales et al. 2020 meta-parameters 𝛳 meta-objective ℒmeta meta-learning algorithm parameter initialization ɸ0 multi vs single task like base-learners, meta-learners consist of a representation, an objective, and an optimizer meta-learning learned ɸ0 random ɸ0 meta-optimizer Fmeta gradient descent 52
  • 83.
    meta-learning learned optimizer (update rule) normalSGD (back-propagation) Taxonomy of meta-learning methods meta-parameters 𝛳 meta-objective ℒmeta meta-learning algorithm multi vs single task like base-learners, meta-learners consist of a representation, an objective, and an optimizer optimizer Hospedales et al. 2020 meta-optimizer Fmeta gradient descent 53
  • 84.
    Taxonomy of meta-learningmethods meta-parameters 𝛳 meta-objective ℒmeta meta-optimizer Fmeta meta-learning algorithm parameter initialization ɸ0 optimizer black-box model embedding (metric learning) hyperparameters λ architecture loss ℒ / reward R … gradient descent reinforcement learning evolution many vs few shots multi vs single task fast adaptation vs fi nal perf. like base-learners, meta-learners consist of a representation, an objective, and an optimizer of fl ine vs online Hospedales et al. 2020 We won’t exhaustively discuss all existing combinations, let’s focus on understanding a few key strategies Huisman et al. 2020 54
  • 85.
    Gradient-based methods: learningɸinit Task i Model Learning algorithm !Ti meta-train Tasknew Model Learning algorithm !Tnew !meta = ∑i !(ɸi, Di,test) meta-learning algorithm Meta-model x,y x,y training (inner loop) meta-training (outer loop) 𝛳 ’ ɸ’i fɸi(x) ɸ’new ( fi netune) 𝛳 * ɸinit = 𝛳 p(T) • 𝛳 (prior): model initialization ɸinit • learn representation suitable for many tasks (e.g. pretrained CNN) • maximize rapid learning • Each task i yields task-adapted ɸi • Update algorithm u • Finetune 𝛳 * on Tnew (in few steps) meta-test ɸinit = learned ɸ0 random ɸ0 ɸ’new = u( 𝛳 *,Dnew,train) ∇ɸ ∇ ∇ɸ Can be seen as bilevel optimization: 𝛳 * = argmin ∑i ℒi(ɸi, Di,test) ɸi = u( 𝛳 ,Di,train) ɸi = u( 𝛳 ,Di,train) 55
  • 86.
    • Current initialization 𝛳 ,model f • On i tasks, perform k SGD steps to fi nd , then evaluate • Update task-speci fi c parameters: • Update 𝛳 to minimize sum of per-task losses, repeat Finn et al. 2017 Model agnostic meta-learning (MAML) !Task 1 !Task 2 θ θ ∇θℒ1(fϕ* 1 ) θ ∇ℒ2 ∇ℒ1 ∇ℒ3 ϕ* 3 ϕ* 1 ϕ* 2 Meta-training Meta-testing • Training data of new task Dtrain • 𝛳 *: pre-trained parameters • Finetune: meta-gradient: second-order gradient + backpropagate meta-gradient 𝛼 , β: learning rates ∇θℒ2(fϕ* 2 ) i=3, k=3 ∇θℒi(fϕ* i ) ϕi = θ − α∇θℒi(fϕ* i ) ϕ1 ϕ2 θ ← θ − β∇θ ∑i ℒi(f(θ − α∇θℒi(fϕ* i ))) θ ← θ − β∇θ ∑i ℒi(fϕi ) derivative of test-set loss ϕ = θ* − α∇θℒ(fθ) compute how changes in 𝛳 a ff ect the gradient at new 𝛳 θ1 θ2 θ3 = ϕ* 2 θ1 θ2 θ3 = ϕ* 1 ϕ* i 56
  • 87.
    • Current initialization 𝛳 ,model f • On i tasks, perform k SGD steps to fi nd , then evaluate • Update task-speci fi c parameters: • Update 𝛳 to minimize sum of per-task losses, repeat Finn et al. 2017 Model agnostic meta-learning (MAML) !Task 1 !Task 2 θ θ ∇θℒ1(fϕ* 1 ) θ ∇ℒ2 ∇ℒ1 ∇ℒ3 ϕ* 3 ϕ* 1 ϕ* 2 Meta-training Meta-testing • Training data of new task Dtrain • 𝛳 *: pre-trained parameters • Finetune: meta-gradient: second-order gradient + backpropagate meta-gradient 𝛼 , β: learning rates ∇θℒ2(fϕ* 2 ) i=3, k=3 ∇θℒi(fϕ* i ) ϕi = θ − α∇θℒi(fϕ* i ) ϕ1 ϕ2 θ ← θ − β∇θ ∑i ℒi(f(θ − α∇θℒi(fϕ* i ))) θ ← θ − β∇θ ∑i ℒi(fϕi ) derivative of test-set loss ϕ = θ* − α∇θℒ(fθ) compute how changes in 𝛳 a ff ect the gradient at new 𝛳 θ1 θ2 θ3 = ϕ* 2 θ1 θ2 θ3 = ϕ* 1 ϕ* i 56
  • 88.
    • Current initialization 𝛳 ,model f • On i tasks, perform k SGD steps to fi nd , then evaluate • Update task-speci fi c parameters: • Update 𝛳 to minimize sum of per-task losses, repeat Finn et al. 2017 Model agnostic meta-learning (MAML) !Task 1 !Task 2 θ θ ∇θℒ1(fϕ* 1 ) θ ∇ℒ2 ∇ℒ1 ∇ℒ3 ϕ* 3 ϕ* 1 ϕ* 2 Meta-training Meta-testing • Training data of new task Dtrain • 𝛳 *: pre-trained parameters • Finetune: meta-gradient: second-order gradient + backpropagate meta-gradient 𝛼 , β: learning rates ∇θℒ2(fϕ* 2 ) i=3, k=3 ∇θℒi(fϕ* i ) ϕi = θ − α∇θℒi(fϕ* i ) ϕ1 ϕ2 θ ← θ − β∇θ ∑i ℒi(f(θ − α∇θℒi(fϕ* i ))) θ ← θ − β∇θ ∑i ℒi(fϕi ) derivative of test-set loss ϕ = θ* − α∇θℒ(fθ) compute how changes in 𝛳 a ff ect the gradient at new 𝛳 θ1 θ2 θ3 = ϕ* 2 θ1 θ2 θ3 = ϕ* 1 ϕ* i 56
  • 89.
    • Example forreinforcement learning: • Goal: reach certain velocity in certain direction Model agnostic meta-learning (MAML) Source: Finn et al. 2017 - oracle - MAML - pretrained 57
  • 90.
    • Example forreinforcement learning: • Goal: reach certain velocity in certain direction Model agnostic meta-learning (MAML) Source: Finn et al. 2017 - oracle - MAML - pretrained 57
  • 91.
    1 Finn etal. 2017 5 Flennerhag et al. 2019 Other gradient-based techniques • Changing update rule yield di ff erent variants: • MAML 1,6 • MetaSGD 2 • Tnet 3 • Meta curvature 4 • WarpGrad 5 All use second-order gradients, Meta-learn a transformation of the gradient for better adaptation 2 Li et al. 2017 3 Lee et al. 2018 4 Park et al. 2019 fi gure source: Flennerhag et al. 2019 MAML WarpGrad θ ← θ − β∇θ ∑i ℒi(fϕi ) ϕi = θ − α∇θℒi(fϕ* i ) ϕi = θ − α diag(w)∇θℒi(fϕ* i ) ϕi = θ − α∇θℒi(fϕ* i , w) ϕi = θ − α B(θ, w)∇θℒi(fϕ* i ) ϕi = θ − α P(θ, ϕi)∇θℒi(fϕ* i ) 6 Antoniou et al. 2019 w: weight per parameter • Online MAML (Follow the Meta Leader) 7 • Minimizes regret • Robust, but computation costs grow over time 7 Finn et al. 2019 58
  • 92.
    1 Finn etal. 2017 Scalability • Backpropagating derivates of ɸi wrt 𝛳 is compute + memory intensive (for large models) • First order approximations of MAML: • First order MAML1 uses only the last inner gradient update: • Reptile 2: iterate over tasks, update 𝛳 in direction of : 2 Nichol et al. 2018 θ θ1 θ2 θ3 θ4 = ϕ* i FOMAML Reptile ϕ* i − θ Single task: FOMAML meta-gradient: ∇θ ∑i ℒi(ϕi) T1 T2 T3 θ θ ← θ − β (ϕ* i − θ) ϕ* i SGD 3 Rajeswaran et al. 2019 θ ← θ − β∑i ℒi(fϕ* i ) 59
  • 93.
    • MAML ismore resilient to over fi tting than many other meta-learning techniques 1 • E ff ectiveness seems mainly due to feature reuse ( fi nding 𝛳 ) 2 • Fine-tuning only the last layer equally good • On few-shot learning benchmarks, a good embedding outperforms most meta-learning 3 • Learn representation on entire meta-dataset (merged into single task) • Train a linear classi fi er on embedded few-shot Dtrain , predictDtest 1 Finn et al. 2018 Generalizability 2 Raghu et al. 2020 𝛳 * ɸ’new 3 Tian et al. 2020 fi gure source: Tian et al. 2020 linear model • For meta-RL, also learn how to explore (how to sample new environments) • E-MAML: add exploration to meta- objective (allows longer term goals) 4 4 Stadie et al. 2019 60
  • 94.
    Grant et al.2018 Can meta-learning reason about uncertainty in the task distribution? y x g fɸ ̂ ϕ ← gθ((x, y)train) ϕ θ ytest ← f ̂ ϕ(xtest) Ti = {Dtrain, Dtest} ← 𝒯 meta−train Tj = {Dtrain, Dtest} ← 𝒯 meta−test θ ← θ − ∇θℒ(ytest) ϕj ← gθ((x, y)train) ytest ← fϕj (xtest) y x g fɸ ϕ θ p(ϕi |(x, y)train) Ti = {Dtrain, Dtest} ← 𝒯 meta−train Tj = {Dtrain, Dtest} ← 𝒯 meta−test p(θ) ← p(θ|(x, y)train) p(ϕj |(x, y)train, θ) p(ytest |xtest) fi netuning p(ytest |xtest) fi netuning Bayesian meta-learning . : . . : 61
  • 95.
    • Use approximationmethods to represent uncertainty over • Sampling technique + variational inference: PLATIPUS 1, BMAML 2, ABML 3 • Laplace approximation: LLAMA 4 • Variational approximation of posterior: • Neural Statistician 5, Neural Processes 6 • What if our tasks are not IID? • Impose additional structure, e.g. with task-speci fi c variable z 7 ϕi y x ϕ θ z mini-ImageNet with fi lters for non-homogeneous tasks: 7 Jerfel et al. 2019 Figure source: Jerfel et al. 2019 Fully Bayesian meta-learning 1 Finn et al. 2019 2 Kim et al. 2018 3 Ravi et al. 2019 5 Edwards et al. 2017 6 Garnelo et al. 2018 4 Grant et al. 2018 62
  • 96.
    • Our brainsprobably don’t do backprop, instead: • weights 2: networks that continuously modify the weights of another network • Gradient based 3, 4: parameterize the update rule using a neural network • Learn meta-parameters across tasks, by gradient descent 1 Bengio et al. 1995 meta-gradient Meta-learning optimizers 2 Schmidhuber 1992 ϕ′  i = ϕi − η( ) θ 3 Runarsson and Jonsson 2000 4 Hochreiter 2001 63 5 Andrychowicz et al. 2016 6 Ravi and Larochelle 2017 7 Wichrowska et al. 2017 • Represent update rule as an LSTM 4,5,6, hierarchical RNN 7
  • 97.
    • Meta-learned (RNN)optimizers‘rediscover’momentum, gradient clipping, learning rate schedules, learning rate adaptation,… 1 • RL-based optimizers: represent updates as a policy, learn using guided policy search 2 • Combined with MAML: • learn per-parameter learning rates 3,4 • learn precondition matrix (to `warp' the loss surface) 5,6 • Black-box optimizers: meta-learned with an RNN 7, or with user-de fi ned priors8 • Speed up backpropagation by meta-learning sparsity and weight sharing9 2 Ke and Malik 2016 Meta-learning optimizers 6 Flennerhag et al. 2019 3 Li et al. 2017 5 Park et al. 2019 4 Antoniou et al. 2019 7 Chen et al. 2017 8 Souza et al. 2020 9 Kirsch et al. 2020 1 Maheswaranathan et al. 2020 learning rate schedules learning rate adaptation Figure source: Maheswaranathan et al. 2020 64
  • 98.
    Metric learning Learn anembedding network 𝛳 that transforms data {Dtrain,Dtest} across all tasks to a representation that allows easy similarity comparison Can be seen as a simple black box meta-model • Often non-parametric (independent of ɸ) Task i learner ɸ Dtrain g g ? embedding g (x) 65 Vinyals et al. 2016 Sung et al. 2018
  • 99.
    Prototypical networks Snell etal. 2017 • Use an embedding function f to encode each data point • De fi ne a prototype vc for every class c based on the examples of that class • Class distribution for input x is based on inverse distance between x and prototypes • Distance function can be any di ff erentiable distance • E.g. squared Euclidean • Loss function to learn the embedding: vc = 1 |D(y) train | ∑ (xi,yi)∈D(y) train fθ(xi) D(y) train p(y = c|xtest) = softmax(−dϕ(fθ(x), vc)) Figure source: Snell et al. 2017 ℒ(θ) = − log pθ(y = c|x) 66
  • 100.
    Metric learning • Quitea few other techniques exist • Siamese neural networks 1 • Graph Neural Networks 2 • Also applicable for semi-supervised and active learning • Attentive Recurrent Comparators 3 • Compares inputs not as a whole but by parts (e.g. image patches) • MetaOptNet 4 • Learns embeddings so that linear models can distinguish between classes • Overall: • Fast at test time, although pair-wise comparisons limit task size • Mostly limited to few-shot supervised tasks • Fails when test tasks are more distant: no way to adapt Koch et al. 2015 Garcia et al. 2017 Shyam et al. 2017 Lee et al. 2019 67
  • 101.
    Wang et al.2016 Duan et al. 2018 Black-box model for meta-RL meta-RL algorithm RNN (fast) RL agent RNN test tasks meta-RL algorithm RNN (fast) RL agent RNN train tasks training (inner loop) Rmeta(RT1,RT2,…) meta-training (outer loop) (few) test episodes 𝛳 ’ train episodes (few) test episodes train episodes R( 𝝅 ɸ, Dtest) ɸ ɸ ɸ ɸ training task 1 training task 2 • RNNs serve as dynamic task embedding storage • Maximize expected reward in each trial • Very expressive, perform very well on short tasks • Longer horizons are an open challenge Previous inputs stored in hidden state Image source: Duan et al. 201868
  • 102.
    Santoro et al.2016 Mishra et al. 2018 Other black-box models Munkhdalai et al. 2017 • Memory-augmented NNs 1 • Uses neural Turing machines: short term + long term memory • Meta Networks 2 • Meta-learner that returns‘fast weights’for itself and the base network solving the task • Simple Neural attentive meta- learner (SNAIL) • Aims to overcome memory limitations of RNNs with series of 1D convolutions Meta Networks. Image source: Munkhdalai et al. 2017 SNAIL. Image source: Mishra et al. 2018 69
  • 103.
    Task 1 Learning Task 2 Learning Task3 Learning Tasknew Learning What can we learn to learn? 1. architectures / pipelines (hyper-parameters) Clune 2019 part 1 new tasks (of own choosing) experience 2. learning algorithms (model parameters) 3. learning environments (task generation) inductive bias 70 part 2 part 3 3 pillars
  • 104.
    • Ultimately, meta-learningtranslates constraints on the learner to constraints on the data • The biases we don’t put in manually have to be learnable from data • Can we automatically create new tasks to inform and challenge our meta-learners? • Paired open-ended trailblazer (POET): evolves a parameterized environment 𝛳 E for agent 𝛳 A • Select agents that can solve challenges AND evolve environments so they are solvable Training Task Acquisition best performing agents move on creates multiple overlapping curricula Wang et al. 2019 Wang et al. 2020 Figure source: Wang et al. 2019 71
  • 105.
    • Ultimately, meta-learningtranslates constraints on the learner to constraints on the data • The biases we don’t put in manually have to be learnable from data • Can we automatically create new tasks to inform and challenge our meta-learners? • Paired open-ended trailblazer (POET): evolves a parameterized environment 𝛳 E for agent 𝛳 A • Select agents that can solve challenges AND evolve environments so they are solvable Training Task Acquisition best performing agents move on creates multiple overlapping curricula Wang et al. 2019 Wang et al. 2020 Figure source: Wang et al. 2019 71
  • 106.
    Does POET scale?Increasingly di ffi cult 3D terrain, 18 degrees of freedom. Zhou & Vanschoren. 2021 72
  • 107.
    Does POET scale?Increasingly di ffi cult 3D terrain, 18 degrees of freedom. Zhou & Vanschoren. 2021 72
  • 108.
    Generative Teaching Networks •Based on an existing dataset, generate synthetic training data for more e ffi cient training • Like dataset distillation, but uses meta-learning to update the generator model • While POET has limited expressivity (limited to 𝛳 E), GTNs could produce all sorts of training datasets and environments • While being careful not to generate noise: needs some grounding in reality Such et al. 2019 73
  • 109.
    Thank you Image source:Pesah et al. 2018 74