Meta learning tutorial

Joaquin Vanschoren (TU Eindhoven)
Meta-Learning
A tutorial
Cover art: MC Escher, Ustwo Games

Intro: humans can easily learn from a single example
2
thanks to years of learning (and eons of evolution)

Canna Indica ‘Picasso’
2

Canna Indica ‘Picasso’ ?
later
train test
2

Can a computer learn from a single example?
e.g. ResNet50
(23 million trainable parameters)
ImageNet
Canna
Indica
`Picasso’
That won’t work :) Humans also don’t start from scratch.
3

e.g. ResNet50
Transfer learning?
ImageNet
(14 million images)
Pretrain
Canna
Indica
`Picasso’
Source task
Target task
4

e.g. ResNet50
Transfer learning?
ImageNet
(14 million images)
Pretrain
Finetune Canna
Indica
`Picasso’
Source task
Target task
4

e.g. ResNet50
Transfer learning?
ImageNet
(14 million images)
Pretrain
Finetune
A single source task (e.g. ImageNet) may not generalize well to the test task
Canna
Indica
`Picasso’
Source task
Target task
4

Meta-learning
Learn over a series (or distribution) of many di
ff
erent tasks/episodes
Inductive bias: learn assumptions that you can to transfer to new tasks
Prepare yourself to learn new things faster
task 1 task 2 task 1000 new task
Canna
Trumpet vine
…
inductive bias
5

Meta-learning
Learn over a series (or distribution) of many di
ff
erent tasks/episodes
Inductive bias: learn assumptions that you can to transfer to new tasks
Prepare yourself to learn new things faster
task 1 task 2 task 1000 new task
Canna
Trumpet vine
…
Useful in many real-life situations: rare events, test-time constraints, data collection costs, privacy issues,…
inductive bias
5

Inspired by human learning
We don’t transfer from a single source task, we learn across many, many tasks

year 1 year 2 year 3 year 4
We have a‘drive’to explore new, challenging, but doable, fun tasks
6

Human-like Learning***
Task 1
Learning
humans learn across tasks: less trial-and-error, less data, less compute

new tasks should be related to experience (doable, fun, interesting?)
Task 2
Learning
Task 3
Learning
new tasks

(of own choosing)
Tasknew
Learning
experience
Lake et al. 2017
key aspects of fast learning: compositionality, causality, learning to learn
Kemp et al. 2007
inductive bias
7

Inductive bias (in language)
Training
Lake et al. 2017
dax
zup
lug
wif
lug blicket wif
wif blicket dax
Test dax blicket zup?
lug kiki wif
wif kiki dax
wif blicket dax kiki lug?
8
which assumptions do we make?

Training
Lake et al. 2017
dax
zup
lug
wif
lug blicket wif
wif blicket dax
lug kiki wif
wif kiki dax
Common mistakes
one-to-one bias:

assume that every word is one color
8

Training
Lake et al. 2017
dax
zup
lug
wif
lug blicket wif
wif blicket dax
lug kiki wif
wif kiki dax
Common mistakes
one-to-one bias:

concatenation bias:

assume that order is always left-to-right
8

What if there is no training data?
Item pool
Lake et al. 2017
we can still solve problems by making assumptions
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
zup?
zup zup?
blicket wif zup?
Lake et al. 2019
9

Item pool
Lake et al. 2017
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
Commonly used assumptions:
one-to-one bias:


(and not a function or something else)
zup?
zup zup?
blicket wif zup?
zup
dax
/
Lake et al. 2019
9

Item pool
Lake et al. 2017
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
one-to-one bias:


concatenation bias:

zup?
zup zup?
blicket wif zup?
zup
dax
/
zup
dax
Lake et al. 2019
9

Item pool
Lake et al. 2017
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
one-to-one bias:


concatenation bias:

zup?
zup zup?
blicket wif zup?
zup
dax
/
zup
dax
mutual exclusivity:

if object has a name, it doesn’t need

another name
zup
dax

Lake et al. 2019
9

Item pool
Lake et al. 2017
dax zup?
zup tufa?
Test
zup wif zup?
zup wif blicket?
one-to-one bias:


concatenation bias:

zup?
zup zup?
blicket wif zup?
zup
dax
/
zup
dax
mutual exclusivity:

if object has a name, it doesn’t need

another name
zup
dax

These assumptions (inductive biases) are necessary for learning quickly
Humans assume that words have consistent meanings and follow input/output constraints
Lake et al. 2019
9

Meta-learning inductive biases
Task 1
model
Task 2 Tasknew
Lake et al. 2017
Capture useful assumptions from the data - that can often not be easily expressed
dax
zup
lug
wif
wif zup dax?
lug dax lug zup?
dax wif lug?
wif wif dax?
dax
zup
lug
wif
dax dax?
wif lug dax?
wif lug lug dax?
zup zup zup?
state model
state
dax
zup
lug
wif
dax dax?
lug lug dax dax?
wif lug lug dax?
zup wif zup?
Lake et al. 2019
10
model

Meta-learning goal
Task 1
Learning
learn minimal inductive biases from prior tasks instead of constructing manual ones

should still generalize well (otherwise you meta-over
fi
t)
Task 2
Learning
Task 3
Learning
Tasknew
Learning
Inductive bias: any assumptions added to training data to learn more e
ff
ectively. E.g:

• Instead of general model architectures, learn better architectures (and hyperparameters)

• Instead of starting from random weights, learn good initial weights

• Instead of standard loss/reward function, learn a better loss/reward function
(prior)
11
inductive bias

Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning
What can we learn to learn?
1. architectures / pipelines

(hyper-parameters)

Clune 2019
part 1
new tasks

(of own choosing)
experience
2. learning algorithms

(model parameters)

3. learning environments

(task generation)
inductive bias
12
part 2 part 3
3 pillars

Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning

(hyper-parameters)

Clune 2019
part 1
new tasks

(of own choosing)
experience

(model parameters)


(task generation)
inductive bias
13
part 2 part 3
3 pillars

Task 1
Model
Learning
algorithm
ɸ* = argmaxɸ log p(ɸ | T
)

Task: distribution of samples q(x)
outputs y, loss ℒ(x,y
)

Model: fɸ(x) = y’
Task 2
Model
Learning
algorithm
!T1(fɸ1,λ(x),y) !T2
x,y x,y
training

∇ɸ
ɸ’1 ɸ’2
= argminɸ ℒ(fɸ,λ(x),y
)

fɸ1(x) fɸ2(x)
for reinforcement learning:
q(x1) + transition q(xt+1| xt, yt
)

ℒ = (neg) reward
Terminology
Learner: model parameters ɸ,
hyper-parameters λ
optimizer
λ1 λ2
14

Task 1
Model
Learning
algorithm
)

)

Task 2
Model
Learning
algorithm
!T1(fɸ1,λ(x),y) !T2
x,y x,y
training

∇ɸ
ɸ’1 ɸ’2
)

fɸ1(x) fɸ2(x)
)

ℒ = (neg) reward
Terminology
hyper-parameters λ
optimizer
λ1 λ2
14
Meta-learning
algorithm g
meta-parameters
𝛳
(encode some
inductive bias)

Task 1..N Models
performance
AutoML Models
Models
Closely related to Automated Machine Learning (AutoML)

But: meta-learn how to design architectures/pipelines and tune hyperparameters

Human data scientists also learn from experience
Models
Models
Models
Learning hyperparameters
Hutter et al. 2019
λ
New task Models
performance
self-learning AutoML Models
λ
bias
(priors, meta-knowledge, human priors)
Search space can
be huge!

Meta-learning for AutoML: how?
Learning hyperparameter priors
Warm starting (what works on similar tasks?)
start randomly
start with

good candidates
Meta-models (learn how to build models/components)
Complex

hyperparameter space
Simple

Vanschoren 2018
Task
λ, scores
λ, scores
Learner
Learner
metadata
hyperparameters = architecture + hyperparameters
Task
λ, scores
Learner
metadata
Task
Task

Observation:

current AutoML strongly depends on learned priors
Complex

Simple

observation

Manual architecture priors
• Most successful pipelines have a similar structure

• Can we meta-learn a prior over successful structures?
Ensembling/
stacking
Figure source: Feurer et al. 2015


+ smaller search space

- you can’t learn entirely new architectures
Ensembling/
stacking

autosklearn Feurer et al. 2015

autoWEKA Thornton et al. 2013

hyperopt-sklearn Komer et al. 2014

AutoGluon-Tabular Erickson et al. 2020


- you can’t learn entirely new architectures
Ensembling/
stacking

Successful deep networks often have repeated motifs (cells)

e.g. Inception v4:
Szegedy
Figure source: Szegedy et al 2016

Google NASNet Zoph et al 2018
Compositionality: learn hierarchical building blocks to simplify the task
•learn parameterized building
blocks (cells)

•stack cells together in macro-
architecture


+ cells can be learned on a small
dataset & transferred to a larger
dataset

-strong domain priors, doesn’t
generalize well
Cell search space
Cell search space prior
Figure source: Elsken et al., 2019

Google NASNet Zoph et al 2018
Compositionality: learn hierarchical building blocks to simplify the task
•learn parameterized building
blocks (cells)

•stack cells together in macro-
architecture


+ cells can be learned on a small
dataset & transferred to a larger
dataset

-strong domain priors, doesn’t
generalize well
Cell search space
Can we meta-learn hierarchies / components that generalize better?

If you constrain the search space enough, you can get SOTA results with random search!
Li & Talwalkar 2019
Yu et al. 2019
Real et al. 2019
• Strong prior!
Figure source: Li & Talwalkar., 2019

Weight-agnostic neural networks
• ALL weights are shared

• Only evolve the architecture?

• Minimal description length

• Baldwin e
ff
ect?
Manual priors: Weight sharing
Gaier & Ha, 2019
Figure source: Gaier & HA., 2019

Learning hyperparameter priors
Complex

Simple

λ, scores
Learner

1 van Rijn & Hutter 2018
ResNets for image classi
fi
cation
• Functional ANOVA 1

• Select hyperparameters that cause
variance in the evaluations.

• Useful to speed up black-box
optimization techniques
Learn hyperparameter importance
Figure source: van Rijn & Hutter, 2018

• Tunability 1,2,3
Learn good defaults, measure importance as improvement via tuning
1 Probst et al. 2018
2 Weerts et al. 2018
3 van Rijn et al. 2018
Learn defaults + hyperparameter importance
Learned defaults Tuning risk

• Start with a few (random) hyperparameter con
fi
gurations

• Fit a surrogate model to predict other con
fi
gurations

• Probabilistic regression: mean and standard deviation (blue band)

• Use an acquisition function to trade o
ff
exploration and exploitation, e.g. Expected
Improvement (EI)

• Sample for the best con
fi
guration under that function
μ σ
Bayesian Optimization (interlude)
Mockus, 1974
μ
μ + σ
μ − σ
value of λi
performance

• Repeat until some stopping
criterion:

• Fixed budget

• Convergence

• EI threshold

• Theoretical guarantees

• Also works for non-convex,
noisy data

• Used in AlphaGo
Srinivas et al. 2010, Freitas et al.
2012, Kawaguchi et al. 2016
Bayesian Optimization
Figure source: Shahriari 2016

• Hyperparameters can interact in very non-linear ways

• Use a neural net to learn a suitable basis expansion ϕz(λ) for all tasks

• You can use Bayesian linear models, transfers info on con
fi
guration space
Learn basis expansions for hyperparameters
Perrone et al. 2018
P
Bayesian Linear surrogate
φz(λ)i
λi,score
Learn basis expansion on lots of data (e.g. OpenML)
φz(λ)
φz(λ)
λ
Gaussian Processes surrogate

• If task j is similar to the new task, its surrogate model Sj will likely transfer well

• Sum up all Sj predictions, weighted by task similarity (as in active testing)1

• Build combined Gaussian process, weighted by current performance on new
task2
Tasks
Models
Models
Models
performance
Learning
Learning
Learning
New Task
meta-learner
Models
Models
Models
performance
per task tj:
Pi,j
}
Surrogate model transfer
1 Wistuba et al. 2018
λi
P
Sj
2 Feurer et al. 2018
S = ∑ wj Sj
+
+
S1
S2
S3
λi

prior tasks
Surrogate model transfer
Manolache & Vanschoren 2019
• Store surrogate model Sij for every pair of task i and algorithm j

• Simpler surrogates, better transfer

• Learn weighted ensemble -> signi
fi
cant speed up in optimization
new task

Warm starting

(what works on similar tasks?)
start randomly
start with

good candidates
Task
λ, scores
Learner
metadata

How to measure task similarity?
• Hand-designed (statistical) meta-features that describe (tabular) datasets 1

• Task2Vec: task embedding for image data 2

• Optimal transport: similarity measure based on comparing probability distributions 3

• Metadata embedding based on textual dataset description 4

• Dataset2Vec: compares batches of datasets 5

• Distribution-based invariant deep networks 6
1 Vanschoren 2018
2 Achille et al. 2019
3 Alvarez-Melis et al. 2020
4 Drori et al. 2019
5 Jooma et al. 2020
6 de Bie et al. 2020
Figure source: Alvarez-Melis et al. 2020

• Find k most similar tasks, warm-start search with best λi

• Auto-sklearn: Bayesian optimization (SMAC)

• Meta-learning yield better models, faster

• Winner of AutoML Challenges
Tasks
Models
Models
Models
performance
Learning
Learning
Learning
New Task
meta-learner
Models
Models
Models
performance
Pi,j
}
Warm-starting with kNN
λ1..k
mj
best λi on
similar tasks
Feurer et al. 2015
λi
Bayesian optimization
λ
P
λ1
λ3
λ2
λ4
Figure source: Feurer et al., 2015

Fusi et al. 2017
Pi,j
λi
TL
λL
tj
tnew warm-started

with λ1..k
. . .. . . .. . . . .
λi
λLi
P
p(P|λLi)
latent representation
• Learn latent representation for
tasks T and con
fi
gurations λ

• Use meta-features to warm-start
on new task

• Returns probabilistic predictions
for Bayesian optitmization
• Collaborative
fi
ltering: con
fi
gurations λi are `rated’by tasks tj
Probabilistic Matrix Factorization
Figure source: Fusi et al., 2017

DARTS: Di
ff
erentiable NAS
Liu et al. 2018
convolution

max pooling

zero
• Fixed (one-shot) structure, learn which operators to use

• Give all operators a weight

• Optimize and model weights using bilevel optimization

• approximate *( ) by adapting after every training step
αi
αi ωj
ωj αi ωj
One-shot model operator weights αi interleaved optimization
of and with SGD
αi ωj
argmax αi
Figure source: Liu et al., 2018

• Warm-start DARTS with architectures that worked well on similar problems

• Slightly better performance, but much faster (5x)
Warm-started DARTS
Grobelnik and Vanschoren, 2021

Meta-models

(learn how to build models/components)
Task
λ, scores
Learner
metadata
Task
Task

• Learn direct mapping between meta-features and Pi,j
• Zero-shot meta-models: predict best λi given meta-features 1

• Ranking models: return ranking λ1..k 2

• Predict which algorithms / con
fi
gurations to consider / tune3

• Predict performance / runtime for given
𝛳
i and task4

• Can be integrated in larger AutoML systems: warm start, guide search,…
meta-learner
Algorithm selection models
λbest
1 Brazdil et al. 2009, Lemke et al. 2015
2 Sun and Pfahringer 2013, Pinto et al. 2017
meta-learner λ1..k
mj
mj
meta-learner
Pij
mj, λi
3 Sanders and C. Giraud-Carrier 2017
meta-learner
Λ
mj
4 Yang et al. 2018

• Learn nonlinearities: RL-based search of space of likely useful activation functions 1

• E.g. Swish can outperform ReLU
1 Ramachandran et al. 2017
• Learn optimizers: RL-based search of space of likely useful update rules 2

• E.g. PowerSign can outperform Adam, RMPprop
2 Bello et al. 2017
Learning model components
PowerSign : esign(g)sign(m)
g
Swish :
x
1 + e−βx
g: gradient, m:moving average
• Learn acquisition functions for Bayesian optimization 3
3 Volpp et al. 2020
Figure source: Ramachandran et al., 2017 (top), Bello et al. 2017 (bottom)

Monte Carlo Tree Search + reinforcement learning
MOSAIC [Rakotoarison et al. 2019]

AlphaD3M [Drori et al. 2019]
• Self-play:

• Game actions: insert, delete, replace components in a pipeline

• Monte Carlo Tree Search builds pipelines given action probabilities

• With grammar to avoid invalid pipelines

• Neural network (LSTM) Predicts pipeline performance (can be pre-trained on prior datasets)
Figure source: Drori et al., 2019

omniglot vgg_
fl
ower dtd
Results on increasingly di
ffi
cult tasks:

• Initially slower than DQN, but faster after
a few tasks

• Policy entropy shows learning/re-
learning
Gomez & Vanschoren, 2019
Meta-Reinforcement Learning for NAS

MetaNAS: MAML + Neural Architecture Search
Elsken et al., 2020
• Combines gradient based meta-learning (REPTILE) with NAS

• During meta-train, it optimizes the meta-architecture (DARTS weights) along with
the meta-parameters (initial weights)
𝛳

• During meta-test, the architecture can be adapted to the novel task through
gradient descent

Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning

(hyper-parameters)

Clune 2019
part 1
new tasks

(of own choosing)
experience

(model parameters)


(task generation)
inductive bias
43
part 2 part 3
3 pillars

Task 1
Model
Learning
algorithm
)

)

Task 2
Model
Learning
algorithm
!T1(fɸ1,λ(x),y) !T2
x,y x,y
training

∇ɸ
ɸ’1 ɸ’2
)

fɸ1(x) fɸ2(x)
)

ℒ = (neg) reward
Terminology (reminder)
hyper-parameters λ
optimizer
λ1 λ2
44

Task 1
Model
Learning
algorithm
)

)

Task 2
Model
Learning
algorithm
!T1(fɸ1,λ(x),y) !T2
x,y x,y
training

∇ɸ
ɸ’1 ɸ’2
)

fɸ1(x) fɸ2(x)
)

ℒ = (neg) reward
Terminology (reminder)
hyper-parameters λ
optimizer
λ1 λ2
44
Meta-learning
algorithm g
meta-parameters
𝛳
(encode some
inductive bias)

Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning
x,y x,y x,y x,y
Strategy 1: bilevel optimization
parameterize some aspect of the learner that we want to learn as meta-parameters
𝛳

meta-learn
𝛳
across tasks
𝛳𝛳𝛳𝛳
*
ɸi* = argmaxɸi log p(ɸi | Ti,
𝛳
)

p(ɸnew | Tnew,
𝛳
*)
Bilevel optimization:
𝛳
* = argmax log p(
𝛳
| Ti)
ɸi* = argmaxɸ log p(ɸ | Ti,
𝛳
*)
so that
𝛳
(prior), could encode an initialization ɸ, the hyperparameters λ , the optimizer,…
Learned
𝛳
* should learn Tnew from small amount of data, yet generalize to a large number of tasks
allows
fi
ne-tuning on Tnew
45

Task 1
Model
Learning
algorithm
Task 2
Model
Learning
algorithm
Task 3
Model
Learning
algorithm
!T1
meta-train meta-test
!T2 !T3
Tasknew
Model
Learning
algorithm
!Tnew
p(T)
every task is a meta-training example
ETnew
!meta(!T1,!T2,…)
meta-learning
algorithm
Meta-model
x,y x,y x,y x,y
training

(inner loop)
meta-training

(outer loop)
adapt
meta-test error
meta-train error
𝛳
’
ɸ’1 ɸ’2 ɸ’3
𝛳
* = argmax log p(
𝛳
| T)

= argmin ℒmeta(ℒT1,ℒT2,…
)

meta-parameters
𝛳
meta-optimizer Fmeta
meta-loss ℒmeta
fɸ1(x) fɸ2(x) fɸ3(x)
distribution of tasks p(T)
f (Ti)
Meta-learning with bilevel optimization
𝛳𝛳
*
adapt
𝛳𝛳𝛳
ɸ’new
(
fi
netune)
46

Task 1
Model
Learning
algorithm
Task 2
Model
Learning
algorithm
Task 3
Model
Learning
algorithm
Tasknew
Model
Learning
algorithm
p(T)
!(ɸi, Di,test)
meta-learning
algorithm g
Meta-model
Di,train
meta-training
meta-train error/reward
𝛳
’
𝛳
𝛳

distribution of tasks p(T)
ɸi
Strategy 2: black-box models
ɸi ɸnew
ɸ3
ɸ2
ɸ1
D3,test
D2,test
D1,test
ɸnew
ɸ1 ɸ2 ɸ3
Dnew,train
ɸi = g (Di,train)
Dnew,test
black box meta-model g : predicts ɸ given Dtrain (
𝛳
is hidden)
hypernetwork where input embedding learned across task
s

Amortized models:

no training in meta-test

(learning fast by learning slow)
feed-forward

only
∇
!(ɸ3, D3,test)
!(ɸ2, D2,test)
!(ɸ1, D1,test)
sample task i, update
𝛳
stateful, with memory

(e.g. RNN)
47

Example: few-shot classi
fi
cation
meta-train
train task 1
Dtrain (support)
train task n test task
Dtest
(query)
Dtrain
Dtest
Dtrain
Dtest
meta-

learner
model
learner
model
train tasks
Bilevel:
𝛳
*
ɸ
training
!(ɸ, Dtest)
!meta(!T1,!T2,…)
𝛳
’
meta-

training
Dtrain Dtest
model
learner
test tasks
𝛳
*
fi
netuning
!(ɸ, Dtest)
Dtrain Dtest
ɸ
𝛳
*: initial
ɸ

optimize
r

metric

…
Vinyals et al. 2016
Trianta
fi
llou et al. 2019
2-way, 3-shot
48

Example: few-shot classi
fi
cation
meta-train
train task 1
Dtrain (support)
train task n test task
Dtest
(query)
Dtrain
Dtest
Dtrain
Dtest
meta-

learner
model
learner
model
test tasks
ɸ*
ɸ
Dtrain Dtest
Black Box:
meta-

learner
model
learner
model
train tasks
ɸ*
ɸ
!(ɸ, Dtest)
!meta(!T1,!T2,…)
meta-

training
Dtrain Dtest
𝛳
’
49

meta-RL

algorithm
model
RL

agent
model
train tasks
𝛳
*
ɸ
𝛳
’
model
RL

agent
test tasks
𝛳
*
fi
netuning
!(ɸ, Dtest)
ɸ
initial state:
randomized object
and goals positions
Example: meta-reinforcement learning
Bilevel:
episodes
Yu et al. 2019
training

(inner loop)
meta-training

(outer loop)
other initial state
s

or related tasks
episodes
𝛳
*:
fi
xed
ɸ

objective, e.g.

which ɸ to learn
Rmeta(RT1,RT2,…) R(
𝝅
ɸ, Dtest)
e
ff
ective policy
𝝅
ɸ modulated by ɸ
meta-policy

gradient
prior
50

meta-RL

algorithm
RNN
(fast) RL

agent
RNN
test tasks
meta-RL

algorithm
RNN
(fast) RL

agent
RNN
train tasks
training

(inner loop)
Rmeta(RT1,RT2,…)
meta-training

(outer loop)
Black Box:
Example: meta-reinforcement learning
(few) test

episodes
𝛳
’
train

episodes
(few) test

episodes
train

episodes
R(
𝝅
ɸ, Dtest)
ɸ
ɸ
ɸ
ɸ
initial state:
randomized object
and goals positions
other initial state
s

or related tasks
51

Taxonomy of meta-learning methods
Hospedales et al. 2020
meta-parameters
𝛳
meta-objective ℒmeta
meta-learning
algorithm
parameter initialization ɸ0 multi vs single task
like base-learners, meta-learners consist of a representation, an objective, and an optimizer

meta-learning
learned ɸ0
random ɸ0
gradient descent
52

meta-learning
learned optimizer

(update rule)
normal SGD

(back-propagation)
meta-parameters
𝛳
meta-learning
algorithm
multi vs single task

optimizer
gradient descent
53

meta-parameters
𝛳
meta-learning
algorithm
parameter initialization ɸ0
optimizer
black-box model
embedding (metric learning)
hyperparameters λ
architecture
loss ℒ / reward R
…
gradient descent
reinforcement learning
evolution
many vs few shots
multi vs single task
fast adaptation vs
fi
nal perf.

of
fl
ine vs online
We won’t exhaustively discuss all existing combinations,

let’s focus on understanding a few key strategies
Huisman et al. 2020
54

Gradient-based methods: learning ɸinit
Task i
Model
Learning
algorithm
!Ti
meta-train
Tasknew
Model
Learning
algorithm
!Tnew
!meta = ∑i !(ɸi, Di,test)
meta-learning
algorithm
Meta-model
x,y x,y
training

(inner loop)
meta-training

(outer loop)
𝛳
’
ɸ’i
fɸi(x)
ɸ’new
(
fi
netune)
𝛳
*
ɸinit =
𝛳
p(T)
•
𝛳
(prior): model initialization ɸinit

• learn representation suitable for
many tasks (e.g. pretrained CNN)

• maximize rapid learning

• Each task i yields task-adapted ɸi

• Update algorithm u

• Finetune
𝛳
* on Tnew (in few steps)
meta-test
ɸinit =
learned ɸ0
random ɸ0
ɸ’new = u(
𝛳
*,Dnew,train)
∇ɸ
∇
∇ɸ
Can be seen as bilevel optimization:
𝛳
* = argmin ∑i ℒi(ɸi, Di,test)
ɸi = u(
𝛳
,Di,train)
ɸi = u(
𝛳
,Di,train)
55

• Current initialization
𝛳
, model f

• On i tasks, perform k SGD steps to
fi
nd , then evaluate

• Update task-speci
fi
c parameters:

• Update
𝛳
to minimize sum of per-task losses, repeat
Finn et al. 2017
Model agnostic meta-learning (MAML)
!Task 1
!Task 2
θ
θ
∇θℒ1(fϕ*
1
)
θ ∇ℒ2
∇ℒ1
∇ℒ3
ϕ*
3
ϕ*
1
ϕ*
2
Meta-training
Meta-testing
• Training data of new task Dtrain
•
𝛳
*: pre-trained parameters

• Finetune:
meta-gradient: second-order gradient + backpropagate
meta-gradient
𝛼
, β: learning rates
∇θℒ2(fϕ*
2
)
i=3, k=3
∇θℒi(fϕ*
i
)
ϕi = θ − α∇θℒi(fϕ*
i
)
ϕ1
ϕ2
θ ← θ − β∇θ ∑i
ℒi(f(θ − α∇θℒi(fϕ*
i
)))
θ ← θ − β∇θ ∑i
ℒi(fϕi
)
derivative of test-set loss
ϕ = θ* − α∇θℒ(fθ)
compute how changes in
𝛳
a
ff
ect the gradient at new
𝛳
θ1
θ2
θ3 = ϕ*
2
θ1
θ2
θ3 = ϕ*
1
ϕ*
i
56

• Example for reinforcement learning:

• Goal: reach certain velocity in certain direction
Model agnostic meta-learning (MAML)
Source: Finn et al. 2017
- oracle
- MAML
- pretrained
57

1 Finn et al. 2017
5 Flennerhag et al. 2019
Other gradient-based techniques
• Changing update rule yield di
ff
erent variants:

• MAML 1,6

• MetaSGD 2

• Tnet 3

• Meta curvature 4

• WarpGrad 5
All use second-order gradients,
Meta-learn a transformation of the
gradient for better adaptation
2 Li et al. 2017
3 Lee et al. 2018
4 Park et al. 2019
fi
gure source: Flennerhag et al. 2019
MAML
WarpGrad
θ ← θ − β∇θ ∑i
ℒi(fϕi
)
i
)
ϕi = θ − α diag(w)∇θℒi(fϕ*
i
)
i
, w)
ϕi = θ − α B(θ, w)∇θℒi(fϕ*
i
)
ϕi = θ − α P(θ, ϕi)∇θℒi(fϕ*
i
)
6 Antoniou et al. 2019
w: weight per parameter
• Online MAML (Follow the Meta Leader) 7

• Minimizes regret

• Robust, but computation costs grow over time
7 Finn et al. 2019
58

1 Finn et al. 2017
Scalability
• Backpropagating derivates of ɸi wrt
𝛳
is compute + memory intensive (for large models)

• First order approximations of MAML:

• First order MAML1 uses only the last inner gradient update:

• Reptile 2: iterate over tasks, update
𝛳
in direction of :
2 Nichol et al. 2018
θ
θ1 θ2
θ3
θ4 = ϕ*
i
FOMAML
Reptile
ϕ*
i
− θ
Single task: FOMAML meta-gradient:
∇θ ∑i
ℒi(ϕi)
T1
T2
T3
θ
θ ← θ − β (ϕ*
i
− θ)
ϕ*
i
SGD
3 Rajeswaran et al. 2019
θ ← θ − β∑i
ℒi(fϕ*
i
)
59

• MAML is more resilient to over
fi
tting than many other meta-learning techniques 1

• E
ff
ectiveness seems mainly due to feature reuse (
fi
nding
𝛳
) 2

• Fine-tuning only the last layer equally good

• On few-shot learning benchmarks, a good embedding outperforms most meta-learning 3

• Learn representation on entire meta-dataset (merged into single task)

• Train a linear classi
fi
er on embedded few-shot Dtrain , predictDtest
1 Finn et al. 2018
Generalizability 2 Raghu et al. 2020
𝛳
*
ɸ’new
3 Tian et al. 2020
fi
gure source: Tian et al. 2020
linear model
• For meta-RL, also learn how to explore
(how to sample new environments)

• E-MAML: add exploration to meta-
objective (allows longer term goals) 4
4 Stadie et al. 2019
60

Grant et al. 2018
Can meta-learning reason about uncertainty in the task distribution?
y
x
g
fɸ
̂
ϕ ← gθ((x, y)train)
ϕ
θ
ytest ← f ̂
ϕ(xtest)
Ti = {Dtrain, Dtest} ←
𝒯
meta−train
Tj = {Dtrain, Dtest} ←
𝒯
meta−test
θ ← θ − ∇θℒ(ytest)
ϕj ← gθ((x, y)train)
ytest ← fϕj
(xtest)
y
x
g
fɸ
ϕ
θ
p(ϕi |(x, y)train)
Ti = {Dtrain, Dtest} ←
𝒯
meta−train
Tj = {Dtrain, Dtest} ←
𝒯
meta−test
p(θ) ← p(θ|(x, y)train)
p(ϕj |(x, y)train, θ)
p(ytest |xtest)
fi
netuning
p(ytest |xtest)
fi
netuning
Bayesian meta-learning
. : . . : 61

• Use approximation methods to represent uncertainty over

• Sampling technique + variational inference: PLATIPUS 1, BMAML 2, ABML 3

• Laplace approximation: LLAMA 4

• Variational approximation of posterior:

• Neural Statistician 5, Neural Processes 6

• What if our tasks are not IID?

• Impose additional structure, e.g. with task-speci
fi
c variable z 7
ϕi
y
x
ϕ
θ
z
mini-ImageNet with
fi
lters for non-homogeneous tasks:
7 Jerfel et al. 2019
Figure source: Jerfel et al. 2019
Fully Bayesian meta-learning
1 Finn et al. 2019
2 Kim et al. 2018
3 Ravi et al. 2019
5 Edwards et al. 2017
6 Garnelo et al. 2018
4 Grant et al. 2018
62

• Our brains probably don’t do backprop, instead:

• weights 2: networks that continuously modify the weights of another network

• Gradient based 3, 4: parameterize the update rule using a neural network

• Learn meta-parameters across tasks, by gradient descent
1 Bengio et al. 1995
meta-gradient
Meta-learning optimizers 2 Schmidhuber 1992
ϕ′

i = ϕi − η( )
θ
3 Runarsson and Jonsson 2000
4 Hochreiter 2001
63
5 Andrychowicz et al. 2016
6 Ravi and Larochelle 2017
7 Wichrowska et al. 2017
• Represent update rule as an LSTM 4,5,6, hierarchical RNN 7

• Meta-learned (RNN) optimizers‘rediscover’momentum, gradient clipping, learning rate
schedules, learning rate adaptation,… 1

• RL-based optimizers: represent updates as a policy, learn using guided policy search 2

• Combined with MAML:

• learn per-parameter learning rates 3,4

• learn precondition matrix (to `warp' the loss surface) 5,6

• Black-box optimizers: meta-learned with an RNN 7, or with user-de
fi
ned priors8

• Speed up backpropagation by meta-learning sparsity and weight sharing9
2 Ke and Malik 2016
Meta-learning optimizers
6 Flennerhag et al. 2019
3 Li et al. 2017
5 Park et al. 2019
4 Antoniou et al. 2019
7 Chen et al. 2017
8 Souza et al. 2020
9 Kirsch et al. 2020
1 Maheswaranathan et al. 2020
learning rate schedules learning rate adaptation
Figure source:

Maheswaranathan et al. 2020
64

Metric learning
Learn an embedding network
𝛳
that transforms data {Dtrain,Dtest} across all
tasks to a representation that allows easy similarity comparison
Can be seen as a simple black box meta-model

• Often non-parametric (independent of ɸ)
Task i
learner
ɸ
Dtrain g
g
? embedding
g (x)
65
Vinyals et al. 2016
Sung et al. 2018

Prototypical networks
Snell et al. 2017
• Use an embedding function f to encode each data point

• De
fi
ne a prototype vc for every class c based on the examples of that class

• Class distribution for input x is based on inverse distance between x and prototypes

• Distance function can be any di
ff
erentiable distance

• E.g. squared Euclidean

• Loss function to learn the embedding:
vc =
1
|D(y)
train | ∑
(xi,yi)∈D(y)
train
fθ(xi)
D(y)
train
p(y = c|xtest) = softmax(−dϕ(fθ(x), vc))
Figure source: Snell et al. 2017
ℒ(θ) = − log pθ(y = c|x)
66

Metric learning
• Quite a few other techniques exist

• Siamese neural networks 1

• Graph Neural Networks 2

• Also applicable for semi-supervised and active learning

• Attentive Recurrent Comparators 3

• Compares inputs not as a whole but by parts (e.g. image patches)

• MetaOptNet 4

• Learns embeddings so that linear models can distinguish between classes

• Overall:

• Fast at test time, although pair-wise comparisons limit task size

• Mostly limited to few-shot supervised tasks

• Fails when test tasks are more distant: no way to adapt
Koch et al. 2015
Garcia et al. 2017
Shyam et al. 2017
Lee et al. 2019
67

Wang et al. 2016
Duan et al. 2018
Black-box model for meta-RL
meta-RL

algorithm
RNN
(fast) RL

agent
RNN
test tasks
meta-RL

algorithm
RNN
(fast) RL

agent
RNN
train tasks
training

(inner loop)
Rmeta(RT1,RT2,…)
meta-training

(outer loop)
(few) test

episodes
𝛳
’
train

episodes
(few) test

episodes
train

episodes
R(
𝝅
ɸ, Dtest)
ɸ
ɸ
ɸ
ɸ
training task 1 training task 2
• RNNs serve as dynamic task
embedding storage

• Maximize expected reward in
each trial

• Very expressive, perform very
well on short tasks

• Longer horizons are an open
challenge
Previous inputs stored in hidden state

Image source: Duan et al. 201868

Santoro et al. 2016
Mishra et al. 2018
Other black-box models Munkhdalai et al. 2017
• Memory-augmented NNs 1

• Uses neural Turing machines: short term + long term memory

• Meta Networks 2

• Meta-learner that returns‘fast weights’for itself and the base network solving the task

• Simple Neural attentive meta- learner (SNAIL)

• Aims to overcome memory limitations of RNNs with series of 1D convolutions
Meta Networks. Image source: Munkhdalai et al. 2017 SNAIL. Image source: Mishra et al. 2018 69

Task 1
Learning
Task 2
Learning
Task 3
Learning
Tasknew
Learning

(hyper-parameters)

Clune 2019
part 1
new tasks

(of own choosing)
experience

(model parameters)


(task generation)
inductive bias
70
part 2 part 3
3 pillars

• Ultimately, meta-learning translates constraints on the learner to constraints on the data

• The biases we don’t put in manually have to be learnable from data

• Can we automatically create new tasks to inform and challenge our meta-learners?

• Paired open-ended trailblazer (POET): evolves a parameterized environment
𝛳
E for agent
𝛳
A

• Select agents that can solve challenges AND evolve environments so they are solvable
Training Task Acquisition
best performing
agents move on creates multiple overlapping curricula
Wang et al. 2019
Wang et al. 2020
Figure source: Wang et al. 2019 71

Does POET scale? Increasingly di
ffi
cult 3D terrain, 18 degrees of freedom.
Zhou & Vanschoren. 2021
72

Generative Teaching Networks
• Based on an existing dataset, generate synthetic training data for more e
ffi
cient training

• Like dataset distillation, but uses meta-learning to update the generator model

• While POET has limited expressivity (limited to
𝛳
E), GTNs could produce all sorts of training
datasets and environments

• While being careful not to generate noise: needs some grounding in reality
Such et al. 2019
73

Thank you
Image source: Pesah et al. 2018 74

Meta learning tutorial

More Related Content

What's hot

Similar to Meta learning tutorial

More from Joaquin Vanschoren

Recently uploaded

Meta learning tutorial