Deep Meta Learning

Chang-Hoon Jeong
Biointelligence lab., Artificial Intelligence Institute
Seoul National University
Deep Meta Learning
Learning-to-Learn in Neural Networks

2
• Meta Learning: Overview
• Meta Supervised Learning
○Model-based methods
○Optimization-based Inference
○Non-parametric approach
• Meta Reinforcement Learning
o Recurrent policies methods
o Optimization-based Inference
oPOMDPs approach
Agenda

4
• Large, diverse dataset, large computation → Broad Generalization
Deep Learning: SOTA(State-Of-The-Art)
GPT-3
Agent57
Meta Pseudo Labels
Pham, Hieu, et al. "Meta pseudo labels." 2020.
Brown, Tom B., et al. "Language models are few-shot learners." 2020.
Badia, Adrià Puigdomènech, et al. "Agent57: Outperforming the atari human benchmark." 2020.

5
Difference between Machine and Human
Machine is a specialist Human is a generalist

6
Problem Statement
What if your data has a long tail?
What if you want a general-purpose AI system in the real world?
CS330: Deep Multi-Task and Meta-Learning

7
Few-Shot Learning
Can you classify the image using just 3 datapoints?
By Gogh or Cezanne?

8
Few-Shot Learning
By Gogh or Cezanne?
(episodic) N-way K-shot learning
In this case, 2-way 3-shot learning

9
Few-Shot Learning
By Gogh or Cezanne?
(episodic) N-way K-shot learning
In this case, 2-way 3-shot learning
How did you accomplish this?
By leveraging prior experience!

10
• Difference between multi-task learning and meta learning
What is the Meta Learning?

11
• Meta Learning in Computer Science
oMeta-learning is a subfield of machine learning where automatic learning
algorithms are applied on metadata about machine learning experiments:
learning-to-learn
• How can we understand the meta-learning?
○Mechanistic view
- For implementation the algorithm
- Training the neural network uses a meta-dataset, which itself consists of many datasets
(each for a task)
○Probabilistic view
- For making it easier to understand the algorithm
- Extract prior information, and learning a new task uses this prior and small training set to
infer most likely posterior parameters
What is the Meta Learning?

12
• (Conventional) Machine Learning
o(Supervised Learning) Given a training dataset ,
we can train a model parameterized by , by solving,
- is a loss function that measures the match between true labels and
those predicted by
- is the condition to make explicit the dependence of this solution on
factors such as choice of optimizer(e.g. Adam, SGD, …) or function class
for (e.g. CNN, RNN, …), or initialization parameters(e.g. Xavier, He, …)
Formalizing the Meta Learning
* is pre-specified, can we learn the learning algorithm itself,
rather than assuming it is pre-specified and fixed?

13
• What is a task?
oA task:
• Different tasks can vary based on:
• different objects
• different people
• different objectives
• different lighting conditions
• different languages
• …
Formalizing the Meta Learning
CS330: Deep Multi-Task and Meta-Learning, Tutorial #2: few-shot learning and meta-learning(by W. Zi)

14
• The bad news
oDifferent tasks need to share some structure
o If this doesn’t hold, you are better off using single-task learning
• The good news
o There are many tasks with shared structure!
The Structure for Meta Learning
transfer! ?

15
• Meta Learning: Task distribution view
o specifies ‘how-to-learn’ and is often evaluated in terms of performance
over a distribution of tasks (meta-knowledge, inductive bias)
oLearning ‘how-to-learn’ solving by,
How to solve Meta Learning?
How can we solve this problem in practice?

16
• The learning process: Two-level optimization
○Inner-level;
- A new task is presented
- An agent tries to quickly learn the associated concepts from the training
observations
- This quick adaptation is facilitated by knowledge that it has accumulated
across earlier tasks at the outer-level
○Outer-level;
- Whereas the inner-level concerns a single task, the outer-level concerns
a multitude of tasks
- In the outer-level, the tasks of the inner-level of one cycle are integrated
How to solve Meta Learning?

17
• Meta-dataset: the dataset of datasets
Introduction to Meta-dataset
Ravi, Sachin, and Hugo Larochelle. "Optimization as a model for few-shot learning." 2016.
task
Held-out classes

18
• Meta-training (Supervised Learning)
Meta Learning: Overview
Query set
Support set

19
Query set
Support set
Learning ‘how-to-learn’
(meta-learning)

20
Query set
Support set
Learning ‘how-to-learn’
(meta-learning)
Methods
• Model-based methods
• Optimization-based Inference
• Non-parametric approach

21
• Meta-testing (Supervised Learning) → Few-Shot Learning
Query set
Support set
Fast adaptation
with small data

22
• Meta-testing (Supervised Learning) → Few-Shot Learning
Query set
Support set
Fast adaptation
with small data
Finally, we can evaluate the accuracy of
our meta-learner by the performance of
on the test split of each target task

24
Benchmark Datasets in Meta Supervised Learning
Dataset Description
Omniglot (Lake et al., 2011)
This dataset presents an image recognition task. Each image corresponds to one out of 1,623 characters from 50
different alphabets. Every character was drawn by 20 people. Note that in this case, the characters are the
classes/labels.
ImageNet (Deng et al., 2009)
This is the largest image classification dataset, containing more than 20K classes and over 14 million colored images.
MiniImageNet is a mini variant of the large ImageNet dataset (Deng et al., 2009) for image classification, proposed by
Vinyals et al. (2016) to reduce the engineering efforts to run experiments. The mini dataset contains 60,000 colored
images of size 84 × 84. There are a total of 100 classes present, each accorded by 600 examples. TieredImageNet
(Ren et al., 2018) is another variation of the large ImageNet dataset. It is similar to miniImageNet, but contains a
hierarchical structure. That is, there are 34 classes, each with its own sub-classes.
CIFAR-10 and CIFAR-100 (Krizhevsky, 2009)
Two other image recognition datasets. Each one contains 60K RGB images of size 32×32. CIFAR-10 and CIFAR100
contain 10 and 100 classes respectively, with a uniform number of examples per class (6,000 and 600 respectively).
Every class in CIFAR-100 also has a super-class, of which there are 20 in the full data set. Many variants of the
CIFAR datasets can be sampled, giving rise to e.g. CIFAR-FS (Bertinetto et al., 2019) and FC-100 (Oreshkin et al.,
2018).
CUB-200-2011 (Wah et al., 2011)
The CUB-200-2011 dataset contains roughly 12K RGB images of birds from 200 species. Every image has some
labeled attributes(e.g. crown color, tail shape).
MNIST (LeCun et al., 2010)
MNIST presents a hand-written digit recognition task, containing ten classes (for digits 0 through 9). In total, the
dataset is split into a 60K train and 10K test gray scale images of hand-written digits.
Meta-Dataset (Triantafillou et al., 2020)
This dataset comprises several other datasets such as Omniglot (Lake et al., 2011), CUB-200 (Wah et al., 2011),
ImageNet (Deng et al., 2009), and more (Triantafillou et al., 2020). An episode is then constructed by sampling a
dataset (e.g. Omniglot), selecting a subset of labels to create train and test splits as before. In this way, broader
generalization is enforced since the tasks are more distant from each other.
Meta-world (Yu et al., 2019)
A meta reinforcement learning dataset, containing 50 robotic manipulation tasks (control a robot arm to achieve some
pre-defined goal, e.g. unlocking a door, or playing soccer). It was specifically designed to cover a broad range of
tasks, such that meaningful generalization can be measures (Yu et al., 2019).

25
• Model-based methods (SL O, RL O)
○ Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks." International conference on
machine learning. 2016.
○ Mishra, Nikhil, et al. "A simple neural attentive meta-learner." arXiv preprint arXiv:1707.03141 (2017).
○ Munkhdalai, Tsendsuren, and Hong Yu. "Meta networks." Proceedings of machine learning research 70
(2017): 2554.
• Optimization-based Inference (SL O, RL O)
○ Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep
networks." arXiv preprint arXiv:1703.03400 (2017).
○ Nichol, Alex, Joshua Achiam, and John Schulman. "On first-order meta-learning algorithms." arXiv preprint
arXiv:1803.02999 (2018).
○ Rusu, Andrei A., et al. "Meta-learning with latent embedding optimization." arXiv preprint
arXiv:1807.05960 (2018).
• Non-parametric approach (SL O, RL X)
○ Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot
image recognition." ICML deep learning workshop. Vol. 2. 2015.
○ Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in neural information
processing systems. 2016.
○ Snell, Jake, Kevin Swersky, and Richard Zemel. "Prototypical networks for few-shot
learning." Advances in neural information processing systems. 2017.
Meta Learning Algorithms

26
• Model-based methods (SL O, RL O)
○ Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks." International conference on
machine learning. 2016.
○ Munkhdalai, Tsendsuren, and Hong Yu. "Meta networks." Proceedings of machine learning research 70
(2017): 2554.
• Optimization-based Inference (SL O, RL O)
○ Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep
○ Nichol, Alex, Joshua Achiam, and John Schulman. "On first-order meta-learning algorithms." arXiv preprint
arXiv:1803.02999 (2018).
○ Rusu, Andrei A., et al. "Meta-learning with latent embedding optimization." arXiv preprint
arXiv:1807.05960 (2018).
• Non-parametric approach (SL O, RL X)
○ Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural networks for one-shot
image recognition." ICML deep learning workshop. Vol. 2. 2015.
○ Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in neural information
processing systems. 2016.
○ Snell, Jake, Kevin Swersky, and Richard Zemel. "Prototypical networks for few-shot
learning." Advances in neural information processing systems. 2017.
Meta Learning Algorithms

27
• General approach: Train a recurrent networks to represent
Model-based methods
Train with standard supervised learning
support set query set

28
Model-based methods
1. Sample task (or mini-batch of tasks)
2. Sample disjoint datasets from

29
Model-based methods
1. Sample task (or mini-batch of tasks)
2. Sample disjoint datasets from
3. Compute
4. Update using
support set query set

30
• One of Challenge: Output parameters does not scalable
oNo need to output all parameters of neural net, only sufficient statistics
(Santoro et al., MANN, 2016), (Mishra et al., SNAIL, 2017)
Model-based methods
For example, in multi-task learning,
concatenate or
multiplication
Low-dimensional vector
Represents contextual task information

31
• Memory-Augmented Neural Networks
Model-based methods: Variants
Graves, Alex, et al. "Neural turing machines." arXiv, 2014,
Santoro, Adam, et al. "Meta-learning with memory-augmented neural networks." ICML 2016.

32
• A Simple Neural AttentIve Meta-Learner(SNAIL)
Model-based methods: Variants
Mishra, Nikhil, et al. "A simple neural attentive meta-learner." ICLR 2018

33
• Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv p
reprint arXiv:1410.5401 (2014).
• Weston, Jason, Sumit Chopra, and Antoine Bordes. "Memory networks." arXiv
preprint arXiv:1410.3916 (2014).
• Santoro, Adam, et al. "Meta-learning with memory-augmented neural
networks." International conference on machine learning. 2016.
• Mishra, Nikhil, et al. "A simple neural attentive meta-learner." arXiv preprint
arXiv:1707.03141 (2017).
• Munkhdalai, Tsendsuren, and Hong Yu. "Meta networks." Proceedings of
machine learning research 70 (2017): 2554.
• Garnelo, Marta, et al. "Conditional neural processes." arXiv preprint
arXiv:1807.01613 (2018).
Model-based methods: Further Readings

34
• Can we utilize the pre-trained prior knowledge?
• One successful form of prior knowledge: initialization for fine-tuning
• What about Transfer Learning?
Optimization-based Inference
Can we do better?

35
• Model-Agnostic Meta Learning
Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." ICML 2017

36

37
• MAML for Supervised Learning(Meta training)
Model-Agnostic Meta Learning

38
meta-initialization(outer-loop)
adaptation(inner-loop)

39

40

41
1st for loop
2nd for loop

42
1st for loop
2nd for loop

43
1st for loop
2nd for loop

44

45
• MAML for Supervised Learning(Meta testing)
Few-shot adaptation
Fine Tuning(Adaptation)
Just 1 or few gradient steps

46
• MAML for Supervised Learning(Meta testing)
Few-shot adaptation
Evaluation: N-way k-shot Accuracy

47
• Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast
adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).
• Ravi, Sachin, and Hugo Larochelle. "Optimization as a model for few-shot learning." (2016).
• Nichol, Alex, Joshua Achiam, and John Schulman. "On first-order meta-learning
algorithms." arXiv preprint arXiv:1803.02999 (2018).
• Rusu, Andrei A., et al. "Meta-learning with latent embedding optimization." arXiv preprint
arXiv:1807.05960 (2018).
• Antoniou, Antreas, Harrison Edwards, and Amos Storkey. "How to train your MAML." arXiv
preprint arXiv:1810.09502 (2018).
• Lee, Yoonho, and Seungjin Choi. "Gradient-based meta-learning with learned layerwise metric
and subspace." arXiv preprint arXiv:1801.05558 (2018).
• Fallah, Alireza, Aryan Mokhtari, and Asuman Ozdaglar. "On the convergence theory of gradient-
based model-agnostic meta-learning algorithms." International Conference on Artificial
Intelligence and Statistics. 2020.
• Song, Xingyou, et al. "Es-maml: Simple hessian-free meta learning." arXiv preprint
arXiv:1910.01215 (2019).
• Triantafillou, Eleni, et al. "Meta-dataset: A dataset of datasets for learning to learn from few
examples." arXiv preprint arXiv:1903.03096 (2019).
Optimization-based Inference: Further Readings

48
• KNN(K-Nearest Neighbor) Algorithm
oLazy, Non-parametric Learner
Non-parametric approach
Red or Blue?

49
• KNN(K-Nearest Neighbor) Algorithm
oLazy, Non-parametric Learner
Red or Blue?
1-Nearest Neighbor
→ Blue
3-Nearset Neighbor
→ Red
5-Nearset Neighbor
→ Red

50
• In low data regimes (e.g. 2D data), non-parametric methods are
simple, works well!
• What about high dimensional data?
o During meta-training
→ Still want to be parametric
o During meta-testing
→ Few-Shot learning ↔ low data regime
Can we train parametric meta-learners that produce effective
non-parametric learners?
Finn, Chelsea, Sergey Levine. “Meta-Learning: from Few-Shot Learning to Rapid Reinforcement Learning." ICML Tutorial 2019

51
• At meta-test time, compare test image(query) with training images
(support set)
o Which space do we use?
o Which distance metric do we use?

52
• Siamese Neural Networks
Non-parametric approach: Variants
meta-training
meta-testing
Koch, Gregory, et al. "Siamese neural networks for one-shot image recognition." ICML deep learning workshop, 2015.

53
• Siamese Neural Networks: A simple example

54
Label
1

55
Label
0

56
Same class:
Otherwise:
Siamese twin is not depicted, but joins immediately after the 4096 unit
fully-connected layer where the L1 component-wise distance between
vectors is computed

57
Test Accuracy on Omniglot Dataset

58
Test Accuracy on Omniglot Dataset
meta-training: 2-way classification
meta-testing: N-way classification
“A simple machine learning principle:
test and train conditions must match”
(Vinyals et al., Matching Networks for One-Shot Learning)

59
• Matching Networks for One-Shot Learning
Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in neural information processing systems. 2016.
bidirectional
LSTM
Convolutional
encoder
Cosine distance

60
• Prototypical Networks
oCan we aggregate class information to create a prototypical embedding?
: Euclidian distance
Snell, Jake, et al., "Prototypical networks for few-shot learning." Advances in neural information processing systems. 2017.

61
• Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. "Siamese neural
networks for one-shot image recognition." ICML deep learning workshop. Vol. 2.
2015.
• Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in
neural information processing systems. 2016.
• Snell, Jake, Kevin Swersky, and Richard Zemel. "Prototypical networks for few-
shot learning." Advances in neural information processing systems. 2017.
• Sung, Flood, et al. "Learning to compare: Relation network for few-shot
learning." Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2018.
• Allen, Kelsey R., et al. "Infinite mixture prototypes for few-shot learning." arXiv
preprint arXiv:1902.04552 (2019).
• Garcia, Victor, and Joan Bruna. "Few-shot learning with graph neural
Non-parametric approach: Further Readings

63
• Reinforcement Learning
Introduction to Reinforcement Learning
Observation
𝑆𝑡
Action
𝐴𝑡
Reward
𝑅𝑡
Agent
The World
Environment
• At each time step 𝑡 the Agent
- Executes action 𝐴𝑡
- Receives observation 𝑆𝑡
- Receives scalar reward 𝑅𝑡
• The environment
- Receives action 𝐴𝑡
- Emits observation 𝑆𝑡+1
- Emits scalar reward 𝑅𝑡+1
• 𝑡 increments at env.step()

64
Environment
Agent

65
Environment
Agent
State 𝑆𝑡

66
Environment
Agent
State 𝑆𝑡
Action 𝐴𝑡

67
Environment
State 𝑆𝑡
Action 𝐴𝑡
Reward 𝑅𝑡+1
Next State 𝑆𝑡+1
Agent

68
• Markov Decision Processes (MDPs)
○Mathematical formulation of the RL problem
• MDPs defined by
○ : State set
○ : Action set
○ : Transition set
○ : Reward
○ : Discount factor

69
• Markov Decision Processes (MDPs)
○Mathematical formulation of the RL problem
• MDPs defined by
○ : State set
○ : Action set
○ : Transition set
○ : Reward
○ : Discount factor
• (Stochastic) Policies
○Probability of agent’s action:
Goal: Find Optimal policy
Value function

70
• What’s wrong with Reinforcement Learning?
Problem Statement
DQN
OpenAI
Five
AlphaStar
AlphaGo

71
Problem Statement
MDP 1 MDP 2 MDP 3
Can we meta-learn reinforcement learning algorithms that are much more efficient?
“Hula Beach”, “Never grow up”, “The Sled” – by artist Matt Spangler, mattspangler.com

72
Problem Statement
https://bit.ly/bandit-meta-rl-ex
Reinforcement
Learning
Meta
Reinforcement
Learning

73
• General Framework for Meta Reinforcement Learning
Meta Reinforcement Learning
Env 1(MDP 1)
Env 2(MDP 2)
Env 3(MDP 3)
.
.
.
meta-training(10 env)
Meta RL
Algorithm
Env 11(MDP
11)
meta-testing
(with small interactions)
Fast meta-learned
Agent

74
• Recurrent policies methods
○ Duan, Yan, et al. "RL2
: Fast reinforcement learning via slow reinforcement learning." arXiv preprint
arXiv:1611.02779 (2016).
○ Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of
deep networks." arXiv preprint arXiv:1703.03400 (2017).
○ Nichol, Alex, Joshua Achiam, and John Schulman. "On first-order meta-learning algorithms." arXiv
preprint arXiv:1803.02999 (2018).
• POMDPs approach
○ Rakelly, Kate, et al. "Efficient off-policy meta-reinforcement learning via probabilistic context
variables." International conference on machine learning. 2019.
○ Humplik, Jan, et al. "Meta reinforcement learning as task inference." arXiv preprint
arXiv:1905.06424 (2019).
Meta Reinforcement Learning Algorithms

75
• Recurrent policies methods
○ Duan, Yan, et al. "RL2
: Fast reinforcement learning via slow reinforcement learning." arXiv preprint
arXiv:1611.02779 (2016).
○ Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of
deep networks." arXiv preprint arXiv:1703.03400 (2017).
○ Nichol, Alex, Joshua Achiam, and John Schulman. "On first-order meta-learning algorithms." arXiv
preprint arXiv:1803.02999 (2018).
• POMDPs approach
○ Rakelly, Kate, et al. "Efficient off-policy meta-reinforcement learning via probabilistic context
variables." International conference on machine learning. 2019.
○ Humplik, Jan, et al. "Meta reinforcement learning as task inference." arXiv preprint
arXiv:1905.06424 (2019).
Meta Reinforcement Learning Algorithms

76
• General approach: Just train a recurrent network!
Recurrent policies methods
RNN hidden state is not reset between episodes!

77
• RL2: Fast Reinforcement Learning via Slow Reinforcement Learning
Recurrent policies methods: Variants
Duan, Yan, et al. "RL2
: Fast reinforcement learning via slow reinforcement learning." ICLR 2016.
(Vanilla) Reinforcement Learning: Optimize episode by episode
RL
Algorithm:
TRPO+GAE

78
• RL2: Fast Reinforcement Learning via Slow Reinforcement Learning
Duan, Yan, et al. "RL2
: Fast reinforcement learning via slow reinforcement learning." ICLR 2016.
Meta Reinforcement Learning: Optimize trial by trial
preserved initialization
: Sample environment
: 𝑘’th episode in environment
RL
Algorithm:
TRPO+GAE

79
• A Simple Neural AttentIve Meta-Learner(SNAIL)
Mishra, Nikhil, et al. "A simple neural attentive meta-learner." ICLR 2018 & https://www.programmersought.com/article/1715257628
/
Like RL2 but:
replace the LSTM with dilated temporal convolution
(like wavenet) + attention
dilated temporal convolution

80
• Botvinick, Matthew, et al. "Reinforcement learning, fast and slow." Trends i
n cognitive sciences 23.5 (2019): 408-422.
• Wang, Jane X., et al. "Learning to reinforcement learn." arXiv preprint arXi
v:1611.05763 (2016).
• Duan, Yan, et al. "RL2: Fast reinforcement learning via slow reinforcement
learning." arXiv preprint arXiv:1611.02779 (2016).
• Mishra, Nikhil, et al. "A simple neural attentive meta-learner." arXiv
preprint arXiv:1707.03141 (2017).
• Ritter, Samuel, et al. "Been there, done that: Meta-learning with episodic
recall." arXiv preprint arXiv:1805.09692 (2018).
Recurrent policies methods: Further Readings

81
oAlso can be applied to Reinforcement Learning!

82

83
meta-initialization(outer-loop)
adaptation(inner-loop)

84

85
…

86
Roll out

87

88
Roll out

89

90
Few-shot adaptation
Evaluation: Cumulative Rewards

91
• MAML-RL Demo(meta-test)
https://sites.google.com/view/maml

92
• MAML-RL Demo(meta-test)
https://sites.google.com/view/maml

93
• Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning
for fast adaptation of deep networks." arXiv preprint arXiv:1703.03400 (2017).
• Nichol, Alex, Joshua Achiam, and John Schulman. "On first-order meta-learning
algorithms." arXiv preprint arXiv:1803.02999 (2018).
• Xu, Zhongwen, Hado P. van Hasselt, and David Silver. "Meta-gradient
reinforcement learning." Advances in neural information processing systems.
2018.
• Houthooft, Rein, et al. "Evolved policy gradients." Advances in Neural
Information Processing Systems. 2018.
• Salimans, Tim, et al. "Evolution strategies as a scalable alternative to
reinforcement learning." arXiv preprint arXiv:1703.03864 (2017).
• Rothfuss, Jonas, et al. "Promp: Proximal meta-policy search." arXiv preprint
arXiv:1810.06784 (2018).
• Fernando, Chrisantha, et al. "Meta-learning by the baldwin effect." Proceedings
of the Genetic and Evolutionary Computation Conference Companion. 2018.
Optimization-based Inference: Further Readings

94
• What is Partially Observable Markov Decision Processes(POMDPs)?
• A POMDP is an MDP with hidden states(hidden Markov model with
actions)
• A POMDP is a tuple
○ is a finite set of states
○ is a finite set of actions
○ is a finite set of observations
○ is a state transition probability matrix:
○ is a reward function:
○ is an observation function:
○ is a discount factor:
• A History is a sequence of actions, observations, and rewards
POMDPs approach

95
• A belief state is a probability distribution over states, conditioned
on history
POMDPs approach
Environment Belief state
Example of observation
• Number of adjacent walls
• A noisy version?
- Wrong value with probability 0.1

96
• Meta Reinforcement Learning as POMDPs
• POMDP for unobserved state: A simple example
POMDPs approach
Motivated by CS330: Deep Multi-Task and Meta-Learning
Initial State
A simple Gridworld
(Only 𝑠3 gives positive reward) Where am I?
𝑠1 𝑠2 𝑠3 𝑠1 𝑠2 𝑠3

97
POMDPs approach
Initial State
Action!
A simple Gridworld
𝑠1 𝑠2 𝑠3 𝑠1 𝑠2 𝑠3
𝑠1 𝑠2 𝑠3
𝑠1 𝑠2 𝑠3
𝑎 = “left”, 𝑠 = 𝑠1, 𝑟 = 0
?

98
POMDPs approach
Initial State
Action!
A simple Gridworld
𝑠1 𝑠2 𝑠3 𝑠1 𝑠2 𝑠3
𝑠1 𝑠2 𝑠3
𝑠1 𝑠2 𝑠3
𝑎 = “left”, 𝑠 = 𝑠1, 𝑟 = 0

99
• POMDP for unobserved task: A simple example
POMDPs approach
𝑠1 𝑠2 𝑠3
What task am I in?
MDP1 MDP2 MDP3
A simple Gridworld
(each task gives positive reward in different state)
MDP1
MDP2
MDP3
Initial State

100
POMDPs approach
𝑠1 𝑠2 𝑠3
What task am I in?
MDP1 MDP2 MDP3
A simple Gridworld
MDP1
MDP2
MDP3
𝑠1 𝑠2 𝑠3 MDP1 MDP2 MDP3
Initial State
Action!
𝑎 = “left”, 𝑠 = 𝑠1, 𝑟 = 0
?

101
POMDPs approach
𝑠1 𝑠2 𝑠3
What task am I in?
MDP1 MDP2 MDP3
A simple Gridworld
MDP1
MDP2
MDP3
𝑠1 𝑠2 𝑠3 MDP1 MDP2 MDP3
Initial State
Action!
𝑎 = “left”, 𝑠 = 𝑠1, 𝑟 = 0

102
• Meta Reinforcement Learning as Partially Observable RL
○Control-Inference duality problem
○ → everything needed to solve the task
• So, what does that mean?
○Learning a task = inferring
○Encapsulated information policy must solve the current task!
• General recipe
1. Sample
2. Act according to to collect more data
POMDPs approach
some approximate posterior
(e.g. variational inference)
act as though was correct

103
• Key Idea: How can we estimate Task Belief States?
1. Supervised Learning (if available)
- Can we use task labels in meta-training (not adaptation phase)?
2. Variational approach
- 𝑝(𝑧|𝑐) is intractable, can we use variational inference in meta RL?
POMDPs approach: Variants

104
• An agent consists of two modules;
○The optimal meta-learner only needs to make decisions based on current
state and the current belief
1. : the policy dependent on the current state and (an approximate
representation of) posterior
2. The belief module: learns to output an approximate representation
POMDPs approach: Auxiliary Supervised Learning
But how can we estimate the ?
Humplik, Jan, et al. "Meta reinforcement learning as task inference." arXiv 2019

105
• Some examples of task labels
○Multi-armed bandit
- A vector of arm probabilities
○Semi-circle
- An angle of semi-circle(e.g. max = 2𝜋)
○Cheetah velocity
- Target velocity

106
• Meta Reinforcement Learning as Task Inference
A. A baseline LSTM agent
B. A belief network agent
C. Auxiliary head agent
• LSTMs and information bottleneck(IB) is optional
• RL algorithms
- SVG(0)(Nicolas, Heess, et al., 2015) as off-policy
- PPO(John, Schulman, et al., 2017) as on-policy

107
• Can we achieve both meta-training and adaptation efficiency?
- Yes! we can make an off-policy meta-RL algorithm that disentangles task
inference and control
○Meta-training efficiency: Off-policy RL algorithm(e.g. SAC)
○Adaptation efficiency: Probabilistic encoder to estimate task-belief
POMDPs approach: Variational approach
Rakelly, Kate, et al. "Efficient off-policy meta-reinforcement learning via probabilistic context variables." ICML 2019.

108
• Objective of variational approach
Don’t need to know the order of transitions
in order to identify the MDP(Markov property)
Use a permutation-invariant encoder for
simplicity and speed(No need to use RNN)
Regularization(Information bottleneck)
Likelihood(Bellman error)
Reparameterization trick

109
• PEARL(Probabilistic Embeddings for Actor-critic meta RL)

110
• Osband, Ian, Daniel Russo, and Benjamin Van Roy. "(More) efficient
reinforcement learning via posterior sampling." Advances in Neural Information
Processing Systems. 2013.
• Rakelly, Kate, et al. "Efficient off-policy meta-reinforcement learning via
probabilistic context variables." International conference on machine learning.
2019.
• Humplik, Jan, et al. "Meta reinforcement learning as task inference." arXiv
preprint arXiv:1905.06424 (2019).
• Zintgraf, Luisa, et al. "Variational Task Embeddings for Fast Adaption in Deep
Reinforcement Learning." International Conference on Learning Representations
Workshop on Structure & Priors in Reinforcement Learning. 2019.
• Zintgraf, Luisa, et al. "VariBAD: A Very Good Method for Bayes-Adaptive Deep
RL via Meta-Learning." arXiv preprint arXiv:1910.08348 (2019).
POMDPs approach: Further Readings

Deep Meta Learning

More Related Content

What's hot

Similar to Deep Meta Learning

Recently uploaded

Deep Meta Learning