Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)

Unsupervised Curricula
for Visual Meta-Reinforcement Learning
2020. 07. 06
Jeong-Gwan Lee
1 / 17
Jabri, Allan, et al. "Unsupervised curricula for visual meta-reinforcement learning." Advances
in Neural Information Processing Systems. 2019.

Meta Reinforcement Learning
• Regular RL : learn policy for single task (=Specialist)
2 / 17
• Meta-RL : learn meta-policy for task distribution, by learning good
initial weights and adaptation rule (=Generalist)
Appendix for adaptation in detail

Preliminaries: Task Distribution
• Single Task (= Single MDP)
• Task Distribution (= MDP Distribution)
• A task is sampled from task distribution where
• Each task can be defined as “MDP without general reward” and
a sampled task-specific reward.
3 / 22
start point
Sampled Task 1
Sampled Task 2
Target-reaching problem
with given

Preliminaries: Unsupervised Meta-RL
• Supervised meta-RL
• Given, hand-crafted task distribution
• When task is sampled, the reward function is provided
to the meta-policy at both meta-train and meta-test phase.
• Unsupervised meta-RL
• There’s no task distribution and goal description
at meta-train phase.
• There’s no reward function at meta-train phase,
only uses reward function at meta-test phase.
4 / 22
start point
with no reward supervision at meta train
meta-train
meta-train = meta-test
meta-test
≠
start point
Sampled Task 1
with given
start point
Sampled Task 1
Q. How to train meta-policy without reward function?

Preliminaries: DIAYN(Diversity is all you need)
5 / 22
• Goal : Learning useful skills without any reward function at train time
• Skill ( ) : Determinant of How to move in the state space
• Skill-conditioned Policy :
2. Only States, Not Actions( ) are used to distinguish skills( )
3. Encouraging exploration (maximizing the policy entropy)
• 3 Desiderata
start point
Skill example
: Categorical var.
(Category = 8)
1. Sampled Skill( ) should control which States( ) the agent visits
= States( ) the agent visited can easily infer the used Skill( )
𝑧!
𝑧"
𝑧#
𝑧$
𝑧%
𝑧&
𝑧'
𝑧(

6 / 22
• Then how to define pseudo reward?
• Define the object as,
• DIAYN implemented with SAC maximizes the policy’s entropy over actions.
Exploration Sample skill variable 𝒛,
action 𝒂 based on 𝒛 and get next state 𝒔
Intractable
Fixed discrete uniform dist.
(Cat. =50)

7 / 22
• Then how to define pseudo reward?
Inferable ability to sampled skill
Skill-conditioned Policy
Update the discriminator to maximize

8 / 22
• Skills learned with no reward
• After training, sample and rollout using
Run forward Walk forward Run backward Acrobatic
• Accelerating Learning with Policy Initialization
• (1)takes the skill with highest reward for each benchmark task and
(2)finetune this policy using the task-specific reward function.

Curricula for Unsupervised Meta-Reinforcement Learning : CARML
• Goal : In unsupervised meta-RL setting, learn meta-policy with evolved task
distribution via EM algorithm.
9 / 17
Task distribution 𝑞! Meta-RL Policy
Tasks
Data?
1. Sample a task
2. Define task reward
3. Update Meta-policy
How to evolve task distribution using data?
M-step : Meta-LearningE-step : Fitting task distribution

Diversity Structure
Organize task distribution from policy by maximizing the Mutual
Information between each state 𝒔 and a latent task variable 𝒛
• Task distribution to be well-
structured by task variable 𝑧
10 / 17
Objective : Information Maximization via EM
• Meta-RL policy 𝜋! to be
as diverse as possible in
state space
• Meta-RL policy 𝜋! to follow
sampled task variable 𝑧

How to Learn a task distribution 𝒒 𝝓
11 / 17
Using collected states of post-updated meta-policy 𝜋$ in previous M-
step, 𝒒 𝝓 estimates the state density model as Gaussian Mixture Model.
𝑧 = 1
𝑧 = 2
𝑧 = 3
𝑧 = 4
𝑧 = 5
Density
Estimation
State Space
Features of state

How to design a task distribution 𝒒 𝝓
12 / 17
• Gaussian mixture model’s parameters
• Categorical random variable ,
Define as Gaussian Mixture Model,
• Prior (Task distribution)
• Likelihood
• State marginal distribution

E-Step: Task Acquisition (Fitting 𝒒 𝝓)
13 / 17
In just before meta-RL step,
Sample task variable 𝑧 and add states into 𝒟
E-step : Estimate given
derivation
Define
Be structured well!
We don’t care where states come from previous 𝑧
Just reorganize to be well structured!

Task Acquisition via Discriminative Clustering
[1]Caron, Mathilde, et al. "Deep clustering for unsupervised learning of visual features." Proceedings of the European Conference on
Computer Vision (ECCV). 2018 14 / 17
2. Based on posterior 𝑞)(𝑧|𝑔*(𝑠)),
Assign pseudo-label
3. Based on pseudo-labels,
Update ResNet by supervised learning
encoder
(ResNet)
𝑔!
Gaussian Mixture
𝑞"
A variant of DeepCluster[1]
Pseudo Label
1. Update Gaussian Mixture Model
by MLE (EM algorithm for GMM)

M-step: How to define a reward function ?
15 / 17
unknown
𝒑(𝒔) ≈ 𝒒 𝝓(𝒔)
Task-conditioned reward function for meta-policy is,
derivation
assumption that the fitted
marginal distribution 𝒒 𝝓 𝒔
matches that of the policy 𝒑(𝒔)
unknown
Known from previous E-step
Be diverse Do the task well

M-step: Meta-Learning (Fitting 𝝅 𝜽)
16 / 17
We can trade-off between discriminability of skills and task-specific
exploration by weighting 𝜆 ∈ 0,1
Inference ability to skills
If 𝑠,-! infers the sampled task 𝑧 well,
Reward should be high.
Task-specific exploration
Given task z, the policy tries unfamiliar state,
Reward should be high.
Task-conditioned reward function for meta-policy is,

Experiment Setting : Visual Navigation Task
VizDoom involving a room filled with five different objects
• Goal : Reaching a single target object
• Observation : egocentric images with limited field of view.
• Actions : Turn Right, Turn Left, Move forward
• Reward Function : inverse l2 distance from the specified target object.
17 / 17
Observation example Bird-eye view (2D position)

Evolution of Task Distribution
18 / 17
1 E-step
3 E / 2 M
5 E / 4 M
Random initial policy trajectories :
Less structured and less distinct
More structured and well-organized
Exploration (wide range) &
Exploitation (direction)
Projecting trajectories of each mixture components (=task) into the true state space

RL2 as M-step meta-policy
• Meta-RL algorithm using RNN policy
• When ”Trial” is initiated, a new task is sampled.
• One trial = two episode with hidden state sharing
• Process Flow
19 / 17
Duan, Yan, et al. "Rl $^ 2$: Fast reinforcement learning via slow reinforcement learning." arXiv preprint
arXiv:1611.02779 (2016).
Initial state
Initial hidden state
Embedding
𝜙
GRU
FC
SOFTMAX
RL2 structure
Trial 1 (sampled Task 1) Trial 2 (sampled Task 2)
Episode 1 Episode 2 Episode 1

Direct Transfer & Finetuning for one test task
• Direct Transfer(Marker and dotted line) :
Apply policy to test task without finetuning.
• Finetuning : Update parameters of policy
to specific test tasks
200 400 600# of episodes 800 1000
20 / 17
Sample only one test task (r.seed 20)
Baselines
1) PPO from scratch
2) Pre-training with random network distillation
(RND) for unsupervised exploration
3) Supervised meta-learning, as an oracle

CARML as Meta-pretraining
21 / 17
Experiment description :
Train with different initialization weights
from train task distribution and
plot learning curve from test task distribution.
Purpose :
Check Meta-pretraining ability
Variants :
Encoder init : only initialize the pre-trained
encoder to separate the effect of meta-policy
and visual representation (ResNet)
The acquired meta-learning strategies may be useful for learning related task
distributions, effectively acting as pretrain model for meta-RL.

Summary
They proposed a framework for inducing unsupervised, adaptive task
distributions for meta-RL that scales to environments with high-
dimensional pixel observations.
They showed CARML enables (1)unsupervised acquisition(skill maps) of
meta-learning strategies that transfer to test task distributions in terms
of (2) direct evaluation, more sample efficient fine-tuning and
(3) more sample-efficient supervised meta-learning
22 / 17

Appendix 1 : EM algorithm
Fix 𝜃 𝑝. , Update 𝝓 (𝒒 𝝓)
Update 𝜽 𝒑 𝜽 , Fix 𝜙 (𝑞))
Expectation
Maximization
𝒒 𝝓 𝒛 = 𝑷 𝒛 𝒙, 𝜽 𝒐𝒍𝒅
(KL(q||p) == 0)
23

Illustration of the E-step
Illustration of the M-step
Fix 𝜃 𝑝. , Update 𝝓 (𝒒 𝝓)
Update 𝜽 𝒑 𝜽 , Fix 𝜙 (𝑞))
𝒒 𝝓 𝒛 = 𝑷 𝒛 𝒙, 𝜽 𝒐𝒍𝒅 (KL(q||p) == 0)
Maximization 24
Appendix 1 : EM algorithm

Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)

Recommended

Recommended

More Related Content

Similar to Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)

Similar to Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML) (20)

Recently uploaded

Recently uploaded (20)

Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)