SlideShare a Scribd company logo
1 of 25
Download to read offline
Unsupervised Curricula
for Visual Meta-Reinforcement Learning
2020. 07. 06
Jeong-Gwan Lee
1 / 17
Jabri, Allan, et al. "Unsupervised curricula for visual meta-reinforcement learning." Advances
in Neural Information Processing Systems. 2019.
Meta Reinforcement Learning
β€’ Regular RL : learn policy for single task (=Specialist)
2 / 17
β€’ Meta-RL : learn meta-policy for task distribution, by learning good
initial weights and adaptation rule (=Generalist)
Appendix for adaptation in detail
Preliminaries: Task Distribution
β€’ Single Task (= Single MDP)
β€’ Task Distribution (= MDP Distribution)
β€’ A task is sampled from task distribution where
β€’ Each task can be defined as β€œMDP without general reward” and
a sampled task-specific reward.
3 / 22
start point
Sampled Task 1
Sampled Task 2
Target-reaching problem
with given
Preliminaries: Unsupervised Meta-RL
β€’ Supervised meta-RL
β€’ Given, hand-crafted task distribution
β€’ When task is sampled, the reward function is provided
to the meta-policy at both meta-train and meta-test phase.
β€’ Unsupervised meta-RL
β€’ There’s no task distribution and goal description
at meta-train phase.
β€’ There’s no reward function at meta-train phase,
only uses reward function at meta-test phase.
4 / 22
start point
Target-reaching problem
with no reward supervision at meta train
meta-train
meta-train = meta-test
meta-test
β‰ 
start point
Sampled Task 1
Target-reaching problem
with given
start point
Sampled Task 1
Q. How to train meta-policy without reward function?
Preliminaries: DIAYN(Diversity is all you need)
5 / 22
β€’ Goal : Learning useful skills without any reward function at train time
β€’ Skill ( ) : Determinant of How to move in the state space
β€’ Skill-conditioned Policy :
2. Only States, Not Actions( ) are used to distinguish skills( )
3. Encouraging exploration (maximizing the policy entropy)
β€’ 3 Desiderata
start point
Skill example
: Categorical var.
(Category = 8)
1. Sampled Skill( ) should control which States( ) the agent visits
= States( ) the agent visited can easily infer the used Skill( )
𝑧!
𝑧"
𝑧#
𝑧$
𝑧%
𝑧&
𝑧'
𝑧(
Preliminaries: DIAYN(Diversity is all you need)
6 / 22
β€’ Then how to define pseudo reward?
β€’ Define the object as,
β€’ DIAYN implemented with SAC maximizes the policy’s entropy over actions.
Exploration Sample skill variable 𝒛,
action 𝒂 based on 𝒛 and get next state 𝒔
Intractable
Fixed discrete uniform dist.
(Cat. =50)
Preliminaries: DIAYN(Diversity is all you need)
7 / 22
β€’ Then how to define pseudo reward?
Inferable ability to sampled skill
Skill-conditioned Policy
Update the discriminator to maximize
Preliminaries: DIAYN(Diversity is all you need)
8 / 22
β€’ Skills learned with no reward
β€’ After training, sample and rollout using
Run forward Walk forward Run backward Acrobatic
β€’ Accelerating Learning with Policy Initialization
β€’ (1)takes the skill with highest reward for each benchmark task and
(2)finetune this policy using the task-specific reward function.
Curricula for Unsupervised Meta-Reinforcement Learning : CARML
β€’ Goal : In unsupervised meta-RL setting, learn meta-policy with evolved task
distribution via EM algorithm.
9 / 17
Task distribution π‘ž! Meta-RL Policy
Tasks
Data?
1. Sample a task
2. Define task reward
3. Update Meta-policy
How to evolve task distribution using data?
M-step : Meta-LearningE-step : Fitting task distribution
Diversity Structure
Organize task distribution from policy by maximizing the Mutual
Information between each state 𝒔 and a latent task variable 𝒛
β€’ Task distribution to be well-
structured by task variable 𝑧
10 / 17
Objective : Information Maximization via EM
β€’ Meta-RL policy πœ‹! to be
as diverse as possible in
state space
β€’ Meta-RL policy πœ‹! to follow
sampled task variable 𝑧
How to Learn a task distribution 𝒒 𝝓
11 / 17
Using collected states of post-updated meta-policy πœ‹$ in previous M-
step, 𝒒 𝝓 estimates the state density model as Gaussian Mixture Model.
𝑧 = 1
𝑧 = 2
𝑧 = 3
𝑧 = 4
𝑧 = 5
Density
Estimation
State Space
Features of state
How to design a task distribution 𝒒 𝝓
12 / 17
β€’ Gaussian mixture model’s parameters
β€’ Categorical random variable ,
Define as Gaussian Mixture Model,
β€’ Prior (Task distribution)
β€’ Likelihood
β€’ State marginal distribution
E-Step: Task Acquisition (Fitting 𝒒 𝝓)
13 / 17
In just before meta-RL step,
Sample task variable 𝑧 and add states into π’Ÿ
E-step : Estimate given
derivation
Define
Be structured well!
We don’t care where states come from previous 𝑧
Just reorganize to be well structured!
Task Acquisition via Discriminative Clustering
[1]Caron, Mathilde, et al. "Deep clustering for unsupervised learning of visual features." Proceedings of the European Conference on
Computer Vision (ECCV). 2018 14 / 17
2. Based on posterior π‘ž)(𝑧|𝑔*(𝑠)),
Assign pseudo-label
3. Based on pseudo-labels,
Update ResNet by supervised learning
encoder
(ResNet)
𝑔!
Gaussian Mixture
π‘ž"
A variant of DeepCluster[1]
Pseudo Label
1. Update Gaussian Mixture Model
by MLE (EM algorithm for GMM)
M-step: How to define a reward function ?
15 / 17
unknown
𝒑(𝒔) β‰ˆ 𝒒 𝝓(𝒔)
Task-conditioned reward function for meta-policy is,
derivation
assumption that the fitted
marginal distribution 𝒒 𝝓 𝒔
matches that of the policy 𝒑(𝒔)
unknown
Known from previous E-step
Be diverse Do the task well
M-step: Meta-Learning (Fitting 𝝅 𝜽)
16 / 17
We can trade-off between discriminability of skills and task-specific
exploration by weighting πœ† ∈ 0,1
Inference ability to skills
If 𝑠,-! infers the sampled task 𝑧 well,
Reward should be high.
Task-specific exploration
Given task z, the policy tries unfamiliar state,
Reward should be high.
Task-conditioned reward function for meta-policy is,
Experiment Setting : Visual Navigation Task
VizDoom involving a room filled with five different objects
β€’ Goal : Reaching a single target object
β€’ Observation : egocentric images with limited field of view.
β€’ Actions : Turn Right, Turn Left, Move forward
β€’ Reward Function : inverse l2 distance from the specified target object.
17 / 17
Observation example Bird-eye view (2D position)
Evolution of Task Distribution
18 / 17
1 E-step
3 E / 2 M
5 E / 4 M
Random initial policy trajectories :
Less structured and less distinct
More structured and well-organized
Exploration (wide range) &
Exploitation (direction)
Projecting trajectories of each mixture components (=task) into the true state space
RL2 as M-step meta-policy
β€’ Meta-RL algorithm using RNN policy
β€’ When ”Trial” is initiated, a new task is sampled.
β€’ One trial = two episode with hidden state sharing
β€’ Process Flow
19 / 17
Duan, Yan, et al. "Rl $^ 2$: Fast reinforcement learning via slow reinforcement learning." arXiv preprint
arXiv:1611.02779 (2016).
Initial state
Initial hidden state
Embedding
πœ™
GRU
FC
SOFTMAX
RL2 structure
Trial 1 (sampled Task 1) Trial 2 (sampled Task 2)
Episode 1 Episode 2 Episode 1
Direct Transfer & Finetuning for one test task
β€’ Direct Transfer(Marker and dotted line) :
Apply policy to test task without finetuning.
β€’ Finetuning : Update parameters of policy
to specific test tasks
200 400 600# of episodes 800 1000
20 / 17
Sample only one test task (r.seed 20)
Baselines
1) PPO from scratch
2) Pre-training with random network distillation
(RND) for unsupervised exploration
3) Supervised meta-learning, as an oracle
CARML as Meta-pretraining
21 / 17
Experiment description :
Train with different initialization weights
from train task distribution and
plot learning curve from test task distribution.
Purpose :
Check Meta-pretraining ability
Variants :
Encoder init : only initialize the pre-trained
encoder to separate the effect of meta-policy
and visual representation (ResNet)
The acquired meta-learning strategies may be useful for learning related task
distributions, effectively acting as pretrain model for meta-RL.
Summary
They proposed a framework for inducing unsupervised, adaptive task
distributions for meta-RL that scales to environments with high-
dimensional pixel observations.
They showed CARML enables (1)unsupervised acquisition(skill maps) of
meta-learning strategies that transfer to test task distributions in terms
of (2) direct evaluation, more sample efficient fine-tuning and
(3) more sample-efficient supervised meta-learning
22 / 17
Appendix 1 : EM algorithm
Fix πœƒ 𝑝. , Update 𝝓 (𝒒 𝝓)
Update 𝜽 𝒑 𝜽 , Fix πœ™ (π‘ž))
Expectation
Maximization
𝒒 𝝓 𝒛 = 𝑷 𝒛 𝒙, 𝜽 𝒐𝒍𝒅
(KL(q||p) == 0)
23
Illustration of the E-step
Illustration of the M-step
Fix πœƒ 𝑝. , Update 𝝓 (𝒒 𝝓)
Update 𝜽 𝒑 𝜽 , Fix πœ™ (π‘ž))
𝒒 𝝓 𝒛 = 𝑷 𝒛 𝒙, 𝜽 𝒐𝒍𝒅 (KL(q||p) == 0)
Maximization 24
Appendix 1 : EM algorithm
Appendix 1
25

More Related Content

Similar to Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)

Final-AI-Problem Solving.pdf
Final-AI-Problem Solving.pdfFinal-AI-Problem Solving.pdf
Final-AI-Problem Solving.pdfharinathkuruva
Β 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
Β 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
Β 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
Β 
Adaptive Intelligent Mobile Robotics
Adaptive Intelligent Mobile RoboticsAdaptive Intelligent Mobile Robotics
Adaptive Intelligent Mobile Roboticsahmad bassiouny
Β 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
Β 
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...Ziyuan Zhao
Β 
Lifelong learning for multi-task learning
Lifelong learning for multi-task learningLifelong learning for multi-task learning
Lifelong learning for multi-task learningJeong-Gwan Lee
Β 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaLuca Marignati
Β 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
Β 
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...MLAI2
Β 
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...dyyjkd
Β 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
Β 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
Β 
Neighborhood Component Analysis 20071108
Neighborhood Component Analysis 20071108Neighborhood Component Analysis 20071108
Neighborhood Component Analysis 20071108Ting-Shuo Yo
Β 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
Β 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
Β 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmSupun Abeysinghe
Β 

Similar to Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML) (20)

ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
Β 
Final-AI-Problem Solving.pdf
Final-AI-Problem Solving.pdfFinal-AI-Problem Solving.pdf
Final-AI-Problem Solving.pdf
Β 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Β 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Β 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
Β 
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Β 
Adaptive Intelligent Mobile Robotics
Adaptive Intelligent Mobile RoboticsAdaptive Intelligent Mobile Robotics
Adaptive Intelligent Mobile Robotics
Β 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Β 
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
[BMVC 2022] DA-CIL: Towards Domain Adaptive Class-Incremental 3D Object Detec...
Β 
Lifelong learning for multi-task learning
Lifelong learning for multi-task learningLifelong learning for multi-task learning
Lifelong learning for multi-task learning
Β 
Presentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in InformaticaPresentazione Tesi Laurea Triennale in Informatica
Presentazione Tesi Laurea Triennale in Informatica
Β 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
Β 
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Β 
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Presentation File of paper "Leveraging Normalization Layer in Adapters With P...
Β 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
Β 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Β 
Neighborhood Component Analysis 20071108
Neighborhood Component Analysis 20071108Neighborhood Component Analysis 20071108
Neighborhood Component Analysis 20071108
Β 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Β 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Β 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
Β 

Recently uploaded

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Β 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
Β 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
Β 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
Β 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
Β 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
Β 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
Β 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
Β 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
Β 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
Β 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
Β 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
Β 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
Β 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
Β 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
Β 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
Β 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
Β 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
Β 

Recently uploaded (20)

(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Β 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Β 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
Β 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Β 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
Β 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
Β 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
Β 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
Β 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
Β 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Β 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
Β 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
Β 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Β 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Β 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Β 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
Β 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
Β 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
Β 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
Β 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
Β 

Unsupervised Curricula for Visual Meta Reinforcement Learning(CARML)

  • 1. Unsupervised Curricula for Visual Meta-Reinforcement Learning 2020. 07. 06 Jeong-Gwan Lee 1 / 17 Jabri, Allan, et al. "Unsupervised curricula for visual meta-reinforcement learning." Advances in Neural Information Processing Systems. 2019.
  • 2. Meta Reinforcement Learning β€’ Regular RL : learn policy for single task (=Specialist) 2 / 17 β€’ Meta-RL : learn meta-policy for task distribution, by learning good initial weights and adaptation rule (=Generalist) Appendix for adaptation in detail
  • 3. Preliminaries: Task Distribution β€’ Single Task (= Single MDP) β€’ Task Distribution (= MDP Distribution) β€’ A task is sampled from task distribution where β€’ Each task can be defined as β€œMDP without general reward” and a sampled task-specific reward. 3 / 22 start point Sampled Task 1 Sampled Task 2 Target-reaching problem with given
  • 4. Preliminaries: Unsupervised Meta-RL β€’ Supervised meta-RL β€’ Given, hand-crafted task distribution β€’ When task is sampled, the reward function is provided to the meta-policy at both meta-train and meta-test phase. β€’ Unsupervised meta-RL β€’ There’s no task distribution and goal description at meta-train phase. β€’ There’s no reward function at meta-train phase, only uses reward function at meta-test phase. 4 / 22 start point Target-reaching problem with no reward supervision at meta train meta-train meta-train = meta-test meta-test β‰  start point Sampled Task 1 Target-reaching problem with given start point Sampled Task 1 Q. How to train meta-policy without reward function?
  • 5. Preliminaries: DIAYN(Diversity is all you need) 5 / 22 β€’ Goal : Learning useful skills without any reward function at train time β€’ Skill ( ) : Determinant of How to move in the state space β€’ Skill-conditioned Policy : 2. Only States, Not Actions( ) are used to distinguish skills( ) 3. Encouraging exploration (maximizing the policy entropy) β€’ 3 Desiderata start point Skill example : Categorical var. (Category = 8) 1. Sampled Skill( ) should control which States( ) the agent visits = States( ) the agent visited can easily infer the used Skill( ) 𝑧! 𝑧" 𝑧# 𝑧$ 𝑧% 𝑧& 𝑧' 𝑧(
  • 6. Preliminaries: DIAYN(Diversity is all you need) 6 / 22 β€’ Then how to define pseudo reward? β€’ Define the object as, β€’ DIAYN implemented with SAC maximizes the policy’s entropy over actions. Exploration Sample skill variable 𝒛, action 𝒂 based on 𝒛 and get next state 𝒔 Intractable Fixed discrete uniform dist. (Cat. =50)
  • 7. Preliminaries: DIAYN(Diversity is all you need) 7 / 22 β€’ Then how to define pseudo reward? Inferable ability to sampled skill Skill-conditioned Policy Update the discriminator to maximize
  • 8. Preliminaries: DIAYN(Diversity is all you need) 8 / 22 β€’ Skills learned with no reward β€’ After training, sample and rollout using Run forward Walk forward Run backward Acrobatic β€’ Accelerating Learning with Policy Initialization β€’ (1)takes the skill with highest reward for each benchmark task and (2)finetune this policy using the task-specific reward function.
  • 9. Curricula for Unsupervised Meta-Reinforcement Learning : CARML β€’ Goal : In unsupervised meta-RL setting, learn meta-policy with evolved task distribution via EM algorithm. 9 / 17 Task distribution π‘ž! Meta-RL Policy Tasks Data? 1. Sample a task 2. Define task reward 3. Update Meta-policy How to evolve task distribution using data? M-step : Meta-LearningE-step : Fitting task distribution
  • 10. Diversity Structure Organize task distribution from policy by maximizing the Mutual Information between each state 𝒔 and a latent task variable 𝒛 β€’ Task distribution to be well- structured by task variable 𝑧 10 / 17 Objective : Information Maximization via EM β€’ Meta-RL policy πœ‹! to be as diverse as possible in state space β€’ Meta-RL policy πœ‹! to follow sampled task variable 𝑧
  • 11. How to Learn a task distribution 𝒒 𝝓 11 / 17 Using collected states of post-updated meta-policy πœ‹$ in previous M- step, 𝒒 𝝓 estimates the state density model as Gaussian Mixture Model. 𝑧 = 1 𝑧 = 2 𝑧 = 3 𝑧 = 4 𝑧 = 5 Density Estimation State Space Features of state
  • 12. How to design a task distribution 𝒒 𝝓 12 / 17 β€’ Gaussian mixture model’s parameters β€’ Categorical random variable , Define as Gaussian Mixture Model, β€’ Prior (Task distribution) β€’ Likelihood β€’ State marginal distribution
  • 13. E-Step: Task Acquisition (Fitting 𝒒 𝝓) 13 / 17 In just before meta-RL step, Sample task variable 𝑧 and add states into π’Ÿ E-step : Estimate given derivation Define Be structured well! We don’t care where states come from previous 𝑧 Just reorganize to be well structured!
  • 14. Task Acquisition via Discriminative Clustering [1]Caron, Mathilde, et al. "Deep clustering for unsupervised learning of visual features." Proceedings of the European Conference on Computer Vision (ECCV). 2018 14 / 17 2. Based on posterior π‘ž)(𝑧|𝑔*(𝑠)), Assign pseudo-label 3. Based on pseudo-labels, Update ResNet by supervised learning encoder (ResNet) 𝑔! Gaussian Mixture π‘ž" A variant of DeepCluster[1] Pseudo Label 1. Update Gaussian Mixture Model by MLE (EM algorithm for GMM)
  • 15. M-step: How to define a reward function ? 15 / 17 unknown 𝒑(𝒔) β‰ˆ 𝒒 𝝓(𝒔) Task-conditioned reward function for meta-policy is, derivation assumption that the fitted marginal distribution 𝒒 𝝓 𝒔 matches that of the policy 𝒑(𝒔) unknown Known from previous E-step Be diverse Do the task well
  • 16. M-step: Meta-Learning (Fitting 𝝅 𝜽) 16 / 17 We can trade-off between discriminability of skills and task-specific exploration by weighting πœ† ∈ 0,1 Inference ability to skills If 𝑠,-! infers the sampled task 𝑧 well, Reward should be high. Task-specific exploration Given task z, the policy tries unfamiliar state, Reward should be high. Task-conditioned reward function for meta-policy is,
  • 17. Experiment Setting : Visual Navigation Task VizDoom involving a room filled with five different objects β€’ Goal : Reaching a single target object β€’ Observation : egocentric images with limited field of view. β€’ Actions : Turn Right, Turn Left, Move forward β€’ Reward Function : inverse l2 distance from the specified target object. 17 / 17 Observation example Bird-eye view (2D position)
  • 18. Evolution of Task Distribution 18 / 17 1 E-step 3 E / 2 M 5 E / 4 M Random initial policy trajectories : Less structured and less distinct More structured and well-organized Exploration (wide range) & Exploitation (direction) Projecting trajectories of each mixture components (=task) into the true state space
  • 19. RL2 as M-step meta-policy β€’ Meta-RL algorithm using RNN policy β€’ When ”Trial” is initiated, a new task is sampled. β€’ One trial = two episode with hidden state sharing β€’ Process Flow 19 / 17 Duan, Yan, et al. "Rl $^ 2$: Fast reinforcement learning via slow reinforcement learning." arXiv preprint arXiv:1611.02779 (2016). Initial state Initial hidden state Embedding πœ™ GRU FC SOFTMAX RL2 structure Trial 1 (sampled Task 1) Trial 2 (sampled Task 2) Episode 1 Episode 2 Episode 1
  • 20. Direct Transfer & Finetuning for one test task β€’ Direct Transfer(Marker and dotted line) : Apply policy to test task without finetuning. β€’ Finetuning : Update parameters of policy to specific test tasks 200 400 600# of episodes 800 1000 20 / 17 Sample only one test task (r.seed 20) Baselines 1) PPO from scratch 2) Pre-training with random network distillation (RND) for unsupervised exploration 3) Supervised meta-learning, as an oracle
  • 21. CARML as Meta-pretraining 21 / 17 Experiment description : Train with different initialization weights from train task distribution and plot learning curve from test task distribution. Purpose : Check Meta-pretraining ability Variants : Encoder init : only initialize the pre-trained encoder to separate the effect of meta-policy and visual representation (ResNet) The acquired meta-learning strategies may be useful for learning related task distributions, effectively acting as pretrain model for meta-RL.
  • 22. Summary They proposed a framework for inducing unsupervised, adaptive task distributions for meta-RL that scales to environments with high- dimensional pixel observations. They showed CARML enables (1)unsupervised acquisition(skill maps) of meta-learning strategies that transfer to test task distributions in terms of (2) direct evaluation, more sample efficient fine-tuning and (3) more sample-efficient supervised meta-learning 22 / 17
  • 23. Appendix 1 : EM algorithm Fix πœƒ 𝑝. , Update 𝝓 (𝒒 𝝓) Update 𝜽 𝒑 𝜽 , Fix πœ™ (π‘ž)) Expectation Maximization 𝒒 𝝓 𝒛 = 𝑷 𝒛 𝒙, 𝜽 𝒐𝒍𝒅 (KL(q||p) == 0) 23
  • 24. Illustration of the E-step Illustration of the M-step Fix πœƒ 𝑝. , Update 𝝓 (𝒒 𝝓) Update 𝜽 𝒑 𝜽 , Fix πœ™ (π‘ž)) 𝒒 𝝓 𝒛 = 𝑷 𝒛 𝒙, 𝜽 𝒐𝒍𝒅 (KL(q||p) == 0) Maximization 24 Appendix 1 : EM algorithm