SlideShare a Scribd company logo
1 of 16
Download to read offline
1
A Very Good Method for Bayes-Adaptive Deep RL
via Meta-Learning(variBAD)
*백승언, 김현성
26 Mar, 2023
2
⚫ Introduction
▪ Meta Reinforcement Learning
⚫ A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning(variBAD)
▪ Backgrounds
▪ variBAD
⚫ Experiment results
Contents
3
Introduction
4
⚫ Problem setting of reinforcement learning, meta-learning
▪ Reinforcement learning
• Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0
Σ𝑡=0
∞
𝛾𝑡−1
𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
▪ Meta-learning
• Given data from 𝒯
1, … , 𝒯, quickly solve new task 𝒯
𝑡𝑒𝑠𝑡
⚫ Problem setting of Meta-Reinforcement Learning(Meta-RL)
▪ Setting 1: meta-learning with diverse goal(goal as a task)
• 𝒯
𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′
𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}
▪ Setting 2: meta-learning with RL tasks(MDP as a task)
• 𝒯
𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′
𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎)
Meta Reinforcement Learning
Meta RL problem statement in CS-330(Finn)
5
A Very Good Method for Bayes-Adaptive
Deep RL via Meta-Learning(variBAD)
https://arxiv.org/pdf/1910.08348.pdf
6
⚫ Exploration-exploitation dilemma
▪ Agent learns the optimal policy through exploration
and exploitation
• The agent need to gather enough information to
make the best overall decisions (exploration)
• The agent need to make the best decision given
current information (exploitation)
Backgrounds
Exploration Exploitation
⚫ Bayesian Reinforcement Learning(Bayesian RL)
▪ Bayesian methods for RL can be used to quantify
uncertainty for action-selection, and provide a way to
incorporate prior knowledge into the algorithms
▪ A Bayes-optimal policy is one that optimally trades off
exploration and exploitation, and thus maximizes
expected return during learning
• Bayes-optimal policy is the solution of the reinforcement
learning problem formalized as a Bayes-Adaptive
MDP(BAMDP)
7
⚫ Bayes Adaptive MDP(BAMDP)
▪ A special case of PODMPs where transition and reward functions constitute the hidden state and the
agent must maintain a belief over them.
▪ In the Bayesian formulation of RL, they assume that the transition and reward functions are distributed
according to a prior 𝑏0 = 𝑝 𝑅, 𝑇 .
• Agent does not have access to the true reward and transition function
▪ BAMDP 𝑀+
= 𝒮+
, 𝒜, ℛ+
, 𝑇+
, 𝑇0
+
, 𝛾, 𝐻+
• In this framework, the state is augmented with belief to incorporate the task uncertainty into decision-making
• Hyper-states: 𝑠𝑡
+
∈ 𝒮+
= 𝒮 × ℬ, where ℬ is the belief space
• Transition on hyper-states:𝑇+
𝑠𝑡+1
+
𝑠𝑡
+
, 𝑎𝑡, 𝑟𝑡 = 𝑇+
𝑠𝑡+1, 𝑏𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑏𝑡 = 𝑇+
𝑠𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑏𝑡 𝑇+
𝑏𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑏𝑡, 𝑠𝑡+1
= 𝔼𝑏𝑡
𝑇 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡 𝛿(𝑏𝑡+1 = 𝑝 𝑅, 𝑇 𝜏:𝑡+1
• Reward function on hyper-states: 𝑅+
𝑠𝑡
+
, 𝑎𝑡, 𝑠𝑡+1
+
= 𝑅+
𝑠𝑡, 𝑏𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑏𝑡+1 = 𝔼𝑏𝑡+1
𝑅(𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
• Horizon: 𝐻+
= 𝑁 × 𝐻
• Objective of agent in the BAMDP: 𝒥+
𝜋 = 𝔼𝑏0,𝑇0
+,𝑇+,𝜋 Σ𝑡=0
𝐻+−1
𝛾𝑡
𝑅+
𝑟𝑡+1 𝑠𝑡
+
, 𝑎𝑡, 𝑠𝑡+1
+
– It is maximized by Bayes-optimal policy, which automatically trades off exploration and exploitation
Backgrounds
8
⚫ Overview of the variBAD
▪ The authors proposed variBAD to resolve some challenges of problem formulated as BAMDP, as follows
• 1) Agent typically does not know the parameterization of the true reward and/or transition model
• 2) The belief update (computing the posterior 𝑝 𝑅, 𝑇 𝜏:𝑡 is often intractable
• 3) Even with the correct posterior, planning in belief space is typically intractable
▪ 1) They simultaneously meta-learned the reward and transition function with posterior over task embedding
• 𝑇′
= 𝑝𝜃
𝑇
𝑠𝑡+1 𝑠𝑡, 𝑎𝑡; 𝑚 , 𝑅′
= 𝑝𝜃
𝑅
𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚 , 𝑚~𝑞𝜙 𝑚 𝜏:𝑡 𝑖𝑠 𝑎 𝑡𝑒𝑠𝑘 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒, 𝑞𝜙 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑛𝑐𝑜𝑑𝑒𝑟
▪ 2) The posterior over task embedding, 𝑞𝜙 𝑚 𝜏:𝑡 is trained when the decoder 𝑝(𝜏:𝑡|𝑚) is updated through
training objectives
• Lower bound of model objective: ELBOt 𝜙, 𝜃 = 𝔼𝜌 𝐸𝑞𝜙 𝑚|𝜏:𝑡
log 𝑝𝜃 𝜏:𝐻 𝑚 + 𝐾𝐿(𝑞𝜙 𝑚 𝜏:𝑡 || 𝑝𝜃(𝑚))
• Policy gradient objective: 𝔼𝑝 𝑀 𝒥(𝜓, 𝜙)
▪ 3) Since their model is learned end-to-end with the inference framework, no planning is necessary at test
time
variBAD (I)
9
⚫ Bayes-adaptive deep RL via meta-learning
▪ The reward and transition functions that are unique to each MDP are learned with task description variable
𝑚𝑖 given MDP 𝑀𝑖~𝑝(𝑀)
• 𝑅𝑖 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1 ≈ 𝑅 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚𝑖
• 𝑇𝑖 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡 ≈ 𝑇(𝑠𝑡+1|𝑠𝑡, 𝑎𝑡; 𝑚𝑖)
▪ Since the agent do not have access to the true task description or ID, they infer 𝑚𝑖 given the agent’s
experience up to time step 𝑡 collected in 𝑀𝑖
variBAD (II)
Proposed variBAD architecture
• They infer the posterior distribution 𝑝 𝑚𝑖 𝜏:𝑡
𝑖
over 𝑚𝑖
given 𝜏:𝑡
𝑖
= (𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠𝑡−1, 𝑎𝑡−1, 𝑟𝑡, 𝑠𝑡)( from
now on, the subscript i is dropped for ease of notation)
▪ Given the reformulation, it is now sufficient to reason
about the embedding 𝑚, instead of the transition and
reward dynamics
• The authors suggested that this is significantly useful,
where the reward and transition function can consist of
millions of params, but the embedding 𝑚 can be small
10
⚫ Model learning objective
▪ They learned a model of the environment 𝑝𝜃, so that even if they don’t know the model, they could allow
fast inference at runtime at using follow objective
• 𝔼𝜌 𝑀,𝜏:𝐻+
log 𝑝𝜃 𝜏:𝐻+ 𝑎:𝐻+−1
▪ Instead of optimizing above objective, which is intractable, they optimized a tractable lower bound
• 𝔼𝜌 𝑀,𝜏:𝐻+
log 𝑝𝜃 𝜏:𝐻 ) = 𝐸𝜌 log 𝔼𝑞𝜙 𝑚|𝜏:𝑡
𝑝𝜃 𝜏:𝐻,𝑚
𝑞𝜙 𝑚 𝜏:𝑡
≥ 𝔼𝜌,𝑞𝜙 𝑚|𝜏:𝑡
log
𝑝𝜃 𝜏:𝐻,𝑚
𝑞𝜙 𝑚 𝜏:𝑡
= 𝔼𝜌 𝐸𝑞𝜙 𝑚|𝜏:𝑡
log 𝑝𝜃 𝜏:𝐻 𝑚 + 𝐾𝐿(𝑞𝜙 𝑚 𝜏:𝑡 || 𝑝𝜃(𝑚)) = 𝐸𝐿𝐵𝑂𝑡
⚫ Overall objective
▪ The encoder 𝑞𝜙 𝑚 𝜏:𝑡 , parameterized by 𝜙
▪ An approximate transition function 𝑇′
= 𝑝𝜃
𝑇
(𝑠𝑡+1|𝑠𝑡, 𝑎𝑡; 𝑚) and an approximate reward function 𝑅′
= 𝑝𝜃
𝑅
(𝑟𝑡+1|𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚) which are jointly parameterized by 𝜃
▪ A policy 𝜋𝜓 𝑎𝑡|𝑠𝑡, 𝑞𝜙(𝑚|𝜏:𝑡) parameterized by 𝜓 and dependent on 𝜙
• ℒ 𝜙, 𝜃, 𝜓 = 𝔼𝑝 𝑀 𝒥 𝜓, 𝜙 + 𝜆Σ𝑡=0
𝐻+
𝐸𝐿𝐵𝑂𝑡 𝜙, 𝜃
variBAD (III)
11
Experiment Results
12
⚫ Environment
▪ Didactic grid world
• The goal is unobserved by the agent, inducing
task uncertainty and necessitating exploration.
• The goal can be anywhere except around the
starting cell, which is at the bottom left.
▪ MDP parameters
• Action: up, right, down, left, stay
• Reward: -0.1 on non-goal cells, +1 on the goal cell
• Horizon: 15
Experiment environment
▪ MuJoCo for meta-learning
• Well-known environment for continuous control
• Ant-Dir: move forward and backward (2 tasks)
• Half-Cheetah-Dir: move forward and backward (2 tasks)
• Half-Cheetah-Vel: Achieve a target velocity running
forward (100 train tasks, 30 test tasks)
• Walker: Agent initialized with some system dynamics
params randomized and move forward (40 train tasks, 10
test tasks)
Different exploration strategies on didactic grid world Continuous control tasks: the ant, half cheetah, and walker robots
13
⚫ Comparison with the previous method on didactic grid world
▪ Experiments show the superior performance of variBAD compared to previous methods in didactic grid-
world environment with few episodes
• variBAD
• RL2
▪ Behavior of variBAD closely matched that of the Bayes-optimal policy
• The variBAD could predicts no reward for cells it has visited, and explores the remaining cells until it finds the goal
Experiment results (I)
• Optimal
• Posterior Sampling
Results for the grid-world (a) Average return (b) running curves per frame (20 seeds)
Behavior of variBAD in the grid-world
• Bayes-Optimal
(𝒂) (𝒃)
14
⚫ Comparison with the previous method on MuJoCo
▪ Experiments show the superior performance of variBAD compared to previous methods in MuJoCo
environment(Ant-Dir, Cheetah-Dir, Cheetah-Vel, Walker) with few episodes
• variBAD
• PEARL
▪ For all MuJoCo environments, they trained variBAD with a reward decoder only (even for Walker, where
the dynamics change, we found that this has superior performance)
Experiment results (II)
Average test performance for the first 5 rollouts of MuJoCo (5 seeds) Learning curves for the MuJoCo at the (a) first rollout (b) N-th rollout
• Oracle(PPO)
• EMAML
• RL2
• PROMP
(𝒃)
(𝒂)
15
Thank you!
16
Q&A

More Related Content

Similar to variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf

Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
 
Does Zero-Shot RL Exist
Does Zero-Shot RL ExistDoes Zero-Shot RL Exist
Does Zero-Shot RL ExistSeungeon Baek
 
Reinforcement Learning basics part1
Reinforcement Learning basics part1Reinforcement Learning basics part1
Reinforcement Learning basics part1Euijin Jeong
 
2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptxZhiwuGuo1
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagationParveenMalik18
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptxSeungeon Baek
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
DL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdfDL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdfsagayalavanya2
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Tsuyoshi Sakama
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machinesJinho Lee
 
Logistic Ordinal Regression
Logistic Ordinal RegressionLogistic Ordinal Regression
Logistic Ordinal RegressionSri Ambati
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 

Similar to variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf (20)

Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
Does Zero-Shot RL Exist
Does Zero-Shot RL ExistDoes Zero-Shot RL Exist
Does Zero-Shot RL Exist
 
Reinforcement Learning basics part1
Reinforcement Learning basics part1Reinforcement Learning basics part1
Reinforcement Learning basics part1
 
2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx2Multi_armed_bandits.pptx
2Multi_armed_bandits.pptx
 
K-means and GMM
K-means and GMMK-means and GMM
K-means and GMM
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Trajectory Transformer.pptx
Trajectory Transformer.pptxTrajectory Transformer.pptx
Trajectory Transformer.pptx
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Continuous control
Continuous controlContinuous control
Continuous control
 
DL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdfDL_lecture3_regularization_I.pdf
DL_lecture3_regularization_I.pdf
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 
Logistic Ordinal Regression
Logistic Ordinal RegressionLogistic Ordinal Regression
Logistic Ordinal Regression
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 

More from taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimizationtaeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimization
 

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 

Recently uploaded (20)

E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf

  • 1. 1 A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning(variBAD) *백승언, 김현성 26 Mar, 2023
  • 2. 2 ⚫ Introduction ▪ Meta Reinforcement Learning ⚫ A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning(variBAD) ▪ Backgrounds ▪ variBAD ⚫ Experiment results Contents
  • 4. 4 ⚫ Problem setting of reinforcement learning, meta-learning ▪ Reinforcement learning • Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0 Σ𝑡=0 ∞ 𝛾𝑡−1 𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1 ▪ Meta-learning • Given data from 𝒯 1, … , 𝒯, quickly solve new task 𝒯 𝑡𝑒𝑠𝑡 ⚫ Problem setting of Meta-Reinforcement Learning(Meta-RL) ▪ Setting 1: meta-learning with diverse goal(goal as a task) • 𝒯 𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′ 𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖} ▪ Setting 2: meta-learning with RL tasks(MDP as a task) • 𝒯 𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′ 𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎) Meta Reinforcement Learning Meta RL problem statement in CS-330(Finn)
  • 5. 5 A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning(variBAD) https://arxiv.org/pdf/1910.08348.pdf
  • 6. 6 ⚫ Exploration-exploitation dilemma ▪ Agent learns the optimal policy through exploration and exploitation • The agent need to gather enough information to make the best overall decisions (exploration) • The agent need to make the best decision given current information (exploitation) Backgrounds Exploration Exploitation ⚫ Bayesian Reinforcement Learning(Bayesian RL) ▪ Bayesian methods for RL can be used to quantify uncertainty for action-selection, and provide a way to incorporate prior knowledge into the algorithms ▪ A Bayes-optimal policy is one that optimally trades off exploration and exploitation, and thus maximizes expected return during learning • Bayes-optimal policy is the solution of the reinforcement learning problem formalized as a Bayes-Adaptive MDP(BAMDP)
  • 7. 7 ⚫ Bayes Adaptive MDP(BAMDP) ▪ A special case of PODMPs where transition and reward functions constitute the hidden state and the agent must maintain a belief over them. ▪ In the Bayesian formulation of RL, they assume that the transition and reward functions are distributed according to a prior 𝑏0 = 𝑝 𝑅, 𝑇 . • Agent does not have access to the true reward and transition function ▪ BAMDP 𝑀+ = 𝒮+ , 𝒜, ℛ+ , 𝑇+ , 𝑇0 + , 𝛾, 𝐻+ • In this framework, the state is augmented with belief to incorporate the task uncertainty into decision-making • Hyper-states: 𝑠𝑡 + ∈ 𝒮+ = 𝒮 × ℬ, where ℬ is the belief space • Transition on hyper-states:𝑇+ 𝑠𝑡+1 + 𝑠𝑡 + , 𝑎𝑡, 𝑟𝑡 = 𝑇+ 𝑠𝑡+1, 𝑏𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑏𝑡 = 𝑇+ 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑏𝑡 𝑇+ 𝑏𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑏𝑡, 𝑠𝑡+1 = 𝔼𝑏𝑡 𝑇 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡 𝛿(𝑏𝑡+1 = 𝑝 𝑅, 𝑇 𝜏:𝑡+1 • Reward function on hyper-states: 𝑅+ 𝑠𝑡 + , 𝑎𝑡, 𝑠𝑡+1 + = 𝑅+ 𝑠𝑡, 𝑏𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑏𝑡+1 = 𝔼𝑏𝑡+1 𝑅(𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1 • Horizon: 𝐻+ = 𝑁 × 𝐻 • Objective of agent in the BAMDP: 𝒥+ 𝜋 = 𝔼𝑏0,𝑇0 +,𝑇+,𝜋 Σ𝑡=0 𝐻+−1 𝛾𝑡 𝑅+ 𝑟𝑡+1 𝑠𝑡 + , 𝑎𝑡, 𝑠𝑡+1 + – It is maximized by Bayes-optimal policy, which automatically trades off exploration and exploitation Backgrounds
  • 8. 8 ⚫ Overview of the variBAD ▪ The authors proposed variBAD to resolve some challenges of problem formulated as BAMDP, as follows • 1) Agent typically does not know the parameterization of the true reward and/or transition model • 2) The belief update (computing the posterior 𝑝 𝑅, 𝑇 𝜏:𝑡 is often intractable • 3) Even with the correct posterior, planning in belief space is typically intractable ▪ 1) They simultaneously meta-learned the reward and transition function with posterior over task embedding • 𝑇′ = 𝑝𝜃 𝑇 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡; 𝑚 , 𝑅′ = 𝑝𝜃 𝑅 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚 , 𝑚~𝑞𝜙 𝑚 𝜏:𝑡 𝑖𝑠 𝑎 𝑡𝑒𝑠𝑘 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒, 𝑞𝜙 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑛𝑐𝑜𝑑𝑒𝑟 ▪ 2) The posterior over task embedding, 𝑞𝜙 𝑚 𝜏:𝑡 is trained when the decoder 𝑝(𝜏:𝑡|𝑚) is updated through training objectives • Lower bound of model objective: ELBOt 𝜙, 𝜃 = 𝔼𝜌 𝐸𝑞𝜙 𝑚|𝜏:𝑡 log 𝑝𝜃 𝜏:𝐻 𝑚 + 𝐾𝐿(𝑞𝜙 𝑚 𝜏:𝑡 || 𝑝𝜃(𝑚)) • Policy gradient objective: 𝔼𝑝 𝑀 𝒥(𝜓, 𝜙) ▪ 3) Since their model is learned end-to-end with the inference framework, no planning is necessary at test time variBAD (I)
  • 9. 9 ⚫ Bayes-adaptive deep RL via meta-learning ▪ The reward and transition functions that are unique to each MDP are learned with task description variable 𝑚𝑖 given MDP 𝑀𝑖~𝑝(𝑀) • 𝑅𝑖 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1 ≈ 𝑅 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚𝑖 • 𝑇𝑖 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡 ≈ 𝑇(𝑠𝑡+1|𝑠𝑡, 𝑎𝑡; 𝑚𝑖) ▪ Since the agent do not have access to the true task description or ID, they infer 𝑚𝑖 given the agent’s experience up to time step 𝑡 collected in 𝑀𝑖 variBAD (II) Proposed variBAD architecture • They infer the posterior distribution 𝑝 𝑚𝑖 𝜏:𝑡 𝑖 over 𝑚𝑖 given 𝜏:𝑡 𝑖 = (𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠𝑡−1, 𝑎𝑡−1, 𝑟𝑡, 𝑠𝑡)( from now on, the subscript i is dropped for ease of notation) ▪ Given the reformulation, it is now sufficient to reason about the embedding 𝑚, instead of the transition and reward dynamics • The authors suggested that this is significantly useful, where the reward and transition function can consist of millions of params, but the embedding 𝑚 can be small
  • 10. 10 ⚫ Model learning objective ▪ They learned a model of the environment 𝑝𝜃, so that even if they don’t know the model, they could allow fast inference at runtime at using follow objective • 𝔼𝜌 𝑀,𝜏:𝐻+ log 𝑝𝜃 𝜏:𝐻+ 𝑎:𝐻+−1 ▪ Instead of optimizing above objective, which is intractable, they optimized a tractable lower bound • 𝔼𝜌 𝑀,𝜏:𝐻+ log 𝑝𝜃 𝜏:𝐻 ) = 𝐸𝜌 log 𝔼𝑞𝜙 𝑚|𝜏:𝑡 𝑝𝜃 𝜏:𝐻,𝑚 𝑞𝜙 𝑚 𝜏:𝑡 ≥ 𝔼𝜌,𝑞𝜙 𝑚|𝜏:𝑡 log 𝑝𝜃 𝜏:𝐻,𝑚 𝑞𝜙 𝑚 𝜏:𝑡 = 𝔼𝜌 𝐸𝑞𝜙 𝑚|𝜏:𝑡 log 𝑝𝜃 𝜏:𝐻 𝑚 + 𝐾𝐿(𝑞𝜙 𝑚 𝜏:𝑡 || 𝑝𝜃(𝑚)) = 𝐸𝐿𝐵𝑂𝑡 ⚫ Overall objective ▪ The encoder 𝑞𝜙 𝑚 𝜏:𝑡 , parameterized by 𝜙 ▪ An approximate transition function 𝑇′ = 𝑝𝜃 𝑇 (𝑠𝑡+1|𝑠𝑡, 𝑎𝑡; 𝑚) and an approximate reward function 𝑅′ = 𝑝𝜃 𝑅 (𝑟𝑡+1|𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚) which are jointly parameterized by 𝜃 ▪ A policy 𝜋𝜓 𝑎𝑡|𝑠𝑡, 𝑞𝜙(𝑚|𝜏:𝑡) parameterized by 𝜓 and dependent on 𝜙 • ℒ 𝜙, 𝜃, 𝜓 = 𝔼𝑝 𝑀 𝒥 𝜓, 𝜙 + 𝜆Σ𝑡=0 𝐻+ 𝐸𝐿𝐵𝑂𝑡 𝜙, 𝜃 variBAD (III)
  • 12. 12 ⚫ Environment ▪ Didactic grid world • The goal is unobserved by the agent, inducing task uncertainty and necessitating exploration. • The goal can be anywhere except around the starting cell, which is at the bottom left. ▪ MDP parameters • Action: up, right, down, left, stay • Reward: -0.1 on non-goal cells, +1 on the goal cell • Horizon: 15 Experiment environment ▪ MuJoCo for meta-learning • Well-known environment for continuous control • Ant-Dir: move forward and backward (2 tasks) • Half-Cheetah-Dir: move forward and backward (2 tasks) • Half-Cheetah-Vel: Achieve a target velocity running forward (100 train tasks, 30 test tasks) • Walker: Agent initialized with some system dynamics params randomized and move forward (40 train tasks, 10 test tasks) Different exploration strategies on didactic grid world Continuous control tasks: the ant, half cheetah, and walker robots
  • 13. 13 ⚫ Comparison with the previous method on didactic grid world ▪ Experiments show the superior performance of variBAD compared to previous methods in didactic grid- world environment with few episodes • variBAD • RL2 ▪ Behavior of variBAD closely matched that of the Bayes-optimal policy • The variBAD could predicts no reward for cells it has visited, and explores the remaining cells until it finds the goal Experiment results (I) • Optimal • Posterior Sampling Results for the grid-world (a) Average return (b) running curves per frame (20 seeds) Behavior of variBAD in the grid-world • Bayes-Optimal (𝒂) (𝒃)
  • 14. 14 ⚫ Comparison with the previous method on MuJoCo ▪ Experiments show the superior performance of variBAD compared to previous methods in MuJoCo environment(Ant-Dir, Cheetah-Dir, Cheetah-Vel, Walker) with few episodes • variBAD • PEARL ▪ For all MuJoCo environments, they trained variBAD with a reward decoder only (even for Walker, where the dynamics change, we found that this has superior performance) Experiment results (II) Average test performance for the first 5 rollouts of MuJoCo (5 seeds) Learning curves for the MuJoCo at the (a) first rollout (b) N-th rollout • Oracle(PPO) • EMAML • RL2 • PROMP (𝒃) (𝒂)