SlideShare a Scribd company logo
1 of 16
1
A Very Good Method for Bayes-Adaptive Deep RL
via Meta-Learning(variBAD)
백승언, 김현성, 주정헌, 이도현
26 Mar, 2023
2
 Introduction
 Meta Reinforcement Learning
 A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning(variBAD)
 Backgrounds
 variBAD
 Experiment results
Contents
3
Introduction
4
 Problem setting of reinforcement learning, meta-learning
 Reinforcement learning
• Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0
Σ𝑡=0
∞
𝛾𝑡−1
𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
 Meta-learning
• Given data from 𝒯
1, … , 𝒯
N, quickly solve new task 𝒯
𝑡𝑒𝑠𝑡
 Problem setting of Meta-Reinforcement Learning(Meta-RL)
 Setting 1: meta-learning with diverse goal(goal as a task)
• 𝒯
𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′
𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}
 Setting 2: meta-learning with RL tasks(MDP as a task)
• 𝒯
𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′
𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎)
Meta Reinforcement Learning
Meta RL problem statement in CS-330(Finn)
5
A Very Good Method for Bayes-Adaptive
Deep RL via Meta-Learning(variBAD)
https://arxiv.org/pdf/1910.08348.pdf
6
 Exploration-exploitation dilemma
 Agent learns the optimal policy through exploration
and exploitation
• The agent need to gather enough information to
make the best overall decisions (exploration)
• The agent need to make the best decision given
current information (exploitation)
Backgrounds
Exploration Exploitation
 Bayesian Reinforcement Learning(Bayesian RL)
 Bayesian methods for RL can be used to quantify
uncertainty for action-selection, and provide a way to
incorporate prior knowledge into the algorithms
 A Bayes-optimal policy is one that optimally trades off
exploration and exploitation, and thus maximizes
expected return during learning
• Bayes-optimal policy is the solution of the reinforcement
learning problem formalized as a Bayes-Adaptive
MDP(BAMDP)
7
 Bayes Adaptive MDP(BAMDP)
 A special case of PODMPs where transition and reward functions constitute the hidden state and the
agent must maintain a belief over them.
 In the Bayesian formulation of RL, they assume that the transition and reward functions are distributed
according to a prior 𝑏0 = 𝑝 𝑅, 𝑇 .
• Agent does not have access to the true reward and transition function
 BAMDP 𝑀+
= 𝒮+
, 𝒜, ℛ+
, 𝑇+
, 𝑇0
+
, 𝛾, 𝐻+
• In this framework, the state is augmented with the belief to incorporate task uncertainty into decision-making
• Hyper-states: 𝑠𝑡
+
∈ 𝒮+
= 𝒮 × ℬ, where ℬ is the belief space
• Transition on hyper-states:𝑇+
𝑠𝑡+1
+
𝑠𝑡
+
, 𝑎𝑡, 𝑟𝑡 = 𝑇+
𝑠𝑡+1, 𝑏𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑏𝑡 = 𝑇+
𝑠𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑏𝑡 𝑇+
𝑏𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑏𝑡, 𝑠𝑡+1
= 𝔼𝑏𝑡
𝑇 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡 𝛿(𝑏𝑡+1 = 𝑝 𝑅, 𝑇 𝜏:𝑡+1
• Reward function on hyper-states: 𝑅+
𝑠𝑡
+
, 𝑎𝑡, 𝑠𝑡+1
+
= 𝑅+
𝑠𝑡, 𝑏𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑏𝑡+1 = 𝔼𝑏𝑡+1
𝑅 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1
• Horizon: 𝐻+
= 𝑁 × 𝐻
• Objective of agent in the BAMDP: 𝒥+
𝜋 = 𝔼𝑏0,𝑇0
+,𝑇+,𝜋 Σ𝑡=0
𝐻+−1
𝛾𝑡
𝑅+
𝑟𝑡+1 𝑠𝑡
+
, 𝑎𝑡, 𝑠𝑡+1
+
– It is maximized by Bayes-optimal policy, which automatically trades off exploration and exploitation
Backgrounds
8
 Overview of the variBAD
 The authors proposed variBAD to resolve some challenges of problem formulated as BAMDP, as follows
• 1) Agent typically does not know the parameterization of the true reward and/or transition model
• 2) The belief update (computing the posterior 𝑝 𝑅, 𝑇 𝜏:𝑡 is often intractable)
• 3) Even with the correct posterior, planning in belief space is typically intractable
 1) They simultaneously meta-learned the reward and transition function with posterior over task embedding
• 𝑇′
= 𝑝𝜃
𝑇
𝑠𝑡+1 𝑠𝑡, 𝑎𝑡; 𝑚 , 𝑅′
= 𝑝𝜃
𝑅
𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚 , 𝑚~𝑞𝜙 𝑚 𝜏:𝑡 𝑖𝑠 𝑎 𝑡𝑒𝑠𝑘 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒, 𝑞𝜙 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑛𝑐𝑜𝑑𝑒𝑟
 2) The posterior over task embedding, 𝑞𝜙 𝑚 𝜏:𝑡 is trained when the decoder 𝑝(𝜏:𝑡|𝑚) is updated through
training objectives
• Lower bound of model objective: ELBOt 𝜙, 𝜃 = 𝔼𝜌 𝐸𝑞𝜙 𝑚|𝜏:𝑡
log 𝑝𝜃 𝜏:𝐻 𝑚 + 𝐾𝐿(𝑞𝜙 𝑚 𝜏:𝑡 || 𝑝𝜃(𝑚))
• Policy gradient objective: 𝔼𝑝 𝑀 𝒥(𝜓, 𝜙)
 3) Since their model is learned end-to-end with the inference framework, no planning is necessary at test
time
variBAD (I)
9
 Bayes-adaptive deep RL via meta-learning
 The reward and transition functions that are unique to each MDP are learned with task description variable
𝑚𝑖 given MDP 𝑀𝑖~𝑝(𝑀)
• 𝑅𝑖 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1 ≈ 𝑅 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚𝑖
• 𝑇𝑖 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡 ≈ 𝑇(𝑠𝑡+1|𝑠𝑡, 𝑎𝑡; 𝑚𝑖)
 Since the agent do not have access to the true task description or ID, they infer 𝑚𝑖 given the agent’s
experience up to time step 𝑡 collected in 𝑀𝑖
variBAD (II)
Proposed variBAD architecture
• They infer the posterior distribution 𝑝 𝑚𝑖 𝜏:𝑡
𝑖
over 𝑚𝑖
given 𝜏:𝑡
𝑖
= (𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠𝑡−1, 𝑎𝑡−1, 𝑟𝑡, 𝑠𝑡)
– (The subscript 𝑖 is dropped for ease of notation)
 Given the reformulation, it is now sufficient to reason
about the embedding 𝑚, instead of the transition and
reward dynamics
• The authors suggested that this is significantly useful,
where the reward and transition function can consist of
millions of params, but the embedding 𝑚 can be small
10
 Model learning objective
 They learned a model of the environment 𝑝𝜃, so that even if they don’t know the model, they could allow
fast inference at runtime at using follow objective
• 𝔼𝜌 𝑀,𝜏:𝐻+
log 𝑝𝜃 𝜏:𝐻+ 𝑎:𝐻+−1
 Instead of optimizing above objective, which is intractable, they optimized a tractable lower bound
• 𝔼𝜌 𝑀,𝜏:𝐻+
log 𝑝𝜃 𝜏:𝐻 ) = 𝔼𝜌 log 𝔼𝑞𝜙 𝑚|𝜏:𝑡
𝑝𝜃 𝜏:𝐻,𝑚
𝑞𝜙 𝑚 𝜏:𝑡
≥ 𝔼𝜌,𝑞𝜙 𝑚|𝜏:𝑡
log
𝑝𝜃 𝜏:𝐻,𝑚
𝑞𝜙 𝑚 𝜏:𝑡
=
𝔼𝜌 𝔼𝑞𝜙 𝑚|𝜏:𝑡
log 𝑝𝜃 𝜏:𝐻 𝑚 + 𝐾𝐿(𝑞𝜙 𝑚 𝜏:𝑡 || 𝑝𝜃(𝑚)) = 𝐸𝐿𝐵𝑂𝑡
 Overall objective
 The encoder 𝑞𝜙 𝑚 𝜏:𝑡 , parameterized by 𝜙
 An approximate transition function 𝑇′
= 𝑝𝜃
𝑇
(𝑠𝑡+1|𝑠𝑡, 𝑎𝑡; 𝑚) and an approximate reward function
𝑅′
= 𝑝𝜃
𝑅
(𝑟𝑡+1|𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚) which are jointly parameterized by 𝜃
 A policy 𝜋𝜓 𝑎𝑡|𝑠𝑡, 𝑞𝜙(𝑚|𝜏:𝑡) parameterized by 𝜓 and dependent on 𝜙
• ℒ 𝜙, 𝜃, 𝜓 = 𝔼𝑝 𝑀 𝒥 𝜓, 𝜙 + 𝜆Σ𝑡=0
𝐻+
𝐸𝐿𝐵𝑂𝑡 𝜙, 𝜃
variBAD (III)
11
Experiment Results
12
 Environment
 Didactic grid world
• The goal is unobserved by the agent, inducing
task uncertainty and necessitating exploration.
• The goal can be anywhere except around the
starting cell at the bottom left.
 MDP parameters
• Action: up, right, down, left, stay
• Reward: -0.1 on non-goal cells, +1 on the goal cell
• Horizon: 15
Experiment environment
 MuJoCo for meta-learning
• Well-known environment for continuous control
• Ant-Dir: move forward and backward (2 tasks)
• Half-Cheetah-Dir: move forward and backward (2 tasks)
• Half-Cheetah-Vel: Achieve a target velocity running
forward (100 train tasks, 30 test tasks)
• Walker: Agent initialized with some system dynamics
params randomized and moved forward (40 train tasks,
10 test tasks)
Different exploration strategies on didactic grid world Continuous control tasks: the ant, half cheetah, and walker robots
13
 Comparison with the previous method on the didactic grid world
 Experiments show the superior performance of variBAD compared to previous methods in a didactic grid-
world environment with few episodes
• variBAD
• RL2
 Behavior of variBAD closely matched that of the Bayes-optimal policy
• The variBAD could predict no reward for cells it has visited and explores the remaining cells until it finds the goal
Experiment results (I)
• Optimal
• Posterior Sampling
Results for the grid-world (a) Average return (b) running curves per frame (20 seeds)
Behavior of variBAD in the grid-world
• Bayes-Optimal
(𝒂) (𝒃)
14
 Comparison with the previous method on MuJoCo
 Experiments show the superior performance of variBAD compared to previous methods in the MuJoCo
environment(Ant-Dir, Cheetah-Dir, Cheetah-Vel, Walker) with few episodes
• variBAD
• PEARL
 For all MuJoCo environments, they trained variBAD with a reward decoder only (even for Walker, where
the dynamics change, we found that this has superior performance)
Experiment results (II)
Average test performance for the first 5 rollouts of MuJoCo (5 seeds) Learning curves for the MuJoCo at the (a) first rollout (b) N-th rollout
• Oracle(PPO)
• E-MAML
• RL2
• ProMP
(𝒃)
(𝒂)
15
Thank you!
16
Q&A

More Related Content

Similar to variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx

Does Zero-Shot RL Exist
Does Zero-Shot RL ExistDoes Zero-Shot RL Exist
Does Zero-Shot RL ExistSeungeon Baek
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagationParveenMalik18
 
DCWP_CVPR2023.pptx
DCWP_CVPR2023.pptxDCWP_CVPR2023.pptx
DCWP_CVPR2023.pptx건영 박
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningJY Chun
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matchingtaeseon ryu
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics JCMwave
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Stochastic optimal control & rl
Stochastic optimal control & rlStochastic optimal control & rl
Stochastic optimal control & rlChoiJinwon3
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksChenYiHuang5
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsJisang Yoon
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Tsuyoshi Sakama
 

Similar to variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx (20)

Does Zero-Shot RL Exist
Does Zero-Shot RL ExistDoes Zero-Shot RL Exist
Does Zero-Shot RL Exist
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
DCWP_CVPR2023.pptx
DCWP_CVPR2023.pptxDCWP_CVPR2023.pptx
DCWP_CVPR2023.pptx
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
QMC: Operator Splitting Workshop, Deeper Look at Deep Learning: A Geometric R...
 
Machine Learning - Reinforcement Learning
Machine Learning - Reinforcement LearningMachine Learning - Reinforcement Learning
Machine Learning - Reinforcement Learning
 
Distributional RL via Moment Matching
Distributional RL via Moment MatchingDistributional RL via Moment Matching
Distributional RL via Moment Matching
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics A machine learning method for efficient design optimization in nano-optics
A machine learning method for efficient design optimization in nano-optics
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Stochastic optimal control & rl
Stochastic optimal control & rlStochastic optimal control & rl
Stochastic optimal control & rl
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural NetworksPaper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
PPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning AlgorithmsPPT - Discovering Reinforcement Learning Algorithms
PPT - Discovering Reinforcement Learning Algorithms
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章Elements of Statistical Learning 読み会 第2章
Elements of Statistical Learning 読み会 第2章
 

Recently uploaded

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 

Recently uploaded (20)

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx

  • 1. 1 A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning(variBAD) 백승언, 김현성, 주정헌, 이도현 26 Mar, 2023
  • 2. 2  Introduction  Meta Reinforcement Learning  A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning(variBAD)  Backgrounds  variBAD  Experiment results Contents
  • 4. 4  Problem setting of reinforcement learning, meta-learning  Reinforcement learning • Given certain MDP, lean a policy 𝜋 that maximize the expected discounted return 𝔼𝜋,𝑝0 Σ𝑡=0 ∞ 𝛾𝑡−1 𝑟𝑡 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1  Meta-learning • Given data from 𝒯 1, … , 𝒯 N, quickly solve new task 𝒯 𝑡𝑒𝑠𝑡  Problem setting of Meta-Reinforcement Learning(Meta-RL)  Setting 1: meta-learning with diverse goal(goal as a task) • 𝒯 𝑖 ≜ {𝒮, 𝒜, 𝑝 𝑠0 , 𝑝 𝑠′ 𝑠, 𝑎 , 𝑟 𝑠, 𝑎, 𝑔 , 𝑔𝑖}  Setting 2: meta-learning with RL tasks(MDP as a task) • 𝒯 𝑖 ≜ 𝒮𝑖, 𝒜𝑖, 𝑝𝑖 𝑠0 , 𝑝𝑖 𝑠′ 𝑠, 𝑎 , 𝑟𝑖(𝑠, 𝑎) Meta Reinforcement Learning Meta RL problem statement in CS-330(Finn)
  • 5. 5 A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning(variBAD) https://arxiv.org/pdf/1910.08348.pdf
  • 6. 6  Exploration-exploitation dilemma  Agent learns the optimal policy through exploration and exploitation • The agent need to gather enough information to make the best overall decisions (exploration) • The agent need to make the best decision given current information (exploitation) Backgrounds Exploration Exploitation  Bayesian Reinforcement Learning(Bayesian RL)  Bayesian methods for RL can be used to quantify uncertainty for action-selection, and provide a way to incorporate prior knowledge into the algorithms  A Bayes-optimal policy is one that optimally trades off exploration and exploitation, and thus maximizes expected return during learning • Bayes-optimal policy is the solution of the reinforcement learning problem formalized as a Bayes-Adaptive MDP(BAMDP)
  • 7. 7  Bayes Adaptive MDP(BAMDP)  A special case of PODMPs where transition and reward functions constitute the hidden state and the agent must maintain a belief over them.  In the Bayesian formulation of RL, they assume that the transition and reward functions are distributed according to a prior 𝑏0 = 𝑝 𝑅, 𝑇 . • Agent does not have access to the true reward and transition function  BAMDP 𝑀+ = 𝒮+ , 𝒜, ℛ+ , 𝑇+ , 𝑇0 + , 𝛾, 𝐻+ • In this framework, the state is augmented with the belief to incorporate task uncertainty into decision-making • Hyper-states: 𝑠𝑡 + ∈ 𝒮+ = 𝒮 × ℬ, where ℬ is the belief space • Transition on hyper-states:𝑇+ 𝑠𝑡+1 + 𝑠𝑡 + , 𝑎𝑡, 𝑟𝑡 = 𝑇+ 𝑠𝑡+1, 𝑏𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑏𝑡 = 𝑇+ 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑏𝑡 𝑇+ 𝑏𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑏𝑡, 𝑠𝑡+1 = 𝔼𝑏𝑡 𝑇 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡 𝛿(𝑏𝑡+1 = 𝑝 𝑅, 𝑇 𝜏:𝑡+1 • Reward function on hyper-states: 𝑅+ 𝑠𝑡 + , 𝑎𝑡, 𝑠𝑡+1 + = 𝑅+ 𝑠𝑡, 𝑏𝑡, 𝑎𝑡, 𝑠𝑡+1, 𝑏𝑡+1 = 𝔼𝑏𝑡+1 𝑅 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1 • Horizon: 𝐻+ = 𝑁 × 𝐻 • Objective of agent in the BAMDP: 𝒥+ 𝜋 = 𝔼𝑏0,𝑇0 +,𝑇+,𝜋 Σ𝑡=0 𝐻+−1 𝛾𝑡 𝑅+ 𝑟𝑡+1 𝑠𝑡 + , 𝑎𝑡, 𝑠𝑡+1 + – It is maximized by Bayes-optimal policy, which automatically trades off exploration and exploitation Backgrounds
  • 8. 8  Overview of the variBAD  The authors proposed variBAD to resolve some challenges of problem formulated as BAMDP, as follows • 1) Agent typically does not know the parameterization of the true reward and/or transition model • 2) The belief update (computing the posterior 𝑝 𝑅, 𝑇 𝜏:𝑡 is often intractable) • 3) Even with the correct posterior, planning in belief space is typically intractable  1) They simultaneously meta-learned the reward and transition function with posterior over task embedding • 𝑇′ = 𝑝𝜃 𝑇 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡; 𝑚 , 𝑅′ = 𝑝𝜃 𝑅 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚 , 𝑚~𝑞𝜙 𝑚 𝜏:𝑡 𝑖𝑠 𝑎 𝑡𝑒𝑠𝑘 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒, 𝑞𝜙 𝑖𝑠 𝑡ℎ𝑒 𝑒𝑛𝑐𝑜𝑑𝑒𝑟  2) The posterior over task embedding, 𝑞𝜙 𝑚 𝜏:𝑡 is trained when the decoder 𝑝(𝜏:𝑡|𝑚) is updated through training objectives • Lower bound of model objective: ELBOt 𝜙, 𝜃 = 𝔼𝜌 𝐸𝑞𝜙 𝑚|𝜏:𝑡 log 𝑝𝜃 𝜏:𝐻 𝑚 + 𝐾𝐿(𝑞𝜙 𝑚 𝜏:𝑡 || 𝑝𝜃(𝑚)) • Policy gradient objective: 𝔼𝑝 𝑀 𝒥(𝜓, 𝜙)  3) Since their model is learned end-to-end with the inference framework, no planning is necessary at test time variBAD (I)
  • 9. 9  Bayes-adaptive deep RL via meta-learning  The reward and transition functions that are unique to each MDP are learned with task description variable 𝑚𝑖 given MDP 𝑀𝑖~𝑝(𝑀) • 𝑅𝑖 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1 ≈ 𝑅 𝑟𝑡+1 𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚𝑖 • 𝑇𝑖 𝑠𝑡+1 𝑠𝑡, 𝑎𝑡 ≈ 𝑇(𝑠𝑡+1|𝑠𝑡, 𝑎𝑡; 𝑚𝑖)  Since the agent do not have access to the true task description or ID, they infer 𝑚𝑖 given the agent’s experience up to time step 𝑡 collected in 𝑀𝑖 variBAD (II) Proposed variBAD architecture • They infer the posterior distribution 𝑝 𝑚𝑖 𝜏:𝑡 𝑖 over 𝑚𝑖 given 𝜏:𝑡 𝑖 = (𝑠0, 𝑎0, 𝑟1, 𝑠1, 𝑎1, 𝑟2, … , 𝑠𝑡−1, 𝑎𝑡−1, 𝑟𝑡, 𝑠𝑡) – (The subscript 𝑖 is dropped for ease of notation)  Given the reformulation, it is now sufficient to reason about the embedding 𝑚, instead of the transition and reward dynamics • The authors suggested that this is significantly useful, where the reward and transition function can consist of millions of params, but the embedding 𝑚 can be small
  • 10. 10  Model learning objective  They learned a model of the environment 𝑝𝜃, so that even if they don’t know the model, they could allow fast inference at runtime at using follow objective • 𝔼𝜌 𝑀,𝜏:𝐻+ log 𝑝𝜃 𝜏:𝐻+ 𝑎:𝐻+−1  Instead of optimizing above objective, which is intractable, they optimized a tractable lower bound • 𝔼𝜌 𝑀,𝜏:𝐻+ log 𝑝𝜃 𝜏:𝐻 ) = 𝔼𝜌 log 𝔼𝑞𝜙 𝑚|𝜏:𝑡 𝑝𝜃 𝜏:𝐻,𝑚 𝑞𝜙 𝑚 𝜏:𝑡 ≥ 𝔼𝜌,𝑞𝜙 𝑚|𝜏:𝑡 log 𝑝𝜃 𝜏:𝐻,𝑚 𝑞𝜙 𝑚 𝜏:𝑡 = 𝔼𝜌 𝔼𝑞𝜙 𝑚|𝜏:𝑡 log 𝑝𝜃 𝜏:𝐻 𝑚 + 𝐾𝐿(𝑞𝜙 𝑚 𝜏:𝑡 || 𝑝𝜃(𝑚)) = 𝐸𝐿𝐵𝑂𝑡  Overall objective  The encoder 𝑞𝜙 𝑚 𝜏:𝑡 , parameterized by 𝜙  An approximate transition function 𝑇′ = 𝑝𝜃 𝑇 (𝑠𝑡+1|𝑠𝑡, 𝑎𝑡; 𝑚) and an approximate reward function 𝑅′ = 𝑝𝜃 𝑅 (𝑟𝑡+1|𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1; 𝑚) which are jointly parameterized by 𝜃  A policy 𝜋𝜓 𝑎𝑡|𝑠𝑡, 𝑞𝜙(𝑚|𝜏:𝑡) parameterized by 𝜓 and dependent on 𝜙 • ℒ 𝜙, 𝜃, 𝜓 = 𝔼𝑝 𝑀 𝒥 𝜓, 𝜙 + 𝜆Σ𝑡=0 𝐻+ 𝐸𝐿𝐵𝑂𝑡 𝜙, 𝜃 variBAD (III)
  • 12. 12  Environment  Didactic grid world • The goal is unobserved by the agent, inducing task uncertainty and necessitating exploration. • The goal can be anywhere except around the starting cell at the bottom left.  MDP parameters • Action: up, right, down, left, stay • Reward: -0.1 on non-goal cells, +1 on the goal cell • Horizon: 15 Experiment environment  MuJoCo for meta-learning • Well-known environment for continuous control • Ant-Dir: move forward and backward (2 tasks) • Half-Cheetah-Dir: move forward and backward (2 tasks) • Half-Cheetah-Vel: Achieve a target velocity running forward (100 train tasks, 30 test tasks) • Walker: Agent initialized with some system dynamics params randomized and moved forward (40 train tasks, 10 test tasks) Different exploration strategies on didactic grid world Continuous control tasks: the ant, half cheetah, and walker robots
  • 13. 13  Comparison with the previous method on the didactic grid world  Experiments show the superior performance of variBAD compared to previous methods in a didactic grid- world environment with few episodes • variBAD • RL2  Behavior of variBAD closely matched that of the Bayes-optimal policy • The variBAD could predict no reward for cells it has visited and explores the remaining cells until it finds the goal Experiment results (I) • Optimal • Posterior Sampling Results for the grid-world (a) Average return (b) running curves per frame (20 seeds) Behavior of variBAD in the grid-world • Bayes-Optimal (𝒂) (𝒃)
  • 14. 14  Comparison with the previous method on MuJoCo  Experiments show the superior performance of variBAD compared to previous methods in the MuJoCo environment(Ant-Dir, Cheetah-Dir, Cheetah-Vel, Walker) with few episodes • variBAD • PEARL  For all MuJoCo environments, they trained variBAD with a reward decoder only (even for Walker, where the dynamics change, we found that this has superior performance) Experiment results (II) Average test performance for the first 5 rollouts of MuJoCo (5 seeds) Learning curves for the MuJoCo at the (a) first rollout (b) N-th rollout • Oracle(PPO) • E-MAML • RL2 • ProMP (𝒃) (𝒂)

Editor's Notes

  1. 이제 본격적인 제가 선정한 논문의 알고리즘에 대해서 발표를 시작하겠습니다.