SlideShare a Scribd company logo
RL Upside-Down:
Training Agents using Upside-Down RL
LEE, DOHYEON
leadh991114@gmail.com
4/16/23 딥논읽 세미나 - 강화학습
NeurIPS 2019 Workshop
Contents
1. Introduction
2. Methods
3. Experiments
4. Conclusion
4/16/23 딥논읽 세미나 - 강화학습 1
| Some Doubts about RL
4/16/23 딥논읽 세미나 - 강화학습 2
INDEX
Introduction
Methods
Performance
Conclusion
| “Deep RL Doesn’t Work Yet”
1. Sample Inefficiency
4/16/23 딥논읽 세미나 - 강화학습 3
INDEX
Introduction
Methods
Performance
Conclusion
Atari Games
- 100% Rainbow when
about 18 millions of frames;
- ~ 83 hours of play experience
2. Nice Alternatives
4/16/23 딥논읽 세미나 - 강화학습 4
INDEX
Introduction
Methods
Performance
Conclusion
Optimal Control Theory
- LQR, QP, Convex Optimization
- Model Predictive Control(MPC)
3. Hard Reward Design
4/16/23 딥논읽 세미나 - 강화학습 5
INDEX
Introduction
Methods
Performance
Conclusion
The Alignment Problem
- Universe(OpenAI)
- reward function must capture
”exactly” what you want
4. Local Optima
4/16/23 딥논읽 세미나 - 강화학습 6
INDEX
Introduction
Methods
Performance
Conclusion
Exploration vs Exploitation
- HalfCheetah(UC Bekeley, BAIR)
- The dilemma is too hard to solve
5. Generalization Issue
4/16/23 딥논읽 세미나 - 강화학습 7
INDEX
Introduction
Methods
Performance
Conclusion
Overfitting
- Laser Tag for Multi-agent
- Lanctot et al, NeurIPS 2017
6. Stability & Reproducibility Problem
4/16/23 딥논읽 세미나 - 강화학습 8
INDEX
Introduction
Methods
Performance
Conclusion
HalfCheetah
- Houthooft et al, NIPS 2016
75%
25%
| Why don’t we leverage the advantages of SL?
4/16/23 딥논읽 세미나 - 강화학습 9
INDEX
Introduction
Methods
Performance
Conclusion
| Advantages of SL algorithms
1. Simplicity
2. Robustness
3. Scalability
| “In general, there is no way to do this”
1. SL:
- Function: Search
- Assumption: I.I.D./Stationary Condition
- Feedback from Env: Error Signal
2. RL:
- Function: Search & Long-Term Memory
- Assumption: Non-I.I.D./Non-Stationary Condition
- Feedback from Env: Evaluation Signal
| A Trick to convert RL to SL!
4/16/23 딥논읽 세미나 - 강화학습 10
INDEX
Introduction
Methods
Performance
Conclusion
| MDP → Supervised Learning; Classification Problem!
Goal: To maximize returns in expectation
→ To learn to follow commands such as;
- “achieve total reward R in next T time steps”
- “reach state S in fewer than T time steps”.
QnA
4/16/23 딥논읽 세미나 - 강화학습 11
INDEX
Introduction
Methods
Performance
Conclusion
1. Idea
4/16/23 딥논읽 세미나 - 강화학습 12
INDEX
Introduction
Methods
Performance
Conclusion
How? S →A →R → S →A →R → … into S → R’ → A → S → R’ → A → …
RL!
1. Idea
4/16/23 딥논읽 세미나 - 강화학습 13
INDEX
Introduction
Methods
Performance
Conclusion
Intuitively, it answers the question:
| “if an agent is in a given state and desires a given return over a given
| horizon, which action should it take next based on past experience?”
2. Behavior Function
4/16/23 딥논읽 세미나 - 강화학습 14
INDEX
Introduction
Methods
Performance
Conclusion
Notation
- 𝑑!
: desired return
- 𝑑" : desired horizon
- 𝑆 : random variable for environment’s state
- 𝐴 : random variable for the agent’s next action
- 𝑅#! : random variable for the return obtained by the agent during the next 𝑑"
time steps.
- 𝒯 : set of trajectories
- 𝐵$ : policy-based behavior function 𝐵!(𝑎, 𝑠, 𝑑", 𝑑#)
- 𝐵𝓣 : trajectory-based behavior function using an unknown policy is available,
where where 𝑁#
$
(𝑠, 𝑑", 𝑑#) is the number of trajectory segments in 𝒯 that
start in state 𝑠, have length 𝑑# and total reward 𝑑"
2. Behavior Function
4/16/23 딥논읽 세미나 - 강화학습 15
INDEX
Introduction
Methods
Performance
Conclusion
| “Using a loss function 𝑳, it can be estimated by solving the following SL problem”
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 16
INDEX
Introduction
Methods
Performance
Conclusion
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 17
INDEX
Introduction
Methods
Performance
Conclusion
| ⅂ꓤ does not explicitly maximize returns, but…
Learning can be biased towards higher returns
by selecting the trajectories on which the behavior function is trained!
| To Do So,
Use a replay buffer with the best 𝑍 trajectories seen so far,
where 𝑍 is a fixed hyperparameter.
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 18
INDEX
Introduction
Methods
Performance
Conclusion
| At any time 𝑡 during an episode,
the current behavior function 𝐵 produces a distribution
over actions 𝑃 𝑎& 𝑠&, 𝑐& = 𝐵(𝑠&, 𝑐&; 𝜃)
| Given an initial command 𝑐' for a new episode,
a new trajectory is generated using Algorithm 2 by sampling actions according to B and
updating the current command using the obtained rewards and time left at each time step
until the episode terminates.
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 19
INDEX
Introduction
Methods
Performance
Conclusion
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 20
INDEX
Introduction
Methods
Performance
Conclusion
| After each training phase the agent can be given new commands,
potentially achieving higher returns due to additional knowledge gained by further training.
| To profit from such exploration through generalization,
a set of new initial commands c0 to be used in Algorithm 2 is generated.
1. A number of episodes with the highest returns are selected from the replay buffer.
This number is a hyperparameter and remains fixed during training.
2. The exploratory desired horizon 𝑑!
"
is set to the mean of the lengths of the selected episodes.
3. The exploratory desired returns 𝑑!
#
are sampled from the uniform distribution 𝒰[𝑀, 𝑀 + 𝑆]
where 𝑀 is the mean and 𝑆 is the standard deviation of the selected episodic returns.
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 21
INDEX
Introduction
Methods
Performance
Conclusion
3. Algorithm
4/16/23 딥논읽 세미나 - 강화학습 22
INDEX
Introduction
Methods
Performance
Conclusion
Algorithm 2 is also used to evaluate the agent at any time using evaluation
commands derived from the most recent exploratory commands.
The initial desired return 𝑑'
!
is set to 𝑀 , the lower bound of the desired
returns from the most recent exploratory command, and the initial desired
horizon 𝑑'
"
is reused.
QnA
4/16/23 딥논읽 세미나 - 강화학습 23
INDEX
Introduction
Methods
Performance
Conclusion
1. Tasks
4/16/23 딥논읽 세미나 - 강화학습 24
INDEX
Introduction
Methods
Performance
Conclusion
• Fully-connected feed-forward neural networks, except for
TakeCover-v0 where we used convolutional networks
• Use environments with both low and high-dimensional (visual)
observations, and both discrete and continuous-valued actions:
• LunarLander-v2 based on Box2D
• TakeCover-v0 based on VizDoom
• Swimmer-v2 & InvertedDoublePendulum-v2 based on the MuJoCo
2. Results
4/16/23 딥논읽 세미나 - 강화학습 25
INDEX
Introduction
Methods
Performance
Conclusion
2. ver.Sparse
4/16/23 딥논읽 세미나 - 강화학습 26
INDEX
Introduction
Methods
Performance
Conclusion
| Since ⅂ꓤ does not use temporal differences for learning,
it is reasonable to hypothesize that its behavior may change differently from other
algorithms that do. To test this, we converted environments to their sparse, delayed reward
(partially observable) versions by delaying all rewards until the last step of each episode.
QnA
4/16/23 딥논읽 세미나 - 강화학습 27
INDEX
Introduction
Methods
Performance
Conclusion
Conclusion
4/16/23 딥논읽 세미나 - 강화학습 28
INDEX
Introduction
Methods
Performance
Conclusion
?
4/16/23 딥논읽 세미나 - 강화학습 29
INDEX
Introduction
Methods
Performance
Conclusion
Thank You For Your Listening!
4/16/23 딥논읽 세미나 - 강화학습 30
INDEX
Introduction
Methods
Performance
Conclusion
RL
SL
UL

More Related Content

Similar to RL_UpsideDown

Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
VARUN KUMAR
 
Learning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robotsLearning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robots
홍배 김
 
euclides-c mthesis
euclides-c mthesiseuclides-c mthesis
euclides-c mthesis
inet-lab
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
Karthik Vivek
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
Karthik Vivek
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
Karthik Vivek
 
Measuringperformance 090527015748-phpapp01
Measuringperformance 090527015748-phpapp01Measuringperformance 090527015748-phpapp01
Measuringperformance 090527015748-phpapp01
manishajadhav13j
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.
Wuhyun Rico Shin
 
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHMJOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHMmailjkb
 
EE312_ Control System Engineering_Moodle_Page
EE312_ Control System Engineering_Moodle_PageEE312_ Control System Engineering_Moodle_Page
EE312_ Control System Engineering_Moodle_PagePraneel Chand
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
Dongmin Lee
 
Agile project management
Agile project management Agile project management
Agile project management
Bimba Pawar
 
Mini Project- Stepper Motor Control
Mini Project- Stepper Motor ControlMini Project- Stepper Motor Control
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
Ganesan Narayanasamy
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
AminaRepo
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
SigOpt
 
181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back
SeungHyeok Baek
 
ManSciProjMan-CPM.pptx
ManSciProjMan-CPM.pptxManSciProjMan-CPM.pptx
ManSciProjMan-CPM.pptx
ssuser85ddaa
 

Similar to RL_UpsideDown (20)

Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Learning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robotsLearning agile and dynamic motor skills for legged robots
Learning agile and dynamic motor skills for legged robots
 
euclides-c mthesis
euclides-c mthesiseuclides-c mthesis
euclides-c mthesis
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
 
master-thesis
master-thesismaster-thesis
master-thesis
 
Measuringperformance 090527015748-phpapp01
Measuringperformance 090527015748-phpapp01Measuringperformance 090527015748-phpapp01
Measuringperformance 090527015748-phpapp01
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.
 
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHMJOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
JOB SCHEDULING USING ANT COLONY OPTIMIZATION ALGORITHM
 
EE312_ Control System Engineering_Moodle_Page
EE312_ Control System Engineering_Moodle_PageEE312_ Control System Engineering_Moodle_Page
EE312_ Control System Engineering_Moodle_Page
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Agile project management
Agile project management Agile project management
Agile project management
 
Mini Project- Stepper Motor Control
Mini Project- Stepper Motor ControlMini Project- Stepper Motor Control
Mini Project- Stepper Motor Control
 
OpenPOWER Application Optimization
OpenPOWER Application Optimization OpenPOWER Application Optimization
OpenPOWER Application Optimization
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana CloudUsing SigOpt to Tune Deep Learning Models with Nervana Cloud
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
 
181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back181123 asynchronous method for deep reinforcement learning seunghyeok back
181123 asynchronous method for deep reinforcement learning seunghyeok back
 
ManSciProjMan-CPM.pptx
ManSciProjMan-CPM.pptxManSciProjMan-CPM.pptx
ManSciProjMan-CPM.pptx
 

More from taeseon ryu

VoxelNet
VoxelNetVoxelNet
VoxelNet
taeseon ryu
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
taeseon ryu
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
taeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
taeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
taeseon ryu
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
taeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
taeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
taeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
taeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
taeseon ryu
 
mPLUG
mPLUGmPLUG
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
taeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
taeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
taeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
taeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
taeseon ryu
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimization
taeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimization
 

Recently uploaded

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 

Recently uploaded (20)

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 

RL_UpsideDown

  • 1. RL Upside-Down: Training Agents using Upside-Down RL LEE, DOHYEON leadh991114@gmail.com 4/16/23 딥논읽 세미나 - 강화학습 NeurIPS 2019 Workshop
  • 2. Contents 1. Introduction 2. Methods 3. Experiments 4. Conclusion 4/16/23 딥논읽 세미나 - 강화학습 1
  • 3. | Some Doubts about RL 4/16/23 딥논읽 세미나 - 강화학습 2 INDEX Introduction Methods Performance Conclusion | “Deep RL Doesn’t Work Yet”
  • 4. 1. Sample Inefficiency 4/16/23 딥논읽 세미나 - 강화학습 3 INDEX Introduction Methods Performance Conclusion Atari Games - 100% Rainbow when about 18 millions of frames; - ~ 83 hours of play experience
  • 5. 2. Nice Alternatives 4/16/23 딥논읽 세미나 - 강화학습 4 INDEX Introduction Methods Performance Conclusion Optimal Control Theory - LQR, QP, Convex Optimization - Model Predictive Control(MPC)
  • 6. 3. Hard Reward Design 4/16/23 딥논읽 세미나 - 강화학습 5 INDEX Introduction Methods Performance Conclusion The Alignment Problem - Universe(OpenAI) - reward function must capture ”exactly” what you want
  • 7. 4. Local Optima 4/16/23 딥논읽 세미나 - 강화학습 6 INDEX Introduction Methods Performance Conclusion Exploration vs Exploitation - HalfCheetah(UC Bekeley, BAIR) - The dilemma is too hard to solve
  • 8. 5. Generalization Issue 4/16/23 딥논읽 세미나 - 강화학습 7 INDEX Introduction Methods Performance Conclusion Overfitting - Laser Tag for Multi-agent - Lanctot et al, NeurIPS 2017
  • 9. 6. Stability & Reproducibility Problem 4/16/23 딥논읽 세미나 - 강화학습 8 INDEX Introduction Methods Performance Conclusion HalfCheetah - Houthooft et al, NIPS 2016 75% 25%
  • 10. | Why don’t we leverage the advantages of SL? 4/16/23 딥논읽 세미나 - 강화학습 9 INDEX Introduction Methods Performance Conclusion | Advantages of SL algorithms 1. Simplicity 2. Robustness 3. Scalability | “In general, there is no way to do this” 1. SL: - Function: Search - Assumption: I.I.D./Stationary Condition - Feedback from Env: Error Signal 2. RL: - Function: Search & Long-Term Memory - Assumption: Non-I.I.D./Non-Stationary Condition - Feedback from Env: Evaluation Signal
  • 11. | A Trick to convert RL to SL! 4/16/23 딥논읽 세미나 - 강화학습 10 INDEX Introduction Methods Performance Conclusion | MDP → Supervised Learning; Classification Problem! Goal: To maximize returns in expectation → To learn to follow commands such as; - “achieve total reward R in next T time steps” - “reach state S in fewer than T time steps”.
  • 12. QnA 4/16/23 딥논읽 세미나 - 강화학습 11 INDEX Introduction Methods Performance Conclusion
  • 13. 1. Idea 4/16/23 딥논읽 세미나 - 강화학습 12 INDEX Introduction Methods Performance Conclusion How? S →A →R → S →A →R → … into S → R’ → A → S → R’ → A → … RL!
  • 14. 1. Idea 4/16/23 딥논읽 세미나 - 강화학습 13 INDEX Introduction Methods Performance Conclusion Intuitively, it answers the question: | “if an agent is in a given state and desires a given return over a given | horizon, which action should it take next based on past experience?”
  • 15. 2. Behavior Function 4/16/23 딥논읽 세미나 - 강화학습 14 INDEX Introduction Methods Performance Conclusion Notation - 𝑑! : desired return - 𝑑" : desired horizon - 𝑆 : random variable for environment’s state - 𝐴 : random variable for the agent’s next action - 𝑅#! : random variable for the return obtained by the agent during the next 𝑑" time steps. - 𝒯 : set of trajectories - 𝐵$ : policy-based behavior function 𝐵!(𝑎, 𝑠, 𝑑", 𝑑#) - 𝐵𝓣 : trajectory-based behavior function using an unknown policy is available, where where 𝑁# $ (𝑠, 𝑑", 𝑑#) is the number of trajectory segments in 𝒯 that start in state 𝑠, have length 𝑑# and total reward 𝑑"
  • 16. 2. Behavior Function 4/16/23 딥논읽 세미나 - 강화학습 15 INDEX Introduction Methods Performance Conclusion | “Using a loss function 𝑳, it can be estimated by solving the following SL problem”
  • 17. 3. Algorithm 4/16/23 딥논읽 세미나 - 강화학습 16 INDEX Introduction Methods Performance Conclusion
  • 18. 3. Algorithm 4/16/23 딥논읽 세미나 - 강화학습 17 INDEX Introduction Methods Performance Conclusion | ⅂ꓤ does not explicitly maximize returns, but… Learning can be biased towards higher returns by selecting the trajectories on which the behavior function is trained! | To Do So, Use a replay buffer with the best 𝑍 trajectories seen so far, where 𝑍 is a fixed hyperparameter.
  • 19. 3. Algorithm 4/16/23 딥논읽 세미나 - 강화학습 18 INDEX Introduction Methods Performance Conclusion | At any time 𝑡 during an episode, the current behavior function 𝐵 produces a distribution over actions 𝑃 𝑎& 𝑠&, 𝑐& = 𝐵(𝑠&, 𝑐&; 𝜃) | Given an initial command 𝑐' for a new episode, a new trajectory is generated using Algorithm 2 by sampling actions according to B and updating the current command using the obtained rewards and time left at each time step until the episode terminates.
  • 20. 3. Algorithm 4/16/23 딥논읽 세미나 - 강화학습 19 INDEX Introduction Methods Performance Conclusion
  • 21. 3. Algorithm 4/16/23 딥논읽 세미나 - 강화학습 20 INDEX Introduction Methods Performance Conclusion | After each training phase the agent can be given new commands, potentially achieving higher returns due to additional knowledge gained by further training. | To profit from such exploration through generalization, a set of new initial commands c0 to be used in Algorithm 2 is generated. 1. A number of episodes with the highest returns are selected from the replay buffer. This number is a hyperparameter and remains fixed during training. 2. The exploratory desired horizon 𝑑! " is set to the mean of the lengths of the selected episodes. 3. The exploratory desired returns 𝑑! # are sampled from the uniform distribution 𝒰[𝑀, 𝑀 + 𝑆] where 𝑀 is the mean and 𝑆 is the standard deviation of the selected episodic returns.
  • 22. 3. Algorithm 4/16/23 딥논읽 세미나 - 강화학습 21 INDEX Introduction Methods Performance Conclusion
  • 23. 3. Algorithm 4/16/23 딥논읽 세미나 - 강화학습 22 INDEX Introduction Methods Performance Conclusion Algorithm 2 is also used to evaluate the agent at any time using evaluation commands derived from the most recent exploratory commands. The initial desired return 𝑑' ! is set to 𝑀 , the lower bound of the desired returns from the most recent exploratory command, and the initial desired horizon 𝑑' " is reused.
  • 24. QnA 4/16/23 딥논읽 세미나 - 강화학습 23 INDEX Introduction Methods Performance Conclusion
  • 25. 1. Tasks 4/16/23 딥논읽 세미나 - 강화학습 24 INDEX Introduction Methods Performance Conclusion • Fully-connected feed-forward neural networks, except for TakeCover-v0 where we used convolutional networks • Use environments with both low and high-dimensional (visual) observations, and both discrete and continuous-valued actions: • LunarLander-v2 based on Box2D • TakeCover-v0 based on VizDoom • Swimmer-v2 & InvertedDoublePendulum-v2 based on the MuJoCo
  • 26. 2. Results 4/16/23 딥논읽 세미나 - 강화학습 25 INDEX Introduction Methods Performance Conclusion
  • 27. 2. ver.Sparse 4/16/23 딥논읽 세미나 - 강화학습 26 INDEX Introduction Methods Performance Conclusion | Since ⅂ꓤ does not use temporal differences for learning, it is reasonable to hypothesize that its behavior may change differently from other algorithms that do. To test this, we converted environments to their sparse, delayed reward (partially observable) versions by delaying all rewards until the last step of each episode.
  • 28. QnA 4/16/23 딥논읽 세미나 - 강화학습 27 INDEX Introduction Methods Performance Conclusion
  • 29. Conclusion 4/16/23 딥논읽 세미나 - 강화학습 28 INDEX Introduction Methods Performance Conclusion
  • 30. ? 4/16/23 딥논읽 세미나 - 강화학습 29 INDEX Introduction Methods Performance Conclusion
  • 31. Thank You For Your Listening! 4/16/23 딥논읽 세미나 - 강화학습 30 INDEX Introduction Methods Performance Conclusion RL SL UL