SlideShare a Scribd company logo
1 of 32
Download to read offline
[Season2] Reinforcement Learning Paper Review
Deep Recurrent Q-Learning
for Partially Observable MDPs
M. Hausknecht and P. Stone, AAAI, 2015
Department of Artificial Intelligence,
Korea University, Seoul
Minsuk Sung
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Table of Contents
• Introduction
• Deep Q-Learning
• Partial Observability
• DRQN Architecture
• Atari Games: MDP or POMDP?
• Flickering Pong POMDP
• Generalization Performance
• Can DRQN Mimic an Unobscured Pong Player?
• Evaluation on Standard Atari Games
• MDP to POMDP Generalization
• Discussion and Conclusion
• Implementation
2
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
INTRODUCTION
3
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Deep Q-Network
• Q-Learning
• Estimating the state-action values (Q-values) of executing an action
from a given state
• Q-values function
• 𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼(𝑟 + 𝛾 max
𝑎′
𝑄(𝑠′
, 𝑎′
) − 𝑄(𝑠, 𝑎))
• Limitations
• Hard to estimate Q-value when too many unique states exist
• Deep Q-Network (DQN) [Mnih et al., 2015]
• Approximating the Q-values using neural network
parameterized by weights and biases
collectively denoted as 𝜃
• Loss function
• 𝐿 𝑠, 𝑎 𝜃𝑖 = 𝑟 + 𝛾 max
𝑎′
𝑄 𝑠′
, 𝑎′
𝜃𝑖 − 𝑄 𝑠, 𝑎 𝜃𝑖
2
• Weight parameter update
• 𝜃𝑖+1 = 𝜃𝑖 + 𝛼∇ 𝜃 𝐿 𝜃𝑖
• Limitations
• Hard to approximate Q-value on partially-observable system
4
Q-Learning vs DQN
DQN Architecture
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Full Observability
• Markov Decision Process (MDP)
• Described by a 4-tuple 𝒮, 𝒜, 𝒫, ℛ
• At each timestep 𝑡, an agent interacting with the MDP observes a state 𝑠𝑡 ∈ 𝒮 which determines the reward
𝑟𝑡 ~ ℛ(𝑠𝑡, 𝑎 𝑡) and next state 𝑠𝑡+1 ~ 𝒫(𝑠𝑡, 𝑎 𝑡)
• Limitations of MDP
• Rare that the full state of the system can be provided to the agent or even determined
5
!
Agents Agents
?
?
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Partial Observability
• Partially-Observable Markov Decision Process (POMDP)
• Described by a 6-tuple 𝒮, 𝒜, 𝒫, ℛ, Ω, Ο
• 𝒮, 𝒜, 𝒫, ℛ are same with MDP
• Agent receives an observation 𝑜 ∈ Ω and 𝑜 generated from the underlying system state according to the
probability distribution 𝑜 ~ Ο
• Better captures the dynamics of many real-world environments
• In the general case, estimating a Q-value from an observation can be different since 𝑄 𝑜, 𝑎 𝜃 ≠ 𝑄 𝑠, 𝑎 𝜃
6
Examples of frozen lake on MDP (Left) and POMDP (Right)
?
?
Agents
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Goal
To narrow the gap between 𝑄 𝑜, 𝑎 𝜃 and 𝑄 𝑠, 𝑎 𝜃
with adding recurrency to Deep Q-Network on POMDP
7
?
?
Agents
?
OK
Agents
Already
Known
Information
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
DRQN
ARCHITECTURE
8
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Architecture of DRQN
• DRQN = DQN + LSTM
• DRQN replaces DQN’s first fully connected layer with a
Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber 1997]
• Overall Architecture
• Input Layer
• Taking a single 84 × 84 preprocessed image,
instead of the last four images required by DQN
• Conv1 ~ Conv3 Layer
• Detecting paddle, ball, and interaction between
each objects
• LSTM Layer
• Detecting high-level features
• The agent missing the ball
• Ball reflections off of paddles
• Ball reflections off the walls
• Fully Connected Layer
9
Paddle
Ball
movement
Ball and
paddle
interaction
Deflections, ball velocity,
and direction of travel
High-level
features
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Q&A
( 1 / 3 )
10
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
STABLE
RECURRENT UPDATES
11
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Two types of Updates
• Bootstrapped Sequential Updates
• Episodes are selected randomly from replay memory
• Updates begin at the beginning of the episode and proceed forward through time to conclusion of
the episode
• The targets at each timestep are generated from the target Q-network
• RNN’s hidden state is carried forward throughout the episode
• Bootstrapped Random Updates
• Episodes are selected randomly from replay memory
• Updates begin at the random points in the episode and proceed for only unroll iterations timesteps
• The targets at each timestep are generated from the target Q-network
• RNN’s initial state is zeroed at the start of the update
12
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Sequential Updates vs Random Updates
• Sequential Updates
• Having advantage of carrying of carrying the LSTM’s hidden state forward from the beginning of the
episode
• Violating DQN’s random sampling policy by sampling experiences sequentially for a full episode
• Random Updates
• Better adjusting to the policy of randomly sampling experience
• LSTM hidden state must be zeroed at the start of each update
• Making it harder for the LSTM to learn functions that span longer time scales
• Experiments show both types of updates yield convergent policies with similar performance
• All results in these paper use the randomized update strategy to limit complexity
13
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
ATARI GAMES:
MDP or POMDP?
14
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Atari Games: MDP or POMDP?
• Atari 2600 Games
• Fully described by 128 bytes of console RAM
• Observed only the console-generated game screens
• POMDP → MDP
• A single screen during Atari games is insufficient to determine the state of the system
• DQN infers the full state of an Atari game by expanding the state representation to encompass the last
four game screens
• Modification to the Pong game
• To introduce partial observability to Atari games without reducing the number of input frames
15
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
FLICKERING
PONG POMDP
16
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Flickering Pong POMDP
• Flickering Pong POMDP
• A modification to the classic game of Pong
• At each time step, the screen is either fully reveal
or fully obscured with probability 𝑝 = 0.5
• Obscuring frames induces an incomplete memory of observations
• Three types of networks to play Flickering Pong
• The recurrent 1-frame DRQN
• A standard 4-frame DQN [Mnih et al., 2015]
• An augmented 10-frame DQN
17
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Generalization Performance
• Evaluating the Best Policies for DRQN, 10-frame DQN, and 4-frame DQN
• Trained on Flickering Pong with 𝑝 = 0.5
• Evaluated against different 𝑝 values
• Observation Quality
• DRQN learns a policy which allows performance to scale
as a function of observation quality
• More information, more high score
• Valuable for domains in which the quality of observations
varies through time
18
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Q&A
( 2 / 3 )
19
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
EVALUATION
ON STANDARD ATARI GAMES
20
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Evaluation on Standard Atari Games
• Selected 9 games for Evaluations for DRQN
1. Asteroids
2. Beam Rider → Worse than DQN
3. Bowling
4. Centipede
5. Chopper Command
6. Double Dunk → Better than DQN
7. Frostbite → Better than DQN
8. Ice Hockey
9. Ms. Pacman
21
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Evaluation on Standard Atari Games
• Selected 9 games for Evaluations for DRQN
1. Asteroids
2. Beam Rider → Worse than DQN
3. Bowling
4. Centipede
5. Chopper Command
6. Double Dunk → Better than DQN
7. Frostbite → Better than DQN
8. Ice Hockey
9. Ms. Pacman
22
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
MDP to POMDP
GENERALIZATION
23
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
MDP to POMDP Generalization
• Comparison Performance between DQN and DRQN
• Train: POMDP / Test: MDP
• Train: MDP / Test: POMDP
• Recurrent Controller
• Robustness against missing information, even trained with full state information
24
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
DISCUSSION
AND CONCLUSION
25
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Discussion and Conclusion
• Better Performance than DQN on POMDP
• DRQN handling the noisy and incomplete characteristic of POMDPs by combining a LSTM with DQN
• Only a single frame at each step, DRQN still integrating information across frames to detect relevant
information
• Generalization Performance
• Trained with partial observations
• DRQN learns policies that are both robust enough to handle to missing game screens, and scalable enough to
improve performance as more data becomes available
• Trained with fully observations
• DRQN performs better than DQN’s at all levels of partial information
• Adding Recurrency
• Experiments show that LSTM is viable method for handling multiple state observations
26
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
IMPLEMENTATION
27
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
DQRN
• GitHub
• Caffe : https://github.com/mhauskn/dqn/tree/recurrent
• PyTorch : https://github.com/mynkpl1998/Recurrent-Deep-Q-Learning
28
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
DQRN (PyTorch ver)
29
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
Q&A
( 3 / 3 )
30
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
THANK YOU
FOR LISTENING
31
[RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”,
Association for the Advancement of Artificial Intelligence (AAAI), 2015.
References
• Paper
• https://www.aaai.org/ocs/index.php/FSS/FSS15/paper/viewPaper/11673
• Blog
• https://sumniya.tistory.com/22
• https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-6-partial-o
bservability-and-deep-recurrent-q-68463e9aeefc
• https://jay.tech.blog/2017/02/06/deep-recurrent-q-learning/
• https://whereisend.tistory.com/108
• https://ddanggle.github.io/demystifyingDL
• Code
• https://github.com/mhauskn/dqn/tree/recurrent
• https://github.com/mynkpl1998/Recurrent-Deep-Q-Learning
• https://github.com/qfettes/DeepRL-Tutorials/blob/master/11.DRQN.ipynb
• https://github.com/awjuliani/DeepRL-Agents/blob/master/Deep-Recurrent-Q-Network.ipynb
32

More Related Content

What's hot

A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review민재 정
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Alexandros Karatzoglou
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Talk@rmit 09112017
Talk@rmit 09112017Talk@rmit 09112017
Talk@rmit 09112017Shuai Zhang
 
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017MLconf
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processingNAVER Engineering
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Neelabha Pant
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionClaudio Greco
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?NAVER Engineering
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan BegumNurjahan Begum
 
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendationsBalázs Hidasi
 
Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network  Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network Tomoki Hayashi
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Deep Learning in Robotics
Deep Learning in RoboticsDeep Learning in Robotics
Deep Learning in RoboticsSungjoon Choi
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 

What's hot (20)

A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Talk@rmit 09112017
Talk@rmit 09112017Talk@rmit 09112017
Talk@rmit 09112017
 
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
Sara Hooker & Sean McPherson, Delta Analytics, at MLconf Seattle 2017
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
[GAN by Hung-yi Lee]Part 2: The application of GAN to speech and text processing
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?Deep Learning, Where Are You Going?
Deep Learning, Where Are You Going?
 
TADPole_Nurjahan Begum
TADPole_Nurjahan BegumTADPole_Nurjahan Begum
TADPole_Nurjahan Begum
 
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendations
 
Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network  Prediction of Exchange Rate Using Deep Neural Network
Prediction of Exchange Rate Using Deep Neural Network
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Deep Learning in Robotics
Deep Learning in RoboticsDeep Learning in Robotics
Deep Learning in Robotics
 
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 

Similar to [RLPR S2] Deep Recurrent Q-Learning for Partially Observable MDPs

The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningYoonho Lee
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learningSeungHyeok Baek
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Deep learning on spark
Deep learning on sparkDeep learning on spark
Deep learning on sparkSatyendra Rana
 
Jing Ma - 2017 - Detect Rumors in Microblog Posts Using Propagation Structur...
Jing Ma - 2017 -  Detect Rumors in Microblog Posts Using Propagation Structur...Jing Ma - 2017 -  Detect Rumors in Microblog Posts Using Propagation Structur...
Jing Ma - 2017 - Detect Rumors in Microblog Posts Using Propagation Structur...Association for Computational Linguistics
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...gabrielesisinna
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Claudio Greco
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
 
Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewYONG ZHENG
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for RoboticsIntel Nervana
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceHansol Kang
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningSsu-Rui Lee
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...Rashid Mijumbi
 

Similar to [RLPR S2] Deep Recurrent Q-Learning for Partially Observable MDPs (20)

The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Entity2rec recsys
Entity2rec recsysEntity2rec recsys
Entity2rec recsys
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Deep learning on spark
Deep learning on sparkDeep learning on spark
Deep learning on spark
 
Jing Ma - 2017 - Detect Rumors in Microblog Posts Using Propagation Structur...
Jing Ma - 2017 -  Detect Rumors in Microblog Posts Using Propagation Structur...Jing Ma - 2017 -  Detect Rumors in Microblog Posts Using Propagation Structur...
Jing Ma - 2017 - Detect Rumors in Microblog Posts Using Propagation Structur...
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
1 intro game-theory
1 intro game-theory 1 intro game-theory
1 intro game-theory
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Context-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick ViewContext-aware Recommendation: A Quick View
Context-aware Recommendation: A Quick View
 
Deep Learning for Robotics
Deep Learning for RoboticsDeep Learning for Robotics
Deep Learning for Robotics
 
Deep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent spaceDeep Convolutional GANs - meaning of latent space
Deep Convolutional GANs - meaning of latent space
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
A Connectionist Approach to Dynamic Resource Management for Virtualised Netwo...
 

Recently uploaded

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 

Recently uploaded (20)

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 

[RLPR S2] Deep Recurrent Q-Learning for Partially Observable MDPs

  • 1. [Season2] Reinforcement Learning Paper Review Deep Recurrent Q-Learning for Partially Observable MDPs M. Hausknecht and P. Stone, AAAI, 2015 Department of Artificial Intelligence, Korea University, Seoul Minsuk Sung
  • 2. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Table of Contents • Introduction • Deep Q-Learning • Partial Observability • DRQN Architecture • Atari Games: MDP or POMDP? • Flickering Pong POMDP • Generalization Performance • Can DRQN Mimic an Unobscured Pong Player? • Evaluation on Standard Atari Games • MDP to POMDP Generalization • Discussion and Conclusion • Implementation 2
  • 3. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. INTRODUCTION 3
  • 4. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Deep Q-Network • Q-Learning • Estimating the state-action values (Q-values) of executing an action from a given state • Q-values function • 𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼(𝑟 + 𝛾 max 𝑎′ 𝑄(𝑠′ , 𝑎′ ) − 𝑄(𝑠, 𝑎)) • Limitations • Hard to estimate Q-value when too many unique states exist • Deep Q-Network (DQN) [Mnih et al., 2015] • Approximating the Q-values using neural network parameterized by weights and biases collectively denoted as 𝜃 • Loss function • 𝐿 𝑠, 𝑎 𝜃𝑖 = 𝑟 + 𝛾 max 𝑎′ 𝑄 𝑠′ , 𝑎′ 𝜃𝑖 − 𝑄 𝑠, 𝑎 𝜃𝑖 2 • Weight parameter update • 𝜃𝑖+1 = 𝜃𝑖 + 𝛼∇ 𝜃 𝐿 𝜃𝑖 • Limitations • Hard to approximate Q-value on partially-observable system 4 Q-Learning vs DQN DQN Architecture
  • 5. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Full Observability • Markov Decision Process (MDP) • Described by a 4-tuple 𝒮, 𝒜, 𝒫, ℛ • At each timestep 𝑡, an agent interacting with the MDP observes a state 𝑠𝑡 ∈ 𝒮 which determines the reward 𝑟𝑡 ~ ℛ(𝑠𝑡, 𝑎 𝑡) and next state 𝑠𝑡+1 ~ 𝒫(𝑠𝑡, 𝑎 𝑡) • Limitations of MDP • Rare that the full state of the system can be provided to the agent or even determined 5 ! Agents Agents ? ?
  • 6. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Partial Observability • Partially-Observable Markov Decision Process (POMDP) • Described by a 6-tuple 𝒮, 𝒜, 𝒫, ℛ, Ω, Ο • 𝒮, 𝒜, 𝒫, ℛ are same with MDP • Agent receives an observation 𝑜 ∈ Ω and 𝑜 generated from the underlying system state according to the probability distribution 𝑜 ~ Ο • Better captures the dynamics of many real-world environments • In the general case, estimating a Q-value from an observation can be different since 𝑄 𝑜, 𝑎 𝜃 ≠ 𝑄 𝑠, 𝑎 𝜃 6 Examples of frozen lake on MDP (Left) and POMDP (Right) ? ? Agents
  • 7. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Goal To narrow the gap between 𝑄 𝑜, 𝑎 𝜃 and 𝑄 𝑠, 𝑎 𝜃 with adding recurrency to Deep Q-Network on POMDP 7 ? ? Agents ? OK Agents Already Known Information
  • 8. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. DRQN ARCHITECTURE 8
  • 9. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Architecture of DRQN • DRQN = DQN + LSTM • DRQN replaces DQN’s first fully connected layer with a Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber 1997] • Overall Architecture • Input Layer • Taking a single 84 × 84 preprocessed image, instead of the last four images required by DQN • Conv1 ~ Conv3 Layer • Detecting paddle, ball, and interaction between each objects • LSTM Layer • Detecting high-level features • The agent missing the ball • Ball reflections off of paddles • Ball reflections off the walls • Fully Connected Layer 9 Paddle Ball movement Ball and paddle interaction Deflections, ball velocity, and direction of travel High-level features
  • 10. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Q&A ( 1 / 3 ) 10
  • 11. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. STABLE RECURRENT UPDATES 11
  • 12. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Two types of Updates • Bootstrapped Sequential Updates • Episodes are selected randomly from replay memory • Updates begin at the beginning of the episode and proceed forward through time to conclusion of the episode • The targets at each timestep are generated from the target Q-network • RNN’s hidden state is carried forward throughout the episode • Bootstrapped Random Updates • Episodes are selected randomly from replay memory • Updates begin at the random points in the episode and proceed for only unroll iterations timesteps • The targets at each timestep are generated from the target Q-network • RNN’s initial state is zeroed at the start of the update 12
  • 13. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Sequential Updates vs Random Updates • Sequential Updates • Having advantage of carrying of carrying the LSTM’s hidden state forward from the beginning of the episode • Violating DQN’s random sampling policy by sampling experiences sequentially for a full episode • Random Updates • Better adjusting to the policy of randomly sampling experience • LSTM hidden state must be zeroed at the start of each update • Making it harder for the LSTM to learn functions that span longer time scales • Experiments show both types of updates yield convergent policies with similar performance • All results in these paper use the randomized update strategy to limit complexity 13
  • 14. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. ATARI GAMES: MDP or POMDP? 14
  • 15. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Atari Games: MDP or POMDP? • Atari 2600 Games • Fully described by 128 bytes of console RAM • Observed only the console-generated game screens • POMDP → MDP • A single screen during Atari games is insufficient to determine the state of the system • DQN infers the full state of an Atari game by expanding the state representation to encompass the last four game screens • Modification to the Pong game • To introduce partial observability to Atari games without reducing the number of input frames 15
  • 16. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. FLICKERING PONG POMDP 16
  • 17. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Flickering Pong POMDP • Flickering Pong POMDP • A modification to the classic game of Pong • At each time step, the screen is either fully reveal or fully obscured with probability 𝑝 = 0.5 • Obscuring frames induces an incomplete memory of observations • Three types of networks to play Flickering Pong • The recurrent 1-frame DRQN • A standard 4-frame DQN [Mnih et al., 2015] • An augmented 10-frame DQN 17
  • 18. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Generalization Performance • Evaluating the Best Policies for DRQN, 10-frame DQN, and 4-frame DQN • Trained on Flickering Pong with 𝑝 = 0.5 • Evaluated against different 𝑝 values • Observation Quality • DRQN learns a policy which allows performance to scale as a function of observation quality • More information, more high score • Valuable for domains in which the quality of observations varies through time 18
  • 19. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Q&A ( 2 / 3 ) 19
  • 20. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. EVALUATION ON STANDARD ATARI GAMES 20
  • 21. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Evaluation on Standard Atari Games • Selected 9 games for Evaluations for DRQN 1. Asteroids 2. Beam Rider → Worse than DQN 3. Bowling 4. Centipede 5. Chopper Command 6. Double Dunk → Better than DQN 7. Frostbite → Better than DQN 8. Ice Hockey 9. Ms. Pacman 21
  • 22. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Evaluation on Standard Atari Games • Selected 9 games for Evaluations for DRQN 1. Asteroids 2. Beam Rider → Worse than DQN 3. Bowling 4. Centipede 5. Chopper Command 6. Double Dunk → Better than DQN 7. Frostbite → Better than DQN 8. Ice Hockey 9. Ms. Pacman 22
  • 23. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. MDP to POMDP GENERALIZATION 23
  • 24. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. MDP to POMDP Generalization • Comparison Performance between DQN and DRQN • Train: POMDP / Test: MDP • Train: MDP / Test: POMDP • Recurrent Controller • Robustness against missing information, even trained with full state information 24
  • 25. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. DISCUSSION AND CONCLUSION 25
  • 26. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Discussion and Conclusion • Better Performance than DQN on POMDP • DRQN handling the noisy and incomplete characteristic of POMDPs by combining a LSTM with DQN • Only a single frame at each step, DRQN still integrating information across frames to detect relevant information • Generalization Performance • Trained with partial observations • DRQN learns policies that are both robust enough to handle to missing game screens, and scalable enough to improve performance as more data becomes available • Trained with fully observations • DRQN performs better than DQN’s at all levels of partial information • Adding Recurrency • Experiments show that LSTM is viable method for handling multiple state observations 26
  • 27. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. IMPLEMENTATION 27
  • 28. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. DQRN • GitHub • Caffe : https://github.com/mhauskn/dqn/tree/recurrent • PyTorch : https://github.com/mynkpl1998/Recurrent-Deep-Q-Learning 28
  • 29. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. DQRN (PyTorch ver) 29
  • 30. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. Q&A ( 3 / 3 ) 30
  • 31. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. THANK YOU FOR LISTENING 31
  • 32. [RLPR-S2] Matthew Hausknecht and Peter Stone, “Deep Recurrent Q-Learning for Partially Observable MDPs”, Association for the Advancement of Artificial Intelligence (AAAI), 2015. References • Paper • https://www.aaai.org/ocs/index.php/FSS/FSS15/paper/viewPaper/11673 • Blog • https://sumniya.tistory.com/22 • https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-6-partial-o bservability-and-deep-recurrent-q-68463e9aeefc • https://jay.tech.blog/2017/02/06/deep-recurrent-q-learning/ • https://whereisend.tistory.com/108 • https://ddanggle.github.io/demystifyingDL • Code • https://github.com/mhauskn/dqn/tree/recurrent • https://github.com/mynkpl1998/Recurrent-Deep-Q-Learning • https://github.com/qfettes/DeepRL-Tutorials/blob/master/11.DRQN.ipynb • https://github.com/awjuliani/DeepRL-Agents/blob/master/Deep-Recurrent-Q-Network.ipynb 32