Playing Atari with Deep Reinforcement Learning
- Author: Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller
- Origin: https://arxiv.org/abs/1312.5602
- Related: https://github.com/number9473/nn-algorithm/issues/250
Processing & Properties of Floor and Wall Tiles.pptx
Playing Atari with Deep Reinforcement Learning
1. Playing Atari with Deep
Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.
NIPS Deep Learning Workshop 2013
Yu Kai Huang
2. Outline
● Reinforcement Learning
● Markov Decision Process
○ State, Action(Policy), Reward
○ Value function, Bellman Equation
● Optimal Policy
○ Bellman Optimality Equation
○ Q-learning
○ Deep Q-learning Network
● Experiments
○ Training and Stability
○ Evaluation
8. ● No supervisor, only a reward
signal.
● Feedback is delayed, not
instantaneous.
● Time really matters.
● Agent’s actions affect the
subsequent data it receives.
Reinforcement Learning
Image from
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/1-1-A-RL/
9. Reinforcement Learning
● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
10. Reinforcement Learning
● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)
● Action: a command that agent
can give in the game.
○ e.g. ↑, ↓, ←, →
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
11. Reinforcement Learning
● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)
● Action: a command that agent
can give in the game.
○ e.g. ↑, ↓, ←, →
● Reward: given after performing
an action.
○ e.g. +1, -100
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
12. Reinforcement Learning
● Full observability: agent directly
observes environment state.
● Agent state = environment state
= information state
● Formally, this is a Markov
Decision Process (MDP).
Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf
14. Markov Decision Process
● Markov decision processes formally describe an environment for
reinforcement learning.
● Where the environment is fully observable.
● Almost all RL problems can be formalised as MDPs.
15. Markov Decision Process: State
● An MDP is a directed graph which has states for its nodes and edges which
describe transitions between Markov states.
○ State Transition Matrix
● Markov Property: “The future is independent of the past given the present”
○ The current state summarizes all past states.
○ e.g., if we only know the position of the ball but not its velocity, its state is
no longer Markov.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
17. Markov Decision Process: Policy
● A policy fully defines the behaviour of an agent.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
18. Markov Decision Process: Reward and Return
● Each time you make a transition into a state, you receive a reward.
● Agents should learn to maximize cumulative future reward.
○ Return
○ Discount factor
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
19. Markov Decision Process: Value function
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
20. Markov Decision Process: Bellman Equation
● if we know the value of the next state, we can know the value of the current
state.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf
21. Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
22. Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
23. Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf
43. Deep Q-learning Network
● Data Preprocessing: “The raw frames are preprocessed by first converting
their RGB representation to gray-scale and down-sampling it to a 110×84
[...] cropping an 84 × 84 region of the image [...].”
● Model Architecture
○ Input size: 84x84x4
○ Ouput size: 4 (←, →, x, B)
○ layers:
■ conv1(16, (8, 8), strides=(4, 4))
■ conv2(32, (4, 4), strides=(2, 2))
■ Dense(256)
■ Dense(4)
Image from https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec
44. Deep Q-learning Network
● Experience Replay
○ “we store the agent’s experiences at each time-step, et = (st, at, rt, st+1)
in a data-set D = e1, ..., eN , pooled over many episodes into a replay
memory.”
51. Main Evaluation
● A trial: 5,000 training episodes, followed by 500 evaluation episodes.
● Average performance across 30 trials.
Image from https://arxiv.org/pdf/1312.5602.pdf
52. Ref.
[1] Jaromír Janisch: LET’S MAKE A DQN: THEORY https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/#fn-38-6
[2] Venelin Valkov: Solving an MDP with Q-Learning from scratch — Deep Reinforcement Learning for Hackers (Part 1)
https://medium.com/@curiousily/solving-an-mdp-with-q-learning-from-scratch-deep-reinforcement-learning-for-hackers-part-
1-45d1d360c120
[3] Flood Sung: Deep Reinforcement Learning 基础知识(DQN方面 https://blog.csdn.net/songrotek/article/details/50580904
[4] Flood Sung: 增强学习Reinforcement Learning经典算法梳理1:policy and value iteration
https://blog.csdn.net/songrotek/article/details/51378582
[5] mmc2015: reinforcement learning,增强学习:Policy Evaluation,Policy Iteration,Value Iteration,Dynamic Programming
https://blog.csdn.net/mmc2015/article/details/52859611
53. Ref.
[6] Gai's Blog: 增强学习 Reinforcement learning part 1 - Introduction https://bluesmilery.github.io/blogs/481fe3af/
[7] Gai's Blog: 增强学习 Reinforcement learning part 2 - Markov Decision Process
https://bluesmilery.github.io/blogs/e4dc3fbf/
[8] Gai's Blog: 增强学习 Reinforcement learning part 3 - Planning by Dynamic Programming
https://bluesmilery.github.io/blogs/b96003ba/
[9] Rowan McAllister: Introduction to Reinforcement Learning http://mlg.eng.cam.ac.uk/rowan/files/rl/01_mdps.pdf
[10] Adrien Lucas Ecoffet: Beat Atari with Deep Reinforcement Learning! (Part 0: Intro to RL)
https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec
[11] Joshgreaves: Everything You Need to Know to Get Started in Reinforcement Learning
https://joshgreaves.com/reinforcement-learning/introduction-to-reinforcement-learning/
[12] Katerina Fragkiadaki: Deep Reinforcement Learning and Control: Markov Decision Processes
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf