Playing Atari with Deep Reinforcement Learning

Playing Atari with Deep
Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller.
NIPS Deep Learning Workshop 2013
Yu Kai Huang

Outline
● Reinforcement Learning
● Markov Decision Process
○ State, Action(Policy), Reward
○ Value function, Bellman Equation
● Optimal Policy
○ Bellman Optimality Equation
○ Q-learning
○ Deep Q-learning Network
● Experiments
○ Training and Stability
○ Evaluation

Image from https://arxiv.org/pdf/1312.5602.pdf

Image from https://i.imgur.com/kw5Veqz.jpg

Image from https://homes.cs.washington.edu/~todorov/papers/TassaIROS12.pdf

Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/intro_RL.pdf

● No supervisor, only a reward
signal.
● Feedback is delayed, not
instantaneous.
● Time really matters.
● Agent’s actions affect the
subsequent data it receives.
Image from
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/1-1-A-RL/

● State: the current situation that
the agent is in.
○ e.g. moving (position,
velocity, acceleration,...)

the agent is in.
● Action: a command that agent
can give in the game.
○ e.g. ↑, ↓, ←, →

the agent is in.
● Action: a command that agent
can give in the game.
○ e.g. ↑, ↓, ←, →
● Reward: given after performing
an action.
○ e.g. +1, -100

● Full observability: agent directly
observes environment state.
● Agent state = environment state
= information state
● Formally, this is a Markov
Decision Process (MDP).

Markov Decision Process
● Markov decision processes formally describe an environment for
reinforcement learning.
● Where the environment is fully observable.
● Almost all RL problems can be formalised as MDPs.

Markov Decision Process: State
● An MDP is a directed graph which has states for its nodes and edges which
describe transitions between Markov states.
○ State Transition Matrix
● Markov Property: “The future is independent of the past given the present”
○ The current state summarizes all past states.
○ e.g., if we only know the position of the ball but not its velocity, its state is
no longer Markov.
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf

Example: Student MDP

Markov Decision Process: Policy
● A policy fully defines the behaviour of an agent.

Markov Decision Process: Reward and Return
● Each time you make a transition into a state, you receive a reward.
● Agents should learn to maximize cumulative future reward.
○ Return
○ Discount factor

Markov Decision Process: Value function

Markov Decision Process: Bellman Equation
● if we know the value of the next state, we can know the value of the current
state.

Markov Decision Process: Bellman Equation
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf,
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf

Example: Student MDP
Image from http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdff

Optimal Policy

Optimal Value function

Bellman Optimality Equation for Q*

The solution method of Bellman
Optimality Equation

Example: How to be a good kid?
Image from https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/

Q-Learning
● Q-table
a1(Watch TV) a2(Do homework)
s1 0 0
s2 0 0
s3 0 0
● Current state: s1
● Select Action:
○ a1 = argmax(q(s1, a1), q(s1, a2))
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
Set discount factor r = 1, alpha = 1
target(s, a) =

Q-Learning
● Q-table
s1 0 0
s2 0 0
s3 0 0
● Select Action:
○ a1 = argmax(q(s1, a1), q(s1, a2))
● Next state: s1
● Reward:
○ R(s1, a1) = -5
● Q-table:
○ delta = target(s1, a1) - q(s1, a1)
= (-5+1*(0)) - 0 = -5
○ q(s1, a1) = q(s1, a1) + alpha*delta
= 0 + 1*(-5) = -5
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
target(s, a) =

Q-Learning
● Q-table
s1 -5 0
s2 0 0
s3 0 0
● Select Action:
○ a1 = argmax(q(s1, a1), q(s1, a2))
● Next state: s1
● Reward:
○ R(s1, a1) = -5
● Q-table:
= (-5+1*(0)) - 0 = -5
= 0 + 1*(-5) = -5
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
target(s, a) =

Q-Learning
● Q-table
s1 -5 0
s2 0 0
s3 0 0
● Select Action:
○ a2 = argmax(q(s1, a1), q(s1, a2))
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
target(s, a) =

Q-Learning
● Q-table
s1 -5 0
s2 0 0
s3 0 0
● Select Action:
○ a2 = argmax(q(s1, a1), q(s1, a2))
● Next state: s2
● Reward:
○ R(s1, a2) = 1
● Q-table:
= (1+1*(0)) - 0= 1
= 0 + 1*1 = 1
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
target(s, a) =

Q-Learning
● Q-table
s1 -5 1
s2 0 0
s3 0 0
● Select Action:
○ a2 = argmax(q(s1, a1), q(s1, a2))
● Next state: s2
● Reward:
○ R(s1, a2) = 1
● Q-table:
= (1+1*(0)) - 0= 1
= 0 + 1*1 = 1
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
target(s, a) =

Q-Learning
● Q-table
s1 -5 1
s2 0 0
s3 0 0
● Select Action:
○ a1 = argmax(q(s2, a1), q(s2, a2))
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
target(s, a) =

Q-Learning
● Q-table
s1 -5 1
s2 0 0
s3 0 0
● Select Action:
○ a1 = argmax(q(s2, a1), q(s2, a2))
● Next state: s1
● Reward:
○ R(s2, a1) = -5
● Q-table:
= (-5+1*1) - 0= -4
= 0 + 1*(-4) = -4
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
target(s, a) =

Q-Learning
● Q-table
s1 -5 1
s2 -4 0
s3 0 0
● Select Action:
○ a1 = argmax(q(s2, a1), q(s2, a2))
● Next state: s1
● Reward:
○ R(s2, a1) = -5
● Q-table:
= (-5+1*1) - 0= -4
= 0 + 1*(-4) = -4
s1 -5 1
s2 -5 5
s3 0 0
● Reward-table
target(s, a) =

Q-Learning
Select Action
Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/

Q-Learning
target(s, a)
Image from https://blog.techbridge.cc/2017/11/04/openai-gym-intro-and-q-learning/

Deep Q-learning Network
● Data Preprocessing: “The raw frames are preprocessed by first converting
their RGB representation to gray-scale and down-sampling it to a 110×84
[...] cropping an 84 × 84 region of the image [...].”
● Model Architecture
○ Input size: 84x84x4
○ Ouput size: 4 (←, →, x, B)
○ layers:
■ conv1(16, (8, 8), strides=(4, 4))
■ conv2(32, (4, 4), strides=(2, 2))
■ Dense(256)
■ Dense(4)
Image from https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec

● Experience Replay
○ “we store the agent’s experiences at each time-step, et = (st, at, rt, st+1)
in a data-set D = e1, ..., eN , pooled over many episodes into a replay
memory.”

● Optimal action-value function

Training and Stability (Evaluation Metric)

Visualizing the Value Function

Main Evaluation
● A trial: 5,000 training episodes, followed by 500 evaluation episodes.
● Average performance across 30 trials.

Ref.
[1] Jaromír Janisch: LET’S MAKE A DQN: THEORY https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/#fn-38-6
[2] Venelin Valkov: Solving an MDP with Q-Learning from scratch — Deep Reinforcement Learning for Hackers (Part 1)
https://medium.com/@curiousily/solving-an-mdp-with-q-learning-from-scratch-deep-reinforcement-learning-for-hackers-part-
1-45d1d360c120
[3] Flood Sung: Deep Reinforcement Learning 基础知识（DQN方面 https://blog.csdn.net/songrotek/article/details/50580904
[4] Flood Sung: 增强学习Reinforcement Learning经典算法梳理1：policy and value iteration
https://blog.csdn.net/songrotek/article/details/51378582
[5] mmc2015: reinforcement learning，增强学习：Policy Evaluation，Policy Iteration，Value Iteration，Dynamic Programming
https://blog.csdn.net/mmc2015/article/details/52859611

Ref.
[6] Gai's Blog: 增强学习 Reinforcement learning part 1 - Introduction https://bluesmilery.github.io/blogs/481fe3af/
[7] Gai's Blog: 增强学习 Reinforcement learning part 2 - Markov Decision Process
https://bluesmilery.github.io/blogs/e4dc3fbf/
[8] Gai's Blog: 增强学习 Reinforcement learning part 3 - Planning by Dynamic Programming
https://bluesmilery.github.io/blogs/b96003ba/
[9] Rowan McAllister: Introduction to Reinforcement Learning http://mlg.eng.cam.ac.uk/rowan/files/rl/01_mdps.pdf
[10] Adrien Lucas Ecoffet: Beat Atari with Deep Reinforcement Learning! (Part 0: Intro to RL)
https://becominghuman.ai/lets-build-an-atari-ai-part-0-intro-to-rl-9b2c5336e0ec
[11] Joshgreaves: Everything You Need to Know to Get Started in Reinforcement Learning
https://joshgreaves.com/reinforcement-learning/introduction-to-reinforcement-learning/
[12] Katerina Fragkiadaki: Deep Reinforcement Learning and Control: Markov Decision Processes
https://www.cs.cmu.edu/~katef/DeepRLControlCourse/lectures/lecture2_mdps.pdf

Ref.
[13] 莫烦: 什么是 Q Leaning
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-2-A-q-learning/

Playing Atari with Deep Reinforcement Learning

Recommended

Recommended

More Related Content

Similar to Playing Atari with Deep Reinforcement Learning

Similar to Playing Atari with Deep Reinforcement Learning (18)

More from 郁凱黃

More from 郁凱黃 (10)

Recently uploaded

Recently uploaded (20)