A very private, introductory notes on reinforcement learning.
Focusing to understand DQN by Google DeepMind.
Mainly based on the nice article
"DEMYSTIFYING DEEP REINFORCEMENT LEARNING"by Tambet Matiisen on
http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/
9. Joo-Haeng Lee 2017 joohaeng@gmail.com
S1S0
a0
S4
S2
a1
S3
a2
S7
S5
S6
S8
A state machine SM defined with 9 states, 3 actions, and 20 transitions.
10. Joo-Haeng Lee 2017 joohaeng@gmail.com
S1S0
a0
S4
S2
a1
S3
a2
S7
S5
S6
S8
For a given input at at time t, the state machine SM returns its state representation st and reward rt.
at
rt
st
Action
State
Reward
11. Joo-Haeng Lee 2017 joohaeng@gmail.com
S1S0
a0
S4
S2
a1
S3
a2
S7
S5
S6
S8
A human player.
at
rt
st
Action
State
Reward
PerceptionControl
12. Joo-Haeng Lee 2017 joohaeng@gmail.com
The system details and dynamics are unknown, or only partially known.
at
rt
st
Action
State
Reward
PerceptionControl
?
13. Joo-Haeng Lee 2017 joohaeng@gmail.com
The system details and dynamics are unknown, or only partially known.
at
rt
st
Action
State
Reward
PerceptionControl
?
14. Joo-Haeng Lee 2017 joohaeng@gmail.com
A human builds his cognitive model by learning.
at
rt
st
Action
State
Reward
PerceptionControl
?
15. Joo-Haeng Lee 2017 joohaeng@gmail.com
How can a machine learn to perform?
at
rt
st
Action
State
Reward
PerceptionControl
?
16. Joo-Haeng Lee 2017 joohaeng@gmail.com
A classic approach is Reinforcement Learning (RL)!
at
rt
st
Action
State
Reward
PerceptionControl
?
17. Joo-Haeng Lee 2017 joohaeng@gmail.com
One of RL methods is Q-Learning (QL)!
at
rt
st
Action
State
Reward
PerceptionControl
?
18. Joo-Haeng Lee 2017 joohaeng@gmail.com
A recent advancement in QL is Deep Q-Learning (DQL) by DeepMind!
at
rt
st
Action
State
Reward
PerceptionControl
?
19. Joo-Haeng Lee 2017 joohaeng@gmail.com
Can we build Alpha Ma using DQL or its variants?
at
rt
st
Action
State
Reward
PerceptionControl
?
21. Joo-Haeng Lee 2017 joohaeng@gmail.com
Reinforcement Learning
• “Reinforcement learning (RL) is an area of machine learning inspired by behaviorist
psychology, concerned with how software agents ought to take actions in an
environment so as to maximize some notion of cumulative reward.“— Wikipedia
• “강화 학습은 기계 학습이 다루는 문제의 하나로, 어떤 환경 안에서 정의된 에이전트가
현재의 상태를 인식하여, 선택 가능한 행동들 중 보상을 최대화하는 행동 혹은 행동 순서
를 선택하는 방법이다.” — Wikipedia
22. Joo-Haeng Lee 2017 joohaeng@gmail.com
Main Challenges in RL
• Credit Assignment Problem
- “방금 얻은 점수는 어떤 액션 덕분이지?”
• Exploration-Exploitation Dilemma
- 탐사-채광:“더 큰 금맥을 찾아 갈까? 여기서 채굴할까?”
23. Joo-Haeng Lee 2017 joohaeng@gmail.com
Mathematical Formalism for RL
• Markov Decision Process (MDP)
- 액션에 따른 상태의 전이
- Action
- State
- Transition
S1S0
a0
S4
S2
a1
S3
a2
S7
S5
S6
S8
-10
-10
+100
+2
+20
+20
-50
24. Joo-Haeng Lee 2017 joohaeng@gmail.com
Key Concepts
• Encoding long term-strategies: discounted future reward
• Estimate the future reward: table-based Q-learning
• Huge state space: Q-table is replaced with neural network
• Implementation tip: experience replay for stability
• Exploration-exploitation dilemma: 𝜀-greedy exploration
26. Joo-Haeng Lee 2017 joohaeng@gmail.com
Reinforcement Learning — Breakout
• Problem description
- Input: game screen with scores
- Output: game controls = ( left || right || space )
- Training?
• Experts dataset for supervised learning — how to get them?
• Self practice with occasional rewards as humans do — reinforcement learning
27. Joo-Haeng Lee 2017 joohaeng@gmail.com
Markov Decision Process (MDP) — Formalism for RL
• Environment: game, system, simulator, …
• Agent: a human user, SW
• State: stochastic transition to another state for a action
• Action
• Reward
• Policy
EnvironmentsAgent
Action
State
Reward
S1S0
a0
S4
S2
a1
S3
a2
S7
S5
S6
S8
-10
-10
+100
+2
+20
+20
-50
28. Joo-Haeng Lee 2017 joohaeng@gmail.com
Markov Decision Process (MDP) — Formalism for RL
• Episode
- a sequence of states, actions, and rewards in a game
- so, ao, r1, s1, a1, r2, s2, a2, r3, s3, a3, …, rn-1, sn = game over = GG
• Markov assumption
- The probability of the next state si+1 depends only on current state si and
performed action ai, but not on preceding states or actions.
29. Joo-Haeng Lee 2017 joohaeng@gmail.com
• To play well in long-term, need to consider current and future rewards at once.
• Total reward for an episode: R = r1 + r2 + r3+…+ rn
• Total future reward at time t: Rt = rt + rt+1 + rt+2+…+ rn
• Considering the stochastic nature,
• Discounted future reward: Rt = rt +𝛾 rt+1 + 𝛾2 rt+2+…+ 𝛾n-t rn = rt +𝛾 Rt+1
• NOTE: A good strategy for an agent would be to always choose an action at at the
state st, that maximizes the discounted future reward Rt+1.
• BUT, how?
Discounted Future Reward (DFR)
30. Joo-Haeng Lee 2017 joohaeng@gmail.com
Q-Learning
• Q-function: Q(s, a) — the discounted future reward from a sequence of optimal
actions
- Q(s, a) = max Rt+1
- Among myriads of possible episodes, the maximum DFR could be earned from a certain
sequence of actions after the current action a at the state s.
- The “quality” of the current action affects DFR.
31. Joo-Haeng Lee 2017 joohaeng@gmail.com
Q-Learning
• Policy:
- π(s) = arg maxa’ Q(s, a’) = a
- The action a which results in the maximum DFR: Q(s, a)
33. Joo-Haeng Lee 2017 joohaeng@gmail.com
Q-Learning
• Naïve algorithm for Q-table filling:
- Initialize Q-table arbitrarily with #state rows and #action columns.
- Observe initial state s
- Repeat
• Select an action a and input to the environment E
- Action a will be carried out in E.
• Observe reward r and new state s’
• Update the table: Q(s, a) = (1-α) Q(s, a) +α (r + γ maxa' Q(s’, a’))
- s = s'
- until terminated
Q-table a1 a2 a2 … an
s1 100 130 80 … 121
s2 200 99 99 … 2
s2 50 99 150 … 2
... … … … … …
sn 101 124 124 … 199
34. Joo-Haeng Lee 2017 joohaeng@gmail.com
Q-Learning
• The estimations get more and more accurate with every iteration and it has been
shown that, if we perform this update enough times, then the Q-function will
converge and represent the true Q-value.
• OK. BUT, how to generalize a Q-function (or Q-table) to handle many similar
problems at once? — (ex) ATARI 2600.
35. Joo-Haeng Lee 2017 joohaeng@gmail.com
Deep Q Network
• Q-Learning + Deep Neural Network
• DQN
• Google DeepMind (NIPS 2013 Workshop, Nature 2015)
40. Joo-Haeng Lee 2017 joohaeng@gmail.com
Q-Learning — Breakout
• Modeling for Breakout:
- State: description of all the game elements such as ball, bar, and bricks
- Reward: score
- Output: game controls = ( left || right || space )
• BUT, how to handle all the other ATARI 2600 games?
- The problem of generalization!
(# bricks) * (x, y, on) + (x)
for bar + (x, y) for ball
41. Joo-Haeng Lee 2017 joohaeng@gmail.com
Q-Learning — All ATARI 2600 Games?
• Modeling for any Atari 2600 games:
- State: all the pixels in the game screens
- Reward: score
- Output: all the control actions in the joystick
42. Joo-Haeng Lee 2017 joohaeng@gmail.com
Q-Learning — All ATARI 2600 Games?
• Modeling for any Atari 2600 games:
- State: 84x84 pixels * 4 frames * 256 gray
- Reward: score
- Output:18 actions
Convolution Convolution Fully connected Fully connected
No input
ation of the convolutional neural network. The
explained in the Methods. The input to the neural
843 4 image produced by the preprocessing
nvolutional layers (note: snaking blue line
symbolizes sliding of each filter across input image) and two fully connected
layers with a single output for each valid action. Each hidden layer is followed
by a rectifier nonlinearity (that is, max 0,xð Þ).
43. Joo-Haeng Lee 2017 joohaeng@gmail.com
Q-Learning — All ATARI 2600 Games?
• Modeling for any Atari 2600 games:
- State: 84x84 pixels * 4 frames * 256 gray = 25684x84x4 ~ 1067970
- Reward: score
- Output:18 actions
Convolution Convolution Fully connected Fully connected
No input
ation of the convolutional neural network. The
explained in the Methods. The input to the neural
843 4 image produced by the preprocessing
nvolutional layers (note: snaking blue line
symbolizes sliding of each filter across input image) and two fully connected
layers with a single output for each valid action. Each hidden layer is followed
by a rectifier nonlinearity (that is, max 0,xð Þ).
44. Joo-Haeng Lee 2017 joohaeng@gmail.com
Q-Learning — All ATARI 2600 Games?
• Modeling for any Atari 2600 games:
- State: 84x84 pixels * 4 frames * 256 gray = 25684x84x4 ~ 1067970
- Reward: score
- Output:18 actions
45. Joo-Haeng Lee 2017 joohaeng@gmail.com
Deep Q Network — All ATARI 2600 Games!
• We can hardly implement a Q-function as a table: size and sparsity!
• Now, deep learning steps in!
- Deep convolutional neural network (CNN) is specially good at extracting small set of
features from a big data.
- We can replace Q-table with a deep neural network — DQN!
Q(s, an)
46. Joo-Haeng Lee 2017 joohaeng@gmail.com
Deep Q Network — All ATARI 2600 Games!
Layer Input Filter size Stride Num filters Activation Output
conv1 84x84x4 8×8 4 32 ReLU 20x20x32
conv2 20x20x32 4×4 2 64 ReLU 9x9x64
conv3 9x9x64 3×3 1 64 ReLU 7x7x64
fc4 7x7x64 512 ReLU 512
fc5 512 18 Linear 18
lt and engaging for human players. We used the same network
ecture, hyperparameter values (see Extended Data Table 1) and
ngprocedurethroughout—takinghigh-dimensionaldata(210|160
r video at 60 Hz) as input—to demonstrate that our approach
tly learns successful policies over a variety of games based solely
soryinputswithonlyveryminimalpriorknowledge(thatis,merely
put data were visual images, and the number of actions available
h game, but not their correspondences; see Methods). Notably,
ethod was able to train large neural networks using a reinforce-
earningsignalandstochasticgradientdescentinastablemanner—
ated by the temporal evolution of two indices of learning (the
s average score-per-episode and average predicted Q-values; see
and Supplementary Discussion for details).
We compared DQN with the best performing methods from the
reinforcement learning literature on the 49 games where results were
available12,15
. In addition to the learned agents, we alsoreport scores for
aprofessionalhumangamestesterplayingundercontrolledconditions
and a policy that selects actions uniformly at random (Extended Data
Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y
axis; see Methods). Our DQN method outperforms the best existing
reinforcement learning methods on 43 of the games without incorpo-
rating any of the additional prior knowledge about Atari 2600 games
used by other approaches (for example, refs 12, 15). Furthermore, our
DQN agent performed at a level that was comparable to that of a pro-
fessionalhumangamestesteracrossthesetof49games,achievingmore
than75%ofthe humanscore onmorethanhalfofthegames(29 games;
Convolution Convolution Fully connected Fully connected
No input
1 | Schematic illustration of the convolutional neural network. The
of the architecture are explained in the Methods. The input to the neural
k consists of an 843 843 4 image produced by the preprocessing
followed by three convolutional layers (note: snaking blue line
symbolizes sliding of each filter across input image) and two fully connected
layers with a single output for each valid action. Each hidden layer is followed
by a rectifier nonlinearity (that is, max 0,xð Þ).
EARCH LETTER
Q(s, an)
47. Joo-Haeng Lee 2017 joohaeng@gmail.com
Deep Q Network
• Loss
- To measure how well a neural network is trained
- The less, the better.
- Current Q by prediction: Q(s, a) — forward evaluation of a neural network
- Target Q from new reward: r + γ maxa' Q(s’, a’) — forward evaluation
- L = 1/2 (current - target)2 = 1/2 ( Q(s, a) - ( r + γ maxa' Q(s’, a’) ) )2
- Weights of a neural network are updated to minimize the loss — back propagation
48. Joo-Haeng Lee 2017 joohaeng@gmail.com
Deep Q Network
• Experience Replay
- Training efficiency:“It takes a long time, almost a week on a single GPU.”
- Experience: <s, a, r, a’>
- Experience memory stores all the recent experiences.Actually, not all, but quite a few.
- Train on adjacent experiences?
- No! Random samples form experience memory to avoid local minimum.
49. Joo-Haeng Lee 2017 joohaeng@gmail.com
Deep Q Network
• So far, we mainly focused on “credit assignment problem,” specially in the context
of Q-learning.
• Exploration-Exploitation Dilemma?
- At first, Q-network gives a taste of randomness in selecting an optimal action due to
random initialization — greedy exploration to find the first (not the best) solution.
- However, it converges as the training continues — exploitation at the local minimum.
• ε-greedy exploration
- “Maybe, there could be a better action with the change of ε.”
- Choose between a random action and argmaxa' Q(s’, a’).
50. Joo-Haeng Lee 2017 joohaeng@gmail.com
n is
ility
deep
ons
rary
orks
bed
ke it
the
…,et},
mi-
hm,
nce,
oach
ence
ncy.
ong
ela-
on-
ters
Initialize action-value function Q with random weights h
Initialize target action-value function ^Q with weights h2
5 h
For episode 5 1, M do
Initialize sequence s1~ x1f g and preprocessed sequence w1~w s1ð Þ
For t 5 1,T do
With probability e select a random action at
otherwise select at~argmaxaQ w stð Þ,a; hð Þ
Execute action at in emulator and observe reward rt and image xt 1 1
Set stz1~st,at,xtz1 and preprocess wtz1~w stz1ð Þ
Store transition wt,at,rt,wtz1
À Á
in D
Sample random minibatch of transitions wj,aj,rj,wjz1
from D
Set yj~
rj if episode terminates at step jz1
rjzc maxa0 ^Q wjz1,a0
; h{
otherwise
(
Perform a gradient descent step on yj{Q wj,aj; h
2
with respect to the
network parameters h
Every C steps reset ^Q~Q
End For
End For
31. Jarrett,K.,Kavukcuoglu,K.,Ranzato,M.A.LeCun,Y.Whatisthebestmulti-stage
51. Joo-Haeng Lee 2017 joohaeng@gmail.com
Deep Q Network
• DQN Algorithm:
- Initialize replay memory D.
- Initialize Q-network with random weights.
- Observe initial state s
- Repeat
• Select a random action a with probability ε. Otherwise a = argmaxa’ Q(s’, a’)
• Input a to the environment E for state transition
• Observe reward r and new state s’, and store them to replay memory D
• Sample random transitions sd, ad, rd, sd’ from replay memory D
• Calculate target t for each mini-batch transition
- If sd’ is terminal state then t = rd
- Otherwise t = rd + γ maxa’Q(sd’, a’)
• Train the Q network with the loss L = (t - Q(sd, ad))2 — Updating the Q-network
- s = s'
- until terminated
60. Joo-Haeng Lee 2017 joohaeng@gmail.com
B. Rider Breakout Enduro Pong Q*bert Seaquest S. Invaders
Random 354 1.2 0 20.4 157 110 179
Sarsa [3] 996 5.2 129 19 614 665 271
Contingency [4] 1743 6 159 17 960 723 268
DQN 4092 168 470 20 1952 1705 581
Human 7456 31 368 3 18900 28010 3690
HNeat Best [8] 3616 52 106 19 1800 920 1720
HNeat Pixel [8] 1332 4 91 16 1325 800 1145
DQN Best 5184 225 661 21 4500 1740 1075
Table 1: The upper table compares average total reward for various learning methods by running
an ✏-greedy policy with ✏ = 0.05 for a fixed number of steps. The lower table reports results of
the single best performing episode for HNeat and DQN. HNeat produces deterministic policies that
always get the same score while DQN used an ✏-greedy policy with ✏ = 0.05.
types of objects on the Atari screen. The HNeat Pixel score is obtained by using the special 8 color
channel representation of the Atari emulator that represents an object label map at each channel.
This method relies heavily on finding a deterministic sequence of states that represents a successful
exploit. It is unlikely that strategies learnt in this way will generalize to random perturbations;
therefore the algorithm was only evaluated on the highest scoring single episode. In contrast, our
algorithm is evaluated on ✏-greedy control sequences, and must therefore generalize across a wide
variety of possible situations. Nevertheless, we show that on all the games, except Space Invaders,
not only our max evaluation results (row 8), but also our average results (row 4) achieve better
performance.
Finally, we show that our method achieves better performance than an expert human player on
Breakout, Enduro and Pong and it achieves close to human performance on Beam Rider. The games
Q*bert, Seaquest, Space Invaders, on which we are far from human performance, are more chal-
lenging because they require the network to find a strategy that extends over long time scales.
6 Conclusion
This paper introduced a new deep learning model for reinforcement learning, and demonstrated its
ability to master difficult control policies for Atari 2600 computer games, using only raw pixels
as input. We also presented a variant of online Q-learning that combines stochastic minibatch up-
dates with experience replay memory to ease the training of deep networks for RL. Our approach
gave state-of-the-art results in six of the seven games it was tested on, with no adjustment of the
architecture or hyperparameters.
References
[1] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In
Proceedings of the 12th International Conference on Machine Learning (ICML 1995), pages
30–37. Morgan Kaufmann, 1995.
[2] Marc Bellemare, Joel Veness, and Michael Bowling. Sketch-based linear value function ap-
proximation. In Advances in Neural Information Processing Systems 25, pages 2222–2230,
2012.
[3] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning
environment: An evaluation platform for general agents. Journal of Artificial Intelligence
Research, 47:253–279, 2013.
[4] Marc G Bellemare, Joel Veness, and Michael Bowling. Investigating contingency awareness
using atari 2600 games. In AAAI, 2012.
[5] Marc G. Bellemare, Joel Veness, and Michael Bowling. Bayesian learning of recursively fac-
tored environments. In Proceedings of the Thirtieth International Conference on Machine
Learning (ICML 2013), pages 1211–1219, 2013.
8
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou
Daan Wierstra Martin Riedmiller
DeepMind Technologies
{vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller} @ deepmind.com
Abstract
We present the first deep learning model to successfully learn control policies di-
rectly from high-dimensional sensory input using reinforcement learning. The
model is a convolutional neural network, trained with a variant of Q-learning,
whose input is raw pixels and whose output is a value function estimating future
rewards. We apply our method to seven Atari 2600 games from the Arcade Learn-
ing Environment, with no adjustment of the architecture or learning algorithm. We
find that it outperforms all previous approaches on six of the games and surpasses
a human expert on three of them.
1 Introduction
Learning to control agents directly from high-dimensional sensory inputs like vision and speech is
one of the long-standing challenges of reinforcement learning (RL). Most successful RL applica-
tions that operate on these domains have relied on hand-crafted features combined with linear value
functions or policy representations. Clearly, the performance of such systems heavily relies on the
quality of the feature representation.
Recent advances in deep learning have made it possible to extract high-level features from raw sen-
sory data, leading to breakthroughs in computer vision [11, 22, 16] and speech recognition [6, 7].
These methods utilise a range of neural network architectures, including convolutional networks,
multilayer perceptrons, restricted Boltzmann machines and recurrent neural networks, and have ex-
ploited both supervised and unsupervised learning. It seems natural to ask whether similar tech-
niques could also be beneficial for RL with sensory data.
However reinforcement learning presents several challenges from a deep learning perspective.
Firstly, most successful deep learning applications to date have required large amounts of hand-
labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar reward
signal that is frequently sparse, noisy and delayed. The delay between actions and resulting rewards,
which can be thousands of timesteps long, seems particularly daunting when compared to the direct
association between inputs and targets found in supervised learning. Another issue is that most deep
learning algorithms assume the data samples to be independent, while in reinforcement learning one
typically encounters sequences of highly correlated states. Furthermore, in RL the data distribu-
tion changes as the algorithm learns new behaviours, which can be problematic for deep learning
methods that assume a fixed underlying distribution.
This paper demonstrates that a convolutional neural network can overcome these challenges to learn
successful control policies from raw video data in complex RL environments. The network is
trained with a variant of the Q-learning [26] algorithm, with stochastic gradient descent to update
the weights. To alleviate the problems of correlated data and non-stationary distributions, we use
1
Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders,
Seaquest, Beam Rider
an experience replay mechanism [13] which randomly samples previous transitions, and thereby
smooths the training distribution over many past behaviors.
We apply our approach to a range of Atari 2600 games implemented in The Arcade Learning Envi-
ronment (ALE) [3]. Atari 2600 is a challenging RL testbed that presents agents with a high dimen-
sional visual input (210 ⇥ 160 RGB video at 60Hz) and a diverse and interesting set of tasks that
were designed to be difficult for humans players. Our goal is to create a single neural network agent
that is able to successfully learn to play as many of the games as possible. The network was not pro-
vided with any game-specific information or hand-designed visual features, and was not privy to the
internal state of the emulator; it learned from nothing but the video input, the reward and terminal
signals, and the set of possible actions—just as a human player would. Furthermore the network ar-
chitecture and all hyperparameters used for training were kept constant across the games. So far the
network has outperformed all previous RL algorithms on six of the seven games we have attempted
and surpassed an expert human player on three of them. Figure 1 provides sample screenshots from
five of the games used for training.
2 Background
We consider tasks in which an agent interacts with an environment E, in this case the Atari emulator,
in a sequence of actions, observations and rewards. At each time-step the agent selects an action
at from the set of legal game actions, A = {1, . . . , K}. The action is passed to the emulator and
modifies its internal state and the game score. In general E may be stochastic. The emulator’s
internal state is not observed by the agent; instead it observes an image xt 2 Rd
from the emulator,
which is a vector of raw pixel values representing the current screen. In addition it receives a reward
rt representing the change in game score. Note that in general the game score may depend on the
whole prior sequence of actions and observations; feedback about an action may only be received
after many thousands of time-steps have elapsed.
Since the agent only observes images of the current screen, the task is partially observed and many
emulator states are perceptually aliased, i.e. it is impossible to fully understand the current situation
from only the current screen xt. We therefore consider sequences of actions and observations, st =
x1, a1, x2, ..., at 1, xt, and learn game strategies that depend upon these sequences. All sequences
in the emulator are assumed to terminate in a finite number of time-steps. This formalism gives
rise to a large but finite Markov decision process (MDP) in which each sequence is a distinct state.
As a result, we can apply standard reinforcement learning methods for MDPs, simply by using the
complete sequence st as the state representation at time t.
The goal of the agent is to interact with the emulator by selecting actions in a way that maximises
future rewards. We make the standard assumption that future rewards are discounted by a factor of
per time-step, and define the future discounted return at time t as Rt =
PT
t0=t
t0
t
rt0 , where T
is the time-step at which the game terminates. We define the optimal action-value function Q⇤
(s, a)
as the maximum expected return achievable by following any strategy, after seeing some sequence
s and then taking some action a, Q⇤
(s, a) = max⇡ E [Rt|st = s, at = a, ⇡], where ⇡ is a policy
mapping sequences to actions (or distributions over actions).
The optimal action-value function obeys an important identity known as the Bellman equation. This
is based on the following intuition: if the optimal value Q⇤
(s0
, a0
) of the sequence s0
at the next
time-step was known for all possible actions a0
, then the optimal strategy is to select the action a0
2
Mnih,V., Kavukcuoglu, K., Silver, D., Graves,A.,Antonoglou, I.,Wierstra, D., Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. — NIPS 2013 Deep Learning Workshop
63. Joo-Haeng Lee 2017 joohaeng@gmail.com
n is
ility
deep
ons
rary
orks
bed
ke it
the
…,et},
mi-
hm,
nce,
oach
ence
ncy.
ong
ela-
on-
ters
Initialize action-value function Q with random weights h
Initialize target action-value function ^Q with weights h2
5 h
For episode 5 1, M do
Initialize sequence s1~ x1f g and preprocessed sequence w1~w s1ð Þ
For t 5 1,T do
With probability e select a random action at
otherwise select at~argmaxaQ w stð Þ,a; hð Þ
Execute action at in emulator and observe reward rt and image xt 1 1
Set stz1~st,at,xtz1 and preprocess wtz1~w stz1ð Þ
Store transition wt,at,rt,wtz1
À Á
in D
Sample random minibatch of transitions wj,aj,rj,wjz1
from D
Set yj~
rj if episode terminates at step jz1
rjzc maxa0 ^Q wjz1,a0
; h{
otherwise
(
Perform a gradient descent step on yj{Q wj,aj; h
2
with respect to the
network parameters h
Every C steps reset ^Q~Q
End For
End For
31. Jarrett,K.,Kavukcuoglu,K.,Ranzato,M.A.LeCun,Y.Whatisthebestmulti-stage
67. Joo-Haeng Lee 2017 joohaeng@gmail.com
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih1
VMNIH@GOOGLE.COM
Adrià Puigdomènech Badia1
ADRIAP@GOOGLE.COM
Mehdi Mirza1,2
MIRZAMOM@IRO.UMONTREAL.CA
Alex Graves1
GRAVESA@GOOGLE.COM
Tim Harley1
THARLEY@GOOGLE.COM
Timothy P. Lillicrap1
COUNTZERO@GOOGLE.COM
David Silver1
DAVIDSILVER@GOOGLE.COM
Koray Kavukcuoglu 1
KORAYK@GOOGLE.COM
1
Google DeepMind
2
Montreal Institute for Learning Algorithms (MILA), University of Montreal
Abstract
We propose a conceptually simple and
lightweight framework for deep reinforce-
ment learning that uses asynchronous gradient
descent for optimization of deep neural network
controllers. We present asynchronous variants of
four standard reinforcement learning algorithms
and show that parallel actor-learners have a
stabilizing effect on training allowing all four
methods to successfully train neural network
controllers. The best performing method, an
asynchronous variant of actor-critic, surpasses
the current state-of-the-art on the Atari domain
while training for half the time on a single
multi-core CPU instead of a GPU. Furthermore,
we show that asynchronous actor-critic succeeds
on a wide variety of continuous motor control
problems as well as on a new task of navigating
random 3D mazes using a visual input.
1. Introduction
Deep neural networks provide rich representations that can
enable reinforcement learning (RL) algorithms to perform
effectively. However, it was previously thought that the
combination of simple online RL algorithms with deep
neural networks was fundamentally unstable. Instead, a va-
riety of solutions have been proposed to stabilize the algo-
rithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Has-
selt et al., 2015; Schulman et al., 2015a). These approaches
share a common idea: the sequence of observed data en-
countered by an online RL agent is non-stationary, and on-
Proceedings of the 33rd
International Conference on Machine
Learning, New York, NY, USA, 2016. JMLR: WCP volume
48. Copyright 2016 by the author(s).
line RL updates are strongly correlated. By storing the
agent’s data in an experience replay memory, the data can
be batched (Riedmiller, 2005; Schulman et al., 2015a) or
randomly sampled (Mnih et al., 2013; 2015; Van Hasselt
et al., 2015) from different time-steps. Aggregating over
memory in this way reduces non-stationarity and decorre-
lates updates, but at the same time limits the methods to
off-policy reinforcement learning algorithms.
Deep RL algorithms based on experience replay have
achieved unprecedented success in challenging domains
such as Atari 2600. However, experience replay has several
drawbacks: it uses more memory and computation per real
interaction; and it requires off-policy learning algorithms
that can update from data generated by an older policy.
In this paper we provide a very different paradigm for deep
reinforcement learning. Instead of experience replay, we
asynchronously execute multiple agents in parallel, on mul-
tiple instances of the environment. This parallelism also
decorrelates the agents’ data into a more stationary process,
since at any given time-step the parallel agents will be ex-
periencing a variety of different states. This simple idea
enables a much larger spectrum of fundamental on-policy
RL algorithms, such as Sarsa, n-step methods, and actor-
critic methods, as well as off-policy RL algorithms such
as Q-learning, to be applied robustly and effectively using
deep neural networks.
Our parallel reinforcement learning paradigm also offers
practical benefits. Whereas previous approaches to deep re-
inforcement learning rely heavily on specialized hardware
such as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015;
Schaul et al., 2015) or massively distributed architectures
(Nair et al., 2015), our experiments run on a single machine
with a standard multi-core CPU. When applied to a vari-
ety of Atari 2600 domains, on many games asynchronous
reinforcement learning achieves better results, in far less
arXiv:1602.01783v2[cs.LG]16Jun2016
Asynchronous Methods for Deep Reinforcement Learning
One way of propagating rewards faster is by using n-
step returns (Watkins, 1989; Peng Williams, 1996).
In n-step Q-learning, Q(s, a) is updated toward the n-
step return defined as rt + rt+1 + · · · + n 1
rt+n 1 +
maxa
n
Q(st+n, a). This results in a single reward r di-
rectly affecting the values of n preceding state action pairs.
This makes the process of propagating rewards to relevant
state-action pairs potentially much more efficient.
In contrast to value-based methods, policy-based model-
free methods directly parameterize the policy ⇡(a|s; ✓) and
update the parameters ✓ by performing, typically approx-
imate, gradient ascent on E[Rt]. One example of such
a method is the REINFORCE family of algorithms due
to Williams (1992). Standard REINFORCE updates the
policy parameters ✓ in the direction r✓ log ⇡(at|st; ✓)Rt,
which is an unbiased estimate of r✓E[Rt]. It is possible to
reduce the variance of this estimate while keeping it unbi-
ased by subtracting a learned function of the state bt(st),
known as a baseline (Williams, 1992), from the return. The
resulting gradient is r✓ log ⇡(at|st; ✓) (Rt bt(st)).
A learned estimate of the value function is commonly used
as the baseline bt(st) ⇡ V ⇡
(st) leading to a much lower
variance estimate of the policy gradient. When an approx-
imate value function is used as the baseline, the quantity
Rt bt used to scale the policy gradient can be seen as
an estimate of the advantage of action at in state st, or
A(at, st) = Q(at, st) V (st), because Rt is an estimate of
Q⇡
(at, st) and bt is an estimate of V ⇡
(st). This approach
can be viewed as an actor-critic architecture where the pol-
icy ⇡ is the actor and the baseline bt is the critic (Sutton
Barto, 1998; Degris et al., 2012).
4. Asynchronous RL Framework
We now present multi-threaded asynchronous variants of
one-step Sarsa, one-step Q-learning, n-step Q-learning, and
advantage actor-critic. The aim in designing these methods
was to find RL algorithms that can train deep neural net-
work policies reliably and without large resource require-
ments. While the underlying RL methods are quite dif-
ferent, with actor-critic being an on-policy policy search
method and Q-learning being an off-policy value-based
method, we use two main ideas to make all four algorithms
practical given our design goal.
First, we use asynchronous actor-learners, similarly to the
Gorila framework (Nair et al., 2015), but instead of using
separate machines and a parameter server, we use multi-
ple CPU threads on a single machine. Keeping the learn-
ers on a single machine removes the communication costs
of sending gradients and parameters and enables us to use
Hogwild! (Recht et al., 2011) style updates for training.
Second, we make the observation that multiple actors-
Algorithm 1 Asynchronous one-step Q-learning - pseu-
docode for each actor-learner thread.
// Assume global shared ✓, ✓ , and counter T = 0.
Initialize thread step counter t 0
Initialize target network weights ✓ ✓
Initialize network gradients d✓ 0
Get initial state s
repeat
Take action a with ✏-greedy policy based on Q(s, a; ✓)
Receive new state s0
and reward r
y =
⇢
r for terminal s0
r + maxa0 Q(s0
, a0
; ✓ ) for non-terminal s0
Accumulate gradients wrt ✓: d✓ d✓ + @(y Q(s,a;✓))2
@✓
s = s0
T T + 1 and t t + 1
if T mod Itarget == 0 then
Update the target network ✓ ✓
end if
if t mod IAsyncUpdate == 0 or s is terminal then
Perform asynchronous update of ✓ using d✓.
Clear gradients d✓ 0.
end if
until T Tmax
learners running in parallel are likely to be exploring dif-
ferent parts of the environment. Moreover, one can explic-
itly use different exploration policies in each actor-learner
to maximize this diversity. By running different explo-
ration policies in different threads, the overall changes be-
ing made to the parameters by multiple actor-learners ap-
plying online updates in parallel are likely to be less corre-
lated in time than a single agent applying online updates.
Hence, we do not use a replay memory and rely on parallel
actors employing different exploration policies to perform
the stabilizing role undertaken by experience replay in the
DQN training algorithm.
In addition to stabilizing learning, using multiple parallel
actor-learners has multiple practical benefits. First, we ob-
tain a reduction in training time that is roughly linear in
the number of parallel actor-learners. Second, since we no
longer rely on experience replay for stabilizing learning we
are able to use on-policy reinforcement learning methods
such as Sarsa and actor-critic to train neural networks in a
stable way. We now describe our variants of one-step Q-
learning, one-step Sarsa, n-step Q-learning and advantage
actor-critic.
Asynchronous one-step Q-learning: Pseudocode for our
variant of Q-learning, which we call Asynchronous one-
step Q-learning, is shown in Algorithm 1. Each thread in-
teracts with its own copy of the environment and at each
step computes a gradient of the Q-learning loss. We use
a shared and slowly changing target network in comput-
ing the Q-learning loss, as was proposed in the DQN train-
ing method. We also accumulate gradients over multiple
timesteps before they are applied, which is similar to us-
Asynchronous Methods for Deep Reinforcement Learning
Figure 1. Learning speed comparison for DQN and the new asynchronous algorithms on five Atari 2600 games. DQN was trained on
a single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores. The plots are averaged over 5 runs. In
the case of DQN the runs were for different seeds with fixed hyperparameters. For asynchronous methods we average over the best 5
models from 50 experiments with learning rates sampled from LogUniform(10 4
, 10 2
) and all other hyperparameters fixed.
two additional domains to evaluate only the A3C algorithm
– Mujoco and Labyrinth. MuJoCo (Todorov, 2015) is a
physics simulator for evaluating agents on continuous mo-
tor control tasks with contact dynamics. Labyrinth is a new
3D environment where the agent must learn to find rewards
in randomly generated mazes from a visual input. The pre-
cise details of our experimental setup can be found in Sup-
plementary Section 8.
5.1. Atari 2600 Games
We first present results on a subset of Atari 2600 games to
demonstrate the training speed of the new methods. Fig-
ure 1 compares the learning speed of the DQN algorithm
trained on an Nvidia K40 GPU with the asynchronous
methods trained using 16 CPU cores on five Atari 2600
games. The results show that all four asynchronous meth-
ods we presented can successfully train neural network
controllers on the Atari domain. The asynchronous meth-
ods tend to learn faster than DQN, with significantly faster
learning on some games, while training on only 16 CPU
cores. Additionally, the results suggest that n-step methods
learn faster than one-step methods on some games. Over-
all, the policy-based advantage actor-critic method signifi-
cantly outperforms all three value-based methods.
We then evaluated asynchronous advantage actor-critic on
57 Atari games. In order to compare with the state of the
art in Atari game playing, we largely followed the train-
ing and evaluation protocol of (Van Hasselt et al., 2015).
Specifically, we tuned hyperparameters (learning rate and
amount of gradient norm clipping) using a search on six
Atari games (Beamrider, Breakout, Pong, Q*bert, Seaquest
and Space Invaders) and then fixed all hyperparameters for
all 57 games. We trained both a feedforward agent with the
same architecture as (Mnih et al., 2015; Nair et al., 2015;
Van Hasselt et al., 2015) as well as a recurrent agent with an
additional 256 LSTM cells after the final hidden layer. We
additionally used the final network weights for evaluation
to make the results more comparable to the original results
Method Training Time Mean Median
DQN 8 days on GPU 121.9% 47.5%
Gorila 4 days, 100 machines 215.2% 71.3%
D-DQN 8 days on GPU 332.9% 110.9%
Dueling D-DQN 8 days on GPU 343.8% 117.1%
Prioritized DQN 8 days on GPU 463.6% 127.6%
A3C, FF 1 day on CPU 344.1% 68.2%
A3C, FF 4 days on CPU 496.8% 116.6%
A3C, LSTM 4 days on CPU 623.0% 112.6%
Table 1. Mean and median human-normalized scores on 57 Atari
games using the human starts evaluation metric. Supplementary
Table SS3 shows the raw scores for all games.
from (Bellemare et al., 2012). We trained our agents for
four days using 16 CPU cores, while the other agents were
trained for 8 to 10 days on Nvidia K40 GPUs. Table 1
shows the average and median human-normalized scores
obtained by our agents trained by asynchronous advantage
actor-critic (A3C) as well as the current state-of-the art.
Supplementary Table S3 shows the scores on all games.
A3C significantly improves on state-of-the-art the average
score over 57 games in half the training time of the other
methods while using only 16 CPU cores and no GPU. Fur-
thermore, after just one day of training, A3C matches the
average human normalized score of Dueling Double DQN
and almost reaches the median human normalized score of
Gorila. We note that many of the improvements that are
presented in Double DQN (Van Hasselt et al., 2015) and
Dueling Double DQN (Wang et al., 2015) can be incorpo-
rated to 1-step Q and n-step Q methods presented in this
work with similar potential improvements.
5.2. TORCS Car Racing Simulator
We also compared the four asynchronous methods on
the TORCS 3D car racing game (Wymann et al., 2013).
TORCS not only has more realistic graphics than Atari
2600 games, but also requires the agent to learn the dy-
namics of the car it is controlling. At each step, an agent
received only a visual input in the form of an RGB image
68. Joo-Haeng Lee 2017 joohaeng@gmail.com
Schema Networks: Zero-shot Transfer with a Generative Causal Model of
Intuitive Physics
Ken Kansky Tom Silver David A. M´ely Mohamed Eldawy Miguel L´azaro-Gredilla Xinghua Lou
Nimrod Dorfman Szymon Sidor Scott Phoenix Dileep George
Abstract
The recent adaptation of deep neural network-
based methods to reinforcement learning and
planning domains has yielded remarkable
progress on individual tasks. Nonetheless,
progress on task-to-task transfer remains limited.
In pursuit of efficient and robust generalization,
we introduce the Schema Network, an object-
oriented generative physics simulator capable
of disentangling multiple causes of events and
reasoning backward through causes to achieve
goals. The richly structured architecture of the
Schema Network can learn the dynamics of an
environment directly from data. We compare
Schema Networks with Asynchronous Advan-
tage Actor-Critic and Progressive Networks on a
suite of Breakout variations, reporting results on
training efficiency and zero-shot generalization,
consistently demonstrating faster, more robust
learning and better transfer. We argue that
generalizing from limited data and learning
causal relationships are essential abilities on the
path toward generally intelligent systems.
1. Introduction
A longstanding ambition of research in artificial intelli-
gence is to efficiently generalize experience in one scenario
to other similar scenarios. Such generalization is essential
for an embodied agent working to accomplish a variety of
goals in a changing world. Despite remarkable progress on
individual tasks like Atari 2600 games (Mnih et al., 2015;
Van Hasselt et al., 2016; Mnih et al., 2016) and Go (Silver
et al., 2016a), the ability of state-of-the-art models to trans-
fer learning from one environment to the next remains lim-
All authors affiliated with Vicarious AI, California, USA. Cor-
respondence to: Ken Kansky ken@vicarious.com, Tom Silver
tom@vicarious.com.
Copyright 2017 by the author(s).
Figure 1. Variations of Breakout. From top left: standard version,
middle wall, half negative bricks, offset paddle, random target,
and juggling. After training on the standard version, Schema Net-
works are able to generalize to the other variations without any
additional training.
ited. For instance, consider the variations of Breakout illus-
trated in Fig. 1. In these environments the positions of ob-
jects are perturbed, but the object movements and sources
of reward remain the same. While humans have no trouble
generalizing experience from the basic Breakout to its vari-
ations, deep neural network-based models are easily fooled
(Taylor Stone, 2009; Rusu et al., 2016).
The model-free approach of deep reinforcement learning
(Deep RL) such as the Deep-Q Network and its descen-
dants is inherently hindered by the same feature that makes
it desirable for single-scenario tasks: it makes no assump-
tions about the structure of the domain. Recent work has
suggested how to overcome this deficiency by utilizing
object-based representations (Diuk et al., 2008; Usunier
et al., 2016). Such a representation is motivated by the
Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics
still be unable to generalize from biased training data with-
out continuing to learn on the test environment. In contrast,
Schema Networks exhibit zero-shot transfer.
Schema Networks are implemented as probabilistic graph-
ical models (PGMs), which provide practical inference and
structure learning techniques. Additionally, inference with
uncertainty and explaining away are naturally supported by
PGMs. We direct the readers to (Koller Friedman, 2009)
and (Jordan, 1998) for a thorough overview of PGMs. In
particular, early work on factored MDPs has demonstrated
how PGMs can be applied in RL and planning settings
(Guestrin et al., 2003b).
3. Schema Networks
3.1. MDPs and Notation
The traditional formalism for the Reinforcement Learning
problem is the Markov Decision Process (MDP). An MDP
M is a five-tuple (S, A, T, R, ), where S is a set of states,
A is a set of actions, T(s(t+1)
|s(t)
, a(t)
) is the probabil-
ity of transitioning from state s(t)
2 S to s(t+1)
2 S af-
ter action a(t)
2 A, R(r(t+1)
|s(t)
, a(t)
) is the probability
of receiving reward r(t+1)
2 R after executing action a(t)
while in state s(t)
, and 2 [0, 1] is the rate at which future
rewards are exponentially discounted.
3.2. Model Definition
A Schema Network is a structured generative model of an
MDP. We first describe the architecture of the model infor-
mally. An image input is parsed into a list of entities, which
may be thought of as instances of objects in the sense of
OO-MDPs (Diuk et al., 2008). All entities share the same
collection of attributes. We refer to a specific attribute of
a specific entity as an entity-attribute, which is represented
as a binary variable to indicate the presence of that attribute
for an entity. An entity state is an assignment of states to
all attributes of the entity, and the complete model state is
the set of all entity states.
A grounded schema is a binary variable associated with
a particular entity-attribute in the next timestep, whose
value depends on the present values of a set of binary
entity-attributes. The event that one of these present entity-
attributes assumes the value 1 is called a precondition of the
grounded schema. When all preconditions of a grounded
schema are satisfied, we say that the schema is active, and
it predicts the activation of its associated entity-attribute.
Grounded schemas may also predict rewards and may be
conditioned on actions, both of which are represented as
binary variables. For instance, a grounded schema might
define a distribution over Entity 1’s “position” attribute at
time 5, conditioned on Entity 2’s “position” attribute at
time 4 and the action “UP” at time 4. Grounded schemas
Figure 2. Architecture of a Schema Network. An ungrounded
schema is a template for a factor that predicts either the value
of an entity-attribute (A) or a future reward (B) based on entity
states and actions taken in the present. Self-transitions (C) predict
that entity-attributes remain in the same state when no schema is
active to predict a change. Self-transitions allow continuous or
categorical variables to be represented by a set of binary variables
(depicted as smaller nodes). The grounded schema factors, instan-
tiated from ungrounded schemas at all positions, times, and entity
bindings, are combined with self-transitions to create a Schema
Network (D).
are instantiated from ungrounded schemas, which behave
like templates for grounded schemas to be instantiated at
different times and in different combinations of entities.
For example, an ungrounded schema could predict the “po-
sition” attribute of Entity x at time t + 1 conditioned on
the “position” of Entity y at time t and the action “UP”
at time t; this ungrounded schema could be instantiated at
time t = 4 with x = 1 and y = 2 to create the grounded
schema described above. In the case of attributes like “po-
sition” that are inherently continuous or categorical, several
binary variables may be used to discretely approximate the
distribution (see the smaller nodes in Figure 2). A Schema
Network is a factor graph that contains all grounded instan-
tiations of a set of ungrounded schemas over some window
of time, illustrated in Figure 2.
We now formalize the Schema Network factor graph. For
simplicity, suppose the number of entities and the num-
ber of attributes are fixed at N and M respectively. Let
Ei refer to the ith
entity and let ↵
(t)
i,j refer to the jth
at-
tribute value of the ith
entity at time t. We use the notation
E
(t)
i = (↵
(t)
i,1, ..., ↵
(t)
i,M ) to refer to the state of the ith
en-
Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics
(a) Mini Breakout Learning Rate (b) Middle Wall Learning Rate
Figure 3. Comparison of learning rates. (a) Schema Networks and A3C were trained for 100k frames in Mini Breakout. Plot shows the
average of 5 training attempts for Schema Networks and the best of 5 training attempts for A3C, which did not converge as reliably. (b)
PNs and Schema Networks were pretrained on 100K frames of Standard Breakout, and then training continued on 45K additional frames
of the Middle Wall variation. We show performance as a function of training frames for both models. Note that Schema Networks are
ignoring all the additional training data, since all the required schemas were learned during pretraining. For Schema Networks, zero-shot
transfer learning is happening.
the input to Schema Networks did not treat any object dif-
ferently. Schema Networks were provided separate entities
for each part (pixel) of each object, and each entity con-
tained 53 attributes corresponding to the available part la-
bels (21 for bricks, 30 for the paddle, 1 for walls, and 1 for
the ball). Only one of these part attributes was active per
entity. Schema Networks had to learn that some attributes,
like parts of bricks, were irrelevant for prediction.
5.1. Transfer Learning
This experiment examines how effectively Schema Net-
works and PNs are able to learn a new Breakout variation
after pretraining, which examines how well the two mod-
els can transfer existing knowledge to a new task. Fig. 3a
shows the learning rates during 100k frames of training on
Mini Breakout. In a second experiment, we pretrained on
Large Breakout for 100k frames and continued training on
the Middle Wall variation, shown in Fig. 1b. Fig. 3b shows
that PNs require significant time to learn in this new en-
vironment, while Schema Networks do not learn anything
new because the dynamics are the same.
5.2. Zero-Shot Generalization
Many Breakout variations can be constructed that all in-
volve the same dynamics. If a model correctly learns the
dynamics from one variation, in theory the others could
be played perfectly by planning using the learned model.
Rather than comparing transfer with additional training us-
ing PNs, in these variations we can compare zero-shot gen-
eralization by training A3C only on Standard Breakout.
Fig. 1b-e shows some of these variations with the following
modifications from the training environment:
• Offset Paddle (Fig. 1d): The paddle is shifted upward
by a few pixels.
• Middle Wall (Fig. 1b): A wall is placed in the middle
of the screen, requiring the agent to aim around it to
hit the bricks.
• Random Target (Fig. 1e): A group of bricks is
destoyed when the ball hits any of them and then reap-
pears in a new random position, requiring the agent to
delibarately aim at the group.
• Juggling (Fig. 1f, enlarged from actual environment
to see the balls): Without any bricks, three balls are
launched in such a way that a perfect policy could jug-
gle them without dropping any.
Table 1 shows the average scores per episode in each
Breakout variation. These results show that A3C has failed
to recognize the common dynamics and adapt its policy ac-
cordingly. This comes as no surprise, as the policy it has
learned for Standard Breakout is no longer applicable in
these variations. Simply adding an offset to the paddle is