Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P., David Silver, Koray Kavukcuoglu
Google DeepMind
Montreal Institute for Learning Algorithms (MILA), University of Montreal
Journal reference: ICML 2016
Cite as: arXiv:1602.01783 [cs.LG]
(or arXiv:1602.01783v2 [cs.LG] for this version)
資應所 105065702 李思叡
1
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
2
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
3
Introduction
Online RL algorithm with DNN is fundamentally unstable
- sequence of observed data encounter by an online RL agent is non-stationary
- online RL updates are strongly correlated
- solution idea → experience replay
(experience replay achieved unprecedented success)
Drawbacks of experience replay:
- uses more memory and computation per real interaction
- requires off-policy learning algorithm that can update from data generated by
an older policy
4
Introduction
In this paper:
Asynchronously execute multiple agents in parallel on multiple environments.
(instead of experience replay)
Benefits:
- decorrelate the agents’ data into a more stationary process
- can apply on both on-policy and off-policy RL algorithm
- GPU or massively distributed machines → multi core CPU (single machine)
- cost far less time than GPU-based algorithms
- use far less resources than massively distributed approach.
5
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
6
Related Work
The General Reinforcement Learning Architecture (Gorila) of performs asynchronous training of
reinforcement learning agents in a distributed setting. (Nair et al., 2015)
Map Reduce framework with linear function approximation. (Li & Schuurmans, 2011)
Parallel version of the Sarsa uses multiple separate actor-learners. (Grounds & Kudenko, 2008)
Convergence properties of Q-learning in the asynchronous optimization setting. (Tsitsiklis, 1994)
The related problem of distributed dynamic programming. (Bertsekas, 1982)
7
Related Work: Gorila Architecture
Perform	asynchronous training	of	reinforcement	learning	agents	in	a	distributed setting.
Actor
- acts	in	its	own	copy	of	environment
- separate	replay	memory
Learner
- samples	data	from	replay	memory
- computes	gradients	of	the	DQN	loss	with	respect	to the	policy	parameters
8
Related Work: Gorila Architecture
Gradients	are	asynchronously	sent	to	a	central	parameter	server	which	updates	a	central	
copy	of	the	model.
The	updated	policy	parameters	are	sent	to	the	actor-learners	at	fixed	interval.
Setting:
- 100	separate	actor-learners
- 30	parameter	servers	instance
- 130	machines	in	total
Performance:
- outperform	DQN	over	49	Atari	games	
- 20	times	faster	than	DQN
9
Related Work
Map	Reduce	framework	to	parallelizing	batch	reinforcement	learning	with	linear	function	
approximation
- Parallelism	was	used	to	speed	up	large	matrix	operations
- Not	to	parallelize	the	collection	of	experience	or	stabilize	learning
Parallel	version	of	Sarsa	algorithm
- Multiple	separate	actor-learners	to	accelerate	training
- Learns	separately		and	periodically	send	updates	to	weights	that	have	changed	
significantly	to	the	other	learners	using	peer-to-peer	(P2P)	communication.
10
Related Work
Q-learning	is	still	guaranteed	to	converge	when	some	of	the	information	is	outdated	as	
long	as	outdated	information	is	always	eventually	discarded	and	several	other	
assumptions	are	satisfied.
Evolutionary	methods
- Often	straightforward	to	parallelize	by	distributing	fitness	evaluations	over	multiple	
machines	or	threads.
11
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
12
Reinforcement Learning Background
one-step Q-learning (TD(0))
one-step Sarsa
n-step Q-learning (TD(0) → TD(λ))
actor-critic
13
RL Background: Q-learning and Sarsa
Q-learning:
Sarsa:
14
RL Background: Actor-Critic
Actor-Critic follow an approximate policy gradient:
With advantage function:
15
(use the TD(0) error for example)
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
16
Asynchronous RL Framework
Copy global network parameters
Worker interacts with environment
Accumulate gradients
Update global network with gradients
Copy global network parameters
Worker interacts with environment
....
source: https://medium.com/emergent-
future/simple-reinforcement-learning-with-
tensorflow-part-8-asynchronous-actor-critic-
agents-a3c-c88f72a5e9f2
17Framework of A3C
loop
Asynchronous RL Framework
18source: Naruto comic (315th, 617th episode)
Asynchronous RL Framework source: https://github.com/coreylynch/async-rl
19
Two main ideas of practice
Similarly to the Gorila framework, but:
- separate machines → multiple CPU threads on a single machine
- removes the communication costs of sending gradients and parameters
- use Hogwild! (Recht et al., 2011) style updates for training
Multiple learners running in parallel
- exploring different parts of the environment
- maximize the diversity
- be less correlated in time (decorrelate)
- do not use replay memory
(→ be able to use on-policy RL to train neural networks in a stable way)
20
Asynchronous RL Framework
Asynchronous one-step Q-learning
Asynchronous one-step Sarsa
Asynchronous n-step Q-learning
Asynchronous advantage actor-critic (A3C)
21
one-step Q-learning
one-step Sarsa
n-step Q-learning
actor-critic
Algo: DQN v.s. Asynchronous one-step Q-learning
22
Deep Q Network (DQN) Asynchronous one-step Q-learning
Algo: Asynchronous Advantage Actor-Critic
23
can add entropy regularization (H):
+
Optimization
1. SGD with momentum
2. RMSProp without shared statistics
3. RMSProp with shared statistics (mode robust)
→ RMSProp where statistics g are shared across threads
24
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
25
Experiments
Atari 2600 Games (experiment on 57 games)
TORCS Car Racing Simulator (3D game)
MuJoCo Physics Simulator (Continuous Action Control)
Labyrinth (3D maze game)
26
Experimental setup (A3C on Atari and TORCS)
- number of threads: 16 (on a single machine and no GPUs)
- updates every 5 actions (𝑡"#$ = 5 and 𝐼()*#+, = 5)
- optimization: shared RMSProp
- network architecture: 2 Conv layers and 1 FC layer (followed by ReLU)
- input preprocessing and network architecture as (Mnih et al., 2015; 2013)
- discount factor 𝛾 = 0.99, RMSProp decay factor 𝛼 = 0.99, entropy weight 𝛽 = 0.01
- learning rate: sample from a 𝐿𝑜𝑔𝑈𝑛𝑖𝐹𝑜𝑟𝑚(10>?
, 10>A
) distribution
(for more details, see paper section 8 & 9) 27
Learning speed comparison (5 Atari games)
DQN: train on Nvidia K40 GPU
Asynchronous methods: train on 16 CPU cores
28
Score result on 57 Atari games
Fix all hyperparameters for all 57 games
A3C, LSTM: add 256 LSTM cells after final hidden layer (for more compare)
Mean, median: human-normalized scores on 57 Atari games
29
A3C on other environments
TORCS 3D car racing game: https://youtu.be/0xo1Ldx3L5Q
MuJoCo physics simulator: https://youtu.be/Ajjc08-iPx8
Labyrinth: https://youtu.be/nMR5mjCFZCw
30
Scalability and Data Efficiency
- superlinear speedups
(especially on one-step methods)
31
the speedups average over 7 Atari games
Robustness and Stability
50 different learning rates and random initializations
Result:
Robust to the choice of learning rate and random initialization
Stable and do not collapse or diverage once they are learning
(4 asynchronous methods have the same conclusion)
32
Comparison of three optimization methods
50 experiments on n-step Q and A3C
with 50 different random learning rates and initializations
1. Momentum SGD
2. RMSProp
3. Shared RMSProp
33
Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
34
Conclusions and Discussion
In this framework
- Stable training of NN is possible in many situations
(value/policy-base, on/off-policy, discrete/continuous)
- Reduce the comsumption of time
- Could be potentially improved by using other ways of estimating advantage
funciton
- A number of complementary improvements to the NN architecture are
possible.
35
Thanks for listening :)
36
Q & A
37

Asynchronous Methods for Deep Reinforcement Learning

  • 1.
    Asynchronous Methods forDeep Reinforcement Learning Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P., David Silver, Koray Kavukcuoglu Google DeepMind Montreal Institute for Learning Algorithms (MILA), University of Montreal Journal reference: ICML 2016 Cite as: arXiv:1602.01783 [cs.LG] (or arXiv:1602.01783v2 [cs.LG] for this version) 資應所 105065702 李思叡 1
  • 2.
    Outline Introduction Related Work Reinforcement LearningBackground Asynchronous RL Framework Experiments Conclusions and Discussion 2
  • 3.
    Outline Introduction Related Work Reinforcement LearningBackground Asynchronous RL Framework Experiments Conclusions and Discussion 3
  • 4.
    Introduction Online RL algorithmwith DNN is fundamentally unstable - sequence of observed data encounter by an online RL agent is non-stationary - online RL updates are strongly correlated - solution idea → experience replay (experience replay achieved unprecedented success) Drawbacks of experience replay: - uses more memory and computation per real interaction - requires off-policy learning algorithm that can update from data generated by an older policy 4
  • 5.
    Introduction In this paper: Asynchronouslyexecute multiple agents in parallel on multiple environments. (instead of experience replay) Benefits: - decorrelate the agents’ data into a more stationary process - can apply on both on-policy and off-policy RL algorithm - GPU or massively distributed machines → multi core CPU (single machine) - cost far less time than GPU-based algorithms - use far less resources than massively distributed approach. 5
  • 6.
    Outline Introduction Related Work Reinforcement LearningBackground Asynchronous RL Framework Experiments Conclusions and Discussion 6
  • 7.
    Related Work The GeneralReinforcement Learning Architecture (Gorila) of performs asynchronous training of reinforcement learning agents in a distributed setting. (Nair et al., 2015) Map Reduce framework with linear function approximation. (Li & Schuurmans, 2011) Parallel version of the Sarsa uses multiple separate actor-learners. (Grounds & Kudenko, 2008) Convergence properties of Q-learning in the asynchronous optimization setting. (Tsitsiklis, 1994) The related problem of distributed dynamic programming. (Bertsekas, 1982) 7
  • 8.
    Related Work: GorilaArchitecture Perform asynchronous training of reinforcement learning agents in a distributed setting. Actor - acts in its own copy of environment - separate replay memory Learner - samples data from replay memory - computes gradients of the DQN loss with respect to the policy parameters 8
  • 9.
    Related Work: GorilaArchitecture Gradients are asynchronously sent to a central parameter server which updates a central copy of the model. The updated policy parameters are sent to the actor-learners at fixed interval. Setting: - 100 separate actor-learners - 30 parameter servers instance - 130 machines in total Performance: - outperform DQN over 49 Atari games - 20 times faster than DQN 9
  • 10.
    Related Work Map Reduce framework to parallelizing batch reinforcement learning with linear function approximation - Parallelism was used to speed up large matrix operations -Not to parallelize the collection of experience or stabilize learning Parallel version of Sarsa algorithm - Multiple separate actor-learners to accelerate training - Learns separately and periodically send updates to weights that have changed significantly to the other learners using peer-to-peer (P2P) communication. 10
  • 11.
  • 12.
    Outline Introduction Related Work Reinforcement LearningBackground Asynchronous RL Framework Experiments Conclusions and Discussion 12
  • 13.
    Reinforcement Learning Background one-stepQ-learning (TD(0)) one-step Sarsa n-step Q-learning (TD(0) → TD(λ)) actor-critic 13
  • 14.
    RL Background: Q-learningand Sarsa Q-learning: Sarsa: 14
  • 15.
    RL Background: Actor-Critic Actor-Criticfollow an approximate policy gradient: With advantage function: 15 (use the TD(0) error for example)
  • 16.
    Outline Introduction Related Work Reinforcement LearningBackground Asynchronous RL Framework Experiments Conclusions and Discussion 16
  • 17.
    Asynchronous RL Framework Copyglobal network parameters Worker interacts with environment Accumulate gradients Update global network with gradients Copy global network parameters Worker interacts with environment .... source: https://medium.com/emergent- future/simple-reinforcement-learning-with- tensorflow-part-8-asynchronous-actor-critic- agents-a3c-c88f72a5e9f2 17Framework of A3C loop
  • 18.
    Asynchronous RL Framework 18source:Naruto comic (315th, 617th episode)
  • 19.
    Asynchronous RL Frameworksource: https://github.com/coreylynch/async-rl 19
  • 20.
    Two main ideasof practice Similarly to the Gorila framework, but: - separate machines → multiple CPU threads on a single machine - removes the communication costs of sending gradients and parameters - use Hogwild! (Recht et al., 2011) style updates for training Multiple learners running in parallel - exploring different parts of the environment - maximize the diversity - be less correlated in time (decorrelate) - do not use replay memory (→ be able to use on-policy RL to train neural networks in a stable way) 20
  • 21.
    Asynchronous RL Framework Asynchronousone-step Q-learning Asynchronous one-step Sarsa Asynchronous n-step Q-learning Asynchronous advantage actor-critic (A3C) 21 one-step Q-learning one-step Sarsa n-step Q-learning actor-critic
  • 22.
    Algo: DQN v.s.Asynchronous one-step Q-learning 22 Deep Q Network (DQN) Asynchronous one-step Q-learning
  • 23.
    Algo: Asynchronous AdvantageActor-Critic 23 can add entropy regularization (H): +
  • 24.
    Optimization 1. SGD withmomentum 2. RMSProp without shared statistics 3. RMSProp with shared statistics (mode robust) → RMSProp where statistics g are shared across threads 24
  • 25.
    Outline Introduction Related Work Reinforcement LearningBackground Asynchronous RL Framework Experiments Conclusions and Discussion 25
  • 26.
    Experiments Atari 2600 Games(experiment on 57 games) TORCS Car Racing Simulator (3D game) MuJoCo Physics Simulator (Continuous Action Control) Labyrinth (3D maze game) 26
  • 27.
    Experimental setup (A3Con Atari and TORCS) - number of threads: 16 (on a single machine and no GPUs) - updates every 5 actions (𝑡"#$ = 5 and 𝐼()*#+, = 5) - optimization: shared RMSProp - network architecture: 2 Conv layers and 1 FC layer (followed by ReLU) - input preprocessing and network architecture as (Mnih et al., 2015; 2013) - discount factor 𝛾 = 0.99, RMSProp decay factor 𝛼 = 0.99, entropy weight 𝛽 = 0.01 - learning rate: sample from a 𝐿𝑜𝑔𝑈𝑛𝑖𝐹𝑜𝑟𝑚(10>? , 10>A ) distribution (for more details, see paper section 8 & 9) 27
  • 28.
    Learning speed comparison(5 Atari games) DQN: train on Nvidia K40 GPU Asynchronous methods: train on 16 CPU cores 28
  • 29.
    Score result on57 Atari games Fix all hyperparameters for all 57 games A3C, LSTM: add 256 LSTM cells after final hidden layer (for more compare) Mean, median: human-normalized scores on 57 Atari games 29
  • 30.
    A3C on otherenvironments TORCS 3D car racing game: https://youtu.be/0xo1Ldx3L5Q MuJoCo physics simulator: https://youtu.be/Ajjc08-iPx8 Labyrinth: https://youtu.be/nMR5mjCFZCw 30
  • 31.
    Scalability and DataEfficiency - superlinear speedups (especially on one-step methods) 31 the speedups average over 7 Atari games
  • 32.
    Robustness and Stability 50different learning rates and random initializations Result: Robust to the choice of learning rate and random initialization Stable and do not collapse or diverage once they are learning (4 asynchronous methods have the same conclusion) 32
  • 33.
    Comparison of threeoptimization methods 50 experiments on n-step Q and A3C with 50 different random learning rates and initializations 1. Momentum SGD 2. RMSProp 3. Shared RMSProp 33
  • 34.
    Outline Introduction Related Work Reinforcement LearningBackground Asynchronous RL Framework Experiments Conclusions and Discussion 34
  • 35.
    Conclusions and Discussion Inthis framework - Stable training of NN is possible in many situations (value/policy-base, on/off-policy, discrete/continuous) - Reduce the comsumption of time - Could be potentially improved by using other ways of estimating advantage funciton - A number of complementary improvements to the NN architecture are possible. 35
  • 36.
  • 37.