Asynchronous Methods for Deep Reinforcement Learning

Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P., David Silver, Koray Kavukcuoglu
Google DeepMind
Montreal Institute for Learning Algorithms (MILA), University of Montreal
Journal reference: ICML 2016
Cite as: arXiv:1602.01783 [cs.LG]
(or arXiv:1602.01783v2 [cs.LG] for this version)
資應所 105065702 李思叡
1

Outline
Introduction
Related Work
Reinforcement Learning Background
Asynchronous RL Framework
Experiments
Conclusions and Discussion
2

Outline
Introduction
Related Work
Experiments
3

Introduction
Online RL algorithm with DNN is fundamentally unstable
- sequence of observed data encounter by an online RL agent is non-stationary
- online RL updates are strongly correlated
- solution idea → experience replay
(experience replay achieved unprecedented success)
Drawbacks of experience replay:
- uses more memory and computation per real interaction
- requires off-policy learning algorithm that can update from data generated by
an older policy
4

Introduction
In this paper:
Asynchronously execute multiple agents in parallel on multiple environments.
(instead of experience replay)
Benefits:
- decorrelate the agents’ data into a more stationary process
- can apply on both on-policy and off-policy RL algorithm
- GPU or massively distributed machines → multi core CPU (single machine)
- cost far less time than GPU-based algorithms
- use far less resources than massively distributed approach.
5

Outline
Introduction
Related Work
Experiments
6

Related Work
The General Reinforcement Learning Architecture (Gorila) of performs asynchronous training of
reinforcement learning agents in a distributed setting. (Nair et al., 2015)
Map Reduce framework with linear function approximation. (Li & Schuurmans, 2011)
Parallel version of the Sarsa uses multiple separate actor-learners. (Grounds & Kudenko, 2008)
Convergence properties of Q-learning in the asynchronous optimization setting. (Tsitsiklis, 1994)
The related problem of distributed dynamic programming. (Bertsekas, 1982)
7

Related Work: Gorila Architecture
Perform asynchronous training of reinforcement learning agents in a distributed setting.
Actor
- acts in its own copy of environment
- separate replay memory
Learner
- samples data from replay memory
- computes gradients of the DQN loss with respect to the policy parameters
8

Related Work: Gorila Architecture
Gradients are asynchronously sent to a central parameter server which updates a central
copy of the model.
The updated policy parameters are sent to the actor-learners at fixed interval.
Setting:
- 100 separate actor-learners
- 30 parameter servers instance
- 130 machines in total
Performance:
- outperform DQN over 49 Atari games
- 20 times faster than DQN
9

Related Work
Map Reduce framework to parallelizing batch reinforcement learning with linear function
approximation
- Parallelism was used to speed up large matrix operations
- Not to parallelize the collection of experience or stabilize learning
Parallel version of Sarsa algorithm
- Multiple separate actor-learners to accelerate training
- Learns separately and periodically send updates to weights that have changed
significantly to the other learners using peer-to-peer (P2P) communication.
10

Related Work
Q-learning is still guaranteed to converge when some of the information is outdated as
long as outdated information is always eventually discarded and several other
assumptions are satisfied.
Evolutionary methods
- Often straightforward to parallelize by distributing fitness evaluations over multiple
machines or threads.
11

Outline
Introduction
Related Work
Experiments
12

one-step Q-learning (TD(0))
one-step Sarsa
n-step Q-learning (TD(0) → TD(λ))
actor-critic
13

RL Background: Q-learning and Sarsa
Q-learning:
Sarsa:
14

RL Background: Actor-Critic
Actor-Critic follow an approximate policy gradient:
With advantage function:
15
(use the TD(0) error for example)

Outline
Introduction
Related Work
Experiments
16

Copy global network parameters
Worker interacts with environment
Accumulate gradients
Update global network with gradients
Copy global network parameters
Worker interacts with environment
....
source: https://medium.com/emergent-
future/simple-reinforcement-learning-with-
tensorflow-part-8-asynchronous-actor-critic-
agents-a3c-c88f72a5e9f2
17Framework of A3C
loop

18source: Naruto comic (315th, 617th episode)

Asynchronous RL Framework source: https://github.com/coreylynch/async-rl
19

Two main ideas of practice
Similarly to the Gorila framework, but:
- separate machines → multiple CPU threads on a single machine
- removes the communication costs of sending gradients and parameters
- use Hogwild! (Recht et al., 2011) style updates for training
Multiple learners running in parallel
- exploring different parts of the environment
- maximize the diversity
- be less correlated in time (decorrelate)
- do not use replay memory
(→ be able to use on-policy RL to train neural networks in a stable way)
20

Asynchronous one-step Q-learning
Asynchronous one-step Sarsa
Asynchronous n-step Q-learning
Asynchronous advantage actor-critic (A3C)
21
one-step Q-learning
one-step Sarsa
n-step Q-learning
actor-critic

Algo: DQN v.s. Asynchronous one-step Q-learning
22
Deep Q Network (DQN) Asynchronous one-step Q-learning

Algo: Asynchronous Advantage Actor-Critic
23
can add entropy regularization (H):
+

Optimization
1. SGD with momentum
2. RMSProp without shared statistics
3. RMSProp with shared statistics (mode robust)
→ RMSProp where statistics g are shared across threads
24

Outline
Introduction
Related Work
Experiments
25

Experiments
Atari 2600 Games (experiment on 57 games)
TORCS Car Racing Simulator (3D game)
MuJoCo Physics Simulator (Continuous Action Control)
Labyrinth (3D maze game)
26

Experimental setup (A3C on Atari and TORCS)
- number of threads: 16 (on a single machine and no GPUs)
- updates every 5 actions (𝑡"#$ = 5 and 𝐼()*#+, = 5)
- optimization: shared RMSProp
- network architecture: 2 Conv layers and 1 FC layer (followed by ReLU)
- input preprocessing and network architecture as (Mnih et al., 2015; 2013)
- discount factor 𝛾 = 0.99, RMSProp decay factor 𝛼 = 0.99, entropy weight 𝛽 = 0.01
- learning rate: sample from a 𝐿𝑜𝑔𝑈𝑛𝑖𝐹𝑜𝑟𝑚(10>?
, 10>A
) distribution
(for more details, see paper section 8 & 9) 27

Learning speed comparison (5 Atari games)
DQN: train on Nvidia K40 GPU
Asynchronous methods: train on 16 CPU cores
28

Score result on 57 Atari games
Fix all hyperparameters for all 57 games
A3C, LSTM: add 256 LSTM cells after final hidden layer (for more compare)
Mean, median: human-normalized scores on 57 Atari games
29

A3C on other environments
TORCS 3D car racing game: https://youtu.be/0xo1Ldx3L5Q
MuJoCo physics simulator: https://youtu.be/Ajjc08-iPx8
Labyrinth: https://youtu.be/nMR5mjCFZCw
30

Scalability and Data Efficiency
- superlinear speedups
(especially on one-step methods)
31
the speedups average over 7 Atari games

Robustness and Stability
50 different learning rates and random initializations
Result:
Robust to the choice of learning rate and random initialization
Stable and do not collapse or diverage once they are learning
(4 asynchronous methods have the same conclusion)
32

Comparison of three optimization methods
50 experiments on n-step Q and A3C
with 50 different random learning rates and initializations
1. Momentum SGD
2. RMSProp
3. Shared RMSProp
33

Outline
Introduction
Related Work
Experiments
34

In this framework
- Stable training of NN is possible in many situations
(value/policy-base, on/off-policy, discrete/continuous)
- Reduce the comsumption of time
- Could be potentially improved by using other ways of estimating advantage
funciton
- A number of complementary improvements to the NN architecture are
possible.
35

Asynchronous Methods for Deep Reinforcement Learning

More Related Content

What's hot

Similar to Asynchronous Methods for Deep Reinforcement Learning

Recently uploaded

Asynchronous Methods for Deep Reinforcement Learning