This document discusses scaling deep learning models to multiple nodes using parallel policy gradients. The author shows that a simple parallel policy gradients model trained on a supercomputer can learn to play Atari Pong faster than Google DeepMind's A3C model and a 7-year old child. Optimization efforts including reducing progress updates, using MKL-backed NumPy, and updating the agent every 4 seconds instead of full games resulted in the model learning to play Pong in under 4 minutes when run on 1536 cores of a supercomputer. However, humans are still around 10,000 times faster at learning the game than current AI techniques. The author concludes that HPC can play a large role in deep learning by helping researchers build high-performance
16. 0
5
10
15
20
25
30
Runtime with 32 nodes
Print error every epoch Print error every 30 epochs
2.2x speedup by reducing the frequency of progress updates
19. Choosing a model: policy gradients
Run a policy for a while. See which actions led to high rewards.
Increase their probability.
Source: Andrej Karpathy’s blog. Karpathy is a genius.
20. Choosing a model: policy gradients
130 lines of python, no framework needed!
21. Policy gradients versus Google DeepMind’s A3C:
0
20
40
60
80
100
120
140
Hours training required to beat Atari Pong
Policy Gradients Google DeepMind AC3
22. So we need a 100x improvement to be competitive. Let’s go!
gradient descent
23. Parallel Policy Gradients on a single node:
y = 257.4x + 220
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8
Steps/s
MPI processes
Training speed on a single node
24. Is this code a good match for Intel® Xeon Phi™ Knight’s Landing?
Half the time is spent in well-vectorized
matrix multiplications
It scales well over multiple cores
The other half of the time is spent
emulating the Atari’s 6502 processor
We scale by running one emulator per
real CPU core
Conclusion: not a good match
25. Let’s use ARCHER, the UK National Supercomputing Service
Cray XC30
Intel® Xeon™
nodes
High performance
networking
27. Here we are on our quick EC2 cluster – how does Parallel Policy Gradients perform?
0
200
400
600
800
1000
1200
1400
1600
1800
1 4 7 10 13 16
Steps/s
MPI processes
Training speed on a single node
32. So are we done now? Let’s have a look at our performance at 90 cores:
33. Insight: we don’t really care about recording complete games
Does it matter what the score is here?
HPC
expertise
Domain
expertise
New
insights
34. Result from this insight: 25% speedup at 90 cores and 10x scalability
0 5000 10000 15000 20000 25000 30000 35000 40000
Steps/s with 90 cores
Update after whole game
Update every 4 seconds
35. The showdown: Google DeepMind’s A3C vs Supercomputer vs 7-year-old child
1
10
100
1000
10000
DQN (DeepMind,
2015)
A3C (Google, 2016) 7-year-old (James,
2016)
PPG (Allinea, 2016)
Minutes training time to defeat Atari Pong
36. The showdown: how fast can we beat Atari Pong with 1536 cores on Archer?
3m 54s 8m 00s 28m 00s
37. Extra: the best strategy found by any agent
Training stats:
• 100 wallclock minutes
• 180 k input frames
• 100 kJ to solution
Comparison to PPG:
• 30 wallclock minutes
• 880,000 k input frames
• 550,000 kJ to solution
Humans still ~103 ahead!
(… but AI has improved by
10x / year since 2014…)
38. Conclusions
Deep learning is going multi-node at significant scales
HPC can and should play a huge role in this
Researchers need frameworks and tools to help
them build high-performance multi-node models
Simple but scalable models can converge faster
state-of-the-art single-node models
Humans aren’t down for the count – yet!
39. Thank-you – meet us at booth #1508 this week!
Mark O’Connor
mark@allinea.com | @yieldthought | [ github link ]