Scaling Deep Learning

Scaling Deep Learning
Can we learn to play Atari Pong faster than a 7-year-old child?
Mark O’Connor
VP Product Management

Allinea provides
high-performance
software tools

Allinea Forge
Debug and proﬁle
codes
Performance Reports
Monitor and tune
applications

0
50
100
150
200
250
300
350
400
450
24-core Xeon
Simulationsteps/s

Is that fast? Let’s have a look with Allinea MAP

Choose hardware that matches the application performance
caption
0
100
200
300
400
500
600
700
800
900
1000
24-core Xeon Xeon Phi (KNL)

gradient descent
What about scaling to multiple nodes?
Model-level Data-level
Large Scale Distributed Deep Networks,
Jeff Dean et al., Google Inc

Scaling image classiﬁcation to multiple nodes with MPI + TensorFlow
Data-level

Reproduced this on Amazon EC2 with a MPI + TensorFlow GPU cluster
0
2
4
6
8
10
2 4 8 16 32
Speedup with increasing node count

Is the EC2 network or old GPU model limiting the scalability?
caption

What limits the performance most as we scale up?
caption
0%
20%
40%
60%
80%
100%
2 4 8 16 32
MPI time Mem time Other

Diving deeper with MAP – why is so much time spent in memory accesses?

0
5
10
15
20
25
30
Runtime with 32 nodes
Print error every epoch Print error every 30 epochs
2.2x speedup by reducing the frequency of progress updates

gradient descent
Scaling deep reinforcement learning to multiple nodes
vs

Choosing a model: keep it simple …

Choosing a model: policy gradients
Run a policy for a while. See which actions led to high rewards.
Increase their probability.
Source: Andrej Karpathy’s blog. Karpathy is a genius.

Choosing a model: policy gradients
130 lines of python, no framework needed!

Policy gradients versus Google DeepMind’s A3C:
0
20
40
60
80
100
120
140
Hours training required to beat Atari Pong
Policy Gradients Google DeepMind AC3

So we need a 100x improvement to be competitive. Let’s go!
gradient descent

Parallel Policy Gradients on a single node:
y = 257.4x + 220
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8
Steps/s
MPI processes
Training speed on a single node

Is this code a good match for Intel® Xeon Phi™ Knight’s Landing?
Half the time is spent in well-vectorized
matrix multiplications
It scales well over multiple cores
The other half of the time is spent
emulating the Atari’s 6502 processor
We scale by running one emulator per
real CPU core
Conclusion: not a good match

Let’s use ARCHER, the UK National Supercomputing Service
Cray XC30
Intel® Xeon™
nodes
High performance
networking

Source: XKCD
What should I do while waiting for my job to begin?

Here we are on our quick EC2 cluster – how does Parallel Policy Gradients perform?
0
200
400
600
800
1000
1200
1400
1600
1800
1 4 7 10 13 16
Steps/s
MPI processes
Training speed on a single node

Time to investigate with Allinea Forge
What might cause this?

Time to investigate with Allinea Forge

Let’s try this again using Anaconda and an Intel MKL-backed Numpy
y = 336.48x + 745.83
0
1000
2000
3000
4000
5000
6000
7000
1 4 7 10 13 16
Steps/s
MPI processes

How does multimode scaling look now?
0
10000
20000
30000
40000
0 10 20 30 40 50 60 70 80 90
Steps/s
MPI processes

So are we done now? Let’s have a look at our performance at 90 cores:

Insight: we don’t really care about recording complete games
Does it matter what the score is here?
HPC
expertise
Domain
expertise
New
insights

Result from this insight: 25% speedup at 90 cores and 10x scalability
0 5000 10000 15000 20000 25000 30000 35000 40000
Steps/s with 90 cores
Update after whole game
Update every 4 seconds

The showdown: Google DeepMind’s A3C vs Supercomputer vs 7-year-old child
1
10
100
1000
10000
DQN (DeepMind,
2015)
A3C (Google, 2016) 7-year-old (James,
2016)
PPG (Allinea, 2016)
Minutes training time to defeat Atari Pong

The showdown: how fast can we beat Atari Pong with 1536 cores on Archer?
3m 54s 8m 00s 28m 00s

Extra: the best strategy found by any agent
Training stats:
•  100 wallclock minutes
•  180 k input frames
•  100 kJ to solution
Comparison to PPG:
•  30 wallclock minutes
•  880,000 k input frames
•  550,000 kJ to solution
Humans still ~103 ahead!
(… but AI has improved by
10x / year since 2014…)

Conclusions
Deep learning is going multi-node at significant scales
HPC can and should play a huge role in this
Researchers need frameworks and tools to help
them build high-performance multi-node models
Simple but scalable models can converge faster
state-of-the-art single-node models
Humans aren’t down for the count – yet!

Thank-you – meet us at booth #1508 this week!
Mark O’Connor
mark@allinea.com | @yieldthought | [ github link ]

Scaling Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling Deep Learning

Similar to Scaling Deep Learning (20)

More from Intel® Software

More from Intel® Software (20)

Recently uploaded

Recently uploaded (20)

Scaling Deep Learning