Slides from Strata+Hadoop Singapore 2016 presenting how Deep Learning can be scaled both vertically and horizontally, when to use CPUs and when to use GPUs.
Student Profile Sample report on improving academic performance by uniting gr...
Deep Learning at scale with H2O's distributed architecture
1. Deep Learning at scale
Mateusz Dymczyk
Software Engineer
H2O.ai
Strata+Hadoop Singapore
08.12.2016
2. About me
• M.Sc. in CS @ AGH UST, Poland
• CS Ph.D. dropout
• Software Engineer @ H2O.ai
• Previously ML/NLP and distributed
systems @ Fujitsu Laboratories and
en-japan inc.
3. • Deep learning - brief introduction
• Why scale
• Scaling models + implementations
• Demo
• A peek into the future
Agenda
5. Text classification (item prediction)
Use Cases
Fraud detection (11% accuracy boost in production)
Image classification
Machine translation
Recommendation systems
6. Deep Learning
Deep learning is a branch of machine learning based on a
set of algorithms that attempt to model high level
abstractions in data by using a deep graph with multiple
processing layers, composed of multiple linear and non-
linear transformations.*
*https://en.wikipedia.org/wiki/Deep_learning
8. SGD
• Memory efficient
• Fast
• Not easy to parallelize without speed
degradation
Initialize
Parameters
Get training
sample i
1
2
3
Until converged
9. Deep Learning
PRO
• Relatively simple concept
• Non-linear
• Versatile and flexible
• Features can be extracted
• Great with big data
• Very promising results in
multiple fields
CON
• Hard to interpret
• Not well understood theory
• Lots of architectures
• A lot of hyper-parameters
• Slow training, data hungry
• CPU/GPU hungry
• Overfits
10. Why scale?
PRO
• Relatively simple concept
• Non-linear
• Versatile and flexible
• Features can be extracted
• Great with big data
• Very promising results in
multiple fields
CON
• Hard to interpret
• Not well understood theory
• Lots of architectures
• A lot of hyper-parameters
• Slow training, data hungry
• CPU/GPU hungry
• Overfits
Grid search?
11. Why not to scale?
• Distribution isn’t free: overhead due to network traffic,
synchronization etc.
• Small neural network (not many layers and/or neurons)
• small+shallow network = not much computation per iteration
• Small data
• Very slow network communication
13. Distribution models
• Model parallelism
• Data parallelism
• Mixed/composed
• Parameter server vs *peer-to-peer
• Communication: Asynchronous vs Synchronous
*http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf
14. Model parallelism
Node 1 Node 2 Node 3
• Each node computes a
different part of network
• Potential to scale to large
models
• Rather difficult to implement
and reason about
• Originally designed for large
convolutional layer in
GoogleNet
Parameter
Server
Parameters or deltas
15. Data parallelism
Node 1
Parameter
Server
Node 3
Node 2
Node 4
• Each node computes all the
parameters
• Using part (or all) of local data
• Results have to be combined
Parameters or deltas
16. Mixed/composed
• Node distribution (model or data)
• In-node, per gpu/cpu/thread
concurrent/parallel computation
• Example: learn whole model on each
multi CPU/GPU machine, where each
CPU/GPU trains a different layer or
works with different part of data
Node
T T T
Node
CPU/GPU CPU/GPU
CPU/GPU CPU/GPU
17. Sync vs. Async
Sync
• Slower
• Reproducible
• Might overfit
• In some cases more
accurate and faster*
At some point parameters need to be collected within the
nodes and between them.
Async
• Race conditions possible
• Not reproducible
• Faster
• Might helps with overfitting
(you make mistakes)
*https://arxiv.org/abs/1604.00981
18. H2O’s architecture
H2O in-memory
Non-blocking
hash map
(resides on all nodes)
Initial model
(weights & biases)
Node
Computation
(threads, async)
Node communication
MAP
each node trains a copy
of the whole network with
its local data (part or all) using
async F/J framework
19. H2O Frame
Node 1 data
Node 2-N data
Thread 1 data
* Each row is fully stored on the same node
* Each chunk contains thousands/millions of rows
* All the data is compressed (sometimes 2-4x) in a
lossless fashion
20. Inside a node
• Each thread works on a
part of data
• All threads update
weights/biases
concurrently
• Race conditions possible
(hard to reproduce,
good for overfitting)
21. H2O’s architecture
Updated model
(weights $ biases)
* Communication frequency is auto-tuned
and user-controllable (affects
convergence)
H2O in-memory
Non-blocking
hash map
(resides on all nodes)
Initial model
(weights & biases)
Node
Computation
(threads, async)
Node communication
MAP
each node trains a copy
of the whole network with
its local data (part or all) using
async F/J framework
REDUCE
Model averaging:
Average weights
and biases from
all the nodes
Here: averaging
New: Elastic averaging
24. Airline Delays
• Data:
• airline data
• 116mln rows (6GB)
• 800+ predictors (numeric & categorical)
• Problem: predict if a flight is delayed
• Hardware: 10* Dual E5-2650 (32 cores, 2.6GHz), ~11Gb
• Platform: H2O
25. GPUs and other networks
• What if you want to use GPUs?
• What if you want to train arbitrary networks?
• What if you want to compare different frameworks?
29. Other frameworks
• Data parallelism (default)
• Model parallelism also available
• for example for multi-layer LSTM
• Supports both sync and async
communication
• Can perform updates in the GPU or CPU
*http://mxnet.io/how_to/model_parallel_lstm.html
• Data and model parallelism
• Both sync and async updates
supported
*https://ischlag.github.io/2016/06/12/async-distributed-tensorflow/
mxnet TensorFlow
30. Summary
Multi Nodo
• data/model doesn’t fit on one node
• computation too long
Single Node
• small NN/small data
Multi cpu/gpu
• data fits on single node but too
much for single processing unit
Model parallelism
• network parameters don’t fit on a
single machine
• faster computation
Data parallelism
• data doesn’t fit on a single node
• faster computation
Async
• faster convergence
• ok with potential lower accuracy
Sync
• best accuracy
• lots of workers or ok with slower
training