Deep Learning at scale with H2O's distributed architecture

Deep Learning at scale
Mateusz Dymczyk 
Software Engineer 
H2O.ai
Strata+Hadoop Singapore
08.12.2016

About me
• M.Sc. in CS @ AGH UST, Poland
• CS Ph.D. dropout
• Software Engineer @ H2O.ai
• Previously ML/NLP and distributed
systems @ Fujitsu Laboratories and
en-japan inc.

• Deep learning - brief introduction
• Why scale
• Scaling models + implementations
• Demo
• A peek into the future
Agenda

Text classification (item prediction)
Use Cases
Fraud detection (11% accuracy boost in production)
Image classification
Machine translation
Recommendation systems

Deep Learning
Deep learning is a branch of machine learning based on a
set of algorithms that attempt to model high level
abstractions in data by using a deep graph with multiple
processing layers, composed of multiple linear and non-
linear transformations.*
*https://en.wikipedia.org/wiki/Deep_learning

SGD
• Memory efficient
• Fast
• Not easy to parallelize without speed
degradation
Initialize
Parameters
Get training
sample i
1
2
3
Until converged

Deep Learning
PRO
• Relatively simple concept
• Non-linear
• Versatile and flexible
• Features can be extracted
• Great with big data
• Very promising results in
multiple fields
CON
• Hard to interpret
• Not well understood theory
• Lots of architectures
• A lot of hyper-parameters
• Slow training, data hungry
• CPU/GPU hungry
• Overfits

Why scale?
PRO
• Relatively simple concept
• Non-linear
• Versatile and flexible
• Features can be extracted
• Great with big data
• Very promising results in
multiple fields
CON
• Hard to interpret
• Not well understood theory
• Lots of architectures
• A lot of hyper-parameters
• Slow training, data hungry
• CPU/GPU hungry
• Overfits
Grid search?

Why not to scale?
• Distribution isn’t free: overhead due to network traffic,
synchronization etc.
• Small neural network (not many layers and/or neurons)
• small+shallow network = not much computation per iteration
• Small data
• Very slow network communication

Distribution models
• Model parallelism
• Data parallelism
• Mixed/composed
• Parameter server vs *peer-to-peer
• Communication: Asynchronous vs Synchronous
*http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf

Model parallelism
Node 1 Node 2 Node 3
• Each node computes a
different part of network
• Potential to scale to large
models
• Rather difficult to implement
and reason about
• Originally designed for large
convolutional layer in
GoogleNet
Parameter
Server
Parameters or deltas

Data parallelism
Node 1
Parameter
Server
Node 3
Node 2
Node 4
• Each node computes all the
parameters
• Using part (or all) of local data
• Results have to be combined
Parameters or deltas

Mixed/composed
• Node distribution (model or data)
• In-node, per gpu/cpu/thread
concurrent/parallel computation
• Example: learn whole model on each
multi CPU/GPU machine, where each
CPU/GPU trains a different layer or
works with different part of data
Node
T T T
Node
CPU/GPU CPU/GPU
CPU/GPU CPU/GPU

Sync vs. Async
Sync
• Slower
• Reproducible
• Might overfit
• In some cases more
accurate and faster*
At some point parameters need to be collected within the
nodes and between them.
Async
• Race conditions possible
• Not reproducible
• Faster
• Might helps with overfitting
(you make mistakes)
*https://arxiv.org/abs/1604.00981

H2O’s architecture
H2O in-memory
Non-blocking
hash map
(resides on all nodes)
Initial model
(weights & biases)
Node
Computation
(threads, async)
Node communication
MAP
each node trains a copy
of the whole network with
its local data (part or all) using
async F/J framework

H2O Frame
Node 1 data
Node 2-N data
Thread 1 data
* Each row is fully stored on the same node
* Each chunk contains thousands/millions of rows
* All the data is compressed (sometimes 2-4x) in a
lossless fashion

Inside a node
• Each thread works on a
part of data
• All threads update
weights/biases
concurrently
• Race conditions possible
(hard to reproduce,
good for overfitting)

H2O’s architecture
Updated model
(weights $ biases)
* Communication frequency is auto-tuned
and user-controllable (affects
convergence)
H2O in-memory
Non-blocking
hash map
(resides on all nodes)
Initial model
(weights & biases)
Node
Computation
(threads, async)
Node communication
MAP
each node trains a copy
of the whole network with
its local data (part or all) using
async F/J framework
REDUCE
Model averaging:
Average weights
and biases from
all the nodes
Here: averaging
New: Elastic averaging

Benchmarks
• Problem: MNIST (hand-written digits 28x28 pixels (784 features), 10-class
classification
• Hardware: 10* Dual E5-2650 (8 cores, 2.6GHz), 10Gb
• Result: trains 100 epochs (6M samples) in 10 seconds on 10 nodes

Airline Delays
• Data:
• airline data
• 116mln rows (6GB)
• 800+ predictors (numeric & categorical)
• Problem: predict if a flight is delayed
• Hardware: 10* Dual E5-2650 (32 cores, 2.6GHz), ~11Gb
• Platform: H2O

GPUs and other networks
• What if you want to use GPUs?
• What if you want to train arbitrary networks?
• What if you want to compare different frameworks?

Other frameworks
• Data parallelism (default)
• Model parallelism also available
• for example for multi-layer LSTM
• Supports both sync and async
communication
• Can perform updates in the GPU or CPU
*http://mxnet.io/how_to/model_parallel_lstm.html
• Data and model parallelism
• Both sync and async updates
supported
*https://ischlag.github.io/2016/06/12/async-distributed-tensorflow/
mxnet TensorFlow

Summary
Multi Nodo
• data/model doesn’t fit on one node
• computation too long
Single Node
• small NN/small data
Multi cpu/gpu
• data fits on single node but too
much for single processing unit
Model parallelism
• network parameters don’t fit on a
single machine
• faster computation
Data parallelism
• data doesn’t fit on a single node
• faster computation
Async
• faster convergence
• ok with potential lower accuracy
Sync
• best accuracy
• lots of workers or ok with slower
training

Open Source
• Github:
https://github.com/h2oai/h2o-3
https://github.com/h2oai/deepwater
• Community:
https://groups.google.com/forum/?hl=en#!forum/h2ostream
http://jira.h2o.ai
https://community.h2o.ai/index.html
@h2oai
http://www.h2o.ai

Thank you!
@mdymczyk
Mateusz Dymczyk
mateusz@h2o.ai

Deep Learning at scale with H2O's distributed architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Deep Learning at scale with H2O's distributed architecture

Similar to Deep Learning at scale with H2O's distributed architecture (20)

Recently uploaded

Recently uploaded (20)

Deep Learning at scale with H2O's distributed architecture