H2O Distributed Deep Learning by Arno Candel 071614

Deep Learning
with H2O
!
0xdata, H2O.ai 
Scalable In-Memory Machine Learning
!
Hadoop User Group, Chicago, 7/16/14
Arno Candel

Who am I?
PhD in Computational Physics, 2005 
from ETH Zurich Switzerland
!
6 years at SLAC - Accelerator Physics Modeling
2 years at Skytree, Inc - Machine Learning
7 months at 0xdata/H2O - Machine Learning
!
15 years in HPC, C++, MPI, Supercomputing
@ArnoCandel

H2O Deep Learning, @ArnoCandel
Outline
Intro & Live Demo (5 mins)
Methods & Implementation (20 mins)
Results & Live Demos (25 mins)
MNIST handwritten digits
text classification
Weather prediction
Q & A (10 mins)
3

Distributed in-memory math platform  
➔ GLM, GBM, RF, K-Means, PCA, Deep Learning 
Easy to use SDK / API 
➔ Java, R, Scala, Python, JSON, Browser-based GUI
!
Businesses can use ALL of their data (w or w/o Hadoop) 
➔ Modeling without Sampling 
 
Big Data + Better Algorithms  
➔ Better Predictions
H2O Open Source in-memory 
Prediction Engine for Big Data
4

About H20 (aka 0xdata)
Pure Java, Apache v2 Open Source
Join the www.h2o.ai/community!
5
+1 Cyprien Noel for prior work

Customer Demands for
Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API / SDK Extensibility
H2O was developed by 0xdata to
meet these requirements

H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
7
H2O H2O
Java

H2O Architecture
Distributed 
In-Memory K-V store
Col. compression
Machine
Learning
Algorithms
R Engine
Nano fast
Scoring Engine
Prediction Engine
Memory manager
e.g. Deep Learning
8
MapReduce

H2O - The Killer App on Spark
9
http://databricks.com/blog/2014/06/30/
sparkling-water-h20-spark.html

H2O Deep Learning, @ArnoCandel 10
John Chambers (creator of the S language, R-core member)
names H2O R API in top three promising R projects
H2O R CRAN package

H2O + R = Happy Data Scientist
11
Machine Learning on Big Data with R: 
Data resides on the H2O cluster!

H2O Deep Learning in Action
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 rows 784 integer columns 10 classes
12
MNIST = Digitized handwritten
digits database (Yann LeCun)
Live Demo Build a H2O Deep Learning
model on MNIST train/test data
Data: 28x28=784 pixels with
(gray-scale) values in 0…255
Yann LeCun: “Yet another advice: don't get fooled
by people who claim to have a solution to
Artificial General Intelligence. Ask them what
error rate they get on MNIST or ImageNet.”

Wikipedia: 
Deep learning is a set of algorithms in
machine learning that attempt to model
high-level abstractions in data by using
architectures composed of multiple  
non-linear transformations.
What is Deep Learning?
Example:
Input data 
(image)
Prediction
(who is it?)
13
Facebook's DeepFace (Yann LeCun)
recognises faces as well as humans

Deep Learning is Trending
20132012
Google trends
2011
14
Businesses are using 
Deep Learning techniques!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)
!
FBI FACE: $1 billion face recognition project
!
Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)

Deep Learning History
slides by Yan LeCun (now Facebook)
15
Deep Learning wins competitions
AND 
makes humans, businesses and
machines (cyborgs!?) smarter

What is NOT Deep
Linear models are not deep
(by definition)
!
Neural nets with 1 hidden layer are not deep
(no feature hierarchy)
!
SVMs and Kernel methods are not deep
(2 layers: kernel + linear)
!
Classification trees are not deep
(operate on original input space)
16

1970s multi-layer feed-forward Neural Network
(supervised learning with stochastic gradient descent using back-propagation)
!
+ distributed processing for big data
(H2O in-memory MapReduce paradigm on distributed data)
!
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
!
+ smart algorithms for accuracy
(weight initialization, adaptive learning, momentum, dropout, regularization)
!
= Top-notch prediction engine!
Deep Learning in H2O
17

“fully connected” directed graph of neurons
age
income
employment
married
single
Input layer
Hidden
layer 1
Hidden
layer 2
Output layer
3x4 4x3 3x2#connections
information flow
input/output neuron
hidden neuron
4 3 2#neurons 3
Example Neural Network
18

age
income
employment
yj = tanh(sumi(xi*uij)+bj)
uij
xi
yj
per-class probabilities 
sum(pl) = 1
zk = tanh(sumj(yj*vjk)+ck)
vjk
zk
pl
pl = softmax(sumk(zk*wkl)+dl)
wkl
softmax(xk) = exp(xk) / sumk(exp(xk))
“neurons activate each other via weighted sums”
Prediction: Forward Propagation
activation function: tanh
alternative: 
x -> max(0,x) “rectifier”
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!
bj, ck, dl: bias values 
(indep. of inputs)
19
married
single

age
income
employment
xi
Automatic standardization of data 
xi: mean = 0, stddev = 1
!
horizontalize categorical variables, e.g.
{full-time, part-time, none, self-employed}  
-> 
{0,1,0} = part-time, {0,0,0} = self-employed
Automatic initialization of weights
!
Poor man’s initialization: random weights wkl
!
Default (better): Uniform distribution in 
+/- sqrt(6/(#units + #units_previous_layer))
Data preparation & Initialization
Neural Networks are sensitive to numerical noise, 
operate best in the linear regime (not saturated)
20
married
single
wkl

Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class”
!
Cross-entropy = -log(0.8) “strongly penalize non-1-ness”
Training: Update Weights & Biases
Stochastic Gradient Descent: Update weights and biases via
gradient of the error (via back-propagation):
For each training row, we make a prediction and compare
with the actual label (supervised learning):
married10.8
predicted actual
Objective: minimize prediction error (MSE or cross-entropy)
w <— w - rate * ∂E/∂w
1
21
single00.2
E
w
rate

Backward Propagation
 
!
∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi
= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi
Backprop: Compute ∂E/∂wi via chain rule going backwards
wi
net = sumi(wi*xi) + b
xi
E = error(y)
y = activation(net)
How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?
Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!
22

H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
threads: async
communication
w
w w
w w w w
w1
w3 w2
w4
w2+w4
w1+w3
w* = (w1+w2+w3+w4)/4
map: 
each node trains a
copy of the weights
and biases with
(some* or all of) its
local data with
asynchronous F/J
threads
initial model: weights and biases w
updated model: w*
H2O atomic
in-memory 
K-V store
reduce: 
model averaging:
average weights and
biases from all nodes,
speedup is at least
#nodes/log(#rows)
arxiv:1209.4129v3
Keep iterating over the data (“epochs”), score from time to time
Query & display
the model via
JSON, WWW
2
2 431
1
1
1
4
3 2
1 2
1
i
*user can specify the number of total rows per MapReduce iteration
23

Adaptive learning rate - ADADELTA (Google) 
Automatically set learning rate for each neuron
based on its training history
Grid Search and Checkpointing 
Run a grid search to scan many hyper-
parameters, then continue training the most
promising model(s)
Regularization 
L1: penalizes non-zero weights 
L2: penalizes large weights 
Dropout: randomly ignore certain inputs
24
“Secret” Sauce to Higher Accuracy

Detail: Adaptive Learning Rate
!
Compute moving average of ∆wi
2 at time t for window length rho:
!
E[∆wi
2]t = rho * E[∆wi
2]t-1 + (1-rho) * ∆wi
2
!
Compute RMS of ∆wi at time t with smoothing epsilon:
!
RMS[∆wi]t = sqrt( E[∆wi
2]t + epsilon )
Adaptive annealing / progress:
Gradient-dependent learning rate,
moving window prevents “freezing”
(unlike ADAGRAD: no window)
Adaptive acceleration / momentum:
accumulate previous weight updates,
but over a window of time
RMS[∆wi]t-1
RMS[∂E/∂wi]t
rate(wi, t) =
Do the same for ∂E/∂wi, then
obtain per-weight learning rate:
cf. ADADELTA paper
25

Detail: Dropout Regularization
26
Training:
For each hidden neuron, for each training sample, for each iteration,
ignore (zero out) a different random fraction p of input activations.
!
age
income
employment
married
single
X
X
X
Testing:
Use all activations, but reduce them by a factor p
(to “simulate” the missing activations during training).
cf. Geoff Hinton's paper

MNIST: digits classification
Standing world record:
Without distortions or convolutions, the best-ever
published error rate on test set: 0.83% (Microsoft)
27
Time to check in
on the demo!
Let’s see how H2O did in the past 20 minutes!

Frequent errors: confuse 2/7 and 4/9
H2O Deep Learning on MNIST:
0.87% test set error (so far)
28
test set error: 1.5% after 10 mins
1.0% after 1.5 hours 
0.87% after 4 hours
World-class
results!
No pre-training
No distortions
No convolutions
No unsupervised
training
Running on 4
nodes with 16
cores each

H2O Deep Learning, A. Candel
Weather Dataset
29
Predict “RainTomorrow” from Temperature,
Humidity, Wind, Pressure, etc.

H2O Deep Learning, A. Candel
Live Demo: Weather Prediction
Interactive ROC curve with
real-time updates
30
3 hidden Rectifier
layers, Dropout,  
L1-penalty
12.7% 5-fold cross-validation error is at
least as good as GBM/RF/GLM models
5-fold cross validation

Live Demo: Grid Search
How did I find those parameters? Grid Search! 
(works for multiple hyper parameters at once)
31
Then continue training
the best model

Use Case: Text Classification
Goal: Predict the item from
seller’s text description
32
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
“Vintage 18KT gold Rolex 2 Tone
in great condition”
Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0
vintagegold condition
Let’s see how H2O does on the ebay dataset!

Out-Of-The-Box: 11.6% test set error after 10 epochs!
Predicts the correct class (out of 143) 88.4% of the time!
33
Note 2: No tuning was done 
(results are for illustration only)
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
Note 1: H2O columnar-compressed in-memory
store only needs 60 MB to store 5 billion
values (dense CSV needs 18 GB)
Use Case: Text Classification

Parallel Scalability
(for 64 epochs on MNIST, with “0.87%” parameters)
34
Speedup
0.00
10.00
20.00
30.00
40.00
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node, 1 epoch per node per MapReduce)
2.7 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes

Tips for H2O Deep Learning
!
General:
More layers for more complex functions (exp. more non-linearity)
More neurons per layer to detect finer structure in data (“memorizing”)
Add some regularization for less overfitting (smaller validation error)
Do a grid search to get a feel for convergence, then continue training.
Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5.
Try Dropout (input: 20%, hidden: 50%) with test/validation set after
finding good parameters for convergence on training set.
Distributed: More training samples per iteration: faster, but less accuracy?
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99
Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8,
momentum_start = 0.5, momentum_stable = 0.99, 
momentum_ramp = 1/rate_annealing.
Try balance_classes = true for imbalanced classes.
Use force_load_balance and replicate_training_data for small datasets.
35

H2O Deep Learning, @ArnoCandel 36
… and more docs
coming soon!
Draft
All parameters are
available from R…
H2O brings Deep Learning to R

POJO Model Export for
Production Scoring
37
Plain old Java code is
auto-generated to take
your H2O Deep Learning
models into production!

Deep Learning Auto-Encoders
for Anomaly Detection
38
Toy example: 
Find anomaly in ECG heart
beat data. First, train a
model on what’s “normal”: 
20 time-series samples of
210 data points each
Deep Auto-Encoder: 
Learn low-dimensional
non-linear “structure” of
the data that allows to
reconstruct the orig. data
Also for categorical data!

Deep Learning Auto-Encoders
for Anomaly Detection
39
Test set with anomaly
Test set prediction is
reconstruction, looks “normal”
Found anomaly! large
reconstruction error
Model of what’s “normal”
+
=>

H2O Steam: Scoring Platform
40

H2O Steam: More Coming Soon!
41

Key Take-Aways
H2O is a distributed in-memory data science
platform. It was designed for high-performance
machine learning applications on big data.
!
H2O Deep Learning is ready to take your advanced
analytics to the next level - Try it on your data!
!
Join our Community and Meetups!
git clone https://github.com/0xdata/h2o
http://docs.0xdata.com
www.h2o.ai/community
@hexadata
42

H2O Distributed Deep Learning by Arno Candel 071614

More Related Content

What's hot

Viewers also liked

Similar to H2O Distributed Deep Learning by Arno Candel 071614

More from Sri Ambati

Recently uploaded

H2O Distributed Deep Learning by Arno Candel 071614