Deep Learning
with H2O
!
0xdata, H2O.ai

Scalable In-Memory Machine Learning
!
Hadoop User Group, Chicago, 7/16/14
Arno Candel
Who am I?
PhD in Computational Physics, 2005

from ETH Zurich Switzerland
!
6 years at SLAC - Accelerator Physics Modeling
2 years at Skytree, Inc - Machine Learning
7 months at 0xdata/H2O - Machine Learning
!
15 years in HPC, C++, MPI, Supercomputing
@ArnoCandel
H2O Deep Learning, @ArnoCandel
Outline
Intro & Live Demo (5 mins)
Methods & Implementation (20 mins)
Results & Live Demos (25 mins)
MNIST handwritten digits
text classification
Weather prediction
Q & A (10 mins)
3
H2O Deep Learning, @ArnoCandel
Distributed in-memory math platform 

➔ GLM, GBM, RF, K-Means, PCA, Deep Learning

Easy to use SDK / API

➔ Java, R, Scala, Python, JSON, Browser-based GUI
!
Businesses can use ALL of their data (w or w/o Hadoop)

➔ Modeling without Sampling



Big Data + Better Algorithms 

➔ Better Predictions
H2O Open Source in-memory

Prediction Engine for Big Data
4
H2O Deep Learning, @ArnoCandel
About H20 (aka 0xdata)
Pure Java, Apache v2 Open Source
Join the www.h2o.ai/community!
5
+1 Cyprien Noel for prior work
H2O Deep Learning, @ArnoCandel
Customer Demands for
Practical Machine Learning
6
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API / SDK Extensibility
H2O was developed by 0xdata to
meet these requirements
H2O Deep Learning, @ArnoCandel
H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
7
H2O H2O
Java
H2O Deep Learning, @ArnoCandel
H2O Architecture
Distributed

In-Memory K-V store
Col. compression
Machine
Learning
Algorithms
R Engine
Nano fast
Scoring Engine
Prediction Engine
Memory manager
e.g. Deep Learning
8
MapReduce
H2O Deep Learning, @ArnoCandel
H2O - The Killer App on Spark
9
http://databricks.com/blog/2014/06/30/
sparkling-water-h20-spark.html
H2O Deep Learning, @ArnoCandel 10
John Chambers (creator of the S language, R-core member)
names H2O R API in top three promising R projects
H2O R CRAN package
H2O Deep Learning, @ArnoCandel
H2O + R = Happy Data Scientist
11
Machine Learning on Big Data with R:

Data resides on the H2O cluster!
H2O Deep Learning, @ArnoCandel
H2O Deep Learning in Action
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 rows 784 integer columns 10 classes
12
MNIST = Digitized handwritten
digits database (Yann LeCun)
Live Demo Build a H2O Deep Learning
model on MNIST train/test data
Data: 28x28=784 pixels with
(gray-scale) values in 0…255
Yann LeCun: “Yet another advice: don't get fooled
by people who claim to have a solution to
Artificial General Intelligence. Ask them what
error rate they get on MNIST or ImageNet.”
H2O Deep Learning, @ArnoCandel
Wikipedia:

Deep learning is a set of algorithms in
machine learning that attempt to model
high-level abstractions in data by using
architectures composed of multiple 

non-linear transformations.
What is Deep Learning?
Example:
Input data

(image)
Prediction
(who is it?)
13
Facebook's DeepFace (Yann LeCun)
recognises faces as well as humans
H2O Deep Learning, @ArnoCandel
Deep Learning is Trending
20132012
Google trends
2011
14
Businesses are using

Deep Learning techniques!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)
!
FBI FACE: $1 billion face recognition project
!
Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)
H2O Deep Learning, @ArnoCandel
Deep Learning History
slides by Yan LeCun (now Facebook)
15
Deep Learning wins competitions
AND

makes humans, businesses and
machines (cyborgs!?) smarter
H2O Deep Learning, @ArnoCandel
What is NOT Deep
Linear models are not deep
(by definition)
!
Neural nets with 1 hidden layer are not deep
(no feature hierarchy)
!
SVMs and Kernel methods are not deep
(2 layers: kernel + linear)
!
Classification trees are not deep
(operate on original input space)
16
H2O Deep Learning, @ArnoCandel
1970s multi-layer feed-forward Neural Network
(supervised learning with stochastic gradient descent using back-propagation)
!
+ distributed processing for big data
(H2O in-memory MapReduce paradigm on distributed data)
!
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
!
+ smart algorithms for accuracy
(weight initialization, adaptive learning, momentum, dropout, regularization)
!
= Top-notch prediction engine!
Deep Learning in H2O
17
H2O Deep Learning, @ArnoCandel
“fully connected” directed graph of neurons
age
income
employment
married
single
Input layer
Hidden
layer 1
Hidden
layer 2
Output layer
3x4 4x3 3x2#connections
information flow
input/output neuron
hidden neuron
4 3 2#neurons 3
Example Neural Network
18
H2O Deep Learning, @ArnoCandel
age
income
employment
yj = tanh(sumi(xi*uij)+bj)
uij
xi
yj
per-class probabilities

sum(pl) = 1
zk = tanh(sumj(yj*vjk)+ck)
vjk
zk
pl
pl = softmax(sumk(zk*wkl)+dl)
wkl
softmax(xk) = exp(xk) / sumk(exp(xk))
“neurons activate each other via weighted sums”
Prediction: Forward Propagation
activation function: tanh
alternative:

x -> max(0,x) “rectifier”
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!
bj, ck, dl: bias values

(indep. of inputs)
19
married
single
H2O Deep Learning, @ArnoCandel
age
income
employment
xi
Automatic standardization of data

xi: mean = 0, stddev = 1
!
horizontalize categorical variables, e.g.
{full-time, part-time, none, self-employed} 

->

{0,1,0} = part-time, {0,0,0} = self-employed
Automatic initialization of weights
!
Poor man’s initialization: random weights wkl
!
Default (better): Uniform distribution in

+/- sqrt(6/(#units + #units_previous_layer))
Data preparation & Initialization
Neural Networks are sensitive to numerical noise,

operate best in the linear regime (not saturated)
20
married
single
wkl
H2O Deep Learning, @ArnoCandel
Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class”
!
Cross-entropy = -log(0.8) “strongly penalize non-1-ness”
Training: Update Weights & Biases
Stochastic Gradient Descent: Update weights and biases via
gradient of the error (via back-propagation):
For each training row, we make a prediction and compare
with the actual label (supervised learning):
married10.8
predicted actual
Objective: minimize prediction error (MSE or cross-entropy)
w <— w - rate * ∂E/∂w
1
21
single00.2
E
w
rate
H2O Deep Learning, @ArnoCandel
Backward Propagation


!
∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi
= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi
Backprop: Compute ∂E/∂wi via chain rule going backwards
wi
net = sumi(wi*xi) + b
xi
E = error(y)
y = activation(net)
How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?
Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!
22
H2O Deep Learning, @ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
threads: async
communication
w
w w
w w w w
w1
w3 w2
w4
w2+w4
w1+w3
w* = (w1+w2+w3+w4)/4
map:

each node trains a
copy of the weights
and biases with
(some* or all of) its
local data with
asynchronous F/J
threads
initial model: weights and biases w
updated model: w*
H2O atomic
in-memory

K-V store
reduce:

model averaging:
average weights and
biases from all nodes,
speedup is at least
#nodes/log(#rows)
arxiv:1209.4129v3
Keep iterating over the data (“epochs”), score from time to time
Query & display
the model via
JSON, WWW
2
2 431
1
1
1
4
3 2
1 2
1
i
*user can specify the number of total rows per MapReduce iteration
23
H2O Deep Learning, @ArnoCandel
Adaptive learning rate - ADADELTA (Google)

Automatically set learning rate for each neuron
based on its training history
Grid Search and Checkpointing

Run a grid search to scan many hyper-
parameters, then continue training the most
promising model(s)
Regularization

L1: penalizes non-zero weights

L2: penalizes large weights

Dropout: randomly ignore certain inputs
24
“Secret” Sauce to Higher Accuracy
H2O Deep Learning, @ArnoCandel
Detail: Adaptive Learning Rate
!
Compute moving average of ∆wi
2 at time t for window length rho:
!
E[∆wi
2]t = rho * E[∆wi
2]t-1 + (1-rho) * ∆wi
2
!
Compute RMS of ∆wi at time t with smoothing epsilon:
!
RMS[∆wi]t = sqrt( E[∆wi
2]t + epsilon )
Adaptive annealing / progress:
Gradient-dependent learning rate,
moving window prevents “freezing”
(unlike ADAGRAD: no window)
Adaptive acceleration / momentum:
accumulate previous weight updates,
but over a window of time
RMS[∆wi]t-1
RMS[∂E/∂wi]t
rate(wi, t) =
Do the same for ∂E/∂wi, then
obtain per-weight learning rate:
cf. ADADELTA paper
25
H2O Deep Learning, @ArnoCandel
Detail: Dropout Regularization
26
Training:
For each hidden neuron, for each training sample, for each iteration,
ignore (zero out) a different random fraction p of input activations.
!
age
income
employment
married
single
X
X
X
Testing:
Use all activations, but reduce them by a factor p
(to “simulate” the missing activations during training).
cf. Geoff Hinton's paper
H2O Deep Learning, @ArnoCandel
MNIST: digits classification
Standing world record:
Without distortions or convolutions, the best-ever
published error rate on test set: 0.83% (Microsoft)
27
Time to check in
on the demo!
Let’s see how H2O did in the past 20 minutes!
H2O Deep Learning, @ArnoCandel
Frequent errors: confuse 2/7 and 4/9
H2O Deep Learning on MNIST:
0.87% test set error (so far)
28
test set error: 1.5% after 10 mins
1.0% after 1.5 hours

0.87% after 4 hours
World-class
results!
No pre-training
No distortions
No convolutions
No unsupervised
training
Running on 4
nodes with 16
cores each
H2O Deep Learning, A. Candel
Weather Dataset
29
Predict “RainTomorrow” from Temperature,
Humidity, Wind, Pressure, etc.
H2O Deep Learning, A. Candel
Live Demo: Weather Prediction
Interactive ROC curve with
real-time updates
30
3 hidden Rectifier
layers, Dropout, 

L1-penalty
12.7% 5-fold cross-validation error is at
least as good as GBM/RF/GLM models
5-fold cross validation
H2O Deep Learning, @ArnoCandel
Live Demo: Grid Search
How did I find those parameters? Grid Search!

(works for multiple hyper parameters at once)
31
Then continue training
the best model
H2O Deep Learning, @ArnoCandel
Use Case: Text Classification
Goal: Predict the item from
seller’s text description
32
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
“Vintage 18KT gold Rolex 2 Tone
in great condition”
Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0
vintagegold condition
Let’s see how H2O does on the ebay dataset!
H2O Deep Learning, @ArnoCandel
Out-Of-The-Box: 11.6% test set error after 10 epochs!
Predicts the correct class (out of 143) 88.4% of the time!
33
Note 2: No tuning was done

(results are for illustration only)
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
Note 1: H2O columnar-compressed in-memory
store only needs 60 MB to store 5 billion
values (dense CSV needs 18 GB)
Use Case: Text Classification
H2O Deep Learning, @ArnoCandel
Parallel Scalability
(for 64 epochs on MNIST, with “0.87%” parameters)
34
Speedup
0.00
10.00
20.00
30.00
40.00
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node, 1 epoch per node per MapReduce)
2.7 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
H2O Deep Learning, @ArnoCandel
Tips for H2O Deep Learning
!
General:
More layers for more complex functions (exp. more non-linearity)
More neurons per layer to detect finer structure in data (“memorizing”)
Add some regularization for less overfitting (smaller validation error)
Do a grid search to get a feel for convergence, then continue training.
Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5.
Try Dropout (input: 20%, hidden: 50%) with test/validation set after
finding good parameters for convergence on training set.
Distributed: More training samples per iteration: faster, but less accuracy?
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99
Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8,
momentum_start = 0.5, momentum_stable = 0.99,

momentum_ramp = 1/rate_annealing.
Try balance_classes = true for imbalanced classes.
Use force_load_balance and replicate_training_data for small datasets.
35
H2O Deep Learning, @ArnoCandel 36
… and more docs
coming soon!
Draft
All parameters are
available from R…
H2O brings Deep Learning to R
H2O Deep Learning, @ArnoCandel
POJO Model Export for
Production Scoring
37
Plain old Java code is
auto-generated to take
your H2O Deep Learning
models into production!
H2O Deep Learning, @ArnoCandel
Deep Learning Auto-Encoders
for Anomaly Detection
38
Toy example:

Find anomaly in ECG heart
beat data. First, train a
model on what’s “normal”:

20 time-series samples of
210 data points each
Deep Auto-Encoder:

Learn low-dimensional
non-linear “structure” of
the data that allows to
reconstruct the orig. data
Also for categorical data!
H2O Deep Learning, @ArnoCandel
Deep Learning Auto-Encoders
for Anomaly Detection
39
Test set with anomaly
Test set prediction is
reconstruction, looks “normal”
Found anomaly! large
reconstruction error
Model of what’s “normal”
+
=>
H2O Deep Learning, @ArnoCandel
H2O Steam: Scoring Platform
40
H2O Deep Learning, @ArnoCandel
H2O Steam: More Coming Soon!
41
H2O Deep Learning, @ArnoCandel
Key Take-Aways
H2O is a distributed in-memory data science
platform. It was designed for high-performance
machine learning applications on big data.
!
H2O Deep Learning is ready to take your advanced
analytics to the next level - Try it on your data!
!
Join our Community and Meetups!
git clone https://github.com/0xdata/h2o
http://docs.0xdata.com
www.h2o.ai/community
@hexadata
42

H2O Distributed Deep Learning by Arno Candel 071614

  • 1.
    Deep Learning with H2O ! 0xdata,H2O.ai
 Scalable In-Memory Machine Learning ! Hadoop User Group, Chicago, 7/16/14 Arno Candel
  • 2.
    Who am I? PhDin Computational Physics, 2005
 from ETH Zurich Switzerland ! 6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree, Inc - Machine Learning 7 months at 0xdata/H2O - Machine Learning ! 15 years in HPC, C++, MPI, Supercomputing @ArnoCandel
  • 3.
    H2O Deep Learning,@ArnoCandel Outline Intro & Live Demo (5 mins) Methods & Implementation (20 mins) Results & Live Demos (25 mins) MNIST handwritten digits text classification Weather prediction Q & A (10 mins) 3
  • 4.
    H2O Deep Learning,@ArnoCandel Distributed in-memory math platform 
 ➔ GLM, GBM, RF, K-Means, PCA, Deep Learning
 Easy to use SDK / API
 ➔ Java, R, Scala, Python, JSON, Browser-based GUI ! Businesses can use ALL of their data (w or w/o Hadoop)
 ➔ Modeling without Sampling
 
 Big Data + Better Algorithms 
 ➔ Better Predictions H2O Open Source in-memory
 Prediction Engine for Big Data 4
  • 5.
    H2O Deep Learning,@ArnoCandel About H20 (aka 0xdata) Pure Java, Apache v2 Open Source Join the www.h2o.ai/community! 5 +1 Cyprien Noel for prior work
  • 6.
    H2O Deep Learning,@ArnoCandel Customer Demands for Practical Machine Learning 6 Requirements Value In-Memory Fast (Interactive) Distributed Big Data (No Sampling) Open Source Ownership of Methods API / SDK Extensibility H2O was developed by 0xdata to meet these requirements
  • 7.
    H2O Deep Learning,@ArnoCandel H2O Integration H2O HDFS HDFS HDFS YARN Hadoop MR R ScalaJSON Python Standalone Over YARN On MRv1 7 H2O H2O Java
  • 8.
    H2O Deep Learning,@ArnoCandel H2O Architecture Distributed
 In-Memory K-V store Col. compression Machine Learning Algorithms R Engine Nano fast Scoring Engine Prediction Engine Memory manager e.g. Deep Learning 8 MapReduce
  • 9.
    H2O Deep Learning,@ArnoCandel H2O - The Killer App on Spark 9 http://databricks.com/blog/2014/06/30/ sparkling-water-h20-spark.html
  • 10.
    H2O Deep Learning,@ArnoCandel 10 John Chambers (creator of the S language, R-core member) names H2O R API in top three promising R projects H2O R CRAN package
  • 11.
    H2O Deep Learning,@ArnoCandel H2O + R = Happy Data Scientist 11 Machine Learning on Big Data with R:
 Data resides on the H2O cluster!
  • 12.
    H2O Deep Learning,@ArnoCandel H2O Deep Learning in Action Train: 60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes 12 MNIST = Digitized handwritten digits database (Yann LeCun) Live Demo Build a H2O Deep Learning model on MNIST train/test data Data: 28x28=784 pixels with (gray-scale) values in 0…255 Yann LeCun: “Yet another advice: don't get fooled by people who claim to have a solution to Artificial General Intelligence. Ask them what error rate they get on MNIST or ImageNet.”
  • 13.
    H2O Deep Learning,@ArnoCandel Wikipedia:
 Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple 
 non-linear transformations. What is Deep Learning? Example: Input data
 (image) Prediction (who is it?) 13 Facebook's DeepFace (Yann LeCun) recognises faces as well as humans
  • 14.
    H2O Deep Learning,@ArnoCandel Deep Learning is Trending 20132012 Google trends 2011 14 Businesses are using
 Deep Learning techniques! Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton) ! FBI FACE: $1 billion face recognition project ! Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)
  • 15.
    H2O Deep Learning,@ArnoCandel Deep Learning History slides by Yan LeCun (now Facebook) 15 Deep Learning wins competitions AND
 makes humans, businesses and machines (cyborgs!?) smarter
  • 16.
    H2O Deep Learning,@ArnoCandel What is NOT Deep Linear models are not deep (by definition) ! Neural nets with 1 hidden layer are not deep (no feature hierarchy) ! SVMs and Kernel methods are not deep (2 layers: kernel + linear) ! Classification trees are not deep (operate on original input space) 16
  • 17.
    H2O Deep Learning,@ArnoCandel 1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) ! + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) ! + multi-threaded speedup (H2O Fork/Join worker threads update the model asynchronously) ! + smart algorithms for accuracy (weight initialization, adaptive learning, momentum, dropout, regularization) ! = Top-notch prediction engine! Deep Learning in H2O 17
  • 18.
    H2O Deep Learning,@ArnoCandel “fully connected” directed graph of neurons age income employment married single Input layer Hidden layer 1 Hidden layer 2 Output layer 3x4 4x3 3x2#connections information flow input/output neuron hidden neuron 4 3 2#neurons 3 Example Neural Network 18
  • 19.
    H2O Deep Learning,@ArnoCandel age income employment yj = tanh(sumi(xi*uij)+bj) uij xi yj per-class probabilities
 sum(pl) = 1 zk = tanh(sumj(yj*vjk)+ck) vjk zk pl pl = softmax(sumk(zk*wkl)+dl) wkl softmax(xk) = exp(xk) / sumk(exp(xk)) “neurons activate each other via weighted sums” Prediction: Forward Propagation activation function: tanh alternative:
 x -> max(0,x) “rectifier” pl is a non-linear function of xi: can approximate ANY function with enough layers! bj, ck, dl: bias values
 (indep. of inputs) 19 married single
  • 20.
    H2O Deep Learning,@ArnoCandel age income employment xi Automatic standardization of data
 xi: mean = 0, stddev = 1 ! horizontalize categorical variables, e.g. {full-time, part-time, none, self-employed} 
 ->
 {0,1,0} = part-time, {0,0,0} = self-employed Automatic initialization of weights ! Poor man’s initialization: random weights wkl ! Default (better): Uniform distribution in
 +/- sqrt(6/(#units + #units_previous_layer)) Data preparation & Initialization Neural Networks are sensitive to numerical noise,
 operate best in the linear regime (not saturated) 20 married single wkl
  • 21.
    H2O Deep Learning,@ArnoCandel Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class” ! Cross-entropy = -log(0.8) “strongly penalize non-1-ness” Training: Update Weights & Biases Stochastic Gradient Descent: Update weights and biases via gradient of the error (via back-propagation): For each training row, we make a prediction and compare with the actual label (supervised learning): married10.8 predicted actual Objective: minimize prediction error (MSE or cross-entropy) w <— w - rate * ∂E/∂w 1 21 single00.2 E w rate
  • 22.
    H2O Deep Learning,@ArnoCandel Backward Propagation 
 ! ∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi = ∂(error(y))/∂y * ∂(activation(net))/∂net * xi Backprop: Compute ∂E/∂wi via chain rule going backwards wi net = sumi(wi*xi) + b xi E = error(y) y = activation(net) How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ? Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow! 22
  • 23.
    H2O Deep Learning,@ArnoCandel H2O Deep Learning Architecture K-V K-V HTTPD HTTPD nodes/JVMs: sync threads: async communication w w w w w w w w1 w3 w2 w4 w2+w4 w1+w3 w* = (w1+w2+w3+w4)/4 map:
 each node trains a copy of the weights and biases with (some* or all of) its local data with asynchronous F/J threads initial model: weights and biases w updated model: w* H2O atomic in-memory
 K-V store reduce:
 model averaging: average weights and biases from all nodes, speedup is at least #nodes/log(#rows) arxiv:1209.4129v3 Keep iterating over the data (“epochs”), score from time to time Query & display the model via JSON, WWW 2 2 431 1 1 1 4 3 2 1 2 1 i *user can specify the number of total rows per MapReduce iteration 23
  • 24.
    H2O Deep Learning,@ArnoCandel Adaptive learning rate - ADADELTA (Google)
 Automatically set learning rate for each neuron based on its training history Grid Search and Checkpointing
 Run a grid search to scan many hyper- parameters, then continue training the most promising model(s) Regularization
 L1: penalizes non-zero weights
 L2: penalizes large weights
 Dropout: randomly ignore certain inputs 24 “Secret” Sauce to Higher Accuracy
  • 25.
    H2O Deep Learning,@ArnoCandel Detail: Adaptive Learning Rate ! Compute moving average of ∆wi 2 at time t for window length rho: ! E[∆wi 2]t = rho * E[∆wi 2]t-1 + (1-rho) * ∆wi 2 ! Compute RMS of ∆wi at time t with smoothing epsilon: ! RMS[∆wi]t = sqrt( E[∆wi 2]t + epsilon ) Adaptive annealing / progress: Gradient-dependent learning rate, moving window prevents “freezing” (unlike ADAGRAD: no window) Adaptive acceleration / momentum: accumulate previous weight updates, but over a window of time RMS[∆wi]t-1 RMS[∂E/∂wi]t rate(wi, t) = Do the same for ∂E/∂wi, then obtain per-weight learning rate: cf. ADADELTA paper 25
  • 26.
    H2O Deep Learning,@ArnoCandel Detail: Dropout Regularization 26 Training: For each hidden neuron, for each training sample, for each iteration, ignore (zero out) a different random fraction p of input activations. ! age income employment married single X X X Testing: Use all activations, but reduce them by a factor p (to “simulate” the missing activations during training). cf. Geoff Hinton's paper
  • 27.
    H2O Deep Learning,@ArnoCandel MNIST: digits classification Standing world record: Without distortions or convolutions, the best-ever published error rate on test set: 0.83% (Microsoft) 27 Time to check in on the demo! Let’s see how H2O did in the past 20 minutes!
  • 28.
    H2O Deep Learning,@ArnoCandel Frequent errors: confuse 2/7 and 4/9 H2O Deep Learning on MNIST: 0.87% test set error (so far) 28 test set error: 1.5% after 10 mins 1.0% after 1.5 hours
 0.87% after 4 hours World-class results! No pre-training No distortions No convolutions No unsupervised training Running on 4 nodes with 16 cores each
  • 29.
    H2O Deep Learning,A. Candel Weather Dataset 29 Predict “RainTomorrow” from Temperature, Humidity, Wind, Pressure, etc.
  • 30.
    H2O Deep Learning,A. Candel Live Demo: Weather Prediction Interactive ROC curve with real-time updates 30 3 hidden Rectifier layers, Dropout, 
 L1-penalty 12.7% 5-fold cross-validation error is at least as good as GBM/RF/GLM models 5-fold cross validation
  • 31.
    H2O Deep Learning,@ArnoCandel Live Demo: Grid Search How did I find those parameters? Grid Search!
 (works for multiple hyper parameters at once) 31 Then continue training the best model
  • 32.
    H2O Deep Learning,@ArnoCandel Use Case: Text Classification Goal: Predict the item from seller’s text description 32 Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes “Vintage 18KT gold Rolex 2 Tone in great condition” Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0 vintagegold condition Let’s see how H2O does on the ebay dataset!
  • 33.
    H2O Deep Learning,@ArnoCandel Out-Of-The-Box: 11.6% test set error after 10 epochs! Predicts the correct class (out of 143) 88.4% of the time! 33 Note 2: No tuning was done
 (results are for illustration only) Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes Note 1: H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB) Use Case: Text Classification
  • 34.
    H2O Deep Learning,@ArnoCandel Parallel Scalability (for 64 epochs on MNIST, with “0.87%” parameters) 34 Speedup 0.00 10.00 20.00 30.00 40.00 1 2 4 8 16 32 63 H2O Nodes (4 cores per node, 1 epoch per node per MapReduce) 2.7 mins Training Time 0 25 50 75 100 1 2 4 8 16 32 63 H2O Nodes in minutes
  • 35.
    H2O Deep Learning,@ArnoCandel Tips for H2O Deep Learning ! General: More layers for more complex functions (exp. more non-linearity) More neurons per layer to detect finer structure in data (“memorizing”) Add some regularization for less overfitting (smaller validation error) Do a grid search to get a feel for convergence, then continue training. Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5. Try Dropout (input: 20%, hidden: 50%) with test/validation set after finding good parameters for convergence on training set. Distributed: More training samples per iteration: faster, but less accuracy? With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8, momentum_start = 0.5, momentum_stable = 0.99,
 momentum_ramp = 1/rate_annealing. Try balance_classes = true for imbalanced classes. Use force_load_balance and replicate_training_data for small datasets. 35
  • 36.
    H2O Deep Learning,@ArnoCandel 36 … and more docs coming soon! Draft All parameters are available from R… H2O brings Deep Learning to R
  • 37.
    H2O Deep Learning,@ArnoCandel POJO Model Export for Production Scoring 37 Plain old Java code is auto-generated to take your H2O Deep Learning models into production!
  • 38.
    H2O Deep Learning,@ArnoCandel Deep Learning Auto-Encoders for Anomaly Detection 38 Toy example:
 Find anomaly in ECG heart beat data. First, train a model on what’s “normal”:
 20 time-series samples of 210 data points each Deep Auto-Encoder:
 Learn low-dimensional non-linear “structure” of the data that allows to reconstruct the orig. data Also for categorical data!
  • 39.
    H2O Deep Learning,@ArnoCandel Deep Learning Auto-Encoders for Anomaly Detection 39 Test set with anomaly Test set prediction is reconstruction, looks “normal” Found anomaly! large reconstruction error Model of what’s “normal” + =>
  • 40.
    H2O Deep Learning,@ArnoCandel H2O Steam: Scoring Platform 40
  • 41.
    H2O Deep Learning,@ArnoCandel H2O Steam: More Coming Soon! 41
  • 42.
    H2O Deep Learning,@ArnoCandel Key Take-Aways H2O is a distributed in-memory data science platform. It was designed for high-performance machine learning applications on big data. ! H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data! ! Join our Community and Meetups! git clone https://github.com/0xdata/h2o http://docs.0xdata.com www.h2o.ai/community @hexadata 42