Deep Learning
with H2O
!
H2O.ai

Scalable In-Memory Machine Learning
!
H20 Meetup, Mountain View, 3/20/14
Arno Candel
Who am I?
PhD in Computational Physics, 2005

from ETH Zurich Switzerland
!
6 years at SLAC - Accelerator Physics Modeling
2 years at Skytree, Inc - Machine Learning
3 months at 0xdata/H2O - Machine Learning
!
10+ years in HPC, C++, MPI, Supercomputing
Arno Candel
Outline
Intro
Theory
Implementation
Results
MNIST handwritten digits classification
Live Demo
Prostate cancer classification and age regression
text classification
Distributed in-memory math platform 

➔ GLM, GBM, RF, K-Means, PCA, Deep Learning

Easy to use SDK / API

➔ Java, R, Scala, Python, JSON, Browser-based GUI
!
Businesses can use ALL of their data (w or w/o Hadoop)

➔ Modeling without Sampling



Big Data + Better Algorithms 

➔ Better Predictions
H2O Open Source in-memory

Prediction Engine for Big Data
About H20 (aka 0xdata)
Pure Java, Apache v2 Open Source
Join the www.h2o.ai/community!
H2O w or w/o Hadoop
H2O
H2O H2O
HDFS HDFS HDFS
YARN Hadoop MR
R Java Scala JSON Python
Standalone Over YARN On MRv1
H2O Architecture
in-memory
K-V store
MapReduce
compression
Machine
Learning
Algorithms
R Engine
Nano fast
Scoring Engine
Prediction Engine
memory manager
e.g. Deep Learning
Wikipedia:

Deep learning is a set of algorithms in machine learning
that attempt to model high-level abstractions in data by
using architectures composed of multiple non-linear
transformations.
!
!
!
!
!
Facebook DeepFace (LeCun): “Almost as good as humans at recognising faces”
!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)
!
FBI FACE: $1 billion face recognition project
What is Deep Learning?
Example:
Input data

(facial image)
Prediction
(person’s ID)
Deep Learning is trending
20132012
Google trends
2011
1970s multi-layer feed-forward Neural Network
(supervised learning with back-propagation)
!
+ distributed processing for big data
(H2O in-memory MapReduce paradigm on distributed data)
!
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
!
+ smart algorithms for accuracy
(weight initialization, adaptive learning, momentum, dropout, regularization)
!
= Top-notch prediction engine!
Deep Learning in H2O
“fully connected” directed graph of neurons
age
income
employment
married
not married
Input layer
Hidden
layer 1
Hidden
layer 2
Output layer
3x4 4x3 3x2#connections
information flow
input/output neuron
hidden neuron
4 3 2#neurons 3
Example Neural Network
age
income
employment
yj = tanh(sumi(xi*uij)+bj)
uij
xi
yj
per-class probabilities

sum(pl) = 1
zk = tanh(sumj(yj*vjk)+ck)
vjk
zk
pl
pl = softmax(sumk(zk*wkl)+dl)
wkl
softmax(xk) = exp(xk) / sumk(exp(xk))
“neurons activate each other via weighted sums”
Prediction: Forward Propagation
married
not married
activation function: tanh
alternative:

x -> max(0,x) “rectifier”
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!
bj, ck, dl: bias values

(indep. of inputs)
age
income
employment
xi
standardize input xi: mean = 0, stddev = 1
!
horizontalize categorical variables, e.g.
{full-time, part-time, none, self-employed} 

->

{0,1,0} = part-time, {0,0,0} = self-employed
Poor man’s initialization: random weights
!
Better: Uniform distribution in

+/- sqrt(6/(#units + #units_previous_layer))
Data preparation & Initialization
Neural Networks are sensitive to numerical noise,

operate best in the linear regime (not saturated)
married
not married
Mean Square Error = (0.2^2 + 0.2^2)/2 “penalize differences per-class”
!
Cross-entropy = -log(0.8) “strongly penalize non-1-ness”
Stochastic Gradient Descent
SGD: improve weights and biases for EACH training row
married
not married
For each training row, we make a prediction and compare
with the actual label (supervised training):
1
0
0.8
0.2
predicted actual
Objective: minimize prediction error (MSE or cross-entropy)
w <— w - rate * ∂E/∂w
1
Backward Propagation


!
∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi
= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi
Backprop: Compute ∂E/∂wi via chain rule going backwards
wi
net = sumi(wi*xi) + b
xi
E = error(y)
y = activation(net)
How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?
Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
threads: async
communication
w
w w
w w w w
w1
w3 w2
w4
w2+w4
w1+w3
w* = (w1+w2+w3+w4)/4
map:

each node trains a
copy of the weights
and biases with
(some* or all of) its
local data with
asynchronous F/J
threads
initial weights and biases w
updated weights and biases w*
H2O atomic
in-memory

K-V store
reduce:

average weights
and biases from
all nodes
Keep iterating over the data (“epochs”), score from time to time
Query & display
the model via
JSON, WWW
2
2 431
1
1
1
4
3 2
1 2
1
i
*mini-batch: number of total rows per iteration, can be less than 1 epoch
“Secret” Sauce to Higher Accuracy
Momentum training:

keep changing weights and biases (even if there’s no error) 

“find other local minima, and go faster along valleys”
Adaptive learning rate - ADADELTA (Google):

automatically set learning rate for each neuron based on its
training history, combines annealing and momentum features
Learning rate annealing:

rate r = r0 / (1+ß*N), N = training samples
“dig deeper into local minimum”
Grid Search and Checkpointing:

Run a grid search over multiple hyper-parameters,
then continue training the best model
L1/L2/Dropout/MaxSumWeights regularization:

L1: penalizes non-zero weights, L2: penalizes large weights

Dropout: randomly ignore certain inputs “train exp. many models at once”
MaxSumWeights: Reduce all incoming weights if the sum > max value
“regularization avoids overtraining and improves generalization error”
MNIST: digits classification
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 rows 784 integer columns 10 classes
MNIST: Digitized handwritten digits database (Yann LeCun)
Data: 28x28=784 pixels with values in 0…255 (gray-scale)
One of the most popular multi-class classification problems
Without distortions or convolutions
(which help), the best-ever published
error rate on test set: 0.83% (Microsoft)
most frequent mistakes:

confuse 4 with 6 and 9, and 7 with 2
test set error: 1.5% after 40 epochs
1.02% after 400 epochs

0.95% after 4000 epochs
H2O Deep Learning on MNIST:
0.95% test set error (so far)
1 node
Prostate Cancer Dataset
Live Demo: Cancer Prediction
Interactive ROC
curve with real-
time updates
Live Demo: Cancer Prediction
0% training error
with only 322
model parameters
in seconds!
Live Demo: Grid Search Regression
Doing a grid search to find good hyper-parameters
to predict AGE from other 7 features
Then continue training the best model
5 hidden 50 tanh layers, rho=0.99, epsilon = 1e-10
MSE < 1 for test set
ages in 44…79
Regression:
1 linear output
neuron
Live Demo: ebay Text Classification
Users enter a description when selling an item
Task: Predict the type of item
Data prep: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0
H2O parses SVMLight sparse format: label 3:1 9:1 13:1 …
!
“Small” sample dataset on jewelry and watches:
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
!
H2O compressed columnar in-memory store:
Only needs 60MB to store 5 billion entries (never inflated)
Live Demo: ebay Text Classification
Work in progress, shown results are for illustration only!
Default parameters, no tuning, 4 nodes (16-cores each)
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
Tips for H2O Deep Learning
!
General:
More layers: more complex functions (non-linearity)
More neurons per layer: detect finer structure in data
More regularization: less overfitting (better validation error)
!
Do a grid search to get a feel for convergence, then continue training.
Try Tanh first. For Rectifier, try max_w2 = 50 and/or L1=1e-5.
Try TanhDropout or RectifierDropout with test/validation set after
finding good parameters for convergence on training set.
Distributed: Smaller mini-batch: more comm., slower, but higher accuracy.
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99
Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8
Try momentum_start = 0.5, momentum_stable = 0.99,

momentum_ramp = 1/rate_annealing
Try balance_classes = true for imbalanced classes.
Try force_load_balance for small datasets.
Summary
H2O is a distributed in-memory math platform that
allows fast prototyping in Java, R, Scala and Python.
!
H2o enables the development of enterprise-quality
blazing fast machine learning applications.
!
H2O Deep Learning is distributed, easy to use, and
early results compete with the world’s best.
!
Deep Learning makes better predictions!
!
Try it yourself and join our next meetup!

git clone https://github.com/0xdata/h2o

H2O Open Source Deep Learning, Arno Candel 03-20-14

  • 1.
    Deep Learning with H2O ! H2O.ai
 ScalableIn-Memory Machine Learning ! H20 Meetup, Mountain View, 3/20/14 Arno Candel
  • 2.
    Who am I? PhDin Computational Physics, 2005
 from ETH Zurich Switzerland ! 6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree, Inc - Machine Learning 3 months at 0xdata/H2O - Machine Learning ! 10+ years in HPC, C++, MPI, Supercomputing Arno Candel
  • 3.
    Outline Intro Theory Implementation Results MNIST handwritten digitsclassification Live Demo Prostate cancer classification and age regression text classification
  • 4.
    Distributed in-memory mathplatform 
 ➔ GLM, GBM, RF, K-Means, PCA, Deep Learning
 Easy to use SDK / API
 ➔ Java, R, Scala, Python, JSON, Browser-based GUI ! Businesses can use ALL of their data (w or w/o Hadoop)
 ➔ Modeling without Sampling
 
 Big Data + Better Algorithms 
 ➔ Better Predictions H2O Open Source in-memory
 Prediction Engine for Big Data
  • 5.
    About H20 (aka0xdata) Pure Java, Apache v2 Open Source Join the www.h2o.ai/community!
  • 6.
    H2O w orw/o Hadoop H2O H2O H2O HDFS HDFS HDFS YARN Hadoop MR R Java Scala JSON Python Standalone Over YARN On MRv1
  • 7.
    H2O Architecture in-memory K-V store MapReduce compression Machine Learning Algorithms REngine Nano fast Scoring Engine Prediction Engine memory manager e.g. Deep Learning
  • 8.
    Wikipedia:
 Deep learning isa set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations. ! ! ! ! ! Facebook DeepFace (LeCun): “Almost as good as humans at recognising faces” ! Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton) ! FBI FACE: $1 billion face recognition project What is Deep Learning? Example: Input data
 (facial image) Prediction (person’s ID)
  • 9.
    Deep Learning istrending 20132012 Google trends 2011
  • 10.
    1970s multi-layer feed-forwardNeural Network (supervised learning with back-propagation) ! + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) ! + multi-threaded speedup (H2O Fork/Join worker threads update the model asynchronously) ! + smart algorithms for accuracy (weight initialization, adaptive learning, momentum, dropout, regularization) ! = Top-notch prediction engine! Deep Learning in H2O
  • 11.
    “fully connected” directedgraph of neurons age income employment married not married Input layer Hidden layer 1 Hidden layer 2 Output layer 3x4 4x3 3x2#connections information flow input/output neuron hidden neuron 4 3 2#neurons 3 Example Neural Network
  • 12.
    age income employment yj = tanh(sumi(xi*uij)+bj) uij xi yj per-classprobabilities
 sum(pl) = 1 zk = tanh(sumj(yj*vjk)+ck) vjk zk pl pl = softmax(sumk(zk*wkl)+dl) wkl softmax(xk) = exp(xk) / sumk(exp(xk)) “neurons activate each other via weighted sums” Prediction: Forward Propagation married not married activation function: tanh alternative:
 x -> max(0,x) “rectifier” pl is a non-linear function of xi: can approximate ANY function with enough layers! bj, ck, dl: bias values
 (indep. of inputs)
  • 13.
    age income employment xi standardize input xi:mean = 0, stddev = 1 ! horizontalize categorical variables, e.g. {full-time, part-time, none, self-employed} 
 ->
 {0,1,0} = part-time, {0,0,0} = self-employed Poor man’s initialization: random weights ! Better: Uniform distribution in
 +/- sqrt(6/(#units + #units_previous_layer)) Data preparation & Initialization Neural Networks are sensitive to numerical noise,
 operate best in the linear regime (not saturated) married not married
  • 14.
    Mean Square Error= (0.2^2 + 0.2^2)/2 “penalize differences per-class” ! Cross-entropy = -log(0.8) “strongly penalize non-1-ness” Stochastic Gradient Descent SGD: improve weights and biases for EACH training row married not married For each training row, we make a prediction and compare with the actual label (supervised training): 1 0 0.8 0.2 predicted actual Objective: minimize prediction error (MSE or cross-entropy) w <— w - rate * ∂E/∂w 1
  • 15.
    Backward Propagation 
 ! ∂E/∂wi =∂E/∂y * ∂y/∂net * ∂net/∂wi = ∂(error(y))/∂y * ∂(activation(net))/∂net * xi Backprop: Compute ∂E/∂wi via chain rule going backwards wi net = sumi(wi*xi) + b xi E = error(y) y = activation(net) How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ? Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!
  • 16.
    H2O Deep LearningArchitecture K-V K-V HTTPD HTTPD nodes/JVMs: sync threads: async communication w w w w w w w w1 w3 w2 w4 w2+w4 w1+w3 w* = (w1+w2+w3+w4)/4 map:
 each node trains a copy of the weights and biases with (some* or all of) its local data with asynchronous F/J threads initial weights and biases w updated weights and biases w* H2O atomic in-memory
 K-V store reduce:
 average weights and biases from all nodes Keep iterating over the data (“epochs”), score from time to time Query & display the model via JSON, WWW 2 2 431 1 1 1 4 3 2 1 2 1 i *mini-batch: number of total rows per iteration, can be less than 1 epoch
  • 17.
    “Secret” Sauce toHigher Accuracy Momentum training:
 keep changing weights and biases (even if there’s no error) 
 “find other local minima, and go faster along valleys” Adaptive learning rate - ADADELTA (Google):
 automatically set learning rate for each neuron based on its training history, combines annealing and momentum features Learning rate annealing:
 rate r = r0 / (1+ß*N), N = training samples “dig deeper into local minimum” Grid Search and Checkpointing:
 Run a grid search over multiple hyper-parameters, then continue training the best model L1/L2/Dropout/MaxSumWeights regularization:
 L1: penalizes non-zero weights, L2: penalizes large weights
 Dropout: randomly ignore certain inputs “train exp. many models at once” MaxSumWeights: Reduce all incoming weights if the sum > max value “regularization avoids overtraining and improves generalization error”
  • 18.
    MNIST: digits classification Train:60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes MNIST: Digitized handwritten digits database (Yann LeCun) Data: 28x28=784 pixels with values in 0…255 (gray-scale) One of the most popular multi-class classification problems Without distortions or convolutions (which help), the best-ever published error rate on test set: 0.83% (Microsoft)
  • 19.
    most frequent mistakes:
 confuse4 with 6 and 9, and 7 with 2 test set error: 1.5% after 40 epochs 1.02% after 400 epochs
 0.95% after 4000 epochs H2O Deep Learning on MNIST: 0.95% test set error (so far) 1 node
  • 20.
  • 21.
    Live Demo: CancerPrediction Interactive ROC curve with real- time updates
  • 22.
    Live Demo: CancerPrediction 0% training error with only 322 model parameters in seconds!
  • 23.
    Live Demo: GridSearch Regression Doing a grid search to find good hyper-parameters to predict AGE from other 7 features Then continue training the best model 5 hidden 50 tanh layers, rho=0.99, epsilon = 1e-10 MSE < 1 for test set ages in 44…79 Regression: 1 linear output neuron
  • 24.
    Live Demo: ebayText Classification Users enter a description when selling an item Task: Predict the type of item Data prep: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0 H2O parses SVMLight sparse format: label 3:1 9:1 13:1 … ! “Small” sample dataset on jewelry and watches: Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes ! H2O compressed columnar in-memory store: Only needs 60MB to store 5 billion entries (never inflated)
  • 25.
    Live Demo: ebayText Classification Work in progress, shown results are for illustration only! Default parameters, no tuning, 4 nodes (16-cores each) Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes
  • 26.
    Tips for H2ODeep Learning ! General: More layers: more complex functions (non-linearity) More neurons per layer: detect finer structure in data More regularization: less overfitting (better validation error) ! Do a grid search to get a feel for convergence, then continue training. Try Tanh first. For Rectifier, try max_w2 = 50 and/or L1=1e-5. Try TanhDropout or RectifierDropout with test/validation set after finding good parameters for convergence on training set. Distributed: Smaller mini-batch: more comm., slower, but higher accuracy. With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8 Try momentum_start = 0.5, momentum_stable = 0.99,
 momentum_ramp = 1/rate_annealing Try balance_classes = true for imbalanced classes. Try force_load_balance for small datasets.
  • 27.
    Summary H2O is adistributed in-memory math platform that allows fast prototyping in Java, R, Scala and Python. ! H2o enables the development of enterprise-quality blazing fast machine learning applications. ! H2O Deep Learning is distributed, easy to use, and early results compete with the world’s best. ! Deep Learning makes better predictions! ! Try it yourself and join our next meetup!
 git clone https://github.com/0xdata/h2o