Webinar: Deep Learning with H2O

Deep Learning
with H2O
!
H2O.ai 
Scalable In-Memory Machine Learning
!
Webinar, 5/21/14
SriSatish Ambati, CEO and Co-Founder
Arno Candel, PhD, Physicst & Hacker

H2O Deep Learning, @ArnoCandel
Outline
Intro & Live Demo (5 mins)
Methods & Implementation (10 mins)
Results & Live Demo (10 mins)
MNIST handwritten digits
text classification
Q & A (10 mins)
2

H2O Deep Learning, @ArnoCandel 3
About H20 (aka 0xdata)
Pure Java, Apache v2 Open Source
Join the www.h2o.ai/community!
3
+1 Cyprien Noel
for prior work

Customer Demands for
Practical Machine Learning
4
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API / SDK Extensibility
H2O was developed by 0xdata to
meet these requirements

H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
5
H2O H2O
Java

H2O Architecture
Distributed 
In-Memory K-V store
Col. compression
Machine
Learning
Algorithms
R Engine
Nano fast
Scoring Engine
Prediction Engine
Memory manager
e.g. Deep Learning
6
MapReduce

H2O + R = Happy Data Scientist
7
Machine Learning on Big Data with R: 
Data resides on the H2O cluster!

H2O Deep Learning in Action
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 rows 784 integer columns 10 classes
8
MNIST = Digitized handwritten
digits database (Yann LeCun)
Live Demo Build a H2O Deep Learning
model on MNIST train/test data
Data: 28x28=784 pixels with
(gray-scale) values in 0…255
Yann LeCun: “Yet another advice: don't get fooled
by people who claim to have a solution to
Artificial General Intelligence. Ask them what
error rate they get on MNIST or ImageNet.”

Wikipedia: 
Deep learning is a set of algorithms in
machine learning that attempt to model
high-level abstractions in data by using
architectures composed of multiple  
non-linear transformations.
What is Deep Learning?
Example:
Input data 
(image)
Prediction
(who?)
9
Facebook's DeepFace (Yann LeCun)
recognises faces as well as humans

Deep Learning is Trending
20132012
Google trends
2011
10
Businesses are using 
Deep Learning techniques!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)
!
FBI FACE: $1 billion face recognition project
!
Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)

What is NOT Deep
Linear models are not deep
(by definition)
!
Neural nets with 1 hidden layer are not deep
(no feature hierarchy)
!
SVMs and Kernel methods are not deep
(2 layers: kernel + linear)
!
Classification trees are not deep
(operate on original input space)
11

1970s multi-layer feed-forward Neural Network
(supervised learning with stochastic gradient descent using back-propagation)
!
+ distributed processing for big data
(H2O in-memory MapReduce paradigm on distributed data)
!
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
!
+ breakthrough algorithms for accuracy
(weight initialization, adaptive learning, momentum, dropout, regularization)
!
= Top-notch prediction engine!
Deep Learning in H2O
12

“fully connected” directed graph of neurons
age
income
employment
married
single
Input layer
Hidden
layer 1
Hidden
layer 2
Output layer
3x4 4x3 3x2#connections
information flow
input/output neuron
hidden neuron
4 3 2#neurons 3
Example Neural Network
13

age
income
employment
yj = tanh(sumi(xi*uij)+bj)
uij
xi
yj
per-class probabilities 
sum(pl) = 1
zk = tanh(sumj(yj*vjk)+ck)
vjk
zk
pl
pl = softmax(sumk(zk*wkl)+dl)
wkl
softmax(xk) = exp(xk) / sumk(exp(xk))
“neurons activate each other via weighted sums”
Prediction: Forward Propagation
married
single
activation function: tanh
alternative: 
x -> max(0,x) “rectifier”
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!
bj, ck, dl: bias values 
(indep. of inputs)
14

Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class”
!
Cross-entropy = -log(0.8) “strongly penalize non-1-ness”
Training: Update Weights & Biases
Stochastic Gradient Descent: Update weights and biases via
gradient of the error (via back-propagation):
For each training row, we make a prediction and compare
with the actual label (supervised learning):
married10.8
predicted actual
Objective: minimize prediction error (MSE or cross-entropy)
w <— w - rate * ∂E/∂w
1
15
single00.2
E
w
rate

H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
threads: async
communication
w
w w
w w w w
w1
w3 w2
w4
w2+w4
w1+w3
w* = (w1+w2+w3+w4)/4
map: 
each node trains a
copy of the weights
and biases with
(some* or all of) its
local data with
asynchronous F/J
threads
initial model: weights and biases w
updated model: w*
H2O atomic
in-memory 
K-V store
reduce: 
model averaging:
average weights and
biases from all nodes,
speedup is at least
#nodes/log(#rows)
arxiv:1209.4129v3
Keep iterating over the data (“epochs”), score from time to time
Query & display
the model via
JSON, WWW
2
2 431
1
1
1
4
3 2
1 2
1
i
*user can specify the number of total rows per MapReduce iteration
16

“Secret” Sauce to Higher Accuracy
Adaptive learning rate - ADADELTA (Google) 
Automatically set learning rate for each neuron
based on its training history
Grid Search and Checkpointing 
Run a grid search to scan many hyper-
parameters, then continue training the most
promising model(s)
Regularization 
L1: penalizes non-zero weights 
L2: penalizes large weights 
Dropout: randomly ignore certain inputs
17

MNIST: digits classification
Standing world record:
Without distortions or convolutions, the best-ever
published error rate on test set: 0.83% (Microsoft)
18
Time to check in
on the demo!
Let’s see how H2O did in the past 10 minutes!

Frequent errors: confuse 2/7 and 4/9
H2O Deep Learning on MNIST:
0.87% test set error (so far)
19
test set error: 1.5% after 10 mins
1.0% after 1.5 hours 
0.87% after 4 hours
World-class
results!
No pre-training
No distortions
No convolutions
No unsupervised
training
Running on 4
nodes with 16
cores each
On 4 nodes

Use Case: Text Classification
Goal: Predict the item from
seller’s text description
20
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
“Vintage 18KT gold Rolex 2 Tone
in great condition”
Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0
vintagegold condition
Let’s see how H2O does on the ebay dataset!

Out-Of-The-Box: 11.6% test set error after 10 epochs!
Predicts the correct class (out of 143) 88.4% of the time!
21
Note 2: No tuning was done 
(results are for illustration only)
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
Note 1: H2O columnar-compressed in-memory
store only needs 60 MB to store 5 billion
values (dense CSV needs 18 GB)
Use Case: Text Classification

Parallel Scalability
(for 64 epochs on MNIST, with “0.87%” parameters)
22
Speedup
0.00
10.00
20.00
30.00
40.00
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node, 1 epoch per node per MapReduce)
2.7 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes

Outlook for H2O Deep Learning
23
Convolutional and Pooling Layers for
General Image Recognition (ImageNet)
Sparse Auto-Encoders for Dimensionality
Reduction and Anomaly Detection
Execution on GPU clusters for even
faster training

H2O Steam: Scoring Platform
24

H2O Steam: More Coming Soon!
25

Key Take-Aways
H2O is a distributed in-memory math platform for
enterprise-grade machine learning applications.
!
H2O Deep Learning is ready to take your advanced
analytics to the next level - Try it on your data!
!
Join our Community and Meetups!
git clone https://github.com/0xdata/h2o
http://docs.0xdata.com
www.h2o.ai/community
@hexadata
26

Webinar: Deep Learning with H2O

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Webinar: Deep Learning with H2O

Similar to Webinar: Deep Learning with H2O (20)

More from Sri Ambati

More from Sri Ambati (20)

Recently uploaded

Recently uploaded (20)

Webinar: Deep Learning with H2O