Note: Make sure to download the slides to get the high-resolution version!
Also, you can find the webinar recording here (please also download for better quality): https://www.dropbox.com/s/72qi6wjzi61gs3q/H2ODeepLearningArnoCandel052114.mov
Come hear how Deep Learning in H2O is unlocking never before seen performance for prediction!
H2O is google-scale open source machine learning engine for R & Big Data. Enterprises can now use all of their data without sampling and build intelligent applications. This live webinar introduces Distributed Deep Learning concepts, implementation and results from recent developments. Real world classification & regression use cases from eBay text dataset, MNIST handwritten digits and Cancer datasets will present the power of this game changing technology.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
2. H2O Deep Learning, @ArnoCandel
Outline
Intro & Live Demo (5 mins)
Methods & Implementation (10 mins)
Results & Live Demo (10 mins)
MNIST handwritten digits
text classification
Q & A (10 mins)
2
3. H2O Deep Learning, @ArnoCandel 3
About H20 (aka 0xdata)
Pure Java, Apache v2 Open Source
Join the www.h2o.ai/community!
3
+1 Cyprien Noel
for prior work
4. H2O Deep Learning, @ArnoCandel
Customer Demands for
Practical Machine Learning
4
Requirements Value
In-Memory Fast (Interactive)
Distributed Big Data (No Sampling)
Open Source Ownership of Methods
API / SDK Extensibility
H2O was developed by 0xdata to
meet these requirements
5. H2O Deep Learning, @ArnoCandel
H2O Integration
H2O
HDFS HDFS HDFS
YARN Hadoop MR
R ScalaJSON Python
Standalone Over YARN On MRv1
5
H2O H2O
Java
6. H2O Deep Learning, @ArnoCandel
H2O Architecture
Distributed
In-Memory K-V store
Col. compression
Machine
Learning
Algorithms
R Engine
Nano fast
Scoring Engine
Prediction Engine
Memory manager
e.g. Deep Learning
6
MapReduce
7. H2O Deep Learning, @ArnoCandel
H2O + R = Happy Data Scientist
7
Machine Learning on Big Data with R:
Data resides on the H2O cluster!
8. H2O Deep Learning, @ArnoCandel
H2O Deep Learning in Action
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 rows 784 integer columns 10 classes
8
MNIST = Digitized handwritten
digits database (Yann LeCun)
Live Demo Build a H2O Deep Learning
model on MNIST train/test data
Data: 28x28=784 pixels with
(gray-scale) values in 0…255
Yann LeCun: “Yet another advice: don't get fooled
by people who claim to have a solution to
Artificial General Intelligence. Ask them what
error rate they get on MNIST or ImageNet.”
9. H2O Deep Learning, @ArnoCandel
Wikipedia:
Deep learning is a set of algorithms in
machine learning that attempt to model
high-level abstractions in data by using
architectures composed of multiple
non-linear transformations.
What is Deep Learning?
Example:
Input data
(image)
Prediction
(who?)
9
Facebook's DeepFace (Yann LeCun)
recognises faces as well as humans
10. H2O Deep Learning, @ArnoCandel
Deep Learning is Trending
20132012
Google trends
2011
10
Businesses are using
Deep Learning techniques!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)
!
FBI FACE: $1 billion face recognition project
!
Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)
11. H2O Deep Learning, @ArnoCandel
What is NOT Deep
Linear models are not deep
(by definition)
!
Neural nets with 1 hidden layer are not deep
(no feature hierarchy)
!
SVMs and Kernel methods are not deep
(2 layers: kernel + linear)
!
Classification trees are not deep
(operate on original input space)
11
12. H2O Deep Learning, @ArnoCandel
1970s multi-layer feed-forward Neural Network
(supervised learning with stochastic gradient descent using back-propagation)
!
+ distributed processing for big data
(H2O in-memory MapReduce paradigm on distributed data)
!
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
!
+ breakthrough algorithms for accuracy
(weight initialization, adaptive learning, momentum, dropout, regularization)
!
= Top-notch prediction engine!
Deep Learning in H2O
12
13. H2O Deep Learning, @ArnoCandel
“fully connected” directed graph of neurons
age
income
employment
married
single
Input layer
Hidden
layer 1
Hidden
layer 2
Output layer
3x4 4x3 3x2#connections
information flow
input/output neuron
hidden neuron
4 3 2#neurons 3
Example Neural Network
13
14. H2O Deep Learning, @ArnoCandel
age
income
employment
yj = tanh(sumi(xi*uij)+bj)
uij
xi
yj
per-class probabilities
sum(pl) = 1
zk = tanh(sumj(yj*vjk)+ck)
vjk
zk
pl
pl = softmax(sumk(zk*wkl)+dl)
wkl
softmax(xk) = exp(xk) / sumk(exp(xk))
“neurons activate each other via weighted sums”
Prediction: Forward Propagation
married
single
activation function: tanh
alternative:
x -> max(0,x) “rectifier”
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!
bj, ck, dl: bias values
(indep. of inputs)
14
15. H2O Deep Learning, @ArnoCandel
Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class”
!
Cross-entropy = -log(0.8) “strongly penalize non-1-ness”
Training: Update Weights & Biases
Stochastic Gradient Descent: Update weights and biases via
gradient of the error (via back-propagation):
For each training row, we make a prediction and compare
with the actual label (supervised learning):
married10.8
predicted actual
Objective: minimize prediction error (MSE or cross-entropy)
w <— w - rate * ∂E/∂w
1
15
single00.2
E
w
rate
16. H2O Deep Learning, @ArnoCandel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
threads: async
communication
w
w w
w w w w
w1
w3 w2
w4
w2+w4
w1+w3
w* = (w1+w2+w3+w4)/4
map:
each node trains a
copy of the weights
and biases with
(some* or all of) its
local data with
asynchronous F/J
threads
initial model: weights and biases w
updated model: w*
H2O atomic
in-memory
K-V store
reduce:
model averaging:
average weights and
biases from all nodes,
speedup is at least
#nodes/log(#rows)
arxiv:1209.4129v3
Keep iterating over the data (“epochs”), score from time to time
Query & display
the model via
JSON, WWW
2
2 431
1
1
1
4
3 2
1 2
1
i
*user can specify the number of total rows per MapReduce iteration
16
17. H2O Deep Learning, @ArnoCandel
“Secret” Sauce to Higher Accuracy
Adaptive learning rate - ADADELTA (Google)
Automatically set learning rate for each neuron
based on its training history
Grid Search and Checkpointing
Run a grid search to scan many hyper-
parameters, then continue training the most
promising model(s)
Regularization
L1: penalizes non-zero weights
L2: penalizes large weights
Dropout: randomly ignore certain inputs
17
18. H2O Deep Learning, @ArnoCandel
MNIST: digits classification
Standing world record:
Without distortions or convolutions, the best-ever
published error rate on test set: 0.83% (Microsoft)
18
Time to check in
on the demo!
Let’s see how H2O did in the past 10 minutes!
19. H2O Deep Learning, @ArnoCandel
Frequent errors: confuse 2/7 and 4/9
H2O Deep Learning on MNIST:
0.87% test set error (so far)
19
test set error: 1.5% after 10 mins
1.0% after 1.5 hours
0.87% after 4 hours
World-class
results!
No pre-training
No distortions
No convolutions
No unsupervised
training
Running on 4
nodes with 16
cores each
On 4 nodes
20. H2O Deep Learning, @ArnoCandel
Use Case: Text Classification
Goal: Predict the item from
seller’s text description
20
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
“Vintage 18KT gold Rolex 2 Tone
in great condition”
Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0
vintagegold condition
Let’s see how H2O does on the ebay dataset!
21. H2O Deep Learning, @ArnoCandel
Out-Of-The-Box: 11.6% test set error after 10 epochs!
Predicts the correct class (out of 143) 88.4% of the time!
21
Note 2: No tuning was done
(results are for illustration only)
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
Note 1: H2O columnar-compressed in-memory
store only needs 60 MB to store 5 billion
values (dense CSV needs 18 GB)
Use Case: Text Classification
22. H2O Deep Learning, @ArnoCandel
Parallel Scalability
(for 64 epochs on MNIST, with “0.87%” parameters)
22
Speedup
0.00
10.00
20.00
30.00
40.00
1 2 4 8 16 32 63
H2O Nodes
(4 cores per node, 1 epoch per node per MapReduce)
2.7 mins
Training Time
0
25
50
75
100
1 2 4 8 16 32 63
H2O Nodes
in minutes
23. H2O Deep Learning, @ArnoCandel
Outlook for H2O Deep Learning
23
Convolutional and Pooling Layers for
General Image Recognition (ImageNet)
Sparse Auto-Encoders for Dimensionality
Reduction and Anomaly Detection
Execution on GPU clusters for even
faster training
26. H2O Deep Learning, @ArnoCandel
Key Take-Aways
H2O is a distributed in-memory math platform for
enterprise-grade machine learning applications.
!
H2O Deep Learning is ready to take your advanced
analytics to the next level - Try it on your data!
!
Join our Community and Meetups!
git clone https://github.com/0xdata/h2o
http://docs.0xdata.com
www.h2o.ai/community
@hexadata
26