Deep Learning 
through Examples 
Arno Candel 
! 
0xdata, H2O.ai 
Scalable In-Memory Machine Learning 
! 
Silicon Valley Big Data Science Meetup, 
Palo Alto, 9/3/14 
!
Who am I? 
@ArnoCandel 
PhD in Computational Physics, 2005 
from ETH Zurich Switzerland 
! 
6 years at SLAC - Accelerator Physics Modeling 
2 years at Skytree, Inc - Machine Learning 
9 months at 0xdata/H2O - Machine Learning 
! 
15 years in HPC/Supercomputing/Modeling 
! 
Named “2014 Big Data All-Star” by Fortune Magazine 
!
H2O Deep Learning, @ArnoCandel 
Outline 
Intro & Live Demo (10 mins) 
Methods & Implementation (20 mins) 
Results & Live Demos (25 mins) 
Higgs boson detection 
MNIST handwritten digits 
text classification 
Q & A (5 mins) 
3
H2O Deep Learning, @ArnoCandel 
About H20 (aka 0xdata) 
Java, Apache v2 Open Source 
Join the www.h2o.ai/community! 
#1 Java Machine Learning in Github 
4
H2O Deep Learning, @ArnoCandel 
Customer Demands for 
Practical Machine Learning 
5 
Requirements Value 
In-Memory Fast (Interactive) 
Distributed Big Data (No Sampling) 
Open Source Ownership of Methods 
API / SDK Extensibility 
H2O was developed by 0xdata from 
scratch to meet these requirements
H2O Deep Learning, @ArnoCandel 
H2O Integration 
H2O 
R JSON Scala Python 
YARN Hadoop MR 
HDFS HDFS HDFS 
Standalone Over YARN On MRv1 
6 
H2O H2O 
Java
H2O Deep Learning, @ArnoCandel 
H2O Architecture 
Prediction Engine 
Distributed 
In-Memory K-V store 
Col. compression 
Machine 
Learning 
Algorithms 
R Engine 
Nano fast 
Scoring Engine 
Memory manager 
e.g. Deep Learning 
7 
MapReduce
H2O Deep Learning, @ArnoCandel 
H2O - The Killer App on Spark 
8 
http://databricks.com/blog/2014/06/30/ 
sparkling-water-h20-spark.html
H2O Deep Learning, @ArnoCandel 
H2O DeepLearning on Spark 
9 
// Test if we can correctly learn A, B where Y = logistic(A + B*X) 
test("deep learning log regression") { 
val nPoints = 10000 
val A = 2.0 
val B = -1.5 ! 
// Generate testing data 
val trainData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 42) 
// Create RDD from testing data 
val trainRDD = sc.parallelize(trainData, 2) 
trainRDD.cache() ! 
import H2OContext._ 
// Create H2O data frame (will be implicit in the future) 
val trainH2ORDD = toDataFrame(sc, trainRDD) 
// Create a H2O DeepLearning model 
val dlParams = new DeepLearningParameters() 
dlParams.source = trainH2ORDD 
dlParams.response = trainH2ORDD.lastVec() 
dlParams.classification = true 
val dl = new DeepLearning(dlParams) 
val dlModel = dl.train().get() ! 
// Score validation data 
val validationData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 17) 
val validationRDD = sc.parallelize(validationData, 2) 
val validationH2ORDD = toDataFrame(sc, validationRDD) 
val predictionH2OFrame = new DataFrame(dlModel.score(validationH2ORDD))('predict) 
val predictionRDD = toRDD[DoubleHolder](sc, predictionH2OFrame) // will be implicit in the future 
// Validate prediction 
validatePrediction( predictionRDD.collect().map (_.predict.getOrElse(Double.NaN)), validationData) 
} 
Brand-Sparkling-New Sneak Preview!
H2O Deep Learning, @ArnoCandel 10 
H2O R CRAN package 
John Chambers (creator of the S language, R-core member) 
names H2O R API in top three promising R projects
H2O Deep Learning, @ArnoCandel 
H2O + R = Happy Data Scientist 
11 
Machine Learning on Big Data with R: 
Data resides on the H2O cluster!
H2O Deep Learning, @ArnoCandel 12 
Higgs Particle Discovery 
Large Hadron Collider: Largest experiment of mankind! 
$13+ billion, 16.8 miles long, 120 MegaWatts, -456F, 1PB/day, etc. 
Higgs boson discovery (July ’12) led to 2013 Nobel prize! 
Higgs 
vs 
Background 
http://arxiv.org/pdf/1402.4735v2.pdf 
Images courtesy CERN / LHC 
Machine Learning Meets Physics 
Or rather: Back to the roots 
(WWW was invented at CERN in ’89…)
H2O Deep Learning, @ArnoCandel 13 
Higgs: Binary Classification Problem 
Current methods of choice for physicists: 
- Boosted Decision Trees 
- Neural networks with 1 hidden layer 
BUT: Must first add derived high-level features (physics formulae) 
HIGGS UCI Dataset: 
21 low-level features AND 
7 high-level derived features 
Train: 10M rows, Test: 500k rows 
Metric: AUC = Area under the ROC curve (range: 0.5…1, higher is better) 
Algorithm low-level H2O AUC all features H2O AUC 
Generalized Linear Model 0.596 0.684 
add 
derived 
Random Forest 0.764 0.840 
features 
Gradient Boosted Trees 0.753 0.839 
Neural Net 1 hidden layer 0.760 0.830
H2O Deep Learning, @ArnoCandel 14 
Higgs: Can Deep Learning Do Better? 
Algorithm low-level H2O AUC all features H2O AUC 
Generalized Linear Model 0.596 0.684 
Random Forest 0.764 0.840 
Gradient Boosted Trees 0.753 0.839 
Neural Net 1 hidden layer 0.760 0.830 
Deep Learning ? ? 
<Your guess goes here> 
reference paper results: baseline 0.733 
Let’s build a H2O Deep Learning model and 
find out! (That was my last weekend)
H2O Deep Learning, @ArnoCandel 
What is Deep Learning? 
Wikipedia: 
Deep learning is a set of algorithms in 
machine learning that attempt to model 
high-level abstractions in data by using 
architectures composed of multiple 
non-linear transformations. 
Example: 
Input data 
(image) 
Prediction 
(who is it?) 
15 
Facebook's DeepFace (Yann LeCun) 
recognises faces as well as humans
H2O Deep Learning, @ArnoCandel 
What is NOT Deep 
Linear models are not deep 
(by definition) 
! 
Neural nets with 1 hidden layer are not deep 
(only 1 layer - no feature hierarchy) 
! 
SVMs and Kernel methods are not deep 
(2 layers: kernel + linear) 
! 
Classification trees are not deep 
(operate on original input space, no new features generated) 
16
H2O Deep Learning, @ArnoCandel 
Deep Learning is Trending 
Google trends 
2009 2011 
2013 
17 
Businesses are using 
Deep Learning techniques! 
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton) 
! 
FBI FACE: $1 billion face recognition project 
! 
Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)
H2O Deep Learning, @ArnoCandel 
Deep Learning History 
slides by Yan LeCun (now Facebook) 
18 
Deep Learning wins competitions 
AND 
makes humans, businesses and 
machines (cyborgs!?) smarter
H2O Deep Learning, @ArnoCandel 
Deep Learning in H2O 
1970s multi-layer feed-forward Neural Network 
(supervised learning with stochastic gradient descent using back-propagation) 
! 
+ distributed processing for big data 
(H2O in-memory MapReduce paradigm on distributed data) 
! 
+ multi-threaded speedup 
(H2O Fork/Join worker threads update the model asynchronously) 
! 
+ smart algorithms for accuracy 
(weight initialization, adaptive learning rate, momentum, dropout regularization, 
l1/L2 regularization, grid search, checkpointing, auto-tuning, model averaging) 
! 
= Top-notch prediction engine! 
19
H2O Deep Learning, @ArnoCandel 
Example Neural Network 
“fully connected” directed graph of neurons 
age 
income 
employment 
input/output neuron 
hidden neuron 
married 
single 
Input layer 
Hidden 
layer 1 
Hidden 
layer 2 
Output layer 
#connections 3x4 4x3 3x2 
information flow 
#neurons 3 4 3 2 
20
H2O Deep Learning, @ArnoCandel 
Prediction: Forward Propagation 
“neurons activate each other via weighted sums” 
age 
income 
employment 
uij 
vjk 
zk pl 
yj = tanh(sumi(xi*uij)+bj) 
xi 
yj 
21 
married 
per-class probabilities 
sum(pl) = 1 
wkl 
zk = tanh(sumj(yj*vjk)+ck) 
single 
pl = softmax(sumk(zk*wkl)+dl) 
softmax(xk) = exp(xk) / sumk(exp(xk)) 
activation function: tanh 
alternative: 
x -> max(0,x) “rectifier” 
pl is a non-linear function of xi: 
can approximate ANY function 
with enough layers! 
bj, ck, dl: bias values 
(indep. of inputs)
H2O Deep Learning, @ArnoCandel 
Data preparation & Initialization 
Neural Networks are sensitive to numerical noise, 
operate best in the linear regime (not saturated) 
age 
income 
employment 
xi 
Automatic standardization of data 
xi: mean = 0, stddev = 1 
! 
horizontalize categorical variables, e.g. 
{full-time, part-time, none, self-employed} 
-> 
{0,1,0} = part-time, {0,0,0} = self-employed 
married 
single 
wkl 
Automatic initialization of weights 
! 
22 
Poor man’s initialization: random weights wkl 
! 
Default (better): Uniform distribution in 
+/- sqrt(6/(#units + #units_previous_layer))
H2O Deep Learning, @ArnoCandel 
Training: Update Weights & Biases 
For each training row, we make a prediction and compare 
with the actual label (supervised learning): 
predicted actual 
0.8 1 married 
Objective: minimize prediction error (MSE or cross-entropy) 
Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class” 
! 
Cross-entropy = -log(0.8) “strongly penalize non-1-ness” 
1 
Stochastic Gradient Descent: Update weights and biases via 
gradient of the error (via back-propagation): 
w <— w - rate * ∂E/∂w 
23 
0.2 0 single 
E 
w 
rate
H2O Deep Learning, @ArnoCandel 
Backward Propagation 
How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ? 
Naive: For every i, evaluate E twice at (w1,…,wi±Δ,…,wN)… Slow! 
Backprop: Compute ∂E/∂wi via chain rule going backwards 
xi 
! 
net = sumi(wi*xi) + b 
wi 
y = activation(net) 
E = error(y) 
∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi 
= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi 
24
H2O Deep Learning, @ArnoCandel 
H2O Deep Learning Architecture 
K-V 
HTTPD 
nodes/JVMs: sync 
threads: async 
communication 
K-V 
HTTPD 
w 
1 
w w 
2 
1 
w w w w 
1 3 2 4 
w1 w3 w2 
w4 
3 2 
w w2+w4 1+w3 
4 
1 2 
w* = (w1+w2+w3+w4)/4 
map: 
each node trains a 
copy of the weights 
and biases with 
(some* or all of) its 
local data with 
asynchronous F/J 
threads 
initial model: weights and biases w 
1 
1 
updated model: w* 
H2O atomic 
in-memory 
K-V store 
reduce: 
model averaging: 
average weights and 
biases from all nodes, 
speedup is at least 
#nodes/log(#rows) 
arxiv:1209.4129v3 
i 
Query & display 
the model via 
JSON, WWW 
Keep iterating over the data (“epochs”), score from time to time 
*auto-tuned (default) or user-specified number of points per MapReduce iteration 
25
H2O Deep Learning, @ArnoCandel 
Adaptive learning rate - ADADELTA (Google) 
Automatically set learning rate for each neuron 
based on its training history 
Regularization 
L1: penalizes non-zero weights 
L2: penalizes large weights 
Dropout: randomly ignore certain inputs 
Grid Search and Checkpointing 
Run a grid search to scan many hyper-parameters, 
then continue training the most 
promising model(s) 
26 
“Secret” Sauce to Higher Accuracy
H2O Deep Learning, @ArnoCandel 
Detail: Adaptive Learning Rate 
! 
Compute moving average of Δwi2 at time t for window length rho: 
! 
E[Δwi2]t = rho * E[Δwi2]t-1 + (1-rho) * Δwi2 
! 
Compute RMS of Δwi at time t with smoothing epsilon: 
! 
RMS[Δwi]t = sqrt( E[Δwi2]t + epsilon ) 
Adaptive acceleration / momentum: 
accumulate previous weight updates, 
but over a window of time 
Adaptive annealing / progress: 
Gradient-dependent learning rate, 
moving window prevents “freezing” 
(unlike ADAGRAD: no window) 
Do the same for ∂E/∂wi, then 
obtain per-weight learning rate: 
RMS[Δwi]t-1 
RMS[∂E/∂wi]t 
rate(wi, t) = 
cf. ADADELTA paper 
27
H2O Deep Learning, @ArnoCandel 
Detail: Dropout Regularization 
28 
Training: 
For each hidden neuron, for each training sample, for each iteration, 
ignore (zero out) a different random fraction p of input activations. 
! 
age 
income 
employment 
married 
single 
X 
X 
X 
Testing: 
Use all activations, but reduce them by a factor p 
(to “simulate” the missing activations during training). 
cf. Geoff Hinton's paper
H2O Deep Learning, @ArnoCandel 
MNIST: digits classification 
MNIST = Digitized handwritten 
digits database (Yann LeCun) 
Yann LeCun: “Yet another advice: don't get fooled 
by people who claim to have a solution to 
Artificial General Intelligence. Ask them what 
error rate they get on MNIST or ImageNet.” 
Data: 28x28=784 pixels with 
(gray-scale) values in 0…255 
Standing world record: 
Without distortions or convolutions, 
the best-ever published error rate on 
test set: 0.83% (Microsoft) 
29 
Train: 60,000 rows 784 integer columns 10 classes 
Test: 10,000 rows 784 integer columns 10 classes 
Let’s see how H2O does on the MNIST dataset!
H2O Deep Learning, @ArnoCandel 
H2O Deep Learning on MNIST: 
0.87% test set error (so far) 
Frequent errors: confuse 2/7 and 4/9 
30 
test set error: 1.5% after 10 mins 
1.0% after 1.5 hours 
0.87% after 4 hours 
World-class 
results! 
No pre-training 
No distortions 
No convolutions 
No unsupervised 
training 
Running on 4 
nodes with 16 
cores each
H2O Deep Learning, A. Candel 
Weather Dataset 
31 
Predict “RainTomorrow” from Temperature, 
Humidity, Wind, Pressure, etc.
H2O Deep Learning, A. Candel 
Live Demo: Weather Prediction 
5-fold cross validation 
Interactive ROC curve with 
real-time updates 
32 
3 hidden Rectifier 
layers, Dropout, 
L1-penalty 
12.7% 5-fold cross-validation error is at 
least as good as GBM/RF/GLM models
H2O Deep Learning, @ArnoCandel 
Live Demo: Grid Search 
How did I find those parameters? Grid Search! 
(works for multiple hyper parameters at once) 
33 
Then continue training 
the best model
H2O Deep Learning, @ArnoCandel 
Text Classification 
Goal: Predict the item from 
seller’s text description 
34 
“Vintage 18KT gold Rolex 2 Tone 
in great condition” 
Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0 
gold vintage condition 
Train: 578,361 rows 8,647 cols 467 classes 
Test: 64,263 rows 8,647 cols 143 classes 
Let’s see how H2O does on the ebay dataset!
H2O Deep Learning, @ArnoCandel 
35 
Text Classification 
Train: 578,361 rows 8,647 cols 467 classes 
Test: 64,263 rows 8,647 cols 143 classes 
Out-Of-The-Box: 11.6% test set error after 10 epochs! 
Predicts the correct class (out of 143) 88.4% of the time! 
Note 1: H2O columnar-compressed in-memory 
store only needs 60 MB to store 5 billion 
values (dense CSV needs 18 GB) 
Note 2: No tuning was done 
(results are for illustration only)
H2O Deep Learning, @ArnoCandel 
Parallel Scalability 
(for 64 epochs on MNIST, with “0.87%” parameters) 
36 
Speedup 
40.00 
30.00 
20.00 
10.00 
0.00 
1 2 4 8 16 32 63 
H2O Nodes 
Training Time 
2.7 mins 
100 
75 
50 
25 
0 
in minutes 
1 2 4 8 16 32 63 
H2O Nodes 
(4 cores per node, 1 epoch per node per MapReduce)
H2O Deep Learning, @ArnoCandel 
Deep Learning Auto-Encoders for 
Anomaly Detection 
37 
Toy example: 
Find anomaly in ECG heart 
beat data. First, train a 
model on what’s “normal”: 
20 time-series samples of 
210 data points each 
Deep Auto-Encoder: 
Learn low-dimensional 
non-linear “structure” of 
the data that allows to 
reconstruct the orig. data 
Also for categorical data!
H2O Deep Learning, @ArnoCandel 38 
Deep Learning Auto-Encoders for 
Test set with anomaly 
Test set prediction is 
reconstruction, looks “normal” 
Found anomaly! large 
reconstruction error 
Model of what’s “normal” 
+ 
=> 
Anomaly Detection
H2O Deep Learning, @ArnoCandel 39 
H2O brings Deep Learning to R 
R Vignette with 
example R scripts 
http://0xdata.com/h2o/algorithms/ 
All parameters are 
available from R…
H2O Deep Learning, @ArnoCandel 
POJO Model Export for 
Production Scoring 
40 
Plain old Java code is 
auto-generated to take 
your H2O Deep Learning 
models into production!
H2O Deep Learning, @ArnoCandel 41 
Higgs Particle Discovery with H2O 
How well did H2O 
Deep Learning do? 
<Your guess goes here> 
reference paper results 
Any guesses for AUC on low-level features? 
AUC=0.76 was the best for RF/GBM/NN 
Let’s see how H2O did in the past 30 minutes!
H2O Deep Learning, @ArnoCandel 
H2O Steam: Scoring Platform 
42 
http://server:port/steam/index.html 
Higgs Dataset Demo on 10-node cluster 
Let’s score all our H2O models and compare them! 
Live Demo
H2O Deep Learning, @ArnoCandel 43 
Scoring Higgs Models in H2O Steam 
Live Demo on 10-node cluster: 
<10 minutes runtime for all algos! 
Better than LHC baseline of AUC=0.73!
H2O Deep Learning, @ArnoCandel 44 
Higgs Particle Detection with H2O 
HIGGS UCI Dataset: 
21 low-level features AND 
7 high-level derived features 
Train: 10M rows, Test: 500k rows 
Algorithm 
*Nature paper: http://arxiv.org/pdf/1402.4735v2.pdf 
Paper’s 
l-l AUC 
low-level 
H2O AUC 
all features 
H2O AUC 
Parameters (not heavily tuned), 
H2O running on 10 nodes 
Generalized Linear Model - 0.596 0.684 default, binomial 
Random Forest - 0.764 0.840 50 trees, max depth 50 
Gradient Boosted Trees 0.73 0.753 0.839 50 trees, max depth 15 
Neural Net 1 layer 0.733 0.760 0.830 1x300 Rectifier, 100 epochs 
Deep Learning 3 hidden layers 0.836 0.850 - 3x1000 Rectifier, L2=1e-5, 40 epochs 
Deep Learning 4 hidden layers 0.868 0.869 - 4x500 Rectifier, L1=L2=1e-5, 300 epochs 
Deep Learning 6 hidden layers 0.880 running - 6x500 Rectifier, L1=L2=1e-5 
Deep Learning on low-level features alone beats everything else! 
H2O prelim. results compare well with paper’s results* (TMVA & Theano)
H2O Deep Learning, @ArnoCandel 
Tips for H2O Deep Learning ! 
General: 
More layers for more complex functions (exp. more non-linearity). 
More neurons per layer to detect finer structure in data (“memorizing”). 
Add some regularization for less overfitting (lower validation set error). 
Specifically: 
Do a grid search to get a feel for convergence, then continue training. 
Try Tanh/Rectifier, try max_w2=10…50, L1=1e-5..1e-3 and/or L2=1e-5…1e-3 
Try Dropout (input: up to 20%, hidden: up to 50%) with test/validation 
set. Input dropout is recommended for noisy high-dimensional input. 
Distributed: More training samples per iteration: faster, but less accuracy? 
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 
Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-9, 
momentum_start = 0.5…0.9, momentum_stable = 0.99, 
momentum_ramp = 1/rate_annealing. 
Try balance_classes = true for datasets with large class imbalance. 
Enable force_load_balance for small datasets. 
Enable replicate_training_data if each node can h0ld all the data. 
45
H2O Deep Learning, @ArnoCandel 
Extensions for H2O Deep Learning 
46 
- Vision: Convolutional & Pooling Layers PUB-644 
- Anomaly Detection PUB-806 
- Pre-Training: Stacked Auto-Encoders PUB-1014 
- Faster Training: GPGPU support PUB-1013 
- Language/Sequences: Recurrent Neural Networks 
- Benchmark vs other Deep Learning packages 
- Investigate other optimization algorithms 
Contribute to H2O! 
Add your own JIRA tickets!
H2O Deep Learning, @ArnoCandel 
Key Take-Aways 
H2O is a distributed in-memory data science 
platform. It was designed for high-performance 
machine learning applications on big data. 
! 
H2O Deep Learning is ready to take your advanced 
analytics to the next level - Try it on your data! 
! 
Join our Community and Meetups! 
https://github.com/h2oai 
http://docs.h2o.ai 
www.h2o.ai/community 
@h2oai 
47 
Thank you!

Deep Learning through Examples

  • 1.
    Deep Learning throughExamples Arno Candel ! 0xdata, H2O.ai Scalable In-Memory Machine Learning ! Silicon Valley Big Data Science Meetup, Palo Alto, 9/3/14 !
  • 2.
    Who am I? @ArnoCandel PhD in Computational Physics, 2005 from ETH Zurich Switzerland ! 6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree, Inc - Machine Learning 9 months at 0xdata/H2O - Machine Learning ! 15 years in HPC/Supercomputing/Modeling ! Named “2014 Big Data All-Star” by Fortune Magazine !
  • 3.
    H2O Deep Learning,@ArnoCandel Outline Intro & Live Demo (10 mins) Methods & Implementation (20 mins) Results & Live Demos (25 mins) Higgs boson detection MNIST handwritten digits text classification Q & A (5 mins) 3
  • 4.
    H2O Deep Learning,@ArnoCandel About H20 (aka 0xdata) Java, Apache v2 Open Source Join the www.h2o.ai/community! #1 Java Machine Learning in Github 4
  • 5.
    H2O Deep Learning,@ArnoCandel Customer Demands for Practical Machine Learning 5 Requirements Value In-Memory Fast (Interactive) Distributed Big Data (No Sampling) Open Source Ownership of Methods API / SDK Extensibility H2O was developed by 0xdata from scratch to meet these requirements
  • 6.
    H2O Deep Learning,@ArnoCandel H2O Integration H2O R JSON Scala Python YARN Hadoop MR HDFS HDFS HDFS Standalone Over YARN On MRv1 6 H2O H2O Java
  • 7.
    H2O Deep Learning,@ArnoCandel H2O Architecture Prediction Engine Distributed In-Memory K-V store Col. compression Machine Learning Algorithms R Engine Nano fast Scoring Engine Memory manager e.g. Deep Learning 7 MapReduce
  • 8.
    H2O Deep Learning,@ArnoCandel H2O - The Killer App on Spark 8 http://databricks.com/blog/2014/06/30/ sparkling-water-h20-spark.html
  • 9.
    H2O Deep Learning,@ArnoCandel H2O DeepLearning on Spark 9 // Test if we can correctly learn A, B where Y = logistic(A + B*X) test("deep learning log regression") { val nPoints = 10000 val A = 2.0 val B = -1.5 ! // Generate testing data val trainData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 42) // Create RDD from testing data val trainRDD = sc.parallelize(trainData, 2) trainRDD.cache() ! import H2OContext._ // Create H2O data frame (will be implicit in the future) val trainH2ORDD = toDataFrame(sc, trainRDD) // Create a H2O DeepLearning model val dlParams = new DeepLearningParameters() dlParams.source = trainH2ORDD dlParams.response = trainH2ORDD.lastVec() dlParams.classification = true val dl = new DeepLearning(dlParams) val dlModel = dl.train().get() ! // Score validation data val validationData = DeepLearningSuite.generateLogisticInput(A, B, nPoints, 17) val validationRDD = sc.parallelize(validationData, 2) val validationH2ORDD = toDataFrame(sc, validationRDD) val predictionH2OFrame = new DataFrame(dlModel.score(validationH2ORDD))('predict) val predictionRDD = toRDD[DoubleHolder](sc, predictionH2OFrame) // will be implicit in the future // Validate prediction validatePrediction( predictionRDD.collect().map (_.predict.getOrElse(Double.NaN)), validationData) } Brand-Sparkling-New Sneak Preview!
  • 10.
    H2O Deep Learning,@ArnoCandel 10 H2O R CRAN package John Chambers (creator of the S language, R-core member) names H2O R API in top three promising R projects
  • 11.
    H2O Deep Learning,@ArnoCandel H2O + R = Happy Data Scientist 11 Machine Learning on Big Data with R: Data resides on the H2O cluster!
  • 12.
    H2O Deep Learning,@ArnoCandel 12 Higgs Particle Discovery Large Hadron Collider: Largest experiment of mankind! $13+ billion, 16.8 miles long, 120 MegaWatts, -456F, 1PB/day, etc. Higgs boson discovery (July ’12) led to 2013 Nobel prize! Higgs vs Background http://arxiv.org/pdf/1402.4735v2.pdf Images courtesy CERN / LHC Machine Learning Meets Physics Or rather: Back to the roots (WWW was invented at CERN in ’89…)
  • 13.
    H2O Deep Learning,@ArnoCandel 13 Higgs: Binary Classification Problem Current methods of choice for physicists: - Boosted Decision Trees - Neural networks with 1 hidden layer BUT: Must first add derived high-level features (physics formulae) HIGGS UCI Dataset: 21 low-level features AND 7 high-level derived features Train: 10M rows, Test: 500k rows Metric: AUC = Area under the ROC curve (range: 0.5…1, higher is better) Algorithm low-level H2O AUC all features H2O AUC Generalized Linear Model 0.596 0.684 add derived Random Forest 0.764 0.840 features Gradient Boosted Trees 0.753 0.839 Neural Net 1 hidden layer 0.760 0.830
  • 14.
    H2O Deep Learning,@ArnoCandel 14 Higgs: Can Deep Learning Do Better? Algorithm low-level H2O AUC all features H2O AUC Generalized Linear Model 0.596 0.684 Random Forest 0.764 0.840 Gradient Boosted Trees 0.753 0.839 Neural Net 1 hidden layer 0.760 0.830 Deep Learning ? ? <Your guess goes here> reference paper results: baseline 0.733 Let’s build a H2O Deep Learning model and find out! (That was my last weekend)
  • 15.
    H2O Deep Learning,@ArnoCandel What is Deep Learning? Wikipedia: Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations. Example: Input data (image) Prediction (who is it?) 15 Facebook's DeepFace (Yann LeCun) recognises faces as well as humans
  • 16.
    H2O Deep Learning,@ArnoCandel What is NOT Deep Linear models are not deep (by definition) ! Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy) ! SVMs and Kernel methods are not deep (2 layers: kernel + linear) ! Classification trees are not deep (operate on original input space, no new features generated) 16
  • 17.
    H2O Deep Learning,@ArnoCandel Deep Learning is Trending Google trends 2009 2011 2013 17 Businesses are using Deep Learning techniques! Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton) ! FBI FACE: $1 billion face recognition project ! Chinese Search Giant Baidu Hires Man Behind the “Google Brain” (Andrew Ng)
  • 18.
    H2O Deep Learning,@ArnoCandel Deep Learning History slides by Yan LeCun (now Facebook) 18 Deep Learning wins competitions AND makes humans, businesses and machines (cyborgs!?) smarter
  • 19.
    H2O Deep Learning,@ArnoCandel Deep Learning in H2O 1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) ! + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) ! + multi-threaded speedup (H2O Fork/Join worker threads update the model asynchronously) ! + smart algorithms for accuracy (weight initialization, adaptive learning rate, momentum, dropout regularization, l1/L2 regularization, grid search, checkpointing, auto-tuning, model averaging) ! = Top-notch prediction engine! 19
  • 20.
    H2O Deep Learning,@ArnoCandel Example Neural Network “fully connected” directed graph of neurons age income employment input/output neuron hidden neuron married single Input layer Hidden layer 1 Hidden layer 2 Output layer #connections 3x4 4x3 3x2 information flow #neurons 3 4 3 2 20
  • 21.
    H2O Deep Learning,@ArnoCandel Prediction: Forward Propagation “neurons activate each other via weighted sums” age income employment uij vjk zk pl yj = tanh(sumi(xi*uij)+bj) xi yj 21 married per-class probabilities sum(pl) = 1 wkl zk = tanh(sumj(yj*vjk)+ck) single pl = softmax(sumk(zk*wkl)+dl) softmax(xk) = exp(xk) / sumk(exp(xk)) activation function: tanh alternative: x -> max(0,x) “rectifier” pl is a non-linear function of xi: can approximate ANY function with enough layers! bj, ck, dl: bias values (indep. of inputs)
  • 22.
    H2O Deep Learning,@ArnoCandel Data preparation & Initialization Neural Networks are sensitive to numerical noise, operate best in the linear regime (not saturated) age income employment xi Automatic standardization of data xi: mean = 0, stddev = 1 ! horizontalize categorical variables, e.g. {full-time, part-time, none, self-employed} -> {0,1,0} = part-time, {0,0,0} = self-employed married single wkl Automatic initialization of weights ! 22 Poor man’s initialization: random weights wkl ! Default (better): Uniform distribution in +/- sqrt(6/(#units + #units_previous_layer))
  • 23.
    H2O Deep Learning,@ArnoCandel Training: Update Weights & Biases For each training row, we make a prediction and compare with the actual label (supervised learning): predicted actual 0.8 1 married Objective: minimize prediction error (MSE or cross-entropy) Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class” ! Cross-entropy = -log(0.8) “strongly penalize non-1-ness” 1 Stochastic Gradient Descent: Update weights and biases via gradient of the error (via back-propagation): w <— w - rate * ∂E/∂w 23 0.2 0 single E w rate
  • 24.
    H2O Deep Learning,@ArnoCandel Backward Propagation How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ? Naive: For every i, evaluate E twice at (w1,…,wi±Δ,…,wN)… Slow! Backprop: Compute ∂E/∂wi via chain rule going backwards xi ! net = sumi(wi*xi) + b wi y = activation(net) E = error(y) ∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi = ∂(error(y))/∂y * ∂(activation(net))/∂net * xi 24
  • 25.
    H2O Deep Learning,@ArnoCandel H2O Deep Learning Architecture K-V HTTPD nodes/JVMs: sync threads: async communication K-V HTTPD w 1 w w 2 1 w w w w 1 3 2 4 w1 w3 w2 w4 3 2 w w2+w4 1+w3 4 1 2 w* = (w1+w2+w3+w4)/4 map: each node trains a copy of the weights and biases with (some* or all of) its local data with asynchronous F/J threads initial model: weights and biases w 1 1 updated model: w* H2O atomic in-memory K-V store reduce: model averaging: average weights and biases from all nodes, speedup is at least #nodes/log(#rows) arxiv:1209.4129v3 i Query & display the model via JSON, WWW Keep iterating over the data (“epochs”), score from time to time *auto-tuned (default) or user-specified number of points per MapReduce iteration 25
  • 26.
    H2O Deep Learning,@ArnoCandel Adaptive learning rate - ADADELTA (Google) Automatically set learning rate for each neuron based on its training history Regularization L1: penalizes non-zero weights L2: penalizes large weights Dropout: randomly ignore certain inputs Grid Search and Checkpointing Run a grid search to scan many hyper-parameters, then continue training the most promising model(s) 26 “Secret” Sauce to Higher Accuracy
  • 27.
    H2O Deep Learning,@ArnoCandel Detail: Adaptive Learning Rate ! Compute moving average of Δwi2 at time t for window length rho: ! E[Δwi2]t = rho * E[Δwi2]t-1 + (1-rho) * Δwi2 ! Compute RMS of Δwi at time t with smoothing epsilon: ! RMS[Δwi]t = sqrt( E[Δwi2]t + epsilon ) Adaptive acceleration / momentum: accumulate previous weight updates, but over a window of time Adaptive annealing / progress: Gradient-dependent learning rate, moving window prevents “freezing” (unlike ADAGRAD: no window) Do the same for ∂E/∂wi, then obtain per-weight learning rate: RMS[Δwi]t-1 RMS[∂E/∂wi]t rate(wi, t) = cf. ADADELTA paper 27
  • 28.
    H2O Deep Learning,@ArnoCandel Detail: Dropout Regularization 28 Training: For each hidden neuron, for each training sample, for each iteration, ignore (zero out) a different random fraction p of input activations. ! age income employment married single X X X Testing: Use all activations, but reduce them by a factor p (to “simulate” the missing activations during training). cf. Geoff Hinton's paper
  • 29.
    H2O Deep Learning,@ArnoCandel MNIST: digits classification MNIST = Digitized handwritten digits database (Yann LeCun) Yann LeCun: “Yet another advice: don't get fooled by people who claim to have a solution to Artificial General Intelligence. Ask them what error rate they get on MNIST or ImageNet.” Data: 28x28=784 pixels with (gray-scale) values in 0…255 Standing world record: Without distortions or convolutions, the best-ever published error rate on test set: 0.83% (Microsoft) 29 Train: 60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes Let’s see how H2O does on the MNIST dataset!
  • 30.
    H2O Deep Learning,@ArnoCandel H2O Deep Learning on MNIST: 0.87% test set error (so far) Frequent errors: confuse 2/7 and 4/9 30 test set error: 1.5% after 10 mins 1.0% after 1.5 hours 0.87% after 4 hours World-class results! No pre-training No distortions No convolutions No unsupervised training Running on 4 nodes with 16 cores each
  • 31.
    H2O Deep Learning,A. Candel Weather Dataset 31 Predict “RainTomorrow” from Temperature, Humidity, Wind, Pressure, etc.
  • 32.
    H2O Deep Learning,A. Candel Live Demo: Weather Prediction 5-fold cross validation Interactive ROC curve with real-time updates 32 3 hidden Rectifier layers, Dropout, L1-penalty 12.7% 5-fold cross-validation error is at least as good as GBM/RF/GLM models
  • 33.
    H2O Deep Learning,@ArnoCandel Live Demo: Grid Search How did I find those parameters? Grid Search! (works for multiple hyper parameters at once) 33 Then continue training the best model
  • 34.
    H2O Deep Learning,@ArnoCandel Text Classification Goal: Predict the item from seller’s text description 34 “Vintage 18KT gold Rolex 2 Tone in great condition” Data: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0 gold vintage condition Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes Let’s see how H2O does on the ebay dataset!
  • 35.
    H2O Deep Learning,@ArnoCandel 35 Text Classification Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes Out-Of-The-Box: 11.6% test set error after 10 epochs! Predicts the correct class (out of 143) 88.4% of the time! Note 1: H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB) Note 2: No tuning was done (results are for illustration only)
  • 36.
    H2O Deep Learning,@ArnoCandel Parallel Scalability (for 64 epochs on MNIST, with “0.87%” parameters) 36 Speedup 40.00 30.00 20.00 10.00 0.00 1 2 4 8 16 32 63 H2O Nodes Training Time 2.7 mins 100 75 50 25 0 in minutes 1 2 4 8 16 32 63 H2O Nodes (4 cores per node, 1 epoch per node per MapReduce)
  • 37.
    H2O Deep Learning,@ArnoCandel Deep Learning Auto-Encoders for Anomaly Detection 37 Toy example: Find anomaly in ECG heart beat data. First, train a model on what’s “normal”: 20 time-series samples of 210 data points each Deep Auto-Encoder: Learn low-dimensional non-linear “structure” of the data that allows to reconstruct the orig. data Also for categorical data!
  • 38.
    H2O Deep Learning,@ArnoCandel 38 Deep Learning Auto-Encoders for Test set with anomaly Test set prediction is reconstruction, looks “normal” Found anomaly! large reconstruction error Model of what’s “normal” + => Anomaly Detection
  • 39.
    H2O Deep Learning,@ArnoCandel 39 H2O brings Deep Learning to R R Vignette with example R scripts http://0xdata.com/h2o/algorithms/ All parameters are available from R…
  • 40.
    H2O Deep Learning,@ArnoCandel POJO Model Export for Production Scoring 40 Plain old Java code is auto-generated to take your H2O Deep Learning models into production!
  • 41.
    H2O Deep Learning,@ArnoCandel 41 Higgs Particle Discovery with H2O How well did H2O Deep Learning do? <Your guess goes here> reference paper results Any guesses for AUC on low-level features? AUC=0.76 was the best for RF/GBM/NN Let’s see how H2O did in the past 30 minutes!
  • 42.
    H2O Deep Learning,@ArnoCandel H2O Steam: Scoring Platform 42 http://server:port/steam/index.html Higgs Dataset Demo on 10-node cluster Let’s score all our H2O models and compare them! Live Demo
  • 43.
    H2O Deep Learning,@ArnoCandel 43 Scoring Higgs Models in H2O Steam Live Demo on 10-node cluster: <10 minutes runtime for all algos! Better than LHC baseline of AUC=0.73!
  • 44.
    H2O Deep Learning,@ArnoCandel 44 Higgs Particle Detection with H2O HIGGS UCI Dataset: 21 low-level features AND 7 high-level derived features Train: 10M rows, Test: 500k rows Algorithm *Nature paper: http://arxiv.org/pdf/1402.4735v2.pdf Paper’s l-l AUC low-level H2O AUC all features H2O AUC Parameters (not heavily tuned), H2O running on 10 nodes Generalized Linear Model - 0.596 0.684 default, binomial Random Forest - 0.764 0.840 50 trees, max depth 50 Gradient Boosted Trees 0.73 0.753 0.839 50 trees, max depth 15 Neural Net 1 layer 0.733 0.760 0.830 1x300 Rectifier, 100 epochs Deep Learning 3 hidden layers 0.836 0.850 - 3x1000 Rectifier, L2=1e-5, 40 epochs Deep Learning 4 hidden layers 0.868 0.869 - 4x500 Rectifier, L1=L2=1e-5, 300 epochs Deep Learning 6 hidden layers 0.880 running - 6x500 Rectifier, L1=L2=1e-5 Deep Learning on low-level features alone beats everything else! H2O prelim. results compare well with paper’s results* (TMVA & Theano)
  • 45.
    H2O Deep Learning,@ArnoCandel Tips for H2O Deep Learning ! General: More layers for more complex functions (exp. more non-linearity). More neurons per layer to detect finer structure in data (“memorizing”). Add some regularization for less overfitting (lower validation set error). Specifically: Do a grid search to get a feel for convergence, then continue training. Try Tanh/Rectifier, try max_w2=10…50, L1=1e-5..1e-3 and/or L2=1e-5…1e-3 Try Dropout (input: up to 20%, hidden: up to 50%) with test/validation set. Input dropout is recommended for noisy high-dimensional input. Distributed: More training samples per iteration: faster, but less accuracy? With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-9, momentum_start = 0.5…0.9, momentum_stable = 0.99, momentum_ramp = 1/rate_annealing. Try balance_classes = true for datasets with large class imbalance. Enable force_load_balance for small datasets. Enable replicate_training_data if each node can h0ld all the data. 45
  • 46.
    H2O Deep Learning,@ArnoCandel Extensions for H2O Deep Learning 46 - Vision: Convolutional & Pooling Layers PUB-644 - Anomaly Detection PUB-806 - Pre-Training: Stacked Auto-Encoders PUB-1014 - Faster Training: GPGPU support PUB-1013 - Language/Sequences: Recurrent Neural Networks - Benchmark vs other Deep Learning packages - Investigate other optimization algorithms Contribute to H2O! Add your own JIRA tickets!
  • 47.
    H2O Deep Learning,@ArnoCandel Key Take-Aways H2O is a distributed in-memory data science platform. It was designed for high-performance machine learning applications on big data. ! H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data! ! Join our Community and Meetups! https://github.com/h2oai http://docs.h2o.ai www.h2o.ai/community @h2oai 47 Thank you!