Deep Learning
with H2O
!
H2O.ai

Scalable In-Memory Machine Learning
!
PayPal, San Jose, 4/24/14
Arno Candel
Who am I?
PhD in Computational Physics, 2005

from ETH Zurich Switzerland
!
6 years at SLAC - Accelerator Physics Modeling...
H2O Deep Learning, A. Candel
Outline
Intro
Theory
Implementation
Results
MNIST handwritten digits classification
Live Demo...
H2O Deep Learning, A. Candel
Distributed in-memory math platform 

➔ GLM, GBM, RF, K-Means, PCA, Deep Learning

Easy to us...
H2O Deep Learning, A. Candel
About H20 (aka 0xdata)
Pure Java, Apache v2 Open Source
Join the www.h2o.ai/community!
5
+1 C...
H2O Deep Learning, A. Candel
H2O w or w/o Hadoop
H2O
H2O H2O
HDFS HDFS HDFS
YARN Hadoop MR
R Java Scala JSON Python
Standa...
H2O Deep Learning, A. Candel
H2O Architecture
in-memory K-V store
compression
Machine
Learning
Algorithms
R Engine
Nano fa...
H2O Deep Learning, A. Candel
Wikipedia:

Deep learning is a set of algorithms in machine learning
that attempt to model hi...
H2O Deep Learning, A. Candel
Deep Learning is trending
20132012
Google trends
2011
9
H2O Deep Learning, A. Candel
Deep Learning History
slides by Yan LeCun (now Facebook)
10
Deep Learning wins competitions
A...
H2O Deep Learning, A. Candel
What is NOT Deep
Linear models are not deep
(by definition)
!
Neural nets with 1 hidden layer...
H2O Deep Learning, A. Candel
1970s multi-layer feed-forward Neural Network
(supervised learning with stochastic gradient d...
H2O Deep Learning, A. Candel
“fully connected” directed graph of neurons
age
income
employment
married
not married
Input l...
H2O Deep Learning, A. Candel
age
income
employment
yj = tanh(sumi(xi*uij)+bj)
uij
xi
yj
per-class probabilities

sum(pl) =...
H2O Deep Learning, A. Candel
age
income
employment
xi
standardize input xi: mean = 0, stddev = 1
!
horizontalize categoric...
H2O Deep Learning, A. Candel
Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class”
!
Cross-entropy = -log(0...
H2O Deep Learning, A. Candel
Backward Propagation


!
∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi
= ∂(error(y))/∂y * ∂(activation(...
H2O Deep Learning, A. Candel
H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
threads: async
communicat...
H2O Deep Learning, A. Candel
“Secret” Sauce to Higher Accuracy
Momentum training:

keep changing weights and biases (even ...
H2O Deep Learning, A. Candel
Adaptive Learning Rate
!
Compute moving average of ∆wi
2 at time t for window length rho:
!
E...
H2O Deep Learning, A. Candel
Dropout Training
Training:
For each hidden neuron, for each training sample, for each iterati...
H2O Deep Learning, A. Candel
MNIST: digits classification
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 r...
H2O Deep Learning, A. Candel
Frequent errors: confuse 2/7 and 4/9
H2O Deep Learning on MNIST:
0.87% test set error (so far...
H2O Deep Learning, A. Candel
Parallel Scalability
(for 64 epochs on MNIST, with “0.87%” parameters)
24
Speedup
0.00
10.00
...
H2O Deep Learning, A. Candel
Prostate Cancer Dataset
25
H2O Deep Learning, A. Candel
Live Demo: Cancer Prediction
Interactive ROC
curve with real-
time updates
26
H2O Deep Learning, A. Candel
Live Demo: Cancer Prediction
0% training error
with only 322
model parameters
in seconds!
27
H2O Deep Learning, A. Candel
H2O Deep Learning with Scala
28
Predict CAPSULE: Variable 1
H2O Deep Learning, A. Candel
H2O Deep Learning with Scala
29
H2O Deep Learning, A. Candel
Live Demo: Grid Search Regression
Doing a grid search to find good hyper-parameters
to predic...
H2O Deep Learning, A. Candel
Live Demo: ebay Text Classification
Users enter a description when selling an item
Task: Pred...
H2O Deep Learning, A. Candel
Live Demo: ebay Text Classification
No tuning (results for illustration only):
11.6% test set...
H2O Deep Learning, A. Candel
Tips for H2O Deep Learning
!
General:
More layers for more complex functions (exp. more non-l...
H2O Deep Learning, A. Candel
Summary
H2O is a distributed in-memory math platform that
allows fast prototyping in Java, R,...
Upcoming SlideShare
Loading in...5
×

H2O.ai's Distributed Deep Learning Presented at PayPal by Arno Candel 04/24/14

2,743

Published on

Invited talk given at Paypal's data science seminar on April 24 2014.

http://docs.0xdata.com/datascience/deeplearning.html

Published in: Technology, Education
2 Comments
11 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,743
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
80
Comments
2
Likes
11
Embeds 0
No embeds

No notes for slide

H2O.ai's Distributed Deep Learning Presented at PayPal by Arno Candel 04/24/14

  1. 1. Deep Learning with H2O ! H2O.ai
 Scalable In-Memory Machine Learning ! PayPal, San Jose, 4/24/14 Arno Candel
  2. 2. Who am I? PhD in Computational Physics, 2005
 from ETH Zurich Switzerland ! 6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree, Inc - Machine Learning 4 months at 0xdata/H2O - Machine Learning ! 10+ years in HPC, C++, MPI, Supercomputing @ArnoCandel
  3. 3. H2O Deep Learning, A. Candel Outline Intro Theory Implementation Results MNIST handwritten digits classification Live Demo Prostate cancer classification and age regression text classification 3
  4. 4. H2O Deep Learning, A. Candel Distributed in-memory math platform 
 ➔ GLM, GBM, RF, K-Means, PCA, Deep Learning
 Easy to use SDK / API
 ➔ Java, R, Scala, Python, JSON, Browser-based GUI ! Businesses can use ALL of their data (w or w/o Hadoop)
 ➔ Modeling without Sampling
 
 Big Data + Better Algorithms 
 ➔ Better Predictions H2O Open Source in-memory
 Prediction Engine for Big Data 4
  5. 5. H2O Deep Learning, A. Candel About H20 (aka 0xdata) Pure Java, Apache v2 Open Source Join the www.h2o.ai/community! 5 +1 Cyprien Noel for prior work
  6. 6. H2O Deep Learning, A. Candel H2O w or w/o Hadoop H2O H2O H2O HDFS HDFS HDFS YARN Hadoop MR R Java Scala JSON Python Standalone Over YARN On MRv1 6
  7. 7. H2O Deep Learning, A. Candel H2O Architecture in-memory K-V store compression Machine Learning Algorithms R Engine Nano fast Scoring Engine Prediction Engine memory manager e.g. Deep Learning 7 MapReduce
  8. 8. H2O Deep Learning, A. Candel Wikipedia:
 Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations. ! ! ! ! ! Facebook DeepFace (LeCun): “Almost as good as humans at recognising faces” ! Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton) ! FBI FACE: $1 billion face recognition project What is Deep Learning? Example: Input data
 (facial image) Prediction (person’s ID) 8
  9. 9. H2O Deep Learning, A. Candel Deep Learning is trending 20132012 Google trends 2011 9
  10. 10. H2O Deep Learning, A. Candel Deep Learning History slides by Yan LeCun (now Facebook) 10 Deep Learning wins competitions AND
 makes humans, businesses and machines (cyborgs!?) smarter
  11. 11. H2O Deep Learning, A. Candel What is NOT Deep Linear models are not deep (by definition) ! Neural nets with 1 hidden layer are not deep (no feature hierarchy) ! SVMs and Kernel methods are not deep (2 layers: kernel + linear) ! Classification trees are not deep (operate on original input space) 11
  12. 12. H2O Deep Learning, A. Candel 1970s multi-layer feed-forward Neural Network (supervised learning with stochastic gradient descent using back-propagation) ! + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) ! + multi-threaded speedup (H2O Fork/Join worker threads update the model asynchronously) ! + smart algorithms for accuracy (weight initialization, adaptive learning, momentum, dropout, regularization) ! = Top-notch prediction engine! Deep Learning in H2O 12
  13. 13. H2O Deep Learning, A. Candel “fully connected” directed graph of neurons age income employment married not married Input layer Hidden layer 1 Hidden layer 2 Output layer 3x4 4x3 3x2#connections information flow input/output neuron hidden neuron 4 3 2#neurons 3 Example Neural Network 13
  14. 14. H2O Deep Learning, A. Candel age income employment yj = tanh(sumi(xi*uij)+bj) uij xi yj per-class probabilities
 sum(pl) = 1 zk = tanh(sumj(yj*vjk)+ck) vjk zk pl pl = softmax(sumk(zk*wkl)+dl) wkl softmax(xk) = exp(xk) / sumk(exp(xk)) “neurons activate each other via weighted sums” Prediction: Forward Propagation married not married activation function: tanh alternative:
 x -> max(0,x) “rectifier” pl is a non-linear function of xi: can approximate ANY function with enough layers! bj, ck, dl: bias values
 (indep. of inputs) 14
  15. 15. H2O Deep Learning, A. Candel age income employment xi standardize input xi: mean = 0, stddev = 1 ! horizontalize categorical variables, e.g. {full-time, part-time, none, self-employed} 
 ->
 {0,1,0} = part-time, {0,0,0} = self-employed Poor man’s initialization: random weights ! Better: Uniform distribution in
 +/- sqrt(6/(#units + #units_previous_layer)) Data preparation & Initialization Neural Networks are sensitive to numerical noise,
 operate best in the linear regime (not saturated) married not married 15
  16. 16. H2O Deep Learning, A. Candel Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class” ! Cross-entropy = -log(0.8) “strongly penalize non-1-ness” Stochastic Gradient Descent SGD: improve weights and biases for EACH training row married not married For each training row, we make a prediction and compare with the actual label (supervised training): 1 0 0.8 0.2 predicted actual Objective: minimize prediction error (MSE or cross-entropy) w <— w - rate * ∂E/∂w 1 16
  17. 17. H2O Deep Learning, A. Candel Backward Propagation 
 ! ∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi = ∂(error(y))/∂y * ∂(activation(net))/∂net * xi Backprop: Compute ∂E/∂wi via chain rule going backwards wi net = sumi(wi*xi) + b xi E = error(y) y = activation(net) How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ? Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow! 17
  18. 18. H2O Deep Learning, A. Candel H2O Deep Learning Architecture K-V K-V HTTPD HTTPD nodes/JVMs: sync threads: async communication w w w w w w w w1 w3 w2 w4 w2+w4 w1+w3 w* = (w1+w2+w3+w4)/4 map:
 each node trains a copy of the weights and biases with (some* or all of) its local data with asynchronous F/J threads initial weights and biases w updated weights and biases w* H2O atomic in-memory
 K-V store reduce:
 model averaging: average weights and biases from all nodes, speedup is at least #nodes/log(#rows) arxiv:1209.4129v3 Keep iterating over the data (“epochs”), score from time to time Query & display the model via JSON, WWW 2 2 431 1 1 1 4 3 2 1 2 1 i *user can specify the number of total rows per MapReduce iteration 18
  19. 19. H2O Deep Learning, A. Candel “Secret” Sauce to Higher Accuracy Momentum training:
 keep changing weights and biases (even if there’s no error) 
 “find other local minima, and go faster along valleys” Adaptive learning rate - ADADELTA (Google):
 automatically set learning rate for each neuron based on its training history, combines annealing and momentum features Learning rate annealing:
 rate r = r0 / (1+ß*N), N = training samples “dig deeper into local minimum” Grid Search and Checkpointing:
 Run a grid search over multiple hyper-parameters, then continue training the best model L1/L2/Dropout/max_w2 regularization:
 L1: penalizes non-zero weights, L2: penalizes large weights
 Dropout: randomly ignore certain inputs “train exp. many models at once” max_w2: Scale down all incoming weights if their squared sum > max_w2 “regularization avoids overtraining and improves generalization error” 19
  20. 20. H2O Deep Learning, A. Candel Adaptive Learning Rate ! Compute moving average of ∆wi 2 at time t for window length rho: ! E[∆wi 2]t = rho * E[∆wi 2]t-1 + (1-rho) * ∆wi 2 ! Compute RMS of ∆wi at time t with smoothing epsilon: ! RMS[∆wi]t = sqrt( E[∆wi 2]t + epsilon ) Adaptive annealing / progress: Gradient-dependent learning rate, moving window prevents “freezing” (unlike ADAGRAD: no window) Adaptive acceleration / momentum: accumulate previous weight updates, but over a window of time RMS[∆wi]t-1 RMS[∂E/∂wi]t rate(wi, t) = Do the same for ∂E/∂wi, then obtain per-weight learning rate: cf. ADADELTA paper
  21. 21. H2O Deep Learning, A. Candel Dropout Training Training: For each hidden neuron, for each training sample, for each iteration, ignore (zero out) a different random fraction p of input activations. ! age income employment married not married X X X Testing: Use all activations, but reduce them by a factor p (to “simulate” the missing activations during training). cf. Geoff Hinton's paper
  22. 22. H2O Deep Learning, A. Candel MNIST: digits classification Train: 60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes MNIST: Digitized handwritten digits database (Yann LeCun) Data: 28x28=784 pixels with values in 0…255 (gray-scale) One of the most popular multi-class classification problems Without distortions or convolutions (which help), the best-ever published error rate on test set: 0.83% (Microsoft) 22
  23. 23. H2O Deep Learning, A. Candel Frequent errors: confuse 2/7 and 4/9 H2O Deep Learning on MNIST: 0.87% test set error (so far) 23 test set error: 1.5% after 10 mins 1.0% after 1.5 hours
 0.87% after 4 hours World-class results! No pre-training No distortions No convolutions No unsupervised training Running on 4 nodes with 16 cores each
  24. 24. H2O Deep Learning, A. Candel Parallel Scalability (for 64 epochs on MNIST, with “0.87%” parameters) 24 Speedup 0.00 10.00 20.00 30.00 40.00 1 2 4 8 16 32 63 H2O Nodes (4 cores per node, 1 epoch per node per MapReduce) 2.7 mins Training Time 0 25 50 75 100 1 2 4 8 16 32 63 H2O Nodes in minutes
  25. 25. H2O Deep Learning, A. Candel Prostate Cancer Dataset 25
  26. 26. H2O Deep Learning, A. Candel Live Demo: Cancer Prediction Interactive ROC curve with real- time updates 26
  27. 27. H2O Deep Learning, A. Candel Live Demo: Cancer Prediction 0% training error with only 322 model parameters in seconds! 27
  28. 28. H2O Deep Learning, A. Candel H2O Deep Learning with Scala 28 Predict CAPSULE: Variable 1
  29. 29. H2O Deep Learning, A. Candel H2O Deep Learning with Scala 29
  30. 30. H2O Deep Learning, A. Candel Live Demo: Grid Search Regression Doing a grid search to find good hyper-parameters to predict AGE from other 7 features Then continue training the best model 5 hidden 50 tanh layers, rho=0.99,
 epsilon = 1e-10, normal distribution scale=1 MSE = 0.5 for test set ages in 44…79 Regression: 1 linear output neuron 30
  31. 31. H2O Deep Learning, A. Candel Live Demo: ebay Text Classification Users enter a description when selling an item Task: Predict the type of item from the words used Data prep: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0 H2O parses SVMLight sparse format: label 3:1 9:1 13:1 … ! “Small” sample dataset on jewelry and watches: Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes ! H2O compressed columnar in-memory store: Only needs 60MB to store 5 billion entries (never inflated) 31
  32. 32. H2O Deep Learning, A. Candel Live Demo: ebay Text Classification No tuning (results for illustration only): 11.6% test set error (<4% for top-5) after only 10 epochs! Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes 32
  33. 33. H2O Deep Learning, A. Candel Tips for H2O Deep Learning ! General: More layers for more complex functions (exp. more non-linearity) More neurons per layer to detect finer structure in data (“memorizing”) Add some regularization for less overfitting (smaller validation error) Do a grid search to get a feel for convergence, then continue training. Try Tanh first, then Rectifier, try max_w2 = 50 and/or L1=1e-5. Try Dropout (input: 20%, hidden: 50%) with test/validation set after finding good parameters for convergence on training set. Distributed: More training samples per iteration: faster, but less accuracy? With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8, momentum_start = 0.5, momentum_stable = 0.99,
 momentum_ramp = 1/rate_annealing Try balance_classes = true for imbalanced classes. Use force_load_balance and replicate_training_data for small datasets. 33
  34. 34. H2O Deep Learning, A. Candel Summary H2O is a distributed in-memory math platform that allows fast prototyping in Java, R, Scala and Python. ! H2o enables the development of enterprise-quality blazingly fast machine learning applications. ! H2O Deep Learning is distributed, easy to use, and early results compete with the world’s best. ! Try it yourself and join our next meetup!
 git clone https://github.com/0xdata/h2o http://docs.0xdata.com www.h2o.ai/community follow us on Twitter: @hexadata 34
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×