H2O Deep Learning at Next.ML


Published on

Scalable Data Science and Deep Learning with H2O

In this session, we introduce the H2O data science platform. We will explain its scalable in-memory architecture and design principles and focus on the implementation of distributed deep learning in H2O. Advanced features such as adaptive learning rates, various forms of regularization, automatic data transformations, checkpointing, grid-search, cross-validation and auto-tuning turn multi-layer neural networks of the past into powerful, easy-to-use predictive analytics tools accessible to everyone. We will present a broad range of use cases and live demos that include world-record deep learning models, anomaly detection tools and approaches for Kaggle data science competitions. We also demonstrate the applicability of H2O in enterprise environments for real-world customer production use cases.

By the end of the hands-on-session, attendees will have learned to perform end-to-end data science workflows with H2O using both the easy-to-use web interface and the flexible R interface. We will cover data ingest, basic feature engineering, feature selection, hyperparameter optimization with N-fold cross-validation, multi-model scoring and taking models into production. We will train supervised and unsupervised methods on realistic datasets. With best-of-breed machine learning algorithms such as elastic net, random forest, gradient boosting and deep learning, you will be able to create your own smart applications.

A local installation of RStudio is recommended for this session.

- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Published in: Software
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

H2O Deep Learning at Next.ML

  1. 1. Scalable Data Science and Deep Learning with H2O Next.ML Workshop San Francisco, 1/17/15 Arno Candel, H2O.ai http://tiny.cc/h2o_next_ml_slides
  2. 2. Who am I? PhD in Computational Physics, 2005
 from ETH Zurich Switzerland ! 6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree - Machine Learning 13 months at H2O.ai - Machine Learning ! 15 years in Supercomputing & Modeling ! Named “2014 Big Data All-Star” by Fortune Magazine ! @ArnoCandel
  3. 3. H2O Deep Learning, @ArnoCandel Outline Introduction (10 mins) Methods & Implementation (20 mins) Results and Live Demos (20 mins) Higgs boson classification MNIST handwritten digits Ebay text classification h2o-dev Outlook: Flow, Python Part 2: Hands-On Session (40 mins) Web GUI: Higgs dataset R Studio: Adult, Higgs, MNIST datasets 3
  4. 4. H2O Deep Learning, @ArnoCandel Teamwork at H2O.ai Java, Apache v2 Open-Source #1 Java Machine Learning in Github Join the community! 4
  5. 5. H2O Deep Learning, @ArnoCandel H2O: Open-Source (Apache v2) Predictive Analytics Platform 5
  6. 6. H2O Deep Learning, @ArnoCandel 6 H2O Architecture - Designed for speed, scale, accuracy & ease of use Key technical points: • distributed JVMs + REST API • no Java GC issues 
 (data in byte[], Double) • loss-less number compression • Hadoop integration (v1,YARN) • R package (CRAN) Pre-built fully featured algos:
 K-Means, NB, PCA, CoxPH,
 GLM, RF, GBM, DeepLearning
  7. 7. H2O Deep Learning, @ArnoCandel Wikipedia:
 Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple 
 non-linear transformations. What is Deep Learning? Input:
 Image Output:
 User ID 7 Example: Facebook DeepFace
  8. 8. H2O Deep Learning, @ArnoCandel What is NOT Deep Linear models are not deep (by definition) ! Neural nets with 1 hidden layer are not deep (only 1 layer - no feature hierarchy) ! SVMs and Kernel methods are not deep (2 layers: kernel + linear) ! Classification trees are not deep (operate on original input space, no new features generated) 8
  9. 9. H2O Deep Learning, @ArnoCandel 1970s multi-layer feed-forward Neural Network (stochastic gradient descent with back-propagation) ! + distributed processing for big data (fine-grain in-memory MapReduce on distributed data) ! + multi-threaded speedup (async fork/join worker threads operate at FORTRAN speeds) ! + smart algorithms for fast & accurate results (automatic standardization, one-hot encoding of categoricals, missing value imputation, weight & bias initialization, adaptive learning rate, momentum, dropout/l1/L2 regularization, grid search, 
 N-fold cross-validation, checkpointing, load balancing, auto-tuning, model averaging, etc.) ! = powerful tool for (un)supervised machine learning on real-world data H2O Deep Learning 9 all 320 cores maxed out
  10. 10. H2O Deep Learning, @ArnoCandel “fully connected” directed graph of neurons age income employment married single Input layer Hidden layer 1 Hidden layer 2 Output layer 3x4 4x3 3x2#connections information flow input/output neuron hidden neuron 4 3 2#neurons 3 Example Neural Network 10
  11. 11. H2O Deep Learning, @ArnoCandel age income employment yj = tanh(sumi(xi*uij)+bj) uij xi yj per-class probabilities
 sum(pl) = 1 zk = tanh(sumj(yj*vjk)+ck) vjk zk pl pl = softmax(sumk(zk*wkl)+dl) wkl softmax(xk) = exp(xk) / sumk(exp(xk)) “neurons activate each other via weighted sums” Prediction: Forward Propagation activation function: tanh alternative:
 x -> max(0,x) “rectifier” pl is a non-linear function of xi: can approximate ANY function with enough layers! bj, ck, dl: bias values
 (indep. of inputs) 11 married single
  12. 12. H2O Deep Learning, @ArnoCandel age income employment xi Automatic standardization of data
 xi: mean = 0, stddev = 1 ! horizontalize categorical variables, e.g. {full-time, part-time, none, self-employed} 
 {0,1,0} = part-time, {0,0,0} = self-employed Automatic initialization of weights ! Poor man’s initialization: random weights wkl ! Default (better): Uniform distribution in
 +/- sqrt(6/(#units + #units_previous_layer)) Data preparation & Initialization Neural Networks are sensitive to numerical noise,
 operate best in the linear regime (not saturated) 12 married single wkl
  13. 13. H2O Deep Learning, @ArnoCandel Mean Square Error = (0.22 + 0.22)/2 “penalize differences per-class” ! Cross-entropy = -log(0.8) “strongly penalize non-1-ness” Training: Update Weights & Biases Stochastic Gradient Descent: Update weights and biases via gradient of the error (via back-propagation): For each training row, we make a prediction and compare with the actual label (supervised learning): married10.8 predicted actual Objective: minimize prediction error (MSE or cross-entropy) w <— w - rate * ∂E/∂w 1 13 single00.2 E w rate
  14. 14. H2O Deep Learning, @ArnoCandel Backward Propagation 
 ! ∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi = ∂(error(y))/∂y * ∂(activation(net))/∂net * xi Backprop: Compute ∂E/∂wi via chain rule going backwards wi net = sumi(wi*xi) + b xi E = error(y) y = activation(net) How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ? Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow! 14
  15. 15. H2O Deep Learning, @ArnoCandel H2O Deep Learning Architecture K-V K-V HTTPD HTTPD nodes/JVMs: sync threads: async communication w w w w w w w w1 w3 w2 w4 w2+w4 w1+w3 w* = (w1+w2+w3+w4)/4 map:
 each node trains a copy of the weights and biases with (some* or all of) its local data with asynchronous F/J threads initial model: weights and biases w updated model: w* H2O atomic in-memory
 K-V store reduce:
 model averaging: average weights and biases from all nodes, speedup is at least #nodes/log(#rows) arxiv:1209.4129v3 Keep iterating over the data (“epochs”), score from time to time Query & display the model via JSON, WWW 2 2 431 1 1 1 4 3 2 1 2 1 i *auto-tuned (default) or user-specified number of points per MapReduce iteration 15
  16. 16. H2O Deep Learning, @ArnoCandel Adaptive learning rate - ADADELTA (Google)
 Automatically set learning rate for each neuron based on its training history Grid Search and Checkpointing
 Run a grid search to scan many hyper- parameters, then continue training the most promising model(s) Regularization
 L1: penalizes non-zero weights
 L2: penalizes large weights
 Dropout: randomly ignore certain inputs Hogwild!: intentional race conditions Distributed mode: weight averaging 16 “Secret” Sauce to Higher Accuracy
  17. 17. H2O Deep Learning, @ArnoCandel Detail: Adaptive Learning Rate ! Compute moving average of ∆wi 2 at time t for window length rho: ! E[∆wi 2]t = rho * E[∆wi 2]t-1 + (1-rho) * ∆wi 2 ! Compute RMS of ∆wi at time t with smoothing epsilon: ! RMS[∆wi]t = sqrt( E[∆wi 2]t + epsilon ) Adaptive annealing / progress: Gradient-dependent learning rate, moving window prevents “freezing” (unlike ADAGRAD: no window) Adaptive acceleration / momentum: accumulate previous weight updates, but over a window of time RMS[∆wi]t-1 RMS[∂E/∂wi]t rate(wi, t) = Do the same for ∂E/∂wi, then obtain per-weight learning rate: cf. ADADELTA paper 17
  18. 18. H2O Deep Learning, @ArnoCandel Detail: Dropout Regularization 18 Training: For each hidden neuron, for each training sample, for each iteration, ignore (zero out) a different random fraction p of input activations. ! age income employment married single X X X Testing: Use all activations, but reduce them by a factor p (to “simulate” the missing activations during training). cf. Geoff Hinton's paper
  19. 19. H2O Deep Learning, @ArnoCandel 19 Application: Higgs Boson Classification Higgs
 Background Large Hadron Collider: Largest experiment of mankind! $13+ billion, 16.8 miles long, 120 MegaWatts, -456F, 1PB/day, etc. Higgs boson discovery (July ’12) led to 2013 Nobel prize! http://arxiv.org/pdf/1402.4735v2.pdf Images courtesy CERN / LHC HIGGS UCI Dataset: 21 low-level features AND 7 high-level derived features (physics formulae) Train: 10M rows, Valid: 500k, Test: 500k rows
  20. 20. H2O Deep Learning, @ArnoCandel 20 Live Demo: Let’s see what Deep Learning can do with low-level features alone! ? ? ? Former baseline for AUC: 0.733 and 0.816 H2O Algorithm low-level H2O AUC all features H2O AUC Generalized Linear Model 0.596 0.684 Random Forest 0.764 0.840 Gradient Boosted Trees 0.753 0.839 Neural Net 1 hidden layer 0.760 0.830 H2O Deep Learning ? add
 ! features Higgs: Derived features are important!
  21. 21. H2O Deep Learning, @ArnoCandel MNIST: digits classification Standing world record:
 Without distortions or convolutions, the best-ever published error rate on test set: 0.83% (Microsoft) 21 Train: 60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes MNIST = Digitized handwritten digits database (Yann LeCun) Data: 28x28=784 pixels with (gray-scale) values in 0…255 Yann LeCun: “Yet another advice: don't get fooled by people who claim to have a solution to Artificial General Intelligence. Ask them what error rate they get on MNIST or ImageNet.”
  22. 22. H2O Deep Learning, @ArnoCandel 22 H2O Deep Learning beats MNIST Standard 60k/10k data No distortions No convolutions No unsupervised training No ensemble ! 10 hours on 10 16-core nodes World-record! 0.83% test set error http://learn.h2o.ai/content/hands-on_training/deep_learning.html
  23. 23. H2O Deep Learning, @ArnoCandel POJO Model Export for Production Scoring 23 Plain old Java code is auto-generated to take your H2O Deep Learning models into production!
  24. 24. H2O Deep Learning, @ArnoCandel Parallel Scalability (for 64 epochs on MNIST, with “0.83%” parameters) 24 Speedup 0.00 10.00 20.00 30.00 40.00 1 2 4 8 16 32 63 H2O Nodes (4 cores per node, 1 epoch per node per MapReduce) 2.7 mins Training Time 0 25 50 75 100 1 2 4 8 16 32 63 H2O Nodes in minutes
  25. 25. H2O Deep Learning, @ArnoCandel Goal: Predict the item from seller’s text description 25 Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes “Vintage 18KT gold Rolex 2 Tone in great condition” Data: Bag of words vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0 vintagegold condition Text Classification
  26. 26. H2O Deep Learning, @ArnoCandel Out-Of-The-Box: 11.6% test set error after 10 epochs! Predicts the correct class (out of 143) 88.4% of the time! 26 Note 2: No tuning was done
 (results are for illustration only) Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes Note 1: H2O columnar-compressed in-memory store only needs 60 MB to store 5 billion values (dense CSV needs 18 GB) Text Classification
  27. 27. H2O Deep Learning, @ArnoCandel MNIST: Unsupervised Anomaly Detection with Deep Learning (Autoencoder) 27 The good The bad The ugly Download the script and run it yourself!
  28. 28. H2O Deep Learning, @ArnoCandel 28 How well did Deep Learning do? Let’s see how H2O did in the past 10 minutes! Higgs: Live Demo (Continued) <your guess?> reference paper results Any guesses for AUC on low-level features? AUC=0.76 was the best for RF/GBM/NN (H2O)
  29. 29. H2O Deep Learning, @ArnoCandel H2O Steam: Scoring Platform 29 Higgs Dataset Demo on 10-node cluster Let’s score all our H2O models and compare them! http://server:port/steam/index.html Live Demo
  30. 30. H2O Deep Learning, @ArnoCandel 30 Live Demo on 10-node cluster: <10 minutes runtime for all H2O algos! Better than LHC baseline of AUC=0.73! Scoring Higgs Models in H2O Steam
  31. 31. H2O Deep Learning, @ArnoCandel 31 Algorithm Paper’s l-l AUC low-level H2O AUC all features
 H2O AUC Parameters (not heavily tuned), 
 H2O running on 10 nodes Generalized Linear Model - 0.596 0.684 default, binomial Random Forest - 0.764 0.840 50 trees, max depth 50 Gradient Boosted Trees 0.73 0.753 0.839 50 trees, max depth 15 Neural Net 1 layer 0.733 0.760 0.830 1x300 Rectifier, 100 epochs Deep Learning 3 hidden layers 0.836 0.850 - 3x1000 Rectifier, L2=1e-5, 40 epochs Deep Learning 4 hidden layers 0.868 0.869 - 4x500 Rectifier, L1=L2=1e-5, 300 epochs Deep Learning 5 hidden layers 0.880 0.871 - 5x500 Rectifier, L1=L2=1e-5 Deep Learning on low-level features alone beats everything else! Prelim. H2O results compare well with paper’s results* (TMVA & Theano) Higgs Particle Detection with H2O *Nature paper: http://arxiv.org/pdf/1402.4735v2.pdf HIGGS UCI Dataset: 21 low-level features AND 7 high-level derived features Train: 10M rows, Test: 500k rows
  32. 32. H2O Deep Learning, @ArnoCandel Coming very soon: h2o-dev New UI: Flow New languages: python, Javascript 32
  33. 33. H2O Deep Learning, @ArnoCandel h2o-dev Python Example 33
  34. 34. H2O Deep Learning, @ArnoCandel Part 2: Hands-On Session 34 Web GUI Import Higgs data, split into train/test Train grid search Deep Learning model Continue training the best model ROC and Multi-Model Scoring R Studio Connect to running H2O Cluster from R Run ML algos on 3 different datasets More: Follow examples from http://learn.h2o.ai 
 (R scripts and data at http://data.h2o.ai)
  35. 35. H2O Deep Learning, @ArnoCandel H2O Docker VM 35 http://h2o.ai/blog/2015/01/h2o-docker/ H2O will be at
 http://`boot2docker ip`:8996
  36. 36. H2O Deep Learning, @ArnoCandel Import Higgs data 36 Enter
  37. 37. H2O Deep Learning, @ArnoCandel Split Into Train/Test 37
  38. 38. H2O Deep Learning, @ArnoCandel Train Grid Search DL Model 38 Enter Enter Enter Enter
  39. 39. H2O Deep Learning, @ArnoCandel Continue Training Best Model 39 Scroll right Enter
  40. 40. H2O Deep Learning, @ArnoCandel Inspect ROC, thresholds, etc. 40
  41. 41. H2O Deep Learning, @ArnoCandel Multi-Model Scoring 41
  42. 42. H2O Deep Learning, @ArnoCandel Control H2O from R Studio 42 http://learn.h2o.ai/ R scripts in github 1) Paste content of
 http://tiny.cc/h2o_next_ml into R Studio 2) Execute line by line with Ctrl-Enter to run ML algorithms on H2O Cluster via R 3) Check out the links below for more info http://h2o.gitbooks.io
  43. 43. H2O Deep Learning, @ArnoCandel Snippets from R script 43 Install H2O R package & connect to H2O Server Run Deep Learning on MNIST
  44. 44. H2O Deep Learning, @ArnoCandel 44 H2O GitBooks Also available: GBM & GLM GitBooks at http://h2o.gitbooks.io H2O World learn.h2o.ai R, EC2, Hadoop Deep Learning
  45. 45. H2O Deep Learning, @ArnoCandel H2O Kaggle Starter R Scripts 45
  46. 46. H2O Deep Learning, @ArnoCandel Re-Live H2O World! 46 http://h2o.ai/h2o-world/ http://learn.h2o.ai Watch the Videos Day 2 • Speakers from Academia & Industry • Trevor Hastie (ML) • John Chambers (S, R) • Josh Bloch (Java API) • Many use cases from customers • 3 Top Kaggle Contestants (Top 10) • 3 Panel discussions Day 1 • Hands-On Training • Supervised • Unsupervised • Advanced Topics • Markting Usecase • Product Demos • Hacker-Fest with 
 Cliff Click (CTO, Hotspot)
  47. 47. H2O Deep Learning, @ArnoCandel You can participate! 47 - Images: Convolutional & Pooling Layers PUB-644 - Sequences: Recurrent Neural Networks PUB-1052 - Faster Training: GPGPU support PUB-1013 - Pre-Training: Stacked Auto-Encoders PUB-1014 - Ensembles PUB-1072 - Use H2O at Kaggle Challenges!
  48. 48. H2O Deep Learning, @ArnoCandel Key Take-Aways H2O is an open source predictive analytics platform for data scientists and business analysts who need scalable and fast machine learning. ! H2O Deep Learning is ready to take your advanced analytics to the next level - Try it on your data! ! Join our Community and Meetups! https://github.com/h2oai h2ostream community forum www.h2o.ai @h2oai 48 Thank you!