Your SlideShare is downloading.
×

×
# Introducing the official SlideShare app

### Stunning, full-screen experience for iPhone and Android

#### Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- Webinar: Deep Learning with H2O by Amy Wang 2481 views
- Deep Learning through Examples by Amy Wang 6916 views
- H2O Distributed Deep Learning by Ar... by Amy Wang 5930 views
- Deep Learning through Examples - Ka... by Amy Wang 3997 views
- How to win data science competition... by Amy Wang 6312 views
- Big Data Science with H2O in R by Anqi Fu 2737 views
- H2O.ai's Distributed Deep Learning ... by Amy Wang 2150 views
- H2O.ai's Distributed Deep Learning ... by Amy Wang 1878 views
- Sparkling Water 5 28-14 by Amy Wang 2256 views
- Anqi Fu presents H2O and R; an intr... by Amy Wang 900 views
- H2O Big Data Environments by Amy Wang 880 views
- Building Random Forest at Scale by Amy Wang 1872 views

Like this? Share it with your network
Share

8,506

views

views

Published on

More information in our Deep Learning webinar: http://www.slideshare.net/0xdata/h2-o-deeplearningarnocandel052114 …

More information in our Deep Learning webinar: http://www.slideshare.net/0xdata/h2-o-deeplearningarnocandel052114

Latest slide deck: http://www.slideshare.net/0xdata/h2o-distributed-deep-learning-by-arno-candel-071614

Published in:
Technology

No Downloads

Total Views

8,506

On Slideshare

0

From Embeds

0

Number of Embeds

11

Shares

0

Downloads

220

Comments

0

Likes

16

No embeds

No notes for slide

- 1. Deep Learning with H2O ! H2O.ai Scalable In-Memory Machine Learning ! H20 Meetup, Mountain View, 3/20/14 Arno Candel
- 2. Who am I? PhD in Computational Physics, 2005 from ETH Zurich Switzerland ! 6 years at SLAC - Accelerator Physics Modeling 2 years at Skytree, Inc - Machine Learning 3 months at 0xdata/H2O - Machine Learning ! 10+ years in HPC, C++, MPI, Supercomputing Arno Candel
- 3. Outline Intro Theory Implementation Results MNIST handwritten digits classification Live Demo Prostate cancer classification and age regression text classification
- 4. Distributed in-memory math platform ➔ GLM, GBM, RF, K-Means, PCA, Deep Learning Easy to use SDK / API ➔ Java, R, Scala, Python, JSON, Browser-based GUI ! Businesses can use ALL of their data (w or w/o Hadoop) ➔ Modeling without Sampling Big Data + Better Algorithms ➔ Better Predictions H2O Open Source in-memory Prediction Engine for Big Data
- 5. About H20 (aka 0xdata) Pure Java, Apache v2 Open Source Join the www.h2o.ai/community!
- 6. H2O w or w/o Hadoop H2O H2O H2O HDFS HDFS HDFS YARN Hadoop MR R Java Scala JSON Python Standalone Over YARN On MRv1
- 7. H2O Architecture in-memory K-V store MapReduce compression Machine Learning Algorithms R Engine Nano fast Scoring Engine Prediction Engine memory manager e.g. Deep Learning
- 8. Wikipedia: Deep learning is a set of algorithms in machine learning that attempt to model high-level abstractions in data by using architectures composed of multiple non-linear transformations. ! ! ! ! ! Facebook DeepFace (LeCun): “Almost as good as humans at recognising faces” ! Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton) ! FBI FACE: $1 billion face recognition project What is Deep Learning? Example: Input data (facial image) Prediction (person’s ID)
- 9. Deep Learning is trending 20132012 Google trends 2011
- 10. 1970s multi-layer feed-forward Neural Network (supervised learning with back-propagation) ! + distributed processing for big data (H2O in-memory MapReduce paradigm on distributed data) ! + multi-threaded speedup (H2O Fork/Join worker threads update the model asynchronously) ! + smart algorithms for accuracy (weight initialization, adaptive learning, momentum, dropout, regularization) ! = Top-notch prediction engine! Deep Learning in H2O
- 11. “fully connected” directed graph of neurons age income employment married not married Input layer Hidden layer 1 Hidden layer 2 Output layer 3x4 4x3 3x2#connections information flow input/output neuron hidden neuron 4 3 2#neurons 3 Example Neural Network
- 12. age income employment yj = tanh(sumi(xi*uij)+bj) uij xi yj per-class probabilities sum(pl) = 1 zk = tanh(sumj(yj*vjk)+ck) vjk zk pl pl = softmax(sumk(zk*wkl)+dl) wkl softmax(xk) = exp(xk) / sumk(exp(xk)) “neurons activate each other via weighted sums” Prediction: Forward Propagation married not married activation function: tanh alternative: x -> max(0,x) “rectifier” pl is a non-linear function of xi: can approximate ANY function with enough layers! bj, ck, dl: bias values (indep. of inputs)
- 13. age income employment xi standardize input xi: mean = 0, stddev = 1 ! horizontalize categorical variables, e.g. {full-time, part-time, none, self-employed} -> {0,1,0} = part-time, {0,0,0} = self-employed Poor man’s initialization: random weights ! Better: Uniform distribution in +/- sqrt(6/(#units + #units_previous_layer)) Data preparation & Initialization Neural Networks are sensitive to numerical noise, operate best in the linear regime (not saturated) married not married
- 14. Mean Square Error = (0.2^2 + 0.2^2)/2 “penalize differences per-class” ! Cross-entropy = -log(0.8) “strongly penalize non-1-ness” Stochastic Gradient Descent SGD: improve weights and biases for EACH training row married not married For each training row, we make a prediction and compare with the actual label (supervised training): 1 0 0.8 0.2 predicted actual Objective: minimize prediction error (MSE or cross-entropy) w <— w - rate * ∂E/∂w 1
- 15. Backward Propagation ! ∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi = ∂(error(y))/∂y * ∂(activation(net))/∂net * xi Backprop: Compute ∂E/∂wi via chain rule going backwards wi net = sumi(wi*xi) + b xi E = error(y) y = activation(net) How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ? Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!
- 16. H2O Deep Learning Architecture K-V K-V HTTPD HTTPD nodes/JVMs: sync threads: async communication w w w w w w w w1 w3 w2 w4 w2+w4 w1+w3 w* = (w1+w2+w3+w4)/4 map: each node trains a copy of the weights and biases with (some* or all of) its local data with asynchronous F/J threads initial weights and biases w updated weights and biases w* H2O atomic in-memory K-V store reduce: average weights and biases from all nodes Keep iterating over the data (“epochs”), score from time to time Query & display the model via JSON, WWW 2 2 431 1 1 1 4 3 2 1 2 1 i *mini-batch: number of total rows per iteration, can be less than 1 epoch
- 17. “Secret” Sauce to Higher Accuracy Momentum training: keep changing weights and biases (even if there’s no error) “find other local minima, and go faster along valleys” Adaptive learning rate - ADADELTA (Google): automatically set learning rate for each neuron based on its training history, combines annealing and momentum features Learning rate annealing: rate r = r0 / (1+ß*N), N = training samples “dig deeper into local minimum” Grid Search and Checkpointing: Run a grid search over multiple hyper-parameters, then continue training the best model L1/L2/Dropout/MaxSumWeights regularization: L1: penalizes non-zero weights, L2: penalizes large weights Dropout: randomly ignore certain inputs “train exp. many models at once” MaxSumWeights: Reduce all incoming weights if the sum > max value “regularization avoids overtraining and improves generalization error”
- 18. MNIST: digits classification Train: 60,000 rows 784 integer columns 10 classes Test: 10,000 rows 784 integer columns 10 classes MNIST: Digitized handwritten digits database (Yann LeCun) Data: 28x28=784 pixels with values in 0…255 (gray-scale) One of the most popular multi-class classification problems Without distortions or convolutions (which help), the best-ever published error rate on test set: 0.83% (Microsoft)
- 19. most frequent mistakes: confuse 4 with 6 and 9, and 7 with 2 test set error: 1.5% after 40 epochs 1.02% after 400 epochs 0.95% after 4000 epochs H2O Deep Learning on MNIST: 0.95% test set error (so far) 1 node
- 20. Prostate Cancer Dataset
- 21. Live Demo: Cancer Prediction Interactive ROC curve with real- time updates
- 22. Live Demo: Cancer Prediction 0% training error with only 322 model parameters in seconds!
- 23. Live Demo: Grid Search Regression Doing a grid search to find good hyper-parameters to predict AGE from other 7 features Then continue training the best model 5 hidden 50 tanh layers, rho=0.99, epsilon = 1e-10 MSE < 1 for test set ages in 44…79 Regression: 1 linear output neuron
- 24. Live Demo: ebay Text Classification Users enter a description when selling an item Task: Predict the type of item Data prep: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0 H2O parses SVMLight sparse format: label 3:1 9:1 13:1 … ! “Small” sample dataset on jewelry and watches: Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes ! H2O compressed columnar in-memory store: Only needs 60MB to store 5 billion entries (never inflated)
- 25. Live Demo: ebay Text Classification Work in progress, shown results are for illustration only! Default parameters, no tuning, 4 nodes (16-cores each) Train: 578,361 rows 8,647 cols 467 classes Test: 64,263 rows 8,647 cols 143 classes
- 26. Tips for H2O Deep Learning ! General: More layers: more complex functions (non-linearity) More neurons per layer: detect finer structure in data More regularization: less overfitting (better validation error) ! Do a grid search to get a feel for convergence, then continue training. Try Tanh first. For Rectifier, try max_w2 = 50 and/or L1=1e-5. Try TanhDropout or RectifierDropout with test/validation set after finding good parameters for convergence on training set. Distributed: Smaller mini-batch: more comm., slower, but higher accuracy. With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99 Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8 Try momentum_start = 0.5, momentum_stable = 0.99, momentum_ramp = 1/rate_annealing Try balance_classes = true for imbalanced classes. Try force_load_balance for small datasets.
- 27. Summary H2O is a distributed in-memory math platform that allows fast prototyping in Java, R, Scala and Python. ! H2o enables the development of enterprise-quality blazing fast machine learning applications. ! H2O Deep Learning is distributed, easy to use, and early results compete with the world’s best. ! Deep Learning makes better predictions! ! Try it yourself and join our next meetup! git clone https://github.com/0xdata/h2o

Be the first to comment