Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Learning at Scale

955 views

Published on

Slides from Strata+Hadoop Singapore 2016 presenting how Deep Learning can be scaled both vertically and horizontally, when to use CPUs and when to use GPUs.

Published in: Data & Analytics

Deep Learning at Scale

  1. 1. Deep Learning at scale Mateusz Dymczyk
 Software Engineer
 H2O.ai Strata+Hadoop Singapore 08.12.2016
  2. 2. About me • M.Sc. in CS @ AGH UST, Poland • CS Ph.D. dropout • Software Engineer @ H2O.ai • Previously ML/NLP and distributed systems @ Fujitsu Laboratories and en-japan inc.
  3. 3. • Deep learning - brief introduction • Why scale • Scaling models + implementations • Demo • A peek into the future Agenda
  4. 4. Deep Learning
  5. 5. Text classification (item prediction) Use Cases Fraud detection (11% accuracy boost in production) Image classification Machine translation Recommendation systems
  6. 6. Deep Learning Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers, composed of multiple linear and non- linear transformations.* *https://en.wikipedia.org/wiki/Deep_learning
  7. 7. Deep Learning
  8. 8. SGD • Memory efficient • Fast • Not easy to parallelize without speed degradation Initialize Parameters Get training sample i 1 2 3 Until converged
  9. 9. Deep Learning PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in multiple fields CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits
  10. 10. Why scale? PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in multiple fields CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits Grid search?
  11. 11. Why not to scale? • Distribution isn’t free: overhead due to network traffic, synchronization etc. • Small neural network (not many layers and/or neurons) • small+shallow network = not much computation per iteration • Small data • Very slow network communication
  12. 12. Distribution
  13. 13. Distribution models • Model parallelism • Data parallelism • Mixed/composed • Parameter server vs *peer-to-peer • Communication: Asynchronous vs Synchronous *http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf
  14. 14. Model parallelism Node 1 Node 2 Node 3 • Each node computes a different part of network • Potential to scale to large models • Rather difficult to implement and reason about • Originally designed for large convolutional layer in GoogleNet Parameter Server Parameters or deltas
  15. 15. Data parallelism Node 1 Parameter Server Node 3 Node 2 Node 4 • Each node computes all the parameters • Using part (or all) of local data • Results have to be combined Parameters or deltas
  16. 16. Mixed/composed • Node distribution (model or data) • In-node, per gpu/cpu/thread concurrent/parallel computation • Example: learn whole model on each multi CPU/GPU machine, where each CPU/GPU trains a different layer or works with different part of data Node T T T Node CPU/GPU CPU/GPU CPU/GPU CPU/GPU
  17. 17. Sync vs. Async Sync • Slower • Reproducible • Might overfit • In some cases more accurate and faster* At some point parameters need to be collected within the nodes and between them. Async • Race conditions possible • Not reproducible • Faster • Might helps with overfitting (you make mistakes) *https://arxiv.org/abs/1604.00981
  18. 18. H2O’s architecture H2O in-memory Non-blocking hash map (resides on all nodes) Initial model (weights & biases) Node Computation (threads, async) Node communication MAP each node trains a copy of the whole network with its local data (part or all) using async F/J framework
  19. 19. H2O Frame Node 1 data Node 2-N data Thread 1 data * Each row is fully stored on the same node * Each chunk contains thousands/millions of rows * All the data is compressed (sometimes 2-4x) in a lossless fashion
  20. 20. Inside a node • Each thread works on a part of data • All threads update weights/biases concurrently • Race conditions possible (hard to reproduce, good for overfitting)
  21. 21. H2O’s architecture Updated model (weights $ biases) * Communication frequency is auto-tuned and user-controllable (affects convergence) H2O in-memory Non-blocking hash map (resides on all nodes) Initial model (weights & biases) Node Computation (threads, async) Node communication MAP each node trains a copy of the whole network with its local data (part or all) using async F/J framework REDUCE Model averaging: Average weights and biases from all the nodes Here: averaging New: Elastic averaging
  22. 22. Benchmarks • Problem: MNIST (hand-written digits 28x28 pixels (784 features), 10-class classification • Hardware: 10* Dual E5-2650 (8 cores, 2.6GHz), 10Gb • Result: trains 100 epochs (6M samples) in 10 seconds on 10 nodes
  23. 23. Demo
  24. 24. Airline Delays • Data: • airline data • 116mln rows (6GB) • 800+ predictors (numeric & categorical) • Problem: predict if a flight is delayed • Hardware: 10* Dual E5-2650 (32 cores, 2.6GHz), ~11Gb • Platform: H2O
  25. 25. GPUs and other networks • What if you want to use GPUs? • What if you want to train arbitrary networks? • What if you want to compare different frameworks?
  26. 26. DeepWater
  27. 27. DeepWater
  28. 28. DeepWater Architecture
  29. 29. Other frameworks • Data parallelism (default) • Model parallelism also available • for example for multi-layer LSTM • Supports both sync and async communication • Can perform updates in the GPU or CPU *http://mxnet.io/how_to/model_parallel_lstm.html • Data and model parallelism • Both sync and async updates supported *https://ischlag.github.io/2016/06/12/async-distributed-tensorflow/ mxnet TensorFlow
  30. 30. Summary Multi Nodo • data/model doesn’t fit on one node • computation too long Single Node • small NN/small data Multi cpu/gpu • data fits on single node but too much for single processing unit Model parallelism • network parameters don’t fit on a single machine • faster computation Data parallelism • data doesn’t fit on a single node • faster computation Async • faster convergence • ok with potential lower accuracy Sync • best accuracy • lots of workers or ok with slower training
  31. 31. Open Source • Github: https://github.com/h2oai/h2o-3 https://github.com/h2oai/deepwater • Community: https://groups.google.com/forum/?hl=en#!forum/h2ostream http://jira.h2o.ai https://community.h2o.ai/index.html @h2oai http://www.h2o.ai
  32. 32. Thank you! @mdymczyk Mateusz Dymczyk mateusz@h2o.ai
  33. 33. Q&A

×