SlideShare a Scribd company logo
Deep Learning at scale
Mateusz Dymczyk

Software Engineer

H2O.ai
Strata+Hadoop Singapore
08.12.2016
About me
• M.Sc. in CS @ AGH UST, Poland
• CS Ph.D. dropout
• Software Engineer @ H2O.ai
• Previously ML/NLP and distributed
systems @ Fujitsu Laboratories and
en-japan inc.
• Deep learning - brief introduction
• Why scale
• Scaling models + implementations
• Demo
• A peek into the future
Agenda
Deep Learning
Text classification (item prediction)
Use Cases
Fraud detection (11% accuracy boost in production)
Image classification
Machine translation
Recommendation systems
Deep Learning
Deep learning is a branch of machine learning based on a
set of algorithms that attempt to model high level
abstractions in data by using a deep graph with multiple
processing layers, composed of multiple linear and non-
linear transformations.*
*https://en.wikipedia.org/wiki/Deep_learning
Deep Learning
SGD
• Memory efficient
• Fast
• Not easy to parallelize without speed
degradation
Initialize
Parameters
Get training
sample i
1
2
3
Until converged
Deep Learning
PRO
• Relatively simple concept
• Non-linear
• Versatile and flexible
• Features can be extracted
• Great with big data
• Very promising results in
multiple fields
CON
• Hard to interpret
• Not well understood theory
• Lots of architectures
• A lot of hyper-parameters
• Slow training, data hungry
• CPU/GPU hungry
• Overfits
Why scale?
PRO
• Relatively simple concept
• Non-linear
• Versatile and flexible
• Features can be extracted
• Great with big data
• Very promising results in
multiple fields
CON
• Hard to interpret
• Not well understood theory
• Lots of architectures
• A lot of hyper-parameters
• Slow training, data hungry
• CPU/GPU hungry
• Overfits
Grid search?
Why not to scale?
• Distribution isn’t free: overhead due to network traffic,
synchronization etc.
• Small neural network (not many layers and/or neurons)
• small+shallow network = not much computation per iteration
• Small data
• Very slow network communication
Distribution
Distribution models
• Model parallelism
• Data parallelism
• Mixed/composed
• Parameter server vs *peer-to-peer
• Communication: Asynchronous vs Synchronous
*http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf
Model parallelism
Node 1 Node 2 Node 3
• Each node computes a
different part of network
• Potential to scale to large
models
• Rather difficult to implement
and reason about
• Originally designed for large
convolutional layer in
GoogleNet
Parameter
Server
Parameters or deltas
Data parallelism
Node 1
Parameter
Server
Node 3
Node 2
Node 4
• Each node computes all the
parameters
• Using part (or all) of local data
• Results have to be combined
Parameters or deltas
Mixed/composed
• Node distribution (model or data)
• In-node, per gpu/cpu/thread
concurrent/parallel computation
• Example: learn whole model on each
multi CPU/GPU machine, where each
CPU/GPU trains a different layer or
works with different part of data
Node
T T T
Node
CPU/GPU CPU/GPU
CPU/GPU CPU/GPU
Sync vs. Async
Sync
• Slower
• Reproducible
• Might overfit
• In some cases more
accurate and faster*
At some point parameters need to be collected within the
nodes and between them.
Async
• Race conditions possible
• Not reproducible
• Faster
• Might helps with overfitting
(you make mistakes)
*https://arxiv.org/abs/1604.00981
H2O’s architecture
H2O in-memory
Non-blocking
hash map
(resides on all nodes)
Initial model
(weights & biases)
Node
Computation
(threads, async)
Node communication
MAP
each node trains a copy
of the whole network with
its local data (part or all) using
async F/J framework
H2O Frame
Node 1 data
Node 2-N data
Thread 1 data
* Each row is fully stored on the same node
* Each chunk contains thousands/millions of rows
* All the data is compressed (sometimes 2-4x) in a
lossless fashion
Inside a node
• Each thread works on a
part of data
• All threads update
weights/biases
concurrently
• Race conditions possible
(hard to reproduce,
good for overfitting)
H2O’s architecture
Updated model
(weights $ biases)
* Communication frequency is auto-tuned
and user-controllable (affects
convergence)
H2O in-memory
Non-blocking
hash map
(resides on all nodes)
Initial model
(weights & biases)
Node
Computation
(threads, async)
Node communication
MAP
each node trains a copy
of the whole network with
its local data (part or all) using
async F/J framework
REDUCE
Model averaging:
Average weights
and biases from
all the nodes
Here: averaging
New: Elastic averaging
Benchmarks
• Problem: MNIST (hand-written digits 28x28 pixels (784 features), 10-class
classification
• Hardware: 10* Dual E5-2650 (8 cores, 2.6GHz), 10Gb
• Result: trains 100 epochs (6M samples) in 10 seconds on 10 nodes
Demo
Airline Delays
• Data:
• airline data
• 116mln rows (6GB)
• 800+ predictors (numeric & categorical)
• Problem: predict if a flight is delayed
• Hardware: 10* Dual E5-2650 (32 cores, 2.6GHz), ~11Gb
• Platform: H2O
GPUs and other networks
• What if you want to use GPUs?
• What if you want to train arbitrary networks?
• What if you want to compare different frameworks?
DeepWater
DeepWater
DeepWater Architecture
Other frameworks
• Data parallelism (default)
• Model parallelism also available
• for example for multi-layer LSTM
• Supports both sync and async
communication
• Can perform updates in the GPU or CPU
*http://mxnet.io/how_to/model_parallel_lstm.html
• Data and model parallelism
• Both sync and async updates
supported
*https://ischlag.github.io/2016/06/12/async-distributed-tensorflow/
mxnet TensorFlow
Summary
Multi Nodo
• data/model doesn’t fit on one node
• computation too long
Single Node
• small NN/small data
Multi cpu/gpu
• data fits on single node but too
much for single processing unit
Model parallelism
• network parameters don’t fit on a
single machine
• faster computation
Data parallelism
• data doesn’t fit on a single node
• faster computation
Async
• faster convergence
• ok with potential lower accuracy
Sync
• best accuracy
• lots of workers or ok with slower
training
Open Source
• Github:
https://github.com/h2oai/h2o-3
https://github.com/h2oai/deepwater
• Community:
https://groups.google.com/forum/?hl=en#!forum/h2ostream
http://jira.h2o.ai
https://community.h2o.ai/index.html
@h2oai
http://www.h2o.ai
Thank you!
@mdymczyk
Mateusz Dymczyk
mateusz@h2o.ai
Q&A

More Related Content

What's hot

Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
DataWorks Summit/Hadoop Summit
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
MapR Technologies
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Streaming in the Extreme
Streaming in the ExtremeStreaming in the Extreme
Streaming in the Extreme
Julius Remigio, CBIP
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
DataWorks Summit
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
Jozo Kovac
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Databricks
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
Databricks
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
Big Data Spain
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
Mohammed Fazuluddin
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Mathieu Dumoulin
 
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
sparktc
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Cesare Cugnasco
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Maya Lumbroso
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial Services
Kinetica
 

What's hot (20)

Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Streaming in the Extreme
Streaming in the ExtremeStreaming in the Extreme
Streaming in the Extreme
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
How Spark Enables the Internet of Things: Efficient Integration of Multiple ...
 
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integrationIndexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
 
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc..."An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
GPU Acceleration for Financial Services
GPU Acceleration for Financial ServicesGPU Acceleration for Financial Services
GPU Acceleration for Financial Services
 

Viewers also liked

CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
Mathieu Dumoulin
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
[9Lenses + CSC] – Transforming the Way you Discover Organizational Insights
[9Lenses + CSC] – Transforming the Way you Discover Organizational Insights[9Lenses + CSC] – Transforming the Way you Discover Organizational Insights
[9Lenses + CSC] – Transforming the Way you Discover Organizational Insights
9Lenses
 
Distributed Multi-device Execution of TensorFlow – an Outlook
Distributed Multi-device Execution of TensorFlow – an OutlookDistributed Multi-device Execution of TensorFlow – an Outlook
Distributed Multi-device Execution of TensorFlow – an OutlookSebnem Rusitschka
 
Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel Programming
Vengada Karthik Rangaraju
 
Machine Learning Methods for Parameter Acquisition in a Human ...
Machine Learning Methods for Parameter Acquisition in a Human ...Machine Learning Methods for Parameter Acquisition in a Human ...
Machine Learning Methods for Parameter Acquisition in a Human ...butest
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
asimkadav
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
Spark Summit
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
Jen Aman
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
jie cao
 
Large Scale Distributed Deep Networks
Large Scale Distributed Deep NetworksLarge Scale Distributed Deep Networks
Large Scale Distributed Deep Networks
Hiroyuki Vincent Yamazaki
 
Tensorflow in Docker
Tensorflow in DockerTensorflow in Docker
Tensorflow in Docker
Eric Ahn
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
Spark Summit
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
Stanley Wang
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
Tien-Yang (Aiden) Wu
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf
 
MapR M7 技術概要
MapR M7 技術概要MapR M7 技術概要
MapR M7 技術概要
MapR Technologies Japan
 
Intro to the Distributed Version of TensorFlow
Intro to the Distributed Version of TensorFlowIntro to the Distributed Version of TensorFlow
Intro to the Distributed Version of TensorFlow
Altoros
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
Mathieu Dumoulin
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
Theodoros Vasiloudis
 

Viewers also liked (20)

CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
[9Lenses + CSC] – Transforming the Way you Discover Organizational Insights
[9Lenses + CSC] – Transforming the Way you Discover Organizational Insights[9Lenses + CSC] – Transforming the Way you Discover Organizational Insights
[9Lenses + CSC] – Transforming the Way you Discover Organizational Insights
 
Distributed Multi-device Execution of TensorFlow – an Outlook
Distributed Multi-device Execution of TensorFlow – an OutlookDistributed Multi-device Execution of TensorFlow – an Outlook
Distributed Multi-device Execution of TensorFlow – an Outlook
 
Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel Programming
 
Machine Learning Methods for Parameter Acquisition in a Human ...
Machine Learning Methods for Parameter Acquisition in a Human ...Machine Learning Methods for Parameter Acquisition in a Human ...
Machine Learning Methods for Parameter Acquisition in a Human ...
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
 
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf JagermanSpark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Rolf Jagerman
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
 
Large Scale Distributed Deep Networks
Large Scale Distributed Deep NetworksLarge Scale Distributed Deep Networks
Large Scale Distributed Deep Networks
 
Tensorflow in Docker
Tensorflow in DockerTensorflow in Docker
Tensorflow in Docker
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
MapR M7 技術概要
MapR M7 技術概要MapR M7 技術概要
MapR M7 技術概要
 
Intro to the Distributed Version of TensorFlow
Intro to the Distributed Version of TensorFlowIntro to the Distributed Version of TensorFlow
Intro to the Distributed Version of TensorFlow
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 

Similar to Deep Learning at Scale

Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
Wee Hyong Tok
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overviewRajiv Kumar
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
Mani Goswami
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
John D Almon
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
Amer Ather
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
Bishnu Rawal
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Jakob Karalus
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesSrinath Perera
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
Ahmed Misbah
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
Sagar Dolas
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
Arnaud Rachez
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
ruvex
 
Modern processor art
Modern processor artModern processor art
Modern processor art
waqasjadoon11
 
processor struct
processor structprocessor struct
processor struct
waqasjadoon11
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
Ganesan Narayanasamy
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computingpurplesea
 

Similar to Deep Learning at Scale (20)

Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
GPU Computing: A brief overview
GPU Computing: A brief overviewGPU Computing: A brief overview
GPU Computing: A brief overview
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Netflix machine learning
Netflix machine learningNetflix machine learning
Netflix machine learning
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
 
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusDistributed Tensorflow with Kubernetes - data2day - Jakob Karalus
Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple SpacesBISSA: Empowering Web gadget Communication with Tuple Spaces
BISSA: Empowering Web gadget Communication with Tuple Spaces
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Programmable Exascale Supercomputer
Programmable Exascale SupercomputerProgrammable Exascale Supercomputer
Programmable Exascale Supercomputer
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
DigitRecognition.pptx
DigitRecognition.pptxDigitRecognition.pptx
DigitRecognition.pptx
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Open power ddl and lms
Open power ddl and lmsOpen power ddl and lms
Open power ddl and lms
 
System models for distributed and cloud computing
System models for distributed and cloud computingSystem models for distributed and cloud computing
System models for distributed and cloud computing
 

Recently uploaded

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 

Recently uploaded (20)

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 

Deep Learning at Scale

  • 1. Deep Learning at scale Mateusz Dymczyk
 Software Engineer
 H2O.ai Strata+Hadoop Singapore 08.12.2016
  • 2. About me • M.Sc. in CS @ AGH UST, Poland • CS Ph.D. dropout • Software Engineer @ H2O.ai • Previously ML/NLP and distributed systems @ Fujitsu Laboratories and en-japan inc.
  • 3. • Deep learning - brief introduction • Why scale • Scaling models + implementations • Demo • A peek into the future Agenda
  • 5. Text classification (item prediction) Use Cases Fraud detection (11% accuracy boost in production) Image classification Machine translation Recommendation systems
  • 6. Deep Learning Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data by using a deep graph with multiple processing layers, composed of multiple linear and non- linear transformations.* *https://en.wikipedia.org/wiki/Deep_learning
  • 8. SGD • Memory efficient • Fast • Not easy to parallelize without speed degradation Initialize Parameters Get training sample i 1 2 3 Until converged
  • 9. Deep Learning PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in multiple fields CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits
  • 10. Why scale? PRO • Relatively simple concept • Non-linear • Versatile and flexible • Features can be extracted • Great with big data • Very promising results in multiple fields CON • Hard to interpret • Not well understood theory • Lots of architectures • A lot of hyper-parameters • Slow training, data hungry • CPU/GPU hungry • Overfits Grid search?
  • 11. Why not to scale? • Distribution isn’t free: overhead due to network traffic, synchronization etc. • Small neural network (not many layers and/or neurons) • small+shallow network = not much computation per iteration • Small data • Very slow network communication
  • 13. Distribution models • Model parallelism • Data parallelism • Mixed/composed • Parameter server vs *peer-to-peer • Communication: Asynchronous vs Synchronous *http://nikkostrom.com/publications/interspeech2015/strom_interspeech2015.pdf
  • 14. Model parallelism Node 1 Node 2 Node 3 • Each node computes a different part of network • Potential to scale to large models • Rather difficult to implement and reason about • Originally designed for large convolutional layer in GoogleNet Parameter Server Parameters or deltas
  • 15. Data parallelism Node 1 Parameter Server Node 3 Node 2 Node 4 • Each node computes all the parameters • Using part (or all) of local data • Results have to be combined Parameters or deltas
  • 16. Mixed/composed • Node distribution (model or data) • In-node, per gpu/cpu/thread concurrent/parallel computation • Example: learn whole model on each multi CPU/GPU machine, where each CPU/GPU trains a different layer or works with different part of data Node T T T Node CPU/GPU CPU/GPU CPU/GPU CPU/GPU
  • 17. Sync vs. Async Sync • Slower • Reproducible • Might overfit • In some cases more accurate and faster* At some point parameters need to be collected within the nodes and between them. Async • Race conditions possible • Not reproducible • Faster • Might helps with overfitting (you make mistakes) *https://arxiv.org/abs/1604.00981
  • 18. H2O’s architecture H2O in-memory Non-blocking hash map (resides on all nodes) Initial model (weights & biases) Node Computation (threads, async) Node communication MAP each node trains a copy of the whole network with its local data (part or all) using async F/J framework
  • 19. H2O Frame Node 1 data Node 2-N data Thread 1 data * Each row is fully stored on the same node * Each chunk contains thousands/millions of rows * All the data is compressed (sometimes 2-4x) in a lossless fashion
  • 20. Inside a node • Each thread works on a part of data • All threads update weights/biases concurrently • Race conditions possible (hard to reproduce, good for overfitting)
  • 21. H2O’s architecture Updated model (weights $ biases) * Communication frequency is auto-tuned and user-controllable (affects convergence) H2O in-memory Non-blocking hash map (resides on all nodes) Initial model (weights & biases) Node Computation (threads, async) Node communication MAP each node trains a copy of the whole network with its local data (part or all) using async F/J framework REDUCE Model averaging: Average weights and biases from all the nodes Here: averaging New: Elastic averaging
  • 22. Benchmarks • Problem: MNIST (hand-written digits 28x28 pixels (784 features), 10-class classification • Hardware: 10* Dual E5-2650 (8 cores, 2.6GHz), 10Gb • Result: trains 100 epochs (6M samples) in 10 seconds on 10 nodes
  • 23. Demo
  • 24. Airline Delays • Data: • airline data • 116mln rows (6GB) • 800+ predictors (numeric & categorical) • Problem: predict if a flight is delayed • Hardware: 10* Dual E5-2650 (32 cores, 2.6GHz), ~11Gb • Platform: H2O
  • 25. GPUs and other networks • What if you want to use GPUs? • What if you want to train arbitrary networks? • What if you want to compare different frameworks?
  • 29. Other frameworks • Data parallelism (default) • Model parallelism also available • for example for multi-layer LSTM • Supports both sync and async communication • Can perform updates in the GPU or CPU *http://mxnet.io/how_to/model_parallel_lstm.html • Data and model parallelism • Both sync and async updates supported *https://ischlag.github.io/2016/06/12/async-distributed-tensorflow/ mxnet TensorFlow
  • 30. Summary Multi Nodo • data/model doesn’t fit on one node • computation too long Single Node • small NN/small data Multi cpu/gpu • data fits on single node but too much for single processing unit Model parallelism • network parameters don’t fit on a single machine • faster computation Data parallelism • data doesn’t fit on a single node • faster computation Async • faster convergence • ok with potential lower accuracy Sync • best accuracy • lots of workers or ok with slower training
  • 31. Open Source • Github: https://github.com/h2oai/h2o-3 https://github.com/h2oai/deepwater • Community: https://groups.google.com/forum/?hl=en#!forum/h2ostream http://jira.h2o.ai https://community.h2o.ai/index.html @h2oai http://www.h2o.ai
  • 33. Q&A