Distributed Deep Learning with Hadoop and TensorFlow

Distributed Deep Learning
with Hadoop and TensorFlow

Image Classification- 2016
Human Performance AI Performance
https://arxiv.org/pdf/1602.07261.pdf
95% 97%
The ability to understand the content of an image by using machine learning

4
AI beats human in games - 2016
Komodo beasts H. Nakamura in 2016AlphaGo beats L. Sedols in 2016
Go 4:1 Chess 2:1

Breast Cancer Diagnoses - 2017
Pathologist Performance AI Performance
https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html
73% 92%
Doctors often use additional tests to find or diagnose breast cancer
The pathologist ended up
spending 30 hours on this
task on 130 slides
A closeup of a lymph node biopsy.

The power of 12 GB HBM2 memory and 640 Tensor
Cores, delivering 110 TeraFLOPS of performance.

AI history à Perceptron
1958 F. Rosenblatt,
“Perceptron” model,
neuronal networks
1943 W. McCulloch,
W. Pitts, “Neuron” as
logical element
OR function XOR function
1969 M. Minsky,
S. Papert, triggers
first AI winter
feed forward

AI history à AI winter
1958 F. Rosenblatt,
Perzeptron model,
neuronal networks
1987-1993 the second
AI winter, desktop
computer, LISP
machines expensive
1943 W. McCulloch,
W. Pitts, neuron as
logical element
1980 Boom expert
systems, Q&A using
logical rules, Prolog
1969 M. Minsky,
S. Papert, trigger
first AI winter
1993-2001
Moore’s law, Deep
blue chess-
playing, Standford
DARPA challenge

12
Machine Learning Problem Types

Structured data
80% of world’s data is unstructured

Fishing in the sea versus fishing in the lake
Data Warehouse Data Lake
Business Intellingence helps find
answers to questions you know.
Data Science helps you find the
question itself.
Any kind of data & schema-on-readStructured data & schema-on-write
Parallel processing on big dataSQL-ish queries on database tables
Extract, Transform, Load Extract, Load, Transform-on-the-fly
Low cost on commodity hardwareExpensive for large data

More Data + Bigger Models
Accuracy
Scale (data size, model size)
other approaches
neural networks
1990s
https://www.scribd.com/document/355752799/Jeff-Dean-s-Lecture-for-YC-AI

More Data + Bigger Models + More Computation
Accuracy
Scale (data size, model size)
other approaches
neural networks
Now
https://www.scribd.com/document/355752799/Jeff-Dean-s-Lecture-for-YC-AI
more compute

More Data + Bigger Models + More Computation
= Better Results in Machine Learning

Millions of “trip”
events each day globally
400+ billion viewing-
related events per day
Five billion data points
for Price Tip feature
Movie
recommendation
Price
optimization
Routing and price
optimization

Single machineML specialist Small data

X X

Single machineML specialist Big data
Single machineML specialist Big data
X X

Train and evaluate machine learning models at scale
Single machine Data center
How to run more experiments faster and in parallel?
How to share and reproduce research?
How to go from research to real products?

Distributed Machine Learning
Data Size
Model Size
Model parallelism
Single machine
Data center
Data
parallelism
training very large models exploring several model
architectures, hyper-
parameter optimization,
training several
independent models
speeds up the training

Compute Workload for Training and Evaluation
I/O intensive
Compute
intensive
Single machine
Data center

I/O Workload for Simulation and Testing
I/O intensive
Compute
intensive
Single machine
Data center

Distributed Machine Learning
X

12/19/17 31
TensorFlow
Standalone
TensorFlow
On YARN
TensorFlow
On multi-
colored YARN
TensorFlow
On Spark
TensorFrames
TensorFlow
On
Kubernetes
TensorFlow
On Mesos
Distributed TensorFlow on
Hadoop, Mesos, Kubernetes,
Spark
https://www.slideshare.net/jwiegelmann/distributed
-tensorflow-on-hadoop-mesos-kubernetes-spark

Data Parallel vs. Model Parallel
http://books.nips.cc/papers/files/nips25/NIPS2012_0598.pdf
Between-Graph Replication In-Graph Replication

Data Shards vs. Data Combined
https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf

Synchronous vs. Asynchronous
https://arxiv.org/pdf/1603.04467.pdf

TensorFlow Standalone
https://www.tensorflow.org/

TensorFlow Standalone
Dedicated cluster
Short & long running jobs
Flexibility
Manual scheduling of workers
No shared resources
Hard to share data with other
applications
No data locality

TensorFlow On YARN (Intel) v3
https://github.com/Intel-bigdata/TensorFlowOnYARN
released March 12, 2017 / YARN-6043

TensorFlow On YARN (Intel)
Shared cluster and data
Optimised long running jobs
Scheduling
Data locality (not yet implemented)
Not easy to have rapid adoption
from upstream
Fault tolerance not yet implemented
GPU still not seen as a “native”
resource on yarn
No use of yarn elasticity

TensorFlow On multi-colored YARN (Hortonworks)
v3
Not yet implemented!
https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/

TensorFlow On multi-colored YARN (Hortonworks)
Shared cluster
GPUs shared by multiple tenants
and applications
Centralised scheduling
YARN-3611 Docker support
YARN-4793 Native processes
Needs YARN wrapper of NVIDIA
Docker (GPU driver)
Not implemented yet!

TensorFlow On Spark (Yahoo) v2
https://github.com/yahoo/TensorFlowOnSpark
released January 22, 2017

TensorFlow On Spark (Yahoo)
Shared cluster and data
Data locality through HDFS or
other Spark sources
Add-hoc training and evaluation
Slice and dice data with Spark
distributed transformations
Scheduling not optimal
Necessary to “convert” existing
TensorFlow application, although
simple process
Might need to restart Spark cluster
No GPU resource management

TensorFrames (Databricks) v2
Scala binding to TF via JNI https://github.com/databricks/tensorframes
released Feb 28, 2016

TensorFrames (Databricks)
Possible shared cluster
TensorFrame infers the shapes
for small tensors (no analyse
required)
Data locality via RDD
Experimental
Still not centralised scheduling, TF
and Spark need to be deployed
and scheduled separately
TF and Spark might not be
collocated
Might need data transfer between
some nodes

TensorFlow On Kubernetes
https://github.com/tensorflow/ecosystem

TensorFlow On Kubernetes
Shared cluster
Centralised scheduling by
Kubernetes
Solved network orchestration,
federation etc.
Experimental support for
managing NVIDIA GPUs (at this
time better than yarn however)
Fault tolerance
Data locality

TensorFlow On Mesos
Marathon
https://github.com/douban/tfmesos

TensorFlow On Mesos
Shared cluster
GPU-based scheduling
Short and long running jobs
Memory footprint
Number of services relative to
Kubernetes
Fault tolerance
Data locality

Hidden Technical Debt in Machine Learning Systems
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf
Google, 2015

http://stevenwhang.com/tfx_paper.pdf
TFX: A TensorFlow-Based Production-Scale
Machine Learning Platform
Google, 2017

https://eng.uber.com/michelangelo/
Michelangelo: Uber’s Machine Learning Platform

http://searchbusinessanalytics.techtarget.com/feature/Machine-learning-platforms-comparison-Amazon-Azure-Google-IBM

Pricing for 890,000 real-time predictions w/o training
AWS:
Compute Fees + Prediction Fees = $8.40 + $96.44
= $104.84 per month
Google:
Prediction $0.10 per thousand predictions, plus $0.40 per hour
= $377 per month
Azure:
Packages $0, $100,13, $1.000,06, $9.999,98
= $1.000 per month
Q3, 2017

High-level Development Process for Autonomous Vehicles
1 Collect
sensors data
3 Autonomous
Driving
2 Model
Engineering
Data Logger Control Unit
Big Data Trained Model
Data Center
Agenda

Sensors Udacity Lincoln MKZ
Camera 3x Blackfly GigE Camera, 20 Hz
Lidar Velodyne HDL-32E, 9.5 Hz
IMU Xsens, 400 Hz
GPS 2x fixed, 1 Hz
CAN bus, 1,1 kHz
Robot Operating System
Data 3 GB per minute
https://github.com/udacity/self-driving-car

Sensors Spec
Sensor blinding,
sunlight,
darkness
rain, fog,
snow
non-metal
objects
wind/ high
velocity
resolution range data
Ultrasonic yes yes yes no + + +
Lidar yes no yes yes +++ ++ +
Radar yes yes no yes ++ +++ +
Camera no no yes yes +++ +++ +++

Machine Learning 101
Observations
State
Estimation
Modeling &
Prediction
Planning
Controls
f(x)
Controls
Observations

Machine Learning for Autonomous Driving
+ Sensor Fusion clustering, segmentation, pattern recognition
+ Road ego-motion, image processing and pattern recognition
+ Localization simultaneous localization and mapping
+ Situation Understanding detection and classification
+ Trajectory Planning motion planning and control
+ Control Strategy reinforcement and supervised learning
+ Driver Model image processing and pattern recognition

Machine Learning Cycle
Data collection
for training/test
Feature
engineering
I/O workload
Model development
and architecture
Compute workload I/O workload
Training and
evaluation
Re- Simulation
and Testing
Scaling and
monitoring
Model deployment
versioning
1 2 3
Model tuning

Flux – Open Machine Learning Stack
Training & Test data
Compute + Network + Storage
Deploy model
ML Development & Catalog & REST API
ML-Specialists
Feature
Engineering
Training
Evaluation
Re-Simulation
Testing
CaffeOnSpark
Sample Model Prediction Batch Regression Cluster
Dataset Correlation Centroid Anomaly Test Scores
ü Mainly open source
ü No vendor lock in
ü Scale-out architecture
ü Multi user support
ü Resource management
ü Job scheduling
ü Speed-up training
ü Speed-up simulation

Feature Engineering
+ Hadoop InputFormat and
Record Reader for Rosbag
+ Process Rosbag with Spark,
Yarn, MapReduce, Hadoop
Streaming API, …
+ Spark RDD are cached and
optimized for analysis
Ros
bag
Processing
Engine
Computer
Network
Storage
Advanced
Analytics
RDD
Record
Reader
RDD
DataFrame, DataSet
SQL, Spark APIs
NumPy
Ros
Msg

Training & Evaluation
+ Tensorflow ROSRecordDataset
+ Protocol Buffers to serialize
records
+ Save time because data
conversion not needed
+ Save storage because data
duplication not needed
Training
Engine
Machine
Learning
Ros
bag
Computer
Network
Storage
ROS
Dataset
Ros
msg

Re-Simulation & Testing
+ Use Spark for preprocessing,
transformation, cleansing,
aggregation, time window
selection before publish to ROS
topics
+ Use Re-Simulation framework
of choice to subscribe to the
ROS topics
Engine
Re-Simulation
with framework
of choice
Computer
Network
Storage
Ros
bag
Ros
topic
core
subscribe
publish

Time Travel
fold(left)
t
fold(right)
reduce/
shuffle

Think Big Business Strategy
Data Strategy
Technology Strategy
Agile Delivery Model
Business Case Validation
Prototypes, MVPs
Data Exploration
Data AcquisitionStart Small
Value
Proposition

+ Classification, Regression, Clustering,
Collaborative Filtering, Anomaly Detection
+ Supervised/Unsupervised Reinforcement
Learning, Deep Learning, CNN
+ Model Training, Evaluation, Testing,
Simulation, Inference
+ Big Data Strategy, Consulting, Data
Lab, Data Science as a Service
+ Data Collection, Cleaning, Analyzing,
Modeling, Validation, Visualization
+ Business Case Validation,
Prototyping, MVPs, Dashboards
Data Science Machine Learning

+ Architecture, DevOps, Cloud Building
+ App. Management Hadoop Ecosystem
+ Managed Infrastructure Services
+ Compute, Network, Storage, Firewall,
Loadbalancer, DDoS, Protection
+ Continuous Integration and Deployment
+ Data Pipelines (Acquisition,
Ingestion, Analytics, Visualization)
+ Distributed Data Architectures
+ Data Processing Backend
+ Hadoop Ecosystem
+ Test Automation and Testing
Data Engineering Data Operations

“Culture eats strategy for breakfast,
technology for lunch, and products for dinner,
and soon thereafter everything else too.”
Peter Drucker

Distributed Deep Learning with Hadoop and TensorFlow

More Related Content

What's hot

Similar to Distributed Deep Learning with Hadoop and TensorFlow

More from Jan Wiegelmann

Recently uploaded

Distributed Deep Learning with Hadoop and TensorFlow