Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018

Deep Learning Pipelines
@joerg_schad @dcos

© 2018 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Tech Lead Community Projects
@joerg_schad

© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Promise
3

Deep Learning: The Process
4
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda

Deep Learning: Some insight
5

Deep Learning: The Challenges
6

7
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring & Operations
Users

Training Challenges
8
Step 1: Training
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
● Compute Intensive
○ (Hopefully) Large Datasets
■ Train
■ Dev
■ Test
○ Hyperparameter
■ #Layer
■ #Units per Layer
■ Learning Rate
■ ….

Input Data Management
9
+ state
Models Model
Serving
Users

Challenges
●
● Training/Dev/Test + New Data
● Large amounts
● Quality
● Availability (for cluster)
● Velocity
● Streaming
Solutions
GFS
Input Data Management
Input:
Lots of Labeled
Data
Apache Kafka
Apache Cassandra

Deep Learning Frameworks
11
+ state
Models Model
Serving
Users

● Machine Intelligence is the broad term used to describe
techniques allowing computers to “learn” by analyzing very
large data sets using artificial neural networks
12
What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org

What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org
● Tensorflow is a software library that makes it easy for
developers to construct artificial neural networks to analyze
their data of interest
TensorFlow
Library
Python
Dataflow
Executor,
Compute Kernel
Implementations,
Networking, etc.
GPUs
CPUs

Alternatives

Alternatives
tf.enable_eager_execution()
https://www.tensorflow.org/get_started/eager

Data Analytics Ecosystem

APIs
17

Challenges
● Different Frameworks
● No one rules them all
Solutions
● Pick the right tool
● PMML if needed
Deep Learning Frameworks

19
+ state
Models Model
Serving
Users

Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users

Cluster Management and Deployments
22
+ state
Models Model
Serving
Users

Trained
Model
Typical Developer Workflow for TensorFlow (Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
23
Input
Data Set

Typical Developer Workflow for TensorFlow (Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
24

Resource Isolation and Allocation
25

TPU
26

TPUs
27

Datacenter
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
Tensorflow
Jenkins
Kafka
Spark
Tensorflow

PHYSICAL
INFRASTRUCTURE
MICROSERVICES, CONTAINERS, & DEV TOOLS
VIRTUAL MACHINES PUBLIC CLOUDS
DATA SERVICES, MACHINE LEARNING, & AI
Security &
Compliance
Application-Aware
Automation Multitenancy
Hybrid Cloud
Management
100+
MORE
DatacenterEdge
Datacenter and Cloud as a Single Computing Resource
Powered by Apache Mesos
20+
MORE

● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
30
Trained
Model
Input
Data Set

Challenges running distributed TensorFlow*
31
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs
* Any Distributed System

Deploy
Scale
Configure
Recover
3 AM
...
Typical Datacenter
siloed, over-provisioned servers,
low utilization
HDFS
Kafka
Kubernetes
Flink
TensorFlow

Two-level Scheduling
1. Agents advertise resources to Master
2. Master offers resources to Framework
3. Framework rejects / uses resources
4. Agent reports task status to Master
33
MESOS ARCHITECTURE
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executor
Cassandra
Task
Flink
Scheduler
Spark
Executor
Spark
Task
Mesos AgentMesos Agent Service
Docker
Executor
Docker
Task
CDB
Executor
Spark
Task
Spark
Scheduler
Kafka
Scheduler

Challenges running distributed TensorFlow
34
● Hard-coding a “ClusterSpec” is incredibly tedious
○ Users need to rewrite code for every job they want to run in a distributed setting
○ True even for code they “inherit” from standard models
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
...
]})
"worker": [
...
],
"ps": [
...
]})
"worker": [
...
],
"ps": [
"ps3.example.com:2222

Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
35

● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
36
Trained
Model
Input
Data Set

Running distributed TensorFlow on DC/OS
● We use the dcos-commons SDK to dynamically create the ClusterSpec
37
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
"worker": [
...
],
"ps": [
...
]})
"worker": [
...
],
"ps": [
...
]})
"worker": [
...
],
"ps": [
"ps3.example.com:2222

38
● Wrapper script to abstract away distributed TensorFlow configuration
○ Separates “deployer” responsibilities from “developer” responsibilities
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
User
Code
Wrapper
Script

39
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster

Model Management
40
+ state
Models Model
Serving
Users

Recall
41
Step 1: Training
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda

Many Models
42
Step 1: Training
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model

Challenges
● Many Models
● Different Hyperparameter
● Different Models
● New Training Data
● ...
Solutions
● Persistent Storage + Metadata
Model Management
GFS

TensorFlow Hub
44
https://www.tensorflow.org/hub/

45
+ state
Models Model
Serving
Users

Challenges
● How to Deploy Models?
● Zero Downtime
● Canary
Solutions
● TensorFlow Serving
Model Serving

TensorFlow Lite
47
https://www.tensorflow.org/mobile/tflite/
Challenges
● Small/Fast model without losing too
much performance
● 500 KB models….

Rendezvous Architecture
48
https://mapr.com/ebooks/machine-learning-logistics/

49
+ state
Models Model
Serving
Users

Challenges
● Understand {...}
● Debug
● Model Quality
● Accuracy
● Training Time
● …
● Overall Architecture
● Availability
● Latencies
● ...
Solutions
● TensorBoard
● Traditional Cluster Monitoring Tool
Monitoring

Debugging
51
tfdbg
https://www.tensorflow.org/programmers_guide/debugger

Debugging
52
Tfdbg
- GUI currently alpha
https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md

Profiling
53
Performance optimization for different
devices
- Keep device occupied
Profiling!
+
Experience!
https://www.tensorflow.org/performance/performance_guide

Platforms
54
● AWS Sagemaker
+ Spark, MXNet, TF
+ Serving/AB
- Cloud Only
● Google Datalab/ML-Engine
+ TF, Keras, Scikit, XGBoost
+ Serving/AB
- Cloud Only
- No control of docker images
● KubeFlow
+ TF Everywhere
- TF only
● DC/OS
+ Flexibility (all of the above)
+ GPU support
- More Manual setup

Demo Time

Related Work
56
● DC/OS TensorFlow
https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/
● DC/OS PyTorch
https://mesosphere.com/blog/deep-learning-pytorch-gpus/
● Ted Dunning’s Machine Learning Logistics
https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/
● KubeFlow
https://github.com/kubeflow/kubeflow
● Tensorflow (+ TensorBoard and Serving)
https://www.tensorflow.org/

Special Thanks to All Collaborators
57
Ben Wood Robin Oh
Evan Lezar Art Rand
Gabriel Hartmann Chris Lambert
Bo Hu
Sam Pringle Kevin Klues

● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dco
s
○ Slack: chat.dcos.io #tensorflow
Questions and Links
58

Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018

Similar to Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 (20)

More from Codemotion

More from Codemotion (20)

Recently uploaded

Recently uploaded (20)

Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018