Deep Learning Pipelines
@joerg_schad @dcos
© 2018 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Tech Lead Community Projects
@joerg_schad
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Promise
3
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Process
4
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: Some insight
5
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
6
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
7
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring & Operations
Users
© 2017 Mesosphere, Inc. All Rights Reserved.
Training Challenges
8
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
● Compute Intensive
○ (Hopefully) Large Datasets
■ Train
■ Dev
■ Test
○ Hyperparameter
■ #Layer
■ #Units per Layer
■ Learning Rate
■ ….
© 2018 Mesosphere, Inc. All Rights Reserved.
Input Data Management
9
Input Data Frameworks Cluster
+ state
Models Model
Serving
Users
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 10
Challenges
●
● Training/Dev/Test + New Data
● Large amounts
● Quality
● Availability (for cluster)
● Velocity
● Streaming
Solutions
GFS
Input Data Management
Input:
Lots of Labeled
Data
Apache Kafka
Apache Cassandra
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning Frameworks
11
Input Data Frameworks Cluster
+ state
Models Model
Serving
Users
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved.
● Machine Intelligence is the broad term used to describe
techniques allowing computers to “learn” by analyzing very
large data sets using artificial neural networks
12
What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org
© 2018 Mesosphere, Inc. All Rights Reserved. 13
What is Tensorflow?
“An open-source software library for Machine Intelligence” - tensorflow.org
● Tensorflow is a software library that makes it easy for
developers to construct artificial neural networks to analyze
their data of interest
TensorFlow
Library
Python
Dataflow
Executor,
Compute Kernel
Implementations,
Networking, etc.
GPUs
CPUs
© 2018 Mesosphere, Inc. All Rights Reserved. 14
Alternatives
© 2018 Mesosphere, Inc. All Rights Reserved. 15
Alternatives
tf.enable_eager_execution()
https://www.tensorflow.org/get_started/eager
© 2018 Mesosphere, Inc. All Rights Reserved. 16
Data Analytics Ecosystem
© 2018 Mesosphere, Inc. All Rights Reserved.
APIs
17
© 2018 Mesosphere, Inc. All Rights Reserved. 18
Challenges
● Different Frameworks
● No one rules them all
Solutions
● Pick the right tool
● PMML if needed
Deep Learning Frameworks
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
19
Input Data Frameworks Cluster
+ state
Models Model
Serving
Users
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 20
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 21
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
© 2018 Mesosphere, Inc. All Rights Reserved.
Cluster Management and Deployments
22
Input Data Frameworks Cluster
+ state
Models Model
Serving
Users
Monitoring & Operations
© 2017 Mesosphere, Inc. All Rights Reserved.
Trained
Model
Typical Developer Workflow for TensorFlow (Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
23
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
24
© 2018 Mesosphere, Inc. All Rights Reserved.
Resource Isolation and Allocation
25
© 2018 Mesosphere, Inc. All Rights Reserved.
TPU
26
© 2018 Mesosphere, Inc. All Rights Reserved.
TPUs
27
© 2017 Mesosphere, Inc. All Rights Reserved. 28
Datacenter
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
Tensorflow
Jenkins
Kafka
Spark
Tensorflow
© 2018 Mesosphere, Inc. All Rights Reserved.
PHYSICAL
INFRASTRUCTURE
MICROSERVICES, CONTAINERS, & DEV TOOLS
VIRTUAL MACHINES PUBLIC CLOUDS
DATA SERVICES, MACHINE LEARNING, & AI
Security &
Compliance
Application-Aware
Automation Multitenancy
Hybrid Cloud
Management
100+
MORE
DatacenterEdge
Datacenter and Cloud as a Single Computing Resource
Powered by Apache Mesos
20+
MORE
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
30
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow*
31
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs
* Any Distributed System
Deploy
Scale
Configure
Recover
3 AM
...
Typical Datacenter
siloed, over-provisioned servers,
low utilization
HDFS
Kafka
Kubernetes
Flink
TensorFlow
© 2018 Mesosphere, Inc. All Rights Reserved.
Two-level Scheduling
1. Agents advertise resources to Master
2. Master offers resources to Framework
3. Framework rejects / uses resources
4. Agent reports task status to Master
33
MESOS ARCHITECTURE
Mesos
Master
Mesos
Master
Mesos
Master
Mesos AgentMesos Agent Service
Cassandra
Executor
Cassandra
Task
Flink
Scheduler
Spark
Executor
Spark
Task
Mesos AgentMesos Agent Service
Docker
Executor
Docker
Task
CDB
Executor
Spark
Task
Spark
Scheduler
Kafka
Scheduler
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
34
● Hard-coding a “ClusterSpec” is incredibly tedious
○ Users need to rewrite code for every job they want to run in a distributed setting
○ True even for code they “inherit” from standard models
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
35
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow (Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
36
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
● We use the dcos-commons SDK to dynamically create the ClusterSpec
37
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
38
● Wrapper script to abstract away distributed TensorFlow configuration
○ Separates “deployer” responsibilities from “developer” responsibilities
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
User
Code
Wrapper
Script
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
39
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster
© 2018 Mesosphere, Inc. All Rights Reserved.
Model Management
40
Input Data Frameworks Cluster
+ state
Models Model
Serving
Users
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved.
Recall
41
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2017 Mesosphere, Inc. All Rights Reserved.
Many Models
42
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
© 2018 Mesosphere, Inc. All Rights Reserved. 43
Challenges
● Many Models
● Different Hyperparameter
● Different Models
● New Training Data
● ...
Solutions
● Persistent Storage + Metadata
Model Management
GFS
© 2017 Mesosphere, Inc. All Rights Reserved.
TensorFlow Hub
44
https://www.tensorflow.org/hub/
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
45
Input Data Frameworks Cluster
+ state
Models Model
Serving
Users
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 46
Challenges
● How to Deploy Models?
● Zero Downtime
● Canary
Solutions
● TensorFlow Serving
Model Serving
© 2018 Mesosphere, Inc. All Rights Reserved.
TensorFlow Lite
47
https://www.tensorflow.org/mobile/tflite/
Challenges
● Small/Fast model without losing too
much performance
● 500 KB models….
© 2018 Mesosphere, Inc. All Rights Reserved.
Rendezvous Architecture
48
https://mapr.com/ebooks/machine-learning-logistics/
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
49
Input Data Frameworks Cluster
+ state
Models Model
Serving
Users
Monitoring & Operations
© 2018 Mesosphere, Inc. All Rights Reserved. 50
Challenges
● Understand {...}
● Debug
● Model Quality
● Accuracy
● Training Time
● …
● Overall Architecture
● Availability
● Latencies
● ...
Solutions
● TensorBoard
● Traditional Cluster Monitoring Tool
Monitoring
© 2018 Mesosphere, Inc. All Rights Reserved.
Debugging
51
tfdbg
https://www.tensorflow.org/programmers_guide/debugger
© 2018 Mesosphere, Inc. All Rights Reserved.
Debugging
52
Tfdbg
- GUI currently alpha
https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md
© 2018 Mesosphere, Inc. All Rights Reserved.
Profiling
53
Performance optimization for different
devices
- Keep device occupied
Profiling!
+
Experience!
https://www.tensorflow.org/performance/performance_guide
© 2018 Mesosphere, Inc. All Rights Reserved.
Platforms
54
● AWS Sagemaker
+ Spark, MXNet, TF
+ Serving/AB
- Cloud Only
● Google Datalab/ML-Engine
+ TF, Keras, Scikit, XGBoost
+ Serving/AB
- Cloud Only
- No control of docker images
● KubeFlow
+ TF Everywhere
- TF only
● DC/OS
+ Flexibility (all of the above)
+ GPU support
- More Manual setup
© 2017 Mesosphere, Inc. All Rights Reserved. 55
Demo Time
© 2018 Mesosphere, Inc. All Rights Reserved.
Related Work
56
● DC/OS TensorFlow
https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/
● DC/OS PyTorch
https://mesosphere.com/blog/deep-learning-pytorch-gpus/
● Ted Dunning’s Machine Learning Logistics
https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/
● KubeFlow
https://github.com/kubeflow/kubeflow
● Tensorflow (+ TensorBoard and Serving)
https://www.tensorflow.org/
© 2018 Mesosphere, Inc. All Rights Reserved.
Special Thanks to All Collaborators
57
Ben Wood Robin Oh
Evan Lezar Art Rand
Gabriel Hartmann Chris Lambert
Bo Hu
Sam Pringle Kevin Klues
© 2018 Mesosphere, Inc. All Rights Reserved.
● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dco
s
○ Slack: chat.dcos.io #tensorflow
Questions and Links
58

Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018

  • 1.
  • 2.
    © 2018 Mesosphere,Inc. All Rights Reserved. 2 Jörg Schad Tech Lead Community Projects @joerg_schad
  • 3.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Promise 3
  • 4.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Process 4 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 5.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: Some insight 5
  • 6.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Challenges 6
  • 7.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Challenges 7 Input Data Frameworks Cluster + state Models Model Serving Monitoring & Operations Users
  • 8.
    © 2017 Mesosphere,Inc. All Rights Reserved. Training Challenges 8 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model ● Compute Intensive ○ (Hopefully) Large Datasets ■ Train ■ Dev ■ Test ○ Hyperparameter ■ #Layer ■ #Units per Layer ■ Learning Rate ■ ….
  • 9.
    © 2018 Mesosphere,Inc. All Rights Reserved. Input Data Management 9 Input Data Frameworks Cluster + state Models Model Serving Users Monitoring & Operations
  • 10.
    © 2018 Mesosphere,Inc. All Rights Reserved. 10 Challenges ● ● Training/Dev/Test + New Data ● Large amounts ● Quality ● Availability (for cluster) ● Velocity ● Streaming Solutions GFS Input Data Management Input: Lots of Labeled Data Apache Kafka Apache Cassandra
  • 11.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning Frameworks 11 Input Data Frameworks Cluster + state Models Model Serving Users Monitoring & Operations
  • 12.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● Machine Intelligence is the broad term used to describe techniques allowing computers to “learn” by analyzing very large data sets using artificial neural networks 12 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org
  • 13.
    © 2018 Mesosphere,Inc. All Rights Reserved. 13 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org ● Tensorflow is a software library that makes it easy for developers to construct artificial neural networks to analyze their data of interest TensorFlow Library Python Dataflow Executor, Compute Kernel Implementations, Networking, etc. GPUs CPUs
  • 14.
    © 2018 Mesosphere,Inc. All Rights Reserved. 14 Alternatives
  • 15.
    © 2018 Mesosphere,Inc. All Rights Reserved. 15 Alternatives tf.enable_eager_execution() https://www.tensorflow.org/get_started/eager
  • 16.
    © 2018 Mesosphere,Inc. All Rights Reserved. 16 Data Analytics Ecosystem
  • 17.
    © 2018 Mesosphere,Inc. All Rights Reserved. APIs 17
  • 18.
    © 2018 Mesosphere,Inc. All Rights Reserved. 18 Challenges ● Different Frameworks ● No one rules them all Solutions ● Pick the right tool ● PMML if needed Deep Learning Frameworks
  • 19.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Challenges 19 Input Data Frameworks Cluster + state Models Model Serving Users Monitoring & Operations
  • 20.
    © 2018 Mesosphere,Inc. All Rights Reserved. 20 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 21.
    © 2018 Mesosphere,Inc. All Rights Reserved. 21 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 22.
    © 2018 Mesosphere,Inc. All Rights Reserved. Cluster Management and Deployments 22 Input Data Frameworks Cluster + state Models Model Serving Users Monitoring & Operations
  • 23.
    © 2017 Mesosphere,Inc. All Rights Reserved. Trained Model Typical Developer Workflow for TensorFlow (Single-Node) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for single-node performance ● Train your data on a single-node → Output Trained Model 23 Input Data Set
  • 24.
    © 2017 Mesosphere,Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for distributed computation ● … 24
  • 25.
    © 2018 Mesosphere,Inc. All Rights Reserved. Resource Isolation and Allocation 25
  • 26.
    © 2018 Mesosphere,Inc. All Rights Reserved. TPU 26
  • 27.
    © 2018 Mesosphere,Inc. All Rights Reserved. TPUs 27
  • 28.
    © 2017 Mesosphere,Inc. All Rights Reserved. 28 Datacenter Typical Datacenter siloed, over-provisioned servers, low utilization Mesos/ DC/OS automated schedulers, workload multiplexing onto the same machines Tensorflow Jenkins Kafka Spark Tensorflow
  • 29.
    © 2018 Mesosphere,Inc. All Rights Reserved. PHYSICAL INFRASTRUCTURE MICROSERVICES, CONTAINERS, & DEV TOOLS VIRTUAL MACHINES PUBLIC CLOUDS DATA SERVICES, MACHINE LEARNING, & AI Security & Compliance Application-Aware Automation Multitenancy Hybrid Cloud Management 100+ MORE DatacenterEdge Datacenter and Cloud as a Single Computing Resource Powered by Apache Mesos 20+ MORE
  • 30.
    © 2017 Mesosphere,Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP address of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 30 Trained Model Input Data Set
  • 31.
    © 2017 Mesosphere,Inc. All Rights Reserved. Challenges running distributed TensorFlow* 31 ● Dealing with failures is not graceful ○ Users need to stop training, change their hard-coded ClusterSpec, and manually restart their jobs * Any Distributed System
  • 32.
    Deploy Scale Configure Recover 3 AM ... Typical Datacenter siloed,over-provisioned servers, low utilization HDFS Kafka Kubernetes Flink TensorFlow
  • 33.
    © 2018 Mesosphere,Inc. All Rights Reserved. Two-level Scheduling 1. Agents advertise resources to Master 2. Master offers resources to Framework 3. Framework rejects / uses resources 4. Agent reports task status to Master 33 MESOS ARCHITECTURE Mesos Master Mesos Master Mesos Master Mesos AgentMesos Agent Service Cassandra Executor Cassandra Task Flink Scheduler Spark Executor Spark Task Mesos AgentMesos Agent Service Docker Executor Docker Task CDB Executor Spark Task Spark Scheduler Kafka Scheduler
  • 34.
    © 2017 Mesosphere,Inc. All Rights Reserved. Challenges running distributed TensorFlow 34 ● Hard-coding a “ClusterSpec” is incredibly tedious ○ Users need to rewrite code for every job they want to run in a distributed setting ○ True even for code they “inherit” from standard models tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 35.
    © 2017 Mesosphere,Inc. All Rights Reserved. Challenges running distributed TensorFlow ● Manually configuring each node in a cluster takes a long time and is error-prone ○ Setting up access to a shared file system (for checkpoint and summary files) requires authenticating on each node ○ Tweaking hyper-parameters requires re-uploading code to every node 35
  • 36.
    © 2017 Mesosphere,Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 36 Trained Model Input Data Set
  • 37.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS ● We use the dcos-commons SDK to dynamically create the ClusterSpec 37 { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 38.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 38 ● Wrapper script to abstract away distributed TensorFlow configuration ○ Separates “deployer” responsibilities from “developer” responsibilities { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } User Code Wrapper Script
  • 39.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 39 ● The dcos-commons SDK cleanly restarts failed tasks and reconnects them to the cluster
  • 40.
    © 2018 Mesosphere,Inc. All Rights Reserved. Model Management 40 Input Data Frameworks Cluster + state Models Model Serving Users Monitoring & Operations
  • 41.
    © 2018 Mesosphere,Inc. All Rights Reserved. Recall 41 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 42.
    © 2017 Mesosphere,Inc. All Rights Reserved. Many Models 42 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model
  • 43.
    © 2018 Mesosphere,Inc. All Rights Reserved. 43 Challenges ● Many Models ● Different Hyperparameter ● Different Models ● New Training Data ● ... Solutions ● Persistent Storage + Metadata Model Management GFS
  • 44.
    © 2017 Mesosphere,Inc. All Rights Reserved. TensorFlow Hub 44 https://www.tensorflow.org/hub/
  • 45.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Challenges 45 Input Data Frameworks Cluster + state Models Model Serving Users Monitoring & Operations
  • 46.
    © 2018 Mesosphere,Inc. All Rights Reserved. 46 Challenges ● How to Deploy Models? ● Zero Downtime ● Canary Solutions ● TensorFlow Serving Model Serving
  • 47.
    © 2018 Mesosphere,Inc. All Rights Reserved. TensorFlow Lite 47 https://www.tensorflow.org/mobile/tflite/ Challenges ● Small/Fast model without losing too much performance ● 500 KB models….
  • 48.
    © 2018 Mesosphere,Inc. All Rights Reserved. Rendezvous Architecture 48 https://mapr.com/ebooks/machine-learning-logistics/
  • 49.
    © 2018 Mesosphere,Inc. All Rights Reserved. Deep Learning: The Challenges 49 Input Data Frameworks Cluster + state Models Model Serving Users Monitoring & Operations
  • 50.
    © 2018 Mesosphere,Inc. All Rights Reserved. 50 Challenges ● Understand {...} ● Debug ● Model Quality ● Accuracy ● Training Time ● … ● Overall Architecture ● Availability ● Latencies ● ... Solutions ● TensorBoard ● Traditional Cluster Monitoring Tool Monitoring
  • 51.
    © 2018 Mesosphere,Inc. All Rights Reserved. Debugging 51 tfdbg https://www.tensorflow.org/programmers_guide/debugger
  • 52.
    © 2018 Mesosphere,Inc. All Rights Reserved. Debugging 52 Tfdbg - GUI currently alpha https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/debugger/README.md
  • 53.
    © 2018 Mesosphere,Inc. All Rights Reserved. Profiling 53 Performance optimization for different devices - Keep device occupied Profiling! + Experience! https://www.tensorflow.org/performance/performance_guide
  • 54.
    © 2018 Mesosphere,Inc. All Rights Reserved. Platforms 54 ● AWS Sagemaker + Spark, MXNet, TF + Serving/AB - Cloud Only ● Google Datalab/ML-Engine + TF, Keras, Scikit, XGBoost + Serving/AB - Cloud Only - No control of docker images ● KubeFlow + TF Everywhere - TF only ● DC/OS + Flexibility (all of the above) + GPU support - More Manual setup
  • 55.
    © 2017 Mesosphere,Inc. All Rights Reserved. 55 Demo Time
  • 56.
    © 2018 Mesosphere,Inc. All Rights Reserved. Related Work 56 ● DC/OS TensorFlow https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/ ● DC/OS PyTorch https://mesosphere.com/blog/deep-learning-pytorch-gpus/ ● Ted Dunning’s Machine Learning Logistics https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/ ● KubeFlow https://github.com/kubeflow/kubeflow ● Tensorflow (+ TensorBoard and Serving) https://www.tensorflow.org/
  • 57.
    © 2018 Mesosphere,Inc. All Rights Reserved. Special Thanks to All Collaborators 57 Ben Wood Robin Oh Evan Lezar Art Rand Gabriel Hartmann Chris Lambert Bo Hu Sam Pringle Kevin Klues
  • 58.
    © 2018 Mesosphere,Inc. All Rights Reserved. ● DC/OS TensorFlow Package (currently closed source) ○ https://github.com/mesosphere/dcos-tensorflow ● DC/OS TensorFlow Tools ○ https://github.com/dcos-labs/dcos-tensorflow-tools/ ● Tutorial for deploying TensorFlow on DC/OS ○ https://github.com/dcos/examples/tree/master/tensorflow ● Contact: ○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dco s ○ Slack: chat.dcos.io #tensorflow Questions and Links 58