Webinar: Deep Learning Pipelines Beyond the Learning

Deep Learning Pipelines
@joerg_schad @dcos

© 2018 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Distributed Systems Engineer
@joerg_schad

© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Promise
3

Deep Learning: The Process
4
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda

Deep Learning: Some insight
5

Deep Learning: The Challenges
6

7
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users

Training Challenges
8
Step 1: Training
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
● Compute Intensive
○ (Hopefully) Large Datasets
■ Train
■ Dev
■ Test
○ Hyperparameter
■ #Layer
■ #Units per Layer
■ Learning Rate
■ ….

Input Data Management
9
+ state
Models Model
Serving
Monitoring
Users

Challenges
● Training/Dev/Test + New Data
● Large amounts
● Quality
● Availability (for cluster)
● Velocity
● Streaming
Solutions
GFS
Input Data Management
Input:
Lots of Labeled
Data
Apache Kafka
Apache Cassandra

Deep Learning Frameworks
11
+ state
Models Model
Serving
Monitoring
Users

● Machine Intelligence is the broad term used to describe
techniques allowing computers to “learn” by analyzing very
large data sets using artificial neural networks
12
What is Tensorflow?
“An open-source software library for Machine Intelligence” -
tensorflow.org

What is Tensorflow?
“An open-source software library for Machine Intelligence” -
tensorflow.org
● Tensorflow is a software library that makes it easy for
developers to construct artificial neural networks to analyze
their data of interest
TensorFlow
Library
Python
Dataflow
Executor,
Compute Kernel
Implementations,
Networking, etc.
GPUs
CPUs

Alternatives

Data Analytics Ecosystem

Challenges
● Different Frameworks
● No one rules them all
Solutions
● Choice
● Deployments?
● Models across Frameworks?
Deep Learning Frameworks

17
+ state
Models Model
Serving
Monitoring
Users

Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users

Cluster Management and Deployments
20
+ state
Models Model
Serving
Monitoring
Users

Datacenter
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
Tensorflow
Jenkins
Kafka
Spark
Tensorflow

● DC/OS (Data Center Operating System) is an
open-source, distributed operating system
● It takes Mesos and builds upon it with
additional services and functionality
○ Built-in support for service discovery, load balancing, security, and
ease of installation
○ Extra tooling (e.g. comprehensive CLI and a GUI)
○ Built-in frameworks for launching long running services (Marathon)
and batch jobs (Metronome)
○ A repository (app-store) for installing other common packages and
frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow)
22
What is DC/OS?

Trained
Model
Typical Developer Workflow for TensorFlow
(Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
23
Input
Data Set

(Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
24

(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
25
Trained
Model
Input
Data Set

Challenges running distributed TensorFlow
27
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs

Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
28

(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
29
Trained
Model
Input
Data Set

Running distributed TensorFlow on DC/OS
32
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster

Model Management
33
+ state
Models Model
Serving
Monitoring
Users

Recall
34
Step 1: Training
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda

Many Models
35
Step 1: Training
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model

Challenges
● Many Models
● Different Hyperparameter
● Different Models
● New Training Data
● ...
Solutions
● Persistent Storage + Metadata
Model Management
GFS

37
+ state
Models Model
Serving
Monitoring
Users

Challenges
● How to Deploy Models?
● Zero Downtime
● Canary
● ...
Solutions
● TensorFlow Serving
Model Serving

39
+ state
Models Model
Serving
Monitoring
Users

Challenges
● Understand {...}
● Debug
● Model Quality
● Accuracy
● Training Time
● …
● Overall Architecture
● Availability
● Latencies
● ...
Solutions
● TensorBoard
● Traditional Cluster Monitoring Tool
Monitoring

Demo Time

Related Work
42
● DC/OS TensorFlow
https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/
● DC/OS PyTorch
https://mesosphere.com/blog/deep-learning-pytorch-gpus/
● Ted Dunning’s Machine Learning Logistics
https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/
● KubeFlow
https://github.com/kubeflow/kubeflow
● Tensorflow (+ TensorBoard and Serving)
https://www.tensorflow.org/

Special Thanks to All Collaborators
43
Ben Wood
Robin Oh
Evan Lezar
Art Rand
Gabriel Hartmann
Sam Pringle Kevin Klues

● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-
dcos
○ Slack: chat.dcos.io #tensorflow
Questions and Links
44

Webinar: Deep Learning Pipelines Beyond the Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Webinar: Deep Learning Pipelines Beyond the Learning

Similar to Webinar: Deep Learning Pipelines Beyond the Learning (20)

More from Mesosphere Inc.

More from Mesosphere Inc. (20)

Recently uploaded

Recently uploaded (20)

Webinar: Deep Learning Pipelines Beyond the Learning

Editor's Notes