Running Distributed TensorFlow with GPUs on Mesos with DC/OS

© 2017 Mesosphere, Inc. All Rights Reserved. 1
Running Distributed
TensorFlow on DC/OS
Kevin Klues
klueska@mesosphere.com

Kevin Klues is an Engineering Manager at Mesosphere where he leads the DC/OS Cluster Operations team.
Since joining Mesosphere, Kevin has been involved in the design and implementation of a number of Mesos’s
core subsystems, including GPU isolation, Pods, and Attach/Exec support. Prior to joining Mesosphere, Kevin
worked at Google on an experimental operating system for data centers called Akaros. He and a few others
founded the Akaros project while working on their Ph.Ds at UC Berkeley. In a past life Kevin was a lead
developer of the TinyOS project, working at Stanford, the Technical University of Berlin, and the CSIRO in
Australia. When not working, you can usually find Kevin on a snowboard or up in the mountains in some
capacity or another.

© 2017 Mesosphere, Inc. All Rights Reserved.
● DC/OS (Data Center Operating System) is an open-
source, distributed operating system
● It takes Mesos and builds upon it with additional
services and functionality
3
What is DC/OS?

● DC/OS (Data Center Operating System) is an
open-source, distributed operating system
● It takes Mesos and builds upon it with additional
services and functionality
○ Built-in support for service discovery, load balancing, security, and
ease of installation
○ Extra tooling (e.g. comprehensive CLI and a GUI)
○ Built-in frameworks for launching long running services (Marathon)
and batch jobs (Metronome)
○ A repository (app-store) for installing other common packages and
frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow)
4
What is DC/OS?

What is DC/OS?
METRONOME
(Batch)

Overview of Talk
● Demo Setup (Preview)
● Typical developer workflow for TensorFlow
● Challenges running distributed TensorFlow
● Running distributed TensorFlow on DC/OS
● Demo
● Next Steps

Demo Setup - Train an Image Classifier
10
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3% Panda

● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset.
○ Inception-V3: an open-source image recognition model
○ CIFAR-10: a well-known dataset with 60,000 low-resolution images of 10
classes of objects (trucks, planes, ships, birds, cats, etc.)
11
Demo Setup - Model and Training Data
Trained
Model

● Run two separate TensorFlow Jobs
○ A non-distributed job with a single worker
○ A distributed job with several workers
12
Demo Setup - Training Deployment Strategy
Trained
Model
Input
Data Set
Trained
Model
Input
Data Set

● Spin up a DC/OS Cluster on GCE to run the jobs
○ 1 master, 8 agents
○ Each agent has:
■ 4 Tesla K80 GPUs
■ 8 CPUs
■ 32GB of Memory
○ HDFS pre-installed for
serving training data 13
Demo Setup

● Log data from both jobs into HDFS
○ Use TensorBoard to monitor and compare their progress
14
Demo Setup

● Log data from both jobs into HDFS
○ Use TensorBoard to monitor and compare their progress
15
Demo Setup
Note: This is a serious model that would take over a week to fully train, even on a
cluster of expensive machines. Our goal here is simply to demonstrate how easy
it is to deploy and monitor large TensorFlow jobs on DC/OS.

Trained
Model
Typical Developer Workflow for TensorFlow
(Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
16
Input
Data Set

(Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
17

(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
18
Trained
Model
Input
Data Set

(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
19
Trained
Model
Input
Data Set

Challenges running distributed TensorFlow
20
● Hard-coding a “ClusterSpec” is incredibly tedious
○ Users need to rewrite code for every job they want to run in a distributed setting
○ True even for code they “inherit” from standard models
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
...
]})
"worker": [
...
],
"ps": [
...
]})
"worker": [
...
],
"ps": [
"ps3.example.com:2222

21
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs

● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
22

Running distributed TensorFlow on DC/OS
● We use the dcos-commons SDK to dynamically create the ClusterSpec
23
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
"worker": [
...
],
"ps": [
...
]})
"worker": [
...
],
"ps": [
...
]})
"worker": [
...
],
"ps": [
"ps3.example.com:2222

24
● Wrapper script to abstract away distributed TensorFlow configuration
○ Separates “deployer” responsibilities from “developer” responsibilities
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"worker": {... },
"ps": {... }
}
User
Code
Wrapper
Script

25
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster

● We use DC/OS Secrets (or alternatively environment variables) to pass
credentials to every node in the cluster
26

27
● We use a runtime configuration dictionary to quickly tweak hyper-
parameters between different runs of the same model.
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "{...}"
},
"worker": {... },
"ps": {... }
}
$ dcos beta-tensorflow update start
--name=/cifar-multiple --
options=cifar-multiple.json

● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset.
● A non-distributed job with a single worker
● A distributed job with several workers
28
Demo Setup Recap
Trained Model
Trained
Model
Input
Data Set
Trained
Model
Input
Data Set

DEMO
29

Next Steps
● What we have today
○ Single Framework
○ Installed via standard DC/OS
package management tools
○ Need to manually start/stop
and remove framework from
cluster when completed
30
Chief
Worker
Worker Worker
PS PS
HDFS/GCS/etc
TensorBoard

Next Steps
● Where we are going
○ Meta Framework
○ Able to install / run instances
of the original single-
framework
○ Launch and monitor via
`tensorflow` CLI extensions
○ Automatically start/stop and
31
Chief
Worker
Worker Worker
P
S
P
S HDFS/G
CS/etc
TensorB
oard
Chief
Worker
Worker Worker
P
S
P
S HDFS/G
CS/etc
TensorB
oard
Chief
Worker
Worker Worker
P
S
P
S HDFS/G
CS/etc
TensorB
oard
Meta
Framework
CLI

Next Steps
32
$ dcos tensorflow run train.py
> --workers=3
> --ps=2
Running “train.py” on DC/OS with 3
workers and 2 parameter servers.

Special Thanks to All Collaborators
33
Sam Pringle
springle@mesosphere.com
Primary Developer of the DC/OS TensorFlow Package
Jörg Schad Ben
Wood
Evan Lezar Art
Rand
Gabriel Hartmann Chris
Lambert

● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dcos
○ Slack: chat.dcos.io #tensorflow
Questions and Links
34

Running Distributed TensorFlow with GPUs on Mesos with DC/OS

More Related Content

What's hot

Similar to Running Distributed TensorFlow with GPUs on Mesos with DC/OS

More from Mesosphere Inc.

Recently uploaded

Running Distributed TensorFlow with GPUs on Mesos with DC/OS

Editor's Notes