© 2017 Mesosphere, Inc. All Rights Reserved. 1
Running Distributed
TensorFlow on DC/OS
Kevin Klues
klueska@mesosphere.com
© 2017 Mesosphere, Inc. All Rights Reserved. 2
Kevin Klues is an Engineering Manager at Mesosphere where he leads the DC/OS Cluster Operations team.
Since joining Mesosphere, Kevin has been involved in the design and implementation of a number of Mesos’s
core subsystems, including GPU isolation, Pods, and Attach/Exec support. Prior to joining Mesosphere, Kevin
worked at Google on an experimental operating system for data centers called Akaros. He and a few others
founded the Akaros project while working on their Ph.Ds at UC Berkeley. In a past life Kevin was a lead
developer of the TinyOS project, working at Stanford, the Technical University of Berlin, and the CSIRO in
Australia. When not working, you can usually find Kevin on a snowboard or up in the mountains in some
capacity or another.
© 2017 Mesosphere, Inc. All Rights Reserved.
● DC/OS (Data Center Operating System) is an open-
source, distributed operating system
● It takes Mesos and builds upon it with additional
services and functionality
3
What is DC/OS?
© 2017 Mesosphere, Inc. All Rights Reserved.
● DC/OS (Data Center Operating System) is an
open-source, distributed operating system
● It takes Mesos and builds upon it with additional
services and functionality
○ Built-in support for service discovery, load balancing, security, and
ease of installation
○ Extra tooling (e.g. comprehensive CLI and a GUI)
○ Built-in frameworks for launching long running services (Marathon)
and batch jobs (Metronome)
○ A repository (app-store) for installing other common packages and
frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow)
4
What is DC/OS?
© 2017 Mesosphere, Inc. All Rights Reserved. 5
What is DC/OS?
METRONOME
(Batch)
© 2017 Mesosphere, Inc. All Rights Reserved. 6
What is DC/OS?
METRONOME
(Batch)
© 2017 Mesosphere, Inc. All Rights Reserved. 7
What is DC/OS?
METRONOME
(Batch)
© 2017 Mesosphere, Inc. All Rights Reserved. 8
What is DC/OS?
METRONOME
(Batch)
© 2017 Mesosphere, Inc. All Rights Reserved. 9
Overview of Talk
● Demo Setup (Preview)
● Typical developer workflow for TensorFlow
● Challenges running distributed TensorFlow
● Running distributed TensorFlow on DC/OS
● Demo
● Next Steps
© 2017 Mesosphere, Inc. All Rights Reserved.
Demo Setup - Train an Image Classifier
10
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3% Panda
© 2017 Mesosphere, Inc. All Rights Reserved.
● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset.
○ Inception-V3: an open-source image recognition model
○ CIFAR-10: a well-known dataset with 60,000 low-resolution images of 10
classes of objects (trucks, planes, ships, birds, cats, etc.)
11
Demo Setup - Model and Training Data
Trained
Model
© 2017 Mesosphere, Inc. All Rights Reserved.
● Run two separate TensorFlow Jobs
○ A non-distributed job with a single worker
○ A distributed job with several workers
12
Demo Setup - Training Deployment Strategy
Trained
Model
Input
Data Set
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
● Spin up a DC/OS Cluster on GCE to run the jobs
○ 1 master, 8 agents
○ Each agent has:
■ 4 Tesla K80 GPUs
■ 8 CPUs
■ 32GB of Memory
○ HDFS pre-installed for
serving training data 13
Demo Setup
© 2017 Mesosphere, Inc. All Rights Reserved.
● Log data from both jobs into HDFS
○ Use TensorBoard to monitor and compare their progress
14
Demo Setup
© 2017 Mesosphere, Inc. All Rights Reserved.
● Log data from both jobs into HDFS
○ Use TensorBoard to monitor and compare their progress
15
Demo Setup
Note: This is a serious model that would take over a week to fully train, even on a
cluster of expensive machines. Our goal here is simply to demonstrate how easy
it is to deploy and monitor large TensorFlow jobs on DC/OS.
© 2017 Mesosphere, Inc. All Rights Reserved.
Trained
Model
Typical Developer Workflow for TensorFlow
(Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
16
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
17
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
18
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
19
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
20
● Hard-coding a “ClusterSpec” is incredibly tedious
○ Users need to rewrite code for every job they want to run in a distributed setting
○ True even for code they “inherit” from standard models
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
21
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
22
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
● We use the dcos-commons SDK to dynamically create the ClusterSpec
23
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222",
...
]})
tf.train.ClusterSpec({
"worker": [
"worker0.example.com:2222",
"worker1.example.com:2222",
"worker2.example.com:2222",
"worker3.example.com:2222",
"worker4.example.com:2222",
"worker5.example.com:2222",
...
],
"ps": [
"ps0.example.com:2222",
"ps1.example.com:2222",
"ps2.example.com:2222",
"ps3.example.com:2222
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
24
● Wrapper script to abstract away distributed TensorFlow configuration
○ Separates “deployer” responsibilities from “developer” responsibilities
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "..."
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
User
Code
Wrapper
Script
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
25
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster
© 2017 Mesosphere, Inc. All Rights Reserved.
● We use DC/OS Secrets (or alternatively environment variables) to pass
credentials to every node in the cluster
Running distributed TensorFlow on DC/OS
26
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
27
● We use a runtime configuration dictionary to quickly tweak hyper-
parameters between different runs of the same model.
{
"service": {
"name": "mnist",
"job_url": "...",
"job_context": "{...}"
},
"gpu_worker": {... },
"worker": {... },
"ps": {... }
}
$ dcos beta-tensorflow update start
--name=/cifar-multiple --
options=cifar-multiple.json
© 2017 Mesosphere, Inc. All Rights Reserved.
● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset.
● A non-distributed job with a single worker
● A distributed job with several workers
28
Demo Setup Recap
Trained Model
Trained
Model
Input
Data Set
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
DEMO
29
© 2017 Mesosphere, Inc. All Rights Reserved.
Next Steps
● What we have today
○ Single Framework
○ Installed via standard DC/OS
package management tools
○ Need to manually start/stop
and remove framework from
cluster when completed
30
Chief
Worker
Worker Worker
PS PS
HDFS/GCS/etc
TensorBoard
© 2017 Mesosphere, Inc. All Rights Reserved.
Next Steps
● Where we are going
○ Meta Framework
○ Able to install / run instances
of the original single-
framework
○ Launch and monitor via
`tensorflow` CLI extensions
○ Automatically start/stop and
31
Chief
Worker
Worker Worker
P
S
P
S HDFS/G
CS/etc
TensorB
oard
Chief
Worker
Worker Worker
P
S
P
S HDFS/G
CS/etc
TensorB
oard
Chief
Worker
Worker Worker
P
S
P
S HDFS/G
CS/etc
TensorB
oard
Meta
Framework
CLI
© 2017 Mesosphere, Inc. All Rights Reserved.
Next Steps
32
$ dcos tensorflow run train.py 
> --workers=3 
> --ps=2
Running “train.py” on DC/OS with 3
workers and 2 parameter servers.
© 2017 Mesosphere, Inc. All Rights Reserved.
Special Thanks to All Collaborators
33
Sam Pringle
springle@mesosphere.com
Primary Developer of the DC/OS TensorFlow Package
Jörg Schad Ben
Wood
Evan Lezar Art
Rand
Gabriel Hartmann Chris
Lambert
© 2017 Mesosphere, Inc. All Rights Reserved.
● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dcos
○ Slack: chat.dcos.io #tensorflow
Questions and Links
34

Running Distributed TensorFlow with GPUs on Mesos with DC/OS

  • 1.
    © 2017 Mesosphere,Inc. All Rights Reserved. 1 Running Distributed TensorFlow on DC/OS Kevin Klues klueska@mesosphere.com
  • 2.
    © 2017 Mesosphere,Inc. All Rights Reserved. 2 Kevin Klues is an Engineering Manager at Mesosphere where he leads the DC/OS Cluster Operations team. Since joining Mesosphere, Kevin has been involved in the design and implementation of a number of Mesos’s core subsystems, including GPU isolation, Pods, and Attach/Exec support. Prior to joining Mesosphere, Kevin worked at Google on an experimental operating system for data centers called Akaros. He and a few others founded the Akaros project while working on their Ph.Ds at UC Berkeley. In a past life Kevin was a lead developer of the TinyOS project, working at Stanford, the Technical University of Berlin, and the CSIRO in Australia. When not working, you can usually find Kevin on a snowboard or up in the mountains in some capacity or another.
  • 3.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● DC/OS (Data Center Operating System) is an open- source, distributed operating system ● It takes Mesos and builds upon it with additional services and functionality 3 What is DC/OS?
  • 4.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● DC/OS (Data Center Operating System) is an open-source, distributed operating system ● It takes Mesos and builds upon it with additional services and functionality ○ Built-in support for service discovery, load balancing, security, and ease of installation ○ Extra tooling (e.g. comprehensive CLI and a GUI) ○ Built-in frameworks for launching long running services (Marathon) and batch jobs (Metronome) ○ A repository (app-store) for installing other common packages and frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow) 4 What is DC/OS?
  • 5.
    © 2017 Mesosphere,Inc. All Rights Reserved. 5 What is DC/OS? METRONOME (Batch)
  • 6.
    © 2017 Mesosphere,Inc. All Rights Reserved. 6 What is DC/OS? METRONOME (Batch)
  • 7.
    © 2017 Mesosphere,Inc. All Rights Reserved. 7 What is DC/OS? METRONOME (Batch)
  • 8.
    © 2017 Mesosphere,Inc. All Rights Reserved. 8 What is DC/OS? METRONOME (Batch)
  • 9.
    © 2017 Mesosphere,Inc. All Rights Reserved. 9 Overview of Talk ● Demo Setup (Preview) ● Typical developer workflow for TensorFlow ● Challenges running distributed TensorFlow ● Running distributed TensorFlow on DC/OS ● Demo ● Next Steps
  • 10.
    © 2017 Mesosphere,Inc. All Rights Reserved. Demo Setup - Train an Image Classifier 10 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 11.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset. ○ Inception-V3: an open-source image recognition model ○ CIFAR-10: a well-known dataset with 60,000 low-resolution images of 10 classes of objects (trucks, planes, ships, birds, cats, etc.) 11 Demo Setup - Model and Training Data Trained Model
  • 12.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● Run two separate TensorFlow Jobs ○ A non-distributed job with a single worker ○ A distributed job with several workers 12 Demo Setup - Training Deployment Strategy Trained Model Input Data Set Trained Model Input Data Set
  • 13.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● Spin up a DC/OS Cluster on GCE to run the jobs ○ 1 master, 8 agents ○ Each agent has: ■ 4 Tesla K80 GPUs ■ 8 CPUs ■ 32GB of Memory ○ HDFS pre-installed for serving training data 13 Demo Setup
  • 14.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● Log data from both jobs into HDFS ○ Use TensorBoard to monitor and compare their progress 14 Demo Setup
  • 15.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● Log data from both jobs into HDFS ○ Use TensorBoard to monitor and compare their progress 15 Demo Setup Note: This is a serious model that would take over a week to fully train, even on a cluster of expensive machines. Our goal here is simply to demonstrate how easy it is to deploy and monitor large TensorFlow jobs on DC/OS.
  • 16.
    © 2017 Mesosphere,Inc. All Rights Reserved. Trained Model Typical Developer Workflow for TensorFlow (Single-Node) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for single-node performance ● Train your data on a single-node → Output Trained Model 16 Input Data Set
  • 17.
    © 2017 Mesosphere,Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for distributed computation ● … 17
  • 18.
    © 2017 Mesosphere,Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP address of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 18 Trained Model Input Data Set
  • 19.
    © 2017 Mesosphere,Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 19 Trained Model Input Data Set
  • 20.
    © 2017 Mesosphere,Inc. All Rights Reserved. Challenges running distributed TensorFlow 20 ● Hard-coding a “ClusterSpec” is incredibly tedious ○ Users need to rewrite code for every job they want to run in a distributed setting ○ True even for code they “inherit” from standard models tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 21.
    © 2017 Mesosphere,Inc. All Rights Reserved. Challenges running distributed TensorFlow 21 ● Dealing with failures is not graceful ○ Users need to stop training, change their hard-coded ClusterSpec, and manually restart their jobs
  • 22.
    © 2017 Mesosphere,Inc. All Rights Reserved. Challenges running distributed TensorFlow ● Manually configuring each node in a cluster takes a long time and is error-prone ○ Setting up access to a shared file system (for checkpoint and summary files) requires authenticating on each node ○ Tweaking hyper-parameters requires re-uploading code to every node 22
  • 23.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS ● We use the dcos-commons SDK to dynamically create the ClusterSpec 23 { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222", ... ]}) tf.train.ClusterSpec({ "worker": [ "worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222", "worker3.example.com:2222", "worker4.example.com:2222", "worker5.example.com:2222", ... ], "ps": [ "ps0.example.com:2222", "ps1.example.com:2222", "ps2.example.com:2222", "ps3.example.com:2222
  • 24.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 24 ● Wrapper script to abstract away distributed TensorFlow configuration ○ Separates “deployer” responsibilities from “developer” responsibilities { "service": { "name": "mnist", "job_url": "...", "job_context": "..." }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } User Code Wrapper Script
  • 25.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 25 ● The dcos-commons SDK cleanly restarts failed tasks and reconnects them to the cluster
  • 26.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● We use DC/OS Secrets (or alternatively environment variables) to pass credentials to every node in the cluster Running distributed TensorFlow on DC/OS 26
  • 27.
    © 2017 Mesosphere,Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 27 ● We use a runtime configuration dictionary to quickly tweak hyper- parameters between different runs of the same model. { "service": { "name": "mnist", "job_url": "...", "job_context": "{...}" }, "gpu_worker": {... }, "worker": {... }, "ps": {... } } $ dcos beta-tensorflow update start --name=/cifar-multiple -- options=cifar-multiple.json
  • 28.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● Train the Inception-V3 Image Classification model on the CIFAR-10 dataset. ● A non-distributed job with a single worker ● A distributed job with several workers 28 Demo Setup Recap Trained Model Trained Model Input Data Set Trained Model Input Data Set
  • 29.
    © 2017 Mesosphere,Inc. All Rights Reserved. DEMO 29
  • 30.
    © 2017 Mesosphere,Inc. All Rights Reserved. Next Steps ● What we have today ○ Single Framework ○ Installed via standard DC/OS package management tools ○ Need to manually start/stop and remove framework from cluster when completed 30 Chief Worker Worker Worker PS PS HDFS/GCS/etc TensorBoard
  • 31.
    © 2017 Mesosphere,Inc. All Rights Reserved. Next Steps ● Where we are going ○ Meta Framework ○ Able to install / run instances of the original single- framework ○ Launch and monitor via `tensorflow` CLI extensions ○ Automatically start/stop and 31 Chief Worker Worker Worker P S P S HDFS/G CS/etc TensorB oard Chief Worker Worker Worker P S P S HDFS/G CS/etc TensorB oard Chief Worker Worker Worker P S P S HDFS/G CS/etc TensorB oard Meta Framework CLI
  • 32.
    © 2017 Mesosphere,Inc. All Rights Reserved. Next Steps 32 $ dcos tensorflow run train.py > --workers=3 > --ps=2 Running “train.py” on DC/OS with 3 workers and 2 parameter servers.
  • 33.
    © 2017 Mesosphere,Inc. All Rights Reserved. Special Thanks to All Collaborators 33 Sam Pringle springle@mesosphere.com Primary Developer of the DC/OS TensorFlow Package Jörg Schad Ben Wood Evan Lezar Art Rand Gabriel Hartmann Chris Lambert
  • 34.
    © 2017 Mesosphere,Inc. All Rights Reserved. ● DC/OS TensorFlow Package (currently closed source) ○ https://github.com/mesosphere/dcos-tensorflow ● DC/OS TensorFlow Tools ○ https://github.com/dcos-labs/dcos-tensorflow-tools/ ● Tutorial for deploying TensorFlow on DC/OS ○ https://github.com/dcos/examples/tree/master/tensorflow ● Contact: ○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-dcos ○ Slack: chat.dcos.io #tensorflow Questions and Links 34

Editor's Notes

  • #29 What tabs will I click on in TensorBoard Scalars Mark arbitrary variables in your code to visualize how they change over time in the Scalars tab in TensorBoard Images Each step processes a batch of images More images are processed per-step in the distributed model Take away: over time the bounding box for the subject in the image will tighter and tighter around the subject alone Graphs Histograms Both jobs are represented because I pointed them at the same bucket