SlideShare a Scribd company logo
Deep Learning Pipelines
@joerg_schad @dcos
© 2018 Mesosphere, Inc. All Rights Reserved. 2
Jörg Schad
Distributed Systems Engineer
@joerg_schad
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Promise
3
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Process
4
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: Some insight
5
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
6
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
7
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2017 Mesosphere, Inc. All Rights Reserved.
Training Challenges
8
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
● Compute Intensive
○ (Hopefully) Large Datasets
■ Train
■ Dev
■ Test
○ Hyperparameter
■ #Layer
■ #Units per Layer
■ Learning Rate
■ ….
© 2018 Mesosphere, Inc. All Rights Reserved.
Input Data Management
9
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 10
Challenges
● Training/Dev/Test + New Data
● Large amounts
● Quality
● Availability (for cluster)
● Velocity
● Streaming
Solutions
GFS
Input Data Management
Input:
Lots of Labeled
Data
Apache Kafka
Apache Cassandra
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning Frameworks
11
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved.
● Machine Intelligence is the broad term used to describe
techniques allowing computers to “learn” by analyzing very
large data sets using artificial neural networks
12
What is Tensorflow?
“An open-source software library for Machine Intelligence” -
tensorflow.org
© 2018 Mesosphere, Inc. All Rights Reserved. 13
What is Tensorflow?
“An open-source software library for Machine Intelligence” -
tensorflow.org
● Tensorflow is a software library that makes it easy for
developers to construct artificial neural networks to analyze
their data of interest
TensorFlow
Library
Python
Dataflow
Executor,
Compute Kernel
Implementations,
Networking, etc.
GPUs
CPUs
© 2018 Mesosphere, Inc. All Rights Reserved. 14
Alternatives
© 2018 Mesosphere, Inc. All Rights Reserved. 15
Data Analytics Ecosystem
© 2018 Mesosphere, Inc. All Rights Reserved. 16
Challenges
● Different Frameworks
● No one rules them all
Solutions
● Choice
● Deployments?
● Models across Frameworks?
Deep Learning Frameworks
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
17
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 18
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 19
Challenges
● Different Users/Use cases
● Data Analyst/Exploring
● Production Workloads
● Highly Optimized
● How to spawn Environments?
Solutions
Users
© 2018 Mesosphere, Inc. All Rights Reserved.
Cluster Management and Deployments
20
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2017 Mesosphere, Inc. All Rights Reserved. 21
Datacenter
Typical Datacenter
siloed, over-provisioned servers,
low utilization
Mesos/ DC/OS
automated schedulers, workload multiplexing onto the
same machines
Tensorflow
Jenkins
Kafka
Spark
Tensorflow
© 2017 Mesosphere, Inc. All Rights Reserved.
● DC/OS (Data Center Operating System) is an
open-source, distributed operating system
● It takes Mesos and builds upon it with
additional services and functionality
○ Built-in support for service discovery, load balancing, security, and
ease of installation
○ Extra tooling (e.g. comprehensive CLI and a GUI)
○ Built-in frameworks for launching long running services (Marathon)
and batch jobs (Metronome)
○ A repository (app-store) for installing other common packages and
frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow)
22
What is DC/OS?
© 2017 Mesosphere, Inc. All Rights Reserved.
Trained
Model
Typical Developer Workflow for TensorFlow
(Single-Node)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for single-node performance
● Train your data on a single-node → Output Trained Model
23
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● Download and install the Python TensorFlow library
● Design your model in terms of TensorFlow’s basic machine learning primitives
● Write your code, optimized for distributed computation
● …
24
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP address
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
25
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
27
● Dealing with failures is not graceful
○ Users need to stop training, change their hard-coded ClusterSpec, and
manually restart their jobs
© 2017 Mesosphere, Inc. All Rights Reserved.
Challenges running distributed TensorFlow
● Manually configuring each node in a cluster takes a long time and is error-prone
○ Setting up access to a shared file system (for checkpoint and summary files)
requires authenticating on each node
○ Tweaking hyper-parameters requires re-uploading code to every node
28
© 2017 Mesosphere, Inc. All Rights Reserved.
Typical Developer Workflow for TensorFlow
(Distributed)
● …
● Provision a set of machines to run your computation
● Install TensorFlow on them
● Write code to map distributed computations to the exact IP
of the machine where those computations will be performed
● Deploy your code on every machine
● Train your data on the cluster → Output Trained Model
29
Trained
Model
Input
Data Set
© 2017 Mesosphere, Inc. All Rights Reserved.
Running distributed TensorFlow on DC/OS
32
● The dcos-commons SDK cleanly restarts failed tasks and reconnects
them to the cluster
© 2018 Mesosphere, Inc. All Rights Reserved.
Model Management
33
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved.
Recall
34
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Step 2: Inference
(Endpoint or Data Center - Instantaneous)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
Trained
Model
Output:
Classification
Trained Model
New Input from
Camera or
Sensor
97% Dog
3%
Panda
© 2017 Mesosphere, Inc. All Rights Reserved.
Many Models
35
Step 1: Training
(In Data Center - Over Hours/Days/Weeks)
Dog
Input:
Lots of Labeled
Data
Output:
Trained Model
Deep neural
network model
© 2018 Mesosphere, Inc. All Rights Reserved. 36
Challenges
● Many Models
● Different Hyperparameter
● Different Models
● New Training Data
● ...
Solutions
● Persistent Storage + Metadata
Model Management
GFS
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
37
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 38
Challenges
● How to Deploy Models?
● Zero Downtime
● Canary
● ...
Solutions
● TensorFlow Serving
Model Serving
© 2018 Mesosphere, Inc. All Rights Reserved.
Deep Learning: The Challenges
39
Input Data Frameworks Cluster
+ state
Models Model
Serving
Monitoring
Users
© 2018 Mesosphere, Inc. All Rights Reserved. 40
Challenges
● Understand {...}
● Debug
● Model Quality
● Accuracy
● Training Time
● …
● Overall Architecture
● Availability
● Latencies
● ...
Solutions
● TensorBoard
● Traditional Cluster Monitoring Tool
Monitoring
© 2017 Mesosphere, Inc. All Rights Reserved. 41
Demo Time
© 2018 Mesosphere, Inc. All Rights Reserved.
Related Work
42
● DC/OS TensorFlow
https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/
● DC/OS PyTorch
https://mesosphere.com/blog/deep-learning-pytorch-gpus/
● Ted Dunning’s Machine Learning Logistics
https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/
● KubeFlow
https://github.com/kubeflow/kubeflow
● Tensorflow (+ TensorBoard and Serving)
https://www.tensorflow.org/
© 2018 Mesosphere, Inc. All Rights Reserved.
Special Thanks to All Collaborators
43
Ben Wood
Robin Oh
Evan Lezar
Art Rand
Gabriel Hartmann
Sam Pringle Kevin Klues
© 2018 Mesosphere, Inc. All Rights Reserved.
● DC/OS TensorFlow Package (currently closed source)
○ https://github.com/mesosphere/dcos-tensorflow
● DC/OS TensorFlow Tools
○ https://github.com/dcos-labs/dcos-tensorflow-tools/
● Tutorial for deploying TensorFlow on DC/OS
○ https://github.com/dcos/examples/tree/master/tensorflow
● Contact:
○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow-
dcos
○ Slack: chat.dcos.io #tensorflow
Questions and Links
44

More Related Content

What's hot

What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 

What's hot (20)

Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
Emerging trends in data analytics
Emerging trends in data analyticsEmerging trends in data analytics
Emerging trends in data analytics
 
Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019Greenplum for Kubernetes - Greenplum Summit 2019
Greenplum for Kubernetes - Greenplum Summit 2019
 
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
Converged and Containerized Distributed Deep Learning With TensorFlow and Kub...
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
Automatski - RSA-2048 Cryptography Cracked using Shor's Algorithm on a Quantu...
 
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
Greenplum Kontained: Coordinating Many PostgreSQL Instances on Kubernetes: Cl...
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
 
Keep your Hadoop Cluster at its Best
Keep your Hadoop Cluster at its BestKeep your Hadoop Cluster at its Best
Keep your Hadoop Cluster at its Best
 
Single View of Well, Production and Assets
Single View of Well, Production and AssetsSingle View of Well, Production and Assets
Single View of Well, Production and Assets
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
 
Greenplum Overview for Postgres Hackers - Greenplum Summit 2018
Greenplum Overview for Postgres Hackers - Greenplum Summit 2018Greenplum Overview for Postgres Hackers - Greenplum Summit 2018
Greenplum Overview for Postgres Hackers - Greenplum Summit 2018
 
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
Pivotal Greenplum: Postgres-Based. Multi-Cloud. Built for Analytics & AI - Gr...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Architecting for Continuous Delivery
Architecting for Continuous DeliveryArchitecting for Continuous Delivery
Architecting for Continuous Delivery
 
Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...Present & Future of Greenplum Database A massively parallel Postgres Database...
Present & Future of Greenplum Database A massively parallel Postgres Database...
 

Similar to Webinar: Deep Learning Pipelines Beyond the Learning

Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
DataKitchen
 

Similar to Webinar: Deep Learning Pipelines Beyond the Learning (20)

TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
From zero to one - How we evolved our test automation processes and mindset i...
From zero to one - How we evolved our test automation processes and mindset i...From zero to one - How we evolved our test automation processes and mindset i...
From zero to one - How we evolved our test automation processes and mindset i...
 
Processing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorialProcessing malaria HTS results using KNIME: a tutorial
Processing malaria HTS results using KNIME: a tutorial
 
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analyticsMetta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
Metta Innovations - Introdução ao Deep Learning aplicado a vídeo analytics
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
 
ODSC data science to DataOps
ODSC data science to DataOpsODSC data science to DataOps
ODSC data science to DataOps
 
OpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in ZurichOpenPOWER Boot camp in Zurich
OpenPOWER Boot camp in Zurich
 
Open Source AI - News and examples
Open Source AI - News and examplesOpen Source AI - News and examples
Open Source AI - News and examples
 
Large Model support and Distribute deep learning
Large Model support and Distribute deep learningLarge Model support and Distribute deep learning
Large Model support and Distribute deep learning
 
Inteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for CodeInteligencia artificial, open source e IBM Call for Code
Inteligencia artificial, open source e IBM Call for Code
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Distributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2lDistributed deep learning reference architecture v3.2l
Distributed deep learning reference architecture v3.2l
 
Jfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocksJfokus 2019-dowling-logical-clocks
Jfokus 2019-dowling-logical-clocks
 
Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...Interactive and reproducible data analysis with the open-source KNIME Analyti...
Interactive and reproducible data analysis with the open-source KNIME Analyti...
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Comcast Labs Connect - PHLAI Conference Philadelphia 2018 Comcast Labs Connect - PHLAI Conference Philadelphia 2018
Comcast Labs Connect - PHLAI Conference Philadelphia 2018
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
 
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde..."Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
"Deep Learning Beyond Cats and Cars: Developing a Real-life DNN-based Embedde...
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Windows Machine Learning, Deep Learning, and Artificial Intelligence (WIN330)...
Windows Machine Learning, Deep Learning, and Artificial Intelligence (WIN330)...Windows Machine Learning, Deep Learning, and Artificial Intelligence (WIN330)...
Windows Machine Learning, Deep Learning, and Artificial Intelligence (WIN330)...
 

More from Mesosphere Inc.

Growing the Mesos Ecosystem
Growing the Mesos EcosystemGrowing the Mesos Ecosystem
Growing the Mesos Ecosystem
Mesosphere Inc.
 

More from Mesosphere Inc. (20)

DevOps in Age of Kubernetes
DevOps in Age of KubernetesDevOps in Age of Kubernetes
DevOps in Age of Kubernetes
 
Java EE Modernization with Mesosphere DCOS
Java EE Modernization with Mesosphere DCOSJava EE Modernization with Mesosphere DCOS
Java EE Modernization with Mesosphere DCOS
 
Operating Kubernetes at Scale (Australia Presentation)
Operating Kubernetes at Scale (Australia Presentation)Operating Kubernetes at Scale (Australia Presentation)
Operating Kubernetes at Scale (Australia Presentation)
 
Episode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OSEpisode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OS
 
Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
Episode 2: Deploying Kubernetes at Scale
Episode 2: Deploying Kubernetes at ScaleEpisode 2: Deploying Kubernetes at Scale
Episode 2: Deploying Kubernetes at Scale
 
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
Best Practices for Managing Kubernetes and Stateful Services: Mesosphere & Sy...
 
Webinar: What's New in DC/OS 1.11
Webinar: What's New in DC/OS 1.11Webinar: What's New in DC/OS 1.11
Webinar: What's New in DC/OS 1.11
 
Webinar: End-to-End CI/CD with GitLab and DC/OS
Webinar: End-to-End CI/CD with GitLab and DC/OSWebinar: End-to-End CI/CD with GitLab and DC/OS
Webinar: End-to-End CI/CD with GitLab and DC/OS
 
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
Manage Microservices & Fast Data Systems on One Platform w/ DC/OSManage Microservices & Fast Data Systems on One Platform w/ DC/OS
Manage Microservices & Fast Data Systems on One Platform w/ DC/OS
 
Jolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using MesosJolt: Distributed, fault-tolerant test running at scale using Mesos
Jolt: Distributed, fault-tolerant test running at scale using Mesos
 
Deploying Kong with Mesosphere DC/OS
Deploying Kong with Mesosphere DC/OSDeploying Kong with Mesosphere DC/OS
Deploying Kong with Mesosphere DC/OS
 
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
Tech Preview: Kubernetes on Mesosphere DC/OS 1.10
 
Discover the all new Mesosphere DC/OS 1.10
Discover the all new Mesosphere DC/OS 1.10Discover the all new Mesosphere DC/OS 1.10
Discover the all new Mesosphere DC/OS 1.10
 
Mesosphere & Magnetic: Take the pain out of running complex and critical serv...
Mesosphere & Magnetic: Take the pain out of running complex and critical serv...Mesosphere & Magnetic: Take the pain out of running complex and critical serv...
Mesosphere & Magnetic: Take the pain out of running complex and critical serv...
 
Easy Docker Deployments with Mesosphere DCOS on Azure
Easy Docker Deployments with Mesosphere DCOS on AzureEasy Docker Deployments with Mesosphere DCOS on Azure
Easy Docker Deployments with Mesosphere DCOS on Azure
 
Mesos framework API v1
Mesos framework API v1Mesos framework API v1
Mesos framework API v1
 
Scaling Like Twitter with Apache Mesos
Scaling Like Twitter with Apache MesosScaling Like Twitter with Apache Mesos
Scaling Like Twitter with Apache Mesos
 
Elastic jenkins with mesos and dcos (2016 01-20)
Elastic jenkins with mesos and dcos (2016 01-20)Elastic jenkins with mesos and dcos (2016 01-20)
Elastic jenkins with mesos and dcos (2016 01-20)
 
Growing the Mesos Ecosystem
Growing the Mesos EcosystemGrowing the Mesos Ecosystem
Growing the Mesos Ecosystem
 

Recently uploaded

Recently uploaded (20)

Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 

Webinar: Deep Learning Pipelines Beyond the Learning

  • 2. © 2018 Mesosphere, Inc. All Rights Reserved. 2 Jörg Schad Distributed Systems Engineer @joerg_schad
  • 3. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Promise 3
  • 4. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Process 4 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 5. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: Some insight 5
  • 6. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 6
  • 7. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 7 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 8. © 2017 Mesosphere, Inc. All Rights Reserved. Training Challenges 8 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model ● Compute Intensive ○ (Hopefully) Large Datasets ■ Train ■ Dev ■ Test ○ Hyperparameter ■ #Layer ■ #Units per Layer ■ Learning Rate ■ ….
  • 9. © 2018 Mesosphere, Inc. All Rights Reserved. Input Data Management 9 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 10. © 2018 Mesosphere, Inc. All Rights Reserved. 10 Challenges ● Training/Dev/Test + New Data ● Large amounts ● Quality ● Availability (for cluster) ● Velocity ● Streaming Solutions GFS Input Data Management Input: Lots of Labeled Data Apache Kafka Apache Cassandra
  • 11. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning Frameworks 11 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 12. © 2018 Mesosphere, Inc. All Rights Reserved. ● Machine Intelligence is the broad term used to describe techniques allowing computers to “learn” by analyzing very large data sets using artificial neural networks 12 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org
  • 13. © 2018 Mesosphere, Inc. All Rights Reserved. 13 What is Tensorflow? “An open-source software library for Machine Intelligence” - tensorflow.org ● Tensorflow is a software library that makes it easy for developers to construct artificial neural networks to analyze their data of interest TensorFlow Library Python Dataflow Executor, Compute Kernel Implementations, Networking, etc. GPUs CPUs
  • 14. © 2018 Mesosphere, Inc. All Rights Reserved. 14 Alternatives
  • 15. © 2018 Mesosphere, Inc. All Rights Reserved. 15 Data Analytics Ecosystem
  • 16. © 2018 Mesosphere, Inc. All Rights Reserved. 16 Challenges ● Different Frameworks ● No one rules them all Solutions ● Choice ● Deployments? ● Models across Frameworks? Deep Learning Frameworks
  • 17. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 17 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 18. © 2018 Mesosphere, Inc. All Rights Reserved. 18 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 19. © 2018 Mesosphere, Inc. All Rights Reserved. 19 Challenges ● Different Users/Use cases ● Data Analyst/Exploring ● Production Workloads ● Highly Optimized ● How to spawn Environments? Solutions Users
  • 20. © 2018 Mesosphere, Inc. All Rights Reserved. Cluster Management and Deployments 20 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 21. © 2017 Mesosphere, Inc. All Rights Reserved. 21 Datacenter Typical Datacenter siloed, over-provisioned servers, low utilization Mesos/ DC/OS automated schedulers, workload multiplexing onto the same machines Tensorflow Jenkins Kafka Spark Tensorflow
  • 22. © 2017 Mesosphere, Inc. All Rights Reserved. ● DC/OS (Data Center Operating System) is an open-source, distributed operating system ● It takes Mesos and builds upon it with additional services and functionality ○ Built-in support for service discovery, load balancing, security, and ease of installation ○ Extra tooling (e.g. comprehensive CLI and a GUI) ○ Built-in frameworks for launching long running services (Marathon) and batch jobs (Metronome) ○ A repository (app-store) for installing other common packages and frameworks (e.g. Spark, Kafka, Cassandra, TensorFlow) 22 What is DC/OS?
  • 23. © 2017 Mesosphere, Inc. All Rights Reserved. Trained Model Typical Developer Workflow for TensorFlow (Single-Node) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for single-node performance ● Train your data on a single-node → Output Trained Model 23 Input Data Set
  • 24. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● Download and install the Python TensorFlow library ● Design your model in terms of TensorFlow’s basic machine learning primitives ● Write your code, optimized for distributed computation ● … 24
  • 25. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP address of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 25 Trained Model Input Data Set
  • 26. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow 27 ● Dealing with failures is not graceful ○ Users need to stop training, change their hard-coded ClusterSpec, and manually restart their jobs
  • 27. © 2017 Mesosphere, Inc. All Rights Reserved. Challenges running distributed TensorFlow ● Manually configuring each node in a cluster takes a long time and is error-prone ○ Setting up access to a shared file system (for checkpoint and summary files) requires authenticating on each node ○ Tweaking hyper-parameters requires re-uploading code to every node 28
  • 28. © 2017 Mesosphere, Inc. All Rights Reserved. Typical Developer Workflow for TensorFlow (Distributed) ● … ● Provision a set of machines to run your computation ● Install TensorFlow on them ● Write code to map distributed computations to the exact IP of the machine where those computations will be performed ● Deploy your code on every machine ● Train your data on the cluster → Output Trained Model 29 Trained Model Input Data Set
  • 29. © 2017 Mesosphere, Inc. All Rights Reserved. Running distributed TensorFlow on DC/OS 32 ● The dcos-commons SDK cleanly restarts failed tasks and reconnects them to the cluster
  • 30. © 2018 Mesosphere, Inc. All Rights Reserved. Model Management 33 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 31. © 2018 Mesosphere, Inc. All Rights Reserved. Recall 34 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Step 2: Inference (Endpoint or Data Center - Instantaneous) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model Trained Model Output: Classification Trained Model New Input from Camera or Sensor 97% Dog 3% Panda
  • 32. © 2017 Mesosphere, Inc. All Rights Reserved. Many Models 35 Step 1: Training (In Data Center - Over Hours/Days/Weeks) Dog Input: Lots of Labeled Data Output: Trained Model Deep neural network model
  • 33. © 2018 Mesosphere, Inc. All Rights Reserved. 36 Challenges ● Many Models ● Different Hyperparameter ● Different Models ● New Training Data ● ... Solutions ● Persistent Storage + Metadata Model Management GFS
  • 34. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 37 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 35. © 2018 Mesosphere, Inc. All Rights Reserved. 38 Challenges ● How to Deploy Models? ● Zero Downtime ● Canary ● ... Solutions ● TensorFlow Serving Model Serving
  • 36. © 2018 Mesosphere, Inc. All Rights Reserved. Deep Learning: The Challenges 39 Input Data Frameworks Cluster + state Models Model Serving Monitoring Users
  • 37. © 2018 Mesosphere, Inc. All Rights Reserved. 40 Challenges ● Understand {...} ● Debug ● Model Quality ● Accuracy ● Training Time ● … ● Overall Architecture ● Availability ● Latencies ● ... Solutions ● TensorBoard ● Traditional Cluster Monitoring Tool Monitoring
  • 38. © 2017 Mesosphere, Inc. All Rights Reserved. 41 Demo Time
  • 39. © 2018 Mesosphere, Inc. All Rights Reserved. Related Work 42 ● DC/OS TensorFlow https://mesosphere.com/blog/tensorflow-gpu-support-deep-learning/ ● DC/OS PyTorch https://mesosphere.com/blog/deep-learning-pytorch-gpus/ ● Ted Dunning’s Machine Learning Logistics https://thenewstack.io/maprs-ted-dunning-intersection-machine-learning-containers/ ● KubeFlow https://github.com/kubeflow/kubeflow ● Tensorflow (+ TensorBoard and Serving) https://www.tensorflow.org/
  • 40. © 2018 Mesosphere, Inc. All Rights Reserved. Special Thanks to All Collaborators 43 Ben Wood Robin Oh Evan Lezar Art Rand Gabriel Hartmann Sam Pringle Kevin Klues
  • 41. © 2018 Mesosphere, Inc. All Rights Reserved. ● DC/OS TensorFlow Package (currently closed source) ○ https://github.com/mesosphere/dcos-tensorflow ● DC/OS TensorFlow Tools ○ https://github.com/dcos-labs/dcos-tensorflow-tools/ ● Tutorial for deploying TensorFlow on DC/OS ○ https://github.com/dcos/examples/tree/master/tensorflow ● Contact: ○ https://groups.google.com/a/mesosphere.io/forum/#!forum/tensorflow- dcos ○ Slack: chat.dcos.io #tensorflow Questions and Links 44

Editor's Notes

  1. One thing being a developer build a TensorFlow Model on my Laptop...
  2. https://jupyterhub.readthedocs.io/en/latest/ https://github.com/vigsterkr/marathonspawner https://github.com/twosigma/beakerx
  3. https://jupyterhub.readthedocs.io/en/latest/ https://github.com/vigsterkr/marathonspawner
  4. - status quo: statically partitioned into siloed clusters, dedicated to running individual datacenter-scale applications Data: SQL, HDFS, Cassandra Services: compute (Spark, MapReduce), microservices, Docker Users: by department/team, per-user dev clusters Environment: dev/qa/prod
  5. https://www.tensorflow.org/
  6. https://www.tensorflow.org/