SlideShare a Scribd company logo
Taming the Deep Learning Workflow
Neil Conway
CTO, Determined AI
January 24, 2019
Deep Learning is very difficult.
DL requires finding scarce talent and making a major
investment in a high-performance GPU cluster.
Even so, most organizations struggle. Time-to-
market for DL applications is often measured in years!
Today’s Reality
Key Challenge


Better DL Infrastructure Software!
TensorFlow is great! (So is Keras,
PyTorch, etc.)



However, these tools are focused on
solving the problems of 1 researcher,
training 1 model, using ~1 GPU.
Wait, what about TensorFlow?
Training A Single Model
Hyperparameter

Tuning, Architecture

Search
GPU Cluster

Scheduling
Metrics Collection

and Storage
Model

Management
Training Data

Management
Collaboration
Deployment
Operations and

Monitoring
Data Augmentation
Data Prep
and ETL
Parallel and

Distributed Training
What Are Your Options?
• For some of these problems: no OSS solutions.
• For others: narrow technical tools. Up to you to
figure out how to put them together!

Result: highly trained DL researchers spend most of
their time on drudgery!
We’re in the Golden Age of Deep Learning,

but Deep Learning infrastructure is still stuck
in the Dark Ages!
Deep Learning Deep Learning Infrastructure ☹
What Do We Need?
• End-to-end system design, not narrow technical tools
• Driven by a deep understanding of real-world DL
workflows
• New APIs, new abstractions, and new platforms!
Determined AI
Deep Dive:

Hyperparameter Tuning
Hyperparameter Tuning
Search over a space of
similar models to find the
“best” model configuration
= Hard Problem!
Large, complex HP spaces are common
(e.g., optimization method, batch size,
LR, model architecture, etc.)
Evaluating a single HP
configuration can take
10-100+ GPU hours!
DL-specific challenges
+ +
HP Tuning Today
Intuition! Grid Search
Pick a few points and try them

out manually.
Exhaustive search over all points
in grid.
Step 1: Smarter Searching
• Lots of academic research on HP tuning algorithms
• Recent work: Hyperband [ICLR 2017]
• Intuition: spend more compute time on “promising”
configurations, give up on “bad” configurations
quickly
• 5-50x faster than prior methods!
Example: Random Search
Example: Hyperband
Step 2: Scheduler Integration
• What if the job scheduler was deeply integrated with
HP search algorithm?
• Smarter scheduling
• Intelligent fault tolerance and task migration
• More efficient caching
• Aside: Much more efficient than distributed training of a
single model!
Step 3: Metadata Storage
• A single HP search might involve thousands
of tasks on hundreds of machines, and run
for days or weeks
• Result: lots of crucial metadata!
• Training and validation metrics
• Hyperparameter settings
• Library versions, random seeds, logs, etc.
• Where does this data live? How can your
teammates make use of it?
• What happens when you want to replace the
production model 9 months later?
Takeaways
1. Progress on deep learning is held back by the current state of DL
infrastructure.
2. End-to-end system design can yield massive performance and
usability wins.
3. What are the key high-level DL workflows we need infra to support?
What are the right APIs and abstractions for doing so?
https://determined.ai

More Related Content

What's hot

The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
eXascale Infolab
 

What's hot (6)

Software team linkedin
Software team linkedinSoftware team linkedin
Software team linkedin
 
High Performance Computing
High Performance ComputingHigh Performance Computing
High Performance Computing
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 

Similar to Taming Your Deep Learning Workflow by Determined AI

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...
II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...
II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...
Dr. Haxel Consult
 

Similar to Taming Your Deep Learning Workflow by Determined AI (20)

Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Challenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in ProductionChallenges of Operationalising Data Science in Production
Challenges of Operationalising Data Science in Production
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
 
tensorflow.pptx
tensorflow.pptxtensorflow.pptx
tensorflow.pptx
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geissler kairntech - SDC Nice Apr 2019
Stefan Geissler kairntech - SDC Nice Apr 2019
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...
II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...
II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»НАДІЯ ФЕДЮШКО БАЦ  «Професійне зростання QA спеціаліста»
НАДІЯ ФЕДЮШКО БАЦ «Професійне зростання QA спеціаліста»
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 

Taming Your Deep Learning Workflow by Determined AI

  • 1. Taming the Deep Learning Workflow Neil Conway CTO, Determined AI January 24, 2019
  • 2. Deep Learning is very difficult. DL requires finding scarce talent and making a major investment in a high-performance GPU cluster. Even so, most organizations struggle. Time-to- market for DL applications is often measured in years! Today’s Reality
  • 3. Key Challenge 
 Better DL Infrastructure Software!
  • 4. TensorFlow is great! (So is Keras, PyTorch, etc.)
 
 However, these tools are focused on solving the problems of 1 researcher, training 1 model, using ~1 GPU. Wait, what about TensorFlow?
  • 5. Training A Single Model Hyperparameter
 Tuning, Architecture
 Search GPU Cluster
 Scheduling Metrics Collection
 and Storage Model
 Management Training Data
 Management Collaboration Deployment Operations and
 Monitoring Data Augmentation Data Prep and ETL Parallel and
 Distributed Training
  • 6. What Are Your Options? • For some of these problems: no OSS solutions. • For others: narrow technical tools. Up to you to figure out how to put them together!
 Result: highly trained DL researchers spend most of their time on drudgery!
  • 7. We’re in the Golden Age of Deep Learning,
 but Deep Learning infrastructure is still stuck in the Dark Ages! Deep Learning Deep Learning Infrastructure ☹
  • 8. What Do We Need? • End-to-end system design, not narrow technical tools • Driven by a deep understanding of real-world DL workflows • New APIs, new abstractions, and new platforms!
  • 11. Hyperparameter Tuning Search over a space of similar models to find the “best” model configuration = Hard Problem! Large, complex HP spaces are common (e.g., optimization method, batch size, LR, model architecture, etc.) Evaluating a single HP configuration can take 10-100+ GPU hours! DL-specific challenges + +
  • 12. HP Tuning Today Intuition! Grid Search Pick a few points and try them
 out manually. Exhaustive search over all points in grid.
  • 13. Step 1: Smarter Searching • Lots of academic research on HP tuning algorithms • Recent work: Hyperband [ICLR 2017] • Intuition: spend more compute time on “promising” configurations, give up on “bad” configurations quickly • 5-50x faster than prior methods!
  • 16. Step 2: Scheduler Integration • What if the job scheduler was deeply integrated with HP search algorithm? • Smarter scheduling • Intelligent fault tolerance and task migration • More efficient caching • Aside: Much more efficient than distributed training of a single model!
  • 17. Step 3: Metadata Storage • A single HP search might involve thousands of tasks on hundreds of machines, and run for days or weeks • Result: lots of crucial metadata! • Training and validation metrics • Hyperparameter settings • Library versions, random seeds, logs, etc. • Where does this data live? How can your teammates make use of it? • What happens when you want to replace the production model 9 months later?
  • 18. Takeaways 1. Progress on deep learning is held back by the current state of DL infrastructure. 2. End-to-end system design can yield massive performance and usability wins. 3. What are the key high-level DL workflows we need infra to support? What are the right APIs and abstractions for doing so? https://determined.ai