Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Taming Your Deep Learning Workflow by Determined AI


Published on

Determined AI gave a talk on "Taming the Deep Learning Workflow" at Re-Work DL Summit on January 24, 2019.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Taming Your Deep Learning Workflow by Determined AI

  1. 1. Taming the Deep Learning Workflow Neil Conway CTO, Determined AI January 24, 2019
  2. 2. Deep Learning is very difficult. DL requires finding scarce talent and making a major investment in a high-performance GPU cluster. Even so, most organizations struggle. Time-to- market for DL applications is often measured in years! Today’s Reality
  3. 3. Key Challenge 
 Better DL Infrastructure Software!
  4. 4. TensorFlow is great! (So is Keras, PyTorch, etc.)
 However, these tools are focused on solving the problems of 1 researcher, training 1 model, using ~1 GPU. Wait, what about TensorFlow?
  5. 5. Training A Single Model Hyperparameter
 Tuning, Architecture
 Search GPU Cluster
 Scheduling Metrics Collection
 and Storage Model
 Management Training Data
 Management Collaboration Deployment Operations and
 Monitoring Data Augmentation Data Prep and ETL Parallel and
 Distributed Training
  6. 6. What Are Your Options? • For some of these problems: no OSS solutions. • For others: narrow technical tools. Up to you to figure out how to put them together!
 Result: highly trained DL researchers spend most of their time on drudgery!
  7. 7. We’re in the Golden Age of Deep Learning,
 but Deep Learning infrastructure is still stuck in the Dark Ages! Deep Learning Deep Learning Infrastructure ☹
  8. 8. What Do We Need? • End-to-end system design, not narrow technical tools • Driven by a deep understanding of real-world DL workflows • New APIs, new abstractions, and new platforms!
  9. 9. Determined AI
  10. 10. Deep Dive:
 Hyperparameter Tuning
  11. 11. Hyperparameter Tuning Search over a space of similar models to find the “best” model configuration = Hard Problem! Large, complex HP spaces are common (e.g., optimization method, batch size, LR, model architecture, etc.) Evaluating a single HP configuration can take 10-100+ GPU hours! DL-specific challenges + +
  12. 12. HP Tuning Today Intuition! Grid Search Pick a few points and try them
 out manually. Exhaustive search over all points in grid.
  13. 13. Step 1: Smarter Searching • Lots of academic research on HP tuning algorithms • Recent work: Hyperband [ICLR 2017] • Intuition: spend more compute time on “promising” configurations, give up on “bad” configurations quickly • 5-50x faster than prior methods!
  14. 14. Example: Random Search
  15. 15. Example: Hyperband
  16. 16. Step 2: Scheduler Integration • What if the job scheduler was deeply integrated with HP search algorithm? • Smarter scheduling • Intelligent fault tolerance and task migration • More efficient caching • Aside: Much more efficient than distributed training of a single model!
  17. 17. Step 3: Metadata Storage • A single HP search might involve thousands of tasks on hundreds of machines, and run for days or weeks • Result: lots of crucial metadata! • Training and validation metrics • Hyperparameter settings • Library versions, random seeds, logs, etc. • Where does this data live? How can your teammates make use of it? • What happens when you want to replace the production model 9 months later?
  18. 18. Takeaways 1. Progress on deep learning is held back by the current state of DL infrastructure. 2. End-to-end system design can yield massive performance and usability wins. 3. What are the key high-level DL workflows we need infra to support? What are the right APIs and abstractions for doing so?