Taming Your Deep Learning Workflow by Determined AI
Taming the Deep Learning Workflow
CTO, Determined AI
January 24, 2019
Deep Learning is very difficult.
DL requires finding scarce talent and making a major
investment in a high-performance GPU cluster.
Even so, most organizations struggle. Time-to-
market for DL applications is often measured in years!
TensorFlow is great! (So is Keras,
However, these tools are focused on
solving the problems of 1 researcher,
training 1 model, using ~1 GPU.
Wait, what about TensorFlow?
Training A Single Model
What Are Your Options?
• For some of these problems: no OSS solutions.
• For others: narrow technical tools. Up to you to
figure out how to put them together!
Result: highly trained DL researchers spend most of
their time on drudgery!
We’re in the Golden Age of Deep Learning,
but Deep Learning infrastructure is still stuck
in the Dark Ages!
Deep Learning Deep Learning Infrastructure ☹
What Do We Need?
• End-to-end system design, not narrow technical tools
• Driven by a deep understanding of real-world DL
• New APIs, new abstractions, and new platforms!
Search over a space of
similar models to find the
“best” model configuration
= Hard Problem!
Large, complex HP spaces are common
(e.g., optimization method, batch size,
LR, model architecture, etc.)
Evaluating a single HP
configuration can take
10-100+ GPU hours!
HP Tuning Today
Intuition! Grid Search
Pick a few points and try them
Exhaustive search over all points
Step 1: Smarter Searching
• Lots of academic research on HP tuning algorithms
• Recent work: Hyperband [ICLR 2017]
• Intuition: spend more compute time on “promising”
configurations, give up on “bad” configurations
• 5-50x faster than prior methods!
Step 2: Scheduler Integration
• What if the job scheduler was deeply integrated with
HP search algorithm?
• Smarter scheduling
• Intelligent fault tolerance and task migration
• More efficient caching
• Aside: Much more efficient than distributed training of a
Step 3: Metadata Storage
• A single HP search might involve thousands
of tasks on hundreds of machines, and run
for days or weeks
• Result: lots of crucial metadata!
• Training and validation metrics
• Hyperparameter settings
• Library versions, random seeds, logs, etc.
• Where does this data live? How can your
teammates make use of it?
• What happens when you want to replace the
production model 9 months later?
1. Progress on deep learning is held back by the current state of DL
2. End-to-end system design can yield massive performance and
3. What are the key high-level DL workflows we need infra to support?
What are the right APIs and abstractions for doing so?