Tuning the Untunable - Insights on Deep Learning Optimization

Tuning the Untunable
Techniques for Deep Learning Optimization
Patrick Hayes, CTO
November 2018

Empower experts everywhere to
amplify and accelerate their
modeling impact

DevOps Builds and Maintains Proprietary Infrastructure
Tasks that depend on your particular infrastructure
(e.g., model lifecycle management, model deployment)
Experts Focus on Data Science
Tasks that benefit from domain expertise
(e.g., metric-function selection)
Our model management philosophy
Software Automates Repeatable Tasks
Tasks that do not benefit from domain expertise
(e.g., training orchestration, model tuning)

Hyperparameter Optimization
Model tuning
Grid Search
Random Search Bayesian Optimization
Training & Tuning
Evolutionary Algorithms
Deep Learning Architecture Search
Hyperparameter Search

How we optimize models
We never
access your
data or models
Iterative, automated optimization
Built specifically
for scalable
enterprise use
cases

3. Parameterize your model (continued…)

Easily track, manage and reproduce experiments
Uncover model insights with
parameter importance
Monitor performance improvement as
the experiment progresses via API, the
web or your mobile phone
Cycle through analysis, suggestions,
history, and other experiment insights

Benefits: Better, cheaper, faster model development
90% Cost Savings
Maximize utilization of compute
https://aws.amazon.com/blogs/machine-learning/fast
-cnn-tuning-with-aws-gpu-instances-and-sigopt/
10x Faster Time to Tune
Less expert time per model
https://devblogs.nvidia.com/sigopt-deep-learning-hy
perparameter-optimization/
Better Performance
No free lunch, but optimize any model
https://arxiv.org/pdf/1603.09441.pdf

Overview of features behind SigOpt
Enterprise
Platform
Optimization
Engine
Experiment
Insights
Reproducibility
Intuitive web dashboards
Cross-team permissions
and collaboration
Advanced experiment
visualizations
Organizational
experiment analysis
Parameter importance
analysis
Multimetric optimization
Continuous, categorical,
or integer parameters
Constraints and failure
regions
Up to 10k observations,
100 parameters
Multitask optimization
and high parallelism
Conditional
parameters
Infrastructure agnostic
REST API
Model agnostic
Black-Box Interface
Doesn’t touch data
Libraries for Python,
Java, R, and MATLAB
Key:
Only HPO solution
with this capability

Applied AI introduces
unique challenges

Failed observations
Constraints
Uncertainty
Competing objectives
Lengthy training cycles
Cluster orchestration
sigopt.com/blog

How do you more efficiently tune models
that take days (or weeks) to train?

Source: AI & Compute, OpenAI Blog, May 2018

Speech
Recognition
Computer
Vision
Deep
Reinforcement
Learning

Training Resnet-50 on ImageNet takes 12 hours
Tuning 12 parameters requires at least 120 distinct models
That equals 1440 hours, or 60 days, of training time

Tuning & training
inefficiency
Training cluster
management

Start with a simple idea:
We can use information about “partially trained” models
to more efficiently inform hyperparameter tuning

Building on prior research related to successive halving and Bayesian
techniques, Multitask samples lower-cost tasks to inexpensively learn
about the model and accelerate full Bayesian Optimization.
Swersky, Snoek, and Adams, “Multi-Task Bayesian Optimization”
http://papers.nips.cc/paper/5086-multi-task-bayesian-optimization.pdf
“

Cheap approximations promise a route to tractability, but bias and
noise complicate their use. An unknown bias arises whenever a
computational model incompletely models a real-world phenomenon,
and is pervasive in applications.
Poloczek, Wang, and Frazier, “Multi-Information Source Optimization”
https://papers.nips.cc/paper/7016-multi-information-source-optimization.pdf
“

Visualizing multitask: learning from approximation
Source: Klein et al., https://arxiv.org/pdf/1605.07079.pdf
Partial Full

Visualizing multitask: Power of correlated functions
Source: Swersky, Snoek, & Adams, http://papers.nips.cc/paper/5086-multi-task-bayesian-optimization

Alternative approaches to lengthy training cycles
Early Termination
(e.g., Hyperband)
Multitask Optimization

Case: Putting multitask optimization to the test
Goal: Benchmark the performance of Multitask and Early Termination methods
Model: SVM
Dataset: Covertype, Vehicle, MNIST
Methods:
● Multitask Enhanced (Fabolas)
● Multitask Basic (MTBO)
● Early Termination (Hyperband)
● Baseline 1 (Expected Improvement)
● Baseline 2 (Entropy Search)

Result: Multitask outperforms other methods
Pull from paper

Multitask Optimization in Practice

Making multitask optimization accessible for anyone
Allow user to
flexibly define
low-cost tasks

Case: Putting multitask optimization to the test
Goal: Benchmark the performance of Multitask and Early Termination methods across a broad
variety of tasks and strategies to get a more complete sense of performance
Model: CNN
Dataset: CIFAR-10
Methods:
● Multitask Optimization
● Early Termination (Hyperband)
● Random Search

Multitask shows best performance
Benchmark: Which optimization technique most
efficiently tunes 10 hyperparameters under
compute constraints?

Complexity of deep learning DevOps
Concurrent optimization experiments
Concurrent model configuration
evaluations
Multiple GPUs per model
Training one model, no optimization
Basic Case Advanced Case
Multiple Users

Problems: Infrastructure, scheduling,
dependencies, code, logging
Solution: SigOpt Orchestrate is a CLI for
managing training infrastructure and
running optimization experiments

1 Spin up and share training clusters
$ sigopt create cluster $ sigopt run -f orchestrate.yml
Containerized
Model
Schedule optimization experiments2
Integrate with the optimization API3 Monitor experiment and infrastructure4
Optimization
How it works: Command-line orchestration

Seamless integration into your model code

Easily define optimization experiments

Easily kick off optimization experiment jobs

Check the status of active and completed experiments

View experiment logs across multiple workers

Track metadata and monitor your results

Automated cluster
management
Efficient training
and tuning

Training Resnet-50 on ImageNet takes 12 4 hours
Tuning 12 parameters requires at least 120 distinct models
That equals 1,440 480 hours, or 60 20 days, of training time
While training on 10 machines, wall-clock time is 2 days

Failure regions
Constraints
Uncertainty
Competing objectives
Lengthy training cycles
Cluster orchestration
sigopt.com/blog

Thank you
Try SigOpt Orchestrate: https://sigopt.com/orchestrate
Free access for Academics & Nonprofits: https://sigopt.com/edu
Solution-oriented program for the Enterprise: https://sigopt.com/pricing
Leading applied optimization research: https://sigopt.com/research
… and we're hiring! https://sigopt.com/careers

Tuning the Untunable - Insights on Deep Learning Optimization

More Related Content

What's hot

Similar to Tuning the Untunable - Insights on Deep Learning Optimization

More from SigOpt

Recently uploaded

Tuning the Untunable - Insights on Deep Learning Optimization