Advanced Hyperparameter Optimization for Deep Learning with MLflow

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Maneesh Bhide, Databricks
Advanced HPO for Deep
Learning
#UnifiedAnalytics #SparkAISummit

Review: HPO Approaches
Grid search:
• PRO: can be run in one time-step
• CON: naive, computationally expensive, suffers from curse of dimensionality, probably alias over global optima.
Random search:
• PRO: suffers less from curse of dimensionality, can be run in one time-step
• CON: naive, no certainty about results, still computationally expensive
Population based:
• PRO: implicit predictions, can be run in several time-steps, good at resolving many optima
• CON: computationally expensive, may converge to local optima
Bayesian:
• PRO: explicit predictions, computationally efficient
• CON: requires sequential observations
3#UnifiedAnalytics #SparkAISummit

Review: Best Practices
• Tune entire pipeline, not individual models
• How you phrase parameters matters!
– Categoricals really categorical?
• [2,4,8,16,32] à {1,5,1} and 2(param)
– Use transformations to your advantage
• For learning_rate, instead of {0,1} à {-10,0} and 10(param)
• Don’t restrict to traditional hyperparameters
– SGD Flavor
– Architecture

HPO for Neural Networks
• Can benefit from compute efficiency of Bayesian
Optimization as parameter space can explode.
– Challenge of sequential training and long training time
• Optimize more than just hyperparameters
– Challenge of parameters depending on other parameters
• Production models often have multiple criteria
– Challenge of trading off between objectives

Agenda
• Challenge of sequential training and long training
time
– Early Termination
• Challenge of parameters depending on other
parameters
– Awkward/Conditional Spaces
• Challenge of trading off between objectives
– Multimetric Optimization

How Early Termination works
From the HyperBand Paper…
1. Select initial candidate configuration set
2. Train configurations for Xn epochs
3. Evaluate performance (preferably, objective metric)
4. Use SuccessiveHalving (eliminate half), run remaining
configurations an additional Xn epochs
5. Xn+1 = 2Xn
6. Goto step 2

Credit: https://www.automl.org/blog_bohb/

Assumptions
• Well behaved learning curves
• Model Performance: don’t need the
best model, need a good model faster

Scenario Walkthrough
ResNet-50 on ImageNet
9 Hyperparameters for HPO
128 configurations
1 p2.xlarge ($.90/hour)
12 hours training time

Standard Training
12 hours * 128 config
Total Compute Time: 1536 hours
Total Cost: $1382.4

With HyperBand
% train .78% 1.56% 3.12% 6.25% 12.5% 25% 50% 100%
hours .09 .19 .37 .75 1.5 3 6 12
Configs 128 64 32 16 8 4 2 1
ET 64 32 16 8 4 2 1
Total 5.76 6.08 5.92 6 6 6 6 12
Total Compute Time: 53.76 hours
Total Cost: $48.38

Scenario Summary
w/o Early Termination: 1536 hours
w/ Early Termination: 53.76 hours
96.5% Reduction in Compute
(and Cost!)

Bayesian + HyperBand
1. Articulate checkpoints
2. Optimizer selects an initial sample (bootstraping)
3. Train for “checkpoint N” epochs
4. Evaluate performance (preferably, objective metric)
5. Use Bayesian method to select new candidates
6. Increment N
7. Goto step 3

Assumptions
None
• Black box optimization
• Allows user to account for potential
stagnation in checkpoint selection
• Regret intrinsically accounted for

Random vs Bayesian
1. Number of initial candidates
– Random: scales exponentially with number of parameters
– Bayesian: scales linearly with number of parameters
2. Candidate selection
– Random: naïve, static
– Bayesian: adaptive
3. Regret Implementation
– Random: User must explicitly define
– Bayesian: Surrogate + acquisition function

Which is Better?

Does this Actually Work?

Summary
• Attempts to optimize for resource allocation
• Dramatically reduce compute and wall clock to convergence
• Better implementations include a “regret” mechanism to
recover configurations
• Bayesian outperforms Random
– But in principle, compatible with any underlying hyperparameter
optimization technique

What about Keras/TF EarlyStopping
NOT THE SAME THING
Evaluates against a pre-determined rate of loss
improvement for a single model
1. Terminate stagnating configurations
2. Prevent over training

Libraries
Open Source: HyperBand
• HpBandSter (with Random search)
Open Source: Conceptually Similar
• HpBandSter (with HyperOpt search)
• Fabolas* (RoBo)
Commercial: Conceptually Similar
• SigOpt

Code
• HpBandSter:
https://automl.github.io/HpBandSter/build/html/a
uto_examples/index.html
• Fabolas:
https://github.com/automl/RoBO/blob/master/exa
mples/example_fabolas.py
• SigOpt:
https://app.sigopt.com/docs/overview/multitask

Awkward/Conditional Spaces
The range, or existence, of one
hyperparameter is dependent on the
value of another hyperparameter

Examples
• Optimize Gradient Descent algorithm selection
• Neural network topology refinement
• Neural Architecture Search
• Ensemble models as featurizers

Credit: https://devblogs.nvidia.com/optimizing-end-to-end-memory-networks-using-sigopt-gpus/

Why does this matter?
• Bayesian/adaptive algorithms learn from the prior
• For every hyperparameter, it will require some
number of samples to ”learn” dependencies

Libraries
Open Source
• HyperOpt
• HpBandSter
Commercial
• SigOpt

Multimetric Optimization
Use Case Metric 1 Metric 2
Fraud Detection Minimize activity Minimize dollars lost
Realtime Classifier Maximize accuracy Minimize inference time
Anomaly Detection Maximize precision Maximize recall
Trading Algorithm Maximize return (alpha) Minimize risk (beta)
E-comm Search Maximize quality of results Maximize profitability of results

“Pareto Efficiency”

Scenario: MNIST for Realtime
• Naïve
– argmax(accuracy)
• Custom Objective Function:
– argmax(accuracy – test_time)
• Statistical Methods (Inverse Efficiency Score)
– argmin(test_time / accuracy)
For all scenarios, log both the accuracy and the test_time

Comparison
Method Accuracy Delta Test Time Delta
Naive .981 -- .47 --
Custom .986 .5% .44 7%
IES .99 .9% .411 14%

Challenges
• Custom Objective Function: susceptible to
unintended consequences
– Lengthscale
– negative values
– fractional values
• Statistical Methods: makes a priori assumptions
– IES, Fβ –Score, choose weighting (eg. β) up front

Fourth Approach
True multimetric optimization
return [{'name': 'f1', 'value': f1_val}, {'name': 'f2', 'value': f2_val}]
• Optimize for competing objectives independently
• Use Surrogates and Acquisition Functions to model
relationship/tradeoff between objectives

Summary
Optimizer Early
Termination
Conditional
s
Multimetric
(single)
Multimetric
(multiple)
Open
Source
HyperOpt No Yes Yes No Yes
HpBandSter Yes Yes Yes No Yes
Fabolas Yes NA NA NA Yes
SigOpt Yes Yes Yes Yes No
Spearmint No No Yes No Yes
GPyOpt No No Yes No Yes

Referenced Papers
• Hyperband: A Novel Bandit-Based Approach to
Hyperparameter Optimization
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar
Journal of Machine Learning Research 18 (2018)
• Combining Hyperband and Bayesian Optimization
Stefan Falkner, Aaron Klein, Frank Hutter
BayesOpt Workshop @ NeurIPS 2018
• Optimizing End-to-End Memory Networks Using SigOpt and
GPUs
Meghana Ravikumar, Nick Payton, Ben Hsu, Scott Clark
NVIDIA Developer Blog

THANK YOU!
DON’T FORGET TO RATE
AND REVIEW THE
SESSIONS
SEARCH SPARK + AI SUMMIT

Advanced Hyperparameter Optimization for Deep Learning with MLflow

More Related Content

What's hot

Similar to Advanced Hyperparameter Optimization for Deep Learning with MLflow

More from Databricks

Recently uploaded

Advanced Hyperparameter Optimization for Deep Learning with MLflow