WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Maneesh Bhide, Databricks
Advanced HPO for Deep
Learning
#UnifiedAnalytics #SparkAISummit
Review: HPO Approaches
Grid search:
• PRO: can be run in one time-step
• CON: naive, computationally expensive, suffers from curse of dimensionality, probably alias over global optima.
Random search:
• PRO: suffers less from curse of dimensionality, can be run in one time-step
• CON: naive, no certainty about results, still computationally expensive
Population based:
• PRO: implicit predictions, can be run in several time-steps, good at resolving many optima
• CON: computationally expensive, may converge to local optima
Bayesian:
• PRO: explicit predictions, computationally efficient
• CON: requires sequential observations
3#UnifiedAnalytics #SparkAISummit
Review: Best Practices
• Tune entire pipeline, not individual models
• How you phrase parameters matters!
– Categoricals really categorical?
• [2,4,8,16,32] à {1,5,1} and 2(param)
– Use transformations to your advantage
• For learning_rate, instead of {0,1} à {-10,0} and 10(param)
• Don’t restrict to traditional hyperparameters
– SGD Flavor
– Architecture
4#UnifiedAnalytics #SparkAISummit
HPO for Neural Networks
• Can benefit from compute efficiency of Bayesian
Optimization as parameter space can explode.
– Challenge of sequential training and long training time
• Optimize more than just hyperparameters
– Challenge of parameters depending on other parameters
• Production models often have multiple criteria
– Challenge of trading off between objectives
5#UnifiedAnalytics #SparkAISummit
Agenda
• Challenge of sequential training and long training
time
– Early Termination
• Challenge of parameters depending on other
parameters
– Awkward/Conditional Spaces
• Challenge of trading off between objectives
– Multimetric Optimization
6#UnifiedAnalytics #SparkAISummit
How Early Termination works
From the HyperBand Paper…
1. Select initial candidate configuration set
2. Train configurations for Xn epochs
3. Evaluate performance (preferably, objective metric)
4. Use SuccessiveHalving (eliminate half), run remaining
configurations an additional Xn epochs
5. Xn+1 = 2Xn
6. Goto step 2
7#UnifiedAnalytics #SparkAISummit
8#UnifiedAnalytics #SparkAISummit
Credit: https://www.automl.org/blog_bohb/
Assumptions
• Well behaved learning curves
• Model Performance: don’t need the
best model, need a good model faster
9#UnifiedAnalytics #SparkAISummit
10#UnifiedAnalytics #SparkAISummit
Credit: https://www.automl.org/blog_bohb/
Scenario Walkthrough
ResNet-50 on ImageNet
9 Hyperparameters for HPO
128 configurations
1 p2.xlarge ($.90/hour)
12 hours training time
11#UnifiedAnalytics #SparkAISummit
Standard Training
12 hours * 128 config
Total Compute Time: 1536 hours
Total Cost: $1382.4
12#UnifiedAnalytics #SparkAISummit
With HyperBand
% train .78% 1.56% 3.12% 6.25% 12.5% 25% 50% 100%
hours .09 .19 .37 .75 1.5 3 6 12
Configs 128 64 32 16 8 4 2 1
ET 64 32 16 8 4 2 1
Total 5.76 6.08 5.92 6 6 6 6 12
13#UnifiedAnalytics #SparkAISummit
Total Compute Time: 53.76 hours
Total Cost: $48.38
Scenario Summary
w/o Early Termination: 1536 hours
w/ Early Termination: 53.76 hours
96.5% Reduction in Compute
(and Cost!)
14#UnifiedAnalytics #SparkAISummit
Bayesian + HyperBand
1. Articulate checkpoints
2. Optimizer selects an initial sample (bootstraping)
3. Train for “checkpoint N” epochs
4. Evaluate performance (preferably, objective metric)
5. Use Bayesian method to select new candidates
6. Increment N
7. Goto step 3
15#UnifiedAnalytics #SparkAISummit
16#UnifiedAnalytics #SparkAISummit
Credit: https://www.automl.org/blog_bohb/
Assumptions
None
• Black box optimization
• Allows user to account for potential
stagnation in checkpoint selection
• Regret intrinsically accounted for
17#UnifiedAnalytics #SparkAISummit
Random vs Bayesian
1. Number of initial candidates
– Random: scales exponentially with number of parameters
– Bayesian: scales linearly with number of parameters
2. Candidate selection
– Random: naïve, static
– Bayesian: adaptive
3. Regret Implementation
– Random: User must explicitly define
– Bayesian: Surrogate + acquisition function
18#UnifiedAnalytics #SparkAISummit
Which is Better?
19#UnifiedAnalytics #SparkAISummit
Does this Actually Work?
20#UnifiedAnalytics #SparkAISummit
Summary
• Attempts to optimize for resource allocation
• Dramatically reduce compute and wall clock to convergence
• Better implementations include a “regret” mechanism to
recover configurations
• Bayesian outperforms Random
– But in principle, compatible with any underlying hyperparameter
optimization technique
21#UnifiedAnalytics #SparkAISummit
What about Keras/TF EarlyStopping
NOT THE SAME THING
Evaluates against a pre-determined rate of loss
improvement for a single model
1. Terminate stagnating configurations
2. Prevent over training
22#UnifiedAnalytics #SparkAISummit
Libraries
Open Source: HyperBand
• HpBandSter (with Random search)
Open Source: Conceptually Similar
• HpBandSter (with HyperOpt search)
• Fabolas* (RoBo)
Commercial: Conceptually Similar
• SigOpt
23#UnifiedAnalytics #SparkAISummit
Code
• HpBandSter:
https://automl.github.io/HpBandSter/build/html/a
uto_examples/index.html
• Fabolas:
https://github.com/automl/RoBO/blob/master/exa
mples/example_fabolas.py
• SigOpt:
https://app.sigopt.com/docs/overview/multitask
24#UnifiedAnalytics #SparkAISummit
Awkward/Conditional Spaces
The range, or existence, of one
hyperparameter is dependent on the
value of another hyperparameter
25#UnifiedAnalytics #SparkAISummit
Examples
• Optimize Gradient Descent algorithm selection
• Neural network topology refinement
• Neural Architecture Search
• Ensemble models as featurizers
26#UnifiedAnalytics #SparkAISummit
27#UnifiedAnalytics #SparkAISummit
Credit: https://devblogs.nvidia.com/optimizing-end-to-end-memory-networks-using-sigopt-gpus/
Why does this matter?
• Bayesian/adaptive algorithms learn from the prior
• For every hyperparameter, it will require some
number of samples to ”learn” dependencies
28#UnifiedAnalytics #SparkAISummit
Libraries
Open Source
• HyperOpt
• HpBandSter
Commercial
• SigOpt
29#UnifiedAnalytics #SparkAISummit
Multimetric Optimization
Use Case Metric 1 Metric 2
Fraud Detection Minimize activity Minimize dollars lost
Realtime Classifier Maximize accuracy Minimize inference time
Anomaly Detection Maximize precision Maximize recall
Trading Algorithm Maximize return (alpha) Minimize risk (beta)
E-comm Search Maximize quality of results Maximize profitability of results
30#UnifiedAnalytics #SparkAISummit
“Pareto Efficiency”
31#UnifiedAnalytics #SparkAISummit
Scenario: MNIST for Realtime
• Naïve
– argmax(accuracy)
• Custom Objective Function:
– argmax(accuracy – test_time)
• Statistical Methods (Inverse Efficiency Score)
– argmin(test_time / accuracy)
For all scenarios, log both the accuracy and the test_time
32#UnifiedAnalytics #SparkAISummit
Comparison
Method Accuracy Delta Test Time Delta
Naive .981 -- .47 --
Custom .986 .5% .44 7%
IES .99 .9% .411 14%
33#UnifiedAnalytics #SparkAISummit
Challenges
• Custom Objective Function: susceptible to
unintended consequences
– Lengthscale
– negative values
– fractional values
• Statistical Methods: makes a priori assumptions
– IES, Fβ –Score, choose weighting (eg. β) up front
34#UnifiedAnalytics #SparkAISummit
Fourth Approach
True multimetric optimization
return [{'name': 'f1', 'value': f1_val}, {'name': 'f2', 'value': f2_val}]
• Optimize for competing objectives independently
• Use Surrogates and Acquisition Functions to model
relationship/tradeoff between objectives
35#UnifiedAnalytics #SparkAISummit
Summary
Optimizer Early
Termination
Conditional
s
Multimetric
(single)
Multimetric
(multiple)
Open
Source
HyperOpt No Yes Yes No Yes
HpBandSter Yes Yes Yes No Yes
Fabolas Yes NA NA NA Yes
SigOpt Yes Yes Yes Yes No
Spearmint No No Yes No Yes
GPyOpt No No Yes No Yes
36#UnifiedAnalytics #SparkAISummit
Referenced Papers
• Hyperband: A Novel Bandit-Based Approach to
Hyperparameter Optimization
Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar
Journal of Machine Learning Research 18 (2018)
• Combining Hyperband and Bayesian Optimization
Stefan Falkner, Aaron Klein, Frank Hutter
BayesOpt Workshop @ NeurIPS 2018
• Optimizing End-to-End Memory Networks Using SigOpt and
GPUs
Meghana Ravikumar, Nick Payton, Ben Hsu, Scott Clark
NVIDIA Developer Blog
37#UnifiedAnalytics #SparkAISummit
THANK YOU!
DON’T FORGET TO RATE
AND REVIEW THE
SESSIONS
SEARCH SPARK + AI SUMMIT

Advanced Hyperparameter Optimization for Deep Learning with MLflow

  • 1.
    WIFI SSID:SparkAISummit |Password: UnifiedAnalytics
  • 2.
    Maneesh Bhide, Databricks AdvancedHPO for Deep Learning #UnifiedAnalytics #SparkAISummit
  • 3.
    Review: HPO Approaches Gridsearch: • PRO: can be run in one time-step • CON: naive, computationally expensive, suffers from curse of dimensionality, probably alias over global optima. Random search: • PRO: suffers less from curse of dimensionality, can be run in one time-step • CON: naive, no certainty about results, still computationally expensive Population based: • PRO: implicit predictions, can be run in several time-steps, good at resolving many optima • CON: computationally expensive, may converge to local optima Bayesian: • PRO: explicit predictions, computationally efficient • CON: requires sequential observations 3#UnifiedAnalytics #SparkAISummit
  • 4.
    Review: Best Practices •Tune entire pipeline, not individual models • How you phrase parameters matters! – Categoricals really categorical? • [2,4,8,16,32] à {1,5,1} and 2(param) – Use transformations to your advantage • For learning_rate, instead of {0,1} à {-10,0} and 10(param) • Don’t restrict to traditional hyperparameters – SGD Flavor – Architecture 4#UnifiedAnalytics #SparkAISummit
  • 5.
    HPO for NeuralNetworks • Can benefit from compute efficiency of Bayesian Optimization as parameter space can explode. – Challenge of sequential training and long training time • Optimize more than just hyperparameters – Challenge of parameters depending on other parameters • Production models often have multiple criteria – Challenge of trading off between objectives 5#UnifiedAnalytics #SparkAISummit
  • 6.
    Agenda • Challenge ofsequential training and long training time – Early Termination • Challenge of parameters depending on other parameters – Awkward/Conditional Spaces • Challenge of trading off between objectives – Multimetric Optimization 6#UnifiedAnalytics #SparkAISummit
  • 7.
    How Early Terminationworks From the HyperBand Paper… 1. Select initial candidate configuration set 2. Train configurations for Xn epochs 3. Evaluate performance (preferably, objective metric) 4. Use SuccessiveHalving (eliminate half), run remaining configurations an additional Xn epochs 5. Xn+1 = 2Xn 6. Goto step 2 7#UnifiedAnalytics #SparkAISummit
  • 8.
  • 9.
    Assumptions • Well behavedlearning curves • Model Performance: don’t need the best model, need a good model faster 9#UnifiedAnalytics #SparkAISummit
  • 10.
  • 11.
    Scenario Walkthrough ResNet-50 onImageNet 9 Hyperparameters for HPO 128 configurations 1 p2.xlarge ($.90/hour) 12 hours training time 11#UnifiedAnalytics #SparkAISummit
  • 12.
    Standard Training 12 hours* 128 config Total Compute Time: 1536 hours Total Cost: $1382.4 12#UnifiedAnalytics #SparkAISummit
  • 13.
    With HyperBand % train.78% 1.56% 3.12% 6.25% 12.5% 25% 50% 100% hours .09 .19 .37 .75 1.5 3 6 12 Configs 128 64 32 16 8 4 2 1 ET 64 32 16 8 4 2 1 Total 5.76 6.08 5.92 6 6 6 6 12 13#UnifiedAnalytics #SparkAISummit Total Compute Time: 53.76 hours Total Cost: $48.38
  • 14.
    Scenario Summary w/o EarlyTermination: 1536 hours w/ Early Termination: 53.76 hours 96.5% Reduction in Compute (and Cost!) 14#UnifiedAnalytics #SparkAISummit
  • 15.
    Bayesian + HyperBand 1.Articulate checkpoints 2. Optimizer selects an initial sample (bootstraping) 3. Train for “checkpoint N” epochs 4. Evaluate performance (preferably, objective metric) 5. Use Bayesian method to select new candidates 6. Increment N 7. Goto step 3 15#UnifiedAnalytics #SparkAISummit
  • 16.
  • 17.
    Assumptions None • Black boxoptimization • Allows user to account for potential stagnation in checkpoint selection • Regret intrinsically accounted for 17#UnifiedAnalytics #SparkAISummit
  • 18.
    Random vs Bayesian 1.Number of initial candidates – Random: scales exponentially with number of parameters – Bayesian: scales linearly with number of parameters 2. Candidate selection – Random: naïve, static – Bayesian: adaptive 3. Regret Implementation – Random: User must explicitly define – Bayesian: Surrogate + acquisition function 18#UnifiedAnalytics #SparkAISummit
  • 19.
  • 20.
    Does this ActuallyWork? 20#UnifiedAnalytics #SparkAISummit
  • 21.
    Summary • Attempts tooptimize for resource allocation • Dramatically reduce compute and wall clock to convergence • Better implementations include a “regret” mechanism to recover configurations • Bayesian outperforms Random – But in principle, compatible with any underlying hyperparameter optimization technique 21#UnifiedAnalytics #SparkAISummit
  • 22.
    What about Keras/TFEarlyStopping NOT THE SAME THING Evaluates against a pre-determined rate of loss improvement for a single model 1. Terminate stagnating configurations 2. Prevent over training 22#UnifiedAnalytics #SparkAISummit
  • 23.
    Libraries Open Source: HyperBand •HpBandSter (with Random search) Open Source: Conceptually Similar • HpBandSter (with HyperOpt search) • Fabolas* (RoBo) Commercial: Conceptually Similar • SigOpt 23#UnifiedAnalytics #SparkAISummit
  • 24.
  • 25.
    Awkward/Conditional Spaces The range,or existence, of one hyperparameter is dependent on the value of another hyperparameter 25#UnifiedAnalytics #SparkAISummit
  • 26.
    Examples • Optimize GradientDescent algorithm selection • Neural network topology refinement • Neural Architecture Search • Ensemble models as featurizers 26#UnifiedAnalytics #SparkAISummit
  • 27.
  • 28.
    Why does thismatter? • Bayesian/adaptive algorithms learn from the prior • For every hyperparameter, it will require some number of samples to ”learn” dependencies 28#UnifiedAnalytics #SparkAISummit
  • 29.
    Libraries Open Source • HyperOpt •HpBandSter Commercial • SigOpt 29#UnifiedAnalytics #SparkAISummit
  • 30.
    Multimetric Optimization Use CaseMetric 1 Metric 2 Fraud Detection Minimize activity Minimize dollars lost Realtime Classifier Maximize accuracy Minimize inference time Anomaly Detection Maximize precision Maximize recall Trading Algorithm Maximize return (alpha) Minimize risk (beta) E-comm Search Maximize quality of results Maximize profitability of results 30#UnifiedAnalytics #SparkAISummit
  • 31.
  • 32.
    Scenario: MNIST forRealtime • Naïve – argmax(accuracy) • Custom Objective Function: – argmax(accuracy – test_time) • Statistical Methods (Inverse Efficiency Score) – argmin(test_time / accuracy) For all scenarios, log both the accuracy and the test_time 32#UnifiedAnalytics #SparkAISummit
  • 33.
    Comparison Method Accuracy DeltaTest Time Delta Naive .981 -- .47 -- Custom .986 .5% .44 7% IES .99 .9% .411 14% 33#UnifiedAnalytics #SparkAISummit
  • 34.
    Challenges • Custom ObjectiveFunction: susceptible to unintended consequences – Lengthscale – negative values – fractional values • Statistical Methods: makes a priori assumptions – IES, Fβ –Score, choose weighting (eg. β) up front 34#UnifiedAnalytics #SparkAISummit
  • 35.
    Fourth Approach True multimetricoptimization return [{'name': 'f1', 'value': f1_val}, {'name': 'f2', 'value': f2_val}] • Optimize for competing objectives independently • Use Surrogates and Acquisition Functions to model relationship/tradeoff between objectives 35#UnifiedAnalytics #SparkAISummit
  • 36.
    Summary Optimizer Early Termination Conditional s Multimetric (single) Multimetric (multiple) Open Source HyperOpt NoYes Yes No Yes HpBandSter Yes Yes Yes No Yes Fabolas Yes NA NA NA Yes SigOpt Yes Yes Yes Yes No Spearmint No No Yes No Yes GPyOpt No No Yes No Yes 36#UnifiedAnalytics #SparkAISummit
  • 37.
    Referenced Papers • Hyperband:A Novel Bandit-Based Approach to Hyperparameter Optimization Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar Journal of Machine Learning Research 18 (2018) • Combining Hyperband and Bayesian Optimization Stefan Falkner, Aaron Klein, Frank Hutter BayesOpt Workshop @ NeurIPS 2018 • Optimizing End-to-End Memory Networks Using SigOpt and GPUs Meghana Ravikumar, Nick Payton, Ben Hsu, Scott Clark NVIDIA Developer Blog 37#UnifiedAnalytics #SparkAISummit
  • 38.
    THANK YOU! DON’T FORGETTO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT