Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced Hyperparameter Optimization for Deep Learning with MLflow

321 views

Published on

Building on the "Best Practices for Hyperparameter Tuning with MLflow" talk, we will present advanced topics in HPO for deep learning, including early stopping, multi-metric optimization, and robust optimization. We will then discuss implementations using open source tools. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze the performance of our models.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Advanced Hyperparameter Optimization for Deep Learning with MLflow

  1. 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. 2. Maneesh Bhide, Databricks Advanced HPO for Deep Learning #UnifiedAnalytics #SparkAISummit
  3. 3. Review: HPO Approaches Grid search: • PRO: can be run in one time-step • CON: naive, computationally expensive, suffers from curse of dimensionality, probably alias over global optima. Random search: • PRO: suffers less from curse of dimensionality, can be run in one time-step • CON: naive, no certainty about results, still computationally expensive Population based: • PRO: implicit predictions, can be run in several time-steps, good at resolving many optima • CON: computationally expensive, may converge to local optima Bayesian: • PRO: explicit predictions, computationally efficient • CON: requires sequential observations 3#UnifiedAnalytics #SparkAISummit
  4. 4. Review: Best Practices • Tune entire pipeline, not individual models • How you phrase parameters matters! – Categoricals really categorical? • [2,4,8,16,32] à {1,5,1} and 2(param) – Use transformations to your advantage • For learning_rate, instead of {0,1} à {-10,0} and 10(param) • Don’t restrict to traditional hyperparameters – SGD Flavor – Architecture 4#UnifiedAnalytics #SparkAISummit
  5. 5. HPO for Neural Networks • Can benefit from compute efficiency of Bayesian Optimization as parameter space can explode. – Challenge of sequential training and long training time • Optimize more than just hyperparameters – Challenge of parameters depending on other parameters • Production models often have multiple criteria – Challenge of trading off between objectives 5#UnifiedAnalytics #SparkAISummit
  6. 6. Agenda • Challenge of sequential training and long training time – Early Termination • Challenge of parameters depending on other parameters – Awkward/Conditional Spaces • Challenge of trading off between objectives – Multimetric Optimization 6#UnifiedAnalytics #SparkAISummit
  7. 7. How Early Termination works From the HyperBand Paper… 1. Select initial candidate configuration set 2. Train configurations for Xn epochs 3. Evaluate performance (preferably, objective metric) 4. Use SuccessiveHalving (eliminate half), run remaining configurations an additional Xn epochs 5. Xn+1 = 2Xn 6. Goto step 2 7#UnifiedAnalytics #SparkAISummit
  8. 8. 8#UnifiedAnalytics #SparkAISummit Credit: https://www.automl.org/blog_bohb/
  9. 9. Assumptions • Well behaved learning curves • Model Performance: don’t need the best model, need a good model faster 9#UnifiedAnalytics #SparkAISummit
  10. 10. 10#UnifiedAnalytics #SparkAISummit Credit: https://www.automl.org/blog_bohb/
  11. 11. Scenario Walkthrough ResNet-50 on ImageNet 9 Hyperparameters for HPO 128 configurations 1 p2.xlarge ($.90/hour) 12 hours training time 11#UnifiedAnalytics #SparkAISummit
  12. 12. Standard Training 12 hours * 128 config Total Compute Time: 1536 hours Total Cost: $1382.4 12#UnifiedAnalytics #SparkAISummit
  13. 13. With HyperBand % train .78% 1.56% 3.12% 6.25% 12.5% 25% 50% 100% hours .09 .19 .37 .75 1.5 3 6 12 Configs 128 64 32 16 8 4 2 1 ET 64 32 16 8 4 2 1 Total 5.76 6.08 5.92 6 6 6 6 12 13#UnifiedAnalytics #SparkAISummit Total Compute Time: 53.76 hours Total Cost: $48.38
  14. 14. Scenario Summary w/o Early Termination: 1536 hours w/ Early Termination: 53.76 hours 96.5% Reduction in Compute (and Cost!) 14#UnifiedAnalytics #SparkAISummit
  15. 15. Bayesian + HyperBand 1. Articulate checkpoints 2. Optimizer selects an initial sample (bootstraping) 3. Train for “checkpoint N” epochs 4. Evaluate performance (preferably, objective metric) 5. Use Bayesian method to select new candidates 6. Increment N 7. Goto step 3 15#UnifiedAnalytics #SparkAISummit
  16. 16. 16#UnifiedAnalytics #SparkAISummit Credit: https://www.automl.org/blog_bohb/
  17. 17. Assumptions None • Black box optimization • Allows user to account for potential stagnation in checkpoint selection • Regret intrinsically accounted for 17#UnifiedAnalytics #SparkAISummit
  18. 18. Random vs Bayesian 1. Number of initial candidates – Random: scales exponentially with number of parameters – Bayesian: scales linearly with number of parameters 2. Candidate selection – Random: naïve, static – Bayesian: adaptive 3. Regret Implementation – Random: User must explicitly define – Bayesian: Surrogate + acquisition function 18#UnifiedAnalytics #SparkAISummit
  19. 19. Which is Better? 19#UnifiedAnalytics #SparkAISummit
  20. 20. Does this Actually Work? 20#UnifiedAnalytics #SparkAISummit
  21. 21. Summary • Attempts to optimize for resource allocation • Dramatically reduce compute and wall clock to convergence • Better implementations include a “regret” mechanism to recover configurations • Bayesian outperforms Random – But in principle, compatible with any underlying hyperparameter optimization technique 21#UnifiedAnalytics #SparkAISummit
  22. 22. What about Keras/TF EarlyStopping NOT THE SAME THING Evaluates against a pre-determined rate of loss improvement for a single model 1. Terminate stagnating configurations 2. Prevent over training 22#UnifiedAnalytics #SparkAISummit
  23. 23. Libraries Open Source: HyperBand • HpBandSter (with Random search) Open Source: Conceptually Similar • HpBandSter (with HyperOpt search) • Fabolas* (RoBo) Commercial: Conceptually Similar • SigOpt 23#UnifiedAnalytics #SparkAISummit
  24. 24. Code • HpBandSter: https://automl.github.io/HpBandSter/build/html/a uto_examples/index.html • Fabolas: https://github.com/automl/RoBO/blob/master/exa mples/example_fabolas.py • SigOpt: https://app.sigopt.com/docs/overview/multitask 24#UnifiedAnalytics #SparkAISummit
  25. 25. Awkward/Conditional Spaces The range, or existence, of one hyperparameter is dependent on the value of another hyperparameter 25#UnifiedAnalytics #SparkAISummit
  26. 26. Examples • Optimize Gradient Descent algorithm selection • Neural network topology refinement • Neural Architecture Search • Ensemble models as featurizers 26#UnifiedAnalytics #SparkAISummit
  27. 27. 27#UnifiedAnalytics #SparkAISummit Credit: https://devblogs.nvidia.com/optimizing-end-to-end-memory-networks-using-sigopt-gpus/
  28. 28. Why does this matter? • Bayesian/adaptive algorithms learn from the prior • For every hyperparameter, it will require some number of samples to ”learn” dependencies 28#UnifiedAnalytics #SparkAISummit
  29. 29. Libraries Open Source • HyperOpt • HpBandSter Commercial • SigOpt 29#UnifiedAnalytics #SparkAISummit
  30. 30. Multimetric Optimization Use Case Metric 1 Metric 2 Fraud Detection Minimize activity Minimize dollars lost Realtime Classifier Maximize accuracy Minimize inference time Anomaly Detection Maximize precision Maximize recall Trading Algorithm Maximize return (alpha) Minimize risk (beta) E-comm Search Maximize quality of results Maximize profitability of results 30#UnifiedAnalytics #SparkAISummit
  31. 31. “Pareto Efficiency” 31#UnifiedAnalytics #SparkAISummit
  32. 32. Scenario: MNIST for Realtime • Naïve – argmax(accuracy) • Custom Objective Function: – argmax(accuracy – test_time) • Statistical Methods (Inverse Efficiency Score) – argmin(test_time / accuracy) For all scenarios, log both the accuracy and the test_time 32#UnifiedAnalytics #SparkAISummit
  33. 33. Comparison Method Accuracy Delta Test Time Delta Naive .981 -- .47 -- Custom .986 .5% .44 7% IES .99 .9% .411 14% 33#UnifiedAnalytics #SparkAISummit
  34. 34. Challenges • Custom Objective Function: susceptible to unintended consequences – Lengthscale – negative values – fractional values • Statistical Methods: makes a priori assumptions – IES, Fβ –Score, choose weighting (eg. β) up front 34#UnifiedAnalytics #SparkAISummit
  35. 35. Fourth Approach True multimetric optimization return [{'name': 'f1', 'value': f1_val}, {'name': 'f2', 'value': f2_val}] • Optimize for competing objectives independently • Use Surrogates and Acquisition Functions to model relationship/tradeoff between objectives 35#UnifiedAnalytics #SparkAISummit
  36. 36. Summary Optimizer Early Termination Conditional s Multimetric (single) Multimetric (multiple) Open Source HyperOpt No Yes Yes No Yes HpBandSter Yes Yes Yes No Yes Fabolas Yes NA NA NA Yes SigOpt Yes Yes Yes Yes No Spearmint No No Yes No Yes GPyOpt No No Yes No Yes 36#UnifiedAnalytics #SparkAISummit
  37. 37. Referenced Papers • Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar Journal of Machine Learning Research 18 (2018) • Combining Hyperband and Bayesian Optimization Stefan Falkner, Aaron Klein, Frank Hutter BayesOpt Workshop @ NeurIPS 2018 • Optimizing End-to-End Memory Networks Using SigOpt and GPUs Meghana Ravikumar, Nick Payton, Ben Hsu, Scott Clark NVIDIA Developer Blog 37#UnifiedAnalytics #SparkAISummit
  38. 38. THANK YOU! DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×