Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using Optimal Learning to Tune Deep Learning Pipelines

808 views

Published on

SigOpt talk from NVIDIA GTC 2017 and AWS SF AI Day

We'll introduce Bayesian optimization as an efficient way to optimize machine learning model parameters, especially when evaluating different parameters is time consuming or expensive. Deep learning pipelines are notoriously expensive to train and often have many tunable parameters, including hyperparameters, the architecture, and feature transformations, that can have a large impact on the efficacy of the model. We'll provide several example applications using multiple open source deep learning frameworks and open datasets. We'll compare the results of Bayesian optimization to standard techniques like grid search, random search, and expert tuning. Additionally, we'll present a robust benchmark suite for comparing these methods in general.

Published in: Engineering
  • Be the first to comment

Using Optimal Learning to Tune Deep Learning Pipelines

  1. 1. BAYESIAN GLOBAL OPTIMIZATION Using Optimal Learning to Tune Deep Learning Pipelines Scott Clark scott@sigopt.com
  2. 2. OUTLINE 1. Why is Tuning AI Models Hard? 2. Comparison of Tuning Methods 3. Bayesian Global Optimization 4. Deep Learning Examples 5. Evaluating Optimization Strategies
  3. 3. Deep Learning / AI is extremely powerful Tuning these systems is extremely non-intuitive
  4. 4. https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3 What is the most important unresolved problem in machine learning? “...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.” Xavier Amatriain, VP Engineering at Quora (former Director of Research at Netflix)
  5. 5. Photo: Joe Ross
  6. 6. TUNABLE PARAMETERS IN DEEP LEARNING
  7. 7. TUNABLE PARAMETERS IN DEEP LEARNING
  8. 8. Photo: Tammy Strobel
  9. 9. STANDARD METHODS FOR HYPERPARAMETER SEARCH
  10. 10. STANDARD TUNING METHODS Parameter Configuration ? Grid Search Random Search Manual Search - Weights - Thresholds - Window sizes - Transformations ML / AI Model Testing Data Cross Validation Training Data
  11. 11. OPTIMIZATION FEEDBACK LOOP Objective Metric Better Results REST API New configurations ML / AI Model Testing Data Cross Validation Training Data
  12. 12. BAYESIAN GLOBAL OPTIMIZATION
  13. 13. … the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive. Prof. Warren Powell - Princeton What is the most efficient way to collect information? Prof. Peter Frazier - Cornell How do we make the most money, as fast as possible? Scott Clark - CEO, SigOpt OPTIMAL LEARNING
  14. 14. ● Optimize objective function ○ Loss, Accuracy, Likelihood ● Given parameters ○ Hyperparameters, feature/architecture params ● Find the best hyperparameters ○ Sample function as few times as possible ○ Training on big data is expensive BAYESIAN GLOBAL OPTIMIZATION
  15. 15. SMBO Sequential Model-Based Optimization HOW DOES IT WORK?
  16. 16. 1. Build Gaussian Process (GP) with points sampled so far 2. Optimize the fit of the GP (covariance hyperparameters) 3. Find the point(s) of highest Expected Improvement within parameter domain 4. Return optimal next best point(s) to sample GP/EI SMBO
  17. 17. GAUSSIAN PROCESSES
  18. 18. GAUSSIAN PROCESSES
  19. 19. GAUSSIAN PROCESSES
  20. 20. GAUSSIAN PROCESSES
  21. 21. GAUSSIAN PROCESSES
  22. 22. GAUSSIAN PROCESSES
  23. 23. GAUSSIAN PROCESSES
  24. 24. GAUSSIAN PROCESSES
  25. 25. overfit good fit underfit GAUSSIAN PROCESSES
  26. 26. EXPECTED IMPROVEMENT
  27. 27. EXPECTED IMPROVEMENT
  28. 28. EXPECTED IMPROVEMENT
  29. 29. EXPECTED IMPROVEMENT
  30. 30. EXPECTED IMPROVEMENT
  31. 31. EXPECTED IMPROVEMENT
  32. 32. DEEP LEARNING EXAMPLES
  33. 33. ● Classify movie reviews using a CNN in MXNet SIGOPT + MXNET
  34. 34. TEXT CLASSIFICATION PIPELINE ML / AI Model (MXNet) Testing Text Validation Accuracy Better Results REST API Hyperparameter Configurations and Feature Transformations Training Text
  35. 35. TUNABLE PARAMETERS IN DEEP LEARNING
  36. 36. ● Comparison of several RMSProp SGD parametrizations STOCHASTIC GRADIENT DESCENT
  37. 37. ARCHITECTURE PARAMETERS
  38. 38. Grid Search Random Search ? TUNING METHODS
  39. 39. MULTIPLICATIVE TUNING SPEED UP
  40. 40. SPEED UP #1: CPU -> GPU
  41. 41. SPEED UP #2: RANDOM/GRID -> SIGOPT
  42. 42. CONSISTENTLY BETTER AND FASTER
  43. 43. ● Classify house numbers in an image dataset (SVHN) SIGOPT + TENSORFLOW
  44. 44. COMPUTER VISION PIPELINE ML / AI Model (Tensorflow) Testing Images Cross Validation Accuracy Better Results REST API Hyperparameter Configurations and Feature Transformations Training Images
  45. 45. METRIC OPTIMIZATION
  46. 46. ● All convolutional neural network ● Multiple convolutional and dropout layers ● Hyperparameter optimization mixture of domain expertise and grid search (brute force) SIGOPT + NEON http://arxiv.org/pdf/1412.6806.pdf
  47. 47. COMPARATIVE PERFORMANCE ● Expert baseline: 0.8995 ○ (using neon) ● SigOpt best: 0.9011 ○ 1.6% reduction in error rate ○ No expert time wasted in tuning
  48. 48. SIGOPT + NEON http://arxiv.org/pdf/1512.03385v1.pdf ● Explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions ● Variable depth ● Hyperparameter optimization mixture of domain expertise and grid search (brute force)
  49. 49. COMPARATIVE PERFORMANCE Standard Method ● Expert baseline: 0.9339 ○ (from paper) ● SigOpt best: 0.9436 ○ 15% relative error rate reduction ○ No expert time wasted in tuning
  50. 50. EVALUATING THE OPTIMIZER
  51. 51. OUTLINE ● Metric Definitions ● Benchmark Suite ● Eval Infrastructure ● Visualization Tool ● Baseline Comparisons
  52. 52. What is the best value found after optimization completes? METRIC: BEST FOUND BLUE RED BEST_FOUND 0.7225 0.8949
  53. 53. How quickly is optimum found? (area under curve) METRIC: AUC BLUE RED BEST_FOUND 0.9439 0.9435 AUC 0.8299 0.9358
  54. 54. STOCHASTIC OPTIMIZATION
  55. 55. ● Optimization functions from literature ● ML datasets: LIBSVM, Deep Learning, etc BENCHMARK SUITE TEST FUNCTION TYPE COUNT Continuous Params 184 Noisy Observations 188 Parallel Observations 45 Integer Params 34 Categorical Params / ML 47 Failure Observations 30 TOTAL 489
  56. 56. ● On-demand cluster in AWS for parallel eval function optimization ● Full eval consists of ~20000 optimizations, taking ~30 min INFRASTRUCTURE
  57. 57. RANKING OPTIMIZERS 1. Mann-Whitney U tests using BEST_FOUND 2. Tied results then partially ranked using AUC 3. Any remaining ties, stay as ties for final ranking
  58. 58. RANKING AGGREGATION ● Aggregate partial rankings across all eval functions using Borda count (sum of methods ranked lower)
  59. 59. SHORT RESULTS SUMMARY
  60. 60. BASELINE COMPARISONS
  61. 61. SIGOPT SERVICE
  62. 62. OPTIMIZATION FEEDBACK LOOP Objective Metric Better Results REST API New configurations ML / AI Model Testing Data Cross Validation Training Data
  63. 63. SIMPLIFIED OPTIMIZATION Client Libraries ● Python ● Java ● R ● Matlab ● And more... Framework Integrations ● TensorFlow ● scikit-learn ● xgboost ● Keras ● Neon ● And more... Live Demo
  64. 64. DISTRIBUTED TRAINING ● SigOpt serves as a distributed scheduler for training models across workers ● Workers access the SigOpt API for the latest parameters to try for each model ● Enables easy distributed training of non-distributed algorithms across any number of models
  65. 65. COMPARATIVE PERFORMANCE ● Better Results, Faster and Cheaper Quickly get the most out of your models with our proven, peer-reviewed ensemble of Bayesian and Global Optimization Methods ○ A Stratified Analysis of Bayesian Optimization Methods (ICML 2016) ○ Evaluation System for a Bayesian Optimization Service (ICML 2016) ○ Interactive Preference Learning of Utility Functions for Multi-Objective Optimization (NIPS 2016) ○ And more... ● Fully Featured Tune any model in any pipeline ○ Scales to 100 continuous, integer, and categorical parameters and many thousands of evaluations ○ Parallel tuning support across any number of models ○ Simple integrations with many languages and libraries ○ Powerful dashboards for introspecting your models and optimization ○ Advanced features like multi-objective optimization, failure region support, and more ● Secure Black Box Optimization Your data and models never leave your system
  66. 66. https://sigopt.com/getstarted Try it yourself!
  67. 67. Questions? contact@sigopt.com https://sigopt.com @SigOpt

×