Successfully reported this slideshow.
Your SlideShare is downloading. ×

Using Bayesian Optimization to Tune Machine Learning Models

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 49 Ad

More Related Content

Slideshows for you (20)

Advertisement

Similar to Using Bayesian Optimization to Tune Machine Learning Models (20)

More from SigOpt (17)

Advertisement

Recently uploaded (20)

Using Bayesian Optimization to Tune Machine Learning Models

  1. 1. USING BAYESIAN OPTIMIZATION TO TUNE MACHINE LEARNING MODELS Scott Clark Co-founder and CEO of SigOpt scott@sigopt.com @DrScottClark
  2. 2. TRIAL AND ERROR WASTES EXPERT TIME Machine Learning is extremely powerful Tuning Machine Learning systems is extremely non-intuitive
  3. 3. UNRESOLVED PROBLEM IN ML https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3 What is the most important unresolved problem in machine learning? “...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.” Xavier Amatriain, VP Engineering at Quora (former Director of Research at Netflix)
  4. 4. LOTS OF TUNABLE PARAMETERS
  5. 5. COMMON APPROACH Random Search for Hyper-Parameter Optimization, James Bergstra et al., 2012 1. Random search or grid search 2. Expert defined grid search near “good” points 3. Refine domain and repeat steps - “grad student descent”
  6. 6. COMMON APPROACH ● Expert intensive ● Computationally intensive ● Finds potentially local optima ● Does not fully exploit useful information Random Search for Hyper-Parameter Optimization, James Bergstra et al., 2012 1. Random search or grid search 2. Expert defined grid search near “good” points 3. Refine domain and repeat steps - “grad student descent”
  7. 7. … the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive. Prof. Warren Powell - Princeton What is the most efficient way to collect information? Prof. Peter Frazier - Cornell How do we make the most money, as fast as possible? Me - @DrScottClark OPTIMAL LEARNING
  8. 8. ● Optimize some Overall Evaluation Criterion (OEC) ○ Loss, Accuracy, Likelihood, Revenue ● Given tunable parameters ○ Hyperparameters, feature parameters ● In an efficient way ○ Sample function as few times as possible ○ Training on big data is expensive BAYESIAN GLOBAL OPTIMIZATION Details at https://sigopt.com/research
  9. 9. Grid Search Random Search
  10. 10. ... ... ... ... ... ... GRID SEARCH SCALES EXPONENTIALLY 4D
  11. 11. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... BAYESIAN OPT SCALES LINEARLY 6D
  12. 12. HOW DOES IT FIT IN THE STACK? Big Data Machine Learning Models with tunable parameters
  13. 13. Optimally suggests new parameters HOW DOES IT FIT IN THE STACK? Objective Metric New parameters Big Data Machine Learning Models with tunable parameters
  14. 14. Optimally suggests new parameters HOW DOES IT FIT IN THE STACK? Objective Metric New parameters Better Models Big Data Machine Learning Models with tunable parameters
  15. 15. QUICK EXAMPLES
  16. 16. Optimally suggests new parameters Ex: LOAN CLASSIFICATION (xgboost) Prediction Accuracy New parameters Better AccuracyLoan Applications Default Prediction with tunable ML parameters ● Income ● Credit Score ● Loan Amount
  17. 17. COMPARATIVE PERFORMANCE ● 8.2% Better Accuracy than baseline ● 100x faster than standard tuning methods Accuracy Cost Grid Search Random Search Iterations AUC .698 .690 .683 .675 1,00010,000100,000
  18. 18. EXAMPLE: ALGORITHMIC TRADING Expected Revenue New parameters Higher Returns Market Data Trading Strategy with tunable weights and thresholds ● Closing Prices ● Day of Week ● Market Volatility Optimally suggests new parameters
  19. 19. COMPARATIVE PERFORMANCE Standard Method Expert ● 200% Higher model returns than expert ● 10x faster than standard methods
  20. 20. HOW BAYESIAN OPTIMIZATION WORKS
  21. 21. 1. Build Gaussian Process (GP) with points sampled so far 2. Optimize the fit of the GP (covariance hyperparameters) 3. Find the point(s) of highest Expected Improvement within parameter domain 4. Return optimal next best point(s) to sample HOW DOES IT WORK?
  22. 22. HOW DOES IT WORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  23. 23. HOW DOES IT WORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  24. 24. HOW DOES IT WORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  25. 25. HOW DOES IT WORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  26. 26. HOW DOES IT WORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  27. 27. HOW DOES IT WORK? 1. User reports data 2. SigOpt builds statistical model (Gaussian Process) 3. SigOpt finds the points of highest Expected Improvement 4. SigOpt suggests best parameters to test next 5. User tests those parameters and reports results to SigOpt 6. Repeat
  28. 28. EXTENDED EXAMPLE: EFFICIENTLY BUILDING CONVNETS
  29. 29. ● Classify house numbers with more training data and more sophisticated model PROBLEM
  30. 30. ● TensorFlow makes it easier to design DNN architectures, but what structure works best on a given dataset? CONVNET STRUCTURE
  31. 31. ● Per parameter adaptive SGD variants like RMSProp and Adagrad seem to work best ● Still require careful selection of learning rate (α), momentum (β), decay (γ) terms STOCHASTIC GRADIENT DESCENT
  32. 32. ● Comparison of several RMSProp SGD parametrizations ● Not obvious which configurations will work best on a given dataset without experimentation STOCHASTIC GRADIENT DESCENT
  33. 33. RESULTS
  34. 34. ● Avg Hold out accuracy after 5 optimization runs consisting of 80 objective evaluations ● Optimized single 80/20 CV fold on training set, ACC reported on test set as hold out PERFORMANCE SigOpt (TensorFlow CNN) Rnd Search (TensorFlow CNN) No Tuning (sklearn RF) No Tuning (TensorFlow CNN) Hold Out ACC 0.8130 (+315.2%) 0.5690 0.5278 0.1958
  35. 35. COST ANALYSIS Model Performance (CV Acc. threshold) Random Search Cost SigOpt Cost SigOpt Cost Savings Potential Savings In Production (50 GPUs) 87 % $275 $42 84% $12,530 85 % $195 $23 88% $8,750 80 % $46 $21 55% $1,340 70 % $29 $21 27% $400
  36. 36. EXAMPLE: TUNING DNN CLASSIFIERS CIFAR10 Dataset ● Photos of objects ● 10 classes ● Metric: Accuracy ○ [0.1, 1.0] Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009.
  37. 37. ● All convolutional neural network ● Multiple convolutional and dropout layers ● Hyperparameter optimization mixture of domain expertise and grid search (brute force) USE CASE: ALL CONVOLUTIONAL http://arxiv.org/pdf/1412.6806.pdf
  38. 38. MANY TUNABALE PARAMETERS... ● epochs: “number of epochs to run fit” - int [1,∞] ● learning rate: influence on current value of weights at each step - double (0, 1] ● momentum coefficient: “the coefficient of momentum” - double (0, 1] ● weight decay: parameter affecting how quickly weight decays - double (0, 1] ● depth: parameter affecting number of layers in net - int [1, 20(?)] ● gaussian scale: standard deviation of initialization normal dist. - double (0,∞] ● momentum step change: mul. amount to decrease momentum - double (0, 1] ● momentum step schedule start: epoch to start decreasing momentum - int [1,∞] ● momentum schedule width: epoch stride for decreasing momentum - int [1,∞] ...optimal values non-intuitive
  39. 39. COMPARATIVE PERFORMANCE ● Expert baseline: 0.8995 ○ (using neon) ● SigOpt best: 0.9011 ○ 1.6% reduction in error rate ○ No expert time wasted in tuning
  40. 40. USE CASE: DEEP RESIDUAL http://arxiv.org/pdf/1512.03385v1.pdf ● Explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions ● Variable depth ● Hyperparameter optimization mixture of domain expertise and grid search (brute force)
  41. 41. COMPARATIVE PERFORMANCE Standard Method ● Expert baseline: 0.9339 ○ (from paper) ● SigOpt best: 0.9436 ○ 15% relative error rate reduction ○ No expert time wasted in tuning
  42. 42. Questions? scott@sigopt.com @DrScottClark https://sigopt.com @SigOpt
  43. 43. TRY OUT SIGOPT FOR FREE https://sigopt.com/getstarted ● Quick example and intro to SigOpt ● No signup required ● Visual and code examples
  44. 44. MORE EXAMPLES https://github.com/sigopt/sigopt-examples Examples of using SigOpt in a variety of languages and contexts. Tuning Machine Learning Models (with code) A comparison of different hyperparameter optimization methods. Using Model Tuning to Beat Vegas (with code) Using SigOpt to tune a model for predicting basketball scores. Learn more about the technology behind SigOpt at https://sigopt.com/research
  45. 45. GPs: FUNCTIONAL VIEW
  46. 46. overfit good fit underfit GPs: FITTING THE GP
  47. 47. USE CASE: CLASSIFICATION MODELS Machine Learning models have many non-intuitive tunable hyperparameters Problem: Before Standard methods use high resources for low performance After SigOpt finds better parameters with 10x fewer evaluations than standard methods
  48. 48. USE CASE: SIMULATIONS BETTER RESULTS +450% FASTER Expensive simulations require high resources for every run Problem: Before Brute force tuning approach prohibitively expensive After SigOpt finds better results with fewer required simulations

×