Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BAYESIAN GLOBAL OPTIMIZATION
Using Optimal Learning to Tune ML Models
Scott Clark
scott@sigopt.com
OUTLINE
1. Why is Tuning ML Models Hard?
2. Standard Tuning Methods
3. Bayesian Global Optimization
4. Comparing Optimizer...
Machine Learning is
extremely powerful
Machine Learning is
extremely powerful
Tuning Machine Learning systems is
extremely non-intuitive
https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3
What is the most important unres...
Photo: Joe Ross
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
TUNABLE PARAMETERS IN DEEP LEARNING
Photo: Tammy Strobel
STANDARD METHODS
FOR HYPERPARAMETER SEARCH
EXAMPLE: FRANKE FUNCTION
Grid Search Random Search
Predictive
Models
Predictive
Models
TUNING MACHINE LEARNING MODELS
New parameters
Objective Metric
Better
Models
Big Data
BAYESIAN GLOBAL OPTIMIZATION
… the challenge of how to collect information as efficiently as
possible, primarily for settings where collecting informat...
● Optimize objective function
○ Loss, Accuracy, Likelihood
● Given parameters
○ Hyperparameters, feature parameters
● Find...
1. Build Gaussian Process (GP) with points
sampled so far
2. Optimize the fit of the GP (covariance
hyperparameters)
3. Fi...
GAUSSIAN PROCESSES
GAUSSIAN PROCESSES
GAUSSIAN PROCESSES
GAUSSIAN PROCESSES
GAUSSIAN PROCESSES
GAUSSIAN PROCESSES
GAUSSIAN PROCESSES
GAUSSIAN PROCESSES
overfit good fit underfit
GAUSSIAN PROCESSES
EXPECTED IMPROVEMENT
EXPECTED IMPROVEMENT
EXPECTED IMPROVEMENT
EXPECTED IMPROVEMENT
EXPECTED IMPROVEMENT
EXPECTED IMPROVEMENT
EVALUATING THE OPTIMIZER
What is the best value found after optimization
completes?
METRIC: BEST FOUND
BLUE RED
BEST_FOUND 0.7225 0.8949
How quickly is optimum found? (area under curve)
METRIC: AUC
BLUE RED
BEST_FOUND 0.9439 0.9435
AUC 0.8299 0.9358
● Optimization functions (eg Branin, Ackeley, Rosenbrock)
● ML datasets (LIBSVM)
BENCHMARK SUITE
TEST FUNCTION TYPE COUNT
...
● On-demand cluster
in AWS for parallel
eval function
optimization
● Full eval consists of
~10000 optimizations,
taking ~4...
VIZ TOOL : BEST SEEN TRACES
METRICS: STOCHASTICITY
● Run each 20
times
● Mann-Whitney
U test for
significance
RANKING OPTIMIZERS
● Alternate methods exist for black box optimization :
Spearmint, TPE, SMAC, PSO, RND Search, Grid Sear...
RANKING OPTIMIZERS
● First, Mann-Whitney U
tests using
BEST_FOUND
● Tied results then
partially ranked using
AUC
● Any rem...
RANKING AGGREGATION
● Aggregate partial rankings across all eval functions
using Borda count (sum of methods ranked lower)
SHORT RESULTS SUMMARY
SIGOPT SERVICE
Predictive
Models
Predictive
Models
HOW DOES SIGOPT INTEGRATE?
New parameters
Objective Metric
Better
Models
Big Data
SIMPLIFIED MANAGEMENT
Before SigOpt
DISTRIBUTED MODEL TRAINING
● SigOpt serves as an
AWS-ready distributed
scheduler for training
models across workers
● Each...
INTEGRATIONS
REST API
Questions?
scott@sigopt.com
@DrScottClark
https://sigopt.com
@SigOpt
SHORT EXAMPLES
EXAMPLE: LOAN DATA
Loan
Applications
Default
Prediction
with tunable
ML parameters
● Income
● Credit Score
● Loan Amount
N...
COMPARATIVE PERFORMANCE
Accuracy
Grid Search
Random Search
AUC
.698
.690
.683
.675
$1,000
100 hrs
$10,000
1,000 hrs
$100,0...
EXAMPLE: ALGORITHMIC TRADING
Market Data
Trading
Strategy
with tunable
weights and
thresholds
● Closing Prices
● Day of We...
COMPARATIVE PERFORMANCE
Standard Method
Expert
● Better: 200%
Higher model
returns than
expert
● Faster/Cheaper:
10x faste...
1. SigOpt Live Demo
2. More Examples
a. Text Classification
b. Unsupervised + Supervised
c. Neural Nets with TensorFlow
AD...
AUTOMATICALLY TUNING TEXT
SENTIMENT CLASSIFIER
● Automatically tune text sentiment classifier
● Amazon product review dataset (35K labels)
eg : “This DVD is great. It br...
● Maximize mean of k-fold cross-validation accuracies
● k = 5 folds, train and valid randomly split 70%, 30%
OBJECTIVE FUN...
● n-gram vocabulary selection parameters
● (min_n_gram, ngram_offset) determine which n-grams
● (log_min_df, df_offset) fi...
● Logistic regression error cost parameters
M = number of training examples
θ = vector of weights the algorithm will learn...
● 50 line python snippet to train
and tune classifier with SigOpt
● 20 lines to define 6 parameter
experiment and run
opti...
● E[f (λ)] after 20 runs, each run consisting of 60 function
evaluations
● For Grid Search : 64 evenly spaced parameter
co...
EXPLOITING UNLABELLED DATA
● Classify house number
digits with lack of
labelled data
● Challenging digit
variations, image clutter
with neighboring d...
● In general we’ll search for an optimized ML pipeline
OBJECTIVE
● Transform image patches into vectors of centroid
distances, then pool to form final representation
● SigOpt optimizes se...
● Whitening transform often useful as image data
pre-processing step, expose εZCA
to SigOpt
UNSUPERVISED MODEL PARAMS
● Tune sparsity of centroid
distance transform
● SigOpt optimizes threshold
(active_p) selection
UNSUPERVISED MODEL PARAMS
● learning rate
number of trees
tree parameters
(max_depth,
sub_sample_sz),
exposed to SigOpt
SUPERVISED MODEL PARAMS
METRIC OPTIMIZATION
● 20 optimization runs, each run consisting of 90 / 40
function evaluations for Unsup / Raw feature settings
● Optimized s...
EFFICIENTLY BUILDING CONVNETS
● Classify house numbers
with more training data and
more sophisticated model
PROBLEM
● TensorFlow makes it easier to design DNN architectures,
but what structure works best on a given dataset?
CONVNET STRUCT...
● Per parameter
adaptive SGD variants
like RMSProp and
Adagrad seem to
work best
● Still require careful
selection of lear...
● Comparison of several RMSProp SGD parametrizations
● Not obvious which configurations will work best on a
given dataset ...
METRIC OPTIMIZATION
● Avg Hold out accuracy after 5 optimization runs
consisting of 80 objective evaluations
● Optimized single 80/20 CV fold ...
COST ANALYSIS
Model Performance
(CV Acc. threshold)
Random
Search Cost
SigOpt
Cost
SigOpt Cost
Savings
Potential Savings I...
https://sigopt.com/getstarted
Try it yourself!
MORE EXAMPLES
Automatically Tuning Text Classifiers (with code)
A short example using SigOpt and scikit-learn to build and...
MLConf 2016 SigOpt Talk by Scott Clark
Upcoming SlideShare
Loading in …5
×

MLConf 2016 SigOpt Talk by Scott Clark

656 views

Published on

MLConf 2016 SigOpt Talk on Bayesian Global Optimization: Using Optimal Learning to Tune ML Models.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

MLConf 2016 SigOpt Talk by Scott Clark

  1. 1. BAYESIAN GLOBAL OPTIMIZATION Using Optimal Learning to Tune ML Models Scott Clark scott@sigopt.com
  2. 2. OUTLINE 1. Why is Tuning ML Models Hard? 2. Standard Tuning Methods 3. Bayesian Global Optimization 4. Comparing Optimizers 5. Real World Examples
  3. 3. Machine Learning is extremely powerful
  4. 4. Machine Learning is extremely powerful Tuning Machine Learning systems is extremely non-intuitive
  5. 5. https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3 What is the most important unresolved problem in machine learning? “...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.” Xavier Amatriain, VP Engineering at Quora (former Director of Research at Netflix)
  6. 6. Photo: Joe Ross
  7. 7. TUNABLE PARAMETERS IN DEEP LEARNING
  8. 8. TUNABLE PARAMETERS IN DEEP LEARNING
  9. 9. TUNABLE PARAMETERS IN DEEP LEARNING
  10. 10. Photo: Tammy Strobel
  11. 11. STANDARD METHODS FOR HYPERPARAMETER SEARCH
  12. 12. EXAMPLE: FRANKE FUNCTION
  13. 13. Grid Search Random Search
  14. 14. Predictive Models Predictive Models TUNING MACHINE LEARNING MODELS New parameters Objective Metric Better Models Big Data
  15. 15. BAYESIAN GLOBAL OPTIMIZATION
  16. 16. … the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive. Prof. Warren Powell - Princeton What is the most efficient way to collect information? Prof. Peter Frazier - Cornell How do we make the most money, as fast as possible? Scott Clark - CEO, SigOpt OPTIMAL LEARNING
  17. 17. ● Optimize objective function ○ Loss, Accuracy, Likelihood ● Given parameters ○ Hyperparameters, feature parameters ● Find the best hyperparameters ○ Sample function as few times as possible ○ Training on big data is expensive BAYESIAN GLOBAL OPTIMIZATION
  18. 18. 1. Build Gaussian Process (GP) with points sampled so far 2. Optimize the fit of the GP (covariance hyperparameters) 3. Find the point(s) of highest Expected Improvement within parameter domain 4. Return optimal next best point(s) to sample HOW DOES IT WORK?
  19. 19. GAUSSIAN PROCESSES
  20. 20. GAUSSIAN PROCESSES
  21. 21. GAUSSIAN PROCESSES
  22. 22. GAUSSIAN PROCESSES
  23. 23. GAUSSIAN PROCESSES
  24. 24. GAUSSIAN PROCESSES
  25. 25. GAUSSIAN PROCESSES
  26. 26. GAUSSIAN PROCESSES
  27. 27. overfit good fit underfit GAUSSIAN PROCESSES
  28. 28. EXPECTED IMPROVEMENT
  29. 29. EXPECTED IMPROVEMENT
  30. 30. EXPECTED IMPROVEMENT
  31. 31. EXPECTED IMPROVEMENT
  32. 32. EXPECTED IMPROVEMENT
  33. 33. EXPECTED IMPROVEMENT
  34. 34. EVALUATING THE OPTIMIZER
  35. 35. What is the best value found after optimization completes? METRIC: BEST FOUND BLUE RED BEST_FOUND 0.7225 0.8949
  36. 36. How quickly is optimum found? (area under curve) METRIC: AUC BLUE RED BEST_FOUND 0.9439 0.9435 AUC 0.8299 0.9358
  37. 37. ● Optimization functions (eg Branin, Ackeley, Rosenbrock) ● ML datasets (LIBSVM) BENCHMARK SUITE TEST FUNCTION TYPE COUNT Continuous Params 184 Noisy Observations 188 Parallel Observations 45 Integer Params 34 Categorical Params / ML 47 Failure Observations 30 TOTAL 489
  38. 38. ● On-demand cluster in AWS for parallel eval function optimization ● Full eval consists of ~10000 optimizations, taking ~4 hours INFRASTRUCTURE
  39. 39. VIZ TOOL : BEST SEEN TRACES
  40. 40. METRICS: STOCHASTICITY ● Run each 20 times ● Mann-Whitney U test for significance
  41. 41. RANKING OPTIMIZERS ● Alternate methods exist for black box optimization : Spearmint, TPE, SMAC, PSO, RND Search, Grid Search ● Important to understand / track method performance disparity on high-level categories of functions ● For a given test function, want a partial ranking (allowing for ties) of method performance
  42. 42. RANKING OPTIMIZERS ● First, Mann-Whitney U tests using BEST_FOUND ● Tied results then partially ranked using AUC ● Any remaining ties, stay as ties for final ranking
  43. 43. RANKING AGGREGATION ● Aggregate partial rankings across all eval functions using Borda count (sum of methods ranked lower)
  44. 44. SHORT RESULTS SUMMARY
  45. 45. SIGOPT SERVICE
  46. 46. Predictive Models Predictive Models HOW DOES SIGOPT INTEGRATE? New parameters Objective Metric Better Models Big Data
  47. 47. SIMPLIFIED MANAGEMENT Before SigOpt
  48. 48. DISTRIBUTED MODEL TRAINING ● SigOpt serves as an AWS-ready distributed scheduler for training models across workers ● Each worker accesses the SigOpt API for the latest parameters to try ● Enables distributed training of non-distributed algorithms
  49. 49. INTEGRATIONS REST API
  50. 50. Questions? scott@sigopt.com @DrScottClark https://sigopt.com @SigOpt
  51. 51. SHORT EXAMPLES
  52. 52. EXAMPLE: LOAN DATA Loan Applications Default Prediction with tunable ML parameters ● Income ● Credit Score ● Loan Amount New parameters Prediction Accuracy Better Accuracy
  53. 53. COMPARATIVE PERFORMANCE Accuracy Grid Search Random Search AUC .698 .690 .683 .675 $1,000 100 hrs $10,000 1,000 hrs $100,000 10,000 hrs Cost ● Better: 22% fewer bad loans vs baseline ● Faster/Cheaper: 100x less time and AWS cost than standard tuning methods
  54. 54. EXAMPLE: ALGORITHMIC TRADING Market Data Trading Strategy with tunable weights and thresholds ● Closing Prices ● Day of Week ● Market Volatility New parameters Expected Revenue Higher Returns
  55. 55. COMPARATIVE PERFORMANCE Standard Method Expert ● Better: 200% Higher model returns than expert ● Faster/Cheaper: 10x faster than standard methods
  56. 56. 1. SigOpt Live Demo 2. More Examples a. Text Classification b. Unsupervised + Supervised c. Neural Nets with TensorFlow ADDITIONAL TOPICS
  57. 57. AUTOMATICALLY TUNING TEXT SENTIMENT CLASSIFIER
  58. 58. ● Automatically tune text sentiment classifier ● Amazon product review dataset (35K labels) eg : “This DVD is great. It brings back all the memories of the holidays as a young child.” ● Logistic regression is generally a good place to start PROBLEM
  59. 59. ● Maximize mean of k-fold cross-validation accuracies ● k = 5 folds, train and valid randomly split 70%, 30% OBJECTIVE FUNCTION
  60. 60. ● n-gram vocabulary selection parameters ● (min_n_gram, ngram_offset) determine which n-grams ● (log_min_df, df_offset) filter for n-grams within df range TEXT FEATURE PARAMETERS Original Text “SigOpt optimizes any complicated system” 1-grams { “SigOpt”, “optimizes”, “any”, “complicated”, “system”} 2-grams { “SigOpt_optimizes”, “optimizes_any”, “any_complicated” … } 3-grams { “SigOpt_optimizes_any”, “optimizes_any_complicated” … }
  61. 61. ● Logistic regression error cost parameters M = number of training examples θ = vector of weights the algorithm will learn for each n-gram in vocabulary yi - training data label : {-1, 1} for our two class problem xi - training data input vector: BOW vectors described in previous section α - weight of regularization term (log_reg_coef in our experiment) ρ - weight of l1 norm term (l1_coef in our experiment) ERROR COST PARAMETERS
  62. 62. ● 50 line python snippet to train and tune classifier with SigOpt ● 20 lines to define 6 parameter experiment and run optimization loop using SigOpt PYTHON CODE
  63. 63. ● E[f (λ)] after 20 runs, each run consisting of 60 function evaluations ● For Grid Search : 64 evenly spaced parameter configurations (order shuffled randomly) ● SigOpt statistical significance over grid and rnd (p = 0.0001, Mann-Whitney U test) PERFORMANCE SigOpt Rnd. Search Grid Search No Tuning (Baseline) Best Found 0.8760 (+5.72%) 0.8673 (+4.67%) 0.8680 (+4.76%) 0.8286
  64. 64. EXPLOITING UNLABELLED DATA
  65. 65. ● Classify house number digits with lack of labelled data ● Challenging digit variations, image clutter with neighboring digits PROBLEM
  66. 66. ● In general we’ll search for an optimized ML pipeline OBJECTIVE
  67. 67. ● Transform image patches into vectors of centroid distances, then pool to form final representation ● SigOpt optimizes selection of w, pool_r, K UNSUPERVISED MODEL PARAMS
  68. 68. ● Whitening transform often useful as image data pre-processing step, expose εZCA to SigOpt UNSUPERVISED MODEL PARAMS
  69. 69. ● Tune sparsity of centroid distance transform ● SigOpt optimizes threshold (active_p) selection UNSUPERVISED MODEL PARAMS
  70. 70. ● learning rate number of trees tree parameters (max_depth, sub_sample_sz), exposed to SigOpt SUPERVISED MODEL PARAMS
  71. 71. METRIC OPTIMIZATION
  72. 72. ● 20 optimization runs, each run consisting of 90 / 40 function evaluations for Unsup / Raw feature settings ● Optimized single CV fold on training set, ACC reported on test set as hold out PERFORMANCE SigOpt (Unsup Feats) Rnd Search (Unsup Feats) SigOpt (Raw Feats) Rnd Search (Raw Feats) No Tuning RF (Raw Feats) Hold Out ACC 0.8708 (+51.4%) 0.8583 0.6844 0.6739 0.5751
  73. 73. EFFICIENTLY BUILDING CONVNETS
  74. 74. ● Classify house numbers with more training data and more sophisticated model PROBLEM
  75. 75. ● TensorFlow makes it easier to design DNN architectures, but what structure works best on a given dataset? CONVNET STRUCTURE
  76. 76. ● Per parameter adaptive SGD variants like RMSProp and Adagrad seem to work best ● Still require careful selection of learning rate (α), momentum (β), decay (γ) terms STOCHASTIC GRADIENT DESCENT
  77. 77. ● Comparison of several RMSProp SGD parametrizations ● Not obvious which configurations will work best on a given dataset without experimentation STOCHASTIC GRADIENT DESCENT
  78. 78. METRIC OPTIMIZATION
  79. 79. ● Avg Hold out accuracy after 5 optimization runs consisting of 80 objective evaluations ● Optimized single 80/20 CV fold on training set, ACC reported on test set as hold out PERFORMANCE SigOpt (TensorFlow CNN) Rnd Search (TensorFlow CNN) No Tuning (sklearn RF) No Tuning (TensorFlow CNN) Hold Out ACC 0.8130 (+315.2%) 0.5690 0.5278 0.1958
  80. 80. COST ANALYSIS Model Performance (CV Acc. threshold) Random Search Cost SigOpt Cost SigOpt Cost Savings Potential Savings In Production (50 GPUs) 87 % $275 $42 84% $12,530 85 % $195 $23 88% $8,750 80 % $46 $21 55% $1,340 70 % $29 $21 27% $400
  81. 81. https://sigopt.com/getstarted Try it yourself!
  82. 82. MORE EXAMPLES Automatically Tuning Text Classifiers (with code) A short example using SigOpt and scikit-learn to build and tune a text sentiment classifier. Tuning Machine Learning Models (with code) A comparison of different hyperparameter optimization methods. Using Model Tuning to Beat Vegas (with code) Using SigOpt to tune a model for predicting basketball scores. Learn more about the technology behind SigOpt at https://sigopt.com/research

×