This document discusses Bayesian global optimization and its application to tuning machine learning models. It begins by outlining some of the challenges of tuning ML models, such as the non-intuitive nature of the task. It then introduces Bayesian global optimization as an approach to efficiently search the hyperparameter space to find optimal configurations. The key aspects of Bayesian global optimization are described, including using Gaussian processes to build models of the objective function from sampled points and finding the next best point to sample via expected improvement. Several examples are provided demonstrating how Bayesian global optimization outperforms standard tuning methods in optimizing real-world ML tasks.
6. https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3
What is the most important unresolved
problem in machine learning?
“...we still don't really know why some configurations of
deep neural networks work in some case and not others,
let alone having a more or less automatic approach to
determining the architectures and the
hyperparameters.”
Xavier Amatriain, VP Engineering at Quora
(former Director of Research at Netflix)
17. … the challenge of how to collect information as efficiently as
possible, primarily for settings where collecting information is
time consuming and expensive.
Prof. Warren Powell - Princeton
What is the most efficient way to collect information?
Prof. Peter Frazier - Cornell
How do we make the most money, as fast as possible?
Scott Clark - CEO, SigOpt
OPTIMAL LEARNING
18. ● Optimize objective function
○ Loss, Accuracy, Likelihood
● Given parameters
○ Hyperparameters, feature parameters
● Find the best hyperparameters
○ Sample function as few times as possible
○ Training on big data is expensive
BAYESIAN GLOBAL OPTIMIZATION
19. 1. Build Gaussian Process (GP) with points
sampled so far
2. Optimize the fit of the GP (covariance
hyperparameters)
3. Find the point(s) of highest Expected
Improvement within parameter domain
4. Return optimal next best point(s) to sample
HOW DOES IT WORK?
42. RANKING OPTIMIZERS
● Alternate methods exist for black box optimization :
Spearmint, TPE, SMAC, PSO, RND Search, Grid Search
● Important to understand / track method
performance disparity on high-level categories of
functions
● For a given test function, want a partial ranking
(allowing for ties) of method performance
43. RANKING OPTIMIZERS
● First, Mann-Whitney U
tests using
BEST_FOUND
● Tied results then
partially ranked using
AUC
● Any remaining ties, stay
as ties for final ranking
49. DISTRIBUTED MODEL TRAINING
● SigOpt serves as an
AWS-ready distributed
scheduler for training
models across workers
● Each worker accesses the
SigOpt API for the latest
parameters to try
● Enables distributed training
of non-distributed
algorithms
54. COMPARATIVE PERFORMANCE
Accuracy
Grid Search
Random Search
AUC
.698
.690
.683
.675
$1,000
100 hrs
$10,000
1,000 hrs
$100,000
10,000 hrs
Cost
● Better: 22%
fewer bad loans
vs baseline
● Faster/Cheaper:
100x less time
and AWS cost
than standard
tuning methods
55. EXAMPLE: ALGORITHMIC TRADING
Market Data
Trading
Strategy
with tunable
weights and
thresholds
● Closing Prices
● Day of Week
● Market Volatility
New parameters
Expected Revenue
Higher
Returns
59. ● Automatically tune text sentiment classifier
● Amazon product review dataset (35K labels)
eg : “This DVD is great. It brings back all the memories of the holidays as a young child.”
● Logistic regression is generally a good place to start
PROBLEM
60. ● Maximize mean of k-fold cross-validation accuracies
● k = 5 folds, train and valid randomly split 70%, 30%
OBJECTIVE FUNCTION
61. ● n-gram vocabulary selection parameters
● (min_n_gram, ngram_offset) determine which n-grams
● (log_min_df, df_offset) filter for n-grams within df range
TEXT FEATURE PARAMETERS
Original Text “SigOpt optimizes any complicated system”
1-grams { “SigOpt”, “optimizes”, “any”, “complicated”, “system”}
2-grams { “SigOpt_optimizes”, “optimizes_any”, “any_complicated” … }
3-grams { “SigOpt_optimizes_any”, “optimizes_any_complicated” … }
62. ● Logistic regression error cost parameters
M = number of training examples
θ = vector of weights the algorithm will learn for each n-gram in vocabulary
yi
- training data label : {-1, 1} for our two class problem
xi
- training data input vector: BOW vectors described in previous section
α - weight of regularization term (log_reg_coef in our experiment)
ρ - weight of l1 norm term (l1_coef in our experiment)
ERROR COST PARAMETERS
63. ● 50 line python snippet to train
and tune classifier with SigOpt
● 20 lines to define 6 parameter
experiment and run
optimization loop using SigOpt
PYTHON CODE
64. ● E[f (λ)] after 20 runs, each run consisting of 60 function
evaluations
● For Grid Search : 64 evenly spaced parameter
configurations (order shuffled randomly)
● SigOpt statistical significance over grid and rnd
(p = 0.0001, Mann-Whitney U test)
PERFORMANCE
SigOpt Rnd. Search Grid Search No Tuning (Baseline)
Best Found 0.8760 (+5.72%) 0.8673 (+4.67%) 0.8680 (+4.76%) 0.8286
66. ● Classify house number
digits with lack of
labelled data
● Challenging digit
variations, image clutter
with neighboring digits
PROBLEM
67. ● In general we’ll search for an optimized ML pipeline
OBJECTIVE
68. ● Transform image patches into vectors of centroid
distances, then pool to form final representation
● SigOpt optimizes selection of w, pool_r, K
UNSUPERVISED MODEL PARAMS
69. ● Whitening transform often useful as image data
pre-processing step, expose εZCA
to SigOpt
UNSUPERVISED MODEL PARAMS
70. ● Tune sparsity of centroid
distance transform
● SigOpt optimizes threshold
(active_p) selection
UNSUPERVISED MODEL PARAMS
71. ● learning rate
number of trees
tree parameters
(max_depth,
sub_sample_sz),
exposed to SigOpt
SUPERVISED MODEL PARAMS
73. ● 20 optimization runs, each run consisting of 90 / 40
function evaluations for Unsup / Raw feature settings
● Optimized single CV fold on training set, ACC reported
on test set as hold out
PERFORMANCE
SigOpt
(Unsup Feats)
Rnd Search
(Unsup Feats)
SigOpt
(Raw Feats)
Rnd Search
(Raw Feats)
No Tuning RF
(Raw Feats)
Hold Out
ACC
0.8708 (+51.4%) 0.8583 0.6844 0.6739 0.5751
75. ● Classify house numbers
with more training data and
more sophisticated model
PROBLEM
76. ● TensorFlow makes it easier to design DNN architectures,
but what structure works best on a given dataset?
CONVNET STRUCTURE
77. ● Per parameter
adaptive SGD variants
like RMSProp and
Adagrad seem to
work best
● Still require careful
selection of learning
rate (α), momentum
(β), decay (γ) terms
STOCHASTIC GRADIENT DESCENT
78. ● Comparison of several RMSProp SGD parametrizations
● Not obvious which configurations will work best on a
given dataset without experimentation
STOCHASTIC GRADIENT DESCENT
80. ● Avg Hold out accuracy after 5 optimization runs
consisting of 80 objective evaluations
● Optimized single 80/20 CV fold on training set, ACC
reported on test set as hold out
PERFORMANCE
SigOpt
(TensorFlow CNN)
Rnd Search
(TensorFlow CNN)
No Tuning
(sklearn RF)
No Tuning
(TensorFlow CNN)
Hold Out
ACC
0.8130 (+315.2%) 0.5690 0.5278 0.1958
83. MORE EXAMPLES
Automatically Tuning Text Classifiers (with code)
A short example using SigOpt and scikit-learn to build and tune a text
sentiment classifier.
Tuning Machine Learning Models (with code)
A comparison of different hyperparameter optimization methods.
Using Model Tuning to Beat Vegas (with code)
Using SigOpt to tune a model for predicting basketball scores.
Learn more about the technology behind SigOpt at
https://sigopt.com/research