Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

BAYESIAN GLOBAL OPTIMIZATION
Using Optimal Learning to Tune ML Models
Scott Clark
scott@sigopt.com

OUTLINE
1. Why is Tuning ML Models Hard?
2. Standard Tuning Methods
3. Bayesian Global Optimization
4. Comparing Optimizers
5. Real World Examples

Machine Learning is
extremely powerful

Machine Learning is
extremely powerful
Tuning Machine Learning systems is
extremely non-intuitive

https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3
What is the most important unresolved
problem in machine learning?
“...we still don't really know why some configurations of
deep neural networks work in some case and not others,
let alone having a more or less automatic approach to
determining the architectures and the
hyperparameters.”
Xavier Amatriain, VP Engineering at Quora
(former Director of Research at Netflix)

TUNABLE PARAMETERS IN DEEP LEARNING

STANDARD METHODS
FOR HYPERPARAMETER SEARCH

Predictive
Models
Predictive
Models
TUNING MACHINE LEARNING MODELS
New parameters
Objective Metric
Better
Models
Big Data

… the challenge of how to collect information as efficiently as
possible, primarily for settings where collecting information is
time consuming and expensive.
Prof. Warren Powell - Princeton
What is the most efficient way to collect information?
Prof. Peter Frazier - Cornell
How do we make the most money, as fast as possible?
Scott Clark - CEO, SigOpt
OPTIMAL LEARNING

● Optimize objective function
○ Loss, Accuracy, Likelihood
● Given parameters
○ Hyperparameters, feature parameters
● Find the best hyperparameters
○ Sample function as few times as possible
○ Training on big data is expensive
BAYESIAN GLOBAL OPTIMIZATION

1. Build Gaussian Process (GP) with points
sampled so far
2. Optimize the fit of the GP (covariance
hyperparameters)
3. Find the point(s) of highest Expected
Improvement within parameter domain
4. Return optimal next best point(s) to sample
HOW DOES IT WORK?

overfit good fit underfit
GAUSSIAN PROCESSES

What is the best value found after optimization
completes?
METRIC: BEST FOUND
BLUE RED
BEST_FOUND 0.7225 0.8949

How quickly is optimum found? (area under curve)
METRIC: AUC
BLUE RED
BEST_FOUND 0.9439 0.9435
AUC 0.8299 0.9358

● Optimization functions (eg Branin, Ackeley, Rosenbrock)
● ML datasets (LIBSVM)
BENCHMARK SUITE
TEST FUNCTION TYPE COUNT
Continuous Params 184
Noisy Observations 188
Parallel Observations 45
Integer Params 34
Categorical Params / ML 47
Failure Observations 30
TOTAL 489

● On-demand cluster
in AWS for parallel
eval function
optimization
● Full eval consists of
~10000 optimizations,
taking ~4 hours
INFRASTRUCTURE

METRICS: STOCHASTICITY
● Run each 20
times
● Mann-Whitney
U test for
significance

RANKING OPTIMIZERS
● Alternate methods exist for black box optimization :
Spearmint, TPE, SMAC, PSO, RND Search, Grid Search
● Important to understand / track method
performance disparity on high-level categories of
functions
● For a given test function, want a partial ranking
(allowing for ties) of method performance

RANKING OPTIMIZERS
● First, Mann-Whitney U
tests using
BEST_FOUND
● Tied results then
partially ranked using
AUC
● Any remaining ties, stay
as ties for final ranking

RANKING AGGREGATION
● Aggregate partial rankings across all eval functions
using Borda count (sum of methods ranked lower)

Predictive
Models
Predictive
Models
HOW DOES SIGOPT INTEGRATE?
New parameters
Objective Metric
Better
Models
Big Data

SIMPLIFIED MANAGEMENT
Before SigOpt

DISTRIBUTED MODEL TRAINING
● SigOpt serves as an
AWS-ready distributed
scheduler for training
models across workers
● Each worker accesses the
SigOpt API for the latest
parameters to try
● Enables distributed training
of non-distributed
algorithms

Questions?
scott@sigopt.com
@DrScottClark
https://sigopt.com
@SigOpt

EXAMPLE: LOAN DATA
Loan
Applications
Default
Prediction
with tunable
ML parameters
● Income
● Credit Score
● Loan Amount
New parameters
Prediction Accuracy
Better
Accuracy

COMPARATIVE PERFORMANCE
Accuracy
Grid Search
Random Search
AUC
.698
.690
.683
.675
$1,000
100 hrs
$10,000
1,000 hrs
$100,000
10,000 hrs
Cost
● Better: 22%
fewer bad loans
vs baseline
● Faster/Cheaper:
100x less time
and AWS cost
than standard
tuning methods

EXAMPLE: ALGORITHMIC TRADING
Market Data
Trading
Strategy
with tunable
weights and
thresholds
● Closing Prices
● Day of Week
● Market Volatility
New parameters
Expected Revenue
Higher
Returns

COMPARATIVE PERFORMANCE
Standard Method
Expert
● Better: 200%
Higher model
returns than
expert
● Faster/Cheaper:
10x faster than
standard
methods

1. SigOpt Live Demo
2. More Examples
a. Text Classification
b. Unsupervised + Supervised
c. Neural Nets with TensorFlow
ADDITIONAL TOPICS

AUTOMATICALLY TUNING TEXT
SENTIMENT CLASSIFIER

● Automatically tune text sentiment classifier
● Amazon product review dataset (35K labels)
eg : “This DVD is great. It brings back all the memories of the holidays as a young child.”
● Logistic regression is generally a good place to start
PROBLEM

● Maximize mean of k-fold cross-validation accuracies
● k = 5 folds, train and valid randomly split 70%, 30%
OBJECTIVE FUNCTION

● n-gram vocabulary selection parameters
● (min_n_gram, ngram_offset) determine which n-grams
● (log_min_df, df_offset) filter for n-grams within df range
TEXT FEATURE PARAMETERS
Original Text “SigOpt optimizes any complicated system”
1-grams { “SigOpt”, “optimizes”, “any”, “complicated”, “system”}
2-grams { “SigOpt_optimizes”, “optimizes_any”, “any_complicated” … }
3-grams { “SigOpt_optimizes_any”, “optimizes_any_complicated” … }

● Logistic regression error cost parameters
M = number of training examples
θ = vector of weights the algorithm will learn for each n-gram in vocabulary
yi
- training data label : {-1, 1} for our two class problem
xi
- training data input vector: BOW vectors described in previous section
α - weight of regularization term (log_reg_coef in our experiment)
ρ - weight of l1 norm term (l1_coef in our experiment)
ERROR COST PARAMETERS

● 50 line python snippet to train
and tune classifier with SigOpt
● 20 lines to define 6 parameter
experiment and run
optimization loop using SigOpt
PYTHON CODE

● E[f (λ)] after 20 runs, each run consisting of 60 function
evaluations
● For Grid Search : 64 evenly spaced parameter
configurations (order shuffled randomly)
● SigOpt statistical significance over grid and rnd
(p = 0.0001, Mann-Whitney U test)
PERFORMANCE
SigOpt Rnd. Search Grid Search No Tuning (Baseline)
Best Found 0.8760 (+5.72%) 0.8673 (+4.67%) 0.8680 (+4.76%) 0.8286

● Classify house number
digits with lack of
labelled data
● Challenging digit
variations, image clutter
with neighboring digits
PROBLEM

● In general we’ll search for an optimized ML pipeline
OBJECTIVE

● Transform image patches into vectors of centroid
distances, then pool to form final representation
● SigOpt optimizes selection of w, pool_r, K
UNSUPERVISED MODEL PARAMS

● Whitening transform often useful as image data
pre-processing step, expose εZCA
to SigOpt

● Tune sparsity of centroid
distance transform
● SigOpt optimizes threshold
(active_p) selection

● learning rate
number of trees
tree parameters
(max_depth,
sub_sample_sz),
exposed to SigOpt
SUPERVISED MODEL PARAMS

● 20 optimization runs, each run consisting of 90 / 40
function evaluations for Unsup / Raw feature settings
● Optimized single CV fold on training set, ACC reported
on test set as hold out
PERFORMANCE
SigOpt
(Unsup Feats)
Rnd Search
(Unsup Feats)
SigOpt
(Raw Feats)
Rnd Search
(Raw Feats)
No Tuning RF
(Raw Feats)
Hold Out
ACC
0.8708 (+51.4%) 0.8583 0.6844 0.6739 0.5751

● Classify house numbers
with more training data and
more sophisticated model
PROBLEM

● TensorFlow makes it easier to design DNN architectures,
but what structure works best on a given dataset?
CONVNET STRUCTURE

● Per parameter
adaptive SGD variants
like RMSProp and
Adagrad seem to
work best
● Still require careful
selection of learning
rate (α), momentum
(β), decay (γ) terms
STOCHASTIC GRADIENT DESCENT

● Comparison of several RMSProp SGD parametrizations
● Not obvious which configurations will work best on a
given dataset without experimentation
STOCHASTIC GRADIENT DESCENT

● Avg Hold out accuracy after 5 optimization runs
consisting of 80 objective evaluations
● Optimized single 80/20 CV fold on training set, ACC
reported on test set as hold out
PERFORMANCE
SigOpt
(TensorFlow CNN)
Rnd Search
(TensorFlow CNN)
No Tuning
(sklearn RF)
No Tuning
(TensorFlow CNN)
Hold Out
ACC
0.8130 (+315.2%) 0.5690 0.5278 0.1958

COST ANALYSIS
Model Performance
(CV Acc. threshold)
Random
Search Cost
SigOpt
Cost
SigOpt Cost
Savings
Potential Savings In
Production (50 GPUs)
87 % $275 $42 84% $12,530
85 % $195 $23 88% $8,750
80 % $46 $21 55% $1,340
70 % $29 $21 27% $400

https://sigopt.com/getstarted
Try it yourself!

MORE EXAMPLES
Automatically Tuning Text Classifiers (with code)
A short example using SigOpt and scikit-learn to build and tune a text
sentiment classifier.
Tuning Machine Learning Models (with code)
A comparison of different hyperparameter optimization methods.
Using Model Tuning to Beat Vegas (with code)
Using SigOpt to tune a model for predicting basketball scores.
Learn more about the technology behind SigOpt at
https://sigopt.com/research

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

More Related Content

What's hot

Viewers also liked

Similar to Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016

More from MLconf

Recently uploaded

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016