Bayesian Optimization
of ML Hyper-parameters
Maksym Bevza
Research Engineer at
ML solves complex problems
Computer vision
Machine translation
Speech recognition
Game playing
Other complex problems
● And many more
○ Recommender systems
○ Natural language understanding
○ Robotics
○ Grammatical error correction
○ ...
Number of parameters growth
● The number of parameters grows tremendously
○ Number of layers
○ Convolution kernel size
○ Number of neurons
○ Dropout drop rate
○ Learning rate
○ Batch size
● Preprocessing params
Tuning parameters is magic
● Complex systems are hard to analyse
● Impact of parameters on success is obscure
● Success of ML algorithm depends on
○ Data
○ Good algorithm/architecture
○ Good parameters settings
Tuning parameters is crucial
● Success of ML algorithm depends on
○ Data
○ Good algorithm/architecture
○ Good parameters settings
Tuning parameters is crucial
Goals
● Introduce Bayesian Optimization to the audience
● Share personal experience
○ Results on digit recognition problem
○ Toolkits for Bayesian Optimization
Overview
● Tuning ML hyper-parameters
● Bayesian Optimization
● Available software
● Experiments in research field
● My experiments
Tuning ML hyper-parameters
Tuning ML hyper-parameters
● Grid search
● Random search
● Grad student descent
Grid Search
1. Define a search space
2. Try all 4*3=12
configurations
Search space for SVM Classifier
{
'C': [1, 10, 100, 1000],
'gamma': [1e-2, 1e-3, 1e-4],
'kernel': ['rbf']
}
Random Search
1. Define the search space
2. Sample the search space
and run ML algorithm
Search space for SVM Classifier
{
'C': scipy.stats.expon(scale=100),
'gamma': scipy.stats.expon(scale=.1),
'kernel': ['rbf']
}
Grid Search: pros & cons
● Fully automatic
● Parallelizable
● Number experiments grows exponentially with number of params
● Waste of time on unimportant parameters
● Some points in search space are not reachable
● Does not learn from previous iterations
Random Search: pros & cons
● Fully automatic
● Parallelizable
● Number of iterations are set upfront
● No time waste on unimportant parameters
● All points in the search space are reachable
● Does not learn from previous iterations
● Does not take into account evaluation cost
● f(x, y) = g(x) + h(y)
● h(y) is smaller than g(x)
Grid Search vs Random Search
Grad Student Descent
● Researcher fiddles around with the parameters until
it works
Name of method by Ryan Adams
Grad Student Descent: pros & cons
● Learns from previous iterations
● Takes into account evaluation cost
● Parallelizable
● Benefits from understanding semantics of hyper-parameters
● Search is biased
● Requires a lot of manual work
Comparison of all methods
Grid Search Random
Search
Grad Student
Descent
Fully automatic Yes Yes No
Learns from previous iterations No No Yes
Takes into account eval. cost No No Yes
Parallelizable Yes Yes Yes
Reasonable search time No Yes Yes
Handles unimportant parameters No Yes Yes
Search is NOT biased Yes Yes No
Good software Yes Yes N/A
Bayesian Optimization: the goal
● Fully automatic
● Learns from previous iterations
● Takes into account evaluation cost
● Search is not biased
● Parallelizable
● Available software is non-free and not stable
Bayesian Optimization (BO)
What is it?
● Let’s treat our ML learning algorithm as a function f : X -> Y
● X is our search space for hyper-parameters
● Y is set of score that we want to optimize
● Let’s consider other parameters to be fixed (e.g. dataset)
Background
● X - a search space
{
'C': [1, 1000],
'gamma': [0.0001, 0.1],
'kernel': ['rbf'],
}
Background: Examples
● We can optimize towards any score (even non-differentiable)
○ Validation error rate
○ AUC
○ Recall at fixed FPR
○ Many more
Background: Examples
● Our ML algorithm f for similar settings gets similar scores
● We can leverage it to try settings that are more promising
● For custom scores this condition should hold
Intuition
● Let’s consider one
dimensional function
f : R -> R
● Let’s suppose we
want to minimize f
An example
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
● Build all possible
functions
● Less smooth
functions are less
probable
An example
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Which point to try next?
● Exploration: Try places with high variance
● Exploitation: Try places with low mean
Exploration / Exploitation tradeoff
● Probability of Improvement (PI)
● Expected Improvement (EI)
● Other complicated ones
Strategies of choosing next point
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Let’s go step by step
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
What about cost of evaluation?
● Hyper-parameters often impact the evaluation time
● Number of hidden layers, neurons per layer (Deep Learning)
● Number and depth of trees (Random Forest)
● Number of estimators (Gradient Boosting)
Time limits vs evaluation limits
● In practice we deal with time limits
● E.g. what’s the best set-up we can get in 7 days?
● Try cheap evaluations first
● Given rough characteristic of f, try expensive evaluations
Time limits vs evaluation limits
How to account for cost of
evaluation?
● Let’s estimate two functions at a time:
○ The function f itself
○ The cost of evaluation (duration) of function f
● We can use BO to estimate those functions
How to account for cost of evaluation?
● We chosed the point with highest Expected Improvement
● Pick the highest EI/second instead
Strategy of choosing next point with cost
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
Comparison of all methods
Grid
Search
Random
Search
Grad Student
Descent
Bayesian
Optimization
Fully automatic Yes Yes No Yes
Learns from previous iterations No No Yes Yes
Takes into account eval. cost No No Yes Yes
Parallelizable Yes Yes Yes Tricky
Reasonable search time No Yes Yes Yes
Handles unimportant parameters No Yes Yes Yes
Search is NOT biased Yes Yes No Yes
Good software Yes Yes N/A No
What’s the catch?
● Bayesian optimization software is tricky to build
● Leveraging clusters for parallelization is hard
● No hype around it
What’s the catch?
Available software
● The toolkits built by researchers are not supported well
○ Spearmint
○ SMAC
○ HyperOpt
○ BayesOpt
● Non-bayesian alternatives
○ TPE (Tree-structured Parzen Estimator)
○ PSO (Particle Swarm Optimization)
Available software
● SigOpt provides Bayesian Optimization as a service
● Claims state-of-the-art Bayesian Optimization
● Their customers
○ Prudential
○ Huawei
○ MIT
○ Hotwire
○ ...
SigOpt
def evaluate_model(assignments):
return train_and_evaluate_cv(**assignments)
SigOpt API
from sigopt import Connection
conn = Connection(client_token='TOKEN')
SigOpt API
experiment = conn.experiments().create(
name='Some Optimization (Python)',
parameters=[
dict(name='C', type='double', bounds=dict(min=0.0, max=1.0)),
dict(name='gamma', type='double', bounds=dict(min=0.0, max=1.0)),
],
)
SigOpt API
for _ in range(30):
suggestion = conn.experiments(experiment.id).suggestions().create()
value = evaluate_model(suggestion.assignments)
conn.experiments(experiment.id).observations().create(
suggestion=suggestion.id,
value=value,
)
SigOpt API
Experiments in research field
Snoek et al. (2012)
● CIFAR-10
○ 60000 images
○ 32x32 colour
○ 10 classes
● Error rate: 14.98%
○ New state-of-the-art result (in 2012)
Snoek et al. (2012)
● Error rate:
14.98%
● Previous:
18%
Extensive analysis by Clark et al. (2016)
● Extensive analysis of BO and other search methods
● Different type of functions
○ Oscillatory
○ Discrete values
○ Boring
○ ...
Extensive analysis by Clark et al. (2016)
● Comparison method
○ Best found
○ AUC
Extensive analysis by Clark et al. (2016)
● For each function
○ First placed
○ Top three
○ Borda
Extensive analysis by Clark et al. (2016)
Extensive analysis by Clark et al. (2016)
Extensive analysis by Clark et al. (2016)
Extensive analysis by Clark et al. (2016)
My experiments
Task
● Digit recognition
● MNIST dataset
○ 70000 images
○ 28x28 grayscale
○ 10 classes
Model
Conv Pool Dropout
Fully Connected
Fully Connected
Output (10)
Dropout
Dropout
Conv Pool Dropout
● 6 parameters tuned
○ Number of filters per layer (1)
○ Number of convolution layers (1)
○ Dense layers size (2)
○ Batch size (1)
○ Learning rate (1)
Parameters of the model
● Features
○ Parameter types: INT, FLOAT, ENUM
○ Evaluation data stored in MongoDB
○ Works with noisy functions
● License: Non-commercial usage
Spearmint
Results
MNIST Results: Random Search
MNIST Results: Bayesian Optimized
MNIST Results: Random vs Bayesian
● Best Random (avg): 1.20%
● Best Bayesian (avg): 0.86%
● Relative decrease in error rate: 28%
MNIST results: Random vs Bayesian
Final points
● Spearmint tries boundaries first
○ Be cautious in setting up your search space
● Use logarithmic scales when it makes sense
● Recommendations on iteration limit
○ 10-20 iterations per one parameter
Gotchas
Conclusions
● Bayesian Optimization leads to better results
● SigOpt is hopefully first stable implementation of BO
Thanks!
Maksym Bevza
Research Engineer at Grammarly
maksym.bevza@grammarly.com
www.grammarly.com

DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации