DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации

Bayesian Optimization
of ML Hyper-parameters
Maksym Bevza
Research Engineer at

Other complex problems
● And many more
○ Recommender systems
○ Natural language understanding
○ Robotics
○ Grammatical error correction
○ ...

Number of parameters growth
● The number of parameters grows tremendously
○ Number of layers
○ Convolution kernel size
○ Number of neurons
○ Dropout drop rate
○ Learning rate
○ Batch size
● Preprocessing params

Tuning parameters is magic
● Complex systems are hard to analyse
● Impact of parameters on success is obscure

● Success of ML algorithm depends on
○ Data
○ Good algorithm/architecture
○ Good parameters settings
Tuning parameters is crucial

Goals
● Introduce Bayesian Optimization to the audience
● Share personal experience
○ Results on digit recognition problem
○ Toolkits for Bayesian Optimization

Overview
● Tuning ML hyper-parameters
● Bayesian Optimization
● Available software
● Experiments in research field
● My experiments

Tuning ML hyper-parameters
● Grid search
● Random search
● Grad student descent

Grid Search
1. Define a search space
2. Try all 4*3=12
configurations
Search space for SVM Classifier
{
'C': [1, 10, 100, 1000],
'gamma': [1e-2, 1e-3, 1e-4],
'kernel': ['rbf']
}

Random Search
1. Define the search space
2. Sample the search space
and run ML algorithm
Search space for SVM Classifier
{
'C': scipy.stats.expon(scale=100),
'gamma': scipy.stats.expon(scale=.1),
'kernel': ['rbf']
}

Grid Search: pros & cons
● Fully automatic
● Parallelizable
● Number experiments grows exponentially with number of params
● Waste of time on unimportant parameters
● Some points in search space are not reachable
● Does not learn from previous iterations

Random Search: pros & cons
● Fully automatic
● Parallelizable
● Number of iterations are set upfront
● No time waste on unimportant parameters
● All points in the search space are reachable
● Does not learn from previous iterations
● Does not take into account evaluation cost

● f(x, y) = g(x) + h(y)
● h(y) is smaller than g(x)
Grid Search vs Random Search

Grad Student Descent
● Researcher fiddles around with the parameters until
it works
Name of method by Ryan Adams

Grad Student Descent: pros & cons
● Learns from previous iterations
● Takes into account evaluation cost
● Parallelizable
● Benefits from understanding semantics of hyper-parameters
● Search is biased
● Requires a lot of manual work

Comparison of all methods
Grid Search Random
Search
Grad Student
Descent
Fully automatic Yes Yes No
Learns from previous iterations No No Yes
Takes into account eval. cost No No Yes
Parallelizable Yes Yes Yes
Reasonable search time No Yes Yes
Handles unimportant parameters No Yes Yes
Search is NOT biased Yes Yes No
Good software Yes Yes N/A

Bayesian Optimization: the goal
● Fully automatic
● Learns from previous iterations
● Takes into account evaluation cost
● Search is not biased
● Parallelizable
● Available software is non-free and not stable

Bayesian Optimization (BO)
What is it?

● Let’s treat our ML learning algorithm as a function f : X -> Y
● X is our search space for hyper-parameters
● Y is set of score that we want to optimize
● Let’s consider other parameters to be fixed (e.g. dataset)
Background

● X - a search space
{
'C': [1, 1000],
'gamma': [0.0001, 0.1],
'kernel': ['rbf'],
}
Background: Examples

● We can optimize towards any score (even non-differentiable)
○ Validation error rate
○ AUC
○ Recall at fixed FPR
○ Many more
Background: Examples

● Our ML algorithm f for similar settings gets similar scores
● We can leverage it to try settings that are more promising
● For custom scores this condition should hold
Intuition

● Let’s consider one
dimensional function
f : R -> R
● Let’s suppose we
want to minimize f
An example
Image from https://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf

● Build all possible
functions
● Less smooth
functions are less
probable
An example

● Exploration: Try places with high variance
● Exploitation: Try places with low mean
Exploration / Exploitation tradeoff

● Probability of Improvement (PI)
● Expected Improvement (EI)
● Other complicated ones
Strategies of choosing next point

What about cost of evaluation?

● Hyper-parameters often impact the evaluation time
● Number of hidden layers, neurons per layer (Deep Learning)
● Number and depth of trees (Random Forest)
● Number of estimators (Gradient Boosting)
Time limits vs evaluation limits

● In practice we deal with time limits
● E.g. what’s the best set-up we can get in 7 days?
● Try cheap evaluations first
● Given rough characteristic of f, try expensive evaluations
Time limits vs evaluation limits

How to account for cost of
evaluation?

● Let’s estimate two functions at a time:
○ The function f itself
○ The cost of evaluation (duration) of function f
● We can use BO to estimate those functions
How to account for cost of evaluation?

● We chosed the point with highest Expected Improvement
● Pick the highest EI/second instead
Strategy of choosing next point with cost

Comparison of all methods
Grid
Search
Random
Search
Grad Student
Descent
Bayesian
Optimization
Fully automatic Yes Yes No Yes
Learns from previous iterations No No Yes Yes
Takes into account eval. cost No No Yes Yes
Parallelizable Yes Yes Yes Tricky
Reasonable search time No Yes Yes Yes
Handles unimportant parameters No Yes Yes Yes
Search is NOT biased Yes Yes No Yes
Good software Yes Yes N/A No

● Bayesian optimization software is tricky to build
● Leveraging clusters for parallelization is hard
● No hype around it
What’s the catch?

● The toolkits built by researchers are not supported well
○ Spearmint
○ SMAC
○ HyperOpt
○ BayesOpt
● Non-bayesian alternatives
○ TPE (Tree-structured Parzen Estimator)
○ PSO (Particle Swarm Optimization)
Available software

● SigOpt provides Bayesian Optimization as a service
● Claims state-of-the-art Bayesian Optimization
● Their customers
○ Prudential
○ Huawei
○ MIT
○ Hotwire
○ ...
SigOpt

def evaluate_model(assignments):
return train_and_evaluate_cv(**assignments)
SigOpt API

from sigopt import Connection
conn = Connection(client_token='TOKEN')
SigOpt API

experiment = conn.experiments().create(
name='Some Optimization (Python)',
parameters=[
dict(name='C', type='double', bounds=dict(min=0.0, max=1.0)),
dict(name='gamma', type='double', bounds=dict(min=0.0, max=1.0)),
],
)
SigOpt API

for _ in range(30):
suggestion = conn.experiments(experiment.id).suggestions().create()
value = evaluate_model(suggestion.assignments)
conn.experiments(experiment.id).observations().create(
suggestion=suggestion.id,
value=value,
)
SigOpt API

Snoek et al. (2012)
● CIFAR-10
○ 60000 images
○ 32x32 colour
○ 10 classes
● Error rate: 14.98%
○ New state-of-the-art result (in 2012)

Snoek et al. (2012)
● Error rate:
14.98%
● Previous:
18%

Extensive analysis by Clark et al. (2016)
● Extensive analysis of BO and other search methods
● Different type of functions
○ Oscillatory
○ Discrete values
○ Boring
○ ...

● Comparison method
○ Best found
○ AUC

● For each function
○ First placed
○ Top three
○ Borda

Task
● Digit recognition
● MNIST dataset
○ 70000 images
○ 28x28 grayscale
○ 10 classes

Model
Conv Pool Dropout
Fully Connected
Fully Connected
Output (10)
Dropout
Dropout
Conv Pool Dropout

● 6 parameters tuned
○ Number of filters per layer (1)
○ Number of convolution layers (1)
○ Dense layers size (2)
○ Batch size (1)
○ Learning rate (1)
Parameters of the model

● Features
○ Parameter types: INT, FLOAT, ENUM
○ Evaluation data stored in MongoDB
○ Works with noisy functions
● License: Non-commercial usage
Spearmint

MNIST Results: Bayesian Optimized

MNIST Results: Random vs Bayesian

● Best Random (avg): 1.20%
● Best Bayesian (avg): 0.86%
● Relative decrease in error rate: 28%
MNIST results: Random vs Bayesian

● Spearmint tries boundaries first
○ Be cautious in setting up your search space
● Use logarithmic scales when it makes sense
● Recommendations on iteration limit
○ 10-20 iterations per one parameter
Gotchas

Conclusions
● Bayesian Optimization leads to better results
● SigOpt is hopefully first stable implementation of BO

Thanks!
Maksym Bevza
Research Engineer at Grammarly
maksym.bevza@grammarly.com
www.grammarly.com

DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации

More Related Content

Similar to DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации

More from GeeksLab Odessa

Recently uploaded

DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи Байесовской оптимизации