SlideShare a Scribd company logo
1 of 4
Download to read offline
Bayesian Optimization Primer
Ian Dewancker IAN@SIGOPT.COM
Michael McCourt MIKE@SIGOPT.COM
Scott Clark SCOTT@SIGOPT.COM
1. SigOpt
SigOpt employs Bayesian optimization to help experts tune
machine learning models and simulations. Instead of re-
sorting to standard techniques like grid search, random
search, or manual tuning, Bayesian optimization efficiently
trades off exploration and exploitation of the parameter
space to quickly guide the user into the configuration that
best optimizes some overall evaluation criterion (OEC) like
accuracy, AUC, or likelihood. In this short introduction
we introduce Bayesian optimization and several techniques
that SigOpt uses to optimize users models and simulations.
For applications and examples of SigOpt using Bayesian
optimization in real world problems please visit
https://sigopt.com/research.
2. Bayesian Optimization
Bayesian optimization is a powerful tool for optimizing ob-
jective functions which are very costly or slow to evaluate
(Martinez-Cantin et al., 2007; Brochu et al., 2010; Snoek
et al., 2012). In particular, we consider problems where the
maximum is sought for an expensive function f : X ! R,
xopt = arg max
x2X
f(x),
within a domain X ⇢ Rd
which is a bounding box (tensor
product of bounded and connected univariate domains).
Hyperparameter optimization for machine learning mod-
els is of particular relevance as the computational costs for
evaluating model variations is high, d is typically small,
and hyperparameter gradients are typically not available.
It is worth noting that Bayesian optimization techniques
can be effective in practice even if the underlying function
f being optimized is stochastic, non-convex, or even non-
continuous.
3. Bayesian Optimization Methods
Bayesian optimization methods (summarized effectively in
(Shahriari et al., 2015)) can be differentiated at a high level
by their regression models (discussed in Section 3.2) and
acquisition functions (discussed in Section 3.3). Several
open source Bayesian optimization software packages ex-
ist and many of their methods and techniques are incor-
porated into SigOpt, where applicable. Spearmint (Snoek
et al., 2012; 2014b;a), Hyperopt (Bergstra et al., 2013b;a),
SMAC (Hutter et al., 2011b;a; Falkner, 2014) and MOE
(Clark et al., 2014), are summarized in Table 1.
Table 1. Bayesian Optimization Software
SOFTWARE REGR. MODEL ACQ. FUNCTION
SPEARMINT GAUSSIAN PROCESS EXP. IMPROV
MOE GAUSSIAN PROCESS EXP. IMPROV
HYPEROPT TREE PARZEN EST. EXP. IMPROV
SMAC RANDOM FOREST EXP. IMPROV
In addition to these methods, non-Bayesian alternatives
for performing hyperparameter optimization include grid
search, random search (Bergstra & Bengio, 2012), and
particle swarm optimization (Kennedy & Eberhart, 1995).
These non-Bayesian techniques are often used in practice
due to the administrative overhead and expertise required
to get reasonable results from these, and other, open source
Bayesian optimization packages. SigOpt wraps a wide
swath of Bayesian Optimization research around a simple
API, allowing experts to quickly and easily tune their mod-
els and leverage these powerful techniques.
3.1. Sequential Model-Based Optimization
Sequential model-based optimization (SMBO) is a succinct
formalism of Bayesian optimization and useful when dis-
cussing variations (Hutter et al., 2011b; Bergstra et al.,
2011; Hoffman & Shahriari, 2014). In this section we will
use this formalism to contrast some of the methods em-
ployed by the open source techniques from Table 1, upon
which SigOpt draws. Typically, a probabilistic regression
model M is initialized using a small set of samples from
the domain X. Following this initialization phase, new lo-
cations within the domain are sequentially selected by op-
timizing an acquisition function S which uses the current
model as a cheap surrogate for the expensive objective f.
Bayesian Optimization Primer
Each suggested function evaluation produces an observed
result yi = f(xi); note that the observation may be ran-
dom either because f is random or because the observation
process is subject to noise. This result is appended to the
historical set D = {(x1, y1), . . . , (xi, yi)}, which is used
to update the regression model for generating the next sug-
gestion. In practice there is often a quota on the total time
or resources available, which imposes a limit T on the total
number of function evaluations. Algorithm 1 encapsulates
this process.
Algorithm 1 Sequential Model-Based Optimization
Input: f, X, S, M
D INITSAMPLES(f, X)
for i |D| to T do
p(y | x, D) FITMODEL(M, D)
xi arg maxx2X S(x, p(y | x, D))
yi f(xi) . Expensive step
D D [ (xi, yi)
end for
The initialization sampling strategy is often an important
consideration in the SMBO formalism. Viable approaches
include random, quasi-random, and Latin hypercube sam-
pling of the domain (Hoffman & Shahriari, 2014).
3.2. Probabilistic Regression Models ( M )
Various probabilistic regression models can be used in the
Bayesian optimization context (Eggensperger et al., 2013);
however, to operate in the standard SMBO algorithm the
model must define a predictive distribution p(y | x, D).
This distribution captures the uncertainty in the surrogate
reconstruction of the objective function.
3.2.1. GAUSSIAN PROCESSES
Gaussian processes (GPs) (Rasmussen & Williams, 2006)
have become a standard surrogate for modeling objective
functions in Bayesian optimization (Snoek et al., 2012;
Martinez-Cantin, 2014). In this setting, the function f is
assumed to be a realization of a GP with mean function µ
and covariance kernel K, f ⇠ GP(µ, K).
The choice of kernel function K in particular can have a
drastic effect on the quality of the surrogate reconstruc-
tion (Rasmussen & Williams, 2006). Automatic relevance
determination (ARD) kernels, also known as anisotropic
kernels, are common variants (Snoek et al., 2012; Duve-
naud, 2014). By default, Spearmint and MOE use the ARD
Mat´ern and ARD squared exponential kernels, respectively.
These kernels define important quantities needed to com-
pute the predictive distribution, including
k(x)= K(x, x1) · · · K(x, xi)
T
,
Kj,k =K(xj, xk).
Predictions follow a normal distribution, therefore we
know p(y | x, D) = N(y | ˆµ, ˆ2
) (Fasshauer & McCourt,
2015), where, assuming µ(x) ⌘ 0,
y = y1 · · · yi
T
,
ˆµ = k(x)T
(K + 2
nI) 1
y,
ˆ2
= K(x, x) k(x)T
(K + 2
nI) 1
k(x).
Spearmint can be flexibly configured to either assume the
objective function is noiseless or attempt to estimate the
process noise parameter 2
n, while MOE allows for only a
manual specification of 2
n for now.
3.2.2. RANDOM FORESTS
Another choice for the probabilistic regression model is
an ensemble of regression trees (Breiman et al., 1984;
Breiman, 2001). Random forests are used by the Sequen-
tial Model-based Algorithm Configuration (SMAC) library
(Hutter et al., 2011b). One common approach in con-
structing the predictive distribution is to assume a Gaussian
N(y | ˆµ, ˆ2
). The parameters ˆµ and ˆ may be chosen as
the empirical mean and variance of the regression values,
r(x), from the set of regression trees B in the forest,
ˆµ =
1
|B|
X
r2B
r(x),
ˆ2
=
1
|B| 1
X
r2B
(r(x) ˆµ)2
.
An attractive property of regression trees is that they nat-
urally support domains with conditional variables (Hutter
et al., 2009; Bergstra et al., 2011), that is, variables that
only exist when another takes on a certain configuration
or range. As an example, consider a variable that controls
the selection between several machine learning algorithms.
Each algorithm may have different parameters on different
domains; these algorithm parameters are conditional vari-
ables, only active depending on the choice of algorithm
(Eggensperger et al., 2013). SMAC supports such condi-
tional variables, while the GP backed Spearmint and MOE
currently do not.
3.2.3. TREE PARZEN ESTIMATORS
Using a tree-structured Parzen estimator (TPE) (Bergstra
et al., 2011; 2013b) deviates from the standard SMBO al-
gorithm in that it does not define a predictive distribution
Bayesian Optimization Primer
over the objective function; instead, it creates two hierar-
chical processes, `(x) and g(x) acting as generative mod-
els for all domain variables. These processes model the
domain variables when the objective function is below and
above a specified quantile y⇤
,
p(x | y, D) =
(
`(x), if y < y⇤
,
g(x), if y y⇤
.
Univariate Parzen estimators (Duda et al., 2000; Murphy,
2012) are organized in a tree structure, preserving any spec-
ified conditional dependence and resulting in a fit per vari-
able for each process `(x), g(x). With these two distribu-
tions, one can optimize a closed form term proportional to
expected improvement (Bergstra et al., 2011).
While this tree structure preserves the specified conditional
relationships, other variable interdependencies may not be
captured. Gaussian processes and random forests, in con-
trast, model the objective function as dependent on the en-
tire joint variable configuration. One benefit to using TPE
is that it naturally supports domains with specified condi-
tional variables.
3.3. Acquisition Function ( S )
Acquisition functions define a balance between exploring
new areas in the objective space and exploiting areas that
are already known to have favorable values. Many suitable
functions have been proposed (Jones et al., 1998; Brochu,
2010; Hoffman et al., 2011), including recent work involv-
ing information theory (Hern´andez-Lobato et al., 2014).
The acquisition function used by each method discussed in
this article is expected improvement (Jones et al., 1998).
Intuitively, it defines the nonnegative expected improve-
ment over the best previously observed objective value (de-
noted fbest) at a given location x,
EI(x | D) =
Z 1
fbest
(y fbest) p(y | x, D) dy.
When the predictive distribution p(y | x, D) is Gaussian,
EI(x) has a convenient closed form (Jones et al., 1998).
A complete Bayesian treatment of EI involves integrat-
ing over parameters of the probabilistic regression model
(Snoek et al., 2012), as Spearmint does; this contrasts with
MOE and SMAC, which fit only a single parameterization.
4. Conclusion
Bayesian optimization represents a powerful tool in helping
experts optimize their machine learning models and simu-
lations. Selecting the best Bayesian optimization technique
for each problem of interest is often non-intuitive. SigOpt
eliminates these issues by wrapping this powerful research
behind a simple API, allowing users to quickly realize the
promise of Bayesian optimization for their problems.
Detailed comparisons between these and other methods can
be found at https://sigopt.com/research and in a coming
ICML paper.
References
Bergstra, James and Bengio, Yoshua. Random search for
hyper-parameter optimization. The Journal of Machine
Learning Research, 13(1):281–305, 2012.
Bergstra, James, Yamins, Dan, and Cox, David D.
Hyperopt: A python library for optimizing the
hyperparameters of machine learning algorithms.
https://github.com/hyperopt/hyperopt,
2013a.
Bergstra, James, Yamins, Daniel, and Cox, David. Making
a science of model search: Hyperparameter optimiza-
tion in hundreds of dimensions for vision architectures.
In Proceedings of The 30th International Conference on
Machine Learning, pp. 115–123, 2013b.
Bergstra, James S, Bardenet, R´emi, Bengio, Yoshua, and
K´egl, Bal´azs. Algorithms for hyper-parameter optimiza-
tion. In Advances in Neural Information Processing Sys-
tems, pp. 2546–2554, 2011.
Breiman, Leo. Random forests. Machine learning, 45(1):
5–32, 2001.
Breiman, Leo, Friedman, Jerome H, Olshen, Richard A,
and Stone, Charles J. Classification and regression trees.
wadsworth. Belmont, CA, 1984.
Brochu, Eric. Interactive Bayesian Optimization. PhD
thesis, The University of British Columbia (Vancouver,
2010.
Brochu, Eric, Brochu, Tyson, and de Freitas, Nando. A
bayesian interactive optimization approach to procedu-
ral animation design. In Proceedings of the 2010 ACM
SIGGRAPH/Eurographics Symposium on Computer An-
imation, pp. 103–112. Eurographics Association, 2010.
Clark, Scott, Liu, Eric, Frazier, Peter, Wang, JiaLei, Oktay,
Deniz, and Vesdapunt, Norases. MOE: A global, black
box optimization engine for real world metric optimiza-
tion. https://github.com/Yelp/MOE, 2014.
Duda, Richard O., Hart, Peter E., and Stork, David G. Pat-
tern Classification (2Nd Edition). Wiley-Interscience,
2000. ISBN 0471056693.
Duvenaud, David. Automatic model construction with
Gaussian processes. PhD thesis, University of Cam-
bridge, 2014.
Bayesian Optimization Primer
Eggensperger, Katharina, Feurer, Matthias, Hutter, Frank,
Bergstra, James, Snoek, Jasper, Hoos, Holger, and
Leyton-Brown, Kevin. Towards an empirical foundation
for assessing bayesian optimization of hyperparameters.
In NIPS workshop on Bayesian Optimization in Theory
and Practice, 2013.
Falkner, Stefan. pysmac : sim-
ple python interface to SMAC.
https://github.com/automl/pysmac, 2014.
Fasshauer, Gregory and McCourt, Michael. Kernel-based
Approximation Methods in MATLAB. World Scientific,
2015.
Hern´andez-Lobato, Jos´e Miguel, Hoffman, Matthew W,
and Ghahramani, Zoubin. Predictive entropy search for
efficient global optimization of black-box functions. In
Advances in Neural Information Processing Systems, pp.
918–926, 2014.
Hoffman, Matthew D, Brochu, Eric, and de Freitas, Nando.
Portfolio allocation for bayesian optimization. In UAI,
pp. 327–336. Citeseer, 2011.
Hoffman, Matthew W and Shahriari, Bobak. Modular
mechanisms for bayesian optimization. In NIPS Work-
shop on Bayesian Optimization, 2014.
Hutter, Frank, Hoos, Holger H, Leyton-Brown, Kevin, and
St¨utzle, Thomas. Paramils: an automatic algorithm con-
figuration framework. Journal of Artificial Intelligence
Research, 36(1):267–306, 2009.
Hutter, Frank, Hoos, Holger, Leyton-Brown, Kevin,
Murphy, Kevin, and Ramage, Steve. SMAC:
Sequential model-based algorithm configuration.
http://www.cs.ubc.ca/labs/beta/Projects/SMAC/,
2011a.
Hutter, Frank, Hoos, Holger H, and Leyton-Brown, Kevin.
Sequential model-based optimization for general algo-
rithm configuration. In Learning and Intelligent Opti-
mization, pp. 507–523. Springer, 2011b.
Jones, Donald R, Schonlau, Matthias, and Welch,
William J. Efficient global optimization of expensive
black-box functions. Journal of Global optimization, 13
(4):455–492, 1998.
Kennedy, James and Eberhart, Russell C. Particle swarm
optimization. In Proceedings of the 1995 IEEE Inter-
national Conference on Neural Networks, volume 4, pp.
1942–1948, Perth, Australia, IEEE Service Center, Pis-
cataway, NJ, 1995.
Martinez-Cantin, Ruben. Bayesopt: A bayesian optimiza-
tion library for nonlinear optimization, experimental de-
sign and bandits. The Journal of Machine Learning Re-
search, 15(1):3735–3739, 2014.
Martinez-Cantin, Ruben, de Freitas, Nando, Doucet, Ar-
naud, and Castellanos, Jos´e A. Active policy learning
for robot planning and exploration under uncertainty. In
Robotics: Science and Systems, pp. 321–328, 2007.
Murphy, Kevin P. Machine learning: a probabilistic per-
spective. MIT press, 2012.
Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-
cesses for Machine Learning. the MIT Press, 2006.
Shahriari, Bobak, Swersky, Kevin, Wang, Ziyu, Adams,
Ryan P., and de Freitas, Nando. Taking the human out
of the loop: A review of bayesian optimization. Techni-
cal report, Universities of Harvard, Oxford, Toronto, and
Google DeepMind, 2015.
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P.
Practical bayesian optimization of machine learning al-
gorithms. In Advances in neural information processing
systems, pp. 2951–2959, 2012.
Snoek, Jasper, Swersky, Kevin, Larochelle, Hugo,
Gelbart, Michael, Adams, Ryan P, and Zemel,
Richard S. Spearmint : Bayesian optimization soft-
ware. https://github.com/HIPS/Spearmint,
2014a.
Snoek, Jasper, Swersky, Kevin, Zemel, Richard S., and
Adams, Ryan Prescott. Input warping for bayesian op-
timization of non-stationary functions. In International
Conference on Machine Learning, 2014b.

More Related Content

What's hot

MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataIRJET Journal
 
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...Xin-She Yang
 
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...IJCSIS Research Publications
 
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...André Gonçalves
 
Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units
Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units
Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units IJECEIAES
 
LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learningbutest
 
An Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersAn Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersIJCSEA Journal
 
Segmentation of Images by using Fuzzy k-means clustering with ACO
Segmentation of Images by using Fuzzy k-means clustering with ACOSegmentation of Images by using Fuzzy k-means clustering with ACO
Segmentation of Images by using Fuzzy k-means clustering with ACOIJTET Journal
 
A Unified Framework for Retrieving Diverse Social Images
A Unified Framework for Retrieving Diverse Social ImagesA Unified Framework for Retrieving Diverse Social Images
A Unified Framework for Retrieving Diverse Social Imagesmultimediaeval
 
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithmIaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithmIaetsd Iaetsd
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...IAEME Publication
 
A scalable collaborative filtering framework based on co-clustering
A scalable collaborative filtering framework based on co-clusteringA scalable collaborative filtering framework based on co-clustering
A scalable collaborative filtering framework based on co-clusteringlau
 
Object Tracking By Online Discriminative Feature Selection Algorithm
Object Tracking By Online Discriminative Feature Selection AlgorithmObject Tracking By Online Discriminative Feature Selection Algorithm
Object Tracking By Online Discriminative Feature Selection AlgorithmIRJET Journal
 
The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...IJECEIAES
 
Probabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross EntropyProbabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross EntropyAlejandro Bellogin
 
A hybrid fuzzy ann approach for software effort estimation
A hybrid fuzzy ann approach for software effort estimationA hybrid fuzzy ann approach for software effort estimation
A hybrid fuzzy ann approach for software effort estimationijfcstjournal
 
A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...
A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...
A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...IJCI JOURNAL
 
Performance Analysis of the Sigmoid and Fibonacci Activation Functions in NGA...
Performance Analysis of the Sigmoid and Fibonacci Activation Functions in NGA...Performance Analysis of the Sigmoid and Fibonacci Activation Functions in NGA...
Performance Analysis of the Sigmoid and Fibonacci Activation Functions in NGA...IOSRJVSP
 
20110516_ria_ENC
20110516_ria_ENC20110516_ria_ENC
20110516_ria_ENCriamaehb
 
Lidar Point Cloud Classification Using Expectation Maximization Algorithm
Lidar Point Cloud Classification Using Expectation Maximization AlgorithmLidar Point Cloud Classification Using Expectation Maximization Algorithm
Lidar Point Cloud Classification Using Expectation Maximization AlgorithmAIRCC Publishing Corporation
 

What's hot (20)

MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
 
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Opti...
 
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
 
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...
 
Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units
Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units
Firefly Algorithm to Opmimal Distribution of Reactive Power Compensation Units
 
LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learning
 
An Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal ClustersAn Automatic Clustering Technique for Optimal Clusters
An Automatic Clustering Technique for Optimal Clusters
 
Segmentation of Images by using Fuzzy k-means clustering with ACO
Segmentation of Images by using Fuzzy k-means clustering with ACOSegmentation of Images by using Fuzzy k-means clustering with ACO
Segmentation of Images by using Fuzzy k-means clustering with ACO
 
A Unified Framework for Retrieving Diverse Social Images
A Unified Framework for Retrieving Diverse Social ImagesA Unified Framework for Retrieving Diverse Social Images
A Unified Framework for Retrieving Diverse Social Images
 
Iaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithmIaetsd an efficient and large data base using subset selection algorithm
Iaetsd an efficient and large data base using subset selection algorithm
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...
 
A scalable collaborative filtering framework based on co-clustering
A scalable collaborative filtering framework based on co-clusteringA scalable collaborative filtering framework based on co-clustering
A scalable collaborative filtering framework based on co-clustering
 
Object Tracking By Online Discriminative Feature Selection Algorithm
Object Tracking By Online Discriminative Feature Selection AlgorithmObject Tracking By Online Discriminative Feature Selection Algorithm
Object Tracking By Online Discriminative Feature Selection Algorithm
 
The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...
 
Probabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross EntropyProbabilistic Collaborative Filtering with Negative Cross Entropy
Probabilistic Collaborative Filtering with Negative Cross Entropy
 
A hybrid fuzzy ann approach for software effort estimation
A hybrid fuzzy ann approach for software effort estimationA hybrid fuzzy ann approach for software effort estimation
A hybrid fuzzy ann approach for software effort estimation
 
A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...
A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...
A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VAL...
 
Performance Analysis of the Sigmoid and Fibonacci Activation Functions in NGA...
Performance Analysis of the Sigmoid and Fibonacci Activation Functions in NGA...Performance Analysis of the Sigmoid and Fibonacci Activation Functions in NGA...
Performance Analysis of the Sigmoid and Fibonacci Activation Functions in NGA...
 
20110516_ria_ENC
20110516_ria_ENC20110516_ria_ENC
20110516_ria_ENC
 
Lidar Point Cloud Classification Using Expectation Maximization Algorithm
Lidar Point Cloud Classification Using Expectation Maximization AlgorithmLidar Point Cloud Classification Using Expectation Maximization Algorithm
Lidar Point Cloud Classification Using Expectation Maximization Algorithm
 

Similar to SigOpt_Bayesian_Optimization_Primer

Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Xin-She Yang
 
Parallel hybrid chicken swarm optimization for solving the quadratic assignme...
Parallel hybrid chicken swarm optimization for solving the quadratic assignme...Parallel hybrid chicken swarm optimization for solving the quadratic assignme...
Parallel hybrid chicken swarm optimization for solving the quadratic assignme...IJECEIAES
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classificationcsandit
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...cscpconf
 
Global analysis of nonlinear dynamics
Global analysis of nonlinear dynamicsGlobal analysis of nonlinear dynamics
Global analysis of nonlinear dynamicsSpringer
 
A Tabu Search Heuristic For The Generalized Assignment Problem
A Tabu Search Heuristic For The Generalized Assignment ProblemA Tabu Search Heuristic For The Generalized Assignment Problem
A Tabu Search Heuristic For The Generalized Assignment ProblemSandra Long
 
Two-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionTwo-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionXin-She Yang
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...IAEME Publication
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Pooyan Jamshidi
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...IJCSEA Journal
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
A review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationA review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationssuserfa7e73
 
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...ijcseit
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...Pooyan Jamshidi
 

Similar to SigOpt_Bayesian_Optimization_Primer (20)

Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
Accelerated Particle Swarm Optimization and Support Vector Machine for Busine...
 
Parallel hybrid chicken swarm optimization for solving the quadratic assignme...
Parallel hybrid chicken swarm optimization for solving the quadratic assignme...Parallel hybrid chicken swarm optimization for solving the quadratic assignme...
Parallel hybrid chicken swarm optimization for solving the quadratic assignme...
 
Medical diagnosis classification
Medical diagnosis classificationMedical diagnosis classification
Medical diagnosis classification
 
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
MEDICAL DIAGNOSIS CLASSIFICATION USING MIGRATION BASED DIFFERENTIAL EVOLUTION...
 
Global analysis of nonlinear dynamics
Global analysis of nonlinear dynamicsGlobal analysis of nonlinear dynamics
Global analysis of nonlinear dynamics
 
A Tabu Search Heuristic For The Generalized Assignment Problem
A Tabu Search Heuristic For The Generalized Assignment ProblemA Tabu Search Heuristic For The Generalized Assignment Problem
A Tabu Search Heuristic For The Generalized Assignment Problem
 
Two-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential EvolutionTwo-Stage Eagle Strategy with Differential Evolution
Two-Stage Eagle Strategy with Differential Evolution
 
Borgulya
BorgulyaBorgulya
Borgulya
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
 
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
Transfer Learning for Improving Model Predictions in Highly Configurable Soft...
 
AROPUB-IJPGE-14-30
AROPUB-IJPGE-14-30AROPUB-IJPGE-14-30
AROPUB-IJPGE-14-30
 
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
USING LEARNING AUTOMATA AND GENETIC ALGORITHMS TO IMPROVE THE QUALITY OF SERV...
 
Gy3312241229
Gy3312241229Gy3312241229
Gy3312241229
 
F5233444
F5233444F5233444
F5233444
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
A review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementationA review of automatic differentiationand its efficient implementation
A review of automatic differentiationand its efficient implementation
 
Ds33717725
Ds33717725Ds33717725
Ds33717725
 
Ds33717725
Ds33717725Ds33717725
Ds33717725
 
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
MULTI-OBJECTIVE ENERGY EFFICIENT OPTIMIZATION ALGORITHM FOR COVERAGE CONTROL ...
 
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing S...
 

SigOpt_Bayesian_Optimization_Primer

  • 1. Bayesian Optimization Primer Ian Dewancker IAN@SIGOPT.COM Michael McCourt MIKE@SIGOPT.COM Scott Clark SCOTT@SIGOPT.COM 1. SigOpt SigOpt employs Bayesian optimization to help experts tune machine learning models and simulations. Instead of re- sorting to standard techniques like grid search, random search, or manual tuning, Bayesian optimization efficiently trades off exploration and exploitation of the parameter space to quickly guide the user into the configuration that best optimizes some overall evaluation criterion (OEC) like accuracy, AUC, or likelihood. In this short introduction we introduce Bayesian optimization and several techniques that SigOpt uses to optimize users models and simulations. For applications and examples of SigOpt using Bayesian optimization in real world problems please visit https://sigopt.com/research. 2. Bayesian Optimization Bayesian optimization is a powerful tool for optimizing ob- jective functions which are very costly or slow to evaluate (Martinez-Cantin et al., 2007; Brochu et al., 2010; Snoek et al., 2012). In particular, we consider problems where the maximum is sought for an expensive function f : X ! R, xopt = arg max x2X f(x), within a domain X ⇢ Rd which is a bounding box (tensor product of bounded and connected univariate domains). Hyperparameter optimization for machine learning mod- els is of particular relevance as the computational costs for evaluating model variations is high, d is typically small, and hyperparameter gradients are typically not available. It is worth noting that Bayesian optimization techniques can be effective in practice even if the underlying function f being optimized is stochastic, non-convex, or even non- continuous. 3. Bayesian Optimization Methods Bayesian optimization methods (summarized effectively in (Shahriari et al., 2015)) can be differentiated at a high level by their regression models (discussed in Section 3.2) and acquisition functions (discussed in Section 3.3). Several open source Bayesian optimization software packages ex- ist and many of their methods and techniques are incor- porated into SigOpt, where applicable. Spearmint (Snoek et al., 2012; 2014b;a), Hyperopt (Bergstra et al., 2013b;a), SMAC (Hutter et al., 2011b;a; Falkner, 2014) and MOE (Clark et al., 2014), are summarized in Table 1. Table 1. Bayesian Optimization Software SOFTWARE REGR. MODEL ACQ. FUNCTION SPEARMINT GAUSSIAN PROCESS EXP. IMPROV MOE GAUSSIAN PROCESS EXP. IMPROV HYPEROPT TREE PARZEN EST. EXP. IMPROV SMAC RANDOM FOREST EXP. IMPROV In addition to these methods, non-Bayesian alternatives for performing hyperparameter optimization include grid search, random search (Bergstra & Bengio, 2012), and particle swarm optimization (Kennedy & Eberhart, 1995). These non-Bayesian techniques are often used in practice due to the administrative overhead and expertise required to get reasonable results from these, and other, open source Bayesian optimization packages. SigOpt wraps a wide swath of Bayesian Optimization research around a simple API, allowing experts to quickly and easily tune their mod- els and leverage these powerful techniques. 3.1. Sequential Model-Based Optimization Sequential model-based optimization (SMBO) is a succinct formalism of Bayesian optimization and useful when dis- cussing variations (Hutter et al., 2011b; Bergstra et al., 2011; Hoffman & Shahriari, 2014). In this section we will use this formalism to contrast some of the methods em- ployed by the open source techniques from Table 1, upon which SigOpt draws. Typically, a probabilistic regression model M is initialized using a small set of samples from the domain X. Following this initialization phase, new lo- cations within the domain are sequentially selected by op- timizing an acquisition function S which uses the current model as a cheap surrogate for the expensive objective f.
  • 2. Bayesian Optimization Primer Each suggested function evaluation produces an observed result yi = f(xi); note that the observation may be ran- dom either because f is random or because the observation process is subject to noise. This result is appended to the historical set D = {(x1, y1), . . . , (xi, yi)}, which is used to update the regression model for generating the next sug- gestion. In practice there is often a quota on the total time or resources available, which imposes a limit T on the total number of function evaluations. Algorithm 1 encapsulates this process. Algorithm 1 Sequential Model-Based Optimization Input: f, X, S, M D INITSAMPLES(f, X) for i |D| to T do p(y | x, D) FITMODEL(M, D) xi arg maxx2X S(x, p(y | x, D)) yi f(xi) . Expensive step D D [ (xi, yi) end for The initialization sampling strategy is often an important consideration in the SMBO formalism. Viable approaches include random, quasi-random, and Latin hypercube sam- pling of the domain (Hoffman & Shahriari, 2014). 3.2. Probabilistic Regression Models ( M ) Various probabilistic regression models can be used in the Bayesian optimization context (Eggensperger et al., 2013); however, to operate in the standard SMBO algorithm the model must define a predictive distribution p(y | x, D). This distribution captures the uncertainty in the surrogate reconstruction of the objective function. 3.2.1. GAUSSIAN PROCESSES Gaussian processes (GPs) (Rasmussen & Williams, 2006) have become a standard surrogate for modeling objective functions in Bayesian optimization (Snoek et al., 2012; Martinez-Cantin, 2014). In this setting, the function f is assumed to be a realization of a GP with mean function µ and covariance kernel K, f ⇠ GP(µ, K). The choice of kernel function K in particular can have a drastic effect on the quality of the surrogate reconstruc- tion (Rasmussen & Williams, 2006). Automatic relevance determination (ARD) kernels, also known as anisotropic kernels, are common variants (Snoek et al., 2012; Duve- naud, 2014). By default, Spearmint and MOE use the ARD Mat´ern and ARD squared exponential kernels, respectively. These kernels define important quantities needed to com- pute the predictive distribution, including k(x)= K(x, x1) · · · K(x, xi) T , Kj,k =K(xj, xk). Predictions follow a normal distribution, therefore we know p(y | x, D) = N(y | ˆµ, ˆ2 ) (Fasshauer & McCourt, 2015), where, assuming µ(x) ⌘ 0, y = y1 · · · yi T , ˆµ = k(x)T (K + 2 nI) 1 y, ˆ2 = K(x, x) k(x)T (K + 2 nI) 1 k(x). Spearmint can be flexibly configured to either assume the objective function is noiseless or attempt to estimate the process noise parameter 2 n, while MOE allows for only a manual specification of 2 n for now. 3.2.2. RANDOM FORESTS Another choice for the probabilistic regression model is an ensemble of regression trees (Breiman et al., 1984; Breiman, 2001). Random forests are used by the Sequen- tial Model-based Algorithm Configuration (SMAC) library (Hutter et al., 2011b). One common approach in con- structing the predictive distribution is to assume a Gaussian N(y | ˆµ, ˆ2 ). The parameters ˆµ and ˆ may be chosen as the empirical mean and variance of the regression values, r(x), from the set of regression trees B in the forest, ˆµ = 1 |B| X r2B r(x), ˆ2 = 1 |B| 1 X r2B (r(x) ˆµ)2 . An attractive property of regression trees is that they nat- urally support domains with conditional variables (Hutter et al., 2009; Bergstra et al., 2011), that is, variables that only exist when another takes on a certain configuration or range. As an example, consider a variable that controls the selection between several machine learning algorithms. Each algorithm may have different parameters on different domains; these algorithm parameters are conditional vari- ables, only active depending on the choice of algorithm (Eggensperger et al., 2013). SMAC supports such condi- tional variables, while the GP backed Spearmint and MOE currently do not. 3.2.3. TREE PARZEN ESTIMATORS Using a tree-structured Parzen estimator (TPE) (Bergstra et al., 2011; 2013b) deviates from the standard SMBO al- gorithm in that it does not define a predictive distribution
  • 3. Bayesian Optimization Primer over the objective function; instead, it creates two hierar- chical processes, `(x) and g(x) acting as generative mod- els for all domain variables. These processes model the domain variables when the objective function is below and above a specified quantile y⇤ , p(x | y, D) = ( `(x), if y < y⇤ , g(x), if y y⇤ . Univariate Parzen estimators (Duda et al., 2000; Murphy, 2012) are organized in a tree structure, preserving any spec- ified conditional dependence and resulting in a fit per vari- able for each process `(x), g(x). With these two distribu- tions, one can optimize a closed form term proportional to expected improvement (Bergstra et al., 2011). While this tree structure preserves the specified conditional relationships, other variable interdependencies may not be captured. Gaussian processes and random forests, in con- trast, model the objective function as dependent on the en- tire joint variable configuration. One benefit to using TPE is that it naturally supports domains with specified condi- tional variables. 3.3. Acquisition Function ( S ) Acquisition functions define a balance between exploring new areas in the objective space and exploiting areas that are already known to have favorable values. Many suitable functions have been proposed (Jones et al., 1998; Brochu, 2010; Hoffman et al., 2011), including recent work involv- ing information theory (Hern´andez-Lobato et al., 2014). The acquisition function used by each method discussed in this article is expected improvement (Jones et al., 1998). Intuitively, it defines the nonnegative expected improve- ment over the best previously observed objective value (de- noted fbest) at a given location x, EI(x | D) = Z 1 fbest (y fbest) p(y | x, D) dy. When the predictive distribution p(y | x, D) is Gaussian, EI(x) has a convenient closed form (Jones et al., 1998). A complete Bayesian treatment of EI involves integrat- ing over parameters of the probabilistic regression model (Snoek et al., 2012), as Spearmint does; this contrasts with MOE and SMAC, which fit only a single parameterization. 4. Conclusion Bayesian optimization represents a powerful tool in helping experts optimize their machine learning models and simu- lations. Selecting the best Bayesian optimization technique for each problem of interest is often non-intuitive. SigOpt eliminates these issues by wrapping this powerful research behind a simple API, allowing users to quickly realize the promise of Bayesian optimization for their problems. Detailed comparisons between these and other methods can be found at https://sigopt.com/research and in a coming ICML paper. References Bergstra, James and Bengio, Yoshua. Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1):281–305, 2012. Bergstra, James, Yamins, Dan, and Cox, David D. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. https://github.com/hyperopt/hyperopt, 2013a. Bergstra, James, Yamins, Daniel, and Cox, David. Making a science of model search: Hyperparameter optimiza- tion in hundreds of dimensions for vision architectures. In Proceedings of The 30th International Conference on Machine Learning, pp. 115–123, 2013b. Bergstra, James S, Bardenet, R´emi, Bengio, Yoshua, and K´egl, Bal´azs. Algorithms for hyper-parameter optimiza- tion. In Advances in Neural Information Processing Sys- tems, pp. 2546–2554, 2011. Breiman, Leo. Random forests. Machine learning, 45(1): 5–32, 2001. Breiman, Leo, Friedman, Jerome H, Olshen, Richard A, and Stone, Charles J. Classification and regression trees. wadsworth. Belmont, CA, 1984. Brochu, Eric. Interactive Bayesian Optimization. PhD thesis, The University of British Columbia (Vancouver, 2010. Brochu, Eric, Brochu, Tyson, and de Freitas, Nando. A bayesian interactive optimization approach to procedu- ral animation design. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer An- imation, pp. 103–112. Eurographics Association, 2010. Clark, Scott, Liu, Eric, Frazier, Peter, Wang, JiaLei, Oktay, Deniz, and Vesdapunt, Norases. MOE: A global, black box optimization engine for real world metric optimiza- tion. https://github.com/Yelp/MOE, 2014. Duda, Richard O., Hart, Peter E., and Stork, David G. Pat- tern Classification (2Nd Edition). Wiley-Interscience, 2000. ISBN 0471056693. Duvenaud, David. Automatic model construction with Gaussian processes. PhD thesis, University of Cam- bridge, 2014.
  • 4. Bayesian Optimization Primer Eggensperger, Katharina, Feurer, Matthias, Hutter, Frank, Bergstra, James, Snoek, Jasper, Hoos, Holger, and Leyton-Brown, Kevin. Towards an empirical foundation for assessing bayesian optimization of hyperparameters. In NIPS workshop on Bayesian Optimization in Theory and Practice, 2013. Falkner, Stefan. pysmac : sim- ple python interface to SMAC. https://github.com/automl/pysmac, 2014. Fasshauer, Gregory and McCourt, Michael. Kernel-based Approximation Methods in MATLAB. World Scientific, 2015. Hern´andez-Lobato, Jos´e Miguel, Hoffman, Matthew W, and Ghahramani, Zoubin. Predictive entropy search for efficient global optimization of black-box functions. In Advances in Neural Information Processing Systems, pp. 918–926, 2014. Hoffman, Matthew D, Brochu, Eric, and de Freitas, Nando. Portfolio allocation for bayesian optimization. In UAI, pp. 327–336. Citeseer, 2011. Hoffman, Matthew W and Shahriari, Bobak. Modular mechanisms for bayesian optimization. In NIPS Work- shop on Bayesian Optimization, 2014. Hutter, Frank, Hoos, Holger H, Leyton-Brown, Kevin, and St¨utzle, Thomas. Paramils: an automatic algorithm con- figuration framework. Journal of Artificial Intelligence Research, 36(1):267–306, 2009. Hutter, Frank, Hoos, Holger, Leyton-Brown, Kevin, Murphy, Kevin, and Ramage, Steve. SMAC: Sequential model-based algorithm configuration. http://www.cs.ubc.ca/labs/beta/Projects/SMAC/, 2011a. Hutter, Frank, Hoos, Holger H, and Leyton-Brown, Kevin. Sequential model-based optimization for general algo- rithm configuration. In Learning and Intelligent Opti- mization, pp. 507–523. Springer, 2011b. Jones, Donald R, Schonlau, Matthias, and Welch, William J. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13 (4):455–492, 1998. Kennedy, James and Eberhart, Russell C. Particle swarm optimization. In Proceedings of the 1995 IEEE Inter- national Conference on Neural Networks, volume 4, pp. 1942–1948, Perth, Australia, IEEE Service Center, Pis- cataway, NJ, 1995. Martinez-Cantin, Ruben. Bayesopt: A bayesian optimiza- tion library for nonlinear optimization, experimental de- sign and bandits. The Journal of Machine Learning Re- search, 15(1):3735–3739, 2014. Martinez-Cantin, Ruben, de Freitas, Nando, Doucet, Ar- naud, and Castellanos, Jos´e A. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and Systems, pp. 321–328, 2007. Murphy, Kevin P. Machine learning: a probabilistic per- spective. MIT press, 2012. Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro- cesses for Machine Learning. the MIT Press, 2006. Shahriari, Bobak, Swersky, Kevin, Wang, Ziyu, Adams, Ryan P., and de Freitas, Nando. Taking the human out of the loop: A review of bayesian optimization. Techni- cal report, Universities of Harvard, Oxford, Toronto, and Google DeepMind, 2015. Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan P. Practical bayesian optimization of machine learning al- gorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012. Snoek, Jasper, Swersky, Kevin, Larochelle, Hugo, Gelbart, Michael, Adams, Ryan P, and Zemel, Richard S. Spearmint : Bayesian optimization soft- ware. https://github.com/HIPS/Spearmint, 2014a. Snoek, Jasper, Swersky, Kevin, Zemel, Richard S., and Adams, Ryan Prescott. Input warping for bayesian op- timization of non-stationary functions. In International Conference on Machine Learning, 2014b.