An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_computer_simulator

An Expected Improvement Criterion for the Global Optimization of a
Noisy Computer Simulator
by
Kanika Anand
Thesis
submitted in partial fulﬁllment of the requirements for
the Degree of Master of Science (Mathematics and Statistics)
Acadia University
Spring Convocation 2015
c by Kanika Anand, 2015

This thesis by Kanika Anand was defended successfully in an oral examination on
March 2, 2015.
The examining committee for the thesis was:
Dr. Michael Stokesbury, Chair
Dr. Chunfang Devon Lin, External Reader
Dr. Wilson Lu, Internal Reader
Dr. P. Ranjan, Supervisor
Dr. H. Chipman, Supervisor
Dr. J. Hooper, Head
This thesis is accepted in its present form by the Division of Research and Graduate
Studies as satisfying the thesis requirements for the degree Master of Science
(Mathematics and Statistics).
ii

I, Kanika Anand, grant permission to the University Librarian at Acadia Univer-
sity to reproduce, loan or distribute copies of my thesis in microform, paper or
electronic formats on a non-proﬁt basis. I, however, retain the copyright in my
thesis.
Author
Supervisor
Supervisor
Date
ii

Acknowledgments
I would like to thank my supervisors Dr. P. Ranjan and Dr. H. Chipman for
their kind guidance and support throughout this research project. Their guid-
ance and encouragement made this a successful learning experience. I would like
to thank Acadia University for funding my research and giving me innumerable
opportunities to excel in my ﬁeld. I would like to thank all the professors in the
Department of Mathematics and Statistics who helped me widen my knowledge
through coursework. Lastly, I am grateful to all my friends and family who always
support me in all my endeavors.
iv

Contents
Acknowledgments iv
Table of Contents vii
List of Figures xi
List of Tables xii
Abstract xiii
1 Introduction 1
2 Bayesian Approach to Data Modeling 5
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Advantages of Bayesian Approach . . . . . . . . . . . . . . . . . . . 9
3 Surrogate Models 10
3.1 Deterministic Simulator . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 GP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
v

3.2 Non-Deterministic Simulator . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 GP Model for Noisy Data . . . . . . . . . . . . . . . . . . . 16
3.2.2 Bayesian Additive Regression Trees Model (BART) . . . . . 17
3.2.3 Details of BART Model . . . . . . . . . . . . . . . . . . . . 19
3.2.4 Implementation of BART . . . . . . . . . . . . . . . . . . . 24
3.2.5 Advantages of BART . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Surrogate Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Global Optimization 27
4.1 Sequential Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 EI Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Improvement Functions for Noisy Simulators . . . . . . . . . 30
4.2.2 EI Criteria for GP Model . . . . . . . . . . . . . . . . . . . 31
4.2.3 EI Criteria for BART Model . . . . . . . . . . . . . . . . . . 32
4.3 EI Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Optimization using a One-shot Space Filling Design . . . . . 34
4.3.2 Optimization using Genetic Algorithm . . . . . . . . . . . . 35
5 Results: One Dimensional Simulators 37
5.1 First 1-dimensional Computer Simulator . . . . . . . . . . . . . . . 39
5.2 Second 1-dimensional Computer Simulator . . . . . . . . . . . . . . 45
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
vi

6 Results: Higher Dimensional Simulators 54
6.1 2-dimensional Computer Simulator . . . . . . . . . . . . . . . . . . 55
6.2 4-dimensional Computer Simulator . . . . . . . . . . . . . . . . . . 62
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7 Conclusions and Future Work 71
vii

List of Figures
3.1 A simple realization of g(x; T, M) in the BART model . . . . . . . 19
3.2 Three priors on residual variance τ when ˆτ = 2 is assumed for
BART model. In the legend, taken from CGM, “df” is ν and “quan-
tile” corresponds to the quantile q at ˆτ = 2. . . . . . . . . . . . . . 20
3.3 Surrogate models ﬁtted to the simulator outputs generated using a
one-dimensional test function: y(x) = − sin(4x−2)−2 exp(−480(x−
0.5)2
) + , ∼ N(0, 0.152
) . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Tidal energy application . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 1d simulator in Example 1. The black curve is the underlying de-
terministic simulator and the noisy outputs are shown in blue. . . . 40
5.2 1d simulator in Example 1. Each panel shows the median, 10% and
90% quantiles of surrogate model minimum for Picheny and
Ranjan method. Experimental settings are n0 = 15, nnew = 30
and δ = 0.05. The four panels show combinations of surrogate
method (GP and BART) with EI optimizer (GA/One-shot space
ﬁlling design approach). The horizontal lines show the simulator
global minimum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
viii

5.3 1d simulator in Example 1. Each panel shows the median, 10%
and 90% quantiles of distance between the simulator mini-
mizer and surrogate model minimizer for Picheny and Ran-
jan method. Experimental settings are n0 = 15, nnew = 30 and
δ = 0.05. The four panels show combinations of surrogate method
(GP and BART) with EI optimizer (GA/One-shot space ﬁlling de-
sign approach). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4 1d simulator in Example 1. Surrogate model minimum results
for δ = 0.2. Figure layout is the same as Figure 5.2. . . . . . . . . 43
5.5 1d simulator in Example 1. Distance between the simulator
minimizer and surrogate model minimizer results for δ = 0.2.
Figure layout is the same as Figure 5.3. . . . . . . . . . . . . . . . 44
5.6 1d simulator in Example 2. The black curve is the underlying de-
terministic simulator and the noisy outputs are shown in blue. . . . 46
minimizer and surrogate model minimizer results for δ =
0.05. Figure layout is the same as Figure 5.3. . . . . . . . . . . . . 48
for δ = 0.2. Figure layout is the same as Figure 5.2. . . . . . . . . . 49
6.1 Conditional plot of log-Goldprice function in Example 3. . . . . . . 56
ix

6.2 Contour plot of log-Goldprice function in Example 3. . . . . . . . . 57
with n0 = 20, nnew = 40 and δ = 0.05. Figure layout is the same
as Figure 5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4 2d simulator in Example 3. Distance between the simula-
tor minimizer and surrogate model minimizer results with
n0 = 20, nnew = 40 and δ = 0.05. Figure layout is the same as
Figure 5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
with n0 = 40, nnew = 120 and δ = 0.05. Figure layout is the same
as Figure 5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.8 4d simulator in Example 4. Distance between the simula-
tor minimizer and surrogate model minimizer results with
n0 = 40, nnew = 120 and δ = 0.05. Figure layout is the same as
Figure 5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.9 4d simulator in Example 4. Plot of the response y vs. each co-
ordinate of the input x for the GP model. This plot shows the
sequential design procedure with n0 = 40, nnew = 120 and δ = 0.05. 66
x

Figure layout is the same as Figure 5.3. . . . . . . . . . . . . . . . . 67
6.12 4d simulator in Example 4. Plot of the response y vs. each co-
ordinate of the input x for the GP model. This plot shows the
sequential design procedure with n0 = 40, nnew = 120 and δ = 0.2. 68
xi

List of Tables
5.1 Summary of Results for Example 1 . . . . . . . . . . . . . . . . . . 52
xii

Abstract
A computer experiment is often used when physical experimentation is complex,
time consuming or expensive. In computer experiments, large computer codes
called computer simulators are written to represent numerical models of real phe-
nomena. Realistic simulators are often time consuming to run, and thus, we
approximate them with surrogate statistical models. In this thesis we consider
two surrogates, Gaussian process (GP) models and Bayesian Additive Regression
Trees (BART) models. Many simulators are deterministic, that is, re-running
the code with the same inputs gives identical results. Yet, it is well known that
many simulators often display numerical noise. Rather than lying on a smooth
curve, results appear to contain a random scatter about a smooth trend. In this
thesis, we focus on minimizing simulator output observed with noise. Eﬃcient
optimization of an expensive simulator is a challenging problem. Jones, Schonlau
& Welch (1998) proposed a merit based criterion called Expected Improvement
(EI) for carefully choosing points in a sequential mannner to identify the global
minimum of a deterministic simulator. Our objective is to compare the improve-
ment functions proposed by Picheny, Ginsbourger, Richet & Caplin (2013) and
Ranjan (2013) for global optimization of a noisy simulator. Four test functions are
used as simulators for performance comparison, and the EI optimization is done
either using a one-shot space-ﬁlling design or a genetic algorithm (GA).
xiii

Chapter 1
Introduction
Many phenomena such as car crashes, nuclear fusion, power generation and uni-
verse expansion are of utmost interest to us. Yet, clearly, direct physical ex-
periments to examine these complex phenomena are either impossible (universe
expansion), expensive (car crashes), infeasible (nuclear fusion) or time consuming
(power generation). As a result, mathematical models are often used to build
a realistic representation of these phenomena, enabling experimentation. For in-
stance, using a closed-form mathematical expression one can simulate the flow rate
through a borehole which is drilled from the ground surface through two aquifers
(Worley, 1987). These mathematical models also known as computer simulators,
typically take a number of inputs and when they are run generate a particular
output. Running computer simulators can also be time consuming when output
values are desired for a large number of different input settings or when the un-
derlying process is complex. In such situations, a second level of approximation,
a surrogate model is used to approximate the input and output relationship of
the simulator. These surrogates are flexible regression models, taking an input
vector x and predicting a real-valued output y. As with most statistical regression
models, a surrogate model is estimated using training data which consists of n
observations (x1, y1), ..., (xn, yn).
Unlike many real-world phenomena, a computer simulator has been tradition-
1

ally assumed to be deterministic. That is, every time the simulator is run with
input x, the same numeric value of output y is obtained. Surrogate models that
capture such behavior, i.e., exactly interpolating the training data, are popular in
computer experiments (Sacks, Welch, Mitchell & Wynn, 1989; Williams, Santner
& Notz, 2000).
Yet, it is well known that many computer simulators display numerical noise.
For instance, Forrester, Keane & Bressloff (2006) illustrates the use of a com-
putational fluid dynamics (CFD) simulator to calculate aerodynamic forces and
recognizes the noise in such simulations. In this thesis we focus on the computer
simulator with outputs that contain noise.
Noise in Computer Experiments
In physical experiments, noise usually accounts for several uncontrolled variables
such as variations of the experimental setup, measurement precision, etc. For
CFD simulations, Forrester et al. (2006) recognizes the noise due to three main
reasons: discretization error, incomplete convergence and inaccurate application
of boundary conditions.
Noise in computer experiments can have many sources including those observed
in the CFD simulations. The nature of the noise usually depends on the associated
simulator. When Monte carlo simulations are involved in the output evaluation,
error can occur due to the finite number of the Monte carlo samples. The error
in Monte carlo experiments is independent from one simulation to another, even
for measurements with the same input variables. See Gramacy and Lee (2012)
for further discussion on sources of error in computer simulators. Such simulators
with independent errors are popular in computer experiments and often referred
to as non-deterministic simulators.
2

Noisy simulator output is of the form
˜y(x) = y(x) + , (1.1)
where ∼ N(0, τ2
) is assumed to be independent noise for different input con-
figurations, and y(x) ∈ R is the underlying deterministic simulator output for
x ∈ D ⊂ Rd
. The design problem of prime interest in this thesis is to estimate
the minimum of y(x) and to identify the minimizer x when ˜y(x) is observed.
We use a sequential design approach for finding the global minimum of y(x).
The approach starts by evaluating the response surface (i.e., the true computer
simulator) at a few points, modeling the response surface with a surrogate model
and then sequentially adding new points by maximizing a figure of merit, such
as the Expected Improvement (EI) criterion proposed by Jones et al. (1998)
and updating the surrogate model. This allows for the refinement of the surro-
gate model and for an increase in the prediction accuracy of the simulator global
minimizer.
Picheny et al. (2013) generalized the EI criterion of Jones et al. (1998) by
proposing a quantile-based criterion for the sequential optimization of a noisy
simulator. This allows for an elegant treatment of heterogeneous response preci-
sions. Ranjan (2013) proposed a slightly different formulation of the improvement
function as compared to the one in Picheny et al. (2013). The main objective of
this thesis is to compare the two improvement criteria proposed by Picheny et al.
(2013) and Ranjan (2013) under two surrogate models (GP and BART), where EI
optimization is done by either a one-shot space filling design approach or a genetic
algorithm (GA).
The remainder of the thesis is organized as follows. Chapter 2 reviews the
Bayesian literature to lay a foundation for building the surrogate models and
other methodologies. Chapter 3 discusses the statistical surrogate models used in
3

this research. In Chapter 4, we include details of the components of sequential
design and the expected improvement criteria. Chapter 5 and Chapter 6 present
simulation results. Finally, in Chapter 7 we conclude with some important remarks
and future recommendations.
4

Chapter 2
Bayesian Approach to Data
Modeling
Modern Bayesian statistics is a rich and powerful framework for inferential data
analysis. Since about 1990, there has been a dramatic growth in the use of
Bayesian methods. In some application areas today where data are scarce or ex-
hibit complex structures, a Bayesian approach is almost a hallmark of leading-edge
research. However, frequentist methods are still dominant in the more traditional
application areas of statistics. For the research presented in this thesis, a Bayesian
approach to data modeling provided signiﬁcant advantages.
In this chapter we introduce the relevant Bayesian terminologies. Section 2.1
gives an overview of the Bayesian approach to statistics. Section 2.2 addresses
Bayesian computation via Markov chain Monte Carlo (MCMC) and provides some
background useful for EI computation. Section 2.3 lists the advantages of using
the Bayesian approach to statistical inference.
2.1 Overview
As with other approaches to statistics, the objective of analyzing data is to make
inferences about some unknown parameters. Knowledge about the parameters of
the model before observing the data is termed as “prior information”. A key fea-
5

ture of the Bayesian approach to statistical inference is that the prior information
can be incorporated into the learning process in a mathematically rigorous and
conceptually clear manner. That is, the parameter θ of a model is given a prior
distribution p(θ), and inference, given data y proceeds by combining the prior
with the likelihood p(y|θ) in Bayes’ theorem. This yields a posterior distribution
p(θ|y), defined as
p(θ|y) =
p(y|θ)p(θ)
p(y)
.
Learning is an ongoing process and Bayes’ theorem reflects this fact in a very
elegant way. Bayes’ theorem can be applied sequentially to assimilate data piece
by piece. Thus at any point, the prior distribution represents the information that
is available before observing a particular piece of data. The resulting information
now is called the posterior distribution. In a way, “Today’s posterior is tomorrow’s
prior”. There are two key steps of a basic Bayesian method:
1. Bayesian Modeling: Bayesian modeling is a two-step process: (a) Identify
the unknown parameters and the inference questions about these parame-
ters that are to be answered. (b) Construct the likelihood and the prior
distribution to represent the available data and prior information.
2. Bayesian Analysis: Obtain the posterior distribution and derive inferences.
Bayesian analysis was very difficult until the advent of powerful computational
tools. For a multivariate θ, it is necessary to obtain marginal posterior distribu-
tions by integrating the posterior w.r.t. other elements of θ to make meaningful
inferential statements. It is typically this integration which is difficult especially
for high-dimensional θ.
Families of priors that combine with a likelihood to produce posterior distri-
butions in the same family are called conjugate. Conjugate priors are particularly
convenient because they lead to analytically tractable posteriors. In the period
6

from the birth of modern Bayesian thinking in the 1950s to at least the mid-1980s,
Bayesian analysis was restricted to situations in which conjugate prior distribu-
tions were available or very simple problems. Problems that could be analyzed
routinely by frequentist methods, such as generalized linear modeling with many
explanatory variables, were outside the reach of Bayesian methods. This changed
with the development of the computationally intensive but very powerful “Markov
chain Monte Carlo” (MCMC) algorithms, so that modeling in very complex multi
parameter situations is possible.
2.2 Markov Chain Monte Carlo
The main idea of MCMC is to establish a Markov chain whose stationary distribu-
tion is the posterior distribution of interest and to collect samples from that chain.
MCMC is based on two conceptually very simple ideas. The first is sampling based
computation, and the second is the theory of Markov chains. Suppose we wish to
compute the posterior mean of the parameter θ1 which is the first element of the
vector θ of say, k parameters. Formally this is
E(θ1|y) = θ1 p(θ|y) dθ2dθ3...dθk.
For moderate to large k (k ≥ 5) this computation is very intensive using numerical
integration. However, imagine that we could take a sample of N values from the
posterior distribution p(θ|y). Denote these by θ1
, θ2
, ..., θN
. Then we would in par-
ticular have a sample of N values of the first parameter θ1 obtained by taking the
first element in each of the vectors θi
, i = 1, ..., N. We could use the sample mean
as the approximation to E(θ1|y). By increasing the number of posterior samples,
N, we can improve the accuracy of the approximation to E(θ1|y). Directly sam-
pling like this from the posterior is sometimes feasible even in some quite large
and complex problems and is referred to as Monte carlo computation.
7

However, in most serious applications of Bayesian analysis, the posterior distri-
bution is too complex and high dimensional for this direct approach to be feasible.
We then employ the second device, which is based on the theory of Markov chains.
We again obtain a series of vectors θ1
, θ2
, ..., θN
, but these are not sampled directly
from p(θ|y) and they are not independent. Instead each θi
depends on the pre-
vious θi−1
and is sampled from a proposal distribution g(θi
|θi−1
). This means
that the θi
s form a Markov chain. The conditional distribution g depends on the
data y as well and is known as the transitional kernel of the chain. g is chosen
such that, for suﬃciently large i, the distribution of θi
converges to the posterior
distribution p(θ|y). Markov chain theory provides simple criteria under which this
convergence will occur and in practice there are numerous ways of constructing a
suitable kernel to sample from any posterior distribution.
The combination of two ideas of sample-based computation and Markov chain
is MCMC. The following are a few important remarks to note while using MCMC.
• The number of runs until the Markov chain approaches stationarity depends
on the starting point θ1
. A poor choice of the starting point can greatly
increase the required time for a chain to converge.
• Although convergence is guaranteed eventually, it is not possible to say how
large a sample must be taken before successive values can be considered to
be sampled from the posterior distribution.
• Successive points in the Markov chain are correlated, and the strength of this
correlation is very important. A highly correlated chain converges slowly and
moves slowly around the parameter space, so that a larger sample is needed
to compute relevant inferences accurately.
Thus for the successful implementation of an MCMC algorithm we have to make
the following choices :
8

Burn in
As a general practice, an initial portion of draws (such as a quarter) of the chain
are discarded. These samples are known as the burn-in. Burn-in makes our draws
closer to the stationary distribution and less dependent on the starting point.
Thinning
In order to break the dependence between draws in the Markov chain, it is a
common practice to keep every d-th draw of the chain. This is known as thinning.
The resultant sample will have less autocorrelation, and be closer to i.i.d. draws.
Thinning also saves memory since only a fraction of the draws are saved.
2.3 Advantages of Bayesian Approach
A Bayesian approach to data modeling provides significant advantages to this
research in the following ways:
• MCMC provides full accounting of uncertainty. The posterior distribution
implicitly contains a full summary of the estimated model, rather than just
point estimates of its parameters. This provides flexibility in formulation of
improvement function and a more reliable calculation of Bayesian EI.
• Another major advantage of using a Bayesian approach is the ability to
find the predictive distribution of future observations. This, in turn, guides
the sequential design of the experiment and helps to identify the global
minimum.
9

Chapter 3
Surrogate Models
A complex mathematical model that produces a set of output values from a set
of input values is commonly referred to as a computer simulator. The name
stems from the necessity to have computers do extensive computations when the
model cannot be solved analytically in closed form and/or it requires an iterative
solution. Computer simulators can either be deterministic or non-deterministic. A
surrogate model is a cheap approximation to a computer simulator and can be used
to make predictions at new points in the input space without running the simulator
again. Section 3.1 reviews the Gaussian process (GP) model, a commonly used
surrogate model for deterministic simulators. Section 3.2 discusses the surrogate
models (GP and BART) for non-deterministic simulators and Section 3.3 uses
simple illustrations to compare and contrast the deterministic GP with the non-
deterministic GP and BART models.
3.1 Deterministic Simulator
Deterministic computer simulators are distinct from models of data from physical
experiments in the sense that they are often not subject to replication error or
observation error. Due to the lack of random error, traditional statistical modeling
approaches are not useful. A simulator is said to be deterministic if the replicate
runs of the same inputs will yield identical responses. Sacks et al. (1989) proposed
10

modeling (or emulating) such an expensive deterministic simulator as a realization
of a Gaussian stochastic process (GP) model. Mathematically, the deterministic
case, corresponds to τ = 0 in (1.1). This implies ˜y(x) = y(x). In Section 3.1.1 we
shall write the response as y(x) to emphasize its deterministic nature.
3.1.1 GP Model
Let the i-th input and output of the computer simulator be denoted by a d di-
mensional vector, xi = (xi1, ..., xid) , and the univariate response, yi=y(xi), respec-
tively. The experimental design D0={x1, ..., xn} is the set of n input trials. The
outputs of the simulation trials are held in the n-dimensional vector Y = y(D0)=
(y1, ..., yn) .
The simulator output, y(xi) is modeled as
y(xi) = µ + z(xi); i = 1, ..., n, (3.1)
where µ is the overall mean, z(xi) is a GP with E(z(xi)) = 0 , Var(z(xi)) = σ2
and Cov(z(xi), z(xj)) = σ2
Rij for a suitably deﬁned positive deﬁnite correlation
structure Rij = R(xi, xj). In general, y(D0) has a multivariate normal distribution
Nn(1nµ, Σ), where Σ = σ2
R, 1n is an n×1 vector of all ones, and R is the n × n
correlation matrix.
Although there are several choices for the correlation structure Rij = R(xi, xj),
we use the Gaussian correlation because of its properties such as smoothness and
popularity in other areas like machine learning and geostatistics. The Gaussian
correlation structure is a special case (pk = 2) of the power exponential correlation
family,
R(xi, xj) =
d
k=1
exp {−θk |xik − xjk|pk
} for all i, j. (3.2)
The parameter θk ≥ 0 in (3.2) controls the sensitivity of the GP w.r.t. the k-th
coordinate, and large θk result in y values that can vary quickly along this axis,
11

making the response function more complex. The correlation between y(xi) and
y(xj) falls quickly to zero as the difference between xi and xj in the k-th coordinate
increases. If θk = 0, the k-th input does not appear in the surrogate model, leading
to a reduction in dimension.
The parameter estimation of a GP model is briefly reviewed here.
Parameter Estimation
Given a set of n simulator outputs y = (y1, ..., yn) at D0={x1, ..., xn}, and a cor-
relation structure R(xi, xj), suppose we wish to fit the GP model (i.e., to estimate
Ω = (θ1, ..., θd, µ, σ2
)). The likelihood function for Ω is written as
L(Ω) =
1
√
2π
n
1
|Σ|
1
2
exp −
1
2
(y − µ1n) Σ−1
(y − µ1n) (3.3)
which implies
L(Ω) ∝
1
|Σ|
1
2
exp −
1
2
(y − µ1n) Σ−1
(y − µ1n) .
From this we obtain the log likelihood
l(Ω) = −
1
2
log (|Σ|) −
1
2
(y − µ1n) Σ−1
(y − µ1n) + constant
= −
n
2
log(σ2
) −
1
2
log(|R|) −
1
2
(y − µ1n) R−1
(y − µ1n)
σ2
. (3.4)
Assuming θ = (θ1, ..., θd) as known and following the steps below, we find ˆµ and
ˆσ2
. By expanding the last term of the log likelihood (3.4) function we get
(y − µ1n) R−1
(y − µ1n) = (y − 1nµ )R−1
(y − µ1n)
= y R−1
y − y R−1
µ1n − 1nµ R−1
y + 1nµ R−1
µ1n.
On differentiation of log likelihood (3.4) w.r.t. µ we get
dl
dµ
= −y R−1
1n − 1nR−1
y + 1nR−1
µ1n + 1nµ R−1
1n.
12

Setting this derivative equal to zero gives
1nR−1
µ1n + 1nµ R−1
1n = y R−1
1n + 1nR−1
y
21nµR−1
1n = 21nR−1
y
ˆµ =
1nR−1
y
1nR−11n
.
(3.5)
Similarly on differentiating the log likelihood (3.4) function w.r.t. σ2
we get
dl
dσ2
= −
n
2σ2
+
(y − µ1n) R−1
(y − µ1n)
2σ4
.
Setting this derivative equal to zero gives the maximum likelihood estimate
ˆσ2
=
(y − µ1n) R−1
(y − µ1n)
n
. (3.6)
The correlation matrix R in the maximum likelihood estimates of µ and σ2
in (3.5)
and (3.6) depends on θ. Closed form solutions for θ = (θ1, ..., θd) are not possible,
so numerical methods are used to calculate values for ˆθ. When θ is assumed to be
known, parameter estimation for µ and σ2
is trivial so the maximum likelihood
estimation is done separately for θ and (µ, σ2
).
Best Linear Unbiased Predictor
The next important part of the problem is to find the predictor ˆy(x) at an arbitrary
point x in the input space. A good predictor should have the following features:
• Unbiased: E(ˆy(x)) = E(y(x)) = µ
• Linear in Y : ˆy(x)=C Y for some n × 1 vector C
• Best: ˆy(x) should have the minimum variance in the class of all linear unbi-
ased predictors.
The objective is to find a vector C such that ˆy(x)=C Y and E(ˆy(x) − y(x))2
is
minimized. It is equivalent to find the BLUP for y∗
(x) = y(x) − µ. That is,
13

to find C∗
such that ˆy∗
(x) = C∗
Y ∗
and E(ˆy∗
(x) − y∗
(x))2
is minimized, where
Y ∗
= Y − ˆµ1n. We have
E(ˆy∗
(x) − y∗
(x))2
= E(C∗
Y ∗
− y∗
(x))2
= E(C∗
Y ∗
(Y ∗
) C∗
+ (y∗
(x))2
− 2C∗
Y ∗
y∗
(x))
= C∗
σ2
RC∗
+ σ2
− 2C∗
σ2
r.
(3.7)
Differentiating (3.7) w.r.t. C∗
and equating it to zero gives
2σ2
RC∗
− 2σ2
r = 0
C∗
= R−1
r.
Using the condition ˆy∗
(x)=C∗
Y ∗
and y∗
(x) = y(x) − µ,
(ˆy(x) − ˆµ) = C∗
(Y − ˆµ1n)
ˆy(x) = ˆµ + C∗
(Y − ˆµ1n)
=
1nR−1
Y
1nR−11n
+ r R−1
Y −
1nR−1
Y
1nR−11n
1n
=
1nR−1
1nR−11n
+ r R−1
−
r R−1
1n
1nR−11n
1nR−1
Y.
Following from the definition ˆy(x) = C Y , we get
C =
1nR−1
1nR−11n
+ r R−1
−
r R−1
1n
1nR−11n
1nR−1
.
From (3.7), uncertainty in the predicted ˆy(x) is as follows:
s2
(x) = E(ˆy(x) − y(x))2
= E(ˆy∗
(x) − y∗
(x))2
= E(C∗
Y ∗
− y∗
(x))2
= C∗
σ2
RC∗
+ σ2
− 2C∗
σ2
r
= σ2
(1 + C∗
RC∗
− 2C∗
r)
= σ2
(1 + C∗
(RC∗
− 2r))
= σ2
1 − r Rr +
1 − 2(1nR−1
r) + (r R−1
1n)(1nR−1
r)
1nR−11n
= σ2
1 − r R−1
r +
(1 − 1nR−1
r)2
1nR−11n
.
14

Hence the predicted value for y(x) at an arbitrary x in the input space,
ˆy(x) = ˆµ + r R−1
(Y − ˆµ1n)
=
1 − r R−1
1n
1nR−11n
1n + r R−1
Y,
(3.8)
is the best linear unbiased predictor (BLUP), and the associated mean squared
error is
s2
(x) = σ2
1 − r R−1
r +
(1 − 1nR−1
r)2
1nR−11n
. (3.9)
A GP model is a traditional surrogate model used to approximate outputs to
deterministic computer simulators. GPﬁt (an R package by MacDonald, Ranjan
and Chipman 2014) facilitates an easy implementation of ﬁtting such GP mod-
els. The GP model is conceptually straightforward, easily accommodates prior
knowledge in the form of covariance structure, and returns estimates of simulator
response with uncertainty. In spite of its simplicity, there are a few important
limitations of a GP model:
1. Inference on the GP model scales poorly with the number of data points,
typically requiring computing time that grows to O(n3
) for sample size n.
This is due to the inversion of n × n correlation matrix R.
2. A standard GP model assumes a constant noise variance. This leads to
restrictive modeling for complex datasets in which noise varies over the input
space.
3. The GP model is assumed to have a stationary correlation structure.
4. The uncertainty estimate, s(x), associated with a predicted response under
a GP model does not directly depend on any of the observed responses.
15

3.2 Non-Deterministic Simulator
A simulator is said to be non-deterministic if replicate runs of the same inputs
will yield different responses. The non-deterministic simulator, like deterministic,
can be emulated as a realization of a GP model with some modifications, or using
other flexible regression models such as Bayesian additive regression trees (BART).
BART (Chipman, George & McCulloch, 2010, henceforth denoted CGM) uses an
ensemble of trees structure to emulate simulator outputs. We describe both a GP
for noisy data and BART before showing in Chapter 4 how they can be used in
an EI-based sequential design framework for estimating the global minimum.
3.2.1 GP Model for Noisy Data
The simulator output, ˜y(xi) is observed with noise and is modeled as
˜y(xi) = µ + z(xi) + i; i = 1, ..., n, (3.10)
where µ is the overall mean, z(·) is a GP with mean 0, variance σ2
and correlation
structure Rij (defined in (3.2)), and i is independent N(0, τ2
). The variance of
the observed response ˜Y = (˜y(x1), ...˜y(xn)) is
Var(˜Y ) = σ2
z R + τ2
I = σ2
z (R + δ I), (3.11)
where δ is called the nugget parameter and I is the n × n identity matrix. From
(3.11) the nugget parameter can be written as,
δ = τ2
/σ2
z . (3.12)
The nugget parameter is such that δ < 1 so as to ensure that the numerical
uncertainty is smaller than the process uncertainty. Thus total uncertainty in the
output is given by
σ2
total = σ2
z + τ2
. (3.13)
16

The log likelihood after profiling out the parameters µ and σ2
for the noisy GP
can be written as,
−2 log Lp = log(|(R + δ I)|) + n log (˜Y − 1n ˆµ(θ)) (R + δ I)−1
(˜Y − 1n ˆµ(θ)) +
constant,
which can be maximized to get the maximum likelihood estimates of the model
parameters θ and δ . Using the same procedure as in deterministic case, the
resulting BLUP is given by
ˆyδ (x∗
) =
1 − r (R + δ I)−1
1n
1n(R + δ I)−11n
1n + r (R + δ I)−1 ˜Y (3.14)
with mean squared error
s2
δ (x∗
) = σ2
z 1 + δ − r (R + δ I)−1
r +
(1 − 1n(R + δ I)−1
r)2
1n(R + δ I)−11n
. (3.15)
The nugget δ also increases the numerical stability in the computation of (R +
δ I)−1
. The GP fit to the noisy data is achieved using the bgp function in the R
package tgp (Gramacy, 2007). The tgp package uses a Bayesian approach to fit
the GP model.
3.2.2 Bayesian Additive Regression Trees Model (BART)
The BART model uses regression trees to model the data in a Bayesian framework.
BART represents the output y as a sum of m adaptively chosen functions and an
independent normal error. It seeks to approximate
y(x) = E(˜y|x)
using a sum of trees. The model can be written as
˜y(x) =
m
j=1
g(x; Tj, Mj) + = y(x) + ; ∼ N(0, τ2
). (3.16)
The function g(x; T, M) produces an output when provided with d-dimensional
input x = (x1, ..., xd) and parameters T and M. It denotes a regression tree
17

model. The predictions for a particular value of x are generated by following the
sequence of decision rules in a tree T until arriving at a terminal node b at which an
associated scalar prediction µb is returned. Decision rules differ in different trees
Tj and Tj . For a tree T with B terminal nodes (i.e, partitioning the input space
into B rectangular regions) let M = (µ1, ..., µB) denote the collection of terminal
node predictions. Thus for an input vector x, tree model g gives a piecewise-
constant output. By combining together an “ensemble” of m such tree models in
(3.16) a flexible modeling framework is created. For example, if each individual Tj
uses partitions on only a single input variable, then the BART model becomes an
additive model. BART is capable of capturing both non-stationary and complex
relationships by choosing the structure and individual rules of the Tj’s. Many
individual trees can place split points in the same area, allowing the predicted
function to change rapidly nearby, effectively capturing non-stationary behavior
such as abrupt changes in the response. This statistical model has a number of
parameters, T1, ..., Tm, M1, ..., Mm, τ :
• Tj : Tree topology
• Mj = (µ1, ..., µBj
) = outputs
• τ2
= Var( )
Figure 3.1 depicts a tree model g(x; T, M) with terminal node µ’s. The function
g(x; T, M) assigns a µ value to x where
• T denotes the tree structure including the decision rules.
• M = (µ1, µ2, µ3) is the set of terminal node µ’s.
For example, the input x = (1.1, 5.4, 0.1, 2.3, 0.5) would lead to prediction µ2 = 5
since we branch left on x5 = 0.5 < 1 and then right on x2 = 5.4 ≥ 4.
18

Figure 3.1: A simple realization of g(x; T, M) in the BART model
BART uses a sum-of-trees model which is vastly more flexible than a single
tree model and is capable to account for high order interaction effects. The addi-
tive structure with multivariate components makes it a very favorable statistical
surrogate for optimizing a noisy simulator. It is an ensemble model which gives
great in-sample fit and out-of-sample predictive performance.
3.2.3 Details of BART Model
In this section we describe the form of the prior and algorithms for MCMC sam-
pling of the posterior of the BART model. The prior specifications used in this
thesis follow from CGM.
Prior specification
CGM consider values for m, the number of trees, between 50 and 200. We used
m = 100 for obtaining results in this thesis.
CGM specifies a prior structure as follows:
p((T1, M1), (T2, M2), ..., (Tm, Mm), τ)
= p(T1, T2, ...Tm) p(M1, M2, ..., Mm|T1, T2, ..., Tm)p(τ).
19

Figure 3.2: Three priors on residual variance τ when ˆτ = 2 is assumed for BART
model. In the legend, taken from CGM, “df” is ν and “quantile” corresponds to
the quantile q at ˆτ = 2.
Since the dimension of Mj depends on Tj, this conditional structure is essential.
CGM makes several simplifying assumptions about the prior on the Tj and Mj,
for instance,
p(T1, T2, ..Tm) =
m
j=1
p(Tj), (3.17)
p(M1, M2, ..., Mm|T1, T2, ..., Tm) =
m
j=1
p(Mj|Tj), (3.18)
and
p(Mj|Tj) =
Bj
i=1
p(µi,j|Tj). (3.19)
CGM speciﬁes a prior on residual variance τ2
as
τ2
∼
νλ
χ2
,
where χ2
is a chi-squared random variable with ν degrees of freedom. The pa-
rameter ν determines the spread of the prior and λ determines the location of the
prior. Both ν and λ should be chosen to place a good amount of prior mass on
20

plausible τ values. An equivalent way of specifying λ is to specify a guess at the
upper q-th quantile of the prior distribution.
Figure 3.2 shows an illustration in which ˆτ = 2 is the guess of upper q-th
quantile of the prior distribution. CGM suggests the three combinations of the
hyperparameters ν and q as illustrated. These three combinations are referred to
as conservative, default and aggressive. The df (ν) is usually taken to be between
3 and 10. In the simulation study in Chapters 5 and 6 , we chose ˆτ as a fraction
of the total data variation i.e.,
ˆτ = 0.2 ∗ sd(˜y), (3.20)
where sd(˜y) is the sample standard deviation of the training ˜y values. This prior
specification allows for some noise in the response values. A choice such as (3.20)
is one of the two ways that CGM recommends specifying a ˆτ. The other approach
is a linear model estimate which is more appropriate for functions with overall
trend as a linear model will capture the general form of the trend. Using the
standard deviation of ˜y(x) as an estimated value of τ is a “total data variation”
estimate and is more appropriate when the mean function does not have a trend
as can be the case in our simulation experiments. Of note here is that strong prior
beliefs of small τ can lead to overfitting so it ought to be avoided.
In the simulation study in Chapters 5 and 6, we choose ν = 3 and use quantile
q = 0.9 along with (3.20) to specify the τ prior.
A prior is put on each tree T using a tree growing process. This implies a prior
on tree size. The tree size limits the number of variables used in each weak learner
g(x; T, M). We adopt the default choices for the tree prior described in CGM.
CGM specifies a prior for terminal node parameters µj as follows: Suppose we
have m = 100 trees then the prediction of y(x) will be a sum of 100 µ’s, one from
21

each tree :
E(˜y|x) = y(x) =
100
j=1
µj.
Using equation (3.19), the variance of the quantity y(x) can be written as
Var(y(x)) =
100
j=1
Var(µj) = 100 Var(µj).
Specifying how much we expect the mean of y(x) given x to vary and take,
Var(µj) =
Var(y(x))
100
and
sd(µj) =
sd(y(x))
√
100
. (3.21)
If we take
sd(y(x)) ≈
range(˜y)
4
as a guess of the variation in the y(x) we might see over the input values, then
equation (3.21) becomes
sd(µj) =
range(˜y)
4
√
100
.
The amount for shrinkage of µ’s depends on the number of trees (here taken to be
100). CGM specifies a normal distribution with mean 0 for each µj resulting in
the default prior µ ∼ N(0, σ2
µ) = N(0, range(˜y)2
/(4k2
m)) with k = 2. We relaxed
this prior to k = 1. Choosing a smaller value of k (i.e., k = 1) increases the
prior variance of output y(x) = E(˜y|x), applying less shrinkage (or smoothness)
of the response so that the fitted values come closer to interpolating the observed
˜y values. A similar relaxation of k was used in Chipman, Ranjan & Wang (2012).
Although not part of the prior specification, several other operating parameters
of BART are chosen as follows: The number of trees in the ensemble is chosen as
m = 100. Individual trees are allowed to split on a fine grid of 1, 000 cutpoints
along each axis.
22

A common concern with Bayesian approaches is sensitivity to prior parame-
ters. CGM found that the results were robust to reasonably wide range of prior
parameters, including ν, q, σµ as well as number of trees m. m needs to be large
enough to provide complexity to capture ˜y(x), but making m too large does not
appreciably degrade accuracy although it does make MCMC sampling slower to
run. CGM provides guidelines for choosing m.
MCMC sampling of the posterior
Posterior samples will be obtained using the MCMC algorithm outlined in CGM
and summarized below.
Let T(−j) be all the trees except Tj, define M(−j) similarly. The MCMC algo-
rithm is as follows :
• WHILE number of MCMC samples (i = 1, ..., N) DO repeat:
– WHILE number of trees (j = 1, ..., m) DO repeat:
1. Metropolis Hastings step : Draw Tj conditional on ˜y, T(−j), τ
2. Draw Mj given ˜y, T1, T2, ..., Tm, M−j, τ
3. Draw τ given ˜y and all other parameters.
– End WHILE
• End WHILE
The sample of Tj at step i is actually the modification of the Tj sample at i − 1.
The MCMC algorithm uses 4000 iterations, discarding the first 500 (burn-in) and
keeping every 10-th thereafter for a sample of 350 posterior draws. Larger posterior
samples might be desirable but with quick mixing behavior of BART observed by
CGM, a sample of this size will be sufficient for the sequential design.
23

Final Prediction from MCMC
Each sweep of the algorithm yields a draw from the posterior of y(x) = g(x; T1, M1)+
g(x; T2, M2) + ... + g(x; Tm, Mm). An average of draws of y(x) gives the posterior
average of y(x). We have a sample of 350 posterior values of y(x) from the MCMC
algorithm. Uncertainty in y(x) is available from the posterior distribution on y(x).
The uncertainty bounds of y(x) can be obtained by the 10-th and 90-th quantile
of this sample.
3.2.4 Implementation of BART
The existing package in R called “BayesTree” (CGM) works for implementing
the Bayesian ensemble model but it takes a long time to run. Pratola et al.
(2014) implements the coding of BayesTree on a parallel computing system with
the MPI (“Message Passing Interface”) protocol. The MPI implementation of
BART runs faster than BayesTree. At each MCMC iteration, BayesTree pro-
duces posterior samples for y. These posterior samples are saved and the T or
M’s are not saved. To make predictions at new x locations, BayesTree must
reﬁt of the model. The MPI implementation of BART saves the posterior sam-
ples ((T1, M1), (T2, M2), ..., (Tm, Mm), τ) of the parameters for later use and thus
predictions at future locations are available without re-training the BART model.
Our simulation study in Chapters 5 and 6 uses the MPI implementation of BART.
3.2.5 Advantages of BART
The BART model is ﬂexible and it adapts itself to the data. BART does not
assume the continuity of the response, thus making it appropriate when there
are abrupt changes or non-stationarity in the response. This is in contrast to a
GP model which for many correlation functions implies continuity and a constant
amount of “shift” of the response. It is computationally faster than GP model-
24

ing, has a quick burn in and convergence of MCMC, and has ability to identify
low-dimensional structure in high dimensional data by using a “sum of trees”
structure and specifying a prior on trees that encourages individual trees to have
a smaller number of terminal nodes. It provides robustness of prediction to prior
specification and is a competitive tool for predictive accuracy.
3.3 Surrogate Model Fit
In Figure 3.3 we present the surrogate model fits to illustrate three comparisons:
1) Deterministic versus Non-deterministic simulator. Fig 3.3(a) shows the deter-
ministic simulator as a dashed line and the design points lie on the deterministic
simulator. Fig 3.3(b) and Fig 3.3(c) show the case of non-deterministic simu-
lator where the design points don’t lie perfectly on the simulator (dashed line)
showing that the output has random noise and it deviates from the deterministic
trend. 2) Deterministic GP versus Noisy GP model fit illustrated by Fig 3.3(a)
and Fig 3.3(b). Fig 3.3(a) shows the Gaussian data fit for a deterministic com-
puter simulator. The model prediction denoted by the blue curve is an interpo-
lator of the design points and the uncertainty bounds in red are football shaped
whereas Fig 3.3(b) shows the Gaussian fit for a noisy computer simulator. Since
the computer simulator output is noisy, the GP model fit (blue curve) is not an
interpolator of all the design points and the uncertainty bounds are somewhat
football shaped. 3) Noisy GP versus BART model fit illustrated by Fig 3.3(b) and
Fig 3.3(c). Fig 3.3(b) shows the GP fit for a noisy computer simulator. The blue
line curve is not an interpolator of design points and the uncertainty bounds are
football shaped. Figure 3.3(c) shows the BART fit for a noisy computer simu-
lator. The blue line curve is the BART fit which tries to interpolate the design
points. The BART fit is a sum of predictions from many trees and thus provides a
piecewise constant prediction. BART exhibits football shaped uncertainty bounds
as typically shown by GP models for deterministic simulators (Fig 3.3(a)).
25

q
q
q
qq
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−2−101
Deterministic GP
x
computersimulator
q
Model prediction
Uncertainty bounds
Simulator
Design points
(a) GP model fit to deterministic simu-
lator outputs.
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−2−1012
x
computersimulator
Noisy GP
(b) GP model fitted to non-deterministic
simulator outputs.
q
q
q
q
q
q
q
q
q
q
0.0 0.2 0.4 0.6 0.8 1.0
−2−1012
BART
x
computersimulator
q
(c) BART model fitted to non-
deterministic simulator outputs.
Figure 3.3: Surrogate models fitted to the simulator outputs generated using a
one-dimensional test function: y(x) = − sin(4x − 2) − 2 exp(−480(x − 0.5)2
) + ,
∼ N(0, 0.152
)
26

Chapter 4
Global Optimization
One of the areas of global optimization that we recognize here in Nova Scotia is in
tidal energy where an important objective is to find an optimal turbine location
for maximizing the power function. Figure 4.1 shows a map of the Minas Passage
highlighting the favorable area and the corresponding power function. For such
applications (with expensive simulators) one is interested in minimizing the total
number of evaluations needed to find the global extremum.
MinasPassage1.jpg (JPEG Image, 989 × 544 pixels) http://www.versicolor.ca/kerr/OttawaHousePresentation_Aug2011/1_St...
1 of 1 9/21/2014 3:02 PM
(a) A portion of Bay of Fundy, NS (around
the Minas Passage) showing the Bathymet-
ric profile.
(b) A low resolution tidal power simulator
to be optimized for finding optimal turbine
location.
Figure 4.1: Tidal energy application
In this chapter we explore sequential design strategies which simultaneously
improves the accuracy of surrogate model (e.g, GP and BART) and efficiently
estimate the global optimum of the simulator. Section 4.1 introduces the sequential
design framework in global optimization. Section 4.2 elaborates on the expected
improvement criterion used in the framework. Section 4.3 defines two approaches
27

to identify the input point with the best value of the expected improvement.
4.1 Sequential Design
Sequential design is a form of active learning procedure frequently used in com-
puter experiments. The technique is similar to regression in that we use training
data to build a model, and then we try to predict output using the model and
values of inputs. The difference is that in sequential design we can sequentially
choose the training data. By “actively learning” about the model we hope to
accurately find the global minimum with less data. The main idea is to rebuild
the model as data are collected and use it to decide where to sample more data.
Here we consider a sequential design in which the goal is to find the optimum
of a noisy simulator. In the remainder of this thesis we will consider the goal as
minimization as it is trivial to shift from minimization to maximization.
The approach presented here is similar to that of Jones et al. (1998) for
optimization of deterministic simulators. The sequential design algorithm can be
summarized as follows:
1. Obtain an initial design Xn0 with n0 points, and evaluate the simulator at
these points, yielding the corresponding simulator outputs Yn0 . This set of
initial design points is referred to as the training set.
2. Set iteration i = 0.
3. Fit the surrogate model (e.g., GP or BART) using Xn0+i and Yn0+i.
4. Maximize EI over the input space, and let x∗
= argmax E(I(x)).
5. Evaluate the simulator at x∗
, augment Xn0+i and Yn0+i with x∗
and ˜y(x∗
),
and set i = i + 1.
6. Unless a stopping condition is met, go to Step 3.
28

7. Obtain the final model fit.
For choosing an initial design in Step 1, Latin hypercube sampling (LHS) schemes
(McKay 1979) are particularly useful because they have a space filling property,
i.e., they uniformly cover the x−domain to explore the function globally. We
consider an LHS in [0, 1]d
. To ensure the coverage of the input space, we augment
the (n0 − 2)-point LHS with corner points x = (0, 0, ..., 0) and x = (1, 1, ..., 1).
This is done to aid GP and BART in assessing uncertainty near the boundaries
of the input space. The number of points sampled at the initial stage may vary
with the input dimension and the complexity of the simulator. The EI criterion in
Step 4 is based on the idea that the additional function evaluation leads to further
reduction in the global minimum estimate if possible. This is discussed further in
Section 4.2. For Step 6, we propose to stop when we have added a predetermined
number of sample points (nnew) to our initial design. Instead of this stopping
condition, the experiment can be run until the accuracy of model fit is within the
desired tolerance or until the value of the improvement function is small.
4.2 EI Criteria
Most of the sequential design procedures iterate between parameter estimation
(rebuilding the model) and optimizing a criterion for choosing new points. An
optimality criterion proposed by Jones et al. (1998) is called the Expected Im-
provement (EI). EI combines the mean and the variance structures of the surrogate
model fit such that the method explores the experimental domain and at the same
time exploits the potential areas where local optima occur. Jones et al. (1998)
named the resulting algorithm, Efficient Global Optimization (EGO).
Here we outline the EI criterion for a deterministic simulator. The super-
scripts (in parenthesis) in the mathematical expressions below denote the number
of points used in the surrogate model fit. Let f
(n)
min = min {y(xr), 1 ≤ r ≤ n} be
29

the smallest function value among the n points sampled thus far. Jones et al.
(1998) defines the improvement at a point x as
I(x) = max(f
(n)
min − y(x), 0), (4.1)
which is positive if the (unobserved) response y(x) at location x is less than the
current best value f
(n)
min, and 0 otherwise. Since y(x) is unobserved, the expected
value of I(x) is used as a sequential design criterion. For a GP model, the ex-
pectation of this improvement function can be obtained in closed form using the
maximum likelihood approach. Assuming y(x) ∼ N(ˆy(x), s2
(x)) where ˆy(x) and
s2
(x) are the BLUP and the associated mean square error at x, the EI can be
shown to be
E(I(x)) = (f
(n)
min − ˆy(x))Φ
f
(n)
min − ˆy(x)
s(x)
+ s(x)φ
f
(n)
min − ˆy(x)
s(x)
, (4.2)
where Φ(.) and φ(.) are the standard cumulative normal distribution and proba-
bility density function. The first term in (4.2) captures local search that seeks to
improve the estimate of the minimum near a currently identified minimum. The
second term captures the global search which places the points in regions where
there is sufficient uncertainty that the minimum response could be nearby.
4.2.1 Improvement Functions for Noisy Simulators
The EI criterion defined in (4.2) is for deterministic computer simulators. Picheny
et al. (2013) proposed an extension of this EI using quantiles of noisy simulator
outputs. The EI criterion of Picheny et al. (2013) depends on the noise variances of
the sampled simulator values and of the new candidate measurement. Picheny et
al. (2013) generalized the improvement function of Jones et al. (1998), presented
in (4.1), for the selection of the next training point x as
I1(x) = max q
(n)
min − Q(n+1)
(x), 0 , (4.3)
30

where
q
(n)
min = min
1≤r≤n
Q(n)
(xr)
and Q(k)
(x) = ˆy(k)
(x) + Φ−1
(β)s(k)
(x) denotes the β quantile of the predicted
response based on the surrogate fit obtained using k observed data points. In
particular, q
(n)
min selects the smallest upper quantile based on the fit obtained from
using the n already observed points. Similar to the improvement function in Jones
et al. (1998), Q(n+1)
(x) is an unobservable quantity based on an unobserved ˜y(x).
Ranjan (2013) proposes an alternative improvement function for minimizing
the noisy simulator output
I2(x) = max q
(n)
min − Q (n+1)
(x), 0 , (4.4)
where
q
(n)
min = min
1≤r≤n
Q (n)
(xr)
and Q (k)
(x) = ˆy(k)
(x) − Φ−1
(β)s(k)
(x) which represents the (1 − β) quantile of
the predicted response based on the surrogate fit obtained using k observed data
points.
Both (4.3) and (4.4) require a numerical value of β. We use β = 0.9 through-
out the thesis. The main objective of this thesis is to compare the EI criteria
corresponding to two improvement functions I1(x) and I2(x).
4.2.2 EI Criteria for GP Model
Taking expectation of I1(x), the Picheny et al. (2013) definition of improvement
function, w.r.t. the predictive distribution leads to the following closed form ex-
pression:
E [I1(x)] = q
(n)
min − (ˆy(n)
(x) + Φ−1
(β)s(n)
(x)) Φ(u) + s(n)
(x)φ(u), (4.5)
31

where
u =
q
(n)
min − (ˆy(n)
(x) + Φ−1
(β)s(n)
(x))
s(n)(x)
.
The new point is chosen by maximizing E(I1(x)) which ﬁnds the point with the
smallest upper quantile in the predictive distribution of ˜y. Similarly for the Ranjan
method, the closed form expression for EI is
E [I2(x)] = q
(n)
min − (ˆy(n)
(x) − Φ−1
(β)s(n)
(x)) Φ(u) + s(n)
(x)φ(u), (4.6)
where
u =
q
(n)
min − (ˆy(n)
(x) − Φ−1
(β)s(n)
(x))
s(n)(x)
.
As expected, the new point is the one with the smallest lower quantile in the
predictive distribution of ˜y.
4.2.3 EI Criteria for BART Model
For BART, MCMC is used for model ﬁtting. In this thesis we formulate the
EI criteria from a fully Bayesian perspective which results in a corresponding
Bayesian EGO method. The key steps of the sequential design algorithm are as
follows. First, choose the initial design Xn0 of n0 points, and evaluate the simulator
at these points, yielding the corresponding simulator outputs Yn0 . The next step
is to run the MCMC and collect N samples from the posterior distribution. For
each posterior draw of the parameters,
Θi
= (Ti
1, Ti
2, ..., Ti
m, Mi
1, Mi
2, ..., Mi
m, τi
), i = 1, ..., N,
we can compute a corresponding posterior draw for response mean
y(x) = m
j=1 g(x; Tj, Mj) and uncertainty (τ). Recall from (3.16) we have,
˜y(x) =
m
j=1
g(x; Tj, Mj) + = y(x) + ; ∼ N(0, τ2
).
These posterior samples can be evaluated at any x although our focus will be on
x in the test set.
32

Picheny et al. (2013) define the improvement function as follows:
I
(i)
1 (x) = max q
(n)
min − Q(n+1)
(x), 0 , (4.7)
where
q
(n)
min = min
1≤r≤n
Q(n)
(xr)
and Q(k)
(x) = ˆy(k)
(x) + Φ−1
(β)τ where ˆy(k)
(x) and τ are obtained from the i−th
MCMC sample. In the above expressions, the dependence of Q(n)
(x), q
(n)
min(x),
ˆy(k)
(x) and τ on MCMC sample i is suppressed. Q(k)
(x) represents the (β) quantile
of the predicted response based on the surrogate fit obtained from k observed data
points.
Similarly, Ranjan (2013) defines the improvement function as
I
(i)
2 (x) = max q
(n)
min − Q (n+1)
(x), 0 , (4.8)
where
q
(n)
min = min
1≤r≤n
Q (n)
(xr)
and Q (k)
(x) = ˆy(k)
(x) − Φ−1
(β)τ represents the (1 − β) quantile of the predicted
response based on the surrogate fit obtained from k observed data points.
The next step is to obtain MCMC approximation to the EI which can be
calculated by taking a sample average of I
(i)
1 (x) or I
(i)
2 (x) values over the N MCMC
posterior draws. Thus for the Picheny method, the approximation to EI is given
by
E(I1(x)) =
1
N
N
i=1
I
(i)
1 (x). (4.9)
Similarly for the Ranjan method, the approximation to EI is
E(I2(x)) =
1
N
N
i=1
I
(i)
2 (x). (4.10)
We maximize E(I1(x)) and E(I2(x)) over the input space to find the follow-up
input points.
33

4.3 EI Optimization
There are two strategies that are used in this thesis for EI optimization: a) one-
shot space filling design and b) genetic algorithm (GA). The description of space
filling design approach and GA approach is given here in general terms for a
response function f(x) with an objective of maximization.
4.3.1 Optimization using a One-shot Space Filling Design
In this section we discuss building the candidate set of points for evaluation and
optimization of f(x). A random Latin hypercube design evenly samples the points
in every 1-dimensional projection and obtains some degree of space filling in higher
dimensions. For 1-dimensional (1d) and 2-dimensional (2d) simulators, a random
Latin hypercube design (Carnell, 2009) is used as a fixed candidate set to sys-
tematically evaluate f(x) at many points and optimize f(x) (or EI in our case).
Use of a space filling design is a naive method in global optimization. A space
filling design method of optimization is convergent under mild assumptions of dif-
ferentiability of f(x) and is not practical in higher dimensional problems. For
simulators with high-dimensional inputs, an effective one-shot space filling design
approach would require a very dense set of thousands of points, and identification
of the optimal location of this dense design is computationally expensive and dif-
ficult. For instance, in 1d, a space-filling design of 100 points may densely sample
the input space, however, in 2d, even 1002
points would not be able to achieve
the same density of the sample. Thus sparsity increases with the dimension d
and choosing many points makes the process infeasible quickly. Hence in higher
dimensions, we use an iterative approach like the genetic algorithm as discussed
in the next section.
34

4.3.2 Optimization using Genetic Algorithm
GAs were invented to mimic some of the processes observed in natural evolution.
The idea with the GA is to use this power of evolution to solve optimization
problems. The father of the original GA was John Holland who invented it in the
early 1970’s. GAs are adaptive heuristic search algorithms based on the evolution-
ary ideas of natural selection and genetics. As such they represent an intelligent
exploitation of a random search used to solve optimization problems. Although
randomized, GAs are by no means undirected, instead they exploit historical in-
formation to direct the search into the region of better performance within the
search space. A GA uses three main types of rules at each step to create the next
generation from the current population:
1. Crossover rules combine two parents to form children for the next generation.
2. Mutation rules apply random changes to individual parents to form children.
3. Selection rules choose the individuals that continue to the next generation.
Here the “individuals”, “parents” and “children” are x values and the selection
rules will correspond to choosing individuals with good values of the response
function f(x).
Outline of our GA implementation
The following outline summarizes how the GA works :
• The algorithm begins by creating an initial population of size 200d using a
random Latin hypercube design of x points in [0, 1]d
.
• The algorithm is run for 10 generations to get the best solution.
35

• Within each generation, the algorithm uses the individuals in the current
generation to create the next population using the following steps:
1. Produce children (or offsprings) by mutation and crossover. We apply
crossover to the current population (of size 200d) by randomly combin-
ing pairs of parents.
2. These offsprings are now subjected to mutation via a small perturbation
of N(0, (0.05/3)2
). At the end, we have (200d) new offsprings.
3. We now augment the parent population (of size 200d) with the children
(of size 200d), and pass them through the selection process. That is, we
keep the best 200d members of the 400d individuals from the combined
population for the next generation.
• The algorithm stops when the end condition (10 generations) is satisfied and
it returns the optimal x in the population.
The GA approach is useful because of its robustness to handle problems for the
inputs in the presence of noise. GAs can be used to solve a variety of optimiza-
tion problems in which the objective function is discontinuous, non-differentiable,
stochastic or highly non-linear. In searching a large input space, multi-modal in-
put space, or a high-dimensional surface, a GA may offer significant benefits over
more typical optimization techniques like gradient based search.
36

Chapter 5
Results: One Dimensional
Simulators
This chapter presents a performance comparison of the EI criteria in Picheny et
al. (2013) and Ranjan (2013) for finding the global minimum of one dimensional
simulators observed with low and high noise. The optimization of the EI function
is done using either the GA or a one-shot space filling design approach. We take
a random Latin hypercube design of 4000 points for evaluating EI using the space
filling design. For the GA, we start with an initial population of 200 and go up
to 10 generations. These GA settings give the same total number of EI function
evaluations as for the space filling design.
As in any sequential design procedure, the size and configuration of points
in the initial design (n0) can impact the performance in the sequential design.
Loeppky, Sacks & Welch (2009) suggest using n0 = 10d points as a reasonable rule-
of-thumb for an initial design in case of deterministic simulators however Chipman
et al. (2012) observed that the optimal choice of n0 depends on the complexity
of the simulator. Certainly if a very small number of points are considered in the
initial design, the algorithm based on BART or the GP model may take longer
to find the optimum. On the other hand, one may not want to use a very large
initial design and put several points in the unimportant regions of the input space.
Following Chipman et al. (2012) for deterministic simulators, we choose an initial
37

design of n0 = 15 in the case of one-dimensional examples of noisy simulators.
Two one-dimensional test functions with additive independent Gaussian noise are
used for generating outputs from noisy simulators.
Recall the outputs from the noisy computer simulator are assumed to be of
the form, ˜y(x) = y(x)+ where ∼ N(0,τ2
), Var(y(x)) = σ2
z and Var( ) = τ2
.
We control the noise variance τ2
in the experiment using a factor δ = τ/σz. For a
speciﬁed value of δ, the noise standard deviation is τ = δ ∗ σz. Note that δ =
√
δ
where δ is the nugget parameter in (3.12). We consider a “low” and “high” noise
case which correspond to δ = 0.05 and δ = 0.2 respectively. The simulation results
are obtained for GP as well as BART.
In both examples, an initial space-ﬁlling design of n0 = 15 points was used, and
an additional 2n0 = 30 points were added one at a time in the sequential design
using I1 or I2. The surrogate model minimum can sometimes be smaller than
the simulator minimum because ˆy(x) is an estimate based on noisy data. In each
case, we do 25 replicates of the experiment, each replicate loops over addition
of 30 additional points. For each replicate after the addition of each of the 30
points, we generate a running estimate of the predicted minimum response of the
surrogate model. This gives a 30 × 25 matrix of predicted minima. For each of
the 30 iteration numbers, we take the median over the 25 predicted minima. This
gives a sequence summarizing the estimated minima over the addition of the 30
points in sequential design. Sequences of 10th and 90th quantiles are generated in
the same manner. A second performance measure, based on the optimal location
x where the surrogate minimum occurs, is generated in a similar fashion. For each
of the 30 × 25 minimizers, we calculate the Manhattan distance from the location
of the true minimizer. These distances are summarized by sequences of medians,
10th and 90th quantiles in the same fashion as the predicted minima.
From here onwards, we refer to the EI proposed by Picheny et al. (2013) as the
38

“Picheny method” and the EI proposed by Ranjan (2013) as the “Ranjan method”.
We hypothesize that the Ranjan method performs better than the Picheny method
in achieving the global minimum of one-dimensional noisy simulators with low
noise (δ=0.05) and high noise (δ=0.2). The simulation results presented here are
summarized on the basis of the following factors:
a) EI criterion: Ranjan method vs. Picheny method
b) EI optimization: GA vs. one-shot space ﬁlling design approach
c) Surrogate model: GP vs. BART
d) Noise level: low (δ = 0.05) vs. high (δ = 0.2).
5.1 First 1-dimensional Computer Simulator
Example 1 : Suppose the underlying deterministic simulator outputs are gener-
ated using the one-dimensional test function
y(x) =
sin(20πx + 5π)
4x + 1
+ (2x − 0.5)4
,
where x ∈ [0, 1]. This test function is a one-dimensional function from Ranjan
(2013). The simulator is plotted in Figure 5.1 with noisy realizations corresponding
to δ = 0.05 (Figure 5.1(a)) and δ = 0.2 (Figure 5.1(b)). The global minimum for
the underlying deterministic simulator is ymin = −0.868 at x = 0.025, and the
range of the function is −0.868 ≤ y(x) ≤ 5.06 for x ∈ [0, 1].
Each of Figures 5.2 − 5.5 presents four panels. Each panel has two median
curves, and the 10-th and the 90-th quantiles of two improvement functions (I1 and
I2). The green colored horizontal line denotes the deterministic simulator global
minimum, ymin = −0.868, in each panel of Figures 5.2 and 5.4. In Figures 5.3 and
5.5, the green horizontal line at zero denotes zero distance from the true simulator
minimizer. The four panels correspond to the combination of two models (GP
39

and BART) with two EI optimization strategies (GA and space filling design
approach).
q
qq
qq
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qqq
q
q
q
qqq
q
qq
q
q
q
q
q
q
qq
qq
q
qq
q
q
qq
qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
qq
q
qqq
qq
q
qq
q
qq
q
q
q
q
q
q
qqq
qqq
qq
qqq
qq
q
qq
q
q
qq
q
qqqq
q
q
qqq
qq
qqqq
q
qq
q
qqq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qqq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
0
2
4
0.00 0.25 0.50 0.75 1.00
q qDeterministic Low noise
1 dimensional test function
(a) low noise: δ = 0.05
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
qq
q
q
q
q
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
0
2
4
0.00 0.25 0.50 0.75 1.00
q qDeterministic High noise
(b) high noise: δ = 0.2
Figure 5.1: 1d simulator in Example 1. The black curve is the underlying deter-
ministic simulator and the noisy outputs are shown in blue.
Figure 5.2 shows surrogate model minimum for low noise (δ = 0.05). Within
every replicate, predictions are obtained by refitting the surrogate model after
addition of a point. The median curves therefore are not monotonically decreasing.
The smaller values of the estimated minimum in the right panels indicates that
BART performs better than the GP. Within the BART results, there is little
difference between the Picheny method and the Ranjan method. On the other
hand, the GP identifies the simulator global minimum in just the 10-th quantile
of the total replicates. The quantile curves for GP are more spread showing that
there is more variation with GP. For both the GP and BART, GA and one-shot
space filling design approach both lead to comparable EI optima and hence model
fits.
Figure 5.3 presents the Manhattan distance between the true deterministic
simulator minimizer and the fitted surrogate model minimizer for low noise. The
same layout as Figure 5.2 is used. It shows that the 90-th quantile of GP for
the Ranjan method is closer to the simulator minimizer. But the median curve
40

15 20 25 30 35 40 45
−1.0−0.50.00.51.0
Total number of design points
GA
GP
15 20 25 30 35 40 45−1.0−0.50.00.51.0
BART
Picheny
Ranjan
Simulator min
Quantile
15 20 25 30 35 40 45
−1.0−0.50.00.51.0
Spacefillingdesign
15 20 25 30 35 40 45
−1.0−0.50.00.51.0
Figure 5.2: 1d simulator in Example 1. Each panel shows the median, 10% and
90% quantiles of surrogate model minimum for Picheny and Ranjan method.
Experimental settings are n0 = 15, nnew = 30 and δ = 0.05. The four panels show
combinations of surrogate method (GP and BART) with EI optimizer (GA/One-
shot space ﬁlling design approach). The horizontal lines show the simulator global
minimum.
41

15 20 25 30 35 40 45
0.00.10.20.30.40.5
GA
GP
15 20 25 30 35 40 450.00.10.20.30.40.5
BART
Picheny
Ranjan
Simulator min
Quantile
15 20 25 30 35 40 45
0.00.10.20.30.40.5
Spacefillingdesign
15 20 25 30 35 40 45
0.00.10.20.30.40.5
Figure 5.3: 1d simulator in Example 1. Each panel shows the median, 10% and
90% quantiles of distance between the simulator minimizer and surrogate
model minimizer for Picheny and Ranjan method. Experimental settings are
n0 = 15, nnew = 30 and δ = 0.05. The four panels show combinations of surrogate
method (GP and BART) with EI optimizer (GA/One-shot space ﬁlling design
approach).
42

15 20 25 30 35 40 45
−1.0−0.50.00.51.0
GA
GP
15 20 25 30 35 40 45
−1.0−0.50.00.51.0
BART
Picheny
Ranjan
Simulator min
Quantile
15 20 25 30 35 40 45
−1.0−0.50.00.51.0
Spacefillingdesign
15 20 25 30 35 40 45
−1.0−0.50.00.51.0
Figure 5.4: 1d simulator in Example 1. Surrogate model minimum results for
δ = 0.2. Figure layout is the same as Figure 5.2.
of GP is basically flat. Thus, it appears that the sequential addition of points
does not help the GP model in finding the global minimum. However, for BART,
the median distances from the simulator minimizer decrease towards zero as more
points are added. This shows that BART is better at identifying the location of
the simulator minimizer as more points are added in the sequential design. The
GA and space filling design approach both exhibit comparable performance.
Figure 5.4 presents a similar layout of the surrogate model minimum for high
noise (δ = 0.2). It shows that the GP did not identify the simulator global
minimum at all but BART is successful in identifying the global minimum based
43

15 20 25 30 35 40 45
0.00.10.20.30.40.5
GA
GP
15 20 25 30 35 40 45
0.00.10.20.30.40.5
BART
Picheny
Ranjan
Simulator min
Quantile
15 20 25 30 35 40 45
0.00.10.20.30.40.5
Spacefillingdesign
15 20 25 30 35 40 45
0.00.10.20.30.40.5
Figure 5.5: 1d simulator in Example 1. Distance between the simulator
minimizer and surrogate model minimizer results for δ = 0.2. Figure layout
is the same as Figure 5.3.
44

on the median performance. For BART, the space filling design approach works
better than the GA for the identification of the global minimum. For the GP,
there is a little difference between space filling design approach and GA. In all
cases, there is little difference between the Picheny and the Ranjan criteria.
Figure 5.5 presents a similar layout of the Manhattan distance between the
underlying deterministic simulator minimizer and the surrogate model minimizer.
Clearly GP fails to recognize the location of simulator minimizer every time,
whereas the decreasing trend of the BART median towards the simulator min-
imizer shows that BART is better than the GP. The Picheny and the Ranjan
improvement criteria perform equally well for optimizing this high noise simula-
tor.
Of note here is the difficult nature of the problem of optimizing this particu-
lar simulator. It has a number of local minima very close to the global minima.
But despite the difficult nature, BART outperforms GP and identifies the global
minima for both low and high noise. It leads us to conclude that BART is an effec-
tive model especially in cases of abrupt changes, high noise and non-stationarity.
There is little difference between the Picheny and the Ranjan method for this
one-dimensional test function.
5.2 Second 1-dimensional Computer Simulator
ated using the simple one-dimensional test function (DiMatteo, Genovese & Kass
(2001)):
y(x) = − sin(4x − 2) − 2 exp(−480(x − 0.5)2
),
where x ∈ [0, 1]. The simulator is plotted in Figure 5.6 with noisy realizations
corresponding to δ = 0.05 (Figure 5.6(a)) and δ = 0.2 (Figure 5.6(b)). The global
minimum for the underlying deterministic simulator is ymin = −2.00 at x = 0.5
45

and the range of the function is such that −2.00 ≤ y(x) ≤ 0.99 for x ∈ [0, 1]. The
same experimental setup (25 replicates, low and high noise levels) as Example 1
is used.
Results are presented in Figures 5.7 − 5.10, following the same layout as Ex-
ample 1. Figure 5.7 presents the surrogate model minimum for low noise data. It
q
q
q
q
q
qqqqq
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qqq
qqq
qq
qq
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
qq
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
qqq
qq
q
q
q
q
q
q
q
q
q
q
q
q
qqq
qq
qq
qq
q
q
q
q
q
q
qq
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
qq
q
q
q
qqq
−2
−1
0
1
0.00 0.25 0.50 0.75 1.00
q qDeterministic Low noise
(a) low noise: δ = 0.05
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
−2
−1
0
1
0.00 0.25 0.50 0.75 1.00
q qDeterministic High noise
(b) high noise: δ = 0.2
Figure 5.6: 1d simulator in Example 2. The black curve is the underlying deter-
ministic simulator and the noisy outputs are shown in blue.
clearly shows that both GP and BART are successful in identifying the simulator
minimum. There is little difference between the Picheny method and the Ranjan
method. The quantile curves for GP are more spread denoting more variation
with GP. For BART, the space filling design approach is better than the GA in
identifying the simulator global minimum.
Figure 5.8 presents the Manhattan distance between the simulator minimizer
and the surrogate model minimizer. Unlike GP, BART places points much closer
to the simulator minimizer right from the beginning of the sequential design pro-
cedure. Both GP and BART seem to identify the location of the simulator global
minimum in a reasonable time. The GA and space filling design approaches both
have comparable performance in identifying the minimizer.
46

15 20 25 30 35 40 45
−2.0−1.5−1.0−0.5
GA
GP
15 20 25 30 35 40 45
−2.0−1.5−1.0−0.5
BART
Picheny
Ranjan
Simulator min
Quantile
15 20 25 30 35 40 45
−2.0−1.5−1.0−0.5
Spacefillingdesign
15 20 25 30 35 40 45
−2.0−1.5−1.0−0.5
47

15 20 25 30 35 40 45
0.00.10.20.30.40.5
GA
GP
15 20 25 30 35 40 45
0.00.10.20.30.40.5
BART
Picheny
Ranjan
Simulator min
Quantile
15 20 25 30 35 40 45
0.00.10.20.30.40.5
Spacefillingdesign
15 20 25 30 35 40 45
0.00.10.20.30.40.5
minimizer and surrogate model minimizer results for δ = 0.05. Figure
layout is the same as Figure 5.3.
48

15 20 25 30 35 40 45
−2.0−1.5−1.0−0.5
GA
GP
15 20 25 30 35 40 45
−2.0−1.5−1.0−0.5
BART
Picheny
Ranjan
Simulator min
Quantile
15 20 25 30 35 40 45
−2.0−1.5−1.0−0.5
Spacefillingdesign
15 20 25 30 35 40 45
−2.0−1.5−1.0−0.5
49

15 20 25 30 35 40 45
0.00.10.20.30.40.5
GA
GP
15 20 25 30 35 40 45
0.00.10.20.30.40.5
BART
Picheny
Ranjan
Simulator min
Quantile
15 20 25 30 35 40 45
0.00.10.20.30.40.5
Spacefillingdesign
15 20 25 30 35 40 45
0.00.10.20.30.40.5
50

Figure 5.9 presents the surrogate model minimum for high noise data. The
quantiles are spread for both GP and BART. The model trend with the Ranjan
method changes more quickly, identifying the simulator global minimum faster. On
the 90-th quantile, the experiment with the Ranjan improvement function leads
the sequential design closer to the simulator minimum in both GP and BART.
For BART, the space filling design approach works better than the GA for EI
optimization.
and the surrogate model minimizer for high noise. On the 90-th quantile for GP,
the Ranjan method places points much closer to the simulator global minimizer.
Overall, the Ranjan method is clearly better than the Picheny method for
both surrogate models (GP and BART) in optimization of this high noise one-
dimensional simulator.
5.3 Summary
Both one-dimensional noisy simulators illustrated here are complex with multiple
local minima very close to the global minimum. BART performs well in both
situations and acts as an effective engine for sequential design and optimization.
In Example 2, the GP model appears to be an appropriate surrogate but opti-
mization with BART is still competitive. BART places points much closer to the
simulator global minimum right from the beginning (clearly evident in Figure 5.8)
and delivers the optimum quickly with fewer runs of the simulator. In Tables
5.1 and 5.2 we summarize the key findings for both one-dimensional computer
simulators.
51

Table 5.1: Summary of Results for Example 1
Low noise High noise
GP
• Fails to identify
the simulator
global minimum
• Fails to identify
the simulator
global minimum
BART
• Identifies the
simulator global
minimum.
• BART performs
better than the
GP.
• Identifies the
simulator global
minimum
• One-shot space
filling design
performs better
than the GA
• BART performs
better than the
GP.
52

Table 5.2: Summary of Results for Example 2
Low noise High noise
GP
• Identifies the
simulator global
minimum
• Identifies the
simulator global
minimum
• Ranjan method
performs better
than Picheny
method
BART
• Identifies the
simulator
minimum.
• One-shot space
filling design
performs better
than the GA
• Identifies the
simulator
minimum
• Ranjan method
performs better
than Picheny
method
• One-shot space
filling design
performs better
than the GA
53

Chapter 6
Results: Higher Dimensional
Simulators
In this chapter we consider the two and four dimensional noisy computer simu-
lators. The simulation results presented here are summarized on the basis of the
following factors:
a) EI criterion: Ranjan method vs. Picheny method
b) EI optimization: GA vs. one-shot space filling design approach
c) Surrogate model: GP vs. BART
d) Noise level: low (δ = 0.05) vs. high (δ = 0.2).
As with the one-dimensional test functions in Chapter 5, we hypothesize that in
higher dimensions, the Ranjan method performs better than the Picheny method
in identifying the global minimum of a noisy simulator. In each case we do 25 repli-
cates of the experiment, each replicate looping over the addition of new points. We
measure performance over the replicates by median and quantiles of the estimated
minimum and distance from the true simulator minimizer.
In Example 3, an initial space-filling design of n0 = 20 points was used, and
an additional 2n0 = 40 points were added, one at a time, in the sequential design
using the Picheny and Ranjan criteria. In Example 4, an initial space-filling design
of n0 = 40 points was used, and an additional 3n0 = 120 points were added.
54

6.1 2-dimensional Computer Simulator
ated using the two-dimensional log-Goldprice function.
y(x1, x2) = log[1 + (4x1 + 4x2 − 3)2
(99 − 104x1 − 104x2 + 96x1x2 + 3(4x1 − 2)2
+
3(4x2 − 2)2
)] + log[30 + (8x1 − 12x2 + 2)2
(12(4x1 − 2)2
+ 27(4x2 − 2)2
+ 160x1+
480x2 − 576x1x2 − 158)].
The input x = (x1, x2) is defined on the interval [0, 1]2
. Here again, we consider
the simulator with low noise (δ = 0.05) and high noise (δ = 0.2) with δ as defined
in Chapter 5. The global minimum for the underlying deterministic simulator
is ymin = 1.504 at x = (0.5, 0.26) and 1.504 ≤ y(x) ≤ 13.830 for x ∈ [0, 1]2
.
Figure 6.1 shows plots of the deterministic response y along x2 conditional on x1
(Figure 6.1(a)) and along x1 conditional on x2 (Figure 6.1(b)). Figure 6.2 shows
the contour plot of the response y.
Each of Figures 6.3 − 6.6 shows results based on the estimated minimum and
the distance from the true minimizer, following the same layout as in Chapter 5.
Figure 6.3 presents the surrogate model minimum for low noise data. For the
GP model, the Ranjan method leads the experiment towards the global minimum
faster than the Picheny method. For BART, on the median, there is a visible
difference between the Ranjan and the Picheny methods. The Ranjan method is
slightly better than the Picheny method. Overall, GP outperforms BART and
achieves the simulator global minimum faster. Both the GA and space filling
design approaches perform equally well.
and the surrogate model minimizer for low noise. For GP, the experiment with
the Ranjan improvement function leads the sequential design much closer to the
55

246812
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
246812
0.0 0.2 0.4 0.6 0.8 1.0
246812
x2
y
0.2 0.4 0.6 0.8 1.0
Given : x1
(a) Conditional plot of underlying deterministic simulator along x2 given x1.
246812
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
246812
0.0 0.2 0.4 0.6 0.8 1.0
246812
x1
y
0.2 0.4 0.6 0.8 1.0
Given : x2
(b) Conditional plot of underlying deterministic simulator along x1 given x2.
Figure 6.1: Conditional plot of log-Goldprice function in Example 3.
simulator minimum. For BART, the experiment has a decreasing trend of the
distance from the simulator minimizer showing that the Ranjan method leads
the sequential design much closer to the simulator minimizer. BART identiﬁes
the location of the simulator minimizer in just the 10-th quantile. Clearly GP
outperforms BART with the Ranjan method outperforming the Picheny method.
Figure 6.5 presents the surrogate model minimum for high noise data. The
56

x1
x2
2
3
4
5
6
7
7
8
8
9
9
9
10
10
10
10
10
11
11
11
11
12
12
13
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
Figure 6.2: Contour plot of log-Goldprice function in Example 3.
quantile curves are more spread for the GP. The median lines for GP with the
Picheny improvement criterion seem to plateau well above the true minimum
whereas the corresponding lines with the Ranjan method continue to decrease
towards the simulator global minimum. Compared to the low-noise, it takes much
longer here to find the simulator global minimum. BART identifies the simulator
global minimum only on the 10-th quantile.
and the surrogate model minimizer for high noise. On the median and 90-th quan-
tile for GP, the Ranjan method places points much closer to the simulator global
minimizer compared to the Picheny method. BART with the Ranjan method
places points much closer to the simulator minimizer but it identifies the correct
location on only the 10-th quantile.
Thus for the optimization of two-dimensional noisy simulator, GP outperforms
BART in presence of either low or high noise.
57

20 30 40 50 60
1234567
GA
GP
20 30 40 50 60
1234567
BART
Picheny
Ranjan
Simulator min
Quantile
20 30 40 50 60
1234567
Spacefillingdesign
20 30 40 50 60
1234567
Figure 6.3: 2d simulator in Example 3. Surrogate model minimum results
with n0 = 20, nnew = 40 and δ = 0.05. Figure layout is the same as Figure 5.2
58

20 30 40 50 60
0.00.10.20.30.40.5
GA
GP
20 30 40 50 60
0.00.10.20.30.40.5
BART
Picheny
Ranjan
Simulator min
Quantile
20 30 40 50 60
0.00.10.20.30.40.5
Spacefillingdesign
20 30 40 50 60
0.00.10.20.30.40.5
minimizer and surrogate model minimizer results with n0 = 20, nnew = 40
and δ = 0.05. Figure layout is the same as Figure 5.3.
59

20 30 40 50 60
1234567
GA
GP
20 30 40 50 60
1234567
BART
Picheny
Ranjan
Simulator min
Quantile
20 30 40 50 60
1234567
Spacefillingdesign
20 30 40 50 60
1234567
60

20 30 40 50 60
0.00.10.20.30.40.5
GA
GP
20 30 40 50 60
0.00.10.20.30.40.5
BART
Picheny
Ranjan
Simulator min
Quantile
20 30 40 50 60
0.00.10.20.30.40.5
Spacefillingdesign
20 30 40 50 60
0.00.10.20.30.40.5
61

6.2 4-dimensional Computer Simulator
Example 4 : Suppose the computer outputs are generated using the four-dimensional
test function given by
y(x) =
4
i=1
− sin(4xi − 2) − 2 exp(−480(xi − 0.5)2
),
where x = (x1, x2, x3, x4) ∈ [0, 1]4
. Each of the four terms in this test function
is equal to a one-dimensional function from Example 2, and thus has a unique
global minimum with ymin = −8 at x = (0.5, 0.5, 0.5, 0.5) and −8 ≤ y(x) ≤ 3.96
for x ∈ [0, 1]4
. Here again, we consider noisy output with low noise (δ = 0.05) and
high noise (δ = 0.2). Figure 5.6 plots one term of the underlying one-dimensional
function. The one-dimensional function has a global minimum ymin = −2 at
x = 0.5 which is a narrow spike. The simulator has a local minimum of y = −1
in the vicinity of x = 0.9. This local minimum is much easier to find than the
global minimum because of the curvature of the function y is less near the local
minimum. The detection of the global minimum in the four-dimensional space
would require at least a few design points in [0.4, 0.6]4
, that is in 0.16% of the
total volume. Without such points, the surrogate models can get misled by the
overall shape (i.e, excluding the spike).
The number of points required for a space filling design to have a point close
to the global optimum grows as the q-th power for dimension q. So it becomes
computationally challenging to find the optimal point through evaluation of the
function at so many points. Hence, in four dimensions it is no longer practical to
use a space filling design to search the minimum or to find the EI-optimal point.
Thus only the GA approach is considered.
The sequential design procedure is initialized with n0 = 40 which is less than
n0 = 15 · 4 = 60 according to the one-dimensional function because with n0 = 60
it will be too simple of a sequential design procedure for BART. A smaller initial
62

40 60 80 100 120 140 160
−8−6−4−20
GP
40 60 80 100 120 140 160
−8−6−4−20
BART
Picheny
Ranjan
Simulator min
Quantile
with n0 = 40, nnew = 120 and δ = 0.05. Figure layout is the same as Figure 5.2.
40 60 80 100 120 140 160
0.00.51.01.52.0
GP
40 60 80 100 120 140 160
0.00.51.01.52.0
BART
Picheny
Ranjan
Simulator min
Quantile
minimizer and surrogate model minimizer results with n0 = 40, nnew = 120
and δ = 0.05. Figure layout is the same as Figure 5.3.
63

design set seems to be an appropriate choice to challenge the procedure. Then
nnew = 120 follow-up points are sequentially added and the discovery of the spiky
regions is left to the follow-up runs. To ensure the evaluation of the EI criterion
and the surrogate model in the spiky region, GA is considered with population
size= 800 and number of generations = 10.
The GA procedure is doing batch analysis at every generation step. This means
that the GA evaluates the fitness value for a batch of points and then chooses
some good points for next generation. It then evaluates the fitness function on
another batch of points including the points selected from previous generation at
the next generation step. This is time efficient and reliable. The batch analysis
is possible with the MPI implementation of BART since it stores the trees which
enables prediction at new locations without refitting the model. This is unlike the
“BayesTree” package which generates predictions while running MCMC and thus
predictions at new locations would require refitting the model.
As in previous examples, Figure 6.7 presents the surrogate model minimum
for low noise data. The GP fails to identify the simulator global minimum. On
the median, the Ranjan method leads the sequential design much closer to the
simulator global minimum. The Picheny method is flat after addition of about 40
points in the sequential design procedure and fails to identify the simulator global
minimum. BART is an effective choice for this simulator as every replicate brings
the procedure much closer to the simulator global minimum than does the GP.
and the surrogate model minimizer for low noise. For the GP, on the median,
the Ranjan method places points much closer to the simulator global minimum.
BART has an overall decreasing trend in the distance from the simulator minimizer
showing that it has identified the simulator minimizer location every single time
and faster. GP has failed to identify the global minimum of this high dimensional
64

simulator in the presence of noise. This is because the deterministic simulator
function is changing very rapidly near the global minimum and the stationary
assumption of GP makes it difficult to model such behavior.
Figure 6.9 shows the plot of the GP predicted response y vs. the input vector x
for the four dimensional computer simulator in the presence of low noise. The GP
predictions are based on the final step in the sequential design, with 160 training
points. It clearly shows that GP model finds the simulator global minimum in
only 2 out of 4 coordinates (x1 and x4) and in other 2 coordinates it finds the
local minimum. Because of smoothness constraint of GP data modeling, it fails
to achieve all four coordinates and thus fails to recognize the global minimum.
Figure 6.10 presents the surrogate model minimum for high noise. The GP
clearly fails to recognize the simulator global minimum in the presence of high
noise. On the other hand, BART is effective in identifying the simulator global
minimum. On the median, decreasing trend of BART model confirms that BART
is able to guide the sequential design much closer to the simulator global minimum.
On the 90-th quantile, the Ranjan method works a bit better than the Picheny
method.
and the surrogate model minimizer in the presence of high noise. The GP predic-
tions are based on the final step in the sequential design, with 160 training points.
For BART, the decreasing trend of the distance from the true minimizer location
shows that it is successful. The Ranjan method based on the performance of the
median and 90-th quantile works better than the Picheny method.
Figure 6.12 shows the plot of the GP predicted response y vs. the input vector
x for the four dimensional computer simulator in the presence of high noise. It
shows that the GP model identifies the simulator global minimizer coordinate in
only one (x2) of the four coordinates. Hence the GP model is not an appropriate
65

q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
0.0 0.2 0.4 0.6 0.8 1.0
−6−4−2024
x1
y
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
0.0 0.2 0.4 0.6 0.8 1.0
−6−4−2024
x2
y
q
q
q
q
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
0.0 0.2 0.4 0.6 0.8 1.0
−6−4−2024 x3
y
q
q
q
q
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
qq
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
0.0 0.2 0.4 0.6 0.8 1.0
−6−4−2024
x4
y
Figure 6.9: 4d simulator in Example 4. Plot of the response y vs. each coordinate
of the input x for the GP model. This plot shows the sequential design procedure
with n0 = 40, nnew = 120 and δ = 0.05.
40 60 80 100 120 140 160
−8−6−4−20
GP
40 60 80 100 120 140 160
−8−6−4−20
BART
Picheny
Ranjan
Simulator min
Quantile
for δ = 0.2. Figure layout is the same as Figure 5.2.
66

An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_computer_simulator

An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_computer_simulator

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (15)

Similar to An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_computer_simulator

Similar to An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_computer_simulator (20)

An_expected_improvement_criterion_for_the_global_optimization_of_a_noisy_computer_simulator