We approach the screening problem - i.e. detecting which inputs of a computer model significantly impact the output - from a formal Bayesian model selection point of view. That is, we place a Gaussian process prior on the computer model and consider the $2^p$ models that result from assuming that each of the subsets of the $p$ inputs affect the response. The goal is to obtain the posterior probabilities of each of these models. In this talk, we focus on the specification of objective priors on the model-specific parameters and on convenient ways to compute the associated marginal likelihoods. These two problems that normally are seen as unrelated, have challenging connections since the priors proposed in the literature are specifically designed to have posterior modes in the boundary of the parameter space, hence precluding the application of approximate integration techniques based on e.g. Laplace approximations. We explore several ways of circumventing this difficulty, comparing different methodologies with synthetic examples taken from the literature.
Authors: Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha) and Rui Paulo (Universidade de Lisboa)
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the Context of Computer Models, Rui Paulo, April 30, 2019
1. Variable selection in the context of computer
models
MUMS Foundations of Model Uncertainty
Gonzalo Garcia-Donato1
and Rui Paulo2
1 Universidad de Castilla-La Mancha (Spain), 2 Universidade de Lisboa (Portugal)
Duke University, BFF conference, April 30 2019
2. Screening
Computer model
Let y(·) denote the output of the computer model with generic
input x ∈ S ⊂ IRp
.
Screening
The question we want to answer is:
Which are the inputs that significantly impact the
output, the so-called active inputs. The other inputs are
called inert.
3. Gaussian process prior
We place a Gaussian process prior on y(·):
y(·) | θ, σ2
∼ GP(0, σ2
c(·, ·))
with correlation function
c(x, x ) =
p
k=1
ck(xk, xk ) .
We will assume that ck is a one-dimensional correlation function
with fixed (known) roughness parameter and unknown range
parameter γk > 0. We adopt the Matern 5/2 correlation function:
with dk = |xk − xk |
ck(xk, xk ) = ck(dk) = 1 +
√
5 dk
γk
+
5 d2
k
3 γ2
k
exp −
√
5 dk
γk
so that θ = (γ1, . . . , γp).
4. Our approach to screening
The information to address the screening question comes in the
form of model data: a design D = {x1, . . . , xn} is selected, the
computer model is run at each of these configurations and we
define
y = (y(x1), . . . , y(xn)) .
Consider all 2p models for y that result from allowing only a subset
of the p inputs to be active:
Let δ = (δ1, . . . , δp) ∈ {0, 1}p identify each of the subsets
If δ = 0, Mδ : y | σ2, θ ∼ N(0, σ2Rδ) where
Rδ =
k:δk =1
ck(xki , xkj )
i,j=1,...,n
If δ = 0, Mδ : y | σ2, θ ∼ N(0, σ2I)
Screening is now tantamount to assessing the support that y
lends to each of the Mδ, i.e. a model selection exercise
5. Parametrizations
In multiple linear regression, the 2p models are formally
obtained by setting to zero subsets of the vector of regression
coefficients of the full model
Here, it depends on the parametrization used for the
correlation function:
γk — range parameter:
xk is inert ⇔ γk → +∞
βk = 1/γk — inverse range parameter:
xk is inert ⇔ βk = 0
ξk = ln γk — log-inverse range parameter:
xk is inert ⇔ ξk → −∞
These parameterizations are free of the roughness parameter
(Gu et al. 2018)
6. Bayes factors and posterior inclusion probabilitites
Our answer to the screening question is then obtained by
considering the marginal likelihoods
m(y | δ) ≡ N(y | 0, σ2
Rδ) π(θ, σ2
| δ)dσ2
dθ
which allow us to compute Bayes factors and, paired with prior
model probabilities, posterior model probabilities, p(δ | y).
Marginal posterior inclusion probabilities are particularly
interesting in this context:
p(xk | y) =
δ: δk =1
p(δ | y)
7. Difficulties
Conceptual: selecting π(θ, σ2 | δ)
Computational:
computing the integral in m(y | δ)
If p is large, enumeration of all models is not practical, so
obtaining p(xk | y) is problematic
8. Priors for Gaussian processes
Berger, De Oliveira and Sans´o, 2001, JASA is a seminal paper
c(x, x ) = c(||x − x ||, θ) with θ a unidimensional range
parameter
Focus on spatial statistics
Some of the commonly used priors give rise to improper
posteriors
Reference prior is derived and posterior propriety is proved
π(θ, σ2
) = π(θ)/σ2
with π(θ) proper has long as the mean of
the GP has a constant term
P., 2005, AoS
Focus on emulation of computer models
c(x, x ) = k c(|xk − xk |, θk )
Reference prior is obtained and posterior propriety established
when D is a cartesian product
Many extensions are considered, e.g. to include a nugget, to
focus on spatial models etc.
Gu, Wang and Berger, 2018, AoS focuses on robust
emulation
Gu 2018, BA, the jointly robust prior
9. Priors for Gaussian processes
The reference prior is proper for many correlation functions as
long as there is an unknown mean, but it changes very little if
the GP has zero mean
It is hence very appealing for variable selection; however,
the normalizing constant is unknown (and we need it)
it is computationally very costly
Gu (2018) jointly robust prior
π(β) = C0
p
k=1
Ckβk
a
exp −b
p
k=1
Ckβk
where C0 is known, mimics the tail behavior of the reference
prior
This talk is about using the jointly robust prior to answer the
screening problem
10. Robust emulation
Emulation: using y(·) | y in lieu of the actual computer model
Obtaining the posterior π(θ | y) via MCMC and using it to
get to y(·) | y is not always practical
The alternative is to compute an estimate of θ and plug it in
the conditional posterior predictive of y(·)
Problems arise because often
a) ˆR ≈ I, prediction reverts to the mean, becoming an impulse
function close to y
b) ˆR ≈ 11T
and numerical instability increases
Gu et al. (2018) terms this phenomenon lack of robustness
and because of the product nature of the correlation function
it happens when
a) ˆγk → 0 for at least one k
b) ˆγk → +∞ for all k
11. Robust emulation (cont.)
Gu et al. (2018) shows that if the ˆγk are computed by
maximizing
f (y | γ) π(γ)
where π(γ, σ2, µ) ∝ π(γ)/σ2 is the reference prior, then
robustness is guaranteed
Remarks:
Since γk → +∞ means that xk is inert, this means that the
reference prior will not be appropriate to detect through the
marginal posterior modes the situation where all the inputs
are inert
The reference prior will not detect either the situation where
some of the inputs are inert (by looking at marginal
posterior modes)
The mode is not invariant with respect to reparametrization,
hence the parametrization matters
Gu et al. (2018) discourage the use of βk
12. Laplace approximation
The jointly robust prior does not allow for a closed form
expression for m(y | δ)
There is a large literature on computing (ratios of) normalizing
constants, often relying on MCMC samples from the posterior
Our goal was to explore the possibility of using the Laplace
approximation to compute m(y | δ), given that there is code
available to obtain ˆβk, k = 1, . . . , p — R package
RobustGaSP [Gu, Palomo and Berger, 2018] — and these
possess nice properties
13. Laplace approximation (cont.d)
For each model indexed by δ = 0,
m(y) ∝ exp[h(β)] dβ
where, with π(β) denoting the JR prior,
h(β) = ln L(β | y) + ln π(β)
and
L(β | y) ∝ (yT
R−1
y)−n/2
|R|−1/2
14. Laplace approximation (and BIC)
Having obtained ˆβ = arg max h(β), expand h(ˆβ) around that
mode to obtain (up to a constant)
m(y) ≈ (2π)k/2
exp[h(ˆβ)] |H|−1/2
H = −
∂2h
∂β∂βT
(ˆβ)
where k = 1T
δ is the number of inputs in model indexed by
δ = 0.
One can obtain explicit formulae for all these quantities
(except for ˆβ!) that are functions of
R−1
,
∂R
∂βj
,
∂2R
∂βi ∂βj
One also has all the quantities to compute BIC-based
posterior marginals:
m(y) ≈ n−k/2
L(ˆβ)
15. Testbeds
Linkletter et al. 2006, Technometrics deals with variable selection
in the context of computer models.
They consider several examples:
Sinusoidal: y(x1, . . . , x10) = sin(x1) + sin(5x2)
Simple linear model:
y(x1, . . . , x10) = 0.2x1 + 0.2x2 + 0.2x3 + 0.2x4
Decreasing impact: y(x1, . . . , x10) = 8
i=1 0.2/2i−1xi
No signal: y(x1, . . . , x10) = 0
The model data is obtained with a 54-run maximin Latin
hypercube design in [0, 1]10 and iid N(0, 0.052) noise is added to
the model runs.
16. Sinusoidal function — initial results
Upon trying to obtain the Laplace approximation for all
2p − 1 = 1023 models,
We encountered models where the mode had entries that were
at the boundary: ˆβk = 0
Those modes were of two types: hk < 0 or hk > 0 but
non-zero gradient
This precludes the use of the usual Laplace approximation, as
the mode must be an interior point of the parameter space
17. Variable selection: two approaches
There are mainly two approaches to variable selection:
Estimation-based
The full model is assumed to be true and the prior (or penalty)
encourages sparcity; a criterion is established for determining
whether a variable is included or not
Model selection-based, in which formally all possible models
are considered, which is what we are trying to accomplish here
The issue of multiple comparisons is present, but the priors for
the parameters are one might say of a different nature
The jointly robust prior differs from the reference prior in that
it can detect the case where some of the inputs are inert,
whereas the reference prior cannot
This is was what we are observing: the jointly robust prior
encourages sparcity
18. The reference prior revisited...
The motivation for the JR prior is that is that it matches the
“expontential and polynomial tail decaying rates of the
reference prior”
Exponential decay when γ → 0 prevents R to be near diagonal
Polynomial decay when γ → +∞ allows the likelihood to
come into play — large values usually produce better fit
Which polynomial?
If γi → +∞ for i ∈ E {x1, . . . , xp} and the other γi are
bounded,
πR
(γ)
i∈E
γ−3
i
or
πR
(β)
i∈E
βi
This motivates considering
πJR
(β) ∝
p
i=1
βi πJR
(β)
19. Sinusoidal example revisited...
This prior pushes the βk away from zero, so no modes at the
boundary were found
Some of the modes were pushed away from zero an order of
magnitude so the inputs deemed more relevant
The marginal posteriors of the vectors with some components
of the mode at zero did not seem to change much
The marginal posteriors of models with some support did not
seem to change much either
20. Sinusoidal example results
Recall that only x1 and x2 are active; the prior on the model space
was constant.
With our modification of the jointly robust prior, these were the
results obtained:
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
p(xk | y) 1 1 0.0808 0.0358 0.5720 0.1917 0.2891 0.7042 0.8249 0.0019
The true model was ranked in 44th place with pmp 0.00011.
δ p(δ | y) x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
404 0.3235 1 1 0 0 1 0 0 1 1 0
452 0.2567 1 1 0 0 0 0 1 1 1 0
52 0.1442 1 1 0 0 1 1 0 0 0 0
408 0.0510 1 1 1 0 1 0 0 1 1 0
292 0.0422 1 1 0 0 0 1 0 0 1 0
260 0.03760 1 1 0 0 0 0 0 0 1 0
21. The modified prior
The culprit
is not the inaccuracy of the Laplace approximation
correlations in the design
the prior or its hyperparameters
it’s the model!
We need a nugget, for δ = 0,
Mδ : y | σ2
, σ2
0, θ ∼ N(0, σ2
Rδ + σ2
0I)
The GP prior implies that the emulator is an interpolator at the
observed data and that no longer happens when we remove inputs
from the correlation matrix, hence defining an inadequate model...
22. Adding the nugget
By introducing η = σ2
0/σ2, the reference prior is now
π(β, η, σ2
) ∝
π(β, η)
σ2
and σ2 can be integrated out
The jointly robust prior is
π(β, η) = C0
p
k=1
Ckβk + η
a
exp −b
p
k=1
Ckβk + η
And the tail of the reference prior γi → +∞ for
i ∈ E {x1, . . . , xp}, η → 0, and the other γi are bounded,
πR
(γ, η)
i∈E
γ−3
i
or
πR
(β, η)
i∈E
βi
i.e., there is no penalization in η
23. The modified prior
For robustness, Gu et al. (2018) recommend the
parametrization in terms of (ln β, ln η) = (ξ, τ), so that we set
π(β, η) ∝
p
k=1
βk
p
k=1
Ckβk + η
a
exp −b
p
k=1
Ckβk + η
but do all the calculations (mode and Laplace approximation)
in the (ξ, τ) parametrization
We next show the results of the four examples (M = 250
simulations)
the Laplace-based inclusion probabilities using a flat prior on
the model space
the Laplace-based inclusion probabilities using a Scott and
Berger (2010) prior on the model space
the BIC-based inclusion probabilities using a flat prior on the
model space
24. q
q
q
q q q q q
q
q
q
0 2 4 6 8 10
0.000.020.040.060.08
p
P(M_p)
Flat vs Scott and Berger
29. Work in progress
In the simulations, we find models for which H has negative
entries in the diagonal. Upon inspection, the marginals seem
well behaved, so this is numerical instability — we need more
stable ways of computing H
Investigate the possibility of using this method in variable
selection in the bias function
Produce an R package
30. Summary
Fully automatic formal Bayesian variable selection approach to
screening which is computationally very simple
Based on a correlation function that possesses atractive
properties and on a prior which mimics the behavior of the
reference prior
Need for a nugget when considering all 2p models
BIC behaves quite well and does not require H
Thank you for your attention!