2018 MUMS Fall Course - Conventional Priors for Bayesian Model Uncertainty - Jim Berger, September 18, 2018

SAMSI MUMS Fall,2018
✬
✫
✩
✪
Lecture 4: Conventional Priors for
Variable Selection in the Linear Model

✬
✫
✩
✪
Outline
• Notation for the linear model and Bayes model selection
• Choice of prior probabilities of models
• G-priors and Zellner-Siow conventional priors for linear models
• Example: determining the distance from Cepheid stars.
• The Robust conventional prior
• The R package BayesVarSel
• Selecting a single model for prediction: the median probability
model

✬
✫
✩
✪
Notation for the normal linear model
• The full model: observe independent Y1, Y2, . . . , Yn, where
Yi = [x0,i1β01 + · · · + x0,ik0 β0k0 ] + xi1β1 + · · · + xipβp + ǫi ,
– the x0,ij and xij being given covariates;
– the β’s being unknown;
– the ǫi being independent N(0, σ2
) errors, with σ2
unknown.
• Deﬁning Y = (Y1, . . . , Yn)t
, X0 as the n × k0 matrix with elements
x0,ij, X as the n × p matrix with elements xij, β0 = (β01, . . . , β0k0 )t
,
and β = (β1, . . . , βp)t
, this model can be written
MF : Y ∼ Nn(X0β0 + Xβ, σ2
I) .

✬
✫
✩
✪
• The simplest model is assumed to be
M0 : Y ∼ Nn(X0β0, σ2
I) ;
with X0 consisting of the covariates that are to be included in all
models (e.g., the intercept in ordinary linear regression).
• Between M0 and MF are 2p
− 2 other models Mi, each additionally
including a non-null subset of ki of the remaining p covariates:
Mi : Y ∼ Nn(X0β0 + Xiβi, σ2
I) ;
Xi is the n × ki matrix consisting of the chosen covariates, i.e., the
chosen columns of X; the corresponding vector of unknown
parameters is denoted βi.
• (β0, σ2) are the common parameters in all models,
• πi(β0, βi, σ2
) is the prior distribution of the parameters in Mi,
• Pr(Mi) is the prior probability of model Mi.

✬
✫
✩
✪
Bayes model selection
• Is based on posterior probabilities for each model:
Pr(Mi | y) =
mi(y)Pr(Mi)
N
j=1 mj(y)Pr(Mj)
= 1 +
j=i
πji Bji
−1
• πji is prior odds Pr(Mj)/Pr(Mi)
• Bji is the Bayes factor
mj(y)
mi(y)
, where
mk(y) = fk(y | β0, βk, σ2
) πk(β0, βk, σ2
) dβ0 dβk dσ2
,
with fk(y | β0, βk, σ2
) denoting the normal likelihood for Mk.
mk(y) is the (integrated) likelihood of Mk, quantifying how
likely the observed data is under that model.

✬
✫
✩
✪
Prior probabilities of models
Two common choices are
• Pr(Mi) = 2−p
(i.e., each of the 2p
models has equal prior
probability);
• Pr(Mi) = 1
(p+1)
p
ki
−1
.
– This is equivalent to assigning the collection of all models having
k parameters (in addition to the common parameters) probability
1/(p + 1), and dividing this probability equally among all the
models of size k.
The ﬁrst choice is bad because it does not adjust for the multiple
testing problem inherent in variable selection (see the next slide).
The second choice is excellent (it does adjust for multiple testing)
and should be used as the default assignment of prior probabilities.

✬
✫
✩
✪
Equal model probabilities Bayes variable inclusion
Number of noise variables Number of noise variables
Signal 1 10 40 90 1 10 40 90
β1 : −1.08 .999 .999 .999 .999 .999 .999 .999 .999
β2 : −0.84 .999 .999 .999 .999 .999 .999 .999 .988
β3 : −0.74 .999 .999 .999 .999 .999 .999 .999 .998
β4 : −0.51 .977 .977 .999 .999 .991 .948 .710 .345
β5 : −0.30 .292 .289 .288 .127 .552 .248 .041 .008
β6 : +0.07 .259 .286 .055 .008 .519 .251 .039 .011
β7 : +0.18 .219 .248 .244 .275 .455 .216 .033 .009
β8 : +0.35 .773 .771 .994 .999 .896 .686 .307 .057
β9 : +0.41 .927 .912 .999 .999 .969 .861 .567 .222
β10 : +0.63 .995 .995 .999 .999 .996 .990 .921 .734
False Positives 0 2 5 10 0 1 0 0
(incl. prob. > 0.5)
Table 1: Posterior inclusion probabilities (i.e. the posterior probability that a
variable is in the model) for 10 real variables in a simulated data set.

✬
✫
✩
✪
Priors for the model parameters
There have been many efforts (over more than 30 years) to develop
‘objective model selection priors,’ πi(β0, βi, σ2
), for the parameters
of model Mi, including
• conventional priors (Jeffreys 1961; Zellner and Siow 1980)
• intrinsic priors (Berger and Pericchi 1996; Moreno et al. 1998)
• fractional priors (O’Hagan 1997)
• expected posterior priors (Pérez and Berger 2002),
• integral priors (Cano et al. 2008),
• divergence based priors (Bayarri and Garc´ıa-Donato 2008)
• Plus many many proposals for particular situations.
Here we only consider the conventional prior approach.

✬
✫
✩
✪
The g- and Zellner-Siow conventional priors for
variable selection
The Zellner g-prior for a model of size ki:
πg
i (β0, βi, σ2
) =
1
σ2
× Normalki
(βi | 0, n σ2
(V t
iV i)−1
) ,
with V i = (In − X0(Xt
0X0)−1
Xt
0)Xi . This prior is very popular
because the computations are all closed form, but it has the bad ﬂaw
of being information inconsistent (i.e.,when the F statistic between
two models goes to ∞, the Bayes factor stays bounded).
The Zellner-Siow prior: As above, but with Normal replaced by
Cauchy. This is a great conventional prior but, unfortunately, the
resulting marginal likelihoods are not closed form.

✬
✫
✩
✪
Example: Bayesian Model Selection and Analysis for
Cepheid Star Oscillations
(Berger, M¨uller, Jeﬀerys and Barnes (2001))
• The astronomical problem
• Bayesian model selection
• The data and likelihood function
• Choice of prior distributions
• Computation and results

✬
✫
✩
✪
The Astronomical Problem
• A Cepheid star pulsates, regularly varying its luminosity (light
output) and size.
• From the Doppler shift as the star surface pulsates, one can
compute surface velocities at certain phases of the star’s period,
thus learning how the radius of the star changes.
• From the luminosity and ‘color’ of the star, one can learn about
the angular radius of the star (the angle from Earth to opposite
edges of the star).
• Combining, allows estimation of s, a star’s distance.

✬
✫
✩
✪
Figure 1: The x’s give the radial velocity measurements at various phases of the
stars oscillation. The curve is a 5th
order trigonometric polynomial ﬁt.

✬
✫
✩
✪
Curve Fitting
To determine the overall change in radius of the star over the star’s
period, the surface velocity must be estimated at phases other than
those actually observed, leading to a curve fitting problem (also for
luminosity). Difficulties:
• Observations have measurement error.
• Phases at which observations are made are unequally spaced.
• Number of possible models (curve fits) entertained varies
between 50 and thousands.
• Resulting models have from 10 to 1000 parameters.

✬
✫
✩
✪
The Data and Statistical Model
• Data:
– m observed radial velocities Ui, i = 1, . . . , m .
– n vectors of photometry data consisting of luminosity
Vi, i = 1, . . . , n, and color index Ci, i = 1, . . . , n.
• Speciﬁed standard deviations σUi
, σVi
, and σCi
; unknown
adjustment factors are inserted, leading to variances σ2
Ui
/τu,
σ2
Vi
/τv, σ2
Ci
/τc.
• The statistical model for measurement error:
Ui ∼ N(ui, σ2
Ui
/τu), Vi ∼ N(vi, σ2
Vi
/τv), Ci ∼ N(ci, σ2
Ci
/τc),
where ui, vi, and ci denote the true unknown mean velocity,
luminosity, and color index.

✬
✫
✩
✪
Curve Fitting I: Fourier Analysis
• Model the true periodic velocity u, at phase φ, as a
trigonometric polynomial
u = u0 +
M
j=1
[β1j cos(jφ) + β2j sin (jφ)],
where u0 is the mean velocity of the star, M is the (unknown)
order of the trigonometric polynomial, and the β1j and β2j are
the unknown Fourier coeﬃcients of the trigonometric polynomial.
• There is a similar polynomial model for the true luminosity v,
having unknown order N.

✬
✫
✩
✪
The resulting statistical models for the column vectors U and V of
observed radial velocities and luminosities are the linear models
U = u01 + Xuβu + εu and V = v01 + Xvβv + εv,
• where u0 and v0 are the (unknown) mean velocity and luminosity
and 1 is the column vector of ones;
• Xu and Xv are matrices of the trigonometric covariates (e.g.,
terms like sin(jφ));
• βu and βv are column vectors of the unknown Fourier
coeﬃcients;
• εu and εv are independently multivariate normal errors:
N(0, Gu/τu) and N(0, Gv/τv), where Gu and Gv are the
known diagonal matrices of the variances σ2
Ui
and σ2
Vi
.

✬
✫
✩
✪
Color need not be modeled separately because it is related to
luminosity (v) and velocity (or change in radius) by
ci = −a[0.1vi + b + 0.5 log (φ0 + ∆ri/s)],
where a and b are known constants, φ0 and s are the angular size
and distance of the star, and ∆r, the change in radius corresponding
to phase φ, is given by
∆r = −g
M
j=1
1
j
[β1j sin(j(φ − ∆φ)) − β2j cos(j(φ − ∆φ))],
with ‘phase shift’ ∆φ and g a known constant.

✬
✫
✩
✪
Choice of Cepheid Prior Distributions
• The orders of the trigonometric polynomials, (M, N), are given a
uniform distribution up to some cut-oﬀ (e.g., (10, 10)).
• τu, τv, τc, which adjust the measurement standard errors, are
given the standard objective priors for ‘scale parameters,’ namely
the Jeﬀreys-rule priors p(τu) = 1
τu
, p(τv) = 1
τv
, and p(τc) = 1
τc
.
• The mean velocity and luminosity, u0 and v0, are ‘location
parameters’ and so can be assigned the standard objective priors
p(u0) = 1 and p(v0) = 1.
• The angular diameter φ0 and the unknown phase shift ∆φ are
also assigned the objective priors p(∆φ) = 1 and p(φ0) = 1. It is
unclear if these are ‘optimal’ objective priors but the choice was
found to have neglible impact on the answers.

✬
✫
✩
✪
• The Fourier coeﬃcients, βu and βv, occur in linear models, so
the Zellner-Siow priors can be utilized.
• The prior for distance s of the star should account for
– Lutz-Kelker bias: a uniform spatial distribution of Cepheid
stars would yield a prior proportional to s2
.
– The distribution of Cepheids is ﬂattened wrt the galactic
plane; we use an exponential distribution.
– So, we use p(s) ∝ s2
exp (−|s sin β|/z0),
∗ β being the known galactic latitude of the star (its angle
above the galactic plane),
∗ z0 being the ‘scale height,’ assigned a uniform prior over
the range z0 = 97 ± 7 parsecs.

✬
✫
✩
✪
Computation
(Not considered for the SAMSI MUMS course)
A reversible-jump MCMC algorithm of the type reviewed in
Dellaportas, Forster and Ntzoufras (2000) is used to move between
models and generate posterior distributions and estimates.
• The full conditional distributions for the variance and precision
parameters and hyperparameters are standard gamma and
inverse-gamma distributions and are sampled with Gibbs sampling.

✬
✫
✩
✪
• For ∆φ, φ0 and s, we employ a random-walk Metropolis algorithm
using, as the proposal distribution, a multivariate normal distribution
centered on the current values and with a covariance matrix found
from linearizing the problem for these three parameters.
• The Fourier coeﬃcients βu and βv, as well as u0 and v0, are also
sampled via Metropolis. The natural proposal distributions are found
by combining the normal likelihoods with the normal part of the
Zellner-Siow priors, leading to conjugate normal posterior
distributions.
• Proposal for moves between models:
– A ‘burn-in’ portion of the MCMC with uniform
model proposals yielded initial posterior model
probabilities, which were then used as the proposal
for subsequent model moves.
– Simultaneously, new values were proposed for the Fourier coeﬃcients.

✬
✫
✩
✪
• The proposal distributions listed above led to a well-mixed Markov
chain, so that only 10,000 iterations of the MCMC computation
needed to be performed.
0 2000 4000 6000 8000 10000
3e-044e-045e-046e-047e-048e-04
T Mon: Parallax
Trial
Parallax

✬
✫
✩
✪
1 2 3 4 5 6 7
T Mon: Velocity Model Posterior Probability
Model index
VelocityModelPosteriorProbability
0.00.20.40.60.8

✬
✫
✩
✪
1 2 3 4 5 6 7
T Mon: V Photometry Model Posterior Probability
Model index
VPhotometryModelPosteriorProbability
0.00.10.20.30.40.5

✬
✫
✩
✪
T Mon: Parallax
Parallax (arcsec)
Frequency
3e-04 4e-04 5e-04 6e-04 7e-04 8e-04
0500100015002000
Figure. The posterior distribution of parallax (proportional to the inverse of
distance) of TMon. This was determined by model averaging over the various
possible models for radial velocity and photometry.

✬
✫
✩
✪
The ‘Robust’ conventional priors for variable
selection (Bayarri, Berger, Forte and Garcia-Donato, 2012; a
generalization of proposals by Strawderman (1971, 1973) and Berger
(1976, 1980, 1985): Deﬁning Σi = σ2
(V t
iV i)−1
.
πR
i (β0, βi, σ2
) =
1
2σ2
1
0
Nki
(βi | 0,
1 + n
λ(k0 + ki)
− 1) Σi) λ−0.5
dλ .
Although this prior is not closed form, it gives closed form marginal
likelihoods, and closed form Bayes factors
Bi0 =
n + 1
ki + k0
−
ki
2
Q
−(n−k0)/2
i0
ki + 1
2F1
ki + 1
2
;
n − k0
2
;
ki + 3
2
;
(1 − Q−1
i0 )(ki + k0)
(1 + n)
,
where 2F1 is the standard (Gauss) hypergeometric function and
Qi0 = SSEi/SSE0
is the ratio of the sum of squared errors of models Mi and M0.

✬
✫
✩
✪
R package BayesVarSel (Garcia-Donato and Forte 12-12-12)
• freely available at CRAN (for sequential or parallel computation)
• programmed in C using GNU-gsl libraries
• priors allowed:
– prior.betas =
∗ “Robust”,
∗ “Liangetal” (not discussed in the lecture)
∗ “gZellner”
∗ “ZellnerSiow”.
– prior.models =
∗ “Constant” (Pr(Mi) = 1/[# models])
∗ “Jeﬀreys” (Pr(Mi) = 1
(p+1)
p
ki
−1
)

✬
✫
✩
✪
Common Bayesian outputs given in the package
• The posterior inclusion probability (called Inclusion Probability)
for variable i is
pi =
l : variable i is in Ml
Pr(Ml | y),
the overall posterior probability that variable i is in the model.
• The highest probability model (HPM).
• The median probability model (MPM) is the model consisting of
those variables whose posterior inclusion probability is ≥ 1/2.
• The Bayesian model averaged predictor of y, at covariates x, is
ˆy = xˆβ, where ˆβ = l Pr(Ml | y)ˆβl (called the Estimate in
the package), with ˆβl being the posterior mean under model Ml.

✬
✫
✩
✪
Example: Infant obesity
• Understand which factors aﬀect infant obesity
• Response variable: Body Mass index (BMI ∝ weight/height2
)
• Set of 16 explanatory covariates (such as ‘born weight’ and ‘born
height’) with 1002 observations.
• There are 216
= 65, 536 models.

✬
✫
✩
✪
Figure 2: This page of the printout lists the ten most probable models,
indicating the variables that are in each model. The posterior probability
of each model is also given. Note that the highest probability model only
has probability 0.094; a small HPM probability is common when there are
so many models.

✬
✫
✩
✪

✬
✫
✩
✪
Selecting a single model for prediction:
The Median Probability Model

✬
✫
✩
✪
Context: Prediction with Normal linear models
• Under the full model, the n × 1 observation vector would follow
y = Xβ + ǫ ;
where X is the n × p design matrix, β is the p × 1 vector of
unknown coeﬃcients, and ǫ is N(0, σ2
I).
• The possible submodels are
Ml : y = Xl βl + ǫ ,
where l = (l1, l2, . . . , lp) is the model index, li being either 1 or 0
as covariate xi is in or out of the model.
• Assume that one of these models is true, and our goal is to
predict a future observation.

✬
✫
✩
✪
Basics of Bayesian prediction
• The goal is to predict a future y∗
= x∗
β + ǫ, at covariate values
x∗
, using squared error loss (y∗
− y∗
)2
.
• Combining the data and prior yields, for all l,
– P(Ml | y), the posterior probability of model Ml;
– πl(βl, σ2
| y), the posterior distribution of (βl, σ2
).
• The best predictor of y∗
is, via model averaging,
¯y∗
= x∗ ¯β ≡ x∗
l
P(Ml | y) βl ,
where βl is the posterior mean for β under Ml. (This follows
from the fact that, under squared error loss, the optimal Bayes
estimate is the posterior mean, and the posterior mean of y∗
is ¯y∗
.)

✬
✫
✩
✪
Selecting a single model
• Often a single model Ml is desired for prediction, with the
prediction then being ˆyl = x∗
βl.
• A common misperception is that the best single model is that
with the largest P(Ml | y);
– this is true if there are only two models;
– this is true if X
′
X is diagonal, σ2
is known, and suitable priors
are used.
• The best single model will typically depend on x∗
.

✬
✫
✩
✪
• An important case is when the model will repeatedly be used to
make future predictions, and the future covariates will be like the
past covariates, in the sense that E[(x∗
)′
x∗
] = X
′
X. Then the
averaged squared error predictive loss for future predictions,
using Ml and when when Ml∗ is the true model, is (adding any
needed zeroes to the β)
Ex∗
(y∗
− ˆyl)2
= (βl∗ − βl)′
E[(x∗
)′
x∗
](βl∗ − βl) + σ2
= (βl∗ − βl)′
X
′
X(βl∗ − βl) + σ2
.
Thus our goal is to ﬁnd the best Ml in terms of the posterior
expectation of this loss.

✬
✫
✩
✪
Posterior inclusion probabilities
The posterior inclusion probability for variable i is
pi ≡
l : li=1
P(Ml | y),
i.e., the overall posterior probability that variable i is in the model.
These are of considerable independent interest
• as basic quantities of interest,
• as aids in searches of model space,
• in deﬁning the median probability model.

✬
✫
✩
✪
The (posterior) median probability model
If it exists, the median probability model, Ml∗ , is deﬁned to be the
model consisting of those variables whose posterior inclusion
probability is at least 1/2. Formally, l∗
is deﬁned, coordinatewise, by
l∗
i =



1 if pi ≥ 1
2
0 otherwise.
(1)
Note: If computation is done by a model-jumping MCMC, the
median probability model consists of those coordinates that were
present in over half the iterations.

✬
✫
✩
✪
Existence of the median probability model
The median probability model exists when the models under
consideration follow a graphical model structure, including
• when any subset of variables is allowed;
• the situation in which the allowed variables consist of main
eﬀects and interactions, but a higher order interaction is allowed
only if lower order interactions are included;
• a sequence of nested models, such as arises in polynomial
regression and autoregressive time series.

✬
✫
✩
✪
Example (Polynomial Regression): Model j is
y = j
i=0 βi xi
+ ǫ.
(Model) j 0 1 2 3 4 5 6
P(Mj | y) ∼ 0 0.06 0.22 0.29 0.38 0.05 ∼ 0
(Covariate) i 0 1 2 3 4 5 6
P(xi
is in model | y) 1 1 0.94 0.72 0.43 0.05 0
Thus M3 is the median probability (optimal predictive) model, while
M4 is the maximum probability model.

✬
✫
✩
✪
Three optimality theorems
Theorem 1. If (i) the models under consideration have graphical
structure; (ii) X
′
X is diagonal, and (iii) the posterior mean of βl is
simply the relevant coordinates of β (the posterior mean in the full
model), then the best predictive model is the median probability
model. Condition (iii) is satisﬁed under any mix of
• constant priors for the βi;
• independent N(0, σ2
λi) priors for the βi, with the λi given
(objectively or subjectively speciﬁed, or estimated via empirical
Bayes) and any prior for σ2
.

✬
✫
✩
✪
Corollary (Clyde and Parmigiani, 1996). If any submodel of the full
model is allowed, X
′
X is diagonal, N(0, σ2
λi) priors are used for the
βi, with the λi given and σ2
known, and the prior probabilities of the
models satisfy
P(Ml) =
p
i=1
(p0
i )li
(1 − p0
i )(1−li)
,
where p0
i is the prior probability that variable xi is in the model, then
the optimal predictive model is that with highest posterior probability
(which is also the median probability model).

✬
✫
✩
✪
Theorem 2. Suppose a sequence of nested linear models is under
consideration. If
(i) prediction is desired at ‘future covariates like the past’ and
(ii) the posterior mean under Ml satisﬁes βl = b βl, where βl is the
least squares estimate and b is common across models,
then the best predictive model is the median probability model.
Condition (ii) is satisﬁed if we use either
• the objective priors for model parameters; or
• g-type Nkl
(0, c σ2
(X
′
lXl)−1
) priors, with the same constant
c > 0 for each model, and any prior for σ2
.
Theorem 3. Theorems 1 and 2 essentially remain true even if there
are non-orthogonal nuisance parameters (i.e., parameters common to
all models) that are assigned the usual noninformative priors.

✬
✫
✩
✪
Example: Nonparametric Regression:
• yi = f(xi) + ǫi, i = 1, . . . , n, ǫi ∼ N(0, σ2).
• Represent f by a (orthonormal) series f(x) = ∞
j=1 βj φj(x).
• Base prior distribution: βi ∼ N(0, vi), with vi = c
ia , where c is
unknown and a is speciﬁed.
• The model Mj, for j = 1, 2, . . . , n, is given by:
yi =
j
k=1
βk φk(xi) + ǫi, ǫi ∼ N(0, σ2
).
• Choose equal prior probabilities (reasonable here) for the models
Mj, j = 1, 2, . . . n. Use the base prior for βj = (β1, . . . , βj).
• For the data y = (y1, . . . , yn), compute P(Mj | y), the posterior
probability of model Mj, for j ≤ n.
• Within Mj, predict βj by its posterior mean, ˜βj.

✬
✫
✩
✪
Example 1. The Shibata Example
• f(x) = − log(1 − x) for −1 < x < 1.
• Choose {φ1(x), φ2(x), . . .} to be the Chebyshev polynomials.
• Then βi = 2/i, so the ‘optimal’ choice of the prior variances
would be vi = 4/i2
, i.e., c = 4 and a = 2.
• Measure the predictive capability of a model by expected squared
error loss relative to the true function (here known) – thus we
use a frequentist evaluation, as did Shibata.

✬
✫
✩
✪
MaxPr MedPr ModAv BIC AIC
a = 1 0.99 [8] 0.89 [10] 0.84 1.14 [8] 1.09 [7]
a = 2 0.88 [10] 0.80 [16] 0.81 1.14 [8] 1.09 [7]
a = 3 0.88 [9] 0.84 [17] 0.85 1.14 [8] 1.09 [7]
Table 2: For n = 30 and σ2
= 1, the expected loss and [average
model size] for the maximum probability model (MaxPr), the Median
Probability Model (MedPr), Model Averaging (ModAv), and BIC and
AIC, in the Shibata example.
Note: It is possible for the median probability model to perform
better than model averaging here, because this is a frequentist
evaluation of error with respect to the true known function.

✬
✫
✩
✪
a = 1 0.54 [14] 0.51 [19] 0.47 0.59 [11] 0.59 [13]
a = 2 0.47 [23] 0.43 [43] 0.44 0.59 [11] 0.59 [13]
a = 3 0.47 [22] 0.46 [45] 0.46 0.59 [11] 0.59 [13]
= 1, the expected loss and average
model size for the maximum probability model (MaxPr), the Median

✬
✫
✩
✪
a = 1 0.34 [23] 0.33 [26] 0.30 0.41 [12] 0.38 [21]
a = 2 0.26 [42] 0.25 [51] 0.25 0.41 [12] 0.38 [21]
a = 3 0.29 [38] 0.29 [50] 0.29 0.41 [12] 0.38 [21]
= 3, the expected loss and average
model size for the maximum probability model (MaxPr), the Median

✬
✫
✩
✪
Comments
• AIC is better than BIC (as Shibata showed), but the true
Bayesian procedures are best.
• Model averaging is generally best (not obvious), followed closely
by the median probability model. The maximum probability
model can be considerably inferior.
• BIC is a poor approximation to the Bayesian answers here.
• The true Bayesian answers choose substantially larger models
than AIC (and then shrink towards 0).

✬
✫
✩
✪
ANOVA
Many ANOVA problems, when written in linear model form, yield
diagonal X
′
X and any such problems will naturally ﬁt under our
theory. In particular, this is true for any balanced ANOVA in which
each factor has only two levels.
As an example, consider the full two-way ANOVA model with
interactions:
yijk = µ + ai + bj + abij + ǫijk
with i = 1, 2, j = 1, 2, k = 1, 2, . . . , K and ǫijk independent
N(0, σ2
), with σ2
unknown. In linear model form, this leads to
X
′
X = 4KI4.

✬
✫
✩
✪
Possible modeling scenarios
We use the simpliﬁed notation M1011 instead of M(1,0,1,1),
representing the model having all parameters except a1.
Scenario 1 - All models with the constant µ: Thus the set of
models under consideration is
{M1000, M1100, M1010, M1001, M1101, M1011, M1110, M1111}.
Scenario 2 - Interactions present only with main eﬀects, and µ
included: The set of models under consideration here is
{M1000, M1100, M1010, M1110, M1111}. Note that this set of models
has graphical structure.

✬
✫
✩
✪
Scenario 3 - An analogue of an unusual classical test: In classical
ANOVA testing, it is sometimes argued that one might be interested
in testing for no interaction eﬀect followed by testing for the main
eﬀects, even if the no-interaction test rejected. The four models that
are under consideration in this process, including the constant µ in
all, are {M1101, M1011, M1110, M1111}. This class of models does not
have graphical model structure and yet the median probability model
is guaranteed to be in the class.

✬
✫
✩
✪
Example 2. Montgomery (1991, pp.271–274) considers the eﬀects
of the concentration of a reactant and the amount of a catalyst on
the yield in a chemical process. The reactant concentration is factor
A and has two levels, 15% and 25%. The catalyst is factor B, with
the two levels ‘one bag’ and ‘two bags’ of catalyst. The experiment
was replicated three times and the data are
treatment replicates
combination I II III
A low, B low 28 25 27
A high, B low 36 32 32
A low, B high 18 19 23
A high, B high 31 30 29

✬
✫
✩
✪
For each modeling scenario, two Bayesian analyses were carried out,
both satisfying the earlier conditions so that the median probability
model is known to be the optimal predictive model.
• I. The reference prior π(µ, σ) ∝ 1
σ
was used for the common
parameters, while the standard N(0, σ2
) g-prior was used for a1,
b1 and ab11. In each scenario, the models under consideration
were given equal prior probabilities of being true.
• II. The g-prior was also used for the common µ.

✬
✫
✩
✪
model posterior probability posterior expected loss
M1000 0.0009 237.21
M1100 0.0347 60.33
M1010 0.0009 177.85
M1110 0.6103 0.97
M1111 0.3532 3.05
Table 5: Scenario 2 – graphical models, prior I. The posterior inclusion
probabilities are p2 = 0.9982, p3 = 0.9644, and p4 = 0.3532; thus
M1110 is the median probability model.

✬
✫
✩
✪
model posterior probability posterior expected loss
M1011 0.124 143.03
M1101 0.286 36.78
M1110 0.456 10.03
M1111 0.134 9.41
Table 6: Scenario 3 – unusual classical models, prior II. The posterior
inclusion probabilities are p2 = 0.876, p3 = 0.714, and p4 = 0.544;
thus M1111 is the median probability model.

✬
✫
✩
✪
When the median probability model can fail (Merlise Clyde)
• Suppose
– under consideration are M0, the model with only a constant term,
and the models, Mi, having a constant term and the single
covariate xi, i = 1, . . . , p, with p ≥ 3;
– the models have equal prior probability of 1
p+1 ;
– all covariates are nearly perfectly correlated, together and with y.
• Then
– the posterior probability of the M0 will be near zero, and that of
each of the other Mi will be approximately 1/p;
– thus the posterior inclusion probabilities will also be
approximately 1
p < 1
2 ;
– so the median probability model is M0, which will have poor
predictive performance compared to any other model.

✬
✫
✩
✪
But in practice, the median probability model works:
Example: Consider Hald’s regression data (Draper and Smith,
1981), consisting of n = 13 observations on a dependent variable y,
with four potential regressors: x1, x2, x3, x4. The full model is thus
y = c + β1x1 + β2x2 + β3x3 + β4x4 + ǫ, ǫ ∼ N(0, σ2
),
σ2
unknown. There is high correlation between covariates here.
• All models that include the constant term are considered. This
example does not formally satisfy the theory, since the models are not
nested and the conditions of Theorem 3 do not apply.
• Least squares estimates are used for parameters.
• Default posterior model probabilities, P(Ml|y), are computed using
the Encompassing Arithmetic Intrinsic Bayes Factor (Berger and
Pericchi, 1996), together with equal prior model probabilities.

✬
✫
✩
✪
• Predictive risks, R(Ml), are computed.
Model P(Ml|y) R(Ml)
c 0.000003 2652.44
c,1 0.000012 1207.04
c,2 0.000026 854.85
c,3 0.000002 1864.41
c,4 0.000058 838.20
c,1,2 0.275484 8.19
c,1,3 0.000006 1174.14
c,1,4 0.107798 29.73
Model P(Ml|y) R(Ml)
c,2,3 0.000229 353.72
c,2,4 0.000018 821.15
c,3,4 0.003785 118.59
c,1,2,3 0.170990 1.21
c,1,2,4 0.190720 0.18
c,1,3,4 0.159959 1.71
c,2,3,4 0.041323 20.42
c,1,2,3,4 0.049587 0.47

✬
✫
✩
✪
• The posterior inclusion probabilities are
p1 =
l:l1=1
P(Ml|y) = 0.95, p2 =
l:l2=1
P(Ml|y) = 0.73
p3 =
l:l3=1
P(Ml|y) = 0.43, p4 =
l:l4=1
P(Ml|y) = 0.55.
• Thus the median probability model is {c, x1, x2, x4} which clearly
coincides with the optimal predictive model.
• Note that the risk of the maximum probability model {c, x1, x2}
is considerably higher than that of the median probability model.

✬
✫
✩
✪
Conclusions
• Bayesian analysis for the normal linear model (including variable
selection) can easily be done with conventional priors.
– Software exists.
– Adjustment for the multiple testing problem inherent in variable
selection is possible (and is the default method).
– Summaries (over the 2p
models) of key quantities – such as
posterior inclusion probabilities and model averaged predictive
parameter estimates – are available and easily interpretable.
• If a single model is required for prediction (or understanding),
consider the median probability model.
– It will typically be the optimal predictive model (better than the
highest probability model).
– It is easy to compute (posterior inclusion probabilities are easy).

2018 MUMS Fall Course - Conventional Priors for Bayesian Model Uncertainty - Jim Berger, September 18, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to 2018 MUMS Fall Course - Conventional Priors for Bayesian Model Uncertainty - Jim Berger, September 18, 2018

Similar to 2018 MUMS Fall Course - Conventional Priors for Bayesian Model Uncertainty - Jim Berger, September 18, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

2018 MUMS Fall Course - Conventional Priors for Bayesian Model Uncertainty - Jim Berger, September 18, 2018