SlideShare a Scribd company logo
1 of 63
Download to read offline
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Lecture 4: Conventional Priors for
Variable Selection in the Linear Model
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Outline
• Notation for the linear model and Bayes model selection
• Choice of prior probabilities of models
• G-priors and Zellner-Siow conventional priors for linear models
• Example: determining the distance from Cepheid stars.
• The Robust conventional prior
• The R package BayesVarSel
• Selecting a single model for prediction: the median probability
model
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Notation for the normal linear model
• The full model: observe independent Y1, Y2, . . . , Yn, where
Yi = [x0,i1β01 + · · · + x0,ik0 β0k0 ] + xi1β1 + · · · + xipβp + ǫi ,
– the x0,ij and xij being given covariates;
– the β’s being unknown;
– the ǫi being independent N(0, σ2
) errors, with σ2
unknown.
• Defining Y = (Y1, . . . , Yn)t
, X0 as the n × k0 matrix with elements
x0,ij, X as the n × p matrix with elements xij, β0 = (β01, . . . , β0k0 )t
,
and β = (β1, . . . , βp)t
, this model can be written
MF : Y ∼ Nn(X0β0 + Xβ, σ2
I) .
SAMSI MUMS Fall,2018
✬
✫
✩
✪
• The simplest model is assumed to be
M0 : Y ∼ Nn(X0β0, σ2
I) ;
with X0 consisting of the covariates that are to be included in all
models (e.g., the intercept in ordinary linear regression).
• Between M0 and MF are 2p
− 2 other models Mi, each additionally
including a non-null subset of ki of the remaining p covariates:
Mi : Y ∼ Nn(X0β0 + Xiβi, σ2
I) ;
Xi is the n × ki matrix consisting of the chosen covariates, i.e., the
chosen columns of X; the corresponding vector of unknown
parameters is denoted βi.
• (β0, σ2) are the common parameters in all models,
• πi(β0, βi, σ2
) is the prior distribution of the parameters in Mi,
• Pr(Mi) is the prior probability of model Mi.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Bayes model selection
• Is based on posterior probabilities for each model:
Pr(Mi | y) =
mi(y)Pr(Mi)
N
j=1 mj(y)Pr(Mj)
= 1 +
j=i
πji Bji
−1
• πji is prior odds Pr(Mj)/Pr(Mi)
• Bji is the Bayes factor
mj(y)
mi(y)
, where
mk(y) = fk(y | β0, βk, σ2
) πk(β0, βk, σ2
) dβ0 dβk dσ2
,
with fk(y | β0, βk, σ2
) denoting the normal likelihood for Mk.
mk(y) is the (integrated) likelihood of Mk, quantifying how
likely the observed data is under that model.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Prior probabilities of models
Two common choices are
• Pr(Mi) = 2−p
(i.e., each of the 2p
models has equal prior
probability);
• Pr(Mi) = 1
(p+1)
p
ki
−1
.
– This is equivalent to assigning the collection of all models having
k parameters (in addition to the common parameters) probability
1/(p + 1), and dividing this probability equally among all the
models of size k.
The first choice is bad because it does not adjust for the multiple
testing problem inherent in variable selection (see the next slide).
The second choice is excellent (it does adjust for multiple testing)
and should be used as the default assignment of prior probabilities.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Equal model probabilities Bayes variable inclusion
Number of noise variables Number of noise variables
Signal 1 10 40 90 1 10 40 90
β1 : −1.08 .999 .999 .999 .999 .999 .999 .999 .999
β2 : −0.84 .999 .999 .999 .999 .999 .999 .999 .988
β3 : −0.74 .999 .999 .999 .999 .999 .999 .999 .998
β4 : −0.51 .977 .977 .999 .999 .991 .948 .710 .345
β5 : −0.30 .292 .289 .288 .127 .552 .248 .041 .008
β6 : +0.07 .259 .286 .055 .008 .519 .251 .039 .011
β7 : +0.18 .219 .248 .244 .275 .455 .216 .033 .009
β8 : +0.35 .773 .771 .994 .999 .896 .686 .307 .057
β9 : +0.41 .927 .912 .999 .999 .969 .861 .567 .222
β10 : +0.63 .995 .995 .999 .999 .996 .990 .921 .734
False Positives 0 2 5 10 0 1 0 0
(incl. prob. > 0.5)
Table 1: Posterior inclusion probabilities (i.e. the posterior probability that a
variable is in the model) for 10 real variables in a simulated data set.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Priors for the model parameters
There have been many efforts (over more than 30 years) to develop
‘objective model selection priors,’ πi(β0, βi, σ2
), for the parameters
of model Mi, including
• conventional priors (Jeffreys 1961; Zellner and Siow 1980)
• intrinsic priors (Berger and Pericchi 1996; Moreno et al. 1998)
• fractional priors (O’Hagan 1997)
• expected posterior priors (P´erez and Berger 2002),
• integral priors (Cano et al. 2008),
• divergence based priors (Bayarri and Garc´ıa-Donato 2008)
• Plus many many proposals for particular situations.
Here we only consider the conventional prior approach.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
The g- and Zellner-Siow conventional priors for
variable selection
The Zellner g-prior for a model of size ki:
πg
i (β0, βi, σ2
) =
1
σ2
× Normalki
(βi | 0, n σ2
(V t
iV i)−1
) ,
with V i = (In − X0(Xt
0X0)−1
Xt
0)Xi . This prior is very popular
because the computations are all closed form, but it has the bad flaw
of being information inconsistent (i.e.,when the F statistic between
two models goes to ∞, the Bayes factor stays bounded).
The Zellner-Siow prior: As above, but with Normal replaced by
Cauchy. This is a great conventional prior but, unfortunately, the
resulting marginal likelihoods are not closed form.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Example: Bayesian Model Selection and Analysis for
Cepheid Star Oscillations
(Berger, M¨uller, Jefferys and Barnes (2001))
• The astronomical problem
• Bayesian model selection
• The data and likelihood function
• Choice of prior distributions
• Computation and results
SAMSI MUMS Fall,2018
✬
✫
✩
✪
The Astronomical Problem
• A Cepheid star pulsates, regularly varying its luminosity (light
output) and size.
• From the Doppler shift as the star surface pulsates, one can
compute surface velocities at certain phases of the star’s period,
thus learning how the radius of the star changes.
• From the luminosity and ‘color’ of the star, one can learn about
the angular radius of the star (the angle from Earth to opposite
edges of the star).
• Combining, allows estimation of s, a star’s distance.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Figure 1: The x’s give the radial velocity measurements at various phases of the
stars oscillation. The curve is a 5th
order trigonometric polynomial fit.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Curve Fitting
To determine the overall change in radius of the star over the star’s
period, the surface velocity must be estimated at phases other than
those actually observed, leading to a curve fitting problem (also for
luminosity). Difficulties:
• Observations have measurement error.
• Phases at which observations are made are unequally spaced.
• Number of possible models (curve fits) entertained varies
between 50 and thousands.
• Resulting models have from 10 to 1000 parameters.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
The Data and Statistical Model
• Data:
– m observed radial velocities Ui, i = 1, . . . , m .
– n vectors of photometry data consisting of luminosity
Vi, i = 1, . . . , n, and color index Ci, i = 1, . . . , n.
• Specified standard deviations σUi
, σVi
, and σCi
; unknown
adjustment factors are inserted, leading to variances σ2
Ui
/τu,
σ2
Vi
/τv, σ2
Ci
/τc.
• The statistical model for measurement error:
Ui ∼ N(ui, σ2
Ui
/τu), Vi ∼ N(vi, σ2
Vi
/τv), Ci ∼ N(ci, σ2
Ci
/τc),
where ui, vi, and ci denote the true unknown mean velocity,
luminosity, and color index.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Curve Fitting I: Fourier Analysis
• Model the true periodic velocity u, at phase φ, as a
trigonometric polynomial
u = u0 +
M
j=1
[β1j cos(jφ) + β2j sin (jφ)],
where u0 is the mean velocity of the star, M is the (unknown)
order of the trigonometric polynomial, and the β1j and β2j are
the unknown Fourier coefficients of the trigonometric polynomial.
• There is a similar polynomial model for the true luminosity v,
having unknown order N.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
The resulting statistical models for the column vectors U and V of
observed radial velocities and luminosities are the linear models
U = u01 + Xuβu + εu and V = v01 + Xvβv + εv,
• where u0 and v0 are the (unknown) mean velocity and luminosity
and 1 is the column vector of ones;
• Xu and Xv are matrices of the trigonometric covariates (e.g.,
terms like sin(jφ));
• βu and βv are column vectors of the unknown Fourier
coefficients;
• εu and εv are independently multivariate normal errors:
N(0, Gu/τu) and N(0, Gv/τv), where Gu and Gv are the
known diagonal matrices of the variances σ2
Ui
and σ2
Vi
.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Color need not be modeled separately because it is related to
luminosity (v) and velocity (or change in radius) by
ci = −a[0.1vi + b + 0.5 log (φ0 + ∆ri/s)],
where a and b are known constants, φ0 and s are the angular size
and distance of the star, and ∆r, the change in radius corresponding
to phase φ, is given by
∆r = −g
M
j=1
1
j
[β1j sin(j(φ − ∆φ)) − β2j cos(j(φ − ∆φ))],
with ‘phase shift’ ∆φ and g a known constant.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Choice of Cepheid Prior Distributions
• The orders of the trigonometric polynomials, (M, N), are given a
uniform distribution up to some cut-off (e.g., (10, 10)).
• τu, τv, τc, which adjust the measurement standard errors, are
given the standard objective priors for ‘scale parameters,’ namely
the Jeffreys-rule priors p(τu) = 1
τu
, p(τv) = 1
τv
, and p(τc) = 1
τc
.
• The mean velocity and luminosity, u0 and v0, are ‘location
parameters’ and so can be assigned the standard objective priors
p(u0) = 1 and p(v0) = 1.
• The angular diameter φ0 and the unknown phase shift ∆φ are
also assigned the objective priors p(∆φ) = 1 and p(φ0) = 1. It is
unclear if these are ‘optimal’ objective priors but the choice was
found to have neglible impact on the answers.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
• The Fourier coefficients, βu and βv, occur in linear models, so
the Zellner-Siow priors can be utilized.
• The prior for distance s of the star should account for
– Lutz-Kelker bias: a uniform spatial distribution of Cepheid
stars would yield a prior proportional to s2
.
– The distribution of Cepheids is flattened wrt the galactic
plane; we use an exponential distribution.
– So, we use p(s) ∝ s2
exp (−|s sin β|/z0),
∗ β being the known galactic latitude of the star (its angle
above the galactic plane),
∗ z0 being the ‘scale height,’ assigned a uniform prior over
the range z0 = 97 ± 7 parsecs.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Computation
(Not considered for the SAMSI MUMS course)
A reversible-jump MCMC algorithm of the type reviewed in
Dellaportas, Forster and Ntzoufras (2000) is used to move between
models and generate posterior distributions and estimates.
• The full conditional distributions for the variance and precision
parameters and hyperparameters are standard gamma and
inverse-gamma distributions and are sampled with Gibbs sampling.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
• For ∆φ, φ0 and s, we employ a random-walk Metropolis algorithm
using, as the proposal distribution, a multivariate normal distribution
centered on the current values and with a covariance matrix found
from linearizing the problem for these three parameters.
• The Fourier coefficients βu and βv, as well as u0 and v0, are also
sampled via Metropolis. The natural proposal distributions are found
by combining the normal likelihoods with the normal part of the
Zellner-Siow priors, leading to conjugate normal posterior
distributions.
• Proposal for moves between models:
– A ‘burn-in’ portion of the MCMC with uniform
model proposals yielded initial posterior model
probabilities, which were then used as the proposal
for subsequent model moves.
– Simultaneously, new values were proposed for the Fourier coefficients.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
• The proposal distributions listed above led to a well-mixed Markov
chain, so that only 10,000 iterations of the MCMC computation
needed to be performed.
0 2000 4000 6000 8000 10000
3e-044e-045e-046e-047e-048e-04
T Mon: Parallax
Trial
Parallax
SAMSI MUMS Fall,2018
✬
✫
✩
✪
1 2 3 4 5 6 7
T Mon: Velocity Model Posterior Probability
Model index
VelocityModelPosteriorProbability
0.00.20.40.60.8
SAMSI MUMS Fall,2018
✬
✫
✩
✪
1 2 3 4 5 6 7
T Mon: V Photometry Model Posterior Probability
Model index
VPhotometryModelPosteriorProbability
0.00.10.20.30.40.5
SAMSI MUMS Fall,2018
✬
✫
✩
✪
T Mon: Parallax
Parallax (arcsec)
Frequency
3e-04 4e-04 5e-04 6e-04 7e-04 8e-04
0500100015002000
Figure. The posterior distribution of parallax (proportional to the inverse of
distance) of TMon. This was determined by model averaging over the various
possible models for radial velocity and photometry.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
The ‘Robust’ conventional priors for variable
selection (Bayarri, Berger, Forte and Garcia-Donato, 2012; a
generalization of proposals by Strawderman (1971, 1973) and Berger
(1976, 1980, 1985): Defining Σi = σ2
(V t
iV i)−1
.
πR
i (β0, βi, σ2
) =
1
2σ2
1
0
Nki
(βi | 0,
1 + n
λ(k0 + ki)
− 1) Σi) λ−0.5
dλ .
Although this prior is not closed form, it gives closed form marginal
likelihoods, and closed form Bayes factors
Bi0 =
n + 1
ki + k0
−
ki
2
Q
−(n−k0)/2
i0
ki + 1
2F1
ki + 1
2
;
n − k0
2
;
ki + 3
2
;
(1 − Q−1
i0 )(ki + k0)
(1 + n)
,
where 2F1 is the standard (Gauss) hypergeometric function and
Qi0 = SSEi/SSE0
is the ratio of the sum of squared errors of models Mi and M0.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
R package BayesVarSel (Garcia-Donato and Forte 12-12-12)
• freely available at CRAN (for sequential or parallel computation)
• programmed in C using GNU-gsl libraries
• priors allowed:
– prior.betas =
∗ “Robust”,
∗ “Liangetal” (not discussed in the lecture)
∗ “gZellner”
∗ “ZellnerSiow”.
– prior.models =
∗ “Constant” (Pr(Mi) = 1/[# models])
∗ “Jeffreys” (Pr(Mi) = 1
(p+1)
p
ki
−1
)
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Common Bayesian outputs given in the package
• The posterior inclusion probability (called Inclusion Probability)
for variable i is
pi =
l : variable i is in Ml
Pr(Ml | y),
the overall posterior probability that variable i is in the model.
• The highest probability model (HPM).
• The median probability model (MPM) is the model consisting of
those variables whose posterior inclusion probability is ≥ 1/2.
• The Bayesian model averaged predictor of y, at covariates x, is
ˆy = xˆβ, where ˆβ = l Pr(Ml | y)ˆβl (called the Estimate in
the package), with ˆβl being the posterior mean under model Ml.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Example: Infant obesity
• Understand which factors affect infant obesity
• Response variable: Body Mass index (BMI ∝ weight/height2
)
• Set of 16 explanatory covariates (such as ‘born weight’ and ‘born
height’) with 1002 observations.
• There are 216
= 65, 536 models.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Figure 2: This page of the printout lists the ten most probable models,
indicating the variables that are in each model. The posterior probability
of each model is also given. Note that the highest probability model only
has probability 0.094; a small HPM probability is common when there are
so many models.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
SAMSI MUMS Fall,2018
✬
✫
✩
✪
SAMSI MUMS Fall,2018
✬
✫
✩
✪
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Selecting a single model for prediction:
The Median Probability Model
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Context: Prediction with Normal linear models
• Under the full model, the n × 1 observation vector would follow
y = Xβ + ǫ ;
where X is the n × p design matrix, β is the p × 1 vector of
unknown coefficients, and ǫ is N(0, σ2
I).
• The possible submodels are
Ml : y = Xl βl + ǫ ,
where l = (l1, l2, . . . , lp) is the model index, li being either 1 or 0
as covariate xi is in or out of the model.
• Assume that one of these models is true, and our goal is to
predict a future observation.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Basics of Bayesian prediction
• The goal is to predict a future y∗
= x∗
β + ǫ, at covariate values
x∗
, using squared error loss (y∗
− y∗
)2
.
• Combining the data and prior yields, for all l,
– P(Ml | y), the posterior probability of model Ml;
– πl(βl, σ2
| y), the posterior distribution of (βl, σ2
).
• The best predictor of y∗
is, via model averaging,
¯y∗
= x∗ ¯β ≡ x∗
l
P(Ml | y) βl ,
where βl is the posterior mean for β under Ml. (This follows
from the fact that, under squared error loss, the optimal Bayes
estimate is the posterior mean, and the posterior mean of y∗
is ¯y∗
.)
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Selecting a single model
• Often a single model Ml is desired for prediction, with the
prediction then being ˆyl = x∗
βl.
• A common misperception is that the best single model is that
with the largest P(Ml | y);
– this is true if there are only two models;
– this is true if X
′
X is diagonal, σ2
is known, and suitable priors
are used.
• The best single model will typically depend on x∗
.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
• An important case is when the model will repeatedly be used to
make future predictions, and the future covariates will be like the
past covariates, in the sense that E[(x∗
)′
x∗
] = X
′
X. Then the
averaged squared error predictive loss for future predictions,
using Ml and when when Ml∗ is the true model, is (adding any
needed zeroes to the β)
Ex∗
(y∗
− ˆyl)2
= (βl∗ − βl)′
E[(x∗
)′
x∗
](βl∗ − βl) + σ2
= (βl∗ − βl)′
X
′
X(βl∗ − βl) + σ2
.
Thus our goal is to find the best Ml in terms of the posterior
expectation of this loss.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Posterior inclusion probabilities
The posterior inclusion probability for variable i is
pi ≡
l : li=1
P(Ml | y),
i.e., the overall posterior probability that variable i is in the model.
These are of considerable independent interest
• as basic quantities of interest,
• as aids in searches of model space,
• in defining the median probability model.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
The (posterior) median probability model
If it exists, the median probability model, Ml∗ , is defined to be the
model consisting of those variables whose posterior inclusion
probability is at least 1/2. Formally, l∗
is defined, coordinatewise, by
l∗
i =



1 if pi ≥ 1
2
0 otherwise.
(1)
Note: If computation is done by a model-jumping MCMC, the
median probability model consists of those coordinates that were
present in over half the iterations.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Existence of the median probability model
The median probability model exists when the models under
consideration follow a graphical model structure, including
• when any subset of variables is allowed;
• the situation in which the allowed variables consist of main
effects and interactions, but a higher order interaction is allowed
only if lower order interactions are included;
• a sequence of nested models, such as arises in polynomial
regression and autoregressive time series.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Example (Polynomial Regression): Model j is
y = j
i=0 βi xi
+ ǫ.
(Model) j 0 1 2 3 4 5 6
P(Mj | y) ∼ 0 0.06 0.22 0.29 0.38 0.05 ∼ 0
(Covariate) i 0 1 2 3 4 5 6
P(xi
is in model | y) 1 1 0.94 0.72 0.43 0.05 0
Thus M3 is the median probability (optimal predictive) model, while
M4 is the maximum probability model.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Three optimality theorems
Theorem 1. If (i) the models under consideration have graphical
structure; (ii) X
′
X is diagonal, and (iii) the posterior mean of βl is
simply the relevant coordinates of β (the posterior mean in the full
model), then the best predictive model is the median probability
model. Condition (iii) is satisfied under any mix of
• constant priors for the βi;
• independent N(0, σ2
λi) priors for the βi, with the λi given
(objectively or subjectively specified, or estimated via empirical
Bayes) and any prior for σ2
.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Corollary (Clyde and Parmigiani, 1996). If any submodel of the full
model is allowed, X
′
X is diagonal, N(0, σ2
λi) priors are used for the
βi, with the λi given and σ2
known, and the prior probabilities of the
models satisfy
P(Ml) =
p
i=1
(p0
i )li
(1 − p0
i )(1−li)
,
where p0
i is the prior probability that variable xi is in the model, then
the optimal predictive model is that with highest posterior probability
(which is also the median probability model).
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Theorem 2. Suppose a sequence of nested linear models is under
consideration. If
(i) prediction is desired at ‘future covariates like the past’ and
(ii) the posterior mean under Ml satisfies βl = b βl, where βl is the
least squares estimate and b is common across models,
then the best predictive model is the median probability model.
Condition (ii) is satisfied if we use either
• the objective priors for model parameters; or
• g-type Nkl
(0, c σ2
(X
′
lXl)−1
) priors, with the same constant
c > 0 for each model, and any prior for σ2
.
Theorem 3. Theorems 1 and 2 essentially remain true even if there
are non-orthogonal nuisance parameters (i.e., parameters common to
all models) that are assigned the usual noninformative priors.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Example: Nonparametric Regression:
• yi = f(xi) + ǫi, i = 1, . . . , n, ǫi ∼ N(0, σ2).
• Represent f by a (orthonormal) series f(x) = ∞
j=1 βj φj(x).
• Base prior distribution: βi ∼ N(0, vi), with vi = c
ia , where c is
unknown and a is specified.
• The model Mj, for j = 1, 2, . . . , n, is given by:
yi =
j
k=1
βk φk(xi) + ǫi, ǫi ∼ N(0, σ2
).
• Choose equal prior probabilities (reasonable here) for the models
Mj, j = 1, 2, . . . n. Use the base prior for βj = (β1, . . . , βj).
• For the data y = (y1, . . . , yn), compute P(Mj | y), the posterior
probability of model Mj, for j ≤ n.
• Within Mj, predict βj by its posterior mean, ˜βj.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Example 1. The Shibata Example
• f(x) = − log(1 − x) for −1 < x < 1.
• Choose {φ1(x), φ2(x), . . .} to be the Chebyshev polynomials.
• Then βi = 2/i, so the ‘optimal’ choice of the prior variances
would be vi = 4/i2
, i.e., c = 4 and a = 2.
• Measure the predictive capability of a model by expected squared
error loss relative to the true function (here known) – thus we
use a frequentist evaluation, as did Shibata.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
MaxPr MedPr ModAv BIC AIC
a = 1 0.99 [8] 0.89 [10] 0.84 1.14 [8] 1.09 [7]
a = 2 0.88 [10] 0.80 [16] 0.81 1.14 [8] 1.09 [7]
a = 3 0.88 [9] 0.84 [17] 0.85 1.14 [8] 1.09 [7]
Table 2: For n = 30 and σ2
= 1, the expected loss and [average
model size] for the maximum probability model (MaxPr), the Median
Probability Model (MedPr), Model Averaging (ModAv), and BIC and
AIC, in the Shibata example.
Note: It is possible for the median probability model to perform
better than model averaging here, because this is a frequentist
evaluation of error with respect to the true known function.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
MaxPr MedPr ModAv BIC AIC
a = 1 0.54 [14] 0.51 [19] 0.47 0.59 [11] 0.59 [13]
a = 2 0.47 [23] 0.43 [43] 0.44 0.59 [11] 0.59 [13]
a = 3 0.47 [22] 0.46 [45] 0.46 0.59 [11] 0.59 [13]
Table 3: For n = 100 and σ2
= 1, the expected loss and average
model size for the maximum probability model (MaxPr), the Median
Probability Model (MedPr), Model Averaging (ModAv), and BIC and
AIC, in the Shibata example.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
MaxPr MedPr ModAv BIC AIC
a = 1 0.34 [23] 0.33 [26] 0.30 0.41 [12] 0.38 [21]
a = 2 0.26 [42] 0.25 [51] 0.25 0.41 [12] 0.38 [21]
a = 3 0.29 [38] 0.29 [50] 0.29 0.41 [12] 0.38 [21]
Table 4: For n = 2000 and σ2
= 3, the expected loss and average
model size for the maximum probability model (MaxPr), the Median
Probability Model (MedPr), Model Averaging (ModAv), and BIC and
AIC, in the Shibata example.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Comments
• AIC is better than BIC (as Shibata showed), but the true
Bayesian procedures are best.
• Model averaging is generally best (not obvious), followed closely
by the median probability model. The maximum probability
model can be considerably inferior.
• BIC is a poor approximation to the Bayesian answers here.
• The true Bayesian answers choose substantially larger models
than AIC (and then shrink towards 0).
SAMSI MUMS Fall,2018
✬
✫
✩
✪
ANOVA
Many ANOVA problems, when written in linear model form, yield
diagonal X
′
X and any such problems will naturally fit under our
theory. In particular, this is true for any balanced ANOVA in which
each factor has only two levels.
As an example, consider the full two-way ANOVA model with
interactions:
yijk = µ + ai + bj + abij + ǫijk
with i = 1, 2, j = 1, 2, k = 1, 2, . . . , K and ǫijk independent
N(0, σ2
), with σ2
unknown. In linear model form, this leads to
X
′
X = 4KI4.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Possible modeling scenarios
We use the simplified notation M1011 instead of M(1,0,1,1),
representing the model having all parameters except a1.
Scenario 1 - All models with the constant µ: Thus the set of
models under consideration is
{M1000, M1100, M1010, M1001, M1101, M1011, M1110, M1111}.
Scenario 2 - Interactions present only with main effects, and µ
included: The set of models under consideration here is
{M1000, M1100, M1010, M1110, M1111}. Note that this set of models
has graphical structure.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Scenario 3 - An analogue of an unusual classical test: In classical
ANOVA testing, it is sometimes argued that one might be interested
in testing for no interaction effect followed by testing for the main
effects, even if the no-interaction test rejected. The four models that
are under consideration in this process, including the constant µ in
all, are {M1101, M1011, M1110, M1111}. This class of models does not
have graphical model structure and yet the median probability model
is guaranteed to be in the class.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Example 2. Montgomery (1991, pp.271–274) considers the effects
of the concentration of a reactant and the amount of a catalyst on
the yield in a chemical process. The reactant concentration is factor
A and has two levels, 15% and 25%. The catalyst is factor B, with
the two levels ‘one bag’ and ‘two bags’ of catalyst. The experiment
was replicated three times and the data are
treatment replicates
combination I II III
A low, B low 28 25 27
A high, B low 36 32 32
A low, B high 18 19 23
A high, B high 31 30 29
SAMSI MUMS Fall,2018
✬
✫
✩
✪
For each modeling scenario, two Bayesian analyses were carried out,
both satisfying the earlier conditions so that the median probability
model is known to be the optimal predictive model.
• I. The reference prior π(µ, σ) ∝ 1
σ
was used for the common
parameters, while the standard N(0, σ2
) g-prior was used for a1,
b1 and ab11. In each scenario, the models under consideration
were given equal prior probabilities of being true.
• II. The g-prior was also used for the common µ.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
model posterior probability posterior expected loss
M1000 0.0009 237.21
M1100 0.0347 60.33
M1010 0.0009 177.85
M1110 0.6103 0.97
M1111 0.3532 3.05
Table 5: Scenario 2 – graphical models, prior I. The posterior inclusion
probabilities are p2 = 0.9982, p3 = 0.9644, and p4 = 0.3532; thus
M1110 is the median probability model.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
model posterior probability posterior expected loss
M1011 0.124 143.03
M1101 0.286 36.78
M1110 0.456 10.03
M1111 0.134 9.41
Table 6: Scenario 3 – unusual classical models, prior II. The posterior
inclusion probabilities are p2 = 0.876, p3 = 0.714, and p4 = 0.544;
thus M1111 is the median probability model.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
When the median probability model can fail (Merlise Clyde)
• Suppose
– under consideration are M0, the model with only a constant term,
and the models, Mi, having a constant term and the single
covariate xi, i = 1, . . . , p, with p ≥ 3;
– the models have equal prior probability of 1
p+1 ;
– all covariates are nearly perfectly correlated, together and with y.
• Then
– the posterior probability of the M0 will be near zero, and that of
each of the other Mi will be approximately 1/p;
– thus the posterior inclusion probabilities will also be
approximately 1
p < 1
2 ;
– so the median probability model is M0, which will have poor
predictive performance compared to any other model.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
But in practice, the median probability model works:
Example: Consider Hald’s regression data (Draper and Smith,
1981), consisting of n = 13 observations on a dependent variable y,
with four potential regressors: x1, x2, x3, x4. The full model is thus
y = c + β1x1 + β2x2 + β3x3 + β4x4 + ǫ, ǫ ∼ N(0, σ2
),
σ2
unknown. There is high correlation between covariates here.
• All models that include the constant term are considered. This
example does not formally satisfy the theory, since the models are not
nested and the conditions of Theorem 3 do not apply.
• Least squares estimates are used for parameters.
• Default posterior model probabilities, P(Ml|y), are computed using
the Encompassing Arithmetic Intrinsic Bayes Factor (Berger and
Pericchi, 1996), together with equal prior model probabilities.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
• Predictive risks, R(Ml), are computed.
Model P(Ml|y) R(Ml)
c 0.000003 2652.44
c,1 0.000012 1207.04
c,2 0.000026 854.85
c,3 0.000002 1864.41
c,4 0.000058 838.20
c,1,2 0.275484 8.19
c,1,3 0.000006 1174.14
c,1,4 0.107798 29.73
Model P(Ml|y) R(Ml)
c,2,3 0.000229 353.72
c,2,4 0.000018 821.15
c,3,4 0.003785 118.59
c,1,2,3 0.170990 1.21
c,1,2,4 0.190720 0.18
c,1,3,4 0.159959 1.71
c,2,3,4 0.041323 20.42
c,1,2,3,4 0.049587 0.47
SAMSI MUMS Fall,2018
✬
✫
✩
✪
• The posterior inclusion probabilities are
p1 =
l:l1=1
P(Ml|y) = 0.95, p2 =
l:l2=1
P(Ml|y) = 0.73
p3 =
l:l3=1
P(Ml|y) = 0.43, p4 =
l:l4=1
P(Ml|y) = 0.55.
• Thus the median probability model is {c, x1, x2, x4} which clearly
coincides with the optimal predictive model.
• Note that the risk of the maximum probability model {c, x1, x2}
is considerably higher than that of the median probability model.
SAMSI MUMS Fall,2018
✬
✫
✩
✪
Conclusions
• Bayesian analysis for the normal linear model (including variable
selection) can easily be done with conventional priors.
– Software exists.
– Adjustment for the multiple testing problem inherent in variable
selection is possible (and is the default method).
– Summaries (over the 2p
models) of key quantities – such as
posterior inclusion probabilities and model averaged predictive
parameter estimates – are available and easily interpretable.
• If a single model is required for prediction (or understanding),
consider the median probability model.
– It will typically be the optimal predictive model (better than the
highest probability model).
– It is easy to compute (posterior inclusion probabilities are easy).

More Related Content

What's hot

Statistics lecture 13 (chapter 13)
Statistics lecture 13 (chapter 13)Statistics lecture 13 (chapter 13)
Statistics lecture 13 (chapter 13)jillmitchell8778
 
Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)jillmitchell8778
 
Pysap#3.1 Pythonでショートコーディング
Pysap#3.1 PythonでショートコーディングPysap#3.1 Pythonでショートコーディング
Pysap#3.1 PythonでショートコーディングFumihito Yokoyama
 
Algorithm Design
Algorithm DesignAlgorithm Design
Algorithm Designsyou6162
 
Module iii sp
Module iii spModule iii sp
Module iii spVijaya79
 
Optimal debt maturity management
Optimal debt maturity managementOptimal debt maturity management
Optimal debt maturity managementADEMU_Project
 
Journey to structure from motion
Journey to structure from motionJourney to structure from motion
Journey to structure from motionJa-Keoung Koo
 
Prévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAMPrévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAMCdiscount
 
NIPS2008: tutorial: statistical models of visual images
NIPS2008: tutorial: statistical models of visual imagesNIPS2008: tutorial: statistical models of visual images
NIPS2008: tutorial: statistical models of visual imageszukun
 
Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4Roziq Bahtiar
 

What's hot (13)

Statistics lecture 13 (chapter 13)
Statistics lecture 13 (chapter 13)Statistics lecture 13 (chapter 13)
Statistics lecture 13 (chapter 13)
 
Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)Statistics lecture 11 (chapter 11)
Statistics lecture 11 (chapter 11)
 
Pysap#3.1 Pythonでショートコーディング
Pysap#3.1 PythonでショートコーディングPysap#3.1 Pythonでショートコーディング
Pysap#3.1 Pythonでショートコーディング
 
Algorithm Design
Algorithm DesignAlgorithm Design
Algorithm Design
 
Module iii sp
Module iii spModule iii sp
Module iii sp
 
09 review
09 review09 review
09 review
 
Optimal debt maturity management
Optimal debt maturity managementOptimal debt maturity management
Optimal debt maturity management
 
Test
TestTest
Test
 
Journey to structure from motion
Journey to structure from motionJourney to structure from motion
Journey to structure from motion
 
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
 
Prévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAMPrévision de consommation électrique avec adaptive GAM
Prévision de consommation électrique avec adaptive GAM
 
NIPS2008: tutorial: statistical models of visual images
NIPS2008: tutorial: statistical models of visual imagesNIPS2008: tutorial: statistical models of visual images
NIPS2008: tutorial: statistical models of visual images
 
Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4
 

Similar to 2018 MUMS Fall Course - Conventional Priors for Bayesian Model Uncertainty - Jim Berger, September 18, 2018

DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTAM Publications
 
Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I
Non-Gaussian Methods for Learning Linear Structural Equation Models: Part INon-Gaussian Methods for Learning Linear Structural Equation Models: Part I
Non-Gaussian Methods for Learning Linear Structural Equation Models: Part IShiga University, RIKEN
 
Approximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelMatt Moores
 
R package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable LikelihoodsR package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable LikelihoodsMatt Moores
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)마이캠퍼스
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesAdi Handarbeni
 
Epidemic processes on switching networks
Epidemic processes on switching networksEpidemic processes on switching networks
Epidemic processes on switching networksNaoki Masuda
 
Development of a test statistic for testing equality of two means under unequ...
Development of a test statistic for testing equality of two means under unequ...Development of a test statistic for testing equality of two means under unequ...
Development of a test statistic for testing equality of two means under unequ...Alexander Decker
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibrationzukun
 
Vu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptxVu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptxQucngV
 
Management of uncertainties in numerical aerodynamics
Management of uncertainties in numerical aerodynamicsManagement of uncertainties in numerical aerodynamics
Management of uncertainties in numerical aerodynamicsAlexander Litvinenko
 

Similar to 2018 MUMS Fall Course - Conventional Priors for Bayesian Model Uncertainty - Jim Berger, September 18, 2018 (20)

Math Exam Help
Math Exam HelpMath Exam Help
Math Exam Help
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
DETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECTDETECTION OF MOVING OBJECT
DETECTION OF MOVING OBJECT
 
Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I
Non-Gaussian Methods for Learning Linear Structural Equation Models: Part INon-Gaussian Methods for Learning Linear Structural Equation Models: Part I
Non-Gaussian Methods for Learning Linear Structural Equation Models: Part I
 
Approximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts modelApproximate Bayesian computation for the Ising/Potts model
Approximate Bayesian computation for the Ising/Potts model
 
Input analysis
Input analysisInput analysis
Input analysis
 
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
 
Ch11.kriging
Ch11.krigingCh11.kriging
Ch11.kriging
 
R package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable LikelihoodsR package bayesImageS: Scalable Inference for Intractable Likelihoods
R package bayesImageS: Scalable Inference for Intractable Likelihoods
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
 
Mk slides.ppt
Mk slides.pptMk slides.ppt
Mk slides.ppt
 
Anov af03
Anov af03Anov af03
Anov af03
 
Introduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resourcesIntroduction geostatistic for_mineral_resources
Introduction geostatistic for_mineral_resources
 
Epidemic processes on switching networks
Epidemic processes on switching networksEpidemic processes on switching networks
Epidemic processes on switching networks
 
Development of a test statistic for testing equality of two means under unequ...
Development of a test statistic for testing equality of two means under unequ...Development of a test statistic for testing equality of two means under unequ...
Development of a test statistic for testing equality of two means under unequ...
 
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
 
Lecture9 camera calibration
Lecture9 camera calibrationLecture9 camera calibration
Lecture9 camera calibration
 
Lecture9
Lecture9Lecture9
Lecture9
 
Vu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptxVu_HPSC2012_02.pptx
Vu_HPSC2012_02.pptx
 
Management of uncertainties in numerical aerodynamics
Management of uncertainties in numerical aerodynamicsManagement of uncertainties in numerical aerodynamics
Management of uncertainties in numerical aerodynamics
 

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
 
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
 
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
 
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
 
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
 
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
 
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
 
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
 
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
 
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
 
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
 
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
 
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
 
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
 
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
 
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
 
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
 
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
2019 GDRR: Blockchain Data Analytics - Modeling Cryptocurrency Markets with T...
2019 GDRR: Blockchain Data Analytics - Modeling Cryptocurrency Markets with T...2019 GDRR: Blockchain Data Analytics - Modeling Cryptocurrency Markets with T...
2019 GDRR: Blockchain Data Analytics - Modeling Cryptocurrency Markets with T...
 

Recently uploaded

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 

Recently uploaded (20)

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 

2018 MUMS Fall Course - Conventional Priors for Bayesian Model Uncertainty - Jim Berger, September 18, 2018

  • 1. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Lecture 4: Conventional Priors for Variable Selection in the Linear Model
  • 2. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Outline • Notation for the linear model and Bayes model selection • Choice of prior probabilities of models • G-priors and Zellner-Siow conventional priors for linear models • Example: determining the distance from Cepheid stars. • The Robust conventional prior • The R package BayesVarSel • Selecting a single model for prediction: the median probability model
  • 3. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Notation for the normal linear model • The full model: observe independent Y1, Y2, . . . , Yn, where Yi = [x0,i1β01 + · · · + x0,ik0 β0k0 ] + xi1β1 + · · · + xipβp + ǫi , – the x0,ij and xij being given covariates; – the β’s being unknown; – the ǫi being independent N(0, σ2 ) errors, with σ2 unknown. • Defining Y = (Y1, . . . , Yn)t , X0 as the n × k0 matrix with elements x0,ij, X as the n × p matrix with elements xij, β0 = (β01, . . . , β0k0 )t , and β = (β1, . . . , βp)t , this model can be written MF : Y ∼ Nn(X0β0 + Xβ, σ2 I) .
  • 4. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ • The simplest model is assumed to be M0 : Y ∼ Nn(X0β0, σ2 I) ; with X0 consisting of the covariates that are to be included in all models (e.g., the intercept in ordinary linear regression). • Between M0 and MF are 2p − 2 other models Mi, each additionally including a non-null subset of ki of the remaining p covariates: Mi : Y ∼ Nn(X0β0 + Xiβi, σ2 I) ; Xi is the n × ki matrix consisting of the chosen covariates, i.e., the chosen columns of X; the corresponding vector of unknown parameters is denoted βi. • (β0, σ2) are the common parameters in all models, • πi(β0, βi, σ2 ) is the prior distribution of the parameters in Mi, • Pr(Mi) is the prior probability of model Mi.
  • 5. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Bayes model selection • Is based on posterior probabilities for each model: Pr(Mi | y) = mi(y)Pr(Mi) N j=1 mj(y)Pr(Mj) = 1 + j=i πji Bji −1 • πji is prior odds Pr(Mj)/Pr(Mi) • Bji is the Bayes factor mj(y) mi(y) , where mk(y) = fk(y | β0, βk, σ2 ) πk(β0, βk, σ2 ) dβ0 dβk dσ2 , with fk(y | β0, βk, σ2 ) denoting the normal likelihood for Mk. mk(y) is the (integrated) likelihood of Mk, quantifying how likely the observed data is under that model.
  • 6. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Prior probabilities of models Two common choices are • Pr(Mi) = 2−p (i.e., each of the 2p models has equal prior probability); • Pr(Mi) = 1 (p+1) p ki −1 . – This is equivalent to assigning the collection of all models having k parameters (in addition to the common parameters) probability 1/(p + 1), and dividing this probability equally among all the models of size k. The first choice is bad because it does not adjust for the multiple testing problem inherent in variable selection (see the next slide). The second choice is excellent (it does adjust for multiple testing) and should be used as the default assignment of prior probabilities.
  • 7. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Equal model probabilities Bayes variable inclusion Number of noise variables Number of noise variables Signal 1 10 40 90 1 10 40 90 β1 : −1.08 .999 .999 .999 .999 .999 .999 .999 .999 β2 : −0.84 .999 .999 .999 .999 .999 .999 .999 .988 β3 : −0.74 .999 .999 .999 .999 .999 .999 .999 .998 β4 : −0.51 .977 .977 .999 .999 .991 .948 .710 .345 β5 : −0.30 .292 .289 .288 .127 .552 .248 .041 .008 β6 : +0.07 .259 .286 .055 .008 .519 .251 .039 .011 β7 : +0.18 .219 .248 .244 .275 .455 .216 .033 .009 β8 : +0.35 .773 .771 .994 .999 .896 .686 .307 .057 β9 : +0.41 .927 .912 .999 .999 .969 .861 .567 .222 β10 : +0.63 .995 .995 .999 .999 .996 .990 .921 .734 False Positives 0 2 5 10 0 1 0 0 (incl. prob. > 0.5) Table 1: Posterior inclusion probabilities (i.e. the posterior probability that a variable is in the model) for 10 real variables in a simulated data set.
  • 8. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Priors for the model parameters There have been many efforts (over more than 30 years) to develop ‘objective model selection priors,’ πi(β0, βi, σ2 ), for the parameters of model Mi, including • conventional priors (Jeffreys 1961; Zellner and Siow 1980) • intrinsic priors (Berger and Pericchi 1996; Moreno et al. 1998) • fractional priors (O’Hagan 1997) • expected posterior priors (P´erez and Berger 2002), • integral priors (Cano et al. 2008), • divergence based priors (Bayarri and Garc´ıa-Donato 2008) • Plus many many proposals for particular situations. Here we only consider the conventional prior approach.
  • 9. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ The g- and Zellner-Siow conventional priors for variable selection The Zellner g-prior for a model of size ki: πg i (β0, βi, σ2 ) = 1 σ2 × Normalki (βi | 0, n σ2 (V t iV i)−1 ) , with V i = (In − X0(Xt 0X0)−1 Xt 0)Xi . This prior is very popular because the computations are all closed form, but it has the bad flaw of being information inconsistent (i.e.,when the F statistic between two models goes to ∞, the Bayes factor stays bounded). The Zellner-Siow prior: As above, but with Normal replaced by Cauchy. This is a great conventional prior but, unfortunately, the resulting marginal likelihoods are not closed form.
  • 10. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Example: Bayesian Model Selection and Analysis for Cepheid Star Oscillations (Berger, M¨uller, Jefferys and Barnes (2001)) • The astronomical problem • Bayesian model selection • The data and likelihood function • Choice of prior distributions • Computation and results
  • 11. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ The Astronomical Problem • A Cepheid star pulsates, regularly varying its luminosity (light output) and size. • From the Doppler shift as the star surface pulsates, one can compute surface velocities at certain phases of the star’s period, thus learning how the radius of the star changes. • From the luminosity and ‘color’ of the star, one can learn about the angular radius of the star (the angle from Earth to opposite edges of the star). • Combining, allows estimation of s, a star’s distance.
  • 12. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Figure 1: The x’s give the radial velocity measurements at various phases of the stars oscillation. The curve is a 5th order trigonometric polynomial fit.
  • 13. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Curve Fitting To determine the overall change in radius of the star over the star’s period, the surface velocity must be estimated at phases other than those actually observed, leading to a curve fitting problem (also for luminosity). Difficulties: • Observations have measurement error. • Phases at which observations are made are unequally spaced. • Number of possible models (curve fits) entertained varies between 50 and thousands. • Resulting models have from 10 to 1000 parameters.
  • 14. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ The Data and Statistical Model • Data: – m observed radial velocities Ui, i = 1, . . . , m . – n vectors of photometry data consisting of luminosity Vi, i = 1, . . . , n, and color index Ci, i = 1, . . . , n. • Specified standard deviations σUi , σVi , and σCi ; unknown adjustment factors are inserted, leading to variances σ2 Ui /τu, σ2 Vi /τv, σ2 Ci /τc. • The statistical model for measurement error: Ui ∼ N(ui, σ2 Ui /τu), Vi ∼ N(vi, σ2 Vi /τv), Ci ∼ N(ci, σ2 Ci /τc), where ui, vi, and ci denote the true unknown mean velocity, luminosity, and color index.
  • 15. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Curve Fitting I: Fourier Analysis • Model the true periodic velocity u, at phase φ, as a trigonometric polynomial u = u0 + M j=1 [β1j cos(jφ) + β2j sin (jφ)], where u0 is the mean velocity of the star, M is the (unknown) order of the trigonometric polynomial, and the β1j and β2j are the unknown Fourier coefficients of the trigonometric polynomial. • There is a similar polynomial model for the true luminosity v, having unknown order N.
  • 16. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ The resulting statistical models for the column vectors U and V of observed radial velocities and luminosities are the linear models U = u01 + Xuβu + εu and V = v01 + Xvβv + εv, • where u0 and v0 are the (unknown) mean velocity and luminosity and 1 is the column vector of ones; • Xu and Xv are matrices of the trigonometric covariates (e.g., terms like sin(jφ)); • βu and βv are column vectors of the unknown Fourier coefficients; • εu and εv are independently multivariate normal errors: N(0, Gu/τu) and N(0, Gv/τv), where Gu and Gv are the known diagonal matrices of the variances σ2 Ui and σ2 Vi .
  • 17. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Color need not be modeled separately because it is related to luminosity (v) and velocity (or change in radius) by ci = −a[0.1vi + b + 0.5 log (φ0 + ∆ri/s)], where a and b are known constants, φ0 and s are the angular size and distance of the star, and ∆r, the change in radius corresponding to phase φ, is given by ∆r = −g M j=1 1 j [β1j sin(j(φ − ∆φ)) − β2j cos(j(φ − ∆φ))], with ‘phase shift’ ∆φ and g a known constant.
  • 18. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Choice of Cepheid Prior Distributions • The orders of the trigonometric polynomials, (M, N), are given a uniform distribution up to some cut-off (e.g., (10, 10)). • τu, τv, τc, which adjust the measurement standard errors, are given the standard objective priors for ‘scale parameters,’ namely the Jeffreys-rule priors p(τu) = 1 τu , p(τv) = 1 τv , and p(τc) = 1 τc . • The mean velocity and luminosity, u0 and v0, are ‘location parameters’ and so can be assigned the standard objective priors p(u0) = 1 and p(v0) = 1. • The angular diameter φ0 and the unknown phase shift ∆φ are also assigned the objective priors p(∆φ) = 1 and p(φ0) = 1. It is unclear if these are ‘optimal’ objective priors but the choice was found to have neglible impact on the answers.
  • 19. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ • The Fourier coefficients, βu and βv, occur in linear models, so the Zellner-Siow priors can be utilized. • The prior for distance s of the star should account for – Lutz-Kelker bias: a uniform spatial distribution of Cepheid stars would yield a prior proportional to s2 . – The distribution of Cepheids is flattened wrt the galactic plane; we use an exponential distribution. – So, we use p(s) ∝ s2 exp (−|s sin β|/z0), ∗ β being the known galactic latitude of the star (its angle above the galactic plane), ∗ z0 being the ‘scale height,’ assigned a uniform prior over the range z0 = 97 ± 7 parsecs.
  • 20. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Computation (Not considered for the SAMSI MUMS course) A reversible-jump MCMC algorithm of the type reviewed in Dellaportas, Forster and Ntzoufras (2000) is used to move between models and generate posterior distributions and estimates. • The full conditional distributions for the variance and precision parameters and hyperparameters are standard gamma and inverse-gamma distributions and are sampled with Gibbs sampling.
  • 21. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ • For ∆φ, φ0 and s, we employ a random-walk Metropolis algorithm using, as the proposal distribution, a multivariate normal distribution centered on the current values and with a covariance matrix found from linearizing the problem for these three parameters. • The Fourier coefficients βu and βv, as well as u0 and v0, are also sampled via Metropolis. The natural proposal distributions are found by combining the normal likelihoods with the normal part of the Zellner-Siow priors, leading to conjugate normal posterior distributions. • Proposal for moves between models: – A ‘burn-in’ portion of the MCMC with uniform model proposals yielded initial posterior model probabilities, which were then used as the proposal for subsequent model moves. – Simultaneously, new values were proposed for the Fourier coefficients.
  • 22. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ • The proposal distributions listed above led to a well-mixed Markov chain, so that only 10,000 iterations of the MCMC computation needed to be performed. 0 2000 4000 6000 8000 10000 3e-044e-045e-046e-047e-048e-04 T Mon: Parallax Trial Parallax
  • 23. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ 1 2 3 4 5 6 7 T Mon: Velocity Model Posterior Probability Model index VelocityModelPosteriorProbability 0.00.20.40.60.8
  • 24. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ 1 2 3 4 5 6 7 T Mon: V Photometry Model Posterior Probability Model index VPhotometryModelPosteriorProbability 0.00.10.20.30.40.5
  • 25. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ T Mon: Parallax Parallax (arcsec) Frequency 3e-04 4e-04 5e-04 6e-04 7e-04 8e-04 0500100015002000 Figure. The posterior distribution of parallax (proportional to the inverse of distance) of TMon. This was determined by model averaging over the various possible models for radial velocity and photometry.
  • 26. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ The ‘Robust’ conventional priors for variable selection (Bayarri, Berger, Forte and Garcia-Donato, 2012; a generalization of proposals by Strawderman (1971, 1973) and Berger (1976, 1980, 1985): Defining Σi = σ2 (V t iV i)−1 . πR i (β0, βi, σ2 ) = 1 2σ2 1 0 Nki (βi | 0, 1 + n λ(k0 + ki) − 1) Σi) λ−0.5 dλ . Although this prior is not closed form, it gives closed form marginal likelihoods, and closed form Bayes factors Bi0 = n + 1 ki + k0 − ki 2 Q −(n−k0)/2 i0 ki + 1 2F1 ki + 1 2 ; n − k0 2 ; ki + 3 2 ; (1 − Q−1 i0 )(ki + k0) (1 + n) , where 2F1 is the standard (Gauss) hypergeometric function and Qi0 = SSEi/SSE0 is the ratio of the sum of squared errors of models Mi and M0.
  • 27. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ R package BayesVarSel (Garcia-Donato and Forte 12-12-12) • freely available at CRAN (for sequential or parallel computation) • programmed in C using GNU-gsl libraries • priors allowed: – prior.betas = ∗ “Robust”, ∗ “Liangetal” (not discussed in the lecture) ∗ “gZellner” ∗ “ZellnerSiow”. – prior.models = ∗ “Constant” (Pr(Mi) = 1/[# models]) ∗ “Jeffreys” (Pr(Mi) = 1 (p+1) p ki −1 )
  • 28. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Common Bayesian outputs given in the package • The posterior inclusion probability (called Inclusion Probability) for variable i is pi = l : variable i is in Ml Pr(Ml | y), the overall posterior probability that variable i is in the model. • The highest probability model (HPM). • The median probability model (MPM) is the model consisting of those variables whose posterior inclusion probability is ≥ 1/2. • The Bayesian model averaged predictor of y, at covariates x, is ˆy = xˆβ, where ˆβ = l Pr(Ml | y)ˆβl (called the Estimate in the package), with ˆβl being the posterior mean under model Ml.
  • 29. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Example: Infant obesity • Understand which factors affect infant obesity • Response variable: Body Mass index (BMI ∝ weight/height2 ) • Set of 16 explanatory covariates (such as ‘born weight’ and ‘born height’) with 1002 observations. • There are 216 = 65, 536 models.
  • 30. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Figure 2: This page of the printout lists the ten most probable models, indicating the variables that are in each model. The posterior probability of each model is also given. Note that the highest probability model only has probability 0.094; a small HPM probability is common when there are so many models.
  • 34. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Selecting a single model for prediction: The Median Probability Model
  • 35. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Context: Prediction with Normal linear models • Under the full model, the n × 1 observation vector would follow y = Xβ + ǫ ; where X is the n × p design matrix, β is the p × 1 vector of unknown coefficients, and ǫ is N(0, σ2 I). • The possible submodels are Ml : y = Xl βl + ǫ , where l = (l1, l2, . . . , lp) is the model index, li being either 1 or 0 as covariate xi is in or out of the model. • Assume that one of these models is true, and our goal is to predict a future observation.
  • 36. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Basics of Bayesian prediction • The goal is to predict a future y∗ = x∗ β + ǫ, at covariate values x∗ , using squared error loss (y∗ − y∗ )2 . • Combining the data and prior yields, for all l, – P(Ml | y), the posterior probability of model Ml; – πl(βl, σ2 | y), the posterior distribution of (βl, σ2 ). • The best predictor of y∗ is, via model averaging, ¯y∗ = x∗ ¯β ≡ x∗ l P(Ml | y) βl , where βl is the posterior mean for β under Ml. (This follows from the fact that, under squared error loss, the optimal Bayes estimate is the posterior mean, and the posterior mean of y∗ is ¯y∗ .)
  • 37. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Selecting a single model • Often a single model Ml is desired for prediction, with the prediction then being ˆyl = x∗ βl. • A common misperception is that the best single model is that with the largest P(Ml | y); – this is true if there are only two models; – this is true if X ′ X is diagonal, σ2 is known, and suitable priors are used. • The best single model will typically depend on x∗ .
  • 38. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ • An important case is when the model will repeatedly be used to make future predictions, and the future covariates will be like the past covariates, in the sense that E[(x∗ )′ x∗ ] = X ′ X. Then the averaged squared error predictive loss for future predictions, using Ml and when when Ml∗ is the true model, is (adding any needed zeroes to the β) Ex∗ (y∗ − ˆyl)2 = (βl∗ − βl)′ E[(x∗ )′ x∗ ](βl∗ − βl) + σ2 = (βl∗ − βl)′ X ′ X(βl∗ − βl) + σ2 . Thus our goal is to find the best Ml in terms of the posterior expectation of this loss.
  • 39. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Posterior inclusion probabilities The posterior inclusion probability for variable i is pi ≡ l : li=1 P(Ml | y), i.e., the overall posterior probability that variable i is in the model. These are of considerable independent interest • as basic quantities of interest, • as aids in searches of model space, • in defining the median probability model.
  • 40. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ The (posterior) median probability model If it exists, the median probability model, Ml∗ , is defined to be the model consisting of those variables whose posterior inclusion probability is at least 1/2. Formally, l∗ is defined, coordinatewise, by l∗ i =    1 if pi ≥ 1 2 0 otherwise. (1) Note: If computation is done by a model-jumping MCMC, the median probability model consists of those coordinates that were present in over half the iterations.
  • 41. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Existence of the median probability model The median probability model exists when the models under consideration follow a graphical model structure, including • when any subset of variables is allowed; • the situation in which the allowed variables consist of main effects and interactions, but a higher order interaction is allowed only if lower order interactions are included; • a sequence of nested models, such as arises in polynomial regression and autoregressive time series.
  • 42. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Example (Polynomial Regression): Model j is y = j i=0 βi xi + ǫ. (Model) j 0 1 2 3 4 5 6 P(Mj | y) ∼ 0 0.06 0.22 0.29 0.38 0.05 ∼ 0 (Covariate) i 0 1 2 3 4 5 6 P(xi is in model | y) 1 1 0.94 0.72 0.43 0.05 0 Thus M3 is the median probability (optimal predictive) model, while M4 is the maximum probability model.
  • 43. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Three optimality theorems Theorem 1. If (i) the models under consideration have graphical structure; (ii) X ′ X is diagonal, and (iii) the posterior mean of βl is simply the relevant coordinates of β (the posterior mean in the full model), then the best predictive model is the median probability model. Condition (iii) is satisfied under any mix of • constant priors for the βi; • independent N(0, σ2 λi) priors for the βi, with the λi given (objectively or subjectively specified, or estimated via empirical Bayes) and any prior for σ2 .
  • 44. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Corollary (Clyde and Parmigiani, 1996). If any submodel of the full model is allowed, X ′ X is diagonal, N(0, σ2 λi) priors are used for the βi, with the λi given and σ2 known, and the prior probabilities of the models satisfy P(Ml) = p i=1 (p0 i )li (1 − p0 i )(1−li) , where p0 i is the prior probability that variable xi is in the model, then the optimal predictive model is that with highest posterior probability (which is also the median probability model).
  • 45. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Theorem 2. Suppose a sequence of nested linear models is under consideration. If (i) prediction is desired at ‘future covariates like the past’ and (ii) the posterior mean under Ml satisfies βl = b βl, where βl is the least squares estimate and b is common across models, then the best predictive model is the median probability model. Condition (ii) is satisfied if we use either • the objective priors for model parameters; or • g-type Nkl (0, c σ2 (X ′ lXl)−1 ) priors, with the same constant c > 0 for each model, and any prior for σ2 . Theorem 3. Theorems 1 and 2 essentially remain true even if there are non-orthogonal nuisance parameters (i.e., parameters common to all models) that are assigned the usual noninformative priors.
  • 46. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Example: Nonparametric Regression: • yi = f(xi) + ǫi, i = 1, . . . , n, ǫi ∼ N(0, σ2). • Represent f by a (orthonormal) series f(x) = ∞ j=1 βj φj(x). • Base prior distribution: βi ∼ N(0, vi), with vi = c ia , where c is unknown and a is specified. • The model Mj, for j = 1, 2, . . . , n, is given by: yi = j k=1 βk φk(xi) + ǫi, ǫi ∼ N(0, σ2 ). • Choose equal prior probabilities (reasonable here) for the models Mj, j = 1, 2, . . . n. Use the base prior for βj = (β1, . . . , βj). • For the data y = (y1, . . . , yn), compute P(Mj | y), the posterior probability of model Mj, for j ≤ n. • Within Mj, predict βj by its posterior mean, ˜βj.
  • 47. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Example 1. The Shibata Example • f(x) = − log(1 − x) for −1 < x < 1. • Choose {φ1(x), φ2(x), . . .} to be the Chebyshev polynomials. • Then βi = 2/i, so the ‘optimal’ choice of the prior variances would be vi = 4/i2 , i.e., c = 4 and a = 2. • Measure the predictive capability of a model by expected squared error loss relative to the true function (here known) – thus we use a frequentist evaluation, as did Shibata.
  • 48. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ MaxPr MedPr ModAv BIC AIC a = 1 0.99 [8] 0.89 [10] 0.84 1.14 [8] 1.09 [7] a = 2 0.88 [10] 0.80 [16] 0.81 1.14 [8] 1.09 [7] a = 3 0.88 [9] 0.84 [17] 0.85 1.14 [8] 1.09 [7] Table 2: For n = 30 and σ2 = 1, the expected loss and [average model size] for the maximum probability model (MaxPr), the Median Probability Model (MedPr), Model Averaging (ModAv), and BIC and AIC, in the Shibata example. Note: It is possible for the median probability model to perform better than model averaging here, because this is a frequentist evaluation of error with respect to the true known function.
  • 49. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ MaxPr MedPr ModAv BIC AIC a = 1 0.54 [14] 0.51 [19] 0.47 0.59 [11] 0.59 [13] a = 2 0.47 [23] 0.43 [43] 0.44 0.59 [11] 0.59 [13] a = 3 0.47 [22] 0.46 [45] 0.46 0.59 [11] 0.59 [13] Table 3: For n = 100 and σ2 = 1, the expected loss and average model size for the maximum probability model (MaxPr), the Median Probability Model (MedPr), Model Averaging (ModAv), and BIC and AIC, in the Shibata example.
  • 50. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ MaxPr MedPr ModAv BIC AIC a = 1 0.34 [23] 0.33 [26] 0.30 0.41 [12] 0.38 [21] a = 2 0.26 [42] 0.25 [51] 0.25 0.41 [12] 0.38 [21] a = 3 0.29 [38] 0.29 [50] 0.29 0.41 [12] 0.38 [21] Table 4: For n = 2000 and σ2 = 3, the expected loss and average model size for the maximum probability model (MaxPr), the Median Probability Model (MedPr), Model Averaging (ModAv), and BIC and AIC, in the Shibata example.
  • 51. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Comments • AIC is better than BIC (as Shibata showed), but the true Bayesian procedures are best. • Model averaging is generally best (not obvious), followed closely by the median probability model. The maximum probability model can be considerably inferior. • BIC is a poor approximation to the Bayesian answers here. • The true Bayesian answers choose substantially larger models than AIC (and then shrink towards 0).
  • 52. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ ANOVA Many ANOVA problems, when written in linear model form, yield diagonal X ′ X and any such problems will naturally fit under our theory. In particular, this is true for any balanced ANOVA in which each factor has only two levels. As an example, consider the full two-way ANOVA model with interactions: yijk = µ + ai + bj + abij + ǫijk with i = 1, 2, j = 1, 2, k = 1, 2, . . . , K and ǫijk independent N(0, σ2 ), with σ2 unknown. In linear model form, this leads to X ′ X = 4KI4.
  • 53. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Possible modeling scenarios We use the simplified notation M1011 instead of M(1,0,1,1), representing the model having all parameters except a1. Scenario 1 - All models with the constant µ: Thus the set of models under consideration is {M1000, M1100, M1010, M1001, M1101, M1011, M1110, M1111}. Scenario 2 - Interactions present only with main effects, and µ included: The set of models under consideration here is {M1000, M1100, M1010, M1110, M1111}. Note that this set of models has graphical structure.
  • 54. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Scenario 3 - An analogue of an unusual classical test: In classical ANOVA testing, it is sometimes argued that one might be interested in testing for no interaction effect followed by testing for the main effects, even if the no-interaction test rejected. The four models that are under consideration in this process, including the constant µ in all, are {M1101, M1011, M1110, M1111}. This class of models does not have graphical model structure and yet the median probability model is guaranteed to be in the class.
  • 55. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Example 2. Montgomery (1991, pp.271–274) considers the effects of the concentration of a reactant and the amount of a catalyst on the yield in a chemical process. The reactant concentration is factor A and has two levels, 15% and 25%. The catalyst is factor B, with the two levels ‘one bag’ and ‘two bags’ of catalyst. The experiment was replicated three times and the data are treatment replicates combination I II III A low, B low 28 25 27 A high, B low 36 32 32 A low, B high 18 19 23 A high, B high 31 30 29
  • 56. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ For each modeling scenario, two Bayesian analyses were carried out, both satisfying the earlier conditions so that the median probability model is known to be the optimal predictive model. • I. The reference prior π(µ, σ) ∝ 1 σ was used for the common parameters, while the standard N(0, σ2 ) g-prior was used for a1, b1 and ab11. In each scenario, the models under consideration were given equal prior probabilities of being true. • II. The g-prior was also used for the common µ.
  • 57. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ model posterior probability posterior expected loss M1000 0.0009 237.21 M1100 0.0347 60.33 M1010 0.0009 177.85 M1110 0.6103 0.97 M1111 0.3532 3.05 Table 5: Scenario 2 – graphical models, prior I. The posterior inclusion probabilities are p2 = 0.9982, p3 = 0.9644, and p4 = 0.3532; thus M1110 is the median probability model.
  • 58. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ model posterior probability posterior expected loss M1011 0.124 143.03 M1101 0.286 36.78 M1110 0.456 10.03 M1111 0.134 9.41 Table 6: Scenario 3 – unusual classical models, prior II. The posterior inclusion probabilities are p2 = 0.876, p3 = 0.714, and p4 = 0.544; thus M1111 is the median probability model.
  • 59. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ When the median probability model can fail (Merlise Clyde) • Suppose – under consideration are M0, the model with only a constant term, and the models, Mi, having a constant term and the single covariate xi, i = 1, . . . , p, with p ≥ 3; – the models have equal prior probability of 1 p+1 ; – all covariates are nearly perfectly correlated, together and with y. • Then – the posterior probability of the M0 will be near zero, and that of each of the other Mi will be approximately 1/p; – thus the posterior inclusion probabilities will also be approximately 1 p < 1 2 ; – so the median probability model is M0, which will have poor predictive performance compared to any other model.
  • 60. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ But in practice, the median probability model works: Example: Consider Hald’s regression data (Draper and Smith, 1981), consisting of n = 13 observations on a dependent variable y, with four potential regressors: x1, x2, x3, x4. The full model is thus y = c + β1x1 + β2x2 + β3x3 + β4x4 + ǫ, ǫ ∼ N(0, σ2 ), σ2 unknown. There is high correlation between covariates here. • All models that include the constant term are considered. This example does not formally satisfy the theory, since the models are not nested and the conditions of Theorem 3 do not apply. • Least squares estimates are used for parameters. • Default posterior model probabilities, P(Ml|y), are computed using the Encompassing Arithmetic Intrinsic Bayes Factor (Berger and Pericchi, 1996), together with equal prior model probabilities.
  • 61. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ • Predictive risks, R(Ml), are computed. Model P(Ml|y) R(Ml) c 0.000003 2652.44 c,1 0.000012 1207.04 c,2 0.000026 854.85 c,3 0.000002 1864.41 c,4 0.000058 838.20 c,1,2 0.275484 8.19 c,1,3 0.000006 1174.14 c,1,4 0.107798 29.73 Model P(Ml|y) R(Ml) c,2,3 0.000229 353.72 c,2,4 0.000018 821.15 c,3,4 0.003785 118.59 c,1,2,3 0.170990 1.21 c,1,2,4 0.190720 0.18 c,1,3,4 0.159959 1.71 c,2,3,4 0.041323 20.42 c,1,2,3,4 0.049587 0.47
  • 62. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ • The posterior inclusion probabilities are p1 = l:l1=1 P(Ml|y) = 0.95, p2 = l:l2=1 P(Ml|y) = 0.73 p3 = l:l3=1 P(Ml|y) = 0.43, p4 = l:l4=1 P(Ml|y) = 0.55. • Thus the median probability model is {c, x1, x2, x4} which clearly coincides with the optimal predictive model. • Note that the risk of the maximum probability model {c, x1, x2} is considerably higher than that of the median probability model.
  • 63. SAMSI MUMS Fall,2018 ✬ ✫ ✩ ✪ Conclusions • Bayesian analysis for the normal linear model (including variable selection) can easily be done with conventional priors. – Software exists. – Adjustment for the multiple testing problem inherent in variable selection is possible (and is the default method). – Summaries (over the 2p models) of key quantities – such as posterior inclusion probabilities and model averaged predictive parameter estimates – are available and easily interpretable. • If a single model is required for prediction (or understanding), consider the median probability model. – It will typically be the optimal predictive model (better than the highest probability model). – It is easy to compute (posterior inclusion probabilities are easy).