Bayesian Variable Selection in Linear Regression and A Comparison

Hacettepe Journal of Mathematics and Statistics
Volume 31 (2002), 63–76
BAYESIAN VARIABLE SELECTION IN
LINEAR REGRESSION AND A COMPARISON
Atilla Yardımcı∗
and Aydın Erar†
Received 17. 04. 2002
Abstract
In this study, Bayesian approaches, such as Zellner, Occam’s Window
and Gibbs sampling, have been compared in terms of selecting the correct
subset for the variable selection in a linear regression model. The aim of
this comparison is to analyze Bayesian variable selection and the behavior
of classical criteria by taking into consideration the different values of β and
σ and prior expected levels.
Key Words: Bayesian variable selection, Prior distribution, Gibbs Sampling, Markov
Chain Monte Carlo.
1. Introduction
A multiple linear regression equation with n observations and k independent vari-
ables can be defined as
Y = Xβ + , (1)
where Yn×1 is a response vector; Xn×k is a non-stochastic input matrix with rank
k , where k = k + 1; βk ×1 is an unknown vector of coefficients; n×1 is an error
vector with E( ) = 0 and E( ) = σ2
In. Estimation and prediction equations can
be defined as Y = Xˆβ + e and ˆY = Xˆβ, where e is the residual vector of the
estimated model, and ˆβ is the estimator of the model parameter obtained using
the Least Square Error (LSE) method or the Maximum Likelihood (ML) method
under the assumption that is normally distributed. Thus the joint distribution
function of the Y’s is obtained by maximizing the likelihood function as follows:
L(β, σ/Y) = f(Y/β, σ, X) =
1
(2πσ2)n/2
exp −
1
2σ2
(Y − Xβ) (Y − Xβ)
ˆβ = (X X)−1
X Y (2)
ˆσ2
= (Y − Xˆβ) (Y − Xˆβ)/(n − k ) = SSE(ˆβ)/(n − k ) (3)
∗Ba¸skent University, Faculty of Science and Letters, Department of Statistics and Computer
science, Ba˘glıca, Ankara.
†Hacettepe University, Faculty of Science, Department of Statistics, Beytepe, Ankara.
63

64 A. Yardımcı, A. Erar
where Y ∼ N(Xβ, σ2
In) and ˆσ2
is corrected such that E(ˆσ2
) = σ2
[4]. Variable
selection in regression analysis can be used to make estimation with minimum
cost, less MSE, leaving out the non-informative independent variables from the
model and low standard errors in the presence of variables with a high degree of
collinearity.
2. Classical Variable Selection Criteria
The stepwise and all possible subset methods for variable selection are often used
in a linear regression model. The all possible subset method is used in relation with
mean square error, coefficient of determination, F statistics, Cp and PRESSp [4].
In recent years, the AIC, AICc and BIC criteria have also been used in variable
selection procedures having respect to the amount of information. If p is the
number of parameters in the subset, AIC and AICc aim to select the subset which
minimizes the values of AIC and AICc obtained by the equations
AIC = n log(AKO + 1) + 2p (4)
AICc = AIC +
2(p + 1)(p + 2)
n − p − 2
. (5)
If f(Y/ˆβp, σp, X) is the likelihood function for the subset p, the criterion BIC
aims to select the subset which maximizes the equation given below [2]
BIC = log f(Y/ˆβp, σp, X) −
p
2
log n.
Posterior model probabilities and risk criteria have been also been used occasion-
ally in variable selection. Posterior model probabilities have been assessed for all
the subsets using a logarithmic approach given by Wassermann [16] under the as-
sumption of equal prior subset probabilities. The risk criterion has been adapted
to variable selection with the help of the equation
R(ˆβ) = σ2
tr(V(ˆβ)) (6)
given by Toutenburg [15], where V(ˆβ) is the variance-covariance matrix of ˆβ.
The rate (6) found for all the probable subsets corresponds to the risk of the
estimations. The aim is to get the model with the smallest risk.
On the other hand, Yardımcı [17] claims that the rates of risk and posterior
probability should be evaluated together during the interpretation. Ideally, the
aim is to find a subset that has high posterior probability and low risk. Since it is
hard to satisfy this condition, preferences are made within acceptable limits [17].
Furthermore, the Risk Inflation Criterion (RIC) has been found for all probable
subsets with the help of the equation defined by Foster and George [5] given below
RIC = SSEp + 2pMSEp log(k), (7)
where SSEp and MSEp are the sum of squares error and mean square error for the
subset p, respectively. It is claimed that RIC will produce the lowest values and
thus give more consistent deductions when trying to select the correct variables
[17].

Bayesian Variable Selection in Linear Regression 65
3. Bayesian Variable Selection Approaches
Bayesian variable selection approaches make use of hypothesis testing and stochas-
tic searching methods in linear regression. Apart from these approaches, there are
various methods based on the selection of the subset where the probability of the
posterior subset is the highest.
3.1. Posterior Odds and the Zellner Approach
Bayesian approaches to hypothesis testing are quite different from classical prob-
ability ratio tests. The difference comes from the use of prior probabilities and
distributions. The researcher has to define prior probabilities P(H0) and P(H1)
for the correctness of the hypothesis. Here H0 symbolizes the null hypothesis and
H1 stands for the alternative hypothesis. To choose one of the hypotheses, pos-
terior probabilities have to be assessed after defining the prior probabilities. The
Bayes Factor of H1 against H0 has been given by Lee [11] as follows:
BF =
p(H0/y)
p(H1/y)
P(H1)
P(H2)
. (8)
Thus, the Bayes factor is the ratio of H0’s posterior odds to its prior odds. If
the prior probabilities that verify the hypotheses H0 and H1 are equal, that is
P(H0) = P(H1) = 0.5, then Bayes factor is equal to the posterior odds of H0.
Although the interpretation of Bayesian hypothesis testing depends on the nature
of the research, Kass and Raftery [9] tables are often used. In the Zellner approach,
k independent variables in the model are divided into two sets. The variables in
the first set are those that are needed in the model and are shown by X1 and β1.
The variables in the second set are those that should be removed from the model
and are shown by X2 and β2. Here, X1 and X2 = V are n×k1 and n×k2 matrices
respectively, where k = k1 + k2. The vectors β1 and β2 have dimensions k1 × 1
and k2 × 1. I denotes the unit vector. The model can be rewritten as follows:
Y = αI + X1η + Vβ2 + ,
where V = I − X1(X1X1)−1
X1 X2, η = β1 + (X1X1)−1
X1X2β2 and X1V = 0
[19]. The hypotheses that will be tested in the case of the division of variables
into two sets may be written as H0 : β2 = 0 and H1 : β2 = 0.
In conclusion, when the prior probabilities for the rejection or acceptance of
the hypothesis are defined to be equal and the prior beliefs for the parameters are
defined as noninformative, then the odds ratio B01 is given by Moulton [13] as
follows:
BF01 = b(v1/2)k2/2
(v1s2
1/v0s2
1/v − 0s2
0)(v1−1)/2
= b(v1/2)k2/2
[1 + (k2/v1)Fk2,v1
]
(v1−1)/2
(9)
and
¯y =
n
i=1
yi
n
, ˆη = (X1X1)−1
X1Y,

v0s2
0 = (y − ¯yI − X1 ˆη), v0 = n − k1 − 1,
ˆβ2 = (V V)−1
V Y,
v1s2
1 = (Y − ¯yI − X1 ˆη − Vˆβ2) (Y − ¯yI − X1 ˆη − Vˆβ2), v1 = n − k1 − k2 − 1,
b =
√
π/Γ[(k2 + 1)/2],
where Fh = Fk2,v1
= ˆβ2V Vˆβ2/k2s2
1 is the F-test statistics. Thus the posterior
odds ratio given by equation (9) can be used as the Bayesian test for the variables
to be included in the model and those to be removed from the model [19].
3.2. Occam’s Window Approach
In this approach, if a model is not as strong as the others in terms of their posterior
odds, then the model and its sub models are eliminated from consideration. If the
posterior probabilities of two models are equal, the simple one is accepted (accord-
ing to parsimony rule). Hoeting [8] has claimed that if the aim is to interpret the
relationship between the variables and to make quick assessments, then Occam
Window (OW) approach is useful. If q = 2k
is the number of all possible models,
then M = M1, . . . , Mq will show all possible subsets. The posterior probability
of the subset Mm is obtained as
p(Y/Mn) = p(Y/β, Mm)p(β/Mm) dβ. (10)
Here, p(Y/β, Mm) is the marginal likelihood of the subset Mm; β is the parameter
vector; p(b/Mm) is the prior probability of β for the subset Mm. The equation
(10) can be rewritten as
p(∆/Y, Hm) = p(∆/Y, βm, Hm)p(βm/Y, Hm) dβm.
with the help of the Bayes theorem [9]. Here βm is the vector of independent
variables for the subset Mm. The OW approach has two main characteristics: if
the subset estimation is weaker than the subset under consideration, then it is
eliminated [12]. Thus, the eliminated subset is deﬁned by Hoeting [8] as follows:
A = Mm;
max(p(Mi/Y ))
p(Mm/Y )
> OL .
Here, OL is determined by the researcher. The second characteristic is that it
determines the alternative subset with the support of data. For this searching,
B = Mm; ∃ Mi ∈ A, Mi ⊆ Mq,
p(Mi/Y )
p(Mm/Y )
> OR
is used [14]. Note that OR is also determined by the researcher. Generally, OR is
taken to be 1 [8]. Also, ∆ denotes a situation which can be regarded as a future
observation or a loss of an event. It will be written as [8]
p(∆/Y ) = Mm∈A p(∆/MmY )p(Y/Mm)p(Mm)
Mm∈A p(Y/M + m)p(Mm)
.

In the OW approach, the rates OL and OR are determined by subjective belief.
For example, Madigan and Raftery [12] accepted OL = 1/20 and OR = 1. In an
application of this approach, model selection is done in two stages: First a logical
model from all the 2k
−1 possible models is chosen. Then “up or down algorithms”
are used to add or remove variables from the models [8].
3.3. The Gibbs Sampling Approach
Bayesian approaches concentrate on the determination of posterior probabilities
or distributions. However, in some cases, the analytic solution of the integrals
needed for the assessment of the posterior moments are impossible or hard to
achieve. In such cases, approaches involving the creation of Markov chains and
the use of convergence characteristics to obtain the posterior distributions are used
[17]. Thus, by using Markov Chain and stochastic simulation techniques together,
Markov Chain Monte Carlo (MCMC) approaches have been developed.
The aim of the MCMC approach is to create a random walk which converges
to the target posterior distributions [1]. In the MCMC approach, a sequence θj
converging to the posterior distribution is produced in which θj+1
depends on the
previous rate θj
. With this in mind, sampling is done using the Markov transi-
tion distribution g(θ/θj
) [7]. For a given posterior distribution, there are various
methods of taking samples from the transition distribution and processing this in-
formation. Metropolis algorithms and Gibbs sampling approaches are used to show
how these samples taken from a posterior distribution achieve the characteristic
of a Markov chain.
In Gibbs sampling, initially starting points (X0
1 , X0
2 , . . . , X0
k) are determined
when j = 0, and then points X(j)
= (X
(j)
1 , X
(j)
2 , . . . , X
(j)
k ) are produced in the
state x(j−1)
. Switching to the state j = j + 1, the process continues until conver-
gence is obtained [6]. When convergence is achieved the points x(j)
correspond to
the rates p(x1, . . . , xk/y). There are various Bayesian variable selection approaches
that depend on Gibbs sampling. Here the approach of Kuo and Mallick [10], that
is used in this study, will be explained.
In Gibbs sampling, a dummy variable given by γ = (γ1, . . . , γk) is used to
define the position of the 2k−1
independent variable subsets in the model. If the
variable considered is in the model, then γj = 1, if not γj = 0. Since the γ rate
is not known, it has to be added to the Bayes process. Thus, the joint prior that
will be used in the variable selection process is defined as
p(β, σ, γ) = p(β/σ, γ)p(σ/γ)p(γ).
Here, the prior defined by Kuo and Mallick [10] for the β coefficients is assumed
to be
β/σ, γ ∼ N(β0, D0I).
In addition, a noninformative prior for σ is defined by
p(σ) ∝
1
σ
.

The prior distribution of σ is obtained with the help of the inverted gamma dis-
tribution and the assumptions α → 0 and η → 0. The noninformative prior for γ
is given as
γj ∼ Bernoulli(1, p).
The deductions concentrate on these distributions since the marginal posterior dis-
tribution p(γ/Y) contains all the information that is needed for variable selection.
In Kuo and Mallick [10], it has been shown that p(γ/Y) can be estimated using
Gibbs sampling. Moreover, the least square estimation for the full model can be
used when the starting values of β0
and σ0
, that are needed for this process, are
γj = 1, (j = 1, . . . , k). Initially we may take γ0
= (1, 1, . . . , 1) . In this way, the
marginal posterior density of γj can be given as
p(γj/γ−j, β, σ, Y) = B(1, ˜pj), j = 1, 2, . . . , k,
where, if γ−j = (γ1, . . . , γj−1, γj+1, . . . , γk), then
˜pj =
cj
(cj + dj)
,
cj = pj exp −
1
2σ2
(Y − Xv∗
j ) (Y − Xv∗
j ) ,
dj = (1 − pj) exp −
1
2σ2
(Y − Xv∗∗
j ) (Y − Xv∗∗
j ) .
Here, the vector v∗
j is the column vector of ν with the jth
entry replaced by βj,
similarly, v∗∗
j is the column vector of ν with the jth
entry replaced by 0. Thus, cj
is a probability for the acceptance of the jth
variable and dj a probability for its
rejection.
4. Comparison of the Selection Criteria with Artificial Data
For a comparison of selection criteria in parallel to the study done by Kuo and
Mallick [10] and George and McCulloch, artificial data was produced in this study
for n = 20 and k = 5. The distribution of the variables Xj has been assumed to
be normal with N(0, 1). Four types of β structure, (Beta1,. . . ,Beta4), have been
determined to observe the effects of the significance level of variables in the model
that will be used in selection:
Beta 1 : (3, 2, 1, 0, 0) Beta 2 : (1, 1, 1, 0, 0)
Beta 3 : (3, −2, 1, 0, 0) Beta 4 : (1, −1, 1, 0, 0)
On the other hand, the levels σ2
= 0.01, 1.0, 10 are chosen to see the effects of σ2
on the results. However, the results for σ2
= 1.0 are not given in the tables, but
they are mentioned in interpretations. In this study, to observe the effects of prior
information levels on the result, the prior assumptions given in Table 4.1 are used.
Furthermore, the assumption ei ∼ N(0, σ2
) is made for the errors. The list of all
possible subsets for 5 independent variables is given in Appendix 1.

Table 4.1: Prior Assumptions
Bayesian Approach Prior Assumption
Zellner Equal subset prior probability
a. Equal subset prior probability.
Occam’s Window b. For the 2nd
and 5th
subsets, prior probability is 0.1.
c. For the 26th
subset, prior probability is 0.10.
a. D0 = 16, 1
Gibbs Sampling b. Equal prior probability for variables
c. Different prior probability for variables
During the comparison of these approaches, the BARVAS (Bayesian Regression-
Variable Selection) computer program, developed especially for the Bayesian vari-
able selection process by Yardımcı [17] is used. The BARVAS program can be
used for Bayesian regression and variable selection in linear regression, and it is a
packet program where the classical variable selection criteria for variable selection
could be also used together [18].
4.1. Classical Variable Selection
In the comparison of classical variable selection, the well-known criteria (Cp, AICc,
BIC, Press) will be referred to as Group I, and posterior model probability, Risk,
RIC as Group II. The results of classical variable selection for the data are summa-
rized in Table 4.2. In this table, the criteria are classified among themselves and
the best three subsets are given. It has been observed that the selection criteria in
Group I have not been affected by β and σ2
. For the β structures under consid-
eration, all the criteria in Group I find the best subset, which is subset number 5.
Also, subsets 10 and 26 are found in all criteria as the best selection subsets. The
AICc criterion is more appropriate for a model which has fewer variables because
of its structure. Group II criteria show parallel results with other criteria. Subset
posterior model probability (P) gave the same classification as Cp in all structural
cases. The risk criterion gives better results for subsets with fewer variables. Al-
though subsets with lower risk accord with other criteria, these subsets have lower
dimension for the condition σ2
= 0.01. RIC gives similar results to Group I. As a
result of the analysis of different classifications it is observed that the RIC rates
are very close to one another.
4.2. Bayesian Approaches
The Bayesian variable selection approaches, which are calculated using the BAR-
VAS program, are given in Table 4.2. The numbers in the table show the numbers
of the selected models. The detailed results for the Bayesian approaches are given
as follows.

Table 4.2. Number of Selected Models through the Classical and
Bayesian Approaches
(See the Appendix for these numbers).
Classical Criteria Bayesian Approaches
Zellner OW Gibbs
σ2 Beta Cp AICc BIC Press P Risk RIC BF Fh (equal D0 = 1 D0 = 16
prior)
5 2 5 5 5 1 5 5 10 10 5 2
1 10 1 10 2 10 2 10 2 26 5 2 5
2 5 26 13 26 5 26 10 5 21 21 1
5 3 5 5 5 3 5 5 10 5 5 5
2 10 5 10 26 10 4 10 10 26 10 21 3
26 4 26 10 26 7 26 26 5 26 3 6
0.01
5 5 5 5 5 5 5 5 26 26 5 5
3 26 2 26 26 26 10 10 10 10 5 21 2
10 26 10 10 10 2 26 26 5 21 2 18
5 5 5 5 5 7 5 5 26 26 5 5
4 26 6 26 26 26 6 10 26 10 21 21 2
10 26 10 10 10 5 26 10 5 5 26 3
5 5 5 5 5 5 5 5 10 10 5 5
1 10 2 10 26 10 10 26 19 26 21 10 2
26 10 26 10 26 26 10 26 5 5 21 10
5 5 5 5 5 5 5 5 10 10 5 5
2 10 1 10 26 10 26 26 10 26 26 21 1
26 2 26 10 26 10 10 26 5 21 10 2
10
5 2 5 5 5 5 5 5 26 5 5 5
3 26 5 26 10 26 26 26 10 10 26 21 2
10 13 10 2 10 2 10 26 5 10 2 13
5 5 5 5 5 5 5 5 10 10 5 5
4 10 10 10 26 10 10 10 10 26 5 21 3
26 26 26 10 26 26 26 26 5 21 26 21
The Zellner Approach
The Zellner approach preferred strongly the best subset number 5 for all the
different cases of beta and σ2
. The BF rates for the subset 5 are different from
those for the alternative subsets 10 and 26. This difference is also reflected in the
interpretations of Jeffreys, Kass and Rafftery. The results of the Zellner approach
on standardized data are given in Table 4.3. When the rates Fh are considered,
it is observed that the alternative subsets 10 and 26 are preferred. Since these
subsets have a larger dimension than the best subset, they give better results for
the purpose of this study. However, generally when the Zellner approach is used
for normally distributed variables, β and σ2
have no affect.

Table 4.3. Results of the Zellner Approach
σ2 = 0.01 σ2 = 10.0
for max BF for min Fh for max BF for min Fh
Beta Subset BF J KR Subset Fh T Subset BF J KR Subset Fh T
No. No. No. No.
5 13.68 S S 10 0.09 A 5 12.16 S H 10 0.02 A
1 2 7.17 H H 26 0.31 A 10 4.46 H H 26 0.57 A
10 4.50 H H 5 0.39 A 26 3.61 H H 5 0.65 A
5 15.29 S S 10 0.06 A 5 13.37 S S 10 0.11 A
2 10 4.55 H H 26 0.11 A 10 4.48 H H 26 0.20 A
26 4.46 H H 5 0.14 A 26 4.29 H H 5 0.44 A
5 14.52 S S 26 0.04 A 5 15.50 S S 26 0.00 A
3 10 4.59 H H 10 0.24 A 10 4.69 H H 10 0.10 A
26 4.21 H H 5 0.25 A 26 4.47 H H 5 0.11 A
5 0.38 H H 26 0.04 A 5 13.74 S S 10 0.02 A
4 26 4.60 H H 10 1.22 A 10 4.65 H H 26 0.37 A
10 2.27 W H 5 1.25 A 26 3.96 H H 5 0.38 A
J: Jeffreys, KR: Kass ve Rafftery, T: according to the F table
A: Acceptance, S: Strong, H: High, W: Weak, R: Rejection
Occam’s Window Approach
The OW approach primarily determines the subsets to be studied, rather than
determining the best subset. As a result, the researcher chooses the suitable subset
for his aims. On artificial data, while the posterior odds are calculated for the
values OL = 0.5 and OR=1, the prior probability for all subsets is considered to
be equal at the first stage. The OW approach evaluates the full set as a good model.
However, the OW approach concentrates more on larger dimensional subsets. The
results for the OW approach are given in Table 4.4. In this study, the best subset
number 5 is generally selected. However the posterior probabilities of subsets in the
classification are close to one another. It may be observed that the OW approach
is not much affected by a change in the β and σ2
structures. At the second stage,
Case I and II are defined to observe the effects of prior probabilities on the results.
In Case I, the prior probability of subset 2 and 5 is 0.10. In Case II, a 0.10 prior
probability is given to subset 26. The prior probabilities of the other subsets are
0.90/29 for Case I and 0.90/30 for Case II. It may be observed that the subset
prior probabilities have a considerable effect on the OW results. The effect in Case
I is more than in Case II. In Case I, the prior probability determined for subset 2
is not sufficient for it to be taken into the classification, but subset 5 has a high
posterior probability.
The Gibbs Sampling Approach
The Gibbs sampling approach used by Kuo and Mallick [10] is applied using the
BARVAS program, and the results given in Table 4.5. Gibbs sampling is applied
for all cases using simulated data over 10.000 iterations. The sampling time lasted
approximately two minutes for each of the σ2
, β and D0 levels. By considering the

prior parameter estimation, which is β0 = 0, the noninformative prior information
has been utilized in the Kuo and Mallick approach to Gibbs sampling.
Table 4.4. The Results for Occam’s Window Approach
σ2
= 0.01
Equal prior Case 1 Case 2
Beta Subset No. Pos. P Subset No. Pos. P Subset No. Pos. P
10 0.25 5 0.50 26 0.47
1 5 0.23 10 0.15 10 0.17
21 0.21 21 0.13 5 0.15
5 0.27 5 0.58 26 0.52
2 10 0.25 10 0.15 5 0.17
26 0.24 26 0.14 10 0.16
26 0.27 5 0.56 26 0.56
3 5 0.26 26 0.16 5 0.16
21 0.23 21 0.14 21 0.14
26 0.36 5 0.44 26 0.65
4 21 0.30 26 0.24 21 0.17
5 0.18 21 0.21 5 0.09
σ2
= 10.0
Equal prior Case 1 Case 2
Beta Subset No. Pos. P Subset No. Pos. P Subset No. Pos. P
10 0.31 5 0.51 26 0.46
1 21 0.26 10 0.20 10 0.21
5 0.22 21 0.17 21 0.17
10 0.27 5 0.53 26 0.52
2 26 0.25 10 0.17 10 0.16
21 0.24 26 0.15 21 0.15
5 0.26 5 0.56 26 0.53
3 26 0.25 26 0.15 5 0.17
10 0.23 10 0.14 10 0.15
10 0.28 5 0.55 26 0.48
4 5 0.25 10 0.17 10 0.19
21 0.24 21 0.14 5 0.17
Case 1: 0.10 prior probability is given to subsets 2 and 5;
Case 2: 0.10 prior probability is given to subset 26.
Pos. P: Subset posterior probability.
The eﬀect on the results have been observed by using prior variance values
of D0 = 1 and 16. The number of zero subsets (Zero) reached, and the subset
numbers (Processed) which were processed were given by the BARVAS program
for Gibbs sampling. It may be observed that the number of zero subsets is aﬀected

by the β structure. A lot of computation time was needed for the zero subsets
for σ2
= 0.01 and σ2
= 1.0 with Beta 2, and for σ2
= 10 with Beta 4. It
may be observed that D0, which is regarded as representing the prior information
level, has no effect on the number of processed subsets, but is effective on the
appearance of zero subsets. In this study, it may be observed that β and σ2
are
not very influential in obtaining subsets. However for σ2
= 10, the posterior model
probabilities for the correct subset are greater than the others. During this study,
the high prior probability rates and the different prior probabilities in the model
are given for redundant variables in all β structures in order to observe the effects
of the independent variables on the process. As a result of this application, it has
been observed that the prior probabilities reversibly determined for variables in
the model have no effect on the results.
Table 4.5 The Results for the Gibbs Sampling Approach
σ2
= 0.01 σ2
= 10
D0 = 1 D0 = 16 D0 = 1 D0 = 16
Beta Subset Pos. P Subset Pos. P Subset Pos. P Subset Pos. P
No. No. No. No.
5 0.37 2 0.63 5 0.62 5 0.70
2 0.31 5 0.27 10 0.12 2 0.21
1 21 0.24 1 0.05 21 0.10 10 0.05
Zero = 4, Zero = 0, Zero = 48, Zero = 9,
Processed = 16 Processed = 14 Processed = 14 Processed = 10
5 0.72 5 0.66 5 0.74 5 0.70
21 0.17 3 0.22 21 0.15 1 0.20
2 3 0.05 6 0.03 10 0.05 2 0.05
5 0.70 5 0.75 5 0.71 5 0.68
21 0.10 2 0.19 21 0.14 2 0.26
3 2 0.05 18 0.02 2 0.09 13 0.03
5 0.64 5 0.59 5 0.80 5 0.88
21 0.13 2 0.15 21 0.16 3 0.06
4 26 0.11 3 0.12 26 0.03 21 0.03
5. Application
An application using air pollution data [3] is also used for comparing the criteria.
The dependent variable is the quantity of SO2. There are 8 independent variables
and 255 subsets for these data. In our study, the classical criterion such as Cp,
BIC, PRESSp, P (posterior model probability) and RIC have selected the same

subsets which given with model numbers 162, 163 in Table 5.1. AICc has selected
the models 156, 163. The results of Bayesian criterion are also shown in Table 5.1.
The models selected by Bayesian criterion are fairly like to be the results of classical
selection. While Occam’s Window method tends to select the more large subsets,
Gibbs sampling used by Kuo and Mallick tends to select the subsets with less
variables. The advantage of Gibbs sampling is to have processed only 54 instead
of 255 subsets for D0=1 or 42 subsets for D0=4. If the models with less variables
is preferred according the parsimony rule or the aim of prediction, the models 156
or 163 can be used to make the estimation of parameters of these models. Using
least square or Bayesian methods can make the estimation of parameters of these
models.
Table 5.1 The Variables and Results for Bayesian Selection.
Zellner Occams Window Gibbs
No. Variables Max. BF Min. Fh Pos. P1 Pos. P2 Pos. P Pos. P
(D0 = 1) (D0 = 4)
99 x2, x5, x7 0.22
156 x2, x5, x7, x8 200.0 0.07 0.20 0.33
162 x1, x2, x5, x6, x7, x8 162.2 0.07 0.179 0.160 0.22
163 x2, x5, x6, x7, x8 255.1 0.31 0.23
165 x1, x2, x3, x5, x6, x7, x8 0.02 0.179 0.160
170 x1, x2, x3, x4, x5, x6, x7, x8 0.176 0.157
1) Posterior probability for equal prior, 2) 0.10 prior probability for model 156.
6. Conclusion
As a result of this study, it can be claimed that Bayesian approaches are affected
by prior changes. Generally, the β and σ2
structures do not have any effect on the
results. However Bayesian approaches define the correct subset most of the time.
In Gibbs sampling, the prior probabilities defined for variable selection are
not effective. However, it can be said that prior information as used in parameter
estimation directly affects the results. It is also observed that the prior information
level is influential in obtaining zero subsets. In Gibbs sampling it is also observed
that few subsets having high posterior probability, are dealt with.
It has been observed that the risk criterion proposed by Yardımcı [17] prefers
smaller dimensional subsets than the other criteria, and that these subsets have
average posterior probabilities. In other words, analyzing and interpreting together
the posterior probabilities of subsets with low risk will help the researcher to
determine the good subsets. In the Zellner and Occam’s window approaches,
subset prior probability is effective on the results. It has also been observed that
Occam’s window prefers larger dimensional subsets than both the Zellner approach
and Gibbs sampling.
The Bayesian approaches are more effective, though not very much so, than
all possible subset selection. They have the advantage that only the evaluation of
subsets with a high posterior probability is needed.

7. Appendix
List of subsets for k = 5
Subset Variables Subset Variables
No No
1 x1 17 x1, x4, x5
2 x1, x2 18 x1, x2, x4, x5
3 x2 19 x2, x4, x5
4 x2, x3 20 x2, x3, x4, x5
5 x1, x2, x3 21 x1, x2, x3, x4, x5
6 x1, x3 22 x1, x3, x4, x5
7 x3 23 x3, x4, x5
8 x3, x4 24 x3, x5
9 x1, x3, x4 25 x1, x3, x5
10 x1, x2, x3, x4 26 x1, x2, x3, x5
11 x2, x3, x4 27 x2, x3, x5
12 x2, x4 28 x2, x5
13 x1, x2, x4 29 x1, x2, x5
14 x1, x4 30 x1, x5
15 x4 31 x5
16 x4, x5
References
[1] Carlin, B.P. and Chib, S. Bayesian model choice via markov chain monte carlo
methods, J. R. Statist. Soc. B 57, (3), 473–484, 1995.
[2] Cavanaugh, J. E. Unifying the derivations for the Akaike and corrected Akaike
information criteria, Statistics and Probability Letters 33, 201–208, 1997.
[3] Candan, M. The Robust Estimation in Linear Regression Analysis, Unpub-
lished MSc. Thesis (in Turkish), Hacettepe University, Ankara, 2000.
[4] Draper, N. and Smith, H. Applied Regression Analysis, John Wiley, 709 p,
1981.
[5] Foster, D. P. and George, E. I. The risk inﬂation criterion for multiple regres-
sion, The Annals of Statistics 22 (4), 1947–1975, 1995.
[6] Gamerman, D. Markov Chain Monte Carlo: Stochastic Simulation for
Bayesian Inference, Chapman and Hall, London, 245p, 1997.
[7] George, E. I. and McCulloch, R. E. Approaches for Bayesian variable selec-
tion, Statistica Sinica 7, 339–373, 1997.

[8] Hoeting, J. A. Accounting for model uncertainty in linear regression, PhD.
Thesis, University of Washington, 1994.
[9] Kass, R. E. and Raftery, A. E. Bayes factors, JASA 90, 773–795, 1995.
[10] Kuo, L. and Mallick, B. Variable selection for regression models, Sankhya B
60, 65–81, 1998.
[11] Lee, P. M. Bayesian Statistics: An Introduction, Oxford, 264p, 1989.
[12] Madigan, D. and Raftery, A. E. Model selection and accounting for model
uncertainty in graphical models using Occam’s window, JASA 89, 1535–1546,
1994.
[13] Moulton, B. I. A Bayesian approach to regression selection and estimation,
with application to price index for radio services, Journal of Econometrics 49,
169–193, 1991.
[14] Raftery, A., Madigan, D. and Hoeting, J. Model selection and accounting
for model uncertainty in linear regression models, University of Washington,
Tech. Report. No. 262., 24p, 1993.
[15] Toutenburg, H. Prior information in linear models, John Wiley Inc., 215p,
1982.
[16] Wasserman, L. Bayesian model selection, Symposium on Methods for Model
Selection, Indiana Uni., Bloomington, 1997.
[17] Yardımcı, A. Comparison of Bayesian Approaches to Variable Selection in
Linear Regression, Unpublished PhD. Thesis (in Turkish), Hacettepe Univer-
sity, Ankara, 2000.
[18] Yardımcı, A. and Erar, A. Bayesian variable selection in linear regression
and BARVAS Software Package (in Turkish), Second Statistical Congress,
Antalya, 386–390, 2001.
[19] Zellner, A. and Siow, A. Posterior odds ratio for selected regression hypothe-
sis, Bayesian Statistics, University Press, Valencia, 585–603, 1980.

Bayesian Variable Selection in Linear Regression and A Comparison

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to Bayesian Variable Selection in Linear Regression and A Comparison

Similar to Bayesian Variable Selection in Linear Regression and A Comparison (20)

More from Atilla YARDIMCI

More from Atilla YARDIMCI (16)

Recently uploaded

Recently uploaded (20)

Bayesian Variable Selection in Linear Regression and A Comparison