This document discusses posterior consistency in Bayesian variable selection using the Bayesian lasso algorithm. It begins by defining posterior consistency and discussing its importance in Bayesian analysis. It then introduces the Bayesian lasso model, which places a double exponential prior on the regression coefficients, resulting in a Bayesian version of the lasso estimator. The document derives sufficient conditions for the Bayesian lasso to achieve posterior consistency in a high-dimensional setting where the number of parameters grows with sample size. Specifically, it shows the Bayesian lasso is consistent when the design matrix is orthogonal and the variance parameter is fixed.
2. COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 6701
In Bayesian analysis, one starts with some prior knowledge (sometimes imprecise)
expressed as a distribution on the parameter space and updates the knowledge according
to the posterior distribution given the data. It is therefore of utmost importance to know
whether the updated knowledge becomes more and more accurate and precise as data are
collected indefinitely. This requirement is called the consistency of the posterior distribution.
Although it is an asymptotic property, consistency is one of the benchmarks since the viola-
tion of consistency is clearly undesirable and one may have serious doubts against inferences
based on an inconsistent posterior distribution.
1.2. Formal definition and choice of vector norm
As was previously stated, the notion of posterior consistency considered herein is the con-
vergence of the posterior distribution of β to degeneracy at β0 with P0-probability 1. We now
state a formal definition of posterior consistency.
Definition. Let β0n ∈ Rpn for each n ≥ 1 and σ2
0 > 0. Now let P0 denote the distribution of
{ˆβn, n ≥ 1} under the model yn = Xnβ0n + en for each n ≥ 1, where en ∼ Nn(0n, σ2
0 In) for
each n ≥ 1. The sequence of posterior distributions PM(βn | ˆβn) is said to be consistent at
{(β0n, σ2
0 ), n ≥ 1} if PM(||βn − β0n||∞ > | ˆβn) → 0 a.s.(P0) for every > 0.
The choice of the ∞ norm in our definition of posterior consistency warrants some dis-
cussion. In the case where the number of covariates p is fixed, it is clear that the particular
choice of vector norm is irrelevant, since the ∞ norm could be replaced by any other r norm,
1 ≤ r < ∞, and the definition would still be equivalent. However, the distinction becomes
relevant when p tends to infinity at some rate along with the sample size, in which case p, β,
and β0 become pn, βn, and β0n. If we wish to allow pn to grow in proportion to n, then the
conventional 2 norm, defined as ||x||2 = (
pn
i=1 x2
i )1/2
, makes posterior consistency unrea-
sonably difficult to achieve. As justification, note that under the 2 norm, even the MLE itself
fails to achieve classical frequentist consistency. Thus, we instead consider posterior consis-
tency under the ∞ norm ||x||∞ = max1≤i≤pn |xi|.
The following lemma and corollary illustrate why the 2 norm is not sufficiently flexible
for our purposes.
Lemma 1. Let Zn ∼ Npn (0pn , n−1
Vn), where pn < n, and where the eigenvalues
ωn,1, . . . , ωn,pn of Vn satisfy 0 < ωmin ≤ infn,i ωn,i ≤ supn,i ωn,i ≤ ωmax < ∞ for some ωmin
and ωmax. Then ||Zn||2 → 0 almost surely if and only if pn/n → 0.
Proof. Let > 0. Note that Var(Zn,i) = n−1
Vn,ii ≤ n−1
ωmax, and n1/2
V−1/2
n,ii Zn,i ∼ N(0, 1).
Now let Un = n
pn
i=1 V−1
n,ii Z2
n,i, so that Un ∼ χ2
pn
. By the properties of the chi-squared distri-
bution, Un/n → 0 almost surely if and only if pn/n → 0. Then since
ωminUn
n
≤ ||Zn||2 ≤
ωmaxUn
n
,
it follows that ||Zn||2 → 0 almost surely if and only if pn/n → 0.
Corollary 1. ||ˆβn − β0n||2 → 0 a.s.(P0) if and only if pn/n → 0.
Proof. Apply Lemma 1 under P0 with Zn = ˆβn − β0n and Vn = σ2
0 (1
n
XT
n Xn)−1
.
3. 6702 S. DASGUPTA
As is clear from Corollary 1, not even the MLE ˆβn achieves almost sure consistency under
the 2 norm when pn grows at the same rate as n. Thus, any attempt to establish posterior
consistency under the 2 norm of a Bayesian regression model under the same circumstances
would be futile. However, the following lemma and corollary motivate the choice of the ∞
norm instead.
Lemma 2. Let Zn ∼ Npn (0pn , n−1
Vn), where pn < n, and where the eigenvalues
ωn,1, . . . , ωn,pn of Vn satisfy supn,i ωn,i ≤ ωmax < ∞ for some ωmax. Then ||Zn||∞ → 0
almost surely.
Proof. Let > 0. Note that Var(Zn,i) = n−1
Vn,ii ≤ n−1
ωmax, and n1/2
V−1/2
n,ii Zn,i ∼ N(0, 1).
Then
∞
n=1
P (||Zn||∞ > ) =
∞
n=1
P max
1≤i≤pn
Zn,i >
≤
∞
n=1
pn
i=1
P n1/2
V−1/2
n,ii Zn,i > n−1
Vn,ii
−1/2
≤
∞
n=1
pn
i=1
P n1/2
V−1/2
n,ii Zn,i > ω−1/2
max n1/2
≤
∞
n=1
pn
i=1
15ω3
max
6n3
< ∞,
by applying Markov’s inequality to n3
V−3
n,ii Z6
n,i, and the result follows from the Borel-Cantelli
lemma, noting that pn < n.
Corollary 2. ||ˆβn − β0n||∞ → 0 a.s.(P0).
Proof. Apply Lemma 2 under P0 with Zn = ˆβn − β0n and Vn = σ2
0 (1
n
XT
n Xn)−1
.
Although Corollaries 1 and 2 are not posterior consistency results per se, they nonetheless
demonstrate the added flexibility that can arise from the use of the ∞ norm instead of the 2
norm when proving consistency results. For this reason, we choose to work with the ∞ norm
throughout our work.
1.3. Conditional independence prior and the Bayesian lasso
The most common approach to prior specification in Bayesian regression models is to first
place a prior on σ2
, and to then place a prior on β | σ2
such that the prior variance of β | σ2
is proportional to σ2
. The conjugate choice for the prior on σ2
is the inverse gamma with shape
parameter a/2 and scale parameter b/2 (the factor of 1/2 is included for later convenience),
where a, b > 0. One may also wish to use an improper prior proportional to 1/σ2
, 1/σ, or 1,
but these improper priors can be seen to have the same basic form as inverse gamma densities
if the parameter restrictions are relaxed to a ≥ −2 and b ≥ 0.
Although there exist various prior structures for the coefficient vector with interest-
ing properties and applications, perhaps the most obvious alternative is to simply replace
XT
n Xn
−1
in the prior variance of β | σ2
with some diagonal matrix Dτ = Diag(τ2
1 , . . . , τ2
p ),
yielding
β | σ2
∼ Np(γ, σ2
Dτ ).
4. COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 6703
Thus, the components of β are independent a priori when conditioned on σ2
. The values of
τ2
1 , . . . , τ2
p can be taken as fixed, or they can be set equal to a common value τ2
which is then
estimated through an empirical Bayesian approach. However, the most important applica-
tion of this model is the extension to a hierarchical model in which τ2
1 , . . . , τ2
p are assigned
independent exponential priors with common rate parameter λ2
/2. As noted by Park and
Casella (2008), this formulation leads to a Bayesian version of the lasso of Tibshirani (1996)
if the point estimate of the coefficient vector β is taken to be its posterior mode. Park and
Casella observe that the resulting Bayesian lasso typically yields results quite similar to those
of the ordinary lasso, but with the advantage of automatic interval estimates for all parameters
via any of the usual constructions of Bayesian credible intervals. Of course, this still leaves the
question of how to specify the parameter λ. Casella (2001) examines the replacement of λ
with an empirical Bayesian estimate ˆλEB
derived by maximizing the marginal likelihood of λ.
Alternatively, the hierarchical structure can be extended further by specifying a prior on λ,
though Park and Casella advise caution here, as seemingly innocuous improper priors such
as 1/λ2
can lead to impropriety of the posterior. Further discussion of Bayesian lasso methods
can be found in Kyung et al. (2010).
A slight but significant modification of the above structure is to take β and σ2
to be a priori
independent, removing the dependence on σ2
from the prior given above for the coefficient
vector β. However, Park and Casella (2008) show that this unconditional prior can easily lead
to a bimodal posterior on β, σ2
. In contrast, they show that the conditional prior always leads
to a unimodal posterior as long as σ2
∼ Inverse-Gamma(a/2, b/2), where we permit a ≥ −2
and b ≥ 0, as before.
Moreover, Kyung et al. (2010) illustrate other lasso-type penalized regression schemes that
can be represented through hierarchical extensions of the conditional independence prior. In
addition to Tibshirani’s original lasso, both the group lasso of Yuan and Lin (2006) and the
elastic net of Zou and Hastie (2005) can be represented in this fashion. A general examination
of posterior consistency under hierarchical extensions of the conditional independence prior
could provide conditions under which these lasso-type regression techniques are consistent
in the frequentist sense.
1.4. Shrinkage priors
Shrinkage estimation through continuous priors (Griffin and Brown, 2007; Park and Casella,
2008; Hans, 2009; Carvalho et al., 2010; Griffin and Brown, 2010) has found much atten-
tion in recent years along with their frequentist analogues (Knight and Fu, 2000; Fan and Li,
2001; Yuan and Lin, 2005; Zhao and Yu, 2006; Zou, 2006; Zou and Li, 2008) in the regulariza-
tion framework. The Lasso of Tibshirani (1996) and its Bayesian analogues relying on double
exponential priors (Park and Casella, 2008; Hans, 2009) have drawn particular attention, with
many variations being proposed. These priors yield undeniable computational advantages in
regression models over Bayesian variable selection approaches that require a search over a
huge discrete model space (George and McCulloch, 1993; Raftery et al., 1997; Chipman et al.,
2001; Liang et al., 2008; Clyde et al., 2010).
Consider the linear model yn = Xnβ0n + n, where yn is an n-dimensional vector of
responses, Xn is the n × pn design matrix, n ∼ Nn(0, σ2
0 In) with fixed σ2
0 , and some of the
components of β0n are zero.
In the Bayesian framework, to justify use in high-dimensional settings, it is important
to establish posterior consistency in cases in which the number of parameters p increases
5. 6704 S. DASGUPTA
with sample size n. Armagan et al. (2013) investigated the asymptotic behavior of posterior
distributions of regression coefficients in high-dimensional linear models as the number of
parameters grows with the number of observations. Their main contribution is providing
a simple sufficient condition on the prior concentration for strong posterior consistency (in
2 norm) when pn = o(n). Their particular focus is on shrinkage priors, including the Laplace,
Student t, generalized double Pareto, and horseshoe-type priors (Johnstone and Silverman,
2004; Griffin and Brown, 2007; Carvalho et al., 2010; Armagan et al., 2011a).
In this paper, we focus Bayesian lasso model with the orthogonal design and fixed variance
parameter, where the number of parameters grows with the sample size. The main objective
of this paper is to derive sufficient conditions for posterior consistency in the Bayesian lasso
model. In Section 2, we introduce the model and provide the main result.
2. Main result
Consider the following Bayesian Lasso (Park and Casella, 2008) model where we treat the
variance parameter σ2
as a non random quantity:
yn | Xn, βn, σ2
∼ Nn(Xnβn, σ2
In)
βn | σ2
, τ2
1 , . . . , τ2
pn
∼ Npn (0pn , σ2
Dτ ); where Dτ =diag(τ2
1 , . . . , τ2
pn
)
i.e., βj | σ2
, τ2
j
ind
∼ N(0, σ2
τ2
j ); j = 1 . . . pn.
τ2
j
iid
∼ exp(λ2
/2); j = 1 . . . pn.
τ2
1 , . . . , τ2
pn
> 0.
Now, suppose the true model is: yn = Xnβ0n + n; where n ∼ Nn(0, σ2
0 In), then we need
to find the conditions on {Xn}n≥1 , {β0n}n≥1 and σ2
0 such that Pn ||βn − β0n||∞ > | yn → 0
a.s as n → ∞, for every > 0.
We investigate the posterior consistency of the above Bayesian Lasso model for a much
more relaxed growth restriction on the dimension, i.e., when pn = O(n). As discussed above,
posterior consistency in 2 norm seems unrealistic to achieve under this growth condi-
tion. Hence, we consider posterior consistency in the weaker ∞ norm. We have proved the
following theorem:
Theorem 1. Let sn be the number of true nonzero regression coefficients such that sn = O(nδ/4
).
Then for an orthogonal design, i.e; XT
n Xn = nIn and a condition on the true regression coeffi-
cients: ||β0n||2
2 = O(n2−δ
); δ > 0, the posterior consistency of the regression coefficients can be
achieved, i.e., Pn ||βn − β0n||∞ > | yn → 0 a.s as n → ∞, for every > 0.
Remark: The assumption on sn and β0n are satisfied, for example, when sn = O(n2/5
) and entries
of β0n are uniformly bounded in n.
Proof.
Pn ||βn − β0n||∞ > | yn
= Pn βn − E βn|σ2
, τ2
, yn + E βn|σ2
, τ2
yn − β0n ∞
> | yn
≤ Pn βn − E βn|σ2
, τ2
, yn ∞
> /2 | yn + Pn E βn|σ2
, τ2
yn − β0n ∞
> /2 | yn
= (I) + (II), say. (2.1)
Let ˜β σ2
, τ2
, yn = E βn|σ2
, τ2
, yn) = (nIn + D−1
τ )−1
XT
n yn, since ˆβn = XT
n yn. Notice
that βn − ˜β σ2
, τ2
, yn | σ2
, τ2
, yn ∼ Npn 0, σ2
(nIn + D−1
τ )−1
. Also, let vii is the ith diag-
onal element of (nIn + D−1
τ )−1
. Then, by applying the Tower property and the Bonferroni
bound consecutively, we obtain
6. COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 6705
(I) = Pn βn − ˜β σ2
, τ2
, yn
∞
> | yn
= E Pn βn − ˜β σ2
, τ2
, yn
∞
> | σ2
, τ2
yn
≤
pn
i=1
E P
|βi − ˜βi(σ2
, τ2
, yn)|
σ
√
vii
>
σ
√
vii
| σ2
, τ2
yn
=
pn
i=1
E P |Zi| >
σ
√
vii
| yn , where Zi ∼ N(0, 1)
≤
pn
i=1
P |Zi| >
√
n
σ
= pnP |Z| >
√
n
σ
→ 0, as n → ∞. (2.2)
Next observe that by the triangle inequality,
(II) = Pn (nIn + D−1
τ )−1
XT
n yn − β0n ∞
> | yn
≤ Pn max
1≤i≤pn
| ˆβn,i − β0,i| > /2 | yn + Pn max
1≤i≤pn
τ−2
i
n + τ−2
i
| ˆβn,i |> /2 | yn
= (III) + (IV ), say. (2.3)
By Corollary 2, it is easy to see that (III) → 0 a.s. Also,
(IV ) = Pn max
1≤i≤pn
τ−2
i
n + τ−2
i
| ˆβn,i |> /2 | yn I ˆβn − β0n
∞
≤ /2
+ Pn max
1≤i≤pn
τ−2
i
n + τ−2
i
| ˆβn,i |> /2 | yn I ˆβn − β0n
∞
> /2
= (V ) + (VI), say. (2.4)
Clearly, (VI) → 0 a.s. Now
(V ) = Pn max
1≤i≤pn
τ−2
i
n + τ−2
i
| ˆβn,i |> /2 | yn I ˆβn − β0n
∞
≤ /2
≤
i:| ˆβn,i|> /2
Pn
τ−2
i
n + τ−2
i
| ˆβn,i |> /2 | yn I ˆβn − β0n
∞
≤ /2
≤
i:| ˆβn,i|> /2
Pn τ−2
i >
n2
| ˆβn,i |
| yn I ˆβn − β0n
∞
≤ /2 , (2.5)
since ˆβn − β0n
∞
≤ /2 and ||β0n||2
2 = O(n2−δ
); δ > 0, | ˆβn,i| = Op(n1−δ/2
). Thus, for large
n, using K(> 0) as a generic constant,
7. 6706 S. DASGUPTA
i:| ˆβn,i|> /2
Pn τ−2
i >
n /2
| ˆβn,i |
| yn I ˆβn − β0n
∞
≤ /2
≤
i:| ˆβn,i|> /2
Pn τ−2
i > Knδ/2
| yn I ˆβn − β0n
∞
≤ /2
≤
i:| ˆβn,i|> /2
E Pn τ−2
i > Knδ/2
| βn, yn | yn
≤ sn E Pn(τ−2
i > Knδ/2
βn, yn) yn (2.6)
Next we observe that τ−2
i | βn, yn ∼ Inverse − Gaussian λσ
|βi|
, λ2
. In order to find an
upper bound for the inner probability in (2.6), we need the following lemma.
Lemma 3. Suppose, X ∼ Inverse − Gaussian(μ, λ). Then
P(X > M) ≤
λ
8πM
exp
λ
μ
exp −
λM
2μ2
.
Proof.
P(X > M) =
∞
M
λ
2πx3
exp −
λ(x − μ)2
2μ2x
dx
=
λ
2π
exp
λ
μ
∞
M
1
x3/2
exp −
λx
2μ2
exp −
λ
2x
dx
≤
λ
2π
exp
λ
μ
∞
M
1
x3/2
exp −
λx
2μ2
dx
≤
λ
2π
exp
λ
μ
exp −
λM
2μ2
∞
M
1
x3/2
dx
=
λ
2π
1
2
√
M
exp
λ
μ
exp −
λM
2μ2
=
λ
8πM
exp
λ
μ
exp −
λM
2μ2
.
By the above lemma an upper bound for (2.6) is given by
sn
K
nδ/4
E exp
λ|βi|
σ
−
Knδ/2
β2
i
2λσ2
| yn . (2.7)
Next observe that the βi have iid priors with common pdf f (β | λ, σ ) =
λ
2σ
exp [−(λ/σ )|β|]. Also, since XT
n Xn = nIn, writing ||yn − Xnβn||2
= ||yn − Xn
ˆβn||2
+
||Xn(ˆβn − βn)||2
and further ||Xn(ˆβn − βn)||2
= ||ˆβn − βn||2
= n n
i=1(βi − ˆβn,i)2
.
Hence, the posterior of βi | yn is
π(βi | yn) ∝ exp −
n
2σ2
(βi − ˆβn,i)2
−
λ|βi|
σ
. (2.8)
8. COMMUNICATIONS IN STATISTICS—THEORY AND METHODS 6707
In view of (2.7),
E exp
λ|βi|
σ
−
Knδ/2
β2
i
2λσ2
| yn
=
∞
−∞
exp − n
2σ2 (βi − ˆβn,i)2
−
Knδ/2β2
i
2λσ2 dβi
∞
−∞
exp − n
2σ2 (βi − ˆβn,i)2 − λ|βi|
σ
dβi
= N/D (say). (2.9)
Now,
N ≤
∞
−∞
exp −
n
2σ2
(βi − ˆβn,i)2
dβi = (2πσ2
n)1/2
. (2.10)
Also,
D =
∞
0
exp −
nβ2
i
2σ2
+
nβi
ˆβn,i
σ2
−
n ˆβ2
n,i
2σ2
−
λβi
σ
dβi
+
0
−∞
exp −
nβ2
i
2σ2
+
nβi
ˆβn,i
σ2
−
n ˆβ2
n,i
2σ2
−
λβi
σ
dβi
=
∞
0
exp −
n
2σ2
βi − ˆβn,i −
λβiσ
n
2
exp
λ2
β2
i
2n
−
βi
ˆβn,i
σ2
dβi
+
0
−∞
exp −
n
2σ2
βi − ˆβn,i +
λβiσ
n
2
exp
λ2
β2
i
2n
+
βi
ˆβn,i
σ2
dβi
= ˆβn,i −
λβiσ
n
exp
λ2
β2
i
2n
−
βi
ˆβn,i
σ2
+ ˆβn,i +
λβiσ
n
exp
λ2
β2
i
2n
+
βi
ˆβn,i
σ2
(2πσ2
/n)1/2
. (2.11)
Since ˆβn,i → β0,i a.s., from (2.10) and (2.11) it follows that N/D = O(1) a.s. as n → ∞.
Hence, from (2.7) to (2.11), it follows that (V ) → 0 a.s. as n → ∞.
3. Discussion
The Bayesian lasso is a popular and widely used algorithm for sparse Bayesian estimation
in linear regression. In this paper, we have established posterior consistency of the Bayesian
lasso algorithm under the orthogonal design and fixed variance parameter where the number
of parameters grows with the sample size. Using the insights obtained from the analysis in
this paper, we are currently investigating the high-dimensional posterior consistency of the
Bayesian lasso for an arbitrary design matrix and a stochastic variance parameter. In future
work, we would also like to consider a careful analysis of the convergence rates.
9. 6708 S. DASGUPTA
Acknowledgment
The author would like to thank Prof. Malay Ghosh and Prof. Kshitij Khare for their help with
the paper.
References
Armagan, A., Dunson, D.B., Lee, J., Bajwa, W.U., Stawn, N. (2013). Posterior consistency in linear mod-
els under shrinkage priors. Biometrika 100:1011–1018.
Armagan, A., Dunson, D.B., Clyde, M. (2011a). Generalized beta mixtures of gaussians. Adv. Neural
Info. Proces. Syst. (NIPS).
Carvalho, C.M., Polson, N.G., Scott, J.G. (2010). The horseshoe estimator for sparse signals. Biometrika
97:465–480.
Chipman, H., George, E.I., Mcculloch, R.E. (2001). The practical implementation of Bayesian model
selection. IMS Lect. Notes - Monograph Ser. 38.
Clyde, M., Ghosh, J., Littman, M.L. (2010). Bayesian adaptive sampling for variable selection and model
averaging. J. Comput. Graph. Stat. 20(1):80–101.
Fan, J., Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties.
J. Am. Stat. Assoc. 96:1348–1360.
Griffin, J.E., Brown, P.J. (2007). Bayesian adaptive lassos with non-convex penalization. Technical Report.
Griffin, J.E., Brown, P.J. (2010). Inference with normal-gamma prior distributions in regression prob-
lems. Bayesian Anal. 5:171–188.
George, E.I., Mcculloch, R.E. (1993). Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 88:881–
889.
Hans, C. (2009). Bayesian lasso regression. Biometrika 96:835–845.
Johnstone, I.M., Silverman, B.W. (2004). Needles and straw in haystacks: Empirical Bayes estimates of
possibly sparse sequences. Ann. Stat. 32:1594–1649.
Knight, K., Fu, W. (2000). Asymptotics for lasso-type estimators. Ann. Stat. 28:1356–1378.
Kyung, M., Gill, J., Ghosh, M., Casella, G. (2010). Penalized regression, standard errors, and Bayesian
lassos. Bayesian Anal. 5:369–412.
Liang, F., Paulo, R., Molina, G., Clyde, M., Berger, J. (2008). Mixtures of g priors for Bayesian variable
selection. J. Am. Stat. Assoc. 103:410–423.
Park, T., Casella, G. (2008). The Bayesian lasso. J. Am. Stat. Assoc. 103(6):681–686.
Raftery, A.E., Madigan, D., Hoeting, J.A., (1997). Bayesian model averaging for linear regression models.
J. Am. Stat. Assoc. 92:179–191.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B 58:267–
288.
Yuan, M., Lin, Y. (2005). Efficient empirical Bayes variable selection and estimation in linear models. J.
Am. Stat. Assoc. 100:1215–1225.
Zhao, P., Yu, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res. 7:2541–2563.
Zou, H. (2006). The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476):1418–1429.
Zou, H., Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat.
36:1509–1533.