PhD_Thesis_slides.pdf

Contributions to Scalable Bayesian Computation
Niloy Biswas
Department of Statistics
Harvard University
PhD Thesis
October 2, 2022
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 1 / 44

References
Niloy Biswas, Pierre Jacob and Paul Vanetti.
Estimating convergence of Markov chains with L-lag couplings.
Advances in Neural Information Processing System (NeurIPS), 2019.
Niloy Biswas, Anirban Bhattacharya, Pierre Jacob and James Johndrow.
Coupling-based convergence assessment of some Gibbs samplers for
high-dimensional Bayesian regression with shrinkage priors.
Journal of the Royal Statistical Society: Series B (JRSSB), 2022.
Niloy Biswas and Lester Mackey.
Bounding Wasserstein distance with couplings.
Revision requested, Journal of the American Statistical Association (JASA), 2022.
Niloy Biswas, Xiao-Li Meng and Lester Mackey.
Scalable Spike-and-Slab.
International Conference on Machine Learning (ICML), 2022.

Motivation
Motivation
Some tales from the crypt in Bayesian computation

Motivation
Motivation
Target distribution, P.

Motivation
Motivation
Monte Carlo methods generate samples from P.

Motivation
Motivation
Example applications:
Calculate integrals:
R
h(x)P(x)dx = EXi ∼P[ 1
N
PN
i=1 h(Xi )]

Motivation
Motivation
R
N
PN
i=1 h(Xi )]
Bayesian inference: sample from posterior

Motivation
Motivation
R
N
PN
i=1 h(Xi )]
Bayesian inference: sample from posterior
Variational autoencoders, generative models, . . .

Motivation
Tales from the crypt1
1
Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.

Motivation
Folklore about Monte Carlo in large scale applications:
1
Scale, 2020.

Motivation
“Markov chain Monte Carlo (MCMC) algorithms have prohibitively high
computational cost per iteration”,
1
Scale, 2020.

Motivation
“MCMC algorithms converge slowly for high-dimensional, multimodal
target distributions”,
1
Scale, 2020.

Motivation
“Bayesian inference is computationally too expensive”.
1
Scale, 2020.

Motivation
“Bayesian inference is computationally too expensive”.
We will revisit some of these tales from the crypt.
1
Scale, 2020.

Motivation
Perfect, Exact and Approximate Sampling

Motivation
Monte Carlo methods generate samples from P. How?

Motivation
1. Perfect sampling
Directly sample from P. Difficult in general.

Motivation
1. Perfect sampling
2. (Asymptotically) Exact sampling
Generate samples from distributions Pt such that Pt
t→∞
⇒ P
E.g. Markov chain Monte Carlo (MCMC)

Motivation
1. Perfect sampling
2. (Asymptotically) Exact sampling
Generate samples from distributions Pt such that Pt
t→∞
⇒ P
E.g. Markov chain Monte Carlo (MCMC)
3. Approximate sampling
Target Q, for some Q ≈ P
E.g. Approximate MCMC, Variational Inference

Motivation
Markov chain Monte Carlo
MCMC:
Initialize X0 ∼ P0. For all t ≥ 1, sample Xt ∼ K(Xt−1, ·).
Kernel K is P-invariant, so that Xt
t→∞
⇒ X∞ ∼ P.
Must stop algorithm at some finite time T.
How close is XT ∼ PT to X∞ ∼ P?

L-Lag couplings of Markov chains
Biswas, Jacob and Vanetti. Estimating convergence of Markov chains with L-lag
couplings. Advances in Neural Information Processing System (NeurIPS), 2019

MCMC: Stylized Example
X0 = 10. Target: P = N(0, 1).
−5
0
5
10
0 50 100 150
t
X

MCMC: Stylized Example
X0 = 10. Target: P = N(0, 1).
−5
0
5
10
0 50 100 150
t
x

Distance between probability distributions
Total variation distance (TV):
dTV(PT , P) =
1
2
sup
h:|h|≤1
E[h(XT ) − h(X∞)]
Captures errors on histograms and credible intervals.
Wasserstein distance: captures errors between moments.

Bounds from L-Lag Couplings
−5
0
5
10
0 50 100 150
t
x
0.0
0.5
1.0
1.5
0 50 100 150
t
d
TV
Lag L=1 L=150 Exact
dTV(Pt, P) =
1
2
sup
h:|h|≤1
E[h(Xt) − h(X∞)]

L-Lag Couplings
A pair of Markov chains (Xt, Yt)t≥0 such that:
Same marginal distributions: Xt ∼ Yt ∼ Pt ∀t ≥ 0 with Pt
t→∞
⇒ P
Xt and Yt−L meet exactly at time τ(L) := inf {t > L : Xt = Yt−L}
−2.5
0.0
2.5
5.0
7.5
10.0
0 50 100 150 200
iteration
x

Back to stylized example
How close is Xt ∼ Pt to X∞ ∼ P = N(0, 1)?
−5
0
5
10
0 50 100 150
t
x
0.0
0.5
1.0
1.5
0 50 100 150
t
d
TV
Lag L=1 L=150 Exact
dTV(Pt, P) =
1
2
sup
h:|h|≤1
E[h(Xt) − h(X∞)] ≤ E
h
max(0,
τ(L) − L − t
L

)
i

Existing work on MCMC convergence
Analytical results:
Rosenthal (1995), Roberts and Rosenthal (2004), Khare and Hobert (2013),
Durmus et al. (2016), Qin and Hobert (2019, 2020), etc.
Bounds often of the form C(P0)f (t). Constant C(P0) unknown.
Empirical techniques:
Cowles and Rosenthal (1998), Johnson (1996).
Popular convergence diagnostics:
Gelman and Brooks (1998), Gelman and Rubin (1992), Vehtari et al.(2021)
Extensions:
• Craiu and Meng (2020), Kelly and Ryder (2021), Ju et al. (2021), Jacob et
al. (2021), Papp and Sherlock (2022), . . .

Existing (and ongoing) work on MCMC convergence
Analytical results:
Rosenthal (1995), Roberts and Rosenthal (2004), Khare and Hobert (2013),
Durmus et al. (2016), Qin and Hobert (2019, 2020), etc.
Bounds often of the form C(P0)f (t). Constant C(P0) unknown.
Empirical techniques:
Cowles and Rosenthal (1998), Johnson (1996).
Popular convergence diagnostics:
Gelman and Brooks (1998), Gelman and Rubin (1992), Vehtari et al.(2021)
Extensions and applications:
• Craiu and Meng (2020), Kelly and Ryder (2021), Ju et al. (2021), Jacob et
al. (2021), Papp and Sherlock (2022), . . .

Back to stylized example
How close is XT ∼ PT to X∞ ∼ P = N(0, 1)?
−5
0
5
10
0 50 100 150
t
x
0.0
0.5
1.0
1.5
0 50 100 150
t
d
TV
Lag L=1 L=150 Exact
Can our diagnostics work in large-scale setting?
In high-dimensions, cannot rely on traceplots.

Assessing convergence of some Gibbs samplers
Can our diagnostics work in large-scale setting?
Biswas, Bhattacharya, Jacob and Johndrow. Coupling-based convergence assessment of some
Gibbs samplers for high-dimensional Bayesian regression with shrinkage priors. Journal of the
Royal Statistical Society: Series B (JRSSB), 2022.

High-dimensional Bayesian regression
Data: y ∈ Rn, X ∈ Rn×p, n ≪ p.
Model:
y = Xβ + σϵ ϵ ∼ N(0, In)
Want to do inference on β ∈ Rp

Model:
y = Xβ + σϵ ϵ ∼ N(0, In)
Half-t Shrinkage Prior:
βj |σ2
, ξ, ηj ∼ N(0,
σ2
ξηj
)
σ−2
∼ Gamma(a0/2, b0/2)
Global ξ−1/2
∼ Cauchy+(0, 1)
Local η
−1/2
j ∼ t+(ν)

Model:
y = Xβ + σϵ ϵ ∼ N(0, In)
Half-t Shrinkage Prior:
βj |σ2
, ξ, ηj ∼ N(0,
σ2
ξηj
)
σ−2
∼ Gamma(a0/2, b0/2)
Global ξ−1/2
∼ Cauchy+(0, 1)
Local η
−1/2
j ∼ t+(ν)
Popular example:
Horseshoe: η
−1/2
j ∼ t+(1) = Cauchy+(0, 1)
Gelman (2006); Carvalho (2009); Polson and Scott (2012).

Half-t priors: statistical estimation benefits
Z
π(βj |ηj , ξ, σ2
)π(ηj )dηj =π(βj |ξ, σ2
) ≍
(
− log |βj | for |βj | → 0
|βj |−(1+ν) for |βj | → +∞
.

Z
) ≍
(
|βj |−(1+ν) for |βj | → +∞
.
0
1
2
3
4
−10 −5 0 5 10
βj
Prior
Density

Z
) ≍
(
|βj |−(1+ν) for |βj | → +∞
.
0
1
2
3
4
−10 −5 0 5 10
βj
Prior
Density
van der Pas et al. (2014, 2017); Ghosh and Chakrabarti (2017).

Half-t priors: computation challenges
Pole at 0:
π(β|σ2
, ξ, y)
∥β∥→0
≍ −
p
Y
j=1
log(|βj |)
Numerical instability for gradient-based samplers.
Bou-Rabee Sanz-Serna (2018); Livingstone et al. (2019a).

Pole at 0:
π(β|σ2
, ξ, y)
∥β∥→0
≍ −
p
Y
j=1
log(|βj |)
Polynomial-tails:
π(λβ⊥
|σ2
, ξ, y)
λ→+∞
≍ λ−p(1+ν)
for any β⊥ ∈ Rp with non-zero entries s.t. Xβ⊥ = 0.
Slow convergence of Gaussian proposal based MH samplers.
Roberts Tweedie (1996); Jarner Tweedie (2003); Livingstone et al. (2019a).

Pole at 0:
π(β|σ2
, ξ, y)
∥β∥→0
≍ −
p
Y
j=1
log(|βj |)
Polynomial-tails:
π(λβ⊥
|σ2
, ξ, y)
λ→+∞
≍ λ−p(1+ν)
for any β⊥ ∈ Rp with non-zero entries s.t. Xβ⊥ = 0.
Slow convergence of Gaussian proposal based MH samplers.
Roberts Tweedie (1996); Jarner Tweedie (2003); Livingstone et al. (2019a).
Trade–off between statistical estimation and “generic” sampling algorithms

A Gibbs sampler for Half-t priors
Markov chain (βt, ηt)t≥0 on Rp × Rp. (ξ, σ2 fixed for simplicity.)

β|η ∼ N(Σ−1
η XT
y, σ2
Σ−1
η ) Ση := XT
X + ξDiag(η)
π(η|β) =
p
Y
j=1
π(ηj |βj )
Polson Scott (2012); Johndrow et al. (2020).

β|η ∼ N(Σ−1
η XT
y, σ2
Σ−1
η ) Ση := XT
X + ξDiag(η)
π(η|β) =
p
Y
j=1
π(ηj |βj )
Chain is geometrically ergodic.
proof not trivial; addresses an open question about such Gibbs samplers with
heavy-tailed priors

β|η ∼ N(Σ−1
η XT
y, σ2
Σ−1
η ) Ση := XT
X + ξDiag(η)
π(η|β) =
p
Y
j=1
π(ηj |βj )
Chain is geometrically ergodic.
proof not trivial; addresses an open question about such Gibbs samplers with
heavy-tailed priors
Can we numerically assess convergence?
Coupling-based assessment: Johnson (1996), Biswas et al. (2019)

A two-scale coupling
Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
η ) Ση := XT X + ξDiag(η)
π(η|β) =
Qp
j=1 π(ηj |βj )

Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
π(η|β) =
Qp
j=1 π(ηj |βj )
Coupled Chain:

Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
π(η|β) =
Qp
j=1 π(ηj |βj )
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN; ‘synchronous’)

Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
π(η|β) =
Qp
j=1 π(ηj |βj )
Coupled Chain:
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2) dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold

Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
π(η|β) =
Qp
j=1 π(ηj |βj )
Coupled Chain:
η1, η2|β1, β2 ∼
(
Why is this a good idea? Which metric d?

Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN)
η1, η2|β1, β2 ∼
(
Why is this a good idea?

Coupled Chain:
η1, η2|β1, β2 ∼
(
Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful.

Coupled Chain:
η1, η2|β1, β2 ∼
(
When far away, use CRN to get close in the future.
Only when close, attempt to meet exactly.

Coupled Chain:
η1, η2|β1, β2 ∼
(
Which metric d?

Coupled Chain:
η1, η2|β1, β2 ∼
(
Which metric d?
Want d to capture probability of exactly meeting.

Coupled Chain:
η1, η2|β1, β2 ∼
(
Which metric d?
d(β1,t, β2,t) := PMaxCouple(η1,t+1 ̸= η2,t+1|β1,t, β2,t)

Coupled Chain:
η1, η2|β1, β2 ∼
(
Which metric d?
d(β1,t, β2,t) := PMaxCouple(η1,t+1 ̸= η2,t+1|β1,t, β2,t)
Roberts Rosenthal (2002)

GWAS dataset example: Half-t(2) prior
n ≈ 2, 266 different maize lines.
p ≈ 98, 385 covariates per maize line, linked to SNPs in the genome.

(βt, ηt)t≥0 Markov chain now in R98,385 × R98,385.
How long does it take to converge?

(βt, ηt)t≥0 Markov chain now in R98,385 × R98,385.
How long does it take to converge? 1, 000 steps.
0.00
0.25
0.50
0.75
1.00
0 250 500 750 1000
iteration
Total
variation
distance

Open questions
Convergence complexity analysis
Interplay between posterior concentration and MCMC convergence
Alternative coupling algorithms

Bounding Wasserstein distance with couplings
What about diagnostics for approximate samplers?
Biswas and Mackey. Bounding Wasserstein distance with couplings. Revision
requested, Journal of the American Statistical Association (JASA), 2022.

Exact and Approximate MCMC
Exact MCMC:
Initialize X0 ∼ P0. Sample Xt
t≥1
∼ K1(Xt−1, ·)
Kernel K1 is P-invariant, so that Xt
t→∞
⇒ P

Exact MCMC:
t≥1
∼ K1(Xt−1, ·)
t→∞
⇒ P
Approximate MCMC:
Initialize Y0 ∼ Q0. Sample Yt
t≥1
∼ K2(Yt−1, ·)
Kernel K2 is Q-invariant, so that Yt
t→∞
⇒ Q
K2 similar to but computationally cheaper than K1

Exact MCMC:
t≥1
∼ K1(Xt−1, ·)
t→∞
⇒ P
Approximate MCMC:
t≥1
∼ K2(Yt−1, ·)
t→∞
⇒ Q
Examples:
Stochastic gradients for tall data
Welling and Teh, 2011; Bardenet et al., 2017; Nemeth and Fearnhead, 2021

Exact MCMC:
t≥1
∼ K1(Xt−1, ·)
t→∞
⇒ P
Approximate MCMC:
t≥1
∼ K2(Yt−1, ·)
t→∞
⇒ Q
Examples:
Matrix approximations for wide data
Johndrow et al. 2020; Narisetty et al. 2019; Atchadé and Wang 2021

Exact MCMC:
t≥1
∼ K1(Xt−1, ·)
t→∞
⇒ P
Approximate MCMC:
t≥1
∼ K2(Yt−1, ·)
t→∞
⇒ Q
Examples:
Matrix approximations for wide data
Johndrow et al. 2020; Narisetty et al. 2019; Atchadé and Wang 2021
How close is P to Q?

Wasserstein distance

For metric space (X, c), p ≥ 1
Wp(P, Q) ≜ inf
X∼P,Y ∼Q
E[c(X, Y )p
]1/p
.

Wp(P, Q) ≜ inf
X∼P,Y ∼Q
E[c(X, Y )p
]1/p
.
A geometrically faithful metric between probability measures
Villani, 2008; Peyré and Cuturi, 2019

Wp(P, Q) ≜ inf
X∼P,Y ∼Q
E[c(X, Y )p
]1/p
.
Can control the absolute difference between moments of P and Q
Gelbrich, 1990; Sriperumbudur et al., 2012; Huggins et al., 2020

Wp(P, Q) ≜ inf
X∼P,Y ∼Q
E[c(X, Y )p
]1/p
.
Can control the absolute difference between moments of P and Q
Gelbrich, 1990; Sriperumbudur et al., 2012; Huggins et al., 2020
We will estimate upper bounds of Wp(P, Q) with couplings.

Couplings
Consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
Pillai and Smith, 2015; Johndrow and Mattingly, 2018; Rudolf and Schweizer, 2018

Couplings
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
We simulate such couplings to empirically assess sample quality
Given K̄, sample (Xt+1, Yt+1)|(Xt, Yt) ∼ K̄ (Xt, Yt), ·


Couplings
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)

For such independent coupled trajectories (X
(i)
t , Y
(i)
t )T
t=1 with
i = 1, . . . , I, define the estimator
CUBp ≜
1
I(T − S)
I
X
i=1
T
X
t=S+1
c(X
(i)
t , Y
(i)
t )p
1/p
.

Couplings
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)

For such independent coupled trajectories (X
(i)
t , Y
(i)
t )T
t=1 with
i = 1, . . . , I, define the estimator
CUBp ≜
1
I(T − S)
I
X
i=1
T
X
t=S+1
c(X
(i)
t , Y
(i)
t )p
1/p
.
Use CUBp to estimate upper bounds of Wp(P, Q)

A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd

A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Apply common random numbers (CRN) coupling of MALA kernels marginally
targeting P and Q.

A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep. Coupling CUB2
Empirical W2 True W2
Figure: Dimension d = 100

A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep: E
X∼P,Y ∼Q,X⊥Y
[∥X − Y ∥2
2]1/2 = (2d)1/2.

A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep: E
X∼P,Y ∼Q,X⊥Y
[∥X − Y ∥2
2]1/2 = (2d)1/2.
CUB2:

1
IT
PI
i=1
PT
t=1 ∥X
(i)
t − Y
(i)
t ∥2
2
1/2
.

A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep: E
X∼P,Y ∼Q,X⊥Y
[∥X − Y ∥2
2]1/2 = (2d)1/2.
CUB2:

1
IT
PI
i=1
PT
t=1 ∥X
(i)
t − Y
(i)
t ∥2
2
1/2
.
Empirical W2: W2(P̂1000I , Q̂1000I ).

A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep: E
X∼P,Y ∼Q,X⊥Y
[∥X − Y ∥2
2]1/2 = (2d)1/2.
CUB2:

1
IT
PI
i=1
PT
t=1 ∥X
(i)
t − Y
(i)
t ∥2
2
1/2
.
Empirical W2: W2(P̂1000I , Q̂1000I ).
True W2: W2(P, Q).

A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds

A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
0
10
20
30
40
50
200 400 600 800 1000
Dimension d
W
2
upper
bounds
Figure: Varying dimension d

Algorithms to sample from the coupled kernel K̄
To simulate the coupled chain (Xt, Yt)t≥0, we consider kernel K̄ on
X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)

X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
To construct and analyze K̄, we make use of:
1 A Markovian coupling Γ1 of kernel K1: for all x, y ∈ X, Γ1(x, y) is a
coupling of the distributions K1(x, ·) and K1(y, ·).
2 A coupling Γ∆ of kernels K1 and K2 from the same point: for all
z ∈ X, Γ∆(z) is a coupling of the distributions K1(z, ·) and K2(z, ·).

X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
To construct and analyze K̄, we make use of:
1 A Markovian coupling Γ1 of kernel K1: for all x, y ∈ X, Γ1(x, y) is a
coupling of the distributions K1(x, ·) and K1(y, ·).
2 A coupling Γ∆ of kernels K1 and K2 from the same point: for all
z ∈ X, Γ∆(z) is a coupling of the distributions K1(z, ·) and K2(z, ·).
Xt Yt
Xt+1 Zt+1 Yt+1
Γ1 Γ∆
Figure: Joint kernel K̄ on X × X

Interpretable upper bounds for CUB
Theorem (CUB upper bound)
Let (Xt, Yt)t≥0 denote a coupled Markov chain generated using joint kernel K̄.
Suppose there exists a constant ρ ∈ (0, 1) such that for all Xt, Yt ∈ X and
(Xt+1, Yt+1)|(Xt, Yt) ∼ Γ1(Xt, Yt),
E[c(Xt+1, Yt+1)p
|Xt, Yt]
1
p ≤ ρc(Xt, Yt).
Then
E[CUBp
p,t]
1
p = E[c(Xt, Yt)p
]
1
p ≤ ρt
E[c(X0, Y0)p
]
1
p +
t
X
i=1
ρt−i
E[∆p(Yi−1)]
1
p
for all t ≥ 0, where ∆p(z) := E[c(X, Y )p
|z] for (X, Y )|z ∼ Γ∆(z).
Generalize existing results on Markov chain perturbation theory W1 to:
(i) Wp for all p ≥ 1, (ii) couplings which may not be Wasserstein-optimal.
Extensions with weaker assumptions in paper.

MCMC for tall data
Bayesian logistic regression on:
Pima Indians (n = 768 observations and d = 8 covariates)
DS1 life sciences dataset (n = 26, 732 and d = 10)

MCMC for tall data
We apply CRN coupling of MALA kernels.

MCMC for tall data
We apply CRN coupling of MALA kernels.
0.0
0.1
0.2
0.3
M
e
a
n
F
i
e
l
d
V
B
L
a
p
l
a
c
e
S
G
L
D
1
0
%
S
G
L
D
5
0
%
U
L
A
Approximate MCMC or variational procedure
W
2
upper
and
lower
bounds
Dataset
DS1
Pima

Approximate MCMC for high-dimensional linear regression
Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088).
y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

, βj |η, ξ, σ2 ind.
∼ N

0,
σ2
ξηj

Polson and Scott, 2020; Biswas et al. 2022

y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

∼ N

0,
σ2
ξηj

Exact MCMC: O(n2
d) cost from XDiag(ηt)−1
XT

y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

∼ N

0,
σ2
ξηj

Exact MCMC: O(n2
XT
Approximate MCMC: XDiag(η̃t)−1
XT
for η̃j,t ≜ ηj,t1{η−1
j,t ϵ} Johndrow et al., 2020

y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

∼ N

0,
σ2
ξηj

Exact MCMC: O(n2
XT
XT
Apply CRN coupling of the exact and the approximate chain.

y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

∼ N

0,
σ2
ξηj

Exact MCMC: O(n2
XT
XT
Apply CRN coupling of the exact and the approximate chain.
0.0
0.2
0.4
0.6
0.8
0 1e−04 0.001 0.01
Approx. MCMC threshold ε
W
2
upper
and
lower
bounds

Open questions
Can we avoid sampling from the exact MCMC chain (Xt)t≥0?

Open questions
Consider constructing a Markov chain (Y ′
t , Yt)t≥0 such that:
1 Y ′
t ∼ Yt for all t ≥ 0, both marginally distributed according to the
approximate MCMC chain.
2 E[c(Xt, Y ′
t )p
] = E[c(Xt, Yt)p
] ≤ E[c(Y ′
t , Yt)p
] for all t ≥ 0.

Open questions
Consider constructing a Markov chain (Y ′
t , Yt)t≥0 such that:
1 Y ′
t ∼ Yt for all t ≥ 0, both marginally distributed according to the
approximate MCMC chain.
2 E[c(Xt, Y ′
t )p
] = E[c(Xt, Yt)p
] ≤ E[c(Y ′
t , Yt)p
] for all t ≥ 0.
Then E[c(Xt, Yt)p]1/p ≤ E[c(Y ′
t , Yt)p]1/p ≤ 2E[c(Xt, Yt)p]1/p.
Upper bounds from (Y ′
t , Yt)t≥0 only loose by a constant factor of 2.

Scalable Spike-and-Slab
Biswas, Mackey and Meng. Scalable Spike-and-Slab.
Revision requested, International Conference on Machine Learning (ICML), 2022.

Variable selection with spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.

Linear regression: y = Xβ + σϵ where ϵ ∼ N(0, In)

Continuous Spike-and-Slab Prior [George and McCulloch, 1993]
σ2
∼ InvGamma
a0
2
,
b0
2

zj
i.i.d.
∼
j=1,...,p
Bernoulli(q),
βj |zj , σ2 ind
∼
j=1,...,p
(1 − zj ) N(0, σ2
τ2
0 )
| {z }
Spike
+zj N(0, σ2
τ2
1 )
| {z }
Slab

σ2
∼ InvGamma
a0
2
,
b0
2

zj
i.i.d.
∼
j=1,...,p
Bernoulli(q),
βj |zj , σ2 ind
∼
j=1,...,p
(1 − zj ) N(0, σ2
τ2
0 )
| {z }
Spike
+zj N(0, σ2
τ2
1 )
| {z }
Slab
Inference using P(zj = 1|y). Guan and Stephens, 2011, Zhou et al., 2013, . . .

σ2
∼ InvGamma
a0
2
,
b0
2

zj
i.i.d.
∼
j=1,...,p
Bernoulli(q),
βj |zj , σ2 ind
∼
j=1,...,p
(1 − zj ) N(0, σ2
τ2
0 )
| {z }
Spike
+zj N(0, σ2
τ2
1 )
| {z }
Slab
Inference using P(zj = 1|y). Guan and Stephens, 2011, Zhou et al., 2013, . . .
How to sample from the posterior?

Bayesian computation for spike-and-slab priors

Markov chain Monte Carlo methods:
Naı̈ve MCMC: O(p3
) cost per iteration
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
For large datasets, O(n2
p) cost become prohibitive.
e.g. GWAS with n ≈ 103
, p ≈ 105
, SOTA takes 1 minute per iteration

Naı̈ve MCMC: O(p3
, p ≈ 105
Approximate inference methods:
Approx. MCMC: O(max{n∥zt∥2
1, np}) cost at iteration t
Narisetty et al., 2019
Variational inference Ray et al., 2020, Ray and Szabó, 2021
Does not converge to the spike-and-slab posterior

Naı̈ve MCMC: O(p3
, p ≈ 105
Approximate inference methods:
Approx. MCMC: O(max{n∥zt∥2
1, np}) cost at iteration t
Narisetty et al., 2019
Variational inference Ray et al., 2020, Ray and Szabó, 2021
Does not converge to the spike-and-slab posterior
Speed up Bayesian computation without compromising on sample quality?

Scalable Spike-and-Slab (S3):
Same MCMC kernel as SOTA.

O(max{n2
pt, np}) cost at iteration t for linear and probit, where
(O(max{n2
pt, n3
, np}) for logistic regression)

O(max{n2
(O(max{n2
pt, n3
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.

O(max{n2
(O(max{n2
pt, n3
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.
(i) sparsity,

O(max{n2
(O(max{n2
pt, n3
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.
(i) sparsity, (ii) posterior concentration,

O(max{n2
(O(max{n2
pt, n3
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.
(i) sparsity, (ii) posterior concentration, and (iii) positive auto-correlation

O(max{n2
(O(max{n2
pt, n3
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.
(i) sparsity, (ii) posterior concentration, and (iii) positive auto-correlation
all give smaller pt and lower computational cost.

S3
: computational cost
Binary response data: y ∈ {0, 1}n, design matrix X ∈ Rn×p, n ≪ p.
0e+00
1e+05
2e+05
3e+05
10000 20000 30000 40000
Dimension p
Time
per
iteration
(ms)
Sampler
S3
Logistic SOTA Logistic Skinny Gibbs Logistic
S3
Probit SOTA Probit
E.g. for n ≈ 4000, p ≈ 40000, E[pt] ≈ 10. S3
is 50× faster than SOTA.

S3
: sample quality
Binary response data: y ∈ {0, 1}n, design matrix X ∈ Rn×p, n ≪ p.
0%
25%
50%
75%
100%
200 400 600 800 1000
Dimension p
TPR
0%
1%
2%
3%
4%
200 400 600 800 1000
Dimension p
FDR
Sampler S3
Logistic S3
Probit Skinny Gibbs Logistic

Revisiting Tales from the crypt 2
2
Chopin Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.

(Folklore about) Monte Carlo in large scale applications:
2
Scale, 2020.

“Markov chain Monte Carlo (MCMC) algorithms (have prohibitively
high) can have lower computational cost per iteration”,
2
Scale, 2020.

“MCMC algorithms (converge slowly) can converge quickly for high-
dimensional, multimodal target distributions”,
2
Scale, 2020.

“Bayesian inference is computationally too expensive?”
2
Scale, 2020.

“Bayesian inference is computationally too expensive?”
This thesis participates in a wider quest to scale Bayesian inference.
Welling and Teh (2011); Gorham (2015); Bardenet et al. (2017); Blei et al. (2017); Narisetty et
al. (2019), Bierkens et al. (2019), Papaspiliopoulos et al. (2019), Pollock et al. (2020), Jacob
et al. (2020), Johndrow et al. (2020), Nemeth and Fearnhead (2021), Ray and Szabó (2021) . . .
2
Scale, 2020.

PhD_Thesis_slides.pdf

Recommended

Recommended

More Related Content

Similar to PhD_Thesis_slides.pdf

Similar to PhD_Thesis_slides.pdf (20)

Recently uploaded

Recently uploaded (20)

PhD_Thesis_slides.pdf