SlideShare a Scribd company logo
Contributions to Scalable Bayesian Computation
Niloy Biswas
Department of Statistics
Harvard University
PhD Thesis
October 2, 2022
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 1 / 44
References
Niloy Biswas, Pierre Jacob and Paul Vanetti.
Estimating convergence of Markov chains with L-lag couplings.
Advances in Neural Information Processing System (NeurIPS), 2019.
Niloy Biswas, Anirban Bhattacharya, Pierre Jacob and James Johndrow.
Coupling-based convergence assessment of some Gibbs samplers for
high-dimensional Bayesian regression with shrinkage priors.
Journal of the Royal Statistical Society: Series B (JRSSB), 2022.
Niloy Biswas and Lester Mackey.
Bounding Wasserstein distance with couplings.
Revision requested, Journal of the American Statistical Association (JASA), 2022.
Niloy Biswas, Xiao-Li Meng and Lester Mackey.
Scalable Spike-and-Slab.
International Conference on Machine Learning (ICML), 2022.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 2 / 44
Motivation
Motivation
Some tales from the crypt in Bayesian computation
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 3 / 44
Motivation
Motivation
Target distribution, P.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
Motivation
Motivation
Target distribution, P.
Monte Carlo methods generate samples from P.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
Motivation
Motivation
Target distribution, P.
Monte Carlo methods generate samples from P.
Example applications:
Calculate integrals:
R
h(x)P(x)dx = EXi ∼P[ 1
N
PN
i=1 h(Xi )]
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
Motivation
Motivation
Target distribution, P.
Monte Carlo methods generate samples from P.
Example applications:
Calculate integrals:
R
h(x)P(x)dx = EXi ∼P[ 1
N
PN
i=1 h(Xi )]
Bayesian inference: sample from posterior
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
Motivation
Motivation
Target distribution, P.
Monte Carlo methods generate samples from P.
Example applications:
Calculate integrals:
R
h(x)P(x)dx = EXi ∼P[ 1
N
PN
i=1 h(Xi )]
Bayesian inference: sample from posterior
Variational autoencoders, generative models, . . .
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
Motivation
Tales from the crypt1
1
Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
Motivation
Tales from the crypt1
Folklore about Monte Carlo in large scale applications:
1
Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
Motivation
Tales from the crypt1
Folklore about Monte Carlo in large scale applications:
“Markov chain Monte Carlo (MCMC) algorithms have prohibitively high
computational cost per iteration”,
1
Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
Motivation
Tales from the crypt1
Folklore about Monte Carlo in large scale applications:
“Markov chain Monte Carlo (MCMC) algorithms have prohibitively high
computational cost per iteration”,
“MCMC algorithms converge slowly for high-dimensional, multimodal
target distributions”,
1
Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
Motivation
Tales from the crypt1
Folklore about Monte Carlo in large scale applications:
“Markov chain Monte Carlo (MCMC) algorithms have prohibitively high
computational cost per iteration”,
“MCMC algorithms converge slowly for high-dimensional, multimodal
target distributions”,
“Bayesian inference is computationally too expensive”.
1
Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
Motivation
Tales from the crypt1
Folklore about Monte Carlo in large scale applications:
“Markov chain Monte Carlo (MCMC) algorithms have prohibitively high
computational cost per iteration”,
“MCMC algorithms converge slowly for high-dimensional, multimodal
target distributions”,
“Bayesian inference is computationally too expensive”.
We will revisit some of these tales from the crypt.
1
Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
Motivation
Perfect, Exact and Approximate Sampling
Target distribution, P.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
Motivation
Perfect, Exact and Approximate Sampling
Target distribution, P.
Monte Carlo methods generate samples from P. How?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
Motivation
Perfect, Exact and Approximate Sampling
Target distribution, P.
Monte Carlo methods generate samples from P. How?
1. Perfect sampling
Directly sample from P. Difficult in general.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
Motivation
Perfect, Exact and Approximate Sampling
Target distribution, P.
Monte Carlo methods generate samples from P. How?
1. Perfect sampling
Directly sample from P. Difficult in general.
2. (Asymptotically) Exact sampling
Generate samples from distributions Pt such that Pt
t→∞
⇒ P
E.g. Markov chain Monte Carlo (MCMC)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
Motivation
Perfect, Exact and Approximate Sampling
Target distribution, P.
Monte Carlo methods generate samples from P. How?
1. Perfect sampling
Directly sample from P. Difficult in general.
2. (Asymptotically) Exact sampling
Generate samples from distributions Pt such that Pt
t→∞
⇒ P
E.g. Markov chain Monte Carlo (MCMC)
3. Approximate sampling
Target Q, for some Q ≈ P
E.g. Approximate MCMC, Variational Inference
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
Motivation
Markov chain Monte Carlo
MCMC:
Initialize X0 ∼ P0. For all t ≥ 1, sample Xt ∼ K(Xt−1, ·).
Kernel K is P-invariant, so that Xt
t→∞
⇒ X∞ ∼ P.
Must stop algorithm at some finite time T.
How close is XT ∼ PT to X∞ ∼ P?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 7 / 44
L-Lag couplings of Markov chains
How close is XT ∼ PT to X∞ ∼ P?
L-Lag couplings of Markov chains
Biswas, Jacob and Vanetti. Estimating convergence of Markov chains with L-lag
couplings. Advances in Neural Information Processing System (NeurIPS), 2019
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 8 / 44
L-Lag couplings of Markov chains
MCMC: Stylized Example
X0 = 10. Target: P = N(0, 1).
−5
0
5
10
0 50 100 150
t
X
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 9 / 44
L-Lag couplings of Markov chains
MCMC: Stylized Example
X0 = 10. Target: P = N(0, 1).
−5
0
5
10
0 50 100 150
t
x
How close is XT ∼ PT to X∞ ∼ P?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 10 / 44
L-Lag couplings of Markov chains
Distance between probability distributions
How close is XT ∼ PT to X∞ ∼ P?
Total variation distance (TV):
dTV(PT , P) =
1
2
sup
h:|h|≤1
E[h(XT ) − h(X∞)]
Captures errors on histograms and credible intervals.
Wasserstein distance: captures errors between moments.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 11 / 44
L-Lag couplings of Markov chains
Bounds from L-Lag Couplings
−5
0
5
10
0 50 100 150
t
x
0.0
0.5
1.0
1.5
0 50 100 150
t
d
TV
Lag L=1 L=150 Exact
dTV(Pt, P) =
1
2
sup
h:|h|≤1
E[h(Xt) − h(X∞)]
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 12 / 44
L-Lag couplings of Markov chains
L-Lag Couplings
A pair of Markov chains (Xt, Yt)t≥0 such that:
Same marginal distributions: Xt ∼ Yt ∼ Pt ∀t ≥ 0 with Pt
t→∞
⇒ P
Xt and Yt−L meet exactly at time τ(L) := inf {t > L : Xt = Yt−L}
−2.5
0.0
2.5
5.0
7.5
10.0
0 50 100 150 200
iteration
x
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 13 / 44
L-Lag couplings of Markov chains
Back to stylized example
How close is Xt ∼ Pt to X∞ ∼ P = N(0, 1)?
−5
0
5
10
0 50 100 150
t
x
0.0
0.5
1.0
1.5
0 50 100 150
t
d
TV
Lag L=1 L=150 Exact
dTV(Pt, P) =
1
2
sup
h:|h|≤1
E[h(Xt) − h(X∞)] ≤ E
h
max(0,
τ(L) − L − t
L

)
i
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 14 / 44
L-Lag couplings of Markov chains
Existing work on MCMC convergence
How close is XT ∼ PT to X∞ ∼ P?
Analytical results:
Rosenthal (1995), Roberts and Rosenthal (2004), Khare and Hobert (2013),
Durmus et al. (2016), Qin and Hobert (2019, 2020), etc.
Bounds often of the form C(P0)f (t). Constant C(P0) unknown.
Empirical techniques:
Cowles and Rosenthal (1998), Johnson (1996).
Popular convergence diagnostics:
Gelman and Brooks (1998), Gelman and Rubin (1992), Vehtari et al.(2021)
Extensions:
• Craiu and Meng (2020), Kelly and Ryder (2021), Ju et al. (2021), Jacob et
al. (2021), Papp and Sherlock (2022), . . .
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 15 / 44
L-Lag couplings of Markov chains
Existing (and ongoing) work on MCMC convergence
How close is XT ∼ PT to X∞ ∼ P?
Analytical results:
Rosenthal (1995), Roberts and Rosenthal (2004), Khare and Hobert (2013),
Durmus et al. (2016), Qin and Hobert (2019, 2020), etc.
Bounds often of the form C(P0)f (t). Constant C(P0) unknown.
Empirical techniques:
Cowles and Rosenthal (1998), Johnson (1996).
Popular convergence diagnostics:
Gelman and Brooks (1998), Gelman and Rubin (1992), Vehtari et al.(2021)
Extensions and applications:
• Craiu and Meng (2020), Kelly and Ryder (2021), Ju et al. (2021), Jacob et
al. (2021), Papp and Sherlock (2022), . . .
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 16 / 44
L-Lag couplings of Markov chains
Back to stylized example
How close is XT ∼ PT to X∞ ∼ P = N(0, 1)?
−5
0
5
10
0 50 100 150
t
x
0.0
0.5
1.0
1.5
0 50 100 150
t
d
TV
Lag L=1 L=150 Exact
Can our diagnostics work in large-scale setting?
In high-dimensions, cannot rely on traceplots.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 17 / 44
Assessing convergence of some Gibbs samplers
Can our diagnostics work in large-scale setting?
Assessing convergence of some Gibbs samplers
Biswas, Bhattacharya, Jacob and Johndrow. Coupling-based convergence assessment of some
Gibbs samplers for high-dimensional Bayesian regression with shrinkage priors. Journal of the
Royal Statistical Society: Series B (JRSSB), 2022.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 18 / 44
Assessing convergence of some Gibbs samplers
High-dimensional Bayesian regression
Data: y ∈ Rn, X ∈ Rn×p, n ≪ p.
Model:
y = Xβ + σϵ ϵ ∼ N(0, In)
Want to do inference on β ∈ Rp
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 19 / 44
Assessing convergence of some Gibbs samplers
High-dimensional Bayesian regression
Data: y ∈ Rn, X ∈ Rn×p, n ≪ p.
Model:
y = Xβ + σϵ ϵ ∼ N(0, In)
Want to do inference on β ∈ Rp
Half-t Shrinkage Prior:
βj |σ2
, ξ, ηj ∼ N(0,
σ2
ξηj
)
σ−2
∼ Gamma(a0/2, b0/2)
Global ξ−1/2
∼ Cauchy+(0, 1)
Local η
−1/2
j ∼ t+(ν)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 19 / 44
Assessing convergence of some Gibbs samplers
High-dimensional Bayesian regression
Data: y ∈ Rn, X ∈ Rn×p, n ≪ p.
Model:
y = Xβ + σϵ ϵ ∼ N(0, In)
Want to do inference on β ∈ Rp
Half-t Shrinkage Prior:
βj |σ2
, ξ, ηj ∼ N(0,
σ2
ξηj
)
σ−2
∼ Gamma(a0/2, b0/2)
Global ξ−1/2
∼ Cauchy+(0, 1)
Local η
−1/2
j ∼ t+(ν)
Popular example:
Horseshoe: η
−1/2
j ∼ t+(1) = Cauchy+(0, 1)
Gelman (2006); Carvalho (2009); Polson and Scott (2012).
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 19 / 44
Assessing convergence of some Gibbs samplers
Half-t priors: statistical estimation benefits
Z
π(βj |ηj , ξ, σ2
)π(ηj )dηj =π(βj |ξ, σ2
) ≍
(
− log |βj | for |βj | → 0
|βj |−(1+ν) for |βj | → +∞
.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 20 / 44
Assessing convergence of some Gibbs samplers
Half-t priors: statistical estimation benefits
Z
π(βj |ηj , ξ, σ2
)π(ηj )dηj =π(βj |ξ, σ2
) ≍
(
− log |βj | for |βj | → 0
|βj |−(1+ν) for |βj | → +∞
.
0
1
2
3
4
−10 −5 0 5 10
βj
Prior
Density
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 20 / 44
Assessing convergence of some Gibbs samplers
Half-t priors: statistical estimation benefits
Z
π(βj |ηj , ξ, σ2
)π(ηj )dηj =π(βj |ξ, σ2
) ≍
(
− log |βj | for |βj | → 0
|βj |−(1+ν) for |βj | → +∞
.
0
1
2
3
4
−10 −5 0 5 10
βj
Prior
Density
van der Pas et al. (2014, 2017); Ghosh and Chakrabarti (2017).
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 20 / 44
Assessing convergence of some Gibbs samplers
Half-t priors: computation challenges
Pole at 0:
π(β|σ2
, ξ, y)
∥β∥→0
≍ −
p
Y
j=1
log(|βj |)
Numerical instability for gradient-based samplers.
Bou-Rabee  Sanz-Serna (2018); Livingstone et al. (2019a).
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 21 / 44
Assessing convergence of some Gibbs samplers
Half-t priors: computation challenges
Pole at 0:
π(β|σ2
, ξ, y)
∥β∥→0
≍ −
p
Y
j=1
log(|βj |)
Numerical instability for gradient-based samplers.
Bou-Rabee  Sanz-Serna (2018); Livingstone et al. (2019a).
Polynomial-tails:
π(λβ⊥
|σ2
, ξ, y)
λ→+∞
≍ λ−p(1+ν)
for any β⊥ ∈ Rp with non-zero entries s.t. Xβ⊥ = 0.
Slow convergence of Gaussian proposal based MH samplers.
Roberts  Tweedie (1996); Jarner  Tweedie (2003); Livingstone et al. (2019a).
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 21 / 44
Assessing convergence of some Gibbs samplers
Half-t priors: computation challenges
Pole at 0:
π(β|σ2
, ξ, y)
∥β∥→0
≍ −
p
Y
j=1
log(|βj |)
Numerical instability for gradient-based samplers.
Bou-Rabee  Sanz-Serna (2018); Livingstone et al. (2019a).
Polynomial-tails:
π(λβ⊥
|σ2
, ξ, y)
λ→+∞
≍ λ−p(1+ν)
for any β⊥ ∈ Rp with non-zero entries s.t. Xβ⊥ = 0.
Slow convergence of Gaussian proposal based MH samplers.
Roberts  Tweedie (1996); Jarner  Tweedie (2003); Livingstone et al. (2019a).
Trade–off between statistical estimation and “generic” sampling algorithms
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 21 / 44
Assessing convergence of some Gibbs samplers
A Gibbs sampler for Half-t priors
Markov chain (βt, ηt)t≥0 on Rp × Rp. (ξ, σ2 fixed for simplicity.)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 22 / 44
Assessing convergence of some Gibbs samplers
A Gibbs sampler for Half-t priors
Markov chain (βt, ηt)t≥0 on Rp × Rp. (ξ, σ2 fixed for simplicity.)
β|η ∼ N(Σ−1
η XT
y, σ2
Σ−1
η ) Ση := XT
X + ξDiag(η)
π(η|β) =
p
Y
j=1
π(ηj |βj )
Polson  Scott (2012); Johndrow et al. (2020).
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 22 / 44
Assessing convergence of some Gibbs samplers
A Gibbs sampler for Half-t priors
Markov chain (βt, ηt)t≥0 on Rp × Rp. (ξ, σ2 fixed for simplicity.)
β|η ∼ N(Σ−1
η XT
y, σ2
Σ−1
η ) Ση := XT
X + ξDiag(η)
π(η|β) =
p
Y
j=1
π(ηj |βj )
Polson  Scott (2012); Johndrow et al. (2020).
Chain is geometrically ergodic.
proof not trivial; addresses an open question about such Gibbs samplers with
heavy-tailed priors
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 22 / 44
Assessing convergence of some Gibbs samplers
A Gibbs sampler for Half-t priors
Markov chain (βt, ηt)t≥0 on Rp × Rp. (ξ, σ2 fixed for simplicity.)
β|η ∼ N(Σ−1
η XT
y, σ2
Σ−1
η ) Ση := XT
X + ξDiag(η)
π(η|β) =
p
Y
j=1
π(ηj |βj )
Polson  Scott (2012); Johndrow et al. (2020).
Chain is geometrically ergodic.
proof not trivial; addresses an open question about such Gibbs samplers with
heavy-tailed priors
Can we numerically assess convergence?
Coupling-based assessment: Johnson (1996), Biswas et al. (2019)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 22 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
η ) Ση := XT X + ξDiag(η)
π(η|β) =
Qp
j=1 π(ηj |βj )
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
η ) Ση := XT X + ξDiag(η)
π(η|β) =
Qp
j=1 π(ηj |βj )
Coupled Chain:
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
η ) Ση := XT X + ξDiag(η)
π(η|β) =
Qp
j=1 π(ηj |βj )
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN; ‘synchronous’)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
η ) Ση := XT X + ξDiag(η)
π(η|β) =
Qp
j=1 π(ηj |βj )
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN; ‘synchronous’)
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2)  dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Single Chain:
β|η ∼ N(Σ−1
η XT y, σ2Σ−1
η ) Ση := XT X + ξDiag(η)
π(η|β) =
Qp
j=1 π(ηj |βj )
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN; ‘synchronous’)
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2)  dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold
Why is this a good idea? Which metric d?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN)
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2)  dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold
Why is this a good idea?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN)
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2)  dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold
Why is this a good idea?
Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN)
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2)  dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold
Why is this a good idea?
Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful.
When far away, use CRN to get close in the future.
Only when close, attempt to meet exactly.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN)
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2)  dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold
Why is this a good idea?
Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful.
When far away, use CRN to get close in the future.
Only when close, attempt to meet exactly.
Which metric d?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN)
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2)  dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold
Why is this a good idea?
Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful.
When far away, use CRN to get close in the future.
Only when close, attempt to meet exactly.
Which metric d?
Want d to capture probability of exactly meeting.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN)
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2)  dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold
Why is this a good idea?
Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful.
When far away, use CRN to get close in the future.
Only when close, attempt to meet exactly.
Which metric d?
Want d to capture probability of exactly meeting.
d(β1,t, β2,t) := PMaxCouple(η1,t+1 ̸= η2,t+1|β1,t, β2,t)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
Assessing convergence of some Gibbs samplers
A two-scale coupling
Coupled Chain:
β1, β2|η1, η2 ∼ Common random numbers (CRN)
η1, η2|β1, β2 ∼
(
CRN when d(β1, β2)  dthreshold
Max Coupling when d(β1, β2) ≤ dthreshold
Why is this a good idea?
Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful.
When far away, use CRN to get close in the future.
Only when close, attempt to meet exactly.
Which metric d?
Want d to capture probability of exactly meeting.
d(β1,t, β2,t) := PMaxCouple(η1,t+1 ̸= η2,t+1|β1,t, β2,t)
Roberts  Rosenthal (2002)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
Assessing convergence of some Gibbs samplers
GWAS dataset example: Half-t(2) prior
n ≈ 2, 266 different maize lines.
p ≈ 98, 385 covariates per maize line, linked to SNPs in the genome.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 25 / 44
Assessing convergence of some Gibbs samplers
GWAS dataset example: Half-t(2) prior
n ≈ 2, 266 different maize lines.
p ≈ 98, 385 covariates per maize line, linked to SNPs in the genome.
(βt, ηt)t≥0 Markov chain now in R98,385 × R98,385.
How long does it take to converge?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 25 / 44
Assessing convergence of some Gibbs samplers
GWAS dataset example: Half-t(2) prior
n ≈ 2, 266 different maize lines.
p ≈ 98, 385 covariates per maize line, linked to SNPs in the genome.
(βt, ηt)t≥0 Markov chain now in R98,385 × R98,385.
How long does it take to converge?  1, 000 steps.
0.00
0.25
0.50
0.75
1.00
0 250 500 750 1000
iteration
Total
variation
distance
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 25 / 44
Assessing convergence of some Gibbs samplers
Open questions
Convergence complexity analysis
Interplay between posterior concentration and MCMC convergence
Alternative coupling algorithms
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 26 / 44
Bounding Wasserstein distance with couplings
What about diagnostics for approximate samplers?
Bounding Wasserstein distance with couplings
Biswas and Mackey. Bounding Wasserstein distance with couplings. Revision
requested, Journal of the American Statistical Association (JASA), 2022.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 27 / 44
Bounding Wasserstein distance with couplings
Exact and Approximate MCMC
Exact MCMC:
Initialize X0 ∼ P0. Sample Xt
t≥1
∼ K1(Xt−1, ·)
Kernel K1 is P-invariant, so that Xt
t→∞
⇒ P
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
Bounding Wasserstein distance with couplings
Exact and Approximate MCMC
Exact MCMC:
Initialize X0 ∼ P0. Sample Xt
t≥1
∼ K1(Xt−1, ·)
Kernel K1 is P-invariant, so that Xt
t→∞
⇒ P
Approximate MCMC:
Initialize Y0 ∼ Q0. Sample Yt
t≥1
∼ K2(Yt−1, ·)
Kernel K2 is Q-invariant, so that Yt
t→∞
⇒ Q
K2 similar to but computationally cheaper than K1
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
Bounding Wasserstein distance with couplings
Exact and Approximate MCMC
Exact MCMC:
Initialize X0 ∼ P0. Sample Xt
t≥1
∼ K1(Xt−1, ·)
Kernel K1 is P-invariant, so that Xt
t→∞
⇒ P
Approximate MCMC:
Initialize Y0 ∼ Q0. Sample Yt
t≥1
∼ K2(Yt−1, ·)
Kernel K2 is Q-invariant, so that Yt
t→∞
⇒ Q
K2 similar to but computationally cheaper than K1
Examples:
Stochastic gradients for tall data
Welling and Teh, 2011; Bardenet et al., 2017; Nemeth and Fearnhead, 2021
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
Bounding Wasserstein distance with couplings
Exact and Approximate MCMC
Exact MCMC:
Initialize X0 ∼ P0. Sample Xt
t≥1
∼ K1(Xt−1, ·)
Kernel K1 is P-invariant, so that Xt
t→∞
⇒ P
Approximate MCMC:
Initialize Y0 ∼ Q0. Sample Yt
t≥1
∼ K2(Yt−1, ·)
Kernel K2 is Q-invariant, so that Yt
t→∞
⇒ Q
K2 similar to but computationally cheaper than K1
Examples:
Stochastic gradients for tall data
Welling and Teh, 2011; Bardenet et al., 2017; Nemeth and Fearnhead, 2021
Matrix approximations for wide data
Johndrow et al. 2020; Narisetty et al. 2019; Atchadé and Wang 2021
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
Bounding Wasserstein distance with couplings
Exact and Approximate MCMC
Exact MCMC:
Initialize X0 ∼ P0. Sample Xt
t≥1
∼ K1(Xt−1, ·)
Kernel K1 is P-invariant, so that Xt
t→∞
⇒ P
Approximate MCMC:
Initialize Y0 ∼ Q0. Sample Yt
t≥1
∼ K2(Yt−1, ·)
Kernel K2 is Q-invariant, so that Yt
t→∞
⇒ Q
K2 similar to but computationally cheaper than K1
Examples:
Stochastic gradients for tall data
Welling and Teh, 2011; Bardenet et al., 2017; Nemeth and Fearnhead, 2021
Matrix approximations for wide data
Johndrow et al. 2020; Narisetty et al. 2019; Atchadé and Wang 2021
How close is P to Q?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
Bounding Wasserstein distance with couplings
Wasserstein distance
How close is P to Q?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
Bounding Wasserstein distance with couplings
Wasserstein distance
How close is P to Q?
For metric space (X, c), p ≥ 1
Wp(P, Q) ≜ inf
X∼P,Y ∼Q
E[c(X, Y )p
]1/p
.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
Bounding Wasserstein distance with couplings
Wasserstein distance
How close is P to Q?
For metric space (X, c), p ≥ 1
Wp(P, Q) ≜ inf
X∼P,Y ∼Q
E[c(X, Y )p
]1/p
.
A geometrically faithful metric between probability measures
Villani, 2008; Peyré and Cuturi, 2019
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
Bounding Wasserstein distance with couplings
Wasserstein distance
How close is P to Q?
For metric space (X, c), p ≥ 1
Wp(P, Q) ≜ inf
X∼P,Y ∼Q
E[c(X, Y )p
]1/p
.
A geometrically faithful metric between probability measures
Villani, 2008; Peyré and Cuturi, 2019
Can control the absolute difference between moments of P and Q
Gelbrich, 1990; Sriperumbudur et al., 2012; Huggins et al., 2020
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
Bounding Wasserstein distance with couplings
Wasserstein distance
How close is P to Q?
For metric space (X, c), p ≥ 1
Wp(P, Q) ≜ inf
X∼P,Y ∼Q
E[c(X, Y )p
]1/p
.
A geometrically faithful metric between probability measures
Villani, 2008; Peyré and Cuturi, 2019
Can control the absolute difference between moments of P and Q
Gelbrich, 1990; Sriperumbudur et al., 2012; Huggins et al., 2020
We will estimate upper bounds of Wp(P, Q) with couplings.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
Bounding Wasserstein distance with couplings
Couplings
Consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
Pillai and Smith, 2015; Johndrow and Mattingly, 2018; Rudolf and Schweizer, 2018
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 30 / 44
Bounding Wasserstein distance with couplings
Couplings
Consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
Pillai and Smith, 2015; Johndrow and Mattingly, 2018; Rudolf and Schweizer, 2018
We simulate such couplings to empirically assess sample quality
Given K̄, sample (Xt+1, Yt+1)|(Xt, Yt) ∼ K̄ (Xt, Yt), ·

Niloy Biswas (Harvard) PhD Thesis October 2, 2022 30 / 44
Bounding Wasserstein distance with couplings
Couplings
Consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
Pillai and Smith, 2015; Johndrow and Mattingly, 2018; Rudolf and Schweizer, 2018
We simulate such couplings to empirically assess sample quality
Given K̄, sample (Xt+1, Yt+1)|(Xt, Yt) ∼ K̄ (Xt, Yt), ·

For such independent coupled trajectories (X
(i)
t , Y
(i)
t )T
t=1 with
i = 1, . . . , I, define the estimator
CUBp ≜
 1
I(T − S)
I
X
i=1
T
X
t=S+1
c(X
(i)
t , Y
(i)
t )p
1/p
.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 30 / 44
Bounding Wasserstein distance with couplings
Couplings
Consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
Pillai and Smith, 2015; Johndrow and Mattingly, 2018; Rudolf and Schweizer, 2018
We simulate such couplings to empirically assess sample quality
Given K̄, sample (Xt+1, Yt+1)|(Xt, Yt) ∼ K̄ (Xt, Yt), ·

For such independent coupled trajectories (X
(i)
t , Y
(i)
t )T
t=1 with
i = 1, . . . , I, define the estimator
CUBp ≜
 1
I(T − S)
I
X
i=1
T
X
t=S+1
c(X
(i)
t , Y
(i)
t )p
1/p
.
Use CUBp to estimate upper bounds of Wp(P, Q)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 30 / 44
Bounding Wasserstein distance with couplings
A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
Bounding Wasserstein distance with couplings
A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Apply common random numbers (CRN) coupling of MALA kernels marginally
targeting P and Q.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
Bounding Wasserstein distance with couplings
A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Apply common random numbers (CRN) coupling of MALA kernels marginally
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep. Coupling CUB2
Empirical W2 True W2
Figure: Dimension d = 100
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
Bounding Wasserstein distance with couplings
A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Apply common random numbers (CRN) coupling of MALA kernels marginally
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep. Coupling CUB2
Empirical W2 True W2
Figure: Dimension d = 100
Indep: E
X∼P,Y ∼Q,X⊥Y
[∥X − Y ∥2
2]1/2 = (2d)1/2.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
Bounding Wasserstein distance with couplings
A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Apply common random numbers (CRN) coupling of MALA kernels marginally
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep. Coupling CUB2
Empirical W2 True W2
Figure: Dimension d = 100
Indep: E
X∼P,Y ∼Q,X⊥Y
[∥X − Y ∥2
2]1/2 = (2d)1/2.
CUB2:

1
IT
PI
i=1
PT
t=1 ∥X
(i)
t − Y
(i)
t ∥2
2
1/2
.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
Bounding Wasserstein distance with couplings
A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Apply common random numbers (CRN) coupling of MALA kernels marginally
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep. Coupling CUB2
Empirical W2 True W2
Figure: Dimension d = 100
Indep: E
X∼P,Y ∼Q,X⊥Y
[∥X − Y ∥2
2]1/2 = (2d)1/2.
CUB2:

1
IT
PI
i=1
PT
t=1 ∥X
(i)
t − Y
(i)
t ∥2
2
1/2
.
Empirical W2: W2(P̂1000I , Q̂1000I ).
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
Bounding Wasserstein distance with couplings
A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Apply common random numbers (CRN) coupling of MALA kernels marginally
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep. Coupling CUB2
Empirical W2 True W2
Figure: Dimension d = 100
Indep: E
X∼P,Y ∼Q,X⊥Y
[∥X − Y ∥2
2]1/2 = (2d)1/2.
CUB2:

1
IT
PI
i=1
PT
t=1 ∥X
(i)
t − Y
(i)
t ∥2
2
1/2
.
Empirical W2: W2(P̂1000I , Q̂1000I ).
True W2: W2(P, Q).
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
Bounding Wasserstein distance with couplings
A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Apply common random numbers (CRN) coupling of MALA kernels marginally
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep. Coupling CUB2
Empirical W2 True W2
Figure: Dimension d = 100
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 32 / 44
Bounding Wasserstein distance with couplings
A stylized example
P = N(0, Σ) for [Σ]i,j = 0.5|i−j|
Q = N(0, Id ) on Rd
Apply common random numbers (CRN) coupling of MALA kernels marginally
targeting P and Q.
0
5
10
15
0 250 500 750 1000
Trajectory length T
W
2
upper
bounds
Indep. Coupling CUB2
Empirical W2 True W2
Figure: Dimension d = 100
0
10
20
30
40
50
200 400 600 800 1000
Dimension d
W
2
upper
bounds
Indep. Coupling CUB2
Empirical W2 True W2
Figure: Varying dimension d
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 32 / 44
Bounding Wasserstein distance with couplings
Algorithms to sample from the coupled kernel K̄
To simulate the coupled chain (Xt, Yt)t≥0, we consider kernel K̄ on
X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 33 / 44
Bounding Wasserstein distance with couplings
Algorithms to sample from the coupled kernel K̄
To simulate the coupled chain (Xt, Yt)t≥0, we consider kernel K̄ on
X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
To construct and analyze K̄, we make use of:
1 A Markovian coupling Γ1 of kernel K1: for all x, y ∈ X, Γ1(x, y) is a
coupling of the distributions K1(x, ·) and K1(y, ·).
2 A coupling Γ∆ of kernels K1 and K2 from the same point: for all
z ∈ X, Γ∆(z) is a coupling of the distributions K1(z, ·) and K2(z, ·).
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 33 / 44
Bounding Wasserstein distance with couplings
Algorithms to sample from the coupled kernel K̄
To simulate the coupled chain (Xt, Yt)t≥0, we consider kernel K̄ on
X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X),
K̄ (x, y), (A, X)

= K1(x, A) and K̄ (x, y), (X, A)

= K2(y, A)
To construct and analyze K̄, we make use of:
1 A Markovian coupling Γ1 of kernel K1: for all x, y ∈ X, Γ1(x, y) is a
coupling of the distributions K1(x, ·) and K1(y, ·).
2 A coupling Γ∆ of kernels K1 and K2 from the same point: for all
z ∈ X, Γ∆(z) is a coupling of the distributions K1(z, ·) and K2(z, ·).
Xt Yt
Xt+1 Zt+1 Yt+1
Γ1 Γ∆
Figure: Joint kernel K̄ on X × X
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 33 / 44
Bounding Wasserstein distance with couplings
Interpretable upper bounds for CUB
Theorem (CUB upper bound)
Let (Xt, Yt)t≥0 denote a coupled Markov chain generated using joint kernel K̄.
Suppose there exists a constant ρ ∈ (0, 1) such that for all Xt, Yt ∈ X and
(Xt+1, Yt+1)|(Xt, Yt) ∼ Γ1(Xt, Yt),
E[c(Xt+1, Yt+1)p
|Xt, Yt]
1
p ≤ ρc(Xt, Yt).
Then
E[CUBp
p,t]
1
p = E[c(Xt, Yt)p
]
1
p ≤ ρt
E[c(X0, Y0)p
]
1
p +
t
X
i=1
ρt−i
E[∆p(Yi−1)]
1
p
for all t ≥ 0, where ∆p(z) := E[c(X, Y )p
|z] for (X, Y )|z ∼ Γ∆(z).
Generalize existing results on Markov chain perturbation theory W1 to:
(i) Wp for all p ≥ 1, (ii) couplings which may not be Wasserstein-optimal.
Extensions with weaker assumptions in paper.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 34 / 44
Bounding Wasserstein distance with couplings
MCMC for tall data
Bayesian logistic regression on:
Pima Indians (n = 768 observations and d = 8 covariates)
DS1 life sciences dataset (n = 26, 732 and d = 10)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 35 / 44
Bounding Wasserstein distance with couplings
MCMC for tall data
Bayesian logistic regression on:
Pima Indians (n = 768 observations and d = 8 covariates)
DS1 life sciences dataset (n = 26, 732 and d = 10)
We apply CRN coupling of MALA kernels.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 35 / 44
Bounding Wasserstein distance with couplings
MCMC for tall data
Bayesian logistic regression on:
Pima Indians (n = 768 observations and d = 8 covariates)
DS1 life sciences dataset (n = 26, 732 and d = 10)
We apply CRN coupling of MALA kernels.
0.0
0.1
0.2
0.3
M
e
a
n
F
i
e
l
d
V
B
L
a
p
l
a
c
e
S
G
L
D
1
0
%
S
G
L
D
5
0
%
U
L
A
Approximate MCMC or variational procedure
W
2
upper
and
lower
bounds
Dataset
DS1
Pima
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 35 / 44
Bounding Wasserstein distance with couplings
Approximate MCMC for high-dimensional linear regression
Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088).
y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

, βj |η, ξ, σ2 ind.
∼ N

0,
σ2
ξηj

Polson and Scott, 2020; Biswas et al. 2022
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
Bounding Wasserstein distance with couplings
Approximate MCMC for high-dimensional linear regression
Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088).
y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

, βj |η, ξ, σ2 ind.
∼ N

0,
σ2
ξηj

Polson and Scott, 2020; Biswas et al. 2022
Exact MCMC: O(n2
d) cost from XDiag(ηt)−1
XT
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
Bounding Wasserstein distance with couplings
Approximate MCMC for high-dimensional linear regression
Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088).
y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

, βj |η, ξ, σ2 ind.
∼ N

0,
σ2
ξηj

Polson and Scott, 2020; Biswas et al. 2022
Exact MCMC: O(n2
d) cost from XDiag(ηt)−1
XT
Approximate MCMC: XDiag(η̃t)−1
XT
for η̃j,t ≜ ηj,t1{η−1
j,t ϵ} Johndrow et al., 2020
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
Bounding Wasserstein distance with couplings
Approximate MCMC for high-dimensional linear regression
Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088).
y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

, βj |η, ξ, σ2 ind.
∼ N

0,
σ2
ξηj

Polson and Scott, 2020; Biswas et al. 2022
Exact MCMC: O(n2
d) cost from XDiag(ηt)−1
XT
Approximate MCMC: XDiag(η̃t)−1
XT
for η̃j,t ≜ ηj,t1{η−1
j,t ϵ} Johndrow et al., 2020
Apply CRN coupling of the exact and the approximate chain.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
Bounding Wasserstein distance with couplings
Approximate MCMC for high-dimensional linear regression
Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088).
y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In)
ξ−1/2
∼ C+(0, 1), η
−1/2
j
i.i.d.
∼ t+(2), σ−2
∼ Gamma
a0
2
,
b0
2

, βj |η, ξ, σ2 ind.
∼ N

0,
σ2
ξηj

Polson and Scott, 2020; Biswas et al. 2022
Exact MCMC: O(n2
d) cost from XDiag(ηt)−1
XT
Approximate MCMC: XDiag(η̃t)−1
XT
for η̃j,t ≜ ηj,t1{η−1
j,t ϵ} Johndrow et al., 2020
Apply CRN coupling of the exact and the approximate chain.
0.0
0.2
0.4
0.6
0.8
0 1e−04 0.001 0.01
Approx. MCMC threshold ε
W
2
upper
and
lower
bounds
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
Bounding Wasserstein distance with couplings
Open questions
Can we avoid sampling from the exact MCMC chain (Xt)t≥0?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 37 / 44
Bounding Wasserstein distance with couplings
Open questions
Can we avoid sampling from the exact MCMC chain (Xt)t≥0?
Consider constructing a Markov chain (Y ′
t , Yt)t≥0 such that:
1 Y ′
t ∼ Yt for all t ≥ 0, both marginally distributed according to the
approximate MCMC chain.
2 E[c(Xt, Y ′
t )p
] = E[c(Xt, Yt)p
] ≤ E[c(Y ′
t , Yt)p
] for all t ≥ 0.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 37 / 44
Bounding Wasserstein distance with couplings
Open questions
Can we avoid sampling from the exact MCMC chain (Xt)t≥0?
Consider constructing a Markov chain (Y ′
t , Yt)t≥0 such that:
1 Y ′
t ∼ Yt for all t ≥ 0, both marginally distributed according to the
approximate MCMC chain.
2 E[c(Xt, Y ′
t )p
] = E[c(Xt, Yt)p
] ≤ E[c(Y ′
t , Yt)p
] for all t ≥ 0.
Then E[c(Xt, Yt)p]1/p ≤ E[c(Y ′
t , Yt)p]1/p ≤ 2E[c(Xt, Yt)p]1/p.
Upper bounds from (Y ′
t , Yt)t≥0 only loose by a constant factor of 2.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 37 / 44
Scalable Spike-and-Slab
Scalable Spike-and-Slab
Biswas, Mackey and Meng. Scalable Spike-and-Slab.
Revision requested, International Conference on Machine Learning (ICML), 2022.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 38 / 44
Scalable Spike-and-Slab
Variable selection with spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
Scalable Spike-and-Slab
Variable selection with spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Linear regression: y = Xβ + σϵ where ϵ ∼ N(0, In)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
Scalable Spike-and-Slab
Variable selection with spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Linear regression: y = Xβ + σϵ where ϵ ∼ N(0, In)
Continuous Spike-and-Slab Prior [George and McCulloch, 1993]
σ2
∼ InvGamma
a0
2
,
b0
2

zj
i.i.d.
∼
j=1,...,p
Bernoulli(q),
βj |zj , σ2 ind
∼
j=1,...,p
(1 − zj ) N(0, σ2
τ2
0 )
| {z }
Spike
+zj N(0, σ2
τ2
1 )
| {z }
Slab
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
Scalable Spike-and-Slab
Variable selection with spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Linear regression: y = Xβ + σϵ where ϵ ∼ N(0, In)
Continuous Spike-and-Slab Prior [George and McCulloch, 1993]
σ2
∼ InvGamma
a0
2
,
b0
2

zj
i.i.d.
∼
j=1,...,p
Bernoulli(q),
βj |zj , σ2 ind
∼
j=1,...,p
(1 − zj ) N(0, σ2
τ2
0 )
| {z }
Spike
+zj N(0, σ2
τ2
1 )
| {z }
Slab
Inference using P(zj = 1|y). Guan and Stephens, 2011, Zhou et al., 2013, . . .
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
Scalable Spike-and-Slab
Variable selection with spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Linear regression: y = Xβ + σϵ where ϵ ∼ N(0, In)
Continuous Spike-and-Slab Prior [George and McCulloch, 1993]
σ2
∼ InvGamma
a0
2
,
b0
2

zj
i.i.d.
∼
j=1,...,p
Bernoulli(q),
βj |zj , σ2 ind
∼
j=1,...,p
(1 − zj ) N(0, σ2
τ2
0 )
| {z }
Spike
+zj N(0, σ2
τ2
1 )
| {z }
Slab
Inference using P(zj = 1|y). Guan and Stephens, 2011, Zhou et al., 2013, . . .
How to sample from the posterior?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
Scalable Spike-and-Slab
Bayesian computation for spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 40 / 44
Scalable Spike-and-Slab
Bayesian computation for spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
Naı̈ve MCMC: O(p3
) cost per iteration
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
For large datasets, O(n2
p) cost become prohibitive.
e.g. GWAS with n ≈ 103
, p ≈ 105
, SOTA takes 1 minute per iteration
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 40 / 44
Scalable Spike-and-Slab
Bayesian computation for spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
Naı̈ve MCMC: O(p3
) cost per iteration
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
For large datasets, O(n2
p) cost become prohibitive.
e.g. GWAS with n ≈ 103
, p ≈ 105
, SOTA takes 1 minute per iteration
Approximate inference methods:
Approx. MCMC: O(max{n∥zt∥2
1, np}) cost at iteration t
Narisetty et al., 2019
Variational inference Ray et al., 2020, Ray and Szabó, 2021
Does not converge to the spike-and-slab posterior
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 40 / 44
Scalable Spike-and-Slab
Bayesian computation for spike-and-slab priors
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
Naı̈ve MCMC: O(p3
) cost per iteration
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
For large datasets, O(n2
p) cost become prohibitive.
e.g. GWAS with n ≈ 103
, p ≈ 105
, SOTA takes 1 minute per iteration
Approximate inference methods:
Approx. MCMC: O(max{n∥zt∥2
1, np}) cost at iteration t
Narisetty et al., 2019
Variational inference Ray et al., 2020, Ray and Szabó, 2021
Does not converge to the spike-and-slab posterior
Speed up Bayesian computation without compromising on sample quality?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 40 / 44
Scalable Spike-and-Slab
Speed up Bayesian computation without compromising on sample quality?
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
Scalable Spike-and-Slab
Speed up Bayesian computation without compromising on sample quality?
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
Scalable Spike-and-Slab
Speed up Bayesian computation without compromising on sample quality?
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
Scalable Spike-and-Slab (S3):
Same MCMC kernel as SOTA.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
Scalable Spike-and-Slab
Speed up Bayesian computation without compromising on sample quality?
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
Scalable Spike-and-Slab (S3):
Same MCMC kernel as SOTA.
O(max{n2
pt, np}) cost at iteration t for linear and probit, where
(O(max{n2
pt, n3
, np}) for logistic regression)
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
Scalable Spike-and-Slab
Speed up Bayesian computation without compromising on sample quality?
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
Scalable Spike-and-Slab (S3):
Same MCMC kernel as SOTA.
O(max{n2
pt, np}) cost at iteration t for linear and probit, where
(O(max{n2
pt, n3
, np}) for logistic regression)
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
Scalable Spike-and-Slab
Speed up Bayesian computation without compromising on sample quality?
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
Scalable Spike-and-Slab (S3):
Same MCMC kernel as SOTA.
O(max{n2
pt, np}) cost at iteration t for linear and probit, where
(O(max{n2
pt, n3
, np}) for logistic regression)
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.
(i) sparsity,
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
Scalable Spike-and-Slab
Speed up Bayesian computation without compromising on sample quality?
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
Scalable Spike-and-Slab (S3):
Same MCMC kernel as SOTA.
O(max{n2
pt, np}) cost at iteration t for linear and probit, where
(O(max{n2
pt, n3
, np}) for logistic regression)
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.
(i) sparsity, (ii) posterior concentration,
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
Scalable Spike-and-Slab
Speed up Bayesian computation without compromising on sample quality?
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
Scalable Spike-and-Slab (S3):
Same MCMC kernel as SOTA.
O(max{n2
pt, np}) cost at iteration t for linear and probit, where
(O(max{n2
pt, n3
, np}) for logistic regression)
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.
(i) sparsity, (ii) posterior concentration, and (iii) positive auto-correlation
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
Scalable Spike-and-Slab
Speed up Bayesian computation without compromising on sample quality?
High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p.
Markov chain Monte Carlo methods:
State-of-the-art (SOTA) MCMC: O(n2
p) cost per iteration
Scalable Spike-and-Slab (S3):
Same MCMC kernel as SOTA.
O(max{n2
pt, np}) cost at iteration t for linear and probit, where
(O(max{n2
pt, n3
, np}) for logistic regression)
pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}.
(i) sparsity, (ii) posterior concentration, and (iii) positive auto-correlation
all give smaller pt and lower computational cost.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
Scalable Spike-and-Slab
S3
: computational cost
Binary response data: y ∈ {0, 1}n, design matrix X ∈ Rn×p, n ≪ p.
0e+00
1e+05
2e+05
3e+05
10000 20000 30000 40000
Dimension p
Time
per
iteration
(ms)
Sampler
S3
Logistic SOTA Logistic Skinny Gibbs Logistic
S3
Probit SOTA Probit
E.g. for n ≈ 4000, p ≈ 40000, E[pt] ≈ 10. S3
is 50× faster than SOTA.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 42 / 44
Scalable Spike-and-Slab
S3
: sample quality
Binary response data: y ∈ {0, 1}n, design matrix X ∈ Rn×p, n ≪ p.
0%
25%
50%
75%
100%
200 400 600 800 1000
Dimension p
TPR
0%
1%
2%
3%
4%
200 400 600 800 1000
Dimension p
FDR
Sampler S3
Logistic S3
Probit Skinny Gibbs Logistic
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 43 / 44
Scalable Spike-and-Slab
Revisiting Tales from the crypt 2
2
Chopin  Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
Scalable Spike-and-Slab
Revisiting Tales from the crypt 2
(Folklore about) Monte Carlo in large scale applications:
2
Chopin  Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
Scalable Spike-and-Slab
Revisiting Tales from the crypt 2
(Folklore about) Monte Carlo in large scale applications:
“Markov chain Monte Carlo (MCMC) algorithms (have prohibitively
high) can have lower computational cost per iteration”,
2
Chopin  Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
Scalable Spike-and-Slab
Revisiting Tales from the crypt 2
(Folklore about) Monte Carlo in large scale applications:
“Markov chain Monte Carlo (MCMC) algorithms (have prohibitively
high) can have lower computational cost per iteration”,
“MCMC algorithms (converge slowly) can converge quickly for high-
dimensional, multimodal target distributions”,
2
Chopin  Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
Scalable Spike-and-Slab
Revisiting Tales from the crypt 2
(Folklore about) Monte Carlo in large scale applications:
“Markov chain Monte Carlo (MCMC) algorithms (have prohibitively
high) can have lower computational cost per iteration”,
“MCMC algorithms (converge slowly) can converge quickly for high-
dimensional, multimodal target distributions”,
“Bayesian inference is computationally too expensive?”
2
Chopin  Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
Scalable Spike-and-Slab
Revisiting Tales from the crypt 2
(Folklore about) Monte Carlo in large scale applications:
“Markov chain Monte Carlo (MCMC) algorithms (have prohibitively
high) can have lower computational cost per iteration”,
“MCMC algorithms (converge slowly) can converge quickly for high-
dimensional, multimodal target distributions”,
“Bayesian inference is computationally too expensive?”
This thesis participates in a wider quest to scale Bayesian inference.
Welling and Teh (2011); Gorham (2015); Bardenet et al. (2017); Blei et al. (2017); Narisetty et
al. (2019), Bierkens et al. (2019), Papaspiliopoulos et al. (2019), Pollock et al. (2020), Jacob
et al. (2020), Johndrow et al. (2020), Nemeth and Fearnhead (2021), Ray and Szabó (2021) . . .
2
Chopin  Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at
Scale, 2020.
Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44

More Related Content

Similar to PhD_Thesis_slides.pdf

The Algorithms of Life - Scientific Computing for Systems Biology
The Algorithms of Life - Scientific Computing for Systems BiologyThe Algorithms of Life - Scientific Computing for Systems Biology
The Algorithms of Life - Scientific Computing for Systems Biology
inside-BigData.com
 
Gull talk London.pdf
Gull talk London.pdfGull talk London.pdf
Gull talk London.pdf
Richard Gill
 
Constraints on the Universe as a Numerical Simulation
Constraints on the Universe as a Numerical SimulationConstraints on the Universe as a Numerical Simulation
Constraints on the Universe as a Numerical Simulation
solodoe
 
MMath Paper, Canlin Zhang
MMath Paper, Canlin ZhangMMath Paper, Canlin Zhang
MMath Paper, Canlin Zhang
canlin zhang
 

Similar to PhD_Thesis_slides.pdf (20)

Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
The Algorithms of Life - Scientific Computing for Systems Biology
The Algorithms of Life - Scientific Computing for Systems BiologyThe Algorithms of Life - Scientific Computing for Systems Biology
The Algorithms of Life - Scientific Computing for Systems Biology
 
Gull talk London.pdf
Gull talk London.pdfGull talk London.pdf
Gull talk London.pdf
 
Coloured Algebras and Biological Response in Quantum Biological Computing Arc...
Coloured Algebras and Biological Response in Quantum Biological Computing Arc...Coloured Algebras and Biological Response in Quantum Biological Computing Arc...
Coloured Algebras and Biological Response in Quantum Biological Computing Arc...
 
COLOURED ALGEBRAS AND BIOLOGICAL RESPONSE IN QUANTUM BIOLOGICAL COMPUTING ARC...
COLOURED ALGEBRAS AND BIOLOGICAL RESPONSE IN QUANTUM BIOLOGICAL COMPUTING ARC...COLOURED ALGEBRAS AND BIOLOGICAL RESPONSE IN QUANTUM BIOLOGICAL COMPUTING ARC...
COLOURED ALGEBRAS AND BIOLOGICAL RESPONSE IN QUANTUM BIOLOGICAL COMPUTING ARC...
 
Constraints on the Universe as a Numerical Simulation
Constraints on the Universe as a Numerical SimulationConstraints on the Universe as a Numerical Simulation
Constraints on the Universe as a Numerical Simulation
 
Quantum Information
Quantum InformationQuantum Information
Quantum Information
 
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
2018 Modern Math Workshop - Nonparametric Regression and Classification for M...
 
Quantum Blockchains
Quantum BlockchainsQuantum Blockchains
Quantum Blockchains
 
Sparse inverse covariance estimation using skggm
Sparse inverse covariance estimation using skggmSparse inverse covariance estimation using skggm
Sparse inverse covariance estimation using skggm
 
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
 
MMath Paper, Canlin Zhang
MMath Paper, Canlin ZhangMMath Paper, Canlin Zhang
MMath Paper, Canlin Zhang
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
 
F0742328
F0742328F0742328
F0742328
 
TS-IASSL2014
TS-IASSL2014TS-IASSL2014
TS-IASSL2014
 
Bayesian Divergence Time Estimation
Bayesian Divergence Time Estimation Bayesian Divergence Time Estimation
Bayesian Divergence Time Estimation
 
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
 
Machine learning with quantum computers
Machine learning with quantum computersMachine learning with quantum computers
Machine learning with quantum computers
 
Quantum Information Science and Quantum Neuroscience.ppt
Quantum Information Science and Quantum Neuroscience.pptQuantum Information Science and Quantum Neuroscience.ppt
Quantum Information Science and Quantum Neuroscience.ppt
 
Global Bilateral Symmetry Detection Using Multiscale Mirror Histograms
Global Bilateral Symmetry Detection Using Multiscale Mirror HistogramsGlobal Bilateral Symmetry Detection Using Multiscale Mirror Histograms
Global Bilateral Symmetry Detection Using Multiscale Mirror Histograms
 

Recently uploaded

ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
ashishpaul799
 

Recently uploaded (20)

The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resources
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptx
 
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdfDanh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
Danh sách HSG Bộ môn cấp trường - Cấp THPT.pdf
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
The Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve ThomasonThe Art Pastor's Guide to Sabbath | Steve Thomason
The Art Pastor's Guide to Sabbath | Steve Thomason
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17
 
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdfPost Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
Post Exam Fun(da) Intra UEM General Quiz 2024 - Prelims q&a.pdf
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).
 
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.pptBasic_QTL_Marker-assisted_Selection_Sourabh.ppt
Basic_QTL_Marker-assisted_Selection_Sourabh.ppt
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf....................Muslim-Law notes.pdf
....................Muslim-Law notes.pdf
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceutics
 

PhD_Thesis_slides.pdf

  • 1. Contributions to Scalable Bayesian Computation Niloy Biswas Department of Statistics Harvard University PhD Thesis October 2, 2022 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 1 / 44
  • 2. References Niloy Biswas, Pierre Jacob and Paul Vanetti. Estimating convergence of Markov chains with L-lag couplings. Advances in Neural Information Processing System (NeurIPS), 2019. Niloy Biswas, Anirban Bhattacharya, Pierre Jacob and James Johndrow. Coupling-based convergence assessment of some Gibbs samplers for high-dimensional Bayesian regression with shrinkage priors. Journal of the Royal Statistical Society: Series B (JRSSB), 2022. Niloy Biswas and Lester Mackey. Bounding Wasserstein distance with couplings. Revision requested, Journal of the American Statistical Association (JASA), 2022. Niloy Biswas, Xiao-Li Meng and Lester Mackey. Scalable Spike-and-Slab. International Conference on Machine Learning (ICML), 2022. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 2 / 44
  • 3. Motivation Motivation Some tales from the crypt in Bayesian computation Niloy Biswas (Harvard) PhD Thesis October 2, 2022 3 / 44
  • 4. Motivation Motivation Target distribution, P. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
  • 5. Motivation Motivation Target distribution, P. Monte Carlo methods generate samples from P. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
  • 6. Motivation Motivation Target distribution, P. Monte Carlo methods generate samples from P. Example applications: Calculate integrals: R h(x)P(x)dx = EXi ∼P[ 1 N PN i=1 h(Xi )] Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
  • 7. Motivation Motivation Target distribution, P. Monte Carlo methods generate samples from P. Example applications: Calculate integrals: R h(x)P(x)dx = EXi ∼P[ 1 N PN i=1 h(Xi )] Bayesian inference: sample from posterior Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
  • 8. Motivation Motivation Target distribution, P. Monte Carlo methods generate samples from P. Example applications: Calculate integrals: R h(x)P(x)dx = EXi ∼P[ 1 N PN i=1 h(Xi )] Bayesian inference: sample from posterior Variational autoencoders, generative models, . . . Niloy Biswas (Harvard) PhD Thesis October 2, 2022 4 / 44
  • 9. Motivation Tales from the crypt1 1 Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
  • 10. Motivation Tales from the crypt1 Folklore about Monte Carlo in large scale applications: 1 Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
  • 11. Motivation Tales from the crypt1 Folklore about Monte Carlo in large scale applications: “Markov chain Monte Carlo (MCMC) algorithms have prohibitively high computational cost per iteration”, 1 Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
  • 12. Motivation Tales from the crypt1 Folklore about Monte Carlo in large scale applications: “Markov chain Monte Carlo (MCMC) algorithms have prohibitively high computational cost per iteration”, “MCMC algorithms converge slowly for high-dimensional, multimodal target distributions”, 1 Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
  • 13. Motivation Tales from the crypt1 Folklore about Monte Carlo in large scale applications: “Markov chain Monte Carlo (MCMC) algorithms have prohibitively high computational cost per iteration”, “MCMC algorithms converge slowly for high-dimensional, multimodal target distributions”, “Bayesian inference is computationally too expensive”. 1 Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
  • 14. Motivation Tales from the crypt1 Folklore about Monte Carlo in large scale applications: “Markov chain Monte Carlo (MCMC) algorithms have prohibitively high computational cost per iteration”, “MCMC algorithms converge slowly for high-dimensional, multimodal target distributions”, “Bayesian inference is computationally too expensive”. We will revisit some of these tales from the crypt. 1 Chopin & Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 5 / 44
  • 15. Motivation Perfect, Exact and Approximate Sampling Target distribution, P. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
  • 16. Motivation Perfect, Exact and Approximate Sampling Target distribution, P. Monte Carlo methods generate samples from P. How? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
  • 17. Motivation Perfect, Exact and Approximate Sampling Target distribution, P. Monte Carlo methods generate samples from P. How? 1. Perfect sampling Directly sample from P. Difficult in general. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
  • 18. Motivation Perfect, Exact and Approximate Sampling Target distribution, P. Monte Carlo methods generate samples from P. How? 1. Perfect sampling Directly sample from P. Difficult in general. 2. (Asymptotically) Exact sampling Generate samples from distributions Pt such that Pt t→∞ ⇒ P E.g. Markov chain Monte Carlo (MCMC) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
  • 19. Motivation Perfect, Exact and Approximate Sampling Target distribution, P. Monte Carlo methods generate samples from P. How? 1. Perfect sampling Directly sample from P. Difficult in general. 2. (Asymptotically) Exact sampling Generate samples from distributions Pt such that Pt t→∞ ⇒ P E.g. Markov chain Monte Carlo (MCMC) 3. Approximate sampling Target Q, for some Q ≈ P E.g. Approximate MCMC, Variational Inference Niloy Biswas (Harvard) PhD Thesis October 2, 2022 6 / 44
  • 20. Motivation Markov chain Monte Carlo MCMC: Initialize X0 ∼ P0. For all t ≥ 1, sample Xt ∼ K(Xt−1, ·). Kernel K is P-invariant, so that Xt t→∞ ⇒ X∞ ∼ P. Must stop algorithm at some finite time T. How close is XT ∼ PT to X∞ ∼ P? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 7 / 44
  • 21. L-Lag couplings of Markov chains How close is XT ∼ PT to X∞ ∼ P? L-Lag couplings of Markov chains Biswas, Jacob and Vanetti. Estimating convergence of Markov chains with L-lag couplings. Advances in Neural Information Processing System (NeurIPS), 2019 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 8 / 44
  • 22. L-Lag couplings of Markov chains MCMC: Stylized Example X0 = 10. Target: P = N(0, 1). −5 0 5 10 0 50 100 150 t X Niloy Biswas (Harvard) PhD Thesis October 2, 2022 9 / 44
  • 23. L-Lag couplings of Markov chains MCMC: Stylized Example X0 = 10. Target: P = N(0, 1). −5 0 5 10 0 50 100 150 t x How close is XT ∼ PT to X∞ ∼ P? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 10 / 44
  • 24. L-Lag couplings of Markov chains Distance between probability distributions How close is XT ∼ PT to X∞ ∼ P? Total variation distance (TV): dTV(PT , P) = 1 2 sup h:|h|≤1 E[h(XT ) − h(X∞)] Captures errors on histograms and credible intervals. Wasserstein distance: captures errors between moments. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 11 / 44
  • 25. L-Lag couplings of Markov chains Bounds from L-Lag Couplings −5 0 5 10 0 50 100 150 t x 0.0 0.5 1.0 1.5 0 50 100 150 t d TV Lag L=1 L=150 Exact dTV(Pt, P) = 1 2 sup h:|h|≤1 E[h(Xt) − h(X∞)] Niloy Biswas (Harvard) PhD Thesis October 2, 2022 12 / 44
  • 26. L-Lag couplings of Markov chains L-Lag Couplings A pair of Markov chains (Xt, Yt)t≥0 such that: Same marginal distributions: Xt ∼ Yt ∼ Pt ∀t ≥ 0 with Pt t→∞ ⇒ P Xt and Yt−L meet exactly at time τ(L) := inf {t > L : Xt = Yt−L} −2.5 0.0 2.5 5.0 7.5 10.0 0 50 100 150 200 iteration x Niloy Biswas (Harvard) PhD Thesis October 2, 2022 13 / 44
  • 27. L-Lag couplings of Markov chains Back to stylized example How close is Xt ∼ Pt to X∞ ∼ P = N(0, 1)? −5 0 5 10 0 50 100 150 t x 0.0 0.5 1.0 1.5 0 50 100 150 t d TV Lag L=1 L=150 Exact dTV(Pt, P) = 1 2 sup h:|h|≤1 E[h(Xt) − h(X∞)] ≤ E h max(0, τ(L) − L − t L ) i Niloy Biswas (Harvard) PhD Thesis October 2, 2022 14 / 44
  • 28. L-Lag couplings of Markov chains Existing work on MCMC convergence How close is XT ∼ PT to X∞ ∼ P? Analytical results: Rosenthal (1995), Roberts and Rosenthal (2004), Khare and Hobert (2013), Durmus et al. (2016), Qin and Hobert (2019, 2020), etc. Bounds often of the form C(P0)f (t). Constant C(P0) unknown. Empirical techniques: Cowles and Rosenthal (1998), Johnson (1996). Popular convergence diagnostics: Gelman and Brooks (1998), Gelman and Rubin (1992), Vehtari et al.(2021) Extensions: • Craiu and Meng (2020), Kelly and Ryder (2021), Ju et al. (2021), Jacob et al. (2021), Papp and Sherlock (2022), . . . Niloy Biswas (Harvard) PhD Thesis October 2, 2022 15 / 44
  • 29. L-Lag couplings of Markov chains Existing (and ongoing) work on MCMC convergence How close is XT ∼ PT to X∞ ∼ P? Analytical results: Rosenthal (1995), Roberts and Rosenthal (2004), Khare and Hobert (2013), Durmus et al. (2016), Qin and Hobert (2019, 2020), etc. Bounds often of the form C(P0)f (t). Constant C(P0) unknown. Empirical techniques: Cowles and Rosenthal (1998), Johnson (1996). Popular convergence diagnostics: Gelman and Brooks (1998), Gelman and Rubin (1992), Vehtari et al.(2021) Extensions and applications: • Craiu and Meng (2020), Kelly and Ryder (2021), Ju et al. (2021), Jacob et al. (2021), Papp and Sherlock (2022), . . . Niloy Biswas (Harvard) PhD Thesis October 2, 2022 16 / 44
  • 30. L-Lag couplings of Markov chains Back to stylized example How close is XT ∼ PT to X∞ ∼ P = N(0, 1)? −5 0 5 10 0 50 100 150 t x 0.0 0.5 1.0 1.5 0 50 100 150 t d TV Lag L=1 L=150 Exact Can our diagnostics work in large-scale setting? In high-dimensions, cannot rely on traceplots. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 17 / 44
  • 31. Assessing convergence of some Gibbs samplers Can our diagnostics work in large-scale setting? Assessing convergence of some Gibbs samplers Biswas, Bhattacharya, Jacob and Johndrow. Coupling-based convergence assessment of some Gibbs samplers for high-dimensional Bayesian regression with shrinkage priors. Journal of the Royal Statistical Society: Series B (JRSSB), 2022. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 18 / 44
  • 32. Assessing convergence of some Gibbs samplers High-dimensional Bayesian regression Data: y ∈ Rn, X ∈ Rn×p, n ≪ p. Model: y = Xβ + σϵ ϵ ∼ N(0, In) Want to do inference on β ∈ Rp Niloy Biswas (Harvard) PhD Thesis October 2, 2022 19 / 44
  • 33. Assessing convergence of some Gibbs samplers High-dimensional Bayesian regression Data: y ∈ Rn, X ∈ Rn×p, n ≪ p. Model: y = Xβ + σϵ ϵ ∼ N(0, In) Want to do inference on β ∈ Rp Half-t Shrinkage Prior: βj |σ2 , ξ, ηj ∼ N(0, σ2 ξηj ) σ−2 ∼ Gamma(a0/2, b0/2) Global ξ−1/2 ∼ Cauchy+(0, 1) Local η −1/2 j ∼ t+(ν) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 19 / 44
  • 34. Assessing convergence of some Gibbs samplers High-dimensional Bayesian regression Data: y ∈ Rn, X ∈ Rn×p, n ≪ p. Model: y = Xβ + σϵ ϵ ∼ N(0, In) Want to do inference on β ∈ Rp Half-t Shrinkage Prior: βj |σ2 , ξ, ηj ∼ N(0, σ2 ξηj ) σ−2 ∼ Gamma(a0/2, b0/2) Global ξ−1/2 ∼ Cauchy+(0, 1) Local η −1/2 j ∼ t+(ν) Popular example: Horseshoe: η −1/2 j ∼ t+(1) = Cauchy+(0, 1) Gelman (2006); Carvalho (2009); Polson and Scott (2012). Niloy Biswas (Harvard) PhD Thesis October 2, 2022 19 / 44
  • 35. Assessing convergence of some Gibbs samplers Half-t priors: statistical estimation benefits Z π(βj |ηj , ξ, σ2 )π(ηj )dηj =π(βj |ξ, σ2 ) ≍ ( − log |βj | for |βj | → 0 |βj |−(1+ν) for |βj | → +∞ . Niloy Biswas (Harvard) PhD Thesis October 2, 2022 20 / 44
  • 36. Assessing convergence of some Gibbs samplers Half-t priors: statistical estimation benefits Z π(βj |ηj , ξ, σ2 )π(ηj )dηj =π(βj |ξ, σ2 ) ≍ ( − log |βj | for |βj | → 0 |βj |−(1+ν) for |βj | → +∞ . 0 1 2 3 4 −10 −5 0 5 10 βj Prior Density Niloy Biswas (Harvard) PhD Thesis October 2, 2022 20 / 44
  • 37. Assessing convergence of some Gibbs samplers Half-t priors: statistical estimation benefits Z π(βj |ηj , ξ, σ2 )π(ηj )dηj =π(βj |ξ, σ2 ) ≍ ( − log |βj | for |βj | → 0 |βj |−(1+ν) for |βj | → +∞ . 0 1 2 3 4 −10 −5 0 5 10 βj Prior Density van der Pas et al. (2014, 2017); Ghosh and Chakrabarti (2017). Niloy Biswas (Harvard) PhD Thesis October 2, 2022 20 / 44
  • 38. Assessing convergence of some Gibbs samplers Half-t priors: computation challenges Pole at 0: π(β|σ2 , ξ, y) ∥β∥→0 ≍ − p Y j=1 log(|βj |) Numerical instability for gradient-based samplers. Bou-Rabee Sanz-Serna (2018); Livingstone et al. (2019a). Niloy Biswas (Harvard) PhD Thesis October 2, 2022 21 / 44
  • 39. Assessing convergence of some Gibbs samplers Half-t priors: computation challenges Pole at 0: π(β|σ2 , ξ, y) ∥β∥→0 ≍ − p Y j=1 log(|βj |) Numerical instability for gradient-based samplers. Bou-Rabee Sanz-Serna (2018); Livingstone et al. (2019a). Polynomial-tails: π(λβ⊥ |σ2 , ξ, y) λ→+∞ ≍ λ−p(1+ν) for any β⊥ ∈ Rp with non-zero entries s.t. Xβ⊥ = 0. Slow convergence of Gaussian proposal based MH samplers. Roberts Tweedie (1996); Jarner Tweedie (2003); Livingstone et al. (2019a). Niloy Biswas (Harvard) PhD Thesis October 2, 2022 21 / 44
  • 40. Assessing convergence of some Gibbs samplers Half-t priors: computation challenges Pole at 0: π(β|σ2 , ξ, y) ∥β∥→0 ≍ − p Y j=1 log(|βj |) Numerical instability for gradient-based samplers. Bou-Rabee Sanz-Serna (2018); Livingstone et al. (2019a). Polynomial-tails: π(λβ⊥ |σ2 , ξ, y) λ→+∞ ≍ λ−p(1+ν) for any β⊥ ∈ Rp with non-zero entries s.t. Xβ⊥ = 0. Slow convergence of Gaussian proposal based MH samplers. Roberts Tweedie (1996); Jarner Tweedie (2003); Livingstone et al. (2019a). Trade–off between statistical estimation and “generic” sampling algorithms Niloy Biswas (Harvard) PhD Thesis October 2, 2022 21 / 44
  • 41. Assessing convergence of some Gibbs samplers A Gibbs sampler for Half-t priors Markov chain (βt, ηt)t≥0 on Rp × Rp. (ξ, σ2 fixed for simplicity.) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 22 / 44
  • 42. Assessing convergence of some Gibbs samplers A Gibbs sampler for Half-t priors Markov chain (βt, ηt)t≥0 on Rp × Rp. (ξ, σ2 fixed for simplicity.) β|η ∼ N(Σ−1 η XT y, σ2 Σ−1 η ) Ση := XT X + ξDiag(η) π(η|β) = p Y j=1 π(ηj |βj ) Polson Scott (2012); Johndrow et al. (2020). Niloy Biswas (Harvard) PhD Thesis October 2, 2022 22 / 44
  • 43. Assessing convergence of some Gibbs samplers A Gibbs sampler for Half-t priors Markov chain (βt, ηt)t≥0 on Rp × Rp. (ξ, σ2 fixed for simplicity.) β|η ∼ N(Σ−1 η XT y, σ2 Σ−1 η ) Ση := XT X + ξDiag(η) π(η|β) = p Y j=1 π(ηj |βj ) Polson Scott (2012); Johndrow et al. (2020). Chain is geometrically ergodic. proof not trivial; addresses an open question about such Gibbs samplers with heavy-tailed priors Niloy Biswas (Harvard) PhD Thesis October 2, 2022 22 / 44
  • 44. Assessing convergence of some Gibbs samplers A Gibbs sampler for Half-t priors Markov chain (βt, ηt)t≥0 on Rp × Rp. (ξ, σ2 fixed for simplicity.) β|η ∼ N(Σ−1 η XT y, σ2 Σ−1 η ) Ση := XT X + ξDiag(η) π(η|β) = p Y j=1 π(ηj |βj ) Polson Scott (2012); Johndrow et al. (2020). Chain is geometrically ergodic. proof not trivial; addresses an open question about such Gibbs samplers with heavy-tailed priors Can we numerically assess convergence? Coupling-based assessment: Johnson (1996), Biswas et al. (2019) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 22 / 44
  • 45. Assessing convergence of some Gibbs samplers A two-scale coupling Single Chain: β|η ∼ N(Σ−1 η XT y, σ2Σ−1 η ) Ση := XT X + ξDiag(η) π(η|β) = Qp j=1 π(ηj |βj ) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
  • 46. Assessing convergence of some Gibbs samplers A two-scale coupling Single Chain: β|η ∼ N(Σ−1 η XT y, σ2Σ−1 η ) Ση := XT X + ξDiag(η) π(η|β) = Qp j=1 π(ηj |βj ) Coupled Chain: Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
  • 47. Assessing convergence of some Gibbs samplers A two-scale coupling Single Chain: β|η ∼ N(Σ−1 η XT y, σ2Σ−1 η ) Ση := XT X + ξDiag(η) π(η|β) = Qp j=1 π(ηj |βj ) Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN; ‘synchronous’) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
  • 48. Assessing convergence of some Gibbs samplers A two-scale coupling Single Chain: β|η ∼ N(Σ−1 η XT y, σ2Σ−1 η ) Ση := XT X + ξDiag(η) π(η|β) = Qp j=1 π(ηj |βj ) Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN; ‘synchronous’) η1, η2|β1, β2 ∼ ( CRN when d(β1, β2) dthreshold Max Coupling when d(β1, β2) ≤ dthreshold Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
  • 49. Assessing convergence of some Gibbs samplers A two-scale coupling Single Chain: β|η ∼ N(Σ−1 η XT y, σ2Σ−1 η ) Ση := XT X + ξDiag(η) π(η|β) = Qp j=1 π(ηj |βj ) Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN; ‘synchronous’) η1, η2|β1, β2 ∼ ( CRN when d(β1, β2) dthreshold Max Coupling when d(β1, β2) ≤ dthreshold Why is this a good idea? Which metric d? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 23 / 44
  • 50. Assessing convergence of some Gibbs samplers A two-scale coupling Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN) η1, η2|β1, β2 ∼ ( CRN when d(β1, β2) dthreshold Max Coupling when d(β1, β2) ≤ dthreshold Why is this a good idea? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
  • 51. Assessing convergence of some Gibbs samplers A two-scale coupling Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN) η1, η2|β1, β2 ∼ ( CRN when d(β1, β2) dthreshold Max Coupling when d(β1, β2) ≤ dthreshold Why is this a good idea? Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
  • 52. Assessing convergence of some Gibbs samplers A two-scale coupling Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN) η1, η2|β1, β2 ∼ ( CRN when d(β1, β2) dthreshold Max Coupling when d(β1, β2) ≤ dthreshold Why is this a good idea? Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful. When far away, use CRN to get close in the future. Only when close, attempt to meet exactly. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
  • 53. Assessing convergence of some Gibbs samplers A two-scale coupling Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN) η1, η2|β1, β2 ∼ ( CRN when d(β1, β2) dthreshold Max Coupling when d(β1, β2) ≤ dthreshold Why is this a good idea? Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful. When far away, use CRN to get close in the future. Only when close, attempt to meet exactly. Which metric d? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
  • 54. Assessing convergence of some Gibbs samplers A two-scale coupling Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN) η1, η2|β1, β2 ∼ ( CRN when d(β1, β2) dthreshold Max Coupling when d(β1, β2) ≤ dthreshold Why is this a good idea? Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful. When far away, use CRN to get close in the future. Only when close, attempt to meet exactly. Which metric d? Want d to capture probability of exactly meeting. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
  • 55. Assessing convergence of some Gibbs samplers A two-scale coupling Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN) η1, η2|β1, β2 ∼ ( CRN when d(β1, β2) dthreshold Max Coupling when d(β1, β2) ≤ dthreshold Why is this a good idea? Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful. When far away, use CRN to get close in the future. Only when close, attempt to meet exactly. Which metric d? Want d to capture probability of exactly meeting. d(β1,t, β2,t) := PMaxCouple(η1,t+1 ̸= η2,t+1|β1,t, β2,t) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
  • 56. Assessing convergence of some Gibbs samplers A two-scale coupling Coupled Chain: β1, β2|η1, η2 ∼ Common random numbers (CRN) η1, η2|β1, β2 ∼ ( CRN when d(β1, β2) dthreshold Max Coupling when d(β1, β2) ≤ dthreshold Why is this a good idea? Insight: When (β1, β2) far away, attempts for (η1, η2) to meet exactly wasteful. When far away, use CRN to get close in the future. Only when close, attempt to meet exactly. Which metric d? Want d to capture probability of exactly meeting. d(β1,t, β2,t) := PMaxCouple(η1,t+1 ̸= η2,t+1|β1,t, β2,t) Roberts Rosenthal (2002) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 24 / 44
  • 57. Assessing convergence of some Gibbs samplers GWAS dataset example: Half-t(2) prior n ≈ 2, 266 different maize lines. p ≈ 98, 385 covariates per maize line, linked to SNPs in the genome. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 25 / 44
  • 58. Assessing convergence of some Gibbs samplers GWAS dataset example: Half-t(2) prior n ≈ 2, 266 different maize lines. p ≈ 98, 385 covariates per maize line, linked to SNPs in the genome. (βt, ηt)t≥0 Markov chain now in R98,385 × R98,385. How long does it take to converge? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 25 / 44
  • 59. Assessing convergence of some Gibbs samplers GWAS dataset example: Half-t(2) prior n ≈ 2, 266 different maize lines. p ≈ 98, 385 covariates per maize line, linked to SNPs in the genome. (βt, ηt)t≥0 Markov chain now in R98,385 × R98,385. How long does it take to converge? 1, 000 steps. 0.00 0.25 0.50 0.75 1.00 0 250 500 750 1000 iteration Total variation distance Niloy Biswas (Harvard) PhD Thesis October 2, 2022 25 / 44
  • 60. Assessing convergence of some Gibbs samplers Open questions Convergence complexity analysis Interplay between posterior concentration and MCMC convergence Alternative coupling algorithms Niloy Biswas (Harvard) PhD Thesis October 2, 2022 26 / 44
  • 61. Bounding Wasserstein distance with couplings What about diagnostics for approximate samplers? Bounding Wasserstein distance with couplings Biswas and Mackey. Bounding Wasserstein distance with couplings. Revision requested, Journal of the American Statistical Association (JASA), 2022. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 27 / 44
  • 62. Bounding Wasserstein distance with couplings Exact and Approximate MCMC Exact MCMC: Initialize X0 ∼ P0. Sample Xt t≥1 ∼ K1(Xt−1, ·) Kernel K1 is P-invariant, so that Xt t→∞ ⇒ P Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
  • 63. Bounding Wasserstein distance with couplings Exact and Approximate MCMC Exact MCMC: Initialize X0 ∼ P0. Sample Xt t≥1 ∼ K1(Xt−1, ·) Kernel K1 is P-invariant, so that Xt t→∞ ⇒ P Approximate MCMC: Initialize Y0 ∼ Q0. Sample Yt t≥1 ∼ K2(Yt−1, ·) Kernel K2 is Q-invariant, so that Yt t→∞ ⇒ Q K2 similar to but computationally cheaper than K1 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
  • 64. Bounding Wasserstein distance with couplings Exact and Approximate MCMC Exact MCMC: Initialize X0 ∼ P0. Sample Xt t≥1 ∼ K1(Xt−1, ·) Kernel K1 is P-invariant, so that Xt t→∞ ⇒ P Approximate MCMC: Initialize Y0 ∼ Q0. Sample Yt t≥1 ∼ K2(Yt−1, ·) Kernel K2 is Q-invariant, so that Yt t→∞ ⇒ Q K2 similar to but computationally cheaper than K1 Examples: Stochastic gradients for tall data Welling and Teh, 2011; Bardenet et al., 2017; Nemeth and Fearnhead, 2021 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
  • 65. Bounding Wasserstein distance with couplings Exact and Approximate MCMC Exact MCMC: Initialize X0 ∼ P0. Sample Xt t≥1 ∼ K1(Xt−1, ·) Kernel K1 is P-invariant, so that Xt t→∞ ⇒ P Approximate MCMC: Initialize Y0 ∼ Q0. Sample Yt t≥1 ∼ K2(Yt−1, ·) Kernel K2 is Q-invariant, so that Yt t→∞ ⇒ Q K2 similar to but computationally cheaper than K1 Examples: Stochastic gradients for tall data Welling and Teh, 2011; Bardenet et al., 2017; Nemeth and Fearnhead, 2021 Matrix approximations for wide data Johndrow et al. 2020; Narisetty et al. 2019; Atchadé and Wang 2021 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
  • 66. Bounding Wasserstein distance with couplings Exact and Approximate MCMC Exact MCMC: Initialize X0 ∼ P0. Sample Xt t≥1 ∼ K1(Xt−1, ·) Kernel K1 is P-invariant, so that Xt t→∞ ⇒ P Approximate MCMC: Initialize Y0 ∼ Q0. Sample Yt t≥1 ∼ K2(Yt−1, ·) Kernel K2 is Q-invariant, so that Yt t→∞ ⇒ Q K2 similar to but computationally cheaper than K1 Examples: Stochastic gradients for tall data Welling and Teh, 2011; Bardenet et al., 2017; Nemeth and Fearnhead, 2021 Matrix approximations for wide data Johndrow et al. 2020; Narisetty et al. 2019; Atchadé and Wang 2021 How close is P to Q? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 28 / 44
  • 67. Bounding Wasserstein distance with couplings Wasserstein distance How close is P to Q? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
  • 68. Bounding Wasserstein distance with couplings Wasserstein distance How close is P to Q? For metric space (X, c), p ≥ 1 Wp(P, Q) ≜ inf X∼P,Y ∼Q E[c(X, Y )p ]1/p . Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
  • 69. Bounding Wasserstein distance with couplings Wasserstein distance How close is P to Q? For metric space (X, c), p ≥ 1 Wp(P, Q) ≜ inf X∼P,Y ∼Q E[c(X, Y )p ]1/p . A geometrically faithful metric between probability measures Villani, 2008; Peyré and Cuturi, 2019 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
  • 70. Bounding Wasserstein distance with couplings Wasserstein distance How close is P to Q? For metric space (X, c), p ≥ 1 Wp(P, Q) ≜ inf X∼P,Y ∼Q E[c(X, Y )p ]1/p . A geometrically faithful metric between probability measures Villani, 2008; Peyré and Cuturi, 2019 Can control the absolute difference between moments of P and Q Gelbrich, 1990; Sriperumbudur et al., 2012; Huggins et al., 2020 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
  • 71. Bounding Wasserstein distance with couplings Wasserstein distance How close is P to Q? For metric space (X, c), p ≥ 1 Wp(P, Q) ≜ inf X∼P,Y ∼Q E[c(X, Y )p ]1/p . A geometrically faithful metric between probability measures Villani, 2008; Peyré and Cuturi, 2019 Can control the absolute difference between moments of P and Q Gelbrich, 1990; Sriperumbudur et al., 2012; Huggins et al., 2020 We will estimate upper bounds of Wp(P, Q) with couplings. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 29 / 44
  • 72. Bounding Wasserstein distance with couplings Couplings Consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X), K̄ (x, y), (A, X) = K1(x, A) and K̄ (x, y), (X, A) = K2(y, A) Pillai and Smith, 2015; Johndrow and Mattingly, 2018; Rudolf and Schweizer, 2018 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 30 / 44
  • 73. Bounding Wasserstein distance with couplings Couplings Consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X), K̄ (x, y), (A, X) = K1(x, A) and K̄ (x, y), (X, A) = K2(y, A) Pillai and Smith, 2015; Johndrow and Mattingly, 2018; Rudolf and Schweizer, 2018 We simulate such couplings to empirically assess sample quality Given K̄, sample (Xt+1, Yt+1)|(Xt, Yt) ∼ K̄ (Xt, Yt), · Niloy Biswas (Harvard) PhD Thesis October 2, 2022 30 / 44
  • 74. Bounding Wasserstein distance with couplings Couplings Consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X), K̄ (x, y), (A, X) = K1(x, A) and K̄ (x, y), (X, A) = K2(y, A) Pillai and Smith, 2015; Johndrow and Mattingly, 2018; Rudolf and Schweizer, 2018 We simulate such couplings to empirically assess sample quality Given K̄, sample (Xt+1, Yt+1)|(Xt, Yt) ∼ K̄ (Xt, Yt), · For such independent coupled trajectories (X (i) t , Y (i) t )T t=1 with i = 1, . . . , I, define the estimator CUBp ≜ 1 I(T − S) I X i=1 T X t=S+1 c(X (i) t , Y (i) t )p 1/p . Niloy Biswas (Harvard) PhD Thesis October 2, 2022 30 / 44
  • 75. Bounding Wasserstein distance with couplings Couplings Consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X), K̄ (x, y), (A, X) = K1(x, A) and K̄ (x, y), (X, A) = K2(y, A) Pillai and Smith, 2015; Johndrow and Mattingly, 2018; Rudolf and Schweizer, 2018 We simulate such couplings to empirically assess sample quality Given K̄, sample (Xt+1, Yt+1)|(Xt, Yt) ∼ K̄ (Xt, Yt), · For such independent coupled trajectories (X (i) t , Y (i) t )T t=1 with i = 1, . . . , I, define the estimator CUBp ≜ 1 I(T − S) I X i=1 T X t=S+1 c(X (i) t , Y (i) t )p 1/p . Use CUBp to estimate upper bounds of Wp(P, Q) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 30 / 44
  • 76. Bounding Wasserstein distance with couplings A stylized example P = N(0, Σ) for [Σ]i,j = 0.5|i−j| Q = N(0, Id ) on Rd Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
  • 77. Bounding Wasserstein distance with couplings A stylized example P = N(0, Σ) for [Σ]i,j = 0.5|i−j| Q = N(0, Id ) on Rd Apply common random numbers (CRN) coupling of MALA kernels marginally targeting P and Q. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
  • 78. Bounding Wasserstein distance with couplings A stylized example P = N(0, Σ) for [Σ]i,j = 0.5|i−j| Q = N(0, Id ) on Rd Apply common random numbers (CRN) coupling of MALA kernels marginally targeting P and Q. 0 5 10 15 0 250 500 750 1000 Trajectory length T W 2 upper bounds Indep. Coupling CUB2 Empirical W2 True W2 Figure: Dimension d = 100 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
  • 79. Bounding Wasserstein distance with couplings A stylized example P = N(0, Σ) for [Σ]i,j = 0.5|i−j| Q = N(0, Id ) on Rd Apply common random numbers (CRN) coupling of MALA kernels marginally targeting P and Q. 0 5 10 15 0 250 500 750 1000 Trajectory length T W 2 upper bounds Indep. Coupling CUB2 Empirical W2 True W2 Figure: Dimension d = 100 Indep: E X∼P,Y ∼Q,X⊥Y [∥X − Y ∥2 2]1/2 = (2d)1/2. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
  • 80. Bounding Wasserstein distance with couplings A stylized example P = N(0, Σ) for [Σ]i,j = 0.5|i−j| Q = N(0, Id ) on Rd Apply common random numbers (CRN) coupling of MALA kernels marginally targeting P and Q. 0 5 10 15 0 250 500 750 1000 Trajectory length T W 2 upper bounds Indep. Coupling CUB2 Empirical W2 True W2 Figure: Dimension d = 100 Indep: E X∼P,Y ∼Q,X⊥Y [∥X − Y ∥2 2]1/2 = (2d)1/2. CUB2: 1 IT PI i=1 PT t=1 ∥X (i) t − Y (i) t ∥2 2 1/2 . Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
  • 81. Bounding Wasserstein distance with couplings A stylized example P = N(0, Σ) for [Σ]i,j = 0.5|i−j| Q = N(0, Id ) on Rd Apply common random numbers (CRN) coupling of MALA kernels marginally targeting P and Q. 0 5 10 15 0 250 500 750 1000 Trajectory length T W 2 upper bounds Indep. Coupling CUB2 Empirical W2 True W2 Figure: Dimension d = 100 Indep: E X∼P,Y ∼Q,X⊥Y [∥X − Y ∥2 2]1/2 = (2d)1/2. CUB2: 1 IT PI i=1 PT t=1 ∥X (i) t − Y (i) t ∥2 2 1/2 . Empirical W2: W2(P̂1000I , Q̂1000I ). Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
  • 82. Bounding Wasserstein distance with couplings A stylized example P = N(0, Σ) for [Σ]i,j = 0.5|i−j| Q = N(0, Id ) on Rd Apply common random numbers (CRN) coupling of MALA kernels marginally targeting P and Q. 0 5 10 15 0 250 500 750 1000 Trajectory length T W 2 upper bounds Indep. Coupling CUB2 Empirical W2 True W2 Figure: Dimension d = 100 Indep: E X∼P,Y ∼Q,X⊥Y [∥X − Y ∥2 2]1/2 = (2d)1/2. CUB2: 1 IT PI i=1 PT t=1 ∥X (i) t − Y (i) t ∥2 2 1/2 . Empirical W2: W2(P̂1000I , Q̂1000I ). True W2: W2(P, Q). Niloy Biswas (Harvard) PhD Thesis October 2, 2022 31 / 44
  • 83. Bounding Wasserstein distance with couplings A stylized example P = N(0, Σ) for [Σ]i,j = 0.5|i−j| Q = N(0, Id ) on Rd Apply common random numbers (CRN) coupling of MALA kernels marginally targeting P and Q. 0 5 10 15 0 250 500 750 1000 Trajectory length T W 2 upper bounds Indep. Coupling CUB2 Empirical W2 True W2 Figure: Dimension d = 100 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 32 / 44
  • 84. Bounding Wasserstein distance with couplings A stylized example P = N(0, Σ) for [Σ]i,j = 0.5|i−j| Q = N(0, Id ) on Rd Apply common random numbers (CRN) coupling of MALA kernels marginally targeting P and Q. 0 5 10 15 0 250 500 750 1000 Trajectory length T W 2 upper bounds Indep. Coupling CUB2 Empirical W2 True W2 Figure: Dimension d = 100 0 10 20 30 40 50 200 400 600 800 1000 Dimension d W 2 upper bounds Indep. Coupling CUB2 Empirical W2 True W2 Figure: Varying dimension d Niloy Biswas (Harvard) PhD Thesis October 2, 2022 32 / 44
  • 85. Bounding Wasserstein distance with couplings Algorithms to sample from the coupled kernel K̄ To simulate the coupled chain (Xt, Yt)t≥0, we consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X), K̄ (x, y), (A, X) = K1(x, A) and K̄ (x, y), (X, A) = K2(y, A) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 33 / 44
  • 86. Bounding Wasserstein distance with couplings Algorithms to sample from the coupled kernel K̄ To simulate the coupled chain (Xt, Yt)t≥0, we consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X), K̄ (x, y), (A, X) = K1(x, A) and K̄ (x, y), (X, A) = K2(y, A) To construct and analyze K̄, we make use of: 1 A Markovian coupling Γ1 of kernel K1: for all x, y ∈ X, Γ1(x, y) is a coupling of the distributions K1(x, ·) and K1(y, ·). 2 A coupling Γ∆ of kernels K1 and K2 from the same point: for all z ∈ X, Γ∆(z) is a coupling of the distributions K1(z, ·) and K2(z, ·). Niloy Biswas (Harvard) PhD Thesis October 2, 2022 33 / 44
  • 87. Bounding Wasserstein distance with couplings Algorithms to sample from the coupled kernel K̄ To simulate the coupled chain (Xt, Yt)t≥0, we consider kernel K̄ on X × X s.t. ∀x, y ∈ X, ∀A ∈ B(X), K̄ (x, y), (A, X) = K1(x, A) and K̄ (x, y), (X, A) = K2(y, A) To construct and analyze K̄, we make use of: 1 A Markovian coupling Γ1 of kernel K1: for all x, y ∈ X, Γ1(x, y) is a coupling of the distributions K1(x, ·) and K1(y, ·). 2 A coupling Γ∆ of kernels K1 and K2 from the same point: for all z ∈ X, Γ∆(z) is a coupling of the distributions K1(z, ·) and K2(z, ·). Xt Yt Xt+1 Zt+1 Yt+1 Γ1 Γ∆ Figure: Joint kernel K̄ on X × X Niloy Biswas (Harvard) PhD Thesis October 2, 2022 33 / 44
  • 88. Bounding Wasserstein distance with couplings Interpretable upper bounds for CUB Theorem (CUB upper bound) Let (Xt, Yt)t≥0 denote a coupled Markov chain generated using joint kernel K̄. Suppose there exists a constant ρ ∈ (0, 1) such that for all Xt, Yt ∈ X and (Xt+1, Yt+1)|(Xt, Yt) ∼ Γ1(Xt, Yt), E[c(Xt+1, Yt+1)p |Xt, Yt] 1 p ≤ ρc(Xt, Yt). Then E[CUBp p,t] 1 p = E[c(Xt, Yt)p ] 1 p ≤ ρt E[c(X0, Y0)p ] 1 p + t X i=1 ρt−i E[∆p(Yi−1)] 1 p for all t ≥ 0, where ∆p(z) := E[c(X, Y )p |z] for (X, Y )|z ∼ Γ∆(z). Generalize existing results on Markov chain perturbation theory W1 to: (i) Wp for all p ≥ 1, (ii) couplings which may not be Wasserstein-optimal. Extensions with weaker assumptions in paper. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 34 / 44
  • 89. Bounding Wasserstein distance with couplings MCMC for tall data Bayesian logistic regression on: Pima Indians (n = 768 observations and d = 8 covariates) DS1 life sciences dataset (n = 26, 732 and d = 10) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 35 / 44
  • 90. Bounding Wasserstein distance with couplings MCMC for tall data Bayesian logistic regression on: Pima Indians (n = 768 observations and d = 8 covariates) DS1 life sciences dataset (n = 26, 732 and d = 10) We apply CRN coupling of MALA kernels. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 35 / 44
  • 91. Bounding Wasserstein distance with couplings MCMC for tall data Bayesian logistic regression on: Pima Indians (n = 768 observations and d = 8 covariates) DS1 life sciences dataset (n = 26, 732 and d = 10) We apply CRN coupling of MALA kernels. 0.0 0.1 0.2 0.3 M e a n F i e l d V B L a p l a c e S G L D 1 0 % S G L D 5 0 % U L A Approximate MCMC or variational procedure W 2 upper and lower bounds Dataset DS1 Pima Niloy Biswas (Harvard) PhD Thesis October 2, 2022 35 / 44
  • 92. Bounding Wasserstein distance with couplings Approximate MCMC for high-dimensional linear regression Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088). y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In) ξ−1/2 ∼ C+(0, 1), η −1/2 j i.i.d. ∼ t+(2), σ−2 ∼ Gamma a0 2 , b0 2 , βj |η, ξ, σ2 ind. ∼ N 0, σ2 ξηj Polson and Scott, 2020; Biswas et al. 2022 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
  • 93. Bounding Wasserstein distance with couplings Approximate MCMC for high-dimensional linear regression Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088). y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In) ξ−1/2 ∼ C+(0, 1), η −1/2 j i.i.d. ∼ t+(2), σ−2 ∼ Gamma a0 2 , b0 2 , βj |η, ξ, σ2 ind. ∼ N 0, σ2 ξηj Polson and Scott, 2020; Biswas et al. 2022 Exact MCMC: O(n2 d) cost from XDiag(ηt)−1 XT Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
  • 94. Bounding Wasserstein distance with couplings Approximate MCMC for high-dimensional linear regression Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088). y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In) ξ−1/2 ∼ C+(0, 1), η −1/2 j i.i.d. ∼ t+(2), σ−2 ∼ Gamma a0 2 , b0 2 , βj |η, ξ, σ2 ind. ∼ N 0, σ2 ξηj Polson and Scott, 2020; Biswas et al. 2022 Exact MCMC: O(n2 d) cost from XDiag(ηt)−1 XT Approximate MCMC: XDiag(η̃t)−1 XT for η̃j,t ≜ ηj,t1{η−1 j,t ϵ} Johndrow et al., 2020 Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
  • 95. Bounding Wasserstein distance with couplings Approximate MCMC for high-dimensional linear regression Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088). y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In) ξ−1/2 ∼ C+(0, 1), η −1/2 j i.i.d. ∼ t+(2), σ−2 ∼ Gamma a0 2 , b0 2 , βj |η, ξ, σ2 ind. ∼ N 0, σ2 ξηj Polson and Scott, 2020; Biswas et al. 2022 Exact MCMC: O(n2 d) cost from XDiag(ηt)−1 XT Approximate MCMC: XDiag(η̃t)−1 XT for η̃j,t ≜ ηj,t1{η−1 j,t ϵ} Johndrow et al., 2020 Apply CRN coupling of the exact and the approximate chain. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
  • 96. Bounding Wasserstein distance with couplings Approximate MCMC for high-dimensional linear regression Data: Riboflavin bacteria GWAS dataset (n = 500 and d = 4, 088). y = Xβ + σϵ̃ ϵ̃ ∼ N(0, In) ξ−1/2 ∼ C+(0, 1), η −1/2 j i.i.d. ∼ t+(2), σ−2 ∼ Gamma a0 2 , b0 2 , βj |η, ξ, σ2 ind. ∼ N 0, σ2 ξηj Polson and Scott, 2020; Biswas et al. 2022 Exact MCMC: O(n2 d) cost from XDiag(ηt)−1 XT Approximate MCMC: XDiag(η̃t)−1 XT for η̃j,t ≜ ηj,t1{η−1 j,t ϵ} Johndrow et al., 2020 Apply CRN coupling of the exact and the approximate chain. 0.0 0.2 0.4 0.6 0.8 0 1e−04 0.001 0.01 Approx. MCMC threshold ε W 2 upper and lower bounds Niloy Biswas (Harvard) PhD Thesis October 2, 2022 36 / 44
  • 97. Bounding Wasserstein distance with couplings Open questions Can we avoid sampling from the exact MCMC chain (Xt)t≥0? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 37 / 44
  • 98. Bounding Wasserstein distance with couplings Open questions Can we avoid sampling from the exact MCMC chain (Xt)t≥0? Consider constructing a Markov chain (Y ′ t , Yt)t≥0 such that: 1 Y ′ t ∼ Yt for all t ≥ 0, both marginally distributed according to the approximate MCMC chain. 2 E[c(Xt, Y ′ t )p ] = E[c(Xt, Yt)p ] ≤ E[c(Y ′ t , Yt)p ] for all t ≥ 0. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 37 / 44
  • 99. Bounding Wasserstein distance with couplings Open questions Can we avoid sampling from the exact MCMC chain (Xt)t≥0? Consider constructing a Markov chain (Y ′ t , Yt)t≥0 such that: 1 Y ′ t ∼ Yt for all t ≥ 0, both marginally distributed according to the approximate MCMC chain. 2 E[c(Xt, Y ′ t )p ] = E[c(Xt, Yt)p ] ≤ E[c(Y ′ t , Yt)p ] for all t ≥ 0. Then E[c(Xt, Yt)p]1/p ≤ E[c(Y ′ t , Yt)p]1/p ≤ 2E[c(Xt, Yt)p]1/p. Upper bounds from (Y ′ t , Yt)t≥0 only loose by a constant factor of 2. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 37 / 44
  • 100. Scalable Spike-and-Slab Scalable Spike-and-Slab Biswas, Mackey and Meng. Scalable Spike-and-Slab. Revision requested, International Conference on Machine Learning (ICML), 2022. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 38 / 44
  • 101. Scalable Spike-and-Slab Variable selection with spike-and-slab priors High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
  • 102. Scalable Spike-and-Slab Variable selection with spike-and-slab priors High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Linear regression: y = Xβ + σϵ where ϵ ∼ N(0, In) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
  • 103. Scalable Spike-and-Slab Variable selection with spike-and-slab priors High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Linear regression: y = Xβ + σϵ where ϵ ∼ N(0, In) Continuous Spike-and-Slab Prior [George and McCulloch, 1993] σ2 ∼ InvGamma a0 2 , b0 2 zj i.i.d. ∼ j=1,...,p Bernoulli(q), βj |zj , σ2 ind ∼ j=1,...,p (1 − zj ) N(0, σ2 τ2 0 ) | {z } Spike +zj N(0, σ2 τ2 1 ) | {z } Slab Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
  • 104. Scalable Spike-and-Slab Variable selection with spike-and-slab priors High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Linear regression: y = Xβ + σϵ where ϵ ∼ N(0, In) Continuous Spike-and-Slab Prior [George and McCulloch, 1993] σ2 ∼ InvGamma a0 2 , b0 2 zj i.i.d. ∼ j=1,...,p Bernoulli(q), βj |zj , σ2 ind ∼ j=1,...,p (1 − zj ) N(0, σ2 τ2 0 ) | {z } Spike +zj N(0, σ2 τ2 1 ) | {z } Slab Inference using P(zj = 1|y). Guan and Stephens, 2011, Zhou et al., 2013, . . . Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
  • 105. Scalable Spike-and-Slab Variable selection with spike-and-slab priors High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Linear regression: y = Xβ + σϵ where ϵ ∼ N(0, In) Continuous Spike-and-Slab Prior [George and McCulloch, 1993] σ2 ∼ InvGamma a0 2 , b0 2 zj i.i.d. ∼ j=1,...,p Bernoulli(q), βj |zj , σ2 ind ∼ j=1,...,p (1 − zj ) N(0, σ2 τ2 0 ) | {z } Spike +zj N(0, σ2 τ2 1 ) | {z } Slab Inference using P(zj = 1|y). Guan and Stephens, 2011, Zhou et al., 2013, . . . How to sample from the posterior? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 39 / 44
  • 106. Scalable Spike-and-Slab Bayesian computation for spike-and-slab priors High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 40 / 44
  • 107. Scalable Spike-and-Slab Bayesian computation for spike-and-slab priors High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: Naı̈ve MCMC: O(p3 ) cost per iteration State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration For large datasets, O(n2 p) cost become prohibitive. e.g. GWAS with n ≈ 103 , p ≈ 105 , SOTA takes 1 minute per iteration Niloy Biswas (Harvard) PhD Thesis October 2, 2022 40 / 44
  • 108. Scalable Spike-and-Slab Bayesian computation for spike-and-slab priors High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: Naı̈ve MCMC: O(p3 ) cost per iteration State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration For large datasets, O(n2 p) cost become prohibitive. e.g. GWAS with n ≈ 103 , p ≈ 105 , SOTA takes 1 minute per iteration Approximate inference methods: Approx. MCMC: O(max{n∥zt∥2 1, np}) cost at iteration t Narisetty et al., 2019 Variational inference Ray et al., 2020, Ray and Szabó, 2021 Does not converge to the spike-and-slab posterior Niloy Biswas (Harvard) PhD Thesis October 2, 2022 40 / 44
  • 109. Scalable Spike-and-Slab Bayesian computation for spike-and-slab priors High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: Naı̈ve MCMC: O(p3 ) cost per iteration State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration For large datasets, O(n2 p) cost become prohibitive. e.g. GWAS with n ≈ 103 , p ≈ 105 , SOTA takes 1 minute per iteration Approximate inference methods: Approx. MCMC: O(max{n∥zt∥2 1, np}) cost at iteration t Narisetty et al., 2019 Variational inference Ray et al., 2020, Ray and Szabó, 2021 Does not converge to the spike-and-slab posterior Speed up Bayesian computation without compromising on sample quality? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 40 / 44
  • 110. Scalable Spike-and-Slab Speed up Bayesian computation without compromising on sample quality? Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
  • 111. Scalable Spike-and-Slab Speed up Bayesian computation without compromising on sample quality? High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
  • 112. Scalable Spike-and-Slab Speed up Bayesian computation without compromising on sample quality? High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration Scalable Spike-and-Slab (S3): Same MCMC kernel as SOTA. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
  • 113. Scalable Spike-and-Slab Speed up Bayesian computation without compromising on sample quality? High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration Scalable Spike-and-Slab (S3): Same MCMC kernel as SOTA. O(max{n2 pt, np}) cost at iteration t for linear and probit, where (O(max{n2 pt, n3 , np}) for logistic regression) Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
  • 114. Scalable Spike-and-Slab Speed up Bayesian computation without compromising on sample quality? High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration Scalable Spike-and-Slab (S3): Same MCMC kernel as SOTA. O(max{n2 pt, np}) cost at iteration t for linear and probit, where (O(max{n2 pt, n3 , np}) for logistic regression) pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
  • 115. Scalable Spike-and-Slab Speed up Bayesian computation without compromising on sample quality? High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration Scalable Spike-and-Slab (S3): Same MCMC kernel as SOTA. O(max{n2 pt, np}) cost at iteration t for linear and probit, where (O(max{n2 pt, n3 , np}) for logistic regression) pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}. (i) sparsity, Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
  • 116. Scalable Spike-and-Slab Speed up Bayesian computation without compromising on sample quality? High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration Scalable Spike-and-Slab (S3): Same MCMC kernel as SOTA. O(max{n2 pt, np}) cost at iteration t for linear and probit, where (O(max{n2 pt, n3 , np}) for logistic regression) pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}. (i) sparsity, (ii) posterior concentration, Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
  • 117. Scalable Spike-and-Slab Speed up Bayesian computation without compromising on sample quality? High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration Scalable Spike-and-Slab (S3): Same MCMC kernel as SOTA. O(max{n2 pt, np}) cost at iteration t for linear and probit, where (O(max{n2 pt, n3 , np}) for logistic regression) pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}. (i) sparsity, (ii) posterior concentration, and (iii) positive auto-correlation Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
  • 118. Scalable Spike-and-Slab Speed up Bayesian computation without compromising on sample quality? High-dimensional data: y ∈ Rn, design matrix X ∈ Rn×p, n ≪ p. Markov chain Monte Carlo methods: State-of-the-art (SOTA) MCMC: O(n2 p) cost per iteration Scalable Spike-and-Slab (S3): Same MCMC kernel as SOTA. O(max{n2 pt, np}) cost at iteration t for linear and probit, where (O(max{n2 pt, n3 , np}) for logistic regression) pt = min{∥zt∥, p − ∥zt∥, ∥zt − zt−1∥1}. (i) sparsity, (ii) posterior concentration, and (iii) positive auto-correlation all give smaller pt and lower computational cost. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 41 / 44
  • 119. Scalable Spike-and-Slab S3 : computational cost Binary response data: y ∈ {0, 1}n, design matrix X ∈ Rn×p, n ≪ p. 0e+00 1e+05 2e+05 3e+05 10000 20000 30000 40000 Dimension p Time per iteration (ms) Sampler S3 Logistic SOTA Logistic Skinny Gibbs Logistic S3 Probit SOTA Probit E.g. for n ≈ 4000, p ≈ 40000, E[pt] ≈ 10. S3 is 50× faster than SOTA. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 42 / 44
  • 120. Scalable Spike-and-Slab S3 : sample quality Binary response data: y ∈ {0, 1}n, design matrix X ∈ Rn×p, n ≪ p. 0% 25% 50% 75% 100% 200 400 600 800 1000 Dimension p TPR 0% 1% 2% 3% 4% 200 400 600 800 1000 Dimension p FDR Sampler S3 Logistic S3 Probit Skinny Gibbs Logistic Niloy Biswas (Harvard) PhD Thesis October 2, 2022 43 / 44
  • 121. Scalable Spike-and-Slab Revisiting Tales from the crypt 2 2 Chopin Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
  • 122. Scalable Spike-and-Slab Revisiting Tales from the crypt 2 (Folklore about) Monte Carlo in large scale applications: 2 Chopin Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
  • 123. Scalable Spike-and-Slab Revisiting Tales from the crypt 2 (Folklore about) Monte Carlo in large scale applications: “Markov chain Monte Carlo (MCMC) algorithms (have prohibitively high) can have lower computational cost per iteration”, 2 Chopin Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
  • 124. Scalable Spike-and-Slab Revisiting Tales from the crypt 2 (Folklore about) Monte Carlo in large scale applications: “Markov chain Monte Carlo (MCMC) algorithms (have prohibitively high) can have lower computational cost per iteration”, “MCMC algorithms (converge slowly) can converge quickly for high- dimensional, multimodal target distributions”, 2 Chopin Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
  • 125. Scalable Spike-and-Slab Revisiting Tales from the crypt 2 (Folklore about) Monte Carlo in large scale applications: “Markov chain Monte Carlo (MCMC) algorithms (have prohibitively high) can have lower computational cost per iteration”, “MCMC algorithms (converge slowly) can converge quickly for high- dimensional, multimodal target distributions”, “Bayesian inference is computationally too expensive?” 2 Chopin Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44
  • 126. Scalable Spike-and-Slab Revisiting Tales from the crypt 2 (Folklore about) Monte Carlo in large scale applications: “Markov chain Monte Carlo (MCMC) algorithms (have prohibitively high) can have lower computational cost per iteration”, “MCMC algorithms (converge slowly) can converge quickly for high- dimensional, multimodal target distributions”, “Bayesian inference is computationally too expensive?” This thesis participates in a wider quest to scale Bayesian inference. Welling and Teh (2011); Gorham (2015); Bardenet et al. (2017); Blei et al. (2017); Narisetty et al. (2019), Bierkens et al. (2019), Papaspiliopoulos et al. (2019), Pollock et al. (2020), Jacob et al. (2020), Johndrow et al. (2020), Nemeth and Fearnhead (2021), Ray and Szabó (2021) . . . 2 Chopin Papaspiliopoulos, Laplace’s Demon seminar on Bayesian Machine Learning at Scale, 2020. Niloy Biswas (Harvard) PhD Thesis October 2, 2022 44 / 44