Monte Carlo methods for some not-quite-but-almost Bayesian problems

Monte Carlo methods for some
not-quite-but-almost Bayesian problems
Pierre E. Jacob
Department of Statistics, Harvard University
joint work with
Ruobin Gong, Paul T. Edlefsen, Arthur P. Dempster
John O’Leary, Yves F. Atchad´e, Niloy Biswas, Paul Vanetti
and others
November 21, 2019
Department of Statistical Science, University of Toronto
Pierre E. Jacob Monte Carlo for not quite Bayes

Introduction
A lot of questions in statistics give rise to non-trivial
computational problems.
Among these, some are numerical integration problems ⇔
about sampling from probability distributions.
Besag, Markov chain Monte Carlo for statistical inference, 2001.
Computational challenges arise in deviations from standard
Bayesian inference, motivated by three questions,
quantifying ignorance / Dempster–Shafer analysis,
model misspeciﬁcation / modular Bayesian inference,
robustness to some perturbation of the data / BayesBag.

Outline
1 Dempster–Shafer analysis of count data
2 Unbiased MCMC and diagnostics of convergence
3 Modular Bayesian inference
4 Bagging posterior distributions

Inference with count data
Notation: [N] := {1, . . . , N}. Simplex ∆.
Observations : xn ∈ [K] := {1, . . . , K}, x = (x1, . . . , xN ).
Index sets : Ik = {n ∈ [N] : xn = k}.
Counts : Nk = |Ik|.
Model: xn
iid
∼ Categorical(θ) with θ = (θk)k∈[K] ∈ ∆,
i.e. P(xn = k) = θk for all n, k.
Goal: estimate θ, predict, etc.
Maximum likelihood estimator: ˆθk = Nk/N.
Bayesian inference combines likelihood with prior on θ into a
posterior distribution, assigning a probability ∈ [0, 1] to any
measurable subset Σ of the simplex ∆.

Arthur Dempster’s approach to inference
Observations x = (xn)n∈[N] are ﬁxed.
We will specify a sampling mechanism, on top of the likelihood,
e.g. xn = m(un, θ) for some function m and random variable un.
We will seek u = (un)n∈[N] that could have generated x for
some θ. For arbitrary u, such a θ might not exist.
If a set of feasible θ exists denote it by F(u). Dempster’s
approach deﬁnes lower/upper probabilities for subsets Σ of
interest, as expectations with respect to non-empty F(u).
Arthur P. Dempster. New methods for reasoning towards posterior
distributions based on sample data. Annals of Mathematical Statistics, 1966.
Arthur P. Dempster. Statistical inference from a Dempster—Shafer
perspective. Past, Present, and Future of Statistical Science, 2014.

Sampling from a Categorical distribution
2 3
1
∆1(θ)
∆2(θ)∆3(θ)
θ
Subsimplex ∆k(θ), for θ ∈ ∆:
{z ∈ ∆ : ∀ ∈ [K] z /zk ≥ θ /θk}.
Sampling mechanism, for θ ∈ ∆:
- draw un uniform on ∆,
- deﬁne xn such that un ∈ ∆xn (θ).
Then P(xn = k) = θk,
because Vol(∆k(θ)) = θk.

Draws in the simplex
Counts: (2, 3, 1). Let’s draw N = 6 uniform samples on ∆.
2 3
1
q
q
q
q
q
q

Each un is associated to an observed xn ∈ {11, 22, 33}.
2 3
1
q
q
q
q
q
q

If there exists a feasible θ, it cannot be just anywhere.
2 3
1
q
q
q
q
q
q

The uns of each category add constraints on θ.
2 3
1
q
q
q
q
q
q

Overall the constraints deﬁne a polytope for θ, or an empty set.
2 3
1
q
q
q
q
q
q

Here, there is a polytope of θ such that ∀n ∈ [N] un ∈ ∆xn (θ).
2 3
1
q
q
q
q
q
q

Any θ in the polytope separates the uns appropriately.
2 3
1
qqq
q
q
q
q
q
q

Let’s try again with fresh uniform samples on ∆.
2 3
1
q q
q
q
q
q

Here there is no θ ∈ ∆ such that ∀n ∈ [N] un ∈ ∆xn (θ).
2 3
1
q q
q
q
q
q

Lower and upper probabilities
Consider the set
Rx = (u1, . . . , uN ) ∈ ∆N
: ∃θ ∈ ∆ ∀n ∈ [N] un ∈ ∆xn (θ) .
and denote by νx the uniform distribution on Rx.
For u ∈ Rx, there is a set F(u) = {θ ∈ ∆ : ∀n un ∈ ∆xn (θ)}.
For a set Σ ⊂ ∆ of interest, deﬁne
(lower probability) P(Σ) = 1(F(u) ⊂ Σ)νx(du),
(upper probability) ¯P(Σ) = 1(F(u) ∩ Σ = ∅)νx(du).

Summary and Monte Carlo problem
Arthur Dempster’s approach, later called Dempster–Shafer
theory of belief functions, is based on a distribution of
feasible sets,
F(u) = {θ ∈ ∆ ∀n ∈ [N] un ∈ ∆xn (θ)},
where u ∼ νx, the uniform distribution on Rx.
How do we obtain samples from this distribution?
Rejection rate 99%, for data (2, 3, 1).
Hit-and-run algorithm?
Our proposed strategy is a Gibbs sampler. Starting from
some u ∈ Rx, we will iteratively refresh some components
un of u given others.

Gibbs sampler: initialization
We can obtain some u in Rx as follows.
Choose an arbitrary θ ∈ ∆.
For all n ∈ [N] sample un uniformly in ∆k(θ) where xn = k.
2 3
1
∆1(θ)
∆2(θ)∆3(θ)
θ
q
q
q
q
q
q
To sample components un given
others, we will express Rx,
{u : ∃θ ∀n un ∈ ∆xn (θ)}
in terms of relations that the
components un must satisfy with
respect to one another.

Equivalent representation
For any θ ∈ ∆,
∀k ∈ [K] ∀n ∈ Ik un ∈ ∆k(θ)
⇔ ∀k ∈ [K] ∀n ∈ Ik ∀ ∈ [K]
un,
un,k
≥
θ
θk
.
This is equivalent to
∀k ∈ [K] ∀ ∈ [K] min
n∈Ik
un,
un,k
≥
θ
θk
.

Linear constraints
Counts: (9, 8, 3), u in Rx.
Values ηk→ = minn∈Ik
un, /un,k deﬁne linear constraints on θ.
2 3
1
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
θ3 θ1 = η1→3
θ2 θ1 = η1→2

Some inequalities
Next, assume u ∈ Rx, write ηk→ = minn∈Ik
un, /un,k, and
consider some implications.
There exists θ ∈ ∆ such that θ /θk ≤ ηk→ for all k, ∈ [K].
Then, for all k,
θ
θk
≤ ηk→ , and
θk
θ
≤ η →k, thus ηk→ η →k ≥ 1.

More inequalities
We can continue, if K ≥ 3: for all k, , j,
η−1
→k ≤
θ
θk
=
θ
θj
θj
θk
≤ ηj→ ηk→j,
thus ηk→jηj→ η →k ≥ 1.
And if K ≥ 4, for all k, , j, m
ηk→jηj→ η →mηm→k ≥ 1.
Generally,
∀L ∈ [K] ∀j1, . . . , jL ∈ [K] ηj1→j2 ηj2→j3 . . . ηjL→j1 ≥ 1.

Main result
So far, if ∃θ ∈ ∆ such that θ /θk ≤ ηk→ for k, ∈ [K] then
The reverse implication holds too.
This would mean
Rx = {u : ∃θ ∀k, ∈ [K] θ /θk ≤ ηk→ }
= {u : ∀L ∈ [K] ∀j1, . . . , jL ∈ [K] ηj1→j2 ηj2→j3 . . . ηjL→j1 ≥ 1}.
i.e. Rx is represented by relations between components (un).
This helps computing conditional distributions under νx,
leading to a Gibbs sampler.

Some remarks on these inequalities
We can consider only unique indices in j1, . . . , jL,
since the other cases can be deduced from those.
Example: η1→2η2→4η4→3η3→2η2→1 ≥ 1,
follows from η1→2η2→1 ≥ 1 and η2→4η4→3η3→2 ≥ 1.
The indices j1 → j2 → · · · → jL → j1 form a cycle.

Graphs
Fully connected graph with weight log ηk→ on edge (k, ).
1
2
3
log(η1→2)
log(η2→1)
Value of a path = sum of the weights along the path.
Negative cycle = path from vertex to itself with negative value

Graphs
∀L ∀j1, . . . , jL ηj1→j2 . . . ηjL→j1 ≥ 1
⇔ ∀L ∀j1, . . . , jL log(ηj1→j2 ) + . . . + log(ηjL→j1 ) ≥ 0
⇔ there are no negative cycles in the graph.
1
2
3
log(η1→2)
log(η2→1)

Proof
Proof of claim: “inequalities” ⇒ “∃θ : θ /θk ≤ ηk→ ∀k, ”.
min(k → ) := minimum value of path from k to in the graph.
Finite ∀k, because of absence of negative cycles in the graph.
Deﬁne θ via θk ∝ exp(min(K → k)).
Then θ ∈ ∆. Furthermore, for all k,
min(K → ) ≤ min(K → k) + log(ηk→ ),
therefore θ /θk ≤ ηk→ .

So far. . .
We want to sample uniformly on the set Rx,
Rx = {u : ∃θ ∀k, ∈ [K] θ /θk ≤ ηk→ }.
We have proved that this set can also be written
{u : ∀L ∈ [K] ∀j1, . . . , jL ∈ [K] ηj1→j2 ηj2→j3 . . . ηjL→j1 ≥ 1}.
The inequalities hold if and only if some graph with weight
log ηk→ on edge (k, ) does not contain negative cycles.

Conditional distributions
We can obtain conditional distributions of un for n ∈ Ik given
(un)n/∈Ik
with respect to νx:
un given (un)n/∈Ik
are i.i.d. uniform in ∆k(θ ),
where θ ∝ exp(− min( → k)) for all ,
with min( → k) := minimum value of path from to k.
Shortest paths can be computed in polynomial time.

Counts: (9, 8, 3). What is the conditional distribution of
(un)n∈Ik
given (un)n/∈Ik
under νx?
2 3
1
q
q
q
qq q
q
q
q
q
q
q q
q
q
q
q
q
q
q

Counts: (9, 8, 3). What is the conditional distribution of
(un)n∈Ik
given (un)n/∈Ik
under νx?
2 3
1
q
q
q q
q
q
q
q
q
q
q

Gibbs sampler
Initial u(0) ∈ Rx.
At each iteration t ≥ 1, for each category k ∈ [K],
1 compute θ such that, for n ∈ Ik,
un given other components is uniform on ∆k(θ ).
2 Draw u
(t)
n ∼ ∆k(θ ) for n ∈ Ik.
3 Update η
(t)
k→ for ∈ [K].
In step 1, θ is obtained by computing shortest path in graph
with weights η
(t)
k→ on edge (k, ).
Computed e.g. with Bellman–Ford algorithm, implemented in
Cs´ardi & Nepusz, igraph package, 2006.
Alternatively, we can compute θ by solving a linear program,
Berkelaar, Eikland & Notebaert, lpsolve package, 2004

Gibbs sampler
Counts: (9, 8, 3), 100 polytopes generated by the sampler.
2 3
1

Cost per iteration
Cost in seconds for 100 full sweeps.
0.0
0.3
0.6
0.9
4 8 12 16
K
elapsed
N 256 512 1024 2048
https://github.com/pierrejacob/dempsterpolytope

Cost per iteration
Cost in seconds for 100 full sweeps.
0.0
0.3
0.6
0.9
256 512 1024 2048
N
elapsed
K 4 8 12 16

How many iterations for convergence?
Let ν(t) by the distribution of u(t) after t iterations.
TV(ν(t), νx) = supA |ν(t)(A) − νx(A)|.
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
iteration
TVupperbounds
K 5 10 20

How many iterations for convergence?
Let ν(t) by the distribution of u(t) after t iterations.
TV(ν(t), νx) = supA |ν(t)(A) − νx(A)|.
0.00
0.25
0.50
0.75
1.00
0 50 100 150 200
iteration
TVupperbounds
N 50 100 150 200

Summary
A Gibbs sampler can be used to approximate lower and upper
probabilities in the Dempster–Shafer framework.
Is perfect sampling possible here?
Extensions for hierarchical counts, hidden Markov models?
Jacob, Gong, Edlefsen & Dempster, A Gibbs sampler for a class of
random convex polytopes. On arXiv and researchers.one.

Coupled chains
Glynn & Rhee, Exact estimation for MC equilibrium expectations, 2014.
Generate two chains (Xt) and (Yt), going to π, as follows:
sample X0 and Y0 from π0 (independently, or not),
sample Xt|Xt−1 ∼ P(Xt−1, ·) for t = 1, . . . , L,
for t ≥ L + 1, sample
(Xt, Yt−L)|(Xt−1, Yt−L−1) ∼ ¯P ((Xt−1, Yt−L−1), ·).
¯P must be such that
Xt+1|Xt ∼ P(Xt, ·) and Yt|Yt−1 ∼ P(Yt−1, ·)
(thus Xt and Yt have the same distribution for all t ≥ 0),
there exists a random time τ such that Xt = Yt−L for t ≥ τ
(the chains meet and remain “faithful”).

Coupled chains
0
4
8
0 50 100 150 200
iteration
x
π = N(0, 1), RWMH with Normal proposal std = 0.5, π0 = N(10, 32
)

Unbiased estimators
Under some conditions, the estimator
1
m − k + 1
m
t=k
h(Xt)
+
1
m − k + 1
τ−1
t=k+L
min m − k + 1,
t − k
L
(h(Xt) − h(Yt−L)),
has expectation h(x)π(dx), finite cost and finite variance.
“MCMC estimator + bias correction terms”
Its efficiency can be close to that of MCMC estimators,
if k, m chosen appropriately (and L also).
Jacob, O’Leary & Atchadé, Unbiased MCMC with couplings, 2019.

Finite-time bias of MCMC
Total variation distance between Xt ∼ πt and π = limt→∞ πt:
πt − π TV ≤ E[max(0, (τ − L − t)/L )].
0.000
0.005
0.010
0.015
0 50 100 150 200
τ − lag
lag = 1
1e−04
1e−03
1e−02
1e−01
1e+00
1e+01
1e+02
0 50 100 150 200
iteration
TVupperbounds
Biswas, Jacob & Vanetti, Estimating Convergence of Markov chains
with L-Lag Couplings, 2019.

πt − π TV ≤ E[max(0, (τ − L − t)/L )].
0.000
0.005
0.010
0.015
0 50 100 150
τ − lag
lag = 50
1e−04
1e−03
1e−02
1e−01
1e+00
1e+01
1e+02
0 50 100 150 200
iteration
TVupperbounds

πt − π TV ≤ E[max(0, (τ − L − t)/L )].
0.000
0.005
0.010
0.015
0 50 100 150
τ − lag
lag = 100
1e−04
1e−03
1e−02
1e−01
1e+00
1e+01
1e+02
0 50 100 150 200
iteration
TVupperbounds

Upper bounds can also be obtained for e.g. 1-Wasserstein.
And perhaps lower bounds?
Applicable in e.g. high-dimensional and/or discrete spaces.

Example: Gibbs sampler for Dempster’s analysis of counts.
0.00
0.25
0.50
0.75
1.00
0 50 100 150 200
iteration
TVupperbounds
N 50 100 150 200
This quantiﬁes bias of MCMC estimators, not variance.

Models made of modules
First module:
parameter θ1, data Y1
prior: p1(θ1)
likelihood: p1(Y1|θ1)
Second module:
parameter θ2, data Y2
prior: p2(θ2|θ1)
likelihood: p2 (Y2|θ1, θ2)
We are interested in the estimation of θ1, θ2 or both.

Joint model approach
Parameter (θ1, θ2), with prior
p(θ1, θ2) = p1(θ1)p2(θ2|θ1).
Data (Y1, Y2), likelihood
p(Y1, Y2|θ1, θ2) = p1(Y1|θ1)p2(Y2|θ1, θ2).
Posterior distribution
π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1(Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).

Joint model approach
In the joint model approach, all data are used to
simultaneously infer all parameters. . .
. . . so that uncertainty about θ1 is propagated to the
estimation of θ2. . .
. . . but misspeciﬁcation of the 2nd module can damage the
estimation of θ1.
What about allowing uncertainty propagation, but
preventing feedback of some modules on others?

Cut distribution
One might want to propagate uncertainty without allowing
“feedback” of second module on first module.
Cut distribution:
πcut
(θ1, θ2; Y1, Y2) = p1(θ1|Y1)p2 (θ2|θ1, Y2).
Different from the posterior distribution under joint model,
under which the first marginal is π(θ1|Y1, Y2).

Example: epidemiological study
Model of virus prevalence
∀i = 1, . . . , I Zi ∼ Binomial(Ni, ϕi),
Zi is number of women infected with high-risk HPV in a
sample of size Ni in country i.
Beta(1,1) prior on each ϕi, independently.
Impact of prevalence onto cervical cancer occurrence
∀i = 1, . . . , I Yi ∼ Poisson(λiTi), log(λi) = θ2,1 + θ2,2ϕi,
Yi is number of cancer cases arising from Ti woman-years of
follow-up in country i.
N(0, 103) on θ2,1, θ2,2, independently.
Plummer, Cuts in Bayesian graphical models, 2014.
Jacob, Holmes, Murray, Robert & Nicholson, Better together?
Statistical learning in models made of modules.

Monte Carlo with joint model approach
Joint model posterior has density
π (θ1, θ2|Y1, Y2) ∝ p1 (θ1) p1 (Y1|θ1)p2 (θ2|θ1) p2 (Y2|θ1, θ2).
The computational complexity typically grows
super-linearly with the number of modules.
Diﬃculties stack up. . .
intractability, multimodality, ridges, etc.

WinBUGS’ approach via the cut function: alternate between
sampling θ1 from K1(θ1 → dθ1), targeting p1(dθ1|Y1);
sampling θ2 from K2
θ1
(θ2 → dθ2), targeting p2(dθ2|θ1, Y2).
This does not leave the cut distribution invariant!
Iterating the kernel K2
θ1
enough times mitigates the issue.
Plummer, Cuts in Bayesian graphical models, 2014.

In a perfect world, we could sample i.i.d.
θi
1 from p1(θ1|Y1),
θi
2 given θi
1 from p2(θ2|θi
1, Y2),
then (θi
1, θi
2) would be i.i.d. from the cut distribution.

In an MCMC world, we can sample
θi
1 approximately from p1(θ1|Y1) using MCMC,
θi
2 given θi
1 approximately from p2(θ2|θi
1, Y2) using MCMC,
then resulting samples approximate the cut distribution,
in the limit of the numbers of iterations, at both stages.

In an unbiased MCMC world, we can approximate expectations
h(x)π(dx) without bias, in ﬁnite compute time.
We can obtain an unbiased approximation of p1(θ1|Y1), and for
each θ1, an unbiased approximation of p2(θ2|θ1, Y2).
Thus, by the tower property, we can unbiasedly estimate
h(θ1, θ2)p2(dθ2|θ1, Y1)p1(dθ1|Y1).
Jacob, O’Leary & Atchad´e, Unbiased MCMC with couplings, 2019.

Example: epidemiological study
0
1
2
3
−2.5 −2.0 −1.5
θ2,1
density
0.00
0.05
0.10
0.15
10 15 20 25
θ2,2
densityApproximation of the marginals of the cut distribution of
(θ2,1, θ2,2), the parameters of the Poisson regression module in
the epidemiological model of Plummer (2014).
Jacob, Holmes, Murray, Robert & Nicholson, Better together?
Statistical learning in models made of modules.

Bagging posterior distributions
We can stabilize the posterior distribution by using a
bootstrap and aggregation scheme, in the spirit of bag-
ging (Breiman, 1996b). In a nutshell, denote by D a
bootstrap or subsample of the data D. The posterior of
the random parameters θ given the data D has c.d.f.
F(·|D), and we can stabilize this using
FBayesBag(·|D) = E [F(·|D )],
where E is with respect to the bootstrap- or subsam-
pling scheme. We call it the BayesBag estimator. It
can be approximated by averaging over B posterior com-
putations for bootstrap- or subsamples, which might be
a rather demanding task (although say B=10 would al-
ready stabilize to a certain extent).
B¨uhlmann, Discussion of Big Bayes Stories and BayesBag, 2014.

Bagging posterior distributions
For b = 1, . . . , B
Sample data set D(b) by bootstrapping from D.
Obtain MCMC approximation ˆπ(b) of posterior given D(b).
Finally obtain B−1 B
b=1 ˆπ(b).
Converges to “BayesBag” distribution as both B and number of
MCMC samples go to inﬁnity.
If we can obtain unbiased approximation of posterior given any
D, the resulting approximation of “BayesBag” would be
consistent as B → ∞ only.
Exactly the same reasoning as for the cut distribution.
Example at https://statisfaction.wordpress.com/2019/
10/02/bayesbag-and-how-to-approximate-it/

Discussion
Some existing alternatives to standard Bayesian inference
are well motivated, but raise computational questions.
There are on-going efforts toward scalable Monte Carlo
methods, e.g. using coupled Markov chains or regeneration
techniques, in addition to sustained search for new MCMC
algorithms.
Quantification of variance is commonly done, quantification
of bias is also possible.
What makes a computational method convenient? It does
not seem to be entirely about asymptotic efficiency when
method is optimally tuned.
Thank you for listening!
Funding provided by the National Science Foundation,
grants DMS-1712872 and DMS-1844695.

References
Practical couplings in the literature. . .
Propp & Wilson, Exact sampling with coupled Markov chains
and applications to statistical mechanics, Random Structures &
Algorithms, 1996.
Johnson, Studying convergence of Markov chain Monte Carlo
algorithms using coupled sample paths, JASA, 1996.
Neal, Circularly-coupled Markov chain sampling, UoT tech
report, 1999.
Glynn & Rhee, Exact estimation for Markov chain equilibrium
expectations, Journal of Applied Probability, 2014.
Agapiou, Roberts & Vollmer, Unbiased Monte Carlo: posterior
estimation for intractable/inﬁnite-dimensional models, Bernoulli,
2018.

References
Finite-time bias of MCMC. . .
Brooks & Roberts, Assessing convergence of Markov chain
Monte Carlo algorithms, STCO, 1998.
Cowles & Rosenthal, A simulation approach to convergence rates
for Markov chain Monte Carlo algorithms, STCO, 1998.
Johnson, Studying convergence of Markov chain Monte Carlo
algorithms using coupled sample paths, JASA, 1996.
Gorham, Duncan, Vollmer & Mackey, Measuring Sample Quality
with Diﬀusions, AAP, 2019.

References
Own work. . .
with John O’Leary, Yves F. Atchad´e
Unbiased Markov chain Monte Carlo with couplings, 2019.
with Fredrik Lindsten, Thomas Sch¨on
Smoothing with Couplings of Conditional Particle Filters, 2019.
with Jeremy Heng
Unbiased Hamiltonian Monte Carlo with couplings, 2019.
with Lawrence Middleton, George Deligiannidis, Arnaud
Doucet
Unbiased Markov chain Monte Carlo for intractable target
distributions, 2019.
Unbiased Smoothing using Particle Independent
Metropolis-Hastings, 2019.

References
with Maxime Rischard, Natesh Pillai
Unbiased estimation of log normalizing constants with
applications to Bayesian cross-validation.
with Niloy Biswas, Paul Vanetti
Estimating Convergence of Markov chains with L-Lag Couplings,
2019.
with Chris Holmes, Lawrence Murray, Christian Robert,
George Nicholson
Better together? Statistical learning in models made of modules.

Monte Carlo methods for some not-quite-but-almost Bayesian problems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Monte Carlo methods for some not-quite-but-almost Bayesian problems

Similar to Monte Carlo methods for some not-quite-but-almost Bayesian problems (20)

More from Pierre Jacob

More from Pierre Jacob (11)

Recently uploaded

Recently uploaded (20)

Monte Carlo methods for some not-quite-but-almost Bayesian problems