SlideShare a Scribd company logo
Multiple estimators for Monte Carlo
approximations
Christian P. Robert
Universit´e Paris-Dauphine, Paris & University of Warwick, Coventry
with M. Banterle, G. Casella, R. Douc, V. Elvira, J.M. Marin
Outline
1 Introduction
2 Multiple Importance Sampling
3 Rao-Blackwellisation
4 Vanilla Rao–Blackwellisation
5 Delayed acceptance
6 Combination of estimators
Monte Carlo problems
Approximation of an integral
I =
ˆ
X
H(x)dx
based on pseudo-random simulations
x1, . . . , xT
i.i.d.
∼ f(x)
and expectation representation
I =
ˆ
X
h(x)f(x)dx = Ef
[h(X)]
[Casella and Robert, 1996]
Monte Carlo problems
Approximation of an integral
I =
ˆ
X
H(x)dx
based on pseudo-random simulations
x1, . . . , xT
i.i.d.
∼ f(x)
and expectation representation
I =
ˆ
X
h(x)f(x)dx = Ef
[h(X)]
[Casella and Robert, 1996]
Multiple estimators
Even with a single sample, several representations available
• path sampling
• safe harmonic means
• mixtures like AMIS (aka IMIS, MAMIS)
• Rao-Blackwellisations
• recycling schemes
• quasi-Monte Carlo
• measure averaging (NPMLE) `a la Kong et al. (2003)
depending on how black is the black box mechanism
[Chen et al., 2000]
An outlier
Given
Xij ∼ fi(x) = cigi(x)
with gi known and ci unknown (i = 1, . . . , k, j = 1, . . . , ni),
constants ci estimated by a “reverse logistic regression” based on
the quasi-likelihood
L(η) =
k
i=1
ni
j=1
log pi(xij, η)
with
pi(x, η) = exp{ηi}gi(x)
k
i=1
exp{ηi}gi(x)
[Anderson, 1972; Geyer, 1992]
Approximation
log ˆci = log ni/n − ˆηi
An outlier
Given
Xij ∼ fi(x) = cigi(x)
with gi known and ci unknown (i = 1, . . . , k, j = 1, . . . , ni),
constants ci estimated by a “reverse logistic regression” based on
the quasi-likelihood
L(η) =
k
i=1
ni
j=1
log pi(xij, η)
with
pi(x, η) = exp{ηi}gi(x)
k
i=1
exp{ηi}gi(x)
[Anderson, 1972; Geyer, 1992]
Approximation
log ˆci = log ni/n − ˆηi
A statistical framework
Existence of a central limit theorem:
√
n (ˆηn − η)
L
−→ Nk(0, B+
AB+
)
[Geyer, 1992; Doss & Tan, 2015]
• strong convergence properties
• asymptotic approximation of the precision
• connection with bridge sampling and auxiliary model [mixture]
• ...but nothing statistical there [no estimation or data]
[Kong et al., 2003; Chopin & Robert, 2011]
A statistical framework
Existence of a central limit theorem:
√
n (ˆηn − η)
L
−→ Nk(0, B+
AB+
)
[Geyer, 1992; Doss & Tan, 2015]
• strong convergence properties
• asymptotic approximation of the precision
• connection with bridge sampling and auxiliary model [mixture]
• ...but nothing statistical there [no estimation or data]
[Kong et al., 2003; Chopin & Robert, 2011]
Evidence from a posterior sample
Harmonic identity
Ef ϕ(X)f(X)
h(X)f(X)
=
ˆ
ϕ(x)
h(x)f(x)
h(x)f(x)
I
dx
=
1
I
no matter what the proposal density ϕ(θ) is
[Gelfand & Dey, 1994; Bartolucci et al., 2006]
Harmonic mean estimator
Constraint opposed to usual importance sampling constraints:
ϕ(x) must have lighter (rather than fatter) tails than h(x)g(x) for
the approximation
I = 1
1
T
T
t=1
ϕ(x(t))
h(x(t))g(x(t))
to have a finite variance [safe harmonic mean]
[Robert & Wraith, 2012]
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that this
estimator is consistent ie, it will very likely be very close to the correct
answer if you use a sufficiently large number of points from the posterior
distribution.
The bad news is that the number of points required for this estimator to
get close to the right answer will often be greater than the number of
atoms in the observable universe. The even worse news is that it’s easy
for people to not realize this, and to na¨ıvely accept estimates that are
nowhere close to the correct value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
“The Worst Monte Carlo Method Ever”
“The good news is that the Law of Large Numbers guarantees that this
estimator is consistent ie, it will very likely be very close to the correct
answer if you use a sufficiently large number of points from the posterior
distribution.
The bad news is that the number of points required for this estimator to
get close to the right answer will often be greater than the number of
atoms in the observable universe. The even worse news is that it’s easy
for people to not realize this, and to na¨ıvely accept estimates that are
nowhere close to the correct value of the marginal likelihood.”
[Radford Neal’s blog, Aug. 23, 2008]
Mixtures as proposals
Design specific mixture for simulation purposes, with density
˜ϕ(x) ∝ ω1h(x)g(x) + ϕ(x) ,
where ϕ(θ) is arbitrary (but normalised)
Note: ω1 is not a probability weight
[Chopin & Robert, 2011]
Mixtures as proposals
Design specific mixture for simulation purposes, with density
˜ϕ(x) ∝ ω1h(x)g(x) + ϕ(x) ,
where ϕ(θ) is arbitrary (but normalised)
Note: ω1 is not a probability weight
[Chopin & Robert, 2011]
Multiple estimators
1 Introduction
2 Multiple Importance Sampling
3 Rao-Blackwellisation
4 Vanilla Rao–Blackwellisation
5 Delayed acceptance
6 Combination of estimators
Multiple Importance Sampling
Reycling: given several proposals Q1, . . . , QT , for 1 ≤ t ≤ T
generate an iid sample
xt
1, . . . , xt
N ∼ Qt
and estimate I by
ˆIQ,N (h) = T−1
T
t=1
N−1
N
i=1
h(xt
i)ωt
i
where
ωt
i =
f(xt
i)
qt(xt
i)
correct...
Multiple Importance Sampling
Reycling: given several proposals Q1, . . . , QT , for 1 ≤ t ≤ T
generate an iid sample
xt
1, . . . , xt
N ∼ Qt
and estimate I by
ˆIQ,N (h) = T−1
T
t=1
N−1
N
i=1
h(xt
i)ωt
i
where
ωt
i =
f(xt
i)
T−1 T
=1 q (xt
i)
still correct!
Mixture representation
Deterministic mixture correction of the weights proposed by Owen
and Zhou (JASA, 2000)
• The corresponding estimator is still unbiased [if not
self-normalised]
• All particles are on the same weighting scale rather than their
own
• Large variance proposals Qt do not take over
• Variance reduction thanks to weight stabilization & recycling
• [k.o.] removes the randomness in the component choice
[=Rao-Blackwellisation]
Global adaptation
AMIS algorithm
At iteration t = 1, · · · , T,
1 For 1 ≤ i ≤ N1, generate xt
i ∼ T3(ˆµt−1, ˆΣt−1)
2 Calculate the mixture importance weight of particle xt
i
ωt
i = f xt
i δt
i
where
δt
i =
t−1
l=0
qT (3) xt
i; ˆµl
, ˆΣl
Backward reweighting
3 If t ≥ 2, actualize the weights of all past particles, xl
i
1 ≤ l ≤ t − 1
ωl
i = f xt
i δl
i
where
δl
i = δl
i + qT (3) xl
i; ˆµt−1
, ˆΣt−1
4 Compute IS estimates of target mean and variance ˆµt and ˆΣt,
where
ˆµt
j =
t
l=1
N1
i=1
ωl
i(xj)l
i
t
l=1
N1
i=1
ωl
i . . .
[Cornuet et al., 2012; Marin et al., 2017]
NPMLE for Monte Carlo
“The task of estimating an integral by Monte Carlo
methods is formulated as a statistical model using
simulated observations as data.
The difficulty in this exercise is that we ordinarily have at
our disposal all of the information required to compute
integrals exactly by calculus or numerical integration, but
we choose to ignore some of the information for
simplicity or computational feasibility.”
[Kong, McCullagh, Meng, Nicolae & Tan, 2003]
NPMLE for Monte Carlo
“Our proposal is to use a semiparametric statistical
model that makes explicit what information is ignored
and what information is retained. The parameter space in
this model is a set of measures on the sample space,
which is ordinarily an infinite dimensional object.
None-the-less, from simulated data the base-line measure
can be estimated by maximum likelihood, and the
required integrals computed by a simple formula
previously derived by Geyer and by Meng and Wong using
entirely different arguments.”
[Kong, McCullagh, Meng, Nicolae & Tan, 2003]
NPMLE for Monte Carlo
“By contrast with Geyer’s retrospective likelihood, a
correct estimate of simulation error is available directly
from the Fisher information. The principal advantage of
the semiparametric model is that variance reduction
techniques are associated with submodels in which the
maximum likelihood estimator in the submodel may have
substantially smaller variance than the traditional
estimator.”
[Kong, McCullagh, Meng, Nicolae & Tan, 2003]
NPMLE for Monte Carlo
“At first glance, the problem appears to be an exercise in calculus
or numerical analysis, and not amenable to statistical formulation”
• use of Fisher information
• non-parametric MLE based on simulations
• comparison of sampling schemes through variances
• Rao–Blackwellised improvements by invariance constraints
NPMLE for Monte Carlo
Observing
Yij ∼ Fi(t) = c−1
i
ˆ t
−∞
ωi(x) dF(x)
with ωi known and F unknown
NPMLE for Monte Carlo
Observing
Yij ∼ Fi(t) = c−1
i
ˆ t
−∞
ωi(x) dF(x)
with ωi known and F unknown
“Maximum likelihood estimate” defined by weighted empirical cdf
i,j
ωi(yij)p(yij)δyij
maximising in p
ij
c−1
i ωi(yij) p(yij)
NPMLE for Monte Carlo
Observing
Yij ∼ Fi(t) = c−1
i
ˆ t
−∞
ωi(x) dF(x)
with ωi known and F unknown
“Maximum likelihood estimate” defined by weighted empirical cdf
i,j
ωi(yij)p(yij)δyij
maximising in p
ij
c−1
i ωi(yij) p(yij)
Result such that
ij
ˆc−1
r ωr(yij)
s nsˆc−1
s ωs(yij)
= 1
[Vardi, 1985]
NPMLE for Monte Carlo
Observing
Yij ∼ Fi(t) = c−1
i
ˆ t
−∞
ωi(x) dF(x)
with ωi known and F unknown
Result such that
ij
ˆc−1
r ωr(yij)
s nsˆc−1
s ωs(yij)
= 1
[Vardi, 1985]
Bridge sampling estimator
ij
ˆc−1
r ωr(yij)
s nsˆc−1
s ωs(yij)
= 1
[Gelman & Meng, 1998; Tan, 2004]
end of the 2002 discussion
“...essentially every Monte Carlo activity may be interpreted as
parameter estimation by maximum likelihood in a statistical model.
We do not claim that this point of view is necessary; nor do we
seek to establish a working principle from it.”
• restriction to discrete support measures [may be] suboptimal
[Ritov & Bickel, 1990; Robins et al., 1997, 2000, 2003]
• group averaging versions in-between multiple mixture
estimators and quasi-Monte Carlo version
each & Guibas, 1995; Owen & Zhou, 2000; Cornuet et al., 2012; Owen, 2003]
• statistical analogy provides at best narrative thread
end of the 2002 discussion
“The hard part of the exercise is to construct a submodel such
that the gain in precision is sufficient to justify the additional
computational effort”
• garden of forking paths, with infinite possibilities
• no free lunch (variance, budget, time)
• Rao–Blackwellisation may be detrimental in Markov setups
Multiple estimators
1 Introduction
2 Multiple Importance Sampling
3 Rao-Blackwellisation
4 Vanilla Rao–Blackwellisation
5 Delayed acceptance
6 Combination of estimators
Accept-Reject
Given a density f(·) to simulate take
g(·) density such that
f(x) ≤ Mg(x)
for M ≥ 1
To simulate X ∼ f, it is sufficient to
generate
Y ∼ g U|Y = y ∼ U(0, Mg(y))
until
0 < u < f(y)
Much ado about...
[Exercice 3.33, MCSM]
Raw outcome: id sequences Y1, Y2, . . . , Yt ∼ g and
U1, U2, . . . , Ut ∼ U(0, 1)
Random number of accepted Yi’s
P(N = n) =
n − 1
t − 1
(1/M)t
(1 − 1/M)n−t
,
Much ado about...
[Exercice 3.33, MCSM]
Raw outcome: id sequences Y1, Y2, . . . , Yt ∼ g and
U1, U2, . . . , Ut ∼ U(0, 1)
Joint density of (N, Y, U)
P(N = n, Y1 ≤ y1, . . . , Yn ≤ yn, U1 ≤ u1, . . . , Un ≤ un)
=
ˆ yn
−∞
g(tn)(un ∧ wn)dtn
ˆ y1
−∞
. . .
ˆ yn−1
−∞
g(t1) . . . g(tn−1)
×
(i1,··· ,it−1)
t−1
j=1
(wij ∧ uij )
n−1
j=t
(uij − wij )+
dt1 · · · dtn−1,
where wi = f(yi)/Mg(yi) and sum over all subsets of
{1, . . . , n − 1} of size t − 1
Much ado about...
[Exercice 3.33, MCSM]
Raw outcome: id sequences Y1, Y2, . . . , Yt ∼ g and
U1, U2, . . . , Ut ∼ U(0, 1)
Marginal joint density of (Yi, Ui)|N = n, i < n
P(N = n, Y1 ≤ y, U1 ≤ u1)
=
n − 1
t − 1
1
M
t−1
1 −
1
M
n−t−1
×
t − 1
n − 1
(w1 ∧ u1) 1 −
1
M
+
n − t
n − 1
(u1 − w1)+ 1
M
ˆ y
−∞
g(t1)dt1
and marginal distribution of Yi
m(y) = t−1/n−1f(y) + n−t/n−1
g(y) − ρf(y)
1 − ρ
P(U1 ≤ w(y)|Y1 = y, N = n) =
g(y)w(y)Mt−1/n−1
m(y)
Much ado about noise
Accept-reject sample (X1, . . . , Xm) associated with (U1, . . . , UN )
and (Y1, . . . , YN )
N is stopping time for acceptance of m variables among Yj’s
Rewrite estimator of E[h] as
1
m
m
i=1
h(Xi) =
1
m
N
j=1
h(Yj) IUj≤wj ,
with wj = f(Yj)/Mg(Yj)
[Casella and Robert (1996)]
Much ado about noise
Rao-Blackwellisation: smaller variance produced by integrating
out the Ui’s,
1
m
N
j=1
E[IUj≤wj |N, Y1, . . . , YN ] h(Yj) =
1
m
N
i=1
ρih(Yi),
where (i < n)
ρi = P(Ui ≤ wi|N = n, Y1, . . . , Yn)
= wi
(i1,...,im−2)
m−2
j=1 wij
n−2
j=m−1(1 − wij )
(i1,...,im−1)
m−1
j=1 wij
n−1
j=m(1 − wij )
,
and ρn = 1.
Numerator sum over all subsets of {1, . . . , i − 1, i + 1, . . . , n − 1}
of size m − 2, and denominator sum over all subsets of size m − 1
[Casella and Robert (1996)]
extension to Metropolis–Hastings case
Sample produced by Metropolis–Hastings algorithm
x(1)
, . . . , x(T)
based on two samples,
y1, . . . , yT
proposals
and u1, . . . , uT
uniforms
[Casella and Robert (1996)]
extension to Metropolis–Hastings case
Sample produced by Metropolis–Hastings algorithm
x(1)
, . . . , x(T)
based on two samples,
y1, . . . , yT
proposals
and u1, . . . , uT
uniforms
Ergodic mean rewritten as
δMH
=
1
T
T
t=1
h(x(t)
) =
1
T
T
t=1
h(yt)
T
i=t
Ix(i)=yt
[Casella and Robert (1996)]
extension to Metropolis–Hastings case
Sample produced by Metropolis–Hastings algorithm
x(1)
, . . . , x(T)
based on two samples,
y1, . . . , yT
proposals
and u1, . . . , uT
uniforms
Conditional expectation
δRB
=
1
T
T
t=1
h(yt) E
T
i=t
IX(i)
= yt y1, . . . , yT
=
1
T
T
t=1
h(yt)
T
i=t
P(X(i)
= yt|y1, . . . , yT )
with smaller variance
[Casella and Robert (1996)]
weight derivation
Take
ρij =
f(yj)/q(yj|yi)
f(yi)/q(yi|yj)
∧ 1 (j > i),
ρij = ρijq(yj+1|yj), ρij
= (1 − ρij)q(yj+1|yi) (i < j < T),
ζjj = 1, ζjt =
t
l=j+1
ρjl
(i < j < T),
τ0 = 1, τj =
j−1
t=0
τtζt(j−1) ρtj, τT =
T−1
t=0
τtζt(T−1)ρtT (i < T),
ωi
T = 1, ωj
i = ρjiωi
i+1 + ρji
ωj
i+1 (0 ≤ j < i < T).
[Casella and Robert (1996)]
weight derivation
Theorem
The estimator δRB satisfies
δRB
=
T
i=0 ϕi h(yi)
T−1
i=0 τi ζi(T−1)
,
with (i < T)
ϕi = τi


T−1
j=i
ζijωi
j+1 + ζi(T−1)(1 − ρiT )


and ϕT = τT .
[Casella and Robert (1996)]
Multiple estimators
1 Introduction
2 Multiple Importance Sampling
3 Rao-Blackwellisation
4 Vanilla Rao–Blackwellisation
5 Delayed acceptance
6 Combination of estimators
Some properties of the Metropolis–Hastings algorithm
Alternative representation of Metropolis–Hastings estimator δ as
δ =
1
n
n
t=1
h(x(t)
) =
1
n
Mn
i=1
nih(zi) ,
where
• {x(t)}t original Markov (MCMC) chain with stationary
distribution π(·) and transition q(·|·)
• zi’s are the accepted yj’s
• Mn is the number of accepted yj’s till time n
• ni is the number of times zi appears in the sequence {x(t)}t
The ”accepted candidates”
Define
˜q(·|zi) =
α(zi, ·) q(·|zi)
p(zi)
≤
q(·|zi)
p(zi)
where p(zi) =
´
α(zi, y) q(y|zi)dy
To simulate from ˜q(·|zi)
1 Propose a candidate y ∼ q(·|zi)
2 Accept with probability
˜q(y|zi)
q(y|zi)
p(zi)
= α(zi, y)
Otherwise, reject it and starts again.
this is the transition of the HM algorithm
The ”accepted candidates”
Define
˜q(·|zi) =
α(zi, ·) q(·|zi)
p(zi)
≤
q(·|zi)
p(zi)
where p(zi) =
´
α(zi, y) q(y|zi)dy
The transition kernel ˜q admits ˜π as a stationary distribution:
˜π(x)˜q(y|x) =
π(x)p(x)
´
π(u)p(u)du
˜π(x)
α(x, y)q(y|x)
p(x)
˜q(y|x)
The ”accepted candidates”
Define
˜q(·|zi) =
α(zi, ·) q(·|zi)
p(zi)
≤
q(·|zi)
p(zi)
where p(zi) =
´
α(zi, y) q(y|zi)dy
The transition kernel ˜q admits ˜π as a stationary distribution:
˜π(x)˜q(y|x) =
π(x)α(x, y)q(y|x)
´
π(u)p(u)du
The ”accepted candidates”
Define
˜q(·|zi) =
α(zi, ·) q(·|zi)
p(zi)
≤
q(·|zi)
p(zi)
where p(zi) =
´
α(zi, y) q(y|zi)dy
The transition kernel ˜q admits ˜π as a stationary distribution:
˜π(x)˜q(y|x) =
π(y)α(y, x)q(x|y)
´
π(u)p(u)du
The ”accepted candidates”
Define
˜q(·|zi) =
α(zi, ·) q(·|zi)
p(zi)
≤
q(·|zi)
p(zi)
where p(zi) =
´
α(zi, y) q(y|zi)dy
The transition kernel ˜q admits ˜π as a stationary distribution:
˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
The ”accepted chain”
Lemma (Douc & X., AoS, 2011)
The sequence (zi, ni) satisfies
1 (zi, ni)i is a Markov chain;
2 zi+1 and ni are independent given zi;
3 ni is distributed as a geometric random variable with
probability parameter
p(zi) :=
ˆ
α(zi, y) q(y|zi) dy ; (1)
4 (zi)i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
The ”accepted chain”
Lemma (Douc & X., AoS, 2011)
The sequence (zi, ni) satisfies
1 (zi, ni)i is a Markov chain;
2 zi+1 and ni are independent given zi;
3 ni is distributed as a geometric random variable with
probability parameter
p(zi) :=
ˆ
α(zi, y) q(y|zi) dy ; (1)
4 (zi)i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
The ”accepted chain”
Lemma (Douc & X., AoS, 2011)
The sequence (zi, ni) satisfies
1 (zi, ni)i is a Markov chain;
2 zi+1 and ni are independent given zi;
3 ni is distributed as a geometric random variable with
probability parameter
p(zi) :=
ˆ
α(zi, y) q(y|zi) dy ; (1)
4 (zi)i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
The ”accepted chain”
Lemma (Douc & X., AoS, 2011)
The sequence (zi, ni) satisfies
1 (zi, ni)i is a Markov chain;
2 zi+1 and ni are independent given zi;
3 ni is distributed as a geometric random variable with
probability parameter
p(zi) :=
ˆ
α(zi, y) q(y|zi) dy ; (1)
4 (zi)i is a Markov chain with transition kernel
˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that
˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
Importance sampling perspective
1 A natural idea:
δ∗
=
1
n
Mn
i=1
h(zi)
p(zi)
,
Importance sampling perspective
1 A natural idea:
δ∗
Mn
i=1
h(zi)
p(zi)
Mn
i=1
1
p(zi)
=
Mn
i=1
π(zi)
˜π(zi)
h(zi)
Mn
i=1
π(zi)
˜π(zi)
.
Importance sampling perspective
1 A natural idea:
δ∗
Mn
i=1
h(zi)
p(zi)
Mn
i=1
1
p(zi)
=
Mn
i=1
π(zi)
˜π(zi)
h(zi)
Mn
i=1
π(zi)
˜π(zi)
.
2 But p not available in closed form.
Importance sampling perspective
1 A natural idea:
δ∗
Mn
i=1
h(zi)
p(zi)
Mn
i=1
1
p(zi)
=
Mn
i=1
π(zi)
˜π(zi)
h(zi)
Mn
i=1
π(zi)
˜π(zi)
.
2 But p not available in closed form.
3 The geometric ni is the replacement, an obvious solution that
is used in the original Metropolis–Hastings estimate since
E[ni] = 1/p(zi).
The Bernoulli factory
The crude estimate of 1/p(zi),
ni = 1 +
∞
j=1 ≤j
I {u ≥ α(zi, y )} ,
can be improved:
Lemma (Douc & X., AoS, 2011)
If (yj)j is an iid sequence with distribution q(y|zi), the quantity
ˆξi = 1 +
∞
j=1 ≤j
{1 − α(zi, y )}
is an unbiased estimator of 1/p(zi) which variance, conditional on
zi, is lower than the conditional variance of ni, {1 − p(zi)}/p2(zi).
Rao-Blackwellised, for sure?
ˆξi = 1 +
∞
j=1 ≤j
{1 − α(zi, y )}
1 Infinite sum but finite with at least positive probability:
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
For example: take a symmetric random walk as a proposal.
2 What if we wish to be sure that the sum is finite?
Finite horizon k version:
ˆξk
i = 1 +
∞
j=1 1≤ ≤k∧j
{1 − α(zi, yj)}
k+1≤ ≤j
I {u ≥ α(zi, y )}
Rao-Blackwellised, for sure?
ˆξi = 1 +
∞
j=1 ≤j
{1 − α(zi, y )}
1 Infinite sum but finite with at least positive probability:
α(x(t)
, yt) = min 1,
π(yt)
π(x(t))
q(x(t)|yt)
q(yt|x(t))
For example: take a symmetric random walk as a proposal.
2 What if we wish to be sure that the sum is finite?
Finite horizon k version:
ˆξk
i = 1 +
∞
j=1 1≤ ≤k∧j
{1 − α(zi, yj)}
k+1≤ ≤j
I {u ≥ α(zi, y )}
Variance improvement
Proposition (Douc & X., AoS, 2011)
If (yj)j is an iid sequence with distribution q(y|zi) and (uj)j is an
iid uniform sequence, for any k ≥ 0, the quantity
ˆξk
i = 1 +
∞
j=1 1≤ ≤k∧j
{1 − α(zi, yj)}
k+1≤ ≤j
I {u ≥ α(zi, y )}
is an unbiased estimator of 1/p(zi) with an almost sure finite
number of terms.
Variance improvement
Proposition (Douc & X., AoS, 2011)
If (yj)j is an iid sequence with distribution q(y|zi) and (uj)j is an
iid uniform sequence, for any k ≥ 0, the quantity
ˆξk
i = 1 +
∞
j=1 1≤ ≤k∧j
{1 − α(zi, yj)}
k+1≤ ≤j
I {u ≥ α(zi, y )}
is an unbiased estimator of 1/p(zi) with an almost sure finite
number of terms. Moreover, for k ≥ 1,
Vˆξk
i zi =
1 − p(zi)
p2(zi)
−
1 − (1 − 2p(zi) + r(zi))k
2p(zi) − r(zi)
2 − p(zi)
p2(zi)
(p(zi) − r(zi)) ,
where p(zi) :=
´
α(zi, y) q(y|zi) dy. and r(zi) :=
´
α2(zi, y) q(y|zi) dy.
Variance improvement
Proposition (Douc & X., AoS, 2011)
If (yj)j is an iid sequence with distribution q(y|zi) and (uj)j is an
iid uniform sequence, for any k ≥ 0, the quantity
ˆξk
i = 1 +
∞
j=1 1≤ ≤k∧j
{1 − α(zi, yj)}
k+1≤ ≤j
I {u ≥ α(zi, y )}
is an unbiased estimator of 1/p(zi) with an almost sure finite
number of terms. Therefore, we have
Vˆξizi ≤ Vˆξk
i zi ≤ Vˆξ0
i zi = Vnizi .
Multiple estimators
1 Introduction
2 Multiple Importance Sampling
3 Rao-Blackwellisation
4 Vanilla Rao–Blackwellisation
5 Delayed acceptance
6 Combination of estimators
Non-informative inference for mixture models
Standard mixture of distributions model
k
i=1
wi f(x|θi) , with
k
i=1
wi = 1 . (1)
[Titterington et al., 1985; Fr¨uhwirth-Schnatter (2006)]
Jeffreys’ prior for mixture not available due to computational
reasons : it has not been tested so far
[Jeffreys, 1939]
Warning: Jeffreys’ prior improper in some settings
[Grazian & Robert, 2017]
Non-informative inference for mixture models
Grazian & Robert (2017) consider genuine Jeffreys’ prior for
complete set of parameters in (1), deduced from Fisher’s
information matrix
Computation of prior density costly, relying on many integrals like
ˆ
X
∂2
log
k
i=1 wi f(x|θi)
∂θh∂θj
k
i=1
wi f(x|θi) dx
Integrals with no analytical expression, hence involving numerical
or Monte Carlo (costly) integration
Non-informative inference for mixture models
When building Metropolis-Hastings proposal over (wi, θi)’s, prior
ratio more expensive than likelihood and proposal ratios
Suggestion: split the acceptance rule
α(x, y) := 1 ∧ r(x, y), r(x, y) :=
π(y|D)q(y, x)
π(x|D)q(x, y)
into
˜α(x, y) := 1 ∧
f(D|y)q(y, x)
f(D|x)q(x, y)
× 1 ∧
π(y)
π(x)
The “Big Data” plague
Simulation from posterior distribution with large sample size n
• Computing time at least of order O(n)
• solutions using likelihood decomposition
n
i=1
(θ|xi)
and handling subsets on different processors (CPU), graphical
units (GPU), or computers
[Korattikara et al. (2013), Scott et al. (2013)]
• no consensus on method of choice, with instabilities from
removing most prior input and uncalibrated approximations
[Neiswanger et al. (2013), Wang and Dunson (2013)]
Proposed solution
“There is no problem an absence of decision cannot
solve.” Anonymous
Given α(x, y) := 1 ∧ r(x, y), factorise
r(x, y) =
d
k=1
ρk(x, y)
under constraint ρk(x, y) = ρk(y, x)−1
Delayed Acceptance Markov kernel given by
˜P(x, A) :=
ˆ
A
q(x, y)˜α(x, y)dy + 1 −
ˆ
X
q(x, y)˜α(x, y)dy 1A(x)
where
˜α(x, y) :=
d
k=1
{1 ∧ ρk(x, y)}.
Proposed solution
“There is no problem an absence of decision cannot
solve.” Anonymous
Algorithm 1 Delayed Acceptance
To sample from ˜P(x, ·):
1 Sample y ∼ Q(x, ·).
2 For k = 1, . . . , d:
• with probability 1 ∧ ρk(x, y) continue
• otherwise stop and output x
3 Output y
Arrange terms in product so that most computationally intensive
ones calculated ‘at the end’ hence least often
Proposed solution
“There is no problem an absence of decision cannot
solve.” Anonymous
Algorithm 1 Delayed Acceptance
To sample from ˜P(x, ·):
1 Sample y ∼ Q(x, ·).
2 For k = 1, . . . , d:
• with probability 1 ∧ ρk(x, y) continue
• otherwise stop and output x
3 Output y
Generalization of Fox and Nicholls (1997) and Christen and Fox
(2005), where testing for acceptance with approximation before
computing exact likelihood first suggested
More recent occurences in literature
[Shestopaloff & Neal, 2013; Golightly, 2014]
Potential drawbacks
• Delayed Acceptance efficiently reduces computing cost only
when approximation ˜π is “good enough” or “flat enough”
• Probability of acceptance always smaller than in the original
Metropolis–Hastings scheme
• Decomposition of original data in likelihood bits may however
lead to deterioration of algorithmic properties without
impacting computational efficiency...
• ...e.g., case of a term explosive in x = 0 and computed by
itself: leaving x = 0 near impossible
Potential drawbacks
Figure: (left) Fit of delayed Metropolis–Hastings algorithm on a
Beta-binomial posterior p|x ∼ Be(x + a, n + b − x) when N = 100,
x = 32, a = 7.5 and b = .5. Binomial B(N, p) likelihood replaced with
product of 100 Bernoulli terms. Histogram based on 105
iterations, with
overall acceptance rate of 9%; (centre) raw sequence of p’s in Markov
chain; (right) autocorrelogram of the above sequence.
The “Big Data” plague
Delayed Acceptance intended for likelihoods or priors, but
not a clear solution for “Big Data” problems
1 all product terms must be computed
2 all terms previously computed either stored for future
comparison or recomputed
3 sequential approach limits parallel gains...
4 ...unless prefetching scheme added to delays
[Angelino et al. (2014), Strid (2010)]
Validation of the method
Lemma (1)
For any Markov chain with transition kernel Π of the form
Π(x, A) =
ˆ
A
q(x, y)a(x, y)dy + 1 −
ˆ
X
q(x, y)a(x, y)dy 1A(x),
and satisfying detailed balance, the function a(·) satisfies (for
π-a.e. x, y)
a(x, y)
a(y, x)
= r(x, y).
Validation of the method
Lemma (2)
( ˜Xn)n≥1, the Markov chain associated with ˜P, is a π-reversible
Markov chain.
Proof.
From Lemma 1 we just need to check that
˜α(x, y)
˜α(y, x)
=
d
k=1
1 ∧ ρk(x, y)
1 ∧ ρk(y, x)
=
d
k=1
ρk(x, y) = r(x, y),
since ρk(y, x) = ρk(x, y)−1 and (1 ∧ a)/(1 ∧ a−1) = a
Comparisons of P and ˜P
The acceptance probability ordering
˜α(x, y) =
d
k=1
{1∧ρk(x, y)} ≤ 1∧
d
k=1
ρk(x, y) = 1∧r(x, y) = α(x, y),
follows from (1 ∧ a)(1 ∧ b) ≤ (1 ∧ ab) for a, b ∈ R+.
Remark
By construction of ˜P,
var(f, P) ≤ var(f, ˜P)
for any f ∈ L2(X, π), using Peskun ordering (Peskun, 1973,
Tierney, 1998), since ˜α(x, y) ≤ α(x, y) for any (x, y) ∈ X2.
Comparisons of P and ˜P
Condition (1)
Defining A := {(x, y) ∈ X2 : r(x, y) ≥ 1}, there exists c such that
inf(x,y)∈A mink∈{1,...,d} ρk(x, y) ≥ c.
Ensures that when
α(x, y) = 1
then acceptance probability ˜α(x, y) uniformly lower-bounded by
positive constant.
Reversibility implies ˜α(x, y) uniformly lower-bounded by a constant
multiple of α(x, y) for all x, y ∈ X.
Comparisons of P and ˜P
Condition (1)
Defining A := {(x, y) ∈ X2 : r(x, y) ≥ 1}, there exists c such that
inf(x,y)∈A mink∈{1,...,d} ρk(x, y) ≥ c.
Proposition (1)
Under Condition (1), Lemma 34 in Andrieu & Lee (2013) implies
Gap ˜P ≥ Gap P and
var f, ˜P ≤ ( −1
− 1)varπ(f) + −1
var (f, P)
with f ∈ L2
0(E, π), = cd−1.
Comparisons of P and ˜P
Proposition (1)
Under Condition (1), Lemma 34 in Andrieu & Lee (2013) implies
Gap ˜P ≥ Gap P and
var f, ˜P ≤ ( −1
− 1)varπ(f) + −1
var (f, P)
with f ∈ L2
0(E, π), = cd−1.
Hence if P has right spectral gap, then so does ˜P.
Plus, quantitative bounds on asymptotic variance of MCMC
estimates using ( ˜Xn)n≥1 in relation to those using (Xn)n≥1
available
Comparisons of P and ˜P
Easiest use of above: modify any candidate factorisation
Given factorisation of r
r(x, y) =
d
k=1
˜ρk(x, y) ,
satisfying the balance condition, define a sequence of functions ρk
such that both r(x, y) = d
k=1 ρk(x, y) and Condition 1 holds.
Comparisons of P and ˜P
Take c ∈ (0, 1], define b = c
1
d−1 and set
˜ρk(x, y) := min
1
b
, max {b, ρk(x, y)} , k ∈ {1, . . . , d − 1},
and
˜ρd(x, y) :=
r(x, y)
d−1
k=1 ˜ρk(x, y)
.
Then:
Proposition (2)
Under this scheme, previous proposition holds with
= c2
= b2(d−1)
.
Multiple estimators
1 Introduction
2 Multiple Importance Sampling
3 Rao-Blackwellisation
4 Vanilla Rao–Blackwellisation
5 Delayed acceptance
6 Combination of estimators
Variance reduction
Given estimators I1, . . . , Id of the same integral I, optimal
combination
ˆI =
d
i=1
ˆωiIi
where
ˆω = argω min
i ωi=1
ωT
Σ−1
ω
with Σ covariance matrix of (I1, . . . , Id)
covariance matrix estimated from samples behind the I ’s
Variance reduction
Given estimators I1, . . . , Id of the same integral I, optimal
combination
ˆI =
d
i=1
ˆωiIi
where
ˆω = argω min
i ωi=1
ωT
Σ−1
ω
with Σ covariance matrix of (I1, . . . , Id)
covariance matrix estimated from samples behind the I ’s
Issues
• Linear combination of estimators cannot exclude outliers and
infinite variance estimators
• Worse: retains infinite variance from any component
• Ignore case of estimators based on identical simulations (e.g.,
multiple AMIS solutions)
• Does not measure added value of new estimator
Issues
Comparison of several combinations of three importance samples
of length 103 [olive stands for h-median cdf]
Combining via cdfs
Revisit Monte Carlo estimates as expectations of distribution
estimates
Collection of estimators by-produces evaluations of cdf of h(X) as
ˆFj(h) =
n
i=1
ωijIh(xij<h)
New perspective on combining cdf estimates with different
distributions and potential correlations Average of cdfs poor choice
(infinite variance, variance ignored)
Combining via cdfs
Revisit Monte Carlo estimates as expectations of distribution
estimates
Collection of estimators by-produces evaluations of cdf of h(X) as
ˆFj(h) =
n
i=1
ωijIh(xij<h)
New perspective on combining cdf estimates with different
distributions and potential correlations Average of cdfs poor choice
(infinite variance, variance ignored)
Combining via cdf median
Towards robustness and elimination of infinite variance, use of
[vertical] median cdf
ˆFmed(h) = medianj
ˆFj(h)
−20 −15 −10 −5 0 5 10 15
0.00.20.40.60.81.0
ecdf7
which does not account for correlation and repetitions
Combining via cdf median
Alternative horizontal median cdf
ˆF−1
hmed(α) = medianj
ˆF−1
j (α)
which does not show visible variation from vertical median
Combining via cdf median
q
qq
qq
q
qq
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
normal.5 normal4 amis2 amis3 hmedian
0.220.240.260.280.300.320.340.36
Comparison of eight estimators for three importance samples of
length 103

More Related Content

What's hot

ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
Christian Robert
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
Christian Robert
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimation
Christian Robert
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
Christian Robert
 
ABC short course: survey chapter
ABC short course: survey chapterABC short course: survey chapter
ABC short course: survey chapter
Christian Robert
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
Christian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
Christian Robert
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
Christian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
Christian Robert
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannolli0601
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
Christian Robert
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
Christian Robert
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
Christian Robert
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chapters
Christian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
Christian Robert
 
Nested sampling
Nested samplingNested sampling
Nested sampling
Christian Robert
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
Christian Robert
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapter
Christian Robert
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
Christian Robert
 

What's hot (20)

ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
random forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimationrandom forests for ABC model choice and parameter estimation
random forests for ABC model choice and parameter estimation
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
 
ABC short course: survey chapter
ABC short course: survey chapterABC short course: survey chapter
ABC short course: survey chapter
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmann
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 
Bayesian inference on mixtures
Bayesian inference on mixturesBayesian inference on mixtures
Bayesian inference on mixtures
 
ABC short course: final chapters
ABC short course: final chaptersABC short course: final chapters
ABC short course: final chapters
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
 
ABC short course: model choice chapter
ABC short course: model choice chapterABC short course: model choice chapter
ABC short course: model choice chapter
 
An overview of Bayesian testing
An overview of Bayesian testingAn overview of Bayesian testing
An overview of Bayesian testing
 

Similar to Multiple estimators for Monte Carlo approximations

ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
Christian Robert
 
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Christian Robert
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Umberto Picchini
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
Christian Robert
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
RayKim51
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
Frank Nielsen
 
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016
Christian Robert
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
Christian Robert
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
Christian Robert
 
Richard Everitt's slides
Richard Everitt's slidesRichard Everitt's slides
Richard Everitt's slides
Christian Robert
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
Christian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
Christian Robert
 
Montecarlophd
MontecarlophdMontecarlophd
Montecarlophd
Marco Delogu
 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing them
Pierre Jacob
 
Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
Arthur Charpentier
 
Workload-aware materialization for efficient variable elimination on Bayesian...
Workload-aware materialization for efficient variable elimination on Bayesian...Workload-aware materialization for efficient variable elimination on Bayesian...
Workload-aware materialization for efficient variable elimination on Bayesian...
Cigdem Aslay
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
The Statistical and Applied Mathematical Sciences Institute
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
King Khalid University
 

Similar to Multiple estimators for Monte Carlo approximations (20)

ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
Talk at 2013 WSC, ISI Conference in Hong Kong, August 26, 2013
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016Discussion of Persi Diaconis' lecture at ISBA 2016
Discussion of Persi Diaconis' lecture at ISBA 2016
 
Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?Inferring the number of components: dream or reality?
Inferring the number of components: dream or reality?
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Richard Everitt's slides
Richard Everitt's slidesRichard Everitt's slides
Richard Everitt's slides
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
Montecarlophd
MontecarlophdMontecarlophd
Montecarlophd
 
2주차
2주차2주차
2주차
 
Markov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing themMarkov chain Monte Carlo methods and some attempts at parallelizing them
Markov chain Monte Carlo methods and some attempts at parallelizing them
 
Machine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & InsuranceMachine Learning in Actuarial Science & Insurance
Machine Learning in Actuarial Science & Insurance
 
Workload-aware materialization for efficient variable elimination on Bayesian...
Workload-aware materialization for efficient variable elimination on Bayesian...Workload-aware materialization for efficient variable elimination on Bayesian...
Workload-aware materialization for efficient variable elimination on Bayesian...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 

More from Christian Robert

Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
Christian Robert
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
Christian Robert
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
Christian Robert
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
Christian Robert
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
Christian Robert
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
Christian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
Christian Robert
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
Christian Robert
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
Christian Robert
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
Christian Robert
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
Christian Robert
 
prior selection for mixture estimation
prior selection for mixture estimationprior selection for mixture estimation
prior selection for mixture estimation
Christian Robert
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
Christian Robert
 

More from Christian Robert (13)

Adaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte CarloAdaptive Restore algorithm & importance Monte Carlo
Adaptive Restore algorithm & importance Monte Carlo
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
Testing for mixtures at BNP 13
Testing for mixtures at BNP 13Testing for mixtures at BNP 13
Testing for mixtures at BNP 13
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 
prior selection for mixture estimation
prior selection for mixture estimationprior selection for mixture estimation
prior selection for mixture estimation
 
Coordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like samplerCoordinate sampler: A non-reversible Gibbs-like sampler
Coordinate sampler: A non-reversible Gibbs-like sampler
 

Recently uploaded

Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
Michel Dumontier
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insect
anitaento25
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
Cherry
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Penicillin...........................pptx
Penicillin...........................pptxPenicillin...........................pptx
Penicillin...........................pptx
Cherry
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
muralinath2
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 

Recently uploaded (20)

Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insect
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Penicillin...........................pptx
Penicillin...........................pptxPenicillin...........................pptx
Penicillin...........................pptx
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 

Multiple estimators for Monte Carlo approximations

  • 1. Multiple estimators for Monte Carlo approximations Christian P. Robert Universit´e Paris-Dauphine, Paris & University of Warwick, Coventry with M. Banterle, G. Casella, R. Douc, V. Elvira, J.M. Marin
  • 2. Outline 1 Introduction 2 Multiple Importance Sampling 3 Rao-Blackwellisation 4 Vanilla Rao–Blackwellisation 5 Delayed acceptance 6 Combination of estimators
  • 3. Monte Carlo problems Approximation of an integral I = ˆ X H(x)dx based on pseudo-random simulations x1, . . . , xT i.i.d. ∼ f(x) and expectation representation I = ˆ X h(x)f(x)dx = Ef [h(X)] [Casella and Robert, 1996]
  • 4. Monte Carlo problems Approximation of an integral I = ˆ X H(x)dx based on pseudo-random simulations x1, . . . , xT i.i.d. ∼ f(x) and expectation representation I = ˆ X h(x)f(x)dx = Ef [h(X)] [Casella and Robert, 1996]
  • 5. Multiple estimators Even with a single sample, several representations available • path sampling • safe harmonic means • mixtures like AMIS (aka IMIS, MAMIS) • Rao-Blackwellisations • recycling schemes • quasi-Monte Carlo • measure averaging (NPMLE) `a la Kong et al. (2003) depending on how black is the black box mechanism [Chen et al., 2000]
  • 6. An outlier Given Xij ∼ fi(x) = cigi(x) with gi known and ci unknown (i = 1, . . . , k, j = 1, . . . , ni), constants ci estimated by a “reverse logistic regression” based on the quasi-likelihood L(η) = k i=1 ni j=1 log pi(xij, η) with pi(x, η) = exp{ηi}gi(x) k i=1 exp{ηi}gi(x) [Anderson, 1972; Geyer, 1992] Approximation log ˆci = log ni/n − ˆηi
  • 7. An outlier Given Xij ∼ fi(x) = cigi(x) with gi known and ci unknown (i = 1, . . . , k, j = 1, . . . , ni), constants ci estimated by a “reverse logistic regression” based on the quasi-likelihood L(η) = k i=1 ni j=1 log pi(xij, η) with pi(x, η) = exp{ηi}gi(x) k i=1 exp{ηi}gi(x) [Anderson, 1972; Geyer, 1992] Approximation log ˆci = log ni/n − ˆηi
  • 8. A statistical framework Existence of a central limit theorem: √ n (ˆηn − η) L −→ Nk(0, B+ AB+ ) [Geyer, 1992; Doss & Tan, 2015] • strong convergence properties • asymptotic approximation of the precision • connection with bridge sampling and auxiliary model [mixture] • ...but nothing statistical there [no estimation or data] [Kong et al., 2003; Chopin & Robert, 2011]
  • 9. A statistical framework Existence of a central limit theorem: √ n (ˆηn − η) L −→ Nk(0, B+ AB+ ) [Geyer, 1992; Doss & Tan, 2015] • strong convergence properties • asymptotic approximation of the precision • connection with bridge sampling and auxiliary model [mixture] • ...but nothing statistical there [no estimation or data] [Kong et al., 2003; Chopin & Robert, 2011]
  • 10. Evidence from a posterior sample Harmonic identity Ef ϕ(X)f(X) h(X)f(X) = ˆ ϕ(x) h(x)f(x) h(x)f(x) I dx = 1 I no matter what the proposal density ϕ(θ) is [Gelfand & Dey, 1994; Bartolucci et al., 2006]
  • 11. Harmonic mean estimator Constraint opposed to usual importance sampling constraints: ϕ(x) must have lighter (rather than fatter) tails than h(x)g(x) for the approximation I = 1 1 T T t=1 ϕ(x(t)) h(x(t))g(x(t)) to have a finite variance [safe harmonic mean] [Robert & Wraith, 2012]
  • 12. “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a sufficiently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that it’s easy for people to not realize this, and to na¨ıvely accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
  • 13. “The Worst Monte Carlo Method Ever” “The good news is that the Law of Large Numbers guarantees that this estimator is consistent ie, it will very likely be very close to the correct answer if you use a sufficiently large number of points from the posterior distribution. The bad news is that the number of points required for this estimator to get close to the right answer will often be greater than the number of atoms in the observable universe. The even worse news is that it’s easy for people to not realize this, and to na¨ıvely accept estimates that are nowhere close to the correct value of the marginal likelihood.” [Radford Neal’s blog, Aug. 23, 2008]
  • 14. Mixtures as proposals Design specific mixture for simulation purposes, with density ˜ϕ(x) ∝ ω1h(x)g(x) + ϕ(x) , where ϕ(θ) is arbitrary (but normalised) Note: ω1 is not a probability weight [Chopin & Robert, 2011]
  • 15. Mixtures as proposals Design specific mixture for simulation purposes, with density ˜ϕ(x) ∝ ω1h(x)g(x) + ϕ(x) , where ϕ(θ) is arbitrary (but normalised) Note: ω1 is not a probability weight [Chopin & Robert, 2011]
  • 16. Multiple estimators 1 Introduction 2 Multiple Importance Sampling 3 Rao-Blackwellisation 4 Vanilla Rao–Blackwellisation 5 Delayed acceptance 6 Combination of estimators
  • 17. Multiple Importance Sampling Reycling: given several proposals Q1, . . . , QT , for 1 ≤ t ≤ T generate an iid sample xt 1, . . . , xt N ∼ Qt and estimate I by ˆIQ,N (h) = T−1 T t=1 N−1 N i=1 h(xt i)ωt i where ωt i = f(xt i) qt(xt i) correct...
  • 18. Multiple Importance Sampling Reycling: given several proposals Q1, . . . , QT , for 1 ≤ t ≤ T generate an iid sample xt 1, . . . , xt N ∼ Qt and estimate I by ˆIQ,N (h) = T−1 T t=1 N−1 N i=1 h(xt i)ωt i where ωt i = f(xt i) T−1 T =1 q (xt i) still correct!
  • 19. Mixture representation Deterministic mixture correction of the weights proposed by Owen and Zhou (JASA, 2000) • The corresponding estimator is still unbiased [if not self-normalised] • All particles are on the same weighting scale rather than their own • Large variance proposals Qt do not take over • Variance reduction thanks to weight stabilization & recycling • [k.o.] removes the randomness in the component choice [=Rao-Blackwellisation]
  • 20. Global adaptation AMIS algorithm At iteration t = 1, · · · , T, 1 For 1 ≤ i ≤ N1, generate xt i ∼ T3(ˆµt−1, ˆΣt−1) 2 Calculate the mixture importance weight of particle xt i ωt i = f xt i δt i where δt i = t−1 l=0 qT (3) xt i; ˆµl , ˆΣl
  • 21. Backward reweighting 3 If t ≥ 2, actualize the weights of all past particles, xl i 1 ≤ l ≤ t − 1 ωl i = f xt i δl i where δl i = δl i + qT (3) xl i; ˆµt−1 , ˆΣt−1 4 Compute IS estimates of target mean and variance ˆµt and ˆΣt, where ˆµt j = t l=1 N1 i=1 ωl i(xj)l i t l=1 N1 i=1 ωl i . . . [Cornuet et al., 2012; Marin et al., 2017]
  • 22. NPMLE for Monte Carlo “The task of estimating an integral by Monte Carlo methods is formulated as a statistical model using simulated observations as data. The difficulty in this exercise is that we ordinarily have at our disposal all of the information required to compute integrals exactly by calculus or numerical integration, but we choose to ignore some of the information for simplicity or computational feasibility.” [Kong, McCullagh, Meng, Nicolae & Tan, 2003]
  • 23. NPMLE for Monte Carlo “Our proposal is to use a semiparametric statistical model that makes explicit what information is ignored and what information is retained. The parameter space in this model is a set of measures on the sample space, which is ordinarily an infinite dimensional object. None-the-less, from simulated data the base-line measure can be estimated by maximum likelihood, and the required integrals computed by a simple formula previously derived by Geyer and by Meng and Wong using entirely different arguments.” [Kong, McCullagh, Meng, Nicolae & Tan, 2003]
  • 24. NPMLE for Monte Carlo “By contrast with Geyer’s retrospective likelihood, a correct estimate of simulation error is available directly from the Fisher information. The principal advantage of the semiparametric model is that variance reduction techniques are associated with submodels in which the maximum likelihood estimator in the submodel may have substantially smaller variance than the traditional estimator.” [Kong, McCullagh, Meng, Nicolae & Tan, 2003]
  • 25. NPMLE for Monte Carlo “At first glance, the problem appears to be an exercise in calculus or numerical analysis, and not amenable to statistical formulation” • use of Fisher information • non-parametric MLE based on simulations • comparison of sampling schemes through variances • Rao–Blackwellised improvements by invariance constraints
  • 26. NPMLE for Monte Carlo Observing Yij ∼ Fi(t) = c−1 i ˆ t −∞ ωi(x) dF(x) with ωi known and F unknown
  • 27. NPMLE for Monte Carlo Observing Yij ∼ Fi(t) = c−1 i ˆ t −∞ ωi(x) dF(x) with ωi known and F unknown “Maximum likelihood estimate” defined by weighted empirical cdf i,j ωi(yij)p(yij)δyij maximising in p ij c−1 i ωi(yij) p(yij)
  • 28. NPMLE for Monte Carlo Observing Yij ∼ Fi(t) = c−1 i ˆ t −∞ ωi(x) dF(x) with ωi known and F unknown “Maximum likelihood estimate” defined by weighted empirical cdf i,j ωi(yij)p(yij)δyij maximising in p ij c−1 i ωi(yij) p(yij) Result such that ij ˆc−1 r ωr(yij) s nsˆc−1 s ωs(yij) = 1 [Vardi, 1985]
  • 29. NPMLE for Monte Carlo Observing Yij ∼ Fi(t) = c−1 i ˆ t −∞ ωi(x) dF(x) with ωi known and F unknown Result such that ij ˆc−1 r ωr(yij) s nsˆc−1 s ωs(yij) = 1 [Vardi, 1985] Bridge sampling estimator ij ˆc−1 r ωr(yij) s nsˆc−1 s ωs(yij) = 1 [Gelman & Meng, 1998; Tan, 2004]
  • 30. end of the 2002 discussion “...essentially every Monte Carlo activity may be interpreted as parameter estimation by maximum likelihood in a statistical model. We do not claim that this point of view is necessary; nor do we seek to establish a working principle from it.” • restriction to discrete support measures [may be] suboptimal [Ritov & Bickel, 1990; Robins et al., 1997, 2000, 2003] • group averaging versions in-between multiple mixture estimators and quasi-Monte Carlo version each & Guibas, 1995; Owen & Zhou, 2000; Cornuet et al., 2012; Owen, 2003] • statistical analogy provides at best narrative thread
  • 31. end of the 2002 discussion “The hard part of the exercise is to construct a submodel such that the gain in precision is sufficient to justify the additional computational effort” • garden of forking paths, with infinite possibilities • no free lunch (variance, budget, time) • Rao–Blackwellisation may be detrimental in Markov setups
  • 32. Multiple estimators 1 Introduction 2 Multiple Importance Sampling 3 Rao-Blackwellisation 4 Vanilla Rao–Blackwellisation 5 Delayed acceptance 6 Combination of estimators
  • 33. Accept-Reject Given a density f(·) to simulate take g(·) density such that f(x) ≤ Mg(x) for M ≥ 1 To simulate X ∼ f, it is sufficient to generate Y ∼ g U|Y = y ∼ U(0, Mg(y)) until 0 < u < f(y)
  • 34. Much ado about... [Exercice 3.33, MCSM] Raw outcome: id sequences Y1, Y2, . . . , Yt ∼ g and U1, U2, . . . , Ut ∼ U(0, 1) Random number of accepted Yi’s P(N = n) = n − 1 t − 1 (1/M)t (1 − 1/M)n−t ,
  • 35. Much ado about... [Exercice 3.33, MCSM] Raw outcome: id sequences Y1, Y2, . . . , Yt ∼ g and U1, U2, . . . , Ut ∼ U(0, 1) Joint density of (N, Y, U) P(N = n, Y1 ≤ y1, . . . , Yn ≤ yn, U1 ≤ u1, . . . , Un ≤ un) = ˆ yn −∞ g(tn)(un ∧ wn)dtn ˆ y1 −∞ . . . ˆ yn−1 −∞ g(t1) . . . g(tn−1) × (i1,··· ,it−1) t−1 j=1 (wij ∧ uij ) n−1 j=t (uij − wij )+ dt1 · · · dtn−1, where wi = f(yi)/Mg(yi) and sum over all subsets of {1, . . . , n − 1} of size t − 1
  • 36. Much ado about... [Exercice 3.33, MCSM] Raw outcome: id sequences Y1, Y2, . . . , Yt ∼ g and U1, U2, . . . , Ut ∼ U(0, 1) Marginal joint density of (Yi, Ui)|N = n, i < n P(N = n, Y1 ≤ y, U1 ≤ u1) = n − 1 t − 1 1 M t−1 1 − 1 M n−t−1 × t − 1 n − 1 (w1 ∧ u1) 1 − 1 M + n − t n − 1 (u1 − w1)+ 1 M ˆ y −∞ g(t1)dt1 and marginal distribution of Yi m(y) = t−1/n−1f(y) + n−t/n−1 g(y) − ρf(y) 1 − ρ P(U1 ≤ w(y)|Y1 = y, N = n) = g(y)w(y)Mt−1/n−1 m(y)
  • 37. Much ado about noise Accept-reject sample (X1, . . . , Xm) associated with (U1, . . . , UN ) and (Y1, . . . , YN ) N is stopping time for acceptance of m variables among Yj’s Rewrite estimator of E[h] as 1 m m i=1 h(Xi) = 1 m N j=1 h(Yj) IUj≤wj , with wj = f(Yj)/Mg(Yj) [Casella and Robert (1996)]
  • 38. Much ado about noise Rao-Blackwellisation: smaller variance produced by integrating out the Ui’s, 1 m N j=1 E[IUj≤wj |N, Y1, . . . , YN ] h(Yj) = 1 m N i=1 ρih(Yi), where (i < n) ρi = P(Ui ≤ wi|N = n, Y1, . . . , Yn) = wi (i1,...,im−2) m−2 j=1 wij n−2 j=m−1(1 − wij ) (i1,...,im−1) m−1 j=1 wij n−1 j=m(1 − wij ) , and ρn = 1. Numerator sum over all subsets of {1, . . . , i − 1, i + 1, . . . , n − 1} of size m − 2, and denominator sum over all subsets of size m − 1 [Casella and Robert (1996)]
  • 39. extension to Metropolis–Hastings case Sample produced by Metropolis–Hastings algorithm x(1) , . . . , x(T) based on two samples, y1, . . . , yT proposals and u1, . . . , uT uniforms [Casella and Robert (1996)]
  • 40. extension to Metropolis–Hastings case Sample produced by Metropolis–Hastings algorithm x(1) , . . . , x(T) based on two samples, y1, . . . , yT proposals and u1, . . . , uT uniforms Ergodic mean rewritten as δMH = 1 T T t=1 h(x(t) ) = 1 T T t=1 h(yt) T i=t Ix(i)=yt [Casella and Robert (1996)]
  • 41. extension to Metropolis–Hastings case Sample produced by Metropolis–Hastings algorithm x(1) , . . . , x(T) based on two samples, y1, . . . , yT proposals and u1, . . . , uT uniforms Conditional expectation δRB = 1 T T t=1 h(yt) E T i=t IX(i) = yt y1, . . . , yT = 1 T T t=1 h(yt) T i=t P(X(i) = yt|y1, . . . , yT ) with smaller variance [Casella and Robert (1996)]
  • 42. weight derivation Take ρij = f(yj)/q(yj|yi) f(yi)/q(yi|yj) ∧ 1 (j > i), ρij = ρijq(yj+1|yj), ρij = (1 − ρij)q(yj+1|yi) (i < j < T), ζjj = 1, ζjt = t l=j+1 ρjl (i < j < T), τ0 = 1, τj = j−1 t=0 τtζt(j−1) ρtj, τT = T−1 t=0 τtζt(T−1)ρtT (i < T), ωi T = 1, ωj i = ρjiωi i+1 + ρji ωj i+1 (0 ≤ j < i < T). [Casella and Robert (1996)]
  • 43. weight derivation Theorem The estimator δRB satisfies δRB = T i=0 ϕi h(yi) T−1 i=0 τi ζi(T−1) , with (i < T) ϕi = τi   T−1 j=i ζijωi j+1 + ζi(T−1)(1 − ρiT )   and ϕT = τT . [Casella and Robert (1996)]
  • 44. Multiple estimators 1 Introduction 2 Multiple Importance Sampling 3 Rao-Blackwellisation 4 Vanilla Rao–Blackwellisation 5 Delayed acceptance 6 Combination of estimators
  • 45. Some properties of the Metropolis–Hastings algorithm Alternative representation of Metropolis–Hastings estimator δ as δ = 1 n n t=1 h(x(t) ) = 1 n Mn i=1 nih(zi) , where • {x(t)}t original Markov (MCMC) chain with stationary distribution π(·) and transition q(·|·) • zi’s are the accepted yj’s • Mn is the number of accepted yj’s till time n • ni is the number of times zi appears in the sequence {x(t)}t
  • 46. The ”accepted candidates” Define ˜q(·|zi) = α(zi, ·) q(·|zi) p(zi) ≤ q(·|zi) p(zi) where p(zi) = ´ α(zi, y) q(y|zi)dy To simulate from ˜q(·|zi) 1 Propose a candidate y ∼ q(·|zi) 2 Accept with probability ˜q(y|zi) q(y|zi) p(zi) = α(zi, y) Otherwise, reject it and starts again. this is the transition of the HM algorithm
  • 47. The ”accepted candidates” Define ˜q(·|zi) = α(zi, ·) q(·|zi) p(zi) ≤ q(·|zi) p(zi) where p(zi) = ´ α(zi, y) q(y|zi)dy The transition kernel ˜q admits ˜π as a stationary distribution: ˜π(x)˜q(y|x) = π(x)p(x) ´ π(u)p(u)du ˜π(x) α(x, y)q(y|x) p(x) ˜q(y|x)
  • 48. The ”accepted candidates” Define ˜q(·|zi) = α(zi, ·) q(·|zi) p(zi) ≤ q(·|zi) p(zi) where p(zi) = ´ α(zi, y) q(y|zi)dy The transition kernel ˜q admits ˜π as a stationary distribution: ˜π(x)˜q(y|x) = π(x)α(x, y)q(y|x) ´ π(u)p(u)du
  • 49. The ”accepted candidates” Define ˜q(·|zi) = α(zi, ·) q(·|zi) p(zi) ≤ q(·|zi) p(zi) where p(zi) = ´ α(zi, y) q(y|zi)dy The transition kernel ˜q admits ˜π as a stationary distribution: ˜π(x)˜q(y|x) = π(y)α(y, x)q(x|y) ´ π(u)p(u)du
  • 50. The ”accepted candidates” Define ˜q(·|zi) = α(zi, ·) q(·|zi) p(zi) ≤ q(·|zi) p(zi) where p(zi) = ´ α(zi, y) q(y|zi)dy The transition kernel ˜q admits ˜π as a stationary distribution: ˜π(x)˜q(y|x) = ˜π(y)˜q(x|y) ,
  • 51. The ”accepted chain” Lemma (Douc & X., AoS, 2011) The sequence (zi, ni) satisfies 1 (zi, ni)i is a Markov chain; 2 zi+1 and ni are independent given zi; 3 ni is distributed as a geometric random variable with probability parameter p(zi) := ˆ α(zi, y) q(y|zi) dy ; (1) 4 (zi)i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  • 52. The ”accepted chain” Lemma (Douc & X., AoS, 2011) The sequence (zi, ni) satisfies 1 (zi, ni)i is a Markov chain; 2 zi+1 and ni are independent given zi; 3 ni is distributed as a geometric random variable with probability parameter p(zi) := ˆ α(zi, y) q(y|zi) dy ; (1) 4 (zi)i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  • 53. The ”accepted chain” Lemma (Douc & X., AoS, 2011) The sequence (zi, ni) satisfies 1 (zi, ni)i is a Markov chain; 2 zi+1 and ni are independent given zi; 3 ni is distributed as a geometric random variable with probability parameter p(zi) := ˆ α(zi, y) q(y|zi) dy ; (1) 4 (zi)i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  • 54. The ”accepted chain” Lemma (Douc & X., AoS, 2011) The sequence (zi, ni) satisfies 1 (zi, ni)i is a Markov chain; 2 zi+1 and ni are independent given zi; 3 ni is distributed as a geometric random variable with probability parameter p(zi) := ˆ α(zi, y) q(y|zi) dy ; (1) 4 (zi)i is a Markov chain with transition kernel ˜Q(z, dy) = ˜q(y|z)dy and stationary distribution ˜π such that ˜q(·|z) ∝ α(z, ·) q(·|z) and ˜π(·) ∝ π(·)p(·) .
  • 55. Importance sampling perspective 1 A natural idea: δ∗ = 1 n Mn i=1 h(zi) p(zi) ,
  • 56. Importance sampling perspective 1 A natural idea: δ∗ Mn i=1 h(zi) p(zi) Mn i=1 1 p(zi) = Mn i=1 π(zi) ˜π(zi) h(zi) Mn i=1 π(zi) ˜π(zi) .
  • 57. Importance sampling perspective 1 A natural idea: δ∗ Mn i=1 h(zi) p(zi) Mn i=1 1 p(zi) = Mn i=1 π(zi) ˜π(zi) h(zi) Mn i=1 π(zi) ˜π(zi) . 2 But p not available in closed form.
  • 58. Importance sampling perspective 1 A natural idea: δ∗ Mn i=1 h(zi) p(zi) Mn i=1 1 p(zi) = Mn i=1 π(zi) ˜π(zi) h(zi) Mn i=1 π(zi) ˜π(zi) . 2 But p not available in closed form. 3 The geometric ni is the replacement, an obvious solution that is used in the original Metropolis–Hastings estimate since E[ni] = 1/p(zi).
  • 59. The Bernoulli factory The crude estimate of 1/p(zi), ni = 1 + ∞ j=1 ≤j I {u ≥ α(zi, y )} , can be improved: Lemma (Douc & X., AoS, 2011) If (yj)j is an iid sequence with distribution q(y|zi), the quantity ˆξi = 1 + ∞ j=1 ≤j {1 − α(zi, y )} is an unbiased estimator of 1/p(zi) which variance, conditional on zi, is lower than the conditional variance of ni, {1 − p(zi)}/p2(zi).
  • 60. Rao-Blackwellised, for sure? ˆξi = 1 + ∞ j=1 ≤j {1 − α(zi, y )} 1 Infinite sum but finite with at least positive probability: α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) For example: take a symmetric random walk as a proposal. 2 What if we wish to be sure that the sum is finite? Finite horizon k version: ˆξk i = 1 + ∞ j=1 1≤ ≤k∧j {1 − α(zi, yj)} k+1≤ ≤j I {u ≥ α(zi, y )}
  • 61. Rao-Blackwellised, for sure? ˆξi = 1 + ∞ j=1 ≤j {1 − α(zi, y )} 1 Infinite sum but finite with at least positive probability: α(x(t) , yt) = min 1, π(yt) π(x(t)) q(x(t)|yt) q(yt|x(t)) For example: take a symmetric random walk as a proposal. 2 What if we wish to be sure that the sum is finite? Finite horizon k version: ˆξk i = 1 + ∞ j=1 1≤ ≤k∧j {1 − α(zi, yj)} k+1≤ ≤j I {u ≥ α(zi, y )}
  • 62. Variance improvement Proposition (Douc & X., AoS, 2011) If (yj)j is an iid sequence with distribution q(y|zi) and (uj)j is an iid uniform sequence, for any k ≥ 0, the quantity ˆξk i = 1 + ∞ j=1 1≤ ≤k∧j {1 − α(zi, yj)} k+1≤ ≤j I {u ≥ α(zi, y )} is an unbiased estimator of 1/p(zi) with an almost sure finite number of terms.
  • 63. Variance improvement Proposition (Douc & X., AoS, 2011) If (yj)j is an iid sequence with distribution q(y|zi) and (uj)j is an iid uniform sequence, for any k ≥ 0, the quantity ˆξk i = 1 + ∞ j=1 1≤ ≤k∧j {1 − α(zi, yj)} k+1≤ ≤j I {u ≥ α(zi, y )} is an unbiased estimator of 1/p(zi) with an almost sure finite number of terms. Moreover, for k ≥ 1, Vˆξk i zi = 1 − p(zi) p2(zi) − 1 − (1 − 2p(zi) + r(zi))k 2p(zi) − r(zi) 2 − p(zi) p2(zi) (p(zi) − r(zi)) , where p(zi) := ´ α(zi, y) q(y|zi) dy. and r(zi) := ´ α2(zi, y) q(y|zi) dy.
  • 64. Variance improvement Proposition (Douc & X., AoS, 2011) If (yj)j is an iid sequence with distribution q(y|zi) and (uj)j is an iid uniform sequence, for any k ≥ 0, the quantity ˆξk i = 1 + ∞ j=1 1≤ ≤k∧j {1 − α(zi, yj)} k+1≤ ≤j I {u ≥ α(zi, y )} is an unbiased estimator of 1/p(zi) with an almost sure finite number of terms. Therefore, we have Vˆξizi ≤ Vˆξk i zi ≤ Vˆξ0 i zi = Vnizi .
  • 65. Multiple estimators 1 Introduction 2 Multiple Importance Sampling 3 Rao-Blackwellisation 4 Vanilla Rao–Blackwellisation 5 Delayed acceptance 6 Combination of estimators
  • 66. Non-informative inference for mixture models Standard mixture of distributions model k i=1 wi f(x|θi) , with k i=1 wi = 1 . (1) [Titterington et al., 1985; Fr¨uhwirth-Schnatter (2006)] Jeffreys’ prior for mixture not available due to computational reasons : it has not been tested so far [Jeffreys, 1939] Warning: Jeffreys’ prior improper in some settings [Grazian & Robert, 2017]
  • 67. Non-informative inference for mixture models Grazian & Robert (2017) consider genuine Jeffreys’ prior for complete set of parameters in (1), deduced from Fisher’s information matrix Computation of prior density costly, relying on many integrals like ˆ X ∂2 log k i=1 wi f(x|θi) ∂θh∂θj k i=1 wi f(x|θi) dx Integrals with no analytical expression, hence involving numerical or Monte Carlo (costly) integration
  • 68. Non-informative inference for mixture models When building Metropolis-Hastings proposal over (wi, θi)’s, prior ratio more expensive than likelihood and proposal ratios Suggestion: split the acceptance rule α(x, y) := 1 ∧ r(x, y), r(x, y) := π(y|D)q(y, x) π(x|D)q(x, y) into ˜α(x, y) := 1 ∧ f(D|y)q(y, x) f(D|x)q(x, y) × 1 ∧ π(y) π(x)
  • 69. The “Big Data” plague Simulation from posterior distribution with large sample size n • Computing time at least of order O(n) • solutions using likelihood decomposition n i=1 (θ|xi) and handling subsets on different processors (CPU), graphical units (GPU), or computers [Korattikara et al. (2013), Scott et al. (2013)] • no consensus on method of choice, with instabilities from removing most prior input and uncalibrated approximations [Neiswanger et al. (2013), Wang and Dunson (2013)]
  • 70. Proposed solution “There is no problem an absence of decision cannot solve.” Anonymous Given α(x, y) := 1 ∧ r(x, y), factorise r(x, y) = d k=1 ρk(x, y) under constraint ρk(x, y) = ρk(y, x)−1 Delayed Acceptance Markov kernel given by ˜P(x, A) := ˆ A q(x, y)˜α(x, y)dy + 1 − ˆ X q(x, y)˜α(x, y)dy 1A(x) where ˜α(x, y) := d k=1 {1 ∧ ρk(x, y)}.
  • 71. Proposed solution “There is no problem an absence of decision cannot solve.” Anonymous Algorithm 1 Delayed Acceptance To sample from ˜P(x, ·): 1 Sample y ∼ Q(x, ·). 2 For k = 1, . . . , d: • with probability 1 ∧ ρk(x, y) continue • otherwise stop and output x 3 Output y Arrange terms in product so that most computationally intensive ones calculated ‘at the end’ hence least often
  • 72. Proposed solution “There is no problem an absence of decision cannot solve.” Anonymous Algorithm 1 Delayed Acceptance To sample from ˜P(x, ·): 1 Sample y ∼ Q(x, ·). 2 For k = 1, . . . , d: • with probability 1 ∧ ρk(x, y) continue • otherwise stop and output x 3 Output y Generalization of Fox and Nicholls (1997) and Christen and Fox (2005), where testing for acceptance with approximation before computing exact likelihood first suggested More recent occurences in literature [Shestopaloff & Neal, 2013; Golightly, 2014]
  • 73. Potential drawbacks • Delayed Acceptance efficiently reduces computing cost only when approximation ˜π is “good enough” or “flat enough” • Probability of acceptance always smaller than in the original Metropolis–Hastings scheme • Decomposition of original data in likelihood bits may however lead to deterioration of algorithmic properties without impacting computational efficiency... • ...e.g., case of a term explosive in x = 0 and computed by itself: leaving x = 0 near impossible
  • 74. Potential drawbacks Figure: (left) Fit of delayed Metropolis–Hastings algorithm on a Beta-binomial posterior p|x ∼ Be(x + a, n + b − x) when N = 100, x = 32, a = 7.5 and b = .5. Binomial B(N, p) likelihood replaced with product of 100 Bernoulli terms. Histogram based on 105 iterations, with overall acceptance rate of 9%; (centre) raw sequence of p’s in Markov chain; (right) autocorrelogram of the above sequence.
  • 75. The “Big Data” plague Delayed Acceptance intended for likelihoods or priors, but not a clear solution for “Big Data” problems 1 all product terms must be computed 2 all terms previously computed either stored for future comparison or recomputed 3 sequential approach limits parallel gains... 4 ...unless prefetching scheme added to delays [Angelino et al. (2014), Strid (2010)]
  • 76. Validation of the method Lemma (1) For any Markov chain with transition kernel Π of the form Π(x, A) = ˆ A q(x, y)a(x, y)dy + 1 − ˆ X q(x, y)a(x, y)dy 1A(x), and satisfying detailed balance, the function a(·) satisfies (for π-a.e. x, y) a(x, y) a(y, x) = r(x, y).
  • 77. Validation of the method Lemma (2) ( ˜Xn)n≥1, the Markov chain associated with ˜P, is a π-reversible Markov chain. Proof. From Lemma 1 we just need to check that ˜α(x, y) ˜α(y, x) = d k=1 1 ∧ ρk(x, y) 1 ∧ ρk(y, x) = d k=1 ρk(x, y) = r(x, y), since ρk(y, x) = ρk(x, y)−1 and (1 ∧ a)/(1 ∧ a−1) = a
  • 78. Comparisons of P and ˜P The acceptance probability ordering ˜α(x, y) = d k=1 {1∧ρk(x, y)} ≤ 1∧ d k=1 ρk(x, y) = 1∧r(x, y) = α(x, y), follows from (1 ∧ a)(1 ∧ b) ≤ (1 ∧ ab) for a, b ∈ R+. Remark By construction of ˜P, var(f, P) ≤ var(f, ˜P) for any f ∈ L2(X, π), using Peskun ordering (Peskun, 1973, Tierney, 1998), since ˜α(x, y) ≤ α(x, y) for any (x, y) ∈ X2.
  • 79. Comparisons of P and ˜P Condition (1) Defining A := {(x, y) ∈ X2 : r(x, y) ≥ 1}, there exists c such that inf(x,y)∈A mink∈{1,...,d} ρk(x, y) ≥ c. Ensures that when α(x, y) = 1 then acceptance probability ˜α(x, y) uniformly lower-bounded by positive constant. Reversibility implies ˜α(x, y) uniformly lower-bounded by a constant multiple of α(x, y) for all x, y ∈ X.
  • 80. Comparisons of P and ˜P Condition (1) Defining A := {(x, y) ∈ X2 : r(x, y) ≥ 1}, there exists c such that inf(x,y)∈A mink∈{1,...,d} ρk(x, y) ≥ c. Proposition (1) Under Condition (1), Lemma 34 in Andrieu & Lee (2013) implies Gap ˜P ≥ Gap P and var f, ˜P ≤ ( −1 − 1)varπ(f) + −1 var (f, P) with f ∈ L2 0(E, π), = cd−1.
  • 81. Comparisons of P and ˜P Proposition (1) Under Condition (1), Lemma 34 in Andrieu & Lee (2013) implies Gap ˜P ≥ Gap P and var f, ˜P ≤ ( −1 − 1)varπ(f) + −1 var (f, P) with f ∈ L2 0(E, π), = cd−1. Hence if P has right spectral gap, then so does ˜P. Plus, quantitative bounds on asymptotic variance of MCMC estimates using ( ˜Xn)n≥1 in relation to those using (Xn)n≥1 available
  • 82. Comparisons of P and ˜P Easiest use of above: modify any candidate factorisation Given factorisation of r r(x, y) = d k=1 ˜ρk(x, y) , satisfying the balance condition, define a sequence of functions ρk such that both r(x, y) = d k=1 ρk(x, y) and Condition 1 holds.
  • 83. Comparisons of P and ˜P Take c ∈ (0, 1], define b = c 1 d−1 and set ˜ρk(x, y) := min 1 b , max {b, ρk(x, y)} , k ∈ {1, . . . , d − 1}, and ˜ρd(x, y) := r(x, y) d−1 k=1 ˜ρk(x, y) . Then: Proposition (2) Under this scheme, previous proposition holds with = c2 = b2(d−1) .
  • 84. Multiple estimators 1 Introduction 2 Multiple Importance Sampling 3 Rao-Blackwellisation 4 Vanilla Rao–Blackwellisation 5 Delayed acceptance 6 Combination of estimators
  • 85. Variance reduction Given estimators I1, . . . , Id of the same integral I, optimal combination ˆI = d i=1 ˆωiIi where ˆω = argω min i ωi=1 ωT Σ−1 ω with Σ covariance matrix of (I1, . . . , Id) covariance matrix estimated from samples behind the I ’s
  • 86. Variance reduction Given estimators I1, . . . , Id of the same integral I, optimal combination ˆI = d i=1 ˆωiIi where ˆω = argω min i ωi=1 ωT Σ−1 ω with Σ covariance matrix of (I1, . . . , Id) covariance matrix estimated from samples behind the I ’s
  • 87. Issues • Linear combination of estimators cannot exclude outliers and infinite variance estimators • Worse: retains infinite variance from any component • Ignore case of estimators based on identical simulations (e.g., multiple AMIS solutions) • Does not measure added value of new estimator
  • 88. Issues Comparison of several combinations of three importance samples of length 103 [olive stands for h-median cdf]
  • 89. Combining via cdfs Revisit Monte Carlo estimates as expectations of distribution estimates Collection of estimators by-produces evaluations of cdf of h(X) as ˆFj(h) = n i=1 ωijIh(xij<h) New perspective on combining cdf estimates with different distributions and potential correlations Average of cdfs poor choice (infinite variance, variance ignored)
  • 90. Combining via cdfs Revisit Monte Carlo estimates as expectations of distribution estimates Collection of estimators by-produces evaluations of cdf of h(X) as ˆFj(h) = n i=1 ωijIh(xij<h) New perspective on combining cdf estimates with different distributions and potential correlations Average of cdfs poor choice (infinite variance, variance ignored)
  • 91. Combining via cdf median Towards robustness and elimination of infinite variance, use of [vertical] median cdf ˆFmed(h) = medianj ˆFj(h) −20 −15 −10 −5 0 5 10 15 0.00.20.40.60.81.0 ecdf7 which does not account for correlation and repetitions
  • 92. Combining via cdf median Alternative horizontal median cdf ˆF−1 hmed(α) = medianj ˆF−1 j (α) which does not show visible variation from vertical median
  • 93. Combining via cdf median q qq qq q qq q qq q q q q qq q q q q q q q q q q qq q q q q q q q q q q q q q q q q normal.5 normal4 amis2 amis3 hmedian 0.220.240.260.280.300.320.340.36 Comparison of eight estimators for three importance samples of length 103