SlideShare a Scribd company logo
1 of 78
Download to read offline
Inferring the number of mixture components:
dream or reality?
Christian P. Robert
Université Paris-Dauphine, Paris
&
University of Warwick, Coventry
Statistical methods and models for complex data
24 Sept. 2022
Padova800
Outline
early Gibbs sampling
weakly informative priors
imperfect sampling
Bayes factor
even less informative prior
Mixture models
Convex combination of densities
x ∼ fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
p1f1(x) + · · · + pkfk(x) .
Usual case: parameterised components
k
X
i=1
pi f (x|θi ) with
n
X
i=1
pi = 1
where weights pi ’s are distinguished from
component parameters θi
Motivations
I Dataset made of several latent/missing/unobserved
groups/strata/subpopulations. Mixture structure due to the
missing origin/allocation of each observation to a specific
subpopulation/stratum. Inference on either the allocations
(clustering) or on the parameters (θi , pi ) or on the number of
components
I Semiparametric perspective where mixtures are functional
basis approximations of unknown distributions
I Nonparametric perspective where number of components
infinite (e.g., Dirichlet process mixtures)
Motivations
I Dataset made of several latent/missing/unobserved
groups/strata/subpopulations. Mixture structure due to the
missing origin/allocation of each observation to a specific
subpopulation/stratum. Inference on either the allocations
(clustering) or on the parameters (θi , pi ) or on the number of
components
I Semiparametric perspective where mixtures are functional
basis approximations of unknown distributions
I Nonparametric perspective where number of components
infinite (e.g., Dirichlet process mixtures)
Motivations
I Dataset made of several latent/missing/unobserved
groups/strata/subpopulations. Mixture structure due to the
missing origin/allocation of each observation to a specific
subpopulation/stratum. Inference on either the allocations
(clustering) or on the parameters (θi , pi ) or on the number of
components
I Semiparametric perspective where mixtures are functional
basis approximations of unknown distributions
I Nonparametric perspective where number of components
infinite (e.g., Dirichlet process mixtures)
Likelihood
“I have decided that mixtures, like tequila, are inherently
evil and should be avoided at all costs.” L. Wasserman
For a sample of independent random variables (x1, · · · , xn),
likelihood
n
Y
i=1
{p1f1(xi ) + · · · + pkfk(xi )} .
computable [pointwise] in O(kn) time.
Likelihood
For a sample of independent random variables (x1, · · · , xn),
likelihood
n
Y
i=1
{p1f1(xi ) + · · · + pkfk(xi )} .
computable [pointwise] in O(kn) time.
Normal mean mixture
Normal mixture
p N(µ1, 1) + (1 − p) N(µ2, 1)
with only means µi unknown
Identifiability
Parameters µ1 and µ2 identifiable: µ1
cannot be confused with µ2 when p is
different from 0.5.
Presence of atavistic mode, better
understood by letting p go to 0.5
Normal mean mixture
Normal mixture
p N(µ1, 1) + (1 − p) N(µ2, 1)
with only means µi unknown
Identifiability
Parameters µ1 and µ2 identifiable: µ1
cannot be confused with µ2 when p is
different from 0.5.
Presence of atavistic mode, better
understood by letting p go to 0.5
Bayesian inference on mixtures
For any prior π (θ, p), posterior distribution of (θ, p) available up
to a multiplicative constant
π(θ, p|x) ∝


n
Y
i=1
k
X
j=1
pj f (xi |θj )

 π (θ, p)
at a cost of order O(kn)
Challenge
Despite this, derivation of posterior characteristics like posterior
expectations only possible in an exponential time of order O(kn)!
Bayesian inference on mixtures
For any prior π (θ, p), posterior distribution of (θ, p) available up
to a multiplicative constant
π(θ, p|x) ∝


n
Y
i=1
k
X
j=1
pj f (xi |θj )

 π (θ, p)
at a cost of order O(kn)
Challenge
Despite this, derivation of posterior characteristics like posterior
expectations only possible in an exponential time of order O(kn)!
Challenges from likelihood
1. Number of modes of the likelihood of order O(k!):
© Maximization / exploration of posterior surface hard
2. Under exchangeable / permutation invariant priors on (θ, p)
all posterior marginals are identical:
© All posterior expectations of θi equal
3. Estimating the density much simpler
[Marin & X, 2007]
Challenges from likelihood
1. Number of modes of the likelihood of order O(k!):
© Maximization / exploration of posterior surface hard
2. Under exchangeable / permutation invariant priors on (θ, p)
all posterior marginals are identical:
© All posterior expectations of θi equal
3. Estimating the density much simpler
[Marin & X, 2007]
Missing variable representation
Demarginalise: Associate to each xi a missing/latent/auxiliary
variable zi that indicates its component:
zi |p ∼ Mk(p1, . . . , pk)
and
xi |zi , θ ∼ f (·|θzi )
Completed likelihood
`C
(θ, p|x, z) =
n
Y
i=1
pzi
f (xi |θzi
)
and
π(θ, p|x, z) ∝
" n
Y
i=1
pzi
f (xi |θzi
)
#
π (θ, p)
where z = (z1, . . . , zn)
Missing variable representation
Demarginalise: Associate to each xi a missing/latent/auxiliary
variable zi that indicates its component:
zi |p ∼ Mk(p1, . . . , pk)
and
xi |zi , θ ∼ f (·|θzi )
Completed likelihood
`C
(θ, p|x, z) =
n
Y
i=1
pzi
f (xi |θzi
)
and
π(θ, p|x, z) ∝
" n
Y
i=1
pzi
f (xi |θzi
)
#
π (θ, p)
where z = (z1, . . . , zn)
Gibbs sampling for mixture models
Take advantage of the missing data structure:
Algorithm
I Initialization: choose p(0) and θ(0)
arbitrarily
I Step t. For t = 1, . . .
1. Generate z
(t)
i (i = 1, . . . , n) from (j = 1, . . . , k)
P

z
(t)
i = j|p
(t−1)
j , θ
(t−1)
j , xi

∝ p
(t−1)
j f

xi |θ
(t−1)
j

2. Generate p(t)
from π(p|z(t)
),
3. Generate θ(t)
from π(θ|z(t)
, x).
[Brooks  Gelman, 1990; Diebolt  X, 1990, 1994; Escobar  West, 1991]
Normal mean mixture (cont’d)
−1 0 1 2 3 4
−1
0
1
2
3
4
µ1
µ
2
(a) initialised at random
[X  Casella, 2009]
Normal mean mixture (cont’d)
−1 0 1 2 3 4
−1
0
1
2
3
4
µ1
µ
2
(a) initialised at random
−2 0 2 4
−2
0
2
4
µ1
µ
2
(b) initialised close to the
lower mode
[X  Casella, 2009]
Outline
early Gibbs sampling
weakly informative priors
imperfect sampling
Bayes factor
even less informative prior
weakly informative priors
“Los espejos y la cópula son abominables, porque multiplican el
número de los hombres.” − Jorge Luis Borges, Ficciones
I possible symmetric empirical Bayes priors
p ∼ D(γ, . . . , γ), θi ∼ N(^
µ, ^
ωσ2
i ), σ−2
i ∼ Ga(ν, ^
ν)
which can be replaced with hierarchical priors
[Diebolt  X, 1990; Richardson  Green, 1997]
I independent improper priors on θj ’s prohibited, thus standard
“flat” and Jeffreys priors impossible to use (except with the
exclude-empty-component trick)
[Diebolt  X, 1990; Wasserman, 1999]
weakly informative priors
I reparameterization to compact set for use of uniform priors
µi −→
eµi
1 + eµi
, σi −→
σi
1 + σi
[Chopin, 2000]
I dependent weakly informative priors
p ∼ D(k, . . . , 1), θi ∼ N(θi−1, ζσ2
i−1), σi ∼ U([0, σi−1])
[Mengersen  X, 1996; X  Titterington, 1998]
I reference priors
p ∼ D(1, . . . , 1), θi ∼ N(µ0, (σ2
i + τ2
0)/2), σ2
i ∼ C+
(0, τ2
0)
[Moreno  Liseo, 1999]
Ban on improper priors
Difficult to use improper priors in the setting of mixtures because
independent improper priors,
π (θ) =
k
Y
i=1
πi (θi ) , with
Z
πi (θi )dθi = ∞
end up, for all n’s, with the property
Z
π(θ, p|x)dθdp = ∞
Reason
There are (k − 1)n terms among the kn terms in the expansion
that allocate no observation at all to the i-th component
Ban on improper priors
Difficult to use improper priors in the setting of mixtures because
independent improper priors,
π (θ) =
k
Y
i=1
πi (θi ) , with
Z
πi (θi )dθi = ∞
end up, for all n’s, with the property
Z
π(θ, p|x)dθdp = ∞
Reason
There are (k − 1)n terms among the kn terms in the expansion
that allocate no observation at all to the i-th component
Estimating k
When k is unknown, setting a prior on k leads to a mixture of
finite mixtures (MFM)
K ∼ pK pmf over N∗
Consistent estimation when pK puts mass on every integer
[Nobile, 1994; Miller  Harrison, 2018]
Implementation by
I reversible jump
[Richardson  Green, 1997; Frühwirth-Schnatter, 2011]
I saturation by superfluous components
[Rousseau  Mengersen, 2011]
I Bayesian model comparison
[Berkhof, Mechelen,  Gelman, 2003; Lee  X, 2018]
I cluster estimation
[Malsiner-Walli, Frühwirth-Schnatter  Grün, 2016]
Estimating k
When k is unknown, setting a prior on k leads to a mixture of
finite mixtures (MFM)
K ∼ pK pmf over N∗
Consistent estimation when pK puts mass on every integer
[Nobile, 1994; Miller  Harrison, 2018]
Implementation by
I reversible jump
[Richardson  Green, 1997; Frühwirth-Schnatter, 2011]
I saturation by superfluous components
[Rousseau  Mengersen, 2011]
I Bayesian model comparison
[Berkhof, Mechelen,  Gelman, 2003; Lee  X, 2018]
I cluster estimation
[Malsiner-Walli, Frühwirth-Schnatter  Grün, 2016]
Clusters versus components
“In the mixture of finite mixtures, it is perhaps intuitively
clear that, under the prior at least, the number of clusters
behaves very similarly to the number of components K
when n is large. It turns out that under the posterior they
also behave very similarly for large n.” Miller  Harrison
(2018)
However Miller  Harrison (2013, 2014) showed that the Dirichlet
process mixture posterior on the number of clusters is typically not
consistent for the number of components
Clusters versus components
“In the mixture of finite mixtures, it is perhaps intuitively
clear that, under the prior at least, the number of clusters
behaves very similarly to the number of components K
when n is large. It turns out that under the posterior they
also behave very similarly for large n.” Miller  Harrison
(2018)
However Miller  Harrison (2013, 2014) showed that the Dirichlet
process mixture posterior on the number of clusters is typically not
consistent for the number of components
Dirichlet process mixture (DPM)
Extension to the k = ∞ (non-parametric) case
xi |zi , θ
i.i.d
∼ f (xi |θxi ), i = 1, . . . , n (1)
P(Zi = k) = πk, k = 1, 2, . . .
π1, π2, . . . ∼ GEM(M) M ∼ π(M)
θ1, θ2, . . .
i.i.d
∼ G0
with GEM (Griffith-Engen-McCloskey) defined by the
stick-breaking representation
πk = vk
k−1
Y
i=1
(1 − vi ) vi ∼ Beta(1, M)
[Sethuraman, 1994]
Dirichlet process mixture (DPM)
Resulting in an infinite mixture
x ∼
n
Y
i=1
∞
X
i=1
πi f (xi |θi )
with (prior) cluster allocation
π(z|M) =
Γ(M)
Γ(M + n)
MK+
K+
Y
j=1
Γ(nj )
and conditional likelihood
p(x|z, M) =
K+
Y
j=1
Z Y
i:zi =j
f (xi |θj )dG0(θj )
available in closed form when G0 conjugate
Outline
early Gibbs sampling
weakly informative priors
imperfect sampling
Bayes factor
even less informative prior
Label switching paradox
I We should observe the exchangeability of the components
[label switching] to conclude about convergence of the Gibbs
sampler.
[Holmes, Jasra  Stephens, 2005]
I If we observe it, then marginals are useless for estimating the
parameters.
[Frühwirth-Schnatter, 2001, 2004; Green, 2019]
I If we do not, then we are uncertain about the convergence!!!
[Celeux, Hurn  X, 2000]
Label switching paradox
I We should observe the exchangeability of the components
[label switching] to conclude about convergence of the Gibbs
sampler.
[Holmes, Jasra  Stephens, 2005]
I If we observe it, then marginals are useless for estimating the
parameters.
[Frühwirth-Schnatter, 2001, 2004; Green, 2019]
I If we do not, then we are uncertain about the convergence!!!
[Celeux, Hurn  X, 2000]
Label switching paradox
I We should observe the exchangeability of the components
[label switching] to conclude about convergence of the Gibbs
sampler.
[Holmes, Jasra  Stephens, 2005]
I If we observe it, then marginals are useless for estimating the
parameters.
[Frühwirth-Schnatter, 2001, 2004; Green, 2019]
I If we do not, then we are uncertain about the convergence!!!
[Celeux, Hurn  X, 2000]
Constraints
Usual reply to lack of identifiability: impose constraints like
µ1 6 . . . 6 µk
in the prior
Mostly incompatible with the topology of the posterior surface:
posterior expectations then depend on the choice of the
constraints.
Computational “detail”
The constraint need not be imposed during the simulation but can
instead be imposed after simulation, reordering MCMC output
according to constraints. [This avoids possible negative effects on
convergence]
Constraints
Usual reply to lack of identifiability: impose constraints like
µ1 6 . . . 6 µk
in the prior
Mostly incompatible with the topology of the posterior surface:
posterior expectations then depend on the choice of the
constraints.
Computational “detail”
The constraint need not be imposed during the simulation but can
instead be imposed after simulation, reordering MCMC output
according to constraints. [This avoids possible negative effects on
convergence]
Constraints
Usual reply to lack of identifiability: impose constraints like
µ1 6 . . . 6 µk
in the prior
Mostly incompatible with the topology of the posterior surface:
posterior expectations then depend on the choice of the
constraints.
Computational “detail”
The constraint need not be imposed during the simulation but can
instead be imposed after simulation, reordering MCMC output
according to constraints. [This avoids possible negative effects on
convergence]
Relabeling towards the mode
Selection of one of the k! modal regions of the posterior,
post-simulation, by computing the approximate MAP
(θ, p)(i∗
)
with i∗
= arg max
i=1,...,M
π


(θ, p)(i)
|x

Pivotal Reordering
At iteration i ∈ {1, . . . , M},
1. Compute the optimal permutation
τi = arg min
τ∈Sk
d

τ


(θ(i)
, p(i)
), (θ(i∗
)
, p(i∗
)
)

where d(·, ·) distance in the parameter space.
2. Set (θ(i)
, p(i)) = τi ((θ(i)
, p(i))).
[Celeux, 1998; Stephens, 2000; Celeux, Hurn  X, 2000]
Relabeling towards the mode
Selection of one of the k! modal regions of the posterior,
post-simulation, by computing the approximate MAP
(θ, p)(i∗
)
with i∗
= arg max
i=1,...,M
π


(θ, p)(i)
|x

Pivotal Reordering
At iteration i ∈ {1, . . . , M},
1. Compute the optimal permutation
τi = arg min
τ∈Sk
d

τ


(θ(i)
, p(i)
), (θ(i∗
)
, p(i∗
)
)

where d(·, ·) distance in the parameter space.
2. Set (θ(i)
, p(i)) = τi ((θ(i)
, p(i))).
[Celeux, 1998; Stephens, 2000; Celeux, Hurn  X, 2000]
Loss functions for mixture estimation
Global loss function that considers
distance between predictives
L(ξ, ^
ξ) =
Z
X
fξ(x) log

fξ(x)/f^
ξ(x)
	
dx
eliminates the labelling effect
Similar solution for estimating clusters
through allocation variables
L(z, ^
z) =
X
ij

I[zi =zj ](1 − I[^
zi =^
zj ]) + I[^
zi =^
zj ](1 − I[zi =zj ])

.
[Celeux, Hurn  X, 2000]
Outline
early Gibbs sampling
weakly informative priors
imperfect sampling
Bayes factor
even less informative prior
Bayesian model choice
Comparison of models Mi by Bayesian means:
probabilise the entire model/parameter space
I allocate probabilities pi to all models Mi
I define priors πi (θi ) for each parameter space Θi
I compute
π(Mi |x) =
pi
Z
Θi
fi (x|θi )πi (θi )dθi
X
j
pj
Z
Θj
fj (x|θj )πj (θj )dθj
Computational difficulty on its own
[Chen, Shao  Ibrahim, 2000; Marin  X, 2007]
Bayesian model choice
Comparison of models Mi by Bayesian means:
Relies on a central notion: the evidence
Zk =
Z
Θk
πk(θk)Lk(θk) dθk,
aka the marginal likelihood.
Computational difficulty on its own
[Chen, Shao  Ibrahim, 2000; Marin  X, 2007]
Validation in mixture setting
“In principle, the Bayes factor for the MFM versus the
DPM could be used as an empirical criterion for choos-
ing between the two models, and in fact, it is quite easy
to compute an approximation to the Bayes factor using
importance sampling” Miller  Harrison (2018)
Bayes Factor consistent for selecting number of components
[Nobile, 1994; Ishwaran et al., 2001; Casella  Moreno, 2009; Chib and Kuffner, 2016]
Bayes Factor consistent for testing parametric versus
nonparametric alternatives
[Verdinelli  Wasserman, 1997; Dass  Lee, 2004; McVinish et al., 2009]
Consistent evidence for location DPM
Consistency of Bayes factor comparing finite mixtures against
(location) Dirichlet Process Mixture
0 250 500 750 1000 1250 1500 1750 2000
n
4
3
2
1
0
1
2
log(BF)
0 200 400 600 800 1000
n
0
1
2
3
4
5
6
7
8
log(BF)
(left) finite mixture; (right) not a finite mixture
Consistent evidence for location DPM
Under generic assumptions, when x1, · · · , xn iid fP0 with
P0 =
k0
X
j=1
p0
j δθ0
j
and Dirichlet DP(M, G0) prior on P, there exists t  0 such that
for all   0
Pf0

mDP(x)  n−(k0−1+dk0+t)/2

= o(1)
Moreover there exists q  0 such that
ΠDP

kf0 − fpk1 6
(log n)q
√
n
x

= 1 + oPf0
(1).
[Hairault, X  Rousseau, 2022]
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk(x|θk) and
θk ∼ πk(θk),
Zk = mk(x) =
fk(x|θk) πk(θk)
πk(θk|x)
Replace with an approximation to the posterior
b
Zk = c
mk(x) =
fk(x|θ∗
k) πk(θ∗
k)
^
πk(θ∗
k|x)
.
[Chib, 1995]
Chib’s representation
Direct application of Bayes’ theorem: given x ∼ fk(x|θk) and
θk ∼ πk(θk),
Zk = mk(x) =
fk(x|θk) πk(θk)
πk(θk|x)
Replace with an approximation to the posterior
b
Zk = c
mk(x) =
fk(x|θ∗
k) πk(θ∗
k)
^
πk(θ∗
k|x)
.
[Chib, 1995]
Case of latent variables
For missing variable z as in mixture models, natural Rao-Blackwell
estimate
c
πk(θ∗
k|x) =
1
T
T
X
t=1
πk(θ∗
k|x, z
(t)
k ) ,
where the z
(t)
k ’s are Gibbs sampled latent variables
[Diebolt  Robert, 1990; Chib, 1995]
Compensation for label switching
For mixture models, z
(t)
k usually fails to visit all configurations in a
balanced way, despite the symmetry predicted by the theory
Consequences on numerical approximation, biased by an order k!
Recover the theoretical symmetry by using
f
πk(θ∗
k|x) =
1
T k!
X
σ∈Sk
T
X
t=1
πk(σ(θ∗
k)|x, z
(t)
k ) .
for all σ’s in Sk, set of all permutations of {1, . . . , k}
[Berkhof, Mechelen,  Gelman, 2003]
Compensation for label switching
For mixture models, z
(t)
k usually fails to visit all configurations in a
balanced way, despite the symmetry predicted by the theory
Consequences on numerical approximation, biased by an order k!
Recover the theoretical symmetry by using
f
πk(θ∗
k|x) =
1
T k!
X
σ∈Sk
T
X
t=1
πk(σ(θ∗
k)|x, z
(t)
k ) .
for all σ’s in Sk, set of all permutations of {1, . . . , k}
[Berkhof, Mechelen,  Gelman, 2003]
Galaxy dataset (k)
Using Chib’s estimate, with θ∗
k as MAP estimator,
log(^
Zk(x)) = −105.1396
for k = 3, while introducing permutations leads to
log(^
Zk(x)) = −103.3479
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
Zk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105
Gibbs iterations and, for k  5, 100 permutations
selected at random in Sk ).
[Lee et al., 2008]
Galaxy dataset (k)
Using Chib’s estimate, with θ∗
k as MAP estimator,
log(^
Zk(x)) = −105.1396
for k = 3, while introducing permutations leads to
log(^
Zk(x)) = −103.3479
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
Zk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105
Gibbs iterations and, for k  5, 100 permutations
selected at random in Sk ).
[Lee et al., 2008]
Galaxy dataset (k)
Using Chib’s estimate, with θ∗
k as MAP estimator,
log(^
Zk(x)) = −105.1396
for k = 3, while introducing permutations leads to
log(^
Zk(x)) = −103.3479
Note that
−105.1396 + log(3!) = −103.3479
k 2 3 4 5 6 7 8
Zk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44
Estimations of the marginal likelihoods by the symmetrised Chib’s
approximation (based on 105
Gibbs iterations and, for k  5, 100 permutations
selected at random in Sk ).
[Lee et al., 2008]
More efficient sampling
Difficulty with the explosive numbers of terms in
f
πk(θ∗
k|x) =
1
T k!
X
σ∈Sk
T
X
t=1
πk(σ(θ∗
k)|x, z
(t)
k ) .
when most terms are equal to zero...
Iterative bridge sampling:
b
E(t)
(k) = b
E(t−1)
(k) M−1
1
M1
X
l=1
^
π(θ̃l |x)
M1q(θ̃l ) + M2 ^
π(θ̃l |x)
.
M−1
2
M2
X
m=1
q(^
θm)
M1q(^
θm) + M2 ^
π(^
θm|x)
[Frühwirth-Schnatter, 2004]
More efficient sampling
Iterative bridge sampling:
b
E(t)
(k) = b
E(t−1)
(k) M−1
1
M1
X
l=1
^
π(θ̃l |x)
M1q(θ̃l ) + M2 ^
π(θ̃l |x)
.
M−1
2
M2
X
m=1
q(^
θm)
M1q(^
θm) + M2 ^
π(^
θm|x)
[Frühwirth-Schnatter, 2004]
where
q(θ) =
1
J1
J1
X
j=1
p(θ|z(j)
)
k
Y
i=1
p(ξi |ξ
(j)
ij , ξ
(j−1)
ij , z(j)
, x)
More efficient sampling
Iterative bridge sampling:
b
E(t)
(k) = b
E(t−1)
(k) M−1
1
M1
X
l=1
^
π(θ̃l |x)
M1q(θ̃l ) + M2 ^
π(θ̃l |x)
.
M−1
2
M2
X
m=1
q(^
θm)
M1q(^
θm) + M2 ^
π(^
θm|x)
[Frühwirth-Schnatter, 2004]
or where
q(θ) =
1
k!
X
σ∈S(k)
p(θ|σ(zo
))
k
Y
i=1
p(ξi |σ(ξo
ij ), σ(ξo
ij ), σ(zo
), x)
Further efficiency
After de-switching (un-switching?), representation of importance
function as
q(θ) =
1
Jk!
J
X
j=1
X
σ∈Sk
π(θ|σ(ϕ(j)
), x) =
1
k!
X
σ∈Sk
hσ(θ)
where hσ associated with particular mode of q
Assuming generations
(θ(1)
, . . . , θ(T)
) ∼ hσc (θ)
how many of the hσ(θ(t)) are non-zero?
Sparsity for the sum
Contribution of each term relative to q(θ)
ησ(θ) =
hσ(θ)
k!q(θ)
=
hσi (θ)
P
σ∈Sk
hσ(θ)
and importance of permutation σ evaluated by
b
Ehσc
[ησi (θ)] =
1
M
M
X
l=1
ησi (θ(l)
) , θ(l)
∼ hσc (θ)
Approximate set A(k) ⊆ S(k) consist of [σ1, · · · , σn] for the
smallest n that satisfies the condition
^
φn =
1
M
M
X
l=1
q̃n(θ(l)
) − q(θ(l)
)
τ
dual importance sampling with approximation
DIS2A
1 Randomly select {z(j)
, θ(j)
}J
j=1 from Gibbs sample and un-switch
Construct q(θ)
2 Choose hσc (θ) and generate particles {θ(t)
}T
t=1 ∼ hσc (θ)
3 Construction of approximation q̃(θ) using first M-sample
3.1 Compute b
Ehσc
[ησ1 (θ)], · · · , b
Ehσc
[ησk!
(θ)]
3.2 Reorder the σ’s such that
b
Ehσc
[ησ1 (θ)]  · · ·  b
Ehσc
[ησk!
(θ)].
3.3 Initially set n = 1 and compute q̃n(θ(t))’s and b
φn. If b
φn=1  τ,
go to Step 4. Otherwise increase n = n + 1
4 Replace q(θ(1)
), . . . , q(θ(T)
) with q̃(θ(1)
), . . . , q̃(θ(T)
) to
estimate b
E
[Lee  X, 2014]
illustrations
k k! |A(k)| ∆(A)
3 6 1.0000 0.1675
4 24 2.7333 0.1148
Fishery data
k k! |A(k)| ∆(A)
3 6 1.000 0.1675
4 24 15.7000 0.6545
6 720 298.1200 0.4146
Galaxy data
Table: Mean estimates of approximate set sizes, |A(k)|, and the reduction
rate of a number of evaluated h-terms ∆(A) for (a) fishery and (b)
galaxy datasets
Sequential importance sampling [1]
Tempered sequence of targets (t = 1, . . . , T)
πkt(θk) ∝ pkt(θk) = πk(θk)fk(x|θk)λt
λ1 = 0  · · ·  λT = 1
particles (simulations) (i = 1, . . . , Nt)
θi
t
i.i.d.
∼ πkt(θk)
usually obtained by MCMC step
θi
t ∼ Kt(θi
t−1, θ)
with importance weights (i = 1, . . . , Nt)
ωt
i = fk(x|θk)λt −λt−1
[Del Moral et al., 2006; Bucholz et al., 2021]
Sequential importance sampling [1]
Tempered sequence of targets (t = 1, . . . , T)
πkt(θk) ∝ pkt(θk) = πk(θk)fk(x|θk)λt
λ1 = 0  · · ·  λT = 1
Produces approximation of evidence
^
Zk =
Y
t
1
Nt
Nt
X
i=1
ωt
i
[Del Moral et al., 2006; Bucholz et al., 2021]
Rethinking Chib’s solution
Alternate Rao–Blackwellisation by marginalising into partitions
Apply candidate’s/Chib’s formula to a chosen partition:
mk(x) =
fk(x|C0)πk(C0)
πk(C0|x)
with
πk(C(z)) =
k!
(k − k+)!
Γ
Pk
j=1 αj

Γ
Pk
j=1 αj + n

k
Y
j=1
Γ(nj + αj )
Γ(αj )
C(z) partition of {1, . . . , n} induced by cluster membership z
nj =
Pn
i=1 I{zi =j} # observations assigned to cluster j
k+ =
Pk
j=1 I{nj 0} # non-empty clusters
Rethinking Chib’s solution
Under conjugate priors G0 on θ,
fk(x|C(z))
k
Y
j=1
Z
Θ
Y
i:zi =k
f (xi |θ)G0(dθ)
| {z }
m(Ck (z))
and
^
πk(C0
|x) =
1
T
T
X
t=1
IC0≡C(z(t))
I considerably lower computational demand
I no label switching issue
Sequential importance sampling [2]
For conjugate priors, (marginal) particle filter representation of a
proposal:
π∗
(z|x) = π(z1|x1)
n
Y
i=2
π(zi |x1:i , z1:i−1)
with importance weight
π(z|x)
π∗(z|x)
=
π(x, z)
m(x)
m(x1)
π(z1, x1)
m(z1, x1, x2)
π(z1, x1, z2, x2)
· · ·
π(z1:n−1, x)
π(z, x)
=
w(z, x)
m(x)
leading to unbiased estimator of evidence
^
Zk(x) =
1
T
T
X
i=1
w(z(t)
, x)
[Long, Liu  Wong, 1994; Carvalho et al., 2010]
Galactic illustration
C
h
i
b
C
h
i
b
P
e
r
m
B
r
i
d
g
e
S
a
m
p
l
i
n
g
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
229
228
227
226
m(y)
K=3
C
h
i
b
C
h
i
b
P
e
r
m
B
r
i
d
g
e
S
a
m
p
l
i
n
g
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
232
230
228
226
K=5
C
h
i
b
C
h
i
b
R
a
n
d
P
e
r
m
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
235.0
232.5
230.0
227.5
m(y)
K=6
C
h
i
b
C
h
i
b
R
a
n
d
P
e
r
m
a
d
a
p
t
i
v
e
S
M
C
S
I
S
C
h
i
b
P
a
r
t
i
t
i
o
n
s
a
r
i
t
h
m
e
t
i
c
M
e
a
n
240
235
230
225
K=8
Galactic illustration
5.0 5.5 6.0 6.5 7.0 7.5 8.0
log(time)
10
8
6
4
2
0
2
log(MSE)
ChibPartitions
Bridge Sampling
ChibPerm
SIS
SMC

More Related Content

Similar to Inferring the number of components: dream or reality?

Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingSSA KPI
 
Learning dyadic data and predicting unaccomplished co-occurrent values by mix...
Learning dyadic data and predicting unaccomplished co-occurrent values by mix...Learning dyadic data and predicting unaccomplished co-occurrent values by mix...
Learning dyadic data and predicting unaccomplished co-occurrent values by mix...Loc Nguyen
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?Christian Robert
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesGilles Louppe
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
Slides: Hypothesis testing, information divergence and computational geometry
Slides: Hypothesis testing, information divergence and computational geometrySlides: Hypothesis testing, information divergence and computational geometry
Slides: Hypothesis testing, information divergence and computational geometryFrank Nielsen
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論岳華 杜
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015Christian Robert
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
Non-informative reparametrisation for location-scale mixtures
Non-informative reparametrisation for location-scale mixturesNon-informative reparametrisation for location-scale mixtures
Non-informative reparametrisation for location-scale mixturesChristian Robert
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtureszukun
 
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Frank Nielsen
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsChristian Robert
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clusteringFrank Nielsen
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingFrank Nielsen
 

Similar to Inferring the number of components: dream or reality? (20)

Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
Learning dyadic data and predicting unaccomplished co-occurrent values by mix...
Learning dyadic data and predicting unaccomplished co-occurrent values by mix...Learning dyadic data and predicting unaccomplished co-occurrent values by mix...
Learning dyadic data and predicting unaccomplished co-occurrent values by mix...
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 
Understanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized treesUnderstanding variable importances in forests of randomized trees
Understanding variable importances in forests of randomized trees
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
Slides: Hypothesis testing, information divergence and computational geometry
Slides: Hypothesis testing, information divergence and computational geometrySlides: Hypothesis testing, information divergence and computational geometry
Slides: Hypothesis testing, information divergence and computational geometry
 
從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論從 VAE 走向深度學習新理論
從 VAE 走向深度學習新理論
 
NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015NBBC15, Reyjavik, June 08, 2015
NBBC15, Reyjavik, June 08, 2015
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
asymptotics of ABC
asymptotics of ABCasymptotics of ABC
asymptotics of ABC
 
Non-informative reparametrisation for location-scale mixtures
Non-informative reparametrisation for location-scale mixturesNon-informative reparametrisation for location-scale mixtures
Non-informative reparametrisation for location-scale mixtures
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
 
the ABC of ABC
the ABC of ABCthe ABC of ABC
the ABC of ABC
 
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
Voronoi diagrams in information geometry:  Statistical Voronoi diagrams and ...
 
Approximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forestsApproximate Bayesian model choice via random forests
Approximate Bayesian model choice via random forests
 
Divergence clustering
Divergence clusteringDivergence clustering
Divergence clustering
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 

More from Christian Robert

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceChristian Robert
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinChristian Robert
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihoodChristian Robert
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)Christian Robert
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerChristian Robert
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Christian Robert
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussionChristian Robert
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceChristian Robert
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsChristian Robert
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distancesChristian Robert
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferenceChristian Robert
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018Christian Robert
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distancesChristian Robert
 

More from Christian Robert (20)

Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Workshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael MartinWorkshop in honour of Don Poskitt and Gael Martin
Workshop in honour of Don Poskitt and Gael Martin
 
discussion of ICML23.pdf
discussion of ICML23.pdfdiscussion of ICML23.pdf
discussion of ICML23.pdf
 
restore.pdf
restore.pdfrestore.pdf
restore.pdf
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
discussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihooddiscussion on Bayesian restricted likelihood
discussion on Bayesian restricted likelihood
 
NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)NCE, GANs & VAEs (and maybe BAC)
NCE, GANs & VAEs (and maybe BAC)
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Coordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like samplerCoordinate sampler : A non-reversible Gibbs-like sampler
Coordinate sampler : A non-reversible Gibbs-like sampler
 
eugenics and statistics
eugenics and statisticseugenics and statistics
eugenics and statistics
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Likelihood-free Design: a discussion
Likelihood-free Design: a discussionLikelihood-free Design: a discussion
Likelihood-free Design: a discussion
 
CISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergenceCISEA 2019: ABC consistency and convergence
CISEA 2019: ABC consistency and convergence
 
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment modelsa discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
a discussion of Chib, Shin, and Simoni (2017-8) Bayesian moment models
 
ABC based on Wasserstein distances
ABC based on Wasserstein distancesABC based on Wasserstein distances
ABC based on Wasserstein distances
 
Poster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conferencePoster for Bayesian Statistics in the Big Data Era conference
Poster for Bayesian Statistics in the Big Data Era conference
 
short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018short course at CIRM, Bayesian Masterclass, October 2018
short course at CIRM, Bayesian Masterclass, October 2018
 
ABC with Wasserstein distances
ABC with Wasserstein distancesABC with Wasserstein distances
ABC with Wasserstein distances
 

Recently uploaded

Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxnoordubaliya2003
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 

Recently uploaded (20)

Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptxSulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
Sulphur & Phosphrus Cycle PowerPoint Presentation (2) [Autosaved]-3-1.pptx
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 

Inferring the number of components: dream or reality?

  • 1. Inferring the number of mixture components: dream or reality? Christian P. Robert Université Paris-Dauphine, Paris & University of Warwick, Coventry Statistical methods and models for complex data 24 Sept. 2022 Padova800
  • 2. Outline early Gibbs sampling weakly informative priors imperfect sampling Bayes factor even less informative prior
  • 3. Mixture models Convex combination of densities x ∼ fj with probability pj , for j = 1, 2, . . . , k, with overall density p1f1(x) + · · · + pkfk(x) . Usual case: parameterised components k X i=1 pi f (x|θi ) with n X i=1 pi = 1 where weights pi ’s are distinguished from component parameters θi
  • 4. Motivations I Dataset made of several latent/missing/unobserved groups/strata/subpopulations. Mixture structure due to the missing origin/allocation of each observation to a specific subpopulation/stratum. Inference on either the allocations (clustering) or on the parameters (θi , pi ) or on the number of components I Semiparametric perspective where mixtures are functional basis approximations of unknown distributions I Nonparametric perspective where number of components infinite (e.g., Dirichlet process mixtures)
  • 5. Motivations I Dataset made of several latent/missing/unobserved groups/strata/subpopulations. Mixture structure due to the missing origin/allocation of each observation to a specific subpopulation/stratum. Inference on either the allocations (clustering) or on the parameters (θi , pi ) or on the number of components I Semiparametric perspective where mixtures are functional basis approximations of unknown distributions I Nonparametric perspective where number of components infinite (e.g., Dirichlet process mixtures)
  • 6. Motivations I Dataset made of several latent/missing/unobserved groups/strata/subpopulations. Mixture structure due to the missing origin/allocation of each observation to a specific subpopulation/stratum. Inference on either the allocations (clustering) or on the parameters (θi , pi ) or on the number of components I Semiparametric perspective where mixtures are functional basis approximations of unknown distributions I Nonparametric perspective where number of components infinite (e.g., Dirichlet process mixtures)
  • 7. Likelihood “I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.” L. Wasserman For a sample of independent random variables (x1, · · · , xn), likelihood n Y i=1 {p1f1(xi ) + · · · + pkfk(xi )} . computable [pointwise] in O(kn) time.
  • 8. Likelihood For a sample of independent random variables (x1, · · · , xn), likelihood n Y i=1 {p1f1(xi ) + · · · + pkfk(xi )} . computable [pointwise] in O(kn) time.
  • 9. Normal mean mixture Normal mixture p N(µ1, 1) + (1 − p) N(µ2, 1) with only means µi unknown Identifiability Parameters µ1 and µ2 identifiable: µ1 cannot be confused with µ2 when p is different from 0.5. Presence of atavistic mode, better understood by letting p go to 0.5
  • 10. Normal mean mixture Normal mixture p N(µ1, 1) + (1 − p) N(µ2, 1) with only means µi unknown Identifiability Parameters µ1 and µ2 identifiable: µ1 cannot be confused with µ2 when p is different from 0.5. Presence of atavistic mode, better understood by letting p go to 0.5
  • 11. Bayesian inference on mixtures For any prior π (θ, p), posterior distribution of (θ, p) available up to a multiplicative constant π(θ, p|x) ∝   n Y i=1 k X j=1 pj f (xi |θj )   π (θ, p) at a cost of order O(kn) Challenge Despite this, derivation of posterior characteristics like posterior expectations only possible in an exponential time of order O(kn)!
  • 12. Bayesian inference on mixtures For any prior π (θ, p), posterior distribution of (θ, p) available up to a multiplicative constant π(θ, p|x) ∝   n Y i=1 k X j=1 pj f (xi |θj )   π (θ, p) at a cost of order O(kn) Challenge Despite this, derivation of posterior characteristics like posterior expectations only possible in an exponential time of order O(kn)!
  • 13. Challenges from likelihood 1. Number of modes of the likelihood of order O(k!): © Maximization / exploration of posterior surface hard 2. Under exchangeable / permutation invariant priors on (θ, p) all posterior marginals are identical: © All posterior expectations of θi equal 3. Estimating the density much simpler [Marin & X, 2007]
  • 14. Challenges from likelihood 1. Number of modes of the likelihood of order O(k!): © Maximization / exploration of posterior surface hard 2. Under exchangeable / permutation invariant priors on (θ, p) all posterior marginals are identical: © All posterior expectations of θi equal 3. Estimating the density much simpler [Marin & X, 2007]
  • 15. Missing variable representation Demarginalise: Associate to each xi a missing/latent/auxiliary variable zi that indicates its component: zi |p ∼ Mk(p1, . . . , pk) and xi |zi , θ ∼ f (·|θzi ) Completed likelihood `C (θ, p|x, z) = n Y i=1 pzi f (xi |θzi ) and π(θ, p|x, z) ∝ " n Y i=1 pzi f (xi |θzi ) # π (θ, p) where z = (z1, . . . , zn)
  • 16. Missing variable representation Demarginalise: Associate to each xi a missing/latent/auxiliary variable zi that indicates its component: zi |p ∼ Mk(p1, . . . , pk) and xi |zi , θ ∼ f (·|θzi ) Completed likelihood `C (θ, p|x, z) = n Y i=1 pzi f (xi |θzi ) and π(θ, p|x, z) ∝ " n Y i=1 pzi f (xi |θzi ) # π (θ, p) where z = (z1, . . . , zn)
  • 17. Gibbs sampling for mixture models Take advantage of the missing data structure: Algorithm I Initialization: choose p(0) and θ(0) arbitrarily I Step t. For t = 1, . . . 1. Generate z (t) i (i = 1, . . . , n) from (j = 1, . . . , k) P z (t) i = j|p (t−1) j , θ (t−1) j , xi ∝ p (t−1) j f xi |θ (t−1) j 2. Generate p(t) from π(p|z(t) ), 3. Generate θ(t) from π(θ|z(t) , x). [Brooks Gelman, 1990; Diebolt X, 1990, 1994; Escobar West, 1991]
  • 18. Normal mean mixture (cont’d) −1 0 1 2 3 4 −1 0 1 2 3 4 µ1 µ 2 (a) initialised at random [X Casella, 2009]
  • 19. Normal mean mixture (cont’d) −1 0 1 2 3 4 −1 0 1 2 3 4 µ1 µ 2 (a) initialised at random −2 0 2 4 −2 0 2 4 µ1 µ 2 (b) initialised close to the lower mode [X Casella, 2009]
  • 20. Outline early Gibbs sampling weakly informative priors imperfect sampling Bayes factor even less informative prior
  • 21. weakly informative priors “Los espejos y la cópula son abominables, porque multiplican el número de los hombres.” − Jorge Luis Borges, Ficciones I possible symmetric empirical Bayes priors p ∼ D(γ, . . . , γ), θi ∼ N(^ µ, ^ ωσ2 i ), σ−2 i ∼ Ga(ν, ^ ν) which can be replaced with hierarchical priors [Diebolt X, 1990; Richardson Green, 1997] I independent improper priors on θj ’s prohibited, thus standard “flat” and Jeffreys priors impossible to use (except with the exclude-empty-component trick) [Diebolt X, 1990; Wasserman, 1999]
  • 22. weakly informative priors I reparameterization to compact set for use of uniform priors µi −→ eµi 1 + eµi , σi −→ σi 1 + σi [Chopin, 2000] I dependent weakly informative priors p ∼ D(k, . . . , 1), θi ∼ N(θi−1, ζσ2 i−1), σi ∼ U([0, σi−1]) [Mengersen X, 1996; X Titterington, 1998] I reference priors p ∼ D(1, . . . , 1), θi ∼ N(µ0, (σ2 i + τ2 0)/2), σ2 i ∼ C+ (0, τ2 0) [Moreno Liseo, 1999]
  • 23. Ban on improper priors Difficult to use improper priors in the setting of mixtures because independent improper priors, π (θ) = k Y i=1 πi (θi ) , with Z πi (θi )dθi = ∞ end up, for all n’s, with the property Z π(θ, p|x)dθdp = ∞ Reason There are (k − 1)n terms among the kn terms in the expansion that allocate no observation at all to the i-th component
  • 24. Ban on improper priors Difficult to use improper priors in the setting of mixtures because independent improper priors, π (θ) = k Y i=1 πi (θi ) , with Z πi (θi )dθi = ∞ end up, for all n’s, with the property Z π(θ, p|x)dθdp = ∞ Reason There are (k − 1)n terms among the kn terms in the expansion that allocate no observation at all to the i-th component
  • 25. Estimating k When k is unknown, setting a prior on k leads to a mixture of finite mixtures (MFM) K ∼ pK pmf over N∗ Consistent estimation when pK puts mass on every integer [Nobile, 1994; Miller Harrison, 2018] Implementation by I reversible jump [Richardson Green, 1997; Frühwirth-Schnatter, 2011] I saturation by superfluous components [Rousseau Mengersen, 2011] I Bayesian model comparison [Berkhof, Mechelen, Gelman, 2003; Lee X, 2018] I cluster estimation [Malsiner-Walli, Frühwirth-Schnatter Grün, 2016]
  • 26. Estimating k When k is unknown, setting a prior on k leads to a mixture of finite mixtures (MFM) K ∼ pK pmf over N∗ Consistent estimation when pK puts mass on every integer [Nobile, 1994; Miller Harrison, 2018] Implementation by I reversible jump [Richardson Green, 1997; Frühwirth-Schnatter, 2011] I saturation by superfluous components [Rousseau Mengersen, 2011] I Bayesian model comparison [Berkhof, Mechelen, Gelman, 2003; Lee X, 2018] I cluster estimation [Malsiner-Walli, Frühwirth-Schnatter Grün, 2016]
  • 27. Clusters versus components “In the mixture of finite mixtures, it is perhaps intuitively clear that, under the prior at least, the number of clusters behaves very similarly to the number of components K when n is large. It turns out that under the posterior they also behave very similarly for large n.” Miller Harrison (2018) However Miller Harrison (2013, 2014) showed that the Dirichlet process mixture posterior on the number of clusters is typically not consistent for the number of components
  • 28. Clusters versus components “In the mixture of finite mixtures, it is perhaps intuitively clear that, under the prior at least, the number of clusters behaves very similarly to the number of components K when n is large. It turns out that under the posterior they also behave very similarly for large n.” Miller Harrison (2018) However Miller Harrison (2013, 2014) showed that the Dirichlet process mixture posterior on the number of clusters is typically not consistent for the number of components
  • 29. Dirichlet process mixture (DPM) Extension to the k = ∞ (non-parametric) case xi |zi , θ i.i.d ∼ f (xi |θxi ), i = 1, . . . , n (1) P(Zi = k) = πk, k = 1, 2, . . . π1, π2, . . . ∼ GEM(M) M ∼ π(M) θ1, θ2, . . . i.i.d ∼ G0 with GEM (Griffith-Engen-McCloskey) defined by the stick-breaking representation πk = vk k−1 Y i=1 (1 − vi ) vi ∼ Beta(1, M) [Sethuraman, 1994]
  • 30. Dirichlet process mixture (DPM) Resulting in an infinite mixture x ∼ n Y i=1 ∞ X i=1 πi f (xi |θi ) with (prior) cluster allocation π(z|M) = Γ(M) Γ(M + n) MK+ K+ Y j=1 Γ(nj ) and conditional likelihood p(x|z, M) = K+ Y j=1 Z Y i:zi =j f (xi |θj )dG0(θj ) available in closed form when G0 conjugate
  • 31. Outline early Gibbs sampling weakly informative priors imperfect sampling Bayes factor even less informative prior
  • 32. Label switching paradox I We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. [Holmes, Jasra Stephens, 2005] I If we observe it, then marginals are useless for estimating the parameters. [Frühwirth-Schnatter, 2001, 2004; Green, 2019] I If we do not, then we are uncertain about the convergence!!! [Celeux, Hurn X, 2000]
  • 33. Label switching paradox I We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. [Holmes, Jasra Stephens, 2005] I If we observe it, then marginals are useless for estimating the parameters. [Frühwirth-Schnatter, 2001, 2004; Green, 2019] I If we do not, then we are uncertain about the convergence!!! [Celeux, Hurn X, 2000]
  • 34. Label switching paradox I We should observe the exchangeability of the components [label switching] to conclude about convergence of the Gibbs sampler. [Holmes, Jasra Stephens, 2005] I If we observe it, then marginals are useless for estimating the parameters. [Frühwirth-Schnatter, 2001, 2004; Green, 2019] I If we do not, then we are uncertain about the convergence!!! [Celeux, Hurn X, 2000]
  • 35. Constraints Usual reply to lack of identifiability: impose constraints like µ1 6 . . . 6 µk in the prior Mostly incompatible with the topology of the posterior surface: posterior expectations then depend on the choice of the constraints. Computational “detail” The constraint need not be imposed during the simulation but can instead be imposed after simulation, reordering MCMC output according to constraints. [This avoids possible negative effects on convergence]
  • 36. Constraints Usual reply to lack of identifiability: impose constraints like µ1 6 . . . 6 µk in the prior Mostly incompatible with the topology of the posterior surface: posterior expectations then depend on the choice of the constraints. Computational “detail” The constraint need not be imposed during the simulation but can instead be imposed after simulation, reordering MCMC output according to constraints. [This avoids possible negative effects on convergence]
  • 37. Constraints Usual reply to lack of identifiability: impose constraints like µ1 6 . . . 6 µk in the prior Mostly incompatible with the topology of the posterior surface: posterior expectations then depend on the choice of the constraints. Computational “detail” The constraint need not be imposed during the simulation but can instead be imposed after simulation, reordering MCMC output according to constraints. [This avoids possible negative effects on convergence]
  • 38. Relabeling towards the mode Selection of one of the k! modal regions of the posterior, post-simulation, by computing the approximate MAP (θ, p)(i∗ ) with i∗ = arg max i=1,...,M π (θ, p)(i) |x Pivotal Reordering At iteration i ∈ {1, . . . , M}, 1. Compute the optimal permutation τi = arg min τ∈Sk d τ (θ(i) , p(i) ), (θ(i∗ ) , p(i∗ ) ) where d(·, ·) distance in the parameter space. 2. Set (θ(i) , p(i)) = τi ((θ(i) , p(i))). [Celeux, 1998; Stephens, 2000; Celeux, Hurn X, 2000]
  • 39. Relabeling towards the mode Selection of one of the k! modal regions of the posterior, post-simulation, by computing the approximate MAP (θ, p)(i∗ ) with i∗ = arg max i=1,...,M π (θ, p)(i) |x Pivotal Reordering At iteration i ∈ {1, . . . , M}, 1. Compute the optimal permutation τi = arg min τ∈Sk d τ (θ(i) , p(i) ), (θ(i∗ ) , p(i∗ ) ) where d(·, ·) distance in the parameter space. 2. Set (θ(i) , p(i)) = τi ((θ(i) , p(i))). [Celeux, 1998; Stephens, 2000; Celeux, Hurn X, 2000]
  • 40. Loss functions for mixture estimation Global loss function that considers distance between predictives L(ξ, ^ ξ) = Z X fξ(x) log fξ(x)/f^ ξ(x) dx eliminates the labelling effect Similar solution for estimating clusters through allocation variables L(z, ^ z) = X ij I[zi =zj ](1 − I[^ zi =^ zj ]) + I[^ zi =^ zj ](1 − I[zi =zj ]) . [Celeux, Hurn X, 2000]
  • 41. Outline early Gibbs sampling weakly informative priors imperfect sampling Bayes factor even less informative prior
  • 42. Bayesian model choice Comparison of models Mi by Bayesian means: probabilise the entire model/parameter space I allocate probabilities pi to all models Mi I define priors πi (θi ) for each parameter space Θi I compute π(Mi |x) = pi Z Θi fi (x|θi )πi (θi )dθi X j pj Z Θj fj (x|θj )πj (θj )dθj Computational difficulty on its own [Chen, Shao Ibrahim, 2000; Marin X, 2007]
  • 43. Bayesian model choice Comparison of models Mi by Bayesian means: Relies on a central notion: the evidence Zk = Z Θk πk(θk)Lk(θk) dθk, aka the marginal likelihood. Computational difficulty on its own [Chen, Shao Ibrahim, 2000; Marin X, 2007]
  • 44. Validation in mixture setting “In principle, the Bayes factor for the MFM versus the DPM could be used as an empirical criterion for choos- ing between the two models, and in fact, it is quite easy to compute an approximation to the Bayes factor using importance sampling” Miller Harrison (2018) Bayes Factor consistent for selecting number of components [Nobile, 1994; Ishwaran et al., 2001; Casella Moreno, 2009; Chib and Kuffner, 2016] Bayes Factor consistent for testing parametric versus nonparametric alternatives [Verdinelli Wasserman, 1997; Dass Lee, 2004; McVinish et al., 2009]
  • 45. Consistent evidence for location DPM Consistency of Bayes factor comparing finite mixtures against (location) Dirichlet Process Mixture 0 250 500 750 1000 1250 1500 1750 2000 n 4 3 2 1 0 1 2 log(BF) 0 200 400 600 800 1000 n 0 1 2 3 4 5 6 7 8 log(BF) (left) finite mixture; (right) not a finite mixture
  • 46. Consistent evidence for location DPM Under generic assumptions, when x1, · · · , xn iid fP0 with P0 = k0 X j=1 p0 j δθ0 j and Dirichlet DP(M, G0) prior on P, there exists t 0 such that for all 0 Pf0 mDP(x) n−(k0−1+dk0+t)/2 = o(1) Moreover there exists q 0 such that ΠDP kf0 − fpk1 6 (log n)q √ n
  • 47.
  • 48.
  • 49.
  • 50. x = 1 + oPf0 (1). [Hairault, X Rousseau, 2022]
  • 51. Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Replace with an approximation to the posterior b Zk = c mk(x) = fk(x|θ∗ k) πk(θ∗ k) ^ πk(θ∗ k|x) . [Chib, 1995]
  • 52. Chib’s representation Direct application of Bayes’ theorem: given x ∼ fk(x|θk) and θk ∼ πk(θk), Zk = mk(x) = fk(x|θk) πk(θk) πk(θk|x) Replace with an approximation to the posterior b Zk = c mk(x) = fk(x|θ∗ k) πk(θ∗ k) ^ πk(θ∗ k|x) . [Chib, 1995]
  • 53. Case of latent variables For missing variable z as in mixture models, natural Rao-Blackwell estimate c πk(θ∗ k|x) = 1 T T X t=1 πk(θ∗ k|x, z (t) k ) , where the z (t) k ’s are Gibbs sampled latent variables [Diebolt Robert, 1990; Chib, 1995]
  • 54. Compensation for label switching For mixture models, z (t) k usually fails to visit all configurations in a balanced way, despite the symmetry predicted by the theory Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using f πk(θ∗ k|x) = 1 T k! X σ∈Sk T X t=1 πk(σ(θ∗ k)|x, z (t) k ) . for all σ’s in Sk, set of all permutations of {1, . . . , k} [Berkhof, Mechelen, Gelman, 2003]
  • 55. Compensation for label switching For mixture models, z (t) k usually fails to visit all configurations in a balanced way, despite the symmetry predicted by the theory Consequences on numerical approximation, biased by an order k! Recover the theoretical symmetry by using f πk(θ∗ k|x) = 1 T k! X σ∈Sk T X t=1 πk(σ(θ∗ k)|x, z (t) k ) . for all σ’s in Sk, set of all permutations of {1, . . . , k} [Berkhof, Mechelen, Gelman, 2003]
  • 56. Galaxy dataset (k) Using Chib’s estimate, with θ∗ k as MAP estimator, log(^ Zk(x)) = −105.1396 for k = 3, while introducing permutations leads to log(^ Zk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 Zk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k 5, 100 permutations selected at random in Sk ). [Lee et al., 2008]
  • 57. Galaxy dataset (k) Using Chib’s estimate, with θ∗ k as MAP estimator, log(^ Zk(x)) = −105.1396 for k = 3, while introducing permutations leads to log(^ Zk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 Zk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k 5, 100 permutations selected at random in Sk ). [Lee et al., 2008]
  • 58. Galaxy dataset (k) Using Chib’s estimate, with θ∗ k as MAP estimator, log(^ Zk(x)) = −105.1396 for k = 3, while introducing permutations leads to log(^ Zk(x)) = −103.3479 Note that −105.1396 + log(3!) = −103.3479 k 2 3 4 5 6 7 8 Zk (x) -115.68 -103.35 -102.66 -101.93 -102.88 -105.48 -108.44 Estimations of the marginal likelihoods by the symmetrised Chib’s approximation (based on 105 Gibbs iterations and, for k 5, 100 permutations selected at random in Sk ). [Lee et al., 2008]
  • 59. More efficient sampling Difficulty with the explosive numbers of terms in f πk(θ∗ k|x) = 1 T k! X σ∈Sk T X t=1 πk(σ(θ∗ k)|x, z (t) k ) . when most terms are equal to zero... Iterative bridge sampling: b E(t) (k) = b E(t−1) (k) M−1 1 M1 X l=1 ^ π(θ̃l |x) M1q(θ̃l ) + M2 ^ π(θ̃l |x) . M−1 2 M2 X m=1 q(^ θm) M1q(^ θm) + M2 ^ π(^ θm|x) [Frühwirth-Schnatter, 2004]
  • 60. More efficient sampling Iterative bridge sampling: b E(t) (k) = b E(t−1) (k) M−1 1 M1 X l=1 ^ π(θ̃l |x) M1q(θ̃l ) + M2 ^ π(θ̃l |x) . M−1 2 M2 X m=1 q(^ θm) M1q(^ θm) + M2 ^ π(^ θm|x) [Frühwirth-Schnatter, 2004] where q(θ) = 1 J1 J1 X j=1 p(θ|z(j) ) k Y i=1 p(ξi |ξ (j) ij , ξ (j−1) ij , z(j) , x)
  • 61. More efficient sampling Iterative bridge sampling: b E(t) (k) = b E(t−1) (k) M−1 1 M1 X l=1 ^ π(θ̃l |x) M1q(θ̃l ) + M2 ^ π(θ̃l |x) . M−1 2 M2 X m=1 q(^ θm) M1q(^ θm) + M2 ^ π(^ θm|x) [Frühwirth-Schnatter, 2004] or where q(θ) = 1 k! X σ∈S(k) p(θ|σ(zo )) k Y i=1 p(ξi |σ(ξo ij ), σ(ξo ij ), σ(zo ), x)
  • 62. Further efficiency After de-switching (un-switching?), representation of importance function as q(θ) = 1 Jk! J X j=1 X σ∈Sk π(θ|σ(ϕ(j) ), x) = 1 k! X σ∈Sk hσ(θ) where hσ associated with particular mode of q Assuming generations (θ(1) , . . . , θ(T) ) ∼ hσc (θ) how many of the hσ(θ(t)) are non-zero?
  • 63. Sparsity for the sum Contribution of each term relative to q(θ) ησ(θ) = hσ(θ) k!q(θ) = hσi (θ) P σ∈Sk hσ(θ) and importance of permutation σ evaluated by b Ehσc [ησi (θ)] = 1 M M X l=1 ησi (θ(l) ) , θ(l) ∼ hσc (θ) Approximate set A(k) ⊆ S(k) consist of [σ1, · · · , σn] for the smallest n that satisfies the condition ^ φn = 1 M M X l=1
  • 64.
  • 65.
  • 67.
  • 68.
  • 69. τ
  • 70. dual importance sampling with approximation DIS2A 1 Randomly select {z(j) , θ(j) }J j=1 from Gibbs sample and un-switch Construct q(θ) 2 Choose hσc (θ) and generate particles {θ(t) }T t=1 ∼ hσc (θ) 3 Construction of approximation q̃(θ) using first M-sample 3.1 Compute b Ehσc [ησ1 (θ)], · · · , b Ehσc [ησk! (θ)] 3.2 Reorder the σ’s such that b Ehσc [ησ1 (θ)] · · · b Ehσc [ησk! (θ)]. 3.3 Initially set n = 1 and compute q̃n(θ(t))’s and b φn. If b φn=1 τ, go to Step 4. Otherwise increase n = n + 1 4 Replace q(θ(1) ), . . . , q(θ(T) ) with q̃(θ(1) ), . . . , q̃(θ(T) ) to estimate b E [Lee X, 2014]
  • 71. illustrations k k! |A(k)| ∆(A) 3 6 1.0000 0.1675 4 24 2.7333 0.1148 Fishery data k k! |A(k)| ∆(A) 3 6 1.000 0.1675 4 24 15.7000 0.6545 6 720 298.1200 0.4146 Galaxy data Table: Mean estimates of approximate set sizes, |A(k)|, and the reduction rate of a number of evaluated h-terms ∆(A) for (a) fishery and (b) galaxy datasets
  • 72. Sequential importance sampling [1] Tempered sequence of targets (t = 1, . . . , T) πkt(θk) ∝ pkt(θk) = πk(θk)fk(x|θk)λt λ1 = 0 · · · λT = 1 particles (simulations) (i = 1, . . . , Nt) θi t i.i.d. ∼ πkt(θk) usually obtained by MCMC step θi t ∼ Kt(θi t−1, θ) with importance weights (i = 1, . . . , Nt) ωt i = fk(x|θk)λt −λt−1 [Del Moral et al., 2006; Bucholz et al., 2021]
  • 73. Sequential importance sampling [1] Tempered sequence of targets (t = 1, . . . , T) πkt(θk) ∝ pkt(θk) = πk(θk)fk(x|θk)λt λ1 = 0 · · · λT = 1 Produces approximation of evidence ^ Zk = Y t 1 Nt Nt X i=1 ωt i [Del Moral et al., 2006; Bucholz et al., 2021]
  • 74. Rethinking Chib’s solution Alternate Rao–Blackwellisation by marginalising into partitions Apply candidate’s/Chib’s formula to a chosen partition: mk(x) = fk(x|C0)πk(C0) πk(C0|x) with πk(C(z)) = k! (k − k+)! Γ Pk j=1 αj Γ Pk j=1 αj + n k Y j=1 Γ(nj + αj ) Γ(αj ) C(z) partition of {1, . . . , n} induced by cluster membership z nj = Pn i=1 I{zi =j} # observations assigned to cluster j k+ = Pk j=1 I{nj 0} # non-empty clusters
  • 75. Rethinking Chib’s solution Under conjugate priors G0 on θ, fk(x|C(z)) k Y j=1 Z Θ Y i:zi =k f (xi |θ)G0(dθ) | {z } m(Ck (z)) and ^ πk(C0 |x) = 1 T T X t=1 IC0≡C(z(t)) I considerably lower computational demand I no label switching issue
  • 76. Sequential importance sampling [2] For conjugate priors, (marginal) particle filter representation of a proposal: π∗ (z|x) = π(z1|x1) n Y i=2 π(zi |x1:i , z1:i−1) with importance weight π(z|x) π∗(z|x) = π(x, z) m(x) m(x1) π(z1, x1) m(z1, x1, x2) π(z1, x1, z2, x2) · · · π(z1:n−1, x) π(z, x) = w(z, x) m(x) leading to unbiased estimator of evidence ^ Zk(x) = 1 T T X i=1 w(z(t) , x) [Long, Liu Wong, 1994; Carvalho et al., 2010]
  • 78. Galactic illustration 5.0 5.5 6.0 6.5 7.0 7.5 8.0 log(time) 10 8 6 4 2 0 2 log(MSE) ChibPartitions Bridge Sampling ChibPerm SIS SMC
  • 79. Empirical conclusions I Bridge sampling, arithmetic mean and original Chib’s method fail to scale with n, sample size I Partition Chib’s increasingly variable I Adaptive SMC ultimately fails I SIS remains most reliable method [Hairault, X Rousseau, 2022]
  • 80. Approximating the DPM evidence Extension of Chib’s formula by marginalising over z and θ mDP(x) = p(x|M∗, G0)π(M∗) π(M∗|x) and using estimate ^ π(M∗ |x) = 1 T T X t=1 π(M∗ |x, η(t) , K (t) + ) provided prior on M a Γ(a, b) distribution since M|x, η, K+ ∼ ωΓ(a+K+, b−log(η))+(1−ω)Γ(a+K+−1, b−log(η)) with ω = (a + K+ − 1)/{n(b − log(η)) + a + K+ − 1} and η|x, M ∼ Beta(M + 1, n) [Basu Chib, 2003]
  • 81. Approximating the DPM likelihood Intractable likelihood p(x|M∗, G0) approximated by sequential importance sampling Generating z from the proposal π∗ (z|x, M) = n Y i=1 π(zi |x1:i , z1:i−1, M) and using the approximation ^ L(x|M∗ , G0) = 1 T T X t=1 ^ p(x1|z (t) 1 , G0) n Y i=2 p(yi |x1:i−1z (t) 1:i−1, G0) [Kong, Lu Wong, 1994; Basu Chib, 2003]
  • 82. Approximating the DPM evidence (bis) Reverse logistic regression (RLR) applies to DPM: Importance function π1(z, M) := π∗ (z|x, M)π(M) and π2(z, M) = π(z, M|x) m(y) {z(1,j), M(1,j)}T j=1 and {z(2,j), M(2,j)}T j=1 samples from π1 and π2 marginal likelihood m(y) estimated as intercept of logistic regression with covariate log{π1(z, M)/π̃2(z, M)} [Geyer, 1994; Chen Shao, 1997]
  • 84. Outline early Gibbs sampling weakly informative priors imperfect sampling Bayes factor even less informative prior
  • 85. Jeffreys priors for mixtures True Jeffreys prior for mixtures of distributions defined as
  • 86.
  • 88.
  • 89. I O(k) matrix I unavailable in closed form except special cases I unidimensional integrals approximated by Monte Carlo tools [Grazian X, 2015]
  • 90. Difficulties I complexity grows in O(k2) I significant computing requirement (reduced by delayed acceptance) [Banterle et al., 2014] I differ from component-wise Jeffreys [Diebolt X, 1990; Stoneking, 2014] I when is the posterior proper? I how to check properness via MCMC outputs?
  • 91. Further reference priors Reparameterisation of a location-scale mixture in terms of its global mean µ and global variance σ2 as µi = µ + σαi and σi = στi 1 6 i 6 k where τi 0 and αi ∈ R Induces compact space on other parameters: k X i=1 pi αi = 0 and k X i=1 pi τ2 i + k X i=1 pi α2 i = 1 © Posterior associated with prior π(µ, σ) = 1/σ proper with Gaussian components if there are at least two observations in the sample [Kamary, Lee X, 2018]
  • 92. Dream or reality? The clock’s run out, time’s up over, bloah! Snap back to reality, Oh there goes gravity [Lose Yourself, Eminem, 2002]