This document describes the Space Alternating Data Augmentation (SADA) algorithm, an efficient Markov chain Monte Carlo method for sampling from posterior distributions. SADA extends the Data Augmentation algorithm by introducing multiple sets of missing data, with each set corresponding to a subset of model parameters. These are sampled in a "space alternating" manner to improve convergence. The document applies SADA to finite mixtures of Gaussians, introducing different types of missing data to update parameter subsets. Simulation results show SADA provides better mixing and convergence than standard Data Augmentation.
Space Alternating Data Augmentation for Finite Mixture of Gaussians
1. Space Alternating Data Augmentation
Application to Finite Mixture of Gaussians
Arnaud Doucet, Tomoko Matsui, St´ephane S´en´ecal
Research Organization of Information and Systems,
The Institute of Statistical Mathematics
18/11/2004
thanks to the support of EPSRC, The Institute of Statistical Mathematics
and the Japan Society for the Promotion of Science
1
3. MCMC techniques
Idea : sample x(0)
, x(1)
, . . . , x(i)
, . . .
• from recursive application of transition kernel K(x(i)
|x(i−1)
)
• such that x(i)
∼ π asymptotically
→ how to obtain fast converging simulation scheme?
3
4. Missing Data, Data Augmentation
Idea : extend sampling space π(x) → π(x, z)
with constraint π(x, z)dz = π(x)
such that Markov chain (x(i)
, z(i)
) ∼ π faster
• Optimization : Expectation-Maximization (EM) algorithm
• Simulation : Data Augmentation, Gibbs sampling
4
5. Efficient Data Augmentation Schemes
Information introduced in missing data : convergence
Idea: construct missing data space as less informative as possible
EM algorithm → Space Alternating Generalized EM: SAGE
algorithm, Hero and Fessler 1994:
• update parameter components by subblocks
• specific missing data space associated with each subblock
• complete data spaces less informative → convergence rate
5
6. Efficient Data Augmentation Sampling Schemes
SAGE Idea → efficient MCMC algorithm:
• sample parameter components by subblocks
• each subblock of parameters is sampled conditional on a specific
missing data set
⇒ Space Alternating Data Augmentation (SADA)
• Optimization : EM algorithm → SAGE algorithm
• Simulation : DA, Gibbs sampling → SADA
6
7. Overview
• Introduction to EM and SAGE algorithms
• Introduction to Data Augmentation and SADA algorithms
• Application to Finite Mixture of Gaussians
7
8. EM and SAGE Algorithms
Bayesian framework: obtaining MAP estimate of random variable X
given realization of Y = y
xMAP = arg max p (x|y)
where
p (x|y) ∝ p (y|x) p (x)
X is random vector whose components are partitioned into n subsets
X = X1:n = (X1, . . . , Xn)
Notation X−k = X1:n {Xk} = (X1, . . . , Xk−1, Xk+1, . . . , Xn) and
Zk:j = (Zk, Zk+1, . . . , Zj)
8
9. Expectation-Maximization (EM) algorithm
→ Maximize p (x|y)
⇒ introduce missing data Z with given conditional distribution
p (z|y, x)
EM, iteration i:
x(i)
= arg max
x
log (p (x, z|y)) p z|y, x(i−1)
dz
9
10. Space Alternating EM (SAGE) algorithm
→ Maximize p (x|y)
⇒ introduce n missing data sets Z1:n with each random variable Zk
is given a conditional distribution p (zk|y, x1:n) satisfying
p (y|x1:n, zk) = p (y|x−k, zk)
10
11. Space Alternating EM (SAGE) algorithm
SAGE, iteration i:
• select index k ∈ {1, . . . , n}
• set x
(i)
k as
arg max
x
log p x
(i−1)
−k , xk, z|y p zk|y, x(i−1)
dzk
and x
(i)
−k = x
(i−1)
−k
Components updated cyclically: iteration i update of component
k = (i mod n) + 1
11
12. DA and SADA Algorithms
Bayesian framework: objective not only to maximize p (x|y) but to
obtain random samples X(i)
distributed according to p (x|y)
Based on samples X(i)
, approximation of MMSE estimate:
xMMSE =
1
N
N
i=1
X(i)
→ xMMSE = xp (x|y) dx
Also possible to compute posterior variances, confidence intervals or
predictive distributions.
Construction of efficient MCMC algorithms typically difficult
→ introduction of missing data
12
13. Data Augmentation, Gibbs sampling
→ Sample p (x|y)
⇒ introduce missing data Z with joint posterior distribution
p (x, z|y) = p (x|y) p (z|y, x)
Data Augmentation algorithm, iteration i given X(i−1)
:
• Sample Z(i)
∼ p ·|y, X(i−1)
• Sample X(i)
∼ p ·|y, Z(i)
13
14. Convergence of DA/Gibbs sampling algorithm
• Transition kernel associated to X(i)
, Z(i)
admits p (x, z|y) as
invariant distribution
• Under weak additional assumptions
(irreducibility and aperiodicity)
instantaneous distribution of X(i)
, Z(i)
converges towards
p (x, z|y) as i → +∞
14
15. Space Alternating Data Augmentation
→ Sample p (x|y)
⇒ introduce n missing data sets Z1:n with each random variable Zk
is given a conditional distribution p (zk|y, x1:n) defining joint
posterior distribution
p (x1:n, z1:n|y) = p (x1:n|y)
n
k=1
p (zk|y, x1:n)
Typically p (y|x1:n, zk) = p (y|x−k, zk) although not necessary
15
16. Space Alternating Data Augmentation
SADA algorithm, iteration i given X
(i−1)
1:n with k = (i mod n) + 1:
• Sample Z
(i)
k ∼ p ·|y, X(i−1)
• Sample X
(i)
k ∼ p ·|y, Z
(i)
k , X
(i−1)
−k
• Set X
(i)
−k = X
(i−1)
−k
16
17. Validity of SADA sampling algorithm
Generation of Markov chain X
(i)
1:n, Z
(i)
1:n with invariant distribution
p (x1:n, z1:n|y)
Idea: SADA equivalent to
• Sample Z
(i)
k , Z−k ∼ p ·|y, X
(i−1)
1:n
• Sample X
(i)
k , Z−k ∼ p ·|y, Z
(i)
k , X
(i−1)
−k
• Set X
(i)
−k = X
(i−1)
−k
17
18. Validity of SADA sampling algorithm
SADA → simulating Zk and Xk but also Z−k at each iteration
sampling according to full conditional distributions p (z1:n|y, x1:n)
and p (x1:n|y, z1:n) ⇒ ad hoc invariant distribution p (x1:n, z1:n|y)
sampling of Z−k not necessary → discarded
18
19. Overview
• Introduction to EM and SAGE algorithms
• Introduction to Data Augmentation and SADA algorithms
• ⇒ Application to Finite Mixture of Gaussians
19
20. Finite Mixture of Gaussians
EM/DA algorithms routinely used to perform ML/MAP parameter
estimation/to sample the posterior distribution
Straightforward extensions to hidden Markov chains with Gaussian
observations
T i.i.d. observations Y1:T in Rd
, distributed according to a finite
mixture of s Gaussians
Yt ∼
s
j=1
πjN (µj; Σj)
20
22. Bayesian Estimation
Σ−1
∼ W (r, C): Wishart distribution, density proportional to
|Σ−1
|
1
2 (r−d−1)
exp −
1
2
tr Σ−1
C−1
(π1, . . . , πs) ∼ D (ζ1, . . . , ζs): Dirichlet distribution restricted to the
simplex, density proportional to
s
k=1 πζk−1
k
Hyperparameters {(αj, λj, rj, Cj, ζj) ; j = 1, . . . , s} assumed fixed but
could be estimated from data in a hierarchical Bayes model
22
23. Missing Data for Finite Mixture of Gaussians
EM/DA introduce the i.i.d. missing data Zt ∈ {1, . . . , s} such that
Yt|Zt = j ∼ N (µj; Σj)
Pr (Zt = j) = πj
Gibbs sampling algorithm, iteration i:
• sample discrete latent variables Z
(i)
t ∼ p ·|yt, X(i−1)
• compute sufficient statistics n
(i)
j
T
t=1 δZ
(i)
t ,j
,
n
(i)
j y
(i)
j
T
t=1 δZ
(i)
t ,j
yt and S
(i)
j
T
t=1 δZ
(i)
t ,j
ytyT
t
• sample parameters
23
24. Gibbs sampling for Finite Mixture of Gaussians
sampling parameters, iteration i:
Σ
−1(i)
j ∼ W rj + n
(i)
j , Σ
−1(i)
j
then
µ
(i)
j |Σ
(i)
j ∼ N m
(i)
j ,
Σ
(i)
j
λj + n
(i)
j
and π
(i)
1 , . . . , π
(i)
s ∼ D n
(i)
1 + ζ1, . . . , n
(i)
s + ζs where
m
(i)
j =
λjαj + n
(i)
j y
(i)
j
λj + n
(i)
j
and
Σ
(i)
j = C−1
j + λjαjαT
j + S
(i)
j − λj + n
(i)
j m
(i)
j m
(i)T
j
24
25. Less Informative Missing Data
update only µj, τ2
j , µ−j, τ2
−j fixed
→ binary missing data Zt,j ∈ {0, j} such that Pr (Zt,j = j) = πj
Zt,j = “observation coming from component j or not”, less
informative than knowing “from which particular component
observation is derived”
constraint
s
j=1 πj = 1 ⇒ cannot update πj, use of standard EM
approach
25
26. Less Informative Missing Data
updating jointly the parameters of two components j and k
→ missing data Zt,j,k ∈ {0, j, k} such that
Pr (Zt,j,k = j) = πj, Pr (Zt,j,k = k) = πk
and
Yt|Zt,j,k = j ∼ N (µj; Σj)
Yt|Zt,j,k = k ∼ N (µk; Σk)
Yt|Zt,j,k = 0 ∼
l=j,l=k πlN (µl; Σl)
l=j,l=k πl
26
27. SAGE algorithm for Finite Mixture of Gaussians
update for µj, τ2
j , iteration i:
µ
(i)
j =
λjαj +
T
t=1 ytp Zt,j,k = j|yt, X(i−1)
λj +
T
t=1 p Zt,j,k = j|yt, X(i−1)
Σ
(i)
j =
C−1
j + λj µ
(i)
j − αj µ
(i)
j − αj
T
+ . . .
. . .
. . . +
T
t=1
yt − µ
(i)
j yt − µ
(i)
j
T
p Zt,j,k = j|yt, X(i−1)
rj − d − 1 + λj +
T
t=1
p Zt,j,k = j|yt, X(i−1)
27
28. SAGE algorithm for Finite Mixture of Gaussians
update for πj, iteration i:
π
(i)
j =
1 − l=j,l=k π
(i−1)
l
1 +
T
t=1
p(Zt,j,k=k|yt,X(i−1)
)+(ζk−1)
T
t=1
p(Zt,j,k=j|yt,X(i−1)
)+(ζj −1)
28
29. SADA algorithm for Finite Mixture of Gaussians
SADA algorithm, iteration i, sample (µj, Σj, πj)
• sample discrete latent variables
Z
(i)
t,j,k ∼ p ·|yt, X(i−1)
• compute sufficient statistics n
(i)
j
T
t=1 δZ
(i)
t,j,k,j
and
n
(i)
j y
(i)
j
T
t=1
δZ
(i)
t,j,k,j
yt, S
(i)
j
T
t=1
δZ
(i)
t,j,k,j
ytyT
t
• sample parameters
29
30. SADA algorithm for Finite Mixture of Gaussians
sampling parameters, iteration i:
Σ
−1(i)
j ∼ W rj + n
(i)
j , Σ
−1(i)
j
then
µ
(i)
j |Σ
(i)
j ∼ N m
(i)
j ,
Σ
(i)
j
λj + n
(i)
j
and
π
(i)
j , π
(i)
k ∼
1 −
l=j,l=k
π
(i−1)
l
D n
(i)
j + ζj, n
(i)
k + ζk
30
31. Simulations
Mixture of s = 5 d = 10-dimensional Gaussians T = 100, parameters
of components sampled from prior with parameters ζj = 1, αj = 0,
λj = 0.01, rj = d + 1 and Cj = 0.01I
200 iterations of EM and SAGE 50 times
5000 iterations of DA and SADA 10 times
Results:
• EM/SAGE: mean of log-posterior values at final iteration
• SA/SADA: mean of average log-posterior values of last 1000
iterations
31
32. Simulations Results
s EM SAGE DA SADA
5 -915.8 -671.5 -873.7 886.0
6 -929.6 –603.2 -877.3 -886.7
7 -941.4 -576.5 -893.9 -906.9
8 -965.7 -559.2 -904.9 -875.0
9 -968.9 -503.0 -898.8 -882.5
10 -983.2 -478.1 -924.0 -906.6
Log-posterior values for final iteration EM/SAGE and average
log-posterior values for DA/SADA
32
33. References - EM/SAGE/MCMC
• G. J. McLachlan and T. Krishnan, The EM Algorithm and
Extensions, Wiley Series in Probability and Statistics, 1997
• J. A. Fessler and A.O. Hero, Space-alternating generalized
expectation-maximization algorithm, IEEE Trans. Sig. Proc.,
42:2664–2677, 1994
• C. P. Robert and G. Casella, Monte Carlo Statistical Methods,
Springer-Verlag, 1999
33
34. References - Finite Mixture of Gaussians
• J. L. Gauvain and C. H. Lee, Maximum a Posteriori estimation
for multivariate Gaussian mixture observations of Markov chains,
IEEE Trans. Speech Audio Proc., 2:291-298, 1994
• G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley
Series in Probability and Statistics, 2000
• G. Celeux, S. Chr´etien, F. Forbes and A. Mkhadri, A
component-wise EM algorithm for mixtures, J. Comp. Graph.
Stat., 10, 699-712, 2001
34