Space Alternating Data Augmentation for Finite Mixture of Gaussians

Space Alternating Data Augmentation
Application to Finite Mixture of Gaussians
Arnaud Doucet, Tomoko Matsui, Stéphane Sénécal
Research Organization of Information and Systems,
The Institute of Statistical Mathematics
18/11/2004
thanks to the support of EPSRC, The Institute of Statistical Mathematics
and the Japan Society for the Promotion of Science
1

Bayesian Estimation
Computation of
Eπ {f} = f(x)π(x)dx
Approximation
⇒ Monte Carlo Markov chain (MCMC) simulation methods
2

MCMC techniques
Idea : sample x(0)
, x(1)
, . . . , x(i)
, . . .
• from recursive application of transition kernel K(x(i)
|x(i−1)
)
• such that x(i)
∼ π asymptotically
→ how to obtain fast converging simulation scheme?
3

Missing Data, Data Augmentation
Idea : extend sampling space π(x) → π(x, z)
with constraint π(x, z)dz = π(x)
such that Markov chain (x(i)
, z(i)
) ∼ π faster
• Optimization : Expectation-Maximization (EM) algorithm
• Simulation : Data Augmentation, Gibbs sampling
4

Eﬃcient Data Augmentation Schemes
Information introduced in missing data : convergence
Idea: construct missing data space as less informative as possible
EM algorithm → Space Alternating Generalized EM: SAGE
algorithm, Hero and Fessler 1994:
• update parameter components by subblocks
• speciﬁc missing data space associated with each subblock
• complete data spaces less informative → convergence rate
5

Efficient Data Augmentation Sampling Schemes
SAGE Idea → efficient MCMC algorithm:
• sample parameter components by subblocks
• each subblock of parameters is sampled conditional on a specific
missing data set
⇒ Space Alternating Data Augmentation (SADA)
• Optimization : EM algorithm → SAGE algorithm
• Simulation : DA, Gibbs sampling → SADA
6

Overview
• Introduction to EM and SAGE algorithms
• Introduction to Data Augmentation and SADA algorithms
• Application to Finite Mixture of Gaussians
7

EM and SAGE Algorithms
Bayesian framework: obtaining MAP estimate of random variable X
given realization of Y = y
xMAP = arg max p (x|y)
where
p (x|y) ∝ p (y|x) p (x)
X is random vector whose components are partitioned into n subsets
X = X1:n = (X1, . . . , Xn)
Notation X−k = X1:n {Xk} = (X1, . . . , Xk−1, Xk+1, . . . , Xn) and
Zk:j = (Zk, Zk+1, . . . , Zj)
8

Expectation-Maximization (EM) algorithm
→ Maximize p (x|y)
⇒ introduce missing data Z with given conditional distribution
p (z|y, x)
EM, iteration i:
x(i)
= arg max
x
log (p (x, z|y)) p z|y, x(i−1)
dz
9

Space Alternating EM (SAGE) algorithm
→ Maximize p (x|y)
⇒ introduce n missing data sets Z1:n with each random variable Zk
is given a conditional distribution p (zk|y, x1:n) satisfying
p (y|x1:n, zk) = p (y|x−k, zk)
10

Space Alternating EM (SAGE) algorithm
SAGE, iteration i:
• select index k ∈ {1, . . . , n}
• set x
(i)
k as
arg max
x
log p x
(i−1)
−k , xk, z|y p zk|y, x(i−1)
dzk
and x
(i)
−k = x
(i−1)
−k
Components updated cyclically: iteration i update of component
k = (i mod n) + 1
11

DA and SADA Algorithms
Bayesian framework: objective not only to maximize p (x|y) but to
obtain random samples X(i)
distributed according to p (x|y)
Based on samples X(i)
, approximation of MMSE estimate:
xMMSE =
1
N
N
i=1
X(i)
→ xMMSE = xp (x|y) dx
Also possible to compute posterior variances, confidence intervals or
predictive distributions.
Construction of efficient MCMC algorithms typically difficult
→ introduction of missing data
12

Data Augmentation, Gibbs sampling
→ Sample p (x|y)
⇒ introduce missing data Z with joint posterior distribution
p (x, z|y) = p (x|y) p (z|y, x)
Data Augmentation algorithm, iteration i given X(i−1)
:
• Sample Z(i)
∼ p ·|y, X(i−1)
• Sample X(i)
∼ p ·|y, Z(i)
13

Convergence of DA/Gibbs sampling algorithm
• Transition kernel associated to X(i)
, Z(i)
admits p (x, z|y) as
invariant distribution
• Under weak additional assumptions
(irreducibility and aperiodicity)
instantaneous distribution of X(i)
, Z(i)
converges towards
p (x, z|y) as i → +∞
14

SADA algorithm, iteration i given X
(i−1)
1:n with k = (i mod n) + 1:
• Sample Z
(i)
k ∼ p ·|y, X(i−1)
• Sample X
(i)
k ∼ p ·|y, Z
(i)
k , X
(i−1)
−k
• Set X
(i)
−k = X
(i−1)
−k
16

Validity of SADA sampling algorithm
Generation of Markov chain X
(i)
1:n, Z
(i)
1:n with invariant distribution
p (x1:n, z1:n|y)
Idea: SADA equivalent to
• Sample Z
(i)
k , Z−k ∼ p ·|y, X
(i−1)
1:n
• Sample X
(i)
k , Z−k ∼ p ·|y, Z
(i)
k , X
(i−1)
−k
• Set X
(i)
−k = X
(i−1)
−k
17

Validity of SADA sampling algorithm
SADA → simulating Zk and Xk but also Z−k at each iteration
sampling according to full conditional distributions p (z1:n|y, x1:n)
and p (x1:n|y, z1:n) ⇒ ad hoc invariant distribution p (x1:n, z1:n|y)
sampling of Z−k not necessary → discarded
18

Overview
• Introduction to EM and SAGE algorithms
• Introduction to Data Augmentation and SADA algorithms
• ⇒ Application to Finite Mixture of Gaussians
19

Finite Mixture of Gaussians
EM/DA algorithms routinely used to perform ML/MAP parameter
estimation/to sample the posterior distribution
Straightforward extensions to hidden Markov chains with Gaussian
observations
T i.i.d. observations Y1:T in Rd
, distributed according to a ﬁnite
mixture of s Gaussians
Yt ∼
s
j=1
πjN (µj; Σj)
20

Bayesian Estimation
Parameters X = {(µj, Σj, πj) ; j = 1, . . . , s} unknown, random,
distributed conjugate prior distributions
µj|Σj ∼ N (αj, Σj/λj)
Σ−1
j ∼ W (rj, Cj)
(π1, . . . , πs) ∼ D (ζ1, . . . , ζs)
21

Bayesian Estimation
Σ−1
∼ W (r, C): Wishart distribution, density proportional to
|Σ−1
|
1
2 (r−d−1)
exp −
1
2
tr Σ−1
C−1
(π1, . . . , πs) ∼ D (ζ1, . . . , ζs): Dirichlet distribution restricted to the
simplex, density proportional to
s
k=1 πζk−1
k
Hyperparameters {(αj, λj, rj, Cj, ζj) ; j = 1, . . . , s} assumed ﬁxed but
could be estimated from data in a hierarchical Bayes model
22

Missing Data for Finite Mixture of Gaussians
EM/DA introduce the i.i.d. missing data Zt ∈ {1, . . . , s} such that
Yt|Zt = j ∼ N (µj; Σj)
Pr (Zt = j) = πj
Gibbs sampling algorithm, iteration i:
• sample discrete latent variables Z
(i)
t ∼ p ·|yt, X(i−1)
• compute suﬃcient statistics n
(i)
j
T
t=1 δZ
(i)
t ,j
,
n
(i)
j y
(i)
j
T
t=1 δZ
(i)
t ,j
yt and S
(i)
j
T
t=1 δZ
(i)
t ,j
ytyT
t
• sample parameters
23

Gibbs sampling for Finite Mixture of Gaussians
sampling parameters, iteration i:
Σ
−1(i)
j ∼ W rj + n
(i)
j , Σ
−1(i)
j
then
µ
(i)
j |Σ
(i)
j ∼ N m
(i)
j ,
Σ
(i)
j
λj + n
(i)
j
and π
(i)
1 , . . . , π
(i)
s ∼ D n
(i)
1 + ζ1, . . . , n
(i)
s + ζs where
m
(i)
j =
λjαj + n
(i)
j y
(i)
j
λj + n
(i)
j
and
Σ
(i)
j = C−1
j + λjαjαT
j + S
(i)
j − λj + n
(i)
j m
(i)
j m
(i)T
j
24

Less Informative Missing Data
update only µj, τ2
j , µ−j, τ2
−j ﬁxed
→ binary missing data Zt,j ∈ {0, j} such that Pr (Zt,j = j) = πj
Zt,j = “observation coming from component j or not”, less
informative than knowing “from which particular component
observation is derived”
constraint
s
j=1 πj = 1 ⇒ cannot update πj, use of standard EM
approach
25

Less Informative Missing Data
updating jointly the parameters of two components j and k
→ missing data Zt,j,k ∈ {0, j, k} such that
Pr (Zt,j,k = j) = πj, Pr (Zt,j,k = k) = πk
and
Yt|Zt,j,k = j ∼ N (µj; Σj)
Yt|Zt,j,k = k ∼ N (µk; Σk)
Yt|Zt,j,k = 0 ∼
l=j,l=k πlN (µl; Σl)
l=j,l=k πl
26

SAGE algorithm for Finite Mixture of Gaussians
update for µj, τ2
j , iteration i:
µ
(i)
j =
λjαj +
T
t=1 ytp Zt,j,k = j|yt, X(i−1)
λj +
T
t=1 p Zt,j,k = j|yt, X(i−1)
Σ
(i)
j =
C−1
j + λj µ
(i)
j − αj µ
(i)
j − αj
T
+ . . .
. . .
. . . +
T
t=1
yt − µ
(i)
j yt − µ
(i)
j
T
p Zt,j,k = j|yt, X(i−1)
rj − d − 1 + λj +
T
t=1
p Zt,j,k = j|yt, X(i−1)
27

SAGE algorithm for Finite Mixture of Gaussians
update for πj, iteration i:
π
(i)
j =
1 − l=j,l=k π
(i−1)
l
1 +
T
t=1
p(Zt,j,k=k|yt,X(i−1)
)+(ζk−1)
T
t=1
p(Zt,j,k=j|yt,X(i−1)
)+(ζj −1)
28

SADA algorithm for Finite Mixture of Gaussians
SADA algorithm, iteration i, sample (µj, Σj, πj)
• sample discrete latent variables
Z
(i)
t,j,k ∼ p ·|yt, X(i−1)
• compute suﬃcient statistics n
(i)
j
T
t=1 δZ
(i)
t,j,k,j
and
n
(i)
j y
(i)
j
T
t=1
δZ
(i)
t,j,k,j
yt, S
(i)
j
T
t=1
δZ
(i)
t,j,k,j
ytyT
t
• sample parameters
29

SADA algorithm for Finite Mixture of Gaussians
sampling parameters, iteration i:
Σ
−1(i)
j ∼ W rj + n
(i)
j , Σ
−1(i)
j
then
µ
(i)
j |Σ
(i)
j ∼ N m
(i)
j ,
Σ
(i)
j
λj + n
(i)
j
and
π
(i)
j , π
(i)
k ∼

1 −
l=j,l=k
π
(i−1)
l

 D n
(i)
j + ζj, n
(i)
k + ζk
30

Simulations
Mixture of s = 5 d = 10-dimensional Gaussians T = 100, parameters
of components sampled from prior with parameters ζj = 1, αj = 0,
λj = 0.01, rj = d + 1 and Cj = 0.01I
200 iterations of EM and SAGE 50 times
5000 iterations of DA and SADA 10 times
Results:
• EM/SAGE: mean of log-posterior values at ﬁnal iteration
• SA/SADA: mean of average log-posterior values of last 1000
iterations
31

Simulations Results
s EM SAGE DA SADA
5 -915.8 -671.5 -873.7 886.0
6 -929.6 –603.2 -877.3 -886.7
7 -941.4 -576.5 -893.9 -906.9
8 -965.7 -559.2 -904.9 -875.0
9 -968.9 -503.0 -898.8 -882.5
10 -983.2 -478.1 -924.0 -906.6
Log-posterior values for ﬁnal iteration EM/SAGE and average
log-posterior values for DA/SADA
32

References - EM/SAGE/MCMC
• G. J. McLachlan and T. Krishnan, The EM Algorithm and
Extensions, Wiley Series in Probability and Statistics, 1997
• J. A. Fessler and A.O. Hero, Space-alternating generalized
expectation-maximization algorithm, IEEE Trans. Sig. Proc.,
42:2664–2677, 1994
• C. P. Robert and G. Casella, Monte Carlo Statistical Methods,
Springer-Verlag, 1999
33

References - Finite Mixture of Gaussians
• J. L. Gauvain and C. H. Lee, Maximum a Posteriori estimation
for multivariate Gaussian mixture observations of Markov chains,
IEEE Trans. Speech Audio Proc., 2:291-298, 1994
• G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley
Series in Probability and Statistics, 2000
• G. Celeux, S. Chr´etien, F. Forbes and A. Mkhadri, A
component-wise EM algorithm for mixtures, J. Comp. Graph.
Stat., 10, 699-712, 2001
34

Space Alternating Data Augmentation for Finite Mixture of Gaussians

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Space Alternating Data Augmentation for Finite Mixture of Gaussians

Similar to Space Alternating Data Augmentation for Finite Mixture of Gaussians (20)

Space Alternating Data Augmentation for Finite Mixture of Gaussians