SlideShare a Scribd company logo
1 of 34
Download to read offline
Space Alternating Data Augmentation
Application to Finite Mixture of Gaussians
Arnaud Doucet, Tomoko Matsui, St´ephane S´en´ecal
Research Organization of Information and Systems,
The Institute of Statistical Mathematics
18/11/2004
thanks to the support of EPSRC, The Institute of Statistical Mathematics
and the Japan Society for the Promotion of Science
1
Bayesian Estimation
Computation of
Eπ {f} = f(x)π(x)dx
Approximation
⇒ Monte Carlo Markov chain (MCMC) simulation methods
2
MCMC techniques
Idea : sample x(0)
, x(1)
, . . . , x(i)
, . . .
• from recursive application of transition kernel K(x(i)
|x(i−1)
)
• such that x(i)
∼ π asymptotically
→ how to obtain fast converging simulation scheme?
3
Missing Data, Data Augmentation
Idea : extend sampling space π(x) → π(x, z)
with constraint π(x, z)dz = π(x)
such that Markov chain (x(i)
, z(i)
) ∼ π faster
• Optimization : Expectation-Maximization (EM) algorithm
• Simulation : Data Augmentation, Gibbs sampling
4
Efficient Data Augmentation Schemes
Information introduced in missing data : convergence
Idea: construct missing data space as less informative as possible
EM algorithm → Space Alternating Generalized EM: SAGE
algorithm, Hero and Fessler 1994:
• update parameter components by subblocks
• specific missing data space associated with each subblock
• complete data spaces less informative → convergence rate
5
Efficient Data Augmentation Sampling Schemes
SAGE Idea → efficient MCMC algorithm:
• sample parameter components by subblocks
• each subblock of parameters is sampled conditional on a specific
missing data set
⇒ Space Alternating Data Augmentation (SADA)
• Optimization : EM algorithm → SAGE algorithm
• Simulation : DA, Gibbs sampling → SADA
6
Overview
• Introduction to EM and SAGE algorithms
• Introduction to Data Augmentation and SADA algorithms
• Application to Finite Mixture of Gaussians
7
EM and SAGE Algorithms
Bayesian framework: obtaining MAP estimate of random variable X
given realization of Y = y
xMAP = arg max p (x|y)
where
p (x|y) ∝ p (y|x) p (x)
X is random vector whose components are partitioned into n subsets
X = X1:n = (X1, . . . , Xn)
Notation X−k = X1:n {Xk} = (X1, . . . , Xk−1, Xk+1, . . . , Xn) and
Zk:j = (Zk, Zk+1, . . . , Zj)
8
Expectation-Maximization (EM) algorithm
→ Maximize p (x|y)
⇒ introduce missing data Z with given conditional distribution
p (z|y, x)
EM, iteration i:
x(i)
= arg max
x
log (p (x, z|y)) p z|y, x(i−1)
dz
9
Space Alternating EM (SAGE) algorithm
→ Maximize p (x|y)
⇒ introduce n missing data sets Z1:n with each random variable Zk
is given a conditional distribution p (zk|y, x1:n) satisfying
p (y|x1:n, zk) = p (y|x−k, zk)
10
Space Alternating EM (SAGE) algorithm
SAGE, iteration i:
• select index k ∈ {1, . . . , n}
• set x
(i)
k as
arg max
x
log p x
(i−1)
−k , xk, z|y p zk|y, x(i−1)
dzk
and x
(i)
−k = x
(i−1)
−k
Components updated cyclically: iteration i update of component
k = (i mod n) + 1
11
DA and SADA Algorithms
Bayesian framework: objective not only to maximize p (x|y) but to
obtain random samples X(i)
distributed according to p (x|y)
Based on samples X(i)
, approximation of MMSE estimate:
xMMSE =
1
N
N
i=1
X(i)
→ xMMSE = xp (x|y) dx
Also possible to compute posterior variances, confidence intervals or
predictive distributions.
Construction of efficient MCMC algorithms typically difficult
→ introduction of missing data
12
Data Augmentation, Gibbs sampling
→ Sample p (x|y)
⇒ introduce missing data Z with joint posterior distribution
p (x, z|y) = p (x|y) p (z|y, x)
Data Augmentation algorithm, iteration i given X(i−1)
:
• Sample Z(i)
∼ p ·|y, X(i−1)
• Sample X(i)
∼ p ·|y, Z(i)
13
Convergence of DA/Gibbs sampling algorithm
• Transition kernel associated to X(i)
, Z(i)
admits p (x, z|y) as
invariant distribution
• Under weak additional assumptions
(irreducibility and aperiodicity)
instantaneous distribution of X(i)
, Z(i)
converges towards
p (x, z|y) as i → +∞
14
Space Alternating Data Augmentation
→ Sample p (x|y)
⇒ introduce n missing data sets Z1:n with each random variable Zk
is given a conditional distribution p (zk|y, x1:n) defining joint
posterior distribution
p (x1:n, z1:n|y) = p (x1:n|y)
n
k=1
p (zk|y, x1:n)
Typically p (y|x1:n, zk) = p (y|x−k, zk) although not necessary
15
Space Alternating Data Augmentation
SADA algorithm, iteration i given X
(i−1)
1:n with k = (i mod n) + 1:
• Sample Z
(i)
k ∼ p ·|y, X(i−1)
• Sample X
(i)
k ∼ p ·|y, Z
(i)
k , X
(i−1)
−k
• Set X
(i)
−k = X
(i−1)
−k
16
Validity of SADA sampling algorithm
Generation of Markov chain X
(i)
1:n, Z
(i)
1:n with invariant distribution
p (x1:n, z1:n|y)
Idea: SADA equivalent to
• Sample Z
(i)
k , Z−k ∼ p ·|y, X
(i−1)
1:n
• Sample X
(i)
k , Z−k ∼ p ·|y, Z
(i)
k , X
(i−1)
−k
• Set X
(i)
−k = X
(i−1)
−k
17
Validity of SADA sampling algorithm
SADA → simulating Zk and Xk but also Z−k at each iteration
sampling according to full conditional distributions p (z1:n|y, x1:n)
and p (x1:n|y, z1:n) ⇒ ad hoc invariant distribution p (x1:n, z1:n|y)
sampling of Z−k not necessary → discarded
18
Overview
• Introduction to EM and SAGE algorithms
• Introduction to Data Augmentation and SADA algorithms
• ⇒ Application to Finite Mixture of Gaussians
19
Finite Mixture of Gaussians
EM/DA algorithms routinely used to perform ML/MAP parameter
estimation/to sample the posterior distribution
Straightforward extensions to hidden Markov chains with Gaussian
observations
T i.i.d. observations Y1:T in Rd
, distributed according to a finite
mixture of s Gaussians
Yt ∼
s
j=1
πjN (µj; Σj)
20
Bayesian Estimation
Parameters X = {(µj, Σj, πj) ; j = 1, . . . , s} unknown, random,
distributed conjugate prior distributions
µj|Σj ∼ N (αj, Σj/λj)
Σ−1
j ∼ W (rj, Cj)
(π1, . . . , πs) ∼ D (ζ1, . . . , ζs)
21
Bayesian Estimation
Σ−1
∼ W (r, C): Wishart distribution, density proportional to
|Σ−1
|
1
2 (r−d−1)
exp −
1
2
tr Σ−1
C−1
(π1, . . . , πs) ∼ D (ζ1, . . . , ζs): Dirichlet distribution restricted to the
simplex, density proportional to
s
k=1 πζk−1
k
Hyperparameters {(αj, λj, rj, Cj, ζj) ; j = 1, . . . , s} assumed fixed but
could be estimated from data in a hierarchical Bayes model
22
Missing Data for Finite Mixture of Gaussians
EM/DA introduce the i.i.d. missing data Zt ∈ {1, . . . , s} such that
Yt|Zt = j ∼ N (µj; Σj)
Pr (Zt = j) = πj
Gibbs sampling algorithm, iteration i:
• sample discrete latent variables Z
(i)
t ∼ p ·|yt, X(i−1)
• compute sufficient statistics n
(i)
j
T
t=1 δZ
(i)
t ,j
,
n
(i)
j y
(i)
j
T
t=1 δZ
(i)
t ,j
yt and S
(i)
j
T
t=1 δZ
(i)
t ,j
ytyT
t
• sample parameters
23
Gibbs sampling for Finite Mixture of Gaussians
sampling parameters, iteration i:
Σ
−1(i)
j ∼ W rj + n
(i)
j , Σ
−1(i)
j
then
µ
(i)
j |Σ
(i)
j ∼ N m
(i)
j ,
Σ
(i)
j
λj + n
(i)
j
and π
(i)
1 , . . . , π
(i)
s ∼ D n
(i)
1 + ζ1, . . . , n
(i)
s + ζs where
m
(i)
j =
λjαj + n
(i)
j y
(i)
j
λj + n
(i)
j
and
Σ
(i)
j = C−1
j + λjαjαT
j + S
(i)
j − λj + n
(i)
j m
(i)
j m
(i)T
j
24
Less Informative Missing Data
update only µj, τ2
j , µ−j, τ2
−j fixed
→ binary missing data Zt,j ∈ {0, j} such that Pr (Zt,j = j) = πj
Zt,j = “observation coming from component j or not”, less
informative than knowing “from which particular component
observation is derived”
constraint
s
j=1 πj = 1 ⇒ cannot update πj, use of standard EM
approach
25
Less Informative Missing Data
updating jointly the parameters of two components j and k
→ missing data Zt,j,k ∈ {0, j, k} such that
Pr (Zt,j,k = j) = πj, Pr (Zt,j,k = k) = πk
and
Yt|Zt,j,k = j ∼ N (µj; Σj)
Yt|Zt,j,k = k ∼ N (µk; Σk)
Yt|Zt,j,k = 0 ∼
l=j,l=k πlN (µl; Σl)
l=j,l=k πl
26
SAGE algorithm for Finite Mixture of Gaussians
update for µj, τ2
j , iteration i:
µ
(i)
j =
λjαj +
T
t=1 ytp Zt,j,k = j|yt, X(i−1)
λj +
T
t=1 p Zt,j,k = j|yt, X(i−1)
Σ
(i)
j =
C−1
j + λj µ
(i)
j − αj µ
(i)
j − αj
T
+ . . .
. . .
. . . +
T
t=1
yt − µ
(i)
j yt − µ
(i)
j
T
p Zt,j,k = j|yt, X(i−1)
rj − d − 1 + λj +
T
t=1
p Zt,j,k = j|yt, X(i−1)
27
SAGE algorithm for Finite Mixture of Gaussians
update for πj, iteration i:
π
(i)
j =
1 − l=j,l=k π
(i−1)
l
1 +
T
t=1
p(Zt,j,k=k|yt,X(i−1)
)+(ζk−1)
T
t=1
p(Zt,j,k=j|yt,X(i−1)
)+(ζj −1)
28
SADA algorithm for Finite Mixture of Gaussians
SADA algorithm, iteration i, sample (µj, Σj, πj)
• sample discrete latent variables
Z
(i)
t,j,k ∼ p ·|yt, X(i−1)
• compute sufficient statistics n
(i)
j
T
t=1 δZ
(i)
t,j,k,j
and
n
(i)
j y
(i)
j
T
t=1
δZ
(i)
t,j,k,j
yt, S
(i)
j
T
t=1
δZ
(i)
t,j,k,j
ytyT
t
• sample parameters
29
SADA algorithm for Finite Mixture of Gaussians
sampling parameters, iteration i:
Σ
−1(i)
j ∼ W rj + n
(i)
j , Σ
−1(i)
j
then
µ
(i)
j |Σ
(i)
j ∼ N m
(i)
j ,
Σ
(i)
j
λj + n
(i)
j
and
π
(i)
j , π
(i)
k ∼

1 −
l=j,l=k
π
(i−1)
l

 D n
(i)
j + ζj, n
(i)
k + ζk
30
Simulations
Mixture of s = 5 d = 10-dimensional Gaussians T = 100, parameters
of components sampled from prior with parameters ζj = 1, αj = 0,
λj = 0.01, rj = d + 1 and Cj = 0.01I
200 iterations of EM and SAGE 50 times
5000 iterations of DA and SADA 10 times
Results:
• EM/SAGE: mean of log-posterior values at final iteration
• SA/SADA: mean of average log-posterior values of last 1000
iterations
31
Simulations Results
s EM SAGE DA SADA
5 -915.8 -671.5 -873.7 886.0
6 -929.6 –603.2 -877.3 -886.7
7 -941.4 -576.5 -893.9 -906.9
8 -965.7 -559.2 -904.9 -875.0
9 -968.9 -503.0 -898.8 -882.5
10 -983.2 -478.1 -924.0 -906.6
Log-posterior values for final iteration EM/SAGE and average
log-posterior values for DA/SADA
32
References - EM/SAGE/MCMC
• G. J. McLachlan and T. Krishnan, The EM Algorithm and
Extensions, Wiley Series in Probability and Statistics, 1997
• J. A. Fessler and A.O. Hero, Space-alternating generalized
expectation-maximization algorithm, IEEE Trans. Sig. Proc.,
42:2664–2677, 1994
• C. P. Robert and G. Casella, Monte Carlo Statistical Methods,
Springer-Verlag, 1999
33
References - Finite Mixture of Gaussians
• J. L. Gauvain and C. H. Lee, Maximum a Posteriori estimation
for multivariate Gaussian mixture observations of Markov chains,
IEEE Trans. Speech Audio Proc., 2:291-298, 1994
• G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley
Series in Probability and Statistics, 2000
• G. Celeux, S. Chr´etien, F. Forbes and A. Mkhadri, A
component-wise EM algorithm for mixtures, J. Comp. Graph.
Stat., 10, 699-712, 2001
34

More Related Content

What's hot

accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannolli0601
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?Christian Robert
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsAlexander Litvinenko
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Frank Nielsen
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaAlexander Litvinenko
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Frank Nielsen
 
Bregman divergences from comparative convexity
Bregman divergences from comparative convexityBregman divergences from comparative convexity
Bregman divergences from comparative convexityFrank Nielsen
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012Zheng Mengdi
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDValentin De Bortoli
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Gabriel Peyré
 
A series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyA series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyFrank Nielsen
 
The Universal Measure for General Sources and its Application to MDL/Bayesian...
The Universal Measure for General Sources and its Application to MDL/Bayesian...The Universal Measure for General Sources and its Application to MDL/Bayesian...
The Universal Measure for General Sources and its Application to MDL/Bayesian...Joe Suzuki
 
Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsGabriel Peyré
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Fabian Pedregosa
 
On the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract meansOn the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract meansFrank Nielsen
 

What's hot (20)

accurate ABC Oliver Ratmann
accurate ABC Oliver Ratmannaccurate ABC Oliver Ratmann
accurate ABC Oliver Ratmann
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
Can we estimate a constant?
Can we estimate a constant?Can we estimate a constant?
Can we estimate a constant?
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Efficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formatsEfficient Analysis of high-dimensional data in tensor formats
Efficient Analysis of high-dimensional data in tensor formats
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 
Big model, big data
Big model, big dataBig model, big data
Big model, big data
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
A nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formulaA nonlinear approximation of the Bayesian Update formula
A nonlinear approximation of the Bayesian Update formula
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
 
Bregman divergences from comparative convexity
Bregman divergences from comparative convexityBregman divergences from comparative convexity
Bregman divergences from comparative convexity
 
SPDE presentation 2012
SPDE presentation 2012SPDE presentation 2012
SPDE presentation 2012
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGD
 
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
Low Complexity Regularization of Inverse Problems - Course #3 Proximal Splitt...
 
A series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropyA series of maximum entropy upper bounds of the differential entropy
A series of maximum entropy upper bounds of the differential entropy
 
The Universal Measure for General Sources and its Application to MDL/Bayesian...
The Universal Measure for General Sources and its Application to MDL/Bayesian...The Universal Measure for General Sources and its Application to MDL/Bayesian...
The Universal Measure for General Sources and its Application to MDL/Bayesian...
 
Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse Problems
 
Intro to ABC
Intro to ABCIntro to ABC
Intro to ABC
 
Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2Random Matrix Theory and Machine Learning - Part 2
Random Matrix Theory and Machine Learning - Part 2
 
On the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract meansOn the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract means
 

Similar to Space Alternating Data Augmentation for Finite Mixture of Gaussians

ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsUmberto Picchini
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingFrank Nielsen
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesFrank Nielsen
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Gabriel Peyré
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep LearningRayKim51
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning Sean Meyn
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodTasuku Soma
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AIMarc Lelarge
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPK Lehre
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPer Kristian Lehre
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt msFaeco Bot
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsAlexander Litvinenko
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component AnalysisSumit Singh
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfAlexander Litvinenko
 

Similar to Space Alternating Data Augmentation for Finite Mixture of Gaussians (20)

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
Patch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective DivergencesPatch Matching with Polynomial Exponential Families and Projective Divergences
Patch Matching with Polynomial Exponential Families and Projective Divergences
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
Bayesian Deep Learning
Bayesian Deep LearningBayesian Deep Learning
Bayesian Deep Learning
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
stoch41.pdf
stoch41.pdfstoch41.pdf
stoch41.pdf
 
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
QMC: Operator Splitting Workshop, Proximal Algorithms in Probability Spaces -...
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AI
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
Tensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEsTensor train to solve stochastic PDEs
Tensor train to solve stochastic PDEs
 
Principal Component Analysis
Principal Component AnalysisPrincipal Component Analysis
Principal Component Analysis
 
Litv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdfLitv_Denmark_Weak_Supervised_Learning.pdf
Litv_Denmark_Weak_Supervised_Learning.pdf
 

Space Alternating Data Augmentation for Finite Mixture of Gaussians

  • 1. Space Alternating Data Augmentation Application to Finite Mixture of Gaussians Arnaud Doucet, Tomoko Matsui, St´ephane S´en´ecal Research Organization of Information and Systems, The Institute of Statistical Mathematics 18/11/2004 thanks to the support of EPSRC, The Institute of Statistical Mathematics and the Japan Society for the Promotion of Science 1
  • 2. Bayesian Estimation Computation of Eπ {f} = f(x)π(x)dx Approximation ⇒ Monte Carlo Markov chain (MCMC) simulation methods 2
  • 3. MCMC techniques Idea : sample x(0) , x(1) , . . . , x(i) , . . . • from recursive application of transition kernel K(x(i) |x(i−1) ) • such that x(i) ∼ π asymptotically → how to obtain fast converging simulation scheme? 3
  • 4. Missing Data, Data Augmentation Idea : extend sampling space π(x) → π(x, z) with constraint π(x, z)dz = π(x) such that Markov chain (x(i) , z(i) ) ∼ π faster • Optimization : Expectation-Maximization (EM) algorithm • Simulation : Data Augmentation, Gibbs sampling 4
  • 5. Efficient Data Augmentation Schemes Information introduced in missing data : convergence Idea: construct missing data space as less informative as possible EM algorithm → Space Alternating Generalized EM: SAGE algorithm, Hero and Fessler 1994: • update parameter components by subblocks • specific missing data space associated with each subblock • complete data spaces less informative → convergence rate 5
  • 6. Efficient Data Augmentation Sampling Schemes SAGE Idea → efficient MCMC algorithm: • sample parameter components by subblocks • each subblock of parameters is sampled conditional on a specific missing data set ⇒ Space Alternating Data Augmentation (SADA) • Optimization : EM algorithm → SAGE algorithm • Simulation : DA, Gibbs sampling → SADA 6
  • 7. Overview • Introduction to EM and SAGE algorithms • Introduction to Data Augmentation and SADA algorithms • Application to Finite Mixture of Gaussians 7
  • 8. EM and SAGE Algorithms Bayesian framework: obtaining MAP estimate of random variable X given realization of Y = y xMAP = arg max p (x|y) where p (x|y) ∝ p (y|x) p (x) X is random vector whose components are partitioned into n subsets X = X1:n = (X1, . . . , Xn) Notation X−k = X1:n {Xk} = (X1, . . . , Xk−1, Xk+1, . . . , Xn) and Zk:j = (Zk, Zk+1, . . . , Zj) 8
  • 9. Expectation-Maximization (EM) algorithm → Maximize p (x|y) ⇒ introduce missing data Z with given conditional distribution p (z|y, x) EM, iteration i: x(i) = arg max x log (p (x, z|y)) p z|y, x(i−1) dz 9
  • 10. Space Alternating EM (SAGE) algorithm → Maximize p (x|y) ⇒ introduce n missing data sets Z1:n with each random variable Zk is given a conditional distribution p (zk|y, x1:n) satisfying p (y|x1:n, zk) = p (y|x−k, zk) 10
  • 11. Space Alternating EM (SAGE) algorithm SAGE, iteration i: • select index k ∈ {1, . . . , n} • set x (i) k as arg max x log p x (i−1) −k , xk, z|y p zk|y, x(i−1) dzk and x (i) −k = x (i−1) −k Components updated cyclically: iteration i update of component k = (i mod n) + 1 11
  • 12. DA and SADA Algorithms Bayesian framework: objective not only to maximize p (x|y) but to obtain random samples X(i) distributed according to p (x|y) Based on samples X(i) , approximation of MMSE estimate: xMMSE = 1 N N i=1 X(i) → xMMSE = xp (x|y) dx Also possible to compute posterior variances, confidence intervals or predictive distributions. Construction of efficient MCMC algorithms typically difficult → introduction of missing data 12
  • 13. Data Augmentation, Gibbs sampling → Sample p (x|y) ⇒ introduce missing data Z with joint posterior distribution p (x, z|y) = p (x|y) p (z|y, x) Data Augmentation algorithm, iteration i given X(i−1) : • Sample Z(i) ∼ p ·|y, X(i−1) • Sample X(i) ∼ p ·|y, Z(i) 13
  • 14. Convergence of DA/Gibbs sampling algorithm • Transition kernel associated to X(i) , Z(i) admits p (x, z|y) as invariant distribution • Under weak additional assumptions (irreducibility and aperiodicity) instantaneous distribution of X(i) , Z(i) converges towards p (x, z|y) as i → +∞ 14
  • 15. Space Alternating Data Augmentation → Sample p (x|y) ⇒ introduce n missing data sets Z1:n with each random variable Zk is given a conditional distribution p (zk|y, x1:n) defining joint posterior distribution p (x1:n, z1:n|y) = p (x1:n|y) n k=1 p (zk|y, x1:n) Typically p (y|x1:n, zk) = p (y|x−k, zk) although not necessary 15
  • 16. Space Alternating Data Augmentation SADA algorithm, iteration i given X (i−1) 1:n with k = (i mod n) + 1: • Sample Z (i) k ∼ p ·|y, X(i−1) • Sample X (i) k ∼ p ·|y, Z (i) k , X (i−1) −k • Set X (i) −k = X (i−1) −k 16
  • 17. Validity of SADA sampling algorithm Generation of Markov chain X (i) 1:n, Z (i) 1:n with invariant distribution p (x1:n, z1:n|y) Idea: SADA equivalent to • Sample Z (i) k , Z−k ∼ p ·|y, X (i−1) 1:n • Sample X (i) k , Z−k ∼ p ·|y, Z (i) k , X (i−1) −k • Set X (i) −k = X (i−1) −k 17
  • 18. Validity of SADA sampling algorithm SADA → simulating Zk and Xk but also Z−k at each iteration sampling according to full conditional distributions p (z1:n|y, x1:n) and p (x1:n|y, z1:n) ⇒ ad hoc invariant distribution p (x1:n, z1:n|y) sampling of Z−k not necessary → discarded 18
  • 19. Overview • Introduction to EM and SAGE algorithms • Introduction to Data Augmentation and SADA algorithms • ⇒ Application to Finite Mixture of Gaussians 19
  • 20. Finite Mixture of Gaussians EM/DA algorithms routinely used to perform ML/MAP parameter estimation/to sample the posterior distribution Straightforward extensions to hidden Markov chains with Gaussian observations T i.i.d. observations Y1:T in Rd , distributed according to a finite mixture of s Gaussians Yt ∼ s j=1 πjN (µj; Σj) 20
  • 21. Bayesian Estimation Parameters X = {(µj, Σj, πj) ; j = 1, . . . , s} unknown, random, distributed conjugate prior distributions µj|Σj ∼ N (αj, Σj/λj) Σ−1 j ∼ W (rj, Cj) (π1, . . . , πs) ∼ D (ζ1, . . . , ζs) 21
  • 22. Bayesian Estimation Σ−1 ∼ W (r, C): Wishart distribution, density proportional to |Σ−1 | 1 2 (r−d−1) exp − 1 2 tr Σ−1 C−1 (π1, . . . , πs) ∼ D (ζ1, . . . , ζs): Dirichlet distribution restricted to the simplex, density proportional to s k=1 πζk−1 k Hyperparameters {(αj, λj, rj, Cj, ζj) ; j = 1, . . . , s} assumed fixed but could be estimated from data in a hierarchical Bayes model 22
  • 23. Missing Data for Finite Mixture of Gaussians EM/DA introduce the i.i.d. missing data Zt ∈ {1, . . . , s} such that Yt|Zt = j ∼ N (µj; Σj) Pr (Zt = j) = πj Gibbs sampling algorithm, iteration i: • sample discrete latent variables Z (i) t ∼ p ·|yt, X(i−1) • compute sufficient statistics n (i) j T t=1 δZ (i) t ,j , n (i) j y (i) j T t=1 δZ (i) t ,j yt and S (i) j T t=1 δZ (i) t ,j ytyT t • sample parameters 23
  • 24. Gibbs sampling for Finite Mixture of Gaussians sampling parameters, iteration i: Σ −1(i) j ∼ W rj + n (i) j , Σ −1(i) j then µ (i) j |Σ (i) j ∼ N m (i) j , Σ (i) j λj + n (i) j and π (i) 1 , . . . , π (i) s ∼ D n (i) 1 + ζ1, . . . , n (i) s + ζs where m (i) j = λjαj + n (i) j y (i) j λj + n (i) j and Σ (i) j = C−1 j + λjαjαT j + S (i) j − λj + n (i) j m (i) j m (i)T j 24
  • 25. Less Informative Missing Data update only µj, τ2 j , µ−j, τ2 −j fixed → binary missing data Zt,j ∈ {0, j} such that Pr (Zt,j = j) = πj Zt,j = “observation coming from component j or not”, less informative than knowing “from which particular component observation is derived” constraint s j=1 πj = 1 ⇒ cannot update πj, use of standard EM approach 25
  • 26. Less Informative Missing Data updating jointly the parameters of two components j and k → missing data Zt,j,k ∈ {0, j, k} such that Pr (Zt,j,k = j) = πj, Pr (Zt,j,k = k) = πk and Yt|Zt,j,k = j ∼ N (µj; Σj) Yt|Zt,j,k = k ∼ N (µk; Σk) Yt|Zt,j,k = 0 ∼ l=j,l=k πlN (µl; Σl) l=j,l=k πl 26
  • 27. SAGE algorithm for Finite Mixture of Gaussians update for µj, τ2 j , iteration i: µ (i) j = λjαj + T t=1 ytp Zt,j,k = j|yt, X(i−1) λj + T t=1 p Zt,j,k = j|yt, X(i−1) Σ (i) j = C−1 j + λj µ (i) j − αj µ (i) j − αj T + . . . . . . . . . + T t=1 yt − µ (i) j yt − µ (i) j T p Zt,j,k = j|yt, X(i−1) rj − d − 1 + λj + T t=1 p Zt,j,k = j|yt, X(i−1) 27
  • 28. SAGE algorithm for Finite Mixture of Gaussians update for πj, iteration i: π (i) j = 1 − l=j,l=k π (i−1) l 1 + T t=1 p(Zt,j,k=k|yt,X(i−1) )+(ζk−1) T t=1 p(Zt,j,k=j|yt,X(i−1) )+(ζj −1) 28
  • 29. SADA algorithm for Finite Mixture of Gaussians SADA algorithm, iteration i, sample (µj, Σj, πj) • sample discrete latent variables Z (i) t,j,k ∼ p ·|yt, X(i−1) • compute sufficient statistics n (i) j T t=1 δZ (i) t,j,k,j and n (i) j y (i) j T t=1 δZ (i) t,j,k,j yt, S (i) j T t=1 δZ (i) t,j,k,j ytyT t • sample parameters 29
  • 30. SADA algorithm for Finite Mixture of Gaussians sampling parameters, iteration i: Σ −1(i) j ∼ W rj + n (i) j , Σ −1(i) j then µ (i) j |Σ (i) j ∼ N m (i) j , Σ (i) j λj + n (i) j and π (i) j , π (i) k ∼  1 − l=j,l=k π (i−1) l   D n (i) j + ζj, n (i) k + ζk 30
  • 31. Simulations Mixture of s = 5 d = 10-dimensional Gaussians T = 100, parameters of components sampled from prior with parameters ζj = 1, αj = 0, λj = 0.01, rj = d + 1 and Cj = 0.01I 200 iterations of EM and SAGE 50 times 5000 iterations of DA and SADA 10 times Results: • EM/SAGE: mean of log-posterior values at final iteration • SA/SADA: mean of average log-posterior values of last 1000 iterations 31
  • 32. Simulations Results s EM SAGE DA SADA 5 -915.8 -671.5 -873.7 886.0 6 -929.6 –603.2 -877.3 -886.7 7 -941.4 -576.5 -893.9 -906.9 8 -965.7 -559.2 -904.9 -875.0 9 -968.9 -503.0 -898.8 -882.5 10 -983.2 -478.1 -924.0 -906.6 Log-posterior values for final iteration EM/SAGE and average log-posterior values for DA/SADA 32
  • 33. References - EM/SAGE/MCMC • G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions, Wiley Series in Probability and Statistics, 1997 • J. A. Fessler and A.O. Hero, Space-alternating generalized expectation-maximization algorithm, IEEE Trans. Sig. Proc., 42:2664–2677, 1994 • C. P. Robert and G. Casella, Monte Carlo Statistical Methods, Springer-Verlag, 1999 33
  • 34. References - Finite Mixture of Gaussians • J. L. Gauvain and C. H. Lee, Maximum a Posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Trans. Speech Audio Proc., 2:291-298, 1994 • G. J. McLachlan and D. Peel, Finite Mixture Models, Wiley Series in Probability and Statistics, 2000 • G. Celeux, S. Chr´etien, F. Forbes and A. Mkhadri, A component-wise EM algorithm for mixtures, J. Comp. Graph. Stat., 10, 699-712, 2001 34