RJMCMC in clustering

.
.
Clustering by mixture model

Pham The Thong

April 22, 2011

Pham The Thong ( ) Clustering by mixture model April 22, 2011 1 / 44

Outline
.
1 RJMCMC in clustering
Clustering overview
Reversible Jump MCMC
.
2 Richardson&Green(1997): On Bayesian Analysis of Mixtures with an
Unknown Number of Components
Overview
Split/Merge and Birth/Death Mechanism
Algorithm
Result
.
3 Tadesse et.al.(2005): Bayesian Variable Selection in Clustering
High-Dimensional Data
Overview
Variable Selection
RJMCMC Mechanism
Result
Weakness of the model


RJMCMC in clustering Clustering overview

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result



Clustering overview

Divide the observations into groups.
Predict group of a new observation.
Model-based clustering: select a probabilistic model
that underlying the observations and make
statistical inferences based on that model. One
popular model is the mixture model.



Clustering via mixture model
X = (x1 , · · · , xn ) be independent p-dimensional
observations from G populations.
∑
G
f (xi |w, θ) = wk f (xi |θk )
k=1

f (xi |θk ) is the density of an observation xi from the kth
component.
w = (w1 , · · · , wG )T are component weights.
θ = (θ1 , · · · , θG )T are component parameters.
Clustering is done via allocation vector
y = (y1 , · · · , yn )T : yi = k if the ith observation xi comes
from component k.


Some approaches

Model Selection: Compare some model selection
criteria of ﬁxed-G models for various values of G to
choose the best G . Inference on ﬁxed-G model is
often done via EM algorithm or Gibbs sampler.
Nonparametric method: Use Dirichlet Process.
Trans-dimensional Markov Chain Monte Carlo
(MCMC): Allow G to be changed during the
inference process by combining Gibbs sampler with
MCMC moves that can change dimension of the
model. Reversible jump MCMC (RJMCMC) is one
possible scheme.

RJMCMC in clustering Reversible Jump MCMC

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result



Overview

First developed in Green(1995)
Has applications ranged well beyond mixture model
analysis.
Mixture model analysis power ﬁrst demonstrated in
Richardson&Green(1997). They considered only the
1-dimensional case.
Applied to multidimensional setting in Tadesse et.al.
(2005).



Some advantages of clustering by
RJMCMC

Avoid the task of model selection.
Provide a coherent Bayesian framework. The cluster
number G is not treated as a special parameter.
Can provide useful summary of data which is
diﬃcult to obtain by other methods.



General ideas of RJMCMC I

Simulating a Markov Chain that converges to the
full posterior distribution p(G , y, w, θ|X).
Hybrid sampler consist of Gibbs Sampler(the base)
and jump moves (the extension).
Gibbs sampler will sample (y, w, θ). Jump moves
will sample the cluster number G .
The jump moves come in pair: Split/Merge and
Birth/Death



General ideas of RJMCMC II
Split move: split one component into two
components.
Merge move: combine two components into one
component.
Birth move: create an empty component.
Death move: delete an empty component.
At each iteration, propose to perform Split(Birth)
move with some ﬁxed probability bk and with
probability 1 − bk propose to perform Merge(Death)
move.
In one proposal, calculate all the changes to the
model as if the move was made.


General ideas of RJMCMC III

Calculate the acceptance probability A, which is the
product of three terms:
the ratio of the posterior of the new model to that of the
old model
the ratio of the probability of the way to go from the
new model back to the old model to that of the way to
go from old model to new model
the Jacobian arises from the change of dimension
To ensure convergence to the desired distribution,
only actually carry out the move with probability
min(1, A).


Richardson&Green(1997) Overview

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result


Richardson&Green(1997) Overview

Overview

1-dimensional data.
Goal:
Clustering data.
Estimating component parameters.
Estimating the distribution of data.
Predicting group of new data.
Demonstrated in three real dataset: Enzym, Acid,
and Galaxy.


Richardson&Green(1997) Split/Merge and Birth/Death Mechanism

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result



Split/Merge Mechanism

In Split move, select one component (wj ∗ , µj ∗ , σj ∗ )
to split to 2 components (wj1 , µj1 , σj1 ) and
(wj2 , µj2 , σj2 ).
In Merge move, select two components (wj1 , µj1 , σj1 )
and (wj2 , µj2 , σj2 ) to merge into one new component
(wj ∗ , µj ∗ , σj ∗ ).
Equalizing the zeroth, ﬁrst, second moment of the
new component to those of a combination of the
two old components.



Birth/Death Mechanism

Birth move
Generate wj ∗ , µj ∗ , σj ∗ from some distributions.
Rescale the weights.
Death move
Delete a randomly chosen empty component.
Rescale the weights.


Richardson&Green(1997) Algorithm

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result


Richardson&Green(1997) Algorithm

One iteration contains
Gibbs Sampler:
Updating the weights w
Updating the parameters µ, σ
Updating the allocation y
Split/Merge move
Birth/Death move


Richardson&Green(1997) Result

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result



Post simulation

By processing the raw data come from the simulation,
one can
clustering data by selecting the allocation vector y
that has the highest frequency.
estimating component parameters by their posterior
mean.
estimating the distribution of data.
predicting group of new data.



The three dataset

Enzym data: enzymatic activity of one enzyme in
the blood of 245 unrelated people. The interest is
identifying subgroups of slow or fast activity as a
marker of genetic polymorphism in the general
population(i.e. to some extent, people of the same
subgroup may have similar genetic structure
although they are unrelated).
Acid data: acidity level of 155 lakes in Wisconsin.
Galaxy data: velocities of 82 galaxies diverging from
our galaxy.


Tadesse et.al.(2005) Overview

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result


Tadesse et.al.(2005) Overview

Overview

High dimensional data
Goal:
Variable selecting.
Clustering data.
Predicting group of new data.
Applied to microarray data.


Tadesse et.al.(2005) Variable Selection

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result



Concept

Perhaps not all variables are useful for clustering.
By throwing away non-discriminating variables
(irrelevant variables) and clustering only on
discriminating variables (relevant variables) we may
improve clustering accuracy.
We can think of variable selection as one way to
generalize the basic approach “clustering by the full
set of variables” to “clustering by a subset of
variables”.



The model of Tadesse et.al. I
Introduce γ = (γ1 , · · · , γp ): γj = 1 if the jth variable is
a discriminating variable and 0 if it is not.
Use (γ) and (γ c ) to index discriminating variables and
non-discriminating variables.
Three assumptions:
The set of discriminating variables and the set of
non-discriminating variables are independent.
If we look only at (γ c ), the data X(γ c ) have a
normal distribution(hence unsuitable for clustering).
If we look only at (γ), the data X(γ) have a mixture
distribution of G normal components (hence
suitable for clustering).


The model of Tadesse et.al. II
(η (γ c ) , Ω(γ c ) ): mean and covariance for the
non-discriminating variables.
(µk(γ) , Σk(γ) ): mean and covariance for the kth
components Ck .
The three assumptions can be written as
∏
n
( )
p(X|G , γ, w, y, µ, Σ, η, Ω) = N xi(γ c ) , η (γ c ) , Ω(γ c )
i=1
∏G ∏ ( )
N xi(γ) , µk(γ) , Σk(γ)
k=1 xi ∈Ck



Searching for γ

The problem of variable selection is re-casted as a
problem of searching for the most probable binary
vector γ.
Use a Metropolis search(of which Simulated
Annealing is one type)
At each step randomly choosing one of the following
two transitional moves: ﬂip one bit or swap two bit
of γ(and accept the ) move with probability
new
|X,y,w,G
min 1, p(γ old |X,y,w,G )) .
p(γ


Tadesse et.al.(2005) RJMCMC Mechanism

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result



Diﬃculties in high dimension

Unlike 1-dimensional case, there is no obvious way
to split a covariance matrix into two covariance
matrix. Even if this could be done[4], the Jacobian
may not have closed-form.
The number of model parameters increases rapidly
with order p 2 . The chain may converge very slowly.



Approach of Tadesse et.al.

Integrating out the mean vector and the covariance
matrix to obtain a marginalized posterior in which
only G , w, γ,and y are involved.
Despite being quite tedious, the math follows a
standard framework: deﬁne conjugate priors for
mean and covariance matrix and then take the
integration.
Only need to split or merge the weights of
components in Split/Merge move. Birth/Death
move are the same as in 1-dimensional case.



Algorithm

One iteration contains
Metropolis search for γ
Gibbs sampler:
Updating the weights w
Updating the allocation y
Split/Merge move
Birth/Death move


Tadesse et.al.(2005) Result

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result



Post simulation

Since the mean and covariance are integrated out,
there is no estimation for component parameters.
Variable selection:
Method 1: select the vector γ that have the highest
frequency.
Method 2: select all variables j that have p(γj |X, G )
greater than some threshold: p(γj |X, G ) ≥ a.
Clustering and group prediction can be done in the
same way as in the univariate case.



Microarray data

14 samples (samples are come from tissues).
Variables are genes. There are 762 variables.
By clustering the samples into subgroups, one may
ﬁnd out which genes are relevant to each subgroup.


Tadesse et.al.(2005) Weakness of the model

Outline
.
Clustering overview
.
Overview
Algorithm
Result
.
Overview
Variable Selection
RJMCMC Mechanism
Result



Weakness of the model [5]

The independence assumption would often lead to
the wrongly case in which one irrelevant variable be
identiﬁed as a discriminating one because it is
related to some discriminating variables.
It is not known whether one can relax this
assumption while still being able to perform
RJMCMC-based full Bayesian analysis.



References
[1]P.J.Green(1995), Reversible jump Markov chain Monte Carlo
computation and Bayesian model determination, Biometrica
82,4,711-732.
[2]S.Richardson and P.J.Green(1997), On Bayesian Analysis of
Mixtures with an Unknown Number of Components, J.R.Statist.
Soc.B 59, 4,731-792.
[3]M.G.Tadesse, N.Sha, and M. Vannucci(2005), Bayesian Variable
Selection in Clustering High-Dimensional Data,Journal of the
American Statistical Association 100,470,602-617.
[4]Petros Dellaportas and Ioulia Papageorgiou(2006), Multivariate
mixtures of normals with unknown number of components,Statistics
and Computing 16,1,57 - 68.
[5]Maugis et.al.(2009), Variable Selection for Clustering with
Gaussian Mixture Models, Biometrics 65, 701-709.



Thank you for your attention


RJMCMC in clustering

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to RJMCMC in clustering

Similar to RJMCMC in clustering (20)

Recently uploaded

Recently uploaded (20)

RJMCMC in clustering