Comparing estimation algorithms for block clustering models

Comparing estimation algorithms for block
clustering models

Gilles Celeux

Projet SELECT INRIA Saclay-Île-de-France

January 6, 2011 - BIG’MC seminar

Block clustering setting

Block clustering of (binary) data

Let y = {(yij ); i ∈ I, j ∈ J} be a dimension n × d binary
matrix, where I is a set n objets and J a set of d variables

Permuting the lines and columns of y to discover a
clustering structure on I × J.

Getting a simple summary of the data matrix y.

Many applications : recommendation systems, genomic
data analysis, text mining, archeology, ...

Example

1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 4 3 5 7 2 6
A A A
B C C
C H H
D B B
E F F I II III
F J J a
G D D b
H G G c
I I I
J E E
(1) (2) (3) (4)

(1) Binary data matrix
(2) A partition on I
(3) A couple of partitions on I and J
(4) Summary of the binary matrix

Model-based clustering framework

Assume that the data are arising from a ﬁnite mixture of
parametrised densities.

A cluster is made by observations arising from the same
density.

In a block clustering model, clusters are deﬁned on blocks
∈ I × J.

In a block clustering model, data of a block are modelled
by the same unidimensional density.

Latent block mixture model

Density of the observed data is supposed to be

f (y|g, m, φ, α) = p(u|g, m, φ)f (y|g, m, u, α)
u∈U

where u is the indicator block vector.
It is assumed that uijb = zik wj , z (resp.w) being the row (resp.
column) cluster indicator vector.
Assuming that the n × d variables Yij are conditionnally
independent knowing z and w leads to the model
z wj
f (y|g, m, π, ρ, α) = πk ik ρ ϕ(yij |g, m, αk )
z,w∈Z×W i,k j, i,j,k ,

An exemple : Bernoulli latent block model

Mixing proportions
For ﬁxed g, the mixing proportions for the row are π1 , . . . , πg .
For ﬁxed m, the mixing proportions for the col. are ρ1 , . . . , ρm .

The Bernoulli density per block

ϕ(yij ; αk ) = (αk )yij (1 − αk )1−yij
where αk ∈ (0, 1).
The mixture density is
z wj
f (y|g, m, π, ρ, α) = πk ik ρ (αk )yij (1−αk )1−yij .
z,w∈Z×W i,k j, i,j,k ,

The parameters to be estimated are the πs, the ρs and the αs.

Conditional expectation of the complete loglikelihood

For the latent block model, it is
(c) (c) (c)
Q(θ|θ(c) ) = sik log πk + tj log ρ + ei,j,k , log ϕ(xij ; αk )
i,k j, i,j,k ,

where
(c) (c)
sik = P(Zik = 1|θ(c) , y), tj = P(Wj = 1|θ(c) , y)

and
(c)
ei,j,k , = P(Zik Wj = 1|θ(c) , y).
(c)
→ Difﬁculty to compute ei,j,k , ... Approximations are needed.

Variational interpretation of EM
From the identity
L(θ) = log p(y, z, w|θ) − log p(z, w|y, θ), we get

p(y, z, w|θ)
L(θ) = IEqzw log + KL(qzw ||p(z, w|y; θ))
qzw (w, z)
= F(qzw , θ) + KL(qzw ||p(z, w|y; θ))

EM as an alterned optimisation algorithm of F(qzw , θ)
E step : Maximising F(qzw , θ(c) ) in qzw (.) with θ(c) fixed, leads to

p(z, w|y; θ(c) ) = arg min KL(qzw ||p(z, w|y; θ(c) ))
qzw

(c) (c)
M step : Maximising F(qzw , θ) in θ with qzw (.) fixed : it amounts
to find
arg max Q(θ|θ(c) ).
θ

Variational approximation of EM (VEM)
Restricting qwz to a function set for which the E step is easily
tractable. It is assumed that qzw (z, w, θ) = qz (z)qw (w)
(c) (c)
sik = Pqz (Zik = 1|θ(c) , x), tj = Pqw (Wj = 1|θ(c) , x),

(c) (c) (c)
ei,j,k , = sik wj

Govaert and Nadif (2008)
1. E step : Maximising the free energy F(qzw , θ(c) ) until
convergence
(c)
1.1 computing sik with ﬁxed wjl and θ(c)
(c+1)
1.2 computing wjl with ﬁxed sik and θ(c)
→ s(c+1) and w (c+1)
2. M step : Updating θ(c+1)

Some characteristics of VEM

The optimised free energy F(qzw , θ) is a lower bound of
the observed loglikelihood.

The parameter maximising the free energy could be
expected to be a good, if not consistent, approximation of
the maximum likelihood estimator.

Since VEM is minimising KL(qzw ||p(z, w|y; θ)) rather than
KL(p(z, w|y; θ)||qzw ), it is expected to be sensitive to
starting values.

The SEM-Gibbs algorithm

SEM
The SEM algorithm (Celeux, Diebolt 1985 ) : After the E step, a
S step is introduced to simulate the missing data according to
the distribution p(z, w|x; θ(c) ).
A difﬁculty for the latent block model is to simulate p(z, w|x; θ).

Gibbs sampling
The distribution p(z, w|x; θ(c) ) is simulated using a Gibbs
sampler. Repeat

Simulate z(t+1) according to p(z|x, w(t) ; θ(c) )
Simulate w(t+1) according to p(w|x, z(t+1) ; θ(c) )

→ The stationary distribution of the Markov chain is
p(z, w|x; θ(c) )

SEM-Gibbs for Bernoulli latent block model
1. E and S steps :
1.1 computation of p(z|y, w(t) ; θ(c) ), then simulation of z(t+1)
πk ψk (yi· , αk · )
p(zi = k |yi· , w(c) ) = , k = 1, . . . g
k πk ψk (yi· , αk · )

u −ui (c) (c)
ψk (yi· , αk · ) = αk i (1−αk )d , ui = wj yij , d = wj
j j

1.2 computation of p(w|y, z(t+1) ; θ(c) ), then simulation of w(t+1)
→ w (c+1) and z (c+1)
2. M step :
(c+1) (c+1)
(c+1) i zik (c+1) j wj
πk = ,ρ =
n d
and
(c+1) (c+1)
(c+1) ij zik wj yij
αk = (c+1) (c+1)
ij zik wj

SEM features

SEM is not increasing the loglikelihood at each iteration.

SEM is generating an irreductible Markov chain with a
unique stationary distribution.

The parameter estimates ﬂuctuate around the ml estimate
→ A natural estimator of θ, z, w is the mean of
(θ(c) , z(c) , w(c) ; c = B, . . . , B + C) get after a burn-in period.

How many Gibbs iterations inside the E-S step ?
→ default version : one Gibbs sampler iteration.

Numerical experiments

Simulation design
n = 100 rows, d = 60 columns,
g = 3 components for I, m = 2 components for J,
equal proportions on I and J.
The parameters α have the form :
 
1− 1−
α= 1− 
1−

where is deﬁning the overlap between the mixture
components.

Comparing VEM and SEM-Gibbs

Criteria of comparison

Estimate parameter values / actual parameter values for θ.

Distance between MAP partition / actual partition,
where the distance between two couples of partitions
u = (z, w) and u = (z , w ) is the relative frequency of
disagreements

1
δ(u, u ) = 1 − zik wjl zik wjl .
nd
i,j,k ,l

SEM Convergence
n=100, d=60 , π = (0.43, 0.36, 0.21), ρ = (0.53, 0.47),
α11 = 0.6, α21 = 0.4, α31 = 0.6, α12 = 0.6, α22 = 0.6, α32 = 0.4

0.5 0.7
rho1
0.4 rho2
0.6
0.3 pi1
pi2 0.5
0.2 pi3
0.4
0.1

0
0 500 1000 1500 2000 0 500 1000 1500 2000

0.65 0.7

0.6 0.6

0.55 a11 0.5 a12
a21 a22
0.5 a31 0.4 a32
0.45 0.3

0.4 0.2
0 500 1000 1500 2000 0 500 1000 1500 2000

SEM variance from a unique starting position
n=100, d=60 , π = (0.30, 0.34, 0.36), ρ = (0.53, 0.47),
δSEM = 0.18(0.01), δVEM = 0.18

0.55

0.5

0.45

0.4

0.35

0.3

0.25

1 2 3 4 5

Comparing VEM and SEM with starting position on θ0
The comparison is made on 100 different samples
δVEM = 0.28(0.17), δSEM = 0.34(0.17)

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
1 2 3 4 5 6 7 8 9 10 11 12
VEM

VEM

VEM

VEM

VEM

VEM

SEM

SEM

SEM

SEM

SEM

SEM

VEM and SEM with random starting positions
Comparisons made on a sample from 100 different positions
δVEM = 0.49(0.16), δSEM = 0.17(0.02)
!
kl

0.65

0.6

0.55

0.5

0.45

0.4

1 2 3 4 5 6 7 8 9 10 11 12
BEM

BEM

BEM

BEM

BEM

BEM

SEM

SEM

SEM

SEM

SEM

SEM

Same comparison : less noisy case
Comparisons made on a sample from 100 different positions
δVEM = 0.20(0.23), δSEM = 0.045(0.004)
!
kl

0.65

0.6

0.55

0.5

0.45

0.4

0.35

1 2 3 4 5 6 7 8 9 10 11 12
BEM

BEM

BEM

BEM

BEM

BEM

SEM

SEM

SEM

SEM

SEM

SEM

Discussion : VEM vs. SEM

Numerical comparisons lead to the conclusions

VEM leads rapidly to reasonable parameter estimates
when its initial position is near enough the ml estimation.

VEM is quite sensitive to starting values.

SEM-Gibbs is (essentially) unsensitive to starting values.

→ Coupling SEM and VEM should be beneﬁcial to derive
sensible ml estimates for the latent block model.

Difficulties with Maximum likelihood

Those difficulties concern the computation of information
criteria for model selection.

The likelihood remains difficult to be computed.

What is the sample size in a latent block model ?

There are many combinations (g, m) to be considered to
choose a relevant number of blocks.

→ Bayesian inference could be thought of as attractive for the
latent block model.

Bayesian inference : choosing the priors

Choosing conjugate priors is essential for the latent block
model.
The choice is easy in the binary case : the priors for π, ρ
and α are D(1, . . . , 1) or D(1/2, . . . , 1/2). They are non
informative priors.
In the continuous case, the conjugate priors for α = (µ, σ 2 )
are weakly informative.

Priors for the number of clusters
This sensitive choice jeopardizes Bayesian inference for
mixtures (Aitken 2000).
It seems that choosing truncated Poisson P(1) priors over the
range 1, . . . , gmax and 1, . . . , mmax is often a reasonable
choice (Nobile 2005).

Bayesian inference : Reversible Jump sampler

A possible advantage of Bayesian inference could be to make
use of a RJMCMC sampler to choose relevant values for g and
m since the likelihood is unavailable.

But, in the latent block context, the standard RJMCMC is
(remains ?...) unattractive since there is a couple of
clusters to deal with.

Fortunately, the allocation sampler of Nobile and Fearnside
(2007) could be used instead.

The allocation sampler : collapsing
The point of allocation sampler is to use a (RJ)MCMC algorithm
on a collapsed model.
Collapsed joint posterior
Using conjugacy properties, we get by integrating the full
posterior with respect to π, ρ and α
g m
P(g, m, z, w|y) = P(g)P(m)CF (.) Mk
k =1 =1

where CF (.) is a closed form function made of Gamma
functions and

Mk = P(αk ) p(yij |αk )dαk .
i/zi =k j/wj =

The allocation sampler : MCMC moves

Moves with fixed numbers of clusters
Updating the label of row i in cluster k :
m +i −i
nk + 1 Mk Mk
˜
P(zi = k ) ∝ , k = k.
nk Mk Mk
=1

Other moves are possible (Nobile and Fearnside 2007).

Moves to split or combine clusters
Two reversible moves to split a cluster or combine two clusters
analogous to the RJMCMC moves of R & G’97 are defined.
But, thanks to collapsing, those moves are of fixed dimension.
Integrating out the parameters leads to reduce the sampling
variability.

The allocation sampler : label switching

Following Nobile, Fearnside (2007), Friel and Wyse (2010)
used a post-processing procedure with the cost function
T −1 n
(t) (T )
C(k1 , k2 ) = I zi = k1 , zi = k2 .
t=1 i=1

1 The z(t) MCMC sequence has been rearranged such that
for s < t, z(s) uses less or the same number of
components than z(t) .
2 An algorithm returns the permutation σ(.) of the labels in
g
z(T ) which minimises the total cost k T −1 C(k , σ(k )).
=1
3 z(T ) is relabelled using the permutation σ(.).

Remarks on the procedure to deal with label switching

Due to collapsing, the cost function does not involve
sampled model parameters.

The row and columns allocations are post-processed
separately.

Simple algebra lead to an efﬁcient on-line post-processing
procedure.

When g and m are large, g! and m! are tremendous.

Summarizing MCMC output

Most visited model : for each (k , ), its posterior probability
is estimated by the relative frequency of visits after post
processing to undo label switching.

MAP cluster model : it is the visited (g, m, z, w) having
highest probability a posteriori from the MCMC samples.

Simulated data
A 200 × 200 binary table. The posterior model probability of the
generating model was respectively (from left to right and from
top to bottom) : .96, .95, .90 ; .93, .89, .84 ; .80, .30, .15.

Congressional voting data
The data set records the votes of 435 members (267
democrats, 168 republicans) of the 98th on 16 different key
issues.
Voting data collapsed LBM BEM2

An example on microarray experiments
The data consist of the expression level of 419 genes under 70
conditions.
Weakly informative hyperprior parameters have been chosen.
The sampler has been run 220,000 iterations with 20,000 for
burn-in.
Hereunder is a detail of the posterior distribution of block
clusters models.

columns
rows 3 4 5
24 .064 .071 .042
25 .102 .120 .070
26 .037 .046 .023

Most visited model : (25, 4)
MAP cluster model : (26, 4).

References

Govaert, G. and Nadif, M. (2007) Block clustering with
Bernoulli mixture models : Comparison of differents
approaches. Computationanl Statistics and Data Analysis,
52, 3233-3245.

Nobile, A. and Fearnside, A. T. (2007) Bayesian ﬁnite
mixtures with an unknown number of components : The
allocation sampler. Statistics and Computing, 17, 147-162.

Wyse, J. and Friel, N. (2010) Block clustering with
collapsed latent block models. In revision at Statistics and
Computing (http ://arxiv.org/abs/1011.2948).

Comparing estimation algorithms for block clustering models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Comparing estimation algorithms for block clustering models

Similar to Comparing estimation algorithms for block clustering models (20)

Comparing estimation algorithms for block clustering models