SlideShare a Scribd company logo
1 of 34
Download to read offline
Comparing estimation algorithms for block
          clustering models

                      Gilles Celeux

         Projet   SELECT   INRIA Saclay-Île-de-France


       January 6, 2011 - BIG’MC seminar
Block clustering setting

   Block clustering of (binary) data

       Let y = {(yij ); i ∈ I, j ∈ J} be a dimension n × d binary
       matrix, where I is a set n objets and J a set of d variables

       Permuting the lines and columns of y to discover a
       clustering structure on I × J.

       Getting a simple summary of the data matrix y.

       Many applications : recommendation systems, genomic
       data analysis, text mining, archeology, ...
Example


        1 2 3 4 5 6 7           1 2 3 4 5 6 7       1 4 3 5 7 2 6
    A                       A                   A
    B                       C                   C
    C                       H                   H
    D                       B                   B
    E                       F                   F                       I    II III
    F                       J                   J                   a
    G                       D                   D                   b
    H                       G                   G                   c
    I                       I                   I
    J                       E                   E
                (1)                  (2)                 (3)                (4)



  (1)   Binary data matrix
         (2)   A partition on I
         (3)   A couple of partitions on I and J
         (4)   Summary of the binary matrix
Model-based clustering framework


      Assume that the data are arising from a finite mixture of
      parametrised densities.


      A cluster is made by observations arising from the same
      density.



      In a block clustering model, clusters are defined on blocks
      ∈ I × J.


      In a block clustering model, data of a block are modelled
      by the same unidimensional density.
Latent block mixture model

   Density of the observed data is supposed to be

            f (y|g, m, φ, α) =         p(u|g, m, φ)f (y|g, m, u, α)
                                 u∈U

   where u is the indicator block vector.
   It is assumed that uijb = zik wj , z (resp.w) being the row (resp.
   column) cluster indicator vector.
   Assuming that the n × d variables Yij are conditionnally
   independent knowing z and w leads to the model
                                           z               wj
    f (y|g, m, π, ρ, α) =                 πk ik        ρ                  ϕ(yij |g, m, αk )
                            z,w∈Z×W i,k           j,            i,j,k ,
An exemple : Bernoulli latent block model

   Mixing proportions
   For fixed g, the mixing proportions for the row are π1 , . . . , πg .
   For fixed m, the mixing proportions for the col. are ρ1 , . . . , ρm .

   The Bernoulli density per block

                    ϕ(yij ; αk ) = (αk )yij (1 − αk )1−yij
   where αk ∈ (0, 1).
   The mixture density is
                                          z               wj
   f (y|g, m, π, ρ, α) =                 πk ik        ρ                  (αk )yij (1−αk )1−yij .
                           z,w∈Z×W i,k           j,            i,j,k ,

   The parameters to be estimated are the πs, the ρs and the αs.
Maximum likelihood estimation
  The loglikelihood of the model parameter is
  L(θ) = f (y|g, m, π, ρ, α) (g and m fixed)

  L(θ) = log p(y, w, z|g, m, θ) − log p(w, z|y; g, m, θ)
        = IE[log p(y, w, z; θ)|y; π (c) , θ(c) ] − IE[log p(w, z|y; θ)|y; θ(c) ]
        = Q(θ|θ(c) ) − H(θ|θ(c) )
     ˜
  If θ ∈ arg maxθ Q(θ|θ(c) )
    ˜               ˜                                    ˜
  L(θ)−L(θ(c) ) = Q(θ|θ(c) )−Q(θ|θ(c) )+H(θ(c) |θ(c) )−H(θ|θ(c) ) ≥ 0



  EM algorithm
      E step : computing the conditional expectation of the
      complete loglikelihood Q(θ|θ(c) )
                                                  ˜
      M step : maximising Q(θ|θ(c) ) in θ, θ(c) → θ
Conditional expectation of the complete loglikelihood


   For the latent block model, it is
                         (c)                  (c)                        (c)
   Q(θ|θ(c) ) =         sik log πk +         tj        log ρ +          ei,j,k , log ϕ(xij ; αk )
                  i,k                   j,                    i,j,k ,

   where
            (c)                                     (c)
           sik = P(Zik = 1|θ(c) , y),             tj      = P(Wj = 1|θ(c) , y)

   and
                          (c)
                         ei,j,k , = P(Zik Wj = 1|θ(c) , y).
                                  (c)
   → Difficulty to compute ei,j,k , ... Approximations are needed.
Variational interpretation of EM
   From the identity
           L(θ) = log p(y, z, w|θ) − log p(z, w|y, θ), we get

                          p(y, z, w|θ)
       L(θ) = IEqzw log                + KL(qzw ||p(z, w|y; θ))
                           qzw (w, z)
             = F(qzw , θ) + KL(qzw ||p(z, w|y; θ))


   EM as an alterned optimisation algorithm of F(qzw , θ)
       E step : Maximising F(qzw , θ(c) ) in qzw (.) with θ(c) fixed, leads to

                p(z, w|y; θ(c) ) = arg min KL(qzw ||p(z, w|y; θ(c) ))
                                         qzw


                                   (c)               (c)
       M step : Maximising F(qzw , θ) in θ with qzw (.) fixed : it amounts
       to find
                            arg max Q(θ|θ(c) ).
                                         θ
Variational approximation of EM (VEM)
  Restricting qwz to a function set for which the E step is easily
  tractable. It is assumed that qzw (z, w, θ) = qz (z)qw (w)
         (c)                           (c)
        sik = Pqz (Zik = 1|θ(c) , x), tj         = Pqw (Wj = 1|θ(c) , x),

                              (c)          (c)     (c)
                             ei,j,k , = sik wj


  Govaert and Nadif (2008)
    1. E step : Maximising the free energy F(qzw , θ(c) ) until
       convergence
                                           (c)
       1.1 computing sik with fixed wjl           and θ(c)
                                       (c+1)
       1.2 computing wjl with fixed    sik         and θ(c)
           → s(c+1) and w (c+1)
    2. M step : Updating θ(c+1)
Some characteristics of VEM



      The optimised free energy F(qzw , θ) is a lower bound of
      the observed loglikelihood.


      The parameter maximising the free energy could be
      expected to be a good, if not consistent, approximation of
      the maximum likelihood estimator.


      Since VEM is minimising KL(qzw ||p(z, w|y; θ)) rather than
      KL(p(z, w|y; θ)||qzw ), it is expected to be sensitive to
      starting values.
The SEM-Gibbs algorithm

  SEM
  The SEM algorithm (Celeux, Diebolt 1985 ) : After the E step, a
  S step is introduced to simulate the missing data according to
  the distribution p(z, w|x; θ(c) ).
  A difficulty for the latent block model is to simulate p(z, w|x; θ).

  Gibbs sampling
  The distribution p(z, w|x; θ(c) ) is simulated using a Gibbs
  sampler. Repeat

          Simulate z(t+1) according to p(z|x, w(t) ; θ(c) )
          Simulate w(t+1) according to p(w|x, z(t+1) ; θ(c) )

  → The stationary distribution of the Markov chain is
  p(z, w|x; θ(c) )
SEM-Gibbs for Bernoulli latent block model
    1. E and S steps :
       1.1 computation of p(z|y, w(t) ; θ(c) ), then simulation of z(t+1)
                                                       πk ψk (yi· , αk · )
                  p(zi = k |yi· , w(c) ) =                                  , k = 1, . . . g
                                                      k πk ψk (yi· , αk · )

                                       u                     −ui                         (c)                (c)
             ψk (yi· , αk · ) =       αk i (1−αk )d                , ui =               wj yij , d =       wj
                                                                                j                      j

       1.2 computation of p(w|y, z(t+1) ; θ(c) ), then simulation of w(t+1)
        → w (c+1) and z (c+1)
    2. M step :
                                            (c+1)                                   (c+1)
                      (c+1)            i   zik              (c+1)           j   wj
                     πk           =                   ,ρ            =
                                            n                                       d
       and
                                                       (c+1)        (c+1)
                              (c+1)              ij   zik          wj       yij
                             αk            =              (c+1) (c+1)
                                                      ij zik   wj
SEM features


     SEM is not increasing the loglikelihood at each iteration.

     SEM is generating an irreductible Markov chain with a
     unique stationary distribution.

     The parameter estimates fluctuate around the ml estimate
     → A natural estimator of θ, z, w is the mean of
     (θ(c) , z(c) , w(c) ; c = B, . . . , B + C) get after a burn-in period.



     How many Gibbs iterations inside the E-S step ?
     → default version : one Gibbs sampler iteration.
Numerical experiments

  Simulation design
     n = 100 rows, d = 60 columns,
     g = 3 components for I, m = 2 components for J,
     equal proportions on I and J.
     The parameters α have the form :
                                        
                             1−     1−
                     α=            1−   
                             1−

     where is defining the overlap between the mixture
     components.
Comparing VEM and SEM-Gibbs


  Criteria of comparison

      Estimate parameter values / actual parameter values for θ.

      Distance between MAP partition / actual partition,
      where the distance between two couples of partitions
      u = (z, w) and u = (z , w ) is the relative frequency of
      disagreements

                                     1
                   δ(u, u ) = 1 −                   zik wjl zik wjl .
                                    nd
                                         i,j,k ,l
SEM Convergence
  n=100, d=60 , π = (0.43, 0.36, 0.21), ρ = (0.53, 0.47),
  α11 = 0.6, α21 = 0.4, α31 = 0.6, α12 = 0.6, α22 = 0.6, α32 = 0.4

    0.5                                    0.7
                                                                     rho1
    0.4                                                              rho2
                                           0.6
    0.3                           pi1
                                  pi2      0.5
    0.2                           pi3
                                           0.4
    0.1

     0
          0   500   1000   1500     2000         0   500   1000   1500   2000


   0.65                                    0.7

    0.6                                    0.6

   0.55                       a11          0.5                       a12
                              a21                                    a22
    0.5                       a31          0.4                       a32
   0.45                                    0.3

    0.4                                    0.2
          0   500   1000   1500     2000         0   500   1000   1500   2000
SEM variance from a unique starting position
   n=100, d=60 , π = (0.30, 0.34, 0.36), ρ = (0.53, 0.47),
   δSEM = 0.18(0.01), δVEM = 0.18

    0.55


     0.5


    0.45


     0.4


    0.35


     0.3


    0.25



           1        2        3        4        5
Comparing VEM and SEM with starting position on θ0
  The comparison is made on 100 different samples
                      δVEM = 0.28(0.17), δSEM = 0.34(0.17)

     1

    0.9

    0.8

    0.7

    0.6

    0.5

    0.4

    0.3

    0.2

    0.1

     0
          1     2     3     4     5     6     7     8     9     10    11    12
          VEM

                VEM

                      VEM

                            VEM

                                  VEM

                                        VEM

                                              SEM

                                                    SEM

                                                          SEM

                                                                SEM

                                                                      SEM

                                                                            SEM
VEM and SEM with random starting positions
  Comparisons made on a sample from 100 different positions
                       δVEM = 0.49(0.16), δSEM = 0.17(0.02)
                                               !
                                               kl




    0.65


     0.6


    0.55


     0.5


    0.45


     0.4

           1     2     3     4     5     6          7     8     9     10    11    12
           BEM

                 BEM

                       BEM

                             BEM

                                   BEM

                                         BEM

                                                    SEM

                                                          SEM

                                                                SEM

                                                                      SEM

                                                                            SEM

                                                                                  SEM
Same comparison : less noisy case
  Comparisons made on a sample from 100 different positions
                       δVEM = 0.20(0.23), δSEM = 0.045(0.004)
                                                !
                                                kl




    0.65


     0.6


    0.55


     0.5


    0.45


     0.4


    0.35


           1     2      3     4     5     6          7     8     9     10    11    12
           BEM

                 BEM

                        BEM

                              BEM

                                    BEM

                                          BEM

                                                     SEM

                                                           SEM

                                                                 SEM

                                                                       SEM

                                                                             SEM

                                                                                   SEM
Discussion : VEM vs. SEM


  Numerical comparisons lead to the conclusions

      VEM leads rapidly to reasonable parameter estimates
      when its initial position is near enough the ml estimation.

      VEM is quite sensitive to starting values.

      SEM-Gibbs is (essentially) unsensitive to starting values.

   → Coupling SEM and VEM should be beneficial to derive
     sensible ml estimates for the latent block model.
Difficulties with Maximum likelihood

   Those difficulties concern the computation of information
   criteria for model selection.

       The likelihood remains difficult to be computed.

       What is the sample size in a latent block model ?

       There are many combinations (g, m) to be considered to
       choose a relevant number of blocks.

    → Bayesian inference could be thought of as attractive for the
      latent block model.
Bayesian inference : choosing the priors

   Choosing conjugate priors is essential for the latent block
   model.
       The choice is easy in the binary case : the priors for π, ρ
       and α are D(1, . . . , 1) or D(1/2, . . . , 1/2). They are non
       informative priors.
       In the continuous case, the conjugate priors for α = (µ, σ 2 )
       are weakly informative.

   Priors for the number of clusters
   This sensitive choice jeopardizes Bayesian inference for
   mixtures (Aitken 2000).
   It seems that choosing truncated Poisson P(1) priors over the
   range 1, . . . , gmax and 1, . . . , mmax is often a reasonable
   choice (Nobile 2005).
Bayesian inference : Reversible Jump sampler


  A possible advantage of Bayesian inference could be to make
  use of a RJMCMC sampler to choose relevant values for g and
  m since the likelihood is unavailable.


      But, in the latent block context, the standard RJMCMC is
      (remains ?...) unattractive since there is a couple of
      clusters to deal with.


      Fortunately, the allocation sampler of Nobile and Fearnside
      (2007) could be used instead.
The allocation sampler : collapsing
   The point of allocation sampler is to use a (RJ)MCMC algorithm
   on a collapsed model.
   Collapsed joint posterior
   Using conjugacy properties, we get by integrating the full
   posterior with respect to π, ρ and α
                                                       g   m
            P(g, m, z, w|y) = P(g)P(m)CF (.)                    Mk
                                                      k =1 =1

   where CF (.) is a closed form function made of Gamma
   functions and

              Mk =     P(αk )                    p(yij |αk )dαk .
                                i/zi =k j/wj =
The allocation sampler : MCMC moves

  Moves with fixed numbers of clusters
      Updating the label of row i in cluster k :
                                       m     +i −i
                              nk + 1        Mk Mk
                 ˜
               P(zi = k ) ∝                        , k = k.
                                nk          Mk Mk
                                       =1

      Other moves are possible (Nobile and Fearnside 2007).

  Moves to split or combine clusters
  Two reversible moves to split a cluster or combine two clusters
  analogous to the RJMCMC moves of R & G’97 are defined.
  But, thanks to collapsing, those moves are of fixed dimension.
  Integrating out the parameters leads to reduce the sampling
  variability.
The allocation sampler : label switching

   Following Nobile, Fearnside (2007), Friel and Wyse (2010)
   used a post-processing procedure with the cost function
                             T −1 n
                                          (t)           (T )
             C(k1 , k2 ) =             I zi     = k1 , zi      = k2 .
                             t=1 i=1


     1 The z(t) MCMC sequence has been rearranged such that
       for s < t, z(s) uses less or the same number of
       components than z(t) .
     2 An algorithm returns the permutation σ(.) of the labels in
                                            g
       z(T ) which minimises the total cost k T −1 C(k , σ(k )).
                                              =1
     3 z(T ) is relabelled using the permutation σ(.).
Remarks on the procedure to deal with label switching



      Due to collapsing, the cost function does not involve
      sampled model parameters.

      The row and columns allocations are post-processed
      separately.

      Simple algebra lead to an efficient on-line post-processing
      procedure.

      When g and m are large, g! and m! are tremendous.
Summarizing MCMC output




     Most visited model : for each (k , ), its posterior probability
     is estimated by the relative frequency of visits after post
     processing to undo label switching.


     MAP cluster model : it is the visited (g, m, z, w) having
     highest probability a posteriori from the MCMC samples.
Simulated data
  A 200 × 200 binary table. The posterior model probability of the
  generating model was respectively (from left to right and from
  top to bottom) : .96, .95, .90 ; .93, .89, .84 ; .80, .30, .15.
Congressional voting data
  The data set records the votes of 435 members (267
  democrats, 168 republicans) of the 98th on 16 different key
  issues.
                 Voting data   collapsed LBM    BEM2
An example on microarray experiments
  The data consist of the expression level of 419 genes under 70
  conditions.
  Weakly informative hyperprior parameters have been chosen.
  The sampler has been run 220,000 iterations with 20,000 for
  burn-in.
  Hereunder is a detail of the posterior distribution of block
  clusters models.

                                columns
                     rows     3     4   5
                     24     .064 .071 .042
                     25     .102 .120 .070
                     26     .037 .046 .023

  Most visited model : (25, 4)
  MAP cluster model : (26, 4).
References


     Govaert, G. and Nadif, M. (2007) Block clustering with
     Bernoulli mixture models : Comparison of differents
     approaches. Computationanl Statistics and Data Analysis,
     52, 3233-3245.

     Nobile, A. and Fearnside, A. T. (2007) Bayesian finite
     mixtures with an unknown number of components : The
     allocation sampler. Statistics and Computing, 17, 147-162.

     Wyse, J. and Friel, N. (2010) Block clustering with
     collapsed latent block models. In revision at Statistics and
     Computing (http ://arxiv.org/abs/1011.2948).

More Related Content

What's hot

Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flowsVjekoslavKovac1
 
Boundedness of the Twisted Paraproduct
Boundedness of the Twisted ParaproductBoundedness of the Twisted Paraproduct
Boundedness of the Twisted ParaproductVjekoslavKovac1
 
A sharp nonlinear Hausdorff-Young inequality for small potentials
A sharp nonlinear Hausdorff-Young inequality for small potentialsA sharp nonlinear Hausdorff-Young inequality for small potentials
A sharp nonlinear Hausdorff-Young inequality for small potentialsVjekoslavKovac1
 
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...VjekoslavKovac1
 
Some Examples of Scaling Sets
Some Examples of Scaling SetsSome Examples of Scaling Sets
Some Examples of Scaling SetsVjekoslavKovac1
 
sublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiessublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiesFujimoto Keisuke
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureVjekoslavKovac1
 
Lecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dualLecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dualStéphane Canu
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsVjekoslavKovac1
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsVjekoslavKovac1
 
Proximal Splitting and Optimal Transport
Proximal Splitting and Optimal TransportProximal Splitting and Optimal Transport
Proximal Splitting and Optimal TransportGabriel Peyré
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorSSA KPI
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas EberleBigMC
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsChristian Robert
 
Introduction to inverse problems
Introduction to inverse problemsIntroduction to inverse problems
Introduction to inverse problemsDelta Pi Systems
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilationsVjekoslavKovac1
 

What's hot (20)

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 
Boundedness of the Twisted Paraproduct
Boundedness of the Twisted ParaproductBoundedness of the Twisted Paraproduct
Boundedness of the Twisted Paraproduct
 
A sharp nonlinear Hausdorff-Young inequality for small potentials
A sharp nonlinear Hausdorff-Young inequality for small potentialsA sharp nonlinear Hausdorff-Young inequality for small potentials
A sharp nonlinear Hausdorff-Young inequality for small potentials
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Some Examples of Scaling Sets
Some Examples of Scaling SetsSome Examples of Scaling Sets
Some Examples of Scaling Sets
 
sublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energiessublabel accurate convex relaxation of vectorial multilabel energies
sublabel accurate convex relaxation of vectorial multilabel energies
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
Lecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dualLecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dual
 
Density theorems for anisotropic point configurations
Density theorems for anisotropic point configurationsDensity theorems for anisotropic point configurations
Density theorems for anisotropic point configurations
 
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
PMED Transition Workshop - A Bayesian Model for Joint Longitudinal and Surviv...
 
Density theorems for Euclidean point configurations
Density theorems for Euclidean point configurationsDensity theorems for Euclidean point configurations
Density theorems for Euclidean point configurations
 
Proximal Splitting and Optimal Transport
Proximal Splitting and Optimal TransportProximal Splitting and Optimal Transport
Proximal Splitting and Optimal Transport
 
New Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial SectorNew Mathematical Tools for the Financial Sector
New Mathematical Tools for the Financial Sector
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
 
Introduction to inverse problems
Introduction to inverse problemsIntroduction to inverse problems
Introduction to inverse problems
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
 

Viewers also liked

Hedibert Lopes' talk at BigMC
Hedibert Lopes' talk at  BigMCHedibert Lopes' talk at  BigMC
Hedibert Lopes' talk at BigMCBigMC
 
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...BigMC
 
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...BigMC
 
Computation of the marginal likelihood
Computation of the marginal likelihoodComputation of the marginal likelihood
Computation of the marginal likelihoodBigMC
 
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsBigMC
 
"Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go""Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go"BigMC
 

Viewers also liked (6)

Hedibert Lopes' talk at BigMC
Hedibert Lopes' talk at  BigMCHedibert Lopes' talk at  BigMC
Hedibert Lopes' talk at BigMC
 
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
Anisotropic Metropolis Adjusted Langevin Algorithm: convergence and utility i...
 
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
Dealing with intractability: Recent Bayesian Monte Carlo methods for dealing ...
 
Computation of the marginal likelihood
Computation of the marginal likelihoodComputation of the marginal likelihood
Computation of the marginal likelihood
 
Stability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithmsStability of adaptive random-walk Metropolis algorithms
Stability of adaptive random-walk Metropolis algorithms
 
"Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go""Monte-Carlo Tree Search for the game of Go"
"Monte-Carlo Tree Search for the game of Go"
 

Similar to Comparing estimation algorithms for block clustering models

Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)MeetupDataScienceRoma
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsUmberto Picchini
 
Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...
Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...
Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...Ali Ajouz
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methodsChristian Robert
 
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...JamesMa54
 
Algebras for programming languages
Algebras for programming languagesAlgebras for programming languages
Algebras for programming languagesYoshihiro Mizoguchi
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...HidenoriOgata
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Alexander Litvinenko
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsJagadeeswaran Rathinavel
 
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSSOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSTahia ZERIZER
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking componentsChristian Robert
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Tomasz Kusmierczyk
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationAlexander Litvinenko
 
Important Cuts and (p,q)-clustering
Important Cuts and (p,q)-clusteringImportant Cuts and (p,q)-clustering
Important Cuts and (p,q)-clusteringASPAK2014
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtureszukun
 
Triangle counting handout
Triangle counting handoutTriangle counting handout
Triangle counting handoutcsedays
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Jagadeeswaran Rathinavel
 

Similar to Comparing estimation algorithms for block clustering models (20)

Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
 
ABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space modelsABC with data cloning for MLE in state space models
ABC with data cloning for MLE in state space models
 
Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...
Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...
Hecke Operators on Jacobi Forms of Lattice Index and the Relation to Elliptic...
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
 
Algebras for programming languages
Algebras for programming languagesAlgebras for programming languages
Algebras for programming languages
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMSSOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
SOLVING BVPs OF SINGULARLY PERTURBED DISCRETE SYSTEMS
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
ABC-Gibbs
ABC-GibbsABC-Gibbs
ABC-Gibbs
 
Tensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantificationTensor Train data format for uncertainty quantification
Tensor Train data format for uncertainty quantification
 
Important Cuts and (p,q)-clustering
Important Cuts and (p,q)-clusteringImportant Cuts and (p,q)-clustering
Important Cuts and (p,q)-clustering
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
 
A Note on TopicRNN
A Note on TopicRNNA Note on TopicRNN
A Note on TopicRNN
 
Triangle counting handout
Triangle counting handoutTriangle counting handout
Triangle counting handout
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration Automatic Bayesian method for Numerical Integration
Automatic Bayesian method for Numerical Integration
 

Comparing estimation algorithms for block clustering models

  • 1. Comparing estimation algorithms for block clustering models Gilles Celeux Projet SELECT INRIA Saclay-Île-de-France January 6, 2011 - BIG’MC seminar
  • 2. Block clustering setting Block clustering of (binary) data Let y = {(yij ); i ∈ I, j ∈ J} be a dimension n × d binary matrix, where I is a set n objets and J a set of d variables Permuting the lines and columns of y to discover a clustering structure on I × J. Getting a simple summary of the data matrix y. Many applications : recommendation systems, genomic data analysis, text mining, archeology, ...
  • 3. Example 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 4 3 5 7 2 6 A A A B C C C H H D B B E F F I II III F J J a G D D b H G G c I I I J E E (1) (2) (3) (4) (1) Binary data matrix (2) A partition on I (3) A couple of partitions on I and J (4) Summary of the binary matrix
  • 4. Model-based clustering framework Assume that the data are arising from a finite mixture of parametrised densities. A cluster is made by observations arising from the same density. In a block clustering model, clusters are defined on blocks ∈ I × J. In a block clustering model, data of a block are modelled by the same unidimensional density.
  • 5. Latent block mixture model Density of the observed data is supposed to be f (y|g, m, φ, α) = p(u|g, m, φ)f (y|g, m, u, α) u∈U where u is the indicator block vector. It is assumed that uijb = zik wj , z (resp.w) being the row (resp. column) cluster indicator vector. Assuming that the n × d variables Yij are conditionnally independent knowing z and w leads to the model z wj f (y|g, m, π, ρ, α) = πk ik ρ ϕ(yij |g, m, αk ) z,w∈Z×W i,k j, i,j,k ,
  • 6. An exemple : Bernoulli latent block model Mixing proportions For fixed g, the mixing proportions for the row are π1 , . . . , πg . For fixed m, the mixing proportions for the col. are ρ1 , . . . , ρm . The Bernoulli density per block ϕ(yij ; αk ) = (αk )yij (1 − αk )1−yij where αk ∈ (0, 1). The mixture density is z wj f (y|g, m, π, ρ, α) = πk ik ρ (αk )yij (1−αk )1−yij . z,w∈Z×W i,k j, i,j,k , The parameters to be estimated are the πs, the ρs and the αs.
  • 7. Maximum likelihood estimation The loglikelihood of the model parameter is L(θ) = f (y|g, m, π, ρ, α) (g and m fixed) L(θ) = log p(y, w, z|g, m, θ) − log p(w, z|y; g, m, θ) = IE[log p(y, w, z; θ)|y; π (c) , θ(c) ] − IE[log p(w, z|y; θ)|y; θ(c) ] = Q(θ|θ(c) ) − H(θ|θ(c) ) ˜ If θ ∈ arg maxθ Q(θ|θ(c) ) ˜ ˜ ˜ L(θ)−L(θ(c) ) = Q(θ|θ(c) )−Q(θ|θ(c) )+H(θ(c) |θ(c) )−H(θ|θ(c) ) ≥ 0 EM algorithm E step : computing the conditional expectation of the complete loglikelihood Q(θ|θ(c) ) ˜ M step : maximising Q(θ|θ(c) ) in θ, θ(c) → θ
  • 8. Conditional expectation of the complete loglikelihood For the latent block model, it is (c) (c) (c) Q(θ|θ(c) ) = sik log πk + tj log ρ + ei,j,k , log ϕ(xij ; αk ) i,k j, i,j,k , where (c) (c) sik = P(Zik = 1|θ(c) , y), tj = P(Wj = 1|θ(c) , y) and (c) ei,j,k , = P(Zik Wj = 1|θ(c) , y). (c) → Difficulty to compute ei,j,k , ... Approximations are needed.
  • 9. Variational interpretation of EM From the identity L(θ) = log p(y, z, w|θ) − log p(z, w|y, θ), we get p(y, z, w|θ) L(θ) = IEqzw log + KL(qzw ||p(z, w|y; θ)) qzw (w, z) = F(qzw , θ) + KL(qzw ||p(z, w|y; θ)) EM as an alterned optimisation algorithm of F(qzw , θ) E step : Maximising F(qzw , θ(c) ) in qzw (.) with θ(c) fixed, leads to p(z, w|y; θ(c) ) = arg min KL(qzw ||p(z, w|y; θ(c) )) qzw (c) (c) M step : Maximising F(qzw , θ) in θ with qzw (.) fixed : it amounts to find arg max Q(θ|θ(c) ). θ
  • 10. Variational approximation of EM (VEM) Restricting qwz to a function set for which the E step is easily tractable. It is assumed that qzw (z, w, θ) = qz (z)qw (w) (c) (c) sik = Pqz (Zik = 1|θ(c) , x), tj = Pqw (Wj = 1|θ(c) , x), (c) (c) (c) ei,j,k , = sik wj Govaert and Nadif (2008) 1. E step : Maximising the free energy F(qzw , θ(c) ) until convergence (c) 1.1 computing sik with fixed wjl and θ(c) (c+1) 1.2 computing wjl with fixed sik and θ(c) → s(c+1) and w (c+1) 2. M step : Updating θ(c+1)
  • 11. Some characteristics of VEM The optimised free energy F(qzw , θ) is a lower bound of the observed loglikelihood. The parameter maximising the free energy could be expected to be a good, if not consistent, approximation of the maximum likelihood estimator. Since VEM is minimising KL(qzw ||p(z, w|y; θ)) rather than KL(p(z, w|y; θ)||qzw ), it is expected to be sensitive to starting values.
  • 12. The SEM-Gibbs algorithm SEM The SEM algorithm (Celeux, Diebolt 1985 ) : After the E step, a S step is introduced to simulate the missing data according to the distribution p(z, w|x; θ(c) ). A difficulty for the latent block model is to simulate p(z, w|x; θ). Gibbs sampling The distribution p(z, w|x; θ(c) ) is simulated using a Gibbs sampler. Repeat Simulate z(t+1) according to p(z|x, w(t) ; θ(c) ) Simulate w(t+1) according to p(w|x, z(t+1) ; θ(c) ) → The stationary distribution of the Markov chain is p(z, w|x; θ(c) )
  • 13. SEM-Gibbs for Bernoulli latent block model 1. E and S steps : 1.1 computation of p(z|y, w(t) ; θ(c) ), then simulation of z(t+1) πk ψk (yi· , αk · ) p(zi = k |yi· , w(c) ) = , k = 1, . . . g k πk ψk (yi· , αk · ) u −ui (c) (c) ψk (yi· , αk · ) = αk i (1−αk )d , ui = wj yij , d = wj j j 1.2 computation of p(w|y, z(t+1) ; θ(c) ), then simulation of w(t+1) → w (c+1) and z (c+1) 2. M step : (c+1) (c+1) (c+1) i zik (c+1) j wj πk = ,ρ = n d and (c+1) (c+1) (c+1) ij zik wj yij αk = (c+1) (c+1) ij zik wj
  • 14. SEM features SEM is not increasing the loglikelihood at each iteration. SEM is generating an irreductible Markov chain with a unique stationary distribution. The parameter estimates fluctuate around the ml estimate → A natural estimator of θ, z, w is the mean of (θ(c) , z(c) , w(c) ; c = B, . . . , B + C) get after a burn-in period. How many Gibbs iterations inside the E-S step ? → default version : one Gibbs sampler iteration.
  • 15. Numerical experiments Simulation design n = 100 rows, d = 60 columns, g = 3 components for I, m = 2 components for J, equal proportions on I and J. The parameters α have the form :   1− 1− α= 1−  1− where is defining the overlap between the mixture components.
  • 16. Comparing VEM and SEM-Gibbs Criteria of comparison Estimate parameter values / actual parameter values for θ. Distance between MAP partition / actual partition, where the distance between two couples of partitions u = (z, w) and u = (z , w ) is the relative frequency of disagreements 1 δ(u, u ) = 1 − zik wjl zik wjl . nd i,j,k ,l
  • 17. SEM Convergence n=100, d=60 , π = (0.43, 0.36, 0.21), ρ = (0.53, 0.47), α11 = 0.6, α21 = 0.4, α31 = 0.6, α12 = 0.6, α22 = 0.6, α32 = 0.4 0.5 0.7 rho1 0.4 rho2 0.6 0.3 pi1 pi2 0.5 0.2 pi3 0.4 0.1 0 0 500 1000 1500 2000 0 500 1000 1500 2000 0.65 0.7 0.6 0.6 0.55 a11 0.5 a12 a21 a22 0.5 a31 0.4 a32 0.45 0.3 0.4 0.2 0 500 1000 1500 2000 0 500 1000 1500 2000
  • 18. SEM variance from a unique starting position n=100, d=60 , π = (0.30, 0.34, 0.36), ρ = (0.53, 0.47), δSEM = 0.18(0.01), δVEM = 0.18 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1 2 3 4 5
  • 19. Comparing VEM and SEM with starting position on θ0 The comparison is made on 100 different samples δVEM = 0.28(0.17), δSEM = 0.34(0.17) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 VEM VEM VEM VEM VEM VEM SEM SEM SEM SEM SEM SEM
  • 20. VEM and SEM with random starting positions Comparisons made on a sample from 100 different positions δVEM = 0.49(0.16), δSEM = 0.17(0.02) ! kl 0.65 0.6 0.55 0.5 0.45 0.4 1 2 3 4 5 6 7 8 9 10 11 12 BEM BEM BEM BEM BEM BEM SEM SEM SEM SEM SEM SEM
  • 21. Same comparison : less noisy case Comparisons made on a sample from 100 different positions δVEM = 0.20(0.23), δSEM = 0.045(0.004) ! kl 0.65 0.6 0.55 0.5 0.45 0.4 0.35 1 2 3 4 5 6 7 8 9 10 11 12 BEM BEM BEM BEM BEM BEM SEM SEM SEM SEM SEM SEM
  • 22. Discussion : VEM vs. SEM Numerical comparisons lead to the conclusions VEM leads rapidly to reasonable parameter estimates when its initial position is near enough the ml estimation. VEM is quite sensitive to starting values. SEM-Gibbs is (essentially) unsensitive to starting values. → Coupling SEM and VEM should be beneficial to derive sensible ml estimates for the latent block model.
  • 23. Difficulties with Maximum likelihood Those difficulties concern the computation of information criteria for model selection. The likelihood remains difficult to be computed. What is the sample size in a latent block model ? There are many combinations (g, m) to be considered to choose a relevant number of blocks. → Bayesian inference could be thought of as attractive for the latent block model.
  • 24. Bayesian inference : choosing the priors Choosing conjugate priors is essential for the latent block model. The choice is easy in the binary case : the priors for π, ρ and α are D(1, . . . , 1) or D(1/2, . . . , 1/2). They are non informative priors. In the continuous case, the conjugate priors for α = (µ, σ 2 ) are weakly informative. Priors for the number of clusters This sensitive choice jeopardizes Bayesian inference for mixtures (Aitken 2000). It seems that choosing truncated Poisson P(1) priors over the range 1, . . . , gmax and 1, . . . , mmax is often a reasonable choice (Nobile 2005).
  • 25. Bayesian inference : Reversible Jump sampler A possible advantage of Bayesian inference could be to make use of a RJMCMC sampler to choose relevant values for g and m since the likelihood is unavailable. But, in the latent block context, the standard RJMCMC is (remains ?...) unattractive since there is a couple of clusters to deal with. Fortunately, the allocation sampler of Nobile and Fearnside (2007) could be used instead.
  • 26. The allocation sampler : collapsing The point of allocation sampler is to use a (RJ)MCMC algorithm on a collapsed model. Collapsed joint posterior Using conjugacy properties, we get by integrating the full posterior with respect to π, ρ and α g m P(g, m, z, w|y) = P(g)P(m)CF (.) Mk k =1 =1 where CF (.) is a closed form function made of Gamma functions and Mk = P(αk ) p(yij |αk )dαk . i/zi =k j/wj =
  • 27. The allocation sampler : MCMC moves Moves with fixed numbers of clusters Updating the label of row i in cluster k : m +i −i nk + 1 Mk Mk ˜ P(zi = k ) ∝ , k = k. nk Mk Mk =1 Other moves are possible (Nobile and Fearnside 2007). Moves to split or combine clusters Two reversible moves to split a cluster or combine two clusters analogous to the RJMCMC moves of R & G’97 are defined. But, thanks to collapsing, those moves are of fixed dimension. Integrating out the parameters leads to reduce the sampling variability.
  • 28. The allocation sampler : label switching Following Nobile, Fearnside (2007), Friel and Wyse (2010) used a post-processing procedure with the cost function T −1 n (t) (T ) C(k1 , k2 ) = I zi = k1 , zi = k2 . t=1 i=1 1 The z(t) MCMC sequence has been rearranged such that for s < t, z(s) uses less or the same number of components than z(t) . 2 An algorithm returns the permutation σ(.) of the labels in g z(T ) which minimises the total cost k T −1 C(k , σ(k )). =1 3 z(T ) is relabelled using the permutation σ(.).
  • 29. Remarks on the procedure to deal with label switching Due to collapsing, the cost function does not involve sampled model parameters. The row and columns allocations are post-processed separately. Simple algebra lead to an efficient on-line post-processing procedure. When g and m are large, g! and m! are tremendous.
  • 30. Summarizing MCMC output Most visited model : for each (k , ), its posterior probability is estimated by the relative frequency of visits after post processing to undo label switching. MAP cluster model : it is the visited (g, m, z, w) having highest probability a posteriori from the MCMC samples.
  • 31. Simulated data A 200 × 200 binary table. The posterior model probability of the generating model was respectively (from left to right and from top to bottom) : .96, .95, .90 ; .93, .89, .84 ; .80, .30, .15.
  • 32. Congressional voting data The data set records the votes of 435 members (267 democrats, 168 republicans) of the 98th on 16 different key issues. Voting data collapsed LBM BEM2
  • 33. An example on microarray experiments The data consist of the expression level of 419 genes under 70 conditions. Weakly informative hyperprior parameters have been chosen. The sampler has been run 220,000 iterations with 20,000 for burn-in. Hereunder is a detail of the posterior distribution of block clusters models. columns rows 3 4 5 24 .064 .071 .042 25 .102 .120 .070 26 .037 .046 .023 Most visited model : (25, 4) MAP cluster model : (26, 4).
  • 34. References Govaert, G. and Nadif, M. (2007) Block clustering with Bernoulli mixture models : Comparison of differents approaches. Computationanl Statistics and Data Analysis, 52, 3233-3245. Nobile, A. and Fearnside, A. T. (2007) Bayesian finite mixtures with an unknown number of components : The allocation sampler. Statistics and Computing, 17, 147-162. Wyse, J. and Friel, N. (2010) Block clustering with collapsed latent block models. In revision at Statistics and Computing (http ://arxiv.org/abs/1011.2948).