Information Theoretic Co Clustering


Published on

Two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log
and market-basket data analysis. A basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses
the co-clustering problem as an optimization problem in information theory — the optimal co-clustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters. We present an innovative co-clustering algorithm
that monotonically increases the preserved mutual information by intertwining both the row and column clusterings at all stages. Using the practical example of simultaneous
word-document clustering, we demonstrate that our algorithm works well in practice, especially in the presence of sparsity and high-dimensionality.

Published in: Technology, Education
1 Comment
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Information Theoretic Co Clustering

  1. 1. Information-theoretic co-clustering<br />Authors / Inderjit S. Dhillon, SubramanyamMallela and Dharmendra S. Modha<br />Conference / ACM SIGKDD ’03, August 24-27, 2003, Washington<br />Presenter / Meng-Lun, Wu<br />1<br />
  2. 2. Outline<br />Introduction<br />Problem Formulation<br />Co-Clustering Algorithm<br />Experimental Result<br />Conclusions And Future Work<br />2<br />
  3. 3. Introduction (cont.)<br />Clustering is a fundamental tool in unsupervised learning.<br />Most clustering algorithms focus on one-way clustering.<br />Clustering<br />3<br />
  4. 4. Introduction (cont.)<br />It is often desirable to co-cluster or simultaneously cluster both dimensions.<br />The normalized non-negative contingency table into a joint probability distribution between two discrete random variables.<br />The optimal co-clustering is one that leads to the largest mutual information between the clustered random variables.<br />4<br />
  5. 5. Introduction (cont.)<br />The optimal co-clustering is one that minimizes the loss in mutual information.<br />The mutual information of two random variables is a quantity that measures the mutual dependence of the two variables.<br />Formally, the mutual information can be defined as:<br />5<br />
  6. 6. Introduction (cont.)<br />The Kullback-Leibler (K-L) divergence, measures the difference between two probability distributions.<br />Given the true probability distribution p(x,y) and another distribution q(x,y) can be defined as:<br />6<br />
  7. 7. Problem formulation<br />Let X and Y be discrete random variables.<br />X: {x1,…,xm}, Y: {y1,…,yn}<br />p(X, Y) denote the joint probability distribution.<br />Let the k clusters of X as: <br />Let the l clusters of Y as: {ŷ1, ŷ2, . . . , ŷl}<br />7<br />
  8. 8. Problem formulation (cont.)<br />Definition <br />An optimal co-clustering minimizes<br />Subject to constraints on the number of row and column clusters.<br />For a fixed co-clustering (CX,CY), we can write the loss in mutual information.<br />8<br />
  9. 9. Problem formulation (cont.)<br />9<br />
  10. 10. Problem formulation (cont.)<br />q(X,Y) is a distribution of the form<br />0.18 0.18 0.14 0.14 0.18 0.18<br />0.5 0.5<br />0.15<br />0.15<br />0.15<br />0.15<br />0.2<br />0.2<br />10<br />0.3<br />0.3<br />0.4<br />Suppose<br />
  11. 11. Co-CLUSTERING Algorithm<br />Input : <br />The joint probability distribution p(X,Y), k the desired number of row clusters and l the desired number of column clusters.<br />Output:<br />The partition functions C†X and C†Y<br />11<br />
  12. 12. Co-CLUSTERING Algorithm (cont.)<br />12<br />^x3^x1<br />^x3^x2<br />
  13. 13. Co-CLUSTERING Algorithm (cont.)<br />13<br />ŷ2 ŷ1<br />ŷ1 ŷ2<br />
  14. 14. Co-CLUSTERING Algorithm (cont.)<br />14<br />D(p||q)=0.02881<br />
  15. 15. Experimental results<br />For our experimental results we use various subsets of the 20-Newsgroup data(NG20).<br />We use 1D-clustering to denote document clustering without any word clustering.<br />Evaluation Measures<br />Micro-averaged-precision<br />Micro-averaged-recall<br />15<br />
  16. 16. Experimental results (cont.)<br />16<br />
  17. 17. Experimental results (cont.)<br />17<br />
  18. 18. Experimental results (cont.)<br />18<br />
  19. 19. CONCLUSIONS AND FUTURE WORK<br />The information-theoretic formulation for co-clustering can be guaranteed to reach a local minimum in a finite number of steps.<br />Co-clustering for joint distribution of two random variables.<br />In this paper, the row and column clusters are pre-specified.<br />We hope that an information-theoretic regularization procedure may allow us to select the number of clusters.<br />19<br />