Donglin Niu, Jennifer G. Dy
Department of Electrical and Computer Engineering, Northeastern University, Boston, MA
                                                                        Michael I. Jordan
                      EECS and Statistics Departments, University of California, Berkeley
   Given medical data,

     From doctor’s view:
            according to type of disease
     From insurance company view:
            based on patient’s cost/risk
Two kinds of Approaches:      Iterative & Simultaneous
Iterative
  Given an existing clustering, find another
  clustering
 Conditional Information Bottleneck. Gondek and
  Hofmann (2004)
 COALA. Bae and Bailey (2006)

 Minimizing KL-divergence. Qi and Davidson (2009)


    Multiple alternative clusterings
   Orthogonal Projection. Cui et al. (2007)
Simultaneous
Discovery of all the possible partitionings
 Meta Clustering. Caruana et al. (2006)
 De-correlated kmeans. Jain et al. (2008)
   Ensemble Clustering

   Hierarchical Clustering
VIEW 1        VIEW 2




There are O( KN ) possible clustering solutions.
We’d like to find solutions that:
   1. have high cluster quality, and
   2. be non-redundant
     and we’d like to simultaneously
   3. learn the subspace in each view
   Normalized Cut
    (On Spectral Clustering, Ng et al.)
    -maximize within-cluster similarity and minimize
     between-cluster similarity.

    Let U be the cluster assignment
                      T     1/ 2    1/ 2
          max tr(U D                      KD   U)
                                     T
          s.t.                  U U        I
Advantage: Can discover arbitrarily-shaped clusters.
   There are several possible criteria:
       Correlation, Mutual information.

    Correlation: can capture only linear dependencies.

    Mutual information: can capture non-linear
    dependencies, but requires estimating the joint probability
    distribution.
   In this approach, we choose
    Hilbert-Schmidt Information Criterion
                                           2
                  HSIC (x, y)       c xy
                                           HS
    Advantage: Can detect non-linear dependence, do not need
    to estimate joint probability distributions.
   HSIC is the norm of a cross-covariance matrix
     in kernel space.
                                 2
     HSIC (x, y)          c xy
                                 HS
               C xy     E xy [( ( x)       x   )       ( ( y)      y   )]
    Empirical estimate of HSIC
                1                      s.t.
HSIC( X , Y ) : 2 tr (KHLH )
                n                      H, K, L R n n ,
                                       K ij : k ( xi , x j ), L ij : l ( yi , y j )
                      Number of
                      observations                     1 T
                                       H           I     1n1n
                                                       n
                                                                Kernel functions
Cluster Quality: NormalizedCut
                          
                                                                    Redundancy HSIC
                                                                         : 
                                                                    
                            T      1/ 2         1/ 2
maximize    Uv   Rn c
                      tr(U v Dv K v Dv U v )                  v   q
                                                                    tr( K v HK q H )
                       T               T
s.t.               Uv Uv        I , Wv Wv       I , K v ,ij   K (WvT xi ,WvT x j )

       Where Uv is the embedding,
             Kv is the kernel matrix,
             Dv is the degree matrix for each view v.
             Hv is the matrix to centralize the kernel matrix.
             All these are defined in subspace Wv.
We use a coordinate ascent approach.
Step 1: Fixed Wv, optimize for Uv
   Solution to Uv is equal to the eigenvectors with the
    largest eigenvalues of the normalized kernel
    similarity matrix.

Step 2: Fixed Uv, optimize for Wv
   We use gradient ascent on a Stiefel manifold.

Repeat Steps 1 & 2 until convergence.
K-means Step:
   Normalize Uv. Apply k-means on Uv.
   Cluster the features using spectral clustering.
   Data x = [f1 f2 f3 f4 f5 …fd]
   Feature similarity based on HSIC(fi,fj).

                                           Transformation Matrix
       f1       f2
                     …                                Wv
               f4                                     1    0   0 . .
                                                      0 1      0 . .

    f15     f34                      f21              0 0      0 . .
             …               f3       …
          f7                      f9
                                                      0 0 1 . .
                                                      . . 0 . .
Synthetic Data 1                     Synthetic Data 2
      View 1         View 2                View 1        View 2




mSC: our algorithm                               DATA 1                DATA 2
OPC: orthogonal Projection               VIEW 1           VIEW 2   VIEW 1       VIEW 2
         (Cui et al., 2007)    mSC        0.94             0.95     0.90         0.93
DK:    de-correlated Kmeans    OPC        0.89             0.85     0.02         0.07
         (Jain et al., 2008)   DK         0.87             0.94     0.03         0.05
SC:    spectral clustering     SC         0.37             0.42     0.31         0.25
                               Kmeans     0.36             0.34     0.03         0.05

                               Normalized Mutual Information (NMI) Results
Identity (ID)View     Pose View       NMI Results
                                                  FACE
                                             ID     POSE
                                     mSC    0.79    0.42
                                     OPC    0.67    0.37
                                     DK     0.70    0.40
                                     SC     0.67    0.22
                                     Kmeans 0.64    0.24



    •Mean face
    •Number below each image is cluster purity
Webkb Data High Weight Words

     High weight word in each subspace view
view 1    Cornell, Texas, Wisconsin, Madison, Washington

view 2    homework, student, professor, project, Ph.d


         NMI                          Webkb
                                  Univ.    Type
         Results      mSC         0.81     0.54
                      OPC         0.43     0.53
                      DK          0.48     0.57
                      SC          0.25     0.39
                      Kmeans      0.10     0.50
NSF Award Data High Frequent Words

            Subjects                        Work Type
Physics     Information   Biology      experimental      theoretical
materials   control       cell         methods          Experiments
chemical    programming   gene         mathematical     Processes
metal       information   protein      develop          Techniques
optical     function      DNA          equation         Measurements
quantum     languages     Biological   theoretical      surface
Machine Sound Data

                      Machine Sound Data
                   Motor       Fan      Pump
     mSC            0.82       0.75     0.83
     OPC            0.73       0.68     0.47
     DK             0.64       0.58     0.75
     SC             0.42       0.16     0.09
     Kmeans         0.57       0.16     0.09

     Normalized Mutual Information (NMI) Results
   Most clustering algorithms only find one single
    clustering solution. However, data may be multi-
    faceted (i.e., it can be interpreted in many different
    ways).
   We introduced a new method for discovering
    multiple non-redundant clusterings.

   Our approach, mSC, optimizes both a spectral
    clustering (to measure quality) and an HSIC
    regularization (to measure redundancy).
   mSC, can discover multiple clusters with flexible
    shapes, while simultaneously find the subspace in
    which these clustering views reside.
Thank you!

2010 ICML

  • 1.
    Donglin Niu, JenniferG. Dy Department of Electrical and Computer Engineering, Northeastern University, Boston, MA Michael I. Jordan EECS and Statistics Departments, University of California, Berkeley
  • 3.
    Given medical data, From doctor’s view: according to type of disease From insurance company view: based on patient’s cost/risk
  • 4.
    Two kinds ofApproaches: Iterative & Simultaneous Iterative Given an existing clustering, find another clustering  Conditional Information Bottleneck. Gondek and Hofmann (2004)  COALA. Bae and Bailey (2006)  Minimizing KL-divergence. Qi and Davidson (2009) Multiple alternative clusterings  Orthogonal Projection. Cui et al. (2007)
  • 5.
    Simultaneous Discovery of allthe possible partitionings  Meta Clustering. Caruana et al. (2006)  De-correlated kmeans. Jain et al. (2008)
  • 6.
    Ensemble Clustering  Hierarchical Clustering
  • 7.
    VIEW 1 VIEW 2 There are O( KN ) possible clustering solutions. We’d like to find solutions that: 1. have high cluster quality, and 2. be non-redundant and we’d like to simultaneously 3. learn the subspace in each view
  • 8.
    Normalized Cut (On Spectral Clustering, Ng et al.) -maximize within-cluster similarity and minimize between-cluster similarity. Let U be the cluster assignment T 1/ 2 1/ 2 max tr(U D KD U) T s.t. U U I Advantage: Can discover arbitrarily-shaped clusters.
  • 9.
    There are several possible criteria: Correlation, Mutual information. Correlation: can capture only linear dependencies. Mutual information: can capture non-linear dependencies, but requires estimating the joint probability distribution.  In this approach, we choose Hilbert-Schmidt Information Criterion 2 HSIC (x, y) c xy HS Advantage: Can detect non-linear dependence, do not need to estimate joint probability distributions.
  • 10.
    HSIC is the norm of a cross-covariance matrix in kernel space. 2 HSIC (x, y) c xy HS C xy E xy [( ( x) x ) ( ( y) y )]  Empirical estimate of HSIC 1 s.t. HSIC( X , Y ) : 2 tr (KHLH ) n H, K, L R n n , K ij : k ( xi , x j ), L ij : l ( yi , y j ) Number of observations 1 T H I 1n1n n Kernel functions
  • 11.
    Cluster Quality: NormalizedCut    Redundancy HSIC  :   T 1/ 2 1/ 2 maximize Uv Rn c tr(U v Dv K v Dv U v ) v q tr( K v HK q H ) T T s.t. Uv Uv I , Wv Wv I , K v ,ij K (WvT xi ,WvT x j ) Where Uv is the embedding, Kv is the kernel matrix, Dv is the degree matrix for each view v. Hv is the matrix to centralize the kernel matrix. All these are defined in subspace Wv.
  • 12.
    We use acoordinate ascent approach. Step 1: Fixed Wv, optimize for Uv  Solution to Uv is equal to the eigenvectors with the largest eigenvalues of the normalized kernel similarity matrix. Step 2: Fixed Uv, optimize for Wv  We use gradient ascent on a Stiefel manifold. Repeat Steps 1 & 2 until convergence. K-means Step:  Normalize Uv. Apply k-means on Uv.
  • 13.
    Cluster the features using spectral clustering.  Data x = [f1 f2 f3 f4 f5 …fd]  Feature similarity based on HSIC(fi,fj). Transformation Matrix f1 f2 … Wv f4 1 0 0 . . 0 1 0 . . f15 f34 f21 0 0 0 . . … f3 … f7 f9 0 0 1 . . . . 0 . .
  • 14.
    Synthetic Data 1 Synthetic Data 2 View 1 View 2 View 1 View 2 mSC: our algorithm DATA 1 DATA 2 OPC: orthogonal Projection VIEW 1 VIEW 2 VIEW 1 VIEW 2 (Cui et al., 2007) mSC 0.94 0.95 0.90 0.93 DK: de-correlated Kmeans OPC 0.89 0.85 0.02 0.07 (Jain et al., 2008) DK 0.87 0.94 0.03 0.05 SC: spectral clustering SC 0.37 0.42 0.31 0.25 Kmeans 0.36 0.34 0.03 0.05 Normalized Mutual Information (NMI) Results
  • 15.
    Identity (ID)View Pose View NMI Results FACE ID POSE mSC 0.79 0.42 OPC 0.67 0.37 DK 0.70 0.40 SC 0.67 0.22 Kmeans 0.64 0.24 •Mean face •Number below each image is cluster purity
  • 16.
    Webkb Data HighWeight Words High weight word in each subspace view view 1 Cornell, Texas, Wisconsin, Madison, Washington view 2 homework, student, professor, project, Ph.d NMI Webkb Univ. Type Results mSC 0.81 0.54 OPC 0.43 0.53 DK 0.48 0.57 SC 0.25 0.39 Kmeans 0.10 0.50
  • 17.
    NSF Award DataHigh Frequent Words Subjects Work Type Physics Information Biology experimental theoretical materials control cell methods Experiments chemical programming gene mathematical Processes metal information protein develop Techniques optical function DNA equation Measurements quantum languages Biological theoretical surface
  • 18.
    Machine Sound Data Machine Sound Data Motor Fan Pump mSC 0.82 0.75 0.83 OPC 0.73 0.68 0.47 DK 0.64 0.58 0.75 SC 0.42 0.16 0.09 Kmeans 0.57 0.16 0.09 Normalized Mutual Information (NMI) Results
  • 19.
    Most clustering algorithms only find one single clustering solution. However, data may be multi- faceted (i.e., it can be interpreted in many different ways).  We introduced a new method for discovering multiple non-redundant clusterings.  Our approach, mSC, optimizes both a spectral clustering (to measure quality) and an HSIC regularization (to measure redundancy).  mSC, can discover multiple clusters with flexible shapes, while simultaneously find the subspace in which these clustering views reside.
  • 20.

Editor's Notes

  • #2 Good afternoon. My name is DonglinNiu and I’m going to talk about “Multiple Non-Redundant Spectral Clustering Views.” This is work I did with my advisor, Jennifer Dy, from Northeastern University and with Mike Jordan form UC Berkeley.
  • #3 Clustering is often the first step in exploring data. Most clustering algorithms only find one clustering solution. However, data may be multi-faceted by nature (i.e., a single data can be interpreted in many different ways). For example, let’s say, are data is a bunch of web-pages as shown here. One way to cluster this data is by grouping faculty webpages together in one cluster and the student webpages into another cluster.Another way is to group them is according to the university they belong to.
  • #4 Another example is:Given medical data, A doctor may be interested in grouping the data based on disease type.An insurance company may be interested in grouping the patients according to their cost/risk.
  • #5 Because of the realization of the need for finding multiple alternative clustering interpretations, there is recent interest in this new clustering research paradigm.There are two kinds of approaches in solving this problem: Iterative and Simultaneous.In iterative methods,One is given an existing clustering, and the goal is to find an alternative clustering.Gondek and Hofman finds an alternative clustering using a conditional information bottleneck approach,Bae and Bailey applies must & cannot-link constraints and agglomerative clustering,Qi and Davidson minimizes a KL-divergence criterion.In many cases, one may be interested in finding not just one but multiple alternative clusterings. Cui et al. introduced an iterative orthogonal projection approach for finding multiple alternative clustering solutions.
  • #6 Another type of solution is simultaneously discovering all the possible partitionings.Meta Clustering by Caruana et al. generates several alternative solutions by random projection, then they apply hierarchical clustering of the clustering solutions.De-correlated Kmeans by Jain et al. minimizes both, the k-means sum-squared-error for each clustering solution and their correlation with each other, to find multiple cluster partitionings.Our approach is a simultaneous approach. However unlike meta-clustering which applies random projection, we find multiple alternative clusterings based on an objective function. Unlike de-correlated k-means which is based on k-means and thereby limited to find only spherical clusters, our approach can discover non-convex shaped clusters. Moreover, de-correlated k-means uses all the features in all the views; our approach, learns the subspace in each clustering view.
  • #7 The paradigm of finding multiple alternative clusterings is different from ensemble methods. Like this paradigm, ensemble clustering generate several alternative clusterings, but their ultimate goal is to find a SINGLE consensus clustering solution.Hierarchical clustering also generate several partitionings; however, they generate a hierarchy of coarse-to-fine clusters, such that samples that belong in the same cluster in the lower or fine levels of the hierarchy stay together at the higher or coarser levels. In our case, samples that belong to the same cluster in one view or solution can belong to different clusters in other views.
  • #8 Let’s say we have data in four dimensions. In features F1 and F2 it has a 3 ring cluster structure as shown in View 1, and a two half-moon cluster structure in features F3 and F4 in view 2. A standard clustering algorithm will have the dilemna of selecting which of these two structures is more interesting to discover. Instead of finding one of them, our goal is to find all possible interesting cluster structures/views. There are O(K^n) possible ways to cluster n samples into K groups modulo permutation of the clusters.We do not want to show these ways to the user as it will overwhelm the data analyst.We’d like to find solutions that:Have high cluster quality andWe’d like to provide non-redundant cluster views.Moreover, we’ve noticed that typically, the different alternative clusterings reside in different subspaces (i.e., they have utilize different similarity metrics to find these clusters).Thus, in our formulation, we also simultaneously learn the subspace in which the clusterings reside in each view.I’ll discuss each component in the following slides.
  • #9 We’d like to capture arbitrarily-shaped clusters. We employ the normalized-cut criterion and spectral clustering to define cluster quality.Normalized cut maximizes the within-cluster similarity and minimizes between-cluster similarity.Let U be the cluster assignment. In spectral clustering, we relax the cluster assignment U to take on any real value, then the normalize-cut clustering objective becomes maximizing the trace of U transposed the normalized similarity matrix U) subject to the constraint that U is orthonormal.The advantage of this criterion is that it can discover arbitrarily-shaped clusters.
  • #10 We’d like the clustering solutions we discover to be non-redundant with each other. There are several possible criteria for measuring non-redundancy: correlation or mutual information.(Read slide)
  • #11 HSIC is a norm of a cross-covariance matrix in kernel space.Empirically, we can estimate the HSIC between two random variables X and Y as theTrace of two kernel matrices K and L. H here simply centers the kernel matrices.
  • #12 Our overall objective is then to maximize this function.The first term optimizes for cluster quality, the spectral clustering criterion.The second term minimizes the redundancies among the clustering views.Lambda is the regularization parameter that controls the trade-off between these two criteria.We incorporate discovering the subspace in which the clustering solutions in each view reside by learning transformation matrix W_v. Note that W_v is inside the kernel and operates on the original input x.
  • #13 We optimize our objective to solve for the cluster embedding Uv and the subspace Wv in each view as follows.(Read slide)We discretize by applying a K-means step: (read slide)
  • #14 Our approach is only guaranteed to find local optima. Thus, the solution is dependent on initialization.We initialize the subspaces Wv in each view as follows.We cluster the features (i.e., columns of x) using spectral clustering and apply Hsic(f_i, f_j) between features as a measure of similarity. This groups together features that are dependent on each other into the same cluster and those that are independent from each other into different groups. Each feature group forms the transformation matrix Wv in each group as follows. (click through the animation and explain).Note that even though each view started with disjoint features, after running our algorithm to convergence, each feature will have some weight in all views. Note to that the dimensions in each view are set by the number of features in each view in our initialization.