Good afternoon. My name is DonglinNiu and I’m going to talk about “Multiple Non-Redundant Spectral Clustering Views.” This is work I did with my advisor, Jennifer Dy, from Northeastern University and with Mike Jordan form UC Berkeley.
Clustering is often the first step in exploring data. Most clustering algorithms only find one clustering solution. However, data may be multi-faceted by nature (i.e., a single data can be interpreted in many different ways). For example, let’s say, are data is a bunch of web-pages as shown here. One way to cluster this data is by grouping faculty webpages together in one cluster and the student webpages into another cluster.Another way is to group them is according to the university they belong to.
Another example is:Given medical data, A doctor may be interested in grouping the data based on disease type.An insurance company may be interested in grouping the patients according to their cost/risk.
Because of the realization of the need for finding multiple alternative clustering interpretations, there is recent interest in this new clustering research paradigm.There are two kinds of approaches in solving this problem: Iterative and Simultaneous.In iterative methods,One is given an existing clustering, and the goal is to find an alternative clustering.Gondek and Hofman finds an alternative clustering using a conditional information bottleneck approach,Bae and Bailey applies must & cannot-link constraints and agglomerative clustering,Qi and Davidson minimizes a KL-divergence criterion.In many cases, one may be interested in finding not just one but multiple alternative clusterings. Cui et al. introduced an iterative orthogonal projection approach for finding multiple alternative clustering solutions.
Another type of solution is simultaneously discovering all the possible partitionings.Meta Clustering by Caruana et al. generates several alternative solutions by random projection, then they apply hierarchical clustering of the clustering solutions.De-correlated Kmeans by Jain et al. minimizes both, the k-means sum-squared-error for each clustering solution and their correlation with each other, to find multiple cluster partitionings.Our approach is a simultaneous approach. However unlike meta-clustering which applies random projection, we find multiple alternative clusterings based on an objective function. Unlike de-correlated k-means which is based on k-means and thereby limited to find only spherical clusters, our approach can discover non-convex shaped clusters. Moreover, de-correlated k-means uses all the features in all the views; our approach, learns the subspace in each clustering view.
The paradigm of finding multiple alternative clusterings is different from ensemble methods. Like this paradigm, ensemble clustering generate several alternative clusterings, but their ultimate goal is to find a SINGLE consensus clustering solution.Hierarchical clustering also generate several partitionings; however, they generate a hierarchy of coarse-to-fine clusters, such that samples that belong in the same cluster in the lower or fine levels of the hierarchy stay together at the higher or coarser levels. In our case, samples that belong to the same cluster in one view or solution can belong to different clusters in other views.
Let’s say we have data in four dimensions. In features F1 and F2 it has a 3 ring cluster structure as shown in View 1, and a two half-moon cluster structure in features F3 and F4 in view 2. A standard clustering algorithm will have the dilemna of selecting which of these two structures is more interesting to discover. Instead of finding one of them, our goal is to find all possible interesting cluster structures/views. There are O(K^n) possible ways to cluster n samples into K groups modulo permutation of the clusters.We do not want to show these ways to the user as it will overwhelm the data analyst.We’d like to find solutions that:Have high cluster quality andWe’d like to provide non-redundant cluster views.Moreover, we’ve noticed that typically, the different alternative clusterings reside in different subspaces (i.e., they have utilize different similarity metrics to find these clusters).Thus, in our formulation, we also simultaneously learn the subspace in which the clusterings reside in each view.I’ll discuss each component in the following slides.
We’d like to capture arbitrarily-shaped clusters. We employ the normalized-cut criterion and spectral clustering to define cluster quality.Normalized cut maximizes the within-cluster similarity and minimizes between-cluster similarity.Let U be the cluster assignment. In spectral clustering, we relax the cluster assignment U to take on any real value, then the normalize-cut clustering objective becomes maximizing the trace of U transposed the normalized similarity matrix U) subject to the constraint that U is orthonormal.The advantage of this criterion is that it can discover arbitrarily-shaped clusters.
We’d like the clustering solutions we discover to be non-redundant with each other. There are several possible criteria for measuring non-redundancy: correlation or mutual information.(Read slide)
HSIC is a norm of a cross-covariance matrix in kernel space.Empirically, we can estimate the HSIC between two random variables X and Y as theTrace of two kernel matrices K and L. H here simply centers the kernel matrices.
Our overall objective is then to maximize this function.The first term optimizes for cluster quality, the spectral clustering criterion.The second term minimizes the redundancies among the clustering views.Lambda is the regularization parameter that controls the trade-off between these two criteria.We incorporate discovering the subspace in which the clustering solutions in each view reside by learning transformation matrix W_v. Note that W_v is inside the kernel and operates on the original input x.
We optimize our objective to solve for the cluster embedding Uv and the subspace Wv in each view as follows.(Read slide)We discretize by applying a K-means step: (read slide)
Our approach is only guaranteed to find local optima. Thus, the solution is dependent on initialization.We initialize the subspaces Wv in each view as follows.We cluster the features (i.e., columns of x) using spectral clustering and apply Hsic(f_i, f_j) between features as a measure of similarity. This groups together features that are dependent on each other into the same cluster and those that are independent from each other into different groups. Each feature group forms the transformation matrix Wv in each group as follows. (click through the animation and explain).Note that even though each view started with disjoint features, after running our algorithm to convergence, each feature will have some weight in all views. Note to that the dimensions in each view are set by the number of features in each view in our initialization.
Donglin Niu, Jennifer G. DyDepartment of Electrical and Computer Engineering, Northeastern University, Boston, MA Michael I. Jordan EECS and Statistics Departments, University of California, Berkeley
Given medical data, From doctor’s view: according to type of disease From insurance company view: based on patient’s cost/risk
Two kinds of Approaches: Iterative & SimultaneousIterative Given an existing clustering, find another clustering Conditional Information Bottleneck. Gondek and Hofmann (2004) COALA. Bae and Bailey (2006) Minimizing KL-divergence. Qi and Davidson (2009) Multiple alternative clusterings Orthogonal Projection. Cui et al. (2007)
SimultaneousDiscovery of all the possible partitionings Meta Clustering. Caruana et al. (2006) De-correlated kmeans. Jain et al. (2008)
VIEW 1 VIEW 2There are O( KN ) possible clustering solutions.We’d like to find solutions that: 1. have high cluster quality, and 2. be non-redundant and we’d like to simultaneously 3. learn the subspace in each view
Normalized Cut (On Spectral Clustering, Ng et al.) -maximize within-cluster similarity and minimize between-cluster similarity. Let U be the cluster assignment T 1/ 2 1/ 2 max tr(U D KD U) T s.t. U U IAdvantage: Can discover arbitrarily-shaped clusters.
There are several possible criteria: Correlation, Mutual information. Correlation: can capture only linear dependencies. Mutual information: can capture non-linear dependencies, but requires estimating the joint probability distribution. In this approach, we choose Hilbert-Schmidt Information Criterion 2 HSIC (x, y) c xy HS Advantage: Can detect non-linear dependence, do not need to estimate joint probability distributions.
HSIC is the norm of a cross-covariance matrix in kernel space. 2 HSIC (x, y) c xy HS C xy E xy [( ( x) x ) ( ( y) y )] Empirical estimate of HSIC 1 s.t.HSIC( X , Y ) : 2 tr (KHLH ) n H, K, L R n n , K ij : k ( xi , x j ), L ij : l ( yi , y j ) Number of observations 1 T H I 1n1n n Kernel functions
Cluster Quality: NormalizedCut Redundancy HSIC : T 1/ 2 1/ 2maximize Uv Rn c tr(U v Dv K v Dv U v ) v q tr( K v HK q H ) T Ts.t. Uv Uv I , Wv Wv I , K v ,ij K (WvT xi ,WvT x j ) Where Uv is the embedding, Kv is the kernel matrix, Dv is the degree matrix for each view v. Hv is the matrix to centralize the kernel matrix. All these are defined in subspace Wv.
We use a coordinate ascent approach.Step 1: Fixed Wv, optimize for Uv Solution to Uv is equal to the eigenvectors with the largest eigenvalues of the normalized kernel similarity matrix.Step 2: Fixed Uv, optimize for Wv We use gradient ascent on a Stiefel manifold.Repeat Steps 1 & 2 until convergence.K-means Step: Normalize Uv. Apply k-means on Uv.
Cluster the features using spectral clustering. Data x = [f1 f2 f3 f4 f5 …fd] Feature similarity based on HSIC(fi,fj). Transformation Matrix f1 f2 … Wv f4 1 0 0 . . 0 1 0 . . f15 f34 f21 0 0 0 . . … f3 … f7 f9 0 0 1 . . . . 0 . .
Identity (ID)View Pose View NMI Results FACE ID POSE mSC 0.79 0.42 OPC 0.67 0.37 DK 0.70 0.40 SC 0.67 0.22 Kmeans 0.64 0.24 •Mean face •Number below each image is cluster purity
Webkb Data High Weight Words High weight word in each subspace viewview 1 Cornell, Texas, Wisconsin, Madison, Washingtonview 2 homework, student, professor, project, Ph.d NMI Webkb Univ. Type Results mSC 0.81 0.54 OPC 0.43 0.53 DK 0.48 0.57 SC 0.25 0.39 Kmeans 0.10 0.50
NSF Award Data High Frequent Words Subjects Work TypePhysics Information Biology experimental theoreticalmaterials control cell methods Experimentschemical programming gene mathematical Processesmetal information protein develop Techniquesoptical function DNA equation Measurementsquantum languages Biological theoretical surface
Machine Sound Data Machine Sound Data Motor Fan Pump mSC 0.82 0.75 0.83 OPC 0.73 0.68 0.47 DK 0.64 0.58 0.75 SC 0.42 0.16 0.09 Kmeans 0.57 0.16 0.09 Normalized Mutual Information (NMI) Results
Most clustering algorithms only find one single clustering solution. However, data may be multi- faceted (i.e., it can be interpreted in many different ways). We introduced a new method for discovering multiple non-redundant clusterings. Our approach, mSC, optimizes both a spectral clustering (to measure quality) and an HSIC regularization (to measure redundancy). mSC, can discover multiple clusters with flexible shapes, while simultaneously find the subspace in which these clustering views reside.