Donglin Niu, Jennifer G. DyDepartment of Electrical and Computer Engineering, Northeastern University, Boston, MA Michael I. Jordan EECS and Statistics Departments, University of California, Berkeley
Given medical data, From doctor’s view: according to type of disease From insurance company view: based on patient’s cost/risk
Two kinds of Approaches: Iterative & SimultaneousIterative Given an existing clustering, find another clustering Conditional Information Bottleneck. Gondek and Hofmann (2004) COALA. Bae and Bailey (2006) Minimizing KL-divergence. Qi and Davidson (2009) Multiple alternative clusterings Orthogonal Projection. Cui et al. (2007)
SimultaneousDiscovery of all the possible partitionings Meta Clustering. Caruana et al. (2006) De-correlated kmeans. Jain et al. (2008)
Ensemble Clustering Hierarchical Clustering
VIEW 1 VIEW 2There are O( KN ) possible clustering solutions.We’d like to find solutions that: 1. have high cluster quality, and 2. be non-redundant and we’d like to simultaneously 3. learn the subspace in each view
Normalized Cut (On Spectral Clustering, Ng et al.) -maximize within-cluster similarity and minimize between-cluster similarity. Let U be the cluster assignment T 1/ 2 1/ 2 max tr(U D KD U) T s.t. U U IAdvantage: Can discover arbitrarily-shaped clusters.
There are several possible criteria: Correlation, Mutual information. Correlation: can capture only linear dependencies. Mutual information: can capture non-linear dependencies, but requires estimating the joint probability distribution. In this approach, we choose Hilbert-Schmidt Information Criterion 2 HSIC (x, y) c xy HS Advantage: Can detect non-linear dependence, do not need to estimate joint probability distributions.
HSIC is the norm of a cross-covariance matrix in kernel space. 2 HSIC (x, y) c xy HS C xy E xy [( ( x) x ) ( ( y) y )] Empirical estimate of HSIC 1 s.t.HSIC( X , Y ) : 2 tr (KHLH ) n H, K, L R n n , K ij : k ( xi , x j ), L ij : l ( yi , y j ) Number of observations 1 T H I 1n1n n Kernel functions
Cluster Quality: NormalizedCut Redundancy HSIC : T 1/ 2 1/ 2maximize Uv Rn c tr(U v Dv K v Dv U v ) v q tr( K v HK q H ) T Ts.t. Uv Uv I , Wv Wv I , K v ,ij K (WvT xi ,WvT x j ) Where Uv is the embedding, Kv is the kernel matrix, Dv is the degree matrix for each view v. Hv is the matrix to centralize the kernel matrix. All these are defined in subspace Wv.
We use a coordinate ascent approach.Step 1: Fixed Wv, optimize for Uv Solution to Uv is equal to the eigenvectors with the largest eigenvalues of the normalized kernel similarity matrix.Step 2: Fixed Uv, optimize for Wv We use gradient ascent on a Stiefel manifold.Repeat Steps 1 & 2 until convergence.K-means Step: Normalize Uv. Apply k-means on Uv.
Cluster the features using spectral clustering. Data x = [f1 f2 f3 f4 f5 …fd] Feature similarity based on HSIC(fi,fj). Transformation Matrix f1 f2 … Wv f4 1 0 0 . . 0 1 0 . . f15 f34 f21 0 0 0 . . … f3 … f7 f9 0 0 1 . . . . 0 . .
Identity (ID)View Pose View NMI Results FACE ID POSE mSC 0.79 0.42 OPC 0.67 0.37 DK 0.70 0.40 SC 0.67 0.22 Kmeans 0.64 0.24 •Mean face •Number below each image is cluster purity
Webkb Data High Weight Words High weight word in each subspace viewview 1 Cornell, Texas, Wisconsin, Madison, Washingtonview 2 homework, student, professor, project, Ph.d NMI Webkb Univ. Type Results mSC 0.81 0.54 OPC 0.43 0.53 DK 0.48 0.57 SC 0.25 0.39 Kmeans 0.10 0.50
NSF Award Data High Frequent Words Subjects Work TypePhysics Information Biology experimental theoreticalmaterials control cell methods Experimentschemical programming gene mathematical Processesmetal information protein develop Techniquesoptical function DNA equation Measurementsquantum languages Biological theoretical surface
Machine Sound Data Machine Sound Data Motor Fan Pump mSC 0.82 0.75 0.83 OPC 0.73 0.68 0.47 DK 0.64 0.58 0.75 SC 0.42 0.16 0.09 Kmeans 0.57 0.16 0.09 Normalized Mutual Information (NMI) Results
Most clustering algorithms only find one single clustering solution. However, data may be multi- faceted (i.e., it can be interpreted in many different ways). We introduced a new method for discovering multiple non-redundant clusterings. Our approach, mSC, optimizes both a spectral clustering (to measure quality) and an HSIC regularization (to measure redundancy). mSC, can discover multiple clusters with flexible shapes, while simultaneously find the subspace in which these clustering views reside.