2010 ICML

Donglin Niu, Jennifer G. Dy
Department of Electrical and Computer Engineering, Northeastern University, Boston, MA
Michael I. Jordan
EECS and Statistics Departments, University of California, Berkeley

 Given medical data,

From doctor’s view:
according to type of disease
From insurance company view:
based on patient’s cost/risk

Two kinds of Approaches: Iterative & Simultaneous
Iterative
Given an existing clustering, find another
clustering
 Conditional Information Bottleneck. Gondek and
Hofmann (2004)
 COALA. Bae and Bailey (2006)

 Minimizing KL-divergence. Qi and Davidson (2009)

Multiple alternative clusterings
 Orthogonal Projection. Cui et al. (2007)

Simultaneous
Discovery of all the possible partitionings
 Meta Clustering. Caruana et al. (2006)
 De-correlated kmeans. Jain et al. (2008)

 Ensemble Clustering

 Hierarchical Clustering

VIEW 1 VIEW 2

There are O( KN ) possible clustering solutions.
We’d like to find solutions that:
1. have high cluster quality, and
2. be non-redundant
and we’d like to simultaneously
3. learn the subspace in each view

 Normalized Cut
(On Spectral Clustering, Ng et al.)
-maximize within-cluster similarity and minimize
between-cluster similarity.

Let U be the cluster assignment
T 1/ 2 1/ 2
max tr(U D KD U)
T
s.t. U U I
Advantage: Can discover arbitrarily-shaped clusters.

 There are several possible criteria:
Correlation, Mutual information.

Correlation: can capture only linear dependencies.

Mutual information: can capture non-linear
dependencies, but requires estimating the joint probability
distribution.
 In this approach, we choose
Hilbert-Schmidt Information Criterion
2
HSIC (x, y) c xy
HS
Advantage: Can detect non-linear dependence, do not need
to estimate joint probability distributions.

 HSIC is the norm of a cross-covariance matrix
in kernel space.
2
HSIC (x, y) c xy
HS
C xy E xy [( ( x) x ) ( ( y) y )]
 Empirical estimate of HSIC
1 s.t.
HSIC( X , Y ) : 2 tr (KHLH )
n H, K, L R n n ,
K ij : k ( xi , x j ), L ij : l ( yi , y j )
Number of
observations 1 T
H I 1n1n
n
Kernel functions

Cluster Quality: NormalizedCut
 
 Redundancy HSIC
 : 

T 1/ 2 1/ 2
maximize Uv Rn c
tr(U v Dv K v Dv U v ) v q
tr( K v HK q H )
T T
s.t. Uv Uv I , Wv Wv I , K v ,ij K (WvT xi ,WvT x j )

Where Uv is the embedding,
Kv is the kernel matrix,
Dv is the degree matrix for each view v.
Hv is the matrix to centralize the kernel matrix.
All these are defined in subspace Wv.

We use a coordinate ascent approach.
Step 1: Fixed Wv, optimize for Uv
 Solution to Uv is equal to the eigenvectors with the
largest eigenvalues of the normalized kernel
similarity matrix.

Step 2: Fixed Uv, optimize for Wv
 We use gradient ascent on a Stiefel manifold.

Repeat Steps 1 & 2 until convergence.
K-means Step:
 Normalize Uv. Apply k-means on Uv.

 Cluster the features using spectral clustering.
 Data x = [f1 f2 f3 f4 f5 …fd]
 Feature similarity based on HSIC(fi,fj).

Transformation Matrix
f1 f2
… Wv
f4 1 0 0 . .
0 1 0 . .

f15 f34 f21 0 0 0 . .
… f3 …
f7 f9
0 0 1 . .
. . 0 . .

Synthetic Data 1 Synthetic Data 2
View 1 View 2 View 1 View 2

mSC: our algorithm DATA 1 DATA 2
OPC: orthogonal Projection VIEW 1 VIEW 2 VIEW 1 VIEW 2
(Cui et al., 2007) mSC 0.94 0.95 0.90 0.93
DK: de-correlated Kmeans OPC 0.89 0.85 0.02 0.07
(Jain et al., 2008) DK 0.87 0.94 0.03 0.05
SC: spectral clustering SC 0.37 0.42 0.31 0.25
Kmeans 0.36 0.34 0.03 0.05

Normalized Mutual Information (NMI) Results

Identity (ID)View Pose View NMI Results
FACE
ID POSE
mSC 0.79 0.42
OPC 0.67 0.37
DK 0.70 0.40
SC 0.67 0.22
Kmeans 0.64 0.24

•Mean face
•Number below each image is cluster purity

Webkb Data High Weight Words

High weight word in each subspace view
view 1 Cornell, Texas, Wisconsin, Madison, Washington

view 2 homework, student, professor, project, Ph.d

NMI Webkb
Univ. Type
Results mSC 0.81 0.54
OPC 0.43 0.53
DK 0.48 0.57
SC 0.25 0.39
Kmeans 0.10 0.50

NSF Award Data High Frequent Words

Subjects Work Type
Physics Information Biology experimental theoretical
materials control cell methods Experiments
chemical programming gene mathematical Processes
metal information protein develop Techniques
optical function DNA equation Measurements
quantum languages Biological theoretical surface

Machine Sound Data

Machine Sound Data
Motor Fan Pump
mSC 0.82 0.75 0.83
OPC 0.73 0.68 0.47
DK 0.64 0.58 0.75
SC 0.42 0.16 0.09
Kmeans 0.57 0.16 0.09

Normalized Mutual Information (NMI) Results

 Most clustering algorithms only find one single
clustering solution. However, data may be multi-
faceted (i.e., it can be interpreted in many different
ways).
 We introduced a new method for discovering
multiple non-redundant clusterings.

 Our approach, mSC, optimizes both a spectral
clustering (to measure quality) and an HSIC
regularization (to measure redundancy).
 mSC, can discover multiple clusters with flexible
shapes, while simultaneously find the subspace in
which these clustering views reside.

2010 ICML

More Related Content

Viewers also liked

Similar to 2010 ICML

Recently uploaded

2010 ICML

Editor's Notes