icml2004 tutorial on spectral clustering part I

A Tutorial on Spectral Clustering

Chris Ding
Computational Research Division
Lawrence Berkeley National Laboratory
University of California

Supported by Office of Science, U.S. Dept. of Energy

Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 1

Some historical notes
• Fiedler, 1973, 1975, graph Laplacian matrix
• Donath & Hoffman, 1973, bounds
• Pothen, Simon, Liou, 1990, Spectral graph
partitioning (many related papers there after)
• Hagen & Kahng, 1992, Ratio-cut
• Chan, Schlag & Zien, multi-way Ratio-cut
• Chung, 1997, Spectral graph theory book
• Shi & Malik, 2000, Normalized Cut


Spectral Gold-Rush of 2001
9 papers on spectral clustering

• Meila & Shi, AI-Stat 2001. Random Walk interpreation of
Normalized Cut
• Ding, He & Zha, KDD 2001. Perturbation analysis of Laplacian
matrix on sparsely connected graphs
• Ng, Jordan & Weiss, NIPS 2001, K-means algorithm on the
embeded eigen-space
• Belkin & Niyogi, NIPS 2001. Spectral Embedding
• Dhillon, KDD 2001, Bipartite graph clustering
• Zha et al, CIKM 2001, Bipartite graph clustering
• Zha et al, NIPS 2001. Spectral Relaxation of K-means
• Ding et al, ICDM 2001. MinMaxCut, Uniqueness of relaxation.
• Gu et al, K-way Relaxation of NormCut and MinMaxCut

Part I: Basic Theory, 1973 – 2001


Spectral Graph Partitioning

MinCut: min cutsize
Constraint on sizes: |A| = |B|
cutsize = # of cut edges


2-way Spectral Graph Partitioning
 1 if i ∈ A
Partition membership indicator: qi = 
− 1 if i ∈ B

∑
1
J = CutSize = wij [qi − q j ]2
4 i, j

∑ ∑
1 1
= wij [qi2 + q 2 − 2qi q j ] =
j q [d δ − wij ]q j
4 i, j 2 i , j i i ij

1 T
= q ( D − W )q
2
Relax indicators qi from discrete values to continuous values,
the solution for min J(q) is given by the eigenvectors of

( D − W ) q = λq (Fiedler, 1973, 1975)
(Pothen, Simon, Liou, 1990)

Properties of Graph Laplacian

Laplacian matrix of the Graph: L = D − W

• L is semi-positive definite xT Lx ≥ 0 for any x.
• First eigenvector is q1=(1,…,1)T = eT with λ1=0.
• Second eigenvector q2 is the desired solution.
• The smaller λ2, the better quality of the
partitioning. Perturbation analysis gives
cutsize cutsize
λ2 =
+
| A| |B|
• Higher eigenvectors are also useful

Recovering Partitions

From the definition of cluster indicators:
Partitions A, B are determined by:

A = {i | q2 (i ) < 0}, B = {i | q2 (i ) ≥ 0}

However, the objective function J(q) is
insensitive to additive constant c :

∑
1
J = CutSize = w [( qi + c) − (q j + c)]2
4 i , j ij

Thus, we sort q2 to increasing order, and cut in the
middle point.

Multi-way Graph Partitioning

• Recursively applying the 2-way partitioning
• Recursive 2-way partitioning
• Using Kernigan-Lin to do local refinements
• Using higher eigenvectors
• Using q3 to further partitioning those obtained
via q2.

• Popular graph partitioning packages
• Metis, Univ of Minnesota
• Chaco, Sandia Nat’l Lab


2-way Spectral Clustering

• Undirected graphs (pairwise similarities)
• Bipartite graphs (contingency tables)
• Directed graphs (web graphs)


Spectral Clustering
min cutsize , without explicit size constraints
But where to cut ?

Need to balance sizes


Clustering Objective Functions
s(A,B) = ∑∑ w ij

• Ratio Cut i∈A j∈B
s(A,B) s(A,B)
J Rcut (A,B) = +
|A| |B|
• Normalized Cut dA = ∑d i
i∈A
s( A, B ) s( A, B )
J Ncut ( A, B) = +
dA dB
s ( A, B) s ( A, B )
= +
s ( A, A) + s ( A, B ) s(B, B ) + s ( A, B )
• Min-Max-Cut
s(A,B) s(A,B)
J MMC (A,B) = +
s(A,A) s(B,B)


Ratio Cut (Hagen & Kahng, 1992)

Min similarity between A , B:
s(A,B) = ∑∑ w
i∈ A j∈B
ij

s(A,B) s(A,B)
Size Balance J Rcut (A,B) = + (Wei & Cheng, 1989)
|A| |B|


 n2 / n1n if i ∈ A
Cluster membership indicator: q(i ) = 
− n1 / n2 n
 if i ∈ B

Normalization: q T q = 1, q T e = 0

Substitute q leads to J Rcut (q) = q T ( D − W )q
Now relax q, the by eigenvectornd eigenvector of L
Solution given solution is 2


Normalized Cut (Shi & Malik, 1997)

Min similarity between A & B: s(A,B) = ∑ ∑ wij
i∈ A j∈B
Balance weights s( A, B ) s( A, B )
J Ncut ( A, B) = +
dA dB dA = ∑d
i∈A
i

 d B / d Ad
 if i ∈ A
Cluster indicator: q(i ) = 
− d A / d B d
 if i ∈ B d= ∑d
i∈G
i

Normalization: q Dq = 1, q De = 0
T T

Substitute q leads to J Ncut (q) = q T ( D − W )q

min q q T ( D − W )q + λ (q T Dq − 1)
Solution is eigenvector of ( D − W )q = λDq

MinMaxCut (Ding et al 2001)
Min similarity between A & B: s(A,B) = ∑∑ w
i∈A j∈B
ij

Max similarity within A & B: s(A,A) = ∑∑ w
i∈A j∈A
ij

s(A,B) s(A,B)
J MMC(A,B) = +
s(A,A) s(B,B)

 d B / d Ad if i ∈ A
Cluster indicator: q(i ) = 
− d A / d B d
 if i ∈ B
Substituting,
1+ dB / d A 1+ d A / dB q T Wq
J MMC ( q) = + −2 Jm =
Jm + dB / d A Jm + d A / dB q T Dq
Because dJ MMC ( J m )
<0 min Jmmc ⇒ max Jm(q)
dJ m
⇒ Wq = ξDq ⇒ ( D − W )q = λDq

A simple example
2 dense clusters, with sparse connections
between them.
Adjacency matrix Eigenvector q2


Comparison of Clustering Objectives
• If clusters are well separated, all three give
very similar and accurate results.
• When clusters are marginally separated,
NormCut and MinMaxCut give better results
• When clusters overlap significantly,
MinMaxCut tend to give more compact and
balanced clusters.
s ( A, B) s ( A, B)
J Ncut = +
s ( A, A) + s(A, B) s(B, B) + s(A, B)

Cluster Compactness ⇒ max s ( A, A)

2-way Clustering of Newsgroups

Newsgroups RatioCut NormCut MinMaxCut
Atheism 63.2 ± 16.2 97.2 ± 0.8 97.2 ± 1.1
Comp.graphics
Baseball 54.9 ± 2.5 74.4 ± 20.4 79.5 ± 11.0
Hockey
Politics.mideast 53.6 ± 3.1 57.5 ± 0.9 83.6 ± 2.5
Politics.misc


Cluster Balance Analysis I:
Random Graph Model
• Random graph: edges are randomly assigned
with probability p: 0 ≤ p ≤ 1.
• RatioCut & NormCut show no size dependence
p | A || B | p | A || B |
J Rcut ( A, B) = + = np = constant
| A| |B|
p | A || B | p | A || B | n
J Ncut ( A, B) = + = = constant
p | A | (n − 1) p | B | ( n − 1) n − 1

• MinMaxCut favors balanced clusters: |A|=|B|
p | A || B | p | A || B | |B| | A|
J MMC ( A, B) = + = +
p | A | (| A | −1) p | B | (| B | −1) | A | −1 | B | −1


2-way Clustering of Newsgroups
Cluster Balance

Eigenvector

JNcut(i)

JMMC(i)


Cluster Balance Analysis II:
Large Overlap Case
s ( A, B )
f = > 0.5
(1 / 2)[ s( A, A) + s( B, B)]

Conditions for skewed cuts:
1 1
NormCut : s(A,A) ≥ ( − ) s ( A, B ) = s ( A, B) / 2
2f 2
1
MinMaxCut : s(A,A) ≥ s ( A, B) = s ( A, B )
2f

Thus MinMaxCut is much less prone to skewed cuts


Spectral Clustering of Bipartite Graphs

Simultaneous clustering of rows and columns
of a contingency table (adjacency matrix B )

Examples of bipartite graphs
• Information Retrieval: word-by-document matrix
• Market basket data: transaction-by-item matrix
• DNA Gene expression profiles
• Protein vs protein-complex

Spectral Clustering of Bipartite Graphs
Simultaneous clustering of rows and columns
(adjacency matrix B )
s ( BR1 ,C2 ) = ∑ ∑b
ri ∈R1c j ∈C 2
ij

min between-cluster sum of
xyz weights: s(R1,C2), s(R2,C1)
max within-cluster sum of xyz
cut xyz weights: s(R1,C1), s(R2,C2)

s ( BR1 ,C2 ) + s ( B R2 ,C1 ) s ( BR1 ,C2 ) + s ( B R2 ,C1 )
J MMC (C1 , C 2 ; R1 , R2 ) = +
2s ( B R1 ,C1 ) 2s ( BR2 ,C2 )
(Ding, AI-STAT 2003)

Bipartite Graph Clustering
Clustering indicators for rows and columns:
 1 if ri ∈ R1  1 if ci ∈ C1
f (i ) =  g (i ) = 
− 1 if ri ∈ R2 − 1 if ci ∈ C2

 BR ,C BR1 ,C2   0 B f 
B= 1 1  W = T  q= 
g
 BR ,C BR2 ,C2  B 0  
 2 1   
Substitute and obtain
s (W12 ) s (W12 )
J MMC (C1 , C 2 ; R1 , R2 ) = +
s (W11 ) s (W22 )
f,g are determined by
 D r   0 B   f  D  f 
 −    = λ  r  


 Dc   B T
 
 g
0   


 Dc  g 
 

Clustering of Bipartite Graphs
Let ~ u   D1 / 2 f 
B = Dr−1/ 2 BDc 1/ 2 , z =   = Dq =  r / 2 
−
 
 D1 g 
v  c 
We obtain ~
 0 B  u  u 
~  
 B T 0  v  = λ  v 
 
    
m

∑u λ v
~
Solution is SVD: B= T
k k k
k =1

(Zha et al, 2001, Dhillon, 2001)


Clustering of Bipartite Graphs
Recovering row clusters:
R1 = {ri , | f 2 (i) < z r }, R2 = {ri , | f 2 (i) ≥ z r },

Recovering column clusters:
C1 = {ci , | g 2 (i ) < z c }, C 2 = {ci , | g 2 (i ) ≥ z c },

zr=zc=0 are dividing points. Relaxation is
invariant up to a constant shift.
Algorithm: search for optimal points icut, jcut, let
zr=f2(icut), zc= g2(jcut), such that J MMC (C1 , C2 ; R1 , R2 )
(Zha et al, 2001)
is minimized.


Clustering of Directed Graphs
Min directed edge weights between A & B:
s(A,B)= ∑∑(w
i∈A j∈B
ij + w ji )

Max directed edges within A & B:
s(A,A)= ∑∑(w
i∈A j∈A
ij + w ji )

• Equivalent to deal with W = W + W T
~
~
• All spectral methods apply to W
• For example, web graphs clustered in such
way
(He, Ding, Zha, Simon, ICDM 2001)

K-way Spectral Clustering
K≥2


K-way Clustering Objectives

• Ratio Cut
 s(C k ,Cl ) s (C k ,Cl )  s(C k ,G − C k )
J Rcut (C1 , , C K ) = ∑


k ,l
|C k|
+
|Cl| 
=

∑
k
|C k|

• Normalized Cut
J Ncut (C1 , , C K ) = ∑


k ,l
dk
+
dl
=


∑
k
dk

• Min-Max-Cut
J MMC (C1 , , C K ) = ∑  s(C , C ) + s(C , C )  = ∑


k ,l k k

 l l k
s (C k , C k )


K-way Spectral Relaxation

• Prove that the solution lie in the subspace
spanned by the first k eigenvectors
• Ratio Cut
• Normalized Cut
• Min-Max-Cut


K-way Spectral Relaxation
h1 = (1m1,0m 0,0 m 0)T
Unsigned cluster indicators:
h2 = (0 m 0,1m1,0 m 0)T
mmm
Re-write: hk = (0 m 0,0 m 0,1m1)T

h1 ( D − W )h1
T
hk ( D − W ) hk
T
J Rcut (h1 , , hk ) = T
++ T
h1 h1 hk hk

h1 ( D − W )h1
T
hk ( D − W )hk
T
J Ncut (h1 , , hk ) = T
++ T
h1 Dh1 hk Dhk
h1 ( D − W ) h1
T
hk ( D − W )hk
T
J MMC (h1 , , hk ) = T
++ T
h1 Wh1 hk Whk


K-way Ratio Cut Spectral Relaxation
-
Unsigned cluster indicators: x = (0 0,11,0 0)T / n1/ 2
nk

k k

Re-write: J ( x , , x ) = xT ( D − W ) x + + xT ( D − W ) x
Rcut 1 k 1 1 k k

= Tr ( X T ( D − W ) X ) X = ( x1 , , xk )

Optimize : min Tr ( X T ( D − W ) X ), subject to X T X = I
X

By K. Fan’s theorem, optimal solution is
eigenvectors: X=(v1,v2, …, vk), (D-W)vk=λkvk
and lower-bound
λ1 + + λk ≤ min J Rcut ( x1 , , xk )
(Chan, Schlag, Zien, 1994)


K-way Normalized Cut Spectral Relaxation
-nk

yk = D 1/ 2
(0 o 0,1o1,0o 0)T / || D1/ 2 hk ||
Re-write: ~ ~
J Ncut ( y1 , , y k ) = y1 ( I − W ) y1 + + y k ( I − W ) y k
T T

~ ~
= Tr (Y T ( I − W )Y ) W = D −1/ 2WD −1/ 2
~
Optimize : min Tr (Y T ( I − W )Y ), subject to Y T Y = I
Y
By K. Fan’s theorem, optimal solution is
~
eigenvectors: Y=(v1,v2, …, vk), ( I − W )vk = λk vk
( D − W )u k = λk Du k , u k = D −1 / 2 v k

λ1 + l + λk ≤ min J Ncut ( y1 , l, y k ) (Gu, et al, 2001)


K-way Min-Max Cut Spectral Relaxation
~
y k = D1/ 2 hk / || D1/ 2 hk || W = D −1/ 2WD −1/ 2
Re-write:
1 1
J MMC ( y1 , , y k ) = T ~
++ T ~
−k
y1 W y1 yk W yk
T ~
Optimize : min J MMC (Y ), subject to Y T Y = I , y k Wy k 0.
Y

Theorem. Optimal solution is by eigenvectors:
~
Y=(v1,v2, …, vk), W v k = λk v k

k2
− k ≤ min J MMC ( y1 , m , y k )
λ1 + m + λk (Gu, et al, 2001)


K-way Spectral Clustering

• Embedding (similar to PCA subspace approach)
– Embed data points in the subspace of the K eigenvectors
– Clustering embedded points using another algorithm, such as K-
means (Shi Malik, Ng et al, Zha, et al)
• Recursive 2-way clustering (standard graph partitioning)
– If desired K is not power of 2, how optimcally to choose the next
sub-cluster to split? (Ding, et al 2002)
• Both above approach do not use K-way clustering
objective functions.
• Refine the obtained clusters using the K-way clustering
objective function typically improve the results (Ding et al
2002).


DNA Gene expression

Lymphoma Cancer
(Alizadeh et al, 2000)
Genes

Effects of feature selection:
Select 900 genes out of
4025 genes

Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California Tissue sample 36

Lymphoma Cancer
Tissue samples
B cell lymphoma go thru
different stages
–3 cancer stages
–3 normal stages
Key question: can we detect
them automatically ?

PCA 2D Display

Brief summary of Part I
• Spectral graph partitioning as origin
• Clustering objective functions and solutions
• Extensions to bipartite and directed graphs
• Characteristics
– Principled approach
– Well-motivated objective functions
– Clear, un-ambiguous
– A framework of rich structures and contents
– Everything is proved rigorously (within the relaxation
framework, i.e., using continuous approximation of the discrete
variables)
• Above results mostly done by 2001.
• More to come in Part II

icml2004 tutorial on spectral clustering part I

icml2004 tutorial on spectral clustering part I

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to icml2004 tutorial on spectral clustering part I

Similar to icml2004 tutorial on spectral clustering part I (20)

More from zukun

More from zukun (20)

Recently uploaded

Recently uploaded (20)

icml2004 tutorial on spectral clustering part I