Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005 tutorial - 2005

Part 2. Spectral Clustering from
Matrix Perspective

A brief tutorial emphasizing recent developments

(More detailed tutorial is given in ICML’04 )

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 56

From PCA to spectral clustering
using generalized eigenvectors
Consider the kernel matrix: Wij = φ ( xi ),φ ( x j )

In Kernel PCA we compute eigenvector: Wv = λv

Generalized Eigenvector: Wq = λDq

D = diag (d1,L, dn ) di = ∑w j ij

This leads to Spectral Clustering !

Indicator Matrix Quadratic Clustering
Framework

Unsigned Cluster indicator Matrix H=(h1, …, hK)
Kernel K-means clustering:

max Tr( H T WH ), s.t. H T H = I , H ≥ 0
H

K-means: W = XT X; Kernel K-means W = (< φ ( xi ),φ ( x j ) >)

Spectral clustering (normalized cut)

max Tr( H T WH ), s.t. H T DH = I , H ≥ 0
H


Brief Introduction to Spectral Clustering
(Laplacian matrix based clustering)


Some historical notes
• Fiedler, 1973, 1975, graph Laplacian matrix
• Donath & Hoffman, 1973, bounds
• Hall, 1970, Quadratic Placement (embedding)
• Pothen, Simon, Liou, 1990, Spectral graph
partitioning (many related papers there after)
• Hagen & Kahng, 1992, Ratio-cut
• Chan, Schlag & Zien, multi-way Ratio-cut
• Chung, 1997, Spectral graph theory book
• Shi & Malik, 2000, Normalized Cut

Spectral Gold-Rush of 2001
9 papers on spectral clustering

• Meila & Shi, AI-Stat 2001. Random Walk interpreation of
Normalized Cut
• Ding, He & Zha, KDD 2001. Perturbation analysis of Laplacian
matrix on sparsely connected graphs
• Ng, Jordan & Weiss, NIPS 2001, K-means algorithm on the
embeded eigen-space
• Belkin & Niyogi, NIPS 2001. Spectral Embedding
• Dhillon, KDD 2001, Bipartite graph clustering
• Zha et al, CIKM 2001, Bipartite graph clustering
• Zha et al, NIPS 2001. Spectral Relaxation of K-means
• Ding et al, ICDM 2001. MinMaxCut, Uniqueness of relaxation.
• Gu et al, K-way Relaxation of NormCut and MinMaxCut

Spectral Clustering
min cutsize , without explicit size constraints
But where to cut ?

Need to balance sizes

Graph Clustering
min between-cluster similarities (weights)
sim(A,B) = ∑∑ wij
i∈ A j∈B

Balance weight
Balance size
Balance volume

sim(A,A) = ∑∑ wij
i∈ A j∈ A

max within-cluster similarities (weights)

Clustering Objective Functions
s(A,B) = ∑∑ w ij

• Ratio Cut i∈ A j∈B
s(A,B) s(A,B)
J Rcut (A,B) = +
|A| |B|
• Normalized Cut dA = ∑d i
i∈A
s ( A, B) s ( A, B)
J Ncut ( A, B) = +
dA dB
s ( A, B) s ( A, B)
= +
s ( A, A) + s ( A, B) s(B, B) + s ( A, B)
• Min-Max-Cut
s(A,B) s(A,B)
J MMC(A,B) = +
s(A,A) s(B,B)

Normalized Cut (Shi & Malik, 2000)

Min similarity between A & B: s(A,B) = ∑ ∑ wij
i∈ A j∈B
Balance weights s ( A, B) s ( A, B)
J Ncut ( A, B) =
dA
+
dB dA = ∑d
i∈A
i

⎧ d B / d Ad
⎪ if i ∈ A
Cluster indicator: q (i ) = ⎨
⎪− d A / d B d
⎩ if i ∈ B d= ∑d
i∈G
i

Normalization: q Dq = 1, q De = 0
T T

Substitute q leads to J Ncut (q) = q T ( D − W )q

min q q T ( D − W )q + λ (q T Dq − 1)
Solution is eigenvector of ( D − W )q = λDq

A simple example
2 dense clusters, with sparse connections
between them.
Adjacency matrix Eigenvector q2


K-way Spectral Clustering
K≥2


K-way Clustering Objectives

• Ratio Cut
⎛ s (C k ,Cl ) s (C k ,Cl ) ⎞ s (C k ,G − C k )
J Rcut (C1 , L , C K ) = ∑ ⎜
⎜ |C | + |C | ⎟ =
< k ,l > ⎝ k l
⎟
⎠
∑
k
|C k|

• Normalized Cut
J Ncut (C1 , L , C K ) =
< k ,l >
∑⎜
⎜ d
⎝ k
+
dl
⎟=
⎟
⎠
∑
k
dk

• Min-Max-Cut
J MMC (C1 , L , C K ) = ⎜
< k ,l > ⎝
∑ k k l l ⎠
⎟
⎜ s (C , C ) + s (C , C ) ⎟ = ∑ k
s (C k , C k )


K-way Spectral Relaxation
h1 = (1L1,0 L 0,0 L 0)T
Unsigned cluster indicators:
h2 = (0L 0,1L1,0 L 0)T
LLL
Re-write: hk = (0 L 0,0L 0,1L1)T

h1 ( D − W )h1
T
hk ( D − W )hk
T
J Rcut (h1 , L, hk ) = T
+L+ T
h1 h1 hk hk

h1 ( D − W )h1
T
hk ( D − W )hk
T
J Ncut (h1 , L, hk ) = T
+L+ T
h1 Dh1 hk Dhk
h1 ( D − W )h1
T
hk ( D − W )hk
T
J MMC (h1 , L , hk ) = T
+L+ T
h1 Wh1 hk Whk


K-way Normalized Cut Spectral Relaxation
Unsigned cluster indicators: nk
}
y k = D1/ 2 (0 L 0,1L1,0L 0)T / || D1/ 2 hk ||
Re-write: ~ ~
J Ncut ( y1 , L , y k ) = T
y1 ( I − W ) y1 + L + y k ( I − W ) y k
T

~ ~
= Tr (Y T ( I − W )Y ) W = D −1/ 2WD −1/ 2
~
Optimize : min Tr (Y ( I − W )Y ), subject to Y T Y = I
T
Y
By K. Fan’s theorem, optimal solution is
~
eigenvectors: Y=(v1,v2, …, vk), ( I − W )vk = λk vk
( D − W )u k = λk Du k , u k = D −1/ 2 vk

λ1 + L + λk ≤ min J Ncut ( y1 ,L , y k ) (Gu, et al, 2001)


K-way Spectral Clustering is difficult
• Spectral clustering is best applied to 2-way
clustering
– positive entries for one cluster
– negative entries for another cluster
• For K-way (K>2) clustering
– Positive and negative signs make cluster
assignment difficult
– Recursive 2-way clustering
– Low-dimension embedding. Project the data to
eigenvector subspace; use another clustering
method such as K-means to cluster the data (Ng
et al; Zha et al; Back & Jordan, etc)
– Linearized cluster assignment using spectral ordering and
cluster crossing

Scaled PCA: a Unified Framework
for clustering and ordering

• Scaled PCA has two optimality properties
– Distance sensitive ordering
– Min-max principle Clustering
• SPCA on contingency table ⇒ Correspondence Analysis
– Simultaneous ordering of rows and columns
– Simultaneous clustering of rows and columns


Scaled PCA
similarity matrix S=(sij) (generated from XXT)
D = diag(d1 ,L, d n ) di = si.
~ −1 −1 ~
Nonlinear re-scaling: S = D SD , sij = sij /(si.s j. )
2 2 1/ 2

~
Apply SVD on S⇒
~ 1 ⎡ T⎤
S = D S D = D ∑ zk λk z k D = D ⎢∑ qk λk qk ⎥ D
1
1 1
2 2
2 T 2

k ⎣k ⎦
qk = D-1/2 zk is the scaled principal component
Subtract trivial component λ = 1, z = d 1/ 2 /s.., q =1
0 0 0

⇒ S − dd T /s.. = D ∑ qk λk qT D
k
k =1 (Ding, et al, 2002)

Scaled PCA on a Rectangle Matrix
⇒ Correspondence Analysis
~ −1 −1 ~
Nonlinear re-scaling: P = D 2 PD 2 , p = p /( p p )1/ 2
r c ij ij i. j.

~
Apply SVD on P Subtract trivial component

P − rc / p.. = Dr ∑ f k λk g Dc
T T r = ( p1.,L, pn. )
T
k
k =1
−1 −1
c = ( p.1,L, p.n ) T

fk = D u , gk = D v
r
2
k
2
c k
are the scaled row and column principal
component (standard coordinates in CA)

Correspondence Analysis (CA)
• Mainly used in graphical display of data
• Popular in France (Benzécri, 1969)
• Long history
– Simultaneous row and column regression (Hirschfeld,
1935)
– Reciprocal averaging (Richardson & Kuder, 1933;
Horst, 1935; Fisher, 1940; Hill, 1974)
– Canonical correlations, dual scaling, etc.
• Formulation is a bit complicated (“convoluted”
Jolliffe, 2002, p.342)
• “A neglected method”, (Hill, 1974)


Clustering of Bipartite Graphs (rectangle matrix)

Simultaneous clustering of rows and columns
of a contingency table (adjacency matrix B )

Examples of bipartite graphs
• Information Retrieval: word-by-document matrix
• Market basket data: transaction-by-item matrix
• DNA Gene expression profiles
• Protein vs protein-complex

Bipartite Graph Clustering
Clustering indicators for rows and columns:
⎧ 1 if ri ∈ R1 ⎧ 1 if ci ∈ C1
f (i ) = ⎨ g (i ) = ⎨
⎩− 1 if ri ∈ R2 ⎩− 1 if ci ∈ C2

⎛ BR1 ,C1 BR1 ,C2 ⎞ ⎛ 0 B⎞ ⎛f ⎞
B=⎜ ⎟ W =⎜ T ⎟ q=⎜ ⎟
⎜g⎟
⎜ BR ,C BR2 ,C2 ⎟ ⎜B 0⎟ ⎝ ⎠
⎝ 2 1 ⎠ ⎝ ⎠
Substitute and obtain
s (W12 ) s (W12 )
J MMC (C1 , C 2 ; R1 , R2 ) = +
s (W11 ) s (W22 )
f,g are determined by
⎡⎛ Dr ⎞ ⎛ 0 B ⎞⎤⎛ f ⎞ ⎛ Dr ⎞⎛ f ⎞
⎜
⎢⎜ ⎟−⎜ T ⎟⎥⎜ ⎟ = λ ⎜ ⎟⎜ ⎟
⎢⎝
⎣ Dc ⎟ ⎜ B
⎠ ⎝ 0 ⎟⎥⎜ g ⎟
⎠⎦⎝ ⎠
⎜
⎝ Dc ⎟⎜ g ⎟
⎠⎝ ⎠

Spectral Clustering of Bipartite Graphs
Simultaneous clustering of rows and columns
(adjacency matrix B )
s ( BR1 ,C2 ) = ∑ ∑b
ri ∈R1c j ∈C 2
ij

min between-cluster sum of
xyz weights: s(R1,C2), s(R2,C1)
max within-cluster sum of xyz
cut xyz weights: s(R1,C1), s(R2,C2)

s ( BR1 ,C2 ) + s ( B R2 ,C1 ) s ( B R1 ,C2 ) + s ( B R2 ,C1 )
J MMC (C1 , C 2 ; R1 , R2 ) = +
2 s ( B R1 ,C1 ) 2 s ( B R2 ,C2 )
(Ding, AI-STAT 2003)

Internet Newsgroups

Simultaneous clustering
of documents and words


Embedding in Principal Subspace

Cluster Self-Aggregation
(proved in perturbation analysis)

(Hall, 1970, “quadratic placement” (embedding) a graph)


Spectral Embedding: Self-aggregation

• Compute K eigenvectors of the Laplacian.
• Embed objects in the K-dim eigenspace

(Ding, 2004)

Spectral embedding is not
topology preserving

700 3-D data points form
2 interlock rings

In eigenspace, they
shrink and separate


Spectral Embedding

Simplex Embedding Theorem.
Objects self-aggregate to K centroids
Centroids locate on K corners of a simplex
• Simplex consists K basis vectors + coordinate origin
• Simplex is rotated by an orthogonal transformation T
•T are determined by perturbation analysis

(Ding, 2004)

Perturbation Analysis
Wq = λDq Wˆ z = ( D −1 / 2WD −1 / 2 ) z = λz q = D −1 / 2 z

Assume data has 3 dense clusters sparsely connected.

C2

⎡W W W ⎤
11 12 13 C1

W = ⎢ 21 W22 W23⎥
⎢W ⎥
⎢ 31 W32 W33⎥
⎣W ⎦ C3

Off-diagonal blocks are between-cluster connections,
assumed small and are treated as a perturbation
(Ding et al, KDD’01)

Spectral Perturbation Theorem

Orthogonal Transform Matrix T = (t1 ,L , t K )

T are determined by: Γt k = λ k t k
−1 −1
Spectral Perturbation Matrix Γ=Ω 2
ΓΩ 2

⎡ h11 − s12 L − s1K ⎤ s pq = s (C p , Cq )
⎢− s L − s2 K ⎥
Γ = ⎢ 21
⎢ M
h22
M L M ⎥
⎥ hkk = ∑ s
p| p ≠ k kp
⎢ ⎥
⎣− s K 1 − s K 2 L hKK ⎦ Ω = diag[ ρ (C1 ),L, ρ (Ck )]


Connectivity Network
⎧ 1 if i, j belong to same cluster
Cij = ⎨
⎩ 0 otherwise

K
Scaled PCA provides C≅D ∑k =1
qk λk qT D
k

K

∑
1
Green’s function : C ≈G = qk qT
k =2
1 − λk k
K
Projection matrix: C≈P≡ ∑k =1
qk qT
k

PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
(Ding et al, 2002)
86

Similarity matrix W 1st order Perturbation: Example 1

1st order
solution
Connectivity

λ2 = 0.300, λ2 = 0.268
matrix

Between-cluster connections suppressed
Within-cluster connections enhanced
Effects of self-aggregation

Optimality Properties of Scaled PCA
Scaled principal components have optimality properties:
Ordering
– Adjacent objects along the order are similar
– Far-away objects along the order are dissimilar
– Optimal solution for the permutation index are given by
scaled PCA.

Clustering
– Maximize within-cluster similarity
– Minimize between-cluster similarity
– Optimal solution for cluster membership indicators given
by scaled PCA.


Spectral Graph Ordering
(Barnard, Pothen, Simon, 1993), envelop reduction of sparse
matrix: find ordering such that the envelop is minimized

min ∑ max j | i − j | wij ⇒ min ∑ ( xi − x j ) wij 2

i ij

(Hall, 1970), “quadratic placement of a graph”:
Find coordinate x to minimize
J= ∑ ij
( xi − x j ) 2 wij = x T ( D − W ) x

Solution are eigenvectors of Laplacian

Distance Sensitive Ordering
Given a graph. Find an optimal Ordering of the nodes.
π permutation indexes
J (π ) = ∑
d
n−d
i =1 π i ,π i + d
π (1,L, n) = (π 1 ,L, π n )
w

∩∩
∩∩∩∩∩∩∩∩
wπ1 ,π 3
J d =2 (π ) :
∪∪∪∪∪∪∪∪
min J (π ) = ∑ n −1
d =1 d J d (π )
2
π
The larger distance, the larger weights, panelity.

J (π ) = ∑ (i − j ) wπ i ,π j = ∑ (i − j ) wπ i ,π j
2 2

ij π i ,π j

= ∑ (π − π ) wi , j
i
−1 −1 2
j
ij
n2 π i−1 −( n +1) / 2 π −1 −( n +1) / 2 2
=
8 ij
∑( n/2 − j
n/2 ) wi , j
Define: shifted and rescaled inverse permutation indexes
π i−1 − (n + 1) /2 1− n 3 − n n −1
qi = ={ , ,L, }
n /2 n n n
J (π ) = n2
8 ∑ (qi − q j ) wij = q ( D − W )q
2 n2
4
T

ij

Once q2 is computed, since

q2 (i ) < q2 ( j ) ⇒ π i
−1
<π −1
j

π i
−1
can be uniquely recovered from q2

Implementation: sort q2 induces π


Re-ordering of Genes and Tissues

J (π )
r=
J (random)

r = 0.18

J d =1 (π )
rd =1=
J d =1 ( random )

rd =1 = 3.39


Spectral clustering vs Spectral ordering
• Continuous approximation of both integer
programming problems are given by the same
eigenvector
• Different problems could have the same
continuous approximate solution.
• Quality of the approximation:
Ordering: better quality: the solution relax
from a set of evenly spaced discrete values
Clustering: less better quality: solution relax
from 2 discrete values

Linearized Cluster Assignment

Turn spectral clustering to 1D clustering problem

• Spectral ordering on connectivity network
• Cluster crossing
– Sum of similarities along anti-diagonal
– Gives 1-D curve with valleys and peaks
– Divide valleys and peaks into clusters


Cluster overlap and crossing
Given similarity W, and clusters A,B.
• Cluster overlap s(A,B) = ∑∑ w
i∈ A j∈B
ij

• Cluster crossing compute a smaller fraction of cluster
overlap.
• Cluster crossing depends on an ordering o. It sums
weights cross the site i along the order
m
ρ (i ) = ∑ wo (i− j ),o (i+ j )
j =1

• This is a sum along anti-diagonals of W.

cluster crossing


K-way Clustering Experiments

Accuracy of clustering results:

Method Linearized Recursive 2-way Embedding
Assignment clustering + K-means
Data A 89.0% 82.8% 75.1%
Data B 75.7% 67.2% 56.4%


Some Additional
Advanced/related Topics

• Random talks and normalized cut
• Semi-definite programming
• Sub-sampling in spectral clustering
• Extending to semi-supervised classification
• Green’s function approach
• Out-of-sample embeding


Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005 tutorial - 2005

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005 tutorial - 2005

Similar to Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005 tutorial - 2005 (20)

More from zukun

More from zukun (20)

Recently uploaded

Recently uploaded (20)

Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005 tutorial - 2005