• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
icml2004 tutorial on spectral clustering part I
 

icml2004 tutorial on spectral clustering part I

on

  • 889 views

 

Statistics

Views

Total Views
889
Views on SlideShare
889
Embed Views
0

Actions

Likes
0
Downloads
27
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    icml2004 tutorial on spectral clustering part I icml2004 tutorial on spectral clustering part I Presentation Transcript

    • A Tutorial on Spectral Clustering Chris Ding Computational Research Division Lawrence Berkeley National Laboratory University of California Supported by Office of Science, U.S. Dept. of EnergyTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 1
    • Some historical notes • Fiedler, 1973, 1975, graph Laplacian matrix • Donath & Hoffman, 1973, bounds • Pothen, Simon, Liou, 1990, Spectral graph partitioning (many related papers there after) • Hagen & Kahng, 1992, Ratio-cut • Chan, Schlag & Zien, multi-way Ratio-cut • Chung, 1997, Spectral graph theory book • Shi & Malik, 2000, Normalized CutTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 2
    • Spectral Gold-Rush of 2001 9 papers on spectral clustering • Meila & Shi, AI-Stat 2001. Random Walk interpreation of Normalized Cut • Ding, He & Zha, KDD 2001. Perturbation analysis of Laplacian matrix on sparsely connected graphs • Ng, Jordan & Weiss, NIPS 2001, K-means algorithm on the embeded eigen-space • Belkin & Niyogi, NIPS 2001. Spectral Embedding • Dhillon, KDD 2001, Bipartite graph clustering • Zha et al, CIKM 2001, Bipartite graph clustering • Zha et al, NIPS 2001. Spectral Relaxation of K-means • Ding et al, ICDM 2001. MinMaxCut, Uniqueness of relaxation. • Gu et al, K-way Relaxation of NormCut and MinMaxCutTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 3
    • Part I: Basic Theory, 1973 – 2001Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 4
    • Spectral Graph Partitioning MinCut: min cutsize Constraint on sizes: |A| = |B| cutsize = # of cut edgesTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 5
    • 2-way Spectral Graph Partitioning  1 if i ∈ A Partition membership indicator: qi =  − 1 if i ∈ B ∑ 1 J = CutSize = wij [qi − q j ]2 4 i, j ∑ ∑ 1 1 = wij [qi2 + q 2 − 2qi q j ] = j q [d δ − wij ]q j 4 i, j 2 i , j i i ij 1 T = q ( D − W )q 2 Relax indicators qi from discrete values to continuous values, the solution for min J(q) is given by the eigenvectors of ( D − W ) q = λq (Fiedler, 1973, 1975) (Pothen, Simon, Liou, 1990)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 6
    • Properties of Graph Laplacian Laplacian matrix of the Graph: L = D − W • L is semi-positive definite xT Lx ≥ 0 for any x. • First eigenvector is q1=(1,…,1)T = eT with λ1=0. • Second eigenvector q2 is the desired solution. • The smaller λ2, the better quality of the partitioning. Perturbation analysis gives cutsize cutsize λ2 = + | A| |B| • Higher eigenvectors are also usefulTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 7
    • Recovering Partitions From the definition of cluster indicators: Partitions A, B are determined by: A = {i | q2 (i ) < 0}, B = {i | q2 (i ) ≥ 0} However, the objective function J(q) is insensitive to additive constant c : ∑ 1 J = CutSize = w [( qi + c) − (q j + c)]2 4 i , j ij Thus, we sort q2 to increasing order, and cut in the middle point.Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 8
    • Multi-way Graph Partitioning • Recursively applying the 2-way partitioning • Recursive 2-way partitioning • Using Kernigan-Lin to do local refinements • Using higher eigenvectors • Using q3 to further partitioning those obtained via q2. • Popular graph partitioning packages • Metis, Univ of Minnesota • Chaco, Sandia Nat’l LabTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 9
    • 2-way Spectral Clustering • Undirected graphs (pairwise similarities) • Bipartite graphs (contingency tables) • Directed graphs (web graphs)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 10
    • Spectral Clustering min cutsize , without explicit size constraints But where to cut ? Need to balance sizesTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 11
    • Clustering Objective Functions s(A,B) = ∑∑ w ij • Ratio Cut i∈A j∈B s(A,B) s(A,B) J Rcut (A,B) = + |A| |B| • Normalized Cut dA = ∑d i i∈A s( A, B ) s( A, B ) J Ncut ( A, B) = + dA dB s ( A, B) s ( A, B ) = + s ( A, A) + s ( A, B ) s(B, B ) + s ( A, B ) • Min-Max-Cut s(A,B) s(A,B) J MMC (A,B) = + s(A,A) s(B,B)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 12
    • Ratio Cut (Hagen & Kahng, 1992) Min similarity between A , B: s(A,B) = ∑∑ w i∈ A j∈B ij s(A,B) s(A,B) Size Balance J Rcut (A,B) = + (Wei & Cheng, 1989) |A| |B|   n2 / n1n if i ∈ A Cluster membership indicator: q(i ) =  − n1 / n2 n  if i ∈ B Normalization: q T q = 1, q T e = 0 Substitute q leads to J Rcut (q) = q T ( D − W )q Now relax q, the by eigenvectornd eigenvector of L Solution given solution is 2Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 13
    • Normalized Cut (Shi & Malik, 1997) Min similarity between A & B: s(A,B) = ∑ ∑ wij i∈ A j∈B Balance weights s( A, B ) s( A, B ) J Ncut ( A, B) = + dA dB dA = ∑d i∈A i  d B / d Ad  if i ∈ A Cluster indicator: q(i ) =  − d A / d B d  if i ∈ B d= ∑d i∈G i Normalization: q Dq = 1, q De = 0 T T Substitute q leads to J Ncut (q) = q T ( D − W )q min q q T ( D − W )q + λ (q T Dq − 1) Solution is eigenvector of ( D − W )q = λDqTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 14
    • MinMaxCut (Ding et al 2001) Min similarity between A & B: s(A,B) = ∑∑ w i∈A j∈B ij Max similarity within A & B: s(A,A) = ∑∑ w i∈A j∈A ij s(A,B) s(A,B) J MMC(A,B) = + s(A,A) s(B,B)   d B / d Ad if i ∈ A Cluster indicator: q(i ) =  − d A / d B d  if i ∈ B Substituting, 1+ dB / d A 1+ d A / dB q T Wq J MMC ( q) = + −2 Jm = Jm + dB / d A Jm + d A / dB q T Dq Because dJ MMC ( J m ) <0 min Jmmc ⇒ max Jm(q) dJ m ⇒ Wq = ξDq ⇒ ( D − W )q = λDqTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 15
    • A simple example 2 dense clusters, with sparse connections between them. Adjacency matrix Eigenvector q2Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 16
    • Comparison of Clustering Objectives • If clusters are well separated, all three give very similar and accurate results. • When clusters are marginally separated, NormCut and MinMaxCut give better results • When clusters overlap significantly, MinMaxCut tend to give more compact and balanced clusters. s ( A, B) s ( A, B) J Ncut = + s ( A, A) + s(A, B) s(B, B) + s(A, B) Cluster Compactness ⇒ max s ( A, A)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 17
    • 2-way Clustering of Newsgroups Newsgroups RatioCut NormCut MinMaxCut Atheism 63.2 ± 16.2 97.2 ± 0.8 97.2 ± 1.1 Comp.graphics Baseball 54.9 ± 2.5 74.4 ± 20.4 79.5 ± 11.0 Hockey Politics.mideast 53.6 ± 3.1 57.5 ± 0.9 83.6 ± 2.5 Politics.miscTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 18
    • Cluster Balance Analysis I: Random Graph Model • Random graph: edges are randomly assigned with probability p: 0 ≤ p ≤ 1. • RatioCut & NormCut show no size dependence p | A || B | p | A || B | J Rcut ( A, B) = + = np = constant | A| |B| p | A || B | p | A || B | n J Ncut ( A, B) = + = = constant p | A | (n − 1) p | B | ( n − 1) n − 1 • MinMaxCut favors balanced clusters: |A|=|B| p | A || B | p | A || B | |B| | A| J MMC ( A, B) = + = + p | A | (| A | −1) p | B | (| B | −1) | A | −1 | B | −1Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 19
    • 2-way Clustering of Newsgroups Cluster Balance Eigenvector JNcut(i) JMMC(i)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 20
    • Cluster Balance Analysis II: Large Overlap Case s ( A, B ) f = > 0.5 (1 / 2)[ s( A, A) + s( B, B)] Conditions for skewed cuts: 1 1 NormCut : s(A,A) ≥ ( − ) s ( A, B ) = s ( A, B) / 2 2f 2 1 MinMaxCut : s(A,A) ≥ s ( A, B) = s ( A, B ) 2f Thus MinMaxCut is much less prone to skewed cutsTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 21
    • Spectral Clustering of Bipartite Graphs Simultaneous clustering of rows and columns of a contingency table (adjacency matrix B ) Examples of bipartite graphs • Information Retrieval: word-by-document matrix • Market basket data: transaction-by-item matrix • DNA Gene expression profiles • Protein vs protein-complexTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 22
    • Spectral Clustering of Bipartite Graphs Simultaneous clustering of rows and columns (adjacency matrix B ) s ( BR1 ,C2 ) = ∑ ∑b ri ∈R1c j ∈C 2 ij min between-cluster sum of xyz weights: s(R1,C2), s(R2,C1) max within-cluster sum of xyz cut xyz weights: s(R1,C1), s(R2,C2) s ( BR1 ,C2 ) + s ( B R2 ,C1 ) s ( BR1 ,C2 ) + s ( B R2 ,C1 ) J MMC (C1 , C 2 ; R1 , R2 ) = + 2s ( B R1 ,C1 ) 2s ( BR2 ,C2 ) (Ding, AI-STAT 2003)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 23
    • Bipartite Graph Clustering Clustering indicators for rows and columns:  1 if ri ∈ R1  1 if ci ∈ C1 f (i ) =  g (i ) =  − 1 if ri ∈ R2 − 1 if ci ∈ C2  BR ,C BR1 ,C2   0 B f  B= 1 1  W = T  q=  g  BR ,C BR2 ,C2  B 0    2 1    Substitute and obtain s (W12 ) s (W12 ) J MMC (C1 , C 2 ; R1 , R2 ) = + s (W11 ) s (W22 ) f,g are determined by  D r   0 B   f  D  f   −    = λ  r      Dc   B T    g 0       Dc  g   Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 24
    • Clustering of Bipartite Graphs Let ~ u   D1 / 2 f  B = Dr−1/ 2 BDc 1/ 2 , z =   = Dq =  r / 2  −    D1 g  v  c  We obtain ~  0 B  u  u  ~    B T 0  v  = λ  v         m ∑u λ v ~ Solution is SVD: B= T k k k k =1 (Zha et al, 2001, Dhillon, 2001)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 25
    • Clustering of Bipartite Graphs Recovering row clusters: R1 = {ri , | f 2 (i) < z r }, R2 = {ri , | f 2 (i) ≥ z r }, Recovering column clusters: C1 = {ci , | g 2 (i ) < z c }, C 2 = {ci , | g 2 (i ) ≥ z c }, zr=zc=0 are dividing points. Relaxation is invariant up to a constant shift. Algorithm: search for optimal points icut, jcut, let zr=f2(icut), zc= g2(jcut), such that J MMC (C1 , C2 ; R1 , R2 ) (Zha et al, 2001) is minimized.Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 26
    • Clustering of Directed Graphs Min directed edge weights between A & B: s(A,B)= ∑∑(w i∈A j∈B ij + w ji ) Max directed edges within A & B: s(A,A)= ∑∑(w i∈A j∈A ij + w ji ) • Equivalent to deal with W = W + W T ~ ~ • All spectral methods apply to W • For example, web graphs clustered in such way (He, Ding, Zha, Simon, ICDM 2001)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 27
    • K-way Spectral Clustering K≥2Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 28
    • K-way Clustering Objectives • Ratio Cut  s(C k ,Cl ) s (C k ,Cl )  s(C k ,G − C k ) J Rcut (C1 , , C K ) = ∑   < k ,l > |C k| + |Cl|  =  ∑ k |C k| • Normalized Cut  s(C k ,Cl ) s (C k ,Cl )  s(C k ,G − C k ) J Ncut (C1 , , C K ) = ∑   < k ,l > dk + dl =   ∑ k dk • Min-Max-Cut  s(C k ,Cl ) s (C k ,Cl )  s(C k ,G − C k ) J MMC (C1 , , C K ) = ∑  s(C , C ) + s(C , C )  = ∑   < k ,l > k k   l l k s (C k , C k )Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 29
    • K-way Spectral Relaxation • Prove that the solution lie in the subspace spanned by the first k eigenvectors • Ratio Cut • Normalized Cut • Min-Max-CutTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 30
    • K-way Spectral Relaxation h1 = (1m1,0m 0,0 m 0)T Unsigned cluster indicators: h2 = (0 m 0,1m1,0 m 0)T mmm Re-write: hk = (0 m 0,0 m 0,1m1)T h1 ( D − W )h1 T hk ( D − W ) hk T J Rcut (h1 , , hk ) = T ++ T h1 h1 hk hk h1 ( D − W )h1 T hk ( D − W )hk T J Ncut (h1 , , hk ) = T ++ T h1 Dh1 hk Dhk h1 ( D − W ) h1 T hk ( D − W )hk T J MMC (h1 , , hk ) = T ++ T h1 Wh1 hk WhkTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 31
    • K-way Ratio Cut Spectral Relaxation - Unsigned cluster indicators: x = (0  0,11,0  0)T / n1/ 2 nk k k Re-write: J ( x , , x ) = xT ( D − W ) x +  + xT ( D − W ) x Rcut 1 k 1 1 k k = Tr ( X T ( D − W ) X ) X = ( x1 , , xk ) Optimize : min Tr ( X T ( D − W ) X ), subject to X T X = I X By K. Fan’s theorem, optimal solution is eigenvectors: X=(v1,v2, …, vk), (D-W)vk=λkvk and lower-bound λ1 +  + λk ≤ min J Rcut ( x1 , , xk ) (Chan, Schlag, Zien, 1994)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 32
    • K-way Normalized Cut Spectral Relaxation Unsigned cluster indicators: -nk yk = D 1/ 2 (0 o 0,1o1,0o 0)T / || D1/ 2 hk || Re-write: ~ ~ J Ncut ( y1 , , y k ) = y1 ( I − W ) y1 +  + y k ( I − W ) y k T T ~ ~ = Tr (Y T ( I − W )Y ) W = D −1/ 2WD −1/ 2 ~ Optimize : min Tr (Y T ( I − W )Y ), subject to Y T Y = I Y By K. Fan’s theorem, optimal solution is ~ eigenvectors: Y=(v1,v2, …, vk), ( I − W )vk = λk vk ( D − W )u k = λk Du k , u k = D −1 / 2 v k λ1 + l + λk ≤ min J Ncut ( y1 , l, y k ) (Gu, et al, 2001)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 33
    • K-way Min-Max Cut Spectral Relaxation Unsigned cluster indicators: ~ y k = D1/ 2 hk / || D1/ 2 hk || W = D −1/ 2WD −1/ 2 Re-write: 1 1 J MMC ( y1 , , y k ) = T ~ ++ T ~ −k y1 W y1 yk W yk T ~ Optimize : min J MMC (Y ), subject to Y T Y = I , y k Wy k > 0. Y Theorem. Optimal solution is by eigenvectors: ~ Y=(v1,v2, …, vk), W v k = λk v k k2 − k ≤ min J MMC ( y1 , m , y k ) λ1 + m + λk (Gu, et al, 2001)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 34
    • K-way Spectral Clustering • Embedding (similar to PCA subspace approach) – Embed data points in the subspace of the K eigenvectors – Clustering embedded points using another algorithm, such as K- means (Shi & Malik, Ng et al, Zha, et al) • Recursive 2-way clustering (standard graph partitioning) – If desired K is not power of 2, how optimcally to choose the next sub-cluster to split? (Ding, et al 2002) • Both above approach do not use K-way clustering objective functions. • Refine the obtained clusters using the K-way clustering objective function typically improve the results (Ding et al 2002).Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 35
    • DNA Gene expression Lymphoma Cancer (Alizadeh et al, 2000) Genes Effects of feature selection: Select 900 genes out of 4025 genesTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California Tissue sample 36
    • Lymphoma Cancer Tissue samples B cell lymphoma go thru different stages –3 cancer stages –3 normal stages Key question: can we detect them automatically ? PCA 2D DisplayTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 37
    • Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 38
    • Brief summary of Part I • Spectral graph partitioning as origin • Clustering objective functions and solutions • Extensions to bipartite and directed graphs • Characteristics – Principled approach – Well-motivated objective functions – Clear, un-ambiguous – A framework of rich structures and contents – Everything is proved rigorously (within the relaxation framework, i.e., using continuous approximation of the discrete variables) • Above results mostly done by 2001. • More to come in Part IITutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California 39