Upcoming SlideShare
×

# icml2004 tutorial on spectral clustering part II

959 views

Published on

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
959
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
31
0
Likes
0
Embeds 0
No embeds

No notes for slide

### icml2004 tutorial on spectral clustering part II

1. 1. A Tutorial on Spectral Clustering Part 2: Advanced/related Topics Chris Ding Computational Research Division Lawrence Berkeley National Laboratory University of California 1Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
2. 2. Advanced/related Topics • Spectral embedding: simplex cluster structure • Perturbation analysis • K-means clustering in embedded space • Equivalence of K-means clustering and PCA • Connectivity networks: scaled PCA & Green’s function • Extension to bipartite graphs: Correspondence analysis • Random talks and spectral clustering • Semi-definite programming and spectral clustering • Spectral ordering (distance-sensitive ordering) • Webpage spectral ranking: Page-Rank and HITS 2Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
3. 3. Spectral Embedding: Simplex Cluster Structure • Compute K eigenvectors of the Laplacian. • Embed objects in the K-dim eigenspace What is the structure of the clusters? Simplex Embedding Theorem. Assume objects are well characterized by spectral clustering objective functions. In the embedded space, objects aggregate to K distinct centroids: • Centroids locate on K corners of a simplex • Simplex consists K basis vectors + coordinate origin • Simplex is rotated by an orthogonal transformation T • Columns of T are eigenvectors of a K × K embedding matrix Γ 3 (Ding, 2004)Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
4. 4. K-way Clustering Objectives s (C p , C q ) s(C p , Cq ) s (C p , G − C q ) J= ∑ 1≤ p < q ≤ K ρ (C p ) + ρ (C q ) = ∑ k ρ (C p )  for Ratio Cut  nk =| Ck |   ρ (Ck ) = d (Ck ) =  ∑ d i∈Ck i for Normalized Cut  s (Ck , Ck ) =   ∑ w i∈Ck , j∈Ck ij for MinMaxCut G - Ck is the graph complement of Ck 4Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
5. 5. Simplex Spectral Embedding Theorem Simplex Orthogonal Transform Matrix T = (t1 , , t K ) T are determined by: Γt k = λ k t k −1 −1 Spectral Perturbation Matrix Γ=Ω 2 ΓΩ 2  h11 − s12 − s1K  s pq = s (C p , Cq ) − s − s2 K  ∑ h22 Γ =  21  hkk = s p| p ≠ k kp     − s K 1 − s K 2 hKK  Ω = diag[ ρ (C1 ),, ρ (Ck )] 5Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
6. 6. Properties of Spectral Embedding • Original basis vectors: -nk hk = (0l0,1l1,0l0) / nk • Dimension of embedding is K-1: (q2, … qK) – q1=(1,…,1)T is constant trivial – Eigenvalues of Γ (=eigenvalues of D-W) – Eigenvalues determine how well clustering objective function characterize the data • Exact solution for K=2 6Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
7. 7. 2-way Spectral Embedding (Exact Solution) Eigenvalues s (C1 , C2 ) s (C1 , C 2 ) s(C1 , C 2 ) s (C1 , C 2 ) s (C1 , C 2 ) s (C1 , C 2 ) λRcut = + , λNcut = + , λMMC = + n(C1 ) n(C 2 ) d (C1 ) d (C 2 ) s (C1 , C1 ) s(C 2 , C 2 ) Recover the original 2-way clustering objectives For Normalized Cut, orthogonal transform T rotates h1 = (1l1,0 l 0)T , h2 = (0 l 0,1l1)T into q1 = (1l1)T, q2 = ( a,l, a,−b,l,−b) T Spectral clustering inherently consistent! (Ding et al, KDD’01) 7Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
8. 8. Perturbation Analysis Wq = λDq Wz = ( D −1 / 2WD −1 / 2 ) z = λz q = D −1 / 2 z ˆ Assume data has 3 dense clusters sparsely connected. C2 W W W  11 12 13 C1  W =  21 W22 W23 W   31 W32 W33 W  C3 Off-diagonal blocks are between-cluster connections, assumed small and are treated as a perturbation (Ding et al, KDD’01) 8Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
9. 9. Perturbation Analysis 0th order: W ( 0)  W11  ˆ  11  =  ˆ ( 0) (0) Wˆ (0 ) =   W W22 W 22    ˆ ( 0)    W33     W 33   1st order:  W12 W13  W − W (0) ˆ ˆ ˆ W12 W13  ˆ  11 11  W (1) W =  21 W23  ˆ W (1) =  W21ˆ (0) W23  ˆ  W22 − W 22 ˆ ˆ   W31 W32     W31ˆ ˆ W32 ˆ ( 0) W33 − W 33  ˆ   W pq = D pp W pq D qq −1 / 2 −1 / 2 ˆ (0) W pq = ( D p1 + D p 2 + D p 3 ) −1 / 2 W pq ( Dq1 + Dq 2 + Dq 3 ) −1 / 2 ˆ 9Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
10. 10. K-means clustering • Developed in 1960’s (Lloyd, MacQueen, etc) • Computationally Efficient (order-mN) • Widely used in practice – Benchmark to evaluate other algorithms Given n points in m-dim: X = ( x1 , x2 ,l, xn ) K K-means min J K = ∑ ∑|| x − c k =1 i∈C k i k ||2 10Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
11. 11. K-means Clustering in Spectral Embedded Space Simplex spectral embedding theorem provides theoretical basis for K-means clustering in the embedded eigenspace – Cluster centroids are well separated (corners of the simplex) – K-means clustering is invariant under (i) coordinate rotation x → Tx, and (ii) shift x → x + a – Thus orthogonal transform T in simplex embedding un- necessary • Many variants of K-means (Ng et al, Bach Jordan, Zha et al, Shi Xu, etc) 11Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
12. 12. We have proved Spectral embedding + K-means clustering is the appropriate method We now show : K-means itself is solved by PCA 12Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
13. 13. Equivalence of K-means Clustering and Principal Component Analysis • Cluster indicators specify the solution of K-means clustering • Principal components are eigenvectors of the Gram (Kernel) matrix = data projections in the principal directions of the covariance matrix • Optimal solution of K-means clustering: continuous solution of the discrete cluster indicators of K-means are given by Principal components (Zha et al, NIPS’01; Ding He, 2003) 13Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
14. 14. Principal Component Analysis (PCA) • Widely used in large number of different fields – Best low-rank approximation (SVD Theorem, Eckart- Young, 1930) : Noise reduction – Unsupervised dimension reduction – Many generalizations • Conventional perspective is inadequate to explain the effectiveness of PCA • New results: Principal components are cluster indicators for well-motivated clustering objective 14Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
15. 15. Principal Component Analysis n points in m-dim: X = ( x1 , x2 ,l, xn ) Principal directions: uk Covariance S = XX T XX T u k = λk uk Principal components: vk Gram (Kernel) matrix X T X X T Xvk = λk vk m Singular Value Decomposition: X= ∑ k =1 λk u k v k T 15Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
16. 16. 2-way K -means Clustering Cluster membership + n2 / n1n  if i ∈ C1 q (i ) =  indicator: − n1 / n2 n if i ∈ C2  n1n2  d (C1 , C2 ) d (C1, C1 ) d (C2 , C2 )  J K = n〈 x 2 〉 − J D , J D = 2 n n − −  n  1 2 n12 2 n2  Define distance matrix: D = (dij ), d ij =|xi − x j|2 ~ J D = − qT Dq = −qT Dq = 2qT X T Xq ~ D is the centered distance matrix min J K ⇒ max J D 16Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
17. 17. 2-way K-means Clustering Cluster indicator satisfy: ∑i q(i ) = 0, ∑i q 2 (i ) = 1 Relax the restriction q (i) take discrete values. Let it take continuous values in [-1,1]. Solution for q is the eigenvector of the Gram matrix. Theorem: The (continuous) optimal solution of q is given by the principal component v1 . Clusters C1, C2 are determined by: C1 = {i | v1 (i ) 0}, C2 = {i | v1 (i ) ≥ 0} Once C1, C2 are computed, iterate K-mean to convergence 17Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
18. 18. Multi-way K-means Clustering Unsigned Cluster membership indicators h1, …, hK: C1 C2 C3 1 0 0 1 0 0   = (h1 , h2 , h3 ) 0 1 0   0 0 1 18Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
19. 19. Multi-way K-means Clustering K For K ≥ 2, ∑ ∑ ∑ 1 JK = x2 − xiT x j i i n i , j∈C k k =1 k (Unsigned) Cluster membership indicators h1, …, hK: -nk hk = (0 0,11,0 0)T / nk K J K = ∑i x − ∑ hk X T Xhk T 2 i k =1 Let H = (h1 ,, hK ) J K = ∑ xi2 − Tr ( H k X T XH k ) T 19 iTutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
20. 20. Multi-way K-means Clustering Regularized Relaxation of K-means Clustering K Redundancy in h1, …, hK: ∑ n1/ 2 hk = e = (11m1)T k k =1 Transform to signed indicator vectors q1 - qk via the k x k orthogonal matrix T: (q1 ,..., qk ) = (h1 ,, hk )T Qk = H kT Require 1st column of T = ( n1 1 /2 ,, n1/2)T / n1/2 k Thus q1 = e /n1/2 = const 20Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
21. 21. Regularized Relaxation of K-means Clustering J K = Tr (Y T Y ) − Tr (Qk −1Y T YQk −1 ) T Qk −1 = (q2 ,..., qk ) (Regularized relaxation) Theorem: The optimal solutions of q2 … qk are given by the principal components v2 … vk. JK is bounded below by total variance minus sum of K eigenvalues of covariance: K −1 n y − ∑ λk min J K n y 2 2 k =1 21Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
22. 22. Scaled PCA similarity matrix S=(sij) (generated from XXT) D = diag (d1 ,m, d n ) di = si. ~ −1 −1 Nonlinear re-scaling: S = D SD , ~ = sij /(si.s j. )1/ 2 sij 2 2 ~ Apply SVD on S⇒ 1 ~   S = D 2 S D 2 = D 2 ∑ zk λk zT D 2 = D ∑ qk λk qT  D 1 1 1 k k k k  qk = D-1/2 zk is the scaled principal component Subtract trivial component λ = 1, z = d 1 / 2 /s.., q =1 0 0 0 ⇒ S − dd T /s.. = D ∑ qk λk qT D k k =1 (Ding, et al, 2002) 22Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
23. 23. Optimality Properties of Scaled PCA Scaled principal components have optimality properties: Ordering – Adjacent objects along the order are similar – Far-away objects along the order are dissimilar – Optimal solution for the permutation index are given by scaled PCA. Clustering – Maximize within-cluster similarity – Minimize between-cluster similarity – Optimal solution for cluster membership indicators given by scaled PCA. 23Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
24. 24. Difficulty of K-way clustering • 2-way clustering uses a single eigenvector • K-way clustering uses several eigenvectors • How to recover 0-1 cluster indicators H? eigenvecto rs : Q = ( q1 ,..., q k ) has both positive and negative entries indicators : H = ( h1 , l , hk ) Q = HT Avoid computing the transformation T: • Do K-means, which is invariant under T • Compute connectivity network QQT, which cancels T 24Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
25. 25. Connectivity Network  1 if i, j belong to same cluster Cij =   0 otherwise K SPCA provides C≅D ∑q λ k =1 k k qT D k K ∑q 1 Green’s function : C ≈G= qT 1 − λk k k k =2 K Projection matrix: C≈P≡ ∑qk =1 k qT k (Ding et al, 2002) 25Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
26. 26. Connectivity network • Similar to Hopfield network • Mathematical basis: projection matrix • Show self-aggregation clearly • Drawback: how to recover clusters – Apply K-means directly on C – Use linearized assignment with cluster crossing and spectral ordering (ICML’04) 26Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
27. 27. Similarity matrix W Connectivity network: Example 1Connectivity λ2 = 0.300, λ2 = 0.268 matrix Between-cluster connections suppressed Within-cluster connections enhanced Effects of self-aggregation 27Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
28. 28. Connectivity of Internet Newsgroups NG2: comp.graphics 100 articles from each group. NG9: rec.motorcycles 1000 words NG10: rec.sport.baseball Tf.idf weight. Cosine similarity NG15:sci.space NG18:talk.politics.mideast Spectral Clustering 89% Direct K-means 66% cosine similarity Connectivity matrix 28Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
29. 29. Spectral embedding is not topology preserving 700 3-D data points form 2 interlock rings In eigenspace, they shrink and separate 29Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
30. 30. Correspondence Analysis (CA) • Mainly used in graphical display of data • Popular in France (Benzécri, 1969) • Long history – Simultaneous row and column regression (Hirschfeld, 1935) – Reciprocal averaging (Richardson Kuder, 1933; Horst, 1935; Fisher, 1940; Hill, 1974) – Canonical correlations, dual scaling, etc. • Formulation is a bit complicated (“convoluted” Jolliffe, 2002, p.342) • “A neglected method”, (Hill, 1974) 30Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
31. 31. Scaled PCA on a Contingency Table ⇒ Correspondence Analysis ~ Nonlinear re-scaling: P = D−2 PD−2 , ~ = p /( p p )1/ 2 1 1 r c pij ij i. j. ~ Apply SVD on P Subtract trivial component P − rcT / p.. = Dr ∑ f k λk g T Dc r = ( p1.,l, pn. )T k k =1 −1 −1 c = ( p.1,l, p.n )T fk = Dr uk , gk = Dc vk 2 2 are the scaled row and column principal component (standard coordinates in CA) 31Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
32. 32. Information Retrieval Bell Lab tech memos 5 comp-sci and 4 applied-math memo titles: C1: Human machine interface for lab ABC computer applications C2: A survey of user opinion of computer system response time C3: The EPS user interface management system C4: System and human system engineering testing of EPS C5: Relation of user-perceived response time to error management M1: The generation of random, binary, unordered trees M2: The intersection graph of paths in trees M3: Graph minors IV: widths of trees and well-quasi-ordering M4: Graph minors: A survey 32Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
33. 33. Word-document matrix: row/col clustering words docs c4 c1 c3 c5 c2 m4 m3 m2 m1 human 1 1 EPS 1 1 interface 1 1 system 2 1 1 computer 1 user 1 1 response 1 1 time 1 1 survey 1 1 minors 1 1 graph 1 1 1 tree 1 1 1 33Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
34. 34. Bipartite Graph: 3 types of Connectivity networks F FT FK GK  T F  QK QK =  K KT T T QK = (q1 ,m , q K ) =  K  GK FK GK GK  GK  e eT 2s11 0  row-row clustering FK FK =  r1 r1 T T    0 e r2 e r2 2s22   e c eT 2 s11 0  GK G K =  1 1 c  Column-column clustering T T   0 e c2 e c2 2 s22   e eT 2 s11 0  FK GK =  r1 c1  Row-column association T T   0 e r2 e c2 2s22   34Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
35. 35. Example column-column: GGT row-row: FFT Original data matrix λ2 = 0.456, λ2 = 0.477 row-column: FGT 35Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
36. 36. Internet Newsgroups Simultaneous clustering of documents and words 36Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
37. 37. Random Walks and Normalized Cut Similarity matrix W, P = D −1W Stochastic matrix πTP =πT ⇒ equilibrium distribution: π =d Px = λx ⇒ Wx = λDx ⇒ ( D − W ) x = (1 − λ ) Dx Random walks between A,B: P( A → B) P( B → A) J NormCut = + π ( A) π ( B) (Meila Shi, 2001) −1 PageRank: P = αLDout + (1 − α )eeT 37Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
38. 38. Semi-definite Programming for Normalized Cut -nk Normalized Cut : y k = D1/ 2 (0o 0,1o1,0o 0)T / || D1/ 2 hk || ~ W = D −1/ 2WD −1 / 2 ~ Optimize : min Tr (Y T ( I − W )Y ), subject to Y T Y = I Y ~ ~ ⇒ Tr[( I − W )YY T ] ⇒ min Tr[( I − W ) Z ] Z = YY T Z s.t. Z ≥ 0, Z 0 , Zd = d , Tr Z = K , Z = Z T Compute Z via SDP. Z=Y’Y’T. Y’’=D-1/2Y’. K-means on Y’’. (Xing Jordan, 2003) Z = connectivity network 38Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
39. 39. Spectral Ordering Hill, 1970, spectral embedding Find coordinate x to minimize J= ∑(x − x ) w ij i j 2 ij = xT ( D − W ) x Solution are eigenvectors of Laplacian Barnard, Pothen, Simon, 1993, envelop reduction of sparse matrix: find ordering such that the envelop is minimized min ∑ (i − j) w ij 2 ij ⇒ min ∑ (π ij i − π j ) 2 wij 39Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
40. 40. Distance-sensitive ordering Ordering is determined by permutation indexes 4-variable. For a given ordering, there are 3 distance=1 pairs, two d=2 pairs, one d=3 pair. 0 1 2 3 π (1,l, n) = (π 1 ,l, π n )  0 1 2 n−d J d (π ) = ∑i =1 sπ i ,π i+d    0 1    0 min J , J (π ) = ∑n−1 d 2 J d (π ) d =1 π The larger distance, the larger weights. Large distance similarities reduced more than small distance similarities (Ding He, ICML’04) 40Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
41. 41. Distance-sensitive ordering Theorem. The continuous optimal solution for the discrete inverse permutation indexes are given by the scaled principal component q1. The shifted and scaled inverse permutation indexes π i−1 − (n + 1) /2 1 − n 3 − n n −1 qi = ={ , ,, } n /2 n n n Relax the restriction on q. Allow it be continuous. Solution for q becomes the eigenvector of (D − S)q = λ Dq 41Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
42. 42. Re-ordering of Genes and Tissues J (π ) r= J (random) r = 0.18 J d =1 (π ) rd =1= J d =1 ( random) rd =1 = 3.39 42 C. Ding, RECOMB 2002Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
43. 43. Webpage Spectral Ranking Rank webpages from the hyperlink topology. L : adjacency matrix of the web subgraph PageRank (Page Brin): rank according to principal eigenvector π (equilibrium distribution) −1 πT = π , T = 0.8 Dout L + 0.2eeT HITS (Kleinberg): rank according to principal eigenvector of authority matrix (LT L)q = λ q Eigenvectors can be obtained in closed-form 43Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
44. 44. Webpage Spectral Ranking HITS (Kleinberg) ranking algorithm Assume web graph is fixed degree sequence random graph (Aiello, Chung, Lu, 2000) Theorem. Eigenvalues of LTL d i2 λ1 h1 λ2 h2 , hi = d i − n −1 Eigenvectors: d1 d2 dn T uk = ( , ,, ) λk − h1 λk − h2 λk − hn Principal eigenvector u1 is monotonic decreasing if d1 d 2 d3 hh ⇒ HITS ranking is identical to indegree ranking (Ding, et al, SIAM Review ’04) 44Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
45. 45. Webpage Spectral Ranking PageRank: weight normalization HITS : mutual reinforcement Combine PageRank and HITS. Generalize. ⇒ −1 Ranking based on a similarity graph S = LT Dout L Random walks on this similarity graph has the equilibrium distribution: (d1, d2,, dn )T / 2E PageRank ranking is identical to indegree ranking (1st order approximation, due to combination of PageRank HITS) (Ding, et al, SIGIR’02) 45Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
46. 46. PCA: a Unified Framework for clustering and ordering • PCA is equivalent to K-means Clustering • Scaled PCA has two optimality properties – Distance sensitive ordering – Min-max principle Clustering • SPCA on contingency table ⇒ Correspondence Analysis – Simultaneous ordering of rows and columns – Simultaneous clustering of rows and columns • Resolve open problems – Relationship between Correspondence Analysis and PCA (open problem since 1940s) – Relationship between PCA and K-means clustering (open problem since 1960s) 46Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
47. 47. Spectral Clustering: a rich spectrum of topics a comprehensive framework for learning A tutorial review of spectral clustering Tutorial website will post all related papers (send your papers) 47Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California
48. 48. Acknowledgment Hongyuan Zha, Penn State Horst Simon, Lawrence Berkeley Lab Ming Gu, UC Berkeley Xiaofeng He, Lawrence Berkeley Lab Michael Jordan, UC Berkeley Michael Berry, U. Tennessee, Knoxville Inderjit Dhillon, UT Austin George Karypis, U. Minnesota Haesen Park, U. Minnesota Work supported by Office of Science, Dept. of Energy 48Tutorial on Spectral Clustering, ICML 2004, Chris Ding © University of California