Like this presentation? Why not share!

# Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005 tutorial - 2005

## by zukun on May 06, 2011

• 812 views

### Views

Total Views
812
Views on SlideShare
812
Embed Views
0

Likes
1
16
0

No embeds

## Principal component analysis and matrix factorizations for learning (part 2) ding - icml 2005 tutorial - 2005Presentation Transcript

• Part 2. Spectral Clustering from Matrix Perspective A brief tutorial emphasizing recent developments (More detailed tutorial is given in ICML’04 )PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 56
• From PCA to spectral clustering using generalized eigenvectors Consider the kernel matrix: Wij = φ ( xi ),φ ( x j ) In Kernel PCA we compute eigenvector: Wv = λv Generalized Eigenvector: Wq = λDq D = diag (d1,L, dn ) di = ∑w j ij This leads to Spectral Clustering !PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 57
• Indicator Matrix Quadratic Clustering Framework Unsigned Cluster indicator Matrix H=(h1, …, hK) Kernel K-means clustering: max Tr( H T WH ), s.t. H T H = I , H ≥ 0 H K-means: W = XT X; Kernel K-means W = (< φ ( xi ),φ ( x j ) >) Spectral clustering (normalized cut) max Tr( H T WH ), s.t. H T DH = I , H ≥ 0 HPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 58
• Brief Introduction to Spectral Clustering (Laplacian matrix based clustering)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 59
• Some historical notes • Fiedler, 1973, 1975, graph Laplacian matrix • Donath & Hoffman, 1973, bounds • Hall, 1970, Quadratic Placement (embedding) • Pothen, Simon, Liou, 1990, Spectral graph partitioning (many related papers there after) • Hagen & Kahng, 1992, Ratio-cut • Chan, Schlag & Zien, multi-way Ratio-cut • Chung, 1997, Spectral graph theory book • Shi & Malik, 2000, Normalized CutPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 60
• Spectral Gold-Rush of 2001 9 papers on spectral clustering • Meila & Shi, AI-Stat 2001. Random Walk interpreation of Normalized Cut • Ding, He & Zha, KDD 2001. Perturbation analysis of Laplacian matrix on sparsely connected graphs • Ng, Jordan & Weiss, NIPS 2001, K-means algorithm on the embeded eigen-space • Belkin & Niyogi, NIPS 2001. Spectral Embedding • Dhillon, KDD 2001, Bipartite graph clustering • Zha et al, CIKM 2001, Bipartite graph clustering • Zha et al, NIPS 2001. Spectral Relaxation of K-means • Ding et al, ICDM 2001. MinMaxCut, Uniqueness of relaxation. • Gu et al, K-way Relaxation of NormCut and MinMaxCutPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 61
• Spectral Clustering min cutsize , without explicit size constraints But where to cut ? Need to balance sizesPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 62
• Graph Clustering min between-cluster similarities (weights) sim(A,B) = ∑∑ wij i∈ A j∈B Balance weight Balance size Balance volume sim(A,A) = ∑∑ wij i∈ A j∈ A max within-cluster similarities (weights)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 63
• Clustering Objective Functions s(A,B) = ∑∑ w ij • Ratio Cut i∈ A j∈B s(A,B) s(A,B) J Rcut (A,B) = + |A| |B| • Normalized Cut dA = ∑d i i∈A s ( A, B) s ( A, B) J Ncut ( A, B) = + dA dB s ( A, B) s ( A, B) = + s ( A, A) + s ( A, B) s(B, B) + s ( A, B) • Min-Max-Cut s(A,B) s(A,B) J MMC(A,B) = + s(A,A) s(B,B)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 64
• Normalized Cut (Shi & Malik, 2000) Min similarity between A & B: s(A,B) = ∑ ∑ wij i∈ A j∈B Balance weights s ( A, B) s ( A, B) J Ncut ( A, B) = dA + dB dA = ∑d i∈A i ⎧ d B / d Ad ⎪ if i ∈ A Cluster indicator: q (i ) = ⎨ ⎪− d A / d B d ⎩ if i ∈ B d= ∑d i∈G i Normalization: q Dq = 1, q De = 0 T T Substitute q leads to J Ncut (q) = q T ( D − W )q min q q T ( D − W )q + λ (q T Dq − 1) Solution is eigenvector of ( D − W )q = λDqPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 65
• A simple example 2 dense clusters, with sparse connections between them. Adjacency matrix Eigenvector q2PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 66
• K-way Spectral Clustering K≥2PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 67
• K-way Clustering Objectives • Ratio Cut ⎛ s (C k ,Cl ) s (C k ,Cl ) ⎞ s (C k ,G − C k ) J Rcut (C1 , L , C K ) = ∑ ⎜ ⎜ |C | + |C | ⎟ = < k ,l > ⎝ k l ⎟ ⎠ ∑ k |C k| • Normalized Cut ⎛ s (C k ,Cl ) s (C k ,Cl ) ⎞ s (C k ,G − C k ) J Ncut (C1 , L , C K ) = < k ,l > ∑⎜ ⎜ d ⎝ k + dl ⎟= ⎟ ⎠ ∑ k dk • Min-Max-Cut ⎛ s (C k ,Cl ) s (C k ,Cl ) ⎞ s (C k ,G − C k ) J MMC (C1 , L , C K ) = ⎜ < k ,l > ⎝ ∑ k k l l ⎠ ⎟ ⎜ s (C , C ) + s (C , C ) ⎟ = ∑ k s (C k , C k )PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 68
• K-way Spectral Relaxation h1 = (1L1,0 L 0,0 L 0)T Unsigned cluster indicators: h2 = (0L 0,1L1,0 L 0)T LLL Re-write: hk = (0 L 0,0L 0,1L1)T h1 ( D − W )h1 T hk ( D − W )hk T J Rcut (h1 , L, hk ) = T +L+ T h1 h1 hk hk h1 ( D − W )h1 T hk ( D − W )hk T J Ncut (h1 , L, hk ) = T +L+ T h1 Dh1 hk Dhk h1 ( D − W )h1 T hk ( D − W )hk T J MMC (h1 , L , hk ) = T +L+ T h1 Wh1 hk WhkPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 69
• K-way Normalized Cut Spectral Relaxation Unsigned cluster indicators: nk } y k = D1/ 2 (0 L 0,1L1,0L 0)T / || D1/ 2 hk || Re-write: ~ ~ J Ncut ( y1 , L , y k ) = T y1 ( I − W ) y1 + L + y k ( I − W ) y k T ~ ~ = Tr (Y T ( I − W )Y ) W = D −1/ 2WD −1/ 2 ~ Optimize : min Tr (Y ( I − W )Y ), subject to Y T Y = I T Y By K. Fan’s theorem, optimal solution is ~ eigenvectors: Y=(v1,v2, …, vk), ( I − W )vk = λk vk ( D − W )u k = λk Du k , u k = D −1/ 2 vk λ1 + L + λk ≤ min J Ncut ( y1 ,L , y k ) (Gu, et al, 2001)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 70
• K-way Spectral Clustering is difficult • Spectral clustering is best applied to 2-way clustering – positive entries for one cluster – negative entries for another cluster • For K-way (K>2) clustering – Positive and negative signs make cluster assignment difficult – Recursive 2-way clustering – Low-dimension embedding. Project the data to eigenvector subspace; use another clustering method such as K-means to cluster the data (Ng et al; Zha et al; Back & Jordan, etc) – Linearized cluster assignment using spectral ordering and cluster crossingPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 71
• Scaled PCA: a Unified Framework for clustering and ordering • Scaled PCA has two optimality properties – Distance sensitive ordering – Min-max principle Clustering • SPCA on contingency table ⇒ Correspondence Analysis – Simultaneous ordering of rows and columns – Simultaneous clustering of rows and columnsPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 72
• Scaled PCA similarity matrix S=(sij) (generated from XXT) D = diag(d1 ,L, d n ) di = si. ~ −1 −1 ~ Nonlinear re-scaling: S = D SD , sij = sij /(si.s j. ) 2 2 1/ 2 ~ Apply SVD on S⇒ ~ 1 ⎡ T⎤ S = D S D = D ∑ zk λk z k D = D ⎢∑ qk λk qk ⎥ D 1 1 1 2 2 2 T 2 k ⎣k ⎦ qk = D-1/2 zk is the scaled principal component Subtract trivial component λ = 1, z = d 1/ 2 /s.., q =1 0 0 0 ⇒ S − dd T /s.. = D ∑ qk λk qT D k k =1 (Ding, et al, 2002)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 73
• Scaled PCA on a Rectangle Matrix ⇒ Correspondence Analysis ~ −1 −1 ~ Nonlinear re-scaling: P = D 2 PD 2 , p = p /( p p )1/ 2 r c ij ij i. j. ~ Apply SVD on P Subtract trivial component P − rc / p.. = Dr ∑ f k λk g Dc T T r = ( p1.,L, pn. ) T k k =1 −1 −1 c = ( p.1,L, p.n ) T fk = D u , gk = D v r 2 k 2 c k are the scaled row and column principal component (standard coordinates in CA)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 74
• Correspondence Analysis (CA) • Mainly used in graphical display of data • Popular in France (Benzécri, 1969) • Long history – Simultaneous row and column regression (Hirschfeld, 1935) – Reciprocal averaging (Richardson & Kuder, 1933; Horst, 1935; Fisher, 1940; Hill, 1974) – Canonical correlations, dual scaling, etc. • Formulation is a bit complicated (“convoluted” Jolliffe, 2002, p.342) • “A neglected method”, (Hill, 1974)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 75
• Clustering of Bipartite Graphs (rectangle matrix) Simultaneous clustering of rows and columns of a contingency table (adjacency matrix B ) Examples of bipartite graphs • Information Retrieval: word-by-document matrix • Market basket data: transaction-by-item matrix • DNA Gene expression profiles • Protein vs protein-complexPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 76
• Bipartite Graph Clustering Clustering indicators for rows and columns: ⎧ 1 if ri ∈ R1 ⎧ 1 if ci ∈ C1 f (i ) = ⎨ g (i ) = ⎨ ⎩− 1 if ri ∈ R2 ⎩− 1 if ci ∈ C2 ⎛ BR1 ,C1 BR1 ,C2 ⎞ ⎛ 0 B⎞ ⎛f ⎞ B=⎜ ⎟ W =⎜ T ⎟ q=⎜ ⎟ ⎜g⎟ ⎜ BR ,C BR2 ,C2 ⎟ ⎜B 0⎟ ⎝ ⎠ ⎝ 2 1 ⎠ ⎝ ⎠ Substitute and obtain s (W12 ) s (W12 ) J MMC (C1 , C 2 ; R1 , R2 ) = + s (W11 ) s (W22 ) f,g are determined by ⎡⎛ Dr ⎞ ⎛ 0 B ⎞⎤⎛ f ⎞ ⎛ Dr ⎞⎛ f ⎞ ⎜ ⎢⎜ ⎟−⎜ T ⎟⎥⎜ ⎟ = λ ⎜ ⎟⎜ ⎟ ⎢⎝ ⎣ Dc ⎟ ⎜ B ⎠ ⎝ 0 ⎟⎥⎜ g ⎟ ⎠⎦⎝ ⎠ ⎜ ⎝ Dc ⎟⎜ g ⎟ ⎠⎝ ⎠PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 77
• Spectral Clustering of Bipartite Graphs Simultaneous clustering of rows and columns (adjacency matrix B ) s ( BR1 ,C2 ) = ∑ ∑b ri ∈R1c j ∈C 2 ij min between-cluster sum of xyz weights: s(R1,C2), s(R2,C1) max within-cluster sum of xyz cut xyz weights: s(R1,C1), s(R2,C2) s ( BR1 ,C2 ) + s ( B R2 ,C1 ) s ( B R1 ,C2 ) + s ( B R2 ,C1 ) J MMC (C1 , C 2 ; R1 , R2 ) = + 2 s ( B R1 ,C1 ) 2 s ( B R2 ,C2 ) (Ding, AI-STAT 2003)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 78
• Internet Newsgroups Simultaneous clustering of documents and wordsPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 79
• Embedding in Principal Subspace Cluster Self-Aggregation (proved in perturbation analysis) (Hall, 1970, “quadratic placement” (embedding) a graph)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 80
• Spectral Embedding: Self-aggregation • Compute K eigenvectors of the Laplacian. • Embed objects in the K-dim eigenspace (Ding, 2004)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 81
• Spectral embedding is not topology preserving 700 3-D data points form 2 interlock rings In eigenspace, they shrink and separatePCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 82
• Spectral Embedding Simplex Embedding Theorem. Objects self-aggregate to K centroids Centroids locate on K corners of a simplex • Simplex consists K basis vectors + coordinate origin • Simplex is rotated by an orthogonal transformation T •T are determined by perturbation analysis (Ding, 2004)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 83
• Perturbation Analysis Wq = λDq Wˆ z = ( D −1 / 2WD −1 / 2 ) z = λz q = D −1 / 2 z Assume data has 3 dense clusters sparsely connected. C2 ⎡W W W ⎤ 11 12 13 C1 W = ⎢ 21 W22 W23⎥ ⎢W ⎥ ⎢ 31 W32 W33⎥ ⎣W ⎦ C3 Off-diagonal blocks are between-cluster connections, assumed small and are treated as a perturbation (Ding et al, KDD’01)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 84
• Spectral Perturbation Theorem Orthogonal Transform Matrix T = (t1 ,L , t K ) T are determined by: Γt k = λ k t k −1 −1 Spectral Perturbation Matrix Γ=Ω 2 ΓΩ 2 ⎡ h11 − s12 L − s1K ⎤ s pq = s (C p , Cq ) ⎢− s L − s2 K ⎥ Γ = ⎢ 21 ⎢ M h22 M L M ⎥ ⎥ hkk = ∑ s p| p ≠ k kp ⎢ ⎥ ⎣− s K 1 − s K 2 L hKK ⎦ Ω = diag[ ρ (C1 ),L, ρ (Ck )]PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 85
• Connectivity Network ⎧ 1 if i, j belong to same cluster Cij = ⎨ ⎩ 0 otherwise K Scaled PCA provides C≅D ∑k =1 qk λk qT D k K ∑ 1 Green’s function : C ≈G = qk qT k =2 1 − λk k K Projection matrix: C≈P≡ ∑k =1 qk qT kPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding (Ding et al, 2002) 86
• Similarity matrix W 1st order Perturbation: Example 1 1st order solutionConnectivity λ2 = 0.300, λ2 = 0.268 matrix Between-cluster connections suppressed Within-cluster connections enhanced Effects of self-aggregationPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 87
• Optimality Properties of Scaled PCA Scaled principal components have optimality properties: Ordering – Adjacent objects along the order are similar – Far-away objects along the order are dissimilar – Optimal solution for the permutation index are given by scaled PCA. Clustering – Maximize within-cluster similarity – Minimize between-cluster similarity – Optimal solution for cluster membership indicators given by scaled PCA.PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 88
• Spectral Graph Ordering (Barnard, Pothen, Simon, 1993), envelop reduction of sparse matrix: find ordering such that the envelop is minimized min ∑ max j | i − j | wij ⇒ min ∑ ( xi − x j ) wij 2 i ij (Hall, 1970), “quadratic placement of a graph”: Find coordinate x to minimize J= ∑ ij ( xi − x j ) 2 wij = x T ( D − W ) x Solution are eigenvectors of LaplacianPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 89
• Distance Sensitive Ordering Given a graph. Find an optimal Ordering of the nodes. π permutation indexes J (π ) = ∑ d n−d i =1 π i ,π i + d π (1,L, n) = (π 1 ,L, π n ) w ∩∩ ∩∩∩∩∩∩∩∩ wπ1 ,π 3 J d =2 (π ) : ∪∪∪∪∪∪∪∪ min J (π ) = ∑ n −1 d =1 d J d (π ) 2 π The larger distance, the larger weights, panelity.PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 90
• Distance Sensitive Ordering J (π ) = ∑ (i − j ) wπ i ,π j = ∑ (i − j ) wπ i ,π j 2 2 ij π i ,π j = ∑ (π − π ) wi , j i −1 −1 2 j ij n2 π i−1 −( n +1) / 2 π −1 −( n +1) / 2 2 = 8 ij ∑( n/2 − j n/2 ) wi , j Define: shifted and rescaled inverse permutation indexes π i−1 − (n + 1) /2 1− n 3 − n n −1 qi = ={ , ,L, } n /2 n n n J (π ) = n2 8 ∑ (qi − q j ) wij = q ( D − W )q 2 n2 4 T ijPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 91
• Distance Sensitive Ordering Once q2 is computed, since q2 (i ) < q2 ( j ) ⇒ π i −1 <π −1 j π i −1 can be uniquely recovered from q2 Implementation: sort q2 induces πPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 92
• Re-ordering of Genes and Tissues J (π ) r= J (random) r = 0.18 J d =1 (π ) rd =1= J d =1 ( random ) rd =1 = 3.39PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 93
• Spectral clustering vs Spectral ordering • Continuous approximation of both integer programming problems are given by the same eigenvector • Different problems could have the same continuous approximate solution. • Quality of the approximation: Ordering: better quality: the solution relax from a set of evenly spaced discrete values Clustering: less better quality: solution relax from 2 discrete valuesPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 94
• Linearized Cluster Assignment Turn spectral clustering to 1D clustering problem • Spectral ordering on connectivity network • Cluster crossing – Sum of similarities along anti-diagonal – Gives 1-D curve with valleys and peaks – Divide valleys and peaks into clustersPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 95
• Cluster overlap and crossing Given similarity W, and clusters A,B. • Cluster overlap s(A,B) = ∑∑ w i∈ A j∈B ij • Cluster crossing compute a smaller fraction of cluster overlap. • Cluster crossing depends on an ordering o. It sums weights cross the site i along the order m ρ (i ) = ∑ wo (i− j ),o (i+ j ) j =1 • This is a sum along anti-diagonals of W.PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 96
• cluster crossingPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 97
• K-way Clustering Experiments Accuracy of clustering results: Method Linearized Recursive 2-way Embedding Assignment clustering + K-means Data A 89.0% 82.8% 75.1% Data B 75.7% 67.2% 56.4%PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 98
• Some Additional Advanced/related Topics • Random talks and normalized cut • Semi-definite programming • Sub-sampling in spectral clustering • Extending to semi-supervised classification • Green’s function approach • Out-of-sample embedingPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 99