Principal Component Analysis and                     Matrix Factorizations for Learning                                   ...
Many unsupervised learning methods                    are closely related in a simple way                   PCA           ...
Part 1.A.            Principal Component Analysis (PCA)                            and            Singular Value Decomposi...
Brief history    • PCA          – Draw a plane closest to data points (Pearson, 1901)          – Retain most variance (Hot...
PCA and SVD  Data: n points in p-dim:                                     X = ( x1 , x2 ,L, xn )                          ...
Further Developments         SVD/PCA         •   Principal Curves         •   Independent Component Analysis         •   S...
Methods of PCA Utilization         Principal components                                                X = ( x1 , x2 ,L, x...
Applications of PCA/SVD         • Most popular in multivariate statistics         • Image processing, signal processing   ...
Applications of PCA/SVD     • PCA/SVD is as widely as Fast Fourier Transforms           –   Both are spectral expansions  ...
PCA is a Matrix Factorization                        (spectral/eigen decomposition)           Principal directions:       ...
From PCA to spectral clustering                         using generalized eigenvectors            Consider the kernel matr...
Scale PCA ⇒ Spectral Clustering                     PCA:                    W=       ∑     k                              ...
Scaled PCA on a Rectangle Matrix                     ⇒ Correspondence Analysis                           ~ −1 −1 ~        ...
Nonnegative Matrix Factorization     Data Matrix: n points in p-dim:                                                      ...
Solving NMF with multiplicative updating                               J =|| X − FGT ||2 , F ≥ 0, G ≥ 0                  F...
Matrix Factorization Summary                          Symmetric                                  Rectangle Matrix         ...
Indicator Matrix Quadratic Clustering       Unsigned Cluster indicator Matrix H=(h1, …, hK)       Kernel K-means clusterin...
Indicator Matrix Quadratic Clustering   Additional features:       Semi-suerpvised classification:                        ...
Tutorial Outline               •   PCA                    – Recent developments on PCA/SVD                    – Equivalenc...
Part 1.B.         Recent Developments on PCA and SVD               Principal Curves               Independent Component An...
PCA and SVD               Data Matrix:                       X = ( x1 , x2 ,L, xn )                                       ...
Kernel PCA                                                  xi → φ ( xi )       Kernel                             K ij = ...
Mixture of PCA       • Data has local structures.             – Global PCA on all data is not useful       • Clustering PC...
Probabilistic PCA                                   Linear Gaussian Model                Latent variables                 ...
Sparse PCA       • Compute a factorization                                 X ≈ UV T             – U or V is sparse or both...
Sparse PCA: Truncation and                           Discretization                                                       ...
Sparse PCA: L1 constraint         • LASSO (Tibshirani, 1996)                                     min || y − X T β ||2 ,   ...
Sparse PCA: Direct Sparsification         • Sparse SVD with explicit sparsification                  min || X − udvT ||F +...
Sparse PCA Summary       • Many different approaches             –   Truncation, discretization             –   L1 Constra...
PCA: Further Generalizations       • Generalization to Exponential Family             – (Collins, Dasgupta, Schapire, 2001...
Column Partitioned Matrix Factorizations                                            n              n                n     ...
Two-dimensional SVD    • Large number of data objects are 2-D: images, maps    • Standard method:          – convert (re-o...
Linearize a 2D object into 1D object                                                                    ⎡0.0⎤             ...
SVD and 2D-SVD        SVD                     X = ( x1 , x2 ,L, xn )                       Eigenvectors of                ...
2D-SVD               { A} = { A1 , A2 ,L, An }                             assume     A =0                                ...
2D-SVD Error Analysis                                                                       p                    SVD: min ...
Temperature maps (January over 100 years)                                                                 Reconstruction  ...
Reconstructed image                                                                          SVD                          ...
2D-SVD Summary          • 2DSVD is extension of standard SVD          • Provides optimal solution for 4 representations fo...
Part 1.C.                        K-means Clustering ⇔                    Principal Component Analysis                    (...
K-means clustering        • Also called “isodata”, “vector quantization”        • Developed in 1960’s (Lloyd, MacQueen, Ha...
PCA is equivalent to K-means                   Continuous optimal solution for cluster                   indicators in K-m...
2-way K -means Clustering        Cluster membership                                    ⎧+ n2 / n1n                        ...
A simple illustration                                                                           44PCA & Matrix Factorizati...
DNA Gene Expression File for Leukemia                                                                           Using v1 ,...
Multi-way K-means Clustering       Unsigned Cluster membership indicators h1, …, hK:                     C1 C2 C3         ...
Multi-way K-means Clustering                                    K                                                  K      ...
Multi-way K-means Clustering                      maxTr[Qk −1 ( X T X )Qk −1 ]                             T              ...
Consistency: 2-way and K-way approaches                                                            ⎛ n2 / n       − n1 / n...
Test of Lower bounds of K-means clustering                                                                           | J o...
Cluster Subspace (spanned by K centroids)                                     = PCA Subspace             Given a data poin...
Effectiveness of PCA Dimension Reduction                                                                           52PCA &...
Kernel K-means Clustering              Kernal K-means objective:                          xi → φ ( xi )                   ...
Kernel K-means clustering                           is equivalent to Kernal PCA         Continuous optimal solution for cl...
Upcoming SlideShare
Loading in …5
×

Principal component analysis and matrix factorizations for learning (part 1) ding - icml 2005 tutorial - 2005

2,427 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,427
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
109
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Principal component analysis and matrix factorizations for learning (part 1) ding - icml 2005 tutorial - 2005

  1. 1. Principal Component Analysis and Matrix Factorizations for Learning Chris Ding Lawrence Berkeley National Laboratory Supported by Office of Science, U.S. Dept. of EnergyPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 1
  2. 2. Many unsupervised learning methods are closely related in a simple way PCA NMF K-means Spectral clustering Clustering Semi-supervised classification Indicator Matrix Semi-supervised clustering Quadratic Clustering Outlier detectionPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 2
  3. 3. Part 1.A. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) • Widely used in large number of different fields • Most widely known as PCA (multivariate statistics) • SVD is the theoretical basis for PCAPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 3
  4. 4. Brief history • PCA – Draw a plane closest to data points (Pearson, 1901) – Retain most variance (Hotelling, 1933) • SVD – Low-rank approximation (Eckart-Young, 1936) – Practical application/Efficient Computation (Golub- Kahan, 1965) • Many generalizationsPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 4
  5. 5. PCA and SVD Data: n points in p-dim: X = ( x1 , x2 ,L, xn ) p Covariance C = XX T = ∑ λk uk uk T k =1 r Gram (kernel) matrix XTX = ∑ k =1 λk v k v k T Principal directions: u k Principal components: k v (Principal axis,subspace) (projection on the subspace) p Underlying basis: SVD X = ∑k =1 σ k uk vk = UΣV T TPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 5
  6. 6. Further Developments SVD/PCA • Principal Curves • Independent Component Analysis • Sparse SVD/PCA (many approaches) • Mixture of Probabilistic PCA • Generalization to exponential familty, max-margin • Connection to K-means clustering Kernel (inner-product) • Kernel PCAPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 6
  7. 7. Methods of PCA Utilization Principal components X = ( x1 , x2 ,L, xn ) (uncorrelated random variables): uk = uk (1) ⋅ X 1 + L + uk (d ) ⋅ X d p Dimension reduction: X= ∑ k =1 σ k uk vk = UΣV T T Projection to low-dim ~ U = (u1 ,L, uk ) X =UT X subspace Sphereing the data ~ X = C −1 / 2 X = UΣ −1U T X Transform data to N(0,1)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 7
  8. 8. Applications of PCA/SVD • Most popular in multivariate statistics • Image processing, signal processing • Physics: principal axis, diagonalization of 2nd tensor (mass) • Climate: Empirical Orthogonal Functions (EOF) • Kalman filter. s ( t +1) = As ( t ) + E , P ( t +1) = AP (t ) AT • Reduced order analysisPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 8
  9. 9. Applications of PCA/SVD • PCA/SVD is as widely as Fast Fourier Transforms – Both are spectral expansions – FFT is more on Partial Differential Equations – PCA/SVD is more on discrete (data) analysis – PCA/SVD surpass FFT as computational sciences further advance • PCA/SVD – Select combination of variables – Dimension reduction • An image has 104 pixels. True dimension is 20 !PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 9
  10. 10. PCA is a Matrix Factorization (spectral/eigen decomposition) Principal directions: U = (u1 , u2 ,L, uk ) Principal components: V = (v1 , v2 ,L, vk ) p Covariance C = XX T = ∑k =1 λk uk uk =UΛU T T r Kernel matrix XTX = ∑k =1 λk vk vk = VΛV T T p Underlying basis: SVD X = ∑k =1 σ k uk vk = UΣV T TPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 10
  11. 11. From PCA to spectral clustering using generalized eigenvectors Consider the kernel matrix: Wij = φ ( xi ),φ ( x j ) In Kernel PCA we compute eigenvector: Wv = λv Generalized Eigenvector: Wq = λDq D = diag (d1,L, dn ) di = ∑w j ij This leads to Spectral Clustering !PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 11
  12. 12. Scale PCA ⇒ Spectral Clustering PCA: W= ∑ k vk λk vT k ∑ 1 ~ 1 Scaled PCA: W = D W D2 = D 2 qk λk qT D k k =1 ~ −1 −1 ~ W = D WD , wij = wij /(did j) 2 2 1/ 2 −1 qk = D vk scaled principal component 2PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 12
  13. 13. Scaled PCA on a Rectangle Matrix ⇒ Correspondence Analysis ~ −1 −1 ~ Re-scaling: P = Dr 2 PD 2 , pij = pij /( pi. pj.)1/ 2 c ~ Apply SVD on P Subtract trivial component P − rc / p.. = Dr ∑ f k λk g Dc T T k r = ( p1.,L, pn. ) T k =1 −1 fk = D u , gk = D v 2 −1 2 c = ( p.1,L, p.n ) T r k c k are scaled row and column principal component (standard coordinates in CA) (Zha, et al, CIKM 2001, Ding et al, PKDD2002)PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 13
  14. 14. Nonnegative Matrix Factorization Data Matrix: n points in p-dim: xi is an image, X = ( x1 , x2 ,L, xn ) document, webpage, etc Decomposition (low-rank approximation) X ≈ FG T Nonnegative Matrices X ij ≥ 0, Fij ≥ 0, Gij ≥ 0 F = ( f1 , f 2 , L, f k ) G = ( g1 , g 2 ,L, g k )PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 14
  15. 15. Solving NMF with multiplicative updating J =|| X − FGT ||2 , F ≥ 0, G ≥ 0 Fix F, solve for G; Fix G, solve for F Lee & Seung ( 2000) propose ( XG )ik ( X T F ) jk Fik ← Fik G jk ← G jk ( FGT G )ik (GF T F ) jkPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 15
  16. 16. Matrix Factorization Summary Symmetric Rectangle Matrix (kernel matrix, graph) (contigency table, bipartite graph) PCA: W = VΛV T X = UΣV T Scaled PCA: 1 1 ~ 1 ~ 1 W = D W D = D QΛQ D 2 2 T X = Dr X Dc2 = Dr FΛG T Dc 2 NMF: W ≈ QQ T X ≈ FG TPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 16
  17. 17. Indicator Matrix Quadratic Clustering Unsigned Cluster indicator Matrix H=(h1, …, hK) Kernel K-means clustering: max Tr( H T WH ), s.t. H T H = I , H ≥ 0 H K-means: W = XT X; Kernel K-means W = (< φ ( xi ),φ ( x j ) >) Spectral clustering (normalized cut) max Tr( H T WH ), s.t. H T DH = I , H ≥ 0 H Difference between the two is the orthogonality of HPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 17
  18. 18. Indicator Matrix Quadratic Clustering Additional features: Semi-suerpvised classification: max Tr( H T WH + C T H ) H Semi-supervised clustering: (A) must-link and (B) cannot-link constraints max Tr( H T WH + αH T AH − βH T BH ) H Outlier Detection: max Tr( H T WH ) allowing zero rows in H H Nonnegative Lagrangian Relaxation: (WH )ik + Cik / 2 H ik ← H ik , α = H T WH + H T C. ( Hα )ikPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 18
  19. 19. Tutorial Outline • PCA – Recent developments on PCA/SVD – Equivalence to K-means clustering • Scaled PCA – Laplacian matrix – Spectral clustering – Spectral ordering • Nonnegative Matrix Factorization – Equivalence to K-means clustering – Holistic vs. Parts-based • Indicator Matrix Quadratic Clustering – Use Nonnegative Lagrangian Relaxtion – Includes • K-means and Spectral Clustering • semi-supervised classification • Semi-supervised clustering • Outlier detectionPCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding 19
  20. 20. Part 1.B. Recent Developments on PCA and SVD Principal Curves Independent Component Analysis Kernel PCA Mixture of PCA (probabilistic PCA) Sparse PCA/SVD Semi-discrete, truncation, L1 constraint, Direct sparsification Column Partitioned Matrix Factorizations 2D-PCA/SVD Equivalence to K-means clusteringPCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 20
  21. 21. PCA and SVD Data Matrix: X = ( x1 , x2 ,L, xn ) p Covariance C = XX T = ∑ λk uk uk T k =1 r Gram (kernel) matrix XTX = ∑ k =1 λk v k v k T Principal directions: u k Principal components: k v (Principal axis,subspace) (projection on the subspace) p Underlying basis: SVD X = ∑σ u v T k k k k =1PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 21
  22. 22. Kernel PCA xi → φ ( xi ) Kernel K ij = φ ( xi ),φ ( x j ) PCA Component v Feature extraction v, φ ( x ) = ∑ i vi φ ( xi ),φ ( x) Indefinite Kernels Generalization to graphs with nonnegative weights (Scholkopf, Smola, Muller, 1996)PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 22
  23. 23. Mixture of PCA • Data has local structures. – Global PCA on all data is not useful • Clustering PCA (Hinton et al): – Using clustering to cluster data into clusters – Perform PCA in each cluster – No explicit generative model • Probabilistic PCA (Tipping & Bishop) – Latent variables – Generative model (Gaussian) – Mixture of Gaussians ⇒ mixture of PCA – Adding Markov dynamics for latent variables (Linear Gaussian Models)PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 23
  24. 24. Probabilistic PCA Linear Gaussian Model Latent variables S = ( s1 ,L, sn ) xi = Wsi + μ + ε , ε ~ N (0,σ ε I ) 2 Gaussian prior P( s) ~ N ( s0 , σ s I ) 2 x ~ N (Ws0 ,σ ε I + σ sWW ) 2 T Linear Gaussian Model si +1 = Asi + η , xi = Wsi + ε , (Tipping & Bishop, 1995; Roweis & Ghahramani, 1999)PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 24
  25. 25. Sparse PCA • Compute a factorization X ≈ UV T – U or V is sparse or both are sparse • Why sparse? – Variable selection (sparse U) – When n >> d – Storage saving – Other new reasons? • L1 and L2 constraintsPCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 25
  26. 26. Sparse PCA: Truncation and Discretization X ≈ UΣV T • Sparsified SVD U = (u1 Luk ) V = (v1 Lvk ) – Compute {uk,vk} one at a time, truncate those entries below a threshold. – Recursively compute all pairs using deflation. – (Zhang, Zha, Simon, 2002) X ← X − σ uvT • Semi-discrete decomposition – U, V only contains {-1, 0, 1} – Iterative algorithm to compute U,V using deflation – (Kolda & O’leary, 1999)PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 26
  27. 27. Sparse PCA: L1 constraint • LASSO (Tibshirani, 1996) min || y − X T β ||2 , || β ||1 ≤ t • SCoTLASS (Joliffe & Uddin, 2003) max u T ( XX T )u T , || u ||1 ≤ t , u T uh = 0 • Least Angle Regression (Efron, et al 2004) • Sparse PCA (Zou, Hastie, Tibshirani,2004) n k k min α ,β ∑ i =1 || xi − α β T xi ||2 +λ ∑ j =1 || β j ||2 + ∑ j =1 λ1, j || β j ||1 , α T α = I v j = β j / || β j ||PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 27
  28. 28. Sparse PCA: Direct Sparsification • Sparse SVD with explicit sparsification min || X − udvT ||F + nnz(u ) + nnz(v) u ,v – rank-one approximation (Zhang, Zha, Simon 2003) – Minimize a bound – deflation • Direct sparse PCA, on covariance matrix S u = max u T Su = max Tr( Suu T ) = max Tr( SU ) s.t. Tr(U ) = 1, nnz(U ) ≤ k 2 , U f 0, rank(U ) = 1 (D’Aspremont, Gharoui, Jordan,Lancriet, 2004)PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 28
  29. 29. Sparse PCA Summary • Many different approaches – Truncation, discretization – L1 Constraint – Direct sparsification – Other approaches • Sparse Matrix factorization in general – L1 constraint • Many questions – Orthogonality – Unique solution, global solutionPCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 29
  30. 30. PCA: Further Generalizations • Generalization to Exponential Family – (Collins, Dasgupta, Schapire, 2001) • Maximum Margin Factorization (Srebro, Rennie, Jaakkola, 2004) – Collaborative filtering – Input Y is binary – Hard margin Yia X ia ≥ 1, ∀ia ∈ S – Soft margin min || X ||Σ + c ∑ max(0,1 − Y ia∈S ia X ia ) X = UV T , || X ||= 1 (|| U ||2 + || V ||2 ) 2 Fro FroPCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 30
  31. 31. Column Partitioned Matrix Factorizations n n n 6 7 8 64748 4 14 2 64748 4k 4 n1 + L + nk = n X = ( x1,L xn ) = ( x1 L xn1 , xn1 +1 L xn2 , L, xnk −1 +1 L xn ) (Zhang & Zha, 2001) • Column Partitioned Data Matrix • Partitions are generate by clustering (Dhillon & Modha, 2001) • Centroid matrix U = (u1 Luk ) (Park, Jeon & Rosen, 2003) – uk is centroid – Fix U, compute V min || X − UV T ||2 F V = X T U (U T U ) −1 • Represent each partition by a SVD. k k 6 74 41 8 4l 8 6 74 – Pick leading Us to form U U = (U1,LU l ) = (u11) Luk1) , L, u1l ) Lukl ) ) ( ( ( ( – Fix U, compute V 1 l (Castelli, Thomasian & Li 2003) • Several other variations (Zeimpekis & Gallopoulos, 2004)PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 31
  32. 32. Two-dimensional SVD • Large number of data objects are 2-D: images, maps • Standard method: – convert (re-order) each image as a 1D vector – collect all 1D vectors into a single (big) matrix – apply SVD on the big matrix • 2D-SVD is developed for 2D objects – Extension of standard SVD – Keeping the 2D characteristics – Improves quality of low-dimensional approximation – Reduces computation, storagePCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 32
  33. 33. Linearize a 2D object into 1D object ⎡0.0⎤ ⎢ 0.5⎥ ⎢ ⎥ ⎢0.7⎥ ⎢10 ⎥ ⎢. ⎥ ⎢M ⎥ ⎢ ⎥ ⎢0.8⎥ ⎢0.2⎥ ⎢ ⎥ ⎣0.0⎦ Pixel vectorPCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 33
  34. 34. SVD and 2D-SVD SVD X = ( x1 , x2 ,L, xn ) Eigenvectors of XX T and X T X X = UΣV T Σ =UT X V 2D-SVD { A} = { A1 , A2 ,L, An } Eigenvectors of F= ∑ i ( Ai − A )( Ai − A )T row-row covariance G= ∑ i ( Ai − A )T ( Ai − A ) column-column cov Ai = UM iV T M i = U Ai V TPCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 34
  35. 35. 2D-SVD { A} = { A1 , A2 ,L, An } assume A =0 F = ∑ Ai Ai = ∑ λ u u T T row-row cov: k k k i G = ∑ Ai Ai = ∑ ζ k uk u T T col-col cov: k i k =1 Bilinear U = (u1 , u2 ,L, uk ) subspace V = (v1 , v2 ,L, vk ) M i = U Ai V T Ai = UM iV , i = 1,L, n T Ai ∈ ℜr×c ,U ∈ ℜr×k ,V ∈ ℜc×k , M i ∈ ℜk×kPCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 35
  36. 36. 2D-SVD Error Analysis p SVD: min || X − UΣV T ||2 = ∑ σ i2 i = k +1 Ai ≈ LM i RT , Ai ∈ R r×c , L ∈ R r×k , R ∈ R c×k , M i ∈ R k×k n c min J1 = ∑i =1 || Ai − LM i ||2 = ∑ζ j = k +1 j n r min J 2 = ∑i =1 || Ai − M i RT ||2 = ∑λ j = k +1 j n r c min J 3 = ∑i =1 || Ai − LM i RT ||2 ≅ ∑ λ + ∑ζ j = k +1 j j = k +1 j n r min J 4 = ∑i =1 || Ai − LM i LT ||2 ≅ 2 ∑λ j = k +1 jPCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 36
  37. 37. Temperature maps (January over 100 years) Reconstruction Errors SVD/2DSVD=1.1 Storages SVD/2DSVD=8PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 37
  38. 38. Reconstructed image SVD 2dSVD SVD (K=15), storage 160560 2DSVD (K=15), storage 93060PCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 38
  39. 39. 2D-SVD Summary • 2DSVD is extension of standard SVD • Provides optimal solution for 4 representations for 2D images/maps • Substantial improvements in storage, computation, quality of reconstruction • Capture 2D characteristicsPCA & Matrix Factorization for Learning, ICML 2005, Chris Ding 39
  40. 40. Part 1.C. K-means Clustering ⇔ Principal Component Analysis (Equivalence between PCA and K-means) 40PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  41. 41. K-means clustering • Also called “isodata”, “vector quantization” • Developed in 1960’s (Lloyd, MacQueen, Hatigan, etc) • Computationally Efficient (order-mN) • Widely used in practice – Benchmark to evaluate other algorithms Given n points in m-dim: X = ( x1 , x2 ,L, xn ) T K K-means objective min J K = ∑∑ k =1 i∈C k || xi − ck ||2 41PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  42. 42. PCA is equivalent to K-means Continuous optimal solution for cluster indicators in K-means clustering are given by principal components. Subspace spanned by K cluster centroids is given by PCA subspace. 42PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  43. 43. 2-way K -means Clustering Cluster membership ⎧+ n2 / n1n ⎪ if i ∈ C1 q (i ) = ⎨ indicator: ⎪− n1 / n2 n if i ∈ C2 ⎩ nn ⎡ d (C1 , C2 ) d (C1 , C1 ) d (C2 , C2 ) ⎤ J K = n〈 x 〉 − J D , 2 JD = 1 2 ⎢2 n n − 2 − 2 ⎥ n ⎣ 1 2 n1 n2 ⎦ Define distance matrix: D = (dij ), dij =|xi − x j|2 T ~ ~ J D = −q Dq = −q Dq = 2q ( X X )q = 2q Kq D = K T T T T min J K ⇒ max J D Solution is principal eigenvector v1of K Clusters C1, C2 are determined by: C1 = {i | v1 (i ) < 0}, C2 = {i | v1 (i ) ≥ 0} 43PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  44. 44. A simple illustration 44PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  45. 45. DNA Gene Expression File for Leukemia Using v1 , tissue samples separated into 2 clusters, 3 errors Do one more K- means, reduce to 1 error 45PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  46. 46. Multi-way K-means Clustering Unsigned Cluster membership indicators h1, …, hK: C1 C2 C3 ⎡1 0 0⎤ ⎢1 ⎥ 0 0⎥ ⎢ = (h1 , h2 , h3 ) ⎢0 1 0⎥ ⎢ ⎥ ⎣0 0 1⎦ 46PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  47. 47. Multi-way K-means Clustering K K ∑ ∑ ∑ ∑ ∑ 1 JK = xi − 2 xiT x j = xi2 − hk X T Xhk T i n i , j∈C k i k =1 k k =1 (Unsigned) Cluster indicators H=(h1, …, hK) J K = ∑ xi2 − Tr ( H k X T XH k ) T i K Regularized Relaxation Redundancy: ∑ k =1 n1/ 2 hk = e k Transform h1, …, hK to q1 - qk via orthogonal matrix T (q1 ,..., qk ) = (h1 ,L, hk )T Qk = H kT q1 = e /n1/2 47PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  48. 48. Multi-way K-means Clustering maxTr[Qk −1 ( X T X )Qk −1 ] T Qk −1 = (q2 ,..., qk ) Optimal solutions of q2 … qk are given by principal components v2 … vk. JK is bounded below by total variance minus sum of K eigenvalues of covariance: K −1 nx2 − ∑ k =1 λk < min J K < n x 2 48PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  49. 49. Consistency: 2-way and K-way approaches ⎛ n2 / n − n1 / n ⎞ Orthogonal Transform: T =⎜ ⎟ ⎜ n /n n2 / n ⎟ ⎝ 1 ⎠ T transforms (h1, h2) to (q1,q2): h1 = (1L1,0L0) , h2 = (0L0,1L1) T T a= n2 n1n q2 = (a,L, a,−b,L,−b) T n1 q1 = (1L1) , T b= n2 n Recover the original 2-way cluster indicator 49PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  50. 50. Test of Lower bounds of K-means clustering | J opt − J LB | J opt Lower bound is within 0.6-1.5% of the optimal value 50PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  51. 51. Cluster Subspace (spanned by K centroids) = PCA Subspace Given a data point x, P= ∑k T ck c k project x into the cluster subspace Centroid is given by ck = ∑ h (i) x = Xh k k i k P= ∑c c k T k k =X ∑h h k T k k XT = X ∑v v k T k k XT = ∑λ u u k T k k k PK −means = ∑k λk u k u k T ⇔ ∑ k uk uk ≡ PPCA T PCA automatically project into cluster subspace PCA is unsupervised version of LDA 51PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  52. 52. Effectiveness of PCA Dimension Reduction 52PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  53. 53. Kernel K-means Clustering Kernal K-means objective: xi → φ ( xi ) K φ min J K = ∑∑ k =1 i∈Ck || φ ( xi ) − φ (ck ) ||2 K ∑ ∑ ∑ 1 = | φ ( xi ) |2 − φ ( xi )T φ ( x j ) i n k =1 k i , j∈Ck 1 K Kernal K-means max J K = ∑ ∑ φ ( xi ),φ ( x j ) φ k =1 nk i , j∈Ck 53PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding
  54. 54. Kernel K-means clustering is equivalent to Kernal PCA Continuous optimal solution for cluster indicators are given by Kernal PCA components Subspace spanned by K cluster centroids are given by Kernal PCA principal subspace 54PCA & Matrix Factorizations for Learning, ICML 2005 Tutorial, Chris Ding

×