Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

6,133 views

Published on

No Downloads

Total views

6,133

On SlideShare

0

From Embeds

0

Number of Embeds

75

Shares

0

Downloads

0

Comments

0

Likes

7

No embeds

No notes for slide

- 1. Principal Components Analysis, Expectation Maximization, and more Harsh Vardhan Sharma1,2 1 Statistical Speech Technology Group Beckman Institute for Advanced Science and Technology 2 Dept. of Electrical & Computer Engineering University of Illinois at Urbana-Champaign Group Meeting: December 01, 2009
- 2. Material for this presentation derived from: Probabilistic Principal Component Analysis. Tipping and Bishop, Journal of the Royal Statistical Society (1999) 61:3, 611-622 Mixtures of Principal Component Analyzers. Tipping and Bishop, Proceedings of Fifth International Conference on Artiﬁcial Neural Networks (1997), 13-18
- 3. Outline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 3 / 39
- 4. PCAOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 4 / 39
- 5. PCAstandard PCA in 1 slide A well-established technique for dimensionality reduction Most common derivation of PCA → linear projection maximizing the variance in the projected space:1: Organize observed data {yi ∈ Rp }N in a p × N matrix X after subtracting mean i=1 1 N y = ¯ N i=1 yi . k2: Obtain the k principal axes wj ∈ Rp j=1 :: k eigenvectors of data-covariance matrix 1 N T (Sy = N i=1 yi − y ¯ yi − y ¯ ) corresponding to k largest eigenvalues (k < p).3: The k principal components of yi are xi = WT yi − y , where W = (w1 , . . . , wk ). The ¯ components of xi are then uncorrelated and the projection-covariance matrix Sx is diagonal with the k largest eigenvalues of Sy . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 5 / 39
- 6. PCAThings to think aboutAssumptions behind standard PCA 1 Linearity problem is that of changing the basis — have measurements in a particular basis; want to see data in a basis that best expresses it. We restrict ourselves to look at bases that are linear combinations of the measurement-basis. 2 Large variances = important structure believe that data has high SNR ⇒ dynamics of interest assumed to exist along directions with largest variance; lower-variance directions pertain to noise. 3 Principal Components are orthogonal decorrelation-based dimensional reduction removes redundancy in the original data-representation. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 6 / 39
- 7. PCAThings to think aboutLimitations of standard PCA 1 Decorrelation not always the best approach Useful only when ﬁrst and second order statistics are suﬃcient statistics for revealing all dependencies in data (for e.g., Gaussian distributed data). 2 Linearity assumption not always justiﬁable Not valid when data-structure captured by a nonlinear function of dimensions in the measurement-basis. 3 Non-parametricity No probabilistic model for observed data. (Advantages of probabilistic extension coming up!) 4 Calculation of Data Covariance Matrix When p and N are very large, diﬃculties arise in terms of computational complexity and data scarcity hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 7 / 39
- 8. PCAThings to think aboutHandling the decorrelation and linearity caveatsExample solutions : Independent Components Analysis (imposing more general notion ofstatistical dependency), kernel PCA (nonlinearly transforming data to a more appropriatenaive-basis) hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 8 / 39
- 9. PCAThings to think aboutHandling the non-parametricity caveat: motivation for probabilistic PCA probabilistic perspective can provide a log-likelihood measure for comparison with other density-estimation techniques. Bayesian inference methods may be applied (e.g., for model comparison). pPCA can be utilized as a constrained Gaussian density model: potential applications → classiﬁcation, novelty detection. multiple pPCA models can be combined as a probabilistic mixture. standard PCA uses a naive way to access covariance (distance2 from observed data): pPCA deﬁnes a proper covariance structure whose parameters can be estimated via EM. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 9 / 39
- 10. PCAThings to think aboutHandling the computational caveat: motivation for EM-based PCA computing the sample covariance itself is O Np 2 . data scarcity: often don’t have enough data for sample covariance to be full-rank. computational complexity: direct diagonalization is O p 3 . standard PCA doesn’t deal properly with missing data. EM algorithms can estimate ML values of missing data. EM-based PCA: doesn’t require computing sample covariance, O (kNp) complexity. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 10 / 39
- 11. Basic ModelOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 11 / 39
- 12. Basic ModelPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA ym − µ = Cxm + m = Cxm +where for m = 1, . . . , Nxm ∈ Rk ∼ N 0, Q – (hidden) state vectory m ∈ Rp – output/observable vectorC ∈ Rp×k – observation/measurement matrix ∈ Rp ∼ N 0, R – zero-mean white Gaussian noiseSo, we have ym ∼ N µ, W = CQCT + R = CCT + R . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 12 / 39
- 13. Basic ModelPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA ym − µ = Cxm + The restriction to zero-mean noise source is not a loss of generality. All of the structure in Q can be moved into C and can use Q = Ik×k . R in general cannot be restricted, since ym are observed and cannot be whitened/rescaled. Assumed: C is of rank k; Q and R are always full rank. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 13 / 39
- 14. Inference, LearningOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 14 / 39
- 15. Inference, LearningLatent Variable Models and Probability Computations Case 1 :: know what the hidden states are. just want to estimate them. (can write down a priori C based on problem-physics) estimating the states given observations and a model → inference Case 2 :: have observation data. observation process mostly unknown. no explicit model for the “causes” learning a few parameters that model the data well (in the ML sense) → learning hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 15 / 39
- 16. Inference, LearningLatent Variable Models and Probability ComputationsInference ym ∼ N µ, W = CCT + Rgives us P (ym |xm ) · P (xm ) N (µ + Cxm , R) |ym · N 0, I |xm P (xm |ym ) = = P (ym ) N (µ, W) |ymTherefore, xm |ym ∼ N (β (ym − µ) , I − βC) −1where β = CT W−1 = CT CCT + R . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 16 / 39
- 17. Inference, LearningLatent Variable Models and Probability ComputationsLearning, via Expectation MaximizationGiven a likelihood function L θ; Y = {ym }m , X = {xm }m , where θ is theparameter vector, Y is the observed data and X represents the unobservedlatent variables or missing values, the maximum likelihood estimate (MLE)of θ is obtained iteratively as follows:Expectation step: ˜ Q θ|θ(u) = EX|Y,θ(u) log L θ; Y, XMaximization step: ˜ θ(u+1) = arg max Q θ|θ(u) hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 17 / 39
- 18. Inference, LearningLatent Variable Models and Probability ComputationsLearning, via Expectation Maximization, for linear Gaussian modelsUse the solution to the inference problem for estimating the unknownlatent variables / missing values X, given Y, θ(u) . Then use this ﬁctitious“complete” data to solve for θ(u+1) .Expectation step: Obtain conditional latent suﬃcient statistics T (u) from xm (u) , xm xm xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u) T −1 T T −1where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .Maximization step: Choose C, R to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 18 / 39
- 19. pPCAOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 19 / 39
- 20. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA ym − µ = Cxm + The restriction to zero-mean noise source is not a loss of generality. All of the structure in Q can be moved into C and can use Q = Ik×k . R in general cannot be restricted, since ym are observed and cannot be whitened/rescaled. Assumed: C is of rank k; Q and R are always full rank.So, we have ym ∼ N µ, W = CQCT + R = CCT + R . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 20 / 39
- 21. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA k < p: looking for more parsimonious representation of the observed data. :: linear Gaussian model → Factor Analysis :: R needs to be restricted: the learning procedure could explain all the structure in the data as noise (i.e., obtain maximal likelihood by choosing C = 0 and R = W = data-sample covariance). since ym ∼ N µ, W = CCT + R we can do no better than having the model-covariance equal the data-sample covariance. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 21 / 39
- 22. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCAFactor Analysis ≡ restricting R to be diagonal xm ≡ {xmi }k – factors explaining the correlations between the i=1 p observation variables ym ≡ ymj j=1 p ymj j=1 conditionally independent given {xmi }k i=1 j – variability unique to a particular ymj R = diag(rjj ) – “uniquenesses” diﬀerent from standard PCA, which eﬀectively treats covariance and variance identically hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 22 / 39
- 23. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCAProbabilistic PCA ≡ constraining R to σ 2 I Noise, ∼ N 0, σ 2 I ym |xm ∼ N µ + Cxm , W = CCT + σ 2 I ym ∼ N µ, W = CCT + σ 2 I xm |ym ∼ N (β (ym − µ) , I − βC) where −1 β = CT W−1 = CT CCT + σ 2 I xm |ym ∼ N κ (ym − µ) , σ 2 M−1 where −1 κ = M−1 CT = CT C + σ 2 I CT hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 23 / 39
- 24. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA −1 β = CT W−1 = CT CCT + σ 2 Ip×p −1 κ = M−1 CT = CT C + σ 2 Ik×k CTIt can be shown (by applying the Woodbury matrix identity to M) that 1 I − βC = σ 2 M−1 2 β=κThen, by letting σ 2 → 0, we obtain −1 Txm |ym → δ xm − CT C C (ym − µ) , which is the standard PCA. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 24 / 39
- 25. pPCAclosed-form ML learning log-likelihood for the pPCA model, L θ; Y = − N p log (2π) + log |W| + trace W−1 Sy 2 where W = CCT + σ 2 I 1 N T Sy = N m=1 (ym − µ) (ym − µ) ML estimates of µ, Sy are the sample mean and sample covariance matrix respectively ˆ 1 N µ= N m=1 ym T ˆ 1 N ˆ ˆ Sy = N m=1 ym − µ ym − µ hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 25 / 39
- 26. pPCAclosed-form ML learning ML estimates of C and σ 2 , i.e. C and R ˆ 1/2 C = Uk Λk − σ 2 I V maps latent space (containing X) to the principal subspace of Y columns of Uk ∈ Rp×k – principal eigenvectors of Sy Λk ∈ Rk×k – diagonal matrix of corresponding eigenvalues of Sy V ∈ Rk×k – arbitrary rotation matrix, can be set to Ik×k ˆ σ2 = 1 p p−k r =k+1 λr variance lost in the projection process, averaged over the number of dimensions projected out/away hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 26 / 39
- 27. EM for pPCAOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 27 / 39
- 28. EM for pPCAEM-based ML learning for linear Gaussian modelsExpectation step: xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u) T −1 T T −1where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .Maximization step: Choose C, R to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 28 / 39
- 29. EM for pPCAEM-based ML learning for probabilistic PCAExpectation step: xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u) T −1 T T (u) −1where β (u) = C(u) W(u) = C(u) C(u) C(u) + σ 2 I .Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 29 / 39
- 30. EM for pPCAEM-based ML learning for probabilistic PCAExpectation step: (u) −1 xm |ym ∼ N κ(u) (ym − µ) , σ 2 M(u) −1 T T (u) −1 Twhere κ(u) = M(u) C(u) = C(u) C(u) + σ 2 I C(u) .Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 30 / 39
- 31. EM for pPCAEM-based ML learning for probabilistic PCAExpectation step: Compute for m = 1, . . . , N −1 T xm = M(u) C(u) ˆ ym − µ (u) −1 T T xm xm = σ2 M(u) + xm xmMaximization step: Set N N −1 C(u+1) = ˆ ym − µ xm T T xm xm m=1 m=1 N (u+1) 1 ˆ 2 T T ˆ Tσ2 = ym − µ − 2 xm C(u+1) ym − µ + trace T xm xm C(u+1) C(u+1) Np m=1 hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 31 / 39
- 32. EM for pPCAEM-based ML learning for probabilistic PCA (u) −1 T −1 ˆ ˆ C(u+1) = Sy C(u) σ 2 I + M(u) C(u) Sy C(u) (u+1) 1 ˆ ˆ −1 T σ2 = trace Sy − Sy C(u) M(u) C(u+1) p Convergence: only stable local extremum is the global maximum at which the true principal subspace is found. Paper doesn’t discuss any initialization scheme(s). Complexity : require terms of the form SC and trace (S) T Computing SC as m ym ym C is O (kNp) and more eﬃcient than T 2 m ym ym C which is equivalent to ﬁnding S explicitly (O Np ). Very eﬃcient for k << p. Require trace (S), not S ⇒ computing only variance along each coordinate suﬃcient. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 32 / 39
- 33. Mixture of pPCAsOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 33 / 39
- 34. Mixture of pPCAsMixture of pPCAs :: the modelLikelihood of observed data N M 2 M L θ = {µr } , {Cr } , σr r =1 ;Y = log πr · p (ym |r ) m=1 r =1where, for m = 1, . . . , N and r = 1, . . . , Mym |r ∼ N µr , Wr = Cr CT + σr I :: a single pPCA model r 2 M{πr } , πr ≥ 0, r =1 πr = 1 :: mixture weightsM independent latent variables xmr for each ym . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 34 / 39
- 35. Mixture of pPCAsMixture of pPCAs :: EM-based ML learningStage 1: New estimates of component-speciﬁc π, µExpectation step: component’s responsibility for generating observation (u) (u+1) p (u) (ym |r ) · πr Rmr = P (u) (r |ym ) = (u) M (u) (y |r ) r =1 p m · πrMaximization step: N (u+1) 1 (u+1) πr = Rmr N m=1 N (u+1) ˆ(u+1) = m=1 Rmr · ym µr (u+1) N m=1 Rmr hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 35 / 39
- 36. Mixture of pPCAsMixture of pPCAs :: EM-based ML learningStage 2: New estimates of component-speciﬁc C, σ 2Expectation step: Compute for m = 1, . . . , N and r = 1, . . . , M (u)−1 (u)T ˆ(u+1) xmr = Mr Cr ym − µr T 2 (u) (u)−1 T xmr xmr = σr Mr + xmr xmrMaximization step: Set for r = 1, . . . , M N N −1 (u+1) (u+1) ˆ(u+1) T (u+1) T Cr = Rmr ym − µr xmr Rmr xmr xmr m=1 m=1 N 2 (u+1) 1 (u+1) σr = (u+1) Rmr × πr p m=1 2 (u+1)T (u+1)T ˆ(u+1) ym − µr − 2 xmr T Cr ˆ(u+1) + trace ym − µr T xmr xmr Cr Cr (u+1) hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 36 / 39
- 37. Mixture of pPCAsMixture of pPCAs :: EM-based ML learning −1 T −1 (u+1) (u+1) 2(u) ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u) Cr = Syr r σr I + Mr Cr Syr Cr 2 (u+1) 1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1) ˆ yr −1 T σr = (u+1) trace Syr r r r πr pwhere N ˆ (u+1) = 1 (u+1) ˆ (u+1) ˆ (u+1) T Syr (u+1) Rmr ym − µr ym − µr πr N m=1 (u) 2(u) (u)T (u) Mr = σr I + Cr Cr hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 37 / 39
- 38. Mixture of pPCAsMixture of pPCAs :: EM-based ML learning −1 T −1 (u+1) (u+1) 2(u) ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u) Cr = Syr r σr I + Mr Cr Syr Cr 2 (u+1) 1 −1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1) ˆ yr T σr = (u+1) trace Syr r r r πr pComplexity : require terms of the form SC and trace (S) T Computing SC as m ym ym C is O (kNp) and more eﬃcient than T C which is equivalent to ﬁnding S explicitly (O Np 2 ). m y m ym Very eﬃcient for k << p. Require trace (S), not S ⇒ computing only variance along each coordinate suﬃcient. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 38 / 39
- 39. Thank You

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment