Probabilistic PCA, EM, and more

6,133 views

Published on

This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,133
On SlideShare
0
From Embeds
0
Number of Embeds
75
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Probabilistic PCA, EM, and more

  1. 1. Principal Components Analysis, Expectation Maximization, and more Harsh Vardhan Sharma1,2 1 Statistical Speech Technology Group Beckman Institute for Advanced Science and Technology 2 Dept. of Electrical & Computer Engineering University of Illinois at Urbana-Champaign Group Meeting: December 01, 2009
  2. 2. Material for this presentation derived from: Probabilistic Principal Component Analysis. Tipping and Bishop, Journal of the Royal Statistical Society (1999) 61:3, 611-622 Mixtures of Principal Component Analyzers. Tipping and Bishop, Proceedings of Fifth International Conference on Artificial Neural Networks (1997), 13-18
  3. 3. Outline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 3 / 39
  4. 4. PCAOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 4 / 39
  5. 5. PCAstandard PCA in 1 slide A well-established technique for dimensionality reduction Most common derivation of PCA → linear projection maximizing the variance in the projected space:1: Organize observed data {yi ∈ Rp }N in a p × N matrix X after subtracting mean i=1 1 N y = ¯ N i=1 yi . k2: Obtain the k principal axes wj ∈ Rp j=1 :: k eigenvectors of data-covariance matrix 1 N T (Sy = N i=1 yi − y ¯ yi − y ¯ ) corresponding to k largest eigenvalues (k < p).3: The k principal components of yi are xi = WT yi − y , where W = (w1 , . . . , wk ). The ¯ components of xi are then uncorrelated and the projection-covariance matrix Sx is diagonal with the k largest eigenvalues of Sy . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 5 / 39
  6. 6. PCAThings to think aboutAssumptions behind standard PCA 1 Linearity problem is that of changing the basis — have measurements in a particular basis; want to see data in a basis that best expresses it. We restrict ourselves to look at bases that are linear combinations of the measurement-basis. 2 Large variances = important structure believe that data has high SNR ⇒ dynamics of interest assumed to exist along directions with largest variance; lower-variance directions pertain to noise. 3 Principal Components are orthogonal decorrelation-based dimensional reduction removes redundancy in the original data-representation. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 6 / 39
  7. 7. PCAThings to think aboutLimitations of standard PCA 1 Decorrelation not always the best approach Useful only when first and second order statistics are sufficient statistics for revealing all dependencies in data (for e.g., Gaussian distributed data). 2 Linearity assumption not always justifiable Not valid when data-structure captured by a nonlinear function of dimensions in the measurement-basis. 3 Non-parametricity No probabilistic model for observed data. (Advantages of probabilistic extension coming up!) 4 Calculation of Data Covariance Matrix When p and N are very large, difficulties arise in terms of computational complexity and data scarcity hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 7 / 39
  8. 8. PCAThings to think aboutHandling the decorrelation and linearity caveatsExample solutions : Independent Components Analysis (imposing more general notion ofstatistical dependency), kernel PCA (nonlinearly transforming data to a more appropriatenaive-basis) hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 8 / 39
  9. 9. PCAThings to think aboutHandling the non-parametricity caveat: motivation for probabilistic PCA probabilistic perspective can provide a log-likelihood measure for comparison with other density-estimation techniques. Bayesian inference methods may be applied (e.g., for model comparison). pPCA can be utilized as a constrained Gaussian density model: potential applications → classification, novelty detection. multiple pPCA models can be combined as a probabilistic mixture. standard PCA uses a naive way to access covariance (distance2 from observed data): pPCA defines a proper covariance structure whose parameters can be estimated via EM. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 9 / 39
  10. 10. PCAThings to think aboutHandling the computational caveat: motivation for EM-based PCA computing the sample covariance itself is O Np 2 . data scarcity: often don’t have enough data for sample covariance to be full-rank. computational complexity: direct diagonalization is O p 3 . standard PCA doesn’t deal properly with missing data. EM algorithms can estimate ML values of missing data. EM-based PCA: doesn’t require computing sample covariance, O (kNp) complexity. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 10 / 39
  11. 11. Basic ModelOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 11 / 39
  12. 12. Basic ModelPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA ym − µ = Cxm + m = Cxm +where for m = 1, . . . , Nxm ∈ Rk ∼ N 0, Q – (hidden) state vectory m ∈ Rp – output/observable vectorC ∈ Rp×k – observation/measurement matrix ∈ Rp ∼ N 0, R – zero-mean white Gaussian noiseSo, we have ym ∼ N µ, W = CQCT + R = CCT + R . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 12 / 39
  13. 13. Basic ModelPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA ym − µ = Cxm + The restriction to zero-mean noise source is not a loss of generality. All of the structure in Q can be moved into C and can use Q = Ik×k . R in general cannot be restricted, since ym are observed and cannot be whitened/rescaled. Assumed: C is of rank k; Q and R are always full rank. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 13 / 39
  14. 14. Inference, LearningOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 14 / 39
  15. 15. Inference, LearningLatent Variable Models and Probability Computations Case 1 :: know what the hidden states are. just want to estimate them. (can write down a priori C based on problem-physics) estimating the states given observations and a model → inference Case 2 :: have observation data. observation process mostly unknown. no explicit model for the “causes” learning a few parameters that model the data well (in the ML sense) → learning hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 15 / 39
  16. 16. Inference, LearningLatent Variable Models and Probability ComputationsInference ym ∼ N µ, W = CCT + Rgives us P (ym |xm ) · P (xm ) N (µ + Cxm , R) |ym · N 0, I |xm P (xm |ym ) = = P (ym ) N (µ, W) |ymTherefore, xm |ym ∼ N (β (ym − µ) , I − βC) −1where β = CT W−1 = CT CCT + R . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 16 / 39
  17. 17. Inference, LearningLatent Variable Models and Probability ComputationsLearning, via Expectation MaximizationGiven a likelihood function L θ; Y = {ym }m , X = {xm }m , where θ is theparameter vector, Y is the observed data and X represents the unobservedlatent variables or missing values, the maximum likelihood estimate (MLE)of θ is obtained iteratively as follows:Expectation step: ˜ Q θ|θ(u) = EX|Y,θ(u) log L θ; Y, XMaximization step: ˜ θ(u+1) = arg max Q θ|θ(u) hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 17 / 39
  18. 18. Inference, LearningLatent Variable Models and Probability ComputationsLearning, via Expectation Maximization, for linear Gaussian modelsUse the solution to the inference problem for estimating the unknownlatent variables / missing values X, given Y, θ(u) . Then use this fictitious“complete” data to solve for θ(u+1) .Expectation step: Obtain conditional latent sufficient statistics T (u) from xm (u) , xm xm xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u) T −1 T T −1where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .Maximization step: Choose C, R to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 18 / 39
  19. 19. pPCAOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 19 / 39
  20. 20. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA ym − µ = Cxm + The restriction to zero-mean noise source is not a loss of generality. All of the structure in Q can be moved into C and can use Q = Ik×k . R in general cannot be restricted, since ym are observed and cannot be whitened/rescaled. Assumed: C is of rank k; Q and R are always full rank.So, we have ym ∼ N µ, W = CQCT + R = CCT + R . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 20 / 39
  21. 21. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA k < p: looking for more parsimonious representation of the observed data. :: linear Gaussian model → Factor Analysis :: R needs to be restricted: the learning procedure could explain all the structure in the data as noise (i.e., obtain maximal likelihood by choosing C = 0 and R = W = data-sample covariance). since ym ∼ N µ, W = CCT + R we can do no better than having the model-covariance equal the data-sample covariance. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 21 / 39
  22. 22. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCAFactor Analysis ≡ restricting R to be diagonal xm ≡ {xmi }k – factors explaining the correlations between the i=1 p observation variables ym ≡ ymj j=1 p ymj j=1 conditionally independent given {xmi }k i=1 j – variability unique to a particular ymj R = diag(rjj ) – “uniquenesses” different from standard PCA, which effectively treats covariance and variance identically hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 22 / 39
  23. 23. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCAProbabilistic PCA ≡ constraining R to σ 2 I Noise, ∼ N 0, σ 2 I ym |xm ∼ N µ + Cxm , W = CCT + σ 2 I ym ∼ N µ, W = CCT + σ 2 I xm |ym ∼ N (β (ym − µ) , I − βC) where −1 β = CT W−1 = CT CCT + σ 2 I xm |ym ∼ N κ (ym − µ) , σ 2 M−1 where −1 κ = M−1 CT = CT C + σ 2 I CT hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 23 / 39
  24. 24. pPCAPCA as a limiting case of linear Gaussian modelslinear Gaussian model ≡ latent variable model → Factor Analysis → probabilistic PCA →standard PCA −1 β = CT W−1 = CT CCT + σ 2 Ip×p −1 κ = M−1 CT = CT C + σ 2 Ik×k CTIt can be shown (by applying the Woodbury matrix identity to M) that 1 I − βC = σ 2 M−1 2 β=κThen, by letting σ 2 → 0, we obtain −1 Txm |ym → δ xm − CT C C (ym − µ) , which is the standard PCA. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 24 / 39
  25. 25. pPCAclosed-form ML learning log-likelihood for the pPCA model, L θ; Y = − N p log (2π) + log |W| + trace W−1 Sy 2 where W = CCT + σ 2 I 1 N T Sy = N m=1 (ym − µ) (ym − µ) ML estimates of µ, Sy are the sample mean and sample covariance matrix respectively ˆ 1 N µ= N m=1 ym T ˆ 1 N ˆ ˆ Sy = N m=1 ym − µ ym − µ hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 25 / 39
  26. 26. pPCAclosed-form ML learning ML estimates of C and σ 2 , i.e. C and R ˆ 1/2 C = Uk Λk − σ 2 I V maps latent space (containing X) to the principal subspace of Y columns of Uk ∈ Rp×k – principal eigenvectors of Sy Λk ∈ Rk×k – diagonal matrix of corresponding eigenvalues of Sy V ∈ Rk×k – arbitrary rotation matrix, can be set to Ik×k ˆ σ2 = 1 p p−k r =k+1 λr variance lost in the projection process, averaged over the number of dimensions projected out/away hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 26 / 39
  27. 27. EM for pPCAOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 27 / 39
  28. 28. EM for pPCAEM-based ML learning for linear Gaussian modelsExpectation step: xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u) T −1 T T −1where β (u) = C(u) W(u) = C(u) C(u) C(u) + R(u) .Maximization step: Choose C, R to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 28 / 39
  29. 29. EM for pPCAEM-based ML learning for probabilistic PCAExpectation step: xm |ym ∼ N β (u) (ym − µ) , I − β (u) C(u) T −1 T T (u) −1where β (u) = C(u) W(u) = C(u) C(u) C(u) + σ 2 I .Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 29 / 39
  30. 30. EM for pPCAEM-based ML learning for probabilistic PCAExpectation step: (u) −1 xm |ym ∼ N κ(u) (ym − µ) , σ 2 M(u) −1 T T (u) −1 Twhere κ(u) = M(u) C(u) = C(u) C(u) + σ 2 I C(u) .Maximization step: Choose C, σ 2 to maximize joint likelihood of X, Y. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 30 / 39
  31. 31. EM for pPCAEM-based ML learning for probabilistic PCAExpectation step: Compute for m = 1, . . . , N −1 T xm = M(u) C(u) ˆ ym − µ (u) −1 T T xm xm = σ2 M(u) + xm xmMaximization step: Set N N −1 C(u+1) = ˆ ym − µ xm T T xm xm m=1 m=1 N (u+1) 1 ˆ 2 T T ˆ Tσ2 = ym − µ − 2 xm C(u+1) ym − µ + trace T xm xm C(u+1) C(u+1) Np m=1 hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 31 / 39
  32. 32. EM for pPCAEM-based ML learning for probabilistic PCA (u) −1 T −1 ˆ ˆ C(u+1) = Sy C(u) σ 2 I + M(u) C(u) Sy C(u) (u+1) 1 ˆ ˆ −1 T σ2 = trace Sy − Sy C(u) M(u) C(u+1) p Convergence: only stable local extremum is the global maximum at which the true principal subspace is found. Paper doesn’t discuss any initialization scheme(s). Complexity : require terms of the form SC and trace (S) T Computing SC as m ym ym C is O (kNp) and more efficient than T 2 m ym ym C which is equivalent to finding S explicitly (O Np ). Very efficient for k << p. Require trace (S), not S ⇒ computing only variance along each coordinate sufficient. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 32 / 39
  33. 33. Mixture of pPCAsOutline1 Principal Components Analysis2 Basic Model / Model Basics3 A brief digression - Inference and Learning4 Probabilistic PCA5 Expectation Maximization for Probabilistic PCA6 Mixture of Principal Component Analyzers hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 33 / 39
  34. 34. Mixture of pPCAsMixture of pPCAs :: the modelLikelihood of observed data N M 2 M L θ = {µr } , {Cr } , σr r =1 ;Y = log πr · p (ym |r ) m=1 r =1where, for m = 1, . . . , N and r = 1, . . . , Mym |r ∼ N µr , Wr = Cr CT + σr I :: a single pPCA model r 2 M{πr } , πr ≥ 0, r =1 πr = 1 :: mixture weightsM independent latent variables xmr for each ym . hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 34 / 39
  35. 35. Mixture of pPCAsMixture of pPCAs :: EM-based ML learningStage 1: New estimates of component-specific π, µExpectation step: component’s responsibility for generating observation (u) (u+1) p (u) (ym |r ) · πr Rmr = P (u) (r |ym ) = (u) M (u) (y |r ) r =1 p m · πrMaximization step: N (u+1) 1 (u+1) πr = Rmr N m=1 N (u+1) ˆ(u+1) = m=1 Rmr · ym µr (u+1) N m=1 Rmr hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 35 / 39
  36. 36. Mixture of pPCAsMixture of pPCAs :: EM-based ML learningStage 2: New estimates of component-specific C, σ 2Expectation step: Compute for m = 1, . . . , N and r = 1, . . . , M (u)−1 (u)T ˆ(u+1) xmr = Mr Cr ym − µr T 2 (u) (u)−1 T xmr xmr = σr Mr + xmr xmrMaximization step: Set for r = 1, . . . , M N N −1 (u+1) (u+1) ˆ(u+1) T (u+1) T Cr = Rmr ym − µr xmr Rmr xmr xmr m=1 m=1 N 2 (u+1) 1 (u+1) σr = (u+1) Rmr × πr p m=1 2 (u+1)T (u+1)T ˆ(u+1) ym − µr − 2 xmr T Cr ˆ(u+1) + trace ym − µr T xmr xmr Cr Cr (u+1) hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 36 / 39
  37. 37. Mixture of pPCAsMixture of pPCAs :: EM-based ML learning −1 T −1 (u+1) (u+1) 2(u) ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u) Cr = Syr r σr I + Mr Cr Syr Cr 2 (u+1) 1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1) ˆ yr −1 T σr = (u+1) trace Syr r r r πr pwhere N ˆ (u+1) = 1 (u+1) ˆ (u+1) ˆ (u+1) T Syr (u+1) Rmr ym − µr ym − µr πr N m=1 (u) 2(u) (u)T (u) Mr = σr I + Cr Cr hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 37 / 39
  38. 38. Mixture of pPCAsMixture of pPCAs :: EM-based ML learning −1 T −1 (u+1) (u+1) 2(u) ˆ (u+1) C(u) πr (u) (u) ˆ (u+1) (u) Cr = Syr r σr I + Mr Cr Syr Cr 2 (u+1) 1 −1 ˆ (u+1) − S(u+1) C(u) M(u) C(u+1) ˆ yr T σr = (u+1) trace Syr r r r πr pComplexity : require terms of the form SC and trace (S) T Computing SC as m ym ym C is O (kNp) and more efficient than T C which is equivalent to finding S explicitly (O Np 2 ). m y m ym Very efficient for k << p. Require trace (S), not S ⇒ computing only variance along each coordinate sufficient. hsharma (SST@BI,ECE:UIUC) PCA,EM,etc. 01/12/2009 38 / 39
  39. 39. Thank You

×