Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Learning group em - 20171025 - copy

57 views

Published on

EM algorithm and GMM

Published in: Technology
  • Be the first to comment

Learning group em - 20171025 - copy

  1. 1. The Expectation Maximization Algorithm Presenter: Shuai Zhang, CSE, UNSW
  2. 2. Content β€’ Introduction β€’ Preliminary β€’ EM-Algorithm β€’ Example: Gaussian Mixture Model
  3. 3. Introduction EM, is one of the top 10 data mining algorithms The other nine are: C4.5, k-means, SVM, Apriori, PageRank, AdaBoost, kNN, NaΓ―ve Bayes, and CART It is a popular tool in statistical estimation problems involving incomplete data, and mixture estimation . The EM algorithm is an efficient iterative procedure to compute the maximum likelihood (ML) estimate in the presence of missing or hidden data.
  4. 4. Preliminary Definition 1: Let 𝑓 be a real valued function defined on an interval I ∈ π‘Ž, 𝑏 , 𝑓 is said to be convex on 𝐼 if βˆ€π‘₯1, π‘₯2 ∈ 𝐼, πœ† ∈ 0,1 , 𝑓 πœ†π‘₯1 + 1 βˆ’ πœ† π‘₯2 ≀ πœ†π‘“ π‘₯1 + 1 βˆ’ πœ† 𝑓 π‘₯2 𝑓 is said to be strictly convex if the inequality is strict.
  5. 5. Preliminary Intuitively, this definition states that the function falls below or is never above the straight line from points (π‘₯1, 𝑓 π‘₯1 to (π‘₯2, 𝑓 π‘₯2
  6. 6. Preliminary Definition 2: 𝑓 is concave if βˆ’π‘“ is convex Definition 3: If 𝑓 x is twice differentiable on π‘Ž, 𝑏 and 𝑓′′ π‘₯ β‰₯ 0 on π‘Ž, 𝑏 then 𝑓(π‘₯) is convex on π‘Ž, 𝑏 .
  7. 7. Preliminary Definition 4: Jensen’s inequality, Let 𝑓 be a convex function defined on an interval 𝐼. If π‘₯1, π‘₯2, … , π‘₯ 𝑛 ∈ 𝐼 π‘Žπ‘›π‘‘ πœ†1, πœ†2, … , πœ† 𝑛 β‰₯ 0 π‘€π‘–π‘‘β„Ž 𝑖=1 𝑛 πœ†π‘– = 1, 𝑓( 𝑖=1 𝑛 πœ†π‘– π‘₯𝑖) ≀ 𝑖=1 𝑛 πœ†π‘– 𝑓(π‘₯𝑖) Proof: For 𝑛 = 1, this is trivial For 𝑛 = 2, this corresponds to the definition of convexity. TBC… Induction
  8. 8. Preliminary Assume the theorem is true for some 𝑛 then, we need to proof for 𝑛 + 1 this equation still holds. We have 𝑖=1 𝑛+1 πœ†π‘– = 1, this is important! 𝑓( 𝑖=1 𝑛+1 πœ†π‘– π‘₯𝑖) = 𝑓(πœ† 𝑛+1 π‘₯ 𝑛+1 + 𝑖=1 𝑛 πœ†π‘– π‘₯𝑖) = 𝑓(πœ† 𝑛+1 π‘₯ 𝑛+1 + 1 βˆ’ πœ† 𝑛+1 1 1βˆ’πœ† 𝑛+1 𝑖=1 𝑛 πœ†π‘– π‘₯𝑖) ≀ πœ† 𝑛+1 𝑓 π‘₯ 𝑛+1 + 1 βˆ’ πœ† 𝑛+1 𝑓( 1 1βˆ’πœ† 𝑛+1 𝑖=1 𝑛 πœ†π‘– π‘₯𝑖) = πœ† 𝑛+1 𝑓 π‘₯ 𝑛+1 + 1 βˆ’ πœ† 𝑛+1 𝑓( 𝑖=1 𝑛 πœ† 𝑖 1βˆ’πœ† 𝑛+1 π‘₯𝑖) TBC
  9. 9. Preliminary ≀ πœ† 𝑛+1 𝑓 π‘₯ 𝑛+1 + 1 βˆ’ πœ† 𝑛+1 𝑖=1 𝑛 πœ† 𝑖 1βˆ’πœ† 𝑛+1 𝑓(π‘₯𝑖) = πœ† 𝑛+1 𝑓 π‘₯ 𝑛+1 + 𝑖=1 𝑛 πœ†π‘– 𝑓(π‘₯𝑖) = 𝑖=1 𝑛+1 πœ†π‘– 𝑓(π‘₯𝑖) JE provides a simple proof that the arithmetic mean is greater than or equal to the geometric mean. Because 𝑖=1 𝑛 πœ† 𝑖 1βˆ’πœ† 𝑛+1 = 1, we set πœ†π‘– = πœ† 𝑖 1βˆ’πœ† 𝑛+1
  10. 10. Preliminary A useful example is natural log function ln π‘₯ , as it is concave, so we have 𝑙𝑛( 𝑖=1 𝑛 πœ†π‘– π‘₯𝑖) ≀ 𝑖=1 𝑛 πœ†π‘– 𝑙𝑛(π‘₯𝑖) We will use this inequality in the derivation of the EM algorithm.
  11. 11. The EM algorithm Two steps: β€’ E-step, the missing data are estimated given the observed data and current estimate of the model parameters. β€’ M-step, the likelihood function is maximized under the assumption that the missing data are known! Convergence is assured since the algorithm is guaranteed to increase the likelihood at each iteration.
  12. 12. The EM algorithm Maximize 𝑃 𝑋 πœƒ , estimate the parameter πœƒ Equally, we maximize the log likelihood, 𝐿 πœƒ = ln 𝑃(𝑋|πœƒ) Assume that after 𝑛 π‘‘β„Ž iteration the current estimate for πœƒ is given by πœƒ 𝑛. We wish to compute an updated estimate πœƒ such that 𝐿 πœƒ > 𝐿 πœƒ 𝑛 Equivalently, we need to maximize the difference, 𝐿 πœƒ βˆ’ 𝐿 πœƒ 𝑛 = ln 𝑃 𝑋 πœƒ βˆ’ ln 𝑃(𝑋|πœƒ 𝑛)
  13. 13. The EM algorithm Sometimes, 𝑃 𝑋 πœƒ is intractable, so we will introduce some latent variables, We denote by 𝑧 the latent variable. We can rewrite 𝑃 𝑋 πœƒ as: 𝑃 𝑋 πœƒ = 𝑧 𝑃 𝑋 𝑧, πœƒ P z πœƒ Then, we have 𝐿 πœƒ βˆ’ 𝐿 πœƒ 𝑛 = ln 𝑧 𝑃(𝑋|𝑧, πœƒ) 𝑃 𝑧 πœƒ βˆ’ ln 𝑃(𝑋|πœƒ 𝑛)
  14. 14. The EM algorithm 𝐿 πœƒ βˆ’ 𝐿 πœƒ 𝑛 = ln 𝑧 𝑃(𝑋|𝑧, πœƒ) 𝑃 𝑧 πœƒ βˆ’ ln 𝑃(𝑋|πœƒ 𝑛) = ln 𝑧 𝑃(𝑋|𝑧, πœƒ) 𝑃 𝑧 πœƒ 𝑃(𝑧|𝑋, πœƒ 𝑛) 𝑃(𝑧|𝑋, πœƒ 𝑛) βˆ’ ln 𝑃(𝑋|πœƒ 𝑛) = ln 𝑧 𝑃(𝑧|𝑋, πœƒ 𝑛) 𝑃(𝑋|𝑧, πœƒ)𝑃(𝑧|πœƒ) 𝑃(𝑧|𝑋, πœƒ 𝑛) βˆ’ ln 𝑃(𝑋|πœƒ 𝑛) β‰₯ 𝑧 𝑃(𝑧|𝑋, πœƒ 𝑛) ln( 𝑃 𝑋 𝑧, πœƒ 𝑃 𝑧 πœƒ 𝑃 𝑧 𝑋, πœƒ 𝑛 ) βˆ’ ln 𝑃(𝑋|πœƒ 𝑛) = 𝑧 𝑃 𝑧 𝑋, πœƒ 𝑛 ln 𝑃 𝑋 𝑧, πœƒ 𝑃 𝑧 πœƒ 𝑃 𝑧 𝑋, πœƒ 𝑛 𝑃(𝑋|πœƒ 𝑛) β‰œ βˆ†(πœƒ|πœƒ 𝑛)
  15. 15. The EM algorithm So, 𝐿 πœƒ β‰₯ 𝐿 πœƒ 𝑛 + βˆ† πœƒ πœƒ 𝑛 For convenience, we define: 𝑙 πœƒ πœƒ 𝑛 β‰œ 𝐿 πœƒ + βˆ† πœƒ πœƒ 𝑛 So we have, 𝐿 πœƒ β‰₯ 𝑙 πœƒ πœƒ 𝑛
  16. 16. The EM algorithm We also observe that, 𝑙 πœƒ 𝑛 πœƒ 𝑛 = 𝐿 πœƒ 𝑛 + βˆ† πœƒ 𝑛 πœƒ 𝑛 = 𝐿 πœƒ 𝑛 + 𝑧 𝑃 𝑧 𝑋, πœƒ 𝑛 ln 𝑃 𝑋 𝑧, πœƒ 𝑛 𝑃 𝑧 πœƒ 𝑛 𝑃 𝑧 𝑋, πœƒ 𝑛 𝑃(𝑋|πœƒ 𝑛) = 𝐿 πœƒ 𝑛 + 𝑧 𝑃 𝑧 𝑋, πœƒ 𝑛 ln 𝑃 𝑋, 𝑧 πœƒ 𝑛 𝑃 𝑋, 𝑧 πœƒ 𝑛 = 𝐿 πœƒ 𝑛 For πœƒ = πœƒ 𝑛, the function 𝑙 πœƒ πœƒ 𝑛 and 𝐿 πœƒ are equal.
  17. 17. The EM algorithm From the following two equations: 𝐿 πœƒ β‰₯ 𝑙 πœƒ πœƒ 𝑛 𝑙 πœƒ 𝑛 πœƒ 𝑛 = 𝐿 πœƒ 𝑛 Any πœƒ which increases 𝑙 πœƒ πœƒ 𝑛 will finally increase 𝐿 πœƒ The EM algorithm calls for selecting πœƒ such that 𝑙 πœƒ πœƒ 𝑛 is maximized, we denote the update value as πœƒ 𝑛+1
  18. 18. The EM algorithm
  19. 19. The EM algorithm πœƒ 𝑛+1 = π‘Žπ‘Ÿπ‘”max πœƒ {𝑙 πœƒ πœƒ 𝑛 } = π‘Žπ‘Ÿπ‘”max πœƒ {𝐿 πœƒ 𝑛 + 𝑧 𝑃 𝑧 𝑋, πœƒ 𝑛 ln 𝑃 𝑋 𝑧, πœƒ 𝑃 𝑧 πœƒ 𝑃 𝑧 𝑋, πœƒ 𝑛 𝑃(𝑋|πœƒ 𝑛) } = π‘Žπ‘Ÿπ‘”max πœƒ { 𝑧 𝑃 𝑧 𝑋, πœƒ 𝑛 ln 𝑃 𝑋 𝑧, πœƒ 𝑃 𝑧 πœƒ } = π‘Žπ‘Ÿπ‘”max πœƒ { 𝑧 𝑃 𝑧 𝑋, πœƒ 𝑛 ln 𝑃 𝑋, 𝑧, πœƒ 𝑃 𝑧, πœƒ 𝑃 𝑧, πœƒ 𝑃 πœƒ } = π‘Žπ‘Ÿπ‘”max πœƒ 𝑧 𝑃 𝑧 𝑋, πœƒ 𝑛 ln 𝑃 𝑋, 𝑧 πœƒ = π‘Žπ‘Ÿπ‘”max πœƒ 𝐸 𝑍|𝑋,πœƒ {ln 𝑃 𝑋, 𝑧 πœƒ }
  20. 20. The EM algorithm E-step: Determine the conditional expectation 𝐸 𝑍|𝑋,πœƒ {ln 𝑃 𝑋, 𝑧 πœƒ } M-step: Maximize this expression w.r.t πœƒ One problem of EM algorithm, it can converge to a local minama or saddle points.
  21. 21. Example: Gaussian Mixture Model Gaussian Mixture Model has the following form: 𝑃 𝑀 π‘₯ = π‘˜=𝑖 𝐾 𝑀 π‘˜ 1 2πœ‹ 𝑑/2 Ξ£ 1/2 exp(βˆ’ 1 2 x βˆ’ πœ‡ π‘˜ 𝑇 Ξ£βˆ’1 π‘₯ βˆ’ πœ‡ π‘˜ ) A mixture of 𝐾 gaussians, 𝑀 π‘˜ is the weight of the π‘˜ π‘‘β„Ž Gaussian component (mixture coefficient).
  22. 22. Example: Gaussian Mixture Model
  23. 23. Example: Gaussian Mixture Model Suppose we observe π‘₯1, π‘₯2, … π‘₯𝑖 … π‘₯ 𝑁 as samples from a mixture of Gaussians. Suppose we augment with the latent variable 𝑧𝑖𝑗 (𝑖 = 1, … 𝑁; 𝑗 = 1, … π‘˜) that indicates which of the π‘˜ Gaussians our observation π‘₯𝑖 came from. Recall that we need to determine the conditional expectation. 𝐸 𝑍|𝑋,πœƒ {ln 𝑃 𝑋, 𝑧 πœƒ }
  24. 24. Example: Gaussian Mixture Model E-step: We first compute the posterior of latent variable 𝑧. From Bayes rule, we have: 𝑃 𝑀 𝑧𝑖 = 𝑗 π‘₯𝑖 = 𝑃 𝑧𝑖 = 𝑗 𝑃 𝑀(π‘₯𝑖|𝑧𝑖 = 𝑗) 𝑃 𝑀 π‘₯𝑖 = 𝑀𝑗 𝑃(π‘₯𝑖|πœ‡ 𝑗, Σ𝑗) 𝑙=1 𝐾 𝑀𝑙 𝑃(π‘₯𝑖|πœ‡π‘™, Σ𝑙) Where, 𝑀𝑗 = 𝑁 𝑗 𝑁 , Interpret N𝑗 as the effective number of points assigned to cluster 𝑗.
  25. 25. Example: Gaussian Mixture Model M-step: Re-estimate the parameters: πœƒ 𝑛+1 = π‘Žπ‘Ÿπ‘”max πœƒ 𝑗 𝑃 𝑧𝑖 = 𝑗 π‘₯𝑖, πœƒ 𝑛 ln 𝑖=1 𝑁 𝑃 π‘₯𝑖, 𝑧𝑖 = 𝑗 πœƒ 𝑛 𝑃 𝑋, 𝑧 πœƒ } = 𝑖=1 𝑁 𝑃 π‘₯𝑖, 𝑧𝑖 = 𝑗 πœƒ 𝑛 = 𝑖=1 𝑁 𝑃 π‘₯𝑖 𝑧𝑖 = 𝑗, πœƒ 𝑛 𝑃 𝑧𝑖 = 𝑗 πœƒ 𝑛 = 𝑖=1 𝑁 𝑀𝑗 𝒩(πœ‡ 𝑗, Σ𝑗) 𝐸 𝑍|𝑋,πœƒ {ln 𝑃 𝑋, 𝑧 πœƒ } 𝒩(πœ‡ 𝑗, Σ𝑗) 𝑀𝑗
  26. 26. Example: Gaussian Mixture Model πœƒ 𝑛+1 = π‘Žπ‘Ÿπ‘”max πœƒ 𝑗 𝑃 𝑧𝑖 = 𝑗 π‘₯𝑖, πœƒ 𝑛 ln 𝑖=1 𝑁 𝑀𝑗 𝒩(πœ‡ 𝑗, Σ𝑗) = π‘Žπ‘Ÿπ‘”max πœƒ 𝑖=1 𝑁 𝑗 ln(𝑀𝑗 𝒩(πœ‡ 𝑗, Σ𝑗) 𝑃 𝑧𝑖 = 𝑗 π‘₯𝑖, πœƒ 𝑛 = π‘Žπ‘Ÿπ‘”max πœƒ 𝑖=1 𝑁 𝑗 ln(𝑀𝑗) 𝑃 𝑧𝑖 = 𝑗 π‘₯𝑖, πœƒ 𝑛 + 𝑖=1 𝑁 𝑗 ln(𝒩(πœ‡ 𝑗, Σ𝑗)) 𝑃 𝑧𝑖 = 𝑗 π‘₯𝑖, πœƒ 𝑛
  27. 27. Example: Gaussian Mixture Model E-step: 𝐸 𝑧𝑖𝑗 = 𝑃 𝑀 𝑧𝑖 = 𝑗 π‘₯𝑖 M-step: πœ‡ 𝑗 = 𝑖=1 𝑁 𝐸 𝑧𝑖𝑗 π‘₯𝑖 𝑖=1 𝑁 𝐸 𝑧𝑖𝑗 ; 𝑀𝑗 = 𝑖=1 𝑁 𝐸 𝑧𝑖𝑗 π‘₯𝑖 𝑖=1 𝑁 𝐸 𝑧𝑖𝑗 Σ𝑗 = 𝑖=1 𝑁 𝐸 𝑧𝑖𝑗 { π‘₯𝑖 βˆ’ πœ‡ 𝑗 𝑇 (π‘₯𝑖 βˆ’ πœ‡ 𝑗)} 𝑖=1 𝑁 𝐸 𝑧𝑖𝑗
  28. 28. Example: Gaussian Mixture Model
  29. 29. Example: Gaussian Mixture Model
  30. 30. Example: Gaussian Mixture Model
  31. 31. References 1. Wu, Xindong, et al. "Top 10 algorithms in data mining." Knowledge and information systems 14.1 (2008): 1- 37. 2. http://blog.csdn.net/u014157632/article/details/65442165 3. http://www.cse.iitm.ac.in/~vplab/courses/DVP/PDF/gmm.pdf 4. http://www-staff.it.uts.edu.au/~ydxu/ml_course/em.pdf 5. http://people.csail.mit.edu/dsontag/courses/ml12/slides/lecture21.pdf 6. https://www.cs.utah.edu/~piyush/teaching/EM_algorithm.pdf 7. Machine Learning, zhihua zhou
  32. 32. Thanks!
  33. 33. Last Week β€’ Implement the WRMF and Lrec under our framework with side information β€’ Our research goal is to investigate the efficacy of incorporating multi-faceted side information with our framework β€’ There are a lot of ways to embed side information β€’ Do not have a sure answer yet if the side information can improve the original model β€’ Deep learning for recommender systems, the magazine paper β€’ Prepare the learning group
  34. 34. This Week β€’ Adjust our model and conduct experiments β€’ The magazine paper β€’ Prepare for the slides for next week’s talk in RMIT (Mark Sanderson’s Group)

Γ—