- 1. UNCLASSIFIED Statistical Clustering: k-means, Gaussian Mixtures, Variational Inference 22-FEB-2012
- 2. UNCLASSIFIED What is Clustering? 22FEB12 2 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. Design Considerations • Features • Dimension • Model: Distance / Cost • Bias / Variance
- 3. UNCLASSIFIED Why do we care? 22FEB12 3 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document.
- 4. UNCLASSIFIED Scope of Talk – Main Take Away Point 22FEB12 4 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. It’s all About the Posterior 𝑝 𝐿 𝐷 K-means How does it work Math behind it Issues GMM How does it work Math behind it Issues Variational Just the facts Variational Inference GMM, EM, (Graph Cuts, Spectral Clustering) K-means, vector quantization
- 5. UNCLASSIFIED Scope of Talk 22FEB12 5 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. Main Take Away Point It’s all Just Posterior Estimation Variational / MCNC GMM K-means / vector quantization K-means How does it work Math behind it Issues GMM How does it work Math behind it Issues Variational Just the facts
- 6. UNCLASSIFIED K-means – How it works 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 6 Goal: represent a data set in terms of K clusters each of which is summarized by a prototype 𝝁 𝒌 Iterative Two step process: E-step: assign each data point to nearest prototype M-step: update prototype to be the cluster means Simple version: Euclidean distance, requires whitening Design Considerations • Features • Dimension • Model: Distance / Cost • Bias / Variance
- 7. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 7
- 8. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 8
- 9. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 9
- 10. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 10
- 11. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 11
- 12. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 12
- 13. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 13
- 14. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 14
- 15. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 15 Converged
- 16. UNCLASSIFIED k-means - Math Responsibilities – assign data to cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 16 Cost Function example
- 17. UNCLASSIFIED Minimizing the Cost Function 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 17
- 18. UNCLASSIFIED What can go wrong? 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 18
- 19. UNCLASSIFIED What can go wrong? A great deal. How do we choose K? (gap statistic / prediction strength) How do we initialize? (k++ seems to be the best) Local minimums – run hundreds of time with different initializations Are we overfitting? Probably. But hey – it simple to understand and does not cost too many cycles 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 19
- 20. UNCLASSIFIED Quick word on distances (k-medioids) 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 20 Mahalanobis Not dependent on scale of measurement Tuning parameter Manhattan / City Block Dampens outliers Euclidean Need to whiten Outliers are an issue
- 21. UNCLASSIFIED Exclusive Clustering: k-means, weighted k-means Overlapping Clustering: fuzzy c-means, Nonlinear Clustering: kernel k-means (spectral clustering, normalized cuts) Hierarchical Clustering: Hierarchical Quicker word on flavors 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 21
- 22. UNCLASSIFIED Probabilistic Clustering Represent the probability distribution of the data as a mixture model Captures uncertainty in cluster assignments Gives model for data distribution Bayesian mixture – we can figure out K easier Consider a mixture of Gaussians 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 22
- 23. UNCLASSIFIED Multivariate Gaussian Distribution Review 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 23
- 24. UNCLASSIFIED Likelihood Function 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 24 Maximum Likelihood What is the best fit to my data Approximation of Posterior!
- 25. UNCLASSIFIED Maximum Likelihood Solution for One Gaussian Sample mean Sample Covariance 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 25
- 26. UNCLASSIFIED Gaussian Mixtures Linear super-position of Gaussians Normalization and positivity require Can interpret mixing coefficients as prior probabilities [Aside]We can sample from this. Given mixing coeff, mean, variance – get a sample from p(x) – our dataset. 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 26
- 27. UNCLASSIFIED Fitting the Gaussian Mixture We wish to invert this sampling process – given the data, find the corresponding parameters (like we did for the single Gaussian case) Mixing coefficients Means Covariances If we knew which data point “belonged” or was the responsibility of which Gaussian, then we could use our single Gaussian ML solution Problem: We don’t have labels, this complicates things. Solution: Create a latent or hidden variable (z) that tells us which data point goes with which Gaussian 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 27
- 28. UNCLASSIFIED Posterior of latent variable 𝜋 𝑘(𝑥) ≡ 𝑝 𝑧 𝑘 = 1 Or more concretely the probability that the data point 𝑥 was generated by the 𝑘 𝑡ℎ Gaussian with no prior knowledge of 𝑥. 𝛾 𝑘 𝑥 ≡ 𝑝 𝑧 𝑘 = 1|𝑥 Or more concretely the probability that the data point 𝑥 was generated by the 𝑘 𝑡ℎ Gaussian after observing 𝑥 𝛾 𝑘 𝑥 = 𝜋 𝑘 𝑁(𝑥|𝜇 𝑘) 𝑗=1 𝐾 𝜋 𝑗 𝑁(𝑥|𝜇 𝑘) Also called responsiblities 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 28
- 29. UNCLASSIFIED Maximum Likelihood for GMM The log likelihood takes this form ln 𝑝 𝐷 𝝅, 𝝁, 𝜮 = 𝑛=1 𝑁 𝑙𝑛 𝑘=1 𝐾 𝜋 𝑘 𝑁(𝑥 𝑛|𝝁 𝒌, 𝜮 𝒌) Notice that the sum inside the log, no closed form solution. Solve by expectation-maximization (EM) algorithm Derivative w.r.t 𝝁 𝒌 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 29
- 30. UNCLASSIFIED EM – notice each one of these is dependent on responsiblities Do the Same for Covariance Use Lagrange Multiplier for mixing coefficients 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 30
- 31. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 31
- 32. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 32
- 33. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 33
- 34. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 34
- 35. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 35
- 36. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 36
- 37. UNCLASSIFIED Relation to k-means 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 37
- 38. UNCLASSIFIED Fast food example 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 38 http://nutrition.mcdonalds.com/nutritionexchange/nutritionfacts.pdf
- 39. UNCLASSIFIED Dessert Cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 39 Caramel Mocha Frappe Caramel Iced Hazelnut Latte Iced Coffee Strawberry Triple Thick Shake Snack Size McFlurry Hot Caramel Sundae Baked Hot Apple Pie Cinnamon Melts Kiddie Cone Strawberry Sundae
- 40. UNCLASSIFIED Burger – like cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 40 Hamburger Cheeseburger Filet-O-Fish Quarter Pounder with Cheese Premium Grilled Chicken Club Sandwich Ranch Snack Wrap Premium Asian Salad with Crispy Chicken Butter Garlic Croutons Sausage McMuffin Sausage McGriddles
- 41. UNCLASSIFIED Salad Cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 41 Premium Southwest Salad with Grilled Chicken Premium Caesar Salad with Grilled Chicken Side Salad Premium Asian Salad without Chicken Premium Bacon Ranch Salad without Chicken
- 42. UNCLASSIFIED Sauces Cluster 2 /6 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 42 Hot Mustard Sauce Spicy Buffalo Sauce Newman’s Own Low Fat Balsamic Vinaigrette Ketchup Packet Barbeque Sauce Chipotle Barbeque Sauce
- 43. UNCLASSIFIED Creamy Sauces 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 43 Creamy Ranch Sauce Newman’s Own Creamy Caesar Dressing Coffee Cream Iced Coffee with Sugar Free Vanilla Syrup
- 44. UNCLASSIFIED Oatmeal and Apples on their own 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 44
- 45. UNCLASSIFIED Breakfast artery clogging cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 45 Sausage McMuffin with Egg Sausage Burrito Egg McMuffin Bacon, Egg & Chees Biscuit McSkillet Burrito with Sausage Big Breakfast with Hotcakes