SlideShare a Scribd company logo
1 of 45
UNCLASSIFIED
Statistical Clustering: k-means, Gaussian
Mixtures, Variational Inference
22-FEB-2012
UNCLASSIFIED
What is Clustering?
22FEB12
2 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document.
Design Considerations
• Features
• Dimension
• Model: Distance / Cost
• Bias / Variance
UNCLASSIFIED
Why do we care?
22FEB12
3 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document.
UNCLASSIFIED
Scope of Talk – Main Take Away Point
22FEB12
4 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document.
It’s all About the Posterior
𝑝 𝐿 𝐷
K-means
How does it work
Math behind it
Issues
GMM
How does it work
Math behind it
Issues
Variational
Just the facts
Variational Inference
GMM, EM, (Graph Cuts, Spectral Clustering)
K-means, vector quantization
UNCLASSIFIED
Scope of Talk
22FEB12
5 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document.
Main Take Away Point
It’s all Just Posterior Estimation
Variational / MCNC
GMM
K-means / vector quantization
K-means
How does it work
Math behind it
Issues
GMM
How does it work
Math behind it
Issues
Variational
Just the facts
UNCLASSIFIED
K-means – How it works
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
6
Goal: represent a data set in
terms of K clusters each of
which is summarized by a
prototype 𝝁 𝒌
Iterative Two step process:
E-step: assign each data point
to nearest prototype
M-step: update prototype to
be the cluster means
Simple version: Euclidean
distance, requires whitening
Design Considerations
• Features
• Dimension
• Model: Distance / Cost
• Bias / Variance
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
7
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
8
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
9
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
10
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
11
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
12
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
13
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
14
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
15
Converged
UNCLASSIFIED
k-means - Math
 Responsibilities – assign data to cluster
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
16
 Cost Function
example
UNCLASSIFIED
Minimizing the Cost Function
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
17
UNCLASSIFIED
What can go wrong?
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
18
UNCLASSIFIED
What can go wrong? A great deal.
 How do we choose K? (gap statistic / prediction strength)
 How do we initialize? (k++ seems to be the best)
 Local minimums – run hundreds of time with different
initializations
 Are we overfitting? Probably.
 But hey – it simple to understand and does not cost too
many cycles
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
19
UNCLASSIFIED
Quick word on distances (k-medioids)
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
20
Mahalanobis
Not dependent on scale of measurement
Tuning parameter
Manhattan / City Block
Dampens outliers
Euclidean
Need to whiten
Outliers are an issue
UNCLASSIFIED
 Exclusive Clustering: k-means, weighted k-means
 Overlapping Clustering: fuzzy c-means,
 Nonlinear Clustering: kernel k-means (spectral clustering,
normalized cuts)
 Hierarchical Clustering: Hierarchical
Quicker word on flavors
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
21
UNCLASSIFIED
Probabilistic Clustering
 Represent the probability distribution of the data as a
mixture model
 Captures uncertainty in cluster assignments
 Gives model for data distribution
 Bayesian mixture – we can figure out K easier
 Consider a mixture of Gaussians
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
22
UNCLASSIFIED
Multivariate Gaussian Distribution Review
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
23
UNCLASSIFIED
Likelihood Function
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
24
Maximum Likelihood
What is the best fit to my data
Approximation of Posterior!
UNCLASSIFIED
Maximum Likelihood Solution for One Gaussian
 Sample mean
 Sample Covariance
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
25
UNCLASSIFIED
Gaussian Mixtures
 Linear super-position of Gaussians
 Normalization and positivity require
 Can interpret mixing coefficients as prior probabilities
 [Aside]We can sample from this. Given mixing coeff,
mean, variance – get a sample from p(x) – our dataset.
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
26
UNCLASSIFIED
Fitting the Gaussian Mixture
 We wish to invert this sampling process – given the data,
find the corresponding parameters (like we did for the
single Gaussian case)
 Mixing coefficients
 Means
 Covariances
 If we knew which data point “belonged” or was the
responsibility of which Gaussian, then we could use our
single Gaussian ML solution
 Problem: We don’t have labels, this complicates things.
 Solution: Create a latent or hidden variable (z) that tells
us which data point goes with which Gaussian
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
27
UNCLASSIFIED
Posterior of latent variable
 𝜋 𝑘(𝑥) ≡ 𝑝 𝑧 𝑘 = 1 Or more concretely the probability
that the data point 𝑥 was generated by the 𝑘 𝑡ℎ Gaussian
with no prior knowledge of 𝑥.
 𝛾 𝑘 𝑥 ≡ 𝑝 𝑧 𝑘 = 1|𝑥 Or more concretely the probability
that the data point 𝑥 was generated by the 𝑘 𝑡ℎ
Gaussian
after observing 𝑥
 𝛾 𝑘 𝑥 =
𝜋 𝑘 𝑁(𝑥|𝜇 𝑘)
𝑗=1
𝐾 𝜋 𝑗 𝑁(𝑥|𝜇 𝑘)
 Also called responsiblities
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
28
UNCLASSIFIED
Maximum Likelihood for GMM
 The log likelihood takes this form
 ln 𝑝 𝐷 𝝅, 𝝁, 𝜮 = 𝑛=1
𝑁
𝑙𝑛 𝑘=1
𝐾
𝜋 𝑘 𝑁(𝑥 𝑛|𝝁 𝒌, 𝜮 𝒌)
 Notice that the sum inside the log, no closed form
solution.
 Solve by expectation-maximization (EM) algorithm
 Derivative w.r.t 𝝁 𝒌
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
29
UNCLASSIFIED
EM – notice each one of these is dependent on
responsiblities
 Do the Same for Covariance
 Use Lagrange Multiplier for mixing coefficients
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
30
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
31
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
32
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
33
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
34
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
35
UNCLASSIFIED
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
36
UNCLASSIFIED
Relation to k-means
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
37
UNCLASSIFIED
Fast food example
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
38
http://nutrition.mcdonalds.com/nutritionexchange/nutritionfacts.pdf
UNCLASSIFIED
Dessert Cluster
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
39
Caramel Mocha
Frappe Caramel
Iced Hazelnut Latte
Iced Coffee
Strawberry Triple Thick Shake
Snack Size McFlurry
Hot Caramel Sundae
Baked Hot Apple Pie
Cinnamon Melts
Kiddie Cone
Strawberry Sundae
UNCLASSIFIED
Burger – like cluster
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
40
Hamburger
Cheeseburger
Filet-O-Fish
Quarter Pounder with Cheese
Premium Grilled Chicken Club Sandwich
Ranch Snack Wrap
Premium Asian Salad with Crispy Chicken
Butter Garlic Croutons
Sausage McMuffin
Sausage McGriddles
UNCLASSIFIED
Salad Cluster
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
41
Premium Southwest Salad with Grilled Chicken
Premium Caesar Salad with Grilled Chicken
Side Salad
Premium Asian Salad without Chicken
Premium Bacon Ranch Salad without Chicken
UNCLASSIFIED
Sauces Cluster 2 /6
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
42
Hot Mustard Sauce
Spicy Buffalo Sauce
Newman’s Own Low Fat Balsamic Vinaigrette
Ketchup Packet
Barbeque Sauce
Chipotle Barbeque Sauce
UNCLASSIFIED
Creamy Sauces
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
43
Creamy Ranch Sauce
Newman’s Own Creamy Caesar Dressing
Coffee Cream
Iced Coffee with Sugar Free Vanilla Syrup
UNCLASSIFIED
Oatmeal and Apples on their own
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
44
UNCLASSIFIED
Breakfast artery clogging cluster
22FEB12
Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this
document.
45
Sausage McMuffin with Egg
Sausage Burrito
Egg McMuffin
Bacon, Egg & Chees Biscuit
McSkillet Burrito with Sausage
Big Breakfast with Hotcakes

More Related Content

More from Tushar Tank

Image Processing Background Elimination in Video Editting
Image Processing Background Elimination in Video EdittingImage Processing Background Elimination in Video Editting
Image Processing Background Elimination in Video EdittingTushar Tank
 
Intuition behind Monte Carlo Markov Chains
Intuition behind Monte Carlo Markov ChainsIntuition behind Monte Carlo Markov Chains
Intuition behind Monte Carlo Markov ChainsTushar Tank
 
Bayesian Analysis Fundamentals with Examples
Bayesian Analysis Fundamentals with ExamplesBayesian Analysis Fundamentals with Examples
Bayesian Analysis Fundamentals with ExamplesTushar Tank
 
Review of CausalImpact / Bayesian Structural Time-Series Analysis
Review of CausalImpact / Bayesian Structural Time-Series AnalysisReview of CausalImpact / Bayesian Structural Time-Series Analysis
Review of CausalImpact / Bayesian Structural Time-Series AnalysisTushar Tank
 
Tech Talk overview of xgboost and review of paper
Tech Talk overview of xgboost and review of paperTech Talk overview of xgboost and review of paper
Tech Talk overview of xgboost and review of paperTushar Tank
 
Shapley Tech Talk - SHAP and Shapley Discussion
Shapley Tech Talk - SHAP and Shapley DiscussionShapley Tech Talk - SHAP and Shapley Discussion
Shapley Tech Talk - SHAP and Shapley DiscussionTushar Tank
 
Variational Inference
Variational InferenceVariational Inference
Variational InferenceTushar Tank
 
Time Frequency Analysis for Poets
Time Frequency Analysis for PoetsTime Frequency Analysis for Poets
Time Frequency Analysis for PoetsTushar Tank
 
Kalman filter upload
Kalman filter uploadKalman filter upload
Kalman filter uploadTushar Tank
 

More from Tushar Tank (10)

Image Processing Background Elimination in Video Editting
Image Processing Background Elimination in Video EdittingImage Processing Background Elimination in Video Editting
Image Processing Background Elimination in Video Editting
 
Intuition behind Monte Carlo Markov Chains
Intuition behind Monte Carlo Markov ChainsIntuition behind Monte Carlo Markov Chains
Intuition behind Monte Carlo Markov Chains
 
Bayesian Analysis Fundamentals with Examples
Bayesian Analysis Fundamentals with ExamplesBayesian Analysis Fundamentals with Examples
Bayesian Analysis Fundamentals with Examples
 
Review of CausalImpact / Bayesian Structural Time-Series Analysis
Review of CausalImpact / Bayesian Structural Time-Series AnalysisReview of CausalImpact / Bayesian Structural Time-Series Analysis
Review of CausalImpact / Bayesian Structural Time-Series Analysis
 
Tech Talk overview of xgboost and review of paper
Tech Talk overview of xgboost and review of paperTech Talk overview of xgboost and review of paper
Tech Talk overview of xgboost and review of paper
 
Shapley Tech Talk - SHAP and Shapley Discussion
Shapley Tech Talk - SHAP and Shapley DiscussionShapley Tech Talk - SHAP and Shapley Discussion
Shapley Tech Talk - SHAP and Shapley Discussion
 
Hindu ABC Book
Hindu ABC BookHindu ABC Book
Hindu ABC Book
 
Variational Inference
Variational InferenceVariational Inference
Variational Inference
 
Time Frequency Analysis for Poets
Time Frequency Analysis for PoetsTime Frequency Analysis for Poets
Time Frequency Analysis for Poets
 
Kalman filter upload
Kalman filter uploadKalman filter upload
Kalman filter upload
 

Recently uploaded

Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...CzechDreamin
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimaginedpanagenda
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 

Recently uploaded (20)

Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties ReimaginedEasier, Faster, and More Powerful – Notes Document Properties Reimagined
Easier, Faster, and More Powerful – Notes Document Properties Reimagined
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 

Statistical Clustering

  • 1. UNCLASSIFIED Statistical Clustering: k-means, Gaussian Mixtures, Variational Inference 22-FEB-2012
  • 2. UNCLASSIFIED What is Clustering? 22FEB12 2 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. Design Considerations • Features • Dimension • Model: Distance / Cost • Bias / Variance
  • 3. UNCLASSIFIED Why do we care? 22FEB12 3 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document.
  • 4. UNCLASSIFIED Scope of Talk – Main Take Away Point 22FEB12 4 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. It’s all About the Posterior 𝑝 𝐿 𝐷 K-means How does it work Math behind it Issues GMM How does it work Math behind it Issues Variational Just the facts Variational Inference GMM, EM, (Graph Cuts, Spectral Clustering) K-means, vector quantization
  • 5. UNCLASSIFIED Scope of Talk 22FEB12 5 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. Main Take Away Point It’s all Just Posterior Estimation Variational / MCNC GMM K-means / vector quantization K-means How does it work Math behind it Issues GMM How does it work Math behind it Issues Variational Just the facts
  • 6. UNCLASSIFIED K-means – How it works 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 6 Goal: represent a data set in terms of K clusters each of which is summarized by a prototype 𝝁 𝒌 Iterative Two step process: E-step: assign each data point to nearest prototype M-step: update prototype to be the cluster means Simple version: Euclidean distance, requires whitening Design Considerations • Features • Dimension • Model: Distance / Cost • Bias / Variance
  • 7. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 7
  • 8. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 8
  • 9. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 9
  • 10. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 10
  • 11. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 11
  • 12. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 12
  • 13. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 13
  • 14. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 14
  • 15. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 15 Converged
  • 16. UNCLASSIFIED k-means - Math  Responsibilities – assign data to cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 16  Cost Function example
  • 17. UNCLASSIFIED Minimizing the Cost Function 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 17
  • 18. UNCLASSIFIED What can go wrong? 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 18
  • 19. UNCLASSIFIED What can go wrong? A great deal.  How do we choose K? (gap statistic / prediction strength)  How do we initialize? (k++ seems to be the best)  Local minimums – run hundreds of time with different initializations  Are we overfitting? Probably.  But hey – it simple to understand and does not cost too many cycles 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 19
  • 20. UNCLASSIFIED Quick word on distances (k-medioids) 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 20 Mahalanobis Not dependent on scale of measurement Tuning parameter Manhattan / City Block Dampens outliers Euclidean Need to whiten Outliers are an issue
  • 21. UNCLASSIFIED  Exclusive Clustering: k-means, weighted k-means  Overlapping Clustering: fuzzy c-means,  Nonlinear Clustering: kernel k-means (spectral clustering, normalized cuts)  Hierarchical Clustering: Hierarchical Quicker word on flavors 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 21
  • 22. UNCLASSIFIED Probabilistic Clustering  Represent the probability distribution of the data as a mixture model  Captures uncertainty in cluster assignments  Gives model for data distribution  Bayesian mixture – we can figure out K easier  Consider a mixture of Gaussians 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 22
  • 23. UNCLASSIFIED Multivariate Gaussian Distribution Review 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 23
  • 24. UNCLASSIFIED Likelihood Function 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 24 Maximum Likelihood What is the best fit to my data Approximation of Posterior!
  • 25. UNCLASSIFIED Maximum Likelihood Solution for One Gaussian  Sample mean  Sample Covariance 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 25
  • 26. UNCLASSIFIED Gaussian Mixtures  Linear super-position of Gaussians  Normalization and positivity require  Can interpret mixing coefficients as prior probabilities  [Aside]We can sample from this. Given mixing coeff, mean, variance – get a sample from p(x) – our dataset. 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 26
  • 27. UNCLASSIFIED Fitting the Gaussian Mixture  We wish to invert this sampling process – given the data, find the corresponding parameters (like we did for the single Gaussian case)  Mixing coefficients  Means  Covariances  If we knew which data point “belonged” or was the responsibility of which Gaussian, then we could use our single Gaussian ML solution  Problem: We don’t have labels, this complicates things.  Solution: Create a latent or hidden variable (z) that tells us which data point goes with which Gaussian 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 27
  • 28. UNCLASSIFIED Posterior of latent variable  𝜋 𝑘(𝑥) ≡ 𝑝 𝑧 𝑘 = 1 Or more concretely the probability that the data point 𝑥 was generated by the 𝑘 𝑡ℎ Gaussian with no prior knowledge of 𝑥.  𝛾 𝑘 𝑥 ≡ 𝑝 𝑧 𝑘 = 1|𝑥 Or more concretely the probability that the data point 𝑥 was generated by the 𝑘 𝑡ℎ Gaussian after observing 𝑥  𝛾 𝑘 𝑥 = 𝜋 𝑘 𝑁(𝑥|𝜇 𝑘) 𝑗=1 𝐾 𝜋 𝑗 𝑁(𝑥|𝜇 𝑘)  Also called responsiblities 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 28
  • 29. UNCLASSIFIED Maximum Likelihood for GMM  The log likelihood takes this form  ln 𝑝 𝐷 𝝅, 𝝁, 𝜮 = 𝑛=1 𝑁 𝑙𝑛 𝑘=1 𝐾 𝜋 𝑘 𝑁(𝑥 𝑛|𝝁 𝒌, 𝜮 𝒌)  Notice that the sum inside the log, no closed form solution.  Solve by expectation-maximization (EM) algorithm  Derivative w.r.t 𝝁 𝒌 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 29
  • 30. UNCLASSIFIED EM – notice each one of these is dependent on responsiblities  Do the Same for Covariance  Use Lagrange Multiplier for mixing coefficients 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 30
  • 31. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 31
  • 32. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 32
  • 33. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 33
  • 34. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 34
  • 35. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 35
  • 36. UNCLASSIFIED 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 36
  • 37. UNCLASSIFIED Relation to k-means 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 37
  • 38. UNCLASSIFIED Fast food example 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 38 http://nutrition.mcdonalds.com/nutritionexchange/nutritionfacts.pdf
  • 39. UNCLASSIFIED Dessert Cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 39 Caramel Mocha Frappe Caramel Iced Hazelnut Latte Iced Coffee Strawberry Triple Thick Shake Snack Size McFlurry Hot Caramel Sundae Baked Hot Apple Pie Cinnamon Melts Kiddie Cone Strawberry Sundae
  • 40. UNCLASSIFIED Burger – like cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 40 Hamburger Cheeseburger Filet-O-Fish Quarter Pounder with Cheese Premium Grilled Chicken Club Sandwich Ranch Snack Wrap Premium Asian Salad with Crispy Chicken Butter Garlic Croutons Sausage McMuffin Sausage McGriddles
  • 41. UNCLASSIFIED Salad Cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 41 Premium Southwest Salad with Grilled Chicken Premium Caesar Salad with Grilled Chicken Side Salad Premium Asian Salad without Chicken Premium Bacon Ranch Salad without Chicken
  • 42. UNCLASSIFIED Sauces Cluster 2 /6 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 42 Hot Mustard Sauce Spicy Buffalo Sauce Newman’s Own Low Fat Balsamic Vinaigrette Ketchup Packet Barbeque Sauce Chipotle Barbeque Sauce
  • 43. UNCLASSIFIED Creamy Sauces 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 43 Creamy Ranch Sauce Newman’s Own Creamy Caesar Dressing Coffee Cream Iced Coffee with Sugar Free Vanilla Syrup
  • 44. UNCLASSIFIED Oatmeal and Apples on their own 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 44
  • 45. UNCLASSIFIED Breakfast artery clogging cluster 22FEB12 Notice: Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this document. 45 Sausage McMuffin with Egg Sausage Burrito Egg McMuffin Bacon, Egg & Chees Biscuit McSkillet Burrito with Sausage Big Breakfast with Hotcakes