SlideShare a Scribd company logo
1 of 36
Chapter 06
Clustering
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
β€’ Overview
β€’ Clustering algorithms
β€’ π‘˜-means Clustering
β€’ EM Clustering
β€’ DBSCAN Clustering
β€’ Single Linkage Clustering
β€’ Comparison of the Clustering Algorithms
β€’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Example of Clustering
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Clustering
The General Problem
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Clustering
Object 1
Object 2
Object 3
Object 4
Object n
…
Object 1
Object 2
Object 3
Object 4
Object n
The Formal Problem
β€’ Object space
β€’ 𝑂 = π‘œπ‘π‘—π‘’π‘π‘‘1, π‘œπ‘π‘—π‘’π‘π‘‘2, …
β€’ Often infinite
β€’ Representations of the objects in a (numeric) feature space
β€’ β„± = {πœ™ π‘œ , π‘œ ∈ 𝑂}
β€’ Clustering
β€’ Grouping of the objects
β€’ Objects in the same group 𝑔 ∈ 𝐺 should be similar
β€’ 𝑐: β„± β†’ 𝐺
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do you
measure
similarity?
Measuring Similarity Distances
β€’ Small distance = similar
β€’ Euclidean Distance
β€’ Based on the Eucledian norm π‘₯ 2
β€’ 𝑑 π‘₯, 𝑦 = 𝑦 βˆ’ π‘₯ 2
= 𝑦1 βˆ’ π‘₯1
2 + β‹― + 𝑦𝑛 βˆ’ π‘₯𝑛
2
β€’ Manhattan Distance
β€’ Based on the Manhattan norm π‘₯ 1
β€’ 𝑑 π‘₯, 𝑦 = 𝑦 βˆ’ π‘₯ 1
= 𝑦1 βˆ’ π‘₯1 + β‹― + 𝑦𝑛 βˆ’ π‘₯𝑛
β€’ Chebyshev Distance
β€’ Based on the maximum norm π‘₯ ∞
β€’ 𝑑 π‘₯, 𝑦 = 𝑦 βˆ’ π‘₯ ∞
= max
𝑖=1..𝑛
|𝑦𝑖 βˆ’ π‘₯𝑖|
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
2 2 2 2 2
2 1 1 1 2
2 1 0 1 2
2 1 1 1 2
2 2 2 2 2
y
x
y
x
Evaluation of Clustering Results
β€’ No general metrics, depends on algorithms
β€’ Low variance for π‘˜-Means
β€’ High density for DBSCAN
β€’ Good fit in comparison to model variables for EM clustering
β€’ …
β€’ Often manual checks
β€’ Do the clusters make sense?
β€’ Can be difficult
β€’ Very large data
β€’ Many clusters
β€’ High dimensional data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
β€’ Overview
β€’ Clustering algorithms
β€’ π’Œ-means Clustering
β€’ EM Clustering
β€’ DBSCAN Clustering
β€’ Single Linkage Clustering
β€’ Comparison of the Clustering Algorithms
β€’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Idea Behind π‘˜β€“means Clustering
β€’ Clusters are described by their center
β€’ The centers are called centroid
β€’ Centroid-based clustering
β€’ Objects are assigned to the closests centroid
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do you
get the
centroids?
Simple Algorithm
β€’ Select initial centroids 𝐢1, … , πΆπ‘˜
β€’ Randomized
β€’ Assign each object to closest centroid
β€’ 𝑐 π‘₯ = argmin𝑖=1..π‘˜ 𝑑(π‘₯, 𝐢𝑖)
β€’ Update centroid
β€’ Arithmetic mean of assigned objects
β€’ 𝐢𝑖 =
1
π‘₯:𝑐 π‘₯ =𝑖 π‘₯:𝑐 π‘₯ =𝑖 π‘₯𝑖
β€’ Repeat update and assignment
β€’ Until convergence, or
β€’ Until maximum number of iterations
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Visualization of the π‘˜β€“means Algorithm
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Selecting π‘˜
β€’ Intuition and knowledge about data
β€’ Based on looking at plots
β€’ Based on domain knowledge
β€’ Due to goal
β€’ Fixed number of groups desired
β€’ Based on best fit
β€’ Within-sum-of-squares
β€’ π‘Šπ‘†π‘† = 𝑖=1
π‘˜
π‘₯: 𝑐 π‘₯ =𝑖 𝑑 π‘₯, 𝐢𝑖
2
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Results for π‘˜ = 2, … , 5
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Big changes in slope
(elbows) indicate
potentially good values for
k
2, 3, and 4 all okay
οƒ  use domain knowledge to decide
Splits like these indicate too
many clusters
Problems of π‘˜-Means
β€’ Depends on initial clusters
β€’ Results may be unstable
β€’ Wrong π‘˜ can lead to bad results
β€’ All features must have a similar scale
β€’ Differences in scale introduce artificial weights between features
β€’ Large scales dominate small scales
β€’ Only works well for β€œroundβ€œ clusters
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
β€’ Overview
β€’ Clustering algorithms
β€’ π‘˜-means Clustering
β€’ EM Clustering
β€’ DBSCAN Clustering
β€’ Single Linkage Clustering
β€’ Comparison of the Clustering Algorithms
β€’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Idea Behind EM Clustering
β€’ Clusters are described by probability distributions
β€’ Usually normal distribution (β€œGaussian Mixture Modelβ€œ)
β€’ Distribution-based clustering
β€’ Objects are assigned to the β€œmost likelyβ€œ cluster
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do you
get the
distributions?
Mean
Covariance
(Simplified!) EM Algorithm
β€’ Task: Determine π‘˜ normal distributions that β€œfit” the data well
β€’ 𝐢1~ πœ‡1, 𝜎1 , … , πΆπ‘˜~ πœ‡π‘˜, πœŽπ‘˜ ,
β€’ Estimate start values similar to π‘˜-means
β€’ Expectation step
β€’ Calculate weights of objects
β€’ Weights define the likelihood that an object belongs to a cluster
β€’ 𝑀𝑗(π‘₯) =
𝑝(π‘₯|πœ‡π‘—,πœŽπ‘—)
𝑖=1
π‘˜
𝑝(π‘₯|πœ‡π‘–,πœŽπ‘–)
for all objects π‘₯ ∈ 𝑋
β€’ Maximization step
β€’ Update mean values
β€’ πœ‡π‘— =
1
|𝑋| π‘₯βˆˆπ‘‹ 𝑀𝑗 π‘₯ βˆ™ π‘₯
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
WARNING:
This is a correct, but simplified
version of the algorithm that ignores
the update of the (co)variance.
Visualization of the EM Algorithm
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Selecting π‘˜
β€’ Same as π‘˜-means: Intuition, knowledge, goal
β€’ Bayesian Information Criterion (BIC)
β€’ Difference between the model complexity and the likelihood of the
clusters
β€’ 𝐡𝐼𝐢 = ln |X| kβ€² βˆ’ 2 β‹… ln(𝐿 𝐢1, … , πΆπ‘˜; 𝑋 )
β€’ kβ€²
is the number of model parameters (i.e., mean values, covariances)
β€’ 𝐿 𝐢1, … , πΆπ‘˜; 𝑋 = 𝑝(𝐢1, … πΆπ‘˜|𝑋) is the likelihood function
β€’ The lower the better
β€’ Decreases with less complex models
β€’ Decreases with better likelihood
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Results for π‘˜ = 1, … , 4
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Minimum = optimal ratio
between model complexity
and goodness of fit
Problems of EM Clustering
β€’ Depends on initial clusters
β€’ Results may be unstable
β€’ Wrong π‘˜ can lead to bad results
β€’ May not converge
β€’ Only works well with normally distributed clusters
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
β€’ Overview
β€’ Clustering algorithms
β€’ π‘˜-means Clustering
β€’ EM Clustering
β€’ DBSCAN Clustering
β€’ Single Linkage Clustering
β€’ Comparison of the Clustering Algorithms
β€’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Idea behind DBSCAN
β€’ Clusters are described by other objects close by
β€’ Density-based clustering
β€’ Scan area around an object for other objects
β€’ If objects are found, they probably belong to the same group
β€’ If no objects are found, the object is probably noise
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Noise
(Relatively) Simple Algorithm
β€’ Two parameters
β€’ Neighborhood size πœ–
β€’ Minimal number of points to be considered dense π‘šπ‘–π‘›π‘ƒπ‘‘π‘ 
β€’ Determine all objects with dense neighborhoods (core points)
β€’ π‘₯ ∈ 𝑋 π‘ π‘’π‘β„Ž π‘‘β„Žπ‘Žπ‘‘ π‘₯β€²
∈ 𝑋: 𝑑 π‘₯, π‘₯β€²
≀ πœ– β‰₯ π‘šπ‘–π‘›π‘ƒπ‘‘π‘ 
β€’ Grow clusters by assigning all points that share a neighborhood
to the same cluster
β€’ All points that are neither core points nor in the neighborhood of
a core point are noise
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Visualization of the DBSCAN Algorithm
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Problems of DBSCAN
β€’ All features must be in the same range
β€’ What if different clusters have different densities?
οƒ  Main problem of DBSCAN!
β€’ This is also related to the size of the data
οƒ  DBSCAN is very sensitive to sampling
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
β€’ Overview
β€’ Clustering algorithms
β€’ π‘˜-means Clustering
β€’ EM Clustering
β€’ DBSCAN Clustering
β€’ Single Linkage Clustering
β€’ Comparison of the Clustering Algorithms
β€’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Idea behind Hierarchical Clustering
β€’ Clusters are described by hierarchies of similarity
β€’ Hierarchical clustering (also called connectivity-based clustering)
β€’ Find most similar pair of objects and establish link
β€’ β€œNearest Neighbor Clusteringβ€œ
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Simple Single Linkage Algorithm (SLINK)
β€’ Every object has its own cluster at the beginning
β€’ The level of all these basic clusters is 0
β€’ 𝐿 𝐢 = 0 for all 𝐢 = π‘₯ with π‘₯ ∈ 𝑋
β€’ Find two closest clusters
β€’ 𝐢, 𝐢′
= argmin𝐢,πΆβ€²βˆˆπΆπ‘™π‘’π‘ π‘‘π‘’π‘Ÿπ‘  𝑑 𝐢, 𝐢′
β€’ 𝑑 𝐢, 𝐢′
= min
π‘₯∈𝐢,π‘₯β€²βˆˆπΆβ€²
𝑑(π‘₯, π‘₯β€²
)
β€’ Merge 𝐢, 𝐢′ into a new cluster 𝐢𝑛𝑒𝑀 = 𝐢 βˆͺ 𝐢′
β€’ The level is the distance between the initial clusters
β€’ 𝐿 𝐢𝑛𝑒𝑀 = 𝑑 𝐢, 𝐢′
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Dendrograms of Clustering
β€’ Visualizes clustering as a tree
β€’ Horizontal line: Merging of two clusters
β€’ Vertical line: Increase of the level due to merge
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Cutoff for the β€œrealβ€œ
clusters, must be
chosen manually
Each object is a leaf
node
Nodes that are
subsequently merged 15
times are suppressed
Problems with Hierarchical Clustering
β€’ Often scales badly in terms of memory consumption
β€’ Standard algorithm requires square matrix of distances between all
objects
β€’ All features must be in the same range
β€’ Different densities in different clusters may be problematic
β€’ Hard to find single cut-off
β€’ Can be solved by visual analysis of the dendrogram
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
β€’ Overview
β€’ Clustering algorithms
β€’ π‘˜-means Clustering
β€’ EM Clustering
β€’ DBSCAN Clustering
β€’ Single Linkage Clustering
β€’ Comparison of the Clustering Algorithms
β€’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Comparison of Clusters
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Comparison of Execution Times
β€’ Single linkage requires too much memory for larger clusters
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Strengths and Weaknesses
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Cluster
number
Explanatory
value
Concise
representation
Categorical
features
Missing
features
Correlated
features
π‘˜-means - + + - - -
EM 0 + + - - 0
DBSCAN + - - - - -
SLINK 0 + - - - -
There are clustering algorithms for
categorical data, e.g., π‘˜-modes
Summary
β€’ Clustering is concerned with the inference of groups for objects
β€’ Works well for numeric data but is often not well suited for
categorical data
β€’ Scales are very important for most clustering algorithms
β€’ Different types of clustering algorithms
β€’ Centroid-based
β€’ Distribution-based
β€’ Density-based
β€’ Hierarchical / connectivity-based
β€’ Evaluation often difficult and requires manual intervention
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science

More Related Content

Similar to 06-Clustering.pptx

Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
Β 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
Β 
about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.MohammadMoreb
Β 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)Ali-ziane Myriam
Β 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-parisMaarten Breddels
Β 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptxssusere1fd42
Β 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
Β 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
Β 
Data minig.pptx
Data minig.pptxData minig.pptx
Data minig.pptxSabthamiS1
Β 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
Β 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and ClusteringYogendra Tamang
Β 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondFrank Kelly
Β 
Hierarichal cluster ml
Hierarichal cluster   mlHierarichal cluster   ml
Hierarichal cluster mlBangalore
Β 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxsandeepsandy494692
Β 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesLuiz Henrique Zambom Santana
Β 
Java 103 intro to java data structures
Java 103   intro to java data structuresJava 103   intro to java data structures
Java 103 intro to java data structuresagorolabs
Β 
Java OOP Programming language (Part 4) - Collection
Java OOP Programming language (Part 4) - CollectionJava OOP Programming language (Part 4) - Collection
Java OOP Programming language (Part 4) - CollectionOUM SAOKOSAL
Β 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptxNTUConcepts1
Β 

Similar to 06-Clustering.pptx (20)

Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
Β 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Β 
about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.about data mining and Exp about data mining and Exp.
about data mining and Exp about data mining and Exp.
Β 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
Β 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
Β 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
Β 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
Β 
Clusteryanam
ClusteryanamClusteryanam
Clusteryanam
Β 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
Β 
Data minig.pptx
Data minig.pptxData minig.pptx
Data minig.pptx
Β 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
Β 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and Clustering
Β 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
Β 
Hierarichal cluster ml
Hierarichal cluster   mlHierarichal cluster   ml
Hierarichal cluster ml
Β 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
Β 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
Β 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
Β 
Java 103 intro to java data structures
Java 103   intro to java data structuresJava 103   intro to java data structures
Java 103 intro to java data structures
Β 
Java OOP Programming language (Part 4) - Collection
Java OOP Programming language (Part 4) - CollectionJava OOP Programming language (Part 4) - Collection
Java OOP Programming language (Part 4) - Collection
Β 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
Β 

More from Shree Shree

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptxShree Shree
Β 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptxShree Shree
Β 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptxShree Shree
Β 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptxShree Shree
Β 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptxShree Shree
Β 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptxShree Shree
Β 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdfShree Shree
Β 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptxShree Shree
Β 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptxShree Shree
Β 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptxShree Shree
Β 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptxShree Shree
Β 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptxShree Shree
Β 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptxShree Shree
Β 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptxShree Shree
Β 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptxShree Shree
Β 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptxShree Shree
Β 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptxShree Shree
Β 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptxShree Shree
Β 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptxShree Shree
Β 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptxShree Shree
Β 

More from Shree Shree (20)

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptx
Β 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptx
Β 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptx
Β 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptx
Β 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptx
Β 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptx
Β 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
Β 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptx
Β 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptx
Β 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptx
Β 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptx
Β 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptx
Β 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptx
Β 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptx
Β 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptx
Β 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptx
Β 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptx
Β 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptx
Β 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptx
Β 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptx
Β 

Recently uploaded

Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
Β 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
Β 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
Β 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
Β 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
Β 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
Β 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
Β 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
Β 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
Β 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
Β 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
Β 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
Β 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
Β 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
Β 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
Β 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
Β 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
Β 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
Β 

Recently uploaded (20)

Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Β 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
Β 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
Β 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
Β 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Β 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
Β 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
Β 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Β 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
Β 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
Β 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
Β 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Β 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
Β 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
Β 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Β 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
Β 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
Β 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
Β 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
Β 

06-Clustering.pptx

  • 1. Chapter 06 Clustering Dr. Steffen Herbold herbold@cs.uni-goettingen.de Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 2. Outline β€’ Overview β€’ Clustering algorithms β€’ π‘˜-means Clustering β€’ EM Clustering β€’ DBSCAN Clustering β€’ Single Linkage Clustering β€’ Comparison of the Clustering Algorithms β€’ Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 3. Example of Clustering Introduction to Data Science https://sherbold.github.io/intro-to-data-science Clustering
  • 4. The General Problem Introduction to Data Science https://sherbold.github.io/intro-to-data-science Clustering Object 1 Object 2 Object 3 Object 4 Object n … Object 1 Object 2 Object 3 Object 4 Object n
  • 5. The Formal Problem β€’ Object space β€’ 𝑂 = π‘œπ‘π‘—π‘’π‘π‘‘1, π‘œπ‘π‘—π‘’π‘π‘‘2, … β€’ Often infinite β€’ Representations of the objects in a (numeric) feature space β€’ β„± = {πœ™ π‘œ , π‘œ ∈ 𝑂} β€’ Clustering β€’ Grouping of the objects β€’ Objects in the same group 𝑔 ∈ 𝐺 should be similar β€’ 𝑐: β„± β†’ 𝐺 Introduction to Data Science https://sherbold.github.io/intro-to-data-science How do you measure similarity?
  • 6. Measuring Similarity Distances β€’ Small distance = similar β€’ Euclidean Distance β€’ Based on the Eucledian norm π‘₯ 2 β€’ 𝑑 π‘₯, 𝑦 = 𝑦 βˆ’ π‘₯ 2 = 𝑦1 βˆ’ π‘₯1 2 + β‹― + 𝑦𝑛 βˆ’ π‘₯𝑛 2 β€’ Manhattan Distance β€’ Based on the Manhattan norm π‘₯ 1 β€’ 𝑑 π‘₯, 𝑦 = 𝑦 βˆ’ π‘₯ 1 = 𝑦1 βˆ’ π‘₯1 + β‹― + 𝑦𝑛 βˆ’ π‘₯𝑛 β€’ Chebyshev Distance β€’ Based on the maximum norm π‘₯ ∞ β€’ 𝑑 π‘₯, 𝑦 = 𝑦 βˆ’ π‘₯ ∞ = max 𝑖=1..𝑛 |𝑦𝑖 βˆ’ π‘₯𝑖| Introduction to Data Science https://sherbold.github.io/intro-to-data-science 2 2 2 2 2 2 1 1 1 2 2 1 0 1 2 2 1 1 1 2 2 2 2 2 2 y x y x
  • 7. Evaluation of Clustering Results β€’ No general metrics, depends on algorithms β€’ Low variance for π‘˜-Means β€’ High density for DBSCAN β€’ Good fit in comparison to model variables for EM clustering β€’ … β€’ Often manual checks β€’ Do the clusters make sense? β€’ Can be difficult β€’ Very large data β€’ Many clusters β€’ High dimensional data Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 8. Outline β€’ Overview β€’ Clustering algorithms β€’ π’Œ-means Clustering β€’ EM Clustering β€’ DBSCAN Clustering β€’ Single Linkage Clustering β€’ Comparison of the Clustering Algorithms β€’ Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 9. Idea Behind π‘˜β€“means Clustering β€’ Clusters are described by their center β€’ The centers are called centroid β€’ Centroid-based clustering β€’ Objects are assigned to the closests centroid Introduction to Data Science https://sherbold.github.io/intro-to-data-science How do you get the centroids?
  • 10. Simple Algorithm β€’ Select initial centroids 𝐢1, … , πΆπ‘˜ β€’ Randomized β€’ Assign each object to closest centroid β€’ 𝑐 π‘₯ = argmin𝑖=1..π‘˜ 𝑑(π‘₯, 𝐢𝑖) β€’ Update centroid β€’ Arithmetic mean of assigned objects β€’ 𝐢𝑖 = 1 π‘₯:𝑐 π‘₯ =𝑖 π‘₯:𝑐 π‘₯ =𝑖 π‘₯𝑖 β€’ Repeat update and assignment β€’ Until convergence, or β€’ Until maximum number of iterations Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 11. Visualization of the π‘˜β€“means Algorithm Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 12. Selecting π‘˜ β€’ Intuition and knowledge about data β€’ Based on looking at plots β€’ Based on domain knowledge β€’ Due to goal β€’ Fixed number of groups desired β€’ Based on best fit β€’ Within-sum-of-squares β€’ π‘Šπ‘†π‘† = 𝑖=1 π‘˜ π‘₯: 𝑐 π‘₯ =𝑖 𝑑 π‘₯, 𝐢𝑖 2 Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 13. Results for π‘˜ = 2, … , 5 Introduction to Data Science https://sherbold.github.io/intro-to-data-science Big changes in slope (elbows) indicate potentially good values for k 2, 3, and 4 all okay οƒ  use domain knowledge to decide Splits like these indicate too many clusters
  • 14. Problems of π‘˜-Means β€’ Depends on initial clusters β€’ Results may be unstable β€’ Wrong π‘˜ can lead to bad results β€’ All features must have a similar scale β€’ Differences in scale introduce artificial weights between features β€’ Large scales dominate small scales β€’ Only works well for β€œroundβ€œ clusters Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 15. Outline β€’ Overview β€’ Clustering algorithms β€’ π‘˜-means Clustering β€’ EM Clustering β€’ DBSCAN Clustering β€’ Single Linkage Clustering β€’ Comparison of the Clustering Algorithms β€’ Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 16. Idea Behind EM Clustering β€’ Clusters are described by probability distributions β€’ Usually normal distribution (β€œGaussian Mixture Modelβ€œ) β€’ Distribution-based clustering β€’ Objects are assigned to the β€œmost likelyβ€œ cluster Introduction to Data Science https://sherbold.github.io/intro-to-data-science How do you get the distributions? Mean Covariance
  • 17. (Simplified!) EM Algorithm β€’ Task: Determine π‘˜ normal distributions that β€œfit” the data well β€’ 𝐢1~ πœ‡1, 𝜎1 , … , πΆπ‘˜~ πœ‡π‘˜, πœŽπ‘˜ , β€’ Estimate start values similar to π‘˜-means β€’ Expectation step β€’ Calculate weights of objects β€’ Weights define the likelihood that an object belongs to a cluster β€’ 𝑀𝑗(π‘₯) = 𝑝(π‘₯|πœ‡π‘—,πœŽπ‘—) 𝑖=1 π‘˜ 𝑝(π‘₯|πœ‡π‘–,πœŽπ‘–) for all objects π‘₯ ∈ 𝑋 β€’ Maximization step β€’ Update mean values β€’ πœ‡π‘— = 1 |𝑋| π‘₯βˆˆπ‘‹ 𝑀𝑗 π‘₯ βˆ™ π‘₯ Introduction to Data Science https://sherbold.github.io/intro-to-data-science WARNING: This is a correct, but simplified version of the algorithm that ignores the update of the (co)variance.
  • 18. Visualization of the EM Algorithm Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 19. Selecting π‘˜ β€’ Same as π‘˜-means: Intuition, knowledge, goal β€’ Bayesian Information Criterion (BIC) β€’ Difference between the model complexity and the likelihood of the clusters β€’ 𝐡𝐼𝐢 = ln |X| kβ€² βˆ’ 2 β‹… ln(𝐿 𝐢1, … , πΆπ‘˜; 𝑋 ) β€’ kβ€² is the number of model parameters (i.e., mean values, covariances) β€’ 𝐿 𝐢1, … , πΆπ‘˜; 𝑋 = 𝑝(𝐢1, … πΆπ‘˜|𝑋) is the likelihood function β€’ The lower the better β€’ Decreases with less complex models β€’ Decreases with better likelihood Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 20. Results for π‘˜ = 1, … , 4 Introduction to Data Science https://sherbold.github.io/intro-to-data-science Minimum = optimal ratio between model complexity and goodness of fit
  • 21. Problems of EM Clustering β€’ Depends on initial clusters β€’ Results may be unstable β€’ Wrong π‘˜ can lead to bad results β€’ May not converge β€’ Only works well with normally distributed clusters Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 22. Outline β€’ Overview β€’ Clustering algorithms β€’ π‘˜-means Clustering β€’ EM Clustering β€’ DBSCAN Clustering β€’ Single Linkage Clustering β€’ Comparison of the Clustering Algorithms β€’ Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 23. Idea behind DBSCAN β€’ Clusters are described by other objects close by β€’ Density-based clustering β€’ Scan area around an object for other objects β€’ If objects are found, they probably belong to the same group β€’ If no objects are found, the object is probably noise Introduction to Data Science https://sherbold.github.io/intro-to-data-science Noise
  • 24. (Relatively) Simple Algorithm β€’ Two parameters β€’ Neighborhood size πœ– β€’ Minimal number of points to be considered dense π‘šπ‘–π‘›π‘ƒπ‘‘π‘  β€’ Determine all objects with dense neighborhoods (core points) β€’ π‘₯ ∈ 𝑋 π‘ π‘’π‘β„Ž π‘‘β„Žπ‘Žπ‘‘ π‘₯β€² ∈ 𝑋: 𝑑 π‘₯, π‘₯β€² ≀ πœ– β‰₯ π‘šπ‘–π‘›π‘ƒπ‘‘π‘  β€’ Grow clusters by assigning all points that share a neighborhood to the same cluster β€’ All points that are neither core points nor in the neighborhood of a core point are noise Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 25. Visualization of the DBSCAN Algorithm Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 26. Problems of DBSCAN β€’ All features must be in the same range β€’ What if different clusters have different densities? οƒ  Main problem of DBSCAN! β€’ This is also related to the size of the data οƒ  DBSCAN is very sensitive to sampling Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 27. Outline β€’ Overview β€’ Clustering algorithms β€’ π‘˜-means Clustering β€’ EM Clustering β€’ DBSCAN Clustering β€’ Single Linkage Clustering β€’ Comparison of the Clustering Algorithms β€’ Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 28. Idea behind Hierarchical Clustering β€’ Clusters are described by hierarchies of similarity β€’ Hierarchical clustering (also called connectivity-based clustering) β€’ Find most similar pair of objects and establish link β€’ β€œNearest Neighbor Clusteringβ€œ Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 29. Simple Single Linkage Algorithm (SLINK) β€’ Every object has its own cluster at the beginning β€’ The level of all these basic clusters is 0 β€’ 𝐿 𝐢 = 0 for all 𝐢 = π‘₯ with π‘₯ ∈ 𝑋 β€’ Find two closest clusters β€’ 𝐢, 𝐢′ = argmin𝐢,πΆβ€²βˆˆπΆπ‘™π‘’π‘ π‘‘π‘’π‘Ÿπ‘  𝑑 𝐢, 𝐢′ β€’ 𝑑 𝐢, 𝐢′ = min π‘₯∈𝐢,π‘₯β€²βˆˆπΆβ€² 𝑑(π‘₯, π‘₯β€² ) β€’ Merge 𝐢, 𝐢′ into a new cluster 𝐢𝑛𝑒𝑀 = 𝐢 βˆͺ 𝐢′ β€’ The level is the distance between the initial clusters β€’ 𝐿 𝐢𝑛𝑒𝑀 = 𝑑 𝐢, 𝐢′ Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 30. Dendrograms of Clustering β€’ Visualizes clustering as a tree β€’ Horizontal line: Merging of two clusters β€’ Vertical line: Increase of the level due to merge Introduction to Data Science https://sherbold.github.io/intro-to-data-science Cutoff for the β€œrealβ€œ clusters, must be chosen manually Each object is a leaf node Nodes that are subsequently merged 15 times are suppressed
  • 31. Problems with Hierarchical Clustering β€’ Often scales badly in terms of memory consumption β€’ Standard algorithm requires square matrix of distances between all objects β€’ All features must be in the same range β€’ Different densities in different clusters may be problematic β€’ Hard to find single cut-off β€’ Can be solved by visual analysis of the dendrogram Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 32. Outline β€’ Overview β€’ Clustering algorithms β€’ π‘˜-means Clustering β€’ EM Clustering β€’ DBSCAN Clustering β€’ Single Linkage Clustering β€’ Comparison of the Clustering Algorithms β€’ Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 33. Comparison of Clusters Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 34. Comparison of Execution Times β€’ Single linkage requires too much memory for larger clusters Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 35. Strengths and Weaknesses Introduction to Data Science https://sherbold.github.io/intro-to-data-science Cluster number Explanatory value Concise representation Categorical features Missing features Correlated features π‘˜-means - + + - - - EM 0 + + - - 0 DBSCAN + - - - - - SLINK 0 + - - - - There are clustering algorithms for categorical data, e.g., π‘˜-modes
  • 36. Summary β€’ Clustering is concerned with the inference of groups for objects β€’ Works well for numeric data but is often not well suited for categorical data β€’ Scales are very important for most clustering algorithms β€’ Different types of clustering algorithms β€’ Centroid-based β€’ Distribution-based β€’ Density-based β€’ Hierarchical / connectivity-based β€’ Evaluation often difficult and requires manual intervention Introduction to Data Science https://sherbold.github.io/intro-to-data-science