Clustering is a technique used in data analysis and machine learning to group similar data points together based on their characteristics or attributes. It is an unsupervised learning method that aims to discover inherent patterns or structures in the data without any predefined labels or target variables.
Here's an overview of the key aspects of clustering:
Objective: The goal of clustering is to partition a dataset into groups or clusters, where data points within the same cluster are similar to each other, and data points in different clusters are dissimilar. The objective is to maximize the intra-cluster similarity and minimize the inter-cluster similarity.
Similarity Measures: Clustering algorithms use similarity or distance measures to quantify the similarity or dissimilarity between data points. Common distance metrics include Euclidean distance, Manhattan distance, cosine similarity, or correlation coefficient, depending on the nature of the data and the problem domain.
Algorithms: Various clustering algorithms exist, each with its own approach and underlying assumptions. Some popular clustering algorithms include K-means clustering, hierarchical clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), and Gaussian mixture models. Each algorithm has different strengths, weaknesses, and suitability for different types of data and clustering scenarios.
Feature Selection and Data Preprocessing: Prior to clustering, it is often important to select relevant features and preprocess the data. Feature selection helps in reducing noise, improving the quality of clustering results, and focusing on the most informative attributes. Data preprocessing steps may involve scaling, normalization, or handling missing values to ensure the data is suitable for clustering algorithms.
Evaluation: Evaluating the quality and effectiveness of clustering results is important. However, since clustering is an unsupervised technique, there is no definitive ground truth to compare against. Evaluation methods include internal validation measures such as silhouette coefficient or cohesion and separation indices, as well as external validation by comparing clustering results with known ground truth or expert judgment.
Applications: Clustering has a wide range of applications in various fields. It can be used for customer segmentation, market research, anomaly detection, image recognition, text mining, social network analysis, and more. Clustering can reveal hidden patterns, identify subgroups or clusters within a population, and provide insights for decision-making or further analysis.
Interpretation: Once clustering is performed, it is important to interpret and analyze the results. This involves understanding the characteristics and behavior of each cluster, identifying the most representative or central data points, and extracting meaningful insights or patterns from the clusters. Visualization techniques such as scatter plots, heatmap
1. Chapter 06
Clustering
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
2. Outline
β’ Overview
β’ Clustering algorithms
β’ π-means Clustering
β’ EM Clustering
β’ DBSCAN Clustering
β’ Single Linkage Clustering
β’ Comparison of the Clustering Algorithms
β’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
4. The General Problem
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Clustering
Object 1
Object 2
Object 3
Object 4
Object n
β¦
Object 1
Object 2
Object 3
Object 4
Object n
5. The Formal Problem
β’ Object space
β’ π = ππππππ‘1, ππππππ‘2, β¦
β’ Often infinite
β’ Representations of the objects in a (numeric) feature space
β’ β± = {π π , π β π}
β’ Clustering
β’ Grouping of the objects
β’ Objects in the same group π β πΊ should be similar
β’ π: β± β πΊ
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do you
measure
similarity?
6. Measuring Similarity Distances
β’ Small distance = similar
β’ Euclidean Distance
β’ Based on the Eucledian norm π₯ 2
β’ π π₯, π¦ = π¦ β π₯ 2
= π¦1 β π₯1
2 + β― + π¦π β π₯π
2
β’ Manhattan Distance
β’ Based on the Manhattan norm π₯ 1
β’ π π₯, π¦ = π¦ β π₯ 1
= π¦1 β π₯1 + β― + π¦π β π₯π
β’ Chebyshev Distance
β’ Based on the maximum norm π₯ β
β’ π π₯, π¦ = π¦ β π₯ β
= max
π=1..π
|π¦π β π₯π|
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
2 2 2 2 2
2 1 1 1 2
2 1 0 1 2
2 1 1 1 2
2 2 2 2 2
y
x
y
x
7. Evaluation of Clustering Results
β’ No general metrics, depends on algorithms
β’ Low variance for π-Means
β’ High density for DBSCAN
β’ Good fit in comparison to model variables for EM clustering
β’ β¦
β’ Often manual checks
β’ Do the clusters make sense?
β’ Can be difficult
β’ Very large data
β’ Many clusters
β’ High dimensional data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
8. Outline
β’ Overview
β’ Clustering algorithms
β’ π-means Clustering
β’ EM Clustering
β’ DBSCAN Clustering
β’ Single Linkage Clustering
β’ Comparison of the Clustering Algorithms
β’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
9. Idea Behind πβmeans Clustering
β’ Clusters are described by their center
β’ The centers are called centroid
β’ Centroid-based clustering
β’ Objects are assigned to the closests centroid
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do you
get the
centroids?
10. Simple Algorithm
β’ Select initial centroids πΆ1, β¦ , πΆπ
β’ Randomized
β’ Assign each object to closest centroid
β’ π π₯ = argminπ=1..π π(π₯, πΆπ)
β’ Update centroid
β’ Arithmetic mean of assigned objects
β’ πΆπ =
1
π₯:π π₯ =π π₯:π π₯ =π π₯π
β’ Repeat update and assignment
β’ Until convergence, or
β’ Until maximum number of iterations
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
11. Visualization of the πβmeans Algorithm
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
12. Selecting π
β’ Intuition and knowledge about data
β’ Based on looking at plots
β’ Based on domain knowledge
β’ Due to goal
β’ Fixed number of groups desired
β’ Based on best fit
β’ Within-sum-of-squares
β’ πππ = π=1
π
π₯: π π₯ =π π π₯, πΆπ
2
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
13. Results for π = 2, β¦ , 5
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Big changes in slope
(elbows) indicate
potentially good values for
k
2, 3, and 4 all okay
ο use domain knowledge to decide
Splits like these indicate too
many clusters
14. Problems of π-Means
β’ Depends on initial clusters
β’ Results may be unstable
β’ Wrong π can lead to bad results
β’ All features must have a similar scale
β’ Differences in scale introduce artificial weights between features
β’ Large scales dominate small scales
β’ Only works well for βroundβ clusters
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
15. Outline
β’ Overview
β’ Clustering algorithms
β’ π-means Clustering
β’ EM Clustering
β’ DBSCAN Clustering
β’ Single Linkage Clustering
β’ Comparison of the Clustering Algorithms
β’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
16. Idea Behind EM Clustering
β’ Clusters are described by probability distributions
β’ Usually normal distribution (βGaussian Mixture Modelβ)
β’ Distribution-based clustering
β’ Objects are assigned to the βmost likelyβ cluster
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
How do you
get the
distributions?
Mean
Covariance
17. (Simplified!) EM Algorithm
β’ Task: Determine π normal distributions that βfitβ the data well
β’ πΆ1~ π1, π1 , β¦ , πΆπ~ ππ, ππ ,
β’ Estimate start values similar to π-means
β’ Expectation step
β’ Calculate weights of objects
β’ Weights define the likelihood that an object belongs to a cluster
β’ π€π(π₯) =
π(π₯|ππ,ππ)
π=1
π
π(π₯|ππ,ππ)
for all objects π₯ β π
β’ Maximization step
β’ Update mean values
β’ ππ =
1
|π| π₯βπ π€π π₯ β π₯
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
WARNING:
This is a correct, but simplified
version of the algorithm that ignores
the update of the (co)variance.
18. Visualization of the EM Algorithm
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
19. Selecting π
β’ Same as π-means: Intuition, knowledge, goal
β’ Bayesian Information Criterion (BIC)
β’ Difference between the model complexity and the likelihood of the
clusters
β’ π΅πΌπΆ = ln |X| kβ² β 2 β ln(πΏ πΆ1, β¦ , πΆπ; π )
β’ kβ²
is the number of model parameters (i.e., mean values, covariances)
β’ πΏ πΆ1, β¦ , πΆπ; π = π(πΆ1, β¦ πΆπ|π) is the likelihood function
β’ The lower the better
β’ Decreases with less complex models
β’ Decreases with better likelihood
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
20. Results for π = 1, β¦ , 4
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Minimum = optimal ratio
between model complexity
and goodness of fit
21. Problems of EM Clustering
β’ Depends on initial clusters
β’ Results may be unstable
β’ Wrong π can lead to bad results
β’ May not converge
β’ Only works well with normally distributed clusters
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
22. Outline
β’ Overview
β’ Clustering algorithms
β’ π-means Clustering
β’ EM Clustering
β’ DBSCAN Clustering
β’ Single Linkage Clustering
β’ Comparison of the Clustering Algorithms
β’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
23. Idea behind DBSCAN
β’ Clusters are described by other objects close by
β’ Density-based clustering
β’ Scan area around an object for other objects
β’ If objects are found, they probably belong to the same group
β’ If no objects are found, the object is probably noise
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Noise
24. (Relatively) Simple Algorithm
β’ Two parameters
β’ Neighborhood size π
β’ Minimal number of points to be considered dense πππππ‘π
β’ Determine all objects with dense neighborhoods (core points)
β’ π₯ β π π π’πβ π‘βππ‘ π₯β²
β π: π π₯, π₯β²
β€ π β₯ πππππ‘π
β’ Grow clusters by assigning all points that share a neighborhood
to the same cluster
β’ All points that are neither core points nor in the neighborhood of
a core point are noise
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
25. Visualization of the DBSCAN Algorithm
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
26. Problems of DBSCAN
β’ All features must be in the same range
β’ What if different clusters have different densities?
ο Main problem of DBSCAN!
β’ This is also related to the size of the data
ο DBSCAN is very sensitive to sampling
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
27. Outline
β’ Overview
β’ Clustering algorithms
β’ π-means Clustering
β’ EM Clustering
β’ DBSCAN Clustering
β’ Single Linkage Clustering
β’ Comparison of the Clustering Algorithms
β’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
28. Idea behind Hierarchical Clustering
β’ Clusters are described by hierarchies of similarity
β’ Hierarchical clustering (also called connectivity-based clustering)
β’ Find most similar pair of objects and establish link
β’ βNearest Neighbor Clusteringβ
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
29. Simple Single Linkage Algorithm (SLINK)
β’ Every object has its own cluster at the beginning
β’ The level of all these basic clusters is 0
β’ πΏ πΆ = 0 for all πΆ = π₯ with π₯ β π
β’ Find two closest clusters
β’ πΆ, πΆβ²
= argminπΆ,πΆβ²βπΆππ’π π‘πππ π πΆ, πΆβ²
β’ π πΆ, πΆβ²
= min
π₯βπΆ,π₯β²βπΆβ²
π(π₯, π₯β²
)
β’ Merge πΆ, πΆβ² into a new cluster πΆπππ€ = πΆ βͺ πΆβ²
β’ The level is the distance between the initial clusters
β’ πΏ πΆπππ€ = π πΆ, πΆβ²
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
30. Dendrograms of Clustering
β’ Visualizes clustering as a tree
β’ Horizontal line: Merging of two clusters
β’ Vertical line: Increase of the level due to merge
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Cutoff for the βrealβ
clusters, must be
chosen manually
Each object is a leaf
node
Nodes that are
subsequently merged 15
times are suppressed
31. Problems with Hierarchical Clustering
β’ Often scales badly in terms of memory consumption
β’ Standard algorithm requires square matrix of distances between all
objects
β’ All features must be in the same range
β’ Different densities in different clusters may be problematic
β’ Hard to find single cut-off
β’ Can be solved by visual analysis of the dendrogram
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
32. Outline
β’ Overview
β’ Clustering algorithms
β’ π-means Clustering
β’ EM Clustering
β’ DBSCAN Clustering
β’ Single Linkage Clustering
β’ Comparison of the Clustering Algorithms
β’ Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
34. Comparison of Execution Times
β’ Single linkage requires too much memory for larger clusters
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
35. Strengths and Weaknesses
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Cluster
number
Explanatory
value
Concise
representation
Categorical
features
Missing
features
Correlated
features
π-means - + + - - -
EM 0 + + - - 0
DBSCAN + - - - - -
SLINK 0 + - - - -
There are clustering algorithms for
categorical data, e.g., π-modes
36. Summary
β’ Clustering is concerned with the inference of groups for objects
β’ Works well for numeric data but is often not well suited for
categorical data
β’ Scales are very important for most clustering algorithms
β’ Different types of clustering algorithms
β’ Centroid-based
β’ Distribution-based
β’ Density-based
β’ Hierarchical / connectivity-based
β’ Evaluation often difficult and requires manual intervention
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science