SlideShare a Scribd company logo
Prof. Pier Luca Lanzi
Hierarchical Clustering
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
2
Prof. Pier Luca Lanzi
3
Prof. Pier Luca Lanzi
4
Prof. Pier Luca Lanzi
5
Prof. Pier Luca Lanzi
6
Prof. Pier Luca Lanzi
7
Prof. Pier Luca Lanzi
8
Prof. Pier Luca Lanzi
9
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• Suppose we have five items, a, b, c, d, and e.
• Initially, we consider one cluster for each item
• Then, at each step we merge together the most similar clusters,
until we generate one cluster
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
Step 0 Step 1 Step 2 Step 3 Step 4
10
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• Alternatively, we start from one cluster containing the five
elements
• Then, at each step we split one cluster to improve intracluster
similarity, until all the elements are contained in one cluster
c
a
b
d
e
d,e
a,b,c,d,e
a,b
c,d,e
Step 4 Step 3 Step 2 Step 1 Step 0
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• By far, it is the most common clustering technique
• Produces a hierarchy of nested clusters
• The hiearchy be visualized as a dendrogram: a tree like diagram
that records the sequences of merges or splits
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
12
Prof. Pier Luca Lanzi
What Approaches?
• Agglomerative
§ Start individual clusters, at each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
• Divisive
§ Start with one cluster, at each step, split a cluster until each cluster
contains a point (or there are k clusters)
13
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
agglomerative
divisive
Prof. Pier Luca Lanzi
Strengths of Hierarchical Clustering
• No need to assume any particular number of clusters
• Any desired number of clusters can be obtained by ‘cutting’ the
dendrogram at the proper level
• They may correspond to meaningful taxonomies
• Example in biological sciences include animal kingdom, phylogeny
reconstruction, etc.
• Traditional hierarchical algorithms use a similarity
or distance matrix to merge or split one cluster at a time
14
Prof. Pier Luca Lanzi
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Compute the proximity matrix
• Let each data point be a cluster
• Repeat
§Merge the two closest clusters
§ Update the proximity matrix
• Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters
• Different approaches to defining the distance between clusters
distinguish the different algorithms
15
Prof. Pier Luca Lanzi
Hierarchical Clustering:
Time and Space Requirements
• O(N2) space since it uses the proximity matrix.
§N is the number of points.
• O(N3) time in many cases
§There are N steps and at each step the size, N2, proximity
matrix must be updated and searched
§Complexity can be reduced to O(N2 log(N) )
time for some approaches
16
Prof. Pier Luca Lanzi
Efficient Implementation
• Compute the distance between all pairs of points [O(N2)]
• Insert the pairs and their distances into a priority queue to fine the min in one
step [O(N2)]
• When two clusters are merged, we remove all entries in the priority queue
involving one of these two clusters [O(Nlog N)]
• Compute all the distances between the new cluster and the re- maining
clusters [O(NlogN)]
• Since the last two steps are executed at most N time, the complexity of the
whole algorithms is O(N2logN)
17
Prof. Pier Luca Lanzi
Distance Between Clusters
Prof. Pier Luca Lanzi
Initial Configuration
• Start with clusters of individual points and the distance matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Distance Matrix
19
Prof. Pier Luca Lanzi
Intermediate Situation
• After some merging steps, we have some clusters
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
20
Prof. Pier Luca Lanzi
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
21
Prof. Pier Luca Lanzi
After Merging
• The question is “How do we update the proximity matrix?”
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 U C5
C3
? ? ? ?
?
?
?
C2
U
C5C1
C1
C3
C4
C2 U C5
C3 C4
Distance Matrix
22
Prof. Pier Luca Lanzi
Similarity?
Prof. Pier Luca Lanzi
Single Linkage or MIN
Prof. Pier Luca Lanzi
Complete Linkage or MAX
Prof. Pier Luca Lanzi
Average or Group Average
Prof. Pier Luca Lanzi
Distance between Centroids
´ ´
Prof. Pier Luca Lanzi
Typical Alternatives to Calculate the
Distance Between Clusters
• Single link (or MIN)
§smallest distance between an element in one cluster and an
element in the other, i.e., d(Ci, Cj) = min(ti,p, tj,q)
• Complete link (or MAX)
§largest distance between an element in one cluster and
an element in the other, i.e., d(Ci, Cj) = max(ti,p, tj,q)
• Average (or group average)
§average distance between an element in one cluster and an
element in the other, i.e., d(Ci, Cj) = avg(d(ti,p, tj,q))
• Centroid
§distance between the centroids of two clusters, i.e.,
d(Ci, Cj) = d(μi, μj) where μi and μi are the centroids
• …
28
Prof. Pier Luca Lanzi
Example
• Suppose we have five items, a, b, c, d, and e.
• We wanto to perform hierarchical clustering on
five instances following an agglomerative approach
• First: we compute the distance or similarity matrix
• Dij is the distance between instancce “i” and “j”
÷
÷
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
ç
ç
è
æ
=
0003050809
000409010
000506
0002
00
.....
....
...
..
.
D
29
Prof. Pier Luca Lanzi
Example
• Group the two instances that are closer
• In this case, a and b are the closest items (D2,1=2)
• Compute again the distance matrix, and start again.
• Suppose we apply single-linkage (MIN), we need to compute the
distance between the new cluster {1,2} and the others
§d(12)3 = min[d13,d23] = d23 = 5.0
§d(12)4 = min[d14,d24] = d24 = 9.0
§d(12)5 = min[d15,d25] = d25 = 8.0
30
Prof. Pier Luca Lanzi
Example
• The new distance matrix is,
÷
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
ç
è
æ
=
0.00.30.50.8
0.00.40.9
0.00.5
0.0
D
31
• At the end, we obtain the
following dendrogram
Prof. Pier Luca Lanzi
Determining the Number of Clusters
32
Prof. Pier Luca Lanzi
hierarchical clustering generates
a set of N possible partitions
which one should I choose?
Prof. Pier Luca Lanzi
From the previous lecture we know ideally
a good cluster should partition points so that …
Data points in the same cluster should have
a small distance from one another
Data points in different clusters should be at
a large distance from one another.
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of
Squares
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
35
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of
Squares (for distance function d)
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
36
Prof. Pier Luca Lanzi
Evaluation of Hierarchical Clustering
using Knee/Elbow Analysis
plot the WSS and BSS for every clustering and look
for a knee in the plot that show a significant
modification in the evaluation metrics
Prof. Pier Luca Lanzi
Run the Python notebook
for hierarchical clustering
Prof. Pier Luca Lanzi
Example data generated using the make_blob function of Scikit-Learn
Prof. Pier Luca Lanzi
Dendrogram computed using single linkage.
Prof. Pier Luca Lanzi
BSS and WSS for values of k from 1 until 19.
Prof. Pier Luca Lanzi
Clusters produced for values of k from 2 to 7.
Prof. Pier Luca Lanzi
Clusters produced for values of k from 5 to 10.
Prof. Pier Luca Lanzi
How can we represent clusters?
Prof. Pier Luca Lanzi
Euclidean vs Non-Euclidean Spaces
• Euclidean Spaces
§ We can identify a cluster using for instance its centroid
(e.g. computed as the average among all its data points)
§ Alternatively, we can use its convex hull
• Non-Euclidean Spaces
§ We can define a distance (jaccard, cosine, edit)
§ We cannot compute a centroid and we can introduce the concept of
clustroid
• Clustroid
§ An existing data point that we take as a cluster representative
§ It can be the point that minimizes the sum of the distances to the other
points in the cluster
§ Or, the one minimizing the maximum distance to another point
§ Or, the sum of the squares of the distances to the other points in the
cluster
45
Prof. Pier Luca Lanzi
Examples using KNIME
Prof. Pier Luca Lanzi
Evaluation of the result from hierarchical clustering with
3 clusters and average linkage against existing labels
Prof. Pier Luca Lanzi
Comparison of hierarchical clustering with 3 clusters
and average linkage against k-Means with k=3
Prof. Pier Luca Lanzi
Computing cluster quality from one to 20 clusters
using the entropy scorer
Prof. Pier Luca Lanzi
Examples using R
Prof. Pier Luca Lanzi
Hierarchical Clustering in R
# init the seed to be able to repeat the experiment
set.seed(1234)
par(mar=c(0,0,0,0))
# randomly generates the data
x<-rnorm(12, mean=rep(1:3,each=4), sd=0.2)
y<-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2)
plot(x,y,pch=19,cex=2,col="blue")
# distance matrix
d <- data.frame(x,y)
dm <- dist(d)
# generate the
cl <- hclust(dm)
plot(cl)
# other ways to plot dendrograms
# http://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html
51
Prof. Pier Luca Lanzi
Evaluation of Clustering in R
library(GMD)
###
### checking the quality of the previous cluster
###
# init two vectors that will contain the evaluation
# in terms of within and between sum of squares
plot_wss = rep(0,12)
plot_bss = rep(0,12)
# evaluate every clustering
for(i in 1:12)
{
clusters <- cutree(cl,i)
eval <- css(dm,clusters);
plot_wss[i] <- eval$totwss
plot_bss[i] <- eval$totbss
}
52
Prof. Pier Luca Lanzi
Evaluation of Clustering in R
# plot the results
x = 1:12
plot(x, y=plot_bss, main="Between Cluster Sum-of-square",
cex=2, pch=18, col="blue", xlab="Number of Clusters",
ylab="Evaluation")
lines(x, plot_bss, col="blue")
par(new=TRUE)
plot(x, y=plot_wss, cex=2, pch=19, col="red", ylab="", xlab="")
lines(x,plot_wss, col="red");
53
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering 54
Prof. Pier Luca Lanzi
Hierarchical Clustering in R – Iris2D
library(foreign)
iris = read.arff("iris.2D.arff")
with(iris, plot(petallength,petalwidth, col="blue", pch=19, cex=2))
dm <- dist(iris[,1:2])
cl <- hclust(iris_dist, method="single")
#clustering <- hclust(dist(iris[,1:2],method="manhattan"), method="single")
plot(cl)
cl_average <- hclust(iris_dist, method="average")
plot(clustering)
cutree(clustering,2)
55
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering for
iris2D
56
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering for
iris
57
Prof. Pier Luca Lanzi
Summary
Prof. Pier Luca Lanzi
Hierarchical Clustering:
Problems and Limitations
• Once a decision is made to combine two clusters,
it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one
or more of the following:
§Sensitivity to noise and outliers
§Difficulty handling different sized clusters
and convex shapes
§Breaking large clusters
• Major weakness of agglomerative clustering methods
§They do not scale well: time complexity of at least O(n2),
where n is the number of total objects
§They can never undo what was done previously
59

More Related Content

What's hot

Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
butest
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
Yan Xu
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향
홍배 김
 
Graph Neural Network (한국어)
Graph Neural Network (한국어)Graph Neural Network (한국어)
Graph Neural Network (한국어)
Jungwon Kim
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
kandelin
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
Mala Deep Upadhaya
 
Binarized CNN on FPGA
Binarized CNN on FPGABinarized CNN on FPGA
Binarized CNN on FPGA
홍배 김
 
時系列分析による異常検知入門
時系列分析による異常検知入門時系列分析による異常検知入門
時系列分析による異常検知入門Yohei Sato
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
Kuppusamy P
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
Sungbin Lim
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
Sanghyuk Chun
 
Rによる決定木解析の一例
Rによる決定木解析の一例Rによる決定木解析の一例
Rによる決定木解析の一例LINE Corp.
 
Metaheuristics
MetaheuristicsMetaheuristics
Metaheuristics
CherifRehouma
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
Sri Ambati
 
時系列分析入門
時系列分析入門時系列分析入門
時系列分析入門
Miki Katsuragi
 
Manifold learning with application to object recognition
Manifold learning with application to object recognitionManifold learning with application to object recognition
Manifold learning with application to object recognition
zukun
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
홍배 김
 
Chapter 19 Variational Inference
Chapter 19 Variational InferenceChapter 19 Variational Inference
Chapter 19 Variational Inference
KyeongUkJang
 
Classification
ClassificationClassification
Classification
CloudxLab
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
NAVER Engineering
 

What's hot (20)

Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향딥러닝을 이용한 자연어처리의 연구동향
딥러닝을 이용한 자연어처리의 연구동향
 
Graph Neural Network (한국어)
Graph Neural Network (한국어)Graph Neural Network (한국어)
Graph Neural Network (한국어)
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
Ways to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performanceWays to evaluate a machine learning model’s performance
Ways to evaluate a machine learning model’s performance
 
Binarized CNN on FPGA
Binarized CNN on FPGABinarized CNN on FPGA
Binarized CNN on FPGA
 
時系列分析による異常検知入門
時系列分析による異常検知入門時系列分析による異常検知入門
時系列分析による異常検知入門
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Rによる決定木解析の一例
Rによる決定木解析の一例Rによる決定木解析の一例
Rによる決定木解析の一例
 
Metaheuristics
MetaheuristicsMetaheuristics
Metaheuristics
 
Building Random Forest at Scale
Building Random Forest at ScaleBuilding Random Forest at Scale
Building Random Forest at Scale
 
時系列分析入門
時系列分析入門時系列分析入門
時系列分析入門
 
Manifold learning with application to object recognition
Manifold learning with application to object recognitionManifold learning with application to object recognition
Manifold learning with application to object recognition
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
 
Chapter 19 Variational Inference
Chapter 19 Variational InferenceChapter 19 Variational Inference
Chapter 19 Variational Inference
 
Classification
ClassificationClassification
Classification
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 

Similar to DMTM Lecture 12 Hierarchical clustering

DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
Pier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
Pier Luca Lanzi
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
Pier Luca Lanzi
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
Pier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
Pier Luca Lanzi
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
Subrata Kumer Paul
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
p_manimozhi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
Pier Luca Lanzi
 
Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
nadimhossain24
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
Pier Luca Lanzi
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
Pier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
Pier Luca Lanzi
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
SandinoBerutu1
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
ImXaib
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
Iwan Sofana
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
Mostafa G. M. Mostafa
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
Pier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
Pier Luca Lanzi
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
Marina Santini
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Sudhakar Chavan
 

Similar to DMTM Lecture 12 Hierarchical clustering (20)

DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
Pier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
Pier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
Pier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
Pier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
Pier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
Pier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
Pier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
Pier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
Pier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
Pier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
Pier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
Pier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
Pier Luca Lanzi
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
Pier Luca Lanzi
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
Pier Luca Lanzi
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
Pier Luca Lanzi
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
Pier Luca Lanzi
 

More from Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 

Recently uploaded

Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
taiba qazi
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
eBook.com.bd (প্রয়োজনীয় বাংলা বই)
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
AyyanKhan40
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Ashish Kohli
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 

Recently uploaded (20)

Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
DRUGS AND ITS classification slide share
DRUGS AND ITS classification slide shareDRUGS AND ITS classification slide share
DRUGS AND ITS classification slide share
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
PIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf IslamabadPIMS Job Advertisement 2024.pdf Islamabad
PIMS Job Advertisement 2024.pdf Islamabad
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 

DMTM Lecture 12 Hierarchical clustering

  • 1. Prof. Pier Luca Lanzi Hierarchical Clustering Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi 2
  • 3. Prof. Pier Luca Lanzi 3
  • 4. Prof. Pier Luca Lanzi 4
  • 5. Prof. Pier Luca Lanzi 5
  • 6. Prof. Pier Luca Lanzi 6
  • 7. Prof. Pier Luca Lanzi 7
  • 8. Prof. Pier Luca Lanzi 8
  • 9. Prof. Pier Luca Lanzi 9
  • 10. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • Suppose we have five items, a, b, c, d, and e. • Initially, we consider one cluster for each item • Then, at each step we merge together the most similar clusters, until we generate one cluster a b c d e a,b d,e c,d,e a,b,c,d,e Step 0 Step 1 Step 2 Step 3 Step 4 10
  • 11. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • Alternatively, we start from one cluster containing the five elements • Then, at each step we split one cluster to improve intracluster similarity, until all the elements are contained in one cluster c a b d e d,e a,b,c,d,e a,b c,d,e Step 4 Step 3 Step 2 Step 1 Step 0
  • 12. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • By far, it is the most common clustering technique • Produces a hierarchy of nested clusters • The hiearchy be visualized as a dendrogram: a tree like diagram that records the sequences of merges or splits a b c d e a,b d,e c,d,e a,b,c,d,e 12
  • 13. Prof. Pier Luca Lanzi What Approaches? • Agglomerative § Start individual clusters, at each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive § Start with one cluster, at each step, split a cluster until each cluster contains a point (or there are k clusters) 13 a b c d e a,b d,e c,d,e a,b,c,d,e agglomerative divisive
  • 14. Prof. Pier Luca Lanzi Strengths of Hierarchical Clustering • No need to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences include animal kingdom, phylogeny reconstruction, etc. • Traditional hierarchical algorithms use a similarity or distance matrix to merge or split one cluster at a time 14
  • 15. Prof. Pier Luca Lanzi Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Compute the proximity matrix • Let each data point be a cluster • Repeat §Merge the two closest clusters § Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms 15
  • 16. Prof. Pier Luca Lanzi Hierarchical Clustering: Time and Space Requirements • O(N2) space since it uses the proximity matrix. §N is the number of points. • O(N3) time in many cases §There are N steps and at each step the size, N2, proximity matrix must be updated and searched §Complexity can be reduced to O(N2 log(N) ) time for some approaches 16
  • 17. Prof. Pier Luca Lanzi Efficient Implementation • Compute the distance between all pairs of points [O(N2)] • Insert the pairs and their distances into a priority queue to fine the min in one step [O(N2)] • When two clusters are merged, we remove all entries in the priority queue involving one of these two clusters [O(Nlog N)] • Compute all the distances between the new cluster and the re- maining clusters [O(NlogN)] • Since the last two steps are executed at most N time, the complexity of the whole algorithms is O(N2logN) 17
  • 18. Prof. Pier Luca Lanzi Distance Between Clusters
  • 19. Prof. Pier Luca Lanzi Initial Configuration • Start with clusters of individual points and the distance matrix ... p1 p2 p3 p4 p9 p10 p11 p12 p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Distance Matrix 19
  • 20. Prof. Pier Luca Lanzi Intermediate Situation • After some merging steps, we have some clusters ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 C5 C3 C2C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix 20
  • 21. Prof. Pier Luca Lanzi Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 C5 C3 C2C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix 21
  • 22. Prof. Pier Luca Lanzi After Merging • The question is “How do we update the proximity matrix?” ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 U C5 C3 ? ? ? ? ? ? ? C2 U C5C1 C1 C3 C4 C2 U C5 C3 C4 Distance Matrix 22
  • 23. Prof. Pier Luca Lanzi Similarity?
  • 24. Prof. Pier Luca Lanzi Single Linkage or MIN
  • 25. Prof. Pier Luca Lanzi Complete Linkage or MAX
  • 26. Prof. Pier Luca Lanzi Average or Group Average
  • 27. Prof. Pier Luca Lanzi Distance between Centroids ´ ´
  • 28. Prof. Pier Luca Lanzi Typical Alternatives to Calculate the Distance Between Clusters • Single link (or MIN) §smallest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min(ti,p, tj,q) • Complete link (or MAX) §largest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max(ti,p, tj,q) • Average (or group average) §average distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = avg(d(ti,p, tj,q)) • Centroid §distance between the centroids of two clusters, i.e., d(Ci, Cj) = d(μi, μj) where μi and μi are the centroids • … 28
  • 29. Prof. Pier Luca Lanzi Example • Suppose we have five items, a, b, c, d, and e. • We wanto to perform hierarchical clustering on five instances following an agglomerative approach • First: we compute the distance or similarity matrix • Dij is the distance between instancce “i” and “j” ÷ ÷ ÷ ÷ ÷ ÷ ø ö ç ç ç ç ç ç è æ = 0003050809 000409010 000506 0002 00 ..... .... ... .. . D 29
  • 30. Prof. Pier Luca Lanzi Example • Group the two instances that are closer • In this case, a and b are the closest items (D2,1=2) • Compute again the distance matrix, and start again. • Suppose we apply single-linkage (MIN), we need to compute the distance between the new cluster {1,2} and the others §d(12)3 = min[d13,d23] = d23 = 5.0 §d(12)4 = min[d14,d24] = d24 = 9.0 §d(12)5 = min[d15,d25] = d25 = 8.0 30
  • 31. Prof. Pier Luca Lanzi Example • The new distance matrix is, ÷ ÷ ÷ ÷ ÷ ø ö ç ç ç ç ç è æ = 0.00.30.50.8 0.00.40.9 0.00.5 0.0 D 31 • At the end, we obtain the following dendrogram
  • 32. Prof. Pier Luca Lanzi Determining the Number of Clusters 32
  • 33. Prof. Pier Luca Lanzi hierarchical clustering generates a set of N possible partitions which one should I choose?
  • 34. Prof. Pier Luca Lanzi From the previous lecture we know ideally a good cluster should partition points so that … Data points in the same cluster should have a small distance from one another Data points in different clusters should be at a large distance from one another.
  • 35. Prof. Pier Luca Lanzi Within/Between Clusters Sum of Squares • Within-cluster sum of squares where μi is the centroid of cluster Ci (in case of Euclidean spaces) • Between-cluster sum of squares where μ is the centroid of the whole dataset 35
  • 36. Prof. Pier Luca Lanzi Within/Between Clusters Sum of Squares (for distance function d) • Within-cluster sum of squares where μi is the centroid of cluster Ci (in case of Euclidean spaces) • Between-cluster sum of squares where μ is the centroid of the whole dataset 36
  • 37. Prof. Pier Luca Lanzi Evaluation of Hierarchical Clustering using Knee/Elbow Analysis plot the WSS and BSS for every clustering and look for a knee in the plot that show a significant modification in the evaluation metrics
  • 38. Prof. Pier Luca Lanzi Run the Python notebook for hierarchical clustering
  • 39. Prof. Pier Luca Lanzi Example data generated using the make_blob function of Scikit-Learn
  • 40. Prof. Pier Luca Lanzi Dendrogram computed using single linkage.
  • 41. Prof. Pier Luca Lanzi BSS and WSS for values of k from 1 until 19.
  • 42. Prof. Pier Luca Lanzi Clusters produced for values of k from 2 to 7.
  • 43. Prof. Pier Luca Lanzi Clusters produced for values of k from 5 to 10.
  • 44. Prof. Pier Luca Lanzi How can we represent clusters?
  • 45. Prof. Pier Luca Lanzi Euclidean vs Non-Euclidean Spaces • Euclidean Spaces § We can identify a cluster using for instance its centroid (e.g. computed as the average among all its data points) § Alternatively, we can use its convex hull • Non-Euclidean Spaces § We can define a distance (jaccard, cosine, edit) § We cannot compute a centroid and we can introduce the concept of clustroid • Clustroid § An existing data point that we take as a cluster representative § It can be the point that minimizes the sum of the distances to the other points in the cluster § Or, the one minimizing the maximum distance to another point § Or, the sum of the squares of the distances to the other points in the cluster 45
  • 46. Prof. Pier Luca Lanzi Examples using KNIME
  • 47. Prof. Pier Luca Lanzi Evaluation of the result from hierarchical clustering with 3 clusters and average linkage against existing labels
  • 48. Prof. Pier Luca Lanzi Comparison of hierarchical clustering with 3 clusters and average linkage against k-Means with k=3
  • 49. Prof. Pier Luca Lanzi Computing cluster quality from one to 20 clusters using the entropy scorer
  • 50. Prof. Pier Luca Lanzi Examples using R
  • 51. Prof. Pier Luca Lanzi Hierarchical Clustering in R # init the seed to be able to repeat the experiment set.seed(1234) par(mar=c(0,0,0,0)) # randomly generates the data x<-rnorm(12, mean=rep(1:3,each=4), sd=0.2) y<-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2) plot(x,y,pch=19,cex=2,col="blue") # distance matrix d <- data.frame(x,y) dm <- dist(d) # generate the cl <- hclust(dm) plot(cl) # other ways to plot dendrograms # http://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html 51
  • 52. Prof. Pier Luca Lanzi Evaluation of Clustering in R library(GMD) ### ### checking the quality of the previous cluster ### # init two vectors that will contain the evaluation # in terms of within and between sum of squares plot_wss = rep(0,12) plot_bss = rep(0,12) # evaluate every clustering for(i in 1:12) { clusters <- cutree(cl,i) eval <- css(dm,clusters); plot_wss[i] <- eval$totwss plot_bss[i] <- eval$totbss } 52
  • 53. Prof. Pier Luca Lanzi Evaluation of Clustering in R # plot the results x = 1:12 plot(x, y=plot_bss, main="Between Cluster Sum-of-square", cex=2, pch=18, col="blue", xlab="Number of Clusters", ylab="Evaluation") lines(x, plot_bss, col="blue") par(new=TRUE) plot(x, y=plot_wss, cex=2, pch=19, col="red", ylab="", xlab="") lines(x,plot_wss, col="red"); 53
  • 54. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering 54
  • 55. Prof. Pier Luca Lanzi Hierarchical Clustering in R – Iris2D library(foreign) iris = read.arff("iris.2D.arff") with(iris, plot(petallength,petalwidth, col="blue", pch=19, cex=2)) dm <- dist(iris[,1:2]) cl <- hclust(iris_dist, method="single") #clustering <- hclust(dist(iris[,1:2],method="manhattan"), method="single") plot(cl) cl_average <- hclust(iris_dist, method="average") plot(clustering) cutree(clustering,2) 55
  • 56. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering for iris2D 56
  • 57. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering for iris 57
  • 58. Prof. Pier Luca Lanzi Summary
  • 59. Prof. Pier Luca Lanzi Hierarchical Clustering: Problems and Limitations • Once a decision is made to combine two clusters, it cannot be undone • No objective function is directly minimized • Different schemes have problems with one or more of the following: §Sensitivity to noise and outliers §Difficulty handling different sized clusters and convex shapes §Breaking large clusters • Major weakness of agglomerative clustering methods §They do not scale well: time complexity of at least O(n2), where n is the number of total objects §They can never undo what was done previously 59