SlideShare a Scribd company logo
1 of 59
Download to read offline
Prof. Pier Luca Lanzi
Hierarchical Clustering
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
2
Prof. Pier Luca Lanzi
3
Prof. Pier Luca Lanzi
4
Prof. Pier Luca Lanzi
5
Prof. Pier Luca Lanzi
6
Prof. Pier Luca Lanzi
7
Prof. Pier Luca Lanzi
8
Prof. Pier Luca Lanzi
9
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• Suppose we have five items, a, b, c, d, and e.
• Initially, we consider one cluster for each item
• Then, at each step we merge together the most similar clusters,
until we generate one cluster
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
Step 0 Step 1 Step 2 Step 3 Step 4
10
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• Alternatively, we start from one cluster containing the five
elements
• Then, at each step we split one cluster to improve intracluster
similarity, until all the elements are contained in one cluster
c
a
b
d
e
d,e
a,b,c,d,e
a,b
c,d,e
Step 4 Step 3 Step 2 Step 1 Step 0
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• By far, it is the most common clustering technique
• Produces a hierarchy of nested clusters
• The hiearchy be visualized as a dendrogram: a tree like diagram
that records the sequences of merges or splits
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
12
Prof. Pier Luca Lanzi
What Approaches?
• Agglomerative
§ Start individual clusters, at each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
• Divisive
§ Start with one cluster, at each step, split a cluster until each cluster
contains a point (or there are k clusters)
13
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
agglomerative
divisive
Prof. Pier Luca Lanzi
Strengths of Hierarchical Clustering
• No need to assume any particular number of clusters
• Any desired number of clusters can be obtained by ‘cutting’ the
dendrogram at the proper level
• They may correspond to meaningful taxonomies
• Example in biological sciences include animal kingdom, phylogeny
reconstruction, etc.
• Traditional hierarchical algorithms use a similarity
or distance matrix to merge or split one cluster at a time
14
Prof. Pier Luca Lanzi
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Compute the proximity matrix
• Let each data point be a cluster
• Repeat
§Merge the two closest clusters
§ Update the proximity matrix
• Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters
• Different approaches to defining the distance between clusters
distinguish the different algorithms
15
Prof. Pier Luca Lanzi
Hierarchical Clustering:
Time and Space Requirements
• O(N2) space since it uses the proximity matrix.
§N is the number of points.
• O(N3) time in many cases
§There are N steps and at each step the size, N2, proximity
matrix must be updated and searched
§Complexity can be reduced to O(N2 log(N) )
time for some approaches
16
Prof. Pier Luca Lanzi
Efficient Implementation
• Compute the distance between all pairs of points [O(N2)]
• Insert the pairs and their distances into a priority queue to fine the min in one
step [O(N2)]
• When two clusters are merged, we remove all entries in the priority queue
involving one of these two clusters [O(Nlog N)]
• Compute all the distances between the new cluster and the re- maining
clusters [O(NlogN)]
• Since the last two steps are executed at most N time, the complexity of the
whole algorithms is O(N2logN)
17
Prof. Pier Luca Lanzi
Distance Between Clusters
Prof. Pier Luca Lanzi
Initial Configuration
• Start with clusters of individual points and the distance matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Distance Matrix
19
Prof. Pier Luca Lanzi
Intermediate Situation
• After some merging steps, we have some clusters
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
20
Prof. Pier Luca Lanzi
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
21
Prof. Pier Luca Lanzi
After Merging
• The question is “How do we update the proximity matrix?”
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 U C5
C3
? ? ? ?
?
?
?
C2
U
C5C1
C1
C3
C4
C2 U C5
C3 C4
Distance Matrix
22
Prof. Pier Luca Lanzi
Similarity?
Prof. Pier Luca Lanzi
Single Linkage or MIN
Prof. Pier Luca Lanzi
Complete Linkage or MAX
Prof. Pier Luca Lanzi
Average or Group Average
Prof. Pier Luca Lanzi
Distance between Centroids
´ ´
Prof. Pier Luca Lanzi
Typical Alternatives to Calculate the
Distance Between Clusters
• Single link (or MIN)
§smallest distance between an element in one cluster and an
element in the other, i.e., d(Ci, Cj) = min(ti,p, tj,q)
• Complete link (or MAX)
§largest distance between an element in one cluster and
an element in the other, i.e., d(Ci, Cj) = max(ti,p, tj,q)
• Average (or group average)
§average distance between an element in one cluster and an
element in the other, i.e., d(Ci, Cj) = avg(d(ti,p, tj,q))
• Centroid
§distance between the centroids of two clusters, i.e.,
d(Ci, Cj) = d(μi, μj) where μi and μi are the centroids
• …
28
Prof. Pier Luca Lanzi
Example
• Suppose we have five items, a, b, c, d, and e.
• We wanto to perform hierarchical clustering on
five instances following an agglomerative approach
• First: we compute the distance or similarity matrix
• Dij is the distance between instancce “i” and “j”
÷
÷
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
ç
ç
è
æ
=
0003050809
000409010
000506
0002
00
.....
....
...
..
.
D
29
Prof. Pier Luca Lanzi
Example
• Group the two instances that are closer
• In this case, a and b are the closest items (D2,1=2)
• Compute again the distance matrix, and start again.
• Suppose we apply single-linkage (MIN), we need to compute the
distance between the new cluster {1,2} and the others
§d(12)3 = min[d13,d23] = d23 = 5.0
§d(12)4 = min[d14,d24] = d24 = 9.0
§d(12)5 = min[d15,d25] = d25 = 8.0
30
Prof. Pier Luca Lanzi
Example
• The new distance matrix is,
÷
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
ç
è
æ
=
0.00.30.50.8
0.00.40.9
0.00.5
0.0
D
31
• At the end, we obtain the
following dendrogram
Prof. Pier Luca Lanzi
Determining the Number of Clusters
32
Prof. Pier Luca Lanzi
hierarchical clustering generates
a set of N possible partitions
which one should I choose?
Prof. Pier Luca Lanzi
From the previous lecture we know ideally
a good cluster should partition points so that …
Data points in the same cluster should have
a small distance from one another
Data points in different clusters should be at
a large distance from one another.
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of
Squares
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
35
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of
Squares (for distance function d)
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
36
Prof. Pier Luca Lanzi
Evaluation of Hierarchical Clustering
using Knee/Elbow Analysis
plot the WSS and BSS for every clustering and look
for a knee in the plot that show a significant
modification in the evaluation metrics
Prof. Pier Luca Lanzi
Run the Python notebook
for hierarchical clustering
Prof. Pier Luca Lanzi
Example data generated using the make_blob function of Scikit-Learn
Prof. Pier Luca Lanzi
Dendrogram computed using single linkage.
Prof. Pier Luca Lanzi
BSS and WSS for values of k from 1 until 19.
Prof. Pier Luca Lanzi
Clusters produced for values of k from 2 to 7.
Prof. Pier Luca Lanzi
Clusters produced for values of k from 5 to 10.
Prof. Pier Luca Lanzi
How can we represent clusters?
Prof. Pier Luca Lanzi
Euclidean vs Non-Euclidean Spaces
• Euclidean Spaces
§ We can identify a cluster using for instance its centroid
(e.g. computed as the average among all its data points)
§ Alternatively, we can use its convex hull
• Non-Euclidean Spaces
§ We can define a distance (jaccard, cosine, edit)
§ We cannot compute a centroid and we can introduce the concept of
clustroid
• Clustroid
§ An existing data point that we take as a cluster representative
§ It can be the point that minimizes the sum of the distances to the other
points in the cluster
§ Or, the one minimizing the maximum distance to another point
§ Or, the sum of the squares of the distances to the other points in the
cluster
45
Prof. Pier Luca Lanzi
Examples using KNIME
Prof. Pier Luca Lanzi
Evaluation of the result from hierarchical clustering with
3 clusters and average linkage against existing labels
Prof. Pier Luca Lanzi
Comparison of hierarchical clustering with 3 clusters
and average linkage against k-Means with k=3
Prof. Pier Luca Lanzi
Computing cluster quality from one to 20 clusters
using the entropy scorer
Prof. Pier Luca Lanzi
Examples using R
Prof. Pier Luca Lanzi
Hierarchical Clustering in R
# init the seed to be able to repeat the experiment
set.seed(1234)
par(mar=c(0,0,0,0))
# randomly generates the data
x<-rnorm(12, mean=rep(1:3,each=4), sd=0.2)
y<-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2)
plot(x,y,pch=19,cex=2,col="blue")
# distance matrix
d <- data.frame(x,y)
dm <- dist(d)
# generate the
cl <- hclust(dm)
plot(cl)
# other ways to plot dendrograms
# http://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html
51
Prof. Pier Luca Lanzi
Evaluation of Clustering in R
library(GMD)
###
### checking the quality of the previous cluster
###
# init two vectors that will contain the evaluation
# in terms of within and between sum of squares
plot_wss = rep(0,12)
plot_bss = rep(0,12)
# evaluate every clustering
for(i in 1:12)
{
clusters <- cutree(cl,i)
eval <- css(dm,clusters);
plot_wss[i] <- eval$totwss
plot_bss[i] <- eval$totbss
}
52
Prof. Pier Luca Lanzi
Evaluation of Clustering in R
# plot the results
x = 1:12
plot(x, y=plot_bss, main="Between Cluster Sum-of-square",
cex=2, pch=18, col="blue", xlab="Number of Clusters",
ylab="Evaluation")
lines(x, plot_bss, col="blue")
par(new=TRUE)
plot(x, y=plot_wss, cex=2, pch=19, col="red", ylab="", xlab="")
lines(x,plot_wss, col="red");
53
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering 54
Prof. Pier Luca Lanzi
Hierarchical Clustering in R – Iris2D
library(foreign)
iris = read.arff("iris.2D.arff")
with(iris, plot(petallength,petalwidth, col="blue", pch=19, cex=2))
dm <- dist(iris[,1:2])
cl <- hclust(iris_dist, method="single")
#clustering <- hclust(dist(iris[,1:2],method="manhattan"), method="single")
plot(cl)
cl_average <- hclust(iris_dist, method="average")
plot(clustering)
cutree(clustering,2)
55
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering for
iris2D
56
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering for
iris
57
Prof. Pier Luca Lanzi
Summary
Prof. Pier Luca Lanzi
Hierarchical Clustering:
Problems and Limitations
• Once a decision is made to combine two clusters,
it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one
or more of the following:
§Sensitivity to noise and outliers
§Difficulty handling different sized clusters
and convex shapes
§Breaking large clusters
• Major weakness of agglomerative clustering methods
§They do not scale well: time complexity of at least O(n2),
where n is the number of total objects
§They can never undo what was done previously
59

More Related Content

What's hot

Efficient initialization for nonnegative matrix factorization based on nonneg...
Efficient initialization for nonnegative matrix factorization based on nonneg...Efficient initialization for nonnegative matrix factorization based on nonneg...
Efficient initialization for nonnegative matrix factorization based on nonneg...Daichi Kitamura
 
画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイcvpaper. challenge
 
グラフニューラルネットワーク入門
グラフニューラルネットワーク入門グラフニューラルネットワーク入門
グラフニューラルネットワーク入門ryosuke-kojima
 
[DL輪読会]Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval ...
[DL輪読会]Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval ...[DL輪読会]Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval ...
[DL輪読会]Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval ...Deep Learning JP
 
Transformerを用いたAutoEncoderの設計と実験
Transformerを用いたAutoEncoderの設計と実験Transformerを用いたAutoEncoderの設計と実験
Transformerを用いたAutoEncoderの設計と実験myxymyxomatosis
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 ISungbin Lim
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略K Moneto
 
Deep Learningで似た画像を見つける技術 | OHS勉強会#5
Deep Learningで似た画像を見つける技術 | OHS勉強会#5Deep Learningで似た画像を見つける技術 | OHS勉強会#5
Deep Learningで似た画像を見つける技術 | OHS勉強会#5Toshinori Hanya
 
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)Preferred Networks
 
東大大学院 電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太
東大大学院 電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太東大大学院 電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太
東大大学院 電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太Preferred Networks
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density ModelsSangwoo Mo
 
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder홍배 김
 
道具としての機械学習:直感的概要とその実際
道具としての機械学習:直感的概要とその実際道具としての機械学習:直感的概要とその実際
道具としての機械学習:直感的概要とその実際Ichigaku Takigawa
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language ModelsDeep Learning JP
 
【DL輪読会】"A Generalist Agent"
【DL輪読会】"A Generalist Agent"【DL輪読会】"A Generalist Agent"
【DL輪読会】"A Generalist Agent"Deep Learning JP
 
多倍長整数の乗算と高速フーリエ変換
多倍長整数の乗算と高速フーリエ変換多倍長整数の乗算と高速フーリエ変換
多倍長整数の乗算と高速フーリエ変換京大 マイコンクラブ
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion ModelsDeep Learning JP
 
Shinyユーザのための非同期プログラミング入門
Shinyユーザのための非同期プログラミング入門Shinyユーザのための非同期プログラミング入門
Shinyユーザのための非同期プログラミング入門hoxo_m
 

What's hot (20)

Efficient initialization for nonnegative matrix factorization based on nonneg...
Efficient initialization for nonnegative matrix factorization based on nonneg...Efficient initialization for nonnegative matrix factorization based on nonneg...
Efficient initialization for nonnegative matrix factorization based on nonneg...
 
画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ画像生成・生成モデル メタサーベイ
画像生成・生成モデル メタサーベイ
 
グラフニューラルネットワーク入門
グラフニューラルネットワーク入門グラフニューラルネットワーク入門
グラフニューラルネットワーク入門
 
[DL輪読会]Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval ...
[DL輪読会]Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval ...[DL輪読会]Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval ...
[DL輪読会]Multi-Modal and Multi-Domain Embedding Learning for Fashion Retrieval ...
 
Transformerを用いたAutoEncoderの設計と実験
Transformerを用いたAutoEncoderの設計と実験Transformerを用いたAutoEncoderの設計と実験
Transformerを用いたAutoEncoderの設計と実験
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略競技プログラミング頻出アルゴリズム攻略
競技プログラミング頻出アルゴリズム攻略
 
Deep Learningで似た画像を見つける技術 | OHS勉強会#5
Deep Learningで似た画像を見つける技術 | OHS勉強会#5Deep Learningで似た画像を見つける技術 | OHS勉強会#5
Deep Learningで似た画像を見つける技術 | OHS勉強会#5
 
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
PFNにおける研究開発(2022/10/19 東大大学院「融合情報学特別講義Ⅲ」)
 
C++の黒魔術
C++の黒魔術C++の黒魔術
C++の黒魔術
 
東大大学院 電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太
東大大学院 電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太東大大学院 電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太
東大大学院 電子情報学特論講義資料「深層学習概論と理論解析の課題」大野健太
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
 
알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder알기쉬운 Variational autoencoder
알기쉬운 Variational autoencoder
 
道具としての機械学習:直感的概要とその実際
道具としての機械学習:直感的概要とその実際道具としての機械学習:直感的概要とその実際
道具としての機械学習:直感的概要とその実際
 
【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models【DL輪読会】Scaling Laws for Neural Language Models
【DL輪読会】Scaling Laws for Neural Language Models
 
【DL輪読会】"A Generalist Agent"
【DL輪読会】"A Generalist Agent"【DL輪読会】"A Generalist Agent"
【DL輪読会】"A Generalist Agent"
 
多倍長整数の乗算と高速フーリエ変換
多倍長整数の乗算と高速フーリエ変換多倍長整数の乗算と高速フーリエ変換
多倍長整数の乗算と高速フーリエ変換
 
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
【DL輪読会】High-Resolution Image Synthesis with Latent Diffusion Models
 
Shinyユーザのための非同期プログラミング入門
Shinyユーザのための非同期プログラミング入門Shinyユーザのための非同期プログラミング入門
Shinyユーザのための非同期プログラミング入門
 

Similar to DMTM Lecture 12 Hierarchical clustering

DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringPier Luca Lanzi
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringPier Luca Lanzi
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringPier Luca Lanzi
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringPier Luca Lanzi
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfp_manimozhi
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationPier Luca Lanzi
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationPier Luca Lanzi
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesPier Luca Lanzi
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsPier Luca Lanzi
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptImXaib
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSandinoBerutu1
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 RegressionPier Luca Lanzi
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesPier Luca Lanzi
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsMarina Santini
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 

Similar to DMTM Lecture 12 Hierarchical clustering (20)

DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 

More from Pier Luca Lanzi

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i VideogiochiPier Luca Lanzi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiPier Luca Lanzi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomePier Luca Lanzi
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...Pier Luca Lanzi
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaPier Luca Lanzi
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Pier Luca Lanzi
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationPier Luca Lanzi
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationPier Luca Lanzi
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningPier Luca Lanzi
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningPier Luca Lanzi
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesPier Luca Lanzi
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringPier Luca Lanzi
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesPier Luca Lanzi
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesPier Luca Lanzi
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationPier Luca Lanzi
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationPier Luca Lanzi
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionPier Luca Lanzi
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningPier Luca Lanzi
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelinePier Luca Lanzi
 

More from Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 

Recently uploaded

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...DhatriParmar
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataBabyAnnMotar
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptxJonalynLegaspi2
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsPooky Knightsmith
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 

Recently uploaded (20)

Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
Beauty Amidst the Bytes_ Unearthing Unexpected Advantages of the Digital Wast...
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Measures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped dataMeasures of Position DECILES for ungrouped data
Measures of Position DECILES for ungrouped data
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
week 1 cookery 8 fourth - quarter .pptx
week 1 cookery 8  fourth  -  quarter .pptxweek 1 cookery 8  fourth  -  quarter .pptx
week 1 cookery 8 fourth - quarter .pptx
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Mental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young mindsMental Health Awareness - a toolkit for supporting young minds
Mental Health Awareness - a toolkit for supporting young minds
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 

DMTM Lecture 12 Hierarchical clustering

  • 1. Prof. Pier Luca Lanzi Hierarchical Clustering Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi 2
  • 3. Prof. Pier Luca Lanzi 3
  • 4. Prof. Pier Luca Lanzi 4
  • 5. Prof. Pier Luca Lanzi 5
  • 6. Prof. Pier Luca Lanzi 6
  • 7. Prof. Pier Luca Lanzi 7
  • 8. Prof. Pier Luca Lanzi 8
  • 9. Prof. Pier Luca Lanzi 9
  • 10. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • Suppose we have five items, a, b, c, d, and e. • Initially, we consider one cluster for each item • Then, at each step we merge together the most similar clusters, until we generate one cluster a b c d e a,b d,e c,d,e a,b,c,d,e Step 0 Step 1 Step 2 Step 3 Step 4 10
  • 11. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • Alternatively, we start from one cluster containing the five elements • Then, at each step we split one cluster to improve intracluster similarity, until all the elements are contained in one cluster c a b d e d,e a,b,c,d,e a,b c,d,e Step 4 Step 3 Step 2 Step 1 Step 0
  • 12. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • By far, it is the most common clustering technique • Produces a hierarchy of nested clusters • The hiearchy be visualized as a dendrogram: a tree like diagram that records the sequences of merges or splits a b c d e a,b d,e c,d,e a,b,c,d,e 12
  • 13. Prof. Pier Luca Lanzi What Approaches? • Agglomerative § Start individual clusters, at each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive § Start with one cluster, at each step, split a cluster until each cluster contains a point (or there are k clusters) 13 a b c d e a,b d,e c,d,e a,b,c,d,e agglomerative divisive
  • 14. Prof. Pier Luca Lanzi Strengths of Hierarchical Clustering • No need to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences include animal kingdom, phylogeny reconstruction, etc. • Traditional hierarchical algorithms use a similarity or distance matrix to merge or split one cluster at a time 14
  • 15. Prof. Pier Luca Lanzi Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Compute the proximity matrix • Let each data point be a cluster • Repeat §Merge the two closest clusters § Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms 15
  • 16. Prof. Pier Luca Lanzi Hierarchical Clustering: Time and Space Requirements • O(N2) space since it uses the proximity matrix. §N is the number of points. • O(N3) time in many cases §There are N steps and at each step the size, N2, proximity matrix must be updated and searched §Complexity can be reduced to O(N2 log(N) ) time for some approaches 16
  • 17. Prof. Pier Luca Lanzi Efficient Implementation • Compute the distance between all pairs of points [O(N2)] • Insert the pairs and their distances into a priority queue to fine the min in one step [O(N2)] • When two clusters are merged, we remove all entries in the priority queue involving one of these two clusters [O(Nlog N)] • Compute all the distances between the new cluster and the re- maining clusters [O(NlogN)] • Since the last two steps are executed at most N time, the complexity of the whole algorithms is O(N2logN) 17
  • 18. Prof. Pier Luca Lanzi Distance Between Clusters
  • 19. Prof. Pier Luca Lanzi Initial Configuration • Start with clusters of individual points and the distance matrix ... p1 p2 p3 p4 p9 p10 p11 p12 p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Distance Matrix 19
  • 20. Prof. Pier Luca Lanzi Intermediate Situation • After some merging steps, we have some clusters ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 C5 C3 C2C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix 20
  • 21. Prof. Pier Luca Lanzi Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 C5 C3 C2C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix 21
  • 22. Prof. Pier Luca Lanzi After Merging • The question is “How do we update the proximity matrix?” ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 U C5 C3 ? ? ? ? ? ? ? C2 U C5C1 C1 C3 C4 C2 U C5 C3 C4 Distance Matrix 22
  • 23. Prof. Pier Luca Lanzi Similarity?
  • 24. Prof. Pier Luca Lanzi Single Linkage or MIN
  • 25. Prof. Pier Luca Lanzi Complete Linkage or MAX
  • 26. Prof. Pier Luca Lanzi Average or Group Average
  • 27. Prof. Pier Luca Lanzi Distance between Centroids ´ ´
  • 28. Prof. Pier Luca Lanzi Typical Alternatives to Calculate the Distance Between Clusters • Single link (or MIN) §smallest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min(ti,p, tj,q) • Complete link (or MAX) §largest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max(ti,p, tj,q) • Average (or group average) §average distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = avg(d(ti,p, tj,q)) • Centroid §distance between the centroids of two clusters, i.e., d(Ci, Cj) = d(μi, μj) where μi and μi are the centroids • … 28
  • 29. Prof. Pier Luca Lanzi Example • Suppose we have five items, a, b, c, d, and e. • We wanto to perform hierarchical clustering on five instances following an agglomerative approach • First: we compute the distance or similarity matrix • Dij is the distance between instancce “i” and “j” ÷ ÷ ÷ ÷ ÷ ÷ ø ö ç ç ç ç ç ç è æ = 0003050809 000409010 000506 0002 00 ..... .... ... .. . D 29
  • 30. Prof. Pier Luca Lanzi Example • Group the two instances that are closer • In this case, a and b are the closest items (D2,1=2) • Compute again the distance matrix, and start again. • Suppose we apply single-linkage (MIN), we need to compute the distance between the new cluster {1,2} and the others §d(12)3 = min[d13,d23] = d23 = 5.0 §d(12)4 = min[d14,d24] = d24 = 9.0 §d(12)5 = min[d15,d25] = d25 = 8.0 30
  • 31. Prof. Pier Luca Lanzi Example • The new distance matrix is, ÷ ÷ ÷ ÷ ÷ ø ö ç ç ç ç ç è æ = 0.00.30.50.8 0.00.40.9 0.00.5 0.0 D 31 • At the end, we obtain the following dendrogram
  • 32. Prof. Pier Luca Lanzi Determining the Number of Clusters 32
  • 33. Prof. Pier Luca Lanzi hierarchical clustering generates a set of N possible partitions which one should I choose?
  • 34. Prof. Pier Luca Lanzi From the previous lecture we know ideally a good cluster should partition points so that … Data points in the same cluster should have a small distance from one another Data points in different clusters should be at a large distance from one another.
  • 35. Prof. Pier Luca Lanzi Within/Between Clusters Sum of Squares • Within-cluster sum of squares where μi is the centroid of cluster Ci (in case of Euclidean spaces) • Between-cluster sum of squares where μ is the centroid of the whole dataset 35
  • 36. Prof. Pier Luca Lanzi Within/Between Clusters Sum of Squares (for distance function d) • Within-cluster sum of squares where μi is the centroid of cluster Ci (in case of Euclidean spaces) • Between-cluster sum of squares where μ is the centroid of the whole dataset 36
  • 37. Prof. Pier Luca Lanzi Evaluation of Hierarchical Clustering using Knee/Elbow Analysis plot the WSS and BSS for every clustering and look for a knee in the plot that show a significant modification in the evaluation metrics
  • 38. Prof. Pier Luca Lanzi Run the Python notebook for hierarchical clustering
  • 39. Prof. Pier Luca Lanzi Example data generated using the make_blob function of Scikit-Learn
  • 40. Prof. Pier Luca Lanzi Dendrogram computed using single linkage.
  • 41. Prof. Pier Luca Lanzi BSS and WSS for values of k from 1 until 19.
  • 42. Prof. Pier Luca Lanzi Clusters produced for values of k from 2 to 7.
  • 43. Prof. Pier Luca Lanzi Clusters produced for values of k from 5 to 10.
  • 44. Prof. Pier Luca Lanzi How can we represent clusters?
  • 45. Prof. Pier Luca Lanzi Euclidean vs Non-Euclidean Spaces • Euclidean Spaces § We can identify a cluster using for instance its centroid (e.g. computed as the average among all its data points) § Alternatively, we can use its convex hull • Non-Euclidean Spaces § We can define a distance (jaccard, cosine, edit) § We cannot compute a centroid and we can introduce the concept of clustroid • Clustroid § An existing data point that we take as a cluster representative § It can be the point that minimizes the sum of the distances to the other points in the cluster § Or, the one minimizing the maximum distance to another point § Or, the sum of the squares of the distances to the other points in the cluster 45
  • 46. Prof. Pier Luca Lanzi Examples using KNIME
  • 47. Prof. Pier Luca Lanzi Evaluation of the result from hierarchical clustering with 3 clusters and average linkage against existing labels
  • 48. Prof. Pier Luca Lanzi Comparison of hierarchical clustering with 3 clusters and average linkage against k-Means with k=3
  • 49. Prof. Pier Luca Lanzi Computing cluster quality from one to 20 clusters using the entropy scorer
  • 50. Prof. Pier Luca Lanzi Examples using R
  • 51. Prof. Pier Luca Lanzi Hierarchical Clustering in R # init the seed to be able to repeat the experiment set.seed(1234) par(mar=c(0,0,0,0)) # randomly generates the data x<-rnorm(12, mean=rep(1:3,each=4), sd=0.2) y<-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2) plot(x,y,pch=19,cex=2,col="blue") # distance matrix d <- data.frame(x,y) dm <- dist(d) # generate the cl <- hclust(dm) plot(cl) # other ways to plot dendrograms # http://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html 51
  • 52. Prof. Pier Luca Lanzi Evaluation of Clustering in R library(GMD) ### ### checking the quality of the previous cluster ### # init two vectors that will contain the evaluation # in terms of within and between sum of squares plot_wss = rep(0,12) plot_bss = rep(0,12) # evaluate every clustering for(i in 1:12) { clusters <- cutree(cl,i) eval <- css(dm,clusters); plot_wss[i] <- eval$totwss plot_bss[i] <- eval$totbss } 52
  • 53. Prof. Pier Luca Lanzi Evaluation of Clustering in R # plot the results x = 1:12 plot(x, y=plot_bss, main="Between Cluster Sum-of-square", cex=2, pch=18, col="blue", xlab="Number of Clusters", ylab="Evaluation") lines(x, plot_bss, col="blue") par(new=TRUE) plot(x, y=plot_wss, cex=2, pch=19, col="red", ylab="", xlab="") lines(x,plot_wss, col="red"); 53
  • 54. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering 54
  • 55. Prof. Pier Luca Lanzi Hierarchical Clustering in R – Iris2D library(foreign) iris = read.arff("iris.2D.arff") with(iris, plot(petallength,petalwidth, col="blue", pch=19, cex=2)) dm <- dist(iris[,1:2]) cl <- hclust(iris_dist, method="single") #clustering <- hclust(dist(iris[,1:2],method="manhattan"), method="single") plot(cl) cl_average <- hclust(iris_dist, method="average") plot(clustering) cutree(clustering,2) 55
  • 56. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering for iris2D 56
  • 57. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering for iris 57
  • 58. Prof. Pier Luca Lanzi Summary
  • 59. Prof. Pier Luca Lanzi Hierarchical Clustering: Problems and Limitations • Once a decision is made to combine two clusters, it cannot be undone • No objective function is directly minimized • Different schemes have problems with one or more of the following: §Sensitivity to noise and outliers §Difficulty handling different sized clusters and convex shapes §Breaking large clusters • Major weakness of agglomerative clustering methods §They do not scale well: time complexity of at least O(n2), where n is the number of total objects §They can never undo what was done previously 59