SlideShare a Scribd company logo
1 of 59
Download to read offline
Prof. Pier Luca Lanzi
Hierarchical Clustering
Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
2
Prof. Pier Luca Lanzi
3
Prof. Pier Luca Lanzi
4
Prof. Pier Luca Lanzi
5
Prof. Pier Luca Lanzi
6
Prof. Pier Luca Lanzi
7
Prof. Pier Luca Lanzi
8
Prof. Pier Luca Lanzi
9
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• Suppose we have five items, a, b, c, d, and e.
• Initially, we consider one cluster for each item
• Then, at each step we merge together the most similar clusters,
until we generate one cluster
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
Step 0 Step 1 Step 2 Step 3 Step 4
10
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• Alternatively, we start from one cluster containing the five
elements
• Then, at each step we split one cluster to improve intracluster
similarity, until all the elements are contained in one cluster
c
a
b
d
e
d,e
a,b,c,d,e
a,b
c,d,e
Step 4 Step 3 Step 2 Step 1 Step 0
Prof. Pier Luca Lanzi
What is Hierarchical Clustering?
• By far, it is the most common clustering technique
• Produces a hierarchy of nested clusters
• The hiearchy be visualized as a dendrogram: a tree like diagram
that records the sequences of merges or splits
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
12
Prof. Pier Luca Lanzi
What Approaches?
• Agglomerative
§ Start individual clusters, at each step, merge the closest pair of clusters
until only one cluster (or k clusters) left
• Divisive
§ Start with one cluster, at each step, split a cluster until each cluster
contains a point (or there are k clusters)
13
a
b
c
d
e
a,b
d,e
c,d,e
a,b,c,d,e
agglomerative
divisive
Prof. Pier Luca Lanzi
Strengths of Hierarchical Clustering
• No need to assume any particular number of clusters
• Any desired number of clusters can be obtained by ‘cutting’ the
dendrogram at the proper level
• They may correspond to meaningful taxonomies
• Example in biological sciences include animal kingdom, phylogeny
reconstruction, etc.
• Traditional hierarchical algorithms use a similarity
or distance matrix to merge or split one cluster at a time
14
Prof. Pier Luca Lanzi
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Compute the proximity matrix
• Let each data point be a cluster
• Repeat
§Merge the two closest clusters
§ Update the proximity matrix
• Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters
• Different approaches to defining the distance between clusters
distinguish the different algorithms
15
Prof. Pier Luca Lanzi
Hierarchical Clustering:
Time and Space Requirements
• O(N2) space since it uses the proximity matrix.
§N is the number of points.
• O(N3) time in many cases
§There are N steps and at each step the size, N2, proximity
matrix must be updated and searched
§Complexity can be reduced to O(N2 log(N) )
time for some approaches
16
Prof. Pier Luca Lanzi
Efficient Implementation
• Compute the distance between all pairs of points [O(N2)]
• Insert the pairs and their distances into a priority queue to fine the min in one
step [O(N2)]
• When two clusters are merged, we remove all entries in the priority queue
involving one of these two clusters [O(Nlog N)]
• Compute all the distances between the new cluster and the re- maining
clusters [O(NlogN)]
• Since the last two steps are executed at most N time, the complexity of the
whole algorithms is O(N2logN)
17
Prof. Pier Luca Lanzi
Distance Between Clusters
Prof. Pier Luca Lanzi
Initial Configuration
• Start with clusters of individual points and the distance matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Distance Matrix
19
Prof. Pier Luca Lanzi
Intermediate Situation
• After some merging steps, we have some clusters
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
20
Prof. Pier Luca Lanzi
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix.
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
21
Prof. Pier Luca Lanzi
After Merging
• The question is “How do we update the proximity matrix?”
...
p1 p2 p3 p4 p9 p10 p11 p12
C1
C4
C2 U C5
C3
? ? ? ?
?
?
?
C2
U
C5C1
C1
C3
C4
C2 U C5
C3 C4
Distance Matrix
22
Prof. Pier Luca Lanzi
Similarity?
Prof. Pier Luca Lanzi
Single Linkage or MIN
Prof. Pier Luca Lanzi
Complete Linkage or MAX
Prof. Pier Luca Lanzi
Average or Group Average
Prof. Pier Luca Lanzi
Distance between Centroids
´ ´
Prof. Pier Luca Lanzi
Typical Alternatives to Calculate the
Distance Between Clusters
• Single link (or MIN)
§smallest distance between an element in one cluster and an
element in the other, i.e., d(Ci, Cj) = min(ti,p, tj,q)
• Complete link (or MAX)
§largest distance between an element in one cluster and
an element in the other, i.e., d(Ci, Cj) = max(ti,p, tj,q)
• Average (or group average)
§average distance between an element in one cluster and an
element in the other, i.e., d(Ci, Cj) = avg(d(ti,p, tj,q))
• Centroid
§distance between the centroids of two clusters, i.e.,
d(Ci, Cj) = d(μi, μj) where μi and μi are the centroids
• …
28
Prof. Pier Luca Lanzi
Example
• Suppose we have five items, a, b, c, d, and e.
• We wanto to perform hierarchical clustering on
five instances following an agglomerative approach
• First: we compute the distance or similarity matrix
• Dij is the distance between instancce “i” and “j”
÷
÷
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
ç
ç
è
æ
=
0003050809
000409010
000506
0002
00
.....
....
...
..
.
D
29
Prof. Pier Luca Lanzi
Example
• Group the two instances that are closer
• In this case, a and b are the closest items (D2,1=2)
• Compute again the distance matrix, and start again.
• Suppose we apply single-linkage (MIN), we need to compute the
distance between the new cluster {1,2} and the others
§d(12)3 = min[d13,d23] = d23 = 5.0
§d(12)4 = min[d14,d24] = d24 = 9.0
§d(12)5 = min[d15,d25] = d25 = 8.0
30
Prof. Pier Luca Lanzi
Example
• The new distance matrix is,
÷
÷
÷
÷
÷
ø
ö
ç
ç
ç
ç
ç
è
æ
=
0.00.30.50.8
0.00.40.9
0.00.5
0.0
D
31
• At the end, we obtain the
following dendrogram
Prof. Pier Luca Lanzi
Determining the Number of Clusters
32
Prof. Pier Luca Lanzi
hierarchical clustering generates
a set of N possible partitions
which one should I choose?
Prof. Pier Luca Lanzi
From the previous lecture we know ideally
a good cluster should partition points so that …
Data points in the same cluster should have
a small distance from one another
Data points in different clusters should be at
a large distance from one another.
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of
Squares
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
35
Prof. Pier Luca Lanzi
Within/Between Clusters Sum of
Squares (for distance function d)
• Within-cluster sum of squares
where μi is the centroid of cluster Ci (in case of Euclidean spaces)
• Between-cluster sum of squares
where μ is the centroid of the whole dataset
36
Prof. Pier Luca Lanzi
Evaluation of Hierarchical Clustering
using Knee/Elbow Analysis
plot the WSS and BSS for every clustering and look
for a knee in the plot that show a significant
modification in the evaluation metrics
Prof. Pier Luca Lanzi
Run the Python notebook
for hierarchical clustering
Prof. Pier Luca Lanzi
Example data generated using the make_blob function of Scikit-Learn
Prof. Pier Luca Lanzi
Dendrogram computed using single linkage.
Prof. Pier Luca Lanzi
BSS and WSS for values of k from 1 until 19.
Prof. Pier Luca Lanzi
Clusters produced for values of k from 2 to 7.
Prof. Pier Luca Lanzi
Clusters produced for values of k from 5 to 10.
Prof. Pier Luca Lanzi
How can we represent clusters?
Prof. Pier Luca Lanzi
Euclidean vs Non-Euclidean Spaces
• Euclidean Spaces
§ We can identify a cluster using for instance its centroid
(e.g. computed as the average among all its data points)
§ Alternatively, we can use its convex hull
• Non-Euclidean Spaces
§ We can define a distance (jaccard, cosine, edit)
§ We cannot compute a centroid and we can introduce the concept of
clustroid
• Clustroid
§ An existing data point that we take as a cluster representative
§ It can be the point that minimizes the sum of the distances to the other
points in the cluster
§ Or, the one minimizing the maximum distance to another point
§ Or, the sum of the squares of the distances to the other points in the
cluster
45
Prof. Pier Luca Lanzi
Examples using KNIME
Prof. Pier Luca Lanzi
Evaluation of the result from hierarchical clustering with
3 clusters and average linkage against existing labels
Prof. Pier Luca Lanzi
Comparison of hierarchical clustering with 3 clusters
and average linkage against k-Means with k=3
Prof. Pier Luca Lanzi
Computing cluster quality from one to 20 clusters
using the entropy scorer
Prof. Pier Luca Lanzi
Examples using R
Prof. Pier Luca Lanzi
Hierarchical Clustering in R
# init the seed to be able to repeat the experiment
set.seed(1234)
par(mar=c(0,0,0,0))
# randomly generates the data
x<-rnorm(12, mean=rep(1:3,each=4), sd=0.2)
y<-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2)
plot(x,y,pch=19,cex=2,col="blue")
# distance matrix
d <- data.frame(x,y)
dm <- dist(d)
# generate the
cl <- hclust(dm)
plot(cl)
# other ways to plot dendrograms
# http://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html
51
Prof. Pier Luca Lanzi
Evaluation of Clustering in R
library(GMD)
###
### checking the quality of the previous cluster
###
# init two vectors that will contain the evaluation
# in terms of within and between sum of squares
plot_wss = rep(0,12)
plot_bss = rep(0,12)
# evaluate every clustering
for(i in 1:12)
{
clusters <- cutree(cl,i)
eval <- css(dm,clusters);
plot_wss[i] <- eval$totwss
plot_bss[i] <- eval$totbss
}
52
Prof. Pier Luca Lanzi
Evaluation of Clustering in R
# plot the results
x = 1:12
plot(x, y=plot_bss, main="Between Cluster Sum-of-square",
cex=2, pch=18, col="blue", xlab="Number of Clusters",
ylab="Evaluation")
lines(x, plot_bss, col="blue")
par(new=TRUE)
plot(x, y=plot_wss, cex=2, pch=19, col="red", ylab="", xlab="")
lines(x,plot_wss, col="red");
53
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering 54
Prof. Pier Luca Lanzi
Hierarchical Clustering in R – Iris2D
library(foreign)
iris = read.arff("iris.2D.arff")
with(iris, plot(petallength,petalwidth, col="blue", pch=19, cex=2))
dm <- dist(iris[,1:2])
cl <- hclust(iris_dist, method="single")
#clustering <- hclust(dist(iris[,1:2],method="manhattan"), method="single")
plot(cl)
cl_average <- hclust(iris_dist, method="average")
plot(clustering)
cutree(clustering,2)
55
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering for
iris2D
56
Prof. Pier Luca Lanzi
Knee/Elbow Analysis of Clustering for
iris
57
Prof. Pier Luca Lanzi
Summary
Prof. Pier Luca Lanzi
Hierarchical Clustering:
Problems and Limitations
• Once a decision is made to combine two clusters,
it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one
or more of the following:
§Sensitivity to noise and outliers
§Difficulty handling different sized clusters
and convex shapes
§Breaking large clusters
• Major weakness of agglomerative clustering methods
§They do not scale well: time complexity of at least O(n2),
where n is the number of total objects
§They can never undo what was done previously
59

More Related Content

What's hot

大規模ネットワークの性質と先端グラフアルゴリズム
大規模ネットワークの性質と先端グラフアルゴリズム大規模ネットワークの性質と先端グラフアルゴリズム
大規模ネットワークの性質と先端グラフアルゴリズム
Takuya Akiba
 

What's hot (20)

Regularization
RegularizationRegularization
Regularization
 
カークマンの女学生問題と有限幾何
カークマンの女学生問題と有限幾何カークマンの女学生問題と有限幾何
カークマンの女学生問題と有限幾何
 
画像認識における特徴表現 -SSII技術マップの再考-
画像認識における特徴表現 -SSII技術マップの再考-画像認識における特徴表現 -SSII技術マップの再考-
画像認識における特徴表現 -SSII技術マップの再考-
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
大規模ネットワーク分析 篠田
大規模ネットワーク分析 篠田大規模ネットワーク分析 篠田
大規模ネットワーク分析 篠田
 
Juliaで前処理
Juliaで前処理Juliaで前処理
Juliaで前処理
 
Wasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 IWasserstein GAN 수학 이해하기 I
Wasserstein GAN 수학 이해하기 I
 
20190706cvpr2019_3d_shape_representation
20190706cvpr2019_3d_shape_representation20190706cvpr2019_3d_shape_representation
20190706cvpr2019_3d_shape_representation
 
Pca ppt
Pca pptPca ppt
Pca ppt
 
Introduction to Genetic Algorithms
Introduction to Genetic AlgorithmsIntroduction to Genetic Algorithms
Introduction to Genetic Algorithms
 
Deep sets
Deep setsDeep sets
Deep sets
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?
 
プログラミングコンテストでのデータ構造 2 ~動的木編~
プログラミングコンテストでのデータ構造 2 ~動的木編~プログラミングコンテストでのデータ構造 2 ~動的木編~
プログラミングコンテストでのデータ構造 2 ~動的木編~
 
大規模ネットワークの性質と先端グラフアルゴリズム
大規模ネットワークの性質と先端グラフアルゴリズム大規模ネットワークの性質と先端グラフアルゴリズム
大規模ネットワークの性質と先端グラフアルゴリズム
 
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
実務と論文で学ぶジョブレコメンデーション最前線2022
実務と論文で学ぶジョブレコメンデーション最前線2022実務と論文で学ぶジョブレコメンデーション最前線2022
実務と論文で学ぶジョブレコメンデーション最前線2022
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-
乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-
乱択データ構造の最新事情 -MinHash と HyperLogLog の最近の進歩-
 

Similar to DMTM Lecture 12 Hierarchical clustering

Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
ImXaib
 

Similar to DMTM Lecture 12 Hierarchical clustering (20)

DMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical ClusteringDMTM 2015 - 07 Hierarchical Clustering
DMTM 2015 - 07 Hierarchical Clustering
 
DMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clusteringDMTM Lecture 13 Representative based clustering
DMTM Lecture 13 Representative based clustering
 
DMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based ClusteringDMTM 2015 - 08 Representative-Based Clustering
DMTM 2015 - 08 Representative-Based Clustering
 
DMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to ClusteringDMTM 2015 - 06 Introduction to Clustering
DMTM 2015 - 06 Introduction to Clustering
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptxData Mining Lecture_7.pptx
Data Mining Lecture_7.pptx
 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
 
DMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluationDMTM Lecture 15 Clustering evaluation
DMTM Lecture 15 Clustering evaluation
 
Clustering.pdf
Clustering.pdfClustering.pdf
Clustering.pdf
 
DMTM Lecture 04 Classification
DMTM Lecture 04 ClassificationDMTM Lecture 04 Classification
DMTM Lecture 04 Classification
 
DMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification EnsemblesDMTM 2015 - 15 Classification Ensembles
DMTM 2015 - 15 Classification Ensembles
 
DMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethodsDMTM Lecture 09 Other classificationmethods
DMTM Lecture 09 Other classificationmethods
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.pptSlide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Csc446: Pattern Recognition
Csc446: Pattern Recognition Csc446: Pattern Recognition
Csc446: Pattern Recognition
 
DMTM Lecture 03 Regression
DMTM Lecture 03 RegressionDMTM Lecture 03 Regression
DMTM Lecture 03 Regression
 
DMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensemblesDMTM Lecture 10 Classification ensembles
DMTM Lecture 10 Classification ensembles
 
Lecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest NeighborsLecture 8: Decision Trees & k-Nearest Neighbors
Lecture 8: Decision Trees & k-Nearest Neighbors
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 

More from Pier Luca Lanzi

More from Pier Luca Lanzi (20)

11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi11 Settembre 2021 - Giocare con i Videogiochi
11 Settembre 2021 - Giocare con i Videogiochi
 
Breve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei VideogiochiBreve Viaggio al Centro dei Videogiochi
Breve Viaggio al Centro dei Videogiochi
 
Global Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning WelcomeGlobal Game Jam 19 @ POLIMI - Morning Welcome
Global Game Jam 19 @ POLIMI - Morning Welcome
 
Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018Data Driven Game Design @ Campus Party 2018
Data Driven Game Design @ Campus Party 2018
 
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
GGJ18 al Politecnico di Milano - Presentazione che precede la presentazione d...
 
GGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di aperturaGGJ18 al Politecnico di Milano - Presentazione di apertura
GGJ18 al Politecnico di Milano - Presentazione di apertura
 
Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018Presentation for UNITECH event - January 8, 2018
Presentation for UNITECH event - January 8, 2018
 
DMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparationDMTM Lecture 20 Data preparation
DMTM Lecture 20 Data preparation
 
DMTM Lecture 19 Data exploration
DMTM Lecture 19 Data explorationDMTM Lecture 19 Data exploration
DMTM Lecture 19 Data exploration
 
DMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph miningDMTM Lecture 18 Graph mining
DMTM Lecture 18 Graph mining
 
DMTM Lecture 17 Text mining
DMTM Lecture 17 Text miningDMTM Lecture 17 Text mining
DMTM Lecture 17 Text mining
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
DMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clusteringDMTM Lecture 14 Density based clustering
DMTM Lecture 14 Density based clustering
 
DMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rulesDMTM Lecture 08 Classification rules
DMTM Lecture 08 Classification rules
 
DMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision treesDMTM Lecture 07 Decision trees
DMTM Lecture 07 Decision trees
 
DMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluationDMTM Lecture 06 Classification evaluation
DMTM Lecture 06 Classification evaluation
 
DMTM Lecture 05 Data representation
DMTM Lecture 05 Data representationDMTM Lecture 05 Data representation
DMTM Lecture 05 Data representation
 
DMTM Lecture 01 Introduction
DMTM Lecture 01 IntroductionDMTM Lecture 01 Introduction
DMTM Lecture 01 Introduction
 
DMTM Lecture 02 Data mining
DMTM Lecture 02 Data miningDMTM Lecture 02 Data mining
DMTM Lecture 02 Data mining
 
VDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipelineVDP2016 - Lecture 16 Rendering pipeline
VDP2016 - Lecture 16 Rendering pipeline
 

Recently uploaded

會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
中 央社
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
 

Recently uploaded (20)

24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
24 ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH SỞ GIÁO DỤC HẢI DƯ...
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
 
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community PartnershipsSpring gala 2024 photo slideshow - Celebrating School-Community Partnerships
Spring gala 2024 photo slideshow - Celebrating School-Community Partnerships
 
“O BEIJO” EM ARTE .
“O BEIJO” EM ARTE                       .“O BEIJO” EM ARTE                       .
“O BEIJO” EM ARTE .
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
 
demyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptxdemyelinated disorder: multiple sclerosis.pptx
demyelinated disorder: multiple sclerosis.pptx
 
IPL Online Quiz by Pragya; Question Set.
IPL Online Quiz by Pragya; Question Set.IPL Online Quiz by Pragya; Question Set.
IPL Online Quiz by Pragya; Question Set.
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...When Quality Assurance Meets Innovation in Higher Education - Report launch w...
When Quality Assurance Meets Innovation in Higher Education - Report launch w...
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
II BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING II
II BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING IIII BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING II
II BIOSENSOR PRINCIPLE APPLICATIONS AND WORKING II
 
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinhĐề tieng anh thpt 2024 danh cho cac ban hoc sinh
Đề tieng anh thpt 2024 danh cho cac ban hoc sinh
 
ANTI PARKISON DRUGS.pptx
ANTI         PARKISON          DRUGS.pptxANTI         PARKISON          DRUGS.pptx
ANTI PARKISON DRUGS.pptx
 
How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17How to Analyse Profit of a Sales Order in Odoo 17
How to Analyse Profit of a Sales Order in Odoo 17
 
Championnat de France de Tennis de table/
Championnat de France de Tennis de table/Championnat de France de Tennis de table/
Championnat de France de Tennis de table/
 
The Liver & Gallbladder (Anatomy & Physiology).pptx
The Liver &  Gallbladder (Anatomy & Physiology).pptxThe Liver &  Gallbladder (Anatomy & Physiology).pptx
The Liver & Gallbladder (Anatomy & Physiology).pptx
 

DMTM Lecture 12 Hierarchical clustering

  • 1. Prof. Pier Luca Lanzi Hierarchical Clustering Data Mining andText Mining (UIC 583 @ Politecnico di Milano)
  • 2. Prof. Pier Luca Lanzi 2
  • 3. Prof. Pier Luca Lanzi 3
  • 4. Prof. Pier Luca Lanzi 4
  • 5. Prof. Pier Luca Lanzi 5
  • 6. Prof. Pier Luca Lanzi 6
  • 7. Prof. Pier Luca Lanzi 7
  • 8. Prof. Pier Luca Lanzi 8
  • 9. Prof. Pier Luca Lanzi 9
  • 10. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • Suppose we have five items, a, b, c, d, and e. • Initially, we consider one cluster for each item • Then, at each step we merge together the most similar clusters, until we generate one cluster a b c d e a,b d,e c,d,e a,b,c,d,e Step 0 Step 1 Step 2 Step 3 Step 4 10
  • 11. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • Alternatively, we start from one cluster containing the five elements • Then, at each step we split one cluster to improve intracluster similarity, until all the elements are contained in one cluster c a b d e d,e a,b,c,d,e a,b c,d,e Step 4 Step 3 Step 2 Step 1 Step 0
  • 12. Prof. Pier Luca Lanzi What is Hierarchical Clustering? • By far, it is the most common clustering technique • Produces a hierarchy of nested clusters • The hiearchy be visualized as a dendrogram: a tree like diagram that records the sequences of merges or splits a b c d e a,b d,e c,d,e a,b,c,d,e 12
  • 13. Prof. Pier Luca Lanzi What Approaches? • Agglomerative § Start individual clusters, at each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive § Start with one cluster, at each step, split a cluster until each cluster contains a point (or there are k clusters) 13 a b c d e a,b d,e c,d,e a,b,c,d,e agglomerative divisive
  • 14. Prof. Pier Luca Lanzi Strengths of Hierarchical Clustering • No need to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences include animal kingdom, phylogeny reconstruction, etc. • Traditional hierarchical algorithms use a similarity or distance matrix to merge or split one cluster at a time 14
  • 15. Prof. Pier Luca Lanzi Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Compute the proximity matrix • Let each data point be a cluster • Repeat §Merge the two closest clusters § Update the proximity matrix • Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms 15
  • 16. Prof. Pier Luca Lanzi Hierarchical Clustering: Time and Space Requirements • O(N2) space since it uses the proximity matrix. §N is the number of points. • O(N3) time in many cases §There are N steps and at each step the size, N2, proximity matrix must be updated and searched §Complexity can be reduced to O(N2 log(N) ) time for some approaches 16
  • 17. Prof. Pier Luca Lanzi Efficient Implementation • Compute the distance between all pairs of points [O(N2)] • Insert the pairs and their distances into a priority queue to fine the min in one step [O(N2)] • When two clusters are merged, we remove all entries in the priority queue involving one of these two clusters [O(Nlog N)] • Compute all the distances between the new cluster and the re- maining clusters [O(NlogN)] • Since the last two steps are executed at most N time, the complexity of the whole algorithms is O(N2logN) 17
  • 18. Prof. Pier Luca Lanzi Distance Between Clusters
  • 19. Prof. Pier Luca Lanzi Initial Configuration • Start with clusters of individual points and the distance matrix ... p1 p2 p3 p4 p9 p10 p11 p12 p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Distance Matrix 19
  • 20. Prof. Pier Luca Lanzi Intermediate Situation • After some merging steps, we have some clusters ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 C5 C3 C2C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix 20
  • 21. Prof. Pier Luca Lanzi Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 C5 C3 C2C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix 21
  • 22. Prof. Pier Luca Lanzi After Merging • The question is “How do we update the proximity matrix?” ... p1 p2 p3 p4 p9 p10 p11 p12 C1 C4 C2 U C5 C3 ? ? ? ? ? ? ? C2 U C5C1 C1 C3 C4 C2 U C5 C3 C4 Distance Matrix 22
  • 23. Prof. Pier Luca Lanzi Similarity?
  • 24. Prof. Pier Luca Lanzi Single Linkage or MIN
  • 25. Prof. Pier Luca Lanzi Complete Linkage or MAX
  • 26. Prof. Pier Luca Lanzi Average or Group Average
  • 27. Prof. Pier Luca Lanzi Distance between Centroids ´ ´
  • 28. Prof. Pier Luca Lanzi Typical Alternatives to Calculate the Distance Between Clusters • Single link (or MIN) §smallest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min(ti,p, tj,q) • Complete link (or MAX) §largest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max(ti,p, tj,q) • Average (or group average) §average distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = avg(d(ti,p, tj,q)) • Centroid §distance between the centroids of two clusters, i.e., d(Ci, Cj) = d(μi, μj) where μi and μi are the centroids • … 28
  • 29. Prof. Pier Luca Lanzi Example • Suppose we have five items, a, b, c, d, and e. • We wanto to perform hierarchical clustering on five instances following an agglomerative approach • First: we compute the distance or similarity matrix • Dij is the distance between instancce “i” and “j” ÷ ÷ ÷ ÷ ÷ ÷ ø ö ç ç ç ç ç ç è æ = 0003050809 000409010 000506 0002 00 ..... .... ... .. . D 29
  • 30. Prof. Pier Luca Lanzi Example • Group the two instances that are closer • In this case, a and b are the closest items (D2,1=2) • Compute again the distance matrix, and start again. • Suppose we apply single-linkage (MIN), we need to compute the distance between the new cluster {1,2} and the others §d(12)3 = min[d13,d23] = d23 = 5.0 §d(12)4 = min[d14,d24] = d24 = 9.0 §d(12)5 = min[d15,d25] = d25 = 8.0 30
  • 31. Prof. Pier Luca Lanzi Example • The new distance matrix is, ÷ ÷ ÷ ÷ ÷ ø ö ç ç ç ç ç è æ = 0.00.30.50.8 0.00.40.9 0.00.5 0.0 D 31 • At the end, we obtain the following dendrogram
  • 32. Prof. Pier Luca Lanzi Determining the Number of Clusters 32
  • 33. Prof. Pier Luca Lanzi hierarchical clustering generates a set of N possible partitions which one should I choose?
  • 34. Prof. Pier Luca Lanzi From the previous lecture we know ideally a good cluster should partition points so that … Data points in the same cluster should have a small distance from one another Data points in different clusters should be at a large distance from one another.
  • 35. Prof. Pier Luca Lanzi Within/Between Clusters Sum of Squares • Within-cluster sum of squares where μi is the centroid of cluster Ci (in case of Euclidean spaces) • Between-cluster sum of squares where μ is the centroid of the whole dataset 35
  • 36. Prof. Pier Luca Lanzi Within/Between Clusters Sum of Squares (for distance function d) • Within-cluster sum of squares where μi is the centroid of cluster Ci (in case of Euclidean spaces) • Between-cluster sum of squares where μ is the centroid of the whole dataset 36
  • 37. Prof. Pier Luca Lanzi Evaluation of Hierarchical Clustering using Knee/Elbow Analysis plot the WSS and BSS for every clustering and look for a knee in the plot that show a significant modification in the evaluation metrics
  • 38. Prof. Pier Luca Lanzi Run the Python notebook for hierarchical clustering
  • 39. Prof. Pier Luca Lanzi Example data generated using the make_blob function of Scikit-Learn
  • 40. Prof. Pier Luca Lanzi Dendrogram computed using single linkage.
  • 41. Prof. Pier Luca Lanzi BSS and WSS for values of k from 1 until 19.
  • 42. Prof. Pier Luca Lanzi Clusters produced for values of k from 2 to 7.
  • 43. Prof. Pier Luca Lanzi Clusters produced for values of k from 5 to 10.
  • 44. Prof. Pier Luca Lanzi How can we represent clusters?
  • 45. Prof. Pier Luca Lanzi Euclidean vs Non-Euclidean Spaces • Euclidean Spaces § We can identify a cluster using for instance its centroid (e.g. computed as the average among all its data points) § Alternatively, we can use its convex hull • Non-Euclidean Spaces § We can define a distance (jaccard, cosine, edit) § We cannot compute a centroid and we can introduce the concept of clustroid • Clustroid § An existing data point that we take as a cluster representative § It can be the point that minimizes the sum of the distances to the other points in the cluster § Or, the one minimizing the maximum distance to another point § Or, the sum of the squares of the distances to the other points in the cluster 45
  • 46. Prof. Pier Luca Lanzi Examples using KNIME
  • 47. Prof. Pier Luca Lanzi Evaluation of the result from hierarchical clustering with 3 clusters and average linkage against existing labels
  • 48. Prof. Pier Luca Lanzi Comparison of hierarchical clustering with 3 clusters and average linkage against k-Means with k=3
  • 49. Prof. Pier Luca Lanzi Computing cluster quality from one to 20 clusters using the entropy scorer
  • 50. Prof. Pier Luca Lanzi Examples using R
  • 51. Prof. Pier Luca Lanzi Hierarchical Clustering in R # init the seed to be able to repeat the experiment set.seed(1234) par(mar=c(0,0,0,0)) # randomly generates the data x<-rnorm(12, mean=rep(1:3,each=4), sd=0.2) y<-rnorm(12, mean=rep(c(1,2,1),each=4), sd=0.2) plot(x,y,pch=19,cex=2,col="blue") # distance matrix d <- data.frame(x,y) dm <- dist(d) # generate the cl <- hclust(dm) plot(cl) # other ways to plot dendrograms # http://rstudio-pubs-static.s3.amazonaws.com/1876_df0bf890dd54461f98719b461d987c3d.html 51
  • 52. Prof. Pier Luca Lanzi Evaluation of Clustering in R library(GMD) ### ### checking the quality of the previous cluster ### # init two vectors that will contain the evaluation # in terms of within and between sum of squares plot_wss = rep(0,12) plot_bss = rep(0,12) # evaluate every clustering for(i in 1:12) { clusters <- cutree(cl,i) eval <- css(dm,clusters); plot_wss[i] <- eval$totwss plot_bss[i] <- eval$totbss } 52
  • 53. Prof. Pier Luca Lanzi Evaluation of Clustering in R # plot the results x = 1:12 plot(x, y=plot_bss, main="Between Cluster Sum-of-square", cex=2, pch=18, col="blue", xlab="Number of Clusters", ylab="Evaluation") lines(x, plot_bss, col="blue") par(new=TRUE) plot(x, y=plot_wss, cex=2, pch=19, col="red", ylab="", xlab="") lines(x,plot_wss, col="red"); 53
  • 54. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering 54
  • 55. Prof. Pier Luca Lanzi Hierarchical Clustering in R – Iris2D library(foreign) iris = read.arff("iris.2D.arff") with(iris, plot(petallength,petalwidth, col="blue", pch=19, cex=2)) dm <- dist(iris[,1:2]) cl <- hclust(iris_dist, method="single") #clustering <- hclust(dist(iris[,1:2],method="manhattan"), method="single") plot(cl) cl_average <- hclust(iris_dist, method="average") plot(clustering) cutree(clustering,2) 55
  • 56. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering for iris2D 56
  • 57. Prof. Pier Luca Lanzi Knee/Elbow Analysis of Clustering for iris 57
  • 58. Prof. Pier Luca Lanzi Summary
  • 59. Prof. Pier Luca Lanzi Hierarchical Clustering: Problems and Limitations • Once a decision is made to combine two clusters, it cannot be undone • No objective function is directly minimized • Different schemes have problems with one or more of the following: §Sensitivity to noise and outliers §Difficulty handling different sized clusters and convex shapes §Breaking large clusters • Major weakness of agglomerative clustering methods §They do not scale well: time complexity of at least O(n2), where n is the number of total objects §They can never undo what was done previously 59