SlideShare a Scribd company logo
1 of 31
Download to read offline
Clustering
Lecture 3: Hierarchical Methods
Jing Gao
SUNY Buffalo
1
Outline
• Basics
– Motivation, definition, evaluation
• Methods
– Partitional
– Hierarchical
– Density-based
– Mixture model
– Spectral methods
• Advanced topics
– Clustering ensemble
– Clustering in MapReduce
– Semi-supervised clustering, subspace clustering, co-clustering,
etc.
2
3
Hierarchical Clustering
• Agglomerative approach
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up
Initialization:
Each object is a cluster
Iteration:
Merge two clusters which are
most similar to each other;
Until all objects are merged
into a single cluster
4
Hierarchical Clustering
• Divisive Approaches
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0 Top-down
Initialization:
All objects stay in one cluster
Iteration:
Select a cluster and split it into
two sub clusters
Until each leaf cluster contains
only one object
5
Dendrogram
• A tree that shows how clusters are merged/split
hierarchically
• Each node on the tree is a cluster; each leaf node is a
singleton cluster
6
Dendrogram
• A clustering of the data objects is obtained by cutting
the dendrogram at the desired level, then each
connected component forms a cluster
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Compute the distance matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the distance matrix
6. Until only a single cluster remains
• Key operation is the computation of the distance between
two clusters
– Different approaches to defining the distance between clusters
distinguish the different algorithms
7
Starting Situation
• Start with clusters of individual points and a distance matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Distance Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
8
Intermediate Situation
• After some merging steps, we have some clusters
• Choose two clusters that has the smallest
distance (largest similarity) to merge
C1
C4
C2 C5
C3
C2
C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
9
Intermediate Situation
• We want to merge the two closest clusters (C2 and C5) and update
the distance matrix.
C1
C4
C2 C5
C3
C2
C1
C1
C3
C5
C4
C2
C3 C4 C5
Distance Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
10
After Merging
• The question is “How do we update the distance matrix?”
C1
C4
C2 U C5
C3
? ? ? ?
?
?
?
C2
U
C5
C1
C1
C3
C4
C2 U C5
C3 C4
Distance Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
11
How to Define Inter-Cluster Distance
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Distance?
 MIN
 MAX
 Group Average
 Distance Between Centroids
 ……
Distance Matrix
12
MIN or Single Link
• Inter-cluster distance
– The distance between two clusters is represented by the
distance of the closest pair of data objects belonging to
different clusters.
– Determined by one pair of points, i.e., by one link in the
proximity graph
)
,
(
min
)
,
(
,
min q
p
d
C
C
d
j
i C
q
C
p
j
i



13
MIN
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
3
4
5
3 6 2 5 4 1
0
0.05
0.1
0.15
0.2
14
Strength of MIN
Original Points Two Clusters
• Can handle non-elliptical shapes
15
Limitations of MIN
Original Points Two Clusters
• Sensitive to noise and outliers
16
MAX or Complete Link
• Inter-cluster distance
– The distance between two clusters is represented by the
distance of the farthest pair of data objects belonging to
different clusters
)
,
(
max
)
,
(
,
min q
p
d
C
C
d
j
i C
q
C
p
j
i



17
MAX
Nested Clusters Dendrogram
3 6 4 1 2 5
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
6
1
2 5
3
4
18
Strength of MAX
Original Points Two Clusters
• Less susceptible to noise and outliers
19
Limitations of MAX
Original Points
•Tends to break large clusters
20
21
MIN (2 clusters) MAX (2 clusters)
Limitations of MAX
•Biased towards globular clusters
Group Average or Average Link
• Inter-cluster distance
– The distance between two clusters is represented by the
average distance of all pairs of data objects belonging to
different clusters
– Determined by all pairs of points in the two clusters
)
,
(
)
,
(
,
min q
p
d
avg
C
C
d
j
i C
q
C
p
j
i



22
Group Average
Nested Clusters Dendrogram
3 6 4 1 2 5
0
0.05
0.1
0.15
0.2
0.25
1
2
3
4
5
6
1
2
5
3
4
23
Group Average
• Compromise between Single and Complete
Link
• Strengths
– Less susceptible to noise and outliers
• Limitations
– Biased towards globular clusters
24
Centroid Distance
• Inter-cluster distance
– The distance between two clusters is represented by the
distance between the centers of the clusters
– Determined by cluster centroids
 
)
,
(
)
,
( j
i
j
i
mean m
m
d
C
C
d 
25
Ward’s Method
• Similarity of two clusters is based on the increase
in squared error when two clusters are merged
– Similar to group average if distance between points is
distance squared
• Less susceptible to noise and outliers
• Biased towards globular clusters
• Hierarchical analogue of K-means
– Can be used to initialize K-means
26
Comparison
Group Average
Ward’s Method
1
2
3
4
5
6
1
2
5
3
4
MIN MAX
1
2
3
4
5
6
1
2
5
3
4
1
2
3
4
5
6
1
2 5
3
4
1
2
3
4
5
6
1
2
3
4
5
27
Time and Space Requirements
• O(N2) space since it uses the distance matrix
– N is the number of points
• O(N3) time in many cases
– There are N steps and at each step the size, N2,
distance matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time
for some approaches
28
Strengths
• Do not have to assume any particular number
of clusters
– Any desired number of clusters can be obtained
by ‘cutting’ the dendrogram at the proper level
• They may correspond to meaningful
taxonomies
– e.g., shopping websites—electronics (computer,
camera, ..), furniture, groceries
29
Problems and Limitations
• Once a decision is made to combine two clusters,
it cannot be undone
• No objective function is directly minimized
• Different schemes have problems with one or
more of the following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and
irregular shapes
– Breaking large clusters
30
Take-away Message
• Agglomerative and divisive hierarchical clustering
• Several ways of defining inter-cluster distance
• The properties of clusters outputted by different
approaches based on different inter-cluster distance
definition
• Pros and cons of hierarchical clustering
31

More Related Content

Similar to clustering_hierarchical ckustering notes.pdf

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptxNTUConcepts1
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptxNANDHINIS900805
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detectionroberval mariano
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Yan Xu
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptxRvishnupriya2
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012Ted Dunning
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptxssusere1fd42
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 

Similar to clustering_hierarchical ckustering notes.pdf (20)

machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
Hierarchical clustering.pptx
Hierarchical clustering.pptxHierarchical clustering.pptx
Hierarchical clustering.pptx
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Network sampling, community detection
Network sampling, community detectionNetwork sampling, community detection
Network sampling, community detection
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Clustering
ClusteringClustering
Clustering
 
Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
 
Clustering on DSS
Clustering on DSSClustering on DSS
Clustering on DSS
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 

Recently uploaded

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 

Recently uploaded (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 

clustering_hierarchical ckustering notes.pdf

  • 1. Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1
  • 2. Outline • Basics – Motivation, definition, evaluation • Methods – Partitional – Hierarchical – Density-based – Mixture model – Spectral methods • Advanced topics – Clustering ensemble – Clustering in MapReduce – Semi-supervised clustering, subspace clustering, co-clustering, etc. 2
  • 3. 3 Hierarchical Clustering • Agglomerative approach b d c e a a b d e c d e a b c d e Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up Initialization: Each object is a cluster Iteration: Merge two clusters which are most similar to each other; Until all objects are merged into a single cluster
  • 4. 4 Hierarchical Clustering • Divisive Approaches b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 Top-down Initialization: All objects stay in one cluster Iteration: Select a cluster and split it into two sub clusters Until each leaf cluster contains only one object
  • 5. 5 Dendrogram • A tree that shows how clusters are merged/split hierarchically • Each node on the tree is a cluster; each leaf node is a singleton cluster
  • 6. 6 Dendrogram • A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster
  • 7. Agglomerative Clustering Algorithm • More popular hierarchical clustering technique • Basic algorithm is straightforward 1. Compute the distance matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. Until only a single cluster remains • Key operation is the computation of the distance between two clusters – Different approaches to defining the distance between clusters distinguish the different algorithms 7
  • 8. Starting Situation • Start with clusters of individual points and a distance matrix p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Distance Matrix ... p1 p2 p3 p4 p9 p10 p11 p12 8
  • 9. Intermediate Situation • After some merging steps, we have some clusters • Choose two clusters that has the smallest distance (largest similarity) to merge C1 C4 C2 C5 C3 C2 C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix ... p1 p2 p3 p4 p9 p10 p11 p12 9
  • 10. Intermediate Situation • We want to merge the two closest clusters (C2 and C5) and update the distance matrix. C1 C4 C2 C5 C3 C2 C1 C1 C3 C5 C4 C2 C3 C4 C5 Distance Matrix ... p1 p2 p3 p4 p9 p10 p11 p12 10
  • 11. After Merging • The question is “How do we update the distance matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? ? ? C2 U C5 C1 C1 C3 C4 C2 U C5 C3 C4 Distance Matrix ... p1 p2 p3 p4 p9 p10 p11 p12 11
  • 12. How to Define Inter-Cluster Distance p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Distance?  MIN  MAX  Group Average  Distance Between Centroids  …… Distance Matrix 12
  • 13. MIN or Single Link • Inter-cluster distance – The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters. – Determined by one pair of points, i.e., by one link in the proximity graph ) , ( min ) , ( , min q p d C C d j i C q C p j i    13
  • 14. MIN Nested Clusters Dendrogram 1 2 3 4 5 6 1 2 3 4 5 3 6 2 5 4 1 0 0.05 0.1 0.15 0.2 14
  • 15. Strength of MIN Original Points Two Clusters • Can handle non-elliptical shapes 15
  • 16. Limitations of MIN Original Points Two Clusters • Sensitive to noise and outliers 16
  • 17. MAX or Complete Link • Inter-cluster distance – The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters ) , ( max ) , ( , min q p d C C d j i C q C p j i    17
  • 18. MAX Nested Clusters Dendrogram 3 6 4 1 2 5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 1 2 5 3 4 18
  • 19. Strength of MAX Original Points Two Clusters • Less susceptible to noise and outliers 19
  • 20. Limitations of MAX Original Points •Tends to break large clusters 20
  • 21. 21 MIN (2 clusters) MAX (2 clusters) Limitations of MAX •Biased towards globular clusters
  • 22. Group Average or Average Link • Inter-cluster distance – The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters – Determined by all pairs of points in the two clusters ) , ( ) , ( , min q p d avg C C d j i C q C p j i    22
  • 23. Group Average Nested Clusters Dendrogram 3 6 4 1 2 5 0 0.05 0.1 0.15 0.2 0.25 1 2 3 4 5 6 1 2 5 3 4 23
  • 24. Group Average • Compromise between Single and Complete Link • Strengths – Less susceptible to noise and outliers • Limitations – Biased towards globular clusters 24
  • 25. Centroid Distance • Inter-cluster distance – The distance between two clusters is represented by the distance between the centers of the clusters – Determined by cluster centroids   ) , ( ) , ( j i j i mean m m d C C d  25
  • 26. Ward’s Method • Similarity of two clusters is based on the increase in squared error when two clusters are merged – Similar to group average if distance between points is distance squared • Less susceptible to noise and outliers • Biased towards globular clusters • Hierarchical analogue of K-means – Can be used to initialize K-means 26
  • 27. Comparison Group Average Ward’s Method 1 2 3 4 5 6 1 2 5 3 4 MIN MAX 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 3 4 5 27
  • 28. Time and Space Requirements • O(N2) space since it uses the distance matrix – N is the number of points • O(N3) time in many cases – There are N steps and at each step the size, N2, distance matrix must be updated and searched – Complexity can be reduced to O(N2 log(N) ) time for some approaches 28
  • 29. Strengths • Do not have to assume any particular number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level • They may correspond to meaningful taxonomies – e.g., shopping websites—electronics (computer, camera, ..), furniture, groceries 29
  • 30. Problems and Limitations • Once a decision is made to combine two clusters, it cannot be undone • No objective function is directly minimized • Different schemes have problems with one or more of the following: – Sensitivity to noise and outliers – Difficulty handling different sized clusters and irregular shapes – Breaking large clusters 30
  • 31. Take-away Message • Agglomerative and divisive hierarchical clustering • Several ways of defining inter-cluster distance • The properties of clusters outputted by different approaches based on different inter-cluster distance definition • Pros and cons of hierarchical clustering 31