3.3 hierarchical methods

Clustering
Hierarchical Methods
1

2
Hierarchical Clustering
 Use distance matrix as clustering criteria.
 This method does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)

3
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages
 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

5
DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

6
Distance Measures
 Minimum distance dmin(Ci, Cj) = min p ∈Ci,p’ ∈Cj |p-p’|
 Nearest Neighbor algorithm
 Single-linkage algorithm – terminates when distance > threshold
 Agglomerative Hierarchical – Minimal Spanning tree
 Maximum distance dmax(Ci, Cj) = max p ∈Ci,p’ ∈Cj |p-p’|
 Farthest-neighbor clustering algorithm
 Complete-linkage - terminates when distance > threshold
 Mean distance dmean(Ci, Cj) = |mi-mj|
 Average distance davg(Ci, Cj) = (1/ninj) ∑ p ∈Ci ∑,p’ ∈Cj |p-p’|

7
Difficulties faced in Hierarchical
Clustering
 Selection of merge/split points
 Cannot revert operation
 Scalability

8
Recent Hierarchical Clustering Methods
 Integration of hierarchical and other techniques:
 BIRCH: uses tree structures and incrementally adjusts the quality of
sub-clusters
 CURE: Represents a Cluster by a fixed number of representative
objects
 ROCK: clustering categorical data by neighbor and link analysis
 CHAMELEON : hierarchical clustering using dynamic modeling

9
BIRCH
 Balanced Iterative Reducing and Clustering using Hierarchies
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering structure of
the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the
CF-tree
 Scales linearly: finds a good clustering with a single scan and improves the
quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record.

10
Clustering Feature Vector in BIRCH
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: ∑N
i=1=Xi
SS: ∑N
i=1=Xi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)

11
CF-Tree in BIRCH
 Clustering feature:
 summary of the statistics for a given subcluster
 registers crucial measurements for computing cluster and utilizes storage
efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
 Branching factor: specify the maximum number of children.
 threshold: max diameter of sub-clusters stored at the leaf nodes

12
CF Tree
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6prev next CF1 CF2 CF4prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node

13
CURE (Clustering Using Representatives)
 Supports Non-Spherical Clusters
 Fixed number of representative points are used
 Select scattered objects and shrink by moving towards cluster
center
 Merge clusters with closest pair of representatives
 For large Databases
 Draw a random Sample S
 Partition
 Partially cluster each partition
 Eliminate outliers and slow growing clusters
 Continue Clustering
 Label Cluster by Representatives

14
Cure: Shrinking Representative Points
 Shrink the multiple representative points towards the gravity center by a
fraction of α.
 Multiple representatives capture the shape of the cluster
x
y
x
y

15
Clustering Categorical Data: ROCK
Algorithm
 ROCK: RObust Clustering using linKs
 Major ideas
 Use links to measure similarity/proximity
 Not distance-based
 Number of Common Neighbours
 Algorithm: sampling-based clustering
 Draw random sample
 Cluster with links
 Label data in disk
Sim T T
T T
T T
( , )1 2
1 2
1 2
=
∩
∪

16
CHAMELEON: Hierarchical Clustering Using
Dynamic Modeling
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high
 A two-phase algorithm
1. Use a graph partitioning algorithm: cluster objects into a large number of
relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine
clusters by repeatedly combining these sub-clusters

17
Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set

Chameleon
 Computes similarity based on Relative Interconnectivity
and Relative Closeness
18

3.3 hierarchical methods

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 3.3 hierarchical methods

Similar to 3.3 hierarchical methods (20)

More from Krish_ver2

More from Krish_ver2 (20)

Recently uploaded

Recently uploaded (20)

3.3 hierarchical methods