2. Cluster Analysis
• Cluster analysis or simply clustering is the process of partitioning a set
of data objects (or observations) into subsets.
• Each subset is a cluster, such that objects in a cluster are similar to
one another, yet dissimilar to objects in other clusters.
• Cluster analysis has been widely used in many applications such as
business intelligence, image pattern recognition, Web search, biology,
and security.
2
3. Cluster Analysis
• Because a cluster is a collection of data objects that are similar to one
another within the cluster and dissimilar to objects in other clusters, a
cluster of data objects can be treated as an implicit class. In this sense,
clustering is sometimes called automatic classification.
• Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their
similarity.
• Clustering can also be used for outlier detection, where outliers (values
that are “far away” from any cluster) may be more interesting than
common cases.
• Clustering is known as unsupervised learning because the class label
information is not present. For this reason, clustering is a form of learning
by observation, rather than learning by examples
3
4. Requirements for Cluster Analysis(I)
Following are typical requirements of clustering in data mining
• Scalability
• Ability to deal with different types of attributes
―Numeric, binary, nominal (categorical), and ordinal data, or mixtures of these
data types
• Discovery of clusters with arbitrary shape
―clusters based on Euclidean or Manhattan distance measures.
• Requirements for domain knowledge to determine input parameters
―such as the desired number of clusters
• Ability to deal with noisy data
4
5. Requirements for Cluster Analysis(II)
• Incremental clustering and insensitivity to input order
• Capability of clustering high-dimensionality data
―When clustering documents, for example, each keyword can be regarded as a
dimension
• Constraint-based clustering
―choose the locations for a given number of new automatic teller machines
(ATMs) in a city.
• Interpretability and usability
―interpretations and applications
5
6. Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
―high intra-class similarity: cohesive within clusters
―low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
―the similarity measure used by the method
―its implementation, and
―Its ability to discover some or all of the hidden patterns
6
7. Measure the Quality of Clustering
• Dissimilarity/Similarity metric
― Similarity is expressed in terms of a distance function, typically metric: d(i, j)
― The definitions of distance functions are usually rather different for interval-
scaled, boolean, categorical, ordinal ratio, and vector variables
― Weights should be associated with different variables based on applications
and data semantics
• Quality of clustering:
― There is usually a separate “quality” function that measures the “goodness”
of a cluster.
― It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
7
8. Considerations for Cluster Analysis
• Partitioning criteria
―Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
• Separation of clusters
―Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive
(e.g., one document may belong to more than one class)
• Similarity measure
―Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based
(e.g., density or contiguity)
• Clustering space
―Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering) 8
9. Major Clustering Approaches (I)
• Partitioning approach:
―Construct various partitions and then evaluate them by some criterion, e.g., minimizing the
sum of square errors.
―Distance based
―May used mean or medoid (etc.) to represent cluster center.
―Effective for small to medium size data sets.
―Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
―Create a hierarchical decomposition of the set of data (or objects) using some criterion
―Cannot correct erroneous merges or splits
―Top down or bottom up approach may be followed
―Typical methods: Diana, Agnes, BIRCH, CAMELEON
9
10. Major Clustering Approaches (II)
• Density-based approach:
―Based on connectivity and density functions
―Each point must have minimum number of points within its neighborhood
―Can find arbitrary shaped clusters.
―May filter out outliers
―Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
―Based on a multiple-level granularity structure
―Fast processing
―Typical methods: STING, WaveCluster, CLIQUE
10
12. Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k clusters, such
that the sum of squared distances is minimized (where ci is the centroid or medoid of
cluster Ci)
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
―Global optimal: exhaustively enumerate all partitions
―Heuristic methods: k-means and k-medoids algorithms
―k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of
the cluster
―k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster
12
13. The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when the assignment does not change
13
15. An Example of K-Means Clustering
15
K=2
Arbitrarily
partition
objects into
k groups
Update the
cluster
centroids
Update the
cluster
centroids
Reassign objects
Loop if
needed
The initial data set
Partition objects into k nonempty
subsets
Repeat
Compute centroid (i.e., mean
point) for each partition
Assign each object to the
cluster of its nearest centroid
Until no change
16. • The difference between an object pϵCi and ci, the representative of the
cluster, is measured by dist(p, ci), where dist(x,y) is the Euclidean
distance between two points x and y.
• The quality of cluster Ci can be measured by the within cluster
variation, which is the sum of squared error between all objects in Ci
and the centroid ci, defined as
𝐸 =
𝑖=1
𝑘
𝑝∈𝐶𝑖
𝑑𝑖𝑠𝑡 𝑝, 𝑐𝑖
2
16
17. Problem
For the data set 1, 2, 3, 8, 9, 10, and 25.
Classify them using k-means clustering
algorithm where k=2
17
18. Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t
<< n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
―Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
―Need to specify k, the number of clusters, in advance (there are ways to automatically
determine the best k (see Hastie et al., 2009)
―Sensitive to noisy data and outliers
―Not suitable to discover clusters with non-convex shapes
18
19. Variations of the K-Means Method
• Most of the variants of the k-means which differ in
―Selection of the initial k means
―Dissimilarity calculations
―Strategies to calculate cluster means
• Handling categorical data: k-modes
―Replacing means of clusters with modes
―Using new dissimilarity measures to deal with categorical objects
―Using a frequency-based method to update modes of clusters
―A mixture of categorical and numerical data: k-prototype method
19
20. What Is the Problem of the K-Means Method?
• The k-means algorithm is sensitive to outliers !
―Since an object with an extremely large value may substantially distort the
distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located object
in a cluster
20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
22. 22
PAM(Partitioning Around Medoids): A Typical K-Medoids
Algorithm
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary
choose k
object as
initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remainin
g object
to
nearest
medoids
Randomly select a
nonmedoid object,Oramdom
Compute
total cost of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
23. Minimizing dissimilarity
• The partitioning method is then performed based on the principle of
• minimizing the sum of the dissimilarities between each object p and
its corresponding representative object. That is, an absolute-error
criterion is used, defined as
𝐸 =
𝑖=1
𝑘
𝑝 ϵ 𝐶𝑖
𝑑𝑖𝑠𝑡 𝑝, 𝑂𝑖
where E is the sum of the absolute error for all objects p in the data
set, and Oi is the representative object of Ci . This is the basis for the k-
medoids method, which groups n objects into k clusters by minimizing
the absolute error
23
24. The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
―PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
• PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)
• Efficiency improvement on PAM
―CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
―CLARANS (Ng & Han, 1994): Randomized re-sampling
24
25. CLARA-Clustering LARge Applications)
• Instead of taking the whole data set into consideration, CLARA uses a
random sample of the data set.
• The PAM algorithm is then applied to compute the best medoids from the
sample.
• Ideally, the sample should closely represent the original data set. In many
cases, a large sample works well if it is created so that each object has
equal probability of being selected into the sample.
• CLARA builds clusterings from multiple random samples and returns the
best clustering as the output.
• The complexity of computing the medoids on a random sample is
𝑂 𝑘𝑠2
+ 𝑘 𝑛 − 𝑘 where s is the size of the sample, k is the number of
clusters, and n is the total number of objects. CLARA can deal with larger
data sets than PAM.
25
27. Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does not require the
number of clusters k as an input, but needs a termination condition
27
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES-AGglomerative NESting)
divisive
(DIANA-DIvisive ANAlysis)
28. AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Use the single-link method and the dissimilarity matrix
―Each cluster is represented by all the objects in the cluster, and the similarity between two
clusters is measured by the similarity of the closest pair of data points belonging to different
clusters.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
28
29. DIANA (DIvisive ANAlysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
• All the objects are used to form one initial cluster.
• The cluster is split according to some principle such as the maximum Euclidean
distance between the closest neighboring objects in the cluster.
• The cluster-splitting process repeats until, eventually, each new cluster contains
only a single object.
29
30. Dendrogram: Shows How Clusters are Merged
―Decompose data objects into a several levels of nested partitioning (tree of clusters), called a
dendrogram
―A clustering of the data objects is obtained by cutting the dendrogram at the desired level,
then each connected component forms a cluster
30
31. Distance between Clusters
• Single link: smallest distance between an element in one cluster and an element in the other,
i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in the other, i.e.,
dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
―Medoid: a chosen, centrally located object in the cluster
X X
31
32. Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster
• Radius: square root of average distance from any point of the cluster
to its centroid
• Diameter: square root of average mean squared distance between all
pairs of points in the cluster
32
N
t
N
i ip
m
C
)
(
1
)
1
(
2
)
(
1
1
N
N
iq
t
ip
t
N
i
N
i
m
D
N
m
c
ip
t
N
i
m
R
2
)
(
1
33. Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods
―Can never undo what was done previously
―Do not scale well: time complexity of at least O(n2), where n is the
number of total objects
• Integration of hierarchical & distance-based clustering
―BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
―CHAMELEON (1999): hierarchical clustering using dynamic
modeling
33
34. BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
―Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
―Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the
CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the quality
with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data
record 34
35. BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• BIRCH uses the notions of clustering feature to summarize a cluster,
and clustering feature tree (CF-tree) to represent a cluster hierarchy.
• These structures help the clustering method achieve good speed and
scalability in large or even streaming databases, and also make it
effective for incremental and dynamic clustering of incoming objects.
35
36. BIRCH- clustering feature (CF)
• Consider a cluster of n d-dimensional data objects or points.
• The clustering feature (CF) of the cluster is a 3-D vector summarizing
information about clusters of objects. It is defined as
CF={n,LS,SS}
• Where LS is the linear sum of the n points (i.e, 𝑖=1
𝑛
𝑥𝑖), and SS is the
square sum of the data points (i.e. 𝑖=1
𝑛
𝑥𝑖
2
).
• A clustering feature is essentially a summary of the statistics for the
given cluster.Using a clustering feature, we can easily derive many
useful statistics of a cluster
36
38. BIRCH- Summarizing a cluster using the
clustering feature
• Summarizing a cluster using the clustering feature can avoid storing
the detailed information about individual objects or points.
• Instead, we only need a constant size of space to store the clustering
feature. This is the key to BIRCH efficiency in space.
• Moreover, clustering features are additive.
―That is, for two disjoint clusters, C1 and C2, with the clustering features
CF1={n1,LS1,SS1} and CF2={n2,LS2,SS2}, respectively, the clustering feature
for the cluster that formed by merging C1 and C2 is simply
CF1+CF2 ={n1+n2,LS1+LS2,SS1+SS2}.
38
39. BIRCH
• A CF-tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering.
• The nonleaf nodes store sums of the CFs of their children, and thus
summarize clustering information about their children.
• A CF-tree has two parameters: branching factor, B, and threshold, T.
―The branching factor specifies the maximum number of children per nonleaf
node.
―The threshold parameter specifies the maximum diameter of subclusters
stored at the leaf nodes of the tree.
• These two parameters implicitly control the resulting tree’s size.
39
40. The CF Tree Structure
40
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF11
child1
CF13
child3
CF12
child2
CF15
child5
CF111 CF112 CF116
prev next CF121 CF122
CF124
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
41. The Birch Algorithm
• Cluster Diameter
• For each point in the input
―Find closest leaf entry
―Add point to leaf entry and update CF
―If entry diameter > max_diameter, then split leaf, and possibly parents
• Algorithm is O(n)
• Concerns
―Sensitive to insertion order of data points
―Since we fix the size of leaf nodes, so clusters may not be so natural
―Clusters tend to be spherical given the radius and diameter measures
41
𝐷 =
𝑖=1
𝑛
𝑗=1
𝑛
𝑥𝑖 − 𝑥𝑗
2
𝑛 𝑛 − 1
=
2𝑛𝑆𝑆 − 2𝐿𝑆2
𝑛 𝑛 − 1
42. BIRCH- Effectiveness
• The time complexity of the algorithm is O(n) where n is the number of
objects to be clustered.
• Experiments have shown the linear scalability of the algorithm with
respect to the number of objects, and good quality of clustering of the
data.
• However, since each node in a CF-tree can hold only a limited number of
entries due to its size, a CF-tree node does not always correspond to what
a user may consider a natural cluster.
• Moreover, if the clusters are not spherical in shape, BIRCH does not
perform well because it uses the notion of radius or diameter to control
the boundary of a cluster.
42
43. CHAMELEON: Hierarchical Clustering Using Dynamic Modeling
(1999)
• CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
• Measures the similarity based on a dynamic model
― Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items within the clusters
• Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into a large number of
relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine
clusters by repeatedly combining these sub-clusters
43
44. CHAMELEON: Hierarchical Clustering Using Dynamic Modeling
(1999)
• Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph,
where each vertex of the graph represents a data object, and there exists an edge
between two vertices (objects) if one object is among the k-most similar objects to the
other.
• The edges are weighted to reflect the similarity between objects.
• Chameleon uses a graph partitioning algorithm to partition the k-nearest-neighbor
graph into a large number of relatively small subclusters such that it minimizes the
edge cut.
• That is, a cluster C is partitioned into subclusters Ci and Cj so as to minimize the weight
of the edges that would be cut should C be bisected into Ci and Cj .
• It assesses the absolute interconnectivity between clusters Ci and Cj .
44
45. CHAMELEON
• Chameleon then uses an agglomerative hierarchical clustering algorithm that
iteratively merges subclusters based on their similarity.
• To determine the pairs of most similar subclusters, it takes into account both
the interconnectivity and the closeness of the clusters.
• Specifically, Chameleon determines the similarity between each pair of clusters
Ci and Cj according to their relative interconnectivity, RI(Ci ,Cj), and their
relative closeness,RC(Ci ,Cj).
45
46. CHAMELEON
• The relative interconnectivity, RI(Ci ,Cj), between two clusters, Ci and Cj , is
defined as the absolute interconnectivity between Ci and Cj , normalized with
respect to the internal interconnectivity of the two clusters, Ci and Cj . That is,
• where EC{Ci ,Cj} is the edge cut as previously defined for a cluster containing both Ci
and Cj .
• Similarly, ECCi (or ECCj ) is the minimum sum of the cut edges that partition Ci (or
Cj) into two roughly equal parts.
46
47. CHAMELEON
• The relative closeness, RC (Ci ,Cj), between a pair of clusters, Ci and Cj
, is the absolute closeness between Ci and Cj , normalized with
respect to the internal closeness of the two clusters, Ci and Cj . It is
defined as
• where SEC{Ci ,Cj} is the average weight of the edges that connect
vertices in Ci to vertices in Cj , and SECCi (or SECCj ) is the average
weight of the edges that belong to the mincut bisector of cluster Ci
(or Cj ).
47
48. CHAMELEON
• Chameleon has been shown to have greater power at discovering
arbitrarily shaped clusters of high quality than several well-known
algorithms such as BIRCH and density based DBSCAN.
• However, the processing cost for high-dimensional data may require
O(n2)time for n objects in the worst case.
48
49. Overall Framework of CHAMELEON
49
Construct (K-NN)
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
K-NN Graph
P and q are connected if
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Relative closeness:
closeness of c1 and c2 over
internal closeness
52. Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-connected
points
• Major features:
―Discover clusters of arbitrary shape
―Handle noise
―One scan
―Need density parameters as termination condition
• Several interesting studies:
―DBSCAN: Ester, et al. (KDD’96)
―OPTICS: Ankerst, et al (SIGMOD’99).
―DENCLUE: Hinneburg & D. Keim (KDD’98)
―CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
52
53. Density-Based Clustering: Basic Concepts
• Two parameters:
―ϵ-neighbourhood: Maximum radius of the neighbourhood
―MinPts: Minimum number of points in an ϵ -neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps} where q is a core point and p is an object
point
• Directly density-reachable: A point p is directly density-reachable from a point q
w.r.t. ϵ-neighbourhood, MinPts if
―p belongs to NEps(q)
―core point condition:
|NEps (q)| ≥ MinPts
53
MinPts = 5
Eps = 1 cm
p
q
54. Density-Reachable and Density-Connected
• Density-reachable:
―A point p is density-reachable from a
point q w.r.t. ϵ-neighbourhood, MinPts if
there is a chain of points p1, …, pn, p1 = q,
pn = p such that pi+1 is directly density-
reachable from pi
• Density-connected
―A point p is density-connected to a point
q w.r.t. ϵ-neighbourhood, MinPts if there
is a point o such that both, p and q are
density-reachable from o w.r.t. ϵ-
neighbourhood and MinPts
54
p
q
p1
p q
o
55. DBSCAN: Density-Based Spatial Clustering of Applications with
Noise
• Relies on a density-based notion of cluster: A cluster is defined as a maximal set
of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with noise
55
Core
Border
Outlier
Eps = 1cm
MinPts = 5
56. DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. ϵ-neighbourhood and MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database
• Continue the process until all of the points have been processed
56
59. Disadvantages of DBSCAN
• responsibility of
―selecting parameter values that will lead to the discovery of acceptable
clusters
―real-world,high-dimensional data sets often have very skewed distributions
such that their intrinsic clustering structure may not be well characterized by
a single set of global density parameters.
59
60. OPTICS: A Cluster-Ordering Method (1999)
• OPTICS: Ordering Points To Identify the Clustering Structure
―Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
―Produces a special order of the database wrt its density-based clustering
structure
―This is a linear list of all objects under analysis and represents the density-
based clustering structure of the data.
―Objects in a denser cluster are listed closer to each other in the cluster
ordering.
―This cluster-ordering contains info equiv to the density-based clusterings
corresponding to a broad range of parameter settings
―Good for both automatic and interactive cluster analysis, including finding
intrinsic clustering structure.
60
61. OPTICS: Some Extension from DBSCAN
• This ordering is equivalent to density-based clustering obtained from
a wide range of parameter settings.
• OPTICS does not require the user to provide a specific density
threshold.
• The cluster ordering can be used to extract basic clustering
information (e.g., cluster centers, or arbitrary-shaped clusters), derive
the intrinsic clustering structure, as well as provide a visualization of
the clustering.
61
65. DENCLUE: Clustering Based on Density
Distribution Functions
• DENCLUE (DENsity-based CLUstEring) is a clustering method based on
a set of density distribution functions.
• In DBSCAN and OPTICS, density is calculated by counting the number
of objects in a neighborhood defined by a radius parameter,ϵ. Such
density estimates can be highly sensitive to the radius value used.
• To overcome this problem, kernel density estimation can be used,
which is a nonparametric density estimation approach from statistics.
65
67. Denclue: Technical Essence
• Uses grid cells but only keeps information about grid cells that do actually contain data points and
manages these cells in a tree-based access structure
• Influence function: describes the impact of a data point within its neighborhood
• Overall density of the data space can be calculated as the sum of the influence function of all data
points
• Clusters can be determined mathematically by identifying density attractors
• Density attractors are local maximal of the overall density function
• Center defined clusters: assign to each density attractor the points density attracted to it
• Arbitrary shaped cluster: merge density attractors that are connected through paths of high
density (> threshold)
67
69. Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
―STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz
(1997)
―WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach using wavelet method
―CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
69
70. STING: A Statistical Information Grid Approach
• Wang, Yang and Muntz (VLDB’97)
• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of resolution
70
71. The STING Clustering Method
• Each cell at a high level is partitioned into a number of smaller cells in the next
lower level
• Statistical info of each cell is calculated and stored beforehand and is used to
answer queries
• Parameters of higher level cells can be easily calculated from parameters of lower
level cell
―count, mean, s, min, max
―type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small number of cells
• For each cell in the current level compute the confidence interval
71
72. STING Algorithm and Its Analysis
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the next lower level
• Repeat this process until the bottom layer is reached
• Advantages:
―Query-independent, easy to parallelize, incremental update
―O(K), where K is the number of grid cells at the lowest level
• Disadvantages:
―All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected
72
73. 73
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space that allow better clustering
than original space
• CLIQUE can be considered as both density-based and grid-based
―It partitions each dimension into the same number of equal length interval
―It partitions an m-dimensional data space into non-overlapping rectangular units
―A unit is dense if the fraction of total data points contained in the unit exceeds the input
model parameter
―A cluster is a maximal set of connected dense units within a subspace
74. 74
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie inside each cell of the
partition.
• Identify the candidate search space
―Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
―Determine dense units in all subspaces of interests
―Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
―Determine maximal regions that cover a cluster of connected dense units for each
cluster
―Determination of minimal cover for each cluster
75. • CLIQUE performs clustering in two steps.
• In the first step, CLIQUE partitions the d-dimensional data space into
nonoverlapping rectangular units, identifying the dense units among
these.
• CLIQUE finds dense cells in all of the subspaces.
• To do so, CLIQUE partitions every dimension into intervals, and
identifies intervals containing at least l points, where l is the density
threshold.
75
76. • In the second step, CLIQUE uses the dense cells in each subspace to
assemble clusters, which can be of arbitrary shape.
• The idea is to apply the Minimum Description Length (MDL) to use
the maximal regions to cover connected dense cells,
• where a maximal region is a hyperrectangle where every cell falling
into this region is dense, and the region cannot be extended further
in any dimension in the subspace
76
78. 78
Strength and Weakness of CLIQUE
• Strength
―automatically finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
―insensitive to the order of records in input and does not presume some
canonical data distribution
―scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weakness
―The accuracy of the clustering result may be degraded at the expense of
simplicity of the method
80. Assessing Clustering Tendency
• Assess if non-random structure exists in the data by measuring the probability that the data is
generated by a uniform data distribution
• Test spatial randomness by statistic test: Hopkins Static
―Given a dataset D regarded as a sample of a random variable o, determine how far away o is
from being uniformly distributed in the data space
―Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D:
xi = min{dist (pi, v)} where v in D
―Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}:
yi = min{dist (qi, v)} where v in D and v ≠ qi
―Calculate the Hopkins Statistic:
―If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D
is highly skewed, H is close to 0
80
81. Determine the Number of Clusters (I)
• Empirical method
―# of clusters ≈
𝑛
2
for a dataset of n
points with each cluster having 2𝑛
points.
• Elbow method
―Use the turning point in the curve of
sum of within cluster variance w.r.t the
# of clusters
81
82. Determine the Number of Clusters(II)
• Cross validation method
―Divide a given data set into m parts
―Use m – 1 parts to obtain a clustering model
―Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid, and use the sum of squared
distance between all points in the test set and the closest centroids to measure how well
the model fits the test set
―For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and
find # of clusters that fits the data the best
82
83. Determine the Number of Clusters(III)
• The average silhouette approach determines how well each
object lies within its cluster. A high average sil
• Average silhouette method computes the average silhouette
of observations for different values of k. The optimal number
of clusters k is the one that maximize the average silhouette
over a range of possible values for k (Kaufman and
Rousseeuw [1990]).houette width indicates a good
clustering.
• The algorithm is similar to the elbow method and can be
computed as follow:
― Compute clustering algorithm (e.g., k-means clustering) for different
values of k. For instance, by varying k from 1 to 10 clusters
― For each k, calculate the average silhouette of observations (avg.sil)
― Plot the curve of avg.sil according to the number of clusters k.
― The location of the maximum is considered as the appropriate number
of clusters.
83
84. Measuring Clustering Quality
• Two methods: extrinsic vs. intrinsic
• Extrinsic: supervised, i.e., the ground truth is available
―Compare a clustering against the ground truth using certain clustering quality
measure
―Ex. BCubed precision and recall metrics
• Intrinsic: unsupervised, i.e., the ground truth is unavailable
―Evaluate the goodness of a clustering by considering how well the clusters are
separated, and how compact the clusters are
―Ex. Silhouette coefficient
84
86. Measuring Clustering Quality: Extrinsic Methods
• Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.
• Q is good if it satisfies the following 4 essential criteria
―Cluster homogeneity: the purer, the better
―Cluster completeness: should assign objects belong to the same category in
the ground truth to the same cluster
―Rag bag: putting a heterogeneous object into a pure cluster should be
penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other”
category)
―Small cluster preservation: splitting a small category into pieces is more
harmful than splitting a large category into pieces
86