SlideShare a Scribd company logo
1 of 86
Cluster Analysis
Ref:- Data Mining Concepts and Techniques, by Jiawei Han|Micheline
Kamber|Jian Pei
1
Cluster Analysis
• Cluster analysis or simply clustering is the process of partitioning a set
of data objects (or observations) into subsets.
• Each subset is a cluster, such that objects in a cluster are similar to
one another, yet dissimilar to objects in other clusters.
• Cluster analysis has been widely used in many applications such as
business intelligence, image pattern recognition, Web search, biology,
and security.
2
Cluster Analysis
• Because a cluster is a collection of data objects that are similar to one
another within the cluster and dissimilar to objects in other clusters, a
cluster of data objects can be treated as an implicit class. In this sense,
clustering is sometimes called automatic classification.
• Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their
similarity.
• Clustering can also be used for outlier detection, where outliers (values
that are “far away” from any cluster) may be more interesting than
common cases.
• Clustering is known as unsupervised learning because the class label
information is not present. For this reason, clustering is a form of learning
by observation, rather than learning by examples
3
Requirements for Cluster Analysis(I)
Following are typical requirements of clustering in data mining
• Scalability
• Ability to deal with different types of attributes
―Numeric, binary, nominal (categorical), and ordinal data, or mixtures of these
data types
• Discovery of clusters with arbitrary shape
―clusters based on Euclidean or Manhattan distance measures.
• Requirements for domain knowledge to determine input parameters
―such as the desired number of clusters
• Ability to deal with noisy data
4
Requirements for Cluster Analysis(II)
• Incremental clustering and insensitivity to input order
• Capability of clustering high-dimensionality data
―When clustering documents, for example, each keyword can be regarded as a
dimension
• Constraint-based clustering
―choose the locations for a given number of new automatic teller machines
(ATMs) in a city.
• Interpretability and usability
―interpretations and applications
5
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
―high intra-class similarity: cohesive within clusters
―low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
―the similarity measure used by the method
―its implementation, and
―Its ability to discover some or all of the hidden patterns
6
Measure the Quality of Clustering
• Dissimilarity/Similarity metric
― Similarity is expressed in terms of a distance function, typically metric: d(i, j)
― The definitions of distance functions are usually rather different for interval-
scaled, boolean, categorical, ordinal ratio, and vector variables
― Weights should be associated with different variables based on applications
and data semantics
• Quality of clustering:
― There is usually a separate “quality” function that measures the “goodness”
of a cluster.
― It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
7
Considerations for Cluster Analysis
• Partitioning criteria
―Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
• Separation of clusters
―Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive
(e.g., one document may belong to more than one class)
• Similarity measure
―Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based
(e.g., density or contiguity)
• Clustering space
―Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering) 8
Major Clustering Approaches (I)
• Partitioning approach:
―Construct various partitions and then evaluate them by some criterion, e.g., minimizing the
sum of square errors.
―Distance based
―May used mean or medoid (etc.) to represent cluster center.
―Effective for small to medium size data sets.
―Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
―Create a hierarchical decomposition of the set of data (or objects) using some criterion
―Cannot correct erroneous merges or splits
―Top down or bottom up approach may be followed
―Typical methods: Diana, Agnes, BIRCH, CAMELEON
9
Major Clustering Approaches (II)
• Density-based approach:
―Based on connectivity and density functions
―Each point must have minimum number of points within its neighborhood
―Can find arbitrary shaped clusters.
―May filter out outliers
―Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
―Based on a multiple-level granularity structure
―Fast processing
―Typical methods: STING, WaveCluster, CLIQUE
10
Partitioning approach
11
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k clusters, such
that the sum of squared distances is minimized (where ci is the centroid or medoid of
cluster Ci)
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
―Global optimal: exhaustively enumerate all partitions
―Heuristic methods: k-means and k-medoids algorithms
―k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of
the cluster
―k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster
12
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when the assignment does not change
13
14
An Example of K-Means Clustering
15
K=2
Arbitrarily
partition
objects into
k groups
Update the
cluster
centroids
Update the
cluster
centroids
Reassign objects
Loop if
needed
The initial data set
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition
 Assign each object to the
cluster of its nearest centroid
 Until no change
• The difference between an object pϵCi and ci, the representative of the
cluster, is measured by dist(p, ci), where dist(x,y) is the Euclidean
distance between two points x and y.
• The quality of cluster Ci can be measured by the within cluster
variation, which is the sum of squared error between all objects in Ci
and the centroid ci, defined as
𝐸 =
𝑖=1
𝑘
𝑝∈𝐶𝑖
𝑑𝑖𝑠𝑡 𝑝, 𝑐𝑖
2
16
Problem
For the data set 1, 2, 3, 8, 9, 10, and 25.
Classify them using k-means clustering
algorithm where k=2
17
Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t
<< n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
―Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
―Need to specify k, the number of clusters, in advance (there are ways to automatically
determine the best k (see Hastie et al., 2009)
―Sensitive to noisy data and outliers
―Not suitable to discover clusters with non-convex shapes
18
Variations of the K-Means Method
• Most of the variants of the k-means which differ in
―Selection of the initial k means
―Dissimilarity calculations
―Strategies to calculate cluster means
• Handling categorical data: k-modes
―Replacing means of clusters with modes
―Using new dissimilarity measures to deal with categorical objects
―Using a frequency-based method to update modes of clusters
―A mixture of categorical and numerical data: k-prototype method
19
What Is the Problem of the K-Means Method?
• The k-means algorithm is sensitive to outliers !
―Since an object with an extremely large value may substantially distort the
distribution of the data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located object
in a cluster
20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
21
22
PAM(Partitioning Around Medoids): A Typical K-Medoids
Algorithm
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary
choose k
object as
initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remainin
g object
to
nearest
medoids
Randomly select a
nonmedoid object,Oramdom
Compute
total cost of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Minimizing dissimilarity
• The partitioning method is then performed based on the principle of
• minimizing the sum of the dissimilarities between each object p and
its corresponding representative object. That is, an absolute-error
criterion is used, defined as
𝐸 =
𝑖=1
𝑘
𝑝 ϵ 𝐶𝑖
𝑑𝑖𝑠𝑡 𝑝, 𝑂𝑖
where E is the sum of the absolute error for all objects p in the data
set, and Oi is the representative object of Ci . This is the basis for the k-
medoids method, which groups n objects into k clusters by minimizing
the absolute error
23
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
―PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
• PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)
• Efficiency improvement on PAM
―CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
―CLARANS (Ng & Han, 1994): Randomized re-sampling
24
CLARA-Clustering LARge Applications)
• Instead of taking the whole data set into consideration, CLARA uses a
random sample of the data set.
• The PAM algorithm is then applied to compute the best medoids from the
sample.
• Ideally, the sample should closely represent the original data set. In many
cases, a large sample works well if it is created so that each object has
equal probability of being selected into the sample.
• CLARA builds clusterings from multiple random samples and returns the
best clustering as the output.
• The complexity of computing the medoids on a random sample is
𝑂 𝑘𝑠2
+ 𝑘 𝑛 − 𝑘 where s is the size of the sample, k is the number of
clusters, and n is the total number of objects. CLARA can deal with larger
data sets than PAM.
25
26
Hierarchical Methods
26
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does not require the
number of clusters k as an input, but needs a termination condition
27
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES-AGglomerative NESting)
divisive
(DIANA-DIvisive ANAlysis)
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Use the single-link method and the dissimilarity matrix
―Each cluster is represented by all the objects in the cluster, and the similarity between two
clusters is measured by the similarity of the closest pair of data points belonging to different
clusters.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
28
DIANA (DIvisive ANAlysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
• All the objects are used to form one initial cluster.
• The cluster is split according to some principle such as the maximum Euclidean
distance between the closest neighboring objects in the cluster.
• The cluster-splitting process repeats until, eventually, each new cluster contains
only a single object.
29
Dendrogram: Shows How Clusters are Merged
―Decompose data objects into a several levels of nested partitioning (tree of clusters), called a
dendrogram
―A clustering of the data objects is obtained by cutting the dendrogram at the desired level,
then each connected component forms a cluster
30
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an element in the other,
i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in the other, i.e.,
dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj)
―Medoid: a chosen, centrally located object in the cluster
X X
31
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
• Centroid: the “middle” of a cluster
• Radius: square root of average distance from any point of the cluster
to its centroid
• Diameter: square root of average mean squared distance between all
pairs of points in the cluster
32
N
t
N
i ip
m
C
)
(
1



)
1
(
2
)
(
1
1







N
N
iq
t
ip
t
N
i
N
i
m
D
N
m
c
ip
t
N
i
m
R
2
)
(
1




Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods
―Can never undo what was done previously
―Do not scale well: time complexity of at least O(n2), where n is the
number of total objects
• Integration of hierarchical & distance-based clustering
―BIRCH (1996): uses CF-tree and incrementally adjusts the quality
of sub-clusters
―CHAMELEON (1999): hierarchical clustering using dynamic
modeling
33
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
―Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
―Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the
CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the quality
with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data
record 34
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
• BIRCH uses the notions of clustering feature to summarize a cluster,
and clustering feature tree (CF-tree) to represent a cluster hierarchy.
• These structures help the clustering method achieve good speed and
scalability in large or even streaming databases, and also make it
effective for incremental and dynamic clustering of incoming objects.
35
BIRCH- clustering feature (CF)
• Consider a cluster of n d-dimensional data objects or points.
• The clustering feature (CF) of the cluster is a 3-D vector summarizing
information about clusters of objects. It is defined as
CF={n,LS,SS}
• Where LS is the linear sum of the n points (i.e, 𝑖=1
𝑛
𝑥𝑖), and SS is the
square sum of the data points (i.e. 𝑖=1
𝑛
𝑥𝑖
2
).
• A clustering feature is essentially a summary of the statistics for the
given cluster.Using a clustering feature, we can easily derive many
useful statistics of a cluster
36
BIRCH
• the cluster’s centroid, x0, radius, R, and diameter, D, are
𝑥0 =
𝑖=1
𝑛
𝑥𝑖
𝑛
=
𝐿𝑆
𝑛
𝑅 = 𝑖=1
𝑛
𝑥𝑖 − 𝑥0
2
𝑛
=
𝑛𝑆𝑆 − 2𝐿𝑆2 + 𝑛𝐿𝑆
𝑛2
,
𝐷 =
𝑖=1
𝑛
𝑗=1
𝑛
𝑥𝑖 − 𝑥𝑗
2
𝑛 𝑛 − 1
=
2𝑛𝑆𝑆 − 2𝐿𝑆2
𝑛 𝑛 − 1
37
BIRCH- Summarizing a cluster using the
clustering feature
• Summarizing a cluster using the clustering feature can avoid storing
the detailed information about individual objects or points.
• Instead, we only need a constant size of space to store the clustering
feature. This is the key to BIRCH efficiency in space.
• Moreover, clustering features are additive.
―That is, for two disjoint clusters, C1 and C2, with the clustering features
CF1={n1,LS1,SS1} and CF2={n2,LS2,SS2}, respectively, the clustering feature
for the cluster that formed by merging C1 and C2 is simply
CF1+CF2 ={n1+n2,LS1+LS2,SS1+SS2}.
38
BIRCH
• A CF-tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering.
• The nonleaf nodes store sums of the CFs of their children, and thus
summarize clustering information about their children.
• A CF-tree has two parameters: branching factor, B, and threshold, T.
―The branching factor specifies the maximum number of children per nonleaf
node.
―The threshold parameter specifies the maximum diameter of subclusters
stored at the leaf nodes of the tree.
• These two parameters implicitly control the resulting tree’s size.
39
The CF Tree Structure
40
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF11
child1
CF13
child3
CF12
child2
CF15
child5
CF111 CF112 CF116
prev next CF121 CF122
CF124
prev next
B = 7
L = 6
Root
Non-leaf node
Leaf node Leaf node
The Birch Algorithm
• Cluster Diameter
• For each point in the input
―Find closest leaf entry
―Add point to leaf entry and update CF
―If entry diameter > max_diameter, then split leaf, and possibly parents
• Algorithm is O(n)
• Concerns
―Sensitive to insertion order of data points
―Since we fix the size of leaf nodes, so clusters may not be so natural
―Clusters tend to be spherical given the radius and diameter measures
41
𝐷 =
𝑖=1
𝑛
𝑗=1
𝑛
𝑥𝑖 − 𝑥𝑗
2
𝑛 𝑛 − 1
=
2𝑛𝑆𝑆 − 2𝐿𝑆2
𝑛 𝑛 − 1
BIRCH- Effectiveness
• The time complexity of the algorithm is O(n) where n is the number of
objects to be clustered.
• Experiments have shown the linear scalability of the algorithm with
respect to the number of objects, and good quality of clustering of the
data.
• However, since each node in a CF-tree can hold only a limited number of
entries due to its size, a CF-tree node does not always correspond to what
a user may consider a natural cluster.
• Moreover, if the clusters are not spherical in shape, BIRCH does not
perform well because it uses the notion of radius or diameter to control
the boundary of a cluster.
42
CHAMELEON: Hierarchical Clustering Using Dynamic Modeling
(1999)
• CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
• Measures the similarity based on a dynamic model
― Two clusters are merged only if the interconnectivity and closeness
(proximity) between two clusters are high relative to the internal
interconnectivity of the clusters and closeness of items within the clusters
• Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into a large number of
relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine
clusters by repeatedly combining these sub-clusters
43
CHAMELEON: Hierarchical Clustering Using Dynamic Modeling
(1999)
• Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph,
where each vertex of the graph represents a data object, and there exists an edge
between two vertices (objects) if one object is among the k-most similar objects to the
other.
• The edges are weighted to reflect the similarity between objects.
• Chameleon uses a graph partitioning algorithm to partition the k-nearest-neighbor
graph into a large number of relatively small subclusters such that it minimizes the
edge cut.
• That is, a cluster C is partitioned into subclusters Ci and Cj so as to minimize the weight
of the edges that would be cut should C be bisected into Ci and Cj .
• It assesses the absolute interconnectivity between clusters Ci and Cj .
44
CHAMELEON
• Chameleon then uses an agglomerative hierarchical clustering algorithm that
iteratively merges subclusters based on their similarity.
• To determine the pairs of most similar subclusters, it takes into account both
the interconnectivity and the closeness of the clusters.
• Specifically, Chameleon determines the similarity between each pair of clusters
Ci and Cj according to their relative interconnectivity, RI(Ci ,Cj), and their
relative closeness,RC(Ci ,Cj).
45
CHAMELEON
• The relative interconnectivity, RI(Ci ,Cj), between two clusters, Ci and Cj , is
defined as the absolute interconnectivity between Ci and Cj , normalized with
respect to the internal interconnectivity of the two clusters, Ci and Cj . That is,
• where EC{Ci ,Cj} is the edge cut as previously defined for a cluster containing both Ci
and Cj .
• Similarly, ECCi (or ECCj ) is the minimum sum of the cut edges that partition Ci (or
Cj) into two roughly equal parts.
46
CHAMELEON
• The relative closeness, RC (Ci ,Cj), between a pair of clusters, Ci and Cj
, is the absolute closeness between Ci and Cj , normalized with
respect to the internal closeness of the two clusters, Ci and Cj . It is
defined as
• where SEC{Ci ,Cj} is the average weight of the edges that connect
vertices in Ci to vertices in Cj , and SECCi (or SECCj ) is the average
weight of the edges that belong to the mincut bisector of cluster Ci
(or Cj ).
47
CHAMELEON
• Chameleon has been shown to have greater power at discovering
arbitrarily shaped clusters of high quality than several well-known
algorithms such as BIRCH and density based DBSCAN.
• However, the processing cost for high-dimensional data may require
O(n2)time for n objects in the worst case.
48
Overall Framework of CHAMELEON
49
Construct (K-NN)
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
K-NN Graph
P and q are connected if
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Relative closeness:
closeness of c1 and c2 over
internal closeness
CHAMELEON (Clustering Complex Objects)
50
51
Density-Based Methods
51
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-connected
points
• Major features:
―Discover clusters of arbitrary shape
―Handle noise
―One scan
―Need density parameters as termination condition
• Several interesting studies:
―DBSCAN: Ester, et al. (KDD’96)
―OPTICS: Ankerst, et al (SIGMOD’99).
―DENCLUE: Hinneburg & D. Keim (KDD’98)
―CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
52
Density-Based Clustering: Basic Concepts
• Two parameters:
―ϵ-neighbourhood: Maximum radius of the neighbourhood
―MinPts: Minimum number of points in an ϵ -neighbourhood of that point
• NEps(p): {q belongs to D | dist(p,q) ≤ Eps} where q is a core point and p is an object
point
• Directly density-reachable: A point p is directly density-reachable from a point q
w.r.t. ϵ-neighbourhood, MinPts if
―p belongs to NEps(q)
―core point condition:
|NEps (q)| ≥ MinPts
53
MinPts = 5
Eps = 1 cm
p
q
Density-Reachable and Density-Connected
• Density-reachable:
―A point p is density-reachable from a
point q w.r.t. ϵ-neighbourhood, MinPts if
there is a chain of points p1, …, pn, p1 = q,
pn = p such that pi+1 is directly density-
reachable from pi
• Density-connected
―A point p is density-connected to a point
q w.r.t. ϵ-neighbourhood, MinPts if there
is a point o such that both, p and q are
density-reachable from o w.r.t. ϵ-
neighbourhood and MinPts
54
p
q
p1
p q
o
DBSCAN: Density-Based Spatial Clustering of Applications with
Noise
• Relies on a density-based notion of cluster: A cluster is defined as a maximal set
of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with noise
55
Core
Border
Outlier
Eps = 1cm
MinPts = 5
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. ϵ-neighbourhood and MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p and DBSCAN visits
the next point of the database
• Continue the process until all of the points have been processed
56
57
DBSCAN: Sensitive to Parameters
58
Disadvantages of DBSCAN
• responsibility of
―selecting parameter values that will lead to the discovery of acceptable
clusters
―real-world,high-dimensional data sets often have very skewed distributions
such that their intrinsic clustering structure may not be well characterized by
a single set of global density parameters.
59
OPTICS: A Cluster-Ordering Method (1999)
• OPTICS: Ordering Points To Identify the Clustering Structure
―Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
―Produces a special order of the database wrt its density-based clustering
structure
―This is a linear list of all objects under analysis and represents the density-
based clustering structure of the data.
―Objects in a denser cluster are listed closer to each other in the cluster
ordering.
―This cluster-ordering contains info equiv to the density-based clusterings
corresponding to a broad range of parameter settings
―Good for both automatic and interactive cluster analysis, including finding
intrinsic clustering structure.
60
OPTICS: Some Extension from DBSCAN
• This ordering is equivalent to density-based clustering obtained from
a wide range of parameter settings.
• OPTICS does not require the user to provide a specific density
threshold.
• The cluster ordering can be used to extract basic clustering
information (e.g., cluster centers, or arbitrary-shaped clusters), derive
the intrinsic clustering structure, as well as provide a visualization of
the clustering.
61
62
63
Density-Based Clustering: OPTICS & Its Applications
64
DENCLUE: Clustering Based on Density
Distribution Functions
• DENCLUE (DENsity-based CLUstEring) is a clustering method based on
a set of density distribution functions.
• In DBSCAN and OPTICS, density is calculated by counting the number
of objects in a neighborhood defined by a radius parameter,ϵ. Such
density estimates can be highly sensitive to the radius value used.
• To overcome this problem, kernel density estimation can be used,
which is a nonparametric density estimation approach from statistics.
65
66
Denclue: Technical Essence
• Uses grid cells but only keeps information about grid cells that do actually contain data points and
manages these cells in a tree-based access structure
• Influence function: describes the impact of a data point within its neighborhood
• Overall density of the data space can be calculated as the sum of the influence function of all data
points
• Clusters can be determined mathematically by identifying density attractors
• Density attractors are local maximal of the overall density function
• Center defined clusters: assign to each density attractor the points density attracted to it
• Arbitrary shaped cluster: merge density attractors that are connected through paths of high
density (> threshold)
67
68
Grid-Based Methods
68
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
―STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz
(1997)
―WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
• A multi-resolution clustering approach using wavelet method
―CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
69
STING: A Statistical Information Grid Approach
• Wang, Yang and Muntz (VLDB’97)
• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of resolution
70
The STING Clustering Method
• Each cell at a high level is partitioned into a number of smaller cells in the next
lower level
• Statistical info of each cell is calculated and stored beforehand and is used to
answer queries
• Parameters of higher level cells can be easily calculated from parameters of lower
level cell
―count, mean, s, min, max
―type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small number of cells
• For each cell in the current level compute the confidence interval
71
STING Algorithm and Its Analysis
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the next lower level
• Repeat this process until the bottom layer is reached
• Advantages:
―Query-independent, easy to parallelize, incremental update
―O(K), where K is the number of grid cells at the lowest level
• Disadvantages:
―All the cluster boundaries are either horizontal or vertical, and no diagonal
boundary is detected
72
73
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space that allow better clustering
than original space
• CLIQUE can be considered as both density-based and grid-based
―It partitions each dimension into the same number of equal length interval
―It partitions an m-dimensional data space into non-overlapping rectangular units
―A unit is dense if the fraction of total data points contained in the unit exceeds the input
model parameter
―A cluster is a maximal set of connected dense units within a subspace
74
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie inside each cell of the
partition.
• Identify the candidate search space
―Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
―Determine dense units in all subspaces of interests
―Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
―Determine maximal regions that cover a cluster of connected dense units for each
cluster
―Determination of minimal cover for each cluster
• CLIQUE performs clustering in two steps.
• In the first step, CLIQUE partitions the d-dimensional data space into
nonoverlapping rectangular units, identifying the dense units among
these.
• CLIQUE finds dense cells in all of the subspaces.
• To do so, CLIQUE partitions every dimension into intervals, and
identifies intervals containing at least l points, where l is the density
threshold.
75
• In the second step, CLIQUE uses the dense cells in each subspace to
assemble clusters, which can be of arbitrary shape.
• The idea is to apply the Minimum Description Length (MDL) to use
the maximal regions to cover connected dense cells,
• where a maximal region is a hyperrectangle where every cell falling
into this region is dense, and the region cannot be extended further
in any dimension in the subspace
76
77
Salary
(10,000)
20 30 40 50 60
age
5
4
3
1
2
6
7
0
20 30 40 50 60
age
5
4
3
1
2
6
7
0
Vacation
(week)
age
Vacation
30 50
 = 3
78
Strength and Weakness of CLIQUE
• Strength
―automatically finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
―insensitive to the order of records in input and does not presume some
canonical data distribution
―scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weakness
―The accuracy of the clustering result may be degraded at the expense of
simplicity of the method
79
Evaluation of Clustering
79
Assessing Clustering Tendency
• Assess if non-random structure exists in the data by measuring the probability that the data is
generated by a uniform data distribution
• Test spatial randomness by statistic test: Hopkins Static
―Given a dataset D regarded as a sample of a random variable o, determine how far away o is
from being uniformly distributed in the data space
―Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D:
xi = min{dist (pi, v)} where v in D
―Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}:
yi = min{dist (qi, v)} where v in D and v ≠ qi
―Calculate the Hopkins Statistic:
―If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D
is highly skewed, H is close to 0
80
Determine the Number of Clusters (I)
• Empirical method
―# of clusters ≈
𝑛
2
for a dataset of n
points with each cluster having 2𝑛
points.
• Elbow method
―Use the turning point in the curve of
sum of within cluster variance w.r.t the
# of clusters
81
Determine the Number of Clusters(II)
• Cross validation method
―Divide a given data set into m parts
―Use m – 1 parts to obtain a clustering model
―Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid, and use the sum of squared
distance between all points in the test set and the closest centroids to measure how well
the model fits the test set
―For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and
find # of clusters that fits the data the best
82
Determine the Number of Clusters(III)
• The average silhouette approach determines how well each
object lies within its cluster. A high average sil
• Average silhouette method computes the average silhouette
of observations for different values of k. The optimal number
of clusters k is the one that maximize the average silhouette
over a range of possible values for k (Kaufman and
Rousseeuw [1990]).houette width indicates a good
clustering.
• The algorithm is similar to the elbow method and can be
computed as follow:
― Compute clustering algorithm (e.g., k-means clustering) for different
values of k. For instance, by varying k from 1 to 10 clusters
― For each k, calculate the average silhouette of observations (avg.sil)
― Plot the curve of avg.sil according to the number of clusters k.
― The location of the maximum is considered as the appropriate number
of clusters.
83
Measuring Clustering Quality
• Two methods: extrinsic vs. intrinsic
• Extrinsic: supervised, i.e., the ground truth is available
―Compare a clustering against the ground truth using certain clustering quality
measure
―Ex. BCubed precision and recall metrics
• Intrinsic: unsupervised, i.e., the ground truth is unavailable
―Evaluate the goodness of a clustering by considering how well the clusters are
separated, and how compact the clusters are
―Ex. Silhouette coefficient
84
85
Measuring Clustering Quality: Extrinsic Methods
• Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.
• Q is good if it satisfies the following 4 essential criteria
―Cluster homogeneity: the purer, the better
―Cluster completeness: should assign objects belong to the same category in
the ground truth to the same cluster
―Rag bag: putting a heterogeneous object into a pure cluster should be
penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other”
category)
―Small cluster preservation: splitting a small category into pieces is more
harmful than splitting a large category into pieces
86

More Related Content

What's hot

2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagationKrish_ver2
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesKush Kulshrestha
 
Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...bihira aggrey
 
Intro to Machine Learning & AI
Intro to Machine Learning & AIIntro to Machine Learning & AI
Intro to Machine Learning & AIMostafa Elsheikh
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning BasicsSuresh Arora
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning TutorialAmr Rashed
 
Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat omarodibat
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Simplilearn
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Grigory Sapunov
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt Poojamanic
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 
Concept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithmConcept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithmswapnac12
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMPuneet Kulyana
 
Machine learning
Machine learningMachine learning
Machine learningeonx_32
 

What's hot (20)

2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning Techniques
 
Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...Classification by back propagation, multi layered feed forward neural network...
Classification by back propagation, multi layered feed forward neural network...
 
Intro to Machine Learning & AI
Intro to Machine Learning & AIIntro to Machine Learning & AI
Intro to Machine Learning & AI
 
Machine learning
Machine learning Machine learning
Machine learning
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 
Concept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithmConcept learning and candidate elimination algorithm
Concept learning and candidate elimination algorithm
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHM
 
Machine learning
Machine learningMachine learning
Machine learning
 
Decision tree learning
Decision tree learningDecision tree learning
Decision tree learning
 

Similar to UNIT_V_Cluster Analysis.pptx

Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptxRvishnupriya2
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 

Similar to UNIT_V_Cluster Analysis.pptx (20)

Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 
Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptChapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
CLUSTERING
CLUSTERINGCLUSTERING
CLUSTERING
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
ClusetrigBasic.ppt
ClusetrigBasic.pptClusetrigBasic.ppt
ClusetrigBasic.ppt
 
Dataa miining
Dataa miiningDataa miining
Dataa miining
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 

Recently uploaded

Digital C-Type Printing: Revolutionizing The Future Of Photographic Prints
Digital C-Type Printing: Revolutionizing The Future Of Photographic PrintsDigital C-Type Printing: Revolutionizing The Future Of Photographic Prints
Digital C-Type Printing: Revolutionizing The Future Of Photographic PrintsMatte Image
 
一比一原版美国西雅图大学毕业证(Seattle毕业证书)毕业证成绩单留信认证
一比一原版美国西雅图大学毕业证(Seattle毕业证书)毕业证成绩单留信认证一比一原版美国西雅图大学毕业证(Seattle毕业证书)毕业证成绩单留信认证
一比一原版美国西雅图大学毕业证(Seattle毕业证书)毕业证成绩单留信认证khuurq8kz
 
SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)
SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)
SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)River / Thao Phan
 
Call Girls In Dilshad Garden | Contact Me ☎ +91-9953040155
Call Girls In Dilshad Garden | Contact Me ☎ +91-9953040155Call Girls In Dilshad Garden | Contact Me ☎ +91-9953040155
Call Girls In Dilshad Garden | Contact Me ☎ +91-9953040155SaketCallGirlsCallUs
 
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...Nitya salvi
 
SB_ Dragons Riders of Berk_ Rough_ RiverPhan (2024)
SB_ Dragons Riders of Berk_ Rough_ RiverPhan (2024)SB_ Dragons Riders of Berk_ Rough_ RiverPhan (2024)
SB_ Dragons Riders of Berk_ Rough_ RiverPhan (2024)River / Thao Phan
 
Call Girl In Chandigarh ☎ 08868886958✅ Just Genuine Call Call Girls Chandigar...
Call Girl In Chandigarh ☎ 08868886958✅ Just Genuine Call Call Girls Chandigar...Call Girl In Chandigarh ☎ 08868886958✅ Just Genuine Call Call Girls Chandigar...
Call Girl In Chandigarh ☎ 08868886958✅ Just Genuine Call Call Girls Chandigar...Sheetaleventcompany
 
Completed Event Presentation for Huma 1305
Completed Event Presentation for Huma 1305Completed Event Presentation for Huma 1305
Completed Event Presentation for Huma 1305jazlynjacobs51
 
Orai call girls 📞 8617370543At Low Cost Cash Payment Booking
Orai call girls 📞 8617370543At Low Cost Cash Payment BookingOrai call girls 📞 8617370543At Low Cost Cash Payment Booking
Orai call girls 📞 8617370543At Low Cost Cash Payment BookingNitya salvi
 
Just Call Vip call girls Farrukhabad Escorts ☎️8617370543 Two shot with one g...
Just Call Vip call girls Farrukhabad Escorts ☎️8617370543 Two shot with one g...Just Call Vip call girls Farrukhabad Escorts ☎️8617370543 Two shot with one g...
Just Call Vip call girls Farrukhabad Escorts ☎️8617370543 Two shot with one g...Nitya salvi
 
Call Girls In Chattarpur | Contact Me ☎ +91-9953040155
Call Girls In Chattarpur | Contact Me ☎ +91-9953040155Call Girls In Chattarpur | Contact Me ☎ +91-9953040155
Call Girls In Chattarpur | Contact Me ☎ +91-9953040155SaketCallGirlsCallUs
 
WhatsApp Chat: 📞 8617370543 Call Girls In Siddharth Nagar At Low Cost Cash Pa...
WhatsApp Chat: 📞 8617370543 Call Girls In Siddharth Nagar At Low Cost Cash Pa...WhatsApp Chat: 📞 8617370543 Call Girls In Siddharth Nagar At Low Cost Cash Pa...
WhatsApp Chat: 📞 8617370543 Call Girls In Siddharth Nagar At Low Cost Cash Pa...Nitya salvi
 
FULL ENJOY —📞9711106444 ✦/ Vℐℙ Call Girls in Jasola Vihar, | Delhi🫶
FULL ENJOY —📞9711106444 ✦/ Vℐℙ Call Girls in Jasola Vihar, | Delhi🫶FULL ENJOY —📞9711106444 ✦/ Vℐℙ Call Girls in Jasola Vihar, | Delhi🫶
FULL ENJOY —📞9711106444 ✦/ Vℐℙ Call Girls in Jasola Vihar, | Delhi🫶delhimunirka15
 
HUMA Final Presentation About Chicano Culture
HUMA Final Presentation About Chicano CultureHUMA Final Presentation About Chicano Culture
HUMA Final Presentation About Chicano Culturekarinamercado2462
 
Sonbhadra Escorts 📞 8617370543 | Sonbhadra Call Girls
Sonbhadra  Escorts 📞 8617370543 | Sonbhadra Call GirlsSonbhadra  Escorts 📞 8617370543 | Sonbhadra Call Girls
Sonbhadra Escorts 📞 8617370543 | Sonbhadra Call GirlsNitya salvi
 
Pari Chowk Call Girls ☎️ ((#9711106444)), 💘 Full enjoy Low rate girl💘 Genuine...
Pari Chowk Call Girls ☎️ ((#9711106444)), 💘 Full enjoy Low rate girl💘 Genuine...Pari Chowk Call Girls ☎️ ((#9711106444)), 💘 Full enjoy Low rate girl💘 Genuine...
Pari Chowk Call Girls ☎️ ((#9711106444)), 💘 Full enjoy Low rate girl💘 Genuine...delhimunirka15
 
Call Girls In Dwarka Mor | Contact Me ☎ +91-9953040155
Call Girls In Dwarka Mor | Contact Me ☎ +91-9953040155Call Girls In Dwarka Mor | Contact Me ☎ +91-9953040155
Call Girls In Dwarka Mor | Contact Me ☎ +91-9953040155SaketCallGirlsCallUs
 
Storyboard short: Ferrarius Tries to Sing
Storyboard short: Ferrarius Tries to SingStoryboard short: Ferrarius Tries to Sing
Storyboard short: Ferrarius Tries to SingLyneSun
 
Turn Off The Air Con - The Singapore Punk Scene
Turn Off The Air Con - The Singapore Punk SceneTurn Off The Air Con - The Singapore Punk Scene
Turn Off The Air Con - The Singapore Punk SceneLuca Vergano
 
Azamgarh Call Girls WhatsApp Chat: 📞 8617370543 (24x7 ) Service Available Nea...
Azamgarh Call Girls WhatsApp Chat: 📞 8617370543 (24x7 ) Service Available Nea...Azamgarh Call Girls WhatsApp Chat: 📞 8617370543 (24x7 ) Service Available Nea...
Azamgarh Call Girls WhatsApp Chat: 📞 8617370543 (24x7 ) Service Available Nea...Nitya salvi
 

Recently uploaded (20)

Digital C-Type Printing: Revolutionizing The Future Of Photographic Prints
Digital C-Type Printing: Revolutionizing The Future Of Photographic PrintsDigital C-Type Printing: Revolutionizing The Future Of Photographic Prints
Digital C-Type Printing: Revolutionizing The Future Of Photographic Prints
 
一比一原版美国西雅图大学毕业证(Seattle毕业证书)毕业证成绩单留信认证
一比一原版美国西雅图大学毕业证(Seattle毕业证书)毕业证成绩单留信认证一比一原版美国西雅图大学毕业证(Seattle毕业证书)毕业证成绩单留信认证
一比一原版美国西雅图大学毕业证(Seattle毕业证书)毕业证成绩单留信认证
 
SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)
SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)
SB_ Pretzel and the puppies_ Rough_ RiverPhan (2024)
 
Call Girls In Dilshad Garden | Contact Me ☎ +91-9953040155
Call Girls In Dilshad Garden | Contact Me ☎ +91-9953040155Call Girls In Dilshad Garden | Contact Me ☎ +91-9953040155
Call Girls In Dilshad Garden | Contact Me ☎ +91-9953040155
 
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...
Call Girls Sultanpur Just Call 📞 8617370543 Top Class Call Girl Service Avail...
 
SB_ Dragons Riders of Berk_ Rough_ RiverPhan (2024)
SB_ Dragons Riders of Berk_ Rough_ RiverPhan (2024)SB_ Dragons Riders of Berk_ Rough_ RiverPhan (2024)
SB_ Dragons Riders of Berk_ Rough_ RiverPhan (2024)
 
Call Girl In Chandigarh ☎ 08868886958✅ Just Genuine Call Call Girls Chandigar...
Call Girl In Chandigarh ☎ 08868886958✅ Just Genuine Call Call Girls Chandigar...Call Girl In Chandigarh ☎ 08868886958✅ Just Genuine Call Call Girls Chandigar...
Call Girl In Chandigarh ☎ 08868886958✅ Just Genuine Call Call Girls Chandigar...
 
Completed Event Presentation for Huma 1305
Completed Event Presentation for Huma 1305Completed Event Presentation for Huma 1305
Completed Event Presentation for Huma 1305
 
Orai call girls 📞 8617370543At Low Cost Cash Payment Booking
Orai call girls 📞 8617370543At Low Cost Cash Payment BookingOrai call girls 📞 8617370543At Low Cost Cash Payment Booking
Orai call girls 📞 8617370543At Low Cost Cash Payment Booking
 
Just Call Vip call girls Farrukhabad Escorts ☎️8617370543 Two shot with one g...
Just Call Vip call girls Farrukhabad Escorts ☎️8617370543 Two shot with one g...Just Call Vip call girls Farrukhabad Escorts ☎️8617370543 Two shot with one g...
Just Call Vip call girls Farrukhabad Escorts ☎️8617370543 Two shot with one g...
 
Call Girls In Chattarpur | Contact Me ☎ +91-9953040155
Call Girls In Chattarpur | Contact Me ☎ +91-9953040155Call Girls In Chattarpur | Contact Me ☎ +91-9953040155
Call Girls In Chattarpur | Contact Me ☎ +91-9953040155
 
WhatsApp Chat: 📞 8617370543 Call Girls In Siddharth Nagar At Low Cost Cash Pa...
WhatsApp Chat: 📞 8617370543 Call Girls In Siddharth Nagar At Low Cost Cash Pa...WhatsApp Chat: 📞 8617370543 Call Girls In Siddharth Nagar At Low Cost Cash Pa...
WhatsApp Chat: 📞 8617370543 Call Girls In Siddharth Nagar At Low Cost Cash Pa...
 
FULL ENJOY —📞9711106444 ✦/ Vℐℙ Call Girls in Jasola Vihar, | Delhi🫶
FULL ENJOY —📞9711106444 ✦/ Vℐℙ Call Girls in Jasola Vihar, | Delhi🫶FULL ENJOY —📞9711106444 ✦/ Vℐℙ Call Girls in Jasola Vihar, | Delhi🫶
FULL ENJOY —📞9711106444 ✦/ Vℐℙ Call Girls in Jasola Vihar, | Delhi🫶
 
HUMA Final Presentation About Chicano Culture
HUMA Final Presentation About Chicano CultureHUMA Final Presentation About Chicano Culture
HUMA Final Presentation About Chicano Culture
 
Sonbhadra Escorts 📞 8617370543 | Sonbhadra Call Girls
Sonbhadra  Escorts 📞 8617370543 | Sonbhadra Call GirlsSonbhadra  Escorts 📞 8617370543 | Sonbhadra Call Girls
Sonbhadra Escorts 📞 8617370543 | Sonbhadra Call Girls
 
Pari Chowk Call Girls ☎️ ((#9711106444)), 💘 Full enjoy Low rate girl💘 Genuine...
Pari Chowk Call Girls ☎️ ((#9711106444)), 💘 Full enjoy Low rate girl💘 Genuine...Pari Chowk Call Girls ☎️ ((#9711106444)), 💘 Full enjoy Low rate girl💘 Genuine...
Pari Chowk Call Girls ☎️ ((#9711106444)), 💘 Full enjoy Low rate girl💘 Genuine...
 
Call Girls In Dwarka Mor | Contact Me ☎ +91-9953040155
Call Girls In Dwarka Mor | Contact Me ☎ +91-9953040155Call Girls In Dwarka Mor | Contact Me ☎ +91-9953040155
Call Girls In Dwarka Mor | Contact Me ☎ +91-9953040155
 
Storyboard short: Ferrarius Tries to Sing
Storyboard short: Ferrarius Tries to SingStoryboard short: Ferrarius Tries to Sing
Storyboard short: Ferrarius Tries to Sing
 
Turn Off The Air Con - The Singapore Punk Scene
Turn Off The Air Con - The Singapore Punk SceneTurn Off The Air Con - The Singapore Punk Scene
Turn Off The Air Con - The Singapore Punk Scene
 
Azamgarh Call Girls WhatsApp Chat: 📞 8617370543 (24x7 ) Service Available Nea...
Azamgarh Call Girls WhatsApp Chat: 📞 8617370543 (24x7 ) Service Available Nea...Azamgarh Call Girls WhatsApp Chat: 📞 8617370543 (24x7 ) Service Available Nea...
Azamgarh Call Girls WhatsApp Chat: 📞 8617370543 (24x7 ) Service Available Nea...
 

UNIT_V_Cluster Analysis.pptx

  • 1. Cluster Analysis Ref:- Data Mining Concepts and Techniques, by Jiawei Han|Micheline Kamber|Jian Pei 1
  • 2. Cluster Analysis • Cluster analysis or simply clustering is the process of partitioning a set of data objects (or observations) into subsets. • Each subset is a cluster, such that objects in a cluster are similar to one another, yet dissimilar to objects in other clusters. • Cluster analysis has been widely used in many applications such as business intelligence, image pattern recognition, Web search, biology, and security. 2
  • 3. Cluster Analysis • Because a cluster is a collection of data objects that are similar to one another within the cluster and dissimilar to objects in other clusters, a cluster of data objects can be treated as an implicit class. In this sense, clustering is sometimes called automatic classification. • Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. • Clustering can also be used for outlier detection, where outliers (values that are “far away” from any cluster) may be more interesting than common cases. • Clustering is known as unsupervised learning because the class label information is not present. For this reason, clustering is a form of learning by observation, rather than learning by examples 3
  • 4. Requirements for Cluster Analysis(I) Following are typical requirements of clustering in data mining • Scalability • Ability to deal with different types of attributes ―Numeric, binary, nominal (categorical), and ordinal data, or mixtures of these data types • Discovery of clusters with arbitrary shape ―clusters based on Euclidean or Manhattan distance measures. • Requirements for domain knowledge to determine input parameters ―such as the desired number of clusters • Ability to deal with noisy data 4
  • 5. Requirements for Cluster Analysis(II) • Incremental clustering and insensitivity to input order • Capability of clustering high-dimensionality data ―When clustering documents, for example, each keyword can be regarded as a dimension • Constraint-based clustering ―choose the locations for a given number of new automatic teller machines (ATMs) in a city. • Interpretability and usability ―interpretations and applications 5
  • 6. Quality: What Is Good Clustering? • A good clustering method will produce high quality clusters ―high intra-class similarity: cohesive within clusters ―low inter-class similarity: distinctive between clusters • The quality of a clustering method depends on ―the similarity measure used by the method ―its implementation, and ―Its ability to discover some or all of the hidden patterns 6
  • 7. Measure the Quality of Clustering • Dissimilarity/Similarity metric ― Similarity is expressed in terms of a distance function, typically metric: d(i, j) ― The definitions of distance functions are usually rather different for interval- scaled, boolean, categorical, ordinal ratio, and vector variables ― Weights should be associated with different variables based on applications and data semantics • Quality of clustering: ― There is usually a separate “quality” function that measures the “goodness” of a cluster. ― It is hard to define “similar enough” or “good enough” • The answer is typically highly subjective 7
  • 8. Considerations for Cluster Analysis • Partitioning criteria ―Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) • Separation of clusters ―Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class) • Similarity measure ―Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) • Clustering space ―Full space (often when low dimensional) vs. subspaces (often in high- dimensional clustering) 8
  • 9. Major Clustering Approaches (I) • Partitioning approach: ―Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors. ―Distance based ―May used mean or medoid (etc.) to represent cluster center. ―Effective for small to medium size data sets. ―Typical methods: k-means, k-medoids, CLARANS • Hierarchical approach: ―Create a hierarchical decomposition of the set of data (or objects) using some criterion ―Cannot correct erroneous merges or splits ―Top down or bottom up approach may be followed ―Typical methods: Diana, Agnes, BIRCH, CAMELEON 9
  • 10. Major Clustering Approaches (II) • Density-based approach: ―Based on connectivity and density functions ―Each point must have minimum number of points within its neighborhood ―Can find arbitrary shaped clusters. ―May filter out outliers ―Typical methods: DBSACN, OPTICS, DenClue • Grid-based approach: ―Based on a multiple-level granularity structure ―Fast processing ―Typical methods: STING, WaveCluster, CLIQUE 10
  • 12. Partitioning Algorithms: Basic Concept • Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) • Given k, find a partition of k clusters that optimizes the chosen partitioning criterion ―Global optimal: exhaustively enumerate all partitions ―Heuristic methods: k-means and k-medoids algorithms ―k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster ―k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 12
  • 13. The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: 1. Partition objects into k nonempty subsets 2. Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) 3. Assign each object to the cluster with the nearest seed point 4. Go back to Step 2, stop when the assignment does not change 13
  • 14. 14
  • 15. An Example of K-Means Clustering 15 K=2 Arbitrarily partition objects into k groups Update the cluster centroids Update the cluster centroids Reassign objects Loop if needed The initial data set  Partition objects into k nonempty subsets  Repeat  Compute centroid (i.e., mean point) for each partition  Assign each object to the cluster of its nearest centroid  Until no change
  • 16. • The difference between an object pϵCi and ci, the representative of the cluster, is measured by dist(p, ci), where dist(x,y) is the Euclidean distance between two points x and y. • The quality of cluster Ci can be measured by the within cluster variation, which is the sum of squared error between all objects in Ci and the centroid ci, defined as 𝐸 = 𝑖=1 𝑘 𝑝∈𝐶𝑖 𝑑𝑖𝑠𝑡 𝑝, 𝑐𝑖 2 16
  • 17. Problem For the data set 1, 2, 3, 8, 9, 10, and 25. Classify them using k-means clustering algorithm where k=2 17
  • 18. Comments on the K-Means Method • Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) • Comment: Often terminates at a local optimal. • Weakness ―Applicable only to objects in a continuous n-dimensional space • Using the k-modes method for categorical data • In comparison, k-medoids can be applied to a wide range of data ―Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) ―Sensitive to noisy data and outliers ―Not suitable to discover clusters with non-convex shapes 18
  • 19. Variations of the K-Means Method • Most of the variants of the k-means which differ in ―Selection of the initial k means ―Dissimilarity calculations ―Strategies to calculate cluster means • Handling categorical data: k-modes ―Replacing means of clusters with modes ―Using new dissimilarity measures to deal with categorical objects ―Using a frequency-based method to update modes of clusters ―A mixture of categorical and numerical data: k-prototype method 19
  • 20. What Is the Problem of the K-Means Method? • The k-means algorithm is sensitive to outliers ! ―Since an object with an extremely large value may substantially distort the distribution of the data • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 21. 21
  • 22. 22 PAM(Partitioning Around Medoids): A Typical K-Medoids Algorithm 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrary choose k object as initial medoids 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remainin g object to nearest medoids Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 26 Swapping O and Oramdom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 23. Minimizing dissimilarity • The partitioning method is then performed based on the principle of • minimizing the sum of the dissimilarities between each object p and its corresponding representative object. That is, an absolute-error criterion is used, defined as 𝐸 = 𝑖=1 𝑘 𝑝 ϵ 𝐶𝑖 𝑑𝑖𝑠𝑡 𝑝, 𝑂𝑖 where E is the sum of the absolute error for all objects p in the data set, and Oi is the representative object of Ci . This is the basis for the k- medoids method, which groups n objects into k clusters by minimizing the absolute error 23
  • 24. The K-Medoid Clustering Method • K-Medoids Clustering: Find representative objects (medoids) in clusters ―PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987) • Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering • PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity) • Efficiency improvement on PAM ―CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples ―CLARANS (Ng & Han, 1994): Randomized re-sampling 24
  • 25. CLARA-Clustering LARge Applications) • Instead of taking the whole data set into consideration, CLARA uses a random sample of the data set. • The PAM algorithm is then applied to compute the best medoids from the sample. • Ideally, the sample should closely represent the original data set. In many cases, a large sample works well if it is created so that each object has equal probability of being selected into the sample. • CLARA builds clusterings from multiple random samples and returns the best clustering as the output. • The complexity of computing the medoids on a random sample is 𝑂 𝑘𝑠2 + 𝑘 𝑛 − 𝑘 where s is the size of the sample, k is the number of clusters, and n is the total number of objects. CLARA can deal with larger data sets than PAM. 25
  • 27. Hierarchical Clustering • Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition 27 Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES-AGglomerative NESting) divisive (DIANA-DIvisive ANAlysis)
  • 28. AGNES (Agglomerative Nesting) • Introduced in Kaufmann and Rousseeuw (1990) • Use the single-link method and the dissimilarity matrix ―Each cluster is represented by all the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters. • Merge nodes that have the least dissimilarity • Go on in a non-descending fashion • Eventually all nodes belong to the same cluster 28
  • 29. DIANA (DIvisive ANAlysis) • Introduced in Kaufmann and Rousseeuw (1990) • Inverse order of AGNES • Eventually each node forms a cluster on its own • All the objects are used to form one initial cluster. • The cluster is split according to some principle such as the maximum Euclidean distance between the closest neighboring objects in the cluster. • The cluster-splitting process repeats until, eventually, each new cluster contains only a single object. 29
  • 30. Dendrogram: Shows How Clusters are Merged ―Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram ―A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster 30
  • 31. Distance between Clusters • Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) • Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) • Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) • Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) • Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) ―Medoid: a chosen, centrally located object in the cluster X X 31
  • 32. Centroid, Radius and Diameter of a Cluster (for numerical data sets) • Centroid: the “middle” of a cluster • Radius: square root of average distance from any point of the cluster to its centroid • Diameter: square root of average mean squared distance between all pairs of points in the cluster 32 N t N i ip m C ) ( 1    ) 1 ( 2 ) ( 1 1        N N iq t ip t N i N i m D N m c ip t N i m R 2 ) ( 1    
  • 33. Extensions to Hierarchical Clustering • Major weakness of agglomerative clustering methods ―Can never undo what was done previously ―Do not scale well: time complexity of at least O(n2), where n is the number of total objects • Integration of hierarchical & distance-based clustering ―BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters ―CHAMELEON (1999): hierarchical clustering using dynamic modeling 33
  • 34. BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) • Zhang, Ramakrishnan & Livny, SIGMOD’96 • Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering ―Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) ―Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree • Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans • Weakness: handles only numeric data, and sensitive to the order of the data record 34
  • 35. BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) • BIRCH uses the notions of clustering feature to summarize a cluster, and clustering feature tree (CF-tree) to represent a cluster hierarchy. • These structures help the clustering method achieve good speed and scalability in large or even streaming databases, and also make it effective for incremental and dynamic clustering of incoming objects. 35
  • 36. BIRCH- clustering feature (CF) • Consider a cluster of n d-dimensional data objects or points. • The clustering feature (CF) of the cluster is a 3-D vector summarizing information about clusters of objects. It is defined as CF={n,LS,SS} • Where LS is the linear sum of the n points (i.e, 𝑖=1 𝑛 𝑥𝑖), and SS is the square sum of the data points (i.e. 𝑖=1 𝑛 𝑥𝑖 2 ). • A clustering feature is essentially a summary of the statistics for the given cluster.Using a clustering feature, we can easily derive many useful statistics of a cluster 36
  • 37. BIRCH • the cluster’s centroid, x0, radius, R, and diameter, D, are 𝑥0 = 𝑖=1 𝑛 𝑥𝑖 𝑛 = 𝐿𝑆 𝑛 𝑅 = 𝑖=1 𝑛 𝑥𝑖 − 𝑥0 2 𝑛 = 𝑛𝑆𝑆 − 2𝐿𝑆2 + 𝑛𝐿𝑆 𝑛2 , 𝐷 = 𝑖=1 𝑛 𝑗=1 𝑛 𝑥𝑖 − 𝑥𝑗 2 𝑛 𝑛 − 1 = 2𝑛𝑆𝑆 − 2𝐿𝑆2 𝑛 𝑛 − 1 37
  • 38. BIRCH- Summarizing a cluster using the clustering feature • Summarizing a cluster using the clustering feature can avoid storing the detailed information about individual objects or points. • Instead, we only need a constant size of space to store the clustering feature. This is the key to BIRCH efficiency in space. • Moreover, clustering features are additive. ―That is, for two disjoint clusters, C1 and C2, with the clustering features CF1={n1,LS1,SS1} and CF2={n2,LS2,SS2}, respectively, the clustering feature for the cluster that formed by merging C1 and C2 is simply CF1+CF2 ={n1+n2,LS1+LS2,SS1+SS2}. 38
  • 39. BIRCH • A CF-tree is a height-balanced tree that stores the clustering features for a hierarchical clustering. • The nonleaf nodes store sums of the CFs of their children, and thus summarize clustering information about their children. • A CF-tree has two parameters: branching factor, B, and threshold, T. ―The branching factor specifies the maximum number of children per nonleaf node. ―The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes of the tree. • These two parameters implicitly control the resulting tree’s size. 39
  • 40. The CF Tree Structure 40 CF1 child1 CF3 child3 CF2 child2 CF6 child6 CF11 child1 CF13 child3 CF12 child2 CF15 child5 CF111 CF112 CF116 prev next CF121 CF122 CF124 prev next B = 7 L = 6 Root Non-leaf node Leaf node Leaf node
  • 41. The Birch Algorithm • Cluster Diameter • For each point in the input ―Find closest leaf entry ―Add point to leaf entry and update CF ―If entry diameter > max_diameter, then split leaf, and possibly parents • Algorithm is O(n) • Concerns ―Sensitive to insertion order of data points ―Since we fix the size of leaf nodes, so clusters may not be so natural ―Clusters tend to be spherical given the radius and diameter measures 41 𝐷 = 𝑖=1 𝑛 𝑗=1 𝑛 𝑥𝑖 − 𝑥𝑗 2 𝑛 𝑛 − 1 = 2𝑛𝑆𝑆 − 2𝐿𝑆2 𝑛 𝑛 − 1
  • 42. BIRCH- Effectiveness • The time complexity of the algorithm is O(n) where n is the number of objects to be clustered. • Experiments have shown the linear scalability of the algorithm with respect to the number of objects, and good quality of clustering of the data. • However, since each node in a CF-tree can hold only a limited number of entries due to its size, a CF-tree node does not always correspond to what a user may consider a natural cluster. • Moreover, if the clusters are not spherical in shape, BIRCH does not perform well because it uses the notion of radius or diameter to control the boundary of a cluster. 42
  • 43. CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999) • CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999 • Measures the similarity based on a dynamic model ― Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters • Graph-based, and a two-phase algorithm 1. Use a graph-partitioning algorithm: cluster objects into a large number of relatively small sub-clusters 2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters 43
  • 44. CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999) • Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph, where each vertex of the graph represents a data object, and there exists an edge between two vertices (objects) if one object is among the k-most similar objects to the other. • The edges are weighted to reflect the similarity between objects. • Chameleon uses a graph partitioning algorithm to partition the k-nearest-neighbor graph into a large number of relatively small subclusters such that it minimizes the edge cut. • That is, a cluster C is partitioned into subclusters Ci and Cj so as to minimize the weight of the edges that would be cut should C be bisected into Ci and Cj . • It assesses the absolute interconnectivity between clusters Ci and Cj . 44
  • 45. CHAMELEON • Chameleon then uses an agglomerative hierarchical clustering algorithm that iteratively merges subclusters based on their similarity. • To determine the pairs of most similar subclusters, it takes into account both the interconnectivity and the closeness of the clusters. • Specifically, Chameleon determines the similarity between each pair of clusters Ci and Cj according to their relative interconnectivity, RI(Ci ,Cj), and their relative closeness,RC(Ci ,Cj). 45
  • 46. CHAMELEON • The relative interconnectivity, RI(Ci ,Cj), between two clusters, Ci and Cj , is defined as the absolute interconnectivity between Ci and Cj , normalized with respect to the internal interconnectivity of the two clusters, Ci and Cj . That is, • where EC{Ci ,Cj} is the edge cut as previously defined for a cluster containing both Ci and Cj . • Similarly, ECCi (or ECCj ) is the minimum sum of the cut edges that partition Ci (or Cj) into two roughly equal parts. 46
  • 47. CHAMELEON • The relative closeness, RC (Ci ,Cj), between a pair of clusters, Ci and Cj , is the absolute closeness between Ci and Cj , normalized with respect to the internal closeness of the two clusters, Ci and Cj . It is defined as • where SEC{Ci ,Cj} is the average weight of the edges that connect vertices in Ci to vertices in Cj , and SECCi (or SECCj ) is the average weight of the edges that belong to the mincut bisector of cluster Ci (or Cj ). 47
  • 48. CHAMELEON • Chameleon has been shown to have greater power at discovering arbitrarily shaped clusters of high quality than several well-known algorithms such as BIRCH and density based DBSCAN. • However, the processing cost for high-dimensional data may require O(n2)time for n objects in the worst case. 48
  • 49. Overall Framework of CHAMELEON 49 Construct (K-NN) Sparse Graph Partition the Graph Merge Partition Final Clusters Data Set K-NN Graph P and q are connected if q is among the top k closest neighbors of p Relative interconnectivity: connectivity of c1 and c2 over internal connectivity Relative closeness: closeness of c1 and c2 over internal closeness
  • 52. Density-Based Clustering Methods • Clustering based on density (local cluster criterion), such as density-connected points • Major features: ―Discover clusters of arbitrary shape ―Handle noise ―One scan ―Need density parameters as termination condition • Several interesting studies: ―DBSCAN: Ester, et al. (KDD’96) ―OPTICS: Ankerst, et al (SIGMOD’99). ―DENCLUE: Hinneburg & D. Keim (KDD’98) ―CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) 52
  • 53. Density-Based Clustering: Basic Concepts • Two parameters: ―ϵ-neighbourhood: Maximum radius of the neighbourhood ―MinPts: Minimum number of points in an ϵ -neighbourhood of that point • NEps(p): {q belongs to D | dist(p,q) ≤ Eps} where q is a core point and p is an object point • Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. ϵ-neighbourhood, MinPts if ―p belongs to NEps(q) ―core point condition: |NEps (q)| ≥ MinPts 53 MinPts = 5 Eps = 1 cm p q
  • 54. Density-Reachable and Density-Connected • Density-reachable: ―A point p is density-reachable from a point q w.r.t. ϵ-neighbourhood, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density- reachable from pi • Density-connected ―A point p is density-connected to a point q w.r.t. ϵ-neighbourhood, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. ϵ- neighbourhood and MinPts 54 p q p1 p q o
  • 55. DBSCAN: Density-Based Spatial Clustering of Applications with Noise • Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points • Discovers clusters of arbitrary shape in spatial databases with noise 55 Core Border Outlier Eps = 1cm MinPts = 5
  • 56. DBSCAN: The Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p w.r.t. ϵ-neighbourhood and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database • Continue the process until all of the points have been processed 56
  • 57. 57
  • 58. DBSCAN: Sensitive to Parameters 58
  • 59. Disadvantages of DBSCAN • responsibility of ―selecting parameter values that will lead to the discovery of acceptable clusters ―real-world,high-dimensional data sets often have very skewed distributions such that their intrinsic clustering structure may not be well characterized by a single set of global density parameters. 59
  • 60. OPTICS: A Cluster-Ordering Method (1999) • OPTICS: Ordering Points To Identify the Clustering Structure ―Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99) ―Produces a special order of the database wrt its density-based clustering structure ―This is a linear list of all objects under analysis and represents the density- based clustering structure of the data. ―Objects in a denser cluster are listed closer to each other in the cluster ordering. ―This cluster-ordering contains info equiv to the density-based clusterings corresponding to a broad range of parameter settings ―Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure. 60
  • 61. OPTICS: Some Extension from DBSCAN • This ordering is equivalent to density-based clustering obtained from a wide range of parameter settings. • OPTICS does not require the user to provide a specific density threshold. • The cluster ordering can be used to extract basic clustering information (e.g., cluster centers, or arbitrary-shaped clusters), derive the intrinsic clustering structure, as well as provide a visualization of the clustering. 61
  • 62. 62
  • 63. 63
  • 64. Density-Based Clustering: OPTICS & Its Applications 64
  • 65. DENCLUE: Clustering Based on Density Distribution Functions • DENCLUE (DENsity-based CLUstEring) is a clustering method based on a set of density distribution functions. • In DBSCAN and OPTICS, density is calculated by counting the number of objects in a neighborhood defined by a radius parameter,ϵ. Such density estimates can be highly sensitive to the radius value used. • To overcome this problem, kernel density estimation can be used, which is a nonparametric density estimation approach from statistics. 65
  • 66. 66
  • 67. Denclue: Technical Essence • Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure • Influence function: describes the impact of a data point within its neighborhood • Overall density of the data space can be calculated as the sum of the influence function of all data points • Clusters can be determined mathematically by identifying density attractors • Density attractors are local maximal of the overall density function • Center defined clusters: assign to each density attractor the points density attracted to it • Arbitrary shaped cluster: merge density attractors that are connected through paths of high density (> threshold) 67
  • 69. Grid-Based Clustering Method • Using multi-resolution grid data structure • Several interesting methods ―STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) ―WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach using wavelet method ―CLIQUE: Agrawal, et al. (SIGMOD’98) • Both grid-based and subspace clustering 69
  • 70. STING: A Statistical Information Grid Approach • Wang, Yang and Muntz (VLDB’97) • The spatial area is divided into rectangular cells • There are several levels of cells corresponding to different levels of resolution 70
  • 71. The STING Clustering Method • Each cell at a high level is partitioned into a number of smaller cells in the next lower level • Statistical info of each cell is calculated and stored beforehand and is used to answer queries • Parameters of higher level cells can be easily calculated from parameters of lower level cell ―count, mean, s, min, max ―type of distribution—normal, uniform, etc. • Use a top-down approach to answer spatial data queries • Start from a pre-selected layer—typically with a small number of cells • For each cell in the current level compute the confidence interval 71
  • 72. STING Algorithm and Its Analysis • Remove the irrelevant cells from further consideration • When finish examining the current layer, proceed to the next lower level • Repeat this process until the bottom layer is reached • Advantages: ―Query-independent, easy to parallelize, incremental update ―O(K), where K is the number of grid cells at the lowest level • Disadvantages: ―All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected 72
  • 73. 73 CLIQUE (Clustering In QUEst) • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98) • Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space • CLIQUE can be considered as both density-based and grid-based ―It partitions each dimension into the same number of equal length interval ―It partitions an m-dimensional data space into non-overlapping rectangular units ―A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter ―A cluster is a maximal set of connected dense units within a subspace
  • 74. 74 CLIQUE: The Major Steps • Partition the data space and find the number of points that lie inside each cell of the partition. • Identify the candidate search space ―Identify the subspaces that contain clusters using the Apriori principle • Identify clusters ―Determine dense units in all subspaces of interests ―Determine connected dense units in all subspaces of interests. • Generate minimal description for the clusters ―Determine maximal regions that cover a cluster of connected dense units for each cluster ―Determination of minimal cover for each cluster
  • 75. • CLIQUE performs clustering in two steps. • In the first step, CLIQUE partitions the d-dimensional data space into nonoverlapping rectangular units, identifying the dense units among these. • CLIQUE finds dense cells in all of the subspaces. • To do so, CLIQUE partitions every dimension into intervals, and identifies intervals containing at least l points, where l is the density threshold. 75
  • 76. • In the second step, CLIQUE uses the dense cells in each subspace to assemble clusters, which can be of arbitrary shape. • The idea is to apply the Minimum Description Length (MDL) to use the maximal regions to cover connected dense cells, • where a maximal region is a hyperrectangle where every cell falling into this region is dense, and the region cannot be extended further in any dimension in the subspace 76
  • 77. 77 Salary (10,000) 20 30 40 50 60 age 5 4 3 1 2 6 7 0 20 30 40 50 60 age 5 4 3 1 2 6 7 0 Vacation (week) age Vacation 30 50  = 3
  • 78. 78 Strength and Weakness of CLIQUE • Strength ―automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces ―insensitive to the order of records in input and does not presume some canonical data distribution ―scales linearly with the size of input and has good scalability as the number of dimensions in the data increases • Weakness ―The accuracy of the clustering result may be degraded at the expense of simplicity of the method
  • 80. Assessing Clustering Tendency • Assess if non-random structure exists in the data by measuring the probability that the data is generated by a uniform data distribution • Test spatial randomness by statistic test: Hopkins Static ―Given a dataset D regarded as a sample of a random variable o, determine how far away o is from being uniformly distributed in the data space ―Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D: xi = min{dist (pi, v)} where v in D ―Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and v ≠ qi ―Calculate the Hopkins Statistic: ―If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D is highly skewed, H is close to 0 80
  • 81. Determine the Number of Clusters (I) • Empirical method ―# of clusters ≈ 𝑛 2 for a dataset of n points with each cluster having 2𝑛 points. • Elbow method ―Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters 81
  • 82. Determine the Number of Clusters(II) • Cross validation method ―Divide a given data set into m parts ―Use m – 1 parts to obtain a clustering model ―Use the remaining part to test the quality of the clustering • E.g., For each point in the test set, find the closest centroid, and use the sum of squared distance between all points in the test set and the closest centroids to measure how well the model fits the test set ―For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k’s, and find # of clusters that fits the data the best 82
  • 83. Determine the Number of Clusters(III) • The average silhouette approach determines how well each object lies within its cluster. A high average sil • Average silhouette method computes the average silhouette of observations for different values of k. The optimal number of clusters k is the one that maximize the average silhouette over a range of possible values for k (Kaufman and Rousseeuw [1990]).houette width indicates a good clustering. • The algorithm is similar to the elbow method and can be computed as follow: ― Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters ― For each k, calculate the average silhouette of observations (avg.sil) ― Plot the curve of avg.sil according to the number of clusters k. ― The location of the maximum is considered as the appropriate number of clusters. 83
  • 84. Measuring Clustering Quality • Two methods: extrinsic vs. intrinsic • Extrinsic: supervised, i.e., the ground truth is available ―Compare a clustering against the ground truth using certain clustering quality measure ―Ex. BCubed precision and recall metrics • Intrinsic: unsupervised, i.e., the ground truth is unavailable ―Evaluate the goodness of a clustering by considering how well the clusters are separated, and how compact the clusters are ―Ex. Silhouette coefficient 84
  • 85. 85
  • 86. Measuring Clustering Quality: Extrinsic Methods • Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg. • Q is good if it satisfies the following 4 essential criteria ―Cluster homogeneity: the purer, the better ―Cluster completeness: should assign objects belong to the same category in the ground truth to the same cluster ―Rag bag: putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other” category) ―Small cluster preservation: splitting a small category into pieces is more harmful than splitting a large category into pieces 86