1. 20IT501 – Data Warehousing and
Data Mining
III Year / V Semester
2. UNIT IV - CLUSTERING AND
OUTLIER ANALYSIS
Cluster Analysis - Types of Data – Categorization of
Major Clustering Methods – K-means– Partitioning
Methods – Hierarchical Methods - Density-Based
Methods – Grid Based Methods – Model-Based
Clustering Methods – Clustering High Dimensional
Data – Constraint – Based Cluster Analysis –
Outlier Analysis
3. Introduction
Clustering is the process of grouping the data into classes or
clusters, so that objects within a cluster have high similarity
in comparison to one another but are very dissimilar to
objects in other clusters.
Dissimilarities are assessed based on the attribute values
describing the objects. Often, distance measures are used.
Clustering has its roots in many areas, including data
mining, statistics, biology, and machine learning.
4. Cluster Analysis
The process of grouping a set of physical or abstract objects
into classes of similar objects is called clustering.
A cluster is a collection of data objects that are similar to
one another within the same cluster.
In business, clustering can help marketers discover distinct
groups in their customer bases and characterize customer
groups based on purchasing patterns.
5. Cluster Analysis
Clustering is also called data segmentation in some
applications because clustering partitions large data sets into
groups according to their similarity.
Clustering can also be used for outlier detection, where
outliers (values that are “far away” from any cluster)
Applications of outlier detection include the detection of
credit card fraud and the monitoring of criminal activities in
electronic commerce.
6. Cluster Analysis
In machine learning, clustering is an example of
unsupervised learning.
Unlike classification, clustering and unsupervised
learning do not rely on predefined classes and
class-labeled training examples.
For this reason, clustering is a form of learning
by observation, rather than learning by examples.
7. Cluster Analysis
Requirements of clustering:
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Ability to deal with noisy data
8. Cluster Analysis
Requirements of clustering:
Incremental clustering and insensitivity to the
order of input records
High dimensionality
Constraint-based clustering
Interpretability and usability
9. Types of Data in Cluster Analysis
Interval-Scaled Variables:
It then describes distance measures that are commonly used for
computing the dissimilarity of objects described by such
variables. These measures include the Euclidean, Manhattan, and
Minkowski distances.
Interval-scaled variables are continuous measurements of a
roughly linear scale. Examples include weight and height,
latitude and longitude coordinates (e.g., when clustering houses),
and weather temperature
10. Types of Data in Cluster Analysis
Interval-Scaled Variables:
The measurement unit used can affect the clustering
analysis.
For example, changing measurement units from meters to
inches for height, or from kilograms to pounds for weight,
may lead to a very different clustering structure.
To help avoid dependence on the choice of measurement
units, the data should be standardized.
11. Types of Data in Cluster Analysis
Interval-Scaled Variables:
To standardize measurements, one choice is to convert
the original measurements to unitless variables.
Calculate the mean absolute deviation, sf :
Calculate the standardized measurement, or z-score:
12. Types of Data in Cluster Analysis
Interval-Scaled Variables:
The most popular distance measure is Euclidean
distance, which is defined as
Another well-known metric is Manhattan (or city
block) distance, defined as
13. Types of Data in Cluster Analysis
Interval-Scaled Variables:
14. Types of Data in Cluster Analysis
Binary Variables:
A binary variable has only two states: 0 or 1, where 0
means that the variable is absent, and 1 means that it is
present.
Treating binary variables as if they are interval-scaled
can lead to misleading clustering results. Therefore,
methods specific to binary data are necessary for
computing dissimilarities.
15. Types of Data in Cluster Analysis
Binary Variables:
A binary variable is symmetric if both of its states are
equally valuable and carry the same weight; that is, there is
no preference on which outcome should be coded as 0 or 1.
One such example could be the attribute gender having the
states male and female.
Its dissimilarity (or distance) measure
16. Types of Data in Cluster Analysis
Binary Variables:
A binary variable is asymmetric if the outcomes of the
states are not equally important, such as the positive
and negative outcomes of a disease test.
The dissimilarity based on such variables is called
asymmetric binary dissimilarity, where the number of
negative matches, t, is considered unimportant and
thus is ignored in the computation
17. Types of Data in Cluster Analysis
Binary Variables:
we can measure the distance between two binary
variables based on the notion of similarity instead of
dissimilarity.
For example, the asymmetric binary similarity
between the objects i and j, or sim(i, j), can be
computed as,
18. Types of Data in Cluster Analysis
Categorical Variables
A categorical variable is a generalization of the binary
variable in that it can take on more than two states.
For example, map color is a categorical variable that may
have, say, five states: red, yellow, green, pink, and blue.
The dissimilarity between two objects i and j can be
computed based on the ratio of mismatches:
19. Types of Data in Cluster Analysis
Ordinal Variables
A discrete ordinal variable resembles a categorical
variable, except that the M states of the ordinal value
are ordered in a meaningful sequence.
For example, professional ranks are often enumerated
in a sequential order, such as assistant, associate, and
full for professors
20. Types of Data in Cluster Analysis
Ordinal Variables
A continuous ordinal variable looks like a set of
continuous data of an unknown scale; that is, the
relative ordering of the values is essential but their
actual magnitude is not.
For example, the relative ranking in a particular sport
(e.g., gold, silver, bronze)
21. Types of Data in Cluster Analysis
Ratio-Scaled Variables
A ratio-scaled variable makes a positive
measurement on a nonlinear scale, such as an
exponential scale, approximately following the
formula AeBt or Ae-Bt
22. Types of Data in Cluster Analysis
Variables of Mixed Types
One approach is to group each kind of variable together,
performing a separate cluster analysis for each variable type.
This is feasible if these analyses derive compatible results.
A more preferable approach is to process all variable types
together, performing a single cluster analysis.
One such technique combines the different variables into a single
dissimilarity matrix, bringing all of the meaningful variables
onto a common scale of the interval [0.0,1.0].
23. Types of Data in Cluster Analysis
Vector Objects
In some applications, such as information retrieval,
text document clustering, and biological taxonomy, we
need to compare and cluster complex objects (such as
documents) containing a large number of symbolic
entities (such as keywords and phrases).
One popular way is to define the similarity function as
a cosine measure as follows:
24. Partitioning Methods
Given D, a data set of n objects, and k, the
number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k
≤ n), where each partition represents a cluster.
The objects within a cluster are “similar,” whereas
the objects of different clusters are “dissimilar” in
terms of the data set attributes.
25. Partitioning Methods
Centroid-Based Technique: The k-Means Method
The k-means algorithm takes the input parameter, k,
and partitions a set of n objects into k clusters so that
the resulting intracluster similarity is high but the
intercluster similarity is low.
Cluster similarity is measured in regard to the mean
value of the objects in a cluster, which can be viewed
as the cluster’s centroid or center of gravity.
26. Partitioning Methods
Centroid-Based Technique: The k-Means Method
Algorithm: k-means. The k-means algorithm for
partitioning, where each cluster’s center is represented
by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
27. Partitioning Methods
Centroid-Based Technique: The k-Means Method
Method:
arbitrarily choose k objects from D as the initial cluster centers;
repeat
(re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
update the cluster means, i.e., calculate the mean value of the objects for
each cluster;
until no change;
28. Partitioning Methods
Representative Object-Based Technique: The
k-Medoids Method
The k-means algorithm is sensitive to outliers
because an object with an extremely large value
may substantially distort the distribution of data.
29. Partitioning Methods
PAM(Partitioning Around Medoids)
It attempts to determine k partitions for n objects.
After an initial random selection of k representative
objects, the algorithm repeatedly tries to make a better
choice of cluster representatives.
All of the possible pairs of objects are analyzed, where one
object in each pair is considered a representative object and
the other is not.
30. Partitioning Methods
PAM(Partitioning Around Medoids)
The quality of the resulting clustering is calculated for each such
combination.
An object, oj, is replaced with the object causing the greatest
reduction in error.
The set of best objects for each cluster in one iteration forms the
representative objects for the next iteration.
The final set of representative objects are the respective medoids
of the clusters.
31. Partitioning Methods
Which method is more robust—k-means or k-
medoids?
The k-medoids method is more robust than k-
means in the presence of noise and outliers.
A medoid is less influenced by outliers or other
extreme values than a mean.
32. Partitioning Methods
Partitioning Methods in Large Databases: From k-
Medoids to CLARANS
A typical k-medoids partitioning algorithm like PAM
works effectively for small data sets, but does not scale
well for large data sets.
To deal with larger data sets, a sampling-based method,
called CLARA (Clustering LARge Applications), can be
used
33. Partitioning Methods
Partitioning Methods in Large Databases: From k-
Medoids to CLARANS
Instead of taking the whole set of data into consideration, a
small portion of the actual data is chosen as a representative
of the data.
Medoids are then chosen from this sample using PAM.
If the sample is selected in a fairly random manner, it
should closely represent the original data set.
34. Partitioning Methods
Partitioning Methods in Large Databases: From k-Medoids to
CLARANS
The representative objects (medoids) chosen will likely be similar to
those that would have been chosen from the whole data set.
CLARA draws multiple samples of the data set, applies PAM on each
sample, and returns its best clustering as the output.
The complexity of each iteration now becomes O(ks2+k(n-k)), where s
is the size of the sample, k is the number of clusters, and n is the total
number of objects.
35. Partitioning Methods
Partitioning Methods in Large Databases: From
k-Medoids to CLARANS
The effectiveness of CLARA depends on the sample
size.
CLARA searches for the best k medoids among the
selected sample of the data set.
CLARA will never find the best clustering.
36. Hierarchical Methods
A hierarchical clustering method works by grouping data objects
into a tree of clusters.
Hierarchical clustering methods can be further classified as either
agglomerative or divisive, depending on whether the hierarchical
decomposition is formed in a bottom-up (merging) or top-down
(splitting) fashion.
The quality of a pure hierarchical clustering method suffers from its
inability to perform adjustment once a merge or split decision has
been executed.
37. Hierarchical Methods
Agglomerative and Divisive Hierarchical
Clustering:
Agglomerative hierarchical clustering: This bottom-up
strategy starts by placing each object in its own cluster
and then merges these atomic clusters into larger and
larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are
satisfied.
38. Hierarchical Methods
Agglomerative and Divisive Hierarchical Clustering:
Divisive hierarchical clustering: This top-down strategy does
the reverse of agglomerative hierarchical clustering by starting
with all objects in one cluster.
It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies certain
termination conditions, such as a desired number of clusters is
obtained or the diameter of each cluster is within a certain
threshold.
40. Hierarchical Methods
Agglomerative and Divisive Hierarchical
Clustering:
A tree structure called a dendrogram is commonly
used to represent the process of hierarchical
clustering. It shows how objects are grouped
together step by step.
42. Hierarchical Methods
Agglomerative and Divisive Hierarchical
Clustering:
Four widely used measures for distance between
clusters are as follows, where |p-p’| is the distance
between two objects or points, p and p’; mi is the mean
for cluster, Ci; and ni is the number of objects in Ci.
44. Hierarchical Methods
Agglomerative and Divisive Hierarchical Clustering:
When an algorithm uses the minimum distance, dmin(Ci,
Cj), to measure the distance between clusters, it is
sometimes called a nearest-neighbor clustering algorithm.
When an algorithm uses the maximum distance, dmax(Ci,
Cj), to measure the distance between clusters, it is
sometimes called a farthest-neighbor clustering algorithm.
45. Hierarchical Methods
BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
BIRCH is designed for clustering a large amount of numerical
data by integration of hierarchical clustering (at the initial
microclustering stage) and other clustering methods such as
iterative partitioning (at the later macroclustering stage).
It overcomes the two difficulties of agglomerative clustering
methods: (1) scalability and (2) the inability to undo what was
done in the previous step.
46. Hierarchical Methods
BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
BIRCH introduces two concepts, clustering feature and
clustering feature tree (CF tree), which are used to summarize
cluster representations.
These structures help the clustering method achieve good speed
and scalability in large databases and also make it effective for
incremental and dynamic clustering of incoming objects.
47. Hierarchical Methods
BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
A clustering feature (CF) is a three-dimensional vector
summarizing information about clusters of objects.
Given n d-dimensional objects or points in a cluster, {xi}, then
the CF of the cluster is defined as
CF=<n, LS, SS>
where n is the number of points in the cluster, LS is the linear sum
of the n points, and SS is the square sum of the data points
48. Hierarchical Methods
BIRCH: Balanced Iterative Reducing and
Clustering Using Hierarchies:
Clustering feature. Suppose that there are three points,
(2, 5), (3, 2), and (4, 3), in a cluster, C1. The clustering
feature of C1 is
CF1 = <3, (2+3+4,5+2+3); (22+32+42;52+22+32)> =
<3, (9,10), (29,38)>:
49. Hierarchical Methods
BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
Clustering features are sufficient for calculating all of the
measurements that are needed for making clustering decisions in
BIRCH.
BIRCH thus utilizes storage efficiently by employing the
clustering features to summarize information about the clusters
of objects, thereby bypassing the need to store all objects.
50. Hierarchical Methods
BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering.
A CF tree has two parameters: branching factor, B, and threshold, T.
The branching factor specifies the maximum number of children per
nonleaf node.
The threshold parameter specifies the maximum diameter of
subclusters stored at the leaf nodes of the tree.
51. Hierarchical Methods
BIRCH: Balanced Iterative Reducing and
Clustering Using Hierarchies:
The computation complexity of the algorithm is O(n),
where n is the number of objects to be clustered.
Each node in a CF tree can hold only a limited
number of entries due to its size,
52. Hierarchical Methods
ROCK: A Hierarchical Clustering Algorithm for
Categorical Attributes
ROCK (RObust Clustering using linKs)
It explores the concept of links (the number of common
neighbors between two objects) for data with categorical
attributes.
Distance measures cannot lead to high-quality clusters
when clustering categorical data.
53. Hierarchical Methods
ROCK: A Hierarchical Clustering Algorithm for
Categorical Attributes
ROCK takes a more global approach to clustering by
considering the neighborhoods of individual pairs of points.
If two similar points also have similar neighborhoods, then
the two points likely belong to the same cluster and so can
be merged.
54. Hierarchical Methods
ROCK: A Hierarchical Clustering Algorithm for
Categorical Attributes
Based on these ideas, ROCK first constructs a sparse graph
from a given data similarity matrix using a similarity
threshold and the concept of shared neighbors.
It then performs agglomerative hierarchical clustering on
the sparse graph.
Random sampling is used for scaling up to large data sets.
55. Hierarchical Methods
Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
Chameleon is a hierarchical clustering algorithm that uses
dynamic modeling to determine the similarity between
pairs of clusters.
In Chameleon, cluster similarity is assessed based on how
well-connected objects are within a cluster and on the
proximity of clusters.
56. Hierarchical Methods
Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
That is, two clusters are merged if their
interconnectivity is high and they are close together.
The merge process facilitates the discovery of natural
and homogeneous clusters and applies to all types of
data as long as a similarity function can be specified.
58. Hierarchical Methods
Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
Chameleon uses a k-nearest-neighbor graph approach
to construct a sparse graph, where each vertex of the
graph represents a data object, and there exists an edge
between two vertices (objects) if one object is among
the k-most-similar objects of the other.
59. Hierarchical Methods
Chameleon: A Hierarchical Clustering Algorithm Using
Dynamic Modeling
The edges are weighted to reflect the similarity between objects.
It then uses an agglomerative hierarchical clustering algorithm
that repeatedly merges subclusters based on their similarity
To determine the pairs of most similar subclusters, it takes into
account both the interconnectivity as well as the closeness of the
clusters.
60. Hierarchical Methods
Chameleon: A Hierarchical Clustering
Algorithm Using Dynamic Modeling
Chameleon determines the similarity between
each pair of clusters Ci and Cj according to their
relative interconnectivity, RI(Ci, Cj), and their
relative closeness, RC(Ci, Cj)
61. Hierarchical Methods
Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
The relative interconnectivity, RI(Ci, Cj), between
two clusters, Ci and Cj, is defined as the absolute
interconnectivity between Ci and Cj, normalized with
respect to the internal interconnectivity of the two
clusters, Ci and Cj. That is,
62. Hierarchical Methods
Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
The relative closeness, RC(Ci, Cj), between a pair of
clusters,Ci and Cj, is the absolute closeness between
Ci and Cj, normalized with respect to the internal
closeness of the two clusters, Ci and Cj. It is defined as
63. Density-Based Methods
To discover clusters with arbitrary shape, density-
based clustering methods have been developed.
These typically regard clusters as dense regions
of objects in the data space that are separated by
regions of low density (representing noise).
64. Density-Based Methods
DBSCAN: A Density-Based Clustering Method Based on
Connected Regions with Sufficiently High Density
DBSCAN (Density-Based Spatial Clustering of Applications
with Noise)
The algorithm grows regions with sufficiently high density into
clusters and discovers clusters of arbitrary shape in spatial
databases with noise.
It defines a cluster as a maximal set of density-connected points.
65. Density-Based Methods
DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
The neighborhood within a radius of a given object is
called the -neighborhood of the object.
If the -neighborhood of an object contains at least a
minimum number, MinPts, of objects, then the object is
called a core object.
66. Density-Based Methods
DBSCAN: A Density-Based Clustering Method
Based on Connected Regions with Sufficiently
High Density
Given a set of objects, D, we say that an object p is
directly density-reachable from object q if p is within
the e-neighborhood of q, and q is a core object.
67. Density-Based Methods
DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
Direct density reachable: A point is called direct density
reachable if it has a core point in its neighbourhood.
Consider the point (1, 2), it has a core point (1.2, 2.5) in its
neighbourhood, hence, it will be a direct density reachable
point.
68. Density-Based Methods
DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
Density Reachable: A point is called density reachable
from another point if they are connected through a series of
core points. For example, consider the points (1, 3) and
(1.5, 2.5), since they are connected through a core point
(1.2, 2.5), they are called density reachable from each
other.
69. Density-Based Methods
DBSCAN: A Density-Based Clustering Method Based on
Connected Regions with Sufficiently High Density
DBSCAN searches for clusters by checking the -neighborhood of
each point in the database.
If the -neighborhood of a point p contains more than MinPts, a new
cluster with p as a core object is created.
DBSCAN then iteratively collects directly density-reachable objects
from these core objects, which may involve the merge of a few density-
reachable clusters.
The process terminates when no new point can be added to any cluster.
71. Density-Based Methods
DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
DBSCAN searches for clusters by checking the -
neighborhood of each point in the database.
If the -neighborhood of a point p contains more than
MinPts, a new cluster with p as a core object is created.
72. Density-Based Methods
DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
DBSCAN then iteratively collects directly density-
reachable objects from these core objects, which may
involve the merge of a few density-reachable clusters.
The process terminates when no new point can be added to
any cluster.
73. Density-Based Methods
DBSCAN: A Density-Based Clustering Method
Based on Connected Regions with Sufficiently
High Density
If a spatial index is used, the computational
complexity of DBSCAN is O(nlogn), where n is the
number of database objects.
Otherwise, it is O(n2).
74. Density-Based Methods
OPTICS: Ordering Points to Identify the Clustering Structure
OPTICS computes an augmented cluster ordering for automatic and
interactive cluster analysis.
This ordering represents the density-based clustering structure of the
data.
It contains information that is equivalent to density-based clustering
obtained from a wide range of parameter settings.
The cluster ordering can be used to extract basic clustering information
(such as cluster centers or arbitrary-shaped clusters) as well as provide
the intrinsic clustering structure.
75. Density-Based Methods
OPTICS: Ordering Points to Identify the
Clustering Structure
The core-distance of an object p is the smallest ’
value that makes {p} a core object. If p is not a
core object, the core-distance of p is undefined.
77. Density-Based Methods
OPTICS: Ordering Points to Identify the Clustering
Structure
The reachability-distance of an object q with respect to
another object p is the greater value of the core-distance of
p and the Euclidean distance between p and q.
If p is not a core object, the reachability-distance between p
and q is undefined.
78. Density-Based Methods
OPTICS: Ordering Points to Identify the Clustering
Structure
An algorithm was proposed to extract clusters based on the
ordering information produced by OPTICS.
Such information is sufficient for the extraction of all
density-based clusterings with respect to any distance ’
that is smaller than the distance e used in generating the
order.
79. Density-Based Methods
OPTICS: Ordering Points to Identify the
Clustering Structure
Because of the structural equivalence of the OPTICS
algorithm to DBSCAN, the OPTICS algorithm has the
same runtime complexity as that of DBSCAN, that is,
O(nlogn) if a spatial index is used, where n is the
number of objects.
80. Density-Based Methods
OPTICS: Ordering Points to Identify the
Clustering Structure
Because of the structural equivalence of the OPTICS
algorithm to DBSCAN, the OPTICS algorithm has the
same runtime complexity as that of DBSCAN, that is,
O(nlogn) if a spatial index is used, where n is the
number of objects.
81. Density-Based Methods
DENCLUE: Clustering Based on Density Distribution Functions
DENCLUE (DENsity-based CLUstEring)
The method is built on the following ideas: the influence of each data point can
be formally modeled using a mathematical function, called an influence
function, which describes the impact of a data point within its neighborhood;
The overall density of the data space can be modeled analytically as the sum of
the influence function applied to all data points; and
Clusters can then be determined mathematically by identifying density
attractors, where density attractors are local maxima of the overall density
function.
82. Density-Based Methods
DENCLUE: Clustering Based on Density Distribution
Functions
The influence function of data object y on x is a function,
the influence function can be an arbitrary function that can be
determined by the distance between two objects in a
neighborhood.
The distance function, d(x, y), should be reflexive and
symmetric, such as the Euclidean distance function
83. Density-Based Methods
DENCLUE: Clustering Based on Density
Distribution Functions
The density function at an object or point x € Fd is
defined as the sum of influence functions of all
data points.
84. Density-Based Methods
DENCLUE: Clustering Based on Density Distribution
Functions
From the density function, we can define the gradient of
the function and the density attractor, the local maxima of
the overall density function.
An arbitrary-shape cluster for a set of density attractors is a
set of Cs, each being density-attracted to its respective
density attractor
85. Grid-Based Methods
The grid-based clustering approach uses a
multiresolution grid data structure.
It quantizes the object space into a finite number
of cells that form a grid structure on which all of
the operations for clustering are performed.
The main advantage of the approach is its fast
processing time
86. Grid-Based Methods
STING: STatistical INformation Grid
STING is a grid-based multiresolution clustering technique
in which the spatial area is divided into rectangular cells.
There are usually several levels of such rectangular cells
corresponding to different levels of resolution, and these
cells form a hierarchical structure: each cell at a high level
is partitioned to form a number of cells at the next lower
level.
87. Grid-Based Methods
STING: STatistical INformation Grid
Statistical information regarding the attributes in
each grid cell (such as the mean, maximum, and
minimum values) is precomputed and stored.
These statistical parameters are useful for query
processing
89. Grid-Based Methods
STING: STatistical INformation Grid
Statistical parameters of higher-level cells can easily be
computed from the parameters of the lower-level cells.
These parameters include the following: the attribute-
independent parameter, count; the attribute-dependent
parameters, mean, stdev (standard deviation), min (minimum),
max (maximum); and the type of distribution that the attribute
value in the cell follows, such as normal, uniform, exponential,
or none.
90. Grid-Based Methods
STING: STatistical INformation Grid
When the data are loaded into the database, the
parameters count, mean, stdev, min, and max of the
bottom-level cells are calculated directly from the data.
The value of distribution may either be assigned by the
user if the distribution type is known beforehand or
obtained by hypothesis tests such as the 2 test.
91. Grid-Based Methods
STING: STatistical INformation Grid
The type of distribution of a higher-level cell can be
computed based on the majority of distribution types
of its corresponding lower-level cells in conjunction
with a threshold filtering process.
If the distributions of the lower level cells disagree
with each other and fail the threshold test, the
distribution type of the high-level cell is set to none.
92. Grid-Based Methods
STING: STatistical INformation Grid
First, a layer within the hierarchical structure is
determined from which the query-answering process is
to start.
This layer typically contains a small number of cells.
For each cell in the current layer, we compute the
confidence interval (or estimated range of probability)
reflecting the cell’s relevancy to the given query.
93. Grid-Based Methods
STING: STatistical INformation Grid
The irrelevant cells are removed from further consideration.
Processing of the next lower level examines only the remaining
relevant cells. This process is repeated until the bottom layer is reached.
At this time, if the query specification is met, the regions of relevant
cells that satisfy the query are returned.
Otherwise, the data that fall into the relevant cells are retrieved and
further processed until they meet the requirements of the query.
94. Grid-Based Methods
STING: STatistical INformation Grid
The grid-based computation is query-independent,
because the statistical information stored in each cell
represents the summary information of the data in the
grid cell, independent of the query;
The grid structure facilitates parallel processing and
incremental updating
95. Grid-Based Methods
WaveCluster: Clustering Using Wavelet
Transformation
WaveCluster is a multiresolution clustering algorithm that
first summarizes the data by imposing a multidimensional
grid structure onto the data space.
It then uses a wavelet transformation to transform the
original feature space, finding dense regions in the
transformed space.
96. Grid-Based Methods
WaveCluster: Clustering Using Wavelet
Transformation
In this approach, each grid cell summarizes the
information of a group of points that map into the cell.
This summary information typically fits into main
memory for use by the multiresolution wavelet
transform and the subsequent cluster analysis.
97. Grid-Based Methods
WaveCluster: Clustering Using Wavelet Transformation
A wavelet transform is a signal processing technique that
decomposes a signal into different frequency subbands.
The wavelet model can be applied to d-dimensional signals by
applying a one-dimensional wavelet transform d times.
In applying a wavelet transform, data are transformed so as to
preserve the relative distance between objects at different levels
of resolution.
98. Grid-Based Methods
WaveCluster: Clustering Using Wavelet
Transformation
It provides unsupervised clustering
The multiresolution property of wavelet transformations
can help detect clusters at varying levels of accuracy.
Wavelet-based clustering is very fast, with a computational
complexity of O(n)
99. Model-Based Clustering Methods
Model-based clustering methods attempt to
optimize the fit between the given data and
some mathematical model.
Expectation-Maximization
Conceptual Clustering
Neural Network Approach
100. Model-Based Clustering Methods
Expectation-Maximization
Each cluster can be represented mathematically by a parametric
probability distribution.
The entire data is a mixture of these distributions, where each
individual distribution is typically referred to as a component
distribution.
The EM(Expectation-Maximization) algorithm is a popular
iterative refinement algorithm that can be used for finding the
parameter estimates.
101. Model-Based Clustering Methods
Expectation-Maximization
It can be viewed as an extension of the k-means
paradigm, which assigns an object to the cluster with
which it is most similar, based on the cluster mean.
Instead of assigning each object to a dedicated cluster,
EM assigns each object to a cluster according to a
weight representing the probability of membership.
102. Model-Based Clustering Methods
Expectation-Maximization – Algorithm:
Make an initial guess of the parameter vector: This
involves randomly selecting k objects to represent the
cluster means or centers (as in k-means partitioning),
as well as making guesses for the additional
parameters.
Iteratively refine the parameters (or clusters)
103. Model-Based Clustering Methods
Conceptual Clustering
A set of unlabeled objects, produces a classification
scheme over the objects.
Finding characteristic descriptions for each group,
where each group represents a concept or class.
Conceptual clustering is a two-step process: clustering
is performed first, followed by characterization.
104. Model-Based Clustering Methods
Conceptual Clustering – Classification Tree:
Each node in a classification tree refers to a concept
and contains a probabilistic description of that concept,
which summarizes the objects classified under the
node.
The sibling nodes at a given level of a classification
tree are said to form a partition.
105. Model-Based Clustering Methods
Conceptual Clustering – Classification Tree:
To classify an object using a classification tree, a
partial matching function is employed to descend
the tree along a path of “best” matching nodes.
107. Model-Based Clustering Methods
Neural Network Approach:
The neural network approach is motivated by
biological neural networks.
A neural network is a set of connected
input/output units, where each connection has a
weight associated with it.
108. Model-Based Clustering Methods
Neural Network Approach:
First, neural networks are inherently parallel and
distributed processing architectures.
Second, neural networks learn by adjusting their
interconnection weights so as to best fit the data.
Third, neural networks process numerical vectors and
require object patterns to be represented by quantitative
features only.
109. Clustering High-Dimensional Data
Most clustering methods are designed for clustering low-
dimensional data and encounter challenges when the dimensionality
of the data grows really high.
When dimensionality increases, data usually become increasingly
sparse because the data points are likely located in different
dimensional subspaces.
To overcome this difficulty, we may consider using feature (or
attribute) transformation and feature (or attribute) selection
techniques
110. Clustering High-Dimensional Data
Feature transformation methods:
Transform the data onto a smaller space
They summarize data by creating linear combinations
of the attributes, and may discover hidden structures in
the data.
Techniques do not actually remove any of the original
attributes from analysis.
111. Clustering High-Dimensional Data
Feature transformation methods:
This is problematic when there are a large number of
irrelevant attributes.
The transformed features (attributes) are often difficult to
interpret, making the clustering results less useful.
Feature transformation is only suited to data sets where
most of the dimensions are relevant to the clustering task.
112. Clustering High-Dimensional Data
Feature transformation methods:
Attribute subset selection (or feature subset selection)
is commonly used for data reduction by removing
irrelevant or redundant dimensions (or attributes).
It is most commonly performed by supervised
learning—the most relevant set of attributes are found
with respect to the given class labels
113. Clustering High-Dimensional Data
Feature transformation methods:
Subspace clustering is an extension to attribute
subset selection.
Subspace clustering searches for groups of
clusters within different subspaces of the same data
set.
114. Clustering High-Dimensional Data
CLIQUE: A Dimension-Growth Subspace Clustering
Method
CLIQUE (CLustering In QUEst)
The clustering process starts at single-dimensional
subspaces and grows upward to higher-dimensional ones.
Because CLIQUE partitions each dimension like a grid
structure and determines whether a cell is dense based on
the number of points it contains
115. Clustering High-Dimensional Data
CLIQUE: A Dimension-Growth Subspace
Clustering Method
An integration of density-based and grid-based
clustering methods.
116. Clustering High-Dimensional Data
CLIQUE: A Dimension-Growth Subspace Clustering
Method – Ideas:
Given a large set of multidimensional data points, the data space
is usually not uniformly occupied by the data points. CLIQUE’s
clustering identifies the sparse and the “crowded” areas in space
(or units), thereby discovering the overall distribution patterns of
the data set.
In CLIQUE, a cluster is defined as a maximal set of connected
dense units.
117. Clustering High-Dimensional Data
How does CLIQUE work?
In the first step, CLIQUE partitions the d-dimensional
data space into non-overlapping rectangular units,
identifying the dense units among these. This is done
(in 1-D) for each dimension.
In the second step, For each cluster, it determines the
maximal region that covers the cluster of connected
dense units.
118. Clustering High-Dimensional Data
PROCLUS: A Dimension-Reduction Subspace
Clustering Method:
PROCLUS (PROjected CLUStering)
It starts by finding an initial approximation of the clusters
in the high-dimensional attribute space.
Each dimension is then assigned a weight for each cluster,
and the updated weights are used in the next iteration to
regenerate the clusters.
119. Clustering High-Dimensional Data
PROCLUS: A Dimension-Reduction Subspace
Clustering Method:
It adopts a distance measure called Manhattan
segmental distance, which is the Manhattan distance
on a set of relevant dimensions.
The PROCLUS algorithm consists of three phases:
initialization, iteration, and cluster refinement.
120. Clustering High-Dimensional Data
PROCLUS: A Dimension-Reduction Subspace
Clustering Method:
In the initialization phase, it ensures that each cluster is
represented by at least one object in the selected set.
The iteration phase selects a random set of k medoids from
this reduced set (of medoids), and replaces “bad” medoids
with randomly chosen new medoids if the clustering is
improved.
121. Clustering High-Dimensional Data
PROCLUS: A Dimension-Reduction
Subspace Clustering Method:
The refinement phase computes new dimensions
for each medoid based on the clusters found,
reassigns points to medoids, and removes outliers.
122. Clustering High-Dimensional Data
Frequent Pattern–Based Clustering Methods
It searches for patterns (such as sets of items or
objects) that occur frequently in large data sets.
Frequent pattern mining can lead to the discovery
of interesting associations and correlations among
data objects.
123. Clustering High-Dimensional Data
Frequent Pattern–Based Clustering Methods - Frequent
term–based text clustering:
Text documents are clustered based on the frequent terms
they contain.
A term is any sequence of characters separated from other
terms by a delimiter.
A term can be made up of a single word or several words.
124. Clustering High-Dimensional Data
Frequent Pattern–Based Clustering Methods -
Frequent term–based text clustering:
In general, we first remove nontext information
(such as HTML tags and punctuation) and stop
words. Terms are then extracted.
125. Constraint-Based Cluster Analysis
Constraint-based clustering finds clusters that satisfy
user-specified preferences or constraints.
Constraints on individual objects: We can specify
constraints on the objects to be clustered.
In a real estate application, for example, one may like to
spatially cluster only those luxury mansions worth over a
million dollars.
This constraint confines the set of objects to be clustered.
126. Constraint-Based Cluster Analysis
Constraints on the selection of clustering parameters:
Clustering parameters are usually quite specific to the
given clustering algorithm.
Examples of parameters include k, the desired number of
clusters in a k-means algorithm; or e (the radius) and
MinPts (the minimum number of points) in the DBSCAN
algorithm.
127. Constraint-Based Cluster Analysis
Constraints on the selection of clustering
parameters:
Although such user-specified parameters may
strongly influence the clustering results, they are
usually confined to the algorithm itself.
128. Constraint-Based Cluster Analysis
Constraints on distance or similarity functions:
We can specify different distance or similarity
functions for specific attributes of the objects to be
clustered, or different distance measures for specific
pairs of objects.
When clustering sportsmen, for example, we may use
different weighting schemes for height, body weight,
age, and skill level.
129. Constraint-Based Cluster Analysis
User-specified constraints on the properties of
individual clusters:
A user may like to specify desired characteristics
of the resulting clusters, which may strongly
influence the clustering process.
130. Outlier Analysis
There exist data objects that do not comply
with the general behavior or model of the data.
Such data objects, which are grossly different
from or inconsistent with the remaining set of
data, are called outliers.
131. Outlier Analysis
Data visualization methods are weak in
detecting outliers in data with many
categorical attributes or in data of high
dimensionality, since human eyes are good at
visualizing numeric data of only two to three
dimensions.
132. Outlier Analysis
There are two basic types of procedures for detecting outliers:
Block procedures: In this case, either all of the suspect objects are
treated as outliers or all of them are accepted as consistent.
Consecutive (or sequential) procedures: An example of such a
procedure is the inside out procedure. Its main idea is that the object
that is least “likely” to be an outlier is tested first. If it is found to be an
outlier, then all of the more extreme values are also considered outliers;
otherwise, the next most extreme object is tested, and so on. This
procedure tends to be more effective than block procedures.
133. Outlier Analysis
How effective is the statistical approach at outlier
detection?
A major drawback is that most tests are for single attributes, yet
many data mining problems require finding outliers in
multidimensional space.
Statistical methods do not guarantee that all outliers will be
found for the cases where no specific test was developed, or
where the observed distribution cannot be adequately modeled
with any standard distribution.