SlideShare a Scribd company logo
1 of 133
20IT501 – Data Warehousing and
Data Mining
III Year / V Semester
UNIT IV - CLUSTERING AND
OUTLIER ANALYSIS
Cluster Analysis - Types of Data – Categorization of
Major Clustering Methods – K-means– Partitioning
Methods – Hierarchical Methods - Density-Based
Methods – Grid Based Methods – Model-Based
Clustering Methods – Clustering High Dimensional
Data – Constraint – Based Cluster Analysis –
Outlier Analysis
Introduction
 Clustering is the process of grouping the data into classes or
clusters, so that objects within a cluster have high similarity
in comparison to one another but are very dissimilar to
objects in other clusters.
 Dissimilarities are assessed based on the attribute values
describing the objects. Often, distance measures are used.
 Clustering has its roots in many areas, including data
mining, statistics, biology, and machine learning.
Cluster Analysis
 The process of grouping a set of physical or abstract objects
into classes of similar objects is called clustering.
 A cluster is a collection of data objects that are similar to
one another within the same cluster.
 In business, clustering can help marketers discover distinct
groups in their customer bases and characterize customer
groups based on purchasing patterns.
Cluster Analysis
 Clustering is also called data segmentation in some
applications because clustering partitions large data sets into
groups according to their similarity.
 Clustering can also be used for outlier detection, where
outliers (values that are “far away” from any cluster)
 Applications of outlier detection include the detection of
credit card fraud and the monitoring of criminal activities in
electronic commerce.
Cluster Analysis
In machine learning, clustering is an example of
unsupervised learning.
 Unlike classification, clustering and unsupervised
learning do not rely on predefined classes and
class-labeled training examples.
 For this reason, clustering is a form of learning
by observation, rather than learning by examples.
Cluster Analysis
 Requirements of clustering:
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Ability to deal with noisy data
Cluster Analysis
 Requirements of clustering:
Incremental clustering and insensitivity to the
order of input records
 High dimensionality
 Constraint-based clustering
 Interpretability and usability
Types of Data in Cluster Analysis
 Interval-Scaled Variables:
 It then describes distance measures that are commonly used for
computing the dissimilarity of objects described by such
variables. These measures include the Euclidean, Manhattan, and
Minkowski distances.
 Interval-scaled variables are continuous measurements of a
roughly linear scale. Examples include weight and height,
latitude and longitude coordinates (e.g., when clustering houses),
and weather temperature
Types of Data in Cluster Analysis
 Interval-Scaled Variables:
 The measurement unit used can affect the clustering
analysis.
 For example, changing measurement units from meters to
inches for height, or from kilograms to pounds for weight,
may lead to a very different clustering structure.
 To help avoid dependence on the choice of measurement
units, the data should be standardized.
Types of Data in Cluster Analysis
 Interval-Scaled Variables:
 To standardize measurements, one choice is to convert
the original measurements to unitless variables.
 Calculate the mean absolute deviation, sf :
 Calculate the standardized measurement, or z-score:
Types of Data in Cluster Analysis
 Interval-Scaled Variables:
 The most popular distance measure is Euclidean
distance, which is defined as
 Another well-known metric is Manhattan (or city
block) distance, defined as
Types of Data in Cluster Analysis
 Interval-Scaled Variables:
Types of Data in Cluster Analysis
 Binary Variables:
 A binary variable has only two states: 0 or 1, where 0
means that the variable is absent, and 1 means that it is
present.
 Treating binary variables as if they are interval-scaled
can lead to misleading clustering results. Therefore,
methods specific to binary data are necessary for
computing dissimilarities.
Types of Data in Cluster Analysis
 Binary Variables:
 A binary variable is symmetric if both of its states are
equally valuable and carry the same weight; that is, there is
no preference on which outcome should be coded as 0 or 1.
 One such example could be the attribute gender having the
states male and female.
 Its dissimilarity (or distance) measure
Types of Data in Cluster Analysis
 Binary Variables:
 A binary variable is asymmetric if the outcomes of the
states are not equally important, such as the positive
and negative outcomes of a disease test.
 The dissimilarity based on such variables is called
asymmetric binary dissimilarity, where the number of
negative matches, t, is considered unimportant and
thus is ignored in the computation
Types of Data in Cluster Analysis
 Binary Variables:
 we can measure the distance between two binary
variables based on the notion of similarity instead of
dissimilarity.
 For example, the asymmetric binary similarity
between the objects i and j, or sim(i, j), can be
computed as,
Types of Data in Cluster Analysis
 Categorical Variables
 A categorical variable is a generalization of the binary
variable in that it can take on more than two states.
For example, map color is a categorical variable that may
have, say, five states: red, yellow, green, pink, and blue.
 The dissimilarity between two objects i and j can be
computed based on the ratio of mismatches:
Types of Data in Cluster Analysis
 Ordinal Variables
 A discrete ordinal variable resembles a categorical
variable, except that the M states of the ordinal value
are ordered in a meaningful sequence.
 For example, professional ranks are often enumerated
in a sequential order, such as assistant, associate, and
full for professors
Types of Data in Cluster Analysis
 Ordinal Variables
 A continuous ordinal variable looks like a set of
continuous data of an unknown scale; that is, the
relative ordering of the values is essential but their
actual magnitude is not.
 For example, the relative ranking in a particular sport
(e.g., gold, silver, bronze)
Types of Data in Cluster Analysis
 Ratio-Scaled Variables
 A ratio-scaled variable makes a positive
measurement on a nonlinear scale, such as an
exponential scale, approximately following the
formula AeBt or Ae-Bt
Types of Data in Cluster Analysis
 Variables of Mixed Types
 One approach is to group each kind of variable together,
performing a separate cluster analysis for each variable type.
 This is feasible if these analyses derive compatible results.
 A more preferable approach is to process all variable types
together, performing a single cluster analysis.
 One such technique combines the different variables into a single
dissimilarity matrix, bringing all of the meaningful variables
onto a common scale of the interval [0.0,1.0].
Types of Data in Cluster Analysis
 Vector Objects
 In some applications, such as information retrieval,
text document clustering, and biological taxonomy, we
need to compare and cluster complex objects (such as
documents) containing a large number of symbolic
entities (such as keywords and phrases).
 One popular way is to define the similarity function as
a cosine measure as follows:
Partitioning Methods
 Given D, a data set of n objects, and k, the
number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k
≤ n), where each partition represents a cluster.
The objects within a cluster are “similar,” whereas
the objects of different clusters are “dissimilar” in
terms of the data set attributes.
Partitioning Methods
 Centroid-Based Technique: The k-Means Method
 The k-means algorithm takes the input parameter, k,
and partitions a set of n objects into k clusters so that
the resulting intracluster similarity is high but the
intercluster similarity is low.
Cluster similarity is measured in regard to the mean
value of the objects in a cluster, which can be viewed
as the cluster’s centroid or center of gravity.
Partitioning Methods
 Centroid-Based Technique: The k-Means Method
 Algorithm: k-means. The k-means algorithm for
partitioning, where each cluster’s center is represented
by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Partitioning Methods
 Centroid-Based Technique: The k-Means Method
 Method:
 arbitrarily choose k objects from D as the initial cluster centers;
 repeat
(re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
update the cluster means, i.e., calculate the mean value of the objects for
each cluster;
 until no change;
Partitioning Methods
Representative Object-Based Technique: The
k-Medoids Method
 The k-means algorithm is sensitive to outliers
because an object with an extremely large value
may substantially distort the distribution of data.
Partitioning Methods
 PAM(Partitioning Around Medoids)
 It attempts to determine k partitions for n objects.
 After an initial random selection of k representative
objects, the algorithm repeatedly tries to make a better
choice of cluster representatives.
 All of the possible pairs of objects are analyzed, where one
object in each pair is considered a representative object and
the other is not.
Partitioning Methods
 PAM(Partitioning Around Medoids)
 The quality of the resulting clustering is calculated for each such
combination.
 An object, oj, is replaced with the object causing the greatest
reduction in error.
 The set of best objects for each cluster in one iteration forms the
representative objects for the next iteration.
 The final set of representative objects are the respective medoids
of the clusters.
Partitioning Methods
 Which method is more robust—k-means or k-
medoids?
 The k-medoids method is more robust than k-
means in the presence of noise and outliers.
 A medoid is less influenced by outliers or other
extreme values than a mean.
Partitioning Methods
 Partitioning Methods in Large Databases: From k-
Medoids to CLARANS
 A typical k-medoids partitioning algorithm like PAM
works effectively for small data sets, but does not scale
well for large data sets.
 To deal with larger data sets, a sampling-based method,
called CLARA (Clustering LARge Applications), can be
used
Partitioning Methods
 Partitioning Methods in Large Databases: From k-
Medoids to CLARANS
 Instead of taking the whole set of data into consideration, a
small portion of the actual data is chosen as a representative
of the data.
 Medoids are then chosen from this sample using PAM.
 If the sample is selected in a fairly random manner, it
should closely represent the original data set.
Partitioning Methods
 Partitioning Methods in Large Databases: From k-Medoids to
CLARANS
 The representative objects (medoids) chosen will likely be similar to
those that would have been chosen from the whole data set.
 CLARA draws multiple samples of the data set, applies PAM on each
sample, and returns its best clustering as the output.
 The complexity of each iteration now becomes O(ks2+k(n-k)), where s
is the size of the sample, k is the number of clusters, and n is the total
number of objects.
Partitioning Methods
 Partitioning Methods in Large Databases: From
k-Medoids to CLARANS
 The effectiveness of CLARA depends on the sample
size.
 CLARA searches for the best k medoids among the
selected sample of the data set.
 CLARA will never find the best clustering.
Hierarchical Methods
 A hierarchical clustering method works by grouping data objects
into a tree of clusters.
 Hierarchical clustering methods can be further classified as either
agglomerative or divisive, depending on whether the hierarchical
decomposition is formed in a bottom-up (merging) or top-down
(splitting) fashion.
 The quality of a pure hierarchical clustering method suffers from its
inability to perform adjustment once a merge or split decision has
been executed.
Hierarchical Methods
 Agglomerative and Divisive Hierarchical
Clustering:
 Agglomerative hierarchical clustering: This bottom-up
strategy starts by placing each object in its own cluster
and then merges these atomic clusters into larger and
larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are
satisfied.
Hierarchical Methods
 Agglomerative and Divisive Hierarchical Clustering:
 Divisive hierarchical clustering: This top-down strategy does
the reverse of agglomerative hierarchical clustering by starting
with all objects in one cluster.
 It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies certain
termination conditions, such as a desired number of clusters is
obtained or the diameter of each cluster is within a certain
threshold.
Hierarchical Methods
 Agglomerative and Divisive Hierarchical
Clustering:
Hierarchical Methods
 Agglomerative and Divisive Hierarchical
Clustering:
 A tree structure called a dendrogram is commonly
used to represent the process of hierarchical
clustering. It shows how objects are grouped
together step by step.
Hierarchical Methods
 Agglomerative and Divisive Hierarchical
Clustering:
Hierarchical Methods
 Agglomerative and Divisive Hierarchical
Clustering:
 Four widely used measures for distance between
clusters are as follows, where |p-p’| is the distance
between two objects or points, p and p’; mi is the mean
for cluster, Ci; and ni is the number of objects in Ci.
Hierarchical Methods
 Agglomerative and Divisive Hierarchical
Clustering:
Hierarchical Methods
 Agglomerative and Divisive Hierarchical Clustering:
 When an algorithm uses the minimum distance, dmin(Ci,
Cj), to measure the distance between clusters, it is
sometimes called a nearest-neighbor clustering algorithm.
 When an algorithm uses the maximum distance, dmax(Ci,
Cj), to measure the distance between clusters, it is
sometimes called a farthest-neighbor clustering algorithm.
Hierarchical Methods
 BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
 BIRCH is designed for clustering a large amount of numerical
data by integration of hierarchical clustering (at the initial
microclustering stage) and other clustering methods such as
iterative partitioning (at the later macroclustering stage).
 It overcomes the two difficulties of agglomerative clustering
methods: (1) scalability and (2) the inability to undo what was
done in the previous step.
Hierarchical Methods
 BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
 BIRCH introduces two concepts, clustering feature and
clustering feature tree (CF tree), which are used to summarize
cluster representations.
 These structures help the clustering method achieve good speed
and scalability in large databases and also make it effective for
incremental and dynamic clustering of incoming objects.
Hierarchical Methods
 BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
 A clustering feature (CF) is a three-dimensional vector
summarizing information about clusters of objects.
 Given n d-dimensional objects or points in a cluster, {xi}, then
the CF of the cluster is defined as
CF=<n, LS, SS>
where n is the number of points in the cluster, LS is the linear sum
of the n points, and SS is the square sum of the data points
Hierarchical Methods
 BIRCH: Balanced Iterative Reducing and
Clustering Using Hierarchies:
 Clustering feature. Suppose that there are three points,
(2, 5), (3, 2), and (4, 3), in a cluster, C1. The clustering
feature of C1 is
 CF1 = <3, (2+3+4,5+2+3); (22+32+42;52+22+32)> =
<3, (9,10), (29,38)>:
Hierarchical Methods
 BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
 Clustering features are sufficient for calculating all of the
measurements that are needed for making clustering decisions in
BIRCH.
 BIRCH thus utilizes storage efficiently by employing the
clustering features to summarize information about the clusters
of objects, thereby bypassing the need to store all objects.
Hierarchical Methods
 BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
 A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering.
 A CF tree has two parameters: branching factor, B, and threshold, T.
 The branching factor specifies the maximum number of children per
nonleaf node.
 The threshold parameter specifies the maximum diameter of
subclusters stored at the leaf nodes of the tree.
Hierarchical Methods
 BIRCH: Balanced Iterative Reducing and
Clustering Using Hierarchies:
 The computation complexity of the algorithm is O(n),
where n is the number of objects to be clustered.
 Each node in a CF tree can hold only a limited
number of entries due to its size,
Hierarchical Methods
 ROCK: A Hierarchical Clustering Algorithm for
Categorical Attributes
 ROCK (RObust Clustering using linKs)
 It explores the concept of links (the number of common
neighbors between two objects) for data with categorical
attributes.
 Distance measures cannot lead to high-quality clusters
when clustering categorical data.
Hierarchical Methods
 ROCK: A Hierarchical Clustering Algorithm for
Categorical Attributes
 ROCK takes a more global approach to clustering by
considering the neighborhoods of individual pairs of points.
If two similar points also have similar neighborhoods, then
the two points likely belong to the same cluster and so can
be merged.
Hierarchical Methods
 ROCK: A Hierarchical Clustering Algorithm for
Categorical Attributes
 Based on these ideas, ROCK first constructs a sparse graph
from a given data similarity matrix using a similarity
threshold and the concept of shared neighbors.
 It then performs agglomerative hierarchical clustering on
the sparse graph.
Random sampling is used for scaling up to large data sets.
Hierarchical Methods
 Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
 Chameleon is a hierarchical clustering algorithm that uses
dynamic modeling to determine the similarity between
pairs of clusters.
 In Chameleon, cluster similarity is assessed based on how
well-connected objects are within a cluster and on the
proximity of clusters.
Hierarchical Methods
 Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
 That is, two clusters are merged if their
interconnectivity is high and they are close together.
 The merge process facilitates the discovery of natural
and homogeneous clusters and applies to all types of
data as long as a similarity function can be specified.
Hierarchical Methods
 Chameleon: A Hierarchical Clustering
Algorithm Using Dynamic Modeling
Hierarchical Methods
 Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
 Chameleon uses a k-nearest-neighbor graph approach
to construct a sparse graph, where each vertex of the
graph represents a data object, and there exists an edge
between two vertices (objects) if one object is among
the k-most-similar objects of the other.
Hierarchical Methods
 Chameleon: A Hierarchical Clustering Algorithm Using
Dynamic Modeling
 The edges are weighted to reflect the similarity between objects.
 It then uses an agglomerative hierarchical clustering algorithm
that repeatedly merges subclusters based on their similarity
 To determine the pairs of most similar subclusters, it takes into
account both the interconnectivity as well as the closeness of the
clusters.
Hierarchical Methods
 Chameleon: A Hierarchical Clustering
Algorithm Using Dynamic Modeling
 Chameleon determines the similarity between
each pair of clusters Ci and Cj according to their
relative interconnectivity, RI(Ci, Cj), and their
relative closeness, RC(Ci, Cj)
Hierarchical Methods
 Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
 The relative interconnectivity, RI(Ci, Cj), between
two clusters, Ci and Cj, is defined as the absolute
interconnectivity between Ci and Cj, normalized with
respect to the internal interconnectivity of the two
clusters, Ci and Cj. That is,
Hierarchical Methods
 Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
 The relative closeness, RC(Ci, Cj), between a pair of
clusters,Ci and Cj, is the absolute closeness between
Ci and Cj, normalized with respect to the internal
closeness of the two clusters, Ci and Cj. It is defined as
Density-Based Methods
 To discover clusters with arbitrary shape, density-
based clustering methods have been developed.
 These typically regard clusters as dense regions
of objects in the data space that are separated by
regions of low density (representing noise).
Density-Based Methods
 DBSCAN: A Density-Based Clustering Method Based on
Connected Regions with Sufficiently High Density
 DBSCAN (Density-Based Spatial Clustering of Applications
with Noise)
 The algorithm grows regions with sufficiently high density into
clusters and discovers clusters of arbitrary shape in spatial
databases with noise.
 It defines a cluster as a maximal set of density-connected points.
Density-Based Methods
 DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
 The neighborhood within a radius  of a given object is
called the -neighborhood of the object.
If the -neighborhood of an object contains at least a
minimum number, MinPts, of objects, then the object is
called a core object.
Density-Based Methods
 DBSCAN: A Density-Based Clustering Method
Based on Connected Regions with Sufficiently
High Density
 Given a set of objects, D, we say that an object p is
directly density-reachable from object q if p is within
the e-neighborhood of q, and q is a core object.
Density-Based Methods
 DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
 Direct density reachable: A point is called direct density
reachable if it has a core point in its neighbourhood.
Consider the point (1, 2), it has a core point (1.2, 2.5) in its
neighbourhood, hence, it will be a direct density reachable
point.
Density-Based Methods
 DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
 Density Reachable: A point is called density reachable
from another point if they are connected through a series of
core points. For example, consider the points (1, 3) and
(1.5, 2.5), since they are connected through a core point
(1.2, 2.5), they are called density reachable from each
other.
Density-Based Methods
 DBSCAN: A Density-Based Clustering Method Based on
Connected Regions with Sufficiently High Density
 DBSCAN searches for clusters by checking the -neighborhood of
each point in the database.
 If the -neighborhood of a point p contains more than MinPts, a new
cluster with p as a core object is created.
 DBSCAN then iteratively collects directly density-reachable objects
from these core objects, which may involve the merge of a few density-
reachable clusters.
 The process terminates when no new point can be added to any cluster.
Density-Based Methods
 DBSCAN: A Density-Based Clustering
Method Based on Connected Regions with
Sufficiently High Density
Density-Based Methods
 DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
 DBSCAN searches for clusters by checking the -
neighborhood of each point in the database.
 If the  -neighborhood of a point p contains more than
MinPts, a new cluster with p as a core object is created.
Density-Based Methods
 DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
 DBSCAN then iteratively collects directly density-
reachable objects from these core objects, which may
involve the merge of a few density-reachable clusters.
The process terminates when no new point can be added to
any cluster.
Density-Based Methods
 DBSCAN: A Density-Based Clustering Method
Based on Connected Regions with Sufficiently
High Density
 If a spatial index is used, the computational
complexity of DBSCAN is O(nlogn), where n is the
number of database objects.
 Otherwise, it is O(n2).
Density-Based Methods
 OPTICS: Ordering Points to Identify the Clustering Structure
 OPTICS computes an augmented cluster ordering for automatic and
interactive cluster analysis.
 This ordering represents the density-based clustering structure of the
data.
 It contains information that is equivalent to density-based clustering
obtained from a wide range of parameter settings.
 The cluster ordering can be used to extract basic clustering information
(such as cluster centers or arbitrary-shaped clusters) as well as provide
the intrinsic clustering structure.
Density-Based Methods
 OPTICS: Ordering Points to Identify the
Clustering Structure
 The core-distance of an object p is the smallest ’
value that makes {p} a core object. If p is not a
core object, the core-distance of p is undefined.
Density-Based Methods
 OPTICS: Ordering Points to Identify the
Clustering Structure
Density-Based Methods
 OPTICS: Ordering Points to Identify the Clustering
Structure
 The reachability-distance of an object q with respect to
another object p is the greater value of the core-distance of
p and the Euclidean distance between p and q.
If p is not a core object, the reachability-distance between p
and q is undefined.
Density-Based Methods
 OPTICS: Ordering Points to Identify the Clustering
Structure
 An algorithm was proposed to extract clusters based on the
ordering information produced by OPTICS.
Such information is sufficient for the extraction of all
density-based clusterings with respect to any distance ’
that is smaller than the distance e used in generating the
order.
Density-Based Methods
 OPTICS: Ordering Points to Identify the
Clustering Structure
 Because of the structural equivalence of the OPTICS
algorithm to DBSCAN, the OPTICS algorithm has the
same runtime complexity as that of DBSCAN, that is,
O(nlogn) if a spatial index is used, where n is the
number of objects.
Density-Based Methods
 OPTICS: Ordering Points to Identify the
Clustering Structure
 Because of the structural equivalence of the OPTICS
algorithm to DBSCAN, the OPTICS algorithm has the
same runtime complexity as that of DBSCAN, that is,
O(nlogn) if a spatial index is used, where n is the
number of objects.
Density-Based Methods
 DENCLUE: Clustering Based on Density Distribution Functions
 DENCLUE (DENsity-based CLUstEring)
 The method is built on the following ideas: the influence of each data point can
be formally modeled using a mathematical function, called an influence
function, which describes the impact of a data point within its neighborhood;
 The overall density of the data space can be modeled analytically as the sum of
the influence function applied to all data points; and
 Clusters can then be determined mathematically by identifying density
attractors, where density attractors are local maxima of the overall density
function.
Density-Based Methods
 DENCLUE: Clustering Based on Density Distribution
Functions
 The influence function of data object y on x is a function,
 the influence function can be an arbitrary function that can be
determined by the distance between two objects in a
neighborhood.
 The distance function, d(x, y), should be reflexive and
symmetric, such as the Euclidean distance function
Density-Based Methods
 DENCLUE: Clustering Based on Density
Distribution Functions
 The density function at an object or point x € Fd is
defined as the sum of influence functions of all
data points.
Density-Based Methods
 DENCLUE: Clustering Based on Density Distribution
Functions
 From the density function, we can define the gradient of
the function and the density attractor, the local maxima of
the overall density function.
 An arbitrary-shape cluster for a set of density attractors is a
set of Cs, each being density-attracted to its respective
density attractor
Grid-Based Methods
 The grid-based clustering approach uses a
multiresolution grid data structure.
 It quantizes the object space into a finite number
of cells that form a grid structure on which all of
the operations for clustering are performed.
 The main advantage of the approach is its fast
processing time
Grid-Based Methods
 STING: STatistical INformation Grid
 STING is a grid-based multiresolution clustering technique
in which the spatial area is divided into rectangular cells.
 There are usually several levels of such rectangular cells
corresponding to different levels of resolution, and these
cells form a hierarchical structure: each cell at a high level
is partitioned to form a number of cells at the next lower
level.
Grid-Based Methods
 STING: STatistical INformation Grid
 Statistical information regarding the attributes in
each grid cell (such as the mean, maximum, and
minimum values) is precomputed and stored.
 These statistical parameters are useful for query
processing
Grid-Based Methods
 STING: STatistical INformation Grid
Grid-Based Methods
 STING: STatistical INformation Grid
 Statistical parameters of higher-level cells can easily be
computed from the parameters of the lower-level cells.
 These parameters include the following: the attribute-
independent parameter, count; the attribute-dependent
parameters, mean, stdev (standard deviation), min (minimum),
max (maximum); and the type of distribution that the attribute
value in the cell follows, such as normal, uniform, exponential,
or none.
Grid-Based Methods
 STING: STatistical INformation Grid
 When the data are loaded into the database, the
parameters count, mean, stdev, min, and max of the
bottom-level cells are calculated directly from the data.
The value of distribution may either be assigned by the
user if the distribution type is known beforehand or
obtained by hypothesis tests such as the 2 test.
Grid-Based Methods
 STING: STatistical INformation Grid
 The type of distribution of a higher-level cell can be
computed based on the majority of distribution types
of its corresponding lower-level cells in conjunction
with a threshold filtering process.
If the distributions of the lower level cells disagree
with each other and fail the threshold test, the
distribution type of the high-level cell is set to none.
Grid-Based Methods
 STING: STatistical INformation Grid
 First, a layer within the hierarchical structure is
determined from which the query-answering process is
to start.
 This layer typically contains a small number of cells.
For each cell in the current layer, we compute the
confidence interval (or estimated range of probability)
reflecting the cell’s relevancy to the given query.
Grid-Based Methods
 STING: STatistical INformation Grid
 The irrelevant cells are removed from further consideration.
 Processing of the next lower level examines only the remaining
relevant cells. This process is repeated until the bottom layer is reached.
 At this time, if the query specification is met, the regions of relevant
cells that satisfy the query are returned.
 Otherwise, the data that fall into the relevant cells are retrieved and
further processed until they meet the requirements of the query.
Grid-Based Methods
 STING: STatistical INformation Grid
 The grid-based computation is query-independent,
because the statistical information stored in each cell
represents the summary information of the data in the
grid cell, independent of the query;
The grid structure facilitates parallel processing and
incremental updating
Grid-Based Methods
 WaveCluster: Clustering Using Wavelet
Transformation
 WaveCluster is a multiresolution clustering algorithm that
first summarizes the data by imposing a multidimensional
grid structure onto the data space.
It then uses a wavelet transformation to transform the
original feature space, finding dense regions in the
transformed space.
Grid-Based Methods
 WaveCluster: Clustering Using Wavelet
Transformation
 In this approach, each grid cell summarizes the
information of a group of points that map into the cell.
 This summary information typically fits into main
memory for use by the multiresolution wavelet
transform and the subsequent cluster analysis.
Grid-Based Methods
 WaveCluster: Clustering Using Wavelet Transformation
 A wavelet transform is a signal processing technique that
decomposes a signal into different frequency subbands.
 The wavelet model can be applied to d-dimensional signals by
applying a one-dimensional wavelet transform d times.
 In applying a wavelet transform, data are transformed so as to
preserve the relative distance between objects at different levels
of resolution.
Grid-Based Methods
 WaveCluster: Clustering Using Wavelet
Transformation
 It provides unsupervised clustering
 The multiresolution property of wavelet transformations
can help detect clusters at varying levels of accuracy.
 Wavelet-based clustering is very fast, with a computational
complexity of O(n)
Model-Based Clustering Methods
 Model-based clustering methods attempt to
optimize the fit between the given data and
some mathematical model.
 Expectation-Maximization
 Conceptual Clustering
 Neural Network Approach
Model-Based Clustering Methods
 Expectation-Maximization
 Each cluster can be represented mathematically by a parametric
probability distribution.
 The entire data is a mixture of these distributions, where each
individual distribution is typically referred to as a component
distribution.
 The EM(Expectation-Maximization) algorithm is a popular
iterative refinement algorithm that can be used for finding the
parameter estimates.
Model-Based Clustering Methods
Expectation-Maximization
 It can be viewed as an extension of the k-means
paradigm, which assigns an object to the cluster with
which it is most similar, based on the cluster mean.
 Instead of assigning each object to a dedicated cluster,
EM assigns each object to a cluster according to a
weight representing the probability of membership.
Model-Based Clustering Methods
Expectation-Maximization – Algorithm:
 Make an initial guess of the parameter vector: This
involves randomly selecting k objects to represent the
cluster means or centers (as in k-means partitioning),
as well as making guesses for the additional
parameters.
 Iteratively refine the parameters (or clusters)
Model-Based Clustering Methods
Conceptual Clustering
 A set of unlabeled objects, produces a classification
scheme over the objects.
 Finding characteristic descriptions for each group,
where each group represents a concept or class.
 Conceptual clustering is a two-step process: clustering
is performed first, followed by characterization.
Model-Based Clustering Methods
Conceptual Clustering – Classification Tree:
 Each node in a classification tree refers to a concept
and contains a probabilistic description of that concept,
which summarizes the objects classified under the
node.
 The sibling nodes at a given level of a classification
tree are said to form a partition.
Model-Based Clustering Methods
Conceptual Clustering – Classification Tree:
 To classify an object using a classification tree, a
partial matching function is employed to descend
the tree along a path of “best” matching nodes.
Model-Based Clustering Methods
Conceptual Clustering – Classification Tree:
Model-Based Clustering Methods
Neural Network Approach:
 The neural network approach is motivated by
biological neural networks.
 A neural network is a set of connected
input/output units, where each connection has a
weight associated with it.
Model-Based Clustering Methods
 Neural Network Approach:
 First, neural networks are inherently parallel and
distributed processing architectures.
 Second, neural networks learn by adjusting their
interconnection weights so as to best fit the data.
 Third, neural networks process numerical vectors and
require object patterns to be represented by quantitative
features only.
Clustering High-Dimensional Data
 Most clustering methods are designed for clustering low-
dimensional data and encounter challenges when the dimensionality
of the data grows really high.
 When dimensionality increases, data usually become increasingly
sparse because the data points are likely located in different
dimensional subspaces.
 To overcome this difficulty, we may consider using feature (or
attribute) transformation and feature (or attribute) selection
techniques
Clustering High-Dimensional Data
 Feature transformation methods:
 Transform the data onto a smaller space
 They summarize data by creating linear combinations
of the attributes, and may discover hidden structures in
the data.
 Techniques do not actually remove any of the original
attributes from analysis.
Clustering High-Dimensional Data
 Feature transformation methods:
 This is problematic when there are a large number of
irrelevant attributes.
 The transformed features (attributes) are often difficult to
interpret, making the clustering results less useful.
 Feature transformation is only suited to data sets where
most of the dimensions are relevant to the clustering task.
Clustering High-Dimensional Data
 Feature transformation methods:
 Attribute subset selection (or feature subset selection)
is commonly used for data reduction by removing
irrelevant or redundant dimensions (or attributes).
 It is most commonly performed by supervised
learning—the most relevant set of attributes are found
with respect to the given class labels
Clustering High-Dimensional Data
 Feature transformation methods:
 Subspace clustering is an extension to attribute
subset selection.
 Subspace clustering searches for groups of
clusters within different subspaces of the same data
set.
Clustering High-Dimensional Data
 CLIQUE: A Dimension-Growth Subspace Clustering
Method
 CLIQUE (CLustering In QUEst)
 The clustering process starts at single-dimensional
subspaces and grows upward to higher-dimensional ones.
 Because CLIQUE partitions each dimension like a grid
structure and determines whether a cell is dense based on
the number of points it contains
Clustering High-Dimensional Data
 CLIQUE: A Dimension-Growth Subspace
Clustering Method
 An integration of density-based and grid-based
clustering methods.
Clustering High-Dimensional Data
 CLIQUE: A Dimension-Growth Subspace Clustering
Method – Ideas:
 Given a large set of multidimensional data points, the data space
is usually not uniformly occupied by the data points. CLIQUE’s
clustering identifies the sparse and the “crowded” areas in space
(or units), thereby discovering the overall distribution patterns of
the data set.
 In CLIQUE, a cluster is defined as a maximal set of connected
dense units.
Clustering High-Dimensional Data
 How does CLIQUE work?
 In the first step, CLIQUE partitions the d-dimensional
data space into non-overlapping rectangular units,
identifying the dense units among these. This is done
(in 1-D) for each dimension.
 In the second step, For each cluster, it determines the
maximal region that covers the cluster of connected
dense units.
Clustering High-Dimensional Data
 PROCLUS: A Dimension-Reduction Subspace
Clustering Method:
 PROCLUS (PROjected CLUStering)
 It starts by finding an initial approximation of the clusters
in the high-dimensional attribute space.
 Each dimension is then assigned a weight for each cluster,
and the updated weights are used in the next iteration to
regenerate the clusters.
Clustering High-Dimensional Data
 PROCLUS: A Dimension-Reduction Subspace
Clustering Method:
 It adopts a distance measure called Manhattan
segmental distance, which is the Manhattan distance
on a set of relevant dimensions.
 The PROCLUS algorithm consists of three phases:
initialization, iteration, and cluster refinement.
Clustering High-Dimensional Data
 PROCLUS: A Dimension-Reduction Subspace
Clustering Method:
 In the initialization phase, it ensures that each cluster is
represented by at least one object in the selected set.
 The iteration phase selects a random set of k medoids from
this reduced set (of medoids), and replaces “bad” medoids
with randomly chosen new medoids if the clustering is
improved.
Clustering High-Dimensional Data
 PROCLUS: A Dimension-Reduction
Subspace Clustering Method:
 The refinement phase computes new dimensions
for each medoid based on the clusters found,
reassigns points to medoids, and removes outliers.
Clustering High-Dimensional Data
 Frequent Pattern–Based Clustering Methods
 It searches for patterns (such as sets of items or
objects) that occur frequently in large data sets.
 Frequent pattern mining can lead to the discovery
of interesting associations and correlations among
data objects.
Clustering High-Dimensional Data
 Frequent Pattern–Based Clustering Methods - Frequent
term–based text clustering:
 Text documents are clustered based on the frequent terms
they contain.
A term is any sequence of characters separated from other
terms by a delimiter.
 A term can be made up of a single word or several words.
Clustering High-Dimensional Data
 Frequent Pattern–Based Clustering Methods -
Frequent term–based text clustering:
 In general, we first remove nontext information
(such as HTML tags and punctuation) and stop
words. Terms are then extracted.
Constraint-Based Cluster Analysis
 Constraint-based clustering finds clusters that satisfy
user-specified preferences or constraints.
 Constraints on individual objects: We can specify
constraints on the objects to be clustered.
 In a real estate application, for example, one may like to
spatially cluster only those luxury mansions worth over a
million dollars.
 This constraint confines the set of objects to be clustered.
Constraint-Based Cluster Analysis
 Constraints on the selection of clustering parameters:
 Clustering parameters are usually quite specific to the
given clustering algorithm.
 Examples of parameters include k, the desired number of
clusters in a k-means algorithm; or e (the radius) and
MinPts (the minimum number of points) in the DBSCAN
algorithm.
Constraint-Based Cluster Analysis
 Constraints on the selection of clustering
parameters:
 Although such user-specified parameters may
strongly influence the clustering results, they are
usually confined to the algorithm itself.
Constraint-Based Cluster Analysis
 Constraints on distance or similarity functions:
 We can specify different distance or similarity
functions for specific attributes of the objects to be
clustered, or different distance measures for specific
pairs of objects.
 When clustering sportsmen, for example, we may use
different weighting schemes for height, body weight,
age, and skill level.
Constraint-Based Cluster Analysis
 User-specified constraints on the properties of
individual clusters:
 A user may like to specify desired characteristics
of the resulting clusters, which may strongly
influence the clustering process.
Outlier Analysis
 There exist data objects that do not comply
with the general behavior or model of the data.
 Such data objects, which are grossly different
from or inconsistent with the remaining set of
data, are called outliers.
Outlier Analysis
 Data visualization methods are weak in
detecting outliers in data with many
categorical attributes or in data of high
dimensionality, since human eyes are good at
visualizing numeric data of only two to three
dimensions.
Outlier Analysis
 There are two basic types of procedures for detecting outliers:
 Block procedures: In this case, either all of the suspect objects are
treated as outliers or all of them are accepted as consistent.
 Consecutive (or sequential) procedures: An example of such a
procedure is the inside out procedure. Its main idea is that the object
that is least “likely” to be an outlier is tested first. If it is found to be an
outlier, then all of the more extreme values are also considered outliers;
otherwise, the next most extreme object is tested, and so on. This
procedure tends to be more effective than block procedures.
Outlier Analysis
 How effective is the statistical approach at outlier
detection?
 A major drawback is that most tests are for single attributes, yet
many data mining problems require finding outliers in
multidimensional space.
 Statistical methods do not guarantee that all outliers will be
found for the cases where no specific test was developed, or
where the observed distribution cannot be adequately modeled
with any standard distribution.

More Related Content

Similar to 20IT501_DWDM_PPT_Unit_IV.ppt

3.1 clustering
3.1 clustering3.1 clustering
3.1 clusteringKrish_ver2
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningNandakumar P
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCsharondabriggs
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classificationJamshed Khan
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysisKritika Jain
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01deepti gupta
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10mqasimsheikh5
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
 

Similar to 20IT501_DWDM_PPT_Unit_IV.ppt (20)

Cluster Analysis.pptx
Cluster Analysis.pptxCluster Analysis.pptx
Cluster Analysis.pptx
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
 
F04463437
F04463437F04463437
F04463437
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
Cs501 cluster analysis
Cs501 cluster analysisCs501 cluster analysis
Cs501 cluster analysis
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisC
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classification
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysis
 
Clusteranalysis
Clusteranalysis Clusteranalysis
Clusteranalysis
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01
 
Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10Data mining concepts and techniques Chapter 10
Data mining concepts and techniques Chapter 10
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
 
K044055762
K044055762K044055762
K044055762
 

More from SamPrem3

20IT501_DWDM_U5.ppt
20IT501_DWDM_U5.ppt20IT501_DWDM_U5.ppt
20IT501_DWDM_U5.pptSamPrem3
 
20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.pptSamPrem3
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.pptSamPrem3
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.pptSamPrem3
 
UNIT V DIS.pptx
UNIT V DIS.pptxUNIT V DIS.pptx
UNIT V DIS.pptxSamPrem3
 
UNIT IV DIS.pptx
UNIT IV DIS.pptxUNIT IV DIS.pptx
UNIT IV DIS.pptxSamPrem3
 
UNIT III DIS.pptx
UNIT III DIS.pptxUNIT III DIS.pptx
UNIT III DIS.pptxSamPrem3
 
UNIT II DIS.pptx
UNIT II DIS.pptxUNIT II DIS.pptx
UNIT II DIS.pptxSamPrem3
 
UNIT I DIS.pptx
UNIT I DIS.pptxUNIT I DIS.pptx
UNIT I DIS.pptxSamPrem3
 

More from SamPrem3 (9)

20IT501_DWDM_U5.ppt
20IT501_DWDM_U5.ppt20IT501_DWDM_U5.ppt
20IT501_DWDM_U5.ppt
 
20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
 
20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt20IT501_DWDM_PPT_Unit_I.ppt
20IT501_DWDM_PPT_Unit_I.ppt
 
UNIT V DIS.pptx
UNIT V DIS.pptxUNIT V DIS.pptx
UNIT V DIS.pptx
 
UNIT IV DIS.pptx
UNIT IV DIS.pptxUNIT IV DIS.pptx
UNIT IV DIS.pptx
 
UNIT III DIS.pptx
UNIT III DIS.pptxUNIT III DIS.pptx
UNIT III DIS.pptx
 
UNIT II DIS.pptx
UNIT II DIS.pptxUNIT II DIS.pptx
UNIT II DIS.pptx
 
UNIT I DIS.pptx
UNIT I DIS.pptxUNIT I DIS.pptx
UNIT I DIS.pptx
 

Recently uploaded

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...Call Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 

20IT501_DWDM_PPT_Unit_IV.ppt

  • 1. 20IT501 – Data Warehousing and Data Mining III Year / V Semester
  • 2. UNIT IV - CLUSTERING AND OUTLIER ANALYSIS Cluster Analysis - Types of Data – Categorization of Major Clustering Methods – K-means– Partitioning Methods – Hierarchical Methods - Density-Based Methods – Grid Based Methods – Model-Based Clustering Methods – Clustering High Dimensional Data – Constraint – Based Cluster Analysis – Outlier Analysis
  • 3. Introduction  Clustering is the process of grouping the data into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters.  Dissimilarities are assessed based on the attribute values describing the objects. Often, distance measures are used.  Clustering has its roots in many areas, including data mining, statistics, biology, and machine learning.
  • 4. Cluster Analysis  The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.  A cluster is a collection of data objects that are similar to one another within the same cluster.  In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns.
  • 5. Cluster Analysis  Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity.  Clustering can also be used for outlier detection, where outliers (values that are “far away” from any cluster)  Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce.
  • 6. Cluster Analysis In machine learning, clustering is an example of unsupervised learning.  Unlike classification, clustering and unsupervised learning do not rely on predefined classes and class-labeled training examples.  For this reason, clustering is a form of learning by observation, rather than learning by examples.
  • 7. Cluster Analysis  Requirements of clustering:  Scalability  Ability to deal with different types of attributes  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Ability to deal with noisy data
  • 8. Cluster Analysis  Requirements of clustering: Incremental clustering and insensitivity to the order of input records  High dimensionality  Constraint-based clustering  Interpretability and usability
  • 9. Types of Data in Cluster Analysis  Interval-Scaled Variables:  It then describes distance measures that are commonly used for computing the dissimilarity of objects described by such variables. These measures include the Euclidean, Manhattan, and Minkowski distances.  Interval-scaled variables are continuous measurements of a roughly linear scale. Examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature
  • 10. Types of Data in Cluster Analysis  Interval-Scaled Variables:  The measurement unit used can affect the clustering analysis.  For example, changing measurement units from meters to inches for height, or from kilograms to pounds for weight, may lead to a very different clustering structure.  To help avoid dependence on the choice of measurement units, the data should be standardized.
  • 11. Types of Data in Cluster Analysis  Interval-Scaled Variables:  To standardize measurements, one choice is to convert the original measurements to unitless variables.  Calculate the mean absolute deviation, sf :  Calculate the standardized measurement, or z-score:
  • 12. Types of Data in Cluster Analysis  Interval-Scaled Variables:  The most popular distance measure is Euclidean distance, which is defined as  Another well-known metric is Manhattan (or city block) distance, defined as
  • 13. Types of Data in Cluster Analysis  Interval-Scaled Variables:
  • 14. Types of Data in Cluster Analysis  Binary Variables:  A binary variable has only two states: 0 or 1, where 0 means that the variable is absent, and 1 means that it is present.  Treating binary variables as if they are interval-scaled can lead to misleading clustering results. Therefore, methods specific to binary data are necessary for computing dissimilarities.
  • 15. Types of Data in Cluster Analysis  Binary Variables:  A binary variable is symmetric if both of its states are equally valuable and carry the same weight; that is, there is no preference on which outcome should be coded as 0 or 1.  One such example could be the attribute gender having the states male and female.  Its dissimilarity (or distance) measure
  • 16. Types of Data in Cluster Analysis  Binary Variables:  A binary variable is asymmetric if the outcomes of the states are not equally important, such as the positive and negative outcomes of a disease test.  The dissimilarity based on such variables is called asymmetric binary dissimilarity, where the number of negative matches, t, is considered unimportant and thus is ignored in the computation
  • 17. Types of Data in Cluster Analysis  Binary Variables:  we can measure the distance between two binary variables based on the notion of similarity instead of dissimilarity.  For example, the asymmetric binary similarity between the objects i and j, or sim(i, j), can be computed as,
  • 18. Types of Data in Cluster Analysis  Categorical Variables  A categorical variable is a generalization of the binary variable in that it can take on more than two states. For example, map color is a categorical variable that may have, say, five states: red, yellow, green, pink, and blue.  The dissimilarity between two objects i and j can be computed based on the ratio of mismatches:
  • 19. Types of Data in Cluster Analysis  Ordinal Variables  A discrete ordinal variable resembles a categorical variable, except that the M states of the ordinal value are ordered in a meaningful sequence.  For example, professional ranks are often enumerated in a sequential order, such as assistant, associate, and full for professors
  • 20. Types of Data in Cluster Analysis  Ordinal Variables  A continuous ordinal variable looks like a set of continuous data of an unknown scale; that is, the relative ordering of the values is essential but their actual magnitude is not.  For example, the relative ranking in a particular sport (e.g., gold, silver, bronze)
  • 21. Types of Data in Cluster Analysis  Ratio-Scaled Variables  A ratio-scaled variable makes a positive measurement on a nonlinear scale, such as an exponential scale, approximately following the formula AeBt or Ae-Bt
  • 22. Types of Data in Cluster Analysis  Variables of Mixed Types  One approach is to group each kind of variable together, performing a separate cluster analysis for each variable type.  This is feasible if these analyses derive compatible results.  A more preferable approach is to process all variable types together, performing a single cluster analysis.  One such technique combines the different variables into a single dissimilarity matrix, bringing all of the meaningful variables onto a common scale of the interval [0.0,1.0].
  • 23. Types of Data in Cluster Analysis  Vector Objects  In some applications, such as information retrieval, text document clustering, and biological taxonomy, we need to compare and cluster complex objects (such as documents) containing a large number of symbolic entities (such as keywords and phrases).  One popular way is to define the similarity function as a cosine measure as follows:
  • 24. Partitioning Methods  Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k ≤ n), where each partition represents a cluster. The objects within a cluster are “similar,” whereas the objects of different clusters are “dissimilar” in terms of the data set attributes.
  • 25. Partitioning Methods  Centroid-Based Technique: The k-Means Method  The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.
  • 26. Partitioning Methods  Centroid-Based Technique: The k-Means Method  Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters.
  • 27. Partitioning Methods  Centroid-Based Technique: The k-Means Method  Method:  arbitrarily choose k objects from D as the initial cluster centers;  repeat (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; update the cluster means, i.e., calculate the mean value of the objects for each cluster;  until no change;
  • 28. Partitioning Methods Representative Object-Based Technique: The k-Medoids Method  The k-means algorithm is sensitive to outliers because an object with an extremely large value may substantially distort the distribution of data.
  • 29. Partitioning Methods  PAM(Partitioning Around Medoids)  It attempts to determine k partitions for n objects.  After an initial random selection of k representative objects, the algorithm repeatedly tries to make a better choice of cluster representatives.  All of the possible pairs of objects are analyzed, where one object in each pair is considered a representative object and the other is not.
  • 30. Partitioning Methods  PAM(Partitioning Around Medoids)  The quality of the resulting clustering is calculated for each such combination.  An object, oj, is replaced with the object causing the greatest reduction in error.  The set of best objects for each cluster in one iteration forms the representative objects for the next iteration.  The final set of representative objects are the respective medoids of the clusters.
  • 31. Partitioning Methods  Which method is more robust—k-means or k- medoids?  The k-medoids method is more robust than k- means in the presence of noise and outliers.  A medoid is less influenced by outliers or other extreme values than a mean.
  • 32. Partitioning Methods  Partitioning Methods in Large Databases: From k- Medoids to CLARANS  A typical k-medoids partitioning algorithm like PAM works effectively for small data sets, but does not scale well for large data sets.  To deal with larger data sets, a sampling-based method, called CLARA (Clustering LARge Applications), can be used
  • 33. Partitioning Methods  Partitioning Methods in Large Databases: From k- Medoids to CLARANS  Instead of taking the whole set of data into consideration, a small portion of the actual data is chosen as a representative of the data.  Medoids are then chosen from this sample using PAM.  If the sample is selected in a fairly random manner, it should closely represent the original data set.
  • 34. Partitioning Methods  Partitioning Methods in Large Databases: From k-Medoids to CLARANS  The representative objects (medoids) chosen will likely be similar to those that would have been chosen from the whole data set.  CLARA draws multiple samples of the data set, applies PAM on each sample, and returns its best clustering as the output.  The complexity of each iteration now becomes O(ks2+k(n-k)), where s is the size of the sample, k is the number of clusters, and n is the total number of objects.
  • 35. Partitioning Methods  Partitioning Methods in Large Databases: From k-Medoids to CLARANS  The effectiveness of CLARA depends on the sample size.  CLARA searches for the best k medoids among the selected sample of the data set.  CLARA will never find the best clustering.
  • 36. Hierarchical Methods  A hierarchical clustering method works by grouping data objects into a tree of clusters.  Hierarchical clustering methods can be further classified as either agglomerative or divisive, depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion.  The quality of a pure hierarchical clustering method suffers from its inability to perform adjustment once a merge or split decision has been executed.
  • 37. Hierarchical Methods  Agglomerative and Divisive Hierarchical Clustering:  Agglomerative hierarchical clustering: This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied.
  • 38. Hierarchical Methods  Agglomerative and Divisive Hierarchical Clustering:  Divisive hierarchical clustering: This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster.  It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold.
  • 39. Hierarchical Methods  Agglomerative and Divisive Hierarchical Clustering:
  • 40. Hierarchical Methods  Agglomerative and Divisive Hierarchical Clustering:  A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering. It shows how objects are grouped together step by step.
  • 41. Hierarchical Methods  Agglomerative and Divisive Hierarchical Clustering:
  • 42. Hierarchical Methods  Agglomerative and Divisive Hierarchical Clustering:  Four widely used measures for distance between clusters are as follows, where |p-p’| is the distance between two objects or points, p and p’; mi is the mean for cluster, Ci; and ni is the number of objects in Ci.
  • 43. Hierarchical Methods  Agglomerative and Divisive Hierarchical Clustering:
  • 44. Hierarchical Methods  Agglomerative and Divisive Hierarchical Clustering:  When an algorithm uses the minimum distance, dmin(Ci, Cj), to measure the distance between clusters, it is sometimes called a nearest-neighbor clustering algorithm.  When an algorithm uses the maximum distance, dmax(Ci, Cj), to measure the distance between clusters, it is sometimes called a farthest-neighbor clustering algorithm.
  • 45. Hierarchical Methods  BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies:  BIRCH is designed for clustering a large amount of numerical data by integration of hierarchical clustering (at the initial microclustering stage) and other clustering methods such as iterative partitioning (at the later macroclustering stage).  It overcomes the two difficulties of agglomerative clustering methods: (1) scalability and (2) the inability to undo what was done in the previous step.
  • 46. Hierarchical Methods  BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies:  BIRCH introduces two concepts, clustering feature and clustering feature tree (CF tree), which are used to summarize cluster representations.  These structures help the clustering method achieve good speed and scalability in large databases and also make it effective for incremental and dynamic clustering of incoming objects.
  • 47. Hierarchical Methods  BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies:  A clustering feature (CF) is a three-dimensional vector summarizing information about clusters of objects.  Given n d-dimensional objects or points in a cluster, {xi}, then the CF of the cluster is defined as CF=<n, LS, SS> where n is the number of points in the cluster, LS is the linear sum of the n points, and SS is the square sum of the data points
  • 48. Hierarchical Methods  BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies:  Clustering feature. Suppose that there are three points, (2, 5), (3, 2), and (4, 3), in a cluster, C1. The clustering feature of C1 is  CF1 = <3, (2+3+4,5+2+3); (22+32+42;52+22+32)> = <3, (9,10), (29,38)>:
  • 49. Hierarchical Methods  BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies:  Clustering features are sufficient for calculating all of the measurements that are needed for making clustering decisions in BIRCH.  BIRCH thus utilizes storage efficiently by employing the clustering features to summarize information about the clusters of objects, thereby bypassing the need to store all objects.
  • 50. Hierarchical Methods  BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies:  A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering.  A CF tree has two parameters: branching factor, B, and threshold, T.  The branching factor specifies the maximum number of children per nonleaf node.  The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes of the tree.
  • 51. Hierarchical Methods  BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies:  The computation complexity of the algorithm is O(n), where n is the number of objects to be clustered.  Each node in a CF tree can hold only a limited number of entries due to its size,
  • 52. Hierarchical Methods  ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes  ROCK (RObust Clustering using linKs)  It explores the concept of links (the number of common neighbors between two objects) for data with categorical attributes.  Distance measures cannot lead to high-quality clusters when clustering categorical data.
  • 53. Hierarchical Methods  ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes  ROCK takes a more global approach to clustering by considering the neighborhoods of individual pairs of points. If two similar points also have similar neighborhoods, then the two points likely belong to the same cluster and so can be merged.
  • 54. Hierarchical Methods  ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes  Based on these ideas, ROCK first constructs a sparse graph from a given data similarity matrix using a similarity threshold and the concept of shared neighbors.  It then performs agglomerative hierarchical clustering on the sparse graph. Random sampling is used for scaling up to large data sets.
  • 55. Hierarchical Methods  Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling  Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the similarity between pairs of clusters.  In Chameleon, cluster similarity is assessed based on how well-connected objects are within a cluster and on the proximity of clusters.
  • 56. Hierarchical Methods  Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling  That is, two clusters are merged if their interconnectivity is high and they are close together.  The merge process facilitates the discovery of natural and homogeneous clusters and applies to all types of data as long as a similarity function can be specified.
  • 57. Hierarchical Methods  Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
  • 58. Hierarchical Methods  Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling  Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph, where each vertex of the graph represents a data object, and there exists an edge between two vertices (objects) if one object is among the k-most-similar objects of the other.
  • 59. Hierarchical Methods  Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling  The edges are weighted to reflect the similarity between objects.  It then uses an agglomerative hierarchical clustering algorithm that repeatedly merges subclusters based on their similarity  To determine the pairs of most similar subclusters, it takes into account both the interconnectivity as well as the closeness of the clusters.
  • 60. Hierarchical Methods  Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling  Chameleon determines the similarity between each pair of clusters Ci and Cj according to their relative interconnectivity, RI(Ci, Cj), and their relative closeness, RC(Ci, Cj)
  • 61. Hierarchical Methods  Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling  The relative interconnectivity, RI(Ci, Cj), between two clusters, Ci and Cj, is defined as the absolute interconnectivity between Ci and Cj, normalized with respect to the internal interconnectivity of the two clusters, Ci and Cj. That is,
  • 62. Hierarchical Methods  Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling  The relative closeness, RC(Ci, Cj), between a pair of clusters,Ci and Cj, is the absolute closeness between Ci and Cj, normalized with respect to the internal closeness of the two clusters, Ci and Cj. It is defined as
  • 63. Density-Based Methods  To discover clusters with arbitrary shape, density- based clustering methods have been developed.  These typically regard clusters as dense regions of objects in the data space that are separated by regions of low density (representing noise).
  • 64. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density  DBSCAN (Density-Based Spatial Clustering of Applications with Noise)  The algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise.  It defines a cluster as a maximal set of density-connected points.
  • 65. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density  The neighborhood within a radius  of a given object is called the -neighborhood of the object. If the -neighborhood of an object contains at least a minimum number, MinPts, of objects, then the object is called a core object.
  • 66. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density  Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is within the e-neighborhood of q, and q is a core object.
  • 67. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density  Direct density reachable: A point is called direct density reachable if it has a core point in its neighbourhood. Consider the point (1, 2), it has a core point (1.2, 2.5) in its neighbourhood, hence, it will be a direct density reachable point.
  • 68. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density  Density Reachable: A point is called density reachable from another point if they are connected through a series of core points. For example, consider the points (1, 3) and (1.5, 2.5), since they are connected through a core point (1.2, 2.5), they are called density reachable from each other.
  • 69. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density  DBSCAN searches for clusters by checking the -neighborhood of each point in the database.  If the -neighborhood of a point p contains more than MinPts, a new cluster with p as a core object is created.  DBSCAN then iteratively collects directly density-reachable objects from these core objects, which may involve the merge of a few density- reachable clusters.  The process terminates when no new point can be added to any cluster.
  • 70. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density
  • 71. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density  DBSCAN searches for clusters by checking the - neighborhood of each point in the database.  If the  -neighborhood of a point p contains more than MinPts, a new cluster with p as a core object is created.
  • 72. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density  DBSCAN then iteratively collects directly density- reachable objects from these core objects, which may involve the merge of a few density-reachable clusters. The process terminates when no new point can be added to any cluster.
  • 73. Density-Based Methods  DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density  If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects.  Otherwise, it is O(n2).
  • 74. Density-Based Methods  OPTICS: Ordering Points to Identify the Clustering Structure  OPTICS computes an augmented cluster ordering for automatic and interactive cluster analysis.  This ordering represents the density-based clustering structure of the data.  It contains information that is equivalent to density-based clustering obtained from a wide range of parameter settings.  The cluster ordering can be used to extract basic clustering information (such as cluster centers or arbitrary-shaped clusters) as well as provide the intrinsic clustering structure.
  • 75. Density-Based Methods  OPTICS: Ordering Points to Identify the Clustering Structure  The core-distance of an object p is the smallest ’ value that makes {p} a core object. If p is not a core object, the core-distance of p is undefined.
  • 76. Density-Based Methods  OPTICS: Ordering Points to Identify the Clustering Structure
  • 77. Density-Based Methods  OPTICS: Ordering Points to Identify the Clustering Structure  The reachability-distance of an object q with respect to another object p is the greater value of the core-distance of p and the Euclidean distance between p and q. If p is not a core object, the reachability-distance between p and q is undefined.
  • 78. Density-Based Methods  OPTICS: Ordering Points to Identify the Clustering Structure  An algorithm was proposed to extract clusters based on the ordering information produced by OPTICS. Such information is sufficient for the extraction of all density-based clusterings with respect to any distance ’ that is smaller than the distance e used in generating the order.
  • 79. Density-Based Methods  OPTICS: Ordering Points to Identify the Clustering Structure  Because of the structural equivalence of the OPTICS algorithm to DBSCAN, the OPTICS algorithm has the same runtime complexity as that of DBSCAN, that is, O(nlogn) if a spatial index is used, where n is the number of objects.
  • 80. Density-Based Methods  OPTICS: Ordering Points to Identify the Clustering Structure  Because of the structural equivalence of the OPTICS algorithm to DBSCAN, the OPTICS algorithm has the same runtime complexity as that of DBSCAN, that is, O(nlogn) if a spatial index is used, where n is the number of objects.
  • 81. Density-Based Methods  DENCLUE: Clustering Based on Density Distribution Functions  DENCLUE (DENsity-based CLUstEring)  The method is built on the following ideas: the influence of each data point can be formally modeled using a mathematical function, called an influence function, which describes the impact of a data point within its neighborhood;  The overall density of the data space can be modeled analytically as the sum of the influence function applied to all data points; and  Clusters can then be determined mathematically by identifying density attractors, where density attractors are local maxima of the overall density function.
  • 82. Density-Based Methods  DENCLUE: Clustering Based on Density Distribution Functions  The influence function of data object y on x is a function,  the influence function can be an arbitrary function that can be determined by the distance between two objects in a neighborhood.  The distance function, d(x, y), should be reflexive and symmetric, such as the Euclidean distance function
  • 83. Density-Based Methods  DENCLUE: Clustering Based on Density Distribution Functions  The density function at an object or point x € Fd is defined as the sum of influence functions of all data points.
  • 84. Density-Based Methods  DENCLUE: Clustering Based on Density Distribution Functions  From the density function, we can define the gradient of the function and the density attractor, the local maxima of the overall density function.  An arbitrary-shape cluster for a set of density attractors is a set of Cs, each being density-attracted to its respective density attractor
  • 85. Grid-Based Methods  The grid-based clustering approach uses a multiresolution grid data structure.  It quantizes the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed.  The main advantage of the approach is its fast processing time
  • 86. Grid-Based Methods  STING: STatistical INformation Grid  STING is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular cells.  There are usually several levels of such rectangular cells corresponding to different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level.
  • 87. Grid-Based Methods  STING: STatistical INformation Grid  Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) is precomputed and stored.  These statistical parameters are useful for query processing
  • 88. Grid-Based Methods  STING: STatistical INformation Grid
  • 89. Grid-Based Methods  STING: STatistical INformation Grid  Statistical parameters of higher-level cells can easily be computed from the parameters of the lower-level cells.  These parameters include the following: the attribute- independent parameter, count; the attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max (maximum); and the type of distribution that the attribute value in the cell follows, such as normal, uniform, exponential, or none.
  • 90. Grid-Based Methods  STING: STatistical INformation Grid  When the data are loaded into the database, the parameters count, mean, stdev, min, and max of the bottom-level cells are calculated directly from the data. The value of distribution may either be assigned by the user if the distribution type is known beforehand or obtained by hypothesis tests such as the 2 test.
  • 91. Grid-Based Methods  STING: STatistical INformation Grid  The type of distribution of a higher-level cell can be computed based on the majority of distribution types of its corresponding lower-level cells in conjunction with a threshold filtering process. If the distributions of the lower level cells disagree with each other and fail the threshold test, the distribution type of the high-level cell is set to none.
  • 92. Grid-Based Methods  STING: STatistical INformation Grid  First, a layer within the hierarchical structure is determined from which the query-answering process is to start.  This layer typically contains a small number of cells. For each cell in the current layer, we compute the confidence interval (or estimated range of probability) reflecting the cell’s relevancy to the given query.
  • 93. Grid-Based Methods  STING: STatistical INformation Grid  The irrelevant cells are removed from further consideration.  Processing of the next lower level examines only the remaining relevant cells. This process is repeated until the bottom layer is reached.  At this time, if the query specification is met, the regions of relevant cells that satisfy the query are returned.  Otherwise, the data that fall into the relevant cells are retrieved and further processed until they meet the requirements of the query.
  • 94. Grid-Based Methods  STING: STatistical INformation Grid  The grid-based computation is query-independent, because the statistical information stored in each cell represents the summary information of the data in the grid cell, independent of the query; The grid structure facilitates parallel processing and incremental updating
  • 95. Grid-Based Methods  WaveCluster: Clustering Using Wavelet Transformation  WaveCluster is a multiresolution clustering algorithm that first summarizes the data by imposing a multidimensional grid structure onto the data space. It then uses a wavelet transformation to transform the original feature space, finding dense regions in the transformed space.
  • 96. Grid-Based Methods  WaveCluster: Clustering Using Wavelet Transformation  In this approach, each grid cell summarizes the information of a group of points that map into the cell.  This summary information typically fits into main memory for use by the multiresolution wavelet transform and the subsequent cluster analysis.
  • 97. Grid-Based Methods  WaveCluster: Clustering Using Wavelet Transformation  A wavelet transform is a signal processing technique that decomposes a signal into different frequency subbands.  The wavelet model can be applied to d-dimensional signals by applying a one-dimensional wavelet transform d times.  In applying a wavelet transform, data are transformed so as to preserve the relative distance between objects at different levels of resolution.
  • 98. Grid-Based Methods  WaveCluster: Clustering Using Wavelet Transformation  It provides unsupervised clustering  The multiresolution property of wavelet transformations can help detect clusters at varying levels of accuracy.  Wavelet-based clustering is very fast, with a computational complexity of O(n)
  • 99. Model-Based Clustering Methods  Model-based clustering methods attempt to optimize the fit between the given data and some mathematical model.  Expectation-Maximization  Conceptual Clustering  Neural Network Approach
  • 100. Model-Based Clustering Methods  Expectation-Maximization  Each cluster can be represented mathematically by a parametric probability distribution.  The entire data is a mixture of these distributions, where each individual distribution is typically referred to as a component distribution.  The EM(Expectation-Maximization) algorithm is a popular iterative refinement algorithm that can be used for finding the parameter estimates.
  • 101. Model-Based Clustering Methods Expectation-Maximization  It can be viewed as an extension of the k-means paradigm, which assigns an object to the cluster with which it is most similar, based on the cluster mean.  Instead of assigning each object to a dedicated cluster, EM assigns each object to a cluster according to a weight representing the probability of membership.
  • 102. Model-Based Clustering Methods Expectation-Maximization – Algorithm:  Make an initial guess of the parameter vector: This involves randomly selecting k objects to represent the cluster means or centers (as in k-means partitioning), as well as making guesses for the additional parameters.  Iteratively refine the parameters (or clusters)
  • 103. Model-Based Clustering Methods Conceptual Clustering  A set of unlabeled objects, produces a classification scheme over the objects.  Finding characteristic descriptions for each group, where each group represents a concept or class.  Conceptual clustering is a two-step process: clustering is performed first, followed by characterization.
  • 104. Model-Based Clustering Methods Conceptual Clustering – Classification Tree:  Each node in a classification tree refers to a concept and contains a probabilistic description of that concept, which summarizes the objects classified under the node.  The sibling nodes at a given level of a classification tree are said to form a partition.
  • 105. Model-Based Clustering Methods Conceptual Clustering – Classification Tree:  To classify an object using a classification tree, a partial matching function is employed to descend the tree along a path of “best” matching nodes.
  • 106. Model-Based Clustering Methods Conceptual Clustering – Classification Tree:
  • 107. Model-Based Clustering Methods Neural Network Approach:  The neural network approach is motivated by biological neural networks.  A neural network is a set of connected input/output units, where each connection has a weight associated with it.
  • 108. Model-Based Clustering Methods  Neural Network Approach:  First, neural networks are inherently parallel and distributed processing architectures.  Second, neural networks learn by adjusting their interconnection weights so as to best fit the data.  Third, neural networks process numerical vectors and require object patterns to be represented by quantitative features only.
  • 109. Clustering High-Dimensional Data  Most clustering methods are designed for clustering low- dimensional data and encounter challenges when the dimensionality of the data grows really high.  When dimensionality increases, data usually become increasingly sparse because the data points are likely located in different dimensional subspaces.  To overcome this difficulty, we may consider using feature (or attribute) transformation and feature (or attribute) selection techniques
  • 110. Clustering High-Dimensional Data  Feature transformation methods:  Transform the data onto a smaller space  They summarize data by creating linear combinations of the attributes, and may discover hidden structures in the data.  Techniques do not actually remove any of the original attributes from analysis.
  • 111. Clustering High-Dimensional Data  Feature transformation methods:  This is problematic when there are a large number of irrelevant attributes.  The transformed features (attributes) are often difficult to interpret, making the clustering results less useful.  Feature transformation is only suited to data sets where most of the dimensions are relevant to the clustering task.
  • 112. Clustering High-Dimensional Data  Feature transformation methods:  Attribute subset selection (or feature subset selection) is commonly used for data reduction by removing irrelevant or redundant dimensions (or attributes).  It is most commonly performed by supervised learning—the most relevant set of attributes are found with respect to the given class labels
  • 113. Clustering High-Dimensional Data  Feature transformation methods:  Subspace clustering is an extension to attribute subset selection.  Subspace clustering searches for groups of clusters within different subspaces of the same data set.
  • 114. Clustering High-Dimensional Data  CLIQUE: A Dimension-Growth Subspace Clustering Method  CLIQUE (CLustering In QUEst)  The clustering process starts at single-dimensional subspaces and grows upward to higher-dimensional ones.  Because CLIQUE partitions each dimension like a grid structure and determines whether a cell is dense based on the number of points it contains
  • 115. Clustering High-Dimensional Data  CLIQUE: A Dimension-Growth Subspace Clustering Method  An integration of density-based and grid-based clustering methods.
  • 116. Clustering High-Dimensional Data  CLIQUE: A Dimension-Growth Subspace Clustering Method – Ideas:  Given a large set of multidimensional data points, the data space is usually not uniformly occupied by the data points. CLIQUE’s clustering identifies the sparse and the “crowded” areas in space (or units), thereby discovering the overall distribution patterns of the data set.  In CLIQUE, a cluster is defined as a maximal set of connected dense units.
  • 117. Clustering High-Dimensional Data  How does CLIQUE work?  In the first step, CLIQUE partitions the d-dimensional data space into non-overlapping rectangular units, identifying the dense units among these. This is done (in 1-D) for each dimension.  In the second step, For each cluster, it determines the maximal region that covers the cluster of connected dense units.
  • 118. Clustering High-Dimensional Data  PROCLUS: A Dimension-Reduction Subspace Clustering Method:  PROCLUS (PROjected CLUStering)  It starts by finding an initial approximation of the clusters in the high-dimensional attribute space.  Each dimension is then assigned a weight for each cluster, and the updated weights are used in the next iteration to regenerate the clusters.
  • 119. Clustering High-Dimensional Data  PROCLUS: A Dimension-Reduction Subspace Clustering Method:  It adopts a distance measure called Manhattan segmental distance, which is the Manhattan distance on a set of relevant dimensions.  The PROCLUS algorithm consists of three phases: initialization, iteration, and cluster refinement.
  • 120. Clustering High-Dimensional Data  PROCLUS: A Dimension-Reduction Subspace Clustering Method:  In the initialization phase, it ensures that each cluster is represented by at least one object in the selected set.  The iteration phase selects a random set of k medoids from this reduced set (of medoids), and replaces “bad” medoids with randomly chosen new medoids if the clustering is improved.
  • 121. Clustering High-Dimensional Data  PROCLUS: A Dimension-Reduction Subspace Clustering Method:  The refinement phase computes new dimensions for each medoid based on the clusters found, reassigns points to medoids, and removes outliers.
  • 122. Clustering High-Dimensional Data  Frequent Pattern–Based Clustering Methods  It searches for patterns (such as sets of items or objects) that occur frequently in large data sets.  Frequent pattern mining can lead to the discovery of interesting associations and correlations among data objects.
  • 123. Clustering High-Dimensional Data  Frequent Pattern–Based Clustering Methods - Frequent term–based text clustering:  Text documents are clustered based on the frequent terms they contain. A term is any sequence of characters separated from other terms by a delimiter.  A term can be made up of a single word or several words.
  • 124. Clustering High-Dimensional Data  Frequent Pattern–Based Clustering Methods - Frequent term–based text clustering:  In general, we first remove nontext information (such as HTML tags and punctuation) and stop words. Terms are then extracted.
  • 125. Constraint-Based Cluster Analysis  Constraint-based clustering finds clusters that satisfy user-specified preferences or constraints.  Constraints on individual objects: We can specify constraints on the objects to be clustered.  In a real estate application, for example, one may like to spatially cluster only those luxury mansions worth over a million dollars.  This constraint confines the set of objects to be clustered.
  • 126. Constraint-Based Cluster Analysis  Constraints on the selection of clustering parameters:  Clustering parameters are usually quite specific to the given clustering algorithm.  Examples of parameters include k, the desired number of clusters in a k-means algorithm; or e (the radius) and MinPts (the minimum number of points) in the DBSCAN algorithm.
  • 127. Constraint-Based Cluster Analysis  Constraints on the selection of clustering parameters:  Although such user-specified parameters may strongly influence the clustering results, they are usually confined to the algorithm itself.
  • 128. Constraint-Based Cluster Analysis  Constraints on distance or similarity functions:  We can specify different distance or similarity functions for specific attributes of the objects to be clustered, or different distance measures for specific pairs of objects.  When clustering sportsmen, for example, we may use different weighting schemes for height, body weight, age, and skill level.
  • 129. Constraint-Based Cluster Analysis  User-specified constraints on the properties of individual clusters:  A user may like to specify desired characteristics of the resulting clusters, which may strongly influence the clustering process.
  • 130. Outlier Analysis  There exist data objects that do not comply with the general behavior or model of the data.  Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.
  • 131. Outlier Analysis  Data visualization methods are weak in detecting outliers in data with many categorical attributes or in data of high dimensionality, since human eyes are good at visualizing numeric data of only two to three dimensions.
  • 132. Outlier Analysis  There are two basic types of procedures for detecting outliers:  Block procedures: In this case, either all of the suspect objects are treated as outliers or all of them are accepted as consistent.  Consecutive (or sequential) procedures: An example of such a procedure is the inside out procedure. Its main idea is that the object that is least “likely” to be an outlier is tested first. If it is found to be an outlier, then all of the more extreme values are also considered outliers; otherwise, the next most extreme object is tested, and so on. This procedure tends to be more effective than block procedures.
  • 133. Outlier Analysis  How effective is the statistical approach at outlier detection?  A major drawback is that most tests are for single attributes, yet many data mining problems require finding outliers in multidimensional space.  Statistical methods do not guarantee that all outliers will be found for the cases where no specific test was developed, or where the observed distribution cannot be adequately modeled with any standard distribution.