20IT501_DWDM_PPT_Unit_IV.ppt

20IT501 – Data Warehousing and
Data Mining
III Year / V Semester

UNIT IV - CLUSTERING AND
OUTLIER ANALYSIS
Cluster Analysis - Types of Data – Categorization of
Major Clustering Methods – K-means– Partitioning
Methods – Hierarchical Methods - Density-Based
Methods – Grid Based Methods – Model-Based
Clustering Methods – Clustering High Dimensional
Data – Constraint – Based Cluster Analysis –
Outlier Analysis

Introduction
 Clustering is the process of grouping the data into classes or
clusters, so that objects within a cluster have high similarity
in comparison to one another but are very dissimilar to
objects in other clusters.
 Dissimilarities are assessed based on the attribute values
describing the objects. Often, distance measures are used.
 Clustering has its roots in many areas, including data
mining, statistics, biology, and machine learning.

Cluster Analysis
 The process of grouping a set of physical or abstract objects
into classes of similar objects is called clustering.
 A cluster is a collection of data objects that are similar to
one another within the same cluster.
 In business, clustering can help marketers discover distinct
groups in their customer bases and characterize customer
groups based on purchasing patterns.

Cluster Analysis
 Clustering is also called data segmentation in some
applications because clustering partitions large data sets into
groups according to their similarity.
 Clustering can also be used for outlier detection, where
outliers (values that are “far away” from any cluster)
 Applications of outlier detection include the detection of
credit card fraud and the monitoring of criminal activities in
electronic commerce.

Cluster Analysis
In machine learning, clustering is an example of
unsupervised learning.
 Unlike classification, clustering and unsupervised
learning do not rely on predefined classes and
class-labeled training examples.
 For this reason, clustering is a form of learning
by observation, rather than learning by examples.

Cluster Analysis
 Requirements of clustering:
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Ability to deal with noisy data

Cluster Analysis
 Requirements of clustering:
Incremental clustering and insensitivity to the
order of input records
 High dimensionality
 Constraint-based clustering
 Interpretability and usability

Types of Data in Cluster Analysis
 Interval-Scaled Variables:
 It then describes distance measures that are commonly used for
computing the dissimilarity of objects described by such
variables. These measures include the Euclidean, Manhattan, and
Minkowski distances.
 Interval-scaled variables are continuous measurements of a
roughly linear scale. Examples include weight and height,
latitude and longitude coordinates (e.g., when clustering houses),
and weather temperature

 The measurement unit used can affect the clustering
analysis.
 For example, changing measurement units from meters to
inches for height, or from kilograms to pounds for weight,
may lead to a very different clustering structure.
 To help avoid dependence on the choice of measurement
units, the data should be standardized.

 To standardize measurements, one choice is to convert
the original measurements to unitless variables.
 Calculate the mean absolute deviation, sf :
 Calculate the standardized measurement, or z-score:

 The most popular distance measure is Euclidean
distance, which is defined as
 Another well-known metric is Manhattan (or city
block) distance, defined as

 Binary Variables:
 A binary variable has only two states: 0 or 1, where 0
means that the variable is absent, and 1 means that it is
present.
 Treating binary variables as if they are interval-scaled
can lead to misleading clustering results. Therefore,
methods specific to binary data are necessary for
computing dissimilarities.

 A binary variable is symmetric if both of its states are
equally valuable and carry the same weight; that is, there is
no preference on which outcome should be coded as 0 or 1.
 One such example could be the attribute gender having the
states male and female.
 Its dissimilarity (or distance) measure

 A binary variable is asymmetric if the outcomes of the
states are not equally important, such as the positive
and negative outcomes of a disease test.
 The dissimilarity based on such variables is called
asymmetric binary dissimilarity, where the number of
negative matches, t, is considered unimportant and
thus is ignored in the computation

 we can measure the distance between two binary
variables based on the notion of similarity instead of
dissimilarity.
 For example, the asymmetric binary similarity
between the objects i and j, or sim(i, j), can be
computed as,

 Categorical Variables
 A categorical variable is a generalization of the binary
variable in that it can take on more than two states.
For example, map color is a categorical variable that may
have, say, five states: red, yellow, green, pink, and blue.
 The dissimilarity between two objects i and j can be
computed based on the ratio of mismatches:

 Ordinal Variables
 A discrete ordinal variable resembles a categorical
variable, except that the M states of the ordinal value
are ordered in a meaningful sequence.
 For example, professional ranks are often enumerated
in a sequential order, such as assistant, associate, and
full for professors

 Ordinal Variables
 A continuous ordinal variable looks like a set of
continuous data of an unknown scale; that is, the
relative ordering of the values is essential but their
actual magnitude is not.
 For example, the relative ranking in a particular sport
(e.g., gold, silver, bronze)

 Ratio-Scaled Variables
 A ratio-scaled variable makes a positive
measurement on a nonlinear scale, such as an
exponential scale, approximately following the
formula AeBt or Ae-Bt

 Variables of Mixed Types
 One approach is to group each kind of variable together,
performing a separate cluster analysis for each variable type.
 This is feasible if these analyses derive compatible results.
 A more preferable approach is to process all variable types
together, performing a single cluster analysis.
 One such technique combines the different variables into a single
dissimilarity matrix, bringing all of the meaningful variables
onto a common scale of the interval [0.0,1.0].

 Vector Objects
 In some applications, such as information retrieval,
text document clustering, and biological taxonomy, we
need to compare and cluster complex objects (such as
documents) containing a large number of symbolic
entities (such as keywords and phrases).
 One popular way is to define the similarity function as
a cosine measure as follows:

Partitioning Methods
 Given D, a data set of n objects, and k, the
number of clusters to form, a partitioning
algorithm organizes the objects into k partitions (k
≤ n), where each partition represents a cluster.
The objects within a cluster are “similar,” whereas
the objects of different clusters are “dissimilar” in
terms of the data set attributes.

 Centroid-Based Technique: The k-Means Method
 The k-means algorithm takes the input parameter, k,
and partitions a set of n objects into k clusters so that
the resulting intracluster similarity is high but the
intercluster similarity is low.
Cluster similarity is measured in regard to the mean
value of the objects in a cluster, which can be viewed
as the cluster’s centroid or center of gravity.

 Algorithm: k-means. The k-means algorithm for
partitioning, where each cluster’s center is represented
by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.

 Method:
 arbitrarily choose k objects from D as the initial cluster centers;
 repeat
(re)assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster;
update the cluster means, i.e., calculate the mean value of the objects for
each cluster;
 until no change;

Representative Object-Based Technique: The
k-Medoids Method
 The k-means algorithm is sensitive to outliers
because an object with an extremely large value
may substantially distort the distribution of data.

 PAM(Partitioning Around Medoids)
 It attempts to determine k partitions for n objects.
 After an initial random selection of k representative
objects, the algorithm repeatedly tries to make a better
choice of cluster representatives.
 All of the possible pairs of objects are analyzed, where one
object in each pair is considered a representative object and
the other is not.

 PAM(Partitioning Around Medoids)
 The quality of the resulting clustering is calculated for each such
combination.
 An object, oj, is replaced with the object causing the greatest
reduction in error.
 The set of best objects for each cluster in one iteration forms the
representative objects for the next iteration.
 The final set of representative objects are the respective medoids
of the clusters.

 Which method is more robust—k-means or k-
medoids?
 The k-medoids method is more robust than k-
means in the presence of noise and outliers.
 A medoid is less influenced by outliers or other
extreme values than a mean.

 Partitioning Methods in Large Databases: From k-
Medoids to CLARANS
 A typical k-medoids partitioning algorithm like PAM
works effectively for small data sets, but does not scale
well for large data sets.
 To deal with larger data sets, a sampling-based method,
called CLARA (Clustering LARge Applications), can be
used

 Partitioning Methods in Large Databases: From k-
Medoids to CLARANS
 Instead of taking the whole set of data into consideration, a
small portion of the actual data is chosen as a representative
of the data.
 Medoids are then chosen from this sample using PAM.
 If the sample is selected in a fairly random manner, it
should closely represent the original data set.

 Partitioning Methods in Large Databases: From k-Medoids to
CLARANS
 The representative objects (medoids) chosen will likely be similar to
those that would have been chosen from the whole data set.
 CLARA draws multiple samples of the data set, applies PAM on each
sample, and returns its best clustering as the output.
 The complexity of each iteration now becomes O(ks2+k(n-k)), where s
is the size of the sample, k is the number of clusters, and n is the total
number of objects.

 Partitioning Methods in Large Databases: From
k-Medoids to CLARANS
 The effectiveness of CLARA depends on the sample
size.
 CLARA searches for the best k medoids among the
selected sample of the data set.
 CLARA will never find the best clustering.

Hierarchical Methods
 A hierarchical clustering method works by grouping data objects
into a tree of clusters.
 Hierarchical clustering methods can be further classified as either
agglomerative or divisive, depending on whether the hierarchical
decomposition is formed in a bottom-up (merging) or top-down
(splitting) fashion.
 The quality of a pure hierarchical clustering method suffers from its
inability to perform adjustment once a merge or split decision has
been executed.

 Agglomerative and Divisive Hierarchical
Clustering:
 Agglomerative hierarchical clustering: This bottom-up
strategy starts by placing each object in its own cluster
and then merges these atomic clusters into larger and
larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are
satisfied.

 Agglomerative and Divisive Hierarchical Clustering:
 Divisive hierarchical clustering: This top-down strategy does
the reverse of agglomerative hierarchical clustering by starting
with all objects in one cluster.
 It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies certain
termination conditions, such as a desired number of clusters is
obtained or the diameter of each cluster is within a certain
threshold.

Clustering:

Clustering:
 A tree structure called a dendrogram is commonly
used to represent the process of hierarchical
clustering. It shows how objects are grouped
together step by step.

Clustering:
 Four widely used measures for distance between
clusters are as follows, where |p-p’| is the distance
between two objects or points, p and p’; mi is the mean
for cluster, Ci; and ni is the number of objects in Ci.

 Agglomerative and Divisive Hierarchical Clustering:
 When an algorithm uses the minimum distance, dmin(Ci,
Cj), to measure the distance between clusters, it is
sometimes called a nearest-neighbor clustering algorithm.
 When an algorithm uses the maximum distance, dmax(Ci,
Cj), to measure the distance between clusters, it is
sometimes called a farthest-neighbor clustering algorithm.

 BIRCH: Balanced Iterative Reducing and Clustering Using
Hierarchies:
 BIRCH is designed for clustering a large amount of numerical
data by integration of hierarchical clustering (at the initial
microclustering stage) and other clustering methods such as
iterative partitioning (at the later macroclustering stage).
 It overcomes the two difficulties of agglomerative clustering
methods: (1) scalability and (2) the inability to undo what was
done in the previous step.

Hierarchies:
 BIRCH introduces two concepts, clustering feature and
clustering feature tree (CF tree), which are used to summarize
cluster representations.
 These structures help the clustering method achieve good speed
and scalability in large databases and also make it effective for
incremental and dynamic clustering of incoming objects.

Hierarchies:
 A clustering feature (CF) is a three-dimensional vector
summarizing information about clusters of objects.
 Given n d-dimensional objects or points in a cluster, {xi}, then
the CF of the cluster is defined as
CF=<n, LS, SS>
where n is the number of points in the cluster, LS is the linear sum
of the n points, and SS is the square sum of the data points

 BIRCH: Balanced Iterative Reducing and
Clustering Using Hierarchies:
 Clustering feature. Suppose that there are three points,
(2, 5), (3, 2), and (4, 3), in a cluster, C1. The clustering
feature of C1 is
 CF1 = <3, (2+3+4,5+2+3); (22+32+42;52+22+32)> =
<3, (9,10), (29,38)>:

Hierarchies:
 Clustering features are sufficient for calculating all of the
measurements that are needed for making clustering decisions in
BIRCH.
 BIRCH thus utilizes storage efficiently by employing the
clustering features to summarize information about the clusters
of objects, thereby bypassing the need to store all objects.

Hierarchies:
 A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering.
 A CF tree has two parameters: branching factor, B, and threshold, T.
 The branching factor specifies the maximum number of children per
nonleaf node.
 The threshold parameter specifies the maximum diameter of
subclusters stored at the leaf nodes of the tree.

 BIRCH: Balanced Iterative Reducing and
Clustering Using Hierarchies:
 The computation complexity of the algorithm is O(n),
where n is the number of objects to be clustered.
 Each node in a CF tree can hold only a limited
number of entries due to its size,

 ROCK: A Hierarchical Clustering Algorithm for
Categorical Attributes
 ROCK (RObust Clustering using linKs)
 It explores the concept of links (the number of common
neighbors between two objects) for data with categorical
attributes.
 Distance measures cannot lead to high-quality clusters
when clustering categorical data.

 ROCK takes a more global approach to clustering by
considering the neighborhoods of individual pairs of points.
If two similar points also have similar neighborhoods, then
the two points likely belong to the same cluster and so can
be merged.

 Based on these ideas, ROCK first constructs a sparse graph
from a given data similarity matrix using a similarity
threshold and the concept of shared neighbors.
 It then performs agglomerative hierarchical clustering on
the sparse graph.
Random sampling is used for scaling up to large data sets.

 Chameleon: A Hierarchical Clustering Algorithm
Using Dynamic Modeling
 Chameleon is a hierarchical clustering algorithm that uses
dynamic modeling to determine the similarity between
pairs of clusters.
 In Chameleon, cluster similarity is assessed based on how
well-connected objects are within a cluster and on the
proximity of clusters.

 That is, two clusters are merged if their
interconnectivity is high and they are close together.
 The merge process facilitates the discovery of natural
and homogeneous clusters and applies to all types of
data as long as a similarity function can be specified.

 Chameleon: A Hierarchical Clustering
Algorithm Using Dynamic Modeling

 Chameleon uses a k-nearest-neighbor graph approach
to construct a sparse graph, where each vertex of the
graph represents a data object, and there exists an edge
between two vertices (objects) if one object is among
the k-most-similar objects of the other.

 Chameleon: A Hierarchical Clustering Algorithm Using
Dynamic Modeling
 The edges are weighted to reflect the similarity between objects.
 It then uses an agglomerative hierarchical clustering algorithm
that repeatedly merges subclusters based on their similarity
 To determine the pairs of most similar subclusters, it takes into
account both the interconnectivity as well as the closeness of the
clusters.

 Chameleon: A Hierarchical Clustering
Algorithm Using Dynamic Modeling
 Chameleon determines the similarity between
each pair of clusters Ci and Cj according to their
relative interconnectivity, RI(Ci, Cj), and their
relative closeness, RC(Ci, Cj)

 The relative interconnectivity, RI(Ci, Cj), between
two clusters, Ci and Cj, is defined as the absolute
interconnectivity between Ci and Cj, normalized with
respect to the internal interconnectivity of the two
clusters, Ci and Cj. That is,

 The relative closeness, RC(Ci, Cj), between a pair of
clusters,Ci and Cj, is the absolute closeness between
Ci and Cj, normalized with respect to the internal
closeness of the two clusters, Ci and Cj. It is defined as

Density-Based Methods
 To discover clusters with arbitrary shape, density-
based clustering methods have been developed.
 These typically regard clusters as dense regions
of objects in the data space that are separated by
regions of low density (representing noise).

 DBSCAN: A Density-Based Clustering Method Based on
Connected Regions with Sufficiently High Density
 DBSCAN (Density-Based Spatial Clustering of Applications
with Noise)
 The algorithm grows regions with sufficiently high density into
clusters and discovers clusters of arbitrary shape in spatial
databases with noise.
 It defines a cluster as a maximal set of density-connected points.

 DBSCAN: A Density-Based Clustering Method Based
on Connected Regions with Sufficiently High Density
 The neighborhood within a radius  of a given object is
called the -neighborhood of the object.
If the -neighborhood of an object contains at least a
minimum number, MinPts, of objects, then the object is
called a core object.

 DBSCAN: A Density-Based Clustering Method
Based on Connected Regions with Sufficiently
High Density
 Given a set of objects, D, we say that an object p is
directly density-reachable from object q if p is within
the e-neighborhood of q, and q is a core object.

 Direct density reachable: A point is called direct density
reachable if it has a core point in its neighbourhood.
Consider the point (1, 2), it has a core point (1.2, 2.5) in its
neighbourhood, hence, it will be a direct density reachable
point.

 Density Reachable: A point is called density reachable
from another point if they are connected through a series of
core points. For example, consider the points (1, 3) and
(1.5, 2.5), since they are connected through a core point
(1.2, 2.5), they are called density reachable from each
other.

 DBSCAN: A Density-Based Clustering Method Based on
Connected Regions with Sufficiently High Density
 DBSCAN searches for clusters by checking the -neighborhood of
each point in the database.
 If the -neighborhood of a point p contains more than MinPts, a new
cluster with p as a core object is created.
 DBSCAN then iteratively collects directly density-reachable objects
from these core objects, which may involve the merge of a few density-
reachable clusters.
 The process terminates when no new point can be added to any cluster.

 DBSCAN: A Density-Based Clustering
Method Based on Connected Regions with
Sufficiently High Density

 DBSCAN searches for clusters by checking the -
neighborhood of each point in the database.
 If the  -neighborhood of a point p contains more than
MinPts, a new cluster with p as a core object is created.

 DBSCAN then iteratively collects directly density-
reachable objects from these core objects, which may
involve the merge of a few density-reachable clusters.
The process terminates when no new point can be added to
any cluster.

 DBSCAN: A Density-Based Clustering Method
Based on Connected Regions with Sufficiently
High Density
 If a spatial index is used, the computational
complexity of DBSCAN is O(nlogn), where n is the
number of database objects.
 Otherwise, it is O(n2).

 OPTICS: Ordering Points to Identify the Clustering Structure
 OPTICS computes an augmented cluster ordering for automatic and
interactive cluster analysis.
 This ordering represents the density-based clustering structure of the
data.
 It contains information that is equivalent to density-based clustering
obtained from a wide range of parameter settings.
 The cluster ordering can be used to extract basic clustering information
(such as cluster centers or arbitrary-shaped clusters) as well as provide
the intrinsic clustering structure.

 OPTICS: Ordering Points to Identify the
Clustering Structure
 The core-distance of an object p is the smallest ’
value that makes {p} a core object. If p is not a
core object, the core-distance of p is undefined.

 OPTICS: Ordering Points to Identify the Clustering
Structure
 The reachability-distance of an object q with respect to
another object p is the greater value of the core-distance of
p and the Euclidean distance between p and q.
If p is not a core object, the reachability-distance between p
and q is undefined.

 OPTICS: Ordering Points to Identify the Clustering
Structure
 An algorithm was proposed to extract clusters based on the
ordering information produced by OPTICS.
Such information is sufficient for the extraction of all
density-based clusterings with respect to any distance ’
that is smaller than the distance e used in generating the
order.

 Because of the structural equivalence of the OPTICS
algorithm to DBSCAN, the OPTICS algorithm has the
same runtime complexity as that of DBSCAN, that is,
O(nlogn) if a spatial index is used, where n is the
number of objects.

 DENCLUE: Clustering Based on Density Distribution Functions
 DENCLUE (DENsity-based CLUstEring)
 The method is built on the following ideas: the influence of each data point can
be formally modeled using a mathematical function, called an influence
function, which describes the impact of a data point within its neighborhood;
 The overall density of the data space can be modeled analytically as the sum of
the influence function applied to all data points; and
 Clusters can then be determined mathematically by identifying density
attractors, where density attractors are local maxima of the overall density
function.

 DENCLUE: Clustering Based on Density Distribution
Functions
 The influence function of data object y on x is a function,
 the influence function can be an arbitrary function that can be
determined by the distance between two objects in a
neighborhood.
 The distance function, d(x, y), should be reflexive and
symmetric, such as the Euclidean distance function

 DENCLUE: Clustering Based on Density
Distribution Functions
 The density function at an object or point x € Fd is
defined as the sum of influence functions of all
data points.

 DENCLUE: Clustering Based on Density Distribution
Functions
 From the density function, we can define the gradient of
the function and the density attractor, the local maxima of
the overall density function.
 An arbitrary-shape cluster for a set of density attractors is a
set of Cs, each being density-attracted to its respective
density attractor

Grid-Based Methods
 The grid-based clustering approach uses a
multiresolution grid data structure.
 It quantizes the object space into a finite number
of cells that form a grid structure on which all of
the operations for clustering are performed.
 The main advantage of the approach is its fast
processing time

Grid-Based Methods
 STING: STatistical INformation Grid
 STING is a grid-based multiresolution clustering technique
in which the spatial area is divided into rectangular cells.
 There are usually several levels of such rectangular cells
corresponding to different levels of resolution, and these
cells form a hierarchical structure: each cell at a high level
is partitioned to form a number of cells at the next lower
level.

Grid-Based Methods
 Statistical information regarding the attributes in
each grid cell (such as the mean, maximum, and
minimum values) is precomputed and stored.
 These statistical parameters are useful for query
processing

Grid-Based Methods

Grid-Based Methods
 Statistical parameters of higher-level cells can easily be
computed from the parameters of the lower-level cells.
 These parameters include the following: the attribute-
independent parameter, count; the attribute-dependent
parameters, mean, stdev (standard deviation), min (minimum),
max (maximum); and the type of distribution that the attribute
value in the cell follows, such as normal, uniform, exponential,
or none.

Grid-Based Methods
 When the data are loaded into the database, the
parameters count, mean, stdev, min, and max of the
bottom-level cells are calculated directly from the data.
The value of distribution may either be assigned by the
user if the distribution type is known beforehand or
obtained by hypothesis tests such as the 2 test.

Grid-Based Methods
 The type of distribution of a higher-level cell can be
computed based on the majority of distribution types
of its corresponding lower-level cells in conjunction
with a threshold filtering process.
If the distributions of the lower level cells disagree
with each other and fail the threshold test, the
distribution type of the high-level cell is set to none.

Grid-Based Methods
 First, a layer within the hierarchical structure is
determined from which the query-answering process is
to start.
 This layer typically contains a small number of cells.
For each cell in the current layer, we compute the
confidence interval (or estimated range of probability)
reflecting the cell’s relevancy to the given query.

Grid-Based Methods
 The irrelevant cells are removed from further consideration.
 Processing of the next lower level examines only the remaining
relevant cells. This process is repeated until the bottom layer is reached.
 At this time, if the query specification is met, the regions of relevant
cells that satisfy the query are returned.
 Otherwise, the data that fall into the relevant cells are retrieved and
further processed until they meet the requirements of the query.

Grid-Based Methods
 The grid-based computation is query-independent,
because the statistical information stored in each cell
represents the summary information of the data in the
grid cell, independent of the query;
The grid structure facilitates parallel processing and
incremental updating

Grid-Based Methods
 WaveCluster: Clustering Using Wavelet
Transformation
 WaveCluster is a multiresolution clustering algorithm that
first summarizes the data by imposing a multidimensional
grid structure onto the data space.
It then uses a wavelet transformation to transform the
original feature space, finding dense regions in the
transformed space.

Grid-Based Methods
Transformation
 In this approach, each grid cell summarizes the
information of a group of points that map into the cell.
 This summary information typically fits into main
memory for use by the multiresolution wavelet
transform and the subsequent cluster analysis.

Grid-Based Methods
 WaveCluster: Clustering Using Wavelet Transformation
 A wavelet transform is a signal processing technique that
decomposes a signal into different frequency subbands.
 The wavelet model can be applied to d-dimensional signals by
applying a one-dimensional wavelet transform d times.
 In applying a wavelet transform, data are transformed so as to
preserve the relative distance between objects at different levels
of resolution.

Grid-Based Methods
Transformation
 It provides unsupervised clustering
 The multiresolution property of wavelet transformations
can help detect clusters at varying levels of accuracy.
 Wavelet-based clustering is very fast, with a computational
complexity of O(n)

Model-Based Clustering Methods
 Model-based clustering methods attempt to
optimize the fit between the given data and
some mathematical model.
 Expectation-Maximization
 Conceptual Clustering
 Neural Network Approach

 Expectation-Maximization
 Each cluster can be represented mathematically by a parametric
probability distribution.
 The entire data is a mixture of these distributions, where each
individual distribution is typically referred to as a component
distribution.
 The EM(Expectation-Maximization) algorithm is a popular
iterative refinement algorithm that can be used for finding the
parameter estimates.

Expectation-Maximization
 It can be viewed as an extension of the k-means
paradigm, which assigns an object to the cluster with
which it is most similar, based on the cluster mean.
 Instead of assigning each object to a dedicated cluster,
EM assigns each object to a cluster according to a
weight representing the probability of membership.

Expectation-Maximization – Algorithm:
 Make an initial guess of the parameter vector: This
involves randomly selecting k objects to represent the
cluster means or centers (as in k-means partitioning),
as well as making guesses for the additional
parameters.
 Iteratively refine the parameters (or clusters)

Conceptual Clustering
 A set of unlabeled objects, produces a classification
scheme over the objects.
 Finding characteristic descriptions for each group,
where each group represents a concept or class.
 Conceptual clustering is a two-step process: clustering
is performed first, followed by characterization.

Conceptual Clustering – Classification Tree:
 Each node in a classification tree refers to a concept
and contains a probabilistic description of that concept,
which summarizes the objects classified under the
node.
 The sibling nodes at a given level of a classification
tree are said to form a partition.

 To classify an object using a classification tree, a
partial matching function is employed to descend
the tree along a path of “best” matching nodes.

Neural Network Approach:
 The neural network approach is motivated by
biological neural networks.
 A neural network is a set of connected
input/output units, where each connection has a
weight associated with it.

 Neural Network Approach:
 First, neural networks are inherently parallel and
distributed processing architectures.
 Second, neural networks learn by adjusting their
interconnection weights so as to best fit the data.
 Third, neural networks process numerical vectors and
require object patterns to be represented by quantitative
features only.

Clustering High-Dimensional Data
 Most clustering methods are designed for clustering low-
dimensional data and encounter challenges when the dimensionality
of the data grows really high.
 When dimensionality increases, data usually become increasingly
sparse because the data points are likely located in different
dimensional subspaces.
 To overcome this difficulty, we may consider using feature (or
attribute) transformation and feature (or attribute) selection
techniques

 Feature transformation methods:
 Transform the data onto a smaller space
 They summarize data by creating linear combinations
of the attributes, and may discover hidden structures in
the data.
 Techniques do not actually remove any of the original
attributes from analysis.

 This is problematic when there are a large number of
irrelevant attributes.
 The transformed features (attributes) are often difficult to
interpret, making the clustering results less useful.
 Feature transformation is only suited to data sets where
most of the dimensions are relevant to the clustering task.

 Attribute subset selection (or feature subset selection)
is commonly used for data reduction by removing
irrelevant or redundant dimensions (or attributes).
 It is most commonly performed by supervised
learning—the most relevant set of attributes are found
with respect to the given class labels

 Subspace clustering is an extension to attribute
subset selection.
 Subspace clustering searches for groups of
clusters within different subspaces of the same data
set.

 CLIQUE: A Dimension-Growth Subspace Clustering
Method
 CLIQUE (CLustering In QUEst)
 The clustering process starts at single-dimensional
subspaces and grows upward to higher-dimensional ones.
 Because CLIQUE partitions each dimension like a grid
structure and determines whether a cell is dense based on
the number of points it contains

 CLIQUE: A Dimension-Growth Subspace
Clustering Method
 An integration of density-based and grid-based
clustering methods.

 CLIQUE: A Dimension-Growth Subspace Clustering
Method – Ideas:
 Given a large set of multidimensional data points, the data space
is usually not uniformly occupied by the data points. CLIQUE’s
clustering identifies the sparse and the “crowded” areas in space
(or units), thereby discovering the overall distribution patterns of
the data set.
 In CLIQUE, a cluster is defined as a maximal set of connected
dense units.

 How does CLIQUE work?
 In the first step, CLIQUE partitions the d-dimensional
data space into non-overlapping rectangular units,
identifying the dense units among these. This is done
(in 1-D) for each dimension.
 In the second step, For each cluster, it determines the
maximal region that covers the cluster of connected
dense units.

 PROCLUS: A Dimension-Reduction Subspace
Clustering Method:
 PROCLUS (PROjected CLUStering)
 It starts by finding an initial approximation of the clusters
in the high-dimensional attribute space.
 Each dimension is then assigned a weight for each cluster,
and the updated weights are used in the next iteration to
regenerate the clusters.

Clustering Method:
 It adopts a distance measure called Manhattan
segmental distance, which is the Manhattan distance
on a set of relevant dimensions.
 The PROCLUS algorithm consists of three phases:
initialization, iteration, and cluster refinement.

Clustering Method:
 In the initialization phase, it ensures that each cluster is
represented by at least one object in the selected set.
 The iteration phase selects a random set of k medoids from
this reduced set (of medoids), and replaces “bad” medoids
with randomly chosen new medoids if the clustering is
improved.

 PROCLUS: A Dimension-Reduction
Subspace Clustering Method:
 The refinement phase computes new dimensions
for each medoid based on the clusters found,
reassigns points to medoids, and removes outliers.

 Frequent Pattern–Based Clustering Methods
 It searches for patterns (such as sets of items or
objects) that occur frequently in large data sets.
 Frequent pattern mining can lead to the discovery
of interesting associations and correlations among
data objects.

 Frequent Pattern–Based Clustering Methods - Frequent
term–based text clustering:
 Text documents are clustered based on the frequent terms
they contain.
A term is any sequence of characters separated from other
terms by a delimiter.
 A term can be made up of a single word or several words.

 Frequent Pattern–Based Clustering Methods -
Frequent term–based text clustering:
 In general, we first remove nontext information
(such as HTML tags and punctuation) and stop
words. Terms are then extracted.

Constraint-Based Cluster Analysis
 Constraint-based clustering finds clusters that satisfy
user-specified preferences or constraints.
 Constraints on individual objects: We can specify
constraints on the objects to be clustered.
 In a real estate application, for example, one may like to
spatially cluster only those luxury mansions worth over a
million dollars.
 This constraint confines the set of objects to be clustered.

 Constraints on the selection of clustering parameters:
 Clustering parameters are usually quite specific to the
given clustering algorithm.
 Examples of parameters include k, the desired number of
clusters in a k-means algorithm; or e (the radius) and
MinPts (the minimum number of points) in the DBSCAN
algorithm.

 Constraints on the selection of clustering
parameters:
 Although such user-specified parameters may
strongly influence the clustering results, they are
usually confined to the algorithm itself.

 Constraints on distance or similarity functions:
 We can specify different distance or similarity
functions for specific attributes of the objects to be
clustered, or different distance measures for specific
pairs of objects.
 When clustering sportsmen, for example, we may use
different weighting schemes for height, body weight,
age, and skill level.

 User-specified constraints on the properties of
individual clusters:
 A user may like to specify desired characteristics
of the resulting clusters, which may strongly
influence the clustering process.

Outlier Analysis
 There exist data objects that do not comply
with the general behavior or model of the data.
 Such data objects, which are grossly different
from or inconsistent with the remaining set of
data, are called outliers.

Outlier Analysis
 Data visualization methods are weak in
detecting outliers in data with many
categorical attributes or in data of high
dimensionality, since human eyes are good at
visualizing numeric data of only two to three
dimensions.

Outlier Analysis
 There are two basic types of procedures for detecting outliers:
 Block procedures: In this case, either all of the suspect objects are
treated as outliers or all of them are accepted as consistent.
 Consecutive (or sequential) procedures: An example of such a
procedure is the inside out procedure. Its main idea is that the object
that is least “likely” to be an outlier is tested first. If it is found to be an
outlier, then all of the more extreme values are also considered outliers;
otherwise, the next most extreme object is tested, and so on. This
procedure tends to be more effective than block procedures.

Outlier Analysis
 How effective is the statistical approach at outlier
detection?
 A major drawback is that most tests are for single attributes, yet
many data mining problems require finding outliers in
multidimensional space.
 Statistical methods do not guarantee that all outliers will be
found for the cases where no specific test was developed, or
where the observed distribution cannot be adequately modeled
with any standard distribution.

20IT501_DWDM_PPT_Unit_IV.ppt

Recommended

Recommended

More Related Content

Similar to 20IT501_DWDM_PPT_Unit_IV.ppt

Similar to 20IT501_DWDM_PPT_Unit_IV.ppt (20)

Recently uploaded

Recently uploaded (20)

20IT501_DWDM_PPT_Unit_IV.ppt