SlideShare a Scribd company logo
1 of 32
Chapter 5
Clustering
5.1 Introduction
Clustering is similar to classification. However, unlike classification, the
groups are not predefined. The groups are called clusters. Clustering can be
viewed as similar to unsupervised learning. The basic features of clustering
as opposed to classification:
• The best number of clusters in not known.
• There may not be any a priory knowledge concerning the clusters.
• Cluster results are dynamic.
Definitions of clusters have been proposed:
• Set of like elements. Elements from different clusters are not alike.
• The distance between points in a cluster is less than the distance between
a point in the cluster and any point outside it.
A given set of data may be clustered on different attributes.
Different Clustering Attributes
A given set of data may be clustered on different attributes. Here a
group of homes in a geographic area is shown.
Classification of Clustering Algorithms
• Recent clustering algorithms look at categorical data and are targeted to larger, or dynamic,
databases. They may adapt to memory constraints by either sampling the database or using data
structures, which can be compressed or pruned to fit into memory regardless of the size of the
database.
• Intrinsic algorithms do not use any a priory category labels, but depend only on the adjacency
matrix containing the distance between objects.
• Agglomerative implies that the clusters are created in a bottom-up fashion, while divisive
algorithms work in a top-down fashion.
• With hierarchical
clustering, nested set of
clusters is created, the
desired number of
clusters is not input.
• With partitional
clustering, the algorithm
creates only one set of
clusters.
5.2 Similarity and Distance Measures
• Definition. Given a database 𝐷 = 𝑡1, 𝑡2, ⋯ , 𝑡 𝑛 of tuples, a similarity measure, 𝑠𝑖𝑚(𝑡𝑖, 𝑡𝑙), defined between
any two tuples, 𝑡𝑖, 𝑡𝑙 ∈ 𝐷, and an integer value 𝑘, the clustering problem is to define a mapping 𝑓: 𝐷 →
1, ⋯ , 𝑘 where each 𝑡𝑖 is assigned to one cluster 𝐾𝑗, 1 ≤ 𝑗 ≤ 𝑘. Given a cluster 𝐾𝑗, ∀𝑡𝑗𝑙, 𝑡𝑗𝑚 ∈ 𝐾𝑗, and 𝑡𝑖 ∉
𝐾𝑗, 𝑠𝑖𝑚 𝑡𝑗𝑙, 𝑡𝑗𝑚 > 𝑠𝑖𝑚(𝑡𝑗𝑙, 𝑡𝑖).
• We assume the definition of a similarity measure 𝑠𝑖𝑚(𝑡𝑖, 𝑡𝑙) exists between any two tuples 𝑡𝑖, 𝑡𝑙 ∈ 𝐷.
A distance measure, 𝑑𝑖𝑠(𝑡𝑖, 𝑡𝑗), as opposed to similarity, is often used in clustering. The clustering problem
then has the desirable property that given a cluster 𝐾𝑗, ∀𝑡𝑗𝑙, 𝑡𝑗𝑚 ∈ 𝐾𝑗 and 𝑡𝑖 ∉ 𝐾𝑗, 𝑑𝑖𝑠 𝑡𝑗𝑙, 𝑡𝑗𝑚 ≤ 𝑑𝑖𝑠(𝑡𝑗𝑙, 𝑡𝑖).
• Given a cluster 𝐾 𝑚 of 𝑁 points 𝑡 𝑚1, 𝑡 𝑚2 ⋯ 𝑡 𝑚𝑁 , the clusters can be described by using characteristic
values:
centroid = 𝐶 𝑚 =
𝑖=1
𝑁
(𝑡 𝑚𝑖)
𝑁
radius = 𝑅 𝑚 = 𝑖=1
𝑁
(𝑡 𝑚𝑖 − 𝐶 𝑚)2
𝑁
diameter = 𝐷 𝑚 =
𝑖=1
𝑁
𝑗=1
𝑁
(𝑡 𝑚𝑖 − 𝑡 𝑚𝑗)2
𝑁(𝑁 − 1)
Alternative Distance Measures
There are several standard alternatives to calculate the distance
between clusters 𝐾𝑖 and 𝐾𝑗. For ∀𝑡𝑖𝑙 ∈ 𝐾𝑖 ∉ 𝐾𝑗 and ∀𝑡𝑗𝑚 ∈ 𝐾𝑗 ∉ 𝐾𝑖:
• Single link: Smallest distance between an element in one cluster and
an element in the other: 𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = min 𝑑𝑖𝑠 𝑡𝑖𝑙, 𝑡𝑗𝑚
• Complete link: Largest distance between an element in one cluster
and an element in the other 𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = max(𝑑𝑖𝑠(𝑡𝑖𝑙, 𝑡𝑗𝑚))
• Average: Average distance between an element in one cluster and an
element in the other 𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = mean(𝑑𝑖𝑠(𝑡𝑖𝑙, 𝑡𝑗𝑚))
• Centroid: If cluster have a representative centroid, then the centroid
distance is defined as the distance between the centroids
𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = 𝑑𝑖𝑠(𝐶𝑖, 𝐶𝑗)
• Medoid: Using a medoid to represent each cluster, the distance
between the clusters can be defined by the distance between the
medoids 𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = 𝑑𝑖𝑠(𝑀𝑖, 𝑀𝑗).
5.3 Outliers
• Some clustering techniques do not
perform well with the presence of
outliers.
• Outlier detection, or outlier mining,
is the process of identifying outliers
in a set of data. Clustering, or other
algorithms may then chose to
remove or treat these values
differently. However, care must be
taken in actually removing outliers.
5.4 Hierarchical Algorithms
• A tree data structure called a dendrogram, can be used to illustrate the hierarchical clustering technique.
Each level in the tree is associated with the distance measure that was used to merge the clusters. The root
in a dendrogram tree contains one cluster where all elements are together. The leaves consists of a single-
element cluster. The clusters at a particular level are combined when the children clusters have a distance
between them less than the distance value associated with this level in the tree.
5.4.1 Agglomerative Algorithms
• Agglomerative algorithms start with each individual item in its own cluster
and iteratively merge clusters until all items belong in one cluster.
• It assumes that a set of elements and distances between them is given as
input adjacency matrix. The output of the algorithms is dendrogram, which
produces a set of clusters rather than just one clustering. The user can
determine which of the clusters (based on distance threshold) he or she
wishes to use.
• Algorithms differ in how they merge clusters from the prior level to create
the next level clusters (merge 2-3… or merge all; treat equal distances).
• In addition, the technique used to determine distance between clusters
may vary: single link, complete link and average link.
Single Link Technique
• With the single link approach, two clusters are merged if there is at least
one edge that connects the two clusters; that is if the minimum distance
between any two points is less than or equal to the threshold distance
being considered. It is often called the nearest neighbor clustering.
• One variation of single link approach is based on the use of a minimum
spanning tree (MST). Here we assume that a procedure, MST produces a
minimum spanning tree given an adjacency matrix as input. The clusters
are merged in increasing order of the distance found in the MST.
• The single linkage approach is infamous for its chain effect; that is, two
clusters are merged if only two of their points are close to each other.
• These algorithms are both of 𝑂 𝑛2
complexity in space and time.
Complete Link, Average Link Technique
• Here a procedure is used to find the maximum distance between any clusters so that two clusters are
merged if the maximum distance is less than or equal to the distance threshold.
• Complete link is an 𝑂(𝑛2
) algorithm.
• Clusters found with the complete link method tend to be more compact than those found using the single
link technique.
• The average link technique merges two clusters if the average distance between any two points in the two
target clusters is below the distance threshold.
5.5 Partitional Algorithms
• Nonhierarchical or partitional clustering creates the clusters in one step as opposed to
several steps. Only one set of clusters is created and user must input the desired number
𝑘 of clusters.
• In addition some metric of criterion function is used to determine the goodness of any
proposed solution. This measure of quality could be the average distance between
clusters or some other metric. The solution with the best value for the criterion function
is the clustering solution used.
• One common measure is a squared error metric, which measures the squared distance
from each point to the centroid for the associated cluster:
𝑚=1
𝑘
𝑡 𝑚𝑖∈𝐾 𝑚
𝑑𝑖𝑠 (𝐶 𝑚, 𝑡 𝑚𝑖)2
• Partitional algorithms suffer from a combinatorial explosion due to the number of
possible solutions. Thus, most algorithms look only at a small subset of all clusters.
Squared Error Clustering, K-Means, K-Modes
• Squared error clustering algorithm minimizes the squared error for a cluster:
𝑠𝑒 𝐾 = 𝑚=1
𝑘
𝑡 𝑚𝑖∈𝐾 𝑚
𝐶 𝑚 − 𝑡 𝑚𝑖
2. During each iteration each tuple is assigned
to the cluster with the closest center. This repeats until the difference between
successive square errors is below a threshold.
• K-Means clustering algorithm requires that some definition of cluster mean
exists, e.g., 𝑚𝑖 =
1
𝑚 𝑗=1
𝑚
𝑡𝑖𝑗, but it does not have to be this particular one; it also
assumes that desired number of clusters 𝑘 is an input parameter. The
convergence criteria could be based on the squared error, but they need not be.
For example, it could stop, when so or very small number of tuples are assigned
to different clusters. K-means does not work on categorical data because the
mean must be defined on the attribute type. Only convex-shaped clusters are
found. It does not handle outliers well. K-means finds a local optimum and may
actually miss the global optimum.
• K-modes uses modes and it does handle categorical data.
Nearest Neighbor, PAM, CLARA, CLARANS
• Nearest neighbor algorithm is similar to single link technique. Items are iteratively merged into
the existing clusters that are closest. In this algorithm a threshold 𝑡 is used to determine if items
will be added to existing clusters or if a new cluster is created.
• PAM (partitioning around medoids) algorithm represents cluster by a medoid, which handles
outliers well. Quality here is measured by the sum of all distances from a non-medoid object to
the medoid for the cluster it is in. An item is assigned to the cluster represented by the medoid to
which it is closest (minimum distance). PAM does not scale well to large databases.
• CLARA (Clustering LARge Applications) applies PAM to a sample of the underlying database and
then uses the medoids found as the medoids for the complete clustering. Several samples can be
drawn with PAM applied to each. The sample chosen as the final clustering is the one that
performs the best. CLARA is more efficient than PAM for large databases. Five samples of size
40 + 2𝑘 seem to give good results.
• CLARANS (clustering large applications based on randomized search) uses multiple different
samples. It requires 2 additional parameters: maxneighbour and numlocal. Numlocal indicates the
number of samples to be taken. Maxneighbour is the number of neighbors of a node to which any
specific node can be compared. Studies indicate that 𝑛𝑢𝑚𝑙𝑜𝑐𝑎𝑙 = 2 and 𝑚𝑎𝑥𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 =
max( 0.0125 × 𝑘 𝑛 − 𝑘 , 250) is a good choice. It assumes that all data are in main memory.
5.5.1 Genetic Algorithms
• The approach in which a new solution is generated from the previous
solution using crossover and mutation operations.
• A fitness function must be used and may be defined based on inverse
of the squared error.
• Genetic clustering algorithms perform a global search rather than a
local search of potential solutions.
5.5 2 Clustering with Neural Networks
• Neural Networks that use unsupervised learning attempt to find features in the
data that characterize the desired output. There are two basic types of
unsupervised learning: noncompetitive and competitive.
• With the noncompetitive or Hebbian learning, the weight between two nodes is
changed to be proportional to both output values Δ𝑤𝑗𝑖 = 𝜂𝑦𝑗 𝑦𝑖.
• With competitive learning, nodes are allowed to compete. This approach usually
assumes a two-layer NN in which all nodes from one layer are connected to all
nodes in the other layer. Imagine every input tuple having each attribute value
input to a specific input node in the NN. The number of input nodes in the same
as number of attributes. When a tuple is input to the NN, all output nodes
produce an output value. The node with the weights more similar to the input
tuple is declared the winner. Its weights are then adjusted.
• This process continues with each tuple input from the training set. With a large
and varied enough training set, over time each output node should become
associated with a set of tuples.
Self-organizing Feature Map (SOFM)
• Perhaps the most common example of a SOFM is the Kohonen self-organizing
map. There is one input layer and one special competitive layer, which produces
output values that compete. Nodes in this layer are viewed as a two-dimensional
grid of nodes. Each input node is connected to each node in this grid.
• Each arc has an associated weight and each node in the competitive layer has an
activation function. Thus, each node in the competitive layer produces an output
value, and the node with the best output wins. An attractive feature of Kohonen
nets is that the data can be fed into the multiple competitive nodes in parallel
• A common approach is to initialize the weights on the input arcs to the competitive
layer with normalized values. Given an input tuple 𝑋 = 𝑥1, ⋯ , 𝑥ℎ and weights on
arcs input to a competitive node 𝑖 as 𝑤1𝑖, ⋯ , 𝑤ℎ𝑖, the similarity between 𝑋 and 𝑖
can be calculated by:
𝑠𝑖𝑚 𝑋, 𝑖 =
𝑗=1
ℎ
𝑥𝑗 𝑤𝑗𝑖
• The competitive node with highest similarity value wins and the weights coming
into 𝑖 as well as its neighbors ( those immediately surrounding it in the matrix) are
changed to be closer to that of the tuple. Given a node 𝑖, we use the notation 𝑁𝑖 to
represent the union of 𝑖 and the nodes near it. Thus the learning process uses:
Δ𝑤 𝑘𝑗 =
𝑐(𝑥 𝑘 − 𝑤 𝑘𝑗) 𝑖𝑓 𝑗 ∈ 𝑁𝑖
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
5.6 Clustering Large Databases
In order to perform effectively on large databases, a clustering algorithm
should:
• Require no more than one scan of the database.
• Have the ability to provide status and best answer so far during the
execution (referred to as ability to be online).
• Be suspendable, stoppable, and resumable.
• Be able to update the results incrementally as data are added or removed
from the database.
• Work with limited main memory.
• Be capable of performing different techniques for scanning the database.
This may include sampling.
• Process each tuple only once.
5.6.1 BIRCH (Balanced Iterative Reducing and
Clustering Hierarchies)
BIRCH is designed for clustering a large amount of metric data and achieves a linear I/O
time requiring only one data base scan. BIRCH applies only to numeric data and used the
clustering feature.
• Definition. A clustering feature (CF) is a triple (𝑁, 𝐿𝑆, 𝑆𝑆), where the number of the
points in the cluster is 𝑁, 𝐿𝑆 is the sum of the points in the cluster, and 𝑆𝑆 is the sum of
the squares of the points in the cluster.
• As points are added to the clustering problem, the CF tree is built. Each node in the CF
tree contains clustering feature information about each of its children subclusters. A
point is inserted into the cluster (represented by a leaf node) to which it is closest. The
subcluster in a leaf node must have a diameter no greater than a given threshold value 𝑇.
When new item is inserted, CF value for that node is appropriately updated, as is each
triple on the path from the root down to the leaf. If the item does not satisfy the
threshold value, then it is added to that node as a single-item cluster.
• Parameters needed for the CF tree construction include: its branching factor, the page
block size, and memory size.
Clustering Feature Tree
Program clustering.
A neighbor-joining
tree that clusters the
classification
programs based on
their similar
attributes.
5.6.2 DBSCAN
• Density-based spatial clustering of applications with
noise creates clusters with a minimum size and
density. Density is defined as a minimum number of
points within a certain distance of each other.
• One input parameter, 𝑀𝑖𝑛𝑃𝑡𝑠, indicates the
minimum number of points in any cluster. In
addition, for each point in a cluster there must be
another point in the cluster whose distance from it is
less than a threshold input value 𝐸𝑝𝑠.
• The desired number of clusters, k, is not input rather
is determined by the algorithm.
• When the algorithm finishes, there will be points not
assigned to a cluster. These are defined as noise.
5.6.3 CURE (Clustering Using Representatives)
• One objective for the CURE is to handle outliers well.
• First, a constant number of points, 𝑐, are chosen from each cluster. These well-scattered points
are then shrunk towards the cluster’s centroid by applying a shrinkage factor, 𝛼. When 𝛼 is 1, all
points are shrunk to just one – the centroid. These points represent a cluster better than a single
point (such as medoid or centroid) could.
• At each step clusters with the closest pair of representative points are chosen to be merged. The
distance between them is defined as the minimum distance between any pair of points in the
representative sets from the two clusters.
• CURE handles limited memory by obtaining a random sample to find the initial clusters. The
random sample is partitioned, and each partition is then partially clustered. These resulting
clusters are then completely clustered in a second pass.
• When the clustering of the sample is complete, the labeling of data on disk is performed. A data
item is assigned to the cluster with the closest representative points.
• Outliers are eliminated by the use of two different techniques. The first technique eliminates
clusters that grow very slowly. The second technique removes very small clusters toward the end
of the clustering.
CURE Example
A set of clusters with representative
points exists at each step in the
processing. There are three clusters in
figure (b), each with two
representative points. These points
are chosen to be far from each other
as well as from the mean of the
cluster. In part (c), two of the clusters
are merged and two new
representative points are chosen. In
part (d) these points are shrunk
toward the mean of the cluster.
CURE with Heap and k-D Tree
• A heap and k-D tree data structure are used to ensure 𝑂(𝑛) time complexity.
• One entry in the heap exists for each cluster. Entries in the heap are stored in increasing order of
the distances between clusters.
• We assume that each entry 𝑢 in the heap contains the set of representative points, 𝑢. 𝑟𝑒𝑝 ; the
mean of the points in the cluster, 𝑢. 𝑚𝑒𝑎𝑛; and the cluster closest to it, 𝑢. 𝑐𝑙𝑜𝑠𝑒𝑠𝑡. We use the
heap operations: heapify to create the heap, min to extract the minimum entry in the heap, insert
to add a new entry, and delete to delete an entry.
• While clustering the entire database an item is placed in the cluster that has the closest
representative point to it. So each of the 𝑛 points must be compared to 𝑐𝑘 representative points.
• A merge procedure is used to merge two clusters. It determines the new representative points for
the new cluster. A k-D tree is used to assist in merging of clusters. It is a balanced binary tree that
can be thought of as a generalization of a binary search tree. It stores the representative points
for each cluster.
• A random sample size of about 2.5% and the number of partitions is greater than one or two
times 𝑘 seem to work well.
Insert Operation on the Heap
• The insert
operation
performs
the
comparison
operation
from the
bottom up.
Delete operations on the Heap
The delete
operation
deletes the
root node and
rolls the last
node to the
position of
the root node
and results in
deleting the
element in
the root
node.
5.7 Clustering with Categorical Attributes
• Traditional algorithms do not always work with categorical data. The problem is that the centroid
tends to weaken the relationship between the associated cluster and other clusters. The ROCK
(Robust Clustering using linKs) algorithm is targeted to both Boolean data and categorical data.
• The number of links between two items is defined as the number of common neighbors they
have. The algorithm uses the number of links as the similarity measure. A pair of items are said to
be neighbors is their similarity exceeds some threshold.
• One proposed similarity measure based on the Jaccard coefficient is defined as
𝑠𝑖𝑚 𝑡𝑖, 𝑡𝑗 =
𝑡𝑖 ∩ 𝑡𝑗
𝑡𝑖 ∪ 𝑡𝑗
• If the tuples are viewed to be sets of items purchased (i.e., market basket data), then we look at
the number of items they have in common divided by the total number in both.
• The objective of the algorithm is to group together points that have more links. The use of links
rather than distance measures provides a more global approach because the similarity between
points is impacted by other points as well.
The ROCK (Robust Clustering using linKs)
The ROCK algorithm is divided into three general parts:
1. Obtaining a random sample of the data.
2. Performing clustering on the data using the link agglomerative approach. A goodness measure is used to
determine which pair of points is merged at each step.
3. Using these clusters the remaining data on disk are assigned to them.
• Both local and global heaps are used. The local heap for 𝑡𝑖, 𝑞 𝑡𝑖 , contains every cluster that has a nonzero
link to 𝑡𝑖 . The hierarchical clustering portion of the algorithm starts by placing each point 𝑡𝑖 in the sample
in a separate cluster. The global heap contains information about each cluster.
• All information in the heap is ordered based on the goodness measure:
𝑔 𝐾𝑖, 𝐾𝑗 =
link(𝐾𝑖, 𝐾𝑗)
𝑛𝑖 + 𝑛𝑗
1+2𝑓 Θ −𝑛𝑖
1+2𝑓 Θ
−𝑛 𝑗
1+2𝑓 Θ
• Here link(𝐾𝑖, 𝐾𝑗) is the number of links between the two clusters. Also 𝑛𝑖 and 𝑛𝑗 are the number of points in
each cluster. The function 𝑓 Θ depends on the data, but it is found to satisfy the property that each item in
𝐾𝑖 has approximately 𝑛𝑖
𝑓 Θ
neighbors in the cluster.
Clustering Applications
5.8 Comparison
Algorithm Type Space Time Notes
Single link Hierarchical 𝑂(𝑛2
) 𝑂(𝑘𝑛2
) Not incremental
Average link Hierarchical 𝑂(𝑛2
) 𝑂(𝑘𝑛2
) Not incremental
Complete link Hierarchical 𝑂(𝑛2
) 𝑂(𝑘𝑛2
) Not incremental
MST Hierarchical / partitional 𝑂(𝑛2
) 𝑂(𝑛2
) Not incremental
Squared error Partitional 𝑂(𝑛) 𝑂(𝑡𝑘𝑛) Iterative
K-means Partitional 𝑂(𝑛) 𝑂(𝑡𝑘𝑛) Iterative; No categorical
Nearest neighbor Partitional 𝑂(𝑛2
) 𝑂(𝑛2
) Iterative
PAM Partitional 𝑂(𝑛2
) 𝑂(𝑡𝑘(𝑛 − 𝑘)2
) Iterative; Adapted agglomerative; Outliers
BIRCH Partitional 𝑂(𝑛) 𝑂(𝑛) CF-tree; Incremental; Outliers
CURE Mixed 𝑂(𝑛) 𝑂(𝑛) Heap; k-D tree; Incremental; Outliers; Sampling
ROCK Agglomerative 𝑂(𝑛2
) 𝑂(𝑛2
lg 𝑛) Sampling; Categorical; Links
DBSCAN Mixed 𝑂(𝑛2
) 𝑂(𝑛2
) Sampling; Outliers
CURE uses sampling
and partitioning to
handle scalability
well, multiple points
represent each
cluster. Detects non-
spherical clusters.
The results of the K-
means are quite
sensitive to the
presence of outliers;
may not find a
global optimum.
CF-tree, BIRCH is
both dynamic and
scalable. However,
they only detect
spherical clusters.
5. 9 Notes
• Another approach for partitional clustering is to allow splitting and merging of clusters. A cluster is split if its variance is above a
certain threshold. One proposed algorithm performing this is called ISODATA.
• A recent work has looked at extending the definition of clustering to be more applicable to categorical data. Categorical ClusTering
Using Summaries (CACTUS) generalizes the traditional definition of clustering and distance. The similarity between tuples (of
attributes used for clustering) is based on the support of two attribute values within the database D. Given two categorical
attributes 𝐴𝑖, 𝐴𝑗 with domains 𝐷𝑖, 𝐷𝑗, respectively, the support of the attribute pair (𝑎𝑖, 𝑎𝑗) is defined as:
𝜎 𝐷 𝑎𝑖, 𝑎𝑗 = 𝑡 ∈ 𝐷: 𝑡. 𝐴𝑖 = 𝑎𝑖 ∧ 𝑡. 𝐴𝑗 = 𝑎𝑗
• Sampling and data compression are perhaps the most common techniques to scale up clustering to work effectively on large
databases. With sampling, the algorithm is allied to a large enough sample to ideally produce a good clustering result. Both BIRCH
and CACTUS employ compression techniques. A data bubble is a compressed data item that represents a larger set of items in the
database. Given a set of items, a data bubble consists of representative item, cardinality of set, radius of set, estimated average k
nearest neighbor distances in the set.
• It is possible that a border point could belong to two clusters. CHAMELEON, is a hierarchical clustering algorithm that bases
merging of clusters on both closeness and their interconnections.
• Online (or iterative) clustering algorithms work well for dynamic databases. They can be performed dynamically as the data are
generated. Adaptive algorithms allow users to change the number of clusters dynamically. They avoid having to completely
recluster the database if the users need change. User has the ability to change the number of clusters at any time during
processing. OAK (online adaptive clustering) algorithm is both online and adaptive. OAK can also handle outliers effectively by
adjusting a viewing parameter, which gives the user a broader view of the clustering, and allows to chose desired clusters.
References:
Dunham, Margaret H. “Data Mining: Introductory and Advanced
Topics”. Pearson Education, Inc., 2003.

More Related Content

What's hot

Clustering paradigms and Partitioning Algorithms
Clustering paradigms and Partitioning AlgorithmsClustering paradigms and Partitioning Algorithms
Clustering paradigms and Partitioning AlgorithmsUmang MIshra
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering Ashek Farabi
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithmVinit Dantkale
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis IntroductionPrasiddhaSarma
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingankur bhalla
 

What's hot (20)

Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Clustering paradigms and Partitioning Algorithms
Clustering paradigms and Partitioning AlgorithmsClustering paradigms and Partitioning Algorithms
Clustering paradigms and Partitioning Algorithms
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Hierarchical clustering
Hierarchical clustering Hierarchical clustering
Hierarchical clustering
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
K-means clustering algorithm
K-means clustering algorithmK-means clustering algorithm
K-means clustering algorithm
 
Clustering
ClusteringClustering
Clustering
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Kmeans
KmeansKmeans
Kmeans
 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Similar to 05 Clustering in Data Mining

clustering and distance metrics.pptx
clustering and distance metrics.pptxclustering and distance metrics.pptx
clustering and distance metrics.pptxssuser2e437f
 
Hierarchical clustering algorithm.pptx
Hierarchical clustering algorithm.pptxHierarchical clustering algorithm.pptx
Hierarchical clustering algorithm.pptxMinakshee Patil
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Yan Xu
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3Nandhini S
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfSowmyaJyothi3
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
 
clustering ppt.pptx
clustering ppt.pptxclustering ppt.pptx
clustering ppt.pptxchmeghana1
 
Hierarchical clustering machine learning by arpit_sharma
Hierarchical clustering  machine learning by arpit_sharmaHierarchical clustering  machine learning by arpit_sharma
Hierarchical clustering machine learning by arpit_sharmaEr. Arpit Sharma
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
 
Lecture 11
Lecture 11Lecture 11
Lecture 11Jeet Das
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
 

Similar to 05 Clustering in Data Mining (20)

clustering and distance metrics.pptx
clustering and distance metrics.pptxclustering and distance metrics.pptx
clustering and distance metrics.pptx
 
Hierarchical clustering algorithm.pptx
Hierarchical clustering algorithm.pptxHierarchical clustering algorithm.pptx
Hierarchical clustering algorithm.pptx
 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
 
Clustering on DSS
Clustering on DSSClustering on DSS
Clustering on DSS
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
CSA 3702 machine learning module 3
CSA 3702 machine learning module 3CSA 3702 machine learning module 3
CSA 3702 machine learning module 3
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
 
clustering ppt.pptx
clustering ppt.pptxclustering ppt.pptx
clustering ppt.pptx
 
Hierarchical clustering machine learning by arpit_sharma
Hierarchical clustering  machine learning by arpit_sharmaHierarchical clustering  machine learning by arpit_sharma
Hierarchical clustering machine learning by arpit_sharma
 
UNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptxUNIT_V_Cluster Analysis.pptx
UNIT_V_Cluster Analysis.pptx
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 
PPT s10-machine vision-s2
PPT s10-machine vision-s2PPT s10-machine vision-s2
PPT s10-machine vision-s2
 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Lecture 11
Lecture 11Lecture 11
Lecture 11
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 

More from Valerii Klymchuk

Sample presentation slides template
Sample presentation slides templateSample presentation slides template
Sample presentation slides templateValerii Klymchuk
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Crime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataCrime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataValerii Klymchuk
 
Artificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectArtificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectValerii Klymchuk
 

More from Valerii Klymchuk (13)

Sample presentation slides template
Sample presentation slides templateSample presentation slides template
Sample presentation slides template
 
Toronto Capstone
Toronto CapstoneToronto Capstone
Toronto Capstone
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
02 Related Concepts
02 Related Concepts02 Related Concepts
02 Related Concepts
 
03 Data Mining Techniques
03 Data Mining Techniques03 Data Mining Techniques
03 Data Mining Techniques
 
03 Data Representation
03 Data Representation03 Data Representation
03 Data Representation
 
05 Scalar Visualization
05 Scalar Visualization05 Scalar Visualization
05 Scalar Visualization
 
06 Vector Visualization
06 Vector Visualization06 Vector Visualization
06 Vector Visualization
 
07 Tensor Visualization
07 Tensor Visualization07 Tensor Visualization
07 Tensor Visualization
 
Crime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation DataCrime Analysis based on Historical and Transportation Data
Crime Analysis based on Historical and Transportation Data
 
Artificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support ProjectArtificial Intelligence for Automated Decision Support Project
Artificial Intelligence for Automated Decision Support Project
 
Data Warehouse Project
Data Warehouse ProjectData Warehouse Project
Data Warehouse Project
 
Database Project
Database ProjectDatabase Project
Database Project
 

Recently uploaded

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 

Recently uploaded (20)

FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 

05 Clustering in Data Mining

  • 2. 5.1 Introduction Clustering is similar to classification. However, unlike classification, the groups are not predefined. The groups are called clusters. Clustering can be viewed as similar to unsupervised learning. The basic features of clustering as opposed to classification: • The best number of clusters in not known. • There may not be any a priory knowledge concerning the clusters. • Cluster results are dynamic. Definitions of clusters have been proposed: • Set of like elements. Elements from different clusters are not alike. • The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it. A given set of data may be clustered on different attributes.
  • 3. Different Clustering Attributes A given set of data may be clustered on different attributes. Here a group of homes in a geographic area is shown.
  • 4. Classification of Clustering Algorithms • Recent clustering algorithms look at categorical data and are targeted to larger, or dynamic, databases. They may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. • Intrinsic algorithms do not use any a priory category labels, but depend only on the adjacency matrix containing the distance between objects. • Agglomerative implies that the clusters are created in a bottom-up fashion, while divisive algorithms work in a top-down fashion. • With hierarchical clustering, nested set of clusters is created, the desired number of clusters is not input. • With partitional clustering, the algorithm creates only one set of clusters.
  • 5. 5.2 Similarity and Distance Measures • Definition. Given a database 𝐷 = 𝑡1, 𝑡2, ⋯ , 𝑡 𝑛 of tuples, a similarity measure, 𝑠𝑖𝑚(𝑡𝑖, 𝑡𝑙), defined between any two tuples, 𝑡𝑖, 𝑡𝑙 ∈ 𝐷, and an integer value 𝑘, the clustering problem is to define a mapping 𝑓: 𝐷 → 1, ⋯ , 𝑘 where each 𝑡𝑖 is assigned to one cluster 𝐾𝑗, 1 ≤ 𝑗 ≤ 𝑘. Given a cluster 𝐾𝑗, ∀𝑡𝑗𝑙, 𝑡𝑗𝑚 ∈ 𝐾𝑗, and 𝑡𝑖 ∉ 𝐾𝑗, 𝑠𝑖𝑚 𝑡𝑗𝑙, 𝑡𝑗𝑚 > 𝑠𝑖𝑚(𝑡𝑗𝑙, 𝑡𝑖). • We assume the definition of a similarity measure 𝑠𝑖𝑚(𝑡𝑖, 𝑡𝑙) exists between any two tuples 𝑡𝑖, 𝑡𝑙 ∈ 𝐷. A distance measure, 𝑑𝑖𝑠(𝑡𝑖, 𝑡𝑗), as opposed to similarity, is often used in clustering. The clustering problem then has the desirable property that given a cluster 𝐾𝑗, ∀𝑡𝑗𝑙, 𝑡𝑗𝑚 ∈ 𝐾𝑗 and 𝑡𝑖 ∉ 𝐾𝑗, 𝑑𝑖𝑠 𝑡𝑗𝑙, 𝑡𝑗𝑚 ≤ 𝑑𝑖𝑠(𝑡𝑗𝑙, 𝑡𝑖). • Given a cluster 𝐾 𝑚 of 𝑁 points 𝑡 𝑚1, 𝑡 𝑚2 ⋯ 𝑡 𝑚𝑁 , the clusters can be described by using characteristic values: centroid = 𝐶 𝑚 = 𝑖=1 𝑁 (𝑡 𝑚𝑖) 𝑁 radius = 𝑅 𝑚 = 𝑖=1 𝑁 (𝑡 𝑚𝑖 − 𝐶 𝑚)2 𝑁 diameter = 𝐷 𝑚 = 𝑖=1 𝑁 𝑗=1 𝑁 (𝑡 𝑚𝑖 − 𝑡 𝑚𝑗)2 𝑁(𝑁 − 1)
  • 6. Alternative Distance Measures There are several standard alternatives to calculate the distance between clusters 𝐾𝑖 and 𝐾𝑗. For ∀𝑡𝑖𝑙 ∈ 𝐾𝑖 ∉ 𝐾𝑗 and ∀𝑡𝑗𝑚 ∈ 𝐾𝑗 ∉ 𝐾𝑖: • Single link: Smallest distance between an element in one cluster and an element in the other: 𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = min 𝑑𝑖𝑠 𝑡𝑖𝑙, 𝑡𝑗𝑚 • Complete link: Largest distance between an element in one cluster and an element in the other 𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = max(𝑑𝑖𝑠(𝑡𝑖𝑙, 𝑡𝑗𝑚)) • Average: Average distance between an element in one cluster and an element in the other 𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = mean(𝑑𝑖𝑠(𝑡𝑖𝑙, 𝑡𝑗𝑚)) • Centroid: If cluster have a representative centroid, then the centroid distance is defined as the distance between the centroids 𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = 𝑑𝑖𝑠(𝐶𝑖, 𝐶𝑗) • Medoid: Using a medoid to represent each cluster, the distance between the clusters can be defined by the distance between the medoids 𝑑𝑖𝑠 𝐾𝑖, 𝐾𝑗 = 𝑑𝑖𝑠(𝑀𝑖, 𝑀𝑗).
  • 7. 5.3 Outliers • Some clustering techniques do not perform well with the presence of outliers. • Outlier detection, or outlier mining, is the process of identifying outliers in a set of data. Clustering, or other algorithms may then chose to remove or treat these values differently. However, care must be taken in actually removing outliers.
  • 8. 5.4 Hierarchical Algorithms • A tree data structure called a dendrogram, can be used to illustrate the hierarchical clustering technique. Each level in the tree is associated with the distance measure that was used to merge the clusters. The root in a dendrogram tree contains one cluster where all elements are together. The leaves consists of a single- element cluster. The clusters at a particular level are combined when the children clusters have a distance between them less than the distance value associated with this level in the tree.
  • 9. 5.4.1 Agglomerative Algorithms • Agglomerative algorithms start with each individual item in its own cluster and iteratively merge clusters until all items belong in one cluster. • It assumes that a set of elements and distances between them is given as input adjacency matrix. The output of the algorithms is dendrogram, which produces a set of clusters rather than just one clustering. The user can determine which of the clusters (based on distance threshold) he or she wishes to use. • Algorithms differ in how they merge clusters from the prior level to create the next level clusters (merge 2-3… or merge all; treat equal distances). • In addition, the technique used to determine distance between clusters may vary: single link, complete link and average link.
  • 10. Single Link Technique • With the single link approach, two clusters are merged if there is at least one edge that connects the two clusters; that is if the minimum distance between any two points is less than or equal to the threshold distance being considered. It is often called the nearest neighbor clustering. • One variation of single link approach is based on the use of a minimum spanning tree (MST). Here we assume that a procedure, MST produces a minimum spanning tree given an adjacency matrix as input. The clusters are merged in increasing order of the distance found in the MST. • The single linkage approach is infamous for its chain effect; that is, two clusters are merged if only two of their points are close to each other. • These algorithms are both of 𝑂 𝑛2 complexity in space and time.
  • 11. Complete Link, Average Link Technique • Here a procedure is used to find the maximum distance between any clusters so that two clusters are merged if the maximum distance is less than or equal to the distance threshold. • Complete link is an 𝑂(𝑛2 ) algorithm. • Clusters found with the complete link method tend to be more compact than those found using the single link technique. • The average link technique merges two clusters if the average distance between any two points in the two target clusters is below the distance threshold.
  • 12. 5.5 Partitional Algorithms • Nonhierarchical or partitional clustering creates the clusters in one step as opposed to several steps. Only one set of clusters is created and user must input the desired number 𝑘 of clusters. • In addition some metric of criterion function is used to determine the goodness of any proposed solution. This measure of quality could be the average distance between clusters or some other metric. The solution with the best value for the criterion function is the clustering solution used. • One common measure is a squared error metric, which measures the squared distance from each point to the centroid for the associated cluster: 𝑚=1 𝑘 𝑡 𝑚𝑖∈𝐾 𝑚 𝑑𝑖𝑠 (𝐶 𝑚, 𝑡 𝑚𝑖)2 • Partitional algorithms suffer from a combinatorial explosion due to the number of possible solutions. Thus, most algorithms look only at a small subset of all clusters.
  • 13. Squared Error Clustering, K-Means, K-Modes • Squared error clustering algorithm minimizes the squared error for a cluster: 𝑠𝑒 𝐾 = 𝑚=1 𝑘 𝑡 𝑚𝑖∈𝐾 𝑚 𝐶 𝑚 − 𝑡 𝑚𝑖 2. During each iteration each tuple is assigned to the cluster with the closest center. This repeats until the difference between successive square errors is below a threshold. • K-Means clustering algorithm requires that some definition of cluster mean exists, e.g., 𝑚𝑖 = 1 𝑚 𝑗=1 𝑚 𝑡𝑖𝑗, but it does not have to be this particular one; it also assumes that desired number of clusters 𝑘 is an input parameter. The convergence criteria could be based on the squared error, but they need not be. For example, it could stop, when so or very small number of tuples are assigned to different clusters. K-means does not work on categorical data because the mean must be defined on the attribute type. Only convex-shaped clusters are found. It does not handle outliers well. K-means finds a local optimum and may actually miss the global optimum. • K-modes uses modes and it does handle categorical data.
  • 14. Nearest Neighbor, PAM, CLARA, CLARANS • Nearest neighbor algorithm is similar to single link technique. Items are iteratively merged into the existing clusters that are closest. In this algorithm a threshold 𝑡 is used to determine if items will be added to existing clusters or if a new cluster is created. • PAM (partitioning around medoids) algorithm represents cluster by a medoid, which handles outliers well. Quality here is measured by the sum of all distances from a non-medoid object to the medoid for the cluster it is in. An item is assigned to the cluster represented by the medoid to which it is closest (minimum distance). PAM does not scale well to large databases. • CLARA (Clustering LARge Applications) applies PAM to a sample of the underlying database and then uses the medoids found as the medoids for the complete clustering. Several samples can be drawn with PAM applied to each. The sample chosen as the final clustering is the one that performs the best. CLARA is more efficient than PAM for large databases. Five samples of size 40 + 2𝑘 seem to give good results. • CLARANS (clustering large applications based on randomized search) uses multiple different samples. It requires 2 additional parameters: maxneighbour and numlocal. Numlocal indicates the number of samples to be taken. Maxneighbour is the number of neighbors of a node to which any specific node can be compared. Studies indicate that 𝑛𝑢𝑚𝑙𝑜𝑐𝑎𝑙 = 2 and 𝑚𝑎𝑥𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 = max( 0.0125 × 𝑘 𝑛 − 𝑘 , 250) is a good choice. It assumes that all data are in main memory.
  • 15. 5.5.1 Genetic Algorithms • The approach in which a new solution is generated from the previous solution using crossover and mutation operations. • A fitness function must be used and may be defined based on inverse of the squared error. • Genetic clustering algorithms perform a global search rather than a local search of potential solutions.
  • 16. 5.5 2 Clustering with Neural Networks • Neural Networks that use unsupervised learning attempt to find features in the data that characterize the desired output. There are two basic types of unsupervised learning: noncompetitive and competitive. • With the noncompetitive or Hebbian learning, the weight between two nodes is changed to be proportional to both output values Δ𝑤𝑗𝑖 = 𝜂𝑦𝑗 𝑦𝑖. • With competitive learning, nodes are allowed to compete. This approach usually assumes a two-layer NN in which all nodes from one layer are connected to all nodes in the other layer. Imagine every input tuple having each attribute value input to a specific input node in the NN. The number of input nodes in the same as number of attributes. When a tuple is input to the NN, all output nodes produce an output value. The node with the weights more similar to the input tuple is declared the winner. Its weights are then adjusted. • This process continues with each tuple input from the training set. With a large and varied enough training set, over time each output node should become associated with a set of tuples.
  • 17. Self-organizing Feature Map (SOFM) • Perhaps the most common example of a SOFM is the Kohonen self-organizing map. There is one input layer and one special competitive layer, which produces output values that compete. Nodes in this layer are viewed as a two-dimensional grid of nodes. Each input node is connected to each node in this grid. • Each arc has an associated weight and each node in the competitive layer has an activation function. Thus, each node in the competitive layer produces an output value, and the node with the best output wins. An attractive feature of Kohonen nets is that the data can be fed into the multiple competitive nodes in parallel • A common approach is to initialize the weights on the input arcs to the competitive layer with normalized values. Given an input tuple 𝑋 = 𝑥1, ⋯ , 𝑥ℎ and weights on arcs input to a competitive node 𝑖 as 𝑤1𝑖, ⋯ , 𝑤ℎ𝑖, the similarity between 𝑋 and 𝑖 can be calculated by: 𝑠𝑖𝑚 𝑋, 𝑖 = 𝑗=1 ℎ 𝑥𝑗 𝑤𝑗𝑖 • The competitive node with highest similarity value wins and the weights coming into 𝑖 as well as its neighbors ( those immediately surrounding it in the matrix) are changed to be closer to that of the tuple. Given a node 𝑖, we use the notation 𝑁𝑖 to represent the union of 𝑖 and the nodes near it. Thus the learning process uses: Δ𝑤 𝑘𝑗 = 𝑐(𝑥 𝑘 − 𝑤 𝑘𝑗) 𝑖𝑓 𝑗 ∈ 𝑁𝑖 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 18. 5.6 Clustering Large Databases In order to perform effectively on large databases, a clustering algorithm should: • Require no more than one scan of the database. • Have the ability to provide status and best answer so far during the execution (referred to as ability to be online). • Be suspendable, stoppable, and resumable. • Be able to update the results incrementally as data are added or removed from the database. • Work with limited main memory. • Be capable of performing different techniques for scanning the database. This may include sampling. • Process each tuple only once.
  • 19. 5.6.1 BIRCH (Balanced Iterative Reducing and Clustering Hierarchies) BIRCH is designed for clustering a large amount of metric data and achieves a linear I/O time requiring only one data base scan. BIRCH applies only to numeric data and used the clustering feature. • Definition. A clustering feature (CF) is a triple (𝑁, 𝐿𝑆, 𝑆𝑆), where the number of the points in the cluster is 𝑁, 𝐿𝑆 is the sum of the points in the cluster, and 𝑆𝑆 is the sum of the squares of the points in the cluster. • As points are added to the clustering problem, the CF tree is built. Each node in the CF tree contains clustering feature information about each of its children subclusters. A point is inserted into the cluster (represented by a leaf node) to which it is closest. The subcluster in a leaf node must have a diameter no greater than a given threshold value 𝑇. When new item is inserted, CF value for that node is appropriately updated, as is each triple on the path from the root down to the leaf. If the item does not satisfy the threshold value, then it is added to that node as a single-item cluster. • Parameters needed for the CF tree construction include: its branching factor, the page block size, and memory size.
  • 20. Clustering Feature Tree Program clustering. A neighbor-joining tree that clusters the classification programs based on their similar attributes.
  • 21. 5.6.2 DBSCAN • Density-based spatial clustering of applications with noise creates clusters with a minimum size and density. Density is defined as a minimum number of points within a certain distance of each other. • One input parameter, 𝑀𝑖𝑛𝑃𝑡𝑠, indicates the minimum number of points in any cluster. In addition, for each point in a cluster there must be another point in the cluster whose distance from it is less than a threshold input value 𝐸𝑝𝑠. • The desired number of clusters, k, is not input rather is determined by the algorithm. • When the algorithm finishes, there will be points not assigned to a cluster. These are defined as noise.
  • 22. 5.6.3 CURE (Clustering Using Representatives) • One objective for the CURE is to handle outliers well. • First, a constant number of points, 𝑐, are chosen from each cluster. These well-scattered points are then shrunk towards the cluster’s centroid by applying a shrinkage factor, 𝛼. When 𝛼 is 1, all points are shrunk to just one – the centroid. These points represent a cluster better than a single point (such as medoid or centroid) could. • At each step clusters with the closest pair of representative points are chosen to be merged. The distance between them is defined as the minimum distance between any pair of points in the representative sets from the two clusters. • CURE handles limited memory by obtaining a random sample to find the initial clusters. The random sample is partitioned, and each partition is then partially clustered. These resulting clusters are then completely clustered in a second pass. • When the clustering of the sample is complete, the labeling of data on disk is performed. A data item is assigned to the cluster with the closest representative points. • Outliers are eliminated by the use of two different techniques. The first technique eliminates clusters that grow very slowly. The second technique removes very small clusters toward the end of the clustering.
  • 23. CURE Example A set of clusters with representative points exists at each step in the processing. There are three clusters in figure (b), each with two representative points. These points are chosen to be far from each other as well as from the mean of the cluster. In part (c), two of the clusters are merged and two new representative points are chosen. In part (d) these points are shrunk toward the mean of the cluster.
  • 24. CURE with Heap and k-D Tree • A heap and k-D tree data structure are used to ensure 𝑂(𝑛) time complexity. • One entry in the heap exists for each cluster. Entries in the heap are stored in increasing order of the distances between clusters. • We assume that each entry 𝑢 in the heap contains the set of representative points, 𝑢. 𝑟𝑒𝑝 ; the mean of the points in the cluster, 𝑢. 𝑚𝑒𝑎𝑛; and the cluster closest to it, 𝑢. 𝑐𝑙𝑜𝑠𝑒𝑠𝑡. We use the heap operations: heapify to create the heap, min to extract the minimum entry in the heap, insert to add a new entry, and delete to delete an entry. • While clustering the entire database an item is placed in the cluster that has the closest representative point to it. So each of the 𝑛 points must be compared to 𝑐𝑘 representative points. • A merge procedure is used to merge two clusters. It determines the new representative points for the new cluster. A k-D tree is used to assist in merging of clusters. It is a balanced binary tree that can be thought of as a generalization of a binary search tree. It stores the representative points for each cluster. • A random sample size of about 2.5% and the number of partitions is greater than one or two times 𝑘 seem to work well.
  • 25. Insert Operation on the Heap • The insert operation performs the comparison operation from the bottom up.
  • 26. Delete operations on the Heap The delete operation deletes the root node and rolls the last node to the position of the root node and results in deleting the element in the root node.
  • 27. 5.7 Clustering with Categorical Attributes • Traditional algorithms do not always work with categorical data. The problem is that the centroid tends to weaken the relationship between the associated cluster and other clusters. The ROCK (Robust Clustering using linKs) algorithm is targeted to both Boolean data and categorical data. • The number of links between two items is defined as the number of common neighbors they have. The algorithm uses the number of links as the similarity measure. A pair of items are said to be neighbors is their similarity exceeds some threshold. • One proposed similarity measure based on the Jaccard coefficient is defined as 𝑠𝑖𝑚 𝑡𝑖, 𝑡𝑗 = 𝑡𝑖 ∩ 𝑡𝑗 𝑡𝑖 ∪ 𝑡𝑗 • If the tuples are viewed to be sets of items purchased (i.e., market basket data), then we look at the number of items they have in common divided by the total number in both. • The objective of the algorithm is to group together points that have more links. The use of links rather than distance measures provides a more global approach because the similarity between points is impacted by other points as well.
  • 28. The ROCK (Robust Clustering using linKs) The ROCK algorithm is divided into three general parts: 1. Obtaining a random sample of the data. 2. Performing clustering on the data using the link agglomerative approach. A goodness measure is used to determine which pair of points is merged at each step. 3. Using these clusters the remaining data on disk are assigned to them. • Both local and global heaps are used. The local heap for 𝑡𝑖, 𝑞 𝑡𝑖 , contains every cluster that has a nonzero link to 𝑡𝑖 . The hierarchical clustering portion of the algorithm starts by placing each point 𝑡𝑖 in the sample in a separate cluster. The global heap contains information about each cluster. • All information in the heap is ordered based on the goodness measure: 𝑔 𝐾𝑖, 𝐾𝑗 = link(𝐾𝑖, 𝐾𝑗) 𝑛𝑖 + 𝑛𝑗 1+2𝑓 Θ −𝑛𝑖 1+2𝑓 Θ −𝑛 𝑗 1+2𝑓 Θ • Here link(𝐾𝑖, 𝐾𝑗) is the number of links between the two clusters. Also 𝑛𝑖 and 𝑛𝑗 are the number of points in each cluster. The function 𝑓 Θ depends on the data, but it is found to satisfy the property that each item in 𝐾𝑖 has approximately 𝑛𝑖 𝑓 Θ neighbors in the cluster.
  • 30. 5.8 Comparison Algorithm Type Space Time Notes Single link Hierarchical 𝑂(𝑛2 ) 𝑂(𝑘𝑛2 ) Not incremental Average link Hierarchical 𝑂(𝑛2 ) 𝑂(𝑘𝑛2 ) Not incremental Complete link Hierarchical 𝑂(𝑛2 ) 𝑂(𝑘𝑛2 ) Not incremental MST Hierarchical / partitional 𝑂(𝑛2 ) 𝑂(𝑛2 ) Not incremental Squared error Partitional 𝑂(𝑛) 𝑂(𝑡𝑘𝑛) Iterative K-means Partitional 𝑂(𝑛) 𝑂(𝑡𝑘𝑛) Iterative; No categorical Nearest neighbor Partitional 𝑂(𝑛2 ) 𝑂(𝑛2 ) Iterative PAM Partitional 𝑂(𝑛2 ) 𝑂(𝑡𝑘(𝑛 − 𝑘)2 ) Iterative; Adapted agglomerative; Outliers BIRCH Partitional 𝑂(𝑛) 𝑂(𝑛) CF-tree; Incremental; Outliers CURE Mixed 𝑂(𝑛) 𝑂(𝑛) Heap; k-D tree; Incremental; Outliers; Sampling ROCK Agglomerative 𝑂(𝑛2 ) 𝑂(𝑛2 lg 𝑛) Sampling; Categorical; Links DBSCAN Mixed 𝑂(𝑛2 ) 𝑂(𝑛2 ) Sampling; Outliers CURE uses sampling and partitioning to handle scalability well, multiple points represent each cluster. Detects non- spherical clusters. The results of the K- means are quite sensitive to the presence of outliers; may not find a global optimum. CF-tree, BIRCH is both dynamic and scalable. However, they only detect spherical clusters.
  • 31. 5. 9 Notes • Another approach for partitional clustering is to allow splitting and merging of clusters. A cluster is split if its variance is above a certain threshold. One proposed algorithm performing this is called ISODATA. • A recent work has looked at extending the definition of clustering to be more applicable to categorical data. Categorical ClusTering Using Summaries (CACTUS) generalizes the traditional definition of clustering and distance. The similarity between tuples (of attributes used for clustering) is based on the support of two attribute values within the database D. Given two categorical attributes 𝐴𝑖, 𝐴𝑗 with domains 𝐷𝑖, 𝐷𝑗, respectively, the support of the attribute pair (𝑎𝑖, 𝑎𝑗) is defined as: 𝜎 𝐷 𝑎𝑖, 𝑎𝑗 = 𝑡 ∈ 𝐷: 𝑡. 𝐴𝑖 = 𝑎𝑖 ∧ 𝑡. 𝐴𝑗 = 𝑎𝑗 • Sampling and data compression are perhaps the most common techniques to scale up clustering to work effectively on large databases. With sampling, the algorithm is allied to a large enough sample to ideally produce a good clustering result. Both BIRCH and CACTUS employ compression techniques. A data bubble is a compressed data item that represents a larger set of items in the database. Given a set of items, a data bubble consists of representative item, cardinality of set, radius of set, estimated average k nearest neighbor distances in the set. • It is possible that a border point could belong to two clusters. CHAMELEON, is a hierarchical clustering algorithm that bases merging of clusters on both closeness and their interconnections. • Online (or iterative) clustering algorithms work well for dynamic databases. They can be performed dynamically as the data are generated. Adaptive algorithms allow users to change the number of clusters dynamically. They avoid having to completely recluster the database if the users need change. User has the ability to change the number of clusters at any time during processing. OAK (online adaptive clustering) algorithm is both online and adaptive. OAK can also handle outliers effectively by adjusting a viewing parameter, which gives the user a broader view of the clustering, and allows to chose desired clusters.
  • 32. References: Dunham, Margaret H. “Data Mining: Introductory and Advanced Topics”. Pearson Education, Inc., 2003.