SlideShare a Scribd company logo
1 of 75
Alok Kumar Jagadev
Data Mining:
Concepts and Techniques
• The goal of clustering is to
– group data points that are close (or similar) to each other
– identify such groupings (or clusters) in an unsupervised manner
• Unsupervised: no information is provided to the algorithm on which
data points belong to which clusters
• Example
What should the clusters
be for these data points?
What is Clustering?
• Clustering can be considered the most important unsupervised learning
problem; so, as every other problem of this kind deals with finding a
structure in a collection of unlabeled data.
• A loose definition of clustering could be “the process of organizing
objects into groups whose members are similar in some way”.
• A cluster is therefore a collection of objects which are “similar”
between them and are “dissimilar” to the objects belonging to other
Clustering Algorithms
 A clustering algorithm attempts to find natural groups of components (or data)
based on some similarity
 Also, the clustering algorithm finds the centroid of a group of data sets
 To determine cluster membership, most algorithms evaluate the distance
between a point and the cluster centroids
 The output from a clustering algorithm is basically a statistical description of
the cluster centroids with the number of components in each cluster.
• Simple graphical example:
 In this case we easily identify the 4 clusters into which the data can be
divided; the similarity criterion is distance: two or more objects belong to
the same cluster if they are “close” according to a given distance. This is
called distance-based clustering.
 Another kind of clustering is conceptual clustering: two or more objects
belong to the same cluster if this one defines a concept common to all that
 In other words, objects are grouped according to their fit to descriptive
concepts, not according to simple similarity measures.
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
 Land use: Identification of areas of similar land use in an earth observation
 Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
 City-planning: Identifying groups of houses according to their house type,
value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
Quality: What Is Good Clustering?
 A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
 The quality of a clustering result depends on both the similarity measure
used by the method and its implementation
 The quality of a clustering method is also measured by its ability to
discover some or all of the hidden patterns
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
• There is a separate “quality” function that measures the “goodness” of a
• The definitions of distance functions are usually very different for interval-
scaled, boolean, categorical, ordinal and ratio variables.
• Weights should be associated with different variables based on applications
and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
Major Clustering Approaches
• Partitioning: Construct various partitions and then evaluate them by some
• Hierarchical: Create a hierarchical decomposition of the set of objects using
some criterion
• Model-based: Hypothesize a model for each cluster and find best fit of
models to data
• Density-based: Guided by connectivity and density functions
Typical Alternatives to Calculate the Distance
between Clusters
 Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an element in
the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) =
dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) =
dis(Mi, Mj)
– Medoid: one chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
• Centroid: the “middle” of a cluster
• Radius: square root of average distance from any point of the cluster to its
• Diameter: square root of average mean squared distance between all pairs
of points in the cluster
i ip
Partitioning Algorithms
• Partitioning method: Construct a partition of a database D of n objects into a
set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen partitioning
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen, 1967): Each cluster is represented by the center of
the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw,
1987): Each cluster is represented by one of the objects in the cluster
1 )
( mi
m t
 
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the current
partition (the centroid is the center, i.e., mean point, of the cluster)
– Assign each object to the cluster with the nearest seed point
– Go back to Step 2, stop when no more new assignment
The K-Means Clustering Method
• Example
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Arbitrarily choose K
object as initial cluster
to most
Comments on the K-Means Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
• Weakness
– Applicable only when mean is defined, then what about categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
• Handling categorical data: k-modes (Huang’98)
– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical objects
– Using a frequency-based method to update modes of clusters
– A mixture of categorical and numerical data: k-prototype method
What Is the Problem of the K-Means Method?
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort the
distribution of the data.
• K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster.
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
• Height and weight information are given. Using these two variables,
we need to group the objects based on height and weight information.
Data Sample
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 1: Input
 Dataset, Clustering Variables and Maximum Number of Clusters (K in
Means Clustering)
 In this dataset, only two variables –height and weight – are considered for
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 2: Initialize cluster centroid
In this example, value of K is considered as 2. Cluster centroids are
initialized with first 2 observations.
Initial Centroid
Cluster Height Weight
K1 185 72
K2 170 56
Step 3: Calculate Euclidean Distance
Euclidean is one of the distance measures used on K Means algorithm.
Euclidean distance between of a observation and initial cluster centroids 1
and 2 is calculated.
Based on euclidean distance each observation is assigned to one of the
clusters - based on minimum distance.
Euclidean Distance
First two observations
Height Weight
185 72
170 56
Now initial cluster centroids are :
Updated Centroid
Cluster Height Weight
K1 185 72
K2 170 56
Euclidean Distance Calculation from each of the clusters is calculated.
Euclidian Distance from Euclidian Distance from
Cluster 1 Cluster 2 Assignment
(185-185)2+(72-72)2 =0 (185-170)2+(72-56)2= 21.93 1
(170-185)2+(56-72)2= 21.93 (170-170)2+(56-56)2= 0 2
We have considered two observations for assignment only because we knew the
assignment. And there is no change in Centroids as these two observations were
only considered as initial centroids.
Step 4: Move on to next observation and calculate Euclidean Distance
Height Weight
168 60
Euclidean Distance Euclidean Distance
from Cluster 1 from Cluster 2 Assignment
(168-185)2+(60-72)2 =20.808 (168-185)2+(60-72)2= 4.472 2
Since distance is minimum from cluster 2, so the observation is assigned to
cluster 2.
Now revise Cluster Centroid – mean value Height and Weight as Custer
Centroids. Addition is only to cluster 2, so centroid of cluster 2 will be
Updated cluster centroids
Updated Centroid
Cluster Height Weight
K=1 185 72
K=2 (170+168)/2 = 169 (56+60)/2 = 58
Step 5: Calculate Euclidean Distance for the next observation, assign next
observation based on minimum euclidean distance and update the cluster
Next Observation.
Height Weight
179 68
Euclidean Distance Calculation and Assignment
Euclidain Distance Euclidain Distance
from Cluster 1 from Cluster 2 Assignment
7.211103 14.14214 1
Update Cluster Centroid
Updated Centroid
Cluster Height Weight
K=1 182 70.6667
K=2 169 58
Continue the steps until all observations are assigned
Cluster Centroids
Cluster Height Weight
K=1 182.8 72
K=2 169 58
This is what was expected initially based on two-dimensional plot.
A few important considerations in K Means
•Scale of measurements influences Euclidean Distance , so variable
standardisation becomes necessary
•Depending on expectations - you may require outlier treatment
•K Means clustering may be biased on initial centroids - called cluster
•Maximum clusters is typically inputs and may also impacts the clusters
getting created
The K-Medoids Clustering Method
• Find representative objects, called medoids, in clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
– PAM works effectively for small data sets, but does not scale well for
large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995)
A Typical K-Medoids Algorithm (PAM)
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0 1 2 3 4 5 6 7 8 9 10
choose k
object as
0 1 2 3 4 5 6 7 8 9 10
object to
Randomly select a
nonmedoid object,Oramdom
total cost of
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Oramdom
If quality is
Do loop
Until no change
0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
• Clusters are created in levels actually creating sets of clusters at each
• Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
• Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition
 Illustrative Example:
 Agglomerative and divisive clustering on the data set {a, b, c, d ,e }
 Cluster distance
 Termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
Hierarchical Agglomerative Clustering
• Starts with each doc in a separate cluster
– then repeatedly joins the closest pair of clusters, until there is only one
• The history of merging forms a binary tree or hierarchy.
How to measure distance of clusters??
Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
– Distance of the “closest” points (single-link)
• Complete-link
– Distance of the “farthest” points
• Centroid
– Distance of the centroids (centers of gravity)
• (Average-link)
– Average distance between pairs of elements
single link
complete link
Cluster Distance Measures
• Single link: smallest distance between an
element in one cluster and an element in
the other, i.e., d(Ci, Cj) = min{d(xip, xjq)}
• Complete link: largest distance between
an element in one cluster and an element
in the other, i.e., d(Ci, Cj) = max{d(xip,
• Average: avg distance between elements
in one cluster and elements in the other,
d(Ci, Cj) = avg{d(xip, xjq)
d(C, C)=0
• Dendrogram: a tree data
structure which illustrates
hierarchical clustering
• Each level shows clusters for
that level.
– Leaf – individual clusters
– Root – one cluster
• A cluster at level i is the union of
its children clusters at level i+1.
Cluster Distance Measures
Example: Given a data set of five objects characterized by a single continuous feature,
assume that there are two clusters: C1: {a, b} and C2: {c, d, e}.
1.Calculate the distance matrix.
2.Calculate three cluster distances between C1 and C2.
a b c d e
Feature 1 2 4 5 6
a b c d e
a 0 1 3 4 5
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1
e 5 4 2 1 0
Single link
Complete link
dist 2
 d
dist(C 2
 d
dist(C 2
Agglomerative Algorithm
• The Agglomerative algorithm is carried out in three steps:
1) Convert all object features into a
distance matrix
2) Set each object as a cluster (thus if
we have N objects, we will have N
clusters at the beginning)
3) Repeat until number of cluster is
one (or known # of clusters)
 Merge two closest clusters
 Update “distance matrix”
• Problem: clustering analysis with agglomerative algorithm
data matrix
distance matrix
Euclidean distance
• Merge two closest clusters (iteration 1)
• Update distance matrix (iteration 1)
• Merge two closest clusters (iteration 2)
• Update distance matrix (iteration 2)
• Merge two closest clusters/update distance matrix (iteration 3)
• Merge two closest clusters/update distance matrix (iteration 4)
• Final result (meeting termination condition)
• Dendrogram tree representation
1. There are 6 clusters: A, B, C, D, E and
2. Merge clusters D and F into cluster (D,
F) at distance 0.50
3. Merge cluster A and cluster B into (A,
B) at distance 0.71
4. Merge clusters E and (D, F) into ((D,
F), E) at distance 1.00
5. Merge clusters ((D, F), E) and C into
(((D, F), E), C) at distance 1.41
6. Merge clusters (((D, F), E), C) and (A,
B) into ((((D, F), E), C), (A, B))
at distance 2.50
7. The last cluster contain all the objects,
thus conclude the computation
Given a data set of five objects characterised by a single continuous feature:
Apply the agglomerative algorithm with single-link, complete-link and averaging cluster
distance measures to produce three dendrogram trees, respectively.
a b C d e
Feature 1 2 4 5 6
a b c d e
a 0 1 3 4 5
b 1 0 2 3 4
c 3 2 0 1 2
d 4 3 1 0 1
e 5 4 2 1 0
Density-Based Clustering Algorithms
Density-Based Clustering
• Clustering based on density (local cluster criterion), such as density-
connected points or based on an explicitly constructed density
• This connected dense component which can grow in any direction that
density leads.
• Density, connectivity and boundary
• Arbitrary shaped clusters and good scalability
• Each cluster has a considerable higher density of points than outside
of the cluster
Major Features
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters
Two Major Types of Density-Based
Clustering Algorithms
• Connectivity based:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– CLIQUE: Agrawal, et al. (SIGMOD’98)
• Density function based:
- DENCLUE: Hinneburg & D. Keim (KDD’98/2006)
Density Based Clustering: Basic Concept
• Intuition for the formalization of the basic idea
– For any point in a cluster, the local point density around that point has to
exceed some threshold
– The set of points from one cluster is connected
• Local point density at a point p defined by two parameters
– ε – radius for the neighborhood of point p:
Nε (p) := {q in data set D | dist(p, q)  ε}
– MinPts – minimum number of points in the given neighbourhood N(p)
• -Neighborhood – Objects within a radius of  from an object.
• “High density” - ε-Neighborhood of an object contains at least MinPts
of objects.
q p
ε-Neighborhood of p
ε-Neighborhood of q
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
( 
 
Core, Border & Outlier
Given  and MinPts, categorize the
objects into three exclusive groups.
A point is a core point if it has more
than a specified number of points
(MinPts) within Eps These are points
that are at the interior of a cluster.
A border point has fewer than
MinPts within Eps, but is in the
neighborhood of a core point.
A noise point is any point that is
not a core point nor a border point.
• M, P, O, and R are core objects since each is in an Eps neighborhood
containing at least 3 points
Minpts = 3
of the circles
 Directly density-reachable
 An object q is directly density-reachable from object p if p is a
core object and q is in p’s -neighborhood.
 q is directly density-reachable from p
 p is not directly density- reachable from q?
 Density-reachability is asymmetric.
MinPts = 5
Eps = 1 cm
• Density-Reachable (directly and indirectly):
– A point p is directly density-reachable from p1;
– p1 is directly density-reachable from q;
– pp1q form a chain.
• p is (indirectly) density-reachable from q
• q is not density- reachable from p?
• Density-connected
– A point p is density-connected to a point q wrt.
Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps
and MinPts.
p q
Formal Description of Cluster
• Given a data set D, parameter  and threshold MinPts.
• A cluster C is a subset of objects satisfying two criteria:
– Connected: p, q C: p and q are density-connected.
– Maximal: p, q: if p C and q is density-reachable from p, then q C.
(avoid redundancy)
P is a core object.
Review of Concepts
Are objects p and q in the
same cluster?
Are p and q density-
Are p and q density-
reachable by some object o?
Directly density-
Indirectly density-reachable
through a chain
Is an object o in a cluster or
an outlier?
Is o a core object?
Is o density-reachable by
some core object?
DBSCAN Algorithm
Input: The data set D
Parameter: , MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For
DBScan Algorithm
DBSCAN: The Algorithm
– Arbitrary select a point p
– Retrieve all points density-reachable from p wrt Eps and MinPts.
– If p is a core point, a cluster is formed.
– If p is a border point, no points are density-reachable from p and
DBSCAN visits the next point of the database.
– Continue the process until all of the points have been processed.
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3
for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
assign o to NOISE
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3
for each o Î D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
assign o to NOISE
DBSCAN Algorithm: Example
• Parameter
• e = 2 cm
• MinPts = 3
for each o Î D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable
from o
and assign them to a new cluster.
assign o to NOISE
DBSCAN Algorithm: Advantages
• DBSCAN does not require to specify the number of clusters in the data
apriori, as opposed to k-means.
• DBSCAN can find arbitrarily shaped clusters. It can even find a cluster
completely surrounded by (but not connected to) a different cluster. Due to the
MinPts parameter, the so-called single-link effect (different clusters being
connected by a thin line of points) is reduced.
• DBSCAN has a notion of noise, and is robust to outliers.
• DBSCAN requires just two parameters and is mostly insensitive to the
ordering of the points in the database. (However, points sitting on the edge of
two different clusters might swap cluster membership if the ordering of the
points is changed, and the cluster assignment is unique only up to
• The parameters minPts and ε can be set by a domain expert, if the data is well
DBSCAN Algorithm: Disadvantages
• DBSCAN is not entirely deterministic: border points that are reachable from
more than one cluster can be part of either cluster, depending on the order the
data is processed. Fortunately, this situation does not arise often, and has little
impact on the clustering result: both on core points and noise points,
DBSCAN is deterministic.
• The quality of DBSCAN depends on the distance measure used in the function
regionQuery (P, ε). The most common distance metric used is Euclidean
distance. Especially for high-dimensional data, this metric can be rendered
almost useless due to the so-called "Curse of dimensionality", making it
difficult to find an appropriate value for ε. This effect, however, is also present
in any other algorithm based on Euclidean distance.
• DBSCAN cannot cluster data sets well with large differences in densities,
since the minPts-ε combination cannot then be chosen appropriately for all
• If the data and scale are not well understood, choosing a meaningful distance
threshold ε can be difficult.
Steps of Grid-based Clustering
Basic Grid-based Algorithm
1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and compute the density of
each cell.
3. Eliminate cells, whose density is below a certain threshold t.
4. Form clusters from contiguous (adjacent) groups of dense cells
(usually minimizing a given objective function)
Advantages of Grid-based Clustering Algorithms
• fast:
– No distance computations
– Clustering is performed on summaries and not individual objects;
complexity is usually O(#-populated-grid-cells) and not O(#objects)
– Easy to determine which clusters are neighboring
• Shapes are limited to union of grid-cells
Grid-Based Clustering Methods
• Grid-based methods quantize the object space into a finite number of cells
that form a gird structure (Uses multi-resolution grid data structure).
• All the clustering operations are performed on the grid structure.
• Clustering complexity depends on the number of populated grid cells and
not on the number of objects in the dataset
• Several interesting methods (in addition to the basic grid-based algorithm)
– STING (a STatistical INformation Grid approach) by Wang, Yang and
Muntz (1997)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
STING: A Statistical Information Grid
• Wang, Yang and Muntz (VLDB’97)
• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of
STING: A Statistical Information Grid
Approach (2)
– Each cell at a high level is partitioned into a number of smaller cells in the
next lower level
– Statistical info of each cell is calculated and stored beforehand and is used
to answer queries
– Parameters of higher level cells can be easily calculated from parameters
of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
– Use a top-down approach to answer spatial data queries
STING: Query Processing(3)
Used a top-down approach to answer spatial data queries
1. Start from a pre-selected layer—typically with a small number of cells
2. From the pre-selected layer until you reach the bottom layer do the
• For each cell in the current level compute the confidence interval
indicating a cell’s relevance to a given query;
– If it is relevant, include the cell in a cluster
– If it irrelevant, remove cell from further consideration
– otherwise, look for relevant cells at the next lower layer
3. Combine relevant cells into relevant regions (based on grid-neighborhood)
and return the so obtained clusters as your answers.
STING: A Statistical Information Grid
Approach (3)
– Advantages:
• Query-independent, easy to parallelize, incremental update
• O(K), where K is the number of grid cells at the lowest level
– Disadvantages:
• All the cluster boundaries are either horizontal or vertical, and no
diagonal boundary is detected

More Related Content

Similar to 26-Clustering MTech-2017.ppt

Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptxssusere1fd42
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Salah Amean
K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsVoidVampire
MODULE 4_ CLUSTERING.pptxnikshaikh786
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasicengrasi
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapterNaveenKumar5162
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran

Similar to 26-Clustering MTech-2017.ppt (20)

Unsupervised learning Modi.pptx
Unsupervised learning Modi.pptxUnsupervised learning Modi.pptx
Unsupervised learning Modi.pptx
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
K means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objectsK means Clustering - algorithm to cluster n objects
K means Clustering - algorithm to cluster n objects
47 292-298
47 292-29847 292-298
47 292-298
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
10 clusbasic
10 clusbasic10 clusbasic
10 clusbasic
Dataa miining
Dataa miiningDataa miining
Dataa miining
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
data mining cocepts and techniques chapter
data mining cocepts and techniques chapterdata mining cocepts and techniques chapter
data mining cocepts and techniques chapter
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service

26-Clustering MTech-2017.ppt

  • 1. Clustering Alok Kumar Jagadev Data Mining: Concepts and Techniques
  • 2. Introduction • The goal of clustering is to – group data points that are close (or similar) to each other – identify such groupings (or clusters) in an unsupervised manner • Unsupervised: no information is provided to the algorithm on which data points belong to which clusters • Example × × × × × × × × × What should the clusters be for these data points?
  • 3. What is Clustering? • Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind deals with finding a structure in a collection of unlabeled data. • A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. • A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
  • 4. Clustering Algorithms  A clustering algorithm attempts to find natural groups of components (or data) based on some similarity  Also, the clustering algorithm finds the centroid of a group of data sets  To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids  The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.
  • 5. • Simple graphical example:  In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance. This is called distance-based clustering.  Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects.  In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.
  • 6. Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
  • 7. Quality: What Is Good Clustering?  A good clustering method will produce high quality clusters with – high intra-class similarity – low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
  • 8. Measure the Quality of Clustering • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) • There is a separate “quality” function that measures the “goodness” of a cluster. • The definitions of distance functions are usually very different for interval- scaled, boolean, categorical, ordinal and ratio variables. • Weights should be associated with different variables based on applications and data semantics. • It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective.
  • 9. Requirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Ability to handle dynamic data  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability
  • 10. Major Clustering Approaches • Partitioning: Construct various partitions and then evaluate them by some criterion • Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion • Model-based: Hypothesize a model for each cluster and find best fit of models to data • Density-based: Guided by connectivity and density functions
  • 11. Typical Alternatives to Calculate the Distance between Clusters  Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)  Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)  Average: avg distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)  Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)  Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj) – Medoid: one chosen, centrally located object in the cluster
  • 12. Centroid, Radius and Diameter of a Cluster (for numerical data sets) • Centroid: the “middle” of a cluster • Radius: square root of average distance from any point of the cluster to its centroid • Diameter: square root of average mean squared distance between all pairs of points in the cluster N t N i ip m C ) ( 1    N m c ip t N i m R 2 ) ( 1     ) 1 ( 2 ) ( 1 1        N N iq t ip t N i N i m D
  • 13. Partitioning Algorithms • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion – Global optimal: exhaustively enumerate all partitions – Heuristic methods: k-means and k-medoids algorithms – k-means (MacQueen, 1967): Each cluster is represented by the center of the cluster – k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster 2 1 ) ( mi m Km t k m t C mi     
  • 14. The K-Means Clustering Method • Given k, the k-means algorithm is implemented in four steps: – Partition objects into k nonempty subsets – Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) – Assign each object to the cluster with the nearest seed point – Go back to Step 2, stop when no more new assignment
  • 15. The K-Means Clustering Method • Example 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means Update the cluster means reassign reassign
  • 16. Comments on the K-Means Method • Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) • Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes
  • 17. Variations of the K-Means Method • A few variants of the k-means which differ in – Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means • Handling categorical data: k-modes (Huang’98) – Replacing means of clusters with modes – Using new dissimilarity measures to deal with categorical objects – Using a frequency-based method to update modes of clusters – A mixture of categorical and numerical data: k-prototype method
  • 18. What Is the Problem of the K-Means Method? • The k-means algorithm is sensitive to outliers ! – Since an object with an extremely large value may substantially distort the distribution of the data. • K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 19. Example • Height and weight information are given. Using these two variables, we need to group the objects based on height and weight information.
  • 20. Data Sample Height Weight 185 72 170 56 168 60 179 68 182 72 188 77 180 71 180 70 183 84 180 88 180 67 177 76
  • 21. Step 1: Input  Dataset, Clustering Variables and Maximum Number of Clusters (K in Means Clustering)  In this dataset, only two variables –height and weight – are considered for clustering Height Weight 185 72 170 56 168 60 179 68 182 72 188 77 180 71 180 70 183 84 180 88 180 67 177 76
  • 22. Step 2: Initialize cluster centroid In this example, value of K is considered as 2. Cluster centroids are initialized with first 2 observations. Initial Centroid Cluster Height Weight K1 185 72 K2 170 56
  • 23. Step 3: Calculate Euclidean Distance Euclidean is one of the distance measures used on K Means algorithm. Euclidean distance between of a observation and initial cluster centroids 1 and 2 is calculated. Based on euclidean distance each observation is assigned to one of the clusters - based on minimum distance. Euclidean Distance
  • 24. First two observations Height Weight 185 72 170 56 Now initial cluster centroids are : Updated Centroid Cluster Height Weight K1 185 72 K2 170 56 Euclidean Distance Calculation from each of the clusters is calculated. Euclidian Distance from Euclidian Distance from Cluster 1 Cluster 2 Assignment (185-185)2+(72-72)2 =0 (185-170)2+(72-56)2= 21.93 1 (170-185)2+(56-72)2= 21.93 (170-170)2+(56-56)2= 0 2 We have considered two observations for assignment only because we knew the assignment. And there is no change in Centroids as these two observations were only considered as initial centroids.
  • 25. Step 4: Move on to next observation and calculate Euclidean Distance Height Weight 168 60 Euclidean Distance Euclidean Distance from Cluster 1 from Cluster 2 Assignment (168-185)2+(60-72)2 =20.808 (168-185)2+(60-72)2= 4.472 2 Since distance is minimum from cluster 2, so the observation is assigned to cluster 2. Now revise Cluster Centroid – mean value Height and Weight as Custer Centroids. Addition is only to cluster 2, so centroid of cluster 2 will be updated Updated cluster centroids Updated Centroid Cluster Height Weight K=1 185 72 K=2 (170+168)/2 = 169 (56+60)/2 = 58
  • 26. Step 5: Calculate Euclidean Distance for the next observation, assign next observation based on minimum euclidean distance and update the cluster centroids. Next Observation. Height Weight 179 68 Euclidean Distance Calculation and Assignment Euclidain Distance Euclidain Distance from Cluster 1 from Cluster 2 Assignment 7.211103 14.14214 1 Update Cluster Centroid Updated Centroid Cluster Height Weight K=1 182 70.6667 K=2 169 58 Continue the steps until all observations are assigned
  • 27. Cluster Centroids Updated Centroid Cluster Height Weight K=1 182.8 72 K=2 169 58
  • 28. This is what was expected initially based on two-dimensional plot.
  • 29. A few important considerations in K Means •Scale of measurements influences Euclidean Distance , so variable standardisation becomes necessary •Depending on expectations - you may require outlier treatment •K Means clustering may be biased on initial centroids - called cluster seeds •Maximum clusters is typically inputs and may also impacts the clusters getting created
  • 30. The K-Medoids Clustering Method • Find representative objects, called medoids, in clusters • PAM (Partitioning Around Medoids, 1987) – starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering – PAM works effectively for small data sets, but does not scale well for large data sets • CLARA (Kaufmann & Rousseeuw, 1990) • CLARANS (Ng & Han, 1994): Randomized sampling • Focusing + spatial data structure (Ester et al., 1995)
  • 31. A Typical K-Medoids Algorithm (PAM) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrary choose k object as initial medoids 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remaining object to nearest medoids Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 26 Swapping O and Oramdom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 32. Hierarchical Clustering • Clusters are created in levels actually creating sets of clusters at each level. • Agglomerative – Initially each item in its own cluster – Iteratively clusters are merged together – Bottom Up • Divisive – Initially all items in one cluster – Large clusters are successively divided – Top Down
  • 33. Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition  Illustrative Example:  Agglomerative and divisive clustering on the data set {a, b, c, d ,e }  Cluster distance  Termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 Agglomerative Divisive
  • 34. Hierarchical Agglomerative Clustering (HAC) • Starts with each doc in a separate cluster – then repeatedly joins the closest pair of clusters, until there is only one cluster. • The history of merging forms a binary tree or hierarchy. How to measure distance of clusters??
  • 35. Closest pair of clusters Many variants to defining closest pair of clusters • Single-link – Distance of the “closest” points (single-link) • Complete-link – Distance of the “farthest” points • Centroid – Distance of the centroids (centers of gravity) • (Average-link) – Average distance between pairs of elements
  • 36. single link (min) complete link (max) average Cluster Distance Measures • Single link: smallest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = min{d(xip, xjq)} • Complete link: largest distance between an element in one cluster and an element in the other, i.e., d(Ci, Cj) = max{d(xip, xjq)} • Average: avg distance between elements in one cluster and elements in the other, i.e., d(Ci, Cj) = avg{d(xip, xjq) d(C, C)=0
  • 37. Dendrogram • Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. • Each level shows clusters for that level. – Leaf – individual clusters – Root – one cluster • A cluster at level i is the union of its children clusters at level i+1.
  • 38. Cluster Distance Measures Example: Given a data set of five objects characterized by a single continuous feature, assume that there are two clusters: C1: {a, b} and C2: {c, d, e}. 1.Calculate the distance matrix. 2.Calculate three cluster distances between C1 and C2. a b c d e Feature 1 2 4 5 6 a b c d e a 0 1 3 4 5 b 1 0 2 3 4 c 3 2 0 1 2 d 4 3 1 0 1 e 5 4 2 1 0 Single link Complete link Average 2 4} 3, 2, 5, 4, min{3, e)} (b, d), (b, c), (b, e), (a, d), a, ( , c) a, ( min{ ) C , C ( dist 2 1    d d d d d d 5 4} 3, 2, 5, 4, max{3, e)} (b, d), (b, c), (b, e), (a, d), a, ( , c) a, ( max{ ) C , dist(C 2 1    d d d d d d 5 . 3 6 21 6 4 3 2 5 4 3 6 e) (b, d) (b, c) (b, e) (a, d) a, ( c) a, ( ) C , dist(C 2 1               d d d d d d
  • 39. Agglomerative Algorithm • The Agglomerative algorithm is carried out in three steps: 1) Convert all object features into a distance matrix 2) Set each object as a cluster (thus if we have N objects, we will have N clusters at the beginning) 3) Repeat until number of cluster is one (or known # of clusters)  Merge two closest clusters  Update “distance matrix”
  • 40. • Problem: clustering analysis with agglomerative algorithm Example data matrix distance matrix Euclidean distance
  • 41. • Merge two closest clusters (iteration 1) Example
  • 42. • Update distance matrix (iteration 1) Example
  • 43. • Merge two closest clusters (iteration 2) Example
  • 44. • Update distance matrix (iteration 2) Example
  • 45. • Merge two closest clusters/update distance matrix (iteration 3) Example
  • 46. • Merge two closest clusters/update distance matrix (iteration 4) Example
  • 47. • Final result (meeting termination condition) Example
  • 48. • Dendrogram tree representation Example 1. There are 6 clusters: A, B, C, D, E and F 2. Merge clusters D and F into cluster (D, F) at distance 0.50 3. Merge cluster A and cluster B into (A, B) at distance 0.71 4. Merge clusters E and (D, F) into ((D, F), E) at distance 1.00 5. Merge clusters ((D, F), E) and C into (((D, F), E), C) at distance 1.41 6. Merge clusters (((D, F), E), C) and (A, B) into ((((D, F), E), C), (A, B)) at distance 2.50 7. The last cluster contain all the objects, thus conclude the computation 2 3 4 5 6 object lifetime
  • 49. Exercise Given a data set of five objects characterised by a single continuous feature: Apply the agglomerative algorithm with single-link, complete-link and averaging cluster distance measures to produce three dendrogram trees, respectively. a b C d e Feature 1 2 4 5 6 a b c d e a 0 1 3 4 5 b 1 0 2 3 4 c 3 2 0 1 2 d 4 3 1 0 1 e 5 4 2 1 0
  • 51. Density-Based Clustering • Clustering based on density (local cluster criterion), such as density- connected points or based on an explicitly constructed density function • This connected dense component which can grow in any direction that density leads. • Density, connectivity and boundary • Arbitrary shaped clusters and good scalability • Each cluster has a considerable higher density of points than outside of the cluster
  • 52. Major Features • Major features: – Discover clusters of arbitrary shape – Handle noise – One scan – Need density parameters
  • 53. Two Major Types of Density-Based Clustering Algorithms • Connectivity based: – DBSCAN: Ester, et al. (KDD’96) – OPTICS: Ankerst, et al (SIGMOD’99). – CLIQUE: Agrawal, et al. (SIGMOD’98) • Density function based: - DENCLUE: Hinneburg & D. Keim (KDD’98/2006)
  • 54. Density Based Clustering: Basic Concept • Intuition for the formalization of the basic idea – For any point in a cluster, the local point density around that point has to exceed some threshold – The set of points from one cluster is connected • Local point density at a point p defined by two parameters – ε – radius for the neighborhood of point p: Nε (p) := {q in data set D | dist(p, q)  ε} – MinPts – minimum number of points in the given neighbourhood N(p)
  • 55. -Neighborhood • -Neighborhood – Objects within a radius of  from an object. • “High density” - ε-Neighborhood of an object contains at least MinPts of objects. q p ε ε ε-Neighborhood of p ε-Neighborhood of q Density of p is “high” (MinPts = 4) Density of q is “low” (MinPts = 4) } ) , ( | { : ) (    q p d q p N
  • 56. Core, Border & Outlier Given  and MinPts, categorize the objects into three exclusive groups. A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster. A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point. A noise point is any point that is not a core point nor a border point.
  • 57. Example • M, P, O, and R are core objects since each is in an Eps neighborhood containing at least 3 points Minpts = 3 Eps=radius of the circles
  • 58. Density-Reachability  Directly density-reachable  An object q is directly density-reachable from object p if p is a core object and q is in p’s -neighborhood.  q is directly density-reachable from p  p is not directly density- reachable from q?  Density-reachability is asymmetric. q p MinPts = 5 Eps = 1 cm
  • 59. Density-Reachability • Density-Reachable (directly and indirectly): – A point p is directly density-reachable from p1; – p1 is directly density-reachable from q; – pp1q form a chain. • p is (indirectly) density-reachable from q • q is not density- reachable from p? • Density-connected – A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p q p1 p q o
  • 60. Formal Description of Cluster • Given a data set D, parameter  and threshold MinPts. • A cluster C is a subset of objects satisfying two criteria: – Connected: p, q C: p and q are density-connected. – Maximal: p, q: if p C and q is density-reachable from p, then q C. (avoid redundancy) P is a core object.
  • 61. Review of Concepts Are objects p and q in the same cluster? Are p and q density- connected? Are p and q density- reachable by some object o? Directly density- reachable Indirectly density-reachable through a chain Is an object o in a cluster or an outlier? Is o a core object? Is o density-reachable by some core object?
  • 62. DBSCAN Algorithm Input: The data set D Parameter: , MinPts For each object p in D if p is a core object and not processed then C = retrieve all objects density-reachable from p mark all objects in C as processed report C as a cluster else mark p as outlier end if End For DBScan Algorithm
  • 63. DBSCAN: The Algorithm – Arbitrary select a point p – Retrieve all points density-reachable from p wrt Eps and MinPts. – If p is a core point, a cluster is formed. – If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. – Continue the process until all of the points have been processed.
  • 64. DBSCAN Algorithm: Example • Parameter • e = 2 cm • MinPts = 3 for each o  D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE
  • 65. DBSCAN Algorithm: Example • Parameter • e = 2 cm • MinPts = 3 for each o Î D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE
  • 66. DBSCAN Algorithm: Example • Parameter • e = 2 cm • MinPts = 3 for each o Î D do if o is not yet classified then if o is a core-object then collect all objects density-reachable from o and assign them to a new cluster. else assign o to NOISE
  • 67. DBSCAN Algorithm: Advantages • DBSCAN does not require to specify the number of clusters in the data apriori, as opposed to k-means. • DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced. • DBSCAN has a notion of noise, and is robust to outliers. • DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.) • The parameters minPts and ε can be set by a domain expert, if the data is well understood.
  • 68. DBSCAN Algorithm: Disadvantages • DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data is processed. Fortunately, this situation does not arise often, and has little impact on the clustering result: both on core points and noise points, DBSCAN is deterministic. • The quality of DBSCAN depends on the distance measure used in the function regionQuery (P, ε). The most common distance metric used is Euclidean distance. Especially for high-dimensional data, this metric can be rendered almost useless due to the so-called "Curse of dimensionality", making it difficult to find an appropriate value for ε. This effect, however, is also present in any other algorithm based on Euclidean distance. • DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters. • If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult.
  • 69. Steps of Grid-based Clustering Algorithms Basic Grid-based Algorithm 1. Define a set of grid-cells 2. Assign objects to the appropriate grid cell and compute the density of each cell. 3. Eliminate cells, whose density is below a certain threshold t. 4. Form clusters from contiguous (adjacent) groups of dense cells (usually minimizing a given objective function)
  • 70. Advantages of Grid-based Clustering Algorithms • fast: – No distance computations – Clustering is performed on summaries and not individual objects; complexity is usually O(#-populated-grid-cells) and not O(#objects) – Easy to determine which clusters are neighboring • Shapes are limited to union of grid-cells
  • 71. Grid-Based Clustering Methods • Grid-based methods quantize the object space into a finite number of cells that form a gird structure (Uses multi-resolution grid data structure). • All the clustering operations are performed on the grid structure. • Clustering complexity depends on the number of populated grid cells and not on the number of objects in the dataset • Several interesting methods (in addition to the basic grid-based algorithm) – STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) – CLIQUE: Agrawal, et al. (SIGMOD’98)
  • 72. STING: A Statistical Information Grid Approach • Wang, Yang and Muntz (VLDB’97) • The spatial area is divided into rectangular cells • There are several levels of cells corresponding to different levels of resolution
  • 73. STING: A Statistical Information Grid Approach (2) – Each cell at a high level is partitioned into a number of smaller cells in the next lower level – Statistical info of each cell is calculated and stored beforehand and is used to answer queries – Parameters of higher level cells can be easily calculated from parameters of lower level cell • count, mean, s, min, max • type of distribution—normal, uniform, etc. – Use a top-down approach to answer spatial data queries
  • 74. STING: Query Processing(3) Used a top-down approach to answer spatial data queries 1. Start from a pre-selected layer—typically with a small number of cells 2. From the pre-selected layer until you reach the bottom layer do the following: • For each cell in the current level compute the confidence interval indicating a cell’s relevance to a given query; – If it is relevant, include the cell in a cluster – If it irrelevant, remove cell from further consideration – otherwise, look for relevant cells at the next lower layer 3. Combine relevant cells into relevant regions (based on grid-neighborhood) and return the so obtained clusters as your answers.
  • 75. STING: A Statistical Information Grid Approach (3) – Advantages: • Query-independent, easy to parallelize, incremental update • O(K), where K is the number of grid cells at the lowest level – Disadvantages: • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected