SlideShare a Scribd company logo
1 of 60
Download to read offline
Clustering
Ms. Rashmi Bhat
What is Clustering??
β–ͺ Grouping of objects
How will you group these together??
What is Clustering??
Option 1: By Type Option 2: By Color
What is Clustering??
Option 3: By Shape
What is Cluster Analysis??
β–ͺ A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters.
β–ͺ Cluster analysis has been extensively focused mainly on distance-based
cluster analysis.
The process of grouping a set of physical or abstract objects into classes of
similar objects is called as Clustering.
What is Cluster Analysis??
β–ͺ How clustering differs from classification???
What is Cluster Analysis??
β–ͺ Clustering is also called data segmentation
β–ͺ Clustering is finding borders between groups,
β–ͺ Segmenting is using borders to form groups
β–ͺ Clustering is the method of creating segments.
β–ͺ Clustering can also be used for outlier detection
What is Cluster Analysis??
β–ͺ Classification: Supervised Learning
β–ͺ Classes are predetermined
β–ͺ Based on training data set
β–ͺ Used to classify future observations
β–ͺ Clustering : Unsupervised Learning
β–ͺ Classes are not known in advance
β–ͺ No prior knowledge
β–ͺ Used to explore (understand) the data
β–ͺ Clustering is a form of learning by observation, rather than learning by
examples.
Applications of Clustering
β–ͺ Marketing:
β–ͺ Segmentation of the customer based on behavior
β–ͺ Banking:
β–ͺ ATM Fraud detection (outlier detection)
β–ͺ Gene analysis:
β–ͺ Identifying gene responsible for a disease
β–ͺ Image processing:
β–ͺ Identifying objects on an image (face detection)
β–ͺ Houses:
β–ͺIdentifying groups of houses according to their house type, value, and geographical location
Requirements of Clustering Analysis
β–ͺ The following are typical requirements of clustering in data mining:
β–ͺ Scalability
β–ͺ Dealing with different types of attributes
β–ͺ Discovering clusters with arbitrary shapes
β–ͺ Ability to deal with noisy data
β–ͺ Minimal requirements for domain knowledge to determine input parameters
β–ͺ Incremental clustering
β–ͺ High dimensionality
β–ͺ Constraint-based clustering
β–ͺ Interpretability and usability
What is Cluster Analysis??
β–ͺ A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters.
β–ͺ Cluster analysis has been extensively focused mainly on distance-based
cluster analysis.
The process of grouping a set of physical or abstract objects into classes of
similar objects is called as Clustering.
Distance Measures
β–ͺ Cluster analysis has been extensively focused mainly on distance-based
cluster analysis
β–ͺ Distance is defined as the quantitative measure of how far apart two objects are.
β–ͺ The similarity measure is the measure of how much alike two data objects
are.
β–ͺ If the distance is small, the features are having a high degree of similarity.
β–ͺ Whereas a large distance will be a low degree of similarity.
β–ͺ Generally, similarity are measured in the range 0 to 1 [0,1].
β–ͺ Similarity = 1 if X = Y (Where X, Y are two objects)
β–ͺ Similarity = 0 if X β‰  Y
Distance
Measures Euclidean Distance
Manhattan Distance
Minkowski Distance
Cosine Similarity
Jaccard Similarity
Distance
Measures
𝑫 𝑿, 𝒀 = π’™πŸ βˆ’ π’™πŸ
𝟐 + π’šπŸ βˆ’ π’šπŸ
𝟐
β€’ The Euclidean distance between two points is the length of the
path connecting them.
β€’ The Pythagorean theorem gives this distance between two points.
Distance
Measures
𝑫 𝑨, 𝑩 = π’™πŸ βˆ’ π’™πŸ + π’šπŸ βˆ’ π’šπŸ
β€’ Manhattan distance is a metric in which the distance between
two points is calculated as the sum of the absolute differences
of their Cartesian coordinates.
β€’ It is the total sum of the difference between the x-coordinates
and y-coordinates.
Distance
Measures
𝑫 𝑿, 𝒀 = ෍
π’Š=𝟏
𝒏
|π’™π’Š βˆ’ π’šπ’Š|𝒑
ΰ΅—
𝟏
𝒑
=
𝒑
෍
π’Š=𝟏
𝒏
|π’™π’Š βˆ’ π’šπ’Š|𝒑
β€’ It is the generalized form of the Euclidean and Manhattan Distance
Measure.
Distance
Measures
β€’ The cosine similarity metric finds the normalized dot
product of the two attributes.
β€’ By determining the cosine similarity, we would
effectively try to find the cosine of the angle between
the two objects.
β€’ The cosine of 0Β° is 1, and it is less than 1 for any other
angle.
Distance
Measures
β€’ When we consider Jaccard similarity these objects will
be sets.
| 𝑨 βˆͺ 𝑩 | = 7
| 𝑨 ∩ 𝑩 | = 2
π½π‘Žπ‘π‘π‘Žπ‘Ÿπ‘‘ π‘†π‘–π‘šπ‘–π‘™π‘Žπ‘Ÿπ‘–π‘‘π‘¦ 𝑱 𝑨, 𝑩 =
𝐴 ∩ 𝐡
𝐴 βˆͺ 𝐡
=
2
7
= 0.286
Clustering Techniques
β–ͺ Clustering techniques are categorized in following categories
Partitioning Methods
Hierarchical Methods
Density-based Methods
Grid-based Methods
Model-based Methods
Partitioning Method
β–ͺ Construct a partition of a database 𝑫 of 𝒏 objects into π’Œ clusters
β–ͺ each cluster contains at least one object
β–ͺ each object belongs to exactly one cluster
β–ͺ Given a π’Œ, find a partition of π’Œ clusters that optimizes the chosen
partitioning criterion (min distance from cluster centers)
β–ͺ Global optimal: exhaustively enumerate all partitions Stirling(n,k)
(S(10,3) = 9.330, S(20,3) = 580.606.446,…)
β–ͺ Heuristic methods: k-means and k-medoids algorithms
β–ͺ k-means: Each cluster is represented by the center of the cluster.
β–ͺ k-medoids or PAM (Partition around medoids): Each cluster is represented by one of
the objects in the cluster.
π‘˜-means Clustering
Input:
π’Œ clusters, 𝒏 objects of database 𝑫.
Output:
Set of π’Œ clusters minimizing squared error function
Algorithm:
1. Arbitrarily choose π’Œ objects from 𝑫 as the initial cluster centers;
2. Repeat
1. (Re)assign each object to the cluster to which the object is the most similar, based on
the mean value of the objects in the cluster;
2. Update the cluster means, i.e., calculate the mean value of the objects for each cluster;
3. Until no change;
π‘˜-means Clustering
Example: Cluster the following data example into 3 clusters using k-means clustering and Euclidean
distance
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
π‘˜-means Clustering
1. Choose arbitrary 3 points as cluster centers
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
π‘˜-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝐷1 𝑃1, 𝐢1 = 2 βˆ’ 2 2 + 1 βˆ’ 5 2 = 16 = 4
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
𝑫 = π’™πŸ βˆ’ π’™πŸ
𝟐 + π’šπŸ βˆ’ π’šπŸ
𝟐 … . π‘¬π’–π’„π’π’Šπ’…π’†π’‚π’ π‘«π’Šπ’”π’•π’‚π’π’„π’†
𝐷1 𝑃1, 𝐢2 = 4 βˆ’ 2 2 + 4 βˆ’ 5 2 = 5 = 2.236
𝐷1 𝑃1, 𝐢3 = 2 βˆ’ 2 2 + 3 βˆ’ 5 2 = 4 = 2
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
𝐷2 𝑃2, 𝐢1 = 2 βˆ’ 2 2 + 1 βˆ’ 1 2 = 0
𝐷2 𝑃2, 𝐢2 = 4 βˆ’ 2 2 + 4 βˆ’ 1 2 = 13 = 3.605
𝐷2 𝑃2, 𝐢3 = 2 βˆ’ 2 2 + 3 βˆ’ 1 2 = 4 = 2
Similarly, assign other points to appropriate cluster.
Cluster1 = { }
Cluster2 = { }
Cluster3 = {(2,5)}
Cluster1 = { }
Cluster2 = { }
Cluster3 = { }
π‘˜-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
Cluster1 = { }
Cluster2 = { }
Cluster3 = {(2,5)}
Cluster1 = {(2,1)}
Cluster2 = {(7,1)}
Cluster3 = {(2,5)}
Cluster1 = {(2,1)}
Cluster2 = {(4,4),(7,1)}
Cluster3 = {(2,5)}
Cluster1 = {(2,1)}
Cluster2 = {(4,4),(7,1), (3,5)}
Cluster3 = {(2,5)}
Cluster1 = {(2,1), (1,2)}
Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)}
Cluster3 = {(2,3),(2,5)}
π‘˜-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
π‘˜-means Clustering
3. Update the cluster means
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Old Cluster Centers:
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
Clusters:
Cluster1 = {(2,1), (1,2), }
Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)}
Cluster3 = {(2,3),(2,5)}
Calculate the mean of the points in each cluster
π‘šπ‘’π‘Žπ‘›1 =
2+1
2
,
1+2
2
π‘šπ‘’π‘Žπ‘›2 =
4+7+3+6+6+3
6
,
4+1+5+2+1+4
6
π‘šπ‘’π‘Žπ‘›3 =
2+2
2
,
3+5
2
New Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
π‘˜-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
π‘˜-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
π‘˜-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝐷1 𝑃1, 𝐢1 = 1.5 βˆ’ 2 2 + 1.5 βˆ’ 5 2 = 3.535
Updated Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
𝐷1 𝑃1, 𝐢2 = 4.83 βˆ’ 2 2 + 2.83 βˆ’ 5 2 = 3.566
𝐷1 𝑃1, 𝐢3 = 2 βˆ’ 2 2 + 4 βˆ’ 5 2 = 1
Cluster1 = { }
Cluster2 = { }
Cluster3 = {(2,5)}
𝐷2 𝑃2, 𝐢1 = 1.5 βˆ’ 2 2 + 1.5 βˆ’ 1 2 = 0.707
𝐷2 𝑃2, 𝐢2 = 4.83 βˆ’ 2 2 + 2.83 βˆ’ 1 2 = 13 = 3.3701
𝐷2 𝑃2, 𝐢3 = 2 βˆ’ 2 2 + 4 βˆ’ 1 2 = 3
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
Similarly, assign other points to appropriate cluster.
π‘˜-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Updated Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
Updated Clusters
Cluster1 = {(2,1), (1,2) }
Cluster2 = {(7,1), (4,4), (6,2), (6,1)}
Cluster3 = {(3,5), (2,5), (3,4), (2,3)}
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
π‘˜-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
π‘˜-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Old Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
Updated Clusters
Cluster1 = {(2,1), (1,2) }
Cluster2 = {(7,1), (4,4), (6,2), (6,1)}
Cluster3 = {(3,5), (2,5), (3,4), (2,3)}
3. Update Cluster centers by repeating the process until there is no
change in clusters
New Cluster Centers:
C1 = (1.5, 1.5)
C2 = (5.75, 2)
C3 = (2.5, 4.25)
π‘˜-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
π‘˜-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Old Cluster Centers:
C1 = (1.5, 1.5)
C2 = (5.75, 2)
C3 = (2.5, 4.25)
Updated Clusters
Cluster1 = {(2,1), (1,2) }
Cluster2 = {(7,1), (6,2), (6,1)}
Cluster3 = {(3,5), (2,5), (4,4), (3,4), (2,3)}
3. Update Cluster centers by repeating the process until there is no
change in clusters
New Cluster Centers:
C1 = (1.5, 1.5)
C2 = (6.33, 1.33)
C3 = (2.8, 4.2)
π‘˜-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
π‘˜-means Clustering
Apply k-means algorithm for the following data set with two
clusters.
D={15, 16, 19, 20, 20, 21, 22, 28, 35, 40, 41, 42, 43, 44, 60, 61, 65}
π‘˜-means Clustering
β–ͺ Advantages:
β–ͺ Relatively scalable and efficient in processing large data sets
β–ͺ The computational complexity of the algorithm is 𝑂 π‘›π‘˜π‘‘
β–ͺ where 𝑛 is the total number of objects, π‘˜ is the number of clusters, and 𝑑 is the number of iterations
β–ͺ This method terminates at a local optimum.
β–ͺDisadvantages:
β–ͺ Can be applied only when the mean of a cluster is defined
β–ͺ The necessity for users to specify π‘˜, the number of clusters, in advance.
β–ͺ Sensitive to noise and outlier data points
π‘˜-means Clustering
β–ͺ How to cluster categorical data?
β–ͺ Variant of π‘˜-means is used for clustering categorical data: π‘˜-modes Method
β–ͺ Replace mean of cluster with mode of data
β–ͺ A new dissimilarity measures to deal with categorical objects
β–ͺ A frequency-based method to update modes of clusters.
π‘˜-Medoids Clustering
β–ͺ Picks actual objects to represent the clusters, using one representative object per
cluster
β–ͺ Each remaining object is clustered with the representative object to which it is the
most similar.
β–ͺ Partitioning method is then performed based on the principle of minimizing the
sum of the dissimilarities between each object and its corresponding reference
point
β–ͺ Absolute Error criterion is used
𝐸 = ෍
𝑗=1
π‘˜
෍
π‘βˆˆπ‘π‘—
𝑑𝑖𝑠𝑑(𝑝, π‘œπ‘—)
Where
β€’ 𝑝 is the point in space representing
a given object in cluster 𝑐𝑗
β€’ 𝑂𝑗 is the representative object of
cluster 𝑐𝑗
Sum of absolute error
π‘˜-Medoids Clustering
β–ͺ The iterative process of replacing representative objects by nonrepresentative objects
continues as long as the quality of the resulting clustering is improved.
β–ͺ Quality is measured by a cost function that measures the average dissimilarity between an
object and the representative object of its cluster.
β–ͺ Four cases are examined for each of the nonrepresentative objects, 𝑝.
β–ͺ Suppose, object 𝒑 is currently assigned to a cluster represented by medoid 𝑢𝒋
𝒑
π‘Άπ’Š
𝑢𝒋
π‘Άπ’“π’‚π’π’…π’π’Ž
𝒑
π‘Άπ’Š
𝑢𝒋
π‘Άπ’“π’‚π’π’…π’π’Ž
𝒑
π‘Άπ’Š
𝑢𝒋
π‘Άπ’“π’‚π’π’…π’π’Ž
𝒑
π‘Άπ’Š
𝑢𝒋
π‘Άπ’“π’‚π’π’…π’π’Ž
Case 1 Case 2 Case 3 Case 4
Before Swapping After Swapping
π‘˜-Medoids Clustering
β–ͺ Each time a reassignment occurs, a difference in absolute error, 𝐸, is
contributed to the cost function.
β–ͺ Therefore, the cost function calculates the difference in absolute-error value if
a current representative object is replaced by a nonrepresentative object.
β–ͺ The total cost of swapping is the sum of costs incurred by all nonrepresentative
objects.
β–ͺ If the total cost is negative, then 𝑂𝑗 is replaced or swapped with π‘‚π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š
β–ͺ If the total cost is positive, the current representative object, 𝑂𝑗, is considered acceptable, and
nothing is changed.
β–ͺ PAM(Partitioning Around Medoids) was one of the first k-medoids algorithms
π‘˜-Medoids Clustering
Input: π‘˜ number of clusters, 𝑛 data objects from data set 𝐷
Output: a set of π‘˜ clusters
Algorithm:
1. Arbitrarily select π‘˜ objects as the representative objects or seeds
2. Repeat
1. Assign each remaining objects to the cluster with the nearest representative object
2. Randomly select the non- representative object π‘‚π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š
3. Compute the total cost 𝑆 of swapping 𝑂𝑗 with π‘‚π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š
4. If 𝑆 < 0, then swap 𝑂𝑗 with π‘‚π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š to form the new set of π‘˜ representative objects
3. Until no change
π‘˜-Medoids Clustering
X Y
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6
Data Objects
Aim: Create two Clusters
Step 1:
Choose randomly two medoids
(representative objects)
𝑂3 = 3,8
𝑂8 = (7,4)
π‘˜-Medoids Clustering
X Y Cluster
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6
Data Objects
Aim: Create two Clusters
Step 2:
Assign each object to the closest
representative object
Using Euclidean distance, we
form following clusters
π‘˜-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
Step 2:
Assign each object to the closest
representative object
Using Euclidean distance, we
form following clusters
C1={O1, O2, O3, O4}
C2={O5, O6, O7, O8, O9, O10}
π‘˜-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
Step 3:
Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂8)
𝐸 = ෍
𝑗=1
π‘˜
෍
π‘βˆˆπ‘π‘—
𝑝 βˆ’ 𝑂𝑗
𝑬 = π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ‘ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ’ βˆ’ π‘ΆπŸ‘
+ π‘ΆπŸ“ βˆ’ π‘ΆπŸ– + π‘ΆπŸ” βˆ’ π‘ΆπŸ– + π‘ΆπŸ• βˆ’ π‘ΆπŸ– + π‘ΆπŸ– βˆ’ π‘ΆπŸ– + π‘ΆπŸ— βˆ’ π‘ΆπŸ– + π‘ΆπŸπŸŽ βˆ’ π‘ΆπŸ–
π‘ΆπŸ βˆ’ π‘ΆπŸ‘ = π’™πŸ βˆ’ π’™πŸ‘ + π’šπŸ βˆ’ π’šπŸ‘ . . . . Manhattan Distance
π‘˜-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
Step 3:
Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂8)
𝐸 = ෍
𝑗=1
π‘˜
෍
π‘βˆˆπ‘π‘—
𝑝 βˆ’ 𝑂𝑗
𝑬 = π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ‘ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ’ βˆ’ π‘ΆπŸ‘
+ π‘ΆπŸ“ βˆ’ π‘ΆπŸ– + π‘ΆπŸ” βˆ’ π‘ΆπŸ– + π‘ΆπŸ• βˆ’ π‘ΆπŸ– + π‘ΆπŸ– βˆ’ π‘ΆπŸ– + π‘ΆπŸ— βˆ’ π‘ΆπŸ– + π‘ΆπŸπŸŽ βˆ’ π‘ΆπŸ–
𝑬 = πŸ‘ + πŸ’ + 𝟎 + 𝟐 + πŸ‘ + 𝟏 + 𝟏 + 𝟎 + 𝟐 + 𝟐
𝑬 = πŸπŸ–
π‘˜-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
Step 4:
Choose a random object 𝑂9
Swap 𝑂8 and 𝑂9
Compute the absolute error (for
the set of representative objects
𝑂3 and 𝑂9)
π‘˜-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
𝑬 = π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ‘ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ’ βˆ’ π‘ΆπŸ‘
+ π‘ΆπŸ“ βˆ’ π‘ΆπŸ— + π‘ΆπŸ” βˆ’ π‘ΆπŸ— + π‘ΆπŸ• βˆ’ π‘ΆπŸ— + π‘ΆπŸ– βˆ’ π‘ΆπŸ— + π‘ΆπŸ— βˆ’ π‘ΆπŸ— + π‘ΆπŸπŸŽ βˆ’ π‘ΆπŸ—
𝑬 = πŸ‘ + πŸ’ + 𝟎 + 𝟐 + (πŸ“ + πŸ‘ + πŸ‘ + 𝟐 + 𝟎 + 𝟐)
𝑬 = πŸπŸ’
Step 5:
Compute the cost function
𝑆 = π΄π‘π‘ π‘œπ‘™π‘’π‘‘π‘’ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ π‘“π‘œπ‘Ÿπ‘ΆπŸ‘, π‘ΆπŸ– βˆ’ π΄π‘π‘ π‘œπ‘™π‘’π‘‘π‘’ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ π‘“π‘œπ‘Ÿπ‘ΆπŸ‘, π‘ΆπŸ—
𝑆 = 18 βˆ’ 24 = βˆ’6
As 𝑆 < 0, we swap π‘ΆπŸ– with π‘ΆπŸ—
π‘˜-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6
Step 6:
New medoids are π‘ΆπŸ‘ with π‘ΆπŸ—
Repeat Step 2
Assign each object to the
closest representative object.
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
π‘˜-Medoids Clustering
β–ͺ Which method is more robust π‘˜-Means or π‘˜-Medoids?
β–ͺ The k-medoids method is more robust than k-means in the presence of noise and outliers,
because a medoid is less influenced by outliers or other extreme values than a mean.
β–ͺ The processing of π‘˜-Medoids is more costly than the k-means method.
Hierarchical Clustering
β–ͺ Groups data objects into a tree of clusters.
Hierarchical
Clustering
Methods
Agglomerative Divisive
Hierarchical Clustering
β–ͺ Agglomerative Hierarchical Clustering
β–ͺ Starts by placing each object in its own cluster
β–ͺ Merges these atomic clusters into larger and larger clusters
β–ͺ It will halt when all of the objects are in a single cluster or until certain termination
conditions are satisfied.
β–ͺ Bottom-Up Strategy.
β–ͺ The user can specify the desired number of clusters as a termination condition.
Hierarchical Clustering
A B F C D E G
AB CD
ABF CDE
ABFCDEG
CDEG
Step 0
Step 1
Step 2
Step 3
Step 4
Application of Agglomerative NESting
(AGNES) Hierarchical Clustering
Hierarchical Clustering
β–ͺ Divisive Hierarchical Clustering Method
β–ͺ Starting with all objects in one cluster.
β–ͺ Subdivides the cluster into smaller and smaller pieces.
β–ͺ It will halt when each object forms a cluster on its own or until it satisfies certain termination
conditions
β–ͺ Top-Down Strategy
β–ͺ The user can specify the desired number of clusters as a termination condition.
Hierarchical Clustering
A B F C D E G
AB CD
ABF CDE
ABFCDEG
CDEG
Step 4
Step 3
Step 2
Step 1
Step 0
Application of DIvisive ANAlysis
(DIANA) Hierarchical Clustering
Hierarchical Clustering
β–ͺ A tree structure called a dendrogram is used to represent the process of
hierarchical clustering.
Fig. Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}
Hierarchical Clustering
β–ͺ Four widely used measures for distance between clusters
β–ͺ 𝒑 βˆ’ 𝒑′ is distance between two objects 𝑝 and 𝑝′.
β–ͺ π’Žπ’Š is mean for cluster π‘ͺπ’Š
β–ͺ π’π’Š is number of objects in cluster π‘ͺπ’Š.
π‘€π‘–π‘›π‘–π‘šπ‘’π‘š π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’: π‘‘π‘šπ‘–π‘› 𝐢𝑖, 𝐢𝑗 = π‘šπ‘–π‘›π‘βˆˆπΆπ‘–,π‘β€²βˆˆπΆπ‘—
𝑝 βˆ’ 𝑝′
π‘€π‘Žπ‘₯π‘–π‘šπ‘’π‘š π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’: π‘‘π‘šπ‘Žπ‘₯ 𝐢𝑖, 𝐢𝑗 = π‘šπ‘Žπ‘₯π‘βˆˆπΆπ‘–,π‘β€²βˆˆπΆπ‘—
𝑝 βˆ’ 𝑝′
π‘€π‘’π‘Žπ‘› π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’: π‘‘π‘šπ‘’π‘Žπ‘› 𝐢𝑖, 𝐢𝑗 = π‘šπ‘– βˆ’ π‘šπ‘—
π΄π‘£π‘’π‘Ÿπ‘Žπ‘”π‘’ π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’: π‘‘π‘Žπ‘£π‘” 𝐢𝑖, 𝐢𝑗 =
1
𝑛𝑖𝑛𝑗
෍
π‘βˆˆπΆπ‘–
෍
π‘β€²βˆˆπΆπ‘—
𝑝 βˆ’ 𝑝′
Hierarchical Clustering
β–ͺ If an algorithm uses minimum distance measure, an algorithm is called a
nearest-neighbor clustering algorithm.
β–ͺIf the clustering process is terminated when the minimum distance between
nearest clusters exceeds an arbitrary threshold, it is called a single-linkage
algorithm.
β–ͺ If an algorithm uses maximum distance measure, an algorithm is called a
farthest-neighbor clustering algorithm.
β–ͺ If the clustering process is terminated when the maximum distance between
nearest clusters exceeds an arbitrary threshold, it is called a complete-
linkage algorithm.
β–ͺ An agglomerative hierarchical clustering algorithm that uses the minimum
distance measure is also called a minimal spanning tree algorithm.

More Related Content

What's hot

Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detectionChathurangi Shyalika
Β 
K means clustering
K means clusteringK means clustering
K means clusteringKuppusamy P
Β 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
Β 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsPrashanth Guntal
Β 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine LearningKuppusamy P
Β 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
Β 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis IntroductionPrasiddhaSarma
Β 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptSyedNahin1
Β 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning ClusteringRupak Roy
Β 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)Pravinkumar Landge
Β 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentationEleni Stamatelou
Β 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
Β 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
Β 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )Mohammad Junaid Khan
Β 
Clustering
ClusteringClustering
ClusteringNLPseminar
Β 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning ANKUSH PAL
Β 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
Β 
Spectral Clustering
Spectral ClusteringSpectral Clustering
Spectral Clusteringssusered887b
Β 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
Β 

What's hot (20)

Data cleaning-outlier-detection
Data cleaning-outlier-detectionData cleaning-outlier-detection
Data cleaning-outlier-detection
Β 
K means clustering
K means clusteringK means clustering
K means clustering
Β 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
Β 
Types of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithmsTypes of clustering and different types of clustering algorithms
Types of clustering and different types of clustering algorithms
Β 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
Β 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
Β 
Cluster Analysis Introduction
Cluster Analysis IntroductionCluster Analysis Introduction
Cluster Analysis Introduction
Β 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.ppt
Β 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
Β 
Unsupervised learning (clustering)
Unsupervised learning (clustering)Unsupervised learning (clustering)
Unsupervised learning (clustering)
Β 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentation
Β 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Β 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
Β 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
Β 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
Β 
Clustering
ClusteringClustering
Clustering
Β 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
Β 
K means clustering
K means clusteringK means clustering
K means clustering
Β 
Spectral Clustering
Spectral ClusteringSpectral Clustering
Spectral Clustering
Β 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Β 

Similar to Clustering

Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptxKhalid Rabayah
Β 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxShwetapadmaBabu1
Β 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningPyingkodi Maran
Β 
Clustering
ClusteringClustering
ClusteringLipikaSaha2
Β 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
Β 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsNithyananthSengottai
Β 
K-means Clustering || Data Mining
K-means Clustering || Data MiningK-means Clustering || Data Mining
K-means Clustering || Data MiningIffat Firozy
Β 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
Β 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clusteringPVP College
Β 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering108kaushik
Β 
k-mean-clustering.ppt
k-mean-clustering.pptk-mean-clustering.ppt
k-mean-clustering.pptRanimeLoutar
Β 
k-mean-Clustering impact on AI using DSS
k-mean-Clustering impact on AI using DSSk-mean-Clustering impact on AI using DSS
k-mean-Clustering impact on AI using DSSMarkNaguibElAbd
Β 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clusteringmonalisa Das
Β 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.pptvikassingh569137
Β 
Clustering on DSS
Clustering on DSSClustering on DSS
Clustering on DSSEnaam Alotaibi
Β 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetAlaaZ
Β 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfp_manimozhi
Β 

Similar to Clustering (20)

08 clustering
08 clustering08 clustering
08 clustering
Β 
Lec13 Clustering.pptx
Lec13 Clustering.pptxLec13 Clustering.pptx
Lec13 Clustering.pptx
Β 
kmean clustering
kmean clusteringkmean clustering
kmean clustering
Β 
CLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptxCLUSTER ANALYSIS ALGORITHMS.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
Β 
Unsupervised Learning in Machine Learning
Unsupervised Learning in Machine LearningUnsupervised Learning in Machine Learning
Unsupervised Learning in Machine Learning
Β 
Clustering
ClusteringClustering
Clustering
Β 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
Β 
Hierachical clustering
Hierachical clusteringHierachical clustering
Hierachical clustering
Β 
Advanced database and data mining & clustering concepts
Advanced database and data mining & clustering conceptsAdvanced database and data mining & clustering concepts
Advanced database and data mining & clustering concepts
Β 
K-means Clustering || Data Mining
K-means Clustering || Data MiningK-means Clustering || Data Mining
K-means Clustering || Data Mining
Β 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
Β 
K mean-clustering
K mean-clusteringK mean-clustering
K mean-clustering
Β 
Pattern recognition binoy k means clustering
Pattern recognition binoy  k means clusteringPattern recognition binoy  k means clustering
Pattern recognition binoy k means clustering
Β 
k-mean-clustering.ppt
k-mean-clustering.pptk-mean-clustering.ppt
k-mean-clustering.ppt
Β 
k-mean-Clustering impact on AI using DSS
k-mean-Clustering impact on AI using DSSk-mean-Clustering impact on AI using DSS
k-mean-Clustering impact on AI using DSS
Β 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clustering
Β 
26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt26-Clustering MTech-2017.ppt
26-Clustering MTech-2017.ppt
Β 
Clustering on DSS
Clustering on DSSClustering on DSS
Clustering on DSS
Β 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
Β 
clustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdfclustering_hierarchical ckustering notes.pdf
clustering_hierarchical ckustering notes.pdf
Β 

More from Rashmi Bhat

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
Β 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
Β 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
Β 
Process Scheduling in OS
Process Scheduling in OSProcess Scheduling in OS
Process Scheduling in OSRashmi Bhat
Β 
Introduction to Operating System
Introduction to Operating SystemIntroduction to Operating System
Introduction to Operating SystemRashmi Bhat
Β 
The Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdfThe Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdfRashmi Bhat
Β 
Module 1 VR.pdf
Module 1 VR.pdfModule 1 VR.pdf
Module 1 VR.pdfRashmi Bhat
Β 
Spatial Data Mining
Spatial Data MiningSpatial Data Mining
Spatial Data MiningRashmi Bhat
Β 
Web mining
Web miningWeb mining
Web miningRashmi Bhat
Β 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesRashmi Bhat
Β 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data MiningRashmi Bhat
Β 
ETL Process
ETL ProcessETL Process
ETL ProcessRashmi Bhat
Β 
Data Warehouse Fundamentals
Data Warehouse FundamentalsData Warehouse Fundamentals
Data Warehouse FundamentalsRashmi Bhat
Β 
Virtual Reality
Virtual Reality Virtual Reality
Virtual Reality Rashmi Bhat
Β 
Introduction To Virtual Reality
Introduction To Virtual RealityIntroduction To Virtual Reality
Introduction To Virtual RealityRashmi Bhat
Β 
Graph Theory
Graph TheoryGraph Theory
Graph TheoryRashmi Bhat
Β 

More from Rashmi Bhat (17)

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
Β 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
Β 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
Β 
Process Scheduling in OS
Process Scheduling in OSProcess Scheduling in OS
Process Scheduling in OS
Β 
Introduction to Operating System
Introduction to Operating SystemIntroduction to Operating System
Introduction to Operating System
Β 
The Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdfThe Geometry of Virtual Worlds.pdf
The Geometry of Virtual Worlds.pdf
Β 
Module 1 VR.pdf
Module 1 VR.pdfModule 1 VR.pdf
Module 1 VR.pdf
Β 
OLAP
OLAPOLAP
OLAP
Β 
Spatial Data Mining
Spatial Data MiningSpatial Data Mining
Spatial Data Mining
Β 
Web mining
Web miningWeb mining
Web mining
Β 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
Β 
Classification in Data Mining
Classification in Data MiningClassification in Data Mining
Classification in Data Mining
Β 
ETL Process
ETL ProcessETL Process
ETL Process
Β 
Data Warehouse Fundamentals
Data Warehouse FundamentalsData Warehouse Fundamentals
Data Warehouse Fundamentals
Β 
Virtual Reality
Virtual Reality Virtual Reality
Virtual Reality
Β 
Introduction To Virtual Reality
Introduction To Virtual RealityIntroduction To Virtual Reality
Introduction To Virtual Reality
Β 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
Β 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
Β 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
Β 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Β 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
Β 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
Β 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
Β 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
Β 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
Β 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
Β 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Β 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Β 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
Β 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
Β 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
Β 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
Β 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Β 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Β 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
Β 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Β 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
Β 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
Β 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Β 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Β 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
Β 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Β 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Β 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Β 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
Β 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Β 

Clustering

  • 2. What is Clustering?? β–ͺ Grouping of objects How will you group these together??
  • 3. What is Clustering?? Option 1: By Type Option 2: By Color
  • 5. What is Cluster Analysis?? β–ͺ A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. β–ͺ Cluster analysis has been extensively focused mainly on distance-based cluster analysis. The process of grouping a set of physical or abstract objects into classes of similar objects is called as Clustering.
  • 6. What is Cluster Analysis?? β–ͺ How clustering differs from classification???
  • 7. What is Cluster Analysis?? β–ͺ Clustering is also called data segmentation β–ͺ Clustering is finding borders between groups, β–ͺ Segmenting is using borders to form groups β–ͺ Clustering is the method of creating segments. β–ͺ Clustering can also be used for outlier detection
  • 8. What is Cluster Analysis?? β–ͺ Classification: Supervised Learning β–ͺ Classes are predetermined β–ͺ Based on training data set β–ͺ Used to classify future observations β–ͺ Clustering : Unsupervised Learning β–ͺ Classes are not known in advance β–ͺ No prior knowledge β–ͺ Used to explore (understand) the data β–ͺ Clustering is a form of learning by observation, rather than learning by examples.
  • 9. Applications of Clustering β–ͺ Marketing: β–ͺ Segmentation of the customer based on behavior β–ͺ Banking: β–ͺ ATM Fraud detection (outlier detection) β–ͺ Gene analysis: β–ͺ Identifying gene responsible for a disease β–ͺ Image processing: β–ͺ Identifying objects on an image (face detection) β–ͺ Houses: β–ͺIdentifying groups of houses according to their house type, value, and geographical location
  • 10. Requirements of Clustering Analysis β–ͺ The following are typical requirements of clustering in data mining: β–ͺ Scalability β–ͺ Dealing with different types of attributes β–ͺ Discovering clusters with arbitrary shapes β–ͺ Ability to deal with noisy data β–ͺ Minimal requirements for domain knowledge to determine input parameters β–ͺ Incremental clustering β–ͺ High dimensionality β–ͺ Constraint-based clustering β–ͺ Interpretability and usability
  • 11. What is Cluster Analysis?? β–ͺ A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. β–ͺ Cluster analysis has been extensively focused mainly on distance-based cluster analysis. The process of grouping a set of physical or abstract objects into classes of similar objects is called as Clustering.
  • 12. Distance Measures β–ͺ Cluster analysis has been extensively focused mainly on distance-based cluster analysis β–ͺ Distance is defined as the quantitative measure of how far apart two objects are. β–ͺ The similarity measure is the measure of how much alike two data objects are. β–ͺ If the distance is small, the features are having a high degree of similarity. β–ͺ Whereas a large distance will be a low degree of similarity. β–ͺ Generally, similarity are measured in the range 0 to 1 [0,1]. β–ͺ Similarity = 1 if X = Y (Where X, Y are two objects) β–ͺ Similarity = 0 if X β‰  Y
  • 13. Distance Measures Euclidean Distance Manhattan Distance Minkowski Distance Cosine Similarity Jaccard Similarity
  • 14. Distance Measures 𝑫 𝑿, 𝒀 = π’™πŸ βˆ’ π’™πŸ 𝟐 + π’šπŸ βˆ’ π’šπŸ 𝟐 β€’ The Euclidean distance between two points is the length of the path connecting them. β€’ The Pythagorean theorem gives this distance between two points.
  • 15. Distance Measures 𝑫 𝑨, 𝑩 = π’™πŸ βˆ’ π’™πŸ + π’šπŸ βˆ’ π’šπŸ β€’ Manhattan distance is a metric in which the distance between two points is calculated as the sum of the absolute differences of their Cartesian coordinates. β€’ It is the total sum of the difference between the x-coordinates and y-coordinates.
  • 16. Distance Measures 𝑫 𝑿, 𝒀 = ෍ π’Š=𝟏 𝒏 |π’™π’Š βˆ’ π’šπ’Š|𝒑 ΰ΅— 𝟏 𝒑 = 𝒑 ෍ π’Š=𝟏 𝒏 |π’™π’Š βˆ’ π’šπ’Š|𝒑 β€’ It is the generalized form of the Euclidean and Manhattan Distance Measure.
  • 17. Distance Measures β€’ The cosine similarity metric finds the normalized dot product of the two attributes. β€’ By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects. β€’ The cosine of 0Β° is 1, and it is less than 1 for any other angle.
  • 18. Distance Measures β€’ When we consider Jaccard similarity these objects will be sets. | 𝑨 βˆͺ 𝑩 | = 7 | 𝑨 ∩ 𝑩 | = 2 π½π‘Žπ‘π‘π‘Žπ‘Ÿπ‘‘ π‘†π‘–π‘šπ‘–π‘™π‘Žπ‘Ÿπ‘–π‘‘π‘¦ 𝑱 𝑨, 𝑩 = 𝐴 ∩ 𝐡 𝐴 βˆͺ 𝐡 = 2 7 = 0.286
  • 19. Clustering Techniques β–ͺ Clustering techniques are categorized in following categories Partitioning Methods Hierarchical Methods Density-based Methods Grid-based Methods Model-based Methods
  • 20. Partitioning Method β–ͺ Construct a partition of a database 𝑫 of 𝒏 objects into π’Œ clusters β–ͺ each cluster contains at least one object β–ͺ each object belongs to exactly one cluster β–ͺ Given a π’Œ, find a partition of π’Œ clusters that optimizes the chosen partitioning criterion (min distance from cluster centers) β–ͺ Global optimal: exhaustively enumerate all partitions Stirling(n,k) (S(10,3) = 9.330, S(20,3) = 580.606.446,…) β–ͺ Heuristic methods: k-means and k-medoids algorithms β–ͺ k-means: Each cluster is represented by the center of the cluster. β–ͺ k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster.
  • 21. π‘˜-means Clustering Input: π’Œ clusters, 𝒏 objects of database 𝑫. Output: Set of π’Œ clusters minimizing squared error function Algorithm: 1. Arbitrarily choose π’Œ objects from 𝑫 as the initial cluster centers; 2. Repeat 1. (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; 2. Update the cluster means, i.e., calculate the mean value of the objects for each cluster; 3. Until no change;
  • 22. π‘˜-means Clustering Example: Cluster the following data example into 3 clusters using k-means clustering and Euclidean distance Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 23. π‘˜-means Clustering 1. Choose arbitrary 3 points as cluster centers Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 C1 = (2,1) C2 = (4,4) C3 = (2,3)
  • 24. π‘˜-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 𝐷1 𝑃1, 𝐢1 = 2 βˆ’ 2 2 + 1 βˆ’ 5 2 = 16 = 4 C1 = (2,1) C2 = (4,4) C3 = (2,3) 𝑫 = π’™πŸ βˆ’ π’™πŸ 𝟐 + π’šπŸ βˆ’ π’šπŸ 𝟐 … . π‘¬π’–π’„π’π’Šπ’…π’†π’‚π’ π‘«π’Šπ’”π’•π’‚π’π’„π’† 𝐷1 𝑃1, 𝐢2 = 4 βˆ’ 2 2 + 4 βˆ’ 5 2 = 5 = 2.236 𝐷1 𝑃1, 𝐢3 = 2 βˆ’ 2 2 + 3 βˆ’ 5 2 = 4 = 2 Cluster1 = {(2,1)} Cluster2 = { } Cluster3 = {(2,5)} 𝐷2 𝑃2, 𝐢1 = 2 βˆ’ 2 2 + 1 βˆ’ 1 2 = 0 𝐷2 𝑃2, 𝐢2 = 4 βˆ’ 2 2 + 4 βˆ’ 1 2 = 13 = 3.605 𝐷2 𝑃2, 𝐢3 = 2 βˆ’ 2 2 + 3 βˆ’ 1 2 = 4 = 2 Similarly, assign other points to appropriate cluster. Cluster1 = { } Cluster2 = { } Cluster3 = {(2,5)} Cluster1 = { } Cluster2 = { } Cluster3 = { }
  • 25. π‘˜-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 C1 = (2,1) C2 = (4,4) C3 = (2,3) Cluster1 = {(2,1)} Cluster2 = { } Cluster3 = {(2,5)} Cluster1 = { } Cluster2 = { } Cluster3 = {(2,5)} Cluster1 = {(2,1)} Cluster2 = {(7,1)} Cluster3 = {(2,5)} Cluster1 = {(2,1)} Cluster2 = {(4,4),(7,1)} Cluster3 = {(2,5)} Cluster1 = {(2,1)} Cluster2 = {(4,4),(7,1), (3,5)} Cluster3 = {(2,5)} Cluster1 = {(2,1), (1,2)} Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)} Cluster3 = {(2,3),(2,5)}
  • 26. π‘˜-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 27. π‘˜-means Clustering 3. Update the cluster means Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 Old Cluster Centers: C1 = (2,1) C2 = (4,4) C3 = (2,3) Clusters: Cluster1 = {(2,1), (1,2), } Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)} Cluster3 = {(2,3),(2,5)} Calculate the mean of the points in each cluster π‘šπ‘’π‘Žπ‘›1 = 2+1 2 , 1+2 2 π‘šπ‘’π‘Žπ‘›2 = 4+7+3+6+6+3 6 , 4+1+5+2+1+4 6 π‘šπ‘’π‘Žπ‘›3 = 2+2 2 , 3+5 2 New Cluster Centers: C1 = (1.5, 1.5) C2 = (4.83, 2.83) C3 = (2, 4)
  • 28. π‘˜-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 29. π‘˜-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 30. π‘˜-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 𝐷1 𝑃1, 𝐢1 = 1.5 βˆ’ 2 2 + 1.5 βˆ’ 5 2 = 3.535 Updated Cluster Centers: C1 = (1.5, 1.5) C2 = (4.83, 2.83) C3 = (2, 4) 𝐷1 𝑃1, 𝐢2 = 4.83 βˆ’ 2 2 + 2.83 βˆ’ 5 2 = 3.566 𝐷1 𝑃1, 𝐢3 = 2 βˆ’ 2 2 + 4 βˆ’ 5 2 = 1 Cluster1 = { } Cluster2 = { } Cluster3 = {(2,5)} 𝐷2 𝑃2, 𝐢1 = 1.5 βˆ’ 2 2 + 1.5 βˆ’ 1 2 = 0.707 𝐷2 𝑃2, 𝐢2 = 4.83 βˆ’ 2 2 + 2.83 βˆ’ 1 2 = 13 = 3.3701 𝐷2 𝑃2, 𝐢3 = 2 βˆ’ 2 2 + 4 βˆ’ 1 2 = 3 Cluster1 = {(2,1)} Cluster2 = { } Cluster3 = {(2,5)} Similarly, assign other points to appropriate cluster.
  • 31. π‘˜-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 Updated Cluster Centers: C1 = (1.5, 1.5) C2 = (4.83, 2.83) C3 = (2, 4) Updated Clusters Cluster1 = {(2,1), (1,2) } Cluster2 = {(7,1), (4,4), (6,2), (6,1)} Cluster3 = {(3,5), (2,5), (3,4), (2,3)} Cluster1 = {(2,1)} Cluster2 = { } Cluster3 = {(2,5)}
  • 32. π‘˜-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 33. π‘˜-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 Old Cluster Centers: C1 = (1.5, 1.5) C2 = (4.83, 2.83) C3 = (2, 4) Updated Clusters Cluster1 = {(2,1), (1,2) } Cluster2 = {(7,1), (4,4), (6,2), (6,1)} Cluster3 = {(3,5), (2,5), (3,4), (2,3)} 3. Update Cluster centers by repeating the process until there is no change in clusters New Cluster Centers: C1 = (1.5, 1.5) C2 = (5.75, 2) C3 = (2.5, 4.25)
  • 34. π‘˜-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 35. π‘˜-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 Old Cluster Centers: C1 = (1.5, 1.5) C2 = (5.75, 2) C3 = (2.5, 4.25) Updated Clusters Cluster1 = {(2,1), (1,2) } Cluster2 = {(7,1), (6,2), (6,1)} Cluster3 = {(3,5), (2,5), (4,4), (3,4), (2,3)} 3. Update Cluster centers by repeating the process until there is no change in clusters New Cluster Centers: C1 = (1.5, 1.5) C2 = (6.33, 1.33) C3 = (2.8, 4.2)
  • 36. π‘˜-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 37. π‘˜-means Clustering Apply k-means algorithm for the following data set with two clusters. D={15, 16, 19, 20, 20, 21, 22, 28, 35, 40, 41, 42, 43, 44, 60, 61, 65}
  • 38. π‘˜-means Clustering β–ͺ Advantages: β–ͺ Relatively scalable and efficient in processing large data sets β–ͺ The computational complexity of the algorithm is 𝑂 π‘›π‘˜π‘‘ β–ͺ where 𝑛 is the total number of objects, π‘˜ is the number of clusters, and 𝑑 is the number of iterations β–ͺ This method terminates at a local optimum. β–ͺDisadvantages: β–ͺ Can be applied only when the mean of a cluster is defined β–ͺ The necessity for users to specify π‘˜, the number of clusters, in advance. β–ͺ Sensitive to noise and outlier data points
  • 39. π‘˜-means Clustering β–ͺ How to cluster categorical data? β–ͺ Variant of π‘˜-means is used for clustering categorical data: π‘˜-modes Method β–ͺ Replace mean of cluster with mode of data β–ͺ A new dissimilarity measures to deal with categorical objects β–ͺ A frequency-based method to update modes of clusters.
  • 40. π‘˜-Medoids Clustering β–ͺ Picks actual objects to represent the clusters, using one representative object per cluster β–ͺ Each remaining object is clustered with the representative object to which it is the most similar. β–ͺ Partitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point β–ͺ Absolute Error criterion is used 𝐸 = ෍ 𝑗=1 π‘˜ ෍ π‘βˆˆπ‘π‘— 𝑑𝑖𝑠𝑑(𝑝, π‘œπ‘—) Where β€’ 𝑝 is the point in space representing a given object in cluster 𝑐𝑗 β€’ 𝑂𝑗 is the representative object of cluster 𝑐𝑗 Sum of absolute error
  • 41. π‘˜-Medoids Clustering β–ͺ The iterative process of replacing representative objects by nonrepresentative objects continues as long as the quality of the resulting clustering is improved. β–ͺ Quality is measured by a cost function that measures the average dissimilarity between an object and the representative object of its cluster. β–ͺ Four cases are examined for each of the nonrepresentative objects, 𝑝. β–ͺ Suppose, object 𝒑 is currently assigned to a cluster represented by medoid 𝑢𝒋 𝒑 π‘Άπ’Š 𝑢𝒋 π‘Άπ’“π’‚π’π’…π’π’Ž 𝒑 π‘Άπ’Š 𝑢𝒋 π‘Άπ’“π’‚π’π’…π’π’Ž 𝒑 π‘Άπ’Š 𝑢𝒋 π‘Άπ’“π’‚π’π’…π’π’Ž 𝒑 π‘Άπ’Š 𝑢𝒋 π‘Άπ’“π’‚π’π’…π’π’Ž Case 1 Case 2 Case 3 Case 4 Before Swapping After Swapping
  • 42. π‘˜-Medoids Clustering β–ͺ Each time a reassignment occurs, a difference in absolute error, 𝐸, is contributed to the cost function. β–ͺ Therefore, the cost function calculates the difference in absolute-error value if a current representative object is replaced by a nonrepresentative object. β–ͺ The total cost of swapping is the sum of costs incurred by all nonrepresentative objects. β–ͺ If the total cost is negative, then 𝑂𝑗 is replaced or swapped with π‘‚π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š β–ͺ If the total cost is positive, the current representative object, 𝑂𝑗, is considered acceptable, and nothing is changed. β–ͺ PAM(Partitioning Around Medoids) was one of the first k-medoids algorithms
  • 43. π‘˜-Medoids Clustering Input: π‘˜ number of clusters, 𝑛 data objects from data set 𝐷 Output: a set of π‘˜ clusters Algorithm: 1. Arbitrarily select π‘˜ objects as the representative objects or seeds 2. Repeat 1. Assign each remaining objects to the cluster with the nearest representative object 2. Randomly select the non- representative object π‘‚π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š 3. Compute the total cost 𝑆 of swapping 𝑂𝑗 with π‘‚π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š 4. If 𝑆 < 0, then swap 𝑂𝑗 with π‘‚π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š to form the new set of π‘˜ representative objects 3. Until no change
  • 44. π‘˜-Medoids Clustering X Y O1 2 6 O2 3 4 O3 3 8 O4 4 7 O5 6 2 O6 6 4 O7 7 3 O8 7 4 O9 8 5 O10 7 6 Data Objects Aim: Create two Clusters Step 1: Choose randomly two medoids (representative objects) 𝑂3 = 3,8 𝑂8 = (7,4)
  • 45. π‘˜-Medoids Clustering X Y Cluster O1 2 6 O2 3 4 O3 3 8 O4 4 7 O5 6 2 O6 6 4 O7 7 3 O8 7 4 O9 8 5 O10 7 6 Data Objects Aim: Create two Clusters Step 2: Assign each object to the closest representative object Using Euclidean distance, we form following clusters
  • 46. π‘˜-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 Step 2: Assign each object to the closest representative object Using Euclidean distance, we form following clusters C1={O1, O2, O3, O4} C2={O5, O6, O7, O8, O9, O10}
  • 47. π‘˜-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 Step 3: Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂8) 𝐸 = ෍ 𝑗=1 π‘˜ ෍ π‘βˆˆπ‘π‘— 𝑝 βˆ’ 𝑂𝑗 𝑬 = π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ‘ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ’ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ“ βˆ’ π‘ΆπŸ– + π‘ΆπŸ” βˆ’ π‘ΆπŸ– + π‘ΆπŸ• βˆ’ π‘ΆπŸ– + π‘ΆπŸ– βˆ’ π‘ΆπŸ– + π‘ΆπŸ— βˆ’ π‘ΆπŸ– + π‘ΆπŸπŸŽ βˆ’ π‘ΆπŸ– π‘ΆπŸ βˆ’ π‘ΆπŸ‘ = π’™πŸ βˆ’ π’™πŸ‘ + π’šπŸ βˆ’ π’šπŸ‘ . . . . Manhattan Distance
  • 48. π‘˜-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 Step 3: Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂8) 𝐸 = ෍ 𝑗=1 π‘˜ ෍ π‘βˆˆπ‘π‘— 𝑝 βˆ’ 𝑂𝑗 𝑬 = π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ‘ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ’ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ“ βˆ’ π‘ΆπŸ– + π‘ΆπŸ” βˆ’ π‘ΆπŸ– + π‘ΆπŸ• βˆ’ π‘ΆπŸ– + π‘ΆπŸ– βˆ’ π‘ΆπŸ– + π‘ΆπŸ— βˆ’ π‘ΆπŸ– + π‘ΆπŸπŸŽ βˆ’ π‘ΆπŸ– 𝑬 = πŸ‘ + πŸ’ + 𝟎 + 𝟐 + πŸ‘ + 𝟏 + 𝟏 + 𝟎 + 𝟐 + 𝟐 𝑬 = πŸπŸ–
  • 49. π‘˜-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 Step 4: Choose a random object 𝑂9 Swap 𝑂8 and 𝑂9 Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂9)
  • 50. π‘˜-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 𝑬 = π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ‘ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ’ βˆ’ π‘ΆπŸ‘ + π‘ΆπŸ“ βˆ’ π‘ΆπŸ— + π‘ΆπŸ” βˆ’ π‘ΆπŸ— + π‘ΆπŸ• βˆ’ π‘ΆπŸ— + π‘ΆπŸ– βˆ’ π‘ΆπŸ— + π‘ΆπŸ— βˆ’ π‘ΆπŸ— + π‘ΆπŸπŸŽ βˆ’ π‘ΆπŸ— 𝑬 = πŸ‘ + πŸ’ + 𝟎 + 𝟐 + (πŸ“ + πŸ‘ + πŸ‘ + 𝟐 + 𝟎 + 𝟐) 𝑬 = πŸπŸ’ Step 5: Compute the cost function 𝑆 = π΄π‘π‘ π‘œπ‘™π‘’π‘‘π‘’ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ π‘“π‘œπ‘Ÿπ‘ΆπŸ‘, π‘ΆπŸ– βˆ’ π΄π‘π‘ π‘œπ‘™π‘’π‘‘π‘’ πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ π‘“π‘œπ‘Ÿπ‘ΆπŸ‘, π‘ΆπŸ— 𝑆 = 18 βˆ’ 24 = βˆ’6 As 𝑆 < 0, we swap π‘ΆπŸ– with π‘ΆπŸ—
  • 51. π‘˜-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 O2 3 4 O3 3 8 O4 4 7 O5 6 2 O6 6 4 O7 7 3 O8 7 4 O9 8 5 O10 7 6 Step 6: New medoids are π‘ΆπŸ‘ with π‘ΆπŸ— Repeat Step 2 Assign each object to the closest representative object. X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2
  • 52. π‘˜-Medoids Clustering β–ͺ Which method is more robust π‘˜-Means or π‘˜-Medoids? β–ͺ The k-medoids method is more robust than k-means in the presence of noise and outliers, because a medoid is less influenced by outliers or other extreme values than a mean. β–ͺ The processing of π‘˜-Medoids is more costly than the k-means method.
  • 53. Hierarchical Clustering β–ͺ Groups data objects into a tree of clusters. Hierarchical Clustering Methods Agglomerative Divisive
  • 54. Hierarchical Clustering β–ͺ Agglomerative Hierarchical Clustering β–ͺ Starts by placing each object in its own cluster β–ͺ Merges these atomic clusters into larger and larger clusters β–ͺ It will halt when all of the objects are in a single cluster or until certain termination conditions are satisfied. β–ͺ Bottom-Up Strategy. β–ͺ The user can specify the desired number of clusters as a termination condition.
  • 55. Hierarchical Clustering A B F C D E G AB CD ABF CDE ABFCDEG CDEG Step 0 Step 1 Step 2 Step 3 Step 4 Application of Agglomerative NESting (AGNES) Hierarchical Clustering
  • 56. Hierarchical Clustering β–ͺ Divisive Hierarchical Clustering Method β–ͺ Starting with all objects in one cluster. β–ͺ Subdivides the cluster into smaller and smaller pieces. β–ͺ It will halt when each object forms a cluster on its own or until it satisfies certain termination conditions β–ͺ Top-Down Strategy β–ͺ The user can specify the desired number of clusters as a termination condition.
  • 57. Hierarchical Clustering A B F C D E G AB CD ABF CDE ABFCDEG CDEG Step 4 Step 3 Step 2 Step 1 Step 0 Application of DIvisive ANAlysis (DIANA) Hierarchical Clustering
  • 58. Hierarchical Clustering β–ͺ A tree structure called a dendrogram is used to represent the process of hierarchical clustering. Fig. Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}
  • 59. Hierarchical Clustering β–ͺ Four widely used measures for distance between clusters β–ͺ 𝒑 βˆ’ 𝒑′ is distance between two objects 𝑝 and 𝑝′. β–ͺ π’Žπ’Š is mean for cluster π‘ͺπ’Š β–ͺ π’π’Š is number of objects in cluster π‘ͺπ’Š. π‘€π‘–π‘›π‘–π‘šπ‘’π‘š π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’: π‘‘π‘šπ‘–π‘› 𝐢𝑖, 𝐢𝑗 = π‘šπ‘–π‘›π‘βˆˆπΆπ‘–,π‘β€²βˆˆπΆπ‘— 𝑝 βˆ’ 𝑝′ π‘€π‘Žπ‘₯π‘–π‘šπ‘’π‘š π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’: π‘‘π‘šπ‘Žπ‘₯ 𝐢𝑖, 𝐢𝑗 = π‘šπ‘Žπ‘₯π‘βˆˆπΆπ‘–,π‘β€²βˆˆπΆπ‘— 𝑝 βˆ’ 𝑝′ π‘€π‘’π‘Žπ‘› π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’: π‘‘π‘šπ‘’π‘Žπ‘› 𝐢𝑖, 𝐢𝑗 = π‘šπ‘– βˆ’ π‘šπ‘— π΄π‘£π‘’π‘Ÿπ‘Žπ‘”π‘’ π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’: π‘‘π‘Žπ‘£π‘” 𝐢𝑖, 𝐢𝑗 = 1 𝑛𝑖𝑛𝑗 ෍ π‘βˆˆπΆπ‘– ෍ π‘β€²βˆˆπΆπ‘— 𝑝 βˆ’ 𝑝′
  • 60. Hierarchical Clustering β–ͺ If an algorithm uses minimum distance measure, an algorithm is called a nearest-neighbor clustering algorithm. β–ͺIf the clustering process is terminated when the minimum distance between nearest clusters exceeds an arbitrary threshold, it is called a single-linkage algorithm. β–ͺ If an algorithm uses maximum distance measure, an algorithm is called a farthest-neighbor clustering algorithm. β–ͺ If the clustering process is terminated when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called a complete- linkage algorithm. β–ͺ An agglomerative hierarchical clustering algorithm that uses the minimum distance measure is also called a minimal spanning tree algorithm.