5. What is Cluster Analysis??
βͺ A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters.
βͺ Cluster analysis has been extensively focused mainly on distance-based
cluster analysis.
The process of grouping a set of physical or abstract objects into classes of
similar objects is called as Clustering.
6. What is Cluster Analysis??
βͺ How clustering differs from classification???
7. What is Cluster Analysis??
βͺ Clustering is also called data segmentation
βͺ Clustering is finding borders between groups,
βͺ Segmenting is using borders to form groups
βͺ Clustering is the method of creating segments.
βͺ Clustering can also be used for outlier detection
8. What is Cluster Analysis??
βͺ Classification: Supervised Learning
βͺ Classes are predetermined
βͺ Based on training data set
βͺ Used to classify future observations
βͺ Clustering : Unsupervised Learning
βͺ Classes are not known in advance
βͺ No prior knowledge
βͺ Used to explore (understand) the data
βͺ Clustering is a form of learning by observation, rather than learning by
examples.
9. Applications of Clustering
βͺ Marketing:
βͺ Segmentation of the customer based on behavior
βͺ Banking:
βͺ ATM Fraud detection (outlier detection)
βͺ Gene analysis:
βͺ Identifying gene responsible for a disease
βͺ Image processing:
βͺ Identifying objects on an image (face detection)
βͺ Houses:
βͺIdentifying groups of houses according to their house type, value, and geographical location
10. Requirements of Clustering Analysis
βͺ The following are typical requirements of clustering in data mining:
βͺ Scalability
βͺ Dealing with different types of attributes
βͺ Discovering clusters with arbitrary shapes
βͺ Ability to deal with noisy data
βͺ Minimal requirements for domain knowledge to determine input parameters
βͺ Incremental clustering
βͺ High dimensionality
βͺ Constraint-based clustering
βͺ Interpretability and usability
11. What is Cluster Analysis??
βͺ A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters.
βͺ Cluster analysis has been extensively focused mainly on distance-based
cluster analysis.
The process of grouping a set of physical or abstract objects into classes of
similar objects is called as Clustering.
12. Distance Measures
βͺ Cluster analysis has been extensively focused mainly on distance-based
cluster analysis
βͺ Distance is defined as the quantitative measure of how far apart two objects are.
βͺ The similarity measure is the measure of how much alike two data objects
are.
βͺ If the distance is small, the features are having a high degree of similarity.
βͺ Whereas a large distance will be a low degree of similarity.
βͺ Generally, similarity are measured in the range 0 to 1 [0,1].
βͺ Similarity = 1 if X = Y (Where X, Y are two objects)
βͺ Similarity = 0 if X β Y
14. Distance
Measures
π« πΏ, π = ππ β ππ
π + ππ β ππ
π
β’ The Euclidean distance between two points is the length of the
path connecting them.
β’ The Pythagorean theorem gives this distance between two points.
16. Distance
Measures
π« πΏ, π = ΰ·
π=π
π
|ππ β ππ|π
ΰ΅
π
π
=
π
ΰ·
π=π
π
|ππ β ππ|π
β’ It is the generalized form of the Euclidean and Manhattan Distance
Measure.
17. Distance
Measures
β’ The cosine similarity metric finds the normalized dot
product of the two attributes.
β’ By determining the cosine similarity, we would
effectively try to find the cosine of the angle between
the two objects.
β’ The cosine of 0Β° is 1, and it is less than 1 for any other
angle.
19. Clustering Techniques
βͺ Clustering techniques are categorized in following categories
Partitioning Methods
Hierarchical Methods
Density-based Methods
Grid-based Methods
Model-based Methods
20. Partitioning Method
βͺ Construct a partition of a database π« of π objects into π clusters
βͺ each cluster contains at least one object
βͺ each object belongs to exactly one cluster
βͺ Given a π, find a partition of π clusters that optimizes the chosen
partitioning criterion (min distance from cluster centers)
βͺ Global optimal: exhaustively enumerate all partitions Stirling(n,k)
(S(10,3) = 9.330, S(20,3) = 580.606.446,β¦)
βͺ Heuristic methods: k-means and k-medoids algorithms
βͺ k-means: Each cluster is represented by the center of the cluster.
βͺ k-medoids or PAM (Partition around medoids): Each cluster is represented by one of
the objects in the cluster.
21. π-means Clustering
Input:
π clusters, π objects of database π«.
Output:
Set of π clusters minimizing squared error function
Algorithm:
1. Arbitrarily choose π objects from π« as the initial cluster centers;
2. Repeat
1. (Re)assign each object to the cluster to which the object is the most similar, based on
the mean value of the objects in the cluster;
2. Update the cluster means, i.e., calculate the mean value of the objects for each cluster;
3. Until no change;
22. π-means Clustering
Example: Cluster the following data example into 3 clusters using k-means clustering and Euclidean
distance
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
25. π-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
Cluster1 = { }
Cluster2 = { }
Cluster3 = {(2,5)}
Cluster1 = {(2,1)}
Cluster2 = {(7,1)}
Cluster3 = {(2,5)}
Cluster1 = {(2,1)}
Cluster2 = {(4,4),(7,1)}
Cluster3 = {(2,5)}
Cluster1 = {(2,1)}
Cluster2 = {(4,4),(7,1), (3,5)}
Cluster3 = {(2,5)}
Cluster1 = {(2,1), (1,2)}
Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)}
Cluster3 = {(2,3),(2,5)}
26. π-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
27. π-means Clustering
3. Update the cluster means
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Old Cluster Centers:
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
Clusters:
Cluster1 = {(2,1), (1,2), }
Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)}
Cluster3 = {(2,3),(2,5)}
Calculate the mean of the points in each cluster
ππππ1 =
2+1
2
,
1+2
2
ππππ2 =
4+7+3+6+6+3
6
,
4+1+5+2+1+4
6
ππππ3 =
2+2
2
,
3+5
2
New Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
28. π-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
29. π-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
31. π-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Updated Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
Updated Clusters
Cluster1 = {(2,1), (1,2) }
Cluster2 = {(7,1), (4,4), (6,2), (6,1)}
Cluster3 = {(3,5), (2,5), (3,4), (2,3)}
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
32. π-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
33. π-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Old Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
Updated Clusters
Cluster1 = {(2,1), (1,2) }
Cluster2 = {(7,1), (4,4), (6,2), (6,1)}
Cluster3 = {(3,5), (2,5), (3,4), (2,3)}
3. Update Cluster centers by repeating the process until there is no
change in clusters
New Cluster Centers:
C1 = (1.5, 1.5)
C2 = (5.75, 2)
C3 = (2.5, 4.25)
34. π-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
35. π-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Old Cluster Centers:
C1 = (1.5, 1.5)
C2 = (5.75, 2)
C3 = (2.5, 4.25)
Updated Clusters
Cluster1 = {(2,1), (1,2) }
Cluster2 = {(7,1), (6,2), (6,1)}
Cluster3 = {(3,5), (2,5), (4,4), (3,4), (2,3)}
3. Update Cluster centers by repeating the process until there is no
change in clusters
New Cluster Centers:
C1 = (1.5, 1.5)
C2 = (6.33, 1.33)
C3 = (2.8, 4.2)
36. π-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
37. π-means Clustering
Apply k-means algorithm for the following data set with two
clusters.
D={15, 16, 19, 20, 20, 21, 22, 28, 35, 40, 41, 42, 43, 44, 60, 61, 65}
38. π-means Clustering
βͺ Advantages:
βͺ Relatively scalable and efficient in processing large data sets
βͺ The computational complexity of the algorithm is π πππ‘
βͺ where π is the total number of objects, π is the number of clusters, and π‘ is the number of iterations
βͺ This method terminates at a local optimum.
βͺDisadvantages:
βͺ Can be applied only when the mean of a cluster is defined
βͺ The necessity for users to specify π, the number of clusters, in advance.
βͺ Sensitive to noise and outlier data points
39. π-means Clustering
βͺ How to cluster categorical data?
βͺ Variant of π-means is used for clustering categorical data: π-modes Method
βͺ Replace mean of cluster with mode of data
βͺ A new dissimilarity measures to deal with categorical objects
βͺ A frequency-based method to update modes of clusters.
40. π-Medoids Clustering
βͺ Picks actual objects to represent the clusters, using one representative object per
cluster
βͺ Each remaining object is clustered with the representative object to which it is the
most similar.
βͺ Partitioning method is then performed based on the principle of minimizing the
sum of the dissimilarities between each object and its corresponding reference
point
βͺ Absolute Error criterion is used
πΈ = ΰ·
π=1
π
ΰ·
πβππ
πππ π‘(π, ππ)
Where
β’ π is the point in space representing
a given object in cluster ππ
β’ ππ is the representative object of
cluster ππ
Sum of absolute error
41. π-Medoids Clustering
βͺ The iterative process of replacing representative objects by nonrepresentative objects
continues as long as the quality of the resulting clustering is improved.
βͺ Quality is measured by a cost function that measures the average dissimilarity between an
object and the representative object of its cluster.
βͺ Four cases are examined for each of the nonrepresentative objects, π.
βͺ Suppose, object π is currently assigned to a cluster represented by medoid πΆπ
π
πΆπ
πΆπ
πΆππππ ππ
π
πΆπ
πΆπ
πΆππππ ππ
π
πΆπ
πΆπ
πΆππππ ππ
π
πΆπ
πΆπ
πΆππππ ππ
Case 1 Case 2 Case 3 Case 4
Before Swapping After Swapping
42. π-Medoids Clustering
βͺ Each time a reassignment occurs, a difference in absolute error, πΈ, is
contributed to the cost function.
βͺ Therefore, the cost function calculates the difference in absolute-error value if
a current representative object is replaced by a nonrepresentative object.
βͺ The total cost of swapping is the sum of costs incurred by all nonrepresentative
objects.
βͺ If the total cost is negative, then ππ is replaced or swapped with πππππππ
βͺ If the total cost is positive, the current representative object, ππ, is considered acceptable, and
nothing is changed.
βͺ PAM(Partitioning Around Medoids) was one of the first k-medoids algorithms
43. π-Medoids Clustering
Input: π number of clusters, π data objects from data set π·
Output: a set of π clusters
Algorithm:
1. Arbitrarily select π objects as the representative objects or seeds
2. Repeat
1. Assign each remaining objects to the cluster with the nearest representative object
2. Randomly select the non- representative object πππππππ
3. Compute the total cost π of swapping ππ with πππππππ
4. If π < 0, then swap ππ with πππππππ to form the new set of π representative objects
3. Until no change
51. π-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6
Step 6:
New medoids are πΆπ with πΆπ
Repeat Step 2
Assign each object to the
closest representative object.
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
52. π-Medoids Clustering
βͺ Which method is more robust π-Means or π-Medoids?
βͺ The k-medoids method is more robust than k-means in the presence of noise and outliers,
because a medoid is less influenced by outliers or other extreme values than a mean.
βͺ The processing of π-Medoids is more costly than the k-means method.
54. Hierarchical Clustering
βͺ Agglomerative Hierarchical Clustering
βͺ Starts by placing each object in its own cluster
βͺ Merges these atomic clusters into larger and larger clusters
βͺ It will halt when all of the objects are in a single cluster or until certain termination
conditions are satisfied.
βͺ Bottom-Up Strategy.
βͺ The user can specify the desired number of clusters as a termination condition.
55. Hierarchical Clustering
A B F C D E G
AB CD
ABF CDE
ABFCDEG
CDEG
Step 0
Step 1
Step 2
Step 3
Step 4
Application of Agglomerative NESting
(AGNES) Hierarchical Clustering
56. Hierarchical Clustering
βͺ Divisive Hierarchical Clustering Method
βͺ Starting with all objects in one cluster.
βͺ Subdivides the cluster into smaller and smaller pieces.
βͺ It will halt when each object forms a cluster on its own or until it satisfies certain termination
conditions
βͺ Top-Down Strategy
βͺ The user can specify the desired number of clusters as a termination condition.
57. Hierarchical Clustering
A B F C D E G
AB CD
ABF CDE
ABFCDEG
CDEG
Step 4
Step 3
Step 2
Step 1
Step 0
Application of DIvisive ANAlysis
(DIANA) Hierarchical Clustering
58. Hierarchical Clustering
βͺ A tree structure called a dendrogram is used to represent the process of
hierarchical clustering.
Fig. Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}
59. Hierarchical Clustering
βͺ Four widely used measures for distance between clusters
βͺ π β πβ² is distance between two objects π and πβ².
βͺ ππ is mean for cluster πͺπ
βͺ ππ is number of objects in cluster πͺπ.
ππππππ’π πππ π‘ππππ: ππππ πΆπ, πΆπ = ππππβπΆπ,πβ²βπΆπ
π β πβ²
πππ₯πππ’π πππ π‘ππππ: ππππ₯ πΆπ, πΆπ = πππ₯πβπΆπ,πβ²βπΆπ
π β πβ²
ππππ πππ π‘ππππ: πππππ πΆπ, πΆπ = ππ β ππ
π΄π£πππππ πππ π‘ππππ: πππ£π πΆπ, πΆπ =
1
ππππ
ΰ·
πβπΆπ
ΰ·
πβ²βπΆπ
π β πβ²
60. Hierarchical Clustering
βͺ If an algorithm uses minimum distance measure, an algorithm is called a
nearest-neighbor clustering algorithm.
βͺIf the clustering process is terminated when the minimum distance between
nearest clusters exceeds an arbitrary threshold, it is called a single-linkage
algorithm.
βͺ If an algorithm uses maximum distance measure, an algorithm is called a
farthest-neighbor clustering algorithm.
βͺ If the clustering process is terminated when the maximum distance between
nearest clusters exceeds an arbitrary threshold, it is called a complete-
linkage algorithm.
βͺ An agglomerative hierarchical clustering algorithm that uses the minimum
distance measure is also called a minimal spanning tree algorithm.