Course - Machine Learning
Course code-IT 312
Unit-IV
Topic- Clustering
Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423603
(An Autonomous Institute Affiliated to Savitribai Phule Pune University, Pune)
NAAC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Information Technology
(NBA Accredited)
Dr.R.D.Chintamani
Asst. Prof.
1
ML- Unit-IV CLUSTERING Department of IT
Unit-IV- CLUSTERING
• Syllabus
• Distance measures-Euclidean, Manhattan, Hamming, Minkowski Distance
• Metric, Different clustering methods (Distance, Density, Hierarchical),
• K-means clustering Algorithm-with example, k-medoid algorithm-with
• example, Performance Measures- Rand Index, K-Nearest Neighbour algorithm
2
ML- Unit-IV CLUSTERING Department of IT
Unit-IV- CLUSTERING
Clustering Definition
• Attributes of a good Clustering method
• Applications
• Challenges
●Hard vs. Soft Clustering
●Different clustering paradigms o
 Partitioning clustering algorithms
 Hierarchical algorithms
 Density-based algorithms
 Model-based algorithms
●Silhouette Score: Cluster Evaluation Metric
3
ML- Unit-IV CLUSTERING Department of IT
Unit-IV- CLUSTERING
Motivation: Clustering
Grouping similar data points: Cluster analysis allows grouping similar data
samples together, which can help identify patterns and relationships in your
data. e.g., Clustering customers based on their buying behavior.
Identifying outliers: Cluster analysis can help identify outliers in the dataset.
By identifying outliers, the data distribution can be better understood, and
more accurate predictions can be made.
e.g., the height of Dalip Singh Rana (The Great Khali) among the heights of all
the WWE wrestlers in 2017 [2.16m v/s 1.8m*].
4
ML- Unit-IV CLUSTERING Department of IT
Attributes of a Good Clustering Methods
A good clustering method should
 Produce clusters with high within-class similarity &l ow between-class
similarity
 Be able to discover most of the hidden patterns of the data.
 Produce meaningful clusters with clear boundaries that are useful for the
intended application
 Be scalable & computationally efficient with the ability to handle large
datasets
 Be robust to noise and outliers without producing misleading results
 Be flexible to handle different types of data (continuous, categorical, or
mixed ) and clustering criteria, such as distance or density
 Be easy to interpret and hence, trusted by the domain experts
5
ML- Unit-IV CLUSTERING Department of IT
Application
Customer segment identification: To group customers based on similar buying
patterns, demographics, or psychographics. This helps businesses to tailor their
marketing strategies to specific customer segments.
Image segmentation :To group pixels of an image with similar characteristics, such as
color and texture, into distinct regions. This can be used for object recognition, image
compression, and other application
Recommender systems: To group users based on similar preferences or behaviour for
generating personalized recommendations for products or services.
Social network analysis: To identify communities or groups of individuals with similar
interests or behaviours. This can provide insights into social dynamics and influence
in online communities
6
ML- Unit-IV CLUSTERING Department of IT
Clustering in Machine Learning
• Clustering or cluster analysis is a machine learning technique, which groups the unlabeled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group.“
• It does it by finding some similar patterns in the unlabeled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
• After applying this clustering technique, each cluster or group is provided with a cluster-ID.
ML system can use this id to simplify the processing of large and complex datasets.
• The clustering technique is commonly used for statistical data analysis.
7
ML- Unit-IV CLUSTERING Department of IT
working of the clustering algorithm.
8
ML- Unit-IV CLUSTERING Department of IT
K-Means Clustering Algorithm
• K-Means Clustering is an Unsupervised Learning algorithm which groups the unlabeled
dataset into different clusters.
• Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters,
and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
• It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any
training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
9
ML- Unit-IV CLUSTERING Department of IT
k-means clustering
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value
of k should be predetermined in this algorithm.
• The k-means clustering
• algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.
10
ML- Unit-IV CLUSTERING Department of IT
k-means clustering
11
ML- Unit-IV CLUSTERING Department of IT
How does the K-Means Algorithm Work?
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input dataset).
• Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.
12
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• Cluster the following eight points (with (x, y) representing locations) into three
clusters:
• A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
• The distance function between two points a = (x1, y1) and b = (x2, y2) is defined
as-
• Ρ(a, b) = |x2 – x1| + |y2 – y1|
• Use K-Means Algorithm to find the three cluster centers after the second iteration.
13
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• Iteration-01:
• We calculate the distance of each point from each of the center of the three clusters.
• The distance is calculated by using the given distance function.
•
• The following illustration shows the calculation of distance between point A1(2, 10)
and each of the center of the three clusters
• Calculating Distance Between A1(2, 10) and C1(2, 10)-
• Ρ(A1, C1)
• = |x2 – x1| + |y2 – y1|
• = |2 – 2| + |10 – 10|
• = 0
14
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• Calculating Distance Between A1(2, 10) and C2(5, 8)-
•
• Ρ(A1, C2)
• = |x2 – x1| + |y2 – y1|
• = |5 – 2| + |8 – 10|
• = 3 + 2
• = 5
15
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• Calculating Distance Between A1(2, 10) and C3(1, 2)-
•
• Ρ(A1, C3)
• = |x2 – x1| + |y2 – y1|
• = |1 – 2| + |2 – 10|
• = 1 + 8
• = 9
16
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• In the similar manner, we calculate the distance of other points from each of the
center of the three clusters.
• Next,
• We draw a table showing all the results.
• Using the table, we decide which point belongs to which cluster.
• The given point belongs to that cluster whose center is nearest to it.
17
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
18
Given
Points
Distance
from
center (2,
10) of
Cluster-01
Distance
from center
(5, 8) of
Cluster-02
Distance
from center
(1, 2) of
Cluster-03
Point
belongs
to Cluster
A1(2, 10) 0 5 9 C1
A2(2, 5) 5 6 4 C3
A3(8, 4) 12 7 9 C2
A4(5, 8) 5 0 10 C2
A5(7, 5) 10 5 9 C2
A6(6, 4) 10 5 7 C2
A7(1, 2) 9 10 0 C3
A8(4, 9) 3 2 10 C2
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• From here, New clusters are-
• Cluster-01:
• First cluster contains points-
• A1(2, 10)
• Cluster-02:
• Second cluster contains points-
• A3(8, 4)
• A4(5, 8)
• A5(7, 5)
• A6(6, 4)
• A8(4, 9)
19
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• Cluster-03:
• Third cluster contains points-
• A2(2, 5)
• A7(1, 2)
• Now,
• We re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points contained in that
cluster.
20
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• For Cluster-01:
• We have only one point A1(2, 10) in Cluster-01.
• So, cluster center remains the same.
•
• For Cluster-02:
• Center of Cluster-02
• = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
• = (6, 6)
21
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• For Cluster-03:
•
• Center of Cluster-03
• = ((2 + 1)/2, (5 + 2)/2)
• = (1.5, 3.5)
• This is completion of Iteration-01.
22
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• iteration-02:
•
• We calculate the distance of each point from each of the center of the three clusters.
• The distance is calculated by using the given distance function.
•
• The following illustration shows the calculation of distance between point A1(2, 10)
and each of the center of the three clusters-
23
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• Calculating Distance Between A1(2, 10) and C1(2, 10)-
• Ρ(A1, C1)
• = |x2 – x1| + |y2 – y1|
• = |2 – 2| + |10 – 10|
• = 0
• Calculating Distance Between A1(2, 10) and C2(6, 6)-
• Ρ(A1, C2)
• = |x2 – x1| + |y2 – y1|
• = |6 – 2| + |6 – 10|
• = 4 + 4
• = 8
24
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-
• Ρ(A1, C3)
• = |x2 – x1| + |y2 – y1|
• = |1.5 – 2| + |3.5 – 10|
• = 0.5 + 6.5
• = 7
25
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• In the similar manner, we calculate the distance of other points from each of the
center of the three clusters.
•
• Next,
• We draw a table showing all the results.
• Using the table, we decide which point belongs to which cluster.
• The given point belongs to that cluster whose center is nearest to it.
26
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
27
Given
Points
Distance
from center
(2, 10) of
Cluster-01
Distance
from center
(6, 6) of
Cluster-02
Distance from
center (1.5,
3.5) of
Cluster-03
Point
belongs to
Cluster
A1(2, 10) 0 8 7 C1
A2(2, 5) 5 5 2 C3
A3(8, 4) 12 4 7 C2
A4(5, 8) 5 3 8 C2
A5(7, 5) 10 2 7 C2
A6(6, 4) 10 2 5 C2
A7(1, 2) 9 9 2 C3
A8(4, 9) 3 5 8 C1
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• From here, New clusters are-
• Cluster-01:
• First cluster contains points-
• A1(2, 10)
• A8(4, 9)
• Cluster-02:
• Second cluster contains points-
• A3(8, 4)
• A4(5, 8)
• A5(7, 5)
• A6(6, 4)
•
28
ML- Unit-V UNSUPERVISED LEARNING –K means Clustering Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• Cluster-03:
•
• Third cluster contains points-
• A2(2, 5)
• A7(1, 2)
•
• Now,
• We re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points contained in that
cluster.
29
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• For Cluster-01:
• Center of Cluster-01
• = ((2 + 4)/2, (10 + 9)/2)
• = (3, 9.5)
• For Cluster-02:
• Center of Cluster-02
• = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
• = (6.5, 5.25)
30
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
• For Cluster-03:
• Center of Cluster-03
• = ((2 + 1)/2, (5 + 2)/2)
• = (1.5, 3.5)
• This is completion of Iteration-02.
• After second iteration, the center of the three clusters are-
• C1(3, 9.5)
• C2(6.5, 5.25)
• C3(1.5, 3.5)
31
ML- Unit-IV CLUSTERING
Department of IT
• Problem-02:
Use K-Means Algorithm to create two clusters-
32
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
Solution-
We follow the above discussed K-Means Clustering Algorithm.
Assume A(2, 2) and C(1, 1) are centers of the two clusters.
Iteration-01:
• We calculate the distance of each point from each of the center of
the two clusters.
• The distance is calculated by using the euclidean distance formula.
33
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
The following illustration shows the calculation of distance between
point A(2, 2) and each of the center of the two clusters-
Calculating Distance Between A(2, 2) and C1(2, 2)-
Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
= 0
34
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
Calculating Distance Between A(2, 2) and C2(1, 1)-
Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1)2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
= 1.41
35
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
In the similar manner, we calculate the distance of other points from
each of the center of the two clusters.
Next,
• We draw a table showing all the results.
• Using the table, we decide which point belongs to which cluster.
• The given point belongs to that cluster whose center is nearest to it.
36
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
37
Given
Points
Distance
from center
(2, 2) of
Cluster-01
Distance
from center
(1, 1) of
Cluster-02
Point
belongs to
Cluster
A(2, 2) 0 1.41 C1
B(3, 2) 1 2.24 C1
C(1, 1) 1.41 0 C2
D(3, 1) 1.41 2 C1
E(1.5, 0.5) 1.58 0.71 C2
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
From here, New clusters are-
Cluster-01:
First cluster contains points-
• A(2, 2)
• B(3, 2)
• E(1.5, 0.5)
• D(3, 1)
38
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
Cluster-02:
Second cluster contains points-
• C(1, 1)
• E(1.5, 0.5)
Now,
• We re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points
contained in that cluster.
39
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
For Cluster-01:
Center of Cluster-01
= ((2 + 3 + 3)/3, (2 + 2 + 1)/3)
= (2.67, 1.67)
For Cluster-02:
Center of Cluster-02
= ((1 + 1.5)/2, (1 + 0.5)/2)
= (1.25, 0.75)
40
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
This is completion of Iteration-01.
Next, we go to iteration-02, iteration-03 and so on until the centers do
not change anymore.
41
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
[MSQ – 3 points] Suppose you have a dataset with the data samples in two-dimensional feature
space. Perform k-means clustering with k=3 and initial cluster centroids at C1 (3, 4), C2 (5,6) and C3
(5, 1). Select the data samples which will be initially assigned to center C3? (1,1)
(1,2)
(2,1)
(2,2)
(3,1)
(6, 1)
(7,2)
(6.5, 0.5)
(4,5)
(4, 6)
(4.5, 5.5)
(5,5)
Data Samples: (1,1), (1,2), (2,1), (2,2), (3,1), (6, 1), (7,2), (6.5, 0.5), (4,5), (4, 6), (4.5, 5.5), (5,5)
42
ML- Unit-IV CLUSTERING Department of IT
PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING
ALGORITHM-
43

Unit4_Clustering k means_Clustering in ML.pdf

  • 1.
    Course - MachineLearning Course code-IT 312 Unit-IV Topic- Clustering Sanjivani Rural Education Society’s Sanjivani College of Engineering, Kopargaon-423603 (An Autonomous Institute Affiliated to Savitribai Phule Pune University, Pune) NAAC ‘A’ Grade Accredited, ISO 9001:2015 Certified Department of Information Technology (NBA Accredited) Dr.R.D.Chintamani Asst. Prof. 1
  • 2.
    ML- Unit-IV CLUSTERINGDepartment of IT Unit-IV- CLUSTERING • Syllabus • Distance measures-Euclidean, Manhattan, Hamming, Minkowski Distance • Metric, Different clustering methods (Distance, Density, Hierarchical), • K-means clustering Algorithm-with example, k-medoid algorithm-with • example, Performance Measures- Rand Index, K-Nearest Neighbour algorithm 2
  • 3.
    ML- Unit-IV CLUSTERINGDepartment of IT Unit-IV- CLUSTERING Clustering Definition • Attributes of a good Clustering method • Applications • Challenges ●Hard vs. Soft Clustering ●Different clustering paradigms o  Partitioning clustering algorithms  Hierarchical algorithms  Density-based algorithms  Model-based algorithms ●Silhouette Score: Cluster Evaluation Metric 3
  • 4.
    ML- Unit-IV CLUSTERINGDepartment of IT Unit-IV- CLUSTERING Motivation: Clustering Grouping similar data points: Cluster analysis allows grouping similar data samples together, which can help identify patterns and relationships in your data. e.g., Clustering customers based on their buying behavior. Identifying outliers: Cluster analysis can help identify outliers in the dataset. By identifying outliers, the data distribution can be better understood, and more accurate predictions can be made. e.g., the height of Dalip Singh Rana (The Great Khali) among the heights of all the WWE wrestlers in 2017 [2.16m v/s 1.8m*]. 4
  • 5.
    ML- Unit-IV CLUSTERINGDepartment of IT Attributes of a Good Clustering Methods A good clustering method should  Produce clusters with high within-class similarity &l ow between-class similarity  Be able to discover most of the hidden patterns of the data.  Produce meaningful clusters with clear boundaries that are useful for the intended application  Be scalable & computationally efficient with the ability to handle large datasets  Be robust to noise and outliers without producing misleading results  Be flexible to handle different types of data (continuous, categorical, or mixed ) and clustering criteria, such as distance or density  Be easy to interpret and hence, trusted by the domain experts 5
  • 6.
    ML- Unit-IV CLUSTERINGDepartment of IT Application Customer segment identification: To group customers based on similar buying patterns, demographics, or psychographics. This helps businesses to tailor their marketing strategies to specific customer segments. Image segmentation :To group pixels of an image with similar characteristics, such as color and texture, into distinct regions. This can be used for object recognition, image compression, and other application Recommender systems: To group users based on similar preferences or behaviour for generating personalized recommendations for products or services. Social network analysis: To identify communities or groups of individuals with similar interests or behaviours. This can provide insights into social dynamics and influence in online communities 6
  • 7.
    ML- Unit-IV CLUSTERINGDepartment of IT Clustering in Machine Learning • Clustering or cluster analysis is a machine learning technique, which groups the unlabeled dataset. It can be defined as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group.“ • It does it by finding some similar patterns in the unlabeled dataset such as shape, size, color, behavior, etc., and divides them as per the presence and absence of those similar patterns. • After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this id to simplify the processing of large and complex datasets. • The clustering technique is commonly used for statistical data analysis. 7
  • 8.
    ML- Unit-IV CLUSTERINGDepartment of IT working of the clustering algorithm. 8
  • 9.
    ML- Unit-IV CLUSTERINGDepartment of IT K-Means Clustering Algorithm • K-Means Clustering is an Unsupervised Learning algorithm which groups the unlabeled dataset into different clusters. • Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on. • It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties. • It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training. • It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters. 9
  • 10.
    ML- Unit-IV CLUSTERINGDepartment of IT k-means clustering • The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm. • The k-means clustering • algorithm mainly performs two tasks: • Determines the best value for K center points or centroids by an iterative process. • Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster. 10
  • 11.
    ML- Unit-IV CLUSTERINGDepartment of IT k-means clustering 11
  • 12.
    ML- Unit-IV CLUSTERINGDepartment of IT How does the K-Means Algorithm Work? • Step-1: Select the number K to decide the number of clusters. • Step-2: Select random K points or centroids. (It can be other from the input dataset). • Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters. • Step-4: Calculate the variance and place a new centroid of each cluster. • Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster. • Step-6: If any reassignment occurs, then go to step-4 else go to FINISH. • Step-7: The model is ready. 12
  • 13.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • Cluster the following eight points (with (x, y) representing locations) into three clusters: • A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9) • Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). • The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as- • Ρ(a, b) = |x2 – x1| + |y2 – y1| • Use K-Means Algorithm to find the three cluster centers after the second iteration. 13
  • 14.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • Iteration-01: • We calculate the distance of each point from each of the center of the three clusters. • The distance is calculated by using the given distance function. • • The following illustration shows the calculation of distance between point A1(2, 10) and each of the center of the three clusters • Calculating Distance Between A1(2, 10) and C1(2, 10)- • Ρ(A1, C1) • = |x2 – x1| + |y2 – y1| • = |2 – 2| + |10 – 10| • = 0 14
  • 15.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • Calculating Distance Between A1(2, 10) and C2(5, 8)- • • Ρ(A1, C2) • = |x2 – x1| + |y2 – y1| • = |5 – 2| + |8 – 10| • = 3 + 2 • = 5 15
  • 16.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • Calculating Distance Between A1(2, 10) and C3(1, 2)- • • Ρ(A1, C3) • = |x2 – x1| + |y2 – y1| • = |1 – 2| + |2 – 10| • = 1 + 8 • = 9 16
  • 17.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • In the similar manner, we calculate the distance of other points from each of the center of the three clusters. • Next, • We draw a table showing all the results. • Using the table, we decide which point belongs to which cluster. • The given point belongs to that cluster whose center is nearest to it. 17
  • 18.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- 18 Given Points Distance from center (2, 10) of Cluster-01 Distance from center (5, 8) of Cluster-02 Distance from center (1, 2) of Cluster-03 Point belongs to Cluster A1(2, 10) 0 5 9 C1 A2(2, 5) 5 6 4 C3 A3(8, 4) 12 7 9 C2 A4(5, 8) 5 0 10 C2 A5(7, 5) 10 5 9 C2 A6(6, 4) 10 5 7 C2 A7(1, 2) 9 10 0 C3 A8(4, 9) 3 2 10 C2
  • 19.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • From here, New clusters are- • Cluster-01: • First cluster contains points- • A1(2, 10) • Cluster-02: • Second cluster contains points- • A3(8, 4) • A4(5, 8) • A5(7, 5) • A6(6, 4) • A8(4, 9) 19
  • 20.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • Cluster-03: • Third cluster contains points- • A2(2, 5) • A7(1, 2) • Now, • We re-compute the new cluster clusters. • The new cluster center is computed by taking mean of all the points contained in that cluster. 20
  • 21.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • For Cluster-01: • We have only one point A1(2, 10) in Cluster-01. • So, cluster center remains the same. • • For Cluster-02: • Center of Cluster-02 • = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) • = (6, 6) 21
  • 22.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • For Cluster-03: • • Center of Cluster-03 • = ((2 + 1)/2, (5 + 2)/2) • = (1.5, 3.5) • This is completion of Iteration-01. 22
  • 23.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • iteration-02: • • We calculate the distance of each point from each of the center of the three clusters. • The distance is calculated by using the given distance function. • • The following illustration shows the calculation of distance between point A1(2, 10) and each of the center of the three clusters- 23
  • 24.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • Calculating Distance Between A1(2, 10) and C1(2, 10)- • Ρ(A1, C1) • = |x2 – x1| + |y2 – y1| • = |2 – 2| + |10 – 10| • = 0 • Calculating Distance Between A1(2, 10) and C2(6, 6)- • Ρ(A1, C2) • = |x2 – x1| + |y2 – y1| • = |6 – 2| + |6 – 10| • = 4 + 4 • = 8 24
  • 25.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)- • Ρ(A1, C3) • = |x2 – x1| + |y2 – y1| • = |1.5 – 2| + |3.5 – 10| • = 0.5 + 6.5 • = 7 25
  • 26.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • In the similar manner, we calculate the distance of other points from each of the center of the three clusters. • • Next, • We draw a table showing all the results. • Using the table, we decide which point belongs to which cluster. • The given point belongs to that cluster whose center is nearest to it. 26
  • 27.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- 27 Given Points Distance from center (2, 10) of Cluster-01 Distance from center (6, 6) of Cluster-02 Distance from center (1.5, 3.5) of Cluster-03 Point belongs to Cluster A1(2, 10) 0 8 7 C1 A2(2, 5) 5 5 2 C3 A3(8, 4) 12 4 7 C2 A4(5, 8) 5 3 8 C2 A5(7, 5) 10 2 7 C2 A6(6, 4) 10 2 5 C2 A7(1, 2) 9 9 2 C3 A8(4, 9) 3 5 8 C1
  • 28.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • From here, New clusters are- • Cluster-01: • First cluster contains points- • A1(2, 10) • A8(4, 9) • Cluster-02: • Second cluster contains points- • A3(8, 4) • A4(5, 8) • A5(7, 5) • A6(6, 4) • 28
  • 29.
    ML- Unit-V UNSUPERVISEDLEARNING –K means Clustering Department of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • Cluster-03: • • Third cluster contains points- • A2(2, 5) • A7(1, 2) • • Now, • We re-compute the new cluster clusters. • The new cluster center is computed by taking mean of all the points contained in that cluster. 29
  • 30.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • For Cluster-01: • Center of Cluster-01 • = ((2 + 4)/2, (10 + 9)/2) • = (3, 9.5) • For Cluster-02: • Center of Cluster-02 • = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4) • = (6.5, 5.25) 30
  • 31.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- • For Cluster-03: • Center of Cluster-03 • = ((2 + 1)/2, (5 + 2)/2) • = (1.5, 3.5) • This is completion of Iteration-02. • After second iteration, the center of the three clusters are- • C1(3, 9.5) • C2(6.5, 5.25) • C3(1.5, 3.5) 31
  • 32.
    ML- Unit-IV CLUSTERING Departmentof IT • Problem-02: Use K-Means Algorithm to create two clusters- 32
  • 33.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- Solution- We follow the above discussed K-Means Clustering Algorithm. Assume A(2, 2) and C(1, 1) are centers of the two clusters. Iteration-01: • We calculate the distance of each point from each of the center of the two clusters. • The distance is calculated by using the euclidean distance formula. 33
  • 34.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- The following illustration shows the calculation of distance between point A(2, 2) and each of the center of the two clusters- Calculating Distance Between A(2, 2) and C1(2, 2)- Ρ(A, C1) = sqrt [ (x2 – x1)2 + (y2 – y1)2 ] = sqrt [ (2 – 2)2 + (2 – 2)2 ] = sqrt [ 0 + 0 ] = 0 34
  • 35.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- Calculating Distance Between A(2, 2) and C2(1, 1)- Ρ(A, C2) = sqrt [ (x2 – x1)2 + (y2 – y1)2 ] = sqrt [ (1 – 2)2 + (1 – 2)2 ] = sqrt [ 1 + 1 ] = sqrt [ 2 ] = 1.41 35
  • 36.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- In the similar manner, we calculate the distance of other points from each of the center of the two clusters. Next, • We draw a table showing all the results. • Using the table, we decide which point belongs to which cluster. • The given point belongs to that cluster whose center is nearest to it. 36
  • 37.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- 37 Given Points Distance from center (2, 2) of Cluster-01 Distance from center (1, 1) of Cluster-02 Point belongs to Cluster A(2, 2) 0 1.41 C1 B(3, 2) 1 2.24 C1 C(1, 1) 1.41 0 C2 D(3, 1) 1.41 2 C1 E(1.5, 0.5) 1.58 0.71 C2
  • 38.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- From here, New clusters are- Cluster-01: First cluster contains points- • A(2, 2) • B(3, 2) • E(1.5, 0.5) • D(3, 1) 38
  • 39.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- Cluster-02: Second cluster contains points- • C(1, 1) • E(1.5, 0.5) Now, • We re-compute the new cluster clusters. • The new cluster center is computed by taking mean of all the points contained in that cluster. 39
  • 40.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- For Cluster-01: Center of Cluster-01 = ((2 + 3 + 3)/3, (2 + 2 + 1)/3) = (2.67, 1.67) For Cluster-02: Center of Cluster-02 = ((1 + 1.5)/2, (1 + 0.5)/2) = (1.25, 0.75) 40
  • 41.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- This is completion of Iteration-01. Next, we go to iteration-02, iteration-03 and so on until the centers do not change anymore. 41
  • 42.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- [MSQ – 3 points] Suppose you have a dataset with the data samples in two-dimensional feature space. Perform k-means clustering with k=3 and initial cluster centroids at C1 (3, 4), C2 (5,6) and C3 (5, 1). Select the data samples which will be initially assigned to center C3? (1,1) (1,2) (2,1) (2,2) (3,1) (6, 1) (7,2) (6.5, 0.5) (4,5) (4, 6) (4.5, 5.5) (5,5) Data Samples: (1,1), (1,2), (2,1), (2,2), (3,1), (6, 1), (7,2), (6.5, 0.5), (4,5), (4, 6), (4.5, 5.5), (5,5) 42
  • 43.
    ML- Unit-IV CLUSTERINGDepartment of IT PRACTICE PROBLEMS BASED ON K-MEANS CLUSTERING ALGORITHM- 43