SlideShare a Scribd company logo
1 of 14
Data Mining
Cluster Analysis
Prithwis Mukerjee, Ph.D.
Prithwis
Mukerjee 2
If we were using “Classification”
Name Eggs Pouch Flies Feathers Class
Cockatoo Yes No Yes Yes Bird
Dugong No No No No Mammal
Echidna Yes Yes No No Marsupial
Emu Yes No No Yes Bird
Kangaroo No Yes No No Marsupial
Koala No Yes No No Marsupial
Kokkabura Yes No Yes Yes Bird
Owl Yes No Yes Yes Bird
Penguin Yes No No Yes Bird
Platypus Yes No No No Mammal
Possum No Yes No No Marsupial
Wombat No Yes No No Marsupial
We would be looking at a data like this ...
Prithwis
Mukerjee 3
But in “Cluster Analysis” we do NOT have
Name Eggs Pouch Flies Feathers Class
Cockatoo Yes No Yes Yes Bird
No No No No Mammal
Yes Yes No No Marsupial
Emu Yes No No Yes Bird
Kangaroo No Yes No No Marsupial
Koala No Yes No No Marsupial
Yes No Yes Yes Bird
Owl Yes No Yes Yes Bird
Penguin Yes No No Yes Bird
Platypus Yes No No No Mammal
Possum No Yes No No Marsupial
Wombat No Yes No No Marsupial
Dugong
Echidna
Kokkabura
Previous knowledge or expertise to define these
classes !!
We have to look at the attributes alone and
somehow group the data into clusters.
Prithwis
Mukerjee 4
What is a cluster ?
A cluster contains objects that are “similar”
There is no unique definition of similarity. It
depends on the situation
 Elements of the periodic table
 Can be clustered along physical or chemical properties
 Customer can be clustered as
 High value, High “pain” or high “ maintainance”, High volume,
....
 Risky, credit worthy, suspicious ....
So similarity will depend on
 Choice of attributes of an object
 A credible definition of “similarity” of these attributes
 The “distance” between two objects based on the values of
the respective attributes
Prithwis
Mukerjee 5
What is “distance” between two objects
This depends on the nature of the attribute
 Quantitative Attributes are easiest and most common
 Height, weight, value, price, score ...
 Distance can be the difference between values
 Binary Attributes are also common, but not easy
 Gender, Marital Status, Employment status ...
 Distance can be in terms of the RATIO OF number of
attributes with same value TO the total number of similar
attributes
 Quality nominal attributes, similar to binary attributes, but
can take more than two values, that are NOT ranked
 Religion, Complexion, Colour of Hair ..
 Quality ordinal attributes that can be ranked in some order
 Size ( S, M, L, XL ), Grade (A, B, C, D)
 Can be converted to a numerical scale
Prithwis
Mukerjee 6
“Distance” between two objects
There are many ways to calculate distance
but ...
All definitions of distance must have the
following properties
 Distance is always positive
 Distance from object X ( or point X ) to itself must be zero
 Distance (X ⇒ Y) ≤ Distance (X ⇒ Z) + Distance (Z ⇒ Y)
 Distance (X ⇒ Y) = Distance (Y ⇒ X)
Care must be taken in choosing
 Attributes : use the most descriptive or discriminatory
attribute
 Scale of values : it may make sense to “normalise” all
distance metrics using the mean and standard deviation
 To guard against one attribute dominating over the others
Prithwis
Mukerjee 7
Finally : Distance
Euclidean Distance
 D(x,y) = √ ∑(xi
- yi
)2
 The L2
norm of the difference vector
Manhattan Distance
 D(x,y) = ∑ |xi
– yi
|
 The L1
norm of the difference vector yields similar results
Chebychev Distance
 D(x,y) = Max |xi
– yi
|
 Also called the L∞
norm
Categorical Data Distance
 D(x,y) = (number of times xi
= yi
) / N
 Where N is number of categorical attributes
Prithwis
Mukerjee 8
Clustering : Partitioning Method
Results in single level of partitioning
 Clusters are NOT nested inside other clusters
Given n objects define k ≤ n clusters
 Each cluster has at least one object
 Each object belongs to only one cluster
Objects assigned to clusters iteratively
 Objects may be reassigned to another cluster during the
process of clustering
The number of clusters is defined up front
Aim is to
 LOW variance WITHIN a cluster
 HIGH variance ACROSS different clusters
Prithwis
Mukerjee 9
Partitioning : K-means / K-median method
Set the number of clusters = k
Pick k seeds as 'centroids' of each cluster
 This may be done randomly OR intelligently
 Compute Distance of each object from centroid
 Euclidean : for K-means
 Manhattan : for K-median
 Allocate each object to a cluster depending on its proximity
to the centroid
Iteration
 Re-calculate centroid of each cluster, based on objects
 Re-compute distance of each object from centroid
 Re-allocate objects to clusters based on new centroid
Stop IF new clusters have same members as
old clusters, ELSE continue iteration
Prithwis
Mukerjee 10
Let us try to cluster this data ...
Our initial centroids are the first three students
 Though these could have been any other point
Student Age Marks 1 Marks 2 Marks 3
s1 18 73 75 57
s2 18 79 85 75
s3 23 70 70 52
s4 20 55 55 55
s5 22 85 86 87
s6 19 91 90 89
s7 20 70 65 60
s8 21 53 56 59
s9 19 82 82 60
s10 47 75 76 77
Centroid Age Marks 1 Marks 2 Marks 3
C1 18 73 75 57
C2 18 79 85 75
C3 23 70 70 52
Prithwis
Mukerjee 11
We assign each student to a cluster
Based on closest distance from centroid
We note that
 C1
= { s1
, s9
}
 C2
= { s2
, s5
, s6
, s10
}
 C3
= { s3
, s4
, s7
, s8
}
Centroid Age Marks 1 Marks 2 Marks 3
C1 18.00 73.00 75.00 57.00
C2 18.00 79.00 85.00 75.00
C3 23.00 70.00 70.00 52.00 C1 C2 C3
Student Age Marks 1 Marks 2 Marks 3
s1 18.00 73.00 75.00 57.00 0.00 34.00 18.00 C1
s2 18.00 79.00 85.00 75.00 34.00 0.00 52.00 C2
s3 23.00 70.00 70.00 52.00 18.00 52.00 0.00 C3
s4 20.00 55.00 55.00 55.00 42.00 76.00 36.00 C3
s5 22.00 85.00 86.00 87.00 57.00 23.00 67.00 C2
s6 19.00 91.00 90.00 89.00 66.00 32.00 82.00 C2
s7 20.00 70.00 65.00 60.00 18.00 46.00 16.00 C3
s8 21.00 53.00 56.00 59.00 44.00 74.00 40.00 C3
s9 19.00 82.00 82.00 60.00 20.00 22.00 36.00 C1
s10 47.00 75.00 76.00 77.00 52.00 44.00 60.00 C2
Distance from Centroid of
Cluster Being
assigned to
cluster
Prithwis
Mukerjee 12
Now we re-calculate the centroids
 Of each cluster based on the values of the attributes of the
members of the cluster
Centroid Age Marks 1 Marks 2 Marks 3
C1 18.00 73.00 75.00 57.00
C2 18.00 79.00 85.00 75.00
C3 23.00 70.00 70.00 52.00 C1 C2 C3
Student Age Marks 1 Marks 2 Marks 3
s1 18.00 73.00 75.00 57.00 0.00 34.00 18.00 C1
s2 18.00 79.00 85.00 75.00 34.00 0.00 52.00 C2
s3 23.00 70.00 70.00 52.00 18.00 52.00 0.00 C3
s4 20.00 55.00 55.00 55.00 42.00 76.00 36.00 C3
s5 22.00 85.00 86.00 87.00 57.00 23.00 67.00 C2
s6 19.00 91.00 90.00 89.00 66.00 32.00 82.00 C2
s7 20.00 70.00 65.00 60.00 18.00 46.00 16.00 C3
s8 21.00 53.00 56.00 59.00 44.00 74.00 40.00 C3
s9 19.00 82.00 82.00 60.00 20.00 22.00 36.00 C1
s10 47.00 75.00 76.00 77.00 52.00 44.00 60.00 C2
Centroid Age Marks 1 Marks 2 Marks 3
C1 18.00 73.00 75.00 57.00
C2 18.00 79.00 85.00 75.00
C3 23.00 70.00 70.00 52.00
New C1 18.50 77.50 78.50 58.50
New C2 26.50 82.50 84.30 82.00
New C3 21.00 61.50 61.50 56.50
Distance from Centroid of
Cluster Being
assigned to
cluster
Prithwis
Mukerjee 13
Second Iteration of Assignments
Based on closest distance from new centroids ..
Sets are ... same as the old set !!
 C1
= { s1
, s9
}
 C2
= { s2
, s5
, s6
, s10
}
 C3
= { s3
, s4
, s7
, s8
}
Centroid Age Marks 1 Marks 2 Marks 3
C1 18.50 77.50 78.50 58.50
C2 26.50 82.50 84.30 82.00
C3 21.00 61.50 61.50 56.50 C1 C2 C3
Student Age Marks 1 Marks 2 Marks 3
s1 18.00 73.00 75.00 57.00 10.00 52.30 28.00 C1
s2 18.00 79.00 85.00 75.00 25.00 19.80 62.00 C2
s3 23.00 70.00 70.00 52.00 27.00 60.30 23.00 C3
s4 20.00 55.00 55.00 55.00 51.00 90.30 16.00 C3
s5 22.00 85.00 86.00 87.00 47.00 13.80 79.00 C2
s6 19.00 91.00 90.00 89.00 56.00 28.80 92.00 C2
s7 20.00 70.00 65.00 60.00 24.00 60.30 16.00 C3
s8 21.00 53.00 56.00 59.00 50.00 86.30 17.00 C3
s9 19.00 82.00 82.00 60.00 10.00 32.30 46.00 C1
s10 47.00 75.00 76.00 77.00 52.00 41.30 74.00 C2
Distance from Centroid of
Cluster Being
assigned to
cluster
STOPSTOP
Prithwis
Mukerjee 14
Some thoughts ....
How good is the clustering ?
 Within cluster variance is low
 Across cluster variances are higher
 Hence the clustering is good.
Can it be improved ?
 Clustering was guided by the Marks, not so much by age
 We might considering scaling all the attributes
 Xi
= (xi
– μx
) / σx
Is this the only way to create clusters ? NO
 We could start with a different set of seeds and we might
end up with another set of clusters
 K-Means is a “hill climbing” algorithm that finds local
optima, NOT the global optima
C1 C2 C3
C1 5.9 26.5 23.3
C2 29.5 14.3 42.6
C3 23.9 41 10.7

More Related Content

Similar to Data mining clustering-2009-v0

Arithmetic Mean in Business Statistics
Arithmetic Mean in Business StatisticsArithmetic Mean in Business Statistics
Arithmetic Mean in Business Statisticsmuthukrishnaveni anand
 
measure of variability (windri). In research include example
measure of variability (windri). In research include examplemeasure of variability (windri). In research include example
measure of variability (windri). In research include examplewindri3
 
Pattern Recognition: Class mean classifier
Pattern Recognition: Class mean classifierPattern Recognition: Class mean classifier
Pattern Recognition: Class mean classifierMd Mamunur Rashid
 
Clasification approaches
Clasification approachesClasification approaches
Clasification approachesgscprasad1111
 
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Neeraj Bhandari
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
 
Standard deviation quartile deviation
Standard deviation  quartile deviationStandard deviation  quartile deviation
Standard deviation quartile deviationRekha Yadav
 
Measures of dispersion range qd md
Measures of dispersion range qd mdMeasures of dispersion range qd md
Measures of dispersion range qd mdRekhaChoudhary24
 
MEASURES OF DISPERSION NOTES.pdf
MEASURES OF DISPERSION NOTES.pdfMEASURES OF DISPERSION NOTES.pdf
MEASURES OF DISPERSION NOTES.pdfLSHERLEYMARY
 
Measure of dispersion statistics
Measure of dispersion statisticsMeasure of dispersion statistics
Measure of dispersion statisticsTanvirkhan164
 
SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4Rashmi Sinha
 
An overview of statistics management with excel
An overview of statistics management with excelAn overview of statistics management with excel
An overview of statistics management with excelKRISHANACHOUDHARY1
 

Similar to Data mining clustering-2009-v0 (20)

Arithmetic Mean in Business Statistics
Arithmetic Mean in Business StatisticsArithmetic Mean in Business Statistics
Arithmetic Mean in Business Statistics
 
Notes Chapter 4.pptx
Notes Chapter 4.pptxNotes Chapter 4.pptx
Notes Chapter 4.pptx
 
Measures of-variation
Measures of-variationMeasures of-variation
Measures of-variation
 
measure of variability (windri). In research include example
measure of variability (windri). In research include examplemeasure of variability (windri). In research include example
measure of variability (windri). In research include example
 
Pattern Recognition: Class mean classifier
Pattern Recognition: Class mean classifierPattern Recognition: Class mean classifier
Pattern Recognition: Class mean classifier
 
Clasification approaches
Clasification approachesClasification approaches
Clasification approaches
 
Measures of Dispersion
Measures of DispersionMeasures of Dispersion
Measures of Dispersion
 
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
Measure of dispersion by Neeraj Bhandari ( Surkhet.Nepal )
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
Standard deviation quartile deviation
Standard deviation  quartile deviationStandard deviation  quartile deviation
Standard deviation quartile deviation
 
Standard Scores
Standard ScoresStandard Scores
Standard Scores
 
Measures of dispersion range qd md
Measures of dispersion range qd mdMeasures of dispersion range qd md
Measures of dispersion range qd md
 
MEASURES OF DISPERSION NOTES.pdf
MEASURES OF DISPERSION NOTES.pdfMEASURES OF DISPERSION NOTES.pdf
MEASURES OF DISPERSION NOTES.pdf
 
Measure of dispersion statistics
Measure of dispersion statisticsMeasure of dispersion statistics
Measure of dispersion statistics
 
Variability
VariabilityVariability
Variability
 
Clustering
ClusteringClustering
Clustering
 
SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4SIMS Quant Course Lecture 4
SIMS Quant Course Lecture 4
 
Statistics
StatisticsStatistics
Statistics
 
An overview of statistics management with excel
An overview of statistics management with excelAn overview of statistics management with excel
An overview of statistics management with excel
 
Clusters (4).pptx
Clusters (4).pptxClusters (4).pptx
Clusters (4).pptx
 

More from Prithwis Mukerjee

Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Prithwis Mukerjee
 
Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Prithwis Mukerjee
 
Currency, Commodity and Bitcoins
Currency, Commodity and BitcoinsCurrency, Commodity and Bitcoins
Currency, Commodity and BitcoinsPrithwis Mukerjee
 
04 Dimensional Analysis - v6
04 Dimensional Analysis - v604 Dimensional Analysis - v6
04 Dimensional Analysis - v6Prithwis Mukerjee
 
World of data @ praxis 2013 v2
World of data   @ praxis 2013  v2World of data   @ praxis 2013  v2
World of data @ praxis 2013 v2Prithwis Mukerjee
 
BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2Prithwis Mukerjee
 
Lecture02 - Data Mining & Analytics
Lecture02 - Data Mining & AnalyticsLecture02 - Data Mining & Analytics
Lecture02 - Data Mining & AnalyticsPrithwis Mukerjee
 
ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?Prithwis Mukerjee
 
Data mining classification-2009-v0
Data mining classification-2009-v0Data mining classification-2009-v0
Data mining classification-2009-v0Prithwis Mukerjee
 
Business Intelligence Industry Perspective Session I
Business Intelligence   Industry Perspective Session IBusiness Intelligence   Industry Perspective Session I
Business Intelligence Industry Perspective Session IPrithwis Mukerjee
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingPrithwis Mukerjee
 

More from Prithwis Mukerjee (20)

Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2Bitcoin, Blockchain and the Crypto Contracts - Part 2
Bitcoin, Blockchain and the Crypto Contracts - Part 2
 
Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3Bitcoin, Blockchain and Crypto Contracts - Part 3
Bitcoin, Blockchain and Crypto Contracts - Part 3
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Thought controlled devices
Thought controlled devicesThought controlled devices
Thought controlled devices
 
Currency, Commodity and Bitcoins
Currency, Commodity and BitcoinsCurrency, Commodity and Bitcoins
Currency, Commodity and Bitcoins
 
Data Science
Data ScienceData Science
Data Science
 
05 OLAP v6 weekend
05 OLAP  v6 weekend05 OLAP  v6 weekend
05 OLAP v6 weekend
 
04 Dimensional Analysis - v6
04 Dimensional Analysis - v604 Dimensional Analysis - v6
04 Dimensional Analysis - v6
 
Thought control
Thought controlThought control
Thought control
 
World of data @ praxis 2013 v2
World of data   @ praxis 2013  v2World of data   @ praxis 2013  v2
World of data @ praxis 2013 v2
 
BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2BIS 08a - Application Development - II Version 2
BIS 08a - Application Development - II Version 2
 
Lecture02 - Data Mining & Analytics
Lecture02 - Data Mining & AnalyticsLecture02 - Data Mining & Analytics
Lecture02 - Data Mining & Analytics
 
ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?ইন্টার্নেট কি এবং কেন ?
ইন্টার্নেট কি এবং কেন ?
 
Data mining classification-2009-v0
Data mining classification-2009-v0Data mining classification-2009-v0
Data mining classification-2009-v0
 
Data mining arm-2009-v0
Data mining arm-2009-v0Data mining arm-2009-v0
Data mining arm-2009-v0
 
Data mining intro-2009-v2
Data mining intro-2009-v2Data mining intro-2009-v2
Data mining intro-2009-v2
 
PPM Lite
PPM LitePPM Lite
PPM Lite
 
Business Intelligence Industry Perspective Session I
Business Intelligence   Industry Perspective Session IBusiness Intelligence   Industry Perspective Session I
Business Intelligence Industry Perspective Session I
 
OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 

Data mining clustering-2009-v0

  • 2. Prithwis Mukerjee 2 If we were using “Classification” Name Eggs Pouch Flies Feathers Class Cockatoo Yes No Yes Yes Bird Dugong No No No No Mammal Echidna Yes Yes No No Marsupial Emu Yes No No Yes Bird Kangaroo No Yes No No Marsupial Koala No Yes No No Marsupial Kokkabura Yes No Yes Yes Bird Owl Yes No Yes Yes Bird Penguin Yes No No Yes Bird Platypus Yes No No No Mammal Possum No Yes No No Marsupial Wombat No Yes No No Marsupial We would be looking at a data like this ...
  • 3. Prithwis Mukerjee 3 But in “Cluster Analysis” we do NOT have Name Eggs Pouch Flies Feathers Class Cockatoo Yes No Yes Yes Bird No No No No Mammal Yes Yes No No Marsupial Emu Yes No No Yes Bird Kangaroo No Yes No No Marsupial Koala No Yes No No Marsupial Yes No Yes Yes Bird Owl Yes No Yes Yes Bird Penguin Yes No No Yes Bird Platypus Yes No No No Mammal Possum No Yes No No Marsupial Wombat No Yes No No Marsupial Dugong Echidna Kokkabura Previous knowledge or expertise to define these classes !! We have to look at the attributes alone and somehow group the data into clusters.
  • 4. Prithwis Mukerjee 4 What is a cluster ? A cluster contains objects that are “similar” There is no unique definition of similarity. It depends on the situation  Elements of the periodic table  Can be clustered along physical or chemical properties  Customer can be clustered as  High value, High “pain” or high “ maintainance”, High volume, ....  Risky, credit worthy, suspicious .... So similarity will depend on  Choice of attributes of an object  A credible definition of “similarity” of these attributes  The “distance” between two objects based on the values of the respective attributes
  • 5. Prithwis Mukerjee 5 What is “distance” between two objects This depends on the nature of the attribute  Quantitative Attributes are easiest and most common  Height, weight, value, price, score ...  Distance can be the difference between values  Binary Attributes are also common, but not easy  Gender, Marital Status, Employment status ...  Distance can be in terms of the RATIO OF number of attributes with same value TO the total number of similar attributes  Quality nominal attributes, similar to binary attributes, but can take more than two values, that are NOT ranked  Religion, Complexion, Colour of Hair ..  Quality ordinal attributes that can be ranked in some order  Size ( S, M, L, XL ), Grade (A, B, C, D)  Can be converted to a numerical scale
  • 6. Prithwis Mukerjee 6 “Distance” between two objects There are many ways to calculate distance but ... All definitions of distance must have the following properties  Distance is always positive  Distance from object X ( or point X ) to itself must be zero  Distance (X ⇒ Y) ≤ Distance (X ⇒ Z) + Distance (Z ⇒ Y)  Distance (X ⇒ Y) = Distance (Y ⇒ X) Care must be taken in choosing  Attributes : use the most descriptive or discriminatory attribute  Scale of values : it may make sense to “normalise” all distance metrics using the mean and standard deviation  To guard against one attribute dominating over the others
  • 7. Prithwis Mukerjee 7 Finally : Distance Euclidean Distance  D(x,y) = √ ∑(xi - yi )2  The L2 norm of the difference vector Manhattan Distance  D(x,y) = ∑ |xi – yi |  The L1 norm of the difference vector yields similar results Chebychev Distance  D(x,y) = Max |xi – yi |  Also called the L∞ norm Categorical Data Distance  D(x,y) = (number of times xi = yi ) / N  Where N is number of categorical attributes
  • 8. Prithwis Mukerjee 8 Clustering : Partitioning Method Results in single level of partitioning  Clusters are NOT nested inside other clusters Given n objects define k ≤ n clusters  Each cluster has at least one object  Each object belongs to only one cluster Objects assigned to clusters iteratively  Objects may be reassigned to another cluster during the process of clustering The number of clusters is defined up front Aim is to  LOW variance WITHIN a cluster  HIGH variance ACROSS different clusters
  • 9. Prithwis Mukerjee 9 Partitioning : K-means / K-median method Set the number of clusters = k Pick k seeds as 'centroids' of each cluster  This may be done randomly OR intelligently  Compute Distance of each object from centroid  Euclidean : for K-means  Manhattan : for K-median  Allocate each object to a cluster depending on its proximity to the centroid Iteration  Re-calculate centroid of each cluster, based on objects  Re-compute distance of each object from centroid  Re-allocate objects to clusters based on new centroid Stop IF new clusters have same members as old clusters, ELSE continue iteration
  • 10. Prithwis Mukerjee 10 Let us try to cluster this data ... Our initial centroids are the first three students  Though these could have been any other point Student Age Marks 1 Marks 2 Marks 3 s1 18 73 75 57 s2 18 79 85 75 s3 23 70 70 52 s4 20 55 55 55 s5 22 85 86 87 s6 19 91 90 89 s7 20 70 65 60 s8 21 53 56 59 s9 19 82 82 60 s10 47 75 76 77 Centroid Age Marks 1 Marks 2 Marks 3 C1 18 73 75 57 C2 18 79 85 75 C3 23 70 70 52
  • 11. Prithwis Mukerjee 11 We assign each student to a cluster Based on closest distance from centroid We note that  C1 = { s1 , s9 }  C2 = { s2 , s5 , s6 , s10 }  C3 = { s3 , s4 , s7 , s8 } Centroid Age Marks 1 Marks 2 Marks 3 C1 18.00 73.00 75.00 57.00 C2 18.00 79.00 85.00 75.00 C3 23.00 70.00 70.00 52.00 C1 C2 C3 Student Age Marks 1 Marks 2 Marks 3 s1 18.00 73.00 75.00 57.00 0.00 34.00 18.00 C1 s2 18.00 79.00 85.00 75.00 34.00 0.00 52.00 C2 s3 23.00 70.00 70.00 52.00 18.00 52.00 0.00 C3 s4 20.00 55.00 55.00 55.00 42.00 76.00 36.00 C3 s5 22.00 85.00 86.00 87.00 57.00 23.00 67.00 C2 s6 19.00 91.00 90.00 89.00 66.00 32.00 82.00 C2 s7 20.00 70.00 65.00 60.00 18.00 46.00 16.00 C3 s8 21.00 53.00 56.00 59.00 44.00 74.00 40.00 C3 s9 19.00 82.00 82.00 60.00 20.00 22.00 36.00 C1 s10 47.00 75.00 76.00 77.00 52.00 44.00 60.00 C2 Distance from Centroid of Cluster Being assigned to cluster
  • 12. Prithwis Mukerjee 12 Now we re-calculate the centroids  Of each cluster based on the values of the attributes of the members of the cluster Centroid Age Marks 1 Marks 2 Marks 3 C1 18.00 73.00 75.00 57.00 C2 18.00 79.00 85.00 75.00 C3 23.00 70.00 70.00 52.00 C1 C2 C3 Student Age Marks 1 Marks 2 Marks 3 s1 18.00 73.00 75.00 57.00 0.00 34.00 18.00 C1 s2 18.00 79.00 85.00 75.00 34.00 0.00 52.00 C2 s3 23.00 70.00 70.00 52.00 18.00 52.00 0.00 C3 s4 20.00 55.00 55.00 55.00 42.00 76.00 36.00 C3 s5 22.00 85.00 86.00 87.00 57.00 23.00 67.00 C2 s6 19.00 91.00 90.00 89.00 66.00 32.00 82.00 C2 s7 20.00 70.00 65.00 60.00 18.00 46.00 16.00 C3 s8 21.00 53.00 56.00 59.00 44.00 74.00 40.00 C3 s9 19.00 82.00 82.00 60.00 20.00 22.00 36.00 C1 s10 47.00 75.00 76.00 77.00 52.00 44.00 60.00 C2 Centroid Age Marks 1 Marks 2 Marks 3 C1 18.00 73.00 75.00 57.00 C2 18.00 79.00 85.00 75.00 C3 23.00 70.00 70.00 52.00 New C1 18.50 77.50 78.50 58.50 New C2 26.50 82.50 84.30 82.00 New C3 21.00 61.50 61.50 56.50 Distance from Centroid of Cluster Being assigned to cluster
  • 13. Prithwis Mukerjee 13 Second Iteration of Assignments Based on closest distance from new centroids .. Sets are ... same as the old set !!  C1 = { s1 , s9 }  C2 = { s2 , s5 , s6 , s10 }  C3 = { s3 , s4 , s7 , s8 } Centroid Age Marks 1 Marks 2 Marks 3 C1 18.50 77.50 78.50 58.50 C2 26.50 82.50 84.30 82.00 C3 21.00 61.50 61.50 56.50 C1 C2 C3 Student Age Marks 1 Marks 2 Marks 3 s1 18.00 73.00 75.00 57.00 10.00 52.30 28.00 C1 s2 18.00 79.00 85.00 75.00 25.00 19.80 62.00 C2 s3 23.00 70.00 70.00 52.00 27.00 60.30 23.00 C3 s4 20.00 55.00 55.00 55.00 51.00 90.30 16.00 C3 s5 22.00 85.00 86.00 87.00 47.00 13.80 79.00 C2 s6 19.00 91.00 90.00 89.00 56.00 28.80 92.00 C2 s7 20.00 70.00 65.00 60.00 24.00 60.30 16.00 C3 s8 21.00 53.00 56.00 59.00 50.00 86.30 17.00 C3 s9 19.00 82.00 82.00 60.00 10.00 32.30 46.00 C1 s10 47.00 75.00 76.00 77.00 52.00 41.30 74.00 C2 Distance from Centroid of Cluster Being assigned to cluster STOPSTOP
  • 14. Prithwis Mukerjee 14 Some thoughts .... How good is the clustering ?  Within cluster variance is low  Across cluster variances are higher  Hence the clustering is good. Can it be improved ?  Clustering was guided by the Marks, not so much by age  We might considering scaling all the attributes  Xi = (xi – μx ) / σx Is this the only way to create clusters ? NO  We could start with a different set of seeds and we might end up with another set of clusters  K-Means is a “hill climbing” algorithm that finds local optima, NOT the global optima C1 C2 C3 C1 5.9 26.5 23.3 C2 29.5 14.3 42.6 C3 23.9 41 10.7