2. CLUSTERING
Process of partitioning a set of data objects into
subsets (called clusters)
Objects in a cluster are similar to one another and
dissimilar to objects in other clusters.
3. CLUSTER VALIDITY INDICES
To evaluate the “goodness” of the resulting clusters.
Different aspects of cluster validation
To compare clustering algorithms
To compare two different cluster set
Comparing the results of a cluster analysis to externally
known results
Determining the ‘correct’ number of clusters
Scikit-learn(sklearn) – a library for machine learning
in python
from sklearn.metrics import ..
4. Types of Validity Indices
Internal Quality Indices
Use to measure the goodness of a clustering structure
without respect to external information.
How well the clusters are separated and how compact the
clusters are.
External Quality Indices
Measure the extent to which cluster labels match the
externally supplied class labels.
5. Internal Quality Indices
Based on the following two criteria:
Compactness/Cohesion: how closely related the objects
in a cluster are
Separation: how distinct or well-separated a cluster is
from other clusters
6. Application
To compare clustering algorithms
Determining the ‘correct’ number of clusters
7. Disadvantages of k-mean
Choosing the number of clusters k
In most exploratory applications, the number of clusters K
is unknown
Correct choice of k is often ambiguous
9. >> from sklearn.metrics import davies_bouldin_score
………....
>> davies_bouldin_score(X, labels)
Lower the DB index value, better is the clustering
10. Dunn Index
It is defined as Minimum separation by
maximum diameter
11. Higher the Dunn index value, better is the clustering.
12. Silhouette Index
The Silhouette Coefficient combine ideas cohesion
and separation, but for individual points
S(i) = ( b(i) – a(i) ) / ( max { ( a(i), b(i) ) }
Where,
a(i) is the average dissimilarity of ith object to all other
objects in the same cluster
b(i) is the average dissimilarity of ith object with all objects
in the closest cluster.
14. Other Internal Cluster Validity Indices
Root-mean-square std dev
R-squared
Modified Hubert statistics
Calinski-Harabasz index
I index
SD validity index
S_Dbw validity index and so on….
15. External Quality Indices
Comparing the results of a cluster analysis to an
externally known result, such as externally
provided class labels
Validate against ground truth
Compare two clusters
17. Rand Index
Measure the number of pairs that are in:
A = Same class both in P and G
B = Same class in P but different in G
C = Different class in P but
same in G
D = Different class both in
P and G
18. Agreement: a, d
Disagreement: b, c
Rand Index:
>> from sklearn.metrics import adjusted_rand_score
………....
>> adjusted_rand_score(labels_true, labels_pred)
19. F-measure
Precision: What % of tuples that the classifier labeled
positive are actually positive
Recall: What % of positive tuples did
the classifier label as positive
F-Measure : The harmonic mean of precision
and recall
20. Others External Cluster Validity Indices
Normalized Mutual Information(NMI)
Purity
Sorensen-Dice
Braun-Banquet
Normalized Van Dongen
Pair-Set Index
Centroid Index and many more….