clustering

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
Hierarchical clustering
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
Nonhierarchical clustering
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
Machine Learning: unsupervised classiﬁers
for a divorce dataset
Paula Robles López
Universidad Politécnica de Madrid (UPM)
7-12-2022
Paula Robles López Universidad Politécnica de Madrid (UPM)
Machine Learning: unsupervised classiﬁers 1 / 36

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
1 Introduction
2 Hierarchical clustering
3 Nonhierarchical clustering
4 Feature importances
5 External validation
6 Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
1 Introduction
6 Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
Problem overview:
1 Goal: to group the data observations into k clusters
according to their similarities
2 Data for 84 divorced and 86 married people from Turkey.
Balanced classes. "UNKNOWN
3 54 ordinal features and 170 total records.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
PCA
We will be performing a PCA for a better cluster visualization:
→ 170 objects in a 54-dimensional space
→ Dimensionality reduction to a 2-dimensional space
→ The PCs explain +80% of the initial data variance

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
PCA
Our ﬁrst step, 170 data points projected onto the ﬁrst two PC:
" We do not know the real labels, we have no prior knowledge
of the groupings.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
1 Introduction
6 Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
We will be performing agglomerative clustering, but ﬁrst we need
to ﬁnd the k value.
→ We need internal validation measures like a dendrogram of the
tree-like groupings according to the clustering.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
→ We also have the silhoutte scores.
k = 2, silhouette score = 0.809

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
k = 2 is the optimal number of cluster according to the silhouette
scores!
→ The Calinski index also says so.
Higher values → the clusters are dense and well separated

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
Clustering results!

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
1 Introduction
Partitional clustering
Probabilistic clustering
6 Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
1 Introduction
6 Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
We will be performing K-Means clustering, but ﬁrst we need to
ﬁnd the k value.
→ We need internal validation measures like the elbow method of
the SSE against the number of clusters.
k = 2 is the elbow point.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
scores!

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
Clustering results!

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
1 Introduction
6 Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
We will be performing Gaussian Mixture clustering, but first we
need to find the k value.
→ We need internal validation measures like the BIC score to
choose the best fitting model among the candidates.
k = 3 is the lowest point.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
scores!

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
Clustering results!

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
1 Introduction
6 Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
We can use the Random Forest classiﬁer to know more about
the clustering.
→ New categorical variable: the K-Means cluster assignation.
→ We compute this variable as the class label.
→ We train a Random Forest classiﬁer and extract the most
important features.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
→ Shared values, overall happiness and knowledge about the
partner’s inner and outer world.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
1 Introduction
6 Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
External information can be used to validate the clusterings. Here,
we use the class label to calculate how similar our cluster results
are to reality.
→ For this: Adjusted Rand Index (ARI).
Close to one → almost perfect match!

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
1 Introduction
6 Conclusions

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
Conclusions:
1 The silhouette plots and the Calinski index suggest k = 2 for
all the clusterings.
2 All the clustering methods give the same output → our results
are reasonably reliable.
3 We can use PCA to improve eﬃciency and get equally good
clustering results!
4 The feature importances extraction proves that there are more
core-like issues and fundamental incompantibilities than
expected, which could be related to the data source (Turkey).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. .. .. .
Introduction
. .. .. .. .. .. .. .
. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .
Feature importances
. .. .
External validation
. .. .. .
Conclusions
Thank you.

clustering

Recommended

Recommended

More Related Content

Similar to clustering

Similar to clustering (20)

Recently uploaded

Recently uploaded (20)

clustering