Data Modelling

1
COSC 2670 — Practical Data Science
Assignment 2: Data Modelling and Presentation
Nicholas Davis(s3712731), Luke Daws (s3322003)
RMIT
nicholas.davis@student.rmit.edu.au, luke.daws@student.rmit.edu.au
Saturday 26th of May
Table of Contents
Abstract 2
Introduction 2
Methodology 2
Results 4
Discussion 13
Conclusion 15
References 16

2
Abstract
Our aim for this assignment was to use clustering methods to group together wheat
kernels by looking for similar characteristics in their physical properties. We expect
these groupings to reflect the different varieties of the wheat kernels in the dataset.
The dataset was taken from the UCI repository, and we applied 3 clustering
techniques to it: K-means, DBSCAN and Agglomerative Clustering. Along with a
confusion matrix, each model was evaluated with an Adjusted Rand Score, Adjusted
Mutual Information Score and a Silhouette Coefficient. We found that K-means and
Agglomerative Clustering performed quite well overall, with each slightly
outperforming the other under different metrics. DBSCAN didn’t perform as well. It
would be recommended when analysing this dataset or similar to use a clustering
method like K-means or agglomerative clustering over a density based clustering
algorithm like DBSCAN.
Introduction
This assignment is focused on data modelling, and specifically clustering. We will be
using three different clustering models with the goal of grouping the ‘seeds’ dataset
from the UCI Machine Learning Repository into meaningful clusters. The dataset
consists of the values of 7 attributes (area A, perimeter P, compactness C =
4*pi*A/P^2, length of kernel, width of kernel, asymmetry coefficient length of kernel
groove) for 70 kernels of wheat from each of 3 different varieties. (Kama, Rosa and
Canadian). This dataset was used in (Charytanowicz et al. 2010) to compare the
performance of a proposed new clustering model they name Complete Gradient
Clustering Algorithm to that of the K-Means algorithm.
We will repeat the analysis for K-Means and compare the results to those of the
DBSCAN and Agglomerative Clustering algorithms, as implemented by the Python
machine learning library scikit-learn. The variety of each wheat kernel is given in the
dataset, so we are able to compare each clustering result against these labels and
evaluate each model under the assumption that useful clusters will correspond to the
wheat varieties.
Methodology
Given that we retrieved our data from the UCI repository, we did not anticipate
needing to spend too much time with data-cleaning. There were some extra tab
characters in the supplied text file, so we did check to make sure that they did not
adversely affect data-loading. We also checked to make sure each attribute had
been assigned the correct data type, which was float64 in each instance.
Before beginning the data modelling phase, we carried out an exploration of the
data. The first step was to separate the data itself from the target values. Target
values were stored in a separate DataFrame and given more meaningful names. We
made histograms of all the attributes (Fig.1 - 7) to provide some visualisation, then
all the attributes were compared against each other using a scatter matrix (fig.8).
Since there were only 7 attributes, with no missing or obviously incorrect data, we
decided to train our models on all the attributes.

3
K-means
The K-means model requires as input the value of K, which corresponds to the
desired number of clusters. Clearly, we expect that K=3 will provide the best results,
but its choice can also be justified without reference to the target values. When we
plotted the value of the inertia for each set of results against K, we saw a clear elbow
in the graph at K=3. (The inertia is an inbuilt attribute of the K-Means model on sci-kit
learn that records a sum of the squared distances of each sample to its closest
cluster center.)
Agglomerative Clustering
The Agglomerative Clustering model also requires as input the desired number of
clusters, however it works in a very different way. Each observation begins in its own
cluster, and these clusters are recursively merged based on distance. The
specification of the desired number of clusters is therefore necessary to halt the
algorithm before all clusters are merged. We selected the value 3 based on its
validity for the K-means model.
DBSCAN
DBSCAN is an entirely different type of model, being density based. Two parameters
are required, MinPts and eps, but the number of clusters to be found is not
predetermined. Our eventual choice of parameters was MinPts = 11 and Eps = 0.9.

4
Results
Fig.1 Histogram showing the Area of wheat seeds and their frequency
Fig.2 Histogram showing the perimeter of wheat seeds and their frequency

5
Fig.3 Histogram showing the compactness of wheat seeds and their frequency
Fig.4 Histogram showing the length of wheat seeds and their frequency

6
Fig.5 Histogram showing the width of wheat seeds and their frequency
Fig.6 Histogram showing the Asymmetry coefficient of wheat seeds and their frequency

7
Fig.7 Histogram showing the Length of kernel groove of wheat seeds and their frequency

8
Fig.8 Scatter matrix comparing all attributes against each other in a scatter plot also displaying the
above histograms (Fig.1-7).

9
KMeans
Fig.9 Elbow Graph of K vs inertia indicating the curve is slowing around K = 3
Table.1 Clustering results of the K-means algorithm
Clusters for K-Means
0 1 2
target Canadian 0 68 2
Kama 1 9 60
Rosa 60 0 10
Adjusted Rand Score: 0.7166

10
Fig.10 Scatterplot of the K-means model showing 3 clusters
For the K-Means model cluster 0 is clearly associated with the Rosa variety, and
contains only 1 grain from a different variety. However, 10 grains of the Rosa variety
ended in cluster 2, which is otherwise strongly associated to the Kama variety. 9
grains of the Kama variety ended in cluster 1, which otherwise contained nearly all
grains of the Canadian variety.
DBSCAN
Fig.11 14-distance graph

11
Table.2 Clustering results of the DBSCAN algorithm
Clusters for DBSCAN
0 1 2 unclustered
target Canadian 0 58 0 12
Kama 47 5 0 18
Rosa 2 0 37 31
Fig.12 Scatterplot of the DBSCAN model and the clusters that it had predicted.
In the DBSCAN model, Cluster 0 could be considered the Kama variety cluster, but it
contains only 47 grains of that type. Cluster 1 captured 58 grains of the Canadian
variety. Cluster 2 contained exclusively grains of the Rosa variety, so precision was
perfect. However, only just over half the Rosa grains were placed in that cluster.
Overall, 61 grains from the total of 210 were not placed in any cluster.

12
Agglomerative Clustering
Table.3 Clustering results of the Agglomerative clustering algorithm
Clusters for Agglomerative Clustering
0 1 2
target Canadian 70 0 0
Kama 16 0 54
Rosa 0 63 7
Fig.13 Scatterplot of the Agglomerative Clustering model showing it’s associated clustering.
The agglomerative clustering model created a cluster containing exclusively grains of
the Rosa variety, but also misplaced a significant number in the cluster associated
with the Kama variety. All Canadian variety grains were place in cluster 0, however
this model placed 16 Kama grains in that cluster.

13
Fig.14 Scatterplot showing the true wheat varieties.
Yellow = Canadian Purple = Kama Green = Rosa
Discussion
For each model we obtained a ‘confusion matrix’ (Table 1-3) with rows
corresponding to the actual wheat varieties and columns recording which cluster the
observations were assigned to by the model. We also present a scatter plot (Fig.10,
12, 13) to help with visualisation of the result for each model. Following
(Charytanowicz et al. 2010), we projected the data onto the two greatest principal
components to generate these plots, rather than choosing any two of the attributes
for display. Interestingly, our data plots differently to that in Fig. 3 of Charytanowicz
et al. This suggests that the implementation of Principal Component Analysis in
scikit-learn differs from their implementation in some way, but they did not provide
details to investigate.
The K-means model was able to perform fairly well the precision for cluster 0 was
very good exclusively clustering the Rosa variety but the recall placed 10 in cluster 2.
The precision for cluster 2 wasn’t so great, although it can be considered the Kama
cluster it managed to assign 12 kernels of the other varieties to that cluster.
The DBSCAN model had trouble forming clusters on this data set, regardless of the
parameters chosen. For MinPts, we initially followed the suggestion in (Sander et al.
1998) of 2*(number of attributes) = 14. For a given value of MinPts, a good value of
Eps can theoretically be obtained by looking for an elbow in the corresponding k-
distance graph (Fig.11) (Ester et el. 1996). Using these parameters, DBSCAN only
formed a single cluster. Varying eps still did not yield good results as measure by the
adjusted rand score.
We also tried the default value MinPts=4 suggested originally by the authors in
(Ester et al. 1996), but were still unhappy with the results. Eventually, we conducted
a grid search of all values for MinPts from 4 to 14, and all values of eps from 0.3 to

14
2.0. The values that performed best with respect to both the adjusted rand score
metric and the adjusted mutual information metric were MinPts = 11 , eps = 0.9.
Even with this choice of parameters, the DBSCAN model scored much lower than
the other two models we evaluated. The clusters formed were quite precise, but a
large number of grains remained unclustered. Increasing eps in the hope of
capturing more points led to a merge of all clusters rather than simply making the
existing ones larger.
Our third model (Agglomerative Clustering) performed on par with the K-means
model. The recall for the Canadian variety was quite good clustering all 70 of them in
cluster 0 although it’s precision was not as good as it also clustered 16 of the Kama
variety with it. The precision of cluster 1 was good, only clustering the Rosa variety.
The adjusted rand score showed that the model does perform just as well as K-
means having only a difference of .0034.
Aside from creating a confusion matrix and exploring precision and recall, there are a
number of other metrics available to evaluate the performance of clustering models.
Many suffer from the drawback that it is necessary to know the ground truth classes,
but this does not affect us, as we know the variety of each grain of wheat. This
enabled us to compare our models using the Adjusted Rand Score and the Adjusted
Mutual Information Score. We also calculated the Silhouette Coefficient, which is an
internal evaluation for clustering models (meaning that it does not require the ground
truth classes.)
The Rand score is a measure of similarity for two partitions of a set. It gives the
proportion of all pairs of elements that are either in the same subset in both
partitions, or in different subsets in both. Details of this measure appear in (Hubert &
Arabie 1985)
The Mutual Information Score is calculated using the concept of information entropy.
In both cases, the adjusted score is a correction for chance. The Adjusted Rand
Score ranges from -1 to 1, while the Adjusted Mutual Information Score ranges from
0 to 1. In both case, a value of 1 indicates perfect agreement between the model and
the ground truth classes.
Finally, we evaluated each model using the Silhouette Coefficient. The Silhouette
Coefficient is an internal evaluation which seeks to quantify how well-defined the
clusters are without reference to ground class truths (which are normally
unknown/non-existent when using clustering models.)

15
The performance of each model with respect to the three metrics is shown in the
table below:
Table.4 A comparison of the adjusted Rand score, adjusted mutual information score and silhouette
coefficient for K-means, DBSCAN and agglomerative clustering.
K-
Means
DBSCAN Agglomerative
Clustering
Adjusted Rand Score 0.7166 0.4889 0.7132
Adjusted Mutual Information
Score
0.6907 0.4912 0.7243
Silhouette Coefficient 0.4719 0.2943 0.4494
The K-Means and Agglomerative Clustering models scored similarly with respect to
each metric, and substantially bettered the results of DBSCAN. It should be noted
that the Silhouette Coefficient does favour convex clusters, and so it is not unusual
for density-based algorithms to score poorly with respect to that metric.
It seems that the region of overlap between the true Canadian and Rosa varieties
(visible in Fig. 14) was a point of difference between the K-Means model and the
Agglomerative Clustering model. The bulk of observations in this region were
allocated to the ‘Kama’ cluster by the K-Means model, and to the ‘Canadian’ cluster
by the Agglomerative Clustering model. All three models produced ‘Rosa’ clusters
with comparatively good precision when compared to the other two varieties (and
this was also reported to be the case for CGCA in (Charytanowicz et al. 2010), but
as reported earlier, the recall of the cluster formed by DBSCAN was not high.
Conclusion
Using each of three clustering models, we placed the data into three clusters.
Examination of the corresponding confusion matrices showed that these clusters
were related to the 3 varieties of wheat present in the dataset. In the case of K-
Means and Agglomerative Clustering, the relationships were fairly robust, as
reflected by the Adjusted Rand Score and Adjusted Mutual Information Score.
For a dataset like this DBSCAN is not the right clustering technique to choose. While
it managed to cluster some kernels, an overwhelming amount remained unclustered
this is seen in Table.2 and can be seen visually in Fig.12. It also received a low
score under the three different evaluation metrics used. The performance of K-
Means and agglomerative clustering was quite even. Neither could be clearly
recommended over the other with regard to this or similar datasets. A decision might
come down to ease of implementation, or run-time.

16
References
M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak (2010), “A Complete
Gradient Clustering Algorithm for Features Analysis of X-ray Images”.: Information Technologies in
Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.
Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han,
Jiawei; Fayyad, Usama M., eds. “A density-based algorithm for discovering clusters in large spatial
databases with noise”. Proceedings of the Second International Conference on Knowledge Discovery
and Data Mining (KDD-96). AAAI Press. pp. 226–231.
Hubert, Lawrence and Arabie, Phipps (1985). "Comparing partitions". Journal of Classification. 2 (1): pp.
193–218.
Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). "Density-Based Clustering in
Spatial Databases: The Algorithm GDBSCAN and Its Applications". Data Mining and Knowledge
Discovery. Berlin: Springer-Verlag. 2 (2): 169–194.
Schubert, Erich; Sander, Jörg; Ester, Martin; Kriegel, Hans Peter; Xu, Xiaowei (July 2017). "DBSCAN
Revisited, Revisited: Why and How You Should (Still) Use DBSCAN". ACM Trans. Database Syst. 42
(3): 19:1–19:21.

Data Modelling

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Data Modelling

Similar to Data Modelling (20)

Recently uploaded

Recently uploaded (20)

Data Modelling