Data mining project
The main goal of this study is to group 101 animals into their natural family types using various features of animals and by utilizing Hierarchical clustering algorithm which is one of the unsupervised learning algorithms.
Powerful Start- the Key to Project Success, Barbara Laskowska
Hierarchical clustering of multi class data (the zoo dataset)
1. 1
732A31; Data Mining-Clustering and Association Analysis
Hierarchical Clustering of Multi-Class Data (The Zoo Dataset)
Linköping University
IDA; Division of statistics
Raid Mahbouba
raima062@student.liu.se
Introduction
In the real life one can find many animals of different family types and the non-zoologists are not
familiar with these types of animals, and this is one of the main reasons why they are kept in the zoo
for identification. The main goal of this study is to use clustering technique in order to categorize 101
animals using various features such as hair, feathers, eggs, milk, etc., in order to verify whether they
correspond to the animal’s natural family types. There are many clustering algorithms that are
appropriate to such a task, among which are: hierarchical algorithms and density based algorithms.
Furthermore, in identifying the right clusters of animals, we made some changes in the default settings
of the algorithm, such as number of iterations, number of clusters, different distance computations (the
linkage) all of which might have influence in the final results. Moreover, the major setting that we
think will have significant influence is the type of linkage that would be computed; this will
demonstrate that types of linkage is a crucial factor in finding the true existing clusters.
Though the problem is originally a classification problem, as it is described in the literature of zoo
database (Forsyth, 1990), my proposition is to us hierarchical clustering algorithm to see how it
identifies these animals’ types throughout using their features.
Previous research which aimed at analysing this dataset in clustering methodology had been done by
Neil Davey and Rod Adams and Mary J. George in which they utilized neural networks (Forsyth,
1990). In this study I consider using hierarchical clustering, since it is simple and is robust to outliers.
It was noted that “Clustering is in the eye of the beholder” (Wikipedia, 2012), which implies the fact
that clustering methods do not give ultimate accuracy. I therefore, do not expect though I hope to get
100% accuracy when grouping objects into their natural clusters using the hierarchical approach.
Background
Machine learning is a growing branch of artificial intelligence, which is about the construction and
study of systems that can learn from data. We have different types of algorithms used to analyze large
datasets in machine learning, examples of which are; classification, clustering, gamboost, random
forest, support vector machines, general linear additive models (glmnet) etc. These algorithms are
often grouped into two larger classes namely; supervised and unsupervised learning algorithms. In this
project more attention is put on a specific class of unsupervised learning algorithms, namely the
clustering algorithms. Clustering is a dynamic field of research in data mining.
Initially, clustering algorithms were considered to be ineffective and inefficient in machine learning
because handling large dataset was challenging, however, in the recent years these algorithms have
evolved remarkably. Furthermore, clustering algorithms can be categorized into partitioning methods,
2. 2
hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for
high-dimensional data (including frequent pattern–based methods), constraint based methods etc.
(Jiawei Han, 2006).
Methodology
The methodology is based on hierarchical clustering algorithm which still requires adjusting few
settings in order to achieve accurate results. We only need to know which of these settings are most
appropriate to achieve our task and why.
Dataset
Datasets in clustering analysis could have any of the following forms: numerical variables, interval-
scaled variables, binary variables, nominal, ordinal, and ratio variables, and variables of mixed types.
Zoo dataset is of mixed type that consists of 16 binary and one categorical variable.
I downloaded the dataset from machine learning repository databases (Forsyth, 1990). Since the data is
in text format, which is not recognized in Weka and R-Rattle, I did some cleaning and further
processing of the dataset and finally transformed it into CSV format. Also, I converted this data into a
data frame with 18 attributes, 16 of which are boolean i.e. binary variables. The boolean variables that
take values {0, 1} are: hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone,
breathe, venomous, fins, tail, domestic, cat size; and the categorical variable ‘Legs’ takes values
{0,2,4,5,6,8} which corresponds to a specific number of legs of each animal. In the output of Weka we
will see the type of seven groups which are listed from 0 through 6: 0- mammal, 1- fish, 2- bird, 3-
invertebrate, 4- insect, 5-amphibian and 6-reptile. Also, our dataset have the variable ‘type’ which
involves the seven groups of family types of animals, and cleaning data involves removing the label
“type” in order to have the algorithm to handle the data blindly to meet the unsupervised learning
objective. Table 1 presents the summary aggregates of each animal family type, which, later, will be
compared against the obtained clusters.
Table 1: count of the number of animals in each type
No Type Count
1 Mammal 41
2 Fish 13
3 Bird 20
4 Invertebrate 10
5 Insect 8
6 Amphibian 4
7 Reptile 5
Algorithm
The following steps show how the agglomerative hierarchical clustering algorithm works (Zhu, 2010):
1. Initially each item x1… xn is in its own cluster C1, . . . ,Cn.
2. Merge the nearest clusters, say Ci and Cj.
3. Repeat until there is only one cluster left or stop at a given threshold.
The result is a cluster tree. One can cut the tree at any level to produce different clusters. The concept
of nearing clusters or objects is measured using a distance measure, which indicates also the similarity
3. 3
xxd
cc
d
ji
xxji cc
,, min ,,
of the clusters. In our case, as we have attributes of mixed types binary and categorical, and since the
categorical variable is a generalization of the binary variable in that it can take on more than two states
(Jiawe & Micheline, 2006) the distance is computed based on the number of mismatches among the
attributes of the objects being compared (Jiawe & Micheline, 2006). For example if m is the number of
matches and p the total number of attributes, then the distance between two objects x and x´ is d(x,
x´)= (p-m). This distance can be regarded as the square of the Euclidean distance between the objects,
as for binary variables the difference between mismatching attributes yield 1, and 0 otherwise.
In the second step of the algorithm objects are merged together in order to form clusters. This merge is
based on the Euclidean linkage, and we are considering three different types of linkage:
Single linkage
In Euclidean single linkage, algorithm takes the closest x and x´ from Ci and Cj respectively to merge
clusters.
Single linkage is equivalent to the minimum spanning tree algorithm and tends to produce long and
skinny clusters.
This single linkage merge criterion is local. We pay attention solely to the area where the two clusters
come closest to each other. Other, more distant parts of the cluster and the clusters' overall structure
are not taken into account (Manning, et al., 2008).
Complete Linkage
However, Euclidean complete linkage works opposite to single linkage by considering the farthest x
and x´.
Clusters tend to be compact and roughly equal in diameter.
This complete-link merge criterion is non-local; the entire structure of the clustering can influence
merge decisions. This results in a preference for compact clusters with small diameters over long,
straggly clusters, but also causes sensitivity to outliers. A single document far from the center can
increase diameters of candidate merge clusters dramatically and completely change the final clustering
(Manning, et al., 2008).
Average Linkage
Here, average linkage is the summation of within cluster objects’ distances over the product of the
total number of objects in each cluster.
And it is a compromise between the sensitivity of complete linkage clustering to outliers and the
tendency of single linkage clustering to form long chains that do not correspond to the intuitive notion
of clusters as compact, spherical objects (Manning, et al., 2008).
xxd
cc
d
ji
xxji cc
,, max ,,
,,
,
cc
cc
cc
ji
ji
ji
xxdxx
d
4. 4
Analysis
The analysis consists of both outputs from Weka and R-Rattle and to easy the analysis we kept all the
output in the form of tables and, later, have added the type label to the tables to make the output
readable.
Results
Using the complete linkage, the misclassification rate is 15.8416%, and this rate is the highest among
the three methods of linkage that have been considered as we will see. From table 2, five animals out
of 41 that belong to mammals are misclassified as insect but fish, bird, invertebrate are almost
classified correctly, also insect, amphibian, and reptile are misclassified.
Table 2: Hierarchical Clustering/complete linkage
Mammal Fish Bird Invertebrate Insect Amphibian Reptile classified as
0 1 2 3 4 5 6
36 0 0 0 5 0 0 0
0 13 0 0 0 0 0 1
0 0 20 0 0 0 0 2
0 0 0 7 0 2 1 3
0 0 0 0 0 8 0 4
0 4 0 0 0 0 0 5
0 3 1 0 0 0 1 6
Now, moving to the single linkage, the misclassification rate is 12.8713 %. Mammal, fish, bird and
invertebrate are correctly classified, while animals belonging to family types such as insect, amphibian
and reptile are misclassified.
Table 3: Hierarchical Clustering/single linkage
Mammal Fish Bird Invertebrate Insect Amphibian Reptile classified as
0 1 2 3 4 5 6
41 0 0 0 0 0 0 0
0 13 0 0 0 0 0 1
0 0 20 0 0 0 0 2
0 0 0 9 0 1 0 3
0 0 0 8 0 0 0 4
0 0 0 0 4 0 0 5
0 0 1 0 3 0 1 6
5. 5
The last type of linkage method we have used is the average linkage method and this also shows that,
the misclassification rate is 12.8713%. Mammal, fish, bird, invertebrate, and insect are classified
correctly; notice that the class of insect has been discovered perfectly for the first time ever.
Table 4: Hierarchical Clustering/average linkage
Mammal Fish Bird invertebrate Insect Amphibian Reptile classified as
0 1 2 3 4 5 6
40 0 0 0 0 1 0 0
0 13 0 0 0 0 0 1
0 0 20 0 0 0 0 2
0 0 0 7 2 0 1 3
0 0 0 0 8 0 0 4
0 4 0 0 0 0 0 5
0 4 1 0 0 0 0 6
6. 6
Comparison of variations between complete, single and average linkages
In cluster analysis we define a diameter of a cluster to be the largest dissimilarity between any two of
its observations (Rousseeuw & Kaufman, 1990). The diameters of the clusters that are produced by
single, complete and average linkage are illustrated below:
Diameter/ Single linkage
[1] 3.000000 3.464102 2.000000 2.828427 NA NA NA
Diameter/ Complete linkage
[1] 3.000000 2.236068 2.236068 2.000000 2.449490 2.828427 2.449490
Diameter/ Average linkage
[1] 2.828427 2.236068 3.000000 1.414214 2.000000 2.828427 2.449490
The advantages of complete linkage method over single linkage method is that, it does not produce the
chaining phenomenon which is represented in the diameters of clusters 1, 2, 3 and 4 in
(Diameter/single linkage) which are larger than the equivalent diameters of the same clusters in
complete linkage method. When we talk of chaining phenomenon we mean that clusters formed
through single linkage clustering may be forced together due to single elements being close to each
other and made a large cluster even though many of the elements in each cluster may be very distant to
each other (Wikipedia, 2013). In addition the diameters of clusters 5, 6 and 7 appear as NA’s
respectively as shown in (Diameter/single linkage) is due to the fact stated earlier that the diameter is
the largest dissimilarity between any two of its observations and we have only one observation in these
clusters and thus algorithm fails to measure the diameter of clusters 6 and 7. For the diameter of cluster
5 in (Diameter/single linkage) appears as NA because all observations of insects were classified
wrongly as amphibian and reptile. However, the complete linkage tends to find compact clusters of
approximately equal diameters and as we stated earlier that complete linkage method is sensitive to
outliers, thus performs poorly too.
The results from table 2 show that complete linkage performs poorly. The algorithm has succeeded to
group purely only two clusters and failed to find the other five clusters due to the maximum distance
computed equally among all clusters. Therefore the characteristic of roughly equal diameter of each
cluster has contributed greatly to such a drawback of complete linkage. On the contrary, average
linkage method has unequal diameters and better clustering output since it strives to achieve a
compromise between the sensitivity of complete linkage clustering to outliers and the tendency of
single linkage clustering.
Analysis of the findings
Generally, our data have few observations, however, 41 observations of mammals is pretty good
number compared to the numbers of animals that belong to reptile or amphibian.
Complete linkage method has succeeded to find five clusters namely, mammals, fish, birds,
invertebrate and reptile. But then, this method does not perform well in grouping, correctly, some of
the animals. Also, complete linkage recorded the highest misclassification rate see (table 2).
Mammals are grouped correctly into one cluster according to the single linkage method see (table 3);
fish, birds and invertebrate are also correctly clustered into their natural types according to single
linkage. However, the other three clusters insect, amphibian and reptile are misclassified and their
equivalent diameters appeared as NA in the output. Instead observations belonging to insect family are
7. 7
misclassified as invertebrate and amphibian, and some observations that belong to reptile are
misclassified as insects. From these findings, we can conclude that invertebrates and insects have more
similarities than dissimilarities according to single linkage method. The misclassified observations
could also be due to the insignificant number of animals compared to the number of animals that
belong to mammals. Moreover, the single linkage discovered five clusters out of seven, which are
mammals, fish, birds, invertebrate and only one observation from reptile. In the same context, average
linkage method showed that five clusters have also been discovered and these clusters are mammals,
fish, birds, invertebrate and insects. But, what is new in the average linkage is, perfectly, discovering
all the number of animals that belong to insect’s type (table 4).
It is also important to note that amphibians and reptiles seem to be particularly homogeneous as they
are usually clustered together. In addition, as we compare the linkage based methods, animals that are
belonging to family type’s fish, birds and invertebrate are always discovered in all the three linkage
methods. Moreover, the same numbers of animals that belong to reptile have been identified in
complete and single linkages methods.
Though, the differences between single and average methods are small, it is up to the researcher to
choose one of these methods (single or average linkage) especially the misclassification rate of
12.8713% is exactly the same in both methods. However, the average linkage is doing better in the way
it has succeeded to group five clusters and has missed those clusters of corresponding types with very
few numbers of observations.
Conclusion
The aim of this study is basically to apply clustering algorithm, especially the hierarchical algorithm
on the Zoo dataset and compare the resulted clusters with the corresponding natural family types. If we
look at the tables we realize that the complete linkage method has produced the highest
misclassification rate compared to the other two methods. A substantial impact is obtained by
changing especially the linkage type that affects the accuracy gains. We have seen that changing only
the linkage type changes the accuracy and the number of iterations does not make any influence due to
the size of the dataset.
Eventually, hierarchical clustering algorithm that utilizes Euclidean distance and average linkage
method to connect observations in closest clusters have succeeded to recognize the real clusters of the
zoo dataset. The algorithm is robust and significantly recommended because of the easiness of its
settings (number of clusters, number of iterations, type of distance etc.)
Finally, further efforts could be done for working out a better clustering algorithm such as ROCK
algorithm etc. that may improve the accuracy of clustering.
8. 8
Tables
Table 1: Number of count of instances
Table 2: Hierarchical Clustering/ complete linkage
Table 3: Hierarchical Clustering/ single linkage
Table 4: Hierarchical Clustering/ single linkage
Bibliography
Forsyth, R., 1990. Machine Learning Repository. [Online]
Available at: http://archive.ics.uci.edu/ml/datasets/Zoo
[Accessed 20 04 2011].
Jiawe, H. & Micheline, K., 2006. Data Mining - Concepts and Techniques. 2nd edition ed. s.l.:Morgan-
Kaufmann.
Manning, C. D., Raghavan, P. & Schutze, H., 2008. Introduction to Information Retrieval. Cambridge: Cambridge
University Press.
Rousseeuw, P. & Kaufman, L., 1990. Finding Groups in Data. New York: Wiley.
Wikipedia, 2012. Clustering Analysis. [Online]
Available at: http://en.wikipedia.org/wiki/Cluster_analysis#Clustering_algorithms
[Accessed 20 April 2013].
Wikipedia, 2013. Complete-linkage clustering. [Online]
Available at: http://en.wikipedia.org/wiki/Complete-linkage_clustering
[Accessed 26 July 2013].
Zhu, X., 2010. Clustering. Advanced Natural Language Processing, 15 Springer, pp. 1-4.