SlideShare a Scribd company logo
1
732A31; Data Mining-Clustering and Association Analysis
Hierarchical Clustering of Multi-Class Data (The Zoo Dataset)
Linköping University
IDA; Division of statistics
Raid Mahbouba
raima062@student.liu.se
Introduction
In the real life one can find many animals of different family types and the non-zoologists are not
familiar with these types of animals, and this is one of the main reasons why they are kept in the zoo
for identification. The main goal of this study is to use clustering technique in order to categorize 101
animals using various features such as hair, feathers, eggs, milk, etc., in order to verify whether they
correspond to the animal’s natural family types. There are many clustering algorithms that are
appropriate to such a task, among which are: hierarchical algorithms and density based algorithms.
Furthermore, in identifying the right clusters of animals, we made some changes in the default settings
of the algorithm, such as number of iterations, number of clusters, different distance computations (the
linkage) all of which might have influence in the final results. Moreover, the major setting that we
think will have significant influence is the type of linkage that would be computed; this will
demonstrate that types of linkage is a crucial factor in finding the true existing clusters.
Though the problem is originally a classification problem, as it is described in the literature of zoo
database (Forsyth, 1990), my proposition is to us hierarchical clustering algorithm to see how it
identifies these animals’ types throughout using their features.
Previous research which aimed at analysing this dataset in clustering methodology had been done by
Neil Davey and Rod Adams and Mary J. George in which they utilized neural networks (Forsyth,
1990). In this study I consider using hierarchical clustering, since it is simple and is robust to outliers.
It was noted that “Clustering is in the eye of the beholder” (Wikipedia, 2012), which implies the fact
that clustering methods do not give ultimate accuracy. I therefore, do not expect though I hope to get
100% accuracy when grouping objects into their natural clusters using the hierarchical approach.
Background
Machine learning is a growing branch of artificial intelligence, which is about the construction and
study of systems that can learn from data. We have different types of algorithms used to analyze large
datasets in machine learning, examples of which are; classification, clustering, gamboost, random
forest, support vector machines, general linear additive models (glmnet) etc. These algorithms are
often grouped into two larger classes namely; supervised and unsupervised learning algorithms. In this
project more attention is put on a specific class of unsupervised learning algorithms, namely the
clustering algorithms. Clustering is a dynamic field of research in data mining.
Initially, clustering algorithms were considered to be ineffective and inefficient in machine learning
because handling large dataset was challenging, however, in the recent years these algorithms have
evolved remarkably. Furthermore, clustering algorithms can be categorized into partitioning methods,
2
hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for
high-dimensional data (including frequent pattern–based methods), constraint based methods etc.
(Jiawei Han, 2006).
Methodology
The methodology is based on hierarchical clustering algorithm which still requires adjusting few
settings in order to achieve accurate results. We only need to know which of these settings are most
appropriate to achieve our task and why.
Dataset
Datasets in clustering analysis could have any of the following forms: numerical variables, interval-
scaled variables, binary variables, nominal, ordinal, and ratio variables, and variables of mixed types.
Zoo dataset is of mixed type that consists of 16 binary and one categorical variable.
I downloaded the dataset from machine learning repository databases (Forsyth, 1990). Since the data is
in text format, which is not recognized in Weka and R-Rattle, I did some cleaning and further
processing of the dataset and finally transformed it into CSV format. Also, I converted this data into a
data frame with 18 attributes, 16 of which are boolean i.e. binary variables. The boolean variables that
take values {0, 1} are: hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone,
breathe, venomous, fins, tail, domestic, cat size; and the categorical variable ‘Legs’ takes values
{0,2,4,5,6,8} which corresponds to a specific number of legs of each animal. In the output of Weka we
will see the type of seven groups which are listed from 0 through 6: 0- mammal, 1- fish, 2- bird, 3-
invertebrate, 4- insect, 5-amphibian and 6-reptile. Also, our dataset have the variable ‘type’ which
involves the seven groups of family types of animals, and cleaning data involves removing the label
“type” in order to have the algorithm to handle the data blindly to meet the unsupervised learning
objective. Table 1 presents the summary aggregates of each animal family type, which, later, will be
compared against the obtained clusters.
Table 1: count of the number of animals in each type
No Type Count
1 Mammal 41
2 Fish 13
3 Bird 20
4 Invertebrate 10
5 Insect 8
6 Amphibian 4
7 Reptile 5
Algorithm
The following steps show how the agglomerative hierarchical clustering algorithm works (Zhu, 2010):
1. Initially each item x1… xn is in its own cluster C1, . . . ,Cn.
2. Merge the nearest clusters, say Ci and Cj.
3. Repeat until there is only one cluster left or stop at a given threshold.
The result is a cluster tree. One can cut the tree at any level to produce different clusters. The concept
of nearing clusters or objects is measured using a distance measure, which indicates also the similarity
3
   xxd
cc
d
ji
xxji cc  
,, min ,,
of the clusters. In our case, as we have attributes of mixed types binary and categorical, and since the
categorical variable is a generalization of the binary variable in that it can take on more than two states
(Jiawe & Micheline, 2006) the distance is computed based on the number of mismatches among the
attributes of the objects being compared (Jiawe & Micheline, 2006). For example if m is the number of
matches and p the total number of attributes, then the distance between two objects x and x´ is d(x,
x´)= (p-m). This distance can be regarded as the square of the Euclidean distance between the objects,
as for binary variables the difference between mismatching attributes yield 1, and 0 otherwise.
In the second step of the algorithm objects are merged together in order to form clusters. This merge is
based on the Euclidean linkage, and we are considering three different types of linkage:
Single linkage
In Euclidean single linkage, algorithm takes the closest x and x´ from Ci and Cj respectively to merge
clusters.
Single linkage is equivalent to the minimum spanning tree algorithm and tends to produce long and
skinny clusters.
This single linkage merge criterion is local. We pay attention solely to the area where the two clusters
come closest to each other. Other, more distant parts of the cluster and the clusters' overall structure
are not taken into account (Manning, et al., 2008).
Complete Linkage
However, Euclidean complete linkage works opposite to single linkage by considering the farthest x
and x´.
Clusters tend to be compact and roughly equal in diameter.
This complete-link merge criterion is non-local; the entire structure of the clustering can influence
merge decisions. This results in a preference for compact clusters with small diameters over long,
straggly clusters, but also causes sensitivity to outliers. A single document far from the center can
increase diameters of candidate merge clusters dramatically and completely change the final clustering
(Manning, et al., 2008).
Average Linkage
Here, average linkage is the summation of within cluster objects’ distances over the product of the
total number of objects in each cluster.
And it is a compromise between the sensitivity of complete linkage clustering to outliers and the
tendency of single linkage clustering to form long chains that do not correspond to the intuitive notion
of clusters as compact, spherical objects (Manning, et al., 2008).
   xxd
cc
d
ji
xxji cc  
,, max ,,
   ,,
,
cc
cc
cc
ji
ji
ji
xxdxx
d




4
Analysis
The analysis consists of both outputs from Weka and R-Rattle and to easy the analysis we kept all the
output in the form of tables and, later, have added the type label to the tables to make the output
readable.
Results
Using the complete linkage, the misclassification rate is 15.8416%, and this rate is the highest among
the three methods of linkage that have been considered as we will see. From table 2, five animals out
of 41 that belong to mammals are misclassified as insect but fish, bird, invertebrate are almost
classified correctly, also insect, amphibian, and reptile are misclassified.
Table 2: Hierarchical Clustering/complete linkage
Mammal Fish Bird Invertebrate Insect Amphibian Reptile classified as
0 1 2 3 4 5 6
36 0 0 0 5 0 0 0
0 13 0 0 0 0 0 1
0 0 20 0 0 0 0 2
0 0 0 7 0 2 1 3
0 0 0 0 0 8 0 4
0 4 0 0 0 0 0 5
0 3 1 0 0 0 1 6
Now, moving to the single linkage, the misclassification rate is 12.8713 %. Mammal, fish, bird and
invertebrate are correctly classified, while animals belonging to family types such as insect, amphibian
and reptile are misclassified.
Table 3: Hierarchical Clustering/single linkage
Mammal Fish Bird Invertebrate Insect Amphibian Reptile classified as
0 1 2 3 4 5 6
41 0 0 0 0 0 0 0
0 13 0 0 0 0 0 1
0 0 20 0 0 0 0 2
0 0 0 9 0 1 0 3
0 0 0 8 0 0 0 4
0 0 0 0 4 0 0 5
0 0 1 0 3 0 1 6
5
The last type of linkage method we have used is the average linkage method and this also shows that,
the misclassification rate is 12.8713%. Mammal, fish, bird, invertebrate, and insect are classified
correctly; notice that the class of insect has been discovered perfectly for the first time ever.
Table 4: Hierarchical Clustering/average linkage
Mammal Fish Bird invertebrate Insect Amphibian Reptile classified as
0 1 2 3 4 5 6
40 0 0 0 0 1 0 0
0 13 0 0 0 0 0 1
0 0 20 0 0 0 0 2
0 0 0 7 2 0 1 3
0 0 0 0 8 0 0 4
0 4 0 0 0 0 0 5
0 4 1 0 0 0 0 6
6
Comparison of variations between complete, single and average linkages
In cluster analysis we define a diameter of a cluster to be the largest dissimilarity between any two of
its observations (Rousseeuw & Kaufman, 1990). The diameters of the clusters that are produced by
single, complete and average linkage are illustrated below:
Diameter/ Single linkage
[1] 3.000000 3.464102 2.000000 2.828427 NA NA NA
Diameter/ Complete linkage
[1] 3.000000 2.236068 2.236068 2.000000 2.449490 2.828427 2.449490
Diameter/ Average linkage
[1] 2.828427 2.236068 3.000000 1.414214 2.000000 2.828427 2.449490
The advantages of complete linkage method over single linkage method is that, it does not produce the
chaining phenomenon which is represented in the diameters of clusters 1, 2, 3 and 4 in
(Diameter/single linkage) which are larger than the equivalent diameters of the same clusters in
complete linkage method. When we talk of chaining phenomenon we mean that clusters formed
through single linkage clustering may be forced together due to single elements being close to each
other and made a large cluster even though many of the elements in each cluster may be very distant to
each other (Wikipedia, 2013). In addition the diameters of clusters 5, 6 and 7 appear as NA’s
respectively as shown in (Diameter/single linkage) is due to the fact stated earlier that the diameter is
the largest dissimilarity between any two of its observations and we have only one observation in these
clusters and thus algorithm fails to measure the diameter of clusters 6 and 7. For the diameter of cluster
5 in (Diameter/single linkage) appears as NA because all observations of insects were classified
wrongly as amphibian and reptile. However, the complete linkage tends to find compact clusters of
approximately equal diameters and as we stated earlier that complete linkage method is sensitive to
outliers, thus performs poorly too.
The results from table 2 show that complete linkage performs poorly. The algorithm has succeeded to
group purely only two clusters and failed to find the other five clusters due to the maximum distance
computed equally among all clusters. Therefore the characteristic of roughly equal diameter of each
cluster has contributed greatly to such a drawback of complete linkage. On the contrary, average
linkage method has unequal diameters and better clustering output since it strives to achieve a
compromise between the sensitivity of complete linkage clustering to outliers and the tendency of
single linkage clustering.
Analysis of the findings
Generally, our data have few observations, however, 41 observations of mammals is pretty good
number compared to the numbers of animals that belong to reptile or amphibian.
Complete linkage method has succeeded to find five clusters namely, mammals, fish, birds,
invertebrate and reptile. But then, this method does not perform well in grouping, correctly, some of
the animals. Also, complete linkage recorded the highest misclassification rate see (table 2).
Mammals are grouped correctly into one cluster according to the single linkage method see (table 3);
fish, birds and invertebrate are also correctly clustered into their natural types according to single
linkage. However, the other three clusters insect, amphibian and reptile are misclassified and their
equivalent diameters appeared as NA in the output. Instead observations belonging to insect family are
7
misclassified as invertebrate and amphibian, and some observations that belong to reptile are
misclassified as insects. From these findings, we can conclude that invertebrates and insects have more
similarities than dissimilarities according to single linkage method. The misclassified observations
could also be due to the insignificant number of animals compared to the number of animals that
belong to mammals. Moreover, the single linkage discovered five clusters out of seven, which are
mammals, fish, birds, invertebrate and only one observation from reptile. In the same context, average
linkage method showed that five clusters have also been discovered and these clusters are mammals,
fish, birds, invertebrate and insects. But, what is new in the average linkage is, perfectly, discovering
all the number of animals that belong to insect’s type (table 4).
It is also important to note that amphibians and reptiles seem to be particularly homogeneous as they
are usually clustered together. In addition, as we compare the linkage based methods, animals that are
belonging to family type’s fish, birds and invertebrate are always discovered in all the three linkage
methods. Moreover, the same numbers of animals that belong to reptile have been identified in
complete and single linkages methods.
Though, the differences between single and average methods are small, it is up to the researcher to
choose one of these methods (single or average linkage) especially the misclassification rate of
12.8713% is exactly the same in both methods. However, the average linkage is doing better in the way
it has succeeded to group five clusters and has missed those clusters of corresponding types with very
few numbers of observations.
Conclusion
The aim of this study is basically to apply clustering algorithm, especially the hierarchical algorithm
on the Zoo dataset and compare the resulted clusters with the corresponding natural family types. If we
look at the tables we realize that the complete linkage method has produced the highest
misclassification rate compared to the other two methods. A substantial impact is obtained by
changing especially the linkage type that affects the accuracy gains. We have seen that changing only
the linkage type changes the accuracy and the number of iterations does not make any influence due to
the size of the dataset.
Eventually, hierarchical clustering algorithm that utilizes Euclidean distance and average linkage
method to connect observations in closest clusters have succeeded to recognize the real clusters of the
zoo dataset. The algorithm is robust and significantly recommended because of the easiness of its
settings (number of clusters, number of iterations, type of distance etc.)
Finally, further efforts could be done for working out a better clustering algorithm such as ROCK
algorithm etc. that may improve the accuracy of clustering.
8
Tables
Table 1: Number of count of instances
Table 2: Hierarchical Clustering/ complete linkage
Table 3: Hierarchical Clustering/ single linkage
Table 4: Hierarchical Clustering/ single linkage
Bibliography
Forsyth, R., 1990. Machine Learning Repository. [Online]
Available at: http://archive.ics.uci.edu/ml/datasets/Zoo
[Accessed 20 04 2011].
Jiawe, H. & Micheline, K., 2006. Data Mining - Concepts and Techniques. 2nd edition ed. s.l.:Morgan-
Kaufmann.
Manning, C. D., Raghavan, P. & Schutze, H., 2008. Introduction to Information Retrieval. Cambridge: Cambridge
University Press.
Rousseeuw, P. & Kaufman, L., 1990. Finding Groups in Data. New York: Wiley.
Wikipedia, 2012. Clustering Analysis. [Online]
Available at: http://en.wikipedia.org/wiki/Cluster_analysis#Clustering_algorithms
[Accessed 20 April 2013].
Wikipedia, 2013. Complete-linkage clustering. [Online]
Available at: http://en.wikipedia.org/wiki/Complete-linkage_clustering
[Accessed 26 July 2013].
Zhu, X., 2010. Clustering. Advanced Natural Language Processing, 15 Springer, pp. 1-4.

More Related Content

What's hot

Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised LearningLukas Tencer
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learningRidge-i, Inc.
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief netsbutest
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane
 
Implementation of page table
Implementation of page tableImplementation of page table
Implementation of page tableguestff64339
 
Towards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptxTowards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptxLuis775803
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Simplilearn
 
What regulation for Artificial Intelligence?
What regulation for Artificial Intelligence?What regulation for Artificial Intelligence?
What regulation for Artificial Intelligence?Nozha Boujemaa
 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep LearningMyungjin Lee
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 
Binomial Heap
Binomial HeapBinomial Heap
Binomial HeapGoa App
 
7 ineffective coding habits many F# programmers don't have
7 ineffective coding habits many F# programmers don't have7 ineffective coding habits many F# programmers don't have
7 ineffective coding habits many F# programmers don't haveYan Cui
 
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...Simplilearn
 

What's hot (20)

Semi-Supervised Learning
Semi-Supervised LearningSemi-Supervised Learning
Semi-Supervised Learning
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
 
Deep Belief nets
Deep Belief netsDeep Belief nets
Deep Belief nets
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
Data Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image ProcessingData Science - Part XVII - Deep Learning & Image Processing
Data Science - Part XVII - Deep Learning & Image Processing
 
Learning Robotic Process Automation-1-80
Learning Robotic Process Automation-1-80Learning Robotic Process Automation-1-80
Learning Robotic Process Automation-1-80
 
Implementation of page table
Implementation of page tableImplementation of page table
Implementation of page table
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
 
Towards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptxTowards Responsible AI - KC.pptx
Towards Responsible AI - KC.pptx
 
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
 
What regulation for Artificial Intelligence?
What regulation for Artificial Intelligence?What regulation for Artificial Intelligence?
What regulation for Artificial Intelligence?
 
Introduction of Deep Learning
Introduction of Deep LearningIntroduction of Deep Learning
Introduction of Deep Learning
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 
BERT
BERTBERT
BERT
 
Binomial Heap
Binomial HeapBinomial Heap
Binomial Heap
 
7 ineffective coding habits many F# programmers don't have
7 ineffective coding habits many F# programmers don't have7 ineffective coding habits many F# programmers don't have
7 ineffective coding habits many F# programmers don't have
 
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
What Is Artificial Intelligence? | Artificial Intelligence For Beginners | Wh...
 

Similar to Hierarchical clustering of multi class data (the zoo dataset)

Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesIOSR Journals
 
Classification_and_Ordination_Methods_as_a_Tool.pdf
Classification_and_Ordination_Methods_as_a_Tool.pdfClassification_and_Ordination_Methods_as_a_Tool.pdf
Classification_and_Ordination_Methods_as_a_Tool.pdfAgathaHaselvin
 
Humans, it would seem, have a great love of categorizing, organi
Humans, it would seem, have a great love of categorizing, organiHumans, it would seem, have a great love of categorizing, organi
Humans, it would seem, have a great love of categorizing, organiNarcisaBrandenburg70
 
A comprehensive review of the firefly algorithms
A comprehensive review of the firefly algorithmsA comprehensive review of the firefly algorithms
A comprehensive review of the firefly algorithmsXin-She Yang
 
Identification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning MethodIdentification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning Methodpraveena06
 
Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...
Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...
Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...sushantparte
 
Ch 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdfCh 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdfYaseenRashid4
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithmsIkutwa
 
soft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptxsoft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptxnaveen356604
 
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]Luís Rita
 
Bat Algorithm: Literature Review and Applications
Bat Algorithm: Literature Review and ApplicationsBat Algorithm: Literature Review and Applications
Bat Algorithm: Literature Review and ApplicationsXin-She Yang
 
Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisJosh Neufeld
 
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...infopapers
 
Analysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetAnalysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetIJERA Editor
 
MOCANAR: A MULTI-OBJECTIVE CUCKOO SEARCH ALGORITHM FOR NUMERIC ASSOCIATION RU...
MOCANAR: A MULTI-OBJECTIVE CUCKOO SEARCH ALGORITHM FOR NUMERIC ASSOCIATION RU...MOCANAR: A MULTI-OBJECTIVE CUCKOO SEARCH ALGORITHM FOR NUMERIC ASSOCIATION RU...
MOCANAR: A MULTI-OBJECTIVE CUCKOO SEARCH ALGORITHM FOR NUMERIC ASSOCIATION RU...cscpconf
 
MOCANAR: A Multi-Objective Cuckoo Search Algorithm for Numeric Association Ru...
MOCANAR: A Multi-Objective Cuckoo Search Algorithm for Numeric Association Ru...MOCANAR: A Multi-Objective Cuckoo Search Algorithm for Numeric Association Ru...
MOCANAR: A Multi-Objective Cuckoo Search Algorithm for Numeric Association Ru...csandit
 
Ramos, v: evolving a stigmergic self organised datamining
Ramos, v: evolving a stigmergic self organised dataminingRamos, v: evolving a stigmergic self organised datamining
Ramos, v: evolving a stigmergic self organised dataminingArchiLab 7
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
As pi re2015_abstracts
As pi re2015_abstractsAs pi re2015_abstracts
As pi re2015_abstractsJoseph Park
 

Similar to Hierarchical clustering of multi class data (the zoo dataset) (20)

Literature Survey On Clustering Techniques
Literature Survey On Clustering TechniquesLiterature Survey On Clustering Techniques
Literature Survey On Clustering Techniques
 
Classification_and_Ordination_Methods_as_a_Tool.pdf
Classification_and_Ordination_Methods_as_a_Tool.pdfClassification_and_Ordination_Methods_as_a_Tool.pdf
Classification_and_Ordination_Methods_as_a_Tool.pdf
 
Humans, it would seem, have a great love of categorizing, organi
Humans, it would seem, have a great love of categorizing, organiHumans, it would seem, have a great love of categorizing, organi
Humans, it would seem, have a great love of categorizing, organi
 
A comprehensive review of the firefly algorithms
A comprehensive review of the firefly algorithmsA comprehensive review of the firefly algorithms
A comprehensive review of the firefly algorithms
 
Identification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning MethodIdentification of Differentially Expressed Genes by unsupervised Learning Method
Identification of Differentially Expressed Genes by unsupervised Learning Method
 
Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...
Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...
Predicting of Hosting Animal Centre Outcome Based on Supervised Machine Learn...
 
Ch 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdfCh 4 Cluster Analysis.pdf
Ch 4 Cluster Analysis.pdf
 
Ontology based clustering algorithms
Ontology based clustering algorithmsOntology based clustering algorithms
Ontology based clustering algorithms
 
soft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptxsoft computing BTU MCA 3rd SEM unit 1 .pptx
soft computing BTU MCA 3rd SEM unit 1 .pptx
 
B.3.5
B.3.5B.3.5
B.3.5
 
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
Community Finding with Applications on Phylogenetic Networks [Extended Abstract]
 
Bat Algorithm: Literature Review and Applications
Bat Algorithm: Literature Review and ApplicationsBat Algorithm: Literature Review and Applications
Bat Algorithm: Literature Review and Applications
 
Introduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysisIntroduction to 16S rRNA gene multivariate analysis
Introduction to 16S rRNA gene multivariate analysis
 
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
Models for a Multi-Agent System Based on Wasp-Like Behaviour for Distributed ...
 
Analysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data SetAnalysis On Classification Techniques In Mammographic Mass Data Set
Analysis On Classification Techniques In Mammographic Mass Data Set
 
MOCANAR: A MULTI-OBJECTIVE CUCKOO SEARCH ALGORITHM FOR NUMERIC ASSOCIATION RU...
MOCANAR: A MULTI-OBJECTIVE CUCKOO SEARCH ALGORITHM FOR NUMERIC ASSOCIATION RU...MOCANAR: A MULTI-OBJECTIVE CUCKOO SEARCH ALGORITHM FOR NUMERIC ASSOCIATION RU...
MOCANAR: A MULTI-OBJECTIVE CUCKOO SEARCH ALGORITHM FOR NUMERIC ASSOCIATION RU...
 
MOCANAR: A Multi-Objective Cuckoo Search Algorithm for Numeric Association Ru...
MOCANAR: A Multi-Objective Cuckoo Search Algorithm for Numeric Association Ru...MOCANAR: A Multi-Objective Cuckoo Search Algorithm for Numeric Association Ru...
MOCANAR: A Multi-Objective Cuckoo Search Algorithm for Numeric Association Ru...
 
Ramos, v: evolving a stigmergic self organised datamining
Ramos, v: evolving a stigmergic self organised dataminingRamos, v: evolving a stigmergic self organised datamining
Ramos, v: evolving a stigmergic self organised datamining
 
Classifiers
ClassifiersClassifiers
Classifiers
 
As pi re2015_abstracts
As pi re2015_abstractsAs pi re2015_abstracts
As pi re2015_abstracts
 

Recently uploaded

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...Product School
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...CzechDreamin
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...Product School
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsExpeed Software
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIES VE
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Julian Hyde
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyJohn Staveley
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 

Recently uploaded (20)

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 

Hierarchical clustering of multi class data (the zoo dataset)

  • 1. 1 732A31; Data Mining-Clustering and Association Analysis Hierarchical Clustering of Multi-Class Data (The Zoo Dataset) Linköping University IDA; Division of statistics Raid Mahbouba raima062@student.liu.se Introduction In the real life one can find many animals of different family types and the non-zoologists are not familiar with these types of animals, and this is one of the main reasons why they are kept in the zoo for identification. The main goal of this study is to use clustering technique in order to categorize 101 animals using various features such as hair, feathers, eggs, milk, etc., in order to verify whether they correspond to the animal’s natural family types. There are many clustering algorithms that are appropriate to such a task, among which are: hierarchical algorithms and density based algorithms. Furthermore, in identifying the right clusters of animals, we made some changes in the default settings of the algorithm, such as number of iterations, number of clusters, different distance computations (the linkage) all of which might have influence in the final results. Moreover, the major setting that we think will have significant influence is the type of linkage that would be computed; this will demonstrate that types of linkage is a crucial factor in finding the true existing clusters. Though the problem is originally a classification problem, as it is described in the literature of zoo database (Forsyth, 1990), my proposition is to us hierarchical clustering algorithm to see how it identifies these animals’ types throughout using their features. Previous research which aimed at analysing this dataset in clustering methodology had been done by Neil Davey and Rod Adams and Mary J. George in which they utilized neural networks (Forsyth, 1990). In this study I consider using hierarchical clustering, since it is simple and is robust to outliers. It was noted that “Clustering is in the eye of the beholder” (Wikipedia, 2012), which implies the fact that clustering methods do not give ultimate accuracy. I therefore, do not expect though I hope to get 100% accuracy when grouping objects into their natural clusters using the hierarchical approach. Background Machine learning is a growing branch of artificial intelligence, which is about the construction and study of systems that can learn from data. We have different types of algorithms used to analyze large datasets in machine learning, examples of which are; classification, clustering, gamboost, random forest, support vector machines, general linear additive models (glmnet) etc. These algorithms are often grouped into two larger classes namely; supervised and unsupervised learning algorithms. In this project more attention is put on a specific class of unsupervised learning algorithms, namely the clustering algorithms. Clustering is a dynamic field of research in data mining. Initially, clustering algorithms were considered to be ineffective and inefficient in machine learning because handling large dataset was challenging, however, in the recent years these algorithms have evolved remarkably. Furthermore, clustering algorithms can be categorized into partitioning methods,
  • 2. 2 hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern–based methods), constraint based methods etc. (Jiawei Han, 2006). Methodology The methodology is based on hierarchical clustering algorithm which still requires adjusting few settings in order to achieve accurate results. We only need to know which of these settings are most appropriate to achieve our task and why. Dataset Datasets in clustering analysis could have any of the following forms: numerical variables, interval- scaled variables, binary variables, nominal, ordinal, and ratio variables, and variables of mixed types. Zoo dataset is of mixed type that consists of 16 binary and one categorical variable. I downloaded the dataset from machine learning repository databases (Forsyth, 1990). Since the data is in text format, which is not recognized in Weka and R-Rattle, I did some cleaning and further processing of the dataset and finally transformed it into CSV format. Also, I converted this data into a data frame with 18 attributes, 16 of which are boolean i.e. binary variables. The boolean variables that take values {0, 1} are: hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathe, venomous, fins, tail, domestic, cat size; and the categorical variable ‘Legs’ takes values {0,2,4,5,6,8} which corresponds to a specific number of legs of each animal. In the output of Weka we will see the type of seven groups which are listed from 0 through 6: 0- mammal, 1- fish, 2- bird, 3- invertebrate, 4- insect, 5-amphibian and 6-reptile. Also, our dataset have the variable ‘type’ which involves the seven groups of family types of animals, and cleaning data involves removing the label “type” in order to have the algorithm to handle the data blindly to meet the unsupervised learning objective. Table 1 presents the summary aggregates of each animal family type, which, later, will be compared against the obtained clusters. Table 1: count of the number of animals in each type No Type Count 1 Mammal 41 2 Fish 13 3 Bird 20 4 Invertebrate 10 5 Insect 8 6 Amphibian 4 7 Reptile 5 Algorithm The following steps show how the agglomerative hierarchical clustering algorithm works (Zhu, 2010): 1. Initially each item x1… xn is in its own cluster C1, . . . ,Cn. 2. Merge the nearest clusters, say Ci and Cj. 3. Repeat until there is only one cluster left or stop at a given threshold. The result is a cluster tree. One can cut the tree at any level to produce different clusters. The concept of nearing clusters or objects is measured using a distance measure, which indicates also the similarity
  • 3. 3    xxd cc d ji xxji cc   ,, min ,, of the clusters. In our case, as we have attributes of mixed types binary and categorical, and since the categorical variable is a generalization of the binary variable in that it can take on more than two states (Jiawe & Micheline, 2006) the distance is computed based on the number of mismatches among the attributes of the objects being compared (Jiawe & Micheline, 2006). For example if m is the number of matches and p the total number of attributes, then the distance between two objects x and x´ is d(x, x´)= (p-m). This distance can be regarded as the square of the Euclidean distance between the objects, as for binary variables the difference between mismatching attributes yield 1, and 0 otherwise. In the second step of the algorithm objects are merged together in order to form clusters. This merge is based on the Euclidean linkage, and we are considering three different types of linkage: Single linkage In Euclidean single linkage, algorithm takes the closest x and x´ from Ci and Cj respectively to merge clusters. Single linkage is equivalent to the minimum spanning tree algorithm and tends to produce long and skinny clusters. This single linkage merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters' overall structure are not taken into account (Manning, et al., 2008). Complete Linkage However, Euclidean complete linkage works opposite to single linkage by considering the farthest x and x´. Clusters tend to be compact and roughly equal in diameter. This complete-link merge criterion is non-local; the entire structure of the clustering can influence merge decisions. This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers. A single document far from the center can increase diameters of candidate merge clusters dramatically and completely change the final clustering (Manning, et al., 2008). Average Linkage Here, average linkage is the summation of within cluster objects’ distances over the product of the total number of objects in each cluster. And it is a compromise between the sensitivity of complete linkage clustering to outliers and the tendency of single linkage clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects (Manning, et al., 2008).    xxd cc d ji xxji cc   ,, max ,,    ,, , cc cc cc ji ji ji xxdxx d    
  • 4. 4 Analysis The analysis consists of both outputs from Weka and R-Rattle and to easy the analysis we kept all the output in the form of tables and, later, have added the type label to the tables to make the output readable. Results Using the complete linkage, the misclassification rate is 15.8416%, and this rate is the highest among the three methods of linkage that have been considered as we will see. From table 2, five animals out of 41 that belong to mammals are misclassified as insect but fish, bird, invertebrate are almost classified correctly, also insect, amphibian, and reptile are misclassified. Table 2: Hierarchical Clustering/complete linkage Mammal Fish Bird Invertebrate Insect Amphibian Reptile classified as 0 1 2 3 4 5 6 36 0 0 0 5 0 0 0 0 13 0 0 0 0 0 1 0 0 20 0 0 0 0 2 0 0 0 7 0 2 1 3 0 0 0 0 0 8 0 4 0 4 0 0 0 0 0 5 0 3 1 0 0 0 1 6 Now, moving to the single linkage, the misclassification rate is 12.8713 %. Mammal, fish, bird and invertebrate are correctly classified, while animals belonging to family types such as insect, amphibian and reptile are misclassified. Table 3: Hierarchical Clustering/single linkage Mammal Fish Bird Invertebrate Insect Amphibian Reptile classified as 0 1 2 3 4 5 6 41 0 0 0 0 0 0 0 0 13 0 0 0 0 0 1 0 0 20 0 0 0 0 2 0 0 0 9 0 1 0 3 0 0 0 8 0 0 0 4 0 0 0 0 4 0 0 5 0 0 1 0 3 0 1 6
  • 5. 5 The last type of linkage method we have used is the average linkage method and this also shows that, the misclassification rate is 12.8713%. Mammal, fish, bird, invertebrate, and insect are classified correctly; notice that the class of insect has been discovered perfectly for the first time ever. Table 4: Hierarchical Clustering/average linkage Mammal Fish Bird invertebrate Insect Amphibian Reptile classified as 0 1 2 3 4 5 6 40 0 0 0 0 1 0 0 0 13 0 0 0 0 0 1 0 0 20 0 0 0 0 2 0 0 0 7 2 0 1 3 0 0 0 0 8 0 0 4 0 4 0 0 0 0 0 5 0 4 1 0 0 0 0 6
  • 6. 6 Comparison of variations between complete, single and average linkages In cluster analysis we define a diameter of a cluster to be the largest dissimilarity between any two of its observations (Rousseeuw & Kaufman, 1990). The diameters of the clusters that are produced by single, complete and average linkage are illustrated below: Diameter/ Single linkage [1] 3.000000 3.464102 2.000000 2.828427 NA NA NA Diameter/ Complete linkage [1] 3.000000 2.236068 2.236068 2.000000 2.449490 2.828427 2.449490 Diameter/ Average linkage [1] 2.828427 2.236068 3.000000 1.414214 2.000000 2.828427 2.449490 The advantages of complete linkage method over single linkage method is that, it does not produce the chaining phenomenon which is represented in the diameters of clusters 1, 2, 3 and 4 in (Diameter/single linkage) which are larger than the equivalent diameters of the same clusters in complete linkage method. When we talk of chaining phenomenon we mean that clusters formed through single linkage clustering may be forced together due to single elements being close to each other and made a large cluster even though many of the elements in each cluster may be very distant to each other (Wikipedia, 2013). In addition the diameters of clusters 5, 6 and 7 appear as NA’s respectively as shown in (Diameter/single linkage) is due to the fact stated earlier that the diameter is the largest dissimilarity between any two of its observations and we have only one observation in these clusters and thus algorithm fails to measure the diameter of clusters 6 and 7. For the diameter of cluster 5 in (Diameter/single linkage) appears as NA because all observations of insects were classified wrongly as amphibian and reptile. However, the complete linkage tends to find compact clusters of approximately equal diameters and as we stated earlier that complete linkage method is sensitive to outliers, thus performs poorly too. The results from table 2 show that complete linkage performs poorly. The algorithm has succeeded to group purely only two clusters and failed to find the other five clusters due to the maximum distance computed equally among all clusters. Therefore the characteristic of roughly equal diameter of each cluster has contributed greatly to such a drawback of complete linkage. On the contrary, average linkage method has unequal diameters and better clustering output since it strives to achieve a compromise between the sensitivity of complete linkage clustering to outliers and the tendency of single linkage clustering. Analysis of the findings Generally, our data have few observations, however, 41 observations of mammals is pretty good number compared to the numbers of animals that belong to reptile or amphibian. Complete linkage method has succeeded to find five clusters namely, mammals, fish, birds, invertebrate and reptile. But then, this method does not perform well in grouping, correctly, some of the animals. Also, complete linkage recorded the highest misclassification rate see (table 2). Mammals are grouped correctly into one cluster according to the single linkage method see (table 3); fish, birds and invertebrate are also correctly clustered into their natural types according to single linkage. However, the other three clusters insect, amphibian and reptile are misclassified and their equivalent diameters appeared as NA in the output. Instead observations belonging to insect family are
  • 7. 7 misclassified as invertebrate and amphibian, and some observations that belong to reptile are misclassified as insects. From these findings, we can conclude that invertebrates and insects have more similarities than dissimilarities according to single linkage method. The misclassified observations could also be due to the insignificant number of animals compared to the number of animals that belong to mammals. Moreover, the single linkage discovered five clusters out of seven, which are mammals, fish, birds, invertebrate and only one observation from reptile. In the same context, average linkage method showed that five clusters have also been discovered and these clusters are mammals, fish, birds, invertebrate and insects. But, what is new in the average linkage is, perfectly, discovering all the number of animals that belong to insect’s type (table 4). It is also important to note that amphibians and reptiles seem to be particularly homogeneous as they are usually clustered together. In addition, as we compare the linkage based methods, animals that are belonging to family type’s fish, birds and invertebrate are always discovered in all the three linkage methods. Moreover, the same numbers of animals that belong to reptile have been identified in complete and single linkages methods. Though, the differences between single and average methods are small, it is up to the researcher to choose one of these methods (single or average linkage) especially the misclassification rate of 12.8713% is exactly the same in both methods. However, the average linkage is doing better in the way it has succeeded to group five clusters and has missed those clusters of corresponding types with very few numbers of observations. Conclusion The aim of this study is basically to apply clustering algorithm, especially the hierarchical algorithm on the Zoo dataset and compare the resulted clusters with the corresponding natural family types. If we look at the tables we realize that the complete linkage method has produced the highest misclassification rate compared to the other two methods. A substantial impact is obtained by changing especially the linkage type that affects the accuracy gains. We have seen that changing only the linkage type changes the accuracy and the number of iterations does not make any influence due to the size of the dataset. Eventually, hierarchical clustering algorithm that utilizes Euclidean distance and average linkage method to connect observations in closest clusters have succeeded to recognize the real clusters of the zoo dataset. The algorithm is robust and significantly recommended because of the easiness of its settings (number of clusters, number of iterations, type of distance etc.) Finally, further efforts could be done for working out a better clustering algorithm such as ROCK algorithm etc. that may improve the accuracy of clustering.
  • 8. 8 Tables Table 1: Number of count of instances Table 2: Hierarchical Clustering/ complete linkage Table 3: Hierarchical Clustering/ single linkage Table 4: Hierarchical Clustering/ single linkage Bibliography Forsyth, R., 1990. Machine Learning Repository. [Online] Available at: http://archive.ics.uci.edu/ml/datasets/Zoo [Accessed 20 04 2011]. Jiawe, H. & Micheline, K., 2006. Data Mining - Concepts and Techniques. 2nd edition ed. s.l.:Morgan- Kaufmann. Manning, C. D., Raghavan, P. & Schutze, H., 2008. Introduction to Information Retrieval. Cambridge: Cambridge University Press. Rousseeuw, P. & Kaufman, L., 1990. Finding Groups in Data. New York: Wiley. Wikipedia, 2012. Clustering Analysis. [Online] Available at: http://en.wikipedia.org/wiki/Cluster_analysis#Clustering_algorithms [Accessed 20 April 2013]. Wikipedia, 2013. Complete-linkage clustering. [Online] Available at: http://en.wikipedia.org/wiki/Complete-linkage_clustering [Accessed 26 July 2013]. Zhu, X., 2010. Clustering. Advanced Natural Language Processing, 15 Springer, pp. 1-4.