Upcoming SlideShare
×

# Hierarchical clustering of multi class data (the zoo dataset)

1,652 views

Published on

Data mining project
The main goal of this study is to group 101 animals into their natural family types using various features of animals and by utilizing Hierarchical clustering algorithm which is one of the unsupervised learning algorithms.

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,652
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
34
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Hierarchical clustering of multi class data (the zoo dataset)

1. 1. 1 732A31; Data Mining-Clustering and Association Analysis Hierarchical Clustering of Multi-Class Data (The Zoo Dataset) Linköping University IDA; Division of statistics Raid Mahbouba raima062@student.liu.se Introduction In the real life one can find many animals of different family types and the non-zoologists are not familiar with these types of animals, and this is one of the main reasons why they are kept in the zoo for identification. The main goal of this study is to use clustering technique in order to categorize 101 animals using various features such as hair, feathers, eggs, milk, etc., in order to verify whether they correspond to the animal’s natural family types. There are many clustering algorithms that are appropriate to such a task, among which are: hierarchical algorithms and density based algorithms. Furthermore, in identifying the right clusters of animals, we made some changes in the default settings of the algorithm, such as number of iterations, number of clusters, different distance computations (the linkage) all of which might have influence in the final results. Moreover, the major setting that we think will have significant influence is the type of linkage that would be computed; this will demonstrate that types of linkage is a crucial factor in finding the true existing clusters. Though the problem is originally a classification problem, as it is described in the literature of zoo database (Forsyth, 1990), my proposition is to us hierarchical clustering algorithm to see how it identifies these animals’ types throughout using their features. Previous research which aimed at analysing this dataset in clustering methodology had been done by Neil Davey and Rod Adams and Mary J. George in which they utilized neural networks (Forsyth, 1990). In this study I consider using hierarchical clustering, since it is simple and is robust to outliers. It was noted that “Clustering is in the eye of the beholder” (Wikipedia, 2012), which implies the fact that clustering methods do not give ultimate accuracy. I therefore, do not expect though I hope to get 100% accuracy when grouping objects into their natural clusters using the hierarchical approach. Background Machine learning is a growing branch of artificial intelligence, which is about the construction and study of systems that can learn from data. We have different types of algorithms used to analyze large datasets in machine learning, examples of which are; classification, clustering, gamboost, random forest, support vector machines, general linear additive models (glmnet) etc. These algorithms are often grouped into two larger classes namely; supervised and unsupervised learning algorithms. In this project more attention is put on a specific class of unsupervised learning algorithms, namely the clustering algorithms. Clustering is a dynamic field of research in data mining. Initially, clustering algorithms were considered to be ineffective and inefficient in machine learning because handling large dataset was challenging, however, in the recent years these algorithms have evolved remarkably. Furthermore, clustering algorithms can be categorized into partitioning methods,
2. 2. 2 hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern–based methods), constraint based methods etc. (Jiawei Han, 2006). Methodology The methodology is based on hierarchical clustering algorithm which still requires adjusting few settings in order to achieve accurate results. We only need to know which of these settings are most appropriate to achieve our task and why. Dataset Datasets in clustering analysis could have any of the following forms: numerical variables, interval- scaled variables, binary variables, nominal, ordinal, and ratio variables, and variables of mixed types. Zoo dataset is of mixed type that consists of 16 binary and one categorical variable. I downloaded the dataset from machine learning repository databases (Forsyth, 1990). Since the data is in text format, which is not recognized in Weka and R-Rattle, I did some cleaning and further processing of the dataset and finally transformed it into CSV format. Also, I converted this data into a data frame with 18 attributes, 16 of which are boolean i.e. binary variables. The boolean variables that take values {0, 1} are: hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathe, venomous, fins, tail, domestic, cat size; and the categorical variable ‘Legs’ takes values {0,2,4,5,6,8} which corresponds to a specific number of legs of each animal. In the output of Weka we will see the type of seven groups which are listed from 0 through 6: 0- mammal, 1- fish, 2- bird, 3- invertebrate, 4- insect, 5-amphibian and 6-reptile. Also, our dataset have the variable ‘type’ which involves the seven groups of family types of animals, and cleaning data involves removing the label “type” in order to have the algorithm to handle the data blindly to meet the unsupervised learning objective. Table 1 presents the summary aggregates of each animal family type, which, later, will be compared against the obtained clusters. Table 1: count of the number of animals in each type No Type Count 1 Mammal 41 2 Fish 13 3 Bird 20 4 Invertebrate 10 5 Insect 8 6 Amphibian 4 7 Reptile 5 Algorithm The following steps show how the agglomerative hierarchical clustering algorithm works (Zhu, 2010): 1. Initially each item x1… xn is in its own cluster C1, . . . ,Cn. 2. Merge the nearest clusters, say Ci and Cj. 3. Repeat until there is only one cluster left or stop at a given threshold. The result is a cluster tree. One can cut the tree at any level to produce different clusters. The concept of nearing clusters or objects is measured using a distance measure, which indicates also the similarity