Hierarchical clustering of multi class data (the zoo dataset)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Hierarchical clustering of multi class data (the zoo dataset)

on

  • 680 views

Data mining project ...

Data mining project
The main goal of this study is to group 101 animals into their natural family types using various features of animals and by utilizing Hierarchical clustering algorithm which is one of the unsupervised learning algorithms.

Statistics

Views

Total Views
680
Views on SlideShare
680
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hierarchical clustering of multi class data (the zoo dataset) Document Transcript

  • 1. 1 732A31; Data Mining-Clustering and Association Analysis Hierarchical Clustering of Multi-Class Data (The Zoo Dataset) Linköping University IDA; Division of statistics Raid Mahbouba raima062@student.liu.se Introduction In the real life one can find many animals of different family types and the non-zoologists are not familiar with these types of animals, and this is one of the main reasons why they are kept in the zoo for identification. The main goal of this study is to use clustering technique in order to categorize 101 animals using various features such as hair, feathers, eggs, milk, etc., in order to verify whether they correspond to the animal’s natural family types. There are many clustering algorithms that are appropriate to such a task, among which are: hierarchical algorithms and density based algorithms. Furthermore, in identifying the right clusters of animals, we made some changes in the default settings of the algorithm, such as number of iterations, number of clusters, different distance computations (the linkage) all of which might have influence in the final results. Moreover, the major setting that we think will have significant influence is the type of linkage that would be computed; this will demonstrate that types of linkage is a crucial factor in finding the true existing clusters. Though the problem is originally a classification problem, as it is described in the literature of zoo database (Forsyth, 1990), my proposition is to us hierarchical clustering algorithm to see how it identifies these animals’ types throughout using their features. Previous research which aimed at analysing this dataset in clustering methodology had been done by Neil Davey and Rod Adams and Mary J. George in which they utilized neural networks (Forsyth, 1990). In this study I consider using hierarchical clustering, since it is simple and is robust to outliers. It was noted that “Clustering is in the eye of the beholder” (Wikipedia, 2012), which implies the fact that clustering methods do not give ultimate accuracy. I therefore, do not expect though I hope to get 100% accuracy when grouping objects into their natural clusters using the hierarchical approach. Background Machine learning is a growing branch of artificial intelligence, which is about the construction and study of systems that can learn from data. We have different types of algorithms used to analyze large datasets in machine learning, examples of which are; classification, clustering, gamboost, random forest, support vector machines, general linear additive models (glmnet) etc. These algorithms are often grouped into two larger classes namely; supervised and unsupervised learning algorithms. In this project more attention is put on a specific class of unsupervised learning algorithms, namely the clustering algorithms. Clustering is a dynamic field of research in data mining. Initially, clustering algorithms were considered to be ineffective and inefficient in machine learning because handling large dataset was challenging, however, in the recent years these algorithms have evolved remarkably. Furthermore, clustering algorithms can be categorized into partitioning methods,
  • 2. 2 hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (including frequent pattern–based methods), constraint based methods etc. (Jiawei Han, 2006). Methodology The methodology is based on hierarchical clustering algorithm which still requires adjusting few settings in order to achieve accurate results. We only need to know which of these settings are most appropriate to achieve our task and why. Dataset Datasets in clustering analysis could have any of the following forms: numerical variables, interval- scaled variables, binary variables, nominal, ordinal, and ratio variables, and variables of mixed types. Zoo dataset is of mixed type that consists of 16 binary and one categorical variable. I downloaded the dataset from machine learning repository databases (Forsyth, 1990). Since the data is in text format, which is not recognized in Weka and R-Rattle, I did some cleaning and further processing of the dataset and finally transformed it into CSV format. Also, I converted this data into a data frame with 18 attributes, 16 of which are boolean i.e. binary variables. The boolean variables that take values {0, 1} are: hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathe, venomous, fins, tail, domestic, cat size; and the categorical variable ‘Legs’ takes values {0,2,4,5,6,8} which corresponds to a specific number of legs of each animal. In the output of Weka we will see the type of seven groups which are listed from 0 through 6: 0- mammal, 1- fish, 2- bird, 3- invertebrate, 4- insect, 5-amphibian and 6-reptile. Also, our dataset have the variable ‘type’ which involves the seven groups of family types of animals, and cleaning data involves removing the label “type” in order to have the algorithm to handle the data blindly to meet the unsupervised learning objective. Table 1 presents the summary aggregates of each animal family type, which, later, will be compared against the obtained clusters. Table 1: count of the number of animals in each type No Type Count 1 Mammal 41 2 Fish 13 3 Bird 20 4 Invertebrate 10 5 Insect 8 6 Amphibian 4 7 Reptile 5 Algorithm The following steps show how the agglomerative hierarchical clustering algorithm works (Zhu, 2010): 1. Initially each item x1… xn is in its own cluster C1, . . . ,Cn. 2. Merge the nearest clusters, say Ci and Cj. 3. Repeat until there is only one cluster left or stop at a given threshold. The result is a cluster tree. One can cut the tree at any level to produce different clusters. The concept of nearing clusters or objects is measured using a distance measure, which indicates also the similarity
  • 3. 3    xxd cc d ji xxji cc   ,, min ,, of the clusters. In our case, as we have attributes of mixed types binary and categorical, and since the categorical variable is a generalization of the binary variable in that it can take on more than two states (Jiawe & Micheline, 2006) the distance is computed based on the number of mismatches among the attributes of the objects being compared (Jiawe & Micheline, 2006). For example if m is the number of matches and p the total number of attributes, then the distance between two objects x and x´ is d(x, x´)= (p-m). This distance can be regarded as the square of the Euclidean distance between the objects, as for binary variables the difference between mismatching attributes yield 1, and 0 otherwise. In the second step of the algorithm objects are merged together in order to form clusters. This merge is based on the Euclidean linkage, and we are considering three different types of linkage: Single linkage In Euclidean single linkage, algorithm takes the closest x and x´ from Ci and Cj respectively to merge clusters. Single linkage is equivalent to the minimum spanning tree algorithm and tends to produce long and skinny clusters. This single linkage merge criterion is local. We pay attention solely to the area where the two clusters come closest to each other. Other, more distant parts of the cluster and the clusters' overall structure are not taken into account (Manning, et al., 2008). Complete Linkage However, Euclidean complete linkage works opposite to single linkage by considering the farthest x and x´. Clusters tend to be compact and roughly equal in diameter. This complete-link merge criterion is non-local; the entire structure of the clustering can influence merge decisions. This results in a preference for compact clusters with small diameters over long, straggly clusters, but also causes sensitivity to outliers. A single document far from the center can increase diameters of candidate merge clusters dramatically and completely change the final clustering (Manning, et al., 2008). Average Linkage Here, average linkage is the summation of within cluster objects’ distances over the product of the total number of objects in each cluster. And it is a compromise between the sensitivity of complete linkage clustering to outliers and the tendency of single linkage clustering to form long chains that do not correspond to the intuitive notion of clusters as compact, spherical objects (Manning, et al., 2008).    xxd cc d ji xxji cc   ,, max ,,    ,, , cc cc cc ji ji ji xxdxx d    
  • 4. 4 Analysis The analysis consists of both outputs from Weka and R-Rattle and to easy the analysis we kept all the output in the form of tables and, later, have added the type label to the tables to make the output readable. Results Using the complete linkage, the misclassification rate is 15.8416%, and this rate is the highest among the three methods of linkage that have been considered as we will see. From table 2, five animals out of 41 that belong to mammals are misclassified as insect but fish, bird, invertebrate are almost classified correctly, also insect, amphibian, and reptile are misclassified. Table 2: Hierarchical Clustering/complete linkage Mammal Fish Bird Invertebrate Insect Amphibian Reptile classified as 0 1 2 3 4 5 6 36 0 0 0 5 0 0 0 0 13 0 0 0 0 0 1 0 0 20 0 0 0 0 2 0 0 0 7 0 2 1 3 0 0 0 0 0 8 0 4 0 4 0 0 0 0 0 5 0 3 1 0 0 0 1 6 Now, moving to the single linkage, the misclassification rate is 12.8713 %. Mammal, fish, bird and invertebrate are correctly classified, while animals belonging to family types such as insect, amphibian and reptile are misclassified. Table 3: Hierarchical Clustering/single linkage Mammal Fish Bird Invertebrate Insect Amphibian Reptile classified as 0 1 2 3 4 5 6 41 0 0 0 0 0 0 0 0 13 0 0 0 0 0 1 0 0 20 0 0 0 0 2 0 0 0 9 0 1 0 3 0 0 0 8 0 0 0 4 0 0 0 0 4 0 0 5 0 0 1 0 3 0 1 6
  • 5. 5 The last type of linkage method we have used is the average linkage method and this also shows that, the misclassification rate is 12.8713%. Mammal, fish, bird, invertebrate, and insect are classified correctly; notice that the class of insect has been discovered perfectly for the first time ever. Table 4: Hierarchical Clustering/average linkage Mammal Fish Bird invertebrate Insect Amphibian Reptile classified as 0 1 2 3 4 5 6 40 0 0 0 0 1 0 0 0 13 0 0 0 0 0 1 0 0 20 0 0 0 0 2 0 0 0 7 2 0 1 3 0 0 0 0 8 0 0 4 0 4 0 0 0 0 0 5 0 4 1 0 0 0 0 6
  • 6. 6 Comparison of variations between complete, single and average linkages In cluster analysis we define a diameter of a cluster to be the largest dissimilarity between any two of its observations (Rousseeuw & Kaufman, 1990). The diameters of the clusters that are produced by single, complete and average linkage are illustrated below: Diameter/ Single linkage [1] 3.000000 3.464102 2.000000 2.828427 NA NA NA Diameter/ Complete linkage [1] 3.000000 2.236068 2.236068 2.000000 2.449490 2.828427 2.449490 Diameter/ Average linkage [1] 2.828427 2.236068 3.000000 1.414214 2.000000 2.828427 2.449490 The advantages of complete linkage method over single linkage method is that, it does not produce the chaining phenomenon which is represented in the diameters of clusters 1, 2, 3 and 4 in (Diameter/single linkage) which are larger than the equivalent diameters of the same clusters in complete linkage method. When we talk of chaining phenomenon we mean that clusters formed through single linkage clustering may be forced together due to single elements being close to each other and made a large cluster even though many of the elements in each cluster may be very distant to each other (Wikipedia, 2013). In addition the diameters of clusters 5, 6 and 7 appear as NA’s respectively as shown in (Diameter/single linkage) is due to the fact stated earlier that the diameter is the largest dissimilarity between any two of its observations and we have only one observation in these clusters and thus algorithm fails to measure the diameter of clusters 6 and 7. For the diameter of cluster 5 in (Diameter/single linkage) appears as NA because all observations of insects were classified wrongly as amphibian and reptile. However, the complete linkage tends to find compact clusters of approximately equal diameters and as we stated earlier that complete linkage method is sensitive to outliers, thus performs poorly too. The results from table 2 show that complete linkage performs poorly. The algorithm has succeeded to group purely only two clusters and failed to find the other five clusters due to the maximum distance computed equally among all clusters. Therefore the characteristic of roughly equal diameter of each cluster has contributed greatly to such a drawback of complete linkage. On the contrary, average linkage method has unequal diameters and better clustering output since it strives to achieve a compromise between the sensitivity of complete linkage clustering to outliers and the tendency of single linkage clustering. Analysis of the findings Generally, our data have few observations, however, 41 observations of mammals is pretty good number compared to the numbers of animals that belong to reptile or amphibian. Complete linkage method has succeeded to find five clusters namely, mammals, fish, birds, invertebrate and reptile. But then, this method does not perform well in grouping, correctly, some of the animals. Also, complete linkage recorded the highest misclassification rate see (table 2). Mammals are grouped correctly into one cluster according to the single linkage method see (table 3); fish, birds and invertebrate are also correctly clustered into their natural types according to single linkage. However, the other three clusters insect, amphibian and reptile are misclassified and their equivalent diameters appeared as NA in the output. Instead observations belonging to insect family are
  • 7. 7 misclassified as invertebrate and amphibian, and some observations that belong to reptile are misclassified as insects. From these findings, we can conclude that invertebrates and insects have more similarities than dissimilarities according to single linkage method. The misclassified observations could also be due to the insignificant number of animals compared to the number of animals that belong to mammals. Moreover, the single linkage discovered five clusters out of seven, which are mammals, fish, birds, invertebrate and only one observation from reptile. In the same context, average linkage method showed that five clusters have also been discovered and these clusters are mammals, fish, birds, invertebrate and insects. But, what is new in the average linkage is, perfectly, discovering all the number of animals that belong to insect’s type (table 4). It is also important to note that amphibians and reptiles seem to be particularly homogeneous as they are usually clustered together. In addition, as we compare the linkage based methods, animals that are belonging to family type’s fish, birds and invertebrate are always discovered in all the three linkage methods. Moreover, the same numbers of animals that belong to reptile have been identified in complete and single linkages methods. Though, the differences between single and average methods are small, it is up to the researcher to choose one of these methods (single or average linkage) especially the misclassification rate of 12.8713% is exactly the same in both methods. However, the average linkage is doing better in the way it has succeeded to group five clusters and has missed those clusters of corresponding types with very few numbers of observations. Conclusion The aim of this study is basically to apply clustering algorithm, especially the hierarchical algorithm on the Zoo dataset and compare the resulted clusters with the corresponding natural family types. If we look at the tables we realize that the complete linkage method has produced the highest misclassification rate compared to the other two methods. A substantial impact is obtained by changing especially the linkage type that affects the accuracy gains. We have seen that changing only the linkage type changes the accuracy and the number of iterations does not make any influence due to the size of the dataset. Eventually, hierarchical clustering algorithm that utilizes Euclidean distance and average linkage method to connect observations in closest clusters have succeeded to recognize the real clusters of the zoo dataset. The algorithm is robust and significantly recommended because of the easiness of its settings (number of clusters, number of iterations, type of distance etc.) Finally, further efforts could be done for working out a better clustering algorithm such as ROCK algorithm etc. that may improve the accuracy of clustering.
  • 8. 8 Tables Table 1: Number of count of instances Table 2: Hierarchical Clustering/ complete linkage Table 3: Hierarchical Clustering/ single linkage Table 4: Hierarchical Clustering/ single linkage Bibliography Forsyth, R., 1990. Machine Learning Repository. [Online] Available at: http://archive.ics.uci.edu/ml/datasets/Zoo [Accessed 20 04 2011]. Jiawe, H. & Micheline, K., 2006. Data Mining - Concepts and Techniques. 2nd edition ed. s.l.:Morgan- Kaufmann. Manning, C. D., Raghavan, P. & Schutze, H., 2008. Introduction to Information Retrieval. Cambridge: Cambridge University Press. Rousseeuw, P. & Kaufman, L., 1990. Finding Groups in Data. New York: Wiley. Wikipedia, 2012. Clustering Analysis. [Online] Available at: http://en.wikipedia.org/wiki/Cluster_analysis#Clustering_algorithms [Accessed 20 April 2013]. Wikipedia, 2013. Complete-linkage clustering. [Online] Available at: http://en.wikipedia.org/wiki/Complete-linkage_clustering [Accessed 26 July 2013]. Zhu, X., 2010. Clustering. Advanced Natural Language Processing, 15 Springer, pp. 1-4.