ITB TERM PAPER Classification and CluteringNitin Kumar Rathore 10BM60055
IntroductionWeka stands for Waikato Environment for knowledge analysis. Weka is software availablefor free used for machine learning. It is coded in Java and is developed by the University ofWaikato, New Zealand. Weka workbench includes set of visualization tools and algorithmswhich is applied for better decision making through data analysis and predictive modeling. Italso has a GUI (graphical user interface) for ease of use. It is developed in Java so is portableacross platforms Weka has many applications and is used widely for research and educationalpurposes. Data mining functions can be done by weka involves classification, clustering,feature selection, data preprocessing, regression and visualization.Weka startup screen looks like:This is Weka GUI Chooser. It gives you four interfaces to work on Explorer: It is used for exploring the data with weka by providing access to all the facilities by the use of menues and forms Experimenter: Weka Experimenter allows you to create, analyse, modify and run large scale experiments. It can be used to answer question such as out of many schemes which is better (if there is) Knowledge flow: it has the same function as that of explorer. It supports incremental learning. It handles data on incremental basis. It uses incremental algorithms to process data. Simple CLI: CLI stands for command line interface. It just provides all the functionality through command line interface.Data Ming TechniquesOut of the data mining techniques provided by the weka, classification, clustering, featureselection, data preprocessing, regression and visualization, this paper will demonstrate use ofclassification and clustering.
ClassificationClassification creates a model based on which a new instance can be classified into theexisting classes or determined classes. for example by creating a decision tree based on pastsales we can determine how likely is a person to buy the product given all his attribute likedisposable income, family strength, state/country etc.To start with classification you must use or create arff or csv (or any supported) file format.An arff file is a table. To create arff file from excel you just have to follow these steps Open the excel file. Remove headings Save as it as a csv file (comma delimited) file. Open the csv file ina text editor. Now write the relation name at the top of the file as: @relation <relation_name> The text inside the arrows, < and >, represents the text to be entered according to the requirement Leave a blank line and enter all the attributes, column heads, in the format: @attibute <attribute_name>(<attribute_values>). For example @attribute outlook (sunny, overcast, rainy) After entering all the attribute leave a blank line and write: @data This last line will appear just above comma separated data values of the file. Save it as <file_name>.arff The sample picture of arff file is shown below
Classification example:Our goal is to create a decision tree using weka so that we can classify new or unknown irisflower samples.There are three kind or iris they are Iris setosa, Iris versicolor, Iris virginica.Data file: We have a data file containing attribute values for 150 Iris samples in arff format atthis link: http://code.google.com/p/pwr-apw/downloads/detail?name=iris.arff.Concept behind the classification is the sepal and petal length and width help us to identifythe unknown iris. The data files contain all the four attributes. The algorithm we are going touse to classify is weka’s J4.8 decision tree learner.Follow the underlying steps to classify: Open weka and choose explorer. Then open the downloaded arff file. Go to classify tab. Click “choose” and Choose J48 algorithm under trees section Left click on the chosen J48 algorithm to open Weka Generic Object Editor. Change the option saveInstanceData to true. Click ok. It allows you to find the classification process for each sample after building of the decision tree
Click “Percentage Split” option in the “Test Options” section. It trains on thenumerical percentage enters in the box and test on the rest of the data. Default value is66%Click on “Start” to start classification. The output box named “classifier output”shows the output of classification. Output will look like thisNow we will see the tree. Right click on the entry in “Result List” then click visualizetree.
Decision tree will be visible in new windowIt gives the decision structure or flow of process to be followed during classification. Forexample if petal width is > 0.6, petal width <=1.7, petal length > 4.9 and petal width <= 1.5,itimplies the iris is Virginica.Now look at the classifier output box. The rules describing the decision tree is described asgiven in the picture.
As we can see in the decision tree we don’t require sepal length and width for classification.We require only petal length and width. Go to “classifier output box”. Scroll to the section “Evaluation on test split section”. We have split the data in two 66% for training and 33% for testing the model or tree. This section will be visible as follows Weka took 51 samples as 33% for test. Out of which 49 are classified correctly and 2 are classified incorrectly. If you look at the confusion matrix below in classifier output box. You will see all setosa(15) and all versicolor(19) are classified correctly but 2 out 0f 117 virginica are classified as versicolor. To find more information or to visualize how decision tree did on test samples. Right click on “Result list” and select “Visualize classifier errors”. A new window will open. Now as our tree has used on petal width and petal length to classify, we will select Petal Length for X axis and Petal Width for Y axis. Here “x” or cross represents correctly classified samples and squares represents incorrectly classified samples. Results of decision tree as Setosa, versicolor and virginica are represented in different colors as blue red and green. AS we can see why these are classified incorrectly as virginica, because they fall into the versicolor group considering petal length and width.
The picture of window will appear asBy left clicking on the squared instances circled black will give you information aboutthat instance.
As we can see 2 nodes out of 50 virginica samples (train +test) are classified incorrectly. Restothers are classified correctly for setosa and versicolor. There can be many reasons for it.Few are mentioned below. Attribute measurement error: It arises out of incorrect measurement of petal and sepal length and widths. Sample class identification error: It may arise because some classes are identified incorrectly. Say some versicolor are classified as virginica. Outlier samples: some infected or abnormal flower are sampled Inappropriate classification algorithm: the algorithm we chose is not suitable for the classification.ClusteringClustering is formation of groups on the basis of its attributes and is used to determinepatterns from the data. Advantage of clustering over classification is each and every attributeis used to define a group but disadvantage of clustering is a user must know beforehand howmany groups he wants to form.There are 2 types of clustering:Hierarchical clustering: This approach uses measure (generally squared Euclidean) ofdistance for identifying distance between objects to form a cluster. Process starts with all theobjects as separate clusters. Then on the basis of shortest distance between clusters twoobjects closest are joined to form a cluster. And this cluster represents the new object. Nowagain the process continues until one cluster is formed or the required number of cluster isreached.Non-Hierarchical Clustering: It is the method of clustering in which partition of observations(say n in number) occur into k clusters. Each observation is assigned to nearest cluster andcluster mean is recalculated. In this paper we will study K-means clustering example.Applications of clustering includes Market segmentation Computer vision Geostatistics Understanding buyer behaviorData file: Data file talks about the BMW dealership. It contains data about how oftencustomer makes a purchase, what cars they look at, how they walk through showroom anddealership. It contains 100 rows of data and where every attribute/column represent the stepsthat customer have achieved in their buying process. “1” represents they have reached thisstep whereas “0” represents they didn’t made it to this step. Download the data file from the
link:http://www.ibm.com/developerworks/apps/download/index.jsp?contentid=487584&filename=os-weka2-Examples.zip&method=http&locale=Let us have a sneak peek into the data file.Now follow these steps to perform Clustering: Load the file into Weka by open file option under preprocess tab of weka explorer or by double clicking the file. The Weka explorer will look like
To create clusters click on cluster tab. Click the command button “Choose” and select “SimpleKMeans”. Click on the text box next to choose button which displays the k means algorithm. It will open Weka GUI Generic Object Editor Change “numClusters” from 2 to 5. It define the number of clusters to be formed. Click ok Click Start to start clustering In “result List” box a entry will appear and Cluster output will display output of clustering. It will appear as follows.Cluster Results:Now we will have the clusters defined. You can have cluster data in a separate window byright clicking the entry in the “Result List” Box. There are 5 clusters formed named from “0”to “4“. If a attribute value for a cluster is “1” it means all the instances in the cluster have thesame value “1” for that attribute. If a cluster has “0” values for an attribute it means allinstances in the cluster have the “0” value for that attribute. To remind, the “0” valuerepresent customer have not entered into that step of buying process and “1” representcustomer have entered into the step.
Clustered instances show how many instances belong to each cluster. Clustered instances isthe heading given in the cluster output. For example in cluster “0” it have 26 instances or26% instances (as there are 100 rows no. of instances is equal to percentage)The value for clusters in separate window is given in the picture below.Interpreting the clusters Cluster 0: It represents the group of non purchasers, as they may look for dealership, look for cars in a showroom but when it comes to purchasing a car they do nothing. This group just adds to cost but doesn’t bring any revenue. Cluster 1: This group is attracted towards M5 as it is quite visible that go straight towards the M5s ignoring 3Series car and paying no heed at all to Z4. They even
don’t do the computer search. But as we can see this high footfall for does not bringsales accordingly. The reason for medium sales should be unearthed. Say if customerservice is the problem we should increase the service quality over the M5 section bytraining sales executive better or if lack of no. of sales personnel to cater everycustomer is the problem we can provide more staff for the M5 section.Cluster 2: This group just contains 5 instances out of 100. They can be called“insignificant group”. They are not statistically important. We should not make anyconclusion from such an insignificant group. It indicates we may reduce the no. ofclustersCluster 3: This is the group of customers we can call “sure shot buyers”. Becausethey will always buy a car. One thing to note is we should take care of their financingas they always go for financing. They lookout showroom for available cars and alsodo computer search for the available dealership. They generally don’t lookout for3Series. It displays that we should make computer search for M5 and Z4 more visibleand attractive in search results.Cluster 4: This group contains the ones that make least purchase after non-purchasers. They are the new ones in the category, because they don’t look forexpensive cars like M5 instead lookout for 3Series. They walk into showrooms andthey don’t involve in computer search. As we can see 50 percent of them get to thefinancing stage but only 32 percent end up buying a car. This means these are theones buying their first BMW and know exactly their requirement and hence their car(3Series entry level model). They generally go for financing to afford the car. Thismeans to increase the sales we should increase the conversion ratio from financingstage to purchasing stage. We should identify the problem there and take theappropriate step. For example making financing easier by collaborating with bank. Bylowering the terms that repels customers.
REFERENCES Data mining by Jan H. witten, Eibe Frank and Mark A. Hall, 3rd edition, MorganKaufman PublisherTutorial for weka provided by university of Waikato, www.cs.waikato.ac.nz/ml/weka/ Weka,Classification using decision trees based on Dr. Polczynskis Lecture, written byProf. Andrzej Kochanski and Prof Marcin Perzyk, Faculty of Production Engineering,Warsaw University of Technology, Warsaw Poland,http://referensi.dosen.narotama.ac.id/files/2011/12/weka-tutorial-2.pdf Classification via Decision Trees in WEKA, Computer science, Telecommunications, andInformation systems, DePaul University,http://maya.cs.depaul.edu/classes/ect584/weka/classify.html Data mining with WEKA, Part 2: classification and clustering, IBM developer worksMichael Abernethy, http://www.ibm.com/developerworks/opensource/library/os-weka2/index.html?ca=drs-