ITB WEKA TutorialData Mining Techniques using WEKA for Clustering (K-Means), and Classification (J48 Decision Tree)VINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR In partial fulfillment Of the requirements for the degree of MASTER OF BUSINESS ADMINISTRATION SUBMITTED BY: Prabhat Agarwal 10BM60059 VGSOM, IIT KHARAGPUR
About WEKAWeka (Waikato Environment for Knowledge Analysis) is machine learning software writtenin Java and developed at the University of Waikato, New Zealand. WEKA is a collection ofmachine learning algorithms for data mining tasks which can either be applied directly (WEKAGUI) to a dataset or called from the Java code (WEKA CLI). WEKA contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is alsowell-suited for developing new machine learning schemes. WEKA is open source softwareissued under the GNU General Public License.WEKA is a powerful tool that helps in Business Research methods and thus empowers managersto find out the trends based on past data, consumer surveys, etc and help them prepare to takebetter decisions. The managers are greatly benefitted in computing complex mathematicalproblems through this software.The WEKA GUI Chooser provides a starting point for launching WEKA’s main GUI (GraphicUser Interface) applications and supporting tools.The GUI Chooser consists of four buttons—one for each of the four major WEKAapplications— 1. Explorer – Environment for exploring data with WEKA. It gives access to all the facilities using menu selection. 2. Experimenter – An environment for performing experiments and conducting statistical tests between learning schemes. 3. Knowledge Flow – It supports the same function as the Explorer but with Drag and Drop interface. It also supports Incremental learning. 4. Simple CLI – Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.In the tutorial we have described two techniques of Data Mining. 1. Clustering (K-Means) 2. Classification Decision Trees (J48 Tree)
Clustering using WEKACluster analysis or clustering means assigning a set of objects into homogenous groups(called clusters) so that the objects in the same cluster are more similar (in some sense oranother) to each other than to those in other clusters. So the objects in each cluster tend to besimilar to each other and dissimilar to objects in the other clusters. Clustering is a main task ofexplorative data mining, and a common technique for statistical data analysis used in many fieldsThere are two major types of clustering techniques: 1. Hierarchical Clustering 2. Non-Hierarchical Clustering or K-means ClusteringHIERARCHICAL CLUSTERING - Some measure of distance (usually Euclidean or squaredEuclidean) is used to find out distances between all pairs of objects to be clustered. We start withall objects in separate clusters so number of clusters is same as the number of data points. Twoclosest objects are joined to form a cluster. This process continues, until points keep joining tosome existing clusters (because they are closest to an existing cluster), and clusters join otherclusters, based on the shortest distance criterion. In this way, a range of possible solutions isformed, from n-cluster solution in the beginning, to a single cluster solution at the end.NON-HIERARCHICAL (K-MEANS) CLUSTERING - We have to specify the number ofclusters we want our data set to be clustered into. We have a hypothesis that the objects willgroup into a certain number of clusters.In the tutorial I have made the demonstration of using K-means clustering. For this primary data-set of a survey is collected done by a major apparel store to understand the buyer behavior. Thedata is collected for 100 individuals.
Problem Statement:A major apparel store (name is not disclosed) has done a survey to collect data to understand thebuyer behavior in purchasing the items from the store. The survey was made to fill by peoplevisiting the stores and selected at random to make the data free from any biases.The questionnaire was a set of 7 questions, which they feel may alter the buyer behavior inmaking the purchases. The respondent had to agree or disagree (1 =Strongly Agree, 2 = Agree, 3= Slightly Agree, 4 = Slightly Disagree, 5 = Disagree, 6 =Strongly Disagree)The Questions in the data set are: 1. Please rate your frequency in making unplanned casual wear purchase for: Own Consumption Other’s Consumption 2. How strongly do you agree with the following sentences I shop to change my mood I tend to buy more casual wear unplanned when I feel happy I tend to buy more casual wear unplanned when I feel unhappy 3. I tend to buy more casual wear unplanned when I see sales promotion such as: Buy 1 Get 1 free Cash rebate Complimentary accessories (ex: Belt, bracelet, necklace) Complimentary vouchers Prize Draws Joint promotions (ex: specific movie ticket given away with purchase of certain brand of casual wear) Buy 1 Get the next one at 50 % off
4. I tend to buy more casual wear unplanned when I see sales promotion such as: 50 % discount 20 % discount Member discount period Storewide discount5. Gender6. Age7. Monthly income range: 10000 & Below (represented by 1) 10001 to 15000 (represented by 2) 15001 to 20000 (represented by 3) 20001 to 25000 (represented by 4) 25001 to 30000 (represented by 5) Above 30000 (represented by 6) A snapshot of the questionnaire is also put.
The store wants to cluster the market based on the above attributes. This will help the store ineffectively catering to the demands of most lucrative segment.In the tutorial we will demonstrate how WEKA can be used to do this.The data collected in the spreadsheet is converted into .csv format. The attributes are named as“Var 1” to “Var 19”. This data file contains 100 instances.
The WEKA Tutorial Steps : 1. Click on WEKA ―Explorer” tab to start the software. 2. Then click on “Preprocess” -> “Open file” to select the data file to be opened.Once we click on “Open” the data file will be loaded.The window will look like this:
The bottom right hand corner shows the distribution of data value for Variable 1. The smallwindow above it shows the Mean and Standard deviation of the variable. This way we can seethe distribution of each variable. 3. However if we want to see the distribution of variables at one go then we can click on tab “Visualize All” to view the distribution of all variables in the sample population.
4. In the main window there is also an option as “Edit data” where we can edit the data of the .csv file if we have any error in the data set.5. For Clustering, we select the tab ―Cluster‖ in the main window and click on “Choose” tab to select K-means Clustering. There on the text-box beside ―Choose‖ we click to customize our settings for doing clustering. The setting used for the given clustering is denoted in the snapshot below.
The distance Function used is the Euclidean Distance and the number of cluster to be made is 5. 6. Then we click on the “Start” button to do the analysis. The result will be displayed on the right hand side panel. 7. We can view the result in a separate window by right clicking the last result set (inside the "Result list" panel on the left) and select "View in separate window" from the pop- up menu. The result that is displayed is given in the snapshot below:
It shows that it needed 8 iterations to arrive at the result.There are 5 clusters. 3 % of the population lies in first cluster, 22 % of the population lies insecond cluster, 23 % of the population lies in third cluster, 34 % of the population lies in fourthcluster and 18 % of the population lies in fifth cluster.So cluster 3 (fourth cluster) is having the maximum population.Cluster 3 characteristics They do not do unplanned casual wear purchase for own consumption.
Sometimes do unplanned casual wear purchase for others consumption. They shop to change their mood Slightly agree that they buy more casual wear unplanned when happy. Slightly Disagree that they buy more casual wear unplanned when feel unhappy. Slightly disagree that they buy more casual wear unplanned when they see sales promotion such as Buy 1 Get 1 free. Slightly agree that that they buy more casual wear unplanned when they see sales promotion such as cash rebate Slightly disagree that they buy more casual wear unplanned when they see sales promotion such as complimentary accessories Slightly disagree that they buy more casual wear unplanned when they see sales promotion such as complimentary vouchers Slightly disagree that they buy more casual wear unplanned when they see sales promotion such as Prize Draws Purchasers are mostly Female Purchasers are of 16 to 25 years old Income range is in the higher side of the range 10001 to 15000 (approx around 14000)This way we can understand different kinds of customers lying in different clusters and theirbehaviour. This will help the store manager to take important decisions regarding marketingactivities, sales promotions, etc. They will target their product offering to particular segment.The other kinds of clustering which WEKA enables us: 1. Farthest First Cluster 2. Filtered Clusterer 3. Hierarchical Clusterer 4. Make Density Based Clusterer
Classification using WEKAClassification (also known as classification trees or decision trees) is a data mining algorithmthat creates a step-by-step guide to determine the output of some data entries. The nodes in thetree represent spot where a decision must be made based on the input data. We move to the nextnode by going into another decision criteria and the next until we reach a leaf that tells us thedesired output.This model can be used for any unknown data instance, and we are able to predict whether thisunknown data instance will fall into that classification tree or not. That is the advantage ofclassification trees — it doesnt require a lot of information about the data to create a tree thatcould be very accurate and very informative.In the WEKA tutorial we have used J48 decision tree to form a decision structureProblem Statement:A bank is analyzing the data entries of some individual to determine whether they can be givenloan or not. (The data set used here is the secondary data collected from some free data source.)The following attributes are considered by the bank. Age – Education - (1- Middle School, 2- High School, 3 –Graduation, 4- Post graduation Employment - (1- Not employed, 2- Student, 3 –Business, 4- Post graduation Income Credit – (1 and 2 – Bad credit Rating, 3 and 4 – Good Credit rating) Default – Yes and NoThe WEKA Tutorial Steps : 1. Click on WEKA “Explorer” tab to start the software. 2. Then click on “Preprocess” -> “Open file” to select the file to be opened.
3. Next, we select the "Classify" tab and click the "Choose" button to select the J48 classifier. We have to select on the text box beside "Choose" and make the following setting. (Here we have kept the default setting). The default version does perform some pruning (using the sub tree raising approach), but does not perform error pruning.4. To know more about the settings we can click on the “More” tab on the top right hand corner to know the detail about different options to be filled.5. Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation approach as we do not have separate evaluation data set.6. We now click "Start" to generate the model. The ASCII version of the tree as well as evaluation statistics will appear in the panel.
7. We can view this information in a separate window by right clicking the last result set (inside the "Result list" panel on the left) and selecting "View in separate window" from the pop-up menu.The number of leaves is 4 and the size of tree is 7.The confusion matrix shows how many are correctly categorized and how many are wronglycategorized. Here we see that out of the data set of 50 entries, 37 are correctly categorized and sothe accuracy of our model is 74 %.
8. WEKA also lets us view a graphical rendition of the classification tree. This can be done by right clicking the last result set (as before) and selecting "Visualize tree" from the pop-up menu.It shows that bank will consider for loan if the age is less than 30 so that repayment guarantee isthere. It further looks for the credit rating of the individual and gives loan if it is more than 2. Ifless than 2 then bank will again look for the age. If it is less than 22 then bank will grant the loan.
References Data Mining by Ian H. Witten, Eibe Frank and Mark A. Hall (3rd edition, Morgan Kaufmann publisher) WEKA Manual for version 3-6-2 by The University of Waikato