Classification and Clustering Analysis using Weka

35,985 views

Published on

This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.

Published in: Education, Technology
2 Comments
13 Likes
Statistics
Notes
No Downloads
Views
Total views
35,985
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
1,698
Comments
2
Likes
13
Embeds 0
No embeds

No notes for slide

Classification and Clustering Analysis using Weka

  1. 1. WEKAIT For Business IntelligenceIshan Awadhesh10BM60033 • Term Paper • 19 April 2012Vinod Gupta School of Management, IIT Kharagpur 1
  2. 2. Table of ContentsWEKA! 3Data Used! 5Classification Analysis! 6Cluster Analysis! 11Other Applications of Weka! 17References! 17Vinod Gupta School of Management, IIT Kharagpur 2
  3. 3. WEKAWaikato Environment for Knowledge AnalysisDATA MINING TECHNIQUESWEKA is a collection of state-of-the-art machine learning algorithms and data preprocessingtools written in Java, developed at the University of Waikato, New Zealand. It is free softwarethat runs on almost any platform and is available under the GNU General Public License. Ithas a wide range of applications in various data mining techniques. It provides extensivesupport for the entire process of experimental data mining, including preparing the inputdata, evaluating learning schemes statistically, and visualizing the input data and the result oflearning. The WEKA workbench includes methods for the main data mining problems:regression, classification, clustering, association rule mining, and attribute selection. It canbe used in either of the following two interfaces –•! Command Line Interface (CLI)•! Graphical User Interface (GUI)The WEKA GUI Chooser appears like this –Vinod Gupta School of Management, IIT Kharagpur 3
  4. 4. The buttons can be used to start the following applications – •Explorer – Environment for exploring data with WEKA. It gives access to all the facilities using menu selection and form filling. •Experimenter – It can be used to get the answer for a question: Which methods and parameter values work best for the given problem? •KnowledgeFlow – Same function as explorer. Supports incremental learning. It allows designing configurations for streamed data processing. Incremental algorithms can be used to process very large datasets. •Simple CLI – It provides a simple Command Line Interface for directly executing WEKA commands.This term paper will demonstrate the following two data mining techniques using WEKA:•Classification•Clustering (Simple K Means)Vinod Gupta School of Management, IIT Kharagpur 4
  5. 5. Data UsedThe data used in this paper is Bank Data available in Comma Separated Values formatThe data contains following fieldsid - a unique identification numberage - age of customer in years (numeric)sex - MALE / FEMALEregion - inner_city/rural/suburban/townincome- income of customer (numeric)married - is the customer married (YES/NO)children - number of children (numeric)car - does the customer own a car (YES/NO)save_acct - does the customer have a saving account (YES/NO)current_acct - does the customer have a current account (YES/NO)mortgage - does the customer have a mortgage (YES/NO)pep - did customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)Vinod Gupta School of Management, IIT Kharagpur 5
  6. 6. Classification AnalysisQuestion"How likely is person X to buy the new Personal Equity?" By creating a classification tree (adecision tree), the data can be mined to determine the likelihood of this person to buy a newPEP. Possible nodes on the tree would be children, income level, marital status. Theattributes of this person can be used against the decision tree to determine the likelihood ofhim purchasing the Personal Equity Plan.Load the data file Bank_Data.CSV into WEKA. This file contains 900 records of presentcustomers of Bank.We need to divide up our records so some data instances are used to create the model, andsome are used to test the model to ensure that we didnt overfit it.Your screen should look like Figure 1 after loading the data.Figure 1.Bank Data Classification in WekaWe select the Classify tab, then we select the trees node, then the J48 leafVinod Gupta School of Management, IIT Kharagpur 6
  7. 7. Figure 2.Bank Data Classification AlgorithmAt this point, we are ready to create our model in WEKA. Ensure that Use training set isselected so we use the data set we just loaded to create our model. Click Start and let WEKArun. The output from this model should look like the results in Listing 1.Vinod Gupta School of Management, IIT Kharagpur 7
  8. 8. Listing 1.Output from WEKA’s classification modelWhat do these numbers mean-Correctly Classified Instances - 92.3333%Incorrectly Classified Instances- 7.6667%False Positives- 29False Negatives-17Based on our accuracy rate of 92.3333%, we can say that this is a pretty good model to predictwhether a new customer will buy Personal Equity Plan or not.Vinod Gupta School of Management, IIT Kharagpur 8
  9. 9. You can see the tree by right-clicking on the model you just created, in the result list. On thepop-up menu, select Visualize tree. Youll see the classification tree we just created,although in this example, the visual tree doesnt offer much help.Figure 3. Classification Tree VisualizationTheres one final step to validating our classification tree, which is to run our test set throughthe model and ensure that accuracy of the model when evaluating the test set isnt toodifferent from the training set. To do this, in Test options, select the Supplied test set radiobutton and click Set. Choose the file bmw-test.arff, which contains 1,500 records that werenot in the training set we used to create the model. When we click Start this time, WEKA willrun this test data set through the model we already created and let us know how the model did.Lets do that, by clicking Start. Below is the output.Vinod Gupta School of Management, IIT Kharagpur 9
  10. 10. Listing 2.Output from WEKA’s classification model of Test DataComparing the "Correctly Classified Instances" from this test set (90.5 percent) with the"Correctly Classified Instances" from the training set (92.3333 percent), we see that theaccuracy of the model is pretty close, which indicates that the model will not break down withunknown data, or when future data is applied to it.Vinod Gupta School of Management, IIT Kharagpur 10
  11. 11. Cluster AnalysisQuestion: "What age groups more likely to buy Personal Equity Plan?" The data can bemined to compare the age of the purchaser of past PEP . From this data, it could be foundwhether certain age groups (22-30 year olds, for example) have a higher propensity to to gofor PEP. The data, when mined, will tend to cluster around certain age groups and certaincolors, allowing the user to quickly determine patterns in the data.Load the data file Bank_data.CSV into WEKA using the same steps we used to load data intothe Preprocess tab. Take a few minutes to look around the data in this tab. Look at thecolumns, the attribute data, the distribution of the columns, etc. Your screen should look likeFigure 4 after loading the data.Figure 4. Bank cluster data in WekaWith this data set, we are looking to create clusters, so instead of clicking on the Classify tab,click on the Cluster tab. Click Choose and select SimpleKMeans from the choices that appear(this will be our preferred method of clustering for this article).Vinod Gupta School of Management, IIT Kharagpur 11
  12. 12. Finally, we want to adjust the attributes of our cluster algorithm by clicking SimpleKMeans .The only attribute of the algorithm we are interested in adjusting here is the numClustersfield, which tells us how many clusters we want to create. Lets change the default value of 2 to5 for now, but keep these steps in mind later if you want to adjust the number of clusterscreated. Your WEKA Explorer should look like Figure 5 at this point. Click OK to acceptthese values.Figure 5. Cluster AttributesAt this point, we are ready to run the clustering algorithm. Remember that 100 rows of datawith five data clusters would likely take a few hours of computation with a spreadsheet, butWEKA can spit out the answer in less than a second. Your output should look like Listing 3.Vinod Gupta School of Management, IIT Kharagpur 12
  13. 13. Listing 3. Cluster Output with 5 clustersVinod Gupta School of Management, IIT Kharagpur 13
  14. 14. Listing 4. Cluster Output with 10 ClustersClustersOne thing that is clear from the clusters is that behavior of Male are clustered in only 2-3groups while females behavior are heavily distributed among 7 clusters, so preparing anoffering for a specificDescription of Clusters-Cluster 0- This group consists of unmarried, mid-income earning females in their early 40’swho live in rural areas. They have on an average two children, no car and personal equity planbut they do have savings and current account.Cluster 1- This group consists of married, high-income earning females in their late 40’s wholive in rural areas. They have on an average two children,no car and personal equity plan butthey do have savings and current account.Cluster 2- This group consists of married, low-income earning females in their early 40’s wholive in inner city. They have on an average one child, no car and savings account but they dohave current account and personal equity plan.Vinod Gupta School of Management, IIT Kharagpur 14
  15. 15. Cluster 3- This group consists of married, low-income earning females in their early 30’s wholive in town. They have on an average one or two children, no car, savings account andpersonal equity plan but they do have current account.Cluster 4- This group consists of married, mid-income earning males in their late 30’s wholive in inner city. They have on an average one or no child, no savings account but they dohave personal equity plan, savings & current account.Cluster 5- This group consists of unmarried, high-income earning males in their early 40’swho live in town. They have on an average one or no child, they have car, personal equity plan,savings & current account.Cluster 6- This group consists of married, mid-income earning females in their early 40’swho live in inner city. They mostly don’t have ant child, they do not have any savings accountand personal equity plan but they do have current account.Cluster 7- This group consists of unmarried, high-income earning females in their mid 40’swho live in inner city. They have on an average one or two child, no car and personal equityplan but they do have savings & current account.Cluster 8- This group consists of unmarried, high-income earning females in their mid 40’swho live in town. They have on an average one or no child, no personal equity plan but they dohave car, savings & current account.Cluster 9- This group consists of married, mid-income earning males in their early 40’s wholive in inner city. They have on an average one or two children, no car, personal equity planand current account but they do have savings account.Vinod Gupta School of Management, IIT Kharagpur 15
  16. 16. One other interesting way to examine the data in these clusters is to inspect it visually. To dothis, you should right-click on theResult List section of the Cluster tab . One of the optionsfrom this pop-up menu is Visualize Cluster Assignments. A window will pop up that lets youplay with the results and see them visually. For this example, change the X axis to be income(Num), the Y axis to children (Num), and the Color to Cluster (Nom). This will show us in achart how the clusters are grouped in terms of income and no’ of children. Also, turn up the"Jitter" to about three-fourths of the way maxed out, which will artificially scatter the plotpoints to allow us to see them more easily.Figure 6. Cluster Visual InspectionVinod Gupta School of Management, IIT Kharagpur 16
  17. 17. Other Applications of Weka•DISCRETIZATION•REGRESSION•NEAREST NEIGHBORReferenceshttps://www.ibm.com/developerworks/opensource/library/os-weka2/http://maya.cs.depaul.edu/classes/ect584/weka/preprocess.htmlhttp://www.cs.waikato.ac.nz/~ml/weka/Vinod Gupta School of Management, IIT Kharagpur 17

×