WEKAA DATA MINING TOOL Amu Prabhjot Singh 10BM60011
WEKAData mining isnt solely the domain of big companies and expensive software. In fact, theres apiece of software that does almost all the same things as these expensive pieces of software —the software is called WEKA. WEKA is the product of the University of Waikato (New Zealand)and was first implemented in its modern form in 1997. It uses the GNU General Public License(GPL). The software is written in the Java™ language and contains a GUI for interacting withdata files and producing visual results (think tables and curves). Its Java-based, so if we donthave a JRE installed on your computer, download the WEKA version that contains the JRE, aswell. To load data into WEKA, we have to put it into a format that will be understood. WEKAspreferred method for loading data is in the Attribute-Relation File Format (ARFF), where we candefine the type of data being loaded, then supply the data itself.When we start WEKA, the GUI chooserpops up as shown in figureIt lets us choose four ways to work withWEKA and our data. The four ways are Explorer Experimenter Knowledge Flow Simple CLIREGRESSIONRegression is the easiest technique to use, but is also probably the least powerful. In effect,regression models all fit the same general pattern. There are a number of independentvariables, which, when taken together, produce a result — a dependent variable. Theregression model is then used to predict the result of an unknown dependent variable, giventhe values of the independent variables.We will perform Regression on the pricing of the house. The price of the house (the dependentvariable) is the result of many independent variables — the square footage of the house, thesize of the lot, whether granite is in the kitchen, bathrooms are upgraded, etc.To create our regression model, start WEKA, then choose the Explorer. In the Explorer screen,select the Preprocess tab. Select the Open File button and select the ARFF file. After selecting thefile the explorer window looks as below
In the left section of the Explorer window, it outlines all of the columns in the data (Attributes)and the number of rows of data supplied (Instances). By selecting each column, the rightsection of the Explorer window will also give the information about the data in that column ofthe data set. For example, by selecting the houseSize column in the left section the right-section should change to show the additional statistical information about the column. Finally,theres a visual way of examining the data, which can be viewed by clicking the VisualizeAll button.To create the model, click on the Classify tab. The first step is to select the model we want tobuild, so WEKA knows how to work with the data, and how to create the appropriate model: 1. Click the Choose button, then expand the functions branch. 2. Select the LinearRegression leaf.This tells WEKA that we want to build a regression model. When we have selected the rightmodel, your WEKA Explorer should look as below
Though it may be obvious to us that we want to use the data we supplied in the ARFF file, thereare actually different options than what well be using. The other three choices are Suppliedtest set, where we can supply a different set of data to build the model; Cross-validation, whichlets WEKA build a model based on subsets of the supplied data and then average them out tocreate a final model; and Percentage split, where WEKA takes a percentile subset of thesupplied data to build a final model. With regression, we can simply choose Use training set.Finally, the last step to creating our model is to choose the dependent variable (the column weare looking to predict). We know this should be the selling price, since thats what were tryingto determine for my house. Right below the test options, theres a combo box that lets youchoose the dependent variable. The column SellingPrice should be selected by default. If itsnot, please select it. To create our model, click Start. Figure below shows the output windowINTERPRETATION OF THE RESULT:SellingPrice = (-26.6882 * houseSize) + (7.0551 * lotSize) + (43166.0767 * bedrooms) + (42292.0901 * bathroom) - 21661.1208Interpreting the pattern and conclusion that our model generated we see that besides just astrict house value: Granite doesnt matter — WEKA will only use columns that statistically contribute to the accuracy of the model. It will throw out and ignore columns that dont help in creating a good model. So this regression model is telling us that granite in your kitchen doesnt affect the houses value.
Bathrooms do matter — Since we use a simple 0 or 1 value for an upgraded bathroom, we can use the coefficient from the regression model to determine the value of an upgraded bathroom on the house value. Bigger houses reduce the value — WEKA is telling us that the bigger our house is, the lower the selling price. This can be seen by the negative coefficient in front of the houseSize variable. The model is telling us that every additional square foot of the house reduces its price by $26.CLUSTERINGClustering is the task of assigning a set of objects into groups (called clusters) so that theobjects in the same cluster are more similar (in some sense or another) to each other than tothose in other clusters. WEKA offers clustering capabilities not only as standalone schemes, butalso as filters and classifiers.To begin with clustering we will use the data set of bank. The data set contains – id, age, sex,region, income, married, children, car, save_acct, current_acct, mortgage and pep.To create clustering, start WEKA, then choose the Explorer. In the Explorer screen, selectthe Preprocess tab. Select the Open File button and select the ARFF file. After selecting the filethe explorer window looks as belowTo perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button.This results in a drop down list of available clustering algorithms. In this case select
"SimpleKMeans". Next, click on the text box to the right of the "Choose" button to get the pop-up window shown in Figure below, for editing the clustering parameter.In the pop-up window we enter 6 as the number of clusters (instead of the default values of 2)and we leave the value of "seed" as is. The seed value is used in generating a random numberwhich is, in turn, used for making the initial assignment of instances to clusters.Once the options have been specified, we can run the clustering algorithm. Here we make surethat in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start".We can right click the result set in the "Result list" panel and view the results of clustering in aseparate window.
We can choose the cluster number and any of the other attributes for each of the threedifferent dimensions available (x-axis, y-axis, and color). Different combinations of choices willresult in a visual rendering of different relationships within each cluster. Here we have chosenthe cluster number as the x-axis, the instance number as the y-axis, and the "sex" attribute asthe color dimension. This will result in a visualization of the distribution of males and females ineach cluster. For instance, here clusters 2 and 3 are dominated by males, while clusters 4 and 5are dominated by females.
INTERPRETING THE RESULT:Each cluster shows us a type of behavior in our customers, from which we can begin to drawsome conclusions:Cluster 0 – It contains a cluster of Females with an average age of 37 who live in inner city and possesssaving account number and current account number. They are unmarried and donot have any mortgageor pep. The average monthly income is 23,300.Cluster 1 - It contains a cluster of Females with an average age of 44 who live in rural area and possesssaving account number and current account number. They are married and donot have any mortgage orpep. The average monthly income is 27,772.Cluster 2 - It contains a cluster of Females with an average age of 48 who live in inner city and possesscurrent account number but no saving account number. They are unmarried and donot have mortgagebut do have pep. The average monthly income is 27,668.Cluster 3 - It contains a cluster of Females with an average age of 39 who live in town and possess savingaccount number and current account number. They are married and donot have any mortgage or pep.The average monthly income is 24,047.Cluster 4 - It contains a cluster of Males with an average age of 39 who live in inner city and possesscurrent account number but no saving account number. They are married and have mortgage and pep.The average monthly income is 26,359.Cluster 5 - It contains a cluster of Males with an average age of 47 who live in inner city and possesssaving account number and current account number. They are unmarried and donot have mortgage butdo have pep. The average monthly income is 35,419.