WEKA- TOOL FOR DATA MINING SUBMITTED BY : DIVYA HAMIRWASIA 10BM60025
INTRODUCTION Waikato Environment for Knowledge Analysis (WEKA) is a free and open source data miningtool. Data mining is the transformation of large amounts of data into meaningful patterns andrules, results of which could be used to take important business decisions. The ultimate goal ofdata mining is to create a model, a model that can improve the way you read and interpret yourexisting data and your future data. WEKA is the product of the University of Waikato (NewZealand) and was first implemented in its modern form in 1997. It uses the GNU General PublicLicense (GPL). The software is written in the Java language and contains a GUI for interactingwith data files and producing visual results.Advantages of Weka include: free availability under the GNU General Public License portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform a comprehensive collection of data preprocessing and modeling techniques ease of use due to its graphical user interfacesREGRESSION ANALYSISLinear regression is an approach to modeling the relationship between a scalar dependentvariable y and one or more explanatory variables denoted X.The data we use here is trying to establish a relation between the number of people employeedand: the percentage price deflation the GNP in millions of dollars the number of unemployed in thousands the number of people employed by the military the number of people over 14 the yearREGRESSION IN WEKA:Load the data by clicking on the preprocess tab. Click on the open file and choose the targetfolder and then the requires .arff file. After selecting the file, your WEKA Explorer should looksimilar to the screenshot:
To create the model, click on the Classify tab. The first step is to select the model we want tobuild, so WEKA knows how to work with the data, and how to create the appropriate model: Click the Choose button, then expand the functions branch. Select the LinearRegression leaf.Now that the desired model has been chosen, we have to tell WEKA where the data is that itshould use to build the model. Though it may be obvious to us that we want to use the data wesupplied in the ARFF file, there are actually different options, some more advanced than whatwell be using. The other three choices are Supplied test set, where you can supply a differentset of data to build the model; Cross-validation, which lets WEKA build a model based onsubsets of the supplied data and then average them out to create a final model;and Percentage split, where WEKA takes a percentile subset of the supplied data to build a finalmodel. These other choices are useful with different models, which well see in future articles.With regression, we can simply choose Use training set. This tells WEKA that to build ourdesired model, we can simply use the data set we supplied in our ARFF file.Select number of people employed as the dependent variable. Click start. The result is asfollows:
The result is :Number of people employed =206.3701 * percentage price deflation -1.2427 * number ofpeople unemployed -0.5971 * number of people employed by the military + 0.3079 * numberof people over 14 + 13699.5644CLUSTER ANALYSISClustering allows a user to make groups of data to determine patterns from the data. Clusteringhas its advantages when the data set is defined and a general pattern needs to be determinedfrom the data. The data set used for clustering example focuses on the BMW dealership . Thedealership has kept track of how people walk through the dealership and the showroom, whatcars they look at, and how often they ultimately make purchases. They are hoping to mine thisdata by finding patterns in the data and by using clusters to determine if certain behaviors intheir customers emerge. There are 100 rows of data in this sample.CLUSTERING IN WEKALoad the data into WEKA from the bmw.arff file. To do so click on the preprocess tab and thenclick on the open file button. Select the target folder and select the needed file. Once the file isopened all the attributes will be listed as follows:
Next, click on the cluster tab. Click Choose and select SimpleKMeans from the choicesthat appear.Finally, we want to adjust the attributes of our cluster algorithm by clicking SimpleKMeans.The only attribute which needs to be adjusted is the numClusters field which lets us decidehow many clusters we want. We set this value as 5 here.
We click start and the clustering is done. The result is as follows:
INTERPRETATION OF THE RESULT: Each cluster shows us a type of behavior in our customers, from which we can begin to drawsome conclusions: Cluster 0 — The people in this group appear to wander around the dealership, looking at cars parked outside on the lots, but trail off when it comes to coming into the dealership, and worst of all, they dont purchase anything. Cluster 1 — this group people tend to walk straight to the M5s, ignoring the 3-series cars and the Z4. However, they dont have a high purchase rate — only 52 percent. This is a potential problem and could be a focus for improvement for the dealership, perhaps by sending more salespeople to the M5 section. Cluster 2 — they arent statistically relevant, and we cant draw any good conclusions from their behavior. (This happens sometimes with clusters and may indicate that you should reduce the number of clusters youve created). Cluster 3 —they always end up purchasing a car and always end up financing it. Heres where the data shows us some interesting things: It appears they walk around the lot looking at cars, then turn to the computer search available at the dealership. Ultimately, they tend to buy M5s or Z4s (but never 3-series). This cluster tells the dealership that it should consider making its search computers more prominent around the lots (outdoor search computers?), and perhaps making the M5 or Z4 much more prominent in the search results. Once the customer has made up his mind to purchase the vehicle, he always qualifies for financing and completes the purchase. Cluster 4 — they always look at the 3-series and never look at the much more expensive M5. They walk right into the showroom, choosing not to walk around the lot and tend to ignore the computer search terminals. While 50 percent get to the financing stage, only 32 percent ultimately finish the transaction. The dealership could draw the conclusion that these customers looking to buy their first BMWs know exactly what kind of car they want (the 3-series entry-level model) and are hoping to qualify for financing to be able to afford it. The dealership could possibly increase sales to this group by relaxing their financing standards or by reducing the 3-series prices.Other interesting way to examine the data in these clusters is to inspect it visually. To do this,right-click on the Result List section of the Cluster tab and then click on the Visualize ClusterAssignments. Change the X axis to be M5 (Num), the Y axis to Purchase (Num), and the Colorto Cluster (Nom). This will show us in a chart how the clusters are grouped in terms of wholooked at the M5 and who purchased one. We can see in the X=1, Y=1 point (those who lookedat M5s and made a purchase) that the only clusters represented here are 1 and 3. We also seethat the only clusters at point X=0, Y=0 are 4 and 0. It matches with our above result. Clusters 1and 3 were buying the M5s, while cluster 0 wasnt buying anything, and cluster 4 was onlylooking at the 3-series. Figure shows the visual cluster layout for our example.