Data Mining Techniques using WEKAVINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR                     In partial fulfillmen...
Introduction to WEKAWEKA is a collection of open source of many data mining and machine learningalgorithms, including– pre...
– (If one is interested in modifying/extending weka there is a developer version thatincludes the source code)• After down...
– best-first– genetic– ranking ...• Evaluation measures– ReliefF– information gain– gain ratioClassification• Predicted ta...
• Relatively easier to use– Features• Run individual experiment, or• Build KDD phasesCons– Lack of proper and adequate doc...
This term paper will demonstrate the following two data mining techniques usingWEKA:    Clustering (Simple K Means)    L...
@attribute engine-location { front, rear}@attribute wheel-base real@attribute length real@attribute width real@attribute h...
?,bmw,gas,std,four,sedan,rwd,front,110,197,70.9,56.3,3505,ohc,six,209,mpfi,3.62,3.39,8,182,5400,15,20,36880,0 121,chevrole...
Finally, we want to adjust the attributes of our cluster algorithm byclicking SimpleKMeans (not the best UI design here, b...
At this point, we are ready to run the clustering algorithm. Remember that this muchrows of data with four data clusters w...
Time taken to build model (full training data) : 0.02 seconds=== Model and evaluation on training set ===Clustered Instanc...
the highest highway and city mileage and low compression ration, horse power and RPM.For this segment price is one of the ...
Creating the regression model with WEKATo create the model, click on the Classify tab. The first step is to select the mod...
When weve selected the right model, WEKA Explorer should look like the followingfigure.Now that the desired model has been...
Relation:    machine_cpuInstances:   209Attributes: 7          MYCT          MMIN          MMAX          CACH          CHM...
=== Evaluation on training set ====== Summary ===Correlation coefficient         0.93Mean absolute error             37.97...
However, looking back to the top of the article, data mining isnt just about outputtinga single number: Its about identify...
Upcoming SlideShare
Loading in …5
×

Data mining techniques using weka

5,622 views

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,622
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
270
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Data mining techniques using weka

  1. 1. Data Mining Techniques using WEKAVINOD GUPTA SCHOOL OF MANAGEMENT, IIT KHARAGPUR In partial fulfillment Of the requirements for the degree of MASTER OF BUSINESS ADMINISTRATION SUBMITTED BY: Prashant Menon 10BM60061 VGSOM, IIT KHARAGPUR
  2. 2. Introduction to WEKAWEKA is a collection of open source of many data mining and machine learningalgorithms, including– pre-processing on data– Classification:– clustering– association rule extraction• Created by researchers at the University of Waikato in New Zealand• Java based (also open source).Main features of WEK• 49 data preprocessing tools• 76 classification/regression algorithms• 8 clustering algorithms• 15 attribute/subset evaluators + 10 search algorithms for feature selection.• 3 algorithms for finding association rules• 3 graphical user interfaces– “The Explorer” (exploratory data analysis)– “The Experimenter” (experimental environment)– “The KnowledgeFlow” (new process model inspired interface)Weka: Download and Installation• Download Weka (the stable version) from http://www.cs.waikato.ac.nz/ml/weka/ – Choose a self-extracting executable (including Java VM)
  3. 3. – (If one is interested in modifying/extending weka there is a developer version thatincludes the source code)• After download is completed, run the self extracting file to install Weka, and use thedefault set-ups.Weka Application Interfaces• Explorer– preprocessing, attribute selection, learning, visualiation• Experimenter– testing and evaluating machine learning algorithms• Knowledge Flow– visual design of KDD process– Explorer• Simple Command-line A simple interface for typing commandsWeka Functions and Tools• Preprocessing Filters• Attribute selection• Classification/Regression• Clustering• Association discovery• VisualizationLoad data file and Preprocessing• Load data file in formats: ARFF, CSV, C4.5, binary• Import from URL or SQL database (using JDBC)• Preprocessing filters– Adding/removing attributes– Attribute value substitution– Discretization– Time series filters (delta, shift)– Sampling, randomization– Missing value management– Normalization and other numeric transformationsFeature Selection• Very flexible: arbitrary combination of search and evaluation methods• Search methods
  4. 4. – best-first– genetic– ranking ...• Evaluation measures– ReliefF– information gain– gain ratioClassification• Predicted target must be categorical• Implemented methods– decision trees(J48, etc.) and rules– Naïve Bayes– neural networks– instance-based classifiers …• Evaluation methods– test data set – crossvalidationClustering• Implemented methods– k-Means– EM– Cobweb– X-means– FarthestFirst…• Clusters can be visualized and compared to “true” clusters (if given)Regression• Predicted target is continuous• Methods– Linear regression– Simple Linear Regression– Neural networks– Regression trees …Weka: Pros and consPros– Open source,• Free• Extensible• Can be integrated into other java packages– GUIs (Graphic User Interfaces)
  5. 5. • Relatively easier to use– Features• Run individual experiment, or• Build KDD phasesCons– Lack of proper and adequate documentations– Systems are updated constantly (Kitchen Sink Syndrome)3. WEKA data formats• Data can be imported from a file in various formats:– ARFF (Attribute Relation File Format) has two sections:• the Header information defines attribute name, type and relations.• the Data section lists the data records.– CSV: Comma Separated Values (text file)– C4.5: A format used by a decision induction algorithmC4.5, requires two separated files• Name file: defines the names of the attributes• Date file: lists the records (samples)– binary• Data can also be read from a URL or from an SQL database (using JDBC)
  6. 6. This term paper will demonstrate the following two data mining techniques usingWEKA:  Clustering (Simple K Means)  Linear regressionClusteringClustering allows a user to make groups of data to determine patterns from the data.Clustering has its advantages when the data set is defined and a general pattern needs tobe determined from the data. One can create a specific number of groups, depending onbusiness needs. One defining benefit of clustering over classification is that everyattribute in the data set will be used to analyze the data. A major disadvantage of usingclustering is that the user is required to know ahead of time how many groups he wants tocreate. For a user without any real knowledge of his data, this might be difficult. It mighttake several steps of trial and error to determine the ideal number of groups to create.However, for the average user, clustering can be the most useful data mining method onecan use. It can quickly take the entire set of data and turn it into groups, from which onecan quickly make some conclusions.Data set for WEKAThis data set consists of three types of entities: (a) the specification of an auto in terms ofvarious characteristics, (b)its assigned insurance risk rating, (c) its normalized losses inuse as compared to other cars. The second rating corresponds to the degree to which theauto is more risky than its price indicates. Cars are initially assigned a risk factor symbolassociated with its price. Then, if it is more risky (or less), this symbol is adjusted bymoving it up (or down) the scale. A value of +3 indicates that the auto is risky, -3 that itis probably pretty safe.The third factor is the relative average loss payment per insured vehicle year. This valueis normalized for all autos within a particular size classification (two-door small, stationwagons, sports/speciality, etc...), and represents the average loss per car per year.A part of the saved arff file. @relation autos @attribute normalized-losses real @attribute make { alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo} @attribute fuel-type { diesel, gas} @attribute aspiration { std, turbo} @attribute num-of-doors { four, two} @attribute body-style { hardtop, wagon, sedan, hatchback, convertible} @attribute drive-wheels { 4wd, fwd, rwd}
  7. 7. @attribute engine-location { front, rear}@attribute wheel-base real@attribute length real@attribute width real@attribute height real@attribute curb-weight real@attribute engine-type { dohc, dohcv, l, ohc, ohcf, ohcv, rotor}@attribute num-of-cylinders { eight, five, four, six, three, twelve, two}@attribute engine-size real@attribute fuel-system { 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi}@attribute bore real@attribute stroke real@attribute compression-ratio real@attribute horsepower real@attribute peak-rpm real@attribute city-mpg real@attribute highway-mpg real@attribute price real@attribute symboling { -3, -2, -1, 0, 1, 2, 3}@data?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27,13495,3?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27,16500,3?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9,154,5000,19,26,16500,1164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10,102,5500,24,30,13950,2164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8,115,5500,18,22,17450,2?,audi,gas,std,two,sedan,fwd,front,99.8,177.3,66.3,53.1,2507,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250,2158,audi,gas,std,four,sedan,fwd,front,105.8,192.7,71.4,55.7,2844,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710,1?,audi,gas,std,four,wagon,fwd,front,105.8,192.7,71.4,55.7,2954,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920,1158,audi,gas,turbo,four,sedan,fwd,front,105.8,192.7,71.4,55.9,3086,ohc,five,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875,1?,audi,gas,turbo,two,hatchback,4wd,front,99.5,178.2,67.9,52,3053,ohc,five,131,mpfi,3.13,3.4,7,160,5500,16,22,?,0192,bmw,gas,std,two,sedan,rwd,front,101.2,176.8,64.8,54.3,2395,ohc,four,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16430,2192,bmw,gas,std,four,sedan,rwd,front,101.2,176.8,64.8,54.3,2395,ohc,four,108,mpfi,3.5,2.8,8.8,101,5800,23,29,16925,0188,bmw,gas,std,two,sedan,rwd,front,101.2,176.8,64.8,54.3,2710,ohc,six,164,mpfi,3.31,3.19,9,121,4250,21,28,20970,0188,bmw,gas,std,four,sedan,rwd,front,101.2,176.8,64.8,54.3,2765,ohc,six,164,mpfi,3.31,3.19,9,121,4250,21,28,21105,0?,bmw,gas,std,four,sedan,rwd,front,103.5,189,66.9,55.7,3055,ohc,six,164,mpfi,3.31,3.19,9,121,4250,20,25,24565,1?,bmw,gas,std,four,sedan,rwd,front,103.5,189,66.9,55.7,3230,ohc,six,209,mpfi,3.62,3.39,8,182,5400,16,22,30760,0?,bmw,gas,std,two,sedan,rwd,front,103.5,193.8,67.9,53.7,3380,ohc,six,209,mpfi,3.62,3.39,8,182,5400,16,22,41315,0
  8. 8. ?,bmw,gas,std,four,sedan,rwd,front,110,197,70.9,56.3,3505,ohc,six,209,mpfi,3.62,3.39,8,182,5400,15,20,36880,0 121,chevrolet,gas,std,two,hatchback,fwd,front,88.4,141.1,60.3,53.2,1488,l,three,61,2bbl,2.91,3.03,9.5,48,5100,47, 53,5151,2 98,chevrolet,gas,std,two,hatchback,fwd,front,94.5,155.9,63.6,52,1874,ohc,four,90,2bbl,3.03,3.11,9.6,70,5400,38,43, 6295,1 81,chevrolet,gas,std,four,sedan,fwd,front,94.5,158.8,63.6,52,1909,ohc,four,90,2bbl,3.03,3.11,9.6,70,5400,38,43, 6575,0 118,dodge,gas,std,two,hatchback,fwd,front,93.7,157.3,63.8,50.8,1876,ohc,four,90,2bbl,2.97,3.23,9.41,68,5500 ,37,41,5572,1 118,dodge,gas,std,two,hatchback,fwd,front,93.7,157.3,63.8,50.8,1876,ohc,four,90,2bbl,2.97,3.23,9.4,68,5500 ,31,38,6377,1 118,dodge,gas,turbo,two,hatchback,fwd,front,93.7,157.3,63.8,50.8,2128,ohc,four,98,mpfi,3.03,3.39,7.6,102,5500 ,24,30,7957,1 148,dodge,gas,std,four,hatchback,fwd,front,93.7,157.3,63.8,50.6,1967,ohc,four,90,2bbl,2.97,3.23,9.4,68,5500, 31,38,6229,1 148,dodge,gas,std,four,sedan,fwd,front,93.7,157.3,63.8,50.6,1989,ohc,four,90,2bbl,2.97,3.23,9.4,68,5500,31, 38,6692,1Clustering in WEKALoad the data file AUTOS.arff into WEKA using the same steps we used to load datainto the Preprocess tab. Take a few minutes to look around the data in this tab. Look atthe columns, the attribute data, the distribution of the columns, etc. The screen shouldlook like the figure shown below after loading the data.With this data set, we are looking to create clusters, so instead of clicking onthe Classify tab, click on the Cluster tab. Click Choose and select SimpleKMeans fromthe choices that appear (this will be our preferred method of clustering for this article).WEKA Explorer window should look like the following figure at this point.
  9. 9. Finally, we want to adjust the attributes of our cluster algorithm byclicking SimpleKMeans (not the best UI design here, but go with it). The only attributeof the algorithm we are interested in adjusting here is the numClusters field, which tellsus how many clusters we want to create. (Remember, one need to know this before start.)Lets change the default value of 2 to 4 for now, but keep these steps in mind later if onewants to adjust the number of clusters created. WEKA Explorer should look like thefollowing at this point. Click OK to accept these values.
  10. 10. At this point, we are ready to run the clustering algorithm. Remember that this muchrows of data with four data clusters would likely take a few hours of computation with aspreadsheet, but WEKA can spit out the answer in less than a second. The output shouldlook like the figure shown below.
  11. 11. Time taken to build model (full training data) : 0.02 seconds=== Model and evaluation on training set ===Clustered Instances0 60 ( 29%)1 33 ( 16%)2 55 ( 27%)3 57 ( 28%)Based on the values of cluster centroids as shown in the above figure, we can state thecharacteristics of each of the clusters. For explaniantion we are taking Cluster 1 andCluster 2Cluster 2This group will always look for the premium segment car ‘Peugot’ . Has the larget wheelbase, length, height, curb weight, engine size. As the engine size is inversely proportionalto the mileage , it has the lowest city and high way mileage. It has the highest number ofcylinders.Compression ratio , horse power, peak rpm all have the highest value which make it ahighest priced Car.Cluster 1This group will always look for the ‘VALUE FOR MONEY’ car. It belongs to the masssegment. As the engine power is inversely proportional to the mileage , we can see it has
  12. 12. the highest highway and city mileage and low compression ration, horse power and RPM.For this segment price is one of the important criteria before buying the car.The Cluster analysis will help the car company which segment it should target beforestart of the new product development/bringing the car into the market.Visualization of Clustering ResultsA more intuitive way to go through the results is to visualize them in the graphical form.To do so:  Right click the result in the Result list panel  Select Visualize cluster assignments  By setting X-axis variable as Cluster, Y-axis variable as Instance_number and Color as aspiration, we get the following output:Here we can see all the clusters (segments) have mixed response to the aspiration.Similarly we can change the variables in X-axis, Y-axis and color to visualize other aspects ofresult. Note that WEKA has generated an extra variable named “Cluster” (not present inoriginal data) which signifies the cluster membership of various instances. We can save theoutput as an arff file by clicking on the save button.The output file contains an additional attribute cluster for each instance.Thus besides the value of twenty six attributes for any instance, the output also specifiesthe cluster membership for that instance.
  13. 13. Creating the regression model with WEKATo create the model, click on the Classify tab. The first step is to select the model wewant to build, so WEKA knows how to work with the data, and how to create theappropriate model: 1. Click the Choose button, then expand the functions branch. 2. Select the LinearRegression leaf.This tells WEKA that we want to build a regression model. As one can see from the otherchoices, though, there are lots of possible models to build.This should give a goodindication of how we are only touching the surface of this subject. There is anotherchoice called SimpleLinearRegression in the same branch. Do not choose this becausesimple regression only looks at one variable, and we have six.The used attributes are as follows: The used attributes are : MYCT: machine cycle time in nanoseconds (integer) MMIN: minimum main memory in kilobytes (integer) MMAX: maximum main memory in kilobytes (integer) CACH: cache memory in kilobytes (integer) CHMIN: minimum channels in units (integer) CHMAX: maximum channels in units (integer) PRP: published relative performance (integer) (target variable)A part of the data file is as follows: @relation machine_cpu @attribute MYCT numeric @attribute MMIN numeric @attribute MMAX numeric @attribute CACH numeric @attribute CHMIN numeric @attribute CHMAX numeric @attribute class numeric @data 125,256,6000,256,16,128,198 29,8000,32000,32,8,32,269 29,8000,32000,32,8,32,220 29,8000,32000,32,8,32,172 29,8000,16000,32,8,16,132 26,8000,32000,64,8,32,318 23,16000,32000,64,16,32,367 23,16000,32000,64,16,32,489 23,16000,64000,64,16,32,636
  14. 14. When weve selected the right model, WEKA Explorer should look like the followingfigure.Now that the desired model has been chosen, we have to tell WEKA where the data isthat it should use to build the model. Though it may be obvious to us that we want to usethe data we supplied in the ARFF file, there are actually different options, some moreadvanced than what well be using. The other three choices are Supplied test set, whereone can supply a different set of data to build the model; Cross-validation, which letsWEKA build a model based on subsets of the supplied data and then average them out tocreate a final model; and Percentage split, where WEKA takes a percentile subset of thesupplied data to build a final model. These other choices are useful with different models,which well see in future articles. With regression, we can simply choose Use trainingset. This tells WEKA that to build our desired model, we can simply use the data set wesupplied in our ARFF file.Finally, the last step to creating our model is to choose the dependent variable (thecolumn we are looking to predict). We know this should be the Class, since thats whatwere trying to determine. Right below the test options, theres a combo box that lets uschoose the dependent variable. The column Class should be selected by default. If itsnot, please select it.Now we are ready to create our model. Click Start. The following figure shows what theoutput should look like.=== Run information ===Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
  15. 15. Relation: machine_cpuInstances: 209Attributes: 7 MYCT MMIN MMAX CACH CHMIN CHMAX classTest mode: evaluate on training data=== Classifier model (full training set) ===Linear Regression Modelclass = 0.0491 * MYCT + 0.0152 * MMIN + 0.0056 * MMAX + 0.6298 * CACH + 1.4599 * CHMAX + -56.075Time taken to build model: 0 seconds
  16. 16. === Evaluation on training set ====== Summary ===Correlation coefficient 0.93Mean absolute error 37.9748Root mean squared error 58.9899Relative absolute error 39.592 %Root relative squared error 36.7663 %Total Number of Instances 209Here class represents PRP(Published Relative Performance)Interpreting the regression modelIt puts the regression model right there in the output, as shown in Listing below:Class(PRP) = 0.0491 * MYCT + 0.0152 * MMIN + 0.0056 * MMAX + 0.6298 *CACH + 1.4599 * CHMAX -56.075Listing 2 shows the results, plugging in the values for my house.Listing 2. PRP value using regression modelClass(PRP)= 0.0491 * 29 + 0.0152 * 8000 + .0056 * 32000 + .6298 * 32 + 1.4599 * 32 –56.075  PRP= 1.4239 + 121.6 + 179.2 + 20.1536 + 46.7168-56.075  PRP = 313.0193
  17. 17. However, looking back to the top of the article, data mining isnt just about outputtinga single number: Its about identifying patterns and rules. Its not strictly used toproduce an absolute number but rather to create a model that lets us detect patterns,predict output, and come up with conclusions backed by the dataMinimum channel units doesnt matter — WEKA will only use columns thatstatistically contribute to the accuracy of the model (measured in R-squared). It willthrow out and ignore columns that dont help in creating a good model. So thisregression model is telling us that minimum channels doesnt affect PRP.We can also visualize the classifier error i.e. those instances which are wronglypredicted by regression equation by right clinking on the result set in the Result listpanel and selecting Visualize classifier errors.The X-axis has Price (actual) and the Y-axis has Predicted Price.

×