Data mining techniques using WEKASubmitted by:Shashidhar Shenoy N (10BM60083)MBA, 2nd Year, Vinod Gupta School of Manageme...
Introduction to WekaWeka stands for ‘Waikato Environment for Knowledge Analysis’ and is a free open source softwaredevelop...
Regression using WekaSimple regression involving two variablesRegression involves building a model to predict the dependan...
Once the file is loaded, a variety of pre-process operations can be done on the data. The data can beedited using the ‘Edi...
It gives the model summary and the details of the regression. Thus, simple linear regression modelhas been built using the...
The first seven attributes are all independant variables, while the eighth one, ie, CLASS is thedependant variable for whi...
Figure 7: The regression model ouput by Weka                                    Figure 8: Regression model detailsThis mod...
2.1363 * 0 +     37.9165Expected Value = 15 mpgRegression Model Output = 14.2 mpgSo, we see that the regression model outp...
Data used for classificationThe data used is the ‘Contraceptive Method Choice’ Data set available from the UCI’s machinele...
Next, go to the classify tab, and use the ZeroR algorithm to run the classification model. ZeroR is thebasic classificatio...
Figure 12: Classification output using Naive Bayes algorithmHere we see that the accuracy of this model, although under ac...
References   1.   Weka reference manual pdf available at their website   2.   http://www.cs.waikato.ac.nz/ml/weka/   3.   ...
Upcoming SlideShare
Loading in...5
×

Data Mining using Weka

7,143

Published on

A term paper demonstrating the use of Weka for data mining

Published in: Education, Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,143
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Data Mining using Weka

  1. 1. Data mining techniques using WEKASubmitted by:Shashidhar Shenoy N (10BM60083)MBA, 2nd Year, Vinod Gupta School of Management,IIT KharagpurAs part of the course “IT for Business Intelligence”
  2. 2. Introduction to WekaWeka stands for ‘Waikato Environment for Knowledge Analysis’ and is a free open source softwaredeveloped by at the University of Waikato, New Zealand. It is a very popular set of software formachine learning, containing a collection of visualization tools and algorithms for data analysis andpredictive modeling, together with graphical user interfaces for easy access to this functionality.Although not as sophisticated as the other statistical packages, Weka’s popularity lies in the factthat it is not only a freeware but also code is open source, which means that new algorithms can beimplemented by making use of the existing algorithms and sufficiently modifying them.Weka can be used to do a wide variety of operations on the data. Some of the important operationswhich can be carried out using weka suite are:  Classification of data  Regression analysis and prediction  Clustering of data  Associating dataA quick guide on how to carry out some of these operations is described in this document.Quick note on the data used in the guideUnless meaningfully interpreted, any data is meaningless. Most machine learning software wouldaccept any data as long as they are in the specified format without understanding why they areused. Thus, the onus lies on the user of the software to choose proper data and feed it to thesoftware to derive meaningful insights on it.Rather than using the pre-built examples given in Weka suite, some attempt is made to get freelyavailable data from the internet and the best place to get .arff files would be the Machine LearningRepository located of UCI. The about page in their website says:“The UCI Machine Learning Repository is a collection of databases, domain theories, and datagenerators that are used by the machine learning community for the empirical analysis of machinelearning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellowgraduate students at UC Irvine. Since that time, it has been widely used by students, educators, andresearchers all over the world as a primary source of machine learning data sets. As an indication ofthe impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited"papers" in all of computer science”For the demonstrations, two of the data sets have been used. Regression uses the data from AutoMPG while the classification uses the data Contraceptive method choice. More details on the data andits attributes are explained in the subsequent sections.VGSoM, IIT Kharagpur Page 2
  3. 3. Regression using WekaSimple regression involving two variablesRegression involves building a model to predict the dependant variable based on one or moreindependent variables. A simple example of regression would be to predict the body weight of amammal given the brain weight. Here, the body weight is the dependant variable and brain weightis the independent variable: Figure 1: Brain weight v Body weightThe data is imported into weka in the native (Attribute-Relation File Format) arff format. Weka supportsimports of the ubiquitous .csv formats too. This is done by clicking on ‘Explorer’ in the Weka GuiChooser suite and then going to ‘Open File..’ under the preprocess tab. Figure 2: Opening a file in Weka SuiteVGSoM, IIT Kharagpur Page 3
  4. 4. Once the file is loaded, a variety of pre-process operations can be done on the data. The data can beedited using the ‘Edit’ option too. In the left section of the Explorer window, it outlines all of thecolumns in the data (Attributes) and the number of rows of data supplied (Instances). By selectingeach column, the right section of the Explorer window will also give information about the data inthat column of your data set. There’s a visual way of examining the data, which we can see byclicking the ‘Visualize All’ button.The next step would be to perform the regression analysis. For this, we go to the ‘Classify’ tab andclick on the ‘Choose’ button. Since we are running a ‘simple linear regression’, we need to go to the‘Classifiers.functions.simplelinearregression’ and click on it. Once this is done, we need to supplythe test options for building the regression model. The following options are available:  Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on.  Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to test on.  Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field.  Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field.Choose one of these for a model, make sure that the dependant variable is shown in the field belowas ‘body weight (kg)’ and click on start. This is the output we get: Figure 3: Output of simple regressionVGSoM, IIT Kharagpur Page 4
  5. 5. It gives the model summary and the details of the regression. Thus, simple linear regression modelhas been built using the weka suite.Multiple Linear regression with many variablesIn multiple regression, there is one dependant variable which depends on many independentvariables. Many of the real world situations are multiple regression models where one variabledepends on a lot of other variables. Here, we use a famous example data to demonstrate regressionusing Weka.Data used for multiple regressionThis data set is taken from the UCI’s machine learning repository and regresses automobile mileageagainst certain basic attributes of the model. The data can be downloaded from the URL<http://archive.ics.uci.edu/ml/datasets/Auto+MPG> and a corresponding ARFF file be created.This sample data file attempts to create a regression model to predict the miles per gallon (MPG)for a car based on several attributes of the car (this data is from 1970 to 1982). The model includesthese possible attributes of the car: cylinders, displacement, horsepower, weight, acceleration,model year, origin, and car make. Further, this data set has 398 rows of data.Data Set Number of Multivariate 398Characteristics: Instances:Attribute Categorical, Number of 8Characteristics: Real Attributes: Missing Yes… 8 instances of the variable horsepower areAssociated Tasks: Regression Values? removed because they have unknown valueThis data set is loaded into the Weka suite using the ‘Open file…’ syntax as explained before. This ishow the window looks like when the data is imported. Figure 4: Imported data in WekaVGSoM, IIT Kharagpur Page 5
  6. 6. The first seven attributes are all independant variables, while the eighth one, ie, CLASS is thedependant variable for which we try and build a predictive model. Before doing so, we can use asmany visualizations on the data as necessary to see the relevant information in each attribute. Figure 5: Visualize the data in WekaThe next step is to perform the regression. Go to the Classify tab and on the choose button, go toclassifiers -> functions -> linear regressions. Once this is done, we need to supply the test options forbuilding the regression model, in the same manner which we did for simple linear regression. Weinitially give a ‘Percentage split’ of 80% of the test data and see the output: Figure 6: Run information shown by WekaVGSoM, IIT Kharagpur Page 6
  7. 7. Figure 7: The regression model ouput by Weka Figure 8: Regression model detailsThis model might appear as complex for beginners but it is not. For example, the first line of theregression model, -2.2744 * cylinders=6,3,5,4 means that if the car has six cylinders, you wouldplace a 1 in this column, and if it has eight cylinders, you would place a 0. We could use a test setand see the deviation from the expected results and calculate the error.Example data:data = 8,390,190,3850,8.5,70,1,15class (aka MPG) = -2.2744 * 0 + -4.4421 * 0 + 6.74 * 0 + 0.012 * 390 + -0.0359 * 190 + -0.0056 * 3850 + 1.6184 * 0 + 1.8307 * 0 + 1.8958 * 0 + 1.7754 * 0 + 1.167 * 0 + 1.2522 * 0 +VGSoM, IIT Kharagpur Page 7
  8. 8. 2.1363 * 0 + 37.9165Expected Value = 15 mpgRegression Model Output = 14.2 mpgSo, we see that the regression model output is pretty near the expected value and thus we have apredictive model for beginners. We could continue to improve on this model to improve theaccuracy. We can also go for visualization to plot each of the independent variable against thedependent one and see how the variation occurs. A sample plot of horsepower versus ‘Miles pergallon’ is shown. The relationship can be found to be inversely proportional. Figure 9: Visualizing the regression outputClassification using WekaIn classification, different attributes of a product are analysed to classify the product into one of thepredefined classes. For example, a cricket player can be classified as batsman, bowler, wicketkeeper or allrounder depending on the attributes like ‘Can bat?’, ‘Can bowl?’ etc.TrainSet: The trainset is that data which is used to train the software. Here, the classification isalready made based on few attributes. The machine just observes the patterns and tries to create arule which can be used to explain how the training set data is classified. If the model built by themachine in first instance is not reliable, intelligent algorithms might be used to make the modelmore robust.TestSet: The test set or data set is the actual data where the classification is not yet made. Once thetrainset is used to build a satisfactory model, we can feed the test set and get the classification ofthe data set.VGSoM, IIT Kharagpur Page 8
  9. 9. Data used for classificationThe data used is the ‘Contraceptive Method Choice’ Data set available from the UCI’s machinelearning repository and can be downloaded from the following URL:< http://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice>The samples are married women who were either not pregnant or do not know if they were at thetime of interview. The problem is to predict the current contraceptive method choice (no use, long-term methods, or short-term methods) of a woman based on her demographic and socio-economiccharacteristics. Some of the attributes like Wife’s age, Wife’s education, Husband’s education,Number of children ever born’, etc are used to predict the current contraceptive method choice.Data Set Characteristics: Multivariate Number of Instances: 1473Attribute Characteristics: Categorical, Integer Number of Attributes: 9Associated Tasks: Classification Missing Values? NoUse the ‘Open file..’ syntax to import the arff file into weka suite as instructed before. The tenthattribute, ie, the contraceptive method used’ is the predicted variable and the data looks like this: Figure 10: CMC data imported into WekaVGSoM, IIT Kharagpur Page 9
  10. 10. Next, go to the classify tab, and use the ZeroR algorithm to run the classification model. ZeroR is thebasic classification model and it does not do anything but classify all the instances into one class.We ask weka to run the model using the entire training set without splitting it into test andtrainsets. This can be done by giving the choice as ‘Use train set’ under ‘Test options’ as explained inthe case of regression before. As expected, the model will be inaccurate. This is the output of theWeka file. Figure 11: Classification Output using ZeroR algorithmOf particular importance is the Confusion matrix which shows the correctly and incorrectlyclassifcied instances. Here, we see that all samples have been classified as ‘a’ and the 333 sampleswhich should have been ‘b’ and the 511 samples which should have been classified as ‘c’ are alsoincorrectly classified as ‘a’. Thus, the accuracy of the model is only 42% (629 out of 1473 samples)We could now go for more accurate algorithms like NaïveBayes or NaiveBayesUpdateable toimprove the accuracy of the predictions. Here is the ouput of the NaiveBayes simple classificationscheme:VGSoM, IIT Kharagpur Page 10
  11. 11. Figure 12: Classification output using Naive Bayes algorithmHere we see that the accuracy of this model, although under acceptable limits has improved overthe previous model. Thus, we can start training the software to be more accurate by using betteralgorithms.Various visualization schemes are present which will help visualize the independent and dependantvariables.ConclusionIn this term paper, two simple techniques which can be used to get started with Weka –regressionand classification are presented. In regression, we have demonstrated how Weka can be used tobuild a regression model with one dependant variable and many independent variables. The liveexample used was the automobile miles per gallon based on many independent attributes in a car.In classification, we have demonstrated how Weka can be trained to classify the given data setbased on observations in a training set. The live data used was the choice of contraceptive methodbased on a number of demographic factors.Though the outputs are not intriguing, the real power of Weka lies in the fact that the algorithmscan be trained to produce better results. Since the source code is open for everyone, anyone candownload the same and simple manipulations can be done on the existing algorithms with ease toproduce more accurate algorithms. Hence, Weka is used by many researchers in their study.VGSoM, IIT Kharagpur Page 11
  12. 12. References 1. Weka reference manual pdf available at their website 2. http://www.cs.waikato.ac.nz/ml/weka/ 3. http://archive.ics.uci.edu/ml/datasets.html 4. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html#N100F6VGSoM, IIT Kharagpur Page 12

×