Data Mining _ Weka

1,251 views

Published on

This paper introduces Weka briefly and proceeds to demonstrate application of two data mining techniques – association rules and regression.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,251
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
84
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Mining _ Weka

  1. 1. Data mining - Weka Submitted as a part of the course ‘IT for Business Intelligence’ Ramya Krishna P 10BM60056 4/19/2012This paper introduces Weka briefly and proceeds to demonstrate application of two data miningtechniques – association rules and regression.
  2. 2. Table of ContentsWeka – Introduction ..................................................................................................................................... 1 Requirements............................................................................................................................................ 1Getting started .............................................................................................................................................. 1Data sets ....................................................................................................................................................... 2Association rules ........................................................................................................................................... 3 Business application.................................................................................................................................. 3 Data set ..................................................................................................................................................... 4 Preprocess................................................................................................................................................. 5 Associate ................................................................................................................................................... 6Regression ..................................................................................................................................................... 7 Business applications ................................................................................................................................ 7 Data set ..................................................................................................................................................... 8 Preprocess................................................................................................................................................. 8 Linear regression ..................................................................................................................................... 10 Non-numeric input variables .................................................................................................................. 13References .................................................................................................................................................. 14
  3. 3. Weka – IntroductionWeka is a rich tool for data mining. It is a collection of machine learning algorithms. It allows us to doclassification, regression, clustering, forming association rules and visualization. It is open sourcesoftware.RequirementsFor latest versions of Weka, i.e., Weka 3.7.x, Java 1.6 needed to be installed in your system. I have usedWeka 3.7.5 for this small tutorial. The latest and other editions of Weka can be downloaded here.Getting startedYou can run Weka through command prompt or through GUI. We go by the GUI. Here is how it lookslike.For all our purposes, the application ‘Explorer’ is sufficient. On clicking ‘explorer’, we have 1
  4. 4. To load a data set into Weka, choose ‘Open file’ under ‘Preprocess’ tab. Now a short note about datasets.Data setsThe default format of a Weka data set is .arff(Attribute-Relation File Format). This is an ASCII text file. Asnapshot of a .arff file is like this.
  5. 5. So, you can either prepare your data in this form or if you have a spreadsheet or an .xls or .xlsx, uploadyour data to .csv format.Now, on clicking ‘Open file’, select the .csv format of your data and click ‘Open’.I will proceed with the rest of the tutorial through examples.Association rulesTo give a little introduction about association rules, this is a method to develop relations betweenvariables in data sets. We develop some rules from these relations that have a certain level of supportand confidence. These rules can be of a great business value sometimes. One typical businessapplication of association rules is ‘Market basket analysis’.Business applicationThe market-basket problem assumes we have some large number of items, e.g., bread, milk. Customersfill their market baskets with some subset of the items, and we get to know what items people buytogether, even if we dont know who they are. By developing association rules of the form,
  6. 6. {X1, X2, . . .Xn} -> Ywe have a good chance of finding Y. So, next time a retailer is stocking up X1, X2, … Xn, he might alsostock up ‘Y’ based on our prediction. Now, without going too much into the theory, let us see our dataset.Data setThe format of my data set is like thisTID1 ID2 ID5 ID6TID2 ID3 ID4 ID6 ID7 ID9TID3 ID4 ID5TID4 ID1 ID4 ID5 ID7 ID9 ID10...where, the first column gives the transaction id and then each row has a number of products, whichhave been purchased in this particular transaction. Now, unfortunately, Weka cannot accept the dataset in this form (the rows are of unequal lengths). Both .arff and .csv require each data record tohave the same number of fields. To change the data format, create one attribute per "item" and use "true" and "false" field valuesin the data row corresponding to the item. We cant use 0 and 1 because Apriori (the algorithm we willbe using) does not work on numeric attributes. It only works on ‘Nominal values’. The data now lookslikeTID, ID1, ID2, ID3, ID4, ID5, ID6, ID7, ID8, ID9, ID101,false, true, false, false, true, true, false, false, false, false2, false, false, true, true, false, true, true, false, true, false3, false, false, true, true, false, false, false, false, false, false4, true, false, false, true, true, false, true, false, true, trueNow, I have a sample data set (which I have downloaded from here) which is thankfully, already inthe.csv form. This is a huge data set with 300+ products and 1300+ rows. When you try to run this inWeka, you get an error that the heap size is not sufficient. You can change the heap size by changing thevalue of the ‘maxheap’ in Weka. ini file (or RunWeka – config file). However, even after giving a heapsize of 1GB, this data set is too huge too run. So, I have cropped the data set to about 20 attributes and400 rows. A snapshot of the data set is like this.
  7. 7. PreprocessOnce you choose this file under ‘open file’, this is how it looks like.
  8. 8. Weka lists all the attributes present in the data set. It also provides visualizations of these data andother stastics. For eg., we can see that the ‘fat free hamburger’ is true only 41 times out of 400. Now, wecan select the attributes we want for our analysis one by one or, or check ‘all’ or we can also write a‘Perl’ language expression to choose the attributes matching a rule, by selecting ‘pattern’ and typing theexpression. We check ‘all’. Then we go to ‘Associate’ tab.AssociateWe go to ‘associate’ tab and click ‘Choose’. Out of the algorithms listed, we select Apriori. Now, byclicking the text box beside Choose (i.e., on Apriori), the various parameters that are used in Apriori, arelisted.
  9. 9. We can change these parameters as per requirements. To know what each parameter stands for, clickon ‘More’. After changing the parameters, click on ‘Ok’.Now, click on ‘Start’ to start building the model. Depending on the size of the data set, it takes a whileand mean-while the bird roams this side and that side.A part of the output is shown here.Since, we have given ‘numrules’ as 10, only the top 10 best rules are shown here. The first rule is Plain English Muffins= false 396 ==> 40 Watt Lightbulb= false 396 <conf:(1)> lift:(1.01) lev:(0) [1]conv:(1.98)That is, people who do not buy Plain English Muffins, do not buy 40 watt lightbulb as well. The rule alsospecifies confidence, conviction and leverage of each rule(explanation of each can be found under‘more’ , shown above).The model can be run by changing the parameters and each of the results can be seen under the ‘Resultlist’. The results can also be saved for later.RegressionRegression, is as one knows a relation between a dependent variable and one or more independentvariables. As there is not much need to explain about regression, we jump into the process.Business applicationsBefore we start with the tutorial, here are some areas where regression can be used
  10. 10. Trend line analysis - to show the movement of financial or product attributes over time. Stock prices, oil prices can be analyzed using trend lines. Risk analysis for investments - The capital asset pricing model was developed using linear regression analysis Sales or market forecasts - multivariate regression is a good method to forecast sales volumes or market shares. Total quality control - Quality control methods use linear regression frequently to analyze key product specifications and other measurable parameters of product or organization (for eg., customer complaints over time). Human Resources - to predict the demographics and types of future work forces for large companies.Data setI have used a data set provided by Weka website for this. A number of datasets for different techniquescan be found here.The data set I am using is ‘strike.arff’ extracted from ‘numeric. Jar’. The data consists of days lost due toindustrial disputes per 1000 wage salary earners, in 18 OECD countries from 1951-1985. The dependentvariables are 1. country code 2. year 3. unemployment 4. inflation 5. parliamentary representation of social democratic and labor parties and 6. a time-invariant measure of union centralization.If your data is not in .csv or .arff, it needs to be preprocessed as explained above.PreprocessAfter uploading the data into Weka, it looks like this.
  11. 11. For each numerical attribute, weka gives the stastics like mean, max, min, stdev.On clicking ‘visualize all’, the graphs of all variables are shown.
  12. 12. We check ‘All’ to select all variables and click on ‘Classify’ now.Linear regressionWe click ‘choose’ under Classifier and select ‘Linear Regression’ as shown.Click on box beside ‘choose’ to select parameters for Linear Regression.
  13. 13. Then, click on ‘Ok’. Now, we have to tell Weka which data set to use. Apart from the data set we haveuploaded, we have 3 more choices - Supplied test set, where we can supply a different set of data tobuild the model, Cross-validation, which lets WEKA build a model based on subsets of the supplied dataand then average them out to create a final model and Percentage split, where WEKA takes a percentilesubset of the supplied data to build a final model. For this example, we choose Use training set.By default, Weka takes the last attribute as dependent attribute. If it is not so, as per the data, wechange the variable to the required variable by choosing from the drop-down. We choose ‘volume’ asthe dependent variable and click on ‘Start’.A part of the output is shown below.
  14. 14. The first line of the model is175.7183 * country=5,3,13,17,7,1,18,6,9,4,10It means that if the country code is 5, you would put a ‘1’ in the calculation of the equation, and if thecountry code is 8, you would put a ‘0’.By default, Weka employs attribute selection, which means it may not include all of the attributes in theregression equation. Hence we have not got all the dependent variables in the above model. Toeliminate attribute selection, we change the ‘attributeSelectionMethod’ parameter to "No attributeselection" and run the model again.Now the model is as follows
  15. 15. Non-numeric input variablesIf we have a non-numeric input variable, d- If we have a binary attribute (yes/no or true/false), we canconvert the two values to 0 and 1.However, we have techniques to handle both numeric and non-numeric (categorical) attributes. 1. One way is to build a decision tree and have each classification be a numeric value that is the average of the values for the training examples in that subgroup - the result is called a regression tree 2. Another option is to have a separate regression equation for each classification in the tree – based on the training examples in that subgroup – this is called a model tree.
  16. 16. References 1. http://www.cs.waikato.ac.nz/ml/weka/ 2. http://www.cs.waikato.ac.nz/ml/weka/index_datasets.html 3. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/prac05.php 4. http://inf.abdn.ac.uk/~hnguyen/teaching/CS5553/marketbasket.csv 5. "The WEKA Data Mining Software: An Update" by Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer Peter Reutemann, and Ian H. Witten 6. http://www.ehow.com/about_6160819_application-regression-analysis-business.html 7. http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html 8. http://cs-people.bu.edu/dgs/courses/cs105/lectures/data_mining_estimation.pdf

×