Upcoming SlideShare
×

# DATA MINING WITH WEKA

6,838 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
6,838
On SlideShare
0
From Embeds
0
Number of Embeds
95
Actions
Shares
0
157
0
Likes
1
Embeds 0
No embeds

No notes for slide

### DATA MINING WITH WEKA

1. 1. Term paper on Data miningHow to use Weka for data analysisSubmitted by: Shubham Gupta (10BM60085)Vinod Gupta School of Management
2. 2. The first technique that we would do on weka is classification. The data below shows the financialsituation in Japan. The data has been collected from 1970-2009. The columns represent: 1) BROAD: Broad money supplied in the economy 2) DOMC: Domestic consumption 3) PSC: Payment securities 4) CLAIMS: Represents the claims on the government. 5) TOTRES: Total Reserves 6) GDP: Gross domestic product 7) LIQLB: Liquid LiabilityWe want to get a decision tree that would help us decide what values of independent variable mayresult in what final rule or result. For example if we know that for a DOMC> 140 and PSC> 150.3 wewould always get say GDP of greater than 3 trillion yen, then it would help us in making our decisionsbetter. Hence to get such rules we perform this analysis to generate a decision tree. YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLB GDP 1970 83.65 61.88 134.25 111.75 4876114550 104.73 205,995,000,000 1971 106.70 21.37 147.59 123.72 15469150615 118.21 232,681,000,000 1972 116.14 23.17 160.29 133.47 18932675966 129.03 308,137,000,000 1973 116.02 19.84 157.87 132.20 13723930639 126.07 418,640,000,000 1974 113.08 13.72 154.00 126.49 16551248298 120.50 464,705,000,000 1975 118.31 13.02 164.40 129.96 14910849997 127.56 505,317,000,000 1976 122.40 12.09 169.96 130.63 18590784646 131.20 567,926,000,000 1977 125.82 8.76 172.45 128.49 25907710023 133.90 698,968,000,000 1978 130.36 8.56 178.29 127.71 37824744320 139.12 982,078,000,000 1979 135.51 8.19 183.05 129.23 31926244737 142.67 1,022,190,000,000 1980 137.95 8.09 188.44 131.29 38918848626 144.30 1,071,000,000,000 1981 142.13 8.04 194.09 134.10 37839039769 150.03 1,183,790,000,000 1982 149.54 7.67 203.99 139.59 34403732201 156.18 1,100,410,000,000 1983 156.55 6.72 213.12 145.03 33844549531 162.92 1,200,190,000,000 1984 159.31 6.69 217.77 147.43 33898638541 165.34 1,275,560,000,000 1985 160.68 7.66 220.09 149.90 34641202378 167.41 1,364,160,000,000 1986 167.30 7.67 230.23 156.30 51727320082 174.65 2,020,890,000,000 1987 175.85 12.27 243.85 173.48 92701641597 183.77 2,448,670,000,000 1988 178.70 10.66 251.68 182.52 1.06668E+11 186.47 2,971,030,000,000 1989 182.62 10.13 258.13 190.28 93672771034 192.14 2,972,670,000,000 1990 184.06 8.46 259.15 194.81 87828362969 190.16 3,058,040,000,000 1991 184.35 5.20 257.54 195.40 80625855126 189.32 3,484,770,000,000 1992 187.89 4.16 265.33 199.63 79696644593 190.93 3,796,110,000,000 1993 193.97 1.33 274.00 202.14 1.07989E+11 198.16 4,350,010,000,000 1994 200.35 1.88 281.02 204.58 1.35146E+11 204.45 4,778,990,000,000
3. 3. 1995 205.79 1.26 287.13 203.90 1.9262E+11 209.90 5,264,380,000,000 1996 209.72 1.81 292.42 205.21 2.25594E+11 213.63 4,642,540,000,000 1997 215.31 6.47 276.47 217.76 2.26679E+11 221.38 4,261,840,000,000 1998 229.64 1.80 298.40 228.01 2.22443E+11 233.17 3,857,030,000,000 1999 239.91 -1.20 309.92 231.08 2.93948E+11 243.22 4,368,730,000,000 2000 242.24 -1.58 308.91 222.28 3.61639E+11 243.84 4,667,450,000,000 2001 225.31 -33.25 299.43 193.01 4.01958E+11 187.41 4,095,480,000,000 2002 207.79 -4.32 299.16 182.40 4.69618E+11 190.79 3,918,340,000,000 2003 209.70 -1.99 307.26 180.71 6.73554E+11 191.84 4,229,100,000,000 2004 207.51 -1.10 303.48 174.12 8.44667E+11 189.79 4,605,920,000,000 2005 207.24 1.79 312.85 182.87 8.46896E+11 189.30 4,552,200,000,000 2006 204.73 -0.14 304.96 179.99 8.95321E+11 186.06 4,362,590,000,000 2007 201.50 0.16 294.31 172.56 9.73297E+11 184.17 4,377,940,000,000 2008 207.14 0.76 295.42 165.48 1.03076E+12 189.52 4,879,860,000,000 2009 223.76 -1.12 320.53 171.00 1.04899E+12 206.13 5,032,980,000,000Loading data in Weka is quite easy. Just click on the open file option and give the location of the file.Figure 1 Shows how to load data in WekaWeka software is used to classify the above data to find out how these economical factors be modifiedor fixed so as to get an 11% growth in the previous year’s GDP
4. 4. Figure 2 Diagram shows where you could the used tree techniqueThe following shows the output by running the above data in Weka. The Classifier used is to create therequired decision tree is M5P. Wekas M5P algorithm is a rational reconstruction of M5 with someenhancements. M5Base. Implements base routines for generating M5 Model trees and rulethe original algorithm M5 was invented by R. Quinlan and Yong Wang. M5P (where the P stands for‘prime’) generates M5 model trees using the M5 algorithm, which was introduced in Wang & Witten(1997) and enhances the original M5 algorithm by Quinlan (1992). The output of the analysis is shownbelow:=== Run information ===Scheme: weka.classifiers.trees.M5P -M 4.0Relation: Copy of Data_Rudra-weka.filters.unsupervised.attribute.Remove-R1Instances: 945Attributes: 6 BROAD, CLAIMS, DOMC, PSC, TOTRES, LIQLBTest mode: 10-fold cross-validation
5. 5. === Classifier model (full training set) ===M5 pruned model tree:(Using smoothed linear models)BROAD <= 153.045 : LM1 (13/5.644%)BROAD > 153.045 :| PSC <= 203.02 :| | BROAD <= 177.275 : LM2 (5/0.653%)| | BROAD > 177.275 :| | | TOTRES <= 871108500000 : LM3 (11/8.309%)| | | TOTRES > 871108500000 : LM4 (4/1.446%)| PSC > 203.02 : LM5 (7/2.741%)LM num: 1LIQLB = 0.7447 * BROAD + 0.1474 * PSC - 0 * TOTRES + 22.3168LM num: 2LIQLB = 0.586 * BROAD + 0.2788 * PSC - 0 * TOTRES + 30.3097LM num: 3LIQLB = 0.4606 * BROAD + 0.2504 * PSC - 0 * TOTRES + 58.87LM num: 4LIQLB = 0.4996 * BROAD + 0.2504 * PSC - 0 * TOTRES + 50.7563LM num: 5LIQLB = 0.7016 * BROAD + 0.2497 * PSC - 0 * TOTRES + 15.2517Number of Rules: 5Time taken to build model: 0.08 seconds
6. 6. === Cross-validation ====== Summary ===Correlation coefficient 0.9882Mean absolute error 3.412Root mean squared error 5.4145Relative absolute error 11.529 %Root relative squared error 15.1993 %Total Number of Instances 40Ignored Class Unknown Instances 905Interpretation of the Results:Based on the data above M5 algorithm generates modular tree which is formed by 5 linear models (LM)based on the initial values of Broad money in the economy which if less than equal to 153.045 then we
7. 7. have to follow linear model 1 (LM 1) to estimate Liquidity in the economy. If BROAD> 153.045 we checkPSC and move down the tree and choosing corresponding models to get the Liquidity and finally GDPvalues as shown in the figure above.Linear Regression with WekaThe second technique is to conduct linear regression through Weka on the same data. When theoutcome, or class, is numeric and all the attributes are numeric, linear regression is a natural techniqueto consider. In the previous technique we created five linear models from the same data; hence M5P’sperformance is slightly worse than any linear model. The idea is to express the class as a linearcombination of the attributes with predetermined weights. From the previous data, we can also findlinear regression equation between various parameters determining GDP. To run the regression, go toclassify tab on Weka and choose linear regression from functions as shown.Figure 3 Shows where to find LR in WekaFollowing output is generated by the above analysis:=== Run information ===Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8Relation: Copy of Data_RudraInstances: 945Attributes: 7 YEAR BROAD
8. 8. CLAIMS DOMC PSC TOTRES LIQLBTest mode: 10-fold cross-validation=== Classifier model (full training set) ===Linear Regression ModelLIQLB = 1.2523 * BROAD + 0.6062 * CLAIMS + -0.1407 * DOMC 0 * TOTRES + -6.9705Time taken to build model: 0.2 seconds=== Cross-validation ====== Summary ===Correlation coefficient 0.9738Mean absolute error 4.8731Root mean squared error 8.0404Relative absolute error 16.4661 %Root relative squared error 22.5707 %Total Number of Instances 40Ignored Class Unknown Instances 905The above analysis gives as a mathematical relationship (linear) between various variables. The Value ofthe fifth variable (dependent) can be found out once other independent variable values are known. Thisequation also tells how these variables are related. A negative relation shows reciprocal relationship andvice-versa. To see the same relation is pictorial form simply goes to visualize tab on Weka explorer. Thesame is shown in the figure below.
9. 9. CLUSTERING IN WEKAClustering is a technique used to group similar instances or rows in term of Euclidean distance. We haveused SimpeKMeans clustering algorithm to analyze clustering in our initial data. In SimpleKMeansimplementation clustering data use k-means, or the algorithm can decide using cross-validation- inwhich case number of folds is fixed at 10. The figure below shows the output of SimpeKMeans for theabove data. The result is shown as table with rows that are attributes names and columns thatcorrespond to cluster centroids; an additional cluster at the beginning shows the entire data set. Thenumber of instances in each cluster appears in parenthesis at the top of its column. Each table entry iseither the mean or mode of the corresponding attribute for the cluster in that column. The bottom ofthe output shows the result of applying the learned cluster model. In this case, it assigned each trainingset to one of the clusters, showing the same result as the parenthetical numbers at the top of eachcolumn. An alternative is to use a separate test set or a percentage split of training data, in which casefigures would be different. This technique could be used with data from other countries in addition ofthe present data that is taken for Japan.
10. 10. === Run information ===Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10Relation: Copy of Data_RudraInstances: 945Attributes: 7 YEAR BROAD CLAIMS DOMC PSC TOTRES LIQLBTest mode:evaluate on training data
11. 11. === Model and evaluation on training set ===kMeans======Number of iterations: 5Within cluster sum of squared errors: 12.988387913678944Missing values globally replaced with mean/modeCluster centroids: Cluster#Attribute Full Data 0 1 (945) (929) (16)=================================================================YEAR 1989.5 1989.2933 2001.5BROAD 174.1633 173.4625 214.8525CLAIMS 6.6645 6.8103 -1.7981DOMC 242.2808 241.2956 299.4794PSC 168.2627 167.8077 194.685TOTRES 248907476505.9463 243675387834.3592 552695625000LIQLB 175.2342 174.7166 205.2875Time taken to build model (full training data) : 0.14 seconds=== Model and evaluation on training set ===Clustered Instances0 929 (98%)1 16 (2%)We can also visualize the clusters formed. Right click on the result-list output and select cluster visualize.We get the following output: