Upcoming SlideShare
×

# ITB Term Paper - 10BM60066

371 views

Published on

A term paper that demonstrates the use of two data mining techniques - Linear modelling technique using R and Classification technique using Weka

Published in: Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
371
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
7
0
Likes
0
Embeds 0
No embeds

No notes for slide

### ITB Term Paper - 10BM60066

1. 1. ITB TERM PAPERDATA MINING TECHNIQUES(LINEAR MODELLING AND CLASSIFICATION)RAHUL MAHAJAN (10BM60066)
2. 2. Table of ContentsINTRODUCTION ............................................................................................................................................. 3 ABOUT WEKA ............................................................................................................................................ 3 ABOUT R .................................................................................................................................................... 3LINEAR MODELLING TECHNIQUE USING R - PREDICTION OF FUTURE SHAR PRICE ..................................... 4 DATA ......................................................................................................................................................... 4CASE 1 ........................................................................................................................................................... 5 THE CODE .................................................................................................................................................. 5 THE RESULT ............................................................................................................................................... 5 INTERPRETATION OF THE RESULT............................................................................................................. 6CASE 2 ........................................................................................................................................................... 7 THE CODE .................................................................................................................................................. 7 THE RESULT ............................................................................................................................................... 7 INTERPRETATION OF THE RESULT............................................................................................................. 9CLASSIFICATION .......................................................................................................................................... 10 THE DATASET .......................................................................................................................................... 10 CLASSIFICATION PROCEDURE ................................................................................................................. 10 INTERPRETING THE RESULTS .................................................................................................................. 112
3. 3. INTRODUCTIONIn this term paper I have demonstrated two data mining techniques  LINEAR MODELLING TECHNIQUE o The linear modelling technique is demonstrated using R.  CLASSIFICATION o The classification technique is demonstrated using WEKAABOUT WEKAWeka is java based collection of open source of many data mining and machine learningalgorithms, including o Pre-processing on data o Classification: o Clustering o Association rule extractionABOUT RR is an open source programming language and software environment for statisticalcomputing and graphics. The R language is widely used among statisticians for developingstatistical software and data analysis3
4. 4. LINEAR MODELLING TECHNIQUE USING R - PREDICTION OFFUTURE SHAR PRICEHere I will try to use GARCH model to predict future share price. GARCH Models gives usliberty to define model using previous share prices and volatility for a defined period. There aremany versions of GARCH Models to give better estimate in different scenarios.Case 1 - Using previous day share prices and standard deviation.In the example explained in this term paper the expression of tomorrow’s price is dependenton yesterday’s prices and standard deviation of last 3 days.Case 2 – Using previous day share price and gain of previous day.It is generally known that share prices behave in momentum basis, for a period of time shareprices go up, then comes a period when prices goes down. This model takes advantage of thisbehaviour of the stock prices.So using the statistical techniques I will try to compare the model developed using case 1 andcase 2. It is widely accepted that the model developed using case 2 fits better than modeldeveloped using case 1DATADr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I haveused few files from her data. In both the cases I have used February 2008 share price data ofTata Motors. Except the traded data rest all data is available in public domain.The file contains the following items i) Symbol, ii) Series, iii) Date, iv) Prev Close, v) Open Price, vi) High Price, vii) Low Price, viii) Last Price, ix) Close Price, x) Average Price, xi) Total Traded xii) Quantity, xiii) Turnover in Lacs,This text file is available at this link- http://bit.ly/TM_PVD4
5. 5. CASE 1The program first reads the file. Then it extracts the price data. It creates few matrixes forprices of previous 3 days i.e. A, B and C. Then using for loop it finds the standard deviation ofprices of past 3 days. Then using linear modelling it tries to fit the model to predict futureprices.Before running the case one thing we need to keep in mind is that we change the directorylocation of R to the place where we have saved our text file. The packages required to run thiscode are already installed in R so there is no need of adding any additional packages.THE CODETFile<-"tatamotors.txt"Trade<-read.table(TFile)A <- Trade[,4]B <- A[-1]C <- B[-1]l<-length(B)B <- B[-l]l<-length(A)A <- A[-l]l<-length(A)A <- A[-l]l<-length(A)for (i in 1:l) D[i]= sd(D <-c(A[i],B[i],C[i]),na.rm = FALSE)summary(lm(C~A+D))THE RESULTThe result of the above code can be found on the following page (figure 1)5
6. 6. Figure 1 the output of case 1INTERPRETATION OF THE RESULTP value and F Stats shows that model is not able to predict the prices well. Estimate Std. Error t value Pr(>|t|)(Intercept) -1.891e+03 1.589e+03 -1.190 0.445A 3.694e+00 2.257e+00 1.637 0.349D -9.471e-02 1.066e-01 -0.888 0.5386
7. 7. CASE 2The program first reads the file. Then it extract the price data in matrix A.T hen using matrix Band C it finds the gains for first n-1 days (where n is total number of days available. This data isstored in matrix D. Now using linear model function one can find statistical significance of themodel.We know that in this case correlation will be high so we use correlation flag of liner model astrue, this way the function gives better prediction by minimizing the auto correlation problemfrom the data.THE CODETFile<-"tatamotors.txt"Trade<-read.table(TFile)A <- Trade[,4]l<-length(A)B <- A[-1]C <- A[-l]D <- (C-B)*100/Csummary(lm(B~C+D), correlation=TRUE)THE RESULTThe result of the above code can be found on the following page (figure 2)7
8. 8. Figure 2 The output of case 28
9. 9. INTERPRETATION OF THE RESULT Estimate Std. Error t value Pr(>|t|)(Intercept) -2.335684 2.392506 -0.976 0.333C 1.002842 0.003261 307.518 <2e-16 ***D -7.187997 0.042478 -169.215 <2e-16 ***Here we see that significance of the model is very high. Also the Adjusted R square is high.However adjusted R value also indicates auto correlations, which is very evident in this case.But again the F-statistic analysis shows that model is able to predict share prices in better way.So here we confirm our assumption that previous day gain models (case 2) fits better thanstandard deviation model using R(case 1).9
10. 10. CLASSIFICATIONClassification is also known as decision trees. It’s basically an algorithm that creates a rule todetermine the output of a new data instance.It creates a tree where each node represents attribute of our dataset. A decision is made atthese spots based on the input. By moving on from one to another node you reach at the endof the tree which gives a predicted output.This is illustrated using the following exampleTHE DATASETThe dataset used in this example was found on net. The data can be downloaded from the linkhttp://maya.cs.depaul.edu/classes/ect584/weka/data/bank-data.csv.Let’s say there is a bank ABC. It has data of 600 people who have either opted for its productor not. It has the following information of the people: age, gender, income, marital status regionand mortgage. Now bank can use this information to create a rule to predict whether a newpotential customer would opt for its product or not based on the known attributes of thecustomer.CLASSIFICATION PROCEDURELoad the data in weka. To load data click on open file and specify the path. The window shownin figure 3 should appear after loading.One will note that there are 12 attributes in the dataset as seen in the attribute tab of thewindow. For this example we will be using only the following attributesAge, sex, region, income, married, mortgage, savings and product.Here we will try to predict the response of new customer using the 7 attributes age, sex,region, income, marriage, mortgage and savingsTo remove the remaining attributes click on the checkbox on the left side of the attributes andclick on remove .After removing the attributes one should get the window as shown in figure 4.Now click on the classify tab on the top. Under the classifier tab click on choose  trees J48 as shown in figure 5. J48 is an algorithm used to generate a decision tree developed by Ross Quinlan. It is anextension of Quinlans earlier ID3 algorithm. The decision trees generated by uses the conceptof information entropy.10
11. 11. Now we can create the model in WEKA. First ensure that training set is selected so the datawe have loaded is only used for creating the model. Click start. The output from this model willlook like as shown in figure 6.INTERPRETING THE RESULTSThe important results to focus on are 1. Correctly Classified Instances" (75.66 percent) and the "Incorrectly Classified Instances (24.33)” which tells us about the accuracy of the model. Our model is neither very good nor very bad. It’s Ok. Further modification needs to be done. 2. Confusion matrix which shows number of false positives and negatives. Here in this case 117 a are incorrectly classified as b and 29 b are incorrectly classified as a. 3. The ROC area measures the discrimination ability of the forecast. Although there is some discrimination whenever the ROC area is > 0.5, in most situations the discrimination ability of the forecast is not really considered useful in practice unless the ROC area is > 0.7. For our model the value of ROC is greater then .7 (.787). 4. The decision tree is the main output. It’s the rule that will help to predict the outcome of new data instances – To view the decision tree right-click on the model and select Visualize tree. You will get the window as shown in the figure711
12. 12. Figure 3 Window after loading datasetFigure 4 Window after removing unwanted attributes12
13. 13. Figure 5 Choosing the J48 treeFigure 6 Output of the classification process13
14. 14. Figure 7The decision tree14