Your SlideShare is downloading. ×
Data Mining Techniques Using R and WEKA
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Data Mining Techniques Using R and WEKA


Published on

This Term paper explained two Techniques - …

This Term paper explained two Techniques -
1) Linear Modelling using R
2) Clustering using WEKA

Published in: Business, Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Data Mining Techniques Using R and WEKA IT for Business Intelligence Term paper Utsav Mone (10BM60094)This Term paper explained two Techniques - 1) Linear Modelling using R 2) Clustering using WEKA
  • 2. Linear Modelling using RHere I have tried to analyse the relation of bid ask spread of the company with the vitality of theprices. I have used three different hypotheses to fit the model. First I tried to see linear modelling thentried to fit logarithmic and exponential relation of volatility.We have data from where bid ask spread can be calculated at different hours. Through trade data Icalculated daily price volatility of the stock and tried to see relation between them.Bid Ask SpreadA Measure of liquidity • The amount by which the ask price exceeds the bid. This is essentially the difference in price between the highest price that a buyer is willing to pay for an asset and the lowest price for which a seller is willing to sell it •Ask - The price a seller is willing to accept for a security, also known as the offer price. Along withthe price, the ask quote will generally also stipulate the amount of the securityBid - An offer made by an investor, a trader or a dealer to buy a security. The bid will stipulate boththe price at which the buyer is willing to purchase the security and the quantity of the securityFactors effecting Bid Ask Spread 1) Volatility (With more volatility the spread is high) Standard deviation Variance between returns from that same security or market index 2) Volumes (More volumes reduce the spread) Absolute number of shares under transaction Percentage of free floating shares Number of orders 3) Others Tick size Price of ShareAll above measures can re classified again in two categories (except Others)ExecutedRequestedIn Our case I have used only identifying the relation of volatility and Bid Ask Spread.I have used daily volatility of share prices of Tata Motors for the month of Feb 2008. Daily volatility ofthe share prices are calculated on the basis of hourly data instead of regular way of finding closingprices of each day.This is required since we want to see changes in daily volatility.
  • 3. Data –Dr Devlina Chatterjee of VGSoM has purchased lots of data from NSE for her research. I have usedfew files from her data.There are three types of files. 1) Snapshots 2) Trade Data 3) Price Volume DataPrice Volume DataI have used February 2008 share data of Tata Motors. Except the traded data rest all data is availablein public domain.The file contains the following items i) Symbol, ii) Series, iii) Date, iv) Prev Close, v) Open Price, vi) High Price, vii) Low Price, viii) Last Price, ix) Close Price, x) Average Price, xi) Total Traded xii) Quantity, xiii) Turnover in Lacs,This text file is available at this link-,EQ,03-Dec-2007,732.45,736,749,733.35,737,736.15,741,481721,3569.5399TATAMOTORS,EQ,04-Dec-2007,736.15,737,746,728.35,746,741.3,738.2,631272,4660.0808995,TATAMOTORS,EQ,05-Dec-2007,741.3,744,783.9,744,773,772.4,769.92,1410714,10861.311993,TATAMOTORS,EQ,06-Dec-2007,772.4,775.5,782,763.25,778,775.45,774.13,807793,6253.379844,TATAMOTORS,EQ,10-Dec-2007,767.3,772,777.7,745.05,775,766.45,757.78,521361,3950.7440285,TATAMOTORS,EQ,11-Dec-2007,766.45,770,777.3,761,777.3,775.2,770.04,676097,5206.1990345,TATAMOTORS,EQ,12-Dec-2007,775.2,776.9,780,762,769,770.05,768.88,665743,5118.7625105,
  • 4. Snapshots DataIn this type of data we have snapshot of order book for 4 Hours in a day which are 11Hr, 12Hr, 13 Hr,14Hr. Here we see snapshot data of Tata Motor for different months and hours of the day.Here is a look of the data. Since numbers of files are too much it is difficult to upload it.A look at Snapshot data – 1) Order Number 2) Company 3) Trade Type 4) No of shares in Order 5) Quote 6) Time Stamp 7) Buy Sell 8) FlagsA Sample Snapshot Data of Tata Motors on 1 Feb 11 Hr -2008020150046719 TATAMOTORS EQ 500 559.60 09:55:48 B ynnn nnn nnn RL 02008020150716321 TATAMOTORS EQ 10 560.00 10:35:56 B ynnn nnn nnn RL 02008020150034116 TATAMOTORS EQ 100 575.00 09:55:22 B ynnn nnn nnn RL 02008020150067971 TATAMOTORS EQ 824 576.65 09:56:38 B ynnn nny nnn RL 02008020100283272 TATAMOTORS EQ 100 582.00 10:09:10 B ynnn nnn nnn RL 02008020150233325 TATAMOTORS EQ 25000 585.00 10:04:34 B ynnn nny nnn RL 0Detail of Flags can be seen at –
  • 5. Trade DataThis is a daily trade data. Which gives all the trades took place in a day.A look at Trade data – 1) Trade Number 2) Name of Company 3) Type of Trade 4) Time of Trading 5) Price 6) Volume of shares tradedOpening data Price = 7082475593 TATAMOTORS EQ 09:55:16 708 37132475830 TATAMOTORS EQ 09:55:20 708 8002475871 TATAMOTORS EQ 09:55:21 708 2002475872 TATAMOTORS EQ 09:55:21 708 12475873 TATAMOTORS EQ 09:55:21 708 12475874 TATAMOTORS EQ 09:55:21 708 2102475935 TATAMOTORS EQ 09:55:22 708 800See Price variation in 3 Seconds from 755 to back 7553843007 TATAMOTORS EQ 13:33:37 755 53843008 TATAMOTORS EQ 13:33:37 755 4533843021 TATAMOTORS EQ 13:33:38 754.9 13843022 TATAMOTORS EQ 13:33:38 754.55 93843037 TATAMOTORS EQ 13:33:38 755 13843050 TATAMOTORS EQ 13:33:38 754.9 13843051 TATAMOTORS EQ 13:33:38 754.9 93843052 TATAMOTORS EQ 13:33:38 754.9 13843069 TATAMOTORS EQ 13:33:39 755 1More detail of the data is available at –
  • 6. R ProgramData LocationWe need to set Directory location in R.R looks for all the file in the directory assigned.Packages RequirementCHRONZOOFDAMASSPROTODBIRSQL.LITERSQL.EXTFUNCSSTATS4SDETCLTKSQLDF
  • 7. Program UnderstandingReading FileIn this program I have first read the different files using for loop.1) Reading Trade Data File Name is made of = Name of Company, Day, Month, Year.txt2) Reading Snapshot Data File name is made of = Company Name_Day, Month, Year_Time Hour.txtThe data is read using SQL queries for which many packages are Required stSince the data is lot and we have to decide which data to read First I found out average prices at 1second of the hour. ndEg Prices of all the trades when time stamp was 10:00 all seconds after 2 are taken intoconsideration.Then I found out gain or loss every hour to find the volatility of the trade.Then for each hour I found out Bid ask Spread and averaged that for day.Now for linear modelling I used Feb Month data of daily volatility and bid ask spread.Program has comments to help us understanding more.Almost same program was run for exponential and logarithmic relations, but there was little change incode in last 6 lines. The other code is given in the cases explained.Since I am new user of R, The program is not very efficient, but the code is perfectly fine and runswell.
  • 8. CodeName<-TATAMOTORS_#/* This is name of company*/MY<-Feb08#/* This is month and year*/Day<-c(01,04,06,07,08,11,12,13,14,15,18,19,20,21,22)#/* These are Working days of Feb Month, This is hardcoded as of now*/#/* This is just to define PSpread and Dailystdev as numeric array*/PSpread<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)Dailystdev<-PSpreadBD<-c(1,2,3,4)
  • 9. #/* Reading Trade Data*/i=1for(i in 1:15){TFile<-paste(Name,Day[i],MY,".txt",sep = "")Trade<-read.table(TFile)summary(Trade)#/* Below are sql query to find average price of all the trades at perticular hour */Hr10Price<-sqldf("select avg(V5) from Trade where V4 like 10:00%")Hr11Price<-sqldf("select avg(V5) from Trade where V4 like 11:00%")Hr12Price<-sqldf("select avg(V5) from Trade where V4 like 12:00%")Hr13Price<-sqldf("select avg(V5) from Trade where V4 like 13:00%")Hr14Price<-sqldf("select avg(V5) from Trade where V4 like 14:00%")Hr15Price<-sqldf("select avg(V5) from Trade where V4 like 15:00%")#/* This is to find returns at each hour*/R1 = ((Hr10Price[1,1] - Hr11Price[1,1])/Hr10Price[1,1])R2 = (Hr11Price[1,1] -Hr12Price[1,1])/Hr11Price[1,1]R3 = (Hr12Price[1,1] -Hr13Price[1,1])/Hr12Price[1,1]R4 = (Hr13Price[1,1] -Hr14Price[1,1])/Hr13Price[1,1]R5 = (Hr14Price[1,1] -Hr15Price[1,1])/Hr14Price[1,1]R<-c(R1,R2,R3,R4,R5)Dailystdev[i]<-sd(R, na.rm = FALSE)#/* Dailystdev variable have standard deviation of daily returns*/#/******************************************/#/* Code below is for reading snapshot data */#/******************************************/Company<-TATAMOTORSMonth<-FebYear<-"08"Time<-c(11,12,13,14)h<-"_"i
  • 10. for(j in 1:4){File<-paste(Company,h,Day[i],Month,Year,h,Time[j],".txt",sep = "")X<-read.table(File)#/* SQL and formulas find the Bid and Ask value of the hour */MaxBuyP<-sqldf("select max(V5) from X where V10 = nnn and V7 = B ")MinSellP<-sqldf("select min(V5) from X where V10 = nnn and V7 = S ")MinSell = MinSellP[1,1]MaxBuy = MaxBuyP[1,1]#/* This is done to bring array variable to regular variable */BidAsk = MinSell - MaxBuyBD[j] =BidAsk/((MaxBuy+MinSell)/2)}PSpread[i]<- mean(BD)}PSpreadDailystdev/* DF is Data Frame for modeling */DF <- data.frame(PSpread,Dailystdev)Result<-lm(PSpread ~ Dailystdev,DF)Resultsummary(Result)/*******************END***********************/
  • 11. AnalysisAnalysis show that interrupt at Y axis is significant but the coefficient is not significant.Adjusted R square is also showing that model is not fitting.F statistic also have very high p values which gives overall indication that Bid Ask Spread do not haveany linear relation with daily volatility of the prices.So I changed the Null Hypothesis to following cases.Bid ask spread is exponentially related with the volatilityorBid ask spread is logarithmically related with the volatility
  • 12. Exponential CaseDailystdevexp<-exp(Dailystdev)DFexp <- data.frame(PSpread,Dailystdevexp)Resultexp <-lm(PSpread ~ Dailystdevexp,DFexp)ResultexpCoefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) -0.04463 0.08037 -0.555 0.588Dailystdevexp 0.04582 0.07970 0.575 0.575
  • 13. Log CaseDailystdevln<-log(Dailystdev, base = exp(1))DFln <- data.frame(PSpread,Dailystdevln)Resultln <-lm(PSpread ~ Dailystdevln,DFln)summary(Resultln)Coefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 0.0021993 0.0027177 0.809 0.433Dailystdevln 0.0001246 0.0005400 0.231 0.821Still we see that even exponential or log normal model is not fitting.
  • 14. Clustering Using WEKAClustering helps one to make group of data instances. These help especially the marketersto identify patterns in data and segment their customers.The DatasetThe data used here is obtained from the CD of book on Marketing research by NareshMalhotra. The data can be downloaded from the following link - example illustrates the use of clustering method to segment customers based on thereattitudes towards shopping. Customers were asked to express s their degree of agreementon the following variables on a 7 point scaleV1 - Shopping is funV 2 - Shopping is bad for your budgetV3 - I combine shopping with eating outV4 - I try to get the best buys when shoppingV5 - I don’t care about shoppingV6 - You can save a lot of money by comparing pricesClustering ProcedureLoad the data using the open file option in Weka. You will get the window as shown infigure 1.Click on cluster tab. Then click Choose and select SimpleKMeans .You will get thewindow as shown in figure 2. By default the number of cluster created would be 2. In orderto change the number of cluster click on SimpleKMeans. You will get the window as shownin figure 3. In the numcluster field specify the number of clusters to be created. For thisexample number of cluster created is 3. Click on start. You will get the output as shown infigure 4.
  • 15. Interpreting The ResultsEach cluster tells us a type of behavior in our customers, from which we can begin to drawsome conclusions:  Cluster 0 — High values on V2 and V4 and V6 . Can be called as economical shoppers  Cluster 1 — High values on variables V1 and V3 and low values on V5 – They could be labeled as fun loving and concerned shoppers  Cluster 2 — Opposite of cluster 1. Can be termed as apathetic clustersTo visually inspect the cluster right-click on theResult List section. One of the optionsfrom this pop-up menu is Visualize Cluster Assignments. A window will pop up thatlets you play with the results and see them visually (see figure 5).Figure 1 The window after loading the dataset
  • 16. Figure 2 Window after choosing the SimpleKMeans procedure
  • 17. Figure 3 Changeing the number of clusters
  • 18. Figure 4 The result of cluster analysis
  • 19. Figure 5 Visually viewing the cluster