Experiments and Results on Click stream analysis using R

4,687 views

Published on

Some Clustering, Classification and Association Rules on Retail Data using R statistical tool.

Published in: Education, Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,687
On SlideShare
0
From Embeds
0
Number of Embeds
72
Actions
Shares
0
Downloads
147
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Experiments and Results on Click stream analysis using R

  1. 1. DB 297C Data Analytics – Project Report Term I (2013-14) Group Information GROUP NO: 11 TEAM MEMBERS: Bisen Vikratsingh Mohansingh - MT2012036 Kodamasimham Pridhvi - MT2012066 Vaibhav Singh Rajput - MT2012145 Dataset Description Blue Martini Software approached several clients using its Customer Interaction System to volunteer their data, and a small dot-com company called Gazelle.com, a leg wear and leg care retailer. Data was made available to in two formats: original data and aggregated data. Among the data collected by the Blue Martini application server, the following three categories are the relevant data:  Customer information, which includes customer ID, registration information, and registration form questionnaire responses.  Order information like Order header, which includes date/time, discount, tax, total amount, payment, shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and status.  Click stream information session, which includes starting and ending date/time, cookie, browser, referrer, visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time, product, and assortment. Initial dataset .csv file -> 25MB (14000 row x 296 column), With SQL query removed columns with NULL making total no. of columns = 208. Removed record of Crawlers and Record with just one page view i.e. Session_time_elasped =0.0 making new rows count = ~5000. Manually removed few more irrelevant columns like browser/os, day, date info, etc, now no. of columns =128.Considering the most frequent visitors removed Columns whose sum (visit) < 10 making new number of columns =113. Final size: 6.5MB, ~5000 x 113
  2. 2. DB 297C Data Analytics – Project Report Term I (2013-14) Summary of top 5 observations Rule Based Classification: Rule-based methods, rule discovery or rule extraction from data, are data mining techniques aimed at understanding data structures, providing comprehensible description instead of only black-box prediction. Sets of rules are useful if rules are not too numerous, comprehensible, and have sufficiently high accuracy. From the result of the experiment we can see rules being generated, we have show some sample rules in the documentation, there were totally 182 rules generated. To which class the rule belongs to is shown at the end of each rule with actual number of rows / number of misclassifications that belong to that particular rule. Association Rules: Association rules were taken based on two factors, lift and support. Rules having lift greater than 1 and min-support > 0.5. A total of 377564 rules were generated out of which we applied filters and selected few rules which showed some interesting patterns . Result 1 The Rule based classification will generate a set of rules on which classification takes place; we can see the set of rules from the model as generated below q2_part O/p: PART decision list Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0) Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0 AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0) Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0) Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND
  3. 3. DB 297C Data Analytics – Project Report Term I (2013-14) Num_articles_Template_Views <= 1: TRUE (73.0/22.0) These are some of the sample rules that are generated based on the model using which the prediction on the test data was done. A successful rule based model was build with an accuracy of above 70% for identifying whether a user will continue his session or not. Results can be found at Rule Based Classification. Result 2 5 Cluster of users plotted against their average amount spend>12$. It is clearly observable from the figure that the most of users in cluster 0 are high spender where as cluster 2 consist of least highest spender Result 3 One more strange observation is high spender people are least interested in offers/free gift. Below image shows a plot of cluster against number of visits to offer/free gift page.
  4. 4. DB 297C Data Analytics – Project Report Term I (2013-14) Result 4 Highest spender people (cluster 0 as concluded from result 2nd ) as found to be mostly interested in branded product that’s why their average spending is more than 12$. Below image shows a plot of cluster against DonnaKaran(Fashion design brand) product views. And opposite for our least spender cluster 2. Result 5 After applying association rules , some interesting rules were generated which were very useful to predict which users like to continue and who don’t and what were the killer pages in most cases. Due to large number of rules generation we were unable to go through all rules. Some of the graphs were:
  5. 5. DB 297C Data Analytics – Project Report Term I (2013-14) Scatter plot of rules generated based on support, confidence and lift. Some of the interesting patterns were: 2 {Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941 17 {Num_Women_Product_Views=Yes, Num_Men_Product_Views=No, CONTINUE=YES} => {LEAVE=NO} 0.100 1 2.352941 18 {Num_MAS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941 19 {Num_MDS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941 From the rules we can see that Women products were the most viewed and pages like MDS category, MAS category were the killer pages where no one was interested. Results can be observed here Association Rule Results.
  6. 6. DB 297C Data Analytics – Project Report Term I (2013-14) APPENDIX – collection of individual experiment reports (not more than 15) Data Cleaning/Filtering DATA PRE-PROCESSING Description of Dataset: Blue Martini Software approached several clients using its Customer Interaction System to volunteer their data, and a small dot-com company called Gazelle.com, a legwear and legcare retailer. Data was made available to in two formats: original data and aggregated data. Among the data collected by the Blue Martini application server, the following three categories are the relevant data:  Customer information, which includes customer ID, registration information, and registration form questionnaire responses.  Order information like Order header, which includes date/time, discount, tax, total amount, payment, shipping, status, session ID, order line, which includes quantity, price, product, date/time, assortment, and status.  Clickstream information session, which includes starting and ending date/time, cookie, browser, referrer, visit count, and user agent, page view, which includes date/time, sequence number, URL, processing time, product, and assortment. Steps for Pre-processing:        Initial dataset .csv file -> 25MB (14000 row x 296 col) Loaded .csv into RDBMS using python scripts as discussed below With SQL query removed columns with NULL o total no. of col = 208 Removed record of Crawlers Record with just one page view i.e Session_time_elasped =0.0 o new rows count = ~5000 Manually removed few more irrelevant col like browser/os,day,date info, etc, o Now no. of col =108 Removed Col whose sum(visit) < 10 o New no of col =103 Final size: 6.5MB, ~5000 x 103
  7. 7. DB 297C Data Analytics – Project Report Term I (2013-14) import sys defreadfile(fname,oname): f = file(fname) whileTrue: line = f.readline() stri=line.split(':'); if len(line)==0: break outfile = file(oname,'a') ifnot stri[0].startswith("|"): mystr=stri[0].replace(" ","_") if((c in mystr)for c in'*&'): m=['*','&','/'] for i in m: mystr=mystr.replace(i,"_") ifnot any((c in mystr)for c in'*&|'): # if any(c in stri[0] for c in '*$&|'): mystr=mystr.replace(".","") outfile.write(mystr.strip()) outfile.write("n") outfile.close() f.close() if len(sys.argv)<2: print'No file specified' sys.exit() else: readfile(sys.argv[1],sys.argv[2]) Python Scripts: Dataset was given in two files: o Data File o Names File In the above script we are taking the Names File from the user and retrieving the column names for our dataset by removing all special characters and writing it to another file for further processing of names. >> python read.py names_file_name output_file_name In the output_file_name we will get the columns names individually .
  8. 8. DB 297C Data Analytics – Project Report Term I (2013-14) import sys defmakeScript(fname,oname): with open(fname,'r')as file_contents: contents = file_contents.read() my_data=contents.split("n") outfile=file(oname,'a') input_db = raw_input("Enter Database Name: ") createDatabase ="CREATE DATABASE "+ input_db +";n" useDatabase ="USE "+input_db +";n" input_table = raw_input("Enter Table Name: ") columns ="" count=0 print columns for data in my_data: count=count+1 if data: columns += data +" VARCHAR(100) DEFAULT NULL ,n" createTable="CREATE TABLE "+input_table +"("+ columns +") ;n" createTable=createTable.replace("/","") k = createTable.rfind(',') createTable=createTable[:k]+" "+createTable[k+1:] print createTable print count outfile.write(createDatabase) outfile.write(useDatabase) outfile.write(createTable) outfile.close() file_contents.close() if len(sys.argv)<2: print'No file specified' sys.exit() else: makeScript(sys.argv[1],sys.argv[2]) In the above python file, we are using the file created before using read.py, which is the input file for the above script. Here we are creating a sql script for creating a table with respective columns having a data type of varchar, so that we can load the file into DBMS for further processing. >> python Createscript.py output_file_name script.sql We will get a sql script for creating the table with the columns retrieved.
  9. 9. DB 297C Data Analytics – Project Report Term I (2013-14) import MySQLdb myfile = open("path where required csv is there",'r') db = MySQLdb.connect(host="localhost",# your host, usually localhost user="root",# your username passwd="root",# your password db="da1")# name of the data base cur = db.cursor() for line in myfile: print line my_line_list = line.split(',') string ="" for value in my_line_list: string = string +"'"+ str(value)+"'," query_string = string[:-1] final_query ="insert into question1 values"+"("+query_string+");" cur.execute(final_query) We are here opening a connection to the database and reading data from Data File and writing it into the table we created before in database. We are directly inserting into the database reading from file. >>python da.py script.sql here we need to specify the path inside the file where our .csv Data File exists and it will read from csv and enter into the database. After entering the data inside the DBMS we are removing the columns with all NULL values using simple sql queries as mentioned above. After performing the Data Pre-processing on the given dataset, we are exporting the table into a .csv file for performing analysis using R. Now we will analyze the data using R.
  10. 10. DB 297C Data Analytics – Project Report Term I (2013-14) Classification Random Forest Objective: To generate a model for building decision tree and to identify important features using random forest. Description: Random forests are an ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. Procedure:1. After the Data preprocessing of dataset, it is loaded into R environment by using , question1 <- read.csv ("dataset.csv") dim (question1)# 5220 103 ----- number of rows and columns 2. After loading the dataset now we will divide dataset into 70% as trainDataset and 30% as testDataset as below: div<- sample (2, nrow(question1),replace=T,prob=c(0.7,0.3)) Here it will generate two samples from our dataset having 70% and 30% rows with non-overlapping of rows. ‘nrow’ will give the number of rows in our dataset, ’prob’ decides the division ratio and ‘sample’ will assign numbers 1 or 2 to the rows as to which sample that row belongs to. To create trainData the command is: trainData<- question1[div==1,] dim (trainData) #3670 103 --- dimensions of trainDataset This will copy all the rows in dataset into trainData which are marked as 1 as per sample. Similarly the testData is: testData<- question1[div==2,] dim(testData)#1550 103 --- dimensions of testDataset 3. After generating the trainData and testData now we will load the required package ‘randomForest’ into R: library(randomForest) 4. Defining the target variable and independent variable in the formula to be used in the generation of the model as below: myformula<- Session_Continues ~ . Here ‘Session_Continues’ is the target variables having classes as ‘true’ or ‘false’ and we are giving remaining all as the independent variables on which basis our target variable is classified as represented by ‘~.’.
  11. 11. DB 297C Data Analytics – Project Report Term I (2013-14) 5. After the formula is decided now we applying the formula to generate the model based on our trainData as below using the function ‘randomForest’ and get the model stored in ‘rf’: rf<- randomForest(myformula,data=trainData,ntree=100,proximity=T) ‘ntree’ is the parameter to specify how many trees it algorithm has to generate to get an accurate model, ‘proximity’is the parameter for checking for exact match for oob and tree generated, used for reducing error rates. 6. We can see the classification result by: -->rf output: Call: randomForest(formula = myformula, data = trainData, ntree = 100, Type of random forest: classification Number of trees: 100 No. of variables tried at each split: 10 proximity = T) OOB estimate of error rate: 34.17% Confusion matrix: False True class.error False 2280 173 0.07052589 True 1081 136 0.88824979 By seeing the result we can say that we are getting an error of 34%. 7. For seeing the generated tree for classification: -->getTree(rf,1) Output:- left daughter right daughter split var split point status prediction 1 2 3 3 1 0 2 4 5 29 1 0 3 6 7 105 1 0 4 8 9 98 1 0 5 10 11 36 1 0 6 12 13 34 1 0 7 0 0 0 -1 2 If status is -1 then that is the leaf node in the decision tree and prediction is 1 0r 2 means the class of classification it is classified. We can get any tree information using the above command just by specifying the randomForest object ‘rf’ and which tree number i.e ‘n’ which in our case is 1 < n < 100
  12. 12. DB 297C Data Analytics – Project Report Term I (2013-14) 8. We can plot the error rates in our generated trees by : plot(rf) we will get a graph as show in figure(1)in observation. 9. We can also find the features that contribute more to the decision tree using: importance(rf) It will give the feature and its mean Gini Index , we can see and decide which are the essential features that effect our decision tree. 10. We can use many attributes that are generated by randomForest, which are: attributes(rf) output: $names [1] "call" "type" "predicted" "err.rate" [5] "confusion" "votes" "oob.times" "classes" [9] "importance" "importanceSD" "localImportance" "proximity" [13] "ntree" "mtry" "forest" "y" [17] "test" "inbag" "terms" $class [1] "randomForest.formula" "randomForest" 11. Now using the model generated from the trainData, we will apply that model on testData for prediction as below: testpredict<- predict(rf,newdata=testData) output: testpredict False True False 952 473 True 52 73 here we will use the model ‘rf’ and dataset as ‘testData’ and store the result in a variable as above. Observations: The plot of the model error rates is:
  13. 13. DB 297C Data Analytics – Project Report Term I (2013-14) We can see that as number of trees increases the error rate decreases and it is able to classify accurately. Result of importance of rf:Column MeanGiniIndex Num_BrandOrder_Assortment_Views 2.226234e+01 Num_UniqueBoutiques_Assortment_Views 3.372542e+01 Num_Brands_Assortment_Views 2.324481e+01 Num_Departments_Assortment_Views 2.224833e+01 Num_LifeStyles_Assortment_Views 1.265545e+01 Num_main_Template_Views 4.466076e+01 Num_products_Template_Views 1.143197e+01 Num_articles_Template_Views 1.943811e+01 Num_main_home_Template_Views 2.347768e+01 So from the result above we can observe that out of 103 features only some features contribute more to the model so we can further reduce the dataset and achieve better results. WEKA OUTPUT: We also tried to get the selection attributes of the above dataset in weka , result is: Num_Hanes_Product_Views Num_Cotton_Product_Views Num_Nylon_Product_Views Num_BrandOrder_Assortment_Views Num_UniqueBoutiques_Assortment_Views Num_LifeStyles_Assortment_Views Num_main_Template_Views Num_articles_Template_Views Num_main_home_Template_Views Num_main_vendor_Template_Views Num_articles_dpt_about_mgmtteam_Template_Views Num_main_cust_serv_Template_Views Nearly both the outputs were matching so we were able to find the factors most influencing our decision tree. Conclusion: We will use the above features that have major influence in decision tree in random Forest as the independent variable for actual construction of decision tree using ‘party’ package for better results and classification.
  14. 14. DB 297C Data Analytics – Project Report Term I (2013-14) Party Decision Tree Objective: To build a decision tree using the features identified by Random Forest using the “party” package. Description: A computational toolbox for recursive partitioning. The core of the package is ctree(), an implementation of conditional inference trees which embed tree-structured regression models into a well-defined theory of conditional inference procedures. This non-parametric class of regression trees is applicable to all kinds of regression problems, including nominal, ordinal, numeric, censored as well as multivariate response variables and arbitrary measurement scales of the covariates. Procedure: 1. 2. 3. 4. Loading the dataset into R, question1_reduced <- read.csv("q2_reduced.csv") dim(question1) #5220 103 Dividing the dataset into training dataset and testdata set, div <- sample(2,nrow(question1_reduced),replace=T,prob=c(0.70,0.30)) Storing the traindata and testData in variable for analysis, trainData_reduced <- question1_reduced[div==1,] dim(trainData_reduced) # 3675 103 testData_reduced <- question1_reduced[div==2,] dim(testData_reduced)# 1545 103 Defining formula based on the features identified by random forest on the target variable, myformula_reduced <- Session_Continues ~ Num_Hanes_Product_Views + Num_Cotton_Product_Views + Num_Nylon_Product_Views + Num_BrandOrder_Assortment_Views + Num_UniqueBoutiques_Assortment_Views + Num_LifeStyles_Assortment_Views + Num_main_Template_Views + Num_articles_Template_Views + Num_main_home_Template_Views + Num_main_vendor_Template_Views + Num_articles_dpt_about_mgmtteam_Template_Views + Num_main_cust_serv_Template_Views The features are the result of the importance factor found during the randomForest. 5. 6. 7. Now loading the “party” package for analysis, library(party) Applying the recursive decision tree algorithm on traindata based on the above formula trainData_ctree <- ctree(myformula_reduced,data=trainData_reduced) Now to see the generated model , print(trainData_ctree)
  15. 15. DB 297C Data Analytics – Project Report Term I (2013-14) this wil show the features used in building the decision tree and also how the decision tree is built as below output: Conditional inference tree with 3 terminal nodes Response: Session_Continues Inputs: Num_Hanes_Product_Views, Num_Cotton_Product_Views, Num_Nylon_Product_Views, Num_BrandOrder_Assortment_Views, Num_UniqueBoutiques_Assortment_Views, Num_LifeStyles_Assortment_Views, Num_main_Template_Views, Num_articles_Template_Views, Num_main_home_Template_Views, Num_main_vendor_Template_Views, Num_articles_dpt_about_mgmtteam_Template_Views, Num_main_cust_serv_Template_Views Number of observations: 3675 1) Num_main_home_Template_Views <= 1; criterion = 0.999, statistic = 16.455 2) Num_articles_dpt_about_mgmtteam_Template_Views <= 0; criterion = 0.984, statistic = 10.226 3)* weights = 2607 2) Num_articles_dpt_about_mgmtteam_Template_Views > 0 4)* weights = 245 1) Num_main_home_Template_Views > 1 5)* weights = 823 8. For visualizing tree graphically it is, Plot(trainData_ctree,type=”simple”) We will get the graph of decision tree as below show in observation. 9. Now applying the model on the testData set , testpred_reduced <- predict(trainData_reduced,newdata=testData_reduced) 10. For checking the accuracy of prediction, table(testpred_reduced,testData_reduced$Session_Continues) output will show the prediction rate, as the prediction has many errors so this is not the suitable method for decision tree generation.
  16. 16. DB 297C Data Analytics – Project Report Term I (2013-14) Observation: The graph of the decision tree obtained is: From the graph we can see that only two factors are being considered by this ctree() algorithm.
  17. 17. DB 297C Data Analytics – Project Report Term I (2013-14) WEKA: When based on the above factor decision tree generated in weka is: Conclusion Results were not satisfactory, there were large rate of error and misclassification of data. Around 50% of data was not correctly classified using “party” package.
  18. 18. DB 297C Data Analytics – Project Report Term I (2013-14) Rule Based Classification Objective To generate a model for building rules and to classify data based on the rules being satisfied. Description Rule-based methods, rule discovery or rule extraction from data, are data mining techniques aimed at understanding data structures, providing comprehensible description instead of only black-box prediction. Rule based systems should expose in a comprehensible way knowledge hidden in data , providing logical justification for drawing conclusions, showing possible inconsistencies and avoiding unpredictable conclusions that black box predictors may generate in untypical situations . Sets of rules are useful if rules are not too numerous, comprehensible, and have sufficiently high accuracy. Procedure 1. Data was already been loaded in R and is divided into training dataset and test dataset so we can directly apply the rule based classification directly on the train dataset. 2. For applying rule based classification we have to install the package “RWeka” which imports all the algorithms in weka tool. 3. We will be using the “PART” rule based classification in weka for generating the rules of our training dataset based on which we will classify our test dataset. library(RWeka) The above command will load the “RWeka” package into R environment 4. Now we will apply the PART algorithm on the training Dataset for obtaining the rules, q2_part <- PART(Session_Continues ~.,data =q2_train) Above command will take the training dataset as “q2_train” and apply the “PART” algorithm based on the target variable i.e Session_Continues and remaining all as independent variables. 5. A model is build based on the previous command , which is used for classifying the test dataset as below, q2_pre <- evaluate_Weka_classifier(q2_part,newdata=q2_test) here we are using the model generated by the training data to classify the test data. Here “evaluate_Weka_classifier” will use the model and classify the testdata, which is a function of weka tool. 6. For seeing the result, q2_pre OUTPUT:- === Summary === Correctly Classified Instances 880 Incorrectly Classified Instances 425 Kappa statistic 0 Mean absolute error 0.4447 Root mean squared error 0.4689 67.433 % 32.567 %
  19. 19. DB 297C Data Analytics – Project Report Term I (2013-14) Relative absolute error 99.9541 % Root relative squared error 99.9958 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 100 % Total Number of Instances 1305 === Confusion Matrix === a b <-- classified as 734 124 | a = FALSE 301 136 | b = TRUE From the result we can see that we are getting a classification rate around 68% - 72 % which is a better rate than decision tree. Observation The Rule based classification will generate a set of rules on which classification takes place , we can see the set of rules from the model it generated as below q2_part O/p: PART decision list Num_EllenTracy_Product_Views <= 1 AND Num_main_assortment_Template_Views > 1 AND Num_main_lifestyles_Template_Views > 1: FALSE (38.0/3.0) Num_main_assortment_Template_Views > 0 AND Num_main_shopping_cart_Template_Views <= 0 AND Num_LifeStyles_Assortment_Views <= 1: TRUE (7.0) Num_UniqueBoutiques_Assortment_Views <= 0 AND Num_main_vendor_Template_Views <= 1 AND Num_Brands_Assortment_Views <= 1: FALSE (145.0/71.0) Num_articles_dpt_about_Template_Views > 0 AND Num_BrandOrder_Assortment_Views <= 1 AND Num_articles_Template_Views <= 1: TRUE (73.0/22.0) From the result you can see the rules being generated, we have show some sample rules there were totally 182 rules generated. Here at the last the class has been mentioned to which class the rule belongs to showing actual number of rows / number of misclassifications that belong to that particular rule. Conclusion So from the above observation and results we can see that a successful rule based model was build with an accuracy of above 70% for identifying whether a user will continue his session or not.
  20. 20. DB 297C Data Analytics – Project Report Term I (2013-14) Clustering Objective: To group visitor of websites whose page view pattern is similar and identify their interest. Approach: Clustering is a best methodology in Data analysis which can be used to group objects based on their similarities. We are making use of WEKA tools for doing this analysis. Preprocessing: 1. Remove all spam data by deleting record with just one page view 2. There are about 500+ dimension which is very not feasible to analyze, so for dimensionality reduction a. Go to select attribute of WEKA b. Manually- remove all session data, browser information, most common page] c. Auto – Calculate information gain and select top 25 Attribute Process: Experiment I In this experiment we will make 5 cluster of given instances (users) and analyze their purchase habits K-means Steps: 1. 2. 3. 4. 5. 6. Import a reduced dataset in weka Select simple k-mean Specify number of clusters Set distance function to Euclidean Specify k (no of cluster) Click on start to generate clusters Results: (A) Using Euclidean distance === Run information === Scheme: weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 num-slots 1 -S 10 Relation: q3_added-weka.filters.unsupervised.attribute.Remove-VR7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15 Instances: 1781 Attributes: 26 City Customer_ID US_State Num_Sheer_Look_Product_Views Num_CT_Waist_Control_Views
  21. 21. DB 297C Data Analytics – Project Report Term I (2013-14) Num_PH_Category_Views Num_main_shopping_cart_Template_Views Num_Replenishable_Stock_Views Num_account_Template_Views Num_main_login2_Template_Views Num_Sandal_Foot_Views Num_HasDressingRoom_True_Views Num_Legwear_Product_Views Num_products_productDetailLegwear_Template_Views Num_DonnaKaran_Product_Views Num_AmericanEssentials_Product_Views Num_Basic_Product_Views Num_WDCS_Category_Views Num_Oroblu_Product_Views WhichDoYouWearMostFrequent Num_products_Template_Views Home_Market_Value Num_WAS_Category_Views Num_main_vendor_Template_Views Num_main_freegift_Template_Views Spend_over_$12_per_order_on_average Test mode: evaluate on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 12 Within cluster sum of squared errors: 4789.607514501406 Missing values globally replaced with mean/mode Time taken to build model (full training data) : 0.14 seconds === Model and evaluation on training set === Clustered Instances 0 1 2 3 4 263 ( 15%) 273 ( 15%) 368 ( 21%) 484 ( 27%) 393 ( 22%)
  22. 22. DB 297C Data Analytics – Project Report Term I (2013-14) Observation:  Cluster 0 o High income o Spend avg >12$ => potential customer (value) o Purchase nylon more than cotton (nylon is costlier than cotton) o Mostly view brands assortment page (DonnaKaran – costly fashion design brand) o Less visit to free gift category o More visit to sandal foot  Cluster 1 o More men product than other cluster => cluster might have more mens o Frequent use of search bar o Rich visitors with most of them are have above average home/assets value  Cluster 2 o General visitor o Buy cheap products  Cluster 3 o Interested mostly in offers/free gift products o Highest visit to checkout page => potential customer (frequency)  Cluster 4 o No special pattern observed Experiment II We can validate whether we can use page view data for identifying potential customer using clustering. We have labeled data of avg. purchase >12$ with 1368 instances as true and 413 instances as false. === Run information === Scheme: weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 num-slots 1 -S 10 Relation: q3_added-weka.filters.unsupervised.attribute.Remove-VR7,10,9,102,127,112,66,87,56,70,111,106,26,59,20,18,104,113,17,1,54,14,115,63,73,15 Instances: 1781 Attributes: 26 City Customer_ID US_State Num_Sheer_Look_Product_Views Num_CT_Waist_Control_Views Num_PH_Category_Views Num_main_shopping_cart_Template_Views Num_Replenishable_Stock_Views Num_account_Template_Views Num_main_login2_Template_Views Num_Sandal_Foot_Views Num_HasDressingRoom_True_Views Num_Legwear_Product_Views
  23. 23. DB 297C Data Analytics – Project Report Term I (2013-14) Num_products_productDetailLegwear_Template_Views Num_DonnaKaran_Product_Views Num_AmericanEssentials_Product_Views Num_Basic_Product_Views Num_WDCS_Category_Views Num_Oroblu_Product_Views WhichDoYouWearMostFrequent Num_products_Template_Views Home_Market_Value Num_WAS_Category_Views Num_main_vendor_Template_Views Num_main_freegift_Template_Views Ignored: Spend_over_$12_per_order_on_average Test mode: Classes to clusters evaluation on training data === Clustering model (full training set) === kMeans ====== Number of iterations: 5 Within cluster sum of squared errors: 4913.0928856548035 Missing values globally replaced with mean/mode Time taken to build model (full training data) : 0.05 seconds === Model and evaluation on training set === Clustered Instances 0 1 561 ( 31%) 1220 ( 69%) Class attribute: Spend_over_$12_per_order_on_average Classes to Clusters: 0 1 <-- assigned to cluster 402 966 | False 159 254 | True Cluster 0 <-- True Cluster 1 <-- False Incorrectly clustered instances : 656.0 36.8332 % Observation : Only 63% of data is correctly classified, as data is more biased toward False(<12$ average spending) class. But clustering gives us good insight of purchase/page view patterns.
  24. 24. DB 297C Data Analytics – Project Report Term I (2013-14) Association Rules Objective: To identify some interesting patterns in the users page views and also the killer pages. Description: Association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Measures used in our analysis are lift, confidence and support. Procedure: 1. For performing the association rules we needed to convert the dataset into binary matrix indicating in each session indicating whether he/she visited that page or not. 2. For performing the association rules, “arules” package is available. library(arules) 3. Now loading the converted dataset into R for generation of rules, we used the important columns based on mean gini index obtained from randomForest result. 4. After loading the data, we convert the data as transactions by following command, dataTrans <- as(assoc,”transactions”) 5. Now we apply “apriori” algorithm to generate the rules, where we can pass the parameter list having support, confidence and minl-ength of every rule. rules <- apriori(dataTrans) This will generate all rules using min-support as 0.1 and min-confidence as 0.8 around. It will generate all the subset rules also based on the frequent itemset of attributes. 6. To know how many rules generated we can see that by rules Around 377564 rules were generated out of which we were interested in only rules having RHS as LEAVE or CONTINUE to check whether person will continue or leave after seeing certain pages. 7. Retrieved a subset of rules from all generated rules which were having some interesting patterns. Observation: We were able to see some of the interesting patterns in the rules generated, like in our dataset most of the persons were females so we were able to find out that most of the rules were having “NUM_OF_WOMEN_PRODUCT_VIEWS” in possibly every transaction. Some of the brands were least visited or never visited according to the rules. We were able to identify some of the killer pages based on the user preferences or after visiting some pages user used to withdraw at the same page everytime like that.
  25. 25. DB 297C Data Analytics – Project Report Term I (2013-14) Results: Some of the rules sorted based on the “lift” values are as below: rulesLeave <- subset(rules,subset=rhs %pin% "LEAVE")// for getting the rules inspect(head(sort(rulesLeave,by="lift"),20)) O/p: lhs rhs 1 {CONTINUE=YES} support confidence lift => {LEAVE=NO} 0.425 1 2.352941 2 {Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941 17 {Num_Women_Product_Views=Yes, Num_Men_Product_Views=No, CONTINUE=YES} => {LEAVE=NO} 0.100 1 2.352941 18 {Num_MAS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941 19 {Num_MDS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.105 1 2.352941 20 {Num_MCS_Category_Views=No, Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941 Some of the interesting rules are shown above.Some of the random generated are, inspect(head(rulesLeave,6)) 0/p: lhs rhs support confidence lift 3 {Num_Women_Product_Views=Yes, CONTINUE=YES} => {LEAVE=NO} 0.110 1 2.352941
  26. 26. DB 297C Data Analytics – Project Report Term I (2013-14) 4 {Num_Women_Product_Views=Yes, CONTINUE=NO} => {LEAVE=YES} 0.100 1 1.739130 5 {Num_Women_Product_Views=No, CONTINUE=YES} => {LEAVE=NO} 0.315 1 2.352941 6 {Num_CT_Waist_Control_Views=No, CONTINUE=YES} => {LEAVE=NO} 0.360 1 2.352941 For CONTINUE some examples are: inspect(head(rulesContinue,4)) o/p: lhs rhs support confidence lift 3 {Num_Women_Product_Views=Yes, LEAVE=NO} => {CONTINUE=YES} 0.110 1 2.352941 4 {Num_Women_Product_Views=Yes, LEAVE=YES} => {CONTINUE=NO} 0.100 1 1.739130 5 {Num_Women_Product_Views=No, LEAVE=NO} => {CONTINUE=YES} 0.315 1 2.352941 6 {Num_CT_Waist_Control_Views=No, LEAVE=NO} => {CONTINUE=YES} 0.360 1 2.352941 Conclusion: We were able to find some interesting patters in users page views and were able to identify some of the killer pages as “Num_CT_Waist_Control_Views”,” Num_MAS_Category_Views” like these pages.

×