Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Loading in …5
×

# Caravan insurance data mining prediction models

2,519 views

Published on

DATA MINING EXAMPLE

Published in: Education, Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

### Caravan insurance data mining prediction models

1. 1. Caravan Insurance Data Mining AssignmentK6225 Knowledge Discovery and Data Mining By, Sesagiri Raamkumar Aravind(G1101761F) Thangavelu Muthu Kumaar(G1101765E) Page 1 of 11
2. 2. Table of Contents1.0 Objective ........................................................................................................................................... 32.0 Summary of Final Results .................................................................................................................. 33.0 Exercise Lifecycle............................................................................................................................... 3 3.1 Understanding the objective of the exercise and its expectations .............................................. 4 3.2 Understanding the data dictionary of the data set ...................................................................... 4 3.3 Assigning appropriate measure values (Set/Range) for data fields.............................................. 4 3.4 Constructing first level models with Training dataset .................................................................. 4 3.4.1 Logistic Regression ................................................................................................................. 4 3.4.2 Decision Trees ........................................................................................................................ 5 3.11.3 Neural Networks .................................................................................................................. 5 3.5 Running the first level Models with Test data .............................................................................. 6 3.6 Performing bivariate analysis on training dataset ........................................................................ 7 3.7 Creating interaction variables based on results of Step 5 ............................................................ 7 3.8 Balancing the training data ........................................................................................................... 8 3.9 Constructing second level models with Training dataset ............................................................. 9 3.10 Running the second level Models with Test dataset .................................................................. 9 3.11 Constructing third level models by adding new interaction variables ..................................... 10 3.12 Running the third level models with Test dataset .................................................................... 10 3.13 Final Results Interpretation ...................................................................................................... 11 Page 2 of 11
3. 3. 1.0 ObjectiveThe objective of this data mining exercise is to find the best possible model to predict whethercustomer signature will opt for caravan insurance (mobile home policy) or not. The techniques usedare logistic regression, decision tree and neural network.2.0 Summary of Final ResultsThe model built using Logistic Regression and Decision Tree came out with the highest accuracy oncomparison with the models built using Neural Network. The best model had an accuracy of 94%.The most interesting part of the exercise is that base model (as provided originally) without anyinteraction variables and balancing, gave the best results. It has been expectedly observed that mostmodels had higher accuracy with training data set but the accuracy rate reduced when run with testdataset .Cross-validation techniques such as 10-step validation was not done in this exercise whichcould have delineated the results even more.3.0 Exercise LifecycleThe lifecycle of the complete data mining exercise comprises of the following steps:- 1. Understanding the objective of the exercise and its expectations 2. Understanding the data dictionary of the data set 3. Assigning appropriate measure values (Set/Range) for data fields 4. Constructing first level Models with Training dataset 5. Running the first level Models with Test dataset 6. Performing bivariate analysis 7. Balancing the training data 8. Constructing second level Models with Training dataset 9. Dataset modification of Training dataset 10. Running the second level Models with Test dataset 11. Creating interaction variables based on results of Step 6 12. Constructing third level Models with Training dataset 13. Running the third level Models with Test dataset 14. Final Results Interpretation Page 3 of 11
4. 4. 3.1 Understanding the objective of the exercise and itsexpectationsThe first and foremost step in a data mining exercise is to understand the business objective. Thebusiness wants to use their existing customer signatures to build a predictive model for predicting thenumber of mobile home policies. The model construction and its inference will be a precursor for apotential marketing campaign to target specific customer groups. The data mining techniques that arein the scope of this exercise are logistic regression, decision trees and neural networks.3.2 Understanding the data dictionary of the data setThe data dictionary consists of 86 variables with an equal mix of socio-demographic and productownership data. There are few ordinal variables that need to be changed to numeric variables for buildefficiency. The socio-demographic variables are captured at zip-code level.3.3 Assigning appropriate measure values (Set/Range) for datafieldsThe measure of the below variables were manually changed to ‘Range’ in Clementine, apart from theautomatically assigned measures:-MAANTHUI Number of housesMGEMOMV Avg size householdMGEMLEEF Avg ageMGODRK Roman catholicPWAPART Contribution private third party insuranceThere is an academic insight that socio-demographic variables are to be converted to ‘Range’variables so that it would be convenient to plot the values in logistic regression graph curve. Theauthors retained the variables as ‘Set’ variables initially to test the postulation at a later stage.3.4 Constructing first level models with Training datasetThe authors made a plan of arriving at the best model by using a three level approach. The modelsbuilt in first level will be crude models constructed on the data set directly without any newinteraction variables or data balancing. These models will be the first benchmark to gauge subsequentimprovements. Models were built using the Logistic, C5.0 and Neural Net nodes.3.4.1 Logistic RegressionNo changes were done for the Logistic Regression as all attributes were seemingly optimal. Page 4 of 11
5. 5. 3.4.2 Decision TreesThe changes done for C5 node under the Export mode arePruning Severity was set to 5‘Minimum records per child branch’ was changed to 5 from 2 as it was found to be optimal number.Value 1 impaired the results and the same could be said for values greater than 2‘Use Boosting’ option was enabled so that more classifiers are created. The value was set to15 forfirst level and changed to 5 for second and third level. Fig 7: C5 Model Attributes3.4.3 Neural NetworksFor the Neural networks, the RBFN method was selected first but the model did not produce betterresults. The final method selected was ‘Quick’. The number of hidden layers was set as 3 so that moretransformations can take place. The learning rates were initially increased marginally to check forperformance improvements assuming that the results are converging towards the globally consistentdepression in the learning curve of the networks. But as marginal increase of alpha learning rate didn’tget produce significant results, it was increased dramatically to 0.9 for overcoming the possiblyassumed local depression. The final values are available in the screenshot below. Page 5 of 11
6. 6. Fig 8: Neural Network Attributes3.5 Running the first level Models with Test dataThe trained first level models were run with the test dataset and the results of the different modellingtechniques were compared with the Analysis and Evaluation node. Logistic Regression and DecisionTree both had the best accuracy rate of 94%. The Nagelkerke Rsquare value with training data set was16.7%. These results will be maintained as the first level benchmark. Screenshots provided below Fig1: First Level Models Analysis Node Results Page 6 of 11
7. 7. Fig 2: First Level Gain and Lift Chart3.6 Performing bivariate analysis on training datasetThis step marks the start of the second level model building process. Bivariate analysis in Clementinecan be done using the Web node that represents the relationships between the values of variablesusing thick and thin lines. The authors performed the analysis using both the normal web and directedweb option in the web node. The directed web had the target as Caravan variable and all the othervariables were put in dependant section. This analysis wasn’t helpful as the relationships were presentamong different values in independent variables and CARAVAN therefore no significant inferenceswere made. However, the normal web analysis indicated strong relationships between the customertype and customer subtype, a potential candidate for interaction variable.3.7 Creating interaction variables based on results of Step 5The indication from last step was implemented in this step by creating two interaction variables. Thefirst interaction variable Derive1(aka customer lifestyle reflector) contains the parent variablesCustomer Type and Subtype. The second interaction variable Derive3(aka Combined Age-IncomeFactor ) contains the parent variables Avg age and Avg Income. This variable was created based onthe author’s intuition that it would help build a better model. Screenshots provided below forreference Page 7 of 11
8. 8. Fig 3: Derived Variables3.8 Balancing the training dataIt has been noticed that the training dataset is not highly representative of positive casesi.e.CARAVAN=1. Therefore, models constructed using this data set may not be the best predictor forpositive cases. Clementine provides a feature called as Balancing to create more signatures based onconditions. The overall positivity is increased in the data set. The authors chose a factor of 6 to makethe dataset slightly better looking in terms of value share (72%:28%) Fig 4: Balancing Page 8 of 11
9. 9. 3.9 Constructing second level models with Training datasetThe second level models were built with the balanced dataset. The attributes of the nodes weremaintained from the first level except for C5 node in which the boosting interval was changed to 5 asthe software did not have enough memory to run with value 15.3.10 Running the second level Models with Test datasetThe trained second level models were run with the test dataset and the results of the differentmodelling techniques were compared with the Analysis and Evaluation node. Decision Tree modelcame out with the highest accuracy of 90.48%. These results were maintained as the second levelbenchmark. Screenshots provided below Fig 5: Second Level Models Analysis Node Results Page 9 of 11
10. 10. Fig 6: Second Level Models Gain and Lift Charts3.11 Constructing third level models by adding new interactionvariablesThe third level model building step is not the same as the second level in terms of data fields. The twonew interaction variables Derive 1and Derive 2 were created. No additional balancing was done.3.12 Running the third level models with Test datasetThe trained third level models were run with the test dataset and the results of the different modellingtechniques were compared with the Analysis and Evaluation node. Neural Network model gives thebest accuracy rate at 90.1%. Fig 9: Third Level Models Analysis Node Results Page 10 of 11
11. 11. Fig 10: Third Level Models Gain and Lift Charts3.13 Final Results InterpretationThe below table compares the output of the Analysis node from all three levels. There is no markedimprovement in each level. It has been inferred that building the model after balancing the trainingdata set, doesn’t produce a better model.In Level1 (base dataset): Highest accuracy is generated by both Decision Tree and LogisticRegressionIn Level2 (model build with balanced dataset): Highest accuracy is generated by Decision TreeIn Level 3(model build with balanced dataset and interaction variables): Highest accuracy isgenerated by Neural Network 1st 2nd 3rd Technique Factor level level level Logistic Regression Test dataset Accuracy 94.00% 87.50% 87.50% Decision Tree Test dataset Accuracy 94.00% 90.48% 89.75% Neural Network Test dataset Accuracy 92.05% 90.12% 90.10% Combined Agreement with CARAVAN 94.52% 95.11% 94.80% Table 1: Level Comparison Page 11 of 11