Caravan insurance data mining statistical analysis
K6255 – Knowledge Discovery and Data Mining Statistical Analysis of Caravan Insurance using IBM SPSS Muthu Kumaar Thangavelu (G1101765E) Muthu1@e.ntu.edu.sg1. INTRODUCTION:The data set contains information on customers of an insurance company which includes theproduct usage data and socio-demographic data derived from zip area codes supplied by the Dutchdata mining company Sentient Machine Research. Our aim is to predict a customer circle who will beinterested in buying caravan insurance and predict a model with the given 86 variable valuesrepresenting the socio demographic, education, insurance interests and income levels of customers.2. STATISTICAL ANALYSIS2.1. DATA PREPARATION:2.1.1. ANALYZING AND CATEGORIZING THE VARIABLES:We extract and analyze the raw variables with labels and try to categorize the variables based on theunderstanding of the insurance product and the product buyers. We classify the broad range of 86variables to significant predictors as belowCUST_SUB_LIFESTYLE_REFLECTION:Customer sub type MOSTYPE variable has 41 value types which can be categorised under two broadclasses which relate to their age, social class, life style and reflection towards investing or spendingas follows- Middle and Upper Class, middle aged and senior citizens, high risk cultured liberal investors (8, 9,12, 13, 23, 25, 36, 2, 3, 4, 5, 15, and 27) - Distributed age and social class, low risk cultured conservative investors(1,6,7,10,11,14,16,17,18,19,20,21,22,24,26,28,29,30,31,32,33,34,35,37,38,39,40,41)CUST_LEVEL_LIFECYCLE:Average age MGEMLEEF holds 6 types of values which can be categorised into three groups and arebased on family status and age.- Young, family starters (1)- Middle aged family men (2, 3, and 4)- Senior, family men (5, 6)
CUST_MAIN_SPEND_INVEST_ATTITUDE:Customer main type MOSHOOFD can be classified into two groups based on the attitude ofcustomers towards buying / spending.- Liberals (1, 2, 5, 6)- Conservatives (3, 4, 7, 8, 9, 10)CUST_MARITAL_STAT:MRELGE, MRELSA, MRELOV, MFALLEEN describe the relationship status of a person which can becombined into two categories signifying the marital status- Married (MRELGE)- Unmarried (MRELSA, MRELOV, MFALLEEN)CUST_WORK_CATEGORY_PROFILE:Variables 19 – 24 describe the profile of work category of a person which can be of 2 types.- Potential income generating high profile work category (MBERHOOG, MBERZELF, MBERMIDD)- Relatively less Potential Income generating low profile work category (MBERBOER, MBERARBG,MBERARBO)CUST_INCOME_LEVEL:Variables 37 to 41 represent the income of a person which can be grouped into three classesLow (MINKM30)Middle (MINK3045, MINK4575)High (MINK7512, MINK123M)These can be best represented by a standalone factor depicting the average income (MINKGEM)CUST_INSURANCE_INTEREST:Variables 44 to 85 and 35,36 describe the interest of customers towards various insurance policiesin general starting from much needed insurance policies for life, health, disabilities, family/privateaccidents and optimal insurance policies for property, small automobiles of individuals (especiallywhere cost of replacement of damaged parts are as costly as getting a new vehicle) or deliveryvehicles of companies which are operated by third party drivers or an industrial machine to the mostsophisticated policies offering luxury and high safety in the form of private third party insurancewhere the insurer pays off the third party even if the insured is at fault and Car, fire and socialsecurity also represent forms of luxury or high sophistication. Hence here is the classification forboth the number and contribution of policies by different customers:- Individuals opting sophistication and high safety Insurance policies (WAPART, PERSAUT, BRAND,BYSTAND)- Firms/Individuals Opting much needed and Optimal Safety Insurance policies (All others)
2.1.2. MAPPING TARGET VARIABLES AS PREDICTORS OF CARAVAN INSURANCE BUYERS:These predictions have been made with descriptive statistics results of the data set along with thereal world logical themes (Appendix-1)FACTOR 1: AGEMiddle aged people are more likely to get caravan insuranceFACTOR 2: ATTITUDE TOWARDS SPENDING/ BUYINGPeople with a liberal attitude predicted by Customer Main type are more likely to get caravaninsuranceFACTOR 3: SOCIAL LIFE STYLE REFLECTORPeople who are modern, professional, middle and upper class and liberal investors of their incomeas predicted Customer Sub type are likely to get caravan insurance.FACTOR 4: MARITAL STATUSMarried Family Men are more likely to buy caravan insuranceFACTOR 5: WORK CATEGORY PROFILEPotential income generating high profile work category people are more likely to get the insurance.FACTOR 6: INCOME LEVELAverage, middle scale Income generators are more likely to get caravan insuranceHere the variable MINKGEM acts as a standalone factor to represent the average income of aperson.FACTOR 7: INSURANCE INTERESTIndividuals opting highly sophisticated high safety Insurance policies are more likely to buy caravaninsuranceFACTOR 8: PURCHASING POWER CLASSIndividuals who purchase or afford to buy high cost products as caravan insurance is not a need buta luxury which is aimed at the average and high income generators.FACTOR 9: RENTED HOME RESIDENTSResidents who stay in rented home might have their own house in their native or settled elsewherein a rented home for work and family convenience or might not have enough savings for investing on
home. All these individuals are more likely to be interested in caravan insurance as they are in needof a local Asset.FACTOR 10: CAR OWNERSHIP:People who own a car signify their buying power, average income and also their interest in cars anddriving and can be interested in buying a caravan and its insurance scheme.People who own more than one car are unusual and must be car freaks who will be considering thebest quality and fashion symbolizing new models; Caravans are most unlikely to suit their needs.2.2. DATA TRANSFORMATION2.2.1. INDEPENDENCE OF DEPENDENT VARIABLES WITH RESPECT TO PREDICTION PARAMETERS:CUSTOMER SUB TYPE (MOSTYPE) variable represents a combination of the age factor,spending/buying attitude and social life style. Hence it can be used as a standalone factor forpredicting the potential buyers.MARRIED PEOPLE are represented by MRELGE and the rest of the variables describing relationshipstatus can be ignored2.2.2. INTERACTION VARIABLES DEFINITION FOR INDEPENDENT REPRESENTATION OF ACOMBINATION:PURCHASING POWER CLASS * AVERAGE INCOMEWork Category, Income Level and purchasing power class can be combined and accurately predictedas Average Income generators with a high profile work category belonging to the purchasing powerclass category represented by the interaction of Independent variables Average Income andPurchasing Power Class.PWAPART*PBRAND*PBYSTAND*PERSAUTPeople who are already interested in buying sophisticated insurance policies are most likely tochoose caravan insurance. Interaction or Cross Product of Contribution to fire, third party, socialsecurity and car insurance represents a high probability of getting caravan insurance2.2.3. DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS:Almost all variables used in the final model are significantly independent predicting different factorsof the caravan insurance buying factor.CUST_SUB_LIFESTYLE_REFLECTION – Social Lifestyle and Attitude towards Spending/investingMRELGE – Marital StatusMAUT1 – Single Car OwnerMHHUUR – Rented Home Resident
PBRAND, PPERSAUT, PBYSTAND, PWAPART – Contribution towards different sophisticated and highsafety Insurance policies.The two factors with significant correlation are MINKGEM and MKOOPKLA where there can be abigger overlap in the population logically. It means that Potential Purchasing Class should have ahigh or middle scaled average Income which form most part of MINIKGEM variable. So these twodimensions can be reduced into one that represents high orthogonality of the variable. Factor Analysis was carried out and the extracted component was rotated and coded as a regressionvariable in the data set.This new variable PURCHASING_POWER_CLASS_INK represents the reduced component ofMINKGEM and MKOOPKLA through PCA.The factor analysis results are attached in Appendix-32.3. DATA ANALYSIS:2.3.1. APPLYING LOGISTIC REGRESSION: (WITHOUT INCLUDING THE VARIABLE REDUCED BY PCA)184.108.40.206. CHOSEN VARIABLES REPRESENTING INDEPENDENT FACTORS TO PREDICT THE CARAVANINSURANCE BUYERS:The predictor variables are represented in 2 blocks of covariates for the dependent variable,CARAVAN (0- Customers will not buy, 1- will buy)BLOCK 1:CUST_SUB_LIFESTYLE_ATTITUDE (Social Life Style Reflector)MRELEGE (Marital Status)MAUT1 (Car Ownership factor – Single Car Indicating potential income generation)MHHUUR (House owners –Potential Earning Factor)BLOCK 2: (INTERACTION VARIABLES)PBRAND, PBYSTAND, PPERSAUT, PWAPART (Customer Insurance Interest factor on sophisticated andhigh Safety policies)MKOOPKLA, MINKGEM (Purchasing Power Class with Average Income Level factor)Method: FORWARD LRCut Off Value: 0.5Probability Entry Criteria: 0.05Probability Exit Criteria: 0.10
220.127.116.11. CHOOSING THE CATEGORICAL VARIABLES:The variables which represent a category of users internally are to be marked as categorical in alogistic regressionIn our caseContribution to various insurance policies (PWAPART, PPERSAUT, PBRAND, and PBYSTAND)represents internal categories such as high, average and low. They are not evenly distributed acrosstheir base value types as seen in the fig1.3, 1.4, 1.5, 1.6 and hence they can be indicated ascategorical.Customer Sub type (CUST_SUB_LIFESTYLE_REFLECTION) representing two main categories - Middleand Upper Class, middle aged and senior citizens, high risk cultured liberal investors and Distributedage and social class, low risk cultured conservative investors and these values are not evenly spreadas seen in fig 1.2 and they can be treated as categorical.All other variables are continuous which contain values corresponding to single category which itstands for. MAUT1 (Owning a Single Car), MRELGE (Married), MHHUUR (Rented Home Residents),MINKGEM (Average Income), MKOOPKLA (Purchasing Power Class)The Regression Converged in two steps in block 2 and the prediction model is generated.The model summary and predictor equation is described in the Appendix-18.104.22.168.3. GENERATED EQUATION BY LOGISTIC REGRESSION FOR PREDICTING POTENTIAL CARAVANINSURANCE BUYERS:0.073 (MAUT1) +0.069 (MRELGE) – 0.018(MHHUUR) -0.376 (CUST_SUB_LIFESTYLE_REFLECTION(1))+ 0.016(MINKGEM by MKOOPKLA) + (PBRAND by PBYSTAND by PPERSAUT by PWAPART) – 2.924Accuracy of the model as predicted by the Nagelkerke R square value is 19.3%2.3.2. APPLYING LOGISTIC REGRESSION: (WITH THE VARIABLE REDUCED BY PCA)With the new component extracted with PCA, PURCHASING_POWER_CLASS_INK, we can applylogistic regression along with other variables.The regression converged in the first step.The predictor model is almost the same as the one above without the reduced component throughPCA and is given by the equation0.093 (MAUT1) +0.069 (MRELGE) – 0.024(MHHUUR) -0.345 (CUST_SUB_LIFESTYLE_REFLECTION (1))+ 0.237(PURCHASING_POWER_CLASS_INK) + (PBRAND by PBYSTAND by PPERSAUT by PWAPART) –2.336The model also has a high degree of accuracy with a Nagelkerke R square percentage of 19.2%The model summary and predictor equation is described in the Appendix-4.
3. MODEL INSIGHTS AND CONCLUSION:The understanding and classification of the initial variables have been thoroughly done to reflectproperties of socio demographic, education, lifestyle, income, car and insurance interests withrelevance to the product type. The logically predicted significant variables have then been analyzedbased on the descriptive statistics of the target variables in the data set using IBM SPSS. DimensionReduction, Variable Recoding and Interaction Variables definition have been done to representaccurate and independent predictors. The logistic regression then gives the required predictormodel.The model should be broad in prediction with appropriate real world logical reasons for categorizingand recoding of variables so that it holds good for most possible cases and avoids OVERFITTING.Appendix -1DESCRIPTIVE STATISTICS – CROSS TAB RESULTSFig 1.0. Rental Home Residents Caravan Insurance Buying Pattern
Fig 1.1. Purchasing Power Class Caravan Insurance Buying PatternFig.1.2. Social Lifestyle based Caravan Insurance Buying Pattern (RECODED VARIABLE)1 – Middle and Upper Class, middle aged and senior citizens, high risk cultured liberal investors0 - Distributed age and social class, low risk cultured conservative investors
Fig 1.3. Third Party Insurance Buyers and Caravan Insurance buyersFig 1.4. Car Insurance Buyers and Caravan Insurance Buyers
Fig 1.5. Fire Insurance Contribution and Caravan Insurance InterestFig 1.6. Social Security Insurance Vs Caravan Insurance Buyers
Appendix -2: (Logistic Regression Summary and Last Convergence Results without PCA Component) Model Summary -2 Log Cox & Snell R Nagelkerke RStep likelihood Square Square1 2220.272a .069 .1892 2210.325a .070 .193a. Estimation terminated at iteration number 20 becausemaximum iterations has been reached. Final solutioncannot be found.Converged Predictors and corresponding Coefficients inbinary logistic regression ( BLOCK 2 - Second Step )Variables in the Equation B S.E. Wald df Sig. Exp(B). .
. . The Cross Product continuing up to (4x4 combinations)a. Variable(s) entered on step 1: PBRAND * PBYSTAND * PPERSAUT * PWAPART . b. Variable(s) entered on step 2: MINKGEM * MKOOPKLA .Appendix -3 (Logistic Regression with reduced component with PCA)Initial Components (Average Income and Purchasing Power Class) Vs Principle Component ExtractedAPPENDIX -3:PRINCIPLE COMPONENT ANALYSIS:FACTOR ANALYSIS: Correlation Matrix MINKGEM MKOOPKLACorrelation MINKGEM 1.000 .452 MKOOPKLA .452 1.000Sig. (1-tailed) MINKGEM .000 MKOOPKLA .000
After Principal Component Analysis - Component Matrixa Component 1MINKGEM .852MKOOPKLA .852Extraction Method:Principal ComponentAnalysis.a. 1 components extracted. Reproduced Correlations MINKGEM MKOOPKLAReproduced Correlation MINKGEM .726a .726 MKOOPKLA .726 .726aResidualb MINKGEM -.274 MKOOPKLA -.274Extraction Method: Principal Component Analysis.a. Reproduced communalitiesb. Residuals are computed between observed and reproducedcorrelations. There are 1 (100.0%) nonredundant residuals withabsolute values greater than 0.05.APPENDIX -4:After PCA with the Reduced Component – Binary Logistic Regression with other predictor variables Model Summary -2 Log Cox & Snell R Nagelkerke RStep likelihood Square Square1 2213.728a .070 .192a. Estimation terminated at iteration number 20 becausemaximum iterations has been reached. Final solutioncannot be found.
Variables in the Equation B S.E. Wald df Sig. Exp(B)Step 1a CUST_SUB_LIFESTYLE_REF -.345 .124 7.778 1 .005 .709 LECTION(1) PURCHASING_POWER_CL .237 .068 12.009 1 .001 1.268 ASS_INK MHHUUR -.024 .024 1.049 1 .306 .976 MAUT1 .093 .040 5.315 1 .021 1.098 PBRAND * PBYSTAND * 207.422 112 .000 PPERSAUT * PWAPART PBRAND(1) by -1.467 .779 3.549 1 .060 .231 PBYSTAND(1) by PPERSAUT(1) by PWAPART(1) PBRAND(1) by -18.885 7541.184 .000 1 .998 .000 PBYSTAND(1) by PPERSAUT(1) by PWAPART(2) PBRAND(1) by -1.627 .960 2.874 1 .090 .197 PBYSTAND(1) by PPERSAUT(1) by PWAPART(3) PBRAND(1) by -19.134 40192.970 .000 1 1.000 .000 PBYSTAND(1) by PPERSAUT(2) by PWAPART(1) PBRAND(1) by -3.743 1.257 8.862 1 .003 .024 PBYSTAND(1) by PPERSAUT(3) by PWAPART(1) PBRAND(1) by -.218 1.065 .042 1 .838 .804 PBYSTAND(1) by PPERSAUT(3) by PWAPART(3)
. . . . . . PBRAND(7) by -19.341 23141.295 .000 1 .999 .000 PBYSTAND(4) by PPERSAUT(4) by PWAPART(1) PBRAND(8) by -18.797 28317.506 .000 1 .999 .000 PBYSTAND(1) by PPERSAUT(1) by PWAPART(1) PBRAND(8) by -19.114 40192.970 .000 1 1.000 .000 PBYSTAND(1) by PPERSAUT(1) by PWAPART(3) PBRAND(8) by -19.252 28290.099 .000 1 .999 .000 PBYSTAND(1) by PPERSAUT(4) by PWAPART(1) PBRAND(8) by -18.921 28301.176 .000 1 .999 .000 PBYSTAND(1) by PPERSAUT(4) by PWAPART(3) PBRAND(8) by -19.476 40192.970 .000 1 1.000 .000 PBYSTAND(1) by PPERSAUT(5) by PWAPART(3) Constant -2.336 .812 8.271 1 .004 .097a. Variable(s) entered on step 1: PBRAND * PBYSTAND * PPERSAUT * PWAPART .