Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Mining Assignment Final

240 views

Published on

  • Be the first to comment

  • Be the first to like this

Data Mining Assignment Final

  1. 1. Data Mining Coursework B064536 Introduction Variables Ranking by Likelihood to Redeem Total Coupon Redeemed Total Participants in Category (Total Coupon Redeemed/Total Participant)*100 Age 35-44 years old 587 120565 0.49% 45-54 years old 761 169188 0.45% 25-34 years old 351 83584 0.42% Marital Status Married 1010 203642 0.50% Unknown 743 203455 0.37% Single 238 71354 0.33% Income 35-99K 1275 273663 0.47% 100K+ 320 79425 0.40% Under 34K 396 125363 0.32% Homeowner Description Unknown 304 5232 5.81% Renter 64 6977 0.92% Homeowner 1529 298685 0.51% Household Composition 2 Adults With Kids 608 108905 0.56% 1 Adult With Kids 141 28276 0.50% 2 Adults No Kids 655 153491 0.43% Ranking by Campaign Effectiveness Coupon Redeemed Total Coupon sent Percentage B 247 25872 0.95% A 1724 444895 0.39% C 20 7684 0.26%  People who redeem coupons are not from your main customer base  Largest campaign, Type A is not effective Data from the demographic, coupon and campaign datasets were merged together. A total of 478451 valid samples were generated. 50% of the samples were used for model training purpose. 50% of the samples were used for testing purpose. The dataset was imbalance as there were only 1991 redeemed coupons out of 478451 valid samples. Balancing node in SPSS Modeler was used to correct imbalance in dataset. The proportion of redeemed coupons was increased by a factor of 245 for model training and testing. The model was build using CHAID and the set of input variables that provide the highest accuracy percentage were included in the decision tree. Different combinations of input variables were tested during the working process and their accuracy readings were recorded and compared. CHAID is a classification method that uses Chi-square statistics to perform optimal splits for decision tree building. CHAID will firstly conduct a detailed analysis of the input variables and dependent variable. Next, it will test for significance using Chi-square test of independence. CHAID will select an input field that has the smallest p-value (most significant). If a variable has more than two categories, these are weighed up, and categories that show least difference in outcome are grouped together. This merging process stops when all remaining categories are significantly different from each other. CHAID was identified as the most accurate classification model for this dataset by auto classifier node. A manual comparison amongst CHAID, CART and C5.0 were also conducted. Result ascertains the analysis of auto classifier node. When deciding who to send coupon to, retailer should consider these factors in following sequence: 1. Homeowner description 2. Income description 3. Household composition 4. Household size 5. Marital status According to CHAID classification, retailer should send coupons to: 1. People who are homeowners, probable owners and probable renters 2. People’s income who are 35-99K 3. Household composition with 2 adults with and without kids 4. A household with 1 to 3 person 5. Married The model developed has an accuracy of 63.14%. This means that the model is able to correctly classify (predict) those who would redeem coupon and those who would not redeem coupon with an accuracy of 63.14%. 5.80% 17.47% 25.20% 35.36% 7.96% 8.21% 19-24 25-34 35-44 45-54 55-64 65+ Age Married 43% Single 15% Unknown 42% Marital Status Under 34K 26% 35-99K 57% 100K+ 17% Income 62.43% 29.32% 5.70% 1.46% 1.09% Homeowner Unknown Renter Probable Owner Probable Renter Homeowner Description 32.08% 22.76% 18.04% 11.32% 5.91% 9.88% 2 Adults No Kids 2 Adults With Kids Single Female Single Male 1 Adult With Kids Unknown Household Composition The aim of this project is to generate some meaningful insights for the retailer using exploratory data analysis (SPSS Modeler). It also aim to build a model that can accurately predict which customers are more likely to redeem coupons and those that would not. Who are your customers?  Age 45-54 Years Old  Married  35-99K  Homeowner  2 adults No Kids Who are redeeming coupons right now? Data Manipulation Chi-squared Automatic Interaction Detection Results Discussion and Recommendations There is one contradiction between data analysed and CHAID prediction. Data analysed revealed that unknown and renter are more likely to redeem coupon. However, CHAID predict that homeowners, probable owners and probable renters are more likely to redeem coupons. The other factors such as marital status, income and household composition are consistent between data analysed and CHAID. Both indicate that people who are married with income between 35-99K and household with 2 adults with and without kids are more likely to redeem coupons. One recommendation is to evaluate and re-design type A campaign as it is not effective. Secondly, campaign should be more targeted to lower cost. Target audience should be its main customer base as CHAID’s prediction coincides with the characteristics of its main customer base.

×