Intelligent Shopping Recommender using Data Mining
OrganicProducts_FinalReport.docx
1. Predictive Analysis Major Assignment # 3
MBA 590 – Supermarket Organic Product Analysis
Final Report
Prepared by:
ARTHUR DOUCETTE
RYAN SULIER
SUJIT SRIVASTAVA
3. Organic Product Consumption Analysis
TABLE OF CONTENTS
EXECUTIVE SUMMARY
DATA IMPUTATION
DATA MODELING PROCESS
PREDICTIVE ANALYSIS MODELS
JMP GRAPHS FOR ANALYZING PROFITABILITY OF THE CUSTOMERS:
BASIC TREE
LEAFSIZE-50
INTERACTIVE TREE
MODAL COMPARISON
CONCLUSION
APPENDIX
APPENDIX A
APPENDIX B
APPENDIX C (I)
APPENDIX C(II)
APPENDIX C(III)
2 | Page
4. Organic Product Consumption Analysis
Executive Summary
The purpose of this report is to explore the given business scenario and figure out how customers are
likely to purchase organic products for a supermarket, as well as to build a predictive model for
classifying customers according to their likelihood to purchase these products.
In our analysis, we try to identify the profitability of customers who purchase organic products from
the supermarket vs customers who do not purchase these products.
Data Imputation
To predict the data analysis more accurately, we decided to use data partition node in SAS miner and
divided the data into 30% as test and 70% as validation data. We have used validation data as
benchmark to predict the results. The supermarket data set for organic products had over 22,000
observations had many missing values in 13 variables. We had to impute the missing data values,
using most common data imputation techniques like Tree/Mean etc. The summary of missing values
and how the data was imputed is given in appendix A.
3 | Page
5. Organic Product Consumption Analysis
Data Modeling Process
This report will analyze the behavior of customers regrading organic products of the mentioned
supermarket. It will help answer the following questions:
- How can we characterize the “profitability” of the customers who purchased organic products
vs those who didn’t purchase organic products? Do they spend similar amounts, or does
there appear to be a significant difference? Do customers who purchase organic products
spend more at your store in general than customers how don’t purchase organic products (or
vice versa)?
- Continuing along a similar path, are there any noticeable differences in the percentage of
customers who purchase organic products across the different loyalty status groups (for
example, is the percentage of platinum customers who purchase organic products higher than
the percentage of tin customers who purchase organic products)? What about the
profitability of the customers in the different loyalty groups?
- What factors seem to have the most impact on a customer’s likelihood to purchase organic
products? (include any relevant statistical output to support your answer) Based on your
model, how would you describe the “typical” organic products customer?
To better analyze the observations, we have used the combination of both logistic regression and
decision trees.
We have listed below all the combinations for performed analysis methods:
- JMP graphs for analyzing profitability of the customers
- SAS miner for generating following prediction models
o Forward
o Backward
o Stepwise
o Basic Tree
o Tree with 50 leaves
o Interactive Tree
4 | Page
6. Organic Product Consumption Analysis
Predictive Analysis Models
We have tried to use both JMP and SAS Miner tools in our analysis to get the maximum results. These
analysis methods are explained below in detail:
JMP GRAPHS FOR ANALYZING PROFITABILITY OF THE CUSTOMERS:
We have used JMP Graph Builder to analyze the profitability of the customer. This analysis will answer
our following question:
How can we characterize the “profitability” of the customers who purchased organic products vs those
who didn’t purchase organic products? Do they spend similar amounts, or does there appear to be a
significant difference? Do customers who purchase organic products spend more at your store in
general than customers how don’t purchase organic products (or vice versa)?
After running the dataset against JMP, the result of the JMP Box plot using Graph builder is shown
below:
The above box plot clearly shows that the customer’s total amount that they spend whether they buy
the organic products or not, does not differ a great deal. Means there is very little difference in their
profitability.
Using the JMP Graph builder we can also answer the following question:
Continuing along a similar path, are there any noticeable differences in the percentage of
customers who purchase organic products across the different loyalty status groups (for
5 | Page
7. Organic Product Consumption Analysis
example, is the percentage of platinum customers who purchase organic products higher than
the percentage of tin customers who purchase organic products)? What about the profitability
of the customers in the different loyalty groups?
To find the profitability of the customers in different loyalty groups, we decided to use Fit-Y-By-X tool
of JMP.
Using this tool we got the following result, which shows the categorical data according to the
customer’s loyalty class:
The above graph clearly shows that the customers with Tin class are actually buying more Organic
products than that of Silver, Platinum or Gold. But it’s not a huge difference. In effect, looking at the
graph, we can say that there is not much difference between classes who buys more Organic
products, but there is little more buying difference in Tin class customer than other classes. Maybe
these customers were taking advantage of coupons that were offered to them.
Other output from this tool is given in Appendix B
We used numerous models for analysis of the following question:
What factors seem to have the most impact on a customer’s likelihood to purchase organic products?
(include any relevant statistical output to support your answer) Based on your model, how would you
describe the “typical” organic products customer?
These models are explained below:
6 | Page
8. Organic Product Consumption Analysis
BASIC TREE
First we tried to analyze the data using a simple Basic Tree predictive model. The results that we got
are shown below:
Other details of this basic tree are given in Appendix C (i)
7 | Page
9. Organic Product Consumption Analysis
LEAFSIZE-50
After trying Basic tree we decided to increase the leaf-size of the tree and analyze the data with 50
leaf nodes. The purpose of increasing the leaf size is to have decent number of observations included
in terminal node which will help us predict the data more precisely. The result that is shown below:
For the reference we have included many other important details like Fit Stats, Treemap, Leaf Stats,
Score overlay details and other details in Appendix C (ii)
8 | Page
10. Organic Product Consumption Analysis
INTERACTIVE TREE
At last we also tried Interactive Tree, the result of which is shown below:
Other details of this interactive tree is listed in Appendix C (iii)
9 | Page
11. Organic Product Consumption Analysis
Modal Comparison
After analyzing the results from all the predictive models, we did a model comparison using SAS
Model Comparison tool. The resulting graph with comparison in SAS miner is shown below:
In the above figure we can see the data partition node being added to make the validation data up to
70% and test data down to 30%. Impute node also can be seen just before all the predictive model
nodes. This is because we need to impute values before we start the modeling process. StatExplore
node gives the details of overall graph and data analytics which can be used for various statistics not
involving regression. At last we see the model comparison node responsible for comparing the models
based on misclassification rate.
After looking at the results of Model Comparison node in SAS, we found that the best model that was
evaluated in terms of fit statistics was LeafSize-50 Tree. Details are as follows:
Fit stat for Model Comparison is shown below:
10 | Page
12. Organic Product Consumption Analysis
Looking at the above graph, it is evident that the selected model LeafSize-50 is the best model among
the other two comparatively. LeafSize-50 is having the least misclassification rate of 0.193 or 19%
Conclusion
Looking at the analysis based on LeafSize-50 model, we concluded that Age, Affluence Grade and
Gender are the most important factors having the most impact on a customer’s likelihood to purchase
organic product. Looking at the tree we see that Female customers are more likely to purchase the
products.
Some of our recommendation to supermarket would be to focus on customers with young age.
Maybe educate them to use organic products and list the benefits of using the product. Maybe apply
digital marketing concepts to reach to young audience and give some incentives for using the Organic
products. They should also come up with some strategy for Male customers of all ages.
11 | Page