SlideShare a Scribd company logo
Building andEvaluating
Predictivemodel: Supermarket
Business Case
BUS5PA – Assignment 1
SIDDHANTH CHAURASIYA Master of Business Analytics 19139507
Objective:
To predict and determine,using Decision Tree and Regression modelling, which segment of
customers are likelyto purchase a new line of organic products that is to be introduced by the
supermarket.
1. Setting up the project and exploratory analysis.
1.A.1&2: On SAS Enterprise Miner workstation, a new Project named
BUS5PA_Assignment1_19139507 is created, followedby creating a diagram called Organics.
Further, a SAS library is created and the given dataset ‘Organics’ is selected as the data source
for the project. On analysing the dataset, SAS Enterprise Miner found 22223 observations and
13 variables.
The roles of the 13 variables have been set as follows:
Figure 1.A.2 - Roles & Measurement Level of Variables
Variables with Nominal Measurement Level contain Categorical data while variables with
Interval Measurement Level contain numeric data. Target/Respond variable TargetBuy has a
Binary Measurement, with 1 indicating Yes and 0 indicating No.
1.A.3: Distribution of Target variables [Appendix – Figure 1.A.3 (2)]
Figure 1.A.3 - Summary of Distribution of TargetBuy
1.A.4: DemCluster has been Rejected as DemClusterGroup contains collapsed data of
DemCluster and based on past evidences,DemClusterGroup is sufficientfor the modelling.
1.B: TargetBuy envelopesthe data contained in TargetAmt. Utilizing TargetAmt as an input
could lead to an imprecise modelling or leakage as the model would find strong co-relation
between the input (TargetAmt) and Target (TargetBuy), since the target variable contains the
collapsed data of TargetAmt. Hence, TargetAmt should not be used as input and should be set
as Rejected.
2. Decision tree based modelling and analysis.
2.A: After dragging the Organics dataset to the Organics diagram, we connect the Data Partition
node to the Organics dataset. 50% of the data is utilizedfor training while the remaining 50% of
the data is used for validation (Appendix – Figure 2.A). Training set is used to build a set of
models while Validation set is utilizedto select the best model created from the Training set.
Figure 2.A (2) – Adding Data Partition to the Organics data source.
2.B: A DecisionTree isthenconnectedtothe Data Partition node (Appendix –Figure 2.B)
2.C.1: The number of leaves in an Optimal tree is 29 based on Average Square Error as the
subtree assessment plot. This Decision Tree has been created using Average Square Error (ASE)
as the subtree Assessment Measure (Appendix – Figure 2.C.). The assessment method specifies
the type of method used to select the best tree. ASE opts for the tree that produces the smallest
average square error.
Figure 2.C.1 – Optimal Tree based on Average Square error as the Subtree Assessment.
2.C.2: Variable DemAge was used for the first split as this is the variable which ensures the best
split in terms of ‘Purity’ (Appendix – Figure 2.C.2).
Based on Logworth of each input variable, the competing splits for the first split (DemAge) for
the first decision tree are DemAffl and DemGender. Logworth is measure of Entropy, which
indicates which variable can create the most homogenous subgroups.
Figure 2.C.2 – Logworth of Input Variables
2.D.1: The maximum branches of the second decision tree has been changed to 3. This means
the subsets of the splitting rules are dividedinto 3 branches (Appendix – Figure 2.D.1).
2.D.2: The second Decision Tree has been created using Average Square Error (ASE) as the
subtree Assessment Measure (Appendix – Figure 2.D.2). The assessment method specifiesthe
type of method used to select the best tree. ASE opts for the tree that produces the smallest
average square error.
Figure 2.D.2 (2) – Adding the second Decision Tree node.
2.D.3: The optimal tree for Decision Tree 2 using Average Square Error as the model assessment
statistic contains 33 leaves.
Figure 2.D.3 – Leaves on Optimal Tree based on Average Square Error for Decision Tree 2.
The two Decision Tree models differas the maximum branch splits (2 vs 3) is different.This
results in the divergence in number of leavesin optimal tree of the respective Decision Tree the
first decision tree contains 29 leaves whereas Decision Tree 2 contains 33 leaves.
The set of rulesor classificationsetbythe firstDecisionTree canbe summarizedas –
 Female customersunderthe Age of 44.5 years,havingAffluence grade more than9.5 or
missingare likelytopurchase organicproducts(Node 36,37, 38 & 39, Appendix –Figure
2.D.3).
 Female customersunderthe age of 39.5 years,havingAffluence grade of lessthan9.5 but
more than 6.5 or missingare likelytopurchase organicproducts(Node 32, Appendix–
Figure 2.D.3).
The set of rules or classificationsetbythe secondDecisionTree canbe summarizedas –
 Customersunderthe age of 39.5 yearswhohave an affluence grade of more than14.5 are
verylikelytopurchase organicproducts[Node 7,Appendix –Figure 2.D.3(2)].
 Customersunderthe age of 39.5 yearshavingaffluencegrade of lessthan14.5 butmore
than 9.5 (ormissing) are likelytopurchase organicproducts.However,if suchacustomeris
a Female,thenshe is22% more likelytobuyOrganicproductsthan the customerwhois
male withthe same attributes[Node 17& 18, Appendix –Figure 2.D.3 (2)].
2.E: Average square error computes, squares and then averages the variation betweenthe
predicted outcome and the actual outcome of the leaf nodes. Lower the average square error,
better the model; as it indicates the model produces the fewer errors. For the Organics dataset,
Decision Tree 2 (0.132662) appears to be marginally better model because the said model’s
average square error is minutely lessthan that of the first Decision Tree (0.132773).
Figure 2.E – Model Comparison between the two DecisionTrees,
3. Regression based modelling and analysis.
3.A: StatExplore tool is attached to the Organics datasource (Appendix – Figure 3.A). StatExplore
provides a statistical summarization as well as graphical representation of the variables.
Through StatExplore, we also get to know about the number of missing values in each variable.
Figure 3.A (2): Summary of Input variable via StatExplore.
3.B: Yes, the missing values in the given dataset should be imputed as Regression modelling
doesn’t accommodate missing values in the model but rather ignores or strips-off such values,
thereby leading to loss of data to that extent for the modelling. This can lead to a creation of
biased training set.
Imputation for Decision Tree isn’t required as Decision Tree accommodates the missing values in
its modellingby considering them as a possible value with its own branch.
3.C: We add impute node to the diagram and connect it to the Data Partition node (Appendix –
Figure 3.C). Impute function creates a synthetic value for the missing value. We impute alphabet
‘U’ for missing class variable value and use the mean of the variable to impute missing interval
variable values.
Figure 3.C – Results of Impute.
We then create imputation indicators for all imputed inputs. This function creates a new
variable to indicate whether a value has been imputed or not in the main variable.
3.D: We adda Variable Clustering node toImpute node togrouptogethersimilarvariablesina
cluster(Appendix –Figure 3.D).Byvirtue of Impute’sfunction, new inputvariablesare created
containingthe imputedvalues forthe missingvalues.Similarly,new indicatorvariableshave also
beencreated. VariableClusteringenablesustoreduce redundancyandensure betterregression
basedmodellingasthistype of modellingismore suitedwheninputvariablesare fewer.
Each clusteris representedbyanindividual inputvariable.We optforBest Variable asthe criteria
for the selectionforthe clusterrepresentative.The variablewith the lowestnormalisedvalue of R
squaredisselectedasthe clusterrepresentative underthe Bestvariable method.
As a result,the 52 variables(afterimputationandindicatorvariables)are groupedtogethertoform
24 clusters.The clusterrepresentativesof the 24 clusterswill be thenusedasthe inputsforthe
Regressionmodel.
Figure 3.D (2) – Representation of Variable Clustering in various forms.
We add Regression node to the Variable clustering node and change the Selectionmodel to
Stepwise and Selection criterion to Validation Error. Under the Stepwise model, variables are
added one by one to the model.At the same time as the addition, the variable already in the
model whose Stay Significance level falls below the threshold is deleted.This type of model
stops when the Stay Significance level or Selectioncriterion (Validation Error in this case) is
achieved.
Figure 3.E – Adding Regression node and changing its properties.
The measurement of Target variable determinesthe type of Regression model that will be
applied. For our given case, Logistic Regression model will be used as our Target Variable is
Binary.
3.F.1: The variables included in the final model of the Regression modellingare:
 IMP_DemAffl – Imputed Affluence Grade.
 IMP_DemAge – Imputed Age.
 M_DemGender0 – Indicator Gender.
 IMP_DemGenderM – Imputed Gender.
 M_DemAffl0 – Indicator Affluence Grade.
 M_DemAge0 – Indicator Age.
The regression modellingindicates the above-mentioned variables have a significant degree
of association with our target variable. The list of variables is in order of their stepwise
addition to the model.
Regression models enables us to estimate the relationshipbetween the input variables and
target variable. Logistics regression helpsus to expressthe association betweenthe input
and target variable in terms of odds (Purchaser vs Non-purchaser)
The higher management of supermarket can interpret the abovementioned association as
follows:
Figure 3.F.1 – Variables in final Logistic Regression model.
With other variables constant, change in a single unit in IMP_DemAffl will result in a 0.2530-unit
change in the log odds of purchaser (vs non-purchaser). On the other hand, with other variables
constant, a change of a single unit in IMP_DemAge will lead to a decrease in log odds by 0.0550.
Similarly, with other variables constant increase in a single unit of IMP_DemGenderM,
M_DemAffl0 and M_DemAge0 will result in decrease in the log odds of a purchaser by 0.7277,
0.3657 and 0.2345 unit respectivelywhile increase in a single unit of M_DemGender0 will lead
to increase in log odds of purchasers by 1.41 units.
3.F.2: The most important variables that influence the target variable are M_DemGender0,
IMP_DemGenderM and M_DemAffl0 respectively.These variables are designated their
importance on the basis of their Absolute Coefficients.
Absolute coefficientsspecifieshow strongly related the variable is related to the target variable.
Variables with Absolute coefficientcloser to 1 indicate strong positive relationship,while
absolute coefficient of around abound -1 indicates strong negative relation. An absolute
coefficientof zero impliesthe weakest relation betweenthe input variable and target variable
for the given scenario.
Figure 3.F.2 – Bar-chart and tabular form of most important variables for Regression modelling.
The bar chart when interpreted together suggests that female customers who are of relatively
young age and have an high affluence grade tend to purchase more Organic products.
3.F.3: The validation Average Square Error (ASE) for the Regression model is 0.141805. ASE helps
in gauging which model is erroneous less often than the other. Lesser ASE indicates better
model in terms of its prediction capabilities. ASE is calculated by squaring and averaging the
difference between the predicted and actual outcome for each and every model.
The ASE is the lowest at the model selectionstep number 6, which impliesthis is the most
optimal model, as shown in the below chart.
Figure 3.F.3 – Validation Average Square Error of Logistic Regression model.
4. Open ended discussion (20%)
4.A: A model comparison tool is added to the workspace and the decision trees & regression
nodes are attached to model comparison. This node enables us to compare the models and its
predictive capabilities.
Figure 4.A – Adding Comparison node to the workspace.
We use three metrics to ascertain which model outperformed the other models on the
validation dataset. These metrics are: Cumulative Lift, ROC Curve and Fit statistics (Average
Square Error and Misclassification rate).
Cumulative Lift: Lift curve measures the effectivenessof the model by comparing the results
obtained with and without the predicted model. The first Decision tree (3.42) performs
marginally better than the second Decision Tree (3.40) on the metric of cumulative gain at the
5th
percentile,while the regression model gives a liftof 3.34 at the 5th
depth.
Figure 4.A – Comparing the three models on Cumulative Lift Chart.
This is interpreted as the top 5% of customers picked by the first Decision Tree model are 3.42
times more likelyto purchase Organic products than the 5% of customers picked out at random.
ROC Curve: ROC curve is graph that demonstrates the relation between Sensitivity(True
positive) and Specificity(True negative) at varied points of a diagnostic test. The model’s curve
which is closer to the top and the left (i.e.closer to the true positive) can predict more
accurately, while the curve which is inclinedtowards the baseline (i.e.closer to the true
negative) is a lesser accurate predictor. From a business context, the supermarket would
naturally want to adopt the curve that inclinestowards the Sensitivity.
Figure 4.A – Comparing the three models on ROCCurve.
The second Decision Tree curve is the most top-left curve, very closely followedby the first
Decision Tree while regression’s curve is closer to the baseline.This indicates the second
Decision Tree is the most accurate model, followedby the first Decision Tree and Regression
respectively.
Fit Statistics: Fit statistics is table containing numerous model fit statistic which describe how
well the model fits the data. We compare the three models on the basis of their Average Square
Error and Misclassification rate.
Average Square Error (ASE) indicates the difference between the predicted and actual outcome
of a model. Misclassification rate refers to inaccurate classification arising when a model
suggests a probability (of higher than 50%) of a segment of customers fall under a group (‘1’ or
‘0’), and thus all customers of that segment are classifiedaccordingly but in reality,a fraction of
customers may not necessarily fall under the said classification.
Figure 4.A - Fit Statistics of first Decision Tree, Second Decision Tree and Regression model respectively.
As per Misclassification rate, the first Decision Tree is the best model as it produces the least
misclassification error (0.185) while the ASE states the second Decision Tree (0.1326) is the
better model out of the three, beating the first Decision Tree very narrowly (0.1327). Regression
model produces the most errors out of the three models in both the metrics.
Conclusion: On the basis of the above comparison, it is conclusive that the decision trees
outperforms the regression model for the given data. However, there little to separate between
the two Decision models, with both the models performing around about the same level on
various parameters. However, assessing strictly by performance on the metrics, the first
Decision Tree trumps the second decision tree by the slightest of margins.
4.B: Decision tree enablesus to classify while Regression enables us to find thee degree of
association between the variables and target variable. With the helpof Decision Tree, we could
classify the Organic’s dataset into purchasers (i.e.‘1’) and non-purchasers (i.e.‘2’) by a set of
rules. While the regression e could detect the relationship-pattern between the input variables
& the target variable, it failsto detect local pattern that may exist in the sections of data. Such
patterns were unearthed by the decision tree during our modelling.
Moreover, the Decision Trees comprehensivelyperformed better in terms of producing errors
and achieving lifts than the Regression model. Hence, for our given case, usage of only the
decision tree modellingwould have been sufficientenough.
4.C: Advantages of Decision Tree -
Classification: Decision Tree helps in understanding one’s path to a decision. Each leaf creates a
segmentation and states its attribute, which can be interpreted as a set of rule that particular
segmentation followsto the final outcome (either‘1’ or ‘0’). The first decision tree model
created 29 leaves,with each leaf signifying a segment and it’s attributes define the outcome
whether the customer will purchase the Organic product or not. Decision Tree produces models
that explainhow they work and are easy to understand.
Figure 4.C – First Decision Tree.
Detecting patterns: Decision tree can detect patterns between sections of variables which other
modellingtechniques like Regression may not be able to detect. As evident from the above
image, Decision Tree detects the relation between Age, Affluence and Gender while modelling
the target variable.
Data Exploration: Decision Trees are also very useful for data exploration, as they can pick-out
the important variables which can predict the targets, out of the huge pool of input variables.
Our Decision Tree picked-out Age, Affluence Grade and Gender as such important variables out
of the nine input variables selectedfrom the datasource.
Advantages of Regression –
Relationship: Regressionmodellingenablesustofindthe relationshipbetweencombinationof
variableswiththe targetvariable.Thistype of modellingcanalsodescribe the degreeof association
betweenthe variablesasamathematical function. The coefficientsdescribeif the inputvariableis
stronglyor inverselyrelatedtothe targetvariable.
Pattern: Regression can detect patterns among the variables across the entire dataset. Through
our analysis, we detected the patterns and interaction betweenthe input variables (Age,
Gender, Affluence Grade) and the target variable.
Estimation: Logistics Regression can estimate the probability (odds) of purchasers as a weighted
sum of the attributes of the input variables. Similarly,Linear or multivariate regression could be
used for estimation and prediction if we were finding answers for a differentquestion. For
example: If the Supermarket wanted to know which customer spend the most money, we could
have used PromSpend as the dependent variable and conducted a multivariate-regression
modellingto find the answer.
5. Extending current knowledge with additional reading (15%)
A) Just getting things wrong
Addressing the wrong issue can render the entire data mining process futile as no or little
business value can be derived from the solutions obtained from the modelling.
As such, understanding the businessproblem and addressing the correct issue becomes
imperative. Such a situation could arise at the supermarket if the higher management frame the
wrong question or attempt to try and solve the wrong problem. In our case, this could happen if
an incorrect target variable is selected.Similarly, wrongly rejecting a potentially important
variable could lead to a creation of a skewedmodel. These issues can be resolvedby having a
person in-charge of the project with the right domain expertise to complement with technical
know-hows.
Erroneous interpretation of binary ‘0’ and ‘1’ by team-members could potentiallylead to a
complete failure of the modelling. Such incidents can be solved by having transparent
communication across the hierarchy of the supermarket and standardization of processes and
rules.
B) Overfitting
Overfitting is phenomena which occurs when a model learns the details, noises or fluctuations
in the training set too well.Such a model memorizes those detailsrather than learning, and then
applies them to a new dataset where it isn’t applicable, leading to poor predictive performance.
A model can be assessed whether it is overfitting or not by evaluating the Average Square Error
and Misclassification rate graphs. If the training performance improves and validation
performance deteriorates, in terms of Average Square Error or Misclassification rate, as the
complexity of the model increases, the model shows signs of overfitting.
Overfitting can arise in our model created for the supermarket, if the model recognizes or
memorizes bogus patterns which aren’t applicable to other data sources. Overfittingcan also
appear when the training set allocated for our supermarket analysis is not sufficientlylarge.
Overfitting leads to creation of unstable model that may perform on some days and may not on
other occasions.
Such a model can be avoided by including only those variables which have a predictive value.
Additionally, allocating an optimal proportion of dataset for the model to train can help to
combat the problem of overfitting.
C) Sample bias
Sampling bias occurs when a sample does not accurately reflect the parent population. The
organics dataset contains details of customers with loyalty card who purchased organic products
on being incentivisedwith coupons. Customers with loyalty card may not be an accurate
representation of the total customer base. Further, the effect of coupons on the purchasing
likelihood of a customer has to be taken into consideration when predicting future responders
for the organic products.
This can lead to creation of a biased model as the model will learn the attributes from a biased
sample (in this case, Customer with loyalty cards) which may not be applicable in reality
(customers as a whole). Additionally,disregarding the effect of coupons on the purchasing
probability of a customer will only lead to inaccurate predictions by the model.
In order to accurately predict the responders for Organic products, a new database should be
created comprising of details of all type of customers (i.e.with and without loyalty card) and a
variable for coupons. A model should then be created on this database to make more informed
classifications and predictions.
D) Future not being like the past
Predictive modellinguses the data from the past as the base to form predictions, classification
and estimation about the future. However, many factors must be taken into account like time-
frame of data, seasonality of business, changes in market conditions and so on.
As such, the Organics data shouldn’t be too far in the past as it may not reflect today’s scenario.
Additionally,the relation between sale of organic products and seasonality behaviour should be
exploredso as to ensure the model doesn’t overfit.
To make the model more reflective of the current situations, modellingmust be conducted on
recent data. Continuously efforts must be directed towards making the model more robust by
iterating/feedingit with real-time data.
6. Appendix:
1.A.3:
Figure 1.A.3 (2) - Bar-Chart Distribution of TargetBuy
2.A:
Figure 2.A - Data Partition: 50% for Training & 50% for Validation
2.B:
Figure 2.B – Adding DecisionTree node.
2.C.1:
Figure 2.C.1 – Using Average Square Error as Assessment Measure for First DecisionTree.
2.C.2:
Figure 2.C.2 – First Decision Tree.
2.D.1:
Figure 2.D.1- Changes in Maximum number of Branches for second DecisionTree.
2.D.2:
Figure 2.D.2 – UsingAverage Square Error as Assessment Measure for Second Decision Tree.
2.D.3:
Figure 2.D.3 – Nodes 36, 37, 38 & 39 of first DecisionTree marked in black while Node 32 marked in white.
Figure 2.D.3 (2) – Node7 marked in white and Nodes 17 & 18 markedin Black in the second Decision Tree.
3.A:
Figure 3.A – Adding StatExplore tool to Organic diagram.
3.C:
Figure 3.C – Adding Impute node and changing functions in the property panel.
3.D:
Figure 3.D – Adding Variable Clustering node and changing its property.

More Related Content

What's hot

Tree pruning
 Tree pruning Tree pruning
Tree pruning
Shivangi Gupta
 
heart final last sem.pptx
heart final last sem.pptxheart final last sem.pptx
heart final last sem.pptx
rakshashadu
 
Crop predction ppt using ANN
Crop predction ppt using ANNCrop predction ppt using ANN
Crop predction ppt using ANN
Astha Jain
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
Usha Vijay
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
Vasudev pendyala
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
Jonathan Sedar
 
Bays theorem of probability
Bays theorem of probabilityBays theorem of probability
Bays theorem of probability
mayank mulchandani
 
Telecom Churn Prediction
Telecom Churn PredictionTelecom Churn Prediction
Telecom Churn Prediction
Anurag Mukhopadhyay
 
Correspondence analysis final
Correspondence analysis finalCorrespondence analysis final
Correspondence analysis finalsaba khan
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Saikiran Panjala
 
Decision tree
Decision treeDecision tree
Decision tree
Ami_Surati
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
ankit_ppt
 
Churn prediction
Churn predictionChurn prediction
Churn prediction
Gigi Lino
 
Literature Survey on Educational Dropout Prediction
Literature Survey on Educational Dropout PredictionLiterature Survey on Educational Dropout Prediction
Literature Survey on Educational Dropout Prediction
Lovely Professional University
 
churn prediction in telecom
churn prediction in telecom churn prediction in telecom
churn prediction in telecom
Hong Bui Van
 
Prediction of House Sales Price
Prediction of House Sales PricePrediction of House Sales Price
Prediction of House Sales Price
Anirvan Ghosh
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
Rashid Ansari
 
Decision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmDecision tree Using c4.5 Algorithm
Decision tree Using c4.5 Algorithm
Mohd. Noor Abdul Hamid
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 

What's hot (20)

Tree pruning
 Tree pruning Tree pruning
Tree pruning
 
heart final last sem.pptx
heart final last sem.pptxheart final last sem.pptx
heart final last sem.pptx
 
Crop predction ppt using ANN
Crop predction ppt using ANNCrop predction ppt using ANN
Crop predction ppt using ANN
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
 
Bays theorem of probability
Bays theorem of probabilityBays theorem of probability
Bays theorem of probability
 
Telecom Churn Prediction
Telecom Churn PredictionTelecom Churn Prediction
Telecom Churn Prediction
 
Correspondence analysis final
Correspondence analysis finalCorrespondence analysis final
Correspondence analysis final
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Decision tree
Decision treeDecision tree
Decision tree
 
Ml2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regressionMl2 train test-splits_validation_linear_regression
Ml2 train test-splits_validation_linear_regression
 
Churn prediction
Churn predictionChurn prediction
Churn prediction
 
Literature Survey on Educational Dropout Prediction
Literature Survey on Educational Dropout PredictionLiterature Survey on Educational Dropout Prediction
Literature Survey on Educational Dropout Prediction
 
churn prediction in telecom
churn prediction in telecom churn prediction in telecom
churn prediction in telecom
 
Prediction of House Sales Price
Prediction of House Sales PricePrediction of House Sales Price
Prediction of House Sales Price
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Decision tree Using c4.5 Algorithm
Decision tree Using c4.5 AlgorithmDecision tree Using c4.5 Algorithm
Decision tree Using c4.5 Algorithm
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 

Similar to Building & Evaluating Predictive model: Supermarket Business Case

Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
Satyam Barsaiyan
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
Shivaram Prakash
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
IRJET- Coloring Greyscale Images using Deep Learning
IRJET- Coloring Greyscale Images using Deep LearningIRJET- Coloring Greyscale Images using Deep Learning
IRJET- Coloring Greyscale Images using Deep Learning
IRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET Journal
 
A02610104
A02610104A02610104
A02610104theijes
 
Data Partitioning for Ensemble Model Building
Data Partitioning for Ensemble Model BuildingData Partitioning for Ensemble Model Building
Data Partitioning for Ensemble Model Building
neirew J
 
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDINGDATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
ijccsa
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
Trushita Redij
 
A Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph AnalysisA Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph Analysis
Editor IJCATR
 
DEA SolverPro Newsletter19
DEA SolverPro Newsletter19DEA SolverPro Newsletter19
DEA SolverPro Newsletter19
Cheer Chain Enterprise Co., Ltd.
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
gerogepatton
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
ijaia
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
國騰 丁
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
Muthu Kumaar Thangavelu
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
Muthu Kumaar Thangavelu
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
IRJET Journal
 

Similar to Building & Evaluating Predictive model: Supermarket Business Case (20)

Churn Analysis in Telecom Industry
Churn Analysis in Telecom IndustryChurn Analysis in Telecom Industry
Churn Analysis in Telecom Industry
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
IRJET- Coloring Greyscale Images using Deep Learning
IRJET- Coloring Greyscale Images using Deep LearningIRJET- Coloring Greyscale Images using Deep Learning
IRJET- Coloring Greyscale Images using Deep Learning
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
A02610104
A02610104A02610104
A02610104
 
Data Partitioning for Ensemble Model Building
Data Partitioning for Ensemble Model BuildingData Partitioning for Ensemble Model Building
Data Partitioning for Ensemble Model Building
 
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDINGDATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDING
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
A Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph AnalysisA Dependent Set Based Approach for Large Graph Analysis
A Dependent Set Based Approach for Large Graph Analysis
 
DEA SolverPro Newsletter19
DEA SolverPro Newsletter19DEA SolverPro Newsletter19
DEA SolverPro Newsletter19
 
report
reportreport
report
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
result analysis for deep leakage from gradients
result analysis for deep leakage from gradientsresult analysis for deep leakage from gradients
result analysis for deep leakage from gradients
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
Caravan insurance data mining prediction models
Caravan insurance data mining prediction modelsCaravan insurance data mining prediction models
Caravan insurance data mining prediction models
 
A Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data MiningA Comparative Study for Anomaly Detection in Data Mining
A Comparative Study for Anomaly Detection in Data Mining
 

More from Siddhanth Chaurasiya

Predictive Modelling & Market-Basket Analysis.
Predictive Modelling & Market-Basket Analysis.Predictive Modelling & Market-Basket Analysis.
Predictive Modelling & Market-Basket Analysis.
Siddhanth Chaurasiya
 
Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.
Siddhanth Chaurasiya
 
Visualization Techniques: Framework, Effective viz & Non-effective viz.
Visualization Techniques: Framework, Effective viz & Non-effective viz.Visualization Techniques: Framework, Effective viz & Non-effective viz.
Visualization Techniques: Framework, Effective viz & Non-effective viz.
Siddhanth Chaurasiya
 
Escape Trave: Analytical solution
Escape Trave: Analytical solutionEscape Trave: Analytical solution
Escape Trave: Analytical solution
Siddhanth Chaurasiya
 
Innovation at International Foods Group
Innovation at International Foods GroupInnovation at International Foods Group
Innovation at International Foods Group
Siddhanth Chaurasiya
 
Sustainable reporting and its effects on financial performance.
Sustainable reporting and its effects on financial performance.Sustainable reporting and its effects on financial performance.
Sustainable reporting and its effects on financial performance.
Siddhanth Chaurasiya
 

More from Siddhanth Chaurasiya (6)

Predictive Modelling & Market-Basket Analysis.
Predictive Modelling & Market-Basket Analysis.Predictive Modelling & Market-Basket Analysis.
Predictive Modelling & Market-Basket Analysis.
 
Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.Machine-Learning: Customer Segmentation and Analysis.
Machine-Learning: Customer Segmentation and Analysis.
 
Visualization Techniques: Framework, Effective viz & Non-effective viz.
Visualization Techniques: Framework, Effective viz & Non-effective viz.Visualization Techniques: Framework, Effective viz & Non-effective viz.
Visualization Techniques: Framework, Effective viz & Non-effective viz.
 
Escape Trave: Analytical solution
Escape Trave: Analytical solutionEscape Trave: Analytical solution
Escape Trave: Analytical solution
 
Innovation at International Foods Group
Innovation at International Foods GroupInnovation at International Foods Group
Innovation at International Foods Group
 
Sustainable reporting and its effects on financial performance.
Sustainable reporting and its effects on financial performance.Sustainable reporting and its effects on financial performance.
Sustainable reporting and its effects on financial performance.
 

Recently uploaded

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 

Recently uploaded (20)

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 

Building & Evaluating Predictive model: Supermarket Business Case

  • 1. Building andEvaluating Predictivemodel: Supermarket Business Case BUS5PA – Assignment 1 SIDDHANTH CHAURASIYA Master of Business Analytics 19139507
  • 2. Objective: To predict and determine,using Decision Tree and Regression modelling, which segment of customers are likelyto purchase a new line of organic products that is to be introduced by the supermarket. 1. Setting up the project and exploratory analysis. 1.A.1&2: On SAS Enterprise Miner workstation, a new Project named BUS5PA_Assignment1_19139507 is created, followedby creating a diagram called Organics. Further, a SAS library is created and the given dataset ‘Organics’ is selected as the data source for the project. On analysing the dataset, SAS Enterprise Miner found 22223 observations and 13 variables. The roles of the 13 variables have been set as follows: Figure 1.A.2 - Roles & Measurement Level of Variables Variables with Nominal Measurement Level contain Categorical data while variables with Interval Measurement Level contain numeric data. Target/Respond variable TargetBuy has a Binary Measurement, with 1 indicating Yes and 0 indicating No.
  • 3. 1.A.3: Distribution of Target variables [Appendix – Figure 1.A.3 (2)] Figure 1.A.3 - Summary of Distribution of TargetBuy 1.A.4: DemCluster has been Rejected as DemClusterGroup contains collapsed data of DemCluster and based on past evidences,DemClusterGroup is sufficientfor the modelling. 1.B: TargetBuy envelopesthe data contained in TargetAmt. Utilizing TargetAmt as an input could lead to an imprecise modelling or leakage as the model would find strong co-relation between the input (TargetAmt) and Target (TargetBuy), since the target variable contains the collapsed data of TargetAmt. Hence, TargetAmt should not be used as input and should be set as Rejected. 2. Decision tree based modelling and analysis. 2.A: After dragging the Organics dataset to the Organics diagram, we connect the Data Partition node to the Organics dataset. 50% of the data is utilizedfor training while the remaining 50% of
  • 4. the data is used for validation (Appendix – Figure 2.A). Training set is used to build a set of models while Validation set is utilizedto select the best model created from the Training set. Figure 2.A (2) – Adding Data Partition to the Organics data source. 2.B: A DecisionTree isthenconnectedtothe Data Partition node (Appendix –Figure 2.B) 2.C.1: The number of leaves in an Optimal tree is 29 based on Average Square Error as the subtree assessment plot. This Decision Tree has been created using Average Square Error (ASE) as the subtree Assessment Measure (Appendix – Figure 2.C.). The assessment method specifies the type of method used to select the best tree. ASE opts for the tree that produces the smallest average square error. Figure 2.C.1 – Optimal Tree based on Average Square error as the Subtree Assessment.
  • 5. 2.C.2: Variable DemAge was used for the first split as this is the variable which ensures the best split in terms of ‘Purity’ (Appendix – Figure 2.C.2). Based on Logworth of each input variable, the competing splits for the first split (DemAge) for the first decision tree are DemAffl and DemGender. Logworth is measure of Entropy, which indicates which variable can create the most homogenous subgroups. Figure 2.C.2 – Logworth of Input Variables 2.D.1: The maximum branches of the second decision tree has been changed to 3. This means the subsets of the splitting rules are dividedinto 3 branches (Appendix – Figure 2.D.1). 2.D.2: The second Decision Tree has been created using Average Square Error (ASE) as the subtree Assessment Measure (Appendix – Figure 2.D.2). The assessment method specifiesthe type of method used to select the best tree. ASE opts for the tree that produces the smallest average square error. Figure 2.D.2 (2) – Adding the second Decision Tree node. 2.D.3: The optimal tree for Decision Tree 2 using Average Square Error as the model assessment statistic contains 33 leaves.
  • 6. Figure 2.D.3 – Leaves on Optimal Tree based on Average Square Error for Decision Tree 2. The two Decision Tree models differas the maximum branch splits (2 vs 3) is different.This results in the divergence in number of leavesin optimal tree of the respective Decision Tree the first decision tree contains 29 leaves whereas Decision Tree 2 contains 33 leaves. The set of rulesor classificationsetbythe firstDecisionTree canbe summarizedas –  Female customersunderthe Age of 44.5 years,havingAffluence grade more than9.5 or missingare likelytopurchase organicproducts(Node 36,37, 38 & 39, Appendix –Figure 2.D.3).  Female customersunderthe age of 39.5 years,havingAffluence grade of lessthan9.5 but more than 6.5 or missingare likelytopurchase organicproducts(Node 32, Appendix– Figure 2.D.3). The set of rules or classificationsetbythe secondDecisionTree canbe summarizedas –  Customersunderthe age of 39.5 yearswhohave an affluence grade of more than14.5 are verylikelytopurchase organicproducts[Node 7,Appendix –Figure 2.D.3(2)].  Customersunderthe age of 39.5 yearshavingaffluencegrade of lessthan14.5 butmore than 9.5 (ormissing) are likelytopurchase organicproducts.However,if suchacustomeris a Female,thenshe is22% more likelytobuyOrganicproductsthan the customerwhois male withthe same attributes[Node 17& 18, Appendix –Figure 2.D.3 (2)]. 2.E: Average square error computes, squares and then averages the variation betweenthe predicted outcome and the actual outcome of the leaf nodes. Lower the average square error, better the model; as it indicates the model produces the fewer errors. For the Organics dataset,
  • 7. Decision Tree 2 (0.132662) appears to be marginally better model because the said model’s average square error is minutely lessthan that of the first Decision Tree (0.132773). Figure 2.E – Model Comparison between the two DecisionTrees, 3. Regression based modelling and analysis. 3.A: StatExplore tool is attached to the Organics datasource (Appendix – Figure 3.A). StatExplore provides a statistical summarization as well as graphical representation of the variables. Through StatExplore, we also get to know about the number of missing values in each variable. Figure 3.A (2): Summary of Input variable via StatExplore. 3.B: Yes, the missing values in the given dataset should be imputed as Regression modelling doesn’t accommodate missing values in the model but rather ignores or strips-off such values,
  • 8. thereby leading to loss of data to that extent for the modelling. This can lead to a creation of biased training set. Imputation for Decision Tree isn’t required as Decision Tree accommodates the missing values in its modellingby considering them as a possible value with its own branch. 3.C: We add impute node to the diagram and connect it to the Data Partition node (Appendix – Figure 3.C). Impute function creates a synthetic value for the missing value. We impute alphabet ‘U’ for missing class variable value and use the mean of the variable to impute missing interval variable values. Figure 3.C – Results of Impute. We then create imputation indicators for all imputed inputs. This function creates a new variable to indicate whether a value has been imputed or not in the main variable. 3.D: We adda Variable Clustering node toImpute node togrouptogethersimilarvariablesina cluster(Appendix –Figure 3.D).Byvirtue of Impute’sfunction, new inputvariablesare created containingthe imputedvalues forthe missingvalues.Similarly,new indicatorvariableshave also beencreated. VariableClusteringenablesustoreduce redundancyandensure betterregression basedmodellingasthistype of modellingismore suitedwheninputvariablesare fewer. Each clusteris representedbyanindividual inputvariable.We optforBest Variable asthe criteria for the selectionforthe clusterrepresentative.The variablewith the lowestnormalisedvalue of R squaredisselectedasthe clusterrepresentative underthe Bestvariable method. As a result,the 52 variables(afterimputationandindicatorvariables)are groupedtogethertoform 24 clusters.The clusterrepresentativesof the 24 clusterswill be thenusedasthe inputsforthe
  • 9. Regressionmodel. Figure 3.D (2) – Representation of Variable Clustering in various forms. We add Regression node to the Variable clustering node and change the Selectionmodel to Stepwise and Selection criterion to Validation Error. Under the Stepwise model, variables are added one by one to the model.At the same time as the addition, the variable already in the model whose Stay Significance level falls below the threshold is deleted.This type of model stops when the Stay Significance level or Selectioncriterion (Validation Error in this case) is achieved. Figure 3.E – Adding Regression node and changing its properties.
  • 10. The measurement of Target variable determinesthe type of Regression model that will be applied. For our given case, Logistic Regression model will be used as our Target Variable is Binary. 3.F.1: The variables included in the final model of the Regression modellingare:  IMP_DemAffl – Imputed Affluence Grade.  IMP_DemAge – Imputed Age.  M_DemGender0 – Indicator Gender.  IMP_DemGenderM – Imputed Gender.  M_DemAffl0 – Indicator Affluence Grade.  M_DemAge0 – Indicator Age. The regression modellingindicates the above-mentioned variables have a significant degree of association with our target variable. The list of variables is in order of their stepwise addition to the model. Regression models enables us to estimate the relationshipbetween the input variables and target variable. Logistics regression helpsus to expressthe association betweenthe input and target variable in terms of odds (Purchaser vs Non-purchaser) The higher management of supermarket can interpret the abovementioned association as follows: Figure 3.F.1 – Variables in final Logistic Regression model. With other variables constant, change in a single unit in IMP_DemAffl will result in a 0.2530-unit change in the log odds of purchaser (vs non-purchaser). On the other hand, with other variables constant, a change of a single unit in IMP_DemAge will lead to a decrease in log odds by 0.0550. Similarly, with other variables constant increase in a single unit of IMP_DemGenderM, M_DemAffl0 and M_DemAge0 will result in decrease in the log odds of a purchaser by 0.7277, 0.3657 and 0.2345 unit respectivelywhile increase in a single unit of M_DemGender0 will lead to increase in log odds of purchasers by 1.41 units.
  • 11. 3.F.2: The most important variables that influence the target variable are M_DemGender0, IMP_DemGenderM and M_DemAffl0 respectively.These variables are designated their importance on the basis of their Absolute Coefficients. Absolute coefficientsspecifieshow strongly related the variable is related to the target variable. Variables with Absolute coefficientcloser to 1 indicate strong positive relationship,while absolute coefficient of around abound -1 indicates strong negative relation. An absolute coefficientof zero impliesthe weakest relation betweenthe input variable and target variable for the given scenario. Figure 3.F.2 – Bar-chart and tabular form of most important variables for Regression modelling. The bar chart when interpreted together suggests that female customers who are of relatively young age and have an high affluence grade tend to purchase more Organic products. 3.F.3: The validation Average Square Error (ASE) for the Regression model is 0.141805. ASE helps in gauging which model is erroneous less often than the other. Lesser ASE indicates better model in terms of its prediction capabilities. ASE is calculated by squaring and averaging the difference between the predicted and actual outcome for each and every model. The ASE is the lowest at the model selectionstep number 6, which impliesthis is the most optimal model, as shown in the below chart.
  • 12. Figure 3.F.3 – Validation Average Square Error of Logistic Regression model. 4. Open ended discussion (20%) 4.A: A model comparison tool is added to the workspace and the decision trees & regression nodes are attached to model comparison. This node enables us to compare the models and its predictive capabilities. Figure 4.A – Adding Comparison node to the workspace.
  • 13. We use three metrics to ascertain which model outperformed the other models on the validation dataset. These metrics are: Cumulative Lift, ROC Curve and Fit statistics (Average Square Error and Misclassification rate). Cumulative Lift: Lift curve measures the effectivenessof the model by comparing the results obtained with and without the predicted model. The first Decision tree (3.42) performs marginally better than the second Decision Tree (3.40) on the metric of cumulative gain at the 5th percentile,while the regression model gives a liftof 3.34 at the 5th depth. Figure 4.A – Comparing the three models on Cumulative Lift Chart. This is interpreted as the top 5% of customers picked by the first Decision Tree model are 3.42 times more likelyto purchase Organic products than the 5% of customers picked out at random. ROC Curve: ROC curve is graph that demonstrates the relation between Sensitivity(True positive) and Specificity(True negative) at varied points of a diagnostic test. The model’s curve which is closer to the top and the left (i.e.closer to the true positive) can predict more accurately, while the curve which is inclinedtowards the baseline (i.e.closer to the true negative) is a lesser accurate predictor. From a business context, the supermarket would naturally want to adopt the curve that inclinestowards the Sensitivity.
  • 14. Figure 4.A – Comparing the three models on ROCCurve. The second Decision Tree curve is the most top-left curve, very closely followedby the first Decision Tree while regression’s curve is closer to the baseline.This indicates the second Decision Tree is the most accurate model, followedby the first Decision Tree and Regression respectively. Fit Statistics: Fit statistics is table containing numerous model fit statistic which describe how well the model fits the data. We compare the three models on the basis of their Average Square Error and Misclassification rate. Average Square Error (ASE) indicates the difference between the predicted and actual outcome of a model. Misclassification rate refers to inaccurate classification arising when a model suggests a probability (of higher than 50%) of a segment of customers fall under a group (‘1’ or ‘0’), and thus all customers of that segment are classifiedaccordingly but in reality,a fraction of customers may not necessarily fall under the said classification.
  • 15. Figure 4.A - Fit Statistics of first Decision Tree, Second Decision Tree and Regression model respectively. As per Misclassification rate, the first Decision Tree is the best model as it produces the least misclassification error (0.185) while the ASE states the second Decision Tree (0.1326) is the better model out of the three, beating the first Decision Tree very narrowly (0.1327). Regression model produces the most errors out of the three models in both the metrics. Conclusion: On the basis of the above comparison, it is conclusive that the decision trees outperforms the regression model for the given data. However, there little to separate between the two Decision models, with both the models performing around about the same level on various parameters. However, assessing strictly by performance on the metrics, the first Decision Tree trumps the second decision tree by the slightest of margins. 4.B: Decision tree enablesus to classify while Regression enables us to find thee degree of association between the variables and target variable. With the helpof Decision Tree, we could classify the Organic’s dataset into purchasers (i.e.‘1’) and non-purchasers (i.e.‘2’) by a set of rules. While the regression e could detect the relationship-pattern between the input variables & the target variable, it failsto detect local pattern that may exist in the sections of data. Such patterns were unearthed by the decision tree during our modelling. Moreover, the Decision Trees comprehensivelyperformed better in terms of producing errors and achieving lifts than the Regression model. Hence, for our given case, usage of only the decision tree modellingwould have been sufficientenough. 4.C: Advantages of Decision Tree - Classification: Decision Tree helps in understanding one’s path to a decision. Each leaf creates a segmentation and states its attribute, which can be interpreted as a set of rule that particular segmentation followsto the final outcome (either‘1’ or ‘0’). The first decision tree model created 29 leaves,with each leaf signifying a segment and it’s attributes define the outcome
  • 16. whether the customer will purchase the Organic product or not. Decision Tree produces models that explainhow they work and are easy to understand. Figure 4.C – First Decision Tree. Detecting patterns: Decision tree can detect patterns between sections of variables which other modellingtechniques like Regression may not be able to detect. As evident from the above image, Decision Tree detects the relation between Age, Affluence and Gender while modelling the target variable. Data Exploration: Decision Trees are also very useful for data exploration, as they can pick-out the important variables which can predict the targets, out of the huge pool of input variables. Our Decision Tree picked-out Age, Affluence Grade and Gender as such important variables out of the nine input variables selectedfrom the datasource. Advantages of Regression – Relationship: Regressionmodellingenablesustofindthe relationshipbetweencombinationof variableswiththe targetvariable.Thistype of modellingcanalsodescribe the degreeof association betweenthe variablesasamathematical function. The coefficientsdescribeif the inputvariableis stronglyor inverselyrelatedtothe targetvariable. Pattern: Regression can detect patterns among the variables across the entire dataset. Through our analysis, we detected the patterns and interaction betweenthe input variables (Age, Gender, Affluence Grade) and the target variable. Estimation: Logistics Regression can estimate the probability (odds) of purchasers as a weighted sum of the attributes of the input variables. Similarly,Linear or multivariate regression could be used for estimation and prediction if we were finding answers for a differentquestion. For example: If the Supermarket wanted to know which customer spend the most money, we could have used PromSpend as the dependent variable and conducted a multivariate-regression modellingto find the answer.
  • 17. 5. Extending current knowledge with additional reading (15%) A) Just getting things wrong Addressing the wrong issue can render the entire data mining process futile as no or little business value can be derived from the solutions obtained from the modelling. As such, understanding the businessproblem and addressing the correct issue becomes imperative. Such a situation could arise at the supermarket if the higher management frame the wrong question or attempt to try and solve the wrong problem. In our case, this could happen if an incorrect target variable is selected.Similarly, wrongly rejecting a potentially important variable could lead to a creation of a skewedmodel. These issues can be resolvedby having a person in-charge of the project with the right domain expertise to complement with technical know-hows. Erroneous interpretation of binary ‘0’ and ‘1’ by team-members could potentiallylead to a complete failure of the modelling. Such incidents can be solved by having transparent communication across the hierarchy of the supermarket and standardization of processes and rules. B) Overfitting Overfitting is phenomena which occurs when a model learns the details, noises or fluctuations in the training set too well.Such a model memorizes those detailsrather than learning, and then applies them to a new dataset where it isn’t applicable, leading to poor predictive performance. A model can be assessed whether it is overfitting or not by evaluating the Average Square Error and Misclassification rate graphs. If the training performance improves and validation performance deteriorates, in terms of Average Square Error or Misclassification rate, as the complexity of the model increases, the model shows signs of overfitting. Overfitting can arise in our model created for the supermarket, if the model recognizes or memorizes bogus patterns which aren’t applicable to other data sources. Overfittingcan also appear when the training set allocated for our supermarket analysis is not sufficientlylarge. Overfitting leads to creation of unstable model that may perform on some days and may not on other occasions. Such a model can be avoided by including only those variables which have a predictive value. Additionally, allocating an optimal proportion of dataset for the model to train can help to combat the problem of overfitting. C) Sample bias Sampling bias occurs when a sample does not accurately reflect the parent population. The organics dataset contains details of customers with loyalty card who purchased organic products on being incentivisedwith coupons. Customers with loyalty card may not be an accurate representation of the total customer base. Further, the effect of coupons on the purchasing
  • 18. likelihood of a customer has to be taken into consideration when predicting future responders for the organic products. This can lead to creation of a biased model as the model will learn the attributes from a biased sample (in this case, Customer with loyalty cards) which may not be applicable in reality (customers as a whole). Additionally,disregarding the effect of coupons on the purchasing probability of a customer will only lead to inaccurate predictions by the model. In order to accurately predict the responders for Organic products, a new database should be created comprising of details of all type of customers (i.e.with and without loyalty card) and a variable for coupons. A model should then be created on this database to make more informed classifications and predictions. D) Future not being like the past Predictive modellinguses the data from the past as the base to form predictions, classification and estimation about the future. However, many factors must be taken into account like time- frame of data, seasonality of business, changes in market conditions and so on. As such, the Organics data shouldn’t be too far in the past as it may not reflect today’s scenario. Additionally,the relation between sale of organic products and seasonality behaviour should be exploredso as to ensure the model doesn’t overfit. To make the model more reflective of the current situations, modellingmust be conducted on recent data. Continuously efforts must be directed towards making the model more robust by iterating/feedingit with real-time data. 6. Appendix: 1.A.3:
  • 19. Figure 1.A.3 (2) - Bar-Chart Distribution of TargetBuy 2.A: Figure 2.A - Data Partition: 50% for Training & 50% for Validation 2.B: Figure 2.B – Adding DecisionTree node. 2.C.1: Figure 2.C.1 – Using Average Square Error as Assessment Measure for First DecisionTree.
  • 20. 2.C.2: Figure 2.C.2 – First Decision Tree. 2.D.1: Figure 2.D.1- Changes in Maximum number of Branches for second DecisionTree.
  • 21. 2.D.2: Figure 2.D.2 – UsingAverage Square Error as Assessment Measure for Second Decision Tree. 2.D.3: Figure 2.D.3 – Nodes 36, 37, 38 & 39 of first DecisionTree marked in black while Node 32 marked in white.
  • 22. Figure 2.D.3 (2) – Node7 marked in white and Nodes 17 & 18 markedin Black in the second Decision Tree. 3.A: Figure 3.A – Adding StatExplore tool to Organic diagram.
  • 23. 3.C: Figure 3.C – Adding Impute node and changing functions in the property panel. 3.D: Figure 3.D – Adding Variable Clustering node and changing its property.