- The document describes building predictive models using decision tree and regression modeling to predict which customers are likely to purchase new organic products being introduced by a supermarket.
- Both decision tree and logistic regression models were created, with the decision tree models performing slightly better based on various evaluation metrics such as cumulative lift, ROC curve, and average square error.
- The top variables influencing the likelihood of a customer purchasing organics according to the models were gender, age, and affluence level.
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
A major challenge facing healthcare organizations (hospitals, medical centers) is
the provision of quality services at affordable costs. Quality service implies diagnosing
patients correctly and administering treatments that are effective. Poor clinical decisions
can lead to disastrous consequences which are therefore unacceptable. Hospitals must
also minimize the cost of clinical tests. They can achieve these results by employing
appropriate computer-based information and/or decision support systems.
Most hospitals today employ some sort of hospital information systems to manage
their healthcare or patient data.
These systems are designed to support patient billing, inventory management and generation of simple statistics. Some hospitals use decision support systems, but they are largely limited. Clinical decisions are often made based on doctors’ intuition and experience rather than on the knowledge rich data hidden in the database.
This practice leads to unwanted biases, errors and excessive medical costs which affects the quality of service provided to patients.
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
A major challenge facing healthcare organizations (hospitals, medical centers) is
the provision of quality services at affordable costs. Quality service implies diagnosing
patients correctly and administering treatments that are effective. Poor clinical decisions
can lead to disastrous consequences which are therefore unacceptable. Hospitals must
also minimize the cost of clinical tests. They can achieve these results by employing
appropriate computer-based information and/or decision support systems.
Most hospitals today employ some sort of hospital information systems to manage
their healthcare or patient data.
These systems are designed to support patient billing, inventory management and generation of simple statistics. Some hospitals use decision support systems, but they are largely limited. Clinical decisions are often made based on doctors’ intuition and experience rather than on the knowledge rich data hidden in the database.
This practice leads to unwanted biases, errors and excessive medical costs which affects the quality of service provided to patients.
Principal Component Analysis and ClusteringUsha Vijay
Identifying the borrower segments from the give bank data set which has 27000 rows and 77 variable using PROC PRINCOMP. variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can
be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the
data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is
executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the
principal components
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
45min talk given at LondonR March 2014 Meetup.
The presentation describes how one might go about an insights-driven data science project using the R language and packages, using an open source dataset.
Educational Data Mining (EDM) is one of the crucial application areas of data mining which helps in predicting educational dropout and hence provides timely help to students. In Indian context, predicting educational dropouts is a major problem. By implementing EDM, we can predict the learning habits of the student. At present EDM has not been introduced at higher education level. Due to this we cannot recognize the genuine problems of students during their education. The objective of this analysis is to find the existing gaps in predicting educational dropout and find the missing attributes if any, which my further contribute for better prediction. After that we try to find the best attributes and DM techniques which are frequently used for dropout prediction. Based on the combination of missing attribute and best attribute of student data thus far, a new algorithm can be tested which may overcome the shortcomings of previous work done.
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
Principal Component Analysis and ClusteringUsha Vijay
Identifying the borrower segments from the give bank data set which has 27000 rows and 77 variable using PROC PRINCOMP. variables, it is important to reduce the data set to a smaller set of variables to derive a feasible
conclusion. With the effect of multicollinearity two or more variables can share the same plane in the in dimensions. Each row of the data can
be envisioned as a 77 dimensional graph and when we project the data as orthonormal, it is expected that the certain characteristics of the
data based on the plots to cluster together as principal components. In order to identify these principal components. PROC PRINCOMP is
executed with all the variables except the constant variables(recoveries and collection fees) and we derive a plot of Eigen values of all the
principal components
As part of our team's enrollment for Data Science Super Specialization course under UpX Academy, we submitted many projects for our final assessments, one of them was Telecom Churn Analysis Model.
The input data was provided by UpX academy and language we used is R. As part of the project, our main objective was :-
-> To predict Customer Churn.
-> To Highlight the main variables/factors influencing Customer Churn.
-> To Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.
-> Finding out the best model for our business case & providing executive Summary.
To address the mentioned business problem, we tried to follow a thorough approach. We did a detailed level Exploratory Data Analysis which consists of various Box Plots, Bar Plots etc..
Further we tried our best to build as many Classification models possible which fits our business case (Logistic Regression/kNN/Decision Trees/Random Forest/SVM) and also tried to touch Cox Hazard Survival analysis Model. Later for every model we tried to boost their performances by applying various performance tuning techniques.
As we all are still into our learning mode w.r.t these concepts & starting new, please feel free to provide feedback on our work. Any suggestions are most welcome... :)
Thanks!!
45min talk given at LondonR March 2014 Meetup.
The presentation describes how one might go about an insights-driven data science project using the R language and packages, using an open source dataset.
Educational Data Mining (EDM) is one of the crucial application areas of data mining which helps in predicting educational dropout and hence provides timely help to students. In Indian context, predicting educational dropouts is a major problem. By implementing EDM, we can predict the learning habits of the student. At present EDM has not been introduced at higher education level. Due to this we cannot recognize the genuine problems of students during their education. The objective of this analysis is to find the existing gaps in predicting educational dropout and find the missing attributes if any, which my further contribute for better prediction. After that we try to find the best attributes and DM techniques which are frequently used for dropout prediction. Based on the combination of missing attribute and best attribute of student data thus far, a new algorithm can be tested which may overcome the shortcomings of previous work done.
I have done this analysis using SAS on a dataset with 5000 records. I have used CART and Logistic regression to build a predictive model to identify customers which are likely to shift to competitors network.
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Data Partitioning for Ensemble Model Buildingneirew J
In distributed ensemble model-building algorithms, the performance and statistical validity of models are
dependent on sizes of the input data partitions as well as the distribution of records among the partitions.
Failure to correctly select and pre-process the data often results in the models which are not stable and do
not perform well. This article introduces an optimized approach to building the ensemble models for very
large data sets in distributed map-reduce environments using Pass-Stream-Merge (PSM) algorithm. To
ensure the model correctness the input data is randomly distributed using the facilities built into mapreduce
frameworks.
DATA PARTITIONING FOR ENSEMBLE MODEL BUILDINGijccsa
In distributed ensemble model-building algorithms, the performance and statistical validity of models are
dependent on sizes of the input data partitions as well as the distribution of records among the partitions.
Failure to correctly select and pre-process the data often results in the models which are not stable and do
not perform well. This article introduces an optimized approach to building the ensemble models for very
large data sets in distributed map-reduce environments using Pass-Stream-Merge (PSM) algorithm. To
ensure the model correctness the input data is randomly distributed using the facilities built into mapreduce
frameworks.
We analysed fertility rate on total population of Island which includes Northern Ireland and Republic of Ireland. We used “All Island Population dataset” and checked the relationship between the dependent variable and multiple independent variables to find the meaningful information to enhance sales. Tools: Python Programming.
A Dependent Set Based Approach for Large Graph AnalysisEditor IJCATR
Now a day’s social or computer networks produced graphs of thousands of nodes & millions of edges. Such Large graphs
are used to store and represent information. As it is a complex data structure it requires extra processing. Partitioning or clustering
methods are used to decompose a large graph. In this paper dependent set based graph partitioning approach is proposed which
decomposes a large graph into sub graphs. It creates uniform partitions with very few edge cuts. It also prevents the loss of
information. The work also focuses on an approach that handles dynamic updation in a large graph and represents a large graph in
abstract form.
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...gerogepatton
A support vector machine (SVM) learns the decision surface from two different classes of the input points, there are misclassifications in some of the input points in several applications. In this paper a bi-objective quadratic programming model is utilized and different feature quality measures are optimized simultaneously using the weighting method for solving our bi-objective quadratic programming problem. An important contribution will be added for the proposed bi-objective quadratic programming model by getting different efficient support vectors due to changing the weighting values. The numerical examples, give evidence of the effectiveness of the weighting parameters on reducing the misclassification between two classes of the input points. An interactive procedure will be added to identify the best compromise solution from the generated efficient solutions.
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...ijaia
A support vector machine (SVM) learns the decision surface from two different classes of the input points, there are misclassifications in some of the input points in several applications. In this paper a bi-objective quadratic programming model is utilized and different feature quality measures are optimized simultaneously using the weighting method for solving our bi-objective quadratic programming problem. An important contribution will be added for the proposed bi-objective quadratic programming model by getting different efficient support vectors due to changing the weighting values. The numerical examples, give evidence of the effectiveness of the weighting parameters on reducing the misclassification between two classes of the input points. An interactive procedure will be added to identify the best compromise solution from the generated efficient solutions.
Part I: Predictive models (Decision Tree and Regression) using SAS Enterprise Miner
Part II: Decision Tree using R.
Part III: Market-Basket Analysis using SAS miner.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
2. Objective:
To predict and determine,using Decision Tree and Regression modelling, which segment of
customers are likelyto purchase a new line of organic products that is to be introduced by the
supermarket.
1. Setting up the project and exploratory analysis.
1.A.1&2: On SAS Enterprise Miner workstation, a new Project named
BUS5PA_Assignment1_19139507 is created, followedby creating a diagram called Organics.
Further, a SAS library is created and the given dataset ‘Organics’ is selected as the data source
for the project. On analysing the dataset, SAS Enterprise Miner found 22223 observations and
13 variables.
The roles of the 13 variables have been set as follows:
Figure 1.A.2 - Roles & Measurement Level of Variables
Variables with Nominal Measurement Level contain Categorical data while variables with
Interval Measurement Level contain numeric data. Target/Respond variable TargetBuy has a
Binary Measurement, with 1 indicating Yes and 0 indicating No.
3. 1.A.3: Distribution of Target variables [Appendix – Figure 1.A.3 (2)]
Figure 1.A.3 - Summary of Distribution of TargetBuy
1.A.4: DemCluster has been Rejected as DemClusterGroup contains collapsed data of
DemCluster and based on past evidences,DemClusterGroup is sufficientfor the modelling.
1.B: TargetBuy envelopesthe data contained in TargetAmt. Utilizing TargetAmt as an input
could lead to an imprecise modelling or leakage as the model would find strong co-relation
between the input (TargetAmt) and Target (TargetBuy), since the target variable contains the
collapsed data of TargetAmt. Hence, TargetAmt should not be used as input and should be set
as Rejected.
2. Decision tree based modelling and analysis.
2.A: After dragging the Organics dataset to the Organics diagram, we connect the Data Partition
node to the Organics dataset. 50% of the data is utilizedfor training while the remaining 50% of
4. the data is used for validation (Appendix – Figure 2.A). Training set is used to build a set of
models while Validation set is utilizedto select the best model created from the Training set.
Figure 2.A (2) – Adding Data Partition to the Organics data source.
2.B: A DecisionTree isthenconnectedtothe Data Partition node (Appendix –Figure 2.B)
2.C.1: The number of leaves in an Optimal tree is 29 based on Average Square Error as the
subtree assessment plot. This Decision Tree has been created using Average Square Error (ASE)
as the subtree Assessment Measure (Appendix – Figure 2.C.). The assessment method specifies
the type of method used to select the best tree. ASE opts for the tree that produces the smallest
average square error.
Figure 2.C.1 – Optimal Tree based on Average Square error as the Subtree Assessment.
5. 2.C.2: Variable DemAge was used for the first split as this is the variable which ensures the best
split in terms of ‘Purity’ (Appendix – Figure 2.C.2).
Based on Logworth of each input variable, the competing splits for the first split (DemAge) for
the first decision tree are DemAffl and DemGender. Logworth is measure of Entropy, which
indicates which variable can create the most homogenous subgroups.
Figure 2.C.2 – Logworth of Input Variables
2.D.1: The maximum branches of the second decision tree has been changed to 3. This means
the subsets of the splitting rules are dividedinto 3 branches (Appendix – Figure 2.D.1).
2.D.2: The second Decision Tree has been created using Average Square Error (ASE) as the
subtree Assessment Measure (Appendix – Figure 2.D.2). The assessment method specifiesthe
type of method used to select the best tree. ASE opts for the tree that produces the smallest
average square error.
Figure 2.D.2 (2) – Adding the second Decision Tree node.
2.D.3: The optimal tree for Decision Tree 2 using Average Square Error as the model assessment
statistic contains 33 leaves.
6. Figure 2.D.3 – Leaves on Optimal Tree based on Average Square Error for Decision Tree 2.
The two Decision Tree models differas the maximum branch splits (2 vs 3) is different.This
results in the divergence in number of leavesin optimal tree of the respective Decision Tree the
first decision tree contains 29 leaves whereas Decision Tree 2 contains 33 leaves.
The set of rulesor classificationsetbythe firstDecisionTree canbe summarizedas –
Female customersunderthe Age of 44.5 years,havingAffluence grade more than9.5 or
missingare likelytopurchase organicproducts(Node 36,37, 38 & 39, Appendix –Figure
2.D.3).
Female customersunderthe age of 39.5 years,havingAffluence grade of lessthan9.5 but
more than 6.5 or missingare likelytopurchase organicproducts(Node 32, Appendix–
Figure 2.D.3).
The set of rules or classificationsetbythe secondDecisionTree canbe summarizedas –
Customersunderthe age of 39.5 yearswhohave an affluence grade of more than14.5 are
verylikelytopurchase organicproducts[Node 7,Appendix –Figure 2.D.3(2)].
Customersunderthe age of 39.5 yearshavingaffluencegrade of lessthan14.5 butmore
than 9.5 (ormissing) are likelytopurchase organicproducts.However,if suchacustomeris
a Female,thenshe is22% more likelytobuyOrganicproductsthan the customerwhois
male withthe same attributes[Node 17& 18, Appendix –Figure 2.D.3 (2)].
2.E: Average square error computes, squares and then averages the variation betweenthe
predicted outcome and the actual outcome of the leaf nodes. Lower the average square error,
better the model; as it indicates the model produces the fewer errors. For the Organics dataset,
7. Decision Tree 2 (0.132662) appears to be marginally better model because the said model’s
average square error is minutely lessthan that of the first Decision Tree (0.132773).
Figure 2.E – Model Comparison between the two DecisionTrees,
3. Regression based modelling and analysis.
3.A: StatExplore tool is attached to the Organics datasource (Appendix – Figure 3.A). StatExplore
provides a statistical summarization as well as graphical representation of the variables.
Through StatExplore, we also get to know about the number of missing values in each variable.
Figure 3.A (2): Summary of Input variable via StatExplore.
3.B: Yes, the missing values in the given dataset should be imputed as Regression modelling
doesn’t accommodate missing values in the model but rather ignores or strips-off such values,
8. thereby leading to loss of data to that extent for the modelling. This can lead to a creation of
biased training set.
Imputation for Decision Tree isn’t required as Decision Tree accommodates the missing values in
its modellingby considering them as a possible value with its own branch.
3.C: We add impute node to the diagram and connect it to the Data Partition node (Appendix –
Figure 3.C). Impute function creates a synthetic value for the missing value. We impute alphabet
‘U’ for missing class variable value and use the mean of the variable to impute missing interval
variable values.
Figure 3.C – Results of Impute.
We then create imputation indicators for all imputed inputs. This function creates a new
variable to indicate whether a value has been imputed or not in the main variable.
3.D: We adda Variable Clustering node toImpute node togrouptogethersimilarvariablesina
cluster(Appendix –Figure 3.D).Byvirtue of Impute’sfunction, new inputvariablesare created
containingthe imputedvalues forthe missingvalues.Similarly,new indicatorvariableshave also
beencreated. VariableClusteringenablesustoreduce redundancyandensure betterregression
basedmodellingasthistype of modellingismore suitedwheninputvariablesare fewer.
Each clusteris representedbyanindividual inputvariable.We optforBest Variable asthe criteria
for the selectionforthe clusterrepresentative.The variablewith the lowestnormalisedvalue of R
squaredisselectedasthe clusterrepresentative underthe Bestvariable method.
As a result,the 52 variables(afterimputationandindicatorvariables)are groupedtogethertoform
24 clusters.The clusterrepresentativesof the 24 clusterswill be thenusedasthe inputsforthe
9. Regressionmodel.
Figure 3.D (2) – Representation of Variable Clustering in various forms.
We add Regression node to the Variable clustering node and change the Selectionmodel to
Stepwise and Selection criterion to Validation Error. Under the Stepwise model, variables are
added one by one to the model.At the same time as the addition, the variable already in the
model whose Stay Significance level falls below the threshold is deleted.This type of model
stops when the Stay Significance level or Selectioncriterion (Validation Error in this case) is
achieved.
Figure 3.E – Adding Regression node and changing its properties.
10. The measurement of Target variable determinesthe type of Regression model that will be
applied. For our given case, Logistic Regression model will be used as our Target Variable is
Binary.
3.F.1: The variables included in the final model of the Regression modellingare:
IMP_DemAffl – Imputed Affluence Grade.
IMP_DemAge – Imputed Age.
M_DemGender0 – Indicator Gender.
IMP_DemGenderM – Imputed Gender.
M_DemAffl0 – Indicator Affluence Grade.
M_DemAge0 – Indicator Age.
The regression modellingindicates the above-mentioned variables have a significant degree
of association with our target variable. The list of variables is in order of their stepwise
addition to the model.
Regression models enables us to estimate the relationshipbetween the input variables and
target variable. Logistics regression helpsus to expressthe association betweenthe input
and target variable in terms of odds (Purchaser vs Non-purchaser)
The higher management of supermarket can interpret the abovementioned association as
follows:
Figure 3.F.1 – Variables in final Logistic Regression model.
With other variables constant, change in a single unit in IMP_DemAffl will result in a 0.2530-unit
change in the log odds of purchaser (vs non-purchaser). On the other hand, with other variables
constant, a change of a single unit in IMP_DemAge will lead to a decrease in log odds by 0.0550.
Similarly, with other variables constant increase in a single unit of IMP_DemGenderM,
M_DemAffl0 and M_DemAge0 will result in decrease in the log odds of a purchaser by 0.7277,
0.3657 and 0.2345 unit respectivelywhile increase in a single unit of M_DemGender0 will lead
to increase in log odds of purchasers by 1.41 units.
11. 3.F.2: The most important variables that influence the target variable are M_DemGender0,
IMP_DemGenderM and M_DemAffl0 respectively.These variables are designated their
importance on the basis of their Absolute Coefficients.
Absolute coefficientsspecifieshow strongly related the variable is related to the target variable.
Variables with Absolute coefficientcloser to 1 indicate strong positive relationship,while
absolute coefficient of around abound -1 indicates strong negative relation. An absolute
coefficientof zero impliesthe weakest relation betweenthe input variable and target variable
for the given scenario.
Figure 3.F.2 – Bar-chart and tabular form of most important variables for Regression modelling.
The bar chart when interpreted together suggests that female customers who are of relatively
young age and have an high affluence grade tend to purchase more Organic products.
3.F.3: The validation Average Square Error (ASE) for the Regression model is 0.141805. ASE helps
in gauging which model is erroneous less often than the other. Lesser ASE indicates better
model in terms of its prediction capabilities. ASE is calculated by squaring and averaging the
difference between the predicted and actual outcome for each and every model.
The ASE is the lowest at the model selectionstep number 6, which impliesthis is the most
optimal model, as shown in the below chart.
12. Figure 3.F.3 – Validation Average Square Error of Logistic Regression model.
4. Open ended discussion (20%)
4.A: A model comparison tool is added to the workspace and the decision trees & regression
nodes are attached to model comparison. This node enables us to compare the models and its
predictive capabilities.
Figure 4.A – Adding Comparison node to the workspace.
13. We use three metrics to ascertain which model outperformed the other models on the
validation dataset. These metrics are: Cumulative Lift, ROC Curve and Fit statistics (Average
Square Error and Misclassification rate).
Cumulative Lift: Lift curve measures the effectivenessof the model by comparing the results
obtained with and without the predicted model. The first Decision tree (3.42) performs
marginally better than the second Decision Tree (3.40) on the metric of cumulative gain at the
5th
percentile,while the regression model gives a liftof 3.34 at the 5th
depth.
Figure 4.A – Comparing the three models on Cumulative Lift Chart.
This is interpreted as the top 5% of customers picked by the first Decision Tree model are 3.42
times more likelyto purchase Organic products than the 5% of customers picked out at random.
ROC Curve: ROC curve is graph that demonstrates the relation between Sensitivity(True
positive) and Specificity(True negative) at varied points of a diagnostic test. The model’s curve
which is closer to the top and the left (i.e.closer to the true positive) can predict more
accurately, while the curve which is inclinedtowards the baseline (i.e.closer to the true
negative) is a lesser accurate predictor. From a business context, the supermarket would
naturally want to adopt the curve that inclinestowards the Sensitivity.
14. Figure 4.A – Comparing the three models on ROCCurve.
The second Decision Tree curve is the most top-left curve, very closely followedby the first
Decision Tree while regression’s curve is closer to the baseline.This indicates the second
Decision Tree is the most accurate model, followedby the first Decision Tree and Regression
respectively.
Fit Statistics: Fit statistics is table containing numerous model fit statistic which describe how
well the model fits the data. We compare the three models on the basis of their Average Square
Error and Misclassification rate.
Average Square Error (ASE) indicates the difference between the predicted and actual outcome
of a model. Misclassification rate refers to inaccurate classification arising when a model
suggests a probability (of higher than 50%) of a segment of customers fall under a group (‘1’ or
‘0’), and thus all customers of that segment are classifiedaccordingly but in reality,a fraction of
customers may not necessarily fall under the said classification.
15. Figure 4.A - Fit Statistics of first Decision Tree, Second Decision Tree and Regression model respectively.
As per Misclassification rate, the first Decision Tree is the best model as it produces the least
misclassification error (0.185) while the ASE states the second Decision Tree (0.1326) is the
better model out of the three, beating the first Decision Tree very narrowly (0.1327). Regression
model produces the most errors out of the three models in both the metrics.
Conclusion: On the basis of the above comparison, it is conclusive that the decision trees
outperforms the regression model for the given data. However, there little to separate between
the two Decision models, with both the models performing around about the same level on
various parameters. However, assessing strictly by performance on the metrics, the first
Decision Tree trumps the second decision tree by the slightest of margins.
4.B: Decision tree enablesus to classify while Regression enables us to find thee degree of
association between the variables and target variable. With the helpof Decision Tree, we could
classify the Organic’s dataset into purchasers (i.e.‘1’) and non-purchasers (i.e.‘2’) by a set of
rules. While the regression e could detect the relationship-pattern between the input variables
& the target variable, it failsto detect local pattern that may exist in the sections of data. Such
patterns were unearthed by the decision tree during our modelling.
Moreover, the Decision Trees comprehensivelyperformed better in terms of producing errors
and achieving lifts than the Regression model. Hence, for our given case, usage of only the
decision tree modellingwould have been sufficientenough.
4.C: Advantages of Decision Tree -
Classification: Decision Tree helps in understanding one’s path to a decision. Each leaf creates a
segmentation and states its attribute, which can be interpreted as a set of rule that particular
segmentation followsto the final outcome (either‘1’ or ‘0’). The first decision tree model
created 29 leaves,with each leaf signifying a segment and it’s attributes define the outcome
16. whether the customer will purchase the Organic product or not. Decision Tree produces models
that explainhow they work and are easy to understand.
Figure 4.C – First Decision Tree.
Detecting patterns: Decision tree can detect patterns between sections of variables which other
modellingtechniques like Regression may not be able to detect. As evident from the above
image, Decision Tree detects the relation between Age, Affluence and Gender while modelling
the target variable.
Data Exploration: Decision Trees are also very useful for data exploration, as they can pick-out
the important variables which can predict the targets, out of the huge pool of input variables.
Our Decision Tree picked-out Age, Affluence Grade and Gender as such important variables out
of the nine input variables selectedfrom the datasource.
Advantages of Regression –
Relationship: Regressionmodellingenablesustofindthe relationshipbetweencombinationof
variableswiththe targetvariable.Thistype of modellingcanalsodescribe the degreeof association
betweenthe variablesasamathematical function. The coefficientsdescribeif the inputvariableis
stronglyor inverselyrelatedtothe targetvariable.
Pattern: Regression can detect patterns among the variables across the entire dataset. Through
our analysis, we detected the patterns and interaction betweenthe input variables (Age,
Gender, Affluence Grade) and the target variable.
Estimation: Logistics Regression can estimate the probability (odds) of purchasers as a weighted
sum of the attributes of the input variables. Similarly,Linear or multivariate regression could be
used for estimation and prediction if we were finding answers for a differentquestion. For
example: If the Supermarket wanted to know which customer spend the most money, we could
have used PromSpend as the dependent variable and conducted a multivariate-regression
modellingto find the answer.
17. 5. Extending current knowledge with additional reading (15%)
A) Just getting things wrong
Addressing the wrong issue can render the entire data mining process futile as no or little
business value can be derived from the solutions obtained from the modelling.
As such, understanding the businessproblem and addressing the correct issue becomes
imperative. Such a situation could arise at the supermarket if the higher management frame the
wrong question or attempt to try and solve the wrong problem. In our case, this could happen if
an incorrect target variable is selected.Similarly, wrongly rejecting a potentially important
variable could lead to a creation of a skewedmodel. These issues can be resolvedby having a
person in-charge of the project with the right domain expertise to complement with technical
know-hows.
Erroneous interpretation of binary ‘0’ and ‘1’ by team-members could potentiallylead to a
complete failure of the modelling. Such incidents can be solved by having transparent
communication across the hierarchy of the supermarket and standardization of processes and
rules.
B) Overfitting
Overfitting is phenomena which occurs when a model learns the details, noises or fluctuations
in the training set too well.Such a model memorizes those detailsrather than learning, and then
applies them to a new dataset where it isn’t applicable, leading to poor predictive performance.
A model can be assessed whether it is overfitting or not by evaluating the Average Square Error
and Misclassification rate graphs. If the training performance improves and validation
performance deteriorates, in terms of Average Square Error or Misclassification rate, as the
complexity of the model increases, the model shows signs of overfitting.
Overfitting can arise in our model created for the supermarket, if the model recognizes or
memorizes bogus patterns which aren’t applicable to other data sources. Overfittingcan also
appear when the training set allocated for our supermarket analysis is not sufficientlylarge.
Overfitting leads to creation of unstable model that may perform on some days and may not on
other occasions.
Such a model can be avoided by including only those variables which have a predictive value.
Additionally, allocating an optimal proportion of dataset for the model to train can help to
combat the problem of overfitting.
C) Sample bias
Sampling bias occurs when a sample does not accurately reflect the parent population. The
organics dataset contains details of customers with loyalty card who purchased organic products
on being incentivisedwith coupons. Customers with loyalty card may not be an accurate
representation of the total customer base. Further, the effect of coupons on the purchasing
18. likelihood of a customer has to be taken into consideration when predicting future responders
for the organic products.
This can lead to creation of a biased model as the model will learn the attributes from a biased
sample (in this case, Customer with loyalty cards) which may not be applicable in reality
(customers as a whole). Additionally,disregarding the effect of coupons on the purchasing
probability of a customer will only lead to inaccurate predictions by the model.
In order to accurately predict the responders for Organic products, a new database should be
created comprising of details of all type of customers (i.e.with and without loyalty card) and a
variable for coupons. A model should then be created on this database to make more informed
classifications and predictions.
D) Future not being like the past
Predictive modellinguses the data from the past as the base to form predictions, classification
and estimation about the future. However, many factors must be taken into account like time-
frame of data, seasonality of business, changes in market conditions and so on.
As such, the Organics data shouldn’t be too far in the past as it may not reflect today’s scenario.
Additionally,the relation between sale of organic products and seasonality behaviour should be
exploredso as to ensure the model doesn’t overfit.
To make the model more reflective of the current situations, modellingmust be conducted on
recent data. Continuously efforts must be directed towards making the model more robust by
iterating/feedingit with real-time data.
6. Appendix:
1.A.3:
19. Figure 1.A.3 (2) - Bar-Chart Distribution of TargetBuy
2.A:
Figure 2.A - Data Partition: 50% for Training & 50% for Validation
2.B:
Figure 2.B – Adding DecisionTree node.
2.C.1:
Figure 2.C.1 – Using Average Square Error as Assessment Measure for First DecisionTree.
20. 2.C.2:
Figure 2.C.2 – First Decision Tree.
2.D.1:
Figure 2.D.1- Changes in Maximum number of Branches for second DecisionTree.
21. 2.D.2:
Figure 2.D.2 – UsingAverage Square Error as Assessment Measure for Second Decision Tree.
2.D.3:
Figure 2.D.3 – Nodes 36, 37, 38 & 39 of first DecisionTree marked in black while Node 32 marked in white.
22. Figure 2.D.3 (2) – Node7 marked in white and Nodes 17 & 18 markedin Black in the second Decision Tree.
3.A:
Figure 3.A – Adding StatExplore tool to Organic diagram.
23. 3.C:
Figure 3.C – Adding Impute node and changing functions in the property panel.
3.D:
Figure 3.D – Adding Variable Clustering node and changing its property.