Performed predictive Data analytics for “Black Friday Sales Dataset” wherein the company wants to predict the purchase amount against the products using Rapid Miner Tool.
4. 1 Introduction
1.1 Crisp DM Model
Cross-industry standard process for data mining, known as CRISP-DM, is an open stand-
ard process model that describes common approaches used by data mining experts. It is
the most widely-used analytics model (Wik19a). Big Data has acquired colossal import-
ance in the field of analytics. Data driven organisation are evolving with the upsurge of
Internet of things, social media, user clicks, digital transactions.
This assignment aims at solving a real life business problem and explore the functionality
of Rapid miner Tool to gauge the performance of different machine learning approaches.
We have selected Black Friday Sales dataset of a retail company which highlights the
customer purchase behaviour against various products of different categories. We aim to
use CRISP-DM mothodology and Rapid Miner tool for implementation and derive the
best fit model for our dataset.
Figure 1: CRISP-DM Methodology
1.2 Rapid Miner Tool
Rapid Miner is a data science software platform developed by the company of the same
name that provides an integrated environment for data preparation, machine learning,
deep learning, text mining, and predictive analytics. It is used for business and com-
mercial applications as well as for research, education, training, rapid prototyping, and
application development and supports all steps of the machine learning process including
data preparation, results visualization, model validation and optimization (Wik19b).
Rapid Miner is a Data science Tool used for quick analysis of data. We can create pro-
cesses, import data, predict the output. Furthermore, we can port the Machine learning
models to web, iOS and Android. Various scattered tasks of data mining are simplified
by Rapid Miner. We can load data, pre-process and prepare data using various methods,
train models, cluster, prune outliers and visualize outputs.
2
5. Figure 2: Courtesy: Rapid Miner Website
2 Business Understanding
At recent, Black Friday has gained considerable importance. There are huge discounts,
sales and offers advertised by the brands to attract customers and gain profit with max-
imum sales margin.
2.1 Problem Statement
A retail company wants to evaluate the customer purchase behaviour and trends against
the product categories. The dataset highlights the purchase summary of customers for
products which were sold in maximum amount.
Figure 3: Courtesy: Under30ceo website
The dataset describes customer demographics like age, gender, marital status,
city, stay in current city and total purchase amount for previous month.
The company wants to predict the purchase amount against the products wherein
they can create discounts, offers for customers against different products.
3
6. 3 Data Understanding
The dataset contains 550069 observations and 12 variables. We use Turbo prep fea-
ture of Rapid Miner to view our data and its structure. Turbo prep is an advanced
functionality of rapid miner which provides environment for data preparation.
• Load the data into RM and Inspect.
• View the data and Analyse the structure.
Figure 4: Data Understanding
The histogram and stats distributions can be visualized at the top of each feature
which indicates the quality. We could see that product category 1 and product category
2 had red indicators wherein we can infer that these columns are not suitable for Machine
Learning.
4 Data Preparation
Raw data has many discrepancies, inconsistency, errors, missing value which needs to be
handled before it is parsed by the machine.
4.1 Steps
Figure 5: Data Preprocessing Steps
4
7. 4.1.1 Steps for Data Preparation
Raw data is often fetched from multiple sources in different formats thus it becomes
important to structurize the data prior to processing. Various factors are responsible for
data quality like human error, measuring devices or redundancy in methods of collecting
data.
In this step we primarily focus on enhancing the quality of data by fixing the below
mentioned issues:
1. Missing Value:
In our sales Dataset , we considered the relationship of product category with the
Figure 6: Cleanse Data
purchase amount and replaced the Null values with space instead of dropping the
two columns.
2. Convert Numeric to Polynomial:
In order to Standardize our data we convert numeric values to polynomial.
Figure 7: Convert Data
5
8. 3. Splitting Data
We split the dataset in 70:30 ratio of Train set and Test set. Train set will be used
to build the model using different algorithms, gauge their performances and select
the best fit.
Figure 8: Data Splitting
4. Balancing Data
This step includes balancing the data by considering equal number observations
in the three groups. Balancing the dataset helps us in retrieving a considerably
satisfying output.
Figure 9: Data Balancing
6
9. 5 Modelling
This phases selects and applies the modelling techniques and calibrates parameters to
optimal value. Often, data problems are encountered while modeling or ideas are invented
to construct new data while modeling.Thus modeling is closely linked to data preparation.
We used Rapid Miner Auto Model to model our data.
5.1 RM: Auto Model
This feature of Rapid Miner helps accelerate Data Science by automating Machine Learn-
ing. It explores new insights by transforming data, generating actionable data insights
without any compromise.
1. Load Data: We loaded the pre processed dataset and selected Prediction based
on Classification.
2. Select Target: The purchase column is the target variable which has three classes
’High’, ’Med’, and ’Low’ according to the purchase range.
3. Select Input: We selected ’Age’, ’Gender’, ’Product Category’, ’Occupation’,
’Marital Status’ as our Input variable.
4. Model Types: We selected Naive Bayes, Logistic Regression, General Linear
Model, Decision Tree and SVM to find out the best fit.
Output:
The below image depicts the accuracy of all selected model. The accurracy of Deep
Figure 10: Auto Model Result
Learning is 68.4 percent and General Linear Model is 67.4 percent, which is higher than
the rest of the model. Since we are predicting the Purchase , we selected Linear Regression
as the best fitting model.
7
10. 5.2 Linear Regression
In Linear Regression we perform modelling based on the relationship between dependent
variable and set of independent variable.
5.2.1 Results of General Linear Model
Figure 11: Linear Model Result
5.2.2 Performance
The figure below highlights the performance of Linear model based on ’Purchase’ variable
classified as ’High’, ’Med’, ’Low’ which was our target variable and the set of other
independent variables.
The confusion matrix depicts the class precision and class recall across the three
classified groups.
The accuracy of model is 67.4 percent with classification error of 32.6 percent.
Figure 12: Linear Model Performance
8
11. 6 Evaluation and Testing
We performed Linear Regression on Test dataset to test the results and performance on
the new dataset. The Test dataset was used as an input to derive the results and guage
the performance.
6.1 Design
Figure 13: Auto Model Design for Testing
The figure below highlights the model performance on ’Test’ dataset. It depicts the
precision and recall based on ’Purchase’ variable.
6.2 Performance
Figure 14: Auto Model performance on Test Dataset
The performance of model on Test dataset was highly satisfactorily and thus Linear
Regression can be used for prediction based on ’Purchase’ variable as target.
9
12. 7 Visualization
The figure below illustrates the a bar graph of ’Puchase’ vs ’Product Category 2’.
It clearly illustrates the distribution across the classified Purchase groups ’High’,
’Med’, ’Low’ for the Product Category 2.
Figure 15: Purchase and Product Category 2
7.0.1 Before Balancing
The figure below illustrates the imbalance in the observations for the Purchase variable
and the need for the balancing the records.
Figure 16: Before Balancing
10
13. 8 Conclusion
We explored the classic features of Rapid Miner called ’Turbo Prep’ and ’Auto Model’.
We used Turbo prep for loading our data, processed and cleansed it for Machine Learning
tasks.
We fit different Machine learning models and compared each of them. We selected
General Linear Regression considering it as best fit among all.
Finally, we explored the efficiency of Rapid Miner Tool and advantage of using it for
quick inferences and results.
9 Timeline
Figure 17: Project Timeline
11
14. References
[Wik19a] Wikipedia contributors, “Cross-industry standard process for data mining
— Wikipedia, the free encyclopedia,” 2019, [Online; accessed 18-
December-2019]. [Online]. Available: https://en.wikipedia.org/w/index.php?
title=Cross-industry standard process for data mining&oldid=930958276
[Wik19b] ——, “Rapidminer — Wikipedia, the free encyclopedia,” 2019, [Online;
accessed 18-December-2019]. [Online]. Available: https://en.wikipedia.org/w/
index.php?title=RapidMiner&oldid=921794576
12