SlideShare a Scribd company logo
1 of 14
Data Mining Class Assignment 2
Msc Data Analytics
Trushita Redij
Student ID: 10504099
Dublin Business School
Supervisor: Terri Hoare
Dublin Business School
Assignment Submission Sheet
Msc Data Analytics
Student Name: Trushita Redij
Student ID: 10504099
Programme: Msc Data Analytics
Year: 2019
Supervisor: Terri Hoare
Submission Due Date: 18/12/2019
Project Title: Data Mining Class Assignment 2
Word Count: 1573
Page Count: 12
Data Mining Class Assignment 2
Trushita Redij
10504099
Contents
1 Introduction 2
1.1 Crisp DM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Rapid Miner Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Business Understanding 3
2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Data Understanding 4
4 Data Preparation 4
4.1 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.1 Steps for Data Preparation . . . . . . . . . . . . . . . . . . . 5
5 Modelling 7
5.1 RM: Auto Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.1 Results of General Linear Model . . . . . . . . . . . . . . . . . . . 8
5.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6 Evaluation and Testing 9
6.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7 Visualization 10
7.0.1 Before Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
8 Conclusion 11
9 Timeline 11
1
1 Introduction
1.1 Crisp DM Model
Cross-industry standard process for data mining, known as CRISP-DM, is an open stand-
ard process model that describes common approaches used by data mining experts. It is
the most widely-used analytics model (Wik19a). Big Data has acquired colossal import-
ance in the field of analytics. Data driven organisation are evolving with the upsurge of
Internet of things, social media, user clicks, digital transactions.
This assignment aims at solving a real life business problem and explore the functionality
of Rapid miner Tool to gauge the performance of different machine learning approaches.
We have selected Black Friday Sales dataset of a retail company which highlights the
customer purchase behaviour against various products of different categories. We aim to
use CRISP-DM mothodology and Rapid Miner tool for implementation and derive the
best fit model for our dataset.
Figure 1: CRISP-DM Methodology
1.2 Rapid Miner Tool
Rapid Miner is a data science software platform developed by the company of the same
name that provides an integrated environment for data preparation, machine learning,
deep learning, text mining, and predictive analytics. It is used for business and com-
mercial applications as well as for research, education, training, rapid prototyping, and
application development and supports all steps of the machine learning process including
data preparation, results visualization, model validation and optimization (Wik19b).
Rapid Miner is a Data science Tool used for quick analysis of data. We can create pro-
cesses, import data, predict the output. Furthermore, we can port the Machine learning
models to web, iOS and Android. Various scattered tasks of data mining are simplified
by Rapid Miner. We can load data, pre-process and prepare data using various methods,
train models, cluster, prune outliers and visualize outputs.
2
Figure 2: Courtesy: Rapid Miner Website
2 Business Understanding
At recent, Black Friday has gained considerable importance. There are huge discounts,
sales and offers advertised by the brands to attract customers and gain profit with max-
imum sales margin.
2.1 Problem Statement
A retail company wants to evaluate the customer purchase behaviour and trends against
the product categories. The dataset highlights the purchase summary of customers for
products which were sold in maximum amount.
Figure 3: Courtesy: Under30ceo website
The dataset describes customer demographics like age, gender, marital status,
city, stay in current city and total purchase amount for previous month.
The company wants to predict the purchase amount against the products wherein
they can create discounts, offers for customers against different products.
3
3 Data Understanding
The dataset contains 550069 observations and 12 variables. We use Turbo prep fea-
ture of Rapid Miner to view our data and its structure. Turbo prep is an advanced
functionality of rapid miner which provides environment for data preparation.
• Load the data into RM and Inspect.
• View the data and Analyse the structure.
Figure 4: Data Understanding
The histogram and stats distributions can be visualized at the top of each feature
which indicates the quality. We could see that product category 1 and product category
2 had red indicators wherein we can infer that these columns are not suitable for Machine
Learning.
4 Data Preparation
Raw data has many discrepancies, inconsistency, errors, missing value which needs to be
handled before it is parsed by the machine.
4.1 Steps
Figure 5: Data Preprocessing Steps
4
4.1.1 Steps for Data Preparation
Raw data is often fetched from multiple sources in different formats thus it becomes
important to structurize the data prior to processing. Various factors are responsible for
data quality like human error, measuring devices or redundancy in methods of collecting
data.
In this step we primarily focus on enhancing the quality of data by fixing the below
mentioned issues:
1. Missing Value:
In our sales Dataset , we considered the relationship of product category with the
Figure 6: Cleanse Data
purchase amount and replaced the Null values with space instead of dropping the
two columns.
2. Convert Numeric to Polynomial:
In order to Standardize our data we convert numeric values to polynomial.
Figure 7: Convert Data
5
3. Splitting Data
We split the dataset in 70:30 ratio of Train set and Test set. Train set will be used
to build the model using different algorithms, gauge their performances and select
the best fit.
Figure 8: Data Splitting
4. Balancing Data
This step includes balancing the data by considering equal number observations
in the three groups. Balancing the dataset helps us in retrieving a considerably
satisfying output.
Figure 9: Data Balancing
6
5 Modelling
This phases selects and applies the modelling techniques and calibrates parameters to
optimal value. Often, data problems are encountered while modeling or ideas are invented
to construct new data while modeling.Thus modeling is closely linked to data preparation.
We used Rapid Miner Auto Model to model our data.
5.1 RM: Auto Model
This feature of Rapid Miner helps accelerate Data Science by automating Machine Learn-
ing. It explores new insights by transforming data, generating actionable data insights
without any compromise.
1. Load Data: We loaded the pre processed dataset and selected Prediction based
on Classification.
2. Select Target: The purchase column is the target variable which has three classes
’High’, ’Med’, and ’Low’ according to the purchase range.
3. Select Input: We selected ’Age’, ’Gender’, ’Product Category’, ’Occupation’,
’Marital Status’ as our Input variable.
4. Model Types: We selected Naive Bayes, Logistic Regression, General Linear
Model, Decision Tree and SVM to find out the best fit.
Output:
The below image depicts the accuracy of all selected model. The accurracy of Deep
Figure 10: Auto Model Result
Learning is 68.4 percent and General Linear Model is 67.4 percent, which is higher than
the rest of the model. Since we are predicting the Purchase , we selected Linear Regression
as the best fitting model.
7
5.2 Linear Regression
In Linear Regression we perform modelling based on the relationship between dependent
variable and set of independent variable.
5.2.1 Results of General Linear Model
Figure 11: Linear Model Result
5.2.2 Performance
The figure below highlights the performance of Linear model based on ’Purchase’ variable
classified as ’High’, ’Med’, ’Low’ which was our target variable and the set of other
independent variables.
The confusion matrix depicts the class precision and class recall across the three
classified groups.
The accuracy of model is 67.4 percent with classification error of 32.6 percent.
Figure 12: Linear Model Performance
8
6 Evaluation and Testing
We performed Linear Regression on Test dataset to test the results and performance on
the new dataset. The Test dataset was used as an input to derive the results and guage
the performance.
6.1 Design
Figure 13: Auto Model Design for Testing
The figure below highlights the model performance on ’Test’ dataset. It depicts the
precision and recall based on ’Purchase’ variable.
6.2 Performance
Figure 14: Auto Model performance on Test Dataset
The performance of model on Test dataset was highly satisfactorily and thus Linear
Regression can be used for prediction based on ’Purchase’ variable as target.
9
7 Visualization
The figure below illustrates the a bar graph of ’Puchase’ vs ’Product Category 2’.
It clearly illustrates the distribution across the classified Purchase groups ’High’,
’Med’, ’Low’ for the Product Category 2.
Figure 15: Purchase and Product Category 2
7.0.1 Before Balancing
The figure below illustrates the imbalance in the observations for the Purchase variable
and the need for the balancing the records.
Figure 16: Before Balancing
10
8 Conclusion
We explored the classic features of Rapid Miner called ’Turbo Prep’ and ’Auto Model’.
We used Turbo prep for loading our data, processed and cleansed it for Machine Learning
tasks.
We fit different Machine learning models and compared each of them. We selected
General Linear Regression considering it as best fit among all.
Finally, we explored the efficiency of Rapid Miner Tool and advantage of using it for
quick inferences and results.
9 Timeline
Figure 17: Project Timeline
11
References
[Wik19a] Wikipedia contributors, “Cross-industry standard process for data mining
— Wikipedia, the free encyclopedia,” 2019, [Online; accessed 18-
December-2019]. [Online]. Available: https://en.wikipedia.org/w/index.php?
title=Cross-industry standard process for data mining&oldid=930958276
[Wik19b] ——, “Rapidminer — Wikipedia, the free encyclopedia,” 2019, [Online;
accessed 18-December-2019]. [Online]. Available: https://en.wikipedia.org/w/
index.php?title=RapidMiner&oldid=921794576
12

More Related Content

Similar to Black_Friday_Sales_Trushita

MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaRahul Bhatia
 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_linjinxing lin
 
BI Project report
BI Project reportBI Project report
BI Project reporthlel
 
Loan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersLoan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersIRJET Journal
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Setanta Systems - Supply Chain Report and Analyses Module
Setanta Systems - Supply Chain Report and Analyses ModuleSetanta Systems - Supply Chain Report and Analyses Module
Setanta Systems - Supply Chain Report and Analyses ModuleSeabrook Technology Group
 
Description Marks out of Wtg() Word Count Due d.docx
Description Marks out of Wtg() Word Count Due d.docxDescription Marks out of Wtg() Word Count Due d.docx
Description Marks out of Wtg() Word Count Due d.docxtheodorelove43763
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss sessionM Baddar
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and EngineeringVijayananda Mohire
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMashfiq Shahriar
 
A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...IRJET Journal
 
Bank Customer Segmentation & Insurance Claim Prediction
Bank Customer Segmentation & Insurance Claim PredictionBank Customer Segmentation & Insurance Claim Prediction
Bank Customer Segmentation & Insurance Claim PredictionIRJET Journal
 
A presentation for Retail Sales Projects
A presentation for Retail Sales ProjectsA presentation for Retail Sales Projects
A presentation for Retail Sales ProjectsAmjad Raza, Ph.D.
 
IS 2 Long Report Pardeep kumar 1271107
IS 2  Long Report Pardeep kumar  1271107IS 2  Long Report Pardeep kumar  1271107
IS 2 Long Report Pardeep kumar 1271107TouchPoint
 
Quant Foundry Labs - Low Probability Defaults
Quant Foundry Labs - Low Probability DefaultsQuant Foundry Labs - Low Probability Defaults
Quant Foundry Labs - Low Probability DefaultsDavidkerrkelly
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Data warehouse project on retail store
Data warehouse project on retail storeData warehouse project on retail store
Data warehouse project on retail storeSiddharth Chaudhary
 
Presentation Title
Presentation TitlePresentation Title
Presentation Titlebutest
 

Similar to Black_Friday_Sales_Trushita (20)

MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_lin
 
BI Project report
BI Project reportBI Project report
BI Project report
 
Loan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersLoan Analysis Predicting Defaulters
Loan Analysis Predicting Defaulters
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Setanta Systems - Supply Chain Report and Analyses Module
Setanta Systems - Supply Chain Report and Analyses ModuleSetanta Systems - Supply Chain Report and Analyses Module
Setanta Systems - Supply Chain Report and Analyses Module
 
Description Marks out of Wtg() Word Count Due d.docx
Description Marks out of Wtg() Word Count Due d.docxDescription Marks out of Wtg() Word Count Due d.docx
Description Marks out of Wtg() Word Count Due d.docx
 
Egypt hackathon 2014 analytics & spss session
Egypt hackathon 2014   analytics & spss sessionEgypt hackathon 2014   analytics & spss session
Egypt hackathon 2014 analytics & spss session
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Key projects Data Science and Engineering
Key projects Data Science and EngineeringKey projects Data Science and Engineering
Key projects Data Science and Engineering
 
Microsoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data ScienceMicrosoft Professional Capstone: Data Science
Microsoft Professional Capstone: Data Science
 
A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...A Machine learning based framework for Verification and Validation of Massive...
A Machine learning based framework for Verification and Validation of Massive...
 
Bank Customer Segmentation & Insurance Claim Prediction
Bank Customer Segmentation & Insurance Claim PredictionBank Customer Segmentation & Insurance Claim Prediction
Bank Customer Segmentation & Insurance Claim Prediction
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 
A presentation for Retail Sales Projects
A presentation for Retail Sales ProjectsA presentation for Retail Sales Projects
A presentation for Retail Sales Projects
 
IS 2 Long Report Pardeep kumar 1271107
IS 2  Long Report Pardeep kumar  1271107IS 2  Long Report Pardeep kumar  1271107
IS 2 Long Report Pardeep kumar 1271107
 
Quant Foundry Labs - Low Probability Defaults
Quant Foundry Labs - Low Probability DefaultsQuant Foundry Labs - Low Probability Defaults
Quant Foundry Labs - Low Probability Defaults
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Data warehouse project on retail store
Data warehouse project on retail storeData warehouse project on retail store
Data warehouse project on retail store
 
Presentation Title
Presentation TitlePresentation Title
Presentation Title
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 

Black_Friday_Sales_Trushita

  • 1. Data Mining Class Assignment 2 Msc Data Analytics Trushita Redij Student ID: 10504099 Dublin Business School Supervisor: Terri Hoare
  • 2. Dublin Business School Assignment Submission Sheet Msc Data Analytics Student Name: Trushita Redij Student ID: 10504099 Programme: Msc Data Analytics Year: 2019 Supervisor: Terri Hoare Submission Due Date: 18/12/2019 Project Title: Data Mining Class Assignment 2 Word Count: 1573 Page Count: 12
  • 3. Data Mining Class Assignment 2 Trushita Redij 10504099 Contents 1 Introduction 2 1.1 Crisp DM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Rapid Miner Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Business Understanding 3 2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Data Understanding 4 4 Data Preparation 4 4.1 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.1.1 Steps for Data Preparation . . . . . . . . . . . . . . . . . . . 5 5 Modelling 7 5.1 RM: Auto Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5.2.1 Results of General Linear Model . . . . . . . . . . . . . . . . . . . 8 5.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 6 Evaluation and Testing 9 6.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 6.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 7 Visualization 10 7.0.1 Before Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 8 Conclusion 11 9 Timeline 11 1
  • 4. 1 Introduction 1.1 Crisp DM Model Cross-industry standard process for data mining, known as CRISP-DM, is an open stand- ard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model (Wik19a). Big Data has acquired colossal import- ance in the field of analytics. Data driven organisation are evolving with the upsurge of Internet of things, social media, user clicks, digital transactions. This assignment aims at solving a real life business problem and explore the functionality of Rapid miner Tool to gauge the performance of different machine learning approaches. We have selected Black Friday Sales dataset of a retail company which highlights the customer purchase behaviour against various products of different categories. We aim to use CRISP-DM mothodology and Rapid Miner tool for implementation and derive the best fit model for our dataset. Figure 1: CRISP-DM Methodology 1.2 Rapid Miner Tool Rapid Miner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. It is used for business and com- mercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the machine learning process including data preparation, results visualization, model validation and optimization (Wik19b). Rapid Miner is a Data science Tool used for quick analysis of data. We can create pro- cesses, import data, predict the output. Furthermore, we can port the Machine learning models to web, iOS and Android. Various scattered tasks of data mining are simplified by Rapid Miner. We can load data, pre-process and prepare data using various methods, train models, cluster, prune outliers and visualize outputs. 2
  • 5. Figure 2: Courtesy: Rapid Miner Website 2 Business Understanding At recent, Black Friday has gained considerable importance. There are huge discounts, sales and offers advertised by the brands to attract customers and gain profit with max- imum sales margin. 2.1 Problem Statement A retail company wants to evaluate the customer purchase behaviour and trends against the product categories. The dataset highlights the purchase summary of customers for products which were sold in maximum amount. Figure 3: Courtesy: Under30ceo website The dataset describes customer demographics like age, gender, marital status, city, stay in current city and total purchase amount for previous month. The company wants to predict the purchase amount against the products wherein they can create discounts, offers for customers against different products. 3
  • 6. 3 Data Understanding The dataset contains 550069 observations and 12 variables. We use Turbo prep fea- ture of Rapid Miner to view our data and its structure. Turbo prep is an advanced functionality of rapid miner which provides environment for data preparation. • Load the data into RM and Inspect. • View the data and Analyse the structure. Figure 4: Data Understanding The histogram and stats distributions can be visualized at the top of each feature which indicates the quality. We could see that product category 1 and product category 2 had red indicators wherein we can infer that these columns are not suitable for Machine Learning. 4 Data Preparation Raw data has many discrepancies, inconsistency, errors, missing value which needs to be handled before it is parsed by the machine. 4.1 Steps Figure 5: Data Preprocessing Steps 4
  • 7. 4.1.1 Steps for Data Preparation Raw data is often fetched from multiple sources in different formats thus it becomes important to structurize the data prior to processing. Various factors are responsible for data quality like human error, measuring devices or redundancy in methods of collecting data. In this step we primarily focus on enhancing the quality of data by fixing the below mentioned issues: 1. Missing Value: In our sales Dataset , we considered the relationship of product category with the Figure 6: Cleanse Data purchase amount and replaced the Null values with space instead of dropping the two columns. 2. Convert Numeric to Polynomial: In order to Standardize our data we convert numeric values to polynomial. Figure 7: Convert Data 5
  • 8. 3. Splitting Data We split the dataset in 70:30 ratio of Train set and Test set. Train set will be used to build the model using different algorithms, gauge their performances and select the best fit. Figure 8: Data Splitting 4. Balancing Data This step includes balancing the data by considering equal number observations in the three groups. Balancing the dataset helps us in retrieving a considerably satisfying output. Figure 9: Data Balancing 6
  • 9. 5 Modelling This phases selects and applies the modelling techniques and calibrates parameters to optimal value. Often, data problems are encountered while modeling or ideas are invented to construct new data while modeling.Thus modeling is closely linked to data preparation. We used Rapid Miner Auto Model to model our data. 5.1 RM: Auto Model This feature of Rapid Miner helps accelerate Data Science by automating Machine Learn- ing. It explores new insights by transforming data, generating actionable data insights without any compromise. 1. Load Data: We loaded the pre processed dataset and selected Prediction based on Classification. 2. Select Target: The purchase column is the target variable which has three classes ’High’, ’Med’, and ’Low’ according to the purchase range. 3. Select Input: We selected ’Age’, ’Gender’, ’Product Category’, ’Occupation’, ’Marital Status’ as our Input variable. 4. Model Types: We selected Naive Bayes, Logistic Regression, General Linear Model, Decision Tree and SVM to find out the best fit. Output: The below image depicts the accuracy of all selected model. The accurracy of Deep Figure 10: Auto Model Result Learning is 68.4 percent and General Linear Model is 67.4 percent, which is higher than the rest of the model. Since we are predicting the Purchase , we selected Linear Regression as the best fitting model. 7
  • 10. 5.2 Linear Regression In Linear Regression we perform modelling based on the relationship between dependent variable and set of independent variable. 5.2.1 Results of General Linear Model Figure 11: Linear Model Result 5.2.2 Performance The figure below highlights the performance of Linear model based on ’Purchase’ variable classified as ’High’, ’Med’, ’Low’ which was our target variable and the set of other independent variables. The confusion matrix depicts the class precision and class recall across the three classified groups. The accuracy of model is 67.4 percent with classification error of 32.6 percent. Figure 12: Linear Model Performance 8
  • 11. 6 Evaluation and Testing We performed Linear Regression on Test dataset to test the results and performance on the new dataset. The Test dataset was used as an input to derive the results and guage the performance. 6.1 Design Figure 13: Auto Model Design for Testing The figure below highlights the model performance on ’Test’ dataset. It depicts the precision and recall based on ’Purchase’ variable. 6.2 Performance Figure 14: Auto Model performance on Test Dataset The performance of model on Test dataset was highly satisfactorily and thus Linear Regression can be used for prediction based on ’Purchase’ variable as target. 9
  • 12. 7 Visualization The figure below illustrates the a bar graph of ’Puchase’ vs ’Product Category 2’. It clearly illustrates the distribution across the classified Purchase groups ’High’, ’Med’, ’Low’ for the Product Category 2. Figure 15: Purchase and Product Category 2 7.0.1 Before Balancing The figure below illustrates the imbalance in the observations for the Purchase variable and the need for the balancing the records. Figure 16: Before Balancing 10
  • 13. 8 Conclusion We explored the classic features of Rapid Miner called ’Turbo Prep’ and ’Auto Model’. We used Turbo prep for loading our data, processed and cleansed it for Machine Learning tasks. We fit different Machine learning models and compared each of them. We selected General Linear Regression considering it as best fit among all. Finally, we explored the efficiency of Rapid Miner Tool and advantage of using it for quick inferences and results. 9 Timeline Figure 17: Project Timeline 11
  • 14. References [Wik19a] Wikipedia contributors, “Cross-industry standard process for data mining — Wikipedia, the free encyclopedia,” 2019, [Online; accessed 18- December-2019]. [Online]. Available: https://en.wikipedia.org/w/index.php? title=Cross-industry standard process for data mining&oldid=930958276 [Wik19b] ——, “Rapidminer — Wikipedia, the free encyclopedia,” 2019, [Online; accessed 18-December-2019]. [Online]. Available: https://en.wikipedia.org/w/ index.php?title=RapidMiner&oldid=921794576 12