SlideShare a Scribd company logo
1 of 45
Bigmart Sale Prediction
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Problem Statement
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020
25-06-2019
Data Exploration
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
• Item_Visibility contains 0.000 as values – meaningless
• Item_Identifier is a string with specific code
• Outlet_size contains NaN values
head
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» 12 features : Numeric – 5 , Categorical - 7
» Total no of entries: 8523
» Memory: ~ 800KB
» Outlet_size has null values(from previous slide data) even though all the fields has to be non-null
Info
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights » Data collection : 1985 to 2009
» Item_Visibility has a minimum value of 0.00
» Item_weight has count of less than 8523
Describe
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights » No of duplicates : 6964.
Possible reason: Same product can exist in multiple stores
Duplicates
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights » Item_Identifier has 1463 missing values.
» Outlet_size has 2410 missing values
Missing Values
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Univariate Analysis
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» 16 different types.
» Possibility to reduce the item_Types to <16
Item_Type
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020
25-06-2019
Commands
Output
Insights
» Regular is represented as multiple ways – Regular,
reg
» Low fat is represented as Low Fat, low fat & LF
» Replace 5 types with 2 – Regular & Low Fat
Item_Fat_Content
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» More no of Medium & Small size outlets
» Less no of High size outlets
Outlet_Size
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Bigmart is present more in Tier 2& Tier 2 than
in Tier 1 cities
Outlet_Location_Type
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» SuperMarket Type1 is prominent.
Other 3 types are of same size
Outlet_Type
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Item_visibility has lowest correlation with target
variable
» Item_MRP has strong positive correlation with
target variable.
Heatmap
Numerical variables
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Individual feature vs Target
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Item_Weight has low correlation with the target
Item_Outlet_Sales
Item_Weight vs Item_Outlet_Sales
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020
25-06-2019
Commands
Output
Insights
» Items which are highly visible has less sales
(Possible reason: Daily groceries have higher
sales and they don’t need high visibility. Also
cosmetics with high rate might be kept in visible
position but usually its sales are less.)
» Many products are lying on x-axis stating that the
visibility is zero
» Distribution is skewed towards low visible items
Item_Visibility vs Item_Outlet_Sales
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» No visible relation between Year of establishment
and output sales.
» Only in 1998, the sales are less(Possible reason
could be less stores opened in that year – no data
provided on no of stores opened each year)
Outlet_Establishment_Year vs Item_Outlet_Sales
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Low Fat product sales > Regular fat sales.
Item_Fat_Content vs Item_Outlet_Sales
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Out of 10 stores,
2 – grocery store, 6 – Supermarket Type1,
1 – Supermarket Type 2, 1- Supermarket Type 3
Outlet_Type vs Outlet_Identifier
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Medium SuperMarket Type3 has more sales than
others
Outlet_Type vs Item_Outlet_Sales
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Groceries “OUT010” & “OUT019” have the lowest
sales results which is expected followed by the
“OUT018” – Based on previous 2 slides, this is
expected
Outlet_Identifier vs Item_Outlet_Sales
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Medium store outlet are having more sales than
High and Low size outlets
Outlet_Size vs Item_Outlet_Sales
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Sales of Tier2 > Sales of Tier 3 > Sales of Tier1
Outlet_Location_Type vs Item_Outlet_Sales
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
» Sales of Tier2 > Sales of Tier 3 > Sales of Tier1
» Item_Visibility does not have a high positive correlation.
» Item_Visibility has items with the value zero
» Item_Type does not influence the outlet_sales much.
» Item_Weight and Outlet_Size seem to present NaN values.
» Item_Fat_Content has vale “low fat” written in different manners.
» Outlet_Establishment_Year values vary from 1985 to 2009. Using this value directly does not make sense.
» Tier2 &Tier3 has better sales than Tier1 cities
» Too many data cleaning activities, better to combine the train and the test dataset.
Insights - Summary
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Data Cleaning
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Code
» Combine the train and test dataset
Reason : Since the data contains lot of missing values , null values and categorical values - reduce duplicate effort
Combine train & test dataset
Avoid re-work of cleaning the test dataset
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights » Missing values in Item_weight are replaced
Missing Values
Replace NaN in Item_Weight
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights » Replaced Missing values with mode – Size of outlets are few and makes sense
to replace the missing with most prominent outlets
Missing Values
Replace Outlet_Size
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights » Item_Visibility can’t be zero – replace with mean
Item_Visibility
Replace columns with zero values – zero makes no sense for this field
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Feature Engineering
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Item_Type has 16 categories which won’t be useful, transform them into 3
broad categories.
Item_Type
Transform 16 item types to 3
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights » Item_Type has non-consumable items which are categorized as fat contents,
these needs to be segregated as non-edible
Item_Type – Transform Non-consumables as Non-edible
Transform 2 item types to 3
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Item_Type has 16 categories which won’t be useful, transform them into 3
broad categories.
Item_Fat_Content
Fix the Spelling mistakes – 5 categories to 2
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Comparing the year of establishment of a store makes no sense. Transforming
into no of years of existence makes a good correlation to outlet_sales.
» Since the latest year of establishment is 2013, subtract all from 2013 to get no
of years of operation.
Outlet_Establishment_Years - Years of Operation of a store
Change the year to no of years in existence
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» scikit-learn library only accepts numerical variables so convert all categorical fields into
numericals.
» Having pure numericals will cause confusion as which is greater than other. So create dummies
to avoid confusion(Data transformed from pure numericals as in table 1 to dummies as in table 2)
Categorical Variables transformation
Transform categorical variables into numericals
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights
» Fields Item_Type & Outlet_Establishment_Year are dropped as they are of
object type. Also they are transformed into other variables types in previoud
slides.
Drop
Fields with object data type
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020
» Fields Item_Type & Outlet_Establishment_Year are dropped
25-06-2019
Model Building
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Code
» Segregate the data into train and test dataset for model prediction
Separate train & test dataset
Segregate the combined data into train & test
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020
» Remove Item_Outlet_sales from test data and the source.
» Remove source from train data.
25-06-2019
Commands
Output
Insights » Accuracy of Linear regression model : 56.35
Linear Regression
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights » Accuracy of Decision Tree model : 61.45
Decision Tree
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Commands
Output
Insights » Accuracy of RandomForest model : 60.81
RandomForest
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Recommendations
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Insights
» Removed outlet_type : Accuracy came down from 56.35 to 34.42
Outlet_sales is highly affected with type of outlet
» Removed item_mrp : Accuracy came down from 56.35 to 24.07
Outlet_sales is highly affected with mrp of the product
Further analysis & Recommendations
IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
Key Factors
» Outlet_type and Item_MRP are the key factors affecting the outlet sales.
» Decision Tree model is the most accurate predicted model

More Related Content

What's hot

Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on TwitterPulkit Goyal
 
Confusion matrix and classification evaluation metrics
Confusion matrix and classification evaluation metricsConfusion matrix and classification evaluation metrics
Confusion matrix and classification evaluation metricsMinesh A. Jethva
 
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsCredit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsHariteja Bodepudi
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionSaad Elbeleidy
 
Case Studies Utilizing Real Time Data Analytics
Case Studies Utilizing Real Time Data AnalyticsCase Studies Utilizing Real Time Data Analytics
Case Studies Utilizing Real Time Data AnalyticsHarrison Hayes, LLC
 
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili SaghafiBusiness Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili SaghafiProfessor Lili Saghafi
 
KPMG Virtual Internship
KPMG Virtual InternshipKPMG Virtual Internship
KPMG Virtual InternshipSOUMIT KAR
 
Churn prediction
Churn predictionChurn prediction
Churn predictionGigi Lino
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data VisualizationCenterline Digital
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine LearningJoel Graff
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term depositPranov Mishra
 
GUI based handwritten digit recognition using CNN
GUI based handwritten digit recognition using CNNGUI based handwritten digit recognition using CNN
GUI based handwritten digit recognition using CNNAbhishek Tiwari
 
Customer Segmentation Project
Customer Segmentation ProjectCustomer Segmentation Project
Customer Segmentation ProjectAditya Ekawade
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesT.S. Lim
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesKush Kulshrestha
 

What's hot (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining on Twitter
Data Mining on TwitterData Mining on Twitter
Data Mining on Twitter
 
Explainable AI (XAI)
Explainable AI (XAI)Explainable AI (XAI)
Explainable AI (XAI)
 
Confusion matrix and classification evaluation metrics
Confusion matrix and classification evaluation metricsConfusion matrix and classification evaluation metrics
Confusion matrix and classification evaluation metrics
 
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning AlgorithmsCredit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
Credit Card Fraud Detection Using Unsupervised Machine Learning Algorithms
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Case Studies Utilizing Real Time Data Analytics
Case Studies Utilizing Real Time Data AnalyticsCase Studies Utilizing Real Time Data Analytics
Case Studies Utilizing Real Time Data Analytics
 
Data Analytics course.pptx
Data Analytics course.pptxData Analytics course.pptx
Data Analytics course.pptx
 
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili SaghafiBusiness Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
 
KPMG Virtual Internship
KPMG Virtual InternshipKPMG Virtual Internship
KPMG Virtual Internship
 
Churn prediction
Churn predictionChurn prediction
Churn prediction
 
The Importance of Data Visualization
The Importance of Data VisualizationThe Importance of Data Visualization
The Importance of Data Visualization
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Prediction of potential customers for term deposit
Prediction of potential customers for term depositPrediction of potential customers for term deposit
Prediction of potential customers for term deposit
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer Segmentation
 
Telecom Churn Prediction
Telecom Churn PredictionTelecom Churn Prediction
Telecom Churn Prediction
 
GUI based handwritten digit recognition using CNN
GUI based handwritten digit recognition using CNNGUI based handwritten digit recognition using CNN
GUI based handwritten digit recognition using CNN
 
Customer Segmentation Project
Customer Segmentation ProjectCustomer Segmentation Project
Customer Segmentation Project
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
Clustering - Machine Learning Techniques
Clustering - Machine Learning TechniquesClustering - Machine Learning Techniques
Clustering - Machine Learning Techniques
 

Recently uploaded

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Datascience - bigmart data analysis

  • 1. Bigmart Sale Prediction IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 2. Problem Statement IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020 25-06-2019
  • 3. Data Exploration IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 4. Commands Output Insights • Item_Visibility contains 0.000 as values – meaningless • Item_Identifier is a string with specific code • Outlet_size contains NaN values head IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 5. Commands Output Insights » 12 features : Numeric – 5 , Categorical - 7 » Total no of entries: 8523 » Memory: ~ 800KB » Outlet_size has null values(from previous slide data) even though all the fields has to be non-null Info IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 6. Commands Output Insights » Data collection : 1985 to 2009 » Item_Visibility has a minimum value of 0.00 » Item_weight has count of less than 8523 Describe IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 7. Commands Output Insights » No of duplicates : 6964. Possible reason: Same product can exist in multiple stores Duplicates IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 8. Commands Output Insights » Item_Identifier has 1463 missing values. » Outlet_size has 2410 missing values Missing Values IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 9. Univariate Analysis IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 10. Commands Output Insights » 16 different types. » Possibility to reduce the item_Types to <16 Item_Type IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020 25-06-2019
  • 11. Commands Output Insights » Regular is represented as multiple ways – Regular, reg » Low fat is represented as Low Fat, low fat & LF » Replace 5 types with 2 – Regular & Low Fat Item_Fat_Content IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 12. Commands Output Insights » More no of Medium & Small size outlets » Less no of High size outlets Outlet_Size IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 13. Commands Output Insights » Bigmart is present more in Tier 2& Tier 2 than in Tier 1 cities Outlet_Location_Type IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 14. Commands Output Insights » SuperMarket Type1 is prominent. Other 3 types are of same size Outlet_Type IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 15. Commands Output Insights » Item_visibility has lowest correlation with target variable » Item_MRP has strong positive correlation with target variable. Heatmap Numerical variables IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 16. Individual feature vs Target IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 17. Commands Output Insights » Item_Weight has low correlation with the target Item_Outlet_Sales Item_Weight vs Item_Outlet_Sales IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020 25-06-2019
  • 18. Commands Output Insights » Items which are highly visible has less sales (Possible reason: Daily groceries have higher sales and they don’t need high visibility. Also cosmetics with high rate might be kept in visible position but usually its sales are less.) » Many products are lying on x-axis stating that the visibility is zero » Distribution is skewed towards low visible items Item_Visibility vs Item_Outlet_Sales IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 19. Commands Output Insights » No visible relation between Year of establishment and output sales. » Only in 1998, the sales are less(Possible reason could be less stores opened in that year – no data provided on no of stores opened each year) Outlet_Establishment_Year vs Item_Outlet_Sales IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 20. Commands Output Insights » Low Fat product sales > Regular fat sales. Item_Fat_Content vs Item_Outlet_Sales IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 21. Commands Output Insights » Out of 10 stores, 2 – grocery store, 6 – Supermarket Type1, 1 – Supermarket Type 2, 1- Supermarket Type 3 Outlet_Type vs Outlet_Identifier IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 22. Commands Output Insights » Medium SuperMarket Type3 has more sales than others Outlet_Type vs Item_Outlet_Sales IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 23. Commands Output Insights » Groceries “OUT010” & “OUT019” have the lowest sales results which is expected followed by the “OUT018” – Based on previous 2 slides, this is expected Outlet_Identifier vs Item_Outlet_Sales IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 24. Commands Output Insights » Medium store outlet are having more sales than High and Low size outlets Outlet_Size vs Item_Outlet_Sales IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 25. Commands Output Insights » Sales of Tier2 > Sales of Tier 3 > Sales of Tier1 Outlet_Location_Type vs Item_Outlet_Sales IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 26. » Sales of Tier2 > Sales of Tier 3 > Sales of Tier1 » Item_Visibility does not have a high positive correlation. » Item_Visibility has items with the value zero » Item_Type does not influence the outlet_sales much. » Item_Weight and Outlet_Size seem to present NaN values. » Item_Fat_Content has vale “low fat” written in different manners. » Outlet_Establishment_Year values vary from 1985 to 2009. Using this value directly does not make sense. » Tier2 &Tier3 has better sales than Tier1 cities » Too many data cleaning activities, better to combine the train and the test dataset. Insights - Summary IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 27. Data Cleaning IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 28. Code » Combine the train and test dataset Reason : Since the data contains lot of missing values , null values and categorical values - reduce duplicate effort Combine train & test dataset Avoid re-work of cleaning the test dataset IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 29. Commands Output Insights » Missing values in Item_weight are replaced Missing Values Replace NaN in Item_Weight IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 30. Commands Output Insights » Replaced Missing values with mode – Size of outlets are few and makes sense to replace the missing with most prominent outlets Missing Values Replace Outlet_Size IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 31. Commands Output Insights » Item_Visibility can’t be zero – replace with mean Item_Visibility Replace columns with zero values – zero makes no sense for this field IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 32. Feature Engineering IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 33. Commands Output Insights » Item_Type has 16 categories which won’t be useful, transform them into 3 broad categories. Item_Type Transform 16 item types to 3 IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 34. Commands Output Insights » Item_Type has non-consumable items which are categorized as fat contents, these needs to be segregated as non-edible Item_Type – Transform Non-consumables as Non-edible Transform 2 item types to 3 IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 35. Commands Output Insights » Item_Type has 16 categories which won’t be useful, transform them into 3 broad categories. Item_Fat_Content Fix the Spelling mistakes – 5 categories to 2 IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 36. Commands Output Insights » Comparing the year of establishment of a store makes no sense. Transforming into no of years of existence makes a good correlation to outlet_sales. » Since the latest year of establishment is 2013, subtract all from 2013 to get no of years of operation. Outlet_Establishment_Years - Years of Operation of a store Change the year to no of years in existence IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 37. Commands Output Insights » scikit-learn library only accepts numerical variables so convert all categorical fields into numericals. » Having pure numericals will cause confusion as which is greater than other. So create dummies to avoid confusion(Data transformed from pure numericals as in table 1 to dummies as in table 2) Categorical Variables transformation Transform categorical variables into numericals IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 38. Commands Output Insights » Fields Item_Type & Outlet_Establishment_Year are dropped as they are of object type. Also they are transformed into other variables types in previoud slides. Drop Fields with object data type IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020 » Fields Item_Type & Outlet_Establishment_Year are dropped 25-06-2019
  • 39. Model Building IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 40. Code » Segregate the data into train and test dataset for model prediction Separate train & test dataset Segregate the combined data into train & test IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA020 » Remove Item_Outlet_sales from test data and the source. » Remove source from train data. 25-06-2019
  • 41. Commands Output Insights » Accuracy of Linear regression model : 56.35 Linear Regression IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 42. Commands Output Insights » Accuracy of Decision Tree model : 61.45 Decision Tree IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 43. Commands Output Insights » Accuracy of RandomForest model : 60.81 RandomForest IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 44. Recommendations IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019
  • 45. Insights » Removed outlet_type : Accuracy came down from 56.35 to 34.42 Outlet_sales is highly affected with type of outlet » Removed item_mrp : Accuracy came down from 56.35 to 24.07 Outlet_sales is highly affected with mrp of the product Further analysis & Recommendations IPL Semester3 - Datascience individual Assignment - Venkat 18EMBA02025-06-2019 Key Factors » Outlet_type and Item_MRP are the key factors affecting the outlet sales. » Decision Tree model is the most accurate predicted model