SlideShare a Scribd company logo
1 of 18
Download to read offline
MULTIPLE LINEAR REGRESSION
TO PREDICT
ITEM SALES QUANTITY
WALMART DATASET
NIKHIL SHRIVASTAVA
1
CONTENTS
 Introduction
 Dataset Description
 Exploratory Data Analyses
 Inference for Multiple Linear Regression
 Model Selection
 Prediction
 Conclusion
2
Introduction
 The purpose
 To explore the attributes which influence the 'Item Sales Quantity'.
 To establish a regression relationship between 'Item Sales Quantity' and other attributes.
 The Dataset
 We have chosen Sales data of supermarket Walmart.
 Sourced from Kaggle
 Preliminary Analysis
 Response variable (dependent) : Item Sales Quantity as Item_Outlet_Sales
 Explanatory variables: 11
 Modeling
 Multiple Linear Regression
3
Dataset Description
 Walmart Dataset consists of the following attributes
 Response Variable
• Item Outlet Sales
 Explanatory Variables
▪ Item Identifier codes
▪ Item Weight
▪ Item Fat Content
▪ Item Visibility
▪ Item Type
▪ Item MRP
4
▪ Outlet Identifier
▪ Outlet Establishment year
▪ Outlet Size
▪ Outlet Location type
▪ Outlet Type
Data Pre-Processing
 The summary Before Pre-Processing
➢ Item_Identifier variable has 8000+ unique observations implying that it is the item code/identifier, may not be
very useful for our modeling purpose.
➢ Item_Weight is not available for approx. 1463 items, which implies missing value as it doesn’t make sense an item
without weight. To reduce the complexity, we deleted these records.
➢ Item_Fat-Content has broadly two categorical values – Low Fat and Regular.
➢ Many variations appearing as “LF”, “Low Fat”, “low fat”.
➢ During preprocessing replaced all the values of all variations of Low fat to “LF”.
➢ Similarly fixed multiple variations of “regular” and set it to “reg”
➢ Item_visibility ranges between 0 to .32 implying values in %
➢ Item_type has mostly proper values, but there are many with Others. For the purpose of this project we assumed
“Others” as one separate category, implying this variable would not require preprocessing.
5
Summary After Pre-Processing
Observations remaining : 4650
6
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP
DRD60 : 5 Min. : 4.555 LF :3004 Min. :0.00000 Fruits and Vegetables: 670 Min. : 31.49
DRE49 : 5 1st Qu.: 8.770 reg:1646 1st Qu.:0.02597 Snack Foods : 656 1st Qu.: 94.41
DRF01 : 5 Median :12.650 Median :0.04966 Household : 498 Median :142.98
DRF03 : 5 Mean :12.899 Mean :0.06070 Frozen Foods : 477 Mean :141.72
DRF27 : 5 3rd Qu.:17.000 3rd Qu.:0.08874 Dairy : 380 3rd Qu.:186.61
DRG23 : 5 Max. :21.350 Max. :0.18832 Canned : 361 Max. :266.89
(Other):4620 (Other) :1608
Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type
OUT013:932 Min. :1987 High : 932 Tier 1:1860 Supermarket Type1:3722
OUT018:928 1st Qu.:1997 Medium:1858 Tier 2: 930 Supermarket Type2: 928
OUT035:930 Median :1999 Small :1860 Tier 3:1860
OUT046:930 Mean :1999
OUT049:930 3rd Qu.:2004
Max. :2009
Item_Outlet_Sales
Min. : 69.24
1st Qu.: 1125.20
Median : 1939.81
Mean : 2272.04
3rd Qu.: 3111.62
Max. :10256.65
Exploratory Data Analysis (EDA): Response vs Quantitative Variables
7
Linear Relationships were observed for the following
Sales vs Visibility
Sales vs Item weight
Sales vs Item Price
EDA: Response Variable vs Categorical Variables
8
Sales vs Fat content
Shows almost same
distribution for Low Fat
and Regular
Sales vs Item Type
Less variations
EDA: Response Variable vs Categorical Variables
9
Sales vs Outlet Identifier
Slight Variation with
OUT035
Sales vs Outlet size
Medium size and small
size outlets seem to
have higher items
sales than High sized
outlets
Inference for Multiple Linear Regression
➢ Hypothesis Testing
➢ Before jumping on model creation, we validated our theory of some relationships via
Hypothesis Test
➢ Null Hypothesis : There is no relationship between the Response and Explanatory variable
➢ Alternative Hypothesis : There is some relationship between Response and Explanatory Variables
➢ Mathematically: Slope of all explanatory variables would be 0
➢ H0: ß0 = ß1 = ß2 = ß3 =………. ßk = 0
➢ HA: At least one of the ßi is non zero 0
➢ We ran the full regression model to obtain p-value for overall model
F-statistic: 180.4 on 23 and 4626 DF, p-value: < 2.2e-16
➢ At 5% Significance level we received P value as 2.2e-16, hence we rejected Null Hypothesis.
➢ Implying at least one explanatory variable has a slope (ßi) that is not 0
10
Model Creation and Selection
• Model Assumptions
• Linearity: Linear Relationships between x and y
• Expected Value of Error Term is 0: Nearly Normal Residuals
• Homoscedasticity: Constant variability of Residuals
• Multicollinearity
• Model Selection
• Two Methods
• P-Value
• Adjusted R-SQ
• Our Method
• p-value: Variables with highest p-value (more than 5%) are not significant
• Our Approach: Backward Pass
• Started with Full Model and removed insignificant variable in each pass
11
Model Selection
Pass Model Variables Adj-R-Sq Highest P-Value Variable/Comments
1st pass Full model-all variables 0.4702 Item_Type (0.8686), since it is a categorical variable, either we
have to remove all its values or keep it all.
2nd pass -Item_type 0.4707 Item_Fat_Content:reg (0.694)
3rd Pass -Item_type
-Item_Fat_Content:reg
0.4709 Item_visibility (0.6562)
4th Pass -Item_type
-Item_Fat_Content:reg
-Item_visibility
0.4709 Outlet_IdentifierOUT046 (0.5277)
Since it is a categorical variable with values which are significant
(p<.05), we will keep this variable and look for next high p-value
Item_Weight (0.2709)
5th Pass -Item_type
-Item_Fat_Content:reg
-Item_visibility
-Item_Weight
0.4709 Outlet_identifierOUT046 (0.5335)
All the variable remaining after 5th pass appear as significant as
shown below, with p-value <0.05. The variable outlet_identifier has
one categorical value significant and other not, so we tried
running a 6th pass to remove outlet_identifier
6th Pass -Item_type
-Item_Fat_Content:reg
-Item_visibility
-Item_Weight
-Outlet_identifier
0.4709 After removing Outlet_identifier, the model accuracy did not
change, hence we decided to drop this variable.
7th Pass -Item_type
-Item_Fat_Content:reg
-Item_visibility
-Item_Weight
-Outlet_identifier
-Super_Market_Type
0.4709 At last we dropped variables showing ‘NA’ in the regression
output as these are represents collinear variables.
12
Parsimonious Model
▪ 7th Pass
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74579.557 10046.019 7.424 1.35e-13 ***
walmart$Item_MRP 16.306 0.256 63.682 < 2e-16 ***
walmart$Outlet_Establishment_Year -37.537 5.056 -7.424 1.34e-13 ***
walmart$Outlet_SizeMedium 518.204 96.416 5.375 8.05e-08 ***
walmart$Outlet_SizeSmall 343.926 71.462 4.813 1.54e-06 ***
walmart$Outlet_Location_TypeTier 2 406.392 61.692 6.587 4.97e-11 ***
Item_Sales = 74579
+ 16 * Item_MRP
- 37.537 * Outlet_Establishment_Year
+ 518 Outlet_SizeMedium
+ 343 * Outlet_SizeSmall
+ 406 * Outlet_Location_TypeTier 2
▪ Percentage Variability Explained by above model is 47%
13
Assumptions Validation
 Linearity: Linear Relationships between x and y
 Expected Value of Error Term is 0: Nearly Normal Residuals
As shown below in histogram of residuals is rightly skewed and are not nearly normal hence
this model does not meet the conditions of Multi Linear Regression.
14
Assumptions Validation
 Homoscedasticity: Constant variability of Residuals
The plot of residuals vs fitted values show fan shaped pattern and does not reflect the
constant variance around 0, the model does not meet the condition of homoscedasticity.
 Multicollinearity
As observed in the regression output the model has some collinear variables represented by
“NA” in the output. We have removed these variables from our final model
Moreover, the final model has t – values all greater than 2 so we didn’t investigate further.
15
Prediction & CI
 The MLR model we created did not satisfy the essential assumptions of
multiple linear regression model, the prediction made from this model
might not be reliable.
 The plot of residuals vs fitted showed strong heteroscedasticity, or non-
normality , then the prediction, and confidence intervals based on this
model would not yield scientific insights that can be trusted
16
Conclusions
➢ Walmart eCommerce data set
➢ Data Pre-Processing:
➢ Removed the null values
➢ Fixed the values of string data by standardizing
➢ EDA: Linearity Check
➢ MLR model from 5000+ observations and 11 Characteristics
➢ Model Selection
➢ Using Backward Pass method
➢ Removal of variables with the highest p-values
➢ Variability Explained by model : 47.09% (Adj-R-Sq)
➢ MLR Assumptions Validations
➢ Residuals are not nearly Normally Distributed
➢ Model has Heteroscedasticity
➢ Predictions: Not reliable
17
Thank you
18

More Related Content

What's hot

Probability theory good
Probability theory goodProbability theory good
Probability theory goodZahida Pervaiz
 
Exploratory Data Analysis week 4
Exploratory Data Analysis week 4Exploratory Data Analysis week 4
Exploratory Data Analysis week 4Manzur Ashraf
 
Chap06 normal distributions & continous
Chap06 normal distributions & continousChap06 normal distributions & continous
Chap06 normal distributions & continousUni Azza Aunillah
 
Calculating a single sample z test
Calculating a single sample z testCalculating a single sample z test
Calculating a single sample z testKen Plummer
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using PythonShirin Mojarad, Ph.D.
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionDerek Kane
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
Reporting pearson correlation in apa
Reporting pearson correlation in apa Reporting pearson correlation in apa
Reporting pearson correlation in apa Amit Sharma
 
Data collection and presentation
Data collection and presentationData collection and presentation
Data collection and presentationferdaus44
 
Business statistics (Basics)
Business statistics (Basics)Business statistics (Basics)
Business statistics (Basics)AhmedToheed3
 

What's hot (19)

Probability theory good
Probability theory goodProbability theory good
Probability theory good
 
Em algorithm
Em algorithmEm algorithm
Em algorithm
 
lfstat3e_ppt_01_rev.ppt
lfstat3e_ppt_01_rev.pptlfstat3e_ppt_01_rev.ppt
lfstat3e_ppt_01_rev.ppt
 
Exploratory Data Analysis week 4
Exploratory Data Analysis week 4Exploratory Data Analysis week 4
Exploratory Data Analysis week 4
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
Chap06 normal distributions & continous
Chap06 normal distributions & continousChap06 normal distributions & continous
Chap06 normal distributions & continous
 
SCATTER PLOTS
SCATTER PLOTSSCATTER PLOTS
SCATTER PLOTS
 
Calculating a single sample z test
Calculating a single sample z testCalculating a single sample z test
Calculating a single sample z test
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Box and Whisker Plot in Biostatic
Box and Whisker Plot in BiostaticBox and Whisker Plot in Biostatic
Box and Whisker Plot in Biostatic
 
Google PageRank
Google PageRankGoogle PageRank
Google PageRank
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
Boxplot
BoxplotBoxplot
Boxplot
 
Prezentacja klas I - III
Prezentacja klas I - IIIPrezentacja klas I - III
Prezentacja klas I - III
 
Reporting pearson correlation in apa
Reporting pearson correlation in apa Reporting pearson correlation in apa
Reporting pearson correlation in apa
 
Data collection and presentation
Data collection and presentationData collection and presentation
Data collection and presentation
 
Business statistics (Basics)
Business statistics (Basics)Business statistics (Basics)
Business statistics (Basics)
 

Similar to MLR Walmart Dataset

Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdfruwanp2000
 
1 Assignment Quantitative Methods 2 The following ass.docx
1 Assignment Quantitative Methods 2 The following ass.docx1 Assignment Quantitative Methods 2 The following ass.docx
1 Assignment Quantitative Methods 2 The following ass.docxteresehearn
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationAsadJaved304231
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...Smarten Augmented Analytics
 
BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxBIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxLSURYAPRAKASHREDDY
 
CHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxCHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxUmaDeviAnanth
 
Houseprice prediction-priya roy
Houseprice prediction-priya royHouseprice prediction-priya roy
Houseprice prediction-priya royPriya Chatterjee
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIVikas Virani
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptEdu4Sure
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spssDr Nisha Arora
 
MSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik MallaMSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik MallaKartik Malla
 
Wisconsin hospital - Healthcare Cost Prediction
Wisconsin hospital - Healthcare Cost PredictionWisconsin hospital - Healthcare Cost Prediction
Wisconsin hospital - Healthcare Cost PredictionPrasann Prem
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .pptneelamsanjeevkumar
 
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptxbigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptxHarshavardhan851231
 
Stats ca report_18180485
Stats ca report_18180485Stats ca report_18180485
Stats ca report_18180485sarthakkhare3
 
Loan portfolio manufacturing sme's statistical analysis
Loan portfolio manufacturing sme's   statistical analysisLoan portfolio manufacturing sme's   statistical analysis
Loan portfolio manufacturing sme's statistical analysisManzar Ahmed
 

Similar to MLR Walmart Dataset (20)

Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
 
1 Assignment Quantitative Methods 2 The following ass.docx
1 Assignment Quantitative Methods 2 The following ass.docx1 Assignment Quantitative Methods 2 The following ass.docx
1 Assignment Quantitative Methods 2 The following ass.docx
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical information
 
Factors affecting customer satisfaction
Factors affecting customer satisfactionFactors affecting customer satisfaction
Factors affecting customer satisfaction
 
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
What Is Generalized Linear Regression with Gaussian Distribution And How Can ...
 
BIG MART SALES.pptx
BIG MART SALES.pptxBIG MART SALES.pptx
BIG MART SALES.pptx
 
BIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptxBIG MART SALES PRIDICTION PROJECT.pptx
BIG MART SALES PRIDICTION PROJECT.pptx
 
APT_&_VaR[1]
APT_&_VaR[1]APT_&_VaR[1]
APT_&_VaR[1]
 
CHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptxCHAPTER 11 LOGISTIC REGRESSION.pptx
CHAPTER 11 LOGISTIC REGRESSION.pptx
 
Houseprice prediction-priya roy
Houseprice prediction-priya royHouseprice prediction-priya roy
Houseprice prediction-priya roy
 
Phase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMIPhase 2 of Predicting Payment default on Vehicle Loan EMI
Phase 2 of Predicting Payment default on Vehicle Loan EMI
 
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.pptMarket Research using SPSS _ Edu4Sure Sept 2023.ppt
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
 
7. logistics regression using spss
7. logistics regression using spss7. logistics regression using spss
7. logistics regression using spss
 
MSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik MallaMSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik Malla
 
Wisconsin hospital - Healthcare Cost Prediction
Wisconsin hospital - Healthcare Cost PredictionWisconsin hospital - Healthcare Cost Prediction
Wisconsin hospital - Healthcare Cost Prediction
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .ppt
 
Lab manual_statistik
Lab manual_statistikLab manual_statistik
Lab manual_statistik
 
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptxbigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
bigmartsalespridictionproject-220813050638-8e9c4c31 (1).pptx
 
Stats ca report_18180485
Stats ca report_18180485Stats ca report_18180485
Stats ca report_18180485
 
Loan portfolio manufacturing sme's statistical analysis
Loan portfolio manufacturing sme's   statistical analysisLoan portfolio manufacturing sme's   statistical analysis
Loan portfolio manufacturing sme's statistical analysis
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 

Recently uploaded (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 

MLR Walmart Dataset

  • 1. MULTIPLE LINEAR REGRESSION TO PREDICT ITEM SALES QUANTITY WALMART DATASET NIKHIL SHRIVASTAVA 1
  • 2. CONTENTS  Introduction  Dataset Description  Exploratory Data Analyses  Inference for Multiple Linear Regression  Model Selection  Prediction  Conclusion 2
  • 3. Introduction  The purpose  To explore the attributes which influence the 'Item Sales Quantity'.  To establish a regression relationship between 'Item Sales Quantity' and other attributes.  The Dataset  We have chosen Sales data of supermarket Walmart.  Sourced from Kaggle  Preliminary Analysis  Response variable (dependent) : Item Sales Quantity as Item_Outlet_Sales  Explanatory variables: 11  Modeling  Multiple Linear Regression 3
  • 4. Dataset Description  Walmart Dataset consists of the following attributes  Response Variable • Item Outlet Sales  Explanatory Variables ▪ Item Identifier codes ▪ Item Weight ▪ Item Fat Content ▪ Item Visibility ▪ Item Type ▪ Item MRP 4 ▪ Outlet Identifier ▪ Outlet Establishment year ▪ Outlet Size ▪ Outlet Location type ▪ Outlet Type
  • 5. Data Pre-Processing  The summary Before Pre-Processing ➢ Item_Identifier variable has 8000+ unique observations implying that it is the item code/identifier, may not be very useful for our modeling purpose. ➢ Item_Weight is not available for approx. 1463 items, which implies missing value as it doesn’t make sense an item without weight. To reduce the complexity, we deleted these records. ➢ Item_Fat-Content has broadly two categorical values – Low Fat and Regular. ➢ Many variations appearing as “LF”, “Low Fat”, “low fat”. ➢ During preprocessing replaced all the values of all variations of Low fat to “LF”. ➢ Similarly fixed multiple variations of “regular” and set it to “reg” ➢ Item_visibility ranges between 0 to .32 implying values in % ➢ Item_type has mostly proper values, but there are many with Others. For the purpose of this project we assumed “Others” as one separate category, implying this variable would not require preprocessing. 5
  • 6. Summary After Pre-Processing Observations remaining : 4650 6 Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP DRD60 : 5 Min. : 4.555 LF :3004 Min. :0.00000 Fruits and Vegetables: 670 Min. : 31.49 DRE49 : 5 1st Qu.: 8.770 reg:1646 1st Qu.:0.02597 Snack Foods : 656 1st Qu.: 94.41 DRF01 : 5 Median :12.650 Median :0.04966 Household : 498 Median :142.98 DRF03 : 5 Mean :12.899 Mean :0.06070 Frozen Foods : 477 Mean :141.72 DRF27 : 5 3rd Qu.:17.000 3rd Qu.:0.08874 Dairy : 380 3rd Qu.:186.61 DRG23 : 5 Max. :21.350 Max. :0.18832 Canned : 361 Max. :266.89 (Other):4620 (Other) :1608 Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type OUT013:932 Min. :1987 High : 932 Tier 1:1860 Supermarket Type1:3722 OUT018:928 1st Qu.:1997 Medium:1858 Tier 2: 930 Supermarket Type2: 928 OUT035:930 Median :1999 Small :1860 Tier 3:1860 OUT046:930 Mean :1999 OUT049:930 3rd Qu.:2004 Max. :2009 Item_Outlet_Sales Min. : 69.24 1st Qu.: 1125.20 Median : 1939.81 Mean : 2272.04 3rd Qu.: 3111.62 Max. :10256.65
  • 7. Exploratory Data Analysis (EDA): Response vs Quantitative Variables 7 Linear Relationships were observed for the following Sales vs Visibility Sales vs Item weight Sales vs Item Price
  • 8. EDA: Response Variable vs Categorical Variables 8 Sales vs Fat content Shows almost same distribution for Low Fat and Regular Sales vs Item Type Less variations
  • 9. EDA: Response Variable vs Categorical Variables 9 Sales vs Outlet Identifier Slight Variation with OUT035 Sales vs Outlet size Medium size and small size outlets seem to have higher items sales than High sized outlets
  • 10. Inference for Multiple Linear Regression ➢ Hypothesis Testing ➢ Before jumping on model creation, we validated our theory of some relationships via Hypothesis Test ➢ Null Hypothesis : There is no relationship between the Response and Explanatory variable ➢ Alternative Hypothesis : There is some relationship between Response and Explanatory Variables ➢ Mathematically: Slope of all explanatory variables would be 0 ➢ H0: ß0 = ß1 = ß2 = ß3 =………. ßk = 0 ➢ HA: At least one of the ßi is non zero 0 ➢ We ran the full regression model to obtain p-value for overall model F-statistic: 180.4 on 23 and 4626 DF, p-value: < 2.2e-16 ➢ At 5% Significance level we received P value as 2.2e-16, hence we rejected Null Hypothesis. ➢ Implying at least one explanatory variable has a slope (ßi) that is not 0 10
  • 11. Model Creation and Selection • Model Assumptions • Linearity: Linear Relationships between x and y • Expected Value of Error Term is 0: Nearly Normal Residuals • Homoscedasticity: Constant variability of Residuals • Multicollinearity • Model Selection • Two Methods • P-Value • Adjusted R-SQ • Our Method • p-value: Variables with highest p-value (more than 5%) are not significant • Our Approach: Backward Pass • Started with Full Model and removed insignificant variable in each pass 11
  • 12. Model Selection Pass Model Variables Adj-R-Sq Highest P-Value Variable/Comments 1st pass Full model-all variables 0.4702 Item_Type (0.8686), since it is a categorical variable, either we have to remove all its values or keep it all. 2nd pass -Item_type 0.4707 Item_Fat_Content:reg (0.694) 3rd Pass -Item_type -Item_Fat_Content:reg 0.4709 Item_visibility (0.6562) 4th Pass -Item_type -Item_Fat_Content:reg -Item_visibility 0.4709 Outlet_IdentifierOUT046 (0.5277) Since it is a categorical variable with values which are significant (p<.05), we will keep this variable and look for next high p-value Item_Weight (0.2709) 5th Pass -Item_type -Item_Fat_Content:reg -Item_visibility -Item_Weight 0.4709 Outlet_identifierOUT046 (0.5335) All the variable remaining after 5th pass appear as significant as shown below, with p-value <0.05. The variable outlet_identifier has one categorical value significant and other not, so we tried running a 6th pass to remove outlet_identifier 6th Pass -Item_type -Item_Fat_Content:reg -Item_visibility -Item_Weight -Outlet_identifier 0.4709 After removing Outlet_identifier, the model accuracy did not change, hence we decided to drop this variable. 7th Pass -Item_type -Item_Fat_Content:reg -Item_visibility -Item_Weight -Outlet_identifier -Super_Market_Type 0.4709 At last we dropped variables showing ‘NA’ in the regression output as these are represents collinear variables. 12
  • 13. Parsimonious Model ▪ 7th Pass Estimate Std. Error t value Pr(>|t|) (Intercept) 74579.557 10046.019 7.424 1.35e-13 *** walmart$Item_MRP 16.306 0.256 63.682 < 2e-16 *** walmart$Outlet_Establishment_Year -37.537 5.056 -7.424 1.34e-13 *** walmart$Outlet_SizeMedium 518.204 96.416 5.375 8.05e-08 *** walmart$Outlet_SizeSmall 343.926 71.462 4.813 1.54e-06 *** walmart$Outlet_Location_TypeTier 2 406.392 61.692 6.587 4.97e-11 *** Item_Sales = 74579 + 16 * Item_MRP - 37.537 * Outlet_Establishment_Year + 518 Outlet_SizeMedium + 343 * Outlet_SizeSmall + 406 * Outlet_Location_TypeTier 2 ▪ Percentage Variability Explained by above model is 47% 13
  • 14. Assumptions Validation  Linearity: Linear Relationships between x and y  Expected Value of Error Term is 0: Nearly Normal Residuals As shown below in histogram of residuals is rightly skewed and are not nearly normal hence this model does not meet the conditions of Multi Linear Regression. 14
  • 15. Assumptions Validation  Homoscedasticity: Constant variability of Residuals The plot of residuals vs fitted values show fan shaped pattern and does not reflect the constant variance around 0, the model does not meet the condition of homoscedasticity.  Multicollinearity As observed in the regression output the model has some collinear variables represented by “NA” in the output. We have removed these variables from our final model Moreover, the final model has t – values all greater than 2 so we didn’t investigate further. 15
  • 16. Prediction & CI  The MLR model we created did not satisfy the essential assumptions of multiple linear regression model, the prediction made from this model might not be reliable.  The plot of residuals vs fitted showed strong heteroscedasticity, or non- normality , then the prediction, and confidence intervals based on this model would not yield scientific insights that can be trusted 16
  • 17. Conclusions ➢ Walmart eCommerce data set ➢ Data Pre-Processing: ➢ Removed the null values ➢ Fixed the values of string data by standardizing ➢ EDA: Linearity Check ➢ MLR model from 5000+ observations and 11 Characteristics ➢ Model Selection ➢ Using Backward Pass method ➢ Removal of variables with the highest p-values ➢ Variability Explained by model : 47.09% (Adj-R-Sq) ➢ MLR Assumptions Validations ➢ Residuals are not nearly Normally Distributed ➢ Model has Heteroscedasticity ➢ Predictions: Not reliable 17