MLR Walmart Dataset

MULTIPLE LINEAR REGRESSION
TO PREDICT
ITEM SALES QUANTITY
WALMART DATASET
NIKHIL SHRIVASTAVA
1

CONTENTS
 Introduction
 Dataset Description
 Exploratory Data Analyses
 Inference for Multiple Linear Regression
 Model Selection
 Prediction
 Conclusion
2

Introduction
 The purpose
 To explore the attributes which influence the 'Item Sales Quantity'.
 To establish a regression relationship between 'Item Sales Quantity' and other attributes.
 The Dataset
 We have chosen Sales data of supermarket Walmart.
 Sourced from Kaggle
 Preliminary Analysis
 Response variable (dependent) : Item Sales Quantity as Item_Outlet_Sales
 Explanatory variables: 11
 Modeling
 Multiple Linear Regression
3

Dataset Description
 Walmart Dataset consists of the following attributes
 Response Variable
• Item Outlet Sales
 Explanatory Variables
▪ Item Identifier codes
▪ Item Weight
▪ Item Fat Content
▪ Item Visibility
▪ Item Type
▪ Item MRP
4
▪ Outlet Identifier
▪ Outlet Establishment year
▪ Outlet Size
▪ Outlet Location type
▪ Outlet Type

Data Pre-Processing
 The summary Before Pre-Processing
➢ Item_Identifier variable has 8000+ unique observations implying that it is the item code/identifier, may not be
very useful for our modeling purpose.
➢ Item_Weight is not available for approx. 1463 items, which implies missing value as it doesn’t make sense an item
without weight. To reduce the complexity, we deleted these records.
➢ Item_Fat-Content has broadly two categorical values – Low Fat and Regular.
➢ Many variations appearing as “LF”, “Low Fat”, “low fat”.
➢ During preprocessing replaced all the values of all variations of Low fat to “LF”.
➢ Similarly fixed multiple variations of “regular” and set it to “reg”
➢ Item_visibility ranges between 0 to .32 implying values in %
➢ Item_type has mostly proper values, but there are many with Others. For the purpose of this project we assumed
“Others” as one separate category, implying this variable would not require preprocessing.
5

Summary After Pre-Processing
Observations remaining : 4650
6
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP
DRD60 : 5 Min. : 4.555 LF :3004 Min. :0.00000 Fruits and Vegetables: 670 Min. : 31.49
DRE49 : 5 1st Qu.: 8.770 reg:1646 1st Qu.:0.02597 Snack Foods : 656 1st Qu.: 94.41
DRF01 : 5 Median :12.650 Median :0.04966 Household : 498 Median :142.98
DRF03 : 5 Mean :12.899 Mean :0.06070 Frozen Foods : 477 Mean :141.72
DRF27 : 5 3rd Qu.:17.000 3rd Qu.:0.08874 Dairy : 380 3rd Qu.:186.61
DRG23 : 5 Max. :21.350 Max. :0.18832 Canned : 361 Max. :266.89
(Other):4620 (Other) :1608
Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type
OUT013:932 Min. :1987 High : 932 Tier 1:1860 Supermarket Type1:3722
OUT018:928 1st Qu.:1997 Medium:1858 Tier 2: 930 Supermarket Type2: 928
OUT035:930 Median :1999 Small :1860 Tier 3:1860
OUT046:930 Mean :1999
OUT049:930 3rd Qu.:2004
Max. :2009
Item_Outlet_Sales
Min. : 69.24
1st Qu.: 1125.20
Median : 1939.81
Mean : 2272.04
3rd Qu.: 3111.62
Max. :10256.65

Exploratory Data Analysis (EDA): Response vs Quantitative Variables
7
Linear Relationships were observed for the following
Sales vs Visibility
Sales vs Item weight
Sales vs Item Price

EDA: Response Variable vs Categorical Variables
8
Sales vs Fat content
Shows almost same
distribution for Low Fat
and Regular
Sales vs Item Type
Less variations

EDA: Response Variable vs Categorical Variables
9
Sales vs Outlet Identifier
Slight Variation with
OUT035
Sales vs Outlet size
Medium size and small
size outlets seem to
have higher items
sales than High sized
outlets

Inference for Multiple Linear Regression
➢ Hypothesis Testing
➢ Before jumping on model creation, we validated our theory of some relationships via
Hypothesis Test
➢ Null Hypothesis : There is no relationship between the Response and Explanatory variable
➢ Alternative Hypothesis : There is some relationship between Response and Explanatory Variables
➢ Mathematically: Slope of all explanatory variables would be 0
➢ H0: ß0 = ß1 = ß2 = ß3 =………. ßk = 0
➢ HA: At least one of the ßi is non zero 0
➢ We ran the full regression model to obtain p-value for overall model
F-statistic: 180.4 on 23 and 4626 DF, p-value: < 2.2e-16
➢ At 5% Significance level we received P value as 2.2e-16, hence we rejected Null Hypothesis.
➢ Implying at least one explanatory variable has a slope (ßi) that is not 0
10

Model Creation and Selection
• Model Assumptions
• Linearity: Linear Relationships between x and y
• Expected Value of Error Term is 0: Nearly Normal Residuals
• Homoscedasticity: Constant variability of Residuals
• Multicollinearity
• Model Selection
• Two Methods
• P-Value
• Adjusted R-SQ
• Our Method
• p-value: Variables with highest p-value (more than 5%) are not significant
• Our Approach: Backward Pass
• Started with Full Model and removed insignificant variable in each pass
11

Model Selection
Pass Model Variables Adj-R-Sq Highest P-Value Variable/Comments
1st pass Full model-all variables 0.4702 Item_Type (0.8686), since it is a categorical variable, either we
have to remove all its values or keep it all.
2nd pass -Item_type 0.4707 Item_Fat_Content:reg (0.694)
3rd Pass -Item_type
-Item_Fat_Content:reg
0.4709 Item_visibility (0.6562)
4th Pass -Item_type
-Item_visibility
0.4709 Outlet_IdentifierOUT046 (0.5277)
Since it is a categorical variable with values which are significant
(p<.05), we will keep this variable and look for next high p-value
Item_Weight (0.2709)
5th Pass -Item_type
-Item_visibility
-Item_Weight
0.4709 Outlet_identifierOUT046 (0.5335)
All the variable remaining after 5th pass appear as significant as
shown below, with p-value <0.05. The variable outlet_identifier has
one categorical value significant and other not, so we tried
running a 6th pass to remove outlet_identifier
6th Pass -Item_type
-Item_visibility
-Item_Weight
-Outlet_identifier
0.4709 After removing Outlet_identifier, the model accuracy did not
change, hence we decided to drop this variable.
7th Pass -Item_type
-Item_visibility
-Item_Weight
-Outlet_identifier
-Super_Market_Type
0.4709 At last we dropped variables showing ‘NA’ in the regression
output as these are represents collinear variables.
12

Parsimonious Model
▪ 7th Pass
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74579.557 10046.019 7.424 1.35e-13 ***
walmart$Item_MRP 16.306 0.256 63.682 < 2e-16 ***
walmart$Outlet_Establishment_Year -37.537 5.056 -7.424 1.34e-13 ***
walmart$Outlet_SizeMedium 518.204 96.416 5.375 8.05e-08 ***
walmart$Outlet_SizeSmall 343.926 71.462 4.813 1.54e-06 ***
walmart$Outlet_Location_TypeTier 2 406.392 61.692 6.587 4.97e-11 ***
Item_Sales = 74579
+ 16 * Item_MRP
- 37.537 * Outlet_Establishment_Year
+ 518 Outlet_SizeMedium
+ 343 * Outlet_SizeSmall
+ 406 * Outlet_Location_TypeTier 2
▪ Percentage Variability Explained by above model is 47%
13

Assumptions Validation
 Linearity: Linear Relationships between x and y
 Expected Value of Error Term is 0: Nearly Normal Residuals
As shown below in histogram of residuals is rightly skewed and are not nearly normal hence
this model does not meet the conditions of Multi Linear Regression.
14

Assumptions Validation
 Homoscedasticity: Constant variability of Residuals
The plot of residuals vs fitted values show fan shaped pattern and does not reflect the
constant variance around 0, the model does not meet the condition of homoscedasticity.
 Multicollinearity
As observed in the regression output the model has some collinear variables represented by
“NA” in the output. We have removed these variables from our final model
Moreover, the final model has t – values all greater than 2 so we didn’t investigate further.
15

Prediction & CI
 The MLR model we created did not satisfy the essential assumptions of
multiple linear regression model, the prediction made from this model
might not be reliable.
 The plot of residuals vs fitted showed strong heteroscedasticity, or non-
normality , then the prediction, and confidence intervals based on this
model would not yield scientific insights that can be trusted
16

Conclusions
➢ Walmart eCommerce data set
➢ Data Pre-Processing:
➢ Removed the null values
➢ Fixed the values of string data by standardizing
➢ EDA: Linearity Check
➢ MLR model from 5000+ observations and 11 Characteristics
➢ Model Selection
➢ Using Backward Pass method
➢ Removal of variables with the highest p-values
➢ Variability Explained by model : 47.09% (Adj-R-Sq)
➢ MLR Assumptions Validations
➢ Residuals are not nearly Normally Distributed
➢ Model has Heteroscedasticity
➢ Predictions: Not reliable
17

MLR Walmart Dataset

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to MLR Walmart Dataset

Similar to MLR Walmart Dataset (20)

Recently uploaded

Recently uploaded (20)

MLR Walmart Dataset