2. CONTENTS
Introduction
Dataset Description
Exploratory Data Analyses
Inference for Multiple Linear Regression
Model Selection
Prediction
Conclusion
2
3. Introduction
The purpose
To explore the attributes which influence the 'Item Sales Quantity'.
To establish a regression relationship between 'Item Sales Quantity' and other attributes.
The Dataset
We have chosen Sales data of supermarket Walmart.
Sourced from Kaggle
Preliminary Analysis
Response variable (dependent) : Item Sales Quantity as Item_Outlet_Sales
Explanatory variables: 11
Modeling
Multiple Linear Regression
3
4. Dataset Description
Walmart Dataset consists of the following attributes
Response Variable
• Item Outlet Sales
Explanatory Variables
▪ Item Identifier codes
▪ Item Weight
▪ Item Fat Content
▪ Item Visibility
▪ Item Type
▪ Item MRP
4
▪ Outlet Identifier
▪ Outlet Establishment year
▪ Outlet Size
▪ Outlet Location type
▪ Outlet Type
5. Data Pre-Processing
The summary Before Pre-Processing
➢ Item_Identifier variable has 8000+ unique observations implying that it is the item code/identifier, may not be
very useful for our modeling purpose.
➢ Item_Weight is not available for approx. 1463 items, which implies missing value as it doesn’t make sense an item
without weight. To reduce the complexity, we deleted these records.
➢ Item_Fat-Content has broadly two categorical values – Low Fat and Regular.
➢ Many variations appearing as “LF”, “Low Fat”, “low fat”.
➢ During preprocessing replaced all the values of all variations of Low fat to “LF”.
➢ Similarly fixed multiple variations of “regular” and set it to “reg”
➢ Item_visibility ranges between 0 to .32 implying values in %
➢ Item_type has mostly proper values, but there are many with Others. For the purpose of this project we assumed
“Others” as one separate category, implying this variable would not require preprocessing.
5
6. Summary After Pre-Processing
Observations remaining : 4650
6
Item_Identifier Item_Weight Item_Fat_Content Item_Visibility Item_Type Item_MRP
DRD60 : 5 Min. : 4.555 LF :3004 Min. :0.00000 Fruits and Vegetables: 670 Min. : 31.49
DRE49 : 5 1st Qu.: 8.770 reg:1646 1st Qu.:0.02597 Snack Foods : 656 1st Qu.: 94.41
DRF01 : 5 Median :12.650 Median :0.04966 Household : 498 Median :142.98
DRF03 : 5 Mean :12.899 Mean :0.06070 Frozen Foods : 477 Mean :141.72
DRF27 : 5 3rd Qu.:17.000 3rd Qu.:0.08874 Dairy : 380 3rd Qu.:186.61
DRG23 : 5 Max. :21.350 Max. :0.18832 Canned : 361 Max. :266.89
(Other):4620 (Other) :1608
Outlet_Identifier Outlet_Establishment_Year Outlet_Size Outlet_Location_Type Outlet_Type
OUT013:932 Min. :1987 High : 932 Tier 1:1860 Supermarket Type1:3722
OUT018:928 1st Qu.:1997 Medium:1858 Tier 2: 930 Supermarket Type2: 928
OUT035:930 Median :1999 Small :1860 Tier 3:1860
OUT046:930 Mean :1999
OUT049:930 3rd Qu.:2004
Max. :2009
Item_Outlet_Sales
Min. : 69.24
1st Qu.: 1125.20
Median : 1939.81
Mean : 2272.04
3rd Qu.: 3111.62
Max. :10256.65
7. Exploratory Data Analysis (EDA): Response vs Quantitative Variables
7
Linear Relationships were observed for the following
Sales vs Visibility
Sales vs Item weight
Sales vs Item Price
8. EDA: Response Variable vs Categorical Variables
8
Sales vs Fat content
Shows almost same
distribution for Low Fat
and Regular
Sales vs Item Type
Less variations
9. EDA: Response Variable vs Categorical Variables
9
Sales vs Outlet Identifier
Slight Variation with
OUT035
Sales vs Outlet size
Medium size and small
size outlets seem to
have higher items
sales than High sized
outlets
10. Inference for Multiple Linear Regression
➢ Hypothesis Testing
➢ Before jumping on model creation, we validated our theory of some relationships via
Hypothesis Test
➢ Null Hypothesis : There is no relationship between the Response and Explanatory variable
➢ Alternative Hypothesis : There is some relationship between Response and Explanatory Variables
➢ Mathematically: Slope of all explanatory variables would be 0
➢ H0: ß0 = ß1 = ß2 = ß3 =………. ßk = 0
➢ HA: At least one of the ßi is non zero 0
➢ We ran the full regression model to obtain p-value for overall model
F-statistic: 180.4 on 23 and 4626 DF, p-value: < 2.2e-16
➢ At 5% Significance level we received P value as 2.2e-16, hence we rejected Null Hypothesis.
➢ Implying at least one explanatory variable has a slope (ßi) that is not 0
10
11. Model Creation and Selection
• Model Assumptions
• Linearity: Linear Relationships between x and y
• Expected Value of Error Term is 0: Nearly Normal Residuals
• Homoscedasticity: Constant variability of Residuals
• Multicollinearity
• Model Selection
• Two Methods
• P-Value
• Adjusted R-SQ
• Our Method
• p-value: Variables with highest p-value (more than 5%) are not significant
• Our Approach: Backward Pass
• Started with Full Model and removed insignificant variable in each pass
11
12. Model Selection
Pass Model Variables Adj-R-Sq Highest P-Value Variable/Comments
1st pass Full model-all variables 0.4702 Item_Type (0.8686), since it is a categorical variable, either we
have to remove all its values or keep it all.
2nd pass -Item_type 0.4707 Item_Fat_Content:reg (0.694)
3rd Pass -Item_type
-Item_Fat_Content:reg
0.4709 Item_visibility (0.6562)
4th Pass -Item_type
-Item_Fat_Content:reg
-Item_visibility
0.4709 Outlet_IdentifierOUT046 (0.5277)
Since it is a categorical variable with values which are significant
(p<.05), we will keep this variable and look for next high p-value
Item_Weight (0.2709)
5th Pass -Item_type
-Item_Fat_Content:reg
-Item_visibility
-Item_Weight
0.4709 Outlet_identifierOUT046 (0.5335)
All the variable remaining after 5th pass appear as significant as
shown below, with p-value <0.05. The variable outlet_identifier has
one categorical value significant and other not, so we tried
running a 6th pass to remove outlet_identifier
6th Pass -Item_type
-Item_Fat_Content:reg
-Item_visibility
-Item_Weight
-Outlet_identifier
0.4709 After removing Outlet_identifier, the model accuracy did not
change, hence we decided to drop this variable.
7th Pass -Item_type
-Item_Fat_Content:reg
-Item_visibility
-Item_Weight
-Outlet_identifier
-Super_Market_Type
0.4709 At last we dropped variables showing ‘NA’ in the regression
output as these are represents collinear variables.
12
14. Assumptions Validation
Linearity: Linear Relationships between x and y
Expected Value of Error Term is 0: Nearly Normal Residuals
As shown below in histogram of residuals is rightly skewed and are not nearly normal hence
this model does not meet the conditions of Multi Linear Regression.
14
15. Assumptions Validation
Homoscedasticity: Constant variability of Residuals
The plot of residuals vs fitted values show fan shaped pattern and does not reflect the
constant variance around 0, the model does not meet the condition of homoscedasticity.
Multicollinearity
As observed in the regression output the model has some collinear variables represented by
“NA” in the output. We have removed these variables from our final model
Moreover, the final model has t – values all greater than 2 so we didn’t investigate further.
15
16. Prediction & CI
The MLR model we created did not satisfy the essential assumptions of
multiple linear regression model, the prediction made from this model
might not be reliable.
The plot of residuals vs fitted showed strong heteroscedasticity, or non-
normality , then the prediction, and confidence intervals based on this
model would not yield scientific insights that can be trusted
16
17. Conclusions
➢ Walmart eCommerce data set
➢ Data Pre-Processing:
➢ Removed the null values
➢ Fixed the values of string data by standardizing
➢ EDA: Linearity Check
➢ MLR model from 5000+ observations and 11 Characteristics
➢ Model Selection
➢ Using Backward Pass method
➢ Removal of variables with the highest p-values
➢ Variability Explained by model : 47.09% (Adj-R-Sq)
➢ MLR Assumptions Validations
➢ Residuals are not nearly Normally Distributed
➢ Model has Heteroscedasticity
➢ Predictions: Not reliable
17