What is Multiple Linear Regression and How Can it be Helpful for Business Analysis?

Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s

Terminologies
Introduction & Example
Standard input/tuning parameters & Sample UI
Sample output UI
Interpretation of Output
Limitations
Business use cases
What Are
All Covered

Terminologies
• Predictors and Target variable :
• Target variable usually denoted by Y , is the variable being predicted and is also
called dependent variable, output variable, response variable or outcome variable
• Predictor, usually denoted by X , sometimes called an independent or explanatory
variable, is a variable that is being used to predict the target variable
• Correlation :
• Correlation is a statistical measure that indicates the extent to which two variables
fluctuate together
• Upper & Lower N% confidence intervals:
• A confidence interval is a statistical measure for saying, "I am pretty sure the true
value of a number I am approximating is within this range with n% confidence

INTRODUCTION
• OBJECTIVE :
• It is a statistical technique that attempts to explore
the relationship between two or more variables ( Xi
and Y )
• BENEFIT :
• Regression model output helps identify important
factors ( Xi ) impacting the dependent variable (Y)
and also the nature of relationship between each
of these factors and dependent variable
• MODEL :
• Linear regression model equation takes the form of
Y = 𝛽0 +𝛽i Xi + 𝜀𝑖 as shown in image in right :

Example: Multiple linear regression
Temperature Humidity Yield
50 57 112
53 54 118
54 54 128
55 60 121
56 66 125
59 59 136
62 61 144
65 58 142
67 59 149
71 64 161
72 56 167
74 66 168
75 52 162
76 68 171
79 52 175
80 62 182
Input
data
Output
Regression Statistics
R Square 0.98
Coefficients P-value Lower 95% Upper 95%
Intercept -5.14 0.68 -31.49 21.21
Temperature 2.19 0.00 1.99 2.40
Humidity 0.15 0.44 -0.26 0.57
Model is a good fit
as R square > 0.7
• P value for Temperature is <0.05 ;
• Hence Temperature is an important factor
for predicting Yield
• But p value for Humidity is >0.05 which
means Humidity is not impacting Yield
significantly
• With one unit increase in
Temperature there is 2 times
increase in Yield
• Coefficient of
Temperature will be
between 1.99 and 2.40
with 95% confidence (5 %
chance of error)
Let’s conduct the Multiple linear regression analysis on independent variables : Temperature & Humidity and target
variable : Yield as shown below:
Note : Intercept is not an important statistics for checking the relation between X & Y
Independent
variables (Xi)
Target
Variable (Y)

Standard input/tuning parameters & Sample
UI Select the predictors
Temperature
Humidity
Yield
Pressure range
Step
1
Step 3
Step size =1
Number of Iterations = 100
Step
2
Display the output window
containing following :
o Model summary
o Line fit plot
o Normal probability plot
o Residual versus Fit plot
Step 4
Note :
 Categorical predictors should be auto detected & converted to dummy/binary variables before applying regression
 Decision on selection of predictors depends on the business knowledge and the correlation value between target
variable and predictors , those with significant positive/negative correlation with Y should be included in model
 Thumb rule for number of predictors is, it should be at most (total number of observations / 20)
By default these parameters should
be set with the values mentioned
Select the target variable
Temperature
Humidity
Yield
Pressure range
More than one
predictors can be
selected

Sample output : 1. Model Summary
Regression Statistics
R Square 0.98
P-value :
o It is used to evaluate whether the corresponding predictor X has any significant impact on the target
variable Y
o As p –value for temperature here is < 0.05 (highlighted in red font in table above) , temperature has
significant relation with Yield
o In contrast, p value for Humidity is >0.05 which makes it insignificant for predicting Yield
Value of a temperature coefficient
lies between 1.99 and 2.4 with 95%
confidence
 R square : It shows the goodness of fit of the model. It lies between 0 to
1 and closer this value to 1, better the model
Coefficient:
o It shows the magnitude as well as direction of impact of predictors (temperature and humidity in this
case) on a target variable Y (Yield)
o For example , in this case , with one unit increase in temperature, there is ‘2.19 unit increase’ in Yield
( yield increases 2 times with one unit increase in Temperature)
Check Interpretation section for more details
Coefficients P-value Lower 95% Upper 95%
Intercept -5.14 0.68 -31.49 21.21
Temperature 2.19 0.00 1.99 2.40
Humidity 0.15 0.44 -0.26 0.57
P value for ANOVA test : 0.02
 Anova p- value : It indicates whether one of the coefficients is
significant in the model , only if p value is <0.05 should the further
model interpretation be made

Line fit plots are used to check the assumption
of linearity between each Xi & Y
Normal Probability plot is used to check the
assumption of normality & to detect outliers
Residual plot is used to check the assumption
of equal error variances & outliers
Sample Output : 2. Plots
Check Interpretation section for more details
 In case of non linearity between any Xi and Y, transformations can be applied on Xi to make it linearly
correlated to Y or else that particular variable has to be dropped from the input into model building

Interpretation of Important Model Summary
Statistics
Multiple R :
•R > 0.7 represents a strong
positive correlation
between X and Y
•0.4 < = R < 0.7 represents a
weak positive correlation
between X and Y
•0 <= R < 0.4 represents a
negligible/no correlation
between X and Y
•-0.4 < = R < -0.7 represents
a weak negative
correlation between X and
Y
•R < - 0.7 represents a
strong negative correlation
between X and Y
R Square :
•R square > 0.7 represents a
very good model i.e. model
is able to explain 70%
variability in Y
•R square between 0 to 0.7
represents a model not fit
well and assumptions of
normality and linearity
should be checked for
better fitment of a model
P value :
•At 95% confidence
threshold , if p-value for a
predictor X is <0.05 then X
is a significant/important
predictor
•At 95% confidence
threshold , if p-value for a
predictor X is >0.05 then X
is an
insignificant/unimportant
predictor i.e. it doesn’t
have significant relation
with target variable Y
Coefficients :
•It indicates with how much
magnitude the output
variable will change with
one unit change in X
•For example, if coefficient
of X is 2 then Y will
increase 2 times with one
unit increase in X
•If coefficient of X is -2
then Y will decrease 2
times with one unit
increase in X

Interpretation of plots
: Line Fit plot
This plot is used to plot the relationship between
each Xi (predictor) & Y (target variable) with Y
on y axis and each Xi on x axis
As shown in the figure1 in right, as
temperature(X) increases, so does the Yield(Y),
hence there is a linear relationship between X
and Y and linear regression is applicable on this
data
If line doesn’t display linearity as shown in
figures 2 & 3 in right then transformation can be
applied on that particular variable before
proceeding with model building
If data transformation doesn’t help then either
that variable(Xi) can be dropped from the
analysis or non linear model should be chosen
depending on the distribution pattern of scatter
plot
Figure 1
Figure 2
Figure 3

: Normal Probability
plot
This plots the percentile vs. variable (Xi or Y)
distribution
It is used to check the assumptions of
normality and outliers in data
It can be helpful to add the trend line to see
whether the variable fits a straight line
The plot in figure 1 shows that the pattern of
dots in the plot lies close to a straight line;
Therefore, the variable is normally
distributed and there are no outliers
Examples of non normal data are shown in
figure 2 &3 in right and example of outliers is
shown in figure 4 :
Figure 1
Figure 2
Figure 3
Figure 4

: Residual versus Fit
plot
It is the scattered plot of standardized residuals
on Y axis and predicted (fitted) values on X axis
It is used to detect the unequal residual
variances and outliers in data
Here are the characteristics of a well-behaved
residual vs. fits plot :
The residuals should "bounce randomly" around
the 0 line and should roughly form a "horizontal
band" around the 0 line as shown in figure 1.
This suggests that the variances of the error
terms are equal
No one residual should "stands out" from the
basic random pattern of residuals. This suggests
that there are no outliers
For example the red data point in figure 1 is an
outlier, such outliers should be removed from
data before proceeding with model
interpretation
Figure 1
Figure 2
 Plots shown in figures 2 & 3 above
depict unequal error variances,
which is not desirable for linear
regression analysis
Figure 3

Limitations
Linear regression is limited to predicting
numeric output i.e. dependent variable has to
be numeric in nature
Minimum sample size should be at least 20
cases per independent variable
Multicollinearity among one or more predictors
should be removed before running the model
Multicollinearity is the situation in which two or
more independent variables are highly
correlated with one another
This method is applicable only when assumption
of linearity between each Xi and Y is met which
can be checked through the Line fit plot which is
a scatter plot between each Xi and Y as
described in the Interpretation section
Residuals should be time independent as
described in the left image below
Time dependent error ( decreasing with time)
Time independent error ( fairly constant over time & lying within certain range)

Limitations
Target/independent variables should be
normally distributed
A normal distribution is an arrangement of a
data set in which most values cluster in the
middle of the range and the rest taper off
symmetrically toward either extreme. It will
look like a bell curve as shown in figure 1 in right
Outliers in data (target as well as independent
variables) can affect the analysis, hence outliers
need to be removed
Outliers are the observations lying outside
overall pattern of distribution as shown in figure
2 in right
These extreme values/outliers can be replaced
with 1st or 99th percentile values to improve
the model accuracy
Outliers
Figure 1
Figure 2

Business use case 1
• Business problem :
• An ecommerce company wants to measure the impact of product price, product promotions, presence of
festive season etc. on product sales
• Input data:
• Predictor/independent variables:
• Product price data
• Product promotions data such as discounts
• Flag representing presence/absence of festive season
• Dependent variable : Product sales data
• Business benefit:
• Product sales manager will get to know which among the predictors included in the analysis have significant
impact on product sales
• For the impactful predictors , important strategic decisions can be made to meet the targeted product sales
• For instance, if promotions and festive seasons turn out to be significant factors, each with positive coefficient
then these factors should be given more focus while devising a marketing strategy to improve sales as they
are directly affecting the sales in a positive way

Business use case 2
• Business problem :
• An agriculture production firm wants to predict the impact of amount of rainfall , humidity ,
temperature etc. on the yield of particular crop
• Input data:
• Predictor/independent variables :
• Amount of rainfall during monsoon months
• Humidity levels/measurements
• Temperature measurements
• Dependent variable : Crop production
• Business benefit:
• An agriculture firm can understand the impact of each of these predictors on target variable
• For instance , if temperature and rain fall have positive significant impact but Humidity levels
have negative significant impact on crop yield then crop production can be done in high
temperature and rain fall levels in conjunction with low humidity levels in order to produce
the desired crop yield

Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018

What is Multiple Linear Regression and How Can it be Helpful for Business Analysis?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What is Multiple Linear Regression and How Can it be Helpful for Business Analysis?

Similar to What is Multiple Linear Regression and How Can it be Helpful for Business Analysis? (20)

More from Smarten Augmented Analytics

More from Smarten Augmented Analytics (16)

Recently uploaded

Recently uploaded (20)

What is Multiple Linear Regression and How Can it be Helpful for Business Analysis?