Simple Linear Regression is a statistical technique that attempts to explore the relationship between one independent variable (X) and one dependent variable (Y). The Simple Linear Regression technique is not suitable for datasets where more than one variable/predictor exists.
What is Simple Linear Regression and How Can an Enterprise Use this Technique to Analyze Data?
1. Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
4. Terminologies
• Predictors and Target variable :
• Target variable usually denoted by Y , is the variable being predicted and is also called
dependent variable, output variable, response variable or outcome variable
• Predictor, usually denoted by X , sometimes called an independent or explanatory
variable, is a variable that is being used to predict the target variable
• Correlation :
• Correlation is a statistical measure that indicates the extent to which two variables
fluctuate together
• Upper & Lower N% confidence intervals:
• A confidence interval is a statistical measure for saying, "I am pretty sure the true value
of a number I am approximating is within this range with n% confidence
5. Terminologies
• Intercept / constant term 𝜷0 :
• Intercept is the expected value of Y when all Xi = 0
• In other words, 𝛽0 represents what would be the minimum value of Y given all Xi = 0
• Coefficients 𝜷𝒊 :
• It is interpreted as the expected value of Yi corresponding to one unit change in Xi
• Error term 𝜺𝒊 :
• It represents the margin of error within a model
• It is a difference between the predicted value of Yi and observed value of Yi
• Standard error of coefficient :
• It is used to measure the precision of the estimate of the coefficient
• In other words, the smaller the standard error, the more precise the estimate
Where Yi is dependent variable
Xi is independent variable
6. Terminologies
• T statistic:
• Dividing the coefficient by its standard error gives t statistic which is used in
calculation of P value
• Degree of freedom:
• Degree of freedom is N-K where N is number of observations and K is number of
parameters used to calculate the estimate
• Significance level /alpha level:
• It represents level of confidence at which you want to test the results.
• Lower values of alpha means higher confidence. For example if 𝛼=0.1, confidence=
100 - (𝛼*100) = 90%
• P value :
• If the p-value associated with this t-statistic is less than alpha level, it means that
there exists a relation between corresponding predictor and dependent variable
7. Types of Linear regression analysis
• Depending on the number of independent variables/predictors in analysis, it is classified into two types :
• Simple linear regression:
• When there is only one dependent and one independent variable/predictor
• Multiple linear regression :
• When there is only one dependent variable but multiple independent variables/predictors
• Where
• Yi is dependent variable
• Xi is independent variable
• 𝛽0 is intercept
• 𝛽𝑖 is coefficient
• 𝜀𝑖 is the error term
8. Introduction : Simple
linear regression
Objective :
It is a statistical technique that attempts to
explore the relationship between one
independent variable (X) and one dependent
variable (Y )
Benefit :
Regression model output helps identify whether
independent variable/predictor X has any
relationship with dependent variable Y and if
yes then what is the nature/direction of
relationship ( i.e. positive/negative) between
the both
Model :
Simple Linear regression model equation takes
the form of Yi = 𝛽0 +𝛽1 Xi + 𝜀𝑖 as shown in
image in right :
9. Example: Simple linear regressionTemperature Yield
50 112
53 118
54 128
55 121
56 125
59 136
62 144
65 142
67 149
71 161
72 167
74 168
75 162
76 171
79 175
80 182
82 180
85 183
87 188
90 200
93 194
94 206
95 207
97 210
100 219
Input data Output
Regression Statistics
R Square 0.98
Coefficients P-value Lower 95% Upper 95%
Intercept 13.33 0.00268 5.13 21.52
Temperature 2.04 0.00138 1.93 2.15
Model is a good fit
as R square > 0.7
• P value for Temperature is <0.05;
• Hence Temperature is an important
factor for predicting Yield and has
significant relation with Yield
• With one unit increase in
Temperature there is 2 times
increase in Yield
• Values of coefficients will lie
between the range mentioned
under upper and lower 95%
• For example , coefficient of
Temperature will be between 1.93
and 2.15 with 95% confidence (5 %
chance of error)
Let’s get the simple linear regression output for independent variable X and
target variable Y as shown below:
Note : Intercept is not an important statistics for checking the relation between X & Y
10. Standard input/tuning parameters & Sample
UI
Select the predictor
Temperature
Yield
Pressure range
Step
1
Select the dependent variable
Temperature
Yield
Pressure range
Step 3
Step size =1
Number of Iterations = 100
Step
2
Display the output window
containing following :
o Model summary
o Line fit plot
o Normal probability plot
o Residual versus Fit plot
Step 4
Note : Categorical predictors should be auto detected &
converted to binary variables before applying regression
By default these parameters should
be set with the values mentioned
11. Sample output : 1. Model Summary
Regression Statistics
Multiple R 0.99
R Square 0.98
P-value :
o It is used to evaluate whether the corresponding predictor X has any significant impact on the target
variable Y
o As p –value for temperature is < 0.05 (highlighted in yellow in table above) , temperature has
significant relation with Yield
Value of a temperature coefficient
lies between 1.93 and 2.15 with 95%
confidence
Multiple R : It depicts the correlation between X & Y , closer this value
to ±1, higher the correlation
R square : It shows the goodness of fit of the model. It lies between 0 to
1 and closer this value to 1, better the model
Coefficient:
o It shows the magnitude as well as direction of impact of predictor X (temperature in this case) to a
target variable Y
o For example , in this case , with one unit increase in temperature, there is ‘2.04 unit increase’ in
Yield ( yield increases 2 times with one unit increase in X)
Coefficients P-value Lower 95% Upper 95%
Intercept 13.33 0.00268 5.13 21.52
Temperature 2.04 0.00138 1.93 2.15
Check Interpretation section for more details
12. Sample output : 2. Plots
y^ = 𝟏𝟕 + 𝟐𝒙
R2 = 0.75
Line fit plot is used to check the assumption of
linearity between X & Y
Normal Probability plot is used to check the
assumption of normality & to detect outliers
Residual plot is used to check the assumption
of equal error variances & outliers
Check Interpretation section for more details
13. Interpretation of Important Model Summary
Statistics
Multiple R :
•R > 0.7 represents a strong
positive correlation
between X and Y
•0.4 < = R < 0.7 represents a
weak positive correlation
between X and Y
•0 <= R < 0.4 represents a
negligible/no correlation
between X and Y
•-0.4 < = R < -0.7 represents
a weak negative
correlation between X and Y
•R < - 0.7 represents a
strong negative correlation
between X and Y
R Square :
•R square > 0.7 represents a
very good model i.e. model
is able to explain 70%
variability in Y
•R square between 0 to 0.7
represents a model not fit
well and assumptions of
normality and linearity
should be checked for better
fitting of a model
P value :
•At 95% confidence threshold
, if p-value for a predictor X
is <0.05 then X is a
significant/important
predictor
•At 95% confidence threshold
, if p-value for a predictor X
is >0.05 then X is an
insignificant/unimportant
predictor i.e. it doesn’t have
significant relation with
target variable Y
Coefficients :
•It indicates with how much
magnitude the output
variable will change with
one unit change in X
•For example, if coefficient
of X is 2 then Y will
increase 2 times with one
unit increase in X
•If coefficient of X is -2
then Y will decrease 2
times with one unit
increase in X
14. Interpretation of plots
: Line Fit plot
This plot is used to plot the relationship between
X (predictor) & Y(target variable) with Y on y
axis and X on x axis
As shown in the figure1 in right, as temperature
increases, so does the Yield, hence there is a
linear relationship between X and Y and simple
linear regression is applicable on this data
Fitted regression line and regression equation is
shown in the plot itself along with model R
square value to describe how well the model fits
the data and whether there is a linear relation
between X and Y or not
If R square is low (<0.7) and line doesn’t display
linearity as shown in figures 2 & 3 in right then a
linear regression model is not applicable and
different model should be considered to predict
Y
y^ = 𝟏𝟕 + 𝟐𝒙
R2 = 0.75
Figure 1
Figure 2
Figure 3
R2 = 0.5
R2 = 0.4
15. Interpretation of plots
: Normal Probability
plot
This plots the percentile vs. target/dependent
variable(Y)
It is used to check the assumptions of
linearity and normality in data and also to
detect the outliers
It can be helpful to add the trend line to see
whether the data fits a straight line
The plot in figure 1 shows that the pattern of
dots in the plot lies close to a straight line;
Therefore, data is normally distributed and
there are no outliers
Examples of non normal data are shown in
figure 2 &3 in right and example of outliers is
shown in figure 4 :
Figure 1
Figure 2
Figure 3
Figure 4
16. Interpretation of plots
: Residual versus Fit
plot
It is the scattered plot of residuals on Y axis and predicted
(fitted) values on X axis
It is used to detect unequal error variances and outliers
Here are the characteristics of a well-behaved residual vs.
fits plot :
The residuals should "bounce randomly" around the 0 line
and should roughly form a "horizontal band" around the 0
line as shown in figure 1. This suggests that the variances of
the error terms are equal
No one residual should "stands out" from the basic random
pattern of residuals. This suggests that there are no outliers
For example the red data point in figure 1 is an outlier, such
outliers should be removed from data before proceeding
with model interpretation
Plots shown in figures 2 & 3 above depict unequal error
variances, which is not desirable for linear regression
analysis
Figure 1
Figure 2
Figure 3
17. Limitations
Simple linear regression is limited to predicting numeric output i.e.
dependent variable has to be numeric in nature
• Minimum sample size should be > 50+8m where m is number of
predictors.
• Hence in case of simple linear regression, minimum sample size should be
50+8(1) = 58
• It handles only two variables : one predictor and one dependent
variable but usually there are more than one predictors correlated
with the dependent variable which can’t be analyzed through simple
linear regression
18. Limitations
Target/dependent variable should be normally
distributed
A normal distribution is an arrangement of a
data set in which most values cluster in the
middle of the range and the rest taper off
symmetrically toward either extreme. It will
look like a bell curve as shown in figure 1 in right
Outliers in data can affect the analysis, hence
outliers need to be removed
Outliers are the observations lying outside
overall pattern of distribution as shown in figure
2 in right
These extreme values/outliers can be replaced
with 1st or 99th percentile values
Outliers
Figure 1
Figure 2
19. Business use case 1
• Business problem :
• An ecommerce company wants to measure the impact of product price on product
sales
• Input data:
• Predictor/independent variable is product price data for last year
• Dependent variable is product sales data for last year
• Business benefit:
• Product sales manager will get to know how much and in what direction does the
product price impact the product sales
• Decision on product price alteration can be made with more confidence according to
the sales target for that particular product
20. Business use case 2
• Business problem :
• An agriculture production firm wants to predict the impact of amount of rainfall on yield of
particular crop
• Input data:
• Predictor/independent variable : Amount of rainfall during monsoon months last year
• Dependent variable : Crop production data during monsoon months last year
• Business benefit:
• An agriculture firm can predict the yield of a particular crop based on the amount of rain fall
this year and can plan for the alternative crop arrangements and other contingencies if the
amount of rain fall is not adequate in order to get the desired / targeted crop production
21. Example : Simple linear regression
Consider the data obtained from a chemical process where the yield (Yi ) of the
process is thought to be related to the reaction temperature ( Xi )(see the table in
right)
Where
y
_
is the mean of all the observed values of dependent variable
x
_
is the mean of all values of the predictor variable
y
_
is calculated using
x
_
is calculated using
STEP 1 : Obtain the estimates, 𝜷0 and 𝜷1 in the equation Yi = 𝜷0 +𝜷i Xi +
𝜺𝒊 using the following equations :
22. Example : Simple linear regression
Calculating 𝜷0 and 𝜷1 :
Once 𝜷0 and 𝜷1 are known, the
fitted regression line can be
written as:
Where y^ is the predicted
value based on the fitted
regression model
23. STEP 2 : Obtain values of y^ for each observation using the regression line fit equation
obtained in Step 1 : y^ = 𝟏𝟕 + 𝟐𝒙
Also compute the corresponding error terms using equation 𝜺𝒊 = yi - yi^ as shown below:
Predicted values
corresponding to each
observation :
y1^ = 17 + 2 x1 = 17 + 2*50 = 117
y2^ = 17 + 2 x2 = 17 + 2*53 = 123
y25^ = 17 + 2 x25 = 17 + 2*100 = 217
𝜺1^ = y1 - y1^ = 122 -117 = 5
𝜺2^ = y2 - y2^ = 118 -123 = -5
𝜺25 ^ = y25 - y25^ = 217-219 = -2
Error values
corresponding to each
predicted values:
Example : Simple Linear Regression
24. To get P value , we need T statistic, degree of freedom and significance
level (𝛼) which can be obtained as follows:
STEP 3 : Obtain the significance value (p value) to understand whether there exists a relation between
predictor and dependent variable i.e. temperature and yield in this case
1. Calculate standard error for 𝜷1 : 2. Calculate t statistic : 3. Calculate P value :
Assuming that the desired significance level is 0.1 ( i.e. 90% confidence threshold), since P value <
0.1 here , there exists a relation between Temperature and Yield variables.
P(T<t0) is
obtained
from t table
Example: Simple Linear Regression
25. Example: Simple
Linear Regression
This metric shows how much % of variability in Y (dependent variable : Yield in this case)
can be explained/predicted by the fitted model
STEP 4 : Calculate the measure of model
accuracy : Coefficient of Determination (R2)
Before any inferences are undertaken ,
model accuracy must be checked
Closer the value of R2 to 1 , better the
fitted model
In this case it is 0.98 indicating 98% of
variability in Yield is explained by the
fitted model . Thus, the model is very
much accurate
26. Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018