Generalized Linear Regression with Gaussian Distribution is a statistical technique which is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The Generalized Linear Model (GLM) generalizes linear regression by allowing the linear model to be related to the response variable via a link function (in this case link function being Gaussian Distribution) and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
4. Terminologies
▪ Target variable usually denoted by Y, is the variable being predicted and is also called dependent variable,
output variable, response variable or outcome variable (E.g., One highlighted in red box in table below).
▪ Predictor, sometimes called an independent variable, is a variable that is being used to predict the
target variable (E.g., Variables highlighted in green box in table below).
The predictors highlighted in green box above constitutes of the attributes upon which the target variable
highlighted in red box (i.e., Loan amount) depends on.
Loan amount Debt to income ratio Grade Annual
income
Verification status
9000 30 A 9632 Not Verified
4800 26 B 5022 Not Verified
20000 25 B 5149 Not Verified
6000 29 A 5225 Verified
3000 27 C 5344 Verified
5. Terminologies (Continued…)
• Coefficients:
• It shows the magnitude as well as direction of impact of predictors on a target variable Y. It shows
the positive or negative relationship between dependent and independent variables.
• A positive coefficient indicates that as the value of the independent variable increases, the
mean of the dependent variable also tends to increase.
• A negative coefficient suggests that as the independent variable increases, the dependent
variable tends to decrease.
• P-Value:
• It is used to evaluate whether the corresponding predictor X has any significant impact on the
target variable Y.
• At a 95% confidence threshold, if the p-value for a predictor is < 0.05, then it has a significant
impact on the target variable.
• At a 95% confidence threshold, if the p-value for a predictor is > 0.05, then it doesn't have a
significant impact on the target variable.
6. Introduction
• Objective:
– The Generalized Linear Model (GLM) Regression is a flexible
generalization of ordinary linear regression that allows for
response variables that have error distribution models other
than a normal distribution.
• Benefit:
– It is a model with low complexity where the response
variables can have any form of exponential distribution
type like gaussian, Poisson etc. It is also easier to interpret
and allows us to understand how each predictor influences
the outcome.
• Model:
– yi here represents the gaussian family which during any
measurement values follow a normal distribution with an
equal number of measurements above and below the
mean value. The distribution provides a parameterized
mathematical function that can be used to calculate the
probability for any individual observation from the sample.
where 𝛳 is the canonical parameter that
represents the estimate of location and is
the dispersion parameter that represents the
scale.
GLM allows the distribution of y to take the
shape of many different exponential families:
7. Example: Generalized Linear Regression
(Gaussian Distribution)
Loan
Amount
Debt to Income
Ratio
Grade
Annual
Income
Verification Status
725 30 A 9632 Not Verified
1000 26 B 5022 Not Verified
1000 25 B 5149 Not Verified
1000 29 A 5225 Verified
1000 27 C 5344 Verified
Independent
variables (Xi)
Target
Variable (Y)
Let’s conduct the Generalized Linear regression (Gaussian Distribution) analysis on Loan Eligibility Data set on independent variables: Grade, Annual Income,
Debt to Income ratio, Verification status and target variable: Loan amount as shown below:
R-Squared 0.867
Adjusted R-Squared 0.868
Model is an excellent fit
when Adjusted R-Squared >
0.7
Adjusted R-Squared: The adjusted R-squared is a modified
version of R-squared that has been adjusted for the number
of predictors in the model. It shows whether adding
additional predictors improve a regression model or not. It
determines the goodness of fit of the model.
R-Squared: A coefficient of determination to represent the
proportion of variance in the target variable explained by a
regression model.
8. Standard Input/Tuning Parameters & Sample UI
Select the Target Variable
Debt to income ratio
Loan amount
Grade
Verification status
Annual income
Step
1
Step
2
More than one
predictors can be
selected
Step 3
family = gaussian
(This indicates the valid link functions
for each family, the first function being
gaussian which is also the default one).
By default, these parameters
should be set with the values
mentioned
Step 4
Display the output window containing
following:
o Model summary
o Line Fit plot
o Residual versus Fit plot
Note:
▪ Decision on selection of predictors depends on the business knowledge and the correlation value between target variable and predictors.
Select the Predictor
variable(s)
Debt to income ratio
Loan amount
Grade
Verification status
Annual income
9. Sample Output: 1. Interpretation
Influencer’s importance chart is used to show impact of each predictor on target variable.
Target Variable: Loan amount
Influencer’s Importance
Positive Negative Neutral
10. Sample Output: 2. Model Summary
R-Squared 0.867
Adjusted R-Squared 0.868
● R-Squared: A coefficient of determination to represent
the proportion of variance in the target variable
explained by a regression model.
● Adjusted R-Squared: A statistical measure determining
the variance in the target variable by considering only
the predictors influencing the target, rather than
considering all chosen predictor variables.
Root Mean Square Error (RMSE) 2847.658
Mean Absolute Error (MAE) 1979.2681
Mean Absolute Percentage Error (MAPE) 0.2484
Mean Percentage Error (MPE) -0.0735
11. Sample Output: 2. Model Summary (Continued…)
Variable Coefficient P-Value
Intercept 0.8272 0.0
Verification status (Not Verified) -0.6004 0.0
Grade(B) -0.0043 0.3950353
Grade(A) -0.0173 0.0016323
Grade(C) -0.0108 0.0556645
Grade(D) 0.0057 0.3778658
Grade(E) 0.027 0.0010318
Annual income 0 0.0177066
Debt to income ratio -0.0024 0.0
Variable Significance by P-value
Insignificant
Significant
12. Sample Output: 2. Model Summary (Continued…)
● P-Value: At a 95% confidence threshold, if the p-value for a predictor is < 0.05, then it has a significant impact
on the target variable. At a 95% confidence threshold, if the p-value for a predictor is > 0.05, then it doesn't
have a significant impact on the target variable.
● Root Mean Square Error (RMSE): Square root of the average of squared differences between prediction and
actual observation. It is a standard deviation of residual error.
● Mean Absolute Error (MAE): Average of the absolute differences between prediction and actual observation.
● Mean Absolute Percentage Error (MAPE): Mean Absolute Percentage ratio of residual over actual
observations.
● Mean Percentage Error (MPE): Mean Percentage Error conveys if there is more positive errors than negative
errors or vice-versa based upon its sign.
● RMSE, MAE, MAPE and MPE are used to identify the variation in terms of errors from predicted to actual
values.
● Lower the values, represent a better fit of the regression model.
13. Sample Output: 3. Predicted Class & Residuals
Loan amount Debt to income
ratio
Grade Annual income Verification
status
Predicted Loan
amount
Regression Residuals Regression
Standardized
Residuals
9000 25 B 11585 Verified 6393.003 2606.997 0.92
4800 25 E 9082 Not Verified 5664.367 -864.367 -0.305
20000 1 B 17627 Verified 24760.365 -4760.365 -1.681
6000 23 B 14689 Not Verified 8364.874 -2364.874 -0.835
3000 29 A 9523 Not Verified 2509.877 490.123 0.173
The data output will contain predicted class column along with the residuals and standardized residuals.
14. Sample Output: 3. Predicted Class & Residuals
• Residuals: The difference between the observed value of the dependent variable and the predicted
value is called the residual.
● Original Values - Predicted Values.
• Standardized Residuals: It is the ratio of the difference between the actual values and the predicted
values to the standard deviation of the predicted values. It is an indication of variation in predicted
value of target.
● Residuals / Standard Deviation of Residuals.
• Interpretations based on residual values:
● Positive values for the residual error indicates that the prediction is lower than actual value.
● Negative values for the residual error indicates that the prediction is higher than actual value.
● Zero value indicates that the prediction exactly matches with actual value.
15. Interpretation of Important Model Summary Statistics
RMSE R Squared
RMSE R-Squared
R-Squared: Adjusted R-Squared: P Value:
• A coefficient of determination
to represent the proportion of
variance in the target variable
explained by a regression
model.
• It lies between 0 to 1 and closer
this value to 1, better the model.
• A statistical measure
determining the variance in the
target variable by considering
only the predictors influencing
the target, rather than
considering all chosen predictor
variables.
• Adjusted R-squared < 0.5: The
model is not a good fit, and
predictions are not accurate.
• 0.5 <= Adjusted R-squared <
0.7: The model is a good fit, and
predictions are reasonably
accurate.
• Adjusted R-squared >= 0.7: The
model is a very good fit, and
predictions are accurate.
• At a 95% confidence threshold,
if the p-value for a predictor is <
0.05, then it has a significant
impact on the target variable.
• At a 95% confidence threshold,
if the p-value for a predictor is >
0.05, then it doesn't have a
significant impact on the target
variable.
16. RMSE:
• Square root of the
average of squared
differences between
prediction and
actual observation.
It is standard
deviation of residual
error.
• Lower values of
RMSE indicate a
better fit. The value
ranges 0 to ∞.
MAE:
• Average of the
absolute differences
between prediction
and actual
observation
• Lower values of MAE
indicate a better fit.
The value ranges 0
to ∞.
• Like RMSE, it is a
negatively oriented
score.
MAPE:
• Mean Absolute
Percentage ratio of
residual over actual
observations.
• Lower the MAPE,
better the
performance of the
model.
MPE:
• Mean Percentage
Error conveys if there
is more positive
errors than negative
errors or vice-versa
based upon its sign.
• In case of more
negative errors the
system
underestimates, and
in case of more
positive errors the
system
overestimates..
Interpretation of Important Model Summary
Statistics (Continued…)
17. Sample Output: 4. Plots
Line fit Plot Residual versus Fit Plot
Residual versus fit plot is used to check the assumption of
equal error variances & outliers
The Line fit plot is plotted between Loan amount against Annual income.
The residual versus Fit plot is plotted between Predicted Loan amount and Standardized Residuals.
Loan
amount
Standardized
Residuals
Annual income Predicted Loan amount
Line fit plots are used to check the assumption of linearity
between each Xi & Y
18. Interpretation of Plots:
Line Fit Plot
Line fit Plot
Loan
amount
Annual income
Figure 1
This plot is used to plot the relationship between
each Xi (predictor) & Y (target variable) with Y-on-y
axis and each Xi on x axis
As shown in the figure 1 in right, as Annual
Income(X) increases, so does the Loan Amount(Y),
hence there is a linear relationship between X and Y
and generalized linear regression (gaussian
distribution) regression is applicable on this data
If line doesn’t display linearity as shown in figures 2
& 3 in right, then transformation can be applied on
that variable before proceeding with model building
If data transformation doesn’t help, then either that
variable(Xi) can be dropped from the analysis or
nonlinear model should be chosen depending on the
distribution pattern of scatter plot.
Figure 2
Figure 3
Figure 2
Figure 3
19. Interpretation of Plots:
Residual Versus Fit Plot
It is the scattered plot of standardized residuals on Y
axis and predicted (fitted) values on X axis
It is used to detect the unequal residual variances and
outliers in data
Here are the characteristics of a well-behaved residual
vs. fits plot:
The residuals should "bounce randomly" around the 0
line and should roughly form a "horizontal band"
around the 0 line as shown in figure 1. This suggests
that the variances of the error terms are equal
No single residual should "stand out" from the basic
random pattern of residuals. This suggests that there
are no outliers.
Figure 2
⮚ Plots shown in figures 2 & 3 above depict unequal error
variances, which is not desirable for regression analysis
Figure 3
Residual versus Fit Plot
Standardized
Residuals
Predicted Loan amount
Figure 1
20. Limitations
● Generalized Linear Regression is limited
to predicting numeric output i.e.,
dependent variable must be numeric in
nature
● Minimum sample size should be at
least 20 cases per independent
variable.
● Sometimes the data can be categorical
and time series data may not be
normally distributed, both of which are
not supported by Gaussian distribution
in generalized linear regression model.
Time independent error ( fairly constant over time & lying within certain range)
Time dependent error (decreasing with time)
21. Limitations (Continued…)
● A normal distribution is an arrangement
of a data set in which most values cluster
in the middle of the range and the rest
taper off symmetrically towards extreme.
It will look like a bell curve as shown in
figure 1.
● Outliers in data (target as well as
independent variables) can affect the
analysis, hence outliers need to be
removed.
● Outliers are the observations lying
outside overall pattern of distribution as
shown in figure 2.
Figure 1
Figure 2
22. Business Use Case 1
• Business Problem: Product’s Profit Prediction
• Identifying the profit made by each product based upon various factors like its total revenue, number of
units sold, region of sale etc.
• Input Data:
• Predictor/independent variables:
• Total Revenue
• Units Sold
• Region
• Total Cost
• Target/dependent variable:
• Total Profit
• Business Benefit:
• The predictive model will help us identify, profit on different products based on the sales, region and
other cost factors.
23. Business Use Case 2
• Business Problem: Student’s Chance Of Admission Prediction
• To determine a student’s chance to get admission based on certain educational scores and factors.
• Input Data:
• Predictor/independent variables:
• CGPA
• GRE Score
• LOR
• Serial No
• TOEFL Score
• Target/dependent variable:
• Chance of admit
• Business Benefit:
• Using generalized linear regression, we can determine, to what extent a person qualifies to get an
admission based on various educational factors. This eases the entire process of admission and
allows the most eligible students to be selected.
24. Want to
Learn More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
September 2021