Shahid Lecture-8- MKAG1273

MAL1303: STATISTICAL
HYDROLOGY
Multiple Regression
Dr. Shamsuddin Shahid
Associate Professor
Department of Hydraulics and Hydrology
Faculty of Civil Engineering
Room No.: M46-332;
Phone: 07-5531624; Mobile: 0182051586
Email: sshahid@utm.my
11/23/2015 Shamsuddin Shahid, FKA, UTM
You created this PDF from an application that is not licensed to print to novaPDF printer (http://www.novapdf.com)

Simple Linear Regression
Simple Linear Regression (SLR) is a statistical
technique that is used to determine the
functional relationship between two variables.
Regression gives an equation that best describes
the relationship between two variables.

Multiple Linear Regression (MLR)
Multiple linear regression is a statistical technique where a
dependent variable is predicted from a set of predictors
Multiple regression is a statistical technique that is used to
identify relationship between a dependent variable and a
combination of independent variables.
The relationship is valid when few assumptions are fulfilled.
Failing to satisfy the assumptions does not mean that
relationship is not correct. It means that the relationship may
not be strong enough.

• The variables should be measure in interval/ratio scale.
• Dependent variable, Y must be normally distributed (no
skewness or outliers)
• Predictors, X’s do not need to be normally distributed, but
if they are it makes for a stronger interpretation.
• There should be linear relationship between Y and all X
• no outliers among Xs predicting Y
• Variance on Y is the same at all values of X
(homoscedastic)
Linear Multiple Regression: Assumptions

Linear Multiple Regression: Outliers
• Outliers can distort the regression results in multiple regression as
like simple linear regression. When an outlier is included in the
analysis, it pulls the regression line towards itself. This can result in a
solution that is more accurate for the outlier, but less accurate for all
of the other cases in the data set.
• It is necessary to check for outliers in the dependent variable and in
the independent variables.
• Removing an outlier may improve the distribution of a variable.
• Transforming a variable may reduce the likelihood that the value for a
case will be characterized as an outlier.

1. Decide dependent and independent variables.
2. Test for normality, linearity, homoscedasticity.
3. In necessary, remove the outliers.
4. If it does not satisfy the criteria for normality, transformation
is required. Decide which transformations should be used.
5. Substitute transformations and run regression entering all
independent variables.
6. Do multiple regression analysis with variables specified in the
problem.
7. Test the significance of the regression equation.
Linear Multiple Regression: Steps

Simple Linear Regression
In Simple Linear Regression (SLR), the functional relationship
between two variables X and Y are determined.
Regression equation is the equation of a straight line that best
describes the relationship between two variables.
When the equation is used to calculate Y from observed X, it
gives an error ε in the prediction. Therefore, the Y equals to
predicted value plus error.

Multiple Linear Regression (MLR): Basics
A multiple linear regression model is called “linear” because only
linear coefficients {β} are used. However, transforms of the
regressor variables are permitted in an MLR model like SLR.
In Multiple Linear Regression (SLR), the functional relationship of
dependent variable Y with more than one independent variables are
determined.

1 11 21
2 12 22 1
3 13 23 2
4 14 24
*
4 1 4 2 * 2 1
*
y x x
y x x b
y x x b
y x x
x x x
data design matrix parameters




Multiple Linear Regression: Basics
Create the design Matrix
Calculate the parameters:
Where, XT is the transpose of Matrix X
X-1 is the inverse of Matrix X

The Goodness of Fit of the Regression Model
One measure of how well a statistical model explains the observed
data is the coefficient of determination, that is, the square of the
Pearson correlation coefficient, r2, between y and x.
When x is replaced by ,
it gives the correlation between actual and predicted value, R2
It can also be measure by,
yˆ

Distinction between r and R are:
• r is a measure of association between two random variables
whereas R is a measure between a random variable y and its
prediction from a regression model.
• r lies in the interval - 1  r -1 while the multiple correlation R
cannot be negative; that is, it lies in the interval 0  R  1.
• R is always well defined, regardless of whether the independent
variable is assumed to be random or fixed. In contrast, calculating
the correlation between a random variable, Y, and a fixed predictor
variable, X, that is, a variable that is not considered random, makes
no sense.
The Goodness of Fit of the Regression Model

Multiple Linear Regression: Example
It is well known that groundwater recharge is directly related to
Rainfall and Soil Moisture Holding Capacity (SMHC). Instrumental
data of groundwater recharge, Rainfall and SMHC at six sites has
been collected. Find a empirical equation that related groundwater
recharge with Rainfall and SMHC

Multiple Linear Regression: Solution
Create the design matrix
Get solution by:

Multiple Linear Regression: Solution
Excel commands:
Matrix Inversion: MINV(array)
Matrix Multiplication: MMULT(array1, array2)
Matrix Transpose: Copy Matrix -> Past Special with tick on
transpose radio button.

Recharge = 1.38 + 0.12Rainfall – 0.01SMHC

Basic assumptions about the errors:
1. The mean of the errors is zero
2. The errors are normally distributed.
3. The variances of the errors for all observations are
constant
4. The errors are independent of each other (uncorrelated)
Gross violations of these basic assumptions will yield a
poor or biased model. However, if the variances of the
errors are unequal and can be estimated, weighted
regression schemes can sometimes be used to obtain a
better model.
Multiple Linear Regression (MLR): Assumptions

is the Variance of residuals
Is the corresponding diagonal value of matrix
(XTX)-1
Multiple Linear Regression: Confidence Interval
The parameter values have range. We can find the range of a
parameter at a certain level of confidence by using following
formula:

Multiple Linear Regression: Confidence Interval
n = 6, p = 3
At α = 0.05,
t(0.025, 3) = 4.18
s2 = 0.084
-0.35 ≤ β0 ≤ 3.11
-0.10 ≤ β1 ≤ 0.35
-0.16 ≤ β2 ≤ 0.14

• An estimator with lower variance is more efficient, in the
sense that it is likely to be closer to the true value over
samples.
• The “best” estimator is the one with minimum variance of all
estimators
Multiple Linear Regression: Efficient Estimator
-0.35 ≤ β0 ≤ 3.11
-0.10 ≤ β1 ≤ 0.35
-0.16 ≤ β2 ≤ 0.14

SST = SSE + SSR
Sum of Square Total (SST) = Total variability in the observed responses
Sum of Square Error (SSE) = Total error by the model, or variability that is not
explained by the model
Sum of Square Residual (SSR) = Systematic variability that is explained by the
regression model.
Multiple Linear Regression: Strength

Mean variation in observations, MST = SST / n-1
Mean Error, MSE = SSE / n-p
Mean regression, MSR = SSR / 1
Higher values of R2 indicate a better fit of the model to the sample
observations.
Disadvantage of R2: Adding any regressor variable to an MLR
model, even an irrelevant regressor, yields a smaller SSE and
greater R2. For this reason, R2 by itself is not a good measure of
the quality of fit.

To overcome this deficiency in R2, an adjusted value can be used.
The adjusted coefficient of multiple determination ( ) is defined
as,
Because the number of model coefficients (p) is used in
computing, the value will not necessarily increase with the
addition of any regressor. Hence, is a more reliable indicator
of model quality.

SST = 1.27; SSR = 0.85; SSE = 0.42
MST = 0.26; MSR = 0.85; MSE = 0.14
= 0.67
= 0.45
SST = SSE + SSR
Multiple Linear Regression: Strength (Example)
Mean variation in observations, MST = SST / n-1
Mean Error, MSE = SSE / n-p
Mean regression, MSR = SSR / 1

 F-test is used to assess the overall ability of a model.
 When testing for the significance of the goodness of fit, our null hypothesis is
that the explanatory variables jointly equal 0.
 If our F-statistic is below the critical value we fail to reject the null and
therefore we say the goodness of fit is not significant.
Multiple Linear Regression: F-statistics

 The F-test is useful for testing a number of hypotheses and is often
used to test for single, global and the joint significance of a group of
variables.
 Joint test often refer to ‘testing a restriction’.
 This restriction is that a group of explanatory variables are jointly
equal to 0

The global F-test is used to assess the overall ability of a model to
explain at least some of the observed variability in the sample
responses. The global F-test is performed in the following steps:
Null hypothesis: β1 = β2 = …. = βk = 0
The global F-statistics is calculated as,
F0 = MSR/MSE
If F(calculated) > F (critical) (α, k, n-p),
(where k = number of regressors; n = data points; p = parameters to
be estimated).
Reject the null hypothesis and conclude that at least one βj≠0 and at
least one model regressor explains some of the response variation.

Recharge = 1.38 + 0.12Rainfall –
0.01SMHC
SST = 1.27 MST = 0.26
SSR = 0.85 MSR = 0.85
SSE = 0.42 MSE = 0.14
SST = SSE + SSR
F0 = MSR/MSE
= 6.07
F (critical) (α, k, n-p)
F (critical) (0.05, 2, 3)
= 9.55
F(calculated) < F (critical) (α, k, n-
p)
Null hypothesis can not
be rejected.
No model regressor
explains some of the
response variation.

Discharge = 21.97 – 0.19ET + 1.55BF + 0.94R -1.05GWR

Null hypothesis:
β1 = β2 = β3 = β4 = 0
= 0.9865
F0 = MSR/MSE
= 7.68
F (critical) (α, k, n-p) =
F (critical) (0.05, 4, 7) = 4.12
F(calculated) > F (critical) (α, k,
n-p)
Null hypothesis
rejected.
Decision: At least one βj≠0 and at least one model regressor
explains some of the response variation.

Discharge = 33.50 – 0.28ET + 1.53BF + 0.28R

Discharge = 33.50 – 0.28ET + 1.53BF + 0.28R
Null hypothesis:
β1 = β2 = β3 = 0
F0 = MSR/MSE
= 6.3
F (critical) (α, k, n-p) =
F (critical) (0.05, 3, 8) = 4.07
F(calculated) > F (critical) (α, k,
n-p)
Null hypothesis
rejected.
Decision: Groundwater recharge has no significant impact on
Discharge.

Discharge = ? + ? ET + ? BF + ? GWR

 To carry out this test you need to conduct two separate regression,
one with all the explanatory variables in (unrestricted equation),
the other with the variables whose joint significance is being
tested, removed.
 Then collect the RSS from both equations.
 Put the values in the formula
 Find the critical value and compare with the test statistic. The null
hypothesis is that the variables jointly equal 0.
Multiple Linear Regression: Joint Significance

The test for joint significance has its own formula, which takes
the following form:
RSSrestrictedRSS
RSSedunrestrictRSS
equationedunrestrictinparametersk
nsrestrictioofnumberm
knRSS
mRSSRSS
F
R
u
u
uR
/
/








Obs. No. Y X1 X2 x3
1 5.1 2.3 2.5 4.2
2 6.2 1.9 2.8 3.3
3 4.8 2.0 3.1 4.0
. . . . .
. . . . .
. . . . .
60 5.9 2.4 3.8 4.6
3322110 xαxαxααy 

If we have a model consists of three explanatory variables. We wish to
test for the joint significance of 2 of the variables (x2 and x3), we need
to run the following restricted and unrestricted models:
restrictedxααy
edunrestrictxαxαxααy
t
t


110
3322110

Given the following model, we wish to test the joint significance of x2
and x3. Having estimated them, we collect their respective RSSs (n=60).
51
750
110
3322110
.RSS
restrictedxββy
.RSS
edunrestrictxαxαxααy
R
t
u
t





RSSrestrictedRSS
RSSedunrestrictRSS
equationedunrestrictinparametersk
nsrestrictioofnumberm
knRSS
mRSSRSS
F
R
u
u
uR
/
/







28
01340
3750
460750
275051




.
.
/.
/..
F
F (critical) (0.05, 2, 56) = 3.16

As the F statistic is greater than the critical value (28>3.15), we
reject the null hypothesis and conclude that the variables x2 and x3
are jointly significant and should remain in the model.
0:,
0:,
32
320




AHHypothesiseAlternativ
HHypothesisNull

Choosing the Best MLR Model
• One of the major issues in multiple regression is the appropriate
approach to variable selection.
• To make a appropriate regression model, we need to
subsequently add or delete variables from model.
• The benefit of adding additional variables to a multiple
regression model is to account for or explain more of the
variance of the response variable. The cost of adding additional
variables is that the degrees of freedom decreases, making it
more difficult to find significance in hypothesis tests and
increasing the width of confidence intervals.
A good model will explain as much of the variance of y as
possible with a small number of explanatory variables.

The choice of whether to add a variable is based on a "cost-benefit
analysis", and variables enter the model only if they make a
significant improvement in the model.
There are at least two types of approaches for evaluating whether
a new variable sufficiently improves the model. The first approach
uses partial F-tests, and when automated is often called a
"stepwise" procedure.
The second approach uses some overall measure of model
quality. The latter has many advantages.

Shahid Lecture-8- MKAG1273

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to Shahid Lecture-8- MKAG1273

Similar to Shahid Lecture-8- MKAG1273 (20)

Recently uploaded

Recently uploaded (20)

Shahid Lecture-8- MKAG1273