Reg n corr

Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 1 / 67
Linear Regression & Correlation
HCM University of Technology
Dzung Hoang Nguyen
October 1, 2020

Outline
1 Learning Objectives
2 Regression
Empirical Models
Simple Linear Regression
Hypothesis Tests in Simple Linear Regression
Confidence Intervals
3 Correlation
4 Transformation and Logistic Regression

Learning Objectives
• Use simple linear regression for building empirical models to
engineering and scientific data.
• Understand how the method of least squares is used to estimate
the parameters in a linear regression model.
• Analyze residuals to determine if the regression model is an
adequate fit to the data or to see if any underlying assumptions
are violated.
• Test the statistical hypotheses and construct confidence intervals
on the regression model parameters.
• Use the regression model to make a prediction of a future
observation and construct an appropriate prediction interval on
the future observation.
• Apply the correlation model.
• Use simple transformations to achieve a linear regression model.

Next Subsection
2 Regression
Empirical Models
3 Correlation

Empirical Models
• Many problems in engineering and science involve exploring the
relationships between two or more variables.
• Regression analysis is a statistical technique that is very useful
for these types of problems.
• For example, in a chemical process, suppose that the yield of the
product is related to the process-operating temperature.
• Regression analysis can be used to build a model to predict yield
at a given temperature level.

Oxygen Purity
Example

Next Subsection
2 Regression
Empirical Models
3 Correlation

Definition
• The simple linear regression considers a single regressor or
predictor x and a dependent or response variable Y.
• The expected value of Y at each level of x is a random variable:
E(Y |x) = β0 + β1x
• We assume that each observation, Y, can be described by the
model:
Y = β0 + β1x +

Example
Fig 11 − 3 : deviation of the data from the estimated regression model

Least Squares Estimates
L =
n
X
i=1
2
i =
n
X
i=1
(yi − β0 − β1xi )2
∂L
∂β0

β0,β1
= −2
n
X
i=1

yi − ˆ
β0 − ˆ
β1xi

= 0
∂L
∂β1

β0,β1
= −2
n
X
i=1

yi − ˆ
β0 − ˆ
β1xi

xi = 0
n ˆ
β0 + ˆ
β1
n
X
i=1
xi =
n
X
i=1
yi
ˆ
β0
n
X
i=1
xi + ˆ
β1
n
X
i=1
x2
i =
n
X
i=1
yi xi

The least-squares estimates of the intercept and slope in the simple
linear regression model are:
ˆ
β0 = ȳ − ˆ
β1x̄
ˆ
β1 =
n
X
i=1
yi xi −
n
X
i=1
yi
! n
X
i=1
xi
!
n
n
X
i=1
x2
i −
n
X
i=1
xi
!2
n
where
ȳ =
1
n
n
X
i=1
yi ; x̄ =
1
n
n
X
i=1
xi

Definition
The fitted or estimated regression line is therefore
ŷ = ˆ
β0 + ˆ
β1x
Note that each pair of observations satisfies the relationship:
yi = ˆ
β0 + ˆ
β1xi + ei , i = 1, 2, ...n
where ei = yi − ˆ
yi is called the residual that describes the error in the
fit of the model to the ith
observation yi .

Some notations
Sxx =
n
X
i=1
(xi − x̄)2
=
n
X
i=1
x2
i −
n
X
i=1
xi
!2
n
Sxy =
n
X
i=1
yi (xi − x̄)2
=
n
X
i=1
xi yi −
n
X
i=1
xi
! n
X
i=1
yi
!
n
ˆ
β1 =
Sxy
Sxx

Oxygen Purity
Example

Oxygen Purity
We will fit a simple linear regression model to the oxygen purity data
in Table 11 − 1. The following quantities may be computed:
• n = 20;
20
X
i=1
xi = 23.92;
20
X
i=1
yi = 1843.21
• x̄ = 1.1960; ȳ = 92.1605
•
20
X
i=1
y2
i = 170044.5321
20
X
i=1
x2
i = 29.2892
•
20
X
i=1
xi yi = 2214.6566

Oxygen Purity-continued
We will fit a simple linear regression model to the oxygen purity data
in Table 11 − 1. The following quantities may be computed:
Sxx =
20
X
i=1
x2
i −
20
X
i=1
xi
!2
20
= 29.2892 −
(23.92)2
20
= 0.68088
Sxy =
20
X
i=1
xi yi −
20
X
i=1
xi
! 20
X
i=1
yi
!
20
= 2214.6566 −
23.92 ∗ 1843.21
20
= 10.17744
ˆ
β1 =
Sxy
Sxx

Oxygen Purity-continued
Therefore, the least squares estimates of the slope and intercept are:
ˆ
β1 =
Sxy
Sxx
=
10.17744
0.68088
= 14.94748
and
ˆ
β0 = ȳ − ˆ
β1x̄ = 92.1605 − (14.94748)1.196 = 74.28331
The fitted simple linear regression model (with the coefficients
reported to three decimal places) is
ŷ = 74.283 + 14.947x

Estimating σ2
The residuals ei = yi − ˆ
yi are used to obtain an estimate of σ2
.
The sum of squares of the residuals, often called the error sum of
squares, is
SSE =
n
X
i=1
e2
i =
n
X
i=1
(yi − ˆ
yi )2
The expected value of the error sum of squares is
E(SSE ) = (n − 2)σ2
Therefore, an unbiased estimator of σ2
is
σ̂2
=
SSE
n − 2
where SSE can be easily computed using
SSE = SST − ˆ
β1Sxy

Estimating σ2
SSE =
n
X
i=1
e2
i =
n
X
i=1
(yi − ˆ
yi )2
ˆ
yi = ˆ
β0 + ˆ
β1xi ; ˆ
β0 = ȳ − ˆ
β1x̄
=
n
X
i=1
[(yi − ȳ) + ˆ
β1(xi − x̄)]2
SSE =
n
X
i=1
(yi − ȳ)2
+ ˆ
β1
2
n
X
i=1
(xi − x̄)2
− 2 ˆ
β1
n
X
i=1
(xi − x̄)(yi − ȳ)
ˆ
β1
2
= ˆ
β1
Sxy
Sxx
; Sxx =
n
X
i=1
(xi − x̄)2
; Sxy =
n
X
i=1
(xi − x̄)(yi − ȳ)
SSE = SST − ˆ
β1Sxy

Properties of the Least Squares Estimators
Slope Properties
E( ˆ
β1) = β1 V ( ˆ
β1) =
σ2
Sxx
Intercept Properties
E( ˆ
β0) = β0 V ( ˆ
β0) = σ2
[
1
n
+
x̄2
Sxx
]

Use of t-Tests
Suppose we wish to test
H0 : β1 = β1,0
H1 : β1 6= β1,0
An appropriate test statistic would be
t0 =
ˆ
β1 − β1,0
p
σ̂2/Sxx

Next Subsection
2 Regression
Empirical Models
3 Correlation

Use of t-Tests
The test statistic could also be written as:
t0 =
ˆ
β1 − β1,0
se( ˆ
β1)
We would reject the null hypothesis if
|t0| tα/2,n−2

Use of t-Tests
Suppose we wish to test
H0 : β0 = β0,0
H1 : β0 6= β0,0
An appropriate test statistic would be
t0 =
ˆ
β0 − β0,0
p
σ̂2/Sxx
=
ˆ
β0 − β0,0
se( ˆ
β0)

Use of t-Tests
We would reject the null hypothesis if
|t0| tα/2,n−2

Use of t-Tests
An important special case of the hypotheses of Equation is
H0 : β1 = 0
H1 : β1 6= 0
These hypotheses relate to the significance of regression.
Failure to reject H0 is equivalent to concluding that there is no linear
relationship between x and Y .

Example
We will test for significance of regression using the model for the
oxygen purity data from Example. The hypotheses are:
H0 : β1 = 0
H1 : β1 6= 0
and we will use α = 0.01. From Example and dat Table we have
ˆ
β1 = 14.947 n = 20 Sxx = 0.68088 σ̂2
= 1.18
so the t-statistic become
t0 =
ˆ
β1
p
σ̂2/Sxx
=
ˆ
β1
se( ˆ
β1)
=
14.974
p
1.18/0.68088
= 11.35

Example
Practical Interpretation: Since the reference value of t is
t0.005,18 = 2.88, the value of the test statistic is very far into the
critical region, implying that H0 : β1 = 0 should be rejected.
There is strong evidence to support this claim. The P-value for this
test is P ' 1.23 ∗ 10−9
. This was obtained manually with a calculator.

ANOVA approach to test Significance of Regression
The ANOVA identity is
n
X
i=1
(yi − ȳ)2
=
n
X
i=1
(ˆ
yi − ȳ)2
+
n
X
i=1
(yi − ˆ
yi )2
Symbolically,
SST = SSR + SSE

If the null hypothesis, H0 : β1 = 0 is true, the statistic:
F0 =
SSR/1
SSE /(n − 2)
=
MSR
MSE
follows the F1,n−2 distribution and we would reject if F0 fα,1,n−2
The quantities, MSR and MSE are called mean squares.

ANOVA table
Source of Sum of Squares Degrees of Mean Square F0
Variation Freedom
Regression SSR = ˆ
β1Sxy 1 MSR MSR /MSE
Error SSE = SST − ˆ
β1Sxy n − 2 MSE
Total SST n − 1
Note that MSE = σ̂2

ANOVA example
Example
We will use the analysis of variance approach to test for significance
of regression using the oxygen purity data model from Example 11-1.
Recall that SST = 173.38; ˆ
β1 = 14.947; Sxy = 10.17744, and n = 20.
The regression sum of squares is:
SSR = ˆ
β1Sxy = 14.947 ∗ 10.17744 = 152.13
and the error sum of squares is:
SSE = SST − SSR = 173.28 − 152.13 = 21.25
The ANOVA for testing H0 : β1 = 0 is your task.

ANOVA example
Example
The test statistic is f0 =
MSR
MSE
=
152.13
1.18
= 128.86 for which we find
that the P−value is P ' 1.2310−9
, so we conclude that β1 is not zero.
There are frequently minor differences in terminology among
computer packages. For example, sometimes the regression sum of
squares is called the ”model” sum of squares, and the error sum of
squares is called the ”residual” sum of squares.

Next Subsection
2 Regression
Empirical Models
3 Correlation

Confidence Intervals on the Slope and Intercept
Definition
Under the assumption that the observation are normally and
independently distributed, a 100(1 − α)% confidence interval on
the slope β1 in simple linear regression is:
ˆ
β1 − tα/2,n−2
s
σ̂2
Sxx
≤ β1 ≤ ˆ
β1 + tα/2,n−2
s
σ̂2
Sxx

Confidence Intervals on the Slope and Intercept
Definition
Similarly, a 100(1 − α)% confidence interval on the slope β0 is:
ˆ
β0 − tα/2,n−2
s
σ̂2[
1
n
+
x̄2
Sxx
] ≤ β0 ≤ ˆ
β0 + tα/2,n−2
s
σ̂2[
1
n
+
x̄2
Sxx
]

Example
We will find a 95% confidence interval on the slope of the regression
line using the data in Example 11-1. Recal that
ˆ
β1 = 14.974; Sxx = 0.68088, andσ̂2
= 1.18. Then we find:
ˆ
β1 − t0.025,18
s
σ̂2
Sxx
≤ β1 ≤ ˆ
β1 + t0.025,18
s
σ̂2
Sxx
Or
14.974 − 2.101
r
1.18
0.68088
≤ β1 ≤ 14.974 − 2.101
r
1.18
0.68088

Example
14.974 − 2.101
r
1.18
0.68088
≤ β1 ≤ 14.974 − 2.101
r
1.18
0.68088
This simplifies to:
12.181 ≤ β1 ≤ 17.713
Practical Interpretation: This CI does not include zero, so there is
strong evidence (at α = 0.05) that the slope is not zero. The CI is
reasonably narrow (±2.766) because the error variance is fairly small.

Confidence Intervals on the Mean Response
µ̂Y |x0
= ˆ
β0 + ˆ
β1x0
Definition
A 100(1 − a)% confidence interval about the mean response at
the value of x = x0, say µY |x0
,is given by
µ̂Y |x0
− tα/2,n−2
s
σ̂2[
1
n
+
(x − x̄)2
Sxx
] ≤ µY |x0
≤ µ̂Y |x0
− tα/2,n−2
s
σ̂2[
1
n
+
(x − x̄)2
Sxx
]
where µ̂Y |x0
= ˆ
β0 + ˆ
β1x0 is computed from the fitted regression
model.

Example
We will construct a 95% confidence interval about the mean response
for the data in Example 11-1. The fitted model is
µ̂Y |x0
= 74.283 + 14.947x0, and the 95% confidence interval onµY |x0
is found as:
µ̂Y |x0
± 2.101
r
1.18[
1
20
+
(x0 − 1.1960)2
0.68088
]
Suppose that we are interested in predicting mean oxygen purity when
x0 = 100%5. Then µ̂Y |x1.00
= 74.283 + 14.947(1.00) = 89.23 and the
95% confidence interval is:
89.23 ± 2.101
r
1.18[
1
20
+
(1.00 − 1.1960)2
0.68088
]

Example
89.23 ± 2.101
r
1.18[
1
20
+
(1.00 − 1.1960)2
0.68088
]
or
89.23 ± 0.75
Therefore, the 95% CI on µY |1.00 is 88.48 ≤ µY |1.00 ≤ 89.98 ⇒ is a
reasonable narrow CI.

Prediction New Observations
A 100(1 − α)% prediction interval on a future observation Y0 at
the value x0 is given by
ˆ
y0 − tα/2,n−2
s
σ̂2[1 +
1
n
+
(x − x̄)2
Sxx
] ≤ Y0
≤ ˆ
y0 − tα/2,n−2
s
σ̂2[1 +
1
n
+
(x − x̄)2
Sxx
]
The value ˆ
y0 is computed from the regression model ˆ
y0 = ˆ
β0 + ˆ
β1x0.

Prediction New Observations
Example
To illustrate the construction of a prediction interval, suppose we use
the data in Example 11 − 1 and find a 95% prediction interval on the
next observation of oxygen purity x0 = 1.00%. Recalling that
ˆ
y0 = 89.23, we find that the prediction interval is:
89.23 ± 2.101
r
1.18[1 +
1
20
+
(1.00 − 1.1960)2
0.68088
]
which simplifies to:
86.83 ≤ y0 ≤ 91.63
This is a reasonably narrow prediction interval.

Adequacy of the Regression Model
Fitting a regression model requires several assumptions:
• Errors are uncorrelated random variables with mean zero;
• Errors have constant variance; and,
• Errors be normally distributed.
The analyst should always consider the validity of these assumptions
to be doubtful and conduct analyses to examine the adequacy of the
model.

Residual Analysis
• The residuals from a regression model are ei = yi − ŷi , where yi
is an actual observation and ŷi is the corresponding fitted value
from the regression model.
• Analysis of the residuals is frequently helpful in checking the
assumption that the errors are approximately normally distributed
with constant variance, and in determining whether additional
terms in the model would be useful.
The analyst should always consider the validity of these assumptions
to be doubtful and conduct analyses to examine the adequacy of the
model.

Example
The regression model for the oxygen purity data is:
ŷ = 74.283 + 14.947
Table 11 − 4 presents the observed and predicted values of y at each
value of x from this data set, along with the corresponding residual.

Example
A normal probability plot of the residuals is shown in Fig.11 − 10.
Since the residuals fall approximately along a straight line in the
figure, we conclude that there is no severe departure from normality.
The residuals are also plotted against the predicted value ŷi in
Fig.11 − 11...

Example
.... and against the hydrocarbon levels xi in Fig.11 − 12. These plots
do not indicate any serious model inadequacies.

Coefficient of Determination-R2
• The quantity R2
=
SSR
SST
= 1 −
SSE
SST
is called the coefficient of
determination and is often used to judge the adequacy of a
regression model
• 0 ≤ R2
≤ 1
• We often refer (loosely) to R2
as the amount of variability in the
data explained or accounted for by the regression model

Coefficient of Determination-R2
• For the oxygen purity regression model:
R2
=
SSR
SST
=
152.13
173.38
= 0.877
• Thus, the model accounts for 87.7% of the variability in the data.

Correlation Coefficient
Definition
We assume that the joint distribution of Xi and Yi is the bivariate
normal distribution, and µY and σ2
Y are the mean and variance of Y ,
µX and σ2
X are the mean and variance X, and ρ is the correlation
coefficient between Y and X. Recall that the correlation coefficient is
defined as ρ =
σXY
σX σY
where σXY is the covariance between Y and X.
The conditional distribution of Y for a given value of X = x is
fY |x (y) =
1
p
2πσY |x
exp

−
1
2

y − β0 − β1x
σY |x
2
#
where
β0 = µY − µX ρ
σY
σX
; β1 =
σY
σX
ρ

Definition
It is possible to draw inferences about the correlation coefficient r in
this model. The estimator of ρ is the sample correlation coefficient:
R =
n
X
i=1
Yi (Xi − X̄)
n
X
i=1
(Xi − X̄)2
n
X
i=1
(Yi − Ȳ )2
#1/2
=
SXY
(SXX ST )1/2
Note that:
ˆ
β1 =

SST
SXX
1/2
R
We may also write:
R2
= ˆ
β1
2 SXX
SYY
=
ˆ
β1SXY
SST
=
SSR
SST

It is often useful to test the hypothesis:
H0 : ρ = 0
H1 : ρ 6= 0
The appropriate test statistic for these hypotheses is
T0 =
R
√
n − 2
√
1 − R2
Reject H0 if |t0| tα/2,n−2

The test procedure for the hypothesis
H0 : ρ = 0
H1 : ρ 6= 0
where ρ0 6= 0 is somewhat more complicated. In this case, the
appropriate test statistic is
Z0 = (arctanh(R) − arctanh(ρ0))(n − 3)1/2
Reject H0 if |Z0| Zα/2 The approximate 100(1 − α)% confidence
interval is
tanh

arctanh(r) −
Zα/2
√
n − 3

≤ ρ ≤ tanh

arctanh(r) +
Zα/2
√
n − 3

where tanh(u) = (eu
− e−u
)/(eu
+ e−u
)

Correlation-Example
Example
Wire Bond Pull Strength: an application of regression analysis is
described in which an engineer at a semiconductor assembly plant is
investigating the relationship between pull strength of a wire bond
and two factors: wire length and die height. In this example, we will
consider only one of the factors, the wire length. A random sample of
25 units is selected and tested, and the wire bond pull strength and
wire length arc observed for each unit. The data are shown in Table
1-2. We assume that pull strength and wire length are jointly
normally distributed. Figure 11-13 shows a scatter diagram of wire
bond strength versus wire length.

Correlation-Example

Correlation-Example continued
Example
Now SXX = 698.56 and SXY = 2027.7132, and the sample correlation
coefficient is
r =
Sxy
[Sxx SST ]1/2
=
2027.7132
[(698.560)(6105.9)]1/2
= 0.9818
Note that r2
= 0.98182
= 0.9640 or that approximately 96.40% of the
variability in pull strength is explained by the linear relationship to
wire length.

Now suppose that we wish to test the hypotheses:
H0 : ρ = 0
H1 : ρ 6= 0
with α = 0.05. We can compute the t-statistic as
t0 =
r
√
n − 2
√
1 − r2
=
0.9818
√
23
√
1 − 0.9640
= 24.8
Because t0.025,23 = 2.069, we reject H0 and conclude that the
correlation coefficient ρ 6= 0.

Finally, we may construct an approximate 95% confidence interval on
r. Since arctanhr = arctanh0.9818 = 2.3452:
tanh

2.3452 −
1.96
√
22

≤ ρ ≤ tanh

2.3452 +
1.96
√
22

which reduces to
0.9585 ≤ ρ ≤ 0.9921

Example
A research engineer is investigating
the use of a windmill to generate
electricity and has collected data
on the DC output from this
windmill and the corresponding
wind velocity. If we fit a
straight-line model to the data, the
regression model is:
ŷ = 0.1309 + 0.2411x. Also, we
have statistic summary for this
model as following R2
=
0.8745, MSE = ˆ
σ2 = 0.0557, F0 =
160.26(P 0.0001)

Definition
We occasionally find that the straight-line regression model
Y = β0 + β1x + inappropriate because the true regression function
is nonlinear. Sometimes nonlinearity is visually determined from the
scatter diagram, and sometimes, because of prior experience or
underlying theory, we know in advance that the model is nonlinear.
Occasionally, a scatter diagram will exhibit an apparent nonlinear
relationship between Y and x. In some of these situations, a nonlinear
function can be expressed as a straight line by using a suitable
transformation. Such nonlinear models are called intrinsically linear.

Reg n corr

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Reg n corr

Similar to Reg n corr (20)

Recently uploaded

Recently uploaded (20)

Reg n corr