Capitol Tech U Doctoral Presentation - April 2024.pptx
Reg n corr
1. Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 1 / 67
Linear Regression & Correlation
HCM University of Technology
Dzung Hoang Nguyen
October 1, 2020
2. Outline
1 Learning Objectives
2 Regression
Empirical Models
Simple Linear Regression
Hypothesis Tests in Simple Linear Regression
Confidence Intervals
3 Correlation
4 Transformation and Logistic Regression
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 2 / 67
3. Learning Objectives
• Use simple linear regression for building empirical models to
engineering and scientific data.
• Understand how the method of least squares is used to estimate
the parameters in a linear regression model.
• Analyze residuals to determine if the regression model is an
adequate fit to the data or to see if any underlying assumptions
are violated.
• Test the statistical hypotheses and construct confidence intervals
on the regression model parameters.
• Use the regression model to make a prediction of a future
observation and construct an appropriate prediction interval on
the future observation.
• Apply the correlation model.
• Use simple transformations to achieve a linear regression model.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 3 / 67
4. Next Subsection
1 Learning Objectives
2 Regression
Empirical Models
Simple Linear Regression
Hypothesis Tests in Simple Linear Regression
Confidence Intervals
3 Correlation
4 Transformation and Logistic Regression
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 4 / 67
5. Empirical Models
• Many problems in engineering and science involve exploring the
relationships between two or more variables.
• Regression analysis is a statistical technique that is very useful
for these types of problems.
• For example, in a chemical process, suppose that the yield of the
product is related to the process-operating temperature.
• Regression analysis can be used to build a model to predict yield
at a given temperature level.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 5 / 67
7. Next Subsection
1 Learning Objectives
2 Regression
Empirical Models
Simple Linear Regression
Hypothesis Tests in Simple Linear Regression
Confidence Intervals
3 Correlation
4 Transformation and Logistic Regression
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 7 / 67
8. Simple Linear Regression
Definition
• The simple linear regression considers a single regressor or
predictor x and a dependent or response variable Y.
• The expected value of Y at each level of x is a random variable:
E(Y |x) = β0 + β1x
• We assume that each observation, Y, can be described by the
model:
Y = β0 + β1x +
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 8 / 67
9. Simple Linear Regression
Example
Fig 11 − 3 : deviation of the data from the estimated regression model
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 9 / 67
16. β0,β1
= −2
n
X
i=1
yi − ˆ
β0 − ˆ
β1xi
xi = 0
n ˆ
β0 + ˆ
β1
n
X
i=1
xi =
n
X
i=1
yi
ˆ
β0
n
X
i=1
xi + ˆ
β1
n
X
i=1
x2
i =
n
X
i=1
yi xi
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 10 / 67
17. Least Squares Estimates
The least-squares estimates of the intercept and slope in the simple
linear regression model are:
ˆ
β0 = ȳ − ˆ
β1x̄
ˆ
β1 =
n
X
i=1
yi xi −
n
X
i=1
yi
! n
X
i=1
xi
!
n
n
X
i=1
x2
i −
n
X
i=1
xi
!2
n
where
ȳ =
1
n
n
X
i=1
yi ; x̄ =
1
n
n
X
i=1
xi
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 11 / 67
18. Least Squares Estimates
Definition
The fitted or estimated regression line is therefore
ŷ = ˆ
β0 + ˆ
β1x
Note that each pair of observations satisfies the relationship:
yi = ˆ
β0 + ˆ
β1xi + ei , i = 1, 2, ...n
where ei = yi − ˆ
yi is called the residual that describes the error in the
fit of the model to the ith
observation yi .
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 12 / 67
19. Some notations
Sxx =
n
X
i=1
(xi − x̄)2
=
n
X
i=1
x2
i −
n
X
i=1
xi
!2
n
Sxy =
n
X
i=1
yi (xi − x̄)2
=
n
X
i=1
xi yi −
n
X
i=1
xi
! n
X
i=1
yi
!
n
ˆ
β1 =
Sxy
Sxx
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 13 / 67
21. Oxygen Purity
We will fit a simple linear regression model to the oxygen purity data
in Table 11 − 1. The following quantities may be computed:
• n = 20;
20
X
i=1
xi = 23.92;
20
X
i=1
yi = 1843.21
• x̄ = 1.1960; ȳ = 92.1605
•
20
X
i=1
y2
i = 170044.5321
20
X
i=1
x2
i = 29.2892
•
20
X
i=1
xi yi = 2214.6566
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 15 / 67
22. Oxygen Purity-continued
We will fit a simple linear regression model to the oxygen purity data
in Table 11 − 1. The following quantities may be computed:
Sxx =
20
X
i=1
x2
i −
20
X
i=1
xi
!2
20
= 29.2892 −
(23.92)2
20
= 0.68088
Sxy =
20
X
i=1
xi yi −
20
X
i=1
xi
! 20
X
i=1
yi
!
20
= 2214.6566 −
23.92 ∗ 1843.21
20
= 10.17744
ˆ
β1 =
Sxy
Sxx
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 16 / 67
23. Oxygen Purity-continued
Therefore, the least squares estimates of the slope and intercept are:
ˆ
β1 =
Sxy
Sxx
=
10.17744
0.68088
= 14.94748
and
ˆ
β0 = ȳ − ˆ
β1x̄ = 92.1605 − (14.94748)1.196 = 74.28331
The fitted simple linear regression model (with the coefficients
reported to three decimal places) is
ŷ = 74.283 + 14.947x
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 17 / 67
24. Estimating σ2
The residuals ei = yi − ˆ
yi are used to obtain an estimate of σ2
.
The sum of squares of the residuals, often called the error sum of
squares, is
SSE =
n
X
i=1
e2
i =
n
X
i=1
(yi − ˆ
yi )2
The expected value of the error sum of squares is
E(SSE ) = (n − 2)σ2
Therefore, an unbiased estimator of σ2
is
σ̂2
=
SSE
n − 2
where SSE can be easily computed using
SSE = SST − ˆ
β1Sxy
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 18 / 67
25. Estimating σ2
SSE =
n
X
i=1
e2
i =
n
X
i=1
(yi − ˆ
yi )2
ˆ
yi = ˆ
β0 + ˆ
β1xi ; ˆ
β0 = ȳ − ˆ
β1x̄
=
n
X
i=1
[(yi − ȳ) + ˆ
β1(xi − x̄)]2
SSE =
n
X
i=1
(yi − ȳ)2
+ ˆ
β1
2
n
X
i=1
(xi − x̄)2
− 2 ˆ
β1
n
X
i=1
(xi − x̄)(yi − ȳ)
ˆ
β1
2
= ˆ
β1
Sxy
Sxx
; Sxx =
n
X
i=1
(xi − x̄)2
; Sxy =
n
X
i=1
(xi − x̄)(yi − ȳ)
SSE = SST − ˆ
β1Sxy
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 19 / 67
26. Properties of the Least Squares Estimators
Slope Properties
E( ˆ
β1) = β1 V ( ˆ
β1) =
σ2
Sxx
Intercept Properties
E( ˆ
β0) = β0 V ( ˆ
β0) = σ2
[
1
n
+
x̄2
Sxx
]
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 20 / 67
27. Hypothesis Tests in Simple Linear Regression
Use of t-Tests
Suppose we wish to test
H0 : β1 = β1,0
H1 : β1 6= β1,0
An appropriate test statistic would be
t0 =
ˆ
β1 − β1,0
p
σ̂2/Sxx
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 21 / 67
28. Next Subsection
1 Learning Objectives
2 Regression
Empirical Models
Simple Linear Regression
Hypothesis Tests in Simple Linear Regression
Confidence Intervals
3 Correlation
4 Transformation and Logistic Regression
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 22 / 67
29. Hypothesis Tests in Simple Linear Regression
Use of t-Tests
The test statistic could also be written as:
t0 =
ˆ
β1 − β1,0
se( ˆ
β1)
We would reject the null hypothesis if
|t0| tα/2,n−2
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 23 / 67
30. Hypothesis Tests in Simple Linear Regression
Use of t-Tests
Suppose we wish to test
H0 : β0 = β0,0
H1 : β0 6= β0,0
An appropriate test statistic would be
t0 =
ˆ
β0 − β0,0
p
σ̂2/Sxx
=
ˆ
β0 − β0,0
se( ˆ
β0)
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 24 / 67
31. Hypothesis Tests in Simple Linear Regression
Use of t-Tests
We would reject the null hypothesis if
|t0| tα/2,n−2
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 25 / 67
32. Hypothesis Tests in Simple Linear Regression
Use of t-Tests
An important special case of the hypotheses of Equation is
H0 : β1 = 0
H1 : β1 6= 0
These hypotheses relate to the significance of regression.
Failure to reject H0 is equivalent to concluding that there is no linear
relationship between x and Y .
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 26 / 67
33. Hypothesis Tests in Simple Linear Regression
Example
We will test for significance of regression using the model for the
oxygen purity data from Example. The hypotheses are:
H0 : β1 = 0
H1 : β1 6= 0
and we will use α = 0.01. From Example and dat Table we have
ˆ
β1 = 14.947 n = 20 Sxx = 0.68088 σ̂2
= 1.18
so the t-statistic become
t0 =
ˆ
β1
p
σ̂2/Sxx
=
ˆ
β1
se( ˆ
β1)
=
14.974
p
1.18/0.68088
= 11.35
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 27 / 67
34. Hypothesis Tests in Simple Linear Regression
Example
Practical Interpretation: Since the reference value of t is
t0.005,18 = 2.88, the value of the test statistic is very far into the
critical region, implying that H0 : β1 = 0 should be rejected.
There is strong evidence to support this claim. The P-value for this
test is P ' 1.23 ∗ 10−9
. This was obtained manually with a calculator.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 28 / 67
35. Hypothesis Tests in Simple Linear Regression
ANOVA approach to test Significance of Regression
The ANOVA identity is
n
X
i=1
(yi − ȳ)2
=
n
X
i=1
(ˆ
yi − ȳ)2
+
n
X
i=1
(yi − ˆ
yi )2
Symbolically,
SST = SSR + SSE
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 29 / 67
36. Hypothesis Tests in Simple Linear Regression
ANOVA approach to test Significance of Regression
If the null hypothesis, H0 : β1 = 0 is true, the statistic:
F0 =
SSR/1
SSE /(n − 2)
=
MSR
MSE
follows the F1,n−2 distribution and we would reject if F0 fα,1,n−2
The quantities, MSR and MSE are called mean squares.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 30 / 67
37. Hypothesis Tests in Simple Linear Regression
ANOVA approach to test Significance of Regression
ANOVA table
Source of Sum of Squares Degrees of Mean Square F0
Variation Freedom
Regression SSR = ˆ
β1Sxy 1 MSR MSR /MSE
Error SSE = SST − ˆ
β1Sxy n − 2 MSE
Total SST n − 1
Note that MSE = σ̂2
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 31 / 67
38. ANOVA example
Example
We will use the analysis of variance approach to test for significance
of regression using the oxygen purity data model from Example 11-1.
Recall that SST = 173.38; ˆ
β1 = 14.947; Sxy = 10.17744, and n = 20.
The regression sum of squares is:
SSR = ˆ
β1Sxy = 14.947 ∗ 10.17744 = 152.13
and the error sum of squares is:
SSE = SST − SSR = 173.28 − 152.13 = 21.25
The ANOVA for testing H0 : β1 = 0 is your task.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 32 / 67
39. ANOVA example
Example
The test statistic is f0 =
MSR
MSE
=
152.13
1.18
= 128.86 for which we find
that the P−value is P ' 1.2310−9
, so we conclude that β1 is not zero.
There are frequently minor differences in terminology among
computer packages. For example, sometimes the regression sum of
squares is called the ”model” sum of squares, and the error sum of
squares is called the ”residual” sum of squares.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 33 / 67
40. Next Subsection
1 Learning Objectives
2 Regression
Empirical Models
Simple Linear Regression
Hypothesis Tests in Simple Linear Regression
Confidence Intervals
3 Correlation
4 Transformation and Logistic Regression
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 34 / 67
41. Confidence Intervals on the Slope and Intercept
Definition
Under the assumption that the observation are normally and
independently distributed, a 100(1 − α)% confidence interval on
the slope β1 in simple linear regression is:
ˆ
β1 − tα/2,n−2
s
σ̂2
Sxx
≤ β1 ≤ ˆ
β1 + tα/2,n−2
s
σ̂2
Sxx
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 35 / 67
42. Confidence Intervals on the Slope and Intercept
Definition
Similarly, a 100(1 − α)% confidence interval on the slope β0 is:
ˆ
β0 − tα/2,n−2
s
σ̂2[
1
n
+
x̄2
Sxx
] ≤ β0 ≤ ˆ
β0 + tα/2,n−2
s
σ̂2[
1
n
+
x̄2
Sxx
]
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 36 / 67
43. Confidence Intervals
Example
We will find a 95% confidence interval on the slope of the regression
line using the data in Example 11-1. Recal that
ˆ
β1 = 14.974; Sxx = 0.68088, andσ̂2
= 1.18. Then we find:
ˆ
β1 − t0.025,18
s
σ̂2
Sxx
≤ β1 ≤ ˆ
β1 + t0.025,18
s
σ̂2
Sxx
Or
14.974 − 2.101
r
1.18
0.68088
≤ β1 ≤ 14.974 − 2.101
r
1.18
0.68088
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 37 / 67
44. Confidence Intervals
Example
14.974 − 2.101
r
1.18
0.68088
≤ β1 ≤ 14.974 − 2.101
r
1.18
0.68088
This simplifies to:
12.181 ≤ β1 ≤ 17.713
Practical Interpretation: This CI does not include zero, so there is
strong evidence (at α = 0.05) that the slope is not zero. The CI is
reasonably narrow (±2.766) because the error variance is fairly small.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 38 / 67
45. Confidence Intervals on the Mean Response
µ̂Y |x0
= ˆ
β0 + ˆ
β1x0
Definition
A 100(1 − a)% confidence interval about the mean response at
the value of x = x0, say µY |x0
,is given by
µ̂Y |x0
− tα/2,n−2
s
σ̂2[
1
n
+
(x − x̄)2
Sxx
] ≤ µY |x0
≤ µ̂Y |x0
− tα/2,n−2
s
σ̂2[
1
n
+
(x − x̄)2
Sxx
]
where µ̂Y |x0
= ˆ
β0 + ˆ
β1x0 is computed from the fitted regression
model.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 39 / 67
46. Confidence Intervals
Example
We will construct a 95% confidence interval about the mean response
for the data in Example 11-1. The fitted model is
µ̂Y |x0
= 74.283 + 14.947x0, and the 95% confidence interval onµY |x0
is found as:
µ̂Y |x0
± 2.101
r
1.18[
1
20
+
(x0 − 1.1960)2
0.68088
]
Suppose that we are interested in predicting mean oxygen purity when
x0 = 100%5. Then µ̂Y |x1.00
= 74.283 + 14.947(1.00) = 89.23 and the
95% confidence interval is:
89.23 ± 2.101
r
1.18[
1
20
+
(1.00 − 1.1960)2
0.68088
]
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 40 / 67
47. Confidence Intervals
Example
89.23 ± 2.101
r
1.18[
1
20
+
(1.00 − 1.1960)2
0.68088
]
or
89.23 ± 0.75
Therefore, the 95% CI on µY |1.00 is 88.48 ≤ µY |1.00 ≤ 89.98 ⇒ is a
reasonable narrow CI.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 41 / 67
48. Prediction New Observations
A 100(1 − α)% prediction interval on a future observation Y0 at
the value x0 is given by
ˆ
y0 − tα/2,n−2
s
σ̂2[1 +
1
n
+
(x − x̄)2
Sxx
] ≤ Y0
≤ ˆ
y0 − tα/2,n−2
s
σ̂2[1 +
1
n
+
(x − x̄)2
Sxx
]
The value ˆ
y0 is computed from the regression model ˆ
y0 = ˆ
β0 + ˆ
β1x0.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 42 / 67
49. Prediction New Observations
Example
To illustrate the construction of a prediction interval, suppose we use
the data in Example 11 − 1 and find a 95% prediction interval on the
next observation of oxygen purity x0 = 1.00%. Recalling that
ˆ
y0 = 89.23, we find that the prediction interval is:
89.23 ± 2.101
r
1.18[1 +
1
20
+
(1.00 − 1.1960)2
0.68088
]
which simplifies to:
86.83 ≤ y0 ≤ 91.63
This is a reasonably narrow prediction interval.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 43 / 67
50. Adequacy of the Regression Model
Fitting a regression model requires several assumptions:
• Errors are uncorrelated random variables with mean zero;
• Errors have constant variance; and,
• Errors be normally distributed.
The analyst should always consider the validity of these assumptions
to be doubtful and conduct analyses to examine the adequacy of the
model.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 44 / 67
51. Residual Analysis
• The residuals from a regression model are ei = yi − ŷi , where yi
is an actual observation and ŷi is the corresponding fitted value
from the regression model.
• Analysis of the residuals is frequently helpful in checking the
assumption that the errors are approximately normally distributed
with constant variance, and in determining whether additional
terms in the model would be useful.
The analyst should always consider the validity of these assumptions
to be doubtful and conduct analyses to examine the adequacy of the
model.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 45 / 67
52. Adequacy of the Regression Model
Example
The regression model for the oxygen purity data is:
ŷ = 74.283 + 14.947
Table 11 − 4 presents the observed and predicted values of y at each
value of x from this data set, along with the corresponding residual.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 46 / 67
53. Adequacy of the Regression Model
Example
A normal probability plot of the residuals is shown in Fig.11 − 10.
Since the residuals fall approximately along a straight line in the
figure, we conclude that there is no severe departure from normality.
The residuals are also plotted against the predicted value ŷi in
Fig.11 − 11...
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 47 / 67
54. Adequacy of the Regression Model
Example
.... and against the hydrocarbon levels xi in Fig.11 − 12. These plots
do not indicate any serious model inadequacies.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 48 / 67
55. Coefficient of Determination-R2
• The quantity R2
=
SSR
SST
= 1 −
SSE
SST
is called the coefficient of
determination and is often used to judge the adequacy of a
regression model
• 0 ≤ R2
≤ 1
• We often refer (loosely) to R2
as the amount of variability in the
data explained or accounted for by the regression model
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 49 / 67
56. Coefficient of Determination-R2
• For the oxygen purity regression model:
R2
=
SSR
SST
=
152.13
173.38
= 0.877
• Thus, the model accounts for 87.7% of the variability in the data.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 50 / 67
57. Correlation Coefficient
Definition
We assume that the joint distribution of Xi and Yi is the bivariate
normal distribution, and µY and σ2
Y are the mean and variance of Y ,
µX and σ2
X are the mean and variance X, and ρ is the correlation
coefficient between Y and X. Recall that the correlation coefficient is
defined as ρ =
σXY
σX σY
where σXY is the covariance between Y and X.
The conditional distribution of Y for a given value of X = x is
fY |x (y) =
1
p
2πσY |x
exp
−
1
2
y − β0 − β1x
σY |x
2
#
where
β0 = µY − µX ρ
σY
σX
; β1 =
σY
σX
ρ
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 51 / 67
58. Definition
It is possible to draw inferences about the correlation coefficient r in
this model. The estimator of ρ is the sample correlation coefficient:
R =
n
X
i=1
Yi (Xi − X̄)
n
X
i=1
(Xi − X̄)2
n
X
i=1
(Yi − Ȳ )2
#1/2
=
SXY
(SXX ST )1/2
Note that:
ˆ
β1 =
SST
SXX
1/2
R
We may also write:
R2
= ˆ
β1
2 SXX
SYY
=
ˆ
β1SXY
SST
=
SSR
SST
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 52 / 67
59. It is often useful to test the hypothesis:
H0 : ρ = 0
H1 : ρ 6= 0
The appropriate test statistic for these hypotheses is
T0 =
R
√
n − 2
√
1 − R2
Reject H0 if |t0| tα/2,n−2
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 53 / 67
60. The test procedure for the hypothesis
H0 : ρ = 0
H1 : ρ 6= 0
where ρ0 6= 0 is somewhat more complicated. In this case, the
appropriate test statistic is
Z0 = (arctanh(R) − arctanh(ρ0))(n − 3)1/2
Reject H0 if |Z0| Zα/2 The approximate 100(1 − α)% confidence
interval is
tanh
arctanh(r) −
Zα/2
√
n − 3
≤ ρ ≤ tanh
arctanh(r) +
Zα/2
√
n − 3
where tanh(u) = (eu
− e−u
)/(eu
+ e−u
)
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 54 / 67
61. Correlation-Example
Example
Wire Bond Pull Strength: an application of regression analysis is
described in which an engineer at a semiconductor assembly plant is
investigating the relationship between pull strength of a wire bond
and two factors: wire length and die height. In this example, we will
consider only one of the factors, the wire length. A random sample of
25 units is selected and tested, and the wire bond pull strength and
wire length arc observed for each unit. The data are shown in Table
1-2. We assume that pull strength and wire length are jointly
normally distributed. Figure 11-13 shows a scatter diagram of wire
bond strength versus wire length.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 55 / 67
63. Correlation-Example continued
Example
Now SXX = 698.56 and SXY = 2027.7132, and the sample correlation
coefficient is
r =
Sxy
[Sxx SST ]1/2
=
2027.7132
[(698.560)(6105.9)]1/2
= 0.9818
Note that r2
= 0.98182
= 0.9640 or that approximately 96.40% of the
variability in pull strength is explained by the linear relationship to
wire length.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 57 / 67
64. Correlation-Example continued
Now suppose that we wish to test the hypotheses:
H0 : ρ = 0
H1 : ρ 6= 0
with α = 0.05. We can compute the t-statistic as
t0 =
r
√
n − 2
√
1 − r2
=
0.9818
√
23
√
1 − 0.9640
= 24.8
Because t0.025,23 = 2.069, we reject H0 and conclude that the
correlation coefficient ρ 6= 0.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 58 / 67
65. Correlation-Example continued
Finally, we may construct an approximate 95% confidence interval on
r. Since arctanhr = arctanh0.9818 = 2.3452:
tanh
2.3452 −
1.96
√
22
≤ ρ ≤ tanh
2.3452 +
1.96
√
22
which reduces to
0.9585 ≤ ρ ≤ 0.9921
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 59 / 67
66. Example
A research engineer is investigating
the use of a windmill to generate
electricity and has collected data
on the DC output from this
windmill and the corresponding
wind velocity. If we fit a
straight-line model to the data, the
regression model is:
ŷ = 0.1309 + 0.2411x. Also, we
have statistic summary for this
model as following R2
=
0.8745, MSE = ˆ
σ2 = 0.0557, F0 =
160.26(P 0.0001)
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 60 / 67
67. Definition
We occasionally find that the straight-line regression model
Y = β0 + β1x + inappropriate because the true regression function
is nonlinear. Sometimes nonlinearity is visually determined from the
scatter diagram, and sometimes, because of prior experience or
underlying theory, we know in advance that the model is nonlinear.
Occasionally, a scatter diagram will exhibit an apparent nonlinear
relationship between Y and x. In some of these situations, a nonlinear
function can be expressed as a straight line by using a suitable
transformation. Such nonlinear models are called intrinsically linear.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 61 / 67
68. Transformation and Logistic Regression
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 62 / 67
70. A plot of the residuals from the transformed model versus ˆ
yi is shown in
Figure 11-17. This plot does not reveal any serious problem with inequality
of variance. The normal probability plot, shown in Figure 11-18, gives a mild
indication that the errors come from a distribution with heavier tails than
the normal (notice the slight upward and downward curve at the extremes).
This normal probability plot has the z-score value plotted on the horizontal
axis. Since there is no strong signal of model inadequacy, we conclude that
the transformed model is satisfactory.
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 64 / 67
71. Important Terms and Concepts
• Analysis of variance test in
regression
• Confidence interval on mean
response
• Correlation coefficient
• Empirical model
• Confidence intervals on model
parameters
• Intrinsically linear model
• Least squares estimation of
regression model parameters
• Logistics regression
• Model adequacy checking
• Odds ratio
• Prediction interval on a future
observation
• Regression analysis
• Residual plots
• Residuals
• Scatter diagram
• Simple linear regression model
standard error
• Statistical test on model
parameters
• Transformations
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 65 / 67
72. Important Terms and Concepts
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 66 / 67
73. Last Page
Summary
End of my talk
with a tidy TU Delft lay-out.
Thank you!
Dzung Hoang Nguyen (SensoryLab) Regression and Correlation talk October 1, 2020 67 / 67