This document provides an introduction to simple linear regression and correlation. It defines key terms like independent and dependent variables, and explains how to estimate regression coefficients using the least squares method. Graphs like scatter plots are used to visualize the linear relationship between two variables. The correlation coefficient measures the strength of the linear association. Regression seeks to predict a dependent variable from an independent variable, while correlation simply measures the degree of association between two variables.
2. Introduction
• A major contribution to our knowledge of Public Health
comes from understanding:
– trends in disease rates and
– relationships among different predictors of health.
• Biostatisticians accomplish these analyses by fitting
mathematical models to data.
• Usually, two or more variables, when all variables are
measured from a single sample, are studied together in
the general hope of determining whether there is some
underlying relation between them, and if so, what kind
of relationship it is.
3. Introduction
• Sometimes, on the other hand, two or more variables
are studied in the hope of being able to use some of
them to predict the other.
– For example, we might have birth weight, gestational
age, mothers age, mother nutritional status, etc
Let us start when we have two variables
• Blood lead levels in children are known to cause serious
brain and neurologic damage
– at levels as low as ten micrograms per deciliter.
• Since the removal of lead from gasoline, blood levels of
lead in children have been steadily declining,
– but there is still a residual risk from environmental pollution.
4. Introduction
• In a survey, blood lead levels of children was related to
lead levels from a sample of soil near their residences.
• A plot of the blood levels and soil concentrations shows
some curvature.
• So the logarithms were used to produce an
approximately linear relationship.
• When plotted, the data show a cloud of points as in the
following example for 200 children.
• The mathematical model relating the two variables
is : y = .29x + .01 .
• It says that an increase of 1 in log(soil-lead) concentration
will correspond, on average, to an increase in log(blood-
lead) of .29 .
5.
6. Simple Correlation and Regression
• Correlation seeks to establish whether a relationship exists
between two variables
• Regression seeks to use one variable to predict another
variable
• Both measure the extent of a linear relationship between
two variables
• Statistical tests are used to determine the strength of the
relationship
7. Scatter Diagram
• A two-dimensional scatter plot is the fundamental graphical
tool for looking at regression and correlation data.
• In correlation and regression problems with one predictor
and one response, the scatter plot of the response versus
the predictor is the starting point for correlation and
regression analysis.
9. Correlations:
0.0
6.7
13.3
20.0
0.0 4.0 8.0 12.0
C1 vs C2
C1
C2
0.0
40.0
80.0
120.0
0.0 83.3 166.7 250.0
C1 vs C2
C1
C2
Positive
Large values of X
associated with large
values of Y,
small values of X
associated with small
values of Y.
e.g. IQ and SAT
Large values of X
associated with small
values of Y
& vice versa
e.g. SPEED and
ACCURACY
Negative
10.
11.
12. Simple Correlation
• Measures the relative strength of the linear relationship
between two variables
• Estimate a quantity called the correlation coefficient, or “r”
• This “r” must lie between -1 and +1, and is interpreted as a
measure of how close to a straight line the data lie.
• Values near ±1: nearly perfect line,
• Values near 0: no linear relationship, but there may be a
non-linear relationship.
• For the lead data, r = 0.42, It can be used to test for the
statistical significance of the regression.
13. Simple Correlation
• Strength of relationship
• Correlation from 0 to 0.25 (or –0.25) indicate little or no
relationship:
• those from 0.25 to 0.5 (or –0.25 to –0.50) indicate a fair
degree of relationship;
• those from 0.50 to 0.75 (or –0.50 to –0.75) a moderate to
good relationship; and
• those greater than 0.75 (or –0.75 to –1.00) a very good to
excellent relationship.
14. Simple Correlation
• Coefficient of Determination, r 2
• To understand the strength of the relationship between
two variables
– The correlation coefficient, r, is squared
– r 2 shows how much of the variation in one measure is
accounted for by knowing the value of the other measure
15. Correlation does not imply causality
Two variables might be associated because they share a
common cause.
For example, SAT scores and College Grade are highly
associated, but probably not because scoring well on
the SAT causes a student to get high grades in college.
Being a good student, etc., would be the common cause
of the SATs and the grades.
• Correlation measures only linear association, and many
biological systems are better described by curvilinear
plots
• This is one reason why data should always be looked at
first (scatterplot)
16. Intervening and confounding factors
There is a positive correlation between ice cream sales
and drowning.
There is a strong positive association between Number
of Years of Education and Annual Income
In part, getting more education allows people to get
better, higher-paying jobs.
But these variables are confounded with others, such
as socio-economic status
17. Simple Correlation
• Correlation coefficient assumes normally distributed data
• The correlation coefficient is sensitive to extreme values
• Non-normal distributions can be transformed (e.g.,
logarithmic transformation) or converted into ranks and
non-parametric correlation test can be used (Spearman’s
rank correlation)
18. 18
Pearson’s Correlation Coefficient
• With the aid of Pearson’s correlation coefficient (r),
we can determine the strength and the direction of
the relationship between X and Y variables,
• both of which have been measured and they must
be quantitative.
• For example, we might be interested in examining
the association between height and weight for the
following sample of eight children:
20. 20
Computational Formula for Pearsons’s Correlation Coefficient r
•
Where SP (sum of the product), SSx (Sum of
the squares for x) and SSy (sum of the squares
for y) can be computed as follows:
21. 21
Height and weights of 8 children
Child Height(inches)X Weight(pounds)Y
A 49 81
B 50 88
C 53 87
D 55 99
E 60 91
F 55 89
G 60 95
H 50 90
Average ( = 54 inches) ( = 90 pounds)
23. 23
Table : The Strength of a Correlation
Value of r (positive or negative) Meaning
_______________________________________________________
0.00 to 0.19 A very weak correlation
0.20 to 0.39 A weak correlation
0.40 to 0.69 A modest correlation
0.70 to 0.89 A strong correlation
0.90 to 1.00 A very strong correlation
________________________________________________________
26. 26
Checking for significance
• There appears to be a strong between chest circumference and birth
weight in babies.
• We need to check that such a correlation is unlikely to have arisen by in a
sample of ten babies.
• Tables are available that gives the significant values of this correlation
ratio at two probability levels.
• First we need to work out degrees of freedom. They are the number of
pair of observations less two, that is (n – 2)= 8.
• Looking at the table we find that our calculated value of 0.86 exceeds the
tabulated value at 8 df of 0.765 at p= 0.01. Our correlation is therefore
statistically highly significant.
27. Simple Correlation
• Sampling distribution of correlation coefficient:
• Note, like a proportion, the variance of the correlation
coefficient depends on the correlation coefficient itself
substitute in estimated r.
• The sample correlation coefficient follows a t-distribution
with n-2 degrees of freedom
• Sample size requirements for r.
28. Simple Correlation
• Significance Test for Pearson Correlation
• H0: ρ = 0 Ha: ρ ≠ 0 (Can do 1-sided test)
• Test Statistic:
• With n-2 degree of freedom
• P-value: 2P(t≥|tobs|)
29. Correlation and Regression
• Example
• Let r= 0.61 and n=18, α=0.05
• t= 0.61√18-2/√1-(0.61)2=3.08
• t0.025, 16 = 2.12
• Concl.: Reject the null hypothesis, i.e., the
relationship is significant
30. Correlation and Regression
• Assumptions in correlation
• The assumptions needed to make inferences about
the correlation coefficient are that the sample was
randomly selected and the two variables, X and Y,
vary together in a joint distribution that is normally
distributed, (called the bivariate normal
distribution).
33. Simple correlation
• Testing for evaluating a hypothesis that the true
population correlation is a specific value other than
zero.
• H0: ρ = ρ0
• H1: ρ ≠ ρ0
Where,
– Zr is Fisher’s transformed value of r
– Z ρ0 is Fisher’s transformed value of ρ0, the
hypothesized population correlation
34. Spearman’s Rank Correlation
• The correlation coefficient is markedly influenced by
extreme value and thus does not provide a good
description of the relationship between two variables when
the distribution of the variables are skewed or contain
outlying values.
• A simple method for dealing with the problem of extreme
observations in correlation is to transform the data to ranks
and then re calculate the correlation on ranks to obtain the
non-parametric correlation called spearman’s rho or rank
correlation.
35. Spearman’s Rank Correlation
• The spearman’s rank correlation (rs) is given by.
in which di is the difference of the two ranks associated with
the ith point.
• The significance of the association may be assessed using t-
test in the same way as described for Pearson correlation
coefficient.
• With n-2 degree freedom
36. Partial Correlation
• It is correlation between y and x1, where a variable x2 is
not allowed to vary.
Example: in an elementary school, reading ability (y) is highly
correlated with the child’s weight (x1).
• But both y and x1 are really caused by something else: the
child’s age (call x2).
• What would the correlation be between weight and
reading ability if the age were held constant? (Would it
drop down to zero?)
38. 38
REGRESSION
CORRELATION
• Regression, Correlation and Analysis of
Covariance are all statistical techniques that
use the idea that one variable say, may be
related to one or more variables through an
equation. Here we consider the relationship
of two variables only in a linear form, which
is called linear regression and linear
correlation; or simple regression and
correlation. The relationships between
more than two variables, called multiple
regression and correlation will be
considered later.
• Simple regression uses the relationship
between the two variables to obtain
information about one variable by knowing
the values of the other. The equation
showing this type of relationship is called
simple linear regression equation. The
related method of correlation is used to
measure how strong the relationship is
between the two variables is.
EQUATION OF REGRESSION
39. 39
Line of Regression
Simple Linear Regression:
Suppose that we are interested in a variable Y, but we
want to know about its relationship to another variable
X or we want to use X to predict (or estimate) the value
of Y that might be obtained without actually measuring
it, provided the relationship between the two can be
expressed by a line.’ X’ is usually called the independent
variable and ‘Y’ is called the dependent variable.
• We assume that the values of variable X are either fixed
or random. By fixed, we mean that the values are
chosen by researcher--- either an experimental unit
(patient) is given this value of X (such as the dosage of
drug or a unit (patient) is chosen which is known to have
this value of X.
• By random, we mean that units (patients) are chosen at
random from all the possible units,, and both variables X
and Y are measured.
• We also assume that for each value of x of X, there is a
whole range or population of possible Y values and that
the mean of the Y population at X = x, denoted by µy/x ,
is a linear function of x. That is,
µy/x = β0 +β1x
DEPENDENT VARIABLE
INDEPENDENT VARIABLE
TWO RANDOM VARIABLE
OR
BIVARIATE
RANDOM
VARIABLE
40. 40
ESTIMATION
• Estimate β0 and β1.
• Predict the value of Y at a given
value x of X.
• Make tests to draw conclusions
about the model and its usefulness.
• We estimate the parameters α and
β by ‘a’ and ‘b’ respectively by using
sample regression line:
Ŷ = + x
• Where we calculate
We select a sample of
n observations (xi,yi)
from the population,
WITH
the goals
0
^
1
^
41. Least Squares Estimation of 0, 1
• 0 Mean response when x=0 (y-intercept)
• 1 Change in mean response when x increases
by 1 unit (slope)
• 0, 1 are unknown parameters (like m)
• 0+1x Mean response when explanatory
variable takes on the value x
• Goal: Choose values (estimates) that minimize the
sum of squared errors (SSE) of observed values to
the straight-line:
2
1 1
^
0
^
1
2
^
1
^
0
^
^
n
i i
i
n
i i
i x
y
y
y
SSE
x
y
42. Least Squares Computations
2
2
2
^
2
1
^
0
^
2
1
^
2
2
n
SSE
n
y
y
s
x
y
S
S
x
x
y
y
x
x
y
y
S
y
y
x
x
S
x
x
S
xx
xy
yy
xy
xx
43. 43
EXAMPLE
• investigators at a sports health centre are
interested in the relationship between oxygen
consumption and exercise time in athletes
recovering from injury. Appropriate mechanics
for exercising and measuring oxygen
consumption are set up, and the results are
presented below:
– x variable
46. Inference Concerning the Slope (1)
• Parameter: Slope in the population model (1)
• Estimator: Least squares estimate:
• Estimated standard error:
• Methods of making inference regarding population:
– Hypothesis tests (2-sided or 1-sided)
– Confidence Intervals
1
^
xx
S
s /
^
1
^
47. Hypothesis Test for 1
• 2-Sided Test
– H0: 1 = 0
– HA: 1 0
• 1-sided Test
– H0: 1 = 0
– HA
+: 1 > 0 or
– HA
-: 1 < 0
|)
|
(
2
:
|
|
:
.
.
:
.
.
2
,
2
/
^
1
^
1
^
obs
n
obs
obs
t
t
P
val
P
t
t
R
R
t
S
T
)
(
:
)
(
:
:
.
.
:
.
.
:
.
.
2
,
2
,
^
1
^
1
^
obs
obs
n
obs
n
obs
obs
t
t
P
val
P
t
t
P
val
P
t
t
R
R
t
t
R
R
t
S
T
48. (1-)100% Confidence Interval for 1
xx
S
s
t
t 2
/
1
^
^
2
/
1
^
1
^
• Conclude positive association if entire interval above 0
• Conclude negative association if entire interval below 0
• Cannot conclude an association if interval contains 0
• Conclusion based on interval is same as 2-sided hypothesis test
49. Analysis of Variance in Regression
• Goal: Partition the total variation in y into
variation “explained” by x and random variation
2
^
2
^
2
^
^
)
(
)
(
)
(
)
(
)
(
)
(
y
y
y
y
y
y
y
y
y
y
y
y
i
i
i
i
i
i
i
i
• These three sums of squares and degrees of freedom are:
•Total (Syy) dfTotal = n-1
• Error (SSE) dfError = n-2
• Model (SSR) dfModel = 1
50. Analysis of Variance in Regression
Source of
Variation
Sum of
Squares
Degrees of
Freedom
Mean
Square F
Model SSR 1 MSR = SSR/1 F = MSR/MSE
Error SSE n-2 MSE = SSE/(n-2)
Total Syy n-1
• Analysis of Variance - F-test
• H0: 1 = 0 HA: 1 0
)
(
:
:
.
.
:
.
.
2
,
1
,
obs
n
obs
obs
F
F
P
val
P
F
F
R
R
MSE
MSR
F
S
T
51. Residual Analysis
Determining the Goodness of Fit
How well does the regression model fit the data?
Q Is the correlation r significantly different
than 0.0? Yes p 0.001
Q If significant, how much of the variance in Y can be
accounted for by X, i.e. the coefficient of determination?
r2 = .682, or 68.2%
Q How much of the variance in Y can not be accounted
for by
X, i.e. the coefficient of non-determination? 1 – r2 = .318, or
31.8%
Q Are the prediction errors distributed randomly?
52. Residual Analysis
A residual (an error) is the difference between a
prediction (Y) and the actual value of the
dependent variable Y
Residual (e) = ( Y – Y )
If the data fit the assumptions of the regression
model …
The residuals will be randomly distributed
How to test whether the residuals are random:
Histogram of the residuals (e)
Normal probability plot of the residuals (e)
Plot the residuals (e) against the
predictions (Y)
53. Plotting Standardized Residuals and Standardized
Predictions
Standardizing the residuals and the predictions and graphing
them in a scatterplot is helpful in identifying outliers …
Cases which may have an undue influence on the estimation of
the regression constant (a) and the regression coefficient (b)
To standardize a residual (e) or a prediction (Y), is to convert it
to a Z score
Ze = (e – e ) / Se
ZY = ( Y – Y ) / SY
In SPSS, standardized residuals and predictions can be
saved in the regression analysis. They are called zre_1 and
zpr_1.