SlideShare a Scribd company logo
1 of 53
1
Simple Linear Regression and
Correlation
Introduction
• A major contribution to our knowledge of Public Health
comes from understanding:
– trends in disease rates and
– relationships among different predictors of health.
• Biostatisticians accomplish these analyses by fitting
mathematical models to data.
• Usually, two or more variables, when all variables are
measured from a single sample, are studied together in
the general hope of determining whether there is some
underlying relation between them, and if so, what kind
of relationship it is.
Introduction
• Sometimes, on the other hand, two or more variables
are studied in the hope of being able to use some of
them to predict the other.
– For example, we might have birth weight, gestational
age, mothers age, mother nutritional status, etc
Let us start when we have two variables
• Blood lead levels in children are known to cause serious
brain and neurologic damage
– at levels as low as ten micrograms per deciliter.
• Since the removal of lead from gasoline, blood levels of
lead in children have been steadily declining,
– but there is still a residual risk from environmental pollution.
Introduction
• In a survey, blood lead levels of children was related to
lead levels from a sample of soil near their residences.
• A plot of the blood levels and soil concentrations shows
some curvature.
• So the logarithms were used to produce an
approximately linear relationship.
• When plotted, the data show a cloud of points as in the
following example for 200 children.
• The mathematical model relating the two variables
is : y = .29x + .01 .
• It says that an increase of 1 in log(soil-lead) concentration
will correspond, on average, to an increase in log(blood-
lead) of .29 .
Simple Correlation and Regression
• Correlation seeks to establish whether a relationship exists
between two variables
• Regression seeks to use one variable to predict another
variable
• Both measure the extent of a linear relationship between
two variables
• Statistical tests are used to determine the strength of the
relationship
Scatter Diagram
• A two-dimensional scatter plot is the fundamental graphical
tool for looking at regression and correlation data.
• In correlation and regression problems with one predictor
and one response, the scatter plot of the response versus
the predictor is the starting point for correlation and
regression analysis.
Scatter Diagram
Correlations:
0.0
6.7
13.3
20.0
0.0 4.0 8.0 12.0
C1 vs C2
C1
C2
0.0
40.0
80.0
120.0
0.0 83.3 166.7 250.0
C1 vs C2
C1
C2
Positive
Large values of X
associated with large
values of Y,
small values of X
associated with small
values of Y.
e.g. IQ and SAT
Large values of X
associated with small
values of Y
& vice versa
e.g. SPEED and
ACCURACY
Negative
Simple Correlation
• Measures the relative strength of the linear relationship
between two variables
• Estimate a quantity called the correlation coefficient, or “r”
• This “r” must lie between -1 and +1, and is interpreted as a
measure of how close to a straight line the data lie.
• Values near ±1: nearly perfect line,
• Values near 0: no linear relationship, but there may be a
non-linear relationship.
• For the lead data, r = 0.42, It can be used to test for the
statistical significance of the regression.
Simple Correlation
• Strength of relationship
• Correlation from 0 to 0.25 (or –0.25) indicate little or no
relationship:
• those from 0.25 to 0.5 (or –0.25 to –0.50) indicate a fair
degree of relationship;
• those from 0.50 to 0.75 (or –0.50 to –0.75) a moderate to
good relationship; and
• those greater than 0.75 (or –0.75 to –1.00) a very good to
excellent relationship.
Simple Correlation
• Coefficient of Determination, r 2
• To understand the strength of the relationship between
two variables
– The correlation coefficient, r, is squared
– r 2 shows how much of the variation in one measure is
accounted for by knowing the value of the other measure
Correlation does not imply causality
 Two variables might be associated because they share a
common cause.
 For example, SAT scores and College Grade are highly
associated, but probably not because scoring well on
the SAT causes a student to get high grades in college.
 Being a good student, etc., would be the common cause
of the SATs and the grades.
• Correlation measures only linear association, and many
biological systems are better described by curvilinear
plots
• This is one reason why data should always be looked at
first (scatterplot)
Intervening and confounding factors
There is a positive correlation between ice cream sales
and drowning.
There is a strong positive association between Number
of Years of Education and Annual Income
 In part, getting more education allows people to get
better, higher-paying jobs.
 But these variables are confounded with others, such
as socio-economic status
Simple Correlation
• Correlation coefficient assumes normally distributed data
• The correlation coefficient is sensitive to extreme values
• Non-normal distributions can be transformed (e.g.,
logarithmic transformation) or converted into ranks and
non-parametric correlation test can be used (Spearman’s
rank correlation)
18
Pearson’s Correlation Coefficient
• With the aid of Pearson’s correlation coefficient (r),
we can determine the strength and the direction of
the relationship between X and Y variables,
• both of which have been measured and they must
be quantitative.
• For example, we might be interested in examining
the association between height and weight for the
following sample of eight children:
Simple Correlation
• After Karl Pearson ( 1857 – 1936)
20
Computational Formula for Pearsons’s Correlation Coefficient r
•
Where SP (sum of the product), SSx (Sum of
the squares for x) and SSy (sum of the squares
for y) can be computed as follows:
21
Height and weights of 8 children
Child Height(inches)X Weight(pounds)Y
A 49 81
B 50 88
C 53 87
D 55 99
E 60 91
F 55 89
G 60 95
H 50 90
Average ( = 54 inches) ( = 90 pounds)
22
Scatter plot for 8 babies
height weight
49 81
50 88
53 83
55 99
60 91
55 89
60 95
50 90
0
20
40
60
80
100
120
0 20 40 60 80
Serie…
23
Table : The Strength of a Correlation
Value of r (positive or negative) Meaning
_______________________________________________________
0.00 to 0.19 A very weak correlation
0.20 to 0.39 A weak correlation
0.40 to 0.69 A modest correlation
0.70 to 0.89 A strong correlation
0.90 to 1.00 A very strong correlation
________________________________________________________
24
Child
X
Y
X2
Y2
XY
A 12
12 144
144
144
B
10
8
100
64
80
C
6
12
36
144
72
D
16
11
256
121
176
E
8
10
64 100
80
F
9
8
81
64
72
G
12
16
144
256
192
H
11
15
121
225
165
∑ 84 92 946 1118 981
25
Table : Chest circumference and Birth
Weight of 10 babies
X(cm) y(kg) x2 y2 xy
___________________________________________________
22.4 2.00 501.76 4.00 44.8
27.5 2.25 756.25 5.06 61.88
28.5 2.10 812.25 4.41 59.85
28.5 2.35 812.25 5.52 66.98
29.4 2.45 864.36 6.00 72.03
29.4 2.50 864.36 6.25 73.5
30.5 2.80 930.25 7.84 85.4
32.0 2.80 1024.0 7.84 89.6
31.4 2.55 985.96 6.50 80.07
32.5 3.00 1056.25 9.00 97.5
TOTAL 292.1 24.8 8607.69 62.42 731.61
26
Checking for significance
• There appears to be a strong between chest circumference and birth
weight in babies.
• We need to check that such a correlation is unlikely to have arisen by in a
sample of ten babies.
• Tables are available that gives the significant values of this correlation
ratio at two probability levels.
• First we need to work out degrees of freedom. They are the number of
pair of observations less two, that is (n – 2)= 8.
• Looking at the table we find that our calculated value of 0.86 exceeds the
tabulated value at 8 df of 0.765 at p= 0.01. Our correlation is therefore
statistically highly significant.
Simple Correlation
• Sampling distribution of correlation coefficient:
• Note, like a proportion, the variance of the correlation
coefficient depends on the correlation coefficient itself
substitute in estimated r.
• The sample correlation coefficient follows a t-distribution
with n-2 degrees of freedom
• Sample size requirements for r.
Simple Correlation
• Significance Test for Pearson Correlation
• H0: ρ = 0 Ha: ρ ≠ 0 (Can do 1-sided test)
• Test Statistic:
• With n-2 degree of freedom
• P-value: 2P(t≥|tobs|)
Correlation and Regression
• Example
• Let r= 0.61 and n=18, α=0.05
• t= 0.61√18-2/√1-(0.61)2=3.08
• t0.025, 16 = 2.12
• Concl.: Reject the null hypothesis, i.e., the
relationship is significant
Correlation and Regression
• Assumptions in correlation
• The assumptions needed to make inferences about
the correlation coefficient are that the sample was
randomly selected and the two variables, X and Y,
vary together in a joint distribution that is normally
distributed, (called the bivariate normal
distribution).
Simple Correlation
• Fisher’s Zr Transformation
• Which follows a normal distribution with
• Mean=
• and
Variance
Simple Correlation
• Transforming Zr back to r
Simple correlation
• Testing for evaluating a hypothesis that the true
population correlation is a specific value other than
zero.
• H0: ρ = ρ0
• H1: ρ ≠ ρ0
Where,
– Zr is Fisher’s transformed value of r
– Z ρ0 is Fisher’s transformed value of ρ0, the
hypothesized population correlation
Spearman’s Rank Correlation
• The correlation coefficient is markedly influenced by
extreme value and thus does not provide a good
description of the relationship between two variables when
the distribution of the variables are skewed or contain
outlying values.
• A simple method for dealing with the problem of extreme
observations in correlation is to transform the data to ranks
and then re calculate the correlation on ranks to obtain the
non-parametric correlation called spearman’s rho or rank
correlation.
Spearman’s Rank Correlation
• The spearman’s rank correlation (rs) is given by.
in which di is the difference of the two ranks associated with
the ith point.
• The significance of the association may be assessed using t-
test in the same way as described for Pearson correlation
coefficient.
• With n-2 degree freedom
Partial Correlation
• It is correlation between y and x1, where a variable x2 is
not allowed to vary.
Example: in an elementary school, reading ability (y) is highly
correlated with the child’s weight (x1).
• But both y and x1 are really caused by something else: the
child’s age (call x2).
• What would the correlation be between weight and
reading ability if the age were held constant? (Would it
drop down to zero?)
Partial Correlation
• A similar set of equations exists for the second predictor.
38
REGRESSION
CORRELATION
• Regression, Correlation and Analysis of
Covariance are all statistical techniques that
use the idea that one variable say, may be
related to one or more variables through an
equation. Here we consider the relationship
of two variables only in a linear form, which
is called linear regression and linear
correlation; or simple regression and
correlation. The relationships between
more than two variables, called multiple
regression and correlation will be
considered later.
• Simple regression uses the relationship
between the two variables to obtain
information about one variable by knowing
the values of the other. The equation
showing this type of relationship is called
simple linear regression equation. The
related method of correlation is used to
measure how strong the relationship is
between the two variables is.
EQUATION OF REGRESSION
39
Line of Regression
Simple Linear Regression:
Suppose that we are interested in a variable Y, but we
want to know about its relationship to another variable
X or we want to use X to predict (or estimate) the value
of Y that might be obtained without actually measuring
it, provided the relationship between the two can be
expressed by a line.’ X’ is usually called the independent
variable and ‘Y’ is called the dependent variable.
• We assume that the values of variable X are either fixed
or random. By fixed, we mean that the values are
chosen by researcher--- either an experimental unit
(patient) is given this value of X (such as the dosage of
drug or a unit (patient) is chosen which is known to have
this value of X.
• By random, we mean that units (patients) are chosen at
random from all the possible units,, and both variables X
and Y are measured.
• We also assume that for each value of x of X, there is a
whole range or population of possible Y values and that
the mean of the Y population at X = x, denoted by µy/x ,
is a linear function of x. That is,
µy/x = β0 +β1x
DEPENDENT VARIABLE
INDEPENDENT VARIABLE
TWO RANDOM VARIABLE
OR
BIVARIATE
RANDOM
VARIABLE
40
ESTIMATION
• Estimate β0 and β1.
• Predict the value of Y at a given
value x of X.
• Make tests to draw conclusions
about the model and its usefulness.
• We estimate the parameters α and
β by ‘a’ and ‘b’ respectively by using
sample regression line:
Ŷ = + x
• Where we calculate
We select a sample of
n observations (xi,yi)
from the population,
WITH
the goals
0
^
 1
^

Least Squares Estimation of 0, 1
• 0  Mean response when x=0 (y-intercept)
• 1  Change in mean response when x increases
by 1 unit (slope)
• 0, 1 are unknown parameters (like m)
• 0+1x  Mean response when explanatory
variable takes on the value x
• Goal: Choose values (estimates) that minimize the
sum of squared errors (SSE) of observed values to
the straight-line:
2
1 1
^
0
^
1
2
^
1
^
0
^
^

 


























n
i i
i
n
i i
i x
y
y
y
SSE
x
y 



Least Squares Computations
 
  
 
  
 
2
2
2
^
2
1
^
0
^
2
1
^
2
2































n
SSE
n
y
y
s
x
y
S
S
x
x
y
y
x
x
y
y
S
y
y
x
x
S
x
x
S
xx
xy
yy
xy
xx



43
EXAMPLE
• investigators at a sports health centre are
interested in the relationship between oxygen
consumption and exercise time in athletes
recovering from injury. Appropriate mechanics
for exercising and measuring oxygen
consumption are set up, and the results are
presented below:
– x variable
44
exercise
time
(min)
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
y variable
oxygen consumption
620
630
800
840
840
870
1010
940
950
1130
45
calculations
•
o
r
Inference Concerning the Slope (1)
• Parameter: Slope in the population model (1)
• Estimator: Least squares estimate:
• Estimated standard error:
• Methods of making inference regarding population:
– Hypothesis tests (2-sided or 1-sided)
– Confidence Intervals
1
^

xx
S
s /
^
1
^



Hypothesis Test for 1
• 2-Sided Test
– H0: 1 = 0
– HA: 1  0
• 1-sided Test
– H0: 1 = 0
– HA
+: 1 > 0 or
– HA
-: 1 < 0
|)
|
(
2
:
|
|
:
.
.
:
.
.
2
,
2
/
^
1
^
1
^
obs
n
obs
obs
t
t
P
val
P
t
t
R
R
t
S
T









)
(
:
)
(
:
:
.
.
:
.
.
:
.
.
2
,
2
,
^
1
^
1
^
obs
obs
n
obs
n
obs
obs
t
t
P
val
P
t
t
P
val
P
t
t
R
R
t
t
R
R
t
S
T



















(1-)100% Confidence Interval for 1
xx
S
s
t
t 2
/
1
^
^
2
/
1
^
1
^


 

 


• Conclude positive association if entire interval above 0
• Conclude negative association if entire interval below 0
• Cannot conclude an association if interval contains 0
• Conclusion based on interval is same as 2-sided hypothesis test
Analysis of Variance in Regression
• Goal: Partition the total variation in y into
variation “explained” by x and random variation
2
^
2
^
2
^
^
)
(
)
(
)
(
)
(
)
(
)
(


 









y
y
y
y
y
y
y
y
y
y
y
y
i
i
i
i
i
i
i
i
• These three sums of squares and degrees of freedom are:
•Total (Syy) dfTotal = n-1
• Error (SSE) dfError = n-2
• Model (SSR) dfModel = 1
Analysis of Variance in Regression
Source of
Variation
Sum of
Squares
Degrees of
Freedom
Mean
Square F
Model SSR 1 MSR = SSR/1 F = MSR/MSE
Error SSE n-2 MSE = SSE/(n-2)
Total Syy n-1
• Analysis of Variance - F-test
• H0: 1 = 0 HA: 1  0
)
(
:
:
.
.
:
.
.
2
,
1
,
obs
n
obs
obs
F
F
P
val
P
F
F
R
R
MSE
MSR
F
S
T






Residual Analysis
Determining the Goodness of Fit
How well does the regression model fit the data?
Q Is the correlation r significantly different
than 0.0? Yes p 0.001
Q If significant, how much of the variance in Y can be
accounted for by X, i.e. the coefficient of determination?
r2 = .682, or 68.2%
Q How much of the variance in Y can not be accounted
for by
X, i.e. the coefficient of non-determination? 1 – r2 = .318, or
31.8%
Q Are the prediction errors distributed randomly?
Residual Analysis
A residual (an error) is the difference between a
prediction (Y) and the actual value of the
dependent variable Y
Residual (e) = ( Y – Y )
If the data fit the assumptions of the regression
model …
The residuals will be randomly distributed
How to test whether the residuals are random:
 Histogram of the residuals (e)
 Normal probability plot of the residuals (e)
 Plot the residuals (e) against the
predictions (Y)
Plotting Standardized Residuals and Standardized
Predictions
Standardizing the residuals and the predictions and graphing
them in a scatterplot is helpful in identifying outliers …
Cases which may have an undue influence on the estimation of
the regression constant (a) and the regression coefficient (b)
To standardize a residual (e) or a prediction (Y), is to convert it
to a Z score
Ze = (e – e ) / Se
ZY = ( Y – Y ) / SY
In SPSS, standardized residuals and predictions can be
saved in the regression analysis. They are called zre_1 and
zpr_1.

More Related Content

Similar to Simple Linear Regression and Correlation Analysis

Introduction to measures of relationship: covariance, and Pearson r
Introduction to measures of relationship: covariance, and Pearson rIntroduction to measures of relationship: covariance, and Pearson r
Introduction to measures of relationship: covariance, and Pearson rIvan Jacob Pesigan
 
Correlation Coefficient
Correlation CoefficientCorrelation Coefficient
Correlation CoefficientSaadSaif6
 
Biostatistics Lecture on Correlation.pptx
Biostatistics Lecture on Correlation.pptxBiostatistics Lecture on Correlation.pptx
Biostatistics Lecture on Correlation.pptxFantahun Dugassa
 
Correlation analysis
Correlation analysis Correlation analysis
Correlation analysis Anil Pokhrel
 
Class 9 Covariance & Correlation Concepts.pptx
Class 9 Covariance & Correlation Concepts.pptxClass 9 Covariance & Correlation Concepts.pptx
Class 9 Covariance & Correlation Concepts.pptxCallplanetsDeveloper
 
Topic 5 Covariance & Correlation.pptx
Topic 5  Covariance & Correlation.pptxTopic 5  Covariance & Correlation.pptx
Topic 5 Covariance & Correlation.pptxCallplanetsDeveloper
 
Topic 5 Covariance & Correlation.pptx
Topic 5  Covariance & Correlation.pptxTopic 5  Covariance & Correlation.pptx
Topic 5 Covariance & Correlation.pptxCallplanetsDeveloper
 
Data analysis test for association BY Prof Sachin Udepurkar
Data analysis   test for association BY Prof Sachin UdepurkarData analysis   test for association BY Prof Sachin Udepurkar
Data analysis test for association BY Prof Sachin Udepurkarsachinudepurkar
 
Chapter 08correlation
Chapter 08correlationChapter 08correlation
Chapter 08correlationghalan
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAbdelaziz Tayoun
 
Correlation and Regression.pptx
Correlation and Regression.pptxCorrelation and Regression.pptx
Correlation and Regression.pptxJayaprakash985685
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression pptSantosh Bhaskar
 
Case study on One way ANOVA
Case study on One way ANOVACase study on One way ANOVA
Case study on One way ANOVANadzirah Hanis
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionANCYBS
 
Chapter 03 scatterplots and correlation
Chapter 03 scatterplots and correlationChapter 03 scatterplots and correlation
Chapter 03 scatterplots and correlationHamdy F. F. Mahmoud
 
CORRELATION-CMC.PPTX
CORRELATION-CMC.PPTXCORRELATION-CMC.PPTX
CORRELATION-CMC.PPTXFahmida Swati
 

Similar to Simple Linear Regression and Correlation Analysis (20)

Correlations
CorrelationsCorrelations
Correlations
 
Introduction to measures of relationship: covariance, and Pearson r
Introduction to measures of relationship: covariance, and Pearson rIntroduction to measures of relationship: covariance, and Pearson r
Introduction to measures of relationship: covariance, and Pearson r
 
Correlation Coefficient
Correlation CoefficientCorrelation Coefficient
Correlation Coefficient
 
Biostatistics Lecture on Correlation.pptx
Biostatistics Lecture on Correlation.pptxBiostatistics Lecture on Correlation.pptx
Biostatistics Lecture on Correlation.pptx
 
Correlation analysis
Correlation analysis Correlation analysis
Correlation analysis
 
Class 9 Covariance & Correlation Concepts.pptx
Class 9 Covariance & Correlation Concepts.pptxClass 9 Covariance & Correlation Concepts.pptx
Class 9 Covariance & Correlation Concepts.pptx
 
Topic 5 Covariance & Correlation.pptx
Topic 5  Covariance & Correlation.pptxTopic 5  Covariance & Correlation.pptx
Topic 5 Covariance & Correlation.pptx
 
Topic 5 Covariance & Correlation.pptx
Topic 5  Covariance & Correlation.pptxTopic 5  Covariance & Correlation.pptx
Topic 5 Covariance & Correlation.pptx
 
Simple linear regressionn and Correlation
Simple linear regressionn and CorrelationSimple linear regressionn and Correlation
Simple linear regressionn and Correlation
 
Data analysis test for association BY Prof Sachin Udepurkar
Data analysis   test for association BY Prof Sachin UdepurkarData analysis   test for association BY Prof Sachin Udepurkar
Data analysis test for association BY Prof Sachin Udepurkar
 
Chapter 08correlation
Chapter 08correlationChapter 08correlation
Chapter 08correlation
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Correlation and Regression.pptx
Correlation and Regression.pptxCorrelation and Regression.pptx
Correlation and Regression.pptx
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression ppt
 
Case study on One way ANOVA
Case study on One way ANOVACase study on One way ANOVA
Case study on One way ANOVA
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Chapter 03 scatterplots and correlation
Chapter 03 scatterplots and correlationChapter 03 scatterplots and correlation
Chapter 03 scatterplots and correlation
 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
 
CORRELATION-CMC.PPTX
CORRELATION-CMC.PPTXCORRELATION-CMC.PPTX
CORRELATION-CMC.PPTX
 

Recently uploaded

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 

Recently uploaded (20)

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 

Simple Linear Regression and Correlation Analysis

  • 1. 1 Simple Linear Regression and Correlation
  • 2. Introduction • A major contribution to our knowledge of Public Health comes from understanding: – trends in disease rates and – relationships among different predictors of health. • Biostatisticians accomplish these analyses by fitting mathematical models to data. • Usually, two or more variables, when all variables are measured from a single sample, are studied together in the general hope of determining whether there is some underlying relation between them, and if so, what kind of relationship it is.
  • 3. Introduction • Sometimes, on the other hand, two or more variables are studied in the hope of being able to use some of them to predict the other. – For example, we might have birth weight, gestational age, mothers age, mother nutritional status, etc Let us start when we have two variables • Blood lead levels in children are known to cause serious brain and neurologic damage – at levels as low as ten micrograms per deciliter. • Since the removal of lead from gasoline, blood levels of lead in children have been steadily declining, – but there is still a residual risk from environmental pollution.
  • 4. Introduction • In a survey, blood lead levels of children was related to lead levels from a sample of soil near their residences. • A plot of the blood levels and soil concentrations shows some curvature. • So the logarithms were used to produce an approximately linear relationship. • When plotted, the data show a cloud of points as in the following example for 200 children. • The mathematical model relating the two variables is : y = .29x + .01 . • It says that an increase of 1 in log(soil-lead) concentration will correspond, on average, to an increase in log(blood- lead) of .29 .
  • 5.
  • 6. Simple Correlation and Regression • Correlation seeks to establish whether a relationship exists between two variables • Regression seeks to use one variable to predict another variable • Both measure the extent of a linear relationship between two variables • Statistical tests are used to determine the strength of the relationship
  • 7. Scatter Diagram • A two-dimensional scatter plot is the fundamental graphical tool for looking at regression and correlation data. • In correlation and regression problems with one predictor and one response, the scatter plot of the response versus the predictor is the starting point for correlation and regression analysis.
  • 9. Correlations: 0.0 6.7 13.3 20.0 0.0 4.0 8.0 12.0 C1 vs C2 C1 C2 0.0 40.0 80.0 120.0 0.0 83.3 166.7 250.0 C1 vs C2 C1 C2 Positive Large values of X associated with large values of Y, small values of X associated with small values of Y. e.g. IQ and SAT Large values of X associated with small values of Y & vice versa e.g. SPEED and ACCURACY Negative
  • 10.
  • 11.
  • 12. Simple Correlation • Measures the relative strength of the linear relationship between two variables • Estimate a quantity called the correlation coefficient, or “r” • This “r” must lie between -1 and +1, and is interpreted as a measure of how close to a straight line the data lie. • Values near ±1: nearly perfect line, • Values near 0: no linear relationship, but there may be a non-linear relationship. • For the lead data, r = 0.42, It can be used to test for the statistical significance of the regression.
  • 13. Simple Correlation • Strength of relationship • Correlation from 0 to 0.25 (or –0.25) indicate little or no relationship: • those from 0.25 to 0.5 (or –0.25 to –0.50) indicate a fair degree of relationship; • those from 0.50 to 0.75 (or –0.50 to –0.75) a moderate to good relationship; and • those greater than 0.75 (or –0.75 to –1.00) a very good to excellent relationship.
  • 14. Simple Correlation • Coefficient of Determination, r 2 • To understand the strength of the relationship between two variables – The correlation coefficient, r, is squared – r 2 shows how much of the variation in one measure is accounted for by knowing the value of the other measure
  • 15. Correlation does not imply causality  Two variables might be associated because they share a common cause.  For example, SAT scores and College Grade are highly associated, but probably not because scoring well on the SAT causes a student to get high grades in college.  Being a good student, etc., would be the common cause of the SATs and the grades. • Correlation measures only linear association, and many biological systems are better described by curvilinear plots • This is one reason why data should always be looked at first (scatterplot)
  • 16. Intervening and confounding factors There is a positive correlation between ice cream sales and drowning. There is a strong positive association between Number of Years of Education and Annual Income  In part, getting more education allows people to get better, higher-paying jobs.  But these variables are confounded with others, such as socio-economic status
  • 17. Simple Correlation • Correlation coefficient assumes normally distributed data • The correlation coefficient is sensitive to extreme values • Non-normal distributions can be transformed (e.g., logarithmic transformation) or converted into ranks and non-parametric correlation test can be used (Spearman’s rank correlation)
  • 18. 18 Pearson’s Correlation Coefficient • With the aid of Pearson’s correlation coefficient (r), we can determine the strength and the direction of the relationship between X and Y variables, • both of which have been measured and they must be quantitative. • For example, we might be interested in examining the association between height and weight for the following sample of eight children:
  • 19. Simple Correlation • After Karl Pearson ( 1857 – 1936)
  • 20. 20 Computational Formula for Pearsons’s Correlation Coefficient r • Where SP (sum of the product), SSx (Sum of the squares for x) and SSy (sum of the squares for y) can be computed as follows:
  • 21. 21 Height and weights of 8 children Child Height(inches)X Weight(pounds)Y A 49 81 B 50 88 C 53 87 D 55 99 E 60 91 F 55 89 G 60 95 H 50 90 Average ( = 54 inches) ( = 90 pounds)
  • 22. 22 Scatter plot for 8 babies height weight 49 81 50 88 53 83 55 99 60 91 55 89 60 95 50 90 0 20 40 60 80 100 120 0 20 40 60 80 Serie…
  • 23. 23 Table : The Strength of a Correlation Value of r (positive or negative) Meaning _______________________________________________________ 0.00 to 0.19 A very weak correlation 0.20 to 0.39 A weak correlation 0.40 to 0.69 A modest correlation 0.70 to 0.89 A strong correlation 0.90 to 1.00 A very strong correlation ________________________________________________________
  • 24. 24 Child X Y X2 Y2 XY A 12 12 144 144 144 B 10 8 100 64 80 C 6 12 36 144 72 D 16 11 256 121 176 E 8 10 64 100 80 F 9 8 81 64 72 G 12 16 144 256 192 H 11 15 121 225 165 ∑ 84 92 946 1118 981
  • 25. 25 Table : Chest circumference and Birth Weight of 10 babies X(cm) y(kg) x2 y2 xy ___________________________________________________ 22.4 2.00 501.76 4.00 44.8 27.5 2.25 756.25 5.06 61.88 28.5 2.10 812.25 4.41 59.85 28.5 2.35 812.25 5.52 66.98 29.4 2.45 864.36 6.00 72.03 29.4 2.50 864.36 6.25 73.5 30.5 2.80 930.25 7.84 85.4 32.0 2.80 1024.0 7.84 89.6 31.4 2.55 985.96 6.50 80.07 32.5 3.00 1056.25 9.00 97.5 TOTAL 292.1 24.8 8607.69 62.42 731.61
  • 26. 26 Checking for significance • There appears to be a strong between chest circumference and birth weight in babies. • We need to check that such a correlation is unlikely to have arisen by in a sample of ten babies. • Tables are available that gives the significant values of this correlation ratio at two probability levels. • First we need to work out degrees of freedom. They are the number of pair of observations less two, that is (n – 2)= 8. • Looking at the table we find that our calculated value of 0.86 exceeds the tabulated value at 8 df of 0.765 at p= 0.01. Our correlation is therefore statistically highly significant.
  • 27. Simple Correlation • Sampling distribution of correlation coefficient: • Note, like a proportion, the variance of the correlation coefficient depends on the correlation coefficient itself substitute in estimated r. • The sample correlation coefficient follows a t-distribution with n-2 degrees of freedom • Sample size requirements for r.
  • 28. Simple Correlation • Significance Test for Pearson Correlation • H0: ρ = 0 Ha: ρ ≠ 0 (Can do 1-sided test) • Test Statistic: • With n-2 degree of freedom • P-value: 2P(t≥|tobs|)
  • 29. Correlation and Regression • Example • Let r= 0.61 and n=18, α=0.05 • t= 0.61√18-2/√1-(0.61)2=3.08 • t0.025, 16 = 2.12 • Concl.: Reject the null hypothesis, i.e., the relationship is significant
  • 30. Correlation and Regression • Assumptions in correlation • The assumptions needed to make inferences about the correlation coefficient are that the sample was randomly selected and the two variables, X and Y, vary together in a joint distribution that is normally distributed, (called the bivariate normal distribution).
  • 31. Simple Correlation • Fisher’s Zr Transformation • Which follows a normal distribution with • Mean= • and Variance
  • 33. Simple correlation • Testing for evaluating a hypothesis that the true population correlation is a specific value other than zero. • H0: ρ = ρ0 • H1: ρ ≠ ρ0 Where, – Zr is Fisher’s transformed value of r – Z ρ0 is Fisher’s transformed value of ρ0, the hypothesized population correlation
  • 34. Spearman’s Rank Correlation • The correlation coefficient is markedly influenced by extreme value and thus does not provide a good description of the relationship between two variables when the distribution of the variables are skewed or contain outlying values. • A simple method for dealing with the problem of extreme observations in correlation is to transform the data to ranks and then re calculate the correlation on ranks to obtain the non-parametric correlation called spearman’s rho or rank correlation.
  • 35. Spearman’s Rank Correlation • The spearman’s rank correlation (rs) is given by. in which di is the difference of the two ranks associated with the ith point. • The significance of the association may be assessed using t- test in the same way as described for Pearson correlation coefficient. • With n-2 degree freedom
  • 36. Partial Correlation • It is correlation between y and x1, where a variable x2 is not allowed to vary. Example: in an elementary school, reading ability (y) is highly correlated with the child’s weight (x1). • But both y and x1 are really caused by something else: the child’s age (call x2). • What would the correlation be between weight and reading ability if the age were held constant? (Would it drop down to zero?)
  • 37. Partial Correlation • A similar set of equations exists for the second predictor.
  • 38. 38 REGRESSION CORRELATION • Regression, Correlation and Analysis of Covariance are all statistical techniques that use the idea that one variable say, may be related to one or more variables through an equation. Here we consider the relationship of two variables only in a linear form, which is called linear regression and linear correlation; or simple regression and correlation. The relationships between more than two variables, called multiple regression and correlation will be considered later. • Simple regression uses the relationship between the two variables to obtain information about one variable by knowing the values of the other. The equation showing this type of relationship is called simple linear regression equation. The related method of correlation is used to measure how strong the relationship is between the two variables is. EQUATION OF REGRESSION
  • 39. 39 Line of Regression Simple Linear Regression: Suppose that we are interested in a variable Y, but we want to know about its relationship to another variable X or we want to use X to predict (or estimate) the value of Y that might be obtained without actually measuring it, provided the relationship between the two can be expressed by a line.’ X’ is usually called the independent variable and ‘Y’ is called the dependent variable. • We assume that the values of variable X are either fixed or random. By fixed, we mean that the values are chosen by researcher--- either an experimental unit (patient) is given this value of X (such as the dosage of drug or a unit (patient) is chosen which is known to have this value of X. • By random, we mean that units (patients) are chosen at random from all the possible units,, and both variables X and Y are measured. • We also assume that for each value of x of X, there is a whole range or population of possible Y values and that the mean of the Y population at X = x, denoted by µy/x , is a linear function of x. That is, µy/x = β0 +β1x DEPENDENT VARIABLE INDEPENDENT VARIABLE TWO RANDOM VARIABLE OR BIVARIATE RANDOM VARIABLE
  • 40. 40 ESTIMATION • Estimate β0 and β1. • Predict the value of Y at a given value x of X. • Make tests to draw conclusions about the model and its usefulness. • We estimate the parameters α and β by ‘a’ and ‘b’ respectively by using sample regression line: Ŷ = + x • Where we calculate We select a sample of n observations (xi,yi) from the population, WITH the goals 0 ^  1 ^ 
  • 41. Least Squares Estimation of 0, 1 • 0  Mean response when x=0 (y-intercept) • 1  Change in mean response when x increases by 1 unit (slope) • 0, 1 are unknown parameters (like m) • 0+1x  Mean response when explanatory variable takes on the value x • Goal: Choose values (estimates) that minimize the sum of squared errors (SSE) of observed values to the straight-line: 2 1 1 ^ 0 ^ 1 2 ^ 1 ^ 0 ^ ^                              n i i i n i i i x y y y SSE x y    
  • 42. Least Squares Computations             2 2 2 ^ 2 1 ^ 0 ^ 2 1 ^ 2 2                                n SSE n y y s x y S S x x y y x x y y S y y x x S x x S xx xy yy xy xx   
  • 43. 43 EXAMPLE • investigators at a sports health centre are interested in the relationship between oxygen consumption and exercise time in athletes recovering from injury. Appropriate mechanics for exercising and measuring oxygen consumption are set up, and the results are presented below: – x variable
  • 46. Inference Concerning the Slope (1) • Parameter: Slope in the population model (1) • Estimator: Least squares estimate: • Estimated standard error: • Methods of making inference regarding population: – Hypothesis tests (2-sided or 1-sided) – Confidence Intervals 1 ^  xx S s / ^ 1 ^   
  • 47. Hypothesis Test for 1 • 2-Sided Test – H0: 1 = 0 – HA: 1  0 • 1-sided Test – H0: 1 = 0 – HA +: 1 > 0 or – HA -: 1 < 0 |) | ( 2 : | | : . . : . . 2 , 2 / ^ 1 ^ 1 ^ obs n obs obs t t P val P t t R R t S T          ) ( : ) ( : : . . : . . : . . 2 , 2 , ^ 1 ^ 1 ^ obs obs n obs n obs obs t t P val P t t P val P t t R R t t R R t S T                   
  • 48. (1-)100% Confidence Interval for 1 xx S s t t 2 / 1 ^ ^ 2 / 1 ^ 1 ^          • Conclude positive association if entire interval above 0 • Conclude negative association if entire interval below 0 • Cannot conclude an association if interval contains 0 • Conclusion based on interval is same as 2-sided hypothesis test
  • 49. Analysis of Variance in Regression • Goal: Partition the total variation in y into variation “explained” by x and random variation 2 ^ 2 ^ 2 ^ ^ ) ( ) ( ) ( ) ( ) ( ) (              y y y y y y y y y y y y i i i i i i i i • These three sums of squares and degrees of freedom are: •Total (Syy) dfTotal = n-1 • Error (SSE) dfError = n-2 • Model (SSR) dfModel = 1
  • 50. Analysis of Variance in Regression Source of Variation Sum of Squares Degrees of Freedom Mean Square F Model SSR 1 MSR = SSR/1 F = MSR/MSE Error SSE n-2 MSE = SSE/(n-2) Total Syy n-1 • Analysis of Variance - F-test • H0: 1 = 0 HA: 1  0 ) ( : : . . : . . 2 , 1 , obs n obs obs F F P val P F F R R MSE MSR F S T      
  • 51. Residual Analysis Determining the Goodness of Fit How well does the regression model fit the data? Q Is the correlation r significantly different than 0.0? Yes p 0.001 Q If significant, how much of the variance in Y can be accounted for by X, i.e. the coefficient of determination? r2 = .682, or 68.2% Q How much of the variance in Y can not be accounted for by X, i.e. the coefficient of non-determination? 1 – r2 = .318, or 31.8% Q Are the prediction errors distributed randomly?
  • 52. Residual Analysis A residual (an error) is the difference between a prediction (Y) and the actual value of the dependent variable Y Residual (e) = ( Y – Y ) If the data fit the assumptions of the regression model … The residuals will be randomly distributed How to test whether the residuals are random:  Histogram of the residuals (e)  Normal probability plot of the residuals (e)  Plot the residuals (e) against the predictions (Y)
  • 53. Plotting Standardized Residuals and Standardized Predictions Standardizing the residuals and the predictions and graphing them in a scatterplot is helpful in identifying outliers … Cases which may have an undue influence on the estimation of the regression constant (a) and the regression coefficient (b) To standardize a residual (e) or a prediction (Y), is to convert it to a Z score Ze = (e – e ) / Se ZY = ( Y – Y ) / SY In SPSS, standardized residuals and predictions can be saved in the regression analysis. They are called zre_1 and zpr_1.