2. Correlation and RegressionCorrelation and Regression
Are two statistical techniques that are usedAre two statistical techniques that are used
to examine the nature and strength of theto examine the nature and strength of the
relationships between two variables.relationships between two variables.
3. CorrelationCorrelation
Correlation analysis is concerned withCorrelation analysis is concerned with
measuring the strength of the relationshipmeasuring the strength of the relationship
between variables.between variables.
When we compute measures of correlationWhen we compute measures of correlation
from a set of data, we are interested in thefrom a set of data, we are interested in the
degree of the correlation between variables.degree of the correlation between variables.
4. Relationship Between VariablesRelationship Between Variables
Examples of two variablesExamples of two variables
Blood pressure and ageBlood pressure and age
Height and weightHeight and weight
The concentration of an injected drug andThe concentration of an injected drug and
heart rateheart rate
TheThe consumption level of some nutrient andconsumption level of some nutrient and
weight gain.weight gain.
5. Correlation coefficient
Correlation coefficient of variables X and Y
shows how strongly the values of these
variables are related to one another.
It is denoted by r and r [-1, 1].∈
If the correlation coefficient is positive, then both
variables are simultaneously increasing (or
simultaneously decreasing).
If the correlation coefficient is negative, then
when one variable increases while the other
decreases, and reciprocally.
6. Coefficient of Correlation ValuesCoefficient of Correlation Values
-1.0-1.0 +1.0+1.000
PerfectPerfect
PositivePositive
CorrelationCorrelation
-.5-.5 +.5+.5
PerfectPerfect
NegativeNegative
CorrelationCorrelation
NoNo
CorrelationCorrelation
Increasing degree ofIncreasing degree of
positive correlationpositive correlation
Increasing degree ofIncreasing degree of
negative correlationnegative correlation
7. Although there is no fixed rule or interpretation ofAlthough there is no fixed rule or interpretation of
the strength of a correlation, we will say thatthe strength of a correlation, we will say that
the correlation isthe correlation is
Strong ifStrong if
Moderate ifModerate if
Weak ifWeak if
We will also add the words positive or negative toWe will also add the words positive or negative to
indicate the type of correlation.indicate the type of correlation.
0.8r ≥
0.5 0.8r≤ ≤
0 0.5r≤ ≤
11. Simple Correlation coefficient (Simple Correlation coefficient (rr))
It is also calledIt is also called Pearson's correlationPearson's correlation, it, it
measures the nature and strength between twomeasures the nature and strength between two
variables of the quantitative type.variables of the quantitative type.
The simple correlation coefficient is obtainedThe simple correlation coefficient is obtained
using the following formula:using the following formula:
wherewhere nn is the sample size,is the sample size, xx is the independentis the independent
variable andvariable and yy is the dependent variable.is the dependent variable.
1111
∑
∑
−
∑
∑
−
∑
∑ ∑
−
=
n
y)(
y.
n
x)(
x
n
yx
xy
r
2
2
2
2
∑
∑
−
∑
∑
−
∑
∑ ∑
−
=
n
y)(
y.
n
x)(
x
n
yx
xy
r
2
2
2
2
12. Coefficient of CorrelationCoefficient of Correlation
An alternative formula for computing the coefficient ofAn alternative formula for computing the coefficient of
correlation,correlation, rr
( )( )
( ) ( )
2 22 2
i i i i
i i i i
n x y x y
r
n x x n y y
−
=
− −
∑ ∑ ∑
∑ ∑ ∑ ∑
14. Coefficient of CorrelationCoefficient of Correlation
( )( )
( ) ( )
2 22 2
i i i i
i i i i
n x y x y
r
n x x n y y
−
=
− −
∑ ∑ ∑
∑ ∑ ∑ ∑
( ) ( ) ( ) ( )
( ) ( ) ( ) ( ) ( ) ( )
2 2
9 18.2 49 3.04
0.9891
9 284 49 9 1.1882 3.04
r
−
= =
− −
rr = 0.989= 0.989
rr ≈≈ 0. 990. 99
15. WarningWarning
The correlation coefficient (The correlation coefficient ( rr) measures) measures
the strength of the relationship betweenthe strength of the relationship between
two variables.two variables.
Just because two variables are relatedJust because two variables are related
does not imply that there is a cause-and-does not imply that there is a cause-and-
effect relationship between them.effect relationship between them.
16. Spearman’s Correlation Coefficient
It is a non-parametric measure of
correlation used in the case of ordinal or
qualitative ( ratio or relative) variables.
This procedure makes use of the two sets
of ranks that may be assigned to the
sample values of x and y.
17. Spearman’s Correlation Coefficient
Spearman Rank correlation coefficient
could be computed in the following cases:
1.Both variables are quantitative.
2.Both variables are qualitative ordinal.
3.One variable is quantitative and the other
is qualitative ordinal.
18. Spearman’s Correlation Coefficient
Procedures:
1. Rank the values of X from 1 to n , where n is the numbers of pairs of values
of X and Y in the sample.
2. Rank the values of Y from 1 to n.
3. Compute the value of for each pair of observations by subtracting the
rank of from the rank of
4. Square each and compute which is the sum of the squared values.
5. Apply the following formula
The value of rs denotes the magnitude and nature of association giving the same
interpretation as simple r.
19. Spearman’s Correlation Coefficient
Example: In a study of the relationship between education level and health
awareness, the following data was obtained. Find the relationship between
them and comment.
Health awareness ( )Education level ( )No.
25preparatory.1
10primary.2
8university.3
10secondary4
15secondary5
50illiterate6
60university.7
20. Spearman’s Correlation Coefficient
Solution:
Rank ( )Rank ( )( )( )No.
423525Preparatory1
0.250.55.5610Primary2
30.25-5.571.58University3
4-25.53.510secondary4
0.25-0.543.515secondary5
2552750illiterate6
0.250.511.560university7
64Total
Comment: There is an indirect weak correlation between education level
and health awareness.
22. Non-Parametric Correlations
Spearman’s Correlation Coefficient:
Firstly, it ranks the data and then applies
Pearson’s correlation to these ranks
where rs is Spearman’s correlation coefficient, d2
is
the difference between the ranks and n is the
number of cases.
)1(
6
1 2
2
−
×
−=
∑
nn
d
rs
23. RegressionRegression analysisanalysis
RegressionRegression analysis is helpful in ascertaining theanalysis is helpful in ascertaining the
probable formprobable form of the relationship between variables.of the relationship between variables.
The ultimate objectives when this method of analysisThe ultimate objectives when this method of analysis
is employed usually is tois employed usually is to predictpredict oror estimateestimate the valuethe value
of one variable corresponding to a given value ofof one variable corresponding to a given value of
another variable.another variable.
25. Simple linear regressionSimple linear regression
In simple linear regression we are interested inIn simple linear regression we are interested in
two variablestwo variables xx andand yy..
The variableThe variable xx is usually referred to as theis usually referred to as the
independent variableindependent variable, since frequently it is, since frequently it is
controlled by the investigator; that is; values ofcontrolled by the investigator; that is; values of
xx may be selected by the investigator and,may be selected by the investigator and,
corresponding to each preselected value of x,corresponding to each preselected value of x,
one -or more- value ofone -or more- value of yy is obtained.is obtained.
The other variable, y, accordingly, is called theThe other variable, y, accordingly, is called the
dependent variabledependent variable, and we speak of the, and we speak of the
regression ofregression of yy onon xx..
26. The regression equationThe regression equation
In simple linear regression the object of theIn simple linear regression the object of the
researcher’s interest is theresearcher’s interest is the regression equationregression equation
that describes the true relationship betweenthat describes the true relationship between
the dependent variable y and the independentthe dependent variable y and the independent
variable x.variable x.
27. Scatter diagramScatter diagram
A first step that is usually useful in studying theA first step that is usually useful in studying the
relationship between two variables is torelationship between two variables is to
prepare aprepare a scatter diagramscatter diagram of the data.of the data.
The points are plotted by assigning values ofThe points are plotted by assigning values of
the independent variable x to the horizontalthe independent variable x to the horizontal
axis and values of the dependent variable y toaxis and values of the dependent variable y to
the vertical axis.the vertical axis.
The pattern made by the points plotted on theThe pattern made by the points plotted on the
scatter diagram usually suggests the basicscatter diagram usually suggests the basic
nature and the strength of the relationshipnature and the strength of the relationship
between two variables.between two variables.
28. ExampleExample
pHpH Optical densityOptical density
33 0.10.1
44 0.20.2
4.54.5 0.250.25
55 0.320.32
5.55.5 0.330.33
66 0.350.35
6.56.5 0.470.47
77 0.490.49
7.57.5 0.530.53
Relationship between pH and optical density
30. NotesNotes
The points in the figureThe points in the figure
seems to be scatteredseems to be scattered
around an invisible straightaround an invisible straight
line.line.
The scatter diagram alsoThe scatter diagram also
shows that, in general, highshows that, in general, high
pH also has high opticalpH also has high optical
density reading.density reading.
0
0.1
0.2
0.3
0.4
0.5
0.6
2 3 4 5 6 7 8
pH
Opticaldensity
31. These impressionsThese impressions
suggest that thesuggest that the
relationship betweenrelationship between
points in the two variablespoints in the two variables
may be described by amay be described by a
straight line crossing the y-straight line crossing the y-
axis near the origin andaxis near the origin and
making approximately a 45making approximately a 45
degree angle with the x-degree angle with the x-
axis .axis .
It looks as if it would beIt looks as if it would be
simple to draw, freehand,simple to draw, freehand,
through the data points thethrough the data points the
line that describe theline that describe the
relationship between x andrelationship between x and
y.y.
0
0.1
0.2
0.3
0.4
0.5
0.6
2 3 4 5 6 7 8
pH
Opticaldensity
32. It is highly unlikely, however, that the linesIt is highly unlikely, however, that the lines
drawn by any two people would be the same.drawn by any two people would be the same.
In other words, for every person drawing suchIn other words, for every person drawing such
a line by eye, or freehand, we would expect aa line by eye, or freehand, we would expect a
slightly different line.slightly different line.
Thinking ChallengeThinking Challenge
33. 0
0.1
0.2
0.3
0.4
0.5
0.6
2 3 4 5 6 7 8
pH
Opticaldensity
Thinking ChallengeThinking Challenge
For every person drawing such a line by eye, orFor every person drawing such a line by eye, or
freehand, we would expect a slightly different line.freehand, we would expect a slightly different line.
34. 0
0.1
0.2
0.3
0.4
0.5
0.6
2 3 4 5 6 7 8
pH
Opticaldensity
Thinking ChallengeThinking Challenge
For every person drawing such a line by eye, orFor every person drawing such a line by eye, or
freehand, we would expect a slightly different line.freehand, we would expect a slightly different line.
35. 0
0.1
0.2
0.3
0.4
0.5
0.6
2 3 4 5 6 7 8
pH
Opticaldensity
Thinking ChallengeThinking Challenge
For every person drawing such a line by eye, orFor every person drawing such a line by eye, or
freehand, we would expect a slightly different line.freehand, we would expect a slightly different line.
36. 0
0.1
0.2
0.3
0.4
0.5
0.6
2 3 4 5 6 7 8
pH
Opticaldensity
Thinking ChallengeThinking Challenge
For every person drawing such a line by eye, orFor every person drawing such a line by eye, or
freehand, we would expect a slightly different line.freehand, we would expect a slightly different line.
37. 0
0.1
0.2
0.3
0.4
0.5
0.6
2 3 4 5 6 7 8
pH
Opticaldensity
Thinking ChallengeThinking Challenge
Which line best describes relationship between the variables?Which line best describes relationship between the variables?
What is needed for obtaining the desired line?What is needed for obtaining the desired line?
38. AnswerAnswer
We need to employ a method known as theWe need to employ a method known as the
method of least squaresmethod of least squares for obtaining thefor obtaining the
desired line, and the resulting line is called thedesired line, and the resulting line is called the
least-square lineleast-square line..
The reason for calling the method by this nameThe reason for calling the method by this name
will be explained in the discussion that follow.will be explained in the discussion that follow.
39. Equation for straight lineEquation for straight line
Now, recall from algebra that the generalNow, recall from algebra that the general
equation for straight line is given byequation for straight line is given by
y = a + bxy = a + bx
a = the y-intercept b = the slope
40. Linear EquationsLinear Equations
Y
Y = a + bx
a = Y-intercept
X
a is the point where the line crosses the vertical axis,
and referred to as y-intercept.
41. Linear EquationsLinear Equations
Y
Y = a + bx
a = Y-intercept
X
Change
in Y
Change in X
b = Slope
b shows the amount by which y changes for each unit
change in x and referred to as the slope of the line.
42. To draw a line based on the equation, we need theTo draw a line based on the equation, we need the
numerical values of the constantsnumerical values of the constants aa andand bb..
Given these constants, weGiven these constants, we may substitute variousmay substitute various
values of x into the equation to obtain correspondingvalues of x into the equation to obtain corresponding
values of y.values of y.
The resulting points may be plotted.The resulting points may be plotted.
Linear EquationsLinear Equations
y = a + bxy = a + bx
43. Computation TableComputation Table
Xi Yi Xi
2
Yi
2
XiYi
X1 Y1 X1
2
Y1
2
X1Y1
X2 Y2 X2
2
Y2
2
X2Y2
: : : : :
Xn Yn Xn
2
Yn
2
XnYn
ΣXi ΣYi ΣXi
2
ΣYi
2
ΣXiYi
45. Finding the b-valueFinding the b-value
( )( )
( )
22
n xy x y
b
n x x
−
=
−
∑ ∑ ∑
∑ ∑
( ) ( )
( ) ( ) ( )
2
9 18.2 -(49)(3.04)
0.0958
9 284 49
b = =
−
46. Finding the y-interceptFinding the y-intercept
a y bx= −
where y mean of y values
and x mean of x values
=
=
y b x
a
n
−
=
∑ ∑Alternatively
47. Finding the y-interceptFinding the y-intercept
( ) ( )0.3378 0.0958 5.444 -0.1837a = − =
3.04
0.3378
9
49
5.444
9
y
x
= =
= =
a y bx= −
48. The equation for the least squaresThe equation for the least squares
line is:line is:
y a bx
∧
= +
y a bx
∧
= +
- 0.1837+0.095 8x y
∧
=
Note that we use the symbol because this value is computed
from the equation and is not an observed value of y.
y
∧
0.0958x - 0. 37 18y
∧
=
49. Now, weNow, we can substitute various values of x into thecan substitute various values of x into the
equation to obtain corresponding values of y.equation to obtain corresponding values of y.
The resulting points may be plotted.The resulting points may be plotted.
0.0958x - 0. 3718y
∧
=
50. Using the Regression EquationUsing the Regression Equation
Predicting y for a given xPredicting y for a given x
Choose a value for x (within the range of xChoose a value for x (within the range of x
values).values).
Substitute the selected x in the regressionSubstitute the selected x in the regression
equation.equation.
Determine corresponding value of y.Determine corresponding value of y.
51. The regression equation:The regression equation:
Substitute x = 6.8:Substitute x = 6.8:
According to the equation, a pH of 6.8 wouldAccording to the equation, a pH of 6.8 would
has a 0.4625 optical density.has a 0.4625 optical density.
0.0958x - 0. 3718y
∧
=
0.0958 6.8 - 0.1837=0.4625y
∧
= ×
52. InterpolationInterpolation
Using the regression equation toUsing the regression equation to
predict y values for x values thatpredict y values for x values that
fall between the points in thefall between the points in the
scatter diagramscatter diagram
54. Since any two such coordinates determine aSince any two such coordinates determine a
straight line, we maystraight line, we may
select any twoselect any two values in the range of xvalues in the range of x,,
compute two corresponding y values,compute two corresponding y values,
locate them on a graph,locate them on a graph,
and connect themand connect them with a straight linewith a straight line to obtainto obtain
the line corresponding the equation.the line corresponding the equation.
The least-squares lineThe least-squares line
55. Y
X
DeviationDeviation
DeviationDeviation
DeviationDeviation
DeviationDeviation
The line that we have drawn is best in this sense:The line that we have drawn is best in this sense:
The sum of the squared vertical deviations of theThe sum of the squared vertical deviations of the
observed data points (yobserved data points (yii) from the least square line is) from the least square line is
smaller than the sum of the squared vertical deviations ofsmaller than the sum of the squared vertical deviations of
the observed data points from any other line.the observed data points from any other line.
yiyi
yiyi
yiyi
yiyi
The least-squares lineThe least-squares line
56. The least-squares lineThe least-squares line
In other words, if weIn other words, if we squaresquare the verticalthe vertical
distance from the observed point (ydistance from the observed point (yii) to the) to the
least-squares line andleast-squares line and addadd these squaredthese squared
values for all points,values for all points, the resulting total will bethe resulting total will be
smaller than the similarly computed total forsmaller than the similarly computed total for
any other line that can be drawn through theany other line that can be drawn through the
pointspoints..
For this reason the line we have drawn isFor this reason the line we have drawn is
called the least-squares line.called the least-squares line.
58. The coefficient of determinationThe coefficient of determination rr22
One way to evaluate theOne way to evaluate the
strength of the regressionstrength of the regression
equationequation is to compare theis to compare the
scatter of the points aboutscatter of the points about
the regression line with thethe regression line with the
scatter about , the meanscatter about , the mean
of the values of y.of the values of y.
y
y = 0.0957x - 0.1835
0
0.1
0.2
0.3
0.4
0.5
0.6
2 3 4 5 6 7 8
pH
Opticaldensity
= 0.3378y
59. The coefficient of determinationThe coefficient of determination rr22
Draw through the points aDraw through the points a
line that intersects the y-line that intersects the y-
axis at and is parallel toaxis at and is parallel to
the x-axis, we may obtainthe x-axis, we may obtain
a visual impression of thea visual impression of the
relative magnitudes of therelative magnitudes of the
scatter of the points aboutscatter of the points about
this line and the regressionthis line and the regression
line.line.
y
y = 0.0957x - 0.1835
0
0.1
0.2
0.3
0.4
0.5
0.6
2 3 4 5 6 7 8
pH
Opticaldensity
= 0.3378y
60. Interpretation ofInterpretation of rr22
Thus, the coefficient of determinationThus, the coefficient of determination
measures the closeness of fit of themeasures the closeness of fit of the
regression equation to observed values of y.regression equation to observed values of y.
61. Interpretation of r2
• If r2
= 0.978
• Approximately 98 percent of the variation in
Optical density (y) is explained by the linear
relationship with x, pH change.
• Less than five percent is explained by other causes.
63. Limitations of Correlation and
Regression
linearity:
– can’t describe non-linear relationships
e.g., relation between anxiety & performance
truncation of range:
– Under estimate strength of relationship if you
can’t see full range of x value.
no proof of causation:
– third variable problem:
could be 3rd
variable causing change in both
variables
directionality: can’t be sure which way causality