2. Correlation
• Correlation is a statistical measure for
finding out degree(or strength)of
association between two(or more)
variables. If the change in one variable
effects a change in other variable then
these variables are said to be correlated.
3. Correlation
The measure of correlation called the
correlation coefficient .
Correlation coefficient ranges from
correlation ( -1 ≤ r ≥ +1)
The direction of change is indicated by a
sign.
4. Correlation & Causation
Causation means cause & effect
relation.
Causation always implies correlation
but correlation does not necessarily
implies causation.
5. Correlation – basic assumptions
• does not change when we change the
units of measurement. For example, from
Kg to pounds for weight. Why?
r uses standardized values of the observations.
• does not measure nor describe curved or
non-linear association no matter how
strong.
• Like the mean and SD, r is not resistant or
uninfluenced by outliers.
r is strongly affected by outlier or outlying
observations.
7. Types of Correlation
Positive Correlation: The correlation is
said to be positive if the values of two
variables changing with same direction.
Ex. No of hours spent on study and grade
in exam
Negative Correlation: The correlation is
said to be negative when the values of
variables change with opposite direction.
No of hrs spent on watching TV and grades in
exam .
8. Direction of the Correlation
• Positive relationship – Variables change in the
same direction.
As X is increasing, Y is increasing
As X is decreasing, Y is decreasing
▫ E.g., As study time increases, grades increase
• Negative relationship – Variables change in
opposite directions.
As X is increasing, Y is decreasing
As X is decreasing, Y is increasing
▫ E.g., As TV time increases, grades decrease
9. More examples
• Positive relationshipsPositive relationships
▫ No of vehicles and air
pollution .
▫ Smoking and cancer.
• Negative relationshipsNegative relationships:
▫ alcohol consumption and
driving ability.
▫ Cholesterol level and heart
disease
16. Types of Correlation
• Simple correlation: Under simple
correlation problem there are only two
variables are studied.
• Multiple Correlation: Under Multiple
Correlation three or more variables are
studied.
• Partial correlation: analysis recognizes
more than two variables but considers only
two variables keeping the other constant.
17. Methods of studying correlation
• Pearson product moment correlation
• Rank correlation
• Kendal’s Tau
• Biserial correlation
• Point Biserial correlation
• Phi coefficient
• Tetra choric correlation
18. Correlation Coefficient
Pearson’s Product Moment Correlation
Symbolized by r
Covariance ÷ (product of the 2 SDs)
Correlation is a standardized covarianceYX
XY
ss
Cov
r =
19. Calculation for Example
• CovXY = 11.12
• sX = 2.33
• sY = 6.69
cov 11.12 11.12
.713
(2.33)(6.69) 15.59
XY
X Y
r
s s
= = = =
20. Other formulae
• Z-score method
• Computational (Raw Score) Method
20
1
x yz z
r
N
=
−
∑
2 2 2 2
( ) ( )
N XY X Y
r
N X X N Y Y
−
=
− −
∑ ∑ ∑
∑ ∑ ∑ ∑
21. Interpretation of Correlation
Coefficient (r)
• The value of correlation coefficient ‘r’ ranges
from -1 to +1
• If r = +1, then the correlation between the
two variables is said to be perfect and
positive
• If r = -1, then the correlation between the
two variables is said to be perfect and
negative
• If r = 0, then there exists no correlation
between the variables
22. Relation between regression and
correlation
• The coefficient of correlation is the geometric mean
of two regression coefficient.
r = √ bxy * byx
23. Limitation of Pearson’s
Coefficient
• Always assume linear
relationship
• Interpreting the value of r is
difficult.
• Value of Correlation Coefficient
is affected by the extreme
values.
25. Coefficient of Determination
• It is the square of correlation coefficient (r²)
• It explains how much of the variability of a factor is
explained by its relationship to another factor.
• The maximum value of r2
is 1 because it is possible
to explain all of the variation in y but it is not
possible to explain more than all of it.
• Coefficient of Determination = Explained variation /
Total variation
26. Coefficient of Determination: An
example
r = 0.60
r = 0.30
It does not mean that the first correlation is twice
as strong as the second
This can be understood by computing the value of
r2 .
When r = 0.60 r2
= 0.36
r = 0.30 r2
= 0.09
This implies that in the first case 36% of the total
variation is explained (shared) whereas in second
case 9% of the total variation is explained (shared)
.
32. Spearman’s Rank
Coefficient of Correlation (Rho)
• When variables under study are arranged in serial
order Spearman Rank correlation can be used.
• Rho = 1- (6 ∑D2
) / N (N2
– 1)
• Rho = Rank correlation coefficient
• D = Difference of rank between paired item in two series.
• N = Total number of observation.
33. Rank Correlation Coefficient
(Rho)
a) Problems where actual rank are given.
1) Calculate the difference ‘D’ of two Ranks i.e.
(R1 – R2).
2) Square the difference & calculate the sum of
the difference i.e. ∑D2
3) Substitute the values obtained in the formula.
34. Example
• To calculate a Spearman rank-order correlation
on data without any ties
• English
56 75 45 71 62 64 58 80 76
61
• Maths
66 70 40 60 65 56 59 77 67
63
37. Rank Correlation Coefficient
(Rho)
• Equal Ranks or tie in Ranks: In such cases
average ranks should be assigned to each individual
• Example (to be worked out)
38. Interpretation of Rank
Correlation Coefficient
• The value of rank correlation coefficient, R ranges
from -1 to +1
• If R = +1, then there is complete agreement in the
order of the ranks and the ranks are in the same
direction
• If R = -1, then there is complete agreement in the
order of the ranks and the ranks are in the
opposite direction
• If R = 0, then there is no correlation
39. Merits Spearman’s Rank
Correlation
• This method is simpler to understand and
easier to apply compared to Karl Pearson’s
correlation method.
• This method is useful in ordinal data
• But difficult if data is large
40. Kendall's Tau
• Kendall's τ (tau) is a non-parametric
measure of correlation between two
ranked variables. It is similar to
Spearman's Rho and Pearson's Product
Moment Correlation Coefficient
41. Calculation of τ
• τ = C-D/C+D
• C= Concordant Pairs
• D= Discordant Pairs
• A concordant pair is when the rank of the
second variable is greater than the rank of the
former variable.
• A discordant pair is when the rank is equal to or
less than the rank of the rst variable
42. ExampleRank
variable
1
Rank
variable
2
1 1
2 3
3 6
4 2
5 7
6 4
7 5
R2
1
2 C
3 C C
4 C D D
5 C C C C
6 C C C D D
7 C C C C D D
1 2 3 4 5 6 7
Counting Concordant and
Discordant Values
τ= 15-6/15+6 = 7/21= 0.429
44. Calculating the Kendall tau-a
Coefficient
Ran
ked
Cha
nge
in
Disp
lay
Scor
es
2 C
3 C C
4 C D D
5 C C D D
6 C C C C C
7 C C C C D D
8 C C C C C C D
1 2 3 4 5 6 7 8
45. solution
• Taking the first person, who is ranked 1 for change in
testosterone, how many people are ranked above that
person for display? These are concordant – and the answer
is 7 people, so C = 7. The number of discordant people,
who are ranked above, is zero, so D = 0.
• Take the second person. 6 people are ranked above that
person, and they are concordant, so C = 6, and 1 person
(the person ranked 4th in display) is equal, so they are
discordant, D = 1.
• We keep doing this for each person, but we can make our
lives easier by putting this into a table, which is shown in
Table 2. For each pair of people, we say whether the scores
are concordant, in which case we give them a C, or
discordant, in which case we give them a D.
46. Easier method
• 1 2 3 4 5 6 7 8
• 1 4 5 2 7 3 8 6
• No of inversions 7
• τ = 1- (2r)
• n(n-1)/2
• r= number of inversions
• n= number of cases
• τ= 1- 14
• 6x7/2
• 1-.5 =.5
47. An easier method
A B C D E F G H I J
•V1 1 2 3 4 5 6 7 8 9 10
•V2 2 1 5 3 4 6 10 8 7 9
•τ = 1- (2r)
• n(n-1)/2
•r= number of inversions
•n= number of cases
•1- 2x5
• 10x9/2
•1- 0.222 = 778
49. Example 2
• A B C D E F G H I J
• V1 1 2 3 4 5 6 7 8 9 10
• V2 5 1 2 4 3 10 6 7 9 8
• Do it yourself
50. Where can you use
• to understand whether there is an association
between exam grade and time spent revising (i.e.,
where there were six possible exam grades – A, B, C,
D, E and F – and revision time was split into five
categories: less than 5 hours, 5-9 hours, 10-14 hours,
15-19 hours, and 20 hours or more).
• to understand whether there is an association
between customer satisfaction and delivery time (i.e.,
where delivery time had four categories – next day, 2
working days, 3-5 working days, and more than 5
working days – and customer satisfaction was
measured in terms of :Highly satisfied; very satisfied,
satisfied, dissatisfied, highly dissatisfied).
51. Comparison of tau and rank
correlations• Kendall’s tau
• In most of the situations, the
interpretations of Kendall’s
tau and Spearman’s rank
correlation coefficient are
very similar and thus
invariably lead to the same
inferences.
give usually smaller
values than Spearman’s
rho correlation.
Calculations based on
concordant and
discordant pairs.
Insensitive to error.
P values are more
accurate with smaller
sample sizes.
The distribution of
Kendall’s tau has better
statistical properties
Spearman rank
•usually have larger values than
Kendall’s Tau.
• Calculations based on deviations.
•Much more sensitive to error and
discrepancies in data
•rank correlation coefficient is the
more widely used rank correlation
coefficient
52. Other Kinds of Correlation
• Point biserial correlation coefficient (rpb)
▫ used with one continuous scale and one
nominal or ordinal or dichotomous scale.
▫ uses the same Pearson formula
Attractiveness Date?
3 0
4 0
1 1
2 1
5 1
6 0
rpb = -0.49 52
53. Point biserial
• Point biserial is used when one variable is
continuous and the other is dichotomous
(like gender)
• Rpb = M1-M2
• Sn-1 x √n(n-1)²
54. Computation of point biserial
• r pb = Mp-Mq
• Std x √pq
• Where rpb is point biserial correlation
• Mp is mean score of students answering correctly
• Mq is mean of students answering incorrectly
• Std is standard deviation of the whole sample
• P is proportion of students answering correctly
• Q is 1-p
56. Correlation and ‘t’
We can convert r to t and test for significance:
Where DF = N-2
2
2
1
N
t r
r
−
=
−
57. Tables of Significance
Suppose r= 0.71 and n=21
Start with Ho = r=0
Df = N-2 = 21 – 2 = 19
‘t’-crit (19) = 2.09
Since 6.90 is larger than 2.09 reject
null hypothesis r = 0.
2 2
2 19 19
.71* .71* 6.90
1 1 .71 .4959
N
t r
r
−
= = = =
− −
58. Other Kinds of Correlation
• Phi coefficient (Φ)
▫ used with two dichotomous scales.
▫ uses the same Pearson formula
Attractiveness Date?
0 0
1 0
1 1
1 1
0 0
1 1
Φ = 0.71
58
62. Tetrachoric correlation
• If you have dichotomous data on two variables
but are willing to assume that the underlying
variables are normally distributed, you may use
the tetrachoric correlation to estimate the size
correlation between the underlying variables.
• Rpb= Cos {180◦/1+√(ad/bc)}
63. Tetrachoric correlation
• When you have continuous data but wants
to split the data in a dichotomous form
(median split for example), you use tetra
choric. Here you are artificially making a
continuous data dichotomous
64. example Attitude towards women
Score on
openness
Negative Positive Total
Above
median
68 a 32 b 100
Below
median
30 c 70 d 100
Total 98 102 200
rtet= Cos {180◦/1+√(ad/bc)}
Cos {180◦/√(65x70)/30x32} = cos 55.784 = .
722
65. On tetrachoric
• This formula works best only when
▫ N is large
▫ When the splits are as near the median as
possible
It is better to use phi coefficient than tetra
choric
66. Advantages of Correlation
studies
• Show the amount (strength) of relationship
present
• Can be used to make predictions about the
variables under study.
• Can be used in many places, including
natural settings, laboratories , etc.
• Easier to collect co relational data
67. Factors Affecting r
Range restrictions (truncation)
Looking at only a small portion of the total
scatter plot (looking at a smaller portion of the
scores’ variability) decreases r.
Reducing variability reduces r
It affects especially in validation with an
external criterion in selection scenario
Nonlinearity
The Pearson r (and its relatives) measure the
degree of linear relationship between two
variables
If a strong non-linear relationship exists, r will
provide a low, or at least inaccurate measure
of the true relationship.
71. Factors affecting correlation
• Reliability of measurement
▫ If you are not reliably distinguishing individuals on some
measure, you will not capture the covariance that
measure may have with another adequately
• Heterogeneous subsamples
▫ Sub-samples may artificially increase or decrease overall
r.
▫ Solution - calculate r separately for sub-samples &
overall, look for differences
▫ Can be caused by lack of reliability
• Outliers can artificially increase or decrease
• r
72. Testing Correlations
How to find out if a correlation is big/ large?
In terms of magnitude, how big is big?
Small correlations in large samples are “big.”
Large correlations in small samples aren’t always
“big.”
Depends upon the magnitude of the
correlation coefficient
AND
The size of your sample.
73
73. Correlation and effect size
Effect size r
Small 0.10
Medium 0.30
Large 0.50
Pearson r or correlation coefficient
75. What is a partial correlation
• Partialling is holding constant a third
variable via residuals
• It estimates what would happen if
everyone had the same score in the third
variable
76. Partial correlation
• Two variables A and B are correlated. But
you feel that this relationship is influenced
by a third variable C. You want to remove
this influence and want to know the true
correlation between A and B.
• In this case you partial out the influence
of C.
77. Example
• You know that exam grades are correlated
with intelligence. You also know that exam
grades are influenced by exam anxiety.
You also know that intelligence scores are
moderated by anxiety. You want to know
the correlation between exam grade and
intelligence when controlled for anxiety.
78. Example
• Suppose you have the following data.
• Correlation between exam (A) grade and
intelligence (B) = .918
• Correlation between exam grade(A) and
anxiety (C) = -.369
• Correlation between anxiety (C) and
intelligence(B) = -.245
79. Solution
• Correlation between exam grade and
intelligence controlled for anxiety is
• r AB.C = rAB-rACxrBC
• √(1-r²AC)(1-r²BC)
• = .918-(-.369x-.245)/√(1-(-.369²)x1-(-.245)²
• = .918-.090/√.864x.94
• = .918-.090 = .828/.901
• = .919
• The true correlation between exam score and intelligence is
.919. We can see that the correlation improved slightly
after partialling out the effect of anxiety.
81. Solution
• Find out correlation between ATW and
Openness
• Find out correlation between ATW amd
Edu
• Find out correlation between Openess and
Edu
• Partial out the effect of edu in the
correlation between ATW and openness
82. Solution
• r between ATW and openness 0.662 (A)
• r between ATW and Edu 0.276(B)
• r between Openness and Edu 0.250 (C)
• Partialling effect of edu
• r AB.C = rAB-rACxrBC
• √(1-r²AC)(1-r²BC)
• .662-.276x.250/√(1 -.276²)(1-.250²) = .637
83. Partial and semi-partial correlation
• In partial correlation the effect of the third
variable is partialled out from both the
variables rAB.C
• In semi partial correlation the effect of the
third variable is partialled out from only
one the two variables. rA(B.C)
84. Order of partialling
• If you partial 1 variable out of a correlation, the resulting
partial is called a first order partial correlation.
• If you partial 2 variables out of a correlation, the resulting
partial is called a second order partial correlation. Can
have 3rd, 4th, etc., order partials.
• Unpartialed (raw) correlations are called zero order
correlations because nothing is partialed out.
• Can use regression to find residuals and compute partial
correlations from the residuals, e.g. for r12.34, regress 1
and 2 on both 3 and 4, then compute correlation between
2 sets of residuals.
85. Solution
• In the above example the relationship
between exam grade and intelligence can be
semipartialed by removing the effect of
anxiety only on intelligence
• r AB.C = rAB-rACxrBC/√1-r²BC)
• .918--.369x-.245/√1-.(-245)²
• .918-.090 = .858/.970
• 0.885
• The correlation between exam grade and
intelligence after removing the influnce of
anxiety on intelligence is 0.885. The effect of
anxiety on exam grade is not removed.
86. The correlation coefficient of number of times absent and
final grade is r = –0.975. The coefficient of determination is
r2
= (–0.975)2
= 0.9506.
Interpretation: About 95% of the variation in final grades can be
explained by the number of times a student is absent. The other
5% is unexplained and can be due to sampling error or other
variables such as intelligence, amount of time studied, etc.
Strength of the
Association
The coefficient of determination, r2
, measures the strength
of the association and is the ratio of explained variation in
y to the total variation in y.
88. Regression Analysis
• Regression Analysis is a very
powerful tool in the field of
statistical analysis in predicting
the value of one variable, given
the value of another variable,
when those variables are related
to each other.
89. Regression Analysis• Regression takes us a step beyond
correlation in that not only are we
concerned with the strength of the
association, but we want to be able to
describe its nature with sufficient precision
to be able to make predictions
• To be able to make predictions, we need to
be able to characterize one of the variables
in the relationship as independent and the
other as dependent
90. Regression Analysis
• For example, in the relationship (male
literacy vs % of people living in the cities),
the causal order seems pretty obvious.
Literacy rates are not like to produce
urbanization, but urbanization is probably
causally prior to increases in literacy rates
91. Regression and Prediction
• If you say that there is a correlation between no
of vehicles and air pollution it does not convey
causal relationship though you know that
vehicles can increase pollution and pollution
cannot increase vehicles
• In regression analysis you predict for each unit
increase in vehicle population how much
increase in pollution will result.
92. In short
• Regression Analysis is mathematical
measure of average relationship between
two or more variables.
• Regression analysis is a statistical tool used
in prediction of value of unknown variable
from known variable.
93. Advantages of Regression
Analysis
• Regression analysis provides estimates of
values of the dependent variables from the
values of independent variables.
• Regression analysis also helps to obtain a
measure of the error involved in using the
regression line as a basis for estimations .
• Regression analysis helps in obtaining a
measure of the degree of association or
correlation that exists between the two
variable.
95. What is regression?
• Fitting a line to the data using an equation in
order to describe and predict data
• Simple Regression
▫ Uses just 2 variables (X and Y)
▫ Other: Multiple Regression (one Y and many X’s)
• Linear Regression
▫ Fits data to a straight line
▫ Other: Curvilinear Regression (curved line)
96. • Existence of actual linear relationship.
• The regression analysis is used to estimate the
values within the range for which it is valid.
• In regression, we have only one dependant
variable in our estimating equation. However,
we can use more than one independent
variable.
•
Assumptions in Regression Analysis
97. • The dependent variable takes any random
value but the values of the independent
variables are fixed.
• In regression, we have only one dependant
variable in our estimating equation. However,
we can use more than one independent
variable.
Assumptions in Regression Analysis
98. What is regression
• Regression indicates the degree to which
the variation in one variable X, is related
to or can be explained by the variation in
another variable Y
• Once you know there is a significant linear
correlation, you can write an equation
describing the relationship between the x
and y variables. This equation is called the
line of regression or least squares line.
99. Regression Equation
• Regression line of Y on X : gives the best
estimate for the value of y for any specific given
values of x
• Y = ax+ b a =Slope of the line
• b =Y - intercept
• Y = Dependent variable
• X = Independent
variable
100. Regression Equation:
We can predict a Y score from an X by
plugging a value for X into the equation and
calculating Y
What would we expect a person to get on quiz
#4 if they got a 12.5 on quiz #3?
Y = .823X + -4.239
Y = .823(12.5) + -4.239 = 6.049
101. Interpreting Regression : Basics
• Intercept
▫ Value of Y if X(s) is 0
▫ Often not meaningful, particularly if it’s practically impossible to
have an X of 0 (e.g. weight)
• Slope, the regression coefficient
▫ Amount of change in Y seen with 1 unit change in X
Standardized regression coefficient
Amount of change in Y seen in standard deviation units with 1 standard
deviation unit change in X
In simple regression it is equivalent to the r for the two variables
• Standard error of estimate
▫ Gives a measure of the accuracy of prediction
• R2
▫ Proportion of variance in the outcome explained by the model
▫ Effect size
104. Explanation
• In the above example, y = 5x + 2
• X is the slope and 2 is the intercept
• This means that
• The predicted Y = 5x (value of x) +2
• Suppose you want to predict the
corresponding value of x=5,
• Then Y(pre) = (5x5) +2 = 27
• If x = 12 then Y (pre)= (5 x 12) + 2 = 62
105. Explanation• We should also know that what we
calculate is the estimated value of Y for a
given value of x
• This need not be accurate; there is some
error in prediction. (because we assume
regression line to be a straight line ,but
the data points actually cluster around the
line not exactly on the line)
106. Explanation• So this predicted (estimate value of Y is
called
• Y- is the error.
• The regression line is fitted in such a way
that this error is minimum
107. The Explanation of Regression Line
• In case of perfect correlation ( positive or
negative ) the two line of regression
coincide.
• If the two Regression lines are far from
each other, then degree of correlation is
less, & vice versa.
• The mean values of X &Y can be obtained
as the point of intersection of the two
regression line.
• The higher degree of correlation between
the variables, the angle between the lines
is smaller & vice versa.
108. Regression Equation / Line
& Method of Least Squares
• Regression Equation of y on x
Y = a x+ b
We have to obtain the values of a, b
• Regression Equation of x on y
X = cy + d
We have to obtain the values of c and d
109. How to calculate
• Regression equation = Y ( = )= ax + b
• Where ‘a’ is the slope and ‘b’ is the y intercept
• To find out slope a= nΣxy-ΣxΣy
• nΣx²-(Σx) ²
• Y intercept = b= Ῡ-a
110. Regression Equation / Line when
Deviation taken from Arithmetic Mean
• Regression Equation of y on x:
Y – Y = byx (X –X)
byx = ∑xy / ∑x2
byx=r (σy / σx )
• Regression Equation of x on y:
X – X = bxy(Y –Y)
bxy = ∑xy / ∑y2
bxy=r (σx / σy )
111. Properties of the Regression Coefficients
• The coefficient of correlation is geometric mean of the
two regression coefficients. r = √ byx * bxy
• If byx is positive than bxy should also be positive & vice
versa.
• If one regression coefficient is greater than one the
other must be less than one.
• The coefficient of correlation will have the same sign as
that our regression coefficient.
• Arithmetic mean of byx & bxy is equal to or greater than
coefficient of correlation. byx + bxy / 2 ≥ r
• Regression coefficient are independent of origin but not
of scale.
113. Calculate a and b.
Write the equation of the
line of regression with
x = number of absences
and y = final grade.
The line of regression is: = –3.924x + 105.66
6084
8464
8100
3364
1849
5476
6561
624
184
450
696
645
666
486
57 516 3751 579 39898
1 8 78
2 2 92
3 5 90
4 12 58
5 15 43
6 9 74
7 6 81
64
4
25
144
225
81
36
xy x2
y2
x y
114. Solution
• a= nΣxy-ΣxΣy
• nΣx²-(Σx) ²
• 7x 3751- 57x 516
• 7x 579 x 57²
• = -3.924
• b = -Ῡ a
• = 516/7 – (-3.924x 57/7) = 73.714 + 31.953 =
105.667
• The line of regression is
• = -3.924 x + 105.667
115. 0 2 4 6 8 10 12 14 16
40
45
50
55
60
65
70
75
80
85
90
95
Absences
FinalGrade
m = –3.924 and b = 105.667
The line of regression is:
Note that the point = (8.143, 73.714) is on the line.
The Line of Regression
116. The regression line can be used to predict values of y for
values of x falling within the range of the data.
The regression equation for number of times absent
and final grade is:
Use this equation to predict the expected grade for a student with
(a) 3 absences (b) 12 absences
(a)
(b)
Predicting y Values
= –3.924(3) + 105.667 = 93.895
= –3.924(12) + 105.667 = 58.579
= –3.924x + 105.667
117. Estimating the error
• Here the estimated grade for 12 absences is
58.57
• But from the original data you can find m
that for 12 absences the grade obtained is 58.
So the error Y- = 58-58.57 = -.57
Similarly for 6 absences we can calculate
118. Estimating the error
• = –3.924(6) + 105.667 = 82.12
• But obtained value is 81
• So Estimated value and obtained value have
some difference
119. Standard Error of Estimate.
• Standard Error of Estimate is the measure of variation
around the computed regression line.
• Standard error of estimate (SE) of Y measure the
variability of the observed values of Y around the
regression line.
• Standard error of estimate gives us a measure about
the line of regression. of the scatter of the observations
about the line of regression.
120. Estimating the error
• We can estimate this error by the following
formula
• SE est of Y=σy √1-r²
• σy = is the sd of y distribution
• r² = square of correlation between x and y
121. Estimating the error
• In the above example the correlation between no
of absences (x) and grades (y) is .975
• Sd is 17.61
• Then SE =σy √1-r²
• 17.61 x √ .046 = 17.61-.224
• 17.39
• This means the estimated y can be +/- 17.39
122. Problem in Regression
• The following are the scores obtained in two variables X,
Y by 10 individuals. Find out the regression of Y on X
and X on YIn
dls
X x² Y y² xy
1 40 1600 2 4 80
2 43 1849 5 25 215
3 45 2025 4 16 180
4 46 2116 7 49 322
5 60 3600 9 81 540
6 63 3969 5 25 315
7 69 4761 2 4 138
8 54 2916 8 64 432
9 70 4900 6 36 420
10 62 3844 9 81 558
125. Solution
• b = -Ῡ a
• 5.7- 0.048 x 55.2
• 3.05
• Regression Equation is
• = .048x + 3.05
• (for x= 60) .048 x 60 + 3.05
• 5.93
• Obtained score is 9
• So error = -Y = -3.07
126. Solution
• For x= 70 What is the estimated Y
• Estimated Y = .048 x 70 + 3.05
• 6.41
• Obtained value for 70 is 6
• So error is 6.41-6 = .41
127. Calculation of regression of X on Y
• c= nΣxy-ΣxΣy
• nΣy²-(Σy) ²
• Numerator is same
• Denominator 10 x 385 - 57²
•
• c= 536/601 = 0.892
• d = -cῩ = 55.2- 0.892 x 5.7
• 50.116
128. Calculation of regression of X on Y
• Regression equation is
̂̂x = cy + d = .892y + 50.116
• Verify:
• For y= 5,
• = .892 x 5 + 50.116 = 54.576
• Obtained value = 43
• For y= 9, is 58.144
• Obtained value is 62
129. Std error
• Then SE =σy √1-r²
• Sd y =2.58414
• r = -.1278
• 2.58414 x .9837
• 2.5420
• This means that for any given value of x the
estimated value of y may be +/- 2.540 of the true
value
130. Multiple Regression
Y = a + b1X1 + b2X2
Notation
a is the Y intercept, where the regression line crosses the Y
axis
b1 is the partial slope for X1 on Y
b1 indicates the change in Y for one unit change in X1,
controlling for X2
b2 is the partial slope for X2 on Y
b2 indicates the change in Y for one unit change in X2,
controlling for X1
131. Partial Slopes
• The partial slopes = the effect of each independent
variable on Y while controlling for the effect of the
other independent variable(s).
• Show the effects of the X’s in their original units.
• These values can be used to predict scores on Y.
• Partial slopes must be computed before computing a
(the Y intercept).
134. Standardized Partial Slopes
(beta-weights)
• Partial slopes (b1 and b2) are in the original units of
the independent variables.
• To compare the relative effects of the independent
variables, compute beta-weights (b*).
• Beta-weights show the amount of change in the
standardized scores of Y for a one-unit change in
the standardized scores of each independent
variable while controlling for the effects of all other
independent variables.
135. Beta-weights
• Formula to calculate
the beta-weight for X1
• Formula to calculate
the beta-weight for X2
136. Multiple Correlation (R2
)
• The multiple correlation coefficient (R2
) shows
the combined effects of all independent variables
on the dependent variable.
137. Limitations
Multiple regression and correlation are among the
most powerful techniques available to researchers.
•These techniques require:
▫ Every variable is measured at the interval-ratio level
▫ Each independent variable has a linear relationship
with the dependent variable
▫ Independent variables do not interact with each
other
▫ Independent variables are uncorrelated with each
other
138. Limitations
When these requirements are violated (as they often
are), these techniques will produce biased and/or
inefficient estimates.
There are more advanced techniques available to
researchers that can correct for violations of these
requirements. Such techniques are beyond the scope
of this course
140. Statement of problem
• A common problem is that there is a large set of
candidate predictor variables.
• (Note: The examples herein are really not that large.)
• Goal is to choose a small subset from the larger
set so that the resulting regression model is
simple, yet have good predictive ability.
141. Example: Selection data
• You are trying to select the best candidates from
a pool of applicant for a job, using a number of
variables (and their tests)
• Cognitive ability
• Adjustment
• Integrity
• Leadership
• Stress tolerance
142. Your problem
• You want to select those variables which
together will predict the criterion (job success)
• You want to select only minimum variables
• Together their predictive efficiency must be
maximum
143. Two basic methods
of selecting predictors
• Stepwise regression: Enter and remove
predictors, in a stepwise manner, until
there is no justifiable reason to enter or
remove more.
• Best subsets regression: Select the subset of
predictors that do the best at meeting some well-
defined objective criterion.
144. What is step wise
• First include the test (variable ) with maximum
predictive ability (Predictive Validity)
• Add a new test (the second best)
• See if adds to the Multiple correlation (R)
• If yes, add a third one
• Go on addicting tests till R does not increase.
• When R no longer increases, you have reached
your maximum predictive efficiency
Editor's Notes
The value or r that is computed represents the correlation coefficient of the sample. Have students interpret this result. Since r is close to -1, there is a strong negative correlation. As the number of absences increase, grades tend to decrease. Since there are 7 ordered pairs, n = 7.
The proof that the coefficient of determination is equal to the square of the correlation coefficient is beyond the scope of the text.
The value of d can be positive, negative or 0. Discuss the circumstances for each. The sum of the values of d will be 0 for the regression line. Squaring d eliminates negative values. Criteria for the Best Fit Line: The sum of the squares of the distances will be minimized.
The sums are repeated here, but they have already been calculated when determining the value of r. A TI-83 can also be used to compute the equation.
To graph the line of regression, find two points that satisfy the equation. Use any x values within the range of the data. Remember that (x-bar, y-bar) can be used as a point.
For someone absent no times, a predicted grade is 105.667 (about 106). Each time a person is absent, it is expected that their grade will decrease by close to 4 points. (-3.924)
Prediction values are meaningful only for x-values in (or close t) the range f x value in the data. If x = 100 the prediction fund by using the equation would be meaningless. A person who has been absent 3 times is predicted to have a final grade of about 94. A person who has been absent 12 times is predicted to have a grade of about 59.