Objectives:
Draw a scatter plot for a set of ordered pairs.
Compute the correlation coefficient.
Test the hypothesis H0: ρ = 0.
Compute the equation of the regression line.
Introduction:
In addition to hypothesis testing and confidence intervals, inferential statistics involves determining whether a relationship between two or more numerical or quantitative variables exists.
Correlation is a statistical method used to determine whether a linear relationship between variables exists.
Regression is a statistical method used to describe the nature of the relationship between variables—that is, positive or negative, linear or nonlinear.
Correlation coefficient, is a numerical measure to determine whether two or more variables are related and to determine the strength of the relationship between or among the variables.
In a simple relationship, there are two variables: an independent variable (predictor variable) and a dependent variable (response variable).
The stronger the relationship is between variables, the more accurate the prediction is.
Scatter Plots and Correlation:
A scatter plot is a graph of the ordered pairs (x, y) of numbers consisting of the independent variable x and the dependent variable y.
Correlation:
The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables.
The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is .
The range of the correlation coefficient is from 1 to 1.
If there is a strong positive linear relationship between the variables, the value of r will be close to 1.
If there is a strong negative linear relationship between the variables, the value of r will be close to 1.
Possible Relationships Between Variables:
When the null hypothesis has been rejected for a specific a value, any of the following five possibilities can exist.
There is a direct cause-and-effect relationship between the variables. That is, x causes y.
There is a reverse cause-and-effect relationship between the variables. That is, y causes x.
The relationship between the variables may be caused by a third variable.
There may be a complexity of interrelationships among many variables.
The relationship may be coincidental.
Regression:
If the value of the correlation coefficient is significant, the next step is to determine the equation of the regression line which is the data’s line of best fit.
Best fit means that the sum of the squares of the vertical distance from each point to the line is at a minimum.
Procedure Table:
Step 1: Make a table with subject, x, y, xy, x2, and y2 columns.
Step 2: Find the values of xy, x2, and y2. Place them in the appropriate columns and sum each column.
Step 3: Substitute in the formula to find the value of r.
Step 4: When r is significant, substitute in the formulas to find the values of a and b for the regression line equation y = a + bx.
2. Correlation and
Regression
Shakir Rahman
BScN, MScN, MSc Applied Psychology, PhD Nursing (Candidate)
University of Minnesota USA.
Principal & Assistant Professor
Ayub International College of Nursing & AHS Peshawar
Visiting Faculty
Swabi College of Nursing & Health Sciences Swabi
Nowshera College of Nursing & Health Sciences Nowshera 2
3. Objectives
1. Draw a scatter plot for a set of ordered pairs.
2. Compute the correlation coefficient.
3. Test the hypothesis H0: ρ = 0.
4. Compute the equation of the regression line.
3
4. Introduction
In addition to hypothesis testing and
confidence intervals, inferential statistics
involves determining whether a
relationship between two or more
numerical or quantitative variables exists.
4
5. Introduction
Correlation is a statistical method used to
determine whether a linear relationship
between variables exists.
Regression is a statistical method used to
describe the nature of the relationship between
variables—that is, positive or negative, linear
or nonlinear.
5
6. Introduction
Correlation coefficient, is a numerical measure to
determine whether two or more variables are related
and to determine the strength of the relationship
between or among the variables.
In a simple relationship, there are two variables: an
independent variable (predictor variable) and a
dependent variable (response variable).
The stronger the relationship is between variables, the
more accurate the prediction is.
6
7. Scatter Plots and Correlation
A scatter plot is a graph of the ordered
pairs (x, y) of numbers consisting of the
independent variable x and the
dependent variable y.
7
9. Correlation
The correlation coefficient computed from
the sample data measures the strength and
direction of a linear relationship between two
variables.
The symbol for the sample correlation
coefficient is r. The symbol for the population
correlation coefficient is .
9
10. Correlation
The range of the correlation coefficient is from
1 to 1.
If there is a strong positive linear
relationship between the variables, the value
of r will be close to 1.
If there is a strong negative linear
relationship between the variables, the value
of r will be close to 1.
10
11. Correlation Coefficient
The formula for the correlation coefficient is
where n is the number of data pairs.
Rounding Rule: Round to three decimal places.
11
2 2
2 2
n xy x y
r
n x x n y y
12. Example: Absences/Final Grades
Compute the correlation coefficient for the data
12
Student
Number of
absences, x
Final Grade
y (pct.) xy x2
y2
A
B
C
D
E
F
6
2
15
9
12
5
82
86
43
74
58
90
492
172
645
666
696
450
36
4
225
81
144
25
6,724
7,396
1,849
5,476
3,364
8,100
Σx =
57
Σy =
511
Σxy =
3745
Σx2
=
579
Σy2
=
38,993
G 8 78 624 64 6,084
13. Example: Absences/Final Grades
13
Σx = 57, Σy = 511, Σxy = 3745, Σx2 = 579,
Σy2 = 38,993, n = 7
2 2
2 2
n xy x y
r
n x x n y y
2 2
7 3745 57 511
7 579 57 7 38,993 511
r
0.944 (strong negative relationship)
r
14. Hypothesis Testing
In hypothesis testing, one of the following is
true:
H0: 0 This null hypothesis means that
there is no correlation between the
x and y variables in the population.
Ha: 0 This alternative hypothesis means
that there is a significant correlation
between the variables in the
population.
14
15. Possible Relationships Between
Variables
When the null hypothesis has been rejected for a specific
a value, any of the following five possibilities can exist.
1. There is a direct cause-and-effect relationship
between the variables. That is, x causes y.
2. There is a reverse cause-and-effect relationship
between the variables. That is, y causes x.
3. The relationship between the variables may be
caused by a third variable.
4. There may be a complexity of interrelationships
among many variables.
5. The relationship may be coincidental.
15
17. Regression
If the value of the correlation coefficient is
significant, the next step is to determine
the equation of the regression line
which is the data’s line of best fit.
17
18. Regression
18
Best fit means that the sum of the
squares of the vertical distance from
each point to the line is at a minimum.
19. Regression Line
19
y a bx
2
2
2
2
2
where
= intercept
= the slope of the line.
y x x xy
a
n x x
n xy x y
b
n x x
a y
b
20. Example: Car Rental Companies
Data for car rental companies in the United States for a
recent year shown in the given table.
20
21. Example: Car Rental Companies
21
Company
Cars x
(in 10,000s)
Income y
(in billions) xy x2
y2
A
B
C
D
E
F
63.0
29.0
20.8
19.1
13.4
8.5
7.0
3.9
2.1
2.8
1.4
1.5
441.00
113.10
43.68
53.48
18.76
2.75
3969.00
841.00
432.64
364.81
179.56
72.25
49.00
15.21
4.41
7.84
1.96
2.25
Σx =
153.8
Σy =
18.7
Σxy =
682.77
Σx2
=
5859.26
Σy2
=
80.67
22. Example: Car Rental Companies
22
Σx = 57, Σy = 511, Σxy = 3745, Σx2 = 579,
Σy2 = 38,993, n = 7
2
2
2
y x x xy
a
n x x
2
2
n xy x y
b
n x x
2
18.7 5859.26 153.8 682.77
6 5859.26 153.8
0.396
2
6 682.77 153.8 18.7
6 5859.26 153.8
0.106
0.396 0.106
y a bx y x
23. Find two points to sketch the graph of the regression line.
Use any x values between 10 and 60. For example, let x
equal 15 and 40. Substitute in the equation and find the
corresponding y value.
Plot (15,1.986) and (40,4.636), and sketch the resulting
line.
Example: Car Rental Companies
23
0.396 0.106
0.396 0.106 15
1.986
y x
0.396 0.106
0.396 0.106 40
4.636
y x
24. Example: Car Rental Companies
24
0.396 0.106
y x
15, 1.986
40, 4.636
25. Use the equation of the regression line to predict the
income of a car rental agency that has 200,000
automobiles.
x = 20 corresponds to 200,000 automobiles.
Hence, when a rental agency has 200,000 automobiles, its
revenue will be approximately $2.516 billion.
Example: Car Rental Companies
25
0.396 0.106
0.396 0.106 20
2.516
y x
26. Procedure Table
Step 1: Make a table with subject, x, y, xy, x2,
and y2 columns.
Step 2: Find the values of xy, x2, and y2. Place
them in the appropriate columns and sum
each column.
Step 3: Substitute in the formula to find the value
of r.
Step 4: When r is significant, substitute in the
formulas to find the values of a and b for
the regression line equation y = a + bx.
26
27. Acknowledgements
Dr Tazeen Saeed Ali
RM, RM, BScN, MSc ( Epidemiology & Biostatistics), Ph.D.
(Medical Sciences), Post Doctorate (Health Policy & Planning)
Associate Dean School of Nursing & Midwifery
The Aga Khan University Karachi.
Kiran Ramzan Ali Lalani
BScN, MSc Epidemiology & Biostatistics (Candidate)
Registered Nurse (NICU)
Aga Khan University Hospital
28. REFERENCES
Kuzma, J.W. (2004). Basic Statistics for the
Health Sciences. (4thed.). California:
Mayfield.
Bluman, G. A. (2008). Elementary
Statistics, A step by step approach (7th ed.)
McGraw Hill.