Successfully reported this slideshow.
Upcoming SlideShare
×

# Measures of relationship

7,114 views

Published on

Measures of relationship (Correlation and Regression)

Published in: Health & Medicine
• Full Name
Comment goes here.

Are you sure you want to Yes No

### Measures of relationship

1. 1. Measures of Relationship By; Mr. Johny Kutty Joseph Asstt. Professor
2. 2. Measures of Relationship • The Mean, Median, Mode Range and Standard Deviation are univariate as it describes only one variable at a time. • Description for two variable is done in terms of relationship. • The most common bivariate descriptive statistics include cross tab tables, correlation and regression. • The cross tab table is same as contingency table.
3. 3. Concept of Probability • A probability is a number that reflects the chance or likelihood that a particular event will occur. • Probabilities can be expressed as proportions that range from 0 to 1, and they can also be expressed as percentages ranging from 0% to 100%. • A probability of 0 indicates that there is no chance that a particular event will occur, whereas a probability of 1 indicates that an event is certain to occur. • A probability of 0.45 (45%) indicates that there are 45 chances out of 100 of the event occurring.
4. 4. Concept of Probability • The concept of probability can be illustrated in the context of a study of obesity in children 5-10 years of age who are seeking medical care at a particular pediatric practice. • The population (sampling frame) includes all children who were seen in the practice in the past 12 months and is summarized in the table.
5. 5. Concept of Probability • Unconditional Probability: A randomly selected child will have the equal probability of other children and it is 1/N, where N=the population size. Thus, the probability that any child is selected is 1/5,290 = 0.0002. Age (years) 5 6 7 8 9 10 Total Boys 432 379 501 410 420 418 2,560 Girls 408 513 412 436 461 500 2,730 Total 840 892 913 846 881 918 5,290
6. 6. Concept of Probability • Conditional Probability: A purposeful selection of a population subset such as probability of 9 year old girls. This can be computed by the formula 461/2730 = 0.169 (16.9%) Age (years) 5 6 7 8 9 10 Total Boys 432 379 501 410 420 418 2,560 Girls 408 513 412 436 461 500 2,730 Total 840 892 913 846 881 918 5,290
7. 7. Normal Probability Curve (Z Score) Properties • It is also called as normal distribution. • It is based on the area/distribution of data. • It is a bell shaped curve. • Its centre point is equal in Mean = Median = Mode. (X=M=Z)
8. 8. Normal Probability Curve (Z Score) Properties • When the Mean, Median and Mode are equal at the centre of the curve it is denoted as “µ” (mu). • The line of the cure is extended to infinity at left side as well as right side. • Total area of the normal curve is taken as “1” • 1 is indicative of the maximum probability. • Probability is the measure of the likelihood that an event will occur in a Random Experiment. • Probability is quantified as a number between 0 and 1, where, loosely speaking, 0 indicates impossibility and 1 indicates certainty.
9. 9. Normal Probability Curve (Z Score) Properties • It is also called Gaussian or normal curve. • The shape of the curve depends on mean and SD. • If SD is high then width increases and vice versa and height decreases. • When the mean is 0 and SD is 1 curve is said to be standard normal curve. • The normal distribution is calculated normal probability model
10. 10. Normal Probability Curve (Z Score) Properties • Distributions that are normal or Gaussian have the following characteristics: • Approximately 68% (68.27%) of the values fall between the mean and one standard deviation (in either direction) • Approximately 95% (95.45%) of the values fall between the mean and two standard deviations (in either direction) • Approximately 99.9% (99.73%) of the values fall between the mean and three standard deviations (in either direction)
11. 11. Normal Probability Curve (Z Score) Properties • If we have a normally distributed variable and know the population mean (μ) and the standard deviation (σ), then we can compute the probability of particular values based on this equation for the normal probability model.
12. 12. Normal Probability Curve (Z Score) Example • Consider body mass index (BMI) in a population of 60 year old males in whom BMI is normally distributed and has a mean value = 29 and a standard deviation = 6. The standard deviation gives us a measure of how spread out the observations are.
13. 13. Normal Probability Curve (Z Score) Example • The mean (μ = 29) is in the center of the distribution, and the horizontal axis is scaled in increments of the standard deviation (σ = 6) and the distribution essentially ranges from μ - 3 σ to μ + 3σ. • It is possible to have BMI values below 11 or above 47, but extreme values occur very infrequently.
14. 14. Normal Probability Curve (Z Score) Example • To compute probabilities from normal distributions, we will compute areas under the curve. • The total area under the curve is 1. • Here the mean is equal to median, so half (50%) of the area under the curve is above the mean and half is below, so Pr(BMI < 29)=0.50. • Consequently, if we select a man at random from this population and ask what is the probability his BMI is less than 29?, the answer is 0.50 or 50%, since 50% of the area under the curve is below the value BMI = 29.
15. 15. Normal Probability Curve (Z Score) Example • What is the probability that a 60 year old male has BMI less than 35? • The probability is displayed graphically and represented by the area under the curve to the left of the value 35 in the figure below.
16. 16. Normal Probability Curve (Z Score) Example • Note that BMI = 35 is 1 standard deviation above the mean. • For the normal distribution we know that approximately 68% of the area under the curve lies between the mean plus or minus one standard deviation.
17. 17. Normal Probability Curve (Z Score) Example • Therefore, 68% of the area under the curve lies between 23 and 35. • We also know that the normal distribution is symmetric about the mean, therefore P(29 < X < 35) = P(23 < X < 29) = 0.34. • Consequently, P(X < 35) = 0.5 + 0.34 = 0.84 or 84%.
18. 18. Normal Probability Curve (Z Score) Example • This can also be calculated using the formula • Z = X - µ / σ. • where μ is the mean and σ is the standard deviation of the variable X. • In order to compute P(X < 30) we convert the X=30 to its corresponding Z score • Z= 30-29/6 = 1/6 = 0.17 (refer the Z table for corresponding value i.e 0.0675) = 0.0675 + 0.5 = 0.5675 = 56.75% • Z-table (Right of Curve or Left) - Statistics How To.pdf
19. 19. Normal Probability Curve (Z Score) Example • The mean height of 500 students is 165 cm and the SD is 6. assuming that heights are normally distributed. Find how many students will have height between 155 and 175cm. (Z = X - µ / σ.) • Z = 155-165/6 = -10/6 = -1.67 • Z = 175 -165/6 = 10/6 = 1.67 • Area under the standard normal curve is between Z = -1.67 and 1.67. • = ( area between Z = -1.67 and 0) + area between Z = 0 and 1.67. • = (0.9525 – 0.5 = 0.4525) + (0.4525) = 0.9050 = 90.5% (0.9050x500 = 452.5 = 452 ) students are having height between 155cm to 175cm.
20. 20. Importance of Normal Probability Curve • Data obtained from biological measurements approximately follow normal distribution. • Binominal and Poisson distribution can be approximated to normal distribution. • Binominal is a fixed trial with limited probability. It can have only two results. (tossing coin) • Poisson is infinite trial with multiple outcome of results. (Printing mistakes of a book) • In case of large samples it can be used to study the descriptive statistics such as mean, SD etc. • Used to find confidence limits of the population parameters. • It is the basis of test of significance.
21. 21. Correlation • The Mean, Median, Mode Range and Standard Deviation are univariate as it describes only one variable at a time. • Description for two variable is done in terms of relationship. • The most common bivariate descriptive statistics include cross tab tables, correlation and regression. • The cross tab table is same as contingency table.
22. 22. Correlation Coefficient • The relationship between two quantitative variable is called correlation. • The extent/degree /intensity of relationship between two variables is expressed in terms of correlation coefficient that ranges from -1 to 1. • It shows only the relation of variables not the influence or cause and effect relationships.
23. 23. Types of Correlation Coefficient • Based on the direction of changes; a. Perfect Positive Correlation: X is directly proportional to Y. Both rise and fall in same proportion. Eg. Designation & Salary. r = 1. b. Perfect Negative Correlation: X and Y are inversely proportionate. r= -1. Eg. Insulin and blood sugar. c. Moderately Positive Correlation: A type of positive correlation. d. Moderately Negative Correlation. A type of negative correlation. e. No Correlation. No relation. r = 0. smoking and type of housing.
24. 24. Types of Correlation Coefficient • Based on number of variables; a. Simple: Only two variables. b. Multiple: More than two variables. c. Partial: More than two variables but correlation is studies for only two variables by keeping the third variable as constant. Eg. X= yield, y = fertilizer, z = amount of rainfall. Simple = r(xy), r(yz), r(xz) Multiple= r(xyz) Partial = r(xy)z
25. 25. Types of Correlation Coefficient • Based on Linearity; a. Linear: If the changes in one variable bears a constant amount of change or solid pattern of change in another variable then the correlation is said to be linear.
26. 26. Types of Correlation Coefficient • Based on Linearity; a. Non Linear: Correlation is said to be non linear if the ratio of change is not constant. In other words, when all the points on the scatter diagram tend to lie near a smooth curve, the correlation is said to be non linear (curvilinear).
27. 27. Methods of Correlation Coefficient • Karl Pearson’s method of correlation • Spearman’s rank correlation. • Scatter Plot/graph/scatter diagram method.
28. 28. Karl Pearson’s method of correlation • The Karl Pearson’s product-moment correlation coefficient (or simply, the Pearson’s correlation coefficient) is a measure of the strength of a linear association between two variables and is denoted by r or rxy(x and y being the two variables involved). • It attempts to draw a line of best fit through the data of two variables, and the value of the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit. • It does not consider whether the variable is dependent or independent variable. It treats all variables equally.
29. 29. Properties of Pearson’s method • r is unit-less. Thus, we may use it to compare association between totally different bivariate distributions as well. • The value of r always lies between +1 and - 1. Depending on its exact value, we see the following degrees of association between the variables. • A value greater than 0 indicates a positive association i.e. as the value of one variable increases, so does the value of the other variable. • A value less than 0 indicates a negative association i.e. as the value of one variable increases, the value of the other variable decreases.
30. 30. Interpretation of Pearson’s method Strength of Association Negative r Positive r Weak -0.1 to -0.3 0.1 to 0.3 Average -0.3 to -0.5 0.3 to 0.5 Strong -0.5 to -1 0.5 to 1 Perfect -1 +1 The coefficient of correlation is “ zero” when the variables X and Y are independent.
31. 31. Assumptions of Pearson’s method • The relationship between the variables is “Linear”, which means when the two variables are plotted, a straight line is formed by the points plotted. • The variables are independent of each other. • The coefficient of correlation measures not only the magnitude of correlation but also tells the direction. Such as, r = -0.67, which shows correlation is negative because the sign is “-“ and the magnitude is 0.67.
32. 32. Karl Pearson’s method of correlation • It can be calculated using the formula • In case of grouped data “x” and “y” can be taken as the mid value of the class interval.
33. 33. Pearson’s method • Compute the correlation coefficient from the following data; • Create the table. • Find the mean of “x” and “y” Weight in Kg 60 70 80 90 Cholesterol 120 130 140 150
34. 34. Assumptions of Pearson’s method x y 60 120 70 130 80 140 90 150 Σx=300 Σy=540 X - x Y - y -15 -15 -5 -5 5 5 15 15 (x –x)(y - y) 225 25 25 225 Σ (x –x)(y - y) = 500
35. 35. Pearson’s method r = 500 √500x500 = 500 √2,50,000 = 500/500 = 1 Hence there is perfect correlation between weight and cholesterol level of patients. (x – x)2 225 25 25 225 Σ(x – x)2 500 (y – y)2 225 25 25 225 Σ(y – y)2 500
36. 36. Pearson’s method • (Homework) Compute the correlation coefficient from the following data; Age 30 40 50 60 70 Blood pressure 120 130 140 150 160
37. 37. Merits and Demerits of Pearson’s method Merits; • It summarizes the correlation and if plotted on a graph with a linear line then it shows the direction too. Demerits: • The correlation coefficient always assumes linear relationship regardless of the fact that assumption is correct or not. • The value of the coefficient is unduly affected by the extreme values. • It cannot be used for ordinal data • It is time consuming method.
38. 38. Spearman’s Rank Correlation Coefficient • It is a method of finding correlation between two variables by taking their ranks. • This is used for qualitative data. • It can be used when actual magnitude of characteristics under consideration is not known, but relative position or rank of the magnitude is known. • It is the nonparametric version of the Pearson correlation coefficient. • The data must be ordinal, interval or ratio with ranks.
39. 39. Spearman’s Rank Correlation Coefficient • Spearman’s returns a value from -1 to 1, where: +1 = a perfect positive correlation between ranks -1 = a perfect negative correlation between ranks 0 = no correlation between ranks. • It is denoted by “ rho” • There are two case for calculating rank correlation. • A. No tie of allotted rank • B. there is tie for two or more values/ranks in either “x” or “y” or both.
40. 40. Spearman’s Rank Correlation Coefficient • Case 1: No tie of allotted rank: In this case none of the values/ranks of x and y are repeated. • In this case “p” can be calculated using the formula; • D/d = difference in the ranks of data set of ‘x’ and ‘y’ (d = Rx - Ry)
41. 41. Spearman’s Rank Correlation Coefficient • Calculate the rank correlation of the following marks obtained by five nursing students in anatomy and FON. • Here the data should not be arranged in the ascending order/descending order but the ranks should be arranged in ascending or descending order. One set of data belongs to one student. • Prepare a table to calculate Σd2 Anatomy 85 81 77 68 53 FON 78 70 72 62 67
42. 42. Spearman’s Rank Correlation Coefficient • 1 – 6x4 / 5 (25-1) = 1 – 24/120 = 0.8 The marks of the two subjects are partially positive correlated. x y Rx Ry D = Rx-Ry D2 85 78 1 1 0 0 81 70 2 3 -1 1 77 72 3 2 1 1 68 62 4 5 -1 1 53 67 5 4 1 1 Σd2
43. 43. Spearman’s Rank Correlation Coefficient • Example: Calculate the correlation for following set of data. Given are the temperature (Degree Celsius) of Jammu and Katra at different days. Jammu 20 28 25 23 22 30 31 Katra 15 26 17 19 21 24 27
44. 44. Spearman’s Rank Correlation Coefficient • Case 2: There is tie of allotted rank: In this case more than one rank is present in either x or y or both x and y. • In this case “p” can be calculated using the formula +CF • CF is the correlation factor. The correlation factor has to be calculated for each repeated ranks and be added. The CF can be calculated using the formula CF = m (m2 – 1)/12 • D/d = difference in the ranks of data set of ‘x’ and ‘y’ (d = Rx - Ry)
45. 45. Spearman’s Rank Correlation Coefficient • Calculate the rank correlation of the following marks obtained by five nursing students in MSN and OBG. • Here MSN (x) the value 68 is repeated twice and in OBG (y) the value 70 is repeated thrice. • In the first series CF = 2x(4-1)/12 = 0.5 • In the second series CF = 3x(9-1)/12 = 2 MSN 60 81 72 68 53 75 85 68 OBG 78 70 72 62 67 70 70 61
46. 46. Spearman’s Rank Correlation Coefficient x y Rx Ry D = Rx-Ry D2 60 78 2 6 -4 16 81 70 6 4 2 4 72 72 4 5 -1 1 68 62 3 2 1 1 53 67 1 3 -2 4 75 70 5 4 1 1 85 70 7 4 3 9 68 61 3 1 2 4 Σd2 =40
47. 47. Spearman’s Rank Correlation Coefficient • 1 – 6x 40 + 0.5 + 2 / 8 (64-1) = 1 – 242.5/504 = 1- 0.48 = 0.52 The marks of the two subjects have strong positive correlation. • Home work: Calculate correlation for the following set of data; X 10 15 14 25 14 14 Y 6 25 12 18 25 40
48. 48. Merits and Demerits of Spearman’s method Merits • This method can be used as a measure of degree of association between qualitative data. • This method is very simple and easily understandable • It can be used when the actual data is given or when only the ranks of the data are given. Demerits • We cannot calculate the ranks coefficient for a frequency distribution, i.e., grouped data • When a large number of observations are given, the calculation becomes tedious
49. 49. Scatter Diagram Method • Scatter Diagrams are convenient mathematical tools to study the correlation between two random variables. • They are a form of a sheet of paper upon which the data points corresponding to the variables of interest, are scattered. • Judging by the shape of the pattern that the data points form on this sheet of paper, we can determine the association between the two variables, and can further apply the best suitable correlation analysis technique.
50. 50. Scatter Diagram Method: Use • Quickly confirm a hypothesis that two variables are correlated. • Provide a graphical representation of the strength of the relationship between two variables. • It also helps in understanding cause and effect relationship to evaluate whether manipulation of independent variable (cause) is actually producing the change in dependent variable (effect.)
51. 51. Steps to make Scatter Diagram • Step 1: on the graph paper or normal paper draw a line “L”, where the horizontal part of “L” is x axis and vertical part of “L” is y axis. • Step 2: Make the scale units at even multiples such as 10,20,30,40 etc so as to have an even scale system. • Step 3: Place the independent (cause) variable on horizontal axis (from left to right) and dependent (effect) variable on vertical axis (from bottom to top). • Plot the data points at the intersection of x and y axis. • The plots on the graphs generally look scattered and hence named as scatter plot. • Interpret the data and find the relationship.
52. 52. Interpretation of Scatter Diagram • It suggests the degree and the direction of the correlation. • The greater the scatter of plotted points on the chart the lesser is the relationship. • The more closely the points come to a straight line falling from left corner to the upper right corner the correlation is said to be perfectly positive. (r = +1) • On the other hand all the plots are on the line falling from upper left corner to the lower right corner the correlation is said to be perfectly negative. (r = -1)
53. 53. Interpretation of Scatter Diagram • If the points are widely distributed/scatterd on the graph it indicates very little relationship. (weak positive or weak negative) • If the plotted points lie on the diagram in disorganized manner it shows absence of correlation.
54. 54. Merits and Demerits of Scatter Diagram Merits • It is simple and non mathematical method to study correlation. • Easily understood and rough idea can be quickly formed. • It is not influenced by the extreme values of x and y. Demerits • Cannot establish the exact degree of correlation. • It cannot be always referred as a measure of degree of correlation since it is not mathematical and hence less reliable.
55. 55. Regression • Regression was introduced by Francis Galton in the field of biometry. • Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. • Dependent Variable: This is the main factor that you’re trying to understand or predict. • Independent Variables: These are the factors that you hypothesize have an impact on your dependent variable.
56. 56. Regression • Regression is done by deriving a suitable equation on the basis of available bivariate data. • This equation is called Regression equation and its geometrical representation is called Regression curve. • The regression equation requires the Regression coefficient. • The method of calculating regression coefficient (b/b1) is described below.
57. 57. Regression Analysis • Regression analysis attempts to establish the nature of relationship between the variables ie to study the functional relationship between the variables and thereby provide a mechanism for prediction, or forecasting. • It is a mathematical model which describes the relationship between dependent variable (y) and independent variable (x) with a feature of estimating the unknown values of ‘y’ and for the known values of ‘x’ through the mathematical method y = a+bx
58. 58. Properties of Regression Coefficient • It is denoted by b. • Between two variables (x and y), two values of regression coefficient can be obtained. One will be obtained when we consider x as independent and y as dependent and the other when it is reversed. • The regression coefficient of y on x is represented as byx and that of x on y as bxy. • The square root of the products of two regression coefficients (b=byx and b1=bxy) is correlation coefficient.
59. 59. Regression Equations • There will be two lines/two equations of regression. • 1. Regression Equation of y on x. • 2. Regression equation of x on y.
60. 60. Regression Equation of y on x. • It is y = a + bx where y=dependent variable, x= independent variable and a & b are constants. • It is also to be noted that b = byx (regression coefficient of y on x) • b = Σxy – nx y Σx2 –nx2 • a = y - bx
61. 61. Regression Equation of x on y. • It is x = a1 + b1x where x=dependent variable, y= independent variable and a1 & b1 are constants. • It is also to be noted that b1 = bxy (regression coefficient of x on y) • b1 = Σxy–nx y Σy2 –ny 2 • a1 = x – b1y
62. 62. Types of Regression • Simple linear regression: It is the relationship between a scalar response or dependent variable and one or more explanatory/independent variables. • Multiple linear regression: More than one explanatory variable. • Multivariate linear regression: Multiple correlated dependent variables are predicted, rather than a single scalar variable.
63. 63. Types of Regression • Positive regression: A positive sign indicates that as the predictor variable increases, the response variable also increases. • Negative regression: A negative sign indicates that as the predictor variable increases, the response variable decreases. • Linear and nonlinear Regression: A model is linear when each term is either a constant or the product of a parameter and a predictor variable. It is non linear if the equation does not meet the linear criteria.
64. 64. Regression Analysis • Fit a regression equation of B.P on age based on the following data and estimate the probable B.P for the subject who is aging 55. • n = 5 • X = Σx/n = 250/5 = 50 • Y = Σy/n = 700/5 = 140 • The regression equation to be fitted is y = a+bx where y is B.P and x is the age. Age 30 40 50 60 70 B.P 120 130 140 150 160
65. 65. Regression Equation of y on x. • Find b and a using the given formula. • b = Σxy – nx y Σx2 –nx2 • a = y - bx
66. 66. Table calculation x y xy x2 30 120 3600 900 40 130 5200 1600 50 140 7000 2500 60 150 9000 3600 70 160 11200 4900 Σx=250 Σy=700 Σxy=36000 Σx2=13500
67. 67. Regression Equation of y on x. • b = 36000 – 5x50x140 13500 – 5x(50)2 • b = 36000 – 35000/13500 – 12500 • b = 1000/1000 = 1 • a = y – bx • a = 140 – 1 x 50 = 90 • So the fitted regression equation is y = a+bx. • B.P = 90 + 1 x 35 = 90 +35 = 145mm of Hg.
68. 68. Regression Analysis: Example 2 • Fit the two line of regression equation for the following data. • n = 5 • X = Σx/n = 150/5 = 30 • Y = Σy/n = 350/5 = 70 • The regression equation to be fitted is y = a+bx and x = a1+b1y. X 10 20 30 40 50 Y 30 50 70 90 110
69. 69. Regression Equation of y on x. • Find b and a using the given formula. • b = Σxy – nx y Σx2 –nx2 • a = y - bx
70. 70. Table 2 x y xy x2 y2 10 30 300 100 900 20 50 1000 400 2500 30 70 2100 900 4900 40 90 3600 1600 8100 50 110 5500 2500 12100 Σx=150 Σy=35 0 Σxy=1250 0 Σx2=550 0 Σy2=285 00
71. 71. Regression Equation of y on x. • b = 12500 – 5x30x70 5500 – 5x(30)2 • b = 12500 – 10500/5500 – 4500 • b = 2000/1000 = 2 • a = y – bx • a = 70 – 2 x 30 = 70 -60 = 10 • So the fitted regression equation is y = 10 + 2x.
72. 72. Regression Equation of x on y. • Find b1 and a1 and a using the formula. • b1 = Σxy – nx y Σy2 –ny2 • a1 = x - by
73. 73. Regression Equation of y on x. • b1 = 12500 – 5x30x70 28500 – 5x(70)2 • b1 = 12500 – 10500/28500 – 24500 • b1 = 2000/4000 = 0.5 • a1 = x – b1y • a1 = 30 – 0.5 x 70 = 30 -35 = -5 • So the fitted regression equation is x = -5 + 0.5y.
74. 74. Properties • The square root of the products of two regression coefficients is correlation coefficient. In the given examples • b = byx = 2 • b1 = b1 xy = 0.5 • r = √2 x 0.5 = √1 = 1
75. 75. Coefficient of Variation • Coefficient of Variation is the percentage variation in mean, standard deviation being considered as the total variation in the mean. • Two compare the variability of two or more series, we can use the coefficient of variation. • The series of data for which the coefficient of variation is large indicates that the group is more variable and it is less stable or less uniform. • If a coefficient of variation is small it indicates that the group is less variable and it is more stable or more uniform.
76. 76. Coefficient of Variation • Find the CV for the following data. ( 13, 35, 56, 58, 35, 60 ) • Mean = 42.8 • SD = 18.5 • CV = 18.5/42.8 = 0.43 (43%)
77. 77. Coefficient of Variation: Example • To compare their efficacy, 2 sleep producing drugs were tested independently on 5 patients. The following data gives the amount of sleep (in hours) the patients had after taking the drugs. • Compare the efficiencies of the two drugs on the basis of coefficient of variation. Drug A 6 2 4 5 3 2 1 Drug B 3 6 7 2 6 3 7