Upcoming SlideShare
×

# Chapter13

694 views
609 views

Published on

Chapter 13

Published in: Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Views
Total views
694
On SlideShare
0
From Embeds
0
Number of Embeds
371
Actions
Shares
0
5
0
Likes
0
Embeds 0
No embeds

No notes for slide
• &amp;lt;number&amp;gt;
• &amp;lt;number&amp;gt;
• &amp;lt;number&amp;gt;
• &amp;lt;number&amp;gt;
• ### Chapter13

1. 1. 1 Chapter 13 Simple Linear Regression & Correlation Inferential Methods
2. 2. 2 Consider the two variables x and y. A deterministic relationship is one in which the value of y (the dependent variable) is described by some formula or mathematical notation such as y = f(x), y = 3 + 2 x or y = 5e-2x where x is the dependent variable. Deterministic Models
3. 3. 3 A description of the relation between two variables x and y that are not deterministically related can be given by specifying a probabilistic model. The general form of an additive probabilistic model allows y to be larger or smaller than f(x) by a random amount, e. The model equation is of the form Probabilistic Models Y = deterministic function of x + random deviation = f(x) + e
4. 4. 4 Probabilistic Models Deviations from the deterministic part of a probabilistic model e=-1.5
5. 5. 5 Simple Linear Regression Model The simple linear regression model assumes that there is a line with vertical or y intercept a and slope b, called the true or population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made, y = α + βx + e Without the random deviation e, all observed points (x, y) points would fall exactly on the population regression line. The inclusion of e in the model equation allows points to deviate from the line by random amounts.
6. 6. 6 Simple Linear Regression Model 0 0 x = x1 x = x2 e2 Observation when x = x1 (positive deviation) e2 Observation when x = x2 (positive deviation) α = vertical intercept Population regression line (Slope β)
7. 7. 7 Basic Assumptions of the Simple Linear Regression Model 1. The distribution of e at any particular x value has mean value 0 (µe = 0). 2. The standard deviation of e (which describes the spread of its distribution) is the same for any particular value of x. This standard deviation is denoted by σ. 3. The distribution of e at any particular x value is normal. 4. The random deviations e1, e2, …, en associated with different observations are independent of one another.
8. 8. 8 More About the Simple Linear Regression Model and (standard deviation of y for fixed x) = σ. For any fixed x value, y itself has a normal distribution. mean y value height of the population x for fixed x regression line above x     = = α + β       
9. 9. 9 Interpretation of Terms 1. The slope β of the population regression line is the mean (average) change in y associated with a 1-unit increase in x. 2. The vertical intercept α is the height of the population line when x = 0. 3. The size of σ determines the extent to which the (x, y) observations deviate from the population line. Small σ Large σ
10. 10. 10 Illustration of Assumptions
11. 11. 11 Estimates for the Regression Line The point estimates of β, the slope, and α, the y intercept of the population regression line, are the slope and y intercept, respectively, of the least squares line. That is, xy xx S b point estimate of S = β = a point estimate of y bx= α = − where ( ) ( ) ( ) 2 2 xy xx x y x S xy and S x n n = − = − ∑ ∑ ∑ ∑ ∑ where ( ) ( ) ( ) 2 2 xy xx x y x S xy and S x n n = − = − ∑ ∑ ∑ ∑ ∑
12. 12. 12 Interpretation of y = a + bx Let x* denote a specific value of the predictor variable x. The a + bx* has two interpetations: 1. a + bx* is a point estimate of the mean y value when x = x*. 2. a + bx* is a point prediction of an individual y value to be observed when x = x*.
13. 13. 13 Example The following data was collected in a study of age and fatness in humans. * Mazess, R.B., Peppler, W.W., and Gibbons, M. (1984) Total body composition by dual- photon (153 Gd) absorptiometry. American Journal of Clinical Nutrition, 40, 834-839 One of the questions was, “What is the relationship between age and fatness?” Age 23 23 27 27 39 41 45 49 50 % Fat 9.5 27.9 7.8 17.8 31.4 25.9 27.4 25.2 31.1 Age 53 53 54 56 57 58 58 60 61 % Fat 34.7 42 29.1 32.5 30.3 33 33.8 41.1 34.5
14. 14. 14 Example Age (x) % Fat y x2 xy 23 9.5 529 218.5 23 27.9 529 641.7 27 7.8 729 210.6 27 17.8 729 480.6 39 31.4 1521 1224.6 41 25.9 1681 1061.9 45 27.4 2025 1233 49 25.2 2401 1234.8 50 31.1 2500 1555 53 34.7 2809 1839.1 53 42 2809 2226 54 29.1 2916 1571.4 56 32.5 3136 1820 57 30.3 3249 1727.1 58 33 3364 1914 58 33.8 3364 1960.4 60 41.1 3600 2466 61 34.5 3721 2104.5 834 515 41612 25489.2 2 n 18 X 834 y 515 X 41612 XY 25489.2 = = = = = ∑ ∑ ∑ ∑
15. 15. 15 Example 2 n 18, x 834, y 515 x 41612, xy 25489.2 = = = = = ∑ ∑ ∑ ∑ ( ) 2 2 xx 2 x S x n 834 41612 2970 18 = − = − = ∑∑ ( ) ( ) ( ) ( ) xy x y S xy n 834 515 25489.2 1627.53 18 = − = − = ∑ ∑∑
16. 16. 16 Example xy xx S 1627.53 b 0.54799 S 2970 = = = 515 834 a y bx 0.54799 3.2209 18 18 = − = − = ˆy 3.22 0.548x= +
17. 17. 17 Example A point estimate for the %Fat for a human who is 45 years old is If 45 is put into the equation for x, we have both an estimated %Fat for a 45 year old human or an estimated average %Fat for 45 year old humans The two interpretations are quite different. ˆy 3.22 0.548x= + a + bx=3.22+0.548(45)=27.9% a + bx=3.22+0.548(45)=27.9%
18. 18. 18 Example A plot of the data points along with the least squares regression line created with Minitab is given to the right. 6050403020 40 30 20 10 Age (x) %Faty S = 5.75361 R-Sq = 62.7 % R-Sq(adj) = 60.4 % % Fat y = 3.22086 + 0.547991 Age (x) Regression Plot
19. 19. 19 Terminology The predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives =1st predicted value =2nd predicted value =nth predicted value 1 1 2 2 n n ˆy a bx ˆy a bx ... ˆy a bx = + = + = + The predicted or fitted values result from substituting each sample x value into the equation for the least squares line. This gives =1st predicted value =2nd predicted value =nth predicted value 1 1 2 2 n n ˆy a bx ˆy a bx ... ˆy a bx = + = + = + The residuals for the least squares line are the values: 1 1 2 2 n n y y ,y y , ...,y yˆ ˆ ˆ− − − The residuals for the least squares line are the values: 1 1 2 2 n n y y ,y y , ...,y yˆ ˆ ˆ− − −
20. 20. 20 Definition formulae The total sum of squares, denoted by SSTo, is defined as The residual sum of squares, denoted by SSResid, is defined as 2 2 2 1 2 n 2 SSTo (y y) (y y) (y y) (y y)∑ = − + − + + − = − L 2 2 2 1 1 2 2 n n 2 SSResid (y y ) (y y ) (y y )ˆ ˆ ˆ (y y)ˆ∑ = − + − + + − = − L
21. 21. 21 Calculation Formulae Recalled SSTo and SSResid are generally found as part of the standard output from most statistical packages or can be obtained using the following computational formulas: ( ) ( ) 2 2 2 y SSTo y y y n ∑ ∑ ∑= − = − 2 2 SSResid (y y) y a y b xyˆ∑ ∑ ∑ ∑= − = − −
22. 22. 22 Coefficient of Determination The coefficient of determination, denoted by r2 , gives the proportion of variation in y that can be attributed to an approximate linear relationship between x and y. The coefficient of determination, r2, can be computed as 2 SSResid r 1 SSTo = − The coefficient of determination, r2, can be computed as 2 SSResid r 1 SSTo = −
23. 23. 23 Estimated Standard Deviation, se The statistic for estimating the variance σ2 is where 2 e SSResid s n 2 = − 2 2 ˆSSResid (y y) y a y b xy= − = − −∑ ∑ ∑ ∑ 2 eThe subscript e in s is a reminder that we are estimating the variance of the "errors" or residuals.
24. 24. 24 Estimated Standard Deviation, se The estimate of σ is the estimated standard deviation The number of degrees of freedom associated with estimating σ2 or σ in simple linear regression is n - 2. 2 e es s=
25. 25. 25 Example continued SSResid 529.66= 2 e SSResid s n 2 529.66 18 2 33.104 = − = − = 2 e es s 33.104 5.754 = = = Age (x) % Fat (y) y2 Predicted Value Residual 23 9.5 90.3 15.82 -6.32 40.00 23 27.9 778.4 15.82 12.08 145.81 27 7.8 60.8 18.02 -10.22 104.38 27 17.8 316.8 18.02 -0.22 0.05 39 31.4 986.0 24.59 6.81 46.34 41 25.9 670.8 25.69 0.21 0.04 45 27.4 750.8 27.88 -0.48 0.23 49 25.2 635.0 30.07 -4.87 23.74 50 31.1 967.2 30.62 0.48 0.23 53 34.7 1204.1 32.26 2.44 5.93 53 42.0 1764.0 32.26 9.74 94.78 54 29.1 846.8 32.81 -3.71 13.78 56 32.5 1056.3 33.91 -1.41 1.98 57 30.3 918.1 34.46 -4.16 17.27 58 33.0 1089.0 35.00 -2.00 4.02 58 33.8 1142.4 35.00 -1.20 1.45 60 41.1 1689.2 36.10 5.00 25.00 61 34.5 1190.3 36.65 -2.15 4.62 834 515.0 16156.3 529.66 ˆy y−ˆy ( ) 2 ˆy y−
26. 26. 26 Example continued 2 n 18, y 515.0, y 16156.3 xy 25489.2 ,a 3.2209, b 0.54799 = = = = = = ∑ ∑ ∑ ( ) ( ) 2 2 2 2 y SSTot= y-y y n (515.0) 16156.3 1421.5 18 = − = − = ∑∑ ∑ 2 SSResid 529.66 r 1 1 1 0.373 0.627 SSTo 1421.5 = − = − = − =
27. 27. 27 Example continued With r2 = 0.627 or 62.7%, we can say that 62.7% of the observed variation in %Fat can be attributed to the probabilistic linear relationship with human age. The magnitude of a typical sample deviation from the least squares line is about 5.75(%) which is reasonably large compared to the y values themselves. This would suggest that the model is only useful in the sense of provide gross “ballpark” estimates for %Fat for humans based on age.
28. 28. 28 Properties of the Sampling Distribution of b 1. The mean value of b is β. Specifically, µb=β and hence b is an unbiased statistic for estimating β When the four basic assumptions of the simple linear regression model are satisfied, the following conditions are met: b xxS σ σ = 2. The standard deviation of the statistic b is 3. The statistic b has a normal distribution (a consequence of the error e being normally distributed)
29. 29. 29 Estimated Standard Deviation of b The estimated standard deviation of the statistic b is e b xx s S σ = When then four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable is the t distribution with df = n - 2 b b t s − β =
30. 30. 30 Confidence interval for β When then four basic assumptions of the simple linear regression model are satisfied, a confidence interval for β, the slope of the population regression line, has the form b ± (t critical value)⋅sb where the t critical value is based on df = n - 2.
31. 31. 31 Example continued Recall 2 2 n 18, x 834, y 515 x 41612, xy 25489.2, y 16156.3 = = = = = = ∑ ∑ ∑ ∑ ∑ b 0.54799, a 3.2209= = e b xx s 5.754 s 0.1056 S 2970 = = = A 95% confidence interval estimate for β is bb t s 0.5480 (2.12) (0.1056) 0.5480 0.2238± = ± = ±g g es 5.754=
32. 32. 32 Example continued Based on sample data, we are 95% confident that the true mean increase in %Fat associated with a year of age is between 0.324% and 0.772%. A 95% confidence interval estimate for β is bb ts 0.5480 2.12(0.1056) 0.5480 0.2238 (0.324,0.772) ± = ± = ±
33. 33. 33 The regression equation is % Fat y = 3.22 + 0.548 Age (x) Predictor Coef SE Coef T P Constant 3.221 5.076 0.63 0.535 Age (x) 0.5480 0.1056 5.19 0.000 S = 5.754 R-Sq = 62.7% R-Sq(adj) = 60.4% Analysis of Variance Source DF SS MS F P Regression 1 891.87 891.87 26.94 0.000 Residual Error 16 529.66 33.10 Total 17 1421.54 Example continued Minitab output looks like Regression line 2 es residual df = n -2 SSResidSSTo Estimated slope b Regression Analysis: % Fat y versus Age (x) Estimated y intercept a
34. 34. 34 Hypothesis Tests Concerning β Null hypothesis: H0: β = hypothesized value Test statistic: The test is based on df = n - 2 b b hypothesized value t s − = Test statistic: The test is based on df = n - 2 b b hypothesized value t s − =
35. 35. 35 Hypothesis Tests Concerning β Alternate hypothesis and finding the P-value: 1. Ha: β > hypothesized value P-value = Area under the t curve with n - 2 degrees of freedom to the right of the calculated t 2. Ha: β < hypothesized value P-value = Area under the t curve with n - 2 degrees of freedom to the left of the calculated t
36. 36. 36 Hypothesis Tests Concerning β 3. Ha: β ≠ hypothesized value a) If t is positive, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the right of the calculated t) b) If t is negative, P-value = 2 (Area under the t curve with n - 2 degrees of freedom to the left of the calculated t)
37. 37. 37 Hypothesis Tests Concerning β Assumptions: 1. The distribution of e at any particular x value has mean value 0 (µe = 0) 2. The standard deviation of e is σ, which does not depend on x 3. The distribution of e at any particular x value is normal 4. The random deviations e1, e2, … , en associated with different observations are independent of one another
38. 38. 38 Hypothesis Tests Concerning β Quite often the test is performed with the hypotheses H0: β = 0 vs. Ha: β ≠ 0 This particular form of the test is called the model utility test for simple linear regression. The test statistic simplifies to and is called the t ratio. b b t s = The null hypothesis specifies that there is no useful linear relationship between x and y, whereas the alternative hypothesis specifies that there is a useful linear relationship between x and y.
39. 39. 39 Example Consider the following data on percentage unemployment and suicide rates. * Smith, D. (1977) Patterns in Human Geography, Canada: Douglas David and Charles Ltd., 158. City Percentage Unemployed Suicide Rate New York 3.0 72 Los Angeles 4.7 224 Chicago 3.0 82 Philadelphia 3.2 92 Detroit 3.8 104 Boston 2.5 71 San Francisco 4.8 235 Washington 2.7 81 Pittsburgh 4.4 86 St. Louis 3.1 102 Cleveland 3.5 104
40. 40. 40 Example The plot of the data points produced by Minitab follows
41. 41. 41 Example City Percentage Unemployed (x) Suicide Rate (y) x 2 xy y 2 New York 3.0 72 9.00 216.0 05184 Los Angeles 4.7 224 22.09 1052.8 50176 Chicago 3.0 82 9.00 246.0 06724 Philadelphia 3.2 92 10.24 294.4 08464 Detroit 3.8 104 14.44 395.2 10816 Boston 2.5 71 6.25 177.5 05041 San Francisco 4.8 235 23.04 1128.0 55225 Washington 2.7 81 7.29 218.7 06561 Pittsburgh 4.4 86 19.36 378.4 07396 St. Louis 3.1 102 9.61 316.2 10404 Cleveland 3.5 104 12.25 364.0 10816 38.7 1253 142.57 4787.2 176807
42. 42. 42 Example Some basic summary statistics 2 2 n 11, x 38.7, x 142.57 y 1253, y 176807, xy 4787.2 = = = = = = ∑ ∑ ∑ ∑ ∑ ( ) ( ) xy x y S xy n (38.7)(1253) 4787.2 11 378.92 = − = − = ∑ ∑ ∑ ( ) 2 2 xx 2 x S x n 38.7 142.57 11 6.4164 = − = − = ∑ ∑
43. 43. 43 Example Continuing with the calculations xy xx S 378.92 b 59.06 S 6.4164 = = = 1253 38.7 a y bx 59.06 93.86 11 11 = − = − = − ˆy 93.86 59.06x= − +
44. 44. 44 Example Continuing with the calculations 2 2 SSResid ˆ(y y) y a y b xy 176807 ( 93.857)(1253) 59.055(4787.2) 11701.9 = − = − − = − − − = ∑ ∑ ∑ ∑ ( ) 2 2 2 yy 2 y SSTo S (y y) y n 1253 176807 11 34078.9 = = − = − = − = ∑∑ ∑
45. 45. 45 Example 2 SSResid 11701.9 r 1 1 SSto 34078.9 1 0.343 0.657 = − = − = − = e SSResid 11701.9 s 36.06 n-2 9 = = =
46. 46. 46 Example - Model Utility Test 1. β = the true average change in suicide rate associated with an increase in the unemployment rate of 1 percentage point 2. H0: β = 0 3. Ha: β ≠ 0 4. α has not been preselected. We shall interpret the observed level of significance (P-value) 5. Test statistic: b b b b hypothesized value b 0 b t s s s − − = = =
47. 47. 47 Example - Model Utility Test 6. Assumptions: The following plot (Minitab) of the data shows a linear pattern and the variability of points does not appear to be changing with x. Assuming that the distribution of errors (residuals) at any given x value is approximately normal, the assumptions of the simple linear regression model are appropriate.
48. 48. 48 Example - Model Utility Test 8. P-value: The table of tail areas for t- distributions only has t values ≤ 4, so we can see that the corresponding tail area is < 0.002. Since this is a two-tail test the P-value < 0.004. (Actual calculation gives a P-value = 0.002) 7. Calculation: e b xx s 36.06 s 14.24 S 6.4164 = = = b b 59.06 t 4.15 s 14.24 = = =
49. 49. 49 Example - Model Utility Test 8. Conclusion: Even though no specific significance level was chosen for the test, with the P-value being so small (< 0.004) one would generally reject the null hypothesis that β = 0 and conclude that there is a useful linear relationship between the % unemployed and the suicide rate.
50. 50. 50 Example - Minitab Output Regression Analysis: Suicide Rate (y) versus Percentage Unemployed (x) The regression equation is Suicide Rate (y) = - 93.9 + 59.1 Percentage Unemployed (x) Predictor Coef SE Coef T P Constant -93.86 51.25 -1.83 0.100 Percenta 59.05 14.24 4.15 0.002 S = 36.06 R-Sq = 65.7% R-Sq(adj) = 61.8% T value for Model Utility Test H0: β = 0 Ha: β ≠ 0 P-value
51. 51. 51 Example – Reality Check! Although the medel utility test indicates that the model is useful, we should be a bit reticent to use the model principally as a estimation tool. Notice that s = 36.06, where the actual range of suicide rates is 235 – 71 = 164. This means to typical error in estimating the suicide rate would be approximately 22% of the range in error. With 9 of the 11 data points having suicide rates at or below 104, this would constitute a very large amount of error in the estimation. The statistics is very clear: We have established a strong positive linear relationship between percentage employed and the suicide rate. I would just not be particularly meaningful or useful to provide actual numerical estimates for suicide rates.
52. 52. 52 Residual Analysis The simple linear regression model equation is y = α + βx + e where e represents the random deviation of an observed y value from the population regression line α + βx . Key assumptions about e 1. At any particular x value, the distribution of e is a normal distribution 2. At any particular x value, the standard deviation of e is σ, which is constant over all values of x.
53. 53. 53 Residual Analysis To check on these assumptions, one would examine the deviations e1, e2, …, en. Generally, the deviations are not known, so we check on the assumptions by looking at the residuals which are the deviations from the estimated line, a + bx. The residuals are given by 1 1 1 1 2 2 2 2 n n n n ˆy y y (a bx ) ˆy y y (a bx ) ˆy y y (a bx ) − = − + − = − + − = − + M
54. 54. 54 Standardized Residuals Recall: A quantity is standardized by subtracting its mean value and then dividing by its true (or estimated) standard deviation. For the residuals, the true mean is zero (0) if the assumptions are true. ( ) i i 2 ˆy y e xx x x1 s s 1 n S − − = − − The estimated standard deviation of a residual depends on the x value. The estimated standard deviation of the ith residual, , is given byi iˆy y−
55. 55. 55 Standardized Residuals As you can see from the formula for the estimated standard deviation the calculation of the standardized residuals is a bit of a calculational nightmare. Fortunately, most statistical software packages are set up to perform these calculations and do so quite proficiently.
56. 56. 56 Standardized Residuals - Example Consider the data on percentage unemployment and suicide rates Notice that the standardized residual for Pittsburgh is -2.50, somewhat large for this size data set. City Percentage Unemployed Suicide Rate Residual Standardized Residual New York 3.0 72 83.31 -11.31 -0.34 Los Angeles 4.7 224 183.70 40.30 1.34 Chicago 3.0 82 83.31 -1.31 -0.04 Philadelphia 3.2 92 95.12 -3.12 -0.09 Detroit 3.8 104 130.55 -26.55 -0.78 Boston 2.5 71 53.78 17.22 0.55 San Francisco 4.8 235 189.61 45.39 1.56 Washington 2.7 81 65.59 15.41 0.48 Pittsburgh 4.4 86 165.99 -79.98 -2.50 St. Louis 3.1 102 89.21 12.79 0.38 Cleveland 3.5 104 112.84 -8.84 -0.26 ˆy ˆy - y
57. 57. 57 Example Pittsburgh This point has an unusually high residual
58. 58. 58 Normal Plots 500-50 2 1 0 -1 -2 NormalScore Residual Normal Probability Plot of the Residuals (response is Suicide) 2.01.51.00.50.0-0.5-1.0-1.5-2.0-2.5 2 1 0 -1 -2 NormalScore Standardized Residual Normal Probability Plot of the Residuals (response is Suicide) Notice that both of the normal plots look similar. If a software package is available to do the calculation and plots, it is preferable to look at the normal plot of the standardized residuals. In both cases, the points look reasonable linear with the possible exception of Pittsburgh, so the assumption that the errors are normally distributed seems to be supported by the sample data.
59. 59. 59 More Comments The fact that Pittsburgh has a large standardized residual makes it worthwhile to look at that city carefully to make sure the figures were reported correctly. One might also look to see if there are some reasons that Pittsburgh should be looked at separately because some other characteristic distinguishes it from all of the other cities. Pittsburgh does have a large effect on model.
60. 60. 60 2 1 0 -1 -2 x StandardizedResidual Standardized Residuals Versus x (response is y) This plot is an example of a satisfactory plot that indicates that the model assumptions are reasonable. Visual Interpretation of Standardized Residuals
61. 61. 61 This plot suggests that a curvilinear regression model is needed. 2 1 0 -1 -2 x StandardizedResidual Standardized Residuals Versus x (response is y) Visual Interpretation of Standardized Residuals
62. 62. 62 This plot suggests a non-constant variance. The assumptions of the model are not correct. 2 1 0 -1 -2 -3 x StandardizedResidual Standardized Residuals Versus x (response is y)3 Visual Interpretation of Standardized Residuals
63. 63. 63 This plot shows a data point with a large standardized residual. 2 1 0 -1 -2 -3 x StandardizedResidual Standardized Residuals Versus x (response is y) Visual Interpretation of Standardized Residuals
64. 64. 64 This plot shows a potentially influential observation. 2 1 0 -1 -2 x StandardizedResidual Standardized Residuals Versus x (response is y) Visual Interpretation of Standardized Residuals
65. 65. 65 Example - % Unemployment vs. Suicide Rate This plot of the residuals (errors) indicates some possible problems with this linear model. You can see a pattern to the points. Generally decreasing pattern to these points. Unusually large residual – clearly an influential point These two points are quite influential since they are far away from the others in terms of the % unemployed
66. 66. 66 Properties of the Sampling Distribution of a + bx for a Fixed x Value Let x* denote a particular value of the independent variable x. When the four basic assumptions of the simple linear regression model are satisfied, the sampling distribution of the statistic a + bx* has the following properties: 1. The mean value of a + bx* is α + βx*, so a + bx* is an unbiased statistic for estimating the average y value when x = x*
67. 67. 67 Properties of the Sampling Distribution of a + bx for a Fixed x Value 3. The distribution of the statistic a + bx* is normal. 2. The standard deviation of the statistic a + bx* denoted by σa+bx*, is given by ( ) 2 a bx* xx x * x1 n S + − σ = σ +
68. 68. 68 Addition Information about the Sampling Distribution of a + bx for a Fixed x Value The estimated standard deviation of the statistic a + bx*, denoted by sa+bx*, is given by ( )2 a bx* e xx x * x1 s s n S + − = + When the four basic assumptions of the simple linear regression model are satisfied, the probability distribution of the standardized variable is the t distribution with df = n - 2. a bx* a bx * ( x*) t s + + − α +β =
69. 69. 69 Confidence Interval for a Mean y Value When the four basic assumptions of the simple linear regression model are met, a confidence interval for a + bx*, the average y value when x has the value x*, is a + bx* ± (t critical value)sa+bx* Where the t critical value is based on df = n -2. Many authors give the following equivalent form for the confidence interval. 2 e xx 1 (x * x) a bx * (t critical value)s n S − + ± +
70. 70. 70 Confidence Interval for a Single y Value When the four basic assumptions of the simple linear regression model are met, a prediction interval for y*, a single y observation made when x has the value x*, has the form Where the t critical value is based on df = n -2. 2 2 e a bx*a bx * (t critical value) s s ++ ± + Many authors give the following equivalent form for the prediction interval. 2 e xx 1 (x * x) a bx * (t critical value)s 1 n S − + ± + +
71. 71. 71 Example - Mean Annual Temperature vs. Mortality Data was collected in certain regions of Great Britain, Norway and Sweden to study the relationship between the mean annual temperature and the mortality rate for a specific type of breast cancer in women. * Lea, A.J. (1965) New Observations on distribution of neoplasms of female breast in certain European countries. British Medical Journal, 1, 488-490 Mean Annual Temperature (F°) 51.3 49.9 50.0 49.2 48.5 47.8 47.3 45.1 Mortality Index 102.5 104.5 100.4 95.9 87.0 95.0 88.6 89.2 Mean Annual Temperature (F°) 46.3 42.1 44.2 43.5 42.3 40.2 31.8 34.0 Mortality Index 78.9 84.6 81.7 72.2 65.1 68.1 67.3 52.5
72. 72. 72 Example - Mean Annual Temperature vs. Mortality Regression Analysis: Mortality index versus Mean annual temperature The regression equation is Mortality index = - 21.8 + 2.36 Mean annual temperature Predictor Coef SE Coef T P Constant -21.79 15.67 -1.39 0.186 Mean ann 2.3577 0.3489 6.76 0.000 S = 7.545 R-Sq = 76.5% R-Sq(adj) = 74.9% Analysis of Variance Source DF SS MS F P Regression 1 2599.5 2599.5 45.67 0.000 Residual Error 14 796.9 56.9 Total 15 3396.4 Unusual Observations Obs Mean ann Mortalit Fit SE Fit Residual St Resid 15 31.8 67.30 53.18 4.85 14.12 2.44RX R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence.
73. 73. 73 Example - Mean Annual Temperature vs. Mortality 504030 100 90 80 70 60 50 Mean annual Mortalityin S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 % Mortality in = -21.7947 + 2.35769 Mean annual Regression Plot The point has a large standardized residual and is influential because of the low Mean Annual Temperature.
74. 74. 74 Example - Mean Annual Temperature vs. Mortality Predicted Values for New Observations New Obs Fit SE Fit 95.0% CI 95.0% PI 1 53.18 4.85 ( 42.79, 63.57) ( 33.95, 72.41) X 2 60.72 3.84 ( 52.48, 68.96) ( 42.57, 78.88) 3 72.51 2.48 ( 67.20, 77.82) ( 55.48, 89.54) 4 83.34 1.89 ( 79.30, 87.39) ( 66.66, 100.02) 5 96.09 2.67 ( 90.37, 101.81) ( 78.93, 113.25) 6 99.16 3.01 ( 92.71, 105.60) ( 81.74, 116.57) X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Mean ann 1 31.8 2 35.0 3 40.0 4 44.6 5 50.0 6 51.3 These are the x* values for which the above fits, standard errors of the fits, 95% confidence intervals for Mean y values and prediction intervals for y values given above.
75. 75. 75 504030 120 110 100 90 80 70 60 50 40 30 Mean annual Mortalityin S = 7.54466 R-Sq = 76.5 % R-Sq(adj) = 74.9 % Mortality in = -21.7947 + 2.35769 Mean annual 95% PI 95% CI Regression Regression Plot Example - Mean Annual Temperature vs. Mortality 95% prediction interval for single y value at x = 45. (67.62,100.98) 95% confidence interval for Mean y value at x = 40. (67.20, 77.82)
76. 76. 76 A Test for Independence in a Bivariate Normal Population Null hypothesis: H0: ρ = 0 Assumption: r is the correlation coefficient for a random sample from a bivariate normal population. Test statistic: The t critical value is based on df = n - 2 2 r t 1 r n 2 = − −
77. 77. 77 A Test for Independence in a Bivariate Normal Population Alternate hypothesis: H0: ρ > 0 (Positive dependence): P-value is the area under the appropriate t curve to the right of the computed t. Alternate hypothesis: H0: ρ < 0 (Negative dependence): P-value is the area under the appropriate t curve to the right of the computed t. Alternate hypothesis: H0: ρ ≠ 0 (Dependence): P-value is i. twice the area under the appropriate t curve to the left of the computed t value if t < 0 and ii. twice the area under the appropriate t curve to the right of the computed t value if t > 0
78. 78. 78 Example Recall the data from the study of %Fat vs. Age for humans. There are 18 data points and a quick calculation of the Pierson correlation coefficient gives r = 0.79209. We will test to see if there is a dependence at the 0.05 significance level. Age (x) % Fat y x2 xy 23 9.5 529 218.5 23 27.9 529 641.7 27 7.8 729 210.6 27 17.8 729 480.6 39 31.4 1521 1224.6 41 25.9 1681 1061.9 45 27.4 2025 1233 49 25.2 2401 1234.8 50 31.1 2500 1555 53 34.7 2809 1839.1 53 42 2809 2226 54 29.1 2916 1571.4 56 32.5 3136 1820 57 30.3 3249 1727.1 58 33 3364 1914 58 33.8 3364 1960.4 60 41.1 3600 2466 61 34.5 3721 2104.5
79. 79. 79 Example 1. ρ = the correlation between % fat and age in the population from which the sample was selected 2. H0: ρ = 0 3. Ha: ρ ≠ 0 4. α = 0.05 5. Test statistic: 2 r t , df n 2 1 r n 2 = = − − −
80. 80. 80 Example 6. Looking at the two normal plots, we can see it is not reasonable to assume that either the distribution of age nor the distribution of % fat are normal. (Notice, the data points deviate from a linear pattern quite substantially. Since neither is normal, we shall not continue with the test. P-Value: 0.011 A-Squared: 0.980 Anderson-Darling Normality Test N: 18 StDev: 13.2176 Average: 46.3333 6252423222 .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability Age (x) Normal Probability Plot P-Value: 0.032 A-Squared: 0.796 Anderson-Darling Normality Test N: 18 StDev: 9.14439 Average: 28.6111 40302010 .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability % Fat y Normal Probability Plot
81. 81. 81 Another Example Height vs. Joint Length The professor in an elementary statistics class wanted to explain correlation so he needed some bivariate data. He asked his class (presumably a random or representative sample of late adolescent humans) to measure the length of the metacarpal bone on the index finger of the right hand (in cm) and height (in ft). The data are provided on the next slide.
82. 82. 82 Example - Height vs. Joint Length There are 17 data points and a quick calculation of the Pierson correlation coefficient gives r = 0.74908. We will test to see if the true population correlation coefficient is positive at the 0.05 level of significance. Joint length 3.5 3.4 3.4 2.7 3.5 3.5 4.2 4.0 3.0 Height 64 68.5 69 64 68 73 72 75 70 Joint length 3.4 2.9 3.5 3.5 2.8 4.0 3.8 3.3 Height 68.5 65 67 70 65 75 70 66
83. 83. 83 1. ρ = the true correlation between height and right index finger metacarpal joint in the population from which the sample was selected 2. H0: ρ = 0 3. Ha: ρ > 0 4. α = 0.05 Example - Height vs. Joint Length 5. Test statistic: 2 r t , df n 2 1 r n 2 = = − − −
84. 84. 84 6. Looking at the two normal plots, we can see it is reasonable to assume that the distribution of age and the distribution of % fat are both normal. (Notice, the data points follow a reasonably linear pattern. This appears to confirm the assumption that the sample is from a bivariate normal distribution. We will assume that the class was a random sample of young adults. Example - Height vs. Joint Length P-Value: 0.557 A-Squared: 0.294 Anderson-Darling Normality Test N: 17 StDev: 3.49974 Average: 68.8235 757065 .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability Height Normal Probability Plot P-Value: 0.156 A-Squared: 0.524 Anderson-Darling Normality Test N: 17 StDev: 0.419734 Average: 3.43529 4.03.53.0 .999 .99 .95 .80 .50 .20 .05 .01 .001 Probability Joint Normal Probability Plot
85. 85. 85 Example - Height vs. Joint Length 8. P-value: Looking on the table of tail areas for t curves under 15 degrees of freedom, 4.379 is off the bottom of the table, so P-value < 0.001. Minitab reports the P-value to be 0.001. 9. Conclusion: The P-value is smaller than α = 0.05, so we can reject H0. We can conclude that the true population correlation coefficient is greater then 0. I.e., the metacarpal bone is longer for taller people. 7. Calculation: 2 2 r 0.74908 t 4.379 1 r 1 (0.74908) n 2 17 2 = = = − − − −