Ch8 Regression Revby Rao


Published on

Published in: Technology, Economy & Finance
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Ch8 Regression Revby Rao

  1. 2. Medical Statistics (full English class) Shaoqi Rao, PhD School of Public Health Sun Yat-Sen University Slides adapted from Dr. Ji-Qian Fang’s
  2. 3. Chapter 8 Linear Regression
  3. 4. How does the value of one variable depend on that of another one? <ul><li>How does the son’s height depend on the father’s height? </li></ul><ul><li>How does the death rate of animal depend on the drug dosage? </li></ul><ul><li>How does the infant weight depend on the month’s age? </li></ul><ul><li>How does the body surface area depend on the height? </li></ul><ul><li>---- To explore linear dependence quantitatively between two continuous variables. </li></ul>
  4. 5. <ul><li>8.1.1 Linear regression equation </li></ul><ul><li>Initial meaning of “regression”: </li></ul><ul><li>Galdon noted that if the father is tall, his son will </li></ul><ul><li>be relatively tall; if the father is short, his son will </li></ul><ul><li>be relatively short. </li></ul><ul><li>But, if the father is very tall, his son will not taller than his father usually; if the father is very short, his son will not shorter than his father usually. </li></ul><ul><li>Otherwise, ……?! </li></ul><ul><li>Galdon called this phenomenon “regression to the mean” </li></ul>8.1 Statistical Description of Linear Regression
  5. 6. <ul><li>Independent variable (explanatory variable), X </li></ul><ul><li>randomly changing </li></ul><ul><li>or fixed by the researcher </li></ul><ul><li>Dependent variable (response variable), Y </li></ul><ul><li>randomly following a linear equation </li></ul>
  6. 7. What is regression in statistics? To find out the track of the means 100 120 140 160 180 200 220 100 120 140 160 180 200 220 Father’s height ( cm ) Son’s height (cm)
  7. 8. <ul><li>Given the value of X , Y varies around a center (  y|x ) </li></ul><ul><li>All the centers locate on a line -- regression line. </li></ul><ul><li>The relationship between the center  y|x and X is described by a linear equation </li></ul>
  8. 9. <ul><li>Linear regression </li></ul><ul><li>Try to estimate  and  , getting </li></ul><ul><li>Where </li></ul><ul><li>a -- estimate of  , intercept </li></ul><ul><li>b -- estimate of  , slope </li></ul><ul><li>-- estimate of  y|x </li></ul>
  9. 10. 8.1.2 Regression coefficient and its calculation <ul><li>To find a straight line to best fit the points. </li></ul><ul><li>Residual: </li></ul><ul><li>Fitness of the regression line : </li></ul><ul><li>Principle of least squares : To find a straight line that minimizes the sum of squared residuals. </li></ul><ul><li>Under such a principle, it is easy to get the formulas for and by calculus:   </li></ul><ul><li>(8.3) </li></ul><ul><li>(8.4) </li></ul><ul><li>Such a line must go through the point of , and cross the vertical axis at ---- Why? </li></ul>
  10. 11. Example 8.1 Calculate the regression equation of the height of son Y on the height of father X .
  11. 13. <ul><li> The t-test for regression coefficient </li></ul><ul><li>b is the sample regression coefficient, changing from sample to sample </li></ul><ul><li>There is a population regression coefficient, denoted by  </li></ul><ul><li>Question : Whether  =0 or not? </li></ul><ul><li>H 0 :  =0, H 1 :  ≠0 α =0.05 </li></ul>8.2 Statistical Inference on Regression 8.2.1 Hypothesis tests
  12. 14. Statistic Standard deviation of regression coefficient Standard deviation of residual
  13. 15. For Example 8.1 p <0.001 . Reject ---- the regression of the son’s height on the father’s height is statistically significant. :  =0, :  ≠0
  14. 16. Analysis of variance <ul><li>: The contribution of the linear regression is 0 </li></ul><ul><li>: The contribution of the linear regression is not 0 </li></ul><ul><li>(1) Before regression, we can only use to estimate </li></ul><ul><li>(2) After regression, we can use to estimate </li></ul><ul><li>(3) The regression makes the sum of squared deviations decline </li></ul><ul><li>(4) To test The contribution of regression is 0, F -statistic is used </li></ul>
  15. 17. For Example 8.1 <ul><li>Conclusion: the regression of the son’s height on the father’s height is statistically significant. </li></ul><ul><li>The slight difference between these two approaches : </li></ul><ul><li>t test could be used for both of one-side and two-side </li></ul><ul><li>problems; </li></ul><ul><li>ANOVA for two-side only. However, the idea of ANOVA can easily be extended to the cases of nonlinear regression and multiple regression. </li></ul>
  16. 18. 8.2.2 Determination coefficient For Example 8.1 Determination coefficient: Contribution of regression by % <ul><li>It reflects that the percentage of the total sum of squared deviations </li></ul><ul><li>can be explained by the regression. </li></ul><ul><li>If both of X and Y are random variables , </li></ul>
  17. 19. In practice, it is suggested to report the value of determination coefficient after an analysis of regression to describe how good the regression is. <ul><li>Here is a story: : An index of liver function : A score for psychological status </li></ul><ul><li>Regression is statistically significant, </li></ul><ul><li>Claimed: “the index for liver function can be improved by psychological consultation” </li></ul><ul><li>It is wrong? Why? </li></ul>
  18. 20. <ul><li>8.3 The Application of Linear Regression </li></ul><ul><li>8.3.1 Two interval estimations </li></ul><ul><li> Confidence interval for </li></ul><ul><li> Prediction interval for Y </li></ul>
  19. 21. 8.3.3 On the basic assumptions ---- LINE <ul><li>(1) Linear : There exists a linear tendency between the dependent variable and the independent variable </li></ul><ul><li>(2) Independent : The individual observations are independent each other </li></ul><ul><li>(3) Normal : Given the value of, the corresponding follows a normal distribution </li></ul><ul><li>(4) Equal variances : The variances of for different values of are all equal, denoted with . </li></ul>
  20. 22. <ul><li>In practice, one may use scatter diagram to observe whether the basic assumptions are met. </li></ul><ul><li>The assumption of linearity is essential that using a linear model to describe a curvilinear relationship is obviously inappropriate; </li></ul><ul><li>The assumption of independency is also essential; </li></ul><ul><li>The violation to the assumptions of normal distribution and equal variance might not seriously affect the least square estimates though all the introduced formulas for statistical inference might not valid. </li></ul><ul><li>Once the assumptions (1), (3) and (4) are violated, some transformations are worthwhile to try. </li></ul>
  21. 23. Summary Regression and Correlation <ul><li>1. Distinguish and connection </li></ul><ul><li>Distinguish : </li></ul><ul><li>Correlation: Both X and Y are random </li></ul><ul><li>Regression: </li></ul><ul><li>Y must be random </li></ul><ul><li>X could be random or not random </li></ul>
  22. 24. <ul><li>Connection: When both X and Y are random </li></ul><ul><li>1) Same sign for correlation coefficient </li></ul><ul><li>and regression coefficient </li></ul><ul><li>2) t tests are equivalent </li></ul><ul><li>t r = t b </li></ul><ul><li>3) Determination Coefficient </li></ul>
  23. 25. <ul><li>2. Caution -- </li></ul><ul><li>for regression and correlation </li></ul><ul><li>Don’t put any two variables together for correlation and regression – They must have some relation in subject matter; </li></ul><ul><li>Correlation and regression do not necessary mean causality </li></ul><ul><li>---- sometimes may be indirect relation or even no any real relation; </li></ul>
  24. 26. <ul><li>A big value of r does not necessary mean a big regression coefficient b; </li></ul><ul><li>4) To reject does not necessary mean that the correlation is strong, only but ; </li></ul><ul><li>5) A regression equation is statistically significant does not necessary mean that one can well predict Y by X, only but ; well predict or not depends on coefficient of determination; </li></ul><ul><li>6) Scatter diagram is useful before working with linear correlation and linear regression; </li></ul><ul><li>7) The regression equation is not allowed to be applied beyond the range of the data set. </li></ul>