Upcoming SlideShare
×

# Research method ch09 statistical methods 3 estimation np

3,983

Published on

Published in: Technology, Economy & Finance
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
3,983
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
0
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Research method ch09 statistical methods 3 estimation np

1. 1. 1 Research Methods in Health Chapter 9. Estimation Young Moon Chae, Ph.D. Graduate School of Public Health Yonsei University, Korea ymchae@yuhs.ac
2. 2. 2 Correlation
3. 3. 3 Questions • Why does the maximum value of r equal 1.0? • What does it mean when a correlation is positive? Negative? • What is the purpose of the Fisher r to z transformation? • What is range restriction? Range enhancement? What do they do to r? • Give an example in which data properly analyzed by ANOVA cannot be used to infer causality. • Why do we care about the sampling distribution of the correlation coefficient? • What is the effect of reliability on r?
4. 4. 4 Basic Ideas • Nominal vs. continuous IV • Degree (direction) & closeness (magnitude) of linear relations -Sign (+ or -) for direction -Absolute value for magnitude • Pearson product-moment correlation coefficient N zz r YXå=
5. 5. 5 Illustrations 757269666360 Height 210 180 150 120 90 Weight Plot of Weight by Height 4003002001000 Study Time 30 20 10 0 Errors Plot of Errors by Study Time 1.91.81.71.61.5 Toe Size 700 600 500 400 SAT-V Plot of SAT-V by Toe Size Positive, negative, zero
6. 6. 6 Graphic Representation 757269666360 Height 210 180 150 120 90 Weight Plot of Weight by Height 757269666360 Height Plot of Weight by HeightPlot of Weight by Height Mean = 66.8 Inches Mean = 150.7 lbs. 210-1-2 Z-height 2 1 0 -1 -2 Z-weight Plot of Weight by Height in Z-scores 2 1 0 -1 -2 Z-weight Plot of Weight by Height in Z-scoresPlot of Weight by Height in Z-scores + - - + 1. Conversion from raw to z. 2. Points & quadrants. Positive & negative products. 3. Correlation is average of cross products. Sign & magnitude of r depend on where the points fall. 4. Product at maximum (average =1) when points on line where zX=zY.
7. 7. 7 Correlation Analysis • It measures the closeness of the relationship between two or more variables • The degree of association or covariation between variables, no causality • Measures of Association by Measurement • Interpretation of Correlation -T-test
8. 8. 8 Regression
9. 9. 9 Questions • What are predictors and criteria? • Write an equation for the linear regression. Describe each term. • How do changes in the slope and intercept affect (move) the regression line? • What does it mean to test the significance of the regression sum of squares? R-square? • What is R-square? • What does it mean to choose a regression line to satisfy the loss function of least squares? • How do we find the slope and intercept for the regression line with a single independent variable? (Either formula for the slope is acceptable.) • Why does testing for the regression sum of squares turn out to have the same result as testing for R- square?
10. 10. 10 Basic Ideas • Jargon -IV = X = Predictor (pl. predictors) -DV = Y = Criterion (pl. criteria) -Regression of Y on X e.g., GPA on SAT • Linear Model = relations between IV and DV represented by straight line. • A score on Y has 2 parts – (1) linear function of X and (2) error. Y Xi i i= + +a b e (population values)
11. 11. 11 Regression Analysis • It refers to the techniques used to derive an equation that relates the criterion variable to one or more predictor variables • Method of least squares • Standardized coefficients • Goodness of fit -F test, t test, Coefficient of Determination • multicollinearity
12. 12. 12 Multiple linear regression
13. 13. 13 ANOVA as linear regression
14. 14. 14 Results
15. 15. 15 Raw & Standardized Regression Weights • Each X has a raw score slope, b. • Slope tells expected change in Y if X changes 1 unit*. • Large b weights should indicate important variables, but b depends on variance of X. • A b for height in inches would be 12 times larger than b for height in feet. • If we standardize X and Y, all units of X are the same. • Relative size of b now meaningful. *strictly speaking, holding other X variables constant.
16. 16. 16 Tests of R2 vs Tests of b • Slopes (b) tell about the relation between Y and the unique part of X. R2 tells about proportion of variance in Y accounted for by set of predictors all together. • Correlations among X variables increase the standard errors of b weights but not R2 . • Possible to get significant R2 , but no or few significant b weights • Possible but unlikely to have significant b but not significant R2 . Look to R2 first. If it is n.s., avoid interpreting b weights.
17. 17. 17 Testing Incremental R2 You can start regression with a set of one or more variables and then add predictors 1 or more at a time. When you add predictors, R 2 will never go down. It usually goes up, and you can test whether the increment in R 2 is significant or else if likely due to chance. )1/()1( )/()( 2 22 --- -- = LL SLSL kNR kkRR F 2 LR 2 SR Sk Lk =R-square for the larger model =R-square for the smaller model = number of predictors in the larger model =number of predictors in the smaller model
18. 18. 18 (cont.) • In regression problems, the most commonly used indices of importance are the correlation, r, and the increment to R-square when the variable of interest is considered last. The second is sometimes called a last-in R-square change. The last-in increment corresponds to the Type III sums of squares and is closely related to the b weight. • The correlation tells about the importance of the variable ignoring all other predictors. • The last-in increment tells about the importance of the variable as a unique contributor to the prediction of Y, above and beyond all other predictors in the model. •“Importance” is not well defined statistically when IVs are correlated. Doesn’t include mediated models (path analysis).
19. 19. 19 Collinearity Defined • The problem of large correlations among the independent variables • Within the set of IVs, one or more IVs are (nearly) totally predicted by the other IVs. • In such a case, the b or beta weights are poorly estimated. • Problem of the “Bouncing Betas.”
20. 20. 20 Dealing with Collinearity • Lump it. Admit ambiguity; SE of b weights. Refer also to correlations. • Select or combine variables. • Factor analyze set of IVs. • Use another type of analysis (e.g., path analysis). • Use another type of regression (ridge regression). • Unit weights (no longer regression).
21. 21. 21 Diagnostics Checking Assumptions and Bad Data
22. 22. 22 Good-Looking Graph 6420-2 X 9 6 3 0 -3 Y No apparent departures from line.
23. 23. 23 Problem with Linearity 50 100 150 200 250 Horsepower 10 20 30 40 50 MilesperGallon R Sq Linear = 0.595
24. 24. 24 Outliers 65320-2 X 10 8 6 3 1 -1 Y Outlier Outlier = pathological point
25. 25. 25 Non-parametric or Distribution-free Tests
26. 26. 26 Non-parametric or Distribution-free Tests • Two kinds of assertions in statistical tests: 1. Assertion directly related to the purpose of investigation, i.e., hypothesis to be tested 2. Assertion to make a probability statement. Set of all assertions is called the model • Testing a hypothesis without a model is non-parametric test. That is, tests which do not make basic assumptions about and without having the knowledge of the distribution of the population parameters
27. 27. 27 Characteristics 1.Do not depend on any assumptions about properties / parameters of the parent population, I.e., do not suppose any particular distribution & consequential assumptions (Parametric tests like ‘t’& ‘F’ tests make assumption about homogeneity of the variances) & No such assumptions or less restricting assumptions 2.When measurements are not so accurate, non-parametric tests come very handy 3.Most non-parametric tests assume only nominal or ordinal data I.e., more suitable (than parametric tests) for nominal & ordinal (or rated data) 4.Involves few arithmetic computations
28. 28. 28 (cont.) 5.Usually less efficient & powerful than parametric tests as they are based on no assumption 6.Greater risk of accepting a false hypothesis and committing type II error; Non-parametric tests require more observations than parametric tests to achieve the same size of type I and type II errors 7.Null hypothesis is somewhat loosely defined & hence rejection of null hypothesis may lead to less precise conclusion than parametric tests 8.It is a trade off between loss in sharpness of estimating intervals and gain in the ability of using less information & to calculate faster
29. 29. 29 Some important applications are (I)concerning single value for the given data (II)difference among 2 or more sets of data (III)relations between variables (IV)variation in the given data (V)randomness of a sample (VI)association or dependency of categorical data (VII)comparing theoretical population with actual data in categories
30. 30. 30 Typical situation 1.Data not likely to be normally distributed 2.Nominal data from responses to questionnaire 3.Partially filled questions, i.e., to handle incomplete / missing data. to make necessary adjustments to extract maximum information from average data 4.Reasonably good results from even very small sample but need more observations than parametric tests to achieve the same size of type I and type II errors
31. 31. 31 31
32. 32. 32 Mc Nemer Test •Useful for testing nominal data of two related samples and before –after measurements of the same subjects with a view to judge the significance for any observed change after treatment
33. 33. 33 Chi-Square Test • An important non-parametric test for significance of association as well as for testing hypothesis regarding (i) goodness of fit and (ii) homogeneity or significance of population variance • When responses are classified into two mutually exclusive classes like favor -not favor, like -dislike, etc. • To find whether differences exist between observed and expected data • χ2is not a measure of degree of relationship • 2. Assumes random observations • 3. Items in the sample are independent
34. 34. 34 (cont.) • Constraints are linear, no cell contains less than five as frequency value and over all no. of items must be reasonably large (Yate’s correction can be applied to a 2x2 table if cells frequencies are smaller than five); Use Kolmogorov-Smirnov Test • PHI Coefficient, φ= √χ2/ N , as a non-parametric measure of coefficient of correlation helps to estimate the magnitude of association; • Cramer’s V-measure, V = φ2/ √min. (r-1), (c-1) • Coefficient of Contingency, C = √χ2/ χ2+ N , also known as coefficient of mean square contingency, is a non-parametric measure of relationship useful where contingency tables are higher order than 2x2 and combining classes is not possible for Yule’s coefficient of association
35. 35. 35 Wilcoxon-Mann-Whitney U-Test • Most powerful non-parametric test to determine whether two independent samples have been drawn from the same population. Used as alternative to t-test both for qualitative and quantitative data • Both the samples are pooled together and elements arranged in ascending order to find U
36. 36. 36 Wilcoxon Matched Pair or Signed Rank Test • Used in the context of two-related samples where we can determine both direction and magnitude of difference. Examples: wife & husband, subjects studied before & after experiment, comparing output of two machines, etc. • As it attaches greater weight to pair which shows a larger difference it is more powerful test than sign test • Null hypothesis (Ho) is that there is no difference in the two groups with respect to characteristics under study
37. 37. 37 K Sample (i.e., more than two sample) Tests The Kruskal-Wallis Test or H Test: • Similar to U test; • H0, ‘K’ individual random samples come from identical universes; does not require approximation of normal distribution as H follows Chi-square distribution; use Chi-square table.
38. 38. 38 A few points on K-W • Calculation of P-values: (avoiding type I errors) – F statistic: F distribution (requires normality) – K-W statistic: 2 distribution (requires large samples) – Either statistic: Permutation tests • Power: (avoiding type II errors) – K-W statistic more resistant to outliers – F statistic more powerful in the case of normality • K-W statistic: don’t need to worry about transformations
39. 39. 39 Reference • Cohen, Louis and Manion, Lawrence. Research methods in education. London: Routledge, 1980. • Goode, William J and Hatt, Paul K. Methods on social research. London; Mc GrawHill, 1981. • Gopal, M.H. An introduction to research procedures in social sciences. Bombay: Asia Publishing House, 1970. • Koosis, Donald J. Business statistics. New York: John Wiley,1972.
40. 40. 40 Multivariate Analysis • Discriminant Analysis -It joins a nominally scaled criterion or dependent variable with one or more independent variables that are interval or ratio scaled. • Multivariate ANOVA -Assesses the relationship between two or more dependent variables and classificatory variables or factors • LISREL (Linear Structural Relationships) -Measurement and Structural equation model -Causality testing
41. 41. 41 Interdependency Techniques •Factor analysis -A factor is a linear combination of variables -Construct with a new set of variables based on the relationships in the correlation matrix -Factor loading -Orthogonal or oblique rotaion •Cluster Analysis -A set of technique for grouping similar objects or people