- 1. ► Estimation ► Hypothesis testing Statistical inference 7/5/2023 1 By Asaye
- 2. Objectives 7/5/2023 By Asaye 2 After complete this session, learners will be able to do Parameter estimations Point estimate Confidence interval Hypothesis testing Z-test T-test Testing associations Chi-square test
- 3. Sampling distribution 3 A sampling distribution is a distribution of all possible values of a statistic computed from samples of the same size randomly selected from the same population. Sampling distribution is the probability distribution of sample statistic. It is formed when samples of size n repeatedly taken from population. Some would be higher than the population parameters and some would be lower. 7/5/2023 Asaye.A
- 4. Sampling distribution…. 4 We consider sample statistic as random variables. For example: Age of individuals is a random variable. Similarly, mean of age is a random variable. No conclusion about values of population parameters based on one individual value. It should be based on sample statistic computed from adequate sample size. 7/5/2023 Asaye.A
- 5. Sampling distribution…. 5 Construction of sampling distributions 1. From a population of size N, randomly draw all possible samples of size n. 2. Compute the statistic of interest for each sample. 3. Create a frequency distribution of the statistic. 7/5/2023 Asaye.A
- 6. A. Sampling distribution of sample mean 6 7/5/2023 Asaye.A
- 7. Example: sampling distribution of sample mean 7/5/2023 Asaye.A 7 The population values {18, 20, 22, 24} put in a box. Two observations are randomly selected, with replacement. Find the mean, variance, and standard deviation of the population. Solution: Mean: μ = 𝑋𝑖 𝑁 = 84 4 = 21 Variance: 𝜎2 = 𝑋𝑖 −𝜇 2 𝑁 = 20 4 = 5 Standard deviation: 5 = 2.236
- 8. Example: sampling distribution of sample mean 7/5/2023 Asaye.A 8 Now consider all possible samples of size “n=2” 16 Sample Means 16 possible samples (with replacement)
- 9. Example: sampling distribution of sample mean 7/5/2023 Asaye.A 9 List all the possible samples of size n = 2 and calculate the mean of each sample. Solution: Samples 𝑿 Samples 𝑿 18,18 18 22,18 20 18,20 19 22,20 21 18,22 20 22,22 22 18,24 21 22,24 23 20,18 19 24,18 21 20,20 20 24,20 22 20,22 21 24,22 23 20,24 22 24,24 24 These means form the sampling distribution of sample means
- 10. Example: sampling distribution of sample mean 7/5/2023 Asaye.A 10 Construct the frequency distribution of the sample means;
- 11. Example: sampling distribution of sample mean 7/5/2023 Asaye.A 11 Find mean, variance and standard deviation of the 16 sample means are; Mean: 𝜇𝑥 = 𝑥𝑖 𝑛 = 18+19+21+⋯+24 16 =21 Variance: 𝜎𝑥 2 = 𝑥𝑖−𝜇𝑥 2 𝑛 = 2.5, 𝜎𝑥 = 2.5 = 1.581 These results satisfy the properties of sampling distributions of sample means. 𝜇𝑥 = 𝜇 = 21, 𝜎𝑥 = 𝜎 𝑛 = 5 2 = 1.581
- 12. 1st 2nd Observation Obs 18 20 22 24 18 18 19 20 21 20 19 20 21 22 22 20 21 22 23 24 21 22 23 24 Example: sampling distribution of sample mean 12 18 19 20 21 22 23 24 0 .1 .2 .3 Sample Means Distribution 16 Sample Means P(𝑋) 𝑋 7/5/2023 Asaye.A
- 13. Comparing the population with its sampling distribution 13 18 19 20 21 22 23 24 0 .1 .2 .3 P(x) Mean 18 20 22 24 𝒙 0 .1 .2 .3 P(x)=1/4 Population, N = 4 𝜇 = 21 𝜎 = 2.236 Sample means distribution, n = 2 𝜇𝑥=21 𝜎𝑥= 1.58 𝒙 7/5/2023 Asaye.A
- 14. Properties of sampling distribution of mean 14 A. Sampling from normally distributed populations a. If a population is normal with mean 𝜇 and standard deviation σ, the sampling distribution of 𝑥 is also normally distributed with 𝜇𝑥 = 𝜇 and 𝜎𝑥 = 𝜎 𝑛 , OR, the standard deviation of any sample statistic is called its standard error. 7/5/2023 Asaye.A
- 15. Cont… 15 b. The mean 𝜇 of the distribution of sample mean is equal to the mean of the population from which the samples were drawn. c. The variance of the distribution of sample mean is equal to the variance of the population divided by the sample size. 7/5/2023 Asaye.A
- 16. Sampling from non-normally distributed populations 16 We can apply the Central Limit Theorem: Even if the population is not normal, Sample means from the population will be approximately normal if the sample sizes ≥ 30 are drawn from any population with mean 𝜇 and standard deviation 𝜎. The sampling distribution of sample means has 𝜇𝑥 = 𝜇 and 𝜎𝑥 = 𝜎 𝑛 7/5/2023 Asaye.A
- 17. Sampling distribution of Proportion 7/5/2023 By Asaye 17 o Suppose we choose a random sample of size n, the sampling distribution of the sample means p posses the following properties. o The sample proportion p will be an estimate of the population mean P. o The standard deviation of p is equal to p(1−p) /n called the standard error of the proportion). o Provided n is large enough the shape of the sampling distribution of p is normal.
- 18. Types of estimation 7/5/2023 By Asaye 18 There are two methods of estimation: 1. Point estimation 2. Interval estimation
- 19. Point estimation involves the calculation of a single value to estimate the population parameter. Interval estimation specifies a range of values assumed to include population parameter. 19
- 20. 1. Point Estimation A parameter : is a numerical descriptive measure of a population (e.g. μ). A statistic: is a numerical descriptive measure of a sample (e.g. 𝑋). It estimates the population parameter. A point estimate of some population parameter is a single value of a sample statistic. To each sample statistic there corresponds a population parameter. 20
- 21. Sample statistic & corresponding population parameter Sample statistic Sample mean ( 𝑋 ) Sample variance (S2 ) Sample Standard deviation (SD) Sample proportion (p) Population parameters μ (population mean) σ2 (population variance) σ(population standard deviation) P or π (Population proportion) 21
- 22. Point estimation….. If a random sample of 100 drug related patients has a mean survival time of 46.9 months then ,what is the point estimate of the population mean? Answer = 46.9 22
- 23. 2. Interval Estimation A point estimate does not give any indication on how far away the parameter lies. But an interval which has a high probability of containing the value parameter lies. An interval estimate is a statement that a population parameter has a value lying between two specified limits. Such interval estimates are called Confidence Intervals (CI) 23
- 24. Confidence Interval (CI) 7/5/2023 By Asaye 24 A confidence interval defines an interval within which the true population parameter is like to fall (interval estimate). Confidence interval therefore takes into account the sample to sample variation of the statistic and gives the measure of precision. Confidence intervals express the inherent uncertainty in any medical study by expressing upper and lower bounds for anticipated true underlying population parameter.
- 25. Confidence Interval (CI)… 7/5/2023 By Asaye 25 Most commonly the 95% confidence intervals are calculated, however 90% and 99% confidence intervals are sometimes used. The probability that the interval contains the true population parameter is (1-α)100%. If we were to select 100 random samples from the population and calculate confidence intervals for each, approximately 95 of them would include the true population mean B (and 5 would not).
- 26. Confidence Interval (CI)… Interval Estimate components Estimator ± Margin of error Estimator ± (Reliability coefficient) x (Standard error) Precision of the estimate or Margin of error (d)= reliability coefficient x standard error. Where; Reliability Coefficient (RC) is the 1 − α 100% percentile of the given probability distribution. Standard Error (SE) is the standard deviation of the sampling distribution of the sample statistic. 26
- 27. Reliability Coefficient 7/5/2023 By Asaye 27 The standardized “z” value corresponding to the given level of confidence. Z = 1.64 if your confidence level is 90% Z = 1.96 if your confidence level is 95% Z = 2.58 if your confidence level is 99% A wide interval suggests imprecision of estimation. Narrow CI widths reflects large sample size, low variability and low confidence level e.g. if you had a confidence level of 99%, the confidence coefficient would be . 99.
- 28. Confidence Level Conﬁdence level is the probability that the interval estimate will contain the parameter, assuming that a large number of samples are selected and that the estimation process on the same parameter is repeated. Denoted by 100(1- 𝛼)%. A relative frequency interpretation: In long run; 100(1-𝛼 )% of all confidences intervals that can be constructed will contain unknown parameter. A specific interval will either contain or not contain unknown parameter. 28
- 29. Normal or t-distribution Is n≥30? Is a population normally, or approximately normally distributed Is variance 𝜎 known? Use t-distribution with n-1 degree of freedom Use normal distribution (Z) Con not use normal or t-distribution Use normal distribution (Z) If 𝜎 is unknown , use s instead. No Yes No Yes No Yes
- 30. Confidence Interval for single population mean 1.When the variance is known and the sample size is large or small, the C.I. has the form: 𝑋 - Z (1- α/2) δ /√n < μ < 𝑋 + Z (1- α /2) δ / √ n or 𝑋 ± 𝑍𝛼 2 𝑆 𝑛 for n ≥ 30, 𝑏𝑢𝑡 𝜎 𝑖𝑠 𝑢𝑛𝑘𝑛𝑜𝑤𝑛. 2. When variance is unknown, and the sample size is small , the C.I. has the form: 𝑋 - t (1- α /2),n-1 s/ √ n < μ < 𝑋+ t (1- α /2),n-1 s/ √ n , d.f = n-1 30
- 31. Example E.g. In normally distributed population mean reading speed of a random sample of 81 adults is 325 words per minute. Find a 90% C.I. for the mean reading speed of all adults (μ) if it is known that the standard deviation for all adults is 45 words per minute . Given n = 81 σ = 45 𝑥 = 325 Zα/2 = 1.645 A 90% C.I. for μ is 325 ± (1.64 x 5 ) = 325 ± 8.2= (316.8, 333.2) Therefore, A 90% CI for μ is 316.8 to 333.2 words per minute. 31
- 33. CI for the difference of means & independent samples 1. When variance known CI = 𝑥1- 𝑥2 ± Z / 2 ẟ12 𝑛1 + ẟ22 𝑛2 2. When variance unknown and if the sample size is less than 30 Use t – distribution instead of z – distribution CI = 𝑥1- 𝑥2 ± t / 2, 𝑛1 + 𝑛2 − 2 𝑆1 2 𝑛1 + 𝑆2 2 𝑛2 33
- 34. Example If a random sample of 50 non-smokers have a mean life of 76 years with a standard deviation of 8 years, and a random sample of 65 smokers live 68 years with a standard deviation of 9 years, Find a 95% C.I for the difference of mean lifetime of non-smokers and smokers? 34
- 35. Confidence Interval for a Single Population proportion (P): A sample is drawn from the population of interest ,then compute the sample proportion p such as; This sample proportion is used as the point estimator of the population proportion n P P Z P ) ˆ 1 ( ˆ ˆ 2 1 35 p = no. of elements in the sample with some characterstics Total no. of element in the sample = x n
- 36. Single proportion cont…. 2. In Addis Ababa, a survey of 350 students showed that 28% carried their lunch to school. Find the 95% CI for the true population proportion of students who carried their lunch to school? 3. Suppose that 22 people were obese from 100 people in Debre Tabor. Find the 95% confidence interval for the true population proportion? 36
- 37. CI for the difference between two Population proportions Two samples are drawn from two independent population of interest, then compute the sample proportion for each sample for the characteristic of interest. An unbiased point estimator for the difference between two population proportions 𝑝1 − 𝑝2. 37
- 38. CI for the difference between two Population proportions A 100(1-α)% confident interval for P1 - P2 is given by 38 2 2 2 1 1 1 2 1 2 1 ) ˆ 1 ( ˆ ) ˆ 1 ( ˆ ) ˆ ˆ ( n P P n P P Z P P
- 39. Example A researcher investigated gender differences in sexual abuse in a sample of 323 adults (68 female and 255 males ). In the sample, 31 of the female and 53 of the males reported sexual abuse. We wish to construct 99% C.I. for the difference between the proportions of sexual abuse in the two sampled population . 39
- 40. Example cont….. 1-α =0.99 → α = 0.01 → α/2 =0.005 → 1- α/2 = 0.995 Z 1- α/2 = Z 0.995 =2.58 , nF=68, nM=255, 40 2078 . 0 255 53 ˆ , 4559 . 0 68 31 ˆ M M M F F F n a p n a p M M M F F F M F n P P n P P Z P P ) ˆ 1 ( ˆ ) ˆ 1 ( ˆ ) ˆ ˆ ( 2 1 255 ) 2078 . 0 1 ( 2078 . 0 68 ) 4559 . 0 1 ( 4559 . 0 58 . 2 ) 2078 . 0 4559 . 0 (
- 41. Example cont….. 0.2481 ± 2.58(0.0655) = ( 0.07914 , 0.4171 ) Interpretation: ?????? 41
- 42. C. Paired Samples 7/5/2023 By Asaye 42 Tests Means of two Related Populations ∆ Paired or matched samples ∆ Repeated measures (before/after) ∆ Use difference between paired values: d = x1-x2 Eliminates variation among subjects Assumptions: Both populations are normally distributed, Or, if not normal, use large samples.
- 43. Examples 7/5/2023 By Asaye 43 Paired data arises when each individual (more specifically, each unit of measurement) in a sample is measured twice. e.g. Blood pressure prior to and following treatment, Notice in each of these examples that the two occasions of measurement are linked by virtue of the two measurements being made on the same individual.
- 44. 7/5/2023 By Asaye 44 Where t𝛼 2 has n-1 df.
- 45. Example 7/5/2023 By Asaye 45 Ten hypertensive patients are screened at a neighborhood health clinic and are given methyl dopa, a strong antihypertensive medication for their condition. They are asked to come back 1 week later and have their blood pressures measured again. Suppose the initial and follow-up SBPs (mm Hg) of the patients are given below.
- 47. 7/5/2023 By Asaye 47 We have the following data and summary statistics
- 48. Summary 7/5/2023 By Asaye 48 Students sometimes have difficulty deciding whether to use 𝑍𝛼/2 or 𝑡𝛼/2 values when ﬁnding conﬁdence intervals.
- 49. Hypothesis testing A statistical hypothesis is a statement about the population under study or about the distribution of a quantity under consideration. Researchers are interested in answering many types of questions. For example, A physician might want to know whether a new medication will lower a person’s blood pressure. These types of questions can be addressed through statistical hypothesis testing, which is a decision-making process for evaluating claims about a population. 49
- 50. Hypothesis testing 7/5/2023 By Asaye 50 Hypothesis is a testable statement that describes the nature of the proposed relationship between two or more variables of interest. In hypothesis testing, the researcher must deﬁned the population under study, state the particular hypotheses that will be investigated, give the signiﬁcance level, select a sample from the population, collect the data, perform the calculations required for the statistical test, and reach a conclusion.
- 51. Type of Hypotheses Null hypothesis (represented by HO) is the statement about the value of the population parameter (normal statement). The null hypothesis postulates that ‘there is no difference between factor and outcome’ or ‘there is no an intervention effect.’ Alternative hypothesis (represented by HA) is the hypothesis that a researcher want to test or claim, or states the ‘opposing’ view that ‘there is a difference between factor and outcome’ or ‘there is an intervention effect. Level of significance: the percentage of the sample means that is outside certain prescribed limits. 51
- 53. Methods of hypothesis testing 7/5/2023 By Asaye 53 Hypotheses concerning about parameters which may or may not be true. The three methods used to test hypotheses are:- The traditional method The P-value method The conﬁdence interval method.
- 54. Steps in hypothesis testing 7/5/2023 By Asaye 54 1. Identify the null hypothesis H0 and the alternate hypothesis HA. 2. Choose 𝛼. The value should be small, usually less than 10%. It is important to consider the consequences of both types of errors. 3. Select the test statistic and determine its value from the sample data. 4. Compare the observed value of the statistic to the critical value obtained for the chosen 𝛼. 5. Make a decision 6. Conclusion
- 55. Test Statistics A test statistics is a value we can compare with known distribution of what we expect when the null hypothesis is true. The general formula of the test statistics is: Test statistics = 55
- 56. Critical value 7/5/2023 By Asaye 56 The critical value separates the critical region from the non-critical region for a given level of significance.
- 57. Decision making Accept or Reject the null hypothesis There are 2 types of errors Type I error is more serious error and it is the level of significant. Power is the probability of rejecting false null hypothesis and it is given by 1-β 57
- 60. Types of errors 7/5/2023 By Asaye 60 Type I errors: refers to the situation when we reject the null hypothesis when it is true (Ho is wrongly rejected) E.g. Ho: there is no differences between two drugs on average Type I error will occur if we conclude that the two drugs produce different effects when actually there isn’t a difference. Prob(type I error)=α Type II errors: refers to the situation when we accept the null hypothesis when it is false. E.g. Ho: there is no differences between two drugs on average Type II error will occur if we conclude that the two drugs produce the same effects when actually there is a difference. Prob(type II error)=𝛽
- 61. Hypothesis testing about a Population mean (μ) Two Tailed Test (The value of sample statistic failing into either tail of the distribution) The large sample (n > = 30) test of hypothesis about a population mean μ is as follows 1 . H 0 :𝜇1 = 𝜇0 vs H A : 𝜇1 ≠ 𝜇0 2. Z cal= 𝑥−𝜇0 ẟ 𝑛 61
- 62. Hypothesis testing about a Population mean (μ) 7/5/2023 By Asaye 62 Ztab Z α / 2 Decision rule : Reject Ho if the Z value falls in the rejection region. Don’t reject Ho if the Z value falls in the non-rejection region. if |zcal| Ztab reject H 0 i f | zcal |< Ztab accept H 0 If n < 30 and variance unknown tcal = 𝑥−𝜇0 𝑠 𝑛 at n-1 d.f And the decision is similar to z calculated
- 63. One tailed tests 2 . H 0 : 0 vs H A : 1 < 0 Ztab Zα D e c i s i o n : if zcal - Ztab accept H0 if zcal < - Ztab reject H0 H 0 : 0 vs H A : 1 0 3. H 0 : 0 vs H A : 1 0 Decision : if zcal Ztab reject H0 if zcal < Ztab accept H0 63
- 64. The P- Value 7/5/2023 By Asaye 64 P-value is the probability that the observed difference is due to chance. A large p-value implies that the probability of the value observed, occurring just by chance, when the null hypothesis is true. With small p-value, we can ignore the effect of chance, and suggests that there might be sufficient evidence for rejecting the null hypothesis. The p-value is defined as the probability of observing the computed significance test value or a larger one, if the H0 hypothesis is true. For example, P[ Z >= Zcal/H0 true].
- 65. The P- Value… 7/5/2023 By Asaye 65 A p-value is the probability of getting the observed difference, or one more extreme, in the sample purely by chance from a population where the true difference is zero. If the p-value is greater than 0.05 then, by convention, we conclude that the observed difference could have occurred by chance and there is no statistically significant evidence (at the 5% level of significance) for a difference between the groups in the population.
- 66. P-value and confidence interval 7/5/2023 By Asaye 66 Confidence intervals are preferable because they give information about the size of any difference in the population, and they also indicate the amount of uncertainty remaining about the size of the difference. When the null hypothesis is rejected in a hypothesis-testing situation, the conﬁdence interval for the mean using the same level of signiﬁcance will not contain the hypothesized mean.
- 67. P-value and confidence interval….. 7/5/2023 By Asaye 67 But for what values of p-value should we reject the null hypothesis? By convention, a p-value of 0.05 or smaller is considered sufficient evidence for rejecting the null hypothesis. By using p-value of 0.05, we are allowing a 5% chance of wrongly rejecting the null hypothesis when it is in fact true. When the p-value is less than to 0.05, we often say that the result is statistically significant.
- 68. Example1 A simple random sample of 10 people from a certain population has a mean age of 27. Can we conclude that the mean age of the population is not 30? The variance is known to be 20. Let 𝛼 = .05. 68
- 69. Example…. 7/5/2023 By Asaye 69 Solution 1. State hypothesis test: Ho: µ = 30 VS HA: µ ≠ 30 2. Determine level of significance: α = 0.05 3. Calculate test statistics: Zcal = (27-30)/ 20 10 = -2.12 4. Determine critical value: Z-critical value at 0.025 is equal to 1.96. 5. Make decision: We reject the null hypothesis since |Zcal | = 2.12 ≥ Ztab = 1.96. That is Zcal =-2.12 is in the rejection region. 6. Conclusion: The mean of age of the population is different from 30 at 5% level of significance. We conclude that µ is not 30 since p-value= 0.034.
- 70. Example 2 Suppose that we have a population mean 3.1 and n=20 people, 𝑥 = 4.5, 1. H0: 3.1 vs HA: 3.1 2. α= 0.5 at 95% CI 3. Our test statistic is: 70
- 71. Example 2… 7/5/2023 By Asaye 71 4. The observed value of the test statistic falls within the range of the non-rejection region. i.e. tcal = 1.14 < ttab = 2.09, since do not reject Ho. 5. We accept Ho and we conclude that there is no enough evidence to reject the null hypothesis
- 72. Hypothesis testing for single proportions Example: In the study of childhood abuse in psychiatry patients, brown found that 166 in a sample of 947 patients reported histories of physical or sexual abuse. Test the hypothesis that the true population proportion is 30%? Solution To the hypothesis we need to follow thesteps. 72
- 73. Example:… 7/5/2023 By Asaye 73 Step 1: State the hypothesis H0: P= Po = 0.3 vs Ha: P ≠ Po ≠ 0.3 Step 2: Fix the level of significant (α=0.05) Step3: determine critical value: Ztab= Z𝛼/2= 1.96 Step 4: Compute the calculated and tabulated value of the teststatistic: Zcal = 𝑃−𝑃0 𝑃∗𝑞 𝑛 = 0.175−0.3 0.3(0.7) 947 = −0.125 0.0149 Zcal = -8.39
- 74. Example:… 7/5/2023 By Asaye 74 Step 5: make decision: reject Ho sine |Zcal|=8.39 ≥ Ztab=1.96. Step 6: making conclusion: we conclude that there is statistical evidence to reject the true population proportion is different from zero.
- 75. Hypothesis testing for two sample means Ho: µ1-µ2 =0 VS HA: µ1-µ2 ≠0, HA: µ1-µ2 <0, HA: µ1-µ2>0 75
- 76. Example A researchers wish to know if the data they have collected provide sufficient evidence to indicate a difference in mean serum uric acid levels between normal individual and individual with down’s syndrome. The data consists of serum uric acid readings on 12 individuals with down’s syndrome and 15 normal individuals. The means are 4.5mg/100ml and 3.4 mg/100ml with standard deviation of 2.9 and 3.5 mg/100ml, respectively with variances (2=1, 2=1.5, respectively). Is there a difference between the means of both groups at 5% level of significance? Hypothesis test: HA: µ1 - µ2 ≠ 0 or HA: µ1 ≠ µ2 76
- 77. Cont… 7/5/2023 77 With α = 0.05, the critical values of Z are -1.96 and +1.96. We reject Ho if Z < -1.96 or Z > +1.96. Reject Ho because 2.57 > 1.96. We are 95% confident that there is a statistically significant evidence the population means are not equal.
- 78. Hypothesis testing for two proportions Suppose that n1 and n2 are large enough sothat; n1·p1≥5, n1·(1 - p1)≥5, n2·p2≥5, and n2·(1 – p2)≥5 To test the hypothesis Ho: P1-P2 =0 VS HA: P1-P2 ≠0 Test statistics: 78 𝜎𝑃1−𝑃2 = 𝑍𝑐𝑎𝑙 = 𝑃1 − 𝑃2 − 𝐷0 𝜎𝑃1−𝑃2 Where; 𝐷0 = (𝑃1 − 𝑃2)
- 79. Example 7/5/2023 By Asaye 79 Two hundred patients suffering from a certain disease were randomly divided into two equal groups. Of the first group, 78 recovered within three days. Out of the other 100, who were treated by a new method, 90 recovered within three days. The physician wishes to know whether the data provide sufficient evidence at 90% level of confidence to indicate that the new treatment is more effective than the standard treatment. Solution; Given: n1= n2= 100; p1=78/100= 0.78 p2=90/100=0.90 1. State the hypothesis: Ho: P1=P2 vs H1: P1< P2 2. Determine level of significance.
- 80. Example… 7/5/2023 By Asaye 80 3. Test statistic: 𝑍𝑐𝑎𝑙 = 0.78 − 0.90 − 0 0.78(0.32) 100 + 0.90(0.10) 100 = −0.12 0.058 = −2.07 4. Critical value: It is one-tailed test and therefore Zα = Z0.05 = ±1.645 5. Decision: since 𝑍𝑐𝑎𝑙<−Zαi.e. -2.07 < -1.645 we reject the Ho 6. Conclusion: the data suggests that the new treatment is more effective than the standard at 95% level of significance.
- 81. Chi-square test 7/5/2023 By Asaye 81 Chi-square test is used to determine a significant difference between the observed and expected frequencies in categorical attributes. In recent years, the use of specialized statistical methods for categorical data has increased dramatically, particularly for applications in the biomedical and social sciences. Categorical scales occur frequently in the health sciences, for measuring responses.
- 82. Cont… 7/5/2023 By Asaye 82 For example: Patient survives an operation (yes, no), Severity of an injury (none, mild, moderate, severe), and Stages of a disease (initial, advanced). Studies often collect data on categorical variables that can be summarized as a series of counts and commonly arranged in a tabular format known as a contingency table.
- 83. Cont… 7/5/2023 By Asaye 83 As with the z and t distributions, there is a different chi-square distribution for each possible value of degrees of freedom. Chi-square distributions with a small number of degrees of freedom are highly skewed; however, this skewness is attenuated as the number of degrees of freedom increases.
- 84. Cont… 7/5/2023 By Asaye 84 The chi-squared distribution is concentrated over non-negative values. It has mean equal to its degrees of freedom (d.f), and its standard deviation equals √(2df ). As d.f increases, the distribution concentrates around larger values and is more spread out. The distribution is skewed to the right, but it becomes more bell-shaped (normal) as d.f increases.
- 85. Cont… 7/5/2023 By Asaye 85 For contingency table, d.f is equal to (r-1)x(c-1)
- 86. Test of association 7/5/2023 By Asaye 86 The chi-squared (2) test statistics is widely used in the analysis of contingency tables. It compares the actual observed frequency in each group with the expected frequency. The chi-squared test (Pearson’s χ2) allows us to test for association between categorical (nominal) variables. The null hypothesis for this test is there is no association between the variables. Consequently a significant p-value implies association.
- 87. Cont… 7/5/2023 By Asaye 87 Test Statistic: 2-test with d.f. = (r-1)x(c-1) j i ij ij ij E E O , 2 2 Oij=observed frequency, Eij=expected frequency of the cell at the juncture of i th raw & j th column 𝐸𝑖𝑗 = 𝑖𝑡ℎ 𝑟𝑜𝑤 𝑡𝑜𝑡𝑎𝑙 × 𝑗𝑡ℎ 𝑐𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙 𝑔𝑟𝑎𝑛𝑑 𝑡𝑜𝑡𝑎𝑙 = 𝑅𝑖 × 𝐶𝑗 𝑛
- 88. Procedures of Hypothesis Testing 7/5/2023 By Asaye 88 1. State the hypothesis 2. Fix level of significance 3. Find the critical value (𝜒2 (d.f, α)) 4. Compute the test statistics 5. Decision rules; reject null hypothesis if test statistics > table value. 6. Make conclusion
- 89. Test of associations for 2x2 tables 7/5/2023 By Asaye 89 If we call the frequencies in the four cells of 2x2 table a, b, c and d then the table is given by Exposure status Disease status Row total diseased Non- diseased Exposed a b a+b Non-exposed c d c+d Column total a+c b+d a+b+c+ d
- 90. Cont… 7/5/2023 By Asaye 90 If the contingency table is 2x2 The d.f is (r-1)x(c-1), then ) )( )( )( ( 2 2 d c b a d b c a bc ad n j i ij ij ij E E O , 2 2
- 91. Assumptions of the 2 - test 7/5/2023 By Asaye 91 The chi-squared test assumes that Data must be categorical data. The data be a frequency data. The numbers in each cell are ‘not too small’. No expected frequency should be less than 1, and No more than 20% of the expected frequencies should be less than 5. If this does not hold row or column variables categories can sometimes be combined (re-categorized) to make the expected frequencies larger or use.
- 92. Example: 7/5/2023 By Asaye 92 Consider hypothetical example on smoking and symptoms of asthma. The study involved 150 individuals and the result is given in the following table: Is there association between smoking cigarettes and symptoms of asthma at 0.05 level of significance? Symptoms of Asthma Ever smoking Total Yes No Yes 20 30 50 No 22 78 100 Total 42 108 150
- 93. 7/5/2023 By Asaye 93 dfarea 0.995 0.99 0.975 0.95 0.9 0.25 0.1 0.05 0.025 0.01 0.005 1 0.000 0.000 0.001 0.004 0.016 1.323 2.706 3.841 5.024 6.635 7.879 2 0.010 0.020 0.051 0.103 0.211 2.773 4.605 5.991 7.378 9.210 10.597 3 0.072 0.115 0.216 0.352 0.584 4.108 6.251 7.815 9.348 11.345 12.838 4 0.207 0.297 0.484 0.711 1.064 5.385 7.779 9.488 11.143 13.277 14.860 5 0.412 0.554 0.831 1.145 1.610 6.626 9.236 11.071 12.833 15.086 16.750 6 0.676 0.872 1.237 1.635 2.204 7.841 10.645 12.592 14.449 16.812 18.548 7 0.989 1.239 1.690 2.167 2.833 9.037 12.017 14.067 16.013 18.475 20.278 8 1.344 1.647 2.180 2.733 3.490 10.219 13.362 15.507 17.535 20.090 21.955 9 1.735 2.088 2.700 3.325 4.168 11.389 14.684 16.919 19.023 21.666 23.589 10 2.156 2.558 3.247 3.940 4.865 12.549 15.987 18.307 20.483 23.209 25.188 11 2.603 3.053 3.816 4.575 5.578 13.701 17.275 19.675 21.920 24.725 26.757 12 3.074 3.571 4.404 5.226 6.304 14.845 18.549 21.026 23.337 26.217 28.300 13 3.565 4.107 5.009 5.892 7.042 15.984 19.812 22.362 24.736 27.688 29.819 14 4.075 4.660 5.629 6.571 7.790 17.117 21.064 23.685 26.119 29.141 31.319 15 4.601 5.229 6.262 7.261 8.547 18.245 22.307 24.996 27.488 30.578 32.801 16 5.142 5.812 6.908 7.962 9.312 19.369 23.542 26.296 28.845 32.000 34.267 17 5.697 6.408 7.564 8.672 10.085 20.489 24.769 27.587 30.191 33.409 35.718 18 6.265 7.015 8.231 9.390 10.865 21.605 25.989 28.869 31.526 34.805 37.156 19 6.844 7.633 8.907 10.117 11.651 22.718 27.204 30.144 32.852 36.191 38.582 20 7.434 8.260 9.591 10.851 12.443 23.828 28.412 31.410 34.170 37.566 39.997 21 8.034 8.897 10.283 11.591 13.240 24.935 29.615 32.671 35.479 38.932 41.401 22 8.643 9.542 10.982 12.338 14.041 26.039 30.813 33.924 36.781 40.289 42.796 23 9.260 10.196 11.689 13.091 14.848 27.141 32.007 35.172 38.076 41.638 44.181 24 9.886 10.856 12.401 13.848 15.659 28.241 33.196 36.415 39.364 42.980 45.559 25 10.520 11.524 13.120 14.611 16.473 29.339 34.382 37.652 40.646 44.314 46.928 26 11.160 12.198 13.844 15.379 17.292 30.435 35.563 38.885 41.923 45.642 48.290 27 11.808 12.879 14.573 16.151 18.114 31.528 36.741 40.113 43.195 46.963 49.645 28 12.461 13.565 15.308 16.928 18.939 32.620 37.916 41.337 44.461 48.278 50.993 29 13.121 14.256 16.047 17.708 19.768 33.711 39.087 42.557 45.722 49.588 52.336 30 13.787 14.953 16.791 18.493 20.599 34.800 40.256 43.773 46.979 50.892 53.672 Table C. Right tail areas for the Chi-square
- 94. Solution 7/5/2023 By Asaye 94 Hypothesis: H0: there is no association between smoking and symptoms of asthma. H0: there is association between smoking and symptoms of asthma. The critical value is given by 𝜒2(0.05,1) = 3.841 Test statistics
- 95. Cont… 7/5/2023 By Asaye 95 The corresponding p-value to 5.36 at 1 degree of freedom is estimated by 0.02. Decision: Hence, the decision is reject the null hypothesis and accept the alternative hypothesis Conclusion: there is association between smoking and symptoms of asthma).
- 96. Exercise 7/5/2023 By Asaye 96 Consider the data on the assessment of the effectiveness of antidepressant. The data is given below: Is there association between treatments and depression at 0.01 level of significance? Treatment Depression status Total Yes No Desipramine 14(8) 10(16) 24 Lithium 6(8) 18(16) 24 Placebo 4(8) 20(16) 24 Total 24 48 72
- 97. 7/5/2023 By Asaye 97 Measure of Association
- 98. Measure of Association 7/5/2023 By Asaye 98 Chi-square test only tells us whether there is association between the two categorical variables or not, however, it did not tell us about the direction and strength of the association. Statistical relationship between exposure and disease. An association is said to exist between two variables when a change in one variable parallels or coincides with a change in another variable. Requires comparing two groups: Exposed Vs Unexposed Cases Vs non cases/controls
- 99. Cont… 7/5/2023 By Asaye 99 Variables can be related or unrelated to one another. If they have relation, it can be: Positively or negatively Strongly or weakly (one variable can have large or small effect on the other) Significantly or not significantly related Statistically significant association is the association is unlikely to be due to chance.
- 100. Cont… 7/5/2023 By Asaye 100 Commonly, the strength of the association is measured by the Relative risk (RR) Odds Ratio (OR)
- 101. Relative Risk (RR) 7/5/2023 By Asaye 101 Risk: The probability of an event occurring over time Risk Ratio: The ratio of the risk of disease incidence in exposed group compared to the risk in those unexposed. Risk measures the probability of disease incidence among groups. Relative risk is used to compare the risk in two different groups of people. Risk = number of cases of disease number of people at risk
- 102. Cont… 7/5/2023 By Asaye 102 It estimates the magnitude (size) of an association between exposure and outcome. It indicates the chance of developing the disease in the exposed group relative to those who are not exposed group to a risk factor.
- 103. Cont… 7/5/2023 By Asaye 103 Table 1: a 2 by 2 table indicating findings of a cohort study
- 104. Cont… 7/5/2023 By Asaye 104 From the above table the RR is calculated as:
- 105. Example1 7/5/2023 By Asaye 105 Table 2: Data from a cohort study of oral contraceptive (OC) use and bacteriuria among women aged 15-49 years. Current OC use Bacteriuria Total Yes No Yes 27 455 482 No 77 1831 1908 Total 104 2286 2390
- 106. Cont… 7/5/2023 By Asaye 106 Calculate RR? RR = 𝑙𝑒 𝑙𝑜 or 𝑎/(𝑎+𝑏) 𝑐/(𝑐+𝑑) = 27/482 ∗1000 77/1908 ∗1000 =1.4 Interpretation: women who used oral contraceptive had 1.4 times higher risk of developing bacteriuria when compared to non-users. RR = Incidence among exposed (Ie) Incidence among non-exposed (Io)
- 107. Interpretation 7/5/2023 By Asaye 107 The value of RR ranges from 0 and infinity. RR is always a positive number. RR=1 Risk in exposed = risk in non-exposed No association RR>1 Risk in exposed > risk in non-exposed Implies that exposed individuals are x times highly likely to develop the outcome as compared to non-exposed. Positive association, factor is associated with disease Larger RR stronger association
- 108. Cont… 7/5/2023 By Asaye 108 RR<1 Risk in exposed < risk in non-exposed Indicates the risk of acquiring the disease is less among subjects with the risk factor than among subjects without the risk factor. Negative association, factor is “protective”
- 109. Interpretation cont’d… 1 No association Preventive Risk 0 ∞
- 110. Guideline for strength of association 7/5/2023 By Asaye 110 1.0 = No association 1.1-1.3 = Weak 1.4-1.7 = Mild 1.8-3.0 = Moderate 3.0-8.0 =Strong Q. What if RR is less than 1?
- 111. Cont… 7/5/2023 By Asaye 111 For inverse associations (RR is less than 1.0), take the reciprocal and look in above table, e.g., reciprocal of 0.5 is 2.0, which corresponds to a “moderate” association. The further RR away from 1, the stronger the association between exposure and disease.
- 112. Odds Ratio (OR) 7/5/2023 By Asaye 112 The Odds of disease is the probability that an individual experiences the disease as a function of exposure. Odds: The probability of an event's occurring to the probability of its not occurring. Odds = P/1-P Where ; p = the probability of an event 1-p = the probability that the event does not occur Indicates the likelihood of having been exposed among cases relative to controls.
- 113. Cont… 7/5/2023 By Asaye 113 Consider the following 2x2 table: Treatment Outcome status Total X - X+ Y - a b a+b Y+ c d c+d Total a+c b+d a+b+c+d
- 114. Cont… 7/5/2023 By Asaye 114 Odds Ratio: The ratio of two odds or the ratio of the odds of exposure in cases compared with the odds of the exposure in controls. Odds Ratio = Odds of positive outcome among cases Odds of positive outcome aomg control = OR = a/c b/d = a∗d b∗c Odds – the ratio of the probability of occurrence of an event to that of nonoccurrence. We can calculate either exposure or disease odds ratio, which are exactly the same.
- 115. Example 7/5/2023 By Asaye 115 Table 3: Data from a case-control study of current oral contraceptive (OC) use and MI in pre-menopausal female nurses. Current OC use Myocardial infraction Total Yes No Yes 23 304 327 No 133 2816 2949 Total 156 3120 3276
- 116. Cont… 7/5/2023 By Asaye 116 Calculate OR OR = a/c b/d = 23∗2816 304∗133 = 1.6 Interpretation: the odds of having MI is 1.6 times higher among OCP users as compared to that of the non OCP users.
- 117. Interpretation cont’d… OR can be ranges from 0 to positive infinity. OR = 1 then exposure not related to disease. OR >1 then exposure positively related to disease. OR <1 then exposure negatively related to disease. 0 1.0 ∞ Positive Negative No weak
- 118. Interpretation 7/5/2023 By Asaye 118 The odds of having the disease in question are OR times greater among those exposed to the suspected risk factor than among those with no such exposure. The formula for standard error of the log odds ratio is given by 𝑆𝐸(ln 𝑂𝑅 ) = 1 𝑎 + 1 𝑏 + 1 𝑐 + 1 𝑑 The 95% confidence interval for the log odds ratio is given by ln 𝑂𝑅 − 𝑍𝛼 2 ∗ 𝑆𝐸 ln 𝑂𝑅 , ln 𝑂𝑅 + 𝑍𝛼 2 ∗ 𝑆𝐸 ln 𝑂𝑅
- 119. Cont… 7/5/2023 By Asaye 119 To obtain 95% confidence interval interpretation for the odds ratio, we need to transform back to the original value of odds ratio. Or, The 95% confidence interval for odds ratio is given by: OR is the point estimate of the sample.
- 120. Exercise 7/5/2023 By Asaye 120 Example: Let us consider an example in order to make the concept clear. The data in the table below is information about infant birth weights and mortality among white infants in region X within a year. Find the confidence interval for odds ratio of infant mortality at 5% level of significance? Birth weight Mortality Total Dead Alive Low BW 618 4597 5215 High BW 422 67093 67515 Total 1040 71690 72730
- 121. Sampled reference 7/5/2023 By Asaye 121 BLUMAN ELEMENTARY STATISTICS: A STEP BY STEP APPROACH, EIGHTH EDITION An Introduction to Statistical Methods and Data Analysis, Sixth Edition Introduction to Biostatistics BY Larry Winner; Department of Statistics, University of Florida