Like this document? Why not share!

# The Chi Square Test

## by Max Chipulu, Senior Teaching Fellow (Management Science) at University of Southampton on Sep 21, 2009

• 17,832 views

Using the chi-square test

Using the chi-square test

### Views

Total Views
17,832
Views on SlideShare
17,832
Embed Views
0

Likes
0
377
0

No embeds

## The Chi Square TestDocument Transcript

• Test of Significance: The Chi-square Statistic 1 1 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton The Chi-square Statistic Learning Objectives To introduce the Chi-square statistic as a test of statistical significance To apply and interpret the calculated Chi-square statistic for a practical problem, using Chi-square tables and ‘degrees of freedom’. 2 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• • “ When it comes to number of babies, all months are equal but some months are more equal than others.” others.” 3 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton The Research Question = Research Hypothesis • It is often thought that there are some ‘boom’ months of their year when the number babies born is higher than others… • Can we, using data of babies who were born to hold a master’s degree, show this to be the case or not? • The research hypothesis is that there is a difference in the number of births from month to month. 4 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• The Null Hypothesis • If there is nothing to the myth of boom months, then the distribution of numbers of births would be uniform throughout all the months of the year • Therefore the null hypothesis: there is no difference in the number of births from month to month 5 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Range of Actual Births What 'uniform' births Differences Numbe (observed would look like between rs frequencies) (expected expected and frequencies) observed Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 6 Total The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• How well do observed frequencies fit the uniform model? • There are differences between the expected and observed frequencies. But these differences could be just because of the randomness of the data • Intuitively, we know that small differences between the observed and predicted frequencies represent a ‘good’ fit • So, overall, if we sum the differences, then a small sum of differences represents a good model • But positive and negative differences may cancel out • This is not so good 7 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton How well do observed frequencies fit the uniform model? • So we square the differences between frequencies • Then we add squared differences up • A small sum of squares is good • To put the result into context, we divide each square difference by the respective expected frequency • The result is a measure of the goodness of our uniform random model 8 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• (fo − fe )2 χ2 = ∑ fe This is the Chi-Square Statistic 9 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Range of Expected Differences between Difference squared Numb Frequencies expected and divided by ers observed expected (contribution to the chi-square) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Total The Chi-square statistic: © 2009 Max Chipulu, University of Southampton 10
• Key Properties of the Chi-Square • The Chi-Square is a non-parametric test: The value of the Chi-square statistics is not affected by the underlying statistical model that generates the data. • The value of the Chi-square depends only on the number of degrees of freedom, the higher the number of degrees of freedom, the higher the value of the chi- square should be. • The number of degrees of freedom is the number of different categories that contribute to the sum of the chi-square sum minus the number of pre-determined (or intermediate) parameters 11 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton ‘Degrees of freedom’ • In this example, degrees of freedom (d.f.) = k - 1, where k is the number of categories (months) that contribute to the chi-square. So d.f. = 12 – 1 = 11 • Suppose that instead of using months we use seasons as our categories. Then we would only have four categories that would contribute to the chi-square. As such, would expect a SMALLER chi- square because there is a smaller number of contributions to the chi- square. • But why subtract by 1? Well the total number of births for all months is a predetermined value: it depends only on the sample size. If we know the frequencies for 11 of the 12 months, and we know the total number of births, then we can work out from these two numbers, what were the number of births in the 12th month. So therefore, although we have in total 12 months (12 categories), there in fact only 11 ways (degrees of freedom) that the chi-square value can vary. 12 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Is the chi-square significant? υ 50 40 30 25 20 15 10 5 3 5 4.351 5.132 6.064 6.626 7.289 8.115 9.236 11.07 12.83 6 5.348 6.211 7.231 7.841 8.558 9.446 10.64 12.59 14.45 7 6.346 7.283 8.383 9.037 9.803 10.75 12.02 14.07 16.01 8 7.344 8.351 9.524 10.22 11.03 12.03 13.36 15.51 17.53 9 8.343 9.414 10.66 11.39 12.24 13.29 14.68 16.92 19.02 10 9.342 10.47 11.78 12.55 13.44 14.53 15.99 18.31 20.48 11 10.34 11.53 12.90 13.70 14.63 15.77 17.28 19.68 21.92 12 11.34 12.58 14.01 14.85 15.81 16.99 18.55 21.03 23.34 13 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Example: Goals in football • Hypothesis: the total number of goals scored in a game of football in Europe follows a Poisson Distribution with mean 2.73 14 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• In Europe we observed this distribution of football goals 20 15 Matches 10 5 0 0 1 2 3 4 5 6 7 8 Goals Scored Now, that we know about some distributions, it might look vaguely familiar 15 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton A Poisson Model of Goals in Football We can think about football like this Let each minute that we watch a game be an experiment The experiment is: is there a goal or not? It is a success if there is a goal; it is a failure if there is not. Since only 3 goals are expected after 90 mins, the probability of ‘success’ is very small. In each minute we conduct a Bernoulli trial. There are 90 trials. It seems reasonable to model goal scoring in Football as a Poisson Process 16 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• A Poisson Model of Goals in Football Alternatively, we can think about football like this A game of football is played over 90 minutes. This is a constrained time interval The number of goals scored in each game is a discrete random variable. Suppose we divide the match into very small intervals, e.g. minutes, then within each small interval, it is reasonable to assume that 1. There will be at most only one goal scored; 2. The probability of observing a goal is proportional to the length of that interval of time, e.g. the probability of observing a goal in 1 minute is twice that of a goal in 30 seconds The above are the key characteristics of a Poisson process 17 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Poisson Probabilities of Football Goals From our data, the expected number of goals per game is 2.73 And so, P(zero goals) = e-µ = e-2.73 = 0.0652 P(1 goal) = 2.73* 0.0652 = 0.1780; e−µ P(2 goals) = 2.73/2* 0.1780 = 0.2430; P(3 goals) = 2.73/3* 0.2430 = 0.2212; P(4 goals) = 2.73/4* 0.2212 = 0.1509; P(5 goals) = 2.73/5* 0.1509 = 0.0824; P(6 goals) = 2.73/6* 0.0824 = 0.0375; P(7 goals) = 2.73/7* 0.0375 = 0.0146; P(8 goals) = 2.73/8* 0.0146 = 0.0050; 18 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Expected Frequencies of Matches According to Poisson According to the Poisson Model, the probability of that a football match will end with zero goals is 0.0652. If we watch 66 matches in total, how many of them should we expect to end with zero goals? Number of games with total zero goals = 0.0652*66 = 4.3 We can thus work out all the expected frequencies of matches with i goals by multiplying the Poisson probabilities with the total number of matches seen 19 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Table 11.1 Comparing Goals Predicted with Observed Goals Goals at Poisson Probability Number of end of of Seeing this Games Expeced Match Number of Goals to end with this 0 0.0652 4.3 1 0.1780 11.8 2 0.2430 16.0 3 0.2212 14.6 4 0.1509 10.0 5 0.0824 5.4 6 0.0375 2.5 7 0.0146 1.0 8 or more 0.0070 0.5 20 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Comparison of Expected and Observed Frequencies of Matches with i goals We also have the observed frequencies So if scoring goals in football is really a Poisson process, then there should not be much difference between the expected and observed frequencies Any difference between predicted and actual should be small and due to random variation only 21 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Football: Observed Vs Poisson Frequencies 20 15 Frequency 10 5 0 0 1 2 3 4 5 6 7 8 Predicted Observed Goals 22 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Calculate the contribution each fi to the χ2; Find the sum: χ2 Goals Expected, Observed, Contribution fe fi s to the χ2 0 or 1 16.1 15 0.0694 2 16.0 20 0.9774 3 14.6 11 0.8863 χ2 4 10.0 9 0.0930 5 or More 9.2 11 0.3483 Total 66.0 66 2.3744 Degrees of Freedom = k – number of predetermined parameters =5–2=3 23 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton To test significance answer this question • Is the calculated chi-square value so high that it is unusual to observe such a value or higher values with 3 degrees of freedom? • Alternatively: Is the probability of observing a chi-square value of 2.37 or more with three degrees of freedom small (say 5% or less)? 24 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• To find the highest χ2 Value observed 95% of the time, under 3 df, if H0 is true Percentage Points of the Chi-Square Distribution υ 50 20 15 10 5 3 1 1 0.45 1.64 2.07 2.71 3.84 5.02 6.63 2 1.39 3.22 3.79 4.61 5.99 7.38 9.21 3 2.37 4.64 5.32 6.25 7.81 9.35 11.34 4 3.36 5.99 6.74 7.78 9.49 11.14 13.28 5 4.35 7.29 8.12 9.24 11.07 12.83 15.09 25 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton To find the highest χ2 Value observed 95% of the time, under 3 df, if H0 is true Percentage Points of the Chi-Square Distribution υ 50 20 15 10 5 3 1 1 0.45 1.64 2.07 2.71 3.84 5.02 6.63 2 1.39 3.22 3.79 4.61 5.99 7.38 9.21 3 2.37 4.64 5.32 6.25 7.81 9.35 11.34 4 3.36 5.99 6.74 7.78 9.49 11.14 13.28 5 4.35 7.29 8.12 9.24 11.07 12.83 15.09 26 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Incidence of Disease Among Adults A county council is worried about the number of Example adults who suffer from a particular disease and has collected the following information AGE GROUP SICK HEALTHY TOTAL 34-39 1327 15702 17029 40-44 2072 17454 19524 45-49 2456 14237 16693 Contingency 50-54 3611 11519 15130 Table 55-59 4688 9174 13862 Analysis 60-64 5490 7526 13016 27 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Incidence of Disease Among Adults Example Can it be said that all age groups are equally likely to be affected and that the differences may be due to random variation? Or, are some age groups more susceptible than others to acquiring the disease? Contingency Table Analysis 28 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Incidence of Disease Among Adults What will be the numbers in each of these cells be Example in a perfect world, i.e. in world where advancing age did not mean more disease? AGE GROUP SICK HEALTHY TOTAL 34-39 17029 40-44 19524 45-49 16693 Contingency 50-54 15130 Table 55-59 13862 Analysis 60-64 13016 29 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Step1: Hypothesize Example Assume that age is NOT related to the incidence of the disease, i.e the maintained hypothesis, H0 is that the incidence of the disease is independent of age. And the alternative hypothesis, Ha is that age IS related to the incidence of the disease Contingency Table Analysis 30 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Step2: Create the statistical model such that age is independent of the incidence Example of the disease. From the rules of probability; the model is as follows: Let the event that an adult is aged 34-39 be A. Let the event that an adult is sick be S. Then, if incidence of the disease is independent Contingency of the age, the probability that an adult is aged Table between 34-39 AND sick is given by the simplified 34- Analysis multiplication rule: P(A and S) = P(A)*P(S) 31 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Step3: Apply the simplified multiplication rule to calculate the probability of every Example combination of age range and sick; and age range and healthy AGE GROUP SICK HEALTHY TOTAL 34-39 0.037 0.142 0.179 40-44 0.042 0.163 0.205 45-49 0.036 0.139 0.175 Contingency 50-54 0.033 0.126 0.159 Table 55-59 0.030 0.116 0.146 Analysis 60-64 0.028 0.108 0.137 Total 0.206 0.794 1.000 32 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Step4: Given these probabilities, calculate the expected number of adults of each age Example group expected to be sick, and to be healthy AGE GROUP SICK HEALTHY TOTAL 34-39 3512 13518 17029 40-44 4026 15498 19524 45-49 3443 13251 16693 50-54 3120 12010 15130 Contingency Table 55-59 2859 11004 13862 Analysis 60-64 2684 10332 13016 Total 19644 75612 95254 33 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Step5: Now test the hypothesis: How well does our independence model predict numbers of Example adults of a certain age who will be sick and who will be healthy? Use Chi-Square to compare differences between observed and expected frequencies. Proceed by calculating the contribution of Contingency each combination of age and sick and age Table and healthy to the chi-square value and Analysis summing them up. 34 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Step 5 Cont’d: The Chi-Square value is the sum of all the contributions. It is 8531. hmm! Example What is the probability of observing a χ2 value this large or larger when the independence model holds? AGE GROUP SICK HEALTHY TOTAL 34-39 1359 353 1712 40-44 949 247 1196 45-49 283 73 356 Contingency 50-54 77 20 97 Table 55-59 1171 304 1475 Analysis 60-64 2933 762 3695 Total 6771 1760 8531 35 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Step6: Calculate Number of degrees of freedom of Chi-Square value Example When expected values are calculated, the expected values in the last column and last row can be filled in automatically. This is because the total number of adults, e.g. the total number of adults aged 34-39, for each column and row is fixed and known already. Contingency Hence, the values is the last columns and rows Table are not free and the total number of degrees of Analysis freedom is (number of rows minus one)*(number of columns minus one) = (6-1) * (2-1) = 5 36 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• For five d.f. the tables we have created do not list values of the χ2 as high as 8531. All we can say is that the probability of values of the χ2 of 8531 or higher must be very very small. Alternatively, we can look at the the maximum value of the χ2 that is observed 95% of the time for five d.f. This is 11.071. Since, 8531 is way beyond this, we must reject the maintained hypothesis. Incidence of the disease is not independent of the age. 37 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton dfare a 25 10 5 2.5 1 0.5 1 1.3233 2.7055 3.8415 5.0239 6.6349 7.8794 2 2.7726 4.6052 5.9915 7.3778 9.2103 10.597 3 4.1083 6.2514 7.8147 9.3484 11.345 12.838 4 5.3853 7.7794 9.4877 11.143 13.277 14.86 5 6.6257 9.2364 11.071 12.833 15.086 16.75 38 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Example from SPSS Practical: What are the key factors in the value of an MBA program THE VARIABLES Salary: Average Salary of MBA graduates Fees: Program Fees at the school Age: Average age of an MBA candidate GMAT: Average academic aptitude Intake: Number of candidates on program Experience: Average experience (yrs) of candidates Country: Whether country is USA (1) or another (0) 39 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Example from SPSS Practical: Is ‘salary’ related to ‘country’? Chi-square test in SPSS Open SPSS 17 Import the ‘MBA.xls’ data to SPSS as explained in the SPSS handout. We wish to conduct a chi-square cross-tabulation (i.e. contingency table) test on ‘salary’ by ‘country’ Null Hypothesis: ‘salary’ and ‘country’ are independent Alternative hypothesis: ‘salary’ and ‘country’ are not independent, i.e. they are related. 40 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Example from SPSS Practical: Re-coding ‘salary’ The salary variable is not categorical, i.e. it is quantitative and not strictly suitable for cross-tabulation So, first, recode salary into categories: 1. Go to the ‘transform’ menu 2. Choose ‘visual binning’ 3. Select ‘salary’ as the variable to bin 4. You should see a histogram of the ‘salary’ variable 5. Type a new name for new variable that will be created after ‘re-coding’ in the box labelled ‘binned variable’. I have called my new variable ‘salary_codes’. 6. Select the tab ‘make cutpoints’. There are several options for cutpoints: a good one is to divide the data by ‘equal percentiles’. For example, if you input ‘3’ in this box, the salary data will be re-coded with 3 cutpoints so that there will be four sections of the data- the first 25% values will be re-coded as ‘1’, values in the next 25% group (i.e. 25% to 50%) will be recoded as ‘2’ and so on.. 7. Click ‘ok’ 8. Check that a new ordinal variable representing categories of salary has been formed. 41 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Example from SPSS Practical: Cross-tabulation of ‘salary_code’ by ‘country’ Now to conduct a chi-square test: 1. Go to the ‘analyse’ menu. 2. Choose ‘descriptive statistics’, ‘crosstabs…’ 3. Input ‘salary_codes’ into the ‘row’ box and ‘country’ into the ‘column’ box. 4. Click the ‘statistics’ tab and check the ‘chi-square’ box 5. Click ok 6. What do the results suggest??? 42 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• 3- Way Chi-square test The results above suggest that the ‘salary’ IS related to the ‘country’. Suppose that we think that this relationship is somewhat affected by the GMAT of the students, we can test this by creating a three-way cross-tabulation: 1. Re-code the GMAT variable into ‘GMAT_codes’, say two categories of ‘low’ and ‘high’ using the ‘visual binning’ in the ‘transform’ menu 2. Repeat the chi-square test of ‘salary_code’ by ‘country’. However, this time, enter the ‘GMAT_code’ variable in the ‘layer’ box. 3. Run the model. 43 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton 3- Way Chi-square test Result 44 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• 3- Way Chi-square test Result The three-way chi-square test, suggests that once we take into account the GMAT average of the students, there is no relationship between ‘salary’ and ‘country’: We can therefore conclude that the observed relationship between ‘salary’ and ‘country’ is in fact indirectly caused by the ‘GMAT’ variation…. 45 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Example: Was the Class lottery Conducted According to the Rules? In order to sample from the distribution of inter-arrival time at a checkout of a super market, we played a lottery. The results ( shown overleaf) show that the simulated distribution looks very much like the distribution from which we are sampling. But they are not the same. So what are we looking at? Are we looking at two data set generated by the same distribution so that the differences can be attributed to random variation? Or are we, in fact, looking at two datasets not of the same distribution so that the differences are not random such as would be the case if the lottery were not conducted properly? 46 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Inter-arrival Distribution from Lottery Inter-arrival Inter-arrival Observed Expected Time Probability Frequency Frequency 1 0.39 76 71 2 0.17 43 31 3 0.13 18 24 4 0.09 13 16 5 0.06 11 11 6 0.05 5 9 7 0.03 6 5 8 0.02 2 4 9 0.06 9 11 TOTAL 183 183 47 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Analysis • H0: The observed data from the lottery was generated by the same process as the original inter-arrival time distribution • Using original probabilities calculate expected frequencies for each inter-arrival time, out of the total of 183 • Combine the category of the inter-arrival time categories of 7 and 8 mins, since the expected frequency of 8 mins is small (< 5) 48 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton
• Analysis Cont’d • Calculate the χ2. This is 9.37 • The d.f. is k – 1, where k is the number of categories of inter-arrival time, which is 8. So d.f. = 7 • For d.f. = 7, the probability of a value of χ2 = 9.37 or higher is between 20% and 25%. This is not small. • Decision: we cannot reject H0 • The lottery was conducted according to rules 49 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton Further Reading • Alan Agresti, 1996. ‘Introduction to Categorical Data Analysis’. John Wiley and Sons, London. 50 The Chi-square statistic: © 2009 Max Chipulu, University of Southampton