Chisquared test.pptx

Chi- squared Test
Krishnakumar D
Biostatistician

2
Chi-Square (χ2) and Frequency Data
• It was proposed in1900 by Karl Pearson.
• The data that we analyze consists of frequencies; that is, the
number of individuals falling into categories. In other words,
the variables are measured on a nominal scale.
• The test statistic for such frequency data is Pearson Chi-
Square.
• The magnitude of Pearson Chi-Square reflects the amount of
discrepancy between observed frequencies and expected
frequencies.

3
Steps in Test of Hypothesis
1. Determine the appropriate test
2. Establish the level of significance:α
3. Formulate the statistical hypothesis
4. Calculate the test statistic
5. Determine the degree of freedom
6. Compare calculated test statistic against a
table/critical value

4
1. Determine Appropriate Test
• Chi Square is used when both variables are
measured on a nominal scale.
• It can be applied to interval or ratio data that have
been categorized into a small number of groups.
• It assumes that the observations are randomly
sampled from the population.
• All observations are independent (an individual
can appear only once in a table and there are no
overlapping categories).
• It does not make any assumptions about the
shape of the distribution nor about the
homogeneity of variances.

5
2. Establish Level of Significance
• α is a predetermined value
• The convention
• α = .05
• α = .01
• α = .001

6
3. Determine The Hypothesis:
Whether There is an Association or
Not
• Ho : The two variables are independent
• Ha : The two variables are associated

7
4. Calculating Test Statistics
• Contrasts observed frequencies in each cell of a
contingency table with expected frequencies.
• The expected frequencies represent the number of
cases that would be found in each cell if the null
hypothesis were true ( i.e. the nominal variables are
unrelated).
• The expected values specify what the values of each
cell of the table would be if there was no association
between the two variables.
• Expected frequency of two unrelated events is
product of the row and column frequency divided by
number of cases.
Fe= Fr Fc / N

8
 




 

e
e
o
F
F
F 2
2 )
(


9
 




 

e
e
o
F
F
F 2
2 )
(


10
5. Determine Degrees of
Freedom
df = (R-1)(C-1)

11
6. Compare computed test statistic
against a tabled/critical value
• The computed value of the Pearson chi- square
statistic is compared with the critical value to
determine if the computed value is improbable
• The critical tabled values are based on
sampling distributions of the Pearson chi-
square statistic
• If calculated 2 is greater than 2 table value,
reject Ho

Example
• General Social survey 1991.
Let X= Income
Y= Job satisfaction
Dissatisfied Little satisf Mod. Satisfied
Much Satisfied
Total
< 5,000 2 4 13 3 22
5,000 to 15,000 2 6 22 4 34
15000 to 25000 0 1 15 8 24
>25000 0 3 13 8 24
Total 4 14 63 23 104
Job satisfaction
Income

Hypothesis:
Ho : X and Y are independent
H1 : X and Y are dependent
 Pearson Chi – Square Statistic is
Degrees of Freedom = 3 * 3 = 9
 




 

e
e
o
F
F
F 2
2 )
(


Observed Frequencies
Dissatisfied Little satisf Mod. Satisfied
Much Satisfied
Total
< 5,000 2 4 13 3 22
5,000 to 15,000 2 6 22 4 34
15000 to 25000 0 1 15 8 24
>25000 0 3 13 8 24
Total 4 14 63 23 104
Job satisfaction
Income

Expected Frequencies
Income
Job satisfaction
Dissatisfied Little satisf
Mod.
Satisfied
Much Satisfied
< 5,000 0.8 3.0 13.3 4.9
5,000 to 15,000 1.3 4.6 20.6 7.5
15000 to 25000 0.9 3.2 14.5 5.3
>25000 0.9 3.2 14.5 5.3
(22*4)/ 104 (24*23)/104

Income
Job satisfaction
Dissatisfied Little satisf Mod. Satisfied Much Satisfied
< 5,000 1.6 0.4 0.0 0.7
5,000 to 15,000 0.4 0.4 0.1 1.6
15000 to 25000 0.9 1.5 0.0 1.4
>25000 0.9 0.0 0.2 1.4
Total 3.8 2.4 0.3 5.1 11.5
(8-5.3)^2/5.3
χ² = 11.5

Table value =
Evidence against Ho is weak
Possible that Job satisfaction and Income are
independent.

An alternative
• The likelihood ratio test: It compares observed
values with the distribution of expected values
based on the multinomial probability distribution
 















 cells
all
Expected
Observed
Observed
GSq _
ln
2
= 0.0866

Pearson Statistic and Likelihood ratio
statistic
• Like the Pearson statistic, GSq takes its minimum value
of 0 when all observed = expected , and larger values
provide stronger evidence against Ho.
• Although the Pearson and likelihood-ratio GSq
provide separate test statistics, but they share many
properties and usually provide the same conclusions.
• When Ho is true and the expected frequencies are
large, the two statistics have the same chi-squared
distribution, and their numerical values are similar.

Example 2
• Two sample polls of votes for two candidates
A and B for a public office are taken , one from
among the residents of rural areas. The results
are given in the adjoining table.
• Examine whether the nature of the area is
related to voting preference in this election

Hypothesis
• Ho: The nature of the area is independent of
the voting preference in the election
• H1: The nature of the area is dependent of the
voting preference in the election
A B
Rural 620 380 1000
Urban 550 450 1000
Total 1170 830 2000
Area
Votesfor
Total

Interpretation
1) Table value for 1 d.f
With 5% level of
significance is 3.841.
(i.e.,) calculated value
is greater than the
table value. We
Conclude that nature of
area is related to voting
Preference in the election.
2) P – value (0.001<0.05) and
hence Null hypothesis is rejected.
Critical value
Rejection region(α)
Acceptance
Region (1 – α)
3.841

Residuals
• Testing of independence using “ Chi – Square
test” infers whether the association between two
variable exists or not based on the p-value.
• But, there is no info regarding the “ Strength of
Association”:
• Strength of Association is found using
1) Residual Analysis
2) Partitioning the Chi-Square statistics

Residual analysis
• Compares Oij (observed) and Eij (Expected)
values.
• The difference between the observed value of
the dependent variable (y) and its expected
value is known as residual.
eij = Oij - Eij

Pearson Residuals
• Pearson’s residuals attempts to adjust for the
notion that larger values of Oij and Eij tend to
have larger differences.
• One approach to adjusting for the variance is to
consider dividing the difference (Oij − Eij ) by √Eij.
• Thus, define
eij = Oij - Eij / √Eij
As the pearson residual.
• Note that ,

Standardised Pearson Residuals
• Under Ho, eij are asymptotically normal with mean 0.
• However, the variance of eij is less than 1.
• To compensate for this, one can use the
STANDARDIZED Pearson Residuals.
• Denote rij as the standardized residuals in which
Where is the estimated row I marginal
probability.
• rij is asymptotically distributed as a standard normal.

Standardised Pearson Residuals
• As a “rule of thumb”, a rij value (which is an
absolute value) greater than 2 or 3 indicates a
lack of ﬁt of H0 in that cell.
• However, as the number of cells increases, the
likelihood that a cell has a value of 2 or 3
increases. For example, if you have 20 cells,
you could expect 1 in the 20 to have a value
greater the 2 just by chance (i.e., α = 0.05).

Output
We can find large positive residuals for “ Rural preferred voting
Candidate A” and “Urban preferred voting Candidate B”, and large
negative residuals for “ Rural preferred voting Candidate B” and “Urban
preferred voting Candidate A”. Thus, there were significantly more people
in“ Rural preferred voting Candidate A” and “Urban preferred voting
Candidate B” and fewer people in “ Rural preferred voting Candidate B”
and “Urban preferred voting Candidate A” than the hypothesis of
independence predicts.

Partitioning the Likelihood Ratio Test
• Motivation for this:
1) If you reject the Ho and conclude that X and Y are
dependent, the next question could be ‘Are there individual
comparisons more signiﬁcant than others?’.
2) Partitioning (or breaking a general I × J contingency table
into smaller tables) may show the association is largely
dependent on certain categories or groupings of categories.
• Recall, these basic principles about Chi Square variables
1) If X1 and X2 are both (independently) distributed as χ 2
with df = 1 then X = X1 + X2 ∼ χ2 (df = 1 + 1)
2)In general, the sum of independent χ2 random variables is
distributed as χ2 ( df= ∑ df (Xi))

General Rules for Partitioning
In order to completely partition a I × J
contingency table, you need to follow this 3
step plan.
1. The df for the subtables must sum to the df
for the full table
2. Each cell count in the full table must be a cell
count in one and only one subtable
3. Each marginal total of the full table must be a
marginal total for one and only one subtable

Example
Independent random samples of 83, 60, 56, and 62 faculty
members of a state university system from four system universities
were polled to determine which of the three collective bargaining
agents (i.e., unions) are preferred.
Interest centers on whether there is evidence to indicate a
differences in the distribution of preference across the 4 state
universities.
Table 1 Total
University 101 102 103
1 42 29 12 83
2 31 23 6 60
3 26 28 2 56
4 8 17 37 62
Total 107 97 57 261
Bargaining agent

• Therefore, we see that there is a signiﬁcant
association among University and Bargaining
Agent.
• Just by looking at the data, we see that
University 4 seems to prefer Agent 103
Universities 1 and 2 seem to prefer Agent 101
University 3 may be undecided, but leans
towards Agent 102
• Partitioning will help examine these trends

First subtable
The Association of University 4 appears the
strongest, so we could consider a subtable of
Note: This table was
obtained by considering
the {4, 3} cell in
comparison to the rest
of the table.
G^2 = 60.5440 on 1 df (p=0.0).
We see strong evidence for an association among
universities (grouped accordingly) and agents.
Subtable 1
university 101- 102 103
01-Mar 179 20 199
4 25 37 62
Total 204 57 261
Bargainingagent
total

Second Subtable
Now, we could consider just Agents 101 and 102
with Universities 1 – 3
G^2 = 1.6378 on 2 df (p=0.4411).
For Universities 1 -3 and Agents 101 and 102,
preference is homogeneous (universities prefer
agents in similar proportions from one university
to another).
Table 1
University 101 102 Total
1 42 29 71
2 31 23 54
3 26 28 54
Total 99 80 179
Bargaining agent

Third Subtable
We could also consider Bargaining units by
dichotomized university
G^2 = 4.8441 on 1 df (p=0.0277).
There is indication that the preference for agents varies
with the introduction of University 4.
Subtable 3 Bargainng Agent
Total
University 101 102
1 to 3 99 80 179
4 8 17 25
Total 107 97 204

Final table
A ﬁnal table we can construct is
G^2 = 4.966 on 2 df (p=0.0835).
With the addition of agent 103 back into the
summary, we still see that sites 1 - 3 still have
homogenous preference.
Subtable 4
university 101- 102 103
1 71 12 83
2 54 6 60
3 54 2 56
Total 179 20 199
Bargaining agent
total

What have we done?
General Notes:
1. We created 4 subtables with df of 1,2,1 and 2 (Recall Rule 1 - df
must sum to the total. 1 + 2 + 1 + 2 = 6. Rule 1 -Check!)
2. Rule 2 - Cell counts in only 1 table. (42 was in subtable 2, 29
subtable 2, ..., 37 subtable 1). Rule 2 - Check !
3. Rule 3 - Marginals can only appear once. (83 was in subtable 4, 60
subtable 4, 56 subtable 4, 62 subtable 1, 107 subtable 3, 97
subtable 3, 57 subtable 1). Rule 3 - Check!
Since we have partitioned according to the rules, note the sum
of G^2.
G^2 = 60.5440 + 1.6378 + 4.8441 + 4.9660 = 71.9910 on 6 df which is
the same value obtained from the original table.

Overall Summary of Example
Now that we have veriﬁed our partitioning, we can draw
inference on the subtables.
From the partitioning, we can observe
1. Preference distribution is homogeneous among
Universities 1 - 3.
2. That preference for a bargaining unit is independent
of the faculty’s university with the exception that if a
faculty member belongs to university 4, then he or
she is much more likely than would otherwise have
been expected to show preference for bargaining
agent 103 (and vice versa).

Final Comments on Partitioning
• For the likelihood ratio test (G^2), exact partitioning occurs
(meaning you can sum the fully partitioned subtables’ G^2 to arrive
at the original G^2).
• Pearson’s does not have this property
• Use the summation of G^2 to double check your partitioning.
• You can have as many subtables as you have df. However, as in our
example, you may have tables with df > 1 (which yields fewer
subtables).
• The selection of subtables is not unique. To initiate the process, you
can use your residual analysis to identify the most extreme cell and
begin there (this is why I isolated the {4, 3} cell initially.
• Partitioning is not easy and is an acquired knack. However, the
rewards is additional interpretation that is generally desired in the
data summary.

Advantages and Limitations of Chi-
squared tests
• Pearson chi-square statistic and Likelihood ratio
statistic do not change value with reorderings of
rows or of columns.(i.e) both variables are
nominal. If atleast one is ordinal, this tests does
not hold good.
• G2 and χ² requires large samples
• When the Expected frequency is small (<5) , the
answer from G2 and χ² will not be reliable. So,
Whenever at least one expected frequency is less
than 5 you can instead use a small-sample
procedure.

Chisquared test.pptx

Recommended

Recommended

More Related Content

Similar to Chisquared test.pptx

Similar to Chisquared test.pptx (20)

More from Krishna Krish Krish

More from Krishna Krish Krish (20)

Recently uploaded

Recently uploaded (20)

Chisquared test.pptx