Chisquare Test of Association.pdf in biostatistics

KEBBI STATE UNIVERSITY OF SCIENCE & TECHNOLOGY,
ALIERO
COLLEGE OF HEALTH SCIENCES
DEPARTMENT OF COMMUNITH HEALTH
COM 201: BIOSTATISTICS
CHI-SQUARE TEST OF ASSOCIATION
By
Prof AU Ka’oje

Test of Association I: Chisquare Test
• A chi-square test (denoted as 𝜒2) is used to assess whether the distribution of a
categorical variable is significantly different between two or more groups.
• Used to /detect the existence of association/relation between two variables -
categorical variables.
• between data in rows and data in columns,
• but it does not indicate the strength of any association
• In health research, a test of chi-square is frequently used to assess whether disease
(present/absent) is associated with exposure (yes/no)
• Chi-square tests are appropriate for most study designs but the results are
influenced by the sample size.

• It depends on the size of the differences between observed and expected
frequencies, degrees of freedom, and sample size.
• There are two types of tests:
• 1. The goodness of fit: It is used to determine how good the observed data
represents the population (expected values)
• 2. Test for independence: It is used to determine if there is a relation between two
categorical variables

Test of Association I: Chisquare Test
• 𝜒2 cannot have negative value, as
distribution curve is always on the
positive side, i.e, skewed to the right.
• More areas under the curve towards the
left of the graph
• it has no symmetry
• It depends on degree of freedom (df)
which are always one less than sample
size (n-1)
• The chi- squared test is a non-
parametric test.

Presentation of Chisquare Test Data
• Data is presented in an r x c table (row x
column),
• cross-classification or contingency table
• Data are presented in cells, arranged in
rows (horizontal) and columns
(vertical).
• These often appear in the form of a 2 x
2 table
• 𝜒2 is also used in more than two rows
and/or columns;
• m by n contingency table with m columns
and n rows
Example of a 2 x 2 table because each
variable has two levels

• 𝜒2 only works when frequencies are used in the cells.
• Data such as proportions, means or physical measurements are not valid.
• 𝜒2 – is more accurate when large frequencies are used –
• test is used to detect an association between data in rows and data in columns, but
it does not indicate the strength of any association.
• In a contingency table, one variable (usually the exposure) forms the rows and the
other variable (usually the disease) forms the columns.
• Column is the y axis and row is the x-axis
• Four internal cells (a – d) show the counts for each of the disease/exposure
groups

Presentation of Chisquare Test Data
• Cell ‘a’ shows the number who satisfy
exposure present (immunized) and
disease present (illness positive).
• Cell ‘b’ - number who satisfy exposure
present (immunized) and disease
absent (illness negative), etc
• As in all analyses, it is important to
identify which variable is the outcome
variable (Column) and which variable
is the explanatory variable (Row).
Example of a 2 x 2 table
OUTCOME
Exposure
present
Illness
positive
Illness
negative
Total
Immunized a b a+b
Not
immunized
c d c+d
Total a + c b + d a+b+c+d

• Important for setting up the crosstabulation table to display the percentages that
are appropriate for answering the research question.
• Can be achieved by either:
• entering the independent (explanatory) variable in the rows, the dependent
(outcome) in the columns and using row percentages, or
• entering the independent (explanatory) variable in the columns, the dependent
(outcome) in the rows and using column percentages.

Assumptions/Conditions for using a chi-square test
• The assumptions that must be met when using a chi-square test are that:
• Each observation must be independent
• each participant is represented in the table once only (NO REPEAT DATA)
• None of the cells contains zero frequency
• all of the expected frequencies should be more than 1
• No cell should contain expected frequency of less than 5;
• For large data Not more than 20% of total cells contain frequency of less than 5
• Samples are randomly drawn and are independent

• If these conditions are not met, the Chi-squared test is not valid and therefore
cannot be used.
• If the 𝜒2 is not valid and a 2 x 2 table is being used, Fisher's exact test is utilised.
Which Chi-square to use?
• Chi-square statistic that is conventionally used depends on both the sample size
and the expected cell counts.
• Pearson’s chi-square
• Continuity correction
• Fisher’s exact test
• Linear-by-linear

Fisher’s exact test
• Fisher’s exact test is a gold standard test as such when available could be used in all
situations
• Pearson’s chi-square and the continuity correction tests are approximations.
• Fisher’s exact test is generally calculated for 2 × 2 contingency tables and, small
sample size;
• depending on the program used, may also be produced for crosstabulations larger than 2 × 2.
• The exact calculation based on the exact distribution of the test statistics provides a
reliable P value irrespective of the sample size or distribution of the data.
• In a 2 × 2 contingency table, the Pearson’s chi-square produces smaller P values than
Fisher’s exact.
• a type I error may occur

Fisher’s exact test……..
• A correction made to the calculation of Pearson’s chi-square (Yates continuity
correction) increases the P value.
• correction tends to overestimate the P value, and
• a type II error may occur
• Yates correction should generally not be applied except if the sample size is small.
• Linear-by-linear test is a trend test,
• Most appropriate in situations in which an ordered exposure variable has three or
more categories and the outcome variable is binary.

Calculating Chi-square value
• Test statistic is calculated by taking the:
• frequencies that are actually observed
(O) and then working out the
frequencies which would be expected
(E) if the null hypothesis was true.
• Null hypothesis (Ho) - will be that there
is no association between the variables.
• Alternative hypothesis (Hi) - there is an
association b/w the variables.
• The expected count is the expected
value due by chance alone and is
calculated for each cell as:
Total
a b Row total =
a+b
c d Row total =
c+d
Column
total =
a+c
Column
total =
b+d
Grand total =
a+b+c+d
For cell ‘a’, expected count = (a+b) x (a+c)
(a+b+c+d)
Expected count =
!"# $"%&' ( )"'*+, $"%&'
-.&,/ $"%&'

This formula is used to produce the x statistic:
§ Where:
§ O = observed frequencies,
§ E = expected frequencies.
§ Degrees of freedom (d.f.), calculated using
the formula: d.f. = (r-1) x (c-1), where
§ r = number of rows and
§ c = number of columns.
Expected frequency for each cell (a –
d) in 2 x 2 table can be calculated
using as:
Expected frequencies can also be
calculated using probabiity theory/
distribution

• Chi-square statistic compares the observed count in each cell to the count which
would be expected under the assumption of no association between the row and
column classifications.
• Continuity corrected (Yates’s continuity correction) chi-square is calculated in a similar way
with correction for small data; 0.5 is subtracted from the absolute value of this
deviation before you square it.
• It is especially important to use when frequencies are small.
• Yates' correction can only be used for 2 x 2 tables.
• it lowers the value of the chi-square statistic and, therefore, makes it less significant, i.e, the significance is
slightly reduced.
• Ho for a chi-square test - there is no significant difference between the observed and
expected frequencies.
• If the observed and expected values are similar, then the 𝜒2 value will be close to zero and
therefore will not be significant.

• the larger the difference between the observed and expected frequencies, the larger the
𝜒2 value becomes and the less likely that the null hypothesis is true.
• more likely the P value will be significant.
Steps in Hypothesis Testing
1. State hypothesis: null & alternatative
2. Decide level of significance
3. Decide apprpriate test statistic for the hypothesis
4. Review/ensure assumptions of sampling distribution
5. Choose the critical region of the test statistic
6. Work out test statistic:
§ i. write out chisquare formula, if not provided
§ ii. Calculate the expected frequency for each cell,
§ iii. Create table of expected frequencies and other variables,
§ iv. Compute the final chisquare value; check statistic against a distribution with known properties
7. Make a decision rule/interprete your finding
8.Conclusion

EXAMPLE: Frequencies for HbA1c testing by ethnic group (Source: adapted from Stewart and
Rao,2000)
CELLS O E O – E (O – E)2 (O – E)2
/E
a
b
c
d
Total ∑ (O – E)2
/E =
Create table of expected frequencies

• Using the marginal total, probabilities can
calculated.
• P(HbA1C)+ve = 558/774
• P(N) – 198/774
• If HO is true and the two events are
independent, then P(HbA1C
+ve and N) =
P(HbA1C
+ve) X P(N) – 198/774
• The joint probability in the cell for Asian
in HbA1C
+ve = 558/774 x 198/774
• The expected frequencies in the cell for
Asian in HbA1C
+ve = (558/774) x (198/774)
x 774 =

Steps in Hypothesis testing
1. State hypothesis: null & alternatative
2. Decide level of significance
3. Decide apprpriate test statistic for the hypothesis
4. Review/ensure assumptions of sampling distribution
5. Choose the critical region of the test statistic: the critical limit of X2 with df of 1, alpha
0.05from the table =7.185
6. Work out test statistic:
§ i. write out chisquare formula, if not provided
§ ii. Calculate the expected frequency for each cell,
§ iii. Create table of expected frequencies and other variables,
§ iv. Compute the final chisquare value; statistic can then be checked against a distribution with known
properties
7. Make a decision rule/interprete your finding: if the observed value was bigger than this critical value
you would say that there was a significant relationship between the two variables.
8.Conclusion

• The Chi-square test statistic = 7.32
• On Chi-square distribution table, we look along the row for d.f. = 1.
• We Look along the row to find the values to the left and right of the x2 statistic - it lies in
between 6.635 and 10.827.
• Reading up the columns for these two values shows that the corresponding P-value is less than
0.01 but greater than 0.001 - we can therefore write the P-value as P < 0.01.
• Thus there is strong evidence to reject the null hypothesis, and we may conclude
that there is an association between being Asian and receiving an HbAlc check.
Asian patients are significantly less likely to receive an HbAlc check, and appear to
receive a poorer quality of care in this respect.

Class work
• The table shows the results of the
field trial of 2 whooping cough
vaccines.
• The question that arises is
whether the vaccine B was really
superior to vaccine A,
• OR
• whether the difference was
merely due to chance.

Assignment: In a study to find diabetes mellitus is related with blood group, a group
of 65 patients of DM were compared with that of 120 normal healthyindividuals. The
observation is presented in the table below;
Subject O A B AB TOTAL
NORMAL 58 30 28 4 120
DIABETIC 32 16 15 2 65
TOTAL 90 46 43 6 185
Is there an association between being diabetic and having a particular blood group type?

Chisquare test in SPSS
• When conducting a chi-square test in SPSS, the significance level is calculated using
the ‘asymptotic’ method,
• which means that P values are calculated based on the assumption that the data has
a large enough sample size to conform to a certain distribution
• If the sample size is small or some cells have a low count, the ‘exact’ P values should
be reported since the asymptotic P values will be unreliable.
• Exact calculation based on the exact distribution of the test statistics provides a
reliable P value irrespective of the sample size or distribution of the data

§ Result shows that 40.2% of males in the sample were premature compared with 20.3%
of females, i.e., rate of prematurity in the males is almost twice that in the females.
§ the smallest cell has an observed count of 12.
§ Expected number for the cell: 59 × 45/141, or 18.83 as shown in the footnote of the
Chi-Square Tests table overleaf

§ In the 𝜒2 table, the third column’s heading is ‘Asymp. Sig. (two-sided)’, indicates the significance
level for a two-sided test, is calculated asymptotically.
§ the sample size is large, so chi-square distribution approximate the exact distribution of the
Pearson statistic; so the Pearson chi-square value should be reported.
§ The continuity correction (Yates) results in a P value of 0.020, which is slightly higher than the P
value of 0.017 for the Fisher’s exact test.
§ The Fisher’s exact test would not be reported in this study because the sample size (141) is large

• This test is two-tailed and the corresponding value indicates that the difference in
rates of prematurity between the genders is statistically significant at P = 0.017.
• This result can be reported as ‘Fisher’s exact test indicated that there was a
significant difference in prematurity between males and females (40.2% vs 20.3%,
P = 0.02)’.

Strength of association
• These include
• Phi
• Contingency Coefficient
• Cramer’s V
• Most of these tests are measures of the strength of association.
• Measures are based on modifying the chi-square statistic to take account of sample
size and degrees of freedom and
• they try to restrict the range of the test statistic from 0 to 1
• to make them similar to the correlation coefficient

Strength of association
• Phi: This statistic is accurate for 2 X 2 contingency tables.
• for tables with greater than two dimensions the value of phi may not lie between 0 and 1 because the
chi-square value can exceed the sample size.
• Pearson suggested the use of the coefficient of contingency.
• Contingency Coefficient: This coefficient ensures a value between 0 and 1
• unfortunately, it seldom reaches its upper limit of 1,
• for this reason Cramer devised Cramer’s V.
• Cramer’s V: When both variables have only two categories, phi and Cramer’s V are
identical.
• when variables have more than two categories Cramer’s statistic can attain its maximum of
one – unlike the other two – and
• so it is the most useful.

• Cramer’s statistic is 0.36 out of a possible
maximum value of 1.
• This represents a medium association
between the variables.
• like a correlation coefficient then this
represents a medium effect size
• The value is highly significant (p < .001)
indicating that a value of the test statistic
that is this big is unlikely to have happened
by chance, and therefore the strength of
the relationship is significant.
• These results confirm what the chi-square
test already told us but also give us some
idea of the size of effect.

Chisquare Test of Association.pdf in biostatistics

Recommended

Recommended

More Related Content

Similar to Chisquare Test of Association.pdf in biostatistics

Similar to Chisquare Test of Association.pdf in biostatistics (20)

Recently uploaded

Recently uploaded (20)

Chisquare Test of Association.pdf in biostatistics