Categorical Data and Statistical Analysis

U N I V E R S I T Y O F S O U T H F L O R I D A //
Categorical Data and
Statistical Analysis
Dr. Shivendu
1

Agenda
CLASS 6:
Mar. 16
Class 6_Stats_Module Categorical Data and Statistical Analysis
 Categorical input and categorical output
 χ 2 test
 Categorical inputs with continuous outputs: ANOVA
 Kruskal-Wallis test
Quiz 6: Based on Class 6 Readings
Class 6_SAS_Module Statistical Analysis I
 Chap 9 (part) of LBS
 SAS Assignment 6 posted: Due before class 7
5/24/2022 2

Quiz 6
1. What is the key difference between parametric tests and non-
parametric tests?
2. When does one use chi-square test?
3. When is Kolmogorov-Smirnov test used?

4
Parametric and Nonparametric Tests
• The term "non-parametric" refers to the fact that the chi-square tests
do not require assumptions about population parameters nor do they
test hypotheses about population parameters.
• Previous examples of hypothesis tests, such as the t tests and analysis
of variance, are parametric tests and they do include assumptions
about parameters and hypotheses about parameters.

5
Nonparametric Tests: Chi-square test
• The most obvious difference between the chi-square
tests and the other hypothesis tests we have
considered (t and ANOVA) is the nature of the data.
• For chi-square, the data are frequencies rather than
numerical scores.

Multinomial Experiments
A multinomial experiment is a probability experiment
consisting of a fixed number of trials in which there are
more than two possible outcomes for each independent
trial. (Unlike the binomial experiment in which there were
only two possible outcomes.)
Example:
A researcher claims that the distribution of favorite pizza toppings among teenagers is as
shown below.
Topping Frequency, f
Cheese 41%
Pepperoni 25%
Sausage 15%
Mushrooms 10%
Onions 9%
Each outcome is
classified into
categories.
The probability
for each possible
outcome is fixed.

Chi-Square Goodness-of-Fit Test
A Chi-Square Goodness-of-Fit Test is used to test whether a
frequency distribution fits an expected distribution.
To calculate the test statistic for the chi-square goodness-of-fit test,
the observed frequencies and the expected frequencies are used.
The observed frequency O of a category is the frequency for the
category observed in the sample data.
The expected frequency E of a category is the calculated frequency
for the category. Expected frequencies are obtained assuming the
specified (or hypothesized) distribution. The expected frequency
for the ith category is
Ei = npi
where n is the number of trials (the sample size) and pi is the
assumed probability of the ith category.

Observed and Expected Frequencies
Example:
200 teenagers are randomly selected and asked what their favorite pizza topping is. The
results are shown below.
Find the observed frequencies and the expected frequencies.
Topping Results
(n = 200)
% of
teenagers
Cheese 78 41%
Pepperoni 52 25%
Sausage 30 15%
Mushrooms 25 10%
Onions 15 9%
Observed
Frequency
78
52
30
25
15
Expected
Frequency
200(0.41) = 82
200(0.25) = 50
200(0.15) = 30
200(0.10) = 20
200(0.09) = 18

For the chi-square goodness-of-fit test to be used, the following must
be true.
1. The observed frequencies must be obtained by using a random
sample.
2. Each expected frequency must be greater than or equal to 5.
The Chi-Square Goodness-of-Fit Test
If the conditions listed above are satisfied, then the sampling
distribution for the goodness-of-fit test is approximated by a chi-
square distribution with k – 1 degrees of freedom, where k is the
number of categories. The test statistic for the chi-square goodness-of-
fit test is
where O represents the observed frequency of each category and E
represents the expected frequency of each category.
2
2 ( )
O E
χ
E

 
The test is always a right-
tailed test.

1. Identify the claim. State the null
and alternative hypotheses.
2. Specify the level of significance.
3. Identify the degrees of freedom.
4. Determine the critical value.
5. Determine the rejection region.
Continued.
Performing a Chi-Square Goodness-of-Fit Test
In Words In Symbols
State H0 and Ha.
Identify .
Use Chi-square tables
d.f. = k – 1

Performing a Chi-Square Goodness-of-Fit Test
In Words In Symbols
If χ2 is in the rejection
region, reject H0.
Otherwise, fail to
reject H0.
6. Calculate the test statistic.
7. Make a decision to reject or fail
to reject the null hypothesis.
8. Interpret the decision in the
context of the original claim.
2
2 ( )
O E
χ
E

 

Example:
A researcher claims that the distribution of favorite pizza toppings among teenagers is as
shown below. 200 randomly selected teenagers are surveyed.
Topping Frequency, f
Cheese 39%
Pepperoni 26%
Sausage 15%
Mushrooms 12.5%
Onions 7.5%
Continued.
Using  = 0.01, and the observed and expected values previously calculated, test the
surveyor’s claim using a chi-square goodness-of-fit test.

Example continued:
Continued.
Ha: The distribution of pizza toppings differs from the
claimed or expected distribution.
H0: The distribution of pizza toppings is 39% cheese, 26%
pepperoni, 15% sausage, 12.5% mushrooms, and 7.5%
onions. (Claim)
Because there are 5 categories, the chi-square distribution has k – 1 = 5 – 1 = 4 degrees of
freedom.
With d.f. = 4 and  = 0.01, the critical value is χ2
0 = 13.277.

Example continued:
Topping Observed
Frequency
Expected
Frequency
Cheese 78 82
Pepperoni 52 50
Sausage 30 30
Mushrooms 25 20
Onions 15 18
X2
0.01


Rejection
region
χ2
0 = 13.277
2
2 ( )
O E
χ
E

 
2
(78 82)
82


2
(52 50)
50


2
(30 30)
30


2
(25 20)
20


2
(15 18)
18


2.025

Fail to reject H0.
There is not enough evidence at the 1% level to reject the surveyor’s claim.

Contingency Tables
An r  c contingency table shows the observed frequencies
for two variables. The observed frequencies are arranged
in r rows and c columns. The intersection of a row and a
column is called a cell.
The following contingency table shows a random sample of 321 fatally injured passenger
vehicle drivers by age and gender. (Adapted from Insurance Institute for Highway
Safety)
6
10
21
33
22
13
Female
10
28
43
52
51
32
Male
61 and older
51 – 60
41 – 50
31 – 40
21 – 30
16 – 20
Gender
Age

Expected Frequency
Assuming the two variables are independent, you can use
the contingency table to find the expected frequency for
each cell.
Finding the Expected Frequency for Contingency Table Cells
The expected frequency for a cell Er,c in a contingency table is
,
(Sum of row ) (Sum of column )
Expected frequency .
Sample size
r c
r c
E 


Expected Frequency
Example:
Find the expected frequency for each “Male” cell in the contingency table
for the sample of 321 fatally injured drivers. Assume that the variables,
age and gender, are independent.
105
6
10
21
33
22
13
Female
16
10
61 and
older
321
38
64
85
73
45
Total
216
28
43
52
51
32
Male
Total
51 – 60
41 – 50
31 – 40
21 – 30
16 – 20
Gender
Age
Continued.

Expected Frequency
Example continued:
105
6
10
21
33
22
13
Female
16
10
61 and
older
321
38
64
85
73
45
Total
216
28
43
52
51
32
Male
Total
51 – 60
41 – 50
31 – 40
21 – 30
16 – 20
Gender
Age
,
(Sum of row ) (Sum of column )
Expected frequency
Sample size
r c
r c
E 

1,2
216 73
49.12
321
E 
 
1,1
216 45
30.28
321
E 
  1,3
216 85
57.20
321
E 
 
1,5
216 38
25.57
321
E 
 
1,4
216 64
43.07
321
E 
  1,6
216 16
10.77
321
E 
 

Chi-Square Independence Test
A chi-square independence test is used to test the independence of
two variables. Using a chi-square test, you can determine whether
the occurrence of one variable affects the probability of the occurrence
of the other variable.
For the chi-square independence test to be used, the following must be
true.
1. The observed frequencies must be obtained by using a random
sample.
2. Each expected frequency must be greater than or equal to 5.

The Chi-Square Independence Test
If the conditions listed are satisfied, then the sampling
distribution for the chi-square independence test is
approximated by a chi-square distribution with
(r – 1)(c – 1)
degrees of freedom, where r and c are the number of rows
and columns, respectively, of a contingency table. The test
statistic for the chi-square independence test is
where O represents the observed frequencies and E
represents the expected frequencies.
2
2 ( )
O E
χ
E

 
The test is always a right-
tailed test.

1. Identify the claim. State the null
and alternative hypotheses.
2. Specify the level of significance.
3. Identify the degrees of freedom.
4. Determine the critical value.
5. Determine the rejection region.
Continued.
Performing a Chi-Square Independence Test
In Words In Symbols
State H0 and Ha.
Identify .
Use Tables
d.f. = (r – 1)(c – 1)

Performing a Chi-Square Independence Test
In Words In Symbols
If χ2 is in the rejection
region, reject H0.
Otherwise, fail to
reject H0.
6. Calculate the test statistic.
7. Make a decision to reject or fail
to reject the null hypothesis.
8. Interpret the decision in the
context of the original claim.
2
2 ( )
O E
χ
E

 

Example:
The following contingency table shows a random sample of 321 fatally injured passenger
vehicle drivers by age and gender. The expected frequencies are displayed in
parentheses. At  = 0.05, can you conclude that the drivers’ ages are related to gender
in such accidents?
105
6
(5.23)
10
(12.43)
21
(20.93)
33
(27.80)
22
(23.88)
13
(14.72)
Female
16
10
(10.77)
61 and
older
321
38
64
85
73
45
216
28
(25.57)
43
(43.07)
52
(57.20)
51
(49.12)
32
(30.28)
Male
Total
51 – 60
41 – 50
31 – 40
21 – 30
16 – 20
Gender
Age

Example continued:
Continued.
Ha: The drivers’ ages are dependent on gender. (Claim)
H0: The drivers’ ages are independent of gender.
Because each expected frequency is at least 5 and the drivers were randomly selected,
the chi-square independence test can be used to test whether the variables are
independent.
With d.f. = 5 and  = 0.05, the critical value is χ2
0 = 11.071.
d.f. = (r – 1)(c – 1) = (2 – 1)(6 – 1) = (1)(5) = 5

Example continued:
X2
0.05


Rejection
region
χ2
0 = 11.071
5.23
12.43
20.93
27.80
23.88
14.72
10.77
25.57
43.07
57.20
49.12
30.28
E
0.77
2.43
0.07
5.2
1.88
1.72
0.77
2.43
0.07
5.2
1.88
1.72
O – E
0.201
2.9584
13
0.0551
0.5929
10
0.2309
5.9049
28
0.0001
0.0049
43
0.148
3.5344
22
0.4727
27.04
52
0.072
3.5344
51
0.0977
2.9584
32
0.5929
5.9049
0.0049
27.04
(O –
E)2
0.1134
6
0.4751
10
0.0002
21
0.9727
33
O
2
( )
O E
E

2
2 ( )
2.84
O E
χ
E

  
Fail to reject H0.
There is not enough evidence at the 5% level to conclude that age is dependent on gender in
such accidents.

Nonparametric Methods
 Sign Test
 Wilcoxon Signed-Rank Test
 Mann-Whitney-Wilcoxon Test
 Kruskal-Wallis Test
 Rank Correlation

 Most of the statistical methods referred to as parametric
require the use of interval- or ratio-scaled data.
 Nonparametric methods are often the only way to
analyze nominal or ordinal data and draw statistical
conclusions.
 Nonparametric methods require no assumptions about
the population probability distributions.
 Nonparametric methods are often called distribution-
free methods.

 In general, for a statistical method to be classified as
nonparametric, it must satisfy at least one of the
following conditions.
• The method can be used with nominal data.
• The method can be used with ordinal data.
• The method can be used with interval or ratio data
when no assumption can be made about the
population probability distribution.

Sign Test
 A common application of the sign test involves using
a sample of n potential customers to identify a
preference for one of two brands of a product.
 The objective is to determine whether there is a
difference in preference between the two items being
compared.
 To record the preference data, we use a plus sign if
the individual prefers one brand and a minus sign if
the individual prefers the other brand.
 Because the data are recorded as plus and minus
signs, this test is called the sign test.

Sign Test: Small-Sample Case
 The small-sample case for the sign test should be
used whenever n < 20.
 The hypotheses are
a : .50
H p 
0 : .50
H p 
A preference for one brand
over the other exists.
No preference for one brand
over the other exists.
 The number of plus signs is our test statistic.
 Assuming H0 is true, the sampling distribution for
the test statistic is a binomial distribution with p = .5.
 H0 is rejected if the p-value < level of significance, .

Sign Test: Large-Sample Case
 Using H0: p = .5 and n > 20, the sampling distribution for the number of
plus signs can be approximated by a normal distribution.
 When no preference is stated (H0: p = .5), the sampling distribution will
have:
 The test statistic is:
 H0 is rejected if the p-value < level of significance, .
Mean: m = .50n
Standard Deviation: .25n
 
x
z
m



(x is the number
of plus signs)

• Example: Ketchup Taste Test
A
B
As part of a market research study, a
sample of 36 consumers were asked to taste
two brands of ketchup and indicate a
preference. Do the data shown on the next
slide indicate a significant difference in the
consumer preferences for the two brands?

18 preferred Brand A Ketchup
(+ sign recorded)
12 preferred Brand B Ketchup
(_ sign recorded)
6 had no preference
 Example: Ketchup Taste Test
A
B
The analysis will be based on
a sample size of 18 + 12 = 30.

• Hypotheses
a : .50
H p 
A B
0 : .50
H p 
A preference for one brand over the other exists
No preference for one brand over the other exists

• Sampling Distribution for Number of Plus Signs
m = .5(30) = 15
A B
.25 .25(30) 2.74
n
   

• Rejection Rule A B
p-Value = 2(.5000 - .3643) = .2714
 p-Value
z = (x – m)/ = (18 - 15)/2.74 = 3/2.74 = 1.10
 Test Statistic
Using .05 level of significance:
Reject H0 if p-value < .05

A B
 Conclusion
Because the p-value > , we cannot reject H0.
There is insufficient evidence in the sample to conclude
that a difference in preference exists for the two brands
of ketchup.

Hypothesis Test About a Median
 We can apply the sign test by:
•Using a plus sign whenever the data in the sample
are above the hypothesized value of the median
•Using a minus sign whenever the data in the
sample are below the hypothesized value of the
median
•Discarding any data exactly equal to the
hypothesized median

34 years

H0: Median Age
34 years

Ha: Median Age
 Example: Trim Fitness Center
A hypothesis test is being conducted
about the median age of female members
of the Trim Fitness Center.
In a sample of 40 female members, 25 are older
than 34, 14 are younger than 34, and 1 is 34. Is there
sufficient evidence to reject H0? Assume  = .05.

p-Value = 2(.5000  .4608) = .0784
m = .5(39) = 19.5
.25 .25(39) 3.12
n
   
 p-Value
z = (x – m)/ = (25 – 19.5)/3.12 = 1.76
 Test Statistic
 Mean and Standard Deviation

 Rejection Rule
 Conclusion
Do not reject H0. The p-value for this two-tail
test is .0784. There is insufficient evidence in the
sample to conclude that the median age is not 34 for
female members of Trim Fitness Center.
Using .05 level of significance:

Wilcoxon Signed-Rank Test
 This test is the nonparametric alternative to the
parametric matched-sample test
 The methodology of the parametric matched-sample
analysis requires:
•interval data, and
•the assumption that the population of differences
between the pairs of observations is normally
distributed.
 If the assumption of normally distributed differences
is not appropriate, the Wilcoxon signed-rank test can
be used.

 Example: Express Deliveries
A firm has decided to select one
of two express delivery services to
provide next-day deliveries to its
district offices.
To test the delivery times of the two services, the
firm sends two reports to a sample of 10 district
offices, with one report carried by one service and the
other report carried by the second service. Do the data
on the next slide indicate a difference in the two
services?

Seattle
Los Angeles
Boston
Cleveland
New York
Houston
Atlanta
St. Louis
Milwaukee
Denver
32 hrs.
30
19
16
15
18
14
10
7
16
25 hrs.
24
15
15
13
15
15
8
9
11
District Office OverNight NiteFlite

 Preliminary Steps of the Test
• Compute the differences between the paired
observations.
• Discard any differences of zero.
• Rank the absolute value of the differences from
lowest to highest. Tied differences are assigned
the average ranking of their positions.
• Give the ranks the sign of the original difference
in the data.
• Sum the signed ranks.
. . . next we will determine whether the
sum is significantly different from zero.

Seattle
Los Angeles
Boston
Cleveland
New York
Houston
Atlanta
St. Louis
Milwaukee
Denver
7
6
4
1
2
3
1
2
2
5
District Office Differ. |Diff.| Rank Sign. Rank
10
9
7
1.5
4
6
1.5
4
4
8
+10
+9
+7
+1.5
+4
+6
1.5
+4
4
+8
+44

 Hypotheses
H0: The delivery times of the two services are the
same; neither offers faster service than the other.
Ha: Delivery times differ between the two services;
recommend the one with the smaller times.

 Sampling Distribution of T for Identical Populations
mT = 0
( 1)(2 1) 10(11)(21)
19.62
6 6
T
n n n

 
  
T

 Rejection Rule
Using .05 level of significance,
 Test Statistic
 p-Value
z = (T - mT )/T = (44 - 0)/19.62 = 2.24
p-Value = 2(.5000 - .4875) = .025

 Conclusion
Reject H0. The p-value for this two-tail test is .025.
There is sufficient evidence in the sample to
conclude that a difference exists in the delivery times
provided by the two services.

Mann-Whitney-Wilcoxon Test
 This test is another nonparametric method for
determining whether there is a difference between
two populations.
 This test, unlike the Wilcoxon signed-rank test, is not
based on a matched sample.
 This test does not require interval data or the
assumption that both populations are normally
distributed.
 The only requirement is that the measurement scale
for the data is at least ordinal.

Ha: The two populations are not identical
H0: The two populations are identical
 Instead of testing for the difference between the
means of two populations, this method tests to
determine whether the two populations are identical.
 The hypotheses are:

 Example: Westin Freezers
Manufacturer labels indicate the
annual energy cost associated with
operating home appliances such as
freezers.
The energy costs for a sample of
10 Westin freezers and a sample of 10
Easton Freezers are shown on the next slide. Do the
data indicate, using  = .05, that a difference exists in
the annual energy costs for the two brands of freezers?

$55.10
54.50
53.20
53.00
55.50
54.90
55.80
54.00
54.20
55.20
$56.10
54.70
54.40
55.40
54.10
56.00
55.50
55.00
54.30
57.00
Westin Freezers Easton Freezers

• Hypotheses
Ha: Annual energy costs differ for the two
brands of freezers.
H0: Annual energy costs for Westin freezers
and Easton freezers are the same.

Mann-Whitney-Wilcoxon Test: Large-Sample Case
 First, rank the combined data from the lowest to
the highest values, with tied values being assigned
the average of the tied rankings.
 Then, compute T, the sum of the ranks for the first
sample.
 Then, compare the observed value of T to the
sampling distribution of T for identical populations.
The value of the standardized test statistic z will
provide the basis for deciding whether to reject H0.

Mann-Whitney-Wilcoxon Test: Large-Sample Case
1 2 1 2
1 ( 1)
12
T n n n n
   
Approximately normal, provided
n1 > 10 and n2 > 10
mT = 1n1(n1 + n2 + 1)
•Mean
•Standard Deviation
•Distribution Form

$55.10
54.50
53.20
53.00
55.50
54.90
55.80
54.00
54.20
55.20
$56.10
54.70
54.40
55.40
54.10
56.00
55.50
55.00
54.30
57.00
Westin Freezers Easton Freezers
Sum of Ranks Sum of Ranks
Rank Rank
86.5 123.5
1
2
12
8
15.5
10
17
3
5
13
19
9
7
14
4
18
15.5
11
6
20

mT = ½(10)(21) = 105
1 2 1 2
1 ( 1)
12
1 (10)(10)(21)
12
13.23
T n n n n
   


T

 Rejection Rule
Using .05 level of significance,
 Test Statistic
 p-Value
z = (T - mT )/T = (86.5  105)/13.23 = -1.40
p-Value = 2(.5000 - .4192) = .1616

 Conclusion
Do not reject H0. The p-value > . There is
insufficient evidence in the sample data to conclude
that there is a difference in the annual energy cost
associated with the two brands of freezers.

Kruskal-Wallis Test
 The Mann-Whitney-Wilcoxon test has been extended
by Kruskal and Wallis for cases of three or more
populations.
 The Kruskal-Wallis test can be used with ordinal data
as well as with interval or ratio data.
 Also, the Kruskal-Wallis test does not require the
assumption of normally distributed populations.
Ha: Not all populations are identical
H0: All populations are identical

• Test Statistic
Kruskal-Wallis Test

 
  
 

 

2
1
12
3( 1)
( 1)
k
i
T
i
T T i
R
W n
n n n
where: k = number of populations
ni = number of items in sample i
nT = Sni = total number of items in all samples
Ri = sum of the ranks for sample i

Kruskal-Wallis Test
 When the populations are identical, the sampling distribution of
the test statistic W can be approximated by a chi-square
distribution with k – 1 degrees of freedom.
 This approximation is acceptable if each of the sample sizes ni is > 5.
 The rejection rule is: Reject H0 if p-value < 

Rank Correlation
 The Pearson correlation coefficient, r, is a measure of
the linear association between two variables for
which interval or ratio data are available.
 The Spearman rank-correlation coefficient, rs , is a
measure of association between two variables when
only ordinal data are available.
 Values of rs can range from –1.0 to +1.0, where
•values near 1.0 indicate a strong positive
association between the rankings, and
•values near -1.0 indicate a strong negative
association between the rankings.

Rank Correlation
• Spearman Rank-Correlation Coefficient, rs
2
2
6
1
( 1)
i
s
d
r
n n
 


where: n = number of items being ranked
xi = rank of item i with respect to one variable
yi = rank of item i with respect to a second variable
di = xi - yi

Test for Significant Rank Correlation
0 : 0
s
H p 
a : 0
s
H p 
 We may want to use sample results to make an
inference about the population rank correlation ps.
 To do so, we must test the hypotheses:
(No rank correlation exists)
(Rank correlation exists)

Rank Correlation
0
s
r
m 
1
1
s
r
n
 

Approximately normal, provided n > 10
 Sampling Distribution of rs when ps = 0
•Mean
•Standard Deviation
•Distribution Form

Rank Correlation
 Example: Crennor Investors
Crennor Investors provides
a portfolio management service
for its clients. Two of Crennor’s
analysts ranked ten investments
as shown on the next slide. Use
rank correlation, with  = .10, to
comment on the agreement of the two analysts’
rankings.

Rank Correlation
Analyst #2 1 5 6 2 9 7 3 10 4 8
Analyst #1 1 4 9 8 6 3 5 7 2 10
Investment A B C D E F G H I J
 Example: Crennor Investors
0 : 0
s
H p 
a : 0
s
H p 
(No rank correlation exists)
(Rank correlation exists)
•Analysts’ Rankings
•Hypotheses

Rank Correlation
A
B
C
D
E
F
G
H
I
J
1
4
9
8
6
3
5
7
2
10
1
5
6
2
9
7
3
10
4
8
0
-1
3
6
-3
-4
2
-3
-2
2
0
1
9
36
9
16
4
9
4
4
Sum =92
Investment
Analyst #1
Ranking
Analyst #2
Ranking Differ. (Differ.)2

 Sampling Distribution of rs
Assuming No Rank Correlation
Rank Correlation
1
.333
10 1
s
r
  

mr = 0
rs

• Test Statistic
2
2
6 6(92)
1 1 0.4424
( 1) 10(100 1)
i
s
d
r
n n
    
 

Rank Correlation
z = (rs - mr )/r = (.4424 - 0)/.3333 = 1.33
 Rejection Rule
With .10 level of significance:
 p-Value
p-Value = 2(.5000 - .4082) = .1836

Do no reject H0. The p-value > . There is not a
significant rank correlation. The two analysts are
not showing agreement in their ranking of the risk
associated with the different investments.
Rank Correlation
 Conclusion

Takeaways
• Ch-square tests are widely used non-parametric tests
• Non-parametric methods provide an important tool kit to data
analysts to draw inferences about population parameters

Next Class
CLASS
7: Mar.
23
Class 7_Stats_Module Mid Term Exam
Class 7_SAS_Module Statistical Analysis II
 Chap 9 (part) of LBS
 SAS Assignment 7 posted: Due before class 8

Categorical Data and Statistical Analysis

Recommended

Recommended

More Related Content

Similar to Categorical Data and Statistical Analysis

Similar to Categorical Data and Statistical Analysis (20)

More from Michael770443

More from Michael770443 (9)

Recently uploaded

Recently uploaded (20)

Categorical Data and Statistical Analysis