Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations

Making Statistics Work For Us: Item Bias,
Decision Making, and Data-Driven Simulations
Quinn N Lathrop
July 10, 2018

Why we (usually) don’t have to worry about multiple comparisons
(Gelman, Hill, & Yajima, 2008)

Standard errors and sample size
“An ironic property about e↵ect estimates with relatively large
standard errors is that they are more likely to produce e↵ect
estimates that are larger in magnitude than e↵ect estimates with
relatively smaller standard errors.... There is a tendency sometimes
towards downplaying a large standard error (which might increase
the p-value of their estimate) by pointing out that, however, the
magnitude of the estimate is quite large. In fact, this “large e↵ect”
is likely a byproduct of this standard error.”

What is DIF
Di↵erential Item Function (DIF) occurs when the probability of a
correct response is di↵erent for students of di↵erent
sub-populations (focal vs reference group), even though their
abilities are the same.

What is DIF
Di↵erential Item Function (DIF) occurs when the probability of a
correct response is di↵erent for students of di↵erent
sub-populations (focal vs reference group), even though their
abilities are the same.
We conduct DIF studies to:
I Flag biased items for review or removal
I Provide evidence that items and assessments are not biased
I To give new grad students managable projects

Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.

Significance If there is evidence to reject the null hypothesis that
there is no DIF, the item is flagged for DIF.
I Type I Error rate is 5%
I Large sample sizes and near-null truth

Compound (ABC Labels) DIF must be significant and have a
large magnitude to be flagged

I Small sample size Type I errors have larger
magnitudes

I Small sample size Type I errors have larger
magnitudes
I Imposed minimum sample size

A traditional infererence flow for DIF
Statistic Mantel-Haenszel (MH) tests
Decision Rule Compond (significant and large magnitude)
Data Filtering Minimum sample size per group of 500

Mantel-Hanseal (MH)
Within each ability slice k, a 2 ⇥ 2 table is constructed such that
Correct Incorrect
Reference Ak Bk
Focal Ck Dk
Then,
î =
R
S
=
P
k AkDk/nk
P
k BkCk/nk
(1)
then ˆ
MHi = ln( î ).
The variance of ˆ
MHi is
ˆ2
i =
1
2R2
X
k
n 2
k (AkDk + î BkCk)[Ak + Dk + î (Bk + Ck)]. (2)

Multilevel models
Partial pooling, shrinkage, bayesian.
I Provide a way to model common phenomenon while both
sharing information and allowing for heterogeneity
I Each estimate is pulled towards a common mean
I Amount of shrinkage determined by the standard error of the
estimate
I If standard error is large, the estimate is stabilized towards the
mean
I If standard error is small, we trust that estimate and shrinkage
is minimal
I Major barrier is awareness and implementation

Multilevel MH Extension
We assume
MHi |✓i ⇠ N(✓i , 2
i ) (3)
And specify a prior as
✓i ⇠ N(µ, ⌧2
) (4)
where µ is the population mean and ⌧2 is the population variance.
The prior and data are combined by calculating the weight
Wi =
⌧2
2
i + ⌧2
(5)
And finally posterior distribution for DIF comparison i is
p(✓i | ˆ
MHi ) ⇠ N
⇣
Wi
ˆ
MHi + (1 Wi )µ, Wi ˆ2
i
⌘
(6)

I’ll pick the population parameters (priors) as
µ = 0 (without data, I assume there is not systematic bias)
⌧2 = .05 (fixed here for ease, but can be estimated)
So compared to traditional MH,
I ˆ
MHi shrinks to Wi
ˆ
MHi
I ˆ2
i shrinks to Wi
2
i
I where Wi = .05
ˆ2
i +.05

Don’t throw priors in a glass house...
Our traditional MH test, has the following priors
If focal group is less than 500 (complete pooling)
µ = 0
⌧2
= 0
If focal group is less than 500 (no pooling)
µ = 0
⌧2
= 1

The 500 minimum sample size rule is crudely approximating the
impact of the standard error by saying “With 500 responses per
group, the standard error should be small enough...”
But by using a prior that is not at the extreme of complete pooling
or no pooling, we allow the standard error of the estimate to
determine how much shrinkage is appropriate.

Example with Reading Assessment
Example data have,
I 839 Reading test items
I 7 self-reported ethnic groups
I a lot of data but very small sample sizes
Recall also the inference rules,
I Magnitude
I Significance
I Compound

Table: Descriptive Statistics of Ethnicity in Reading Assessment
Group N Student N Resp % Resp Ave Score
American Indian 16572 37181 1.36 167.00
Asian 42094 102062 3.73 175.24
Black 180098 501665 18.35 163.46
Hispanic 149558 401692 14.70 164.65
Native Hawaiian 3260 8688 0.32 164.55
White 493910 1244608 45.54 172.90
Multi-Ethnic 27945 71643 2.62 169.65
Not Specified or Other 143928 365588 13.38 170.26

Table: Number of Items With Responses by Ethnicity
Group Any Responses At Least 500
American Indian 839 0
Asian 839 2
Black 839 469
Hispanic 839 330
Native Hawaiian 823 0
White 839 807
Multi-Ethnic 839 0

Results
Traditional MH without minimum sample size

Results
I 37.9% of items flagged for DIF

Results
For traditional MH with 500 minimum sample size

Results
I only 6.2% of items tested are flagged, but...

Results
I only 57.3% of items are ever tested

Results
For Multilevel MH

Results
For Multilevel MH
I all items tested

Results
For Multilevel MH
I all items tested
I only 6.7% of items are flagged

Simulations Drive Methodological Advancement
The false positive rates for compound decision rules are not always
available analytically, but can be found through simulation.
Validity of simulation rests on if the simulated conditions can
generalize to reality.
Alternatively, simulations can directly sample empirical data. Then,
the simulations captures complex features and noise in empirical
data that data simulated from parametric models can never attain.

Data-Driven False Positive Rate Simulation
I Mimics empirical response curves from field test items
(model-free)
I Mimics observed di↵erences in subpopulations (impact)
There is no need to specify item response forms or ability
distributions.
Data-driven simulations generalize directly to the population of
interest.
They provide better evidence to support methodological choices.

Data-Driven Simulation Details
Given empirical data and researcher-specified reference and focal
group sample sizes:
I Randomly draw a single item’s data
I Compute the MH contingency table as proportions
I Compute the expected contingency table by combing the
empirical contingency table with the researcher-specified
sample sizes
I Compute ˆ2
null from the expected contingency table, assuming
no DIF
I Draw ˆ
MH ⇠ N(0, ˆ2
null )
I Adjust the expected contingency table so that it can produce
ˆ
MH
I Estimate ˆ2 from the adjusted-expected contingency table

The Data
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0

MH Bins
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0

Bin Probabilities
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0

MH Can Match Parametric...
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0

Parametric can’t always match MH
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0

Draw Random Item and Create MH Table
Random item from empirical data:
Bin 1 2 3 4 5 6 7 8 9
Ref 1 40 44 73 92 97 135 153 166 184
Ref 0 169 169 167 149 139 111 97 83 67
Foc 1 26 33 28 34 53 44 49 60 61
Foc 0 108 96 74 67 53 52 43 33 30
Then adjust margins to desired sample sizes, and compute ˆ2
null
assuming there is no DIF. For this item, with researcher-specified
sample sizes of 500 and 250, ˆ2
null = .03.

Draw from Null HM Sampling Distribution
Draw once from N(0, .03), say ˆ
MH = .13. Then, adjust the
empirical MH table so it produces ˆ
MH = .13 (and obeys sample
sizes). The result is:
1 2 3 4 5 6 7 8 9
Ref 1 8.4 10.0 14.8 18.5 21.5 26.8 30.7 34.3 37.4
Ref 0 35.1 34.4 35.2 31.7 27.6 24.4 21.3 17.6 14.8
Foc 1 7.0 7.9 8.1 9.9 12.3 13.1 14.0 15.7 16.6
Foc 0 25.9 23.8 16.9 14.9 13.8 10.5 8.5 7.1 5.8
The distribution of ability across comes from empirical data.
The item response probabilities comes from empirical data.

Big Picture
Data-driven simulations more closely reflect the population of that
methodological inferences will act on.
Data-driven simulations remove often used assumptions in
simulations regarding the form of the ICC and the ability
distributions.
Especially useful when the population of interest is known to not
perfectly follow a model (ie field testing, new domains, ELL,
accessibility).

False Positive Rate Simulation Results
The False Positive Rates for Magnitude Testing are
Reference Size
100 500 1000 5000
Focal Size
100 20.772 11.038 9.987 8.709
500 9.786 .914 .474 .227
1000 8.587 .318 .084 .029
5000 7.059 .074 .011 0
Table: False Positive Rate of 0.43 Magnitude Testing

False Positive Rate Simulation
The False Positive Rates for Significance Testing are
Reference Size
100 500 1000 5000
Focal Size
100 4.242 4.718 4.736 4.670
500 4.714 4.821 4.946 4.896
1000 4.819 4.961 4.920 4.914
5000 4.741 4.933 4.934 5.158
Table: False Positive Rate of 95% Confidence Interval Testing

The False Positive Rates for Compound Testing are
Reference Size
100 500 1000 5000
Focal Size
100 4.242 4.631 4.441 4.104
500 4.712 0.782 0.417 0.220
1000 4.728 0.312 0.084 0.029
5000 4.421 0.074 0.011 0
Table: False Positive Rate of Compound Decision Rule

The False Positive Rates for Multilevel Compound Testing are
Reference Size
100 500 1000 5000
Focal Size
100 0 .011 .008 .014
500 .003 .004 0 0
1000 .007 0 0 0
5000 .013 .001 0 0
Table: False Positive Rate of Multilevel MH with Compound Rules

Putting it all together
The Multilevel MH tests all items for DIF, and flags at a similar
rate to the traditional MH test.
But Multilevel MH has a near 0% False Positive Rate, meaning
that the items that are flagged very likely represent actual DIF that
should be addressed.
Without a sample size restriction, Multilevel MH can find DIF
much earlier than traditional methods, and can find the worse
o↵enders well before the minimum sample size.

Bigger Picture
Make statistics work for us
I Use a method that can test all items
I Flag items that have meaningful bias
I Don’t flag any unbiased items
Minimum sample sizes rules will be less and less viable.
We need to be able to make decisions based on statistical
evidence, without requiring a certain number of data points.
That evidence should use all available information appropriately in
making the best decision.

Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations

Recommended

Recommended

More Related Content

Similar to Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations

Similar to Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations (15)

Recently uploaded

Recently uploaded (20)

Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations