This talk is about a common problem of imposing a minimum sample size in a business context in order to make decisions. By using Bayesian approaches, we can drastically increase the speed to decision making.
This quote cited in the talk sum it up well:
"An ironic property about effect estimates with relatively large standard errors is that they are more likely to produce effect estimates that are larger in magnitude than effect estimates with relatively smaller standard errors.... There is a tendency sometimes towards downplaying a large standard error (which might increase the p-value of their estimate) by pointing out that, however, the magnitude of the estimate is quite large. In fact, this 'large effect' is likely a byproduct of this standard error.”
Intro to Passkeys and the State of Passwordless.pptx
Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations
1. Making Statistics Work For Us: Item Bias,
Decision Making, and Data-Driven Simulations
Quinn N Lathrop
July 10, 2018
2. Why we (usually) don’t have to worry about multiple comparisons
(Gelman, Hill, & Yajima, 2008)
3. Standard errors and sample size
“An ironic property about e↵ect estimates with relatively large
standard errors is that they are more likely to produce e↵ect
estimates that are larger in magnitude than e↵ect estimates with
relatively smaller standard errors.... There is a tendency sometimes
towards downplaying a large standard error (which might increase
the p-value of their estimate) by pointing out that, however, the
magnitude of the estimate is quite large. In fact, this “large e↵ect”
is likely a byproduct of this standard error.”
4. What is DIF
Di↵erential Item Function (DIF) occurs when the probability of a
correct response is di↵erent for students of di↵erent
sub-populations (focal vs reference group), even though their
abilities are the same.
5. What is DIF
Di↵erential Item Function (DIF) occurs when the probability of a
correct response is di↵erent for students of di↵erent
sub-populations (focal vs reference group), even though their
abilities are the same.
We conduct DIF studies to:
I Flag biased items for review or removal
I Provide evidence that items and assessments are not biased
I To give new grad students managable projects
8. Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.
Significance If there is evidence to reject the null hypothesis that
there is no DIF, the item is flagged for DIF.
I Type I Error rate is 5%
I Large sample sizes and near-null truth
9. Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.
Significance If there is evidence to reject the null hypothesis that
there is no DIF, the item is flagged for DIF.
I Type I Error rate is 5%
I Large sample sizes and near-null truth
Compound (ABC Labels) DIF must be significant and have a
large magnitude to be flagged
10. Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.
Significance If there is evidence to reject the null hypothesis that
there is no DIF, the item is flagged for DIF.
I Type I Error rate is 5%
I Large sample sizes and near-null truth
Compound (ABC Labels) DIF must be significant and have a
large magnitude to be flagged
I Small sample size Type I errors have larger
magnitudes
11. Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.
Significance If there is evidence to reject the null hypothesis that
there is no DIF, the item is flagged for DIF.
I Type I Error rate is 5%
I Large sample sizes and near-null truth
Compound (ABC Labels) DIF must be significant and have a
large magnitude to be flagged
I Small sample size Type I errors have larger
magnitudes
I Imposed minimum sample size
12. A traditional infererence flow for DIF
Statistic Mantel-Haenszel (MH) tests
Decision Rule Compond (significant and large magnitude)
Data Filtering Minimum sample size per group of 500
13. Mantel-Hanseal (MH)
Within each ability slice k, a 2 ⇥ 2 table is constructed such that
Correct Incorrect
Reference Ak Bk
Focal Ck Dk
Then,
ˆi =
R
S
=
P
k AkDk/nk
P
k BkCk/nk
(1)
then ˆ
MHi = ln( ˆi ).
The variance of ˆ
MHi is
ˆ2
i =
1
2R2
X
k
n 2
k (AkDk + ˆi BkCk)[Ak + Dk + ˆi (Bk + Ck)]. (2)
14. Multilevel models
Partial pooling, shrinkage, bayesian.
I Provide a way to model common phenomenon while both
sharing information and allowing for heterogeneity
I Each estimate is pulled towards a common mean
I Amount of shrinkage determined by the standard error of the
estimate
I If standard error is large, the estimate is stabilized towards the
mean
I If standard error is small, we trust that estimate and shrinkage
is minimal
I Major barrier is awareness and implementation
15. Multilevel MH Extension
We assume
MHi |✓i ⇠ N(✓i , 2
i ) (3)
And specify a prior as
✓i ⇠ N(µ, ⌧2
) (4)
where µ is the population mean and ⌧2 is the population variance.
The prior and data are combined by calculating the weight
Wi =
⌧2
2
i + ⌧2
(5)
And finally posterior distribution for DIF comparison i is
p(✓i | ˆ
MHi ) ⇠ N
⇣
Wi
ˆ
MHi + (1 Wi )µ, Wi ˆ2
i
⌘
(6)
16. I’ll pick the population parameters (priors) as
µ = 0 (without data, I assume there is not systematic bias)
⌧2 = .05 (fixed here for ease, but can be estimated)
So compared to traditional MH,
I ˆ
MHi shrinks to Wi
ˆ
MHi
I ˆ2
i shrinks to Wi
2
i
I where Wi = .05
ˆ2
i +.05
17. Don’t throw priors in a glass house...
Our traditional MH test, has the following priors
If focal group is less than 500 (complete pooling)
µ = 0
⌧2
= 0
If focal group is less than 500 (no pooling)
µ = 0
⌧2
= 1
18. The 500 minimum sample size rule is crudely approximating the
impact of the standard error by saying “With 500 responses per
group, the standard error should be small enough...”
But by using a prior that is not at the extreme of complete pooling
or no pooling, we allow the standard error of the estimate to
determine how much shrinkage is appropriate.
19. Example with Reading Assessment
Example data have,
I 839 Reading test items
I 7 self-reported ethnic groups
I a lot of data but very small sample sizes
Recall also the inference rules,
I Magnitude
I Significance
I Compound
21. Table: Descriptive Statistics of Ethnicity in Reading Assessment
Group N Student N Resp % Resp Ave Score
American Indian 16572 37181 1.36 167.00
Asian 42094 102062 3.73 175.24
Black 180098 501665 18.35 163.46
Hispanic 149558 401692 14.70 164.65
Native Hawaiian 3260 8688 0.32 164.55
White 493910 1244608 45.54 172.90
Multi-Ethnic 27945 71643 2.62 169.65
Not Specified or Other 143928 365588 13.38 170.26
22. Table: Number of Items With Responses by Ethnicity
Group Any Responses At Least 500
American Indian 839 0
Asian 839 2
Black 839 469
Hispanic 839 330
Native Hawaiian 823 0
White 839 807
Multi-Ethnic 839 0
27. Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
28. Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
29. Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
I only 57.3% of items are ever tested
30. Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
I only 57.3% of items are ever tested
For Multilevel MH
31. Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
I only 57.3% of items are ever tested
For Multilevel MH
I all items tested
32. Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
I only 57.3% of items are ever tested
For Multilevel MH
I all items tested
I only 6.7% of items are flagged
33. Simulations Drive Methodological Advancement
The false positive rates for compound decision rules are not always
available analytically, but can be found through simulation.
Validity of simulation rests on if the simulated conditions can
generalize to reality.
Alternatively, simulations can directly sample empirical data. Then,
the simulations captures complex features and noise in empirical
data that data simulated from parametric models can never attain.
34. Data-Driven False Positive Rate Simulation
I Mimics empirical response curves from field test items
(model-free)
I Mimics observed di↵erences in subpopulations (impact)
There is no need to specify item response forms or ability
distributions.
Data-driven simulations generalize directly to the population of
interest.
They provide better evidence to support methodological choices.
35. Data-Driven Simulation Details
Given empirical data and researcher-specified reference and focal
group sample sizes:
I Randomly draw a single item’s data
I Compute the MH contingency table as proportions
I Compute the expected contingency table by combing the
empirical contingency table with the researcher-specified
sample sizes
I Compute ˆ2
null from the expected contingency table, assuming
no DIF
I Draw ˆ
MH ⇠ N(0, ˆ2
null )
I Adjust the expected contingency table so that it can produce
ˆ
MH
I Estimate ˆ2 from the adjusted-expected contingency table
41. Draw Random Item and Create MH Table
Random item from empirical data:
Bin 1 2 3 4 5 6 7 8 9
Ref 1 40 44 73 92 97 135 153 166 184
Ref 0 169 169 167 149 139 111 97 83 67
Foc 1 26 33 28 34 53 44 49 60 61
Foc 0 108 96 74 67 53 52 43 33 30
Then adjust margins to desired sample sizes, and compute ˆ2
null
assuming there is no DIF. For this item, with researcher-specified
sample sizes of 500 and 250, ˆ2
null = .03.
42. Draw from Null HM Sampling Distribution
Draw once from N(0, .03), say ˆ
MH = .13. Then, adjust the
empirical MH table so it produces ˆ
MH = .13 (and obeys sample
sizes). The result is:
1 2 3 4 5 6 7 8 9
Ref 1 8.4 10.0 14.8 18.5 21.5 26.8 30.7 34.3 37.4
Ref 0 35.1 34.4 35.2 31.7 27.6 24.4 21.3 17.6 14.8
Foc 1 7.0 7.9 8.1 9.9 12.3 13.1 14.0 15.7 16.6
Foc 0 25.9 23.8 16.9 14.9 13.8 10.5 8.5 7.1 5.8
The distribution of ability across comes from empirical data.
The item response probabilities comes from empirical data.
43. Big Picture
Data-driven simulations more closely reflect the population of that
methodological inferences will act on.
Data-driven simulations remove often used assumptions in
simulations regarding the form of the ICC and the ability
distributions.
Especially useful when the population of interest is known to not
perfectly follow a model (ie field testing, new domains, ELL,
accessibility).
48. Putting it all together
The Multilevel MH tests all items for DIF, and flags at a similar
rate to the traditional MH test.
But Multilevel MH has a near 0% False Positive Rate, meaning
that the items that are flagged very likely represent actual DIF that
should be addressed.
Without a sample size restriction, Multilevel MH can find DIF
much earlier than traditional methods, and can find the worse
o↵enders well before the minimum sample size.
49.
50. Bigger Picture
Make statistics work for us
I Use a method that can test all items
I Flag items that have meaningful bias
I Don’t flag any unbiased items
Minimum sample sizes rules will be less and less viable.
We need to be able to make decisions based on statistical
evidence, without requiring a certain number of data points.
That evidence should use all available information appropriately in
making the best decision.