SlideShare a Scribd company logo
1 of 50
Download to read offline
Making Statistics Work For Us: Item Bias,
Decision Making, and Data-Driven Simulations
Quinn N Lathrop
July 10, 2018
Why we (usually) don’t have to worry about multiple comparisons
(Gelman, Hill, & Yajima, 2008)
Standard errors and sample size
“An ironic property about e↵ect estimates with relatively large
standard errors is that they are more likely to produce e↵ect
estimates that are larger in magnitude than e↵ect estimates with
relatively smaller standard errors.... There is a tendency sometimes
towards downplaying a large standard error (which might increase
the p-value of their estimate) by pointing out that, however, the
magnitude of the estimate is quite large. In fact, this “large e↵ect”
is likely a byproduct of this standard error.”
What is DIF
Di↵erential Item Function (DIF) occurs when the probability of a
correct response is di↵erent for students of di↵erent
sub-populations (focal vs reference group), even though their
abilities are the same.
What is DIF
Di↵erential Item Function (DIF) occurs when the probability of a
correct response is di↵erent for students of di↵erent
sub-populations (focal vs reference group), even though their
abilities are the same.
We conduct DIF studies to:
I Flag biased items for review or removal
I Provide evidence that items and assessments are not biased
I To give new grad students managable projects
Statistical Inference
Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.
Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.
Significance If there is evidence to reject the null hypothesis that
there is no DIF, the item is flagged for DIF.
I Type I Error rate is 5%
I Large sample sizes and near-null truth
Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.
Significance If there is evidence to reject the null hypothesis that
there is no DIF, the item is flagged for DIF.
I Type I Error rate is 5%
I Large sample sizes and near-null truth
Compound (ABC Labels) DIF must be significant and have a
large magnitude to be flagged
Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.
Significance If there is evidence to reject the null hypothesis that
there is no DIF, the item is flagged for DIF.
I Type I Error rate is 5%
I Large sample sizes and near-null truth
Compound (ABC Labels) DIF must be significant and have a
large magnitude to be flagged
I Small sample size Type I errors have larger
magnitudes
Statistical Inference
Magnitude If the estimated DIF is larger in magnitude than a
constant, the item is flagged for DIF.
Significance If there is evidence to reject the null hypothesis that
there is no DIF, the item is flagged for DIF.
I Type I Error rate is 5%
I Large sample sizes and near-null truth
Compound (ABC Labels) DIF must be significant and have a
large magnitude to be flagged
I Small sample size Type I errors have larger
magnitudes
I Imposed minimum sample size
A traditional infererence flow for DIF
Statistic Mantel-Haenszel (MH) tests
Decision Rule Compond (significant and large magnitude)
Data Filtering Minimum sample size per group of 500
Mantel-Hanseal (MH)
Within each ability slice k, a 2 ⇥ 2 table is constructed such that
Correct Incorrect
Reference Ak Bk
Focal Ck Dk
Then,
ˆi =
R
S
=
P
k AkDk/nk
P
k BkCk/nk
(1)
then ˆ
MHi = ln( ˆi ).
The variance of ˆ
MHi is
ˆ2
i =
1
2R2
X
k
n 2
k (AkDk + ˆi BkCk)[Ak + Dk + ˆi (Bk + Ck)]. (2)
Multilevel models
Partial pooling, shrinkage, bayesian.
I Provide a way to model common phenomenon while both
sharing information and allowing for heterogeneity
I Each estimate is pulled towards a common mean
I Amount of shrinkage determined by the standard error of the
estimate
I If standard error is large, the estimate is stabilized towards the
mean
I If standard error is small, we trust that estimate and shrinkage
is minimal
I Major barrier is awareness and implementation
Multilevel MH Extension
We assume
MHi |✓i ⇠ N(✓i , 2
i ) (3)
And specify a prior as
✓i ⇠ N(µ, ⌧2
) (4)
where µ is the population mean and ⌧2 is the population variance.
The prior and data are combined by calculating the weight
Wi =
⌧2
2
i + ⌧2
(5)
And finally posterior distribution for DIF comparison i is
p(✓i | ˆ
MHi ) ⇠ N
⇣
Wi
ˆ
MHi + (1 Wi )µ, Wi ˆ2
i
⌘
(6)
I’ll pick the population parameters (priors) as
µ = 0 (without data, I assume there is not systematic bias)
⌧2 = .05 (fixed here for ease, but can be estimated)
So compared to traditional MH,
I ˆ
MHi shrinks to Wi
ˆ
MHi
I ˆ2
i shrinks to Wi
2
i
I where Wi = .05
ˆ2
i +.05
Don’t throw priors in a glass house...
Our traditional MH test, has the following priors
If focal group is less than 500 (complete pooling)
µ = 0
⌧2
= 0
If focal group is less than 500 (no pooling)
µ = 0
⌧2
= 1
The 500 minimum sample size rule is crudely approximating the
impact of the standard error by saying “With 500 responses per
group, the standard error should be small enough...”
But by using a prior that is not at the extreme of complete pooling
or no pooling, we allow the standard error of the estimate to
determine how much shrinkage is appropriate.
Example with Reading Assessment
Example data have,
I 839 Reading test items
I 7 self-reported ethnic groups
I a lot of data but very small sample sizes
Recall also the inference rules,
I Magnitude
I Significance
I Compound
Score
Table: Descriptive Statistics of Ethnicity in Reading Assessment
Group N Student N Resp % Resp Ave Score
American Indian 16572 37181 1.36 167.00
Asian 42094 102062 3.73 175.24
Black 180098 501665 18.35 163.46
Hispanic 149558 401692 14.70 164.65
Native Hawaiian 3260 8688 0.32 164.55
White 493910 1244608 45.54 172.90
Multi-Ethnic 27945 71643 2.62 169.65
Not Specified or Other 143928 365588 13.38 170.26
Table: Number of Items With Responses by Ethnicity
Group Any Responses At Least 500
American Indian 839 0
Asian 839 2
Black 839 469
Hispanic 839 330
Native Hawaiian 823 0
White 839 807
Multi-Ethnic 839 0
Results
Traditional MH without minimum sample size
Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
I only 57.3% of items are ever tested
Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
I only 57.3% of items are ever tested
For Multilevel MH
Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
I only 57.3% of items are ever tested
For Multilevel MH
I all items tested
Results
Traditional MH without minimum sample size
I 37.9% of items flagged for DIF
For traditional MH with 500 minimum sample size
I only 6.2% of items tested are flagged, but...
I only 57.3% of items are ever tested
For Multilevel MH
I all items tested
I only 6.7% of items are flagged
Simulations Drive Methodological Advancement
The false positive rates for compound decision rules are not always
available analytically, but can be found through simulation.
Validity of simulation rests on if the simulated conditions can
generalize to reality.
Alternatively, simulations can directly sample empirical data. Then,
the simulations captures complex features and noise in empirical
data that data simulated from parametric models can never attain.
Data-Driven False Positive Rate Simulation
I Mimics empirical response curves from field test items
(model-free)
I Mimics observed di↵erences in subpopulations (impact)
There is no need to specify item response forms or ability
distributions.
Data-driven simulations generalize directly to the population of
interest.
They provide better evidence to support methodological choices.
Data-Driven Simulation Details
Given empirical data and researcher-specified reference and focal
group sample sizes:
I Randomly draw a single item’s data
I Compute the MH contingency table as proportions
I Compute the expected contingency table by combing the
empirical contingency table with the researcher-specified
sample sizes
I Compute ˆ2
null from the expected contingency table, assuming
no DIF
I Draw ˆ
MH ⇠ N(0, ˆ2
null )
I Adjust the expected contingency table so that it can produce
ˆ
MH
I Estimate ˆ2 from the adjusted-expected contingency table
The Data
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
MH Bins
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
Bin Probabilities
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
MH Can Match Parametric...
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
Parametric can’t always match MH
Ability
Observed
Response
−2 0 2 4 6
0.0
0.2
0.4
0.6
0.8
1.0
Draw Random Item and Create MH Table
Random item from empirical data:
Bin 1 2 3 4 5 6 7 8 9
Ref 1 40 44 73 92 97 135 153 166 184
Ref 0 169 169 167 149 139 111 97 83 67
Foc 1 26 33 28 34 53 44 49 60 61
Foc 0 108 96 74 67 53 52 43 33 30
Then adjust margins to desired sample sizes, and compute ˆ2
null
assuming there is no DIF. For this item, with researcher-specified
sample sizes of 500 and 250, ˆ2
null = .03.
Draw from Null HM Sampling Distribution
Draw once from N(0, .03), say ˆ
MH = .13. Then, adjust the
empirical MH table so it produces ˆ
MH = .13 (and obeys sample
sizes). The result is:
1 2 3 4 5 6 7 8 9
Ref 1 8.4 10.0 14.8 18.5 21.5 26.8 30.7 34.3 37.4
Ref 0 35.1 34.4 35.2 31.7 27.6 24.4 21.3 17.6 14.8
Foc 1 7.0 7.9 8.1 9.9 12.3 13.1 14.0 15.7 16.6
Foc 0 25.9 23.8 16.9 14.9 13.8 10.5 8.5 7.1 5.8
The distribution of ability across comes from empirical data.
The item response probabilities comes from empirical data.
Big Picture
Data-driven simulations more closely reflect the population of that
methodological inferences will act on.
Data-driven simulations remove often used assumptions in
simulations regarding the form of the ICC and the ability
distributions.
Especially useful when the population of interest is known to not
perfectly follow a model (ie field testing, new domains, ELL,
accessibility).
False Positive Rate Simulation Results
The False Positive Rates for Magnitude Testing are
Reference Size
100 500 1000 5000
Focal Size
100 20.772 11.038 9.987 8.709
500 9.786 .914 .474 .227
1000 8.587 .318 .084 .029
5000 7.059 .074 .011 0
Table: False Positive Rate of 0.43 Magnitude Testing
False Positive Rate Simulation
The False Positive Rates for Significance Testing are
Reference Size
100 500 1000 5000
Focal Size
100 4.242 4.718 4.736 4.670
500 4.714 4.821 4.946 4.896
1000 4.819 4.961 4.920 4.914
5000 4.741 4.933 4.934 5.158
Table: False Positive Rate of 95% Confidence Interval Testing
False Positive Rate Simulation
The False Positive Rates for Compound Testing are
Reference Size
100 500 1000 5000
Focal Size
100 4.242 4.631 4.441 4.104
500 4.712 0.782 0.417 0.220
1000 4.728 0.312 0.084 0.029
5000 4.421 0.074 0.011 0
Table: False Positive Rate of Compound Decision Rule
False Positive Rate Simulation
The False Positive Rates for Multilevel Compound Testing are
Reference Size
100 500 1000 5000
Focal Size
100 0 .011 .008 .014
500 .003 .004 0 0
1000 .007 0 0 0
5000 .013 .001 0 0
Table: False Positive Rate of Multilevel MH with Compound Rules
Putting it all together
The Multilevel MH tests all items for DIF, and flags at a similar
rate to the traditional MH test.
But Multilevel MH has a near 0% False Positive Rate, meaning
that the items that are flagged very likely represent actual DIF that
should be addressed.
Without a sample size restriction, Multilevel MH can find DIF
much earlier than traditional methods, and can find the worse
o↵enders well before the minimum sample size.
Bigger Picture
Make statistics work for us
I Use a method that can test all items
I Flag items that have meaningful bias
I Don’t flag any unbiased items
Minimum sample sizes rules will be less and less viable.
We need to be able to make decisions based on statistical
evidence, without requiring a certain number of data points.
That evidence should use all available information appropriately in
making the best decision.

More Related Content

Similar to Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations

Basic Statistical Concepts & Decision-Making
Basic Statistical Concepts & Decision-MakingBasic Statistical Concepts & Decision-Making
Basic Statistical Concepts & Decision-Making
Penn State University
 
Inferential statistics.ppt
Inferential statistics.pptInferential statistics.ppt
Inferential statistics.ppt
Nursing Path
 
250 words, no more than 500· Focus on what you learned that made.docx
250 words, no more than 500· Focus on what you learned that made.docx250 words, no more than 500· Focus on what you learned that made.docx
250 words, no more than 500· Focus on what you learned that made.docx
eugeniadean34240
 

Similar to Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations (15)

Assigment 1
Assigment 1Assigment 1
Assigment 1
 
Chapter 8
Chapter 8Chapter 8
Chapter 8
 
Basic Statistical Concepts & Decision-Making
Basic Statistical Concepts & Decision-MakingBasic Statistical Concepts & Decision-Making
Basic Statistical Concepts & Decision-Making
 
Inferential statistics.ppt
Inferential statistics.pptInferential statistics.ppt
Inferential statistics.ppt
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
Statistical Writing (Sven Sandin)
Statistical Writing (Sven Sandin)Statistical Writing (Sven Sandin)
Statistical Writing (Sven Sandin)
 
Statistics
StatisticsStatistics
Statistics
 
Sample Size Determination.23.11.2021.pdf
Sample Size Determination.23.11.2021.pdfSample Size Determination.23.11.2021.pdf
Sample Size Determination.23.11.2021.pdf
 
Estimating a Population Proportion
Estimating a Population ProportionEstimating a Population Proportion
Estimating a Population Proportion
 
Estimating a Population Proportion
Estimating a Population ProportionEstimating a Population Proportion
Estimating a Population Proportion
 
Statistics
StatisticsStatistics
Statistics
 
Math 300 MM Project
Math 300 MM ProjectMath 300 MM Project
Math 300 MM Project
 
Normal and standard normal distribution
Normal and standard normal distributionNormal and standard normal distribution
Normal and standard normal distribution
 
Lab manual_statistik
Lab manual_statistikLab manual_statistik
Lab manual_statistik
 
250 words, no more than 500· Focus on what you learned that made.docx
250 words, no more than 500· Focus on what you learned that made.docx250 words, no more than 500· Focus on what you learned that made.docx
250 words, no more than 500· Focus on what you learned that made.docx
 

Recently uploaded

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 

Recently uploaded (20)

Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 

Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations

  • 1. Making Statistics Work For Us: Item Bias, Decision Making, and Data-Driven Simulations Quinn N Lathrop July 10, 2018
  • 2. Why we (usually) don’t have to worry about multiple comparisons (Gelman, Hill, & Yajima, 2008)
  • 3. Standard errors and sample size “An ironic property about e↵ect estimates with relatively large standard errors is that they are more likely to produce e↵ect estimates that are larger in magnitude than e↵ect estimates with relatively smaller standard errors.... There is a tendency sometimes towards downplaying a large standard error (which might increase the p-value of their estimate) by pointing out that, however, the magnitude of the estimate is quite large. In fact, this “large e↵ect” is likely a byproduct of this standard error.”
  • 4. What is DIF Di↵erential Item Function (DIF) occurs when the probability of a correct response is di↵erent for students of di↵erent sub-populations (focal vs reference group), even though their abilities are the same.
  • 5. What is DIF Di↵erential Item Function (DIF) occurs when the probability of a correct response is di↵erent for students of di↵erent sub-populations (focal vs reference group), even though their abilities are the same. We conduct DIF studies to: I Flag biased items for review or removal I Provide evidence that items and assessments are not biased I To give new grad students managable projects
  • 7. Statistical Inference Magnitude If the estimated DIF is larger in magnitude than a constant, the item is flagged for DIF.
  • 8. Statistical Inference Magnitude If the estimated DIF is larger in magnitude than a constant, the item is flagged for DIF. Significance If there is evidence to reject the null hypothesis that there is no DIF, the item is flagged for DIF. I Type I Error rate is 5% I Large sample sizes and near-null truth
  • 9. Statistical Inference Magnitude If the estimated DIF is larger in magnitude than a constant, the item is flagged for DIF. Significance If there is evidence to reject the null hypothesis that there is no DIF, the item is flagged for DIF. I Type I Error rate is 5% I Large sample sizes and near-null truth Compound (ABC Labels) DIF must be significant and have a large magnitude to be flagged
  • 10. Statistical Inference Magnitude If the estimated DIF is larger in magnitude than a constant, the item is flagged for DIF. Significance If there is evidence to reject the null hypothesis that there is no DIF, the item is flagged for DIF. I Type I Error rate is 5% I Large sample sizes and near-null truth Compound (ABC Labels) DIF must be significant and have a large magnitude to be flagged I Small sample size Type I errors have larger magnitudes
  • 11. Statistical Inference Magnitude If the estimated DIF is larger in magnitude than a constant, the item is flagged for DIF. Significance If there is evidence to reject the null hypothesis that there is no DIF, the item is flagged for DIF. I Type I Error rate is 5% I Large sample sizes and near-null truth Compound (ABC Labels) DIF must be significant and have a large magnitude to be flagged I Small sample size Type I errors have larger magnitudes I Imposed minimum sample size
  • 12. A traditional infererence flow for DIF Statistic Mantel-Haenszel (MH) tests Decision Rule Compond (significant and large magnitude) Data Filtering Minimum sample size per group of 500
  • 13. Mantel-Hanseal (MH) Within each ability slice k, a 2 ⇥ 2 table is constructed such that Correct Incorrect Reference Ak Bk Focal Ck Dk Then, ˆi = R S = P k AkDk/nk P k BkCk/nk (1) then ˆ MHi = ln( ˆi ). The variance of ˆ MHi is ˆ2 i = 1 2R2 X k n 2 k (AkDk + ˆi BkCk)[Ak + Dk + ˆi (Bk + Ck)]. (2)
  • 14. Multilevel models Partial pooling, shrinkage, bayesian. I Provide a way to model common phenomenon while both sharing information and allowing for heterogeneity I Each estimate is pulled towards a common mean I Amount of shrinkage determined by the standard error of the estimate I If standard error is large, the estimate is stabilized towards the mean I If standard error is small, we trust that estimate and shrinkage is minimal I Major barrier is awareness and implementation
  • 15. Multilevel MH Extension We assume MHi |✓i ⇠ N(✓i , 2 i ) (3) And specify a prior as ✓i ⇠ N(µ, ⌧2 ) (4) where µ is the population mean and ⌧2 is the population variance. The prior and data are combined by calculating the weight Wi = ⌧2 2 i + ⌧2 (5) And finally posterior distribution for DIF comparison i is p(✓i | ˆ MHi ) ⇠ N ⇣ Wi ˆ MHi + (1 Wi )µ, Wi ˆ2 i ⌘ (6)
  • 16. I’ll pick the population parameters (priors) as µ = 0 (without data, I assume there is not systematic bias) ⌧2 = .05 (fixed here for ease, but can be estimated) So compared to traditional MH, I ˆ MHi shrinks to Wi ˆ MHi I ˆ2 i shrinks to Wi 2 i I where Wi = .05 ˆ2 i +.05
  • 17. Don’t throw priors in a glass house... Our traditional MH test, has the following priors If focal group is less than 500 (complete pooling) µ = 0 ⌧2 = 0 If focal group is less than 500 (no pooling) µ = 0 ⌧2 = 1
  • 18. The 500 minimum sample size rule is crudely approximating the impact of the standard error by saying “With 500 responses per group, the standard error should be small enough...” But by using a prior that is not at the extreme of complete pooling or no pooling, we allow the standard error of the estimate to determine how much shrinkage is appropriate.
  • 19. Example with Reading Assessment Example data have, I 839 Reading test items I 7 self-reported ethnic groups I a lot of data but very small sample sizes Recall also the inference rules, I Magnitude I Significance I Compound
  • 20. Score
  • 21. Table: Descriptive Statistics of Ethnicity in Reading Assessment Group N Student N Resp % Resp Ave Score American Indian 16572 37181 1.36 167.00 Asian 42094 102062 3.73 175.24 Black 180098 501665 18.35 163.46 Hispanic 149558 401692 14.70 164.65 Native Hawaiian 3260 8688 0.32 164.55 White 493910 1244608 45.54 172.90 Multi-Ethnic 27945 71643 2.62 169.65 Not Specified or Other 143928 365588 13.38 170.26
  • 22. Table: Number of Items With Responses by Ethnicity Group Any Responses At Least 500 American Indian 839 0 Asian 839 2 Black 839 469 Hispanic 839 330 Native Hawaiian 823 0 White 839 807 Multi-Ethnic 839 0
  • 23.
  • 24.
  • 25. Results Traditional MH without minimum sample size
  • 26. Results Traditional MH without minimum sample size I 37.9% of items flagged for DIF
  • 27. Results Traditional MH without minimum sample size I 37.9% of items flagged for DIF For traditional MH with 500 minimum sample size
  • 28. Results Traditional MH without minimum sample size I 37.9% of items flagged for DIF For traditional MH with 500 minimum sample size I only 6.2% of items tested are flagged, but...
  • 29. Results Traditional MH without minimum sample size I 37.9% of items flagged for DIF For traditional MH with 500 minimum sample size I only 6.2% of items tested are flagged, but... I only 57.3% of items are ever tested
  • 30. Results Traditional MH without minimum sample size I 37.9% of items flagged for DIF For traditional MH with 500 minimum sample size I only 6.2% of items tested are flagged, but... I only 57.3% of items are ever tested For Multilevel MH
  • 31. Results Traditional MH without minimum sample size I 37.9% of items flagged for DIF For traditional MH with 500 minimum sample size I only 6.2% of items tested are flagged, but... I only 57.3% of items are ever tested For Multilevel MH I all items tested
  • 32. Results Traditional MH without minimum sample size I 37.9% of items flagged for DIF For traditional MH with 500 minimum sample size I only 6.2% of items tested are flagged, but... I only 57.3% of items are ever tested For Multilevel MH I all items tested I only 6.7% of items are flagged
  • 33. Simulations Drive Methodological Advancement The false positive rates for compound decision rules are not always available analytically, but can be found through simulation. Validity of simulation rests on if the simulated conditions can generalize to reality. Alternatively, simulations can directly sample empirical data. Then, the simulations captures complex features and noise in empirical data that data simulated from parametric models can never attain.
  • 34. Data-Driven False Positive Rate Simulation I Mimics empirical response curves from field test items (model-free) I Mimics observed di↵erences in subpopulations (impact) There is no need to specify item response forms or ability distributions. Data-driven simulations generalize directly to the population of interest. They provide better evidence to support methodological choices.
  • 35. Data-Driven Simulation Details Given empirical data and researcher-specified reference and focal group sample sizes: I Randomly draw a single item’s data I Compute the MH contingency table as proportions I Compute the expected contingency table by combing the empirical contingency table with the researcher-specified sample sizes I Compute ˆ2 null from the expected contingency table, assuming no DIF I Draw ˆ MH ⇠ N(0, ˆ2 null ) I Adjust the expected contingency table so that it can produce ˆ MH I Estimate ˆ2 from the adjusted-expected contingency table
  • 36. The Data Ability Observed Response −2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0
  • 37. MH Bins Ability Observed Response −2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0
  • 38. Bin Probabilities Ability Observed Response −2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0
  • 39. MH Can Match Parametric... Ability Observed Response −2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0
  • 40. Parametric can’t always match MH Ability Observed Response −2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0
  • 41. Draw Random Item and Create MH Table Random item from empirical data: Bin 1 2 3 4 5 6 7 8 9 Ref 1 40 44 73 92 97 135 153 166 184 Ref 0 169 169 167 149 139 111 97 83 67 Foc 1 26 33 28 34 53 44 49 60 61 Foc 0 108 96 74 67 53 52 43 33 30 Then adjust margins to desired sample sizes, and compute ˆ2 null assuming there is no DIF. For this item, with researcher-specified sample sizes of 500 and 250, ˆ2 null = .03.
  • 42. Draw from Null HM Sampling Distribution Draw once from N(0, .03), say ˆ MH = .13. Then, adjust the empirical MH table so it produces ˆ MH = .13 (and obeys sample sizes). The result is: 1 2 3 4 5 6 7 8 9 Ref 1 8.4 10.0 14.8 18.5 21.5 26.8 30.7 34.3 37.4 Ref 0 35.1 34.4 35.2 31.7 27.6 24.4 21.3 17.6 14.8 Foc 1 7.0 7.9 8.1 9.9 12.3 13.1 14.0 15.7 16.6 Foc 0 25.9 23.8 16.9 14.9 13.8 10.5 8.5 7.1 5.8 The distribution of ability across comes from empirical data. The item response probabilities comes from empirical data.
  • 43. Big Picture Data-driven simulations more closely reflect the population of that methodological inferences will act on. Data-driven simulations remove often used assumptions in simulations regarding the form of the ICC and the ability distributions. Especially useful when the population of interest is known to not perfectly follow a model (ie field testing, new domains, ELL, accessibility).
  • 44. False Positive Rate Simulation Results The False Positive Rates for Magnitude Testing are Reference Size 100 500 1000 5000 Focal Size 100 20.772 11.038 9.987 8.709 500 9.786 .914 .474 .227 1000 8.587 .318 .084 .029 5000 7.059 .074 .011 0 Table: False Positive Rate of 0.43 Magnitude Testing
  • 45. False Positive Rate Simulation The False Positive Rates for Significance Testing are Reference Size 100 500 1000 5000 Focal Size 100 4.242 4.718 4.736 4.670 500 4.714 4.821 4.946 4.896 1000 4.819 4.961 4.920 4.914 5000 4.741 4.933 4.934 5.158 Table: False Positive Rate of 95% Confidence Interval Testing
  • 46. False Positive Rate Simulation The False Positive Rates for Compound Testing are Reference Size 100 500 1000 5000 Focal Size 100 4.242 4.631 4.441 4.104 500 4.712 0.782 0.417 0.220 1000 4.728 0.312 0.084 0.029 5000 4.421 0.074 0.011 0 Table: False Positive Rate of Compound Decision Rule
  • 47. False Positive Rate Simulation The False Positive Rates for Multilevel Compound Testing are Reference Size 100 500 1000 5000 Focal Size 100 0 .011 .008 .014 500 .003 .004 0 0 1000 .007 0 0 0 5000 .013 .001 0 0 Table: False Positive Rate of Multilevel MH with Compound Rules
  • 48. Putting it all together The Multilevel MH tests all items for DIF, and flags at a similar rate to the traditional MH test. But Multilevel MH has a near 0% False Positive Rate, meaning that the items that are flagged very likely represent actual DIF that should be addressed. Without a sample size restriction, Multilevel MH can find DIF much earlier than traditional methods, and can find the worse o↵enders well before the minimum sample size.
  • 49.
  • 50. Bigger Picture Make statistics work for us I Use a method that can test all items I Flag items that have meaningful bias I Don’t flag any unbiased items Minimum sample sizes rules will be less and less viable. We need to be able to make decisions based on statistical evidence, without requiring a certain number of data points. That evidence should use all available information appropriately in making the best decision.