This presentation from the 2016 Conference on Test Security discusses SIFT, a software program for finding test fraud and related issues, such as student copying or cheating, proctor help, brain dump takers, brain dump makers, dishonest testing locations, and more. Free version is available at www.assess.com.
3. SIFT
Most operational focus is on deterrence
(rightfully so)
Data forensics is less widespread
Two things prevent more orgs from doing this
important work
1. Lack of summary literature & resources
a. Until recently – Wollack & Maynes (2013) and Kingston &
Clark (2014) – you were on your own!
2. Lack of user-friendly software
3
4. The Need for SIFT
Not the issue here:
-Whether cheating happens
-Choice of forensics to find it
The issue: Lack of user-friendly
software for data forensics, so
more orgs can do it!
4
5. SIFT
Options to implement data forensics
Hire consultants
Write your own code
Find old software (SCheck, Scrutiny, Integrity)
New: SIFT
Also new: R packages
5
6. Options for Data Forensics
6
Intra-Indivvidual
• Time/RTE
(CBT
only)
• Response
patterns
• Score
gains
• Person fit
Inter-Individual
• Collusion
Indices
• Erasure
(paper
only, also
Group
level)
Group
• Roll-up of
intra and
inter
• Other
useful
stats
(pass rate,
mean
score…)
7. It’s a Hypothesis Test!
7
If you aim at
nothing, that’s
exactly what
you’ll hit.
8. It’s a Hypothesis Test!
Independent variables
Test centers/locations
Countries
Training programs
Test forms
Individuals
Operational vs. Pretest items
8
9. It’s a Hypothesis Test!
Dependent variables
Item response or test time
Item statistics
Test statistics (mean/SD, pass rate)
Person statistics (intra-individual)
Collusion indices
9
10. SIFT Output
Intra-Individual
Response Time Effort (RTE)
Mean item/test time
Response sets
Option proportion > x
Operational vs. Pretest items
Score gains (future)
Person fit (future)
10
11. SIFT Output
Inter-
Individual
Collusion
indices
Response
time
similarity
(future)
11
Started with
older/simpler
indices and
working
forward!
12. SIFT Output
Group
Descriptive stats (mean, SD, pass rate…)
Roll-up of collusion indices
Roll-up if intra indices
Time usage
Item P
Operational vs. Pretest scores
12
13. How do these look with data?
Data set 1: Real
School district summative assessment
31 items
N=1372
IRT parameters available (can do Omega)
13
14. How do these look with data?
Data set 2: Modified
Took Set 1 and…
Created fake schools/teachers
Implemented collusion for a few teachers
Shortened item times for one teacher’s students
14
15. How do these look with data?
Data set 3
English assessment from Indonesia
N = 16,666
6 districts
50 items
15
16. How do these look with data?
Data set 4
Math assessment from Botswana
N = 2,185
10 districts, 336 schools
72 items
16
17. Summary
Use case
SIFT allows, for example, a small certification
organization to obtain all of this output in only a
few hours of work, and quickly investigate
locations before a test is further compromised.
17
18. Summary
Future of SIFT
More indices to be added (post-2000)
Currently in MVP stage
Add secondary analysis of indices (e.g., mean
per location)
Group-level Z statistics (Sotaridona)
Graphics!
Free version available for download assess.com
18
19. Summary
Future of SIFT
Most important: getting integrated with our online
testing platform so that our clients do not need
psychometric expertise
Easier to get the output (saves time/money) – many
professionals do not have the time to export files and
learn SIFT or R
Easier to interpret – will take a lot of good thought and
UX design!
19
Examples:Time: Flag an examinee for having a very low test time or average item time. Response Time Ratio is a statistic to quantify this.
Response pattern: Flag an examinee for answering one option >50% of the time. In this case, they probably gave up or didn’t care and just answered “C” over and over…
Score gains: Your score doubled since the last time you took the test. Not likely!
Person fit: Why are you getting tough items right but easy items wrong?
Collusion: A number of indices that quantify, for any given pair of examinees, whether their responses were unusually similar.
Erasure: Evaluating proportion of changes that are wrong-to-right vs. right-to-wrong.
Roll-up: What percent of examinees at each location/group were flagged for intra/inter issues? For example, 90% of a location gets flagged for collusion.
Other stats: Some locations have high average scores but low average test times.