So, you've heard about adaptive testing, and wondering what it takes to develop a valid one? This presentation is made for you. It outlines a 5 step process, starting with feasibility studies and business case evaluation. More info at www.assess.com and http://pareonline.net/getvn.asp?v=16&n=1.
2. What is CAT?
CAT is an algorithm
We need to break down and specify all
aspects
Choice of major algorithms
Subalgorithms
Input parameters
Item bank needs
3. CAT Components
1. Calibrated item bank
2. Starting rule
3. Item selection rule
4. Scoring rule
5. Stopping rule
We must provide validity documentation on
each
Algorithms
inside your
testing
engine
Test development
side
5. Background
We have approximately 4 decades of technical
research on CAT
Numerous books and other resources
(Rudner’s tutorial) on what CAT is and how it
works
Discussions of issues (Wise & Kingsbury,
2000)
Very few resources on how to develop a CAT
6. Background
Best existing resource: descriptions of
current CAT programs
Sands, Waters, & McBride (1997): ASVAB
Elements of Adaptive Testing: Part 2 = 5
examples
JATT issue on CAT
7. Background
Framework, not complete recipe
Identify choices for your org and best way
to investigate/decide
Leads to better quality in the end
Also the foundation for validity arguments
Why did you choose certain things?
8. Seq. Stage Primary work
1 Feasibility, applicability, and
planning studies
Monte carlo simulation;
business case evaluation
2 Develop item bank content or
utilize existing bank
Item writing and review
3 Pretest and calibrate item
bank
Pretesting; item analysis
4 Determine specifications for
final CAT
Post-hoc or hybrid
simulations
5 Publish live CAT Publishing and
distribution; software
development
The 5 step model
9. 1. Feasibility, applicability, planning
Big question: is CAT worth
the investment?
If so, how can we develop a
project plan and timeline?
10. 1. Feasibility, applicability, planning
Answer: simulations
Simulate how a CAT would operate under
specified conditions
IVs
Item bank size
Item quality
Desired precision
DVs
Average test length
Accuracy: CAT θ vs. true θ (or full bank)
11. 1. Feasibility, applicability, planning
For those newer to CAT…
Three types of simulations
Monte Carlo
Post hoc (real data)
Hybrid
12. 1. Feasibility, applicability, planning
At this point, real data not likely, so Monte
Carlo
Generate plausible situations
Item bank: 100, 200, 300…
Item quality: a = 0.7, 0.8…; spread of b
Desired precision: SEM = 0.2, 0.3, 0.4…
Compare results to each other and fixed forms
Base values on reality (e.g., mean a)
13. 1. Feasibility, applicability, planning
Think of the results table you want to see
Bank size Target SEM Mean test length Mean SEM
(current test) - 100 .32
200 0.30 ? ?
200 0.40 ? ?
300 0.30 ? ?
300 0.40 ? ?
14. 1. Feasibility, applicability, planning
Software will do this for you, allowing you to
simulate CATs for thousands of examinees in
seconds
CATSim (ASC)
WinGen (Han)
FireStar (Choi)
You can then easily set up an experiment with
a wide range of conditions, and run a
simulation for each
Workshop by Cito on this
16. 1. Feasibility, applicability, planning
Example takeaway:
CAT with bank of 300 items and SEM=0.25
has average of 53 items
Current fixed test has 100 items, SEM=0.23
in middle and 0.35+ beyond θ of ±1.5
CAT will make test more accurate for
extreme examinees, about same accuracy
for middle, but with 50% reduction
17. 1. Feasibility, applicability, planning
Another question: Business Case Evaluation
Example:
You deliver 100,000 tests per year
You estimate $20/hour seat time
Reducing a test from 2 hours to 1 hour then saves
$2 million
More difficult to estimate for K-12 – cost is not seat
time but time away from instruction
18. 2. Develop item bank
Now that we have an idea what we need, we
need to build it
CAT-based considerations:
Difficulty spread
Anticipated exposure/security issues
TIF adequacy
Normal considerations
Content blueprints
Cognitive level
19. 3. Pretesting and analysis
Must pretest items to obtain bank
calibration
Two situations
New test, new scale: present large amounts of
items to examinees
Existing test, old scale: seed items
Obviously will take longer time to pilot
Requires a linking study
20. 3. Pretesting and analysis
Then calibrate, usually IRT
Also perform other due diligence
Dimensionality
DIF
Model fit
Distractor analysis
Remove/revise items based on stats?
Etc.
21. 4. Determine final specifications
To publish a CAT, we need to specify
algorithms
Starting point
Item selection
Scoring
Termination criterion
Also subalgorithms, such as item exposure,
content, test length constraints
22. 4. Determine final specifications
But we must have a reason for selecting
specifications
Validity documentation
Defensibility
Again, we turn to simulation studies
Define competing conditions
Big difference now: we have real data!
Post Hoc or Hybrid simulations
24. 4. Determine final specifications
After determining psychometric
specifications, evaluate more practical
issues
For example, time limits; can’t really set
until you know how many items
CAT-ASVAB approach: set limits for 90-95% of
population
25. 5. Publish live CAT
Once you have finalized your item bank and
CAT design, time to publish
Need to put everything into item banker and
CAT engine
First: obtain the item banker and CAT engine
If developing your own, this can be the biggest step
If purchasing, this is the easiest step
26. Epilogue: Maintaining CAT
Like fixed form testing, maintenance is
usually necessary
Check that performing as expected
Is termination criterion being satisfied?
Examinees hitting test length or other
constraints?
Average test length what you expected?
Exposure or security issues?