Validation Studies in Simulation-based Education - Deb Rooney

Validation Studies in Simulation-based
Education
April 15, 2015
Deb Rooney, PhD
Professor of Learning Health Sciences
All Rights Reserved.
1

•  Validity in the current framework used to evaluate
evidence
•  How we gather /evaluate validity evidence from a
simulator and it’s associated measures
•  Context of academic product (manuscripts)
•  Final considerations
Objectives
2

A few definitions to consider….
Validity: What is it?
2. the degree to which evidence and
theory support the interpretations of test
scores as entailed by proposed uses of
tests –Standards (1999)
1. the degree to which the tool measures
what it claims to measure
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education.
(1999) Standards for educational and psychological testing. Washington, DC: American Educational Research Association. 3

•  Evidence relevant to relationships to other variables (e.g. novice
versus expert discriminant validity) is over-represented
•  Evidence relevant to response process and consequences of
testing and are infrequently reported
Cook, D. A., Brydges, R., Zendejas, B., Hamstra, S. J., & Hatala, R. (2013). Technology-enhanced simulation
to assess health professionals: A systematic review of validity evidence, research methods, and reporting
quality. Academic Medicine. Jun;88(6):872-83.
•  Apply current Standards to ensure rigorous research and
reporting
Simulator Validation:
The Framework
4

The Evidence
Current AERA Standards*
o  Not new/novel
o  Unitary Construct (All evidence falls under “Construct” validity)
o  Test content
o  Internal structure
o  Response Processes
o  Relationships to other variables
o  Consequences of testing
(psychometric properties of measures)
(comparison w/ previously-validated measures)
(standard setting, rater/ings quality, fidelity vs. stakes)
(reliability, dimensionality, function across groups)
(face validity, subjective measures, construct alignment)
*Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014)
5

Validity evidence;
Validity: What is it NOT?
•  Does not allow us to make inferences about a
curriculum
•  Does not allow us to make inferences about different
applications, settings, or learners
•  Is not a terminal quality determination
(about the quality of your measures, application)
•  “The scale was valid” ! “evidence supports the use
of a scale to measure X in a particular setting/
application” 6

We have a much more complex
environment to evaluate!
Simulator
test content
relationships to other variables
consequences of testing
Instrument (measures)
test content
internal structure
relationships to other variables
response processes
consequences of testing
How does this evidence apply to us?
7

CREATIONDESIGN &
PLANNING
IMPLEMENTATION &
EVALUATION
Test content (measures/simulator)
Internal structure (m)
Response processes
(m)
Relationship to other
variables (m/s)
Consequences of testing (m/s)
Validity Evidence in Simulation
How/when do we gather evidence?
8

PAPER 2PAPER 1 PAPER 3
Test content
(measures &
simulator)
Before
implementation
Validity Evidence in Simulation
How disseminate findings?
Quality of
performance
measures
Before or After
Implementation
Impact on
performance and/
or patient
outcomes
After full
implementation
9

Most Recent Example:
Neurosurgery Sim
a Tai B, Rooney D Stephenson F, Liao P, Sagher O, Shih A, Savastano LE. Development of 3D-
printing built ventriculostomy placement simulator, Journal of Neurosurgery, (in press)
bRooney DM, Tai BL, Sagher O, Shih AJ, Wilkinson DA, Savastano LE. A Simulator and two
tools: Validation of performance measures from a novel neurosurgery simulator using the
current Standards framework. Surgery, submitted 3/15.
Paper 1a: Preliminary evaluation of
quality of simulator/ measures
Paper 2b: Evaluation of performance
measures from simulator
Paper 3: Evaluation of Impact on
performance measures / patient
outcomes
10

Paper 1: Simulator validation process
n=7 n=5 n=5
4 months
11

The Content Validity Form: 5 domains
•  Physical
Attributes
•  Realism-
experience
•  Value
•  Relevance
•  Overall (global)
12

Paper 1: The preliminary validation process
(Sim)*
•  Using Rasch model, analyzed data for;
•  Looked at domain rating differences across 3 sites
•  Mean averages by item
•  Looked at Rasch variability indices to identify possible
inconsistency in ratings
•  Ensured psychometric quality of survey
•  Using traditional methods, estimated;
•  Inter-item consistency, Cronbach alpha
•  Inter-rater agreement, ICC(2,k)
•  Using a Rasch model ensured rating scales’ function
*performance checklist is separate/ different process
13

Results: Domain mean averages by site
0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

UM

HF

WS

3.4 3.3 3.9 3.3 2.4
Combined mean averages
“This simulator requires minor adjustments before it can be considered for use in
ventriculostomy placement training.”
14

Results: Mean averages by item
15

Paper 1*: Test Content-Checklist
*shoulda, coulda, woulda
16

Proposed items Definitely do not
include this task
(1)
Not sure if this
task should be
included
(2)
Pretty sure this
task should be
included
(3)
Definitely include
this task
(4)
Position head and mark
midline
Locate Kocher's point
(10.5 cm posterior to the
nasion and 3 cm lateral to
midline)
Mark incision (approximately
2cm long in a parasagittal
location)
Incise, clear tissue off
cranium, retracted scalp
…
Suture wound (Staples or a
3-0 running nylon or prolene
suture)
Paper 1*:
Test Content-Checklist
17

Paper 1*:
Test Content-Checklist
Ask expert instructors about the value of included steps (items) for
measuring X at doing Y.
•  Reasonable number of experts ~ 3
What else do you ask about?
•  Clarity of item
•  Appropriateness of qualifiers (use X instrument, at x location)
•  Rating scale
•  Missing steps
•  Objective measures to include (eg. time to)
*shoulda, coulda, woulda
18

Before
implementation
Test content
(measures &
simulator)
Before or After
Implementation*
Quality of
performance
measures
After full
implementation
Impact on
performance and/or
patient outcomes
Next Step: Deeper Evaluation (Paper 2)
19

Next Step: Deeper Evaluation (Paper 2)
•  Evaluation of all validity evidence of performance
measures [ala Standards]
•  Capture broader (regional/national) sample of
performance data via videotaped performances
•  ideal N (+++) and range of experience
•  Compare measures from the novel performance
checklist and gold standard (eg-OSATS] (relationship
with other variables*)
•  Set/test performance standards (if appropriate)
Rooney DM, Tai BL, Sagher O, Shih A, Wilkinson DA, Savastano, L. A Simulator and two tools: Validation of
performance measures from a novel neurosurgery simulator using the current Standards framework. Surgery,
submitted 3/15)
20

•  Nationally-recognized training
program sponsored by Society
of Neurological Surgeons
•  Total n=14 (11 trainees*, 3
attendings) performed
ventriculostomy on simulator
•  All performances were video-
captured, scored by 3 raters
using novel checklist and
modified version of OSATS
Paper 2: study design
*first year neurosurgery fellows 21

Checklist: Ventriculostomy Procedural
Assessment Tool (V-PAT)
22

Martin JA, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchison C, et al. (1997). Objective structured assessment of technical skill
(OSATS) for surgical residents. Br J Surg 84:273-278.
Modified Objective Structured
Assessment of Technical Skills:
m-OSATS
23

Measures adequately reflect ventriculostomy performance “quality”
•  Relationships to other variables
•  Trainee v expert ratings
•  Correlation of summed V-PAT with summed OSATS scores
Measures are psychometrically sound (Adequate “Quality Control”)
•  Psychometric function of V-PAT & OSATS measures;
•  Response processes: Rasch indices ! rating scale function
•  Test Content: Rasch item point-measure correlations, item fit (variability)
•  Internal structure: Inter-item consistency-Cα, Inter-rater agreement- ICC(2,k)
Measures are free from rater bias
•  Consequences of testing
•  Evaluated Rasch bias indices to identify potential rating differences at rater level
Paper 2: evidence examined
Examined evidence from 5 sources, but packaged a bit differently;
24

Response processes
•  Rasch indices: (Avg meas., Fit statistics, and RA
thresholds) indicated all rating scales for both V-PAT &
OSATS were well-functioning
Test content
Point-measure correlations: all positive, [.39, .81]
Rasch item Outfit MS: all < 2.0
Internal structure
•  Cronbach α: both α = 0.95
•  Intraclass correlation: V-PAT= [-0.33, 0.93]
OSATS = [0.80, 0.93]
Results: “Quality control”
25

Do V-PAT measures adequately differentiate trainee and
expert performances?

Instrument Indices
V-PAT
Resident
Observed
Average
(SE)
Attending
Observed
Average
(SE)
P-
value
ICC
(2,k)
Value
1. Position head and mark midline 3.75 (.20) 4.00 (.47) 0.54 *
2. Locate Kocher's point 3.84 (.18) 4.11 (.34) 0.52 .86
3. Mark an incision 3.61 (.16) 4.00 (.32) 0.42 .10
4. Select drain exit site from the scalp 2.94 (.17) 3.71 (.50) 0.22 *
5. Incise, clear tissue off cranium, retract
scalp 3.56 (.14) 4.11 (.34) 0.20 .83
6. Set drill stop and drill trephine 2.80 (.16) 3.78 (.37) 0.08 .70
7. Confirm dura and pierce 2.98 (.17) 3.67 (.35) 0.25 .76
8. Confirm landmarks and place
catheter 2.88 (.17) 3.33 (.35) 0.24 .91
9. Confirm CSF flow 3.41 (.13) 3.67 (.30) 0.39 .73
10. Remove trocar cover, tunnel trocar
to exit site and recap trocar 2.78 (.14) 3.00 (.47) 0.62 -.33
11. Place purse string suture to anchor
the catheter at scalp exit site 3.00 (.36) 4.25 (.69) 0.26 *
Overall average 3.30 (.06) 3.80 (.11) 0.01 –

Instrument Indices
V-PAT
Resident
Observed
Average
(SE)
Attending
Observed
Average
(SE)
P-
value
ICC
(2,k)
Value
3. Mark an incision 3.61 (.16) 4.00 (.32) 0.42 .10
scalp 3.56 (.14) 4.11 (.34) 0.20 .83
catheter 2.88 (.17) 3.33 (.35) 0.24 .91
9. Confirm CSF flow 3.41 (.13) 3.67 (.30) 0.39 .73
Overall average 3.30 (.06) 3.80 (.11) 0.01 –
Results: Relationship to Other
variables (V-PAT)

Instrument Indices
V-PAT
Resident
Observed
Average
(SE)
Attending
Observed
Average
(SE)
P-
value
ICC
(2,k)
Value
3. Mark an incision 3.61 (.16) 4.00 (.32) 0.42 .10
scalp 3.56 (.14) 4.11 (.34) 0.20 .83
catheter 2.88 (.17) 3.33 (.35) 0.24 .91
9. Confirm CSF flow 3.41 (.13) 3.67 (.30) 0.39 .73
Overall average 3.30 (.06) 3.80 (.11) 0.01 –
26

Do m-OSATS measures adequately differentiate trainee
and expert performances?
•  Correlation of summed V-PAT scores with summed
m-OSATS
•  Pearson’s r = 0.72, p = 0.001
V-PAT
Resident
Observed
Average
(SE)
Attending
Observed
Average
(SE)
P-
value
ICC
(2,k)
Value
3. Mark an incision approximately 2 cm
long in a parasagittal location 3.61 (.16) 4.00 (.32) 0.42 .10
scalp 3.56 (.14) 4.11 (.34) 0.20 .83
7. Confirm dura and pierced with 18g
spinal needle or 11 blade scalpel 2.98 (.17) 3.67 (.35) 0.25 .76
catheter to 6-7cm from outer table of
skull 2.88 (.17) 3.33 (.35) 0.24 .91
9. Confirm CSF flow 3.41 (.13) 3.67 (.30) 0.39 .73
11. Place purse string suture at the
scalp exit site to anchor the catheter 3.00 (.36) 4.25 (.69) 0.26 *
Overall average 3.30 (.06) 3.80 (.11) 0.01 –
OSATS
1. Respect for Tissue 2.66 (.21) 4.11 (.34) 0.004 .93
2. Time and Motion 2.42 (.22) 4.00 (.43) 0.005 .85
3. Instrument Handling 2.51 (.22) 4.00 (.46) 0.007 .86
4. Knowledge of Instruments 2.36 (.21) 4.33 (.43) 0.001 .84
5. Flow of Operation 2.36 (.21) 4.22 (.45) 0.001 .85
6. Knowledge of specific procedure 2.33 (.23) 4.33 (.46) 0.001 .80
Overall average 2.32 (.08) 3.73 (.15) 0.001 –
* too few cases to estimate

Instrument Indices
V-PAT
Resident
Observed
Average
(SE)
Attending
Observed
Average
(SE)
P-
value
ICC
(2,k)
Value
3. Mark an incision approximately 2 cm
long in a parasagittal location 3.61 (.16) 4.00 (.32) 0.42 .10
scalp 3.56 (.14) 4.11 (.34) 0.20 .83
7. Confirm dura and pierced with 18g
spinal needle or 11 blade scalpel 2.98 (.17) 3.67 (.35) 0.25 .76
catheter to 6-7cm from outer table of
skull 2.88 (.17) 3.33 (.35) 0.24 .91
9. Confirm CSF flow 3.41 (.13) 3.67 (.30) 0.39 .73
11. Place purse string suture at the
OSATS
Results: Relationship to Other
variables (m-OSATS)
27

Are measures free from rater bias?
Results:
Consequences of Testing
V-PAT
Observed Average
(SE)
1. Rater 1 (LS) 3.60 (.11)
2. Rater 2 (OS) 3.60 (.10)
3. Rater 3 (DW) 3.60 (.11)
4. Participants (Variable)† 2.90 (.11)
Overall average 3.60 ( – )
OSATS
1. Rater 1 (LS)* 2.10 (.17)
2. Rater 2 (OS) 3.30 (.15)
3. Rater 3 (DW) 3.00 (.15)
Overall average 2.80 (–)
†Comparison with 3 expert raters, p = 0.01
* Comparison with 2 expert raters, p = 0.01
V-PAT
Observed Average
(SE)
1. Rater 1 (LS) 3.60 (.11)
2. Rater 2 (OS) 3.60 (.10)
3. Rater 3 (DW) 3.60 (.11)
OSATS
1. Rater 1 (LS)* 2.10 (.17)
2. Rater 2 (OS) 3.30 (.15)
3. Rater 3 (DW) 3.00 (.15)
V-PAT
Observed Average
(SE)
1. Rater 1 (LS) 3.60 (.11)
2. Rater 2 (OS) 3.60 (.10)
3. Rater 3 (DW) 3.60 (.11)
OSATS
1. Rater 1 (LS)* 2.10 (.17)
2. Rater 2 (OS) 3.30 (.15)
3. Rater 3 (DW) 3.00 (.15)
28

Are OSATS measures free from rater bias?
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Q1 Q2 Q3 Q4 Q5 Q6
Observedaverage
OSATS item
Bias interaction: Rater by item
Rater1-LS
Rater2-OS
Rater3-DW
Results:
Consequences of Testing
29

Summary of Results
Evidence Inferences V-PAT OSATS
Quality Control
Response Processes Adequate rating scale function √ √
Test Content Items align with construct √ √
Internal Structure Inter-item consistency, inter-rater
agreement*
X* √
Test of Assumptions
Rel. Other Variables Measures dif. bet. N/E performances X √
Rel. Other Variables V-PAT/OSATS summed scores correlate √ √
Cons. of Testing V-PAT/OSATS measures are bias free √ X
X = challenges that require resolution 30

Standards for educational and psychological testing. American Educational Research Association, American Psychological Association, &
National Council on Measurement in Education. (2014). Washington, DC: American Educational Research.
Available for purchase via http://teststandards.org/
Validity: ala 2015
•  Evidence is important, but interpretive argument is critical
•  Content of interpretive argument determines the kinds of
evidence that are most relevant (most important) in validation
•  Strategy of developing interpretive argument based on
•  Validity evidence relevant to inferences
•  Assumptions
•  Challenges (alternative interpretations)
31

Problematic inter-rater agreement (ICC) for 5 items
should be resolved;
•  Item 3 (Mark an incision approximately 2 cm long in a
parasagittal location), ICC= .10
•  Item 10 (Remove trocar cover, tunnel trocar to exit site
and recap trocar), ICC= -.33
Examine, refine items to ensure
alignment with simulator capabilities/ are
mutually exclusive
Challenges:
potential threats to validity (V-PAT)
•  Item 1 (Position head and mark midline)*
•  Item 4 (Select drain exit site from the scalp)*
•  Item 1 (Place purse string suture at the scalp exit site to
anchor the catheter) *
* ICC incalculable
32

“Hawkish” OSATS ratings by one expert rater
requires follow-up
Refine items, add rater training on scoring
rubric and administration standards
Challenges: potential threats to
validity (m-OSATS)
33

Before
implementation
Test content
(measures &
simulator)
Before or After
Implementation*
Quality of
performance
measures
After full
implementation
Impact on
performance and/or
patient outcomes
Next Step: Evaluation of Impact
(Paper 3)
34

•  Evaluation of impact on trainee’s clinical performance
or patient outcomes [relationship with other
variables]
•  Examine;
•  Change in trainees’ clinical performance (checklist
ratings, objective measures (“time to”, LOS, adverse
events)
•  Impact on hospital costs
Barsuk JH, Cohen ER, Feinglass J, Kozmic SE, McGaghie WC, Ganger D, Wayne DB.
Cost savings of performing paracentesis procedures at the bedside after simulation-
based education. Simul Healthc. 2014 Oct;9(5):312-8.
Next Step: Evaluation of Impact
(Paper 3)
35

•  Validation process is fluid/reiterative/on-
going
•  It takes a team;
•  Development (clinicians, instructors,
engineers, researchers assistants)!
•  Outcomes (+QI, hospital info)
•  There is funding;
•  AHRQ
•  P-CORi
•  Michigan Blue Cross Blue Shield
Considerations
36

Thank you
Questions?

Deborah
Rooney,
PhD

dmrooney@med.umich.edu

Validation Studies in Simulation-based Education - Deb Rooney

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Validation Studies in Simulation-based Education - Deb Rooney

Similar to Validation Studies in Simulation-based Education - Deb Rooney (20)

More from Department of Learning Health Sciences, University of Michigan Medical School

More from Department of Learning Health Sciences, University of Michigan Medical School (20)

Recently uploaded

Recently uploaded (20)

Validation Studies in Simulation-based Education - Deb Rooney