1. Considerations in
Developing a Benchmark
Scale for Many-Facet
Rasch Analysis
Ross M. Brown and Mary E. Lunz
Measurement Research Associates, Inc., Chicago, Illinois
This case study paper describes:
1) a way to create a sufficiently precise
benchmark scale for small sample
performance assessment programs, and
2) methods to investigate misfitting facet
elements that may affect the benchmark
integrity.
3) Discussion includes the considerations
involved in deciding whether to eliminate
examiner ratings that do not meet the
requirements of the Rasch model.
2. Background: Performance
Assessments and MFRM
• Many-facet Rasch model for analysis of
testing situations with additional facets,
i.e., examiners
• Examples: Oral exams, portfolio
assessments
• Common exam elements can be used
for equating future performance
assessments when the
severity/difficulty is calibrated on a
benchmark scale
3. Pooling Exam
Administrations
•Portfolio exam
•15 cases in seven defined areas of
practice
•Each case receives ratings for three
skills.
•Individual administrations have nine to
24 candidates
•Increasing the number of candidates
> 40 shown to significantly improve facet
calibration precision
•Pooled 97 candidates’ ratings from six
exam administrations to construct the
benchmark.
•After pooling the ratings, facet
calibration error was reduced compared to
individual administrations
5. Fit of the Data
to the Model
Before interpretation, the results of a
Rasch analysis must be examined for fit,
which is the amount of error variance
relative to what the model predicts.
Data with sufficient fit is evidence that
the model’s interval measures are
appropriately derived.
Case type and skill calibrations all
showed sufficient fit*
Two examiners had fit above 1.2 to
1.7 high end criteria: Examiner 10 had
an outfit mean square of 2.45;
Examiner 19, 1.8.
*One case type was outside the conservative
range of 0.80 to 1.2. Outfit meansquare: 1.25
6. Examiner Misfit: Rater Effects
• Halo effect: a holistic assessment of candidate,
and therefore assign similar ratings to all skills
and case types from that candidate
• When skill or case difficulties vary, raters
exhibiting halo will have infit and outfit mean
square indices significantly greater than 1.
Do skill and case difficulties vary?
Skills: Fixed (all same) chi-square: 611.3 d.f.: 2
significance: .00. Separation 14.34: Spread of the skill
difficulty measures is ~ 14 times larger than the
precision of measures.
Cases: Fixed (all same) chi-square: 56.6 d.f.: 6
significance: .00. Separation 2.82: Spread of the case
difficulty measures is ~ three times larger than the
precision of those measures.
Skill and case difficulties vary. What percent of the time
did these raters assign series of same ratings across
skills?
– Examiner 10: 17%
– Examiner 19: 35%.
– For all other examiners combined: 47%
7. Percent of Cases Rated with all
Skills Assigned the Same Rating
Examiner Percent of Cases
10 17
11 72
12 46
13 37
14 34
15 67
16 26
17 49
18 57
19 35
20 47
For all examiners except 10 and 19: 47%
8. Rater Effects: Randomness
•Randomness: tendency to
assign ratings randomly,
without apparent
consideration for the merit
of the portfolio submission
•Detecting randomness:
Anchor all ratees at the
same level of performance
(i.e., 0) and then run the
analysis.
•Raters who show the best
fit to this model are likely
to be exhibiting
randomness (i.e., their
ratings are not related to
the level of performance of
the ratee)
Examiner Outfit Mean
Square
10 2.46
11 .57
12 .89
13 .90
14 .83
15 1.23
16 .76
17 1.63
18 .94
19 1.92
20 .73
9. Bias Interaction
•Bias interaction analysis: the degree to which ratings
produced for a particular interaction deviate from model
expectations.
•Standardized Z-score indicates size of bias interaction.
Statistically significant misfit for an interaction: + 2
Examiner 10: significant bias on all three skills.
– For two skills, more lenient than expected.
– Third skill, the easiest overall of the three, he was more severe.
Examiner 19: significant bias on two skills.
– Skill two, more lenient than expected
– Third skill, the easiest overall of the three, he was more severe.
•Misfit for skill three (complexity) is notable because it is it is
different from the other two skills, diagnosis and treatment.
•Examiners are likely less familiar with complexity as a
criterion for portfolio assessment, and it is somewhat abstract and
ill-defined.
•For these reasons, examiners typically rate portfolios more
generously on complexity than they rate the other skills.
•The MFRM therefore expects examiners to rate this skill
higher than the others. But Examiner 10 and 19 were more
severe for this skill, thereby producing misfit.
10. Retain or Discard the Data?
Data that does not fit the model “threatens the validity
of the measurement process” (Linacre, 1989/1994, p. 9).
Typically rating data is precious
•Individual exams: 11 examiners rate candidates
•Each candidate has 135 ratings from three examiners
•Removing one examiner’s ratings: substantive
implications for candidate scores, because the examiner’s
ratings are removed from all candidates assessed by the
examiner.
For developing the benchmark scale, the consequences
of not removing the misfitting data has different
implications.
Benchmark: the measurement frame of reference that
is the basis of future pass/fail decisions
•Misfitting raters in the data:
•Skill one becomes easier by more than one SEM
•Skill two becomes more difficult by more than one SEM
•Cases one and two become more difficult by a SEM
•Cases five and six become easier by a SEM
11. Change in Examiner Severity
Calibrations with Misfitting
Examiners Removed
Severity
Examiner
With all
Examiners
Misfitting
Examiners
Removed
Model
Standard
Error
11 5.62 5.65 .05
12 5.36 5.36 .05
13 5.02 5.00 .07
14 4.66 4.54 .05
15 4.55 4.51 .05
16 5.96 6.02 .04
17 3.48 3.29 .07
18 5.66 5.65 .04
20 5.13 4.98 .22
12. Decision: Remove Misfitting
Examiners
•Testing practitioners running small-
sample performance assessments may
in general be loathe to discard hard-
earned rating data.
•In these circumstances, the
decision is made easier by the limited
consequences and the implications for
future measurement against the
benchmark scale.
•The calibrations for the benchmark
scale will be based on data from 97
candidates, 65 of whom received the
full 135 ratings (45 from each of three
examiners).