SlideShare a Scribd company logo
1 of 12
Considerations in
Developing a Benchmark
Scale for Many-Facet
Rasch Analysis
Ross M. Brown and Mary E. Lunz
Measurement Research Associates, Inc., Chicago, Illinois
This case study paper describes:
1) a way to create a sufficiently precise
benchmark scale for small sample
performance assessment programs, and
2) methods to investigate misfitting facet
elements that may affect the benchmark
integrity.
3) Discussion includes the considerations
involved in deciding whether to eliminate
examiner ratings that do not meet the
requirements of the Rasch model.
Background: Performance
Assessments and MFRM
• Many-facet Rasch model for analysis of
testing situations with additional facets,
i.e., examiners
• Examples: Oral exams, portfolio
assessments
• Common exam elements can be used
for equating future performance
assessments when the
severity/difficulty is calibrated on a
benchmark scale
Pooling Exam
Administrations
•Portfolio exam
•15 cases in seven defined areas of
practice
•Each case receives ratings for three
skills.
•Individual administrations have nine to
24 candidates
•Increasing the number of candidates
> 40 shown to significantly improve facet
calibration precision
•Pooled 97 candidates’ ratings from six
exam administrations to construct the
benchmark.
•After pooling the ratings, facet
calibration error was reduced compared to
individual administrations
Standard Error
Exam Facet
Element
Spring 2006
Administration Pooled Analysis
Examiners
12 .21 .04
13 .15 .06
14 .19 .05
15 .18 .05
16 .15 .04
18 .15 .04
19 .17 .06
20 .25 .20
Skills
1 0.11 .03
2 0.11 .03
3 0.10 .03
Cases
1 0.24 .06
2 0.24 .06
3 0.24 .06
4 0.11 .03
5 0.17 .04
6 0.17 .04
7 0.14 .04
Fit of the Data
to the Model
Before interpretation, the results of a
Rasch analysis must be examined for fit,
which is the amount of error variance
relative to what the model predicts.
Data with sufficient fit is evidence that
the model’s interval measures are
appropriately derived.
Case type and skill calibrations all
showed sufficient fit*
Two examiners had fit above 1.2 to
1.7 high end criteria: Examiner 10 had
an outfit mean square of 2.45;
Examiner 19, 1.8.
*One case type was outside the conservative
range of 0.80 to 1.2. Outfit meansquare: 1.25
Examiner Misfit: Rater Effects
• Halo effect: a holistic assessment of candidate,
and therefore assign similar ratings to all skills
and case types from that candidate
• When skill or case difficulties vary, raters
exhibiting halo will have infit and outfit mean
square indices significantly greater than 1.
Do skill and case difficulties vary?
Skills: Fixed (all same) chi-square: 611.3 d.f.: 2
significance: .00. Separation 14.34: Spread of the skill
difficulty measures is ~ 14 times larger than the
precision of measures.
Cases: Fixed (all same) chi-square: 56.6 d.f.: 6
significance: .00. Separation 2.82: Spread of the case
difficulty measures is ~ three times larger than the
precision of those measures.
Skill and case difficulties vary. What percent of the time
did these raters assign series of same ratings across
skills?
– Examiner 10: 17%
– Examiner 19: 35%.
– For all other examiners combined: 47%
Percent of Cases Rated with all
Skills Assigned the Same Rating
Examiner Percent of Cases
10 17
11 72
12 46
13 37
14 34
15 67
16 26
17 49
18 57
19 35
20 47
For all examiners except 10 and 19: 47%
Rater Effects: Randomness
•Randomness: tendency to
assign ratings randomly,
without apparent
consideration for the merit
of the portfolio submission
•Detecting randomness:
Anchor all ratees at the
same level of performance
(i.e., 0) and then run the
analysis.
•Raters who show the best
fit to this model are likely
to be exhibiting
randomness (i.e., their
ratings are not related to
the level of performance of
the ratee)
Examiner Outfit Mean
Square
10 2.46
11 .57
12 .89
13 .90
14 .83
15 1.23
16 .76
17 1.63
18 .94
19 1.92
20 .73
Bias Interaction
•Bias interaction analysis: the degree to which ratings
produced for a particular interaction deviate from model
expectations.
•Standardized Z-score indicates size of bias interaction.
Statistically significant misfit for an interaction: + 2
Examiner 10: significant bias on all three skills.
– For two skills, more lenient than expected.
– Third skill, the easiest overall of the three, he was more severe.
Examiner 19: significant bias on two skills.
– Skill two, more lenient than expected
– Third skill, the easiest overall of the three, he was more severe.
•Misfit for skill three (complexity) is notable because it is it is
different from the other two skills, diagnosis and treatment.
•Examiners are likely less familiar with complexity as a
criterion for portfolio assessment, and it is somewhat abstract and
ill-defined.
•For these reasons, examiners typically rate portfolios more
generously on complexity than they rate the other skills.
•The MFRM therefore expects examiners to rate this skill
higher than the others. But Examiner 10 and 19 were more
severe for this skill, thereby producing misfit.
Retain or Discard the Data?
Data that does not fit the model “threatens the validity
of the measurement process” (Linacre, 1989/1994, p. 9).
Typically rating data is precious
•Individual exams: 11 examiners rate candidates
•Each candidate has 135 ratings from three examiners
•Removing one examiner’s ratings: substantive
implications for candidate scores, because the examiner’s
ratings are removed from all candidates assessed by the
examiner.
For developing the benchmark scale, the consequences
of not removing the misfitting data has different
implications.
Benchmark: the measurement frame of reference that
is the basis of future pass/fail decisions
•Misfitting raters in the data:
•Skill one becomes easier by more than one SEM
•Skill two becomes more difficult by more than one SEM
•Cases one and two become more difficult by a SEM
•Cases five and six become easier by a SEM
Change in Examiner Severity
Calibrations with Misfitting
Examiners Removed
Severity
Examiner
With all
Examiners
Misfitting
Examiners
Removed
Model
Standard
Error
11 5.62 5.65 .05
12 5.36 5.36 .05
13 5.02 5.00 .07
14 4.66 4.54 .05
15 4.55 4.51 .05
16 5.96 6.02 .04
17 3.48 3.29 .07
18 5.66 5.65 .04
20 5.13 4.98 .22
Decision: Remove Misfitting
Examiners
•Testing practitioners running small-
sample performance assessments may
in general be loathe to discard hard-
earned rating data.
•In these circumstances, the
decision is made easier by the limited
consequences and the implications for
future measurement against the
benchmark scale.
•The calibrations for the benchmark
scale will be based on data from 97
candidates, 65 of whom received the
full 135 ratings (45 from each of three
examiners).

More Related Content

What's hot

EquivalencePartition
EquivalencePartitionEquivalencePartition
EquivalencePartitionswornim nepal
 
Design of Experiments
Design of Experiments Design of Experiments
Design of Experiments Furk Kruf
 
Kappa statistics
Kappa statisticsKappa statistics
Kappa statisticsAmeyDhatrak
 
Design of experiments
Design of experimentsDesign of experiments
Design of experimentsUpendra K
 
Multivariate Variate Techniques
Multivariate Variate TechniquesMultivariate Variate Techniques
Multivariate Variate TechniquesDr. Keerti Jain
 
Fractional Factorial Designs
Fractional Factorial DesignsFractional Factorial Designs
Fractional Factorial DesignsThomas Abraham
 
Correlational research
Correlational researchCorrelational research
Correlational researchAzura Zaki
 
Factorial design - Dr. Manu Melwin Joy - School of Management Studies, Cochin...
Factorial design - Dr. Manu Melwin Joy - School of Management Studies, Cochin...Factorial design - Dr. Manu Melwin Joy - School of Management Studies, Cochin...
Factorial design - Dr. Manu Melwin Joy - School of Management Studies, Cochin...manumelwin
 
Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)Farhad Ashraf
 
Factor analysis
Factor analysisFactor analysis
Factor analysis緯鈞 沈
 
Sensitivity Analysis
Sensitivity AnalysisSensitivity Analysis
Sensitivity Analysisashishtqm
 

What's hot (20)

Chi sqaure test
Chi sqaure testChi sqaure test
Chi sqaure test
 
EquivalencePartition
EquivalencePartitionEquivalencePartition
EquivalencePartition
 
Factorial Design
Factorial DesignFactorial Design
Factorial Design
 
Design of Experiments
Design of Experiments Design of Experiments
Design of Experiments
 
Sensitivity Analysis
Sensitivity AnalysisSensitivity Analysis
Sensitivity Analysis
 
Design of Experiments
Design of ExperimentsDesign of Experiments
Design of Experiments
 
Kappa statistics
Kappa statisticsKappa statistics
Kappa statistics
 
Design of Experiment
Design of Experiment Design of Experiment
Design of Experiment
 
Design of experiments
Design of experimentsDesign of experiments
Design of experiments
 
Multivariate Variate Techniques
Multivariate Variate TechniquesMultivariate Variate Techniques
Multivariate Variate Techniques
 
om
omom
om
 
Fractional Factorial Designs
Fractional Factorial DesignsFractional Factorial Designs
Fractional Factorial Designs
 
Reliability
ReliabilityReliability
Reliability
 
Correlational research
Correlational researchCorrelational research
Correlational research
 
Factorial design - Dr. Manu Melwin Joy - School of Management Studies, Cochin...
Factorial design - Dr. Manu Melwin Joy - School of Management Studies, Cochin...Factorial design - Dr. Manu Melwin Joy - School of Management Studies, Cochin...
Factorial design - Dr. Manu Melwin Joy - School of Management Studies, Cochin...
 
Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)Statistical analysis & errors (lecture 3)
Statistical analysis & errors (lecture 3)
 
Design of experiments
Design of experimentsDesign of experiments
Design of experiments
 
Factor analysis
Factor analysisFactor analysis
Factor analysis
 
Sensitivity Analysis
Sensitivity AnalysisSensitivity Analysis
Sensitivity Analysis
 
Confirmatory Factor Analysis
Confirmatory Factor AnalysisConfirmatory Factor Analysis
Confirmatory Factor Analysis
 

Viewers also liked

Assessment Jeopardy
Assessment JeopardyAssessment Jeopardy
Assessment Jeopardyblakeward05
 
Presentation1
Presentation1Presentation1
Presentation1ttn09001
 
Large scale assessments and students with disabilities
Large scale assessments and students with disabilitiesLarge scale assessments and students with disabilities
Large scale assessments and students with disabilitiesshenley735
 
IDEA 2004, Rti, WISC-III and WISC-IV
IDEA 2004, Rti, WISC-III and WISC-IVIDEA 2004, Rti, WISC-III and WISC-IV
IDEA 2004, Rti, WISC-III and WISC-IVMargaret Kay
 
Assessment Procedures and Techniques
Assessment Procedures and TechniquesAssessment Procedures and Techniques
Assessment Procedures and TechniquesAliza Zaina
 
Session 3 Reading Assessment
Session 3 Reading AssessmentSession 3 Reading Assessment
Session 3 Reading AssessmentJill A. Aguilar
 
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRYDEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRYPawan Sharma
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in EducationPhilip Piety
 
achievement test
achievement testachievement test
achievement testniks884
 
An Introduction to the assessment of learning in the Psychomotor And Affectiv...
An Introduction to the assessment of learning in the Psychomotor And Affectiv...An Introduction to the assessment of learning in the Psychomotor And Affectiv...
An Introduction to the assessment of learning in the Psychomotor And Affectiv...Miguel Angelo Rosales
 
Achievement tests
Achievement testsAchievement tests
Achievement testsManu Sethi
 
Assessment in the affective domain. cha.4.ed8
Assessment in the affective domain. cha.4.ed8Assessment in the affective domain. cha.4.ed8
Assessment in the affective domain. cha.4.ed8Eddie Abug
 
The Classroom of the Future
The Classroom of the FutureThe Classroom of the Future
The Classroom of the FutureDean Shareski
 

Viewers also liked (16)

Memorias de un computador
Memorias de un computadorMemorias de un computador
Memorias de un computador
 
Althea's reading
Althea's readingAlthea's reading
Althea's reading
 
Assessment Jeopardy
Assessment JeopardyAssessment Jeopardy
Assessment Jeopardy
 
Using Large Scale Assessments to improve schools
Using Large Scale Assessments to improve schoolsUsing Large Scale Assessments to improve schools
Using Large Scale Assessments to improve schools
 
Presentation1
Presentation1Presentation1
Presentation1
 
Large scale assessments and students with disabilities
Large scale assessments and students with disabilitiesLarge scale assessments and students with disabilities
Large scale assessments and students with disabilities
 
IDEA 2004, Rti, WISC-III and WISC-IV
IDEA 2004, Rti, WISC-III and WISC-IVIDEA 2004, Rti, WISC-III and WISC-IV
IDEA 2004, Rti, WISC-III and WISC-IV
 
Assessment Procedures and Techniques
Assessment Procedures and TechniquesAssessment Procedures and Techniques
Assessment Procedures and Techniques
 
Session 3 Reading Assessment
Session 3 Reading AssessmentSession 3 Reading Assessment
Session 3 Reading Assessment
 
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRYDEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
DEVELOPMENT AND EVALUATION OF SCALES/INSTRUMENTS IN PSYCHIATRY
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
 
achievement test
achievement testachievement test
achievement test
 
An Introduction to the assessment of learning in the Psychomotor And Affectiv...
An Introduction to the assessment of learning in the Psychomotor And Affectiv...An Introduction to the assessment of learning in the Psychomotor And Affectiv...
An Introduction to the assessment of learning in the Psychomotor And Affectiv...
 
Achievement tests
Achievement testsAchievement tests
Achievement tests
 
Assessment in the affective domain. cha.4.ed8
Assessment in the affective domain. cha.4.ed8Assessment in the affective domain. cha.4.ed8
Assessment in the affective domain. cha.4.ed8
 
The Classroom of the Future
The Classroom of the FutureThe Classroom of the Future
The Classroom of the Future
 

Similar to AERA 2007 Developing benchmark

Evaluating tests
Evaluating testsEvaluating tests
Evaluating testscwhms
 
Effect of Computer-Based Testing on Candidates
Effect of Computer-Based Testing on CandidatesEffect of Computer-Based Testing on Candidates
Effect of Computer-Based Testing on CandidatesAssessment Systems
 
Multidimensional scalling
Multidimensional scallingMultidimensional scalling
Multidimensional scallingZakkyZamrudi2
 
Satisfaction and loyalty
Satisfaction and loyaltySatisfaction and loyalty
Satisfaction and loyaltyTheDataNation
 
Testcase design techniques final
Testcase design techniques finalTestcase design techniques final
Testcase design techniques finalshraavank
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testingPhuong Tran
 
evaluations Item Analysis for teachers.pdf
evaluations  Item Analysis for teachers.pdfevaluations  Item Analysis for teachers.pdf
evaluations Item Analysis for teachers.pdfBatMan752678
 
Introduction to Business Analytics Course Part 10
Introduction to Business Analytics Course Part 10Introduction to Business Analytics Course Part 10
Introduction to Business Analytics Course Part 10Beamsync
 
Doing observation and Data Analysis for Qualitative Research
Doing observation and Data Analysis for Qualitative ResearchDoing observation and Data Analysis for Qualitative Research
Doing observation and Data Analysis for Qualitative ResearchAhmad Johari Sihes
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyAlon Bochman, CFA
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validityMuhammad Ali
 
Reliability and validity w3
Reliability and validity w3Reliability and validity w3
Reliability and validity w3Muhammad Ali
 
PEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliPEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliMDO_Lab
 

Similar to AERA 2007 Developing benchmark (20)

Performance Assessments with MFRM
Performance Assessments with MFRMPerformance Assessments with MFRM
Performance Assessments with MFRM
 
Evaluating tests
Evaluating testsEvaluating tests
Evaluating tests
 
Effect of Computer-Based Testing on Candidates
Effect of Computer-Based Testing on CandidatesEffect of Computer-Based Testing on Candidates
Effect of Computer-Based Testing on Candidates
 
Multidimensional scalling
Multidimensional scallingMultidimensional scalling
Multidimensional scalling
 
Satisfaction and loyalty
Satisfaction and loyaltySatisfaction and loyalty
Satisfaction and loyalty
 
Testcase design techniques final
Testcase design techniques finalTestcase design techniques final
Testcase design techniques final
 
Valiadity and reliability- Language testing
Valiadity and reliability- Language testingValiadity and reliability- Language testing
Valiadity and reliability- Language testing
 
evaluations Item Analysis for teachers.pdf
evaluations  Item Analysis for teachers.pdfevaluations  Item Analysis for teachers.pdf
evaluations Item Analysis for teachers.pdf
 
Introduction to Business Analytics Course Part 10
Introduction to Business Analytics Course Part 10Introduction to Business Analytics Course Part 10
Introduction to Business Analytics Course Part 10
 
Doing observation and Data Analysis for Qualitative Research
Doing observation and Data Analysis for Qualitative ResearchDoing observation and Data Analysis for Qualitative Research
Doing observation and Data Analysis for Qualitative Research
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
UNIT 5.pptx
UNIT 5.pptxUNIT 5.pptx
UNIT 5.pptx
 
Validation Studies in Simulation-based Education - Deb Rooney
Validation Studies in Simulation-based Education - Deb RooneyValidation Studies in Simulation-based Education - Deb Rooney
Validation Studies in Simulation-based Education - Deb Rooney
 
Malhotra09 basic
Malhotra09 basicMalhotra09 basic
Malhotra09 basic
 
Brown Myford NCME presentation
Brown Myford NCME presentationBrown Myford NCME presentation
Brown Myford NCME presentation
 
Unit 1.2
Unit 1.2Unit 1.2
Unit 1.2
 
Ch._6_pp_industrial.ppt
Ch._6_pp_industrial.pptCh._6_pp_industrial.ppt
Ch._6_pp_industrial.ppt
 
Reliability and validity
Reliability and validityReliability and validity
Reliability and validity
 
Reliability and validity w3
Reliability and validity w3Reliability and validity w3
Reliability and validity w3
 
PEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliPEMF2_SDM_2012_Ali
PEMF2_SDM_2012_Ali
 

AERA 2007 Developing benchmark

  • 1. Considerations in Developing a Benchmark Scale for Many-Facet Rasch Analysis Ross M. Brown and Mary E. Lunz Measurement Research Associates, Inc., Chicago, Illinois This case study paper describes: 1) a way to create a sufficiently precise benchmark scale for small sample performance assessment programs, and 2) methods to investigate misfitting facet elements that may affect the benchmark integrity. 3) Discussion includes the considerations involved in deciding whether to eliminate examiner ratings that do not meet the requirements of the Rasch model.
  • 2. Background: Performance Assessments and MFRM • Many-facet Rasch model for analysis of testing situations with additional facets, i.e., examiners • Examples: Oral exams, portfolio assessments • Common exam elements can be used for equating future performance assessments when the severity/difficulty is calibrated on a benchmark scale
  • 3. Pooling Exam Administrations •Portfolio exam •15 cases in seven defined areas of practice •Each case receives ratings for three skills. •Individual administrations have nine to 24 candidates •Increasing the number of candidates > 40 shown to significantly improve facet calibration precision •Pooled 97 candidates’ ratings from six exam administrations to construct the benchmark. •After pooling the ratings, facet calibration error was reduced compared to individual administrations
  • 4. Standard Error Exam Facet Element Spring 2006 Administration Pooled Analysis Examiners 12 .21 .04 13 .15 .06 14 .19 .05 15 .18 .05 16 .15 .04 18 .15 .04 19 .17 .06 20 .25 .20 Skills 1 0.11 .03 2 0.11 .03 3 0.10 .03 Cases 1 0.24 .06 2 0.24 .06 3 0.24 .06 4 0.11 .03 5 0.17 .04 6 0.17 .04 7 0.14 .04
  • 5. Fit of the Data to the Model Before interpretation, the results of a Rasch analysis must be examined for fit, which is the amount of error variance relative to what the model predicts. Data with sufficient fit is evidence that the model’s interval measures are appropriately derived. Case type and skill calibrations all showed sufficient fit* Two examiners had fit above 1.2 to 1.7 high end criteria: Examiner 10 had an outfit mean square of 2.45; Examiner 19, 1.8. *One case type was outside the conservative range of 0.80 to 1.2. Outfit meansquare: 1.25
  • 6. Examiner Misfit: Rater Effects • Halo effect: a holistic assessment of candidate, and therefore assign similar ratings to all skills and case types from that candidate • When skill or case difficulties vary, raters exhibiting halo will have infit and outfit mean square indices significantly greater than 1. Do skill and case difficulties vary? Skills: Fixed (all same) chi-square: 611.3 d.f.: 2 significance: .00. Separation 14.34: Spread of the skill difficulty measures is ~ 14 times larger than the precision of measures. Cases: Fixed (all same) chi-square: 56.6 d.f.: 6 significance: .00. Separation 2.82: Spread of the case difficulty measures is ~ three times larger than the precision of those measures. Skill and case difficulties vary. What percent of the time did these raters assign series of same ratings across skills? – Examiner 10: 17% – Examiner 19: 35%. – For all other examiners combined: 47%
  • 7. Percent of Cases Rated with all Skills Assigned the Same Rating Examiner Percent of Cases 10 17 11 72 12 46 13 37 14 34 15 67 16 26 17 49 18 57 19 35 20 47 For all examiners except 10 and 19: 47%
  • 8. Rater Effects: Randomness •Randomness: tendency to assign ratings randomly, without apparent consideration for the merit of the portfolio submission •Detecting randomness: Anchor all ratees at the same level of performance (i.e., 0) and then run the analysis. •Raters who show the best fit to this model are likely to be exhibiting randomness (i.e., their ratings are not related to the level of performance of the ratee) Examiner Outfit Mean Square 10 2.46 11 .57 12 .89 13 .90 14 .83 15 1.23 16 .76 17 1.63 18 .94 19 1.92 20 .73
  • 9. Bias Interaction •Bias interaction analysis: the degree to which ratings produced for a particular interaction deviate from model expectations. •Standardized Z-score indicates size of bias interaction. Statistically significant misfit for an interaction: + 2 Examiner 10: significant bias on all three skills. – For two skills, more lenient than expected. – Third skill, the easiest overall of the three, he was more severe. Examiner 19: significant bias on two skills. – Skill two, more lenient than expected – Third skill, the easiest overall of the three, he was more severe. •Misfit for skill three (complexity) is notable because it is it is different from the other two skills, diagnosis and treatment. •Examiners are likely less familiar with complexity as a criterion for portfolio assessment, and it is somewhat abstract and ill-defined. •For these reasons, examiners typically rate portfolios more generously on complexity than they rate the other skills. •The MFRM therefore expects examiners to rate this skill higher than the others. But Examiner 10 and 19 were more severe for this skill, thereby producing misfit.
  • 10. Retain or Discard the Data? Data that does not fit the model “threatens the validity of the measurement process” (Linacre, 1989/1994, p. 9). Typically rating data is precious •Individual exams: 11 examiners rate candidates •Each candidate has 135 ratings from three examiners •Removing one examiner’s ratings: substantive implications for candidate scores, because the examiner’s ratings are removed from all candidates assessed by the examiner. For developing the benchmark scale, the consequences of not removing the misfitting data has different implications. Benchmark: the measurement frame of reference that is the basis of future pass/fail decisions •Misfitting raters in the data: •Skill one becomes easier by more than one SEM •Skill two becomes more difficult by more than one SEM •Cases one and two become more difficult by a SEM •Cases five and six become easier by a SEM
  • 11. Change in Examiner Severity Calibrations with Misfitting Examiners Removed Severity Examiner With all Examiners Misfitting Examiners Removed Model Standard Error 11 5.62 5.65 .05 12 5.36 5.36 .05 13 5.02 5.00 .07 14 4.66 4.54 .05 15 4.55 4.51 .05 16 5.96 6.02 .04 17 3.48 3.29 .07 18 5.66 5.65 .04 20 5.13 4.98 .22
  • 12. Decision: Remove Misfitting Examiners •Testing practitioners running small- sample performance assessments may in general be loathe to discard hard- earned rating data. •In these circumstances, the decision is made easier by the limited consequences and the implications for future measurement against the benchmark scale. •The calibrations for the benchmark scale will be based on data from 97 candidates, 65 of whom received the full 135 ratings (45 from each of three examiners).