The document provides an overview of vertically moderated standard setting (VMSS). It describes VMSS as a process that aligns performance levels and standards across multiple grade levels. This helps to smooth out differences in cut scores that often occur between grades. The document outlines some approaches to VMSS, including setting equal percentages of students at proficiency levels across grades or anchoring the lowest and highest grades and adjusting intermediate grades. It also discusses assumptions about student growth over time that underlie VMSS. Limitations are noted around lack of historical data and growth models. Core components and areas for future research are identified.
2. Ch-13
Scheduling Standard Setting Activities
Chapter Goals:
Authors suggest ways and methods of how to schedule standard setting in
two types of assessments by drawing primarily on their experiences in large
scale credentialing programs and educational assessments while providing
examples of each standard setting activities.
1. Scheduling standard setting for educational assessments
2. Scheduling standard setting for credentialing programs
2
3. Scheduling standard setting for educational
assessment
Table 31-1 (p. 219-221) provides an overview of the main activities to be
completed along with a time table for their completion.
A generic version of the table can also be found at
www.sagepub.com/cizek/schedule
This table shows the planning for standard setting beginning two years
before the actual standard setting session.
3
4. 1. Overall Plan
Establish performance level labels (PLLs) and performance level descriptions (PLDs)
Drafting a standard setting plan before item writing begins, is one way to make sure the
test supports the standard-setting activity that is eventually carried out.
Table 13-1 shows a field test exactly one year prior to the first operational administration
of the test. During the first year, a regular testing window would be reserved for field
testing.
The planning should specify: a) a method, b) an agenda, c) training procedures and d)
analysis procedures.
Technical advisory committee (TAC).
Stakeholder review
4
5. 2. Participants
Identify and recruit the individuals who will participate in the standard setting activity (i.e., the panelists).
For statewide assessments, it is preferable that the panelists be as representative of the state as possible.
Table 13-1 shows the process of identifying these individuals about nine months before standard setting
begins.
Creation of the standard-setting panels is a three-step process
1. Local superintendents or their designees identify potential panelists in accordance with specifications
provided by the state education agency.
2. Notify candidates prior to submitting their names by sending an initial letter to all candidates
3. State agency staff sort the nominations to create the required number of panels and with the approved
numbers of panelists.
5
6. 3. Materials
Training materials, forms and data analysis programs
The timing of preparing these materials is crucial
Some can be prepared in advance and some can not (refer to Table 13-2; 13-3).
Final Preparations: Everyone involved needs to be thoroughly prepared; all presentations
should be scripted and rehearsed, all rating forms should be double checked, all
participant materials should be produced, duplicated, collated, and assembled easy in
sets.
As a final part of the preparation, the entire standard-setting staff should conduct a dress
rehearsal making sure that timing of presentations, is consistent with the agenda, that all
forms are correct and usable and that the flow of events is logical.
6
7. 4. At the standard setting site and following up
The lead facilitator attends to matters related to conduct of the sessions
Logistics coordinator attends to everything else
Once panelists complete their tasks, and turn in their materials, data entry staff take over,
and the next morning, the data analysis staff continues the process.
All data entry should be verified by a second person before data analysis begins.
The state education agency responsible for the standard setting should have arranged
time on the agenda of the state board of education as soon as possible after standard
setting in order to have cut scores approved.
Once cut scores are adopted by the board, it is possible to include them in the score
reporting programs and produce score reports.
7
8. Scheduling standard setting for
credentialing programs
Scheduling standard setting for credentialing programs is different from
educational assessment programs. Educational assessment programs are
bound to specific time of the academic year and tests are typically given in
the spring or fall.
Credentialing programs are not bound by these constraints, and have the
ability for some flexibility such as computer adaptive testing (CAT) or
computer based testing (CBT) may permit test administration on any day of
the year.
Table 13-4 provides an overview of the major tasks for a credentialing
testing program.
8
9. Small group activity
In groups of three review pages (237-245) and post the key components of
scheduling standard setting for credentialing programs focusing on
differences between scheduling standard setting for educational
assessments.
Use this website to post your thoughts
http://padlet.com/wall/4qxyguqgnd
9
10. Recommendations
Planning for standard setting needs to be made an integral part of planning for
test development.
Plans of the standard setting facilitators should be reviewed by test
development staff, and vice versa.
One person with authority over both item developers and standard setters
should have informed oversight over both activities.
Attention to scoring in particular with open ended or constructed response
items.
Finally, test planning, test development, and standard setting are interlinked
parts of a single enterprise.
10
11. Ch-14
Vertically-Moderated Standard Setting
Chapter Goals:
Describe:
(1) the general concept of VMSS
(2) specific approaches to conduct VMSS
(3) a specific application of VMSS
Provide:
(1) suggestions for a current assessment system and a need for additional
research
11
12. Linking Test Scores across grades within the
Norm Referenced Testing (NRT) context
Review from Ch-6 (Ryan & Shepard)
Construct of Linking- refers to several types of statistical methods that
establish a relationship between the score scales from two tests, so the
results can be comparable between the tests.
Test Score Equating- Used to measure year to year changes over time for
different students in the same grade
Vertical Equating- linking test scores vertically across grade levels and
schooling levels. The tests that are to be linked need to measure the same
construct.
12
13. Interrelated Challenges within the Standards-
Referenced Testing (SRT) context
NCLB requirements for tracking cohort growth & achievement gaps
These newer assessment apply standards-referenced testing (SRT)
Linking test performance standards from two or more grade levels (adjacent and
not adjacent)
The construct measured may be different
Sheer number of performance levels that NCLB requires
The wide test span and developmental range
The panels of educators who participate in standard setting
13
14. A New Method that Links Standards Across Tests
To address these challenges, a need to develop and implement standard
setting methods that set performance levels across all affected grade levels
with some method for smoothing out differences between grades.
Suggested approach—VMSS—Vertically Moderated Standard Setting
14
15. History of VMSS
Introduced by Lissitz & Huynh (2003b)
AYP is based on the percentage of students who meet Proficient and the
expected percentage increases over time.
The purpose of VMSS – deriving at a set of cross grade standards that
realistically tracks student growth over time and provides a reasonable
expectation of growth from one grade to the next.
The critical issue—defining reasonable expectations using vertical scaling would
not produce a satisfactory set of expectations for grade to grade growth.
Alternative to vertical scaling or equating, Lissitz and Huynh (2003 b) suggested
VMSS.
15
16. What is VMSS?
A process of vertical articulation of standards: aligning scores, scales or
proficiency levels.
Is a procedure or set of procedures, typically carried out after individual
standards have been set that seeks to smooth out the bumps that inevitably
occur across grades.
Reasonable expectations are stated in terms of percentages of students at
or above a consequential performance level, such as Proficient.
Lets discuss the hypothetical scenario using the table on the next slide
(p.255 in your book).
16
17. What is VMSS?
Grades % of Students At or Above
Proficient Performance Lv.
Difference
3
4
5
6
7
8
37
41
34
43
29
42
+ 4 %
- 7 %
+ 9 %
- 14 %
+ 13 %
17
18. Approaches to VMSS
Focuses on percentages of students at various proficiency levels
Is based on assumptions about growth in achievement over time
Problem: Different percentages of students reaching a given performance level
– such as—Proficient cut score at different grades.
Solution:
1. Set all standards at the score point or such that equal percentages of students
would be classified as proficient at each grade level by fiat.
2. Set standards only for the lowest and highest grades and then align the
percentages of Proficient students in the intermediate grades accordingly.
18
19. Approaches to VMSS
Grades % of Students At or
Above Proficient
Performance Lv.
3
4
5
6
7
8
37
38
39
40
41
42
36
37
38
39
40
41
42
0 5 10
Y-Value 1
Y-Value 1
19
20. Assumptions re: growth over time
Lewis & Huang (2005)
The percentage of students
classified as at or above Proficient
would be expected to be:
1. Equal across grades or subjects
2. Approximately equal
3. Smoothly decreasing
4. Smoothly increasing
Ferrara, Johnson & Chen (2005)
Assumptions for standard setting are
based on the intersection of three
growth models:
1. Linear Growth
2. Remediation
3. Acceleration
20
21. Alternative procedures
Due to VMSS being a relatively new procedure, it is difficult to pinpoint
limitations and alternative procedures
There have been few thoroughly documented applications of VMSS
Each application has been slightly different from the others
Authors have suggested a common core of elements to VMSS
However, no fixed set of steps has emerged in applications of VMSS so far
Every aspect of any application might be thought as an alternative procedure
21
22. Core components of VMSS future applications
1. Grounding in historical data (Lwesi & Haug, 2005; Buckendahl et al, 2005).
2. Establishment of performance models
3. Consideration of historical data
4. Cross-grade examination of test content and student performance
5. Polling of participants
6. Follow up review and adjustment
22
23. Limitations
Lack of historical perspective or context would be not only limiting, but
debilitating If the focus of VMSS is the percentages of students at or above a
particular proficiency level.
Any application of VMSS is hampered if it is not supported by a
theoretically or empirically sound model of achievement growth.
Maintaining a meaning of cut scores and fidelity to PLDs is one of the most
fundamental for future research.
Research and development is a growth industry
23
Editor's Notes
The version may be easily adapted. This schedule assumes a new testing program. Part of the planning process is establishing the number and nature of the performance levels to be set. It is necessary to bring some precision to the performance level labels (PLLs) and performance level descriptions (PLDs). If these are established by state law or board action, then some of the work has already been done.
PLLs and PLDs—establishing these at the beginning, will be possible to ensure that there are test items that will support these levels. TAC- Many assessment programs employ a panel of nationally recognized assessment experts to advise them on technical issues related to those programs. Stakeholder review- are individuals or groups with a particular interest in the testing program--community members, elected or appointed officials. It is a good idea to know early in the process who these stakeholders are and obtain their input as early in the process as possible. One very special stakeholder group is the policy board that will actually make the decision to adopt, modify or reject the cut scores. For licensure and certification testing programs, the policy entity is usually the professional association or credentialing board. For statewide assessments, the policy board is usually the state board of education.
As the overall standard setting plan is being reviewed by various advisory committees and stakeholder groups, the next phase of the plan is to:Identifying potential panelists for a statewide assessment program, usually involves working with the local officials, usually local superintendent. Notification letter: should include a form on which the candidate can indicate interest in and availability for a standard-setting meeting on specific dates in a specific city. After the sorting is done, the agency sends a follow up letter to all candidates of their selection or non selection. The invitation letter is sent out about 5 months prior to standard setting in order to allow panelists time to schedule it in their calendars. Six weeks prior to the event, a final letter to all panelists should confirm their participation and provide the location and driving directions, a reminder of the purpose of the meeting and contact phone numbers in case of emergency. Once rooms are confirmed, the sponsoring agency may send a housing confirmation to each panelists. One person should ne designated as the lead facilitator, who is responsible for training and other matters. A different person should be designated as the logistical coordinator who is responsible for anything related to hotel guest rooms, meeting rooms, catering copying, etc.
Generic-printed materials, visuals, scripts for training etc…
Ability to see potential problems, conflicts or disconnects.
Ch-6 on Ryan & Shepard discuss application of test linking, the diversity in state testing programs and challenges to linking tests in accountability systems. Construct of Linking: Refers to several types of statistical methods used to establish a relationship between the score scales from two tests , so that results from one test can be compared to results on another test. To ensure comparability of tests from one test administration to the next, large scale assessment programs use a process of test score equating. Without this process it would be impossible to measure changes in achievement over time. Test Score Equating: statistical procedures by which scores on two different tests are relatedPossible when test forms are built to the same specifications and test content, difficulty, reliability, format, purpose, administration and population are equivalent. Answer the question of “have this year’s 6th graders performed better in reading that last year’s 6th graders?” Vertical Equating and linking of test scores is successful when test design and item selection within and across grade levels are managed carefully, so that there is sufficient overlap of items in adjacent test levels enables stable links such as norm referenced tests and individual and intelligence achievement tests.
NCLB-SRT (standard reference testing) requires tests built to statistical specifications that are also narrower and tightly matched to specific content specifications that are also narrower and tightly matched to specific within grade content standards that often do not have considerable across grade level overlap. Therefore, the content standards upon which SRTs are based can militate against the construction of traditional cross-grade scales’ vertically lining SRTs require strong assumptions about the equivalence of constructs being assessed at different levels. Interrelated Challenges:(a) the construct measured are different-the existence of empirically determined or the theoretically assumed of a continuous developmental construct across grade levels. (b) Sheer of number of performance levels that NCLB requires- 2 levels representing higher achievement (Proficient and Advanced) are required and a lower (Basic) level. These multiple levels of is compounded by the requirement of performance standards on three different tests (reading, math, and science in grades 3-8 plus one secondary for reading and math) and three grade levels for science. (c) tests span in such a wide grade and developmental range-
Introduced by Lissitz and Huynh (2003b) in a background paper prepared for the Arkansas Department of Education where they spelled out the problem of determining AYP and proposed a solution: VMSSFirst Lissitz and Huynh (2003 b), tried to define reasonable expectations using vertical scaling/equating method. They concluded that vertical scaling would generally not produce a satisfactory set of expectations for grade-to grade growth. They recommended that “new cut scores for each test be set for all grades such that:each achievement level has the same generic meaning across all grades; The proportion of students in each achievement level follow a growth curve trend across these grades
VMSS may be used when there is a need to establish meaningful progressions of standards across levels or to enable reasonable predictions of student classifications over time when traditional vertical equating is not possible.
Lets look at 4th or 6th graders. If we believed that these groups of students on whom these results were based were typical (we would expect similar results next year with a new group of students ).We need to point out that these currently proficient students would only have about 75% chance of scoring at the Proficient level the next year. The standards have been set so 5th and 7th graders have a lower probability of scoring at the Proficient level than do 4th and 6th graders. This means that many proficient 4th and 6th graders are going to lose ground in the subsequent grades (17% and 33%). To remedy this situation, VMSS requires a reexamination of the cut scores and percentages in light of other historical or other corollary information available at the time of standard setting. Making adjustments to the cut scores, so we have a reasonable expectation for what should happen next year.
In our last example, we would take the 37% figure for Grade 3 and 42% for Grade 8 and set cut scores for Grades 4-7, so that their resulting percentages of students at or above Proficient would fall on a straight line between 37% and 42%. A liner trend has been imposed on intervening grade levels to obtain cut scores for those grades. In all cases, VMSS is based on assumptions about growth in achievement over time.
Linear growth- assumes that the proficiency of all examinees increases by a fixed amount and examinees retain their positions relative to one anotherRemediation- assumes the proficiency of examinees at the lower end of the score distribution increases more than those of examinees at the upper endAcceleration- assumes the proficiency of examinees in the upper portion of the score distribution increases at a greater rate than that of examinees at the lower end of the score distribution.
From the example of illustrating a VMSS process implemented for the English Language Learners Assessment (ELDA) suggestions for a common core of elements to VMSS include the following: Grounding in historical data- collected and used historical performance data to prepare for and interpret results of standard setting. Collection of these data and planning for their use may include discussions with stakeholders, and content experts in advance of standard setting. Establishment of performance models- should be based on the historical evidence. If these evidence is not available, models should rely on theories of cognitive development, discussions with content experts and stakeholders or generalization from other tests. Consideration of historical data- when available, these data should be presented to those involved in setting standards; including the participants who work through the multiple rounds of a standard setting procedure cross grade or cross subject articulation. Cross-grade examination- include some degree of cross grade review by standard setting participants. Where possible all grade review should be included in a full-scale VMSS for at least one round either the final round or at some point just prior to the final round. Polling of participants- Two studies of VMSS included the data selection from participants at the end of the standard activity. It is important for the validity evidence of for the standard setting activity but also for future standard setting activities. Follow up review and adjustment- These follow up are important for two reasons: 1. elected or appointed state officials are responsible for the successful implementation of the performance standards. 2. Even with the best intentions and earnest application of standard-setting techniques, participants may still hold fairly disparate notions with regard to where cut scores should be set.