Brown&hudson,chap1

Criterion-referenced Language Testing
Chapter 1
Alternative paradigms

In this chapter
Definition of norm- referenced tests and criterion-referenced tests
1
The differences and similarities of norm-referenced and
criterion-referenced approaches
The place of criterion-referenced tests in language testing
theory and research
2
3

NRT definition
Any test that is primarily designed to disperse the performances of students in a
normal distribution based on their general abilities, or proficiencies, for purposes
of categorizing the students into levels or comparing students’ performances to
the performances of the others who formed the normative group.
The interpretation given an examinee’s score is called a relative decision because
it is understood as that examinee’s position relative to the scores of all of the
other examinees who took the test.

Criterion-referenced tests
They yield absolute decisions because each examinee's score is meaningful without reference to the scores of
the other examinees.
Glaser (1963): criterion-referenced measures indicate the content of the behavioral repertory, and the
correspondence between what an individual does and the underlying continuum of achievement. These
measures provide information as to the degree of competence attained by a particular student which is
independent of reference to the performance of others.

Variations of CRTs
Domain-referenced tests
Hively, Patterson, and Page (1968): DRTs are based on item forms; item forms: the documents which
delineate a domain of student behaviors and content-area material to which test items are then referenced.
Osburn (1968): universe-defined test; any test constructed and administered in a way that an examinee's score
on the test provides an unbiased estimate of his score on some explicitly defined universe of item content.

Variations of CRTs
Mastery-referenced tests
Tests that link different mastery decisions to specific instructional contexts with specific instructional
objectives that are not necessarily related to any external domain of knowledge.
The differences among the RT applications tend to focus on types of sampling generalizability;
whether generalized to a domain, to an instructional set of objectives, to a mastery decision, etc.

Variations of CRTs
Objectives-referenced tests
Are constructed so that subsets of the items measure the specific objectives of a course, program of
study, or other clearly delineated subject-matter area.
They can be CRTs if: a) objectives are written to define a domain; b) items are representative samples
of behavior from this domain.

Differences and similarities between NRTs and CRTs
in educational settings
Teaching/testing mismatches.
Lack of instructional sensitivity.
Lack of curricular relevance.
Restriction to the normal distribution.
Restriction to items that discriminate.
Why were criterion-referenced tests developed?
5
4
1
2
3
CRTs developed in response to problems, or weaknesses, that were perceived in the pervasive
norm- referenced testing of the day. The problems with NRTs:

Teaching/testing mismatches
In cases where large-scale, standardized examinations are used, material tested in the
examination may nor be directly related to the teaching going on at the particular
institution involved. Such mismatches may arise because of the general nature of the
material that is typically tested on an NRT or because the content of the test is not
directly related to the curriculum at the institution.

Lack of instructional sensitivity
Because of their general and abstracted nature and putative global applicability across a
variety of instructional settings, NRTs are nor suited to measuring the specific learning
points and skills developed in a particular program. As a result, NRTs cannot be
expected to measure the amount of knowledge or skill that a student has within a well-
defined content area. Furthermore, NRTs cannot be expected to be effective for
diagnosis of deficiencies with reference to particular courses or programs.

Lack of curricular relevance
Because of they can cause teaching testing mismatches and generally lack sensitivity to
instruction, some educators feel thar NRTs are nor effective for evaluating the effects
of curriculum change on student achievement. Thus, NRTs are nor felt to be
particularly well-suited to assessing the strengths and weaknesses of a given program,
or for suggesting useful areas of instructional amelioration in a specific program, or for
comparing the relative strengths and weaknesses of different language programs.

Restriction to the normal distribution
NRTs are designed, statistically analyzed, and revised with the purpose of creating a
normal distribution of scores. Thus, one of the first issues that must be addressed in
analyzing NRT results is the degree to which the distribution of scores is normal.
However, if all of the students know all of the material, a test that reflects that
knowledge would be desirable. On such a test, a normal distribution of scores could nor
reasonably be expected to appear.

Restriction to items that discriminate
To create an NRT, those items are selected that about half of the students cannot
answer correctly on average. Thus, the focus of the test is on content that the students
(or at least 50 percent of them) do not know rather than on what they have learned (as
on a CRT). As a result, test designers sometimes have a tendency to select items simply
because they discriminate well between high-achieving and low-achieving students
rather than because the items are related to the curriculum or anything that the students
are learning.

CRTs are different
CRTs can be used to circumvent all of the complaints mentioned previously.
CRTs can be expected to have these characteristics:
Emphasis on teaching/testing matches; Focus on instructional sensitivity; Curricular relevance;
Absence of normal distribution restrictions; No item discrimination restriction.

Comparisons between NRTs and CRTs
Hudson and Lynch (1984): relative standing and absolute standing
NRM: broad, less descriptive indication of relative standing
CRM: gaps in coverage, more descriptive information, absolute standing

Fundamental distinctions between CRTs and NRTs
NRTs and CRTs both:
Require specification of the achievement domain to be measured; Require a relevant and representative
sample of test items; Use the same types of test items; Use the same rules for item writing (except for
item difficulty); Are judged by the same qualities of goodness (validity and reliability); Are useful in
educational measurement.

Davies' principles
1. CRTs cannot be constructed in a completely separate way from NRTs without "the usual canons of
item discreteness and discrimination".
Counter claim: CRT items are certainly concerned with discriminating between very different features
than those in NRT development.

Davies' principles (continued)
2. Because teachers are concerned with very small groups of learners what they require is a criterion-
referenced use of a norm-referenced tests, that "does not discriminate greatly among their students but
which does establish an adequate dichotomy… between plus success and minus success". Further, for
every CRT there must be a population for whom the tests could be norm-referenced.
Counter claim: this principle is on the face of it an inverse statement of what goes on in classrooms.

Davies' principles (continued)
3. Criterion-referencing is linked to exercises while norm-referencing is linked to tests.
Counter claim: this proposed definition of "test" and "exercise" is not found elsewhere and is only argued
by Davies himself.

The place of CRTs in language testing theory and
research
Four questions are raised
1. What makes language testing special?
a. Language and language acquisition are different in nature from other educational content areas such
as mathematics, etc.
b. Language is interactional;
c. Language is situated.

The place of CRTs in language testing theory and
research
2. What is language proficiency?
There are different perspectives and definitions to this complicated construct.

3.What is communicative language ability?
Communicative competence was proposed by Hymes (1972) and Campbell and Wales
(1970) as a broader view of Chomsky’s linguistic competence.
Several models of communicative competence were proposed. These models differ
mostly in terms of what they include. The differing views of communicative
competence and communicative language ability produce differing views of how
language tests can best assess the communicative abilities of examinees.

3.What is communicative language ability? (continued)
Morrow (1979) asserts that language tests must reflect the following features of language use:
1. Language is used in interaction.
2. Interactions are usually unpredictable.
3. Language has a context.
4. Language is used for a purpose.
5. There is a need to examine performance.
6. Language is authentic, not simplified.
7. Language success is behavior based.
1. Be criterion-referenced against the operational
performance of a set of language tasks.
2. Be concerned with validating itself against those criteria
and be concerned with content, construct and predictive
validity, not concurrent validity.
3. Rely on modes of assessment which are nor directly
quantitative, bur which are instead qualitative.
4. Subordinate reliability to face validity
To satisfy
these
requirements,
a test should

Bachman (1990) refers to the 2 primary approaches to identifying what is meant by
‘‘authentic’’ within language assessment:
Real-life approach (represented also by Morrow’s concerns) focuses on the degree to
which a test represents language performance in non-testing situational use.
Interactional ability approach is concerned more with the distinguishing characteristics
of communicative language use.

Cziko (1984) examines several models of communicative language ability and addresses 3 problematic issues in
the development and interpretation of the models of communicative competence:
1. Correlational analyses are problematic. If two skills have a low correlation coefficient, they would be
considered relatively independent. However, a low correlation might also be the result of little within- skill
variation on one/both of the skill tests.
2. Variable language skill exposure in a group may lead to misleading interpretations.
3. Heterogeneous groups of subjects may show high within- skill variance in language proficiency while
homogeneous groups may show low within- skill variance.
Given these issues, he concludes that in order to understand models of communicative competence research must
involve the use of CRTs.

4.What problems do CRT developers face?
Several practical questions:
1. How can item analysis be performed when: (a) no comparison group is designated as instructed or
uninstructed group; (b) no externally identified masters and non-masters are defined; or (c) when
mastery groups are defined and available?
2. How dependable are the decisions made on the basis of the test? How generalizable are the scores
and analyses to those of other examinees on other forms of the test?
3. How can a standard, or cut-point, be rationally set?
4. What advantages and disadvantages accrue from application of the statistical approaches provided
by NRT or CRT analyses?

Brown&hudson,chap1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Brown&hudson,chap1

Similar to Brown&hudson,chap1 (20)

Recently uploaded

Recently uploaded (20)