Principles of Language
Assessment
Prepared By: Ameer Salman Hussein
Principles of Language Assessment
To say that this test is effective, dependable, and even
accurately measures what we want it to measure, the test
should meet five cardinal criteria for (testing a test). These
criteria are:
1. Practicality.
2. Reliability
3. Validity
4. Authenticity
5. Washback
Practicality
The test is practical when it:
1. Is not excessively expensive. (it should save both time and money)
2. Stays within appropriate time constraints. (A test that Requires
individual monitoring is impractical for a group of hundred students
and little number of examiners.)
3. Is relatively easy to administer. (it shouldn’t take hours to evaluate)
4. Has a scoring procedure that is specific and time-efficient (doesn’t
scored only by computer).
Reliability
Reliability is the degree of consistency of a measure. A test will be
reliable when it gives the same repeated result under the same
conditions.
There are FOUR sub-divisions of reliability, they are:
1. Student-Related reliability.
2. Rater-Reliability.
3. Test Administration Reliability.
4. Test Reliability.
1. Student-Related reliability:
The most common learner-related issue in reliability is
caused by temporary illness, fatigue, a "bad day"
anxiety, and other physical or psychological factors,
which may make an "observed" score deviate from
one's "true" score.
2. Rater-Reliability:
Human error, subjectivity, and bias may enter into the scoring process. Here, we
have to differentiate between two types of (rater-reliability), i.e., Inter-rater
reliability and Intra-rater reliability.
Inter-rater reliability occurs when two or more scorers yield inconsistent
scores of the same test, possibly for lack of attention to scoring criteria,
inexperience, inattention, or even preconceived biases.
Intra-rater reliability is common occurrence for classroom teachers because of
unclear scoring criteria, fatigue, bias toward particular "good and "bad" students,
or simple carelessness.
In tests of writing skills, rater reliability is particularly hard to achieve since writing
proficiency involves numerous traits that are difficult to define. The careful specification of an
analytical scoring instrument, however, can increase rater reliability.
3. Test Administration Reliability:
Unreliability may also result from the conditions in which the test is
administered, such as administration of aural test by means of
comprehension test tools. Other sources of unreliability are found in
photocopying variations, the amount of light in different parts of the
room, variations in temperature, and even the condition of desks
and chairs.
4. Test Reliability:
Sometimes the nature of the test itself can cause measurement
errors. If a test is too long. test-takers may become fatigued by
the time they reach the later items and hastily respond
incorrectly. Timed tests may discriminate against students who
do not perform well on a test with a time limit. Also, Poorly
written test may be a further source of test unreliability.
Validity
Validity can be defined as “the extent to which inferences made from
assessment results are appropriate, meaningful, and useful in terms of
the purpose of the assessment”. Simply, the valid test of reading
measures reading ability and not other one. To make a test for the
validity of writing ability, we have to pay attention to the
comprehensibility, rhetorical, discourse elements and the organization of
ideas rather than just collecting a number of words.
How is the validity of a test established?
There are several different kinds of evidence that may be examined to
support the validity. They are:
1. Content-Related Evidence.
2. Criterion-Related Evidence.
3. Construct-Related Evidence.
4. Consequential Validity.
5. Face Validity.
1. Content-Related Evidence.
Content validity refers to the extent to which the items on a test
are fairly representative of the entire subject that the test seeks
to measure.
For example, if you are trying to assess a person’s ability to
speak a second language in a conversational setting, asking the
learner to answer paper-and-pencil multiple-choice questions
requiring grammatical judgments lacks content validity
There are a few cases of understanding content validity:
- It is possible to contend, for example, that standard language proficiency tests,
with their context-reduced and academically oriented language, lack content
validity since they do not require the full spectrum of communicative
performance on the part of the learner
- Another way is to consider the difference between direct and indirect testing.
Direct testing involves the test-taker in actually performing the target task. While
in an indirect test, learners are not performing the task itself but rather a task that
is related in some way.
- The most important rule for achieving content validity in classroom assessment
is to test performance directly. Consider, for example, a listening/speaking class
that is doing a unit on greetings and exchanges that includes discourse for asking
for personal information (name, address, hobbies, etc.) with some form-focus on
the verb to be, personal pronouns, and question formation.
2. Criterion-Related Evidence.
A second form of evidence of the validity of a test may be
found in what is called criterion-related evidence, also
referred to as criterion-related validity, or the extent to which
the "criterion" of the test has actually been reached.
The most classroom-based assessment with teacher designed
tests fits the concept of criterion-referenced assessment. In the
case of teacher-made classroom assessments, criterion-related
evidence is best demonstrated through a comparison of results
of an assessment with results of some other measure of the
same criterion.
Criterion-related evidence usually falls into one of two
categories; i.e., concurrent and predictive validity.
A test has concurrent validity if its results are supported by
other concurrent performance beyond the assessment itself.
The predictive validity of an assessment becomes important
in the case of placement tests, admissions assessment
batteries, language aptitude tests, and the like.
3. Construct-Related Evidence:
A third kind of evidence that can support validity is construct-
related validity, commonly referred to as a construct validity. A
construct is any theory, hypothesis, or model that attempts to
explain observed phenomena in our universe of perceptions.
Constructs may or may not be directly or empirically measured
and their verification often requires inferential data.
In the field of assessment, construct validity asks, "Does this
test actually tap into the theoretical construct as it has been
defined?"
Construct validity is a major issue in validating large-scale
standardized tests of proficiency.
4. Consequential Validity:
Consequential validity encompasses all the consequences of a
test, including such considerations as its accuracy in measuring
intended criteria, its impact on the preparation of test-takers, its
effect on the learner, and the (intended and unintended) social
consequences of a test's interpretation and use.
5. Face Validity:
Face validity refers to the degree to which a test looks right,
and appears to measure the knowledge or abilities it claims to
measure, based on the subjective judgment of the examines
who take it, the administrative personnel who decide on its use,
and other psychometrically unsophisticated observers.
Face validity will likely be high if learners encounter:
• A well-constructed, expected format with familiar tasks.
• A test that is clearly practical within the allotted time limit.
• Items that are clear and uncomplicated.
• Directions that arc crystal clear.
• Tasks that relate to their course work (content validity).
• A difficulty level that presents a reasonable challenge.
Authenticity
Bachman and Palmer (1996) define authenticity as "the degree of
correspondence of the characteristics of a given language test task to the
features of a target language task," and then, they suggest an agenda for
identifying those target language tasks and for transforming them into
valid test items.
In a test, authenticity may be present in the following ways:
1. The language in the test is as natural as possible.
2. Items are contextualized rather than isolated.
3. Topics are meaningful (relevant, interesting) for the learner.
4. Some thematic organization to items is provided. such as through a
story line or episode.
5. Tasks represent, or closely approximate, real-world tasks.
Washback
➢ In large-scale assessment, washback generally refers to the effects
the tests have on instruction in terms of how students prepare for
their test courses and "teaching to the test" are examples of such
washback.
➢ Another form of washback that occurs more in classroom assessment
is the information that “washes back" to students in the form of
useful diagnoses of strengths and weaknesses.
➢ Washback also includes the effects of an assessment on teaching and
learning prior to the assessment itself, that is, on preparation for the
assessment.
➢ Finally, washback also implies that students have ready access to you
to discuss the feedback and evaluation you have given.
Thank You for Listening and
Attention

Principles of language assessment

  • 1.
  • 2.
    Principles of LanguageAssessment To say that this test is effective, dependable, and even accurately measures what we want it to measure, the test should meet five cardinal criteria for (testing a test). These criteria are: 1. Practicality. 2. Reliability 3. Validity 4. Authenticity 5. Washback
  • 3.
    Practicality The test ispractical when it: 1. Is not excessively expensive. (it should save both time and money) 2. Stays within appropriate time constraints. (A test that Requires individual monitoring is impractical for a group of hundred students and little number of examiners.) 3. Is relatively easy to administer. (it shouldn’t take hours to evaluate) 4. Has a scoring procedure that is specific and time-efficient (doesn’t scored only by computer).
  • 4.
    Reliability Reliability is thedegree of consistency of a measure. A test will be reliable when it gives the same repeated result under the same conditions. There are FOUR sub-divisions of reliability, they are: 1. Student-Related reliability. 2. Rater-Reliability. 3. Test Administration Reliability. 4. Test Reliability.
  • 5.
    1. Student-Related reliability: Themost common learner-related issue in reliability is caused by temporary illness, fatigue, a "bad day" anxiety, and other physical or psychological factors, which may make an "observed" score deviate from one's "true" score.
  • 6.
    2. Rater-Reliability: Human error,subjectivity, and bias may enter into the scoring process. Here, we have to differentiate between two types of (rater-reliability), i.e., Inter-rater reliability and Intra-rater reliability. Inter-rater reliability occurs when two or more scorers yield inconsistent scores of the same test, possibly for lack of attention to scoring criteria, inexperience, inattention, or even preconceived biases. Intra-rater reliability is common occurrence for classroom teachers because of unclear scoring criteria, fatigue, bias toward particular "good and "bad" students, or simple carelessness. In tests of writing skills, rater reliability is particularly hard to achieve since writing proficiency involves numerous traits that are difficult to define. The careful specification of an analytical scoring instrument, however, can increase rater reliability.
  • 7.
    3. Test AdministrationReliability: Unreliability may also result from the conditions in which the test is administered, such as administration of aural test by means of comprehension test tools. Other sources of unreliability are found in photocopying variations, the amount of light in different parts of the room, variations in temperature, and even the condition of desks and chairs.
  • 8.
    4. Test Reliability: Sometimesthe nature of the test itself can cause measurement errors. If a test is too long. test-takers may become fatigued by the time they reach the later items and hastily respond incorrectly. Timed tests may discriminate against students who do not perform well on a test with a time limit. Also, Poorly written test may be a further source of test unreliability.
  • 9.
    Validity Validity can bedefined as “the extent to which inferences made from assessment results are appropriate, meaningful, and useful in terms of the purpose of the assessment”. Simply, the valid test of reading measures reading ability and not other one. To make a test for the validity of writing ability, we have to pay attention to the comprehensibility, rhetorical, discourse elements and the organization of ideas rather than just collecting a number of words.
  • 10.
    How is thevalidity of a test established? There are several different kinds of evidence that may be examined to support the validity. They are: 1. Content-Related Evidence. 2. Criterion-Related Evidence. 3. Construct-Related Evidence. 4. Consequential Validity. 5. Face Validity.
  • 11.
    1. Content-Related Evidence. Contentvalidity refers to the extent to which the items on a test are fairly representative of the entire subject that the test seeks to measure. For example, if you are trying to assess a person’s ability to speak a second language in a conversational setting, asking the learner to answer paper-and-pencil multiple-choice questions requiring grammatical judgments lacks content validity
  • 12.
    There are afew cases of understanding content validity: - It is possible to contend, for example, that standard language proficiency tests, with their context-reduced and academically oriented language, lack content validity since they do not require the full spectrum of communicative performance on the part of the learner - Another way is to consider the difference between direct and indirect testing. Direct testing involves the test-taker in actually performing the target task. While in an indirect test, learners are not performing the task itself but rather a task that is related in some way. - The most important rule for achieving content validity in classroom assessment is to test performance directly. Consider, for example, a listening/speaking class that is doing a unit on greetings and exchanges that includes discourse for asking for personal information (name, address, hobbies, etc.) with some form-focus on the verb to be, personal pronouns, and question formation.
  • 13.
    2. Criterion-Related Evidence. Asecond form of evidence of the validity of a test may be found in what is called criterion-related evidence, also referred to as criterion-related validity, or the extent to which the "criterion" of the test has actually been reached. The most classroom-based assessment with teacher designed tests fits the concept of criterion-referenced assessment. In the case of teacher-made classroom assessments, criterion-related evidence is best demonstrated through a comparison of results of an assessment with results of some other measure of the same criterion.
  • 14.
    Criterion-related evidence usuallyfalls into one of two categories; i.e., concurrent and predictive validity. A test has concurrent validity if its results are supported by other concurrent performance beyond the assessment itself. The predictive validity of an assessment becomes important in the case of placement tests, admissions assessment batteries, language aptitude tests, and the like.
  • 15.
    3. Construct-Related Evidence: Athird kind of evidence that can support validity is construct- related validity, commonly referred to as a construct validity. A construct is any theory, hypothesis, or model that attempts to explain observed phenomena in our universe of perceptions. Constructs may or may not be directly or empirically measured and their verification often requires inferential data. In the field of assessment, construct validity asks, "Does this test actually tap into the theoretical construct as it has been defined?" Construct validity is a major issue in validating large-scale standardized tests of proficiency.
  • 16.
    4. Consequential Validity: Consequentialvalidity encompasses all the consequences of a test, including such considerations as its accuracy in measuring intended criteria, its impact on the preparation of test-takers, its effect on the learner, and the (intended and unintended) social consequences of a test's interpretation and use.
  • 17.
    5. Face Validity: Facevalidity refers to the degree to which a test looks right, and appears to measure the knowledge or abilities it claims to measure, based on the subjective judgment of the examines who take it, the administrative personnel who decide on its use, and other psychometrically unsophisticated observers.
  • 18.
    Face validity willlikely be high if learners encounter: • A well-constructed, expected format with familiar tasks. • A test that is clearly practical within the allotted time limit. • Items that are clear and uncomplicated. • Directions that arc crystal clear. • Tasks that relate to their course work (content validity). • A difficulty level that presents a reasonable challenge.
  • 19.
    Authenticity Bachman and Palmer(1996) define authenticity as "the degree of correspondence of the characteristics of a given language test task to the features of a target language task," and then, they suggest an agenda for identifying those target language tasks and for transforming them into valid test items. In a test, authenticity may be present in the following ways: 1. The language in the test is as natural as possible. 2. Items are contextualized rather than isolated. 3. Topics are meaningful (relevant, interesting) for the learner. 4. Some thematic organization to items is provided. such as through a story line or episode. 5. Tasks represent, or closely approximate, real-world tasks.
  • 20.
    Washback ➢ In large-scaleassessment, washback generally refers to the effects the tests have on instruction in terms of how students prepare for their test courses and "teaching to the test" are examples of such washback. ➢ Another form of washback that occurs more in classroom assessment is the information that “washes back" to students in the form of useful diagnoses of strengths and weaknesses. ➢ Washback also includes the effects of an assessment on teaching and learning prior to the assessment itself, that is, on preparation for the assessment. ➢ Finally, washback also implies that students have ready access to you to discuss the feedback and evaluation you have given.
  • 21.
    Thank You forListening and Attention