Compiled by :
Mila Nentinah Falhan
Nizar Ridaus Syamsi
SEKOLAH TINGGI DAN KEGURUAN ILMU PENDIDIKAN
Assessment is a broad and relatively nonrestrictive descriptor for the kinds
of testing and measuring that teacher must do. Assessment is a word that
embraces diverse kind of test and measurements. Teachers who can test well will
be better teachers. Therefore, the effective testing will enhance a teacher‟s
instructional effectiveness. Teacher should never assess students without a clear
understanding of what decision is that will be informed by result of the
assessment. Many educators carry out assessments in the first place. Then it
become the important part in educational. Educators use the result of assessments
to make decisions about students.
2.1. A Quest for Defensible Inferences
Validity refers to the degree to which a test measures what it purports to
measure. In validity, the educators must know about their students. The more
educators know about their student status with respect to educationally relevant
variables, the better will be educational decisions that are made regarding those
students. For this reason, the teacher is apt to decide that the students will tackle
more advanced topics than originally planned. Educators must make appropriate
educational decisions depend on the accuracy of educational assessment. The
accurate assessment will improve the quality of decisions whereas inaccurate
assessments will do the opposite. When we measure students, we try to sample the
contents of an assessment domain in a representative manner so that, based on the
students‟ performance on the sampled assessment domain, we can infer what
students status is with respect to the entire assessment domain. If a test truly
measures what it sets out to measure, then it‟s likely the inferences that we make
about students based on their test performances will be valid. Valid assessments
minimize unintended negative. The focus of validity should be on test-based
inferences, not on tests themselves.
2.2. Three Varieties of Validity Evidence
There are three kinds of evidence that can be used to help educators
determine whether their score-based inferences are valid. There are:
Content-related evidence of validity
This first form can be used to support the defensibility of score-based
inferences about a student‟s status with respect to an assessment domain.
Content-related evidence of validity (often referred to simply as content
validity) refers to the adequacy with which the content of a test represents the
content of the assessment domain about which inferences are to be made.
The notion of “content” refers to much more than factual knowledge. The
content of assessment domains in which educators are interested can embrace
knowledge (such as historical facts), skills (such as higher-order thinking
competencies), or attitudes (such as students‟ disposition toward the study of
science). Content, therefore, should be conceived of broadly. When we determine
the content representativeness of a test, the content in assessment domain being
sampled can consist of whatever is in that domain. .
let‟s illustrate varying degrees with which an assessment domain can be
represented by a test. Take a look at figure 3-2 where you see an illustrative
assessment domain (represented by the shaded rectangle) and the items from
different tests (represented by the dots). As the test items coincide less adequately
with the assessment domain, the weaker is the content-related evidence of
Varying Degrees to which a Test‟s Items Represent the
Assessment Domain about Which Score-Based Inferences Are to
For example, in illustration A of Figure 3-2, we see that test‟s items
effectively sample the full range of assessment-domain content represented by the
shaded rectangle. In illustration B, however, note that some of the test‟s items
don‟t even coincide with the assessment domain‟s content , and that those items
falling in the assessment domain don‟t cover it all that well. Even in illustration C,
where all the test‟s items measure content included in the assessment domain, the
breadth of coverage for the domain is insufficient.
Trying to put a bit of reality into those rectangles and dots, think about an
“Algebra 1” teacher who is trying to measure his students‟ mastery of a
semester‟s worth of content by creating a truly comprehensive final examination.
Based chiefly on students‟ performances on the final examination, he will assign
grades that will influence whether or not his students can advance to Algebra 2.
Let‟s assume the content the teacher addressed instructionally in Algebra 1-that is,
the algebraic skills and knowledge taught during the algebra 1 course-are truly
prerequisite to Algebra 2. Then, if the assessment domain representing the
Algebra 1 content is not satisfactorily represented by the teacher‟s final
examination, the teacher‟s score-based inferences about students‟ end-of-course
algebraic capabilities and his resultant decisions about students‟ readiness for
Algebra 2 are apt to be in error. If teachers‟ educational decisions hinge on
students‟ status regarding an assessment domain‟s content, then those decisions
are likely to be flawed if inferences about students mastery of the domain are
based on a test that doesn‟t adequately represent the domain‟s content.
Criterion-related evidence of validity
This kind of evidence helps educators decide how much confidence can be
placed in a score-based inference about a student‟s status with respect to an
assessment domain. Moreover, criterion-related evidence of validity is collected
only in situations where educators are using an assessment procedure to predict
how well students will perform on some subsequent criterion.
The earliest way to understand what this second kind of validity evidence
looks like is to describe the most common educational setting in which it is
collected-namely, the relationship between students‟ scores on (1) an aptitude test
and (2) the grades those students subsequently earn. An Aptitude test is an
assessment device that is used in order to predict how well an examine will
perform at some later point.
For example, many high school students complete a scholastic aptitude test
when they‟re still in high school. The test is supposed to be predictive of how well
those students are apt to perform in college. More specifically, students‟ scores on
the aptitudes test are employed to predict students‟ grade-point averages (GPAs)
in college. It is assumed that the students who score well on the aptitude test will
earn higher GPAs in college than those students who score poorly on the aptitude
3. Construct-related evidence of validity
The way that construct-related evidence is assembled for a test is, in some
sense, quite straightforward. First, based on our understanding of how the
hypothetical construct that we‟re measuring works, we make one or more formal
hypotheses about students‟ performances on the test for which we‟re gathering
construct-related evidence of validity. Second, we gather empirical evidence to
see whether the hypothesis (or hypotheses) is confirmed. If it is, we have
assembled evidence that the test is measuring what it‟s supposed to be measuring.
As a consequence, we are more apt to be able to draw valid score-based
inferences when students take the test.
There are three types of strategies most commonly used in constructrelated evidence studies.
a. Intervention Studies
One kind of investigation that provides construct-related evidence of
validity is an intervention study. In a intervention study, we hypothesize that
students will respond differently to assessment instrument after having received
some type of treatment (or intervention).
b. Differential Population Studies
In this kind of study, based on our knowledge of the construct being
measured, we hypothesize that individuals representing distinctly different
populations will score differently on the assessment procedure under
c. Related-Measures Studies
In related measures study, we hypothesize that a given kind of relationship
will be present between students‟ scores on the assessment device we‟re
scrutinizing and their scores on a related assessment device.
2.3. Sanctioned and Unsanctioned Form of Validity
Validity is the linchpin of educational measurement. However, because
validity is such a central notion in educational assessment, some folks have
attached specialized meanings to it that, although helpful at some level, also may
One of these is face validity. Face validity is that the appearance of a test
seems to coincide with the use to which the test is being put. Another more
recently introduced variant of validity is something known as consequential
validity. Consequential validity refers to whether the uses of test result are valid.
Consequential validity is a decent way to remind educators if the importance of
consequences when tests are used.
2.4. The Relationship between Reliability and Validity
A test, for example, could be measuring with remarkable consistency a
construct that the test developer never even contemplated measuring. For instance,
although the test developer though that an assessment procedure was measuring
students‟ punctuation skills, what is actually measured is students‟ general
intellectual ability which, not comprising, splashes over into how well students
can punctuate. Thus, inconsistent results will preclude the validity of score-based
inference. Evidence of valid score-based inferences almost certainly requires that
consistency of measurement is present.
2.5. What Do Classroom Teachers Really Need to Know About Validity?
The author recommended that for your more important tests, you devote at
least some attention to content-related evidence of validity. He suggested that this
chapter gave serious though to the content of an assessment domain being
represented by a test is a good first step. Your test content is also an effective way
to help make sure that your classroom tests represent satisfactorily the content you
are trying to promote, and that your score-based inferences about your students „
status are not miles off the mark.
Validity refers to the degree to which a test measures what it purports to
measure. There are three kinds of evidence that can be used to help educators
determine whether their score-based inferences are valid. There are: Contentrelated evidence of validity, Criterion-related evidence of validity, and Constructrelated evidence of validity.
Popham,W James. Classroom Assessment What Teachers need to know. Allyn