Unit. 7.pdf

Unit–7
Test Development and Qualities of a test
Written by:
Dr. Fayyaz Ahmad Faize

TABLE OF CONTENTS
1. Achievement Test .................................................................................................................................3
1.1 Purposes/uses of achievement test................................................................................................5
2. Attitude Scale........................................................................................................................................6
3. Steps for Test Development..................................................................................................................8
4. Reliability............................................................................................................................................15
4.1 Reliability Coefficient.................................................................................................................16
4.2 Relationship between Validity and Reliability ...........................................................................17
5. Reliability Types.................................................................................................................................18
5.1 Test-Retest Reliability ......................................................................................................................18
5.2 Equivalence Reliability or inter-class reliability.........................................................................19
5.3 Split-Halves Reliability...............................................................................................................20
6. Factors Affecting Reliability...............................................................................................................21
7. Validity ...............................................................................................................................................23
7.1 Content-related validity...............................................................................................................24
7.2 Criterion-related validity.............................................................................................................26
7.2.1 Concurrent Validity....................................................................................................................27
7.2.2 Predictive Validity .....................................................................................................................27
7.3 Construct-related validity............................................................................................................28
Self-Assessment Questions.....................................................................................................................29
8. References...........................................................................................................................................30

OBJECTIVES
After studying this chapter, the students will be able to:
• Describe about achievement test and attitude scale
• Explain the steps involved in test development
• Describe the qualities of a good test
• Define and interpret reliability and validity
• Discuss how to determine reliability and validity of tests
• Understand the relationship between reliability and validity.
• Understand the basic kinds of validity evidence.
• Interpret various expressions of validity.
• Recognize what factors affect validity
1. ACHIEVEMENT TEST
Achievement tests are designed to measure accomplishment. Usually, it is conducted at the end
of some learning activity/process to ascertain the degree to which the required task has been
accomplished.
For example, the achievement test for a students of Nursery class might contain assessment of
English alphabets, knowledge of numbers and key science concepts. Thus, achievement tests
help in measuring the degree of learning on some already instructed/guided tasks. The tasks may
be specific and short or it may be comprehensive and detailed. An achievement test may be
standardized such as a test of Chemistry for secondary class on formulae and valences or Physics
test on fundamental quantities or kinematics.

Another term that is also useful is ‘General Achievement”. This relates to measuring of learning
experiences in one or more academic areas. This would usually involve a number of subtests
each aimed at measuring some specific learning experiences/targets. These subtests are
sometimes called achievement batteries. Such batteries may be individually administered or
group administered. They may consist of a few subtests, as does the Wide Range Achievement
Test-4 (Wilkinson & Robertson, 2006) with its measures of reading, spelling, arithmetic, and
(new to the fourth edition) reading comprehension.
Achievement may be as comprehensive as the STEP Series, which includes subtests in reading,
vocabulary, mathematics, writing skills, study skills, science, and social studies; a behavior
inventory; an educational environment questionnaire; and an activities inventory. Some
batteries, such as the SRA California Achievement Tests, span kindergarten through grade 12,
whereas others are grade or course-speciﬁc. Some batteries are constructed to provide both
norm-referenced and criterion-referenced analyses. Others are concurrently normed with
scholastic aptitude tests to enable a comparison between achievement and aptitude. Some
batteries are constructed with practice tests that may be administered several days before actual
testing to help students familiarize themselves with test taking procedures. One popular
instrument appropriate for use with person age 4 through adult is the Wechsler Individual
Achievement Test-Second Edition, otherwise known as the WIAT-II (Psychological
Corporation, 2001). This instrument is used not only to gauge achievement but also to develop
hypotheses about achievement versus ability. It features nine subtests that samples content in
each of the seven areas listed in a past revision of the Individuals with Disabilities Education
Act: oral expression, listening comprehension, written expression, basic reading skill, reading
comprehension, mathematics calculation, and mathematics reasoning.

For a particular purpose, a battery that focuses on achievement in a few select areas may be
preferable to one that attempts to sample achievement in several areas. On the other hand, a test
that samples many areas may be advantageous when an individual comparison of performance
across subject areas is desirable. If a school or a local school district undertakes to follow the
progress of a group of students as measured by a particular achievement battery, then the battery
of choice will be one that spans the targeted subject areas in all the grades to be tested. If ability
to distinguish individual areas of dif
ﬁc ulty is of primary concern, then achievement tests with
strong diagnostic features will be chosen. Although achievement batteries sampling a wide
range of areas, across grades, and standardized on large, national samples of students have much
to recommend them, they also have certain drawbacks. For example, such tests usually take
years to develop; in the interim the items, especially inﬁ elds such as social studies and science,
may become outdated. Further, any nationally standardized instrument is only as good as the
extent to which it meets the (local) test user’s objectives.
1.1 Purposes/uses of achievement test
i. To measure students’ mastery of certain essential skills and knowledge, such as
proficiency in recalling facts, understanding concepts, principles and use of skills
ii. To measure students’ growth/progress over time for promotion purposes. This is helpful
to school in making decision about students’ placement in a specific program, class,
group or for promoting to next level.
iii. To rank pupils in terms of their achievement by comparing performance of an individual
to the norm or average performance of his/her group (norm referenced)
iv. To Identify pupil’s problem and diagnosing them. Given a federal mandate to identify
children with a “severe discrepancy between achievement and intellectual ability”

(Procedures for Evaluating Specific Learning Disabilities, 1977, p. 65083), it can readily
be appreciated how achievement tests—as well as intelligence could play a role in the
diagnosis of a specific learning disability (SLD).
v. To evaluate the effectiveness of teacher's instructional method
vi. To encourage good study habits in the students and motivate them to work hard.
2. ATTITUDE SCALE
An attitude may be defined formally as a presumably learned disposition to react in
some characteristic manner to a particular stimulus. The stimulus may be an object,
a group, an institution—virtually anything. Although attitudes do not necessarily predict
behavior (Tittle & Hill, 1967; Wicker, 1969), there has been great interest in measuring
the attitudes of employers and employees toward each other and toward numerous
variables in the workplace. As the name implies, this type of scale tries to measure individual’s
belief, attitude and perception towards one self, others or towards some phenomena, activities,
situation etc.
2.1 Measuring Attitude
Attitude can be measured using the following scales.
Attitude can be measured using self-report, tests and/or questionnaires. However, it is not
easy to measure attitude accurately as individuals greatly differ in their ability to rightly
introspect about their attitudes and in their level of self-awareness. Moreover, some people also
feel reluctant to share or report about their attitude to others. It may also happen that some time
people come with some attitude or form attitude that they did not know about it or existed earlier.

Measuring attitude was earlier mentioned by Likert (1932) in his monograph, “A Technique for
the Measurement of Attitudes”. This relates to designing an instrument that helps in measuring
attitude. This scale seeks individual’s response on a number of statement in terms of his/her level
of agreement or disagreement. The options may be Strongly Agree, Agree, Undecided, Disagree,
Strongly Disagree. The degree of agreement or disagreement reflects individual attitude about a
certain phenomenon or statement. Each response is assigned a specific score from 1 to 5. For
positive statement, 5 is assigned to strongly agree and 1 is assigned to strongly disagree.
According to Thurstone (1928), attitude can be measured as mentioned in his article,
“Attitudes Can Be Measured”. Recently, the research of Banaji (2001) further supported this
contention in his article, “Implicit Attitudes Can Be Measured”. Implicit attitudes are
“introspectively unidentiﬁ ed (or inaccurately identiﬁed) traces of past experience that mediate
favorable or unfavorable feeling, thought, or action toward social objects” (Greenwald & Banaji,
1995, p.8). Stated another way, they are nonconscious, automatic associations in memory that
produce dispositions to react in some characteristic manner to a particular stimulus.
Implicit attitude can be measured using Implicit Attitude Test (IAT), a computerized sorting
task by which implicit attitudes are gauged with reference to the test taker’s reaction times. E.g.
the individual is shown/given a particular stimuli and is asked to categorize or associate another
word or phrase to it without taking much time. For example, the attitude of a person can be
gauged by presenting the word ‘terror’ and then associating other words favorable or unfavorable
to it quickly to know about individual attitude to ‘terror’. Using the IAT or similar protocols,
implicit attitudes toward a wide range of stimuli can be measured. Likewise, implicit attitudes
have been studied in relation to racial prejudices, threats, voting behavior, professional ethics,
self-esteem, drug use etc. Measuring implicit attitude is now frequently used in consumer

psychology and consumer preferences. In consumer psychology, the attitude may be found
through asking a series of questions about a product or choice and the individual response is
noted which may reflect the belief or thinking of the individual. The responses of people can be
sought through a survey or opinion poll using questionnaire, emails, google forms, social medic
posts etc. the surveys and polls may be conducted by means of face-to-face, online, and
telephone interviews, as well as by mail. The face-to-face interaction helps in getting quicker
response and in understanding the questions well. Moreover, the researcher can present or show
the products directly and seeks people’s response on it. However, there is also a drawback of
face to face interaction as sometime, the people would react in a way they feel is favorable to the
researcher or the gesture of researcher influences the choice of the respondents.
Another type of scale to measure attitude is the semantic differential technique. In this
type of scale, the respondents are given two opposite extremes and the individual is asked to
place a mark on the 7 spaces in the continuum according to his level of preference. The two bi-
polar extremes might be easy-difficult, good-bad, weak-strong etc.
Strong ____:____:____:____:____:____:____ Weak
Decisive ____:____:____:____:____:____:____ Indecisive
Good ____:____:____:____:____:____:____ Bad
Cheap ____:____:____:____:____:____:____ Expensive
3. STEPS FOR TEST DEVELOPMENT
The creation of a good test is not a matter of chance, rather it requires a sound knowledge and
principles of test construction (Cohen & Swerdlik, 2010). The development of a good test
requires some steps however; these steps are not specific as various authors have suggested

different steps/stages for developing a test. Following are some of the general steps for test
development.
1. Identification of objectives
It is one of the most important step in developing any test when the test authors need to
consider in detail what exactly they aim to measure or the purpose of the test. It is especially
important to define clearly the purpose of the test because that increases the possibility for
achieving high validity. It defines what exactly is required to be measured by a test. This will
help in improving the validity of a test. There are two kinds of objectives: the behavioral and
non-behavioral. As the name suggest, the behavioral objectives deal with “activities that are
observable and measurable whereas non-behavioral objectives specify activities that are
unobservable and not directly measurable” (Reynolds et al., 2009, p. 177).
2. Deciding about test format
Without predefined
objectives, a test will be meanings and purposeless.
The format/design of the test is another important element on constructing a test. The test
developer needs to decide about which format/design will be the most suitable in achieving the
set objectives. The format of the test may be objectives type, essay type or both. Again, the
examiner will decide about what type of objective items shall be included whether it will be
multiple-choice, fill in the blanks, matching items, short answer etc. The test author also decides
about the number of marks assigned to each format and the total amount of time to complete the
test.
3. Making a table of specifications
A table to specifications serves as test blueprint. This helps in ensuring suitable number of items
from the whole content as well as specifying the type of assessment objectives that the items will

be testing. This table ensures that all levels of instructional objectives are used in the test
questions.
• What language skills should be included – will there be a list of grammatical
structures and lexis, etc.;
The table enlists the number of items from each content area, the weightage assigned
to each content area and the type of instructional objectives the items will be measuring whether
recalling, understanding or application. Last but not the least, the examiner shall also decide
about the weightage to each format (objective and subjective) within the test and the weightage
in terms of difficulty level (easy, moderate, difficult). For example, in developing an English
test, the teacher can focus on the following areas.
• What sort of tasks are required – objectively assessable, integrative, simulated
“authentic”, etc.;
• How many items are required for each section, and what their relative weight will be
equal weighting or extra weighting for more difficult items;
• What test methods are to be used – multiple choice, gap filling, matching,
transformations, picture descriptions, essay writing, etc.;
• What rubrics are to be used as instructions for students – will there be included
examples to help students know what is expected, and
• should the assessment criteria be added to the rubric;
• What assessment criteria will be used – how important is accuracy, spelling, length of
written text, etc.
4. Writing Items

5.
The examiner writes the items keeping in mind the table of specification and the difficulty level
of items. The items shall progress from simple to difficult however, it is debatable whether the
items are arranged randomly or from easy to difficult. The examiner should ensure that the test
can be completed within the stipulated time. The language of the test items be simple, brief and
lucid. The language should be checked for grammar, spelling and punctuation.
Preparation of Marking scheme
As regarding developing standardized type of test, the following steps are given by Cohen and
Swerdlik (2010) though it can also be applied to custom test made by teachers, researchers and
recruiters. The process encompasses five stages:
The test developer decides about the number of marks to be assigned to each item or the relevant
bits of detail in the students’ answers. This is necessary to ensure consistency in marking and to
make scoring more scientific and systematic. The essay type questions can be divided into
smaller components and the marks defined for each important concept/point.
1. test conceptualization
2. test construction
3. test tryout
4. item analysis
5. test revision
The process of test development starts from the conceptualizing the idea of test and the purpose
for which the test has to be constructed. A test may be designed on some emerging phenomena,
problems, issues or some needs. Test conceptualization might also include the construct or the
concepts which the test should measure. What kind of objectives or behavior the test should
measure in the presence of other such tests? In there any need for making a test or the existing

test can be used for the set purpose? How the test can be better than the existing test? Who will
be the user of the test, the students, teachers, or employers? What will be the content of the
test? How will the test be administered, individually or in groups? Will the test be written, oral
or practical? What will be the format of test items and what will be the proportion of items in
objective and subjective? Who will benefit from the test? Will there be any harmful effect of
the test? If yes, then on whom?
Based on the purpose, needs and the objectives to be achieved, the items for the test are
constructed/selected. The test is then pilot tested on a sample to try out whether the items in the
test are appropriate for achieving the set objectives. Based on the result from the test tryout or
pilot test, the items in the test are put to item analysis. This requires the use of statistical
procedures in determining the difficulty level of items, reliability and validity. This process
helps in selecting the appropriate items for the test while the inappropriate items may be
revised or deleted. This finally helps in making a revised draft of the test better than the initial
version of the test. The process may be repeated till a refined and standardized type of version
is made available.
References.
1. Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation,
Cambridge University Press, 1995
2. Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests
and measurement. 7th ed. McGraw−Hill Primis; 2010.
3. Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning
distractors in multiple-choice questions: a descriptive analysis,
http://www.springerlink.com/content/e8k8618552465484/fulltext.pdf, 2009.

4. QUALITIES OF A GOOD TEST
In constructing a test, the test developer should aim at making a good test. A bad test may spoil
the purpose of the test and thus would be useless to administer. According to Mohamed (2011),
Objectivity
This is very important for a test to be objective. A test with higher objectivity will eliminate
personal biases and influences in scoring and interpreting the test result. This can be done by
including more objective type items in the test. This includes multiple choice questions, fill in
the blanks, true-false, matching items, short questions-answers etc. In contrast, essay questions
are subjective questions. Thus, difference examiners may arrive at different answers while
checking such questions depending upon the mood of person, knowledge level and personal likes
and dislikes. However, the essay type questions can be made objective through well-defined
marking scheme for small bits of important and relevant information in the long answers.
A good test should have the following properties.
Comprehensiveness
A good test should cover the content area which is taught. The items in the test should be from
different areas of the course content. If one topic or area is assigned more question items and the
other areas are neglected then, such test will not be a good one. A good English test may have
items taken from composition, comprehension, dialogue, creative writing, grammar, vocabulary
etc. Meanwhile, due importance may be given to important bits in the content according to its
utility and significance.

Validity
It means that a test should rightly measures what it is supposed to measure. It tests what it ought
to test. A good test which measures control of grammar should have no difficult lexical items.
The detail of validity is explained in validity section.
Reliability
Reliability of a test refers to the degree of consistency with which it measures what it intended to
measure. If a test is re-taken by same students under same conditions, the score will be almost
the same provided that the time between the test and the retest is of reasonable length. In this
case it is said that the test provides consistency in measuring the items being evaluated.
Details about reliability is given in reliability section.
Discriminating Power
Discriminating power of the test is its power to discriminate between the upper and lower groups
who took the test. Thus, a good test should not contain only difficult items or easy items rather, it
should contain items with different difficulty level to sift students with different intelligence
level. The questions should progressively be increased in difficulty to reduce stress level and
tension in students.
Practicability
The test should be realistic and practicable. It should not measure unrealistic targets or
objectives. The test should also be easy to administer as well as easy to score. The test should
also be economical without wasting too much resources, energies and effort. Tests may be

competitive and sometimes difficult to complete within stipulated time to select students with
higher IQ level and less reaction time because such tests may have this specific purpose.
Otherwise, classroom tests shall keep in mind the individual differences of students and provide
ample opportunity for its completion.
Simplicity
It refers to clarity in language, correctness, adequacy and lucidity. Ambiguous questions, and
items with multiple meanings should be avoided. The students should be very clear about what
the question is asking and how to answer. Sometimes, the students get confused about the
possible answers due to lack of clarity in the questions.
5. RELIABILITY
According to Gay, Mills, & Airasian, (2011), “Reliability is the degree to which a test
consistently measures whatever it measures”.
Thorndike (2005) refers reliability to “accuracy or precision of a measurement procedure”.
While, Mehrens and Lehmann (1991) defined reliability as “the degree of consistency between
two measures of the same thing”
It also signifies the repeatability of observations or scores on a measurement. Some other terms
that are used to define reliability includes dependability, stability, accuracy, regularity in
measurement.
For a test, high reliability would mean that the person gets the same score or nearly same each
time the test is administered to the same person. If the person obtains different score each time
the test is administered, then the test reliability will be questioned.

Reliability can be ascertained by the examiner by taking the same test on two different
occasions. The score obtained on the test on the two occasions may be compared to determine
the degree of reliability. Another method is to test students on one test and then administer
another but different test. The scores obtained by the students on the two test may be compared
to find reliability of the two tests. If there is much difference in the score of students on the two
tests, then the two tests will have poor reliability. Essay type questions may have poor reliability
as the students get different score each time the answers are marked. In comparison, multiple
choice questions have comparatively a higher reliability as compared to essay type questions.
A test may not be reliable in all settings. A test may be reliable in a specific situation, under
specific circumstances and with a specific group of subjects. However, it may not be reliable in a
different situation or with a different group of students under a different circumstance.
5.1 Reliability Coefficient
As regarding physical measurement or using different tests for ascertaining reliability, it may be
difficult to achieve 100% consistency in scores. However, an acceptable value will be the degree
of closeness or consistency in the measurement of the different tests. For this purpose, the degree
of reliability of a test is measured numerically which is termed as reliability coefficient.
According to Merriam Webster dictionary, reliability coefficient is a measure of the accuracy of
a test or measuring instrument obtained by measuring the same individuals twice and computing
the correlation of the two sets of measures.
The reliability coefficient is a way of confirming how accurate a test or measure is. It essentially
measures consistency in scoring. The reliability coefficient is found by giving the test to the
same subject more than once and determining if there's a correlation between the two scores.
This will also reveal the strength of the relationship and similarity between the two scores. If the

two scores are close enough, then the test can be said to be accurate and has good reliability. The
variation in the score is called error variance and the source of variation is called source of error.
The smaller the error, the more consistent and reliable the score and vice versa.
An example could be done in which an individual is given a measure to determine their self-
esteem levels and then given the same measure again. The two scores would be correlated and
the reliability coefficient would be produced. If the scores are very similar to each other then it
can be said they are reliable that are consistently measuring the same thing, which in this case
would be self-esteem.
The maximum value of reliability coefficient is 1.00 which means that the test is
perfectly reliable while the minimum value is 0.00 which indicates no reliability. However, in
actual situation, it is not possible to have a perfectly reliable test. Thus, the coefficient of
reliability will be less than 1.00. The reason is the effect of various factors and errors in
measurement. This includes errors caused by the test itself due to ambiguous test items which is
interpreted differently by students. The different in conditions of students (emotionally,
physically, mentally etc.) is also responsible for producing errors in measurement such as fatigue
factor, arousal of specific emotion such as anger, fear, depression etc. and lack of motivation.
Moreover, the selection of test items, its construction, sequence, wording etc. may also result in
measurement error and thus affecting the reliability coefficient,
5.2 Relationship between Validity and Reliability
A test which is valid is also reliable. However, a test which is reliable is not necessarily
valid. If a test is valid, it means that it is rightly measuring the purpose/objectives what it is
supposed to be measuring. Thus, the score obtained on such test is also reliable because the test
is rightly measuring its intended purpose and thus the score will also be consistent on such test

whether lower or higher. In comparison, if a test is reliable which means that the students’ score
is coming consistently the same, but this test may not be rightly measuring its intended purpose
and thus is invalid. Thus, a test which is reliable may be valid or it may not be valid but a test
which is valid must be reliable. A test with coefficient of reliability as 0.93 is a highly reliable
test but is the test really measuring the set objectives from the given content. If it measures its
intended purpose, then the test will also be valid. However, if it is not measuring the concepts
from the given content then it will be in valid. [Form more detail see Gay, Mills, & Airasian,
(2011)]
6. RELIABILITY TYPES
Some types of reliability are given below:
• Test-Retest Reliability
• Equivalence Reliability or inter-class reliability
• Split-Halves Reliability
6.1 Test-Retest Reliability
One of the simple way to determine reliability of a test is to test-retest. It is the degree to which
scores on a test are consistent over time. The subjects are given a test on two occasions. The
score obtained by the subjects are then compared to see the consistency in the two scores on both
the tests. This can be found by measuring the correlation between the two scores. If the
correlation coefficient is high, then the two tests have a high degree of reliability. This method is
seldom used by subject teachers but is frequently used by test developers or commercial test
publishers such as IELTS, TOEFL, GRE etc.

One issue that arises here is how much time should elapse between the two tests. If the time
interval between the two tests is short say a few hours or days, then the chances of students
remembering their previous answers will be high and thus they will score the same which will
increase the reliability coefficient. If the duration is long, then the ability to perform well on the
test increases due to learning with time thus affecting reliability coefficient. Thus, in measuring
test-retest reliability, it is necessary that the time interval between the test should also be
mentioned along with the reliability coefficient. This kind of reliability is ensured for aptitude
tests and achievement tests so that they measure the intended purpose each time they are
administered.
6.2 Equivalence Reliability or inter-class reliability
It relates to two tests that are similar in every aspect except the test items. The reliability between
the two test is then measured and if the coefficient of reliability known as coefficient of
equivalence in this case is higher, then the two test are highly reliable and vice versa. It shall be
kept into consideration that the two tests shall be measuring the same variables, having the same
number of items, structure, difficulty level. Besides the direction for administering the two tests
shall also be same, with similar scoring style and interpretation. The purpose is to make the
scoring on both the tests consistent and reliable. No matter which test is taken by students, the
score of the students should be same on both the tests. This is usually used in situation where the
number of candidates are very large or a test is to taken on two occasions. In this kind of
situation, the examiner constructs different versions of the same test so that each group of
students can be administer the test at different time without the fear of test items leaking or
repeating. In some circumstances, the researchers ensured to make equivalence pre-test and post-

test to measure the actual difference in the performance removing the error in measurement
occurring from recalling/remembering the answers on the first test.
The procedure for establishing equivalence reliability is to construct the two versions of the test
measuring similar objectives taken from the same content area, number of items, difficulty level
etc. One form of the test is administered to an appropriate group. After some time, the second
form of the test is administered to the same group. The score obtained by students on both the
test is then correlated to find the coefficient of reliability. The difference in the score obtained by
students would be treated as error.
6.3 Split-Halves Reliability
This type of reliability is used for measuring internal consistency between the test items in a
single test. This is theoretically same as finding equivalence reliability however; here the two
parts are taken from the same test. This reliability can be found by administering the test only
once and thus the effect/error caused due to time interval or students’ condition (physical,
mental, emotional etc.) or two groups is minimized. As the name indicates, the test items for a
single test are divided into two halves to form two equivalent parts of a test. The two parts can be
obtained by various methods e.g. Dividing the test items into two halves with equal number of
items in both the halves or by splitting the test items into two halves, the odd number items and
even number items.
In case, the test is divided into odd and even numbered items, the reliability is calculated
as follow. Firstly, the test is administered to subjects and the items are marked. The items are
divided into two halves by combining all the odd items in one half and the even items in the
second half. The score obtained on odd and even numbered items are separately totaled. Thus,

there are two set of scores for each student. The score obtained on odd and even numbered items.
The two scores are then correlated to find the correlation coefficient using Pearson product
moment correlation coefficient. If the value of correlation coefficient is higher, then the two parts
of the test are highly reliable and vice versa.
The reliability coefficient obtained from the correlation needs to be adjusted/corrected as this
coefficient was for a test which is divided into two (split-halves). The actual reliability of the
whole test needs to be higher. This is computed using Spearman-Brown prophecy formula.
Suppose the reliability coefficient for a 40 items test was .70 which was obtained by correlating
the score for 20 odd and 20 even items. Thus, the reliability coefficient for the whole test (40
items) will be found using the following formula:
r total test
r
=
2 𝑟𝑠𝑝𝑙𝑖𝑡 ℎ𝑎𝑙𝑓
1+ 𝑟𝑠𝑝𝑙𝑖𝑡 ℎ𝑎𝑙𝑓
total test
The advantage of split-halves reliability is that one test is used only once. Thus, it can be
economically and conveniently used by classroom teachers and researchers to collect data about
a test.
=
2 (.70)
1+ .70
= .82
7. FACTORS AFFECTING RELIABILITY
Fatigue: The score obtained on a test by subjects during different conditions may be different.
The fatigue factor has an important role in affecting the test score. Thus, fatigue/tiredness affects
test reliability. Generally, the students will score less on a test with fatigue factor. Thus, Fatigue
generally decreases reliability coefficient.

Practice: The reliability of a test can be affected by the amount of practice. It is generally said
that practice makes a man perfect. In the same manner, practice on test will improve students’
score and thus increases reliability coefficient on test with greater practice.
Subject variability: The variation in the scores will increase if in a group, there is more subject
variability. The greater in differences among subjects on the basis of gender, age, program,
interests etc., the greater will be the variation in the score among individuals. In the same way, if
a group is more homogenous such as group of students with same range of IQ, then the variation
in the score will be less.
Test Length: The length of test and the number of items affect reliability of a test. Usually, a test
with greater number of items may give more reliable scores due to the cancelling of random
positive and negative errors with in a test. Thus, adding more items to a test increases its
reliability. In the same manner, deleting items from a test will lower the reliability of a test. One
technique of deleting items from a test without decreasing its reliability is to remove that item
from a test which has lower reliability value in item analysis.
The Spearman-Brown prophecy formula is used for estimating reliability for a test which if made
shorter or longer provided that the original reliability of a test is given. For example, if a test
original reliability is .60 and the number of items are increased or decreased, then the new
reliability of the test will be:
rx
r
=
𝐾 𝑟
1+(𝐾−1)𝑟
x
r = reliability of original test
= predicted reliability of a test with added or deleted number of items
K = ratio of number of items in the new test to number of items in the original test

8. VALIDITY
Validity refers to the extent to which a test measure what it is supposed to measure. In other
words, it refers to the degree to which a test pertains to its objectives. Thus, for a measure or test
to be valid, it must measure the particular trait, characteristic, or ability consistently for which it
was constructed.
According to The Standards for Educational and Psychological Testing (AERA/APA/ NCME,
1985), Validity "refers to the appropriateness, meaningfulness, and usefulness of the specific
inferences made from a test”. If correct and true inferences can be derived from a test, then such
test has a greater validity to measure that specific inference.
Cohen and Swerdlik (2010) defined validity as “a judgment based on evidence about the
appropriateness of inferences drawn from test scores”. While, inference is a logical result or
deduction. When a test score is used to make inference about a person trait or characteristic, then
the test score is assumed as representing that trait or characteristic.
A test may be valid for a particular group and for a particular purpose however, it may not be
valid for another group or for a different purpose. A test on English grammar may be valid for a
high school group but it is not valid for university students. Moreover, no test or measurement
technique is “universally valid” for all time, for all uses, and for all user(Cohen & Swerdlik,
2010). Rather, tests may be shown to be valid within what we would characterize as reasonable
boundaries of a contemplated usage. If those boundaries are exceeded, the validity of the test
may be called into question (Cohen & Swerdlik, 2010).
The process of gathering and evaluating evidence about validity is called ‘validation’. This is an
important phase of validity which the test developer has to take with the test takers for a specific
purpose. It is necessary that the test developer should mention the validity evidence in the test

manual for the users/readers. However, sometimes the test users may conduct their own studies
to check for validation with their test takers usually called local validation.
Some types of validity are:
1. Content-related validity
2. Criterion-related validity
3. Construct-related validity
8.1 Content-related validity
Sometimes content validity is also referred as face validity or logical validity. According to
Gay, Mills, & Airasian (2011), content validity is the degree to which a test measures an
intended content area. In order to determine content validity, the content domain of a test
shall be well defined and clear.
Face validity is a quick way of ascertaining whether a test looks/appears to measure what it
purports to measure. A primary class math test shall contain numbers and figures and shall
appear to be a math test rather than a language test. Face validity is not an exact measure of
estimating content validity and is only used as a quick way for initial screening of judging
validity.
In order to judge content validity, one must pay attention to ‘item validity’ and ‘sampling
validity’. Item validity ensures that the test items represents the relevant content area of a
given subject matter. The items in a math test shall include questions from its given content
and shall not focus on evaluating language proficiency or math items not included in the
given syllabus. Similarly, an English test shall not contain items related to mathematical
formulae or cover the subject matter of a science subject.

Sampling validity is concerned with how well the test items samples the total content area. A
test with good sampling validity ensures that the test items adequately samples the relevant
content area. The proportion of test items from various units must be kept into consideration
according to their importance. Although, all the units or concepts cannot be covered in a
single test, however, the examiner must ensure that the proportion of test items are in
accordance with the significance of the concepts to be tested. If a Physics test contains items
from Energy chapter only and ignore other chapters, then such test will have poor sampling
validity.
Content validity can be judged by content expert, relevant subject teacher and/or text book
writer. According to Gay et al. (2011), content validity cannot be measured quantitatively
rather the experts carefully observe all the test items for item validity and sampling validity
and then make a judgement about its content validity. A good way of ensuring content
validity is to make a table of specifications that shall include the total number of units/topics
to be tested, the number of items from each unit/topic and the different domain of
instructional objectives. The table of specification helps in observing the units from which
most of the items are included and also units which were under represented or ignored.
Consider a secondary grade physics test taken from five chapter as given in table of
specification. The names of the units are mentioned and the number of test items assessing
each of the instructional objectives given by Bloom’s taxonomy. It is not a hard and fast rule
to strictly follow the given proportion. The examiner decide which aspect or instructional
objectives shall be given more or less weightage for each unit and still ensure that there shall
not be greater difference in the weightage assigned to each objective. Thus, some units may
require more focus on application side while some units may focus on knowledge or

comprehension. The objective is to rightly measure the skill that the examiner wants to
measure.
Table 2. Table of specification of Physics test from five units
Course
content
Knowledge
(30%)
Comprehension
(40%)
Application
(30)
Total
Forces 3 5 2 10
Energy
sources
3 4 3 10
Turning
Effect
2 4 4 10
Kinematics 3 3 4 10
Atomic
Structure
4 4 2 10
Total 15 20 15 50
8.2 Criterion-related validity
Other terms used for criterion-related validity is statistical validity or correlational validity. It
provides evidence that a test items measures a specific criterion or trait for which it is designed.
In order to determine criterion validity of a test, the first step is to establish the criterion to be
measured. Then a variety of test items are developed and then tested. The test items are then
correlated with the criterion to determine how well are these items measuring the set criterion
through finding Pearson correlation. In case, a number of test are used to measure the criterion,
then multiple correlational procedures are used instead of Pearson correlation.

Criterion-related validity can be further subdivided into concurrent validity and predictive
validity.
8.2.1 Concurrent Validity
The main difference between concurrent and predictive validity is the time at which the criterion
is measured. For concurrent validity, the criterion is measured at approximately the same time as
the alternative measure. However, if the criterion being measured relates to some future time,
then it is called predictive validity.
The concurrent validity of a test is the degree to which the score on the test is related to the score
on an already established test administered at the same time. For example, GRE is an already
standardized test for measuring some specific skills and knowledge. Suppose a new test is
developed that claims to be measuring the same skills and knowledge, then it is necessary to find
the concurrent validity of the new test. For this purpose, the new test and the already established
test will be administered to some defined group of individuals at the same time. The scores
obtained by individuals on both the test is correlated to observe for similarity or differences. The
coefficient of validity can be calculated from correlation which will provide information about
the concurrent validity of the new test. A high value of validity coefficient indicates a good
concurrent validity and vice versa.
8.2.2 Predictive Validity
It is the degree to which a test can predict about the future performance of an individual. It is
often used for selecting or grouping individuals. The score on entry test serves as predictive
validity about future performance of individuals in a specific program. If the marks on the entry
test is high, then it can be predicted that the candidate will do well in future thus ascertaining
predictive validity of the entry test. Such test may include ISSB test for entrance to armed forces,
GRE test and SAT test for university performance. Likewise, medical test reports such as high

body fat, high cholesterol, smoking and hypertension are all predictive of future heart disease. It
shall be kept into consideration that the predictive validity of various tests like entry test, GRE,
TOEFL etc. may vary due to a number of factors such as the difference in curriculum studied by
students, the textbooks used for preparation, the geographical location etc. Thus, there is no such
thing as perfect predictive validity which will also sometime makes the prediction false. Not all
students who pass GRE or entry test may successfully pass the program in which the individuals
were enrolled. Thus, it is not advisable to consider the score of a test for predicting future
performance rather several indicators shall be used such as marks in preceding exams, the
interview score, comments of professors, performance on practical skills etc.
8.3 Construct-related validity
Construct-related validity is used to measure a theoretical construct. The construct to be measure
is unobservable yet it exists theoretically. The construct though cannot be seen but its effects can
be observed. For example, intelligence quotient (IQ), anxiety, creativity, attitude etc. Tests have
been developed for measuring a specific construct. The researchers/test developers ensure that
the test they construct should accurately measure that specific construct for which it was
designed. Thus, a test aimed at measuring level of anxiety shall not measure creativity or IQ. The
test score can be used to make decision related to a construct. If a test is unable to measure a
construct, then its validity is questionable and the conclusion based on its score will be
meaningless and inaccurate.
The process of determining construct validity is not simple. The measuring of a construct
requires a strong theory that hypothesize about the construct under study. For example,
Psychology theories hypothesize that individuals with higher anxiety persons will work longer
on a problem as compared to persons with low anxiety level. Suppose a test that measures

anxiety level and some persons score higher on such test and then the same persons also worked
for a longer time on the task/problem under consideration; then we have ample evidence to
support the theory and thus the construct validity of the test to measure that construct.
Figure: Validity and Reliability
Source: James, Allen, James, & Dale (2005)
Self-Assessment Questions
Q1. How is achievement test different from attitude scale?
Q2. Describe the uses of achievement test and attitude scale?
Q3. What are the steps for developing a test?
Q4. Define reliability and reliability coefficient?
Q5. Describe the different types of reliability?
Q6. What are the factors that affect reliability?
Q7. Define the concept of validity in measurement and its relation with reliability?
Validity
Reliability
Interclass
Test-Retest Equivalence
Interclass
Anova
Alpha KR20
Relevance
Content Criterion
Concurrent Predictive
Construct

Q8. Explain the different types of test validity.
Q9. What are the qualities of a good test?
9. REFERENCES
Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation,
Cambridge University Press, 1995
1. Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests and
measurement. 7th ed. McGraw−Hill Primis; 2010.
2. Gay, L. R., Mills, G. E., & Airasian, P. W. (2011). Educational research: Competencies for
analysis and applications. Pearson Higher Ed.
3. http://www.alleydog.com/glossary/definition.php?term=Reliability%20Coefficient#ixzz48
EmyHlQe
4. James, R. M., Allen, W. J., James, G. D., & Dale, P. M. (2005). Measurement and
evaluation in human performance. USA: Human Kinetics.
5. McMillin, E. (2013). Steps to Developing a Classroom Achievement Test. Assessed from
https://prezi.com/fhtzfkrreh6p/steps-to-developing-a-classroom-achievement-test/#
6. Mohamed, R. (2011). 12 Characteristics of a good test. Retrieved from
https://eltguide.wordpress.com/2011/12/28/12-characteristics-of-a-good-test/
7. Reynolds, C. R., Livingston, R. L., & Willson, V. L. (2009). Measurement and Assessment
in Education. (2nd ed.). Upper Saddle River, NJ: Pearson Education Inc.
8. Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning
distractors in multiple-choice questions: a descriptive analysis,
http://www.springerlink.com/content/e8k8618552465484/fulltext.pdf, 2009.

Unit. 7.pdf

Recommended

Recommended

More Related Content

Similar to Unit. 7.pdf

Similar to Unit. 7.pdf (20)

More from Imtiaz Hussain

More from Imtiaz Hussain (20)

Recently uploaded

Recently uploaded (20)

Unit. 7.pdf