SlideShare a Scribd company logo
1 of 30
Download to read offline
Unit–7
Test Development and Qualities of a test
Written by:
Dr. Fayyaz Ahmad Faize
TABLE OF CONTENTS
1. Achievement Test .................................................................................................................................3
1.1 Purposes/uses of achievement test................................................................................................5
2. Attitude Scale........................................................................................................................................6
3. Steps for Test Development..................................................................................................................8
4. Reliability............................................................................................................................................15
4.1 Reliability Coefficient.................................................................................................................16
4.2 Relationship between Validity and Reliability ...........................................................................17
5. Reliability Types.................................................................................................................................18
5.1 Test-Retest Reliability ......................................................................................................................18
5.2 Equivalence Reliability or inter-class reliability.........................................................................19
5.3 Split-Halves Reliability...............................................................................................................20
6. Factors Affecting Reliability...............................................................................................................21
7. Validity ...............................................................................................................................................23
7.1 Content-related validity...............................................................................................................24
7.2 Criterion-related validity.............................................................................................................26
7.2.1 Concurrent Validity....................................................................................................................27
7.2.2 Predictive Validity .....................................................................................................................27
7.3 Construct-related validity............................................................................................................28
Self-Assessment Questions.....................................................................................................................29
8. References...........................................................................................................................................30
OBJECTIVES
After studying this chapter, the students will be able to:
• Describe about achievement test and attitude scale
• Explain the steps involved in test development
• Describe the qualities of a good test
• Define and interpret reliability and validity
• Discuss how to determine reliability and validity of tests
• Understand the relationship between reliability and validity.
• Understand the basic kinds of validity evidence.
• Interpret various expressions of validity.
• Recognize what factors affect validity
1. ACHIEVEMENT TEST
Achievement tests are designed to measure accomplishment. Usually, it is conducted at the end
of some learning activity/process to ascertain the degree to which the required task has been
accomplished.
For example, the achievement test for a students of Nursery class might contain assessment of
English alphabets, knowledge of numbers and key science concepts. Thus, achievement tests
help in measuring the degree of learning on some already instructed/guided tasks. The tasks may
be specific and short or it may be comprehensive and detailed. An achievement test may be
standardized such as a test of Chemistry for secondary class on formulae and valences or Physics
test on fundamental quantities or kinematics.
Another term that is also useful is ‘General Achievement”. This relates to measuring of learning
experiences in one or more academic areas. This would usually involve a number of subtests
each aimed at measuring some specific learning experiences/targets. These subtests are
sometimes called achievement batteries. Such batteries may be individually administered or
group administered. They may consist of a few subtests, as does the Wide Range Achievement
Test-4 (Wilkinson & Robertson, 2006) with its measures of reading, spelling, arithmetic, and
(new to the fourth edition) reading comprehension.
Achievement may be as comprehensive as the STEP Series, which includes subtests in reading,
vocabulary, mathematics, writing skills, study skills, science, and social studies; a behavior
inventory; an educational environment questionnaire; and an activities inventory. Some
batteries, such as the SRA California Achievement Tests, span kindergarten through grade 12,
whereas others are grade or course-specific. Some batteries are constructed to provide both
norm-referenced and criterion-referenced analyses. Others are concurrently normed with
scholastic aptitude tests to enable a comparison between achievement and aptitude. Some
batteries are constructed with practice tests that may be administered several days before actual
testing to help students familiarize themselves with test taking procedures. One popular
instrument appropriate for use with person age 4 through adult is the Wechsler Individual
Achievement Test-Second Edition, otherwise known as the WIAT-II (Psychological
Corporation, 2001). This instrument is used not only to gauge achievement but also to develop
hypotheses about achievement versus ability. It features nine subtests that samples content in
each of the seven areas listed in a past revision of the Individuals with Disabilities Education
Act: oral expression, listening comprehension, written expression, basic reading skill, reading
comprehension, mathematics calculation, and mathematics reasoning.
For a particular purpose, a battery that focuses on achievement in a few select areas may be
preferable to one that attempts to sample achievement in several areas. On the other hand, a test
that samples many areas may be advantageous when an individual comparison of performance
across subject areas is desirable. If a school or a local school district undertakes to follow the
progress of a group of students as measured by a particular achievement battery, then the battery
of choice will be one that spans the targeted subject areas in all the grades to be tested. If ability
to distinguish individual areas of dif
fic ulty is of primary concern, then achievement tests with
strong diagnostic features will be chosen. Although achievement batteries sampling a wide
range of areas, across grades, and standardized on large, national samples of students have much
to recommend them, they also have certain drawbacks. For example, such tests usually take
years to develop; in the interim the items, especially infi elds such as social studies and science,
may become outdated. Further, any nationally standardized instrument is only as good as the
extent to which it meets the (local) test user’s objectives.
1.1 Purposes/uses of achievement test
i. To measure students’ mastery of certain essential skills and knowledge, such as
proficiency in recalling facts, understanding concepts, principles and use of skills
ii. To measure students’ growth/progress over time for promotion purposes. This is helpful
to school in making decision about students’ placement in a specific program, class,
group or for promoting to next level.
iii. To rank pupils in terms of their achievement by comparing performance of an individual
to the norm or average performance of his/her group (norm referenced)
iv. To Identify pupil’s problem and diagnosing them. Given a federal mandate to identify
children with a “severe discrepancy between achievement and intellectual ability”
(Procedures for Evaluating Specific Learning Disabilities, 1977, p. 65083), it can readily
be appreciated how achievement tests—as well as intelligence could play a role in the
diagnosis of a specific learning disability (SLD).
v. To evaluate the effectiveness of teacher's instructional method
vi. To encourage good study habits in the students and motivate them to work hard.
2. ATTITUDE SCALE
An attitude may be defined formally as a presumably learned disposition to react in
some characteristic manner to a particular stimulus. The stimulus may be an object,
a group, an institution—virtually anything. Although attitudes do not necessarily predict
behavior (Tittle & Hill, 1967; Wicker, 1969), there has been great interest in measuring
the attitudes of employers and employees toward each other and toward numerous
variables in the workplace. As the name implies, this type of scale tries to measure individual’s
belief, attitude and perception towards one self, others or towards some phenomena, activities,
situation etc.
2.1 Measuring Attitude
Attitude can be measured using the following scales.
Attitude can be measured using self-report, tests and/or questionnaires. However, it is not
easy to measure attitude accurately as individuals greatly differ in their ability to rightly
introspect about their attitudes and in their level of self-awareness. Moreover, some people also
feel reluctant to share or report about their attitude to others. It may also happen that some time
people come with some attitude or form attitude that they did not know about it or existed earlier.
Measuring attitude was earlier mentioned by Likert (1932) in his monograph, “A Technique for
the Measurement of Attitudes”. This relates to designing an instrument that helps in measuring
attitude. This scale seeks individual’s response on a number of statement in terms of his/her level
of agreement or disagreement. The options may be Strongly Agree, Agree, Undecided, Disagree,
Strongly Disagree. The degree of agreement or disagreement reflects individual attitude about a
certain phenomenon or statement. Each response is assigned a specific score from 1 to 5. For
positive statement, 5 is assigned to strongly agree and 1 is assigned to strongly disagree.
According to Thurstone (1928), attitude can be measured as mentioned in his article,
“Attitudes Can Be Measured”. Recently, the research of Banaji (2001) further supported this
contention in his article, “Implicit Attitudes Can Be Measured”. Implicit attitudes are
“introspectively unidentifi ed (or inaccurately identified) traces of past experience that mediate
favorable or unfavorable feeling, thought, or action toward social objects” (Greenwald & Banaji,
1995, p.8). Stated another way, they are nonconscious, automatic associations in memory that
produce dispositions to react in some characteristic manner to a particular stimulus.
Implicit attitude can be measured using Implicit Attitude Test (IAT), a computerized sorting
task by which implicit attitudes are gauged with reference to the test taker’s reaction times. E.g.
the individual is shown/given a particular stimuli and is asked to categorize or associate another
word or phrase to it without taking much time. For example, the attitude of a person can be
gauged by presenting the word ‘terror’ and then associating other words favorable or unfavorable
to it quickly to know about individual attitude to ‘terror’. Using the IAT or similar protocols,
implicit attitudes toward a wide range of stimuli can be measured. Likewise, implicit attitudes
have been studied in relation to racial prejudices, threats, voting behavior, professional ethics,
self-esteem, drug use etc. Measuring implicit attitude is now frequently used in consumer
psychology and consumer preferences. In consumer psychology, the attitude may be found
through asking a series of questions about a product or choice and the individual response is
noted which may reflect the belief or thinking of the individual. The responses of people can be
sought through a survey or opinion poll using questionnaire, emails, google forms, social medic
posts etc. the surveys and polls may be conducted by means of face-to-face, online, and
telephone interviews, as well as by mail. The face-to-face interaction helps in getting quicker
response and in understanding the questions well. Moreover, the researcher can present or show
the products directly and seeks people’s response on it. However, there is also a drawback of
face to face interaction as sometime, the people would react in a way they feel is favorable to the
researcher or the gesture of researcher influences the choice of the respondents.
Another type of scale to measure attitude is the semantic differential technique. In this
type of scale, the respondents are given two opposite extremes and the individual is asked to
place a mark on the 7 spaces in the continuum according to his level of preference. The two bi-
polar extremes might be easy-difficult, good-bad, weak-strong etc.
Strong ____:____:____:____:____:____:____ Weak
Decisive ____:____:____:____:____:____:____ Indecisive
Good ____:____:____:____:____:____:____ Bad
Cheap ____:____:____:____:____:____:____ Expensive
3. STEPS FOR TEST DEVELOPMENT
The creation of a good test is not a matter of chance, rather it requires a sound knowledge and
principles of test construction (Cohen & Swerdlik, 2010). The development of a good test
requires some steps however; these steps are not specific as various authors have suggested
different steps/stages for developing a test. Following are some of the general steps for test
development.
1. Identification of objectives
It is one of the most important step in developing any test when the test authors need to
consider in detail what exactly they aim to measure or the purpose of the test. It is especially
important to define clearly the purpose of the test because that increases the possibility for
achieving high validity. It defines what exactly is required to be measured by a test. This will
help in improving the validity of a test. There are two kinds of objectives: the behavioral and
non-behavioral. As the name suggest, the behavioral objectives deal with “activities that are
observable and measurable whereas non-behavioral objectives specify activities that are
unobservable and not directly measurable” (Reynolds et al., 2009, p. 177).
2. Deciding about test format
Without predefined
objectives, a test will be meanings and purposeless.
The format/design of the test is another important element on constructing a test. The test
developer needs to decide about which format/design will be the most suitable in achieving the
set objectives. The format of the test may be objectives type, essay type or both. Again, the
examiner will decide about what type of objective items shall be included whether it will be
multiple-choice, fill in the blanks, matching items, short answer etc. The test author also decides
about the number of marks assigned to each format and the total amount of time to complete the
test.
3. Making a table of specifications
A table to specifications serves as test blueprint. This helps in ensuring suitable number of items
from the whole content as well as specifying the type of assessment objectives that the items will
be testing. This table ensures that all levels of instructional objectives are used in the test
questions.
• What language skills should be included – will there be a list of grammatical
structures and lexis, etc.;
The table enlists the number of items from each content area, the weightage assigned
to each content area and the type of instructional objectives the items will be measuring whether
recalling, understanding or application. Last but not the least, the examiner shall also decide
about the weightage to each format (objective and subjective) within the test and the weightage
in terms of difficulty level (easy, moderate, difficult). For example, in developing an English
test, the teacher can focus on the following areas.
• What sort of tasks are required – objectively assessable, integrative, simulated
“authentic”, etc.;
• How many items are required for each section, and what their relative weight will be
equal weighting or extra weighting for more difficult items;
• What test methods are to be used – multiple choice, gap filling, matching,
transformations, picture descriptions, essay writing, etc.;
• What rubrics are to be used as instructions for students – will there be included
examples to help students know what is expected, and
• should the assessment criteria be added to the rubric;
• What assessment criteria will be used – how important is accuracy, spelling, length of
written text, etc.
4. Writing Items
5.
The examiner writes the items keeping in mind the table of specification and the difficulty level
of items. The items shall progress from simple to difficult however, it is debatable whether the
items are arranged randomly or from easy to difficult. The examiner should ensure that the test
can be completed within the stipulated time. The language of the test items be simple, brief and
lucid. The language should be checked for grammar, spelling and punctuation.
Preparation of Marking scheme
As regarding developing standardized type of test, the following steps are given by Cohen and
Swerdlik (2010) though it can also be applied to custom test made by teachers, researchers and
recruiters. The process encompasses five stages:
The test developer decides about the number of marks to be assigned to each item or the relevant
bits of detail in the students’ answers. This is necessary to ensure consistency in marking and to
make scoring more scientific and systematic. The essay type questions can be divided into
smaller components and the marks defined for each important concept/point.
1. test conceptualization
2. test construction
3. test tryout
4. item analysis
5. test revision
The process of test development starts from the conceptualizing the idea of test and the purpose
for which the test has to be constructed. A test may be designed on some emerging phenomena,
problems, issues or some needs. Test conceptualization might also include the construct or the
concepts which the test should measure. What kind of objectives or behavior the test should
measure in the presence of other such tests? In there any need for making a test or the existing
test can be used for the set purpose? How the test can be better than the existing test? Who will
be the user of the test, the students, teachers, or employers? What will be the content of the
test? How will the test be administered, individually or in groups? Will the test be written, oral
or practical? What will be the format of test items and what will be the proportion of items in
objective and subjective? Who will benefit from the test? Will there be any harmful effect of
the test? If yes, then on whom?
Based on the purpose, needs and the objectives to be achieved, the items for the test are
constructed/selected. The test is then pilot tested on a sample to try out whether the items in the
test are appropriate for achieving the set objectives. Based on the result from the test tryout or
pilot test, the items in the test are put to item analysis. This requires the use of statistical
procedures in determining the difficulty level of items, reliability and validity. This process
helps in selecting the appropriate items for the test while the inappropriate items may be
revised or deleted. This finally helps in making a revised draft of the test better than the initial
version of the test. The process may be repeated till a refined and standardized type of version
is made available.
References.
1. Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation,
Cambridge University Press, 1995
2. Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests
and measurement. 7th ed. McGraw−Hill Primis; 2010.
3. Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning
distractors in multiple-choice questions: a descriptive analysis,
http://www.springerlink.com/content/e8k8618552465484/fulltext.pdf, 2009.
4. QUALITIES OF A GOOD TEST
In constructing a test, the test developer should aim at making a good test. A bad test may spoil
the purpose of the test and thus would be useless to administer. According to Mohamed (2011),
Objectivity
This is very important for a test to be objective. A test with higher objectivity will eliminate
personal biases and influences in scoring and interpreting the test result. This can be done by
including more objective type items in the test. This includes multiple choice questions, fill in
the blanks, true-false, matching items, short questions-answers etc. In contrast, essay questions
are subjective questions. Thus, difference examiners may arrive at different answers while
checking such questions depending upon the mood of person, knowledge level and personal likes
and dislikes. However, the essay type questions can be made objective through well-defined
marking scheme for small bits of important and relevant information in the long answers.
A good test should have the following properties.
Comprehensiveness
A good test should cover the content area which is taught. The items in the test should be from
different areas of the course content. If one topic or area is assigned more question items and the
other areas are neglected then, such test will not be a good one. A good English test may have
items taken from composition, comprehension, dialogue, creative writing, grammar, vocabulary
etc. Meanwhile, due importance may be given to important bits in the content according to its
utility and significance.
Validity
It means that a test should rightly measures what it is supposed to measure. It tests what it ought
to test. A good test which measures control of grammar should have no difficult lexical items.
The detail of validity is explained in validity section.
Reliability
Reliability of a test refers to the degree of consistency with which it measures what it intended to
measure. If a test is re-taken by same students under same conditions, the score will be almost
the same provided that the time between the test and the retest is of reasonable length. In this
case it is said that the test provides consistency in measuring the items being evaluated.
Details about reliability is given in reliability section.
Discriminating Power
Discriminating power of the test is its power to discriminate between the upper and lower groups
who took the test. Thus, a good test should not contain only difficult items or easy items rather, it
should contain items with different difficulty level to sift students with different intelligence
level. The questions should progressively be increased in difficulty to reduce stress level and
tension in students.
Practicability
The test should be realistic and practicable. It should not measure unrealistic targets or
objectives. The test should also be easy to administer as well as easy to score. The test should
also be economical without wasting too much resources, energies and effort. Tests may be
competitive and sometimes difficult to complete within stipulated time to select students with
higher IQ level and less reaction time because such tests may have this specific purpose.
Otherwise, classroom tests shall keep in mind the individual differences of students and provide
ample opportunity for its completion.
Simplicity
It refers to clarity in language, correctness, adequacy and lucidity. Ambiguous questions, and
items with multiple meanings should be avoided. The students should be very clear about what
the question is asking and how to answer. Sometimes, the students get confused about the
possible answers due to lack of clarity in the questions.
5. RELIABILITY
According to Gay, Mills, & Airasian, (2011), “Reliability is the degree to which a test
consistently measures whatever it measures”.
Thorndike (2005) refers reliability to “accuracy or precision of a measurement procedure”.
While, Mehrens and Lehmann (1991) defined reliability as “the degree of consistency between
two measures of the same thing”
It also signifies the repeatability of observations or scores on a measurement. Some other terms
that are used to define reliability includes dependability, stability, accuracy, regularity in
measurement.
For a test, high reliability would mean that the person gets the same score or nearly same each
time the test is administered to the same person. If the person obtains different score each time
the test is administered, then the test reliability will be questioned.
Reliability can be ascertained by the examiner by taking the same test on two different
occasions. The score obtained on the test on the two occasions may be compared to determine
the degree of reliability. Another method is to test students on one test and then administer
another but different test. The scores obtained by the students on the two test may be compared
to find reliability of the two tests. If there is much difference in the score of students on the two
tests, then the two tests will have poor reliability. Essay type questions may have poor reliability
as the students get different score each time the answers are marked. In comparison, multiple
choice questions have comparatively a higher reliability as compared to essay type questions.
A test may not be reliable in all settings. A test may be reliable in a specific situation, under
specific circumstances and with a specific group of subjects. However, it may not be reliable in a
different situation or with a different group of students under a different circumstance.
5.1 Reliability Coefficient
As regarding physical measurement or using different tests for ascertaining reliability, it may be
difficult to achieve 100% consistency in scores. However, an acceptable value will be the degree
of closeness or consistency in the measurement of the different tests. For this purpose, the degree
of reliability of a test is measured numerically which is termed as reliability coefficient.
According to Merriam Webster dictionary, reliability coefficient is a measure of the accuracy of
a test or measuring instrument obtained by measuring the same individuals twice and computing
the correlation of the two sets of measures.
The reliability coefficient is a way of confirming how accurate a test or measure is. It essentially
measures consistency in scoring. The reliability coefficient is found by giving the test to the
same subject more than once and determining if there's a correlation between the two scores.
This will also reveal the strength of the relationship and similarity between the two scores. If the
two scores are close enough, then the test can be said to be accurate and has good reliability. The
variation in the score is called error variance and the source of variation is called source of error.
The smaller the error, the more consistent and reliable the score and vice versa.
An example could be done in which an individual is given a measure to determine their self-
esteem levels and then given the same measure again. The two scores would be correlated and
the reliability coefficient would be produced. If the scores are very similar to each other then it
can be said they are reliable that are consistently measuring the same thing, which in this case
would be self-esteem.
The maximum value of reliability coefficient is 1.00 which means that the test is
perfectly reliable while the minimum value is 0.00 which indicates no reliability. However, in
actual situation, it is not possible to have a perfectly reliable test. Thus, the coefficient of
reliability will be less than 1.00. The reason is the effect of various factors and errors in
measurement. This includes errors caused by the test itself due to ambiguous test items which is
interpreted differently by students. The different in conditions of students (emotionally,
physically, mentally etc.) is also responsible for producing errors in measurement such as fatigue
factor, arousal of specific emotion such as anger, fear, depression etc. and lack of motivation.
Moreover, the selection of test items, its construction, sequence, wording etc. may also result in
measurement error and thus affecting the reliability coefficient,
5.2 Relationship between Validity and Reliability
A test which is valid is also reliable. However, a test which is reliable is not necessarily
valid. If a test is valid, it means that it is rightly measuring the purpose/objectives what it is
supposed to be measuring. Thus, the score obtained on such test is also reliable because the test
is rightly measuring its intended purpose and thus the score will also be consistent on such test
whether lower or higher. In comparison, if a test is reliable which means that the students’ score
is coming consistently the same, but this test may not be rightly measuring its intended purpose
and thus is invalid. Thus, a test which is reliable may be valid or it may not be valid but a test
which is valid must be reliable. A test with coefficient of reliability as 0.93 is a highly reliable
test but is the test really measuring the set objectives from the given content. If it measures its
intended purpose, then the test will also be valid. However, if it is not measuring the concepts
from the given content then it will be in valid. [Form more detail see Gay, Mills, & Airasian,
(2011)]
6. RELIABILITY TYPES
Some types of reliability are given below:
• Test-Retest Reliability
• Equivalence Reliability or inter-class reliability
• Split-Halves Reliability
6.1 Test-Retest Reliability
One of the simple way to determine reliability of a test is to test-retest. It is the degree to which
scores on a test are consistent over time. The subjects are given a test on two occasions. The
score obtained by the subjects are then compared to see the consistency in the two scores on both
the tests. This can be found by measuring the correlation between the two scores. If the
correlation coefficient is high, then the two tests have a high degree of reliability. This method is
seldom used by subject teachers but is frequently used by test developers or commercial test
publishers such as IELTS, TOEFL, GRE etc.
One issue that arises here is how much time should elapse between the two tests. If the time
interval between the two tests is short say a few hours or days, then the chances of students
remembering their previous answers will be high and thus they will score the same which will
increase the reliability coefficient. If the duration is long, then the ability to perform well on the
test increases due to learning with time thus affecting reliability coefficient. Thus, in measuring
test-retest reliability, it is necessary that the time interval between the test should also be
mentioned along with the reliability coefficient. This kind of reliability is ensured for aptitude
tests and achievement tests so that they measure the intended purpose each time they are
administered.
6.2 Equivalence Reliability or inter-class reliability
It relates to two tests that are similar in every aspect except the test items. The reliability between
the two test is then measured and if the coefficient of reliability known as coefficient of
equivalence in this case is higher, then the two test are highly reliable and vice versa. It shall be
kept into consideration that the two tests shall be measuring the same variables, having the same
number of items, structure, difficulty level. Besides the direction for administering the two tests
shall also be same, with similar scoring style and interpretation. The purpose is to make the
scoring on both the tests consistent and reliable. No matter which test is taken by students, the
score of the students should be same on both the tests. This is usually used in situation where the
number of candidates are very large or a test is to taken on two occasions. In this kind of
situation, the examiner constructs different versions of the same test so that each group of
students can be administer the test at different time without the fear of test items leaking or
repeating. In some circumstances, the researchers ensured to make equivalence pre-test and post-
test to measure the actual difference in the performance removing the error in measurement
occurring from recalling/remembering the answers on the first test.
The procedure for establishing equivalence reliability is to construct the two versions of the test
measuring similar objectives taken from the same content area, number of items, difficulty level
etc. One form of the test is administered to an appropriate group. After some time, the second
form of the test is administered to the same group. The score obtained by students on both the
test is then correlated to find the coefficient of reliability. The difference in the score obtained by
students would be treated as error.
6.3 Split-Halves Reliability
This type of reliability is used for measuring internal consistency between the test items in a
single test. This is theoretically same as finding equivalence reliability however; here the two
parts are taken from the same test. This reliability can be found by administering the test only
once and thus the effect/error caused due to time interval or students’ condition (physical,
mental, emotional etc.) or two groups is minimized. As the name indicates, the test items for a
single test are divided into two halves to form two equivalent parts of a test. The two parts can be
obtained by various methods e.g. Dividing the test items into two halves with equal number of
items in both the halves or by splitting the test items into two halves, the odd number items and
even number items.
In case, the test is divided into odd and even numbered items, the reliability is calculated
as follow. Firstly, the test is administered to subjects and the items are marked. The items are
divided into two halves by combining all the odd items in one half and the even items in the
second half. The score obtained on odd and even numbered items are separately totaled. Thus,
there are two set of scores for each student. The score obtained on odd and even numbered items.
The two scores are then correlated to find the correlation coefficient using Pearson product
moment correlation coefficient. If the value of correlation coefficient is higher, then the two parts
of the test are highly reliable and vice versa.
The reliability coefficient obtained from the correlation needs to be adjusted/corrected as this
coefficient was for a test which is divided into two (split-halves). The actual reliability of the
whole test needs to be higher. This is computed using Spearman-Brown prophecy formula.
Suppose the reliability coefficient for a 40 items test was .70 which was obtained by correlating
the score for 20 odd and 20 even items. Thus, the reliability coefficient for the whole test (40
items) will be found using the following formula:
r total test
r
=
2 𝑟𝑠𝑝𝑙𝑖𝑡 ℎ𝑎𝑙𝑓
1+ 𝑟𝑠𝑝𝑙𝑖𝑡 ℎ𝑎𝑙𝑓
total test
The advantage of split-halves reliability is that one test is used only once. Thus, it can be
economically and conveniently used by classroom teachers and researchers to collect data about
a test.
=
2 (.70)
1+ .70
= .82
7. FACTORS AFFECTING RELIABILITY
Fatigue: The score obtained on a test by subjects during different conditions may be different.
The fatigue factor has an important role in affecting the test score. Thus, fatigue/tiredness affects
test reliability. Generally, the students will score less on a test with fatigue factor. Thus, Fatigue
generally decreases reliability coefficient.
Practice: The reliability of a test can be affected by the amount of practice. It is generally said
that practice makes a man perfect. In the same manner, practice on test will improve students’
score and thus increases reliability coefficient on test with greater practice.
Subject variability: The variation in the scores will increase if in a group, there is more subject
variability. The greater in differences among subjects on the basis of gender, age, program,
interests etc., the greater will be the variation in the score among individuals. In the same way, if
a group is more homogenous such as group of students with same range of IQ, then the variation
in the score will be less.
Test Length: The length of test and the number of items affect reliability of a test. Usually, a test
with greater number of items may give more reliable scores due to the cancelling of random
positive and negative errors with in a test. Thus, adding more items to a test increases its
reliability. In the same manner, deleting items from a test will lower the reliability of a test. One
technique of deleting items from a test without decreasing its reliability is to remove that item
from a test which has lower reliability value in item analysis.
The Spearman-Brown prophecy formula is used for estimating reliability for a test which if made
shorter or longer provided that the original reliability of a test is given. For example, if a test
original reliability is .60 and the number of items are increased or decreased, then the new
reliability of the test will be:
rx
r
=
𝐾 𝑟
1+(𝐾−1)𝑟
x
r = reliability of original test
= predicted reliability of a test with added or deleted number of items
K = ratio of number of items in the new test to number of items in the original test
8. VALIDITY
Validity refers to the extent to which a test measure what it is supposed to measure. In other
words, it refers to the degree to which a test pertains to its objectives. Thus, for a measure or test
to be valid, it must measure the particular trait, characteristic, or ability consistently for which it
was constructed.
According to The Standards for Educational and Psychological Testing (AERA/APA/ NCME,
1985), Validity "refers to the appropriateness, meaningfulness, and usefulness of the specific
inferences made from a test”. If correct and true inferences can be derived from a test, then such
test has a greater validity to measure that specific inference.
Cohen and Swerdlik (2010) defined validity as “a judgment based on evidence about the
appropriateness of inferences drawn from test scores”. While, inference is a logical result or
deduction. When a test score is used to make inference about a person trait or characteristic, then
the test score is assumed as representing that trait or characteristic.
A test may be valid for a particular group and for a particular purpose however, it may not be
valid for another group or for a different purpose. A test on English grammar may be valid for a
high school group but it is not valid for university students. Moreover, no test or measurement
technique is “universally valid” for all time, for all uses, and for all user(Cohen & Swerdlik,
2010). Rather, tests may be shown to be valid within what we would characterize as reasonable
boundaries of a contemplated usage. If those boundaries are exceeded, the validity of the test
may be called into question (Cohen & Swerdlik, 2010).
The process of gathering and evaluating evidence about validity is called ‘validation’. This is an
important phase of validity which the test developer has to take with the test takers for a specific
purpose. It is necessary that the test developer should mention the validity evidence in the test
manual for the users/readers. However, sometimes the test users may conduct their own studies
to check for validation with their test takers usually called local validation.
Some types of validity are:
1. Content-related validity
2. Criterion-related validity
3. Construct-related validity
8.1 Content-related validity
Sometimes content validity is also referred as face validity or logical validity. According to
Gay, Mills, & Airasian (2011), content validity is the degree to which a test measures an
intended content area. In order to determine content validity, the content domain of a test
shall be well defined and clear.
Face validity is a quick way of ascertaining whether a test looks/appears to measure what it
purports to measure. A primary class math test shall contain numbers and figures and shall
appear to be a math test rather than a language test. Face validity is not an exact measure of
estimating content validity and is only used as a quick way for initial screening of judging
validity.
In order to judge content validity, one must pay attention to ‘item validity’ and ‘sampling
validity’. Item validity ensures that the test items represents the relevant content area of a
given subject matter. The items in a math test shall include questions from its given content
and shall not focus on evaluating language proficiency or math items not included in the
given syllabus. Similarly, an English test shall not contain items related to mathematical
formulae or cover the subject matter of a science subject.
Sampling validity is concerned with how well the test items samples the total content area. A
test with good sampling validity ensures that the test items adequately samples the relevant
content area. The proportion of test items from various units must be kept into consideration
according to their importance. Although, all the units or concepts cannot be covered in a
single test, however, the examiner must ensure that the proportion of test items are in
accordance with the significance of the concepts to be tested. If a Physics test contains items
from Energy chapter only and ignore other chapters, then such test will have poor sampling
validity.
Content validity can be judged by content expert, relevant subject teacher and/or text book
writer. According to Gay et al. (2011), content validity cannot be measured quantitatively
rather the experts carefully observe all the test items for item validity and sampling validity
and then make a judgement about its content validity. A good way of ensuring content
validity is to make a table of specifications that shall include the total number of units/topics
to be tested, the number of items from each unit/topic and the different domain of
instructional objectives. The table of specification helps in observing the units from which
most of the items are included and also units which were under represented or ignored.
Consider a secondary grade physics test taken from five chapter as given in table of
specification. The names of the units are mentioned and the number of test items assessing
each of the instructional objectives given by Bloom’s taxonomy. It is not a hard and fast rule
to strictly follow the given proportion. The examiner decide which aspect or instructional
objectives shall be given more or less weightage for each unit and still ensure that there shall
not be greater difference in the weightage assigned to each objective. Thus, some units may
require more focus on application side while some units may focus on knowledge or
comprehension. The objective is to rightly measure the skill that the examiner wants to
measure.
Table 2. Table of specification of Physics test from five units
Course
content
Knowledge
(30%)
Comprehension
(40%)
Application
(30)
Total
Forces 3 5 2 10
Energy
sources
3 4 3 10
Turning
Effect
2 4 4 10
Kinematics 3 3 4 10
Atomic
Structure
4 4 2 10
Total 15 20 15 50
8.2 Criterion-related validity
Other terms used for criterion-related validity is statistical validity or correlational validity. It
provides evidence that a test items measures a specific criterion or trait for which it is designed.
In order to determine criterion validity of a test, the first step is to establish the criterion to be
measured. Then a variety of test items are developed and then tested. The test items are then
correlated with the criterion to determine how well are these items measuring the set criterion
through finding Pearson correlation. In case, a number of test are used to measure the criterion,
then multiple correlational procedures are used instead of Pearson correlation.
Criterion-related validity can be further subdivided into concurrent validity and predictive
validity.
8.2.1 Concurrent Validity
The main difference between concurrent and predictive validity is the time at which the criterion
is measured. For concurrent validity, the criterion is measured at approximately the same time as
the alternative measure. However, if the criterion being measured relates to some future time,
then it is called predictive validity.
The concurrent validity of a test is the degree to which the score on the test is related to the score
on an already established test administered at the same time. For example, GRE is an already
standardized test for measuring some specific skills and knowledge. Suppose a new test is
developed that claims to be measuring the same skills and knowledge, then it is necessary to find
the concurrent validity of the new test. For this purpose, the new test and the already established
test will be administered to some defined group of individuals at the same time. The scores
obtained by individuals on both the test is correlated to observe for similarity or differences. The
coefficient of validity can be calculated from correlation which will provide information about
the concurrent validity of the new test. A high value of validity coefficient indicates a good
concurrent validity and vice versa.
8.2.2 Predictive Validity
It is the degree to which a test can predict about the future performance of an individual. It is
often used for selecting or grouping individuals. The score on entry test serves as predictive
validity about future performance of individuals in a specific program. If the marks on the entry
test is high, then it can be predicted that the candidate will do well in future thus ascertaining
predictive validity of the entry test. Such test may include ISSB test for entrance to armed forces,
GRE test and SAT test for university performance. Likewise, medical test reports such as high
body fat, high cholesterol, smoking and hypertension are all predictive of future heart disease. It
shall be kept into consideration that the predictive validity of various tests like entry test, GRE,
TOEFL etc. may vary due to a number of factors such as the difference in curriculum studied by
students, the textbooks used for preparation, the geographical location etc. Thus, there is no such
thing as perfect predictive validity which will also sometime makes the prediction false. Not all
students who pass GRE or entry test may successfully pass the program in which the individuals
were enrolled. Thus, it is not advisable to consider the score of a test for predicting future
performance rather several indicators shall be used such as marks in preceding exams, the
interview score, comments of professors, performance on practical skills etc.
8.3 Construct-related validity
Construct-related validity is used to measure a theoretical construct. The construct to be measure
is unobservable yet it exists theoretically. The construct though cannot be seen but its effects can
be observed. For example, intelligence quotient (IQ), anxiety, creativity, attitude etc. Tests have
been developed for measuring a specific construct. The researchers/test developers ensure that
the test they construct should accurately measure that specific construct for which it was
designed. Thus, a test aimed at measuring level of anxiety shall not measure creativity or IQ. The
test score can be used to make decision related to a construct. If a test is unable to measure a
construct, then its validity is questionable and the conclusion based on its score will be
meaningless and inaccurate.
The process of determining construct validity is not simple. The measuring of a construct
requires a strong theory that hypothesize about the construct under study. For example,
Psychology theories hypothesize that individuals with higher anxiety persons will work longer
on a problem as compared to persons with low anxiety level. Suppose a test that measures
anxiety level and some persons score higher on such test and then the same persons also worked
for a longer time on the task/problem under consideration; then we have ample evidence to
support the theory and thus the construct validity of the test to measure that construct.
Figure: Validity and Reliability
Source: James, Allen, James, & Dale (2005)
Self-Assessment Questions
Q1. How is achievement test different from attitude scale?
Q2. Describe the uses of achievement test and attitude scale?
Q3. What are the steps for developing a test?
Q4. Define reliability and reliability coefficient?
Q5. Describe the different types of reliability?
Q6. What are the factors that affect reliability?
Q7. Define the concept of validity in measurement and its relation with reliability?
Validity
Reliability
Interclass
Test-Retest Equivalence
Interclass
Anova
Alpha KR20
Relevance
Content Criterion
Concurrent Predictive
Construct
Q8. Explain the different types of test validity.
Q9. What are the qualities of a good test?
9. REFERENCES
Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation,
Cambridge University Press, 1995
1. Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests and
measurement. 7th ed. McGraw−Hill Primis; 2010.
2. Gay, L. R., Mills, G. E., & Airasian, P. W. (2011). Educational research: Competencies for
analysis and applications. Pearson Higher Ed.
3. http://www.alleydog.com/glossary/definition.php?term=Reliability%20Coefficient#ixzz48
EmyHlQe
4. James, R. M., Allen, W. J., James, G. D., & Dale, P. M. (2005). Measurement and
evaluation in human performance. USA: Human Kinetics.
5. McMillin, E. (2013). Steps to Developing a Classroom Achievement Test. Assessed from
https://prezi.com/fhtzfkrreh6p/steps-to-developing-a-classroom-achievement-test/#
6. Mohamed, R. (2011). 12 Characteristics of a good test. Retrieved from
https://eltguide.wordpress.com/2011/12/28/12-characteristics-of-a-good-test/
7. Reynolds, C. R., Livingston, R. L., & Willson, V. L. (2009). Measurement and Assessment
in Education. (2nd ed.). Upper Saddle River, NJ: Pearson Education Inc.
8. Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning
distractors in multiple-choice questions: a descriptive analysis,
http://www.springerlink.com/content/e8k8618552465484/fulltext.pdf, 2009.

More Related Content

Similar to Unit. 7.pdf

Jedh esterninos special topics authentic assessment
Jedh esterninos special topics authentic assessmentJedh esterninos special topics authentic assessment
Jedh esterninos special topics authentic assessmentbsemathematics2014
 
Chapter 2- Authentic assessment
Chapter 2- Authentic assessmentChapter 2- Authentic assessment
Chapter 2- Authentic assessmentJarry Fuentes
 
DetailsThis assignment is a presentation that allows you to apply.docx
DetailsThis assignment is a presentation that allows you to apply.docxDetailsThis assignment is a presentation that allows you to apply.docx
DetailsThis assignment is a presentation that allows you to apply.docxgalinagrabow44ms
 
Measurement & Evaluation pptx
Measurement & Evaluation pptxMeasurement & Evaluation pptx
Measurement & Evaluation pptxAliimtiaz35
 
Challenges of Alternative Assessment for Students with Disabilities/Intellect
Challenges of Alternative Assessment for Students with Disabilities/IntellectChallenges of Alternative Assessment for Students with Disabilities/Intellect
Challenges of Alternative Assessment for Students with Disabilities/IntellectLouie Jane Eleccion, LPT
 
Types of Assessment in Classroom
Types of Assessment in ClassroomTypes of Assessment in Classroom
Types of Assessment in ClassroomS. Raj Kumar
 
Developing assessment instruments
Developing assessment instrumentsDeveloping assessment instruments
Developing assessment instrumentsJCrawford62
 
Apt 501 chapter_7
Apt 501 chapter_7Apt 501 chapter_7
Apt 501 chapter_7cdjhaigler
 
Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7cdjhaigler
 
Developing Assessment Instrument
Developing Assessment InstrumentDeveloping Assessment Instrument
Developing Assessment Instrumentcdjhaigler
 
Identifying Test Objective (Assessment of Learning) - CES report 011114
Identifying Test Objective (Assessment of Learning) - CES report 011114Identifying Test Objective (Assessment of Learning) - CES report 011114
Identifying Test Objective (Assessment of Learning) - CES report 011114mcdelmundo
 
Capacity Building Training-Workshop (Discussions).pdf
Capacity Building Training-Workshop (Discussions).pdfCapacity Building Training-Workshop (Discussions).pdf
Capacity Building Training-Workshop (Discussions).pdfGuillermoJrDiluvio
 
Assessment of student learning 1
Assessment of student learning 1Assessment of student learning 1
Assessment of student learning 1joeri Neri
 
LESSON 6 JBF 361.pptx
LESSON 6 JBF 361.pptxLESSON 6 JBF 361.pptx
LESSON 6 JBF 361.pptxAdnanIssah
 

Similar to Unit. 7.pdf (20)

Jedh esterninos special topics authentic assessment
Jedh esterninos special topics authentic assessmentJedh esterninos special topics authentic assessment
Jedh esterninos special topics authentic assessment
 
Chapter 2- Authentic assessment
Chapter 2- Authentic assessmentChapter 2- Authentic assessment
Chapter 2- Authentic assessment
 
Assessment
AssessmentAssessment
Assessment
 
Module 1
Module 1Module 1
Module 1
 
DetailsThis assignment is a presentation that allows you to apply.docx
DetailsThis assignment is a presentation that allows you to apply.docxDetailsThis assignment is a presentation that allows you to apply.docx
DetailsThis assignment is a presentation that allows you to apply.docx
 
Measurement & Evaluation pptx
Measurement & Evaluation pptxMeasurement & Evaluation pptx
Measurement & Evaluation pptx
 
Challenges of Alternative Assessment for Students with Disabilities/Intellect
Challenges of Alternative Assessment for Students with Disabilities/IntellectChallenges of Alternative Assessment for Students with Disabilities/Intellect
Challenges of Alternative Assessment for Students with Disabilities/Intellect
 
Language assessment
Language assessmentLanguage assessment
Language assessment
 
Types of Assessment in Classroom
Types of Assessment in ClassroomTypes of Assessment in Classroom
Types of Assessment in Classroom
 
Developing assessment instruments
Developing assessment instrumentsDeveloping assessment instruments
Developing assessment instruments
 
Apt 501 chapter_7
Apt 501 chapter_7Apt 501 chapter_7
Apt 501 chapter_7
 
Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7Developing Assessment Instruments Chapter 7
Developing Assessment Instruments Chapter 7
 
Developing Assessment Instrument
Developing Assessment InstrumentDeveloping Assessment Instrument
Developing Assessment Instrument
 
Identifying Test Objective (Assessment of Learning) - CES report 011114
Identifying Test Objective (Assessment of Learning) - CES report 011114Identifying Test Objective (Assessment of Learning) - CES report 011114
Identifying Test Objective (Assessment of Learning) - CES report 011114
 
Learning_activity1_Group2.pptx
Learning_activity1_Group2.pptxLearning_activity1_Group2.pptx
Learning_activity1_Group2.pptx
 
Capacity Building Training-Workshop (Discussions).pdf
Capacity Building Training-Workshop (Discussions).pdfCapacity Building Training-Workshop (Discussions).pdf
Capacity Building Training-Workshop (Discussions).pdf
 
Unit 2.pptx
Unit 2.pptxUnit 2.pptx
Unit 2.pptx
 
Assessment of student learning 1
Assessment of student learning 1Assessment of student learning 1
Assessment of student learning 1
 
Authentic assessment
Authentic assessmentAuthentic assessment
Authentic assessment
 
LESSON 6 JBF 361.pptx
LESSON 6 JBF 361.pptxLESSON 6 JBF 361.pptx
LESSON 6 JBF 361.pptx
 

More from Imtiaz Hussain

More from Imtiaz Hussain (20)

Essentials for Measurement2.ppt
Essentials for Measurement2.pptEssentials for Measurement2.ppt
Essentials for Measurement2.ppt
 
BN-725592(3).ppt
BN-725592(3).pptBN-725592(3).ppt
BN-725592(3).ppt
 
Islamiat Compulsory Notes For ADP, BA and BSc.pdf
Islamiat Compulsory Notes For ADP, BA and BSc.pdfIslamiat Compulsory Notes For ADP, BA and BSc.pdf
Islamiat Compulsory Notes For ADP, BA and BSc.pdf
 
12 english idioms.pdf
12 english idioms.pdf12 english idioms.pdf
12 english idioms.pdf
 
5. RPL.pdf
5. RPL.pdf5. RPL.pdf
5. RPL.pdf
 
1. Overview of NVQF.pdf
1. Overview of NVQF.pdf1. Overview of NVQF.pdf
1. Overview of NVQF.pdf
 
4. Competency Standard.pdf
4. Competency Standard.pdf4. Competency Standard.pdf
4. Competency Standard.pdf
 
2. CBT vs Conventional.pdf
2.  CBT vs Conventional.pdf2.  CBT vs Conventional.pdf
2. CBT vs Conventional.pdf
 
3. Pathways to Assessment.pdf
3. Pathways to Assessment.pdf3. Pathways to Assessment.pdf
3. Pathways to Assessment.pdf
 
Unit. 6.doc
Unit. 6.docUnit. 6.doc
Unit. 6.doc
 
UNIT. 3.pdf
UNIT. 3.pdfUNIT. 3.pdf
UNIT. 3.pdf
 
Assessment review.pdf
Assessment review.pdfAssessment review.pdf
Assessment review.pdf
 
BN-725592(3).ppt
BN-725592(3).pptBN-725592(3).ppt
BN-725592(3).ppt
 
Biology 9 th-ch-3
Biology  9 th-ch-3Biology  9 th-ch-3
Biology 9 th-ch-3
 
Biology 9 th ch-9-mcq
Biology  9 th ch-9-mcqBiology  9 th ch-9-mcq
Biology 9 th ch-9-mcq
 
Biology 9 th ch-8-mcq
Biology  9 th ch-8-mcqBiology  9 th ch-8-mcq
Biology 9 th ch-8-mcq
 
Biology 9th ch-7
Biology  9th ch-7Biology  9th ch-7
Biology 9th ch-7
 
Biology 9th ch-6
Biology  9th ch-6Biology  9th ch-6
Biology 9th ch-6
 
Biology 9 th ch-2-
Biology  9 th ch-2-Biology  9 th ch-2-
Biology 9 th ch-2-
 
Biology 9 th ch-2 - long q
Biology  9 th ch-2 - long qBiology  9 th ch-2 - long q
Biology 9 th ch-2 - long q
 

Recently uploaded

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 

Recently uploaded (20)

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 

Unit. 7.pdf

  • 1. Unit–7 Test Development and Qualities of a test Written by: Dr. Fayyaz Ahmad Faize
  • 2. TABLE OF CONTENTS 1. Achievement Test .................................................................................................................................3 1.1 Purposes/uses of achievement test................................................................................................5 2. Attitude Scale........................................................................................................................................6 3. Steps for Test Development..................................................................................................................8 4. Reliability............................................................................................................................................15 4.1 Reliability Coefficient.................................................................................................................16 4.2 Relationship between Validity and Reliability ...........................................................................17 5. Reliability Types.................................................................................................................................18 5.1 Test-Retest Reliability ......................................................................................................................18 5.2 Equivalence Reliability or inter-class reliability.........................................................................19 5.3 Split-Halves Reliability...............................................................................................................20 6. Factors Affecting Reliability...............................................................................................................21 7. Validity ...............................................................................................................................................23 7.1 Content-related validity...............................................................................................................24 7.2 Criterion-related validity.............................................................................................................26 7.2.1 Concurrent Validity....................................................................................................................27 7.2.2 Predictive Validity .....................................................................................................................27 7.3 Construct-related validity............................................................................................................28 Self-Assessment Questions.....................................................................................................................29 8. References...........................................................................................................................................30
  • 3. OBJECTIVES After studying this chapter, the students will be able to: • Describe about achievement test and attitude scale • Explain the steps involved in test development • Describe the qualities of a good test • Define and interpret reliability and validity • Discuss how to determine reliability and validity of tests • Understand the relationship between reliability and validity. • Understand the basic kinds of validity evidence. • Interpret various expressions of validity. • Recognize what factors affect validity 1. ACHIEVEMENT TEST Achievement tests are designed to measure accomplishment. Usually, it is conducted at the end of some learning activity/process to ascertain the degree to which the required task has been accomplished. For example, the achievement test for a students of Nursery class might contain assessment of English alphabets, knowledge of numbers and key science concepts. Thus, achievement tests help in measuring the degree of learning on some already instructed/guided tasks. The tasks may be specific and short or it may be comprehensive and detailed. An achievement test may be standardized such as a test of Chemistry for secondary class on formulae and valences or Physics test on fundamental quantities or kinematics.
  • 4. Another term that is also useful is ‘General Achievement”. This relates to measuring of learning experiences in one or more academic areas. This would usually involve a number of subtests each aimed at measuring some specific learning experiences/targets. These subtests are sometimes called achievement batteries. Such batteries may be individually administered or group administered. They may consist of a few subtests, as does the Wide Range Achievement Test-4 (Wilkinson & Robertson, 2006) with its measures of reading, spelling, arithmetic, and (new to the fourth edition) reading comprehension. Achievement may be as comprehensive as the STEP Series, which includes subtests in reading, vocabulary, mathematics, writing skills, study skills, science, and social studies; a behavior inventory; an educational environment questionnaire; and an activities inventory. Some batteries, such as the SRA California Achievement Tests, span kindergarten through grade 12, whereas others are grade or course-specific. Some batteries are constructed to provide both norm-referenced and criterion-referenced analyses. Others are concurrently normed with scholastic aptitude tests to enable a comparison between achievement and aptitude. Some batteries are constructed with practice tests that may be administered several days before actual testing to help students familiarize themselves with test taking procedures. One popular instrument appropriate for use with person age 4 through adult is the Wechsler Individual Achievement Test-Second Edition, otherwise known as the WIAT-II (Psychological Corporation, 2001). This instrument is used not only to gauge achievement but also to develop hypotheses about achievement versus ability. It features nine subtests that samples content in each of the seven areas listed in a past revision of the Individuals with Disabilities Education Act: oral expression, listening comprehension, written expression, basic reading skill, reading comprehension, mathematics calculation, and mathematics reasoning.
  • 5. For a particular purpose, a battery that focuses on achievement in a few select areas may be preferable to one that attempts to sample achievement in several areas. On the other hand, a test that samples many areas may be advantageous when an individual comparison of performance across subject areas is desirable. If a school or a local school district undertakes to follow the progress of a group of students as measured by a particular achievement battery, then the battery of choice will be one that spans the targeted subject areas in all the grades to be tested. If ability to distinguish individual areas of dif fic ulty is of primary concern, then achievement tests with strong diagnostic features will be chosen. Although achievement batteries sampling a wide range of areas, across grades, and standardized on large, national samples of students have much to recommend them, they also have certain drawbacks. For example, such tests usually take years to develop; in the interim the items, especially infi elds such as social studies and science, may become outdated. Further, any nationally standardized instrument is only as good as the extent to which it meets the (local) test user’s objectives. 1.1 Purposes/uses of achievement test i. To measure students’ mastery of certain essential skills and knowledge, such as proficiency in recalling facts, understanding concepts, principles and use of skills ii. To measure students’ growth/progress over time for promotion purposes. This is helpful to school in making decision about students’ placement in a specific program, class, group or for promoting to next level. iii. To rank pupils in terms of their achievement by comparing performance of an individual to the norm or average performance of his/her group (norm referenced) iv. To Identify pupil’s problem and diagnosing them. Given a federal mandate to identify children with a “severe discrepancy between achievement and intellectual ability”
  • 6. (Procedures for Evaluating Specific Learning Disabilities, 1977, p. 65083), it can readily be appreciated how achievement tests—as well as intelligence could play a role in the diagnosis of a specific learning disability (SLD). v. To evaluate the effectiveness of teacher's instructional method vi. To encourage good study habits in the students and motivate them to work hard. 2. ATTITUDE SCALE An attitude may be defined formally as a presumably learned disposition to react in some characteristic manner to a particular stimulus. The stimulus may be an object, a group, an institution—virtually anything. Although attitudes do not necessarily predict behavior (Tittle & Hill, 1967; Wicker, 1969), there has been great interest in measuring the attitudes of employers and employees toward each other and toward numerous variables in the workplace. As the name implies, this type of scale tries to measure individual’s belief, attitude and perception towards one self, others or towards some phenomena, activities, situation etc. 2.1 Measuring Attitude Attitude can be measured using the following scales. Attitude can be measured using self-report, tests and/or questionnaires. However, it is not easy to measure attitude accurately as individuals greatly differ in their ability to rightly introspect about their attitudes and in their level of self-awareness. Moreover, some people also feel reluctant to share or report about their attitude to others. It may also happen that some time people come with some attitude or form attitude that they did not know about it or existed earlier.
  • 7. Measuring attitude was earlier mentioned by Likert (1932) in his monograph, “A Technique for the Measurement of Attitudes”. This relates to designing an instrument that helps in measuring attitude. This scale seeks individual’s response on a number of statement in terms of his/her level of agreement or disagreement. The options may be Strongly Agree, Agree, Undecided, Disagree, Strongly Disagree. The degree of agreement or disagreement reflects individual attitude about a certain phenomenon or statement. Each response is assigned a specific score from 1 to 5. For positive statement, 5 is assigned to strongly agree and 1 is assigned to strongly disagree. According to Thurstone (1928), attitude can be measured as mentioned in his article, “Attitudes Can Be Measured”. Recently, the research of Banaji (2001) further supported this contention in his article, “Implicit Attitudes Can Be Measured”. Implicit attitudes are “introspectively unidentifi ed (or inaccurately identified) traces of past experience that mediate favorable or unfavorable feeling, thought, or action toward social objects” (Greenwald & Banaji, 1995, p.8). Stated another way, they are nonconscious, automatic associations in memory that produce dispositions to react in some characteristic manner to a particular stimulus. Implicit attitude can be measured using Implicit Attitude Test (IAT), a computerized sorting task by which implicit attitudes are gauged with reference to the test taker’s reaction times. E.g. the individual is shown/given a particular stimuli and is asked to categorize or associate another word or phrase to it without taking much time. For example, the attitude of a person can be gauged by presenting the word ‘terror’ and then associating other words favorable or unfavorable to it quickly to know about individual attitude to ‘terror’. Using the IAT or similar protocols, implicit attitudes toward a wide range of stimuli can be measured. Likewise, implicit attitudes have been studied in relation to racial prejudices, threats, voting behavior, professional ethics, self-esteem, drug use etc. Measuring implicit attitude is now frequently used in consumer
  • 8. psychology and consumer preferences. In consumer psychology, the attitude may be found through asking a series of questions about a product or choice and the individual response is noted which may reflect the belief or thinking of the individual. The responses of people can be sought through a survey or opinion poll using questionnaire, emails, google forms, social medic posts etc. the surveys and polls may be conducted by means of face-to-face, online, and telephone interviews, as well as by mail. The face-to-face interaction helps in getting quicker response and in understanding the questions well. Moreover, the researcher can present or show the products directly and seeks people’s response on it. However, there is also a drawback of face to face interaction as sometime, the people would react in a way they feel is favorable to the researcher or the gesture of researcher influences the choice of the respondents. Another type of scale to measure attitude is the semantic differential technique. In this type of scale, the respondents are given two opposite extremes and the individual is asked to place a mark on the 7 spaces in the continuum according to his level of preference. The two bi- polar extremes might be easy-difficult, good-bad, weak-strong etc. Strong ____:____:____:____:____:____:____ Weak Decisive ____:____:____:____:____:____:____ Indecisive Good ____:____:____:____:____:____:____ Bad Cheap ____:____:____:____:____:____:____ Expensive 3. STEPS FOR TEST DEVELOPMENT The creation of a good test is not a matter of chance, rather it requires a sound knowledge and principles of test construction (Cohen & Swerdlik, 2010). The development of a good test requires some steps however; these steps are not specific as various authors have suggested
  • 9. different steps/stages for developing a test. Following are some of the general steps for test development. 1. Identification of objectives It is one of the most important step in developing any test when the test authors need to consider in detail what exactly they aim to measure or the purpose of the test. It is especially important to define clearly the purpose of the test because that increases the possibility for achieving high validity. It defines what exactly is required to be measured by a test. This will help in improving the validity of a test. There are two kinds of objectives: the behavioral and non-behavioral. As the name suggest, the behavioral objectives deal with “activities that are observable and measurable whereas non-behavioral objectives specify activities that are unobservable and not directly measurable” (Reynolds et al., 2009, p. 177). 2. Deciding about test format Without predefined objectives, a test will be meanings and purposeless. The format/design of the test is another important element on constructing a test. The test developer needs to decide about which format/design will be the most suitable in achieving the set objectives. The format of the test may be objectives type, essay type or both. Again, the examiner will decide about what type of objective items shall be included whether it will be multiple-choice, fill in the blanks, matching items, short answer etc. The test author also decides about the number of marks assigned to each format and the total amount of time to complete the test. 3. Making a table of specifications A table to specifications serves as test blueprint. This helps in ensuring suitable number of items from the whole content as well as specifying the type of assessment objectives that the items will
  • 10. be testing. This table ensures that all levels of instructional objectives are used in the test questions. • What language skills should be included – will there be a list of grammatical structures and lexis, etc.; The table enlists the number of items from each content area, the weightage assigned to each content area and the type of instructional objectives the items will be measuring whether recalling, understanding or application. Last but not the least, the examiner shall also decide about the weightage to each format (objective and subjective) within the test and the weightage in terms of difficulty level (easy, moderate, difficult). For example, in developing an English test, the teacher can focus on the following areas. • What sort of tasks are required – objectively assessable, integrative, simulated “authentic”, etc.; • How many items are required for each section, and what their relative weight will be equal weighting or extra weighting for more difficult items; • What test methods are to be used – multiple choice, gap filling, matching, transformations, picture descriptions, essay writing, etc.; • What rubrics are to be used as instructions for students – will there be included examples to help students know what is expected, and • should the assessment criteria be added to the rubric; • What assessment criteria will be used – how important is accuracy, spelling, length of written text, etc. 4. Writing Items
  • 11. 5. The examiner writes the items keeping in mind the table of specification and the difficulty level of items. The items shall progress from simple to difficult however, it is debatable whether the items are arranged randomly or from easy to difficult. The examiner should ensure that the test can be completed within the stipulated time. The language of the test items be simple, brief and lucid. The language should be checked for grammar, spelling and punctuation. Preparation of Marking scheme As regarding developing standardized type of test, the following steps are given by Cohen and Swerdlik (2010) though it can also be applied to custom test made by teachers, researchers and recruiters. The process encompasses five stages: The test developer decides about the number of marks to be assigned to each item or the relevant bits of detail in the students’ answers. This is necessary to ensure consistency in marking and to make scoring more scientific and systematic. The essay type questions can be divided into smaller components and the marks defined for each important concept/point. 1. test conceptualization 2. test construction 3. test tryout 4. item analysis 5. test revision The process of test development starts from the conceptualizing the idea of test and the purpose for which the test has to be constructed. A test may be designed on some emerging phenomena, problems, issues or some needs. Test conceptualization might also include the construct or the concepts which the test should measure. What kind of objectives or behavior the test should measure in the presence of other such tests? In there any need for making a test or the existing
  • 12. test can be used for the set purpose? How the test can be better than the existing test? Who will be the user of the test, the students, teachers, or employers? What will be the content of the test? How will the test be administered, individually or in groups? Will the test be written, oral or practical? What will be the format of test items and what will be the proportion of items in objective and subjective? Who will benefit from the test? Will there be any harmful effect of the test? If yes, then on whom? Based on the purpose, needs and the objectives to be achieved, the items for the test are constructed/selected. The test is then pilot tested on a sample to try out whether the items in the test are appropriate for achieving the set objectives. Based on the result from the test tryout or pilot test, the items in the test are put to item analysis. This requires the use of statistical procedures in determining the difficulty level of items, reliability and validity. This process helps in selecting the appropriate items for the test while the inappropriate items may be revised or deleted. This finally helps in making a revised draft of the test better than the initial version of the test. The process may be repeated till a refined and standardized type of version is made available. References. 1. Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation, Cambridge University Press, 1995 2. Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests and measurement. 7th ed. McGraw−Hill Primis; 2010. 3. Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning distractors in multiple-choice questions: a descriptive analysis, http://www.springerlink.com/content/e8k8618552465484/fulltext.pdf, 2009.
  • 13. 4. QUALITIES OF A GOOD TEST In constructing a test, the test developer should aim at making a good test. A bad test may spoil the purpose of the test and thus would be useless to administer. According to Mohamed (2011), Objectivity This is very important for a test to be objective. A test with higher objectivity will eliminate personal biases and influences in scoring and interpreting the test result. This can be done by including more objective type items in the test. This includes multiple choice questions, fill in the blanks, true-false, matching items, short questions-answers etc. In contrast, essay questions are subjective questions. Thus, difference examiners may arrive at different answers while checking such questions depending upon the mood of person, knowledge level and personal likes and dislikes. However, the essay type questions can be made objective through well-defined marking scheme for small bits of important and relevant information in the long answers. A good test should have the following properties. Comprehensiveness A good test should cover the content area which is taught. The items in the test should be from different areas of the course content. If one topic or area is assigned more question items and the other areas are neglected then, such test will not be a good one. A good English test may have items taken from composition, comprehension, dialogue, creative writing, grammar, vocabulary etc. Meanwhile, due importance may be given to important bits in the content according to its utility and significance.
  • 14. Validity It means that a test should rightly measures what it is supposed to measure. It tests what it ought to test. A good test which measures control of grammar should have no difficult lexical items. The detail of validity is explained in validity section. Reliability Reliability of a test refers to the degree of consistency with which it measures what it intended to measure. If a test is re-taken by same students under same conditions, the score will be almost the same provided that the time between the test and the retest is of reasonable length. In this case it is said that the test provides consistency in measuring the items being evaluated. Details about reliability is given in reliability section. Discriminating Power Discriminating power of the test is its power to discriminate between the upper and lower groups who took the test. Thus, a good test should not contain only difficult items or easy items rather, it should contain items with different difficulty level to sift students with different intelligence level. The questions should progressively be increased in difficulty to reduce stress level and tension in students. Practicability The test should be realistic and practicable. It should not measure unrealistic targets or objectives. The test should also be easy to administer as well as easy to score. The test should also be economical without wasting too much resources, energies and effort. Tests may be
  • 15. competitive and sometimes difficult to complete within stipulated time to select students with higher IQ level and less reaction time because such tests may have this specific purpose. Otherwise, classroom tests shall keep in mind the individual differences of students and provide ample opportunity for its completion. Simplicity It refers to clarity in language, correctness, adequacy and lucidity. Ambiguous questions, and items with multiple meanings should be avoided. The students should be very clear about what the question is asking and how to answer. Sometimes, the students get confused about the possible answers due to lack of clarity in the questions. 5. RELIABILITY According to Gay, Mills, & Airasian, (2011), “Reliability is the degree to which a test consistently measures whatever it measures”. Thorndike (2005) refers reliability to “accuracy or precision of a measurement procedure”. While, Mehrens and Lehmann (1991) defined reliability as “the degree of consistency between two measures of the same thing” It also signifies the repeatability of observations or scores on a measurement. Some other terms that are used to define reliability includes dependability, stability, accuracy, regularity in measurement. For a test, high reliability would mean that the person gets the same score or nearly same each time the test is administered to the same person. If the person obtains different score each time the test is administered, then the test reliability will be questioned.
  • 16. Reliability can be ascertained by the examiner by taking the same test on two different occasions. The score obtained on the test on the two occasions may be compared to determine the degree of reliability. Another method is to test students on one test and then administer another but different test. The scores obtained by the students on the two test may be compared to find reliability of the two tests. If there is much difference in the score of students on the two tests, then the two tests will have poor reliability. Essay type questions may have poor reliability as the students get different score each time the answers are marked. In comparison, multiple choice questions have comparatively a higher reliability as compared to essay type questions. A test may not be reliable in all settings. A test may be reliable in a specific situation, under specific circumstances and with a specific group of subjects. However, it may not be reliable in a different situation or with a different group of students under a different circumstance. 5.1 Reliability Coefficient As regarding physical measurement or using different tests for ascertaining reliability, it may be difficult to achieve 100% consistency in scores. However, an acceptable value will be the degree of closeness or consistency in the measurement of the different tests. For this purpose, the degree of reliability of a test is measured numerically which is termed as reliability coefficient. According to Merriam Webster dictionary, reliability coefficient is a measure of the accuracy of a test or measuring instrument obtained by measuring the same individuals twice and computing the correlation of the two sets of measures. The reliability coefficient is a way of confirming how accurate a test or measure is. It essentially measures consistency in scoring. The reliability coefficient is found by giving the test to the same subject more than once and determining if there's a correlation between the two scores. This will also reveal the strength of the relationship and similarity between the two scores. If the
  • 17. two scores are close enough, then the test can be said to be accurate and has good reliability. The variation in the score is called error variance and the source of variation is called source of error. The smaller the error, the more consistent and reliable the score and vice versa. An example could be done in which an individual is given a measure to determine their self- esteem levels and then given the same measure again. The two scores would be correlated and the reliability coefficient would be produced. If the scores are very similar to each other then it can be said they are reliable that are consistently measuring the same thing, which in this case would be self-esteem. The maximum value of reliability coefficient is 1.00 which means that the test is perfectly reliable while the minimum value is 0.00 which indicates no reliability. However, in actual situation, it is not possible to have a perfectly reliable test. Thus, the coefficient of reliability will be less than 1.00. The reason is the effect of various factors and errors in measurement. This includes errors caused by the test itself due to ambiguous test items which is interpreted differently by students. The different in conditions of students (emotionally, physically, mentally etc.) is also responsible for producing errors in measurement such as fatigue factor, arousal of specific emotion such as anger, fear, depression etc. and lack of motivation. Moreover, the selection of test items, its construction, sequence, wording etc. may also result in measurement error and thus affecting the reliability coefficient, 5.2 Relationship between Validity and Reliability A test which is valid is also reliable. However, a test which is reliable is not necessarily valid. If a test is valid, it means that it is rightly measuring the purpose/objectives what it is supposed to be measuring. Thus, the score obtained on such test is also reliable because the test is rightly measuring its intended purpose and thus the score will also be consistent on such test
  • 18. whether lower or higher. In comparison, if a test is reliable which means that the students’ score is coming consistently the same, but this test may not be rightly measuring its intended purpose and thus is invalid. Thus, a test which is reliable may be valid or it may not be valid but a test which is valid must be reliable. A test with coefficient of reliability as 0.93 is a highly reliable test but is the test really measuring the set objectives from the given content. If it measures its intended purpose, then the test will also be valid. However, if it is not measuring the concepts from the given content then it will be in valid. [Form more detail see Gay, Mills, & Airasian, (2011)] 6. RELIABILITY TYPES Some types of reliability are given below: • Test-Retest Reliability • Equivalence Reliability or inter-class reliability • Split-Halves Reliability 6.1 Test-Retest Reliability One of the simple way to determine reliability of a test is to test-retest. It is the degree to which scores on a test are consistent over time. The subjects are given a test on two occasions. The score obtained by the subjects are then compared to see the consistency in the two scores on both the tests. This can be found by measuring the correlation between the two scores. If the correlation coefficient is high, then the two tests have a high degree of reliability. This method is seldom used by subject teachers but is frequently used by test developers or commercial test publishers such as IELTS, TOEFL, GRE etc.
  • 19. One issue that arises here is how much time should elapse between the two tests. If the time interval between the two tests is short say a few hours or days, then the chances of students remembering their previous answers will be high and thus they will score the same which will increase the reliability coefficient. If the duration is long, then the ability to perform well on the test increases due to learning with time thus affecting reliability coefficient. Thus, in measuring test-retest reliability, it is necessary that the time interval between the test should also be mentioned along with the reliability coefficient. This kind of reliability is ensured for aptitude tests and achievement tests so that they measure the intended purpose each time they are administered. 6.2 Equivalence Reliability or inter-class reliability It relates to two tests that are similar in every aspect except the test items. The reliability between the two test is then measured and if the coefficient of reliability known as coefficient of equivalence in this case is higher, then the two test are highly reliable and vice versa. It shall be kept into consideration that the two tests shall be measuring the same variables, having the same number of items, structure, difficulty level. Besides the direction for administering the two tests shall also be same, with similar scoring style and interpretation. The purpose is to make the scoring on both the tests consistent and reliable. No matter which test is taken by students, the score of the students should be same on both the tests. This is usually used in situation where the number of candidates are very large or a test is to taken on two occasions. In this kind of situation, the examiner constructs different versions of the same test so that each group of students can be administer the test at different time without the fear of test items leaking or repeating. In some circumstances, the researchers ensured to make equivalence pre-test and post-
  • 20. test to measure the actual difference in the performance removing the error in measurement occurring from recalling/remembering the answers on the first test. The procedure for establishing equivalence reliability is to construct the two versions of the test measuring similar objectives taken from the same content area, number of items, difficulty level etc. One form of the test is administered to an appropriate group. After some time, the second form of the test is administered to the same group. The score obtained by students on both the test is then correlated to find the coefficient of reliability. The difference in the score obtained by students would be treated as error. 6.3 Split-Halves Reliability This type of reliability is used for measuring internal consistency between the test items in a single test. This is theoretically same as finding equivalence reliability however; here the two parts are taken from the same test. This reliability can be found by administering the test only once and thus the effect/error caused due to time interval or students’ condition (physical, mental, emotional etc.) or two groups is minimized. As the name indicates, the test items for a single test are divided into two halves to form two equivalent parts of a test. The two parts can be obtained by various methods e.g. Dividing the test items into two halves with equal number of items in both the halves or by splitting the test items into two halves, the odd number items and even number items. In case, the test is divided into odd and even numbered items, the reliability is calculated as follow. Firstly, the test is administered to subjects and the items are marked. The items are divided into two halves by combining all the odd items in one half and the even items in the second half. The score obtained on odd and even numbered items are separately totaled. Thus,
  • 21. there are two set of scores for each student. The score obtained on odd and even numbered items. The two scores are then correlated to find the correlation coefficient using Pearson product moment correlation coefficient. If the value of correlation coefficient is higher, then the two parts of the test are highly reliable and vice versa. The reliability coefficient obtained from the correlation needs to be adjusted/corrected as this coefficient was for a test which is divided into two (split-halves). The actual reliability of the whole test needs to be higher. This is computed using Spearman-Brown prophecy formula. Suppose the reliability coefficient for a 40 items test was .70 which was obtained by correlating the score for 20 odd and 20 even items. Thus, the reliability coefficient for the whole test (40 items) will be found using the following formula: r total test r = 2 𝑟𝑠𝑝𝑙𝑖𝑡 ℎ𝑎𝑙𝑓 1+ 𝑟𝑠𝑝𝑙𝑖𝑡 ℎ𝑎𝑙𝑓 total test The advantage of split-halves reliability is that one test is used only once. Thus, it can be economically and conveniently used by classroom teachers and researchers to collect data about a test. = 2 (.70) 1+ .70 = .82 7. FACTORS AFFECTING RELIABILITY Fatigue: The score obtained on a test by subjects during different conditions may be different. The fatigue factor has an important role in affecting the test score. Thus, fatigue/tiredness affects test reliability. Generally, the students will score less on a test with fatigue factor. Thus, Fatigue generally decreases reliability coefficient.
  • 22. Practice: The reliability of a test can be affected by the amount of practice. It is generally said that practice makes a man perfect. In the same manner, practice on test will improve students’ score and thus increases reliability coefficient on test with greater practice. Subject variability: The variation in the scores will increase if in a group, there is more subject variability. The greater in differences among subjects on the basis of gender, age, program, interests etc., the greater will be the variation in the score among individuals. In the same way, if a group is more homogenous such as group of students with same range of IQ, then the variation in the score will be less. Test Length: The length of test and the number of items affect reliability of a test. Usually, a test with greater number of items may give more reliable scores due to the cancelling of random positive and negative errors with in a test. Thus, adding more items to a test increases its reliability. In the same manner, deleting items from a test will lower the reliability of a test. One technique of deleting items from a test without decreasing its reliability is to remove that item from a test which has lower reliability value in item analysis. The Spearman-Brown prophecy formula is used for estimating reliability for a test which if made shorter or longer provided that the original reliability of a test is given. For example, if a test original reliability is .60 and the number of items are increased or decreased, then the new reliability of the test will be: rx r = 𝐾 𝑟 1+(𝐾−1)𝑟 x r = reliability of original test = predicted reliability of a test with added or deleted number of items K = ratio of number of items in the new test to number of items in the original test
  • 23. 8. VALIDITY Validity refers to the extent to which a test measure what it is supposed to measure. In other words, it refers to the degree to which a test pertains to its objectives. Thus, for a measure or test to be valid, it must measure the particular trait, characteristic, or ability consistently for which it was constructed. According to The Standards for Educational and Psychological Testing (AERA/APA/ NCME, 1985), Validity "refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from a test”. If correct and true inferences can be derived from a test, then such test has a greater validity to measure that specific inference. Cohen and Swerdlik (2010) defined validity as “a judgment based on evidence about the appropriateness of inferences drawn from test scores”. While, inference is a logical result or deduction. When a test score is used to make inference about a person trait or characteristic, then the test score is assumed as representing that trait or characteristic. A test may be valid for a particular group and for a particular purpose however, it may not be valid for another group or for a different purpose. A test on English grammar may be valid for a high school group but it is not valid for university students. Moreover, no test or measurement technique is “universally valid” for all time, for all uses, and for all user(Cohen & Swerdlik, 2010). Rather, tests may be shown to be valid within what we would characterize as reasonable boundaries of a contemplated usage. If those boundaries are exceeded, the validity of the test may be called into question (Cohen & Swerdlik, 2010). The process of gathering and evaluating evidence about validity is called ‘validation’. This is an important phase of validity which the test developer has to take with the test takers for a specific purpose. It is necessary that the test developer should mention the validity evidence in the test
  • 24. manual for the users/readers. However, sometimes the test users may conduct their own studies to check for validation with their test takers usually called local validation. Some types of validity are: 1. Content-related validity 2. Criterion-related validity 3. Construct-related validity 8.1 Content-related validity Sometimes content validity is also referred as face validity or logical validity. According to Gay, Mills, & Airasian (2011), content validity is the degree to which a test measures an intended content area. In order to determine content validity, the content domain of a test shall be well defined and clear. Face validity is a quick way of ascertaining whether a test looks/appears to measure what it purports to measure. A primary class math test shall contain numbers and figures and shall appear to be a math test rather than a language test. Face validity is not an exact measure of estimating content validity and is only used as a quick way for initial screening of judging validity. In order to judge content validity, one must pay attention to ‘item validity’ and ‘sampling validity’. Item validity ensures that the test items represents the relevant content area of a given subject matter. The items in a math test shall include questions from its given content and shall not focus on evaluating language proficiency or math items not included in the given syllabus. Similarly, an English test shall not contain items related to mathematical formulae or cover the subject matter of a science subject.
  • 25. Sampling validity is concerned with how well the test items samples the total content area. A test with good sampling validity ensures that the test items adequately samples the relevant content area. The proportion of test items from various units must be kept into consideration according to their importance. Although, all the units or concepts cannot be covered in a single test, however, the examiner must ensure that the proportion of test items are in accordance with the significance of the concepts to be tested. If a Physics test contains items from Energy chapter only and ignore other chapters, then such test will have poor sampling validity. Content validity can be judged by content expert, relevant subject teacher and/or text book writer. According to Gay et al. (2011), content validity cannot be measured quantitatively rather the experts carefully observe all the test items for item validity and sampling validity and then make a judgement about its content validity. A good way of ensuring content validity is to make a table of specifications that shall include the total number of units/topics to be tested, the number of items from each unit/topic and the different domain of instructional objectives. The table of specification helps in observing the units from which most of the items are included and also units which were under represented or ignored. Consider a secondary grade physics test taken from five chapter as given in table of specification. The names of the units are mentioned and the number of test items assessing each of the instructional objectives given by Bloom’s taxonomy. It is not a hard and fast rule to strictly follow the given proportion. The examiner decide which aspect or instructional objectives shall be given more or less weightage for each unit and still ensure that there shall not be greater difference in the weightage assigned to each objective. Thus, some units may require more focus on application side while some units may focus on knowledge or
  • 26. comprehension. The objective is to rightly measure the skill that the examiner wants to measure. Table 2. Table of specification of Physics test from five units Course content Knowledge (30%) Comprehension (40%) Application (30) Total Forces 3 5 2 10 Energy sources 3 4 3 10 Turning Effect 2 4 4 10 Kinematics 3 3 4 10 Atomic Structure 4 4 2 10 Total 15 20 15 50 8.2 Criterion-related validity Other terms used for criterion-related validity is statistical validity or correlational validity. It provides evidence that a test items measures a specific criterion or trait for which it is designed. In order to determine criterion validity of a test, the first step is to establish the criterion to be measured. Then a variety of test items are developed and then tested. The test items are then correlated with the criterion to determine how well are these items measuring the set criterion through finding Pearson correlation. In case, a number of test are used to measure the criterion, then multiple correlational procedures are used instead of Pearson correlation.
  • 27. Criterion-related validity can be further subdivided into concurrent validity and predictive validity. 8.2.1 Concurrent Validity The main difference between concurrent and predictive validity is the time at which the criterion is measured. For concurrent validity, the criterion is measured at approximately the same time as the alternative measure. However, if the criterion being measured relates to some future time, then it is called predictive validity. The concurrent validity of a test is the degree to which the score on the test is related to the score on an already established test administered at the same time. For example, GRE is an already standardized test for measuring some specific skills and knowledge. Suppose a new test is developed that claims to be measuring the same skills and knowledge, then it is necessary to find the concurrent validity of the new test. For this purpose, the new test and the already established test will be administered to some defined group of individuals at the same time. The scores obtained by individuals on both the test is correlated to observe for similarity or differences. The coefficient of validity can be calculated from correlation which will provide information about the concurrent validity of the new test. A high value of validity coefficient indicates a good concurrent validity and vice versa. 8.2.2 Predictive Validity It is the degree to which a test can predict about the future performance of an individual. It is often used for selecting or grouping individuals. The score on entry test serves as predictive validity about future performance of individuals in a specific program. If the marks on the entry test is high, then it can be predicted that the candidate will do well in future thus ascertaining predictive validity of the entry test. Such test may include ISSB test for entrance to armed forces, GRE test and SAT test for university performance. Likewise, medical test reports such as high
  • 28. body fat, high cholesterol, smoking and hypertension are all predictive of future heart disease. It shall be kept into consideration that the predictive validity of various tests like entry test, GRE, TOEFL etc. may vary due to a number of factors such as the difference in curriculum studied by students, the textbooks used for preparation, the geographical location etc. Thus, there is no such thing as perfect predictive validity which will also sometime makes the prediction false. Not all students who pass GRE or entry test may successfully pass the program in which the individuals were enrolled. Thus, it is not advisable to consider the score of a test for predicting future performance rather several indicators shall be used such as marks in preceding exams, the interview score, comments of professors, performance on practical skills etc. 8.3 Construct-related validity Construct-related validity is used to measure a theoretical construct. The construct to be measure is unobservable yet it exists theoretically. The construct though cannot be seen but its effects can be observed. For example, intelligence quotient (IQ), anxiety, creativity, attitude etc. Tests have been developed for measuring a specific construct. The researchers/test developers ensure that the test they construct should accurately measure that specific construct for which it was designed. Thus, a test aimed at measuring level of anxiety shall not measure creativity or IQ. The test score can be used to make decision related to a construct. If a test is unable to measure a construct, then its validity is questionable and the conclusion based on its score will be meaningless and inaccurate. The process of determining construct validity is not simple. The measuring of a construct requires a strong theory that hypothesize about the construct under study. For example, Psychology theories hypothesize that individuals with higher anxiety persons will work longer on a problem as compared to persons with low anxiety level. Suppose a test that measures
  • 29. anxiety level and some persons score higher on such test and then the same persons also worked for a longer time on the task/problem under consideration; then we have ample evidence to support the theory and thus the construct validity of the test to measure that construct. Figure: Validity and Reliability Source: James, Allen, James, & Dale (2005) Self-Assessment Questions Q1. How is achievement test different from attitude scale? Q2. Describe the uses of achievement test and attitude scale? Q3. What are the steps for developing a test? Q4. Define reliability and reliability coefficient? Q5. Describe the different types of reliability? Q6. What are the factors that affect reliability? Q7. Define the concept of validity in measurement and its relation with reliability? Validity Reliability Interclass Test-Retest Equivalence Interclass Anova Alpha KR20 Relevance Content Criterion Concurrent Predictive Construct
  • 30. Q8. Explain the different types of test validity. Q9. What are the qualities of a good test? 9. REFERENCES Alderson, J. C., C. Clapham, D. Wall, Language Test Construction and Evaluation, Cambridge University Press, 1995 1. Cohen RJ, Swerdlik ME. Psychological testing and assessment: An introduction to tests and measurement. 7th ed. McGraw−Hill Primis; 2010. 2. Gay, L. R., Mills, G. E., & Airasian, P. W. (2011). Educational research: Competencies for analysis and applications. Pearson Higher Ed. 3. http://www.alleydog.com/glossary/definition.php?term=Reliability%20Coefficient#ixzz48 EmyHlQe 4. James, R. M., Allen, W. J., James, G. D., & Dale, P. M. (2005). Measurement and evaluation in human performance. USA: Human Kinetics. 5. McMillin, E. (2013). Steps to Developing a Classroom Achievement Test. Assessed from https://prezi.com/fhtzfkrreh6p/steps-to-developing-a-classroom-achievement-test/# 6. Mohamed, R. (2011). 12 Characteristics of a good test. Retrieved from https://eltguide.wordpress.com/2011/12/28/12-characteristics-of-a-good-test/ 7. Reynolds, C. R., Livingston, R. L., & Willson, V. L. (2009). Measurement and Assessment in Education. (2nd ed.). Upper Saddle River, NJ: Pearson Education Inc. 8. Tarrant, M., J. Ware, A. Mohammed, An assessment of functioning and non- functioning distractors in multiple-choice questions: a descriptive analysis, http://www.springerlink.com/content/e8k8618552465484/fulltext.pdf, 2009.