1. 1
EDUCATIONAL PSYCHOLOGY (HINA JALAL, PHD GCUF)
Chapter 6:
TEST
Test is an instrument or technique that measures students’ knowledge of something to determine
what he/she has learned.
According to Robert L. Linn (2008) “A test is a particular type of assessment that typically
consist of a set of questions administered during a fixed period of time under reasonably
comparable conditions for all students”
According to Robert L. Ebel, David A. Frisbie (1991) “A test is a set of questions each of which
has a correct answer, that examinee usually answer orally or in writing”.
Characteristics of a Good Test
Validity :- A test is valid if it measures what we want it to measure and nothing else. Validity
is a more test-dependant concept, but reliability is a purely statistical parameter.
Reliability :- A test is reliable if we get the same results repeatedly. An “unreliable” test, on
the other hand one’s score might fluctuate from one administration to the other.
Practicality :- practicality refers to the ease of administration and scoring of a test.
Administrability :- Administrability the test should directed uniformly to all students so that
the scores obtained will not vary due to factors other than differences of the students’ knowledge
and skills. There should a clear provision for instruction for the students and the one who will
check the test (having clear directions and processes)
Comprehensiveness :- A test is said to have comprehensiveness if it encompasses all aspects
of a particular subject of study.
Objectivity :- Objectivity represents the agreement of two or more raters or a test administrator
concerning the score of a student. Not influenced by emotion or personal prejudice. Lack of
objectivity reduces test validity in the same way that lack reliability influence validity.
Simplicity :- A test is said to be simple if it is easy to understand along with the instructions
and other details.
2. 2
EDUCATIONAL PSYCHOLOGY (HINA JALAL, PHD GCUF)
Scorability :- Scorability the test should be easy to score, directions for scoring is clear, provide
the answer sheet and the answer key.
Types of Test
There are generally two types of tests used to evaluate environmental education programs:
A. standardized tests (prepared by publishing companies, formal testing agencies, and
universities), and
B. teacher made tests (prepared by the teacher).
A. Standardized test
Standardized test refers as constructed by the experts and includes explicit instructions for uniform
administration scores. Standardized tests are formal tests that allow you to compare your students
with other students in the region or country. These tests are usually valid and reliable because
they have been tested on large sample populations and have been revised to eliminate unreliable
or invalid questions. They are useful if you want to compare your students with other students or
if you want to rank students against the "norm." (Ratings of validity and reliability are published
for standardized tests and you can check on the documentation).
Examples of Standardized Tests: TOEFL (Test of English as Foreign Language, TOEIC (Test of
English for International Communication), IELTS (International English Language Testing
System), GMAT (Graduate Management Admission System), etc.
B. Teacher Made Test
A test developed by the teacher in order to assess the achievements and performance of students’
subject called teacher made test. It is useful as:
Evaluating the learning outcomes and content unique class or school.
Evaluating students’ day to day progress and their achievements on work units of varying sizes.
Evaluating knowledge of current developments in such rapidly changing content areas as science
and social studies.
3. 3
EDUCATIONAL PSYCHOLOGY (HINA JALAL, PHD GCUF)
Differences between Standardized Test and Teacher Made Test
Standard test Teacher-made test
1. Generally prepared by specialists who know
very well the principles of test construction;
1. Made by teachers who may not know very well
the principles of test construction;
2. Prepared very carefully following principles
of test construction;
2. Often prepared hurriedly and haphazardly to be
able to meet the deadline for administration;
3. Given to a large proportion of the population
for which they are intended for the computation
of norms;
3. Usually given only to a class or classes for which
the tests are intended; usually, no norms are
computed;
4. Generally correlated with other tests of
known validity and reliability or with measures
such as school marks to determine their validity
and reliability;
4. Teacher-made tests are not subjected to any
statistical procedures to determine their validity and
reliability;
5. Generally, are high objective; 5. May be objective and may essay in which case
scoring is subjective;
6. Have their norms computed for purposes of
comparison and interpretation;
6. Have no norms unless the teacher computes the
median, mean, and other measures for comparison
and interpretation;
7. Measure innate capacities and characteristics
as achievement;
7. Generally, measure subject achievement only;
8. Intended to be used for a long period of time
and for all people of the same class in the
culture where they are validated.
8. Intended to be used only once or twice to measure
achievement of students in a subject matter studied
during a certain period;
9. Accompanied by manuals of instructions on
how to administer and score the tests and how
to interpret the results;
9. Do not have manuals of instructions, only the
directions for the different types of tests which may
be given orally or in writing.
10. Generally copyrighted. 10. Not copyrighted.
4. 4
EDUCATIONAL PSYCHOLOGY (HINA JALAL, PHD GCUF)
ITEM ANALYSIS
An item analysis is a valuable, yet easy procedure which educational professionals can use to
analyse each item on a test to determine the proportions of students selecting each answer, besides
evaluating student's strengths and weaknesses. Item Analysis is an important (probably the most
important) tool to increase test effectiveness. Each items contribution is analysed and assessed. It
is a scientific way of improving the quality of tests and test items in an item bank. An item analysis
provides three kinds of important information about the quality of test items.
1- Item difficulty: A measure of whether an item was too easy or too hard. The item difficulty
ranges from 30 to 70. The formula for difficulty value (D.V)
D.V =
(R.H + R.L)
(N.H + N.L)
• R.H – rightly answered in highest group
• R.L - rightly answered in lowest group
• N.H – no of examinees in highest group
• N.L - no of examinees in lowest group
2- Item discrimination: A measure of whether an item discriminated between students who knew
the material well and students who did not. The discriminative power ranges from +1 to -1. The
formula for discrimination index (D.I)
D.I =
(R.H - R.L)
(N.H or N.L)
• R.H – rightly answered in highest group
• R.L - rightly answered in lowest group
• N.H – no of examinees in highest group
• N.L - no of examinees in lowest group
3- Effectiveness of alternatives: Determination of whether distractors (incorrect but plausible
answers) tend to be marked by the less able students and not by the more able students.
5. 5
EDUCATIONAL PSYCHOLOGY (HINA JALAL, PHD GCUF)
Purposes of item analysis
• Item analysis serves the following purposes
• Improving the test.
• Identifying items which may have a bias.
• Serving as a basis for class discussion.
• Diagnosing the students' strengths and weaknesses.
• Increasing the skill of item construction.
Steps used in item analysis
Step l Rank the answer sheets in order trout the highest to the lowest.
Step 2 Create criterion-groups from the extremes of scores. With large sample generally the top
279E of Upper Group (//) and the bottom 27% of Lower Group (t) of the sample is taken.
Step 3 Count the number of students in the Upper Group (II) who have selected the correct answer.
Step 4 Count the number of students in the Lower Group (L) who have selected the correct answer.
Step 5 Enter the values in the test item card.
Step 6 Compute the difficulty level using the formula.
Step 7 Compute the discriminating power of each item.
Step 8 Find the effectiveness of distracters in each item.
Step 9 Interpret the item analysis report.
RELIABILITY
Reliability is the degree to which an assessment tool produces stable and consistent results.
The reliability of an assessment tool is the extent to which it measures learning consistently.
Test-retest reliability is a measure of reliability obtained by administering the same test twice over
a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be
correlated in order to evaluate the test for stability over time.
6. 6
EDUCATIONAL PSYCHOLOGY (HINA JALAL, PHD GCUF)
1- Inter-rater reliability is a measure of reliability used to assess the degree to which different
judges or ratters agree in their assessment decisions. Inter-rater reliability is useful because
human observers will not necessarily interpret answers the same way; ratters may disagree as to
how well certain responses or material demonstrate knowledge of the construct or skill being
assessed.
2- Split-half reliability is another subtype of internal consistency reliability. The process of
obtaining split-half reliability is begun by “splitting in half” all items of a test that are intended to
probe the same area of knowledge (e.g., World War II) in order to form two “sets” of
items. The entire test is administered to a group of individuals, the total score for each “set” is
computed, and finally the split-half reliability is obtained by determining the correlation between
the two total “set” scores.
3- Parallel forms reliability is a measure of reliability obtained by administering different
versions of an assessment tool (both versions must contain items that probe the same construct,
skill, knowledge base, etc.) to the same group of individuals. The scores from the two versions
can then be correlated in order to evaluate the consistency of results across alternate versions.
VALIDITY
Validity is a bit more complex because it is more difficult to assess than reliability. There are
various ways to assess and demonstrate that an assessment is valid, but in simple terms, validity
refers to how well a test measures what it is supposed to measure. Validity refers to how well a
test measures what it is purported to measure. The validity of an assessment tool is the extent by
which it measures what it was designed to measure.
1. Face Validity ascertains that the measure appears to be assessing the intended construct under
study. The stakeholders can easily assess face validity. Although this is not a very “scientific”
type of validity, it may be an essential component in enlisting motivation of stakeholders. If the
stakeholders do not believe the measure is an accurate assessment of the ability, they may become
disengaged with the task.
2. Construct Validity is used to ensure that the measure is measure what it is intended to measure
(i.e. the construct), and no other variables. Using a panel of “experts” familiar with the construct
is a way in which this type of validity can be assessed. The experts can examine the items and
7. 7
EDUCATIONAL PSYCHOLOGY (HINA JALAL, PHD GCUF)
decide what that specific item is intended to measure. Students can be involved in this process to
obtain their feedback.
3. Criterion-Related Validity is used to predict future or current performance - it correlates test
results with another criterion of interest.