Item writing involves 8 key steps: 1) defining what is to be measured, 2) generating an item pool, 3) avoiding long items, 4) considering reading level, 5) avoiding double-barreled items, 6) mixing positively and negatively worded items, 7) considering cultural sensitivity, and 8) realizing items become obsolete. There are several item formats including dichotomous (true/false), polytomous (multiple choice), Likert scales, category scales, checklists, and Q-sorts. Each format has advantages and disadvantages for assessing different traits like knowledge, attitudes, or personalities. Careful item writing following best practices can help ensure accurate assessment of test takers.
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
Item writing
1. ITEM WRITING AND ITEM FORMATS
Objectives
1. To outline the steps taken in developing test items.
2. To discuss the different item formats, i.e. dichotomous,
polychotomous, Likert, Checklists, Q-sorts and the category scale;
their advantages and disadvantages.
Introduction
Items are specific questions or problems that make up a test (Kaplan
&Saccuzo, 2009).
An item is a specific stimulus to which a person responds overtly (i.e can
be observed) or can be scored. This response can be scored or evaluated
for example on a scale or grade e.g. 75% meaning out of a 100-item test,
the individual has scored 75 items correct.
A test is a measurement device or technique used to quantify behaviour
or help in understanding and prediction of behaviour. It is also termed as
a collection of items.
Item Writing
Item writing involves a number of steps;
1. Define clearly what you want to measure.
Most often, it will be in one of these areas;
A type of cognitive achievement –this can be either a skill or
knowledge. An example of knowledge is – ‘knowledge of
Ugandan history’ or for a skill – ‘demonstration of an ability to
multiply decimals’.
A type of affective trait- for example - interest in psychology.
The items should be made as speficic as possible.
2. Generate an item pool
The item developer should take care in selecting and developing items. They
should avoid redundant items.
2. 2
In order to get the required number if items, one may need to write 3-4
items for each item that they wish to write. For example if you wish to write
20 items for your test, you may generate a pool of 60-80 items.
3. Avoid exceptionally long items.
Writing exceptionally long items may lead to having items that are
misleading or confusing. So they should be avoided.
4. Keep the level of reading difficulty appropriate for those who will
complete the scale.
It is important to be mindful of the level of reading difficulty of the targeted
test takers. If for example the item developer is writing for nursery school
children, the items should be in line with the capability of the targeted test
takers. If this is not done, they will not understand the test and will
therefore fail the test.
5. Avoid double barrelled items that convey two or more ideas at
the same time.
Double barrelled items may end up confusing the test taker since they may
fail to decide whether to agree with or disagree with the statement. This will
eventually affect the results of the test. An example is ;
Indicate whether you agree or disagree with the statement.
“ I vote NRM because I support Universal Secondary Education”.
These are two different statements; “I vote NRM” and “I support Universal
Secondary Education”. Someone can agree with one but not the other or
viceversa.
6. Consider mixing positively and negatively worded items.
At times, the test takers may develop the ‘acquiescence response set’ where
they tend to respond positively to all items. To avoid this bias, you may
include items that are worded in the opposite direction.
For example;
“I feel tired”.
“I feel energised”.
7. When writing test items, you need to be sensitive to the
cultural and ethnic differences.
For example, if you are writing items for a religious population, it may not be
appropriate to write items reflecting mannerisms that may be offensive to
3. 3
them like – alcohol drinking, eating certain foods that may be taboo to them,
etc.
8. It is important to realise that items become obsolete. When
they become obsolete, they lose reliability.
When items are used over a long period, they tend to lose reliability. Hence
the need to ensure they are reliable at any one point if they are to be used.
Other general guides for item writing.
“All of the Above” should not be an answer option
“None of the Above” should not be an answer option
All answer options should be credible
Order of answer options should be logical or vary
Items should cover important concepts and objectives
Negative wording should not be used
Answer options should include only one correct answer
Specific determiners (e.g. always, never) should not be used
Answer options should be homogenous
Correct answer options should not be the longest answer option
Items should be independent of each other
Test copies should be clear, readable and not hand-written
Item Formats
Different item formats are used for different purposes. The format used for
evaluating attitudes may not be the same to be used for assessing
personalities. Each format is chosen based on the pros and cons for that
particular format.
a. Dichotomous Format
This format offers two alternatives for each item. If a test taker selects one of
the alternatives that is presented, they are awarded a point.
4. 4
A common dichotomous test is the True-False examination. The test taker’s
task is to choose either what is true or what is false, but not both for a
single item.
Other item responses on the format include, “Yes” or “No”
An example of dichotomous items;
Item True False
1 Tough managers produce best performing
teams
2 Teamwork encourages social loafing
3 Introverts perform tasks better as
individuals
4 I often worry about my reading ability
Advantages of the Dichotomous Format
1. It is easy to construct and adminster.
2. It is easily scored. The tester only needs to count the number of
correct items to get the score.
3. The true-false items require absolute judgement. The test taker
cannot choose anything in between.
Disadvantages
1. They encourage students to memorise material and be able to pass
the test even when they have not really understood the concepts.
2. Dichotomous items tend to be less reliable than other item formats.
This is because it only poses a mere chance of 50% of either passing
the test or failing it! It is easy for a test taker to simply guess a correct
answer without understanding the context of the item.
b. The Polychotomous Format (Polytomous)
5. 5
This resembles the dichotomous format only that it has more than two
alternatives.
A point is given for selecting one of the alternatives but not for selecting any
other choice.
For a polychotomous examination, the test taker has to determine which
alternative is correct. Incorrect alternatives are called distractors.
According to the psychometric theory, adding more distractors increases the
reliability of the item. It is usually best to have 3- 4 distractors for this
purpose. However, poorly written distractors may affect the quality of the
test.
Unlike in the dichotomous format where a 50% chance of success is
observed, in the polychotomous format, chances of success are dependent
on the number of choices available per item, i.e. if the choices are four,
chance of a correct choice is one out of the four choices which is equivalent
to 25%. If the choices are three, the chance of a correct choice is one out of
three which is equivalent to 33.3%.
Some test takers can get the items correct simply by guessing even if they
have not read the subject matter. Hence for a test with three alternatives,
the chances of getting a correct choice is 33%, etc.
Because of guessing, a correction for guessing is done. The formula to
correct for guessing on a test is;
Corrected score = R -
W
𝑛−1
Where R = the number of right responses
W = the number of wrong responses
n = the number of choices for each item
Take an example of 100 items with 4 choices each, and the test taker
decided to guess all through the exercise. By default the expected score
from guessing will be a quarter (25) of the 100 items. R is expected to be
25 of the 100 items, and the number of wrong responses will be W =
(100-25)= 75 and n = 4
Using the formula above;
Correct score = 25 -(
75
4−1
) = 25 –(
75
3
) =25-25=0
6. 6
So when correction for guessing is applied, the corrected score is
actually 0.
An example:
Mukiibi was subjected to a psychological test with 100 items, each item
having four answer choices to choose from. He scored 88 correct answers
and was pronounced to have passed the test. What is Mukiibi’s score
after correction for guessing?
From the formula, Corrected score = R -
W
𝑛−1
R is observed to be 88 of the 100 items, W = 12 and n = 4
Correct score = 88- (
12
4−1
) = 88 – (
12
3
) = (88-4 ) =84
So Mukiibi’s corrected score is 84.
The omitted numbers are not included. They provide neither credit nor
penalty.
The expression (W/n-1) is an estimate of the number of responses the
test taker is expected to get right by chance.
Advantages of use of polychotomous format
It takes little time for the test takers to respond since they do not write
the answers. Hence one can respond to a large number of items in a
short time.
The tests are easy to score. The tester only counts the correct items to
get the score.
Disadvantages
It may be easy to guess a correct answer and by chance a correct answer
may be selected.
c. The Likert Format
This format requires that a respondent indicates the degree of agreement
with a particular attitudinal question.
It is very popular with personality and attitude scales.
This scale is non-comparative and measures only a single trait. The
respondent is asked to indicate their level of agreement with a given
statement by way of an ordinal scale.
7. 7
It is sometimes expressed as a four, five or even six –point scale ranging
from, Strongly agree, Agree, Neutral, Disagree, Strongly Disagree. The
more the number of points, the less likely it is for the respondent to be
neutral.
An example of a six-point scale;
No. Item Strongly
Disagree
Moderately
Disagree
Mildly
Disagree
Mildly
Agree
Moderately Agre
1 I am afraid
of
caterpillars
2 I love snakes
3 I fear cats
4 I do not like
centipedes
Likert scales are some times referred to as summative scales because each
specific question can be summed up with other related items to create a
score for a group of statements.
Scoring requires that each negatively worded item be reverse scored and the
responses are then summed up.
Advantages
It is easy to construct
It produces a highly reliable scale
It is easy to read and complete by the test takers.
Weaknesses of this scale include:
Central tendency bias; participants may avoid extreme response
categories
Acquiescence bias; participants may agree with statements as
presented in order to please the tester.
8. 8
Social desirability bias; Respondents may wish to portray themselves
in a more favorable light rather than being honest.
Validity may be difficult to demonstrate; it may not portray what the
tester intended to measure
d. The Category Format
It is similar to the Likert scale but uses an even greater number of
choices than the Likert scale.
Although it may seem similar to the Likert format, the category scale
uses a defined point rating system.
Test takers are required to rate a given item scenario on a scale in a
category range. For example one may use a scale of 1 to 5 or 1 to 10,
where 1 is the lowest score and 5 or 10 being the highest score
respectively.
The numbers that are assigned when using the rating scale are
sometimes influenced by the context in which the items are rated.
The number of categories used depends on the fineness of the
discrimination that the test takers are willing to make. If they wish to
have a fine discrimination they will take even more categories.
An example:
1. On a scale of 1 to 5, rate Bazalaki’s attitude towards class
assignments. (where 1 is very negative and 5 is very positive)
2. On a scale of 1to 10, rate the level of academic excellence of Makerere
university. (where 1 is very ordinary and 10 is very competitive.
Advantages
It is very easy to administer
Disadvantages
It does not take into consideration the context in which the test
subject is being rated! E.g. in a class of averagely performing students,
a student may be rated as 9, which represents a very good performer.
Yet if the same student is placed in another class of only highly
performing students, the same student may be rated 3, which
represents a relatively poor performance.
9. 9
Also on this scale, test takers have a tendency to spread their
responses evenly across the entire scale of 1 to 10, which may not
fairly represent the actual score.
In order to overcome the problems above, the end points of this scale
have to be clearly defined, by outlining the expected characteristics of
each point (Kaplan & Ernest, 1983).
For example if one is looking at the performance of students in a given
class, for a student to score 10, they must have been;
- attending all classes
- contribute to every question asked in class
- solves problems fast
- assists others to complete their class work
- regularly passes class tests with over 80%.
On the other hand, the opposite can explain the characteristics of a
student scoring 1.
e. Checklists
These are used in personality measurement.
The test taker is given a list of adjectives and asked to indicate whether
each is characteristic of him/herself or someone else.
Here, a rating of 9 will mean that the statement on the card is the best
description of the characteristic of the person being studied, while 1 is
the least description of that person’s characteristics.
For example ;
Castro is…
Is a dependable person.
Is a talkative individual.
Behaves in a sympathetic or considerate manner.
Appears to have a high degree of intellectual capacity
Is protective of those close to him
Tends to be self-defensive.
Is thin-skinned; sensitive to anything that can be construed as
criticism.
10. 10
Q-Lists
A test taker is given a list of statements about one their proposed
personal characteristics and asked to sort them into a given number of
piles, e.g. 5, or 9 piles.
These statements are sorted into piles that indicate the degree to which
they appear to describe a given person accurately.
A pile list of 1 to 9 is provided to the test taker, where he/she will rate
and place the statement listed on the card, onto the pile number that
appropriately describes the characteristics of the person being studied.
For example 100 statements about a person’s characteristics are listed
on cards, with each card having one statement, making 100 cards.
The degree of representation of the statements on the cards can be
distributed across the 9 piles, depending on the test taker’s
interpretation of the subject being studied.
The frequency of cards placed on the different piles is noted and the best
characteristic description of the person under study is noted.
The observed results tend to follow a normal distribution. However, items
that lie at the extreme ends of the quantum always Speke volumes about
the true personal characteristics of the subject.
Conclusion
The items if written carefully will be able to help in the assessment of the
subject and give accurate results to the tester.
11. 11
References
Kaplan, R.M &Saccuzzo, D.P(2009) Psychological Testing, Principles,
Applications and Issues
Crocker, L &Algina, J (2008) Introduction to Classical and Modern
Test Theory
Suen, H,K& McClellan, S(2003). Test item construction techniques
and Principles.