1. 1
Theoretical Formula Scoring Criterion to Simulate
Correction for Guessing
Abstract
One of the major phenomenon's in standardized multiple-choice tests is determining how much
random guessing occurs at a particular question. It can be intuitive to think that when a multiple-choice
test is admitted the probability of an individual guessing an item correct is one out of all possible answer
choices if they truly do not know the correct answer. Even with this kind of attempt, guessing potentially
increases error in measuring an individual's true score, let alone their true ability. A measurement
framework, known to many as formula-scoring, was proposed several years ago to possibly help the
correction for guessing. The purpose of this proposal is to use the general framework of the formula-
scoring method and introduce a new formula-scoring criterion that can be implemented through
simulation. This new criterion will show that if formula-scoring is applied then we may be able to
improve an individual's overall observed score (sum score) to reflect what their true score really is.
.
2. 2
I. Purpose
The idea of formula scoring is that a multiple-choice test being admitted should provide detailed
instructions as to how a test-taker should approach attempting an item. The rule generally follows that an
examinee should omit an item if they experience a situation of having to guess (whether they have partial
or no knowledge of the item). If they attempt an item and get it correct, they will receive a score of 1 (this
is based on if they have more than partial knowledge of the item). If they experience a situation where
they must rely on some type of guessing they should, per the instructions, omit the item and receive a
score of 0. A consequence of answering an item incorrectly is receiving a score of -1/(k-1) where k is the
number of possible answers to a multiple-choice item. When these rules are in place the complete formula
score for potentially correcting error in guessing is,
where S is the individual's corrected score, R is the number of items the individual marked correctly, W is
the number of items the individual marked incorrectly and k is the number of possible choices for each
item (Diamonds and Evans, p. 181). If guidelines such as the one listed above are given to individuals
taking the test, it's possible there may be improvements obtaining true scores through the individuals
observed score. It would seem that just implementing this in all tests would resolve the issue of guessing,
but the effects of guessing behaviors that occur can vary greatly. Such acts of guessing can cause the test-
taker to think about the possible rewards of a higher score if they attempt all items. Some of these
guessing behaviors are, but not limited to:
" 1.Eliminating one or more answer choices judged to be definitely wrong
2.Making use of unintended semantic or syntactic cues available from the wording of the questions or
the response options
3.Falling into traps set by the ingenious item writer e.g. cliché-like choices that sound plausible but are
wrong
4.Responding on the basis of some element in one of the response choices that attracts him, but at a
relatively low level of confidence
5.Using some essentially random fashion of responding-in the extreme, flipping a coin or marking some
specific response position or pattern of positions " (Thorndike, p. 59)
Many testing services implement the formula-scoring rule mentioned previously. Testing
companies such as ETS (Educational Testing Services) have employed formula scoring on the SAT
(Scholastic Aptitude Test) and the GRE (Graduate Record Examination) subject exams since 1953
(Budescu and Bar-Hillel, p. 278). However some researchers have suggested that the use of this formula
score correction can reduce the reliability of a test. In contrast to these claims, studies have shown that the
reliability between corrected and uncorrected scores, if guessing instructions were implemented, were not
that far off from each other (Diamonds and Evans, p. 182).With that being said, the purpose of this paper
is to propose a formula-scoring criterion (FSC) that can be implemented through simulations that may
show signs of an individual's observed score falling closer to their true score.
3. 3
II. Theoretical Framework
In standardized multiple-choice tests such as a Math or English placement exam, a person's
ability can best be measured using unidimensional Item Response Theory (IRT) models. What is obtained
from these models is the probability of an examinee getting one individual item correct; assuming we
know their true ability, how difficult the problem is, how discriminating the item is and how much
guessing is potentially involved with this particular item. This model (known as a three-parameter
logistics or 3PL model) is typically written as,
where Uij is item i attempted by examinee j, θj is the examinees given ability, ai is the discrimination value
on the ith
item, bi is the difficulty value on the ith
itemand ci is the guessing value on the ith
item. In order
for this model to potentially hold true, three assumptions must be met: 1) θ is unidimensional, 2) Local
Independence, and 3) Monotonicity. Another alternative IRT model that will be presented in this paper is
a 3PL model created by Peter J. Pashley. Pashley elegantly designed a 3PL model around what he calls a
four-parameter hyperbolic equation. The purpose of incorporating this model is to explore other models,
not widely used in IRT, to compare the proposed FSC. Pashley proposed the following model parameters,
where f defines a slope parameter, h defines the difficulty parameter, and k is thought of as the lower
asymptote parameter. With these new parameters, Pashley formulized the four-parameter hyperbolic
model as such,
(Pashley, pgs. 9-10).
In many IRT studies, simulations are carried out where computer software, such as R, can
simulate random examinees taking a multiple-choice item test. What is produced through these
simulations are datasets consisting of a very large matrix of 1's (right) and 0's (wrong) where each row
represents a test-taker and each column represents score patterns for each multiple-choice item. One
problem with these simulations is they are not able to take into consideration a multiple-choice test
applying the formula-scoring rule. This paper proposes an FSC, let it be defined as τ, that can be used in
the practice of simulating IRT data coming from multiple-choice tests being implemented with formula-
score rules. The FSC was developed around the assumption of correlations between all estimated c
4. 4
parameters and various subgroups of estimated θ's. What was discovered was a vector (denoted as ) of
these correlations fell under a pattern very similar to a normal distribution. That is,
where j = the number of items on the test and l = the number of simulated examinees. With this in place, τ
will be defined as a random number from this hypothesized normal distribution. Therefore, the following
can be applied to a simulated examination taken by random examinees with formula-scoring applied,
III. Methods & Data
To defend the FSC, two separate simulations were undertaken in R: 1) A simulation of all
parameters assumed to be known and 2) a parametric bootstrap sampling procedure. In order to carry out
an appropriate simulation, real data was used which consisted of 2,642 examinees attempting a 32
multiple-choice item Form B College Math Placement test. Parameters were estimated from this data
through BILOG-MG using Expected A Posteriori (EAP), a type of Bayesian approximation method. The
following parameters estimated through BILOG-MG followed the distributions given in Table 1.
Table 1. Parameter Distributions
Parameter Distribution
a LogNormal(0.290,0.301)
b Normal(0.246,0.962)
c Beta(2.771,14.021)
IV. Results
In the first simulation, 30,000 simulated examinees took a 32-item test three times. Each test
followed these categories: a standard test (0 wrong and 1 correct), a test implementing the FSC, and a test
implementing the FSC using Pashley's 3PL model. In Table 2 (on p. 5), a sample of 10 examinees were
presented with their observed score and true score (calculated under methods of Classical Test Theory).
As you can see in the table, it's surprising how some examinee's observed score comes fairly close to the
their true score when taking the formula-scoring test. Based on this FSC, there were however instances
where simulated examinees scored well under or above their true score (examinee 7 for example). In the
second simulation, a parametric bootstrap procedure was conducted with the same number of examinees
and items as compared to the real data. From Table 3 (on p. 5), it's noticed that the reliability (Guttman's
λ2) of the simulated tests (applying the FSC and/or Pashley's model) were less than the reliability of the
5. 5
simulated standard test. This would make sense being that Ruch and Stoddard claimed this to be true back
in 1925 (Diamonds and Evans, p. 182). However, notice how the values of the standard deviation of the
biserials show a reduction in the FSC method and Pashley's model applied with the FSC method.
However a negative aspect were the results of the SEM. It seemed the test's SEM increased with the FSC
method and with Pashley's model with the FSC method. Although SEM did increase, it wasn't by much.
In conclusion, the next 3-5 months will be spent analyzing this new FSC method to ensure its validity and
statistical power in simulating an examination undertaken with formula-scoring rules.
Table 2. Simulated Examinees
Observed Score
Examinee True Ability Standard
Formula-
Scoring
Pashley w/
Formula-
Scoring
True Score True SD
1 0.4564 24.0000 19.2500 17.2500 20.9410 2.2180
2 -0.0309 17.0000 18.0000 19.7500 16.5580 2.3520
3 0.2706 22.0000 18.7500 17.0000 19.3030 2.2960
4 -0.5884 11.0000 11.7500 8.2500 12.2310 2.3330
5 -1.1745 11.0000 5.7500 9.7500 8.7980 2.2710
6 0.1921 18.0000 17.0000 13.7500 18.5840 2.3210
7 -0.5151 14.0000 9.7500 16.7500 12.7380 2.3380
8 -0.3514 17.0000 15.2500 12.5000 13.9340 2.3470
9 1.7278 27.0000 30.0000 29.7500 28.7620 1.4800
10 0.2953 19.0000 19.5000 17.7500 19.5280 2.2870
Table 3. Parametric Bootstrap Simulation (500 Iterations)
Statistics Real Data Standard 3PL
Standard 3PL w/
Formula-Scoring
Pashley's 3PL w/
Formula-Scoring
Guttman's λ2 0.8396 0.8976 0.8940 0.8883
SEM 2.4759 2.3190 2.3491 2.3727
sd(ρbis) 0.1003 0.1412 0.1358 0.1211
V. Educational Significance
When it comes to issuing standardized multiple-choice tests, such as a Math/English college
placement test, various acts of guessing can occur that can result in inaccurate IRT measurements of a
person's true score, which could potentially yield under/over estimates of their true ability. Based on these
simulations using the proposed FSC criterion, implementing this type of formula-scoring rule to multiple-
choice testing practices in the future may open doors in improving a person's overall observed score being
very close to their overall true score. Although there are flaws in the suggested method, optimistically its
logical reasoning will be accepted.
6. 6
References
Bar-Hillel, David Budescu and Maya. "To Guess or Not to Guess: A Decision-Theoretic View of Formula
Scoring." Journal of Educational Measurment, Vol. 30, No. 4 (Winter, 1993): 277-291.
Evans, James Diamond and William. "The Correction For Guessing." Review of Educational Research,
Vol. 43, No. 2 (Spring, 1973): 181-191.
Habing, Dr. Brian. STAT 778 - Spring 2014 - R Templates. 2014. 2014
<http://www.stat.sc.edu/~habing/courses/778rS14.html>.
Michele Zimowski, Eiki Muraki, Robert Mislevy, and Darrell Bock. BILOG-MG 3. Scientific Software
International, Inc. Stokie, IL, 2005-2014.
Pashley, Peter J. An Alternative Three-Parameter Logistic Item Response Model. Research. Princeton,
New Jersey: Educational Testing Service, February 1991.
Thorndike, Robert L. Educational Measurement: The Problem of Guessing. Washington, D.C.: American
Council on Education, 1971.