SlideShare a Scribd company logo
1 of 6
Download to read offline
1
Theoretical Formula Scoring Criterion to Simulate
Correction for Guessing
Abstract
One of the major phenomenon's in standardized multiple-choice tests is determining how much
random guessing occurs at a particular question. It can be intuitive to think that when a multiple-choice
test is admitted the probability of an individual guessing an item correct is one out of all possible answer
choices if they truly do not know the correct answer. Even with this kind of attempt, guessing potentially
increases error in measuring an individual's true score, let alone their true ability. A measurement
framework, known to many as formula-scoring, was proposed several years ago to possibly help the
correction for guessing. The purpose of this proposal is to use the general framework of the formula-
scoring method and introduce a new formula-scoring criterion that can be implemented through
simulation. This new criterion will show that if formula-scoring is applied then we may be able to
improve an individual's overall observed score (sum score) to reflect what their true score really is.
.
2
I. Purpose
The idea of formula scoring is that a multiple-choice test being admitted should provide detailed
instructions as to how a test-taker should approach attempting an item. The rule generally follows that an
examinee should omit an item if they experience a situation of having to guess (whether they have partial
or no knowledge of the item). If they attempt an item and get it correct, they will receive a score of 1 (this
is based on if they have more than partial knowledge of the item). If they experience a situation where
they must rely on some type of guessing they should, per the instructions, omit the item and receive a
score of 0. A consequence of answering an item incorrectly is receiving a score of -1/(k-1) where k is the
number of possible answers to a multiple-choice item. When these rules are in place the complete formula
score for potentially correcting error in guessing is,
where S is the individual's corrected score, R is the number of items the individual marked correctly, W is
the number of items the individual marked incorrectly and k is the number of possible choices for each
item (Diamonds and Evans, p. 181). If guidelines such as the one listed above are given to individuals
taking the test, it's possible there may be improvements obtaining true scores through the individuals
observed score. It would seem that just implementing this in all tests would resolve the issue of guessing,
but the effects of guessing behaviors that occur can vary greatly. Such acts of guessing can cause the test-
taker to think about the possible rewards of a higher score if they attempt all items. Some of these
guessing behaviors are, but not limited to:
" 1.Eliminating one or more answer choices judged to be definitely wrong
2.Making use of unintended semantic or syntactic cues available from the wording of the questions or
the response options
3.Falling into traps set by the ingenious item writer e.g. cliché-like choices that sound plausible but are
wrong
4.Responding on the basis of some element in one of the response choices that attracts him, but at a
relatively low level of confidence
5.Using some essentially random fashion of responding-in the extreme, flipping a coin or marking some
specific response position or pattern of positions " (Thorndike, p. 59)
Many testing services implement the formula-scoring rule mentioned previously. Testing
companies such as ETS (Educational Testing Services) have employed formula scoring on the SAT
(Scholastic Aptitude Test) and the GRE (Graduate Record Examination) subject exams since 1953
(Budescu and Bar-Hillel, p. 278). However some researchers have suggested that the use of this formula
score correction can reduce the reliability of a test. In contrast to these claims, studies have shown that the
reliability between corrected and uncorrected scores, if guessing instructions were implemented, were not
that far off from each other (Diamonds and Evans, p. 182).With that being said, the purpose of this paper
is to propose a formula-scoring criterion (FSC) that can be implemented through simulations that may
show signs of an individual's observed score falling closer to their true score.
3
II. Theoretical Framework
In standardized multiple-choice tests such as a Math or English placement exam, a person's
ability can best be measured using unidimensional Item Response Theory (IRT) models. What is obtained
from these models is the probability of an examinee getting one individual item correct; assuming we
know their true ability, how difficult the problem is, how discriminating the item is and how much
guessing is potentially involved with this particular item. This model (known as a three-parameter
logistics or 3PL model) is typically written as,
where Uij is item i attempted by examinee j, θj is the examinees given ability, ai is the discrimination value
on the ith
item, bi is the difficulty value on the ith
itemand ci is the guessing value on the ith
item. In order
for this model to potentially hold true, three assumptions must be met: 1) θ is unidimensional, 2) Local
Independence, and 3) Monotonicity. Another alternative IRT model that will be presented in this paper is
a 3PL model created by Peter J. Pashley. Pashley elegantly designed a 3PL model around what he calls a
four-parameter hyperbolic equation. The purpose of incorporating this model is to explore other models,
not widely used in IRT, to compare the proposed FSC. Pashley proposed the following model parameters,
where f defines a slope parameter, h defines the difficulty parameter, and k is thought of as the lower
asymptote parameter. With these new parameters, Pashley formulized the four-parameter hyperbolic
model as such,
(Pashley, pgs. 9-10).
In many IRT studies, simulations are carried out where computer software, such as R, can
simulate random examinees taking a multiple-choice item test. What is produced through these
simulations are datasets consisting of a very large matrix of 1's (right) and 0's (wrong) where each row
represents a test-taker and each column represents score patterns for each multiple-choice item. One
problem with these simulations is they are not able to take into consideration a multiple-choice test
applying the formula-scoring rule. This paper proposes an FSC, let it be defined as τ, that can be used in
the practice of simulating IRT data coming from multiple-choice tests being implemented with formula-
score rules. The FSC was developed around the assumption of correlations between all estimated c
4
parameters and various subgroups of estimated θ's. What was discovered was a vector (denoted as ) of
these correlations fell under a pattern very similar to a normal distribution. That is,
where j = the number of items on the test and l = the number of simulated examinees. With this in place, τ
will be defined as a random number from this hypothesized normal distribution. Therefore, the following
can be applied to a simulated examination taken by random examinees with formula-scoring applied,
III. Methods & Data
To defend the FSC, two separate simulations were undertaken in R: 1) A simulation of all
parameters assumed to be known and 2) a parametric bootstrap sampling procedure. In order to carry out
an appropriate simulation, real data was used which consisted of 2,642 examinees attempting a 32
multiple-choice item Form B College Math Placement test. Parameters were estimated from this data
through BILOG-MG using Expected A Posteriori (EAP), a type of Bayesian approximation method. The
following parameters estimated through BILOG-MG followed the distributions given in Table 1.
Table 1. Parameter Distributions
Parameter Distribution
a LogNormal(0.290,0.301)
b Normal(0.246,0.962)
c Beta(2.771,14.021)
IV. Results
In the first simulation, 30,000 simulated examinees took a 32-item test three times. Each test
followed these categories: a standard test (0 wrong and 1 correct), a test implementing the FSC, and a test
implementing the FSC using Pashley's 3PL model. In Table 2 (on p. 5), a sample of 10 examinees were
presented with their observed score and true score (calculated under methods of Classical Test Theory).
As you can see in the table, it's surprising how some examinee's observed score comes fairly close to the
their true score when taking the formula-scoring test. Based on this FSC, there were however instances
where simulated examinees scored well under or above their true score (examinee 7 for example). In the
second simulation, a parametric bootstrap procedure was conducted with the same number of examinees
and items as compared to the real data. From Table 3 (on p. 5), it's noticed that the reliability (Guttman's
λ2) of the simulated tests (applying the FSC and/or Pashley's model) were less than the reliability of the
5
simulated standard test. This would make sense being that Ruch and Stoddard claimed this to be true back
in 1925 (Diamonds and Evans, p. 182). However, notice how the values of the standard deviation of the
biserials show a reduction in the FSC method and Pashley's model applied with the FSC method.
However a negative aspect were the results of the SEM. It seemed the test's SEM increased with the FSC
method and with Pashley's model with the FSC method. Although SEM did increase, it wasn't by much.
In conclusion, the next 3-5 months will be spent analyzing this new FSC method to ensure its validity and
statistical power in simulating an examination undertaken with formula-scoring rules.
Table 2. Simulated Examinees
Observed Score
Examinee True Ability Standard
Formula-
Scoring
Pashley w/
Formula-
Scoring
True Score True SD
1 0.4564 24.0000 19.2500 17.2500 20.9410 2.2180
2 -0.0309 17.0000 18.0000 19.7500 16.5580 2.3520
3 0.2706 22.0000 18.7500 17.0000 19.3030 2.2960
4 -0.5884 11.0000 11.7500 8.2500 12.2310 2.3330
5 -1.1745 11.0000 5.7500 9.7500 8.7980 2.2710
6 0.1921 18.0000 17.0000 13.7500 18.5840 2.3210
7 -0.5151 14.0000 9.7500 16.7500 12.7380 2.3380
8 -0.3514 17.0000 15.2500 12.5000 13.9340 2.3470
9 1.7278 27.0000 30.0000 29.7500 28.7620 1.4800
10 0.2953 19.0000 19.5000 17.7500 19.5280 2.2870
Table 3. Parametric Bootstrap Simulation (500 Iterations)
Statistics Real Data Standard 3PL
Standard 3PL w/
Formula-Scoring
Pashley's 3PL w/
Formula-Scoring
Guttman's λ2 0.8396 0.8976 0.8940 0.8883
SEM 2.4759 2.3190 2.3491 2.3727
sd(ρbis) 0.1003 0.1412 0.1358 0.1211
V. Educational Significance
When it comes to issuing standardized multiple-choice tests, such as a Math/English college
placement test, various acts of guessing can occur that can result in inaccurate IRT measurements of a
person's true score, which could potentially yield under/over estimates of their true ability. Based on these
simulations using the proposed FSC criterion, implementing this type of formula-scoring rule to multiple-
choice testing practices in the future may open doors in improving a person's overall observed score being
very close to their overall true score. Although there are flaws in the suggested method, optimistically its
logical reasoning will be accepted.
6
References
Bar-Hillel, David Budescu and Maya. "To Guess or Not to Guess: A Decision-Theoretic View of Formula
Scoring." Journal of Educational Measurment, Vol. 30, No. 4 (Winter, 1993): 277-291.
Evans, James Diamond and William. "The Correction For Guessing." Review of Educational Research,
Vol. 43, No. 2 (Spring, 1973): 181-191.
Habing, Dr. Brian. STAT 778 - Spring 2014 - R Templates. 2014. 2014
<http://www.stat.sc.edu/~habing/courses/778rS14.html>.
Michele Zimowski, Eiki Muraki, Robert Mislevy, and Darrell Bock. BILOG-MG 3. Scientific Software
International, Inc. Stokie, IL, 2005-2014.
Pashley, Peter J. An Alternative Three-Parameter Logistic Item Response Model. Research. Princeton,
New Jersey: Educational Testing Service, February 1991.
Thorndike, Robert L. Educational Measurement: The Problem of Guessing. Washington, D.C.: American
Council on Education, 1971.

More Related Content

What's hot

Chapter 5 t-test
Chapter 5 t-testChapter 5 t-test
Chapter 5 t-testJevf Shen
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsNBER
 
Basic of Statistical Inference Part-I
Basic of Statistical Inference Part-IBasic of Statistical Inference Part-I
Basic of Statistical Inference Part-IDexlab Analytics
 
Chap14 analysis of categorical data
Chap14 analysis of categorical dataChap14 analysis of categorical data
Chap14 analysis of categorical dataJudianto Nugroho
 
Chi square analysis-for_attribute_data_(01-14-06)
Chi square analysis-for_attribute_data_(01-14-06)Chi square analysis-for_attribute_data_(01-14-06)
Chi square analysis-for_attribute_data_(01-14-06)Daniel Augustine
 
hypothesis test
 hypothesis test hypothesis test
hypothesis testUnsa Shakir
 
Lecture 4: NBERMetrics
Lecture 4: NBERMetricsLecture 4: NBERMetrics
Lecture 4: NBERMetricsNBER
 
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationIJECEIAES
 
Day 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfDay 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfElih Sutisna Yanto
 
Hypothesis Testing in Six Sigma
Hypothesis Testing in Six SigmaHypothesis Testing in Six Sigma
Hypothesis Testing in Six SigmaBody of Knowledge
 
Chapter 1 biostatistics by Dr Ahmed Hussein
Chapter 1 biostatistics by Dr Ahmed HusseinChapter 1 biostatistics by Dr Ahmed Hussein
Chapter 1 biostatistics by Dr Ahmed HusseinDr Ghaiath Hussein
 
Statistical inference 2
Statistical inference 2Statistical inference 2
Statistical inference 2safi Ullah
 

What's hot (18)

Chapter 5 t-test
Chapter 5 t-testChapter 5 t-test
Chapter 5 t-test
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse Models
 
1.1 statistical and critical thinking
1.1 statistical and critical thinking1.1 statistical and critical thinking
1.1 statistical and critical thinking
 
Basic of Statistical Inference Part-I
Basic of Statistical Inference Part-IBasic of Statistical Inference Part-I
Basic of Statistical Inference Part-I
 
Chap14 analysis of categorical data
Chap14 analysis of categorical dataChap14 analysis of categorical data
Chap14 analysis of categorical data
 
Msb12e ppt ch07
Msb12e ppt ch07Msb12e ppt ch07
Msb12e ppt ch07
 
Chi square analysis-for_attribute_data_(01-14-06)
Chi square analysis-for_attribute_data_(01-14-06)Chi square analysis-for_attribute_data_(01-14-06)
Chi square analysis-for_attribute_data_(01-14-06)
 
hypothesis test
 hypothesis test hypothesis test
hypothesis test
 
Lecture 4: NBERMetrics
Lecture 4: NBERMetricsLecture 4: NBERMetrics
Lecture 4: NBERMetrics
 
Opinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classicationOpinion mining framework using proposed RB-bayes model for text classication
Opinion mining framework using proposed RB-bayes model for text classication
 
Day 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfDay 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdf
 
Hypothesis Testing in Six Sigma
Hypothesis Testing in Six SigmaHypothesis Testing in Six Sigma
Hypothesis Testing in Six Sigma
 
Propensity Scores in Medical Device Trials
Propensity Scores in Medical Device TrialsPropensity Scores in Medical Device Trials
Propensity Scores in Medical Device Trials
 
Geert van Kollenburg-masterthesis
Geert van Kollenburg-masterthesisGeert van Kollenburg-masterthesis
Geert van Kollenburg-masterthesis
 
Chapter 1 biostatistics by Dr Ahmed Hussein
Chapter 1 biostatistics by Dr Ahmed HusseinChapter 1 biostatistics by Dr Ahmed Hussein
Chapter 1 biostatistics by Dr Ahmed Hussein
 
Statistical inference 2
Statistical inference 2Statistical inference 2
Statistical inference 2
 
Chi
ChiChi
Chi
 
Himani sharma
Himani sharmaHimani sharma
Himani sharma
 

Viewers also liked

zvernennia_chesno_tygypko
zvernennia_chesno_tygypkozvernennia_chesno_tygypko
zvernennia_chesno_tygypkoCHESNO
 
Breve pantallazo de la situación de la diabetes
Breve pantallazo de la situación de la diabetesBreve pantallazo de la situación de la diabetes
Breve pantallazo de la situación de la diabetesEduardo Vergara
 
Perfuração no mar.
Perfuração no mar.Perfuração no mar.
Perfuração no mar.Joaosantana22
 
Are there same person
Are there same personAre there same person
Are there same personTenzin Palden
 
Cuadro comparativo entre la lectura
Cuadro comparativo entre la lecturaCuadro comparativo entre la lectura
Cuadro comparativo entre la lecturaIs Ar Pa
 
Guide to start up funding
Guide to start up fundingGuide to start up funding
Guide to start up fundingKristian Ward
 
あまログ利用マニュアル「キーワード入力」
あまログ利用マニュアル「キーワード入力」あまログ利用マニュアル「キーワード入力」
あまログ利用マニュアル「キーワード入力」yuusiro
 
Instalacion de geotanques para engorda de trucha en acaxochitlan hgo.
Instalacion de geotanques para engorda de trucha  en acaxochitlan hgo.Instalacion de geotanques para engorda de trucha  en acaxochitlan hgo.
Instalacion de geotanques para engorda de trucha en acaxochitlan hgo.UAEhighstreet.com
 

Viewers also liked (14)

zvernennia_chesno_tygypko
zvernennia_chesno_tygypkozvernennia_chesno_tygypko
zvernennia_chesno_tygypko
 
web educativo
web educativoweb educativo
web educativo
 
Breve pantallazo de la situación de la diabetes
Breve pantallazo de la situación de la diabetesBreve pantallazo de la situación de la diabetes
Breve pantallazo de la situación de la diabetes
 
Maria Lorena de Andrade Resume 2
Maria Lorena de Andrade Resume 2Maria Lorena de Andrade Resume 2
Maria Lorena de Andrade Resume 2
 
Perfuração no mar.
Perfuração no mar.Perfuração no mar.
Perfuração no mar.
 
Are there same person
Are there same personAre there same person
Are there same person
 
Prospeccion de la Tilapia
Prospeccion de la TilapiaProspeccion de la Tilapia
Prospeccion de la Tilapia
 
Invitation to partner part 3
Invitation to partner part 3Invitation to partner part 3
Invitation to partner part 3
 
Weather
WeatherWeather
Weather
 
Flor de paz
Flor de pazFlor de paz
Flor de paz
 
Cuadro comparativo entre la lectura
Cuadro comparativo entre la lecturaCuadro comparativo entre la lectura
Cuadro comparativo entre la lectura
 
Guide to start up funding
Guide to start up fundingGuide to start up funding
Guide to start up funding
 
あまログ利用マニュアル「キーワード入力」
あまログ利用マニュアル「キーワード入力」あまログ利用マニュアル「キーワード入力」
あまログ利用マニュアル「キーワード入力」
 
Instalacion de geotanques para engorda de trucha en acaxochitlan hgo.
Instalacion de geotanques para engorda de trucha  en acaxochitlan hgo.Instalacion de geotanques para engorda de trucha  en acaxochitlan hgo.
Instalacion de geotanques para engorda de trucha en acaxochitlan hgo.
 

Similar to STAT 778 Project Proposal - Jonathan Poon

© 2010 The Psychonomic Society, Inc. 618PSAssessing the .docx
© 2010 The Psychonomic Society, Inc. 618PSAssessing the .docx© 2010 The Psychonomic Society, Inc. 618PSAssessing the .docx
© 2010 The Psychonomic Society, Inc. 618PSAssessing the .docxgerardkortney
 
Hypothesis TestingThe Right HypothesisIn business, or an.docx
Hypothesis TestingThe Right HypothesisIn business, or an.docxHypothesis TestingThe Right HypothesisIn business, or an.docx
Hypothesis TestingThe Right HypothesisIn business, or an.docxadampcarr67227
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksDevansh16
 
A New Concurrent Calibration Method For Nonequivalent Group Design Under Nonr...
A New Concurrent Calibration Method For Nonequivalent Group Design Under Nonr...A New Concurrent Calibration Method For Nonequivalent Group Design Under Nonr...
A New Concurrent Calibration Method For Nonequivalent Group Design Under Nonr...Kathryn Patel
 
Model validation strategies ftc 2018
Model validation strategies ftc 2018Model validation strategies ftc 2018
Model validation strategies ftc 2018Philip Ramsey
 
Applying the pythagorean model to derive a correction factor for estimating m...
Applying the pythagorean model to derive a correction factor for estimating m...Applying the pythagorean model to derive a correction factor for estimating m...
Applying the pythagorean model to derive a correction factor for estimating m...Alexander Decker
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
TEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docxTEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docxmattinsonjanel
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingpraveen3030
 
Data AnalysisResearch Report AssessmentBSB
Data AnalysisResearch Report AssessmentBSBData AnalysisResearch Report AssessmentBSB
Data AnalysisResearch Report AssessmentBSBOllieShoresna
 
Mba103 statistics for management
Mba103  statistics for managementMba103  statistics for management
Mba103 statistics for managementsmumbahelp
 
09 ch ken black solution
09 ch ken black solution09 ch ken black solution
09 ch ken black solutionKrunal Shah
 
Statistics pres 3.31.2014
Statistics pres 3.31.2014Statistics pres 3.31.2014
Statistics pres 3.31.2014tjcarter
 
Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxcarlstromcurtis
 
Business Development Analysis
Business Development Analysis Business Development Analysis
Business Development Analysis Manpreet Chandhok
 

Similar to STAT 778 Project Proposal - Jonathan Poon (20)

Research Procedure
Research ProcedureResearch Procedure
Research Procedure
 
© 2010 The Psychonomic Society, Inc. 618PSAssessing the .docx
© 2010 The Psychonomic Society, Inc. 618PSAssessing the .docx© 2010 The Psychonomic Society, Inc. 618PSAssessing the .docx
© 2010 The Psychonomic Society, Inc. 618PSAssessing the .docx
 
Hypothesis TestingThe Right HypothesisIn business, or an.docx
Hypothesis TestingThe Right HypothesisIn business, or an.docxHypothesis TestingThe Right HypothesisIn business, or an.docx
Hypothesis TestingThe Right HypothesisIn business, or an.docx
 
Accounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarksAccounting for variance in machine learning benchmarks
Accounting for variance in machine learning benchmarks
 
A New Concurrent Calibration Method For Nonequivalent Group Design Under Nonr...
A New Concurrent Calibration Method For Nonequivalent Group Design Under Nonr...A New Concurrent Calibration Method For Nonequivalent Group Design Under Nonr...
A New Concurrent Calibration Method For Nonequivalent Group Design Under Nonr...
 
Model validation strategies ftc 2018
Model validation strategies ftc 2018Model validation strategies ftc 2018
Model validation strategies ftc 2018
 
Applying the pythagorean model to derive a correction factor for estimating m...
Applying the pythagorean model to derive a correction factor for estimating m...Applying the pythagorean model to derive a correction factor for estimating m...
Applying the pythagorean model to derive a correction factor for estimating m...
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
TEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docxTEST #1Perform the following two-tailed hypothesis test, using a.docx
TEST #1Perform the following two-tailed hypothesis test, using a.docx
 
Samle size
Samle sizeSamle size
Samle size
 
ESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSIONESTIMATING R 2 SHRINKAGE IN REGRESSION
ESTIMATING R 2 SHRINKAGE IN REGRESSION
 
Cb36469472
Cb36469472Cb36469472
Cb36469472
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Data AnalysisResearch Report AssessmentBSB
Data AnalysisResearch Report AssessmentBSBData AnalysisResearch Report AssessmentBSB
Data AnalysisResearch Report AssessmentBSB
 
Mba103 statistics for management
Mba103  statistics for managementMba103  statistics for management
Mba103 statistics for management
 
09 ch ken black solution
09 ch ken black solution09 ch ken black solution
09 ch ken black solution
 
Statistics pres 3.31.2014
Statistics pres 3.31.2014Statistics pres 3.31.2014
Statistics pres 3.31.2014
 
Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docx
 
Business Development Analysis
Business Development Analysis Business Development Analysis
Business Development Analysis
 

STAT 778 Project Proposal - Jonathan Poon

  • 1. 1 Theoretical Formula Scoring Criterion to Simulate Correction for Guessing Abstract One of the major phenomenon's in standardized multiple-choice tests is determining how much random guessing occurs at a particular question. It can be intuitive to think that when a multiple-choice test is admitted the probability of an individual guessing an item correct is one out of all possible answer choices if they truly do not know the correct answer. Even with this kind of attempt, guessing potentially increases error in measuring an individual's true score, let alone their true ability. A measurement framework, known to many as formula-scoring, was proposed several years ago to possibly help the correction for guessing. The purpose of this proposal is to use the general framework of the formula- scoring method and introduce a new formula-scoring criterion that can be implemented through simulation. This new criterion will show that if formula-scoring is applied then we may be able to improve an individual's overall observed score (sum score) to reflect what their true score really is. .
  • 2. 2 I. Purpose The idea of formula scoring is that a multiple-choice test being admitted should provide detailed instructions as to how a test-taker should approach attempting an item. The rule generally follows that an examinee should omit an item if they experience a situation of having to guess (whether they have partial or no knowledge of the item). If they attempt an item and get it correct, they will receive a score of 1 (this is based on if they have more than partial knowledge of the item). If they experience a situation where they must rely on some type of guessing they should, per the instructions, omit the item and receive a score of 0. A consequence of answering an item incorrectly is receiving a score of -1/(k-1) where k is the number of possible answers to a multiple-choice item. When these rules are in place the complete formula score for potentially correcting error in guessing is, where S is the individual's corrected score, R is the number of items the individual marked correctly, W is the number of items the individual marked incorrectly and k is the number of possible choices for each item (Diamonds and Evans, p. 181). If guidelines such as the one listed above are given to individuals taking the test, it's possible there may be improvements obtaining true scores through the individuals observed score. It would seem that just implementing this in all tests would resolve the issue of guessing, but the effects of guessing behaviors that occur can vary greatly. Such acts of guessing can cause the test- taker to think about the possible rewards of a higher score if they attempt all items. Some of these guessing behaviors are, but not limited to: " 1.Eliminating one or more answer choices judged to be definitely wrong 2.Making use of unintended semantic or syntactic cues available from the wording of the questions or the response options 3.Falling into traps set by the ingenious item writer e.g. cliché-like choices that sound plausible but are wrong 4.Responding on the basis of some element in one of the response choices that attracts him, but at a relatively low level of confidence 5.Using some essentially random fashion of responding-in the extreme, flipping a coin or marking some specific response position or pattern of positions " (Thorndike, p. 59) Many testing services implement the formula-scoring rule mentioned previously. Testing companies such as ETS (Educational Testing Services) have employed formula scoring on the SAT (Scholastic Aptitude Test) and the GRE (Graduate Record Examination) subject exams since 1953 (Budescu and Bar-Hillel, p. 278). However some researchers have suggested that the use of this formula score correction can reduce the reliability of a test. In contrast to these claims, studies have shown that the reliability between corrected and uncorrected scores, if guessing instructions were implemented, were not that far off from each other (Diamonds and Evans, p. 182).With that being said, the purpose of this paper is to propose a formula-scoring criterion (FSC) that can be implemented through simulations that may show signs of an individual's observed score falling closer to their true score.
  • 3. 3 II. Theoretical Framework In standardized multiple-choice tests such as a Math or English placement exam, a person's ability can best be measured using unidimensional Item Response Theory (IRT) models. What is obtained from these models is the probability of an examinee getting one individual item correct; assuming we know their true ability, how difficult the problem is, how discriminating the item is and how much guessing is potentially involved with this particular item. This model (known as a three-parameter logistics or 3PL model) is typically written as, where Uij is item i attempted by examinee j, θj is the examinees given ability, ai is the discrimination value on the ith item, bi is the difficulty value on the ith itemand ci is the guessing value on the ith item. In order for this model to potentially hold true, three assumptions must be met: 1) θ is unidimensional, 2) Local Independence, and 3) Monotonicity. Another alternative IRT model that will be presented in this paper is a 3PL model created by Peter J. Pashley. Pashley elegantly designed a 3PL model around what he calls a four-parameter hyperbolic equation. The purpose of incorporating this model is to explore other models, not widely used in IRT, to compare the proposed FSC. Pashley proposed the following model parameters, where f defines a slope parameter, h defines the difficulty parameter, and k is thought of as the lower asymptote parameter. With these new parameters, Pashley formulized the four-parameter hyperbolic model as such, (Pashley, pgs. 9-10). In many IRT studies, simulations are carried out where computer software, such as R, can simulate random examinees taking a multiple-choice item test. What is produced through these simulations are datasets consisting of a very large matrix of 1's (right) and 0's (wrong) where each row represents a test-taker and each column represents score patterns for each multiple-choice item. One problem with these simulations is they are not able to take into consideration a multiple-choice test applying the formula-scoring rule. This paper proposes an FSC, let it be defined as τ, that can be used in the practice of simulating IRT data coming from multiple-choice tests being implemented with formula- score rules. The FSC was developed around the assumption of correlations between all estimated c
  • 4. 4 parameters and various subgroups of estimated θ's. What was discovered was a vector (denoted as ) of these correlations fell under a pattern very similar to a normal distribution. That is, where j = the number of items on the test and l = the number of simulated examinees. With this in place, τ will be defined as a random number from this hypothesized normal distribution. Therefore, the following can be applied to a simulated examination taken by random examinees with formula-scoring applied, III. Methods & Data To defend the FSC, two separate simulations were undertaken in R: 1) A simulation of all parameters assumed to be known and 2) a parametric bootstrap sampling procedure. In order to carry out an appropriate simulation, real data was used which consisted of 2,642 examinees attempting a 32 multiple-choice item Form B College Math Placement test. Parameters were estimated from this data through BILOG-MG using Expected A Posteriori (EAP), a type of Bayesian approximation method. The following parameters estimated through BILOG-MG followed the distributions given in Table 1. Table 1. Parameter Distributions Parameter Distribution a LogNormal(0.290,0.301) b Normal(0.246,0.962) c Beta(2.771,14.021) IV. Results In the first simulation, 30,000 simulated examinees took a 32-item test three times. Each test followed these categories: a standard test (0 wrong and 1 correct), a test implementing the FSC, and a test implementing the FSC using Pashley's 3PL model. In Table 2 (on p. 5), a sample of 10 examinees were presented with their observed score and true score (calculated under methods of Classical Test Theory). As you can see in the table, it's surprising how some examinee's observed score comes fairly close to the their true score when taking the formula-scoring test. Based on this FSC, there were however instances where simulated examinees scored well under or above their true score (examinee 7 for example). In the second simulation, a parametric bootstrap procedure was conducted with the same number of examinees and items as compared to the real data. From Table 3 (on p. 5), it's noticed that the reliability (Guttman's λ2) of the simulated tests (applying the FSC and/or Pashley's model) were less than the reliability of the
  • 5. 5 simulated standard test. This would make sense being that Ruch and Stoddard claimed this to be true back in 1925 (Diamonds and Evans, p. 182). However, notice how the values of the standard deviation of the biserials show a reduction in the FSC method and Pashley's model applied with the FSC method. However a negative aspect were the results of the SEM. It seemed the test's SEM increased with the FSC method and with Pashley's model with the FSC method. Although SEM did increase, it wasn't by much. In conclusion, the next 3-5 months will be spent analyzing this new FSC method to ensure its validity and statistical power in simulating an examination undertaken with formula-scoring rules. Table 2. Simulated Examinees Observed Score Examinee True Ability Standard Formula- Scoring Pashley w/ Formula- Scoring True Score True SD 1 0.4564 24.0000 19.2500 17.2500 20.9410 2.2180 2 -0.0309 17.0000 18.0000 19.7500 16.5580 2.3520 3 0.2706 22.0000 18.7500 17.0000 19.3030 2.2960 4 -0.5884 11.0000 11.7500 8.2500 12.2310 2.3330 5 -1.1745 11.0000 5.7500 9.7500 8.7980 2.2710 6 0.1921 18.0000 17.0000 13.7500 18.5840 2.3210 7 -0.5151 14.0000 9.7500 16.7500 12.7380 2.3380 8 -0.3514 17.0000 15.2500 12.5000 13.9340 2.3470 9 1.7278 27.0000 30.0000 29.7500 28.7620 1.4800 10 0.2953 19.0000 19.5000 17.7500 19.5280 2.2870 Table 3. Parametric Bootstrap Simulation (500 Iterations) Statistics Real Data Standard 3PL Standard 3PL w/ Formula-Scoring Pashley's 3PL w/ Formula-Scoring Guttman's λ2 0.8396 0.8976 0.8940 0.8883 SEM 2.4759 2.3190 2.3491 2.3727 sd(ρbis) 0.1003 0.1412 0.1358 0.1211 V. Educational Significance When it comes to issuing standardized multiple-choice tests, such as a Math/English college placement test, various acts of guessing can occur that can result in inaccurate IRT measurements of a person's true score, which could potentially yield under/over estimates of their true ability. Based on these simulations using the proposed FSC criterion, implementing this type of formula-scoring rule to multiple- choice testing practices in the future may open doors in improving a person's overall observed score being very close to their overall true score. Although there are flaws in the suggested method, optimistically its logical reasoning will be accepted.
  • 6. 6 References Bar-Hillel, David Budescu and Maya. "To Guess or Not to Guess: A Decision-Theoretic View of Formula Scoring." Journal of Educational Measurment, Vol. 30, No. 4 (Winter, 1993): 277-291. Evans, James Diamond and William. "The Correction For Guessing." Review of Educational Research, Vol. 43, No. 2 (Spring, 1973): 181-191. Habing, Dr. Brian. STAT 778 - Spring 2014 - R Templates. 2014. 2014 <http://www.stat.sc.edu/~habing/courses/778rS14.html>. Michele Zimowski, Eiki Muraki, Robert Mislevy, and Darrell Bock. BILOG-MG 3. Scientific Software International, Inc. Stokie, IL, 2005-2014. Pashley, Peter J. An Alternative Three-Parameter Logistic Item Response Model. Research. Princeton, New Jersey: Educational Testing Service, February 1991. Thorndike, Robert L. Educational Measurement: The Problem of Guessing. Washington, D.C.: American Council on Education, 1971.