2. 2 PSYCHOMETRIKA
FIGURE 1.
Concurrent calibration in a common item design.
Two groups of examinees from different populations were each assigned to different test forms.
The test design in which the groups of examinees are not equivalent is called the nonequivalent
groups design.
Two main IRT linking methods are used in a common-item nonequivalent groups design:
separate estimation and concurrent calibration (Wingersky & Lord, 1984). In separate estima-
tion, the two sets of item parameter estimates for the common items are used to estimate a scale
transformation that will put the item parameter estimates of one form on the scale of the item pa-
rameter estimates for the other form (Haebara, 1980; Stocking & Lord, 1983; Kolen & Brennan,
2004). In concurrent calibration, item parameters for all the items on both forms are estimated
simultaneously in one run of the estimation software. Estimating parameters for all items simul-
taneously ensures that all parameter estimates are on the same scale. Bock and Zimowski (1997)
suggested a multiple group IRT that can deal with multiple examinee groups differing in ability in
the concurrent calibration. Von Davier and von Davier (2004) presented all the linking methods
in a new framework (in which the calibration is performed in one step) in terms of restrictions
on the likelihood function.
Numerous studies have researched the accuracy of the estimation by using each method
(Baker & Al-Karni, 1991; Kim & Cohen, 1992). Hanson and Béguin (2002) found that concur-
rent calibration procedures produced more accurate results than did separate estimation. Conse-
quently, to date, concurrent calibration is thought to be the most appropriate estimation method
for the common-item nonequivalent groups design.
When concurrent calibration is applied in the common-item nonequivalent groups design,
the scores on the tests that some examinees did not take are regarded as missing data, whereas
the preceding methods related to the nonequivalent groups design ignored these data. In other
words, these methods implicitly assumed missing data to be data that was “missing at random”
(MAR, see Little & Rubin, 2002). The missing data mechanism is ignorable if (a) missingness
is MAR and (b) the parameters of the grouping variable and the parameters of the item response
variables are distinct (Little & Rubin, 2002). When these two conditions are satisfied, the like-
lihood function is separated into the likelihood of grouping variable and that of item response
variables. If missingness is not MAR, or the likelihood function is not separated into the like-
lihood of grouping variable and that of item response variables, the missing data mechanism is
nonignorable.
Meanwhile, in many cases, examinees can select their test forms; consequently, they are not
randomly assigned to a test form. We explain this by using the example of real data analysis. In
Sect. 5, we analyze a JLRT (Japanese Listening and Reading Comprehension Test) data set. The
JLRT, which is a part of the Business Japanese Proficiency Test (BJT; Kato & JETRO, 2006)
is a 100 multiple choice item test that measures the ability to communicate with other persons
in the Japanese language. The target population of the BJT is people whose first language is
3. KEI MIYAZAKI ET AL. 3
FIGURE 2.
The situation of the test design considered in this study.
not Japanese; the test has been administered in 32 cities across 13 countries such as Japan, the
United States, Canada, EU countries, Brazil, China, and other Asian countries. The version of the
JLRT that we analyzed has two equivalent test forms, and the one that the examinees are given
depends on the country in which they take the test; that is, we can determine the country in which
the examinees took the test by asking them which form they took. Because we can assume that
the examinees living in Japan are probably highly motivated to study Japanese and have higher
abilities than the examinees in other countries, just by asking an examinee which form he/she
took, we can form an expectation of whether he/she has higher or lower ability. This expectation
implies that missing data have some information about examinees’ achievement levels. Thus,
missing data are regarded as being nonignorable.
Another example demonstrating the advantage of our model is that examinees can frequently
select test forms by themselves and are consequently not randomly assigned to a test form. In
an AP (Advanced Placement) exam, in one measure at one administration, one student might
choose between two essays that are supposed to measure the same construct while the common
items are fixed (for more details about linking in AP exams, refer to Yang, 2004). In such cases,
since examinees are not randomly assigned to one of the two test forms, a mere application of
existing item parameter linking methods can yield biased results (see Sect. 4, simulation study).
In addition, when the test form selection behavior depends on the examinees’ abilities, the exist-
ing item parameter linking methods—including multiple group IRT—can yield biased estimates
because the likelihood function is not separated into the likelihood of grouping variable and that
of item response variables (a detailed explanation with mathematical expressions is provided in
Model Assumptions, Sect. 2).
To solve this problem, we constructed a model in which test form selection behavior is
dependent on the scores of the tests (Fig. 2). In this model, missing test scores are regarded as
nonignorable (Little & Rubin, 2002). Furthermore, we proposed an estimation method for the
parameters, using the MCEM algorithm (Wei & Tanner, 1990). Consequently, we proposed a
new concurrent calibration method for a common-item nonequivalent groups design.
In Sect. 2, we present the model assumptions and describe the form of likelihood function of
our model. We provide maximum likelihood estimations using the MCEM algorithm and address
certain topics related to parameter estimation, such as the calculation of the asymptotic variance
covariance matrix from the EM algorithm in Sect. 3. In Sect. 4, we present a simulation study to
show that the traditional method provides severely biased estimates and verify that the proposed
model can yield adequate estimates. In Sect. 5, we apply the proposed method to JLRT data and
describe the meaningful results. Finally, in Sect. 6, we provide concluding remarks.
4. 4 PSYCHOMETRIKA
2. Model Assumptions
We considered the situations in which each of the two test forms were used separately. Item
parameter linking was performed between these two test forms (Fig. 1). For the examinees in
group 1, Tests A and B were administered, and for the examinees in group 2, Tests B and C were
administered. Test B was common between the two test groups. The examinees had to choose one
of the two groups. Let KA,KB, and KC be the number of items in Tests A, B, and C, respectively.
Let ri be a test form selection indicator (ri = j (j = 1,2) implies that the ith examinee selected
the jth test form). The item response vector uij is observed when the ith examinee selected the
jth test form, and uij′ (j′ = j) is missing. Let ui = ((uobs
i )′,(umis
i )′)′, where uobs
i represents
the observed components of ui, while umis
i represents the missing entries. Hence, the missing
patterns can be expressed as follows:
uobs
i
′
,
umis
i
′
=
(u′
i1,u′
i2) (ri = 1),
(u′
i2,u′
i1) (ri = 2),
(1)
and ui1 = (u′
iA,u′
iB1)′, ui2 = (u′
iB2,u′
iC)′. For example, uiA = (uiA1,...,uiAKA )′. uiB1 repre-
sents the item response vector when the ith examinee takes Test B at the first point in time or
at the first place. uiB2 represents the item response vector when the ith examinee takes Test B
at the second point in time or at the second place. The constraint that uiB1 = uiB2 should be
considered. However, as described in the Introduction, here we assume that uiB1 is not equal to
uiB2 because of the difference in examinees’ abilities with regard to each point in time or each
place.
We let θij be a random latent variable that represents the ability of the jth group. θij is
distributed as N(μj ,σ2
j ) (in this paper, we do not assume the multidimensionality of abilities.
For the problem of multidimensionality of abilities, see van der Linden Luecht, 1998). Under
the three-parameter logistic model, the probability that the ith examinee of ability θij correctly
answered item k of test X (X = A,B,C) is defined as
p
uiXkX |θij ,ψjk
= cXkX + (1 − cXkX )
1
1 + exp{−1.7aXkX (θij − bXkX )}
, (2)
where kX = 1,...,KX and ψjk is the vector that contains all the item parameters of item k of
the jth group.
The probability that the ith item response vector uij is obtained is expressed as follows:
p
ui1|θi1,ψ1
=
KA
kA=1
cAkA + (1 − cAkA )
1
1 + exp{−1.7aAkA (θi1 − bAkA )}
×
KB
kB =1
cBkB + (1 − cBkB )
1
1 + exp{−1.7aBkB (θi1 − bBkB )}
, (3)
p
ui2|θi2,ψ2
=
KB
kB =1
cBkB + (1 − cBkB )
1
1 + exp{−1.7aBkB (θi2 − bBkB )}
×
KC
kC=1
cCkC + (1 − cCkC )
1
1 + exp{−1.7aCkC (θi2 − bCkC )}
. (4)
In the preceding models, the assignment mechanism is explained by the observed portion of
the complete item responses (Lord, 1974; Bernaards Sijtsma, 1999). However, if the indicator
5. KEI MIYAZAKI ET AL. 5
variable of test form selection behavior depends on both observed and missing portions, the as-
signment mechanism is not random, thereby leading to the conclusion that the existing methods
yield biased estimates for ability and item parameters (see the simulation study in Sect. 4). To
solve this problem, we modeled the relation between grouping variables and item response vari-
ables of all the tests containing observed and missing item response variables, using the logistic
regression model.
Our model also seeks to estimate the differences in ability parameters between the two points
in time. In previous models, because the likelihood function was separated into the likelihood of
grouping variable and item response variables in item response models, it turns out that the ex-
isting IRT model provides consistent estimates for parameters. In our model, however, grouping
variables depend on the item response variables of all the test forms and, therefore, the likelihood
function cannot be separated. Therefore, the traditional method cannot be applied.
In this paper, the test form selection mechanism is modeled using the following nominal
logistic regression model. To express the model in a more general manner, we let the explanatory
variables include item response variables ui = (u′
i1,u′
i2)′ and ability variables θi = (θi1,θi2).
The equation is as follows:
p
ri = j|ui,θi,ρ
=
exp(ρ′
uj ui + ρ′
fj θi)
1 + exp(ρ′
uj ui + ρ′
fj θi)
=
exp(ρ′
j zi)
1 + exp(ρ′
j zi)
, (5)
where zi = (u′
i,θ′
i)′ and ρj = (ρ′
uj ,ρ′
fj )′. ρj are coefficients multiplied by item response vector
ui and latent ability variable θi. The higher the values of the elements of ρj , the higher is the
probability that the examinees are assigned to the jth group, and
ρuj =
ρ′
Aj ,ρ′
B1j ,ρ′
B2j ,ρ′
Cj
′
, (6)
ρAj = (ρA1j ,...,ρAKAj )′
, (7)
ρB1j = (ρB11j ,...,ρB1KB j )′
, (8)
ρB2j = (ρB21j ,...,ρB2KB j )′
, (9)
ρCj = (ρC1j ,...,ρCKCj )′
, (10)
ρθj = (ρθ1j ,ρθ2j )′
. (11)
Furthermore, to ensure notational simplicity, let φj be the vector that contains μj and σ2
j .
In this paper, we provide the maximum likelihood estimation of the parameters. Under the
assumption that θ1 and θ2 are independent, θ1 ⊥
⊥θ2, the complete-data log-likelihood function of
the sample of the ui,ri observations is a function of ψ,φ, and ρ, and is written as follows:
L
ψ,φ,ρ|r,u,θ
=
N
i=1
logp
ri = j,ui,θi|ψ,φ,ρ
=
N
i=1
logp
ri = j|ui,θi,ρ
+
2
j=1
logp
uij |θij ,ψj
+
2
j=1
logp
θij |φj
= LR + LU + L, (12)
6. 6 PSYCHOMETRIKA
where
LR =
N
i=1
logp
ri = j|ui,θi,ρ
, (13)
LU =
N
i=1
2
j=1
logp
uij |θij ,ψj
, (14)
L =
N
i=1
2
j=1
logp
θij |φj
. (15)
Actually, both umis and θ cannot be observed. Let Lobs(ψ,ρ|r,uobs) be the observed log-
likelihood value. Lobs(ψ,ρ|r,uobs) is as follows:
Lobs
ψ,ρ|r,uobs
=
N
i=1
logp
ri = j|uobs
i ,ρ
+
N
i=1
logp
uobs
i |ψ
. (16)
The function form is so complex that we cannot maximize this directly. To solve this problem,
we used the EM algorithm (Dempster, Laird, Rubin, 1977), which is useful for analyzing data
containing missing values.
The Relationship between Our Method and Existing Linking Methods for the Nonequivalent
Groups Design
Nonignorable missing data models are classified broadly into two categories: pattern mix-
ture models and selection models (Little and Rubin, 2002). The multiple group IRT model is
categorized as a kind of pattern mixture model. Pattern mixture models are expressed as a joint
distribution where the observed variables depend on missing indicators. Thus, the likelihood
function of the multiple groups IRT model is expressed as follows:
lobs
ψ,φ,ρ|r,u,θ
=
N
i=1
p(ri = j,uobs
i ,umis
i ,θi|ψ,φ,ρ) dumis
i dθi
=
N
i=1
p
uobs
i ,umis
i ,θi|ri = j,ψ,φ
× p
ri = j|ρ
dumis
i dθi. (17)
As described in Sect. 3, we can test whether the missing data umis
i have an effect on test form
selection behavior using the Wald statistic. In this way, the proposed method is more practical
and useful than the existing nonequivalent groups design.
We also considered the assumption that an examinee’s ability can influence test form assign-
ment. The observed likelihood of this assumption can be expressed as follows:
lobs
ψ,φ,ρ|r,uobs
=
N
i=1
p
ri = j|θi,ρ
× p
uobs
i ,umis
i ,θi|ψ,φ
dumis
i dθi. (18)
Equation (18) indicates that the likelihood function is not separated into the likelihood of group-
ing variable and that of item response variables. The likelihood function under the MAR assump-
tion, that is, the likelihood function of the existing multiple group IRT, is given by (17) and is
7. KEI MIYAZAKI ET AL. 7
separated into the likelihood of grouping variable and item response variables; this form is ob-
viously different from (18). The data are not MAR and, therefore, existing IRT linking methods
can yield biased ML estimates. Consequently, as long as we assume that test form selection be-
havior depends on examinees’ abilities, all the existing methods related to concurrent calibration
inevitably yield biased estimates. The proposed method can also adjust for these biases.
3. The Estimation Method
Obtaining the ML estimates by directly maximizing p(r,uobs|ψ,ρ) is very difficult because
the likelihood function p(r,uobs|ψ,ρ) is very complicated due to the presence of missing data
and latent variables. Thus, instead of working with p(r,uobs|ψ,ρ) directly, we augment uobs
with (umis,f ) using the EM algorithm in the ML estimation. Consequently, the ML estimation
based on the complete data set is made easier when the following EM algorithm is used:
[E-step]: Evaluate the expected value of the complete-data log-likelihood with respect to
umis and θ
Q
ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t)
=
N
i=1
logp
ri|ui,θi,ρ
+
2
j=1
logp
uij ,θij |ψj ,φj
× p
umis
i ,θi|uobs
i ,ri,ψ(t)
,φ(t)
,ρ(t)
dumis
i dθi (19)
at the tth iteration with a current value ψ(t)
,φ(t)
.
Since the E-step cannot be calculated analytically, we use the MCEM algorithm Wei
Tanner, 1990), where this E-step is approximated by the Monte Carlo estimate of the
expectation using a sufficiently large number of observations simulated from the condi-
tional distribution p(umis,θ|uobs,ψ(t)
,φ(t)
). This is accomplished by using the Metropolis–
Hastings algorithm, which enables us to draw samples from p(umis
i |uobs
i ,ri,θi,ψ(t)
,ρ(t)) and
p(θi|uobs
i ,umis
i ,ri,φ(t)
). (See Ibrahim, Chen, Lipsitz, 2001 for the ML estimation in general-
ized linear models when the missing data mechanism is nonignorable.)
The following Metropolis–Hastings algorithms are used to sample umis and θ. Let umis
(m)
and θ(m) be the current values at the mth iteration and umis
i(m) and θi(m) be the values of the ith
observation of umis
(m) and θ(m).
(i) Generate umis
i(m+1) from p(umis
i |uobs
i ,ri,θi(m),ψ(t)
,ρ(t)) (Metropolis–Hastings algorithm)
The target and the proposal distribution are as follows:
target distribution: p
umis
i |uobs
i ,ri,θi(m),ψ(t)
,ρ(t)
(20)
proposal distribution: p
umis
i |uobs
i ,θi(m),ψ(t)
(21)
(i-1) Draw u∗
i ∼ p(umis
i |uobs
i ,θi(m),ψ(t)
)
(i-2) Accept umis
i(m+1) = u∗
i with probability
α
u∗
i |umis
i(m),uobs
i ,θi(m),ri,ρ(t)
= min
p(ri|uobs
i ,u∗
i ,θi(m),ρ(t))
p(ri|uobs
i ,umis
i(m),θi(m),ρ(t))
,1
(22)
8. 8 PSYCHOMETRIKA
(ii) Generate θi(m+1) from p(θi|uobs
i ,umis
i(m+1),ri,φ(t)
,ρ(t))
Further, here, we use the following Metropolis–Hastings algorithm. The target and the pro-
posal distribution are as follows:
target distribution: p
θi|ri,uobs
i ,umis
i(m+1),ρ(t)
,φ(t)
(23)
proposal distribution: p
θi|φ(t)
(24)
(ii-1) Draw θ∗
i ∼ p(θi|φ(t)
)
(ii-2) Accept θi(m+1) = θ∗
i with probability
α
θ∗
i |θi(m),uobs
i ,umis
i(m+1),ψ(t)
= min
p(ri,uobs
i ,umis
i(m+1)|θ∗
i ,ρ(t),ψ(t)
)
p(ri,uobs
i ,umis
i(m+1)|θi(m),ρ(t),ψ(t)
)
,1
= min
p(ri|uobs
i ,umis
i(m+1),θ∗
i ,ρ(t))p(uobs
i ,umis
i(m+1)|θ∗
i ,ψ(t)
)
p(ri|uobs
i ,umis
i(m+1),θi(m),ρ(t))p(uobs
i ,umis
i(m+1)|θi(m),ψ(t)
)
,1
. (25)
After drawing samples of umis
(m) and θ(m), Q(ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t)) is approximated by the
Monte Carlo integration:
Q
ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t)
≈
N
i=1
1
M
M
m=1
logp
ri,uobs
i ,umis
i(m),θi(m)|ψ,φ,ρ
, (26)
where M is the number of draws.
[M-step]: Maximize Q(ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t)) and update ψ(t)
,φ(t)
,ρ(t) to ψ(t+1)
,
φ(t+1)
, ρ(t+1).
At the maximization (M)-step, we need to maximize Q(ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t)) with respect
to ψ,φ and ρ. Using (12) and (26), Q(ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t)) can be written as follows:
Q
ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t)
= ER + EU + E, (27)
where
ER =
N
i=1
logp
ri|ui,θi,ρ
× p
umis
i ,θi|uobs
i ,ri,ψ(t)
,φ(t)
,ρ(t)
dumis
i dθi, (28)
EU =
N
i=1
2
j=1
logp
uij |θij ,ψ
× p
umis
i ,θi|uobs
i ,ri,ψ(t)
,φ(t)
,ρ(t)
dumis
i dθi, (29)
E =
N
i=1
2
j=1
logp
θij |φ
× p
umis
i ,θi|uobs
i ,ri,ψ(t)
,φ(t)
,ρ(t)
dumis
i dθi. (30)
Thus, maximizing Q(ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t)) is equivalent to solving the following equations:
∂Q(ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t))
∂ρ
=
∂ER
∂ρ
= 0, (31)
9. KEI MIYAZAKI ET AL. 9
∂Q(ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t))
∂ψ
=
∂EU
∂ψ
= 0, (32)
∂Q(ψ,φ,ρ|ψ(t)
,φ(t)
,ρ(t))
∂φ
=
∂E
∂φ
= 0. (33)
The complete-data likelihood equation for ρ cannot be obtained as a closed form; therefore, the
Newton–Raphson method is used to obtain the maximum of updating parameters. The first and
second partial derivatives of ρj (j = 1,2) are
∂LR
∂ρj
=
N
i=1
Rij −
exp(ρ′
j zi)
1 + exp(ρ′
j zi)
zi, (34)
∂2LR
∂ρj ∂ρ′
j
= −
N
i=1
exp(ρ′
j zi)
1 + exp(ρ′
j zi)
1 −
exp(ρ′
j zi)
1 + exp(ρ′
j zi)
ziz′
i, (35)
where R is N × 2 indicator matrix in which the (i,j)th element is defined as:
Rij =
1 if ri = j,
0 if ri = j.
(36)
Let ρ
(t)
j(s) be the value of the sth Newton–Raphson step in the tth M-step. The following equation
is used for updating ρ(t)
j
ρ
(t)
j(s+1) = ρ
(t)
j(s) −
∂2ER
∂ρj ∂ρt
j
−1
ρj =ρ
(t)
j(s)
×
∂ER
∂ρj
ρj =ρ
(t)
j(s)
, (37)
where
∂ER
∂ρj
=
1
M
M
m=1
∂LR
∂ρj
,
∂2ER
∂ρj ∂ρ′
j
=
1
M
M
m=1
∂2LR
∂ρj ∂ρ′
j
. (38)
Updating is repeated through the above equation until the convergence criterion is satisfied.
Moreover, the likelihood equation for the item parameters cannot be obtained as a closed
form; therefore, again, we use the Newton–Raphson method.
In operational practice, the following two types of constraints can be imposed for test form
selection behavior, and some parts of the Metropolis–Hastings algorithm in the E-step are altered
as described below due to these constraints.
When the Test Form Selection Behavior Depends Only on the Ability Parameters
(i) is altered as follows:
(i’) Generate umis
i(m+1) from p(umis
i |uobs
i ,ri,θi(m),ψ(t)
,ρ(t))
Because umis
i is independent of ri, the conditional distribution of umis
i can be obtained from
p(umis
i |uobs
i ,θi(m),ψ(t)
,ρ(t)).
When the Test Form Selection Behavior Depends Only on the Test Scores
(ii) is altered as follows:
10. 10 PSYCHOMETRIKA
(ii’) Generate θi(m+1) from p(θi|uobs
i ,umis
i(m+1),ri,φ(t)
,ρ(t))
We again use the following Metropolis–Hastings algorithm. Further, the target and the pro-
posal distribution are as follows:
target distribution: p
θi|ri,uobs
i ,umis
i(m+1),ρ(t)
,φ(t)
(39)
proposal distribution: p
θi|φ(t)
(40)
(ii’-1) Draw θ∗
i ∼ p(θi|φ(t)
)
(ii’-2) Accept θi(m+1) = θ∗
i with probability
α
θ∗
i |θi(m),uobs
i ,umis
i(m+1),ψ(t)
= min
p(uobs
i ,umis
i(m+1)|θ∗
i ,ψ(t)
)
p(uobs
i ,umis
i(m+1)|θi(m),ψ(t)
)
,1
. (41)
Assessing the Covariance Matrix of the Estimates Based on the Observed Information Matrix
Let ξ = (ψ,φ,ρ); that is, ξ contains all the parameters of our model. As a by-product, the
standard error of the parameter vector ξ can be calculated for the MCEM algorithm. Louis (1982)
showed that the observed information matrix of ξ̂ from the EM algorithm can be expressed as
I
ξ̂|yobs
= Eξ Ic
ξ|y
|yobs
ξ=ξ̂
− Eξ
Sc
y|ξ
ST
c
y|ξ
|yobs
ξ=ξ̂
, (42)
where Ic(ξ|y) is the matrix of the negative of the second-order partial derivatives of the complete-
data log likelihood function with respect to the elements of ξ, and Sc(y|ξ) is the gradient vector
of the complete-data log likelihood function, that is,
Ic
ξ|y
= −
∂2 logLc(ξ|y)
∂ξ∂ξ′ , (43)
Sc
y|ξ
=
∂ logLc(ξ|y)
∂ξ
, (44)
where Lc(ξ|y) is a complete-data log likelihood function in which the missing part is com-
plemented through each MCE step. In practice, the calculation of expectation values in (42) is
approximated through Monte Carlo integration. Therefore, (42) is translated as follows:
I
ξ̂|yobs
=
1
M
M
m=1
Ic
ξ̂|y(m)
−
1
M
M
m=1
Sc
y(m)|ξ̂
ST
c
y(m)|ξ̂
. (45)
Using the observed information matrix I(ξ̂|yobs), let ξp be a part of the parameter vector ξ, so
that we can test the null hypothesis “H0 : ξp = 0” using the Wald statistic. The Wald statistic of
the hypothesis H0 can be expressed as follows:
W = ξ′
p
Iξ′
p
ξ̂|yobs
−1
ξp, (46)
where Iξp (ξ̂|yobs) is the submatrix of the Fisher information I(ξ̂|yobs) relevant to ξp.
4. Simulation Study
To show the reliability of the proposed method, we carried out a simulation study. We in-
cluded data for which the assignment of test forms was not random, and for which the test form
11. KEI MIYAZAKI ET AL. 11
selection behavior depended on both the scores on the tests the examinees selected and the scores
on the tests they did not select. We used IML/SAS to evaluate the multiple group item response
theory as well as the estimates from the proposed method. For the simulation study, we gener-
ated 100 data sets and for each data set, we obtained the usual ML estimates for multiple group
IRT (concurrent calibration) and the estimates using the proposed method. We were interested in
assessing the accuracy of the parameter estimation for our model.
A two-parameter logistic model was used as the functional form of the item response. Each
test had 10 items. The item parameters of Test B (aB,bB) were common across test forms, and
μ1 and σ2
1 were fixed as μ1 = 0 and σ2
1 = 1, respectively. To ensure the identifiability of the
model and for the sake of simplicity, we assumed that the sums of the observed and missing
test scores would determine test form selection behavior; that is, we considered the following
constraint:
ρA1j = ··· = ρAKAj = ρB11j = ··· = ρB1KB j = πj ,
(47)
ρB21j = ··· = ρB2KB j = ρC1j = ··· = ρCKCj = −πj .
Even when this constraint is assumed, the assumption of a nonrandom assignment can be upheld.
With these constraints, the test selection probability can be expressed as follows:
p
ri = j|ui,ρ
=
exp(πj vi)
1 + exp(π1vi)
, (48)
where vi is the difference between the scores of the ith examinee on the two test forms:
vi =
KA
kA=1
uiAkA +
KB
kB =1
uiB1kB −
KB
kB =1
uiB2kB +
KC
kC=1
uiCkC . (49)
The true values are provided in Tables 1 and 2. In the current model, the total number of
parameters was 63. In this simulation study, the correlation of θ1 and θ2 was set to 0.5.
We generated 100 replications, all of which followed the same population parameters (or
true values). For each replication, we generated 3,000 observations and set M = 30.
The convergence criterion for the Newton–Raphson algorithm in each M-step was 0.001.
The means of the ML estimates were computed based on 100 replications. The root mean squares
(RMSs) between the estimates and true values as well as the total value of mean squared errors
(MSEs) were computed to compare the accuracy of the results of this simulation with those under
a random assignment condition. The results are listed in Tables 1 and 2. We obtained the biases
by subtracting the estimated values from the true values. Our method could calculate scores.
We calculated Monte Carlo estimates for the ability parameter of each examinee and created
histograms (see Fig. 3).
The sum of the MSEs was calculated, and the value was 0.298 for the proposed model and
3.18 for the traditional model. Moreover, the sum of the absolute values of the biases was cal-
culated for each assumed model. The resultant value of the sum was 0.686 for the proposed
model and 12.5 for the traditional model. With regard to the sum of the MSEs, approximately 10
times the difference was observed, whereas we found that the sum of the biases under the con-
current calibration was about 20 times larger than that under the proposed model. These results
indicate that the parameters can be estimated accurately under the proposed model, whereas the
traditional model essentially yields biased estimates.
12. 12 PSYCHOMETRIKA
TABLE 1.
The results of simulation study for the IRT model (ρ,φ,µ,aA,aB ).
Para Proposed model Existing model
Biases RMS Biases RMS
π1 = 0.2 −3.00 × 10−4 0.0238 ∗ *
μ2 = 0.5 −3.33 × 10−3 0.0575 0.420 0.423
σ2 = 1.5 −2.47 × 10−2 0.111 −0.181 0.204
aA1 = 0.6 4.68 × 10−3 0.0477 −0.0460 0.0641
aA2 = 0.8 1.42 × 10−4 0.0669 −0.0623 0.0885
aA3 = 1.0 −4.76 × 10−3 0.0825 −0.0808 0.113
aA4 = 1.2 −1.97 × 10−3 0.0788 −0.0936 0.120
aA5 = 1.5 2.56 × 10−2 0.107 −0.0944 0.139
aA6 = 0.8 6.35 × 10−3 0.0630 −0.0571 0.0824
aA7 = 0.6 5.91 × 10−3 0.0521 −0.0456 0.0674
aA8 = 1.0 3.01 × 10−3 0.0663 −0.0752 0.0979
aA9 = 1.5 1.32 × 10−2 0.105 −0.103 0.143
aA10 = 1.2 1.03 × 10−2 0.0959 −0.0839 0.122
aB1 = 1.8 1.32 × 10−2 0.0903 −0.116 0.145
aB2 = 1.3 3.82 × 10−3 0.0548 −0.0859 0.102
aB3 = 1.2 7.17 × 10−3 0.0570 −0.0779 0.0968
aB4 = 0.7 3.91 × 10−3 0.0333 −0.0488 0.0589
aB5 = 0.9 8.89 × 10−3 0.0508 −0.0543 0.0709
aB6 = 0.8 1.95 × 10−3 0.0366 −0.0563 0.0666
aB7 = 0.7 −1.67 × 10−3 0.0366 −0.0525 0.0651
aB8 = 1.0 −7.10 × 10−3 0.0481 −0.0774 0.0929
aB9 = 1.1 9.35 × 10−3 0.0531 −0.0669 0.0849
aB10 = 1.0 1.63 × 10−2 0.0521 −0.0551 0.0716
5. Real Data Analysis
We illustrated the applicability of our method, using a small part of the JLRT data set. The
JLRT is a test that measures the ability to communicate with other persons in Japanese in the
business setting, and is administered mainly to students or business people whose first language
is not Japanese. The JLRT has two equivalent test forms. Test form 1 includes Tests A and B,
and Test form 2 includes Tests B and C; Test B is common to the two test forms. The test form
the examinees take depends on the country in which they take the test. For illustrative purposes,
we used a subsample that recently took the JLRT. One thousand one hundred and ninety-nine
examinees took Test form 1 in Japan and 863 examinees took Test form 2 in other countries.
Considering that the examinees living in Japan were probably highly motivated to study Japanese
and had greater abilities than the examinees living in other countries, we can assume that the
examinees were not randomly assigned to the test forms.
For the purpose of simplification, a two-parameter logistic model was used as the functional
form of the item responses. The number of items in Test B was 33 and the number of items in
each Test A and Test C is 67. If we had analyzed all these items, there would have been too
much data in the tables and figures, which would have been inappropriate for the purpose of
illustration. Thus, after an exploratory analysis of 100 items included in the two test forms using
BILOG-MG, we chose 10 items from each of the Tests A, B, and C. μ1 and σ2
1 are fixed as
μ1 = 0, σ2
1 = 1, respectively. As in the simulation study, we assumed that the sums of the test
scores determined the test form selection behavior. The convergence criterion for the EM step
14. 14 PSYCHOMETRIKA
FIGURE 3.
Histograms of scores for each group in simulation.
and existing methods. This result is similar to the data-generating situation in the simulation
study.
As a method for model comparison, the Akaike information criteria (AIC) and the Bayesian
information criteria (BIC) were calculated to compare the proposed method with the existing
method. The results are listed in Table 5. The method with the lower AIC and BIC values was
found to be more appropriate to analyze the current data. Hence, it is concluded that the proposed
method is superior to the existing one. Following (46), we also calculated the Wald statistic of
15. KEI MIYAZAKI ET AL. 15
TABLE 3.
The results of real data analysis (ρ,φ,µ,aA,aB ,aC).
Para Proposed model Existing model (MAR)
Estimates SE Estimates SE
π1 0.443 0.0383 * *
μ2 −0.445 0.0320 −0.639 0.0327
σ2 0.684 0.0316 0.701 0.0222
aA1 1.369 0.238 0.954 0.189
aA2 1.786 0.266 1.388 0.152
aA3 1.521 0.393 1.234 0.280
aA4 1.654 0.382 1.456 0.269
aA5 1.515 0.388 1.165 0.235
aA6 1.298 0.137 0.905 0.081
aA7 0.956 0.141 0.640 0.090
aA8 0.337 0.074 0.255 0.041
aA9 1.712 0.191 1.243 0.111
aA10 1.032 0.099 0.734 0.057
aB1 1.093 0.088 0.931 0.057
aB2 0.540 0.052 0.408 0.033
aB3 0.661 0.051 0.536 0.035
aB4 0.508 0.058 0.383 0.037
aB5 0.932 0.064 0.741 0.039
aB6 1.162 0.107 1.037 0.070
aB7 0.804 0.070 0.649 0.043
aB8 0.861 0.072 0.737 0.047
aB9 1.079 0.077 0.891 0.047
aB10 0.480 0.044 0.363 0.028
aC1 0.724 0.132 0.834 0.101
aC2 0.804 0.359 1.085 0.195
aC3 1.026 0.280 1.487 0.199
aC4 1.114 0.280 1.655 0.217
aC5 0.885 0.150 1.057 0.114
aC6 1.041 0.195 1.346 0.164
aC7 0.448 0.088 0.521 0.065
aC8 1.347 0.159 1.522 0.111
aC9 0.605 0.098 0.623 0.067
aC10 0.439 0.084 0.480 0.060
the hypothesis concerning MAR H0 : π1 = 0. The Wald statistic follows a chi-square distribution
with 1 degree of freedom. The resulting value was χ2(1) = 191.3,p 0.001. This indicates that
the MAR assumption cannot be upheld and missing data are nonignorable. As another statistical
testing of π1, the Z-value was also calculated at 11.57 (p 0.001), which is statistically signifi-
cant. The AIC, BIC, Wald statistic, and Z-value, all suggest that assignment to the test forms was
not random. The examinees could determine which test form they would take by determining
which country they lived in and were also expected to have acquired superior Japanese language
skills if they lived in Japan. Therefore, in our real data analysis, we can assume that test form
selection behavior exists and that assignment to a test form is not random.
For reference, we calculated Monte Carlo estimates for the ability parameters of each ex-
aminee and created histograms using our proposed method (see Fig. 4). Figure 4 shows that the
examinees living in countries other than Japan (group 2) had lower abilities than those living in
Japan, which is consistent with the expected results.
16. 16 PSYCHOMETRIKA
TABLE 4.
The results of real data analysis (bA,bB ,bC).
Para Proposed model Existing model (MAR)
Estimates RMS Estimates RMS
bA1 −1.776 0.267 −2.970 0.473
bA2 −1.054 0.105 −1.672 0.119
bA3 −1.827 0.317 −2.815 0.442
bA4 −1.603 0.223 −2.354 0.257
bA5 −1.744 0.316 −2.729 0.393
bA6 −0.783 0.089 −1.407 0.119
bA7 −1.547 0.237 −2.697 0.362
bA8 −1.776 0.486 −2.922 0.542
bA9 −0.792 0.076 −1.345 0.093
bA10 −0.279 0.066 −0.723 0.085
bB1 −0.965 0.067 −1.499 0.064
bB2 −0.855 0.101 −1.551 0.126
bB3 1.493 0.094 1.475 0.094
bB4 −1.550 0.182 −2.490 0.223
bB5 −0.249 0.042 −0.679 0.046
bB6 −1.230 0.083 −1.767 0.075
bB7 −0.974 0.083 −1.577 0.089
bB8 −0.989 0.080 −1.542 0.079
bB9 −0.526 0.048 −0.988 0.046
bB10 0.658 0.072 0.412 0.080
bC1 −1.916 0.260 −2.257 0.163
bC2 −2.981 0.961 −2.934 0.304
bC3 −2.265 0.366 −2.335 0.137
bC4 −2.168 0.307 −2.261 0.118
bC5 −1.709 0.189 −2.044 0.115
bC6 −1.984 0.233 −2.241 0.126
bC7 −1.119 0.183 −1.540 0.127
bC8 −0.657 0.053 −1.098 0.038
bC9 0.640 0.153 0.120 0.098
bC10 −0.212 0.101 −0.766 0.088
TABLE 5.
The value of AIC and BIC for the real data analysis.
Proposed method Existing method
AIC 2.878 × 104 3.132 × 104
BIC 2.914 × 104 3.169 × 104
6. Conclusion
In this paper, we proposed a new method of item parameter linking in IRT. Through the
simulation study, we showed that ignoring test form selection behavior results in considerable
bias in the estimates of the item parameters when the assignment to the test forms is not ran-
dom. Furthermore, we showed that this bias can be reduced using the models presented above.
However, although the results in the simulation study seemed to be sufficient to demonstrate the
accuracy of the proposed estimation method, the models and methods for nonignorable missing
data are notoriously sensitive to misspecification. In some cases of misspecification, the bias of
17. KEI MIYAZAKI ET AL. 17
FIGURE 4.
Histograms of scores for each group in real data.
these non-MAR models can also be more serious than when assuming MAR. Because the true
missing data mechanism is not known in practice, the performance of simulation studies can sel-
dom be widely generalized to real data. The issue of robustness of the proposed method to model
assumptions is left for future empirical studies.
The proposed model includes the model in which test form selection behavior is determined
at random without dependency on the scores on all the tests, which leads to the idea that we can
apply the proposed model initially; then using some test statistics such as the Wald statistic, we
18. 18 PSYCHOMETRIKA
can test whether missing test scores affect test form selection behavior. In this regard, this model
is advantageous.
A variety of methods for dealing with ignorable and nonignorable missing data in practical
situations have been proposed (Schafer, 1997). Models with nonignorable missing-data mech-
anisms in IRT were also proposed by Holman and Glas (2005). However, since they were in-
terested in modeling the nonignorable missing data mechanism with the item response model,
their model does not consider item parameter linking in the common-item nonequivalent groups
design. While their model considers the missing mechanisms per item, our model deals with the
missing mechanism per test form. Moreover, their model is based on the idea of pattern mixture
models in which the test form selection indicators are set as the explanatory variables and the
item responses are set as the dependent variables. In contrast, our model is regarded as a kind of
selection model in which the relation between explanatory variables and dependent variables are
contrary to that in pattern mixture models.
While the proposed model is a full parametric model, one can conduct analysis using a
semiparametric model with propensity scores (Hoshino, Kurata, Shigemasu, 2006; Hoshino,
2007, 2008) under the MAR assumption. We are going to conduct simulations to confirm which
of the two methods to use: the proposed method that needs parametric model assumption but
does not need the MAR assumption or the above methods that enable semiparametric analysis
but require the MAR assumption.
Models for tests having items that differ in terms of levels of measurement (such as dichoto-
mous data and polytomous data) are also topics that can be addressed in future research.
Acknowledgements
The authors are grateful to Ms. Naoko Hojo of the Japan External Trade Organization
(JETRO) for helping us access the JLRT data. This study was partially supported by the Min-
istry of Education, Science, Sports, and Culture, Grant-in-Aid for Scientific Research, 19-8879,
Scientific Research (B), 193-30145 and the Inamori Foundation grant (to Takahiro Hoshino). Fi-
nally, we would like to express our sincere thanks to the associate editor and two reviewers for
their valuable advice and comments.
References
Baker, F.B., Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal
of Educational Measurement, 28, 147–162.
Bernaards, C.A., Sijtsma, K. (1999). Factor analysis of multidimensional polytomous item response data suffering
from ignorable item nonresponse. Multivariate Behavioral Research, 34, 277–313.
Bock, R.D., Zimowski, M.F. (1997). Multiple group IRT. In W.M. van der Linden R.K. Hambleton (Eds.), Handbook
of modern item response theory (pp. 433–448). Berlin: Springer.
Dempster, A.P., Laird, N.M., Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society, Series B, 39, 1–38.
Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Re-
search, 22, 144–149.
Hanson, B.A., Béguin, A.A. (2002). Obtaining a common scale for item response theory item parameters using separate
versus concurrent calibration in the common-item equating design. Applied Psychological Measurement, 26, 3–24.
Holman, R., Glas, C.A.W. (2005). Modelling non-ignorable missing-data mechanisms with item response theory
models. British Journal of Mathematical and Statistical Psychology, 58, 1–17.
Hoshino, T. (2007). Doubly robust type estimation for covariate adjustment in latent variable modeling. Psychometrika,
72, 535–549.
Hoshino, T. (2008). A Bayesian propensity score adjustment for latent variable modeling and MCMC algorithm. Com-
putational Statistics Data Analysis, 52, 1413–1429.
Hoshino, T., Kurata, H., Shigemasu, K. (2006). A propensity score adjustment for multiple group structural equation
modeling. Psychometrika, 71, 691–712.
Ibrahim, J.G., Chen, M.H., Lipsitz, S.R. (2001). Missing responses in generalised linear mixed models when the
missing data mechanism is nonignorable. Biometrika, 88, 551–564.
19. KEI MIYAZAKI ET AL. 19
Kato, K. Japan External Trade Organization (JETRO) (2006). BJT buisiness Japanese proficiency test official guide.
Japan External Trade Organization(JETRO), Tokyo, Japan.
Kim, S.H., Cohen, A.S. (1992). Effects of linking methods on detection of DIF. Journal of Educational Measurement,
29, 51–66.
Kolen, M.J., Brennan, R.L. (2004). Test equating, scaling, and linking: methods and practices (2nd ed.). New York:
Springer.
Little, R.J.A., Rubin, D.B. (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley.
Lord, F.M. (1974). Estimation of latent ability and item parameters when there are omitted responses. Psychometrika,
39, 247–264.
Louis, T.A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statis-
tical Society, Series B, 44, 226–233.
Schafer, J.L. (1997). Analysis of incomplete multivariate data. New York: Chapman Hall.
Stocking, M.L., Lord, F.M. (1983). Developing a common metric in item response theory. Applied Psychological
Measurement, 7, 201–210.
van der Linden, W.J., Luecht, R.M. (1998). Observed-score equating as a test assembly problem. Psychometrika, 63,
401–418.
von Davier, M., von Davier, A.A. (2004). A unified approach to irt scale linkage and scale transformations (Research
Report RR-04-09). ETS: Princeton, NJ
Wei, G.C.G., Tanner, M.A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data
augmentation algorithm. Journal of the American Statistical Association, 85, 699–704.
Wingersky, M.S., Lord, F.M. (1984). An investigation of methods for reducing sampling error in certain IRT proce-
dures. Applied Psychological Measurement, 8, 347–364.
Yang, W.L. (2004). Sensitivity of linkings between AP multiple-choice scores and composite scores to geographical
region: An illustration of checking for population invariance. Journal of Educational Measurement, 41, 33–41.
Manuscript Received: 31 MAR 2007
Final Version Received: 14 JUL 2008
Published Online Date: 9 SEP 2008