SlideShare a Scribd company logo
1 of 44
Download to read offline
Comparison of Asymptotic, Bootstrap
and Posterior Predictive P-values in
Assessing Latent Class Model Fit
Geert van Kollenburg 647091∗
Abstract
Goodness-of-fit testing in Latent Class analysis can result in unreliable
asymptotic p-value when the reference distributions are unknown or
when the contingency tables become sparse. For instance, it has been
shown that the asymptotic p-value belonging to the likelihood ratio
statistic becomes untrustworthy in sparse data. A number of solutions
to this problem have risen in the form of resampling techniques. The
parametric bootstrap uses the maximum likelihood estimates as popu-
lation parameters to sample new datasets to see whether the observed
statistics are likely to occur under the proposed model. The posterior
predictive check is the Bayesian alternative for a p-value and is simi-
lar to the bootstrap, but controls for uncertainty about the parameter
values by drawing samples from the posterior predictive distribution.
The purpose of this thesis is to compare the asymptotic, bootstrap
and posterior predictive p-values in assessing the model-fit of latent
class models when sample size is large and when it is small.
Key words: Latent Class Analysis, Goodness-of-Fit, Bayes Theorem,
Parametric Bootstrap, posterior predictive check.
∗
Department of Methodology and Statistics, Tilburg University, the Netherlands.
1
1 Introduction
To test the fit of a latent class (LC) model to a dataset, there exist overall
goodness-of-fit tests, which measure the discrepancy between observed fre-
quencies and those expected under the proposed model for all cells in a cor-
responding contingency table (e.g. the likelihood ratio L2
(Vermunt, 2010))
Also bivariate, or higher order, measures can be estimates, which assess the
remaining association between two or more items in a dataset after a LC
model has been fitted. For instance, the bivariate residual (BVR)(Vermunt
& Magidson, 2005) is an approximation of the score-test for the association
parameter between two items. Its value gives an indication of the estimated
increase in model fit if the association parameter would be included in the
model.
Significance testing may become troublesome if the distribution of a
statistic is unknown. For example, the score test follows asymptotically a
chi-squared distribution when the model is true (Bera & Bilias, 2001), but as
the BVR is an approximation, its distribution is at best only approximated
by the chi-squared distribution. This problem broadens when even more
complex measures (e.g. the sum of all BVRs) are being used. The quality
of the approximation of the reference distribution then depends on the qual-
ity of the approximations to the score-tests. Here, the BVR is an example
where there is some approximation possible, but the asymptotic distribu-
tion of other statistics may not be approximated well by other distributions
2
or it may be very hard to derive the asymptotic distributions of a statistic
analytically.
Even when a statistic follows a known distribution asymptotically (i.e.,
when the sample size goes to infinity), its use in performing significance
tests can become inappropriate sample sizes are not large and contingency
tables become sparse. Also as the number of items increases or the sample
size is small to moderate, the contingency tables may quickly become sparse
(i.e. many cells will have 0 or 1 entries). For instance, with 10 dichotomous
items there are already 210
= 1024 cells in the table and in these cases the
asymptotic distributions do not hold anymore and the associated p-values
become untrustworthy (Maydue-Olivares & Joe, 2006, Reiser & Lin, 1999;
Vermunt, 2010). In case of unknown, untrustworthy or incorrect distributions
it is necessary to calculate empirical reference distributions. According to
Formann(2003) this holds for overall goodnes-of-fit tests, residuals and other
statistics.
In order to determine empirical reference distributions resampling tech-
niques, like the parametric bootstrap by Collins et al. (1993; in Formann,
2003), have been proposed to solve the problem of untrustworthy asymptotic
p-values and unknown distributions. If one assumes that the data contain
information about the true values of the parameters of interest, it is possi-
ble to create a reference distribution to determine how likely an observation
is given the estimated parameters. The parametric bootstrap, for instance,
is implemented in the software package LatentGold (Vermunt & Magidson,
3
2005) and uses Monte Carlo simulations to approximate the empirical dis-
tribution of the goodness-of-fit statistics based on the maximum likelihood
(ML) estimates obtained from the data.
Instead of relying on the ML estimates, several authors propose using
Bayesian methods to assess model fit in LC analysis (Berkhof, Van Mechelen
& Gelman, 2003; Garrett & Zeger, 2000; Hoijtink, 1998). The Bayesian
method for obtaining a p-value is the Posterior Predictive Check (PPC),
which can be used in complex models where analytic solutions are tedious
to obtain. Instead of relying on ML estimates, this method uses random
draws for the unknown parameters from the posterior predictive distribution
to determine how likely an observed statistic is (Gelman, Carlin, Stern &
Rubin, 2004).
The purpose of this thesis is to investigate the PPC as an alternative
to asymptotic and bootstrap p-values in assessing model-fit of LC models.
Also a comparison will be made between all methods to check whether they
produce comparable results in large samples and whether the resampling
techniques are more adequate than the asymptotic p-value in small samples.
To investigate this, I use a number of commonly used fit statistics of which
the long-run behavior of the resulting p-values from the different methods
will be compared by a Monte Carlo simulation study. This will lead to
a direct comparison of the asymptotic, bootstrap and PPC p-values under
different conditions such as sample size. Importantly, it is assessed whether
the different p-values are uniformly distributed under the null-hypothesis,
4
and whether nominal Type-I error levels are correct for the given statistics.
I do not intend to discuss the use of cut-off scores in significance testing, but
rather apply the commonly used levels as a reference to the behavior of the
statistics under the different methods.
The outline of this thesis is as follows. Section 2 describes the LC model,
estimation of a LC model, and the fit statistics used in the study. Section 3
provides an overview of the used methods for obtaining p-values. Section 4
illustrates the simulation studies and gives the results. In Section 5 an em-
pirical dataset is analyzed to illustrate the techniques that result in p-values.
Finally, in Section 6 I discuss the findings and issues in need of further re-
search.
2 Latent Class Analysis
2.1 Defining the LC model
In the multivariate setting, let an N × J matrix Y , contain the responses
of N units (i.e. individuals) on J discrete variables with Rj, j = 1, . . . , J
categories. Let Yi = (Yi1, . . . , YiJ ) be row i, i = 1, . . . , N of Y , containing
the responses to the J variables. In total there are S = J
j=1 Rj possible
response patterns for Yi. Therefore, let Ys, s = 1, . . . , S denote a specific
pattern, and ns denote the observed pattern count. Finally let y(without
subscripts) denote an observed dataset.
The LC model assumes that the N = ns units can be partitioned into
5
C latent classes, which have their own probability density for the responses.
A unit’s unobservable class membership is represented by the latent variable
θ and a particular class is denoted by c, with c = 1, . . . , C. The idea is then
to find a LC model with the lowest number of classes for which the responses
conditional on class membership are independent. This assumption is called
local independence and lies at the basis of LC analysis.
In a LC model P(Ys), the probability of observing pattern Ys, is assumed
to be is a weighted average of the class-specific probabilities, with weights πc
being the probability that an individual belongs to LC c (Vermunt, 2010).
So for each of the S patterns, the probability density is given by
P(Ys) =
C
c=1
πcP(Ys|θ = c), (1)
Assuming local independence,
P(Ys|θ = c) =
J
j=1
P(Ysj|θ = c). (2)
Using the notation of Vermunt (2010) to indicate the conditional item re-
sponse probability of a person in class c giving response r to item j as πjrc,
the conditional probability P(Ysj|θ = c) is then a multinomial probability
density given by
P(Ysj|θ = c) =
R
r=1
π
y∗
sjr
jrc , (3)
where y∗
sjr is 1 if Ysj = r and 0 otherwise.
6
Lastly, the probability that a person belongs to LC c, conditional on
having response Ys, called the posterior membership probability (Vermunt,
2010), is obtained using the Bayes rule:
πc|s =
P(Ys|θ = c)πc
P(Ys)
(4)
2.2 Estimating the LC Model
To obtain ML estimates for the LC model, typically the Expectation- Maximization
(EM) algorithm (Goodman, 1974) is used. The EM algorithm finds the ML
estimates by maximizing the log-likelihood function
log L = ns
S
s=1
log P(Ys). (5)
Because only non-zero response frequencies attribute to the likelihood, the
convention 0 log(0) = 0 is used throughout this thesis.The details of the EM
algorithm and the convention 0 log(0) = 0 are discussed in Appendix A.
Using the EM to obtain the ML estimates requires that starting values
are provided for the parameters in ψ = (πjrc, πc) , denoted as π
(0)
jrc and π
(0)
jrc.
Caution is advised that when the starting values are too similar, the model
can become unidentifiable. To solve this, it should be possible to order the
LCs by π
(0)
c or, for instance, π
(0)
1rc (Hoijtink, 1998). For further discussion
on the identifiability of LC models, including item/class ratios see Goodman
(1974). The EM algorithm goes as follows:
7
Step 0: Choose initial values for ψ(0)
and set t = 1.
Step 1: Expectation step
Given ψ(t−1)
, calculate πc|s(see Equation 4). Then multiply this with
ns to obtain n
(t)
sc , the estimated number of respondents in each class
having pattern s.
Step 2: Maximization step
Calculate
π(t)
c = mc/N =
S
s=1
nsc/N
and
π
(t)
jrc =
S
s=1
(n(t)
sc y∗
sjr)/mc,
where y∗
sjr is 1 if Ysj = r and 0 otherwise.
Step 3: Set t = t + 1 and repeat Steps 1 and 2 until the decrease in the log-
likelihood between two iterations is smaller than a given convergence
criteria (e.g., 10−8
).
Estimation of the model can also be done in a Bayesian context using a
Gibbs sampler (e.g., Hoijtink, 1998). The Gibbs sampler is similar to the
EM procedure, but relies on sampling distributions at each step (Ligtvoet
& Vermunt, 2011) and results in an estimated (posterior) distribution of
the parameters rather then stationary estimates for ψ. The Gibbs sampler
proceeds as follows:
8
Step 0: Choose initial values for ψ(0)
and set d = 1.
Step 1: Data augmentation
Given ψ(d−1)
, calculate πc|s(see Equation 4). Then, every subject with
a particular pattern is assigned to a LC by drawing from a multino-
mial distribution with probabilities from πc|s. This results in both the
class sizes m
(d)
c and n
(d)
jrc, the number of respondents from class c with
response r to item j.
Step 2: Draw a sample from the posteriors for
π(d)
c ∼ Dir(m
(d)
1 + αc, . . . , m
(d)
C + αc)
and
(π
(d)
j1c, . . . , π
(d)
jRc) ∼ Dir(n
(d)
j1c + αjrc, . . . , n
(d)
jRc + αjrc),
where αc = 1/C and αjrc = 1/Rj (see Appendix B)
Step 3: Set d = d+1 and repeat Steps 1 and 2 until convergence (Section 3.3
describes a method for assessing the convergence of the sampler. For
more Bayesian convergence criteria see e.g. Brooks and Gelman, 1998).
After convergence, repeat Step 1 and 2 L times and keep the sampled
values to estimate the posterior distribution of the parameters.
In the simulation study (see Section 4), I use the population parameter values
as starting points and use a burn-in of 100 iterations before I start sampling.
9
This way the method is likely to start close to the parameter values and the
posterior is properly estimated. When the population values were not useful
(e.g., when the analysis only had 1 LC), I used the ML estimates that were
obtained from the EM as starting values.
2.3 Model-fit test statistics
Three test-statistics are used to assess model fit. These fit statistics are indi-
cators of the local dependencies given class membership. Let es = P(Ys)N
denote the expected pattern frequencies under the fitted LC model given the
(estimated) values of ψ (from which P(Ys) is calculated). The likelihood
ratio statistic L2
and the overall Pearson chi-squared test statistic X2
are
then:
L2
= 2
S
s=1
ns ln
ns
es
,
X2
=
S
s=1
(ns − es)2
es
Thirdly also the bivariate residual (BVR) is used, which measures re-
maining local dependencies between two items. The BVRs are X2
values
computed for pairs of variables (Vermunt & Magidson, 2005). So for items j
and j
BV Rjj =
Rj
r=1
R
j
r =1
(nrr − err )2
err
.
To investigate the BVR statistics based on a number of random samples
10
I assume that all BVRs behave the same and will therefore only need to
analyze the BVR of items 1 and 2.
The L2
, X2
and BV R are all in the form of less-is-better and can be seen
as indicators of badness-of-fit. In the next section I will describe how these
statistics can be used to perform significance tests for goodness-of-fit. The
significance tests are based on p-values, which indicate how likely the value
of an observed statistic is, given certain assumptions about the population
parameters and/or the data. The methods differ from each other in the
assumptions about the population parameters and in the estimation process.
First I will describe how to obtain a p-value using a asymptotic reference
distribution, then by means of parametric bootstrap and finally by means of
two PPCs.
3 Estimating p-values
3.1 Asymptotic reference distribution
In the frequentist framework, the p-value is the theoretical probability of
finding a test statistic that is more extreme than the one actually observed,
under the null-hypothesis H0 (Hogg & Tanis, 2010). In testing a LC model
with C classes, we base the p-value on the assumption that this model is true.
The p-value associated with an observed test statistic Tobs is the probability
that a value for T is at least as extreme as Tobs, given the C-class model is
true.
11
In testing model fit I am only interested in the probability of worse fit.
This is indicated by larger values for T so the asymptotic p-value can be
defined as
pa = Pr(T ≥ Tobs|H0), (6)
where the conditioning upon H0 can means that the posited model is assumed
to be true or that ψ = ψ0, the values postulated in H0(Gelman et al., 2004;,
Meng, 1994)). To obtain this p-value one calculates the area beyond the value
of Tobs in a reference distribution, with a specified degrees of freedom (df).
In an unrestricted LC model the L2
and X2
statistics under H0 are assumed
to asymptotically follow a chi-squared distribution (χ2
df ) with df given by
df =
J
j=1
Rj − C[1 +
J
j=1
(Rj − 1)]. (7)
As noted before, the BVR does not have a direct reference distribution
since it is an approximation of the score-test which follows a chi-squared
distribution. In the coming simulation only binary variables are used and
the BVR will then approximate the score-test for a 2 × 2 contingency table.
Because the score-test is known to asymptotically follow a (Rj −1)×(Rj −1) =
1 df chi-squared distribution in this case, I will assume that the BVR can be
approximated by the same asymptotic distribution and check the validity of
this assumption.
Issues concerning pa values
Besides misconceptions and malpractices concerning p-values (see Sterne
12
& Smith, 2001 for a clear evaluation), also statistical problems arise with the
use of (asymptotic) reference distributions. One problem with the asymptotic
p-value is that if it is unknown what distribution a statistic follows, the use of
an incorrect reference distribution can result in inaccurate p-values. Another
problem is that, by definition, an asymptotic p-value is not exact because
sample sizes are always finite. And although results might be trustworthy in
very large samples, even moderate sample sizes can lead to inaccurate results.
When the number of items in the data becomes large, the observed pat-
tern frequencies in the contingency tables quickly become very sparse and
one needs very large sample sizes to control for this. In sparse tables statis-
tics like the L2
cannot be approximated well. And even though pa can still be
calculated, its values can no longer be trusted (Magidson & Vermunt, 2004;
Maydue-Olivares & Joe, 2006; Reiser & Lin, 1999; Vermunt, 2010). Other
methods have to be used in order to get more reliable and accurate p-values
in situations where these issues occur.
Because of these and other problems associated with pa-values, other
methods have been proposed to obtain p-values, which do not rely on asymp-
totic theory, but are based on resampling techniques. These techniques gen-
erate a large number of random replicate samples from a set of (estimated)
population parameter values. For each of these datasets yrep
it is possible
to calculate the statistics of interest and determine the probability that a
statistic Trepis larger than the one observed. This is done by estimating the
proportion of Trepthat were more extreme than Tobs, given the estimation of
13
the parameters. For the LC model, I will compare resampling techniques
from the frequentist (bootstrap) and from the Bayesian framework (PPC).
3.2 Parametric Bootstrap Method
The parametric bootstrap can be used to estimate the distribution of the
statistics for which the distribution is unknown, either due to the limited
sample size or to inapproximability. If we use the ML estimates from the
observed data as population values, it is possible to estimate the probability
that Trep ≥ Tobs, given that the estimates are true (Langeheine, Pannekoek
& Van de Pol, 1996). The bootstrap p-value is then given by:
pb = Pr[(Trep ≥ Tobs)| ˆψ, H0]. (8)
The bootstrap method proceeds as follows:
Step 1. Assume that the model (H0) is true.
Step 2. Treat the ML estimates from the observed data under H0 as popu-
lation parameters.
Step 3. Draw B random replicate samples yrep,b
, b = 1, . . . , B of size N
based on these population parameter estimates
Step 4. Estimate the LC model for each dataset using the EM algorithm
and calculate Tb
repfrom the ML estimates ˆψ
b
.
14
The proportion
B−1
B
b=1
I(Tb
rep≥Tobs)
,
(where the indicator function I equals 1 if Tb
rep ≥ Tobs and 0 otherwise) is
taken to be the estimate of pb. In words, pb is (estimated by) the proportion
of samples in which the value Tb
rep is greater than of equal to Tobs.
3.3 Posterior Predictive Check
The PPC is the Bayesian counterpart of the classical statistical tests (Meng,
1994). Given that H0 is true and that the observed data came from the
population of interest, the posterior predictive (PP) p-value is given by:
pp = Pr[(Tl
rep ≥ Tobs)|y, H0]. (9)
In the Bayesian framework one is not particularly interested in the probability
that observed data have come from a population with parameters posited in
the null-hypothesis (as in the frequentist framework), but rather in what the
probability is that parameters have certain values given that the observed
data indeed came from that population (Gelman et al., 2004).
As a result of this philosophy the major difference with the bootstrap is
that the PPC is based on the posterior distribution P(ψ|y) of the unknown
parameters (rather than on a point estimate like ˆψ) and on the predictive
distribution for the replicated data P(yrep
|ψ). In its general form, the prob-
15
ability in Equation 9 is taken over the joint distribution P(ψ, yrep
|y) so that
pp = I(Tl
rep≥Tobs)
P(yrep
|ψ)P(ψ|y)dyrep
dψ, (10)
where I equals 1 if Tl
rep ≥ Tobs for all possible values of Tobs(Gelman et
al., 2004). Appendix B shows how the posterior and PP distribution are
obtained.
In practice, the PP distribution P(yrep
|ψ) is usually estimated through
simulations and the pp-value is then estimated based on these draws. In
principle the PPC is done like this:
Step 1. Assume that the model is true.
Step 2. Draw L samples from the PP distribution to obtain ψl
and yrep,l
,
l = 1, . . . , L.
Step 3. Estimate the LC model under H0 on each dataset yrep,l
and calculate
the statistic Tl
rep.
So Tl
rep is obtained by estimating the model under H0 using the EM algo-
rithm. For each replication the ML estimates ˆψ
l
are used to calculate Tl
rep
and the proportion
L−1
L
l=1
I(Tl
rep≥Tobs)
,
(where the indicator function I equals 1 if Tl
rep ≥ Tobs and 0 otherwise) is
taken to be the estimate of pp.
16
In more complex models (like the LC model) however, it may not be
possible to obtain the PP distribution in Step 2 analytically. The solution
involves splitting up Step 2 and using an iterative sampling procedure:
Step 2a. Draw a sample from the posterior distribution ψl
∼ P(ψ|y).
Step 2b. Generate a replicate dataset yrep,l
∼ P(yrep
|ψl
).
Step 2c. Repeat Steps 2a and 2b to obtain L replicated datasets.
But, as shown in Appendix B, the posterior distribution for the LC model
again does not have a convenient form to sample directly from. Fortunately
the Gibbs sampler, as discussed in Section 2.2, can be used to obtain the re-
quired posterior draws ψl
(Rubin & Stern, 1994). At convergence, the draws
in a Gibbs sampler iteration are actually samples from the posterior P(ψ|y)
and as a result the L iterations result in an approximation of the posterior
distribution. Performing Step 2b results in draws from the predictive distri-
bution. The joint draws from the posterior distribution and the predictive
distribution can together be seen as a single draw from the PP distribution
Figure 1 in Appendix C is a graphical representation of the PPC. The
upper plot is a trace plot and depicts the values of the Trep = X2
rep statistic
during the L = 500 replications for the empirical example described in Sec-
tion 5 where N = 94 and C = 2. If the plot shows any long-term trends,
this is an indication that successive draws are highly correlated and that the
method has not converged. The values should move freely around in the
value space, without getting stuck in a local region (King et al., 2011). The
17
bottom plot shows a smoothed density of the replicated values. The horizon-
tal and vertical dashed lines indicates the observed value X2
obs = 67.993 and
the proportion of values above or beyond that line (.554) is the estimate for
pp.
PPC using discrepancy variables
The formulation of the PP p-value has been extended by Gelman et
al.(2004) by using, instead of a statistic T, a discrepancy variable D(ψ)
which depends on the data as well as the parameters. For each draw from
the posterior Dobs(ψl
)is calculated as the discrepancy between ˆψand ψl
and
Drep(ψl
)is calculated as the discrepancy between ˆψ
l
and ψl
.
The p-value for the discrepancy measure is given by:
pd = Pr[Drep(ψ) ≥ Dobs(ψ)|y, H0].
Goodness-of-fit measures like L2
can be used as discrepancy variables be-
cause the predicted pattern frequencies are functions of the parameters in
ψ. For instance, the expected frequencies for the L2
are calculated as
el
s = P(Ys|ψl
)N. The discrepancy p-value is estimated by taking the L
sampled draws, computing the predicted pattern frequencies el
s directly from
ψl
, and computing Dobs(ψl
)and Drep(ψl
)based on these predicted frequen-
cies. In this method on obtains L ’observed’ discrepancies Dobs(ψl
)and L
18
replicated discrepancies Drep(ψl
). The pd is estimated by
L−1
L
l=1
I(Drep(ψl
) ≥ Dobs(ψl
)).
The PPC using discrepancy variables was used in LC analysis by Berkhof,
van Mechelen and Gelman (2003) and Meulders et al. (2002), who indicate
that this procedure tends to be conservative. Conservativeness, however, is
not the only issue with the pd-value. Hjort, Dahl & Steinbakk (2006) showed
that the distribution of pd under H0 is far from uniform and have indicated
that its values need to be adjusted in order to make results interpretable.
Hjort et al. investigated the behavior of pd in a number of models, but
not the LC model. In order to test the appropriateness of the method it is
important to investigate the behavior of pd in the current setting as well, and
the method is therefore included in this study.
4 Simulation study
To compare the methods described above, the behaviors of the p-values under
different situations need to be assessed. In situations where H0 is true, the
p-values from the fit-statistics described in Section 2.3 should be uniformly
distributed (Sackrowitz & Samuel-Cahn, 1999). Deviations from uniformity
could indicate that the used reference distribution or method is incorrect.
The uniformity of the p-values will therefore be used to assess applicability
19
of the methods in different situations.
To investigate the behavior of the proposed p-values I generated data for
J = 6 dichotomous items (Rj = 2, for all j). The population class sizes
and conditional response probabilities used throughout the simulations can
be found in Table 1. To test the behavior of the p-values under H0 in large
Table 1: Population values for the simulation studies
c = 1 c = 2
πc) 0.5 0.5
πj1c 0.8 0.2
πj2c 0.2 0.8
samples I generated 500 datasets with N = 1000. In large samples the p-
values ought to behave approximately equivalently. Since one of the reasons
for using resampling techniques was usage in small samples and spare tables
I generated the same number of datasets with N = 100. On all datasets a
2-class LC model was fitted using the EM algorithm. At convergence the
asymptotic p-values were calculated for the L2
and X2
based on the χ2
50
distribution and for the BV R12 using the χ2
1 distribution. To obtain the
pb-value the bootstrap with B = 100 was performed and similarly the pp and
pd were calculated based on L = 100 PP samples. In total, the LC model
had to be fitted to 200,000 additional datasets.
To test the behavior of the p-values under a misspecified model and to
perform a power test, again 500 datasets with N = 1000 and 500 datasets
with N = 100 were generated from a 2-class population, but each of these
datasets was analyzed using a 1-class LC model. I then calculated the pa-
20
values (with df = 57 for the L2
and X2
) and obtained the pb, pp and pd-values
based on B = L = 100.
To check whether the p-values are uniformly distributed under H0 I per-
formed two numerical checks and a graphical check to substantiate the find-
ings. If a p-value is uniformly distributed its expected value E(p) = .5 and
P(p < .05) = .05 (i.e., in 5% of the cases the p-value is less than .05).
I use the convenient significance level of .05 (Fisher, 1954) as upper-limit
for rejecting the null-hypothesis. If there are considerable deviations from
the indicators of uniformity, the used method might be inappropriate or in-
correctly specified. The graphical checks are shown as the distributions of
the p-values, smoothed using splines to approximate the log-densities (see
Stone, et al., 1997). These graphical checks can be used directly to see any
deviations from uniformity anywhere. Please note that sharp increases in
density at the very boundaries (at approximately < .02, > .98) are due to
the estimation procedure rather than implying practical malbehavior of the
p-value.
Results
Figure 2 in Appendix C and Table 2 provide the results of the p-values
under H0 for sample size of N = 1000. Figure 3 in Appendix C and Table 3
provide the results under H0 for N = 100. The densities of the pa-values
are depicted as solid lines, the pb-values as dashed lines, the pp-values as
dash-dotted lines and the pd-values are shown as dotted lines. Also included
is a reference line indicating a truly uniform distribution as a reference. The
21
tables can be used as a summary of the figures and include two checks for nor-
mality; the expected p-values E(p) and P(p < .05) for the different goodness-
of-fit statistics. Not only can these proportions be used as an indication of
systematic deviations from uniformity but may also be helpful if only Type-I
error (false rejection of the null-hypothesis) rates are the issue of concern.
The results show that with a sample size of N = 1000, under H0, the chi-
squared reference distribution used for the pa-values is not an exact reference
to the L2
statistic. Using the χ2
50 distribution resulted in too liberal results,
since the Type-1 error rate was .094 (almost twice high as expected under
H0). Also the expected values is much lower than .5. From Figure 2 it is
clear that the density becomes larger as pa comes closer to 0, indicating too
many small p-values. Although this may be due to sampling fluctuations
given the limited number of simulations, it is worth mentioning that within
the same analyses the pa-value for the X2
statistic shows this behavior much
less. To illustrate, there were 81 analyses where the pa-value for L2
was less
than .10 (where there should be only 50). In those analyses the X2
had
p-values less than .10 in only 57 cases. Inspection of the pa-values for BV R12
clearly indicates that BVR does not follow a χ2
1 distribution. The density of
p-values becomes larger in a linear fashion as the values of pa increase.
Conversely, from Table 2 it can be seen that in the large sample case the
pb and pp-values only seem to provide somewhat too liberal results, having
slightly too many values smaller than .05. Other than that these p-values
show very good approximations to the uniform distribution. In the current
22
Table 2: Uniformity measures of p-values
E(p) Pr(p < .05)
pa pb pp pd pa pb pp pd
L2
.4388 .4945 .4946 .8449 .094 .062 .064 .002
X2
.4917 .4918 .4923 .8496 .060 .068 .064 .000
BV R12 .6706 .5065 .5072 .7667 .000 .046 .046 .000
N = 1000, MC simulations = 500, bootstrap/PPC replications = 100
setting, with large sample size, pb and pp clearly outperform the asymptotic
p-value for both the L2
and BVR, but this is perhaps more likely due the
specification of the asymptotic reference distribution than to the quality of
the methods in the large sample case since the methods have very similar
behaviors for the X2
statistic.
As expected, the most ’problematic’ results came from the PPC using
discrepancy variables, which is very clearly not adequate for testing model-
fit using any of the goodness-of-fit statistics. In line with the findings of Hjort
et al. (2006) the pd is distributed far from uniform in the LC goodness-of-fit
setting. Figure 2 shows that for the L2
and X2
the density increases as pd
gets larger and peaks at 1. For the BVR statistic it peaks at around .78,
with a range of [0.54, 0.93]. In only 1 dataset (the value .002 in Table 2) a
pd-value was found that was less than .05.
From Table 3 it can be seen that in sparser datasets the expected values of
pb and pp are somewhat higher than pa for the L2
statistic (perhaps still due
to the asymptotic reference distribution), about equal for X2
and lower for
the BVR (although rather trivial since the reference distribution was clearly
23
Table 3: Uniformity measures of p-values
E(p) Pr(p < .05)
pa pb pp pd pa pb pp pd
L2
.4019 .4354 .4352 .8854 .016 .040 .034 .000
X2
.5224 .5200 .5114 .8535 .028 .024 .018 .000
BV R12 .6758 .5088 .5136 .7607 .004 .040 .038 .000
N = 100, MC simulations = 500, bootstrap/PPC replications = 100
inadequate). Also in sparser tables the pd has much higher values than the
other measures, except for pa of the BVR (again probably due to incorrect
reference). All methods tend to be conservative in that too few p-values were
less than .05, even when the expected values are lower than expected. From
Figure 3 it can be seen that the distribution of the pa-value under H0 with a
small sample size is far from uniform for the L2
statistic. Interestingly this
behavior is mimicked by the pb and pp. Although the behavior is similar,
the pb and pp are distributed more flatly for all statistics, with the bootstrap
method resulting in the least peaked distribution.
Finally, in analyzing the 500 datasets of N = 1000 from a 2-class pop-
ulation with a 1 class model, the probability of correctly rejecting the null-
hypothesis (i.e., the power) was 1, using any of the statistics. That is, all
pa-values were less than 10−19
for the BVR, and less than 10−161
for the L2
and less than 10−291
for the X2
statistic. All other p-values were always equal
to 0. In the 500 smaller samples, all p-values resulted in a power of 1 for the
L2
and X2
.
Although the power for the BVR was 1 in the previous simulation, it is
24
not a very good measure to determine model-misfit if analyzed solely as it
is based only on the two-item relationships. That is to say, if one BVR does
not provide a small p-value, it does not indicate an that the whole model fits
well. This aspect is captured by the p-values in the small sample case. The
expected and maximum p-values, as well as power (indicated as P(p < .05),
the probability of a value less than .05) for all methods are provided in
Table 4. Also here, the pd provides very inadequate results if the values are
not processed (see Hjort et al., 2006).
Table 4: Power results for the BVR
pa pb pp pd
E(p) .001 0.010 0.009 0.146
P(p < .05) .964 .944 .952 .284
max(p) .565 0.60 0.55 0.84
5 Empirical example
To illustrate the usage of the proposed methods I have analyzed data which
were obtained by Galen and Gambino (1975, in Rindskopf, 2002) in a study
of 94 patients who suffered chest pains and were admitted to an emergency
room. Four indicators of myocardial infarction (MI) were scored either a 1
(present) or 0 (not-present); the patients’ heart-rhythm Q-waves (Q), high
low-density blood cholesterol levels (L), creatine phosphokinase levels (C) and
their clinical history (H). The response patterns and their observed frequen-
cies can be found in Table 5. Rindskopf indicated that the data are consistent
25
with a 2-class LC model, with df = 6, the L2
= 4.29 with pa = .64.
To obtain the 4 p-values for the different statistics, I used the χ2
6 reference
distribution for the L2
and X2
, and set B = L = 500 to obtain the resam-
pling p-values. Because the data is quite sparse, given the results from the
simulation study with N=100, I expected to find that the pb and pp would be
higher than pa for the L2
statistic, about equal for X2
and lower for the BVR
(due to the unknown reference distribution for the BVR). Also I expected pd
to be much higher than the other p-values but less so than pa for the BVR.
Table 5: Response pattern frequencies
Q L C H count Q L C H count
0 0 0 0 33 1 0 0 0 0
0 0 0 1 7 1 0 0 1 0
0 0 1 0 7 1 0 1 0 2
0 0 1 1 5 1 0 1 1 3
0 1 0 0 1 1 1 0 0 0
0 1 0 1 0 1 1 0 1 0
0 1 1 0 3 1 1 1 0 4
0 1 1 1 5 1 1 1 1 24
Table 6 provides the conditional response probabilities and group sizes
resulting from fitting the 2 LC model on the data (which are identical to
those reported by Rindskopf, 2002). The first class (likely to have had MI)
had high conditional probabilities for all indicators , the other group had low
conditional probabilities.
In Table 7 the estimated p-values from all methods are shown for the
2-class model for the three used statistics. As none of the p-values are small,
all p-values indicate that the 2-class model fits the data well. Against ex-
26
Table 6: ML parameter estimates of ψfor the MI data using a 2-Class model
MI no MI
πc 0.4578 0.5422
Q 0 0.2332 1.0000
Q 1 0.7668 0.0000
L 0 0.1721 0.9731
L 1 0.8279 0.0269
C 0 0.0000 0.8045
C 1 1.0000 0.1955
H 0 0.2086 0.8049
H 1 0.7914 0.1951
pectation the bootstrap resulted in much smaller p-values than the other
methods for the L2
and X2
. Although no p-value indicated lack of fit, there
are large differences in the actual values of the p-values.
Table 7: Results for the empirical example
p-value
pa pb pp pd
L2
= 4.292611 .637 .358 .606 .874
X2
= 4.22263 .647 .306 .554 .892
BV R12 = 0.1545949 .694 .230 .182 .652
df = 6, N = 94, B = L = 500
6 Discussion
In this thesis I compared different p-values in goodness-of fit testing of LC
models. The classical asymptotic p-value was compared to the p-values ob-
tained by means of parametric bootstrap and PPCs in large and small sam-
ples. The methods were discussed and the differences illustrated. Two prob-
27
lems that occur in using asymptotic p-values were discussed, firstly that they
cannot be trusted in small samples, and secondly that they are not useful
when it is unknown what distribution a statistic follows.
The results suggested that the χ2
df may not be a valid reference for the
L2
statistic in LC analysis, since it produced too liberal results in large
samples under H0. Also the BVR has been shown to clearly not follow an χ2
1
distribution. The pb and pp showed much better behavior than the asymptotic
p-value for both the L2
and BVR, although this might have been due to the
used asymptotic reference distribution, since the methods were comparable
for the X2
, for which also pa showed good behavior.
Whether the bootstrap or PPC are better methods for approximating a
p-value in the current setting is not clear-cut. The data for N = 100 were not
extremely sparse since the number of patterns with observed frequencies of 0
or 1 was not so large. But especially the L2
statistic showed very surprising
behaviors and needs to be investigated further.
More research should be done to investigate the distribution of the L2
and BVR statistics, which can be done by looking at the actual values of the
statistics rather than the p-values under the reference distribution.
Additionally, analysis of the empirical example showed that the p-values
can differ from each other quite severely within one dataset, even though
the expected values did not differ much. To find out more about the dif-
ference between the p-values within datasets, a comparison of the p-values
within each simulation could provide a better insight into the characteristics
28
of the data responsible for these differences. This may result in a clearer
understanding of when each of the methods can be used optimally.
Since the current research has focused on (overall) goodness-of-fit statis-
tics, an option for future research is to do a similar study to investigate
the applicability of resampling techniques to issues regarding LC model se-
lection/comparison. For instance, the PPC could provide a p-value for the
increase in fit when adding LCs or when including local dependencies.
This said, I have only considered rather simple LC models and future
research on this topic should include, for example, models with more LCs,
local dependencies or models which include covariates.
Note on computational time
Because for each dataset B = L = 100 bootstraps and PPCs are per-
formed to estimate pb, pp and pd, a total of 400,000 replicated datasets had
to be computed and analyzed using the EM algorithm, which can become
rather time consuming. For instance the analysis for N = 100 with 2 LCs
took over 20 hours to complete on a 32 bits, 2.61 GHz, 3.43 GB RAM com-
puter using the software package R (CRAN, 2012).
However, the individual analyses themselves do not take very long (a
couple of minutes per run). The assessment of the empirical data using all
techniques took only about 3 minutes with 500 bootstrap/PPC replications,
indicating the practical usefulness of the methods in obtaining p-values. Of
course the empirical dataset was not very large, but researchers should not
be inhibited to use these techniques in empirical research. The used soft-
29
and hardware (and the efficiency of the programming) can greatly diminish
the time needed to analyze a problem and, moreover, even waiting a day to
get reliable research results should be considered worthwhile.
30
References
Bera, A. K. & Bilias, Y. (2001). Rao’s score, Neyman’s C(a) and Silvey’s
LM tests: An essay on historical developments and some new results.
Journal of Statistical Planning and Inference, 97, 944.
Berkhof, J., Van Mechelen, I., & Gelman, A. (2003). A Bayesian approach
to the selection and testing of Mixture Models. Statistica Sinica, 13,
423 – 442.
Brooks, S. P. & Gelman, A. (1998). General Methods for Monitoring
Convergence of Iterative Simulations. Journal of Computational and
Graphical Statistics, 7(4), 434–455
Fisher, R. A. (1925). Statistical methods for research workers (chapter 3).
Retrieved May 2, 2012, from http://psychclassics.yorku.ca/Fisher/Methods/
Formann, A. K. (2003). Latent Class Model Diagnosis – a review and some
proposals. Computational Statistics & Data Analysis ,41, 548 – 559.
Galindo–Garre, F., & Vermunt, J.K, (2005). Testing log–linear models
with inequality constraints: a comparison of asymptotic, bootstrap,
and posterior predictive p values. Statistica Neerlandica, 59, 82–94.
Garrett, S. G., & Zeger, S. L. (2000). Latent Class Model Diagnosis. Bio-
metrics, 56, 1055–1067.
Gelman, A., Carlin, J., Stern, H. & Rubin D. (2004). Bayesian Data Anal-
ysis. 2nd edition. Boca Raton, FL: Chapman & Hall
Goodman, L.A. (1974). Exploratory latent structure analysis using both
identifiable and unidentifiable models. Biometrika, 61, 215–231.
Hjort, N. L., Dahl, F. A. & Steinbakk, G. H. (2006): Post–Processing Pos-
terior Predictive p Values. Journal of the American Statistical Associ-
ation, 101(475), 1157–1174.
Hogg, R. V. & Tanis, E. A. (2010). Probability and Statistical Inference.
8th edition. Upper Saddle River, NJ: Pearson Prentice Hall
31
Hoijtink, H. (1998). Constrained Latent Class Analysis using the Gibbs
Sampler and Posterior Predictive P–values: applications to educational
testing. Statistica Sinica, 8, 691–711.
King, M. D., Calamante, F., Clark, C. A. & Gadian, D. G. (2011). Markov
Chain Monte Carlo Random Effects Modeling in Magnetic Resonance
Image Processing Using the RBugs Interface to WinBUGS. Journal of
Statistical Software, 44(2), available online from http://www.jstatsoft.org/v44/i02
Langeheine, R., Pannekoek, J. & Van de Pol, F.(1996). Bootstrapping
Goodness–of–Fit Measures in Categorical Data Analysis. Sociological
Methods & Research, 24, 492–516.
Ligtvoet, R. & Vermunt, J.K. (2012). Latent class models for testing mono-
tonicity and invariant item ordering for polytomous items. British
Journal of Mathematical and Statistical Psychology, 65(2), 237–250.
Magidson, J., and Vermunt, J.K, (2004) Latent class models. in D. Kaplan
(ed.), The Sage Handbook of Quantitative Methodology for the Social
Sciences (pp. 175–198). Thousand Oaks, CA: Sage Publications, Inc.
Maydeu–Olivares, A. & Joe, H. (2006). Limited Goodness–of–Fit testing in
Multidimensional Contingency tables. Psychometrika, 71, 713–732.
Meulders, M., de Boeck, P., Kuppens, P. & Van Mechelen, I. (2002). Con-
strained Latent Class Analysis of Three-Way Three-Mode Data. Jour-
nal of Classification, 19, 277–302
Nylund, K. L., Asparouhov, T. & Muthn, B.O.(2007). Deciding on the
Number of Classes in Latent Class Analysis and Growth Mixture Mod-
eling: A Monte Carlo Simulation Study. Structural Equation Modeling:
A Multidisciplinary Journal, 14(4), 535–569.
Reiser, M., & Lin, Y. (1999). Goodness–of–fit test for the latent class model
when expected frequencies are small. M.Sobel and M.Becker (Eds.),
Sociological Methodology (pp. 81–111). Boston: Blackwell Publishers.
Rindskopf, D. (2002). The use of latent class analysis in medical diagnosis.
Proceedings of the Joint Meetings of the American Statistical Associa-
tion, 29122916.
32
Rubin, D. B., & Stern, H. S. (1994). Testing in latent class models using
a posterior predictive check distribution. In Von Eye, A. & Clogg,
C. C. (Eds.), Latent variables analysis: Applications for developmental
research (pp. 420–438). Thousand Oaks, CA: Sage Publications, Inc.
Sackrowitz, H. & Samuel–Cahn, E. (1999). P Values as Random Variables–
Expected P Values. The American Statistician, 53(4), 326–331.
Sterne, J. A. C. & Smith, G. D. (2001) Sifting the evidencewhat’s wrong
with significance tests?. BMJ, 322, 226–231.
Stone, C. J., Hansen, M., Kooperberg, C. & Truong, Y. K. (1997). The
use of polynomial splines and their tensor products in extended linear
modeling (with discussion). Annals of Statistics, 25, 1371–1470.
Tanner, M. A. & Wong, H.W. (1984). The Calculation of Posterior Dis-
tributions by Data Augmentation. Journal of the American Statistical
Association, 82(398), 528–540
Vermunt, J. K. (2010). Latent Class Models. In P. Peterson, E. Baker, &
B. McGaw (Eds.), International Encyclopedia of Education (pp. 238–
244). Oxford: Elsevier
Vermunt, J.K., & Magidson, J. (2005). Technical Guide for Latent GOLD
4.0: Basic and Advanced. Belmont Massachusetts: Statistical Innova-
tions Inc.
33
A EM algorithm
The EM algorithm
Because the LC membership is unobservable, the (logarithm of the) like-
lihood is hard to estimate. The summation within the log makes separation
of the product terms unviable. It is possible, however, to use a sequential
algorithm if we give starting values for the missing data (i.e., the unobserved
class membership).
Combining Equations 1-3 to obtain the likelihood gives:
P(Ys) =
C
c=1
πc
J
j=1
R
r=1
π
y∗
sjr
jrc (11)
and taking the log gives the log-likelihood:
log P(Ys) = log
C
c=1
πc
J
j=1
R
r=1
π
y∗
sjr
jrc . (12)
With class membership unobservable, this expression is unsolvable. However,
if we impute values for the missing class membership (also called data aug-
mentation, e.g., Ligtvoet & Vermunt, 2011), the expression can be written
as:
log P(Ys) = ns
C
c=1
πc|s log πc
J
j=1
R
r=1
π
y∗
sjr
jrc .
Now, the EM algorithm consists of sequentially updating πc|s(providing πc)
34
and πjrc to maximize
log L =
S
s=1
log P(Ys).
The algorithm continues until the change in the log-likelihood between iter-
ation t and t + 1 is lower than a given convergence criterium. The values for
which this log-likelihood is maximized are the ML estimates.
Using the EM algorithm it can, however, occur that convergence is at-
tained at a local maximum. Often, to control for this, multiple starting sets
are used and the values for ψ resulting in the highest log-likelihood are taken
as ML estimates.
0 log(0) = 0 convention
In order to only let observed patterns contribute to the likelihood, I used
a convention that 0 log(0) = 0. This is needed because log(0) is undefined,
and multiplying log(0) with 0 will technically not result in 0. Following is
the justification of using the convention.
If I define the natural logarithm as log(x) =
x
1
1
t
dt, and need to find a
reasonable value for 0 log(0), I should take the limit as x approaches 0. Using
Hopital’s Rule one can show that although log(0) is undefined, the limit of
35
x log(x) as x approaches zero is:
lim
x→0
xlog(x) = lim
x→0
log(x)
x−1
= lim
x→0
x−1
−x−2
= lim
x→0
−x
= 0
36
B The Gibbs sampler (in LC analysis)
The Gibbs sampler can be used to estimate the LC model, as described in
Section 2.2, but also to perform the PPC (see Section 3.3 as a means of test-
ing model fit. The Bayesian model fit approach compares the goodness-of-fit
statistic Tobsto a reference distribution which is obtained by averaging the
distribution P(T|ψ) over the posterior P(ψ|y). When the posterior distri-
bution is not (or tediously) calculable analytically, one can use simulations to
estimate it. Here I show in detail how to obtain the posterior (and) predictive
distribution for ψand yrep
and perform the PPC.
The method goes as follows:
Step 1. Assume that the model is true.
Step 2a. Draw a sample from the posterior distribution ψl
∼ P(ψ|y).
Step 2b. Generate a replicate dataset yrep,l
∼ P(yrep
|ψl
).
Step 2c. Repeat Steps 2a and 2b to obtain L draws from the posterior
predictive distribution.
Step 3. Estimate the LC model under H0 on each dataset and calculate the
statistic Tl
rep.
Drawing the samples in Step 2 had to be split into 3 parts, which involve the
posterior distribution of the parameters in ψ, from which it is not straight-
forward to draws samples of the LC model parameters. The following text
37
discusses how to specify the posterior distribution and how to obtain samples
from it using the Gibbs sampler.
This discussion is about obtaining (draws from) the posterior distribution
P(ψ|y). Note that this applies to the Gibbs sampler both in the estimation
process as in the PPC.. The posterior distribution of ψ can be obtained
using the Bayes rule:
P(ψ|y) =
P(y|ψ)P(ψ)
P(y)
(13)
∝ P(y|ψ)P(ψ) (14)
The term P(y) is called the marginal likelihood or normalizing constant.
To draw samples from the posterior we can simply use Equation 14 because
the shape of the distribution is not influenced by multiplying/dividing by a
constant. However, as can be seen, one does need a prior distribution P(ψ)
for the parameters in ψ, which can be used to include prior knowledge (or
lack thereof) about the parameters of interest.
For each set of multinomial parameters (e.g., πjrc, r = 1, . . . , Rj) I have
used a Dirichlet prior distribution. For dichotomous variables (Rj = 2 for all
j), I could equivalently have used Beta distributions (Gelman et al., 2004),
but for the sake of generality, I show the use of the Dirichlet distribution here.
For example, the prior distribution of the conditional response probabilities
38
of a person in LC c = 1, . . . , C on item j = 1, . . . , J is given by:
P(πjrc, r = 1, . . . , R) =
Rj
q=1
αjqc !
Rj
q=1
αjqc!
Rj
r=1
π
αjrc−1
jrc (15)
∝
Rj
r=1
π
αjrc−1
jrc . (16)
It is commonplace to ignore the constant and only indicate the parts of the
distribution which involve the parameters (here, πjrc) and use the propor-
tionality property. The prior distribution for the class sizes is given by:
P(θc, c = 1, . . . , C) ∝
C
c=1
παc−1
c (17)
The values for the hyperparameters αjrc in absolute sense indicate the
strength of one’s prior belief about the probability of giving response r to item
j in class c, and the relative sizes of the hyperparameters indicate the relative
probabilities for the responses (Rubin & Stern, 1994). αc is used likewise for
the class-sizes. To indicate no prior knowledge about the items of LC sizes
I only use vague (diffuse) priors in the analysis where c αc = r αjrc = 1
(see Section 2.2.
The prior distribution of the entire set ψ is the product of the priors on
39
the elements in it:
p(ψ) =
C
c=1
παc−1
c
R1
r=1
πα1rc−1
1rc × · · · ×
RJ
r=1
παJrc−1
Jrc (18)
and the posterior is then obtained by combining this prior distribution with
the likelihood (Equation 11) of the LC model (Rubin & Stern, 1994):
P(ψ|y) ∝
S
s=1
C
c=1
P(Ys)
ns
P(ψ). (19)
As indicated earlier, this posterior distribution does not have a convenient
form to sample from. But, as it turns out, augmenting the data with esti-
mates for the unobserved LC memberships can make the model estimable.
As shown in Section 2.2, the Gibbs sampler can be used to estimate the
LC model in an iterative fashion, but it requires that unobserved indicators
for the LC memberships are used to augment the data. In this way it is
possible to obtain conditional distributions of the parameters given the LC
membership (Tanner & Wong, 1984). To illustrate, let Zsic = 1 if the ith ob-
servation in the sth cell of the contingency table (i = 1, . . . , ns, s = 1, . . . , S)
belongs to LC c and 0 otherwise. Then the joint distribution
P(ψ, Z, y) ∝
S
s=1
ns
i=1
C
c=1
P(Ys)Zsic
P(ψ). (20)
The distribution of ψconditional on Z and yis given by the product of inde-
pendent Dirichlet distributions with hyperparameters αjrc +njrc and αc +mc.
40
The conditional probability P(Z|ψ, y) is given by the Bernoulli distribution.
Using the Bayes rule, the probabilities that Zsic = 1 is obtained using Equa-
tion 4:
P(Zsic = 1|ψ, y) =
P(Ys|θ = c)πc
P(Ys)
. (21)
These conditional distributions are easy to sample from (see Section 2.2. The
Gibbs sampler described in this thesis does this iteratively, and at conver-
gence, the sampled values for Z and ψare draws from the joint posterior
distribution P(Z, ψ|y) (Rubin & Stern, 1994; Tanner & Wong, 1984). To
avoid correlations between the samples, one is advised not to use subsequent
draws, but, for instance, to retain only every 50th draw or so.
To obtain the replicate data yrep,l
in Step 2b as a draw from the predic-
tive distribution P(yrep
|psibf), we just need to draw N observations from a
multinomial distribution with probabilities P(Ys) estimated from ψl
.
41
C Figures
0 100 300 500
051525
Trace of replicated X2
Iteration
Trep(X2)
0 5 10 15
0.000.050.100.15
Density for replicated X2
Replicated X2 values
Density
Figure 1: Example of trace and density plot for the PPC in the empirical
data. The dashed lines indicate X2
obs = 4.223, pp = .554
42
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.0
L2
p−value
p−valuedensity
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.0
X2
p−value
p−valuedensity
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.0
BVR
p−value
p−valuedensity
Asympotic p−values
Bootstrap
PPC
Discrepancy
Uniform
Figure 2: P-value log-densities for the 2-Class model with N = 1000
43
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.0
L2
p−value
p−valuedensity
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.0
X2
p−value
p−valuedensity
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.0
BVR
p−value
p−valuedensity
Asympotic p−values
Bootstrap
PPC
Discrepancy
Uniform
Figure 3: P-value log-densities for the 2-Class model with N = 100
44

More Related Content

What's hot

Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithmHoopeer Hoopeer
 
Application of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in ForecastingApplication of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in Forecastingpaperpublications3
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlationdomsr
 
Inverse scattering internal multiple attenuation algorithm in complex multi-D...
Inverse scattering internal multiple attenuation algorithm in complex multi-D...Inverse scattering internal multiple attenuation algorithm in complex multi-D...
Inverse scattering internal multiple attenuation algorithm in complex multi-D...Arthur Weglein
 
Hierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validationHierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validationColleen Farrelly
 
Commentary: Aedes albopictus and Aedes japonicus—two invasive mosquito specie...
Commentary: Aedes albopictus and Aedes japonicus—two invasive mosquito specie...Commentary: Aedes albopictus and Aedes japonicus—two invasive mosquito specie...
Commentary: Aedes albopictus and Aedes japonicus—two invasive mosquito specie...Konstantinos Demertzis
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsNBER
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
Principal components
Principal componentsPrincipal components
Principal componentsHutami Endang
 
Introduction to Regression Analysis and R
Introduction to Regression Analysis and R   Introduction to Regression Analysis and R
Introduction to Regression Analysis and R Rachana Taneja Bhatia
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control StudySatish Gupta
 
Evaluating competing predictive distributions
Evaluating competing predictive distributionsEvaluating competing predictive distributions
Evaluating competing predictive distributionsAndreas Collett
 
Machine learning session7(nb classifier k-nn)
Machine learning   session7(nb classifier k-nn)Machine learning   session7(nb classifier k-nn)
Machine learning session7(nb classifier k-nn)Abhimanyu Dwivedi
 
Perspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsPerspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsGianluca Bontempi
 
Advanced Methods of Statistical Analysis used in Animal Breeding.
Advanced Methods of Statistical Analysis used in Animal Breeding.Advanced Methods of Statistical Analysis used in Animal Breeding.
Advanced Methods of Statistical Analysis used in Animal Breeding.DrBarada Mohanty
 

What's hot (20)

Morse et al 2012
Morse et al 2012Morse et al 2012
Morse et al 2012
 
Performance of the classification algorithm
Performance of the classification algorithmPerformance of the classification algorithm
Performance of the classification algorithm
 
cca stat
cca statcca stat
cca stat
 
Application of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in ForecastingApplication of Weighted Least Squares Regression in Forecasting
Application of Weighted Least Squares Regression in Forecasting
 
Cannonical Correlation
Cannonical CorrelationCannonical Correlation
Cannonical Correlation
 
Inverse scattering internal multiple attenuation algorithm in complex multi-D...
Inverse scattering internal multiple attenuation algorithm in complex multi-D...Inverse scattering internal multiple attenuation algorithm in complex multi-D...
Inverse scattering internal multiple attenuation algorithm in complex multi-D...
 
Naive bayes classifier
Naive bayes classifierNaive bayes classifier
Naive bayes classifier
 
Hierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validationHierarchical clustering and topology for psychometric validation
Hierarchical clustering and topology for psychometric validation
 
Commentary: Aedes albopictus and Aedes japonicus—two invasive mosquito specie...
Commentary: Aedes albopictus and Aedes japonicus—two invasive mosquito specie...Commentary: Aedes albopictus and Aedes japonicus—two invasive mosquito specie...
Commentary: Aedes albopictus and Aedes japonicus—two invasive mosquito specie...
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse Models
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Principal components
Principal componentsPrincipal components
Principal components
 
Introduction to Regression Analysis and R
Introduction to Regression Analysis and R   Introduction to Regression Analysis and R
Introduction to Regression Analysis and R
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Evaluating competing predictive distributions
Evaluating competing predictive distributionsEvaluating competing predictive distributions
Evaluating competing predictive distributions
 
Machine learning session7(nb classifier k-nn)
Machine learning   session7(nb classifier k-nn)Machine learning   session7(nb classifier k-nn)
Machine learning session7(nb classifier k-nn)
 
Aussem
AussemAussem
Aussem
 
Perspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsPerspective of feature selection in bioinformatics
Perspective of feature selection in bioinformatics
 
Advanced Methods of Statistical Analysis used in Animal Breeding.
Advanced Methods of Statistical Analysis used in Animal Breeding.Advanced Methods of Statistical Analysis used in Animal Breeding.
Advanced Methods of Statistical Analysis used in Animal Breeding.
 

Viewers also liked

Propuesta De Seo - Mktdig
Propuesta De Seo -  MktdigPropuesta De Seo -  Mktdig
Propuesta De Seo - Mktdigdupyval
 
Escuela Profesional De Administracion
Escuela Profesional De AdministracionEscuela Profesional De Administracion
Escuela Profesional De Administraciongueste59d4873
 
fotografos mas sobresalientes
fotografos mas sobresalientesfotografos mas sobresalientes
fotografos mas sobresalientesstefany nataly
 
metodod de estudio
metodod de estudiometodod de estudio
metodod de estudiojose pardo
 
¿Cómo crear una pestaña que permita a los fans de la marca acceder a nuestra ...
¿Cómo crear una pestaña que permita a los fans de la marca acceder a nuestra ...¿Cómo crear una pestaña que permita a los fans de la marca acceder a nuestra ...
¿Cómo crear una pestaña que permita a los fans de la marca acceder a nuestra ...Dani Ortega
 
El Mundo
El MundoEl Mundo
El Mundounam
 
Taller casos espirometries
Taller casos espirometriesTaller casos espirometries
Taller casos espirometriesjiasab
 
Mantenimiento De Un Monitor
Mantenimiento De Un MonitorMantenimiento De Un Monitor
Mantenimiento De Un MonitorJPBR
 
Actividad 3
Actividad 3Actividad 3
Actividad 3grisel
 

Viewers also liked (20)

Propuesta De Seo - Mktdig
Propuesta De Seo -  MktdigPropuesta De Seo -  Mktdig
Propuesta De Seo - Mktdig
 
Noticiero
NoticieroNoticiero
Noticiero
 
Quest en Español - Primavera 2010 - Semana1
Quest en Español - Primavera 2010 - Semana1Quest en Español - Primavera 2010 - Semana1
Quest en Español - Primavera 2010 - Semana1
 
Escuela Profesional De Administracion
Escuela Profesional De AdministracionEscuela Profesional De Administracion
Escuela Profesional De Administracion
 
fotografos mas sobresalientes
fotografos mas sobresalientesfotografos mas sobresalientes
fotografos mas sobresalientes
 
metodod de estudio
metodod de estudiometodod de estudio
metodod de estudio
 
Examen Silvia
Examen SilviaExamen Silvia
Examen Silvia
 
¿Cómo crear una pestaña que permita a los fans de la marca acceder a nuestra ...
¿Cómo crear una pestaña que permita a los fans de la marca acceder a nuestra ...¿Cómo crear una pestaña que permita a los fans de la marca acceder a nuestra ...
¿Cómo crear una pestaña que permita a los fans de la marca acceder a nuestra ...
 
El Mundo
El MundoEl Mundo
El Mundo
 
La vega central
La vega centralLa vega central
La vega central
 
Nueva Universida Color De La Esperanza
Nueva Universida Color De La EsperanzaNueva Universida Color De La Esperanza
Nueva Universida Color De La Esperanza
 
Taller casos espirometries
Taller casos espirometriesTaller casos espirometries
Taller casos espirometries
 
Promedu Manejo De Imágenes 1 Gimp
Promedu Manejo De Imágenes 1 GimpPromedu Manejo De Imágenes 1 Gimp
Promedu Manejo De Imágenes 1 Gimp
 
O bárbaro
O bárbaroO bárbaro
O bárbaro
 
El comienzo
El comienzoEl comienzo
El comienzo
 
Esquiada Febrer 2011
Esquiada Febrer 2011Esquiada Febrer 2011
Esquiada Febrer 2011
 
Mantenimiento De Un Monitor
Mantenimiento De Un MonitorMantenimiento De Un Monitor
Mantenimiento De Un Monitor
 
Paolita 22
Paolita 22Paolita 22
Paolita 22
 
Actividad 3
Actividad 3Actividad 3
Actividad 3
 
Proyecto TIC
Proyecto  TICProyecto  TIC
Proyecto TIC
 

Similar to Geert van Kollenburg-masterthesis

chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptShayanChowdary
 
1-s2.0-S0047259X16300689-main (1).pdf
1-s2.0-S0047259X16300689-main (1).pdf1-s2.0-S0047259X16300689-main (1).pdf
1-s2.0-S0047259X16300689-main (1).pdfshampy kamboj
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfAlemAyahu
 
Financial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali TirmiziFinancial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali TirmiziDr. Muhammad Ali Tirmizi., Ph.D.
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1Gautam Kumar
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionDario Panada
 
journal in research
journal in research journal in research
journal in research rikaseorika
 
research journal
research journalresearch journal
research journalrikaseorika
 
published in the journal
published in the journalpublished in the journal
published in the journalrikaseorika
 
ProjectWriteupforClass (3)
ProjectWriteupforClass (3)ProjectWriteupforClass (3)
ProjectWriteupforClass (3)Jeff Lail
 
Assessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generateAssessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generateDaniel Koh
 
Assessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIFAssessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIFDaniel Koh
 
Probabilistic Error Bounds for Reduced Order Modeling M&C2015
Probabilistic Error Bounds for Reduced Order Modeling M&C2015Probabilistic Error Bounds for Reduced Order Modeling M&C2015
Probabilistic Error Bounds for Reduced Order Modeling M&C2015Mohammad
 
ProbErrorBoundROM_MC2015
ProbErrorBoundROM_MC2015ProbErrorBoundROM_MC2015
ProbErrorBoundROM_MC2015Mohammad Abdo
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detectionTrilochan Panigrahi
 
Recommender system
Recommender systemRecommender system
Recommender systemBhumi Patel
 

Similar to Geert van Kollenburg-masterthesis (20)

chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.ppt
 
1607.01152.pdf
1607.01152.pdf1607.01152.pdf
1607.01152.pdf
 
1-s2.0-S0047259X16300689-main (1).pdf
1-s2.0-S0047259X16300689-main (1).pdf1-s2.0-S0047259X16300689-main (1).pdf
1-s2.0-S0047259X16300689-main (1).pdf
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdf
 
Financial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali TirmiziFinancial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali Tirmizi
Financial Risk Mgt - Lec 11 by Dr. Syed Muhammad Ali Tirmizi
 
Data mining Part 1
Data mining Part 1Data mining Part 1
Data mining Part 1
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
CH3.pdf
CH3.pdfCH3.pdf
CH3.pdf
 
journals public
journals publicjournals public
journals public
 
journal in research
journal in research journal in research
journal in research
 
research journal
research journalresearch journal
research journal
 
published in the journal
published in the journalpublished in the journal
published in the journal
 
ProjectWriteupforClass (3)
ProjectWriteupforClass (3)ProjectWriteupforClass (3)
ProjectWriteupforClass (3)
 
Assessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generateAssessing relative importance using rsp scoring to generate
Assessing relative importance using rsp scoring to generate
 
Assessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIFAssessing Relative Importance using RSP Scoring to Generate VIF
Assessing Relative Importance using RSP Scoring to Generate VIF
 
3 es timation-of_parameters[1]
3 es timation-of_parameters[1]3 es timation-of_parameters[1]
3 es timation-of_parameters[1]
 
Probabilistic Error Bounds for Reduced Order Modeling M&C2015
Probabilistic Error Bounds for Reduced Order Modeling M&C2015Probabilistic Error Bounds for Reduced Order Modeling M&C2015
Probabilistic Error Bounds for Reduced Order Modeling M&C2015
 
ProbErrorBoundROM_MC2015
ProbErrorBoundROM_MC2015ProbErrorBoundROM_MC2015
ProbErrorBoundROM_MC2015
 
Prob and statistics models for outlier detection
Prob and statistics models for outlier detectionProb and statistics models for outlier detection
Prob and statistics models for outlier detection
 
Recommender system
Recommender systemRecommender system
Recommender system
 

Geert van Kollenburg-masterthesis

  • 1. Comparison of Asymptotic, Bootstrap and Posterior Predictive P-values in Assessing Latent Class Model Fit Geert van Kollenburg 647091∗ Abstract Goodness-of-fit testing in Latent Class analysis can result in unreliable asymptotic p-value when the reference distributions are unknown or when the contingency tables become sparse. For instance, it has been shown that the asymptotic p-value belonging to the likelihood ratio statistic becomes untrustworthy in sparse data. A number of solutions to this problem have risen in the form of resampling techniques. The parametric bootstrap uses the maximum likelihood estimates as popu- lation parameters to sample new datasets to see whether the observed statistics are likely to occur under the proposed model. The posterior predictive check is the Bayesian alternative for a p-value and is simi- lar to the bootstrap, but controls for uncertainty about the parameter values by drawing samples from the posterior predictive distribution. The purpose of this thesis is to compare the asymptotic, bootstrap and posterior predictive p-values in assessing the model-fit of latent class models when sample size is large and when it is small. Key words: Latent Class Analysis, Goodness-of-Fit, Bayes Theorem, Parametric Bootstrap, posterior predictive check. ∗ Department of Methodology and Statistics, Tilburg University, the Netherlands. 1
  • 2. 1 Introduction To test the fit of a latent class (LC) model to a dataset, there exist overall goodness-of-fit tests, which measure the discrepancy between observed fre- quencies and those expected under the proposed model for all cells in a cor- responding contingency table (e.g. the likelihood ratio L2 (Vermunt, 2010)) Also bivariate, or higher order, measures can be estimates, which assess the remaining association between two or more items in a dataset after a LC model has been fitted. For instance, the bivariate residual (BVR)(Vermunt & Magidson, 2005) is an approximation of the score-test for the association parameter between two items. Its value gives an indication of the estimated increase in model fit if the association parameter would be included in the model. Significance testing may become troublesome if the distribution of a statistic is unknown. For example, the score test follows asymptotically a chi-squared distribution when the model is true (Bera & Bilias, 2001), but as the BVR is an approximation, its distribution is at best only approximated by the chi-squared distribution. This problem broadens when even more complex measures (e.g. the sum of all BVRs) are being used. The quality of the approximation of the reference distribution then depends on the qual- ity of the approximations to the score-tests. Here, the BVR is an example where there is some approximation possible, but the asymptotic distribu- tion of other statistics may not be approximated well by other distributions 2
  • 3. or it may be very hard to derive the asymptotic distributions of a statistic analytically. Even when a statistic follows a known distribution asymptotically (i.e., when the sample size goes to infinity), its use in performing significance tests can become inappropriate sample sizes are not large and contingency tables become sparse. Also as the number of items increases or the sample size is small to moderate, the contingency tables may quickly become sparse (i.e. many cells will have 0 or 1 entries). For instance, with 10 dichotomous items there are already 210 = 1024 cells in the table and in these cases the asymptotic distributions do not hold anymore and the associated p-values become untrustworthy (Maydue-Olivares & Joe, 2006, Reiser & Lin, 1999; Vermunt, 2010). In case of unknown, untrustworthy or incorrect distributions it is necessary to calculate empirical reference distributions. According to Formann(2003) this holds for overall goodnes-of-fit tests, residuals and other statistics. In order to determine empirical reference distributions resampling tech- niques, like the parametric bootstrap by Collins et al. (1993; in Formann, 2003), have been proposed to solve the problem of untrustworthy asymptotic p-values and unknown distributions. If one assumes that the data contain information about the true values of the parameters of interest, it is possi- ble to create a reference distribution to determine how likely an observation is given the estimated parameters. The parametric bootstrap, for instance, is implemented in the software package LatentGold (Vermunt & Magidson, 3
  • 4. 2005) and uses Monte Carlo simulations to approximate the empirical dis- tribution of the goodness-of-fit statistics based on the maximum likelihood (ML) estimates obtained from the data. Instead of relying on the ML estimates, several authors propose using Bayesian methods to assess model fit in LC analysis (Berkhof, Van Mechelen & Gelman, 2003; Garrett & Zeger, 2000; Hoijtink, 1998). The Bayesian method for obtaining a p-value is the Posterior Predictive Check (PPC), which can be used in complex models where analytic solutions are tedious to obtain. Instead of relying on ML estimates, this method uses random draws for the unknown parameters from the posterior predictive distribution to determine how likely an observed statistic is (Gelman, Carlin, Stern & Rubin, 2004). The purpose of this thesis is to investigate the PPC as an alternative to asymptotic and bootstrap p-values in assessing model-fit of LC models. Also a comparison will be made between all methods to check whether they produce comparable results in large samples and whether the resampling techniques are more adequate than the asymptotic p-value in small samples. To investigate this, I use a number of commonly used fit statistics of which the long-run behavior of the resulting p-values from the different methods will be compared by a Monte Carlo simulation study. This will lead to a direct comparison of the asymptotic, bootstrap and PPC p-values under different conditions such as sample size. Importantly, it is assessed whether the different p-values are uniformly distributed under the null-hypothesis, 4
  • 5. and whether nominal Type-I error levels are correct for the given statistics. I do not intend to discuss the use of cut-off scores in significance testing, but rather apply the commonly used levels as a reference to the behavior of the statistics under the different methods. The outline of this thesis is as follows. Section 2 describes the LC model, estimation of a LC model, and the fit statistics used in the study. Section 3 provides an overview of the used methods for obtaining p-values. Section 4 illustrates the simulation studies and gives the results. In Section 5 an em- pirical dataset is analyzed to illustrate the techniques that result in p-values. Finally, in Section 6 I discuss the findings and issues in need of further re- search. 2 Latent Class Analysis 2.1 Defining the LC model In the multivariate setting, let an N × J matrix Y , contain the responses of N units (i.e. individuals) on J discrete variables with Rj, j = 1, . . . , J categories. Let Yi = (Yi1, . . . , YiJ ) be row i, i = 1, . . . , N of Y , containing the responses to the J variables. In total there are S = J j=1 Rj possible response patterns for Yi. Therefore, let Ys, s = 1, . . . , S denote a specific pattern, and ns denote the observed pattern count. Finally let y(without subscripts) denote an observed dataset. The LC model assumes that the N = ns units can be partitioned into 5
  • 6. C latent classes, which have their own probability density for the responses. A unit’s unobservable class membership is represented by the latent variable θ and a particular class is denoted by c, with c = 1, . . . , C. The idea is then to find a LC model with the lowest number of classes for which the responses conditional on class membership are independent. This assumption is called local independence and lies at the basis of LC analysis. In a LC model P(Ys), the probability of observing pattern Ys, is assumed to be is a weighted average of the class-specific probabilities, with weights πc being the probability that an individual belongs to LC c (Vermunt, 2010). So for each of the S patterns, the probability density is given by P(Ys) = C c=1 πcP(Ys|θ = c), (1) Assuming local independence, P(Ys|θ = c) = J j=1 P(Ysj|θ = c). (2) Using the notation of Vermunt (2010) to indicate the conditional item re- sponse probability of a person in class c giving response r to item j as πjrc, the conditional probability P(Ysj|θ = c) is then a multinomial probability density given by P(Ysj|θ = c) = R r=1 π y∗ sjr jrc , (3) where y∗ sjr is 1 if Ysj = r and 0 otherwise. 6
  • 7. Lastly, the probability that a person belongs to LC c, conditional on having response Ys, called the posterior membership probability (Vermunt, 2010), is obtained using the Bayes rule: πc|s = P(Ys|θ = c)πc P(Ys) (4) 2.2 Estimating the LC Model To obtain ML estimates for the LC model, typically the Expectation- Maximization (EM) algorithm (Goodman, 1974) is used. The EM algorithm finds the ML estimates by maximizing the log-likelihood function log L = ns S s=1 log P(Ys). (5) Because only non-zero response frequencies attribute to the likelihood, the convention 0 log(0) = 0 is used throughout this thesis.The details of the EM algorithm and the convention 0 log(0) = 0 are discussed in Appendix A. Using the EM to obtain the ML estimates requires that starting values are provided for the parameters in ψ = (πjrc, πc) , denoted as π (0) jrc and π (0) jrc. Caution is advised that when the starting values are too similar, the model can become unidentifiable. To solve this, it should be possible to order the LCs by π (0) c or, for instance, π (0) 1rc (Hoijtink, 1998). For further discussion on the identifiability of LC models, including item/class ratios see Goodman (1974). The EM algorithm goes as follows: 7
  • 8. Step 0: Choose initial values for ψ(0) and set t = 1. Step 1: Expectation step Given ψ(t−1) , calculate πc|s(see Equation 4). Then multiply this with ns to obtain n (t) sc , the estimated number of respondents in each class having pattern s. Step 2: Maximization step Calculate π(t) c = mc/N = S s=1 nsc/N and π (t) jrc = S s=1 (n(t) sc y∗ sjr)/mc, where y∗ sjr is 1 if Ysj = r and 0 otherwise. Step 3: Set t = t + 1 and repeat Steps 1 and 2 until the decrease in the log- likelihood between two iterations is smaller than a given convergence criteria (e.g., 10−8 ). Estimation of the model can also be done in a Bayesian context using a Gibbs sampler (e.g., Hoijtink, 1998). The Gibbs sampler is similar to the EM procedure, but relies on sampling distributions at each step (Ligtvoet & Vermunt, 2011) and results in an estimated (posterior) distribution of the parameters rather then stationary estimates for ψ. The Gibbs sampler proceeds as follows: 8
  • 9. Step 0: Choose initial values for ψ(0) and set d = 1. Step 1: Data augmentation Given ψ(d−1) , calculate πc|s(see Equation 4). Then, every subject with a particular pattern is assigned to a LC by drawing from a multino- mial distribution with probabilities from πc|s. This results in both the class sizes m (d) c and n (d) jrc, the number of respondents from class c with response r to item j. Step 2: Draw a sample from the posteriors for π(d) c ∼ Dir(m (d) 1 + αc, . . . , m (d) C + αc) and (π (d) j1c, . . . , π (d) jRc) ∼ Dir(n (d) j1c + αjrc, . . . , n (d) jRc + αjrc), where αc = 1/C and αjrc = 1/Rj (see Appendix B) Step 3: Set d = d+1 and repeat Steps 1 and 2 until convergence (Section 3.3 describes a method for assessing the convergence of the sampler. For more Bayesian convergence criteria see e.g. Brooks and Gelman, 1998). After convergence, repeat Step 1 and 2 L times and keep the sampled values to estimate the posterior distribution of the parameters. In the simulation study (see Section 4), I use the population parameter values as starting points and use a burn-in of 100 iterations before I start sampling. 9
  • 10. This way the method is likely to start close to the parameter values and the posterior is properly estimated. When the population values were not useful (e.g., when the analysis only had 1 LC), I used the ML estimates that were obtained from the EM as starting values. 2.3 Model-fit test statistics Three test-statistics are used to assess model fit. These fit statistics are indi- cators of the local dependencies given class membership. Let es = P(Ys)N denote the expected pattern frequencies under the fitted LC model given the (estimated) values of ψ (from which P(Ys) is calculated). The likelihood ratio statistic L2 and the overall Pearson chi-squared test statistic X2 are then: L2 = 2 S s=1 ns ln ns es , X2 = S s=1 (ns − es)2 es Thirdly also the bivariate residual (BVR) is used, which measures re- maining local dependencies between two items. The BVRs are X2 values computed for pairs of variables (Vermunt & Magidson, 2005). So for items j and j BV Rjj = Rj r=1 R j r =1 (nrr − err )2 err . To investigate the BVR statistics based on a number of random samples 10
  • 11. I assume that all BVRs behave the same and will therefore only need to analyze the BVR of items 1 and 2. The L2 , X2 and BV R are all in the form of less-is-better and can be seen as indicators of badness-of-fit. In the next section I will describe how these statistics can be used to perform significance tests for goodness-of-fit. The significance tests are based on p-values, which indicate how likely the value of an observed statistic is, given certain assumptions about the population parameters and/or the data. The methods differ from each other in the assumptions about the population parameters and in the estimation process. First I will describe how to obtain a p-value using a asymptotic reference distribution, then by means of parametric bootstrap and finally by means of two PPCs. 3 Estimating p-values 3.1 Asymptotic reference distribution In the frequentist framework, the p-value is the theoretical probability of finding a test statistic that is more extreme than the one actually observed, under the null-hypothesis H0 (Hogg & Tanis, 2010). In testing a LC model with C classes, we base the p-value on the assumption that this model is true. The p-value associated with an observed test statistic Tobs is the probability that a value for T is at least as extreme as Tobs, given the C-class model is true. 11
  • 12. In testing model fit I am only interested in the probability of worse fit. This is indicated by larger values for T so the asymptotic p-value can be defined as pa = Pr(T ≥ Tobs|H0), (6) where the conditioning upon H0 can means that the posited model is assumed to be true or that ψ = ψ0, the values postulated in H0(Gelman et al., 2004;, Meng, 1994)). To obtain this p-value one calculates the area beyond the value of Tobs in a reference distribution, with a specified degrees of freedom (df). In an unrestricted LC model the L2 and X2 statistics under H0 are assumed to asymptotically follow a chi-squared distribution (χ2 df ) with df given by df = J j=1 Rj − C[1 + J j=1 (Rj − 1)]. (7) As noted before, the BVR does not have a direct reference distribution since it is an approximation of the score-test which follows a chi-squared distribution. In the coming simulation only binary variables are used and the BVR will then approximate the score-test for a 2 × 2 contingency table. Because the score-test is known to asymptotically follow a (Rj −1)×(Rj −1) = 1 df chi-squared distribution in this case, I will assume that the BVR can be approximated by the same asymptotic distribution and check the validity of this assumption. Issues concerning pa values Besides misconceptions and malpractices concerning p-values (see Sterne 12
  • 13. & Smith, 2001 for a clear evaluation), also statistical problems arise with the use of (asymptotic) reference distributions. One problem with the asymptotic p-value is that if it is unknown what distribution a statistic follows, the use of an incorrect reference distribution can result in inaccurate p-values. Another problem is that, by definition, an asymptotic p-value is not exact because sample sizes are always finite. And although results might be trustworthy in very large samples, even moderate sample sizes can lead to inaccurate results. When the number of items in the data becomes large, the observed pat- tern frequencies in the contingency tables quickly become very sparse and one needs very large sample sizes to control for this. In sparse tables statis- tics like the L2 cannot be approximated well. And even though pa can still be calculated, its values can no longer be trusted (Magidson & Vermunt, 2004; Maydue-Olivares & Joe, 2006; Reiser & Lin, 1999; Vermunt, 2010). Other methods have to be used in order to get more reliable and accurate p-values in situations where these issues occur. Because of these and other problems associated with pa-values, other methods have been proposed to obtain p-values, which do not rely on asymp- totic theory, but are based on resampling techniques. These techniques gen- erate a large number of random replicate samples from a set of (estimated) population parameter values. For each of these datasets yrep it is possible to calculate the statistics of interest and determine the probability that a statistic Trepis larger than the one observed. This is done by estimating the proportion of Trepthat were more extreme than Tobs, given the estimation of 13
  • 14. the parameters. For the LC model, I will compare resampling techniques from the frequentist (bootstrap) and from the Bayesian framework (PPC). 3.2 Parametric Bootstrap Method The parametric bootstrap can be used to estimate the distribution of the statistics for which the distribution is unknown, either due to the limited sample size or to inapproximability. If we use the ML estimates from the observed data as population values, it is possible to estimate the probability that Trep ≥ Tobs, given that the estimates are true (Langeheine, Pannekoek & Van de Pol, 1996). The bootstrap p-value is then given by: pb = Pr[(Trep ≥ Tobs)| ˆψ, H0]. (8) The bootstrap method proceeds as follows: Step 1. Assume that the model (H0) is true. Step 2. Treat the ML estimates from the observed data under H0 as popu- lation parameters. Step 3. Draw B random replicate samples yrep,b , b = 1, . . . , B of size N based on these population parameter estimates Step 4. Estimate the LC model for each dataset using the EM algorithm and calculate Tb repfrom the ML estimates ˆψ b . 14
  • 15. The proportion B−1 B b=1 I(Tb rep≥Tobs) , (where the indicator function I equals 1 if Tb rep ≥ Tobs and 0 otherwise) is taken to be the estimate of pb. In words, pb is (estimated by) the proportion of samples in which the value Tb rep is greater than of equal to Tobs. 3.3 Posterior Predictive Check The PPC is the Bayesian counterpart of the classical statistical tests (Meng, 1994). Given that H0 is true and that the observed data came from the population of interest, the posterior predictive (PP) p-value is given by: pp = Pr[(Tl rep ≥ Tobs)|y, H0]. (9) In the Bayesian framework one is not particularly interested in the probability that observed data have come from a population with parameters posited in the null-hypothesis (as in the frequentist framework), but rather in what the probability is that parameters have certain values given that the observed data indeed came from that population (Gelman et al., 2004). As a result of this philosophy the major difference with the bootstrap is that the PPC is based on the posterior distribution P(ψ|y) of the unknown parameters (rather than on a point estimate like ˆψ) and on the predictive distribution for the replicated data P(yrep |ψ). In its general form, the prob- 15
  • 16. ability in Equation 9 is taken over the joint distribution P(ψ, yrep |y) so that pp = I(Tl rep≥Tobs) P(yrep |ψ)P(ψ|y)dyrep dψ, (10) where I equals 1 if Tl rep ≥ Tobs for all possible values of Tobs(Gelman et al., 2004). Appendix B shows how the posterior and PP distribution are obtained. In practice, the PP distribution P(yrep |ψ) is usually estimated through simulations and the pp-value is then estimated based on these draws. In principle the PPC is done like this: Step 1. Assume that the model is true. Step 2. Draw L samples from the PP distribution to obtain ψl and yrep,l , l = 1, . . . , L. Step 3. Estimate the LC model under H0 on each dataset yrep,l and calculate the statistic Tl rep. So Tl rep is obtained by estimating the model under H0 using the EM algo- rithm. For each replication the ML estimates ˆψ l are used to calculate Tl rep and the proportion L−1 L l=1 I(Tl rep≥Tobs) , (where the indicator function I equals 1 if Tl rep ≥ Tobs and 0 otherwise) is taken to be the estimate of pp. 16
  • 17. In more complex models (like the LC model) however, it may not be possible to obtain the PP distribution in Step 2 analytically. The solution involves splitting up Step 2 and using an iterative sampling procedure: Step 2a. Draw a sample from the posterior distribution ψl ∼ P(ψ|y). Step 2b. Generate a replicate dataset yrep,l ∼ P(yrep |ψl ). Step 2c. Repeat Steps 2a and 2b to obtain L replicated datasets. But, as shown in Appendix B, the posterior distribution for the LC model again does not have a convenient form to sample directly from. Fortunately the Gibbs sampler, as discussed in Section 2.2, can be used to obtain the re- quired posterior draws ψl (Rubin & Stern, 1994). At convergence, the draws in a Gibbs sampler iteration are actually samples from the posterior P(ψ|y) and as a result the L iterations result in an approximation of the posterior distribution. Performing Step 2b results in draws from the predictive distri- bution. The joint draws from the posterior distribution and the predictive distribution can together be seen as a single draw from the PP distribution Figure 1 in Appendix C is a graphical representation of the PPC. The upper plot is a trace plot and depicts the values of the Trep = X2 rep statistic during the L = 500 replications for the empirical example described in Sec- tion 5 where N = 94 and C = 2. If the plot shows any long-term trends, this is an indication that successive draws are highly correlated and that the method has not converged. The values should move freely around in the value space, without getting stuck in a local region (King et al., 2011). The 17
  • 18. bottom plot shows a smoothed density of the replicated values. The horizon- tal and vertical dashed lines indicates the observed value X2 obs = 67.993 and the proportion of values above or beyond that line (.554) is the estimate for pp. PPC using discrepancy variables The formulation of the PP p-value has been extended by Gelman et al.(2004) by using, instead of a statistic T, a discrepancy variable D(ψ) which depends on the data as well as the parameters. For each draw from the posterior Dobs(ψl )is calculated as the discrepancy between ˆψand ψl and Drep(ψl )is calculated as the discrepancy between ˆψ l and ψl . The p-value for the discrepancy measure is given by: pd = Pr[Drep(ψ) ≥ Dobs(ψ)|y, H0]. Goodness-of-fit measures like L2 can be used as discrepancy variables be- cause the predicted pattern frequencies are functions of the parameters in ψ. For instance, the expected frequencies for the L2 are calculated as el s = P(Ys|ψl )N. The discrepancy p-value is estimated by taking the L sampled draws, computing the predicted pattern frequencies el s directly from ψl , and computing Dobs(ψl )and Drep(ψl )based on these predicted frequen- cies. In this method on obtains L ’observed’ discrepancies Dobs(ψl )and L 18
  • 19. replicated discrepancies Drep(ψl ). The pd is estimated by L−1 L l=1 I(Drep(ψl ) ≥ Dobs(ψl )). The PPC using discrepancy variables was used in LC analysis by Berkhof, van Mechelen and Gelman (2003) and Meulders et al. (2002), who indicate that this procedure tends to be conservative. Conservativeness, however, is not the only issue with the pd-value. Hjort, Dahl & Steinbakk (2006) showed that the distribution of pd under H0 is far from uniform and have indicated that its values need to be adjusted in order to make results interpretable. Hjort et al. investigated the behavior of pd in a number of models, but not the LC model. In order to test the appropriateness of the method it is important to investigate the behavior of pd in the current setting as well, and the method is therefore included in this study. 4 Simulation study To compare the methods described above, the behaviors of the p-values under different situations need to be assessed. In situations where H0 is true, the p-values from the fit-statistics described in Section 2.3 should be uniformly distributed (Sackrowitz & Samuel-Cahn, 1999). Deviations from uniformity could indicate that the used reference distribution or method is incorrect. The uniformity of the p-values will therefore be used to assess applicability 19
  • 20. of the methods in different situations. To investigate the behavior of the proposed p-values I generated data for J = 6 dichotomous items (Rj = 2, for all j). The population class sizes and conditional response probabilities used throughout the simulations can be found in Table 1. To test the behavior of the p-values under H0 in large Table 1: Population values for the simulation studies c = 1 c = 2 πc) 0.5 0.5 πj1c 0.8 0.2 πj2c 0.2 0.8 samples I generated 500 datasets with N = 1000. In large samples the p- values ought to behave approximately equivalently. Since one of the reasons for using resampling techniques was usage in small samples and spare tables I generated the same number of datasets with N = 100. On all datasets a 2-class LC model was fitted using the EM algorithm. At convergence the asymptotic p-values were calculated for the L2 and X2 based on the χ2 50 distribution and for the BV R12 using the χ2 1 distribution. To obtain the pb-value the bootstrap with B = 100 was performed and similarly the pp and pd were calculated based on L = 100 PP samples. In total, the LC model had to be fitted to 200,000 additional datasets. To test the behavior of the p-values under a misspecified model and to perform a power test, again 500 datasets with N = 1000 and 500 datasets with N = 100 were generated from a 2-class population, but each of these datasets was analyzed using a 1-class LC model. I then calculated the pa- 20
  • 21. values (with df = 57 for the L2 and X2 ) and obtained the pb, pp and pd-values based on B = L = 100. To check whether the p-values are uniformly distributed under H0 I per- formed two numerical checks and a graphical check to substantiate the find- ings. If a p-value is uniformly distributed its expected value E(p) = .5 and P(p < .05) = .05 (i.e., in 5% of the cases the p-value is less than .05). I use the convenient significance level of .05 (Fisher, 1954) as upper-limit for rejecting the null-hypothesis. If there are considerable deviations from the indicators of uniformity, the used method might be inappropriate or in- correctly specified. The graphical checks are shown as the distributions of the p-values, smoothed using splines to approximate the log-densities (see Stone, et al., 1997). These graphical checks can be used directly to see any deviations from uniformity anywhere. Please note that sharp increases in density at the very boundaries (at approximately < .02, > .98) are due to the estimation procedure rather than implying practical malbehavior of the p-value. Results Figure 2 in Appendix C and Table 2 provide the results of the p-values under H0 for sample size of N = 1000. Figure 3 in Appendix C and Table 3 provide the results under H0 for N = 100. The densities of the pa-values are depicted as solid lines, the pb-values as dashed lines, the pp-values as dash-dotted lines and the pd-values are shown as dotted lines. Also included is a reference line indicating a truly uniform distribution as a reference. The 21
  • 22. tables can be used as a summary of the figures and include two checks for nor- mality; the expected p-values E(p) and P(p < .05) for the different goodness- of-fit statistics. Not only can these proportions be used as an indication of systematic deviations from uniformity but may also be helpful if only Type-I error (false rejection of the null-hypothesis) rates are the issue of concern. The results show that with a sample size of N = 1000, under H0, the chi- squared reference distribution used for the pa-values is not an exact reference to the L2 statistic. Using the χ2 50 distribution resulted in too liberal results, since the Type-1 error rate was .094 (almost twice high as expected under H0). Also the expected values is much lower than .5. From Figure 2 it is clear that the density becomes larger as pa comes closer to 0, indicating too many small p-values. Although this may be due to sampling fluctuations given the limited number of simulations, it is worth mentioning that within the same analyses the pa-value for the X2 statistic shows this behavior much less. To illustrate, there were 81 analyses where the pa-value for L2 was less than .10 (where there should be only 50). In those analyses the X2 had p-values less than .10 in only 57 cases. Inspection of the pa-values for BV R12 clearly indicates that BVR does not follow a χ2 1 distribution. The density of p-values becomes larger in a linear fashion as the values of pa increase. Conversely, from Table 2 it can be seen that in the large sample case the pb and pp-values only seem to provide somewhat too liberal results, having slightly too many values smaller than .05. Other than that these p-values show very good approximations to the uniform distribution. In the current 22
  • 23. Table 2: Uniformity measures of p-values E(p) Pr(p < .05) pa pb pp pd pa pb pp pd L2 .4388 .4945 .4946 .8449 .094 .062 .064 .002 X2 .4917 .4918 .4923 .8496 .060 .068 .064 .000 BV R12 .6706 .5065 .5072 .7667 .000 .046 .046 .000 N = 1000, MC simulations = 500, bootstrap/PPC replications = 100 setting, with large sample size, pb and pp clearly outperform the asymptotic p-value for both the L2 and BVR, but this is perhaps more likely due the specification of the asymptotic reference distribution than to the quality of the methods in the large sample case since the methods have very similar behaviors for the X2 statistic. As expected, the most ’problematic’ results came from the PPC using discrepancy variables, which is very clearly not adequate for testing model- fit using any of the goodness-of-fit statistics. In line with the findings of Hjort et al. (2006) the pd is distributed far from uniform in the LC goodness-of-fit setting. Figure 2 shows that for the L2 and X2 the density increases as pd gets larger and peaks at 1. For the BVR statistic it peaks at around .78, with a range of [0.54, 0.93]. In only 1 dataset (the value .002 in Table 2) a pd-value was found that was less than .05. From Table 3 it can be seen that in sparser datasets the expected values of pb and pp are somewhat higher than pa for the L2 statistic (perhaps still due to the asymptotic reference distribution), about equal for X2 and lower for the BVR (although rather trivial since the reference distribution was clearly 23
  • 24. Table 3: Uniformity measures of p-values E(p) Pr(p < .05) pa pb pp pd pa pb pp pd L2 .4019 .4354 .4352 .8854 .016 .040 .034 .000 X2 .5224 .5200 .5114 .8535 .028 .024 .018 .000 BV R12 .6758 .5088 .5136 .7607 .004 .040 .038 .000 N = 100, MC simulations = 500, bootstrap/PPC replications = 100 inadequate). Also in sparser tables the pd has much higher values than the other measures, except for pa of the BVR (again probably due to incorrect reference). All methods tend to be conservative in that too few p-values were less than .05, even when the expected values are lower than expected. From Figure 3 it can be seen that the distribution of the pa-value under H0 with a small sample size is far from uniform for the L2 statistic. Interestingly this behavior is mimicked by the pb and pp. Although the behavior is similar, the pb and pp are distributed more flatly for all statistics, with the bootstrap method resulting in the least peaked distribution. Finally, in analyzing the 500 datasets of N = 1000 from a 2-class pop- ulation with a 1 class model, the probability of correctly rejecting the null- hypothesis (i.e., the power) was 1, using any of the statistics. That is, all pa-values were less than 10−19 for the BVR, and less than 10−161 for the L2 and less than 10−291 for the X2 statistic. All other p-values were always equal to 0. In the 500 smaller samples, all p-values resulted in a power of 1 for the L2 and X2 . Although the power for the BVR was 1 in the previous simulation, it is 24
  • 25. not a very good measure to determine model-misfit if analyzed solely as it is based only on the two-item relationships. That is to say, if one BVR does not provide a small p-value, it does not indicate an that the whole model fits well. This aspect is captured by the p-values in the small sample case. The expected and maximum p-values, as well as power (indicated as P(p < .05), the probability of a value less than .05) for all methods are provided in Table 4. Also here, the pd provides very inadequate results if the values are not processed (see Hjort et al., 2006). Table 4: Power results for the BVR pa pb pp pd E(p) .001 0.010 0.009 0.146 P(p < .05) .964 .944 .952 .284 max(p) .565 0.60 0.55 0.84 5 Empirical example To illustrate the usage of the proposed methods I have analyzed data which were obtained by Galen and Gambino (1975, in Rindskopf, 2002) in a study of 94 patients who suffered chest pains and were admitted to an emergency room. Four indicators of myocardial infarction (MI) were scored either a 1 (present) or 0 (not-present); the patients’ heart-rhythm Q-waves (Q), high low-density blood cholesterol levels (L), creatine phosphokinase levels (C) and their clinical history (H). The response patterns and their observed frequen- cies can be found in Table 5. Rindskopf indicated that the data are consistent 25
  • 26. with a 2-class LC model, with df = 6, the L2 = 4.29 with pa = .64. To obtain the 4 p-values for the different statistics, I used the χ2 6 reference distribution for the L2 and X2 , and set B = L = 500 to obtain the resam- pling p-values. Because the data is quite sparse, given the results from the simulation study with N=100, I expected to find that the pb and pp would be higher than pa for the L2 statistic, about equal for X2 and lower for the BVR (due to the unknown reference distribution for the BVR). Also I expected pd to be much higher than the other p-values but less so than pa for the BVR. Table 5: Response pattern frequencies Q L C H count Q L C H count 0 0 0 0 33 1 0 0 0 0 0 0 0 1 7 1 0 0 1 0 0 0 1 0 7 1 0 1 0 2 0 0 1 1 5 1 0 1 1 3 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 1 1 0 3 1 1 1 0 4 0 1 1 1 5 1 1 1 1 24 Table 6 provides the conditional response probabilities and group sizes resulting from fitting the 2 LC model on the data (which are identical to those reported by Rindskopf, 2002). The first class (likely to have had MI) had high conditional probabilities for all indicators , the other group had low conditional probabilities. In Table 7 the estimated p-values from all methods are shown for the 2-class model for the three used statistics. As none of the p-values are small, all p-values indicate that the 2-class model fits the data well. Against ex- 26
  • 27. Table 6: ML parameter estimates of ψfor the MI data using a 2-Class model MI no MI πc 0.4578 0.5422 Q 0 0.2332 1.0000 Q 1 0.7668 0.0000 L 0 0.1721 0.9731 L 1 0.8279 0.0269 C 0 0.0000 0.8045 C 1 1.0000 0.1955 H 0 0.2086 0.8049 H 1 0.7914 0.1951 pectation the bootstrap resulted in much smaller p-values than the other methods for the L2 and X2 . Although no p-value indicated lack of fit, there are large differences in the actual values of the p-values. Table 7: Results for the empirical example p-value pa pb pp pd L2 = 4.292611 .637 .358 .606 .874 X2 = 4.22263 .647 .306 .554 .892 BV R12 = 0.1545949 .694 .230 .182 .652 df = 6, N = 94, B = L = 500 6 Discussion In this thesis I compared different p-values in goodness-of fit testing of LC models. The classical asymptotic p-value was compared to the p-values ob- tained by means of parametric bootstrap and PPCs in large and small sam- ples. The methods were discussed and the differences illustrated. Two prob- 27
  • 28. lems that occur in using asymptotic p-values were discussed, firstly that they cannot be trusted in small samples, and secondly that they are not useful when it is unknown what distribution a statistic follows. The results suggested that the χ2 df may not be a valid reference for the L2 statistic in LC analysis, since it produced too liberal results in large samples under H0. Also the BVR has been shown to clearly not follow an χ2 1 distribution. The pb and pp showed much better behavior than the asymptotic p-value for both the L2 and BVR, although this might have been due to the used asymptotic reference distribution, since the methods were comparable for the X2 , for which also pa showed good behavior. Whether the bootstrap or PPC are better methods for approximating a p-value in the current setting is not clear-cut. The data for N = 100 were not extremely sparse since the number of patterns with observed frequencies of 0 or 1 was not so large. But especially the L2 statistic showed very surprising behaviors and needs to be investigated further. More research should be done to investigate the distribution of the L2 and BVR statistics, which can be done by looking at the actual values of the statistics rather than the p-values under the reference distribution. Additionally, analysis of the empirical example showed that the p-values can differ from each other quite severely within one dataset, even though the expected values did not differ much. To find out more about the dif- ference between the p-values within datasets, a comparison of the p-values within each simulation could provide a better insight into the characteristics 28
  • 29. of the data responsible for these differences. This may result in a clearer understanding of when each of the methods can be used optimally. Since the current research has focused on (overall) goodness-of-fit statis- tics, an option for future research is to do a similar study to investigate the applicability of resampling techniques to issues regarding LC model se- lection/comparison. For instance, the PPC could provide a p-value for the increase in fit when adding LCs or when including local dependencies. This said, I have only considered rather simple LC models and future research on this topic should include, for example, models with more LCs, local dependencies or models which include covariates. Note on computational time Because for each dataset B = L = 100 bootstraps and PPCs are per- formed to estimate pb, pp and pd, a total of 400,000 replicated datasets had to be computed and analyzed using the EM algorithm, which can become rather time consuming. For instance the analysis for N = 100 with 2 LCs took over 20 hours to complete on a 32 bits, 2.61 GHz, 3.43 GB RAM com- puter using the software package R (CRAN, 2012). However, the individual analyses themselves do not take very long (a couple of minutes per run). The assessment of the empirical data using all techniques took only about 3 minutes with 500 bootstrap/PPC replications, indicating the practical usefulness of the methods in obtaining p-values. Of course the empirical dataset was not very large, but researchers should not be inhibited to use these techniques in empirical research. The used soft- 29
  • 30. and hardware (and the efficiency of the programming) can greatly diminish the time needed to analyze a problem and, moreover, even waiting a day to get reliable research results should be considered worthwhile. 30
  • 31. References Bera, A. K. & Bilias, Y. (2001). Rao’s score, Neyman’s C(a) and Silvey’s LM tests: An essay on historical developments and some new results. Journal of Statistical Planning and Inference, 97, 944. Berkhof, J., Van Mechelen, I., & Gelman, A. (2003). A Bayesian approach to the selection and testing of Mixture Models. Statistica Sinica, 13, 423 – 442. Brooks, S. P. & Gelman, A. (1998). General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455 Fisher, R. A. (1925). Statistical methods for research workers (chapter 3). Retrieved May 2, 2012, from http://psychclassics.yorku.ca/Fisher/Methods/ Formann, A. K. (2003). Latent Class Model Diagnosis – a review and some proposals. Computational Statistics & Data Analysis ,41, 548 – 559. Galindo–Garre, F., & Vermunt, J.K, (2005). Testing log–linear models with inequality constraints: a comparison of asymptotic, bootstrap, and posterior predictive p values. Statistica Neerlandica, 59, 82–94. Garrett, S. G., & Zeger, S. L. (2000). Latent Class Model Diagnosis. Bio- metrics, 56, 1055–1067. Gelman, A., Carlin, J., Stern, H. & Rubin D. (2004). Bayesian Data Anal- ysis. 2nd edition. Boca Raton, FL: Chapman & Hall Goodman, L.A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231. Hjort, N. L., Dahl, F. A. & Steinbakk, G. H. (2006): Post–Processing Pos- terior Predictive p Values. Journal of the American Statistical Associ- ation, 101(475), 1157–1174. Hogg, R. V. & Tanis, E. A. (2010). Probability and Statistical Inference. 8th edition. Upper Saddle River, NJ: Pearson Prentice Hall 31
  • 32. Hoijtink, H. (1998). Constrained Latent Class Analysis using the Gibbs Sampler and Posterior Predictive P–values: applications to educational testing. Statistica Sinica, 8, 691–711. King, M. D., Calamante, F., Clark, C. A. & Gadian, D. G. (2011). Markov Chain Monte Carlo Random Effects Modeling in Magnetic Resonance Image Processing Using the RBugs Interface to WinBUGS. Journal of Statistical Software, 44(2), available online from http://www.jstatsoft.org/v44/i02 Langeheine, R., Pannekoek, J. & Van de Pol, F.(1996). Bootstrapping Goodness–of–Fit Measures in Categorical Data Analysis. Sociological Methods & Research, 24, 492–516. Ligtvoet, R. & Vermunt, J.K. (2012). Latent class models for testing mono- tonicity and invariant item ordering for polytomous items. British Journal of Mathematical and Statistical Psychology, 65(2), 237–250. Magidson, J., and Vermunt, J.K, (2004) Latent class models. in D. Kaplan (ed.), The Sage Handbook of Quantitative Methodology for the Social Sciences (pp. 175–198). Thousand Oaks, CA: Sage Publications, Inc. Maydeu–Olivares, A. & Joe, H. (2006). Limited Goodness–of–Fit testing in Multidimensional Contingency tables. Psychometrika, 71, 713–732. Meulders, M., de Boeck, P., Kuppens, P. & Van Mechelen, I. (2002). Con- strained Latent Class Analysis of Three-Way Three-Mode Data. Jour- nal of Classification, 19, 277–302 Nylund, K. L., Asparouhov, T. & Muthn, B.O.(2007). Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Mod- eling: A Monte Carlo Simulation Study. Structural Equation Modeling: A Multidisciplinary Journal, 14(4), 535–569. Reiser, M., & Lin, Y. (1999). Goodness–of–fit test for the latent class model when expected frequencies are small. M.Sobel and M.Becker (Eds.), Sociological Methodology (pp. 81–111). Boston: Blackwell Publishers. Rindskopf, D. (2002). The use of latent class analysis in medical diagnosis. Proceedings of the Joint Meetings of the American Statistical Associa- tion, 29122916. 32
  • 33. Rubin, D. B., & Stern, H. S. (1994). Testing in latent class models using a posterior predictive check distribution. In Von Eye, A. & Clogg, C. C. (Eds.), Latent variables analysis: Applications for developmental research (pp. 420–438). Thousand Oaks, CA: Sage Publications, Inc. Sackrowitz, H. & Samuel–Cahn, E. (1999). P Values as Random Variables– Expected P Values. The American Statistician, 53(4), 326–331. Sterne, J. A. C. & Smith, G. D. (2001) Sifting the evidencewhat’s wrong with significance tests?. BMJ, 322, 226–231. Stone, C. J., Hansen, M., Kooperberg, C. & Truong, Y. K. (1997). The use of polynomial splines and their tensor products in extended linear modeling (with discussion). Annals of Statistics, 25, 1371–1470. Tanner, M. A. & Wong, H.W. (1984). The Calculation of Posterior Dis- tributions by Data Augmentation. Journal of the American Statistical Association, 82(398), 528–540 Vermunt, J. K. (2010). Latent Class Models. In P. Peterson, E. Baker, & B. McGaw (Eds.), International Encyclopedia of Education (pp. 238– 244). Oxford: Elsevier Vermunt, J.K., & Magidson, J. (2005). Technical Guide for Latent GOLD 4.0: Basic and Advanced. Belmont Massachusetts: Statistical Innova- tions Inc. 33
  • 34. A EM algorithm The EM algorithm Because the LC membership is unobservable, the (logarithm of the) like- lihood is hard to estimate. The summation within the log makes separation of the product terms unviable. It is possible, however, to use a sequential algorithm if we give starting values for the missing data (i.e., the unobserved class membership). Combining Equations 1-3 to obtain the likelihood gives: P(Ys) = C c=1 πc J j=1 R r=1 π y∗ sjr jrc (11) and taking the log gives the log-likelihood: log P(Ys) = log C c=1 πc J j=1 R r=1 π y∗ sjr jrc . (12) With class membership unobservable, this expression is unsolvable. However, if we impute values for the missing class membership (also called data aug- mentation, e.g., Ligtvoet & Vermunt, 2011), the expression can be written as: log P(Ys) = ns C c=1 πc|s log πc J j=1 R r=1 π y∗ sjr jrc . Now, the EM algorithm consists of sequentially updating πc|s(providing πc) 34
  • 35. and πjrc to maximize log L = S s=1 log P(Ys). The algorithm continues until the change in the log-likelihood between iter- ation t and t + 1 is lower than a given convergence criterium. The values for which this log-likelihood is maximized are the ML estimates. Using the EM algorithm it can, however, occur that convergence is at- tained at a local maximum. Often, to control for this, multiple starting sets are used and the values for ψ resulting in the highest log-likelihood are taken as ML estimates. 0 log(0) = 0 convention In order to only let observed patterns contribute to the likelihood, I used a convention that 0 log(0) = 0. This is needed because log(0) is undefined, and multiplying log(0) with 0 will technically not result in 0. Following is the justification of using the convention. If I define the natural logarithm as log(x) = x 1 1 t dt, and need to find a reasonable value for 0 log(0), I should take the limit as x approaches 0. Using Hopital’s Rule one can show that although log(0) is undefined, the limit of 35
  • 36. x log(x) as x approaches zero is: lim x→0 xlog(x) = lim x→0 log(x) x−1 = lim x→0 x−1 −x−2 = lim x→0 −x = 0 36
  • 37. B The Gibbs sampler (in LC analysis) The Gibbs sampler can be used to estimate the LC model, as described in Section 2.2, but also to perform the PPC (see Section 3.3 as a means of test- ing model fit. The Bayesian model fit approach compares the goodness-of-fit statistic Tobsto a reference distribution which is obtained by averaging the distribution P(T|ψ) over the posterior P(ψ|y). When the posterior distri- bution is not (or tediously) calculable analytically, one can use simulations to estimate it. Here I show in detail how to obtain the posterior (and) predictive distribution for ψand yrep and perform the PPC. The method goes as follows: Step 1. Assume that the model is true. Step 2a. Draw a sample from the posterior distribution ψl ∼ P(ψ|y). Step 2b. Generate a replicate dataset yrep,l ∼ P(yrep |ψl ). Step 2c. Repeat Steps 2a and 2b to obtain L draws from the posterior predictive distribution. Step 3. Estimate the LC model under H0 on each dataset and calculate the statistic Tl rep. Drawing the samples in Step 2 had to be split into 3 parts, which involve the posterior distribution of the parameters in ψ, from which it is not straight- forward to draws samples of the LC model parameters. The following text 37
  • 38. discusses how to specify the posterior distribution and how to obtain samples from it using the Gibbs sampler. This discussion is about obtaining (draws from) the posterior distribution P(ψ|y). Note that this applies to the Gibbs sampler both in the estimation process as in the PPC.. The posterior distribution of ψ can be obtained using the Bayes rule: P(ψ|y) = P(y|ψ)P(ψ) P(y) (13) ∝ P(y|ψ)P(ψ) (14) The term P(y) is called the marginal likelihood or normalizing constant. To draw samples from the posterior we can simply use Equation 14 because the shape of the distribution is not influenced by multiplying/dividing by a constant. However, as can be seen, one does need a prior distribution P(ψ) for the parameters in ψ, which can be used to include prior knowledge (or lack thereof) about the parameters of interest. For each set of multinomial parameters (e.g., πjrc, r = 1, . . . , Rj) I have used a Dirichlet prior distribution. For dichotomous variables (Rj = 2 for all j), I could equivalently have used Beta distributions (Gelman et al., 2004), but for the sake of generality, I show the use of the Dirichlet distribution here. For example, the prior distribution of the conditional response probabilities 38
  • 39. of a person in LC c = 1, . . . , C on item j = 1, . . . , J is given by: P(πjrc, r = 1, . . . , R) = Rj q=1 αjqc ! Rj q=1 αjqc! Rj r=1 π αjrc−1 jrc (15) ∝ Rj r=1 π αjrc−1 jrc . (16) It is commonplace to ignore the constant and only indicate the parts of the distribution which involve the parameters (here, πjrc) and use the propor- tionality property. The prior distribution for the class sizes is given by: P(θc, c = 1, . . . , C) ∝ C c=1 παc−1 c (17) The values for the hyperparameters αjrc in absolute sense indicate the strength of one’s prior belief about the probability of giving response r to item j in class c, and the relative sizes of the hyperparameters indicate the relative probabilities for the responses (Rubin & Stern, 1994). αc is used likewise for the class-sizes. To indicate no prior knowledge about the items of LC sizes I only use vague (diffuse) priors in the analysis where c αc = r αjrc = 1 (see Section 2.2. The prior distribution of the entire set ψ is the product of the priors on 39
  • 40. the elements in it: p(ψ) = C c=1 παc−1 c R1 r=1 πα1rc−1 1rc × · · · × RJ r=1 παJrc−1 Jrc (18) and the posterior is then obtained by combining this prior distribution with the likelihood (Equation 11) of the LC model (Rubin & Stern, 1994): P(ψ|y) ∝ S s=1 C c=1 P(Ys) ns P(ψ). (19) As indicated earlier, this posterior distribution does not have a convenient form to sample from. But, as it turns out, augmenting the data with esti- mates for the unobserved LC memberships can make the model estimable. As shown in Section 2.2, the Gibbs sampler can be used to estimate the LC model in an iterative fashion, but it requires that unobserved indicators for the LC memberships are used to augment the data. In this way it is possible to obtain conditional distributions of the parameters given the LC membership (Tanner & Wong, 1984). To illustrate, let Zsic = 1 if the ith ob- servation in the sth cell of the contingency table (i = 1, . . . , ns, s = 1, . . . , S) belongs to LC c and 0 otherwise. Then the joint distribution P(ψ, Z, y) ∝ S s=1 ns i=1 C c=1 P(Ys)Zsic P(ψ). (20) The distribution of ψconditional on Z and yis given by the product of inde- pendent Dirichlet distributions with hyperparameters αjrc +njrc and αc +mc. 40
  • 41. The conditional probability P(Z|ψ, y) is given by the Bernoulli distribution. Using the Bayes rule, the probabilities that Zsic = 1 is obtained using Equa- tion 4: P(Zsic = 1|ψ, y) = P(Ys|θ = c)πc P(Ys) . (21) These conditional distributions are easy to sample from (see Section 2.2. The Gibbs sampler described in this thesis does this iteratively, and at conver- gence, the sampled values for Z and ψare draws from the joint posterior distribution P(Z, ψ|y) (Rubin & Stern, 1994; Tanner & Wong, 1984). To avoid correlations between the samples, one is advised not to use subsequent draws, but, for instance, to retain only every 50th draw or so. To obtain the replicate data yrep,l in Step 2b as a draw from the predic- tive distribution P(yrep |psibf), we just need to draw N observations from a multinomial distribution with probabilities P(Ys) estimated from ψl . 41
  • 42. C Figures 0 100 300 500 051525 Trace of replicated X2 Iteration Trep(X2) 0 5 10 15 0.000.050.100.15 Density for replicated X2 Replicated X2 values Density Figure 1: Example of trace and density plot for the PPC in the empirical data. The dashed lines indicate X2 obs = 4.223, pp = .554 42
  • 43. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.0 L2 p−value p−valuedensity 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.0 X2 p−value p−valuedensity 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.0 BVR p−value p−valuedensity Asympotic p−values Bootstrap PPC Discrepancy Uniform Figure 2: P-value log-densities for the 2-Class model with N = 1000 43
  • 44. 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.0 L2 p−value p−valuedensity 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.0 X2 p−value p−valuedensity 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.0 BVR p−value p−valuedensity Asympotic p−values Bootstrap PPC Discrepancy Uniform Figure 3: P-value log-densities for the 2-Class model with N = 100 44