A measurement error model approach to survey data integration: combining information from two surveys

A measurement error model approach to survey
data integration: combining information from two
surveys
Jae Kwang Kim 1
Iowa State University
2017 SAE conference, Paris
July 11th, 2017
1
Joint work with Seho Park

Survey data integration
Want to combine information from multiple surveys
Three situations
1 Multiple samples for one target population
2 One sample each from multiple populations
3 Multiple samples from multiple populations
Small area estimation is a special case of survey data integration, in
that multiple sub-populations represent multiple domains.
Kim (ISU) Survey Data Integration 7/11/2017 2 / 25

Motivation
USAID Bureau for Food Security (BFS) sponsors Food and Nutrition
Technical Assistance III project (FANTA).
Key technical areas of focus are food security, maternal and child health,
agriculture, and livelihoods strengthening.

Motivation
FANTA has two projects: Feed the Future (FTF) and Food for Peace
(FFP) development projects.
FFP project was conducted by ICF International, and FTF project was
conducted by UNC MEASURE.
Two surveys were conducted in 2013 from selected departments in
Guatemala: San Marcos, Totonicapan, Quiche, Quezaltenango, and
Huehuetenango.

Map of Guatemala

FFP and FTF Projects in Guatemala
Figure: Selected Departments in Guatemala

Overlap Area
Figure: FFP ZOI and FFP Project Implementation Area for Guatemala

Overlap Area
Table: Overlap Area: Departments and Municipalities
Department Municipality
San Marcos Sibinal
Tajumulco
Totonicapan Momostenango
Santa Lucia La Reforma
Huehuetenango Chiantla
Concepcion Huista
Jacaltenango
San Antonio Huista
Todos Santos
Quetzaltenango San Juan Ostuncalco
Quiche Chichicastenango
(Santa Maria) Nebaj
Uspantan
Cunen
San Juan Cotzal

Common Indicators
Two surveys have their own indicators and 11 common indicators
were chosen to be studied.
The common items are about women’s nutritional status, children’s
well-being status, and prevalence of poverty in household.

Common Indicators
Table: Common Indicators
Indicator Description
Daily Per Capita Expendi-
tures (PCE)
Average daily per capita consumption con-
stant 2010 USD
Prevalence of Poverty
(PP)
Prevalence of poverty: percentage of people
living on less than $1.25 USD per capita per
day
Mean Depth Poverty
(MDP)
Average of the diﬀerences between total
daily
Prevalence of Households
with Hunger (HHS)
Prevalence of households with moderate or
severe hunger
Prevalence of Under-
weight Women
Women that are eligible for BMI (not cur-
rently pregnant and not within 2 months of
delivery) who has BMI less than 18.5
Women’s Dietary Diver-
sity Score (WDDS)
Mean number of food groups consumed by
women of reproductive age (15-49 years)

Common Indicators
Table: Common Indicators (Cont’d)
Indicator Description
Prevalence of Stunted
Children
Prevalence of stunted children under five
years of age (0-59 months)
Prevalence of Wasted
Children
Prevalence of wasted children under five
years of age (0-59 months)
Prevalence of Under-
weight Children
Prevalence of underweight children under
five years of age (0-59 months)
Prevalence of Children Re-
ceiving a Minimum Ac-
ceptable Diet (MAD)
Prevalence of children 6-23 months receiv-
ing a minimum acceptable diet
Prevalence of Exclusive
Breastfeeding (EBF)
Prevalence of exclusive breastfeeding of chil-
dren under six months of age

Estimates from two surveys
Table: Daily Per Capita Expenditure
Department FFP/ICF FTF/UNC T-statistics
N Mean S.E. N Mean S.E.
San Marcos 1419 0.558 0.014 981 1.166 0.018 -23.376
Totonicapan 1654 0.388 0.015 181 0.896 0.039 -5.505
Huehuetenango 877 0.456 0.023 1535 1.140 0.018 -30.587
Quetzaltenango 628 0.695 0.022 60 1.325 0.112 -26.179
Quiche 1288 0.382 0.015 1350 1.045 0.015 -12.179

Estimates from two surveys
Table: Prevalence of Households with Hunger (%)
Department FFP/ICF FTF/UNC T-statistics
N Mean S.E. N Mean S.E.
San Marcos 1419 3.76 0.50 981 15.35 1.08 -9.733
Totonicapan 1654 11.79 0.87 181 15.01 2.72 -1.125
Huehuetenango 877 8.91 0.91 1535 15.58 0.87 -5.323
Quetzaltenango 628 6.84 0.91 60 9.94 3.96 -0.765
Quiche 1288 7.13 0.74 1350 9.73 0.77 -2.430

Data Structure
Table: Data Structure
X Ya Yb
Sample A o o
Sample B o o

Goal: Synthetic data imputation
Table: Data Structure
X Ya Yb
Sample A o o o
Sample B o o o

Methodology
Steps
1 Specify a measurement error model.
2 Derive prediction model using Bayes theorem.
3 Parameter estimation: EM algorithm.
4 Generating imputed values from the prediction model.

Step 1: Model specification
Assume that Sample A is a gold standard one. That is, Ya = Y .
Structural Equation model
Ya ∼ f1(ya | x; θ1).
From the observations in Sample A, we can perform model
diagnostics.
Measurement error model
Yb ∼ f2(yb | ya; θ2).
Assume nondifferentiability of measurement error model
f (yb | x, ya) = f (yb | ya)
For dichotomous y-variables, measurement error model becomes
misclassification model.

Step 2: Prediction model
Prediction model is the model for the counterfactual outcome,
conditional on the observed values.
Prediction model for Yb in sample A:
p(yb | x, ya) = f2(yb | ya).
Prediction model for Ya in sample B: Using Bayes formula, we can
derive
p(ya | x, yb) =
f1(ya | x; θ1)f2(yb | ya; θ2)
f1(ya | x; θ1)f2(yb | ya; θ2)dya
The prediction model can be used to obtain the best prediction of Yai
for i ∈ Sb.

Step 3: Parameter estimation - EM algorithm
E-step: compute
Q1(θ1 | data; ˆθ(t)
) =
i∈Sa
wi,a log f1(yai | xi ; θ1)
+
i∈Sb
wi,bE{log f1(Ya | xi ; θ1) | xi , ybi ; ˆθ(t)
}
Q2(θ2 | data; ˆθ(t)
) =
i∈Sa
wi,aE{log f2(Yb | yai ; θ2) | x, yai ; ˆθ(t)
)
+
i∈Sb
wi,bE{log f2(ybi | Ya; θ2) | x, ybi ; ˆθ(t)
)},
where the conditional expectations are computed from the prediction
model in Step 2.
M-step: update the parameters by maximizing Q1 and Q2 wrt θ1 and
θ2, respectively.

Step 4: Best prediction
Using the measurement error model, we can predict yai by
ˆyai = E(Ya | xi , ybi ) for i ∈ SB.
A prediction estimation of µ = E(Ya) can be obtained by
ˆµ∗
=
i∈SA
wi,ayai + i∈SB
wi,b ˆyai
i∈SA
wi,a + i∈SB
wi,b
Reference: Kim, Berg, and Park (2016). Statistical Matching using
fractional imputation. Survey Methodology, 42, 19–40.

Application to FANTA project
1 Model for PCE
yai = xi β + ei
ybi = α0 + α1yai + ui
where ei ∼ N(0, σ2
e ) and ui ∼ N(0, σ2
u).
2 Model for HHS prevalence
yai ∼ Bernoulli(πi )
ybi ∼ Bernoulli{pyai + q(1 − yai )}
where logit(πi ) = xi β and p, q ∈ (0, 1).

Model Diagnostics for PCE model
-2 -1 0 1 2
-2-1012
Fitted Values Vs Residuals
Fitted Values
Residuals
-4 -2 0 2 4
-2-1012
Normal Q-Q Plot
Theoretical Quantiles
SampleQuantiles

Result: PCE Indictor
Department FFP FTF Combined
San Marcos 0.558 1.165 0.563
(0.030) (0.038) (0.026)
Totonicapan 0.388 0.895 0.331
(0.030) (0.085) (0.028)
Quiche 0.382 1.045 0.396
(0.030) (0.031) (0.026)
Huehuetenango 0.456 1.140 0.479
(0.044) (0.036) (0.027)
Quetzaltenango 0.695 1.325 0.795
(0.044) (0.232) (0.043)

Results for HHS indicator
Department FFP FTF Combined
San Marcos 3.76 15.35 3.77
(1.01) (2.22) (1.00)
Totonicapan 11.79 15.01 12.08
(1.70) (6.00) (1.60)
Quiche 7.13 9.73 7.19
(1.50) (1.57) (1.42)
Huehuetenango 8.91 15.58 8.75
(1.90) (2.00) (1.90)
Quetzaltenango 6.84 9.94 6.85
(1.80) (8.25) (1.70)

Concluding remark
Survey data integration using measurement error model is considered.
Prediction of the counterfactual outcome is obtained by Bayes
theorem.
Parameter estimation involves EM algorithm.
Bayesian approach can be developed (not discussed here).
Extension to GLMM model for the structural equation model is under
progress.

A measurement error model approach to survey data integration: combining information from two surveys

Recommended

Recommended

More Related Content

Similar to A measurement error model approach to survey data integration: combining information from two surveys

Similar to A measurement error model approach to survey data integration: combining information from two surveys (20)

Recently uploaded

Recently uploaded (20)

A measurement error model approach to survey data integration: combining information from two surveys