The demand for various socio-economic and health statistics for small geographical areas is steadily increasing at a time when survey agencies are looking for ways to reduce costs to meet fixed budgetary requirements. In the current survey environment, the application of standard sample survey methods for small areas, which require large sample, is generally not feasible from the cost consideration. One of the key factors that lead to the success of small area methodology, which typically uses implicit or explicit models to combine survey and administrative data sources, is the availability of strong auxiliary variables. The accessibility of big data from different sources is now bringing new opportunities for statisticians to develop innovative small area methods. In this talk, I will first give an overview of the current state of small area research and then discuss the potential for the use of big data in producing reliable local area statistics.
Dr. Partha Lahiri is Professor of Survey Methodology and Statistics at the University of Maryland, College Park. Prior to coming to Maryland, Dr. Lahiri was the Milton Mohr Distinguished Professor of Statistics at the University of Nebraska-Lincoln. His research interests include Bayesian statistics, record linkage and small-area estimation. Dr. Lahiri has served on a number of advisory committees, including the U.S. Census Advisory committee and U.S. National Academy panel. Over the years Dr. Lahiri advised various local and international organizations such as the United Nations Development Program, World Bank, and the Gallup Organization. He is a Fellow of the American Statistical Association and the Institute of Mathematical Statistics and an elected member of the International Statistical Institute.
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Can big data help in the production of reliable local area statistics?
1. Can big data help in the production of reliable
local area statistics?
Partha Lahiri
Joint Program in Survey Methodology
University of Maryland, College Park, USA
SDAL, Virginia Tech.
January 28, 2015
SDAL January 28, 2015 1 / 27
3. Remote Sensing for Crop Acreage
The NASS-USDA has been publishing county estimates of crop
acreage, crop production, crop yield and livestock inventories since
1917.
Uses: local agricultural decision making, payments to farmers if crop
yields are below certain levels.
Can earth resources satellite data provide useful ancillary data source
for county estimates of crop acreage?
Satellite information is recorded for pixels (a term for picture
elements). A pixel is about .45 hectares;
Based on satellite readings in early Fall, it is possible to classify the
crop cover all pixels. This generates big data.
SDAL January 28, 2015 3 / 27
4. Ref: http://www.nass.usda.gov/Statistics-by-State/Iowa/Publications/Cropland-Data-Layer/2011/index.asp
2011 Hardin County, Iowa
0 1.41 2.83 4.24
miles
LandOCoverOCategories
-byOdecreasingOacreage*O
AGRICULTURE
Corn
Soybeans
GrasslandOHerbaceous
Alfalfa
OtherOHay/NonOAlfalfa
Oats
WinterOWheat
Rye
Fallow/IdleOCropland
Sod/GrassOSeed
NON-AGRICULTURED
Developed/OpenOSpace
DeciduousOForest
Developed/LowOIntensity
WoodyOWetlands
OpenOWater
Developed/MediumOIntensity
Produced by CropScape - http://nassgeodata.gmu.edu/CropScape * Only top 6 non-agriculturecategroies are listed.
SDAL January 28, 2015 4 / 27
5. Remote Sensing for Crop Acreage
Bellow et al.
NASS has been a user of remote sensing products since the
1950’s when it began using midaltitude aerial photography to
construct area sampling frames (ASF’s) for the 48 states of the
continental United States. A new era in remote sensing began in
1972 with the launch of the Landsat I earth-resource monitoring
satellite. Four additional Landsats have been launched since
1972, with Landsat IV and V still in operation in 1993. The
polar-orbiting Landsat satellites contain a multi-spectral scanner
(MSS) that measures reflected energy in four bands of the
electromagnetic spectrum for an area of just under one acre. The
spectral bands were selected to be responsive to vegetation
characteristics.
SDAL January 28, 2015 5 / 27
6. Remote Sensing for Crop Acreage
In addition to the MSS sensor, Landsats IV and V have a
Thematic Mapper (TM) sensor which measures seven energy
bands and has increased spatial resolution. The large area (185
by 170 km) and repeat (16 day per satellite) coverage of these
satellites opened new areas of remote sensing research: large area
crop inventories, crop yields, land cover mapping, area frame
stratification, and small area crop cover estimation.
SDAL January 28, 2015 6 / 27
10. Unit Level Model
yij : value of the study variable for the jth unit of the i small area
population (i = 1, · · · , m; j = 1, · · · , Ni )
We are interested in estimating the finite population means:
¯Yi = N−1
i
Ni
j=1
yij .
Nested Error Regression Model
yij = xij β + vi + eij ,
where xij is a p × 1 column vector of known auxiliary variables; {vi } and
{eij } are all independent with vi
iid
∼ N(0, σ2
v ) and eij
iid
∼ N(0, σ2
e )
SDAL January 28, 2015 10 / 27
11. An Example
Estimation of the number of hectares of corn for 12 Iowa counties
based on the 1978 June Enumerative Survey and satellite data.
yij : the number of hectares of corn in the jth segment of the ith
county as reported in the June Enumerative Survey.
xij = (1, x1ij , x2ij ), where x1ij (x2ij ) is the number of pixels classified as
corn (soybean) in the jth segment of the ith county.
¯X = (1, ¯X1i , ¯X2i ), where ¯X1i ( ¯X2i ) is the mean number of pixels per
segment classified as corn (soybean) for county i.
SDAL January 28, 2015 11 / 27
12. EBLUP
EBLUP (EB) estimators of ¯Yi :
¯yEB
i = fi
ˆ¯Y Reg
i + (1 − fi ){(1 − ˆBi ) ˆ¯Y Reg
i + Bi
ˆ¯Y Syn
i },
where
Bi =
ˆσ2
e /ni
ˆσ2
v + ˆσ2
e /ni
ˆ¯Y Reg
i = ¯yi + ( ¯Xi − ¯xi ) ˆβ
ˆ¯Y Syn
i = ¯Xi
ˆβ
Any standard variance component estimation method (e.g., REML)
can be used to obtain ˆσ2
v and ˆσ2
e .
ˆβ: the weighted least squares estimator with estimated variance
components
SDAL January 28, 2015 12 / 27
13. Plots of Survey-Weighted Poverty Rates and SAE for a Small County
(drawn by Sam Hawala)
SDAL January 28, 2015 13 / 27
14. Plots of Estimated SE Survey-Weighted Poverty Rates and SAE for a
Small County (drawn by Sam Hawala)
SDAL January 28, 2015 14 / 27
15. A Cross-Sectional Model
Ref: Fay and Herriot (JASA 1979)
For i = 1, · · · , m,
Level 1: (Sampling Distribution): yi = θi + ei ;
Level 2: (Linking Distribution): θi = xi β + vi
where
yi : direct survey estimate of true small area mean θi for area i
xi : p × 1 vector of known auxiliary variables coming from big data;
{ei } and {vi } are indep. with ei ∼ N(0, ψi ) and vi ∼ N(0, σ2
v ); ψit’s
are assumed to be known.
The p × 1 vector of regression coefficients βt and model variance σ2
vt
are unknown.
SDAL January 28, 2015 15 / 27
16. Auxiliary Variables from big data
The proportion of child exemptions reported by families in poverty on
their tax returns.
The proportion of people under 65 who did not file income tax
returns.
The proportion of people receiving food stamps.
SDAL January 28, 2015 16 / 27
17. A Time Series Cross-Sectional Model
Ref: Datta, Lahiri, Maiti and Lu (1999) Datta, Lahiri, Maiti (2002)
For i = 1, · · · , m; t = 1, · · · , T,
Level 1: : yit = θit + eit;
Level 2: : θit = xitβ + vi + uit
Level 3: : uit = uit−1 + it
where
yit: direct survey estimate of median income of four person family for
state i, year t
eit: sampling error
xit: auxiliary variables coming from big data (previous census and
administrative records)
vi : state specific random effects
uit: state and year specific random effects
SDAL January 28, 2015 17 / 27
18. Estimates of Coefficient of Variations of CPS Direct estimates of
Median Income of 4-person Families in the US States: Year 1989
2.5
5.0
7.5
10.0
12.5
U.S. state level
CV, CPS
SDAL January 28, 2015 18 / 27
19. Estimates of Coefficient of Variations of EB Direct estimates of
Median Income of 4-person Families in the US States: Year 1989
2.5
5.0
7.5
10.0
12.5
U.S. state level
CV, EB
SDAL January 28, 2015 19 / 27
20. A Plot of Absolute Residuals From a Simple Linear Regression
Dep Variable: 1989 Median Income Estimates from 1990 Census
Indep. Variable: CPS or EB Estimates for 1989
0 10 20 30 40 50
0200040006000800010000
Plot of absolute residual versus state
State
Absoluteresidual
CPS
EB
SDAL January 28, 2015 20 / 27
21. Poverty mapping: the Chilean Case
High poverty rates can work favorably to a Chilean municipality in
terms of securing more funds from the Chilean central government.
Consider the following situation. For a given small municipality,
poverty rate for the current year turns out to be high by standard
design-based method.
How do we convince the mayor of that municipality to go for a
statistically efficient SAE method that yields lower poverty rate?
SDAL January 28, 2015 21 / 27
22. Plots of Survey-Weighted Poverty Rates and SAE for
Selected Comunas (drawn by Carolina Casas-Cordero)
0
.1
.2
.3
.4
0
.1
.2
.3
.4
2000 2003 2006 2009 2012 2000 2003 2006 2009 2012
concón hualpén
lolol santiago
Direct SAE
PovertyRate
Year
Source: Casen Survey 2000 to 2011
Estimates of poverty rates for comunas, Chile
SDAL January 28, 2015 22 / 27
23. Initial set of auxiliary variables
Number and Name of the auxiliary variable Institution responsible for data collection Frequency of publication
of the data
#1. Subsidio Familiar Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly
#2. Subsidio al Pago del Consumo de Agua Potable
y Servicio de Alcantarillado de Aguas Servidas
Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly
#3. Bono Chile Solidario Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly
#4. Subsidio de Discapacidad Mental Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly
#5. Pensión Básica Solidaria (vejez e invalidez) Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. December
#6. Aporte Previsional Solidario (vejez e invalidez) Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. December
#7. Bonificación al Ingreso Ético Familiar Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly
#8. Beca de Apoyo a la Retención Escolar, BARE Unidad de Prestaciones Monetarias, Ministerio de Desarrollo Social. monthly and yearly
#9. Afiliados Sistema de Capitalización Individual Superintendencia de Pensiones monthly and yearly
#10. Matrícula Ministerio de Educación Yearly
#11. Rendimiento Ministerio de Educación Yearly
#12. SIMCE Ministerio de Educación Yearly or every two years
#13. Titulados Educación Superior Ministerio de Educación Yearly
#14. Índice de Vulnerabilidad del Establecimiento
(IVE-SINAE)
Junta Nacional Escolar y Becas (Junaeb) Yearly
#15. Situación Nutricional estudiantes básica y
media
Junta Nacional Escolar y Becas (Junaeb) Yearly
#16. Población beneficiaria Fonasa Ministerio de Salud Yearly
#17. Atenciones sector privado Ministerio de Salud Yearly
#18. Razón de analfabetos respecto a la población de
10 y más años en la comuna
CENSO, INE Every 10 years
#19. Porcentaje de Población Rural CENSO, INE Every 10 years
#20. Porcentaje de Asistencia Escolar Comunal SINIM monthly
#21. Tamaño promedio del hogar CENSO, INE Every 10 years
#22. Tasa de pobreza histórica CASEN Every 2 or 3 years
#23. Contribuciones de Vivienda SII (http://www.sii.cl/avaluaciones/estadisticas/estadisticas_bbrr.htm#2) Yearly
#24. Remuneraciones promedio de los trabajadores
dependientes
Yearly
Source: Ministerio de Desarrollo Social (2013a).
SDAL January 28, 2015 23 / 27
24. Regression Analysis
Independent variables
Regression coefficient estimate
(t-statistics): original comuna
weights
Average wage of dependent
workers (log)
-0.09575646
(3.52**)
Average of the poverty rate from
Casen 2000, 2003 and 2006
(arcsin)
0.49548266
(7.92**)
% of population in rural areas
(arcsin)
-0.13409847
(4.96**)
% of illiterate population (arcsin)
0.40349163
(2.57*)
% of population attending to
school
(arcsin)
-0.21883535
(2.23*)
Dummy for region 7 (=1)
0.03442978
(2.11*)
Dummy for region 8 (=1)
0.03882056
(2.67**)
Dummy for region 9 (=1)
0.105632
(6.04**)
Constant
1.61477028
(4.24**)
Number of observations 235
Adjusted R2
0.67
SDAL January 28, 2015 24 / 27
25. Length of the direct and parametric bootstrap confidence intervals of the comuna-level
poverty rates for comunas sorted by the limited translation empirical Bayes estimates of
the poverty rate.
SDAL January 28, 2015 25 / 27
26. ”...D.J. Finney once wrote about the statistician whose
client comes in and says, ”Here is my mountain of trash.
Find the gems that lie therein.” Finney’s advice was to
not throw him out of the office but to attempt to find out
what he considers ”gems”. After all, if the trained
statistician does not help, he will find some one who
will....” David Salsburg, ASA Connect Discussion
SDAL January 28, 2015 26 / 27
27. First
Latin American ISI Satellite
Meeting on Small Area Estimation
August 3-5, 2015, Santiago, Chile
International Statistical Institute (ISI) Satellite Meeting
At Pontificia Universidad Católica de Chile
Invited Talks:
Malay Ghosh
“Small Area Estimation with Health Applications”
Wayne Fuller
“Bootstrap Methods for Small Area Predictions”
Partha Lahiri
“Recent Advances in Poverty Mapping Methodology”
Angela Luna, Nikos Tzavidis and LiChun Zhang
“From start to finish: Specify – Adapt – Evaluate (SAE)”
Danny Pfeffermann and Richard Tiller
“Small Area Labor Force Statistics using Time Series Models”
J.N.K. Rao
“Measuring Uncertainty of Small Area Estimators”
Special Topics, Contributed & Poster Sessions:
Submit abstracts by April 15th of 2015 at sae2015@uc.cl
Abstracts accepted on a first-come basis.
Language of the conference:
English
Website:
http://www.encuestas.uc.cl/sae2015/
Main Organizer: Centro de Encuestas y Estudios Longitudinales, Universidad Católica de Chile.
Co-organizers: International Statistical Institute (ISI), International Association of Survey Statisticians
(IASS), Sociedad Chilena de Estadística (SOCHE), Instituto Nacional de Estadísticas (INE), Ministerio de
Desarrollo Social (MDS), Departamento de Estadística, Departamento de Salud Pública e Instituto de
Sociología de la Universidad Católica de Chile.
Purpose:
We hope that this
meeting will serve
as a bridge between
mathematical
statisticians and
practitioners working on
small area estimation in
academia, private and
government agencies.
This meeting in Santiago
will give researchers
an opportunity to learn
about state-of-the-art
small area estimation
techniques from the
experts in the field.
Journal
of the Royal
Statistical Society
(JRSS) Series A
Special Issue
on SAE !!!