1. The Minnesota
Data Harmonization
Projects
Bill & Melinda Gates Foundation
Seattle, Washington
May 21, 2014
Elizabeth Boyle, Miriam King, Matthew Sobek
Minnesota Population Center, University of Minnesota
sobek@umn.edu
4. We build data infrastructure for research
community. Specialize in data harmonization.
World’s largest collection of individual
population and health data, across 9 projects.
50,000 registered users from over 100
countries.
Free
Minnesota Population Center
7. The Problem
1. Combining data from multiple sources is
time consuming
Discovery
Data management
2. It’s error prone
Recoding data
Overlook documentation
3. Hard to replicate results
4. Discourages comparative research
8. Outline
Harmonization methods
Dissemination system
International projects
Integrated DHS
Terra Populus
IPUMS-International
14. MPC Data Dictionary
Variable Start Width Value Var ValueLabel Frequency Universe
SMOKE100 57 1 Ever smoked 100 cigarettes All persons
1 Yes 54,189
2 No 59,501
7 Don't know/Not sure 205
9 Refused 39
SMOKENOW 58 1 Smoke cigarettes now Persons who ever smoked
1 Yes 25,644
2 No 28,535
7 Don't know/Not sure 0
9 Refused 10
Blank [no label] 59,745
SMOKE30 59 2 Number of days smoked in the last 30 Persons who currently smoke
1 to 30 Number of days 25,290
77 Don't know/Not sure 293
88 None 49
99 Refused 12
Blank [no label] 88,290
SMOKENUM 61 2 Number of cigarettes smoked per day Persons who currently smoke
0 to 76 Number of cigarettes 22,292
77 Don't know/Not sure 248
99 Refused 43
Blank [no label] 91,351
16. 5. Number of Rooms
How many rooms are used for sleeping without counting hallways?
_____ Write the number
Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count
the kitchen
_____Write the number
6. Access to water
Read all of the options until you get an affirmative answer.
Circle only one answer
1 Running water inside the dwelling
2 Running water outside the dwelling but on the land
3 Running water from a public faucet or hydrant
4 Running water that is carried from another dwelling
5 Tanked in by truck
6 Water from a well, river, lake, stream or other
Answers 3, 4, 5, 6 continue with number 8
7. Water supply
How many days of the week is water available?
Circle only one answer
1 Daily
2 Every third day
3 Twice a week
4 Once a week
5 Occasionally
Metadata: Questionnaire Text
46. Foremost source of health information for
the developing world
Funded by USAID
Since 1980s, over 300 surveys, 90 countries
Topics: fertility, nutrition, HIV, malaria,
maternal and child health, etc
Demographic and Health Surveys
47. 5-year NIH grant (end of year 2)
Focus on Africa, with India
Partnership with ICF-International and
USAID
IDHS Project
48. Motivation: DHS is incredibly valuable, but
it’s hard to capitalize on its full potential.
Problem:
Data discovery
Dispersed documentation
Data management
Variable changes over time
Not unique to DHS: endemic to any survey
that’s persisted over decades.
Why an Integrated DHS?
52. Recode
notes
Data
dictionary
Just the woman file – for one survey. 61 to go.
Still need Report (377 page pdf)
• Contains questionnaire and sample design information
• Errata file
53. DHS “Recode Variables” make it more harmonized than
most surveys
Consistent variable names
Each DHS phase has a shared model
questionnaire
But:
6 phases over 25+ years
Country control over final wording of surveys
Country-specific variables
The recode variables can be a two-edged sword
At least the DHS variables are already
harmonized, right?
54. 100 Muslim/Islam 4 = Muslim 7 = Moslem 1 = Muslim 2 = Muslim
200 Christian 2 = Christian 3 = Christian
201 Catholic 2 = Catholic 1 = Catholic
202 Protestant 1 = Protestant
203 Anglican 2 = Anglican
204 Methodist 3 = Methodist
205 Presbyterian 4 = Presbyterian
206 Pentacostal 5 = Pentecostal
208 Other Christian 3 = Other Christian 6 = Other Christian
300 Other
301 Hindu 0 = Hindu 1 = Hindu
302 Sikh 3 = Sikh 4 = Sikh
303 Buddhist 5 = Buddhist
302 Jain 6 = Jain
305 Jewish 7 = Jewish
306 Parsi/Zoroastrian 8 = Parsi/Zoroastrian
307 Doni-Polo 10 = Donyi polo
400 Traditional/spiritual 8 = Trad/spiritualist
401 Traditional 5 = Traditional
402 Spiritual
403 Animist
500 No religion 0 = No religion 9 = No religion 9 = No religion
600 Other 96 = Other 4 = Other 96 = Other
Ghana 1993
V130
Ghana 2008
V130
India 1992
V130
India 2005
V130
Harmonization: Religion
55. Egypt 1995 S802 Ever circumcised
Egypt 2005 S801 Respondent circumcised
Egypt 2008 G102 Respondent circumcised
Ethiopia 2000 FG103 Circumcised
Ethiopia 2005 FG103 Circumcised
Ghana 2003 S821 Circumcised
Kenya 1998 S1002 Respondent circumcised
Kenya 2003 S821 Circumcised
Kenya 2008 G102 Respondent circumcised
Mali 1995 S551 Circumcised
Mali 2001 FG103 Circumcised?
Mali 2006 G102 Respondent circumcised
Nigeria 1999 S521 Type of circumcision
Nigeria 2003 FG103 Circumcised
Nigeria 2008 G102 Respondent circumcised
Harmonization: Female Circumcision
Ever Circumcised
56. Timeline: 2014
(current)
9 countries, 39
samples
Much of woman files
Women of child
bearing age as unit
of analysis
57. Timeline: 2015
15 countries, 69
samples
Complete the woman
files
Children & birth files
61. Lower barriers to conducting research on
population and the environment.
Motivation:
The data from different domains have
incompatible formats, and few researchers have
the skills to combine them
Terra Populus Goal
62. 5 year grant NSF
At mid-point: year 3
TerraPop
63. 6 countries:
Argentina
Brazil
Malawi
Spain
United States
Vietnam
Population Microdata
64. Tabulations of census data for administrative
units
Area-level Data
65. Land cover from
satellite images
(Global Land Cover 2000)
Agricultural use
from satellites and
government records
(Global Landscapes
Initiative)
Climate from weather
stations (WorldClim)
Environmental Data
Rasters (Grid Cells)
66. Microdata
Area-level dataRasters
Mix and match
variables originating in
any of the data
structures
Obtain output in the
data structure most
useful to you
Location-Based Integration
67. Individuals and households
with their environmental
and social context
Microdata
Area-level dataRasters
Location-Based Integration
68. Summarized
environmental
and population
Microdata
Area-level dataRasters
County ID
G17003100001
G17003100002
G17003100003
G17003100004
G17003100005
G17003100006
G17003100007
County ID
Mean Ann.
Temp.
Max. Ann.
Precip.
G17003100001 21.2 768
G17003100002 23.4 589
G17003100003 24.3 867
G17003100004 21.5 943
G17003100005 24.1 867
G17003100006 24.4 697
G17003100007 25.6 701
County ID
Mean Ann.
Temp.
Max. Ann.
Precip.
Rent,
Rural
Rent,
Urban
Own,
Rural
Own,
Urban
G17003100001 21.2 768 3129 1063 637 365
G17003100002 23.4 589 2949 1075 1469 717
G17003100003 24.3 867 3418 1589 1108 617
G17003100004 21.5 943 1882 425 202 142
G17003100005 24.1 867 2416 572 426 197
G17003100006 24.4 697 2560 934 950 563
G17003100007 25.6 701 2126 653 321 215
characteristics for
administrative
districts
Location-Based Integration
72. Linkages across data formats rely on
administrative unit boundaries
Particular needs
Lower level
boundaries
Historical
boundaries
Boundaries are Key
85. Top Institutional Users
Country Institution Country Institution
1 USA University of Minnesota 16 Brazil Universidade Federal de Minas Gerais
2 USA Harvard University 17 Mexico El Colegio de México
3 USA University of Michigan Ann Arbor 18 USA Yale University
4 USA Columbia University 19 China University of Hong Kong
5 Spain Autonomous University Barcelona 20 USA University of Washington
6 USA Arizona State University 21 UK London School Economics
7 Singapore National University of Singapore 22 UK University of Stirling
8 IADB Inter American Development Bank 23 France Université de Bordeaux 4
9 WB World Bank Group 24 Austria University of Vienna
10 USA University of California Berkeley 25 Malaysia National University of Malaysia
11 USA Vanderbilt University 26 Austria Vienna Institute of Demography
12 USA University of Chicago 27 USA Pew Research Center
13 Australia University of Queensland Australia 28 Colombia Universidad del Valle
14 USA University of California Los Angeles 29 USA University of Delaware
15 USA Dartmouth College 30 USA Brown University
87. Millennium Development Goals
Source: Cuesta and Lovatón (2014)
Data Source: IPUMS-International, Minnesota Population Center
Census 1993 Census 2005
Colombia: Adolescent Birth Rate
88. Data acquisition
Outreach: developing countries
Virtual data enclave
IPUMSI Future