SlideShare a Scribd company logo
1 of 89
The Minnesota
Data Harmonization
Projects
Bill & Melinda Gates Foundation
Seattle, Washington
May 21, 2014
Elizabeth Boyle, Miriam King, Matthew Sobek
Minnesota Population Center, University of Minnesota
sobek@umn.edu
USA
Integrated Public Use Microdata Series
 We build data infrastructure for research
community. Specialize in data harmonization.
 World’s largest collection of individual
population and health data, across 9 projects.
 50,000 registered users from over 100
countries.
 Free
Minnesota Population Center
MPC Data Dissemination, 1993-2012
Gigabytes
per week
MPC Data Projects
The Problem
1. Combining data from multiple sources is
time consuming
 Discovery
 Data management
2. It’s error prone
 Recoding data
 Overlook documentation
3. Hard to replicate results
4. Discourages comparative research
Outline
 Harmonization methods
 Dissemination system
 International projects
 Integrated DHS
 Terra Populus
 IPUMS-International
Terminology
Harmonization:
Combining datasets collected at different times or
places into a single, consistent data series.
“Integration”
Metadata:
Data about data. Documentation in broadest sense.
Relation
to head
Marital
status Education Occupation
Microdata
Summary Data
Harmonization Methods
 Metadata
 Data
 Dissemination
Systematize Metadata
(record layout file, pdf)
MPC Data Dictionary
Variable Start Width Value Var ValueLabel Frequency Universe
SMOKE100 57 1 Ever smoked 100 cigarettes All persons
1 Yes 54,189
2 No 59,501
7 Don't know/Not sure 205
9 Refused 39
SMOKENOW 58 1 Smoke cigarettes now Persons who ever smoked
1 Yes 25,644
2 No 28,535
7 Don't know/Not sure 0
9 Refused 10
Blank [no label] 59,745
SMOKE30 59 2 Number of days smoked in the last 30 Persons who currently smoke
1 to 30 Number of days 25,290
77 Don't know/Not sure 293
88 None 49
99 Refused 12
Blank [no label] 88,290
SMOKENUM 61 2 Number of cigarettes smoked per day Persons who currently smoke
0 to 76 Number of cigarettes 22,292
77 Don't know/Not sure 248
99 Refused 43
Blank [no label] 91,351
Water
Access
Convert Questionnaires to Metadata
(Mexico 2000)
5. Number of Rooms
How many rooms are used for sleeping without counting hallways?
_____ Write the number
Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count
the kitchen
_____Write the number
6. Access to water
Read all of the options until you get an affirmative answer.
Circle only one answer
1 Running water inside the dwelling
2 Running water outside the dwelling but on the land
3 Running water from a public faucet or hydrant
4 Running water that is carried from another dwelling
5 Tanked in by truck
6 Water from a well, river, lake, stream or other
Answers 3, 4, 5, 6 continue with number 8
7. Water supply
How many days of the week is water available?
Circle only one answer
1 Daily
2 Every third day
3 Twice a week
4 Once a week
5 Occasionally
Metadata: Questionnaire Text
Water access
Bedrooms
Rooms
XML-Tagged Questionnaire Text
Data: Variable Harmonization
Marital Status: IPUMS-International
Bangladesh 2011
1 = Unmarried
2 = Married
3 = Widowed
4 = Divorced/separated
Mexico 1970
1 = Married, civil & relig
2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = Single
Kenya 1999
1 = Never married
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
Translation Table
Input
Bangladesh
2011
4 = Divrc or separated
1 = Unmarried
2 = Married
3 = Widowed
Mexico
1970
1 = Married, civil & relig
2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = Single
Kenya
1999
1 = Never married
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
LabelCode
Translation Table
Harmonized
1 = Never married1 = Married, civil & relig
4 = Divrc or separated
1 = Unmarried
2 = Married
3 = Widowed
2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = Single
Single
Married or in union
Married, formally
Civil
Religious
Civil and religious
Monogamous
Polygamous
Consensual union
Separated
Divorced
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
1 0 0
2 0 0
2 1 0
2 1 1
2 1 2
2 1 3
2 1 4
2 1 5
2 2 0
0 0
3 1 0
3 2 0
0 0
Mexico
1970
Input
Bangladesh
2011
Kenya
1999
Divorced or separated3
Widowed4
LabelCode
Translation Table
Harmonized
1 = Never married
1 = Married, civil & relig
4 = Divrc or separated
1 = Unmarried
2 = Married
3 = Widowed
2 = Married, civil
3 = Married, religious
4 = Consensual union
5 = Widowed
6 = Divorced
7 = Separated
8 = SingleSingle
Married or in union
Married, formally
Civil
Religious
Civil and religious
Monogamous
Polygamous
Consensual union
Separated
Divorced
2 = Monogamous
3 = Polygamous
4 = Widowed
5 = Divorced
6 = Separated
1 0 0
2 0 0
2 1 0
2 1 1
2 1 2
2 1 3
2 1 4
2 1 5
2 2 0
0 0
3 1 0
3 2 0
0 0
Mexico
1970
Input
Bangladesh
2011
Kenya
1999
Divorced or separated3
Widowed4
Data Dissemination System
Data Dissemination System
Variables Page
Variables Page
238 censuses
Sample Filtering
Variables Page – Filtered
Variable Page: Marital Status
Variable Codes
(Marital status)
Variable Codes
(Marital status)
Variable Codes
(Marital status)
Variable Page: Marital Status
Variable Comparability Discussion
(Marital status)
Variable Page: Documentation
Questionnaire Text
Questionnaire Text
(Marital status, Cambodia)
Variables Page
Extract Summary
Case Selection
Age of spouse
Employment
status of father
Occupation of
father
Attached Characteristics
Extract Summary
Download or Revise Extract
On-line Analysis
The International Projects
Integrated DHS
 Foremost source of health information for
the developing world
 Funded by USAID
 Since 1980s, over 300 surveys, 90 countries
 Topics: fertility, nutrition, HIV, malaria,
maternal and child health, etc
Demographic and Health Surveys
 5-year NIH grant (end of year 2)
 Focus on Africa, with India
 Partnership with ICF-International and
USAID
IDHS Project
Motivation: DHS is incredibly valuable, but
it’s hard to capitalize on its full potential.
Problem:
 Data discovery
 Dispersed documentation
 Data management
 Variable changes over time
Not unique to DHS: endemic to any survey
that’s persisted over decades.
Why an Integrated DHS?
DHS Research Process
Example: Find data on female genital cutting
Survey Search Tool
Recode
notes
Data
dictionary
Just the woman file – for one survey. 61 to go.
Still need Report (377 page pdf)
• Contains questionnaire and sample design information
• Errata file
DHS “Recode Variables” make it more harmonized than
most surveys
 Consistent variable names
 Each DHS phase has a shared model
questionnaire
But:
 6 phases over 25+ years
 Country control over final wording of surveys
 Country-specific variables
The recode variables can be a two-edged sword
At least the DHS variables are already
harmonized, right?
100 Muslim/Islam 4 = Muslim 7 = Moslem 1 = Muslim 2 = Muslim
200 Christian 2 = Christian 3 = Christian
201 Catholic 2 = Catholic 1 = Catholic
202 Protestant 1 = Protestant
203 Anglican 2 = Anglican
204 Methodist 3 = Methodist
205 Presbyterian 4 = Presbyterian
206 Pentacostal 5 = Pentecostal
208 Other Christian 3 = Other Christian 6 = Other Christian
300 Other
301 Hindu 0 = Hindu 1 = Hindu
302 Sikh 3 = Sikh 4 = Sikh
303 Buddhist 5 = Buddhist
302 Jain 6 = Jain
305 Jewish 7 = Jewish
306 Parsi/Zoroastrian 8 = Parsi/Zoroastrian
307 Doni-Polo 10 = Donyi polo
400 Traditional/spiritual 8 = Trad/spiritualist
401 Traditional 5 = Traditional
402 Spiritual
403 Animist
500 No religion 0 = No religion 9 = No religion 9 = No religion
600 Other 96 = Other 4 = Other 96 = Other
Ghana 1993
V130
Ghana 2008
V130
India 1992
V130
India 2005
V130
Harmonization: Religion
Egypt 1995 S802 Ever circumcised
Egypt 2005 S801 Respondent circumcised
Egypt 2008 G102 Respondent circumcised
Ethiopia 2000 FG103 Circumcised
Ethiopia 2005 FG103 Circumcised
Ghana 2003 S821 Circumcised
Kenya 1998 S1002 Respondent circumcised
Kenya 2003 S821 Circumcised
Kenya 2008 G102 Respondent circumcised
Mali 1995 S551 Circumcised
Mali 2001 FG103 Circumcised?
Mali 2006 G102 Respondent circumcised
Nigeria 1999 S521 Type of circumcision
Nigeria 2003 FG103 Circumcised
Nigeria 2008 G102 Respondent circumcised
Harmonization: Female Circumcision
Ever Circumcised
Timeline: 2014
(current)
 9 countries, 39
samples
 Much of woman files
 Women of child
bearing age as unit
of analysis
Timeline: 2015
 15 countries, 69
samples
 Complete the woman
files
 Children & birth files
Timeline: 2017
 21 countries, 94
samples
 Men and couples files
Timeline: Next grant
 41 African
countries,
130+ samples
 11 Asian
countries,
32+ samples
Beta
Lower barriers to conducting research on
population and the environment.
Motivation:
 The data from different domains have
incompatible formats, and few researchers have
the skills to combine them
Terra Populus Goal
 5 year grant NSF
 At mid-point: year 3
TerraPop
 6 countries:
 Argentina
 Brazil
 Malawi
 Spain
 United States
 Vietnam
Population Microdata
 Tabulations of census data for administrative
units
Area-level Data
 Land cover from
satellite images
(Global Land Cover 2000)
 Agricultural use
from satellites and
government records
(Global Landscapes
Initiative)
 Climate from weather
stations (WorldClim)
Environmental Data
Rasters (Grid Cells)
Microdata
Area-level dataRasters
Mix and match
variables originating in
any of the data
structures
Obtain output in the
data structure most
useful to you
Location-Based Integration
Individuals and households
with their environmental
and social context
Microdata
Area-level dataRasters
Location-Based Integration
Summarized
environmental
and population
Microdata
Area-level dataRasters
County ID
G17003100001
G17003100002
G17003100003
G17003100004
G17003100005
G17003100006
G17003100007
County ID
Mean Ann.
Temp.
Max. Ann.
Precip.
G17003100001 21.2 768
G17003100002 23.4 589
G17003100003 24.3 867
G17003100004 21.5 943
G17003100005 24.1 867
G17003100006 24.4 697
G17003100007 25.6 701
County ID
Mean Ann.
Temp.
Max. Ann.
Precip.
Rent,
Rural
Rent,
Urban
Own,
Rural
Own,
Urban
G17003100001 21.2 768 3129 1063 637 365
G17003100002 23.4 589 2949 1075 1469 717
G17003100003 24.3 867 3418 1589 1108 617
G17003100004 21.5 943 1882 425 202 142
G17003100005 24.1 867 2416 572 426 197
G17003100006 24.4 697 2560 934 950 563
G17003100007 25.6 701 2126 653 321 215
characteristics for
administrative
districts
Location-Based Integration
Rasters of
population and
environment
data
Microdata
Area-level dataRasters
Location-Based Integration
Rasterization of Area-Level Data
Area-Level Summary of Raster Data
 Linkages across data formats rely on
administrative unit boundaries
 Particular needs
 Lower level
boundaries
 Historical
boundaries
Boundaries are Key
Geographic Harmonization
Geographic Harmonization
Geographic Harmonization
 Web interface will change significantly in fall
2014
 Fast microdata tabulator needed
Beta Version
IPUMS-International
IPUMS-International
Census microdata from around world
Funded by NSF and NIH
Motivation:
 Provide data access
 Preservation
Khartoum, CBS-Sudan
Dhaka, Bangladesh
Bureau of Statistics
IPUMS-International
Participating
Disseminating
IPUMS Censuses Per Country
IPUMS Censuses Per Country
Variables Included in Extracts
Top Institutional Users
Country Institution Country Institution
1 USA University of Minnesota 16 Brazil Universidade Federal de Minas Gerais
2 USA Harvard University 17 Mexico El Colegio de México
3 USA University of Michigan Ann Arbor 18 USA Yale University
4 USA Columbia University 19 China University of Hong Kong
5 Spain Autonomous University Barcelona 20 USA University of Washington
6 USA Arizona State University 21 UK London School Economics
7 Singapore National University of Singapore 22 UK University of Stirling
8 IADB Inter American Development Bank 23 France Université de Bordeaux 4
9 WB World Bank Group 24 Austria University of Vienna
10 USA University of California Berkeley 25 Malaysia National University of Malaysia
11 USA Vanderbilt University 26 Austria Vienna Institute of Demography
12 USA University of Chicago 27 USA Pew Research Center
13 Australia University of Queensland Australia 28 Colombia Universidad del Valle
14 USA University of California Los Angeles 29 USA University of Delaware
15 USA Dartmouth College 30 USA Brown University
Millennium Development Goals
Ratio of literate
women to men,
15-24 years old
Source: Cuesta and Lovatón (2014) 1990 Census round
Millennium Development Goals
Source: Cuesta and Lovatón (2014)
Data Source: IPUMS-International, Minnesota Population Center
Census 1993 Census 2005
Colombia: Adolescent Birth Rate
 Data acquisition
 Outreach: developing countries
 Virtual data enclave
IPUMSI Future
Thank you!
sobek@umn.edu

More Related Content

Similar to Minnesota Data Harmonization Projects

Challenges in harmonization and development of measures among the cohorts stu...
Challenges in harmonization and development of measures among the cohorts stu...Challenges in harmonization and development of measures among the cohorts stu...
Challenges in harmonization and development of measures among the cohorts stu...UNICEF Office of Research - Innocenti
 
Cannon ace 2016 presentation slideshare
Cannon ace 2016 presentation slideshareCannon ace 2016 presentation slideshare
Cannon ace 2016 presentation slidesharekcannon2
 
Promising Strategies for Engaging Culturally Diverse Audiences
Promising Strategies for Engaging Culturally Diverse AudiencesPromising Strategies for Engaging Culturally Diverse Audiences
Promising Strategies for Engaging Culturally Diverse Audiencessondramilkie
 
Advanced Needs Statements
Advanced Needs StatementsAdvanced Needs Statements
Advanced Needs StatementsFacetoFace
 
Collecting sex-disaggegated data
Collecting sex-disaggegated dataCollecting sex-disaggegated data
Collecting sex-disaggegated dataCGIAR
 
Iatefl 2021 Is gender bias an ELT reality? A woman's perspective
Iatefl 2021  Is gender bias an ELT reality? A woman's perspectiveIatefl 2021  Is gender bias an ELT reality? A woman's perspective
Iatefl 2021 Is gender bias an ELT reality? A woman's perspectiveEleni Symeonidou
 
No Cookie Cutter Approaches
No Cookie Cutter Approaches No Cookie Cutter Approaches
No Cookie Cutter Approaches IFPRI Gender
 
Tools and Resources for Research-Practice Partnerships
Tools and Resources for Research-Practice PartnershipsTools and Resources for Research-Practice Partnerships
Tools and Resources for Research-Practice PartnershipsThe Annie E. Casey Foundation
 
COMMUNITY NEEDS ASSESSMENT-CMHGR Sept 15
COMMUNITY NEEDS ASSESSMENT-CMHGR Sept 15COMMUNITY NEEDS ASSESSMENT-CMHGR Sept 15
COMMUNITY NEEDS ASSESSMENT-CMHGR Sept 15Avery Eenigenburg
 
Effectiveness of Care Groups and Interpersonal Approaches: Evidence and a Res...
Effectiveness of Care Groups and Interpersonal Approaches: Evidence and a Res...Effectiveness of Care Groups and Interpersonal Approaches: Evidence and a Res...
Effectiveness of Care Groups and Interpersonal Approaches: Evidence and a Res...CORE Group
 
Promoting dignity and rights in marginalised communities
Promoting dignity and rights in marginalised communitiesPromoting dignity and rights in marginalised communities
Promoting dignity and rights in marginalised communitiesUniversity of Salford
 
SafeCare: An Evidence-based Widely Disseminated Parent Trianing Program to Pr...
SafeCare: An Evidence-based Widely Disseminated Parent Trianing Program to Pr...SafeCare: An Evidence-based Widely Disseminated Parent Trianing Program to Pr...
SafeCare: An Evidence-based Widely Disseminated Parent Trianing Program to Pr...Georgia State School of Public Health
 
23 introduction to social worksecond ed
23 introduction to social worksecond ed23 introduction to social worksecond ed
23 introduction to social worksecond edsmile790243
 
Did you sleep here last night? The impact of the household definition in sam...
Did you sleep here last night?  The impact of the household definition in sam...Did you sleep here last night?  The impact of the household definition in sam...
Did you sleep here last night? The impact of the household definition in sam...Ernestina Coast
 
Using Data to Support Informed Policy and Decision Making to Promote Health a...
Using Data to Support Informed Policy and Decision Making to Promote Health a...Using Data to Support Informed Policy and Decision Making to Promote Health a...
Using Data to Support Informed Policy and Decision Making to Promote Health a...DataNB
 
FCG TJACH Housing Symposium Final Pres
FCG TJACH Housing Symposium Final PresFCG TJACH Housing Symposium Final Pres
FCG TJACH Housing Symposium Final PresCorydon Baylor
 

Similar to Minnesota Data Harmonization Projects (20)

Challenges in harmonization and development of measures among the cohorts stu...
Challenges in harmonization and development of measures among the cohorts stu...Challenges in harmonization and development of measures among the cohorts stu...
Challenges in harmonization and development of measures among the cohorts stu...
 
Cannon ace 2016 presentation slideshare
Cannon ace 2016 presentation slideshareCannon ace 2016 presentation slideshare
Cannon ace 2016 presentation slideshare
 
Promising Strategies for Engaging Culturally Diverse Audiences
Promising Strategies for Engaging Culturally Diverse AudiencesPromising Strategies for Engaging Culturally Diverse Audiences
Promising Strategies for Engaging Culturally Diverse Audiences
 
Advanced Needs Statements
Advanced Needs StatementsAdvanced Needs Statements
Advanced Needs Statements
 
Collecting sex-disaggegated data
Collecting sex-disaggegated dataCollecting sex-disaggegated data
Collecting sex-disaggegated data
 
Iatefl 2021 Is gender bias an ELT reality? A woman's perspective
Iatefl 2021  Is gender bias an ELT reality? A woman's perspectiveIatefl 2021  Is gender bias an ELT reality? A woman's perspective
Iatefl 2021 Is gender bias an ELT reality? A woman's perspective
 
No Cookie Cutter Approaches
No Cookie Cutter Approaches No Cookie Cutter Approaches
No Cookie Cutter Approaches
 
Tools and Resources for Research-Practice Partnerships
Tools and Resources for Research-Practice PartnershipsTools and Resources for Research-Practice Partnerships
Tools and Resources for Research-Practice Partnerships
 
Respondent Driven Sampling
Respondent Driven Sampling Respondent Driven Sampling
Respondent Driven Sampling
 
COMMUNITY NEEDS ASSESSMENT-CMHGR Sept 15
COMMUNITY NEEDS ASSESSMENT-CMHGR Sept 15COMMUNITY NEEDS ASSESSMENT-CMHGR Sept 15
COMMUNITY NEEDS ASSESSMENT-CMHGR Sept 15
 
Effectiveness of Care Groups and Interpersonal Approaches: Evidence and a Res...
Effectiveness of Care Groups and Interpersonal Approaches: Evidence and a Res...Effectiveness of Care Groups and Interpersonal Approaches: Evidence and a Res...
Effectiveness of Care Groups and Interpersonal Approaches: Evidence and a Res...
 
Promoting dignity and rights in marginalised communities
Promoting dignity and rights in marginalised communitiesPromoting dignity and rights in marginalised communities
Promoting dignity and rights in marginalised communities
 
SafeCare: An Evidence-based Widely Disseminated Parent Trianing Program to Pr...
SafeCare: An Evidence-based Widely Disseminated Parent Trianing Program to Pr...SafeCare: An Evidence-based Widely Disseminated Parent Trianing Program to Pr...
SafeCare: An Evidence-based Widely Disseminated Parent Trianing Program to Pr...
 
OECD Policy Dialogue on Women’s Economic Empowerment.
OECD Policy Dialogue on Women’s Economic Empowerment.OECD Policy Dialogue on Women’s Economic Empowerment.
OECD Policy Dialogue on Women’s Economic Empowerment.
 
OECD Policy Dialogue on Women’s Economic Empowerment
OECD Policy Dialogue on Women’s Economic EmpowermentOECD Policy Dialogue on Women’s Economic Empowerment
OECD Policy Dialogue on Women’s Economic Empowerment
 
23 introduction to social worksecond ed
23 introduction to social worksecond ed23 introduction to social worksecond ed
23 introduction to social worksecond ed
 
Motherhood method 12 9-13
Motherhood method 12 9-13Motherhood method 12 9-13
Motherhood method 12 9-13
 
Did you sleep here last night? The impact of the household definition in sam...
Did you sleep here last night?  The impact of the household definition in sam...Did you sleep here last night?  The impact of the household definition in sam...
Did you sleep here last night? The impact of the household definition in sam...
 
Using Data to Support Informed Policy and Decision Making to Promote Health a...
Using Data to Support Informed Policy and Decision Making to Promote Health a...Using Data to Support Informed Policy and Decision Making to Promote Health a...
Using Data to Support Informed Policy and Decision Making to Promote Health a...
 
FCG TJACH Housing Symposium Final Pres
FCG TJACH Housing Symposium Final PresFCG TJACH Housing Symposium Final Pres
FCG TJACH Housing Symposium Final Pres
 

Recently uploaded

31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6Vanessa Camilleri
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17Celine George
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17Celine George
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Celine George
 

Recently uploaded (20)

31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17How to Fix XML SyntaxError in Odoo the 17
How to Fix XML SyntaxError in Odoo the 17
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17How to Manage Buy 3 Get 1 Free in Odoo 17
How to Manage Buy 3 Get 1 Free in Odoo 17
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17Tree View Decoration Attribute in the Odoo 17
Tree View Decoration Attribute in the Odoo 17
 

Minnesota Data Harmonization Projects

  • 1. The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek Minnesota Population Center, University of Minnesota sobek@umn.edu
  • 2.
  • 3. USA Integrated Public Use Microdata Series
  • 4.  We build data infrastructure for research community. Specialize in data harmonization.  World’s largest collection of individual population and health data, across 9 projects.  50,000 registered users from over 100 countries.  Free Minnesota Population Center
  • 5. MPC Data Dissemination, 1993-2012 Gigabytes per week
  • 7. The Problem 1. Combining data from multiple sources is time consuming  Discovery  Data management 2. It’s error prone  Recoding data  Overlook documentation 3. Hard to replicate results 4. Discourages comparative research
  • 8. Outline  Harmonization methods  Dissemination system  International projects  Integrated DHS  Terra Populus  IPUMS-International
  • 9. Terminology Harmonization: Combining datasets collected at different times or places into a single, consistent data series. “Integration” Metadata: Data about data. Documentation in broadest sense.
  • 12. Harmonization Methods  Metadata  Data  Dissemination
  • 14. MPC Data Dictionary Variable Start Width Value Var ValueLabel Frequency Universe SMOKE100 57 1 Ever smoked 100 cigarettes All persons 1 Yes 54,189 2 No 59,501 7 Don't know/Not sure 205 9 Refused 39 SMOKENOW 58 1 Smoke cigarettes now Persons who ever smoked 1 Yes 25,644 2 No 28,535 7 Don't know/Not sure 0 9 Refused 10 Blank [no label] 59,745 SMOKE30 59 2 Number of days smoked in the last 30 Persons who currently smoke 1 to 30 Number of days 25,290 77 Don't know/Not sure 293 88 None 49 99 Refused 12 Blank [no label] 88,290 SMOKENUM 61 2 Number of cigarettes smoked per day Persons who currently smoke 0 to 76 Number of cigarettes 22,292 77 Don't know/Not sure 248 99 Refused 43 Blank [no label] 91,351
  • 15. Water Access Convert Questionnaires to Metadata (Mexico 2000)
  • 16. 5. Number of Rooms How many rooms are used for sleeping without counting hallways? _____ Write the number Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count the kitchen _____Write the number 6. Access to water Read all of the options until you get an affirmative answer. Circle only one answer 1 Running water inside the dwelling 2 Running water outside the dwelling but on the land 3 Running water from a public faucet or hydrant 4 Running water that is carried from another dwelling 5 Tanked in by truck 6 Water from a well, river, lake, stream or other Answers 3, 4, 5, 6 continue with number 8 7. Water supply How many days of the week is water available? Circle only one answer 1 Daily 2 Every third day 3 Twice a week 4 Once a week 5 Occasionally Metadata: Questionnaire Text
  • 18. Data: Variable Harmonization Marital Status: IPUMS-International Bangladesh 2011 1 = Unmarried 2 = Married 3 = Widowed 4 = Divorced/separated Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
  • 19. Translation Table Input Bangladesh 2011 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
  • 20. LabelCode Translation Table Harmonized 1 = Never married1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Single Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 1 0 0 2 0 0 2 1 0 2 1 1 2 1 2 2 1 3 2 1 4 2 1 5 2 2 0 0 0 3 1 0 3 2 0 0 0 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated3 Widowed4
  • 21. LabelCode Translation Table Harmonized 1 = Never married 1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = SingleSingle Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 1 0 0 2 0 0 2 1 0 2 1 1 2 1 2 2 1 3 2 1 4 2 1 5 2 2 0 0 0 3 1 0 3 2 0 0 0 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated3 Widowed4
  • 27. Variables Page – Filtered
  • 40. Age of spouse Employment status of father Occupation of father Attached Characteristics
  • 46.  Foremost source of health information for the developing world  Funded by USAID  Since 1980s, over 300 surveys, 90 countries  Topics: fertility, nutrition, HIV, malaria, maternal and child health, etc Demographic and Health Surveys
  • 47.  5-year NIH grant (end of year 2)  Focus on Africa, with India  Partnership with ICF-International and USAID IDHS Project
  • 48. Motivation: DHS is incredibly valuable, but it’s hard to capitalize on its full potential. Problem:  Data discovery  Dispersed documentation  Data management  Variable changes over time Not unique to DHS: endemic to any survey that’s persisted over decades. Why an Integrated DHS?
  • 49. DHS Research Process Example: Find data on female genital cutting Survey Search Tool
  • 50.
  • 51.
  • 52. Recode notes Data dictionary Just the woman file – for one survey. 61 to go. Still need Report (377 page pdf) • Contains questionnaire and sample design information • Errata file
  • 53. DHS “Recode Variables” make it more harmonized than most surveys  Consistent variable names  Each DHS phase has a shared model questionnaire But:  6 phases over 25+ years  Country control over final wording of surveys  Country-specific variables The recode variables can be a two-edged sword At least the DHS variables are already harmonized, right?
  • 54. 100 Muslim/Islam 4 = Muslim 7 = Moslem 1 = Muslim 2 = Muslim 200 Christian 2 = Christian 3 = Christian 201 Catholic 2 = Catholic 1 = Catholic 202 Protestant 1 = Protestant 203 Anglican 2 = Anglican 204 Methodist 3 = Methodist 205 Presbyterian 4 = Presbyterian 206 Pentacostal 5 = Pentecostal 208 Other Christian 3 = Other Christian 6 = Other Christian 300 Other 301 Hindu 0 = Hindu 1 = Hindu 302 Sikh 3 = Sikh 4 = Sikh 303 Buddhist 5 = Buddhist 302 Jain 6 = Jain 305 Jewish 7 = Jewish 306 Parsi/Zoroastrian 8 = Parsi/Zoroastrian 307 Doni-Polo 10 = Donyi polo 400 Traditional/spiritual 8 = Trad/spiritualist 401 Traditional 5 = Traditional 402 Spiritual 403 Animist 500 No religion 0 = No religion 9 = No religion 9 = No religion 600 Other 96 = Other 4 = Other 96 = Other Ghana 1993 V130 Ghana 2008 V130 India 1992 V130 India 2005 V130 Harmonization: Religion
  • 55. Egypt 1995 S802 Ever circumcised Egypt 2005 S801 Respondent circumcised Egypt 2008 G102 Respondent circumcised Ethiopia 2000 FG103 Circumcised Ethiopia 2005 FG103 Circumcised Ghana 2003 S821 Circumcised Kenya 1998 S1002 Respondent circumcised Kenya 2003 S821 Circumcised Kenya 2008 G102 Respondent circumcised Mali 1995 S551 Circumcised Mali 2001 FG103 Circumcised? Mali 2006 G102 Respondent circumcised Nigeria 1999 S521 Type of circumcision Nigeria 2003 FG103 Circumcised Nigeria 2008 G102 Respondent circumcised Harmonization: Female Circumcision Ever Circumcised
  • 56. Timeline: 2014 (current)  9 countries, 39 samples  Much of woman files  Women of child bearing age as unit of analysis
  • 57. Timeline: 2015  15 countries, 69 samples  Complete the woman files  Children & birth files
  • 58. Timeline: 2017  21 countries, 94 samples  Men and couples files
  • 59. Timeline: Next grant  41 African countries, 130+ samples  11 Asian countries, 32+ samples
  • 60. Beta
  • 61. Lower barriers to conducting research on population and the environment. Motivation:  The data from different domains have incompatible formats, and few researchers have the skills to combine them Terra Populus Goal
  • 62.  5 year grant NSF  At mid-point: year 3 TerraPop
  • 63.  6 countries:  Argentina  Brazil  Malawi  Spain  United States  Vietnam Population Microdata
  • 64.  Tabulations of census data for administrative units Area-level Data
  • 65.  Land cover from satellite images (Global Land Cover 2000)  Agricultural use from satellites and government records (Global Landscapes Initiative)  Climate from weather stations (WorldClim) Environmental Data Rasters (Grid Cells)
  • 66. Microdata Area-level dataRasters Mix and match variables originating in any of the data structures Obtain output in the data structure most useful to you Location-Based Integration
  • 67. Individuals and households with their environmental and social context Microdata Area-level dataRasters Location-Based Integration
  • 68. Summarized environmental and population Microdata Area-level dataRasters County ID G17003100001 G17003100002 G17003100003 G17003100004 G17003100005 G17003100006 G17003100007 County ID Mean Ann. Temp. Max. Ann. Precip. G17003100001 21.2 768 G17003100002 23.4 589 G17003100003 24.3 867 G17003100004 21.5 943 G17003100005 24.1 867 G17003100006 24.4 697 G17003100007 25.6 701 County ID Mean Ann. Temp. Max. Ann. Precip. Rent, Rural Rent, Urban Own, Rural Own, Urban G17003100001 21.2 768 3129 1063 637 365 G17003100002 23.4 589 2949 1075 1469 717 G17003100003 24.3 867 3418 1589 1108 617 G17003100004 21.5 943 1882 425 202 142 G17003100005 24.1 867 2416 572 426 197 G17003100006 24.4 697 2560 934 950 563 G17003100007 25.6 701 2126 653 321 215 characteristics for administrative districts Location-Based Integration
  • 69. Rasters of population and environment data Microdata Area-level dataRasters Location-Based Integration
  • 71. Area-Level Summary of Raster Data
  • 72.  Linkages across data formats rely on administrative unit boundaries  Particular needs  Lower level boundaries  Historical boundaries Boundaries are Key
  • 76.  Web interface will change significantly in fall 2014  Fast microdata tabulator needed Beta Version
  • 78. IPUMS-International Census microdata from around world Funded by NSF and NIH Motivation:  Provide data access  Preservation
  • 85. Top Institutional Users Country Institution Country Institution 1 USA University of Minnesota 16 Brazil Universidade Federal de Minas Gerais 2 USA Harvard University 17 Mexico El Colegio de México 3 USA University of Michigan Ann Arbor 18 USA Yale University 4 USA Columbia University 19 China University of Hong Kong 5 Spain Autonomous University Barcelona 20 USA University of Washington 6 USA Arizona State University 21 UK London School Economics 7 Singapore National University of Singapore 22 UK University of Stirling 8 IADB Inter American Development Bank 23 France Université de Bordeaux 4 9 WB World Bank Group 24 Austria University of Vienna 10 USA University of California Berkeley 25 Malaysia National University of Malaysia 11 USA Vanderbilt University 26 Austria Vienna Institute of Demography 12 USA University of Chicago 27 USA Pew Research Center 13 Australia University of Queensland Australia 28 Colombia Universidad del Valle 14 USA University of California Los Angeles 29 USA University of Delaware 15 USA Dartmouth College 30 USA Brown University
  • 86. Millennium Development Goals Ratio of literate women to men, 15-24 years old Source: Cuesta and Lovatón (2014) 1990 Census round
  • 87. Millennium Development Goals Source: Cuesta and Lovatón (2014) Data Source: IPUMS-International, Minnesota Population Center Census 1993 Census 2005 Colombia: Adolescent Birth Rate
  • 88.  Data acquisition  Outreach: developing countries  Virtual data enclave IPUMSI Future