• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Minnesota Data Harmonization Projects
 

Minnesota Data Harmonization Projects

on

  • 22 views

 

Statistics

Views

Total Views
22
Views on SlideShare
22
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Minnesota Data Harmonization Projects Minnesota Data Harmonization Projects Presentation Transcript

    • The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek Minnesota Population Center, University of Minnesota sobek@umn.edu
    • USA Integrated Public Use Microdata Series
    •  We build data infrastructure for research community. Specialize in data harmonization.  World’s largest collection of individual population and health data, across 9 projects.  50,000 registered users from over 100 countries.  Free Minnesota Population Center
    • MPC Data Dissemination, 1993-2012 Gigabytes per week
    • MPC Data Projects
    • The Problem 1. Combining data from multiple sources is time consuming  Discovery  Data management 2. It’s error prone  Recoding data  Overlook documentation 3. Hard to replicate results 4. Discourages comparative research
    • Outline  Harmonization methods  Dissemination system  International projects  Integrated DHS  Terra Populus  IPUMS-International
    • Terminology Harmonization: Combining datasets collected at different times or places into a single, consistent data series. “Integration” Metadata: Data about data. Documentation in broadest sense.
    • Relation to head Marital status Education Occupation Microdata
    • Summary Data
    • Harmonization Methods  Metadata  Data  Dissemination
    • Systematize Metadata (record layout file, pdf)
    • MPC Data Dictionary Variable Start Width Value Var ValueLabel Frequency Universe SMOKE100 57 1 Ever smoked 100 cigarettes All persons 1 Yes 54,189 2 No 59,501 7 Don't know/Not sure 205 9 Refused 39 SMOKENOW 58 1 Smoke cigarettes now Persons who ever smoked 1 Yes 25,644 2 No 28,535 7 Don't know/Not sure 0 9 Refused 10 Blank [no label] 59,745 SMOKE30 59 2 Number of days smoked in the last 30 Persons who currently smoke 1 to 30 Number of days 25,290 77 Don't know/Not sure 293 88 None 49 99 Refused 12 Blank [no label] 88,290 SMOKENUM 61 2 Number of cigarettes smoked per day Persons who currently smoke 0 to 76 Number of cigarettes 22,292 77 Don't know/Not sure 248 99 Refused 43 Blank [no label] 91,351
    • Water Access Convert Questionnaires to Metadata (Mexico 2000)
    • 5. Number of Rooms How many rooms are used for sleeping without counting hallways? _____ Write the number Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count the kitchen _____Write the number 6. Access to water Read all of the options until you get an affirmative answer. Circle only one answer 1 Running water inside the dwelling 2 Running water outside the dwelling but on the land 3 Running water from a public faucet or hydrant 4 Running water that is carried from another dwelling 5 Tanked in by truck 6 Water from a well, river, lake, stream or other Answers 3, 4, 5, 6 continue with number 8 7. Water supply How many days of the week is water available? Circle only one answer 1 Daily 2 Every third day 3 Twice a week 4 Once a week 5 Occasionally Metadata: Questionnaire Text
    • Water access Bedrooms Rooms XML-Tagged Questionnaire Text
    • Data: Variable Harmonization Marital Status: IPUMS-International Bangladesh 2011 1 = Unmarried 2 = Married 3 = Widowed 4 = Divorced/separated Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
    • Translation Table Input Bangladesh 2011 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
    • LabelCode Translation Table Harmonized 1 = Never married1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Single Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 1 0 0 2 0 0 2 1 0 2 1 1 2 1 2 2 1 3 2 1 4 2 1 5 2 2 0 0 0 3 1 0 3 2 0 0 0 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated3 Widowed4
    • LabelCode Translation Table Harmonized 1 = Never married 1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = SingleSingle Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 1 0 0 2 0 0 2 1 0 2 1 1 2 1 2 2 1 3 2 1 4 2 1 5 2 2 0 0 0 3 1 0 3 2 0 0 0 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated3 Widowed4
    • Data Dissemination System
    • Data Dissemination System
    • Variables Page
    • Variables Page 238 censuses
    • Sample Filtering
    • Variables Page – Filtered
    • Variable Page: Marital Status
    • Variable Codes (Marital status)
    • Variable Codes (Marital status)
    • Variable Codes (Marital status)
    • Variable Page: Marital Status
    • Variable Comparability Discussion (Marital status)
    • Variable Page: Documentation
    • Questionnaire Text
    • Questionnaire Text (Marital status, Cambodia)
    • Variables Page
    • Extract Summary
    • Case Selection
    • Age of spouse Employment status of father Occupation of father Attached Characteristics
    • Extract Summary
    • Download or Revise Extract
    • On-line Analysis
    • The International Projects
    • Integrated DHS
    •  Foremost source of health information for the developing world  Funded by USAID  Since 1980s, over 300 surveys, 90 countries  Topics: fertility, nutrition, HIV, malaria, maternal and child health, etc Demographic and Health Surveys
    •  5-year NIH grant (end of year 2)  Focus on Africa, with India  Partnership with ICF-International and USAID IDHS Project
    • Motivation: DHS is incredibly valuable, but it’s hard to capitalize on its full potential. Problem:  Data discovery  Dispersed documentation  Data management  Variable changes over time Not unique to DHS: endemic to any survey that’s persisted over decades. Why an Integrated DHS?
    • DHS Research Process Example: Find data on female genital cutting Survey Search Tool
    • Recode notes Data dictionary Just the woman file – for one survey. 61 to go. Still need Report (377 page pdf) • Contains questionnaire and sample design information • Errata file
    • DHS “Recode Variables” make it more harmonized than most surveys  Consistent variable names  Each DHS phase has a shared model questionnaire But:  6 phases over 25+ years  Country control over final wording of surveys  Country-specific variables The recode variables can be a two-edged sword At least the DHS variables are already harmonized, right?
    • 100 Muslim/Islam 4 = Muslim 7 = Moslem 1 = Muslim 2 = Muslim 200 Christian 2 = Christian 3 = Christian 201 Catholic 2 = Catholic 1 = Catholic 202 Protestant 1 = Protestant 203 Anglican 2 = Anglican 204 Methodist 3 = Methodist 205 Presbyterian 4 = Presbyterian 206 Pentacostal 5 = Pentecostal 208 Other Christian 3 = Other Christian 6 = Other Christian 300 Other 301 Hindu 0 = Hindu 1 = Hindu 302 Sikh 3 = Sikh 4 = Sikh 303 Buddhist 5 = Buddhist 302 Jain 6 = Jain 305 Jewish 7 = Jewish 306 Parsi/Zoroastrian 8 = Parsi/Zoroastrian 307 Doni-Polo 10 = Donyi polo 400 Traditional/spiritual 8 = Trad/spiritualist 401 Traditional 5 = Traditional 402 Spiritual 403 Animist 500 No religion 0 = No religion 9 = No religion 9 = No religion 600 Other 96 = Other 4 = Other 96 = Other Ghana 1993 V130 Ghana 2008 V130 India 1992 V130 India 2005 V130 Harmonization: Religion
    • Egypt 1995 S802 Ever circumcised Egypt 2005 S801 Respondent circumcised Egypt 2008 G102 Respondent circumcised Ethiopia 2000 FG103 Circumcised Ethiopia 2005 FG103 Circumcised Ghana 2003 S821 Circumcised Kenya 1998 S1002 Respondent circumcised Kenya 2003 S821 Circumcised Kenya 2008 G102 Respondent circumcised Mali 1995 S551 Circumcised Mali 2001 FG103 Circumcised? Mali 2006 G102 Respondent circumcised Nigeria 1999 S521 Type of circumcision Nigeria 2003 FG103 Circumcised Nigeria 2008 G102 Respondent circumcised Harmonization: Female Circumcision Ever Circumcised
    • Timeline: 2014 (current)  9 countries, 39 samples  Much of woman files  Women of child bearing age as unit of analysis
    • Timeline: 2015  15 countries, 69 samples  Complete the woman files  Children & birth files
    • Timeline: 2017  21 countries, 94 samples  Men and couples files
    • Timeline: Next grant  41 African countries, 130+ samples  11 Asian countries, 32+ samples
    • Beta
    • Lower barriers to conducting research on population and the environment. Motivation:  The data from different domains have incompatible formats, and few researchers have the skills to combine them Terra Populus Goal
    •  5 year grant NSF  At mid-point: year 3 TerraPop
    •  6 countries:  Argentina  Brazil  Malawi  Spain  United States  Vietnam Population Microdata
    •  Tabulations of census data for administrative units Area-level Data
    •  Land cover from satellite images (Global Land Cover 2000)  Agricultural use from satellites and government records (Global Landscapes Initiative)  Climate from weather stations (WorldClim) Environmental Data Rasters (Grid Cells)
    • Microdata Area-level dataRasters Mix and match variables originating in any of the data structures Obtain output in the data structure most useful to you Location-Based Integration
    • Individuals and households with their environmental and social context Microdata Area-level dataRasters Location-Based Integration
    • Summarized environmental and population Microdata Area-level dataRasters County ID G17003100001 G17003100002 G17003100003 G17003100004 G17003100005 G17003100006 G17003100007 County ID Mean Ann. Temp. Max. Ann. Precip. G17003100001 21.2 768 G17003100002 23.4 589 G17003100003 24.3 867 G17003100004 21.5 943 G17003100005 24.1 867 G17003100006 24.4 697 G17003100007 25.6 701 County ID Mean Ann. Temp. Max. Ann. Precip. Rent, Rural Rent, Urban Own, Rural Own, Urban G17003100001 21.2 768 3129 1063 637 365 G17003100002 23.4 589 2949 1075 1469 717 G17003100003 24.3 867 3418 1589 1108 617 G17003100004 21.5 943 1882 425 202 142 G17003100005 24.1 867 2416 572 426 197 G17003100006 24.4 697 2560 934 950 563 G17003100007 25.6 701 2126 653 321 215 characteristics for administrative districts Location-Based Integration
    • Rasters of population and environment data Microdata Area-level dataRasters Location-Based Integration
    • Rasterization of Area-Level Data
    • Area-Level Summary of Raster Data
    •  Linkages across data formats rely on administrative unit boundaries  Particular needs  Lower level boundaries  Historical boundaries Boundaries are Key
    • Geographic Harmonization
    • Geographic Harmonization
    • Geographic Harmonization
    •  Web interface will change significantly in fall 2014  Fast microdata tabulator needed Beta Version
    • IPUMS-International
    • IPUMS-International Census microdata from around world Funded by NSF and NIH Motivation:  Provide data access  Preservation
    • Khartoum, CBS-Sudan
    • Dhaka, Bangladesh Bureau of Statistics
    • IPUMS-International Participating Disseminating
    • IPUMS Censuses Per Country
    • IPUMS Censuses Per Country
    • Variables Included in Extracts
    • Top Institutional Users Country Institution Country Institution 1 USA University of Minnesota 16 Brazil Universidade Federal de Minas Gerais 2 USA Harvard University 17 Mexico El Colegio de México 3 USA University of Michigan Ann Arbor 18 USA Yale University 4 USA Columbia University 19 China University of Hong Kong 5 Spain Autonomous University Barcelona 20 USA University of Washington 6 USA Arizona State University 21 UK London School Economics 7 Singapore National University of Singapore 22 UK University of Stirling 8 IADB Inter American Development Bank 23 France Université de Bordeaux 4 9 WB World Bank Group 24 Austria University of Vienna 10 USA University of California Berkeley 25 Malaysia National University of Malaysia 11 USA Vanderbilt University 26 Austria Vienna Institute of Demography 12 USA University of Chicago 27 USA Pew Research Center 13 Australia University of Queensland Australia 28 Colombia Universidad del Valle 14 USA University of California Los Angeles 29 USA University of Delaware 15 USA Dartmouth College 30 USA Brown University
    • Millennium Development Goals Ratio of literate women to men, 15-24 years old Source: Cuesta and Lovatón (2014) 1990 Census round
    • Millennium Development Goals Source: Cuesta and Lovatón (2014) Data Source: IPUMS-International, Minnesota Population Center Census 1993 Census 2005 Colombia: Adolescent Birth Rate
    •  Data acquisition  Outreach: developing countries  Virtual data enclave IPUMSI Future
    • Thank you! sobek@umn.edu