Minnesota Data Harmonization Projects

165
-1

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
165
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Minnesota Data Harmonization Projects

  1. 1. The Minnesota Data Harmonization Projects Bill & Melinda Gates Foundation Seattle, Washington May 21, 2014 Elizabeth Boyle, Miriam King, Matthew Sobek Minnesota Population Center, University of Minnesota sobek@umn.edu
  2. 2. USA Integrated Public Use Microdata Series
  3. 3.  We build data infrastructure for research community. Specialize in data harmonization.  World’s largest collection of individual population and health data, across 9 projects.  50,000 registered users from over 100 countries.  Free Minnesota Population Center
  4. 4. MPC Data Dissemination, 1993-2012 Gigabytes per week
  5. 5. MPC Data Projects
  6. 6. The Problem 1. Combining data from multiple sources is time consuming  Discovery  Data management 2. It’s error prone  Recoding data  Overlook documentation 3. Hard to replicate results 4. Discourages comparative research
  7. 7. Outline  Harmonization methods  Dissemination system  International projects  Integrated DHS  Terra Populus  IPUMS-International
  8. 8. Terminology Harmonization: Combining datasets collected at different times or places into a single, consistent data series. “Integration” Metadata: Data about data. Documentation in broadest sense.
  9. 9. Relation to head Marital status Education Occupation Microdata
  10. 10. Summary Data
  11. 11. Harmonization Methods  Metadata  Data  Dissemination
  12. 12. Systematize Metadata (record layout file, pdf)
  13. 13. MPC Data Dictionary Variable Start Width Value Var ValueLabel Frequency Universe SMOKE100 57 1 Ever smoked 100 cigarettes All persons 1 Yes 54,189 2 No 59,501 7 Don't know/Not sure 205 9 Refused 39 SMOKENOW 58 1 Smoke cigarettes now Persons who ever smoked 1 Yes 25,644 2 No 28,535 7 Don't know/Not sure 0 9 Refused 10 Blank [no label] 59,745 SMOKE30 59 2 Number of days smoked in the last 30 Persons who currently smoke 1 to 30 Number of days 25,290 77 Don't know/Not sure 293 88 None 49 99 Refused 12 Blank [no label] 88,290 SMOKENUM 61 2 Number of cigarettes smoked per day Persons who currently smoke 0 to 76 Number of cigarettes 22,292 77 Don't know/Not sure 248 99 Refused 43 Blank [no label] 91,351
  14. 14. Water Access Convert Questionnaires to Metadata (Mexico 2000)
  15. 15. 5. Number of Rooms How many rooms are used for sleeping without counting hallways? _____ Write the number Without counting the hallways or bathrooms how many total rooms are in this dwelling? Count the kitchen _____Write the number 6. Access to water Read all of the options until you get an affirmative answer. Circle only one answer 1 Running water inside the dwelling 2 Running water outside the dwelling but on the land 3 Running water from a public faucet or hydrant 4 Running water that is carried from another dwelling 5 Tanked in by truck 6 Water from a well, river, lake, stream or other Answers 3, 4, 5, 6 continue with number 8 7. Water supply How many days of the week is water available? Circle only one answer 1 Daily 2 Every third day 3 Twice a week 4 Once a week 5 Occasionally Metadata: Questionnaire Text
  16. 16. Water access Bedrooms Rooms XML-Tagged Questionnaire Text
  17. 17. Data: Variable Harmonization Marital Status: IPUMS-International Bangladesh 2011 1 = Unmarried 2 = Married 3 = Widowed 4 = Divorced/separated Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
  18. 18. Translation Table Input Bangladesh 2011 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed Mexico 1970 1 = Married, civil & relig 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Kenya 1999 1 = Never married 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated
  19. 19. LabelCode Translation Table Harmonized 1 = Never married1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = Single Single Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 1 0 0 2 0 0 2 1 0 2 1 1 2 1 2 2 1 3 2 1 4 2 1 5 2 2 0 0 0 3 1 0 3 2 0 0 0 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated3 Widowed4
  20. 20. LabelCode Translation Table Harmonized 1 = Never married 1 = Married, civil & relig 4 = Divrc or separated 1 = Unmarried 2 = Married 3 = Widowed 2 = Married, civil 3 = Married, religious 4 = Consensual union 5 = Widowed 6 = Divorced 7 = Separated 8 = SingleSingle Married or in union Married, formally Civil Religious Civil and religious Monogamous Polygamous Consensual union Separated Divorced 2 = Monogamous 3 = Polygamous 4 = Widowed 5 = Divorced 6 = Separated 1 0 0 2 0 0 2 1 0 2 1 1 2 1 2 2 1 3 2 1 4 2 1 5 2 2 0 0 0 3 1 0 3 2 0 0 0 Mexico 1970 Input Bangladesh 2011 Kenya 1999 Divorced or separated3 Widowed4
  21. 21. Data Dissemination System
  22. 22. Data Dissemination System
  23. 23. Variables Page
  24. 24. Variables Page 238 censuses
  25. 25. Sample Filtering
  26. 26. Variables Page – Filtered
  27. 27. Variable Page: Marital Status
  28. 28. Variable Codes (Marital status)
  29. 29. Variable Codes (Marital status)
  30. 30. Variable Codes (Marital status)
  31. 31. Variable Page: Marital Status
  32. 32. Variable Comparability Discussion (Marital status)
  33. 33. Variable Page: Documentation
  34. 34. Questionnaire Text
  35. 35. Questionnaire Text (Marital status, Cambodia)
  36. 36. Variables Page
  37. 37. Extract Summary
  38. 38. Case Selection
  39. 39. Age of spouse Employment status of father Occupation of father Attached Characteristics
  40. 40. Extract Summary
  41. 41. Download or Revise Extract
  42. 42. On-line Analysis
  43. 43. The International Projects
  44. 44. Integrated DHS
  45. 45.  Foremost source of health information for the developing world  Funded by USAID  Since 1980s, over 300 surveys, 90 countries  Topics: fertility, nutrition, HIV, malaria, maternal and child health, etc Demographic and Health Surveys
  46. 46.  5-year NIH grant (end of year 2)  Focus on Africa, with India  Partnership with ICF-International and USAID IDHS Project
  47. 47. Motivation: DHS is incredibly valuable, but it’s hard to capitalize on its full potential. Problem:  Data discovery  Dispersed documentation  Data management  Variable changes over time Not unique to DHS: endemic to any survey that’s persisted over decades. Why an Integrated DHS?
  48. 48. DHS Research Process Example: Find data on female genital cutting Survey Search Tool
  49. 49. Recode notes Data dictionary Just the woman file – for one survey. 61 to go. Still need Report (377 page pdf) • Contains questionnaire and sample design information • Errata file
  50. 50. DHS “Recode Variables” make it more harmonized than most surveys  Consistent variable names  Each DHS phase has a shared model questionnaire But:  6 phases over 25+ years  Country control over final wording of surveys  Country-specific variables The recode variables can be a two-edged sword At least the DHS variables are already harmonized, right?
  51. 51. 100 Muslim/Islam 4 = Muslim 7 = Moslem 1 = Muslim 2 = Muslim 200 Christian 2 = Christian 3 = Christian 201 Catholic 2 = Catholic 1 = Catholic 202 Protestant 1 = Protestant 203 Anglican 2 = Anglican 204 Methodist 3 = Methodist 205 Presbyterian 4 = Presbyterian 206 Pentacostal 5 = Pentecostal 208 Other Christian 3 = Other Christian 6 = Other Christian 300 Other 301 Hindu 0 = Hindu 1 = Hindu 302 Sikh 3 = Sikh 4 = Sikh 303 Buddhist 5 = Buddhist 302 Jain 6 = Jain 305 Jewish 7 = Jewish 306 Parsi/Zoroastrian 8 = Parsi/Zoroastrian 307 Doni-Polo 10 = Donyi polo 400 Traditional/spiritual 8 = Trad/spiritualist 401 Traditional 5 = Traditional 402 Spiritual 403 Animist 500 No religion 0 = No religion 9 = No religion 9 = No religion 600 Other 96 = Other 4 = Other 96 = Other Ghana 1993 V130 Ghana 2008 V130 India 1992 V130 India 2005 V130 Harmonization: Religion
  52. 52. Egypt 1995 S802 Ever circumcised Egypt 2005 S801 Respondent circumcised Egypt 2008 G102 Respondent circumcised Ethiopia 2000 FG103 Circumcised Ethiopia 2005 FG103 Circumcised Ghana 2003 S821 Circumcised Kenya 1998 S1002 Respondent circumcised Kenya 2003 S821 Circumcised Kenya 2008 G102 Respondent circumcised Mali 1995 S551 Circumcised Mali 2001 FG103 Circumcised? Mali 2006 G102 Respondent circumcised Nigeria 1999 S521 Type of circumcision Nigeria 2003 FG103 Circumcised Nigeria 2008 G102 Respondent circumcised Harmonization: Female Circumcision Ever Circumcised
  53. 53. Timeline: 2014 (current)  9 countries, 39 samples  Much of woman files  Women of child bearing age as unit of analysis
  54. 54. Timeline: 2015  15 countries, 69 samples  Complete the woman files  Children & birth files
  55. 55. Timeline: 2017  21 countries, 94 samples  Men and couples files
  56. 56. Timeline: Next grant  41 African countries, 130+ samples  11 Asian countries, 32+ samples
  57. 57. Beta
  58. 58. Lower barriers to conducting research on population and the environment. Motivation:  The data from different domains have incompatible formats, and few researchers have the skills to combine them Terra Populus Goal
  59. 59.  5 year grant NSF  At mid-point: year 3 TerraPop
  60. 60.  6 countries:  Argentina  Brazil  Malawi  Spain  United States  Vietnam Population Microdata
  61. 61.  Tabulations of census data for administrative units Area-level Data
  62. 62.  Land cover from satellite images (Global Land Cover 2000)  Agricultural use from satellites and government records (Global Landscapes Initiative)  Climate from weather stations (WorldClim) Environmental Data Rasters (Grid Cells)
  63. 63. Microdata Area-level dataRasters Mix and match variables originating in any of the data structures Obtain output in the data structure most useful to you Location-Based Integration
  64. 64. Individuals and households with their environmental and social context Microdata Area-level dataRasters Location-Based Integration
  65. 65. Summarized environmental and population Microdata Area-level dataRasters County ID G17003100001 G17003100002 G17003100003 G17003100004 G17003100005 G17003100006 G17003100007 County ID Mean Ann. Temp. Max. Ann. Precip. G17003100001 21.2 768 G17003100002 23.4 589 G17003100003 24.3 867 G17003100004 21.5 943 G17003100005 24.1 867 G17003100006 24.4 697 G17003100007 25.6 701 County ID Mean Ann. Temp. Max. Ann. Precip. Rent, Rural Rent, Urban Own, Rural Own, Urban G17003100001 21.2 768 3129 1063 637 365 G17003100002 23.4 589 2949 1075 1469 717 G17003100003 24.3 867 3418 1589 1108 617 G17003100004 21.5 943 1882 425 202 142 G17003100005 24.1 867 2416 572 426 197 G17003100006 24.4 697 2560 934 950 563 G17003100007 25.6 701 2126 653 321 215 characteristics for administrative districts Location-Based Integration
  66. 66. Rasters of population and environment data Microdata Area-level dataRasters Location-Based Integration
  67. 67. Rasterization of Area-Level Data
  68. 68. Area-Level Summary of Raster Data
  69. 69.  Linkages across data formats rely on administrative unit boundaries  Particular needs  Lower level boundaries  Historical boundaries Boundaries are Key
  70. 70. Geographic Harmonization
  71. 71. Geographic Harmonization
  72. 72. Geographic Harmonization
  73. 73.  Web interface will change significantly in fall 2014  Fast microdata tabulator needed Beta Version
  74. 74. IPUMS-International
  75. 75. IPUMS-International Census microdata from around world Funded by NSF and NIH Motivation:  Provide data access  Preservation
  76. 76. Khartoum, CBS-Sudan
  77. 77. Dhaka, Bangladesh Bureau of Statistics
  78. 78. IPUMS-International Participating Disseminating
  79. 79. IPUMS Censuses Per Country
  80. 80. IPUMS Censuses Per Country
  81. 81. Variables Included in Extracts
  82. 82. Top Institutional Users Country Institution Country Institution 1 USA University of Minnesota 16 Brazil Universidade Federal de Minas Gerais 2 USA Harvard University 17 Mexico El Colegio de México 3 USA University of Michigan Ann Arbor 18 USA Yale University 4 USA Columbia University 19 China University of Hong Kong 5 Spain Autonomous University Barcelona 20 USA University of Washington 6 USA Arizona State University 21 UK London School Economics 7 Singapore National University of Singapore 22 UK University of Stirling 8 IADB Inter American Development Bank 23 France Université de Bordeaux 4 9 WB World Bank Group 24 Austria University of Vienna 10 USA University of California Berkeley 25 Malaysia National University of Malaysia 11 USA Vanderbilt University 26 Austria Vienna Institute of Demography 12 USA University of Chicago 27 USA Pew Research Center 13 Australia University of Queensland Australia 28 Colombia Universidad del Valle 14 USA University of California Los Angeles 29 USA University of Delaware 15 USA Dartmouth College 30 USA Brown University
  83. 83. Millennium Development Goals Ratio of literate women to men, 15-24 years old Source: Cuesta and Lovatón (2014) 1990 Census round
  84. 84. Millennium Development Goals Source: Cuesta and Lovatón (2014) Data Source: IPUMS-International, Minnesota Population Center Census 1993 Census 2005 Colombia: Adolescent Birth Rate
  85. 85.  Data acquisition  Outreach: developing countries  Virtual data enclave IPUMSI Future
  86. 86. Thank you! sobek@umn.edu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×