Integrating Cross-Domain Metadata


Published on

Slides from #BigDataMN presentation discussing metadata integration in the Terra Populus project at the Minnesota Population Center at the University of Minnesota

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Integrating Cross-Domain Metadata

  1. 1. Integrating Cross-Domain Metadata Terra Populus experience report Peter Clark @pclarkAlex Jokela @alexjokela
  2. 2. Overview of Terra Populus● National Science Foundation ○ Award ID: 0940818● Core goal: Integrate, preserve, and disseminate data describing changes in the human population and environment over time.● Focus for this talk is on integrating and linking interdisciplinary data ○ Challenges for both data and metadata
  3. 3. Motivation: DemographicsData: Population Reference Bureau
  4. 4. Motivation: EnvironmentImage credit: United Nations Environmental Programme (via
  5. 5. Flavors of Data and Metadata● Four types of data in Terra Populus ○ Individual demographic ■ Brazil 1970-2000 (4 points in time) ■ Malawi 1987-2009 (3 points in time) ○ Area-based demographic and economic ○ Environment and land use ■ Global Landscapes Initiative ■ Global Land Cover GLC2000 ○ Climate ■ WorldClim: 1950-2000● Three different formats: ○ Microdata (survey) ○ Area-level (tables) ○ Raster (images/bitmaps)
  6. 6. Microdata● Individual-level survey response data ○ Usually grouped into households ○ Usually has a couple hundred variables ○ Usually doesnt give good geographic locality due to confidentiality● All censuses tend to ask about similar topics: ○ location, dwelling characteristics, age, sex, occupation, marital status, family size, home ownership, nativity and immigration...● Each of these is a Variable
  7. 7. US Census 2000 MicrodataH000012712724 27600 51205120 22 34473P00001270100004501000010600011110000010147031400100005002699901200000 0027010000P00001270300004503011010170010110000010147050600215007003299901200000 0027010000P00001270200004519000010540010110000010147051600100008003299901200000 0027010000H000013412724 27720 51205120 23 2862P00001340100008023000010200010110000010147010200216011008203202200000 0027010000H000014112724 27500 51205120 22 22517P00001410100027001000010540010110000010147010100100009003299901200000 0055010000P00001410200022602000220440010110000010147010100100009002202602200000 0027010000P00001410300031503000010210010110000010147050600206011003202202200000 0027010000P00001410400029303011010170010110000010147050600205007003202202200000 0027010000P00001410500024703011010130010110000010147050000204004003202202200000 0027010000
  8. 8. US Census 2000 MicrodataH000012712724 27600 51205120 22 34473P00001270100004501000010600011110000010147031400100005002699901200000 0027010000P00001270300004503011010170010110000010147050600215007003299901200000 0027010000P00001270200004519000010540010110000010147051600100008003299901200000 0027010000
  9. 9. Area-level aggregate data● Demographic and economic characteristics of a location ○ Location can mean nation, state, county, city, metropolitan area ○ Might be a political/administrative entity or just a statistical boundary.● Some aggregate characteristics are scalar: ○ Total Population, GDP, % literacy, average household size● More often its a crosstab or table... ○ We treat each cell in the table as a variable
  10. 10. Educational Attainment by Sex: MinnesotaSubject Estimate Estimate Estimate (Total) (Male) (Female)Population 18 to 24 years 506,399 258,167 248,232Less than high school graduate 12.9% 14.2% 11.6%High school graduate (inc. equivalency) 25.8% 28.3% 23.1%Some college or associates degree 50.2% 48.5% 51.9%Bachelors degree or higher 11.2% 9.0% 13.4%
  11. 11. Raster data● Basically, just a picture ○ But usually heavily processed● Geocoded● Pixel = Measurement ○ either direct or interpolated● Type ○ Categorical (such as land use) ○ Continuous (temperature, elevation)
  12. 12. Global Land Cover 2000
  13. 13. Metadata Integration - same domain● Census/Survey ○ Different questions ○ Different answers ○ Different encodings● Area-level ○ Different table structures ○ Different geographies ○ Different categories● Raster ○ Different alignments ○ Different resolutions ○ Different processing
  14. 14. Metadata integration - cross-domain● Semantics are key ○ Terms can mean different things to different disciplines. ○ What does "data" mean?● How do you link across domains? ○ What questions are you trying to answer?
  15. 15. Microdata integration● Old data usually presents recovery, cleanup and parsing issues● The older the data, the less likely to have a machine readable codebook● If youre lucky, you get the data as an SPSS or SAS file
  16. 16. Historical Data challenges: ETL
  17. 17. Historical Microdata - metadata
  18. 18. Microdata Integration in TerraPop● Data structure can vary on many dimensions ○ questions/variables ○ categories & topcoding/bottomcoding ○ universe statements● Weve defined a unified coding scheme● We deal with these using translation tables
  19. 19. Translation Tables for microdataCODE LABEL China 1982 Columbia 1973 Kenya 1989 Mexico 1970 USA 1990100 Single/Never Married 1=never 4=single 1=single 9=single 6=never200 Married/In Union210 Married (Not Specified) 2=married 2=married 2=monogamous 1=married211 Civil 3=only civil212 Religious 4=only religious213 Civil and Religious 2=civil and religious214 Polygamous 3=polygamous200 Consensual Union 1=free union 5=free union300 Separated/Divorced 3=sep/divorced310 Separated 6=separated 8=separated 3=separated330 Divorced 4=divorced 5=divorced 7=divorced 4=divorced400 Widowed 3=widowed 5=widowed 4=widowed 6=widowed 5=widowed999 Missing/Unknown 0=missing 6=unknown B=Blank 1=unknown
  20. 20. Area-level metadataPresents all the same problems as microdata,but two points of additional complexity● Geography ○ boundaries, names, and codes change● Table Structure ○ Changes over time in concert with changes in survey structure and binning
  21. 21. Area-level Metadata in TerraPopWere cheating!● Generating our own from microdata ○ Minimizes structural differences ○ But doesnt eliminate semantic differences● But we lose out on fine-grain geography
  22. 22. Raster metadata● Best opportunity for things to be easy ○ ISO-19115 standard ○ ArcGIS ○ GeoTIFF● Need to integrate: ○ Spatial: projection, origin, resolution ○ Semantic integration is harder ○ Across datasets and within a dataset over time● Raster data is often heavily processed, and theres no standard for describing that.
  23. 23. Raster metadata...INTERNATIONAL JOURNAL OF CLIMATOLOGYInt. J. Climatol. 25: 1965–1978 (2005)Published online in Wiley InterScience ( DOI: 10.1002/joc.1276VERY HIGH RESOLUTION INTERPOLATED CLIMATE SURFACES FOR GLOBAL LAND AREASROBERT J. HIJMANS,a,* SUSAN E. CAMERON,a,b JUAN L. PARRA,a PETER G. JONESc and ANDY JARVISc,d a Museum of Vertebrate Zoology, University of California, 3101Valley Life Sciences Building, Berkeley, CA, USAb Department of Environmental Science and Policy, University of California, Davis, CA, USA; and Rainforest Cooperative Research Centre, University of Queensland, Australiac International Center for Tropical Agriculture, Cali, Colombiad International Plant Genetic Resources Institute, Cali, ColombiaReceived 18 November 2004 Revised 25 May 2005 Accepted 6 September 2005ABSTRACTWe developed interpolated climate surfaces for global land areas (excluding Antarctica) at a spatial resolution of 30 arc s (often referred to as 1-km spatialresolution). The climate elements considered were monthly precipitation and mean, minimum, and maximum temperature. Input data were gathered from avariety of sources and, where possible, were restricted to records from the 1950–2000 period. We used the thin-plate smoothing spline algorithm implementedin the ANUSPLIN package for interpolation, using latitude, longitude, and elevation as independent variables. We quantified uncertainty arising from the inputdata and the interpolation by mapping weather station density, elevation bias in the weather stations, and elevation variation within grid cells and through datapartitioning and cross validation. Elevation bias tended to be negative (stations lower than expected) at high latitudes but positive in the tropics. Uncertainty ishighest in mountainous and in poorly sampled areas. Data partitioning showed high uncertainty of the surfaces on isolated islands, e.g. in the Pacific.Aggregating the elevation and climate data to 10 arc min resolution showed an enormous variation within grid cells, illustrating the value of high-resolutionsurfaces. A comparison with an existing data set at 10 arc min resolution showed overall agreement, but with significant variation in some regions. Acomparison with two high-resolution data sets for the United States also identified areas with large local differences, particularly in mountainous areas.Compared to previous global climatologies, ours has the following advantages: the data are at a higher spatial resolution (400 times greater or more); moreweather station records were used; improved elevation data were used; and more information about spatial patterns of uncertainty in the data is available.Owing to the overall low density of available climate stations, our surfaces do not capture of all variation that may occur at a resolution of 1 km, particularly ofprecipitation in mountainous areas. In future work, such variation might be captured through knowledge- based methods and inclusion of additional co-variates,particularly layers obtained through remote sensing. Copyright 2005 Royal Meteorological Society.
  24. 24. Integration in Terra Populus● Everything is a Variable.● Deal with each domain separately● Integration across domains is geospatial ○ Tabulate microdata to area-level ○ Summarize raster to area-level, based on boundary files● Provide output in all three domains ○ Different disciplines have different preferences
  25. 25. Observations and Lessons● Unifying view of data as Variables● Need a human in the loop: ○ Integration is messy, hard & difficult to automate ○ Many issues of semantics to sort out● Mappings let you automate ~80%.● Support hand-coding of special cases.● Useful to have a sense of what the output should look like for sanity checks.● Simple is good.
  26. 26. Coming this summer! @terrapopulus http://www.nhgis.org