Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Documenting Data Transformations

398 views

Published on

Slides from webinar: Provenance and social science data. Presented on 15 March 2017. Presenter was Prof George Alter, Research Professor, ICPSR, and visiting Professor, ANU

FULL webinar recording: https://youtu.be/elPcKqWoOPg

3. Prof George Alter, (Research Professor, ICPSR & Visiting Prof, ANU)

The C2Metadata Project is producing new tools that will work with common statistical packages (eg R and SPSS) to automate the capture of metadata describing variable transformations. Software-independent data transformation descriptions will be added to metadata in two internationally accepted standards: DDI and Ecological Markup Language (EML). These tools will create efficiencies and reduce the costs of data collection, preparation, and re-use. Of special interest to social sciences with its strong metadata standards and heavy reliance on statistical analysis software.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Documenting Data Transformations

  1. 1. “Provenance and Social Science Data” 15 March 2017 Documenting DataTransformations George Alter, University of Michigan
  2. 2. • Data are useless without Metadata – “data about data” • Metadata should: – Include all information about data creation – Describe transformations to variables – Be easy to create • Our goal: Automated capture of metadata Why Metadata?
  3. 3. A few words about ICPSR • World’s largest archive of social science data • Consortium established 1962 • 760+ member institutions around the world • Founding member and home office for the DDI Alliance
  4. 4. Powered by DDI Metadata ICPSR is building search tools based upon Data Documentation Initiative (DDI) XML Codebooks (pdf and online) are rendered from the DDI.
  5. 5. Searchable database of 4.5M variables Click here for online codebook
  6. 6. Online codebook shows variable in context of dataset Link to online crosstab tool What question was asked? How was the question coded?Link to online graph tool
  7. 7. Searchable database of 4.5M variables Click here for variable comparison
  8. 8. Variable comparison display Click here for online codebook
  9. 9. Search for datasets with 3 desired variables Check boxes for variable comparison
  10. 10. Crosswalk for American National Election Study (ANES) and General Social Survey (GSS) Columns link to 70 datasets 134 tags in 8 lists Variable comparison display Variables linked to online codebooks
  11. 11. Metadata for the American National Election Study What question was asked? Who answered this question? How was the question coded? Who answered this question?
  12. 12. Metadata for the American National Election Study Who answered this question? Who answered this question? How do we know who answered the question? It’s in the pdf.
  13. 13. When data arrive at the archive… • No question text • No interview flow (question order, skip pattern) • No variable provenance • Data transformations are not documented.
  14. 14. How is research data created? • Most surveys are conducted with computer assisted interview software (CAI) – CATI – Computer-assisted Telephone Interview – CAPI – Computer-assisted Personal Interview – CAWI – Computer Aided Web Interview • There is no paper questionnaire • The CAI program is the questionnaire – i.e. the program is the metadata
  15. 15. Original data DDI XML Original metadata CAI CAI to DDI Convert to DDI: Collectica MQDS others Computer Assisted Interviewing We already have tools to convert CAI to machine- readable metadata.
  16. 16. SPSS SAS Stata R Command scripts: Original data DDI XML Original metadata Revised data SPSS SAS Stata R CAI CAI to DDI Statistical Packages Convert to DDI: Collectica MQDS others Computer Assisted Interviewing What happens when a project modifies the data. The modified data no longer match the metadata.
  17. 17. SPSS SAS Stata R Command scripts: Original data DDI XML Original metadata Revised data SPSS SAS Stata R SPSSSAS Stata R CAI CAI to DDI Statistical Packages Convert to DDI: Collectica MQDS others Computer Assisted Interviewing Stat Package to DDI DDI XML Extracted metadata Extract metadata from SPSS/SAS/S tata/R Data file Metadata are re- created after the data are transformed. Transformations are documented by hand
  18. 18. Statistics packages have limited metadata • Variable names • Variable labels • Value labels • No provenance
  19. 19. SDTL XML Updater DDI XML SPSS SAS Stata R Script Parser Command scripts: Original data Revised metadata DDI XML Original metadata Revised data SPSS SAS Stata R CAI CAI to DDI Statistical Packages Standard Data Transformation Language Convert to DDI: Collectica MQDS others Computer Assisted Interviewing Automating the capture of transformation metadata. Missing links that we will build.
  20. 20. What statistics packages should be covered? ICPSR Downloads by Format All downloads Studies with all formats Delimited text 43% 29% SPSS 22% 24% SAS 10% 12% Stata 19% 23% R 5% 12% Excel 0% 1% Other 0% 0% 100% 100% Number 378,007 154,663
  21. 21. Input Data Output Data SPSS MISSING VALUES X(-1). IF (X > 3) Y=9. IF (X < 3) Z=8. X 2 3 4 -1 Stata replace X=. if X==-1 generate Y=9 if X>3 generate Z=8 if X<3 X 2 3 4 -1 SAS if X=-1 then X=.; if X>3 then Y=9; if X<3 then Z=8; X 2 3 4 -1 Why do we need an SDTL?
  22. 22. Input Data Output Data SPSS MISSING VALUES X(-1). IF (X > 3) Y=9. IF (X < 3) Z=8. X X Y Z 2 2 8 3 3 4 4 9 -1 -1 Stata replace X=. if X==-1 generate Y=9 if X>3 generate Z=8 if X<3 X X Y Z 2 2 8 3 3 4 4 9 -1 9 SAS if X=-1 then X=.; if X>3 then Y=9; if X<3 then Z=8; X X Y Z 2 2 . 8 3 3 . . 4 4 9 . -1 . . 8 Why do we need an SDTL?
  23. 23. What happens when a missing value is in a logical comparison? • SPSS – Logical expressions including a missing value are considered “Missing.” Usually, “Missing” is equivalent to “False.” • Stata – Missing values are treated as numbers equal to infinity. So, any number is less than a missing value. • SAS – Missing values are treated as numbers equal to minus infinity. So, any number is greater than a missing value.
  24. 24. Input Data Output Data SPSS MISSING VALUES X(-1). IF (X > 3) Y=9. IF (X < 3) Z=8. X X Y Z 2 2 8 3 3 4 4 9 -1 NULL Stata replace X=. if X==-1 generate Y=9 if X>3 generate Z=8 if X<3 X X Y Z 2 2 8 3 3 4 4 9 -1 ∞ 9 SAS if X=-1 then X=.; if X>3 then Y=9; if X<3 then Z=8; X X Y Z 2 2 . 8 3 3 . . 4 4 9 . -1 -∞ . 8 Missing Values in Comparisons
  25. 25. Benefits of automated metadata capture • Metadata will be better – All the information in the CAI can be included. – Variable transformations can be described • Automation will lower costs – Metadata will not be discarded and re-created • All metadata will be standardized and machine readable – Codebooks with rich information can be rendered at will • If we make it easy and beneficial, researchers will use it.
  26. 26. Continuous Capture of Metadata for Statistical Data (NSF ACI-1640575) Project Partners •Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan •Colectica •Metadata Technology North America •Norwegian Centre for Research Data •General Social Survey, NORC, University of Chicago •American National Election Study, University of Michigan
  27. 27. Questions? George Alter altergc@umich.edu

×