Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Archiving and Processing


Published on

  • Be the first to comment

  • Be the first to like this

Data Archiving and Processing

  1. 1. <ul><ul><li>Data Archiving and Processing </li></ul></ul><ul><ul><li>Karine Sahakyan, MD, MPH </li></ul></ul><ul><ul><li>American University of Armenia </li></ul></ul><ul><ul><li>June 26-27, 2008 </li></ul></ul>Caucasus Research Resource Centers – Armenia A Program of the Eurasia Foundation
  2. 2. Overview <ul><li>Introduction, general archiving theory and practices </li></ul><ul><li>Data structures and data processing </li></ul><ul><li>Survey documentation </li></ul><ul><li>User guides </li></ul>
  3. 3. Course schedule <ul><li>26 th June </li></ul><ul><li>Introduction and orientation </li></ul><ul><li>General archiving theory and practices </li></ul><ul><li>Data processing, data file structures, deriving variables </li></ul><ul><li>Practical exercises and coursework </li></ul>
  4. 4. Course schedule <ul><li>27 th June </li></ul><ul><li>Data linkage </li></ul><ul><li>Survey documentation </li></ul><ul><li>Thematic user guides </li></ul>
  5. 5. <ul><li>Introduction </li></ul>
  6. 6. Data
  7. 7. Data Project
  8. 8. Data Project Documentation
  9. 9. Data Project Documentation Theory
  10. 10. Data User Documentation Theory Project
  11. 11. Data User Documentation Theory Project Data storage and preservation (official archives)
  12. 12. Data User Documentation Theory Project Data processing and analysis (personal archives)
  13. 13. Data User Documentation Theory Project Supporting documentation
  14. 14. Data User Documentation Theory Project User guides
  15. 15. Sources <ul><li>Elder, GH et al (1993) Working with Archival data: studying Lives, London, Sage. </li></ul><ul><li>Dale, A et al (1988) Doing Secondary Analysis, London, Unwin Hyman. </li></ul><ul><li>UK Data archive ( </li></ul><ul><li>Royal Statistical Society ( </li></ul><ul><li>ESDS ( </li></ul><ul><li>SPSS to process and analyse data </li></ul><ul><li> </li></ul><ul><li>DI2007 and INTAS 2007 survey data sets and docs </li></ul>
  16. 16. Intended course outcomes <ul><li>1: Understanding of the need to systematically archive and document data. </li></ul><ul><li>2: Ability to differentiate different data types and structures </li></ul><ul><li>3: Ability to use SPSS syntax files to perform data derivation, data merging and analysis </li></ul><ul><li>4: Appreciation of the benefits of user guides for prospective data users </li></ul>
  17. 17. Assessment <ul><li>User guide and codebook development for the DI 2007 on a specific thematic module. Testing ILO’s 1,2 and 4 </li></ul><ul><li>Analysis of DI2007 or INTAS survey using SPSS syntax files to derive variables / merge files / produce tables/statistics. Testing ILO 3. </li></ul>
  18. 18. Assessment <ul><li>Course work sessions are timetabled so you expected to start both projects during this course </li></ul><ul><li>Projects should be submitted 2 weeks after the end of the course </li></ul><ul><li>I will be available by email to assist with any questions in this 2 week period </li></ul>
  19. 19. <ul><li>Data Archives </li></ul>
  20. 20. Data preservation- why? <ul><li>Scientific responsibility </li></ul><ul><li>Costs </li></ul><ul><li>Legal requirement </li></ul><ul><li>Future use (secondary analysis) </li></ul>
  21. 21. Data preservation- what? <ul><li>Original digitised data </li></ul><ul><li>Questionnaire forms (?) </li></ul><ul><li>Explanatory documentation (purpose and technical) </li></ul><ul><li>Unique identifiers (for future linkage and follow up) </li></ul><ul><li>Data at risk of being lost </li></ul>
  22. 22. Data preservation- how? <ul><li>Design surveys with preservation in mind (consent forms, anonymisation) </li></ul><ul><li>Use commonly used formats (eg SPSS) </li></ul><ul><li>Collate developmental reports (track changes) </li></ul><ul><li>Recognised archive sites (CRRC!) </li></ul>
  23. 23. Data preservation- threats <ul><li>Initial user needs delay access </li></ul><ul><li>Ownership and copyright </li></ul><ul><li>Confidentiality, disclosure, ethics, data protection </li></ul><ul><li>Physical storage media </li></ul><ul><li>Logical (digital) storage format </li></ul><ul><li>Costs </li></ul><ul><li>Organisational change </li></ul><ul><li>Poor data infrastructure (funding and strategy) </li></ul>
  24. 24. Survey Data: ‘version’ control <ul><li>Early (pre-cleaning) release </li></ul><ul><li>‘ Final’ release </li></ul><ul><li>Additional variables (derivations) </li></ul><ul><li>Preserving the original codings through: </li></ul><ul><li>1. using syntax to process the original data </li></ul><ul><li>2. saving processed data with different file name </li></ul><ul><li>3. creating archive of derived data sets (possibly thematic) </li></ul>
  25. 25. Exercise <ul><li>What factors constitute the major threats to data preservation in the South Caucasus? </li></ul><ul><li>2. Using your list of threats formulate a ‘best practice’ guide for the preservation of data which aims to safeguard the future of statistical data in the region. </li></ul>
  26. 26. <ul><li>Data file structures </li></ul>
  27. 27. Simple one-off cross-sectional <ul><li>Simplest file structure </li></ul><ul><li>Data arranged in a case/variables matrix </li></ul><ul><li>Each case has a value on each variable </li></ul><ul><li>Each case has a unique identifier </li></ul><ul><li>Processing involves </li></ul><ul><ul><ul><li>Selecting sub-sets of cases </li></ul></ul></ul><ul><ul><ul><li>Selecting sub-sets of variables </li></ul></ul></ul><ul><ul><ul><li>Recoding original variables </li></ul></ul></ul><ul><ul><ul><li>Deriving new variables from existing ones </li></ul></ul></ul>
  28. 28. .. .. .. .. .. .. .. .. .. .. .. .. .. 8 .. .. 2 73 3 1 7 .. .. 10 74 2 1 6 .. .. 6 72 1 2 5 .. .. 2 73 1 2 4 .. .. 4 73 2 2 3 .. .. 3 75 3 1 2 .. .. 6 70 3 1 1 .. .. V4 V3 V2 V1 ID
  29. 29. Repeated cross-sectional <ul><li>As above but the questionnaire, or a newer version of it is administered at different points in time (say annually) </li></ul><ul><li>Respondents are sampled anew </li></ul><ul><li>Data processing as above </li></ul><ul><li>Comparisons over time are macro not micro. ie. They represent aggregate change over time and not individual change. </li></ul>
  30. 30. Different respondents, same questions T1 T2 .. .. .. .. .. .. 74 1 2 3 .. 71 3 1 2 .. 72 3 1 1 .. V3 V2 V1 ID .. .. .. .. .. .. 83 3 1 3 .. 80 1 2 2 .. 79 2 2 1 .. V3 V2 V1 ID
  31. 31. Hierarchical cross-sectional <ul><li>Similar to the above but now there is more than one structure in the data eg. Respondents within households. </li></ul><ul><li>The case/variable matrix is now ‘nested’ ie some data is for the HH and some for the individual (this can be in the same data file or can be in separate files) </li></ul><ul><li>Separate unique code numbers are needed. </li></ul><ul><li>Data Processing involves </li></ul><ul><ul><li>Accurate separation of different levels </li></ul></ul><ul><ul><li>Suitable linkage where appropriate </li></ul></ul>
  32. 32. Hierarchical structure #1 (people in households) .. .. .. .. .. .. .. .. .. .. 1 2 2 1 1 4 .. .. 2 1 1 1 1 4 .. .. 2 1 3 2 2 3 .. .. 1 1 2 2 2 3 .. .. 2 2 1 2 2 3 .. .. 2 2 1 1 1 2 .. .. 2 1 2 1 4 1 .. .. 1 2 1 1 4 1 .. .. P2 P1 PNO H2 H1 HHID
  33. 33. Hierarchical structure #2 (episodes of employment) .. .. .. .. .. .. .. -8 2005 2 2 4 .. 2005 2004 1 1 4 .. -8 2001 1 3 3 .. 2001 1998 1 2 3 .. 1998 1995 2 1 3 .. -8 2002 2 1 2 .. -8 2005 1 2 1 .. 2005 2001 2 1 1 .. P3 P2 P1 EMPNO PID
  34. 34. Panel <ul><li>Using the same sample of respondents over time </li></ul><ul><li>Questions are often also repeated at different surveys </li></ul><ul><li>Data structure can be a simple case/variable for each phase of data collection </li></ul><ul><li>Unique identification for each respondent which remains for the life of the panel needed. </li></ul><ul><li>Data processing </li></ul><ul><ul><li>Connecting variables for a single individual over successive waves of the survey (micro data analysis) </li></ul></ul>
  35. 35. Same respondents over time T1 T3 T2 T4
  36. 36. Cohort <ul><li>Similar to Panel but each case is from a common cohort (where this is taken to be time related) </li></ul><ul><li>Birth cohort studies for example – all babies born in a particular week during a particular year, traced through their lives </li></ul><ul><li>Data structure and processing same as panel </li></ul>
  37. 37. Same respondents over time T1 T3 T2 T4
  38. 38. Retrospective <ul><li>Not really a survey type but a data collection tool </li></ul><ul><li>Can be included in any of the surveys listed above </li></ul><ul><li>Data is (retrospectively) longitudinal </li></ul><ul><li>Each retrospective element needs to have unique codings for different events or episodes </li></ul><ul><li>Data structure is ‘relational’ each element relates to each respondent as well as to the respondents other retrospective elements </li></ul><ul><li>Data processing </li></ul><ul><ul><li>Time sensitive linkages of different elements </li></ul></ul>
  39. 39. Looking backwards T1 T1 - X
  40. 40. Exercise <ul><li>What type of survey design can help with the following ideas: </li></ul><ul><ul><li>Young people are taking longer to get married than they used to </li></ul></ul><ul><ul><li>Fear of crime is highest in the urban environment </li></ul></ul><ul><ul><li>Employment and income are generally under-reported </li></ul></ul><ul><ul><li>The democrats will win the US presidency in 2008 </li></ul></ul>
  41. 41. Structure of DI2007 <ul><li>Cross-sectional </li></ul><ul><li>Individuals within HH </li></ul><ul><li>All HH members </li></ul><ul><li>Absent HH migrants </li></ul>
  42. 42. Household ID links each file HH and Ind HH members Absent Migrants
  43. 43. Structure on INTAS 2007 <ul><li>Linked to DI 2005 so panel and hierarchical (though these properties not being used) </li></ul><ul><li>Retrospective data collection </li></ul><ul><li>main file and 8 retrospective modules </li></ul><ul><li>relational structure </li></ul>
  44. 44. Each data file relates to each other (person ID) Education Marriage Leisure Employment Job Housing Cohabitation Core Children
  45. 45. <ul><li>Deriving variables </li></ul>
  46. 46. Coding and recoding <ul><ul><li>Original codings (as in code book) </li></ul></ul><ul><ul><li>Simplifications </li></ul></ul><ul><ul><ul><li>Dealing with DK / NA and Missing codes (tidying up) </li></ul></ul></ul><ul><ul><ul><li>Collapsing categories (substantive and statistical) </li></ul></ul></ul><ul><ul><li>Improves analysis and presentation </li></ul></ul><ul><ul><li>See D1.sps – DI2007 and contributions of absent HH members </li></ul></ul>
  47. 47. Creating analytic files <ul><ul><li>Protects the original data from being deleted/overwritten </li></ul></ul><ul><ul><li>Small files are processed faster </li></ul></ul><ul><ul><li>Less scrolling through data/variables </li></ul></ul><ul><ul><li>If syntax file used, it is easy to adapt (to include or delete variables) </li></ul></ul><ul><ul><li>See D2.sps </li></ul></ul>
  48. 48. Deriving variables <ul><ul><li>Combining variables to produce a hybrid </li></ul></ul><ul><ul><li>Can be scale related to summarise a concept (ie where all response codes are of the same type – ‘safety’ example. See D3.sps ) </li></ul></ul><ul><ul><li>Can relate to a broad conceptual category (social origins using parental education and employment. See CF1.sps ) </li></ul></ul><ul><ul><li>To adjust data where you have reason to suspect that one variable might help to improve another (using reported expenditure to adjust reported income. See CF7.sps ) </li></ul></ul>
  49. 49. <ul><li>Data linkage </li></ul>
  50. 50. Hierarchical links <ul><ul><li>Data is nested: individual within household, or episodes belong to an individual. </li></ul></ul><ul><ul><li>Link 1: attach HH data to the individual (analysing individuals, not needed for DI as already linked) </li></ul></ul><ul><ul><li>Link 2 : produce summary data of all individuals in the HH, and attach to the HH data (analysing HH’s). See D6.sps , though it is a bit long. </li></ul></ul><ul><ul><li>Link 3: attach episode data to an individual. See CF2.sps . </li></ul></ul><ul><ul><li>See earlier slides on hierarchical data. </li></ul></ul>
  51. 51. Longitudinal links <ul><ul><li>The respondents’ data from successive surveys is joined together </li></ul></ul><ul><ul><li>Cross-wave ID number used for both individual and family </li></ul></ul><ul><ul><li>Panel surveys and cohort surveys </li></ul></ul>
  52. 52. Relational links <ul><ul><li>Linking an individuals marital statuses and fertility statuses </li></ul></ul><ul><ul><li>Linking an individuals education / employment and job statuses </li></ul></ul><ul><ul><li>Linking both of the above </li></ul></ul><ul><ul><li>… adding housing and leisure </li></ul></ul><ul><ul><li>(so-called ‘many to many’ links ie. one individual may have had 5 jobs, 4 different addresses, 2 marriages, 4 children and so on… others might have had much less. </li></ul></ul><ul><ul><li>See CF5.sps and CF6.sps </li></ul></ul>
  53. 53. <ul><li>Data Processing Coursework </li></ul>
  54. 54. <ul><li>Survey Documentation </li></ul>
  55. 55. Survey Documentation exercise <ul><ul><li>In groups detail exactly what is needed to effectively begin analysing a survey data set. </li></ul></ul><ul><ul><li>try to be as precise as possible about the type of documentation, its content and the amount of detail that is required. </li></ul></ul><ul><ul><li>If you were to manage an archive of different social surveys, what would be on your check list of details to catalogue the surveys? </li></ul></ul>
  56. 56. Survey Documentation (ESDS) <ul><ul><li>all variables should be named . Ideally, variable names should not exceed 8 characters, which ensures compatibility between all current dissemination formats used by the ESDS. The absolute maximum is 32 characters, which ensures compatibility with recent versions of all major dissemination (SPSS, ver. 12 onwards; Stata, ver. 7 onwards; and SAS, ver. 8 onwards) </li></ul></ul><ul><ul><li>all variables should be labelled . Labels should be brief (preferably ‹ 80 characters), but precise and always make explicit the unit of measurement for continuous (interval) variables. Where possible, all variable labels should reference the question number (and if necessary questionnaire) </li></ul></ul><ul><ul><li>where possible, all data labelling should be created and supplied to the ESDS as part of the data file itself. This is the expectation with data supplied in one of the three major statistical packages - SPSS, Stata or SAS </li></ul></ul><ul><ul><li>if the package being used for data management does not allow such variable and code labelling it must be provided as part of the documentation - i.e. a comprehensive list of variable names, variable descriptions, code names and variable formatting information </li></ul></ul><ul><ul><li>the code used to create all derived variables (e.g. the SPSS syntax file or Stata do file) should be provided so that interested users can see exactly how these variables have arisen </li></ul></ul><ul><ul><li>See also ESDS ‘Research Management and Documentation’ </li></ul></ul>
  57. 57. Survey Documentation (Metadata) <ul><ul><li>Rationale / purpose / history </li></ul></ul><ul><ul><li>Questionnaires </li></ul></ul><ul><ul><li>Code book </li></ul></ul><ul><ul><li>Technical details (design issues, sampling, weighting, imputation etc) </li></ul></ul><ul><ul><li>Technical details for users (user based examples) </li></ul></ul><ul><ul><li>Publications (working papers / technical reports / academic papers) </li></ul></ul>
  58. 58. Example <ul><ul><li> </li></ul></ul>
  59. 59. <ul><li>User Guides </li></ul>
  60. 60. What is a thematic data user guide? <ul><ul><li>A tool to assist researchers in locating and using data ie. it is not meant to provide all answers, only to point to sources which contain more detail </li></ul></ul><ul><ul><li>An up-to-date collation of research/publications on a particular topic </li></ul></ul><ul><ul><li>A brief description of different data sources </li></ul></ul><ul><ul><li>A brief description of different research projects, their aims, and design. </li></ul></ul><ul><ul><li>A collation of different theoretical questions of relevance </li></ul></ul><ul><ul><li>See ESDS Education / Social Capital </li></ul></ul>
  61. 61. Examples <ul><ul><li> </li></ul></ul>
  62. 62. Exercise <ul><ul><li>Select a substantive topic of interest to you </li></ul></ul><ul><ul><li>List current data sources / theories / publications </li></ul></ul><ul><ul><li>Draft an introduction to this topic that would help a researcher to quickly learn the main issues. </li></ul></ul>