RDAP13 Jared Lyle: Domain Repositories and Institutional Repositories Partn…

Uploaded on

Jared Lyle, ICPSR …

Jared Lyle, ICPSR

Domain Repositories and Institutional Repositories Partnering to Curate: Opportunities and Examples

Panel: Partnerships between institutional repositories, domain repositories, and publishers
Research Data Access & Preservation Summit 2013
Baltimore, MD April 4, 2013 #rdap13

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • “At the end of 2011 ICPSR had about 9TB of content stored in Archival Storage.  This measurement includes everything we have collected over the past 50 years, including content which is not packaged into "studies" for dissemination, such as TIGER/Line files and data packaged for SDA.  This content is not compressed, and contains many duplicates[1], and so should be considered an upper bound.”“Long-time ICPSR staff tell the story of how the 2000 Census doubled the size of ICPSR's holdings.  (I'll speculate that perhaps ICPSR went from about 3TB of content prior to the 200 Census, and then grew to 6TB thereafter.)  In 2012-2013 ICPSR is likely to quadruple the size of its holdings, growing from about 9TB to nearly 40TB.”http://techaticpsr.blogspot.com/2012/04/nature-of-icpsrs-holdings.html
  • Sharing data = formally archiving the data.
  • (on a 4-point scale, 49 percent “agree completely” and 42 percent “agree somewhat”)
  • Why are data not shared?Preparing data and documentation can be enormously time consumingLimited resources for data preparationNeed to protect the confidentiality of respondentsFear of getting “scooped”Lack of rewards for sharingPienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Shared?” http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009
  • 4,883 NIH & NSF PIs emailed a survey1,217 responses (24.9% response rate)1,003 valid (collected data, not dissertation)We attempted to invite all 4,883 of these PIs. The PI survey consisted of consisted of questions about research data collected, various methods for sharing research data, attitudes about data sharing and demographic information. PIs were also asked about publications tied to the research project including information about their own publications, research team publications, and publications outside the research team. We received 1,217 responses (24.9% response rate). For the analytic sample we select PIs and their research data if (1) they confirm they collected research data (86.6% of the responses), (2) they did not collect data for a dissertation award (n=33), or (3) they were missing data on the dependent variable.
  • Enhancements by both Investigators & Archives – time, money, training, & tools
  • [Quote is from the National Longitudinal Survey of Youth’s explanation of its documentation (see: http://www.nlsinfo.org/nlsy97/97guide/chap3.htm#threethree).]
  • “A centuries-old fresco of Jesus Christ that was botched by a well-intentioned elderly woman has drawn hundreds of visitors and reporters to a north-eastern Spanish church - a positive push in tourism for the small town.The "ecce homo" (or "behold the man"), painted by famous Spanish artist Elias Garcia Martinez, is now mockingly - if not affectionately - called "ecce mono" ("behold the monkey") after an 81-year-old Cecilia Jimenez of Borja tried to fix the deteriorating fresco by applying a paint brush.”http://www.cbsnews.com/8301-503543_162-57501085-503543/ruined-fresco-draws-attention-fans-in-spain/
  • Traditionally, we’ve dealt with quantitative social surveys with properties and structures.
  • Along with project documentation, which is needed so secondary users can independently understand a data collection.
  • In this example, the very first value is “-1”. The variable is an age variable, as indicated by the name above the frequency table, therefore age cannot have a “-1” value, unless it has another valid meaning, such as “Inappropriate”, “Not applicable”, or “Missing”.
  • This variable might be missing descriptive information, which is problematic and could render the variable unusable. This variable’s name is generic, which might be fine as long as the codebook provides description. But without further labeling, this variable could be meaningless since no label is provided and the value labels are generic Yes/No. The end user wouldn’t be able to interpret the variable.
  • Here the top and bottom values for Age seem a bit off. It looks like the PI recoded everything <18 to 18 and everything >40 as 41, although it’s not explicit.
  • A disclosure risk review asks the question, Do these data contain content that I need to restrict?Major areas to check when assessing risk include:1) Are there direct identifiers that reveal the identity of respondents that may have been obtained in the process of data collection?
  • 2) Are there indirect identifiers that reveal the identity of respondents when they are used in combination with other data?It can be more challenging to identify indirect identifiers. Careful attention must therefore be paid to interactions among the context of the study, the nature of the sample, and the characteristics of respondents to prevent ordinarily unrevealing information from becoming the pointer to an individual.
  • 3) Are there external linkages that might reveal the identity of respondents?The ability to link data from these files to data available through external sources may present an unacceptable risk of disclosure.
  • Data discovery requires variable-level metadataData Documentation Initiative (DDI) is an XML standard for micro-data in the social sciencesFederated search toolsNew search toolsVariable level searchingQuestion banksHarmonization tools
  • At ICPSR, we use a “LEADS” database is to actively discover important research data that should be preserved and disseminated. See:Pienta, Gutmann, Hoelter, Lyle, and Donakowski (2008). “The LEADS Database at ICPSR: Identifying Important ‘At Risk’ Social Science Data.”http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
  • There are increasing sources and types of data produced.
  • On IR was handed 80 distinct pieces of removable media with over 30,000 files and no instructions. One file, as an amusing aside, was named “David’s Favourite Captain Haddock Curses”.How to find the relevant files, and which of those are essential?
  • None of the IRs had the software to read statistical files, let alone the capability to recover or convert older files. This was when ICPSR could lend a direct hand.
  • Survey distributed March & April 2012.60% completion rate (109/181)27 U.S. states + D.C.6 Canadian provincesUK, AU, NL, NO, SA66% respondents from social science repository mailing listMost from college or universityLibrarian5457%Repository Manager3537%Of those who’d received or were planning to receive data (80%):Social Sciences (69%)Physical Sciences (47%)Humanities (36%)Biomedical (36%)Engineering (24%)
  • As we’re presenting guidelines and tips for creating a well-prepared data collection, keep in mind that a lot of the information we’re conveying in this presentation is found in our “ICPSR Guide to Social Science Data Preparation and Archiving” (a link to it is on the ICPSR web site).
  • Confidentiality review and treatment involves reviewing and modifying the actual data to reduce disclosure risk. Data collections undergo confidentiality review to determine whether the data contain any information that could be used to identify respondents. All direct identifiers should be removed from files.There are a number of actions you can take to protect respondent confidentiality: removing, masking, or collapsing variables within public-use versions of the datasets. Or, restricting access to the data.Removing variables is a good solution for treating direct identifiers.Blanking masks identifiers by altering original data values to missing data codes. For example, ‘abcd’ to “ “ (all blanks). Recoding alters original data values to missing data codes. For example, value ‘1234’ is changed to ‘9999’.Bracketing/collapsing combines the categories of a variable or merges the concepts embodies in two or more variables by creating a new summary variable. For example, age: 13-29=1, 30-49=2.Top/Bottom coding groups the upper or lower limits to eliminate outliers. For instance, a sample with extreme values for income might top-code or round all income >$100,000.Perterbing is a more complex statistical technique that involves alteration of the variable by variable suppressing, adding, or removing records, and random noise continuous/pseudo-continuous variables. This technique limits the appeal of the data since it alters the original data values.Restricting access through requiring users to apply for use, and highly restricted access (e.g., secured enclave-only access).
  • The intent of the ICPSR pipeline process is to curate, “preserve and access information for the Long Term” (see “Reference Model for an Open Archival Information System (OAIS)”, Consultative Committee for Space Data Systems, Page 2-1. http://public.ccsds.org/publications/archive/650x0b1.PDF).Throughout the pipeline, our intent is to insure that curated data are independently understandable – that is, “the community should be able to understand the information without needing the assistance of the experts who produced the information” (see “Reference Model for an Open Archival Information System (OAIS)”, Consultative Committee for Space Data Systems, Page 3-1. http://public.ccsds.org/publications/archive/650x0b1.PDF).
  • The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.
  • The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.


  • 1. Domain Repositories and InstitutionalRepositories Partnering to Curate:Opportunities and ExamplesJared LyleRDAP13
  • 2. About ICPSR• Founded in 1962 as a consortium of 21 universities to share the National Election Survey• Today: 700+ members around the world• Data dissemination for more than 20 federal and non-government sponsors• 600,000+ visitors per year
  • 3. What we do• Acquire and archive social science data• Distribute data to researchers• Preserve data for future generations• Provide training in quantitative methodsArchive size• 8,000 data collections, over 60,000 data sets• Grows by 300+ collections a year• 9 Terabytes, soon to be 40+ Terabytes
  • 4. http://www.icpsr.umich.edu
  • 5. http://www.flickr.com/photos/dwiggs/3983200894/sizes/l/in/photostream/
  • 6. 1. Sharing Data (Archiving)
  • 7. “It saves funding and avoidsrepeated data collecting efforts,allows the verification andreplication of research findings,facilitates scientific openness,deters scientific misconduct, andsupports communication andprogress.”Niu (2006). “Reward and Punishment Mechanism for Research Data Sharinhttp://www.iassistdata.org/downloads/iqvol304niu.pdf
  • 8. “Virtually all geneticists believethat scientists should share theirresults freely with peers…”Louis, Jones, and Campbell (2002). “Sharing in Science.”http://dx.doi.org/10.1511/2002.4.304
  • 9. “…the era of data sharing has arrived.” Samet (2009). “Data: To Share or Not to Share?” http://dx.doi.org/10.1097/EDE.0b013e3181930df3
  • 10. http://www.data-pass.org/
  • 11. Most PIs indicated that they wantedto be “Good Citizens” and help: “This sounds like an exciting project.” “I hope your project is successful because I think that it is important.”
  • 12. “Good Citizens” = high willingness…but no time, money, or resourcesto submit data to us.
  • 13. Data Sharing (N=1,544) 70 58.7% 60 50 40 30 25.7% 20 14.2% 10 0 Data Are Has Copy of Data Are Lost Archived DataPienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Shhttp://ori.hhs.gov/content/research-research-integrity-rri-conference-2009See also: Pienta, Gutmann, Hoelter, Lyle, & Donakowski (2008). “The LEADS Database at ICPSIdentifying Important „At Risk‟ Social Science Data.”http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
  • 14. Data Sharing (N=935) Federal Shared Shared Not Agency Formally, Informally, Shared Archived Not (n=409) (n=111) Archived (n=415) NSF 22.4% 43.7% 33.9% (27.3%) NIH 7.4% 45.0% 47.6% (72.7%) Total 11.5% 44.6% 43.9%Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research:The Use and Reuse of Primary Research Data”.http://hdl.handle.net/2027.42/78307
  • 15. 2. Enhancing Data (Curating)
  • 16. A well-prepared data collection“contains information intended tobe complete and self-explanatory”for future users.
  • 17. A corollary: Do no harm. http://img.gawkerassets.com/img/17xbuy519gga2jpg/ku- xlarge.jpg
  • 18. Data
  • 19. Documentation http://dx.doi.org/10.3886/ICPSR31521.v1
  • 20. 20
  • 21. 21
  • 22. Disclosure Issues• Direct Identifiers? – personal names – addresses (including ZIP codes) – telephone numbers – social security numbers – driver license numbers – patient numbers – certification numbers,
  • 23. Disclosure Issues• Indirect Identifiers? – detailed geography (i.e., state, county, or census tract of residence) – exact date of birth – exact occupations held – exact dates of events – detailed income
  • 24. Disclosure Issues• External Linkages? – public patient/medical records – court records – police and correction records – Social Security records – Medicare records – driver’s licenses – military records
  • 25. Opportunity http://www.flickr.com/photos/k3v1nm/3366181223/
  • 26. “It saves funding and avoidsrepeated data collecting efforts,allows the verification andreplication of research findings,facilitates scientific openness,deters scientific misconduct, andsupports communication andprogress.”Niu (2006). “Reward and Punishment Mechanism for Research Data Sharinghttp://www.iassistdata.org/downloads/iqvol304niu.pdf
  • 27. “Search/Compare Variables” examines 2.1 million variables in 4,000 data collections
  • 28. Emerging sources and types of data• Geo-spatial• Video• Administrative data• Online text• Transactions• Clicks• Sensors
  • 29. Partnerships “We propose that domain specific archives partner with institution based repositories to provide expertise, tools, guidelines, and best practices to the research communities they serve.”Green, Ann G., and Myron P. Gutmann. (2007) "BuildingPartnerships Among SocialScience Researchers, Institution-based Repositories, andDomain Specific Data Archives." OCLC Systems andServices: International Digital Library Perspectives. 23: 35-53. http://hdl.handle.net/2027.42/41214
  • 30. Support:
  • 31. http://www.icpsr.umich.edu/icpsrweb/IR/
  • 32. 5 Pilot Data Collections http://www.flickr.com/photos/smithsonian/25511703 86/
  • 33. Selection & Appraisal
  • 34. Recovery
  • 35. Finding interested partners http://www.flickr.com/photos/usnationalarchives/47269173 73/
  • 36. Time & Willingness http://www.flickr.com/photos/floridamemory/702661937 1/
  • 37. Survey of Repositories‟ Data NeedsInter-university Consortium for Political and SocialResearch. Survey of Data Curation Services forRepositories, 2012. ICPSR34302-v1. Ann Arbor, MI:Inter-university Consortium for Political and SocialResearch [distributor], 2012-09-21.doi:10.3886/ICPSR34302.v1
  • 38. Repository Suggested Solutions:• Media recovery, format migration, data recovery• Cost estimating and policy review• Metadata tools, documentation, and catalog linkages• Support networks and training• Confidential data dissemination and confidentiality review
  • 39. 1. Community Wayfinder
  • 40. http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf
  • 41. 2. Confidentiality Review & Treatment
  • 42. • Suppressing unique cases• Grouping values (e.g., 13-29=1, 30-49=2)• Top-coding (e.g., >1,000=1,000)• Aggregating geographic areas• Swapping values• Sampling within a larger data collection• Adding “noise”• Replacing real data with synthetic data
  • 43. http://www.icpsr.umich.edu/icpsrweb/content/DSDR/tools/qualanon.html
  • 44. 3. Access to Processing Tools
  • 45. The Virtual Data Enclave (VDE) provides remote accessto quantitative data in a secure environment.
  • 46. Hermes Outputs• ASCII data files – Column- and tab-delimited• Stat package setup files – SAS, SPSS, Stata (.do and .dct)• “Ready-to-go” data files – SAS transport (CPORT engine) – SPSS system (.sav) – Stata system (.dta) – R (.rda)
  • 47. Your ideas on partnerships?Useful categories for discussion?• Media recovery, format migration, data recovery• Cost estimating and policy review• Metadata tools, documentation, and catalog linkages• Support networks and training• Confidential data dissemination and confidentiality review
  • 48. Thank you!lyle@umich.edu