SlideShare a Scribd company logo
1 of 53
Domain Repositories and Institutional
Repositories Partnering to Curate:
Opportunities and Examples



Jared Lyle
RDAP13
About ICPSR
• Founded in 1962 as a consortium of 21
  universities to share the National Election
  Survey
• Today: 700+ members around the world
• Data dissemination for more than 20 federal
  and non-government sponsors
• 600,000+ visitors per year
What we do
• Acquire and archive social science data
• Distribute data to researchers
• Preserve data for future generations
• Provide training in quantitative methods

Archive size
• 8,000 data collections, over 60,000 data sets
• Grows by 300+ collections a year
• 9 Terabytes, soon to be 40+ Terabytes
http://www.icpsr.umich.edu
http://www.flickr.com/photos/dwiggs/3983200894/sizes/l/in/photostream/
1. Sharing Data (Archiving)
“It saves funding and avoids
repeated data collecting efforts,
allows the verification and
replication of research findings,
facilitates scientific openness,
deters scientific misconduct, and
supports communication and
progress.”
Niu (2006). “Reward and Punishment Mechanism for Research Data Sharin
http://www.iassistdata.org/downloads/iqvol304niu.pdf
“Virtually all geneticists believe
that scientists should share their
results freely with peers…”




Louis, Jones, and Campbell (2002). “Sharing in Science.”
http://dx.doi.org/10.1511/2002.4.304
“…the era of data sharing has arrived.”




     Samet (2009). “Data: To Share or Not to Share?”
     http://dx.doi.org/10.1097/EDE.0b013e3181930df3
http://www.data-pass.org/
Most PIs indicated that they wanted
to be “Good Citizens” and help:

  “This sounds like an exciting
   project.”

  “I hope your project is successful
    because I think that it is
    important.”
“Good Citizens” = high willingness



…but no time, money, or resources
to submit data to us.
Data Sharing (N=1,544)
           70
                                           58.7%
           60
           50
           40
           30                                                25.7%
           20           14.2%
           10
            0
                      Data Are         Has Copy of      Data Are Lost
                      Archived            Data


Pienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Sh
http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009
See also: Pienta, Gutmann, Hoelter, Lyle, & Donakowski (2008). “The LEADS Database at ICPS
Identifying Important „At Risk‟ Social Science Data.”
http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
Data Sharing (N=935)
           Federal      Shared        Shared         Not
           Agency       Formally,     Informally,    Shared
                        Archived      Not            (n=409)
                        (n=111)       Archived
                                      (n=415)
           NSF          22.4%         43.7%          33.9%
           (27.3%)
           NIH          7.4%          45.0%          47.6%
           (72.7%)
           Total        11.5%         44.6%          43.9%



Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research:
The Use and Reuse of Primary Research Data”.
http://hdl.handle.net/2027.42/78307
2. Enhancing Data (Curating)
A well-prepared data collection
“contains information intended to
be complete and self-explanatory”
for future users.
A corollary: Do no harm.




           http://img.gawkerassets.com/img/17xbuy519gga2jpg/ku-
           xlarge.jpg
Data
Documentation




                http://dx.doi.org/10.3886/ICPSR31521.v1
20
21
Disclosure Issues
• Direct Identifiers?
   –   personal names
   –   addresses (including ZIP codes)
   –   telephone numbers
   –   social security numbers
   –   driver license numbers
   –   patient numbers
   –   certification numbers,
Disclosure Issues
• Indirect Identifiers?
   – detailed geography (i.e., state, county, or
     census tract of residence)
   – exact date of birth
   – exact occupations held
   – exact dates of events
   – detailed income
Disclosure Issues
• External Linkages?
   –   public patient/medical records
   –   court records
   –   police and correction records
   –   Social Security records
   –   Medicare records
   –   driver’s licenses
   –   military records
Opportunity




 http://www.flickr.com/photos/k3v1nm/3366181223/
“It saves funding and avoids
repeated data collecting efforts,
allows the verification and
replication of research findings,
facilitates scientific openness,
deters scientific misconduct, and
supports communication and
progress.”
Niu (2006). “Reward and Punishment Mechanism for Research Data Sharing
http://www.iassistdata.org/downloads/iqvol304niu.pdf
“Search/Compare Variables” examines 2.1 million variables in 4,000 data collections
Emerging sources and types of data

•   Geo-spatial
•   Video
•   Administrative data
•   Online text
•   Transactions
•   Clicks
•   Sensors
Partnerships
 “We propose that domain specific
 archives partner with institution based
 repositories to provide expertise, tools,
 guidelines, and best practices to the
 research communities they serve.”

Green, Ann G., and Myron P. Gutmann. (2007) "Building
Partnerships Among Social
Science Researchers, Institution-based Repositories, and
Domain Specific Data Archives." OCLC Systems and
Services: International Digital Library Perspectives. 23: 35-
53. http://hdl.handle.net/2027.42/41214
Support:
http://www.icpsr.umich.edu/icpsrweb/I
R/
5 Pilot Data Collections




            http://www.flickr.com/photos/smithsonian/25511703
            86/
Selection & Appraisal
Recovery
Finding interested partners




           http://www.flickr.com/photos/usnationalarchives/47269173
           73/
Time & Willingness




          http://www.flickr.com/photos/floridamemory/702661937
          1/
Survey of Repositories‟ Data Needs

Inter-university Consortium for Political and Social
Research. Survey of Data Curation Services for
Repositories, 2012. ICPSR34302-v1. Ann Arbor, MI:
Inter-university Consortium for Political and Social
Research [distributor], 2012-09-21.
doi:10.3886/ICPSR34302.v1
Repository Suggested Solutions:

• Media recovery, format migration, data
  recovery
• Cost estimating and policy review
• Metadata tools, documentation, and catalog
  linkages
• Support networks and training
• Confidential data dissemination and
  confidentiality review
1. Community Wayfinder
http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf
2. Confidentiality Review & Treatment
• Suppressing unique cases
• Grouping values (e.g., 13-29=1, 30-49=2)
• Top-coding (e.g., >1,000=1,000)
• Aggregating geographic areas
•   Swapping values
•   Sampling within a larger data collection
•   Adding “noise”
•   Replacing real data with synthetic data
http://www.icpsr.umich.edu/icpsrweb/content/DSDR/tools/qualano
n.html
3. Access to Processing Tools
The Virtual Data Enclave (VDE) provides remote access
to quantitative data in a secure environment.
Hermes Outputs
• ASCII data files
   – Column- and tab-delimited

• Stat package setup files
   – SAS, SPSS, Stata (.do and .dct)

• “Ready-to-go” data files
   –   SAS transport (CPORT engine)
   –   SPSS system (.sav)
   –   Stata system (.dta)
   –   R (.rda)
Your ideas on partnerships?

Useful categories for discussion?
• Media recovery, format migration, data
   recovery
• Cost estimating and policy review
• Metadata tools, documentation, and catalog
   linkages
• Support networks and training
• Confidential data dissemination and
   confidentiality review
Thank you!

lyle@umich.edu

More Related Content

What's hot

Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Vivien Bonazzi
 
Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)Jian Qin
 
The NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGThe NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGPhilip Bourne
 
NIH Data Commons - Note: Presentation has animations
NIH Data Commons  - Note:  Presentation has animations NIH Data Commons  - Note:  Presentation has animations
NIH Data Commons - Note: Presentation has animations Vivien Bonazzi
 
Data Management Solutions from Libraries at NSF Large Facilities Workshop
Data Management Solutions from Libraries at NSF Large Facilities WorkshopData Management Solutions from Libraries at NSF Large Facilities Workshop
Data Management Solutions from Libraries at NSF Large Facilities WorkshopCarly Strasser
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSMicah Altman
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interityIUPUI
 
Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6ARDC
 
Gather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your researchGather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your researchIUPUI
 
UC Santa Cruz: Data Management for Scientists
UC Santa Cruz: Data Management for ScientistsUC Santa Cruz: Data Management for Scientists
UC Santa Cruz: Data Management for ScientistsCarly Strasser
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataPhilip Bourne
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeEric Kansa
 
RDAP 16: DMPs and Public Access: Agency and Data Service Experiences
RDAP 16: DMPs and Public Access: Agency and Data Service ExperiencesRDAP 16: DMPs and Public Access: Agency and Data Service Experiences
RDAP 16: DMPs and Public Access: Agency and Data Service ExperiencesASIS&T
 
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...University of California Curation Center
 
The Thinking Behind Big Data at the NIH
The Thinking Behind Big Data at the NIHThe Thinking Behind Big Data at the NIH
The Thinking Behind Big Data at the NIHPhilip Bourne
 
Data, Infrastructure and Public Policy
Data, Infrastructure and Public PolicyData, Infrastructure and Public Policy
Data, Infrastructure and Public PolicyDublinked .
 
Informatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data DecadeInformatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data DecadeLiz Lyon
 

What's hot (20)

Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2Bonazzi commons bd2 k ahm 2016 v2
Bonazzi commons bd2 k ahm 2016 v2
 
Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)Linking Scientific Metadata (presented at DC2010)
Linking Scientific Metadata (presented at DC2010)
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
The NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGThe NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAG
 
NIH Data Commons - Note: Presentation has animations
NIH Data Commons  - Note:  Presentation has animations NIH Data Commons  - Note:  Presentation has animations
NIH Data Commons - Note: Presentation has animations
 
Data Management Solutions from Libraries at NSF Large Facilities Workshop
Data Management Solutions from Libraries at NSF Large Facilities WorkshopData Management Solutions from Libraries at NSF Large Facilities Workshop
Data Management Solutions from Libraries at NSF Large Facilities Workshop
 
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALSBROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
BROWN BAG TALK WITH MICAH ALTMAN INTEGRATING OPEN DATA INTO OPEN ACCESS JOURNALS
 
Data Management Plans: Tips, Tricks and Tools
Data Management Plans: Tips, Tricks and ToolsData Management Plans: Tips, Tricks and Tools
Data Management Plans: Tips, Tricks and Tools
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interity
 
Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6Fsci 2018 monday30_july_am6
Fsci 2018 monday30_july_am6
 
Gather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your researchGather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your research
 
UC Santa Cruz: Data Management for Scientists
UC Santa Cruz: Data Management for ScientistsUC Santa Cruz: Data Management for Scientists
UC Santa Cruz: Data Management for Scientists
 
BD2K Update
BD2K Update BD2K Update
BD2K Update
 
The Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big DataThe Commons: Leveraging the Power of the Cloud for Big Data
The Commons: Leveraging the Power of the Cloud for Big Data
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional Practice
 
RDAP 16: DMPs and Public Access: Agency and Data Service Experiences
RDAP 16: DMPs and Public Access: Agency and Data Service ExperiencesRDAP 16: DMPs and Public Access: Agency and Data Service Experiences
RDAP 16: DMPs and Public Access: Agency and Data Service Experiences
 
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
 
The Thinking Behind Big Data at the NIH
The Thinking Behind Big Data at the NIHThe Thinking Behind Big Data at the NIH
The Thinking Behind Big Data at the NIH
 
Data, Infrastructure and Public Policy
Data, Infrastructure and Public PolicyData, Infrastructure and Public Policy
Data, Infrastructure and Public Policy
 
Informatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data DecadeInformatics Transform : Re-engineering Libraries for the Data Decade
Informatics Transform : Re-engineering Libraries for the Data Decade
 

Similar to Domain Repositories and Institutional Repositories Partnering to Curate Data

Research Integrity Advisor and Data Management
Research Integrity Advisor and Data ManagementResearch Integrity Advisor and Data Management
Research Integrity Advisor and Data ManagementARDC
 
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...datacite
 
Magle data curation in libraries
Magle data curation in librariesMagle data curation in libraries
Magle data curation in librariesC. Tobin Magle
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open ScienceMark Parsons
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research RequirementsICPSR
 
Management of Data Collections
Management of Data CollectionsManagement of Data Collections
Management of Data Collectionsabedejesus
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeLizLyon
 
Repository Federation: Towards Data Interoperability
Repository Federation: Towards Data InteroperabilityRepository Federation: Towards Data Interoperability
Repository Federation: Towards Data InteroperabilityRobert H. McDonald
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data CitationMicah Altman
 
Public data archiving: Who does? Who doesn't? What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does? Who doesn't? What can we do about it?Heather Piwowar
 
Using Quantitative Data in Teaching: ICPSR Resources
Using Quantitative Data in Teaching: ICPSR ResourcesUsing Quantitative Data in Teaching: ICPSR Resources
Using Quantitative Data in Teaching: ICPSR ResourcesICPSR
 
Building and providing data management services a framework for everyone!
Building and providing data management services  a framework for everyone!Building and providing data management services  a framework for everyone!
Building and providing data management services a framework for everyone!Renaine Julian
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott LibraryRebekah Cummings
 
Gobinda Chowdhury
Gobinda ChowdhuryGobinda Chowdhury
Gobinda Chowdhurymaredata
 
Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Jamie Bisset
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
 

Similar to Domain Repositories and Institutional Repositories Partnering to Curate Data (20)

Research Integrity Advisor and Data Management
Research Integrity Advisor and Data ManagementResearch Integrity Advisor and Data Management
Research Integrity Advisor and Data Management
 
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
2013 DataCite Summer Meeting - Closing Keynote: Building Community Engagement...
 
Magle data curation in libraries
Magle data curation in librariesMagle data curation in libraries
Magle data curation in libraries
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
 
Data Policy for Open Science
Data Policy for Open ScienceData Policy for Open Science
Data Policy for Open Science
 
Meeting Federal Research Requirements
Meeting Federal Research RequirementsMeeting Federal Research Requirements
Meeting Federal Research Requirements
 
Management of Data Collections
Management of Data CollectionsManagement of Data Collections
Management of Data Collections
 
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
Acting as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decadeActing as Advocate? Seven steps for libraries in the data decade
Acting as Advocate? Seven steps for libraries in the data decade
 
Repository Federation: Towards Data Interoperability
Repository Federation: Towards Data InteroperabilityRepository Federation: Towards Data Interoperability
Repository Federation: Towards Data Interoperability
 
Data Sharing & Data Citation
Data Sharing & Data CitationData Sharing & Data Citation
Data Sharing & Data Citation
 
Public data archiving: Who does? Who doesn't? What can we do about it?
Public data archiving: Who does?  Who doesn't?  What can we do about it?Public data archiving: Who does?  Who doesn't?  What can we do about it?
Public data archiving: Who does? Who doesn't? What can we do about it?
 
Using Quantitative Data in Teaching: ICPSR Resources
Using Quantitative Data in Teaching: ICPSR ResourcesUsing Quantitative Data in Teaching: ICPSR Resources
Using Quantitative Data in Teaching: ICPSR Resources
 
Building and providing data management services a framework for everyone!
Building and providing data management services  a framework for everyone!Building and providing data management services  a framework for everyone!
Building and providing data management services a framework for everyone!
 
Christine borgman keynote
Christine borgman keynoteChristine borgman keynote
Christine borgman keynote
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott Library
 
Gobinda Chowdhury
Gobinda ChowdhuryGobinda Chowdhury
Gobinda Chowdhury
 
Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction) Publishing your research: Research Data Management (Introduction)
Publishing your research: Research Data Management (Introduction)
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
 

More from ASIS&T

RDAP 16: Sustaining Research Data Services (Panel 2: Sustainability)
RDAP 16: Sustaining Research Data Services (Panel 2: Sustainability)RDAP 16: Sustaining Research Data Services (Panel 2: Sustainability)
RDAP 16: Sustaining Research Data Services (Panel 2: Sustainability)ASIS&T
 
RDAP 16: Sustainability of data infrastructure: The history of science scienc...
RDAP 16: Sustainability of data infrastructure: The history of science scienc...RDAP 16: Sustainability of data infrastructure: The history of science scienc...
RDAP 16: Sustainability of data infrastructure: The history of science scienc...ASIS&T
 
RDAP 16: Perspective on DMPs, Funders and Public Access (Panel 5: DMPs and Pu...
RDAP 16: Perspective on DMPs, Funders and Public Access (Panel 5: DMPs and Pu...RDAP 16: Perspective on DMPs, Funders and Public Access (Panel 5: DMPs and Pu...
RDAP 16: Perspective on DMPs, Funders and Public Access (Panel 5: DMPs and Pu...ASIS&T
 
RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...
RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...
RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...ASIS&T
 
RDAP 16: If I could turn back time: Looking back on 2+ years of DMP consultin...
RDAP 16: If I could turn back time: Looking back on 2+ years of DMP consultin...RDAP 16: If I could turn back time: Looking back on 2+ years of DMP consultin...
RDAP 16: If I could turn back time: Looking back on 2+ years of DMP consultin...ASIS&T
 
RDAP 16: Data Management Plan Perspectives (Panel 5, DMPs and Public Access)
RDAP 16: Data Management Plan Perspectives (Panel 5, DMPs and Public Access)RDAP 16: Data Management Plan Perspectives (Panel 5, DMPs and Public Access)
RDAP 16: Data Management Plan Perspectives (Panel 5, DMPs and Public Access)ASIS&T
 
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...ASIS&T
 
RDAP 16 Poster: Interpreting Local Data Policies in Practice
RDAP 16 Poster: Interpreting Local Data Policies in PracticeRDAP 16 Poster: Interpreting Local Data Policies in Practice
RDAP 16 Poster: Interpreting Local Data Policies in PracticeASIS&T
 
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...ASIS&T
 
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...ASIS&T
 
RDAP 16 Lightning: Spreading the love: Bringing data management training to s...
RDAP 16 Lightning: Spreading the love: Bringing data management training to s...RDAP 16 Lightning: Spreading the love: Bringing data management training to s...
RDAP 16 Lightning: Spreading the love: Bringing data management training to s...ASIS&T
 
RDAP 16 Lightning: RDM Discussion Group: How'd that go?
RDAP 16 Lightning: RDM Discussion Group: How'd that go?RDAP 16 Lightning: RDM Discussion Group: How'd that go?
RDAP 16 Lightning: RDM Discussion Group: How'd that go?ASIS&T
 
RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...
RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...
RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...ASIS&T
 
RDAP 16 Lightning: Working Across Cultures: Data Librarian as Knowledge Broker
RDAP 16 Lightning: Working Across Cultures: Data Librarian as Knowledge BrokerRDAP 16 Lightning: Working Across Cultures: Data Librarian as Knowledge Broker
RDAP 16 Lightning: Working Across Cultures: Data Librarian as Knowledge BrokerASIS&T
 
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...ASIS&T
 
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...ASIS&T
 
RDAP 16 Lightning: Personas as a Policy Development Tool for Research Data
RDAP 16 Lightning: Personas as a Policy Development Tool for Research DataRDAP 16 Lightning: Personas as a Policy Development Tool for Research Data
RDAP 16 Lightning: Personas as a Policy Development Tool for Research DataASIS&T
 
RDAP 16 Lightning: Growing Data in Utah: A Model for Statewide Collaboration
RDAP 16 Lightning: Growing Data in Utah: A Model for Statewide CollaborationRDAP 16 Lightning: Growing Data in Utah: A Model for Statewide Collaboration
RDAP 16 Lightning: Growing Data in Utah: A Model for Statewide CollaborationASIS&T
 
RDAP 16: Building Without a Plan: How do you assess structural strength? (Pan...
RDAP 16: Building Without a Plan: How do you assess structural strength? (Pan...RDAP 16: Building Without a Plan: How do you assess structural strength? (Pan...
RDAP 16: Building Without a Plan: How do you assess structural strength? (Pan...ASIS&T
 
RDAP 16: How do we know where to grow? Assessing Research Data Services at th...
RDAP 16: How do we know where to grow? Assessing Research Data Services at th...RDAP 16: How do we know where to grow? Assessing Research Data Services at th...
RDAP 16: How do we know where to grow? Assessing Research Data Services at th...ASIS&T
 

More from ASIS&T (20)

RDAP 16: Sustaining Research Data Services (Panel 2: Sustainability)
RDAP 16: Sustaining Research Data Services (Panel 2: Sustainability)RDAP 16: Sustaining Research Data Services (Panel 2: Sustainability)
RDAP 16: Sustaining Research Data Services (Panel 2: Sustainability)
 
RDAP 16: Sustainability of data infrastructure: The history of science scienc...
RDAP 16: Sustainability of data infrastructure: The history of science scienc...RDAP 16: Sustainability of data infrastructure: The history of science scienc...
RDAP 16: Sustainability of data infrastructure: The history of science scienc...
 
RDAP 16: Perspective on DMPs, Funders and Public Access (Panel 5: DMPs and Pu...
RDAP 16: Perspective on DMPs, Funders and Public Access (Panel 5: DMPs and Pu...RDAP 16: Perspective on DMPs, Funders and Public Access (Panel 5: DMPs and Pu...
RDAP 16: Perspective on DMPs, Funders and Public Access (Panel 5: DMPs and Pu...
 
RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...
RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...
RDAP 16: DMPs and Public Access: An NIH Perspective (Panel 5, DMPs and Public...
 
RDAP 16: If I could turn back time: Looking back on 2+ years of DMP consultin...
RDAP 16: If I could turn back time: Looking back on 2+ years of DMP consultin...RDAP 16: If I could turn back time: Looking back on 2+ years of DMP consultin...
RDAP 16: If I could turn back time: Looking back on 2+ years of DMP consultin...
 
RDAP 16: Data Management Plan Perspectives (Panel 5, DMPs and Public Access)
RDAP 16: Data Management Plan Perspectives (Panel 5, DMPs and Public Access)RDAP 16: Data Management Plan Perspectives (Panel 5, DMPs and Public Access)
RDAP 16: Data Management Plan Perspectives (Panel 5, DMPs and Public Access)
 
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
RDAP 16 Poster: Challenges and Opportunities in an Institutional Repository S...
 
RDAP 16 Poster: Interpreting Local Data Policies in Practice
RDAP 16 Poster: Interpreting Local Data Policies in PracticeRDAP 16 Poster: Interpreting Local Data Policies in Practice
RDAP 16 Poster: Interpreting Local Data Policies in Practice
 
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
 
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...
RDAP 16 Poster: Responding to Data Management and Sharing Requirements in the...
 
RDAP 16 Lightning: Spreading the love: Bringing data management training to s...
RDAP 16 Lightning: Spreading the love: Bringing data management training to s...RDAP 16 Lightning: Spreading the love: Bringing data management training to s...
RDAP 16 Lightning: Spreading the love: Bringing data management training to s...
 
RDAP 16 Lightning: RDM Discussion Group: How'd that go?
RDAP 16 Lightning: RDM Discussion Group: How'd that go?RDAP 16 Lightning: RDM Discussion Group: How'd that go?
RDAP 16 Lightning: RDM Discussion Group: How'd that go?
 
RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...
RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...
RDAP 16 Lightning: Data Practices and Perspectives of Atmospheric and Enginee...
 
RDAP 16 Lightning: Working Across Cultures: Data Librarian as Knowledge Broker
RDAP 16 Lightning: Working Across Cultures: Data Librarian as Knowledge BrokerRDAP 16 Lightning: Working Across Cultures: Data Librarian as Knowledge Broker
RDAP 16 Lightning: Working Across Cultures: Data Librarian as Knowledge Broker
 
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
RDAP 16 Lightning: An Open Science Framework for Solving Institutional Challe...
 
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...
RDAP 16 Lightning: Quantifying Needs for a University Research Repository Sys...
 
RDAP 16 Lightning: Personas as a Policy Development Tool for Research Data
RDAP 16 Lightning: Personas as a Policy Development Tool for Research DataRDAP 16 Lightning: Personas as a Policy Development Tool for Research Data
RDAP 16 Lightning: Personas as a Policy Development Tool for Research Data
 
RDAP 16 Lightning: Growing Data in Utah: A Model for Statewide Collaboration
RDAP 16 Lightning: Growing Data in Utah: A Model for Statewide CollaborationRDAP 16 Lightning: Growing Data in Utah: A Model for Statewide Collaboration
RDAP 16 Lightning: Growing Data in Utah: A Model for Statewide Collaboration
 
RDAP 16: Building Without a Plan: How do you assess structural strength? (Pan...
RDAP 16: Building Without a Plan: How do you assess structural strength? (Pan...RDAP 16: Building Without a Plan: How do you assess structural strength? (Pan...
RDAP 16: Building Without a Plan: How do you assess structural strength? (Pan...
 
RDAP 16: How do we know where to grow? Assessing Research Data Services at th...
RDAP 16: How do we know where to grow? Assessing Research Data Services at th...RDAP 16: How do we know where to grow? Assessing Research Data Services at th...
RDAP 16: How do we know where to grow? Assessing Research Data Services at th...
 

Domain Repositories and Institutional Repositories Partnering to Curate Data

  • 1. Domain Repositories and Institutional Repositories Partnering to Curate: Opportunities and Examples Jared Lyle RDAP13
  • 2. About ICPSR • Founded in 1962 as a consortium of 21 universities to share the National Election Survey • Today: 700+ members around the world • Data dissemination for more than 20 federal and non-government sponsors • 600,000+ visitors per year
  • 3. What we do • Acquire and archive social science data • Distribute data to researchers • Preserve data for future generations • Provide training in quantitative methods Archive size • 8,000 data collections, over 60,000 data sets • Grows by 300+ collections a year • 9 Terabytes, soon to be 40+ Terabytes
  • 6. 1. Sharing Data (Archiving)
  • 7. “It saves funding and avoids repeated data collecting efforts, allows the verification and replication of research findings, facilitates scientific openness, deters scientific misconduct, and supports communication and progress.” Niu (2006). “Reward and Punishment Mechanism for Research Data Sharin http://www.iassistdata.org/downloads/iqvol304niu.pdf
  • 8. “Virtually all geneticists believe that scientists should share their results freely with peers…” Louis, Jones, and Campbell (2002). “Sharing in Science.” http://dx.doi.org/10.1511/2002.4.304
  • 9. “…the era of data sharing has arrived.” Samet (2009). “Data: To Share or Not to Share?” http://dx.doi.org/10.1097/EDE.0b013e3181930df3
  • 11. Most PIs indicated that they wanted to be “Good Citizens” and help: “This sounds like an exciting project.” “I hope your project is successful because I think that it is important.”
  • 12. “Good Citizens” = high willingness …but no time, money, or resources to submit data to us.
  • 13. Data Sharing (N=1,544) 70 58.7% 60 50 40 30 25.7% 20 14.2% 10 0 Data Are Has Copy of Data Are Lost Archived Data Pienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Sh http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009 See also: Pienta, Gutmann, Hoelter, Lyle, & Donakowski (2008). “The LEADS Database at ICPS Identifying Important „At Risk‟ Social Science Data.” http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
  • 14. Data Sharing (N=935) Federal Shared Shared Not Agency Formally, Informally, Shared Archived Not (n=409) (n=111) Archived (n=415) NSF 22.4% 43.7% 33.9% (27.3%) NIH 7.4% 45.0% 47.6% (72.7%) Total 11.5% 44.6% 43.9% Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data”. http://hdl.handle.net/2027.42/78307
  • 15. 2. Enhancing Data (Curating)
  • 16. A well-prepared data collection “contains information intended to be complete and self-explanatory” for future users.
  • 17. A corollary: Do no harm. http://img.gawkerassets.com/img/17xbuy519gga2jpg/ku- xlarge.jpg
  • 18. Data
  • 19. Documentation http://dx.doi.org/10.3886/ICPSR31521.v1
  • 20. 20
  • 21. 21
  • 22.
  • 23. Disclosure Issues • Direct Identifiers? – personal names – addresses (including ZIP codes) – telephone numbers – social security numbers – driver license numbers – patient numbers – certification numbers,
  • 24. Disclosure Issues • Indirect Identifiers? – detailed geography (i.e., state, county, or census tract of residence) – exact date of birth – exact occupations held – exact dates of events – detailed income
  • 25. Disclosure Issues • External Linkages? – public patient/medical records – court records – police and correction records – Social Security records – Medicare records – driver’s licenses – military records
  • 27. “It saves funding and avoids repeated data collecting efforts, allows the verification and replication of research findings, facilitates scientific openness, deters scientific misconduct, and supports communication and progress.” Niu (2006). “Reward and Punishment Mechanism for Research Data Sharing http://www.iassistdata.org/downloads/iqvol304niu.pdf
  • 28. “Search/Compare Variables” examines 2.1 million variables in 4,000 data collections
  • 29.
  • 30. Emerging sources and types of data • Geo-spatial • Video • Administrative data • Online text • Transactions • Clicks • Sensors
  • 31. Partnerships “We propose that domain specific archives partner with institution based repositories to provide expertise, tools, guidelines, and best practices to the research communities they serve.” Green, Ann G., and Myron P. Gutmann. (2007) "Building Partnerships Among Social Science Researchers, Institution-based Repositories, and Domain Specific Data Archives." OCLC Systems and Services: International Digital Library Perspectives. 23: 35- 53. http://hdl.handle.net/2027.42/41214
  • 34. 5 Pilot Data Collections http://www.flickr.com/photos/smithsonian/25511703 86/
  • 37. Finding interested partners http://www.flickr.com/photos/usnationalarchives/47269173 73/
  • 38. Time & Willingness http://www.flickr.com/photos/floridamemory/702661937 1/
  • 39. Survey of Repositories‟ Data Needs Inter-university Consortium for Political and Social Research. Survey of Data Curation Services for Repositories, 2012. ICPSR34302-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2012-09-21. doi:10.3886/ICPSR34302.v1
  • 40. Repository Suggested Solutions: • Media recovery, format migration, data recovery • Cost estimating and policy review • Metadata tools, documentation, and catalog linkages • Support networks and training • Confidential data dissemination and confidentiality review
  • 44. • Suppressing unique cases • Grouping values (e.g., 13-29=1, 30-49=2) • Top-coding (e.g., >1,000=1,000) • Aggregating geographic areas • Swapping values • Sampling within a larger data collection • Adding “noise” • Replacing real data with synthetic data
  • 46. 3. Access to Processing Tools
  • 47.
  • 48.
  • 49. The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.
  • 50.
  • 51. Hermes Outputs • ASCII data files – Column- and tab-delimited • Stat package setup files – SAS, SPSS, Stata (.do and .dct) • “Ready-to-go” data files – SAS transport (CPORT engine) – SPSS system (.sav) – Stata system (.dta) – R (.rda)
  • 52. Your ideas on partnerships? Useful categories for discussion? • Media recovery, format migration, data recovery • Cost estimating and policy review • Metadata tools, documentation, and catalog linkages • Support networks and training • Confidential data dissemination and confidentiality review

Editor's Notes

  1. “At the end of 2011 ICPSR had about 9TB of content stored in Archival Storage.  This measurement includes everything we have collected over the past 50 years, including content which is not packaged into "studies" for dissemination, such as TIGER/Line files and data packaged for SDA.  This content is not compressed, and contains many duplicates[1], and so should be considered an upper bound.”“Long-time ICPSR staff tell the story of how the 2000 Census doubled the size of ICPSR's holdings.  (I'll speculate that perhaps ICPSR went from about 3TB of content prior to the 200 Census, and then grew to 6TB thereafter.)  In 2012-2013 ICPSR is likely to quadruple the size of its holdings, growing from about 9TB to nearly 40TB.”http://techaticpsr.blogspot.com/2012/04/nature-of-icpsrs-holdings.html
  2. Sharing data = formally archiving the data.
  3. (on a 4-point scale, 49 percent “agree completely” and 42 percent “agree somewhat”)
  4. Why are data not shared?Preparing data and documentation can be enormously time consumingLimited resources for data preparationNeed to protect the confidentiality of respondentsFear of getting “scooped”Lack of rewards for sharingPienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Shared?” http://ori.hhs.gov/content/research-research-integrity-rri-conference-2009
  5. 4,883 NIH & NSF PIs emailed a survey1,217 responses (24.9% response rate)1,003 valid (collected data, not dissertation)We attempted to invite all 4,883 of these PIs. The PI survey consisted of consisted of questions about research data collected, various methods for sharing research data, attitudes about data sharing and demographic information. PIs were also asked about publications tied to the research project including information about their own publications, research team publications, and publications outside the research team. We received 1,217 responses (24.9% response rate). For the analytic sample we select PIs and their research data if (1) they confirm they collected research data (86.6% of the responses), (2) they did not collect data for a dissertation award (n=33), or (3) they were missing data on the dependent variable.
  6. Enhancements by both Investigators & Archives – time, money, training, & tools
  7. [Quote is from the National Longitudinal Survey of Youth’s explanation of its documentation (see: http://www.nlsinfo.org/nlsy97/97guide/chap3.htm#threethree).]
  8. “A centuries-old fresco of Jesus Christ that was botched by a well-intentioned elderly woman has drawn hundreds of visitors and reporters to a north-eastern Spanish church - a positive push in tourism for the small town.The "ecce homo" (or "behold the man"), painted by famous Spanish artist Elias Garcia Martinez, is now mockingly - if not affectionately - called "ecce mono" ("behold the monkey") after an 81-year-old Cecilia Jimenez of Borja tried to fix the deteriorating fresco by applying a paint brush.”http://www.cbsnews.com/8301-503543_162-57501085-503543/ruined-fresco-draws-attention-fans-in-spain/
  9. Traditionally, we’ve dealt with quantitative social surveys with properties and structures.
  10. Along with project documentation, which is needed so secondary users can independently understand a data collection.
  11. In this example, the very first value is “-1”. The variable is an age variable, as indicated by the name above the frequency table, therefore age cannot have a “-1” value, unless it has another valid meaning, such as “Inappropriate”, “Not applicable”, or “Missing”.
  12. This variable might be missing descriptive information, which is problematic and could render the variable unusable. This variable’s name is generic, which might be fine as long as the codebook provides description. But without further labeling, this variable could be meaningless since no label is provided and the value labels are generic Yes/No. The end user wouldn’t be able to interpret the variable.
  13. Here the top and bottom values for Age seem a bit off. It looks like the PI recoded everything <18 to 18 and everything >40 as 41, although it’s not explicit.
  14. A disclosure risk review asks the question, Do these data contain content that I need to restrict?Major areas to check when assessing risk include:1) Are there direct identifiers that reveal the identity of respondents that may have been obtained in the process of data collection?
  15. 2) Are there indirect identifiers that reveal the identity of respondents when they are used in combination with other data?It can be more challenging to identify indirect identifiers. Careful attention must therefore be paid to interactions among the context of the study, the nature of the sample, and the characteristics of respondents to prevent ordinarily unrevealing information from becoming the pointer to an individual.
  16. 3) Are there external linkages that might reveal the identity of respondents?The ability to link data from these files to data available through external sources may present an unacceptable risk of disclosure.
  17. Data discovery requires variable-level metadataData Documentation Initiative (DDI) is an XML standard for micro-data in the social sciencesFederated search toolsNew search toolsVariable level searchingQuestion banksHarmonization tools
  18. At ICPSR, we use a “LEADS” database is to actively discover important research data that should be preserved and disseminated. See:Pienta, Gutmann, Hoelter, Lyle, and Donakowski (2008). “The LEADS Database at ICPSR: Identifying Important ‘At Risk’ Social Science Data.”http://www.data-pass.org/sites/default/files/Pienta_et_al_2008.pdf
  19. There are increasing sources and types of data produced.
  20. On IR was handed 80 distinct pieces of removable media with over 30,000 files and no instructions. One file, as an amusing aside, was named “David’s Favourite Captain Haddock Curses”.How to find the relevant files, and which of those are essential?
  21. None of the IRs had the software to read statistical files, let alone the capability to recover or convert older files. This was when ICPSR could lend a direct hand.
  22. Survey distributed March & April 2012.60% completion rate (109/181)27 U.S. states + D.C.6 Canadian provincesUK, AU, NL, NO, SA66% respondents from social science repository mailing listMost from college or universityLibrarian5457%Repository Manager3537%Of those who’d received or were planning to receive data (80%):Social Sciences (69%)Physical Sciences (47%)Humanities (36%)Biomedical (36%)Engineering (24%)
  23. As we’re presenting guidelines and tips for creating a well-prepared data collection, keep in mind that a lot of the information we’re conveying in this presentation is found in our “ICPSR Guide to Social Science Data Preparation and Archiving” (a link to it is on the ICPSR web site).
  24. Confidentiality review and treatment involves reviewing and modifying the actual data to reduce disclosure risk. Data collections undergo confidentiality review to determine whether the data contain any information that could be used to identify respondents. All direct identifiers should be removed from files.There are a number of actions you can take to protect respondent confidentiality: removing, masking, or collapsing variables within public-use versions of the datasets. Or, restricting access to the data.Removing variables is a good solution for treating direct identifiers.Blanking masks identifiers by altering original data values to missing data codes. For example, ‘abcd’ to “ “ (all blanks). Recoding alters original data values to missing data codes. For example, value ‘1234’ is changed to ‘9999’.Bracketing/collapsing combines the categories of a variable or merges the concepts embodies in two or more variables by creating a new summary variable. For example, age: 13-29=1, 30-49=2.Top/Bottom coding groups the upper or lower limits to eliminate outliers. For instance, a sample with extreme values for income might top-code or round all income >$100,000.Perterbing is a more complex statistical technique that involves alteration of the variable by variable suppressing, adding, or removing records, and random noise continuous/pseudo-continuous variables. This technique limits the appeal of the data since it alters the original data values.Restricting access through requiring users to apply for use, and highly restricted access (e.g., secured enclave-only access).
  25. The intent of the ICPSR pipeline process is to curate, “preserve and access information for the Long Term” (see “Reference Model for an Open Archival Information System (OAIS)”, Consultative Committee for Space Data Systems, Page 2-1. http://public.ccsds.org/publications/archive/650x0b1.PDF).Throughout the pipeline, our intent is to insure that curated data are independently understandable – that is, “the community should be able to understand the information without needing the assistance of the experts who produced the information” (see “Reference Model for an Open Archival Information System (OAIS)”, Consultative Committee for Space Data Systems, Page 3-1. http://public.ccsds.org/publications/archive/650x0b1.PDF).
  26. The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.
  27. The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.