SlideShare a Scribd company logo
1 of 37
Managing &
Preserving Data Sets
Vicky Steeves | METRO Webinar | 8/3/2015
Itinerary
❖Science Data: Definition & Explanation
❖Current Trends in DigiPres for Science
❖Benefits to Curating Datasets
❖Existing Problems
❖Research Data Management
❖Upcoming Tools
What is Data @ the Federal Gov’t?
“the recorded factual material commonly accepted in the
scientific community as necessary to validate research
findings.”
-Federal Office of Management & Budget Circular A-110
Why is science data different?
Why is Science data different?
OR
Why is Science data different?
Why is Science data different?
Why is Science data different?
Digital Preservation of Science in the US
❏ North Carolina County Geospatial Data
❏ Caroline Dean Wildflower Collection
❏ FSU Biological Scientist, Dr. A.K.S.K.
Prasad Diatomscapes I and II Collections
Photographs
❏ FSU Department of Oceanography
Technical Reports
NDSA Levels of Digital Preservation
Level 1 Level 2 Level 3 Level 4
Storage & Geographic Locations
File Fixity & Data Integrity
Information Security
Metadata
File Formats
USGS Levels of Digital Preservation
Level 1 Level 2 Level 3 Level 4
Storage & Geographic Locations
Data Integrity
Information Security
Metadata
File Formats
Physical Media
Benefits
Satisfy your federal grant requirements
more likely to receive future funding if data is made immediately
accessible & you write and follow a well-structured DMP
Saves time & effort, making research process more efficient
reduces duplicated work & cuts costs of storage
more efficient quality control of data produced
better version control of data--identify versions that can be
periodically purged to save money on storage costs!
Adding metadata means that both you & others can go back to it
and easily understand and use it effectively
Benefits
Makes your data scientifically and legally defensible
verifiable & authenticatable, replicable & easily duplicated!
Supports Open Access Movement
advocates for researchers to share data to foster development of body of
knowledge
sustainability of science data in the long-term!
Reinforce open scientific inquiry
could lead to new and unanticipated discoveries!
continually improve the quality of data
data can be used as valuable teaching instrument to train future scientists
EXAMPLE: Benefit to Science
Existing Problems: Management
“My research assistants
manage all my data.”
“I think only the division chair
knows how much server space
we have in total.”
“There are not enough
computer terminals in the
imaging lab.”
“I have no time to
standardize my data
management.”
“I have no way to get my data
off of the computers in the
gene sequencing lab because
the files are too large.”
“I don’t know where exactly to
get support for my database. I
don’t know if its IT’s jurisdiction
or job.”
Existing Problem: Scope
Existing Problems: Media
Possible Solution
Research Data Management
a set of practices which affords researchers the ability
to more quickly, efficiently, and accurately find,
access, and understand their own or others’ research
data
*also our way to get science to listen to us! (sorry Scott)
Data Management Plan
A data management plan (DMP) is a document
that describes how you will collect, organise,
manage, store, secure, backup, preserve, and
share your data.
Federal Agencies & DMPs
DMPs Usual Requirements
❏ Format of Data
❏ Research methodology
❏ Roles & Responsibilities for Data
❏ Metadata Standards
❏ Storage & Back Up Procedure
❏ Long-Term Archiving & Preservation Plan
❏ Access Policy
❏ Security Measures
❏ Data
Existing Tools
Research Lifecycle
Data management is
done at all stages of
the research lifecycle.
*each step in the process has its
own best practices & standards
Research Lifecycle: Create
Creating Data
 what format will the data be in?
 where will we store this data?
 how will it be backed up?
 how are we going to share this
data?
 how will we collect this data?
Research Lifecycle: Process
Processing Data
 how will we check, validate,
or clean the data?
a. how will we describe that
process?
b. how will we describe the
data?
 will we store the processed
Research Lifecycle: Analyze
Analyzing Data
 how will we interpret data?
 what research outputs will be
produced?
a. what format will they be in?
b. how can we make it
preservation-ready?
 where will we store this data?
Existing Resources
Metadata
Open File Formats
Darwin
Core
Library of Congress
File Standards Guide
ABCD
Data Documentation
Initiative
File
Directories
& Org
Fixity Checks & Security
Back
Ups
Version
Control
Research Lifecycle: Preserve
Preserving Data
 what is the best archival
format for our type of data?
 what is the best type of
archival storage for our data?
 What needs to be preserved
alongside our data to make it
useful to others?
Research Lifecycle: Access
Access to Data
 Interoperable formatting
means many people can
use our data
 distribute data
 share data
 promote data
Existing Resources
Research Lifecycle: ReUse
Re-using Data
 5 years down the road? 10?
a. use open, new archival
formats
b. refresh it our storage media
c. scrutinize our findings &
integrate data into new
projects
What’s more...
Short-Term Solution
Long-Term Solution
Upcoming Tools &
Strategies
● ReproZip is a general tool for Linux distributions that simplifies the process
of creating reproducible experiments from command-line executions.
○ automatically creates a package that contains all the dependencies
(e.g., libraries, input data) that are required to run the input command.
● Hydra in a Box is a turnkey feature-rich, robust, flexible digital repository
that is easy to install, configure, and maintain
○ Joint venture between DPLA, Stanford University, and DuraSpace for
$2M from IMLS
○ 30 month period to completion
Ask me questions!
@VickySteeves
victoriaisteeves@gmail.com

More Related Content

What's hot

IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...
IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...
IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...Amanda Whitmire
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesAmanda Whitmire
 
Going Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaGoing Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaJohann van Wyk
 
Data Management for Postgraduate students by Lynn Woolfrey
Data Management for Postgraduate students by Lynn WoolfreyData Management for Postgraduate students by Lynn Woolfrey
Data Management for Postgraduate students by Lynn Woolfreypvhead123
 
Data Management Lab: Session 1 Slides
Data Management Lab: Session 1 SlidesData Management Lab: Session 1 Slides
Data Management Lab: Session 1 SlidesIUPUI
 
Data Management Lab: Data management plan instructions
Data Management Lab: Data management plan instructionsData Management Lab: Data management plan instructions
Data Management Lab: Data management plan instructionsIUPUI
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Amanda Whitmire
 
Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycleSherry Lake
 
Publishing perspectives on data management & future directions
Publishing perspectives on data management & future directionsPublishing perspectives on data management & future directions
Publishing perspectives on data management & future directionsARDC
 
Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelKrzysztof Gorgolewski
 
Brain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible NeuroscinceBrain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible NeuroscinceKrzysztof Gorgolewski
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collectionSherry Lake
 
Research Data Management in practice, RIA Data Management Workshop Brisbane 2017
Research Data Management in practice, RIA Data Management Workshop Brisbane 2017Research Data Management in practice, RIA Data Management Workshop Brisbane 2017
Research Data Management in practice, RIA Data Management Workshop Brisbane 2017ARDC
 
A basic course on Research data management: part 1 - part 4
A basic course on Research data management: part 1 - part 4A basic course on Research data management: part 1 - part 4
A basic course on Research data management: part 1 - part 4Leon Osinski
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data ManagementC. Tobin Magle
 
Why should researchers care about data curation?
Why should researchers care about data curation?Why should researchers care about data curation?
Why should researchers care about data curation?Varsha Khodiyar
 
Research data lifecycle diagram
Research data lifecycle diagramResearch data lifecycle diagram
Research data lifecycle diagramSteven Cracknell
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Anita de Waard
 
Data management woolfrey
Data management woolfreyData management woolfrey
Data management woolfreypvhead123
 

What's hot (20)

IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...
IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...
IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...
 
Developing data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universitiesDeveloping data services: a tale from two Oregon universities
Developing data services: a tale from two Oregon universities
 
Going Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of PretoriaGoing Full Circle: Research Data Management @ University of Pretoria
Going Full Circle: Research Data Management @ University of Pretoria
 
Data Management for Postgraduate students by Lynn Woolfrey
Data Management for Postgraduate students by Lynn WoolfreyData Management for Postgraduate students by Lynn Woolfrey
Data Management for Postgraduate students by Lynn Woolfrey
 
Data Management Lab: Session 1 Slides
Data Management Lab: Session 1 SlidesData Management Lab: Session 1 Slides
Data Management Lab: Session 1 Slides
 
Data Management Lab: Data management plan instructions
Data Management Lab: Data management plan instructionsData Management Lab: Data management plan instructions
Data Management Lab: Data management plan instructions
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
 
Managing the research life cycle
Managing the research life cycleManaging the research life cycle
Managing the research life cycle
 
Publishing perspectives on data management & future directions
Publishing perspectives on data management & future directionsPublishing perspectives on data management & future directions
Publishing perspectives on data management & future directions
 
Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next level
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Brain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible NeuroscinceBrain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible Neuroscince
 
Best practices data collection
Best practices data collectionBest practices data collection
Best practices data collection
 
Research Data Management in practice, RIA Data Management Workshop Brisbane 2017
Research Data Management in practice, RIA Data Management Workshop Brisbane 2017Research Data Management in practice, RIA Data Management Workshop Brisbane 2017
Research Data Management in practice, RIA Data Management Workshop Brisbane 2017
 
A basic course on Research data management: part 1 - part 4
A basic course on Research data management: part 1 - part 4A basic course on Research data management: part 1 - part 4
A basic course on Research data management: part 1 - part 4
 
Responsible conduct of research: Data Management
Responsible conduct of research: Data ManagementResponsible conduct of research: Data Management
Responsible conduct of research: Data Management
 
Why should researchers care about data curation?
Why should researchers care about data curation?Why should researchers care about data curation?
Why should researchers care about data curation?
 
Research data lifecycle diagram
Research data lifecycle diagramResearch data lifecycle diagram
Research data lifecycle diagram
 
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"Some Ideas on Making Research Data: "It's the Metadata, stupid!"
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
 
Data management woolfrey
Data management woolfreyData management woolfrey
Data management woolfrey
 

Similar to METRO RDM Webinar

Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesRebekah Cummings
 
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...University of California Curation Center
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data ManagementJamie Bisset
 
Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6ARDC
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassAaron Collie
 
Natasha intro to rdm c3 dis may 2018.pptx
Natasha intro to rdm c3 dis may 2018.pptxNatasha intro to rdm c3 dis may 2018.pptx
Natasha intro to rdm c3 dis may 2018.pptxARDC
 
Managing, Sharing and Curating Your Research Data in a Digital Environment
Managing, Sharing and Curating Your Research Data in a Digital EnvironmentManaging, Sharing and Curating Your Research Data in a Digital Environment
Managing, Sharing and Curating Your Research Data in a Digital Environmentphilipdurbin
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Anita de Waard
 
Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)aaroncollie
 
How to share useful data
How to share useful dataHow to share useful data
How to share useful dataPeter McQuilton
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsAaron Collie
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceSusanna-Assunta Sansone
 

Similar to METRO RDM Webinar (20)

Research Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and HumanitiesResearch Data Management and Sharing for the Social Sciences and Humanities
Research Data Management and Sharing for the Social Sciences and Humanities
 
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
DMPTool Webinar 8: Data Curation Profiles and the DMPTool (presented by Jake ...
 
Critical infrastructure to promote data synthesis
Critical infrastructure to promote data synthesis Critical infrastructure to promote data synthesis
Critical infrastructure to promote data synthesis
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6Fsci 2018 thursday2_august_am6
Fsci 2018 thursday2_august_am6
 
Research Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities ClassResearch Data Curation _ Grad Humanities Class
Research Data Curation _ Grad Humanities Class
 
Natasha intro to rdm c3 dis may 2018.pptx
Natasha intro to rdm c3 dis may 2018.pptxNatasha intro to rdm c3 dis may 2018.pptx
Natasha intro to rdm c3 dis may 2018.pptx
 
Managing, Sharing and Curating Your Research Data in a Digital Environment
Managing, Sharing and Curating Your Research Data in a Digital EnvironmentManaging, Sharing and Curating Your Research Data in a Digital Environment
Managing, Sharing and Curating Your Research Data in a Digital Environment
 
Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013Talk at OHSU, September 25, 2013
Talk at OHSU, September 25, 2013
 
Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)Data Management for Research (New Faculty Orientation)
Data Management for Research (New Faculty Orientation)
 
INCLUSION OF DATA ARCHIVES IN DATA MANAGEMENT PLAN
INCLUSION OF DATA ARCHIVES IN DATA MANAGEMENT PLANINCLUSION OF DATA ARCHIVES IN DATA MANAGEMENT PLAN
INCLUSION OF DATA ARCHIVES IN DATA MANAGEMENT PLAN
 
How to share useful data
How to share useful dataHow to share useful data
How to share useful data
 
Introduction to Data Management and Sharing
Introduction to Data Management and SharingIntroduction to Data Management and Sharing
Introduction to Data Management and Sharing
 
ARLIS-NY Presentation
ARLIS-NY PresentationARLIS-NY Presentation
ARLIS-NY Presentation
 
Research-Data-Management-and-your-PhD
Research-Data-Management-and-your-PhDResearch-Data-Management-and-your-PhD
Research-Data-Management-and-your-PhD
 
Research Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering StudentsResearch Data Management Fundamentals for MSU Engineering Students
Research Data Management Fundamentals for MSU Engineering Students
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
 

METRO RDM Webinar

  • 1. Managing & Preserving Data Sets Vicky Steeves | METRO Webinar | 8/3/2015
  • 2. Itinerary ❖Science Data: Definition & Explanation ❖Current Trends in DigiPres for Science ❖Benefits to Curating Datasets ❖Existing Problems ❖Research Data Management ❖Upcoming Tools
  • 3. What is Data @ the Federal Gov’t? “the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.” -Federal Office of Management & Budget Circular A-110
  • 4. Why is science data different?
  • 5. Why is Science data different? OR
  • 6. Why is Science data different?
  • 7. Why is Science data different?
  • 8. Why is Science data different?
  • 9. Digital Preservation of Science in the US ❏ North Carolina County Geospatial Data ❏ Caroline Dean Wildflower Collection ❏ FSU Biological Scientist, Dr. A.K.S.K. Prasad Diatomscapes I and II Collections Photographs ❏ FSU Department of Oceanography Technical Reports
  • 10. NDSA Levels of Digital Preservation Level 1 Level 2 Level 3 Level 4 Storage & Geographic Locations File Fixity & Data Integrity Information Security Metadata File Formats
  • 11. USGS Levels of Digital Preservation Level 1 Level 2 Level 3 Level 4 Storage & Geographic Locations Data Integrity Information Security Metadata File Formats Physical Media
  • 12. Benefits Satisfy your federal grant requirements more likely to receive future funding if data is made immediately accessible & you write and follow a well-structured DMP Saves time & effort, making research process more efficient reduces duplicated work & cuts costs of storage more efficient quality control of data produced better version control of data--identify versions that can be periodically purged to save money on storage costs! Adding metadata means that both you & others can go back to it and easily understand and use it effectively
  • 13. Benefits Makes your data scientifically and legally defensible verifiable & authenticatable, replicable & easily duplicated! Supports Open Access Movement advocates for researchers to share data to foster development of body of knowledge sustainability of science data in the long-term! Reinforce open scientific inquiry could lead to new and unanticipated discoveries! continually improve the quality of data data can be used as valuable teaching instrument to train future scientists
  • 15. Existing Problems: Management “My research assistants manage all my data.” “I think only the division chair knows how much server space we have in total.” “There are not enough computer terminals in the imaging lab.” “I have no time to standardize my data management.” “I have no way to get my data off of the computers in the gene sequencing lab because the files are too large.” “I don’t know where exactly to get support for my database. I don’t know if its IT’s jurisdiction or job.”
  • 19. Research Data Management a set of practices which affords researchers the ability to more quickly, efficiently, and accurately find, access, and understand their own or others’ research data *also our way to get science to listen to us! (sorry Scott)
  • 20. Data Management Plan A data management plan (DMP) is a document that describes how you will collect, organise, manage, store, secure, backup, preserve, and share your data.
  • 22. DMPs Usual Requirements ❏ Format of Data ❏ Research methodology ❏ Roles & Responsibilities for Data ❏ Metadata Standards ❏ Storage & Back Up Procedure ❏ Long-Term Archiving & Preservation Plan ❏ Access Policy ❏ Security Measures ❏ Data
  • 24. Research Lifecycle Data management is done at all stages of the research lifecycle. *each step in the process has its own best practices & standards
  • 25. Research Lifecycle: Create Creating Data  what format will the data be in?  where will we store this data?  how will it be backed up?  how are we going to share this data?  how will we collect this data?
  • 26. Research Lifecycle: Process Processing Data  how will we check, validate, or clean the data? a. how will we describe that process? b. how will we describe the data?  will we store the processed
  • 27. Research Lifecycle: Analyze Analyzing Data  how will we interpret data?  what research outputs will be produced? a. what format will they be in? b. how can we make it preservation-ready?  where will we store this data?
  • 28. Existing Resources Metadata Open File Formats Darwin Core Library of Congress File Standards Guide ABCD Data Documentation Initiative File Directories & Org Fixity Checks & Security Back Ups Version Control
  • 29. Research Lifecycle: Preserve Preserving Data  what is the best archival format for our type of data?  what is the best type of archival storage for our data?  What needs to be preserved alongside our data to make it useful to others?
  • 30. Research Lifecycle: Access Access to Data  Interoperable formatting means many people can use our data  distribute data  share data  promote data
  • 32. Research Lifecycle: ReUse Re-using Data  5 years down the road? 10? a. use open, new archival formats b. refresh it our storage media c. scrutinize our findings & integrate data into new projects
  • 36. Upcoming Tools & Strategies ● ReproZip is a general tool for Linux distributions that simplifies the process of creating reproducible experiments from command-line executions. ○ automatically creates a package that contains all the dependencies (e.g., libraries, input data) that are required to run the input command. ● Hydra in a Box is a turnkey feature-rich, robust, flexible digital repository that is easy to install, configure, and maintain ○ Joint venture between DPLA, Stanford University, and DuraSpace for $2M from IMLS ○ 30 month period to completion

Editor's Notes

  1. Hi! Welcome to Managing & Preserving Data Sets. I’m Vicky Steeves, the Research Data Management and Reproducibility Librarian at NYU Libraries. I’m pleased to be here discussing digital preservation for science data with you today!
  2. So I am going to go over some relevant definitions when dealing with the management and preservation for science data, as well as some of the reasons why science data is so different from the other bits and bytes you may deal with in your day-to-day. After that we’ll look at some existing problems for management of science data and a solution--research data management. I’ll end by taking a peek at some exciting new tools on the horizon for RDM for science.
  3. When we talk about data in this webinar, this is what we are talking about. The Federal Government defines research data as “the recorded factual material commonly accepted in the scientific community as necessary to validate researching findings” This definition excludes preliminary analyses, drafts of scientific papers, unpublished data, personally identifiable information for human subjects, plans for future research, peer reviews, commercial information, or communications with colleagues. This is important to consider since government funding covers the majority of scientific research done by US scientists--as information professionals, we need to keep this in mind as we begin to work with these data.
  4. Science is, at its core, the act of collecting, analyzing, refining, re-analyzing, and reusing data. Reuse and re-analysis are important parts of the evolution of our understanding of the world and the universe, so to carry out meaningful preservation, we as the digital preservationists need to equip those future users with the necessary tools to reuse said data. This means that we have to be a part of or at least provide researchers tools in order to make their data preserve ready. One important note on this cycle is that it is interative--we will have to manage data at each step hundreds or thousands of times.
  5. In preserving datasets there is a very real need to preserve not only the dataset but the ability to deliver that knowledge to a future user community. Technical obsolescence is a huge problem in the preservation of scientific data, due in large part to the field-specific proprietary software and formats used in research. These software are sometimes even project specific, and often are not backwards compatible, meaning that a new version of the software won’t be able to open a file created in an older version. This is counter-intuitive for access and preservation. Researchers often build their own software so unless they make the code available (which most do), chances are the results cannot be reproducible after the life of the project or after the life of the researcher themselves.
  6. Science data is a big data problem at its heart. Not only because of the large quantity of data usually generated throughout the course of an experiment or research period, but also the diversity of data created at a high speed. The velocity of data also refers to the speed at which data is transferred and moved around, which is a regular occurrence amongst labs with multiple collaborators. Keeping a There is also a trust issue here--with so much data being generated at such a high speed, there is a quality and accuracy issue there. It’s just less controllable.
  7. Digital data are not only research output, but also input into new hypotheses and research initiatives, enabling future scientific insights and driving innovation. In the case of natural sciences, specimen collections and taxonomic descriptions from the 19th century (and earlier) are still used in modern scientific discourse and research. There is a unique concern in digital preservation of scientific datasets where the phrase “in perpetuity” has real usability and consequence, in that these data have value that will only increases with time. 100 years from now, scientific historians will look to these data to document the processes of science and the evolution of research. Scientists themselves will use these data for additional research or even comparative study: “look at the population density of this scorpion species in 2014 versus today, 2114, I wonder what caused the shift.” Some data, particularly older data, aren’t necessarily replicable, and in that case, the value of the material for preservation increases exponentially. So the resulting question is how to develop new methods, management structures and technologies to manage the diversity, size, and complexity of current and future datasets, ensuring they remain interoperable and accessible over the long term. With this in mind, it is imperative to develop an approach to preserving scientific data that continuously anticipates and adapts to changes in both the popular field-specific technologies, and user expectations.
  8. Digital preservation is an extremely new field--well, maybe not new, but recently mainstreamed into the LIS field. I would say overall its about 5 years old, at the very youngest, and 10 years at the very oldest. We are just beginning to develop programs for our archival datasets--science data is definitely a niche. There just aren’t that many digital preservationists dealing with science data at the moment. On the flip side--science just wants to do science. There are concerns over longevity of data, but overall they want what is going to be easiest and most efficient. We just have to make sure that we are providing that service.
  9. To underscore that point... This map comes from the NDSA website, and lists some preservation partners in the US that are currently preserving at risk digital content. I just wanted you all to have some context about how little science is in the conversation in terms of digital preservation. Out of the 14000 collections listed, four are scientific. The North Carolina County Geospatial Data, the Carolina Dean Wildflower Collection, FSU Scientist Dr. AKSK Prasad’s Diatomscapes 1 & 2 collections photographs, and the FSU Department of Oceanography’s Technical Reports. As you can see, science data as a whole is very underrepresented in our digital preservation community.
  10. Again for context--These are the National Digital Stewardship Alliance levels of digital preservation. What’s nice about these is that the each row is independent of the other--an institution could be at level four for metadata and level one for file formats. These levels are intended as basically tiered recommendations for how institutions should begin or enhance their digital preservation activities. It’s a fairly simple yet robust guide useful for everyone on the spectrum of digital preservation--from those just beginning to think about it to those who exclusively think about it. It’s basically a great self-assessment tool.
  11. But it’s changing and science is coming into the conversation! So the US Geological Survey actually took these levels and adapted them for their own purposes, in language that they believe their scientific data managers will better understand. You can see I’ve bolded the main differences between the NDSA levels and the USGS levels, though they are very miniscule. You can see that the USGS wanted to account for the physical media that stores the digital data, the file formatting for each, and replaced “file fixity” with simply “data integrity,” though based on my reading of their levels, it means the same thing to them. Somewhat most importantly, this is the first national science institution to explicitly bring attention to digital preservation. Their website also has a very robust set of pages explaining what digital preservation is, what it means for science, and why more people should be paying attention to it. They are really trying to get a conversation started, which is absolutely commendable.
  12. Why should this be a big deal for science?? So why does all this matter to information professionals? Why would you want to put all that effort into data management for science? Well for one it would help satisfy federal grant requirements, and if you were to follow the steps to make your data as interoperable and accessible as possible, it would make you more likely to receive future funding. It also makes the research process more efficient. With standardization, it makes your team run like a well-oiled machine--everyone knows what is expected and what needs to be done ahead of time. It can reduce any duplicated work due to misplaced files, and helps researchers periodically purge versions of data they may not need to save on storage resources.
  13. Wide dissemination of relevant information helpful to others Advance the field by reducing duplicated efforts Maximize data’s effectiveness by increasing use Supports Open Access Movement advocates for researchers to share data to foster development of body of knowledge sustainability of science data in the long-term! Reinforce open scientific inquiry could lead to new and unanticipated discoveries! continually improve the quality of data data can be used as valuable teaching instrument to train future scientists
  14. By re-using data collected from a variety of sources – eBird database, land cover data, meteorology, and remotely sensed GIS data from NASA-- this project was able to compile and process the data using supercomputering to determine bird migration routes for particular species.
  15. These quotes represent what I’ve typically come across as reasons why there is no standardized management in a lab group. What is bolded there, “I have no time to standardize my management.” is definitely the most common.
  16. Usually there are as many management systems as there are researchers. This is a bit of a problem when combined with all the many devices that each curator stores their data in. Sometimes, its a matter of finding one out of a thousand files on one out of a hundred projects on one out of five drives, two USB sticks, two computers, dropbox, and amazon cloud. With no standardized management--the risk of data loss is obvious. This is also a huge knowledge management issue. There are a lot of student workers, post docs, collaborators, etc, that not only move data around, but create data, database systems, scripts, etc, that are not well documented. This is a huge issue for not only finding data, but reusing data after the term of the project has ended. Picture multiple hundred thousand lines of code uncommented and the one person who built it long gone--what a headache.
  17. Science is generally an early adopter and there are a lot of legacy media concerns because of this. Zip disks, floppy disks, CDs--there are viable datasets still on these media in the bottoms of desks. Getting relevant data from these media will be a great concern of information professionals as we move into the sphere of science more decidedly. There is also just not enough!!
  18. we have all this data from all these data creators on legacy media floating around goodness knows where--how can we possibly manage this??
  19. I know it seems insurmountable at first--but we can manage this through a process called research data management! This is how the federal government is lumping preservation in with their management plans, which we will get to in a second. RDM is an iterative process that allows researchers to more efficiently and accurately find, access, and understand data, theirs or others. This requires standardizing and documenting the way in which data is collected, organized, managed, stored, backed up, preserved, and shared through planning at the beginning of a research project.
  20. in 2011 the National Science Foundation began to require a data management plan with each of their grant applications. This essentially a planning document included with all grant proposals.
  21. After the NSF, other federal agencies have adopted similar practices, requiring that researchers describe the all the facets of research data management I just named, and many more.
  22. Data management plans for funding agencies usually require detail on the types of data to be produced during the award period, how data will be collected, who is responsible for data management, how data will be described, how data will be stored and backed up, for the award period and in the long-term, access policies around the data, and security measures taken for data and human subjects, if any. That being said, the grants provide little to no resources for making data accessible and don’t really enforce the DMPs--however, using this model as a way to enforce data management in your institution can prove extremely useful.
  23. These are some tools that can help you and your researchers write a data management plan. There are libguides from many institutions that DMP Tool is an online to create, review, and share DMPs. It includes templates for each funding agency separately (in fact, they just added an IMLS template!) and walks users through the process of writing a DMP. DMP online is from the Digital Curation Center in the UK, but is still relevant for US institutions. This tools also has a number of templates available that represent the requirements of different funders. There are three initial questions that determine the template to display, and guidance is provided to help users answer DMP questions.
  24. Though it is mandatory for those funding agencies, data management planning should be a regular part of the research lifecycle. I will now briefly detail the types of activities typically done at each stage in the research process.
  25. It is our job as information professionals to help our users navigate these questions to provide current and useful curation over their research. These questions can be best expressed throughout a large planning document, or in the form of checklists, which I personally like because of their readability. There are many libguides and other online checklists that can be adapted for use in your research environment also. MIT and UMass Amherst in particular have great guides for public consumption. Each of these questions have their own set of best practices that have to be considered--for instance: “How will data be backed up?” Best practices in digital preservation dictates we have at least three copies of our data: the original, the second copy which is stored locally but to an external media, and a third copy that is stored remotely (such as the Cloud). These should be distributed geographically--let’s say we’re in NYC (because hey, I am). I have my original data saved on my desktop at work. I back it up to an external hard drive once a day, which I bring home with me every night. My data is also backed up daily to rackspace, automatically. Rackspace has servers in Virginia, and my house is about 25 miles away from my workplace. Geographically redundant, triple-backed up--this data, at least in this one aspect, is safe.
  26. Processing data directly corresponds to the DMP’s requirement of describing research methodology. Here, we need to describe how we are checking or validating data, and where we will store the process work that gets us to our analysis. This is where information professionals can do a lot of description and quality assurance work with the data initially created in the research process.
  27. This is perhaps the most important step in the process to describe--the analysis. This is the intellectual power behind the research initiative, and needs to be described in excruciating detail. This is how the individual interprets the data they collected. Here, we need to again note the file format and software version used so the experiment is directly replicable. This is the step where we can begin to make our data preservation ready. This can take the form of a planning document in many cases, or a libguide to ensure compliance and understanding witin the lab. We as informational professionals can provide the planning tools to allow the research lab to run as efficiently and smoothly as possible throughout this step.
  28. WE HAVE WAY MORE EXPERIENCE WITH THIS THAN THEY DO. We can offer this as a service to science! It is imperative that data be described accurately at the time of its creation to be able to replicate the experiment or research process. This is the process of collecting and describing the data using metadata. This could be as simple as specimen collection event information or as robust as describing the parameters from which astronomical information was collected. Access to Biological Collection Data (TDWG) Darwin Core (biology) Data Documentation Initiative (social science)
  29. This is the process of compiling everything together that someone would need to replicate the experiment in the future. In addition to simply compiling it all, here we need to ensure that the data are on archival storage media in archival formats so the data remains usable into the future. In this step, research data is no different than the other data that digital archivists work with on a daily basis. We are simply following the same formula to make digital archival: is it well described? is there high quality technical, descriptive, and administrative metadata? Is it on archival storage with a refreshing schedule?
  30. This is when we disseminate our data! We want to use the most open and accepted formats there are so as many people as possible can make use of our data. This means making our data as interoperable across software and systems. We have in our field
  31. These are some existing systems for either access or both access and preservation. The A&P repositories include Dryad, DataOne, and Figshare. These are also field agnostic and contain data from a variety of different disciplines. These are all online repository systems where scientists can post and disseminate their data, and represent many scientific research areas. the VAO is for astronomical data, whereas NCBI, or GENBank, is used for genomics. These can be deceptive, as many researchers will put their data up on these aggregators and repositories thinking that it will be available forever, when no such promise is ever made. While these are generally very good for the dissemination of data, a word of caution would be well placed.
  32. One of my favourite things about science data is that is usable way after its produced. Friends of mine in zoology often will use archival materials dating back to the early 1900s in their current research. That’s much easier because paper is pretty easy to archive--we have established practices. For digital data, that process of preservation is ongoing. So to reuse digital data, we have to refresh our storage mediums to make sure our bits don’t rot, and update our file formats to newer and open formats. This way, 100 years from now, your data can be used in comparative studies. This is another way in which the process for science data is the same as other data--except the scale will be much, much bigger.
  33. We need to incentivize science to let us help them. Librarians and archivists are no strangers to having to advocate for themselves, and this case is no different. By making these services available to science with the assurance that we are making their process more efficient and streamlined, we will attract science to digital preservation and research data management. Think of those benefits we went over earlier in the webinar. By simply saying, “By letting us help you now, it would reduce the time you had to spend thinking about the best ways to manage your data while actually doing the research which makes the project that much smoother and valuable to others and yourself down the road!” we can motivate science to take an interest in creating and curating data which can be made archival and reproducible--fulfilling the requirements of the scientific method! We can market ourselves effectively by stressing the importance of their research and their time--which is often a precious and overtaxed resource.
  34. The best short-term solution for your institution's RDM practices is simply providing expertise! This could take the form of management trainings, hiring data manager or revising a job description to include RDM, doing instruction and consultations with lab groups, and being involved in the grant writing process.
  35. That being said, the idea next step is the construction of a central, trustworthy institutional repository based on open-source software to house scientific data throughout its whole lifecycle, based out of your Library. I believe that providing a central storage, management, and preservation infrastructure would mitigate the risk of data loss--which only increases as the amount of data we generate does. A repository would answer the needs of the researchers while putting the Library in a position to begin preserving the data so intrinsically important to the institution. Talk about figshare + homegrown solutions + hydra/dspace/fedora/islandora
  36. These are two upcoming tools I am particularly excited about, which you may or may not have heard about. ReproZip is a piece of software coming out of NYU’s Center for Data Science which allows users to reproduce the conditions under which a scientific experiment took place. It creates a package of all the dependencies that are required to rerun the experiment in question, all from command line executions. Keep an eye out for this on their github page. The other tool is a repository system based off Hydra called Hydra in a Box. This is a turn key system aimed at helping under-resourced institutions with digital preservation and content management, which came out of an IMLS leadership grant. It’s a 30 month $2M project being managed by DPLA, Stanford University, and DuraSpace. They have a wiki where you can track the updates.
  37. Thank you so much for your attention! I’d be happy to answer any questions or open the floor for debate at this time. If you think of a question after the session is ended, feel free to email me at the address listed above or tweet at VickySteeves. So...questions?