It's a real world: developing preservation policy for Dryad
Upcoming SlideShare
Loading in...5
×
 

It's a real world: developing preservation policy for Dryad

on

  • 183 views

Presented at Research Data Access and Preservation (RDAP) 2014 as part of the panel "Developing and implementing institutional policies on research data: ownership, preservation, and compliance."

Presented at Research Data Access and Preservation (RDAP) 2014 as part of the panel "Developing and implementing institutional policies on research data: ownership, preservation, and compliance."

Statistics

Views

Total Views
183
Views on SlideShare
183
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Intro both
  • SaraFor the past year, Ayoung and I have been involved in a working group that has developed preservation policy for Dryad Digital Repository. It’s been a somewhat lengthy process – we’ve benefitted from the input of many different stakeholders, including a panel of preservation experts, Dryad staff members, and the Dryad board of directors. However, as you can imagine, the number of stakeholders involved has also slowed and complicated the policy development process. Throughout this project, we’ve found that there is a tension between what is ideal and what is realistic, and this tension required that Ayoung and I be flexible, making certain concessions in order to create a preservation policy that would actually be able to function in the real world of Dryad. This presentation will provide an overview of the policy development process, including a quick summary of the state of data preservation, an overview of the Dryad repository and its content, and a timeline of the policy development process. We’ll then walk you through the policy itself, talk about some of the lessons we learned, and discuss open questions that arose during the project.
  • SaraWith the advent of “e-science” and data intensive computing, research data has been produced in massive quantities, a phenomenon often referred to as “the data deluge.”Over the past few years, the scientific community has begun to address how to preserveaccess to this data, and how to facilitate its continued re-analysis and reuse. Data has begun to be considered a commodity, a thing of value that should be publicly available to the scientific community. To that end, many major scientific journals have now implemented data-availability policies, and funding agencies have begun mandating data archiving, leading to a demand for reliable data repositories.Beyond compliance with mandates, data archiving has many benefits. It facilitates verification of published research, it allows for the preservation of accessibility and discoverability, it provides opportunities for data reuse, it can lead to increased citations and more research visibility.Data archiving also prevents redundant data collection, inefficient legacy data curation, and the burden of sharing-on-request.However, data archiving has unique challenges.Perhaps most challenging from a preservation perspective is that research data have a wider variety of file formats than most digital archival materials. This means the data may require rare or proprietary programs to interpret them – this makes migration and emulation expensive – a repository would need to purchase special software for each of these edge-cases, or risk losing information during data normalization. Second, data is a raw material, rather than a finished product. This means that data creators need to be able to add new versions of datasets to the repository as they continue to conduct their research. Because data is used to create scholarly output, security may also be a consideration – without proper embargoes, researchers who deposit data in a repository could run the risk having their findings scooped. Lastly, data sets can be very large and consist of multiple files – this means that metadata creation is much more important to facilitate discoverability– but it’s also time consuming to curate large data sets.
  • SaraSo first, let’s address why preservation policy is necessary. The definition of digital preservation evolved from the Research Libraries Group and OCLC’s simple definition “ensuring both the long-term maintenance of a bitstream and continued accessibility of its contents” to JISC’s expansion that digital preservation “encompasses not just technical activities, but also all of the strategic and organizational considerations that relate to the survival and management of digital material. most digital repositories now have a preservation plan in place, including formal policy. It’s important for stakeholders to know that digital content will be preserved into the future, and that there is a plan in place for preservation. However, there are relatively few preservation policies that address data specifically. Many that do exist are for repositories holding specific types of data, like the particle accelerator data at CERN, the European Organization for Nuclear Research, and the National Snow and Ice Data Center; or social science data, like the OdumInstititue and Interuniversity Consortium for Political and Social Research. (DataONE’s policy is probably the closest to the kind of general-data policy that Dryad was looking to develop, and one member of Dryad’s preservation working group was also involved in developing DataONE’s preservation policy.)
  • SaraDryad is a curated, general-purpose repository that makes the data underlying scientific and medical publications discoverable, freely reusable, and citable (http://datadryad.org/). Curation is part of what sets Dryad apart from other repositories – Dryad only accepts data that is associated with a specific publication – usually peer reviewed. Our journal integration service links the data submission process with journals’ article submission systems, ensuring higher-quality metadata. Dryad curators then provide another level of quality control. Dryad aims to ensure access to its data over the long term in order to facilitate data availability, data sharing, and scholarly communication. Having good metadata is a good first step toward this aim.When Dryad launched in 2008, Dryad partnered with a group of leading journals and scientific societies in the field of evolutionary biology and ecology, providing research data associated with articles in these and other journals in order to supported data reuse.Dryad has a very broad collecting policy. Almost any data is accepted as long as it is associated with a publication, and as long as it doesn’t contain personally-identifiable, sensitive, or illegal material, and as long as there are no technical problems or viruses. The scope of the repository continues to grow – Dryad now holds data from different types of publications like books, conference proceedings, and dissertations, as well as data from a wider range of scientific fields.
  • SaraDryad focuses on the “long tail of data” – which is comprised of a lot of valuable and irreplaceable data. Dryad aims to provide a home for this “orphan data” and enable all the data underlying a publication to be archived and accessible in perpetuity, regardless of format. However, (it may be hard to see these numbers), but there are only 1000 excel files, and fewer than 100 JPEG, R script, HTML, and WAV files. The number of files in each format goes lower from there, and the report from which I gleaned these numbers showed almost 600 “unknown filetypes,” which are most likely rare or homegrown formats that can’t be automatically detected.This long tail data makes preservation activities much more difficult. It was challenging to develop policy that could address hundreds of unique file formats, each with their own dependencies.
  • SaraPreservation is an important part of Dryad’s mission, particularly with regard to open access and scholarly communicationSome preservation actions were initiated at the beginning of the project, so they have already been implemented in the Dryad curation workflow.MD5 Checksums are calculated for every bitstream upon upload. These checksums are validated nightly, and reports are sent to system administrators.Provenance metadata is automatically generated every time a data package progresses through the Dryad workflow. Because this metadata sometimes documents private activities, it is only visible to Dryad administrators.Since its inception, Dryad has informally encouraged preferred formats. If Dryad curators have trouble opening files in proprietary or rare formats, they may contact data creators to try to obtain more preservation-friendly formats. In helpdesk correspondence, the senior curator also encourages non-proprietary, openly documented formats when possible.However, these simple preservation actions aren’t enough. Developing and implementing a formal preservation policy is necessary in order toguide current and future preservation practice, and to facilitate the long-term preservation of the repository’s digital assets.
  • AyoungPreservation Working Group in Feb 2013Two authors researched and reviewed existing preservation practices, models, and policies, and consulted with digital preservation experts on the Working GroupPreservation Task Force formed to address specific implementation issues, including versioning, preferred file formats, and file type migration strategies
  • AyoungAlthough we tried to integrate some long and short term goals into the document, our aim was to develop a higher-level policy document that could guide future preservation planning.
  • AyoungWhile these standards provide useful insights and guidance for the big picture, implementation should be considered within the boundaries of organizational capability, and adopting these standards also requires embedding local context during implementation.Due to its fundamental significance for developing repositories for long-term preservation, Dryad adopted the OAIS model and concepts at a high level. Dryad preservation policy uses the definition of “long term” as stated in the OAIS model. Because the model is conceptual and high level, however, sometimes there was a need to clarify the terms and localize the use of concepts in the context of Dryad. One example is distinguishing between SIPs and AIPs during the day-to-day curation activities at work. PREMIS is also one of Dryad’s considerations from the early development of this project. However, PREMIS did not align with the priorities at that time, and as already discussed, Dryad already had provenance metadata requirements for datasets. Thus, full implementation of PREMIS remains as a long-term goal for Dryad.Dryad is also aware of other standards and guidelines about audit and certification for building a trusted digital repository, such as Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) and Data Seal of Approval (DSA), and has kept certification as a consideration when developing policies and procedures. However, Dryad’s current efforts should focus more on putting the planned preservation infrastructure in place, such as format migration, before going through a formal self or external auditing process.Implementation cost
  • AyoungIn order to follow Dryad’s internal policies, we looked primarily to Dryad’s Terms of Service document (https://datadryad.org/pages/policies), which includes policies on submission, content, payment, usage, and privacy.Terms of service: consulted with stakeholders, developed by Laura Wendell and Dryad Staff, consulted with Board of Directors, finally consulted with a lawyer to ensure that language communicated the policy correctly.We also aimed to comply with Dryad’s unofficial policies, and policies that have yet to be finalized. One example of a policy-in-progress is Dryad’s policy on versioning. An initial policy on versioning was developed in the first quarter of 2009 (http://wiki.datadryad.org/Track_Version_Changes). The versioning feature is currently deployed for curator use only, and in a few rare cases, updated files have been submitted. However, the tool is still in the process of being streamlined and debugged before being deployed publicly.
  • AyoungDryad has only minimum requirements for depositors, which might seem to be a very different approach compared to some other data repositories that have more requirements for depositors. For Dryad, making submission as easy as possible for data depositors is necessary to support the mission of Dryad, which aims to encourage data sharing through open and widespread availability. This might raise a question regarding acquisition of representation information for preservation and reuse, and this is where Dryad needs to decide how to balance “minimum efforts” and having “enough” representation information.Dryad’s decision on “minimum efforts” is compensated by other factors related to the data submission in Dryad. Unlike some other data repositories, Dryad only accepts data that are linked to papers or publications, making it possible to acquire metadata mandated by journals.
  • New Task Force (looking at versioning, preferred formats, other topics?), focusing specifically on preservation planning. Eventually this Task Force will develop a strategic plan to complement the policy document.

It's a real world: developing preservation policy for Dryad It's a real world: developing preservation policy for Dryad Presentation Transcript

  • It’s a Real World: Developing Preservation Policy for Dryad Ayoung Yoon (Dryad preservation working group, Doctoral Candidate, UNC-CH) Sara Mannheimer (Former Dryad curator, Data management librarian, MSU) Elena Feinstein, Jane Greenberg, Ryan Scherle, Dryad Digital Repository March 26, 2014 Research Data Access & Preservation Submit (RDAP) 2014
  • Outline • Introduction • What is Dryad Digital Repository? • Preservation policy development process • Dryad preservation policy • Lesson learned and open questions • Conclusion • Acknowledgement
  • Introduction • “Data deluge” • Journal policies and funding agency mandates • Benefits to archiving and preserving research data: – Facilitates: • verification of research • accessibility and discoverability • opportunities for data reuse • increased citations • research visibility – Prevents: • redundant data collection • inefficient legacy data curation • burden of sharing-on-request • Challenges of data archiving: – Wider variety of file formats than most digital archival materials. – New versions as data sets are added to and updated – Security considerations – Large amounts of data Benefits adapted from Beagrie N, Lavoie BF, Woollard M (2010) Keeping research data safe 2. HEFCE
  • Why preservation policy? • Preservation policy supports strategic planning for implementation • Communicates to stakeholders – trustworthiness and commitment to preservation • Not many data preservation policies. Some examples: – CERN: CMS data – Archaeology Data Service – NSIDC Data Management Policies – Odum Institute Preservation Policy – ICPSR – DataONE
  • Dryad Digital Repository • A curated, general-purpose repository that makes the data underlying scientific and medical publications discoverable, freely reusable, and citable (http://datadryad.org/). • Facilitates data availability, data sharing, and scholarly communication. • Originally partnered with leading journals and scientific societies in evolutionary biology and ecology. • Broad collecting policy – almost any data is accepted, as long as it is associated with a publication.
  • Common filetypes in Dryad 0 500 1000 1500 2000 2500 HTML WAV Phylip R script JPEG Image Newick tree file RTF XML GZip archive MS Word/OpenXML Nexus Unknown format PDF FASTA Zip archive CSV MS Excel/OpenXML Plain Text
  • Dryad and Preservation Needs • Preservation is a major part of Dryad’s mission. • Current preservation actions: – MD5 Checksums – provenance metadata – informal encouragement of preferred formats • Developing and implementing a formal preservation policy will: – guide current and future preservation practice – Facilitate the long-term preservation of the repository’s digital assets
  • Policy Development Process 2012 Feb 2013 May 2013 July 2013 Nov 2013 An initial preservation plan (version 1.0.) Preservation Working Group in Feb 2013 Version 2.0. presented to the Dryad Board of Directors Version 2.0. revised in cooperation with Dryad staff • Version 2.4. Approved by Dryad Board of Directors • Preservation Working Group dissolved. Preservation Task Force formed
  • Preservation Policy • Purpose • Scope and content coverage • Overview of preservation strategies • Format support and levels of preservation – e.g. Preferred formats and format support levels • Implementing the strategy – e.g. integrations of OAIS functional activities, pre-ingest & ingest, and archival storage, authenticity and integrity, security, versioning, and withdrawal of collections • Sustainability plans – e.g. technical sustainability, institutional and financial sustainability
  • Lesson Learned and Open Questions • A negotiation between what is ideal and what is realistic – Adopting International standards, models, and best practices exist for long-term preservation • Open Archival Information System (OAIS) reference model (ISO 14721:2003) • PREMIS (PREservation Metadata: Implementation Strategies) – Other standards and guidelines about audit and certification for building a trusted digital repository • Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC) and Data Seal of Approval (DSA)
  • Lesson Learned and Open Questions • Align with other internal and institutional policies – Follow Dryad’s internal policies, we looked primarily to Dryad’s Terms of Service document (https://datadryad.org/pages/policies), which includes policies on submission, content, payment, usage, and privacy – Comply with Dryad’s unofficial policies, which have yet to be finalized • A policy-in-progress: Dryad’s policy on versioning – Comply with policy from partner institutions • Dryad functions as a partnership between the University of North Carolina at Chapel Hill (UNC), Duke University (Duke), and North Carolina State University (NC State)
  • Lesson Learned and Open Questions • Structuring the policy according to Dryad’s specific needs – Meeting specific organizational needs is fundamentally important and should be the first consideration in all work, as each organization has different goals, priorities, and capabilities. – Data depositors’ requirements: minimum requirements • balance “minimum efforts” and having “enough” representation information • compensated by other factors
  • Conclusion • Policy-creation and planning are just first steps -- implementation will require further considerations • Future plan – Potential for implementing TRAC / DSA in the future – Divide policy and implementation into separate documents – New Task Force
  • Acknowledgement • Thank you to the Dryad Preservation Working Group: – Co-chairs: • Jane Greenberg, professor, SILS/UNC-CH, director, MRC • Sara Mannheimer, former Dryad curator, Data Management Librarian, Montana State University – Members: • Alex Ball, Research Officer, UKOLN, University of Bath • Robin Dale, Director of Digital and Preservation Services, LYRASIS • Michael Day, Digital Curation Centre, UKOLN, University of Bath • Ruth Duerr, Principal Investigator/Project Manager, Data Management and Cyberinfrastructure Manager, Data Stewardship, National Snow & Ice Data Center • Elena Feinstein, former Dryad curator, Librarian for Chemistry and Biological Sciences, Duke University • Cal Lee, Associate Professor, SILS/UNC-CH • Robin Rice, Data Librarian, EDINA and Data Library, University of Edinburgh • Ayoung Yoon, SILS doctoral student • This work was supported in part from National Science Foundation (NSF), Award number: 1147166/ABI Development: Dryad: scalable and sustainable infrastructure for the publication of data.
  • Thank you! Ayoung Yoon Doctoral candidate University of North Carolina at Chapel Hill ayyoon@email.unc.edu Sara Mannheimer Data management librarian Montana State University sara.mannheimer@montana.edu