Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Citation Implementation Guidelines By Tim Clark


Published on

This talk presents a set of detailed technical recommendations for operationalizing the Joint Declaration of Data Citation Principles (JDDCP) - the most widely agreed set of principle-based recommendations for direct scholarly data citation.

We will provide initial recommendations on identifier schemes, identifier resolution behavior, required metadata elements, and best practices for realizing programmatic machine actionability of cited data.

We hope that these recommendations along with the new NISO JATS document schema revision, developed in parallel, will help accelerate the wide adoption of data citation in scholarly literature. We believe their adoption will enable open data transparency for validation, reuse and extension of scientific results; and will significantly counteract the problem of false positives in the literature.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Data Citation Implementation Guidelines By Tim Clark

  1. 1. Joint Declaration of Data Citation Principles © 2015 Massachusetts General Hospital and Tim Clark, Ph.D. Assistant Professor of Neurology Massachusetts General Hospital & Harvard Medical School June 9, 2015
  2. 2. reproducibility crisis
  3. 3. Non-reproduciblity 11% Begley CG and Ellis LM, Nature 2012, 483(7391):531-533
  4. 4. Transparency and Reproducibility • Transparency is the basis of reproducibility • What we are aiming for is robust science • Validation from multiple orthogonal viewpoints • Focus on transparent communication of results
  5. 5. Joint Declaration of Data Citation Principles endorsed by over 90 scholarly organizations
  6. 6. The Brief JDDCP 1. Importance. Data are first- class objects. 2. Credit. Support citing all contributors to the data. 3. Evidence. Assertions must be traceable to evidence. 4. Unique ID. Cited datasets must have resolvable IDs. 5. Access. Data must be robustly archived. 6. Persistence. Metadata must persist even after data is gone. 7. Specificity & Verifiability. Get same dynamic time-slice. 8. Interoperable & flexible. Give cross-community support.
  7. 7. How to implement JDDCP?
  8. 8. JDDCP Archival, id & retrieval Document model Archival & retrieval Archival & retrieval Identification Common APIs Workflows Metadata
  9. 9. repositories social science biomedicine earth science climatology scholarly publishing scholarly publishing web standards scientific data standards astronomy scholarly publishing physics academic libraries data science software technology physics scholarly publishing biomedicine Archival & retrieval
  10. 10. Human and machine accessibility of cited data in scholarly publications © 2015 Massachusetts General Hospital and Tim Clark, Ph.D. Assistant Professor of Neurology Massachusetts General Hospital & Harvard Medical School June 9, 2015
  11. 11. or, how to store and access cited data to radically improve scholarly transparency - and so that BOTH humans and machines are happy.
  12. 12. PeerJ Computer Science 1:e1.
  13. 13. Basic guidelines 1. Cite data as you would cite publications. 2. Deposit data in an archival-quality repository. 3. Use an identifier scheme meeting JDDCP criteria. 4. Identifiers should resolve to a landing page, not directly to the data. 5. Landing pages describe the data in both human and machine readable form.
  14. 14. Basic guidelines (contd.) 6. Landing page & data retention may differ. 7. Repositories should provide specific guarantee of landing page persistence. 8. Landing pages should provide both human and machine interpretable information. 9. Provide web service accessibility. 10. Stakeholder responsibilities for ecosystem.
  15. 15. 1. Cite data as you would cite publications • Strongly preferred: • Use the NISO JATS revision 1.1d2 XML schema • Interim (less good) alternative: • Use own XML schema, but do what JATS does.
  16. 16. 2. Deposit data in archival quality repositories Examples: • NIH and EBI bioscience repositories; • Standard earth/space/physical science repositories; • Dataverse, Dryad, Figshare, Zenodo; etc. Unacceptable: • “Available on my laboratory website”.
  17. 17. 3. Use an ID scheme that meets JDDCP criteria (4-6) Any currently‐available identifier scheme that is: • Machine actionable, • Globally unique, • Widely used by a community, and • Has a long term commitment to persistence Best practice: • use a scheme that is cross-discipline, such as DOI.
  18. 18. Machine accessibility Machine accessibility in this context means: “access by well-documented Web services—preferably RESTful Web services—to data and metadata stored in a robust repository, independently of integrated browser access by humans.”
  19. 19. Commitment to persistence If a resolving authority is required, that authority has demonstrated a reasonable chance to be present and functional in the future; Owner of the domain or the resolving authority has made a credible commitment to ensure that its identifiers will always resolve. A useful survey of persistent identifier schemes appears in Hilse & Kothe (2006).
  20. 20. • Digital Object Identifiers (DOIs)
  21. 21. 4. Identifiers should resolve to a landing page, not directly to data Because: • Data may be de-accessioned, like books, but the description of thing cited should remain; • Data may be restricted (e.g. Protected Health Information; specially-licensed data; etc.); • Data may be VERY large and user needs to be able to decide whether to download or not. • Content negotiation for machine access!
  22. 22. 5. Landing pages describe the dataBest practices: • Identifier, title, description, creator, publisher/contact, publication/release date, version. Additional: • Creator identifier (e.g. ORCID), license Content encoding: • HTML; plus… • At least one non-proprietary machine-readable format, e.g. XML, JSON/JSON-LD, RDF, microformats, microdata, RDFa,…
  23. 23. Serving the landing pages “To enable automated agents to extract the metadata these landing pages should include an HTML <link> element specifying a machine readable form of the page as an alternative.” “For those that are capable of doing so, we recommend also using Web Linking (Nottingham, 2010) to provide this information from all of the alternative formats.”
  24. 24. 6. Landing page retention may differ from data retention Because: • Repositories cannot commit to keeping arbitrary and possibly very large volumes of data forever! • But when data is de-accessioned, the citation identifier must not give a 404 error. • Retain awareness of what was cited even if it is not currently extant in a particular repository.
  25. 25. 7. Repositories should provide a specific guarantee of persistence for landing pages Model guarantee language: “[Organization/Institution Name] is committed to maintaining persistent identifiers in [Repository Name] so that they will continue to resolve to a landing page providing metadata describing the data, including elements of stewardship, provenance, and availability. [Organization/Institution Name] has made the following plan for organizational persistence and succession [plan]
  26. 26. 8. Landing pages should provide both human and machine interpretable information. Because: • Mash-ups and distributed search. • Apps that you haven’t yet thought of. • Web services. Examples of machine interpretable info: •.RDF, RDFa, XML, microformats, JSON-LD, etc.
  27. 27. 9. Provide web service accessibility Because: • Service composition, new apps, etc. Best practice: •.RESTful web service, because this is a data- oriented application and required functionality. Much less good practice: • SOAP, because SOAP is process-oriented.
  28. 28. 10. Stakeholder responsibilities • Archives and repositories: Ids, resolution, landing page metadata, dataset description, data access methods conform to these recommendations. • Registries of repositories: Document conformance. • Researchers: Treat data as first-class objects. • Funders, scholarly societies, academic institutions: Strongly encourage conformance to best practices.
  29. 29. Summary • Use NISO JATS 1.1d2 to publish & archive documents. • Cite datasets as if they were publications and deposit datasets in archival repositories. • Follow human & machine accessibility guidelines as presented above in points 3 through 9. • Adhere to stakeholder responsibilities as in point 10. • Welcome to the future of scholarly publishing!
  30. 30. Acknowledgements • Joan Starr, California Digital Library • other co-authors of the “Achieving Human and Machine Accessibility” publication • FORCE11 Data Citation Implementation Group • Maryann Martone, UCSD & FORCE11 • John Kunze, California Digital Library • Harry Hochheiser, University of Pittsburgh • Phil Bourne, NIH Data Science Directorate
  31. 31. Questions?