This document outlines guidelines for improving the reproducibility of scholarly research through better data citation practices. It recommends depositing cited datasets in archival repositories, using persistent identifiers that meet JDDCP criteria, and having identifiers resolve to landing pages that provide both human- and machine-readable metadata about the dataset. Landing pages should be retained even if the underlying data is removed. Repositories are responsible for maintaining identifier persistence and researchers should treat data as first-class objects in the scholarly process. Following these guidelines would radically improve transparency and enable both humans and machines to access and interpret cited data.
5. Transparency and
Reproducibility
• Transparency is the basis of reproducibility
• What we are aiming for is robust science
• Validation from multiple orthogonal viewpoints
• Focus on transparent communication of results
6. Joint Declaration of Data Citation Principles
endorsed by over 90 scholarly organizations
7.
8.
9. The Brief JDDCP
1. Importance. Data are first-
class objects.
2. Credit. Support citing all
contributors to the data.
3. Evidence. Assertions must
be traceable to evidence.
4. Unique ID. Cited datasets
must have resolvable IDs.
5. Access. Data must be
robustly archived.
6. Persistence. Metadata must
persist even after data is gone.
7. Specificity & Verifiability.
Get same dynamic time-slice.
8. Interoperable & flexible.
Give cross-community support.
17. Basic guidelines
1. Cite data as you would cite publications.
2. Deposit data in an archival-quality repository.
3. Use an identifier scheme meeting JDDCP
criteria.
4. Identifiers should resolve to a landing page,
not directly to the data.
5. Landing pages describe the data in both
human and machine readable form.
18. Basic guidelines (contd.)
6. Landing page & data retention may differ.
7. Repositories should provide specific
guarantee of landing page persistence.
8. Landing pages should provide both human
and machine interpretable information.
9. Provide web service accessibility.
10. Stakeholder responsibilities for ecosystem.
19. 1. Cite data as you would
cite publications
• Strongly preferred:
• Use the NISO JATS revision 1.1d2 XML schema
• Interim (less good) alternative:
• Use own XML schema, but do what JATS does.
20. 2. Deposit data in archival
quality repositories
Examples:
• NIH and EBI bioscience repositories;
• Standard earth/space/physical science repositories;
• Dataverse, Dryad, Figshare, Zenodo; etc.
Unacceptable:
• “Available on my laboratory website”.
21. 3. Use an ID scheme that meets
JDDCP criteria (4-6)
Any currently‐available identifier scheme that is:
• Machine actionable,
• Globally unique,
• Widely used by a community, and
• Has a long term commitment to persistence
Best practice:
• use a scheme that is cross-discipline, such as
DOI.
22. Machine accessibility
Machine accessibility in this context means:
“access by well-documented Web services—preferably
RESTful Web services—to data and metadata stored in
a robust repository, independently of integrated browser
access by humans.”
23. Commitment to persistence
If a resolving authority is required, that authority has
demonstrated a reasonable chance to be present and
functional in the future;
Owner of the domain or the resolving authority has
made a credible commitment to ensure that its
identifiers will always resolve.
A useful survey of persistent identifier schemes
appears in Hilse & Kothe (2006).
25. 4. Identifiers should resolve to a
landing page, not directly to data
Because:
• Data may be de-accessioned, like books, but
the description of thing cited should remain;
• Data may be restricted (e.g. Protected Health
Information; specially-licensed data; etc.);
• Data may be VERY large and user needs to
be able to decide whether to download or not.
• Content negotiation for machine access!
26. 5. Landing pages describe the data
Best practices:
• Identifier, title, description, creator,
publisher/contact, publication/release date,
version.
Additional:
• Creator identifier (e.g. ORCID), license
Content encoding:
• HTML; plus…
• At least one non-proprietary machine-readable
format, e.g. XML, JSON/JSON-LD, RDF,
microformats, microdata, RDFa,…
27. Serving the landing pages
“To enable automated agents to extract the metadata
these landing pages should include an HTML <link>
element specifying a machine readable form of the
page as an alternative.”
“For those that are capable of doing so, we
recommend also using Web Linking (Nottingham,
2010) to provide this information from all of the
alternative formats.”
28. 6. Landing page retention may differ
from data retention
Because:
• Repositories cannot commit to keeping
arbitrary and possibly very large volumes of
data forever!
• But when data is de-accessioned, the citation
identifier must not give a 404 error.
• Retain awareness of what was cited even if it
is not currently extant in a particular repository.
29. 7. Repositories should provide a
specific guarantee of persistence for
landing pages
Model guarantee language:
“[Organization/Institution Name] is committed to maintaining
persistent identifiers in [Repository Name] so that they will
continue to resolve to a landing page providing metadata
describing the data, including elements of stewardship,
provenance, and availability.
[Organization/Institution Name] has made the following plan
for organizational persistence and succession [plan]
30. 8. Landing pages should provide
both human and machine
interpretable information.
Because:
• Mash-ups and distributed search.
• Apps that you haven’t yet thought of.
• Web services.
Examples of machine interpretable info:
•.RDF, RDFa, XML, microformats, JSON-LD,
etc.
31. 9. Provide web service accessibility
Because:
• Service composition, new apps, etc.
Best practice:
•.RESTful web service, because this is a data-
oriented application and required functionality.
Much less good practice:
• SOAP, because SOAP is process-oriented.
32. 10. Stakeholder
responsibilities
• Archives and repositories: Ids, resolution, landing
page metadata, dataset description, data access
methods conform to these recommendations.
• Registries of repositories: Document conformance.
• Researchers: Treat data as first-class objects.
• Funders, scholarly societies, academic institutions:
Strongly encourage conformance to best practices.
33. Summary
• Use NISO JATS 1.1d2 to publish & archive documents.
• Cite datasets as if they were publications and deposit
datasets in archival repositories.
• Follow human & machine accessibility guidelines as
presented above in points 3 through 9.
• Adhere to stakeholder responsibilities as in point 10.
• Welcome to the future of scholarly publishing!
34. Acknowledgements
• Joan Starr, California Digital Library
• other co-authors of the “Achieving Human and Machine
Accessibility” publication
• FORCE11 Data Citation Implementation Group
• Maryann Martone, UCSD & FORCE11
• John Kunze, California Digital Library
• Harry Hochheiser, University of Pittsburgh
• Phil Bourne, NIH Data Science Directorate