Citing data in research articles: principles, implementation, challenges - and the benefits of changing our ways.

Citing data in research articles:
principles, implementation, challenges
- and the benefits of changing our ways
Jo McEntyre
Europe PMC, EMBL-EBI
www.ebi.ac.uk

Europe PMC literature database
Europe PMC
• Abstracts: 30 million
• Full-text articles: 3 million
• Article citation counts
• Grants
• ORCIDs
• Semantic annotation
• Data citations
• Data integration
Europe PMC is a member of the PMC
International Collaboration.
Funded by 28 European funders of life science research

About EMBL-EBI
• Part of the European
Molecular Biology
Laboratory
• International, non-profit
research institute
• Europe’s hub for
biological data services
and research

Making data discoverable
Labs around the
world deposit
data and we…
Archive it
Classify it
Share it with
other data
providers
Analyse, add
value and
integrate it
…provide
tools to help
researchers
use it
A collaborative
enterprise

Data Citation in Europe PMC full text
Literature*
Added-Value
Submitted
*OMIM, Clinical trials, GO
Submission statements
vs reuse?
260K

Data Citation Principals Engender Two
Big Ideas
"sound, reproducible scholarship rests upon a
foundation of robust, accessible data"
"data should be considered legitimate, citable
products of research"
These slides are adapted from:
http://www.slideshare.net/joanstarr/data-citation-a-joint-declaration-

1 Importance
2 Credit and Attribution
3 Evidence
4 Unique Identification
5 Access
6 Persistence
7 Specificity and Verifiability
8 Interoperability and flexibility
Full Principles: https://www.force11.org/datacitation
Joint Declaration on Data Citation Principles

Joint Declaration
Data should be considered legitimate, citable
products of research. Data citations should be
accorded the same importance in the scholarly
record as citations of other research objects, such as
publications.
1. Importance

Data citations should facilitate giving scholarly credit
and normative and legal attribution to all contributors
to the data, recognizing that a single style or
mechanism of attribution may not be applicable to all
data.
2. Credit and Attribution
Joint Declaration

In scholarly literature, whenever and wherever
a claim relies upon data, the corresponding data
should be cited.
3. Evidence
Joint Declaration

A data citation should include a persistent method
for identification that is machine actionable, globally
unique, and widely used by a community.
4. Unique identification
etc.. !!!
Joint Declaration

Data citations should facilitate access to the data
themselves and to such associated metadata,
documentation, code, and other materials, as are
necessary for both humans and machines to make
informed use of the referenced data.
5. Access
Joint Declaration

Unique identifiers, and metadata describing the
data, and its disposition, should persist -- even
beyond the lifespan of the data they describe.
6. Persistence
Joint Declaration

Data citations should facilitate identification of,
access to, and verification of the specific data that
support a claim. Citations or citation metadata
should include information about provenance and
fixity sufficient to facilitate verifying that the specific
timeslice, version and/or granular portion of data
retrieved subsequently is the same as was
originally cited.
7. Specificity and Verifiability
Joint Declaration

Data citation methods should be sufficiently flexible
to accommodate the variant practices among
communities, but should not differ so much that they
compromise interoperability of data citation practices
across communities.
8. Interoperability and flexibility
Joint Declaration

Many organizational endorsements

An implementation example
Principle 2:
Credit and
Attribution
Principle 4, 5,
6:
Unique ID
Access
Persistence
Principle 7:
Specificity
and
Verifiability
Principle 8: Interoperability and flexibility
Creators, Year, Dataset Title, DOI, Data Repository, version
(Resolves to landing page with
access to metadata, docs, and
data)
Slide from
Mercè Crosas, Ph.D.
Harvard University

http://europepmc.org/articles/PMC3089613
Large dataset:

Fig. 2

!! 2469 references !!

Examples of Implementations of Data Citations
in Reference Lists

<mixed-citation publication-type="other">
Occurrence in reference list:
Occurrence in text:
Tagged in reference list as:

<mixed-citation publication-type="thesis">
Occurrence in text:

<mixed-citation publication-type="webpage">
Also in this reference list: a non-DOI data citation
Occurrence in text:

<mixed-citation publication-type="journal">
Occurrence in text:
Cite data generated in
the course of the work
described?

JATS support for data citation
<mixed-citation publication-type='data'>
<name><surname>Heinz</surname><given-names>D.W.</given-
names></name>,
<name><surname>Baase</surname><given-names>W.A.</given-
names></name>,
<etal>et. al.</etal>
<data-title>How amino-acid insertions are allowed in an
alpha-helix of T4
lysozyme</data-title>.
<source>PDB Europe</source>,
accession <pub-id pub-id-type='accession' assigning-
authority='pdb'
xlink:href='http://www.ebi.ac.uk/pdbe/entry/search/index?te
xt:102L'>102l</pub-id>.
<pub-id pub-id-type='doi'
xlink:href='http://dx.doi.org/10.2210/pdb102l/pdb'>10.2210/
pdb102l/pdb</pub-id>
</mixed-citation>

Minimal, maximal & extensible citation
Resource
name
I
D
Resource
name
Resolution ‘template’ I
D
Author
list
Resource
name
Resolution
‘template’
I
D
Tim
e
? Author
list
Resource
name
Resolution
‘template’
I
D
Tim
e
?
For example:
new data vs pre-existing
data
For example:
version
Thomas Lemberger, EMBO

Integrated Research
Reused from: seier+seier,
Flickr
Reused from: Images
Money, Flickr
Articles
Data
People
Institutions
Funders

A data citation should include a persistent method
for identification that is machine actionable, globally
unique, and widely used by a community.
4. Unique identification
etc..
Joint Declaration

1. Discoverability through accessibility
• Deposit in a public/open database
• Where possible, structured archive (e.g. PDB,
ENA) >> unstructured archive (e.g. Zenodo,
Figshare)
• Uniquely identify it: PID, Accession number, DOI,
ROI
• Give it context: metadata (and more)
• All of the above = citable =

2. Discoverability through structured data
structured data is one of the true
enablers of life science
- Discovery of homology between genes across species
- Predicting function based on protein folds
• Structured data can be cross-analysed, compared by
algorithm, and encourages development of new products
and tools

Structured data is good value for money
Annual cost of generating new protein
structure data in labs around the world
Annual cost of
maintaining it
in a central
database

Degrees of Data
Unstructured/semi-
structured
Structured
Added Value
Metadata
A picture of a graph
A spreadsheet of my results
A record in a DNA
sequence
database
A graphical display of a genome
A narrative with
citations, pictures
and attachments
Article

Metadata – critical to discoverability
Generic: title, submitters, date, file format, version.
citation
basic search
Wagner F.F., 23-APR-2002, TPA: Homo sapiens SMP1
gene, RHD gene and RHCE gene, INSDC, 14-NOV-2006
(Rel. 89, Last updated, Version 7). BN000065
Specific: organism, tissue, assay, page number …
deep search
analysis
computation

BioStudyEBI
BioStudy database for unstructured data
Study
Publications
Ontologies
Data files
Other DBs
Metadata
Other DBs

Elixir: An international distributed infrastructure
for
• Data
• Standards
• Tools
• Compute
• Training
• Industry

Citing data in research articles: principles, implementation, challenges - and the benefits of changing our ways.

More Related Content

What's hot

Viewers also liked

Similar to Citing data in research articles: principles, implementation, challenges - and the benefits of changing our ways.

Recently uploaded

Citing data in research articles: principles, implementation, challenges - and the benefits of changing our ways.

Editor's Notes