Citing data in research articles:
principles, implementation, challenges
- and the benefits of changing our ways
Jo McEntyre
Europe PMC, EMBL-EBI
www.ebi.ac.uk
Life Science Data
Familiar Complexity!
Article‘Package’ExternalResources
“Recognized” data repos:
file|structured record,
Accession|DOI|API+ Accession
Institutional repos:
file|structured record,
URL|DOI|API+Accession
Author database|‘website’:
file|struct record,
URL|DOI|API+Accession
Supp info tables/data:
file, URL|DOI
Cross-reference
Dataset list
Ref to external
resRef to external
res
Reference list
Fig Source data:
file, URL|DOI
Fig (caption + graphic)
Cross-reference
Ref to external
resource
Adapted from Thomas Lemberger, EMBO
Europe PMC literature database
Europe PMC
• Abstracts: 30 million
• Full-text articles: 3 million
• Article citation counts
• Grants
• ORCIDs
• Semantic annotation
• Data citations
• Data integration
Europe PMC is a member of the PMC
International Collaboration.
Funded by 28 European funders of life science research
About EMBL-EBI
• Part of the European
Molecular Biology
Laboratory
• International, non-profit
research institute
• Europe’s hub for
biological data services
and research
Making data discoverable
Labs around the
world deposit
data and we…
Archive it
Classify it
Share it with
other data
providers
Analyse, add
value and
integrate it
…provide
tools to help
researchers
use it
A collaborative
enterprise
Journal Data Publishing
Data Citation in Europe PMC full text
Literature*
Added-Value
Submitted
*OMIM, Clinical trials, GO
Submission statements
vs reuse?
260K
Data Citation Principals Engender Two
Big Ideas
"sound, reproducible scholarship rests upon a
foundation of robust, accessible data"
"data should be considered legitimate, citable
products of research"
These slides are adapted from:
http://www.slideshare.net/joanstarr/data-citation-a-joint-declaration-
1 Importance
2 Credit and Attribution
3 Evidence
4 Unique Identification
5 Access
6 Persistence
7 Specificity and Verifiability
8 Interoperability and flexibility
Full Principles: https://www.force11.org/datacitation
Joint Declaration on Data Citation Principles
Joint Declaration
Data should be considered legitimate, citable
products of research. Data citations should be
accorded the same importance in the scholarly
record as citations of other research objects, such as
publications.
1. Importance
Data citations should facilitate giving scholarly credit
and normative and legal attribution to all contributors
to the data, recognizing that a single style or
mechanism of attribution may not be applicable to all
data.
2. Credit and Attribution
Joint Declaration
In scholarly literature, whenever and wherever
a claim relies upon data, the corresponding data
should be cited.
3. Evidence
Joint Declaration
A data citation should include a persistent method
for identification that is machine actionable, globally
unique, and widely used by a community.
4. Unique identification
etc.. !!!
Joint Declaration
Data citations should facilitate access to the data
themselves and to such associated metadata,
documentation, code, and other materials, as are
necessary for both humans and machines to make
informed use of the referenced data.
5. Access
Joint Declaration
Unique identifiers, and metadata describing the
data, and its disposition, should persist -- even
beyond the lifespan of the data they describe.
6. Persistence
Joint Declaration
Data citations should facilitate identification of,
access to, and verification of the specific data that
support a claim. Citations or citation metadata
should include information about provenance and
fixity sufficient to facilitate verifying that the specific
timeslice, version and/or granular portion of data
retrieved subsequently is the same as was
originally cited.
7. Specificity and Verifiability
Joint Declaration
Data citation methods should be sufficiently flexible
to accommodate the variant practices among
communities, but should not differ so much that they
compromise interoperability of data citation practices
across communities.
8. Interoperability and flexibility
Joint Declaration
Many organizational endorsements
An implementation example
Principle 2:
Credit and
Attribution
Principle 4, 5,
6:
Unique ID
Access
Persistence
Principle 7:
Specificity
and
Verifiability
Principle 8: Interoperability and flexibility
Creators, Year, Dataset Title, DOI, Data Repository, version
(Resolves to landing page with
access to metadata, docs, and
data)
Slide from
Mercè Crosas, Ph.D.
Harvard University
http://europepmc.org/articles/PMC3089613
Large dataset:
http://europepmc.org/articles/PMC3535838
http://europepmc.org/articles/PMC3766260
http://europepmc.org/articles/PMC3704603
http://europepmc.org/articles/PMC3710810
Fig. 2
!! 2469 references !!
http://europepmc.org/articles/PMC2672098
Examples of Implementations of Data Citations
in Reference Lists
http://europepmc.org/articles/PMC3661987
<mixed-citation publication-type="other">
Occurrence in reference list:
Occurrence in text:
Tagged in reference list as:
http://europepmc.org/articles/PMC3646594
<mixed-citation publication-type="thesis">
Occurrence in text:
Occurrence in reference list:
Tagged in reference list as:
http://europepmc.org/articles/PMC3722494
<mixed-citation publication-type="webpage">
Also in this reference list: a non-DOI data citation
Occurrence in text:
Occurrence in reference list:
Tagged in reference list as:
http://europepmc.org/articles/PMC3626513
<mixed-citation publication-type="journal">
Occurrence in text:
Occurrence in reference list:
Tagged in reference list as:
Cite data generated in
the course of the work
described?
JATS support for data citation
<mixed-citation publication-type='data'>
<name><surname>Heinz</surname><given-names>D.W.</given-
names></name>,
<name><surname>Baase</surname><given-names>W.A.</given-
names></name>,
<etal>et. al.</etal>
<data-title>How amino-acid insertions are allowed in an
alpha-helix of T4
lysozyme</data-title>.
<source>PDB Europe</source>,
accession <pub-id pub-id-type='accession' assigning-
authority='pdb'
xlink:href='http://www.ebi.ac.uk/pdbe/entry/search/index?te
xt:102L'>102l</pub-id>.
<pub-id pub-id-type='doi'
xlink:href='http://dx.doi.org/10.2210/pdb102l/pdb'>10.2210/
pdb102l/pdb</pub-id>
</mixed-citation>
Minimal, maximal & extensible citation
Resource
name
I
D
Resource
name
Resolution ‘template’ I
D
Author
list
Resource
name
Resolution
‘template’
I
D
Tim
e
? Author
list
Resource
name
Resolution
‘template’
I
D
Tim
e
?
For example:
new data vs pre-existing
data
For example:
version
Thomas Lemberger, EMBO
Integrated Research
Reused from: seier+seier,
Flickr
Reused from: Images
Money, Flickr
Articles
Data
People
Institutions
Funders
A data citation should include a persistent method
for identification that is machine actionable, globally
unique, and widely used by a community.
4. Unique identification
etc..
Joint Declaration
1. Discoverability through accessibility
• Deposit in a public/open database
• Where possible, structured archive (e.g. PDB,
ENA) >> unstructured archive (e.g. Zenodo,
Figshare)
• Uniquely identify it: PID, Accession number, DOI,
ROI
• Give it context: metadata (and more)
• All of the above = citable =
2. Discoverability through structured data
structured data is one of the true
enablers of life science
- Discovery of homology between genes across species
- Predicting function based on protein folds
• Structured data can be cross-analysed, compared by
algorithm, and encourages development of new products
and tools
Structured data is good value for money
Annual cost of generating new protein
structure data in labs around the world
Annual cost of
maintaining it
in a central
database
Degrees of Data
Unstructured/semi-
structured
Structured
Added Value
Metadata
A picture of a graph
A spreadsheet of my results
A record in a DNA
sequence
database
A graphical display of a genome
A narrative with
citations, pictures
and attachments
Article
Metadata – critical to discoverability
Generic: title, submitters, date, file format, version.
citation
basic search
Wagner F.F., 23-APR-2002, TPA: Homo sapiens SMP1
gene, RHD gene and RHCE gene, INSDC, 14-NOV-2006
(Rel. 89, Last updated, Version 7). BN000065
Specific: organism, tissue, assay, page number …
deep search
analysis
computation
BioStudyEBI
BioStudy database for unstructured data
Study
Publications
Ontologies
Data files
Other DBs
Metadata
Other DBs
Elixir: An international distributed infrastructure
for
• Data
• Standards
• Tools
• Compute
• Training
• Industry
THE END

Citing data in research articles: principles, implementation, challenges - and the benefits of changing our ways.

  • 1.
    Citing data inresearch articles: principles, implementation, challenges - and the benefits of changing our ways Jo McEntyre Europe PMC, EMBL-EBI www.ebi.ac.uk
  • 2.
  • 3.
    Familiar Complexity! Article‘Package’ExternalResources “Recognized” datarepos: file|structured record, Accession|DOI|API+ Accession Institutional repos: file|structured record, URL|DOI|API+Accession Author database|‘website’: file|struct record, URL|DOI|API+Accession Supp info tables/data: file, URL|DOI Cross-reference Dataset list Ref to external resRef to external res Reference list Fig Source data: file, URL|DOI Fig (caption + graphic) Cross-reference Ref to external resource Adapted from Thomas Lemberger, EMBO
  • 4.
    Europe PMC literaturedatabase Europe PMC • Abstracts: 30 million • Full-text articles: 3 million • Article citation counts • Grants • ORCIDs • Semantic annotation • Data citations • Data integration Europe PMC is a member of the PMC International Collaboration. Funded by 28 European funders of life science research
  • 5.
    About EMBL-EBI • Partof the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data services and research
  • 6.
    Making data discoverable Labsaround the world deposit data and we… Archive it Classify it Share it with other data providers Analyse, add value and integrate it …provide tools to help researchers use it A collaborative enterprise
  • 7.
  • 8.
    Data Citation inEurope PMC full text Literature* Added-Value Submitted *OMIM, Clinical trials, GO Submission statements vs reuse? 260K
  • 9.
    Data Citation PrincipalsEngender Two Big Ideas "sound, reproducible scholarship rests upon a foundation of robust, accessible data" "data should be considered legitimate, citable products of research" These slides are adapted from: http://www.slideshare.net/joanstarr/data-citation-a-joint-declaration-
  • 10.
    1 Importance 2 Creditand Attribution 3 Evidence 4 Unique Identification 5 Access 6 Persistence 7 Specificity and Verifiability 8 Interoperability and flexibility Full Principles: https://www.force11.org/datacitation Joint Declaration on Data Citation Principles
  • 11.
    Joint Declaration Data shouldbe considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications. 1. Importance
  • 12.
    Data citations shouldfacilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data. 2. Credit and Attribution Joint Declaration
  • 13.
    In scholarly literature,whenever and wherever a claim relies upon data, the corresponding data should be cited. 3. Evidence Joint Declaration
  • 14.
    A data citationshould include a persistent method for identification that is machine actionable, globally unique, and widely used by a community. 4. Unique identification etc.. !!! Joint Declaration
  • 15.
    Data citations shouldfacilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data. 5. Access Joint Declaration
  • 16.
    Unique identifiers, andmetadata describing the data, and its disposition, should persist -- even beyond the lifespan of the data they describe. 6. Persistence Joint Declaration
  • 17.
    Data citations shouldfacilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific timeslice, version and/or granular portion of data retrieved subsequently is the same as was originally cited. 7. Specificity and Verifiability Joint Declaration
  • 18.
    Data citation methodsshould be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities. 8. Interoperability and flexibility Joint Declaration
  • 19.
  • 20.
    An implementation example Principle2: Credit and Attribution Principle 4, 5, 6: Unique ID Access Persistence Principle 7: Specificity and Verifiability Principle 8: Interoperability and flexibility Creators, Year, Dataset Title, DOI, Data Repository, version (Resolves to landing page with access to metadata, docs, and data) Slide from Mercè Crosas, Ph.D. Harvard University
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    !! 2469 references!! http://europepmc.org/articles/PMC2672098
  • 27.
    Examples of Implementationsof Data Citations in Reference Lists
  • 28.
    http://europepmc.org/articles/PMC3661987 <mixed-citation publication-type="other"> Occurrence inreference list: Occurrence in text: Tagged in reference list as:
  • 29.
    http://europepmc.org/articles/PMC3646594 <mixed-citation publication-type="thesis"> Occurrence intext: Occurrence in reference list: Tagged in reference list as:
  • 30.
    http://europepmc.org/articles/PMC3722494 <mixed-citation publication-type="webpage"> Also inthis reference list: a non-DOI data citation Occurrence in text: Occurrence in reference list: Tagged in reference list as:
  • 31.
    http://europepmc.org/articles/PMC3626513 <mixed-citation publication-type="journal"> Occurrence intext: Occurrence in reference list: Tagged in reference list as: Cite data generated in the course of the work described?
  • 32.
    JATS support fordata citation <mixed-citation publication-type='data'> <name><surname>Heinz</surname><given-names>D.W.</given- names></name>, <name><surname>Baase</surname><given-names>W.A.</given- names></name>, <etal>et. al.</etal> <data-title>How amino-acid insertions are allowed in an alpha-helix of T4 lysozyme</data-title>. <source>PDB Europe</source>, accession <pub-id pub-id-type='accession' assigning- authority='pdb' xlink:href='http://www.ebi.ac.uk/pdbe/entry/search/index?te xt:102L'>102l</pub-id>. <pub-id pub-id-type='doi' xlink:href='http://dx.doi.org/10.2210/pdb102l/pdb'>10.2210/ pdb102l/pdb</pub-id> </mixed-citation>
  • 33.
    Minimal, maximal &extensible citation Resource name I D Resource name Resolution ‘template’ I D Author list Resource name Resolution ‘template’ I D Tim e ? Author list Resource name Resolution ‘template’ I D Tim e ? For example: new data vs pre-existing data For example: version Thomas Lemberger, EMBO
  • 34.
    Integrated Research Reused from:seier+seier, Flickr Reused from: Images Money, Flickr Articles Data People Institutions Funders
  • 35.
    A data citationshould include a persistent method for identification that is machine actionable, globally unique, and widely used by a community. 4. Unique identification etc.. Joint Declaration
  • 36.
    1. Discoverability throughaccessibility • Deposit in a public/open database • Where possible, structured archive (e.g. PDB, ENA) >> unstructured archive (e.g. Zenodo, Figshare) • Uniquely identify it: PID, Accession number, DOI, ROI • Give it context: metadata (and more) • All of the above = citable =
  • 37.
    2. Discoverability throughstructured data structured data is one of the true enablers of life science - Discovery of homology between genes across species - Predicting function based on protein folds • Structured data can be cross-analysed, compared by algorithm, and encourages development of new products and tools
  • 38.
    Structured data isgood value for money Annual cost of generating new protein structure data in labs around the world Annual cost of maintaining it in a central database
  • 39.
    Degrees of Data Unstructured/semi- structured Structured AddedValue Metadata A picture of a graph A spreadsheet of my results A record in a DNA sequence database A graphical display of a genome A narrative with citations, pictures and attachments Article
  • 40.
    Metadata – criticalto discoverability Generic: title, submitters, date, file format, version. citation basic search Wagner F.F., 23-APR-2002, TPA: Homo sapiens SMP1 gene, RHD gene and RHCE gene, INSDC, 14-NOV-2006 (Rel. 89, Last updated, Version 7). BN000065 Specific: organism, tissue, assay, page number … deep search analysis computation
  • 41.
    BioStudyEBI BioStudy database forunstructured data Study Publications Ontologies Data files Other DBs Metadata Other DBs
  • 42.
    Elixir: An internationaldistributed infrastructure for • Data • Standards • Tools • Compute • Training • Industry
  • 43.

Editor's Notes

  • #12 Image: https://www.flickr.com/photos/svenwerk/506579282 #1: Importance
  • #13 Image: http://www.flickr.com/photos/ggunson/16900719 #2: Credit and Attribution
  • #14 Image: http://www.flickr.com/photos/8395214@N06/2441779856 #3: Evidence
  • #15 Image: http://www.doi.org/ #4: Unique Identification
  • #16 Image: http://www.flickr.com/photos/mag3737/8755090129 #5: Access
  • #17 Image: http://www.flickr.com/photos/azwegers/6691014193 #6: Persistence
  • #18 Image: by Joan Starr #7: Specificity and Verifiability
  • #19 Image: Image: http://www.flickr.com/photos/29261875@N05/6410305335 #8: Interoperability and flexibility
  • #36 Image: http://www.doi.org/ #4: Unique Identification