Why Data Citation Currently Misses the Point

Mark A. Parsons and Peter A. Fox
19 December 2014
Why Data Citation Currently Misses the Point
References:
1Joint Declaration of Data Citation Principles, https://www.force11.org/datacitation
2Chawla D S. 2014. Could digital badges clarify the roles of co-authors? ScienceInsider
http://news.sciencemag.org/scientific-community/2014/11/could-digital-badges-clarify-roles-
co-authors. See also http://projectcredit.net.
3ESIP Data Stewardship Committee http://wiki.esipfed.org/index.php/
Preservation_and_Stewardship
4Donovan C and S Hanney. 2011. The payback framework explained. Research Evaluation
20 (3): 181-183. http://dx.doi.org/10.3152/095820211X13118583635756
What’s the use case?
In recent years, the data management community has begun to codify data
citation practices and associated technologies, especially persistent
identifiers. Nonetheless, digital data sets are rarely cited formally or
specifically. Moreover, citation has done little to open up data in the
unindexed deep Web. There are many reasons for this, but we believe one
problem is that data managers expect too much from classic bibliographic-
style data citation and assume false parallels between data and literature.
The idea of formalizing data citation emerged in the 1990s, primarily as a
mechanism to encourage and reward data sharing by giving credit to data
“authors”. The approach seemed a logical extension of the publishing
incentive, but it has not provided a strong incentive to share or done much
to expose hidden data. We need to reconsider the use case, but instead
work-around efforts such as “data journals” have emerged to try and make
data publishing more akin to literature publishing. Meanwhile, the
community is recognizing other purposes of citation, notably to help ensure
a scientific result can be verified. After much discussion amongst competing
views, the community converged around a core set of data citation
principles1. These principles are an important step forward, but they are
primarily oriented toward formal, scholarly citation. They hint at, but do not
fully consider, the myriad ways data are used.
In this poster, we take a broader view on what we are trying to accomplish
with data citation by exploring several use cases around attribution,
provenance, and impact. We seek to start a conversation on how we can
robustly address the myriad use cases that begin to uncover the deep Web.
We suggest that we need more sophisticated, diverse, and nuanced
approaches to actually address the many use cases of identifying,
tracking, and enhancing data use.
1. Attribution and Credit
Provide fair and recognized attribution for all
personnel involved in creating a data set.
Some concerns:
• Who is the
“author” of a
data set?
• What is the
appropriate
credit
mechanism for
all involved?
• Who gets credit
for what?
Some ideas:
Project CRediT has defined a taxonomy of contributor roles for
publications and suggests using digital badges that detail what
each author did for the work and link to their profiles elsewhere on
the Web2. Can we do this for data?
Project CRediT (Contributor Roles Taxonomy)
why not change the world? ®
2. Tracking and Provenance
Identify and trace all observations used in
forcing and constraining a model run.
Some concerns:
• How to capture the
purpose of the data in
the model e.g. forcing,
assimilation, boundary
conditions?
• How to reference the
precise version and
subset used.
• How far back does one need to go (see figure below).
• What are references (PIDs) pointing too?
Some ideas:
This is really an issue of provenance not just reference. Full
reproducibility requires being able to trace data, processes, and
tools. While persistent identifiers are crucial they are insufficient. A
fuller semantic description of the provenance is required as well
as richer context description. See provenance work of ESIP3.
3. Impact and Return on Investment
Provide a means to track the use, impact, and
value of a data set.
Some concerns:
• Data are used in many contexts that do not result in a formal
article, e.g. land use planning, disaster response, agricultural
prediction, policy analysis, education, etc.
• How to attribute a particular outcome to a particular person.
• Qualitative impact may
be as important as
quantitative, but it is
hard to measure.
Consider the impact of
this Apollo 8 image on
public consciousness.
Some ideas:
In health and social sciences
researchers have developed
a “Payback Framework” with
a logical model of the complete research process and categories
of payback from research4. Can we extend this and apply it to
data?
If we assign credit badges as suggested in use case 1, can we
aggregate the links to those badges through search engines
rather than relying on constrained citation indices?
“Payback Categories”
1. Knowledge
2. Benefits to future research and research use
3. Benefits from informing policy and product development
4. Environmental and public sector benefits
5. Broader economic benefits
Rensselaer Polytechnic Institute — rpi.edu
Data citation in theory and practice
The Data Citation Principles cover purpose,
function and attributes of citations. These principles
recognize the dual necessity of creating
citation practices that are both human
understandable and machine-
actionable.
1.Importance 
Data should be considered legitimate, citable
products of research. Data citations should be
accorded the same importance in the scholarly
record as citations of other research objects,
such as publications.
2.Credit and Attribution 
Data citations should facilitate giving scholarly
credit and normative and legal attribution to all
contributors to the data, recognizing that a single
style or mechanism of attribution may not be
applicable to all data.
3.Evidence 
In scholarly literature, whenever and wherever
a claim relies upon data, the corresponding data
should be cited.
4.Unique Identification 
A data citation should include a persistent
method for identification that is machine
actionable, globally unique, and widely used by a
community.
5.Access 
Data citations should facilitate access to the data
themselves and to such associated
metadata, documentation, code, and other
materials, as are necessary for both
humans and machines to make informed
use of the referenced data.
6.Persistence 
Unique identifiers, and metadata describing the
data, and its disposition, should persist -- even
beyond the lifespan of the data they describe.
7.Specificity and Verifiability  
Data citations should facilitate identification of,
access to, and verification of the specific data
that support a claim. Citations or citation
metadata should include information about
provenance and fixity sufficient to facilitate
verifying that the specific timeslice, version and/
or granular portion of data retrieved
subsequently is the same as was originally cited.
8.Interoperability and Flexibility 
Data citation methods should be sufficiently
flexible to accommodate the variant practices
among communities, but should not differ so
much that they compromise interoperability of
data citation practices across communities.
Joint Declaration of Data Citation Principles1
Figure courtesy Curt Tilmes, NASA
1. Conceptualization
2. Methodology
3. Software
4. Validation
5. Formal analysis
6. Investigation
7. Resources
8. Data curation
9. Writing – original draft
10.Writing – review &
editing
11. Visualization
12.Supervision
13.Project administration
14.Funding acquisition
Initial Conclusions
• Much data use and production occur outside of the regular scholarly discourse (i.e. the literature).
• The principles of data citation are strong, and the increasing use of persistent identifiers is a
significant advance, but we must think beyond bibliographic-style citation.
• It is important to have a citation approach that can readily be accepted by scholarly publishers, but we
should not assume that that approach addresses other concerns. We must separate the various
concerns around citation by considering multiple use cases.
• Indeed we must consider use cases in the first place! What problem are we truly trying to solve?
• Other disciplines are taking a more nuanced look at these issues outside the realm of publication.
Geosciences should too.
Stage 0
ID Topic
Stage 1
Inputs
Stage 2
Research
process
Stage 3
Primary
outputs
Stage 4
Secondary
outputs:
policy,
products
Stage 5
Adoption
Stage 6
Final
outcomes
Stock or reservoir of knowledge
Interface A
project
specification
Interface B
dissemination
direct impact from processes
and outputs to adoption
The political , professional, and industrial environment and wider society
direct feedback paths
The logical model of the Payback Framework
PROJECTCREDITNET
(CONTRIBUTOR
TAXONOMY): J. SCOTT, L.
ALLEN, A. BRAND ET AL.;
BIOMED CENTRAL DESIGN
(BADGE DESIGNS)/
CREATIVE COMMONS 4.0

Why Data Citation Currently Misses the Point

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Why Data Citation Currently Misses the Point

Similar to Why Data Citation Currently Misses the Point (20)

Recently uploaded

Recently uploaded (20)

Why Data Citation Currently Misses the Point