Emerging Data Citation Infrastructure

Prepared for
MIT Libraries Informatics Program Brown Bag Talk
August2013
Emerging Data Citation Infrastructure
Dr. Micah Altman
<escience@mit.edu>
Director of Research, MIT Libraries

DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi,
Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle,
George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,
etc.
Emerging Data Citation Practices

Collaborators & Co-Conspirators
• Merce Crosas, IQSS, Harvard U.
• Data-PASS Steering Committee
<data-pass.org>
• CODATA-ICSTI Task Group on Data Citation
Standards and Practices
<www.codata.org/taskgroups/TGdatacitation/>
• Research Support
– Thanks to the National Academies BRDI
Sponsors: Department of Energy (DOE). Institute
of Museum and Library Services (IMLS), The
Library of Congress (LOC). Microsoft Research.
National Institute of Standards and Technology
(NIST), National Institutes of Health
(NIH),National Oceanic and Atmospheric
Administration (NOAA), National Science
Foundation (NSF). U.S. Geological Survey
(USGS) & the Massachusetts Institute of
Technology. Emerging Data Citation Practices

Related Work
• CODATA-ICSTI Task Group on Data Citation Standards and Practices, 2013 , Out
of Cite, “Out of Mind: The Current State of Practice, Policy, and Technology for
the Citation of Data”, Data Science Journal. Forthcoming.
• P. F. Uhlir (Ed.), Developing Data Attribution and Citation Practices and
Standards Report from an International Workshop (p. Forthcoming).
National Academies Press.
• M. Altman,2008, "A Fingerprint Method for Verification of Scientific
Data" in, Advances in Systems, Computing Sciences and Software
Engineering, (Proceedings of the International Conference on Systems,
Computing Sciences and Software Engineering 2007) , Springer Verlag.
• Altman, M., & King, G. 2007. A Proposed Standard for the Scholarly
Citation of Quantitative Data. DLib Magazine, 13(3/4)
Most reprints available from:
informatics.mit.edu

This Talk
• What is data citation? Why Cite?
• Emerging Principles
• On the horizon

What’s Wrong with this Picture?
“To test Benet’s (1998) theory of “politically-induced
intelligence” (Benet 1999, pg 8), use a hierarchical
corrected contingency model (see Altman & Smith 2010;
Edgeworth 1863). We apply this model to a snowball
sample (Glass 1973) of eligible voters14, to which the
standard Stanford-Binet (Stanford & Binet 1766) has
been applied. Our results show that adoption of
Pastafarrianism can be expected to yield an increase
mean intelligence by 10.3 points. ”
13 We thank Jon Sample, Director of the institute of the Pastaffarian
institute for supplying this dataset, which is available upon request.

“How much slower would scientific progress be if
the near universal standards for scholarly citation of
articles and books had never been
developed?Suppose shortly after publication only
some printed works could be reliably found by
other scholars; or if researchers were only
permitted to read an article if they first committed
not to criticize it, or were required to coauthor with
the original author any work that built on the
original … *If+ printed works existed in different
libraries under different titles; if researchers
routinely redistributed modified versions of other
authors' works without changing the title or author
listed; or if publishing new editions of books meant
that earlier editions were destroyed?...” – Altman &
King 2007

“Citations to unpublished data and personal
communications cannot be used to support
claims in a published paper”
“All data necessary to understand, assess, and
extend the conclusions of the manuscript must
be available to any reader of Science.”
Ideal
Helping Journals Manage Data

Reality
Helping Journals Manage Data
 Compliance is low even in
best examples of journals
 Checking compliance
manually is tedious, hard
to scale

Attribution
• Cite data as first class work
• Identify contributors to data
Discovery
• Associate a persistent id with a
work
• Locate data via identifier
• Locate data integral to article
• Locate works related to data –
articles, derivatives, sources
Persistence
• Reference exists as long as referring object
• Evidence persists as long as assertions
based on evidence?
• Durability of data transparent?
Access
• Citation provides for mediated
access
• Access to surrogate
• On-line access to object
• Machine understandability
• Long-term human
understandability
Provenance
• Associate work with version of
evidence used
• Verify fixity of information
Principles for Data Citation
Theory: Use Cases Operational Constraints?
-Syntax
-Interoperability
-Technical contexts of use

Reference
• Formal syntax used within the text of a publication to denote a relationship
to an external object. May contain additional information about the
portion/subset of external object implicated. Also known as “in-text
reference”, “pin-cite”.
We applied contingency analysis to the greatest data ever. [Altman 2005]”
Citation
•Formal description of external object, used for location and attribution.
Micah Altman; Karin MacDonald; Michael P. McDonald, 2005, "Computer
Use in Redistricting", hdl:1902.1/AMXGCNKCLU
UNF:3:J0PkMygLPfIyT1E/8xO/EA==
http://id.thedata.org/hdl%3A1902.1%2FAMXGCNKCLU
Citation Metadata
•Metadata that is systematically associated with citation through well-
known public service, catalog, or protocol.
<component_list> <component parent_relation="isPartOf">
<description><b>Figure 1:</b> This is the caption of the first
figure...</description>
<format mime_type="image/jpeg">Web resolution image</format>
External Service
•Applications and services that consume, enhance, aggregrate citation
information.
Practice

Analysis Method
2 Workshops
(70+ participants)
+ 1 Literature Review
(400+ resources)
+ 2 Task Groups
NAS & Co-Data
(25+ members)
+ 60 Interviews
+ 7 authors
Out of Cite, Out of Mind: The
Current State of Practice, Policy,
and Technology for the Citation of
Data

- Separate
- scientific principles
- use cases
- requirements
- Distinguish
- syntax
- semantics
- presentation
- Design for
- Ecosystem
- Lifecycle
- Stakeholders
- Implement
- Incremental value for incremental effort
- Think globally, act Locally
Analysis Approach

1. Status of Data: Data citations should be accorded the same importance in
the scholarly record as the citation of other objects.
2. Attribution: Citations should facilitate giving scholarly credit and legal
attribution to all parties responsible for those data.
3. Persistence: Citations should be as durable as the cited objects.
4. Access: Citations should facilitate access to data by humans and by machines.
5. Discovery: Citations should support the discovery of data and their
documentation.
6. Provenance: Citations should facilitate the establishment of provenance of
data.
7. Granularity: Citations should support the finest grained description
necessary to identify the data.
8. Verifiability: Citations should contain information sufficient to identify the
data unambiguously.
9. Metadata Standards: Citations should employ widely accepted metadata
standards.
10. Flexibility: Citation methods should be sufficiently flexible to accommodate
the variant practices among communities.
Data Citation Principles

• Author.
– The creator of the data set.
• Title.
– As well as the name of the cited resource itself, this may also include the name of a facility and the titles of the top collection and main
parent subcollection (if any) of which the data set is a part.
• Publisher.
– The organization (or repository) either hosting the data or performing quality assurance.
• Publication date.
– Whichever is later: the date the data set was made available, the date all quality assurance procedures were completed, or the date
the embargo period (if applicable) expired. In other standards an “Access Date” field is used to document the date the data set was
successfully accessed.
• Resource type.
– Examples: “database” or “data set.”
• Edition.
– The level or stage of processing of the data, indicating how raw or refined the data set is.
• Version.
– A number increased when the data changes, as the result of adding more data points or rerunning a derivation process, for example.
• Feature name and URI.
– The name of an ISO 19101:2002 “feature” (e.g., GridSeries, ProfileSeries) and the URI identifying its standard definition, used to pick
out a subset of the data.
• Verifier
– to verify the identity of the content.
• Identifier.
– A resolvable web identifier for the data, according to a persistent scheme. There are several types of persistent identifiers, but the
scheme that is gaining the most traction is the Digital Object Identifier (DOI).
• Location.
– A persistent URL or UNF from which the data set is available. Some identifier schemes provide these via an identifier resolver service.
Citation Metadata Elements

Gaps
• Metadata/Structural
– Granularity
– Version Control
– Microattribution
– Contributor ID
– Facilitation of reuse
• Practice
– Author: use of citations to data
– Journals: ad-hoc syntax and location
– Infrastructure: failure to index citations and references to
data, even when associated with DOI’s
– Tools: support for datasets in reference managers, etc.

Harmonizing Principles & Requirements
DataCite
• DOI
• Creator
• Title
• Publisher
• Publication
Year
Digital Curation Center
1. The citation itself must be able to identify
uniquely the object cited, though
different citations might use
different methods or schemes to do
so.
2. It must be able to identify subsets of
the data as well as the whole
dataset.
3.
a. It must provide the reader with
enough information to access the
dataset;
b. indeed, when expressed digitally
it should provide a mechanism for
accessing the dataset through the
Web infrastructure.
4.
a. It must be usable not only by
humans but also by software tools,
so that additional services may be
built using these citations.
b. In particular, there need to be
services that use the citations in
metrics to support the academic
reward system, and services that can
generate complete citations.- See
more at:
Force 11
• Data should be considered citable
products of research.
• Such data should be held in persistent
public repositories.
• If a publication is based on data not
included with the article, those data
should be cited in the publication.
• A data citation in a publication should
resemble a bibliographic citation and be
located in the publication’s reference list.
• Such a data citation should include a
unique persistent identifier (a DataCite
DOI recommended, or other persistent
identifiers already in use within the
community).
• The identifier should resolve to a page
that either provides direct access to the
data or information concerning its
accessibility. Ideally, that landing page
should be machine-actionable to
promote interoperability of the data.
• If the data are available in different
versions, the identifier should provide a
method to access the previous or related
versions.
• Data citation should facilitate attribution
of credit to all contributors

Current Infrastructure
FigShare
• Closed source
• No charge
• Archives data
• Supports DOI’s, ORCIDS
• Preserved in CLOCKSS
Data Citation Index
• Commercial Service
(Thomson Reuters)
• Indexes many large
repositories
(e.g. Data-PASS)
• Beginning to extract
citations from TR
publications
Dataverse Network
• Open Source System
• Hubs run at Harvard
other universities
• Archives data
• Generates persistent
identifiers (handles, DOI’s
forthcoming)
• Generates resolvable
citations
• Versioned
• Harvard Library Dataverse
now part of DataCite,
Data-PASS preservation
network
DataCite
• DOI registry service
(DOI provider)
• Data DOI metadata
indexing service
(parallel to CrossRef)
• Not-for-profit
membership
Organization
• Collaborating with
ORCID-EU to embed
ORCIDs

Emerging Developments
Open Journal Data
Publication
• Open source integration
of PKP-OJS and Dataverse
Network
• Uses SWORD
• Integrated data
submission/citation/publi
cation workflow for OJS
open journals
Journal Developments
• NISO Recommendations on
Supplementary Materials
• Sloan/ICPSR Data Citation Project
• Data-PASS Journal Outreach
• New journal types:
– Registered Replication journals
– Null results journals
– Data journals/data papers

Research Questions for Data Citation
and Management

Research Areas Building on Richer Citations

Brightening the “Dark Matter” of Scholarly
Communications
Researcher Identifiers: Developments, Opportunities &
Challenges
Research & Node Layout: Kevin Boyack and Dick
Klavans (mapofscience.com); Data: Thompson ISI;
Graphics & Typography: W. Bradford Paley
(didi.com/brad); Commissioned Katy Börner
(scimaps.org)
Seed Magazine, Mar 7, 2007
http://seedmagazine.com/content/article/scientific_m
ethod_relationships_among_scientific_paradigms/
22
• Bibliometric and network analysis are
the “telescopes” for exploring the
structure of science
• Researcher ID’s allow us to see more
connections, more reliably
• Identifiers for datasets, etc. reveal the
“dark matter” of science
Some potential questions:
• Are fields linked through evidence that are
not linked through publications?
• How is the practice of science changing – are
data scientists, statisticians, etc. making
bigger contributions?
• How would be the results of:
– Catalyzing new research collaborations among individuals,
organizations?
– Strengthening support for specific areas of
interdisciplinary research?
– Growing the evidence base in particular areas?
 Questions about how network of contributors and outputs
evolves over time

Additional Bibliography (Selected)
• Starr, J., & Gastl, A. (2011). IsCitedBy: A metadata scheme for datacite. D-Lib
Magazine, 17(½). doi:10.1045/january2011-starr
• Piwowar, H., Vision, T.J. (2013). Data reuse and the open data citation
advantage. PeerJ PrePrints. 1:e1v1. doi: 10.7287/peerj.preprints.1
• Cronin, B. (1984). The citation process: The role and significance of citations
in scientific publication. London, United Kingdom: Taylor Graham.
• Van Leunen, M. (1992). A handbook for scholars. New York, NY: Oxford
University Press.

Questions?
E-mail: escience@mit.edu
Web: micahaltman.com
Twitter: @drmaltman

Emerging Data Citation Infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Emerging Data Citation Infrastructure

Similar to Emerging Data Citation Infrastructure (20)

More from Micah Altman

More from Micah Altman (20)

Recently uploaded

Recently uploaded (20)

Emerging Data Citation Infrastructure

Editor's Notes