Needs for Data Management & Citation Throughout the Information Lifecycle

Prepared for

NISO Forum:
Tracking it Back to the Source: Managing and Citing Research Data

September 2012

Needs for Data Management &
Citation Throughout the Information
Lifecycle

Micah Altman
Director of Research, MIT Libraries

Collaborators and Co-Conspirators
• Jonathan Crabtree, Merce Crosas, Gary King, Tom
Lipkis, Nancy McGovern, John Willinsky

• Research Support
– Library of Congress (PA#NDP03-1),
– National Science Foundation (DMS-0835500, SES 0112072)
– Institute for Museum and Library Services (LG-05-09-0041-09)
– Sloan Foundation
– Amazon Web Services
– Massachusetts Institute of Technology

Needs for Data Management & Citation 2

Related Work
Reprints available from:
http://maltman.hmdc.harvard.edu

• Altman, M. 2012. Data Citation in The Dataverse Network ®. In P. F. Uhlir (Ed.), Developing Data
Attribution and Citation Practices and Standards: Report from an International Workshop (p.
Forthcoming). National Academies Press. Forthcoming.
• Altman, M., & Crabtree, J. 2011. Using the SafeArchive System : TRAC-Based Auditing of LOCKSS.
Archiving 2011 (pp. 165–170). Society for Imaging Science and Technology.
• M. Altman, Adams, M., Crabtree, J., Donakowski, D., Maynard, M., Pienta, A., & Young, C. 2009.
"Digital preservation through archival collaboration: The Data Preservation Alliance for the Social
Sciences." The American Archivist. 72(1): 169-182 M. Altman, 2008, "A Fingerprint Method for
Verification of Scientific Data" in, Advances in Systems, Computing Sciences and Software
Engineering, (Proceedings of the International Conference on Systems, Computing Sciences and
Software Engineering 2007) , Springer-Verlag.
• M. Altman and G. King. 2007. “A Proposed Standard for the Scholarly Citation of Quantitative Data”,
D-Lib, 13, 3/4 (March/April).


Preview
• Principled approach to data management
• Lifecycle data management planning
• Lifecycle data management tracking
• Lifecycle data management infrastructure
• [Exemplar Projects]


(Some) Timely Challenges


“Data science is suddenly sexy –
does that mean data is the new
black?”


Valuable Data is Lost
• Researchers lack Examples
archiving capability Intentionally Discarded: “Destroyed, in accord with
[nonexistent] APA 5-year post-publication rule.”
• Incentives for data Unintentional Hardware Problems “Some data were
sharing are weak collected, but the data file was lost in a technical
malfunction.”

Acts of Nature The data from the studies were on punched
cards that were destroyed in a flood in the department
in the early 80s.”

Discarded or Lost in a Move “As I retired ….
Unfortunately, I simply didn’t have the room to store
these data sets at my house.”

Obsolescence “Speech recordings stored on a LISP
Machine…, an experimental computer which is long
obsolete.”

Simply Lost “For all I know, they are on a [University]
server, but it has been literally years and years since
the research was done, and my files are long gone.”

Research by:

Unpublished Data Ends up in the “Desk Drawer”
• Null results are less likely to be published
• Outliers are routinely discarded

Daniel
Schectman’s
Lab Notebook
Providing
Initial
Evidence of
Quasi Crystals


Data Behind Publications Unavailable for
Review, Reuse, Replication


Model Science

“Citations to unpublished data and personal
communications cannot be used to support
claims in a published paper”

“All data necessary to understand, assess,
and extend the conclusions of the
manuscript must be available to any reader
of Science.”

Compliance with Policies is Low
 Compliance is low even in
best examples of journals
 Checking compliance
manually is tedious,
doesn’t scale


Special Challenges for Long-Term Access
to New Forms of Data
• Some Examples
– GIS and geospatial trails
– Facebook & social networks
– Text: blogs, tweets
– Cell phone data
• Challenges
– Proprietary – intellectual Source: [Calberese 2008]

property
– Size
– Dynamic content
– Fixity
– Format Needs for Data Management & Citation 12

A Lifecycle Framework


“The published article is not scientific output
–
it’s a summary of scientific output.”

-- corollary of Buckheit & Donaho 1995


Information Lifecycle

Long-term Creation/Collecti
access on
Modeling

Re-use
• Scientific Storage/I
• Educational ngest
• Scientometric
• Institutional

External
dissemination/publicati Processing
on

Internal
Analysis
Sharing

Stakeholders
Data
Consumers Long- Sources/Su
Creation/C bjects
term
ollection
access

Data
Modeling

Archives/ Storage/
Publisher Re-use
Researchers Ingest

Research Research
Sponsors Organizations
External
dissemination/ Processing
publication

Scholarly Internal
Analysis
Publishers Sharing
Service/Infras
tructure
Needs for Data Management & Citation
Providers 16

Legal Requirements and Rights

Contract Intellectual Property
Trade
Secret Intellectual
Contract Click-Wrap Patent
Attribution
TOU
License Moral Rights
Modeling

Database Rights
Journal Funder Open Copyright DMCA Trademar
Replication Access k
Requirement Fair Use Rights of
Common
s Publicity
Rule
HIPAA 45 CFR 26 Privacy
FOIA EU Privacy
FERPA Torts
Directive (Invasion,
State Defamation)
FOI CIPSEA
Potentially
Laws State Harmful
Privacy Laws (Archeologic
al Sites,
Classifie
Sensitive Animal
butd Testing, …)
Access EA Confidentiality
Unclassifie
Rights d R
ITAR

Stakeholders, Rights and Requirements

Contract Intellectual Property
Trade
Secret Intellectual
Contract Click-Wrap Scholarly Patent
Publisher Attribution
TOU
License s Moral Rights
Modeling

Consumers
- Secondary research
- Participative Science
- - Public policy uses
Database Rights
Journal Funder Open Copyright
Infrastructure/Serv DMCA Trademar
Replication Access Primary
ice Providers k
Requirement Fair Use Rights of
Researchers
Common
s Publicity
Research HIPAA
HIPAA Rule
FOIA Organizations 45 CFR 26 Privacy
EU Privacy Torts
FERPA FERPA
Directive (Invasion,
State Data Archives CIPSEA Defamation)
FOI Laws State Potentially
Privacy Laws Harmful
Classifie (Archeologic
Research al Sites,
Sponsors Sensitive Sources/S
d
Animal
but ubjects Testing, …)
Access Unclassifie Confidentiality
Rights d

Stakeholder Drivers per Stage of Information Lifecycle
Stage Actors Legal Constraint Concerns
Research Subjects - Consent/contract - Public benefit
Proposal, - Privacy
Design and - Future access to own
Modeling

Data information
Collection Sources - Intellectual - Business confidentiality
Property - IP
- Contract - Profit from licenses
Funder - Open Access - Public benefit
- Confidentiality - Policy Relevance
- Reproducible Research
- Future access
Primary - Confidentiality - Publication potential
Researcher - Contract - Compliance with
- IP institutional/funder
requirements
Research - Confidentiality - Compliance with funder
Institution - Contract requirements
- IP
Needs for Data Management & Citation - License, IP, confidentiality
19
compliance

Data Storage, Primary - Confidentiality - Publication potential
Analysis Researcher - Contract - Compliance with
(Pre-publication) - IP institutional/funder
Modeling

requirements
Research - Confidentiality - License, IP,
Institution - Contract confidentiality
- IP compliance
- Records management
Service - Contract - Contract
Providers - (Selected Cases) - Service business
Confidentiality model
Requirements - Service deployment


Publication Primary Compliance for: - Scholarly attribution/credit
Researcher - Source/subjects - Promote use of research
- Sponsor - Track use/impact of research
- Host institution
Modeling

- Publisher
Sponsor - Track research products
- Track compliance
- Track use/impact
Research - Sponsor compliance - Track OA products
Institution - Records management
- Intellectual property
Scholarly - IP - Impact/use
/Journal - Contract - Profit/business model
Publisher - Replicability
Data - IP - Profit/business model
Publisher - Replicability
- Connection to publication

Re(use) Research - Access Rights - Provenance
Reader
Modeling

Secondary - Access rights - Replicability
Researcher - Confidentiality - Data reintegration/reanalysis
- Contract - Linking publications and data
- Provenance
“Citizen/Co Access Rights - Data
mmunity redissemination/reanalysis
Scientist” - Linking publications and data
Public Policy Access Rights - Provenance
- Replicability
- Linking publications and data
Education Access Rights - “Classroom” use
/teaching - MOOC use


Lifecycle Management:
Data Management Planning


Some Formal “DMP” Requirements
• The Final NIH Statement on Sharing Research Data
– was published in the NIH Guide on February 26, 2003.
“Starting with the October 1, 2003 receipt date, investigators submitting an
NIH application seeking $500,000 or more in direct costs in any single year
Planning

are expected to include a plan for data sharing or state why data sharing is
not possible. “
– No later than the main findings from the final data set are
accepted for publication
• NSF, All proposals must (as of 1/1/2011) include a data
management plan.
– Specific requirements vague, for the most part:
“will be determined by the community of interest through the process of peer review and
program management.”
• Wellcome Trust:
– “ will review data management and sharing plans, and any costs
involved in delivering them, as an integral part of the funding
decision”

DMP Goals
• Orchestrate data for current use
• Control disclosure
• Compliance with contracts, regulations, law,
Planing

and policy
• Maximize value of information assets
• Ensure short term and long term
dissemination


DMP Elements
• Orchestrate data for current use – Data description
– Quality Assurance – Data value
– Storage, backup, replication, and – Relation to collection
versioning – Relation to evidence base
– Data Formats – Budget
– Data Organization
Planning

– Budget • Ensure short term and long term
– Metadata and documentation dissemination
– Data description
• Control disclosure – Institutional Archiving Commitments
– Access and Sharing – Audience
– Intellectual Property Rights – Access and Sharing
– Legal Requirements – Data Formats
– Security – Data Organization
– Metadata and documentation
• Compliance with contracts, – Budget
regulations, law, and policy
– Access and Sharing
– Adherence
– Responsibility
– Ethics and privacy
– Security

• Value of information assets

DMP Details
• Sharing – Restrictions on use
– Plans for depositing in an existing public database • Budget
– Access procedures – Cost of preparing data and documentation
– Embargo periods – Cost of storage and backup
– Access charges – Cost of permanent archiving and access
– Timeframe for access • Intellectual Property Rights
– Technical access methods – Entities who hold property rights
– Restrictions on access – Types of IP rights in data
• Long term access – Protections provided
(Preservation) – Dispute resolution process
–
Planning

Requirements for data destruction, if applicable • Legal Requirements
– Procedures for long term preservation – Provider requirements and plans to meet them
– Institution responsible for long-term costs of data preservation – Institutional requirements and plans to meet them
– Succession plans for data should archiving entity go out of existence • Responsibility
• Formats – Individual or project team role responsible for data management
– Generation and dissemination formats and procedural justification – Qualifications, certifications, and licenses of responsible parties
– Storage format and archival justification • Ethics and privacy
– Format documentation – Informed consent
• Metadata and documentation – Protection of privacy
– Internal and External Identifiers and Citations – Data use agreements
– Metadata to be provided – Other ethical issues
– Metadata standards used • Adherence
– Planned documentation and supporting materials – When will adherence to data management plan be checked or
– Quality assurance procedures for metadata and documentation demonstrated
• Data Organization – Who is responsible for managing data in the project
– File organization – Who is responsible for checking adherence to data management plan
– Naming conventions – Auditing procedures and framework
• Storage, backup, replication, and versioning • Value of information assets
– Facilities – Project use value
– Methods – Institutional audience and uses
– Procedures – Public audience and uses
– Frequency – Relation to institutional collection
– Replication – Relation to disciplinary evidence base
– Version management – Cost of re-creating data
– Recovery guarantees
• Security
– Procedural controls
– Technical Controls
– Confidentiality concerns
– Access control rules


Approaching Requirement Overlap
• Sanity-check DMP details with lifecycle questions:
– Who wants it?
Planning

– What do they need it for?
– When will it be used?
• Be conscious of elements that serve multiple goals / or lifecycle
– Metadata/documentation
– Identifiers
– Budgets
– Formats
– IP Rights and confidentiality restrictions
– Responsibilities/Adherence
• Use tracking tools and methods throughout lifecycle
This Way…


Tracking


What do we track?

What tools and methods provide technical leverage or
incentives to management across lifecycle stages and among
actors?
Tracking

• Identification – identifiers, references, citations
• Provenance – relationship of delivered data to history of inputs and
modifications and actors responsible for these ; revision control; versioning
• Authenticity: assertions about the provenance of the records
• Respect des fonds: assertions about the original organization of the records
• Chain of custody: assertions about the ownership of the records
• Integrity: assertions about the management of the records; fixity of bits; fixity of
semantics
• Auditing: verification of properties & policy compliance

Sources: Bulleted list of attributes adapted from Moore 2008


Tracking Across Information Lifecycle

Long-term Creation/Collecti
access on

identifiers
Tracking

Storage/I
Re-use
ngest
Metadata for:
Integrity,
Provenance,
citation Custody

External
dissemination/publicati Processing
on

Internal
Analysis
Sharing 31

Data Citation: a Point of Leverage
• Services
– Identifiers to specific fixed versions of data are needed to
establish unambiguous chains of provenance
– Identifiers that can be globally resolved to machine-
understandable metadata and to identified object are needed to
Tracking

building generalized access and analysis services
– Persistence of identifiers are needed to maintain long-term
access
• Incentives
– Scholarly credit (intellectual attribution) is a large motivator for
many researchers
– citation creates incentive for researchers to publish data
– Scholars also comply with enforceable journal policies
-- requiring data citation is a light-weight method to make data
access policies auditable
– Impact/usage is a motivator for public research funders – data
citation provides foundation for measures of usage and impact


Emerging Practices for Data Citation
• Publishers
– OECD iLibrary
– Thomson Reuters
Tracking

Data Citation Index
• Data archives
– Dataverse Network
– Data-PASS
• Harmonization
efforts
– DataCite
– NAS BRDI
– ICSU/Co-Data
• Discipline specific

Identifier and Citation Use Cases
Attribution
• Provide scholarly attribution
• Provide legal attribution
• Identify contributors to data

Verification Discovery
• Associate work with version • Locate data via identifier
of evidence used • Locate data integral to article
• Verify fixity of bits • Locate works related to data
• Verify fixity of information – articles, derivatives,
• Verify “authenticity” of work sources

Access Persistence
• Access to surrogate • Does evidence persists as
long as assertions based on
• On-line access to object
it?
• Machine understandability
• Is durability of evidence
• Long-term understandability transparent?

Emerging Principles for Data Citation
• Data citations should be first class objects for publication
-- appear with citations to other works; should be as easy
Tracking

to cite as other works

• Citations should persist and enable access to fixed version of data at least
as long as citing work
• Citations should persist and enable access to fixed version of data at least
as long as the citing work exists.
• Citations should support unambiguous attribution of credit to all contributors,
possibly through the citation ecosystem.


Fixity
Tracking

• Are files, bitstreams corrupted?
• Do semantics remain the same over time, across formats, software
analysis systems?
Some semantic approaches…
Universal Numeric Fingerprint - Canonicalization Perceptual Signatures –
Characterization of Significant Properties


Audit [aw-dit]:

An independent evaluation of
records and activities to
Tracking

assess a system of controls

Fixity mitigates risk only if used
for auditing.

Example:
Functions of Storage Auditing
• Detect
corruption/deletion of content
Tracking

• Verify
compliance with storage/replication
policies

• Prompt
repair actions

Audit Design Choices
• Audit regularity and coverage:
on-demand (manually); on event;
randomized sample;
scheduled/comprehensive
Tracking

• Audit procedure, algorithms, certifying
authority
• Auditing scope:
integrity of object; integrity of collection;
integrity of network; policy compliance;
public/transparent auditing
• Trust model
• Threat model

Infrastructure


Many Tools, Few Solutions
“Poor carpenters blame their tools”
–Proverb

“If all you have is a hammer, everything looks like a nail”
– Another Proverb

“Ultimately, some people need holes – but no one needs a drill. ”
– Yet Another Proverb
Infrastructure

• Many scientific tools are embedded in needs,
perspectives, and practices of specific disciplines
• Identify common requirements
• Identify gaps across lifecycle stages and among actors


Core Requirements for Data Sharing Infrastructure
• Stakeholder incentives
– recognition; citation; payment; compliance; services
Infrastructure

• Dissemination
– access to metadata; documentation; data
• Access control
– authentication; authorization; rights management
• Provenance
– chain of control; verification of metadata, bits, semantic content
• Persistence
– bits; semantic content; use
• Legal protection
– rights management; consent; record keeping; auditing
• Usability
– discovery; deposit; curation; administration; collaboration
• Business model

Sources: King 2007; ICSU 2004; NSB 2005


Mind the Gaps
Lifecycle Strengths Other Gaps

dissemination
collection

analysis
storage

reuse
Scientific - Close integration across supported - Discipline-centric
lifecycle - Doesn’t address most storage
Workflow
- Perceived as useful service by requirements (replication, access
Software researchers control)
(e.g. Taverna) - High Performance
Storage - Integration across supported lifecycle - Loose integration of analysis,
- Storage is perceived as useful service insufficient for reproducibility
Grid/VRE
by researchers
(e.g. Irods) - High performance performance
Institutional - Low cost - Access and discovery mechanisms
- Institutional commitment to long- usually tailored to publications, not
Repository data
term access
(e.g. Dspace)
Reproducible - Close integration of analysis and - Addresses replication but not
scientific publication reuse for secondary analysis,
Publications integration
- Reduces risk of embarrassment
Systems when working with “co-authors”
(e.g. StatWeave) - Ensures one form of reproducibility
(calibration, mechanical replicability)

“Data Archive” - Richer support for reuse - Varied models – curated database;
- Often supports cross-discipline “virtual archive”, disciplinary
discovery; long-term access repository
- Often discipline-centric

Exemplar Efforts

(A.K.A., What have you done for me lately?)


• Audit Data Replication & Integrity
Policies

Automatic Auditing of Data
Examplars

Replication & Integrity
Policies

safearchive.org

The Distributed Content Replication Problem
• We hold digital assets we A Partial Solution: LOCKSS
 Self-contained OSS
wish to preserve
 Harvests resources via open
• Many of these assets are interfaces
not replicated
 Replicated through secure P2P
• Even when replicated, protocol
vulnerable to single  Self-repairing
Examplars

points of failure because  Zero trust
replicas are managed by  Used by hundreds of institution
single institution for collaborative preservation

What we needed…
Auditing – how many replicates exist, where & are they
current?
Policy – prove replication are consistent with policy, like
TRAC?
Collaboration – coordinateforwith partners to replicate content?46
Needs Data Management & Citation

Resilience of peer-to-peer with
the Accountability of centralized system
Examplars

Facilitating collaborative replication and preservation with cyberinfrastructure …
• Collaborators declare explicit non-uniform resource commitments
• Policy records and schematizes commitments, desired TRAC replication properties
• Storage layer provides replication, integrity, freshness, versioning
• SafeArchive software provides monitoring, auditing, transparency, and provisioning
• Content is harvested through HTTP (LOCKSS) or OAI-PMH
• Integration of LOCKSS, Institutional Repositories, TRAC

ORCID is an international, interdisciplinary, open, and not-for-profit
organization created for the benefit of all stakeholders, including research
Examplars

institutions, funding organizations, publishers, and researchers to enhance
the scientific discovery process and improve collaboration and the efficiency
of research funding.

ORCID aims to solve the name ambiguity problem in scholarly
communications by creating a registry of persistent unique identifiers for
individual researchers and an open and transparent linking mechanism
between ORCID, other ID schemes, and research objects such as publications,
grants, and patents.

http://orcid.org

ORCID Launch to Public in October
ORCID Launch Partners Program include research institutions, publishers, research funders, data
repositories, and third party providers, such as:

The American Physical Society, Aries Systems, Avedas, Boston University, the California Institute of
Technology, CrossRef, Elsevier, Faculty of 1000, figshare, Hindawi Publishing Corporation, KNODE, Nature
Publishing Group, SafetyLit, Symplectic, Thomson Reuters, Total-Impact, and Wellcome Trust.
Examplars

At Launch, the ORCID Registry will:

• Allow researchers and scholars to register for an ORCID identifier, create ORCID records, and
manage their privacy settings
• Contain ORCID records created by universities on behalf of their researchers and scholars
• Allow researchers and scholars to link their ORCID record external identifiers, including Scopus
and ResearcherID
• Facilitate synchronization of ORCID identifier record data with external systems including
Scopus
• Bi-directionally link to a number of author profile and manuscript submission, including the
American Physical Society, Aries Systems, Hindawi Publishing Corporation, Nature Publishing
Group, and Scholar One Manuscripts
• Allow researchers and scholars to search and upload publication metadata from CrossRef
• (Soon after launch) have the ability to link to grant application systems


Data Management Workflows
for Open Access Journals
Examplars

+

http://bit.ly/DVNOJS

Embed Real Data Archives in Journals
• Embed remotely managed
data archive in OJS journal
• Replaces “supplemental
materials”
• Ads
– Online analysis
Examplars

– Independent storage
– Persistent identifiers and
citation
– Data versioning
– Enhanced discoverability
and interoperability
– Format normalization
– Fixity and replication


Integrated Policies, Workflow, Access
• OJS and DVN
– Support workflows
– Enforce policies
– Disseminate content

• Integrate policies for
– Access and data license
Examplars

– Embargoes
– Citation
• Coordinate
– Submission
– Review
– Publication
• Link
– Content
– Subscriptions & notifications
– Usage Metrics

Wrapping Up


How will we see the geography of science e,
when we reveal how research connects through
data?

Research & Node Layout: Kevin Boyack and Dick
Klavans (mapofscience.com); Data: Thompson ISI;
Graphics & Typography: W. Bradford Paley
(didi.com/brad); Commissioned Katy Börner
(scimaps.org)

Seed Magazine, Mar 7, 2007
http://seedmagazine.com/content/article/scientific_m
ethod_relationships_among_scientific_paradigms/

Summary
• Principled approach to data management
– Follow information through information lifecycle
– Assess stakeholder requirements
– Track management, use, impact across lifecycle
• Data management planning goals
– Orchestrate data for current use
– Protect against disclosure
– Compliance with contracts, regulations, law, and policy
– Maximize value of information assets
– Ensure short term and long term dissemination
• Lifecycle data management tracking
– Identification – identifiers, references, citations
– Provenance – relationship of delivered data to history of inputs and modifications and actors responsible for
these
– Authenticity: assertions about the provenance of the records
– Chain of custody: assertions about the ownership of the records
– Integrity: assertions about the management of the records; fixity of bits; fixity of semantics
– Auditing: verification of properties & policy compliance
• Data citation is a key leverage point
– Services: establish provenance; access; long-term preservation
– Incentives: scholarly credit; reproducible research policies; impact/usage analysis
– Data citations should be first class objects for publication -- appear with citations to other works;
should be as easy to cite as other work


Additional References
• Buckheit J, Donoho DL. Wavelab and reproducible research. In:
Antoniadis A, editor. Wavelets and Statistics. New York, NY:
Springer; 1995. p. 55-81.
• International Council For Science (ICSU) 2004. ICSU Report of the
CSPR Assessment Panel on Scientific Data and Information. Report.
• King, Gary. 2007. "An Introduction to the Dataverse Network as an
Infrastructure for Data Sharing." Sociological Methods and Research
36
• Moore, M. 2008, Towards a Theory of Digital Preservation,
International Journal of Digital Curation 1(3)
• National Science Board (NSB), 2005, Long-Lived Digital Data
Collections: Enabling Research and Education in the 21rst Century,
NSF. (NSB-05-40).


Discussion
Contact information:

Web: http://micahaltman.com

E-mail: micah_altman@alumni.brown.edu

Twitter: @drmaltman

Needs for Data Management & Citation Throughout the Information Lifecycle

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Needs for Data Management & Citation Throughout the Information Lifecycle

Similar to Needs for Data Management & Citation Throughout the Information Lifecycle (20)

More from Micah Altman

More from Micah Altman (20)

Recently uploaded

Recently uploaded (20)

Needs for Data Management & Citation Throughout the Information Lifecycle

Editor's Notes