Data Publishing Models
Sünje Dallmeier-Tiessen, PhD
CERN, Harvard University
For the RDA-WDS Data Publishing Workflow Group
June 9th, 2015
Topics
• What is data publishing
• Why do we care about it (today)
• Models in data publishing
• Building blocks
• Information gathered through trusted data publishing
• Relevance and conclusions for today’s workshop
This is work conducted by the RDA-WDS group on data
publishing workflows, chaired in collaboration with Fiona
Murphy and Theo Bloom.
Data Publishing
… describes the process of making research data and
other research objects available on the web so that they
can be discovered and referred to in a unique and
persistent way.
At its best, data publishing takes place through dedicated
data repositories and data journals and ensures that the
published research objects are well documented, curated,
archived for the long term, interoperable, citable and
quality assured.
Thus, they are reusable and discoverable on the long
term.
Examples
Analysis elements
• Discipline, responsible units (i.e. their roles)
• Function of workflow
• PID assignment: DOI, ARK, etc.
• Peer review of data (e.g. by researcher & editorial review)
• Curatorial review of metadata (e.g. by institutional or subject repository?)
• Technical review & checks (e.g. for data integrity at repository upon
ingestion)
• Formats covered
• Persons/Roles involved, e.g. editor, publisher, data repository manager,
etc.
• Links to additional data products (data paper; review documents; other
journal articles) or “stand-alone” product
• Links to grants, usage of author PIDs
• Discoverability: Indexing of the data -- if yes, where?
• Data citation facilitated
• Data life cycle reference
• Standards compliance
Repository’s perspective
Data
Deposit
Ingest
Quality
Assurance
Data
Management
LT Archiving
Dissemination
Access
Producer Consumer/
Reuse
Simplified generic repository
workflow
Researcher with a central role during submission/deposition
Review/QA
mainly
internal
through
dedicated
curation
personnel
Data
Deposit
Ingest
Quality
Assurance
Light
Data
Management
LT Archiving
Dissemination
Access
Producer
Consumer
(disciplinary)
Ingest
Quality
Assurance
Detailed
Project Repositories:
• Data are published in a federated
data infrastructure
• Data are added and corrected
• Poor documentation
• Usually no data backup
• Light-weight quality assurance
against intl. and project standards
• Tendency that the project data
never become stable
• Currently no PIDs assigned or
reserved but Handles planned
Long-term Archive:
• Data are archived for the long term at a
single location
• Data are stable and curated
• Detailed documentation
• Data backup/redundancy
• Quality assurance process is more
detailed and includes a review
• Data is a “snapshot” of the project
data at a certain time
• DOIs assigned to data collections
Consumer
(interdisciplinary)
Dissemination
Access
Content provided by
M. Stockhause
Disciplinary
repository
example
Lessons learnt and questions
• Very diverse landscape
• Discipline-specific and cross-discipline actions
• Quality assurance a big topic in discipline-specific
repositories
• Widespread persistent identification
• Data citation awareness
• Challenge: Versioning
Publisher’s perspective
Article
preparation
Data
Submission
Article
submission
Peer Review
Process EditingProducer Consumer/
Reuse
Simplified generic publisher workflow
Researcher takes over several roles: submitter, reviewer,
editor potentially?
- Article/data
container
- Separate
article and
datasets
Publishing
Data
repositories
Example Workflows in Dataverse:
Connect Data to Journals
A. Journals include Dataverse as a Recommended Repository
B. Authors Contribute Directly to a Journal’s Dataverse
C. Automated Integration of Journal + Dataverse (e.g., OJS)
Slide by Eleni Castro
Example: Dryad repository integrated
with journals
Slide by T. Bloom
Data publishing building blocks
Primary data
entry with PID
Repository
entry
Metadata
Curation
Parallel data
description
Data Paper or
link to it
Link to results
paper
Linked and
published quality
assurance
Curation,
Editing
process
Peer review
Any kind of
QA process
Additional
visibility
Push to
ORCID, author
pages,
impact/reput
ation building
tools
Enable index
(Data citation
index, crawled
by Google)
Basic published
product
Add-ons: workflows for more documentation, QA, visibility
Trusted data publishing contains:
• Standardized information about the data
– Disciplinary standards
– Basic common metadata sets
• Distinct Roles, Workflows and Responsibilities
– Authorship, Submission
– Curation
– Quality Assurance
– Peer review
• Persistent Identification
– Permanent reference
– Data citation
Challenges
• Interoperability challenges
– Different metadata schemas
– Rich vs. limited metadata
• Discoverability challenges
– E.g. no bi-directional linking
– Usability challenges in aggregators
• Metrics and accreditation
• What information is needed for future
reuse/remix/reproducibility
• How can this information be exposed – human
and machine readable
Thank you!
Data Publishing Workflows
Activities and processes in a digital environment
that lead to the publication of research data and
other research objects on the Web. These
activities may be performed by humans or in an
automated fashion.
In contrast to the interim or final published
products, workflows are the means to curate,
document, peer review and thus ensure and
enhance the value of the published product.

Data Publishing Models by Sünje Dallmeier-Tiessen

  • 1.
    Data Publishing Models SünjeDallmeier-Tiessen, PhD CERN, Harvard University For the RDA-WDS Data Publishing Workflow Group June 9th, 2015
  • 2.
    Topics • What isdata publishing • Why do we care about it (today) • Models in data publishing • Building blocks • Information gathered through trusted data publishing • Relevance and conclusions for today’s workshop This is work conducted by the RDA-WDS group on data publishing workflows, chaired in collaboration with Fiona Murphy and Theo Bloom.
  • 3.
    Data Publishing … describesthe process of making research data and other research objects available on the web so that they can be discovered and referred to in a unique and persistent way. At its best, data publishing takes place through dedicated data repositories and data journals and ensures that the published research objects are well documented, curated, archived for the long term, interoperable, citable and quality assured. Thus, they are reusable and discoverable on the long term.
  • 7.
  • 8.
    Analysis elements • Discipline,responsible units (i.e. their roles) • Function of workflow • PID assignment: DOI, ARK, etc. • Peer review of data (e.g. by researcher & editorial review) • Curatorial review of metadata (e.g. by institutional or subject repository?) • Technical review & checks (e.g. for data integrity at repository upon ingestion) • Formats covered • Persons/Roles involved, e.g. editor, publisher, data repository manager, etc. • Links to additional data products (data paper; review documents; other journal articles) or “stand-alone” product • Links to grants, usage of author PIDs • Discoverability: Indexing of the data -- if yes, where? • Data citation facilitated • Data life cycle reference • Standards compliance
  • 9.
  • 10.
    Data Deposit Ingest Quality Assurance Data Management LT Archiving Dissemination Access Producer Consumer/ Reuse Simplifiedgeneric repository workflow Researcher with a central role during submission/deposition Review/QA mainly internal through dedicated curation personnel
  • 11.
    Data Deposit Ingest Quality Assurance Light Data Management LT Archiving Dissemination Access Producer Consumer (disciplinary) Ingest Quality Assurance Detailed Project Repositories: •Data are published in a federated data infrastructure • Data are added and corrected • Poor documentation • Usually no data backup • Light-weight quality assurance against intl. and project standards • Tendency that the project data never become stable • Currently no PIDs assigned or reserved but Handles planned Long-term Archive: • Data are archived for the long term at a single location • Data are stable and curated • Detailed documentation • Data backup/redundancy • Quality assurance process is more detailed and includes a review • Data is a “snapshot” of the project data at a certain time • DOIs assigned to data collections Consumer (interdisciplinary) Dissemination Access Content provided by M. Stockhause Disciplinary repository example
  • 12.
    Lessons learnt andquestions • Very diverse landscape • Discipline-specific and cross-discipline actions • Quality assurance a big topic in discipline-specific repositories • Widespread persistent identification • Data citation awareness • Challenge: Versioning
  • 13.
  • 14.
    Article preparation Data Submission Article submission Peer Review Process EditingProducerConsumer/ Reuse Simplified generic publisher workflow Researcher takes over several roles: submitter, reviewer, editor potentially? - Article/data container - Separate article and datasets Publishing Data repositories
  • 15.
    Example Workflows inDataverse: Connect Data to Journals A. Journals include Dataverse as a Recommended Repository B. Authors Contribute Directly to a Journal’s Dataverse C. Automated Integration of Journal + Dataverse (e.g., OJS) Slide by Eleni Castro
  • 16.
    Example: Dryad repositoryintegrated with journals Slide by T. Bloom
  • 17.
    Data publishing buildingblocks Primary data entry with PID Repository entry Metadata Curation Parallel data description Data Paper or link to it Link to results paper Linked and published quality assurance Curation, Editing process Peer review Any kind of QA process Additional visibility Push to ORCID, author pages, impact/reput ation building tools Enable index (Data citation index, crawled by Google) Basic published product Add-ons: workflows for more documentation, QA, visibility
  • 18.
    Trusted data publishingcontains: • Standardized information about the data – Disciplinary standards – Basic common metadata sets • Distinct Roles, Workflows and Responsibilities – Authorship, Submission – Curation – Quality Assurance – Peer review • Persistent Identification – Permanent reference – Data citation
  • 19.
    Challenges • Interoperability challenges –Different metadata schemas – Rich vs. limited metadata • Discoverability challenges – E.g. no bi-directional linking – Usability challenges in aggregators • Metrics and accreditation • What information is needed for future reuse/remix/reproducibility • How can this information be exposed – human and machine readable
  • 20.
  • 21.
    Data Publishing Workflows Activitiesand processes in a digital environment that lead to the publication of research data and other research objects on the Web. These activities may be performed by humans or in an automated fashion. In contrast to the interim or final published products, workflows are the means to curate, document, peer review and thus ensure and enhance the value of the published product.