On community-standards, FAIR data and
scholarly communication
Susanna-Assunta Sansone, PhD
ORCID: 0000-0001-5306-5690
INSERM Workshop 246 “Management and reuse of health data: methodological issues”, Bordeaux, 14-17 May 2017
Data Consultant,
Founding Academic Editor
Associate Director,
Principal Investigator
www.slideshare.net/SusannaSansone
Source: https://www.dataone.org/best-practices
Simplified research data life cycle
• Available in a public repository
• Findable through some sort of search facility
• Retrievable in a standard format
• Self-describing so that third parties can make sense of it
• The product of careful planning, organization and stewardship
• Intended to outlive the experiment for which they were
collected
To do better science, more efficiently
we need data that are…
Key problem: low findability and understandability
• Not always well cited and stored
o True for data as well as for any other digital asset
• Poorly described for third party reuse
o Different level of details and annotation
• Reporting and annotation activities are perceived as time
consuming
o Often rushed and minimally done
We need content or reporting standards
• To harmonized the datasets with respect to the structure
and level or annotation of their:
§ experimental components (e.g., design, conditions, parameters),
§ fundamental biological entities (e.g., samples, genes, cells),
§ complex concepts (such as bioprocesses, tissues, diseases),
§ analytical process and the mathematical models, and
§ their instantiation in computational simulations (from the
molecular level through to whole populations of individuals)
Minimum information reporting
requirements, checklists
o Report the same core, essential
information
o e.g. MIAME guidelines
Controlled vocabularies, taxonomies, thesauri,
ontologies etc.
o Unambiguous identification and definition of
concepts
o e.g. Gene Ontology
Conceptual model, schema,
exchange formats etc
o Define the structure and
interrelation of information, and
the transmission format
o e.g. FASTA
Formats Terminologies Guidelines
Types of content standards
de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
Formats Terminologies Guidelines
Community-driven efforts, just few examples
Formats Terminologies Guidelines
224
115
500+
source source
source
MIAME
MIRIAM
MIQAS
MIX
MIGEN
ARRIVE
MIAPE
MIASE
MIQE
MISFISHIE….
REMARK
CONSORT
SRAxml
SOFT FASTA
DICOM
MzML
SBRML
SEDML…
GELML
ISA
CML
MITAB
AAO
CHEBIOBI
PATO ENVO
MOD
BTO
IDO…
TEDDY
PRO
XAO
DO
VO
Content standards in numbers
How to discover the ‘right’ standards for your data?
A	web-based,	curated	and	searchable	portal	that monitors	the	development and	
evolution of	standards,	their	use in	databases and	the	adoption	of	both	in	data	
policies,	to	inform and	educate the	user	community
Data policies by
funders, journals and
other organizations
Content standards
Formats Terminologies Guidelines
Map this complex and evolving landscape
Databases
All	records	are	manually	curated	in-house	
and	verified	by	the	community	behind	each	resource
Data policies by
funders, journals and
other organizations
Databases
Content standards
Formats Terminologies Guidelines
Using indicators to describe ‘status’
Ready	for	use,	implementation,	or	recommendation
In	development
Status	uncertain
Deprecated	as	subsumed	or	superseded
Understanding how standards are used
Understanding how standards are used
Guideline
Understanding how standards are used
Formats
Guideline
Understanding how standards are used
Formats
Guideline
Formats
Understanding how standards are used
Formats
Guideline
Formats
Terminology
Data policies by
funders, journals and
other organizations
Databases
Content standards
Formats Terminologies Guidelines
Using indicators to indicate ‘adoption’
Standard developing groups:Journal, publishers:
Cross-links, data exchange:
Societies and organisations: Institutional RDM services:
Projects, programmes:
Technologically-delineated
views of the world
Biologically-delineated
views of the world
Generic features (‘common core’)
- description of source biomaterial
- experimental design components
Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Duplications & lack of interoperability among standards
Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Hard to use them in combinations, e.g. to represent:
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut
microbiota profiling
Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Enhancing modularization
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut
microbiota profiling
Arrays
Scanning Arrays &
Scanning
Columns
Gels
MS MS
FTIR
NMR
Columns
transcriptomics
proteomics
metabolomics
plant biology
epidemiology
microbiology
Enhancing modularization
Proteomics-based gut microbiota profiling
Proteomics and metabolomics based gut
microbiota profiling
bsg-000174
biosharing:
ReportingGuideline
bsg-000161
MINSEQE
MIMARKS
sample
information
sample
identifier
taxonomy
identifier
sequence
read
geo location
High-level information about
the metadata standards
Representations
of the standards elements
Template elements
for
el-000001
el-000002
el-000003
provenance:
MINSEQE
provenance:
MINSEQE
and
MIMARKS
provenance:
MIMARKS
Serve machine-readable content metadata standards, providing provenance for
their elements, rendering standards invisible to the researchers
Inform the creation of metadata templates
How to discover the datasets relevant to your work?
OmicsDI: Nature Biotechnology 35, 406–409 (2017) doi:10.1038/nbt.3790
omicsdi.org
datamed.org
DataMed: bioRxiv 094888; https://doi.org/10.1101/094888 Nature Genetics (in press)
DATS: bioRxiv 103143; https://doi.org/10.1101/103143 Scientific Data (in press)
• Discoverability and reusability
o Complementing community
databases
• Incentive, credit for sharing
o Big and small data
o Unpublished data
o Long tail of data
o Curated aggregation
• Peer review of data
• Value of data vs. analysis
Growing number of data papers and data journals, e.g:
nature.com/scientificdataHonorary Academic Editor
Susanna-Assunta Sansone, PhD
Managing Editor
Andrew L Hufton, PhD
Editorial Curator
Varsha Khodiyar
Publisher
Iain Hrynaszkiewicz
A new open-access, online-only publication for
descriptions of scientifically valuable datasets
Supported by
• A peer reviewed description of data, to maximize usage
• Citable publications that give credit for reusable data
• It requires data deposition to the appropriate repository(s)
• Is complementary and can be associated or not to traditional article(s)
New article type
Research
papers
Data
records
Data
Descriptors
Value added component – complementing
articles and repositories
• Title
• Abstract
• Background & Summary
• Methods
• Data Records
• Technical Validation
• Usage Notes
• Figures & Tables
• References
• Data Citations
• following the Joint Declaration of Data Citation Principles
Detailed description of the methods and
technical analyses supporting the
quality of the measurements;
no scientific hypotheses
Article structure
Focus on data peer review
• Completeness = can others reproduce?
• Consistency = were community standards followed?
• Integrity = are data in the best repository?
• Experimental rigour, technical quality = were the methods sound?
Does not focus on perceived impact, importance, size, complexity of data
Credit for data producers, data managers/curators etc.
Credit to: Varsha Khodiyar
“The Data Descriptor made it easier to use
the data, for me it was critical that everything
was there…all the technical details like voxel
size.”
Professor Daniele Marinazzo
Credit to: Varsha Khodiyar
Data (re)use made easier
Decades
old dataset
Aggregated or
curated data
resources
Computationally
produced data
products
Large
consortium
dataset
Data from a
single
experiment
Data that YOU
find valuable
and that others
might find
useful too
Data associated
with a high impact
analysis article
What makes a good ?
Experimental metadata or
structured component
(in-house curated, machine-
readable formats)
Article or
narrative component
(PDF and HTML)
Data Descriptors has two components
The Data Curation Editor is responsible for creating and
curating the machine-readable structured component
• Enables browsing and searching the articles
• Facilitates links to related journal articles and repository
records
Curation and discoverability
Created with the input of the
authors, includes value-added
semantic annotation of the
experimental metadata
analysis
method
script
Data file or
record in a
database
Data Descriptors: structured component
Complementary roles of ISA and
nanopublications
From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles
of Data Models and Workflows in Bioinformatics. https://doi.org/10.1371/journal.pone.0127612
PloS ONE (2015)
The (long) road to FAIR
Responsibilities lie across several stakeholder groups
Understand the benefits of sharing
FAIR datasets and enact them
Engage and assist researchers to
enable them to share FAIR datasets
Release or endorse practices
and polices, but also incentive
and credit mechanisms for
researchers, curators and
developers
“As Data Science culture grows,
digital research outputs (such as
data, computational analysis and
software) are being established as
first-class citizens.
This cultural shift is required to go
one step further: to recognize
interoperability standards as digital
objects in their own right, with their
associated research, development
and educational activities”.
Sansone, Susanna-Assunta; Rocca-Serra, Philippe (2016).
Interoperability Standards - Digital Objects in Their Own
Right. Wellcome Trust”
https://dx.doi.org/10.6084/m9.figshare.4055496.v1
Philippe
Rocca-Serra, PhD
Senior Research Lecturer
Alejandra
Gonzalez-Beltran, PhD
Research Lecturer
Milo
Thurston, DPhD
Research Software Engineer
Massimiliano
Izzo, PhD
Research Software Engineer
Peter
McQuilton, PhD
Knowledge Engineer
Allyson
Lister, PhD
Knowledge Engineer
Eamonn
Maguire, Dphil
Contractor
David
Johnson, PhD
Research Software Engineer
Melanie
Adekale, PhD
Biocurator Contractor
Delphine
Dauga, PhD
Biocurator Contractor
We work with and for
to make data and other
digital research assets
Susanna-Assunta Sansone, PhD
Principal Investigator, Associate Director
and Data Consultant for Springer Nature
enabling open science,
driving science and discoveries

INSERM - Data Management & Reuse of Health Data - May 2017

  • 1.
    On community-standards, FAIRdata and scholarly communication Susanna-Assunta Sansone, PhD ORCID: 0000-0001-5306-5690 INSERM Workshop 246 “Management and reuse of health data: methodological issues”, Bordeaux, 14-17 May 2017 Data Consultant, Founding Academic Editor Associate Director, Principal Investigator www.slideshare.net/SusannaSansone
  • 3.
  • 4.
    • Available ina public repository • Findable through some sort of search facility • Retrievable in a standard format • Self-describing so that third parties can make sense of it • The product of careful planning, organization and stewardship • Intended to outlive the experiment for which they were collected To do better science, more efficiently we need data that are…
  • 5.
    Key problem: lowfindability and understandability • Not always well cited and stored o True for data as well as for any other digital asset • Poorly described for third party reuse o Different level of details and annotation • Reporting and annotation activities are perceived as time consuming o Often rushed and minimally done
  • 6.
    We need contentor reporting standards • To harmonized the datasets with respect to the structure and level or annotation of their: § experimental components (e.g., design, conditions, parameters), § fundamental biological entities (e.g., samples, genes, cells), § complex concepts (such as bioprocesses, tissues, diseases), § analytical process and the mathematical models, and § their instantiation in computational simulations (from the molecular level through to whole populations of individuals)
  • 7.
    Minimum information reporting requirements,checklists o Report the same core, essential information o e.g. MIAME guidelines Controlled vocabularies, taxonomies, thesauri, ontologies etc. o Unambiguous identification and definition of concepts o e.g. Gene Ontology Conceptual model, schema, exchange formats etc o Define the structure and interrelation of information, and the transmission format o e.g. FASTA Formats Terminologies Guidelines Types of content standards
  • 8.
    de jure defacto grass-roots groups standard organizations Nanotechnology Working Group Formats Terminologies Guidelines Community-driven efforts, just few examples
  • 9.
    Formats Terminologies Guidelines 224 115 500+ sourcesource source MIAME MIRIAM MIQAS MIX MIGEN ARRIVE MIAPE MIASE MIQE MISFISHIE…. REMARK CONSORT SRAxml SOFT FASTA DICOM MzML SBRML SEDML… GELML ISA CML MITAB AAO CHEBIOBI PATO ENVO MOD BTO IDO… TEDDY PRO XAO DO VO Content standards in numbers
  • 11.
    How to discoverthe ‘right’ standards for your data?
  • 13.
    A web-based, curated and searchable portal that monitors the development and evolutionof standards, their use in databases and the adoption of both in data policies, to inform and educate the user community
  • 14.
    Data policies by funders,journals and other organizations Content standards Formats Terminologies Guidelines Map this complex and evolving landscape Databases All records are manually curated in-house and verified by the community behind each resource
  • 15.
    Data policies by funders,journals and other organizations Databases Content standards Formats Terminologies Guidelines Using indicators to describe ‘status’ Ready for use, implementation, or recommendation In development Status uncertain Deprecated as subsumed or superseded
  • 16.
  • 17.
    Understanding how standardsare used Guideline
  • 18.
    Understanding how standardsare used Formats Guideline
  • 19.
    Understanding how standardsare used Formats Guideline Formats
  • 20.
    Understanding how standardsare used Formats Guideline Formats Terminology
  • 21.
    Data policies by funders,journals and other organizations Databases Content standards Formats Terminologies Guidelines Using indicators to indicate ‘adoption’
  • 25.
    Standard developing groups:Journal,publishers: Cross-links, data exchange: Societies and organisations: Institutional RDM services: Projects, programmes:
  • 26.
    Technologically-delineated views of theworld Biologically-delineated views of the world Generic features (‘common core’) - description of source biomaterial - experimental design components Arrays Scanning Arrays & Scanning Columns Gels MS MS FTIR NMR Columns transcriptomics proteomics metabolomics plant biology epidemiology microbiology Duplications & lack of interoperability among standards
  • 27.
    Arrays Scanning Arrays & Scanning Columns Gels MSMS FTIR NMR Columns transcriptomics proteomics metabolomics plant biology epidemiology microbiology Hard to use them in combinations, e.g. to represent: Proteomics-based gut microbiota profiling Proteomics and metabolomics based gut microbiota profiling
  • 28.
    Arrays Scanning Arrays & Scanning Columns Gels MSMS FTIR NMR Columns transcriptomics proteomics metabolomics plant biology epidemiology microbiology Enhancing modularization Proteomics-based gut microbiota profiling Proteomics and metabolomics based gut microbiota profiling
  • 29.
    Arrays Scanning Arrays & Scanning Columns Gels MSMS FTIR NMR Columns transcriptomics proteomics metabolomics plant biology epidemiology microbiology Enhancing modularization Proteomics-based gut microbiota profiling Proteomics and metabolomics based gut microbiota profiling
  • 30.
    bsg-000174 biosharing: ReportingGuideline bsg-000161 MINSEQE MIMARKS sample information sample identifier taxonomy identifier sequence read geo location High-level informationabout the metadata standards Representations of the standards elements Template elements for el-000001 el-000002 el-000003 provenance: MINSEQE provenance: MINSEQE and MIMARKS provenance: MIMARKS Serve machine-readable content metadata standards, providing provenance for their elements, rendering standards invisible to the researchers Inform the creation of metadata templates
  • 31.
    How to discoverthe datasets relevant to your work?
  • 32.
    OmicsDI: Nature Biotechnology35, 406–409 (2017) doi:10.1038/nbt.3790 omicsdi.org
  • 33.
    datamed.org DataMed: bioRxiv 094888;https://doi.org/10.1101/094888 Nature Genetics (in press) DATS: bioRxiv 103143; https://doi.org/10.1101/103143 Scientific Data (in press)
  • 34.
    • Discoverability andreusability o Complementing community databases • Incentive, credit for sharing o Big and small data o Unpublished data o Long tail of data o Curated aggregation • Peer review of data • Value of data vs. analysis Growing number of data papers and data journals, e.g:
  • 35.
    nature.com/scientificdataHonorary Academic Editor Susanna-AssuntaSansone, PhD Managing Editor Andrew L Hufton, PhD Editorial Curator Varsha Khodiyar Publisher Iain Hrynaszkiewicz A new open-access, online-only publication for descriptions of scientifically valuable datasets Supported by
  • 36.
    • A peerreviewed description of data, to maximize usage • Citable publications that give credit for reusable data • It requires data deposition to the appropriate repository(s) • Is complementary and can be associated or not to traditional article(s) New article type
  • 37.
  • 38.
    • Title • Abstract •Background & Summary • Methods • Data Records • Technical Validation • Usage Notes • Figures & Tables • References • Data Citations • following the Joint Declaration of Data Citation Principles Detailed description of the methods and technical analyses supporting the quality of the measurements; no scientific hypotheses Article structure
  • 39.
    Focus on datapeer review • Completeness = can others reproduce? • Consistency = were community standards followed? • Integrity = are data in the best repository? • Experimental rigour, technical quality = were the methods sound? Does not focus on perceived impact, importance, size, complexity of data
  • 40.
    Credit for dataproducers, data managers/curators etc. Credit to: Varsha Khodiyar
  • 41.
    “The Data Descriptormade it easier to use the data, for me it was critical that everything was there…all the technical details like voxel size.” Professor Daniele Marinazzo Credit to: Varsha Khodiyar Data (re)use made easier
  • 42.
    Decades old dataset Aggregated or curateddata resources Computationally produced data products Large consortium dataset Data from a single experiment Data that YOU find valuable and that others might find useful too Data associated with a high impact analysis article What makes a good ?
  • 43.
    Experimental metadata or structuredcomponent (in-house curated, machine- readable formats) Article or narrative component (PDF and HTML) Data Descriptors has two components
  • 44.
    The Data CurationEditor is responsible for creating and curating the machine-readable structured component • Enables browsing and searching the articles • Facilitates links to related journal articles and repository records Curation and discoverability
  • 45.
    Created with theinput of the authors, includes value-added semantic annotation of the experimental metadata analysis method script Data file or record in a database Data Descriptors: structured component
  • 49.
    Complementary roles ofISA and nanopublications From Peer-Reviewed to Peer-Reproduced in Scholarly Publishing: The Complementary Roles of Data Models and Workflows in Bioinformatics. https://doi.org/10.1371/journal.pone.0127612 PloS ONE (2015)
  • 50.
  • 51.
    Responsibilities lie acrossseveral stakeholder groups Understand the benefits of sharing FAIR datasets and enact them Engage and assist researchers to enable them to share FAIR datasets Release or endorse practices and polices, but also incentive and credit mechanisms for researchers, curators and developers
  • 52.
    “As Data Scienceculture grows, digital research outputs (such as data, computational analysis and software) are being established as first-class citizens. This cultural shift is required to go one step further: to recognize interoperability standards as digital objects in their own right, with their associated research, development and educational activities”. Sansone, Susanna-Assunta; Rocca-Serra, Philippe (2016). Interoperability Standards - Digital Objects in Their Own Right. Wellcome Trust” https://dx.doi.org/10.6084/m9.figshare.4055496.v1
  • 53.
    Philippe Rocca-Serra, PhD Senior ResearchLecturer Alejandra Gonzalez-Beltran, PhD Research Lecturer Milo Thurston, DPhD Research Software Engineer Massimiliano Izzo, PhD Research Software Engineer Peter McQuilton, PhD Knowledge Engineer Allyson Lister, PhD Knowledge Engineer Eamonn Maguire, Dphil Contractor David Johnson, PhD Research Software Engineer Melanie Adekale, PhD Biocurator Contractor Delphine Dauga, PhD Biocurator Contractor We work with and for to make data and other digital research assets Susanna-Assunta Sansone, PhD Principal Investigator, Associate Director and Data Consultant for Springer Nature enabling open science, driving science and discoveries