On community-standards, data curation and scholarly communication - BITS, Italy, 2016

On community-standards, data curation and
scholarly communication
Susanna-Assunta Sansone, PhD
@SusannaASansone
13th Annual Meeting of the Bioinformatics Italian Society, University of Salerno, Italy, 15-17 June 2016.
Data Consultant,
Founding Academic Editor
Associate Director,
Principal Investigator
Member,
Executive Committee

•  Better data better science – the FAIR meme
•  Publication of digital research outputs – why it matters
•  Interoperability standards – as enablers
Outline

Research as a Connected Digital Enterprise aka The Commons
•  Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons.
The vision - P. Bourne (NIH Associate Director for Data Science)

•  Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various
ideas to engage with researcher Y and their research network.

•  A fruitful collaboration ensues and they generate publications, data sets and
software; their output is captured in PubMed and the Commons, and is indexed by
the data and software catalogs.

•  Company Z identiﬁes relevant data and software that, based on the metrics from
the catalogs, have utilization above a threshold indicating that those data and
software are heavily utilized by the community.

software are heavily utilized by the community. An open source version remains, but
the company adds services on top of the software and revenue ﬂows back to the
labs of researchers X and Y which is used to develop new innovative software for
open distribution.

software are heavily utilized by the community. An open source version remains, but
the company adds services on top of the software and revenue ﬂows back to the
labs of researchers X and Y which is used to develop new innovative software for
open distribution.
•  Researchers X and Y provide hands-on advice in the use of their new version and
their course is offered as a MOOC (Massive Open Online Courses).

https://datascience.nih.gov/commons

A Data Discovery Index prototype that:
•  Helps users ﬁnd and access shared data
•  Interoperates in the NIH Commons

aggregator'
A'
B C
A
aggregator'
Data'Discovery'Index'
data'
Dashed lines:
mapping of metadata
standards, links to
aggregators, data
Data:
digital research objects
Pilot projectsCore
development team
Designed as an element of the
ecosystem

1
2
medicine
agriculture
bioindustries
environment
ELIXIR connects national
bioinformatics centres and
EMBL-EBI into a sustainable
European infrastructure for
biological research data
Building a pan-European infrastructure

to do better science !
more efficiently!

Credit to: ttps://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/ 2014

“Over 50% of completed studies in biomedicine do not
appear in the published literature….Often because
results do not conform to author's hypotheses”
“Only half the health-related studies funded by the
European Union between 1998 and 2006 - an
expenditure of €6 billion - led to identiﬁable reports”
Selective reporting is still an unfortunate practice
•  Small independent efforts, yielding a rich variety of specialty data sets
o  Most of these data (such as null ﬁndings) is unpublished
o  These dark data hold a potential wealth of knowledge

•  Researchers still lack of or insufﬁcient motivations
•  Hypothesis-conﬁrming results get prioritized
•  Agreements, disagreements and timing
•  Loose requirements and monitoring by journals and
funders
But why?

•  Most researchers are
sharing data, and using the
data of others
•  Direct contact* between
researchers (on request) is
a common way of sharing
data
•  Repositories are second
most common method of
sharing
Kratz JE, Strasser C (2015) Researcher Perspectives on Publication and Peer Review of Data. PLoS ONE 10(2): e0117619.
Current approaches to sharing
* Data associated with published works disappears at a rate of ~17% per year (Vines et al. 2014, doi:10.1016/j.cub.2013.11.014
Datasets not referenced in a manuscript are essentially invisible and data producers do not get appropriate credit for their work

•  Outputs are multi-dimensional, not always well cited, stored
o  Software, codes, workflows are hard(er) to get hold of
•  Poorly described for third party reuse
o  Different level of details and annotation
•  Curation activities are perceived as time consuming
o  Collection and harmonization of detailed methods and
experimental steps is done/rushed at publication stage
Shared data is not always understandable, reusable

A B C D E
1 Group1 Group2
2 Day 0
3 Sodium 139 142
4 Potassium 3.3 4.8
5 Chloride 100 108
6 BUN 18 18
7 Creatine 1.2 1.2
8 Uric acid 5.5* 6.2*
9 Day 7
10 Sodium 140 146
11 Potassium 3.4 5.1
12 Chloride 97 108
S1Sh.cuo
Sharing starts with good metadata…
Credit to: Iain Hrynaszkiewicz

A B C D E
1 Group1 Group2
2 Day 0
3 Sodium 139 142
4 Potassium 3.3 4.8
5 Chloride 100 108
6 BUN 18 18
7 Creatine 1.2 1.2
8 Uric acid 5.5* 6.2*
9 Day 7
10 Sodium 140 146
11 Potassium 3.4 5.1
12 Chloride 97 108
S1Sh.cuo Meaningless
column titles
Special characters
can cause text
mining errors
No units
Unhelpful
document name
Undeﬁned
abbreviation
Formatting for
information that
should be in
metadata
….…but this not!

A B C D E F
1 Parameter Day Control Treated Units P
2 Sodium 0 139 142 mEq/l 0.82
3 Sodium 7 140 146 mEq/l 0.70
4 Sodium 14 140 158 mEq/l 0.03
5 Sodium 21 143 160 mEq/l 0.02
6 Potassium 0 3.3 4.8 mEq/l 0.06
7 Potassium 7 3.4 5.1 mEq/l 0.07
8 Potassium 14 3.7 4.7 mEq/l 0.10
9 Potassium 21 3.1 3.6 mEq/l 0.52
10 Chloride 0 100 108 mEq/l 0.56
11 Chloride 7 97 108 mEq/l 0.68
12 Chloride 14 101 106 mEq/l 0.79
Table_S1_Shanghai_blood.xls
….this is much clearer!

Without context data is meaningless

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
2
4
…breadth and depth of the
context is pivotal…
…including capturing
experimental design and
statistical analysis

Among these, publishers occupy a
leverage point, because of importance of
formal publications in the academic
incentive structure
Stakeholders mobilizations, old and new driving forces

•  Incentive, credit for sharing
o  Big and small data
o  Unpublished data
o  Long tail of data
o  Curated aggregation
•  Peer review of data
•  Value of data vs. analysis
•  Discoverability and reusability
o  Complementing community
databases
Growing number of data papers and data journals

nature.com/scientificdataHonorary Academic Editor
Managing Editor
Andrew L Hufton, PhD
Editorial Curator
Varsha Khodiyar
Publisher
Iain Hrynaszkiewicz
A new open-access, online-only publication for
descriptions of scientiﬁcally valuable datasets
Supported by

A new article type
A new category of publication that provides detailed
descriptors of scientiﬁcally valuable datasets
Mandates open data, without unnecessary
restrictions, as a condition of submission

Research
papers
Data
records
Data
Descriptors
Value added component – complementing
articles and repositories

Scientiﬁc hypotheses:
Synthesis
Analysis
Conclusions
Methods and technical analyses supporting the quality
of the measurements:
What did I do to generate the data?
How was the data processed?
Where is the data?
Who did what when
Relation with traditional articles – content

Citation of and links to data files and databases

Experimental metadata or
structured component
(in-house curated, machine-
readable formats)
Article or
narrative component
(PDF and HTML)
Data Descriptors has two components

The Data Curation Editor is responsible for creating and
curating the machine-readable structured component
•  Enables browsing and searching the articles
•  Facilitates links to related journal articles and repository
records
Curation and discoverability

Created with the input of the
authors, includes value-added
semantic annotation of the
experimental metadata
analysis
method script
Data ﬁle or
record in a
database
Data Descriptors: structured component

Browse, search, view Data Descriptors

3
8
Why data papers? Credit for data producers!
Credit to: Varsha Khodiyar

“The Data Descriptor made it easier to use
the data, for me it was critical that everything
was there…all the technical details like voxel
size.”
Professor Daniele Marinazzo
Why data papers? Data reuse is easier!
Credit to: Varsha Khodiyar

4
0
Decades
old
dataset
Aggregated or
curated data
resources
Computationally
produced data
products
Large
consortium
dataset
Data from a
single
experiment
Data associated
with a high
impact analysis
article
What does make a good Data Descriptors?
Credit to: Andrew Hufton

de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
•  To structure, enrich and report the description of the datasets and the
experimental context under which they were produced
•  To facilitate discovery, sharing, understanding and reuse of datasets
Community-developed content standards

de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
Content standards as enabler for better described data
Including minimum
information reporting
requirements, or
checklists to report the
same core, essential
information
Including controlled
vocabularies, taxonomies,
thesauri, ontologies etc. to
use the same word and
refer to the same ‘thing’
Including conceptual
model, conceptual
schema from which an
exchange format is derived
to allow data to flow from
one system to another

203
105
345
miame!
MIRIAM!
MIQAS!
MIX!
MIGEN!
ARRIVE!
MIAPE!
MIASE!
MIQE!
MISFISHIE….!
REMARK!
CONSORT!
SRAxml!
SOFT!
FASTA!
DICOM!
MzML!
SBRML!
SEDML…!
GELML!
ISA-Tab!
CML!
MITAB!
AAO!
CHEBI!
OBI!
PATO! ENVO!
MOD!
BTO!
IDO…!
TEDDY!
PRO!
XAO!
DO
VO!
Complex and evolving landscape
data policies databases
data/metadata standards

Is there a database, implementing
standards, where to deposit my
metagenomics dataset?
My funder’s data sharing policy
recommends the use of
established standards, but
which ones are widely
endorsed and applicable to my
toxicological and clinical data?
Am I using the most up-to-date
version of this terminology to
annotate cell-based assays?
I understand this format has been
deprecated; what has been replaced
by and how is leading the work?
Are there databases implementing
this exchange format, whose
development we have funded?
What are the mature
standards and
standards-compliant
databases we should
recommend to our
authors?
But how do we help users to make informed decisions?

A web-based, curated and searchable registry ensuring that
standards and databases are registered, informative and
discoverable; monitoring development and evolution of standards,
their use in databases and adoption of both in data policies
An informative and educational resource
1,400 records and growing

An informative and educational resource

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and substitutions

Model/format formalizing reporting guideline -->
<-- Reporting guideline used by model/format
Cross-linking standards to standards and databases

Standards and databases recommended by publishers in
their data policies

Interactive graph to inform and educate, e.g. database
standard
policy

Linking standards and databases to training material

Advised by the ELIXIR Training Coordinators Group,
including:
A collaboration between:

Data!
Software!
Standards!
Databases!
Workflow!
Publications!
Training material!

Philippe
Rocca-Serra, PhD
Senior Research Lecturer
Alejandra
Gonzalez-Beltran, PhD
Research Lecturer
Milo
Thurston, DPhD
Research Software Engineer
Massimiliano
Izzo, PhD
Peter
McQuilton, PhD
Knowledge Engineer
Allyson
Lister, PhD
Knowledge Engineer
Eamonn
Maguire, DPhil
Software Engineer contractor
David
Johnson, PhD
Principal Investigator, Associate Director
We also acknowledge our network of collaborators
in the following active projects: H2020 PhenoMeNal,
H2020 ELIXIR-EXCELERATE, H2020 MultiMot,
NIH bioCADDIE, NIH CEDAR and IMI eTRIKS

On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to On community-standards, data curation and scholarly communication - BITS, Italy, 2016

Similar to On community-standards, data curation and scholarly communication - BITS, Italy, 2016 (20)

More from Susanna-Assunta Sansone

More from Susanna-Assunta Sansone (20)

Recently uploaded

Recently uploaded (20)

On community-standards, data curation and scholarly communication - BITS, Italy, 2016