Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
On community-standards, data curation and scholarly communication - BITS, Italy, 2016
1. On community-standards, data curation and
scholarly communication
Susanna-Assunta Sansone, PhD
@SusannaASansone
13th Annual Meeting of the Bioinformatics Italian Society, University of Salerno, Italy, 15-17 June 2016.
Data Consultant,
Founding Academic Editor
Associate Director,
Principal Investigator
Member,
Executive Committee
2. • Better data better science – the FAIR meme
• Publication of digital research outputs – why it matters
• Interoperability standards – as enablers
Outline
3. Research as a Connected Digital Enterprise aka The Commons
• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons.
The vision - P. Bourne (NIH Associate Director for Data Science)
4. Research as a Connected Digital Enterprise aka The Commons
• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons.
• Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various
ideas to engage with researcher Y and their research network.
The vision - P. Bourne (NIH Associate Director for Data Science)
5. Research as a Connected Digital Enterprise aka The Commons
• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons.
• Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various
ideas to engage with researcher Y and their research network.
• A fruitful collaboration ensues and they generate publications, data sets and
software; their output is captured in PubMed and the Commons, and is indexed by
the data and software catalogs.
The vision - P. Bourne (NIH Associate Director for Data Science)
6. Research as a Connected Digital Enterprise aka The Commons
• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons.
• Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various
ideas to engage with researcher Y and their research network.
• A fruitful collaboration ensues and they generate publications, data sets and
software; their output is captured in PubMed and the Commons, and is indexed by
the data and software catalogs.
• Company Z identifies relevant data and software that, based on the metrics from
the catalogs, have utilization above a threshold indicating that those data and
software are heavily utilized by the community.
The vision - P. Bourne (NIH Associate Director for Data Science)
7. Research as a Connected Digital Enterprise aka The Commons
• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons.
• Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various
ideas to engage with researcher Y and their research network.
• A fruitful collaboration ensues and they generate publications, data sets and
software; their output is captured in PubMed and the Commons, and is indexed by
the data and software catalogs.
• Company Z identifies relevant data and software that, based on the metrics from
the catalogs, have utilization above a threshold indicating that those data and
software are heavily utilized by the community. An open source version remains, but
the company adds services on top of the software and revenue flows back to the
labs of researchers X and Y which is used to develop new innovative software for
open distribution.
The vision - P. Bourne (NIH Associate Director for Data Science)
8. Research as a Connected Digital Enterprise aka The Commons
• Researcher X is automatically made aware of researcher Y through commonalities
in their respective data located in the Commons.
• Research X locates the researcher Y’s data sets with their associated usage
statistics, navigates to the associated publications and starts to explore various
ideas to engage with researcher Y and their research network.
• A fruitful collaboration ensues and they generate publications, data sets and
software; their output is captured in PubMed and the Commons, and is indexed by
the data and software catalogs.
• Company Z identifies relevant data and software that, based on the metrics from
the catalogs, have utilization above a threshold indicating that those data and
software are heavily utilized by the community. An open source version remains, but
the company adds services on top of the software and revenue flows back to the
labs of researchers X and Y which is used to develop new innovative software for
open distribution.
• Researchers X and Y provide hands-on advice in the use of their new version and
their course is offered as a MOOC (Massive Open Online Courses).
The vision - P. Bourne (NIH Associate Director for Data Science)
9. Research as a Connected Digital Enterprise aka The Commons
The vision - P. Bourne (NIH Associate Director for Data Science)
https://datascience.nih.gov/commons
10. A Data Discovery Index prototype that:
• Helps users find and access shared data
• Interoperates in the NIH Commons
16. “Over 50% of completed studies in biomedicine do not
appear in the published literature….Often because
results do not conform to author's hypotheses”
“Only half the health-related studies funded by the
European Union between 1998 and 2006 - an
expenditure of €6 billion - led to identifiable reports”
Selective reporting is still an unfortunate practice
• Small independent efforts, yielding a rich variety of specialty data sets
o Most of these data (such as null findings) is unpublished
o These dark data hold a potential wealth of knowledge
17. • Researchers still lack of or insufficient motivations
• Hypothesis-confirming results get prioritized
• Agreements, disagreements and timing
• Loose requirements and monitoring by journals and
funders
But why?
18. • Most researchers are
sharing data, and using the
data of others
• Direct contact* between
researchers (on request) is
a common way of sharing
data
• Repositories are second
most common method of
sharing
Kratz JE, Strasser C (2015) Researcher Perspectives on Publication and Peer Review of Data. PLoS ONE 10(2): e0117619.
Current approaches to sharing
* Data associated with published works disappears at a rate of ~17% per year (Vines et al. 2014, doi:10.1016/j.cub.2013.11.014
Datasets not referenced in a manuscript are essentially invisible and data producers do not get appropriate credit for their work
19. • Outputs are multi-dimensional, not always well cited, stored
o Software, codes, workflows are hard(er) to get hold of
• Poorly described for third party reuse
o Different level of details and annotation
• Curation activities are perceived as time consuming
o Collection and harmonization of detailed methods and
experimental steps is done/rushed at publication stage
Shared data is not always understandable, reusable
20. A B C D E
1 Group1 Group2
2 Day 0
3 Sodium 139 142
4 Potassium 3.3 4.8
5 Chloride 100 108
6 BUN 18 18
7 Creatine 1.2 1.2
8 Uric acid 5.5* 6.2*
9 Day 7
10 Sodium 140 146
11 Potassium 3.4 5.1
12 Chloride 97 108
S1Sh.cuo
Sharing starts with good metadata…
Credit to: Iain Hrynaszkiewicz
21. A B C D E
1 Group1 Group2
2 Day 0
3 Sodium 139 142
4 Potassium 3.3 4.8
5 Chloride 100 108
6 BUN 18 18
7 Creatine 1.2 1.2
8 Uric acid 5.5* 6.2*
9 Day 7
10 Sodium 140 146
11 Potassium 3.4 5.1
12 Chloride 97 108
S1Sh.cuo Meaningless
column titles
Special characters
can cause text
mining errors
No units
Unhelpful
document name
Undefined
abbreviation
Formatting for
information that
should be in
metadata
….…but this not!
Credit to: Iain Hrynaszkiewicz
22. A B C D E F
1 Parameter Day Control Treated Units P
2 Sodium 0 139 142 mEq/l 0.82
3 Sodium 7 140 146 mEq/l 0.70
4 Sodium 14 140 158 mEq/l 0.03
5 Sodium 21 143 160 mEq/l 0.02
6 Potassium 0 3.3 4.8 mEq/l 0.06
7 Potassium 7 3.4 5.1 mEq/l 0.07
8 Potassium 14 3.7 4.7 mEq/l 0.10
9 Potassium 21 3.1 3.6 mEq/l 0.52
10 Chloride 0 100 108 mEq/l 0.56
11 Chloride 7 97 108 mEq/l 0.68
12 Chloride 14 101 106 mEq/l 0.79
Table_S1_Shanghai_blood.xls
….this is much clearer!
Credit to: Iain Hrynaszkiewicz
24. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
2
4
…breadth and depth of the
context is pivotal…
…including capturing
experimental design and
statistical analysis
25. Among these, publishers occupy a
leverage point, because of importance of
formal publications in the academic
incentive structure
Stakeholders mobilizations, old and new driving forces
26. • Incentive, credit for sharing
o Big and small data
o Unpublished data
o Long tail of data
o Curated aggregation
• Peer review of data
• Value of data vs. analysis
• Discoverability and reusability
o Complementing community
databases
Growing number of data papers and data journals
27. nature.com/scientificdataHonorary Academic Editor
Susanna-Assunta Sansone, PhD
Managing Editor
Andrew L Hufton, PhD
Editorial Curator
Varsha Khodiyar
Publisher
Iain Hrynaszkiewicz
A new open-access, online-only publication for
descriptions of scientifically valuable datasets
Supported by
28. A new article type
A new category of publication that provides detailed
descriptors of scientifically valuable datasets
Mandates open data, without unnecessary
restrictions, as a condition of submission
30. Scientific hypotheses:
Synthesis
Analysis
Conclusions
Methods and technical analyses supporting the quality
of the measurements:
What did I do to generate the data?
How was the data processed?
Where is the data?
Who did what when
Relation with traditional articles – content
32. Experimental metadata or
structured component
(in-house curated, machine-
readable formats)
Article or
narrative component
(PDF and HTML)
Data Descriptors has two components
33. The Data Curation Editor is responsible for creating and
curating the machine-readable structured component
• Enables browsing and searching the articles
• Facilitates links to related journal articles and repository
records
Curation and discoverability
34. Created with the input of the
authors, includes value-added
semantic annotation of the
experimental metadata
analysis
method script
Data file or
record in a
database
Data Descriptors: structured component
39. “The Data Descriptor made it easier to use
the data, for me it was critical that everything
was there…all the technical details like voxel
size.”
Professor Daniele Marinazzo
Why data papers? Data reuse is easier!
Credit to: Varsha Khodiyar
41. • Better data better science – the FAIR meme
• Publication of digital research outputs – why it matters
• Interoperability standards – as enablers
Outline
42. de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
• To structure, enrich and report the description of the datasets and the
experimental context under which they were produced
• To facilitate discovery, sharing, understanding and reuse of datasets
Community-developed content standards
43. de jure de facto
grass-roots
groups
standard
organizations
Nanotechnology Working Group
Content standards as enabler for better described data
Including minimum
information reporting
requirements, or
checklists to report the
same core, essential
information
Including controlled
vocabularies, taxonomies,
thesauri, ontologies etc. to
use the same word and
refer to the same ‘thing’
Including conceptual
model, conceptual
schema from which an
exchange format is derived
to allow data to flow from
one system to another
45. Is there a database, implementing
standards, where to deposit my
metagenomics dataset?
My funder’s data sharing policy
recommends the use of
established standards, but
which ones are widely
endorsed and applicable to my
toxicological and clinical data?
Am I using the most up-to-date
version of this terminology to
annotate cell-based assays?
I understand this format has been
deprecated; what has been replaced
by and how is leading the work?
Are there databases implementing
this exchange format, whose
development we have funded?
What are the mature
standards and
standards-compliant
databases we should
recommend to our
authors?
But how do we help users to make informed decisions?
46. A web-based, curated and searchable registry ensuring that
standards and databases are registered, informative and
discoverable; monitoring development and evolution of standards,
their use in databases and adoption of both in data policies
An informative and educational resource
1,400 records and growing
48. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and substitutions
49. Model/format formalizing reporting guideline -->
<-- Reporting guideline used by model/format
Cross-linking standards to standards and databases
57. Philippe
Rocca-Serra, PhD
Senior Research Lecturer
Alejandra
Gonzalez-Beltran, PhD
Research Lecturer
Milo
Thurston, DPhD
Research Software Engineer
Massimiliano
Izzo, PhD
Research Software Engineer
Peter
McQuilton, PhD
Knowledge Engineer
Allyson
Lister, PhD
Knowledge Engineer
Eamonn
Maguire, DPhil
Software Engineer contractor
David
Johnson, PhD
Research Software Engineer
Susanna-Assunta Sansone, PhD
Principal Investigator, Associate Director
We also acknowledge our network of collaborators
in the following active projects: H2020 PhenoMeNal,
H2020 ELIXIR-EXCELERATE, H2020 MultiMot,
NIH bioCADDIE, NIH CEDAR and IMI eTRIKS