3. Plagued by selective reporting of data and methods
• Over 50% of completed studies in
biomedicine do not appear in the
published literature!
!
• Often because results do not
conform to author's hypotheses!
“Only half the health-related
studies funded by the European
Union between 1998 and 2006 -
an expenditure of €6 billion - led
to identifiable reports”!
4. Incentivizing individual contributor to share data
• Big science efforts!
o data is often better organized, reported and shared!
• Small independent efforts, yielding a rich variety of specialty data sets!
o Most of these data (such as null findings) is unpublished!
o These dark data hold a potential wealth of knowledge!
5. A community mobilization for “openness”
http://discovery.urlibraries.org/ https://okfn.org
image by Greg Emmerich
Open data is a means to do
better science more efficiently!
http://opendefinition.org/licenses/
http://pantonprinciples.org
https://creativecommons.org
6. Open access is not enough on its own
http://www.theguardian.com/higher-education-network/blog/2014/jun/26
If your research has been funded by
the taxpayer, there's a good chance
you'll be encouraged to publish your
results on an open access basis…..
This final article makes publicly
available the hypotheses,
interpretations and conclusions of your
research.
But what about the data that led you
to those results and conclusions?
7. Also open data is not always enough
http://www.theguardian.com/higher-education-network/blog/2014/jun/26
So data that is in theory open and
free to access!
• may still be hard to get hold of!
• it may not have been stored or cited
in the appropriate manner!
• it may not be interoperable with
related data because it is not
formatted appropriately; or!
• it may not be reusable because it
may not contain enough information
for others to understand it!
8. Movement for FAIR data in life and medical sciences
http://bd2k.nih.gov/workshops.html#ADDS
10. Responsibilities lie across several stakeholder groups
Understand the benefits of sharing
FAIR datasets and enact them
Engage and assist researchers to
enable them to share FAIR datasets
Release or endorse practices
and polices, but also incentive
and credit mechanisms for
researchers, curators and
developers
11. Publishers occupy a leverage point
Because of importance of formal
publications in the academic !
incentive structure!
12. Role of publishers as “agents of change”
Serve as the implementation and/or enforcement arm
at the point of publication!
13. Publishers and data/reproducibility
• Policies on access (to data, code, reagents etc.)!
o Supporting funder & community needs!
• Format and amount of content!
o Methodological details, supplementary info, data integration and
links to repositories!
• Licensing for reuse!
• Incentives to share!
o Data citations!
o Data journals and articles!
• Quality assurance through peer review!
Credit to:
Iain Hrynaszkiewicz
16. Data/reproducibility at NPG
Wang et al, Nature, 2013
doi:10.1038/nature12730
• Figure source data
o putting data behind figures/graphs
17. Data/reproducibility at NPG
• Figure source data
o putting data behind figures/graphs
• Data citation
o tackling both styling and format; monitoring community developments,
such the Data Citation Synthesis Group
• Code reproducibility
o peer review, availability and reuse
• NPG’s Linked Data release – CC0
• A new data journal
19. !
!
!
!
!
!
!
!
!
!
!
A new open-access, online-only publication for
descriptions of scientifically valuable datasets !
20. • Get Credit for Sharing Your Data
• Publications will be listed in the major indexes and will be citeable
• Focused on Data Reuse
• All the information others need to reuse the data; no interpretative
analysis or hypothesis testing
• Open-access
• Authors select from three Creative Commons licences for the main
• Data Descriptor. Each publication supported by curated CC0
metadata
• Peer-reviewed
• Rigorous peer-review managed by our Editorial Board of academic
researchers ensures data quality and standards
• Promoting Community Data Repositories
• Data stored in community data repositories
21. Introducing a new content type: the Data Descriptor
• Designed to make data more discoverable, interpretable and
reusable!
• Concerned with the facts behind the methodology
of data generation/collection and processing!
• Complements a journal article!
Synthesis
Analysis
Data Descriptor
Conclusions
Interpretation
What is the
sample?
What did I do to
generate the data?
How was the data
processed?
Where is the data?
Who did what when?
Summary of
Data
Descriptor
Facts
Data Descriptor
Journal article
NARRATIVE
22. Data Descriptor: narrative and structure!
!
!
!
Experimental metadata or !
structured component!
(in-house curated, machine-readable
formats)!
Article or !
narrative component!
(PDF and HTML) !
23. Data Descriptor: narrative!
Focus on data reuse!
Detailed descriptions of the methods and technical analyses supporting the
quality of the measurements.!
Does not contain tests of new scientific hypotheses!
In traditional publications this
information is not provided in a
sufficiently detailed manner
However this information is
essential for understanding,
reusing, and reproducing
datasets
Sections:!
• Title!
• Abstract!
• Background & Summary!
• Methods!
• Technical Validation!
• Data Records!
• Usage Notes !
• Figures & Tables !
• References!
• Data Citations!
!
24. Data Descriptor: narrative!
Focus on data reuse!
Detailed descriptions of the methods and technical analyses supporting the
quality of the measurements.!
Does not contain tests of new scientific hypotheses!
Sections:!
• Title!
• Abstract!
• Background & Summary!
• Methods!
• Technical Validation!
• Data Records!
• Usage Notes !
• Figures & Tables !
• References!
• Data Citations!
!
25. Data Descriptor: narrative!
Focus on data reuse!
Detailed descriptions of the methods and technical analyses supporting the
quality of the measurements.!
Does not contain tests of new scientific hypotheses!
Sections:!
• Title!
• Abstract!
• Background & Summary!
• Methods!
• Technical Validation!
• Data Records!
• Usage Notes !
• Figures & Tables !
• References!
• Data Citations!
!
Joint Declaration of Data Citation Principles by the
Data Citation Synthesis Group
26. Data Descriptor: structure - content !
In-house editorial curator:!
• assists users to submit the structured
content via simple templates and an
internal authoring tool!
• performs value-added semantic
annotation of the experimental
metadata!
For advanced users/service providers
willing to export ISA-Tab for direct
submission, we have released a technical
specification:!
Data file or !
record in a
database!
analysis !
method! script!
28. Collect
Data!
Publish your data early!
Follow-up
experiments!
Publish
Findings!
Publish
Data!
Scientific Data’s prior publication policy with other NPG journals
protects your ability to publish the screen data and the hits later
Credit to:
Andrew Hufton
29. Hao et al.: Environmental!
8 citations
Data sets from the Global Integrated
Drought Monitoring and Prediction
System (GIDMaPS), which provides
drought information based on multiple
drought indicators
30. Hao et al.: Environmental!
8 citations
New Dataset
• Data in figshare
• Code in figshare
31. Hao et al.: Environmental!
8 citations
New Dataset
• Data in figshare
• Code in figshare
• Cited in Science
32. !
!
!
!
!
!
!
!
!
Code in GitHub
!
!
!
!
!
!
!
!
!
Data in OpenfMRI
Hanke: Neuroscience !
New Dataset
33. Or your data and findings simultaneously!
Collect
Data!
Follow-up
experiments!
Publish
Findings!
Submit
Data!
Hold
publication!
Scientific Data will hold a Data Descriptor publication that has
been accepted for publication, while your other related research
publications clear peer review
Credit to:
Andrew Hufton
34. Or after the findings, but….!
Collect
Data!
Follow-up
experiments!
Publish
Findings!
Publish
Data!
• A fuller, more in-depth look at the data processing steps,
supported by additional data files and code from each step
• And/or additional tutorial-like information for scientists
interested in reusing or integrating the data with their own
35. Messina et al.: Epidemiology!
4 citations
The most comprehensive geographic
collection of human dengue virus
occurrence data (1960 -2012), linked
to point or polygon locations, derived
from peer-reviewed literature and
case reports as well as informal online
sources
36. !
!
!
!
!
!
!
!
Scientific hypotheses:!
Synthesis!
Analysis!
Conclusions!
Messina et al.: Epidemiology! 4 citations
Associated Nature Article
• Data in figshare
Methods and technical analyses supporting
the quality of the measurements:!
What did I do to generate the data?!
How was the data processed?!
Where is the data?!
Who did what when!
37. Adding value to research articles and data records
Research
papers
Descriptors
Data
Data
records
38. Helping authors find the right place for the data!
• We currently recognize over 60 public
data repositories, and provide advice on
the best place for authors to archive their
data!
• We have integrated systems with both:!
!
!
2
4
3
10 4
1
4
3
4
“Omics” is emphasized
among basic life-sciences
repositories
DNA and protein sequence
Functional genomics
Genetic association and genome variation
Metagenomics
Molecular interactions
Organism- or disease-specific
Proteomics
Taxonomy and species diversity
Traces and sequencing reads
39. 3 Big
data
|
CSE
2014
9
Repositories criteria!
1. Broad support and recognition within their scientific community !
2. Ensure long-term persistence and preservation of datasets!
3. Provide expert curation !
4. Implement relevant, community-endorsed reporting requirements !
Progressively monitor this via !
5. Provide for confidential review of submitted datasets !
6. Provide stable identifiers for submitted datasets !
7. Allow public access to data without unnecessary restrictions !
41. Peer review process focused on quality and reuse!
Evaluation is not be based on the perceived impact !
or novelty of the findings or size of the data!
!
• Experimental rigour and technical data quality!
o Methodologically sound!
o Technical validation experiments and statistical analyses!
o Depth, coverage, size, and/or completeness of data sufficient for the types
of applications!
• Completeness of the description!
o Sufficient details to allow others to reproduce the results, reuse or
integrate it with other data!
o Compliance with relevant minimum information or reporting standards!
• Integrity of the data files and repository record!
o Data files match the descriptions in the Data Descriptor!
o Deposited in the most appropriate available databases!
42. Current content is diverse - bimonthly releases !
• Neuroscience, ecology, epidemiology, environmental science,
functional genomics, metabolomics, toxicology etc.!
• New previously published individual datasets, curated
aggregation and citizen science:!
• Datasets in figshare, Dryad and domain specific databases!
• Code deposited in figshare and GitHub!
• First collection:!
42
43.
44. Supported by:!
Advisory Panel including senior researchers, funders, librarians and curators
Michael Huerta ● National Institutes of Health, USA ● Mark Thorley ● Natural Environment Research
Council, UK ● Patricia Cruse ● University of California, USA ● Susan Gregurick ● Office of
Biological and Environmental Research, Department of Energy, USA ● Ioannis Xenarios ● Swiss
Institute of Bioinformatics, Switzerland ● Chris Bowler ● IBENS, France ● Mark Forster ● Syngenta,
UK ● Anthony Rowe ● Johnson & Johnson, USA ● Stephen Chanock ● National Cancer Institute,
USA ● Weida Tong ● National Center for Toxicological Research, FDA, USA ● Albert J. R. Heck ●
Utrecht University, The Netherlands ● Johanna McEntyre ● EMBL-EBI, European Bioinformatics
Institute, UK ● Simon Hodson ● CODATA, France ● Joseph R. Ecker ● Howard Hughes Medical
Institute & Salk Institute, USA ● Stephen Friend ● Sage Bionetworks, USA ● Jessica Tenenbaum ●
Duke Translational Medicine Institute, USA ● Anne-Claude Gavin ● EMBL, Germany ● David Carr ●
Wellcome Trust, UK ● Wolfram Horstmann ● Göttingen State and University Library, Germany ●
Piero Carninci ● RIKEN Omics Science Center, Japan ● Pascale Gaudet ● Swiss Institute of
Bioinformatics, Switzerland ● Judith A. Blake ● The Jackson Laboratory, USA ● Richard H.
Scheuermann ● J. Craig Venter Institute, USA ● Caroline Shamu ● Harvard Medical School, USA
Susanna-Assunta Sansone
Honorary Academic Editor
(University of Oxford, UK)
Andrew L Hufton
Managing Editor
Varsha Khodiyar
Editorial Curator
Iain Hrynaszkiewicz
Publisher
An open access, peer-reviewed publication for
descriptions of scientifically valuable datasets!
Launched May 2014