SciDataCon2014, 2-5 November, 2014 
Data Papers and their applications:! 
examples from ! 
Nature Publishing Group and Ubiquity Press! 
1. Introduction! 
2. Anatomy of a data paper - cases studies from 
specific journals! 
• Nature Publishing Group - Scientific Data, 
Susanna-Assunta Sansone! 
• Ubiquity Press - Open Health Data, 
Brian Hole! 
3. Feedback and discussion!
Consultant, 
Honorary Academic Editor 
Associate Director, 
Principal Investigator 
! 
Introduction! 
The role of publishers and data papers ! 
! 
Susanna-Assunta Sansone, PhD! 
! 
! 
@biosharing! 
@isatools! 
@scientificdata! 
! 
SciDataCon2014, 2-5 November, 2014
Credit to: 
https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/
A community mobilization for “openness” 
http://discovery.urlibraries.org/ https://okfn.org 
image by Greg Emmerich 
• Open data is a means to do better 
science more efficiently! 
• Licenses, copyright and IP are legal 
barriers to data sharing and reuse! 
o Licenses are for asserting rights; 
waivers are for giving them up, 
maximising potential for data reuse, 
integration and discovery of new 
knowledge! 
• Creative Commons CC0! 
o interoperability: CC0 is human and 
machine-readable! 
o universality: CC0 is global and 
universal and widely recognized! 
o simplicity: no need for humans to 
make, and respond to, individual data 
requests! 
http://opendefinition.org/licenses/ 
http://pantonprinciples.org 
https://www.copyrightsworld.com 
https://creativecommons.org
Open access is not enough on its own 
http://www.theguardian.com/higher-education-network/blog/2014/jun/26 
If your research has been funded by 
the taxpayer, there's a good chance 
you'll be encouraged to publish your 
results on an open access basis….. 
This final article makes publicly 
available the hypotheses, 
interpretations and conclusions of your 
research. 
But what about the data that led you 
to those results and conclusions?
Also open data is not always enough 
http://www.theguardian.com/higher-education-network/blog/2014/jun/26 
So data that is in theory open and 
free to access! 
• may still be hard to get hold of! 
• it may not have been stored or cited 
in the appropriate manner! 
• it may not be interoperable with 
related data because it is not 
formatted appropriately; or! 
• it may not be reusable because it 
may not contain enough information 
for others to understand it!
Movement for FAIR data in life and medical sciences 
http://bd2k.nih.gov/workshops.html#ADDS
Because, in all fairness, not much data is FAIR! 
8
Data unavailability and incomplete annotations
Benefits and barriers to data sharing 
Credit to: 
Iain Hrynaszkiewicz 
Benefits! Barriers! 
• Reduction of error and fraud! 
• Increased return on investment in 
research! 
• Compliance with funder and 
journal mandates! 
• Reduce duplication and bias! 
• Reproduction/validation of 
research! 
• Testing additional hypotheses! 
• Use for teaching! 
• Integration with other data sets! 
• Increased citations ! 
• Concerns over inappropriate reuse! 
• Limited time/resources! 
• Costs associated with data sharing! 
• Human privacy concerns! 
• Unclear ownership of data/ 
authority to release data! 
• Lack of academic incentives/ 
recognition! 
• Lack of repositories or lack of 
awareness of repositories! 
• Protecting commercially sensitive 
information !
Responsibilities lie across several stakeholder groups
Role of publishers as “agents of change” 
• Data has to become an integral 
part of scholarly communications! 
! 
• Publishers occupy a leverage 
point in this process!
The role of data journals/articles 
• Credit! 
• Unpublished data! 
• Peer review focus! 
• Value of data vs. analysis! 
• Discoverability! 
• Reusability! 
• Narrative/context! 
• “Intelligently open data”! 
Credit to: 
Iain Hrynaszkiewicz
Publishers and data/reproducibility 
• Policies on access (to data, code, reagents etc.)! 
o Supporting funder & community needs! 
• Format and amount of content! 
o Methodological details, supplementary info, data integration and 
links to repositories! 
• Licensing for reuse! 
• Incentives to share! 
o Data citations! 
o Data journals and articles! 
• Quality assurance through peer review! 
Credit to: 
Iain Hrynaszkiewicz
Nature Publishing Group: the changing landscape 
Human Genome 2001 
62 Pages, 150 Authors, 
49 Figure, 27 tables 
Encode Project 2012 
30 papers, 
3 Journals
2013 
Credit to: 
Iain Hrynaszkiewicz
Data/reproducibility at NPG 
Some important recent events 2013-2014 
• Figure source data 
o putting data behind figures/graphs 
o rolled out at Nature and progressively across all other Nature branded 
titles 
Wang et al, Nature, 2013 
doi:10.1038/nature12730
Data/reproducibility at NPG 
Some important recent events 2013-2014 
• Figure source data 
o putting data behind figures/graphs 
o rolled out at Nature and progressively across all other Nature branded 
titles 
• Extended data 
o expandable text and extra figures; rolled out at Nature
Data/reproducibility at NPG 
Some important recent events 2013-2014 
• Figure source data 
o putting data behind figures/graphs 
o rolled out at Nature and progressively across all other Nature branded 
titles 
• Extended data 
o expandable text and extra figures; rolled out at Nature 
• Data citation 
o tackling both styling and format; monitoring community developments, 
such the Data Citation Synthesis Group 
o to be rolled out across all Nature branded titles and Scientific Data 
• Code reproducibility 
o peer review, availability and reuse 
• NPG’s Linked Data release – CC0 
• A new data publication platform:
From made reproducible to born reproducible 
“Reproducing the method took several months of effort, and 
required using new versions and new software that posed 
challenges to reconstructing and validating the results”
Data journals everywhere? 
Credit to: 
Iain Hrynaszkiewicz
Consultant, 
Honorary Academic Editor 
Associate Director, 
Principal Investigator 
! 
! 
! 
! 
! 
! 
! 
! 
! 
! 
@scientificdata! 
Susanna-Assunta Sansone, PhD! 
@biosharing! 
@isatools! 
! 
! 
SciDataCon2014, 2-5 November, 2014 
A new open-access, online-only publication for 
descriptions of scientifically valuable datasets !
• Get Credit for Sharing Your Data 
• Publications will be listed in the major indexes and will be citeable 
• Focused on Data Reuse 
• All the information others need to reuse the data; no interpretative 
analysis or hypothesis testing 
• Open-access 
• Authors select from three Creative Commons licences for the main 
• Data Descriptor. Each publication supported by curated CC0 
metadata 
• Peer-reviewed 
• Rigorous peer-review managed by our Editorial Board of academic 
researchers ensures data quality and standards 
• Promoting Community Data Repositories 
• Data stored in community data repositories
Introducing a new content type: the Data Descriptor 
• Designed to make data more discoverable, interpretable and 
reusable! 
• Concerned with the facts behind the methodology 
of data generation/collection and processing! 
• Complements a journal article! 
Synthesis 
Analysis 
Data Descriptor 
Conclusions 
Interpretation 
What is the 
sample? 
What did I do to 
generate the data? 
How was the data 
processed? 
Where is the data? 
Who did what when? 
Summary of 
Data 
Descriptor 
Facts 
Data Descriptor 
Journal article 
NARRATIVE
Data Descriptor: narrative and structure! 
! 
! 
! 
Experimental metadata or ! 
structured component! 
(in-house curated, machine-readable 
formats)! 
Article or ! 
narrative component! 
(PDF and HTML) !
Data Descriptor: narrative! 
Focus on data reuse! 
Detailed descriptions of the methods and technical analyses supporting the 
quality of the measurements.! 
Does not contain tests of new scientific hypotheses! 
In traditional publications this 
information is not provided in a 
sufficiently detailed manner 
However this information is 
essential for understanding, 
reusing, and reproducing 
datasets 
Sections:! 
• Title! 
• Abstract! 
• Background & Summary! 
• Methods! 
• Technical Validation! 
• Data Records! 
• Usage Notes ! 
• Figures & Tables ! 
• References! 
• Data Citations! 
!
Data Descriptor: narrative! 
Focus on data reuse! 
Detailed descriptions of the methods and technical analyses supporting the 
quality of the measurements.! 
Does not contain tests of new scientific hypotheses! 
Sections:! 
• Title! 
• Abstract! 
• Background & Summary! 
• Methods! 
• Technical Validation! 
• Data Records! 
• Usage Notes ! 
• Figures & Tables ! 
• References! 
• Data Citations! 
!
Data Descriptor: narrative! 
Focus on data reuse! 
Detailed descriptions of the methods and technical analyses supporting the 
quality of the measurements.! 
Does not contain tests of new scientific hypotheses! 
Sections:! 
• Title! 
• Abstract! 
• Background & Summary! 
• Methods! 
• Technical Validation! 
• Data Records! 
• Usage Notes ! 
• Figures & Tables ! 
• References! 
• Data Citations! 
! 
Joint Declaration of Data Citation Principles by the 
Data Citation Synthesis Group
Data Descriptor: structure - content ! 
General-purpose, machine-readable 
format, designed to 
support: 
• description of the experimental 
workflow 
• explicit and discoverable 
annotations 
• provenance tracking 
• use community-defined 
minimal reporting guidelines 
and terminologies 
Data file or ! 
record in a 
database! 
analysis ! 
method! script!
Data Descriptor: structure - content ! 
Includes fields describing: 
• each study, linking to relevant 
sections of the Data Descriptor 
article 
• authors’ details, including ORCID 
• publications 
• funding sources and funders’ name, 
via FundRef 
• experimental factors 
• study design 
• assays 
• protocols 
Data file or ! 
record in a 
database! 
analysis ! 
method! script!
Data Descriptor: structure - content ! 
It allows to relate samples, and 
their descriptions to the data files
Data Descriptor: structure - content ! 
In-house editorial curator:! 
• assists users to submit the structured 
content via simple templates and an 
internal authoring tool! 
• performs value-added semantic 
annotation of the experimental 
metadata! 
For advanced users/service providers 
willing to export ISA-Tab for direct 
submission, we have released a technical 
specification:! 
Data file or ! 
record in a 
database! 
analysis ! 
method! script!
Workflow overview! 
Green: author; Purple: repository; Blue: SciData; Red: production
Collect 
Data! 
Publish your data early! 
Follow-up 
experiments! 
Publish 
Findings! 
Publish 
Data! 
Scientific Data’s prior publication policy with other NPG journals 
protects your ability to publish the screen data and the hits later 
Credit to: 
Andrew Hufton
Hao et al.: Environmental! 
8 citations 
Data sets from the Global Integrated 
Drought Monitoring and Prediction 
System (GIDMaPS), which provides 
drought information based on multiple 
drought indicators
Hao et al.: Environmental! 
8 citations 
New Dataset 
• Data in figshare 
• Code in figshare
Hao et al.: Environmental! 
8 citations 
New Dataset 
• Data in figshare 
• Code in figshare 
• Cited in Science
Or your data and findings simultaneously/after! 
Collect 
Data! 
Follow-up 
experiments! 
Publish 
Findings! 
Submit 
Data! 
Hold 
publication! 
Scientific Data will hold a Data Descriptor publication that has 
been accepted for publication, while your other related research 
publications clear peer review 
Credit to: 
Andrew Hufton
Messina et al.: Epidemiology! 
4 citations 
The most comprehensive geographic 
collection of human dengue virus 
occurrence data (1960 -2012), linked 
to point or polygon locations, derived 
from peer-reviewed literature and 
case reports as well as informal online 
sources
! 
! 
! 
! 
! 
! 
! 
! 
Scientific hypotheses:! 
Synthesis! 
Analysis! 
Conclusions! 
Messina et al.: Epidemiology! 4 citations 
Associated Nature Article 
• Data in figshare 
Methods and technical analyses supporting 
the quality of the measurements:! 
What did I do to generate the data?! 
How was the data processed?! 
Where is the data?! 
Who did what when!
Value added component integrated in a 
growing ecosystem! 
Research 
papers 
Descriptors 
Data 
Data 
records
Progressively refine the guidance to authors ! 
Over 500 Over 600 
A web-based, curated and searchable portal works to ensure the 
standards and databases are registered, informative and discoverable 
and accessible, monitoring the development and evolution of standards, 
their use in databases and the adoption of both in data policies.
Helping authors find the right place for the data! 
• We currently recognize over 60 public data 
repositories, and provide advice on the best 
place for authors to archive their data! 
• We have integrated systems with both:! 
! 
! 
2 
4 
3 
10 4 
1 
4 
3 
4 
“Omics” is emphasized 
among basic life-sciences 
repositories 
DNA and protein sequence 
Functional genomics 
Genetic association and genome variation 
Metagenomics 
Molecular interactions 
Organism- or disease-specific 
Proteomics 
Taxonomy and species diversity 
Traces and sequencing reads
4 Big 
data 
| 
CSE 
2014 
4 
Repositories criteria! 
1. Broad support and recognition within their scientific community ! 
2. Ensure long-term persistence and preservation of datasets! 
3. Provide expert curation ! 
4. Implement relevant, community-endorsed reporting requirements ! 
Progressively monitor this via ! 
5. Provide for confidential review of submitted datasets ! 
6. Provide stable identifiers for submitted datasets ! 
7. Allow public access to data without unnecessary restrictions !
Citations of and links to data files - databases!
Peer review process focused on quality and reuse! 
Evaluation is not be based on the perceived impact ! 
or novelty of the findings or size of the data! 
! 
• Experimental rigour and technical data quality! 
o Methodologically sound! 
o Technical validation experiments and statistical analyses! 
o Depth, coverage, size, and/or completeness of data sufficient for the types 
of applications! 
• Completeness of the description! 
o Sufficient details to allow others to reproduce the results, reuse or 
integrate it with other data! 
o Compliance with relevant minimum information or reporting standards! 
• Integrity of the data files and repository record! 
o Data files match the descriptions in the Data Descriptor! 
o Deposited in the most appropriate available data repository!
Current content is diverse - bimonthly releases ! 
• Neuroscience, ecology, epidemiology, environmental science, functional 
genomics, metabolomics, toxicology etc.! 
• New previously published individual datasets, curated aggregation and 
citizen science:! 
o a fuller, more in-depth look at the data processing steps, supported by 
additional data files and code from each step! 
o additional tutorial-like information for scientists interested in reusing or 
integrating the data with their own! 
• Datasets in figshare, Dryad and domain specific databases! 
• Code deposited in figshare and GitHub! 
• First collection:! 
47
Open Access – APC supported! 
Data: the primary datasets resides in public 
repositories. Partnering with FigShare and Dryad, 
which are both CC0! 
Data Descriptor - structured component (ISA-Tab): 
as NPG has already done with its existing Linked 
Data Portal, the metadata about data descriptors in 
Scientific Data is CC0! 
Data Descriptor - narrative component: describing 
the methodology of data generation/collection and 
processing is licensed under either of the following, by 
author choice: 
OA Article processing charges: $1,000 USD / £650 GBP / €750 for each accepted article
Supported by:! 
Advisory Panel including senior researchers, funders, librarians and curators 
Michael Huerta ● National Institutes of Health, USA ● Mark Thorley ● Natural Environment Research 
Council, UK ● Patricia Cruse ● University of California, USA ● Susan Gregurick ● Office of 
Biological and Environmental Research, Department of Energy, USA ● Ioannis Xenarios ● Swiss 
Institute of Bioinformatics, Switzerland ● Chris Bowler ● IBENS, France ● Mark Forster ● Syngenta, 
UK ● Anthony Rowe ● Johnson & Johnson, USA ● Stephen Chanock ● National Cancer Institute, 
USA ● Weida Tong ● National Center for Toxicological Research, FDA, USA ● Albert J. R. Heck ● 
Utrecht University, The Netherlands ● Johanna McEntyre ● EMBL-EBI, European Bioinformatics 
Institute, UK ● Simon Hodson ● CODATA, France ● Joseph R. Ecker ● Howard Hughes Medical 
Institute & Salk Institute, USA ● Stephen Friend ● Sage Bionetworks, USA ● Jessica Tenenbaum ● 
Duke Translational Medicine Institute, USA ● Anne-Claude Gavin ● EMBL, Germany ● David Carr ● 
Wellcome Trust, UK ● Wolfram Horstmann ● Göttingen State and University Library, Germany ● 
Piero Carninci ● RIKEN Omics Science Center, Japan ● Pascale Gaudet ● Swiss Institute of 
Bioinformatics, Switzerland ● Judith A. Blake ● The Jackson Laboratory, USA ● Richard H. 
Scheuermann ● J. Craig Venter Institute, USA ● Caroline Shamu ● Harvard Medical School, USA 
Susanna-Assunta Sansone 
Honorary Academic Editor 
(University of Oxford, UK) 
Andrew L Hufton 
Managing Editor 
Varsha Khodiyar 
Editorial Curator 
Iain Hrynaszkiewicz 
Publisher 
An open access, peer-reviewed publication for 
descriptions of scientifically valuable datasets! 
Launched May 2014
SciDataCon2014, 2-5 November, 2014 
Data Papers and their applications:! 
examples from ! 
Nature Publishing Group and Ubiquity Press! 
Feedback and discussion! 
• Based on what you have heard today, how well do 
these journals fit with your/researchers at your 
instituteʼs publication and data management workflow? ! 
• What are the benefits to data publication? ! 
• What are the risks/barriers?! 
• What can publishers/journal do to incentivise data 
publication?!

SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific Data

  • 1.
    SciDataCon2014, 2-5 November,2014 Data Papers and their applications:! examples from ! Nature Publishing Group and Ubiquity Press! 1. Introduction! 2. Anatomy of a data paper - cases studies from specific journals! • Nature Publishing Group - Scientific Data, Susanna-Assunta Sansone! • Ubiquity Press - Open Health Data, Brian Hole! 3. Feedback and discussion!
  • 2.
    Consultant, Honorary AcademicEditor Associate Director, Principal Investigator ! Introduction! The role of publishers and data papers ! ! Susanna-Assunta Sansone, PhD! ! ! @biosharing! @isatools! @scientificdata! ! SciDataCon2014, 2-5 November, 2014
  • 3.
  • 4.
    A community mobilizationfor “openness” http://discovery.urlibraries.org/ https://okfn.org image by Greg Emmerich • Open data is a means to do better science more efficiently! • Licenses, copyright and IP are legal barriers to data sharing and reuse! o Licenses are for asserting rights; waivers are for giving them up, maximising potential for data reuse, integration and discovery of new knowledge! • Creative Commons CC0! o interoperability: CC0 is human and machine-readable! o universality: CC0 is global and universal and widely recognized! o simplicity: no need for humans to make, and respond to, individual data requests! http://opendefinition.org/licenses/ http://pantonprinciples.org https://www.copyrightsworld.com https://creativecommons.org
  • 5.
    Open access isnot enough on its own http://www.theguardian.com/higher-education-network/blog/2014/jun/26 If your research has been funded by the taxpayer, there's a good chance you'll be encouraged to publish your results on an open access basis….. This final article makes publicly available the hypotheses, interpretations and conclusions of your research. But what about the data that led you to those results and conclusions?
  • 6.
    Also open datais not always enough http://www.theguardian.com/higher-education-network/blog/2014/jun/26 So data that is in theory open and free to access! • may still be hard to get hold of! • it may not have been stored or cited in the appropriate manner! • it may not be interoperable with related data because it is not formatted appropriately; or! • it may not be reusable because it may not contain enough information for others to understand it!
  • 7.
    Movement for FAIRdata in life and medical sciences http://bd2k.nih.gov/workshops.html#ADDS
  • 8.
    Because, in allfairness, not much data is FAIR! 8
  • 9.
    Data unavailability andincomplete annotations
  • 10.
    Benefits and barriersto data sharing Credit to: Iain Hrynaszkiewicz Benefits! Barriers! • Reduction of error and fraud! • Increased return on investment in research! • Compliance with funder and journal mandates! • Reduce duplication and bias! • Reproduction/validation of research! • Testing additional hypotheses! • Use for teaching! • Integration with other data sets! • Increased citations ! • Concerns over inappropriate reuse! • Limited time/resources! • Costs associated with data sharing! • Human privacy concerns! • Unclear ownership of data/ authority to release data! • Lack of academic incentives/ recognition! • Lack of repositories or lack of awareness of repositories! • Protecting commercially sensitive information !
  • 11.
    Responsibilities lie acrossseveral stakeholder groups
  • 12.
    Role of publishersas “agents of change” • Data has to become an integral part of scholarly communications! ! • Publishers occupy a leverage point in this process!
  • 13.
    The role ofdata journals/articles • Credit! • Unpublished data! • Peer review focus! • Value of data vs. analysis! • Discoverability! • Reusability! • Narrative/context! • “Intelligently open data”! Credit to: Iain Hrynaszkiewicz
  • 14.
    Publishers and data/reproducibility • Policies on access (to data, code, reagents etc.)! o Supporting funder & community needs! • Format and amount of content! o Methodological details, supplementary info, data integration and links to repositories! • Licensing for reuse! • Incentives to share! o Data citations! o Data journals and articles! • Quality assurance through peer review! Credit to: Iain Hrynaszkiewicz
  • 15.
    Nature Publishing Group:the changing landscape Human Genome 2001 62 Pages, 150 Authors, 49 Figure, 27 tables Encode Project 2012 30 papers, 3 Journals
  • 16.
    2013 Credit to: Iain Hrynaszkiewicz
  • 17.
    Data/reproducibility at NPG Some important recent events 2013-2014 • Figure source data o putting data behind figures/graphs o rolled out at Nature and progressively across all other Nature branded titles Wang et al, Nature, 2013 doi:10.1038/nature12730
  • 18.
    Data/reproducibility at NPG Some important recent events 2013-2014 • Figure source data o putting data behind figures/graphs o rolled out at Nature and progressively across all other Nature branded titles • Extended data o expandable text and extra figures; rolled out at Nature
  • 19.
    Data/reproducibility at NPG Some important recent events 2013-2014 • Figure source data o putting data behind figures/graphs o rolled out at Nature and progressively across all other Nature branded titles • Extended data o expandable text and extra figures; rolled out at Nature • Data citation o tackling both styling and format; monitoring community developments, such the Data Citation Synthesis Group o to be rolled out across all Nature branded titles and Scientific Data • Code reproducibility o peer review, availability and reuse • NPG’s Linked Data release – CC0 • A new data publication platform:
  • 20.
    From made reproducibleto born reproducible “Reproducing the method took several months of effort, and required using new versions and new software that posed challenges to reconstructing and validating the results”
  • 21.
    Data journals everywhere? Credit to: Iain Hrynaszkiewicz
  • 22.
    Consultant, Honorary AcademicEditor Associate Director, Principal Investigator ! ! ! ! ! ! ! ! ! ! @scientificdata! Susanna-Assunta Sansone, PhD! @biosharing! @isatools! ! ! SciDataCon2014, 2-5 November, 2014 A new open-access, online-only publication for descriptions of scientifically valuable datasets !
  • 23.
    • Get Creditfor Sharing Your Data • Publications will be listed in the major indexes and will be citeable • Focused on Data Reuse • All the information others need to reuse the data; no interpretative analysis or hypothesis testing • Open-access • Authors select from three Creative Commons licences for the main • Data Descriptor. Each publication supported by curated CC0 metadata • Peer-reviewed • Rigorous peer-review managed by our Editorial Board of academic researchers ensures data quality and standards • Promoting Community Data Repositories • Data stored in community data repositories
  • 24.
    Introducing a newcontent type: the Data Descriptor • Designed to make data more discoverable, interpretable and reusable! • Concerned with the facts behind the methodology of data generation/collection and processing! • Complements a journal article! Synthesis Analysis Data Descriptor Conclusions Interpretation What is the sample? What did I do to generate the data? How was the data processed? Where is the data? Who did what when? Summary of Data Descriptor Facts Data Descriptor Journal article NARRATIVE
  • 25.
    Data Descriptor: narrativeand structure! ! ! ! Experimental metadata or ! structured component! (in-house curated, machine-readable formats)! Article or ! narrative component! (PDF and HTML) !
  • 26.
    Data Descriptor: narrative! Focus on data reuse! Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.! Does not contain tests of new scientific hypotheses! In traditional publications this information is not provided in a sufficiently detailed manner However this information is essential for understanding, reusing, and reproducing datasets Sections:! • Title! • Abstract! • Background & Summary! • Methods! • Technical Validation! • Data Records! • Usage Notes ! • Figures & Tables ! • References! • Data Citations! !
  • 27.
    Data Descriptor: narrative! Focus on data reuse! Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.! Does not contain tests of new scientific hypotheses! Sections:! • Title! • Abstract! • Background & Summary! • Methods! • Technical Validation! • Data Records! • Usage Notes ! • Figures & Tables ! • References! • Data Citations! !
  • 28.
    Data Descriptor: narrative! Focus on data reuse! Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.! Does not contain tests of new scientific hypotheses! Sections:! • Title! • Abstract! • Background & Summary! • Methods! • Technical Validation! • Data Records! • Usage Notes ! • Figures & Tables ! • References! • Data Citations! ! Joint Declaration of Data Citation Principles by the Data Citation Synthesis Group
  • 29.
    Data Descriptor: structure- content ! General-purpose, machine-readable format, designed to support: • description of the experimental workflow • explicit and discoverable annotations • provenance tracking • use community-defined minimal reporting guidelines and terminologies Data file or ! record in a database! analysis ! method! script!
  • 30.
    Data Descriptor: structure- content ! Includes fields describing: • each study, linking to relevant sections of the Data Descriptor article • authors’ details, including ORCID • publications • funding sources and funders’ name, via FundRef • experimental factors • study design • assays • protocols Data file or ! record in a database! analysis ! method! script!
  • 31.
    Data Descriptor: structure- content ! It allows to relate samples, and their descriptions to the data files
  • 32.
    Data Descriptor: structure- content ! In-house editorial curator:! • assists users to submit the structured content via simple templates and an internal authoring tool! • performs value-added semantic annotation of the experimental metadata! For advanced users/service providers willing to export ISA-Tab for direct submission, we have released a technical specification:! Data file or ! record in a database! analysis ! method! script!
  • 33.
    Workflow overview! Green:author; Purple: repository; Blue: SciData; Red: production
  • 34.
    Collect Data! Publishyour data early! Follow-up experiments! Publish Findings! Publish Data! Scientific Data’s prior publication policy with other NPG journals protects your ability to publish the screen data and the hits later Credit to: Andrew Hufton
  • 35.
    Hao et al.:Environmental! 8 citations Data sets from the Global Integrated Drought Monitoring and Prediction System (GIDMaPS), which provides drought information based on multiple drought indicators
  • 36.
    Hao et al.:Environmental! 8 citations New Dataset • Data in figshare • Code in figshare
  • 37.
    Hao et al.:Environmental! 8 citations New Dataset • Data in figshare • Code in figshare • Cited in Science
  • 38.
    Or your dataand findings simultaneously/after! Collect Data! Follow-up experiments! Publish Findings! Submit Data! Hold publication! Scientific Data will hold a Data Descriptor publication that has been accepted for publication, while your other related research publications clear peer review Credit to: Andrew Hufton
  • 39.
    Messina et al.:Epidemiology! 4 citations The most comprehensive geographic collection of human dengue virus occurrence data (1960 -2012), linked to point or polygon locations, derived from peer-reviewed literature and case reports as well as informal online sources
  • 40.
    ! ! ! ! ! ! ! ! Scientific hypotheses:! Synthesis! Analysis! Conclusions! Messina et al.: Epidemiology! 4 citations Associated Nature Article • Data in figshare Methods and technical analyses supporting the quality of the measurements:! What did I do to generate the data?! How was the data processed?! Where is the data?! Who did what when!
  • 41.
    Value added componentintegrated in a growing ecosystem! Research papers Descriptors Data Data records
  • 42.
    Progressively refine theguidance to authors ! Over 500 Over 600 A web-based, curated and searchable portal works to ensure the standards and databases are registered, informative and discoverable and accessible, monitoring the development and evolution of standards, their use in databases and the adoption of both in data policies.
  • 43.
    Helping authors findthe right place for the data! • We currently recognize over 60 public data repositories, and provide advice on the best place for authors to archive their data! • We have integrated systems with both:! ! ! 2 4 3 10 4 1 4 3 4 “Omics” is emphasized among basic life-sciences repositories DNA and protein sequence Functional genomics Genetic association and genome variation Metagenomics Molecular interactions Organism- or disease-specific Proteomics Taxonomy and species diversity Traces and sequencing reads
  • 44.
    4 Big data | CSE 2014 4 Repositories criteria! 1. Broad support and recognition within their scientific community ! 2. Ensure long-term persistence and preservation of datasets! 3. Provide expert curation ! 4. Implement relevant, community-endorsed reporting requirements ! Progressively monitor this via ! 5. Provide for confidential review of submitted datasets ! 6. Provide stable identifiers for submitted datasets ! 7. Allow public access to data without unnecessary restrictions !
  • 45.
    Citations of andlinks to data files - databases!
  • 46.
    Peer review processfocused on quality and reuse! Evaluation is not be based on the perceived impact ! or novelty of the findings or size of the data! ! • Experimental rigour and technical data quality! o Methodologically sound! o Technical validation experiments and statistical analyses! o Depth, coverage, size, and/or completeness of data sufficient for the types of applications! • Completeness of the description! o Sufficient details to allow others to reproduce the results, reuse or integrate it with other data! o Compliance with relevant minimum information or reporting standards! • Integrity of the data files and repository record! o Data files match the descriptions in the Data Descriptor! o Deposited in the most appropriate available data repository!
  • 47.
    Current content isdiverse - bimonthly releases ! • Neuroscience, ecology, epidemiology, environmental science, functional genomics, metabolomics, toxicology etc.! • New previously published individual datasets, curated aggregation and citizen science:! o a fuller, more in-depth look at the data processing steps, supported by additional data files and code from each step! o additional tutorial-like information for scientists interested in reusing or integrating the data with their own! • Datasets in figshare, Dryad and domain specific databases! • Code deposited in figshare and GitHub! • First collection:! 47
  • 49.
    Open Access –APC supported! Data: the primary datasets resides in public repositories. Partnering with FigShare and Dryad, which are both CC0! Data Descriptor - structured component (ISA-Tab): as NPG has already done with its existing Linked Data Portal, the metadata about data descriptors in Scientific Data is CC0! Data Descriptor - narrative component: describing the methodology of data generation/collection and processing is licensed under either of the following, by author choice: OA Article processing charges: $1,000 USD / £650 GBP / €750 for each accepted article
  • 50.
    Supported by:! AdvisoryPanel including senior researchers, funders, librarians and curators Michael Huerta ● National Institutes of Health, USA ● Mark Thorley ● Natural Environment Research Council, UK ● Patricia Cruse ● University of California, USA ● Susan Gregurick ● Office of Biological and Environmental Research, Department of Energy, USA ● Ioannis Xenarios ● Swiss Institute of Bioinformatics, Switzerland ● Chris Bowler ● IBENS, France ● Mark Forster ● Syngenta, UK ● Anthony Rowe ● Johnson & Johnson, USA ● Stephen Chanock ● National Cancer Institute, USA ● Weida Tong ● National Center for Toxicological Research, FDA, USA ● Albert J. R. Heck ● Utrecht University, The Netherlands ● Johanna McEntyre ● EMBL-EBI, European Bioinformatics Institute, UK ● Simon Hodson ● CODATA, France ● Joseph R. Ecker ● Howard Hughes Medical Institute & Salk Institute, USA ● Stephen Friend ● Sage Bionetworks, USA ● Jessica Tenenbaum ● Duke Translational Medicine Institute, USA ● Anne-Claude Gavin ● EMBL, Germany ● David Carr ● Wellcome Trust, UK ● Wolfram Horstmann ● Göttingen State and University Library, Germany ● Piero Carninci ● RIKEN Omics Science Center, Japan ● Pascale Gaudet ● Swiss Institute of Bioinformatics, Switzerland ● Judith A. Blake ● The Jackson Laboratory, USA ● Richard H. Scheuermann ● J. Craig Venter Institute, USA ● Caroline Shamu ● Harvard Medical School, USA Susanna-Assunta Sansone Honorary Academic Editor (University of Oxford, UK) Andrew L Hufton Managing Editor Varsha Khodiyar Editorial Curator Iain Hrynaszkiewicz Publisher An open access, peer-reviewed publication for descriptions of scientifically valuable datasets! Launched May 2014
  • 51.
    SciDataCon2014, 2-5 November,2014 Data Papers and their applications:! examples from ! Nature Publishing Group and Ubiquity Press! Feedback and discussion! • Based on what you have heard today, how well do these journals fit with your/researchers at your instituteʼs publication and data management workflow? ! • What are the benefits to data publication? ! • What are the risks/barriers?! • What can publishers/journal do to incentivise data publication?!