Data Consultant, 
Honorary Academic Editor 
Associate Director, 
Principal Investigator 
iDASH meeting, San Diego, Sept 15-16, 2014 
The rise of the data-centric 
research and publication enterprises 
Susanna-Assunta Sansone, PhD 
@biosharing 
@isatools 
@scientificdata 
Board of Directors; Technical Advisory Board; 
Coordinating Editors; Sector Lead
Credit to: 
https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/
Worldwide movement for FAIR data 
Credit: Barend Mons
Worldwide movement for FAIR data 
Credit: Barend Mons 
http://bd2k.nih.gov/workshops.html#ADDS
Doing my fair share of work 
Increase the level of annotation at the source, tracking provenance and using community standards 
Notes and narrative Spreadsheets and tables Linked data and nanopublications 
Notes in Lab Books 
(information for humans) 
Spreadsheets and Tables 
( the compromise) 
Facts as RDF statements 
(information for machines) 
Working with and for:
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta 
Sansone www.ebi.ac.uk/net-project 
6 
• make annotation explicit 
and discoverable 
• structure the descriptions for 
consistency 
• ensure/regulate access 
• deposit and publish 
• etc…. 
 To make any dataset ‘FAIR’, one 
must have standards, tools and 
best practices to: 
• report sufficient details 
• capture all salient features of 
the experimental workflow
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta 
Sansone www.ebi.ac.uk/net-project 
7 
…breath and depth 
of the experimental context 
…is pivotal
sample characteristic(s) 
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta 
Sansone www.ebi.ac.uk/net-project 
8 
experimental design 
experimental variable(s) 
technology(s) 
measurement(s) 
protocols(s) 
data file(s) 
......
The role of reporting or content standards 
Community-developed “norms” set to structure and enrich the 
description of datasets, facilitating understanding, sharing and reuse 
Including minimum 
information reporting 
requirements, or 
checklists to report the 
same core, essential 
information 
Including controlled 
vocabularies, taxonomies, 
thesauri, ontologies etc. to 
use the same word and 
refer to the same ‘thing’ 
Including conceptual 
model, conceptual 
schema from which an 
exchange format is 
derived to allow data to 
flow from one system to 
another
A community mobilization - some examples 
de jure de facto 
grass-roots 
groups 
standard 
organizations 
Nanotechnology Working Group
Organizational and operational structures - quite diverse 
de jure de facto 
grass-roots 
groups 
standard 
organizations 
Nanotechnology Working Group
Fragmentation, duplications and gaps 
Technologically-delineated 
views of the world 
12 
Biologically-delineated 
views of the world 
Generic features (‘common core’) 
- description of source biomaterial 
- experimental design components 
Arrays 
MS MS 
Gels 
Columns 
Scanning Arrays & 
Scanning 
NMR 
FTIR 
Columns 
transcriptomics 
proteomics 
metabolomics 
plant biology 
epidemiology 
microbiology 
To compare and integrate data we need interoperable standards
Growing number of reporting standards 
~ 156 
~ 70 
~ 334 
Source: BioPortal 
Databases, 
annotation, 
curation 
tools 
implementing 
standards 
miame 
MIAPA 
MIRIAM 
MIQAS 
MIX 
MIGEN 
CIMR 
MIAPE 
MIASE 
REMARK 
MIQE 
CONSORT 
MISFISHIE…. 
MAGE-Tab 
GCDML 
SRAxml 
SOFT 
FASTA 
DICOM 
SBRML 
MzML 
GELML 
SEDML… 
ISA-Tab 
CML 
MITAB 
AAO 
CHEBI 
OBI 
PATO ENVO 
MOD 
TEDDY 
BTO 
IDO… 
XAO 
PRO 
DO 
VO
Which standards and database can we use/recommend
BioSharing works to map the landscape of content standards in the 
life sciences, broadly covering biological, natural and 
biomedical sciences 
The web-based, curated and searchable registry works to ensure the 
standards are informative and discoverable, monitoring their 
development, evolution also their use in databases 
and adoption in data policies.
BioSharing’s goal is to assist stakeholders to make informed decisions: 
• researchers, developers and curators who lack support and guidance on how to 
best navigate and select the various content standards and understand their 
maturity, or find databases that implement them; 
• funders, journals, and librarians because they do not have enough information to 
make informed decisions on which content standards or database should be 
recommended in their policies, or funded or implemented.
Operational Team 
Advisory Board and RDA Working Group
Core functionalities: 
• search and filtering 
• submissions forms to add new records 
• “claim” functionality of existing records 
• person’s profile (as maintainer of 
records) associated to the ORCID 
profile 
• visualization and views of content 
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta 
Sansone www.ebi.ac.uk/net-project 
1 
8 
Current content: 
• Over 500 
• Over 600
Registering and cataloging is just step one; the next include: 
• Develop assessment criteria for usability and popularity of standards 
CTSA Omics 
Data Standards 
Working Group
Registering and cataloging is just step one; the next include: 
• Develop assessment criteria for usability and popularity of standards 
• Associate standards to data policies and databases 
• Assemble journal and funder policies re data storage 
• Make fully cross-searchable 
• Continue to embed it in the ecosystem of complementary registries
Registering and cataloging is just step one; the next include: 
• Develop assessment criteria for usability and popularity of standards 
• Associate standards to data policies and databases 
• Assemble journal and funder policies re data storage 
• Make fully cross-searchable 
• Continue to embed it in the ecosystem of complementary registries
Registering and cataloging is just step one; the next include: 
• Develop assessment criteria for usability and popularity of standards 
• Associate standards to data policies and databases 
• Assemble journal and funder policies re data storage 
• Make fully cross-searchable 
• Continue to embed it in the ecosystem of complementary registries
General-purpose, configurable format for 
the description of experimental metadata 
Designed to support: 
• provenance tracking 
• use of community minimal reporting 
guidelines and terminologies 
- reference system to link to (CDISC) 
SDTM files; further connections 
explored via 
Designed to be converted to: 
• a growing number of other metadata 
formats, e.g. used by EBI repositories 
• RDF representation with mapping to 
several ontologies, incl. PROV-O to 
deliver 
analysis 
method 
script 
Data file or 
record in a 
database
ISA powers data collection, curation resources and repositories, e.g.: 
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta 
Sansone www.ebi.ac.uk/net-project
Embedding and in activities 
CEDAR: 
Centre for Extended Data Annotation and Retrieval 
(PI: Musen; pending notification of award) 
The centre will take advantage of the recent growth in 
community-driven metadata standards to develop 
innovative methods to facilitate the annotation, 
cataloguing, and retrieval of dataset collections. 
(pending final decision and notification of award)
Role of publishers as “agents of change” 
• Data has to become an integral part 
of the scholarly communications 
• Responsibilities lie across several 
stakeholder groups: researchers, 
data centers, librarians, funding 
agencies and publishers 
• Publishers occupy a leverage point 
in this process
Launched on May 27th, 2014 
Credit for sharing 
your data 
Focused on reuse 
and reproducibility 
Peer reviewed, 
curated 
Promoting Community 
Data Repositories 
Open Access 
A new online-only publication for descriptions of scientifically valuable datasets 
in the life, environmental and biomedical sciences, but not limited to these 
Supported by:
Data Descriptor: narrative and structure 
Experimental metadata or 
structured component 
(in-house curated, machine-readable 
formats) 
Article or 
narrative component 
(PDF and HTML)
Data Descriptor: narrative and structure 
Experimental metadata or 
structured component 
(in-house curated, machine-readable 
formats) 
Article or 
narrative component 
(PDF and HTML)
Data Descriptor - focus on reuse 
Detailed descriptions of methods and technical analyses supporting quality 
of the measurements; does not contain tests of new scientific hypotheses 
Sections: 
• Title 
• Abstract 
• Background & Summary 
• Methods 
• Technical Validation 
• Data Records 
• Usage Notes 
• Figures & Tables 
• References 
• Data Citations 
In traditional publications this 
information is not provided in a 
sufficiently detailed manner 
However this information is 
essential for understanding, 
reusing, and reproducing 
datasets
Relation with traditional articles - content 
Scientific hypotheses: 
Synthesis 
Analysis 
Conclusions 
Methods and technical analyses supporting the quality 
of the measurements: 
What did I do to generate the data? 
How was the data processed? 
Where is the data? 
Who did what when
Relation with traditional articles - time 
BEFORE: get your data to the community as soon as possible (see NPG pre-publication policy) 
AT THE SAME TIME: publish your Data Descriptor(s) alongside research article(s) 
AFTER: expand on your research articles, adding further information for reuse of the data
Citations of and links to data files - databases 
Joint Declaration of Data Citation Principles by 
the Data Citation Synthesis Group
Value added component integrated in a 
growing ecosystem 
We currently recognize over 
50 public data repositories 
Research 
papers 
Data 
Data 
records 
Descriptors
Peer review process focused on quality and reuse 
Evaluation is not be based on the perceived impact or novelty of the findings 
• Experimental rigour and technical data quality 
o Methodologically sound 
o Technical validation experiments and statistical analyses 
o Depth, coverage, size, and/or completeness of data sufficient for the types 
of applications 
• Completeness of the description 
o Sufficient details to allow others to reproduce the results, reuse or 
integrate it with other data 
o Compliance with relevant minimum information or reporting standards 
• Integrity of the data files and repository record 
o Data files match the descriptions in the Data Descriptor 
o Deposited in the most appropriate available data repository
• Neuroscience, ecology, epidemiology, environmental science, functional 
genomics, metabolomics, toxicology etc. 
• New previously published individual datasets, curated aggregation and 
citizen science: 
o a fuller, more in-depth look at the data processing steps, supported by 
additional data files and code from each step 
o additional tutorial-like information for scientists interested in reusing or 
integrating the data with their own 
• Datasets in figshare, Dryad and domain specific databases 
• Code deposited in figshare and GitHub 
• First collection: 
39 
Current content is diverse - bimonthly releases
• Neuroscience, ecology, epidemiology, environmental science, functional 
genomics, metabolomics, toxicology etc. 
• New previously published individual datasets, curated aggregation and 
citizen science: 
o a fuller, more in-depth look at the data processing steps, supported by 
additional data files and code from each step 
o additional tutorial-like information for scientists interested in reusing or 
integrating the data with their own 
• Datasets in figshare, Dryad and domain specific databases 
• Code deposited in figshare and GitHub 
• First collection: 
40 
Current content is diverse - bimonthly releases
Acknowledgements 
Advisory Boards and Collaborators 
Philippe 
Rocca-Serra, PhD 
Alejandra 
Gonzalez-Beltran, PhD 
Eamonn 
Maguire 
Milo 
Thurston, PhD 
Visit 
nature.com/scientificdata 
Email 
scientificdata@nature.com 
Tweet 
@ScientificData 
Honorary Academic Editor 
Susanna-Assunta Sansone, PhD 
Managing Editor 
Andrew L Hufton, PhD 
Editorial Curator 
Victoria Newman 
Advisory Panel and Editorial Board including 
senior researchers, funders, librarians and curators

NIH iDASH meeting on data sharing - BioSharing, ISA and Scientific Data

  • 1.
    Data Consultant, HonoraryAcademic Editor Associate Director, Principal Investigator iDASH meeting, San Diego, Sept 15-16, 2014 The rise of the data-centric research and publication enterprises Susanna-Assunta Sansone, PhD @biosharing @isatools @scientificdata Board of Directors; Technical Advisory Board; Coordinating Editors; Sector Lead
  • 2.
  • 3.
    Worldwide movement forFAIR data Credit: Barend Mons
  • 4.
    Worldwide movement forFAIR data Credit: Barend Mons http://bd2k.nih.gov/workshops.html#ADDS
  • 5.
    Doing my fairshare of work Increase the level of annotation at the source, tracking provenance and using community standards Notes and narrative Spreadsheets and tables Linked data and nanopublications Notes in Lab Books (information for humans) Spreadsheets and Tables ( the compromise) Facts as RDF statements (information for machines) Working with and for:
  • 6.
    The International Conferenceon Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project 6 • make annotation explicit and discoverable • structure the descriptions for consistency • ensure/regulate access • deposit and publish • etc….  To make any dataset ‘FAIR’, one must have standards, tools and best practices to: • report sufficient details • capture all salient features of the experimental workflow
  • 7.
    The International Conferenceon Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project 7 …breath and depth of the experimental context …is pivotal
  • 8.
    sample characteristic(s) TheInternational Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project 8 experimental design experimental variable(s) technology(s) measurement(s) protocols(s) data file(s) ......
  • 9.
    The role ofreporting or content standards Community-developed “norms” set to structure and enrich the description of datasets, facilitating understanding, sharing and reuse Including minimum information reporting requirements, or checklists to report the same core, essential information Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’ Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another
  • 10.
    A community mobilization- some examples de jure de facto grass-roots groups standard organizations Nanotechnology Working Group
  • 11.
    Organizational and operationalstructures - quite diverse de jure de facto grass-roots groups standard organizations Nanotechnology Working Group
  • 12.
    Fragmentation, duplications andgaps Technologically-delineated views of the world 12 Biologically-delineated views of the world Generic features (‘common core’) - description of source biomaterial - experimental design components Arrays MS MS Gels Columns Scanning Arrays & Scanning NMR FTIR Columns transcriptomics proteomics metabolomics plant biology epidemiology microbiology To compare and integrate data we need interoperable standards
  • 13.
    Growing number ofreporting standards ~ 156 ~ 70 ~ 334 Source: BioPortal Databases, annotation, curation tools implementing standards miame MIAPA MIRIAM MIQAS MIX MIGEN CIMR MIAPE MIASE REMARK MIQE CONSORT MISFISHIE…. MAGE-Tab GCDML SRAxml SOFT FASTA DICOM SBRML MzML GELML SEDML… ISA-Tab CML MITAB AAO CHEBI OBI PATO ENVO MOD TEDDY BTO IDO… XAO PRO DO VO
  • 14.
    Which standards anddatabase can we use/recommend
  • 15.
    BioSharing works tomap the landscape of content standards in the life sciences, broadly covering biological, natural and biomedical sciences The web-based, curated and searchable registry works to ensure the standards are informative and discoverable, monitoring their development, evolution also their use in databases and adoption in data policies.
  • 16.
    BioSharing’s goal isto assist stakeholders to make informed decisions: • researchers, developers and curators who lack support and guidance on how to best navigate and select the various content standards and understand their maturity, or find databases that implement them; • funders, journals, and librarians because they do not have enough information to make informed decisions on which content standards or database should be recommended in their policies, or funded or implemented.
  • 17.
    Operational Team AdvisoryBoard and RDA Working Group
  • 18.
    Core functionalities: •search and filtering • submissions forms to add new records • “claim” functionality of existing records • person’s profile (as maintainer of records) associated to the ORCID profile • visualization and views of content The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project 1 8 Current content: • Over 500 • Over 600
  • 19.
    Registering and catalogingis just step one; the next include: • Develop assessment criteria for usability and popularity of standards CTSA Omics Data Standards Working Group
  • 20.
    Registering and catalogingis just step one; the next include: • Develop assessment criteria for usability and popularity of standards • Associate standards to data policies and databases • Assemble journal and funder policies re data storage • Make fully cross-searchable • Continue to embed it in the ecosystem of complementary registries
  • 21.
    Registering and catalogingis just step one; the next include: • Develop assessment criteria for usability and popularity of standards • Associate standards to data policies and databases • Assemble journal and funder policies re data storage • Make fully cross-searchable • Continue to embed it in the ecosystem of complementary registries
  • 22.
    Registering and catalogingis just step one; the next include: • Develop assessment criteria for usability and popularity of standards • Associate standards to data policies and databases • Assemble journal and funder policies re data storage • Make fully cross-searchable • Continue to embed it in the ecosystem of complementary registries
  • 25.
    General-purpose, configurable formatfor the description of experimental metadata Designed to support: • provenance tracking • use of community minimal reporting guidelines and terminologies - reference system to link to (CDISC) SDTM files; further connections explored via Designed to be converted to: • a growing number of other metadata formats, e.g. used by EBI repositories • RDF representation with mapping to several ontologies, incl. PROV-O to deliver analysis method script Data file or record in a database
  • 27.
    ISA powers datacollection, curation resources and repositories, e.g.: The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
  • 28.
    Embedding and inactivities CEDAR: Centre for Extended Data Annotation and Retrieval (PI: Musen; pending notification of award) The centre will take advantage of the recent growth in community-driven metadata standards to develop innovative methods to facilitate the annotation, cataloguing, and retrieval of dataset collections. (pending final decision and notification of award)
  • 29.
    Role of publishersas “agents of change” • Data has to become an integral part of the scholarly communications • Responsibilities lie across several stakeholder groups: researchers, data centers, librarians, funding agencies and publishers • Publishers occupy a leverage point in this process
  • 30.
    Launched on May27th, 2014 Credit for sharing your data Focused on reuse and reproducibility Peer reviewed, curated Promoting Community Data Repositories Open Access A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these Supported by:
  • 31.
    Data Descriptor: narrativeand structure Experimental metadata or structured component (in-house curated, machine-readable formats) Article or narrative component (PDF and HTML)
  • 32.
    Data Descriptor: narrativeand structure Experimental metadata or structured component (in-house curated, machine-readable formats) Article or narrative component (PDF and HTML)
  • 33.
    Data Descriptor -focus on reuse Detailed descriptions of methods and technical analyses supporting quality of the measurements; does not contain tests of new scientific hypotheses Sections: • Title • Abstract • Background & Summary • Methods • Technical Validation • Data Records • Usage Notes • Figures & Tables • References • Data Citations In traditional publications this information is not provided in a sufficiently detailed manner However this information is essential for understanding, reusing, and reproducing datasets
  • 34.
    Relation with traditionalarticles - content Scientific hypotheses: Synthesis Analysis Conclusions Methods and technical analyses supporting the quality of the measurements: What did I do to generate the data? How was the data processed? Where is the data? Who did what when
  • 35.
    Relation with traditionalarticles - time BEFORE: get your data to the community as soon as possible (see NPG pre-publication policy) AT THE SAME TIME: publish your Data Descriptor(s) alongside research article(s) AFTER: expand on your research articles, adding further information for reuse of the data
  • 36.
    Citations of andlinks to data files - databases Joint Declaration of Data Citation Principles by the Data Citation Synthesis Group
  • 37.
    Value added componentintegrated in a growing ecosystem We currently recognize over 50 public data repositories Research papers Data Data records Descriptors
  • 38.
    Peer review processfocused on quality and reuse Evaluation is not be based on the perceived impact or novelty of the findings • Experimental rigour and technical data quality o Methodologically sound o Technical validation experiments and statistical analyses o Depth, coverage, size, and/or completeness of data sufficient for the types of applications • Completeness of the description o Sufficient details to allow others to reproduce the results, reuse or integrate it with other data o Compliance with relevant minimum information or reporting standards • Integrity of the data files and repository record o Data files match the descriptions in the Data Descriptor o Deposited in the most appropriate available data repository
  • 39.
    • Neuroscience, ecology,epidemiology, environmental science, functional genomics, metabolomics, toxicology etc. • New previously published individual datasets, curated aggregation and citizen science: o a fuller, more in-depth look at the data processing steps, supported by additional data files and code from each step o additional tutorial-like information for scientists interested in reusing or integrating the data with their own • Datasets in figshare, Dryad and domain specific databases • Code deposited in figshare and GitHub • First collection: 39 Current content is diverse - bimonthly releases
  • 40.
    • Neuroscience, ecology,epidemiology, environmental science, functional genomics, metabolomics, toxicology etc. • New previously published individual datasets, curated aggregation and citizen science: o a fuller, more in-depth look at the data processing steps, supported by additional data files and code from each step o additional tutorial-like information for scientists interested in reusing or integrating the data with their own • Datasets in figshare, Dryad and domain specific databases • Code deposited in figshare and GitHub • First collection: 40 Current content is diverse - bimonthly releases
  • 41.
    Acknowledgements Advisory Boardsand Collaborators Philippe Rocca-Serra, PhD Alejandra Gonzalez-Beltran, PhD Eamonn Maguire Milo Thurston, PhD Visit nature.com/scientificdata Email scientificdata@nature.com Tweet @ScientificData Honorary Academic Editor Susanna-Assunta Sansone, PhD Managing Editor Andrew L Hufton, PhD Editorial Curator Victoria Newman Advisory Panel and Editorial Board including senior researchers, funders, librarians and curators