Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

www.slideshare.net/SusannaSansone

Delivering reproducible bioscience data by enabling
biocuration at the source

Susanna-Assunta Sansone, PhD
Principal Investigator and Team Leader,
University of Oxford e-Research Centre, Oxford, UK

Academic Consultant, Open Access Data Products,
Nature Publishing Group

Data Curation Centre (DCC)
13th Regional Data Management Roadshow, London, 20 November 2012

University of Oxford e-Research Centre


Providing research
computing, high-
performance
computing
Integrating with
national and
international
infrastructure

Supporting leading
edge facilities through
education and training


Collaborating with European and wider
international groups in, e.g.:
•  energy,
•  radio astronomy,
•  biological data federation,
•  life sciences simulation,
•  biodiversity,
•  computational chemistry,
•  neuroscience,
•  digital humanities tools,
•  digital music analysis
•  visualization
•  …

My team’s activities and stakeholders we work with
data management and biocuration, collaborative development of software
and database, standards and ontology

•  environmental genomics •  stem cell discovery
•  metabolomics •  system biology
•  metagenomics •  transcriptomics
•  nanotechnology •  toxicogenomics
•  proteomics •  environmental health

Outline

“The buzz around reproducible bioscience data:
the communities and the standards”

“The reality from the buzz:
challenges and exemplar project”

http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY

C E
O M R H
N
P E S
I

B
L
E

C E
O M R H
N
P E S
I
N R E I
T E O P R
A
B
L
E

C E
O M R H
N
P E S
I
N R E I
T E O P R
A
R E U S B
L
E

experimental design

sample characteristic(s)

experimental variable(s)

technology(s)

measurement(s)

protocols(s)

data ﬁle(s)

......

11 The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone
www.ebi.ac.uk/net-project

§  We must strike a balance
between
•  depth and breadth of
information; and
•  sufficient information
required to reuse the data

§  Capture all salient features
of the experimental
workflow
§  Make annotation explicit
and discoverable
§  Structure the descriptions
for consistency, tracking

Growing, worldwide movement for reproducible research

esoteric formats
comprehensible?
lack of sufficient
contextual interoperable?
information
hoc or proprietary reusable?
terminologies
Source: http://ebbailey.wordpress.com

§  Researchers and bioinformaticians in both academic and commercial
science, along with funding agencies and publishers, embrace the
concept that community-developed standards are pivotal to structure
and enrich the annotation of
•  entities of interest (e.g., genes, metabolites, phenotypes) and
•  experimental steps (e.g., provenance of study materials,
technology and measurement types)

Community mobilization to develop standards, e.g.:

use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information

Is this general mobilization good or bad?

use the same word and
allow data to flow from report the same core,
refer to the same ‘thing’
one system to another essential information

§  Fragmentation of the standards is a major issue
•  Being focused on particular communities’ interests, be their individual
technologies or biological/biomedical disciplines, leads to duplication of effort,
and more seriously, the development of (largely arbitrarily) different standards
•  This severely hinders the interoperability of databases and tools and ultimately
the integration of datasets

Growing number of reporting standards

MAGE-Tab! AAO! miame!
GCDML! MIAPA!
CHEBI!
SRAxml! OBI! MIRIAM!
VO!
SOFT! MIQAS!
FASTA! PATO! MIX!
CML! ENVO! REMARK!
DICOM! MIGEN!
GELML! MOD!
SBRML! MIAPE! MIQE!
TEDDY!
MITAB! MzML! XAO! CIMR! CONSORT!
BTO!
ISA-Tab! SEDML…! DO PRO! IDO…! MIASE! MISFISHIE….!

Growing number of reporting standards
+ 303

+ 150
+ 130

Source: MIBBI,
Source: BioPortal

EQUATOR
Estimated

Databases,
annotation,
curation
tools
MAGE-Tab! AAO! miame!
GCDML! MIAPA!
CHEBI!
SRAxml! OBI! MIRIAM!
VO!
SOFT! MIQAS!
FASTA! PATO! MIX!
CML! ENVO! REMARK!
DICOM! MIGEN!
GELML! MOD!
SBRML! MIAPE! MIQE!
TEDDY!
MITAB! MzML! XAO! CIMR! CONSORT!
BTO!
ISA-Tab! SEDML…! DO PRO! IDO…! MIASE! MISFISHIE….!

But how much do we know about these standards

Which tools and I use high throughput
databases sequencing technologies,
implement which which one are applicable
standards? to me?

How can I get
What are the
involved to
criteria to evaluate
propose
their status and
extensions or
value?
modifications?

Which one are I work on plants,
mature enough for are these just for
me to use or biomedical
recommend? applications?

A catalogue to map the
landscape of standards and the
systems implementing them:
Over 400 bio-standards
(public and in curation)
Field*, Sansone* et al., Omics data sharing. Science
326, 234-36 (2009) doi:0.1126/science.1180598

•  A coherent, curated and searchable catalogue of data sharing resources
•  Bioscience standards and associated data-sharing policies, publications, tools and databases
•  Assessment criteria for usability and popularity of standards
•  Relationships among standards
•  Encouragement for communication & interaction among groups
•  Promoting interoperability & informed decisions about standards

Social engineering


Ownership of open standards
can be problematic in broad,
grass-root collaborations; it
requires improved models, to
encourage maintenance of and
contributions to these efforts,
supporting their evolutions


The extensive community
liaison needs to be managed
and funded; rewards and
incentives need to be identified
for all contributors


The cost of implementing a
standards-supported data
sharing vision is as large as the
number of stakeholders that
must operate synchronously


Funders are actively developing data policies

Similar trend in the regulatory arena…

… and in the commercial sector

….the rise of data-driven journals, e.g.:

partnering with:

core organization in the

UK node
work in progress

UK Node

reasoning visualization
analysis browsing integration
exchange retrieval

Community Software
Standards Tools
Well-annotated &
Structured Data

Reproducible &
Reusable
Bioscience Research

An exemplar approach to the status quo

§  A grass-root collaborative that works to facilitate collection, curation
and sharing of experiments using a common, structured representation
of the experiments that
•  transcends individual biological and technological domains and
•  can be ‘configured’ to implement (several of) the community
standards

metadata tracking framework

user community

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone

collection, curate and sharing of bioscience experiments

A growing ecosystem of over 30 public and internal resources using the
ISA metadata tracking framework to facilitate standards-compliant
collection, curation, management and reuse of investigations in an
increasingly diverse set of life science domains, including:

•  environmental health •  stem cell discovery
•  environmental genomics •  system biology
•  metabolomics •  transcriptomics
•  metagenomics •  toxicogenomics
•  nanotechnology •  also by communities working to build
•  proteomics, a library of cellular signatures

TOWARDS INTEROPERABLE BIOSCIENCE DATA Feb 2012
Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W,
Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA,
Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo C, Forster M, Gaudet P, Gilbert J,
Goble C, Griffin J, Jacob D, Kleinjans J, Harland L, Haug K, Hermjakob H, Sui S, Laederach A, Liang S,
Marshall S, Merrill E, McGrath A, Reilly D, Roux M, Shamu C, Shang C, Steinbeck C, Trefethen A,
Williams-Jones B, Wolstencroft K, Xenarios J, Hide W.

Implementations at Harvard

Importance of a local community

Implementations at Harvard

data sharing
in ISA-Tab

Importance of a local community

Implementation at the EBI


Extensions of the

Nanotechnology
Informatics Working Group


We must increase the level of annotation

Notes in Lab Books Spreadsheets and Tables Facts as RDF statements
(information for humans) ( the compromise) (information for machines)

•  Invest in curating and manage data at the source using:
•  a common metadata tracking framework, such as ISA
•  publicly available and community-developed terminologies
•  recording sufficient contextual information of the experimental steps
§  Progressively datasets will become more comprehensible, interoperable,
reproducible and (re)usable, underpinning future investigations

Collaborative approaches are highly valuable but take time

Community involvement and uptake!
1st ISA-Tab workshop! 3rd ISA-Tab workshop! User workshops/visits - start! 1st public instance: !
2nd ISA-Tab workshop! Other tools implement ! Harvard Stem Cell ! Growing number of
ISA-Tab! Discovery Engine! systems starts to adopt
ISA framework!

Core developments!
Conversions to ! Links to
Pride-XML/SRA-XML/! analysis tools
Strawman ISA-Tab spec! ISA software v1! MAGE-Tab and more! starts!
Final ISA-Tab spec! Database instance !
at EBI! RDF format starts!

Publications!
Stem Cell !
ISA-Tab and ! Discovery ! ISA Commons!
Omics data sharing!
Workshop reports! ISA software suite! Engine!
(Science)! (Nature Genetics)!
(Bioinformatics)! (NAR)!

2007 2008 2009 2010 2011 2012
Development timeline

Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Recommended

Recommended

More Related Content

Similar to Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Similar to Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source (20)

More from Susanna-Assunta Sansone

More from Susanna-Assunta Sansone (20)

Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source