This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Data Management Team Lead, Software Engineering Group
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
1.
What was the plan?
A role for data standards, models and computational
workflows in scholarly data publishing
Alejandra González-Beltrán, PhD
Philippe Rocca-Serra, PhD
Oxford e-Research Centre, University of Oxford
{alejandra.gonzalezbeltran,philippe.rocca-serra}@oerc.ox.ac.uk
ISMB Workshop:What Bioinformaticians need to know about
digital publishing beyond the PDF2
July15th, 2014 Boston, USA
2.
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
The experimental workflow
3.
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
The experimental workflow
metadata
4.
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
The experimental workflow
metadata
5.
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data Interoperability
The experimental workflow
Reproducibility
Data Review
6.
The experimental workflow
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Data Reusability
7.
The experimental plan - life sciences case
experimental design!
sample characteristic(s)!
experimental variable(s)!
2-week systemic rat study using male Wistar rats (N=15 per dose group)
14 proprietary drug candidates from participating companies and
2 reference toxic compounds
InnoMed PredTox Project
8.
The experimental plan - life sciences case
experimental design!
sample characteristic(s)!
experimental variable(s)!
technology(s)!
measurement(s)!
protocols(s)!
data file(s)!
…!
9.
The experimental plan - computational case
•open peer-review
•availability of
•data
•analysis scripts
•documentation
Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA
segments reads by next generation sequencing, implementing improvements over
SOAPdenovo1 assembler.
10.
genome
assembly
algorithm
genome
size
Predictor Variables!
(Factor Name, Factor Type)
The experimental plan - computational case
11.
genome
assembly
algorithm
genome
size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
Predictor Variables!
(Factor Name, Factor Type)
The experimental plan - computational case
12.
genome
assembly
algorithm
genome
size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genome
human genome
Predictor Variables!
(Factor Name, Factor Type)
The experimental plan - computational case
13.
genome
assembly
algorithm
genome
size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genome
human genome
bacterial genome
insect genome
human genome
bacterial genome
insect genome
human genome
Predictor Variables!
(Factor Name, Factor Type)
3x3 factorial design
9 study groups
The experimental plan - computational case
14.
genome
assembly
algorithm
genome
size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genome
human genome
bacterial genome
insect genome
human genome
bacterial genome
insect genome
human genome
Predictor Variables!
(Factor Name, Factor Type)
The experimental plan - computational case
S. aureus
R. sphaeroides
B. impatiens
Chinese Han genome
(or YH genome)
15.
genome
assembly
algorithm
genome
size
SOAPdenovo2
SOAPdenovo1
ALL-PATHS-LG
bacterial genome
insect genome
human genome
bacterial genome
insect genome
human genome
bacterial genome
insect genome
human genome
Predictor Variables!
(Factor Name, Factor Type)
The experimental plan - computational case
Response Variables!
genome coverage
computation run time
memory consumption
17.
17
A growing ecosystem of over 30 public and internal resources using
the ISA metadata tracking framework (ISA-Tab and/or tools) to
facilitate standards-compliant collection, curation, management and
reuse of investigations in an increasingly diverse set of life science
domains, including:
!
• stem cell discovery
• system biology
• transcriptomics
• toxicogenomics
• also by communities working to build a library of cellular
signatures
!
• environmental health
• environmental genomics
• metabolomics
• metagenomics
• nanotechnology
• proteomics
18.
General-purpose,
configurable format designed
to support:
!
• description of the experimental
metadata, making the annotation
explicit and discoverable
!
• provenance tracking
!
• use of community standards,
such as minimal reporting guidelines
and terminologies
!
• designed to be converted to - a
growing number of - other metadata
formats, e.g. used by the European
Bioinformatics Institute (EBI)
repositories
!
19.
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
20.
H. Sapiens
H. Sapiens
H. Sapiens
H1
H1
H2
35
35
33
Years
Years
Years
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
Scanning
Scanning
Scanning
...
H. Sapiens
33 Years
H1
H2
H1.sample1
H1.sample2
H2.sample1
Labeling
Labeling
H1.sample1.labeled
H2.sample1.labeled
h1-s1.cel
h1-s2.cel
h2-s1.cel
H. Sapiens
35 Years
Scanning
Scanning
Scanning
...
...
...
obi:material
entity
obi:material
sample
obi:material
processing
obi:processed
material
obi:planned
process
isa:raw data
file
bfo:derives from
23.
Experimental metadata
or
structured component
(in-house curated,
machine-readable
formats)
Article or
narrative
component
(PDF and HTML)
A new online-only publication for descriptions of scientifically valuable datasets
in the life, environmental and biomedical sciences, but not limited to these!
Credit for sharing
your data
Focused on reuse
and reproducibility
Peer reviewed,
curated
Promoting Community
Data Repositories
Open Access
26.
SOAPdenovo2
http://isa-tools.github.io/soapdenovo2
Galaxy workflows to re-enact the data analysis
27.
http://isa-tools.github.io/soapdenovo2
SOAPdenovo2
Nanopub: represents structured data along with its
provenance in a single publishable and citable entity
28.
http://isa-tools.github.io/soapdenovo2
SOAPdenovo2
ResearchObject: enables the aggregation of the digital
resources contributing to findings of computational
research, including results, data and software, as citable
compound digital objects
29.
Reproducing SOAPdenovo2 results
Galaxy workflows
S. aureus pipeline
33.
“genome coverage increased
over the human data when
comparing SOAPdenovo2
against SOAPdenovo1”!
Response Variables!
genome coverage
memory consumption
34.
OntoMaton:(a(Bioportal(powered(
Ontology(widget(for(Google(
Spreadsheets(
Maguire(et(al,((2013(
Bioinforma?cs(
widget for
ontology
annotation and
tagging on
Google
spreadsheets
relying on
BioPortal and
Linked Open
Vocabularies
services
35.
OntoMaton:(a(Bioportal(powered(
Ontology(widget(for(Google(
Spreadsheets(
Maguire(et(al,((2013(
Bioinforma?cs(
widget for
ontology
annotation and
tagging on
Google
spreadsheets
relying on
BioPortal and
Linked Open
Vocabularies
services
NanoMaton https://github.com/ISA-tools/NanoMaton
Ontology for Biomedical Investigations
SemanticsScience Integrated Ontology
36.
Data
Scientist
Visualization
Analysis
Planning
Data
Management
Data CollectionPublication
Use existing
data
Perform new
experiment
Findable, Accessible, Interoperable, Reusable!FAIR data
37.
Contributing to !
Metabolights and ISA
• BBRSC UK-China Award & BGI funded Hackathon!
• venue: BGI Hong-Kong!
• Participants:!
• Metabolights/BGI/ISA/Birmingham/Hong-Kong
University!
• Outcome: !
• ISAtab web viewer code!
• Functional Specifications & Code for DoE
Wizard API!
• 4 datasets coded in ISA format!
• Conversion Metabolights datasets to RDF
38.
funders
acknowledgements
Scott Edmunds, GigaScience
Peter Li, GigaScience
Jun Zhao, Lancaster University
María Susana Avila García, Oxford University
Marco Roos, Leiden University
Mark Thompson, Leiden University
Ruibang Luo, University of Hong Kong
Tin-Lap Lee, Chinese University of
Hong Kong
Tak-wah Lam, University of Hong Kong
39.
Questions?
You can email us...
isatools@googlegroups.com
View our blog
http://isatools.wordpress.com
Follow us onTwitter
@isatools
View our websites
View our Git repo & contribute
http://github.com/ISA-tools
Thanks for your attention!
It appears that you have an ad-blocker running. By whitelisting SlideShare on your ad-blocker, you are supporting our community of content creators.
Hate ads?
We've updated our privacy policy.
We’ve updated our privacy policy so that we are compliant with changing global privacy regulations and to provide you with insight into the limited ways in which we use your data.
You can read the details below. By accepting, you agree to the updated privacy policy.