1. You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
2. You are free to:
Copy, share, adapt, or re-mix;
Photograph, film, or broadcast;
Blog, live-blog, or post video of;
This presentation. Provided that:
You attribute the work to its author and respect the rights
and licenses associated with its components.
Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero.
Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
3. Disclaimer
• I do not (and will not) profit in any way, shape
or form, from any of the brands, products or
companies I may mention in this
presentation.
4. Data availability and re‐usability in the
transition from microarray to next‐generation
sequencing: can we do better?
B.F. Francis Ouellette
• Senior Scientist & Associate Director, Informatics and
Biocomputing, Ontario Institute for Cancer Research,
Toronto, ON
• Associate Professor, Department of Cell and Systems
Biology, University of Toronto, Toronto, ON.
@bffo on
5. •
Gabriella Rustici, Eleanor Williams, B.F. Francis Ouellette,
Alvis Brazma and the Functional Genomics Data Society
http://fged.org
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Alvis Brazma - EBI
Roger Bumgarner - U of Washington
Cesare Furlanello - FBK – MPBA
Michael Miller - ISB
Francis Ouellette - OICR
John Quackenbush – Dana-Farber
Michael Reich - Broad
Gabriella Rustici - EBI
Chris Stoeckert – U Penn
Ronald Taylor - PNNL
Steve Chervitz Trutane - Personalis
Jennifer Weller - UNC
Brian Wilhelm - IRIC
Neil Winegarden - UHN
6. FGED’s mission:
To be a positive agent of
change in the effective
sharing and reproducibility
of functional genomic data
Poster # 142 (Friday)
fged.org
7. I come here wearing many hats!
• Officer of FGED
• Data submitter to a large international cancer
genomics initiative
• Receiving and curating data from that same
initiative from 67 cancer genome projects.
• Editor in an #openaccess journal where we are just
now rewriting the data submission policy to ensure
reproducibility
• Associate Editor of an #OA DATABASE journal
• Also on the SAB of Galaxy and Genomespace
8. What do we do with this?
FGED
(Functional Genomics Data Society)
was
MGED
(Microarray Gene Expression
Data Society)
9. we evaluated the replication of data analyses in 18 articles on
microarray-based gene expression profiling. (…) We reproduced
two analyses in principle and six partially or with some
discrepancies; ten could not be reproduced. The main reason
for failure to reproduce was data unavailability, and discrepancies
were mostly due to incomplete data annotation or specification of
data processing and analysis. Repeatability of published
microarray studies is apparently limited. More strict publication
rules enforcing public data availability and explicit description of
data processing and analysis should be considered.
10. Does it matter?
• In Ioannidis et al (2009), they were not saying that
the papers were wrong.
• But there were problems
– missing data (38%)
– missing software, hardware details (50%)
– missing method, processing details (66%)
11. … forensic bioinformatics [was needed] to infer what
was done to obtain the results
- Keith Baggerly
12. Does it matter?
• In both cases the supporting data WERE deposited
in GEO or ArrayExpress
• Forensic bioinformatics was needed and more
often than not failed
• May be just depositing is not quite enough?
13.
14. What was in MIAME?
1. The raw data
2. The final processed (normalised) data
3. The essential sample annotation and experimental
variables
4. Sample data relationships
5. Array annotation (e.g., probe oligonucleotide
sequences)
6. The laboratory and data processing protocols
15. Did it work? The glass half empty…
• Where were the hiccups? MIAME was asking too
much!
• However, some now say that MIAME is much too
little to ask! (e.g., publishing fully documented code
with instructions how to run it)
• What does it mean ‘sufficient data processing
protocols’?
• Even when data and protocols were deposited,
would the reviewers check these? Probably not
• So does it help at all?
16. Did it work? The glass half full …
• ArrayExpress and GEO have data from well
over 6 million high throughput assays from
some 30,000 functional genomics studies
• The MIAME compliance has been increasing
over time
• Many studies have shown the reusability of
these data
• We can have an informed discussion about the
reproducibility rather than forensics
17. Standards for content vs
standards for format
• Developing a usable format is challenging
– If it’s too ‘flexible’, too much free text, it’s no longer a
standard, no software can reasonably parse it
– If it’s too rigid, too granular, it can’t handle new type of
data, and people end up putting things in fields that don’t
work
• Human readable formats is useful, but machine
readability is essential!
18. A simple human readable format for Functional
genomics experiment metadata
• Sample-Data Relationship File (SDRF)
19. Lessons learned
• Keep it simple, keep it simple, keep it simple!
• Perils of designing standards by a committee vs
advantages of community agreement
• Successful formats are mostly defined by
successful software, e.g., GFF in UCSC GB or
Bioconductors gene_set
• The attraction and perils of perfection – the last few
steps of full automation cost most effort
– A human person may be a cheep broker between two
pieces of software (again – Bioconductor example)
20. What does it mean for HTS?
• (RNASeq – ChIPSeq)
• The metadata for functional genomics HTS
experiments are not so different from microarray
experiments – replace cel files with BAM files
21. MINSEQE - Minimum Information about a highthroughput Nucleotide SeQuencing Experiment
1. A general description of the aim of the experiment;
2. The submitter contact details;
3. Essential sample annotation and the experimental
factors;
4. An ‘experiment’ or ‘run’ date, which may be
important for identifying batch effects;
5. Sufficient information to correctly identify bio &
tech reps;
6. Experimental and data processing protocols
7. Raw sequencing reads location; and processed
data.
22. Percentage of publications from 2012
containing new gene expression data
Data type
Number of
PMID with new
data
% of data in
SRA/Arrayexpr
ess/GEO
Microarray
347
49
RNA-SEQ
334
61
23. Percentage of RNA-Seq studies
providing metadata (1/2)
Original
Database
ArrayExpress GEO
SRA
Experimental
description
95
100
100
Contact
100
100
0
Sample &
Factor info
100
100
60
Experimental
Or Run date
0
0
60
24. Percentage of RNA-Seq studies
providing metadata (2/2)
Original
Database
ArrayExpress GEO
SRA
Biological
and Tech
replicates
Yes
Sometimes
Yes
Exp and data
processing
protocol
60
100
0
Raw reads
100
100
100
Processed
data
35
90
0
25. Things we still need to do:
• Involves folks from NCBI
• Compare methods and metrics over time (20092012)
• Compare methods with ENCODE, ICGC, EGA and
the databases we presented here.
• Look for shared meta data and seek to mate what
is best and core to all.
• Make sure it aligns with large funder’s current
requirements.
• Share and publish this information
26. Take home messages
• Archiving just something is not the same as
making data available and useful – metadata,
analysis code, usable format, …
– Storing metadata doesn’t cost too much, extracting them
from data generators does!
• Minimising the human mediation in moving data
between the LIMS, archives and analysis tools is
more realistic goal than eliminating it – the need for
brokerage
• The main source of variability in RNSseq
interpretation seems to be the alignments – we
don’t know how to do this well yet. Getting the
short reads for RNASeq is a beginning.
27. • FGED: The Functional Genomics Data Society is a
very open society, and we welcome feedback and
input!
– http://fged.org
– Twitter: @fged
28. Acknowledgements:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Gabriella Rustici, Eleanor Williams, Alvis Brazma and
the Functional Genomics Data Society http://fged.org
Alvis Brazma - EBI
Roger Bumgarner - U of Washington
Cesare Furlanello - FBK – MPBA
Michael Miller - ISB
Francis Ouellette - OICR
John Quackenbush – Dana-Farber
Michael Reich - Broad
Gabriella Rustici - EBI
Chris Stoeckert – U Penn
Ronald Taylor - PNNL
Steve Chervitz Trutane - Personalis
Jennifer Weller - UNC
Brian Wilhelm - IRIC
Neil Winegarden - UHN