Cshl minseqe 2013_ouellette
Upcoming SlideShare
Loading in...5
×
 

Cshl minseqe 2013_ouellette

on

  • 726 views

2013 Genome Informatics presentation by Francis Ouellette at the Wednesday Oct 30 evening session

2013 Genome Informatics presentation by Francis Ouellette at the Wednesday Oct 30 evening session

Statistics

Views

Total Views
726
Views on SlideShare
405
Embed Views
321

Actions

Likes
0
Downloads
2
Comments
0

2 Embeds 321

https://twitter.com 315
http://tweetedtimes.com 6

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cshl minseqe 2013_ouellette Cshl minseqe 2013_ouellette Presentation Transcript

  • You are free to: Copy, share, adapt, or re-mix; Photograph, film, or broadcast; Blog, live-blog, or post video of; This presentation. Provided that: You attribute the work to its author and respect the rights and licenses associated with its components. Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
  • You are free to: Copy, share, adapt, or re-mix; Photograph, film, or broadcast; Blog, live-blog, or post video of; This presentation. Provided that: You attribute the work to its author and respect the rights and licenses associated with its components. Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at; http://www.thisismyurl.com/free-downloads/15-free-speech-bubble-icons-for-popular-websites
  • Disclaimer • I do not (and will not) profit in any way, shape or form, from any of the brands, products or companies I may mention in this presentation.
  • Data availability and re‐usability in the transition from microarray to next‐generation sequencing: can we do better? B.F. Francis Ouellette • Senior Scientist & Associate Director, Informatics and Biocomputing, Ontario Institute for Cancer Research, Toronto, ON • Associate Professor, Department of Cell and Systems Biology, University of Toronto, Toronto, ON. @bffo on
  • • Gabriella Rustici, Eleanor Williams, B.F. Francis Ouellette, Alvis Brazma and the Functional Genomics Data Society http://fged.org • • • • • • • • • • • • • • Alvis Brazma - EBI Roger Bumgarner - U of Washington Cesare Furlanello - FBK – MPBA Michael Miller - ISB Francis Ouellette - OICR John Quackenbush – Dana-Farber Michael Reich - Broad Gabriella Rustici - EBI Chris Stoeckert – U Penn Ronald Taylor - PNNL Steve Chervitz Trutane - Personalis Jennifer Weller - UNC Brian Wilhelm - IRIC Neil Winegarden - UHN
  • FGED’s mission: To be a positive agent of change in the effective sharing and reproducibility of functional genomic data Poster # 142 (Friday) fged.org
  • I come here wearing many hats! • Officer of FGED • Data submitter to a large international cancer genomics initiative • Receiving and curating data from that same initiative from 67 cancer genome projects. • Editor in an #openaccess journal where we are just now rewriting the data submission policy to ensure reproducibility • Associate Editor of an #OA DATABASE journal • Also on the SAB of Galaxy and Genomespace
  • What do we do with this? FGED (Functional Genomics Data Society) was MGED (Microarray Gene Expression Data Society)
  • we evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling. (…) We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability, and discrepancies were mostly due to incomplete data annotation or specification of data processing and analysis. Repeatability of published microarray studies is apparently limited. More strict publication rules enforcing public data availability and explicit description of data processing and analysis should be considered.
  • Does it matter? • In Ioannidis et al (2009), they were not saying that the papers were wrong. • But there were problems – missing data (38%) – missing software, hardware details (50%) – missing method, processing details (66%)
  • … forensic bioinformatics [was needed] to infer what was done to obtain the results - Keith Baggerly
  • Does it matter? • In both cases the supporting data WERE deposited in GEO or ArrayExpress • Forensic bioinformatics was needed and more often than not failed • May be just depositing is not quite enough?
  • What was in MIAME? 1. The raw data 2. The final processed (normalised) data 3. The essential sample annotation and experimental variables 4. Sample data relationships 5. Array annotation (e.g., probe oligonucleotide sequences) 6. The laboratory and data processing protocols
  • Did it work? The glass half empty… • Where were the hiccups? MIAME was asking too much! • However, some now say that MIAME is much too little to ask! (e.g., publishing fully documented code with instructions how to run it) • What does it mean ‘sufficient data processing protocols’? • Even when data and protocols were deposited, would the reviewers check these? Probably not • So does it help at all?
  • Did it work? The glass half full … • ArrayExpress and GEO have data from well over 6 million high throughput assays from some 30,000 functional genomics studies • The MIAME compliance has been increasing over time • Many studies have shown the reusability of these data • We can have an informed discussion about the reproducibility rather than forensics
  • Standards for content vs standards for format • Developing a usable format is challenging – If it’s too ‘flexible’, too much free text, it’s no longer a standard, no software can reasonably parse it – If it’s too rigid, too granular, it can’t handle new type of data, and people end up putting things in fields that don’t work • Human readable formats is useful, but machine readability is essential!
  • A simple human readable format for Functional genomics experiment metadata • Sample-Data Relationship File (SDRF)
  • Lessons learned • Keep it simple, keep it simple, keep it simple! • Perils of designing standards by a committee vs advantages of community agreement • Successful formats are mostly defined by successful software, e.g., GFF in UCSC GB or Bioconductors gene_set • The attraction and perils of perfection – the last few steps of full automation cost most effort – A human person may be a cheep broker between two pieces of software (again – Bioconductor example)
  • What does it mean for HTS? • (RNASeq – ChIPSeq) • The metadata for functional genomics HTS experiments are not so different from microarray experiments – replace cel files with BAM files
  • MINSEQE - Minimum Information about a highthroughput Nucleotide SeQuencing Experiment 1. A general description of the aim of the experiment; 2. The submitter contact details; 3. Essential sample annotation and the experimental factors; 4. An ‘experiment’ or ‘run’ date, which may be important for identifying batch effects; 5. Sufficient information to correctly identify bio & tech reps; 6. Experimental and data processing protocols 7. Raw sequencing reads location; and processed data.
  • Percentage of publications from 2012 containing new gene expression data Data type Number of PMID with new data % of data in SRA/Arrayexpr ess/GEO Microarray 347 49 RNA-SEQ 334 61
  • Percentage of RNA-Seq studies providing metadata (1/2) Original Database ArrayExpress GEO SRA Experimental description 95 100 100 Contact 100 100 0 Sample & Factor info 100 100 60 Experimental Or Run date 0 0 60
  • Percentage of RNA-Seq studies providing metadata (2/2) Original Database ArrayExpress GEO SRA Biological and Tech replicates Yes Sometimes Yes Exp and data processing protocol 60 100 0 Raw reads 100 100 100 Processed data 35 90 0
  • Things we still need to do: • Involves folks from NCBI • Compare methods and metrics over time (20092012) • Compare methods with ENCODE, ICGC, EGA and the databases we presented here. • Look for shared meta data and seek to mate what is best and core to all. • Make sure it aligns with large funder’s current requirements. • Share and publish this information
  • Take home messages • Archiving just something is not the same as making data available and useful – metadata, analysis code, usable format, … – Storing metadata doesn’t cost too much, extracting them from data generators does! • Minimising the human mediation in moving data between the LIMS, archives and analysis tools is more realistic goal than eliminating it – the need for brokerage • The main source of variability in RNSseq interpretation seems to be the alignments – we don’t know how to do this well yet. Getting the short reads for RNASeq is a beginning.
  • • FGED: The Functional Genomics Data Society is a very open society, and we welcome feedback and input! – http://fged.org – Twitter: @fged
  • Acknowledgements: • • • • • • • • • • • • • • • Gabriella Rustici, Eleanor Williams, Alvis Brazma and the Functional Genomics Data Society http://fged.org Alvis Brazma - EBI Roger Bumgarner - U of Washington Cesare Furlanello - FBK – MPBA Michael Miller - ISB Francis Ouellette - OICR John Quackenbush – Dana-Farber Michael Reich - Broad Gabriella Rustici - EBI Chris Stoeckert – U Penn Ronald Taylor - PNNL Steve Chervitz Trutane - Personalis Jennifer Weller - UNC Brian Wilhelm - IRIC Neil Winegarden - UHN