Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reproducibility - The myths and truths of pipeline bioinformatics

3,190 views

Published on

In a talk for the Newcastle Bioinformatics Special Interest Group (http://bsu.ncl.ac.uk/fms-bioinformatics) I explored the topic of reproducibility. Looking at the pros and cons of pipelining analyses, as well as some tools for achieving this. I also considered some additional tools for enabling reproducible bioinformatics, and look at the 'executable paper', and whether it represents the future for bioinformatics publishing.

Published in: Technology

Reproducibility - The myths and truths of pipeline bioinformatics

  1. 1. Reproducibility:The Myths and Truths of “Push-Button” BioinformaticsSimon CockellBioinformatics Special Interest Group19th July 2012
  2. 2. Repeatability and Reproducibility• Main principle of scientific method• Repeatability is „within lab‟• Reproducibility is „between lab‟ • Broader concept• This should be easy in bioinformatics, right? • Same data + same code = same results • Not many analyses have stochastisicity http://xkcd.com/242/
  3. 3. Same data?• Example • Data deposited in SRA • Original data deleted by researchers • .sra files are NOT .fastq • All filtering/QC steps lost • Starting point for subsequent analysis not the same – regardless of whether same code used
  4. 4. Same data?• Data files are very large• Hardware failures are surprisingly common• Not all hardware failures are catastrophic • Bit-flipping by faulty RAM• Do you keep an md5sum of your data, to ensure it hasn‟t been corrupted by the transfer process?
  5. 5. Same code?• What version of a particular software did you use?• Is it still available?• Did you write it yourself?• Do you use version control?• Did you tag a version?• Is the software closed/proprietary?
  6. 6. Version Control• Good practice for software AND data• DVCS means it doesn‟t have to be in a remote repository• All local folders can be versioned • Doesn‟t mean they have to be, it‟s a judgment call• Check-in regularly• Tag important “releases” https://twitter.com/sjcockell/status/202041359920676864
  7. 7. Pipelines• Package your analysis• Easily repeatable• Also easy to distribute• Start-to-finish task automation• Process captured by underlying pipeline architecture http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/
  8. 8. Tools for pipelining analyses• Huge numbers • See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems • Only a few widely used:• Bash • old school• Taverna • build workflows from public webservices• Galaxy • sequencing focus – tools provided in „toolshed‟• Microbase • distributed computing, build workflows from „responders‟• e-Science Central • „Science as a Service‟ – cloud focus • not specifically a bioinformatics tool
  9. 9. Bash• Single-machine (or cluster) command-line workflows• No fancy GUIs• Record provenance & process• Rudimentary parallel processing http://www.gnu.org/software/bash/
  10. 10. Taverna• Workflows from web services• Lack of relevant services • Relies on providers• Gluing services together increasingly problematic• Sharing workflows through myExperiment • http://www.myexperiment.org/ http://www.taverna.org.uk/
  11. 11. Galaxy• “open, web-based platform for data intensive biomedical research”• Install or use (limited) public server• Can build workflows from tools in „toolshed‟• Command-line tools wrapped with web interface https://main.g2.bx.psu.edu/
  12. 12. Galaxy Workflow
  13. 13. Microbase• Task management framework• Workflows emerge from interacting „responders‟• Notification system passes messages around• „Cloud-ready‟ system that scales easily• Responders must be written for new tools http://www.microbasecloud.com/
  14. 14. e-Science Central• „Blocks‟ can be combined into workflows• Blocks need to be written by an expert• Social networking features• Good provenance recording http://www.esciencecentral.co.uk/
  15. 15. The best approach?• Good for individual analysis • Package & publish• All datasets different • One size does not fit all • Downstream processes often depend on results of upstream ones• Note lack of QC • Requires human interaction – impossible to pipeline • Different every time • Subjective – major source of variation in results • BUT – important and necessary (GIGO)
  16. 16. More tools for reproducibility• iPython notebook • http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html • Build notebooks with code embedded • Run code arbitrarily • Example: https://pilgrims.ncl.ac.uk:9999/• Runmycode.org • Allows researchers to create „companion websites‟ for papers • This website allows readers to implement the methodology described in the paper • Example: http://www.runmycode.org/CompanionSite/site.do?siteId=92
  17. 17. The executable paper• The ultimate in repeatable research• Data and code embedded in the publication• Figures can be generated, in situ, from the actual data• http://ged.msu.edu/pap ers/2012-diginorm/
  18. 18. Summary• For work to be repeatable: • Data and code must be available • Process must be documented (and preferably shared) • Version information is important • Pipelines are not the great panacea • Though they may help for parts of the process • Bash is as good as many „fancier‟ tools (for tasks on a single machine or cluster)
  19. 19. Inspirations for this talk• C. Titus Brown‟s blogposts on repeatability and the executable paper • http://ivory.idyll.org/blog• Michael Barton‟s blogposts about organising bioinformatics projects and pipelines • http://bioinformaticszen.com/

×