Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reproducibility - The myths and truths of pipeline bioinformatics


Published on

In a talk for the Newcastle Bioinformatics Special Interest Group ( I explored the topic of reproducibility. Looking at the pros and cons of pipelining analyses, as well as some tools for achieving this. I also considered some additional tools for enabling reproducible bioinformatics, and look at the 'executable paper', and whether it represents the future for bioinformatics publishing.

Published in: Technology

Reproducibility - The myths and truths of pipeline bioinformatics

  1. 1. Reproducibility:The Myths and Truths of “Push-Button” BioinformaticsSimon CockellBioinformatics Special Interest Group19th July 2012
  2. 2. Repeatability and Reproducibility• Main principle of scientific method• Repeatability is „within lab‟• Reproducibility is „between lab‟ • Broader concept• This should be easy in bioinformatics, right? • Same data + same code = same results • Not many analyses have stochastisicity
  3. 3. Same data?• Example • Data deposited in SRA • Original data deleted by researchers • .sra files are NOT .fastq • All filtering/QC steps lost • Starting point for subsequent analysis not the same – regardless of whether same code used
  4. 4. Same data?• Data files are very large• Hardware failures are surprisingly common• Not all hardware failures are catastrophic • Bit-flipping by faulty RAM• Do you keep an md5sum of your data, to ensure it hasn‟t been corrupted by the transfer process?
  5. 5. Same code?• What version of a particular software did you use?• Is it still available?• Did you write it yourself?• Do you use version control?• Did you tag a version?• Is the software closed/proprietary?
  6. 6. Version Control• Good practice for software AND data• DVCS means it doesn‟t have to be in a remote repository• All local folders can be versioned • Doesn‟t mean they have to be, it‟s a judgment call• Check-in regularly• Tag important “releases”
  7. 7. Pipelines• Package your analysis• Easily repeatable• Also easy to distribute• Start-to-finish task automation• Process captured by underlying pipeline architecture
  8. 8. Tools for pipelining analyses• Huge numbers • See: • Only a few widely used:• Bash • old school• Taverna • build workflows from public webservices• Galaxy • sequencing focus – tools provided in „toolshed‟• Microbase • distributed computing, build workflows from „responders‟• e-Science Central • „Science as a Service‟ – cloud focus • not specifically a bioinformatics tool
  9. 9. Bash• Single-machine (or cluster) command-line workflows• No fancy GUIs• Record provenance & process• Rudimentary parallel processing
  10. 10. Taverna• Workflows from web services• Lack of relevant services • Relies on providers• Gluing services together increasingly problematic• Sharing workflows through myExperiment •
  11. 11. Galaxy• “open, web-based platform for data intensive biomedical research”• Install or use (limited) public server• Can build workflows from tools in „toolshed‟• Command-line tools wrapped with web interface
  12. 12. Galaxy Workflow
  13. 13. Microbase• Task management framework• Workflows emerge from interacting „responders‟• Notification system passes messages around• „Cloud-ready‟ system that scales easily• Responders must be written for new tools
  14. 14. e-Science Central• „Blocks‟ can be combined into workflows• Blocks need to be written by an expert• Social networking features• Good provenance recording
  15. 15. The best approach?• Good for individual analysis • Package & publish• All datasets different • One size does not fit all • Downstream processes often depend on results of upstream ones• Note lack of QC • Requires human interaction – impossible to pipeline • Different every time • Subjective – major source of variation in results • BUT – important and necessary (GIGO)
  16. 16. More tools for reproducibility• iPython notebook • • Build notebooks with code embedded • Run code arbitrarily • Example:• • Allows researchers to create „companion websites‟ for papers • This website allows readers to implement the methodology described in the paper • Example:
  17. 17. The executable paper• The ultimate in repeatable research• Data and code embedded in the publication• Figures can be generated, in situ, from the actual data• ers/2012-diginorm/
  18. 18. Summary• For work to be repeatable: • Data and code must be available • Process must be documented (and preferably shared) • Version information is important • Pipelines are not the great panacea • Though they may help for parts of the process • Bash is as good as many „fancier‟ tools (for tasks on a single machine or cluster)
  19. 19. Inspirations for this talk• C. Titus Brown‟s blogposts on repeatability and the executable paper •• Michael Barton‟s blogposts about organising bioinformatics projects and pipelines •