Your SlideShare is downloading. ×
Reproducibility - The myths and truths of pipeline bioinformatics
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Reproducibility - The myths and truths of pipeline bioinformatics


Published on

In a talk for the Newcastle Bioinformatics Special Interest Group ( I explored the topic of reproducibility. Looking at the pros and cons of pipelining …

In a talk for the Newcastle Bioinformatics Special Interest Group ( I explored the topic of reproducibility. Looking at the pros and cons of pipelining analyses, as well as some tools for achieving this. I also considered some additional tools for enabling reproducible bioinformatics, and look at the 'executable paper', and whether it represents the future for bioinformatics publishing.

Published in: Technology

1 Like
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Or any other Unix shell (maybe even Windows batch scripts)
  • Expose often simplified set of options
  • Can make arbitrary blocks in R, Java or Octave
  • Transcript

    • 1. Reproducibility:The Myths and Truths of “Push-Button” BioinformaticsSimon CockellBioinformatics Special Interest Group19th July 2012
    • 2. Repeatability and Reproducibility• Main principle of scientific method• Repeatability is „within lab‟• Reproducibility is „between lab‟ • Broader concept• This should be easy in bioinformatics, right? • Same data + same code = same results • Not many analyses have stochastisicity
    • 3. Same data?• Example • Data deposited in SRA • Original data deleted by researchers • .sra files are NOT .fastq • All filtering/QC steps lost • Starting point for subsequent analysis not the same – regardless of whether same code used
    • 4. Same data?• Data files are very large• Hardware failures are surprisingly common• Not all hardware failures are catastrophic • Bit-flipping by faulty RAM• Do you keep an md5sum of your data, to ensure it hasn‟t been corrupted by the transfer process?
    • 5. Same code?• What version of a particular software did you use?• Is it still available?• Did you write it yourself?• Do you use version control?• Did you tag a version?• Is the software closed/proprietary?
    • 6. Version Control• Good practice for software AND data• DVCS means it doesn‟t have to be in a remote repository• All local folders can be versioned • Doesn‟t mean they have to be, it‟s a judgment call• Check-in regularly• Tag important “releases”
    • 7. Pipelines• Package your analysis• Easily repeatable• Also easy to distribute• Start-to-finish task automation• Process captured by underlying pipeline architecture
    • 8. Tools for pipelining analyses• Huge numbers • See: • Only a few widely used:• Bash • old school• Taverna • build workflows from public webservices• Galaxy • sequencing focus – tools provided in „toolshed‟• Microbase • distributed computing, build workflows from „responders‟• e-Science Central • „Science as a Service‟ – cloud focus • not specifically a bioinformatics tool
    • 9. Bash• Single-machine (or cluster) command-line workflows• No fancy GUIs• Record provenance & process• Rudimentary parallel processing
    • 10. Taverna• Workflows from web services• Lack of relevant services • Relies on providers• Gluing services together increasingly problematic• Sharing workflows through myExperiment •
    • 11. Galaxy• “open, web-based platform for data intensive biomedical research”• Install or use (limited) public server• Can build workflows from tools in „toolshed‟• Command-line tools wrapped with web interface
    • 12. Galaxy Workflow
    • 13. Microbase• Task management framework• Workflows emerge from interacting „responders‟• Notification system passes messages around• „Cloud-ready‟ system that scales easily• Responders must be written for new tools
    • 14. e-Science Central• „Blocks‟ can be combined into workflows• Blocks need to be written by an expert• Social networking features• Good provenance recording
    • 15. The best approach?• Good for individual analysis • Package & publish• All datasets different • One size does not fit all • Downstream processes often depend on results of upstream ones• Note lack of QC • Requires human interaction – impossible to pipeline • Different every time • Subjective – major source of variation in results • BUT – important and necessary (GIGO)
    • 16. More tools for reproducibility• iPython notebook • • Build notebooks with code embedded • Run code arbitrarily • Example:• • Allows researchers to create „companion websites‟ for papers • This website allows readers to implement the methodology described in the paper • Example:
    • 17. The executable paper• The ultimate in repeatable research• Data and code embedded in the publication• Figures can be generated, in situ, from the actual data• ers/2012-diginorm/
    • 18. Summary• For work to be repeatable: • Data and code must be available • Process must be documented (and preferably shared) • Version information is important • Pipelines are not the great panacea • Though they may help for parts of the process • Bash is as good as many „fancier‟ tools (for tasks on a single machine or cluster)
    • 19. Inspirations for this talk• C. Titus Brown‟s blogposts on repeatability and the executable paper •• Michael Barton‟s blogposts about organising bioinformatics projects and pipelines •