Your SlideShare is downloading. ×
Reproducibility - The myths and truths of pipeline bioinformatics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Reproducibility - The myths and truths of pipeline bioinformatics

2,098
views

Published on

In a talk for the Newcastle Bioinformatics Special Interest Group (http://bsu.ncl.ac.uk/fms-bioinformatics) I explored the topic of reproducibility. Looking at the pros and cons of pipelining …

In a talk for the Newcastle Bioinformatics Special Interest Group (http://bsu.ncl.ac.uk/fms-bioinformatics) I explored the topic of reproducibility. Looking at the pros and cons of pipelining analyses, as well as some tools for achieving this. I also considered some additional tools for enabling reproducible bioinformatics, and look at the 'executable paper', and whether it represents the future for bioinformatics publishing.

Published in: Technology

2 Comments
1 Like
Statistics
Notes
No Downloads
Views
Total Views
2,098
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
28
Comments
2
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Or any other Unix shell (maybe even Windows batch scripts)
  • Expose often simplified set of options
  • Can make arbitrary blocks in R, Java or Octave
  • Transcript

    • 1. Reproducibility:The Myths and Truths of “Push-Button” BioinformaticsSimon CockellBioinformatics Special Interest Group19th July 2012
    • 2. Repeatability and Reproducibility• Main principle of scientific method• Repeatability is „within lab‟• Reproducibility is „between lab‟ • Broader concept• This should be easy in bioinformatics, right? • Same data + same code = same results • Not many analyses have stochastisicity http://xkcd.com/242/
    • 3. Same data?• Example • Data deposited in SRA • Original data deleted by researchers • .sra files are NOT .fastq • All filtering/QC steps lost • Starting point for subsequent analysis not the same – regardless of whether same code used
    • 4. Same data?• Data files are very large• Hardware failures are surprisingly common• Not all hardware failures are catastrophic • Bit-flipping by faulty RAM• Do you keep an md5sum of your data, to ensure it hasn‟t been corrupted by the transfer process?
    • 5. Same code?• What version of a particular software did you use?• Is it still available?• Did you write it yourself?• Do you use version control?• Did you tag a version?• Is the software closed/proprietary?
    • 6. Version Control• Good practice for software AND data• DVCS means it doesn‟t have to be in a remote repository• All local folders can be versioned • Doesn‟t mean they have to be, it‟s a judgment call• Check-in regularly• Tag important “releases” https://twitter.com/sjcockell/status/202041359920676864
    • 7. Pipelines• Package your analysis• Easily repeatable• Also easy to distribute• Start-to-finish task automation• Process captured by underlying pipeline architecture http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/
    • 8. Tools for pipelining analyses• Huge numbers • See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems • Only a few widely used:• Bash • old school• Taverna • build workflows from public webservices• Galaxy • sequencing focus – tools provided in „toolshed‟• Microbase • distributed computing, build workflows from „responders‟• e-Science Central • „Science as a Service‟ – cloud focus • not specifically a bioinformatics tool
    • 9. Bash• Single-machine (or cluster) command-line workflows• No fancy GUIs• Record provenance & process• Rudimentary parallel processing http://www.gnu.org/software/bash/
    • 10. Taverna• Workflows from web services• Lack of relevant services • Relies on providers• Gluing services together increasingly problematic• Sharing workflows through myExperiment • http://www.myexperiment.org/ http://www.taverna.org.uk/
    • 11. Galaxy• “open, web-based platform for data intensive biomedical research”• Install or use (limited) public server• Can build workflows from tools in „toolshed‟• Command-line tools wrapped with web interface https://main.g2.bx.psu.edu/
    • 12. Galaxy Workflow
    • 13. Microbase• Task management framework• Workflows emerge from interacting „responders‟• Notification system passes messages around• „Cloud-ready‟ system that scales easily• Responders must be written for new tools http://www.microbasecloud.com/
    • 14. e-Science Central• „Blocks‟ can be combined into workflows• Blocks need to be written by an expert• Social networking features• Good provenance recording http://www.esciencecentral.co.uk/
    • 15. The best approach?• Good for individual analysis • Package & publish• All datasets different • One size does not fit all • Downstream processes often depend on results of upstream ones• Note lack of QC • Requires human interaction – impossible to pipeline • Different every time • Subjective – major source of variation in results • BUT – important and necessary (GIGO)
    • 16. More tools for reproducibility• iPython notebook • http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html • Build notebooks with code embedded • Run code arbitrarily • Example: https://pilgrims.ncl.ac.uk:9999/• Runmycode.org • Allows researchers to create „companion websites‟ for papers • This website allows readers to implement the methodology described in the paper • Example: http://www.runmycode.org/CompanionSite/site.do?siteId=92
    • 17. The executable paper• The ultimate in repeatable research• Data and code embedded in the publication• Figures can be generated, in situ, from the actual data• http://ged.msu.edu/pap ers/2012-diginorm/
    • 18. Summary• For work to be repeatable: • Data and code must be available • Process must be documented (and preferably shared) • Version information is important • Pipelines are not the great panacea • Though they may help for parts of the process • Bash is as good as many „fancier‟ tools (for tasks on a single machine or cluster)
    • 19. Inspirations for this talk• C. Titus Brown‟s blogposts on repeatability and the executable paper • http://ivory.idyll.org/blog• Michael Barton‟s blogposts about organising bioinformatics projects and pipelines • http://bioinformaticszen.com/