Reproducibility:The Myths and Truths of “Push-Button” BioinformaticsSimon CockellBioinformatics Special Interest Group19th...
Repeatability and Reproducibility• Main principle of  scientific method• Repeatability is „within  lab‟• Reproducibility i...
Same data?• Example  • Data deposited in SRA  • Original data deleted by    researchers  • .sra files are NOT .fastq  • Al...
Same data?• Data files are very large• Hardware failures are  surprisingly common• Not all hardware failures  are catastro...
Same code?• What version of a  particular software did  you use?• Is it still available?• Did you write it yourself?• Do y...
Version Control• Good practice for  software AND data• DVCS means it doesn‟t  have to be in a remote  repository• All loca...
Pipelines• Package your analysis• Easily repeatable• Also easy to distribute• Start-to-finish task  automation• Process ca...
Tools for pipelining analyses• Huge numbers  • See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_system...
Bash• Single-machine (or  cluster) command-line  workflows• No fancy GUIs• Record provenance &  process• Rudimentary paral...
Taverna• Workflows from web  services• Lack of relevant services • Relies on providers• Gluing services together  increasi...
Galaxy• “open, web-based  platform for data  intensive biomedical  research”• Install or use (limited)  public server• Can...
Galaxy Workflow
Microbase• Task management framework• Workflows emerge from interacting „responders‟• Notification system passes messages ...
e-Science Central• „Blocks‟ can be  combined into  workflows• Blocks need to be  written by an expert• Social networking  ...
The best approach?• Good for individual analysis  • Package & publish• All datasets different  • One size does not fit all...
More tools for reproducibility• iPython notebook   • http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html   • ...
The executable paper• The ultimate in  repeatable research• Data and code  embedded in the  publication• Figures can be  g...
Summary• For work to be repeatable:  • Data and code must be available  • Process must be documented (and preferably share...
Inspirations for this talk• C. Titus Brown‟s blogposts on repeatability and the executable paper  • http://ivory.idyll.org...
Reproducibility - The myths and truths of pipeline bioinformatics
Upcoming SlideShare
Loading in...5
×

Reproducibility - The myths and truths of pipeline bioinformatics

2,328

Published on

In a talk for the Newcastle Bioinformatics Special Interest Group (http://bsu.ncl.ac.uk/fms-bioinformatics) I explored the topic of reproducibility. Looking at the pros and cons of pipelining analyses, as well as some tools for achieving this. I also considered some additional tools for enabling reproducible bioinformatics, and look at the 'executable paper', and whether it represents the future for bioinformatics publishing.

Published in: Technology
2 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,328
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
33
Comments
2
Likes
2
Embeds 0
No embeds

No notes for slide
  • Or any other Unix shell (maybe even Windows batch scripts)
  • Expose often simplified set of options
  • Can make arbitrary blocks in R, Java or Octave
  • Reproducibility - The myths and truths of pipeline bioinformatics

    1. 1. Reproducibility:The Myths and Truths of “Push-Button” BioinformaticsSimon CockellBioinformatics Special Interest Group19th July 2012
    2. 2. Repeatability and Reproducibility• Main principle of scientific method• Repeatability is „within lab‟• Reproducibility is „between lab‟ • Broader concept• This should be easy in bioinformatics, right? • Same data + same code = same results • Not many analyses have stochastisicity http://xkcd.com/242/
    3. 3. Same data?• Example • Data deposited in SRA • Original data deleted by researchers • .sra files are NOT .fastq • All filtering/QC steps lost • Starting point for subsequent analysis not the same – regardless of whether same code used
    4. 4. Same data?• Data files are very large• Hardware failures are surprisingly common• Not all hardware failures are catastrophic • Bit-flipping by faulty RAM• Do you keep an md5sum of your data, to ensure it hasn‟t been corrupted by the transfer process?
    5. 5. Same code?• What version of a particular software did you use?• Is it still available?• Did you write it yourself?• Do you use version control?• Did you tag a version?• Is the software closed/proprietary?
    6. 6. Version Control• Good practice for software AND data• DVCS means it doesn‟t have to be in a remote repository• All local folders can be versioned • Doesn‟t mean they have to be, it‟s a judgment call• Check-in regularly• Tag important “releases” https://twitter.com/sjcockell/status/202041359920676864
    7. 7. Pipelines• Package your analysis• Easily repeatable• Also easy to distribute• Start-to-finish task automation• Process captured by underlying pipeline architecture http://bioinformatics.knowledgeblog.org/2011/06/21/using-standardized-bioinformatics-formats-in-taverna-workflows-for-integrating-biological-data/
    8. 8. Tools for pipelining analyses• Huge numbers • See: http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems • Only a few widely used:• Bash • old school• Taverna • build workflows from public webservices• Galaxy • sequencing focus – tools provided in „toolshed‟• Microbase • distributed computing, build workflows from „responders‟• e-Science Central • „Science as a Service‟ – cloud focus • not specifically a bioinformatics tool
    9. 9. Bash• Single-machine (or cluster) command-line workflows• No fancy GUIs• Record provenance & process• Rudimentary parallel processing http://www.gnu.org/software/bash/
    10. 10. Taverna• Workflows from web services• Lack of relevant services • Relies on providers• Gluing services together increasingly problematic• Sharing workflows through myExperiment • http://www.myexperiment.org/ http://www.taverna.org.uk/
    11. 11. Galaxy• “open, web-based platform for data intensive biomedical research”• Install or use (limited) public server• Can build workflows from tools in „toolshed‟• Command-line tools wrapped with web interface https://main.g2.bx.psu.edu/
    12. 12. Galaxy Workflow
    13. 13. Microbase• Task management framework• Workflows emerge from interacting „responders‟• Notification system passes messages around• „Cloud-ready‟ system that scales easily• Responders must be written for new tools http://www.microbasecloud.com/
    14. 14. e-Science Central• „Blocks‟ can be combined into workflows• Blocks need to be written by an expert• Social networking features• Good provenance recording http://www.esciencecentral.co.uk/
    15. 15. The best approach?• Good for individual analysis • Package & publish• All datasets different • One size does not fit all • Downstream processes often depend on results of upstream ones• Note lack of QC • Requires human interaction – impossible to pipeline • Different every time • Subjective – major source of variation in results • BUT – important and necessary (GIGO)
    16. 16. More tools for reproducibility• iPython notebook • http://ipython.org/ipython-doc/dev/interactive/htmlnotebook.html • Build notebooks with code embedded • Run code arbitrarily • Example: https://pilgrims.ncl.ac.uk:9999/• Runmycode.org • Allows researchers to create „companion websites‟ for papers • This website allows readers to implement the methodology described in the paper • Example: http://www.runmycode.org/CompanionSite/site.do?siteId=92
    17. 17. The executable paper• The ultimate in repeatable research• Data and code embedded in the publication• Figures can be generated, in situ, from the actual data• http://ged.msu.edu/pap ers/2012-diginorm/
    18. 18. Summary• For work to be repeatable: • Data and code must be available • Process must be documented (and preferably shared) • Version information is important • Pipelines are not the great panacea • Though they may help for parts of the process • Bash is as good as many „fancier‟ tools (for tasks on a single machine or cluster)
    19. 19. Inspirations for this talk• C. Titus Brown‟s blogposts on repeatability and the executable paper • http://ivory.idyll.org/blog• Michael Barton‟s blogposts about organising bioinformatics projects and pipelines • http://bioinformaticszen.com/
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×