Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to write bioinformatics software no one will use

890 views

Published on

Presented at ASM NGS 2018 at Tysons Corner VA, USA.

Published in: Science
  • Be the first to comment

How to write bioinformatics software no one will use

  1. 1. How to write bioinformatics software no one will use A/Prof Torsten Seemann @torstenseemann ASM NGS 2018 - Washington DC, USA - Wed 25 Sep 2018
  2. 2. Feel free to tweet or photograph Slides available from slideshare.net after the conference
  3. 3. Who am I ?
  4. 4. Melbourne, Australia
  5. 5. “Immunity and infection” ● Research ● Teaching ● Public health and reference labs ● Diagnostic services ● Clinical care in ID and immunity
  6. 6. Microbiological Diagnostic Unit ● Oldest public health lab in Australia ○ established 1897 in Melbourne ○ historical ~500,000 isolate collection back to 1950s ● National reference laboratory ○ Salmonella, Listeria, EHEC ● W.H.O regional reference lab ○ vaccine preventable invasive bacterial pathogens
  7. 7. Why am I here?
  8. 8. Bioinformatics software and me Installed >1000 packages manually Authored >100 Brew & Conda packages Written and maintain >10 packages
  9. 9. Software tools for bacterial genomics
  10. 10. The Unix command line
  11. 11. How to get a bioinformatics headache 1. See tweet about new published tool 2. Read abstract - sounds awesome! 3. Fail to find link to source code - eventually Google it 4. Attempt to compile and install it 5. Google for 30 min for fixes 6. Finally get it built 7. Run it on tiny data set 8. Get a vague error 9. Delete and never revisit it again
  12. 12. Familiar output
  13. 13. Should I stay for this talk ? YES It will help you write good tools YES It will help you identify bad tools
  14. 14. Should you write a tool?
  15. 15. Should you write a new tool? ● NO ○ It already exists ○ You are unable to maintain it ○ You won’t really use it ● YES ○ YOU need the tool ○ YOU will use the tool ○ YOU want others to use the tool ○ Desire to give back to the community
  16. 16. Eating my own dog food
  17. 17. Lessons from the Prokka experience ● Nearly all feedback is positive ● People all over the world are grateful ● Warm fuzzy feeling inside ● Increase your public profile ● But maintenance burden and guilt
  18. 18. Discoverability
  19. 19. Choosing a home base University or lab web site Y
  20. 20. Choosing a name ● Try to be unique ○ Google to check for conflicts ○ Consider how internationals will pronounce it ○ Be creative! ● Avoid dodgy acronyms ○ Try not to win a JABBA Award ○ “Just Another Bogus Bioinformatics Acronym”
  21. 21. Don’t be this person
  22. 22. First impressions count ● “Keep It Simple Stupid” ● First page of documentation ○ What does it do? ○ How do I install it? ○ How do I run it? ● Try to keep in one place ○ Otherwise becomes inconsistent or missed
  23. 23. Usability
  24. 24. A lesson from history
  25. 25. Print something useful if no parameters % biotool Please use --help for instructions
  26. 26. Always have a --help flag % biotool -h % biotool --help Usage: biotool [options] seq.fa --help Show this help --version Print version and exit --top N Keep top N sequences
  27. 27. Always have a --version flag % biotool -v % biotool -V % biotool --version biotool 1.3
  28. 28. Always raise an error when things go wrong % biotool seq.fa ERROR: can not open file ‘seq.fa’
  29. 29. Check that dependencies are installed % biotool seq.fa Checking BLAST... ok Checking SAMtools... NOT FOUND! Please install ‘samtools’ and add it to your PATH.
  30. 30. Always let users control output filenames % biotool seq.fa Processing ‘seq.fa’ Wrote result to ‘filt.seq.fa.out’ # ARGH! % biotool --out seq.filt.fa
  31. 31. KISS - run with minimum parameters % biotool seq.fa ERROR: missing -x parameter % biotool -x 3 seq.fa ERROR: missing -y parameter % biotool -x 3 -y 7 seq.fa ERROR: need -n name # ARGH!
  32. 32. Standards
  33. 33. Use the standard getopt interface Short options ( -h ) and long options ( --help ) ● C #include <getopt.h> ● C++ boost:program_options ● Python import argparse ● Perl use Getopt::Long ● R library(argparse) ● BASH getopt Command line interface
  34. 34. Unix exit codes ● A positive integer ● Loose standards ○ 0 = success ○ 1 = general failure ○ 2 = error with command line ○ 3..127 = user defined specific failures ● Result in shell $? Variable
  35. 35. Accessing exit codes in the shell % ls /tmp/fake ls: cannot access /tmp/fake % echo $? 1 % ls /proc/cpuinfo /proc/cpuinfo % echo $? 0
  36. 36. Using stdin, stderr and stdout ● stdin (0) command < input ● stdout (1) command > output ● stderr (2) command 2> errors ● All command < input > output 2> errors ● Allows piping! sort input | command1 1> output 2> errors
  37. 37. This makes your tool useful in streaming % zcat seq.fastq.gz | cutadapt -a adapters.fa | qualtrim -Q 20 | bwa mem -t 8 ref.fa | samtools sort --threads 4 > seq.bam
  38. 38. Use standards compliant files * ● Feature coordinates ○ BED, GFF, VCF ● Columnar data (put headings!) ○ TSV ○ CSV ● Structured data ○ JSON ○ YAML * XML excepted
  39. 39. Installation
  40. 40. Keeping your audience “Each equation in a book will halve your audience” “Each difficulty encountered during installation will halve your number of users” — @d_r_powell
  41. 41. Traditional systems level packaging ● Debian / DEB apt-get install blast dpkg -i blast-2.2.5-amd64.deb ● Redhat / RPM yum install blast rpm -i blast-2.2.5-x86_64.rpm ● Various others
  42. 42. Cross platform solutions: Linux, Mac, Windows ● Brew brew install blast ● Conda conda install blast ● Others ○ GUIX, ... ○ Docker, AMI images
  43. 43. Language specific repositories ● Python - PyPI pip install unicycler ● Perl - CPAN cpanm Bio::Roary ● R - CRAN install.packages(“edgeseq3”)
  44. 44. Marketing
  45. 45. Publish it ● Preprint archive ○ PeerJ, bioRxiv ● Method focussed journal ○ Bioinformatics, BMC Bioinformatics ● Software focussed journal ○ Journal of Open Source Software
  46. 46. Plug it ● Twitter ○ Ask someone popular you know to retweet it ● Blog ○ Start a general blog and post about your tool ● Conferences ○ Tell people about it
  47. 47. Support your users ● Reply to emails ● Monitor your “Issues” web site ● Monitor Biostars and SeqAnswers ● Have a mailing list ● Update your documentation ● Fix bugs
  48. 48. Conclusions
  49. 49. Take home messages ● Make it as painless as possible to install ● Keep documentation clear and simple ● Get people to use it before you publish ● People are not judging your coding skills ● But they will curse you if waste their time ● Most users are grateful - leads to free beer ● A good tool is worth much more than a paper
  50. 50. What am I working on next? ●
  51. 51. Update on the TorstyVerse suite ● Ready ○ Snippy 4.x - rapid SNP calling and core SNP alignments ○ Shovill 1.x - wrapper around SPAdes to make it faster + better ○ Nullarbor 2.x - new plugin architecture ● Improvements ○ Abricate - AMR gene calling ➝ support NCBI hierarchy & classes ○ Prokka 1.14 ➝ ISfinder + AMR, better ncRNA anno, ... ● Planned ○ Mokka - metagenome annotation ○ Prokka 2 - genome annotation ➝ GO Terms, plugins, pseudo-genes
  52. 52. T
  53. 53. Acknowledgments ● Jennifer Gardy ● Duncan MacCannell ● Adam Phillippy ● The ASM NGS organising committee ● Anders Goncalves da Silva - The University of Melbourne ● David Powell - Monash University ● And everyone that has supported and encouraged me
  54. 54. Thanks for listening!
  55. 55. The end.

×