How to write bioinformatics software no one will use

How to write
bioinformatics software
no one will use
A/Prof Torsten Seemann
@torstenseemann
ASM NGS 2018 - Washington DC, USA - Wed 25 Sep 2018

Feel free to tweet or photograph
Slides available from slideshare.net after the conference

“Immunity and infection”
● Research
● Teaching
● Public health and reference labs
● Diagnostic services
● Clinical care in ID and immunity

Microbiological Diagnostic Unit
● Oldest public health lab in Australia
○ established 1897 in Melbourne
○ historical ~500,000 isolate collection back to 1950s
● National reference laboratory
○ Salmonella, Listeria, EHEC
● W.H.O regional reference lab
○ vaccine preventable invasive bacterial pathogens

Bioinformatics software and me
Installed >1000 packages manually
Authored >100 Brew & Conda packages
Written and maintain >10 packages

Software tools for bacterial genomics

How to get a bioinformatics headache
1. See tweet about new published tool
2. Read abstract - sounds awesome!
3. Fail to find link to source code - eventually Google it
4. Attempt to compile and install it
5. Google for 30 min for fixes
6. Finally get it built
7. Run it on tiny data set
8. Get a vague error
9. Delete and never revisit it again

Should I stay for this talk ?
YES
It will help you write good tools
YES
It will help you identify bad tools

Should you write a new tool?
● NO
○ It already exists
○ You are unable to maintain it
○ You won’t really use it
● YES
○ YOU need the tool
○ YOU will use the tool
○ YOU want others to use the tool
○ Desire to give back to the community

Lessons from the Prokka experience
● Nearly all feedback is positive
● People all over the world are grateful
● Warm fuzzy feeling inside
● Increase your public profile
● But maintenance burden and guilt

Choosing a home base
University or lab web site
Y

Choosing a name
● Try to be unique
○ Google to check for conflicts
○ Consider how internationals will pronounce it
○ Be creative!
● Avoid dodgy acronyms
○ Try not to win a JABBA Award
○ “Just Another Bogus Bioinformatics Acronym”

First impressions count
● “Keep It Simple Stupid”
● First page of documentation
○ What does it do?
○ How do I install it?
○ How do I run it?
● Try to keep in one place
○ Otherwise becomes inconsistent or missed

Print something useful if no parameters
% biotool
Please use --help for instructions

Always have a --help flag
% biotool -h
% biotool --help
Usage: biotool [options] seq.fa
--help Show this help
--version Print version and exit
--top N Keep top N sequences

Always have a --version flag
% biotool -v
% biotool -V
% biotool --version
biotool 1.3

Always raise an error when things go wrong
% biotool seq.fa
ERROR: can not open file ‘seq.fa’

Check that dependencies are installed
% biotool seq.fa
Checking BLAST... ok
Checking SAMtools... NOT FOUND!
Please install ‘samtools’ and add
it to your PATH.

Always let users control output filenames
% biotool seq.fa
Processing ‘seq.fa’
Wrote result to ‘filt.seq.fa.out’
# ARGH!
% biotool --out seq.filt.fa

KISS - run with minimum parameters
% biotool seq.fa
ERROR: missing -x parameter
% biotool -x 3 seq.fa
ERROR: missing -y parameter
% biotool -x 3 -y 7 seq.fa
ERROR: need -n name
# ARGH!

Use the standard getopt interface
Short options ( -h ) and long options ( --help )
● C #include <getopt.h>
● C++ boost:program_options
● Python import argparse
● Perl use Getopt::Long
● R library(argparse)
● BASH getopt
Command line interface

Unix exit codes
● A positive integer
● Loose standards
○ 0 = success
○ 1 = general failure
○ 2 = error with command line
○ 3..127 = user defined specific failures
● Result in shell $? Variable

Accessing exit codes in the shell
% ls /tmp/fake
ls: cannot access /tmp/fake
% echo $?
1
% ls /proc/cpuinfo
/proc/cpuinfo
% echo $?
0

Using stdin, stderr and stdout
● stdin (0) command < input
● stdout (1) command > output
● stderr (2) command 2> errors
● All command < input > output 2> errors
● Allows piping!
sort input | command1 1> output 2> errors

This makes your tool useful in streaming
% zcat seq.fastq.gz |
cutadapt -a adapters.fa |
qualtrim -Q 20 |
bwa mem -t 8 ref.fa |
samtools sort --threads 4
> seq.bam

Use standards compliant files *
● Feature coordinates
○ BED, GFF, VCF
● Columnar data (put headings!)
○ TSV
○ CSV
● Structured data
○ JSON
○ YAML
* XML excepted

Keeping your audience
“Each equation in a book
will halve your audience”
“Each difficulty encountered during installation
will halve your number of users”
— @d_r_powell

Traditional systems level packaging
● Debian / DEB
apt-get install blast
dpkg -i blast-2.2.5-amd64.deb
● Redhat / RPM
yum install blast
rpm -i blast-2.2.5-x86_64.rpm
● Various others

Cross platform solutions: Linux, Mac, Windows
● Brew
brew install blast
● Conda
conda install blast
● Others
○ GUIX, ...
○ Docker, AMI images

Language specific repositories
● Python - PyPI
pip install unicycler
● Perl - CPAN
cpanm Bio::Roary
● R - CRAN
install.packages(“edgeseq3”)

Publish it
● Preprint archive
○ PeerJ, bioRxiv
● Method focussed journal
○ Bioinformatics, BMC Bioinformatics
● Software focussed journal
○ Journal of Open Source Software

Plug it
● Twitter
○ Ask someone popular you know to retweet it
● Blog
○ Start a general blog and post about your tool
● Conferences
○ Tell people about it

Support your users
● Reply to emails
● Monitor your “Issues” web site
● Monitor Biostars and SeqAnswers
● Have a mailing list
● Update your documentation
● Fix bugs

Take home messages
● Make it as painless as possible to install
● Keep documentation clear and simple
● Get people to use it before you publish
● People are not judging your coding skills
● But they will curse you if waste their time
● Most users are grateful - leads to free beer
● A good tool is worth much more than a paper

What am I working on next?
●

Update on the TorstyVerse suite
● Ready
○ Snippy 4.x - rapid SNP calling and core SNP alignments
○ Shovill 1.x - wrapper around SPAdes to make it faster + better
○ Nullarbor 2.x - new plugin architecture
● Improvements
○ Abricate - AMR gene calling ➝ support NCBI hierarchy & classes
○ Prokka 1.14 ➝ ISfinder + AMR, better ncRNA anno, ...
● Planned
○ Mokka - metagenome annotation
○ Prokka 2 - genome annotation ➝ GO Terms, plugins, pseudo-genes

Acknowledgments
● Jennifer Gardy
● Duncan MacCannell
● Adam Phillippy
● The ASM NGS organising committee
● Anders Goncalves da Silva - The University of Melbourne
● David Powell - Monash University
● And everyone that has supported and encouraged me

How to write bioinformatics software no one will use

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to write bioinformatics software no one will use

Similar to How to write bioinformatics software no one will use (20)

More from Torsten Seemann

More from Torsten Seemann (20)

Recently uploaded

Recently uploaded (20)

How to write bioinformatics software no one will use