SlideShare a Scribd company logo
1 of 58
Download to read offline
How to write
bioinformatics software
no one will use
A/Prof Torsten Seemann
@torstenseemann
ASM NGS 2018 - Washington DC, USA - Wed 25 Sep 2018
Feel free to tweet or photograph
Slides available from slideshare.net after the conference
Who am I ?
Melbourne, Australia
“Immunity and infection”
● Research
● Teaching
● Public health and reference labs
● Diagnostic services
● Clinical care in ID and immunity
Microbiological Diagnostic Unit
● Oldest public health lab in Australia
○ established 1897 in Melbourne
○ historical ~500,000 isolate collection back to 1950s
● National reference laboratory
○ Salmonella, Listeria, EHEC
● W.H.O regional reference lab
○ vaccine preventable invasive bacterial pathogens
Why am I here?
Bioinformatics software and me
Installed >1000 packages manually
Authored >100 Brew & Conda packages
Written and maintain >10 packages
Software tools for bacterial genomics
The Unix command line
How to get a bioinformatics headache
1. See tweet about new published tool
2. Read abstract - sounds awesome!
3. Fail to find link to source code - eventually Google it
4. Attempt to compile and install it
5. Google for 30 min for fixes
6. Finally get it built
7. Run it on tiny data set
8. Get a vague error
9. Delete and never revisit it again
Familiar output
Should I stay for this talk ?
YES
It will help you write good tools
YES
It will help you identify bad tools
Should you write a tool?
Should you write a new tool?
● NO
○ It already exists
○ You are unable to maintain it
○ You won’t really use it
● YES
○ YOU need the tool
○ YOU will use the tool
○ YOU want others to use the tool
○ Desire to give back to the community
Eating my own dog food
Lessons from the Prokka experience
● Nearly all feedback is positive
● People all over the world are grateful
● Warm fuzzy feeling inside
● Increase your public profile
● But maintenance burden and guilt
Discoverability
Choosing a home base
University or lab web site
Y
Choosing a name
● Try to be unique
○ Google to check for conflicts
○ Consider how internationals will pronounce it
○ Be creative!
● Avoid dodgy acronyms
○ Try not to win a JABBA Award
○ “Just Another Bogus Bioinformatics Acronym”
Don’t be this person
First impressions count
● “Keep It Simple Stupid”
● First page of documentation
○ What does it do?
○ How do I install it?
○ How do I run it?
● Try to keep in one place
○ Otherwise becomes inconsistent or missed
Usability
A lesson from history
Print something useful if no parameters
% biotool
Please use --help for instructions
Always have a --help flag
% biotool -h
% biotool --help
Usage: biotool [options] seq.fa
--help Show this help
--version Print version and exit
--top N Keep top N sequences
Always have a --version flag
% biotool -v
% biotool -V
% biotool --version
biotool 1.3
Always raise an error when things go wrong
% biotool seq.fa
ERROR: can not open file ‘seq.fa’
Check that dependencies are installed
% biotool seq.fa
Checking BLAST... ok
Checking SAMtools... NOT FOUND!
Please install ‘samtools’ and add
it to your PATH.
Always let users control output filenames
% biotool seq.fa
Processing ‘seq.fa’
Wrote result to ‘filt.seq.fa.out’
# ARGH!
% biotool --out seq.filt.fa
KISS - run with minimum parameters
% biotool seq.fa
ERROR: missing -x parameter
% biotool -x 3 seq.fa
ERROR: missing -y parameter
% biotool -x 3 -y 7 seq.fa
ERROR: need -n name
# ARGH!
Standards
Use the standard getopt interface
Short options ( -h ) and long options ( --help )
● C #include <getopt.h>
● C++ boost:program_options
● Python import argparse
● Perl use Getopt::Long
● R library(argparse)
● BASH getopt
Command line interface
Unix exit codes
● A positive integer
● Loose standards
○ 0 = success
○ 1 = general failure
○ 2 = error with command line
○ 3..127 = user defined specific failures
● Result in shell $? Variable
Accessing exit codes in the shell
% ls /tmp/fake
ls: cannot access /tmp/fake
% echo $?
1
% ls /proc/cpuinfo
/proc/cpuinfo
% echo $?
0
Using stdin, stderr and stdout
● stdin (0) command < input
● stdout (1) command > output
● stderr (2) command 2> errors
● All command < input > output 2> errors
● Allows piping!
sort input | command1 1> output 2> errors
This makes your tool useful in streaming
% zcat seq.fastq.gz |
cutadapt -a adapters.fa |
qualtrim -Q 20 |
bwa mem -t 8 ref.fa |
samtools sort --threads 4
> seq.bam
Use standards compliant files *
● Feature coordinates
○ BED, GFF, VCF
● Columnar data (put headings!)
○ TSV
○ CSV
● Structured data
○ JSON
○ YAML
* XML excepted
Installation
Keeping your audience
“Each equation in a book
will halve your audience”
“Each difficulty encountered during installation
will halve your number of users”
— @d_r_powell
Traditional systems level packaging
● Debian / DEB
apt-get install blast
dpkg -i blast-2.2.5-amd64.deb
● Redhat / RPM
yum install blast
rpm -i blast-2.2.5-x86_64.rpm
● Various others
Cross platform solutions: Linux, Mac, Windows
● Brew
brew install blast
● Conda
conda install blast
● Others
○ GUIX, ...
○ Docker, AMI images
Language specific repositories
● Python - PyPI
pip install unicycler
● Perl - CPAN
cpanm Bio::Roary
● R - CRAN
install.packages(“edgeseq3”)
Marketing
Publish it
● Preprint archive
○ PeerJ, bioRxiv
● Method focussed journal
○ Bioinformatics, BMC Bioinformatics
● Software focussed journal
○ Journal of Open Source Software
Plug it
● Twitter
○ Ask someone popular you know to retweet it
● Blog
○ Start a general blog and post about your tool
● Conferences
○ Tell people about it
Support your users
● Reply to emails
● Monitor your “Issues” web site
● Monitor Biostars and SeqAnswers
● Have a mailing list
● Update your documentation
● Fix bugs
Conclusions
Take home messages
● Make it as painless as possible to install
● Keep documentation clear and simple
● Get people to use it before you publish
● People are not judging your coding skills
● But they will curse you if waste their time
● Most users are grateful - leads to free beer
● A good tool is worth much more than a paper
What am I working on next?
●
Update on the TorstyVerse suite
● Ready
○ Snippy 4.x - rapid SNP calling and core SNP alignments
○ Shovill 1.x - wrapper around SPAdes to make it faster + better
○ Nullarbor 2.x - new plugin architecture
● Improvements
○ Abricate - AMR gene calling ➝ support NCBI hierarchy & classes
○ Prokka 1.14 ➝ ISfinder + AMR, better ncRNA anno, ...
● Planned
○ Mokka - metagenome annotation
○ Prokka 2 - genome annotation ➝ GO Terms, plugins, pseudo-genes
T
Acknowledgments
● Jennifer Gardy
● Duncan MacCannell
● Adam Phillippy
● The ASM NGS organising committee
● Anders Goncalves da Silva - The University of Melbourne
● David Powell - Monash University
● And everyone that has supported and encouraged me
Thanks for listening!
The end.

More Related Content

What's hot

What's hot (20)

David
DavidDavid
David
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
Data Sanity
Data SanityData Sanity
Data Sanity
 
Using and combining the different tools for predicting the pathogenicity of s...
Using and combining the different tools for predicting the pathogenicity of s...Using and combining the different tools for predicting the pathogenicity of s...
Using and combining the different tools for predicting the pathogenicity of s...
 
Semantic Technology: The Basics
Semantic Technology: The BasicsSemantic Technology: The Basics
Semantic Technology: The Basics
 
SWISS-PROT
SWISS-PROTSWISS-PROT
SWISS-PROT
 
Curso Formacion Apache Solr
Curso Formacion Apache SolrCurso Formacion Apache Solr
Curso Formacion Apache Solr
 
Tetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan EisenTetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan Eisen
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
cBioPortal Webinar Slides (2/3)
cBioPortal Webinar Slides (2/3)cBioPortal Webinar Slides (2/3)
cBioPortal Webinar Slides (2/3)
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
 
Bioinformatics workshop presentation
Bioinformatics   workshop presentationBioinformatics   workshop presentation
Bioinformatics workshop presentation
 
The uni prot knowledgebase
The uni prot knowledgebaseThe uni prot knowledgebase
The uni prot knowledgebase
 
Genome Browser
Genome BrowserGenome Browser
Genome Browser
 
[C++ Korea] Effective Modern C++ Sinchon Study Item 37-39
[C++ Korea] Effective Modern C++ Sinchon Study Item 37-39[C++ Korea] Effective Modern C++ Sinchon Study Item 37-39
[C++ Korea] Effective Modern C++ Sinchon Study Item 37-39
 
Composite protein databases
Composite protein databasesComposite protein databases
Composite protein databases
 
Ontologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontologyOntologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontology
 
Phylogenetics: Tree building
Phylogenetics: Tree buildingPhylogenetics: Tree building
Phylogenetics: Tree building
 
Autodock and vina
Autodock and vinaAutodock and vina
Autodock and vina
 
AI in Bioinformatics
AI in BioinformaticsAI in Bioinformatics
AI in Bioinformatics
 

Similar to How to write bioinformatics software no one will use

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 
OSMC 2015 | Testing in Production by Devdas Bhagat
OSMC 2015 | Testing in Production by Devdas BhagatOSMC 2015 | Testing in Production by Devdas Bhagat
OSMC 2015 | Testing in Production by Devdas Bhagat
NETWAYS
 

Similar to How to write bioinformatics software no one will use (20)

How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...
 
Writing clean scientific software Murphy cleancoding
Writing clean scientific software Murphy cleancodingWriting clean scientific software Murphy cleancoding
Writing clean scientific software Murphy cleancoding
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
The quality of the python ecosystem - and how we can protect it!
The quality of the python ecosystem - and how we can protect it!The quality of the python ecosystem - and how we can protect it!
The quality of the python ecosystem - and how we can protect it!
 
Pentester++
Pentester++Pentester++
Pentester++
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Open Source Tools for Libraries
Open Source Tools for LibrariesOpen Source Tools for Libraries
Open Source Tools for Libraries
 
Let's Contribute
Let's ContributeLet's Contribute
Let's Contribute
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
05 python.pdf
05 python.pdf05 python.pdf
05 python.pdf
 
What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)What Your Tech Lead Thinks You Know (But Didn't Teach You)
What Your Tech Lead Thinks You Know (But Didn't Teach You)
 
Try harder or go home
Try harder or go homeTry harder or go home
Try harder or go home
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Introduction to OPA
Introduction to OPAIntroduction to OPA
Introduction to OPA
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code
 
Learning to Translate with Joey NMT
Learning to Translate with Joey NMTLearning to Translate with Joey NMT
Learning to Translate with Joey NMT
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
OSMC 2015 | Testing in Production by Devdas Bhagat
OSMC 2015 | Testing in Production by Devdas BhagatOSMC 2015 | Testing in Production by Devdas Bhagat
OSMC 2015 | Testing in Production by Devdas Bhagat
 
OSMC 2015: Testing in Production by Devdas Bhagat
OSMC 2015: Testing in Production by Devdas BhagatOSMC 2015: Testing in Production by Devdas Bhagat
OSMC 2015: Testing in Production by Devdas Bhagat
 

More from Torsten Seemann

De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 

More from Torsten Seemann (20)

Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016Snippy - T.Seemann - Poster - Genome Informatics 2016
Snippy - T.Seemann - Poster - Genome Informatics 2016
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
 
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...What can we do with microbial WGS data?  - t.seemann - mc gill summer 2016 - ...
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
 
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...Bioinformatics tools for the diagnostic laboratory -  T.Seemann - Antimicrobi...
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobi...
 
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...Sequencing your poo with a usb stick -  Linux.conf.au 2016 miniconf  - mon 1 ...
Sequencing your poo with a usb stick - Linux.conf.au 2016 miniconf - mon 1 ...
 
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015Long read sequencing -  WEHI  bioinformatics seminar - tue 16 june 2015
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
 
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
 
Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
 
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 

Recently uploaded

development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 

Recently uploaded (20)

Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 

How to write bioinformatics software no one will use

  • 1. How to write bioinformatics software no one will use A/Prof Torsten Seemann @torstenseemann ASM NGS 2018 - Washington DC, USA - Wed 25 Sep 2018
  • 2. Feel free to tweet or photograph Slides available from slideshare.net after the conference
  • 5. “Immunity and infection” ● Research ● Teaching ● Public health and reference labs ● Diagnostic services ● Clinical care in ID and immunity
  • 6. Microbiological Diagnostic Unit ● Oldest public health lab in Australia ○ established 1897 in Melbourne ○ historical ~500,000 isolate collection back to 1950s ● National reference laboratory ○ Salmonella, Listeria, EHEC ● W.H.O regional reference lab ○ vaccine preventable invasive bacterial pathogens
  • 7. Why am I here?
  • 8. Bioinformatics software and me Installed >1000 packages manually Authored >100 Brew & Conda packages Written and maintain >10 packages
  • 9. Software tools for bacterial genomics
  • 11. How to get a bioinformatics headache 1. See tweet about new published tool 2. Read abstract - sounds awesome! 3. Fail to find link to source code - eventually Google it 4. Attempt to compile and install it 5. Google for 30 min for fixes 6. Finally get it built 7. Run it on tiny data set 8. Get a vague error 9. Delete and never revisit it again
  • 13.
  • 14.
  • 15.
  • 16. Should I stay for this talk ? YES It will help you write good tools YES It will help you identify bad tools
  • 17. Should you write a tool?
  • 18. Should you write a new tool? ● NO ○ It already exists ○ You are unable to maintain it ○ You won’t really use it ● YES ○ YOU need the tool ○ YOU will use the tool ○ YOU want others to use the tool ○ Desire to give back to the community
  • 19. Eating my own dog food
  • 20. Lessons from the Prokka experience ● Nearly all feedback is positive ● People all over the world are grateful ● Warm fuzzy feeling inside ● Increase your public profile ● But maintenance burden and guilt
  • 22. Choosing a home base University or lab web site Y
  • 23. Choosing a name ● Try to be unique ○ Google to check for conflicts ○ Consider how internationals will pronounce it ○ Be creative! ● Avoid dodgy acronyms ○ Try not to win a JABBA Award ○ “Just Another Bogus Bioinformatics Acronym”
  • 24. Don’t be this person
  • 25. First impressions count ● “Keep It Simple Stupid” ● First page of documentation ○ What does it do? ○ How do I install it? ○ How do I run it? ● Try to keep in one place ○ Otherwise becomes inconsistent or missed
  • 27. A lesson from history
  • 28. Print something useful if no parameters % biotool Please use --help for instructions
  • 29. Always have a --help flag % biotool -h % biotool --help Usage: biotool [options] seq.fa --help Show this help --version Print version and exit --top N Keep top N sequences
  • 30. Always have a --version flag % biotool -v % biotool -V % biotool --version biotool 1.3
  • 31. Always raise an error when things go wrong % biotool seq.fa ERROR: can not open file ‘seq.fa’
  • 32. Check that dependencies are installed % biotool seq.fa Checking BLAST... ok Checking SAMtools... NOT FOUND! Please install ‘samtools’ and add it to your PATH.
  • 33. Always let users control output filenames % biotool seq.fa Processing ‘seq.fa’ Wrote result to ‘filt.seq.fa.out’ # ARGH! % biotool --out seq.filt.fa
  • 34. KISS - run with minimum parameters % biotool seq.fa ERROR: missing -x parameter % biotool -x 3 seq.fa ERROR: missing -y parameter % biotool -x 3 -y 7 seq.fa ERROR: need -n name # ARGH!
  • 36. Use the standard getopt interface Short options ( -h ) and long options ( --help ) ● C #include <getopt.h> ● C++ boost:program_options ● Python import argparse ● Perl use Getopt::Long ● R library(argparse) ● BASH getopt Command line interface
  • 37. Unix exit codes ● A positive integer ● Loose standards ○ 0 = success ○ 1 = general failure ○ 2 = error with command line ○ 3..127 = user defined specific failures ● Result in shell $? Variable
  • 38. Accessing exit codes in the shell % ls /tmp/fake ls: cannot access /tmp/fake % echo $? 1 % ls /proc/cpuinfo /proc/cpuinfo % echo $? 0
  • 39. Using stdin, stderr and stdout ● stdin (0) command < input ● stdout (1) command > output ● stderr (2) command 2> errors ● All command < input > output 2> errors ● Allows piping! sort input | command1 1> output 2> errors
  • 40. This makes your tool useful in streaming % zcat seq.fastq.gz | cutadapt -a adapters.fa | qualtrim -Q 20 | bwa mem -t 8 ref.fa | samtools sort --threads 4 > seq.bam
  • 41. Use standards compliant files * ● Feature coordinates ○ BED, GFF, VCF ● Columnar data (put headings!) ○ TSV ○ CSV ● Structured data ○ JSON ○ YAML * XML excepted
  • 43. Keeping your audience “Each equation in a book will halve your audience” “Each difficulty encountered during installation will halve your number of users” — @d_r_powell
  • 44. Traditional systems level packaging ● Debian / DEB apt-get install blast dpkg -i blast-2.2.5-amd64.deb ● Redhat / RPM yum install blast rpm -i blast-2.2.5-x86_64.rpm ● Various others
  • 45. Cross platform solutions: Linux, Mac, Windows ● Brew brew install blast ● Conda conda install blast ● Others ○ GUIX, ... ○ Docker, AMI images
  • 46. Language specific repositories ● Python - PyPI pip install unicycler ● Perl - CPAN cpanm Bio::Roary ● R - CRAN install.packages(“edgeseq3”)
  • 48. Publish it ● Preprint archive ○ PeerJ, bioRxiv ● Method focussed journal ○ Bioinformatics, BMC Bioinformatics ● Software focussed journal ○ Journal of Open Source Software
  • 49. Plug it ● Twitter ○ Ask someone popular you know to retweet it ● Blog ○ Start a general blog and post about your tool ● Conferences ○ Tell people about it
  • 50. Support your users ● Reply to emails ● Monitor your “Issues” web site ● Monitor Biostars and SeqAnswers ● Have a mailing list ● Update your documentation ● Fix bugs
  • 52. Take home messages ● Make it as painless as possible to install ● Keep documentation clear and simple ● Get people to use it before you publish ● People are not judging your coding skills ● But they will curse you if waste their time ● Most users are grateful - leads to free beer ● A good tool is worth much more than a paper
  • 53. What am I working on next? ●
  • 54. Update on the TorstyVerse suite ● Ready ○ Snippy 4.x - rapid SNP calling and core SNP alignments ○ Shovill 1.x - wrapper around SPAdes to make it faster + better ○ Nullarbor 2.x - new plugin architecture ● Improvements ○ Abricate - AMR gene calling ➝ support NCBI hierarchy & classes ○ Prokka 1.14 ➝ ISfinder + AMR, better ncRNA anno, ... ● Planned ○ Mokka - metagenome annotation ○ Prokka 2 - genome annotation ➝ GO Terms, plugins, pseudo-genes
  • 55. T
  • 56. Acknowledgments ● Jennifer Gardy ● Duncan MacCannell ● Adam Phillippy ● The ASM NGS organising committee ● Anders Goncalves da Silva - The University of Melbourne ● David Powell - Monash University ● And everyone that has supported and encouraged me