Qualimap
• What: Analyse insert size distribution
• Why: Determine whether sequencing has
been effective, particularly for de novo
assembly, need for adaptor trimming
• Where: http://qualimap.bioinfo.cipf.es/
Trimmomatic
• What: One of several million read trimmers
• Why: To remove sequence adaptors which
may influence the results of de novo assembly
• Where:
http://www.usadellab.org/cms/?page=trimmo
matic
Species ID: BLAST
• What: Only the most famous bioinformatics
algorithm ever made
• Why: A few random BLAST searches will reveal
much important information about your data
before you start on a pipeline analysis
• Where: http://ncbi.nlm.nih.gov/BLAST
Species ID: Metaphlan
• What: Designed for metagenomics, this
algorithm will find “taxon-defining” genes to
identify what species are in a sample
• Why: Check for extent of sample
contamination, give an accurate species ID for
unknown samples
• Where:
https://bitbucket.org/biobakery/metaphlan2
Species ID: Kraken
• What: Similar to Metaphlan but even faster
and with a more complete database
• Why: Check for extent of sample
contamination, give an accurate species ID for
unknown samples
• Where: https://ccb.jhu.edu/software/kraken/
Species ID: MOCAT
• What: Uses a phylogenetic approach to
identify novel or divergent species by relying
on distances in conserved marker genes
• Why: Sometimes you sequence something
completely novel and want to know more
about its relationships
• Where: http://vm-
lux.embl.de/~kultima/MOCAT/
• Alternatives: Phylosift, rMLST
Sample QC: Blobology
• What: A simple method of plotting de novo
assembly contigs by GC, coverage and taxon
• Why: Characterise contamination, plasmids,
lytic phage in a sample
• Where:
https://github.com/blaxterlab/blobology
Alignment: BWA
• What: The standard method for aligning
Illumina sequences to a reference, use in
BWA-MEM mode which works well with most
read lengths
• Why: Finds the likely location of each
sequence read in a reference genome
• Where: https://github.com/lh3/bwa
• Alternatives: SMALT, Bowtie2 (beware
standard insert size parameters)
Variant calling: samtools&VarScan
• What: A way of calling SNPs against a
reference in one or more samples
• Why: VarScan permits easy filtering of SNPs by
allele frequency and strand, useful for getting
a precise dataset
• Where: http://www.htslib.org/
• http://varscan.sourceforge.net/
• Alternatives: GATK, snippy, Nesoni
Recombination filtering: Gubbins
• What: Detect regions which have undergone
recombination which will confound phylogenetic
reconstructions assuming clonality
• Why: Important when attempting phylogenetic
reconstructions from recombining organisms
• Where: http://sanger-
pathogens.github.io/gubbins/
• Alternatives: ClonalFrameML, BRATNextGen
Tree building: FastTree
• What: Phylogenetic reconstructions from SNP
data
• Why: Tree reconstructions are an effective way of
examining evolutionary relationships in isolates
and testing if they are from an outbreak, FastTree
• Note: Ensure you don’t hit the double-precision
bug!
(http://darlinglab.org/blog/2015/03/23/not-so-
fast-fasttree.html)
• Where:
http://meta.microbesonline.org/fasttree/Alternat
ives: RAxML (more thorough, slower), REALPHY
http://realphy.unibas.ch/fcgi/realphy
MLST & Antibiogram: SRST2
• What: Aligns reads against MLST and
antibiotic resistance databases
• Why: Permits MLST typing with genome data
and a rough prediction of antibiotic resistance
• Where: http://katholt.github.io/srst2/
De novo assembly: SPADES
• What: A reliable de novo assembler which
works well with multiple data types
• Why: Has in-built error corrector so no need
for read trimming, can use multiple values of k
so less need for experimentation, consistently
performs well in comparisons
• Where: http://bioinf.spbau.ru/spades
De novo assembly: Velvet
• What: The original short-read assembly
• Why: Extremely fast for draft assemblies,
particularly if just want to do MLST or
antibiograms
• Where:
https://www.ebi.ac.uk/~zerbino/velvet/
• Alternatives: MEGAHIT – even faster!
Annotation: Prokka
• What: Takes de novo assembly contig files and
annotates them with coding sequences and non-
coding features such as RNAs
• Why: A very sensible set of tools and reference
databases in a single package, produces usable
output for other software and database
submission
• Where:
http://www.vicbioinformatics.com/software.prok
ka.shtml
• Alternatives: xBASE annotation interface
Tree building: Harvest
• What: Takes de novo assembly contigs, performs
whole-genome alignment and permits
reconstruction of core genome phylogenies
• Why: Scaleable to hundreds of genomes on a
laptop and with an excellent viewer
• Where:
http://harvest.readthedocs.org/en/latest/index.h
tml
• Alternatives: Mauve
Population genomics: BIGSDB
• What: Takes de novo assembly contigs and
applies MLST-like schemes working on
hundreds or thousands of core genes
• Why: Scaleable to >1000s of genomes for
rapid population-level clustering
• Where:
http://pubmlst.org/software/database/bigsdb
/
• Alternatives: Bionumerics
Pan/accessory genomes: LS-BSR
• What: Takes de novo assembly contigs or
annotations and compares gene content
• Why: To determine differences in gene
content between 1 to 1000s of strains
• Where: https://github.com/jasonsahl/LS-BSR
• Alternatives: OrthoMCL
MLST/Antibiogram: mlst and Abricate
• What: Works on de novo assembly to give
mlst prediction and antibiotic resistance
perdiction
• Why: A very fast method
• Where: https://github.com/tseemann/mlst
• https://github.com/tseemann/abricate
• Alternatives: SRST2
CLoud Infrastructure for Microbial
Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for
microbial bioinformatics
• £4M of hardware, capable
of supporting >1000
individual virtual servers
• Amazon/Google cloud for
Academics
Acknowledgements
• Twitter comments:
– Tom Connor, Alan McNally, Torsten Seemann, C.
Titus Brown, Heng Li, Christoffer Flensburg, Matt
MacManes, Rachel Glover, Willem van Schaik, Bill
Hanage, Jennifer Gardy, Mick Watson, Alan
McNally, Esther Robinson, Nicola Fawcett, Aziz
Aboobaker, Ruth Massey