7. Whole-genome sequencing:
utility in clinical microbiology
• Diagnostics
– Species, subspecies, strain identification
– In silico antibiogram
– In silico virulence profile
• Surveillance
• Typing (including backwards compatibility with MLST and
serotype)
• What strains and resistance elements are lurking in my
hospital/community?
• Forensic epidemiology
– Is there an outbreak?
• Who gave what to who?
8. Common types of sequencing
• Paired-end Illumina (typically 150 – 300 bases)
• Single-end Ion Torrent (typically 300-400
bases)
– Can be treated more or less the same
• Pacific Biosciences or Oxford Nanopore
– Requires special handling, not covered today
9. Quality Control: Questions to Ask
• Did my sequencing work?
• What are the fragment lengths?
• Is my sample what I think it is?
• Is my sample contaminated?
Read QC
Adaptor/quality
trimming
Species ID
Sample QC
FastQC, Qualimap,
Kraken, BLAST
Trimmomatic
BLAST, Metaphlan,
MOCAT
Blobology
11. What coverage do I have?
• SNP calling: >10x (>15x better)
• De novo assembly: >30x (50x probably better)
• Absolutely no benefits over about 100x for
standard applications and slows everything
down and takes more disk space
• (BTW, FASTQ files are probably a waste of
space)
12. What are the fragment lengths?
• Qualimap (or just BWA)
Bad
Fragment length < read
length
OK
Fragment length > read
length
Good
Fragment length > 2x read
length
You are in dangerous territory dealing with
repetitive regions longer than the fragment
length, regardless of read depth coverage
13. Repetitive regions
This is important because repeat-containing are often
the most interesting parts of the genome! Think:
• Insertion elements
• Transposons
• Plasmids
• Ribosomal RNA
REPEAT: You are in dangerous territory dealing
with repetitive regions longer than the fragment
length, regardless of read depth coverage
14. Do not trust the computer
Bioinformatics software will do its best to look
like it is dealing with repeats in a rational way,
but it is in fact plotting aggressively to ruin your
analysis without telling you.
Computers are just like that!
If repeats are important to your analysis, you need an
alternative sequencing strategy: long mate-pairs, long reads
(Pacific Biosciences or Oxford Nanopore). Don’t drive
yourself mad making short reads do what they can’t.
15. Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important when mean fragment
length < read length.
• Many trimmers available: I like to use
Trimmomatic
• Quality trimming not important with modern
tools (BWA and Spades)
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
16. Is my sample what I think it is?
• BLASTing a few random reads usually very
efficient quality control check, as well as
helping identify a reference genome
• Kraken or Metaphlan can give rapid organism
report
17. Species identification
• Methods:
– 16S rDNA extraction (typically following de novo
assembly and annotation) and BLAST
– Taxon-defining genes (e.g. Metaphlan)
– Phylogenetic approach (e.g. MOCAT, Phylosift)
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
20. Sources of contamination
• Accidental multiple colony picks or mixed liquid
culture
– Same or different organism
– E.g. Achromobacter & Pseudomonas aeruginosa in CF
• Reagent contamination (DNA extractions)
• Sequencer “carry-over” (0.2%?)
• PhiX control sequence <- don’t be this guy
• Barcode “cross-over” (bad pipetting technique or
contaminated reagents)
23. Adaptor trim reads
• With Nextera libraries, failing to adaptor trim
will KILL your assemblies.
• Particularly important when mean fragment
length < read length.
• Many trimmers available: I like to use
Trimmomatic
For more explanation: http://nickloman.github.io/high-
throughput%20sequencing/genomics/bioinformatics/2013/04/17/adaptor-trim-or-die-
experiences-with-nextera-libraries/
25. Reference-based or de novo?
• Reference-based
– Implies ALIGNMENT to reference
– Implies you HAVE a reference
– Allows exquisitely sensitive and specific SNP calling
(forensic SNP calling to single mutation precision)
– Important for looking at CHAINS OF TRANSMISSION
– Can only call in parts of the genome COMMON
between your SAMPLES and REFERENCE: the CORE
26. Reference-based or de novo?
• De-novo
– Implies de novo assembly
– Does NOT require a reference
– Gives access to the entire PAN-genome
– E.g.
• Unexpected antibiotic resistance genes
• Virulence factors
– Can give misleading results in REPEAT sequences
– Not suitable for very fine-resolution SNP analysis
27. In practice
• Most people will want to do both.
• And if you have no reference, you can use a
draft de novo assembly AS your reference
– But exercise caution
28. Reference-based approach
Alignment
Variant calling
SNP extraction & filter
Recombination
filtering
Tree building
MLST/Antibiogram
Read QC
Adaptor/quality
trimming
Species ID
Sample QC
FastQC, Qualimap,
Kraken, BLAST
Trimmomatic
BLAST, Metaphlan,
MOCAT
Blobology
BWA
Samtools/VarScan
GATK
Custom script, snippy,
snpEff, BRESEQ
Gubbins,
ClonalFrameML
FastTree, RaXML
SRST2
29. Analysis choice highly species
dependent: not one size fits all!
• What is the mode and tempo of evolution?
• Monomorphic organisms:
– Characterised by vertical pattern of inheritance
– Isolates differ by few mutations
• Highly recombinogenic organisms
– Mutations dominated by recombination
– May have vast differences in gene content, gene
order
– “Clonal frame” may be obscured or absent
30. Different species require different
analysis strategies
Variation
M. tuberculosis
S. aureus
B. anthracis
E. coli
P. aeruginosa
N. meningitidis
S. pneumoniae
Clonal population structure
Branching phylogenies
Open pan-genome
Horizontal gene transfer
Salmonella
High rates of recombination
Phylogenetic networks
31. Tips for picking a reference
• The higher quality the better (aim for pre-NGS
Sanger genomes, e.g. <2001)
• Ideally single contig, no gaps
• Canonical strains have most portable and
referenced gene references, e.g. TB H37Rv,
PAO1, E. coli K-12 etc.
• For SNP calling specificity: more closely
related is better
32. The core genome
• The core genome used to
call SNPs will reduce as
more genomes are added
• Particularly noticeable in
species with highly
plastic genomes: E. coli
• Has significance for
forensic applications
33. Is my reference good enough?
• Assess core genome size
– Harvest will do this for you
• Or look at samtools flagstat (?)
• Between-sample SNP calling efficiency goes
down with reference divergence
• Luxury option: get a Pacific Biosciences
complete reference done for each “clone” in
your dataset (for some definition of clone)
34. Effect of closer reference on P.
aeruginosa genotyping
SNPs Indels Mapped
PAO1
Reference
23 4 77%
PacBio
Reference
40 5 97%
Quick, Loman et al. BMJ Open 2014
35. SNP filtering
• Specific SNP dataset is vital for effective
phylogenetic reconstructions and outbreak
tracing
• Most SNP calling errors come from
– A) misalignment (sequence present in sample but not
in reference, align)
– B) copy number variation (2 copies in sample, 1 copy
in reference)
• NOT from sequencing error (at least with
Illumina: systematic errors with other platforms)
36. SNP filtering (2)
• Allele frequency filter is most effective SNP filter
– AF > 0.9 (90%) works very well empirically
• Strand filter also very useful to prevent SNPs
around structural variations
• Filtering for low coverage not that helpful:
– 1/1000 error (Q30) * minimum of 3 coverage =
.000000001 chance of an error per position = < 1
error per genome
• Avoid SNPs at ends of contigs as these may be
mismapping
37. Detecting recombination
• Simple algorithms rely on SNP density, more
complex ones asssess impact on “clonal
frame”
Normal SNP density Recombining region
39. De novo approach
• Interrogate the accessory genome
– Novel genes
• Some important applications take contigs
rather than reads as primary input
• SNP calling with de novo assembly is
fundamentally less reliable due to lack of
allele frequency information; but fine for
broad-scale clustering
40. Reference-based approach
Alignment
Variant calling
SNP extraction & filter
Recombination
filtering
Tree building
MLST/Antibiogram
Read QC
Adaptor/quality
trimming
Species ID
Sample QC
FastQC, Qualimap
Trimmomatic
BLAST, Metaphlan,
MOCAT
Blobology, Kraken,
BLAST
BWA
Samtools/VarScan
GATK
Custom script, snippy
Gubbins,
ClonalFrameML
FastTree, RaXML
SRST2
De novo approach
Assembly
MLST/Antibiogram
Annotation
Tree building
Population genomics
Pan-genome
Velvet
SPADES
Prokka
Harvest
BigsDB
Phyloviz
LS-BSR
mlst, Abricate
41. Concluding thoughts
1. Don’t trust your sequencing data (or others’)
– sense-check and validate each step
2. Make extensive use of visualisation tools to
do this
3. There’s more than one way to do any one
task
42. CLoud Infrastructure for Microbial
Bioinformatics (CLIMB)
• MRC funded project to
develop Cloud
Infrastructure for
microbial bioinformatics
• £4M of hardware, capable
of supporting >1000
individual virtual servers
• Amazon/Google cloud for
Academics
43. Meet-The-Expert
• Meet-The-Expert: Joao Carrico and I
• Tomorrow (Monday)
• 07:45 (really)
• Hall M
• Session ME11 What bioinformatics tools do I use for whole-
genome sequence (WGS)-based bacterial diagnostics and
typing?
44. Acknowledgements
• Twitter comments:
– Tom Connor, Alan McNally, Torsten Seemann, C.
Titus Brown, Heng Li, Christoffer Flensburg, Matt
MacManes, Rachel Glover, Willem van Schaik, Bill
Hanage, Jennifer Gardy, Mick Watson, Alan
McNally, Esther Robinson, Nicola Fawcett, Aziz
Aboobaker, Ruth Massey
Editor's Notes
Reminds me of an old joke: A man is travelling and stops an old man on the road and says “How do I get to xyz?”. The man pauses and has a good think about it. He asks “You want to get to xyz?”. He pauses again and concludes: “Well if I wanted to get to xyz, I wouldn’t have started from here.”
Caution with filtering: several important antibiotic resistance mutations may occur in just several copies of a repetitive gene, e.g. 23S (linezolid resistance) - filtering will exclude these!