Torsten Seemann discussed bioinformatic tools for diagnostic laboratories using whole genome sequencing (WGS). He explained that WGS generates large amounts of sequencing reads that can be assembled de novo or aligned to references to identify single nucleotide polymorphisms (SNPs) and characterize genomes. Key applications of WGS include diagnostic identification, antimicrobial resistance profiling, virulence factor detection, and high-resolution epidemiological typing through SNP analysis and phylogenetic trees. Seemann emphasized that WGS analysis requires metadata, domain expertise, and open data sharing for maximum public health benefit.
Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobials 2016 - Melb, AU - sat 27 feb 2016
1. Bioinformatic tools for the
diagnostic laboratory
A/Prof Torsten Seemann
Victorian Life Sciences Computation Initiative (VLSCI)
Microbiological Diagnostic Unit Public Health Laboratory (MDU PHL)
Doherty Applied Microbial Genomics (DAMG)
The University of Melbourne
ASA 2016 - Melbourne, AU - Sat 27 Feb 2016
4. The currency of genomics
Reads
Reads are stored in FASTQ files
Genome
5. Types of sequence reads
100 - 300 bp (paired)
100 - 400 bp
5,000 - 15,000+ bp
5,000 - 50,000+ bp
6. What data do we really have?
Isolate genome
Sequenced reads
Other isolates in
sequencing run
Contamination
Sequencing adaptors
Spike-in controls eg. phiX
Unsequenced
regions
7. Do we have enough data?
∷ Depth
: expressed as fold-coverage of genome eg. 25x
: means each base sequenced 25 times (on average)
∷ Coverage
: the % of genome sequenced with depth > 0
25x
8. Genome data itself is of limited value.
Needs “extra” information
□ location: Australia 37.8S,145.0E
□ date: 2015 2015-07-20
□ source: human 60yo male faecal swab
□ etc.
Metadata
11. Two options
∷ De novo genome assembly
: reconstruct original sequence from reads alone
: like a giant jigsaw puzzle
: “create”
∷ Align to reference
: identify where each read fits on a related genome
: can not always be uniquely placed
: “compare”
12. De novo genome assembly
Amplified DNA
Shear DNA
Sequenced reads
Overlaps
Layout
Consensus ↠ “Contigs”
13. The effect of read length
250 bp - Illumina - $200 8000 bp - Pacbio - $2000
14. The problem with repeats
Repeat copy 1 Repeat copy 2
Collapsed repeat consensus
1 locus
4 contigs
15. Align to reference
Seven short 4bp reads
AGTC TTAC GGGA CTTT
TAGG TTTA ATAG
Aligned to 31bp reference
AGTCTTTATTATAGGGAGCCATAGCTTTACA
AGTC TAGG ATAG TTAC
TTTA GGGA CTTT
17. Best practice
■ Use both approaches
□ reference-based + de novo
■ Best of both worlds
□ and worst of both worlds - interpretation is non-trivial
■ Still need
□ good epidemiology, metadata and domain knowledge!
19. Applications of WGS
∷ Diagnostics
: species ⇒ subspecies ⇒ strain identification
: in silico antibiogram and virulence profile
∷ Surveillance
: in silico genotyping - MLST, serotyping, VNTR, MLVA
: what’s lurking in our hospital/community?
∷ Forensics
: outbreak detection
: source tracking
20. Isolate identification
∷ Can be done in seconds
∷ Directly from reads (or subset)
∷ Scan against index of unique k-mers (oligoes)
∷ Species level accurate (on average)
∷ Great for quality control !
Kraken,
MetaPhlan,
OneCodex
22. Antibiogram
∷ The “resistome”
∷ Resistance specific genes
: we have good databases of these
: easy to identify to exact allele eg. blaNDM-9
∷ New alleles conferring resistance
: databases are poor (exceptions include M.tb)
: novel mechanisms arrive de novo
ResFinder, CARD,
ARG-Annot
SRST2, ABRicate
33. Every SNP is sacred
∷ Chocolate bar tree
: branches were based on phenotypic attributes
: size, colour, filling, texture, ingredients, flavour
∷ Genomic trees
: want to use every part of the genome sequence
: need to find all differences between isolates
38. Reference based analysis
∷ Implies you have a “close” reference
: need to be careful with draft genomes
∷ Very sensitive
: single mutation precision
∷ May not be complete
: ignores novel DNA in your isolate
46. Core
∷ Common DNA
∷ Vertical evolution
∷ Genotyping
∷ Phylogenetics
∷ Novel DNA
∷ Lateral transfer
∷ Plasmids
∷ Mobile elements
∷ Partly unexploited
Accessory
50. Nullarbor
∷ Software pipeline
: does “reads to report”
: cloud image available (mGVL)
∷ Under active development
: used at MDU-PHL for past year for routine jobs
: also used by USA CDC Enterics, FSS Qld, and research
∷ National access programme underway
null arbor
“no trees”
51. Doherty Applied Microbial Genomics
■ Non-profit service available
□ fixed price per isolate
■ Genome sequencing
□ Illumina NextSeq 500
■ Bioinformatics analysis
□ Nullarbor
■ Report
□ QC, typing, resistome, phylogeny
□ plus your raw data
53. Open science
∷ Crowd-sourcing provably works
: EHEC outbreak 2011
: Ebola, MERS, Zika
∷ But only if people share
: sequencing data
: metadata
: software source code for analysis
54. GenomeTrakr
∷ International cooperation
: Led by FDA + NCBI
: >20 collaborating institutes inc. UK PHE, DK DTU, MX
: Salmonella and Listeria
∷ Public SRA BioProject #183844
: Real-time submission of WGS genome reads
: Nightly updates of phylogenomic trees
: Contains ~25,000 strains of Salmonella
55. “GenomeTrakka”
∷ A shared online system for all Australian labs
: upload samples
: automated standard/specific analyses
: simple reports and visualization
: easy to submit to international archives (SRA)
∷ Access control
: each lab controls their own data
: jurisdictions can share data in national outbreaks
57. Does WGS deliver?
Yes!Bioinformatics Epidemiology
Technology
Microbiology
This means
scientists
not just software
Domain
expertise
Always changing...