Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobials 2016 - Melb, AU - sat 27 feb 2016

Bioinformatic tools for the
diagnostic laboratory
A/Prof Torsten Seemann
Victorian Life Sciences Computation Initiative (VLSCI)
Microbiological Diagnostic Unit Public Health Laboratory (MDU PHL)
Doherty Applied Microbial Genomics (DAMG)
The University of Melbourne
ASA 2016 - Melbourne, AU - Sat 27 Feb 2016

Doherty Applied
Microbial Genomics
Lead bioinformatician ♥ microbial genomics

The currency of genomics
Reads
Reads are stored in FASTQ files
Genome

Types of sequence reads
100 - 300 bp (paired)
100 - 400 bp
5,000 - 15,000+ bp
5,000 - 50,000+ bp

What data do we really have?
Isolate genome
Sequenced reads
Other isolates in
sequencing run
Contamination
Sequencing adaptors
Spike-in controls eg. phiX
Unsequenced
regions

Do we have enough data?
∷ Depth
: expressed as fold-coverage of genome eg. 25x
: means each base sequenced 25 times (on average)
∷ Coverage
: the % of genome sequenced with depth > 0
25x

Genome data itself is of limited value.
Needs “extra” information
□ location: Australia 37.8S,145.0E
□ date: 2015 2015-07-20
□ source: human 60yo male faecal swab
□ etc.
Metadata

Two options
∷ De novo genome assembly
: reconstruct original sequence from reads alone
: like a giant jigsaw puzzle
: “create”
∷ Align to reference
: identify where each read fits on a related genome
: can not always be uniquely placed
: “compare”

De novo genome assembly
Amplified DNA
Shear DNA
Sequenced reads
Overlaps
Layout
Consensus ↠ “Contigs”

The effect of read length
250 bp - Illumina - $200 8000 bp - Pacbio - $2000

The problem with repeats
Repeat copy 1 Repeat copy 2
Collapsed repeat consensus
1 locus
4 contigs

Align to reference
Seven short 4bp reads
AGTC TTAC GGGA CTTT
TAGG TTTA ATAG
Aligned to 31bp reference
AGTCTTTATTATAGGGAGCCATAGCTTTACA
AGTC TAGG ATAG TTAC
TTTA GGGA CTTT

Eight short 4bp reads
AGTC TTAC GGGA CTTT
TAGG TTTA ATAG TTAT
Aligned to 31bp reference
AGTCTTTATTATAGGGAGCCATAGCTTTACA
AGTC TAGG ATAG TTAC
TTTA GGGA CTTT
TTAT
TTAT
Ambiguous alignment
D’oh!

Best practice
■ Use both approaches
□ reference-based + de novo
■ Best of both worlds
□ and worst of both worlds - interpretation is non-trivial
■ Still need
□ good epidemiology, metadata and domain knowledge!

Applications of WGS
∷ Diagnostics
: species ⇒ subspecies ⇒ strain identification
: in silico antibiogram and virulence profile
∷ Surveillance
: in silico genotyping - MLST, serotyping, VNTR, MLVA
: what’s lurking in our hospital/community?
∷ Forensics
: outbreak detection
: source tracking

Isolate identification
∷ Can be done in seconds
∷ Directly from reads (or subset)
∷ Scan against index of unique k-mers (oligoes)
∷ Species level accurate (on average)
∷ Great for quality control !
Kraken,
MetaPhlan,
OneCodex

One Codex example metagenome output

Antibiogram
∷ The “resistome”
∷ Resistance specific genes
: we have good databases of these
: easy to identify to exact allele eg. blaNDM-9
∷ New alleles conferring resistance
: databases are poor (exceptions include M.tb)
: novel mechanisms arrive de novo
ResFinder, CARD,
ARG-Annot
SRST2, ABRicate

ABRicate example E.faecium output
START END GENE COVERAGE COVERAGE_MAP GAPS %COVERAGE %IDENTITY
7140 7902 erm(B) 1-762/762 ========/====== 1 100.00 99.08
8627 9421 aph(3')-III 1-795/795 =============== 0 100.00 100.00
11040 11948 ant(6)-Ia 1-345/909 =====.......... 0 35.00 100.00
15456 16257 lnu(B) 1-804/804 ========/====== 2 99.75 99.63
573128 575046 tet(M) 1-1920/1920 ========/====== 1 99.95 99.95
770130 770792 VanR-B 1-663/663 =============== 0 100.00 99.25
770792 772135 VanS-B 1-1344/1344 =============== 0 100.00 99.63
772306 773112 VanY-B 1-807/807 =============== 0 100.00 100.00
773130 773957 VanW-B 1-828/828 =============== 0 100.00 97.58
773954 774925 VanH-B 1-972/972 =============== 0 100.00 99.38
774918 775946 VanA-B 1-1029/1029 =============== 0 100.00 98.93
775952 776560 VanX-B 1-609/609 =============== 0 100.00 96.72
2352083 2352631 aac(6')-Ii 1-549/549 =============== 0 100.00 99.64
2789984 2791462 msr(C) 1-1479/1479 =============== 0 100.00 98.92

Virulence profile
∷ The “virulome”
∷ Curated databases
: known virulence genes
: pathogenicity islands
∷ Caveats
: variable representation across organisms
VirulenceFinder,
VFDB, MvirDB,
ViPR, PAI DB

Backward compatibility
MLST
Resistome
Virulome
NG-MAST
MLVA
VNTR
Serotyping
Phage typing
PFGE
SRST2, mlst,
ngmaster, lissero,
and many more!

Focus on a small “informative” section

Genotype shows isolates are related

Every SNP is sacred
∷ Chocolate bar tree
: branches were based on phenotypic attributes
: size, colour, filling, texture, ingredients, flavour
∷ Genomic trees
: want to use every part of the genome sequence
: need to find all differences between isolates

Finding differences
AGTCTGATTAGCTTAGCTTGTAGCGCTATATTAT
AGTCTGATTAGCTTAGAT
ATTAGCTTAGATTGTAG
CTTAGATTGTAGC-C
TGATTAGCTTAGATTGTAGC-CTATAT
TAGCTTAGATTGTAGC-CTATATT
TAGATTGTAGC-CTATATTA
TAGATTGTAGC-CTATATTAT
SNP Deletion
Reference
Reads
Snippy, VarScan,
SAMtools, GATK
and many more!

Annotated tree
∷ 1 SNP resolution
∷ Distinguishes clades
within genotypes
∷ Interpretation is not
straightforward
10 SNPs
L. monocytogenes

Same tree!
Dendrogram
Spanning
Radial

Reference based analysis
∷ Implies you have a “close” reference
: need to be careful with draft genomes
∷ Very sensitive
: single mutation precision
∷ May not be complete
: ignores novel DNA in your isolate

Inferring transmission
∷ Identical
sequence
does not imply
transmission
∷ Easier to rule
out than in

Align all your isolate genomes

The core genome
Core is common to all & has similar sequence.

Example pan genome
Roary, LS-BSR,
OrthoMCL, Degust
Rows are genomes, columns are genes.

Core
∷ Common DNA
∷ Vertical evolution
∷ Genotyping
∷ Phylogenetics
∷ Novel DNA
∷ Lateral transfer
∷ Plasmids
∷ Mobile elements
∷ Partly unexploited
Accessory

Nullarbor
∷ Software pipeline
: does “reads to report”
: cloud image available (mGVL)
∷ Under active development
: used at MDU-PHL for past year for routine jobs
: also used by USA CDC Enterics, FSS Qld, and research
∷ National access programme underway
null arbor
“no trees”

Doherty Applied Microbial Genomics
■ Non-profit service available
□ fixed price per isolate
■ Genome sequencing
□ Illumina NextSeq 500
■ Bioinformatics analysis
□ Nullarbor
■ Report
□ QC, typing, resistome, phylogeny
□ plus your raw data

Open science
∷ Crowd-sourcing provably works
: EHEC outbreak 2011
: Ebola, MERS, Zika
∷ But only if people share
: sequencing data
: metadata
: software source code for analysis

GenomeTrakr
∷ International cooperation
: Led by FDA + NCBI
: >20 collaborating institutes inc. UK PHE, DK DTU, MX
: Salmonella and Listeria
∷ Public SRA BioProject #183844
: Real-time submission of WGS genome reads
: Nightly updates of phylogenomic trees
: Contains ~25,000 strains of Salmonella

“GenomeTrakka”
∷ A shared online system for all Australian labs
: upload samples
: automated standard/specific analyses
: simple reports and visualization
: easy to submit to international archives (SRA)
∷ Access control
: each lab controls their own data
: jurisdictions can share data in national outbreaks

Does WGS deliver?
Yes!Bioinformatics Epidemiology
Technology
Microbiology
This means
scientists
not just software
Domain
expertise
Always changing...

Acknowledgements
Ben Howden
Tim Stinear
Dieter Bulach
Jason Kwong
Anders G da Silva

Contact
tseemann.github.io
torsten.seemann@gmail.com
@torstenseemann

The End
Thank you for listening.

Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobials 2016 - Melb, AU - sat 27 feb 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobials 2016 - Melb, AU - sat 27 feb 2016

Similar to Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobials 2016 - Melb, AU - sat 27 feb 2016 (20)

More from Torsten Seemann

More from Torsten Seemann (17)

Recently uploaded

Recently uploaded (20)

Bioinformatics tools for the diagnostic laboratory - T.Seemann - Antimicrobials 2016 - Melb, AU - sat 27 feb 2016