2015 12-09 nmdd

WGS data for bacterial typing
Karin Lagesen
@karinlag
NMDD presentation
2015-12-09

Bacterial genomes
Four letters: A, C, T, G
Two strands complementary:
A : T, C : G
Genes: DNA that encode for proteins
Often regarded as the “functional”
regions of the genome
Bacteria: genes approx 90% of the genome
ATCCGGAG GAGGACGG
Mutations: single letter
character changes
TGAGGGACCAAACCGAT
TGAGGGACGAAACCGAT
Bacterial
genomes are
most often
circular
Campylobacter
jejuni genome:
1.68 million
basepairs

Bacterial typing
 Typing: identifying a bacterial isolate at the strain
level
 Goal: discriminate between different bacterial
isolates
● Effectively: a distance measure is often sought
 Traditionally done via distinguishing based on
phenotypic characteristics
 Molecular strain typing has taken over
 Goal: figure out how different sequences are

Advances in bacterial genomics
Phyla Number
genomes
% of total
Actinobacteria 4059 13
Bacteroidetes/
Chlorobi group
932 3
Cyanobacteria 340 1
Firmicutes 9628 31
Proteobacteria 14,268 46
Spirochaetes 525 2
Other 1500 5
Number of sequenced genomes for 6 selected phyla and the percent of all genomes found
in the phyla
Source: GenBank prokaryotes.txt file downloaded 4 February 2015
Land et. al., Functional & Integrative Genomics, 2015

2002
Development of sequencing technologies

Genome assembly
http://knowgenetics.org/whole-genome-sequencing/
Sequencing
machine
Reads

Molecular bacterial typing
Howdifferencesarecounted
Amount of sequence used
Single
gene
Categorical
Ordinal
Continuous
MLST,
MLVA
MLSA
One region Some regions Many regions All

MLVA – Multi-locus VNTR analysis
 Find loci with known
repeats
 Discover copy number
of repeat – becomes
identifier for loci
 Strain identified by
copy numbers for
defined set of loci
 Similarity is # of
idential loci numbers
http://www.applied-maths.com/applications/mlva

Multi Locus Sequence Typing
 Set of genes
 Each variant is assigned
a categorical number
 Cluster types on #
shared variants
 Numbers becomes
Sequence type (ST)
 Similarity is # of idential
loci numbers
 MLST: 7 genes
 rMLST: ribosomal genes
http://www.applied-maths.com/applications/mlst

Clustering categorical data
Feil, Nature Rev. Microbiol. 2004

Phylogeny – tracing ancestry
 Many algorithms
● Distance matrix methods (sequence similarity)
● Maximum parsimony methods
● Maximum likelyhood methods
 Based on similarity between sequences
 Can become very computationally intensive,
especially for longer sequences (e.g. WGS)
 Examples:
● 16S rRNA phylogenetic trees
● Multi Locus Sequence Analyses – phylogenies of
concatenated MLST genes

Campylobacter 16S tree
Friis et. al. PLOS One 2013

Molecular bacterial typing
Howdifferencesarecounted
Amount of sequence used
Single
gene
Categorical
Ordinal
Continuous
Pairwise
SNPs
Core
genome
MLST,
MLVA
MLSA
One region Some regions Many regions All
wgMLST
Core
SNPs

Ideal whole genome comparisons
 Bacterial species definition:
● 70% of genome should be able to anneal to each
other – i.e. «match»
 Converted to whole genome sequences:
● Based on % identity between conserved regions
● Average Nucleotide Identity~95 %
 All-against-all sequence alignment is required
● Time complexity: O(n2)
● Not feasible in most cases
 Alternatives:
● Focus on core regions of the genome (core genes)
● Find just the variations (SNPs), make trees from those

Core genome – # ”shared genes”
 Sequences q and s have matching region
 Regarded as ”shared” iff k and n are large
enough
 Similarity = # ”shared” genes
s
q
length of match (n)
% of matching characters
in matching region (k)

Core genome tree, Campylobacter
Friis et. al. PLOS One 2013

Core SNP trees
 Approach A: External core gene set
● Map each genome’s reads to genes
● Examine reads mapping to the same gene to
find sequence variations (variant calling)
● Create genome/SNP matrix
 Approach B: Intrinsic core set
● Use suffix graphs to get Maximal Unique Matches
● Extend alignments from MUMs to get shared
core set
● Find variants in alignments
● Create genome/SNP matrix
 Similarity: genomes that share the same SNP
Snippy
snpTree
Parsnp

Campylobacter jejuni, core SNP tree
Maximum likelihood phylogeny derived from the core-genome alignment of 131 C. jejuni
isolates. Isolates with a known hyper-invasive phenotype have their taxa identifier names
highlighted in red. The three clades identified as containing hyper-invasive strains have
branches indicated in red
Baig et al. BMC Genomics 2015 16:852 doi:10.1186/s12864-015-2087-y

k-mer based SNP trees
 k-mer: piece of sequence, k nucleotides long
 Split genomes/reads into k-mers
 Find k-mers in different genomes that vary in their
middle character
 Create genome/SNP matrix
● Note: this is not core, but pairwise all-against-all
 Create trees
 Similarity is # shared SNPs
Genome A: TGAGGGACCAAACCGAT
Genome B: TGAGGGACGAAACCGAT
kSNP

Acenitobacter whole genome SNP tree
Sahl et. al., PLOS One, 2013

Classification of distance measures
 Categorical
● Loci defined as either equal/different
● Similarity calculated as # shared loci
 Ordinal
● Regions defined as “shared” based on sequence
similarity levels
● Similarity calculated as # shared sequences
 Continous
● Find all sequence differences (SNPs)
● Similarity calculated as # shared SNPs

(Some) sources of variation
 Small changes
● Nucleotide substitution
● Insertions and deletions
 Recombination
● Shuffling regions of the genome
 “Jumping genes”: insertion sequences and transposons
● Small sequences that jump
● Can move other sequences with them

Gene tree != genome tree
Rose et. Al., Biology direct 2007

So… what do we do?
 No real answers (yet)
 Could sequence the lot, but is expensive
 However: gain so much more with sequencing
● Very high discriminatory power (resolution)
● Access to virulence genes, ++
 Be aware of possible fragility in MLST data
● One mutation = changed ST
● Should probably double check STs with MLSA
 Compare MLSTs with WGS data, see how stable the
MLSTs are to the whole genome

2015 12-09 nmdd

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 2015 12-09 nmdd

Similar to 2015 12-09 nmdd (20)

Recently uploaded

Recently uploaded (20)

2015 12-09 nmdd