2
Whole Genome Sequencing
Suddenly cheap and easy
Huge amounts of data generated in Canada & globally
Can solve many problems
Resolution
Breadth of strains typed
Scale of data brings its own problems
Pangenome definitions
Variable assembly completeness and quality
Existing typing systems don't scale well
3
Classical MLST
Looks at allelic diversity of ~7 “housekeeping” loci
All loci must be fully present
Each new allele is a type
Recombination and mutation are equivalent
Each unique combination of types is a Sequence Type
Type definitions are universal
Centralized and curated
e.g. ST-21 in Canada = ST-21 in UK = ST-21 in Denmark
Dingle, et al. 2001. J. Clin. Micro. 39(1) 14-23
4
The core genome is shared by all members of the species; mostly SNP-level
genetic variation
Accessory genes are not shared by all members of the species and drive a
lot of the phenotypic variability between strains
What is a “Core gene”? What about a “Core genome”?
5
Core Genome MLST
Logical extension of Classical MLST concepts
7 genes → 100s or 1000s of genes
Potential successor “Gold Standard” typing method for surveillance
Big Advantages
High Resolution
Viable way for WGS → Surveillance
Lots of interest in cgMLST
8
Aprototype cgMLST scheme for C. jejuni
2690 Campylobacter jejuni whole genome sequence assemblies
Set of 1,658 ORFs from reference strain NCTC11168 used as queries
85% sequence identity & 50% length coverage
732 ORFs conserved across all genomes core genome loci
9
cgMLST Trials and Tribulations
2690 Campylobacter jejuni whole genome sequence assemblies
Allele definitions gathered from all genomes
Not so simple!
WGS projects don't usually finish their
genomes
“Genome Assemblies”
Target loci are often truncated by
chance
Only 1464 genomes (54%) had
complete sequences at all 732 loci
10
Contig Truncations are a function of genome count
As the number of genomes analyzed is increased, the probability that
any locus will have at least one truncation approaches 100%
Average rate of missing/truncated loci ≈ 3.5%
26 per assembly!
11
Contig Truncations are a function of locus count
Average rate of missing/truncated loci ≈ 3.5%
26 per assembly!
As the number of loci analyzed is increased, the probability that at
least one genome will have a truncation increases to 100%
12
The Story So Far...
Advantages of cgMLST
1. Analysis is cheap and speedy
2. Hugely improved resolution
3. Consistent, portable nomenclature
Difficulties Introduced by cgMLST
Missing / Truncated Loci will affect your
scheme
As-is, forces you to sacrifice either #1 or #3:
Re-sequence and re-assemble and hope it works
– or –
Abandon all hope for portability
13
Some options for damage control!
1. Use only highly conserved core genes
2. Use optimized gene fragments
3. Reduce the number of target loci
4. Attempt to impute data
14
Some options for damage control!
1. Use only highly conserved core genes
2. Use optimized gene fragments
3. Reduce the number of target loci
4. Attempt to impute data
15
Using Optimized Gene Fragments
• The longer the target sequence, then more opportunities for truncations
16
Using Optimized Gene Fragments
• The longer the target sequence, then more opportunities for truncations
• Avoid regions with empirically high contig truncation rates
17
Using Optimized Gene Fragments
• The longer the target sequence, then more opportunities for truncations
• Avoid regions with empirically high contig truncation rates
• Retain the most informative regions Measured by Shannon Entropy
18
Using Optimized Gene Fragments
• The longer the target sequence, then more opportunities for truncations
• Avoid regions with empirically high contig truncation rates
• Retain the most informative regions Measured by Shannon Entropy
• Optimized sub-regions that are informative and truncation-free
19
Some options for damage control!
1. Use only highly conserved core genes
2. Use optimized gene fragments
3. Reduce the number of target loci
4. Attempt to impute data
20
How many loci do we need for accurate clustering?
Pristine Genome Set
732 cgMLST loci
1,464 aforementioned genomes
A controlled development
environment for cgMLST testing
Clustering
Reference set clustered at various
similarity thresholds
100% - 20% similarity
0.5% steps
21
Random Gene Selection
N genes randomly selected from the 732
1000 replicates each
Clusters compared vs the full 732
Comparison to “reference tree”
Adjusted Wallace Coefficient
Compares clusters produced by two methods
“How often do two strains clustered together by Method A cluster
together by Method B”
How many loci do we need for accurate clustering?
22
Random Subset Clusters – 5th Percentile (i.e.“worst case scenario”)
150-250 genes are nearly as good as 732 genes
0.0 0.2 0.4 0.6 0.8
23
Some options for damage control!
1. Use only highly conserved core genes
2. Use optimized gene fragments
3. Reduce the number of target loci
4. Attempt to impute data
Allele Imputation: Another Approach
5 21???
• Inferring the allele of a missing/partial
locus
• Educated guess from the allele proportions
of 'centres' known to be associated with
particular 'flanks‘
• Mean accuracy of 90.5%
• Further refinement with partial sequence
data
Conclusions
• cgMLST is poised to be the Gold Standard for global surveillance of
bacterial pathogens
• Contig truncations and missing data become a blocking problem if the
same portability of typing definitions as MLST is desired
• A compromise between typability and robustness is required
• Contig truncations’ effect can be mitigated by :
• The worst fragments of genes (truncation & information content)
• The genes that contribute the least to discriminatory power
• “Filling the gaps” with advance knowledge about linkage
• Supervisors:
• Drs. Ed Taboada & Jim Thomas
• Labmates:
• Steven Mutschall (PHAC)
• Peter Krucziewicz (PHAC)
• Ben Hetman (PHAC/ULeth)
• Cody Buchanan (CFIA/ULeth)
• Funding
• ESCMID Attendance Grant
• University of Lethbridge
• Public Health Agency of Canada
• Government of Canada Genomics Research and
Development Initiative
Acknowledgements