Whole genome sequencing (WGS) is poised to replace current subtyping methods for foodborne pathogens. WGS provides identification, serotyping, virulence and antimicrobial resistance characterization from a single analysis. For public health, WGS data analysis tools must be simple, standardized, allow free sharing of data between labs, and perform both central and local analysis. Multilocus sequence typing (MLST) is currently the preferred approach for public health surveillance due to its stable nomenclature and accessibility for non-bioinformaticians. Hierarchical clustering can be used to assign unique identifiers like "SNP addresses" to WGS profiles. PulseNet International will adopt MLST allele codes as the phylogenetic nomenclature system.
The path to implementation of Whole Genome Sequencing (WGS) in PulseNet
1. The path to implementation of WGS in
PulseNet
National Center for Emerging and Zoonotic Infectious Diseases
Division of Foodborne, Waterborne, and Environmental Diseases
Peter Gerner-Smidt, MD, DSc
Enteric Diseases Laboratory Branch
GMI9
Rome, Italy, May 23- 25, 2016
2. PulseNet International
The international subtyping network of national and regional networks for foodborne disease surveillance
”Saving Lives Since 2000”
http://www.pulsenetinternational.org/
3. Whole Genome Sequencing (WGS) is a
Transforming and REPLACING Technology
Consolidating multiple laboratory workflows into one:
o Identification – serotyping – virulence profiling – antimicrobial
resistance characterization – plasmid characterization- subtyping
Replacing - NOT supplementing current methods
More: Precise- Informative- Cost-efficient
4. WGS in Public Health:
The analytical tools must be
• Simple
• Public health microbiologists are NOT
bioinformaticians
• Standard desktop software
• Comprehensive
• All characterization incl. analysis in one workflow
• Working in a network of laboratories, i.e. STANDARDIZED
• Free sharing and comparison of data between labs
• Central and local analysis
5. MLST vs SNP
SNP MLST
Epidemiological concordance High High
Stable nomenclature (No) Yes
Reference characterization:
identification, serotyping, virulence &
resistance markers
No Yes
Speed Slow SNP calling,
slow analysis
Slow allele calling,
fast analysis
Local computing requirements Medium-High Low
Local bioinformatics expertise Yes No
Reference used to perform analysis Sequence of
closely related
annotated strain
Allele database
Requires curation No (Yes)
MLST is the primary approach for public health surveillance; SNP is used if more
detail is needed or MLST fails
6. Listeria 1403MLGX6-1WGS
wgMLST and hqSNP Are Equally Discriminatory
and Phylogenetic Trees Are Concordant
hqSNP
0.0
0.0
0.3
0.3
0.1
0.5
0.1
0.6
1.5
2.1
wgMLST (<All Characters>)
100
99
98
wgMLST
LMO_1
LMO_4
LMO_5
LMO_6
LMO_7
LMO_10
2 18 20 41 11
25 2 18 20 41 11
25 2 18 20 41 11
25 2 20 41 11
25 2 18 20 41 11
25 2 41 11
State 2 isolate 1
State 1 isolate
State 3 isolate
State 2 isolate 2
State 2 isolate 3
2013 isolate – Nearest Neighbor
wgMLST
State 2 isolate 1
State 1 isolate
State 3 isolate
State 2 isolate 2
State 2 isolate 3
2013 isolate – Nearest Neighbor
7. Trees ~ Tables
Key SourceStateSerotype PFGE-XbaI-patternPFGE-XbaI-status PFGE-BlnI-pattern
PFGE-BlnI-
status Outbreak SourceCounty SourceCity SourceCountry
SourceT
ype SourceSite PatientAge PatientSex IsolatDateReceivedDate UploadDate
M18340 M Enteritidis JEGX01.0009 Confirmed Unconfirmed 1507MLJEG-3 DeKalb Dawsonville USA Human Stool 54UNKNOWN 6/26/2015 7/15/2015 8/4/2015
X150951 X Enteritidis JEGX01.0009 Confirmed Unconfirmed 1507MLJEG-3 Gwinnett Key West USA Human Stool 33MALE 7/5/2015 7/15/2015 8/4/2015
D108427 D Enteritidis JEGX01.0009 Confirmed Unconfirmed 1507MLJEG-3 Fulton Miami USA Human Blood 50FEMALE 7/7/2015 7/15/2015 8/4/2015
A15054-1 A Enteritidis JEGX01.0009 Confirmed Unconfirmed 1507MLJEG-3 Pickens USA Human Stool 28FEMALE 7/7/2015 7/27/2015 8/7/2015
D508583 D Enteritidis JEGX01.0009 Confirmed Unconfirmed 1507MLJEG-3 Dawson Philadelphia USA Human Stool 24FEMALE 7/21/2015 8/11/2015
M088433 M Enteritidis JEGX01.0009 Confirmed Unconfirmed 1507MLJEG-3 Forsyth USA Human Stool 44FEMALE 7/16/2015 7/24/2015 8/13/2015
P110964-1 P Enteritidis JEGX01.0009 Confirmed Unconfirmed Forsyth USA Human Blood 72MALE 8/3/2015 8/10/2015 8/17/2015
A09461 A Enteritidis JEGX01.0009 Confirmed Unconfirmed Cabbagetown USA Human Blood 43FEMALE 7/30/2015 8/5/2015 8/26/2015
A109320 A Enteritidis JEGX01.0009 Confirmed Unconfirmed Bismarck USA Human Stool 28UNKNOWN 7/25/2015 8/6/2015 8/27/2015
T509961 T Enteritidis JEGX01.0009 Confirmed Unconfirmed
Forsyth
Decatur USA Human Stool 57UNKNOWN 7/31/2015 8/13/2015 9/10/2015
A110203 A Enteritidis JEGX01.0009 Confirmed Unconfirmed DeKalb Hollywood USA Human Other 14FEMALE 8/11/2015 8/25/2015 9/22/2015
A151664 A Enteritidis JEGX01.0009 Confirmed Unconfirmed Talking Rock USA Human Stool 62MALE 8/26/2015 9/8/2015 9/28/2015
DA159061 K Enteritidis JEGX01.0009 Confirmed Unconfirmed Pickens Pierre USA Human Stool 6FEMALE 8/29/2015 9/9/2015 9/29/2015
M150130-1 P Enteritidis JEGX01.0009 Confirmed Unconfirmed Dawson USA Human Stool 6MALE 9/20/2015 9/28/2015 10/1/2015
C15-0445058 N Enteritidis JEGX01.0009 Confirmed Unconfirmed Charlotte USA Human Stool 5MALE 9/2/2015 9/25/2015 10/9/2015
A122326 L Enteritidis JEGX01.0009 Confirmed Unconfirmed Gwinnett NYC USA Human Blood 88FEMALE 9/30/2015 10/7/2015 10/15/2015
A151248 A Enteritidis JEGX01.0009 Confirmed Unconfirmed Atlanta USA Human Stool 37MALE 10/4/2015 10/13/2015 10/21/2015
A125223 D Enteritidis JEGX01.0009 Confirmed Unconfirmed Hall L..A. USA Human Stool FEMALE 9/26/2015 10/14/2015 10/22/2015
FDA00009433
FDA00009408
FDA00009432
FDA00009411
FDA00009414
FDA00009410
2015K-0962
FDA00009415
FDA00009409
PNUSAS000907
FDA00009413
2015K-0960
FDA00009412
2015K-0961
FDA00009417
PNUSAS000905
PNUSAS000839
FDA00009416
PNUSAS000861
PNUSAS000906
PNUSAS000842
PNUSAS000858
PNUSAS000844
PNUSAS000862
PNUSAS000840
PNUSAS000908
PNUSAS000897
PNUSAS000845
PNUSAS000860
PNUSAS000903
PNUSAS000904
PNUSAS000764
PNUSAS000843
PNUSAS000859
PNUSAS000841
PNUSAS000807
PNUSAS000895
PNUSAS000773
PNUSAS000767*
PNUSAS000894
PNUSAS000766
PNUSAS000770*
PNUSAS000772*
PNUSAS000896
PNUSAS000769*
PNUSAS000771*
PNUSAS000808
PNUSAS000768*
PNUSAS000799
2015K-0964
63
44
15
38
75
84
67
100
4
35
52
25
19
12
0.001
FDA00009433
FDA00009408
FDA00009432
FDA00009411
FDA00009414
FDA00009410
2015K-0962
FDA00009415
FDA00009409
PNUSAS000907
FDA00009413
2015K-0960
FDA00009412
2015K-0961
FDA00009417
PNUSAS000905
2015K-0963
PNUSAS000839
FDA00009416
PNUSAS000861
PNUSAS000906
PNUSAS000842
PNUSAS000858
PNUSAS000844
PNUSAS000862
PNUSAS000840
PNUSAS000908
PNUSAS000897
PNUSAS000845
PNUSAS000860
PNUSAS000903
PNUSAS000904
PNUSAS000764
PNUSAS000843
PNUSAS000859
PNUSAS000841
PNUSAS000807
PNUSAS000895
PNUSAS000773
PNUSAS000767*
PNUSAS000894
PNUSAS000766
PNUSAS000770*
PNUSAS000772*
PNUSAS000896
PNUSAS000769*
PNUSAS000771*
PNUSAS000808
PNUSAS000768*
PNUSAS000799
2015K-0964
63
44
15
38
75
84
67
100
4
35
52
25
19
12
0.001
9. Courtesy Tim Dallman, PHE
• Hierarchical clustering
based on full pairwise
distance between two
genomes
• Used to assign a SNP
address to a strain
based on specified
index e.g. 50:25:10:5:0
• Can be used for
surveillance purposes
“SNP address”
PulseNet International will use MLST: “Allele Code”
10. Considerations for a phylogenetic
relevant strain nomenclature system
• Must be simple
– Sequence of numbers
• Stability of system
– Fit new sequences into an existing tree?
– Recalculate the clusters with every new entry?
• No matter which method used, the stability can be controlled
• < 2% risk that you cannot fit a new sequence unambiguously
into the nomenclatural system
• Cutoffs between levels
• Clustering algorithm
– Single linkage? UPGMA?
11. WGS Data Workflow
Allele & Allele code
Databases
Allele names, Allele code
(strain names)
NO Metadata
Temporary storage,
QA/QC, Data
extraction
Trimming, mapping, de novo
assembly, SNP detection,
allele detection
NO Metadata
Public Health databases
Extensive Metadata
Database managers
and end users
External storage
NCBI, ENA,
Limited
Metadata
Sequencer
Raw sequences
LIMS
7-gene MLST Allelic profile
cgMLST ST
wgMLST Allele Code
(SNPs)
12. Acknowledgements
National Center for Emerging and Zoonotic Infectious Diseases
Division of Foodborne, Waterborne, and Environmental Diseases
Disclaimers:
“The findings and conclusions in this presentation are those of the author and do not necessarily
represent the official position of the Centers for Disease Control and Prevention”
“Use of trade names is for identification only and does not imply endorsement by the Centers for
Disease Control and Prevention or by the U.S. Department of Health and Human Services.”
Public Health Agency of Canada
Institut Pasteur, S. Brisse; M. Lecuit
Center for Genomic Epidemiology, DTU
University of Oxford, M. Maiden, K. Jolly
Public Health England, T. Dallman
13. Hierachical Nomenclature Is
Inherently Unstable
• As we use approximate matching to group strains, equality is no
longer transitive.
Given strains A, B and C with distances as indicated,
Then at distance cutoff 21, A, B and C would be in the same cluster.
• However, if B has not been sampled yet, A and C would not be in
the same cluster
• How bad is it?
A
C
B13 17
28
Courtesy: Hannes Poussele, Applied Maths
14. Cutoff determination
(case PulseNet Listeria cgMLST database, N= 3,652)
Test procedure: find points with minimal name changes starting from
nothing and by chronological addition of strains
Thresholds: 150:100:63:41:21:11
Courtesy: Hannes Poussele, Applied Maths
15. Stability Assessment
• Test 1: starting from nothing, add samples chronologically
• Test 2: starting from a random subset (50%), add samples chronologically
• Using a precalculated strain nomenclature structure based on what is
known today, reduces the nomenclature stability beyond what is expected
(that is, in this case, 50% reduction)
• The 21 allelic changes cutoff might be not stable enough
threshold % change
Test 1 Test 2
11 1.01% 0.30%
21 2.51% 1.64%
41 2.51% 0.57%
63 1.37% 0.27%
100 2.52% 0.03%
150 0.22% 0%
Courtesy: Hannes Poussele, Applied Maths
16. Stability Assessment
Conclusions
MLST-based hierachical strain nomenclature is feasible
• Stability good
– Without the 21 allelic changes cutoff, less that 1.17% name changes
• Stability can be further increased by defining a broad starting set
– Using a more international collection of strains
– Using biological knowledge about the population structure of L.monocytogenes
• Computational feasibility
– Names can be assigned one sample at a time, no need for complete
recalculations
• wgMLST instead of cgMLST yields extremely similar results
Courtesy: Hannes Poussele, Applied Maths