2. 29.3
19.67
11.12
3.08
1.94 1.56
0.36 0.32
Campylobacteriosis Salmonellosis Giardiasis
Shigellosis Verotoxigenic E. coli (VTEC) Cryptosporidiosis
Listeriosis Cyclosporiasis
(*447)
(*269)
(*24)
(*4)
(*39) (*7)
(*.55) (*7.5)
Thomas et al (2013). doi:10.1089/fpd.2012.1389
FoodNet Canada Short Report 2013
***Post-correction estimate
2
Campylobacter is a public health challenge
#1 bacterial gastrointestinal disease in Canada and a leading foodborne
pathogen worldwide (300-500 million cases)
Self-limiting illness, highly under-reported, largely sporadic
3. 3
The epidemiology of campylobacteriosis is daunting
Source: Julie Arsenault (PhD Thesis)
papyrus.bib.umontreal.ca/jspui/handle/1866/4625
Widespread in “farm-to-fork” and “source-to-tap”
high prevalence in most major livestock species
found in many wild animal species, insects, surface waters
Difficult to establish sources of exposure and routes of transmission
Crisis = Opportunity WGS to the rescue!!!
5. 5
Those who make many species
are the 'splitters' and those who
make few are the 'lumpers’…
– CD (1857)
Clustering thresholds have been with us forever…
Need to calibrate our analysises to ensure our results exploit the high
resolution of WGS data while remaining epidemiologically relevant
6. 6
Building a model for quantifying epidemiological similarity
“Essentially, all models are wrong,
but some are useful.”
George E.P. Box
(1919-2013)
7. 7
How to relate epidemiologic and genomic clustering?
1. Adjusted Wallace Coefficient: (AWC) Carriço et al. (Comparing Partitions)
The directional likelihood that two isolates clustered together using
one method will be grouped together in the second method
AWCStrain 1 Strain 2 Strain 1 Strain 2
WGS clusters Epi clusters
2. Intra-cluster cohesion: (ICC)
A measure of the of the genomic and/or epidemiologic homogeneity of
the isolates within a cluster
High ICC Low ICC
8. 8
Comparing epidemiology vs. genomics
Need a model to assess strain to strain relationships based on isolate
epidemiology so we can directly compare them against the WGS data
Core
Analysis
Source
Location
Date
Genomics
Workflow
Epidemiology
Workflow
Sequencing Assembly Annotation
In-Silico Typing
Cluster Analysis
&
Analysis of
concordance
Metadata Curation Quantify Epi-Similarities
Isolate Selection
9. The challenge with epidemiological data
Source SpatialTemporal
Surveillance data is inherently less comprehensive than outbreak data
Metadata is generally qualitative/categorical, not quantitative
10. Source SpatialTemporal
Establish a metric that summarizes the relationships between isolates
based on basic epidemiologic metadata
Clustering of isolates based on epidemiological metadata
Our proposed approach:
A model for quantifying epidemiological similarity
between strains based on three primary factors:
source, space, time
EpiSym = σ(source) + γ(geospatial) + τ(temporal)
σ = coefficient for Source
γ = coefficient for Geospatial
τ = coefficient for Temporal
Building a model for epidemiological similarity
11. 11
Spatial
=
Where
• distab is given by the Haversine formula
• x, y = sampling dates
Temporal
=
Quantifying epi-similarities: Spatial and Temporal
‘Spatial’ and ‘Temporal’ factors required for the EpiSym coefficient are
relatively simple to build into the equation
12. 12
Identify all available sources
Identify core epidemiological attributes
Assess each source independently and completely for each
attribute
Score the pairwise similarity between any two sources based
on their shared epidemiological attributes
Source
Quantifying Source-Source Similarities
=
Where
• i, j = two sources being compared
• *(i + j) = number of matching attributes
• n = maximum possible score
EpiSym
13. 13
Faecal_Cow Retail_Chicken
Animal
Food Production
Retail
Domestic
Wild
Avian
Ruminant
Porcine
OtherAnimal
Human
Retail_Foods
Food Association
Rural
Urban
Environmental
Water
Soil
Farm?
Food
An example: ‘faecal cow’ vs. ‘retail chicken’
Similarity:
=
12.5
19
= 0.658
Σ Pairwise Matches
Maximum
Possible Score
=
Once source similarity is quantified, we can compute overall EpiSym
We can systematically compute EpiSym across large datasets epi clusters
Comparison to genomic clusters using cluster concordance metrics
16. 16
Calibrating WGS typing for epidemiologic investigationsGenetic Similarity
07_1875
CI_2864
06_7515
CI_1415
06_3783
06_3849
06_3851
06_6554
CI_3889
07_5039
CGY_HR_073
CGY_HR_074
CI_0898
CI_0893
06_3852
08_1711
08_1709
CI_2548
CI_2423
CI_2991
CI_2989
CI_1799
CI_2009
CI_2004
CI_4102
CI_3609
CI_2328
CI_1653
CI_4079
CI_3036
CI_2605
CI_2536
CI_0532
CI_0699
CI_3074
CI_3043
CI_0450
CI_0458
CI_0453
CI_2695
CI_2533
CI_2510
CI_0765
CI_5034
CI_3986
CI_2950
CI_2705
CI_0697
CI_3252
CI_1636
CI_3856
CI_1660
CI_4990
CI_3943
CI_3812
CI_2230
CI_1845
CI_0168
CI_0182
CI_0392
CI_3299
CI_3290
07_0675
07_0549
CE_M_10_3107
07_7324
We can identify the clusters obtained at varying thresholds and compare
them to epidemiological clusters to look for ‘best-fit’
An advantage of WGS is the flexibility in thresholding that is possible
17. 0.25
0.50
0.75
1.00
0 25 50 75 100
cgMLST Clustering Threshold (%)
WeightedGlobalClusterCohesion
WGEC_ns
WGEC_ws
WGGC_ns
WGGC_ws
17
Calibrating WGS typing for epidemiologic investigations
Genomic cluster
homogeneity
vs.
Epidemiologic cluster
homogeneity
Calculate point of highest genomic-cohesion while maintaining
Multi-isolate clusters
High epidemiologic validity
18. 18
Epi vs. Genomic clustering: examining the outliers
Strains with similar epidemiology aren’t necessarily similar genomically
(and vice-versa!)
By overlaying the two methods, we can identify clusters that group
together significantly stronger via genomic or epidemiologic relationships
“Epi-Clustering “Genomic-Clustering”
19. 19
= stronger similarity via
= stronger similarity via
−1.0 −0.5 0.0 0.5 1.0
01000
0102030
Frequency Count (left−tail p = 0.05)
ST−1244
ST−137
ST−19
ST−21
ST−2306
ST−2521
ST−262
ST−267
ST−3391
ST−3530
ST−42
ST−45
ST−459
ST−46
ST−48
ST−50
ST−5164
ST−52
ST−5619
ST−61
ST−679
ST−7694
ST−8
ST−922
ST−929
ST−982
0 10 20 30
Frequency Count (right−tail p = 0.05)
“Generalist
genotype”
“Generalist
source”
‘Generalist’ genotypes persist across many conbinations of source,
temporal and spatial parameters
‘Generalist’ reservoirs support the persistence of a broad range of
genotypes
Epi vs. Genomic clustering: examining the outliers
20. 20
Summary
We have developed a model to help guide our analysis of Campylobacter
WGS data for practical public health purposes
Systematic examination of the relationship between the genomic and
epidemiological similarity of sets of isolates optimization of clustering
for epidemiologic relevance
Calculate point of highest genomic-cohesion while maintaining
High epidemiologic cohesion
Multi-isolate clusters
Interactive web application under development (Check it out!)
https://hetmanb.shinyapps.io/EpiQuant/
21. 21
Acknowledgements
People
• Supervisors:
Ed Taboada + Jim Thomas
• Lab:
Steven Mutschall (PHAC)
Peter Kruczkiewicz (PHAC)
Dillon Barker (PHAC/ULeth)
Funding
• University of Lethbridge
• Public Health Agency of Canada A-base
• Gov’t of Canada: Genomics Research and Development Initiative