1. The document evaluates different methods for inferring relationships between Salmonella samples based on whole genome sequencing data from large databases. It compares k-mer based methods and site-based methods using 18,997 Salmonella isolates from public databases.
2. Site-based methods like NUCmer and MLST produced more accurate results, but require more computing resources when dealing with large databases. K-mer based methods are faster but more sensitive to assembly and contamination issues.
3. While k-mer methods may be useful for initial filtering, site-based methods are superior for accuracy, though challenges remain in applying them to databases containing tens of thousands of samples. Quality control and computing resources are important considerations.
Building bioinformatics resources for global community
1. Building bioinformatics resources
for the global community
James Pettengill
james.pettengill@fda.hhs.gov
Biostatistics and Bioinformatics Staff
Office of Analytics and Outreach
FDA Center for Food Safety and Applied Nutrition
GMI9
May 24, 2016
Rome, Italy
2. CFSAN’s open-access peer reviewed methods for analyzing and differentiating
among samples based on WGS data.
Submitted 16 April 2014
Accepted 23 September 2014
Published 14 October 2014
Corresponding author
Errol Strain,
Errol.Strain@fda.hhs.gov
Academic editor
Keith Crandall
Additional Information and
Declarations can be found on
page 21
DOI 10.7717/peerj.620
An evaluation of alternative methods for
constructing phylogenies from whole
genome sequence data: a case study with
Salmonella
James B. Pettengill, Yan Luo, Steven Davis, Yi Chen,
Narjol Gonzalez-Escalona, Andrea Ottesen, Hugh Rand,
Marc W. Allard and Errol Strain
Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration, College Park,
MD, USA
ABSTRACT
Comparative genomics based on whole genome sequencing (WGS) is increasingly
being applied to investigate questions within evolutionary and molecular biology,
as well as questions concerning public health (e.g., pathogen outbreaks). Given the
impact that conclusions derived from such analyses may have, we have evaluated
the robustness of clustering individuals based on WGS data to three key factors: (1)
next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and
SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism)
matrix (reference-based and reference-free), and (3) phylogenetic inference method
(FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole
genome sequences representing 107 unique Salmonella enterica subsp. enterica ser.
Montevideo strains. Reference-based approaches for identifying SNPs produced trees
that were significantly more similar to one another than those produced under the
reference-free approach. Topologies inferred using a core matrix (i.e., no missing
data) were significantly more discordant than those inferred using a non-core matrix
that allows for some missing data. However, allowing for too much missing data likely
results in a high false discovery rate of SNPs. When analyzing the same SNP matrix,
we observed that the more thorough inference methods implemented in GARLI and
RAxML produced more similar topologies than FastTreeMP. Our results also confirm
that reproducibility varies among NGS platforms where the MiSeq had the lowest
number of pairwise diVerences among replicate runs. Our investigation into the ro-
bustness of clustering patterns illustrates the importance of carefully considering how
data from diVerent platforms are combined and analyzed. We found clear diVerences
in the topologies inferred, and certain methods performed significantly better than
others for discriminating between the highly clonal organisms investigated here. The
methods supported by our results represent a preliminary set of guidelines and a
step towards developing validated standards for clustering based on whole genome
sequence data.
3. Real-time pathogen detection in the era of whole-genome
sequencing and big data: K-mer and site-based methods for
inferring the distances among tens of thousands of
Salmonella samples
James Pettengill
james.pettengill@fda.hhs.gov
Biostatistics and Bioinformatics Staff
Office of Analytics and Outreach
FDA Center for Food Safety and Applied Nutrition
GMI9
May 24, 2016
Rome, Italy
4. • The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
Premise/Background of the project
5. • The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
Premise/Background of the project
6. • The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
• For these databases to be of optimal use one must be able to quickly interrogate them
to accurately determine the genomic distances among a set of samples.
Premise/Background of the project
7. • The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
• For these databases to be of optimal use one must be able to quickly interrogate them
to accurately determine the genomic distances among a set of samples.
• Being able to do so is challenging due to both biological (evolutionary diverse
samples) and computational (petabytes of sequence data) issues.
Premise/Background of the project
8. • The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
• For these databases to be of optimal use one must be able to quickly interrogate them
to accurately determine the genomic distances among a set of samples.
• Being able to do so is challenging due to both biological (evolutionary diverse
samples) and computational (petabytes of sequence data) issues.
• Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean,
Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and
whole-genome multi-locus sequence typing (wgMLST))
Premise/Background of the project
9. • The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
• These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
• For these databases to be of optimal use one must be able to quickly interrogate them
to accurately determine the genomic distances among a set of samples.
• Being able to do so is challenging due to both biological (evolutionary diverse
samples) and computational (petabytes of sequence data) issues.
• Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean,
Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and
multi-locus sequence typing (MLST))
• Empirical data: whole-genome sequence data from 18,997 Salmonella isolates
Premise/Background of the project
11. Efficient method
inter-category comparisons
intra-category comparisons
genetic distances
Experimental design: based on a classification scheme determine how
well each distance measure performs
#
Inefficient method
genetic distances
#
13. Experimental design:
Empirical data:
• Analyze different distance methods on de novo assemblies of all Salmonella
samples in GenomeTrakr
• Use serovar as the classification scheme
Efficient method
inter-enteritidis comparisons
intra-enteritidis comparisons
genetic distances
#
14. Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
15. Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
16. Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Quality filter using fastx toolkit
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
17. Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Quality filter using fastx toolkit
Taxonomic/contamination filtering
using Kraken with custom db
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
18. Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Quality filter using fastx toolkit
Taxonomic/contamination filtering
using Kraken with custom db
Assembly using
SPAdes
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
20. Site-based:
Sample1: ACCTAGTACC
Sample2: ACGTACTACC
Requires statements about homology/
sequence alignment
Kmer-based (L = 9):
Sample1: ACCTAGTACC
kmer1: ACCTAGTAC
kmer2: CCTAGTACC
Sample2: ACGTACTACC
kmer1: ACGTACTAC
kmer2: CGTACTACC
Fast but loss/oversimplification of
information
Similarity = 0.8
Similarity = 0
Experimental design:
Distance measures
21. Summary of methods used to infer the relationships among samples.
Class Method Description Exec. time (s)
Site-based
Nucmer§
Pairwise genome alignment using suffix arrays 11.9
wgMLST¶
Gene based approach 46.95
K-mer
based
Jaccard Index§ The intersection divided by the union of all K-mers found between two
samples
9.4
Manhattan Distance§ Sum of the absolute differences between the abundance of each K-mer
present between two samples
45.1
Euclidean Distance§ The square root of the sum of square of all pairwise differences in K-
mer abundance
44.2
Mash Distance
MinHash (Broder 1998) technique to reduce genomes to sketches and
estimates a novel evolutionary distance metric among them
1.2
Mash Jaccard Distance
The Jaccard Distance (as described above) but based on the sketch size
(e.g., the number of hashes)
1.2
§
Performed using de novo assemblies and requires k-mer indexing, which with jellyfish takes 7.4s (0.8) per sample (2.1
days for 25,000 samples)
¶
Requires a reference genome
22. Classification of simulated data:
ROC curves identical across
different distance methods
* Simulated data is not complex/
noisy enough
23.
24. Summary/Implications:
• There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between
samples.
• Treating absent data as informative may be problematic
25. Summary/Implications:
• There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between
samples.
• Treating absent data as informative may be problematic
• Site-based methods, like NUCmer and MLST, tended to be superior in
performance
26. Summary/Implications:
• There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between
samples.
• Treating absent data as informative may be problematic
• Site-based methods, like NUCmer and MLST, tended to be superior in
performance
• Accessing the computing resources necessary to perform site-based
methods may be challenging when analyzing large databases.
27. Summary/Implications:
• There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between
samples.
• Treating absent data as informative may be problematic
• Site-based methods, like NUCmer and MLST, tended to be superior in
performance
• Accessing the computing resources necessary to perform site-based
methods may be challenging when analyzing large databases.
• If working with k-mer distances err on the side of false positives
• And have high quality assemblies
28. Acknowledgements
FDA
• Center for Food Safety and Applied Nutrition
• Biostats/Bioinformatics staff – J. Baugher, H. Rand,
J. Miller, Y. Luo, S. Davis, E. Strain
• Center for Veterinary Medicine
• Office of Regulatory Affairs
National Institutes of Health
• National Center for Biotechnology Information
State Health and University Labs
• Alaska
• Arizona
• California
• Florida
• Hawaii
• Maryland
• Minnesota
• New Mexico
• New York
• South Dakota
• Texas
• Virginia
• Washington
USDA/FSIS
• Eastern Laboratory
CDC
• Enteric Diseases Laboratory
• INEI-ANLIS “Carolos Malbran Institute,”
Argentina
• Centre for Food Safety, University College
Dublin, Ireland
• Food Environmental Research Agency, UK
• Public Health England, UK
• WHO
• Illumina
• Pac Bio
• CLC Bio
• Other independent collaborators
29.
30.
31. • False negatives are primarily
due to failure to meet
consensus frequency
threshold
ConsensusFrequency<0.9
Coverage<8
X20x_Coverage
X100x_Coverage
X20x_Coverage
X100x_Coverage
0 1000 2000 3000 4000 5000
value
variable
variable
X20x_Coverage
X100x_Coverage
Validation exercise key findings:
Number of false negatives
• False negatives are not random
across the genome
32. Validation exercise of CFSAN SNP Pipeline key findings:
• 100× dataset
• Recovered 98.9% of the introduced SNPs
• False positive rate of 1.04 × 10−6
• 20× dataset
• Recovered 98.8% of SNPs
• False positive rate of 8.34 × 10−7