Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Barker immemxi final March 2016

84 views

Published on

Mitigating the effects of sequence data quality on strain typeability: towards the development of robust core genome MLST (cgMLST) schemes

Published in: Science
  • Be the first to comment

  • Be the first to like this

Barker immemxi final March 2016

  1. 1. IMMEM XI Navigating Microbial Genomes: Insights from the Next Generation 9 – 12 March 2016, Estoril, Portugal
  2. 2. 2 Whole Genome Sequencing  Suddenly cheap and easy  Huge amounts of data generated in Canada & globally  Can solve many problems  Resolution  Breadth of strains typed  Scale of data brings its own problems  Pangenome definitions  Variable assembly completeness and quality  Existing typing systems don't scale well
  3. 3. 3 Classical MLST  Looks at allelic diversity of ~7 “housekeeping” loci  All loci must be fully present  Each new allele is a type  Recombination and mutation are equivalent  Each unique combination of types is a Sequence Type  Type definitions are universal  Centralized and curated  e.g. ST-21 in Canada = ST-21 in UK = ST-21 in Denmark Dingle, et al. 2001. J. Clin. Micro. 39(1) 14-23
  4. 4. 4  The core genome is shared by all members of the species; mostly SNP-level genetic variation  Accessory genes are not shared by all members of the species and drive a lot of the phenotypic variability between strains What is a “Core gene”? What about a “Core genome”?
  5. 5. 5 Core Genome MLST  Logical extension of Classical MLST concepts  7 genes → 100s or 1000s of genes  Potential successor “Gold Standard” typing method for surveillance  Big Advantages  High Resolution  Viable way for WGS → Surveillance  Lots of interest in cgMLST
  6. 6. cgMLST analysis of 200 isolates “identical” by MLST
  7. 7. 7 Walkerton outbreak 2000 cgMLST analysis of 200 isolates “identical” by MLST
  8. 8. 8 Aprototype cgMLST scheme for C. jejuni  2690 Campylobacter jejuni whole genome sequence assemblies  Set of 1,658 ORFs from reference strain NCTC11168 used as queries  85% sequence identity & 50% length coverage  732 ORFs conserved across all genomes  core genome loci
  9. 9. 9 cgMLST Trials and Tribulations  2690 Campylobacter jejuni whole genome sequence assemblies  Allele definitions gathered from all genomes Not so simple!  WGS projects don't usually finish their genomes  “Genome Assemblies”  Target loci are often truncated by chance  Only 1464 genomes (54%) had complete sequences at all 732 loci
  10. 10. 10 Contig Truncations are a function of genome count As the number of genomes analyzed is increased, the probability that any locus will have at least one truncation approaches 100%  Average rate of missing/truncated loci ≈ 3.5%  26 per assembly!
  11. 11. 11 Contig Truncations are a function of locus count  Average rate of missing/truncated loci ≈ 3.5%  26 per assembly! As the number of loci analyzed is increased, the probability that at least one genome will have a truncation increases to 100%
  12. 12. 12 The Story So Far...  Advantages of cgMLST 1. Analysis is cheap and speedy 2. Hugely improved resolution 3. Consistent, portable nomenclature  Difficulties Introduced by cgMLST  Missing / Truncated Loci will affect your scheme  As-is, forces you to sacrifice either #1 or #3: Re-sequence and re-assemble and hope it works – or – Abandon all hope for portability
  13. 13. 13 Some options for damage control! 1. Use only highly conserved core genes 2. Use optimized gene fragments 3. Reduce the number of target loci 4. Attempt to impute data
  14. 14. 14 Some options for damage control! 1. Use only highly conserved core genes 2. Use optimized gene fragments 3. Reduce the number of target loci 4. Attempt to impute data
  15. 15. 15 Using Optimized Gene Fragments • The longer the target sequence, then more opportunities for truncations
  16. 16. 16 Using Optimized Gene Fragments • The longer the target sequence, then more opportunities for truncations • Avoid regions with empirically high contig truncation rates
  17. 17. 17 Using Optimized Gene Fragments • The longer the target sequence, then more opportunities for truncations • Avoid regions with empirically high contig truncation rates • Retain the most informative regions  Measured by Shannon Entropy
  18. 18. 18 Using Optimized Gene Fragments • The longer the target sequence, then more opportunities for truncations • Avoid regions with empirically high contig truncation rates • Retain the most informative regions  Measured by Shannon Entropy • Optimized sub-regions that are informative and truncation-free
  19. 19. 19 Some options for damage control! 1. Use only highly conserved core genes 2. Use optimized gene fragments 3. Reduce the number of target loci 4. Attempt to impute data
  20. 20. 20 How many loci do we need for accurate clustering? Pristine Genome Set  732 cgMLST loci  1,464 aforementioned genomes  A controlled development environment for cgMLST testing Clustering  Reference set clustered at various similarity thresholds  100% - 20% similarity  0.5% steps
  21. 21. 21  Random Gene Selection  N genes randomly selected from the 732  1000 replicates each  Clusters compared vs the full 732  Comparison to “reference tree”  Adjusted Wallace Coefficient  Compares clusters produced by two methods  “How often do two strains clustered together by Method A cluster together by Method B” How many loci do we need for accurate clustering?
  22. 22. 22 Random Subset Clusters – 5th Percentile (i.e.“worst case scenario”) 150-250 genes are nearly as good as 732 genes 0.0 0.2 0.4 0.6 0.8
  23. 23. 23 Some options for damage control! 1. Use only highly conserved core genes 2. Use optimized gene fragments 3. Reduce the number of target loci 4. Attempt to impute data
  24. 24. Allele Imputation: Another Approach 5 21??? • Inferring the allele of a missing/partial locus • Educated guess from the allele proportions of 'centres' known to be associated with particular 'flanks‘ • Mean accuracy of 90.5% • Further refinement with partial sequence data
  25. 25. Conclusions • cgMLST is poised to be the Gold Standard for global surveillance of bacterial pathogens • Contig truncations and missing data become a blocking problem if the same portability of typing definitions as MLST is desired • A compromise between typability and robustness is required • Contig truncations’ effect can be mitigated by : • The worst fragments of genes (truncation & information content) • The genes that contribute the least to discriminatory power • “Filling the gaps” with advance knowledge about linkage
  26. 26. • Supervisors: • Drs. Ed Taboada & Jim Thomas • Labmates: • Steven Mutschall (PHAC) • Peter Krucziewicz (PHAC) • Ben Hetman (PHAC/ULeth) • Cody Buchanan (CFIA/ULeth) • Funding • ESCMID Attendance Grant • University of Lethbridge • Public Health Agency of Canada • Government of Canada Genomics Research and Development Initiative Acknowledgements

×