Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization


Published on

Presentation from the ECDC expert consultation on Whole Genome Sequencing organised by the European Centre of Disease Prevention and Control - Stockholm, 19 November 2015

Published in: Health & Medicine
  • Login to see the comments

  • Be the first to like this

EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization

  1. 1. Dag Harmsen Univ. Münster, Germany EU PathoNGenTrace Consortium cgMLST Evolvement and Challenges for Harmonization 19th November, 2015
  2. 2. Commercial Disclosure Dag Harmsen is co-founder and partial owner of a bioinformatics company (Ridom GmbH, Münster, Germany) that develops software for DNA sequence analysis. Ridom and Ion Torrent/Thermo Fisher (Waltham, MA) partnered and released SeqSphere+ software to speed and simplify whole genome based bacterial typing.
  3. 3. cgMLST Introduction
  4. 4. for Outbreak Investigation & Global Nomenclature Multiple Genome Alignment (e.g., progressive Mauve) k-mer without alignment ANI with alignment (Average Nucleotide Identity) Genome-wide Mapping & SNP Calling Genome-wide Gene by Gene Allele Calling (cgMLST) + Works on read, draft & complete genome level, quickly identifies closest matching genome. - Whole genome reduced to a single number of similarity. - Additively expandable [≈ O(n)], but poor mapping to nomenclature possible. - Difficult to interpret with draft genomes. - Computational intensive (≧ O(n2), limit ≈ 30-50 genomes). - Not additive expandable, no nomenclature possible. + Works well for monomorphic organisms and ‘ad hoc’ analysis & more discriminatory than cgMLST. - Problematic with rearrangement / recombination events. - Not additive expandable (at least if not always mapped to same reference). + Scalable, working on single gene to whole genome levels. + Both recombination & point mutation accommodated a single event. + Additively expandable [≈ O(1)] & nomenclature possible. …A C GGGATACATACCTATGCTATAGCT… …ACGTGATACATACCTATGATATAGCT… …ACGTGATACATACCTATGCTATAGCT… Surveillance and Phylogeny from Draft Genomes ‘Molecular Typing Esperanto’ by Standardized Genome Comparison SNP, single nucleotide polymorphism; cgMLST, core genome multi locus sequence typing; n, number of isolates in database.
  5. 5. Alleles vs. Sequence/SNPs ST1 = 1,1,1,1,1,1,1 ST2 = 1,2,1,1,1,1,1 ST3 = 1,1,1,2,1,1,1 ST4 = 1,1,1,1,1,1,2 A (clonal founder) B (isolate) C D A B C D Point mutation Recombination Allelic profiles A B C D Sequence data The use of allelic profiles, rather than (concatenated) sequences or SNPs, results in the loss of information (reductionist), but patterns of descent are more robust to the effects of horizontal genetic transfer. Using for analysis just genes – bacteria have a high coding capacity – avoids frequently repetitive intergenic regions that are anyway with 2nd generation NGS data difficult to assemble. Modified from: Ed Feil; Univ. Bath, UK each one genetic event
  6. 6. SNP vs. … • M. tuberculosis outbreak • Reference mapped against MtbC H37Rv and SNP calling. • Outgroup strains same MIRU-VNTR type with no epi- link. Kohl et al. (2014). JCM 52: 2479 [PubMed].
  7. 7. … cgMLST based Typing • Reference mapped against MtbC H37Rv. • Core genome schema consists of 3,257 coding genes (76.8% of whole genome). • 3,041 genes shared by all 26 isolates analyzed with SeqSphere+. Kohl et al. (2014). JCM 52: 2479 [PubMed]. Enterococcus: de Been. Et al. (2015). JCM pii: JCM.01946-15 [PubMed].
  8. 8. Genome-wide Genes vs. Whole Genome Consensus Sequencing and Assembling Mismatches
  9. 9. cgMLST Evolvement
  10. 10. Jolley & Maiden (November, 2010). BMC Bioinformatics. 11: 595 [PubMed]. ClonalFrame trees were generated from 43 streptococcal genome sequences, i.e., from concatenated sequences, using A seven MLSA gene fragment loci and B 77 complete genes found to be present throughout the genus identified by BIGSdb.
  11. 11. Mellmann et al. (July, 2011). PLoS One. 6: e22751 [PubMed]. Phylogenetic Analysis of EHEC 0104:H4 Method  First real-time prospective outbreak genomics outbreak analysis. Hybrid assembly from reference mapping & de novo assembly with Ion Torrent PGM WGS data and BIGSdb genome-wide gene-by gene allele calling against a fixed set of loci/targets  n = 1,144 STEC core genome gene scheme defined before outbreak analysis and SeqSphere minimum- spanning tree (not yet termed so but first cgMLST application; internally called at that time ‘super MLST’ and/or ‘MLST on steroids’)
  12. 12. Grant agreement number: 278864-2 EC contribution: 5.995.267 € Duration: 54 months (01/01/2012 - 30/06/2016) Funding scheme: SME-targeted Collaborative Project URL: Scientific Advisory Board Marc J. Struelens European Centre for Disease Prevention and Control (ECDC), Stockholm, Sweden Rene S. Hendriksen Technical University of Denmark - National Food Institute, Lyngby, Denmark Stephen H. Gillespie University of St Andrews, St Andrews, Scotland UK Gary Van Domselaar National Microbiology Laboratory Public Health Agency of Canada
  13. 13. Consortium Dag Harmsen Universität Münster Stefan Niemann Coordinator, FZ Borstel Philip Supply Genoscreen Martin C.J. Maiden University of Oxford Bruno Pot Applied Maths NV Jörg Rothgänger Ridom GmbH Ronald Burggrave Piext BV Claudia Giehl Eurice GmbH Associated Partners Alexander Mellmann Univ. Münster, Germany Roland Diel Univ. Kiel, Germany Joao Carrico Univ. Lisbon, Portugal
  14. 14. Main Objectives • Develop new, completely integrated bioinformatics microbial genomics tools for: fast and easy quality- controlled data extraction interpretation for general diagnostics and public health applications • Streamline and implement new quality control procedures of the whole genomics process • Test and validate the performances of NGS for early diagnosing and monitoring the spread of major microbial pathogens
  15. 15. Work-Packages • WP1. Development of easy to use software tools for whole genome comparison (Leader: Applied Maths, Partner: Ridom, Oxford; User: Genoscreen, Münster, Borstel, Oxford) • WP2. Next generation high throughput genome wide analysis – new technologies and optimization (Leader: Genoscreen, Partners: Münster, [Ion Torrent]; User: Borstel, Münster, Oxford) • WP3. Use of whole genome sequencing & ODM for genotyping of MtbC (Leader: Borstel, Partner: Genoscreen, Oxford, PiEXT [OpGen], Münster) • WP4. Use of whole genome sequencing & ODM for genotyping of MRSA (Leader: Münster, Partner: Genoscreen, Oxford, PiEXT [+OpGen]) • WP5. Use of whole genome sequencing & ODM for genotyping of Campylobacter (Leader: Oxford, Partner: Genoscreen, Münster, PiEXT [+OpGen]) • WP6. Innovation related activities (IP, Dissemination, and Exploitation) (Leader: Eurice, Partner: all) • WP7. Management (Leader: Eurice, Partner: all)
  16. 16. Prospective Real-time Studies • Campylobacter: prospective surveillance in Oxfordshire, UK has been ongoing with WGS data since 2010 – moving to more real-time starting from 2015 (700-900 isolates per year). • MtbC: prospective surveillance in Hamburg, DE has been ongoing with WGS data since 2005 – moving to more real- time starting from 2015 (110-130 isolates per year). • MRSA: prospective real-time (TaT 4-5 days) surveillance of all multi-drug resistant bacteria (MDR; including MRSA) of University Hospital Münster, DE since October 2013 (1,200-1,500 isolates per year).
  17. 17. Jolley et al. (April, 2012). Microbiology 158: 1005 [PubMed]. In 2013/2014 also rMLST STs added.
  18. 18. Jünemann et al. (April, 2013). Nature Biotechnology 31: 294 [PubMed]. Evaluation of contiguity and consensus accuracy of draft de novo assemblies from benchtop sequencers. a) evolution of genome contiguity for GSJ, MiSeq and PGM. The contiguity of the de novo assembly consensus sequences generated by MIRA was analyzed for 4,671 non-pseudo- or non-paralogous chromosomal coding E. coli Sakai NCBI reference sequence genes. This genome-wide gene-by-gene allele analysis was performed with the Ridom SeqSphere+ software. (b) Venn diagram of consensus sequencing accuracy for PGM 300 bp, MiSeq 2 × 250-bp PE (MIS) and GSJ. reported consensus errors were analyzed for 4,632 coding Sakai genes that could be retrieved using SeqSphere+ for all three platforms. Numbers of variants confirmed by bidirectional sanger sequencing are indicated in parentheses. *Avoidance of the term core genome as core genome genes are here determined from DNA with rather high similarity values! *
  19. 19. Maiden et al. (October, 2013). Nature Rev. Microbiol. 11: 728 [PubMed]. PathoNGenTrace Yearly Meeting (May 13th - 14th, 2013). Cambridge, UK. Bruno Pot and Hannes Pouseele, Applied Maths. Kmers are the ways how to compare genomes (work done together with Ilya Chorny, Illumina). IMMEM X (October 2nd - 5th, 2013). Paris, France. Hannes Pouseele, Applied Maths. Seven ways (= one of them wgMLST) how to leave your lover (= PFGE). cgMLST at that time for the authors NOT a fixed set of loci but ‘shared’ loci of selected isolates under study.
  20. 20. Kohl et al. (April, 2014). JCM 52: 2479 [PubMed]. First original publication using the term cgMLST and using a fixed genome-wide set of genes.
  21. 21. Tools for Microbial Genotypic Surveillance and Phylogeny Wyres et al. (2014). WGS analysis and interpretation in clinical and public health microbiology laboratories: what are the requirements and how do existing tools compare? Pathogens 3: 437 [doi:10.3390/pathogens3020437]. __________________________ ENA Sub- Included Nomen- mission Database clature __________________________ __________________________ Yes Yes No No No No No No No __________________________ No Yes Yes No No No Yes Yes Yes No No No __________________________ WWW WWW WWW WWW
  22. 22. Standardized Hierarchical Microbial WGS Typing Pan-bacterial-specific Jolley et al. (2012). Microbiology 158: 1005 [PubMed] Global Nomenclature / Surveillance rMLST Species-specific STEC: Mellmann et al. (2011). PLoS One. 6: e22751 [PubMed] S. aureus: Leopold et al. (2014). JCM 52: 2365 [PubMed] MtbC: Kohl et al. (2014). JCM 52: 2479 [PubMed] K. pneumo.: Bialek et al. (2014). EID 20: 1812 [PubMed] Lp: Moran-Gildad et al. (2015). Euro Surveill. 20: pii: 21186 [PubMed] Listeria: Ruppitsch et al. (2015). JCM 53: 2869 [PubMed] E. faecium: de Been et al. (2015). JCM 53: [PubMed] cgMLST MLST SNPs confirmatory/canonical Standardized hierarchical microbial WGS typing approach. From bottom to top with increasing discriminatory power. MLST, multi locus sequence typing; rMLST, ribosomal MLST; cgMLST, core genome MLST; wgMLST, whole genome MLST, and SNP, single nucleotide polymorphism. Species-specific e.g., Van Ert et al. (2007). JCM 45: 47 [PubMed] Maiden et al. (1998). PNAS 95: 3140 [PubMed] also needed for backwards compatibility DiscriminatoryPower Speciation by rMLST Evolutionary Analysis SNPs* Alleles from accessory reference ge- nome genes or pan-genome based wgMLST Local Outbreak Investigation Outbreak- / Lineage-specific SNP e.g., Köser et al (2012). NEJM 366: 2267 [PubMed] wgMLST/’shared’ genome N. meng.: Jolley et al. (2012). JCM. 50: 3046 [PubMed] C. jejuni: Cody et al. (2013). JCM. 51: 2526 [PubMed] *from de novo assembled [PubMed] and/or mapped genomes
  23. 23. cgMLST Challenges for Harmonization
  24. 24. cgMLST and API/Ontology Workshop Organization: Martin Maiden and Dag Harmsen Date: 2nd & 3rd March, 2015 Place: Oxford University, UK Participants: Oxford Univ., Univ. Münster, FZ Borstel, Univ. Warwick, Inst. Pasteur, Univ. Lisboa, PHE, CDC, Ridom, and Applied Maths Informal agreement that cgMLST is a fixed and in the community agreed upon set of genome-wide genes that is going to be at least the minimum denominator for analyzing whole genome shotgun (WGS) sequence data for surveillance purposes!
  25. 25. ECDC (October, 2015). Describes a top-down approach that includes also several tears of reporting (e.g., national and international). However, in the past the most successful bacterial genotyping initiatives (e.g., MLST, spa- typing, or MIRU-VNTR) followed a bottom-up - grass- root basic democratic or even anarchic - approach. Only the PulseNet imitative followed a top-down approach but never resulted in a public nomenclature and involved ‘heavy’ investment by CDC.
  26. 26. Nomenclature is in its essence a technique to reduce the amount of available information by assigning a short, yet still informative human [and machine] readable code to isolates. Where two isolates share the same code, it implies that they have the same properties as defined by the nomenclature scheme that is assumed to be commonly understood and adhered to. An additional step in assigning allele identifiers to a particular set of loci, which also further reduces the information to a degree that it can be used effectively for human communication, is to assign an additional unique identifier to each combination of alleles observed within a single genome. Nomenclature Assignment ECDC (October, 2015). Expert Opinion on the introduction of next-generation typing methods for food- and waterborne diseases in the EU and EEA.
  27. 27. wgMLST principle: assignment of unique allele identifiers. ECDC (October, 2015). Expert Opinion on the introduction of next-generation typing methods for food- and waterborne diseases in the EU and EEA. infinite growing* *SeqSphere+ only uses the accessory genome of the ‘reference genome’. BIGSdb and Bionumerics use the accessory genome of the pan genome. Furthermore, for detecting loci/targets by similarity and overlap BIGSdb scans new draft genomes against all alleles of a locus and not only against the allele of the ‘reference genome’ as done by SeqSphere+ and Bionumerics. Thereby different results might be obtained depending when the search was conducted (‘triangulation problem’). Cluster/outbreak threshold calibration only possible on cgMLST level! wgMLST Nomenclature
  28. 28. • MLST sequence type (ST) and clonal complex (CC) concept must and will be remain (among many others reasons for backwards compatibility). • For NGS genome-wide gene by gene allele typing with hundreds/thousands of genes/targets from a ‘WGS typing scheme’ or with ‘core genome genes’ the allele nomenclature for every target/gene must be controlled. • For communication between humans (e.g. publication) and to make the results comparable on an international scale the nomenclature of specific combinations of hundreds/thousands targets/genes must also be controlled. • For these specific combinations of hundreds/thousands targets/genes the term Cluster Type (CT) is proposed. • CT will be much more discriminatory than a ST; definition is mainly needed for outbreak investigation/transmission chain analysis. • CT concept must be able to cope with: • some missing targets/genes (either not present or not sequenced by chance or not assembled well), • a few target/gene allele differences due to NGS sequencing errors, intra-host variation and/or micro-evolutionary changes during an outbreak, and • different bacterial population structures (e.g., monomorphic vs. panmictic structure) and infection dynamics (e.g., incubation period and/or transmission mode). Therefore, a CT will be species specific. • CT threshold is pragmatically defined as the highest observed number of allele differences in intra- patient, consecutive and/or outbreak isolates plus 25% number of alleles (rounded) to rule-out recent transmission for sure. • As the CT will be ‘just’ a number and there will be no biological meaningful relationship between the CT numbers – otherwise a single expanding nomenclature would be impossible – it is proposed to associate with every CT the date and location (city and country) of isolation (e.g. CT 399; March 2013, Münster Germany). • As a CT will be specific for a ‘WGS typing scheme’ (cgMLST), it is proposed to use e.g. the phrase Ridom cgMLST CT. WGS Cluster Type (CT) Problems due to: • additive expansion • missing data • entry order
  29. 29. Taxonomical nomenclature principle based on SNP or wgMLST dendrogram.* *Desirable BUT hardly possible for an additive expandable nomenclature system as there will be always changes in the tree (was not possible in the past with MLST or canonical SNPs of monomorphic bacteria; would violate stability of nomenclature). Furthermore, if done with ‘SNP addresses’ and not with alleles very compute intensive to calculate. ECDC (October, 2015). Expert Opinion on the introduction of next-generation typing methods for food- and waterborne diseases in the EU and EEA. Taxonomical/Phylogenetic Nomenclature
  30. 30. Vaz et al. (October, 2014). J Biomed Semantics 5: 43 [PubMed] cgMLST Nomenclature Harmonization The TypON microbial typing ontology foresees immediately a REST application programming interface (API) for cgMLST allele nomenclature services that allows software tools to bi-directional communicate with each other.
  31. 31. cgMLST Nomenclature Server(s) SeqSphere+: Query and authentication API to be released into public early 2016. Submission for SeqSphere+ users already since 2013 possible without any manual curation steps involved. Submission API for other tools foreseen for mid 2016. BIGSdb: Query and authentication API available since mid 2015. Submission API announced October 2015 (evaluation needed and SeqSphere+ and Bionumerics must ‘emulate’ BIGSdb mode of allele calling).
  32. 32. Other PathoNGenTrace Bioinformatics Activities WGS Genotyping Standardization Visualization of four dimensions and Early Warning From WGS Geno- to Phenotype Prediction (resistome & virulome analysis) From WGS to Plain Language Report
  33. 33. SeqSphere+ Visualization of Four Dimensions released with version 3.0 early October 2015 (also MLST+ term no longer used since then) Place Ruppitsch et al. J Clin Microbiol. 2015; 53: 2869 [PubMed]. Time #Missing values Sample ID Good Targets ST Collection Date Country of Isolation City of Isolation ZIP of Isolation Cluster Type 4 12025647 99.8 398 unknown Austria ? (unknown) ? (unknown) 45 4 2010-00770 99.8 398 Feb 2, 2010 Austria Hartberg 8230 39 4 3230TP3 99.8 398 Jan 22, 2010 Austria Hartberg 8230 39 3 3230TP5 99.8 403 Jan 22, 2010 Austria Hartberg 8230 35 8 CIP105458 99.5 2 1959 USA ? (unknown) ? (unknown) 49 0 EGD-e 100.0 35 1924 United Kingdom Cambridge ? (unknown) 1 4 L10-10 99.8 398 Jan 12, 2010 Austria Zell am See 5700 39 4 L14-10 99.8 398 Jan 25, 2010 Austria Rohrbach 4150 39 4 L16-10 99.8 398 Jan 29, 2010 Austria Salzburg 5020 39 4 L17-10 99.8 398 Jan 30, 2010 Austria Krems 3500 39 4 L30-10 99.8 398 Jan 25, 2010 Austria Ried im Innkreis 4910 39 4 L33-10 99.8 398 Feb 22, 2010 Austria St. Pölten 3100 39 5 L38-11 99.7 398 2010 Austria Vienna 1010 41 4 L4-10 99.8 398 Dec 23, 2009 Austria Gänserndorf 2230 39 2 L71-09 99.9 403 Dec 10, 2009 Austria Mattersburg 7210 35 5 L75-09 99.7 398 Dec 16, 2009 Austria Vienna 1100 39 ‘Person‘ by color Type All four dimensions views are inter-linked interactively and ex- portable in publication quality scalable vector graphics (SVG) format.
  34. 34. allele calling (<5min) Pure bacterial culture / single cell DNA (≈3.5h) Rapid NGS (≈28-43h) De novo or reference assisted assembly (<1h) Phenotypic and epidemiologic information LIMS (e.g., via Excel file or HL7) One Disruptive Technology Fits it All – Genomic Surveillance and More MLST/rMLST Evolutionary analysis Resistome / Virulome Surveillance & outbreak investigation cgMLST SNP / accessory targets Antibiotic resistance targets Toxins & pathogenicity targets Standardized hierarchical microbial typing and more EBI ENA (Backup raw reads) cgMLST Nomen- clature Server
  35. 35. Dissemination Activities
  36. 36. 2nd Conference Rapid Microbial NGS and Bioinformatics: Translation Into Practice The event will gather experts from all over the world active in applying Next Generation Sequencing (NGS) techniques to discover the epidemiology, anti-microbial resistance, ecology and evolution of microorganisms. The program will be designed to build a bridge between software developers and end-users. At a Glance Date: June 9-11, 2016 Place: Hamburg, Germany Complete program to be announced online soon. Registration: will open mid December 2015 at: Contact: For questions or further information please send an email to The research from the PathoNGen-Trace project has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement N° 278864.
  37. 37. Dag Harmsen Univ. Münster, Germany cgMLST Evolvement and Challenges for Harmonization 19th November, 2015 EU PathoNGenTrace Consortium