Integrating phylogenetic inference and metadata visualization for NGS data
1. João André Carriço, PhD
Microbiology Institute/Institute for Molecular Medicine
Faculty of Medicine, University of Lisbon
Portugal
Integrating phylogenetic inference and
metadata visualization for NGS data
http://im.fm.ul.pt
http://imm.fm.ul.pt
http://www.joaocarrico.info
Workshop 20:
Typing of Bacterial Pathogens in 2015:
Expanding the scope of NGS
3. Charles Darwin ‘s “tree of life” in
Notebook B, 1837-1838
Darwin and the tree of life
4. Phylogenetics methods aim to infer the
relationships between the taxa trying to define
the common ancestors between taxa
Assumptions: the characters being compared
are homologous and independent, i.e. they had
shared a common ancestor and each character
suffered evolutive forces individually
Phylogenetic Inference
ATTGGGG ATGGGGG
AT?GGGG
5. Software for Phylogenetic trees: based
on sequence alignments• MEGA
• http://www.megasoftware.net/
• Splitstree
• http://www.splitstree.org/
• Geneious (http://www.geneious.com/)
• www.geneious.com
• FastTree
• http://www.microbesonline.org/fasttree
• RAxML
• http://sco.h-
its.org/exelixis/web/software/raxml/index.html
• PHYLIP
• http://evolution.genetics.washington.edu/phylip.ht
ml
• BEAST
• http://beast.bio.ed.ac.uk/
And many many others…
6. Sequence Alignment methods
Kos, V.N. et al., 2012. Comparative genomics of vancomycin-resistant Staphylococcus aureus
strains and their positions within the clade most commonly associated with Methicillin-resistant S.
aureus hospital-acquired infection in the United States. mBio, 3(3).
Maximum Likelihood tree of concatenated SICOs
7. Sequence Alignment methods
Maximum Likelihood tree of concatenated SICOs
Caveats:
• Computationally intensive: some methods can’t be
applied to hundreds to thousands of strains
• Require specialized method and software
knowledge for parameter definition
• Some phenomena violate the assumptions
(recombination, convergent evolution,etc)
8. Sequence Based Typing Methodsx
Strain genomic information encoded as a numeric
sequence
Sanger sequencing:
MLST: Gene allele identifier
MLVA: Number of repeats
NGS approaches:
Gene-by-Gene / allele based:
wgMLST: core + pan genome genes are represented
cgMLST: just core genome
SNP Typing : Polymorphism
9. To each unique gene sequence
(allele)
is attributed an integer ID,
by comparison with online DBs
Allelic profile:
12 - 9 - 11 - 7 - 11 - 20 - 3
Each allelic profile, aka ST, is
unequivocally identified by an
integer.
Single locus variant (SLV):
Double locus variant (DLV):
Triple locus variant (TLV):
12
12
10
- 10
- 10
- 10
- 11
- 11
- 11
- 7
- 11
- 11
- 11
- 11
- 11
- 20
- 20
- 2
- 3
- 3
- 3
Bacterial
chromosome
MLST
10. SNP NGS Approach
Good approach in Monomorphic species.
For non-monomorphic species , SNPs in genome areas where
recombination was detected need to be removed to avoid confounding the
phylogenetic signal.
sample
NGS
WGS
reads
Mapping to reference
Fasta File with SNPs
fastq files
BAM files
VCF files
11. Gene by Gene NGS Approach
Software currently available:
BIGSDB (Jolley, K.A. & Maiden, M.C.J., 2010. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics)
RIDOM™ SEQSPHERE+ (http://www.ridom.com/seqsphere/)
Central nomenclature server:
Schemas, Allele definitions and identifiers
sample
NGS
WGS
reads
assembly
contigs
Output :Allelic Profile
12. Algorithms for Phylogenetic Inference
Based on the distance matrix:
•Hierarchical clustering methods: UPGMA, Single Linkage
and Complete linkage
•Neighbor-joining
•Minimum Spanning Trees
Maximum Parsimony methods
Based on rules (Graphic Matroids)
•goeBURST
Maximum Likelihood methods
Bayesian inference methods
Sequence alignments
Sequence alignments
Sequence alignments
Sequence alignments
Allelic Profiles
Allelic Profiles
13. Infering phylogeny from allelic profiles
Assume that you have only 3 genes and each number corresponds to a different
allele for each gene. The minimum assumption is assuming that a SLV may
correspond to a possible phylogenetic descent.
1-1-1 1-1-2 1-2-1 1-2-2 1-2-3
SLV SLV SLV
SLV SLV
SLV
11 possible
trees….
14. eBURST model
More similar STs should denote closely related strains
from an evolutionary point of view.
STs with more SLVs can be regarded has a common
ancestor.
Links between STs depict descent relations.
With these assumptions, connected STs should share an
evolutionary path.
Maynard Smith J., et al. 2000. Bioessays 22:1115-
eBURST
Feil E. et al, J Bac 2004
15. 1-1-1
1-1-2
1-2-1
1-2-2
1-2-3
goeBURST
#SLVs #DLVs #TLVs Freq STid
2 2 0 1 1
2 2 0 1 2
3 1 0 1 3
3 1 0 1 4
2 2 0 1 5
Implementation of the eBURST rules as a graphic
matroid problem, allows for a globally optimal solution of
the placement of the ST links.
Francisco et al, BMC Bioinf, 2009
More SLVs / lower ID
Connects to ST4 because #SLVs
Final goeBURST tree :
unique solution
guaranteed
16. Applying goeBURST
1-1-1 1-1-2 1-2-1 1-2-2 1-2-3
SLV SLV SLV
SLV SLV
SLV
11 possible
trees….
All these are valid goeBURST solutions. The
tie break would need to be the ST ID if all of
them would have the same frequency in the
dataset
18. goeBURST FULL MST
• The goeBURST rules can be expanded to any number of
loci while maintaining the same assumptions of the
evolutionary model behind
• Adds an evolutionary model to the basic Minnimum
Spanning Tree approach
• Advantage: very fast to calculate compared to phylogenetic
analysis algorithms
• Advantage: If the strains are closely related we have the
internal nodes defined as strains as opposed to any
traditional phylogenetic methodology
• Disadvantage: does not create internal nodes as putative
recent common ancestral
21. PHYLOViZ
Can be easily applied to:
-MLST
-MLVA
-SNP data*
-Gene Presence/absence
*Conversion of VCF to PHYLOViZ:
https://github.com/nickloman/misc-genomics-tools/blob/master/scripts/vcf2phyloviz.py
(Thanks Nick!)
23. Core genome comparison - Workflow
Core genome from all available fully sequenced S.aureus Strains in NCBI
Using strain COL genes as reference
1866 target loci found for a cgMLST schema (RIDOM Seqsphere+)
Call alleles for strains under study
Removing loci with missing data in the strains under analysis
1542 target genes kept for whole genome comparison
goeBURST Minimum Spanning Tree of the resulting allelic profiles
(PHYLOViZ software)
24. Core genome comparison
VRSA
NCBI strains
US VRSA strains (Kos et al)
HSM strains
MRSA srp
VRS5
MLST+: 1542 genes
Core genome genes found in all strains
65
26. PHYLOViZ
PROs:
Handles thousands of profiles
Fast calculation
Easy to annotate and explore metadata
Allows for basic statistics on profiles and metadata
Allows for advanced statistics on MSTs
(PLoS One. 2015 Mar 23;10(3):e0119315)
Exports high quality graphical formats
Allows plugin development
CONs:
goeBURST and goeBURST MST only
(Neighbour Joining and UPGMA soon)
JAVA knowledge to code new plugins
27. Final Remarks
Phylogenetic inference has always an underlying model. The
choice of method depends on what data is being analyzed and
the underlying question
With the increasing availability of bacterial genomes, the methods
that allow their comparison need to be efficient and scalable
Metadata should always be use to evaluate the algorithm results
PHYLOViZ provides a visualization framework to
analyze inferred patterns of descent based on goeBURST ,
including detailed statistics and allows easy integration of
metadata on algorithm results
Any sequence-based typing method that generates allelic profiles
can be analyzed by this framework, including any NGS derived
schema (ie cgMLST, SNPs)
28. Ongoing Phyloviz work
Modular plugin architecture
Allows expansion and addition of new
capabilities
Other analysis algorithms/ custom rules
New visualization modules
Allow the analysis of other data types
Complementary statistics modules
Try to address user’s needs…
We need your feedback!
Phyloviz is open-source freeware software