Integrating phylogenetic inference and metadata visualization for NGS data

João André Carriço, PhD
Microbiology Institute/Institute for Molecular Medicine
Faculty of Medicine, University of Lisbon
Portugal
Integrating phylogenetic inference and
metadata visualization for NGS data
http://im.fm.ul.pt
http://imm.fm.ul.pt
http://www.joaocarrico.info
Workshop 20:
Typing of Bacterial Pathogens in 2015:
Expanding the scope of NGS

Conflicts of Interest
NOTHING TO DISCLOSE

Charles Darwin ‘s “tree of life” in
Notebook B, 1837-1838
Darwin and the tree of life

Phylogenetics methods aim to infer the
relationships between the taxa trying to define
the common ancestors between taxa
Assumptions: the characters being compared
are homologous and independent, i.e. they had
shared a common ancestor and each character
suffered evolutive forces individually
Phylogenetic Inference
ATTGGGG ATGGGGG
AT?GGGG

Software for Phylogenetic trees: based
on sequence alignments• MEGA
• http://www.megasoftware.net/
• Splitstree
• http://www.splitstree.org/
• Geneious (http://www.geneious.com/)
• www.geneious.com
• FastTree
• http://www.microbesonline.org/fasttree
• RAxML
• http://sco.h-
its.org/exelixis/web/software/raxml/index.html
• PHYLIP
• http://evolution.genetics.washington.edu/phylip.ht
ml
• BEAST
• http://beast.bio.ed.ac.uk/
And many many others…

Sequence Alignment methods
Kos, V.N. et al., 2012. Comparative genomics of vancomycin-resistant Staphylococcus aureus
strains and their positions within the clade most commonly associated with Methicillin-resistant S.
aureus hospital-acquired infection in the United States. mBio, 3(3).
Maximum Likelihood tree of concatenated SICOs

Sequence Alignment methods
Maximum Likelihood tree of concatenated SICOs
Caveats:
• Computationally intensive: some methods can’t be
applied to hundreds to thousands of strains
• Require specialized method and software
knowledge for parameter definition
• Some phenomena violate the assumptions
(recombination, convergent evolution,etc)

Sequence Based Typing Methodsx
Strain genomic information encoded as a numeric
sequence
Sanger sequencing:
MLST: Gene allele identifier
MLVA: Number of repeats
NGS approaches:
Gene-by-Gene / allele based:
wgMLST: core + pan genome genes are represented
cgMLST: just core genome
SNP Typing : Polymorphism

To each unique gene sequence
(allele)
is attributed an integer ID,
by comparison with online DBs
Allelic profile:
12 - 9 - 11 - 7 - 11 - 20 - 3

Each allelic profile, aka ST, is
unequivocally identified by an
integer.
Single locus variant (SLV):
Double locus variant (DLV):
Triple locus variant (TLV):
12
12
10
- 10
- 10
- 10
- 11
- 11
- 11
- 7
- 11
- 11
- 11
- 11
- 11
- 20
- 20
- 2
- 3
- 3
- 3
Bacterial
chromosome
MLST

SNP NGS Approach
Good approach in Monomorphic species.
For non-monomorphic species , SNPs in genome areas where
recombination was detected need to be removed to avoid confounding the
phylogenetic signal.
sample
NGS
WGS
reads
Mapping to reference
Fasta File with SNPs
fastq files
BAM files
VCF files

Gene by Gene NGS Approach
Software currently available:
BIGSDB (Jolley, K.A. & Maiden, M.C.J., 2010. BIGSdb: Scalable analysis of bacterial genome variation at the population level. BMC Bioinformatics)
RIDOM™ SEQSPHERE+ (http://www.ridom.com/seqsphere/)
Central nomenclature server:
Schemas, Allele definitions and identifiers
sample
NGS
WGS
reads
assembly
contigs
Output :Allelic Profile

Algorithms for Phylogenetic Inference
Based on the distance matrix:
•Hierarchical clustering methods: UPGMA, Single Linkage
and Complete linkage
•Neighbor-joining
•Minimum Spanning Trees
Maximum Parsimony methods
Based on rules (Graphic Matroids)
•goeBURST
Maximum Likelihood methods
Bayesian inference methods
Sequence alignments
Sequence alignments
Sequence alignments
Sequence alignments
Allelic Profiles
Allelic Profiles

Infering phylogeny from allelic profiles
Assume that you have only 3 genes and each number corresponds to a different
allele for each gene. The minimum assumption is assuming that a SLV may
correspond to a possible phylogenetic descent.
1-1-1 1-1-2 1-2-1 1-2-2 1-2-3
SLV SLV SLV
SLV SLV
SLV
11 possible
trees….

eBURST model
More similar STs should denote closely related strains
from an evolutionary point of view.
STs with more SLVs can be regarded has a common
ancestor.
Links between STs depict descent relations.
With these assumptions, connected STs should share an
evolutionary path.
Maynard Smith J., et al. 2000. Bioessays 22:1115-
eBURST
Feil E. et al, J Bac 2004

1-1-1
1-1-2
1-2-1
1-2-2
1-2-3
goeBURST
#SLVs #DLVs #TLVs Freq STid
2 2 0 1 1
2 2 0 1 2
3 1 0 1 3
3 1 0 1 4
2 2 0 1 5
Implementation of the eBURST rules as a graphic
matroid problem, allows for a globally optimal solution of
the placement of the ST links.
Francisco et al, BMC Bioinf, 2009
More SLVs / lower ID
Connects to ST4 because #SLVs
Final goeBURST tree :
unique solution
guaranteed

Applying goeBURST
1-1-1 1-1-2 1-2-1 1-2-2 1-2-3
SLV SLV SLV
SLV SLV
SLV
11 possible
trees….
All these are valid goeBURST solutions. The
tie break would need to be the ST ID if all of
them would have the same frequency in the
dataset

goeBURST output examples
Largest S. aureus
MLST CC
1067 of 2650 STs total
2nd
largest S. aureus CC
252 Sts

goeBURST FULL MST
• The goeBURST rules can be expanded to any number of
loci while maintaining the same assumptions of the
evolutionary model behind
• Adds an evolutionary model to the basic Minnimum
Spanning Tree approach
• Advantage: very fast to calculate compared to phylogenetic
analysis algorithms
• Advantage: If the strains are closely related we have the
internal nodes defined as strains as opposed to any
traditional phylogenetic methodology
• Disadvantage: does not create internal nodes as putative
recent common ancestral

Allelic profiles
Accessory data
(“metadata”)
Antibiogram
Serotype
Origin info (patient)
….
Analysis
(goeBURST)
Other typing method
Present the data in a meaningful way
Integrating Data Analysis and Visualization

Using Phyloviz (http://www.phyloviz.net)

PHYLOViZ
Can be easily applied to:
-MLST
-MLVA
-SNP data*
-Gene Presence/absence
*Conversion of VCF to PHYLOViZ:
https://github.com/nickloman/misc-genomics-tools/blob/master/scripts/vcf2phyloviz.py
(Thanks Nick!)

PHYLOViZ
Example of visualization with MLST+ (core genome) data of
VRSA and MRSA strains

Core genome comparison - Workflow
Core genome from all available fully sequenced S.aureus Strains in NCBI
Using strain COL genes as reference
1866 target loci found for a cgMLST schema (RIDOM Seqsphere+)
Call alleles for strains under study
Removing loci with missing data in the strains under analysis
1542 target genes kept for whole genome comparison
goeBURST Minimum Spanning Tree of the resulting allelic profiles
(PHYLOViZ software)

Core genome comparison
VRSA
NCBI strains
US VRSA strains (Kos et al)
HSM strains
MRSA srp
VRS5
MLST+: 1542 genes
Core genome genes found in all strains
65

PHYLOViZ
PROs:
Handles thousands of profiles
Fast calculation
Easy to annotate and explore metadata
Allows for basic statistics on profiles and metadata
Allows for advanced statistics on MSTs
(PLoS One. 2015 Mar 23;10(3):e0119315)
Exports high quality graphical formats
Allows plugin development
CONs:
goeBURST and goeBURST MST only
(Neighbour Joining and UPGMA soon)
JAVA knowledge to code new plugins

Final Remarks
Phylogenetic inference has always an underlying model. The
choice of method depends on what data is being analyzed and
the underlying question
With the increasing availability of bacterial genomes, the methods
that allow their comparison need to be efficient and scalable
Metadata should always be use to evaluate the algorithm results
PHYLOViZ provides a visualization framework to
analyze inferred patterns of descent based on goeBURST ,
including detailed statistics and allows easy integration of
metadata on algorithm results
Any sequence-based typing method that generates allelic profiles
can be analyzed by this framework, including any NGS derived
schema (ie cgMLST, SNPs)

Ongoing Phyloviz work
Modular plugin architecture
Allows expansion and addition of new
capabilities
Other analysis algorithms/ custom rules

New visualization modules
Allow the analysis of other data types
Complementary statistics modules

Try to address user’s needs…
We need your feedback!
Phyloviz is open-source freeware software

Alexandre Francisco
Cátia Vaz
Pedro Monteiro
Mário Ramirez
José Melo-Cristino
Acknowledgements
Initial funding from Fundação para a Ciência e Tecnologia

Draft Scientific Programme:
Plenaries:
1)Small Scale Microbial Epidemiology
2)Large Scale Microbial Epidemiology
3)Bioinformatics for Genome-based Microbial Epidemiology
4)Population Genetics: Pathogen Emergence
5)Population Dynamics : Transmission networks and
surveillance
6)Molecular Epidemiology for Global Health and One Health
Parallel Sessions
1)Food and Environmental pathogens
2)Microbial Forensics
3)Virus
4)Fungi and Yeasts
5)Novel Diagnostics methodologies
6)Novel Typing approaches
7)Phylogenetic Inference
8)Interactive Illustration Platforms
Save thedate !

Phyloviz Visualization Examples

Phyloviz
Burkholderia pseudomallei

Clinical
animal
NA
community
Hospital
Surv/Outb
Enterococcus faecium

Streptococcus pneumoniae CC90
Coloured by country of origin

Streptococcus pneumoniae
10 largest clonal complexes coloured by
serotype

Integrating phylogenetic inference and metadata visualization for NGS data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Integrating phylogenetic inference and metadata visualization for NGS data

Similar to Integrating phylogenetic inference and metadata visualization for NGS data (20)

Recently uploaded

Recently uploaded (20)

Integrating phylogenetic inference and metadata visualization for NGS data

Editor's Notes