1. Types of Phylogenetic Tree
Construction Methods
Presented By
Shabir Hussain
MVSc, VMC
M-6504
MICROBIAL BIOINFORMATICS
2. What is Phylogenetic Tree?
Phylogenetic tree is a graphic representation of relationships between different taxonomic units that can be
assumed to have a common ancestor.
Lamarck (1809) was first to present an evolutionary tree.
Willi Hennig (a German entomologist), first proposed the method of phylogenetic analysis in “Grundzüge einer
Theorie der Phylogenetischen Systematik” (in German language) in the year 1950.
Earlier phylogenetic studies (prior to 1960s) were based on morphological similarities and dissimilarities.
The limitations of morphology-based phylogeny have now been replaced by molecular phylogeny. The later
uses molecular data (DNA RNA/Amino acid sequences, enzymatic data etc) for constructing the phylogenetic
tree.
Phylogenetic analysis : Study of relationship between a collection of things (genes, proteins, organisms)
derived from a common ancestor.
3. Why Phylogenetic Analysis?
Phylogenetic analysis involving molecular sequence comparisons has at least 4 uses which are central
to a range of downstream bioinformatic analysis:
Classification (taxonomy)
Grouping of genes, proteins, and other molecular sequences including noncoding
Sequences
Epidemiological investigations
Analysis of parallel evolution between host and parasite
4. Terminology
• Node: nodes represent speciation event in evolution. Beyond this point any sequence change that occurred
are specific to branch. It represents a taxonomic unit which can be either an existing species or an ancestor.
• Internodes: branching points that are connected with two or more other branches r ref. to as internodes.
• Orthologs are homologs produced by speciation - they represent genes derived from a common ancestor that
diverged because of divergence of the organism. Orthologs tend to have similar function.
• Paralogs are homologs produced by gene duplication and represent genes derived from a common ancestral
gene that duplicated within an organism and then diverged. Paralogs tend to have different functions.
• Clade: (Also known as monophyletic group), is a group of organisms that consists of a common ancestor
and all its lineal descendants, and represents a ‘single branch’
5. Terminology
• Branch: The relationship between the taxa in terms of descent and ancestry is defined by branch. Branches
refers to the relationships between the nodes. These can also represent the relative divergence among the
terminal and nodes.
• Branch length: The number of changes that have occurred in the branch is branch length.
• Horizontal Branch length: determines the time between speciation events according to the mutation rate or
the mutation among the lineages, depending on the tree topology. Branch length is proportional to the
evolutionary distance between the nodes.
• Leaves/Terminals: The remaining tips, which are associated with only one node, are called leaves or terminal
taxa
• Root: The common ancestor of all taxa is root.
• Distance scale: The scale that represents the number of differences between organisms or sequences is
distance scale. Generally found at the bottom of the tree.
6.
7. Tree Terminology
Rooted vs Unrooted Tree
Midpoint rooting calculates tip to tip distances
and then places the root halfway between the two
longest tips
Unrooted trees: Only specifies the degree of
kinship among taxa but not the evolutionary
path.
Rooted trees: Has a root that denotes common
ancestry.
Outgroup Rooting method assumes that one or more of the taxa are divergent from the rest of the taxa (ingroup).
The branch linking the ingroup and outgroup becomes the starting point, and defines all subsequent evolutionary
events within the tree
8. How to construct A Phylogenetic Tree
• Involves 4 main steps viz :
1) MSA and extraction of phylogenetic data
2) Determining of substitution model
3) Tree construction
4) Tree evaluation
Collect Sequence Data
Sequence Alignment
Sequence Similarity
Choose a Phylogenetic Method
Tree Construction
Tree Evaluation
Tree Visualization
9. STEP 1: MSA and extraction of phylogenetic data
Align the sequences (usually RNA, DNA or protein) using MSA algorithms such as ClustalW, MUSCLE, or
MAFFT.
It helps to identify evolutionary relationships and common patterns between genes, which are crucial for tree
construction.
STEP 2: Determining of substitution model (Sequence Similarity Measurement)
Sequence substitution model describes in probabilistic terms the process (Markov process) by which a sequence of
characters (nucleotides or amino acids) changes into another set of homologous (i.e, aligned) character states over
time.
The substitution models influence both alignment and tree building ; used to calculate evolutionary distances
between sequences.
Common methods include percent identity, p-distance, Kimura’s two parameter model, Jukes-Canter model,
Poisson model ETC
10. STEP 3: Tree Construction
2 Methods viz
Distance based methods:
Based on the amount of dissimilarity (distance) between pairs of sequences, computed on the basis of sequence
alignment
The fever the differences, the more closely related the species are. Based on the distance scores, construct a
phylogenetic tree.
Clustering algorithms – UPGMA, neighbor joining (NJ)
Optimality based – Fitch-Margoliash (FM), minimum evolution (ME)
Characters based methods:
Based on discrete characters, which are molecular sequences from individual taxa.
11. • Character based methods derive trees that optimize the distribution of the actual data patterns for
each character.
• The commonly used character-based methods are Maximum Parsimony, Maximum Likelihood
and Bayesian Approach.
STEP 4: Tree Evaluation (Reliability of a Tree)
• Reliability refers to the probability that members of a clade will be part of the true tree
• Bootstrapping is the most common reliability test
• Bootstrapping (a permutation test) is a procedure where you take a random subset of the data and re-
run the phylogenetic analysis (resampling), and the reported value is the percentage of bootstrap
replicates in which the node showed up.
12. Parsimony: Principle in science where the simplest answer is preferred.
This method was the first phylogenetic method to be used with molecular sequences (Eck and
Dayhoff 1966).
Minimum mutation preferred
Finds the optimum tree by minimizing the number of evolutionary changes
No assumptions on the evolutionary pattern
May oversimplify evolution
May produce several equally good trees
13.
14.
15.
16. Parsimony Algorithms
• Parsimony approaches comprise a family of related methods with varying assumptions about how
character-state transformation occurs.
Fitch Parsimony
Wagner Parsimony
Dollo Parsimony
Camin-Sokal Parsimony
The Fitch-Wagner algorithm allows fully reversible changes and for this reason can be used with
sequences. It is implemented as DNAPARS and DNAPROT in the PHYLIP package
Dollo Parsimony assumes evolutionary changes are irreversible. Allows unidirectional changes,
although reversals are possible but minimized
Camin-Sokal Parsimony (most strict) assumes all evolutionary changes as irreversible. Disallows
reversion to the ancestral condition
17. Maximum Likelihood
• First proposed by Cavalli-Sforza and Edwards (1967), later used on molecular sequences by J.
Felsenstein and implemented in the PHYLIP package as the DNAML program.
• Main idea behind phylogeny inference with ML is to determine the tree topology, branch lengths, and
parameters of the evolutionary model that maximize the probability of observing the sequences at
hand.
• This method works best when it’s used to test (or improve) an existing tree
• More accuracy
• Nucleotide models more advanced at the moment than aminoacid models
• Well suited for analysis of distantly related sequences, but it’s computationally expensive
18. Neighbor-Joining
• Neighbor joining/distance matrix (Saito and Nei 1987) belongs to the clustering methods originally
developed for numeric taxonomy (Sneath and Sokal 1973)
• NJ is a heuristic approach that does not guarantee to find the perfect result, but under normal
conditions has a very probability to do
• First a distance matrix is calculated based on pairwise comparison of all sequences with each other.
Then the neighbor joining algorithm constructs the tree.
• It has a very good computational efficiency, making it well suited for large datasets
19. Tools used to construct phylogenetic tree
Some commonly used programs and program packages are
PHYLIP
fastDNAml
PhyML
MEGA7
FastTree
RAxML
MEGA-X
MacClade
20. References
• Book Introduction to Microbial Bioinformatics (Springer) - H. Christensen and J. E. Olsen
• https://youtu.be/QlMwSqNbKA8?si=SW3e8gNmEv5G6mKW
• https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2
ahUKEwig7PXcsp-
BAxW0mlYBHa3iBuQQFnoECBEQAw&url=https%3A%2F%2Fwww.ihes.fr%2F~carbone%2F
MaximumLikelihood2.pdf&usg=AOvVaw2S1UBW0WHQXRMpfjC-7OOr&opi=89978449
• https://youtu.be/pkCNcWrsYCY?feature=shared
• https://youtu.be/xDKUIegYpWM?feature=shared
• https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2
ahUKEwj9_LPlu5-
BAxWK1TgGHdwaB78QgQN6BAgFEAI&url=https%3A%2F%2Fscholar.google.co.in%2Fschol
ar%3Fq%3Doutgroup%2Brooting%2Bbrady%2Betal%26hl%3Den%26as_sdt%3D0%26as_vis%3
D1%26oi%3Dscholart&usg=AOvVaw3DlVm-mTcPWw5tLsyn98m_&opi=89978449