Phylogenetic tree construction using bioinformatics tools Zarlish attique 187104
1. Report on Phylogenetic tree
Introduction to bioinformatics
Page 1
Government Postgraduate College Mandian Abbottabad.
Report on Phylogenetic tree
Subject: Introduction to Bioinformatics
SUBMITTED BY:
Name: Zarlish Attique
Registration no: 187104
BS Bioinformatics Semester 04
SUBMITTED TO:
Name: Sir Muhammad Rizwam
Department: Bioinformatics
Date: June,19,2020
2. Report on Phylogenetic tree
Introduction to bioinformatics
Page 2
Contents
1, Phylogenetics:- ...................................................................................................................................4
1. Description:-................................................................................................................................4
2. Phylogenetic inference methods:-...............................................................................................4
3. About taxonomy:- ........................................................................................................................4
4. Brief History:- ..............................................................................................................................5
4.2. 1866,...................................................................................................................................6
5. Evolution:- ...................................................................................................................................6
2.Evolution of Bionformatics tools:- .......................................................................................................7
1.1. Bioinformatics experts ............................................................................................................7
1.2. Development and use of computational and an array of bioinformatics tools.......................7
3. Phylogenetic tree................................................................................................................................9
1. Computational phylogenetics:- ...................................................................................................9
2. Traditional phylogenetics and recent phylogenetics:-.................................................................9
3. Molecular data such as DNA sequence for genes and amino acid sequence for proteins:-... 10
4. Evolutionary history and relationship:-..................................................................................... 10
5. Phylogenetic tree is a graphical representation ...................................................................... 11
4. Types of Phylogenetic trees:-....................................................................................................... 12
5. Method for constructing Phylogenetic tree:-................................................................................ 15
.List of Methods for constructing trees:-.............................................................................................. 16
Character State method .............................................................................................................. 16
Method for validation of phylogenetic tree.................................................................................. 16
Table 1:-Representing mathods ...................................................................................................... 17
Online Softwares available for Phylogenetic analysis ......................................................................... 18
Desktop Software............................................................................................................................ 20
Libraries:-......................................................................................................................................... 21
Unweighted Pair Group Method with Arithmetic Mean ....................................................................... 22
1.1. Description:-......................................................................................................................... 22
Tree consisting of 6 OTUs............................................................................................................ 23
Another Example of UPGMA ........................................................................................................ 26
The Neighbor-Joining Method.............................................................................................................. 26
1. Note:-........................................................................................................................................... 27
Advantages and disadvantages of the neighbor-joining method .................................................... 30
3. Report on Phylogenetic tree
Introduction to bioinformatics
Page 3
Maximum parsimony (MP):.................................................................................................................. 30
Character based Method..................................................................................................................... 30
Maximum-likelihood (ML): ............................................................................................................... 30
Bootstrapping:-.................................................................................................................................... 31
Multiple sequence alignment (MSA) ................................................................................................... 31
Description:-..................................................................................................................................... 31
Practical Section:-................................................................................................................................ 32
ClustalW........................................................................................................................................... 32
Access:-............................................................................................................................................ 32
ClustalW for Phylogeneetic tree construction:-.................................................................................... 33
6. ClustalW |Result Interpretation..................................................................................................... 39
Applications of Phylogenetic tree construction:- ................................................................................. 54
References .......................................................................................................................................... 60
4. Report on Phylogenetic tree
Introduction to bioinformatics
Page 4
Phylogenetic tree
1, Phylogenetics:-
1. Description:-
In biology, phylogenetics (Greek:– phylé, phylon = tribe, clan, race + genetikós = origin,
source, birth) is a part of systematics that addresses the inference of
the evolutionary history and relationships among or within groups
of organisms (e.g. species, or more inclusive taxa).
Figure 1 represents the derivation of phylogenetics.
2. Phylogenetic inference methods:-
These relationships are hypothesized by phylogenetic inference methods that evaluate
observed heritable traits, such as DNA sequences or morphology, often under a specified
model of evolution of these traits.
3. About taxonomy:-
Taxonomy is the identification, naming and classification of organisms. Classifications
are now usually based on phylogenetic data, and many systematics contend that
only monophyletic taxa should be recognized as named groups.
3.1. School of taxonomy:-
The degree to which classification depends on inferred evolutionary history differs
depending on the school of taxonomy: phenetics ignores phylogenetic speculation
altogether, trying to represent the similarity between organisms
instead; cladistics (phylogenetic systematics) tries to reflect phylogeny in its
classifications by only recognizing groups based on shared, derived characters
(synapomorphies); evolutionary taxonomy tries to take into account both the branching
pattern and "degree of difference" to find a compromise between them.
phylon = tribe,
clan, race
genetikós =
origin, source,
birth
Phylogenetics
the inference of
the evolutionary
history and
relationships
5. Report on Phylogenetic tree
Introduction to bioinformatics
Page 5
Figure represents the taxonomy of one of the example known as homo sepians.
4. Brief History:-
The term "phylogeny" derives from the German Phylogenie, introduced by Haeckel in
1866, and the Darwinian approach to classification became known as the "phyletic"
approach.
4.1. 1858 Heinrich Georg Bronn
Paleontologist Heinrich Georg Bronn (1800–1862) published a hypothetical tree to
illustrating the paleontological "arrival" of new, similar species following the extinction
of an older species. Bronn did not propose a mechanism responsible for such phenomena,
precursor concept.
6. Report on Phylogenetic tree
Introduction to bioinformatics
Page 6
Branching tree diagram from Heinrich Georg Bronn's work (1858)
4.2.1866, Ernst Haeckel
1866, Ernst Haeckel, first publishes his phylogeny-based evolutionary tree, A precursor
concept.
Figure represents Phylogenetic tree suggested by Haeckel (1866).
5. Evolution:-
Evolution is the change in heritable traits of biological organisms over generations due
to natural selection, mutation, gene flow, and genetic drift. Also known as descent with
7. Report on Phylogenetic tree
Introduction to bioinformatics
Page 7
modification. Over time these evolutionary processes lead to formation of new species
(speciation), changes within lineages (anagenesis), and loss of species (extinction).
Figure A diagram showing the relationships between various groups of organisms and
concept of evolution.
"Evolution" is also another name for evolutionary biology, the subfield
of biology concerned with studying evolutionary processes that produced the diversity of
life on Earth.
2.Evolution of Bionformatics tools:-
1.1.Bioinformatics experts
Bioinformatics experts have developed a large collection of tools to make sense of the
rapidly growing data related to molecular biology. Biological systems are complex and often
need to combine data sets and use more than one tool to understand them. Therefore,
bioinformatics experts have experimented with a number of strategies to try to integrate data
sets and tools.
Complex biological system usually requires gathering a variety of data from a variety of
sources, so multiple tools are needed. Therefore, there is a clear need for technology that
combines both data and tools to create a workflow that can be easily used by biologists.
1.2.Development and use of computational and an array of bioinformatics tools
Development and use of computational and an array of bioinformatics tools provides the
8. Report on Phylogenetic tree
Introduction to bioinformatics
Page 8
ability to analyze large data sets in practical computing times, and yielding an optimal or
near-optimal solutions with high probability are being possible. In response to this trend,
much of the current research in phyloinformatics (i.e., computational phylogenetics)
concentrates on the development of more efficient heuristic approaches.
Figure represents the data storage to computer with the evolution of Bioinformatics tools.
***“----The phylogenetic tree----”****
Computational phylogenetics is the application of computational algorithms, methods,
and programs to phylogenetic analyses. The goal is to assemble a phylogenetic
tree representing a hypothesis about the evolutionary ancestry of a set of genes, species,
or other taxa.
Figure The root of the tree of life.
9. Report on Phylogenetic tree
Introduction to bioinformatics
Page 9
3. Phylogenetic tree
----- General Description ----
1. Computational phylogenetics:-
Computational phylogenetics is the application of computational algorithms, methods,
and programs to phylogenetic analyses. The goal is to assemble a phylogenetic
tree representing a hypothesis about the evolutionary ancestry of a set of genes, species,
or other taxa.
1.1. Example:-
For example, these techniques have been used to explore the family tree of gene α-
hemoglobin and the relationships between specific genes.
Figure The gene tree for the gene α-hemoglobin compared to the species tree. Both
match because the gene evolved from common ancestors.
2. Traditional phylogenetics and recent phylogenetics:-
Traditional phylogenetics relies on morphological data obtained by measuring and
quantifying the phenotypic properties of representative organisms, while the more recent
field of molecular phylogenetics uses nucleotide sequences encoding genes or amino
acid sequences encoding proteins as the basis for classification.
Many forms of molecular phylogenetics are closely related to and make extensive use
of sequence alignment in constructing and refining phylogenetic trees, which are used to
classify the evolutionary relationships between homologous genes represented in
the genomes of divergent species. The phylogenetic trees constructed by computational
methods are unlikely to perfectly reproduce the evolutionary tree that represents the
historical relationships between the species being analyzed. The historical species tree
may also differ from the historical tree of an individual homologous gene shared by those
species.
10. Report on Phylogenetic tree
Introduction to bioinformatics
Page 10
Figure Tree of life focused on the relation between human and apes.
3. Molecular data such as DNA sequence for genes and amino acid sequence for
proteins:-
Phylogenetic analysis using molecular data such as DNA sequence for genes and amino
acid sequence for proteins is very common not only in the field of evolutionary biology
but also in the wide fields of molecular biology. The reason is that DNA sequencing
became very popular and a huge amount of sequence data of genes and proteins are
available in the public online database. Since many molecules (genes or proteins) which
have various evolutionary rates are available, it is important to choose the suitable
molecule for the phylogenetic analysis of a given lineage.
3.1.Example:-
For example, when the evolutionary rate of the gene (or protein) is too much higher for a
given lineage, the substitution of nucleotide (or amino acid) is saturated. In this case, the
accuracy of the phylogenetic analysis decreases. The methods for phylogenetic analysis
are improving along with the evolution of computer science. Thus, there are many
methods to infer phylogenetic tree, and many programs for each method are available.
4. Evolutionary history and relationship:-
Phylogenetic analysis is a method to elucidate the evolutionary history and relationship
among a group of organisms. In Past, phylogenetic analysis was based on morphological
comparison among the fossils, but the information from fossils was limited. Now,
molecular phylogenetic analysis using molecular data such as DNA or proteins become
popular.
4.1. Reasons for popularity:-
There are several reasons These include,
(1) popularity of DNA sequencing method
(2) establishment of methods for phylogenetic tree construction using gene or protein
sequences
(3) The results of a phylogenetic analysis being treated in a quantitative pattern
(4) Availability of many programs for constructing phylogenetic tree.
11. Report on Phylogenetic tree
Introduction to bioinformatics
Page 11
The knowledge from phylogenetic analysis contributes to basic biology (e.g. evolutionary
history of species, the evolution of genes, and identification of sampled species) as well
as applied biology (e.g. investigation of the route of the infection of pathogenic
microorganisms). Phylogenetic trees are commonly constructed to figure out the
evolutionary relationship among species. Selection of the molecules (genes or proteins)
DNA sequences of genes, RNA sequences of functional RNA, or amino acid sequences
of proteins are used for phylogenetic analysis. To choose the molecule for phylogenetic
analysis, there are two focal points. First, the genes must be shared by all of the given
species. Secondly, the genes have the proper evolutionary rates, because proteins have
varied evolutionary rates (Miyata et al., 1980). If a species has a distant relationship, the
molecule which has low evolutionary rate should be chosen. This is because nucleotide or
amino acid substitution of gene or proteins reaches to saturation between distant species
when the evolutionary rate is high. Note that nucleotide sequence of a gene is easy to
reach to saturation than an amino acid sequence of the coded protein. In this case,
housekeeping genes which have low evolutionary rate are suitable.
5. Phylogenetic tree is a graphical representation
A phylogenetic tree is a graphical representation of the evolutionary relationships among
entities that share a common ancestor. Those entities can be species, genes, genomes, or
any other operational taxonomic unit (OTU).
More specifically, a phylogenetic tree, with its pattern of branching, represents the
descent from a common ancestor into distinct lineages. It is critical to understand that the
branching patterns and branch lengths that make up a phylogenetic tree can rarely be
observed directly, but rather they must be inferred from other information. The principle
underlying phylogenetic inference is quite simple: Analysis of the similarities and
differences among biological entities can be used to infer the evolutionary history of
those entities.
Figure The gene tree for the gene Glycosyl Hydrolase compared to the species tree. The
trees do not match because of the horizontal gene transfer (HGT).
12. Report on Phylogenetic tree
Introduction to bioinformatics
Page 12
4. Types of Phylogenetic trees:-
The branches of a phylogenetic tree may be represented two different ways:
1.1.Scaled and Unscaled Trees
Scaled branches
Branches will be different lengths based on the number of evolutionary changes or distance.
Unscaled branches
All branches in the tree are the same length.
Figure represents the scaled and unscaled branches trees.
Species and Gene Trees
Species Trees
“Species” Trees recover the genealogy of taxa, individuals of a population, etc.
Internal nodes represent speciation or other taxonomic events.
Species trees should contain sequences from only orthologous genes.
Gene Trees
Gene trees represent the evolutionary history of the genes included in the study.
Gene trees can provide evidence for gene duplication events, as well as speciation events.Sequences
from different homologs can be included in a gene tree; the subsequent analyses should cluster
orthologs, thus demonstrating the evolutionary history of the orthologs.
13. Report on Phylogenetic tree
Introduction to bioinformatics
Page 13
Rooted versus Unrooted Trees
Rooted phylogenetic tree
In a rooted phylogenetic tree, each node with descendants represents the inferred most recent common
ancestors of the descendants, and the edge lengths in some trees may be interpreted as time estimates.In
rooted tress the ancestral state of organisms or genes is shown at the bottom of the tree, and the tree
branches, or bifurcates until it reaches the terminal branches, tips or leaves at the top of the tree.
Rooted trees shows the most basal ancestor of the tree.
Rooted trees reflect the most basal ancestor of the tree in question.
There are competing techniques for rooting a tree; one of the most common methods is through the use of
an "outgroup" (The Parsimony Methods).
Unrooted phylogenetic tree
Unrooted phylogenetic tree does not show an ancestral root.Unrooted binary tree is unrooted tree in which
each vertex has either one or three neighbors.
Unrooted trees represents the branching order but do not indicate the root or location of the last common
ancestor.
Unrooted trees shows the relatedness of organisms without indicating ancestry.
14. Report on Phylogenetic tree
Introduction to bioinformatics
Page 14
Figure represents the unrooted tree with unscaled branches.
2. Terms used to describe rooted and unrooted tree:-
1.1.Clade
An ancestor (an organism, population, or species) and all of its descendants.
1.1.1. Sister clade
One member of a pair of clades originating when a single lineage splits into two. Sister
clades thus share an exclusive common ancestry and are mutually most closely related to
one another in terms of common ancestry.
1.2.Ancestor
An entity from which another entity is descended
1.3.Node
A point or vertex on a tree (in the sense of graph theory). On a phylogenetic tree, a node
is commonly used to represent (1) the split of one lineage to form two or more lineages
(internal node) or the extinction of a lineage (terminal node) or the lineage at a specified
time, often the present (terminal node), or (2) a taxon, whether ancestral (internal node)
or descendant (internal node or terminal node).
1.4.Root
The root of the tree represents the ancestral lineage, and the tips of the branches
represent the descendants of that ancestor
1.5.Leaf
Each leaf on a phylogenetic tree represents a taxon.
15. Report on Phylogenetic tree
Introduction to bioinformatics
Page 15
Figure represents terms used to describe rooted and unrooted tree.
5. Method for constructing Phylogenetic tree:-
Summary:-
(1)The first step in doing phylogenetics is to choose the sequences from which the
tree should be constructed. Very popular sequences to construct phylogenetic trees
are the sequences of rRNA (the RNA the ribosome is build of) and mitochondrial
genes.These genetic material is present in almost all organisms and they have enough
mutations to reliably construct a tree. (2)The second step is to construct pairwise and
multiple sequence alignments from these sequences. (3)The third step is to choose a
method for constructing a phylogenetic tree. There exist 3 categories: distance-based,
maximum parsimony, and maximum likelihood. Maximum parsimony should be
chosen for strong sequence similarities because too much variation results in many
possible trees. For the same reason only few sequences (less than 15) should be used.
Distance based methods (e.g. clustalW) require less similarity among the sequences
than maximum parsimony methods but sequence similarities should be present. Some
sequences should be similar to one another and others are less similar. Distance based
methods can be applied to a set of many sequences. Maximum likelihood methods
may be used for very variable sequences but the computational costs increase with the
number of sequences as every possible tree must be considered.
16. Report on Phylogenetic tree
Introduction to bioinformatics
Page 16
Figure represents the method for constructing phylogenetic tree.
This method is use in the practical section(mention section name)
List of Methods for constructing trees:-
Distance matrix method
1.UPGMA
2.Transfromed distance method
3.Neighbor’s Relation method
4.Neighbor joining method
5.Fitch and Margoliash method
Character State method
1.Maximum likelihood approach
Method for validation of phylogenetic tree
1.Bootstrapping
2.Felsenstein’s bootstrap test
17. Report on Phylogenetic tree
Introduction to bioinformatics
Page 17
Figure represent the methods for constructing phylogenetic tree.
Table 1:-Representing mathods
Method Advantage Disadvantage Other information
Maximum
parsimony
Appropriate for very similar
sequences and a small
number of sequences
Very time-consuming as
it tests all possible trees
Parsimony may fail for
diverged sequences
Suffers from the long-
branch attraction
Predict the evolutionary tree
that minimizes the number of
steps required to generate the
observed variation in the
sequences
It is built with the fewest
changes required to explain
(tree) the differences observed
in the data
Maximum
likelihood
Suitable for very dissimilar
sequences
We can formulate
hypothesis about
evolutionary relationships
A slow search algorithm
will lead to slow
response
Takes a long time for
large datasets
It tries to find a model that has
the highest probability to
generate the input sequence
under a given evolutionary
model
Methods for
constructing trees
Distance matrix
method
Character State
method
validation of
phylogenetic tree
18. Report on Phylogenetic tree
Introduction to bioinformatics
Page 18
More accurate phylogenetic
trees can be constructed for
a small number of taxa in a
reasonable time frame
Neighbour
joining
Faster than the character-
based method
They are fast and can be
used with a variety of
models
Conversion from
sequence data to
distance data leads to
loss of information
Provides an unrooted tree and a
single resultant tree
UPGMA Reliable for related
sequences
Evolution rate is
constant in all branches
UPGMA provides rooted tree
Fitch
Mangrolish
Less sensitive to variations
in evolutionary rate
Dependent on the model
used to obtain the
distance matrix
Online Softwares available for Phylogenetic analysis
This list of phylogenetic tree viewing software is a compilation of software tools and web
portals used in visualising phylogenetic trees.
Softwares:-
Name Description
Aquapony Javascript tree viewer for Beast
ETE toolkit Tree
Viewer
an online tool for phylogenetic tree view (newick format) that allows
multiple sequence alignments to be shown together with the trees (fasta
format)
EvolView an online tool for visualizing, annotating and managing phylogenetic trees
19. Report on Phylogenetic tree
Introduction to bioinformatics
Page 19
IcyTree Client-side Javascript SVG viewer for annotated rooted trees. Also supports
phylogenetic networks
Iroki Automatic customization and visualization of phylogenetic trees
iTOL -
interactive Tree
Of Life
annotate trees with various types of data and export to various graphical
formats; scriptable through a batch interface
Microreact Link, visualise and explore sequence and meta-data using phylogenetic trees,
maps and timelines
OneZoom uses IFIG (Interactive Fractal Inspired Graphs) to display phylogenetic trees
which can be zoomed in on to increase detail
Phylo.io View and compare up to 2 trees side by side with interactive HTML5
visualisations
PhyloExplorer a tool to facilitate assessment and management of phylogenetic tree
collections. Given an input collection of rooted trees, PhyloExplorer provides
facilities for obtaining statistics describing the collection, correcting invalid
taxon names, extracting taxonomically relevant parts of the collection using a
dedicated query language, and identifying related trees in
the TreeBASEdatabase.
PHYLOViZ
Online
Web-based tool for visualization, phylogenetic inference, analysis and
sharing of minimum spanning trees
PhyloWidget view, edit, and publish phylogenetic trees online; interfaces with databases
T-REX
(Webserver)
Tree inference and visualization (hierarchical, radial and axial tree
views), Horizontal gene transfer detection and HGT network visualization
TidyTree A client-side HTML5/SVG Phylogenetic Tree Renderer, based on D3.js
TreeVector scalable, interactive, phylogenetic trees for the web, produces dynamic SVG
20. Report on Phylogenetic tree
Introduction to bioinformatics
Page 20
or PNG output, implemented in Java
Desktop Software
Name Description OS1
ARB An integrated software environment for tree visualisation and
annotation
LM
Archaeopteryx Java tree viewer and editor (used to be ATV)
BioNumerics Universal platform for the management, storage and analysis of all
types of biological data, including tree and network inference of
sequence data
W
Bio::Phylo A collection of Perl modules for manipulating and visualizing
phylogenetic data. Bio::Philo is one part of a comprehensive suite
of Perl biology tools
All
Dendroscope An interactive viewer for large phylogenetic trees and networks All
DensiTree A viewer capable of viewing multiple overlaid trees. All
JEvTrace A multivalent browser for sequence alignment, phylogeny, and
structure. Performs an interactive Evolutionary Trace[21]
and other
phylogeny-inspired analysis.
All
MEGA Software for statistical analysis of molecular evolution. It includes
different tree visualization features
All
MultiDendrograms Interactive open-source application to calculate and plot phylogenetic
trees
All
PHYLOViZ Phylogenetic inference and data visualization for allelic/SNP
sequences profiles using Minimum Spanning Trees
All
21. Report on Phylogenetic tree
Introduction to bioinformatics
Page 21
TreeDyn Open-source software for tree manipulation and annotation allowing
incorporation of meta information
All
Treevolution Open-source tool for circular visualization with section and ring
distortion and several other features such as branch clustering and
pruning
All
TreeGraph 2 Open-source tree editor with numerous editing and formatting
operations including combining different phylogenetic analyses
All
TreeView Treeviewing software All
UGENE An opensource visual interface for Phylip 3.6 package All
"All" refers to Microsoft Windows, Apple OSX and Linux; L=Linux, M=Apple Mac,
W=Microsoft Windows
Libraries:-
Name Language Description
ggtree R An R package for tree visualization and annotation with grammar of
graphics supported
jsPhyloSVG Javascript open-source javascript library for rendering highly-extensible,
customizable phylogenetic trees; used for Elsevier's interactive trees
PhyD3 Javascript interactive phylogenetic tree visualization with numerical annotation
graphs, with SVG or PNG output, implemented in D3.js
phylotree.js Javascript phylotree.js is a library that extends the popular data visualization
framework D3.js, and is suitable for building JavaScript applications
where users can view and interact with phylogenetic trees
Phytools R Phylogenetic Tools for Comparative Biology (and Other Things)
22. Report on Phylogenetic tree
Introduction to bioinformatics
Page 22
based in R
toytree Python Toytree: A minimalist tree visualization and manipulation library for
Python
Methods for Phylogenetic tree:-
Distance matrix its advantages and disadvantages
Distance-matrix methods of phylogenetic analysis explicitly rely on a measure of "genetic
distance" between the sequences being classified, and therefore they require an MSA (multiple
sequence alignment) as an input. Distance is often defined as the fraction of mismatches at
aligned positions, with gaps either ignored or counted as mismatches. Distance methods attempt
to construct an all-to-all matrix from the sequence query set describing the distance between each
sequence pair. From this is constructed a phylogenetic tree that places closely related sequences
under the same interior node and whose branch lengths closely reproduce the observed distances
between sequences. Distance-matrix methods may produce either rooted or unrooted trees,
depending on the algorithm used to calculate them. They are frequently used as the basis for
progressive and iterative types of multiple sequence alignment. The main disadvantage of
distance-matrix methods is their inability to efficiently use information about local high-variation
regions that appear across multiple subtrees.
Unweighted Pair Group Method with Arithmetic Mean
1.1.Description:-
UPGMA: Unweighted Pair Group Method with Arithmetic Mean: A simple
clustering method that assumes a constant rate of evolution (molecular clock
hypothesis). It needs a distance matrix of the analysed taxa that can be calculated
from a multiple alignment.
UPGMA stands for :
Unweighted Pair-Group Method with Arithmetic mean
Unweighted – all pairwise distances contribute equally.
Pair-Group – groups are combined in pairs (dichotomies only).
Arithmetic mean – pairwise distances to each group (clade) are mean distances to
all members of that group.
1.2.Construction of a distance tree using clustering with the Unweighted Pair Group
Method with Arithmatic Mean (UPGMA).
The UPGMA is the simplest method of tree construction. It was originally developed
for constructing taxonomic phenograms, i.e. trees that reflect the phenotypic
similarities between Operational taxonomic units OTUs, but it can also be used to
construct phylogenetic trees if the rates of evolution are approximately constant
23. Report on Phylogenetic tree
Introduction to bioinformatics
Page 23
among the different lineages.
For this purpose the number of observed nucleotide or amino-acid substitutions can
be used. UPGMA employs a sequential clustering algorithm, in which local
topological relationships are identifeid in order of similarity, and the phylogenetic
tree is build in a stepwise manner.
We first identify from among all the OTUs the two OTUs that are most similar to
each other and then treat these as a new single OTU. Such a OTU is referred to as a
composite OTU. Subsequently from among the new group of OTUs we identify the
pair with the highest similarity, and so on, until we are left with only two
Tree consisting of 6 OTUs
UTUs.Suppose we have the following tree consisting of 6 OTUs:
The pairwise evolutionary distances are given by the following distance
matrix:
A B C D E
B 2
C 4 4
D 6 6 6
E 6 6 6 4
F 8 8 8 8 8
24. Report on Phylogenetic tree
Introduction to bioinformatics
Page 24
1.1.1. Step 1:-
We now cluster the pair of OTUs with the smallest distance, being A and B,
that are separated a distance of 2. The branching point is positioned at a
distance of 2 / 2 = 1 substitution. We thus constuct a subtree as follows:
Following the first clustering A and B are considered as a single composite
OTU(A,B) and we now calculate the new distance matrix as follows:
dist(A,B),C = (distAC + distBC) / 2 = 4
dist(A,B),D = (distAD + distBD) / 2 = 6
dist(A,B),E = (distAE + distBE) / 2 = 6
dist(A,B),F = (distAF + distBF) / 2 = 8
In other words the distance between a simple OTU and a composite OTU is
the average of the distances between the simple OTU and the constituent
simple OTUs of the composite OTU. Then a new distance matrix is
recalculated using the newly calculated distances and the whole cycle is being
repeated:
1.1.2. Step 2:-
A,B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8
1.1.3. Step 3:-
A,B C D,E
C 4
D,E 6 6
F 8 8 8
1.1.4. Step 4:-
AB,C D,E
D,E 6
F 8 8
25. Report on Phylogenetic tree
Introduction to bioinformatics
Page 25
1.1.5. Step 5:-
The final step consists of clustering the last OTU, F, with the composite OTU.
ABC,DE
F 8
Although this method leads essentially to an unrooted tree, UPGMA assumes
equal rates of mutation along all the branches, as the model of evolution used.
The theoretical root, therefore, must be equidistant from all OTUs. We can
here thus apply the method of mid-point rooting. The root of the entire tree is
then positioned at dist (ABCDE),F / 2 = 4.
1.1.6. Final tree:-
The final tree as inferred by using the UPGMA method is shown below.
So now we have reconstructed the phylogenetic tree using the UPGMA method. As you can see
we have obtained the original phylogenetic tree we started with.
In bioinformatics, UPGMA is used for the creation of phenetic trees (phenograms). UPGMA was
initially designed for use in protein electrophoresis studies, but is currently most often used to
produce guide trees for more sophisticated algorithms. This algorithm is for example used
in sequence alignment procedures, as it proposes one order in which the sequences will be
aligned. Indeed, the guide tree aims at grouping the most similar sequences, regardless of their
evolutionary rate or phylogenetic affinities, and that is exactly the goal of UPGMA.
26. Report on Phylogenetic tree
Introduction to bioinformatics
Page 26
Another Example of UPGMA
The Neighbor-Joining Method
1.1. Description:-
Neighbour-joining (NJ): Bottom-up clustering method that also needs a distance matrix.
NJ is a heuristic approach that does not guarantee to find the perfect result, but under
normal conditions has a very high probability to do so. It has a very good computational
efficiency, making it well suited for large datasets.
27. Report on Phylogenetic tree
Introduction to bioinformatics
Page 27
1.2. The Neighbor-Joining Method
Neighbor-joining (Saitou and Nei, 1987) is a method that is related to the cluster method
but does not require the data to be ultrametric. In other words it does not require that all
lineages have diverged by equal amounts. The method is especially suited for datasets
comprising lineages with largely varying rates of evolution. It can be used in combination
with methods that allow correction for superimposed substitutions.
1.3. History
Created by Naruya Saitou and Masatoshi Nei in 1987. Usually used for trees based
on DNA or protein sequence data, the algorithm requires knowledge of the distance
between each pair of taxa (e.g., species or sequences) to form the tree.
1.4. Programs
The following programs are available
Neighbor of the Phylip package (Jo Felsentein, Univ. Washington),
ClustalW (D. Higgins, EMBL) ,
Distnj in the Protml package (Adachi and Hasegawa, Univ. Tokyo)
1.5. Star decomposition method
The neighbor-joining method is a special case of the star decomposition method. In
contrast to cluster analysis neighbor-joining keeps track of nodes on a tree rather than
taxa or clusters of taxa. The raw data are provided as a distance matrix and the initial tree
is a star tree. Then a modified distance matrix is constructed in which the separation
between each pair of nodes is adjusted on the basis of their average divergeance from all
other nodes. The tree is constructed by linking the least-distant pair of nodes in this
modified matrix. When two nodes are linked, their common ancestral node is added to
the tree and the terminal nodes with their respective branches are removed from the tree.
This pruning process converts the newly added common ancestor into a terminal node on
a tree of reduced size. At each stage in the process two terminal nodes are replaced by
one new node. The process is complete when two nodes remain, separated by a single
branch..
1.6. Note:-
NB: especially its suitability to handle large datasets has led to the fact that the
method is widely used by molecular evolutionists. With the rapid growth of
sequence databases it is still one of the few methods that allows the rapid inclusion
of all homologous sequences present in the database in a single tree. A good
example can be found in the Ribosomal Database Project that maintains a tree of life
based on all available ribosomal RNA sequences.
Example of the method
Suppose we have the following tree:
28. Report on Phylogenetic tree
Introduction to bioinformatics
Page 28
Since B and D have accumulated mutations at a higher rate than A. The Three-point
criterion is violated and the UPGMA method cannot be used since this would group
together A and C rather than A and B. In such a case the neighbor-joining method is
one of the recommended methods.
The raw data of the tree are represented by the following distance matrix:
A B C D E
B 5
C 4 7
D 7 10 7
E 6 9 6 5
F 8 11 8 9 8
We have in total 6 OTUs (N=6).
Step 1:-
We calculate the net divergence r (i) for each OTU from all other OTUs
r(A) = 5+4+7+6+8=30
r(B) = 42
r(C) = 32
r(D) = 38
r(E) = 34
r(F) = 44
Step 2:-
Now we calculate a new distance matrix using for each pair of OUTs the formula:
M(ij)=d(ij) - [r(i) + r(j)]/(N-2) or
in the case of the pair A,B:M(AB)=d(AB) -[(r(A) + r(B)]/(N-2) = -13
A B C D E
B -13
29. Report on Phylogenetic tree
Introduction to bioinformatics
Page 29
Now we start with a star tree:
A
F | B
| /
| /
|/
/|
/ |
/ |
E | C
D
Step 3:-
Now we choose as neighbors those two OTUs for which Mij is the smallest. These are A
and B and D and E. Let's take A and B as neighbors and we form a new node called U.
Now we calculate the branch length from the internal node U to the external OTUs A and
B.
S(AU) =d(AB) / 2 + [r(A)-r(B)] / 2(N-2) = 1
S(BU) =d(AB) -S(AU) = 4
Step 4: Now we define new distances from U to each other terminal node:
d(CU) = d(AC) + d(BC) - d(AB) / 2 = 3
d(DU) = d(AD) + d(BD) - d(AB) / 2 = 6
d(EU) = d(AE) + d(BE) - d(AB) / 2 = 5
d(FU) = d(AF) + d(BF) - d(AB) / 2 = 7
and we create a new matrix:
U C D E
C 3
C -
11.5
-
11.5
D -10 -10 -
10.5
E -10 -10 -
10.5
-13
F -
10.5
-
10.5
-11 -
11.5
-
11.5
30. Report on Phylogenetic tree
Introduction to bioinformatics
Page 30
D 6 7
E 5 6 5
F 7 8 9 8
The resulting tree will be the following:
C
D |
| A
|___/ 1
/|
/ | 4
E |
F
B
N= N-1 = 5
The entire procedure is repeated starting at step 1
Advantages and disadvantages of the neighbor-joining method
Advantages
o is fast and thus suited for large datasets and for bootstrap analysis
o permist lineages with largely different branch lengths
o permits correction for multiple substitutions
Disadvantages
o sequence information is reduced
o gives only one possible tree
strongly dependent on the model of evolution used
Maximum parsimony (MP):
This method tries to create a phylogeny that requires the least evolutionary change. It may suffer
from long branch attraction, a problem that leads to incorrect trees in rapidly evolving lineages
(Felsenstein, 1978).
Character based Method
Maximum-likelihood (ML):
ML uses a statistical approach to infer a phylogenetic tree. ML is well suited for the analysis of
distantly related sequences, but is computationally expensive and thus not that well suited for
larger input data.
31. Report on Phylogenetic tree
Introduction to bioinformatics
Page 31
Bootstrapping:-
Bootstrapping is any test or metric that uses random sampling with replacement, and falls under
the broader class of resampling methods. Bootstrapping assigns measures of accuracy (bias,
variance, confidence intervals, prediction error, etc.) to sample estimates. This technique allows
estimation of the sampling distribution of almost any statistic using random sampling methods.
Bootstrapping and jackknifing are statistical methods to evaluate and distinguish the confidence
of partial hypotheses (“branch support”) that are contained in a phylogenetic tree and have
become a standard in molecular phylogenetic analyses.
Multiple sequence alignment (MSA)
Description:-
A multiple sequence alignment (MSA) is a sequence alignment of three or more biological
sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are
assumed to have an evolutionary relationship by which they share a linkage and are descended
from a common ancestor.
Analysis And uses:-
From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be
conducted to assess the sequences' shared evolutionary origins. Visual depictions of the
alignment as in the image at right illustrate mutation events such as point mutations
(single amino acid or nucleotide changes) that appear as differing characters in a single
alignment column, and insertion or deletion mutations (indels or gaps) that appear as hyphens in
one or more of the sequences in the alignment.
Multiple sequence alignment is often used to assess sequence conservation of protein
domains, tertiary and secondary structures, and even individual amino acids or nucleotides.
Sequence set:-
Multiple sequence alignment also refers to the process of aligning such a sequence set. Because
three or more sequences of biologically relevant length can be difficult and are almost always
time-consuming to align by hand, computational algorithms are used to produce and analyze the
alignments. MSAs require more sophisticated methodologies than pairwise alignment because
they are more computationally complex. Most multiple sequence alignment programs
use heuristic methods rather than global optimization because identifying the optimal alignment
between more than a few sequences of moderate length is prohibitively computationally
expensive.
32. Report on Phylogenetic tree
Introduction to bioinformatics
Page 32
Figure represents first 90 positions of a protein multiple sequence alignment of instances of the
acidic ribosomal protein P0 (L10E) from several organisms. Generated with ClustalX.
Practical Section:-
ClustalW: Clustal is a series of widely used computer programs used
in Bioinformatics for multiple sequence alignment. The third generation, released in 1994,
greatly improved upon the previous versions. It improved upon the progressive alignment
algorithm in various ways, including allowing individual sequences to be weighted down or up
according to similarity or divergence respectively in a partial alignment. It also included the
ability to run the program in batch mode from the command line.
Access:-
ClustalW can access from both NCBI(National Center for biotechnology) and EMBL(European
Management Biology Laborataory)
33. Report on Phylogenetic tree
Introduction to bioinformatics
Page 33
Figure represents ClustalW can access from both NCBI(National Center for biotechnology) and
EMBL(European Management Biology Laborataory)
Website link:-url to get homepage of ClustalW
https://www.genome.jp/tools-bin/clustalw
ClustalW for Phylogeneetic tree construction:-
1.Access ClustalW
Open ClustalW through website. When we open this two different types of distribution will be
there as shown in Figure – and -
Figure represents the homepage Distribution 1 of ClustalW
CLUSTALW
NCBI
EMBL
34. Report on Phylogenetic tree
Introduction to bioinformatics
Page 34
Figure represents the homepage Distribution 2 of ClustalW
2. Important information of Homepage:-
2nd
part of distribution should be taken as default or according to need.
3. Retreival of sequence:-
In the third step, retrieve the sequence of pqqc gene in FASTA format for multiple sequence
alignement need for construction of phylogenetic tree using Nucleotide database.
In which form you
need an output
Choose according to
need but slow and
accurate is
recommended
The sequence of
interest is in DNA or
Protein
Choose the file or
paste to execute
Click Directly on
Execute
Search for
pqqc gene
35. Report on Phylogenetic tree
Introduction to bioinformatics
Page 35
Figure represents the NCBI homepage use for sequence retrieval,here use pqqc gene.
Pyrroloquinoline Quinone Biosynthesis Gene pqqC.
Figure represents the pqqC gene in FASTA format.
36. Report on Phylogenetic tree
Introduction to bioinformatics
Page 36
4.Use of BLASTn
BLAST it and it will take us to its output page
37. Report on Phylogenetic tree
Introduction to bioinformatics
Page 37
Figure represents results of BLASTn and Selection of sequence according to need but should be
5-3.
In this section select 18 sequences and download in FASTA format.
38. Report on Phylogenetic tree
Introduction to bioinformatics
Page 38
Figure represents all the sequences present in notepad.
5.Multiple Sequence Alignment from ClustalW
All the aligned sequences are now placed in this software to execute and provide MSA.
Choose the
sequence file
from computer
39. Report on Phylogenetic tree
Introduction to bioinformatics
Page 39
6. ClustalW |Result Interpretation
Alignment
results.
40. Report on Phylogenetic tree
Introduction to bioinformatics
Page 40
1.Sequence
number
2.Accessiom id
41. Report on Phylogenetic tree
Introduction to bioinformatics
Page 41
1.Sequence aligmment
number
2.Next to it is Score
42. Report on Phylogenetic tree
Introduction to bioinformatics
Page 42
1.After alignment it
forms group according to
similarity of sequences
2.Next to it is Score
43. Report on Phylogenetic tree
Introduction to bioinformatics
Page 43
*Histeric represents homology or similarity and conserved
( ) gap represent gap or mismatich.
---- represents the stretch of sequence.
Accessiom id
44. Report on Phylogenetic tree
Introduction to bioinformatics
Page 44
In last we have clustal dendrolgrams
7. Clustal dendrolgrams/Tree Construction:-
8. Booststrapping:-
Here boostrap value 500 upto 1000. Means 1000 times tool runs and provide results.
45. Report on Phylogenetic tree
Introduction to bioinformatics
Page 45
Figure represents the five trees that we can construct through ClustalW
Here we have 5 tres
1.Fast Tree
2.FastTree full
3.PhyML
4.PhyML bootsrap
5.RAxML
5.RAxML bootstrap
1.Choose PhyML bootsrap
Figure represents waiting.
https://www.genome.jp/tools-
bin/ete?id=20061915421388fff0d53d9d4b01b1b05494af7d14ba283099c8
46. Report on Phylogenetic tree
Introduction to bioinformatics
Page 46
Figure represents Tree that we get. It shows its relation with ancestors.
As we take 18 sequences on the basis of that resemblence it construct a tree which shows its
interaction as well as phylogenetic history or ancestors shows its relation with them.
Subtool use in
CLUSTALW
Method we use
47. Report on Phylogenetic tree
Introduction to bioinformatics
Page 47
Figure represents the outpage page of CLUSTALW.
Figure represents the the phylogram (phylogenetic tree ) that we get from the sequences with
accession number.These are 17 sequences that we aligned in CLUSTALW.
3. How to read this tree and Applying Filters and save its PNG form in computer.
These are accession
numbers written.
48. Report on Phylogenetic tree
Introduction to bioinformatics
Page 48
Boostrap reverifired the tree or results of alignment as well.
Base of tree represents the ancestors start and move.
branch starts and end at leaf or clade.
These are Boostrap
value.
Circle in the last of
branch is called leaf
Branch /Branch length
Base of
tree/root
49. Report on Phylogenetic tree
Introduction to bioinformatics
Page 49
Represents Clade. With
percentage with boostrip
value next to accession
number
53. Report on Phylogenetic tree
Introduction to bioinformatics
Page 53
PNG form of phylogenetic tree.
https://www.genome.jp/tools-
bin/ete?id=20061915421388fff0d53d9d4b01b1b05494af7d14ba283099c8
54. Report on Phylogenetic tree
Introduction to bioinformatics
Page 54
Applications of Phylogenetic tree construction:-
1. The inference of phylogenies with computational methods has many important
applications in medical and biological research, such as drug discovery and conservation
biology.
Figure represents important applications in medical and biological research.
2. A result published by Korber et al. that times the evolution of the HIV-1 virus,
demonstrates that ML techniques can be effective in solving biological problems.
Figure represents phylogenetics tree in evolution.
3. Phylogenetic trees have already witnessed applications in numerous practical domains,
such as in conservation biology (illegal whale hunting), epidemiology (predictive
evolution), forensics (dental practice HIV transmission), gene function prediction and
55. Report on Phylogenetic tree
Introduction to bioinformatics
Page 55
drug development.
Figure represents gene function prediction.
4. Other applications of phylogenies include multiple sequence alignment protein structure
prediction ,gene and protein function prediction and drug design
Figure represents phylogenies include multiple sequence alignment.
5. A paper by Bader et al. addresses important industrial applications of phylogenetic trees,
e.g. in the area of commercial drug discovery.
Figure represents important industrial applications of phylogenetic trees
56. Report on Phylogenetic tree
Introduction to bioinformatics
Page 56
6. Due to the rapid growth of available sequence data over recent years and the constant
improvement of multiple alignment methods, it has now become feasible to compute very
large trees which comprise more than 1,000 organisms.
Figure represents Multiple sequence alignment of large no of organisms.
7. The computation of the tree-of life containing representatives of all living beings on earth
is considered to be one of the grand challenges in Bioinformatics.|
57. Report on Phylogenetic tree
Introduction to bioinformatics
Page 57
|
Figure represent tree of life.
8. Some large multi-institutional/multidisciplinary projects are underway which aim at
building the tree of life: CIPRES (Cyber Infrastructure for Phylogenetic Research
www.phylo.org) and ATOL (Assembling the Tree of Life project, tolweb.org).
9. Cancer research is considered one of the most significant areas in the medical
community. Mutations in genomic sequences are responsible for cancer development and
increased aggressiveness in patients The combination of all such genes mutations, or
progression pathways, across a population can be summarized in a phylogeny describing
the different evolutionary pathways.
58. Report on Phylogenetic tree
Introduction to bioinformatics
Page 58
Figure represents Cancer evolutionary tree
10. Application of the phylogenetic tree can be explored for finding similarities among breast
cancer subtypes based on gene data.
11. Discovery of genes associated in cancer subtype help researchers to map different
pathways to classify cancer subtypes according to their mutations.
12. Methods of phylogenetic tree inference have proliferated in cancer genome studies such
as breast cancer.
13. Phylogenetic can capture important mutational events among different cancer types; a
network approach can also capture tumour similarities.
Figure respresents phylogenetic tree in mutation .
14. It has been observed from the literature that in cancer disease, the driver genes change the
cancer progression, and it even affects the participation of other genes thus generating
59. Report on Phylogenetic tree
Introduction to bioinformatics
Page 59
gene interaction network.
Figure represents phylogenetic tree and gene interaction network.
15. Phylogenetic methods can solve the problem of class prediction by using a classification
tree. Phylogenetic methods give us a deeper understanding of biological heterogeneity
among cancer subtype.