Bioinformatica t6-phylogenetics

FBW
6-11-2012

Wim Van Criekinge

Inhoud Lessen: Bioinformatica

GEEN LES

Phylogenetics
Introduction
Definitions
Species concept
Examples
The Tree-of-life
Phylogenetics Methodologies
Algorithms
Distance Methods
Maximum Likelihood
Maximum Parsimony
Rooting
Statistical Validation
Conclusions
Orthologous genes
Horizontal Gene Transfer
Phylogenomics
Practical Approach: PHYLIP
Weblems

What is phylogenetics ?

Phylogeny (phylo =tribe + genesis)

Phylogenetic trees are about visualising evolutionary
relationships. They reconstruct the pattern of events
that have led to the distribution and diversity of life.

The purpose of a phylogenetic tree is to illustrate how a
group of objects (usually genes or organisms) are
related to one another

Nothing in Biology Makes Sense Except in the Light of
Evolution. Theodosius Dobzhansky (1900-1975)

Trees

• Diagram consisting of branches and nodes
• Species tree (how are my species related?)
– contains only one representative from each
species.
– all nodes indicate speciation events
• Gene tree (how are my genes related?)
– normally contains a number of genes from a
single species
– nodes relate either to speciation or gene
duplication events

Clade: A set of species which includes all of the species
derived from a single common ancestor

S p e c ie s C o n c e p ts from V a rio u s A u th o rs
D .A . B a um a nd K .L . S ha w - E x c lu s iv e g rou p s o f org a n ism s, w h ere a n ex c lu s iv e g rou p is o ne w h ose m e m b ers are a ll m ore c lose ly re la ted to
ea c h oth er th a n to a n y org a n is m s ou ts id e the g rou p .
J . C ra cra ft - A n irred u c ib le c lu ster o f org a n ism s, d iag n osab ly d is tin ct fro m oth er su c h c lu sters, a nd w ith in w h ic h there is a p are n ta l p a ttern o f
a nc estry a nd d esce n t.
C ha rles D a rw in - "F rom these rem arks it w ill b e se e n th at I lo o k a t th e term sp e c ies, as o n e arb itrarily g iv e n for the sa k e o f c o n v e n ie nc e to a set
o f in d iv id u a ls c lose ly rese m b lin g e ac h o ther, a nd th a t it d oes n ot essen tia lly d iffer from the term varie ty, w h ic h is g iv e n to l ess d istin ct a nd
m ore flu c tu a ting form s. T he term varie ty, ag a in, in c o m p aris o n w ith m ere in d iv id u a l d iffere n ces, is a ls o a p p lied arb itrarily, a n d for m ere
c o n ve n ie n ce sa k e " (O rig in o f S p ec ies, 1 st ed., p . 1 0 8 ).
T . D o b zha nsk y - T h e larg est a nd m ost in c lu s ive rep rod u ctiv e c om m u n ity o f sex u a l a nd cross-fertiliz ing in d iv id u a ls w h ic h sh are a c o m m o n g e ne
p o o l. A nd la ter...S ys te m s o f p op u la tio ns, th e g e n e ex c ha ng e b e tw ee n w h ic h is lim ited or p re v e nted b y rep rod u ctiv e is o la ting m e c h a n is m s.
M . G hise lin - T h e m ost ex te ns ive u n its in the n atu ra l e c o n om y, su c h tha t rep r od u ctiv e c om p etitio n oc cu rs am o ng th e ir p arts.
D .M . L a m b ert - G rou p s o f ind iv id u a ls th at d e fin e th em se lv es b y a sp e c ific m a te rec og n itio n s ystem .
J . M a llet - Id e ntifia b le g e n o typ ic c lu sters re c og n iz e d b y a d e fic it o f in term ed iates, b o th a t s ing le lo c i a n d at m u ltip le lo c i.
E . M a y r - G rou p s o f ac tu a lly or p o te n tia lly in terb ree d ing na tu ra l p op u lat io ns w h ic h are rep rod u ctiv e ly is o la ted fro m oth er su c h g rou p s.
C .D . M ich en er - A g rou p o f org a n is m s n o t itse lf d iv is ib le b y p he n etic g ap s resu ltin g from c o nc ord a nt d iffere n ces in c harac ter states (ex c ep t for
m orp hs - su ch as sex , ag e, or caste), b u t sep ara ted b y su ch p h e ne tic g ap s from o ther su c h u n its.
H .E .H . P a tte rso n - T h at m ost in c lu s iv e p op u latio n o f in d iv id u a l b ip are n ta l org a n is m s w h ic h sh are a c o m m o n fertiliz atio n s ystem .
G .G . S im p so n - A lin eag e o f p op u latio ns e v o lv in g w ith tim e, sep arate ly fro m ot h ers, w ith its ow n u n iq u e e v o lu tio n ary ro le a nd te nd e n c ies.
P .H .A . S nea th a nd R .R . S o k a l - T he sm a llest (m ost h o m og e n e ou s) c lu ster th at ca n b e re c og n iz ed u p o n s o m e g iv e n criterio n as b e in g d istin c t
fro m oth er c lu sters.
A .R . T em p leto n - T h e m ost in c lu s ive p op u la tio n o f in d iv id u a ls h a v ing the p o te n ti a l for p h e n otyp ic c o he sio n throu g h in trins ic c o h es io n
m e c ha n is m s (g e ne tic a nd /or d e m og rap h ic - i.e. ec o lo g ic a l -ex c h a ng e ab ility).
E .O . W iley - A s ing le lin e ag e o f a nc estor -d esc e nd a n t p op u latio ns w h ic h m a in ta ins its id e ntity fro m oth er s u ch lin e ag es a nd w h ic h h as its ow n
e v o lu tio n ary te nd e n c ies a nd h istoric a l fate.
S . W rig ht - A sp ec ies in tim e a nd sp ac e is c o m p ose d o f nu m erou s lo ca l p op u latio ns, ea c h o ne in terc o m m u n ica ting a nd in terg rad ing w ith oth ers.

Species

I. Definitions:

Species = the basic unit of classification

> Three different ways to recognize species:

Plant Species

Definitions:


1) Morphological species = the smallest group that is
consistently and persistently distinct (Clusters in
morphospace)

species are recognized initially on the basis of
appearance; the individuals of one species look
different from the individuals of another

Species

Definitions:


2) Biological species = a set of interbreeding or
potentially interbreeding individuals that are
separated from other species by reproductive
barriers

species are unable to interbreed

Species

Definitions:


3) Phylogenetic species = the boundary between
reticulate (among interbreeding individuals) and
divergent relationships (between lineages with no
gene exchange)

Phylogenetic species
divergent

boundary

reticulate

recognized by the pattern of ancestor - descendent relationships

Species

Definitions:


4) Phylogenomics species = ability to transmit (and
maintain) a (stable) gene pool

Adresses the Anopheles genome topology
variations

Branching Order in a Phylogenetic Tree

• In the tree to the left, A and B share the most recent
common ancestry. Thus, of the species in the
tree, A and B are the most closely related.
• The next most recent common ancestry is C with
the group composed of A and B. Notice that the
relationship of C is with the group containing A
and B. In particular, C is not more closely related to
B than to A. This can be emphasized by the
following two trees, which are equivalent to each
other:

More definitions …

Edge, Branch

Leafs
Tips
external node Branch node, internal node

• A common simplifying assumption is that the three is bifurcating,
meaning that each brach node has exactly two descendents.
• The edges, taken together, are sometimes said to define the topology
of the tree

Outgroups, rooted versus unrooted

An unrooted reptilian phylogeny with an avian outgroup and
the corresponding rooted phylogeny. The Ri represent modern
reptiles; the Ai, inferred ancestors and the B a bird.

Examples

Phylogenetic methods may be used to
solve crimes, test purity of products, and
determine whether endangered species
have been smuggled or mislabeled:
– Vogel, G. 1998. HIV strain analysis debuts in
murder trial. Science 282(5390): 851-853.
– Lau, D. T.-W., et al. 2001. Authentication of
medicinal Dendrobium species by the internal
transcribed spacer of ribosomal DNA. Planta
Med 67:456-460.

Examples

– Epidemiologists use phylogenetic methods to
understand the development of
pandemics, patterns of disease transmission, and
development of antimicrobial resistance or
pathogenicity:
• Basler, C.F., et al. 2001. Sequence of the 1918
pandemic influenza virus nonstructural gene (NS)
segment and characterization of recombinant viruses
bearing the 1918 NS genes. PNAS, 98(5):2746-2751.
• Ou, C.-Y., et al. 1992. Molecular epidemiology of HIV
transmission in a dental practice. Science
256(5060):1165-1171.
• Bacillus Antracis:

Examples
• Conservation biologists may use these techniques to
determine which populations are in greatest need of
protection, and other questions of population structure:
– Trepanier, T.L., and R.W. Murphy. 2001. The Coachella Valley
fringe-toed lizard (Uma inornata): genetic diversity and
phylogenetic relationships of an endangered species. Mol
Phylogenet Evol 18(3):327-334.
– Alves, M.J., et al. 2001. Mitochondrial DNA variation in the
highly endangered cyprinid fish Anaecypris hispanica:
importance for conservation. Heredity 87(Pt 4):463-473.
• Pharmaceutical researchers may use phylogenetic
methods to determine which species are most closely
related to other medicinal species, thus perhaps sharing
their medicinal qualities:
– Komatsu, K., et al. 2001. Phylogenetic analysis based on 18S
rRNA gene and matK gene sequences of Panax vietnamensis
and five related species. Planta Med 67:461-465.

Some Important Dates in History

Origin of the Universe 15 billion yrs
Formation of the Solar System 4.6 "
First Self-replicating System 3.5 "
Prokaryotic-Eukaryotic Divergence 2.0 "
Plant-Animal Divergence 1.0 "
Invertebrate-Vertebrate Divergence 0.5 "
Mammalian Radiation Beginning 0.1 "

What Sequence to Use ?

• To infer relationships that span the
diversity of known life, it is
necessary to look at genes
conserved through the billions of
years of evolutionary divergence.
• The gene must display an
appropriate level of sequence
conservation for the divergences of
interest.
.


• If there is too much change, then
the sequences become
randomized, and there is a limit to
the depth of the divergences that
can be accurately inferred.
• If there is too little change (if the
gene is too conserved), then there
may be little or no change between
the evolutionary branchings of
interest, and it will not be possible to
infer close (genus or species level)
relationships.

Ribosomal RNA Genes and Their Sequences

recognized the full potential of rRNA
sequences as a measure of phylogenetic
relatedness. He initially used an RNA
sequencing method that determined about
1/4 of the nucleotides in the 16S rRNA (the
best technology available at the time). This
amount of data greatly exceeded anything
else then available. Using newer methods,
it is now routine to determine the
Carl Woese
sequence of the entire 16S rRNA
molecule. Today, the accumulated 16S
rRNA sequences (about 10,000) constitute
the largest body of data available for
inferring relationships among organisms.


An example of genes in this category are
those that define the ribosomal RNAs
(rRNAs). Most prokaryotes have three
rRNAs, called the 5S, 16S and 23S
rRNA.

Namea Size (nucleotides) Location
5S 120 Large subunit of ribosome
16S 1500 Small subunit of ribosome
23S 2900 Large subunit of ribosome
a The name is based on the rate that the
molecule sediments (sinks) in water.
Bigger molecules sediment faster than small
ones.

Ribosomal RNA Genes and Their Sequences

The extraordinary conservation of rRNA genes can
be seen in these fragments of the small subunit
rRNA gene sequences from organisms spanning
the known diversity of life:
human ...GTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGCTGCAGTTAAAAAG...

yeast ...GTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAG...

Corn ...GTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAG...

Escherichia coli ...GTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCG...

Anacystis nidulans ...GTGCCAGCAGCCGCGGTAATACGGGAGAGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCG...

Thermotoga maratima ...GTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTACCCGGATTTACTGGGCGTAAAGGG...

Methanococcus vannielii ...GTGCCAGCAGCCGCGGTAATACCGACGGCCCGAGTGGTAGCCACTCTTATTGGGCCTAAAGCG...

Thermococcus celer ...GTGGCAGCCGCCGCGGTAATACCGGCGGCCCGAGTGGTGGCCGCTATTATTGGGCCTAAAGCG...

Sulfolobus sulfotaricus ...GTGTCAGCCGCCGCGGTAATACCAGCTCCGCGAGTGGTCGGGGTGATTACTGGGCCTAAAGCG...

Molecular Clock (MC)

• Rate of evolution = rate of mutation
• rate of evolution for any macromolecule is
approximately constant over time (Neutral
Theory of evolution)
• For a given protein the rate of sequence
evolution is approximately constant across
lineages. Zuckerkandl and Pauling (1965)
• This would allow speciation and duplication
events to be dated accurately based on
molecular data

• (a) A traditional phylogenetic tree and

• (a) A traditional phylogenetic tree and
• (b) the new phylogenetic tree, each showing the
positions of selected phyla. B, bilateria;
AC, acoelomates; PC, pseudocoelomates;
C, coelomates; P, protostomes; L, lophotrochozoa;
E, ecdysozoa; D, deuterostomes.

Molecular Clock (MC)

• Local and approximate molecular
clocks more reasonable
– one amino acid subst. 14.5 My
– 1.3 10-9 substitutions/nucleotide site/year
– Relative rate test (see further)
• ((A,B),C) then measure distance between
(A,C) & (B,C)

Proteins evolve at highly different rates

Rate of Change Theoretical Lookback Time
(PAMs / 100 myrs) (myrs)
Pseudogenes 400 45
Fibrinopeptides 90 200
Lactalbumins 27 670
Lysozymes 24 850
Ribonucleases 21 850
Haemoglobins 12 1500
Acid proteases 8 2300
Cytochrome c 4 5000
Glyceraldehyde-P dehydrogenase2 9000
Glutamate dehydrogenase 1 18000

PAM = number of Accepted Point Mutations per 100 amino acids.

4-steps

• align
• select method (evolutionary
model)
– Distance
– ML
– MP
• generate tree
• validate tree

Distance matrix methods (upgma, nj, Fitch,...)

• Convert sequence data into a
set of discrete pairwise distance
values (n*(n-1)/2), arranged into
a matrix. Distance methods fit a
tree to this matrix.
• The phylogenetic topology tree
is constructed by using a cluster
analysis method (like upgma or
nj methods).


CGT


Since we start with A,p(A)=1


D=evolutionary distance ~ tijd
F = dissimilarity ~ (1 – PX(t))

F~1– d

Unweighted Pair Group Method with Arithmatic Mean (UPGMA)

Distance matrix methods: Summary

http://www.bioportal.bic.nus.edu.sg/phylip/neighbor.html


• The phylogeny makes an estimation of
the distance for each pair as the sum
of branch lengths in the path from one
sequence to another through the tree.
easy to perform ;
quick calculation ;
fit for sequences having high similarity scores ;
• drawbacks :
the sequences are not considered as such (loss
of information) ;
all sites are generally equally treated (do not
take into account differences of substitution
rates ) ;
not applicable to distantly divergent sequences.

Maximum likelihood

• In this method, the bases
(nucleotides or amino acids) of all
sequences at each site are
considered separately (as
independent), and the log-likelihood
of having these bases are computed
for a given topology by using a
particular probability model.
• This log-likelihood is added for all
sites, and the sum of the log-
likelihood is maximized to estimate
the branch length of the tree.

Maximum likelihood

• This procedure is repeated for all
possible topologies, and the topology
that shows the highest likelihood is
chosen as the final tree.
• Notes :
ML estimates the branch lengths of the
final tree ;
ML methods are usually consistent ;
ML is extented to allow differences
between the rate of transition and
transversion.
• Drawbacks
need long computation time to construct a
tree.

Maximum Parsimony

Parsimony criterion
• It consists of determining the minimum
number of changes (substitutions) required to
transform a sequence to its nearest neighbor.
Maximum Parsimony
• The maximum parsimony algorithm searches
for the minimum number of genetic events
(nucleotide substitutions or amino-acid
changes) to infer the most parsimonious tree
from a set of sequences.

Maximum Parsimony

Occam’s Razor
Entia non sunt multiplicanda praeter necessitatem.
William of Occam (1300-1349)

The best tree is the one which requires the least number of
substitutions

Maximum Parsimony

• The best tree is the one which needs the
fewest changes.
– If the evolutionary clock is not constant, the
procedure generates results which can be
misleading ;
– within practical computational limits, this
often leads in the generation of tens or more
"equally most parsimonious trees" which
make it difficult to justify the choice of a
particular tree ;
– long computation time to construct a tree.

Maximum Parsimony: Branch Node A or B ?

Maximum Parsimony: A requires 5 mutaties

Maximum Parsimony: B (and propagating A->B) requires only 4 mutations

Maximum Parsimony

• The best tree is the one which
needs the fewest changes.
• Problems :
– If the evolutionary clock is not
constant, the procedure generates
results which can be misleading ;
– within practical computational
limits, this often leads in the
generation of tens or more "equally
most parsimonious trees" which make
it difficult to justify the choice of a
particular tree ;
– long computation time to construct a
tree.

Comparative evaluation of different methods

There is at present no statistical
methods which allow
comparisons of trees obtained
from different phylogenetic
methods, nevertheless many
studies have been made to
compare the relative consistency
of the existing methods.

Comparative evaluation of different methods

The consistency depends on many
factors, among these the topology
and branch lengths of the real
tree, the transition/transversion rate
and the variability of the
substitution rates.
One expects that if sequences have
strong phylogenetic
relationship, different methods will
show the same phylogenetic tree

Comparison of methods

• Inconsistency
• Neighbour Joining (NJ) is very fast but depends on
accurate estimates of distance. This is more
difficult with very divergent data
• Parsimony suffers from Long Branch Attraction.
This may be a particular problem for very divergent
data
• NJ can suffer from Long Branch Attraction
• Parsimony is also computationally intensive
• Codon usage bias can be a problem for MP and NJ
• Maximum Likelihood is the most reliable but
depends on the choice of model and is very slow
• Methods may be combined

Rooting the Tree

• In an unrooted tree the direction of
evolution is unknown
• The root is the hypothesized ancestor
of the sequences in the tree
• The root can either be placed on a
branch or at a node
• You should start by viewing an
unrooted tree

Automatic rooting

• Many software packages will root
trees automaticall (e.g. mid-point
rooting in NJPlot)
• Sometimes two trees may look very
different but, in fact, differ only in the
position of the root
• This normally involves assumptions…
BEWARE!

Rooting Using an Outgroup

1. The outgroup should be a sequence (or set
of sequences) known to be less closely
related to the rest of the sequences than they
are to each other
2. It should ideally be as closely related as
possible to the rest of the sequences while
still satisfying condition 1
The root must be somewhere between the
outgroup and the rest (either on the node or
in a branch)

How confident am I that my tree is correct?

Bootstrap values
Bootstrapping is a statistical
technique that can use random
resampling of data to determine
sampling error for tree topologies

Bootstrapping phylogenies

• Characters are resampled with replacement
to create many bootstrap replicate data sets
• Each bootstrap replicate data set is analysed
(e.g. with parsimony, distance, ML etc.)
• Agreement among the resulting trees is
summarized with a majority-rule consensus
tree
• Frequencies of occurrence of
groups, bootstrap proportions (BPs), are a
measure of support for those groups

Bootstrapping - an example

Ciliate SSUrDNA - parsimony bootstrap
Ochromonas (1)

Symbiodinium (2)
100
Prorocentrum (3)

Euplotes (8)
84
Tetrahymena (9)

96 Loxodes (4)
100
Tracheloraphis (5)
100
Spirostomum (6)
100
Gruberia (7)
Majority-rule consensus

Bootstrap - interpretation

• Bootstrapping is a very valuable and widely used
technique (it is demanded by some journals)
• BPs give an idea of how likely a given branch
would be to be unaffected if additional data, with
the same distribution, became available
• BPs are not the same as confidence intervals.
There is no simple mapping between bootstrap
values and confidence intervals. There is no
agreement about what constitutes a ‘good’
bootstrap value (> 70%, > 80%, > 85% ????)
• Some theoretical work indicates that BPs can be a
conservative estimate of confidence intervals
• If the estimated tree is inconsistent all the
bootstraps in the world won’t help you…..

Jack-knifing

• Jack-knifing is very similar to
bootstrapping and differs only in the
character resampling strategy
• Jack-knifing is not as widely
available or widely used as
bootstrapping
• Tends to produce broadly similar
results

Statistical evaluation of the obtained phylogenetic trees

At present only sampling techniques allow testing the
topology of a phylogenetic tree
Bootstrapping
» It consists of drawing columns from a sample of
aligned sequences, with replacement, until one gets
a data set of the same size as the original one.
(usually some columns are sampled several times
others left out)
Half-Jacknife
» This technique resamples half of the sequence sites
considered and eliminates the rest. The final sample
has half the number of initial number of sites
without duplication.

Weblems
W6.1: The growth hormones in most mammals have very similar ammo acid
sequences. (The growth hormones of the Alpaca, Dog Cat Horse, Rabbit, and
Elephant each differ from that of the Pig at no more than 3 positions out of 191.)
Human growth hormone is very different, differing at 62 positions. The evolution of
growth hormone accelerated sharply in the line leading to humans. By retrieving
and aligning growth hormone sequences from species closely related to humans
and our ancestors, determine where in the evolutionary tree leading to humans the
accelerated evolution of growth hormone took place.
W6.2: Humans are primates, an order that we, apes and monkeys share with lemurs
and tarsiers. On the basis of the Beta-globin gene cluster of human, a
chimpanzee, an old-world monkey, a new-world monkey, a lemur, and a tarsier,
derive a phylogenetic tree of these groups.
W6.3: Primates are mammals, a class we share with marsupials and monotremes;
Extant marsupials live primarily in Australia, except for the opossum, found also in
North and South America. Extant monotremes are limited to two animals from
Australia: the platypus and echidna. Using the complete mitochondnal genome
from human, horse (Equus caballus), wallaroo (Macropus robustus), American
opossum (Didelphis mrgimana), and platypus (Ormthorhynchus anatmus), draw
an evolutionary tree, indicating branch lengths. Are monotremes more closely
related to placental mammals or to marsupials?
W6.4: Mammals are vertebrates, a subphylum that we share with fishes, sharks, birds
and reptiles, amphibia, and primitive jawless fishes (example: lampreys). For the
coelacanth (Latimeria chalumnae), the great white shark (Carcharodon
carcharias), skipjack tuna (Katsuwonus pelamis), sea lamprey (Petromyzon
marinus), frog (Rana Ripens), and Nile crocodile (Crocodylus niloticus), using
sequences of cytochromes c and pancreatic ribonucleases, derive evolutionary
trees of these species.

Bioinformatica t6-phylogenetics

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Bioinformatica t6-phylogenetics

Similar to Bioinformatica t6-phylogenetics (20)

More from Prof. Wim Van Criekinge

More from Prof. Wim Van Criekinge (20)

Recently uploaded

Recently uploaded (20)

Bioinformatica t6-phylogenetics