Phylogenetic prediction - maximum parsimony method

Phylogenetic PredictionPhylogenetic Prediction
MAXIMUM PARSIMONYMAXIMUM PARSIMONY
METHODMETHOD
Afnan ZuiterAfnan Zuiter

DEFINITIONDEFINITION
 This method predicts the evolutionary tree(s)This method predicts the evolutionary tree(s)
that minimizes the number of steps required tothat minimizes the number of steps required to
generate the observed variation in thegenerate the observed variation in the
sequences (sequences (minimum number of changesminimum number of changes))
 So the method is also sometimes referred to asSo the method is also sometimes referred to as
the minimum evolution method.the minimum evolution method.

WHAT IT NEEDSWHAT IT NEEDS??
 A multiple sequence alignment is required toA multiple sequence alignment is required to
predict which sequence positions are likely topredict which sequence positions are likely to
correspond.correspond.

THE PROCEDURETHE PROCEDURE
 Prediction which sequence positions are likely toPrediction which sequence positions are likely to
correspond by multiple sequence alignment.correspond by multiple sequence alignment.
 These positions will appear in vertical columns.These positions will appear in vertical columns.
 For each aligned position, phylogenetic trees (thatFor each aligned position, phylogenetic trees (that
require the smallest number of evolutionary changesrequire the smallest number of evolutionary changes
to produce the observed sequence changes) areto produce the observed sequence changes) are
identified.identified.
 This analysis is continued for every position in theThis analysis is continued for every position in the
sequence alignment.sequence alignment.
 Finally, those trees that produce the smallest numberFinally, those trees that produce the smallest number
of changes overall for all sequence positions areof changes overall for all sequence positions are
identified.identified.

WHEN IT IS USEDWHEN IT IS USED??
 This method is used forThis method is used for
1.1. Sequences that are quite similarSequences that are quite similar
2.2. For small numbers of sequences, for which itFor small numbers of sequences, for which it
is best suited.is best suited.

DISADVANTAGESDISADVANTAGES
 The algorithm followedThe algorithm followed is not particularlyis not particularly
complicatedcomplicated, but it is guaranteed to find the best tree,, but it is guaranteed to find the best tree,
because all possible trees relating a group ofbecause all possible trees relating a group of
sequences are examined.sequences are examined.
For this reasonFor this reason, the method is, the method is
1.1. Quite time-consuming.Quite time-consuming.
2.2. Is not useful for data that include a large number ofIs not useful for data that include a large number of
sequences.sequences.
3.3. Is not useful for data that include sequences with a largeIs not useful for data that include sequences with a large
amount of variation.amount of variation.
4.4. One or more unrooted trees are predictedOne or more unrooted trees are predicted
5.5. Other assumptions must be made to root the predicted tree.Other assumptions must be made to root the predicted tree.

PAUPPAUP
 Phylogenetic analysis using parsimony (PAUP).Phylogenetic analysis using parsimony (PAUP).
 PAUP runs on practically any machine, but the user interfacePAUP runs on practically any machine, but the user interface
for PAUP has been most developed for use on the Macintoshfor PAUP has been most developed for use on the Macintosh
computer.computer.
 PAUP offers a number of options and parameter settings for aPAUP offers a number of options and parameter settings for a
parsimony analysis in the Macintosh environment.parsimony analysis in the Macintosh environment.
 Current versions of these programs provide the three mainCurrent versions of these programs provide the three main
methods for phylogenetic analysis: parsimony, distance, andmethods for phylogenetic analysis: parsimony, distance, and
maximum likelihood methods andmaximum likelihood methods and also include many types ofalso include many types of
evolutionary models for sequence variationevolutionary models for sequence variation..
 PAUP provides methods for finding consensus trees, and suchPAUP provides methods for finding consensus trees, and such
trees are also calculated by the CONSENSE program in thetrees are also calculated by the CONSENSE program in the
PHYLIP package.PHYLIP package.

The main programs for maximum parsimonyThe main programs for maximum parsimony
analysis in the PHYLIP package:analysis in the PHYLIP package:
A.A. For analysis ofFor analysis of nucleic acid sequencesnucleic acid sequences, programs are:, programs are:
1.1. DNAPARSDNAPARS, which treats gaps as a fifth nucleotide state., which treats gaps as a fifth nucleotide state.
2.2. DNAPENNYDNAPENNY, which performs parsimonious phylogenies by, which performs parsimonious phylogenies by
branch-and-bound search that can analyze more sequencesbranch-and-bound search that can analyze more sequences
(up to 11 or 12).(up to 11 or 12).
3.3. DNACOMPDNACOMP, which performs phylogenetic analysis using the, which performs phylogenetic analysis using the
compatibility criterion. Rather than searching for overallcompatibility criterion. Rather than searching for overall
parsimony at all sites in the multiple sequence alignment,parsimony at all sites in the multiple sequence alignment,
this method finds the tree that supports the largest numberthis method finds the tree that supports the largest number
of sitesof sites. This method is recommended when the rate of. This method is recommended when the rate of
evolution varies among sites.evolution varies among sites.
4.4. DNAMOVEDNAMOVE, which performs parsimony and compatibility, which performs parsimony and compatibility
analysis interactively.analysis interactively.

The main programs for maximum parsimonyThe main programs for maximum parsimony
analysis in the PHYLIP packageanalysis in the PHYLIP package::
B.B. For analysis ofFor analysis of protein sequencesprotein sequences, the program is:, the program is:
1.1. PROTPARSPROTPARS, which, which counts the minimum number ofcounts the minimum number of
mutations to change a codon for the first amino acidmutations to change a codon for the first amino acid
into a codon for the second amino acidinto a codon for the second amino acid, but only, but only
scores those mutations in the mutational path thatscores those mutations in the mutational path that
actually change the amino acid.actually change the amino acid. Silent mutationsSilent mutations
that do not change the amino acidthat do not change the amino acid are not scoredare not scored onon
the grounds that they have little evolutionarythe grounds that they have little evolutionary
significance.significance.

Rules for parsimony methodRules for parsimony method
 In a multiple sequence alignment, onlyIn a multiple sequence alignment, only certain sequence variationscertain sequence variations
at a given site are usefulat a given site are useful for a parsimony analysis.for a parsimony analysis.
 In the analysis, all of the possible unrooted trees (In the analysis, all of the possible unrooted trees (three trees forthree trees for
four sequences) are considered.four sequences) are considered.
 TheThe sequence variationssequence variations at each site in the alignmentat each site in the alignment are placed atare placed at
the tips of the treesthe tips of the trees, and the tree that, and the tree that requires the smallest numberrequires the smallest number
of changesof changes to produce this variationto produce this variation is determined.is determined.
 This analysis is repeated for eachThis analysis is repeated for each informative siteinformative site, and the tree (s), and the tree (s)
that supports the smallest number of changes overall is found.that supports the smallest number of changes overall is found.
 TheThe length of the treelength of the tree, defined as the, defined as the sum of the number of stepssum of the number of steps inin
each branch of the tree, will be a minimum.each branch of the tree, will be a minimum.

EXAMPLEEXAMPLE
 Rules for analysis by maximum parsimony in this example are:Rules for analysis by maximum parsimony in this example are:
1.1. There are four taxa giving three possible unrooted trees.There are four taxa giving three possible unrooted trees.
2.2. Some sites are informative, i.e., they favor one tree over another (siteSome sites are informative, i.e., they favor one tree over another (site
5 is informative but sites 1, 6, and 8 are not).To be informative, a site5 is informative but sites 1, 6, and 8 are not).To be informative, a site
must have the same sequence character in at least two taxa (sites 1, 2,must have the same sequence character in at least two taxa (sites 1, 2,
3, 4, 6, and 8 are not informative; sites 5, 7, and 9 are informative).3, 4, 6, and 8 are not informative; sites 5, 7, and 9 are informative).
3.3. Only the informative sites need to be analyzed.Only the informative sites need to be analyzed.
4.4. The optimal tree is obtained by adding the number of changes at eachThe optimal tree is obtained by adding the number of changes at each
informative site for each tree, and picking the tree requiring the leastinformative site for each tree, and picking the tree requiring the least
number of changes.number of changes.
 The three possible trees are shown in Figure 6.6.The three possible trees are shown in Figure 6.6.
 A scoring matrix may be used instead of scoring a change as 1.A scoring matrix may be used instead of scoring a change as 1.
 Tree 1 is the correct one and the tree length will be 4 (one change at eachTree 1 is the correct one and the tree length will be 4 (one change at each
of positions 5 and 7 and two changes at position 9).of positions 5 and 7 and two changes at position 9).
 Because there were only four sequences to consider, it was necessary toBecause there were only four sequences to consider, it was necessary to
consider only three possible unrooted trees.consider only three possible unrooted trees.

For a larger number of sequencesFor a larger number of sequences
 For a larger number of sequences, the number of treesFor a larger number of sequences, the number of trees
becomes so large that it may not be feasible to examine allbecomes so large that it may not be feasible to examine all
possible trees.possible trees.
 The example of 12 sequences below took only a few secondsThe example of 12 sequences below took only a few seconds
on a Macintosh G3.on a Macintosh G3.
 The exhaustive and branch-and-bound options of the programThe exhaustive and branch-and-bound options of the program
PAUP will analyze all possible trees, and if the number is tooPAUP will analyze all possible trees, and if the number is too
large, the program can keep running for a very long time.large, the program can keep running for a very long time.
 For large numbers of sequences, PAUP provides a programFor large numbers of sequences, PAUP provides a program
option calledoption called ““heuristic,heuristic,”” which searches among all possiblewhich searches among all possible
trees and keeps representative trees that best fit the data.trees and keeps representative trees that best fit the data.
 The presence of common branch patterns in these trees revealsThe presence of common branch patterns in these trees reveals
some of the broader features of the phylogenetic relationshipssome of the broader features of the phylogenetic relationships
among the sequences.among the sequences.

Figure 6.7Figure 6.7..
A.A. Portion of a multiple sequence alignment of the mitochondrialPortion of a multiple sequence alignment of the mitochondrial
sequences provided in the PAUP distribution package.sequences provided in the PAUP distribution package.
• PAUP will import sequences in other multiple sequencePAUP will import sequences in other multiple sequence
alignment format and convert them into thealignment format and convert them into the NEXUS formatNEXUS format..
• The programThe program READSEQREADSEQ will reformat multiple sequencewill reformat multiple sequence
alignments intoalignments into the NEXUS formatthe NEXUS format..
• This format includes:This format includes:
(A)(A) information about type of sequence,information about type of sequence,
(B)(B) coding information,coding information,
(C)(C) codon positions,codon positions,
(D)(D) differential weights for transitions and transversions,differential weights for transitions and transversions,
(E)(E) treatment of gaps, and preferred groupings.treatment of gaps, and preferred groupings.

Figure 6.7.Figure 6.7.
• Only a portion of the NEXUS file is shown.Only a portion of the NEXUS file is shown.
• In this analysis, branch-and-bound and otherwise defaultIn this analysis, branch-and-bound and otherwise default
options were used.options were used.
• Gaps are treated as missing informationGaps are treated as missing information..
• The number of sequences is indicated as ntaxaThe number of sequences is indicated as ntaxa,, number ofnumber of
alignment columns as ncharalignment columns as nchar, and the interleave command, and the interleave command
allows the data to be entered in readable blocks of sequenceallows the data to be entered in readable blocks of sequence
60 characters long.60 characters long.
B.B. One of the two predicted trees. The tree file of PAUP wasOne of the two predicted trees. The tree file of PAUP was
edited in MacClade and output as a graphics file.edited in MacClade and output as a graphics file.

This sequence format is the NEXUS format, which allows additional information about the sequences, species
relationship, and a scoring system for base substitution referred to as a cost or step matrix.

Figure 6.8Figure 6.8
 Although other columns in the sequence alignment that show less variation mayAlthough other columns in the sequence alignment that show less variation may
provide the correct tree, the columns representing greater variation dominateprovide the correct tree, the columns representing greater variation dominate
the analysis.the analysis.
 SuchSuch long branches may be broken down if additional taxa are presentlong branches may be broken down if additional taxa are present that arethat are
more closely related to taxa 1 and 4, thereby providing branches that intersectmore closely related to taxa 1 and 4, thereby providing branches that intersect
the long branches and give a better resolution of the changes.the long branches and give a better resolution of the changes.
 Parsimony can give misleading information when rates of sequence changeParsimony can give misleading information when rates of sequence change
vary in the different branches of a tree that are represented by the sequencevary in the different branches of a tree that are represented by the sequence
data.data.
 These variations produce a range of branch lengths, long ones representingThese variations produce a range of branch lengths, long ones representing
more extended periods of time and short ones representing shorter times.more extended periods of time and short ones representing shorter times.
 For example, the real tree shown in Figure 6.8A includes two long branches inFor example, the real tree shown in Figure 6.8A includes two long branches in
which G has turned to A independently, probably with a number ofwhich G has turned to A independently, probably with a number of
intermediate changes that are not observed in the sequence data.intermediate changes that are not observed in the sequence data.
 Because in a parsimony analysis rates of change along all branches of the treeBecause in a parsimony analysis rates of change along all branches of the tree
are assumed to be equal, the tree predicted by parsimony and shown in Figureare assumed to be equal, the tree predicted by parsimony and shown in Figure

LakeLake’’s methods method
 LakeLake’’s method (of invariants or evolutionary parsimony,s method (of invariants or evolutionary parsimony,
available in PAUP) is another method for identifyingavailable in PAUP) is another method for identifying
such long branches .such long branches .
 In this method,In this method,
1.1. four of the sequences are chosen at a timefour of the sequences are chosen at a time
2.2. only transversions in the aligned positions are scored asonly transversions in the aligned positions are scored as
changes on the grounds that transversions are the mostchanges on the grounds that transversions are the most
significant base changes during evolution.significant base changes during evolution.
 Transversions of any base to each possible derivative,Transversions of any base to each possible derivative,
e.g., A → C or T, are assumed to change at the same ratee.g., A → C or T, are assumed to change at the same rate
to create a balanced distribution, and the changes in eachto create a balanced distribution, and the changes in each
column of the alignment (each sequence position) arecolumn of the alignment (each sequence position) are
assumed to occur independently of each other.assumed to occur independently of each other.

 Suppose that there are two long branches.Suppose that there are two long branches.
 The correct tree is shown in Figure 6.9A, and one of the sites hasThe correct tree is shown in Figure 6.9A, and one of the sites has
changed multiply but ends up as the same base A by chance.changed multiply but ends up as the same base A by chance.
 Traditional parsimony will identify this tree incorrectly.Traditional parsimony will identify this tree incorrectly.
 If these long branches do indeed exist, then other sites should giveIf these long branches do indeed exist, then other sites should give
the type of transversion events shown in Figure 6.9B.the type of transversion events shown in Figure 6.9B.
 The greater the number of B-type sites, the less one can depend onThe greater the number of B-type sites, the less one can depend on
the A-type sites revealed in A.the A-type sites revealed in A.
 The evolutionary parsimony method subtracts the number of typeThe evolutionary parsimony method subtracts the number of type
B from the number of type A.B from the number of type A.

 If, on the one hand, long branches are not present in theIf, on the one hand, long branches are not present in the
quartet of sequences, there will be very few type B, and type Aquartet of sequences, there will be very few type B, and type A
will be taken as evidence for the correct tree.will be taken as evidence for the correct tree.
 On the other hand, if many examples of type B are present, theOn the other hand, if many examples of type B are present, the
A type will carry little weight.A type will carry little weight.
 These calculations are performed for all three possibleThese calculations are performed for all three possible
unrooted trees and all possible types of transversions for theunrooted trees and all possible types of transversions for the
four sequences, and the tree receiving the most support isfour sequences, and the tree receiving the most support is
chosen.chosen.
 The PHYLIP program DNAINVAR computes LakeThe PHYLIP program DNAINVAR computes Lake’’s ands and
other phylogenetic invariants for nucleic acid sequences.other phylogenetic invariants for nucleic acid sequences.
PAUP also includes an option for LakePAUP also includes an option for Lake’’s invariant.s invariant.

Which is the best method?Which is the best method?
 Compared to the above methods, maximumCompared to the above methods, maximum
likelihood and distance methodslikelihood and distance methods provide moreprovide more
reliable predictions when corrections are madereliable predictions when corrections are made
for multiple substitutions.for multiple substitutions.
 Distance methods such as neighbor joiningDistance methods such as neighbor joining
have been shown generally to be betterhave been shown generally to be better
predictors than both standard and evolutionarypredictors than both standard and evolutionary
parsimony methodsparsimony methods when branch lengths arewhen branch lengths are
varying.varying.

 There are options in PAUP and MacClade for selectingThere are options in PAUP and MacClade for selecting
among the most parsimonious trees.among the most parsimonious trees.
 With MacClade it is possible to view the changes inWith MacClade it is possible to view the changes in
sequence characters in each branch of the tree to arrivesequence characters in each branch of the tree to arrive
at the current base in each sequence or taxon,at the current base in each sequence or taxon,
 As these characters are traced from positions lower inAs these characters are traced from positions lower in
the tree to upper positions, some nodes in the tree maythe tree to upper positions, some nodes in the tree may
be assigned an unambiguous character (shown in color,be assigned an unambiguous character (shown in color,
Fig. 6.10).Fig. 6.10).
 For other nodes, the assignment may be ambiguousFor other nodes, the assignment may be ambiguous
because the node is leading to two different charactersbecause the node is leading to two different characters
above (thin black line).above (thin black line).

 It is possible to arrange these ambiguities optionally in twoIt is possible to arrange these ambiguities optionally in two
ways:ways:
1.1. one is to delay them going as far up the tree away from theone is to delay them going as far up the tree away from the
root as possible (the Deltran option; not shown in figure);root as possible (the Deltran option; not shown in figure);
2.2. a second is to introduce them as soon as possible and asa second is to introduce them as soon as possible and as
close to the root as possible (the Acctran option; not shownclose to the root as possible (the Acctran option; not shown
in figure).in figure).
 The effect of using Deltran is to force parallel changes in theThe effect of using Deltran is to force parallel changes in the
upper branches of the tree, that of Acctran is to force reversalsupper branches of the tree, that of Acctran is to force reversals
in the upper branches. Using these options is notin the upper branches. Using these options is not
recommended unless such variations are expected, as inrecommended unless such variations are expected, as in
analysis of more divergent sequencesanalysis of more divergent sequences

HomoplasyHomoplasy
 Homoplasy:Homoplasy: refers to the occurrence of the same sequence change inrefers to the occurrence of the same sequence change in
more than one branch of the tree.more than one branch of the tree.
 If all the sequence character changes support the same tree, there isIf all the sequence character changes support the same tree, there is
no homoplasy.no homoplasy.
 In reality, homoplasy is usually found for some characters for anyIn reality, homoplasy is usually found for some characters for any
tree.tree.
 Mac-CladeMac-Clade allows changing of the tree to avoid homoplasyallows changing of the tree to avoid homoplasy at aat a
sequence position, but the new tree length will often increase, thussequence position, but the new tree length will often increase, thus
making the tree a less parsimonious choice than the original.making the tree a less parsimonious choice than the original.
 Another parameterAnother parameter used is the consistency index (CI)=the minimumused is the consistency index (CI)=the minimum
possible tree length / the actual tree length.possible tree length / the actual tree length.
 The more homoplasy, the greater the actual tree length, and theThe more homoplasy, the greater the actual tree length, and the
smaller the value of CI.smaller the value of CI.

 Parsimony methods can use information on the number of changesParsimony methods can use information on the number of changes
required or steps to change one residue into another.required or steps to change one residue into another.
For exampleFor example,,
 The number of mutations required to change one amino acid intoThe number of mutations required to change one amino acid into
another in one branch of a tree can be taken into account.another in one branch of a tree can be taken into account.
 The parsimony method then attempts to minimize the number ofThe parsimony method then attempts to minimize the number of
such steps.such steps.
 This number of steps for interchanging characters can beThis number of steps for interchanging characters can be
incorporated into a matrix, called a step or cost matrix forincorporated into a matrix, called a step or cost matrix for
programs such as PAUP and MacClade to use.programs such as PAUP and MacClade to use.

Phylogenetic prediction - maximum parsimony method

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Phylogenetic prediction - maximum parsimony method

Similar to Phylogenetic prediction - maximum parsimony method (20)

More from Afnan Zuiter

More from Afnan Zuiter (20)

Recently uploaded

Recently uploaded (20)

Phylogenetic prediction - maximum parsimony method