thesis

INFERRING SMALL TREES WITH PHYLOGENTIC INVARIANTS AND
INEQUALITIES
A thesis presented to the faculty of
San Francisco State University
In partial fulﬁlment of
The Requirements for
The Degree
Master of Arts
In
Mathematics
by
Addie Andromeda Evans
San Francisco, California
December 2011

Copyright by
2011

CERTIFICATION OF APPROVAL
I certify that I have read INFERRING SMALL TREES WITH PHYLO-
GENTIC INVARIANTS AND INEQUALITIES by Addie Andromeda
Evans and that in my opinion this work meets the criteria for approving a
thesis submitted in partial fulﬁllment of the requirements for the degree:
Master of Arts in Mathematics at San Francisco State University.
Serkan Ho¸sten
Professor of Mathematics
Federico Ardila
Professor of Mathematics
Greg Spicer
Professor of Biology

INFERRING SMALL TREES WITH PHYLOGENTIC INVARIANTS AND
INEQUALITIES
San Francisco State University
2011
Phylogenetic trees are used to explain evolutionary relationships between a collection
of species, subspecies, or individuals. Phylogenetic invariants are constraints placed
on the probability space of the DNA site patterns in order to find the true tree out
of combinatorially many possibilities. This research project looks at the efficiency of
the method of phylogenetic invariants, in choosing the correct evolutionary model,
over a broad spectrum of data. Additionally, we are the first to test the effectiveness
of inequality constraints that have recently been developed.
I certify that the Abstract is a correct representation of the content of this thesis.
Serkan Ho¸sten, Chair, Thesis Committee Date

ACKNOWLEDGMENTS
Serkan Ho¸sten
Federico Ardila
Greg Spicer
Ronald Evans
Raymond Cavalcante
Tol Lau
v

TABLE OF CONTENTS
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Evolutionary Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Divergence of Species Over Time . . . . . . . . . . . . . . . . . . . . 4
2.3 The Tree Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Methods of Phylogenetic Inference . . . . . . . . . . . . . . . . . . . . 8
2.5 The Genetic Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Evolutionary Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 The General Markov Model . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Symmetric and Group Based Models . . . . . . . . . . . . . . . . . . 12
3.3 Non-homogenous Models . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 The Algebraic Statistics of Phylogenetic Trees . . . . . . . . . . . . . . . . 18
4.1 Site Pattern Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 The Algebraic Variety of a Statistical Model . . . . . . . . . . . . . . 21
4.3 The Hardy-Weinberg Model Invariants . . . . . . . . . . . . . . . . . 22
5 Invariants for Group Based Models . . . . . . . . . . . . . . . . . . . . . . 25
5.1 The Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . . 25
5.2 The Sturmfels-Sullivant Mapping . . . . . . . . . . . . . . . . . . . . 30
vi

5.3 Example of the Nucleotide Three Taxa Claw Tree . . . . . . . . . . . 31
5.4 Equivalence Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6 Inequalities on the Discrete Fourier Transform . . . . . . . . . . . . . . . 39
6.1 Motivation and Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Pendant and Internal Edge Inequalities . . . . . . . . . . . . . . . . . 42
6.3 Inequalities for Nucleotide Data . . . . . . . . . . . . . . . . . . . . . 44
7 Metrics and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.1 Inequalities Score . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.1.2 Invariants Score . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.1.3 The True Tree’s Relative Distance from the Optimal Scoring
Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.2.1 The Program . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.2.2 Winning Tree Algorithm . . . . . . . . . . . . . . . . . . . . . 51
7.2.3 Invariants Scores for Zero and Equivalence Class Reduced In-
variant Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.2.4 Simulating Data . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.2.5 Tree Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
vii

8.1 Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8.1.1 Three Taxa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.2 Nucleotide Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.2.1 Summary of Combinatorial Growth of Models . . . . . . . . . 63
8.2.2 Efficacy of Invariants and Inequalities for Three to Five Taxa . 65
8.2.3 Three Taxa . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.2.4 Four Taxa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2.5 Five Taxa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.2.6 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2.7 Efficacy of Invariants and Inequalities for Real Data . . . . . . 91
9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1 Discussion of Results and Further Directions . . . . . . . . . . . . . . 97
9.2 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . . . . . 99
9.3 Open Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
10 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.1 Appendix: Mathematica Code for Generating Invariants . . . . . . . 101
10.2 Appendix: Perl Code for Evaluating Invariants . . . . . . . . . . . . . 105
10.3 Appendix: Perl Code for Testing Efficacy of Method . . . . . . . . . . 122
10.4 Appendix: R Code for Simulating Binary Genetic Data . . . . . . . . 134
10.5 Appendix: R Code for Creating Evolver Files . . . . . . . . . . . . . 136
viii

10.6 Appendix: Perl Code for Running Evolver . . . . . . . . . . . . . . . 137
10.7 Appendix: Perl Code for Cleaning Genetic Data from Evolver . . . . 139
ix

LIST OF FIGURES
2.1 Tree Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 The Relationship Between the Unrooted and Rooted Trees for 3 Taxa. 5
2.3 The Complementary Structure of the Nucleotides [11]. . . . . . . . . 9
3.1 The Felsenstein Hierarchy of Evolutionary Models from [9] . . . . . . 15
3.2 Non-homogenous Model of Transition Probabilities. . . . . . . . . . . 16
4.1 The 3-Taxa Claw Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 The Hardy-Weinburg Curve in the Probability Simplex from [9] . . . 23
5.1 The Fourier Transformed Parameterization of the 3 Taxa Giraﬀe Tree. 30
8.1 Proof By Picture for Lemma 8.1 . . . . . . . . . . . . . . . . . . . . . 61
8.2 Eﬃcacy of Invariants and Inequalities for 3-5 Taxa . . . . . . . . . . 65
8.3 Varying Number of Sites for 3 Taxa . . . . . . . . . . . . . . . . . . . 71
8.4 The Cumulative Frequencies of 3 Taxa with 500 Sites . . . . . . . . . 72
8.5 Varying the Internal Branch Lengths for 3 Taxa . . . . . . . . . . . . 74
8.6 Cumulative Frequencies of Varying Branch Internal Lengths for 3
Taxa Jukes-Cantor with External Branches of 0.17 . . . . . . . . . . . 75
8.7 Cumulative Frequencies of Varying Branch Internal Lengths for 3
Taxa Kimura-2 with External Branches of 0.17 . . . . . . . . . . . . . 76
8.8 The Distributions of the Invariants Scores for 3 Taxa Jukes-Cantor
and Kimura-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
x

8.9 The Cumulative Frequencies for 4 Taxa Jukes-Cantor with 1000 sites 80
8.10 The Cumulative Frequencies Subsets of Invariants for 4 Taxa with
500 sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.11 Variation of Branch Lengths for the Giraffe Tree and Jukes-Cantor
Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.12 Variation of Branch Lengths for the Balanced Tree and Jukes-Cantor
Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.13 The Effect of Short and Long Internal Branches . . . . . . . . . . . . 84
8.14 The Distributions of the True Trees for 4 Taxa Jukes-Cantor1000 Sites 85
8.15 The Cumulative Frequencies for the Zero and Equivalence Class Re-
duced Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.16 The Cumulative Frequencies for the Equivalence Class Reduced In-
variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.17 The Efficacy of the Invariants and Inequalities for Real Data . . . . . 92
8.18 The Cumulative Efficacy of 3 Taxa Real Data . . . . . . . . . . . . . 93
8.19 The Cumulative Efficacy of 4 Taxa Real Data . . . . . . . . . . . . . 94
8.20 The Invariants Score for 3 Taxa Real Data . . . . . . . . . . . . . . . 95
8.21 The Distribution of Invariants Scores for 3 Taxa Real Data . . . . . . 96
xi

Chapter 1
Introduction
Phylogenetic trees are used to explain evolutionary relationships between a col-
lection of individuals. These individuals may be from different species, the same
species, or different subspecies, and in order to generalize the discussion they are
referred to as taxa. There are many different methods that use the comparison of
genetic differences of taxa by using samples of DNA. Examples of these include like-
lihood methods or non-parametric distance algorithms such as maximum parsimony.
The method of phylogenetic invariants was proposed in 1987 by Cavender and
Felsenstein [3], and Lake [7] to infer phylogenetic trees. In 1993, Evans and Speed [4]
wrote about using invariants for testing phylogenetic trees, and introduced the Dis-
crete Fourier Transform as a clever change of variables to reduce the computational
expense of finding the invariants. More recent work by Sturmfels and Sullivant [12]
1

2
has provided convenient mappings for this transform, which we use in our work.
Only a few years ago, Matsen [8] demonstrated how to generate the complete set
of constraints on the probability space which define popular nucleotide evolution
models in the form of inequalities.
Thanks to these researchers, we have a great deal of elegant theory on the algebraic
statistics of phylogenetic invariants. However, many biologists, mathematicians and
statisticians remain unconvinced of the potential for phylogenetic invariants as a
practical method. Yet very little testing of the method of invariants has been done
to show its efficacy one way or another. Only recently, Casanellas and Sanchez
[1] tested the method of invariants for four taxa. Their results show promise for
the method, especially for non-homogenous models (different parameters for each
branch). Our work tests up to 5 taxa and uses the complete set of constraints as
described by Matsen. We are the first to test the inequalities, and have done the
most comprehensive testing of the invariants to date.

Chapter 2
Evolutionary Biology
The purpose of this project is to contribute to testing the method of phylogenetic in-
variants for inferring phylogenetic trees. In this section we define phylogenetic trees
and how they describe evolutionary relationships between species or individuals.
2.1 Phylogenetic Trees
Phylogenetic trees describe the evolutionary relationship between taxa. A taxon, or
taxa for plural, is a biological classification that places organisms into categories. For
the purposes of this paper, the taxa will either be species, subspecies or individuals
within a population. Mathematically, a tree is an object defined by a set of vertices
as seen in Figure 2.1. Pairs of vertices define edges, or branches. A pendant edge
has one vertex which is a terminal node, meaning there is no other edge incident to
3

Figure 2.1: Tree Terminology
it. Internal edges are those which are not pendant edges. Terminal nodes are also
referred to as “leaves.” These leaves represent the diﬀerent taxa whose evolutionary
relationship we are describing. We are interested in how many leaves a tree has
because the leaves represent the modern day taxa, whereas the internal nodes are
ancestral taxa for whom we usually do not have genetic data.
2.2 Divergence of Species Over Time
Rooted trees are the most natural way to think of phylogenetic trees. We imagine at
the root an ancestral taxon where, over time, other taxa evolve from this common
ancestor. Evolution continues until the modern time, where the current taxa are
represented by the leaves of the tree. How many species can evolve from an ancestral

5
species at a time? Biologically speaking, species diverge two at a time, resulting in a
tree with two edges coming from each internal node. This is called a bifurcating tree.
We could consider evolutionary relationships where more than two taxa are diverging
at a time, and this would be represented by a multi-furcating tree. In practice,
biologists use multi-furcating trees when they don’t have enough information to
resolve the relationship. For this paper, we will be considering only bifurcating trees,
but it should be noted that multi-furcating trees can be created from bifurcating
trees by setting some internal edge lengths equal to zero.

Figure 2.2: The Relationship Between the Unrooted and Rooted Trees for 3 Taxa.
We can infer relationships between modern taxa without inferring their ancestral
history (an unrooted tree). However, rooted trees are more informative. For exam-
ple, if we are only comparing three taxa, then we cannot say anything about their
evolutionary relationship without knowing where the root is on the tree. In Figure
2.2 we see the unique unrooted tree for three taxa, which gives no indication about
who is more closely related to whom. However, there are three possible rooted trees
for three taxa, that describe distinct evolutionary histories.

6
Rooted trees can always be turned into unrooted trees by omitting the root. In
an unrooted tree, the root could potentially lie on any branch. In Figure 2.2, we see
that placing a root on a diﬀerent branch results in a diﬀerent evolutionary relation-
ship between the taxa. In order to determine on which branch the root should be
placed, some evolutionary information is needed. This is often done using an out-
group - a taxon that is evolutionarily distinct enough that its only common ancestor
with the rest of the taxa under consideration must be the root.

7
2.3 The Tree Space
As we saw in the previous section, there is one possible unrooted tree and 3 possible
rooted trees for three taxa. For any given number of taxa, there are combinatorially
many phylogenetic trees that could describe the evolutionary relationship between
the taxa. These numbers grow quickly, as can be seen in Table 2.3, and thus we see
that phylogenetic inference is a highly non-trivial problem.
# of taxa # of unrooted tree # of rooted trees
n (2n−5)!
2n−3(n−3)!
(2n−3)!
2n−2(n−2)!
3 1 3
4 3 15
5 15 105
6 105 945
7 945 10,395
8 10,395 135,135
50 2.84 × 1074
2.75 × 1076
For only 50 taxa, the number of rooted bifurcating trees is approaching Eddington’s
number, which is the number of electrons in the visible universe [5]. Since biologists
often use more than 50 taxa, this creates a problem in ﬁnding the one “true tree”
that describes the exact evolution of these taxa. Our goal is to ﬁnd this true tree,
or at least a good approximation of it. We need to be able to make our way through

8
the vast space of all possible trees to this one true tree. Since we cannot consider
every tree, current methods rely on random walks through the tree space and then
compare the trees sampled on the walk. A subset of all possible trees are compared,
and we choose a tree out of these that we think best approximates the true tree.
This is the problem of phylogenetic inference.
2.4 Methods of Phylogenetic Inference
Current methods of phylogenetic inference fall into two categories: parametric and
non-parametric. A popular non-parametric method is called Maximum Parsimony,
which actually seeks to minimize the number of evolutionary changes needed to
result in the modern taxa. This number of changes can be considered a score for
the given tree. Parametric methods assume a model of evolutionary change in order
to make inferences. Popular parametric methods score trees using the evolutionary
models to calculate the likelihood of a given tree. Thus, trees can be compared
by their scores, to ﬁnd the tree that best satisﬁes the criteria. Most parsimonious,
or highest likelihood, are both estimates of “the best tree.” These methods are all
based on comparing genetic data.

9
2.5 The Genetic Structure
To measure evolutionary change, biologists look at diﬀerences in genetic data. Ge-
netic data comes in diﬀerent forms. Sometimes the binary alphabet {0, 1} is used
since for some biological traits either exist {1} or do not exist {0}. Additionally, the
20 letter alphabet of amino acids is sometimes used. The most common alphabet
for this coding is {A, T, C, G} where these letters represent the four nucleotides. For
this research, we focused on nucleotide data.
Figure 2.3: The Complementary Structure of the Nucleotides [11].
The chemical structure of the nucleotides is important for our algebraic methods.
There are two types of nucleotides, the purines and the pyrimidines as can be seen
in Figure 2.3. The purines are adenine and guanine (A and G), which have two rings
while the pyrimidines, which are cytosine and thymine (C and T), have one ring in
their structure. Additionally, along the double helix, A is always paired with T and

10
G is always paired with C. We call these base pairs. Thus, we only need to know
the nucleotide sequence for one strand, since these pairings tell us the structure of
the other strand.

Chapter 3
Evolutionary Models
The method of this research is a parametric one and so we describe the evolutionary
models here. We focus on a certain class of models, called group based models.
In this section we explain the general Markov model and group based models.
3.1 The General Markov Model
When DNA is copied during the cell cycle, mistakes are made resulting in changes
of base. This process modeled by a Markov chain, since the mistakes, or mutations,
are assumed to be random, for which the probability of transitioning into the new
state is solely determined by the current state. This makes sense biologically since a
gene does not have a memory of its past states. This ﬁts the deﬁnition of a Markov
process, where the probability fij of transitioning from state i to state j is only
11

12
dependent on the current state i and not on any previous state. The Markov model
as a model for the mutation process has been studied extensively by Felsenstein[5]
and others.
To summarize the probabilities of transitioning from any nucleotide state {A,C,G,T}
to another we can use a 4 × 4 matrix:
Fi,j =









A C G T
A α11 α12 α13 α14
C α21 α22 α23 α24
G α31 α32 α33 α34
T α41 α42 α43 α44









The general Markov model only places two restrictions on the entries in the proba-
bility transition matrix Pij, and those restrictions are simply to satisfy the deﬁnition
of a probability model: that the probabilities of transitioning are nonnegative, and
that the rows and columns sum to 1.
3.2 Symmetric and Group Based Models
There is a subclass of general Markov models that places the following restrictions:
ﬁrst, the probability of not changing base is the same for all nucleotides and second,

13
the probability of T → A is the same as A → T et cetera. In other words, fii = fjj,
fij = fji for all i, j. These restrictions can be seen in the model below:
F∗
i,j =









A C G T
A α0 α1 α2 α3
C α1 α0 α3 α2
G α2 α3 α0 α1
T α3 α2 α1 α0









The probability of not transitioning can be assumed to be the same, because the
probability of transitioning is essentially the probability of making a mistake in
copying, and a mistake is equally likely for any state. However, given that a mistake
will be made at a certain site, the probability of what the mistake will be depends
on the current state. Additionally, the condition fij = fji for all i, j, can be thought
of as representing an equal probability of making or breaking molecular bonds. To
transition in one direction, bonds will have to be made that in the other direction
would have to be broken.
Once we make this assumption of symmetry, we can consider our alphabet to
be a ﬁnite abelian group G and all our calculations become easier. In this case
{A, C, G, T} ∼= Z2 × Z2. This is the appropriate choice, rather than Z4 because
Z2 ×Z2 reﬂects the complementary relationship between the purines and the pyrim-

14
idines, and the base pairs. The correspondence is as follows: A ∼= (0, 0), C ∼=
(0, 1), G ∼= (1, 0), T ∼= (1, 1). There is an additional property of the structure of the
transition matrix above: that the entry in the i, jth column is determined by the
difference of the ith and jth group element.
Definition 3.1. Let G be a finite group where each element i ∈ G represents a
discrete value, and F∗
i,j the transition probability matrix for G. If fii = fjj, fij = fji
for all i, j ∈ G and fij = h(i − j) for some function h, we call f∗
i,j a group based
model.
Our matrix F∗
ij is called the Kimura-3 (K81) parameter model, where the fourth
parameter can be written in terms of the other three. The Kimura-2 (K80) pa-
rameter model is a special case of the Kimura-3 parameter model, where α1 = α3.
The Kimura-2 model is relevant because a transition from purine to purine (or from
pyrimidine to pyrimidine), has a higher probability than transversion from one type
to the other. This is because there would be a more radical change of structure, or
a more extreme mistake. Additionally, the Jukes-Cantor (JC69) model is a special
case where α1 = α2 = α3, or in other words, there is a probability of transitioning
into a new state or not. This simplified model is feasible since the probability of
changing state is usually quite low.
The class of group based models is a subset of all evolutionary models. Figure
3.1 depicts the Felsenstein Hierarchy of Evolutionary Models [9]. This shows the

15
relationship between models, and how some are special cases of others.




· απC απG απT
απA · απG απT
απA απC · απT
απA απC απG ·







· α α α
α · α α
α α · α
α α α ·







· βπC απG βπT
βπA · βπG απT
απA βπC · βπT
βπA απC βπG ·







· βπC (β + α
πR
)πG βπT
βπA · βπG (β + α
πY
)πT
(β + α
πR
)πA βπC · βπT
βπA (β + α
πY
)πC βπG ·






· βπC απC βπA
βπA · βπC απA
απA βπC · βπA
βπA απC βπC ·






· β α β
β · β α
α β · β
β α β ·






· α β γ
α · δ
β δ · φ
γ φ ·






· β α γ
β · γ α
α γ · β
γ α β ·






· απC βπG γπT
απA · δπG πT
βπA δπC · φπT
γπA πC φπG ·






· βπC απG βπT
βπA · βπG γπT
απA βπC · βπT
βπA γπC βπG ·



Figure 3.1: The Felsenstein Hierarchy of Evolutionary Models from [9]

16
3.3 Non-homogenous Models
Ancestral lineages for taxa may evolve at diﬀerent rates. While it is reasonable
to assume that all taxa have evolved at the same rate along their lineages, a non-
homogenous model is more realistic. In Figure 3.2 we see a three taxa tree, where
each of the four branches are labeled with their own probability transition matrix
f1
, . . . , f4
.

Figure 3.2: Non-homogenous Model of Transition Probabilities.
For a given site in the gene under consideration, suppose that the ancestral taxa r
is in state Xr ∈ G, where G is the state space. What is the probability that the
ith taxa will be in any one of the possible states given that the ancestral taxa was
in state Xr? This is simply given by the transition probability matrix f1
. To ﬁnd
the probability that the jth or kth taxa are in any of the possible states, we simply
multiply the transition probability matrices along the path from the root to that

17
taxa. For example, to ﬁnd the probabilities of transitioning from state to state for
taxa j, we multiply f2
by f3
.

Chapter 4
The Algebraic Statistics of Phylogenetic
Trees
We now have described the class of evolutionary models we will be working with.
These models describe the rates of transitioning from one state to another along
a single lineage. However, since we are interested in inferring the evolutionary
relationship between a set of modern taxa, we want to say something about the
state of each taxa. Thus we consider the probability of observing patterns of states
for all taxa at a given site in the gene. These probabilities are polynomials for
which we can ﬁnd relations, relying on the rich theory of algebraic geometry, which
we explain below.
18

19
4.1 Site Pattern Probabilities
Consider the claw tree for three taxa as seen in Figure 4.1. For a ﬁxed site in the
gene under consideration, we want to consider the probability that i is in a particular
state, and j is in a particular state, and k is in a particular state. In other words, we
want to consider for some state space G, the probability P(i = Xi, j = Xj, k = Xk)
where the X’s are states from G. We will use the shorthand notation
P(i = Xi, j = Xj, k = Xk) = pijk
to indicate the probability of observing the site pattern such that i = Xi, j = Xj, k =
Xk.

Figure 4.1: The 3-Taxa Claw Tree

20
Definition 4.1. For a state space G, and n taxa, a site pattern probability is
the probability of observing a certain assignment of states for each of the n taxa:
P(i1 = Xi1 , . . . , in = Xin ) = pi1,...,in
Note that there are |G|n
possible site patterns and that summing over all possible
site pattern probabilities we get
Xi1
,...,Xin ∈G
P(i1 = Xi1 , . . . , in = Xin ) = 1.
Suppose for the three taxa claw tree we have a state space of {0, 1} at the root and
each leaf. As seen in Figure 4.1, the branch with leaf i has the transition matrix
f1
, the branch with leaf j has matrix f2
and the branch with leaf k has f3
. In
particular:
f1
=



α0 α1
α1 α0


 , f2
=



β0 β1
β1 β0


 , and f3
=



γ0 γ1
γ1 γ0


 ,
We also must define a root distribution, say f0
, such that f0
(0) = π0, f0
(1) = π1,
where π0 + π1 = 1. To find p000 = P(i = j = k = 0), we first consider each possible
state at the root. If the root has value 0; then the transition probability matrices
give p000 = π0α0β0γ0. Next, consider if our root has value 1; then the transition
probability matrices give p000 = π1α1β1γ1, so p000 = π0α0β0γ0 + π1α1β1γ1. Similar

21
reasoning gives the parametrization:
p000 = π0α0β0γ0 + π1α1β1γ1, p110 = π0α1β1γ0 + π1α0β0γ1,
p001 = π0α0β0γ1 + π1α1β1γ0, p111 = π0α1β1γ1 + π1α0β0γ0.
Observe that we have 8 binomials in terms of 8 parameters. We comment more on
this later.
4.2 The Algebraic Variety of a Statistical Model
The site pattern probabilities seen in the section above are polynomials in terms
of the parameters from the probability transition matrices. In fact, the heart of
algebraic statistics is that a statistical model of a discrete random variable is a map
from the parameter space to the probability space such that the relations on the
image form an algebraic variety. The generators of the ideal deﬁning the variety are
functions of the probabilities such that for any possible values of the parameters,
the function vanishes. We call these model invariants.
For a discrete random variable X with a state space of {1, . . . , n}, a probabil-
ity distribution is the set of points (p1, . . . , pn) ∈ Rn
where for all i, pi ≥ 0 and

22
n
i=1 pi = 1. While biologists are only interested in real valued results, to use the
tools of algebraic geometry, we consider the ambient space taken over the complex
numbers. Thus we consider our statistical model within the complex numbers and
define it as:
f : Cd
→ Cn
which is a map from the parameter space to the probability space where the image
of each coordinate f1, . . . , fm of f is an algebraic variety. The topological closure of
the image of f is an algebraic variety since the image of f is a Boolean combination
of varieties [9]. This would not be true for real numbers and so we do our work in the
complex space and introduce real constraints on the model later. Now we define If
as the set of all polynomials that vanish on the set f(Cd
). This means that If is the
ideal corresponding to the variety f(Cd). Fortunately, Hilbert’s Basis Theorem tells
us that this ideal can be represented by finitely many generators. This is important
because the elements of If are our invariants with which we will test our model.
Clearly it is desirable to have a finite set of invariants to test.
4.3 The Hardy-Weinberg Model Invariants
Before we dive into the technicalities of invariants for phylogenetic trees, we present
a simple example also drawn from molecular evolution, but not related to trees.
Consider the Hardy-Weinberg model for the inheritance of genes in diploid organ-

23
Figure 4.2: The Hardy-Weinburg Curve in the Probability Simplex from [9]
isms. For each gene, a diploid organism has two different varieties of the gene, called
alleles. The offspring inherits one allele from each parent. If there are only two pos-
sible alleles in the population, A and a, and if θA and θa are the frequencies of each
allele in that population then an arbitrary offspring has a probability θA
2
of inher-
iting AA, θa
2
of inheriting aa, and 2θAθa of inheriting Aa. Since θA + θa = 1 then
we can rewrite θa = 1 − θA. Since we can write our map in terms of one parameter,
let us call it θ such that θ = θA. This describes a map from the parameter space to
the probability space:
f : R → R3
such that
θ → (θ2
, 2θ(1 − θ), (1 − θ)2
).
The trivial invariant for this map is given by the restriction on a probability model,

24
x + y + z = 1. The only non-trivial invariant for this model is y2
− 4xz since
2θ(1 − θ)2
− 4θ2
(1 − θ)2
= 0, for any possible value of θ. In Figure 4.2, we see this
curve with the probability simplex; the polygon in which all possible probability
models can exist.

Chapter 5
Invariants for Group Based Models
We want to ﬁnd the invariants for our tree and evolutionary model. This model is
a map from the transition probabilities parameters to the site pattern probabilities,
for which the relations on the image are called phylogenetic invariants. For
group based models, there is a clever change of coordinates that simpliﬁes our site
pattern probabilities and results in a toric variety. In this section we describe this
transformation and give examples. Also, under certain assumptions, there are site
pattern probabilities that are equal to each other, forming equivalance classes.
5.1 The Discrete Fourier Transform
We continue with the 3 taxa claw tree. As we mentioned in Section 4.1, the site
pattern probabilities are 8 binomials in terms of 8 parameters. We can make it
25

26
easier to find the relations on these probabilities by performing a linear change of
coordinates: the discrete Fourier transform. This simplifies our work by turning the
above polynomial parameterizations into monomials. We define this transformation
below.
Definition 5.1. For a group G and any function f : G → C there exists a function
ˆf : ˆG → C, where ˆG = Hom(G, C∗
) is the dual group. This is the group of all
homomorphisms χ : G → C∗
, where C∗
= C {0}. Then the discrete Fourier
transform of f is given by
ˆf(χ) =
g∈G
χ(g)f(g).
We use this formula to transform our edge parameters fi
, thereby transforming our
probabilities as well. Our set of invariants for the transformed probabilities turn out
to be a toric ideal, which are well know to behave nicely. We also note that G ∼= ˆG.
When working with binary sequence data and a group based model, we use Z2 with
the dual group Z2
∼= Z2. The members of Z2 are homomorphisms, which we call
characters. In particular, Z2 = {χ0, χ1} where χ0, χ1 : Z2 → C∗
such that:
χ0(0) = 1, χ1(0) = 1, and χ0(1) = 1, χ1(1) = −1
which must be defined as such in order to satisfy the properties of a group as elements
in Z2 as well as satisfy the properties of homomorphisms. This can also represented

27
as a character table [9]:
0 1
χ0 1 1
χ1 1 −1
Now we transform the functions fi
,
fi(χ0) =
g∈Z2
χ0(g)fi
(g) = χ0(0)fi
(0) + χ0(1)fi
(1) = fi
(0) + fi
(1)
fi(χ1) =
g∈Z2
χ1(g)fi
(g) = χ1(0)fi
(0) + χ1(1)fi
(1) = fi
(0) − fi
(1),
which we can ﬁnd since the fi
are given by the transition probability matrices,
f0
(0) = π0, f1
(0) = α0, f2
(0) = β0, f3
(0) = γ0,
f0
(1) = π1, f1
(1) = α1, f2
(1) = β1, f3
(1) = γ1.
Plugging them in we get:
f0(χ0) = π0 + π1 =: r0, f2(χ0) = β0 + β1 =: b0,
f0(χ1) = π0 − π1 =: r1, f2(χ1) = β0 − β1 =: b1,
f1(χ0) = α0 + α1 =: a0, f3(χ0) = γ0 + γ1 =: c0,
f1(χ1) = α0 − α1 =: a1, f3(χ1) = γ0 − γ1 =: c1,
where we are re-deﬁning the our parameters.

28
The monomial qrst is obtained via equation (10) from Sturmfels Sullivant [12],
q(χ1, χ2, χ3) = π(χ1χ2χ3)
3
i=1
fi(χi)
where we recall that the characters behave as Z2 under addition:
q000 = π(χ0χ0χ0)[f1(χ0)f2(χ0)f3(χ0)] = π(χ0)[a0b0c0] = r0a0b0c0,
q100 = π(χ1χ0χ0)[f1(χ1)f2(χ0)f3(χ0)] = π(χ1)[a1b0c0] = r1a1b0c0.
We find the rest of the qrst similarly,
q010 = r1a0b1c0, q001 = r1a0b0c1,
q110 = r0a1b1c0, q101 = r0a1b0c1,
q011 = r0a0b1c1, q111 = r1a1b1c1.
In order to find the relations among the monomials qrst, we encode this information
in the 8 × 8 binary table,
q000 q001 q010 q011 q100 q101 q110 q111
r0 1 0 0 1 0 1 1 0
r1 0 1 1 0 1 0 0 1
a0 1 1 1 1 0 0 0 0
a1 0 0 0 0 1 1 1 1
b0 1 1 0 0 1 1 0 0
b1 0 0 1 1 0 0 1 1
c0 1 0 1 0 1 0 1 0
c1 0 1 0 1 0 1 0 1
and use 4ti2 [6] to find a minimal generating set of the ideal. In 4ti2, we find the

29
Markov basis:
{q000q111 − q001q110, q000q111 − q010q101, q100q011 − q000q111}.
As can be seen, this is a toric ideal, since the generators are differences of monomi-
als. While in the original coordinates, the relations formed an irreducible variety,
the fact that the Fourier transform gives a toric ideal makes finding the phylogenetic
invariants computationally feasible.
Finally, we convert the above Gröbner basis back to the pijk via equation (3) in
[12]:
qijk =
1
r=0
1
s=0
1
t=0
(−1)ir+js+kt
prst,
and obtain [?],
q001q110 − q000q111 = p001p010 − p000p011 + p001p100 − p000p101 − p011p110 − p101p110 + p010p111 + p100p111
q100q011 − q000q111 = p001p100 + p010p100 − p000p101 − p011p101 − p000p110 − p011p110 + p001p111 + p010p111
q010q101 − q000q111 = p001p010 − p000p011 + p010p100 − p011p101 − p000p110 − p101p110 + p001p111 + p100p111.
This is the complete set of invariants for the 3 taxa binary model, giving the
relationship between the Fourier Transformed probabilities and the original proba-
bilities.

30
5.2 The Sturmfels-Sullivant Mapping
As it happens, there is a less cumbersome way of transforming the pijk to qrst. In
the example above, the parametrization for the 3 taxa claw tree K1,3 is
qijk → ri+j+kaibjck.
For example,
q001 → r0+0+1a0b0c1 = r1a0b0c1.
However, this is just the mapping for the 3 taxa claw tree. This mapping works for

Figure 5.1: The Fourier Transformed Parameterization of the 3 Taxa Giraﬀe Tree.
any tree and easiest to explain descriptively. It turns out that there is one Fourier
transformed parameter for each edge on the tree, as well as for the root. The sub-
script of each parameter is simply determined by adding the states of the taxa below
that edge.

31
To see another example, consider the 3 taxa giraﬀe tree as seen in Figure 5.1. Here
the map is
qijk → ri+j+kaibj+kcjdk,
where,
q001 → r0+0+1a0b0+1c0d1 = r1a0b1c0d1.
The versatility of this approach allows us any state space associated with a group
based model. The diﬀerence in the state space corresponds to addition (in the
subscripts) within the appropriate group. We will also see that this map works for
the following example.
5.3 Example of the Nucleotide Three Taxa Claw Tree
Suppose we have nucleotide data for three taxa for which we want to test the claw
tree K1,3, with the state space is {A, C, G, T} at the root and each leaf. If we assume
the Jukes-Cantor model then the branch with leaf i has the transition matrix f1
, the
branch with leaf j has matrix f2
and the branch with leaf k has f3
. In particular:
f1 =









α0 α1 α1 α1
α1 α0 α1 α1
α1 α1 α0 α1
α1 α1 α1 α0









, f2 =









β0 β1 β1 β1
β1 β0 β1 β1
β1 β1 β0 β1
β1 β1 β1 β0









, and f3 =









γ0 γ1 γ1 γ1
γ1 γ0 γ1 γ1
γ1 γ1 γ0 γ1
γ1 γ1 γ1 γ0









.

32
We now have 43
= 64 observable sequences of length three, whereas in the previous
example we only had 23
= 8. We also assume a root distribution
f0
(A) = π0, f0
(C) = π1, f0
(G) = π2, f0
(T) = π3.
Thus for pAAA = P(i = j = k = A), ﬁrst consider when the root has value A; the
transition matrices give pAAA = π0α0β0γ0. Next consider when the root has value
C; the transition matrices give pAAA = π1α1β1γ1, and so on and thus for all possible
roots, giving:
pAAA = π0α0β0γ0+π1α1β1γ1+π2α1β1γ1+π3α1β1γ1 = π0α0β0γ0+(π1+π2+π3)α1β1γ1.
The remaining 63 probabilities are also four termed polynomials in eight vari-
ables. Again, we’d like to transform our pijk into qijk. This time our state space
{A, C, G, T} is isomorphic to the group Z2 × Z2 and so our dual group is Z2 × Z2 =
{χ0, χ1, χ2, χ3} such that χi : Z2 × Z2 → C∗
. While these groups are isomorphic,
we use the alphabet for ease of reading while Z2 × Z2 is more useful for actual
computations. Thus we have the character table
A C G T
χ0 1 1 1 1
χ1 1 −1 1 −1
χ2 1 1 −1 −1
χ3 1 −1 −1 1

33
Applying the Fourier transform to our transition rate matrices fi
, we get
fi(χ0) =
g∈{A,C,G,T}
χ0(g)fi
(g)
= χ0(A)fi
(A) + χ0(C)fi
(C) + χ0(G)fi
(G) + χ0(T)fi
(T)
= fi
(A) + fi
(C) + fi
(G) + fi
(T)
fi(χ1) =
g∈{A,C,G,T}
χ1(g)fi
(g)
= χ1(A)fi
(A) + χ1(C)fi
(C) + χ1(G)fi
(G) + χ1(T)fi
(T)
= fi
(A) − fi
(C) + fi
(G) − fi
(T)
fi(χ2) =
g∈{A,C,G,T}
χ2(g)fi
(g)
= χ2(A)fi
(A) + χ2(C)fi
(C) + χ2(G)fi
(G) + χ2(T)fi
(T)
= fi
(A) + fi
(C) − fi
(G) − fi
(T)
fi(χ3) =
g∈{A,C,G,T}
χ3(g)fi
(g)
= χ3(A)fi
(A) + χ3(C)fi
(C) + χ3(G)fi
(G) + χ3(T)fi
(T)
= fi
(A) − fi
(C) − fi
(G) + fi
(T)

34
As given by our probability transition matrices,
f0
(A) = π0, f1
(A) = α0, f2
(A) = β0, f3
(A) = γ0,
f0
(C) = π1, f1
(C) = α1, f2
(C) = β1, f3
(C) = γ1,
f0
(G) = π2, f1
(G) = α1, f2
(G) = β1, f3
(G) = γ1,
f0
(T) = π3, f1
(T) = α1, f2
(T) = β1, f3
(T) = γ1.
Plugging these in, we get
cf0(χ0) = f0
(A) + f0
(C) + f0
(G) + f0
(T) = π0 + π1 + π2 + π3 =: r0
cf1(χ0) = f1
(A) + f1
(C) + f1
(G) + f1
(T) = α0 + 3α1 =: a0
cf2(χ0) = f2
(A) + f2
(C) + f2
(G) + f2
(T) = β0 + 3β1 =: b0
cf3(χ0) = f3
(A) + f3
(C) + f3
(G) + f3
(T) = γ0 + 3γ1 =: c0
cf0(χ1) = f0
(A) − f0
(C) + f0
(G) − f0
(T) = π0 − π1 + π2 − π3 =: r1
cf1(χ1) = f1
(A) − f1
(C) + f1
(G) − f1
(T) = α0 − α1 =: a1
cf2(χ1) = f2
(A) − f2
(C) + f2
(G) − f2
(T) = β0 − β1 =: b1
cf3(χ1) = f3
(A) − f3
(C) + f3
(G) − f3
(T) = γ0 − γ1 =: c1
cf0(χ2) = fi
(A) + fi
(C) − fi
(G) − fi
(T) = π0 + π1 − π2 − π3 =: r2
cf0(χ3) = fi
(A) − fi
(C) − fi
(G) + fi
(T) = π0 − π1 − π2 + π3 =: r3
and to summarize the rest, it is easily seen from the character table that
f1(χ1) = f1(χ2) = f1(χ3) = α0 − α1 =: a1
f2(χ1) = f2(χ2) = f2(χ3) = β0 − β1 =: b1
f3(χ1) = f3(χ2) = f3(χ3) = γ0 − γ1 =: c1

35
We proceed to find the monomial qrst we use equation (10) from Sturmfels Sulli-
vant [12], and thus we have:
q(χ1, χ2, χ3) = π(χ1χ2χ3)
3
i=1
fi(χi)
For example we get:
qAAA = π(χ0χ0χ0)[f1(χ0)f2(χ0)f3(χ0)] = π(χ0)[a0b0c0] = r0a0b0c0,
qAAC = π(χ0χ0χ1)[f1(χ0)f2(χ0)f3(χ1)] = π(χ1)[a0b0c1] = r1a0b0c1.
Similarly, we find the rest of the 64 monomial parameterizations in the Fourier
coordinates for which we want to find invariants.
5.4 Equivalence Classes
Finding the relations on all 64 monomial probabilities is still computationally ex-
pensive. We can make a powerful assumption that can reduce our number of proba-
bilities. Here we suppose that the root has a uniform distribution, or in other words
that π0 = π1 = π2 = π3. As can be seen in our calculations of the Fourier transform
above, this means that r1 = r2 = r3 = 0. Thus, any qijk with these parameters is
eliminated, or rather, places in a ”zero” equivalence class.

36
We also used Mathematica [13] to code the change of coordinates using the fol-
lowing facts:
1. qijk → ri+j+kaibjck, where addition in the subscript of r is addition in Z2 ×Z2.
2. If in = i1 + i2 + . . . in−1 in the group, then qi1...in = 0. This follows from
assuming that the root distribution is the uniform distribution in the same
way as we saw above [12].
A further simpliﬁcation occurs when we look at the matrix to be consumed by 4ti2:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
0 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
Notice that some columns are identical. This indicates that we can group the qijk
into equivalence classes. We have omitted the column headers owing to space re-
strictions, but a glance through the Mathematica code [13] will give the following

37
list of equivalence classes.
Class 1: qAAA
Class 2: qACC, qAGG, qATT
Class 3: qCAC, qGAG, qTAT
Class 4: qCCA, qGGA, qTTA
Class 5: qCGT , qCTG, qGCT , qGTC, qTCG, qTGC
The remaining qijk reside in Class 0, when assuming a uniform distribution. Deleting
all duplicate columns we use the following matrix for 4ti2 :
q1 q2 q3 q4 q5
r0 1 1 1 1 1
r1 0 0 0 0 0
a0 1 1 0 0 0
a1 0 0 1 1 1
b0 1 0 1 0 0
b1 0 1 0 1 1
c0 1 0 0 1 0
c1 0 1 1 0 1

38
Here qi corresponds to class i. The Gröbner basis of the toric ideal of algebraic
relations is:
{q2q3q4 − q1q2
5}.
Our assumption of the uniform distribution at the root creates equivalence classes
among the qijk and results in only one equation defining the toric ideal of phyloge-
netic invariants. Based on the number of site pattern probabilities in each equiv-
alence class, this one invariant generates 1 · 3 · 3 · 3 · 6 = 162 invariants, once we
substitute in all the possible q’s from each equivalence class.
We can choose to reduce the number of invariants by removing only the zero class
equivalent variables, or removing the non-zero class equivalent variables, or by re-
moving both. We explore the efficacy of some of these different options in this
thesis.

Chapter 6
Inequalities on the Discrete Fourier
Transform
6.1 Motivation and Deﬁnitions
There are other constraints on the probability space which we use. In fact, in 2008
Matsen [8] proved and described the complete set of constraints in the form of
inequalities. The observed site pattern frequencies should roughly satisfy our in-
variants in order for a tree to be plausible. But, there may be trees that sit in the
probability simplex, satisfying the phylogenetic invariants, which may not be bio-
logically meaningful. By this, we mean that they may have negative branch lengths.
First, we observe that for any edge on the tree, our transition probability matrix is
39

40
the exponential of the instantaneous rate matrix
F(e)
= exp teQ(e)
where te is some positive length of time. Since F(e)
is a probability transition
matrix, it must have positive values, and its rows must sum up to 1. Equivalently,
this means that Q(e)
must have non-negative off diagonal entries, where the diagonal
is the negative sum of the rest of the entries in that row. In other words Q(e)
= [qij]
where each qij ≥ 0 for i = j and qii = − j=i qij. Usually te is absorbed into Q(e)
,
and can only be separated by using a molecular clock. This does not affect the afore
mentioned properties.
For binary data, we will have a 2 × 2 probability transition matrix
F(e)
=



α0 α1
α1 α0


 .
Applying our Fourier transform, we get
F(e)
=



α0 + α1 0
0 α0 − α1


 =



1 0
0 θ



where θ = exp(−2γ) and γ is the branch length. Casanella and Sanchez [2] observed
that the need for non-negative off diagonal entries in Q(e)
was equivalent to θ ≥ 0.

41
They also observed (2007) that this was not sufficient to guarantee non-negative
branch lengths. It was Matsen who observed that θ must also be less than one,
since γ is a non-negative length of time.
To understand why the inequality 0 ≤ θ ≤ 1 is useful, let’s look at the alge-
braic variety defined by the map from the Fourier parameter space to the Fourier
probability space. For a bifurcating three taxa tree, the giraffe tree, there are four
branches. Since we are using non-homogenous models, each branch has it’s own
probability transition matrix, each branch has it’s own θ. Thus there is a map
φ : C4
→ C8
φ : (θ1, θ2, θ3, θ4) → (q000, q001, . . . , q111)
such that the Zariski closure of this image is an algebraic variety as mentioned
earlier in this paper. However, not all points in this set correspond to biologically
meaningful trees. It turns out that
ImφCn ∩ Rn
= ImφRn .
In other words, a point in the left hand side might be the image of a complex valued
point. In fact,
Imφ(0,1]n ⊂ ImφRn ⊂ ImφCn ∩ Rn

42
where all the containments are strict.
We needed to deﬁne our map over the complex numbers in order to use algebraic
geometry and have our Fourier transform be a toric variety, with all it’s inherently
nice properties. Thus, we need some way to reduce the tree space back down to
meaningful numbers. We can cut down our image to the set of all points that
correspond to non-negative branch lengths with the inequalities.
6.2 Pendant and Internal Edge Inequalities
Matsen gives us the two formulas for generating the full set of inequality constraints.
These inequalities are generated from the Fourier transformed probabilities and are
determined either by conditions on the pendant edges or conditions on the internal
edges. The inequalities are based on unroot trees, and so they apply to all rooted
trees associated therewith. We give the following propositions without proof, and
refer the reader to Matsen’s paper [8] for details. We do however give illustrative
examples. For n taxa and a state space G for our genetic structure, we have the
following two formulas:
Proposition 6.1. (Matsen 2008) Given some pendant edge e, let i denote the leaf
on e and let ν be the internal node on e. Pick j and k any leaves distinct from i
such that the path p(j, k) contains ν. Let w(gi, gj, gk) ∈ Gn
assign state gx to leaf x

43
for x ∈ {i, j, k} and the identity to all other leaves. Then
[f(e)
(h)]2
=
qw(h,−h,0) · qw(−h,0,h)
qw(0,−h,h)
Example 6.1 (Pendant Edge Inequalities for the 3 Taxa Unrooted Tree). For each
pendant edge on the 3 taxa unrooted tree, there are 2 leaves that satisfy the con-
ditions for j and k. If we were to consider both cases, one where the ﬁrst leaf is
assigned to j and the second leaf is assigned to k, as well as the opposite case, we
would actually get the same inequality. Thus, for this tree, each pendant edge gives
one inequality, yielding three in total. They are.
q110 · q101
q011
= [f(e)
(1)]2
1
q101 · q011
q110
= [f(e)
(1)]2
1
q110 · q011
q101
= [f(e)
(1)]2
1
Proposition 6.2. (Matsen 2008) Pick some internal edge e; say the two nodes on
either side of e are ν and ν . Choose i, j (respectively i , j ) such that p(i, j) (respec-
tively p(i , j )) contains ν but not ν (respectively ν but not ν). Let z(gi, gj, gi , gj ) ∈
Gn
assign state gx to leaf x for x ∈ {i, j, i , j } and the identity to all other leaves.

44
Then
[f(e)
(h)]2
=
qz(h,0,−h,0) · qz(0,−h,0,h)
qz(h,−h,0,0) · qz(0,0,−h,h)
Example 6.2 (Internal Edge Inequalities for the 4 Taxa Unrooted Tree). For each
pendant edge on the 4 taxa unrooted tree, for each choice of i and j, there is exactly
one choice for i and j . Again, this is due to symmetries in switching i and j
causing us only to gain a repeated inequality. There are only two ways to choose i
and j (without considering ordering), and since there is only one choice for i and j ,
there are two internal edge inequalities for the four taxa unrooted tree. They are:
q1010 · q0101
q1100 · q0011
= [f(e)
(1)]2
1,
q1001 · q0110
q1100 · q0011
= [f(e)
(1)]2
1
6.3 Inequalities for Nucleotide Data
We ﬁrst observe that for both Z2 and Z2
∼= Z2, each element is its own inverse.
Therefore, to generate the nucleotide inequalities, we simply swap identity elements,
i.e. wherever there is a 0, we replace it with A. Then, for each non-identity element
g ∈ Z2
∼= Z2, we can replace 1 with g in each inequality. Since there are three
non-identity elements, we end up with three times as many inequalities for the
nucleotide data as for the binary data. The the pendant edge inequalities for the 3

45
taxa unrooted tree we have:
qCCA · qCAC
qACC
= [f(e)
(C)]2
1
qCAC · qACC
qCCA
= [f(e)
(C)]2
1
qCCA · qACC
qCAC
= [f(e)
(C)]2
1
qGGA · qGAG
qAGG
= [f(e)
(G)]2
1
qGAG · qAGG
qGGA
= [f(e)
(G)]2
1
qGGA · qAGG
qGAG
= [f(e)
(G)]2
1
qTTA · qTAT
qATT
= [f(e)
(T)]2
1
qTAT · qATT
qTTA
= [f(e)
(T)]2
1
qTTA · qATT
qTAT
= [f(e)
(T)]2
1

Chapter 7
Metrics and Methods
We assume aligned genetic data for a set of taxa for which we want to determine
the evolutionary history. First, our method will take the data and calculate the
observed site pattern frequencies. It then calculates the Fourier transformed site
pattern probabilities. For a ﬁxed transition model and for each tree topology we
have an associated set of inequalities and invariants. We need some way to score
each tree topology in order to compare them and choose a “winning tree” in terms
of these scores. In this chapter, we explain our algorithm for choosing the winning
tree, which takes into account the inequalities and invariants scores. We test each
transition model separately, only comparing the all topologies within the framework
of one transition model at a time. We compare the transition models themselves in
the results section. In this chapter we explain the methods for calculating those two
scores, as well as the measure of the distance of the true tree from the tree with the
46

47
optimal invariants score. We also give a summary of how the main program works
and explain our methods of simulating data.
7.1 Metrics
There are three main scores, or metrics, that we need to define: the inequalities score
for each tree, the invariants score for each tree, and the relative distance of the true
tree invariants score from the optimal invariants scoring tree. The later is a metric
involving the invariants scores over all possible tree topologies. The values for the
inequality and invariants are random variables for the distributions are unknown.
Thus we cannot perform significance tests. We are left to define metrics, on both
the invariants and the inequalities, that are merely ad hoc. All three are described
in detail below.
7.1.1 Inequalities Score
The inequalities score is simple: calculate the proportion of inequalities satisfied.
Definition 7.1. For a fixed tree with M inequalities, where Z is the number of
inequalities satisfied by a given data set, we define the “inequality score” to be:
D :=
Z
M
.
This score clearly ranges between 0 and 1, where D represents the percentage of

48
inequalities satisfied and the ideal score is 1, when all inequalities are satisfied.
There is generally much repetition among inequality scores; sometimes all trees
have an inequality score of 1. This makes sense, since the inequalities are merely
a proportion of inequalities satisfied, so even if different inequalities are satisfied,
often the same number of inequalities are satisfied. Also, inequality scores occur in
blocks; since the inequalities are on the unrooted tree, every rooted tree associated
with that unrooted tree has the same inequality score.
7.1.2 Invariants Score
In theory, the invariants for the true tree will evaluate to zero. However, the actual
value of each invariant for a given set of data is generally non-zero. We calculate
the average of the absolute values of the evaluated invariants, and call this the
“invariants score:”
Definition 7.2. For a tree with N invariants where {Y1, . . . , YN } are the invariants
evaluated at the transformed observed frequencies of a given set of data, we define
the distance of the data from the tree to be
D∗
:=
|Y1| + · · · + |YN |
N
.
While there is little repetition of invariants scores, the differences between the scores
are often quite small. Sometimes there are duplicate scores, and in one case with real

49
data, all three trees for 3 taxa had the same score. This, however is not common.
7.1.3 The True Tree’s Relative Distance from the Optimal Scoring
Tree
Our preliminary research showed that the true tree was not often the winning true,
nor the tree with the lowest invariants score. Recall that since the invariants for
the true tree should evaluate to zero, the optimal invariants score is the one that
is closest to zero. Since we can not measure statistical signiﬁcance, we cannot
determine whether a given tree, though not having the lowest invariants score, is
close enough to being the lowest. What we did instead, was to measure the distance
of true tree’s score from the lowest scoring tree relative to the interval of the range
of scores.
Deﬁnition 7.3. Let {X1, X2, . . . , Xn} be the ordered invariants score of the n trees,
from minimum to maximum. Let Xt be the invariants score of the true tree. Then
D#
:=
Xt − X1
Xn − X1
is the relative distance of the true tree from the minimum scoring tree. We chose
to use a relative measure, because as the data and the models vary, the interval
positions and lengths vary. We count the cumulative frequencies of the relative
distances of the true tree scores away from the minimum scoring tree. We report

50
the cumulative frequencies of the true tree falling into certain percentages of the
intervals. For example, if we refer to the true tree being 20% into the interval, that
means that its distance form the minimum tree is less than or equal to 20% of the
length of the invariants interval. This is not the same thing as saying that the true
tree is in the lowest 20% tree scores, since the scores of the different trees may have
any distribution and are not uniformly distributed. We look at the relative distance
from the minimum score to answer the question of how often the true tree appears
in trees scoring close to the minimum scoring tree.
7.2 Methods
The first step of this project was to generate the constraints for all bifurcating trees
with 3-5 taxa. To find the invariants, we first used Mathematica[13] to produce
the Fourier transformed probabilities in matrix form via the Sturmfels-Sullivant
mapping (see Appendix 10.1). We then input these matrices into a program called
4ti2 [6], which finds the Markov basis of the toric ideal for the given model. We
chose the Markov basis over the Groebner basis, since the former produces a minimal
generating set whereas the latter does not. The inequalities were easy enough to
generate by hand, using the formulas given by Matsen [8].

51
7.2.1 The Program
The main program for our research takes as input genetic data and produces the
scores defined in the previous section. First, our program calculates and transforms
the observed site pattern frequencies. That is, it counts the proportion of the time
that each site pattern is seen in the data. It then performs the Fourier transform
on the observed site pattern frequencies. Next, our program loops through each
unrooted tree and plugs these Fourier transformed site pattern frequencies into in-
equality constraints. For each unrooted tree, it loops through all associated rooted
trees and substitutes the Fourier transformed site pattern frequencies into the in-
variants.
The program runs several data sets at a time, and calculates the proportion of
times the tree with the lowest invariant score D∗
is the true tree. See Appendix 10.2
for the code for our main program. See Appendix 10.3 for the code that calculates
the relative distance of the true tree from the minimum tree and find the “winning
tree.”
7.2.2 Winning Tree Algorithm
The algorithm specified below is for choosing the “winning” tree. It also counts the
proportion of times the true tree is the winning tree. The algorithm is:
1. Find the maximum inequality score (often it is 1).

52
2. Find all trees which have the maximum inequality score.
3. Determine which tree with the maximum inequality score has the minimum
invariants score. This is the winning tree.
4. If the winning tree is the true tree, add 1 to the count of frequency of the true
tree winning.
This algorithm favors the trees with higher inequality scores. We chose to do this
because the inequalities describe a region, and so it is much more important to make
sure we are within a region. The invariants describe lower dimensional algebraic
varieties in the probabilities, and the data will never ﬁt to them exactly, thus we
expect some deviation from the invariants and prefer to have minimal deviation
from the inequalities.
7.2.3 Invariants Scores for Zero and Equivalence Class Reduced In-
variant Sets
When we reduce the site patterns by placing them into equivalence classes, we choose
one variable to be the representative for each class. We then generate relations
on the smaller set of representative variables, in order to obtain a smaller set of
invariants. However, from our data, we will have Fourier transformed probabilities
for all site patterns, but only the class representative site patterns will show up in
the invariants. We do not want to simply substitute in the values of representative

53
site patterns, as that would be a loss of information. Rather, for each equivalence
class, we take the average of all Fourier transformed site pattern frequencies in that
class, and substitute that value in for the representative site pattern of that class.
However, for variables that were placed in the zero class, we do lose that information,
as we do not use any of the zero class variables in generating the invariants.
7.2.4 Simulating Data
We want to know which tree model our data comes from, in order to determine
whether our method selects the true tree. Thus, we simulated data under a ﬁxed
evolutionary model and tree. For each rooted 3 to 5 taxa tree, using the Jukes-
Cantor and Kimura-2 evolutionary models, we randomly generated sequences of
data.
The ﬁrst step in generating data was to randomly select branch lengths for the
trees. Recall from section 6.1 our transition probability matrix is the exponential of
the instantaneous rate matrix
P(e)
= exp teQ(e)
where te is some positive length of time. But since we cannot separate teQ(e)
with-
out a molecular clock, we use the concept of branch length. For example, if we are
considering the Jukes-Cantor model for nucleotide data, we would have an instan-

54
taneous rate matrix
Qi,j =









−3x x x x
x −3x x x
x x −3x x
x x x −3x









for x ≥ 0, whose corresponding probability transition matrix is
Pi,j =
1
4









1 + 3e−4xt
1 − 3e−4xt
1 − 3e−4xt
1 − 3e−4xt
1 − 3e−4xt
1 + 3e−4xt
1 − 3e−4xt
1 − 3e−4xt
1 − 3e−4xt
1 − 3e−4xt
1 + 3e−4xt
1 − 3e−4xt
1 − 3e−4xt
1 − 3e−4xt
1 − 3e−4xt
1 + 3e−4xt









where 3xte is called the branch length [9]. Thus, we do not need to know te but
can sample branch lengths to plug into probability transition matrices in the form
above.
Code for Generating Data
Our original plan was to generate data solely in R[10]. We randomly selected branch
lengths between 0.1 and 0.75, in increments of 0.2 as done by Casanellas and Sanchez
[1]. Using these branch lengths, we calculated the entries for the probability tran-
sition matrices. Using these probabilities, for each data set, we randomly chose a

55
genetic sequence for the ancestral taxa. This sequence was chosen using the uniform
distribution for the nucleotide state space. Also note that each site was chosen in-
dependently. To generate the sequences for the modern taxa, at each site, our code
looked at the state for the ancestral taxa. Depending on the state, the code chose
the appropriate probabilities of transitioning to another state. For nucleotide data,
the program R has a function that speciﬁes a discrete state space and a probability
of choosing each possible state. We used this function to choose the state at each
site for the modern taxa. See Appendix 10.4 for our data simulation code in R.
Upon running our simulated data through our Perl program, as well as running real
data through our program, we found that our data was not performing very well.
Our program was approximately 80% in agreement with the previously determined
topologies for the real data. However, for the simulated data, our program was only
choosing the correct model at a rate less than random chance (for three taxa, less
than 33.33% of the time). We decided to use the program PAML’s genetic data
simulator Evolver [14] instead. While our original project was to include Kimura-3
parameter model, since Evolver does not simulate this model, and we were told
that it is generally not used since the Kimura-2 parameter model usually performs
just as well, we decided to remove the Kimura-3 parameter model from our research.
In order to simulate thousands of data sets eﬃciently, we wrote an algorithm in

56
R that randomly samples branch lengths, and prints them files to be input into
Evolver. (See Appendix 10.5.) We also wrote code that runs Evolver, using all of
the files, each with a different set of branch lengths. (See Appendix 10.6.) Addition-
ally, we wrote another algorithm to “clean,” or in other words, to format the data
in such a way that my main program could work with it easily. (See Appendix 10.7).
For each transition model, topology, and number of sites, I sampled 10 sets of branch
lengths. For each set of branch lengths, I created 100 data sets. This gives a total of
1000 data sets for each “type of data.” For 3 taxa, I fixed the internal branch length
and randomly sampled the external branch lengths. The internal branch length was
fixed at 0.01, 0.19, 0.35, 0.55 and 0.75. For each of those I generated 1000 data
sets, as explained above. Additionally, I generated 1000 data sets when the internal
branch varied over the set {0.01. 0.19, 0.35, 0.55,0.75} and the external branches
were were fixed at 0.17. I followed the same scheme for 4 taxa, however, with two
internal branch lengths, I varied both of them.
7.2.5 Tree Names
Bifurcating trees are often named using the Newick system. This system is quite
simple and effective, indicating the internal nodes of the tree with commas and
subtrees within parenthesis. However, to make our programming easier, we named
them differently. The naming system is giving in the following table.

57
3A1 4A1 4A2
(1,(2,3)) (1,(2,(3,4)) ((1,2),(3,4))
5A1 5A2 5A3
(1,(2,(3,(4,5)))) ((1,2),(3,(4,5))) (1,((2,3),(4,5)))
The letter corresponds to the associated unrooted tree topology. For 3 to 5 taxa
there is one unrooted tree topology each, indicated by the letter A.

Chapter 8
Results
Our results cover the binary (3 taxa), Jukes-Cantor (3-5 taxa), and Kimura-2 param-
eter (3-4 taxa) models. We break it up ﬁrst by data type, binary or nucleotide data,
then by number of taxa. We compare the Jukes-Cantor and Kimura-2 parameter
models side by side within the nucleotide data section.
8.1 Binary Data
Working with binary data provides a simpler context in which to initially explore the
composition and nature of phylogenetic invariants. However, since binary data is no
longer used extensively, we decided not to include it in our simulations, but rather
to focus on the group based models that are used frequently: the Jukes-Cantor and
Kimura-2 parameter models. However, we did some preliminary work on binary
58

59
data, and found some interesting results. For three taxa, we found an interesting
pattern, which we prove below.
8.1.1 Three Taxa
First we generated binary data sets for three taxa in R. Running these through my
Perl program, we observed a pattern in the scores for the three possible labelings of
the one topology. The lowest scoring, or winning, tree always had a unique score,
while the other two trees had an equal score. Since this happened for every single
data set, we conjectured that this was an artifact of the symmetries in the equations
and in the permutations of the labelings.
Lemma 8.1. For 3 taxa and binary data, where the scores using the metric D∗
are
{X1, X2, X3}, if Xi = min(X1, X2, X3) then Xj = Xk for i = j, k.
Proof. Recall there are three trees for three taxa. The invariants for the tree with
the identity labeling are:
0 = (q001) ∗ (q110) − (q010) ∗ (q101)
0 = (q000) ∗ (q111) − (q011) ∗ (q100)
Performing the appropriate permutations of the labeling for the other two trees we

60
get:
0 = (q001) ∗ (q110) − (q100) ∗ (q011)
0 = (q000) ∗ (q111) − (q101) ∗ (q010)
and:
0 = (q100) ∗ (q011) − (q010) ∗ (q101)
0 = (q000) ∗ (q111) − (q110) ∗ (q001).
Notice that there are only four terms that show up in these six equations. Since
this is the case, we will rename them as follows:
w = (q001) ∗ (q110)
x = (q010) ∗ (q101)
y = (q000) ∗ (q111)
z = (q011) ∗ (q100).
We also observe that each newly deﬁned variable occurs once and only once for each
tree, and they occur as follows:

61
Tree 2D∗
(1,(2,3)) |w − x| + |y − z|
(2,(1,3)) |w − z| + |y − x|
(3,(2,1)) |z − x| + |y − w|
First note that the absolute value of diﬀerences is really just the distance between
two points. Since these variables are linearly ordered, without loss of generality, we
assume that w x y z. From which we can see the proof of the lemma in
Figure 8.1 .

Figure 8.1: Proof By Picture for Lemma 8.1
In the case above, we see that the (1, (2, 3)) tree has the lowest score, and the other
two trees tie. If we permute the ordering of our variables w, x, y, and z, it merely
changes which tree has the smaller score.

62
Thus we see that one tree will always have a unique lowest score. The true tree
had the lowest score about a third of the time, corresponding to uniform random
chance. This corresponds to the random chance of the ordering of the variables
w x y z.
For 3 taxa, binary invariants, each tree has an equal chance of winning.
Recall that binary data has a state space of {0, 1}, where 0 indicates and absence of
a given trait and 1 indicates the presence of the trait. We believe that the method of
phylogenetic invariants does not perform well enough for binary data, since having
only two states does not give much information and produces symmetries that make
it diﬃcult to distinguish between topologies. We believe this method is not suited
to binary data, but since binary data is uncommon in current genetics research, this
does not weigh heavily against the method as a whole.
8.2 Nucleotide Data
The focus of our work is on nucleotide data, using the Jukes-Cantor and Kimura-2
parameter models. First, we give tables of the sizes of the sets of inequalities and
invariants. Then, for each number of taxa, we describe the overall eﬃcacy of the
invariants and inequalities. Next we show the results of the cumulative frequencies
of the relative distances of the true tree from the minimum scoring tree, as well as
show the distribution of the invariant scores for the true tree. For 3 and 4 taxa,

63
we will evaluate the effect of different branch lengths. For 5 taxa we have some
surprising results. Additionally, we will describe the efficacy of the invariants and
inequalities, as well as the cumulative efficacy of the invariants and the distribution
of the invariants scores for real data.
8.2.1 Summary of Combinatorial Growth of Models
This summary of the combinatorial growth of the models gives a sense of some of
the computational hurdles we encountered and why we chose to use certain sub-
sets of invariants. The number of invariants and inequalities grows combinatorially
depending on the number of taxa and the transition model. The main limitations
of this research were based on the surprisingly and unfortunately large size of the
invariant sets. We give a table below to show this. We were not able to calculate the
full set of invariants beyond the 4 taxa Jukes Cantor models. We used equivalence
classes and zero classes to generate subsets of invariants for 4 and 5 taxa.
Full Invariants Set Combinatorial Growth
Taxa, Model Site Patterns Inequalities Invariants
3, JC 64 9 84
3, K2 64 9 825
4, JC 256 30 734
4, K2 256 30 ???
5, JC 1024 72 ???

64
Equivalence Class Combinatorial Growth
Taxa, Model Site Patterns Equivalence Classes Equivalence Class Invariants
3, JC 64 13 33
3, K2 64 34 795
4, JC 256 34 512
4, K2 256 116 45450
5, JC 1024 89 6379
Zero Class Combinatorial Growth
Taxa, Model Site Patterns Zero Class Variables Zero Class Invariants
3, JC 64 16 12
3, K2 64 16 15
4, JC 256 64 825
Zero and Equivalence Class Combinatorial Growth
Taxa, Model Site Patterns Variables Invariants
3, JC 64 5 1
3, K2 64 10 9
5, JC 1024 34 512

65
Figure 8.2: Efficacy of Invariants and Inequalities for 3-5 Taxa
8.2.2 Efficacy of Invariants and Inequalities for Three to Five Taxa
Figure 8.2 compares the efficacy of the minimum tree score with our winning tree
algorithm. The order of the datasets was chosen by generally considering which
data sets had been performing worst to least. This graph shows that the true tree
is the minimum scoring tree equally often or more often than it is the winning
tree. This means that the inequalities actually make it less likely to choose the true
tree. The invariants are more efficient on their own. The inequalities are likely
ineffective due to the fact that they cannot distinguish between many trees (have
the same score) and that noise in the data may be causing the true tree to have a less
than maximum inequalities score. We do not incorporate inequalities scores that
were close the maximum inequality score into our analysis, sometimes eliminating

66
the true tree because its inequality score was slightly lower than the maximum
inequality score. Because our use of the inequalities makes matters worst the vast
majority of the time, we focus most of our analysis of the data on the minimum
scoring tree rather than the winning tree.
The inequalities make it less likely to choose the true tree.
The invariants are more eﬃcient on their own.
Before we move on from the inequalities however, we make a few comments on the
nature of the data, then make some general comparisons between the data types.
We must ﬁrst point out that not all data types have the same number of sites;
some were run with 500 sites and some with 1000 sites. This is due to the compu-
tational expense of the program, which we will not explain why here as it is tedious
and not enlightening. We would have preferred to be comparing all of these with the
same number of sites, but this is what we have to work with. Additionally, because
of the computational expense, each data type for 3 and 4 taxa is tested on 1000
data sets. However, for 5 take equivalence class reduced subsets, only 25 data sets
have completed and for 5 taxa zero and equivalence class removed subsets only 93
data sets have completed, thus far (at print). We estimate it will take months more
to complete all 1000 data sets.
Looking at the table of the exact numbers, there are many observations to be made:

67
Filename % True Tree Wins % True Tree Minimum
JCzeroequiv5A3 500sites 0.086021505 0.053763441
JCzeroequiv5A2 500sites 0 0
JCzeroequiv5A1.500sites 0 0.010752688
JCequiv5A3.500sites 0 0
JCzero4A1.500sites 0.077 0.145
JCzero4A2.1000sites 0.106 0.218
JCequiv4A2.1000sites 0.474849095 0
JCdiﬀ4A1.500sites 0.043 0.129
JCdiﬀ4A2.500sites 0.093 0.253
JCfull4A1.1000sites 0.047 0.105
JCfull4A2.1000sites 0.121 0.225
JCzero3A1.500sites 0.367 0.367
JCfull3A1.500sites 0.367 0.367
K2zero4A1.500sites 0.095 0.258
K2zero4A2.500sites 0.122 0.232
K2equiv4A1.500sites 0.1 0.9
K2equiv4A2.500sites 0 1
K2zero3A1.500sites 0.416 0.415
K2full3A1.500sites 0.621 0.62

68
Equivalence Class Reduced Subsets of Invariants
First and most notably, the 4 taxa Kimura-2 parameter equivalence class reduced
subsets of invariants performed remarkably well, with the balanced tree’s true tree
labeling having a minimum score 100% of the time and the giraffe tree’s true tree
labeling having the minimum score 90% of the time. This is far better than the 4
taxa Jukes-Cantor full set of invariants which were 22.5% effective for the balanced
true and only 10.5% effective for the giraffe tree. The Jukes-Cantor equivalence class
reduced subsets of invariants were effective 0% of the time, except for the balanced
tree, the true tree was actually the winning tree ≈ 47.5% of the time. This is the
one of the few instances where the winning tree algorithm performs better than
looking at just the minimum invariants scores of the trees. The 5 taxa Jukes-Cantor
equivalence class reduced subsets of invariants would likely perform much better if
the sample size were 1000, rather than 23.
For the 4 taxa, Kimura-2 parameter equivalence class reduced subset of invariants,
the true tree has the minimum score 90-100% of the time.
Zero Class Reduced Subsets of Invariants
In general, whenever the zero class is removed from the situation, the performance
drops, possibly due to the loss of information as the zero class observed site patterns
are not being utilized in these cases.

69
Differences of Invariants Sets
For a fixed number of taxa and a fixed transition model, the sets of invariants for the
different topologies will have many common invariants. These common invariants
are due to a shared portion of the structure of the topologies. For 4 taxa Jukes-
Cantor, we removed the common invariants, and ran the 4 taxa data using only the
differences between the sets of invariants. This, as expected, performed better than
the full sets of invariants, increasing the effectiveness by about 2%. The increase is
not impressive, but the fact that it does perform better is useful since although no
method yet exists for finding the differences between the sets without first finding the
full set, if it did exist, it would likely be less computationally expensive than finding
the full set of invariants. This could be very helpful for reducing the computational
prohibitiveness of this method. We were not able to calculate the full sets of 4 taxa
Kimura-2 parameter invariants, however, if there were a way to simply calculate the
differences between the sets, this might be computationally feasible.
For 4 taxa, Jukes-Cantor, the differences between the sets of invariants
for each topology perform better than the full sets by about 2%.
Comparing Jukes-Cantor with Kimura-2
Finally we note that with the same number of taxa and the types of invariants
(sub)sets, the Kimura-2 parameter invariants always outperform the Jukes-Cantor

70
invariants.
The Kimura-2 parameter invariants always greatly outperform
the Jukes-Cantor invariants.
8.2.3 Three Taxa
For 3 taxa, we compare a variety of types of data. We vary the number of sites,
branch lengths, and of course compare the different transition models and tree
topologies for the efficacy of the invariants and inequalities. We want to note that
for 3 taxa, there is no difference between the winning and minimum scoring trees.
This is because all 3 trees come from the same unrooted topology and thus they
have the same inequality score. Thus, the inequalities have no effect with 3 taxa.
We will generally refer to the “winning tree,” to keep things simple.
Number of Sites
We begin by demonstrating the effect of the number of sites. While it is natural to
think that an increase of number of sites would increase the accuracy of the method,
we actually see that for 100, 500, and 1000 sites, there is not much of a difference,
and not a strictly increasing difference.
Looking at Figure 8.3, we see that the percentage that the true tree wins is not
remarkably different across the number of sites greater than 100. Perhaps if we

71
Figure 8.3: Varying Number of Sites for 3 Taxa
tested with very low numbers of sites, say, below 64 (the number of site patterns),
we would see much lower efficacy. Preliminary research also showed that the method
was not greatly improved with 5000 or 10,000 sites. Thus, we conclude that the
number of site patterns over 100 does not have a strong effect. We place the rest of
our focus on other variables.
The number of sites over 100 does not have a strong effect
on the performance of the method.

72
Cumulative Frequencies of Invariants Intervals
We now look at data for 3 taxa, 500 sites and any branch length. For Jukes-Cantor,
the true tree is the winning tree 36.7% of the time and for Kimura-2 the true tree
is the winning tree 62% of the time. Though neither of these are great results, at
least for Kimura-2, the true tree is chosen a majority of the time. However 62% is
not frequent enough to be reliable for model testing. We now look at the frequency
that the true tree occurs within a certain proportion of the invariants interval.
Figure 8.4: The Cumulative Frequencies of 3 Taxa with 500 Sites
We see in Figure 8.4 that the Jukes-Cantor invariants only begin to capture the true
tree a majority of the time when considering trees 40% into the interval. The true
tree is only captured 100% of the time when considering all trees. For Kimura-2,

73
the situation is better: the true tree is captured a majority of the time within any
distance, and gradually improves, but is not captured 100% of the time until all
trees are considered. For Kimura-2, the true tree is captured at least 80% of the
time when 70% into the interval.
For 3 taxa, the Kimura-2 parameter invariants:
• always capture the true tree a majority of the time.
• capture the true tree at least 80% of the time when 70% into the interval.
Variation of the Branch Lengths
The data considered above was for any length of the branches between 0.01 and 0.75.
For three taxa, we now look at the efficacy of the method as the internal branch varies
(recall there is only one internal branch length for the 3 taxa topology). The idea
is that when the internal branch is small, the correct bifurcating tree is similar to a
claw tree, and is hard to distinguish between the other bifurcating topologies. With
a long internal branch, the two taxa off of the internal branch should be genetically
distinct enough from the third taxa, that the true tree is easier to determine.
We notice that when the internal branch length is short (0.01) we see the true tree
is chosen around a third of the time for all models. This is due to the fact that the
internal branch is so short, the true tree is almost a claw tree. Thus, the method can
not differentiate between the three tree topologies. As the internal branch length
increases, both models perform better, especially when the external branch lengths

74
Figure 8.5: Varying the Internal Branch Lengths for 3 Taxa
are fixed at the moderate size of 0.17. Kimura-2 performs excellently with a long
internal branch length and fixed external branch lengths of 0.17, where the true tree
has the minimum score 90.6% of the time, when the internal branch length is 0.75.
For 3 taxa, the Kimura-2 invariants perform excellently with an internal
branch length of 0.75 and fixed external branch lengths of 0.17, where
the true tree has the minimum score 90.6% of the time.
Now we look at the cumulative frequencies of percentages into the invariants intervals
across the different branch lengths. For the following, we look only at the cases where
the external branch lengths are at fixed at 0.17. We choose to focus on this because
it is more comparable to the work of Casanellas and Fernandez-Sanchez [1], whose

75
data was completely limited to one internal branch length and one external branch
length.
Figure 8.6: Cumulative Frequencies of Varying Branch Internal Lengths for 3 Taxa
Jukes-Cantor with External Branches of 0.17
Notice that for Jukes-Cantor model (see Figure 8.6) it is not until we get 60% into
the interval that the true tree is captured a majority of the time for all internal
branch lengths. Unfortunately, the true tree is not captured at least 80% of the
time at any point besides when we consider the whole interval, or in other words,
when we consider all trees. Kimura-2, as we have seen, fares much better. As can be
seen in Figure 8.7, once the internal branch length is more than twice the external
branch length (0.35 and 0.17 respectively), the true tree wins a majority of the time

76
Figure 8.7: Cumulative Frequencies of Varying Branch Internal Lengths for 3 Taxa
Kimura-2 with External Branches of 0.17
and the true tree is captured at least 80% of the time when considering all trees
30% into the interval.
Distribution of Invariant Scores
We would like to have a sense of the range of values for the true tree invariants. The
true tree invariants scores for both transition models are given in Figure 8.8.
Note that in Figure 8.8 the scores are scaled so they can be compared. The scale is
given along the horizontal axis. The Jukes-Cantor scores are found by multiplying
the scale by 0.01 and the Kimura-2 scores are found by multiplying the scale by

77
Figure 8.8: The Distributions of the Invariants Scores for 3 Taxa Jukes-Cantor and
Kimura-2
0.001. This in itself is interesting; that the Kimura-2 true tree invariants scores
are smaller by a factor of 10. Additionally, if we found a typical range of true tree
scores, when testing all trees, we could exclude any tree that falls above that range.
This data gives us a look at what that range might be.
8.2.4 Four Taxa
For 4 taxa, we have one unrooted tree, with two rooted topologies for a total of 15
distinct labelings of rooted trees to test. The number of evolutionary models from
which to choose is larger, making the uniform random chance of choosing the correct
true less likely. Of course, the matter is not up to uniform random chance, but if
it were, we note that each tree would have a one in ﬁfteen chance, or 6.67% chance

78
of being chosen. While the method often performed better than by uniform chance,
we note that it did not always perform well for 4 taxa.
We also test various types of data for 4 taxa. We vary the branch lengths for each
of the topologies. In the case of 4 taxa, there are two internal branches. We vary
both branches, noting that the best performance should occur when both branches
are longer. When one branch or the other is short, different confusions are more
likely to occur, but the end result is not identifying the true tree.
Additionally, we consider some subsets of the full set of invariants. In particu-
lar, we tested the subset of invariants with the zero class removed. Since we were
unable to compute the full set of invariants for the Kimura-2 model, comparing the
efficacy of Jukes-Cantor and Kimura-2 happens in the context of the zero class re-
duced subsets. Since we do have the full set of invariants for Jukes-Cantor, we look
at those outcomes in this section as well. For each of the two rooted topologies,
the giraffe and the balanced tree, they share many invariants. We also look at how
effective the method is when we remove all common invariants they have in come,
in other words, we only use the differences in the invariants sets. All of these results
are discussed in detail below.
Finally, we observe from Table 8.2.2, that for Jukes-Cantor, both the zero class

79
reduced subset of invariants and the differences between the full sets of invariants
outperform the full set of invariants.
For 4 taxa, Jukes-Cantor, both the zero class reduced
subset of invariants and the differences between the
full sets of invariants outperform the full set of invariants.
First we look at the efficacy of the method with the full set of Jukes-Cantor invari-
ants, with 1000 sites. In all cases the true tree is neither the winning tree nor the
minimum tree a majority of the time.
For both tree topologies, we see in Figure 8.9, that it is not until 30% into the
interval that they both capture the true tree a majority of the time. Additionally,
it is not until 60% into the interval when they both capture the true tree at least
80% of the time. For Kimura-2, the true tree is captured 85% when considering all
trees in the lower half of the interval.
Now we will compare some of the subset of invariants that we considered, with
500 sites and any branch lengths. We ran data using the differences between the
full set of Jukes-Cantor invariants. We also ran data using the zero class reduced
subsets of invariants, for both Jukes-Cantor and Kimura-2. Recall that we were not
able to generate the full set of invariants for Kimura-2, so we were not able to use

80
Figure 8.9: The Cumulative Frequencies for 4 Taxa Jukes-Cantor with 1000 sites
the differences in this case. The three cases are compared in Figure 8.10.
The goal of Figure 8.10 is to first compare the performance of differences and zero
class subsets of the Jukes-Cantor invariants and to compare the performance of the
Jukes-Cantor and Kimura-2 zero class invariants. First, note that for lower relative
distances, where it really counts, the differences perform lower than the zero class
invariants. Since the zero class subset of invariants take less computational expense
to generate, this is good news. However, it was my thought that the differences
should out perform any other method, since they would produce more extreme
values of scores for the invariants, without the common invariants weighting the
scores similarly. It is of course obvious by this point, that the Kimura-2 zero class

81
Figure 8.10: The Cumulative Frequencies Subsets of Invariants for 4 Taxa with 500
sites
invariants out performed the Jukes-Cantor zero class invariants.
Variation of the Branch Lengths
For both trees, even as the branch length increases, the true tree is neither the
winning tree nor the minimum scoring tree a majority of the time. Since, for 4 taxa,
the minimum tree method outperforms our winning tree algorithm, we evaluate the
eﬀects of varying the branch lengths by using the minimum scoring performance.
All analysis on the variation of branch lengths is done on Jukes-Cantor, 1000 site
data run through the full set of Jukes-Cantor invariants.
Both topologies perform best as both internal branches become longer. However,
at the maximum internal branch lengths, for neither topology is the true tree the

82
Figure 8.11: Variation of Branch Lengths for the Giraﬀe Tree and Jukes-Cantor
Invariants
minimum scoring tree a majority of the time: for the giraﬀe tree the true tree is
the minimum scoring tree 23.2% of the time, and for the balanced tree the true tree
is the minimum scoring tree 46.2% of the time. Next, we look at the cumulative
frequencies of the relative distances into the interval.
Clearly, as seen in Figure 8.13, the long (0.75) internal branches will outperform
the short (0.01) internal branches. In the later case, the true tree is not captured a
majority of the time until 50% into the interval, and is not captured at least 80% of
the time until we are 80% into the interval. However, the balanced tree does quite
well only 10% into the interval where it captures the true tree 86.3% of the time.

83
Figure 8.12: Variation of Branch Lengths for the Balanced Tree and Jukes-Cantor
Invariants
Unfortunately, the giraffe tree does not fare as well, capturing the true tree at least
80% of the time when considering the trees 30% into the interval.
In general, the balanced tree is more accurately predicted than the giraffe tree.
This is likely because there are four times as many giraffe trees as there are bal-
anced trees.

84
Figure 8.13: The Eﬀect of Short and Long Internal Branches
Distribution of Invariants
We will compare the distribution of the full set of invariants for each topology for
the Jukes-Cantor model here.
In Figure 8.14, we see how the scores of the true tree of both 4 taxa topologies are
distributed. Their scores are quite similar, despite the fact that around half of their
invariants are diﬀerent. We mostly include this for completeness.

85
Figure 8.14: The Distributions of the True Trees for 4 Taxa Jukes-Cantor1000 Sites
8.2.5 Five Taxa
For 5 taxa, we have one unrooted topology with three rooted topologies for a to-
tal of 105 distinct labelings of rooted trees to test. While we only tested the zero
equivalence class reduced subset and equivalence class reduced subset of the Jukes-
Cantor invariants, since there are so many rooted trees to test, this still takes a long
time. Recall that for the Jukes-Cantor zero and equivalence class reduced subset of
invariants, only 93 data sets completed so far and for the Jukes-Cantor equivalence
class reduced subset of invariants only 25 data sets completed so far. Thus, our
power is not high. Nonetheless, we make some observation based on the data we
have.

86
From our comparison of the percentage of the true tree winning versus the true
tree having the minimum score earlier in this section, in Table 8.2.2, things may
have looked bleak. The 5 taxa portion of that table is reproduced here:
Filename % True Tree Wins % True Tree Minimum
JCzeroequiv5A3 500sites 0.086021505 0.053763441
JCzeroequiv5A2 500sites 0 0
JCzeroequiv5A1.500sites 0 0.010752688
While the non-zero numbers may be better than the uniform chance of 1 in 105,
they are still far from a majority. We now look at the the relative distance of the
true tree from the minimum tree to see how well the method performs.
For the zero and equivalence class reduced subsets of the Jukes-Cantor invariants
we have the following graph:

87
Figure 8.15: The Cumulative Frequencies for the Zero and Equivalence Class Re-
duced Invariants
In Figure 8.15 we see that only 20% into the interval, we capture the true tree a
majority of the time, in fact at least 84% of the time. When it is only 40% into the
interval, we capture the true tree 100% of the time.
For the equivalence class reduced subsets of the Jukes-Cantor invariants we have
the following graph:

88
Figure 8.16: The Cumulative Frequencies for the Equivalence Class Reduced Invari-
ants
In Figure 8.16 we see that 40% into the interval we have captured the true tree not
only a majority of the time, but at least 80% of the time. It is 60% into the interval
that we capture the true tree 100% of the time.
The decrease in performance of the equivalence class reduced invariants is more
likely due the smaller sample size than with its actually eﬀectiveness. In general, we
hypothesize that reducing the invariants by removing the zero class will reduce the
eﬃcacy due to under utilizing the information in the data. However, for both cases
we have here, even with the much smaller sample sizes, while the true is rarely the
minimum scoring tree, the true tree is found to be relatively closer to the minimum

89
tree than with 3 or 4 taxa.
For 5 taxa, the reduced Jukes-Cantor invariants capture
the true tree 100% of the time only a short distance into the interval.

90
8.2.6 Real Data
It is important to note that simulated data is oversimplified; real data is more
complex. We ran a small set of real data through our program to determine whether
there were any notable performance differences. We started with two sets of 6 taxa,
one containing bird genetic data and the other containing chipmunk genetic data,
with well understood evolutionary histories. Selecting 3, 4 and 5 taxa subsets, we
had very little statistical power, nor absolute knowledge of the true tree. However,
we find the results interesting nonetheless. The original topologies 1
are shown on
the next page.
Vermivora celata
Dendroica coronata
Zonotrichia leucophrys
Junco hyemalis
Carpodacus mexicanus
Carduelis tristis
Bird
T. cinericollis
T. minimus borealis
T. townsendii
T. obscurus davisi
Marmota vancouverensis
Sciurus carolinensis
chipmunk
Note that we cannot use any subset for a given tree topology, only those subsets
that correspond to topology with the same number of taxa. From the Chipmunk
1
These trees are courtesy of Dr. Greg Spicer, San Francisco State Unniversity.

91
data we get 20 3A1 trees, 15 4A1 trees, and 6 5A1 trees. From the bird data we
get 16 3A1 trees, 4 4A1 trees, 3 4A2 trees, 4 5A2 trees, and 2 5A3 trees. As you
can see, there are very few data sets here, and thus the power of our analysis is not
high.
8.2.7 Eﬃcacy of Invariants and Inequalities for Real Data
The following image shows that, unfortunately for our few 5 taxa data sets, the
invariants were not able to determine the true tree, with or without the inequalities.
For 3 and 4 taxa, we see again that the inequalities are reducing the frequency of
choosing the true tree. The graph represents the percentage of times that the true
tree is chosen using our algorithm, which favors the inequality score, versus the
percentage of time the true tree has a minimum invariants score. The order of the
datasets was chosen by generally considering which data sets had been performing
worst to least. Generally, the more trees to test (more taxa), the worse the perfor-
mance. Additionally, the Kimura-2 parameter model outperforms the Jukes-Cantor
model a large majority of the time.
For real data, the Kimura-2 parameter model outperforms
the Jukes-Cantor model a large majority of the time.

92
Figure 8.17: The Efficacy of the Invariants and Inequalities for Real Data
Since the percentage of time that the true trees wins is always less than or equal to
the percentage of the time that the true tree has the minimum invariants score, we
conclude that the inequalities are not useful for evolutionary model testing.
We conclude that the inequalities are not useful for evolutionary model testing.
It is possible that the ineffectiveness has something to do with our method of scoring
the inequalities. Perhaps with a known distribution of the inequalities, we could add
a statistical significance condition. For example, a score of 0.96 or higher may not
be statistically significant from a score of 1. In this case, we would want to include
trees with a score of 0.96 or higher in our ”maximum inequality” set, from which

93
we look for a minimum invariants score.
However, we find our method of scoring to the inequalities to be quite natural.
Aside from measuring statistical significance in deviation from a score of 1, we be-
lieve there is no better way to score the inequalities.
Figure 8.18: The Cumulative Efficacy of 3 Taxa Real Data
For 3 taxa, we see that the Kimura-2 parameter model is capturing the true tree at
least 80% of the time. However, we don’t capture the true true 100% of the time
until we get all the way or almost all the way into the interval. The Jukes-Cantor
model is not capturing the true tree at least 80% of the time until we get 40-50%

94
into the interval. See Figure 8.18. For 3 taxa, we are using the full set of invariants
for both transition models.
Figure 8.19: The Cumulative Eﬃcacy of 4 Taxa Real Data
For 4 taxa, Kimura-2 parameter model the true tree is captured 100% of the time,
20% into the interval. All models are capturing the true tree 100% of the time 50%
into the interval. See Figure 8.19. Note that with 4 taxa, for the Jukes-Cantor
model we use the full set of invariants but with the Kimura-2 model we are using
the subset of invariants with the zero class variables removed.
For 5 taxa, I used the zero and equivalence class reduced subsets of the Jukes-
Cantor invariants. For all topologies, the true tree is captured 0% of the time, until
20% into the interval the true tree is captured 100% of the time. While there were
only a few data sets with 5 taxa, this still is an indication that just looking at the

thesis

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to thesis

Similar to thesis (20)

thesis