This poster is for the Joint Mathematics Meetings undergraduate poster session in January 2016 and gives a sample of my research from summer 2015 in implementing a new tree reconstruction method at the Winthrop REU: Bridging Theoretical & Applied Mathematics.
development of diagnostic enzyme assay to detect leuser virus
A distance-based method for phylogenetic tree reconstruction using algebraic geometry
1. Distance
to
variety
•Variety for probabilistic model of evolution
•Aligned DNA sequence data as data points
for each topology
Keep if
cutoff >
value
•Quartet topology gives taxa pairings
•Data-dependent hypothesis testing
and other scores
Phylogenetic
Tree
A distance-based method for phylogenetic tree
reconstruction using algebraic geometry
Emily Castner*1, Brent Davis2, and Dr. Joseph Rusinko3
Mount Holyoke College1, Colorado State University2, Hobart and William Smith Colleges3
Phylogenetic trees show evolutionary history
• We reconstruct four-species quartet trees, which can be
amalgamated into larger phylogenetic trees.
• In Markov models, species evolve because of random,
independent nucleotide substitutions in their genome.
Varieties show expected word distributions
Select the topology closest to the variety
Test a tree space of varying branch lengths
• Topology: Tree structure giving specific pairings between
species, shows which are more closely related
• Quartet trees have three possible topologies; one is correct
• Words: Four-letter combinations of DNA bases {A, C, G, T}
• Three possible species orderings give three sets of words
• Variety: Solution set of a system of polynomial equations
• Parameterized from nucleotide substitution probabilities
to genome word distributions
• Data fits a topology if the distance to its variety is small
M. Casanellas, L. D. Garcia, and S. Sullivant, “Catalog of small trees,” in Algebraic Statistics for Computational Biology. New York:
Cambridge Univ. Press, 2005 pp. 291—304. [Online]. Available: http://dx.doi.org/10.1017/CBO9780511610684.019
N. Eriksson, “Using invariants for phylogenetic tree construction,” IMA Volumes in Mathematics and its Applications, vol. 149,
Emerging Applications of Algebraic Geometry, pp. 89—108, Springer, New York, 2009.
Jesús Fernández-Sánchez and Marta Casanellas, “Invariant versus classical quartet inference when evolution is heterogeneous
across sites and lineages,” Systematic Biology, 2015.
Thomas H Jukes and Charles R. Cantor, “Evolution of protein molecules,” Mammalian protein metabolism, vol. 3, pp. 21—132, 1980.
Anna M. Kedzierska and Marta Casanellas, “Gennon-h: Generating multiple sequence alignments on nonhomogenous phylogenetic
trees,” BMC bioinformatics, vol. 13, no. 1, pp. 216, 2012.
Motoo Kimura, “A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide
sequences,” Journal of molecular evolution, vol. 16, no. 2, pp. 111—120, 1980.
M. Kimura, “Estimation of evolutionary distances between homologous nucleotide sequences,” Proceedings of the National Academy of
Sciences, vol. 78, no. 1, pp. 454—458, 1981.
M. S. Swenson, R. Suri, C. R. Linder, and T. Warnow, “An experimental study of quartets maxcut and other supertree methods,”
Algorithms for Molecular Biology, vol. 6, no. 1, pp. 7, 2011.
Kimura 3-parameter model is most accurate
Algebraic geometry has phylogenetic promise
This material is based upon work supported by the National Science Foundation under Grant No. DMS-1358534 and hosted by the 2015 Winthrop University REU: Bridging Applied & Theoretical
Mathematics. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Acknowledgments
References
Summarize observed distributions as points
A T C G
A C G T
A T T G
A G C A
AAAA,
TCTG,
CGTC,
GTGA
A T C G
A T T G
A G C A
A C G T
AAAA,
TTGC,
CTCG,
GGAT
A T C G
A G C A
A T T G
A C G T
AAAA,
TGTC,
CCTG,
GAGT
Python
Matlab
MaxCut
Jukes-Cantor model is fastest
𝑐
𝑎 𝑎
𝑏 𝑏
• 𝑎, 𝑏 ∈ 0.01, 1.5 ;
Left; 𝑐 = 𝑎
• 𝑎 = 0.05; 𝑏 = 0.75
Left; c ∈ 0.01, 0.4 ;
• c ∈ 0.01, 0.4 ;
Right; 𝑎 = 0.05; 𝑏 = 0.75
Jukes-Cantor reconstructs other models’ data
𝑐
𝑎 𝑏
𝑎 𝑏
• Use Jukes-Cantor implementation on all data
• A faster runtime
• Almost the same accuracy
• At least 95% accurate for 64% of the sample space
• At least 85% accurate for 77% of the sample space
• Most effective when excluding high 𝑏 and extreme 𝑎
• Reconstruction methods are useful for biologists.
Jukes-Cantor
A C G T
A x0 x1 x1 x1
C x1 x0 x1 x1
G x1 x1 x0 x1
T x1 x1 x1 x0
Kimura
2-Parameter
A C G T
A x0 x1 x2 x1
C x1 x0 x1 x2
G x2 x1 x0 x1
T x1 x2 x1 x0
Kimura
3-Parameter
A C G T
A x0 x1 x2 x3
C x1 x0 x3 x2
G x2 x3 x0 x1
T x3 x2 x1 x0