The statistical and informaticschallenges posed by ascertainmentbiases in phylogenetic datacollectionMark T. Holder and Jordan M. KochDepartment of Ecology and Evolutionary Biology, University of Kansas.David Swoord, Tracy Heath, David Bryant and Paul Lewis are collaboratorson the second part of the talk.Thanks to NSF and KUs IMSD for funding.iEvoBio - 2013
Ascertainment biasa bias in parameter estimation or testing caused bynon-random sampling of the data.This talk will focus on analyses of ltered data.Filtered data → Some types of data will never be sampled.
Correcting for ltered data in treeestimationUse:P(Data | Tree, not excluded)as the likelihood instead of:P(Data | Tree)P(Data | Tree, not excluded) =P(Data | Tree)P(not excluded | Tree)Felsenstein (1992) and Lewis (2001)
Conclusions• Analyzing variable-only data with Lewis Mkv model isconsistent• Inferring trees from parsimony-informative-only data:• can be consistent if the tree is not tiny.• can be feasible for multi-state data using newalgorithms• Treating gaps as missing data:• does not lead to inconsistency if indel process isindependent of the substitution process.• can be positively misleading under mild violations ofthis independence assumption.• rules for character encoding need to be linked to thedata. ? = − in molecular data.
Calculating the probability of not beingexcludedP(not excluded | Tree) = 1 − P(excluded | Tree)For variable-only, binary data:P(Var. pat | Tree) = 1−P(all 0 pat. | Tree)−P(all 1 pat. | Tree)
Calculating the probability of aparsimony-uninformative patternFor multi-state character (k 2), andparsimony-informative-only data, there are lots of patterns:O 2k−1 Nk − 1JMK and MTH have implementedparsimony-uninformative specializations of an algorithm forcalculating the prob. of classes of patterns.Koch and Holder (2012)https://github.com/mtholder/PhyPatClassProb
Informatics implicationsTo correct for ascertainment bias we need to know whatform of data ltering was used.If a character was chosen because it was variable in arelated group, it is dicult to correct of the ascertainmentbias.
Gaps-as-missing-data ML (SML) on thecorrect alignment• Clearly ignores information from indels,• Warnow (2012) argues that the method is inconsistent- but her proof only works when there are nosubstitutions on the tree.Holder, Heath, Lewis, Swoord, and Bryant (in prep):1. Proof of consistency if indel process is independent ofsubstitution process,2. Example of the method being positively misleadingunder a +I model for indels and substitutions.
Theorem 1The tree and parameter pair, ˆTM , ˆθM , estimated via SMLwill yield a consistent estimator of the tree, T, if:(a) the time-reversible substitution model, θ, results inconsistent estimation of the T in the absence of indels;(b) the indel process, φ, acts independently of substitutionprocess and the sequence states;(c) the probability distribution for newly inserted states isidentical to the the equilibrium state frequency of thesubstitution process; and(d) there is non-zero probability of generating a sitewithout gaps under φ.
What if rates of indels are correlatedwith rates of substitutions?• Substitution: Jukes-Cantor + a proportion of invariantsites• Indel: invariant sites will not experience indels. Allsites that are free to have substitutions are also free toexperience indels.The result:If we calculate the expected pattern frequency spectra forextreme Felsenstein-zone trees, and mimic innitecharacter sampling the software usinggaps-as-missing-data approach (ML and Bayesian) prefersthe wrong tree.We still need to verify that this is not an artifact of localoptima being found in software.
Positively misleading behavior fromtreating gaps as missing data• Long-branch attraction if the rate of the indel processis correlated with the rate of the substitution process.• Non-random ltering of data → long branchesunderestimated.The result could have implications for the (long-standing)debates in systematics about the eect of missing data andinapplicable character states.
Informatics implications• Gaps arent missing data, we really should be usingmodels of the indel process.• Terminal gaps in alignments often are the result ofmissing data. Software should not use the samesymbol for gaps caused by indels and gaps caused byincomplete sequencing.
References IE Allman and J Rhodes. Identifying evolutionary trees andsubstitution parameters for the general markov model withinvariable sites. Mathematical biosciences, 211(1):1833, Jan2008. doi: 10.1016/j.mbs.2007.09.001. URLhttp://linkinghub.elsevier.com/retrieve/pii/S0025556407001897.Elizabeth S. Allman, Mark T. Holder, and John A. Rhodes.Estimating trees from ltered data: Identiability of modelsfor morphological phylogenetics. Journal of TheoreticalBiology, 263(1):108119, 2010. ISSN 0022-5193. doi:DOI:10.1016/j.jtbi.2009.12.001. URLhttp://www.sciencedirect.com/science/article/B6WMD-4XX160T-2/2/5adf8b8af77dd551890d7cb5b0e62dba.
References IIJoseph Felsenstein. Phylogenies from restriction sites: amaximum-likelihood approach. Evolution, 46:159173, Jan1992.Jordan M. Koch and Mark T. Holder. An algorithm forcalculating the probability of classes of data patterns on agenealogy. PLOS Currents Tree of Life, Dec 14 [last modied:2012 Dec 14](1), 2012. doi: 10.1371/4fd1286980c08. URLhttp://currents.plos.org/treeoflife/article/an-algorithm-for-calculating-the-probability-of-classes-oP. O. Lewis. A likelihood approach to estimating phylogenyfrom discrete morphological character data. SystematicBiology, 50(6):913925, 2001.Mike Steel. Recovering a tree from the leaf colourations itgenerates under a markov model. Appl. Math. Letters, 7(2):1923, Jan 1994. URL http://www.math.canterbury.ac.nz/~mathmas/research/markov3.pdf.
References IIIMike Steel, Michael D. Hendy, and David Penny. Parsimony canbe consistent! Syst Biol, 42(4):581587, 1993.Tandy J. Warnow. Standard maximum likelihood analyses ofalignments with gaps can be statistically inconsistent. PLOSCurrents Tree of Life, Mar 12:[last modied: 2012 Apr 3]Edition 1, 2012. doi: 10.1371/currents.RRN1308. URLhttp://currents.plos.org/treeoflife/article/standard-maximum-likelihood-analyses-of-alignments-with-g