Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Holder and Koch ievobio-2013 ascertainment biases


Published on

data filtering and phylogenetic inference and implications for informatics

  • Be the first to comment

  • Be the first to like this

Holder and Koch ievobio-2013 ascertainment biases

  1. 1. The statistical and informaticschallenges posed by ascertainmentbiases in phylogenetic datacollectionMark T. Holder and Jordan M. KochDepartment of Ecology and Evolutionary Biology, University of Kansas.David Swoord, Tracy Heath, David Bryant and Paul Lewis are collaboratorson the second part of the talk.Thanks to NSF and KUs IMSD for funding.iEvoBio - 2013
  2. 2. Ascertainment biasa bias in parameter estimation or testing caused bynon-random sampling of the data.This talk will focus on analyses of ltered data.Filtered data → Some types of data will never be sampled.
  3. 3. Correcting for ltered data in treeestimationUse:P(Data | Tree, not excluded)as the likelihood instead of:P(Data | Tree)P(Data | Tree, not excluded) =P(Data | Tree)P(not excluded | Tree)Felsenstein (1992) and Lewis (2001)
  4. 4. Conclusions• Analyzing variable-only data with Lewis Mkv model isconsistent• Inferring trees from parsimony-informative-only data:• can be consistent if the tree is not tiny.• can be feasible for multi-state data using newalgorithms• Treating gaps as missing data:• does not lead to inconsistency if indel process isindependent of the substitution process.• can be positively misleading under mild violations ofthis independence assumption.• rules for character encoding need to be linked to thedata. ? = − in molecular data.
  5. 5. Filtering data: retain variable patternsCharacter CharacterTaxon 1 2 3 4 5 6 Taxon 1 2 3 4 5t1 0 1 1 0 0 0 t1 1 1 0 0 0t2 0 1 0 0 1 0 → t2 1 0 0 1 0t3 0 1 0 1 1 1 t3 1 0 1 1 1t4 0 0 0 1 0 1 t4 0 0 1 0 1
  6. 6. Filtering data: retainparsimony-informative patternsCharacter Char.Taxon 1 2 3 4 5 6 Taxon 1 2 3t1 0 1 1 0 0 0 t1 0 0 0t2 0 1 0 0 1 0 → t2 0 1 0t3 0 1 0 1 1 1 t3 1 1 1t4 0 0 0 1 0 1 t4 1 0 1
  7. 7. Identiability of the treeABCD$$$$$$$XˆˆˆˆˆˆˆzACBDErrrrrrrjPattern probabilityABCD100110101100
  8. 8. Partial identiability of the treeABCD$$$$$$$Xˆˆˆˆˆˆˆzˆˆˆˆˆˆˆz~¨¨¨¨¨¨¨¨BACBDErrrrrrrjPattern probabilityABCD100110101100
  9. 9. A Markov model for character evolution0q01q101Mk: q10 = q01 GMk: q10 = q01
  10. 10. P(1100|T)P(1010|T) P(1001|T)T =ABCD
  11. 11. Extending identiability resultsFiltering Model Identiable?None GMk Yes. (Steel, 1994)Variable GMkv YesMkp−i Part. (Steel et al., 1993)Pars-inf GMkp−iN = 4 NoN ≥ 8 Yes5 N ≤ 7 ?results in red: (Allman et al., 2010), extending (Allman andRhodes, 2008) for GMkv
  12. 12. Analyzing ltered dataFilteringAnalysis Variable Pars-inf.Mk Pos. Misleading Pos. MisleadingMkv Consistent Pos. MisleadingMkp−i - Consistent N ≥ 8
  13. 13. Calculating the probability of not beingexcludedP(not excluded | Tree) = 1 − P(excluded | Tree)For variable-only, binary data:P(Var. pat | Tree) = 1−P(all 0 pat. | Tree)−P(all 1 pat. | Tree)
  14. 14. Calculating the probability of aparsimony-uninformative patternFor multi-state character (k 2), andparsimony-informative-only data, there are lots of patterns:O 2k−1 Nk − 1JMK and MTH have implementedparsimony-uninformative specializations of an algorithm forcalculating the prob. of classes of patterns.Koch and Holder (2012)
  15. 15. Informatics implicationsTo correct for ascertainment bias we need to know whatform of data ltering was used.If a character was chosen because it was variable in arelated group, it is dicult to correct of the ascertainmentbias.
  17. 17. Gaps-as-missing-data ML (SML) on thecorrect alignment• Clearly ignores information from indels,• Warnow (2012) argues that the method is inconsistent- but her proof only works when there are nosubstitutions on the tree.Holder, Heath, Lewis, Swoord, and Bryant (in prep):1. Proof of consistency if indel process is independent ofsubstitution process,2. Example of the method being positively misleadingunder a +I model for indels and substitutions.
  18. 18. Theorem 1The tree and parameter pair, ˆTM , ˆθM , estimated via SMLwill yield a consistent estimator of the tree, T, if:(a) the time-reversible substitution model, θ, results inconsistent estimation of the T in the absence of indels;(b) the indel process, φ, acts independently of substitutionprocess and the sequence states;(c) the probability distribution for newly inserted states isidentical to the the equilibrium state frequency of thesubstitution process; and(d) there is non-zero probability of generating a sitewithout gaps under φ.
  19. 19. What if rates of indels are correlatedwith rates of substitutions?• Substitution: Jukes-Cantor + a proportion of invariantsites• Indel: invariant sites will not experience indels. Allsites that are free to have substitutions are also free toexperience indels.The result:If we calculate the expected pattern frequency spectra forextreme Felsenstein-zone trees, and mimic innitecharacter sampling the software usinggaps-as-missing-data approach (ML and Bayesian) prefersthe wrong tree.We still need to verify that this is not an artifact of localoptima being found in software.
  20. 20. Positively misleading behavior fromtreating gaps as missing data• Long-branch attraction if the rate of the indel processis correlated with the rate of the substitution process.• Non-random ltering of data → long branchesunderestimated.The result could have implications for the (long-standing)debates in systematics about the eect of missing data andinapplicable character states.
  21. 21. Informatics implications• Gaps arent missing data, we really should be usingmodels of the indel process.• Terminal gaps in alignments often are the result ofmissing data. Software should not use the samesymbol for gaps caused by indels and gaps caused byincomplete sequencing.
  22. 22. References IE Allman and J Rhodes. Identifying evolutionary trees andsubstitution parameters for the general markov model withinvariable sites. Mathematical biosciences, 211(1):1833, Jan2008. doi: 10.1016/j.mbs.2007.09.001. URL S. Allman, Mark T. Holder, and John A. Rhodes.Estimating trees from ltered data: Identiability of modelsfor morphological phylogenetics. Journal of TheoreticalBiology, 263(1):108119, 2010. ISSN 0022-5193. doi:DOI:10.1016/j.jtbi.2009.12.001. URL
  23. 23. References IIJoseph Felsenstein. Phylogenies from restriction sites: amaximum-likelihood approach. Evolution, 46:159173, Jan1992.Jordan M. Koch and Mark T. Holder. An algorithm forcalculating the probability of classes of data patterns on agenealogy. PLOS Currents Tree of Life, Dec 14 [last modied:2012 Dec 14](1), 2012. doi: 10.1371/4fd1286980c08. URL O. Lewis. A likelihood approach to estimating phylogenyfrom discrete morphological character data. SystematicBiology, 50(6):913925, 2001.Mike Steel. Recovering a tree from the leaf colourations itgenerates under a markov model. Appl. Math. Letters, 7(2):1923, Jan 1994. URL
  24. 24. References IIIMike Steel, Michael D. Hendy, and David Penny. Parsimony canbe consistent! Syst Biol, 42(4):581587, 1993.Tandy J. Warnow. Standard maximum likelihood analyses ofalignments with gaps can be statistically inconsistent. PLOSCurrents Tree of Life, Mar 12:[last modied: 2012 Apr 3]Edition 1, 2012. doi: 10.1371/currents.RRN1308. URL