Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Perl for Phyloinformatics


Published on

Course slides for computational phyloinformatics, an annual course organized by NESCent in collaboration with hosting organizations across the world. I am the teacher of the Perl section of the course, these are the slides I presented in 2010 at BGI, Shenzhen, PRC.

Published in: Technology

Perl for Phyloinformatics

  1. 1. Perl for PhyloInformatics<br />What did we learn?<br />What did we do?<br />
  2. 2. Tree Concepts<br />What are phylogenetic trees?<br />
  3. 3. Phylogenetic Trees<br />Describe the historical relationships among lineages of organisms or their parts, such as their genes.<br />
  4. 4. Operational taxonomic units (OTU) / Taxa<br />Sisters<br />A<br />Internal nodes<br />B<br />Terminal nodes or tips<br />C<br />D<br />Root<br />E<br />F<br />Branches<br />Tree terminology<br />
  5. 5. Interpreting phylogenies<br />These trees are the same shape<br />
  6. 6. Rooted vs. unrooted trees<br />D<br />A<br />B<br />E<br />A<br />B<br />C<br />D<br />Root<br />E<br />C<br />F<br />F<br />Rooted tree:Has a root that denotes common ancestry<br />Unrootedtree:Only specifies the degree of kinship among taxa but not the evolutionary path<br />Tree terminology<br />
  7. 7. Rooted and unrooted trees<br />The number of rooted and unrooted trees for n species is<br />NR = (2n - 3)!/2n-2(n - 2)!<br />NU = (2n - 5)!/2n-3(n - 3)!<br />
  8. 8. A simple example<br />
  9. 9. Why more rooted than unrooted?<br />On an unrooted tree, the root can be placed on any of the branches.<br />
  10. 10. Trees and classification<br />
  11. 11. Monophyletic<br />A monophyletic group is a group of organisms which forms a clade, meaning that it consists of an ancestor and all its descendants.<br />(Most clades on our Supertree are monophyletic.)<br />
  12. 12. Paraphyletic<br />Aclade that excludes species that share a common ancestor with its members.<br />
  13. 13. Polyphyletic<br />A polyphyletic group is one whose members' most recent common ancestor is not a member of the group.<br />
  14. 14. Example: birds and reptiles<br />Reptiles, without the birds, form a paraphyletic group.<br />
  15. 15. Change and time<br />
  16. 16. A<br />B<br />C<br />D<br />E<br />F<br />Phylograms: Branch lengths are proportional to amount of change that occurred on that branch (these are the gene trees before r8s).<br />Cladograms:Branch lengths are not proportional to the amount of changes (this is the Supertree from Monday).<br />Cladograms and phylograms<br />
  17. 17. Ultrametric trees<br />If the distance from the root represents time (not change) we can use trees to study how fast new species form.<br />(This is our final tree after we put it all together.)<br />
  18. 18. Types of data<br />What evidence are phylogenetic trees based on?<br />
  19. 19. Distance data<br />Example: DNA-DNA hybridization. The more closely related two species are, the more similar their DNA. The more similar the DNA, the stronger the bond between the two strands, and the shorter the distance.<br />
  20. 20. Morphological characters<br />Example: the shape of spider webs.<br />
  21. 21. Molecular sequence data<br />I am sure you have all heard about DNA sequencing. Amino acid sequences are often used for more distantly related species.<br />
  22. 22. Types of Data<br />Two categories<br />Numerical data<br />Evolutionary distance between two species<br />Usually derived from sequence data<br />Character data<br />Each character has a finite number of states<br />E.g. number or legs = 1, 2, 4<br />DNA = {A, C, T, G}<br />
  23. 23. Tree reconstruction<br />
  24. 24. Distance methods<br />Types of data<br />Distance matrices:<br />DNA-DNA hybridization<br />Computed from sequences<br />Examples<br />UPGMA is the oldest distance matrix method<br />Neighbor-joining is more commonly used<br />
  25. 25. Distance data<br />When using sequences, distance-based methods must transform the sequence data into a pairwise similarity matrix for use during tree inference<br />
  26. 26. Neighbor-Joining Methods<br />Maintain a pairwise distance matrix<br />Find the closest two taxa<br />Collapse them into one row (internal node) and recompute distance from the merged row to every other row<br />Loop to 2<br />Build tree as you go<br />
  27. 27. Character methods<br />Types of data<br />Any homologized data:<br />Morphological data<br />Molecular sequences<br />Examples<br />Optimality-criterion methods:<br />Maximum parsimony<br />Maximum likelihood<br />Bayesian methods:<br />MCMC<br />
  28. 28. What is homology?<br />Example: forelimbs<br />Definition<br />Homology means any similarity between characters that is due to their shared ancestry.<br />Anatomical structures that evolved from the same structure in some ancestor species are homologous.<br />In genetics, homology can be observed inaligned DNA sequences.<br />
  29. 29. What is an “optimality criterion”?<br />An optimality criterion is simply a way to quantify, using a number, how well a tree fits the data relative to other trees.<br />Examples are parsimony tree length (this is how the Supertree was optimized on the CIPRES cluster) and likelihood score.<br />The posterior probability can also be seen as an optimality criterion.<br />
  30. 30. Parsimony tree length<br />Tree length is the minimum number of reconstructed changes.<br />The most parsimonious tree is the tree with the fewest number of changes.<br />
  31. 31. Finding the optimal tree<br />Under an optimality criterion, trees need to be compared with one another to find the one that maximizes the optimality criterion.<br />When we talk about MP and ML trees, this is usually done with hill-climbing algorithms.<br />
  32. 32. …but this is not the whole story!<br />Maximum Parsimony assumes a very simple model for evolutionary change – namely that change is rare.<br />Especially molecular evolution can be modeled in more realistic ways, using substitution models.<br />There are more complex ways to explore tree space than just hill-climbing (such as the Parsimony Ratchet).<br />We can also sample different areas of tree space to see how optimality is distributed, using MCMC.<br />
  33. 33. Substitution models<br />
  34. 34. Base frequencies and substitution rates<br />
  35. 35. Additional parameters<br />Gamma distribution<br />Invariant sites<br />Perhaps some sites never change.<br />Maybe specify their proportion?<br />
  36. 36. Likelihood and the number of parameters<br />More parameters always leads to a better fit of the data<br />
  37. 37. Likelihood and the number of parameters<br />More parameters always leads to a higher value of the likelihood whether or not the additional parameters are providing a ‘significantly’ better fit to the data<br />
  38. 38. Are the extra parameters justified?<br />Maximum Likelihood | H1<br />(<br />)<br />Likelihood ratio statistic: 2 log<br />Maximum Likelihood | H0<br />Has chi-squared distribution<br />dof = number of additional parameters<br />(We did this with ModelTest)<br />
  39. 39. How did we use the substitution models?<br />Each substitution has an associated likelihood given a branch of a certain length and the estimated model parameters.<br />A function is derived to represent the likelihood of the data given the tree, branch-lengths and additional parameters.<br />Optimise the branch lengths to get the maximum likelihood estimate.<br />
  40. 40. Estimating node ages<br />
  41. 41. Rate smoothing<br />r8s methods attempt to simultaneously estimate unknown divergence times and smooth the rapidity of rate change along lineages.<br />This is done by invoking some function that penalizes rates that change too quickly from branch to neighboring branch.<br />
  42. 42. supertree<br />Given a cladogram, how do we infer the divergence<br />dates of the true tree?<br />A B C D E<br />NOT time<br />A C E<br />A B D E<br />The relative lengths of some branches can be obtained from genes that fit an MLK model.<br />
  43. 43. “true tree”<br />A C E<br />A B D E<br />A B E D C<br />time<br />Simmons<br />Hackman<br />Estimates from multiple molecular sequences can subsequently be combined by calibrating the gene trees on a common node, and applying the resulting node depths to the supertree.<br />
  44. 44. Where did we get the other dates?<br />If there is no extinction and constant speciation (!), the expected waiting time from one speciation event to the next is 1/n, where n=number of lineages.<br />This is a little more complicated if we take multiple labeled histories into account…<br />…but we can come up with expected ages this way.<br />
  45. 45. PhyloInformatics<br />
  46. 46. What is PhyloInformatics?<br />A made up word!<br />We’ve seen we have to deal with data of different types (trees, sequences, alignments, metadata).<br />This are part of complex work flows or pipelines.<br />We “do” phyloinformatics when we come up with repeatable ways to automate these pipelines.<br />
  47. 47. The power of UNIX<br />UNIX is very useful for phyloinformatics:<br />Everything is text-based<br />Everything can be scripted and called from other programs<br />Many programs for phylogenetics are available on UNIX platforms<br />Everything can be piped together to create larger workflows<br />
  48. 48. The power of Perl<br />Perl allows us to chain other UNIX tools together<br />Many perl libraries exist for dealing with biological data<br />Easy to learn, quick to develop<br />
  49. 49. Join us!<br />We do a lot more phyloinformatics:<br />Hackathons<br />Google Summer of Code<br />Ongoing projects<br />Stay in touch, we can help each other!<br />
  50. 50. 谢谢!<br />Thank you!<br />