Bls 303 l1.phylogenetics


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bls 303 l1.phylogenetics

  1. 1. BLS 303: Principles of Computational Biology Lecture 1: Molecular Phylogenetics
  2. 2. Topics• i. Molecular Evolution• ii. Calculating Distances• iii. Clustering Algorithms• iv. Cladistic Methods• v. Computer Software
  3. 3. Evolution• The theory of evolution is the foundation upon which all of modern biology is built.• From anatomy to behavior to genomics, the scientific method requires an appreciation of changes in organisms over time.• It is impossible to evaluate relationships among gene sequences without taking into consideration the way these sequences have been modified over time
  4. 4. RelationshipsSimilarity searches and multiple alignments ofsequences naturally lead to the question: “How are these sequences related?”and more generally:“How are the organisms from whichthese sequences come related?”
  5. 5. Taxonomy• The study of the relationships between groups of organisms is called taxonomy, an ancient and venerable branch of classical biology.• Taxonomy is the art of classifying things into groups — a quintessential human behavior — established as a mainstream scientific field by Carolus Linnaeus (1707-1778).
  6. 6. Phylogenetics• Evolutionary theory states that groups of similar organisms are descended from a common ancestor.• Phylogenetic systematics (cladistics) is a method of taxonomic classification based on their evolutionary history.• It was developed by Willi Hennig, a German entomologist, in 1950.
  7. 7. Cladistic Methods• Evolutionary relationships are documented by creating a branching structure, termed a phylogeny or tree, that illustrates the relationships between the sequences.• Cladistic methods construct a tree (cladogram) by considering the various possible pathways of evolution and choose from among these the best possible tree.• A phylogram is a tree with branches that are proportional to evolutionary distances.
  8. 8. Molecular Evolution• Phylogenetics often makes use of numerical data, (numerical taxonomy) which can be scores for various “character states” such as the size of a visible structure or it can be DNA sequences.• Similarities and differences between organisms can be coded as a set of characters, each with two or more alternative character states.• In an alignment of DNA sequences, each position is a separate character, with four possible character states, the four nucleotides.
  9. 9. DNA is a good tool for taxonomyDNA sequences have many advantagesover classical types of taxonomiccharacters:– Character states can be scored unambiguously– Large numbers of characters can be scored for each individual– Information on both the extent and the nature of divergence between sequences is available (nucleotide substitutions, insertion/deletions, or genome rearrangements)
  10. 10. A aat tcg ctt cta gga atc tgc ctaatc ctgB ... ..a ..g ..a .t. ... ... t..... ..aC ... ..a ..c ..c ... ..t ... ...... t.aD ... ..a ..a ..g ..g ..t ... t.t Each nucleotide difference is a character..t t..
  11. 11. Sequences Reflect Relationships• After working with sequences for a while, one develops an intuitive understanding that “for a given gene, closely related organisms have similar sequences and more distantly related organisms have more dissimilar sequences. These differences can be quantified”.• Given a set of gene sequences, it should be possible to reconstruct the evolutionary relationships among genes and among organisms.
  12. 12. What Sequences to Study?• Different sequences accumulate changes at different rates - chose level of variation that is appropriate to the group of organisms being studied. – Proteins (or protein coding DNAs) are constrained by natural selection - better for very distant relationships – Some sequences are highly variable (rRNA spacer regions, immunoglobulin genes), while others are highly conserved (actin, rRNA coding regions) – Different regions within a single gene can evolve at different rates (conserved vs. variable domains)
  13. 13. (globin) Ancestral gene A Duplication (hemoglobin) A B (myoglobin) SpeciationA1 B1 A2 B2 (mouse) (human)
  14. 14. Orthologs vs. Paralogs• When comparing gene sequences, it is important to distinguish between identical vs. merely similar genes in different organisms.• Orthologs are homologous genes in different species with analogous functions.• Paralogs are similar genes that are the result of a gene duplication. – A phylogeny that includes both orthologs and paralogs is likely to be incorrect. – Sometimes phylogenetic analysis is the best way to determine if a new gene is an ortholog or paralog to other known genes.
  15. 15. Terminologies of phylogeny• Phylogenetic (binary) tree: A tree is a graph composed of nodes and branches, in which any two nodes are connected by a unique path.• Nodes: Nodes in phylogenetic trees are called taxonomic units (TUs) Usually, taxonomic units are represented by sequences (DNA or RNA nucleotides or amino acids).• Branches: Branches in phylogenetic trees indicate descent/ancestry relationships among the TUs.• Terminal (external) nodes: The terminal nodes are also called the external nodes, leaves, or tips of the tree and are also called extant taxonomic units or operational taxonomic units (OTUs)
  16. 16. Terminologies of phylogeny• Internal nodes: The internal nodes are nodes, which are not terminal. They are also called ancestral TUs.• Root: The root is a node from which a unique path leads to any other node, in the direction of evolutionary time. The root is the common ancestor of all TU’s under study.• Topology: The topology is the branching pattern of a tree.• Branch length: The lengths of the branches determine the metrics of a tree. In phylogenetic trees, lengths of branches are measured in units of evolutionary time.
  17. 17. Example of phylogenetic tree: VP1 gene for FMDV
  18. 18. Genes vs. Species• Relationships calculated from sequence data represent the relationships between genes, this is not necessarily the same as relationships between species.• Your sequence data may not have the same phylogenetic history as the species from which they were isolated.• Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (hybridization, vector mediated DNA movement, or direct uptake of DNA).
  19. 19. Cladistic vs. PheneticWithin the field of taxonomy there are twodifferent methods and philosophies of buildingphylogenetic trees: cladistic and phenetic– Phenetic methods construct trees (phenograms) by considering the current states of characters without regard to the evolutionary history that brought the species to their current phenotypes.– Cladistic methods rely on assumptions about ancestral relationships as well as on current data.
  20. 20. Phenetic Methods• Computer algorithms based on the phenetic model rely on Distance Methods to build of trees from sequence data.• Phenetic methods count each base of sequence difference equally, so a single event that creates a large change in sequence (insertion/deletion or recombination) will move two sequences far apart on the final tree.• Phenetic approaches generally lead to faster algorithms and they often have nicer statistical properties for molecular data.• The phenetic approach is popular with molecular evolutionists because it relies heavily on objective character data (such as sequences) and it requires relatively few assumptions.
  21. 21. Cladistic Methods• For character data about the physical traits of organisms (such as morphology of organs etc.) and for deeper levels of taxonomy, the cladistic approach is almost certainly superior.• Cladistic methods are often difficult to implement with molecular data because all of the assumptions are generally not satisfied.
  22. 22. Distances Measurements• It is often useful to measure the genetic distance between two species, between two populations, or even between two individuals.• The entire concept of numerical taxonomy is based on computing phylogenies from a table of distances.• In the case of sequence data, pairwise distances must be calculated between all sequences that will be used to build the tree - thus creating a distance matrix.• Distance methods give a single measurement of the amount of evolutionary change between two sequences since divergence from a common ancestor.
  23. 23. DNA Distances• Distances between pairs of DNA sequences are relatively simple to compute as the sum of all base pair differences between the two sequences. – this type of algorithm can only work for pairs of sequences that are similar enough to be aligned• Generally all base changes are considered equal• Insertion/deletions are generally given a larger weight than replacements (gap penalties).• It is also possible to correct for multiple substitutions at a single site, which is common in distant relationships and for rapidly evolving sites.
  24. 24. Amino Acid Distances• Distances between amino acid sequences are a bit more complicated to calculate.• Some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be functionally devastating.• From the standpoint of the genetic code, some amino acid changes can be made by a single DNA mutation while others require two or even three changes in the DNA sequence.• In practice, what has been done is to calculate tables of frequencies of all amino acid replacements within families of related protein sequences in the databanks: i.e. PAM and BLOSSUM
  25. 25. The PAM 250 scoring matrix A R N D C Q E G H I L K M F P S T W Y V A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 4 -5 4 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of ProteinSequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation,Silver Spring, MD.
  26. 26. Clustering AlgorithmsClustering algorithms use distances to calculatephylogenetic trees. These trees are based solely onthe relative numbers of similarities and differencesbetween a set of sequences.– Start with a matrix of pairwise distances– Cluster methods construct a tree by linking the least distant pairs of taxa, followed by successively more distant taxa.
  27. 27. UPGMA• The simplest of the distance methods is the UPGMA (Unweighted Pair Group Method using Arithmetic averages)• The PHYLIP programs DNADIST and PROTDIST calculate absolute pairwise distances between a group of sequences. Then the GCG program GROWTREE uses UPGMA to build a tree.• Many multiple alignment programs such as PILEUP use a variant of UPGMA to create a dendrogram of DNA sequences which is then used to guide the multiple alignment algorithm.
  28. 28. Neighbor Joining• The Neighbor Joining method is the most popular way to build trees from distance measurements (Saitou and Nei 1987, Mol. Biol. Evol. 4:406) – Neighbor Joining corrects the UPGMA method for its (frequently invalid) assumption that the same rate of evolution applies to each branch of a tree. – The distance matrix is adjusted for differences in the rate of evolution of each taxon (branch). – Neighbor Joining has given the best results in simulation studies and it is the most computationally efficient of the distance algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)
  29. 29. Cladistic methods• Cladistic methods are based on the assumption that a set of sequences evolved from a common ancestor by a process of mutation and selection without mixing (hybridization or other horizontal gene transfers).• These methods work best if a specific tree, or at least an ancestral sequence, is already known so that comparisons can be made between a finite number of alternate trees rather than calculating all possible trees for a given set of sequences.
  30. 30. Parsimony• Parsimony is the most popular method for reconstructing ancestral relationships.• Parsimony allows the use of all known evolutionary information in building a tree – In contrast, distance methods compress all of the differences between pairs of sequences into a single number
  31. 31. Building Trees with Parsimony• Parsimony involves evaluating all possible trees and giving each a score based on the number of evolutionary changes that are needed to explain the observed data.• The best tree is the one that requires the fewest base changes for all sequences to derive from a common ancestor.
  32. 32. Parsimony Example• Consider four sequences: ATCG, TTCG, ATCC, and TCCG• Imagine a tree that branches at the first position, grouping ATCG and ATCC on one branch, TTCG and TCCG on the other branch.• Then each branch splits, for a total of 3 nodes on the tree (Tree #1)
  33. 33. Compare Tree #1 with one that first divides ATCC on its ownbranch, then splits off ATCG, and finally divides TTCG fromTCCG (Tree #2). Trees #1 and #2 both have three nodes, but when all of thedistances back to the root (# of nodes crossed) are summed,the total is equal to 8 for Tree #1 and 9 for Tree #2. Tree Tree #1 #2
  34. 34. Maximum Likelihood• The method of Maximum Likelihood attempts to reconstruct a phylogeny using an explicit model of evolution.• This method works best when it is used to test (or improve) an existing tree.• Even with simple models of evolutionary change, the computational task is enormous, making this the slowest of all phylogenetic methods.
  35. 35. Assumptions for Maximum Likelihood • The frequencies of DNA transitions (C<->T,A<->G) and transversions (C or T<->A or G). • The assumptions for protein sequence changes are taken from the PAM matrix - and are quite likely to be violated in “real” data. • Since each nucleotide site evolves independently, the tree is calculated separately for each site. The product of the likelihoods for each site provides the overall likelihood of the observed data.
  36. 36. Computer Software for PhylogeneticsDue to the lack of consensus among evolutionary biologistsabout basic principles for phylogenetic analysis, it is notsurprising that there is a wide array of computer softwareavailable for this purpose.– PHYLIP is a free package that includes 30 programs that compute various phylogenetic algorithms on different kinds of data.– The GCG package (available at most research institutions) contains a full set of programs for phylogenetic analysis including simple distance-based clustering and the complex cladistic analysis program PAUP (Phylogenetic Analysis Using Parsimony)– CLUSTALX is a multiple alignment program that includes the ability to create trees based on Neighbor Joining.– DNAStar– MacClade is a well designed cladistics program that allows the user to explore possible trees for a data set.
  37. 37. Phylogenetics on the Web• There are several phylogenetics servers available on the Web – some of these will change or disappear in the near future – these programs can be very slow so keep your sample sets small• The Institut Pasteur, Paris has a PHYLIP server at:• Louxin Zhang at the Natl. University of Singapore has a WebPhylip server:• The Belozersky Institute at Moscow State University has their own "GeneBee" phylogenetics server:• The Phylodendron website is a tree drawing program with a nice user interface and a lot of options, however, the output is limited to gifs at 72 dpi - not publication quality.
  38. 38. Other Web Resources• Joseph Felsenstein (author of PHYLIP) maintains a comprehensive list of Phylogeny programs at: /software.html• Introduction to Phylogenetic Systematics, Peter H. Weston & Michael D. Crisp, Society of Australian Systematic Biologists• University of California, Berkeley Museum of Paleontology (UCMP)
  39. 39. Software Hazards• There are a variety of programs for Macs and PCs, but you can easily tie up your machine for many hours with even moderately sized data sets (i.e. fifty 300 bp sequences)• Moving sequences into different programs can be a major hassle due to incompatible file formats.• Just because a program can perform a given computation on a set of data does not mean that that is the appropriate algorithm for that type of data.
  40. 40. ConclusionsGiven the huge variety of methods for computingphylogenies, how can the biologist determine whatis the best method for analyzing a given data set?– Published papers that address phylogenetic issues generally make use of several different algorithms and data sets in order to support their conclusions.– In some cases different methods of analysis can work synergistically • Neighbor Joining methods generally produce just one tree, which can help to validate a tree built with the parsimony or maximum likelihood method– Using several alternate methods can give an indication of the robustness of a given conclusion.