Table 2. Examples of Conditions in Which Similarity Methods Produce Inaccurate Predictions of FunctionHighest Hit Method Phylogenomic MethodEvolutionary Pattern and Tree of Genesand Functions1Gene WithUnknown Function2PredictedFunction3Accurate? PredictedFunction4Accurate? CommentsA. Functional change during evolution. 123456//+++-±±//++±±++• Phylogenomic method cannot predict functions for all genes, but thepredictions that are made are accurate.• Highest hit method is misleading because function changed among homologsbut hierarchies of similarity do not correlate with the function (see Bolker andRaff 1996).B. Functional change & rate variation. 123456++---+//++±±++• Similarity based methods perform particularly poorly when evolutionary ratesvary between taxa.• Molecular phylogenetic methods can allow for rate variation and reconstructgene history reasonably accurately.C. Gene duplication and rate variation. 1A2A3A1B2B3B++-++-++++++• Most-similarity based methods are not ideally set up to deal with cases of geneduplication since orthologous genes do not always have significantly moresequence similarity to each other than to paralogs (Eisen et al. 1995; Zardova etal. 1996; Tatusov et al. 1997).• Similarity-based methods perform particularly poorly when rate variation andgene duplication are combined. This even applies to the COG method (seeTable1) since it works by classifying levels of similarity and not by inferringhistory. Nevertheless, the COG method is a significant improvement over othersimilarity based methods in classifying orthologs.• Phylogenetic reconstruction is the most reliably way to inferr gene duplicationevents and thus determine orthology.1 The true tree is shown but it is assumed that it is not known. Different colors and symbols represent different functions. Numbers correspond to different species.2 The function of all other genes is assumed to be known.3 The top hit can be determined from the tree by finding the gene is the shortest evolutionary distance away (as determined along the branches of the tree).4 It is assumed that the tree of the genes can be reproduced accurately by molecular phylogenetic methods (see Fig. 1).
ing method (see Table 1). Although theCOG method is clearly a major advancein identifying orthologous groups ofgenes, it is limited in its power becauseclustering is a way of classifying levels ofsimilarity and is not an accurate methodof inferring evolutionary relationships(Swofford et al. 1996). Thus, as sequencesimilarity and clustering are not reliableestimators of evolutionary relatedness,and as the incorporation of such phylo-genetic information has been so usefulto other areas of biology, evolutionarytechniques should be useful for improv-ing the accuracy of predicting functionbased on sequence similarity.PhylogenomicsThere are many ways in which evolu-tionary information can be used to im-prove functional predictions. Below, Ipresent an outline of one such phylog-enomic method (see Fig. 1), and I com-pare this method to nonevolutionaryfunctional prediction methods. Thismethod is based on a relatively simpleassumption—because gene functionschange as a result of evolution, recon-structing the evolutionary history ofgenes should help predict the functionsof uncharacterized genes. The first stepis the generation of a phylogenetic treerepresenting the evolutionary history ofthe gene of interest and its homologs.Such trees are distinct from clusters andother means of characterizing sequencesimilarity because they are inferred byspecial techniques that help convert pat-terns of similarity into evolutionary re-lationships (see Swofford et al. 1996). Af-ter the gene tree is inferred, biologicallydetermined functions of the various ho-mologs are overlaid onto the tree. Fi-nally, the structure of the tree and therelative phylogenetic positions of genesof different functions are used to tracethe history of functional changes, whichis then used to predict functions of un-characterized genes. More detail of thismethod is provided below.Identification of HomologsThe first step in studying the evolutionof a particular gene is the identificationof homologs. As with similarity-basedfunctional prediction methods, likelyhomologs of a particular gene are iden-tified through database searches. Be-cause phylogenetic methods benefitgreatly from more data, it is useful toaugment this initial list by using identi-fied homologs as queries for furtherdatabase searches or using automatic it-erated search methods such as PSI-BLAST (Altschul et al. 1997). If a genefamily is very large (e.g., ABC transport-ers), it may be necessary to only analyzea subset of homologs. However, thismust be done with extreme care, as onemight accidentally leave out proteinsthat would be important for the analy-sis.Alignment and MaskingSequence alignment for phylogeneticanalysis has a particular purpose—it isthe assignment of positional homology.Each column in a multiple sequencealignment is assumed to include aminoacids or nucleotides that have a com-mon evolutionary history, and each col-umn is treated separately in the phylo-genetic analysis. Therefore, regions inwhich the assignment of positional ho-mology is ambiguous should be ex-cluded (Gatesy et al. 1993). The exclu-sion of certain alignment positions (alsoknown as masking) helps to give phylo-genetic methods much of their discrimi-natory power. Phylogenetic trees gener-ated without masking (as is done inmany sequence analysis software pack-ages) are less likely to accurately reflectthe evolution of the genes than treeswith masking.Phylogenetic TreesFor extensive information about gener-ating phylogenetic trees from sequencealignments, see Swofford et al. (1996). Insummary, there are three methods com-monly used: parsimony, distance, andmaximum likelihood (Table 3), and eachhas its advantages and disadvantages. ITable 2. Types of Molecular HomologyHomolog Genes that are descended from a common ancestor(e.g., all globins)Ortholog Homologous genes that have diverged from each otherafter speciation events (e.g., human b- and chimpb-globin)Paralog Homologous genes that have diverged from each otherafter gene duplication events (e.g., b- and g-globin)Xenolog Homologous genes that have diverged from each otherafter lateral gene transfer events (e.g., antibioticresistance genes in bacteria)Positional homology Common ancestry of specific amino acid or nucleotidepositions in different genesTable 1. Methods of PredictingGene Function When HomologsHave Multiple FunctionsHighest HitThe uncharacterized gene isassigned the function (or frequently,the annotated function) of the genethat is identified as the highest hitby a similarity search program (e.g.,Tomb et al. 1997).Top HitsIdentify top 10+ hits for theuncharacterized gene. Dependingon the degree of consensus of thefunctions of the top hits, the querysequence is assigned a specificfunction, a general activity withunknown specificity, or no function(e.g., Blattner et al. 1997).Clusters of Orthologous GroupsGenes are divided into groups oforthologs based on a clusteranalysis of pairwise similarity scoresbetween genes from differentspecies. Uncharacterized genes areassigned the function ofcharacterized orthologs (Tatusov etal. 1997).PhylogenomicsKnown functions are overlaid ontoan evolutionary tree of allhomologs. Functions ofuncharacterized genes are predictedby their phylogenetic positionrelative to characterized genes (e.g.,Eisen et al. 1995, 1997).Insight/Outlook164 GENOME RESEARCH
prefer distance methods because theyare the quickest when using large datasets. Before using any particular tree it isimportant to estimate the robustnessand accuracy of the phylogenetic pat-terns it shows (through techniques suchas the comparison of trees generated bydifferent methods and bootstrapping).Finally, in most cases, it is also useful todetermine a root for the tree.Functional PredictionsTo make functional predictions basedon the phylogenetic tree, it is necessaryto first overlay any known functionsonto the tree. There are many ways this‘‘map’’ can then be used to make func-tional predictions, but I recommendsplitting the task into two steps. First,the tree can be used to identify likelygene duplication events in the past. Thisallows the division of the genes intogroups of orthologs and paralogs (e.g.,Eisen et al. 1995). Uncharacterized genescan be assigned a likely function if thefunction of any ortholog is known (andif all characterized orthologs have thesame function). Second, parsimony re-construction techniques (Maddison andMaddison 1992) can be used to infer thelikely functions of uncharacterizedgenes by identifying the evolutionaryscenario that requires the fewest func-tional changes over time (Fig. 1). The in-corporation of more realistic models offunctional change (and not just mini-mizing the total number of changes)may prove to be useful, but the parsi-mony minimization methods are prob-ably sufficient in most cases.Is the Phylogenomic Method Worththe Trouble?Phylogenomic methods require manymore steps and usually much moremanual labor than similarity-basedfunctional prediction methods. Is thephylogenomic approach worth thetrouble? Many specific examples exist inwhich gene function has been shown tocorrelate well with gene phylogeny (Ei-sen et al. 1995; Atchley and Fitch 1997).Although no systematic comparisons ofphylogenetic versus similarity-basedfunctional prediction methods havebeen done, there are a variety of reasonsto believe that the phylogenomicmethod should produce more accuratepredictions than similarity-based meth-ods. In particular, there are many condi-tions in which similarity-based methodsare likely to make inaccurate predictionsbut which can be dealt with well by phy-logenetic methods (see Table 4).A specific example helps illustrate apotential problem with similarity-basedmethods. Molecular phylogenetic meth-ods show conclusively that mycoplas-mas share a common ancestor with low-GC Gram-positive bacteria (Weisburg etFigure 1 Outline of a phylogenomic methodology. In this method, information about theevolutionary relationships among genes is used to predict the functions of uncharacterizedgenes (see text for details). Two hypothetical scenarios are presented and the path of trying toinfer the function of two uncharacterized genes in each case is traced. (A) A gene family hasundergone a gene duplication that was accompanied by functional divergence. (B) Gene func-tion has changed in one lineage. The true tree (which is assumed to be unknown) is shown atthe bottom. The genes are referred to by numbers (which represent the species from which thesegenes come) and letters (which in A represent different genes within a species). The thinbranches in the evolutionary trees correspond to the gene phylogeny and the thick graybranches in A (bottom) correspond to the phylogeny of the species in which the duplicate genesevolve in parallel (as paralogs). Different colors (and symbols) represent different gene func-tions; gray (with hatching) represents either unknown or unpredictable functions.Insight/OutlookGENOME RESEARCH 165
al. 1989). However, examination of thepercent similarity between mycoplasmalgenes and their homologs in bacteriadoes not clearly show this relationship.This is because mycoplasmas have un-dergone an accelerated rate of molecularevolution relative to other bacteria.Thus, a BLAST search with a gene fromBacillus subtilis (a low GC Gram-positivespecies) will result in a list in which themycoplasma homologs (if they exist)score lower than genes from many spe-Table 3. Molecular Phylogenetic MethodsMethodParsimony Possible trees are compared and each is given a score that is a reflection of the minimum numberof character state changes (e.g., amino acid substitutions) that would be required overevolutionary time to fit the sequences into that tree. The optimal tree is considered to be theone requiring the fewest changes (the most parsimonious tree).Distance The optimal tree is generated by first calculating the estimated evolutionary distance between allpairs of sequences. Then these distances are used to generate a tree in which the branchpatterns and lengths best represent the distance matrix.Maximum likelihood Maximum likelihood is similar to parsimony methods in that possible trees are compared andgiven a score. The score is based on how likely the given sequences are to have evolved in aparticular tree given a model of amino acid or nucleotide substitution probabilities. The optimaltree is considered to be the one that has the highest probability.Bootstrapping Alignment positions within the original multiple sequence alignment are resampled and new datasets are made. Each bootstrapped data set is used to generate a separate phylogenetic tree andthe trees are compared. Each node of the tree can be given a bootstrap percentage indicatinghow frequently those species joined by that node group together in different trees. Bootstrappercentage does not correspond directly to a confidence limit.Insight/Outlook166 GENOME RESEARCH
cies of bacteria less closely related to B.subtilis. When amounts or rates ofchange vary between lineages, phyloge-netic methods are better able to inferevolutionary relationships than similar-ity methods (including clustering) be-cause they allow for evolutionarybranches to have different lengths.Thus, in those cases in which gene func-tion correlates with gene phylogeny andin which amounts or rates of changevary between lineages, similarity-basedmethods will be more likely than phy-logenomic methods to make inaccuratefunctional predictions (see Table 4).Another major advantage of phyloge-netic methods over most similaritymethods comes from the process ofmasking (see above). For example, a de-letion of a large section of a gene in onespecies will greatly affect similarity mea-sures but may not affect the function ofthat gene. A phylogenetic analysis in-cluding these genes could exclude theregion of the deletion from the analysisby masking. In addition, regions ofgenes that are highly variable betweenspecies are more likely to undergo con-vergence and such regions can be ex-cluded from phylogenetic analysis bymasking. Masking thus allows the exclu-sion of regions of genes in which se-quence similarity is likely to be ‘‘noisy’’or misleading rather than a biologicallyimportant signal. The pairwise sequencecomparisons used by most similarity-based functional prediction methods donot allow such masking. Phylogeneticmethods have been criticized because oftheir dependence (for most methods) onmultiple sequence alignments that arenot always reliable and unbiased. How-ever, multiple sequence alignments alsoallow for masking, which is probablymore valuable than the cost of depend-ing on alignments.The conditions described above andhighlighted in Table 4 are just some ex-amples of conditions in which evolu-tionary methods are more likely to makeaccurate functional predictions thansimilarity-based methods. Phylogeneticmethods are particularly useful whenthe history of a gene family includesmany of these conditions (e.g., multiplegene duplications plus rate variation) orwhen the gene family is very large. Theprinciple is simple—the more compli-cated the history of a gene family, themore useful it is to try to infer that his-tory. Thus although the phylogenomicmethod is slow and labor intensive, I be-lieve it is worth using if accuracy is themain objective. In addition, informa-tion about the evolutionary relation-ships among gene homologs is useful forsummarizing relationships among genesand for putting functional informationinto a useful context.Despite the evolution of these meth-ods, and likely continued improvementsin functional predictions, it must be re-membered that the key word is predic-tion. All methods are going to make in-accurate predictions of functions. Forexample, none of the methods describedcan perform well when gene functionscan change with little sequence changeas has been seen in proteins like opsins(Yokoyama 1997). Thus, sequence data-bases and genome researchers shouldmake clear which functions assigned togenes are based on predictions andwhich are based on experiments. In ad-dition, all prediction methods shoulduse only experimentally determinedfunctions as their grist for predictions.This will hopefully limit error propaga-tion that can happen by using an inac-curate prediction of function to thenpredict the function of a new gene,which is a particular problem for thehighest hit methods, as they rely on thefunction of only one gene at a time tomake predictions (Eisen et al. 1997). De-spite these and other potential prob-lems, functional predictions are of greatvalue in guiding research and in sortingthrough huge amounts of data. I believethat the increased use of phylogeneticmethods can only serve to improve theaccuracy of such functional predictions.REFERENCESAltschul, S.F., R.J. Carroll, and D.J. Lipman.1989. J. Mol. Biol. 207: 647–653.Altschul, S.F., T.L. Madden, A.A. Schaeffer, J.Zhang, Z. Zhang, W. Miller, and D.J. Lipman.1997. Nucleic Acids Res. 25: 3389–3402.Atchley, W.R. and W.M. Fitch. 1997. Proc.Natl. Acad. Sci. 94: 5172–5176.Blattner, F.R., G.I. Plunkett, C.A. Bloch, N.T.Perna, V. Burland, M. Riley, J. Collado-Vides,J.D. Glasner, C.K. Rode, G.F. Mayhew et al.1997. Science 277: 1453–1462.Bolker, J.A. and R.A. Raff. 1996. BioEssays18: 489–494.Eisen, J.A., D. Kaiser, and R.M. Myers. 1997.Nature (Med.) 3: 1076–1078.Eisen, J.A., K.S. Sweder, and P.C. Hanawalt.1995. Nucleic Acids Res. 23: 2715–2723.Felsenstein, J. 1985. Am. Nat. 125: 1–15.Fitch, W.M. 1970. Syst. Zool. 19: 99–113.Gatesy, J., R. Desalle, and W. Wheller. 1993.Mol. Phylog. Evol. 2: 152–157.Goldman, N., J.L. Thorne, and D.T. Jones.1996. J. Mol. Biol. 263: 196–208.Henikoff, S., E.A. Greene, S. Pietrovsky, P.Bork, T.K. Attwood, and L. Hood. 1997. Sci-ence 278: 609–614.Henikoff, S., S. Pietrokovski, and J.G. Heni-koff. 1998. Nucleic Acids Res. 26: 311–315.Hillis, D.M. 1994. In Homology: The hierarchi-cal basis of comparative biology (ed. B.K. Hall),pp. 339–368. Academic Press, San Diego, CA.Maddison, W.P. and D.R. Maddison. 1992.MacClade. Sinauer Associates, Sunderland,MA.Reeck, G.R., C. Hae¨n, D.C. Teller, R.F.Doolittle, W.M. Fitch, R.E. Dickerson, P.Chambon, A.D. McLachlan, E. Margoliash,T.H. Jukes et al. 1987. Cell 50: 667.Swofford, D.L., G.J. Olsen, P.J. Waddell, andD.M. Hillis. 1996. In Molecular systematics (ed.D.M. Hillis, C. Moritz, and B.K. Mable), pp.407–514. Sinauer Associates, Sunderland,MA.Tatusov, R.L., E.V. Koonin, and D.J. Lipman.1997. Science 278: 631–637.Tomb, J.F., O. White, A.R. Kerlavage, R.A.Clayton, G.G. Sutton, R.D. Fleischmann, K.A.Ketchum, H.P. Klenk, S. Gill, B.A. Doughertyet al. 1997. Nature 388: 539–547.Weisburg, W.G., J.G. Tully, D.L. Rose, J.P. Pet-zel, H. Oyaizu, D. Yang, L. Mandelco, J.Sechrest, T.G. Lawrence, J. Van Etten et al.1989. J. Bacteriol. 171: 6455–6467.Yokoyama, S. 1997. Annu. Rev. Genet.31: 315–336.Zardoya, R., E. Abouheif, and A. Meyer. 1996.Trends Genet. 2: 496–497.Insight/OutlookGENOME RESEARCH 167
PHYLOGENENETIC PREDICTION OF GENE FUNCTIONIDENTIFY HOMOLOGSOVERLAY KNOWNFUNCTIONS ONTO TREEINFER LIKELY FUNCTIONOF GENE(S) OF INTEREST1 2 3 4 5 63 531A 2A 3A 1B 2B 3B2A 1B1A3A1B2B3BALIGN SEQUENCESCALCULATE GENE TREE1246CHOOSE GENE(S) OF INTEREST2A2A53Species 3Species 1 Species 211 222 311A 3A1A 2A 3A1A 2A 3A4 64 5 64 5 62B 3B1B 2B 3B1B 2B 3BACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)Duplication?EXAMPLE A EXAMPLE BDuplication?Duplication?Duplication5METHODAmbiguous