From Chemistry to Information: Emergence of the Genetic Code


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

From Chemistry to Information: Emergence of the Genetic Code

  1. 1. From Chemistry to Information: Emergence of the Genetic Code Rob Knight Department of Ecology and Evolutionary Biology, Princeton University, Princeton, NJ 08544 Abstract The genetic code translates codons into amino acids, allowing coded peptide synthesis and forming the core of modern biology. Although the modern system is extremely precise, primitive systems probably encoded fewer amino acids with much less precision. This paper provides an overview of recent progress in understanding the origin and subsequent evolution of the genetic code. It also outlines several of the remaining puzzles, and suggests several approaches that may be of use in answering some of these questions. Introduction Although “information” is widely misused as a metaphor in biology (Oyama 1985), the genetic code is an evolved system that encodes information in the strict, mathematical sense (Shannon 1943). The presence of a codon at a given position in a mRNA allows us to predict, with high precision, the amino acid present at the corresponding site in the resulting protein: the mRNA acts as the source, the protein acts as the receiver, and the translation apparatus (ribosome, aminoacyl-tRNA synthetases, tRNA, release factors) provides the channel conditions. Because the genetic code is constant in most organisms, it is easy to forget that these channel conditions are not immutable, but are themselves products of evolution. The textbook picture is that the 64 codons are distributed among the 20 amino acids with varying degrees of degeneracy, such that each codon specifies precisely one amino acid (or acts as a stop codon) but most amino acids are specified by more than one codon. However, the actual situation is more complex. It is impossible to transmit information with no error whatsoever, and the pattern of translation error (reviewed by Parker 1989) produces a frequency distribution of amino acids for each codon, rather than a single amino acid for each codon. Thus there are two fundamental questions concerning the evolution of the genetic code: 1. How did the set of frequency distributions that comprises the genetic code became skewed so heavily to one amino acid per codon? 2. Given the one amino acid per codon rule, how did codons come to have their present identities?
  2. 2. There are additional complexities, such as the cotranslational insertion of the “21st” amino acid selenocysteine (which requires a specific stem-loop secondary structure in the mRNA upstream from the insertion site (Heider et al. 1992)), and the effect of the codon context on misreading and alternative translation events at particular locations, but they are peripheral to the above questions. The Translational repertoire: Why These Amino Acids? Although over 400 different amino acids are found in cells, only 21 of these (including selenocysteine) are cotranslationally incorporated into proteins. Interestingly, the set of amino acids consistently produced in prebiotic settings (spark-tube and HCN- condensation experiments, and extraterrestrial sources) only partially overlaps the set found in the genetic code (Weber and Miller 1981). In particular, norleucine, norvaline, alpha-aminobutyric acid, and pipecolic acid are all produced readily and are very similar to amino acids that are actually present in the code. Similarly, homocysteine, ornithine, citrulline, and L-DOPA are all important metabolic intermediates, yet none has been incorporated into the genetic code. Why should such a small subset be chosen, and is there any rationale for the choices? One possibility is that this set of amino acids allows unusually error-resistant codes to be produced (Knight et al. 1999). The genetic code appears highly adaptive, in that it minimizes error relative to other possible genetic codes that permute the same amino acids among the same codon blocks: depending on the criteria used, the genetic code may be nearly the best of all possible codes (Freeland et al. 2000). Determining the error value (the average effect of a point misreading) of a given code requires a measure of the distance between two amino acids. For protein-coding amino acids, this may be an prior measurement (such as the polar requirement (Woese et al. 1966) of the free amino acids), or a posterior measurement of how often particular substitutions are acceptable in proteins (for example, PAM 74-100 (Benner et al. 1994)). Interestingly, although the code looks very good with both prior and posterior measurements, these measurements are not highly correlated with one another. We are presently measuring polar requirements for non-protein amino acids using thin-layer chromatography, and plan to test the effect of incorporating these nonstandard amino acids into the code (R.D. Knight, S.J. Freeland, and L.F. Landweber, unpublished data). However, the procedure is very time-consuming: if it were possible to predict these values by modeling the amino acids in silico, it would be possible to estimate the effects of adding a far greater range of amino acids to the code. The Earliest Codes: RNA-Amino Acid Interactions? Although the idea that amino acids might directly associate with amino acids is not new (Gamow 1954), only in the last decade has chemical evidence started to indicate that even modern genetic codes might have been influenced by such interactions (Yarus and Christian 1989). The technique of in vitro selection, or SELEX (Tuerk and Gold 1990, Ellington and Szostak 1990), allows amplification of functional nucleic acid molecules from random sequences. Several aptamers (nucleic acids that bind specific targets) to
  3. 3. different amino acids have now been isolated and characterized, giving a preliminary survey of the types of RNA molecules that can bind amino acids. For each nucleotide in an aptamer, it is possible to decide (by inspection) whether it is in a specified codon in some reading frame, and to decide (by chemical mapping) whether it is directly involved in binding the amino acid. If the binding sites for particular amino acids were disproportionately composed of the cognate codons, a standard test for independence (chi-squared or G) should give a highly significant result (Knight and Landweber 1998). In fact, this is the case, both for the arginine aptamers considered by themselves (Knight and Landweber 2000) and for the three amino acids, Arg, Val, and Tyr, for which structural data were available at the start of this year (Yarus 2000). This suggests that intrinsic biases in trinucleotide preferences at binding sites could have influenced the modern genetic code. Could these have initial statistical influences have evolved into the modern coding system? By analogy to the evolution of language (Nowak and Krakauer 1999), it seems intuitive that even a small initial bias linking particular trinucleotides to particular amino acids could be amplified into strong, specific preferences if it were advantageous to identify amino acids. In fact, it is possible that the genetic code predated peptide synthesis, specifying amino acids as cofactors to ribozymes: this coding system would only later be used to join the amino acids together. Alternatively, the genetic code might be an example of a phase transition in which information crystallizes in such a way that variation in only a few elements in the system causes variation in the outcome (Kauffman 1993). In this case, the set of proteins that a cell produces could be considered its phenotype: in the absence of a specific coding system, an autocatalytic set might produce different peptides depending on the ratios of its constituents and on the surrounding conditions. By concentrating the information about proteins into a few messenger molecules that could vary in defined ways to produce different messages, a lineage of cells might better be able to explore the space of possible phenotypes, especially since such a coding system would be ideally suited to co-option as an unlimited heredity mechanism (Maynard Smith and Szathmary 1995). This topic should be a fruitful area for future modeling studies. One controversial point is whether the information embodied in RNA/amino acid interactions could have been transmitted to modern translation systems. Modern tRNA molecules, which carry the amino acid to the ribosome and are responsible for specifically pairing it to its cognate codon, are all related to one another, presumably because they duplicated and diverged from an ancestral ur-tRNA. Thus, it has been argued that at most one codon/binding site interaction could have been incorporated into the genetic code, since other codons would probably not function in the same, highly specialized sequence context (Ellington et al. 2000). Interestingly, tRNAs contain the anticodons, which are the trinucleotides that are complementary to those that seem to be associated with binding sites. Consequently, it seems more likely that tRNAs always acted as acceptor molecules, and were originally charged with ribozymes that bound amino acids noncovalently using codon-rich motifs and covalently attached these amino acids to the primitive tRNAs. If the only consistent feature that diverse amino acid- binding tRNAs shared was the presence of an appropriate codon, the primitive tRNA
  4. 4. (already specialized as an adaptor) could duplicate and diverge to give a family of tRNAs recognized by their triplet anticodon motifs (Knight and Landweber 2000). These tRNAs could then maintain their role as adaptors even after protein aminoacyl-tRNA synthetases displaced the ribozymes that previously fulfilled the same role. Interestingly, RNA can catalyze the ligation of amino acids to RNA faster and more specifically than do the actual synthetases used by modern cells (Illangasekare and Yarus 1999). However, the pathway by which primitive interactions between RNA and amino acids evolved into a genetic code remains speculative. Expansion of the Code: Modern Life and Beyond Since only about half of the amino acids in the code are plausible prebiotic molecules (Weber and Miller 1981), it seems likely that the code expanded from a simpler form. One particularly important question is when, and at what time, amino acids were added to the code. One proposal is that amino acids entered the code on the timescale of the duplication and divergence of the aminoacyl-tRNA synthetases, although the synthetase phylogeny does not seem to support a division between early and late amino acids on any chemical basis (Nagel and Doolittle 1995). It is also possible that some amino acids were invented relatively late, so that enzymes predating their introduction (in particular, the enzymes that catalyze their formation from a metabolic precursor) should be impoverished in the later amino acids at their active sites (S.J. Freeland, pers. comm.). Another fascinating question is whether the genetic code is still expanding. Although variant genetic codes have been found in many lineages (Osawa 1995), these are all recent descendants of the code found in the last common ancestor, and all use the standard complement of amino acids. In particular, selenocysteine may be a recent addition to the code, since it shares a codon with termination and requires a special secondary structure for insertion: however, it may instead be an early entry, displaced by Cys and/or Ser when made unstable by the oxygenation of the atmosphere. Similarly, Asn and Gln may be relatively late additions (Wong 1975), since in some species they are formed by the amidation of Asp or Glu on the tRNA. However, the pathways can coexist in a single organism (Becker and Kern 1998), and so their presence or absence may have a physiological, rather than an evolutionary basis. Today, the genetic code can even be changed experimentally. The first step in this direction involved breeding bacteria to use 4-fluorotryptophan constitutively in place of Trp (Wong 1983). More recently, additional amino acids have been incorporated by using a 65th codon-anticodon pair using unnatural bases (Bain et al. 1992), by using chemically charged tRNAs that bind either termination codons (Mendel et al. 1995) or an artificial four-base codon rarely used in the organism (Hohsaka et al. 1999), by selectively degrading specific tRNAs and replacing them with chemically charged alternatives (Kanda et al. 2000), and by using an aminoacyl-tRNA synthetase that recognizes the suppressor tRNA but not the wild-type tRNA and incorporates the novel amino acid in vivo (Furter et al. 1998). This last approach is by far the most desirable, since the charging reaction can be performed in the cell itself instead of on isolated tRNA. This experimental expansion of the genetic code should provide new insight into the evolution of the canonical amino acid set.
  5. 5. References Oyama S (1985). The ontogeny of information: developmental systems and evolution. Cambridge University Press, New York Shannon CE A mathematical theory of communication. Bell System Technical Journal 1948 27:379-423, 623-656. Parker J Errors and alternatives in reading the universal genetic code. Microbiol Rev. 1989 53:273-98. Heider J, Baron C, Bock A Coding from a distance: dissection of the mRNA determinants required for the incorporation of selenocysteine into protein. EMBO J. 1992 11:3759-66. Weber AL, Miller SL (1981) Reasons for the occurrence of the twenty coded protein amino acids. J Mol Evol. 1981;17(5):273-84. Knight RD, Freeland SJ, Landweber LF Selection, history and chemistry: the three faces of the genetic code. Trends Biochem Sci. 1999 Jun;24(6):241-7. Freeland SJ, Knight RD, Landweber LF, Hurst LD Early fixation of an optimal genetic code. Mol Biol Evol. 2000 Apr;17(4):511-8. Woese CR, Dugre DH, Saxinger WC, Dugre SA The molecular basis for the genetic code. Proc Natl Acad Sci U S A. 1966 Apr;55(4):966-74. Benner SA, Cohen MA, Gonnet GH Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Eng. 1994 Nov;7(11):1323-32. Gamow G Possible relation between deoxyribonucleic acid and protein structures. Nature. 1954 173:318. Yarus M, Christian EL Genetic code origins. Nature. 1989 Nov 23;342(6248):349-50. Tuerk C, Gold L Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science. 1990 Aug 3;249(4968):505-10. Ellington AD, Szostak JW In vitro selection of RNA molecules that bind specific ligands. Nature. 1990 Aug 30;346(6287):818-22. Knight RD, Landweber LF Rhyme or reason: RNA-arginine interactions and the genetic code. Chem Biol. 1998 Sep;5(9):R215-20. Knight RD, Landweber LF Guilt by association: the arginine case revisited. RNA. 2000 Apr;6(4):499-510.
  6. 6. Yarus M RNA-ligand chemistry: a testable source for the genetic code. RNA. 2000 Apr;6(4):475-84. Nowak MA, Krakauer DC The evolution of language. Proc Natl Acad Sci U S A. 1999 Jul 6;96(14):8028-33. Kauffman SA The origins of order : self organization and selection in evolution. New York : Oxford University Press, 1993. Maynard Smith J, Szathmary, E The major transitions in evolution. Oxford ; New York : W.H. Freeman Spektrum, 1995. Ellington AD, Khrapov M, Shaw CA The scene of a frozen accident. RNA. 2000 Apr;6(4):485-98. Illangasekare M, Yarus M Specific, rapid synthesis of Phe-RNA by RNA. Proc Natl Acad Sci U S A. 1999 May 11;96(10):5470-5. Nagel GM, Doolittle RF Phylogenetic analysis of the aminoacyl-tRNA synthetases. J Mol Evol. 1995 May;40(5):487-98. Osawa S Evolution of the genetic code. Oxford ; New York : Oxford University Press, 1995. Wong JT A co-evolution theory of the genetic code. Proc Natl Acad Sci U S A. 1975 May;72(5):1909-12. Becker HD, Kern D Thermus thermophilus: a link in evolution of the tRNA-dependent amino acid amidation pathways. Proc Natl Acad Sci U S A. 1998 Oct 27;95(22):12832-7. Wong JT Membership mutation of the genetic code: loss of fitness by tryptophan. Proc Natl Acad Sci U S A. 1983 Oct;80(20):6303-6. Bain JD, Switzer C, Chamberlin AR, Benner SA Ribosome-mediated incorporation of a non-standard amino acid into a peptide through expansion of the genetic code. Nature. 1992 Apr 9;356(6369):537-9. Mendel D, Cornish VW, Schultz PG Site-directed mutagenesis with an expanded genetic code. Annu Rev Biophys Biomol Struct. 1995;24:435-62. Hohsaka T, Ashizuka Y, Sisido M Incorporation of two nonnatural amino acids into proteins through extension of the genetic code. Nucleic Acids Symp Ser. 1999; (42):79-80. Kanda T, Takai K, Hohsaka T, Sisido M, Takaku H Sense codon-dependent introduction of unnatural amino acids into multiple sites of a protein. Biochem Biophys Res Commun. 2000 Apr 21;270(3):1136-9.
  7. 7. Furter R Expansion of the genetic code: site-directed p-fluoro-phenylalanine incorporation in Escherichia coli. Protein Sci. 1998 Feb;7(2):419-26.