From Chemistry to Information: Emergence of the Genetic Code
From Chemistry to Information: Emergence of the Genetic Code
Department of Ecology and Evolutionary Biology, Princeton University, Princeton, NJ
The genetic code translates codons into amino acids, allowing coded peptide synthesis
and forming the core of modern biology. Although the modern system is extremely
precise, primitive systems probably encoded fewer amino acids with much less precision.
This paper provides an overview of recent progress in understanding the origin and
subsequent evolution of the genetic code. It also outlines several of the remaining
puzzles, and suggests several approaches that may be of use in answering some of these
Although “information” is widely misused as a metaphor in biology (Oyama 1985), the
genetic code is an evolved system that encodes information in the strict, mathematical
sense (Shannon 1943). The presence of a codon at a given position in a mRNA allows us
to predict, with high precision, the amino acid present at the corresponding site in the
resulting protein: the mRNA acts as the source, the protein acts as the receiver, and the
translation apparatus (ribosome, aminoacyl-tRNA synthetases, tRNA, release factors)
provides the channel conditions. Because the genetic code is constant in most organisms,
it is easy to forget that these channel conditions are not immutable, but are themselves
products of evolution.
The textbook picture is that the 64 codons are distributed among the 20 amino acids with
varying degrees of degeneracy, such that each codon specifies precisely one amino acid
(or acts as a stop codon) but most amino acids are specified by more than one codon.
However, the actual situation is more complex. It is impossible to transmit information
with no error whatsoever, and the pattern of translation error (reviewed by Parker 1989)
produces a frequency distribution of amino acids for each codon, rather than a single
amino acid for each codon. Thus there are two fundamental questions concerning the
evolution of the genetic code:
1. How did the set of frequency distributions that comprises the genetic code
became skewed so heavily to one amino acid per codon?
2. Given the one amino acid per codon rule, how did codons come to have their
There are additional complexities, such as the cotranslational insertion of the “21st”
amino acid selenocysteine (which requires a specific stem-loop secondary structure in the
mRNA upstream from the insertion site (Heider et al. 1992)), and the effect of the codon
context on misreading and alternative translation events at particular locations, but they
are peripheral to the above questions.
The Translational repertoire: Why These Amino Acids?
Although over 400 different amino acids are found in cells, only 21 of these (including
selenocysteine) are cotranslationally incorporated into proteins. Interestingly, the set of
amino acids consistently produced in prebiotic settings (spark-tube and HCN-
condensation experiments, and extraterrestrial sources) only partially overlaps the set
found in the genetic code (Weber and Miller 1981). In particular, norleucine, norvaline,
alpha-aminobutyric acid, and pipecolic acid are all produced readily and are very similar
to amino acids that are actually present in the code. Similarly, homocysteine, ornithine,
citrulline, and L-DOPA are all important metabolic intermediates, yet none has been
incorporated into the genetic code. Why should such a small subset be chosen, and is
there any rationale for the choices?
One possibility is that this set of amino acids allows unusually error-resistant codes to be
produced (Knight et al. 1999). The genetic code appears highly adaptive, in that it
minimizes error relative to other possible genetic codes that permute the same amino
acids among the same codon blocks: depending on the criteria used, the genetic code may
be nearly the best of all possible codes (Freeland et al. 2000). Determining the error value
(the average effect of a point misreading) of a given code requires a measure of the
distance between two amino acids. For protein-coding amino acids, this may be an prior
measurement (such as the polar requirement (Woese et al. 1966) of the free amino acids),
or a posterior measurement of how often particular substitutions are acceptable in
proteins (for example, PAM 74-100 (Benner et al. 1994)). Interestingly, although the
code looks very good with both prior and posterior measurements, these measurements
are not highly correlated with one another. We are presently measuring polar
requirements for non-protein amino acids using thin-layer chromatography, and plan to
test the effect of incorporating these nonstandard amino acids into the code (R.D. Knight,
S.J. Freeland, and L.F. Landweber, unpublished data). However, the procedure is very
time-consuming: if it were possible to predict these values by modeling the amino acids
in silico, it would be possible to estimate the effects of adding a far greater range of
amino acids to the code.
The Earliest Codes: RNA-Amino Acid Interactions?
Although the idea that amino acids might directly associate with amino acids is not new
(Gamow 1954), only in the last decade has chemical evidence started to indicate that
even modern genetic codes might have been influenced by such interactions (Yarus and
Christian 1989). The technique of in vitro selection, or SELEX (Tuerk and Gold 1990,
Ellington and Szostak 1990), allows amplification of functional nucleic acid molecules
from random sequences. Several aptamers (nucleic acids that bind specific targets) to
different amino acids have now been isolated and characterized, giving a preliminary
survey of the types of RNA molecules that can bind amino acids.
For each nucleotide in an aptamer, it is possible to decide (by inspection) whether it is in
a specified codon in some reading frame, and to decide (by chemical mapping) whether it
is directly involved in binding the amino acid. If the binding sites for particular amino
acids were disproportionately composed of the cognate codons, a standard test for
independence (chi-squared or G) should give a highly significant result (Knight and
Landweber 1998). In fact, this is the case, both for the arginine aptamers considered by
themselves (Knight and Landweber 2000) and for the three amino acids, Arg, Val, and
Tyr, for which structural data were available at the start of this year (Yarus 2000). This
suggests that intrinsic biases in trinucleotide preferences at binding sites could have
influenced the modern genetic code.
Could these have initial statistical influences have evolved into the modern coding
system? By analogy to the evolution of language (Nowak and Krakauer 1999), it seems
intuitive that even a small initial bias linking particular trinucleotides to particular amino
acids could be amplified into strong, specific preferences if it were advantageous to
identify amino acids. In fact, it is possible that the genetic code predated peptide
synthesis, specifying amino acids as cofactors to ribozymes: this coding system would
only later be used to join the amino acids together. Alternatively, the genetic code might
be an example of a phase transition in which information crystallizes in such a way that
variation in only a few elements in the system causes variation in the outcome (Kauffman
1993). In this case, the set of proteins that a cell produces could be considered its
phenotype: in the absence of a specific coding system, an autocatalytic set might produce
different peptides depending on the ratios of its constituents and on the surrounding
conditions. By concentrating the information about proteins into a few messenger
molecules that could vary in defined ways to produce different messages, a lineage of
cells might better be able to explore the space of possible phenotypes, especially since
such a coding system would be ideally suited to co-option as an unlimited heredity
mechanism (Maynard Smith and Szathmary 1995). This topic should be a fruitful area for
future modeling studies.
One controversial point is whether the information embodied in RNA/amino acid
interactions could have been transmitted to modern translation systems. Modern tRNA
molecules, which carry the amino acid to the ribosome and are responsible for
specifically pairing it to its cognate codon, are all related to one another, presumably
because they duplicated and diverged from an ancestral ur-tRNA. Thus, it has been
argued that at most one codon/binding site interaction could have been incorporated into
the genetic code, since other codons would probably not function in the same, highly
specialized sequence context (Ellington et al. 2000). Interestingly, tRNAs contain the
anticodons, which are the trinucleotides that are complementary to those that seem to be
associated with binding sites. Consequently, it seems more likely that tRNAs always
acted as acceptor molecules, and were originally charged with ribozymes that bound
amino acids noncovalently using codon-rich motifs and covalently attached these amino
acids to the primitive tRNAs. If the only consistent feature that diverse amino acid-
binding tRNAs shared was the presence of an appropriate codon, the primitive tRNA
(already specialized as an adaptor) could duplicate and diverge to give a family of tRNAs
recognized by their triplet anticodon motifs (Knight and Landweber 2000). These tRNAs
could then maintain their role as adaptors even after protein aminoacyl-tRNA synthetases
displaced the ribozymes that previously fulfilled the same role. Interestingly, RNA can
catalyze the ligation of amino acids to RNA faster and more specifically than do the
actual synthetases used by modern cells (Illangasekare and Yarus 1999). However, the
pathway by which primitive interactions between RNA and amino acids evolved into a
genetic code remains speculative.
Expansion of the Code: Modern Life and Beyond
Since only about half of the amino acids in the code are plausible prebiotic molecules
(Weber and Miller 1981), it seems likely that the code expanded from a simpler form.
One particularly important question is when, and at what time, amino acids were added to
the code. One proposal is that amino acids entered the code on the timescale of the
duplication and divergence of the aminoacyl-tRNA synthetases, although the synthetase
phylogeny does not seem to support a division between early and late amino acids on any
chemical basis (Nagel and Doolittle 1995). It is also possible that some amino acids were
invented relatively late, so that enzymes predating their introduction (in particular, the
enzymes that catalyze their formation from a metabolic precursor) should be
impoverished in the later amino acids at their active sites (S.J. Freeland, pers. comm.).
Another fascinating question is whether the genetic code is still expanding. Although
variant genetic codes have been found in many lineages (Osawa 1995), these are all
recent descendants of the code found in the last common ancestor, and all use the
standard complement of amino acids. In particular, selenocysteine may be a recent
addition to the code, since it shares a codon with termination and requires a special
secondary structure for insertion: however, it may instead be an early entry, displaced by
Cys and/or Ser when made unstable by the oxygenation of the atmosphere. Similarly,
Asn and Gln may be relatively late additions (Wong 1975), since in some species they are
formed by the amidation of Asp or Glu on the tRNA. However, the pathways can coexist
in a single organism (Becker and Kern 1998), and so their presence or absence may have
a physiological, rather than an evolutionary basis.
Today, the genetic code can even be changed experimentally. The first step in this
direction involved breeding bacteria to use 4-fluorotryptophan constitutively in place of
Trp (Wong 1983). More recently, additional amino acids have been incorporated by using
a 65th codon-anticodon pair using unnatural bases (Bain et al. 1992), by using chemically
charged tRNAs that bind either termination codons (Mendel et al. 1995) or an artificial
four-base codon rarely used in the organism (Hohsaka et al. 1999), by selectively
degrading specific tRNAs and replacing them with chemically charged alternatives
(Kanda et al. 2000), and by using an aminoacyl-tRNA synthetase that recognizes the
suppressor tRNA but not the wild-type tRNA and incorporates the novel amino acid in
vivo (Furter et al. 1998). This last approach is by far the most desirable, since the
charging reaction can be performed in the cell itself instead of on isolated tRNA. This
experimental expansion of the genetic code should provide new insight into the evolution
of the canonical amino acid set.
Oyama S (1985). The ontogeny of information: developmental systems and evolution.
Cambridge University Press, New York
Shannon CE A mathematical theory of communication. Bell System Technical Journal
1948 27:379-423, 623-656.
Parker J Errors and alternatives in reading the universal genetic code. Microbiol Rev.
Heider J, Baron C, Bock A Coding from a distance: dissection of the mRNA
determinants required for the incorporation of selenocysteine into protein. EMBO J. 1992
Weber AL, Miller SL (1981) Reasons for the occurrence of the twenty coded protein
amino acids. J Mol Evol. 1981;17(5):273-84.
Knight RD, Freeland SJ, Landweber LF Selection, history and chemistry: the three faces
of the genetic code. Trends Biochem Sci. 1999 Jun;24(6):241-7.
Freeland SJ, Knight RD, Landweber LF, Hurst LD Early fixation of an optimal genetic
code. Mol Biol Evol. 2000 Apr;17(4):511-8.
Woese CR, Dugre DH, Saxinger WC, Dugre SA The molecular basis for the genetic
code. Proc Natl Acad Sci U S A. 1966 Apr;55(4):966-74.
Benner SA, Cohen MA, Gonnet GH Amino acid substitution during functionally
constrained divergent evolution of protein sequences. Protein Eng. 1994
Gamow G Possible relation between deoxyribonucleic acid and protein structures.
Nature. 1954 173:318.
Yarus M, Christian EL Genetic code origins. Nature. 1989 Nov 23;342(6248):349-50.
Tuerk C, Gold L Systematic evolution of ligands by exponential enrichment: RNA
ligands to bacteriophage T4 DNA polymerase. Science. 1990 Aug 3;249(4968):505-10.
Ellington AD, Szostak JW In vitro selection of RNA molecules that bind specific ligands.
Nature. 1990 Aug 30;346(6287):818-22.
Knight RD, Landweber LF Rhyme or reason: RNA-arginine interactions and the genetic
code. Chem Biol. 1998 Sep;5(9):R215-20.
Knight RD, Landweber LF Guilt by association: the arginine case revisited. RNA. 2000
Yarus M RNA-ligand chemistry: a testable source for the genetic code. RNA. 2000
Nowak MA, Krakauer DC The evolution of language. Proc Natl Acad Sci U S A. 1999
Kauffman SA The origins of order : self organization and selection in evolution. New
York : Oxford University Press, 1993.
Maynard Smith J, Szathmary, E The major transitions in evolution. Oxford ; New York :
W.H. Freeman Spektrum, 1995.
Ellington AD, Khrapov M, Shaw CA The scene of a frozen accident. RNA. 2000
Illangasekare M, Yarus M Specific, rapid synthesis of Phe-RNA by RNA. Proc Natl
Acad Sci U S A. 1999 May 11;96(10):5470-5.
Nagel GM, Doolittle RF Phylogenetic analysis of the aminoacyl-tRNA synthetases. J
Mol Evol. 1995 May;40(5):487-98.
Osawa S Evolution of the genetic code. Oxford ; New York : Oxford University Press,
Wong JT A co-evolution theory of the genetic code. Proc Natl Acad Sci U S A. 1975
Becker HD, Kern D Thermus thermophilus: a link in evolution of the tRNA-dependent
amino acid amidation pathways. Proc Natl Acad Sci U S A. 1998 Oct 27;95(22):12832-7.
Wong JT Membership mutation of the genetic code: loss of fitness by tryptophan. Proc
Natl Acad Sci U S A. 1983 Oct;80(20):6303-6.
Bain JD, Switzer C, Chamberlin AR, Benner SA Ribosome-mediated incorporation of a
non-standard amino acid into a peptide through expansion of the genetic code. Nature.
1992 Apr 9;356(6369):537-9.
Mendel D, Cornish VW, Schultz PG Site-directed mutagenesis with an expanded genetic
code. Annu Rev Biophys Biomol Struct. 1995;24:435-62.
Hohsaka T, Ashizuka Y, Sisido M Incorporation of two nonnatural amino acids into
proteins through extension of the genetic code. Nucleic Acids Symp Ser. 1999;
Kanda T, Takai K, Hohsaka T, Sisido M, Takaku H Sense codon-dependent introduction
of unnatural amino acids into multiple sites of a protein. Biochem Biophys Res Commun.
2000 Apr 21;270(3):1136-9.
Furter R Expansion of the genetic code: site-directed p-fluoro-phenylalanine
incorporation in Escherichia coli. Protein Sci. 1998 Feb;7(2):419-26.