maximum score was 81% for an archaeal sequence and theminimum was 10% for a eucaryal sequence. In general,the high-to-low trend in score and the mean score of eachgroup (in parentheses) was: 16S Archaea (69%) > Eubacte-ria (55%) > chloroplast (48%) > mitochondria (31%) =Eucarya (30%). These results were very interestingbecause they showed that, in some cases, the thermody-namics-based method of Zuker and Turner could success-fully elucidate the biologically relevant form of a largeRNA sequence such as 16S rRNA, which is about 1500bases in length. (We believe that a score of 46% is mean-ingful in the context of a 10% minimum score and an 81%maximum score.)In the present paper, we continue this investigation of theapplicability of the Zuker–Turner method by presentingthe outcome of the folding of 72 phylogenetically andstructurally distinct 23S rRNA sequences (see Table 1) bythe said method (see Materials and methods). But evenmore, we identify several properties of the comparativelyderived secondary structures that serve as semiquantita-tive predictors of the accuracy of the secondary structurepredictions of the Zuker–Turner method. In the Resultssection, we start by providing a summary of the vastamount of information obtained from the 72 23S rRNAfoldings by working with average scores and we also takethis opportunity to compare and contrast these resultswith our previous findings on 16S rRNA . The conclu-sions drawn here represent a set of empirical observationson the performance of the Zuker–Turner method. Wethen switch from using average scores to looking at indi-vidual scores as a function of several properties of thecomparatively derived secondary structures; in this way,we are able to identify the predictors. The predictors arethe % of noncanonical base pairs, the % of hairpin loopsthat are stable tetraloops, and the sequence %G+C. Herewe keep our speculation on the causality of the predictiverelationship to a minimum; the important point is that wedemonstrate that this complex problem does indeed have‘handles’ with which it can be understood. In the Discus-420 Folding & Design Vol 1 No 6Table 1Selection of 23S rRNA.Archaea (8)Euryarchaeota Halobacterium marismortui, Halococcus morrhuae, Methanobacterium thermoautotrophicum,Methanococcus vannielii, Thermococcus celer, Thermoplasma acidophilumCrenarchaeota Sulfolobus solfataricus, Thermoproteus tenax(Eu)bacteria (13)Thermotoga Thermotoga maritimaDeinococcus + relatives Thermus thermophilusSpirochaetes + relatives Borrelia burgdorferiPurple bacteria Campylobacter coli, Escherichia coli, Pseudomonas aeruginosa, Pseudomonas cepacia,Rhodobacter sphaeroides (A)Cyanobacteria Anacystis nidulans (new name = Synechoccus sp. 6301)Gram-positive Bacillus subtilis, Frankia sp., Mycobacterium leprae, Streptomyces ambofaciensChloroplast (10)Protista Astasia longa, Chlamydomonas eugametos, Chlamydomonas reinhardtii, Chlorella ellipsoidea,Euglena gracilis, Palmaria palmataPlantae Alnus incana, Marchantia polymorpha, Nicotiana tabacum, Zea maysMitochondria (18)Protista Acanthamoeba castellanii, Chondrus crispus, Dictyostelium discoideum, Paramecium tetraurelia,Prototheca wickerhamii, Tetrahymena pyriformis (R)Fungi Penicillium chrysogenum, Saccharomyces cerevisiae, Schizosaccharomyces pombePlantae Marchantia polymorpha, Zea maysAnimalia Caenorhabditis elegans, Crossostoma lacustre, Drosophila melanogaster, Gallus gallus, Homo sapiens,Paracentrotus lividus, Xenopus laevisEucarya (23)Archezoa Giardia intestinalis, Giardia murisProtista Chlorella ellipsoidea, Crithidia fasciculata, Euglena gracilis, Physarum polycephalum, Phytophthora megasperma,Prorocentrum micans, Tetrahymena thermophila, Toxoplasma gondii-P, Trypanosoma bruceiFungi Cryptococcus neoformans B, Pneumocystis carinii, Saccharomyces cerevisiaeAnimalia Aedes albopictus, Caenorhabditis elegans, Drosophila melanogaster, Herdmania momus, Homo sapiens,Mus musculus, Xenopus laevisPlantae Arabidopsis thaliana, Oryza sativa
sion, we assume that the biologically relevant secondarystructure of rRNA is in fact the structure of globalminimum free energy. By doing this, we provide a frame-work within which we can speculate on the biologicalmeaning of the results.ResultsThe scoring schemeFor any one rRNA sequence, the Zuker–Turner methodreturns an energetically optimal folding and a number ofenergetically suboptimal foldings, and we obtain a sepa-rate score for each of these alternative structures. Thescore for any particular energetically optimal or suboptimalfolding is simply the percentage of the canonical basepairs (i.e. AU-, CG-, and GU-containing base pairs) thatare present in the sequence’s comparatively derived sec-ondary structure [12,13] that are proposed to exist in thefolding. For example, the comparatively derived sec-ondary structure of 23S rRNA of Escherichia coli has 801canonical base pairs. The energetically optimal folding ofthe E. coli 23S rRNA sequence correctly predicts 466 ofthese base pairs, yielding a score of 58%. For each of the72 23S rRNA sequences studied here, we have mappedthe base pairs that were correctly predicted in the energet-ically optimal foldings onto the comparatively derived sec-ondary structures (Materials and methods; see alsohttp://pundit.colorado.edu:8080/root.html).The definition of score given here is useful because itstraightforwardly reflects the amount of secondary struc-ture correctly predicted by the folding algorithm that itwas theoretically able to predict. The Zuker–Turnermethod, unlike the comparative approach, does notpredict noncanonical base pairs, so excluding noncanoni-cal base pairs from the score allows it to range from 0 to100%. This definition does not address the larger issue ofwhat is happening to the single-stranded regions, nor doesit examine the energetically optimal and suboptimal fold-ings in the regions where the true secondary structure wasnot correctly predicted to see what structure, if any,formed instead. These issues must eventually beaddressed and a score must be defined to suit this need,but these are not in the scope of this paper.The prediction of RNA secondary structure with theZuker–Turner method tacitly assumes that the structureof global minimal free energy is in fact the structure ofbiological relevance. Hence, the scores given in this paperalways refer to the score of the energetically optimal fold-ings computed in the aforementioned base pair basedmanner, except when we explicitly note that we areworking with the scores of the highest scoring energeti-cally suboptimal foldings (i.e. the best suboptimal fold-ings) or otherwise. Finally, it is not uncommon forsecondary structures to be scored against their respectivecomparatively derived structures by counting the numberof helices that were correctly predicted. According toZuker et al. , a helix is defined as a region of double-stranded RNA of at least three base pairs where interrup-tions of internal or bulge loops of up to two unpaired basesare allowed. A helix is then said to be correctly predictedin the folding when all the base pairs comprising the helixare correctly predicted (in the aforementioned sense) withthe exception of at most two base pairs. We used thishelix-based definition of score alongside our own base pairbased definition of score only in Figure 1.Summary of the 23S rRNA foldingsGeneral trends in the scoresThe 23S rRNA sequence of highest score was the archaealsequence of Thermococcus celer at 74%, while the lowestscoring 23S rRNA sequence was that of the chloroplast ofAstasia longa at 19%. The mean score ± SD over all 72 23SrRNA sequences was 44 ± 11%. Interestingly, this isessentially the same as the score of 46 ± 17% that wasfound in our previous work over all 56 16S rRNAsequences . This strongly suggests for 16S and 23SrRNA that sequence length does not relate to the foldingscore because 23S and 23S-like rRNAs are about twice thelength of 16S and 16S-like rRNAs.Research Paper Analysis of 23S rRNA Fields and Gutell 421Figure 1Scoring of the secondary structure predictions of the method of Zukerand Turner [6–9]. The horizontal axis shows the five phylogeneticgroups under study: arc, Archaea; eub, Eubacteria; chl, chloroplast;mit, mitochondria; euc, Eucarya. The vertical axis is the group meanscore of the energetically optimal foldings of the 23S rRNAsequences. The group mean scores in terms of the base pair basedcounting method are denoted as ‘bp’ while the group mean scores interms of the helix-based counting method are denoted as ‘hel’. Themaximum and minimum scores are denoted by closed and open circlesrespectively. The standard deviations are displayed as vertical lines.10090807060504030201010050% %bp helarcbp heleubbp helchlbp helmitbp heleuc
The mean score of each group is presented in Figure 1where the trend in the group mean scores is (mean score inparentheses): 23S Archaea (59%) = Eubacteria (53%) >chloroplast (39%) = mitochondria (38%) = Eucarya (41%).The 23S and 16S rRNA cases are similar in that thesequences of Archaea and Eubacteria are folded more accu-rately on average by the Zuker–Turner method than are thesequences of mitochondria and Eucarya. The two casesdiffer in that the group mean scores for 23S Archaea, Eubac-teria, and chloroplast are less than their corresponding valuesin the 16S rRNA case, while in contrast, the group meanscores of 23S mitochondria and Eucarya are larger than theircorresponding values in the 16S rRNA case. Taking intoaccount our previous comments on the overall mean scores,one concludes that 16S and 23S rRNA secondary structuresare predicted with a similar pattern of accuracy (on average)by the Zuker–Turner method, except that as one movesfrom working with 16S to 23S rRNA there is a correspondingshift in accuracy away from sequences of Archaea, Eubacte-ria, and chloroplast to those of mitochondria and Eucarya.From Figure 1, one can see that the range of predictionsuccess (i.e. maximum minus minimum score) is ∼ 22% forArchaea and Eubacteria, and ∼ 35% for chloroplast, mito-chondria, and Eucarya. These are slightly smaller than theanalogous ranges found for 16S rRNA, which were 28%and 38% respectively. By using instead the scores of thebest suboptimal foldings, the 23S rRNA group meanscores increase only 5–8%, but the aforementioned trendin the 23S rRNA group mean score is preserved. Figure 1also displays the group mean scores of the 23S rRNA fold-ings in terms of the helix-based scoring scheme, where thehelix-based scheme clearly causes no significant change inthe results.Average scores in terms of base pair distancesA base pair can be categorized by the difference in thenumerical positions in the primary sequence of its twoconstituent bases. This type of categorization is usefulbecause it allows one to attempt to understand the scoresin terms of local versus global base-to-base pairing. Theinset of Figure 2 uses bins of size 100 nt to denote themean distribution of the base pair distances found in thecomparatively derived secondary structures of the 13 23SrRNA eubacterial sequences; the analogous distributionsfor Archaea, chloroplast, mitochondria, and Eucarya giveessentially the same pattern. All five phylogenetic groupsdo have base pairs in the 23S rRNA comparatively derivedsecondary structures with distances > 600 nt, the largest ofthese reaching 3300 nt in the mitochondria. But since allthese combined make up only ~ 2% of the total in eachgroup, they are not shown.The inset of Figure 2 reveals that the majority (∼ 75%) ofthe base pairs in 23S rRNA comparatively derived sec-ondary structures are separated by < 100 nt; these are422 Folding & Design Vol 1 No 6Figure 2The inset shows the distribution of the basepairs according to distance. The horizontalaxis shows the classification of the base pairsin the eubacterial comparatively derivedsecondary structures according to thedistance of separation of each base pair’sconstituent bases in the primary sequence; abin size of 100 nt is used. The vertical axisdenotes the average % of the base pairs overthe Eubacteria which fall into each of thedistance bins of the horizontal axis. Thestandard deviation (SD) of each average wasno larger than 1.4%. In the main figure, thescoring of the secondary structure predictionsof the method of Zuker and Turner are shownwith respect to base pair distance. Thehorizontal axis shows the five phylogeneticgroups under study: arc, Archaea; eub,Eubacteria; chl, chloroplast; mit, mitochondria;euc, Eucarya. The vertical axis is the groupmean score of the energetically optimalfoldings of the 72 23S rRNA sequencesspecific to the classification of the base pairsaccording to the distance of separation of thebase pair’s constituent bases in the primarysequence; a bin size of 100 nt is used. For the100 nt bins, the SD was about 7% for arc andeub, and about 11% for chl, mit, and euc. Forthe 200 nt bins the SD was ~ 17% for all thegroups, except eub was ~ 10%. For the 300to 600 nt bins, the SD was typically one- totwo-fold the value of the respective meanscore.10090807060504030201010050% %100200300400500600arc100200300400500600eub100200300400500600chl100200300400500600mit100200300400500600euc100908070605040302010%Comparative structure100 200 300 400 500 600
called ‘short range’ while base pairs separated by > 100 ntare called ‘long range’. Interestingly, this inset is similar toits counterpart in our previous study of 16S rRNA . So,an important point to note is that even though 23S and23S-like rRNAs are about twice the size of 16S and 16S-like rRNA molecules, which of course gives 23S rRNA thepotential to form many more long-range interactions thancould be found in 16S rRNA, both rRNA moleculesappear to be using short-range base pairs to constitute thebulk of their secondary structure.Carrying on with the idea of binning each base pair in thecomparatively derived structures, for Figure 2 proper wecalculate a new set of group mean scores specific to eachbin. The most immediate conclusion from Figure 2 is thatwithin any one group the short-range base pairs are pre-dicted better than the long-range base pairs. In support ofthis, one observes that the mean scores of each of the five100 nt bins exceeds its corresponding group mean score inFigure 1 by ∼ 7%; no other distance bin has this behavior.This same type of behavior was observed in the case of 16SrRNA. The most notable pattern of difference betweenthe 23S and 16S rRNA cases is found in the 200 nt bin;specifically, the 23S rRNA mean scores were greater thantheir 16S rRNA counterparts by typically ∼ 20%.Average scores with base pairs classified by loop closureA base pair can be categorized according to the type ofloop that it closes. The categories (as exemplified in theinset of Fig. 3) are hairpin loop closing (h), internal loopclosing (i), bulge loop closing (b), and multistem loopclosing (m). Categorizing base pairs in this way is usefulbecause it allows one to attempt to relate the scores withthe local structural features of RNA. Group mean scoresspecific to each type of loop category were computed andcollected in Figure 3. Although the standard deviations inthese group means can be rather large (e.g. in Fig. 3, seethe m category in the mitochondria group), one can stillsee that within each of the five phylogenetic groups,hairpin loop closing base pairs are predicted better thanbulge and internal loop closers, which in turn score betterthan the multistem loop closing base pairs.The group mean scores specific to the hairpin loops allexceed their respective group mean score in Figure 1 by9–18% and this is not seen with the mean scores specificto the other loop categories (except for the b category inEucarya in Fig. 3). By the nature of RNA structure, a basepair that closes a hairpin loop is usually a short-range basepair as well. So it is interesting to observe that the groupmean scores specific to the hairpin loops all exceed theirrespective 100 nt bin mean scores in Figure 2 by 3–10%.In the case of 16S rRNA, we observed basically thesesame two patterns . So overall, one concludes that basepairs classified as being hairpin loop closers in 16S and 23SrRNA are predicted better than base pairs classified asclosing the other three loop types.Predictors of folding successScore versus % of noncanonical base pairsThe Zuker–Turner method does not predict noncanoni-cal base pairs even though this type of pairing does occurResearch Paper Analysis of 23S rRNA Fields and Gutell 423Figure 3The inset shows the base pair classification interms of loop closure. The categories are: h,hairpin loop closing; i, internal loop closing; b,bulge loop closing; m, multistem loop closing.In the main figure, scoring of the secondarystructure predictions of the method of Zukerand Turner is shown with respect to the fourloop closing classifications. The horizontal axisshows the five phylogenetic groups understudy. The vertical axis is the group meanscore of the energetically optimal foldings ofthe 72 23S rRNA sequences specific to theclassification of the base pairs according tothe type of loop that they close. The maximumand minimum scores that contributed to eachof the loop-specific group mean scores aredenoted by a closed and open circle,respectively. The standard deviations aredisplayed as vertical lines.10090807060504030201010050% %h i b m h i b m h i b m h i b m h i b marc eub chl mit euc′ ′
in comparatively derived secondary structures. In partic-ular, for the 72 23S rRNA comparatively derived sec-ondary structures studied here, the % of noncanonicalbase pairs ranged from 2.9 to 9.9%. (By noncanonical, wemean any pairing other than AU, UA, CG, GC, GU, orUG.) In our earlier work on 16S rRNA , it was notedqualitatively that the % of noncanonical base pairspresent in a comparatively derived secondary structurewas ‘inversely proportional’ to the folding score of thecorresponding primary sequence. To follow up on thesetwo clues, Figure 4 shows the score of the folding of eachof the 72 23S rRNA sequences plotted against their cor-responding values of % noncanonical base pairs; Figure 5gives the same information for the 56 16S rRNAsequences. In Figure 4, the score decreases by ∼ 40%with a 6% increase in % noncanonical base pairs, while inFigure 5, the score decreases by ∼ 50% with an 8%increase in % noncanonical base pairs. The linear correla-tions coefficients (see Materials and methods) are –0.55and –0.74 for Figures 4 and 5, respectively, which areboth significant at a level < 0.05%. We therefore con-clude that the accuracy of the secondary structure predic-tions of 16S and 23S rRNA made by the Zuker–Turnermethod will decrease with increasing % noncanonicalbase pairs.Why are noncanonical base pairs detrimental to the score?One reason for this may be that in the comparativelyderived secondary structures, noncanonical base pairs areoften internal to helices (see the secondary structure dia-grams at http://pundit.colorado.edu:8080/root.html). So,the closest the Zuker–Turner method can come to cor-rectly reproducing helices bearing internal noncanonicalbase pair(s) is to correctly form the canonical base pairswhile simultaneously leaving internal loop(s) where thenoncanonical base pair(s) should be. The destabilizinginfluence of an internal loop negates the stabilizing contri-bution of about two stacked dinucleotide units, and thisclearly presents a way by which noncanonical base pairscan erode the accuracy of a secondary structure prediction.Score versus % of hairpin loops that are ‘stable tetraloops’Examination of the comparatively derived secondarystructures of the 72 23S rRNA sequences revealed that themost frequently observed hairpin loops were of size four(tetraloops) and, excepting triloops, the observed fre-quency of each size of loop tended to increase withdecreasing loop size. Specifically, considering the fivephylogenetic group averages, tetraloops comprised424 Folding & Design Vol 1 No 6Figure 4Score versus % of noncanonical base pairs (%NC) present in thecorresponding comparatively derived secondary structure. For each ofthe 72 23S rRNA sequences folded, the score of the energeticallyoptimal folding is plotted against its %NC. The linear correlationcoefficient is –0.55.Score%NC20.0025.0030.0035.0040.0045.0050.0055.0060.0065.0070.0075.004.00 6.00 8.00 10.00Figure 5Score versus % of noncanonical base pairs (%NC) present in thecorresponding comparatively derived secondary structure. For each ofthe 56 16S rRNA sequences folded in our previous work , thescore of the energetically optimal folding is plotted against its %NC.The linear correlation coefficient is –0.74.Score%NC10.0015.0020.0025.0030.0035.0040.0045.0050.0055.0060.0065.0070.0075.0080.006.00 8.00 10.00 12.00 14.00 16.00
26–35% of the total hairpin loops. In our study of 16SrRNA [11,15], tetraloops were also found to be the mostfrequently observed (30–50%) type of hairpin loop; thenature and function of tetraloops in 16S rRNA [5,15] and23S rRNA  has been discussed. The Zuker–Turnermethod assigns a free energy contribution of about+4.5 kcal to any tetraloop, but even more, it will simulta-neously add an additional –2.0 kcal to the free energy con-tribution of tetraloops with the following composition :GHGA, GVAA, and UWCG (H = not G, V = not U, W = Aor U ). Working with the comparatively derived sec-ondary structures, the distribution of these three sub-classes of tetraloop (which have been called ‘stabletetraloops’) with respect to all tetraloops is shown inTable 2. Clearly, the frequency of use of these stabletetraloops is: GVAA > GHGA > UWCG. This same trendwas found in our study of 16S rRNA [11,15], although onaverage the frequencies of use of the stable tetraloopswere higher by ∼ 15% than in the 23S rRNA case.Given that the Zuker–Turner method recognizes theseso-called stable tetraloops, the % of hairpin loops that arein fact stable tetraloops in the comparatively derived sec-ondary structures (% hairpin loops that are stabletetraloops [%HSTL] = number of stable tetraloops /number of all hairpin loops) seems likely to be related tothe score. Therefore, the scores of the folding of each ofthe 72 23S rRNA sequences are plotted against the corre-sponding values of %HSTL in Figure 6; the kindred plotfor the 56 16S rRNA sequences is given in Figure 7. InFigure 6 the score increases by ∼ 30% with a 20% increasein %HSTL, while in Figure 7, the score increases by∼ 50% with a 30% increase in %HSTL. The linear correla-tion coefficients for Figures 6 and 7 are +0.51 and +0.66,respectively, which are both significant at a level < 0.05%.We conclude that the accuracy of the secondary structurepredictions of 16S and 23S rRNA made by theZuker–Turner method will increase with increasing %hairpin loops that are stable tetraloops.Attempting a quick rationalization of this correlation, onemight recall that the Zuker–Turner method simply adds astabilizing contribution to GHGA, GVAA, and UWCGtetraloops. So continuing with the thought, it might thenbe considered ‘obvious’ that the placement of these stabi-lizing structures where a hairpin loop is required wouldimprove the score. However, this argument fails to con-sider the specific pattern of distribution of the tetraloopsubsequences in each of the full-length sequences underconsideration; i.e. in each full-length sequence, there are anumber of subsequences that could form ‘stabletetraloops’ (just by chance) and this presents a potentiallysequence-specific type of ‘background’ in which %hairpin loops that are stable tetraloops must be inter-preted. So further work is clearly needed to elucidate thenature of this correlation.Score versus sequence lengthPerhaps the most obvious property of a secondary struc-ture that one would expect to have an impact on the accu-racy of its folding is the length of the sequence, becausethe complexity of the folding problem increases withincreasing sequence size. In our earlier work with 16SrRNA , we commented that score and length appearnot to be related on the basis of a simple survey of theresults. To place this idea on more solid ground for both16S and 23S rRNA, we prepared plots (data not shown;see Figs A,B at http://pundit.colorado.edu:8080/root.html)of score versus sequence length where only the mitochon-drial and eucaryal sequences were considered; we did notuse archaeal, eubacterial, or chloroplast sequences becausethese sequences are very close in length and hence theydo not sample a broad length range. The linear correlationcoefficients for the two plots are +0.15 and +0.19 for the 41mitochondrial and eucaryal 23S rRNA sequences and the22 mitochondrial and eucaryal 16S rRNA sequences,respectively. These coefficients indicate a correlation ofdoubtful significance, so we conclude that score andlength are not related in the case of 16S and 23S rRNA (assuggested earlier).Score versus %G+CThe % of bases in a sequence that are G or C (%G+C) isknown to be useful in characterizing the behavior ofnucleic acids; for example, %G+C correlates with themelting temperature of DNA . Therefore, we plotted%G+C against the score for the archaeal, eubacterial,chloroplast, and mitochondrial 23S rRNA data (see Fig. 8),and in a separate plot for the eucaryal 23S rRNA data (seeFig. 9). Analogous plots were produced for 16S rRNA fromour earlier work  (data not shown; see Figs C,D athttp://pundit.colorado.edu:8080/root.html). The mostimmediate impression from the examination of all of theseplots was that the 49 23S and 41 16S rRNA archaeal,eubacterial, chloroplast, and mitochondrial data points cor-relate positively with the score; i.e. the score increaseswith increasing %G+C. Specifically, linear correlationcoefficients are +0.59 and +0.55 for the 23S and 16S rRNAcases, respectively, which are both significant at a level ofResearch Paper Analysis of 23S rRNA Fields and Gutell 425Table 2Distribution of the ‘stable tetraloops’ with respect to alltetraloops in the comparatively derived secondary structures.Group %GHGA %GVAA %UWCG Sum of %Archaea 23 ± 9 26 ± 10 6 ± 4 55Eubacteria 16 ± 4 27 ± 9 8 ± 3 51Chloroplast 19 ± 6 22 ± 6 6 ± 4 47Mitochondria 12 ± 11 20 ± 13 3 ± 4 35Eucarya 17 ± 5 22 ± 9 10 ± 7 49H = not G; V = not U; W = A or U .
< 0.1%. In contrast, the eucaryal data points for the 23 23Sand 15 16S rRNA sequences appeared to be anti-correlat-ing with the score (see Fig. 9). Specifically, linear correla-tion coefficients are –0.63 and –0.40 for the 23S and 16SrRNA cases, respectively, which is significant at a level of< 0.5% for the 23S rRNA case but which represents aweak correlation (i.e. significance level of 14%) for the 16SrRNA case. So, regarding the noneucaryal sequences of16S and 23S rRNA, we conclude that %G+C correlateswith the score; for 23S eucaryal rRNA, we conclude thatan anti-correlation exists.In our earlier paper on 16S rRNA , a plot of scoreversus %G+C was considered. In this plot, the data pointsbelonging to Archaea, Eubacteria, and chloroplast weredisplayed as a single group, while the data points belong-ing to mitochondria and Eucarya were taken together.With the results displayed in this way, the correlations andanti-correlation that were noted in the previous paragraphwere not realized; the conclusion was instead that scoreand %G+C did not seem to be related.DiscussionThe mechanism by which an rRNA molecule attains itsbiologically relevant secondary and tertiary structure is notknown. Undoubtedly, this folding mechanism incorpo-rates pathways guided by both kinetic and thermody-namic constraints, and these pathways certainly includeproteins and other important factors [18,19]. Nevertheless,for the sake of discussion here, we assume that the‘driving force’ behind the in vivo folding of rRNA is theglobal minimization of the free energy of the naked rRNAmolecule.The overall average scores for the prediction of 16S and23S rRNA were found to be 46 ± 17% and 44 ± 11%,respectively. Under the assumption that the global mini-mization of free energy determines the secondary structure,why do the majority of the scores not approach 100% andwhy are there significant differences in the folding successof the different phylogenetic groups? In answer, first recallthe overall trend in prediction success displayed in Figure 1where, on average, the mitochondrial and eucaryalsequences did not fold well relative to the archaeal andeubacterial sequences. This suggests that the free energyelements used as input to the Zuker–Turner method mayneed to be separately ‘optimized’ for each of the five phylo-genetic groups. One must take care when discussing opti-426 Folding & Design Vol 1 No 6Figure 7Score versus % of the hairpin loops that are ‘stable tetraloops’(%HSTL) in the corresponding comparatively derived secondarystructure. For each of the 56 16S rRNA sequences folded in ourprevious work , the score of the energetically optimal folding isplotted against its %HSTL. The linear correlation coefficient is +0.66.Score%HSTL10.0015.0020.0025.0030.0035.0040.0045.0050.0055.0060.0065.0070.0075.0080.0010.00 20.00 30.00 40.00Figure 6Score versus % of the hairpin loops that are ‘stable tetraloops’(%HSTL) in the corresponding comparatively derived secondarystructure. For each of the 72 23S rRNA sequences folded, the scoreof the energetically optimal folding is plotted against its %HSTL.The linear correlation coefficient is +0.51.Score%HSTL20.0025.0030.0035.0040.0045.0050.0055.0060.0065.0070.0075.005.00 10.00 15.00 20.00 25.00
mization. In an ideal setting at a given temperature, onlyone set of free energy parameters is expected for all RNAmolecules. However, real systems are likely to maintaindifferent conditions of salt, pH, and other factors, so in thissense ‘optimization’ might prove useful.Second, there may be structural features (i.e. motifs) stillunknown to us that need to be identified and incorporatedinto the algorithm; these unknown features could begroup-specific and could account for the trend in the scorein Figure 1. Analogously, the frequency of occurrence ofmotifs in each of the five phylogenetic groups may berelated to the accurate prediction of base pairs and hencethe score; this relationship requires further study. Theaforementioned ‘optimization’ of the Zuker–Turnermethod would need to account for the frequency of occur-rence in motifsWhy are short-range base pairs predicted better than long-range base pairs (see Fig. 2)? The Zuker–Turner methoddoes not penalize a base pair simply for being long range,nor does it stabilize a base pair simply for being shortrange. So, one must look beyond this convenient view inorder to answer the question. By definition, the con-stituent bases that make up a short-range base pair areseparated by fewer nucleotides than are the constituentbases of a long-range base pair. Hence, a smaller amountof the overall rRNA structure is ‘closed’ by a short-rangebase pair, and in contrast, a long-range base pair can closea significant portion of the overall structure. This suggeststhat the formation of a long-range base pair is a morecomplex problem in three-dimensional space than is theformation of a short-range base pair. Since theZuker–Turner method folds RNA by estimating the freeenergy of only secondary structural features, the three-dimensional information required to correctly predictlong-range base pairs is not inherent in the thermody-namic energy values, and the result that short-range basepairs are predicted better than long-range base pairs isthen not unexpected.Why are base pairs that are classified as being hairpin loopclosers predicted better than base pairs classified asclosing bulge, internal, and multistem loops? By the con-struction of RNA, base pairs that close hairpin loops arevery likely to be short-range base pairs as well. So thisquestion may be partially answered by the previous com-ments regarding short-range base pairs. On the otherhand, hairpin loops have received much greater experi-Research Paper Analysis of 23S rRNA Fields and Gutell 427Figure 8Score versus % of bases in the corresponding primary sequence thatare G or C (%G+C). For each of the 49 23S archaeal, eubacterial,chloroplast, and mitochondrial rRNA sequences folded, the score ofthe energetically optimal folding is plotted against its %G+C. Thelinear correlation coefficient is +0.59.Score%G+C20.0025.0030.0035.0040.0045.0050.0055.0060.0065.0070.0075.0020.00 30.00 40.00 50.00 60.00Figure 9Score versus % of bases in the corresponding primary sequence thatare G or C (%G+C). For each of the 23 23S eucaryal rRNAsequences folded, the score of the energetically optimal folding isplotted against its %G+C. The linear correlation coefficient is –0.63.Score%G+C22.0024.0026.0028.0030.0032.0034.0036.0038.0040.0042.0044.0046.0048.0050.0052.0054.0056.0058.0060.0040.00 50.00 60.00 70.00
mental attention than have bulge, internal, and multistemloops (i.e. irregularities within helices, which would alsoinclude noncanonical base pairs). Given this disparatelevel of attention, the Zuker–Turner method may simplybe accounting more accurately for base pairs in proximityto hairpin loops than for base pairs in proximity to theother three loop types; this could answer the current ques-tion. To illustrate the complexity of the study of irregular-ities in a helix, one needs to consider only the twodifferent arrangements of bulge loops of size one (see ).In this case, the single base may ‘bulge out’ of the helix inwhich it is situated (as its name implies), or instead thebase may be intercalated into its respective helix. Surelythese two arrangements should not be assigned the samevalue of free energy by the Zuker–Turner method as iscurrently the case? (One must use caution with argumentsthat focus on the adjustment of a free energy inputelement because, by necessity, the adjustment will affectthe secondary structure at both the global and the locallevel.) Also, the free energy values currently used for mul-tistem loops are estimations and a better understanding ofmultistem loops would undoubtedly lead to an increase inpredictive accuracy [6,20].Any time a prediction of RNA secondary structure ismade, one must be concerned with the accuracy of theselfsame prediction. Regarding the Zuker–Turnermethod, Zuker and co-workers have begun to address theissue of accuracy in the case of 16S rRNA . Specifi-cally, they showed that predictions made by their methodare more reliable when the helices of the structure ofinterest are predicted with a low level of competing struc-tures in the suboptimal foldings. (They call this property‘well determinedness’.) We approached the issue of accu-racy for 16S and 23S rRNA by looking for predictors ofscore and we semiquantitatively identified three suchparameters: % noncanonical base pairs, % hairpin loopsthat are stable tetraloops, and %G+C.Might claiming % noncanonical base pairs and % hairpinloops that are stable tetraloops as predictors of scoresimply be ‘too obvious’ to warrant our elaboration? Not atall! First, this work has provided not only semiquantitativeproof that the relationships exist, but it has gauged themagnitude of the relationship. Second, the fact thatseveral predictors have been found, and that their effecton the score is measurable, shows that a continued investi-gation of the accuracy of the Zuker–Turner method hasthe potential of uncovering other secondary structure fea-tures that might (anti-)correlate with the score. It was anegative result that length was found not to be a predictorof score for the 16S and 23S mitochondrial and eucaryalsequences. Nevertheless, this fact was worth reportingbecause it shows that the Zuker–Turner method shouldnot be disconsidered simply because one is working withlarge sequences.Predictors of score can be used in two ways. First, predic-tors can point out where time and effort (experimental andtheoretical) should be focused to improve the method. Allthe predictors noted here fall into this category and %noncanonical base pairs is a very straightforward example.Zuker and co-workers  have already suggested thatincorporation of rules to predict noncanonical base pairs islikely to improve the accuracy of the method and ourdemonstration of % noncanonical base pairs as a predictorplaces this suggestion on solid ground. Second, using theZuker–Turner method as a way to play out the folding ofrRNA, predictors directly question the biological nature ofrRNA. We hinted at this above when we suggested “thatthe free energy elements used as input to theZuker–Turner method may need to be separately ‘opti-mized’ for each of the five phylogenetic groups.” The %noncanonical base pairs is a (perhaps trivial) example ofthis in that it forces one to notice that mitochondrial andeucaryal 23S rRNA comparatively derived secondarystructures utilize noncanonical base pairs to a greaterextent than do the secondary structures of the other phy-logenetic groups. In a contrasting fashion, one is forced tonote that % hairpin loops that are stable tetraloops is largerfor archaeal and eubacterial comparatively derived sec-ondary structures than it is for structures of mitochondriaand Eucarya.The case of %G+C in 23S rRNA provides a less superficialexample of this second usage of a predictor, because thecontent of G+C in a eucaryal sequence exhibits a very dif-ferent behavior in the execution of the algorithm than itdoes in the other phylogenetic groups (see Figs 8,9). Thismay infer that the folding strategy of eucaryal rRNAdiffers in some (unknown) manner from the strategies ofthe other phylogenetic groups. In the future, this differ-ence could be investigated by charting the context of use(e.g. is the base found predominantly in helices or insingle-stranded regions, or does it tend to close a particulartype of loop?) of the four base types in the comparativelyderived secondary structures. We speculate that this couldprovide a basis of explanation for the behavior of %G+C asa predictor of score, and yet at the same time, our bio-chemical perspective of rRNA would expand in a funda-mental way by learning how each of the four base typesare ‘used’ in building secondary structure in each of thefive phylogenetic groups.In closing, we have presented in this paper the results ofthe folding of 72 phylogenetically and structurally distinct23S rRNA sequences by the Zuker–Turner method.Using averages of scores, we have pointed out trends inthe accuracy of the prediction of the method by groupingthe sequences into five phylogenetic classes and by sub-grouping the classes according to base pair distance andloop closing category. This showed that base pairs that arebetter understood (i.e. short-range and hairpin loop428 Folding & Design Vol 1 No 6
closing base pairs) tend to be better predicted. To beginto understand how the Zuker–Turner method can beimproved and, just as important, to probe the biology of16S and 23S rRNA, we identified three semiquantitativepredictors of score (i.e. % noncanonical base pairs, %hairpin loops that are stable tetraloops, and %G+C).Finally, by assuming that the global minimization of freeenergy is the ‘driving force’ behind the folding of 16S and23S rRNA, we have discussed the results and showed thatmuch work remains to be done in this field.Materials and methodsDiagrams of the 72 comparatively derived secondary structures of 23SrRNA used here are publically available at: http://pundit.colorado.edu:8080/root.html. Each diagram shows the comparativelyinferred base pairs and the base pairs that were correctly predicted bythe energetically optimal folding of the Zuker–Turner method. The mostimmediate conclusion upon review of these structures is that the basepairs that are correctly predicted are ‘short range’ with respect to theirpositions in the primary sequence (i.e. < 100 bases apart). For each ofthe 72 sequences, the score of the energetically optimal folding, thescore of the best suboptimal folding, the sequence length, and theother values used in the Results section are also available.Notation and abbreviationsBoth 23S and 23S-like rRNA are referred to as 23S rRNA. Likewise,16S and 16S-like rRNA are both referred to as 16S rRNA. Canonicalbase pairs are Watson–Crick base pairs or GU or UG base pairs; anyother type of base pair is noncanonical. For the purposes of denotingtrends in the text we use ‘>’ in its standard role as ‘greater than’, exceptthat ‘=’ will be used whenever the values of two scores come within90% of each other. We use the word ‘semiquantitative’ to describe ourresults because we utilize linear correlation coefficients without provid-ing or discussing the associated regression line.Comparatively derived secondary structuresThe comparatively derived secondary structures upon which all thescores were based were secondary structures derived from compara-tive analysis [3–5]. The 72 23S rRNA sequences used in this paperand their respective comparatively derived secondary structures wereselected from recent compilations [12,13] or from data in preparationfor publication by MN Schnare, SH Damberger, MW Gray, andRR Gutell. The sequences were selected so that phylogenetically andstructurally distinct comparatively derived secondary structures wouldbe represented. The sequences were similar in phylogenetic position-ing and degree of structural variation to the 16S rRNA sequences ana-lyzed previously . The secondary structure diagrams were drawnwith the program XRNA (Weiser and Noller;ftp://fangio.ucsc.edu/pub/XRNA/) on Sun Sparc workstations.Zuker–Turner methodThe Zuker–Turner method [6–9] used here was version 2.2 of theMFOLD program and the free energy values were those for 37°C. Atthe time of writing the MFOLD package was available atftp://snark.wustl.edu/pub/ and the MFOLD manual could be found athttp://ibc.wustl.edu/~zuker/seqanal/. We implemented MFOLD on aSun Sparc 10/51 and we set the user-defined parameters required forthe operation of MFOLD to the following values: % for sort (P%) = 10,number of trace-backs = 50, and window size = 20. Each sequencewas submitted to MFOLD in its full-length form rather than beingbroken into its known domains with subsequent submission of thesedomain subsequences. Although this practice increased the time ofcomputation by several-fold (each folding took approximately 48 to 72and 120 megabytes of memory), it did emphasize our pretense ofhaving no prior knowledge of the biologically relevant secondary struc-ture. It is noted that the Zuker–Turner method has been used in a wayin which coaxial stacking and some noncanonical base pairs wereaddressed ; we did not use this version.Linear correlation coefficients.Our goal was to find semiquantitative predictors of the accuracy of theZuker–Turner method and this ultimately required us to interpret scatterplots of score versus the corresponding values of the suspected pre-dictors. To police our qualitative interpretation of the scatter plots, wecomputed the linear correlation coefficient without regression analysisfor each plot under consideration. Naturally, one does not expect thatthe hypothetical function relating the score to any particular predictorneed be linear; nevertheless, from the multitude of standard statisticsthat can be employed in any study, the extent to which our scatter plotssupport a linear relation was a frugal choice. Letting x and y representthe two values to be compared and with the bar denoting the meanvalue the linear correlation coefficient was :⌺ (xi – x–) (yi – y–)√⌺ (xi – x–)2 ⌺ (yi – y–)2 (1)The linear correlation coefficients were computed in Fortran 77 on aSun Sparc workstation and the level of significance for each linear cor-relation coefficient was determined from Table 3 in the appendix of .AcknowledgementsThis work was supported by grants from the NIH (GM48207) and the Col-orado RNA Center. The WH Keck Foundation is thanked for its support ofRNA research at the University of Colorado. RR Gutell is an associate in theProgram in Evolutionary Biology of the Canadian Institute for AdvancedResearch. We thank Danielle Konings for a great deal of assistance in allthe phases of this project. We also thank the referees for their constructivecriticisms.References1. Hill, W.E., Dahlberg, A., Garrett, R.A., Moore, P.B., Schlessinger, D. &Warner, J.R. (eds). (1990). The Ribosome: Structure, Function, andEvolution. American Society for Microbiology, Washington, DC.2. Nierhaus, K.H., Franceschi, F., Subramanian, A.R., Erdmann, V.A. &Wittmann-Liebold, B. (eds). (1993). The Translational Apparatus:Structure, Function, Regulation, Evolution. Plenum Press, New York.3. Gutell, R.R., Larsen, N. & Woese, C.R. (1994). Lessons from an evolv-ing rRNA: 16S and 23S rRNA structures from a comparative perspec-tive. Microbiol. Rev. 58, 10–26.4. Woese, C.R. & Pace, N.R. (1993). Probing RNA structure, functionand history by comparative analysis. In The RNA World. (Gesteland,R.F., Atkins, J.F., eds), pp. 91–117, Cold Spring Harbor LaboratoryPress, Plainview NY.5. Gutell, R.R. (1996). Comparative sequence analysis and the structureof 16S and 23S rRNA. In Ribosomal RNA: Structure, Evolution, Pro-cessing, and Function in Protein Biosynthesis. (Zimmermann, R.A.,Dahlberg, A.E., eds), pp. 111–128, CRC Press, Boca Raton FL.6. Jaeger, J.A., Turner, D.H. & Zuker, M. (1989). Improved predictions ofsecondary structures for RNA. Proc. Natl. Acad. Sci. U.S.A. 86,7706–7710.7. Jaeger, J.A., Turner, D.H. & Zuker, M. (1990). Predicting optimal andsuboptimal secondary structure for RNA. In Molecular Evolution:Computer Analysis of Protein and Nucleic Acid Sequences. (Doolit-tle, R.F., ed.), vol. 183, pp. 281–306, Academic Press, San Diego CA.8. Zuker, M. (1989). On finding all suboptimal foldings of an RNA mole-cule. Science 244, 48–52.9. Zuker, M. (1989). The use of dynamic programming algorithms in RNAsecondary structure prediction. In Mathematical Methods for DNASequences. (Waterman, M.S., ed.), pp. 159–184, CRC Press, BocaRaton FL.10. Woese, C.R., Kandler, O. & Wheelis, M.L. (1990). Towards a naturalsystem of organisms: proposal for the domains Archaea, Bacteria andEucarya. Proc. Natl. Acad. Sci. U.S.A. 87, 4576–4579.11. Konings, D.A.M. & Gutell, R.R. (1995). A comparison of thermody-namic foldings with comparatively derived structures of 16S and 16S-like rRNAs. RNA 1, 559–574.12. Gutell, R.R., Gray, M.W. & Schnare, M.N. (1993). A compilation oflarge subunits (23S and 23S-like) ribosomal RNA structures: 1993.Nucleic Acids Res. 21, 3055–3074.Research Paper Analysis of 23S rRNA Fields and Gutell 429
13. Schnare, M.N., Damberger, S.H., Gray, M.W. & Gutell, R.R. (1996).Comprehensive comparison of structural characteristics in eukaryoticcytoplasmic large subunit (23S-like) ribosomal RNA. J. Mol. Biol. 256,701–719.14. Zuker, M., Jaeger, J.A. & Turner, D.H. (1991). Comparison of optimaland suboptimal RNA secondary structures predicted by free energyminimization with structures determined by phylogenetic comparison.Nucleic Acids Res. 19, 2707–2714.15. Woese, C.R., Winker, S. & Gutell, R.R. (1990). Architecture of riboso-mal RNA: constraints on the sequence of “tetra-loops”. Proc. Natl.Acad. Sci. U.S.A. 87, 8467–8471.16. Cornish-Bowden, A. (1985). Nomenclature for incompletely specifiedbases in nucleic acid sequences: recommendations 1984. NucleicAcids Res. 13, 3021–3030.17. Saenger, W. (1984). Principles of Nucleic Acid Structure. p. 146,Springer-Verlag, New York.18. Herschlag, D. (1995). RNA chaperones and the RNA folding problem.J. Biol. Chem. 270, 20871–20874.19. Higgs, P.G. & Morgan, S.R. (1995). Thermodynamics of RNA folding.When is an RNA molecule in equilibrium? In Advances in ArtificialLife: Proceedings of the Third European Conference on Artificial Life.(Moran, F., Moreno, A., Merelo, J.J., Chacon, P., eds), In Lecture Notesin Artificial Intelligence, vol. 929, pp. 852–861, Springer Verlag, NewYork.20. Zuker, M. & Jacobson, A.B. (1995). “Well-determined” regions in RNAsecondary structure prediction: analysis of small subunit ribosomalRNA. Nucleic Acids Res. 23, 2791–2798.21. Walter, A.E., et al., & Zuker, M. (1994). Coaxial stacking of helixesenhances binding of oligoribonucleotides and improves predictions ofRNA folding. Proc. Natl. Acad. Sci. U.S.A. 91, 9218–9222.22. Taylor, J.R. (1982). An Introduction to Error Analysis. UniversityScience Books, Mill Valley CA.430 Folding & Design Vol 1 No 6Because Folding & Design operates a ‘Continuous PublicationSystem’ for Research Papers, this paper has been publishedvia the internet before being printed. The paper can beaccessed from http://biomednet.com/cbiology/fad.htm — forfurther information, see the explanation on the contents page.