Your SlideShare is downloading. ×
Gutell 062.jmb.1997.267.1104
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Gutell 062.jmb.1997.267.1104

45
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
45
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Assessing the Reliability of RNA Folding usingStatistical MechanicsMartijn Huynen1,2*, Robin Gutell3and Danielle Konings2,31Theoretical Biology andBiophysics and Center forNonlinear Studies, Los AlamosNational Laboratory, MS-K710Los Alamos, NM 87545, USA2Department of Santa FeInstitute, 1399 Hyde ParkRoad, Santa Fe, NM87501, USA3Molecular, Cellular andDevelopmental Biology,University of ColoradoCampus Box 347, BoulderCO 80309, USAWe have analyzed the base-pairing probability distributions of 16 S and16 S-like, and 23 S and 23 S-like ribosomal RNAs of Archaea, Bacteria,chloroplasts, mitochondria and Eukarya, as predicted by the partitionfunction approach for RNA folding introduced by McCaskill. A quanti-tative analysis of the reliability of RNA folding is done by comparing thebase-pairing probability distributions with the structures predicted bycomparative sequence analysis (comparative structures). We distinguishtwo factors that show a relationship to the reliability of RNA minimumfree energy structure. The ®rst factor is the dominance of one particularbase-pair or the absence of base-pairing for a given base within the base-pairing probability distribution (BPPD). We characterize the BPPD perbase, including the probability of not base-pairing, by its Shannonentropy (S). The S value indicates the uncertainty about the base-pairingof a base: low S values result from BPPDs that are strongly dominatedby a single base-pair or by the absence of base-pairing. We show thatbases with low S values have a relatively high probability that their mini-mum free energy (MFE) structure corresponds to the comparative struc-ture. The BPPDs of prokaryotes that live at high temperatures(thermophilic Archaea and Bacteria) have, calculated at 37C, lower Svalues than the BPPDs of prokaryotes that live at lower temperatures(mesophilic and psychrophilic Archaea and Bacteria). This re¯ects an ad-aptation of the ribosomal RNAs to the environmental temperature.A second factor that is important to consider with regard to the reliabilityof MFE structure folding is a variable degree of applicability of the ther-modynamic model of RNA folding for different groups of RNAs. Herewe show that among the bases that show low S values, the Archaea andBacteria have similar, high probabilities (0.96 and 0.94 in 16 S and 0.93and 0.91 in 23 S, respectively) that the MFE structure corresponds to thecomparative structure. These probabilities are lower in the chloroplasts(16 S 0.91, 23 S 0.79), mitochondria (16 S-like 0.89, 23 S-like 0.69) andEukarya (18 S 0.81, 28 S 0.86).# 1997 Academic Press LimitedKeywords: RNA secondary structure; RNA folding; base-pairingprobability distribution; ribosomal RNA; thermophiles*Corresponding authorIntroductionThe higher-order structure of RNA is crucial inmany of its functions. This is exempli®ed by itsconservation in evolution and by the in vitro selec-tion from random sequences of ribozymes for thesame function with a similar base-pairing pattern(Ekland et al., 1995). Within RNA structure the sec-ondary structure plays a central role, it covers thedominant energy contributions and provides thePresent addresses: R. Gutell, Department ofChemistry and Biochemistry, Campus Box 215,University of Colorado Boulder, CO 80309-0215, USA:D. Konings, Department of Microbiology, SouthernIllinois University at Carbondale, Carbondale, IL 62901,USA: M. Huynen, Biocomputing, EMBL,Meyerhofstrasse 1, 6900 Heidelberg, GermanyAbbreviations used: MFE, minimum free energy;BPPD, base-pairing probability distribution; S, Shannonentropy.J. Mol. Biol. (1997) 267, 1104±11120022±2836/97/151104±09 $25.00/0/mb970889 # 1997 Academic Press Limited
  • 2. major distance constraints for the formation ofthe tertiary structure. The free energy of RNAsecondary structure formation can be approxi-mated by adding up the experimentally deter-mined free energies of its elements (base-pairs,hairpin loops, bulges, etc.). Predicting RNA sec-ondary structure by ®nding the structures withthe lowest free energies has become a major re-search tool in experimental and theoreticalbiology. Here, we analyze what affects the re-liability of secondary structure prediction by freeenergy minimization.We have shown recently that the reliability ofsecondary structure prediction for ribosomalRNAs varies between phylogenetic classes(Konings Gutell, 1995; Fields Gutell, 1996).The correspondence between the minimum freeenergy (MFE) structure folding and the second-ary structure based on comparative sequenceanalysis (comparative structure) is highest forthe Archaea followed by the (eu)Bacteria. Theorder of the other three classes, i.e. chloroplasts,mitochondria and Eukarya, varies between the16 S-like RNAs and the 23 S-like RNAs. Here,we ask what underlies this variation in predict-ability. Is it the current thermodynamic modelfor calculating RNA secondary structure itselfthat for various reasons applies less to eukary-otic rRNAs than to prokaryotic rRNAs? Forexample, because eukaryotic rRNAs have morenon-standard interactions within the RNA, orbecause they have more interactions with othermolecules? Or can we distinguish other factorsthat play a role? In answering this question wewill focus speci®cally on how the probability of(sub)structures within the Boltzmann distributionof alternative secondary structures is related totheir predictability.Secondary structure prediction by free energyminimization faces the problem that for any se-quence there is an exponentially large number ofpossible structures. Although in thermodynamicequilibrium the structure with the lowest free en-ergy has the highest probability, that probability isvery small for long sequences. For example, in ther-modynamic equilibrium the probability of the MFEstructure within the Boltzmann distribution is, forrandom sequences of the length of a 16 S rRNA(about 1500 nucleotides), generally smaller than10À45. A more interesting quantity than the prob-ability of a speci®c large structure is that of smallsubstructures. This approach has been formalizedby McCaskill (1990), whose algorithm focuses onthe smallest sub-structure, the base-pair. It calcu-lates the comprehensive base-pairing probabilitydistribution based on the free energies and resultingprobabilities of all structures. In earlier work wehave calculated the base-pairing probability distri-bution (BPPD) of an entire HIV-1 genome (Huynenet al., 1996). We have shown that the RNA second-ary structures that are known to be functional inHIV-1 are relatively ``well-de®ned: i.e. their BPPDper base is dominated by a single pair interaction orby the absence of base-pairing (e.g. for bases in hair-pin loops). Here, we analyze the BPPDs of 16 S and16 S-like, and 23 S and 23 S-like rRNAs.We characterize the BPPD per base by its Shan-non entropy (S). Low S values correspond to prob-ability distributions that are dominated by a singleor a few probabilities. We then investigate the re-lationship between the S value of a base and theprediction of its correct base-pairing pattern, asgiven in the comparative structure, by the MFEstructure. We discuss the nature of this relationshipas revealed by the rRNAs of different phylogeneticgroups and with regard to their environmentaltemperatures.Figure 1. Relation between the average S value and thereliability of the MFE structure. Shown are the averageS value per sequence and the degree to which the MFEstructure of that sequence corresponds to the compara-tive structure for A, 16 S and 16 S-like rRNA, and B, for23 S and 23 S-like rRNA. Both the 16 S(like) rRNAs andthe 23 S(like) rRNAs show a negative correlationbetween the average S value and the percentage cor-rectly predicted bases. Correlation coef®cients over allthe sequences are À0.61 and À0.50 for the 16 S(like) andthe 23 S(like) rRNAs, respectively.Assessing the Reliability of RNA Folding 1105
  • 3. ResultsShannon entropy of base-pairingprobability distributionsFor a set of 16 S and 16 S-like, and 23 S and 23S-like rRNA sequences of Archaea, Bacteria, chlor-oplasts, mitochondria and Eukarya that span thephylogenetic tree and represent the major forms ofstructural diversity, we studied how the average Svalue per sequence relates to the correspondencebetween the MFE structure and the comparativestructure. Figure 1 shows that there is a negativecorrelation between the average S value per se-quence and the percentage of correctly predictedbases in the MFE structure. Thus, as the uncer-tainty about the base-pairing pattern of the se-quence decreases, the MFE structure of thesequence becomes more reliable. Figure 1 indicatesthat 16 S rRNAs of Archaea have relatively low Svalues, followed by the Bacteria, the chloroplasts,the Eukarya and the mitochondria, respectively.Also for the 2 S rRNAs, the Archaea have the low-est S values, followed by the Bacteria, but there isno clear distinction between the other classes. Toget a more detailed view of the distribution of Svalues per class we plotted the distribution of S va-lues of all the sequences per class in cumulativehistograms (Figure 2).The Archaea clearly have the most bases withlow S values, followed by the Bacteria. For theother classes the differences are less dramatic, andthe order differs for 16 S (like) and 23 S (like)rRNAs. A number of the Archaea species and twoof the bacterial species in our set are thermophilicor hyper-thermophilic. Since the lowest free energystate within the Boltzmann distribution becomesless dominant with increasing temperature, low Svalues (calculated at 37C) might re¯ect an adap-tation to high temperatures. For rRNA sequencesthat are available over a large range of optimalgrowth temperatures (Archaea and Bacteria 16 SrRNAs and Archaea 23 S rRNAs) we analyzedhow the average S value per sequence correlateswith the optimal growth temperature. A negativecorrelation between the average S value of theBPPD and the temperature at which the specieslive was indeed observed for all three groups(Figure 3).In principle, calculations of the BPPD should bedone for the temperature at which the secondarystructure functions. Although the partition functioncalculation in the Vienna RNA Package allows forcalculation of secondary structure at different tem-peratures, the extrapolation of the free energy par-ameters is inaccurate for the extremely hightemperatures (80C) at which the hyperthermo-philes live. For example, folding the 16 S rRNA ofPyrodictium occultum at its optimal growth tem-perature (105C) leads to a mostly single-strandedstructure. As temperature increases the entropiccontributions to the free energy become moredominant, effectively melting the secondary struc-ture. The melting of the secondary structure in-creases the S values, as bases alternate morebetween single-stranded and double-strandedstates. Folding the same sequence at increasingtemperatures does indeed lead to higher S valuesof the BPPD, and a lower probability of the MFEstructure within the Boltzmann distribution (datanot shown). The low S values we observe in theBPPD of the thermophiles (optimal growth tem-peratures between 45C and 80C) and hyper-ther-mophiles in the Archaea and in the Bacteria canFigure 2. The S values for A, the 16 S(like) rRNAs andB, the 23 S(like) rRNAs per phylogenetic class. The Svalue per class is presented in a cumulative way. Forevery class we score which fraction of the bases has anS value between 0 and 0.025, 0,025 and 0.075, 0.075 and0.125, etc. The fractions are then added. The line showswhich fraction of the nucleotides have an S value belowthe value on the x-axis. For both 16 S rRNA and 23 SrRNA, the Archaea have the most bases with low Svalues, followed by the Bacteria. For 16 S rRNA, thechloroplasts have a lower S value than Eukarya, whichin turn have a lower S value than the mitochondria. Forthe 23 S rRNA, the S values of the chloroplasts, mito-chondria and Eukarya are similar.1106 Assessing the Reliability of RNA Folding
  • 4. therefore be explained as a result of adaptations tohigh temperatures.The G ‡ C level of the 16 S rRNAs in Archaea ispositively correlated with their environmental tem-perature (Dalgaard Garrett, 1993). Althoughhigh and balanced G and C levels are necessary toget low free energies of secondary structure (Huy-nen et al., 1992), and prevent melting at high tem-peratures, they do not necessarily give rise to verywell de®ned structures. The variations in theG ‡ C content, observed in the RNAs studied here,have very little effect on the average of the Svalues of random sequences (data not shown).Randomizing the sequences but not the base com-position of the hyper-thermophiles resulted in arise of the average S value for all of the sequences.The average S value for the hyper-thermophilesrose from 0.45 (SD 0.09) to 0.85 (SD 0.18), which isnot different from that for random sequences inwhich all nucleotide frequencies are 0.25 (seeFigure 5). Note that the 16 S rRNA sequences ofthe mesophiles (optimal growth temperature be-tween 20C and 45C) and the psychrophiles (opti-mal growth temperature below 20C) have, ingeneral, S values that are lower than the averagefor random sequences (Figure 5). In Figure 3 theArchaea have one outlier at the bottom left: Metha-notrix soehngenii, optimal growth temperature37C, S value 0.40. The genus Methanotrix has bothmesophilic and thermophilic species, the relativelylow S value value of M. soehngenii might be the re-sult of a recent adaptation of the species to the me-sophile temperature range that is not yet fullypresent in the 16 S RNA.Relevance of the thermodynamic model ofRNA foldingFrom the S value per base in a sequence we cananalyze whether the correct prediction of a base inthe MFE structure depends on the S value of itsBPPD. This is a more detailed analysis of the pat-tern that was presented in Figure l. Instead of ana-lyzing the correspondence between the MFEstructure and the comparative structure per se-quence, we analyze the correspondence per Svalue of the bases. For all bases within one phylo-genetic class that have an S value between, say,Figure 3. Relation between the optimal growth tempera-ture and the average S value. Shown are the optimalgrowth temperature and the average S value for A, 16 SrRNA in Archaea and Bacteria and B, for 23 S rRNA inArchaea. Only sequences with ®ve or less unknownnucleotides were used. Data on optimal growth tem-peratures are from Holt et al. (1994), unless otherwisenoted. When an optimum growth temperature rangewas speci®ed, the temperature at the center of the rangewas used. The overall correlation between S values andoptimal growth temperature for 16 S rRNA is À0.74, forthe 23 S Archaea it is À0.58, for the 16 S Bacteria it isÀ0.74. The species with their optimal growth tempera-tures (in C) are: Archaea 16 S: Acidianus brierleyi70, Acidianus infernus 88, Archaeoglobus fulgidus 83,Desulfurococcus mobilis 85, Metallosphaera sedula 75,Methanococcoides formicicum 41, Methanobacterium thermo-autotrophicum 68, Methanococcoides burtonii 32.5, Methano-coccus vannielli 37.5, Methanosphaera stadtmanii 37,Methanospirillum hungatei 37.5, Methanothermus fervidus85.5, Methanotrix soehngenii 37, Pyrococcus furiosus 100,Pyrodictium occultum 105, Sulfolobus acidocaldarius 75, Sul-folobus shibatae 80, Sulfolobus solfactaricus 85, Thermococcusceler 90, Thermo®lum pendens 88, Thermoplasma acidophi-lum 60, Thermoproteus tenax 90. Bacteria 16 S: Agrobacter-ium tumefaciens 26, Aquifex pyrophilus 85 (1), Arthrobacterglobiformis 26.5, Bacillus megaterium 40, Bacillus psychro-philus 20, Bacteroides fragilis 37, Borellia burgdorferi 37,Carnobacterium alterfunditum 22 (2), Carnobacterium fundi-tum 22 (2), Desulfurella acetivorans 54.5, Escherichia coli37, Fervidobacterium gondwanalandicum 70, Flavobacteriumsalegens 20, Frankia alni 37, Psychrobacter immobilis20, Renibacterium salmoninarum 16.5, Streptococcus termo-philus 68, Sulfobacillus thermosul®dooxidans 50, Thermoa-naerobacter cellulolyticus 68, Thermoanaerobacter brockii70, Thermoanaerobacter thermohydrosulfuricus 68, Thermoa-naerobium lactoethylicum 70, Thermotoga maritima 75, Ther-mus thermophilius 80. Archaea 23 S, A. fulgidus, D.mobilis, M. thermoautotrophicum, M. hungatei, S. acidocal-darius, S. solfactaricus, T. acidophilum, T. celer, T. pendens,T. tenax. (1) Huber et al. (1992), (2) Franzmann et al.(1991).Assessing the Reliability of RNA Folding 1107
  • 5. 0.075 and 0.125, we can count for which fraction ofthese bases does the MFE structure correspond tothe comparative structure. This is a direct measureof the applicability of the secondary structuremodel. As we have less uncertainty about the base-pairing behavior of a single base, we expect thatour prediction of the MFE structure of that basewill be closer to what is observed experimentally,or in this case, by comparative analysis.The results (Figure 4) show that for all theclasses there is a negative relation between the Svalue of a base and the chance that it is predictedcorrectly. This corresponds to the results averagedper sequence in Figure l. The relation is, however,non-linear. The ®rst part of the slope, between Svalue 0 and 0.3, is steeper than the rest of theslope. For 16 S rRNA we observe that the curvethat describes the relation between the S value andthe fraction of correctly predicted bases is virtuallyidentical for the Archaea and the Bacteria, but islower in the chloroplasts and mitochondria, andlowest in the Eukarya. Note that relative toFigure 2, the Eukarya and the mitochondria haveswitched places: the eukaryotic 18 S rRNA se-quences have lower S values than the mitochon-drial 16 S-like sequences; however, the MFEstructure for a base with a low S value is, on aver-age, more reliable in a mitochondrial sequencethan in a Eukaryotic sequence. For the 23 S(like)rRNAs, we again observe that the curve is highestin Archaea and Bacteria. The differences with theother classes are, however, smaller than for the 16S(like) rRNA sequences.Competition between non-canonical andcanonical base-pairsThe thermodynamic model for secondary struc-ture prediction considers only Watson-Crick andG-U base-pairs. We studied to what extent thenon-canonical (excluding G-U) base-pairs that havebeen derived by the comparative analysis competewith canonical base-pairs. In other words, do theytend to occur at positions that would otherwise bepaired or single-stranded? We divided the basesinto three groups according to their base-pairing inthe comparative structures: those that form a non-canonical base-pair (excluding G-U), those that aresingle-stranded and those that form a canonicalbase-pair (Watson-Crick ‡ G-U). Per group, weFigure 4. Predictive value of the S value for the re-liability of the MFE structure per base for A, 16 S(like)rRNA and B, for 23 S(like) rRNA. For all bases thathave an S value between 0 and 0.025, 0.025 and 0.075,0.075 and 0.125, etc., the fraction of bases for which theMFE structure corresponds to the comparative structureis counted. The Figure hence gives a probability that fora base with the given S value the MFE structure corre-sponds to the comparative structure. For all the classeswe observe a negative relation between the S value of abase and the reliability of the MFE structure for thatbase. In other words, as there is, based on the thermo-dynamic model, less uncertainty of the base-pairingbehavior of a base, the correspondence between theMFE structure and the comparative structure for thatbase increases. In both the 16 S(like) and the 23 S(like)rRNAs the prokaryotes score higher than the other phy-logenetic classes. For 23 S(like) rRNA the differencesbetween the prokaryotes and the other classes are smal-ler than for the 16 S(like) RNAs.Table 1. Probability of being canonically base-paired inthe thermodynamic modelType of base-pair incomparativestrcuture Non-canonical Canonical Single-strandedArchaea 0.325 0.900 0.417Bacteria 0.416 0.856 0.440Chloroplasts 0.496 0.820 0.474Mitochondria 0.566 0.769 0.556Eukarya 0.578 0.790 0.572Average probabilities of canonical base-paring in the thermody-namic model for bases that in the comparative structure formeither non-canonical base-pairs, canonical base-pairs (Watson-Crick ‡ G-U) or single-stranded bases. The standard deviationsare about 0.35 for the non-canonical base-pairs, 0.25 for thecanonical base-pairs and 0.36 for the single-stranded bases, irre-spective of the taxonomic class.1108 Assessing the Reliability of RNA Folding
  • 6. score the average probability that the bases form acanonical base-pair according to the thermodyn-amic model. Table 1 shows that the non-canonicalbase-pairs in the comparative structure occur atpositions that have a relatively low probability offorming a canonical base-pair in the thermodyn-amic model. The probabilities are comparable withthe base-pairing probabilities of the positions thatare single-stranded in the comparative structure. Inthe Archaea, the base-pairing probability is evenlower than for the single-stranded bases. Within ri-bosomal RNAs there appears relatively little com-petition in the thermodynamic model fromcanonical base-pair interaction at the positionswhere we observe non-canonical base-pairs in thecomparative structure.DiscussionWe have shown that the local dominance of asingle structure within the Boltzmann distributionof alternative secondary structures is strongly cor-related with the reliability of the MFE structure.Bases whose BPPD is dominated by a single base-pair or by the absence of base-pairing are betterpredicted than those that have many alternativestates. This pattern is observed in 16 S and 16 S-like rRNA, and 23 S and 23 S-like rRNA in Ar-chaea, Bacteria, chloroplasts, mitochondria and Eu-karya. These results are in accordance with thedata on 16 S rRNA of Escherichia coli reported byZuker Jacobson (1995). They showed that theparts of the sequence for which a relatively fewalternative structures exist within a certain free en-ergy range are those that have a high probabilityof being predicted correctly in the MFE structure.The rRNAs of Archaea and Bacteria that live athigh temperatures have, calculated at 37C, a moreuniquely determined secondary structure than therRNAs of Archaea and Bacteria that live at lowtemperatures. This appears to re¯ect an adaptationto their environmental temperature, as the RNAsecondary structure for any given sequence be-comes less uniquely determined with rising tem-perature. An interesting observation about hyper-thermophilic Archaea and Bacteria is that their ri-bosomal RNAs evolve at a relatively low rate(Woese, 1987). An explanation for this is that thethermodynamic constraints imposed on their struc-ture, as re¯ected in the low S values of their base-pairing probability distributions, reduce the frac-tion of neutral mutations. The fact that the S valuesof the (hyper)thermophiles are signi®cantly lowerthan those of random sequences and more extremethan the S values of the mesophiles and psychro-philes supports this hypothesis. Environmentaltemperature is but one factor that affects RNAfolding: the low pH values and high salt concen-trations at which some of the Archaea live stabilizebase-pairing. They reduce the repulsion betweenthe negatively charged backbones by protonationof the phosphate groups (at pH 2; Saenger, 1984)and by ``shielding the negative charges, respect-ively.The second factor that affects the reliability ofthe MFE structure is the applicability of the modelfor RNA folding itself. The model that is used herefor calculating the RNA secondary structure isbased on the following principles: the RNA struc-ture is in thermodynamic equilibrium, its free en-ergy is calculated by adding up local contributions,a limited set of experimentally determined par-ameters of these contributions are included, andinteractions other than secondary structure onesare not considered. Tertiary interactions can con-tribute a substantial part of the total folding energy(Draper, 1996), although that in itself does notimply that they compete with secondary structureformation. Examples of competition between ter-tiary interactions or non-canonical base-pairs andsecondary structure formation have been observedin the a mRNA S4 binding site (Tang Draper,1989) and in E. coli 16 S RNA (Konings Gutell,1995), respectively. For long RNA molecules, theassumption that the system is in thermodynamicequilibrium probably does not hold. There is evi-dence for the selection of relatively short-rangeinteractions in ribosomal RNAs that form localminima in the energy landscape in which the fold-ing process gets trapped (Morgan Higgs, 1996),and simulations of the kinetics of the RNA foldinghave yielded improved predictions of the second-ary structure of ribosomal RNAs (Gultyaev et al.,1995). Other limitations of the model are that itdoes not include the interaction with other mol-ecules like proteins (Powers Noller, 1995) orother RNA molecules like small nucleolar RNAs(Maxwell Fournier, 1995). Given these limi-tations, it is reassuring that we ®nd correlations be-tween the indeterminacy of the predictedprobability distribution, the reliability of the MFEstructure folding and the optimal growth tempera-ture of the species. The MFE structure folding andits dominance in the partition function are, how-ever, only a small part of the information con-tained within the BPPD, and these results do notnecessarily imply that the probabilities of alterna-tive base-pairs are predicted correctly. The statisti-cal mechanical interpretation of the RNA foldingmodel is as much subject to the limitations men-tioned above as are the classic RNA folding algor-ithms that predict the MFE structure and a set ofalternative structures. More important, however, isthat by using the statistical mechanics interpret-ation we have been able to separate two factorsthat affect reliability of the MFE structure folding.The ®rst of these factors, the indeterminacy of thesecondary structure, is intrinsic to the RNA foldingmodel itself. By taking this factor out of theequation we get a better, quantitative picture of thesecond factor, the applicability of the RNA foldingmodel. Using the two factors we can better explainthe variation that was observed in the reliability ofthe folding of ribosomal RNAs (Konings Gutell,1995; Fields Gutell, 1996). The higher reliabilityAssessing the Reliability of RNA Folding 1109
  • 7. of the MFE structures of Archaea than those ofBacteria is largely due to the better de®ned second-ary structures in Archaea. The lower reliability ofthe MFE structure of rRNAs in chloroplasts, mito-chondria and Eukarya is also due to the fact thatthe thermodynamic model of secondary structureprediction applies less to these RNAs.Non-canonical base-pairs (non-Watson-Crickand non-G-U) are not part of the thermodynamicmodel of secondary structure prediction, the as-sumption is that they are added to the core struc-ture that is formed by the secondary structuresensu stricto. We observed that there is indeed rela-tively little competition at the positions of the non-canonical base-pairs from canonical base-pairing.The effect is strongest in the Archaea and becomesless in the Bacteria, chloroplasts, mitochondria andEukarya, respectively. The variation in this effectpoints to different strengths of selection to preventcanonical base-pair interactions at positions wherenon-canonical base-pairs occur. Since all the riboso-mal RNAs are the product of a selection processon secondary structure, we can, however, fromthese data not conclude to what extent non-canoni-cal base-pairs interfere with secondary structureformation in random sequences.We do fully acknowledge that the BPPD ap-proach represents an entirely different view onRNA structure than does the comparative sequenceanalysis approach. The BPPD approach is based onstatistical mechanics, and does in principle not pre-dict a single structure. The uncertainty it representsin terms of a probability distribution of structuresinstead of a single structure is assumed to be``real in thermodynamic equilibrium. Adaptationto this uncertainty by evolving secondary struc-tures that are relatively dominant in their Boltz-mann distribution of alternative structures hasbeen shown for tRNAs (Marliere, 1983; Higgs,1993) and for the functional secondary structuresin HIV-1 (Huynen et al., 1996). Here we haveshown that this type of adaptation can also com-pensate the increase of uncertainty that is causedby high environmental temperatures.MethodsCalculating the base-pairing probability matrixRNA secondary prediction is essentially non-local.One needs in principle to calculate the whole set ofalternative secondary structures and their probabilities tocalculate the probability of any speci®c base-pair.McCaskills algorithm ®rst calculates the partition func-tion for the ensemble of all possible secondary structuresusing a dynamical programming algorithm analogous tothat used to calculate the MFE structures (Zuker Stiegler, 1981), and then uses a recursive backtrackingscheme to get the probabilities for individual base-pairs;for details, see McCaskill (1990). The complete base-pair-ing probability matrix is a symmetric n  n matrix inwhich the entry (i,j) is the probability that base i ispaired with base j. In practice, we do not include base-pairs that occur with a probability lower than 10À5inour analyses. The algorithm is part of the Vienna RNAPackage (Hofacker et al., 1994; Zuker Stiegler, 1981),which can be down-loaded with anonymous ftp fromftp.itc.univie.ac.at, directory pub/RNA/ViennaRNA-l.lb.The Vienna RNA Package uses free energy parametersfrom Freier et al. (1986), Jaeger et al. (1989) and He et al.(1991). The calculation does include stabilizing energiesfor single stacked based in free ends and multi-stemloops.Analyzing the base-pairing probability matrixWe characterize the BPPD per base (i) by its Shannonentropy (Si):Si ˆ ÀˆjPiYj log PiYjwhere Pi,j is the probability of base-pairing betweenbases i and j, and Pi,j ˆ i is the probability that the base idoes not pair with any other base. The lower the Svalue, the stronger the distribution is dominated by asingle or a few base-pairing probabilities. In otherwords, S values re¯ect the uncertainty we have aboutthe base-pairing. The average S value of a sequence isthe sum of the S values of its bases divided by the se-quence length.In earlier work we characterized the probability distri-bution per base with its maximum (Huynen et al., 1996),using the term ``well-de®nedness of secondary struc-ture. Although the qualitative results for the analysespresented here for both measures are the same, the en-tropy measure shows a higher correlation with the prob-ability that the MFE structure corresponds to thecomparative structure. In a recent paper, Zuker Jacobsen (1995) introduced ``well determinedness ofsecondary structure, where a structure is well deter-mined if there are no alternative structures within a cer-tain range of free energy. Although the concept isqualitatively very similar to our entropy or well de®ned-Figure 5. The relation between the S value per (random)sequence and its length. For sequences of various lengthclasses, at least 25,000 nucleotides were folded (250sequences of length 100, 50 of length 500, etc.). The ver-tical bars denote one standard deviation. The average Svalue increases sub-linear with the logarithm of thelength of the sequence and starts to saturate at thelength of 500.1110 Assessing the Reliability of RNA Folding
  • 8. ness terms, the latter allow for a quantitatively muchmore re®ned analysis, since they include actual probabil-ities.In principle, one should scale the S value with thelength of the sequence (N), given that the maximumvalue of S is the logarithm of N. For long sequences,however (N 500), we observed that the average S valueper base as a function of sequence length increases lessthan linear with the logarithm of N, and saturates to avalue of about 0.9 (Figure 5), also the distribution of Svalues per sequence does not change for N 500 (datanot shown). Apparently the intrinsic properties of thethermodynamic model of RNA secondary structure pre-diction result even for very long sequences in a limitednumber of bases with which a base has a reasonablechance (P ˆ 10À5) of pairing.Comparing minimum free energy structures withcomparative structuresPredicting base-pairing patterns by comparative se-quence analysis has been reviewed by Gutell et al. (1994).The comparative structures of the RNAs used in thisanalysis (Gutell et al., 1993; Gutell, 1994) can be found onhttp://pundit.colorado.edu:8080/. We consider a basepredicted correctly in the MFE structure if it is paired tothe same base as in the comparative structure, if it issingle-stranded in both the MFE structure and the com-parative structure and if it is single-stranded in the MFEstructure and non-canonically base-paired (non-A-U, G-C or G-U) in the comparative structure. We also exam-ined other de®nitions for correctly predicted bases: onlyconsidering bases that are base-paired in the compara-tive structure, or de®ning that bases that are non-stan-dard base-paired in the comparative structure arealways incorrectly predicted in the MFE structure. Wedid not observe any signi®cant differences in the relationbetween the S value and the probability that a base ispredicted correctly in the MFE structure for the differentde®nitions of correctly predicted bases.AcknowledgementsCharacterizing the base-pairing probability distri-bution by its Shannon entropy was ®rst suggested byIvo Hofacker. M.H. thanks Gerhard Hummer for manyuseful discussions and lessons on thermodynamics. D.K.and R.G. thank the W.M. Keck foundation for its gener-ous support of RNA research on the Boulder Campus.Part of the work was supported by a National Institutesof Health grant (GM-48207). Part of this work was doneunder the auspices of the U.S. Department of Energyand supported by the Center for Nonlinear Studies atLos Alamos National Laboratory and by the Santa Fe In-stitute.ReferencesDalgaard, J. Z. Garrett, R. A. (1993). Archaealhyperthermophile genes. In The Biochemistry ofArchaea (Neuberger, A. van Deenen, L., eds),pp. 535±536, Elsevier, Amsterdam.Draper, D. E. (1996). Strategies for RNA folding. TrendsBiochem. Sci. 21, 145±149.Ekland, E. H., Szostak, J. W. Bartel, D. P. (1995).Structurally complex and highly active RNA ligasesderived from random RNA sequences. Science, 269,364±370.Fields, D. S. Gutell, R. R. (1996). An analysis of largerRNA sequences folded by a thermodynamicmethod. Fold. Des. 1, 419±430.Franzmann, P., Hop¯, P., Weiss, N. Tindall, B. J.(1991). Psychrotrophic, lactic acid-producing bac-teria from anoxic waters in ace lake, Antarctica;Carnobacterium funditum sp. nov. and Carnobacteriumalterfunditum sp. nov. Arch. Microbiol. 156, 255±262.Freier, S. M., Kierzek, R., Jaeger, J. A., Sugimoto, N.,Caruthers, M. H., Neilson, T. Turner, D. H.(1986). Improved free-energy parameters for predic-tions of RNA duplex stability. Proc. Natl Acad. Sci.USA, 83, 9373±9377.Gultyaev, A. P., van Batenburg, F. H. Pleij, C. W.(1995). The computer simulation of RNA foldingpathways using a genetic algorithm. J. Mol. Biol.250, 37±51.Gutell, R. R. (1994). Collection of small subunit (16 S-and 16 S-like)ribosomal RNA structures: 1994. Nucl.Acids Res. 22, 3502±3507.Gutell, R. R., Gray, M. W. Schnare, M. N. (1993).A compilation of large subunit (23 S and 23 S-like)ribosomal RNA structures. Nucl. Acids Res. 21,3055±3074.Gutell, R. R., Larsen, N. Woese, C. (1994). Lessonsfrom an evolving rRNA: 16 S and 23 S rRNA struc-tures from a comparative perspective. Microbiol.Rev. 58, 10±26.He, L., Kierzek, R., SantaLucia, J., Walter, A. Turner,D. (1991). Nearest-neighbor parameters for GUmismatches. Biochemistry, 30, 11124±11132.Higgs, P. (1993). RNA secondary structure: a compari-son of real and random sequences. J. Phys. France,3, 43±59.Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer,S., Tacker, M. Schuster, P. (1994). Fast foldingand comparison of RNA secondary structures. Mon-atsh. Chem. 125(2), 167±188.Holt, J. G., Krieg, N. R., Sneath, P. H., Staley, J. T. Williams, S. T. (1994). Bergeys Manual of Determina-tive Bacteriology, Williams Wilkins, Baltimore.Huber, R., Wilharm, T., Huber, R., Trincone, A.,Burggraf, S., Konig, H., Rachel, R., Rockinger, I.,Fricke, H. Stetter, K. (1992). Aquifex pyrophilusgen-nov sp-nov represents a novel group of marinehyperthermophilic hydrogen-oxidizing bacteria.Syst. Appl. Microbiol. 15, 340±351.Huynen, M. A., Konings, D. Hogweg, P. (1992). EqualG and C contents in Histone genes indicate selec-tion pressures on mRNA secondary structure. J. Mol.Evol. 34, 280±291.Huynen, M. A., Perelson, A., Vieira, W. Stadler, P.(1996). Base-pairing probabilities in a completeHIV-1 genome. J. Comput. Biol. 3, 253±274.Jaeger, J. A., Turner, D. H. Zuker, M. (1989).Improved predictions of secondary structures forRNA. Proc. Natl Acad. Sci. USA, 86, 7706±7710.Konings, D. Gutell, R. (1995). A comparison of ther-modynamic foldings with comparatively derivedstructures of 16 S and 16 S-like rRNAs. RNA, 1,559±574.Marliere, P. (1983). Computer building and folding of®ctitious transfer-RNA sequences. Biochimie, 65,267±273.Maxwell, E. S. Fournier, M. J. (1995). The smallnucleolar RNAs. Annu. Rev. Biochem. 64, 897±934.Assessing the Reliability of RNA Folding 1111
  • 9. McCaskill, J. S. (1990). The equilibrium partition func-tion and base pair binding probabilities for RNAsecondary structure. Biopolymers, 29, 1105±1119.Morgan, S. R. Higgs, P. G. (1996). Evidence for kineticeffects in the folding of large RNA molecules.J. Chem. Phys. 105, 7152±7157.Powers, T. Noller, H. F. (1995). Hydroxyl radical foot-printing of ribosomal proteins on 16 S rRNA. RNA,1, 194±209.Saenger, W. (1984). Principles of Nucleic Acid Structure,Springer-Verlag, New York.Tang, C. K. Draper, D. E. (1989). Unusual mRNApseudoknot structure is recognized by a proteintranslational repressor. Cell, 57, 531±536.Woese, C. R. (1987). Bacterial evolution. Microbiol. Rev.51, 221±271.Zuker, M. Jacobsen, A. B. (1995). ``Well-determinedregions in RNA secondary structure prediction:analysis of small subunit ribosomal RNA. Nucl.Acids Res. 23, 2791±2798.Zuker, M. Stiegler, P. (1981). Optimal computer fold-ing of larger RNA sequences using thermodynamicsand auxiliary information. Nucl. Acids Res. 9, 133±148.Edited by D. E. Draper(Received 22 July 1996; received in revised form2 January 1997; accepted 2 January 1997)Supplementary material comprising one Table isavailable from JMB Online1112 Assessing the Reliability of RNA Folding