Gutell 114.jmb.2011.413.0473


Published on

Gardner D.P., Ren P., Ozer S., and Gutell R.R. (2011).
Statistical Potentials for Hairpin and Internal Loops Improve the Accuracy of the Predicted RNA Structure.
Journal of Molecular Biology, 413(2):473-483.2011. pp 15-22.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Gutell 114.jmb.2011.413.0473

  1. 1. Statistical Potentials for Hairpin and Internal LoopsImprove the Accuracy of the Predicted RNA StructureDavid P. Gardner 1, Pengyu Ren 2, Stuart Ozer 3and Robin R. Gutell 1⁎1Center for Computational Biology and Bioinformatics, Section of Integrative Biology in the School of BiologicalSciences, and the Institute for Cellular and Molecular Biology, University of Texas at Austin, 2401 Speedway, Austin,TX 78712, USA2Department of Biomedical Engineering, University of Texas at Austin, Austin, TX 78712-1062, USA3Microsoft Corporation, 1 Microsoft Way, Redmond, WA 98052, USAReceived 16 February 2011;received in revised form12 August 2011;accepted 16 August 2011Available online23 August 2011Edited by D. E. DraperKeywords:statistical potentials;RNA folding;comparative analysis;RNA structure;accuracy of the predictedRNA structureRNA is directly associated with a growing number of functions within thecell. The accurate prediction of different RNA higher-order structures fromtheir nucleic acid sequences will provide insight into their functions andmolecular mechanics. We have been determining statistical potentials for acollection of structural elements that is larger than the number of structuralelements determined with experimentally determined energy values. Theexperimentally derived free energies and the statistical potentials forcanonical base-pair stacks are analogous, demonstrating that statisticalpotentials derived from comparative data can be used as an alternativeenergetic parameter. A new computational infrastructure—RNA Compar-ative Analysis Database (rCAD)—that utilizes a relational database wasdeveloped to manipulate and analyze very large sequence alignments andsecondary-structure data sets. Using rCAD, we determined a richer set ofenergetic parameters for RNA fundamental structural elements includinghairpin and internal loops. A new version of RNAfold was developed toutilize these statistical potentials. Overall, these new statistical potentials forhairpin and internal loops integrated into the new version of RNAfolddemonstrated significant improvements in the prediction accuracy of RNAsecondary structure.© 2011 Elsevier Ltd. All rights reserved.Introduction“The comparative approach indicates far morethan the mere existence of a secondary structuralelement; it ultimately provides the detailed rulesfor constructing the functional form of each helix.Such rules are a transformation of the detailedphysical relationships of a helix and perhapseven reflection of its detailed energetics as well.(One might envision a future time when com-parative sequencing provides energetic measure-ments too subtle for physical chemicalmeasurements to determine).”1The RNA sequences and their structures that weobserve today are the last record of their biologicalancestry. The snapshots of these RNA structuresare the result of their evolution from a simplerstructure and organization to their more sophisti-cated and complex state. Traditional experimentalmanipulation of biological systems expands ourunderstanding of this system. These laboratory*Corresponding author. E-mail used: rCAD, RNA Comparative AnalysisDatabase; CRW site, Comparative RNA Web site; SRP,signal recognition particle; HCV IRES, hepatitis C virusinternal ribosome entry site; IRE, iron response element;HIV DIS, human immunodeficiency virus type 1dimerization initiation site; HDV, hepatitis delta virus; C/Pratio, comparative/potential ratio.doi:10.1016/j.jmb.2011.08.033 J. Mol. Biol. (2011) 413, 473–483Contents lists available at www.sciencedirect.comJournal of Molecular Biologyjournal homepage:$ - see front matter © 2011 Elsevier Ltd. All rights reserved.
  2. 2. experiments are designed to test or expand upon ahypothesis, based in part on the underlyingprinciples of RNA structure and a predicted orexperimentally determined higher-order structure.In contrast, Mother Natures experiments duringthe evolution of RNA are derived from an apparentrandom collection of mutations and other changesto the biological systems. The molecules and cellsthat survive these mutations reveal the character-istics of the RNA that maintain the integrity of theirstructure and function. Thus, the task for compar-ative analysis is complementary to hypothesis-driven experimentation. Experimentalists prove,disprove, or determined more details for theirhypothesis while comparative analysis attempts todecipher the principles that are the boundaryconditions for the collections of biological datathat have survived their evolutionary process.The first stage of comparative analysis is thecollection of a phylogenetically diverse set of RNAsequences and structures, followed by the com-parative and covariation analysis of these linearstrings of the four nucleotides in RNA—adenine(A), guanine (G), cytosine (C), and uracil (U)—toidentify a secondary structure that is similar foreach of the RNA sequences that are in the sameRNA family. For each of these RNA families, suchas tRNA and 16S ribosomal (r)RNA, manydifferent sequences fold into the same higher-order structure. Encrypted in these relationshipsbetween sequence and higher-order structuremodels are the fundamental rules that govern themultiple levels of RNA structure, starting with theformation of the smaller structural elements suchas the base pair and base stacking, continuing tolarger structural elements that are composed ofdifferent types and arrangements of these basepairs and base stacks, and culminating in theformation of significantly larger higher-orderstructures that have the capacity to dynamicallycatalyze chemical reactions and change theirhigher-order structure. To facilitate the RNAsfunction, these fundamental rules for RNA struc-ture are also directly associated with the folding ofan RNAs primary structure into its secondary,tertiary, and quaternary structures.Comparative analysis is composed of multipledimensions of information. New technology pro-vides us with significant amounts of data for each ofthe dimensions of RNA: (1) nucleotide sequences fororganisms that span the entire phylogenetic tree oflife, (2) the accurate prediction of the secondarystructures that are similar for each of the sequencesin a single RNA family, (3) analysis of the high-resolution crystal structures and the comparativestructure models reveals different RNA structuralmotifs and elements that are the basic buildingblocks of a complete RNA structure, and (4) thehistorical record of these evolving RNAs providesinsight into their evolutionary dynamics and phy-logenetic relationships.In contrast to comparative analysis, physicalbiochemists usually use different experimentalmethods to solve simplified model systems thatare less complex than the structure of the entireRNA. In particular, many laboratories have beenobtaining free-energy values for different structuralelements. Approximately 66% of many RNA struc-tures are composed of a set of base pairs that form aregular helix.2,3The energetic values for consecutivebase pairs have been studied for more than 25 years,initially focusing on canonical (i.e., G:C, A:U, and G:U) and, later, noncanonical base pairs.4–7Theenergetic values for other types of structuralelements, including helices with dangling ends,8hairpin,9internal10,11and multi-stem12loops, co-axial stacking,13and other structural motifs, forexample, the UAA/GAN motif,14have also beendetermined.The most widely used program (and its de-rivatives) to predict an RNA secondary structurewith the minimal free energy from a single nucleicacid sequence is Mfold.15Early studies revealed thatthe accuracy of the predicted structures is depen-dent in part on the free-energy values for differentstructural motifs and the length of the RNAmolecule.16As more free-energy values weredetermined for consecutive base pairs and newRNA structural motifs, the prediction accuraciesincreased. For example, the identification of theGNRA, UUCG, and CUUG hairpin tetraloops17,18and the subsequent determination of their extra-stable free-energy value19,20resulted in an improve-ment in the prediction accuracy.16Subsequentstudies showed that the prediction accuracy isdependent on the phylogenetic group of the RNAmolecule and the distance separating the nucleo-tides that are base paired (i.e., simple distance).21Ananalysis of a significantly larger data set substanti-ated these earlier studies22while providing a moredetailed assessment of the factors that affectprediction accuracy. For example, base pairs witha smaller simple distance occur significantly morefrequently than base pairs with larger simpledistances, and the prediction accuracy of individualbase pairs decreases exponentially as their simpledistance increases.22Thus, a larger number of free-energy values for avariety of structural elements are required toaccurately and routinely predict the secondarystructure for an RNA molecule. Carl Woesesremarkable foresight in 1983 that comparativeanalysis can be used to determine RNA energeticmeasurements of higher-order structural elementswas not appreciated at that time. However, thisapproach has been used in the prediction of proteinstructure,23–29suggesting that Woeses idea couldhave the potential to reveal free-energy values for474 Accurate Prediction of RNA Structure
  3. 3. RNA that are not easily discernable with experi-mental methods. Within the past few years, statis-tical potentials determined with comparativeanalysis30,31for a few RNA structural elementswere similar to the free-energy values determinedwith experimental methods. The replacement ofbase-pair stacking energetic parameters with statis-tical potentials generated from an analysis of RNAcrystal structures showed similar predictionaccuracies.30These results emphasize that compar-ative data can be used to create similar energyvalues for some structural elements.Previously, we determined statistical potentialsfor canonical base-pair stacks that occur within aregular helix. While the statistical potentials forcanonical base-pair stacks resulted in a veryminimal improvement in the accuracy of thepredicted secondary structure, a larger improve-ment was observed when statistical potentials weredetermined for the nucleotides immediately flank-ing the ends of the helix and in small internal loops(1×1, 1×2, 2×2)31and used in place of theequivalent experimentally determined energeticparameters.Statistical learning procedures are another form of aknowledge-based approach for improving energeticparameters. Methods using stochastic context-freegrammars showed prediction accuracies32near thoseof RNAstructure33and Mfold.15CONTRAfold34isbased upon conditional log-linear models, which arean extension of stochastic context-free grammars.34The energetic parameters used by CONTRAfold wereselected to maximize the conditional likelihood of thestructures within the sequences analyzed. Andro-nescu et al. utilized constraint generation and Boltz-mann likelihood methods to estimate their energeticparameters used by the program MultiFold.35Our confidence in Woeses 1983 statement influ-enced the development of our RNA ComparativeAnalysis Database (rCAD) (Ozer, Doshi, Xu andGutell, in press). One objective of this article is toutilize rCAD to determine a richer set of energeticparameters from our comparative analysis of RNAsequences and their structures. We have developednew statistical potentials for hairpin and internalloops but not for base-pair stacks and multi-stemloops. A modified version of RNAfold36,37wasdeveloped to utilize this new set of statisticalpotentials. Another objective of this article is toquantify the effect that our new statistical potentialshad on the accuracy of the predicted secondary-structure model.Results and DiscussionHairpin loop comparative/potential ratioTo determine the likelihood that a structuralelement will occur in the correct structure, wedetermined a ratio of the number of occurrences ofthat element in the comparative structure modeldivided by the number of potential occurrences ofthat element in the same RNA molecular class (seeMethods). An example of the comparative/potential(C/P) ratio for tetraloop hairpin loops in bacterial16S rRNA is shown in Figure 1. The following are afew of the highlights: (1) five of the tetraloop hairpinloops with any closing canonical base pairs have aC/P value greater than 0.5; (2) the closing base pairof these hairpin loops can alter the C/P values. Forexample, the C:G closing base pair usually increasesthe C/P values significantly for the 20 tetraloopsshown in Figure 1.Fig. 1. The ranked order of the 20 tetraloop hairpin loops (with any closing canonical base pair) with the highest C/Pratios (red bars) is shown along the x-axis. The C/P ratio for each of these tetraloop hairpin loops is shown on the y-axis.The ratios for tetraloop hairpin loops flanked by any canonical base pair are shown as red bars, while the tetraloop hairpinloops flanked by a CG base pair are shown as blue bars. The values are for bacterial 16S rRNA.475Accurate Prediction of RNA Structure
  4. 4. The different closing base pairs effect on the C/Pvalue for tetraloops is available at the ComparativeRNA Web (CRW) site†. Also available are the C/Pratios for hairpin loops of lengths 3–5 and for all ofthe molecular classes used in this study. The otherstructural statistics at the CRW site (i.e., nucleotide,base pairs, internal and multi-stem loops) all revealsignificant biases in the frequencies of the sequencesand their lengths. This general concept is used tocreate the statistical potentials.Hairpin loop statistical potentialsHairpin loop statistical potentials were createdand tested using Eqs. (2) and (4) (see Methods). The16 RNA molecular classes (see Methods) included inthe creation of our statistical potentials were thebacterial and eukaryotic 5S rRNA, bacterial andeukaryotic 16S rRNA, bacterial 23S rRNA, tRNA,38bacterial RNase P class A,39bacterial signal recog-nition particle (SRP),40U1 spliceosomal RNA,41hepatitis C virus internal ribosome entry site (HCVIRES),42Ykok leader,43TPP44and SAM45ribos-witches, iron response element (IRE),46humanimmunodeficiency virus type 1 dimerization initia-tion site (HIV DIS),47and UnaL2 Line 3′ element.48The first flanking (closing) canonical base pair isincluded when our comparative and potentialcounts and statistical potentials are generated.For hairpin loops of length 4, the values of m and bin Eq. (2) (see Methods) with the best accuracy were2.25 and 0.8, respectively. For the restricted range of0 to 2 for −ln(C/P) (see Methods), the statisticalpotentials of hairpin loops of length 4 will vary from5.3 to 0.8 kcal/mol, with 5.3 kcal/mol set as thedefault value. Hairpin loops of different sizes willhave different m and b values (see SupplementalData, Excel file HPComparison). Statistical poten-tials were generated for 908 hairpin loops plusdefault values.The approach used to determine the statisticalpotentials for hairpin loops is illustrated with acomparison with recent experimentally derivedtetraloop free-energy values.49For the 1536 possiblecombinations (256 hairpin loops ×6 base pairs),1225 (80%) had an absolute difference less than0.5 kcal/mol and 1243 (81%) had an absolutedifference less than 1.0 kcal/mol. A total of 191(12%) combinations had absolute differences between1.025 and 2.0 kcal/mol, and 102 (7%) combinationshad differences between 2.075 and 3.1 kcal/mol(Supplemental Data, see Excel file HPComparison).The 14 tetraloop closing base-pair combinationswith the largest absolute difference all had smallerkcal/mol values and thus are more energeticallystable. However, the majority of the combinations(232 out of 311) with absolute difference greaterthan 0.5 kcal/mol had experimentally derivedenergetic values smaller (i.e., more stable) than thederived statistical potential.For triloops, the experimentally derived free-energy values were taken from Thulasi et al.50Only 6 out of the 384 (0.2%) triloop combinationshad an absolute difference of less than 1.0 kcal/molbetween the experimentally derived free energiesand statistical potentials. Most of the triloops (369out of 384) (94%) had absolute differences between1.0 and 2.0 kcal/mol. The absolute difference for theother 23 combinations ranged from 2.028 to2.61 kcal/mol (Supplemental Data, see Excel fileHPComparison). For the pentaloop comparison, theenergetic parameters from TURNER046,51wereused. Of the 6144 possible pentaloop combinations,3354 (55%) had an absolute difference of 0.5 kcal/mol or less and 4674 (76%) had an absolutedifference less than 1.0 kcal/mol. A total of 1146(19%) had an absolute difference between 1.02 and2.0 kcal/mol, 287 (5%) had an absolute differencebetween 2.068 and 3.0 kcal/mol, and 36 (0.6%) hadan absolute difference between 3.1 and 4.0 kcal/mol.The remaining pentaloop has an absolute differenceof 4.408 kcal/mol (Supplemental Data, see Excel fileHPComparison). Statistical potentials have beencreated for hairpin loops for all observed lengthsin the molecular classes studied with comparativemethods.Internal loop statistical potentialsInternal loop statistical potentials were createdusing Eqs. (2) and (4). The same 16 RNA molecularclasses used in the generation of the hairpin loopstatistical potentials were used for the internal loops.Both base pairs flanking an internal loop areincluded in the generation of statistical potentialsfor internal loops. For 1×1 internal loops, the valuesof m and b in Eq. (2) (see Methods) with the bestaccuracy were 2.5 and −1.0, respectively. For therestricted range of 0 to 2 for −ln(C/P) (see Methods),the statistical potentials of 1×1 internal loops willvary from 4.0 to −1.0 kcal/mol, with 4.0 kcal/molset as the default value. Internal loops of differentsizes will have different m and b values (seeSupplemental Data, Excel file ILComparison). Sta-tistical potentials were generated for 1368 internalloop plus default values.The approach used to determine the statisticalpotentials for internal loops is illustrated with 1×1internal loops. For these internal loops, the absolutedifferences between the statistical potentials and theTURNER046experimentally derived energetic pa-rameters were usually large. There are 360 possible1×1 internal loops—6 base pairs ×6 base pairs ×10internal loops. Only 57 out of the 360 (16%) had an† Accurate Prediction of RNA Structure
  5. 5. absolute difference of less than 1.0 kcal/mol andonly 10 (3%) had absolute differences between 1.0and 2.0 kcal/mol. A total of 130 (36%) had absolutedifferences between 2.0 and 3.0 kcal/mol, and 111(30%) had absolute differences between 3.0 and4.0 kcal/mol. The 30 1×1 internal loops with thelargest difference between experimentally derivedfree-energies and statistical potentials all had a G–Ginternal loop. The values for the experimentallyderived free energies and statistical potentials for all360 1×1 and all 9216 2×2 internal loops are in theSupplemental Data (Excel file ILComparison). Sta-tistical potentials have been created for internalloops for any length observed on the 5′ and 3′ sidesof the loop in those molecular classes studied withcomparative methods.Evaluation of hairpin loop statistical potentialsThe prediction of an RNA structure is evaluatedwith the statistical potentials for hairpin loops. Inprevious versions of RNAfold, the only hairpin loopswith specific free-energy values were triloops andtetraloops. Free-energy values for longer hairpinloops were calculated using the length of the hairpinloop and the composition of the first and lastnucleotides of the hairpin loop and the flanking(closing) base pair. To determine if statistical poten-tials generated with Eqs. (2) and (4) would improvethe accuracy of RNA secondary-structure prediction,we modified the program RNAfold36,37to acceptdetailed statistical potentials for hairpin loops of anylength. When testing the hairpin loop statisticalpotentials, the experimentally derived energeticparameters (TURNER99) for base-pair stacks andinternal and multi-stem loops were used.Similar to previous studies,21,31sensitivity hasbeen used to gauge prediction accuracy. Sensitivityis defined as the number of canonical base pairs inthe predicted minimal free-energy structure presentin the comparative model divided by the totalnumber of comparative canonical base pairs. Differ-ences in prediction accuracy are defined as (sensi-tivity using statistical potentials)−(sensitivity usingother energetic parameters and/or folding pro-grams). If a program returns suboptimal structures,only the optimal structure is used in our analysis.Results in the Supplemental Data (supplemental.pdf, pages 1-4) reveal that the statistical potentialsfor hairpin loops improved the prediction of theRNA structure.Evaluation of internal loop statistical potentialsTo utilize the new internal loop statistical poten-tials, the functionality of RNAfold was againextended to accept a wider range of energeticparameters. The original version of RNAfold hadspecific free-energy values for internal loops oflengths 1×1, 1×2, 2×2, and 2×3. For larger internalloops, the calculation of the experimentally derivedfree-energy values was based on the number ofnucleotides in the internal loop plus the compositionof the ends of the internal loop and both flankingbase pairs. The modified RNAfold accepts specificfree-energy values for internal loops of any size.When testing hairpin loop statistical potentials, theexperimentally derived energetic parameters(TURNER99) for base-pair stacks and hairpin andmulti-stem loops are used.Results in the Supplemental Data (supplemental.pdf, pages 1-4) reveal that the statistical potentialsfor the internal loops improved the prediction of theRNA structure.Combining statistical potentials and comparisonwith other programsThe prediction accuracy using the combination ofhairpin and internal loop molecule-independentstatistical potentials for all 16 RNA molecular classeswas compared with the results from four other RNAfolding programs—RNAfold36(TURNER99),RNAstructure33using just TURNER04 and usingTURNER04 plus the newer triloop and tetraloopthermodynamic parameters,49,50CONTRAfold,34and MultiFold (BL⁎ parameter set).35RNAfoldand RNAstructure utilize experimentally derivedenergetic parameters while CONTRAfold and Mul-tiFold use parameters derived with statisticallearning. When testing the hairpin and internalloop statistical potentials with RNAfold, the exper-imentally derived energetic parameters (TURN-ER99) for base-pair stacks and multi-stem loopsare used.Overall, the combined molecule-independent sta-tistical potentials outperformed the other four pro-grams (Fig. 2a and b). On average, over the 16 RNAmolecular classes, our statistical potentials scored15% higher than RNAfold (TURNER99), 14% forRNAstructure (TURNER04), 14% higher for RNAs-tructure (TURNER04 Plus), 12% for CONTRAfold,and 13% for MultiFold. Our statistical potentialsoutperformed all four programs for all 16 RNAmolecular classes with the exception of the Ykokleader RNA where RNAfold (TURNER99) matchedour score and RNase P A where CONTRAfoldscored 3% higher. The difference in accuracybetween our statistical potentials and the competingprogram with the best results for a given moleculeranged from −3% (RNase P A) to 15% (UnaL2Line 3′element) (Fig. 2a and b). On average, our statisticalpotentials outperformed the program with the bestresults for a given RNA molecule by 7% (Supple-mental Data, see Excel file Accuracies.xlsx). Stan-dard deviation results for each program on eachmolecule are contained in the Supplemental Data(supplemental.pdf, pages 5-6).477Accurate Prediction of RNA Structure
  6. 6. Two methods were used to evaluate the cross-validation of the statistical potentials. The firstutilized the same method used for MultiFold.35The results in the Supplemental Data reveal that theaccuracies of the predicted RNA secondary struc-tures are very similar between the training andtesting on the full set of sequences and on an80%/20% split (see Supplemental Data, supplemen-tal.pdf, pages 7-8). The second method tested ourstatistical potentials and the four other RNA foldingprograms against nine control RNA molecularclasses (see Methods) that were not used in thegeneration of the statistical potentials. The controlmolecular classes are RNase P B,39Hammerhead IIIribozyme,52purine riboswitch,53hepatitis deltavirus (HDV) ribozyme,54HIV ribosomal frameshiftsignal,55GEMM cis-regulatory element,56R2 RNAelement,57and mitochondrial and archaeal 16SrRNA.38On average, over these nine RNA molecu-lar classes, our statistical potentials essentiallyequaled the performance of the four other RNAfolding programs (Supplemental Data, see supple-mental.pdf, pages 9-14).Given that our approach utilizes comparative datafor generating the statistical potentials, it is notsurprising that they perform only on par with theother RNA folding programs over the control RNAmolecular classes. The nine RNA molecular classesin our test set must have some structural elementsthat are not present and/or absent in the original 16Fig. 2. RNA secondary-structure prediction accuracies for four RNA folding programs: RNAfold, RNAstructure(TURNER04 and TURNER04 plus the newer triloop and tetraloop thermodynamic parameters), CONTRAfold,MultiFold, and RNAfold using statistical potentials. Results for 16 RNA molecular classes are divided into (A) bacterial 5SrRNA, eukaryotic 5S rRNA, bacterial 16S rRNA, bacterial 23S rRNA, tRNA, eukaryotic 16S rRNA, RNase P A, andbacterial SRP and (B) U1 spliceosomal RNA, HCV IRES, Ykok leader, TPP and SAM riboswitches, IRE, HIV DIS, andUnaL2 Line 3′ element.478 Accurate Prediction of RNA Structure
  7. 7. classes. This indicates that increasing the number ofRNA molecular classes used to generate the statis-tical potentials is necessary before the statisticalpotentials will have higher accuracies for a largernumber of molecular classes. During the course ofthese studies, we observed improvements in theaccuracies for a larger number of molecular classesas the training set included more RNA families.RNA folding websiteRNA sequences can be folded on our modifiedRNAfold program that contains our new statisticalpotentials‡. The C# code and the new statisticalpotentials will also be made available at this website.SummaryThe focus of this study was to improve theenergetic parameters for hairpin and internalloops. Previously, the base-pair stack statisticalpotentials created with comparative data, on aver-age, only slightly improved the prediction accuracy,demonstrating that statistical potentials can gener-ate analogous energetic parameters.31This minorimprovement in the accuracy from the base-pairstack statistical potentials was not as much as weanticipated. However, our previous analysis didreveal that flanking nucleotides of the hairpin andinternal loops did have a more pronounced im-provement, suggesting that a richer set of statisticalpotentials for the loop regions of the secondarystructure could have a larger enhancement in theaccurate prediction.The new comparative analysis system in develop-ment in the Gutell laboratory, rCAD (Ozer, Doshi,Xu and Gutell, in press), was used to determine thiscollection of statistical potentials that representsmore of the structural elements present in RNAmolecules. This new set of energetic parameters useda new structural statistic—the C/P ratio. TheRNAfold program was modified to utilize our largerset of statistical potentials since it originally hadmore limited hairpin and internal loop energeticparameters.This modified RNAfold program and our newhairpin and internal loop statistical potentialsdemonstrated significant increases in the predictionaccuracy of RNA secondary structure. Over 16 RNAmolecular classes, the statistical potentials alwaysoutperformed the four existing RNA folding pro-grams with the exception of two RNA moleculeswhere our accuracies were equal to or slightly worse‡ 3. a) Nucleotides in the tetraloop hairpin loops that occur in the comparative structure for a modified Escherichiacoli 16S rRNA secondary structure between positions 118 and 241 are colored blue. For this figure the E.coli sequence waschanged at a few positions to create better examples of potential base pairings that form hairpin loops. Potential tetraloophairpin loop, as defined by four nucleotides that are closed by two or more canonical base pairs, are colored red. The basepairs flanking the tetraloop hairpin loops are circled and connected with a red line. Nucleotides that are base paired in thecomparative structure are connected with a thick black line. c) Nucleotides in the internal loop that occur in our modifiedEscherichia coli comparative secondary structure between positions 139 and 184 are colored blue; b&c) Nucleotides inpotential internal loops are colored red and the nucleotides that form a set of base pairs within the potential helix in theinternal loop are circled and connected with a red line. Nucleotides that are base paired in the comparative structure areconnected with a thick black line.479Accurate Prediction of RNA Structure
  8. 8. than one other program. On average, the improve-ments ranged from 12% to 15% compared to thecompeting four programs. Our program predictedthe accuracy of the RNA secondary structure betterin 78 of the 80 comparisons. When our program wasnot included in these comparisons, RNAfold(TURNER99) and RNAstructure (TURNER99+) out-performed the other programs in 19 out of 64comparisons; RNAstructure (TURNER04), Multi-Fold and CONTRAfold outperformed the otherprograms in 20 out of 64 comparisons, 39 out of 64comparisons and 45 out of 64 comparisons, respec-tively. Our statistical potentials also were approxi-mately the same as the performances of the otherfour programs when tested over the nine additionalcontrol RNA molecular classes that were not used inthe generation of the statistical potentials.Our intention with this work was to determine ifthis generalized approach would improve theprediction of RNA secondary structure beyondcurrent approaches. Given that this approach didsignificantly increase prediction accuracy in the 16training RNA molecular classes, we will extend andimprove upon our generalized approach with avariety of approaches in the future.We will add more RNA molecular classes whengenerating the statistical potentials. We will also aimto identify the most essential structural elementsand components that will produce the highestaccuracy of the predicted RNA structure. Thisshould help identify general structural families andreduce the number of needed energetic parameters.We will also investigate extending the statisticalpotentials and folding program to utilize non-nearest-neighbor effects.MethodsComparative and potential secondary structuralelementsA potential secondary structural element, such as ahairpin loop, an internal loop, or a helix, is defined as theset of nucleotides that forms the motif. This potentialstructural element may or may not occur in the compar-ative secondary structure of the RNA molecule, whileevery comparative structural element is a potentialstructural element. Our objective is to generate a statisticalpotential from the ratio of comparative and potentialstructural elements.Potential hairpin loops are a set of consecutivenucleotides of a specific length that are flanked by twoor more canonical base pairs in the RNA sequence(Fig. 3). The determination of a potential internal loopinitiates with a comparative helix. The nucleotides flankingthe 5′ and 3′ ends of this helix that contain at least twopotential canonical base pairs are identified (Fig. 3). Thenucleotides between the comparative and the potentialhelices are defined as a potential internal loop.Creation of statistical potentialsA basic assumption in the creation of the statisticalpotentials is:−lnðC=PÞeFree energy ð1Þwhere C is the frequency of a structural elementappearing in the comparative structure and P is thepotential frequency of the structural element. Everycomparative structure is considered to be a potentialstructure as well; C/P will have values in the rangebetween 0 and 1. A typical statistical potential utilizes−ln(C) with C normalized with the frequency ofindividual nucleotides. The formula proposed here canbe considered as normalized by the potential to form astructure element. A statistical potential is determinedwith the equation:−m ln C= Pð Þ + b = SPð Þ ð2Þwhere SP is a statistical potential and m and b are globalparameters that will be selected to optimize the overallaccuracy of the folding program. For the vast majority ofstructural elements, the comparative count will be 0 orthe C/P ratio too low and the default value will be used.Restricting the range of values for −ln(C/P) between0 and 2 provides the best prediction accuracies; thisrestricts C/P values to a minimum of 0.01. If a structuralelement has no potential structures or the C/P value isless than 0.01, the C/P value is set to 0.01. The defaultvalue for a structural element is set to:−m × 2 + b = default ð3ÞMolecule-independent statistical potentialInitially, a set of statistical potentials will begenerated for each type of RNA molecular classanalyzed (e.g., 16S rRNA—bacteria). The statisticalpotentials for each molecule-specific set will not havedetailed values for all possible structural elements. Ourultimate goal is to create one set of statistical potentialsthat are applicable for all types of RNAs. To create amolecule-independent set of statistical potentials, wetreated each molecule-dependent set as a member of aBoltzmann distribution. For every secondary structuralelement, the molecule-independent statistical potential isa Boltzmann-weighted sum of statistical potentials fromeach molecule i:SPmolecule−ind =PiaI exp −SPi = kbTð ÞSPiPiaI exp −SPi = kbTð Þð4ÞCRW siteThe Gutell laboratorys CRW site§38has a diversecollection of secondary-structure models predicted fromcomparative analysis for different phylogenetic groups ofthe 5S, 16S, and 23S rRNAs; tRNAs for different amino§ Accurate Prediction of RNA Structure
  9. 9. acids; and group I and II introns. The number ofsecondary diagrams currently available is 1092, whilethe number of sequences with only base-pair informationis 54,525. The accuracy of these secondary-structuremodels is extremely high; approximately 97% of the basepairs in the ribosomal RNA structures predicted withcomparative methods are present in the high-resolutioncrystal structure.58RNA Comparative Analysis DatabaseAll sequence and comparative structure information isstored in the rCAD. rCAD at the time the manuscript wassubmitted contains 293,039 aligned RNA sequences andtheir comparative structure information. These data areutilized to determine the number of structural elements inthe comparative structures. rCAD also contains structuralstatistics (comparative and potential counts) on nearly500,000 different internal loops and almost 2.3 milliondifferent hairpin loops.RNA molecular classesThe RNA molecule sequences and structures initiallystudied for their comparative and potential counts ofstructural elements and used in the generation of thestatistical potentials were aligned and created by theGutell laboratory∥. They include sequences from thebacterial and eukaryotic phylogenetic groups and from5S, 16S, and 23S rRNA and tRNA.Additional RNA sequences and structures wereobtained from the RFam website.59These includedbacterial RNase P class A, bacterial SRP, U1 spliceosomalRNA, HCV IRES, Ykok leader, TPP and SAM ribos-witches, IRE, HIV DIS, and UnaL2 Line 3′ element. All ofthese sequences and structures were taken from theirrespective RFam full alignments.For the training and initial testing of the statisticalpotentials, sequences with a similarity of greater than 97%were removed to minimize the folding of duplicate RNAsequences. Also, only complete or nearly completesequences were analyzed. The total number of RNAsequences analyzed for testing RNA secondary-structureaccuracy for each molecular class is as follows: 1094bacterial and 258 eukaryotic 16S rRNA, 65 bacterial 23SrRNA, 230 bacterial and 310 eukaryotic 5S rRNA, 2112tRNA, 274 RNase P class A, 937 U1 spliceosomal RNA,1049 bacterial SRP, 550 HCV IRES, 188 Ykok leader, 726TPP and 589 SAM riboswitches, 371 IRE, 136 HIV DIS, and572 UnaL2 Line 3′ element. The number of sequences andtheir average length are available in the SupplementalData (see supplemental.pdf).For the additional testing of control RNA molecules,seven sets of RNA sequences and structures were obtainedfrom the RFam website. These are the RNase P B,Hammerhead III ribozyme, purine riboswitch, HDVribozyme, HIV ribosomal frameshift signal, GEMM cis-regulatory element, and R2 RNA element. All of thesesequences are taken from their respective RFam seedalignment. Two sets of RNA sequences and structures arefrom the Gutell laboratory—mitochondrial and archaeal16S rRNA.The total number of RNA sequences for each of the nineclasses is as follows: 366 RNase P B, 84 Hammerhead IIIribozymes, 133 purine riboswitches, 33 HDV ribozymes,145 HIV ribosomal frameshift signal, 162 GEMM cis-regulatory element, and 15 R2 RNA element. There were128 and 143 RNA sequences tested for mitochondrial andarchaeal 16S rRNA, respectively. The number of se-quences and their average length are available in theSupplemental Data (see supplemental.pdf).AcknowledgementsThis article is dedicated to Dr. Carl Woese for hisintuition that comparative analysis could reveal“energetic measurements too subtle for physicalchemical measurements to determine” and to ourerstwhile colleague Dr. Jim Gray whose pioneeringwork on transaction control enables databasesystems to be the foundation for Jims vision of the“Fourth Paradigm”, following experimental, theo-retical, and computer science. Jim appreciated thatthe overwhelming amount of multiple dimensionsof information was not strictly a computer scienceproblem, but instead a collaborative effort betweencomputer scientists and (in this case) molecularbiologists. The authors are also most grateful toYuxing Li, Jamie Cannone, Ame Wongsa, andYanan Jiang for help establishing the RNA foldingwebsite. Grants from the Robert A. Welch Founda-tion [grant numbers F-1691 (P.R.) and F-1427 (R.G.)],National Institutes of Health [grant numbers R01GM0796686 (P.R.), R01 GM067317 (R.G.), andGM085337 (R.G.)], and Microsoft Research TCI/ER(R.G.) were essential for this project to come tofruition. The authors appreciated the constructivecomments from the reviewers and the editor.Supplementary DataSupplementary data to this article can be foundonline at doi:10.1016/j.jmb.2011.08.033References1. Woese, C. R., Gutell, R., Gupta, R. & Noller, H. F.(1983). Detailed analysis of the higher-order structureof 16S-like ribosomal ribonucleic acids. Microbiol. Rev.47, 621–669.2. Gutell, R. R., Weiser, B., Woese, C. R. & Noller, H. F.(1985). Comparative anatomy of 16-S-like ribosomalRNA. Prog. Nucleic Acid Res. Mol. Biol. 32, 155–216.3. Gutell, R. R., Cannone, J. J., Shang, Z., Du, Y. & Serra,M. J. (2000). A story: unpaired adenosine bases inribosomal RNAs. J. Mol. Biol. 304, 335–354.∥ Available at Prediction of RNA Structure
  10. 10. 4. Freier, S. M., Kierzek, R., Jaeger, J. A., Sugimoto, N.,Caruthers, M. H., Neilson, T. & Turner, D. H. (1986).Improved free-energy parameters for predictions ofRNA duplex stability. Proc. Natl Acad. Sci. USA, 83,9373–9377.5. Mathews, D. H., Sabina, J., Zuker, M. & Turner, D. H.(1999). Expanded sequence dependence of thermody-namic parameters improves prediction of RNAsecondary structure. J. Mol. Biol. 288, 911–940.6. Turner, D. H. & Mathews, D. H. (2010). NNDB: thenearest neighbor parameter database for predictingstability of nucleic acid secondary structure. NucleicAcids Res. 38, D280–D282.7. Xia, T., SantaLucia, J., Jr, Burkard, M. E., Kierzek,R., Schroeder, S. J., Jiao, X. et al. (1998). Thermo-dynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexeswith Watson–Crick base pairs. Biochemistry, 37,14719–14735.8. Liu, J. D., Zhao, L. & Xia, T. (2008). The dynamicstructural basis of differential enhancement of confor-mational stability by 5′- and 3′-dangling ends in RNA.Biochemistry, 47, 5962–5975.9. Antao, V. P. & Tinoco, I., Jr (1992). Thermodynamicparameters for loop formation in RNA and DNAhairpin tetraloops. Nucleic Acids Res. 20, 819–824.10. Schroeder, S. J., Burkard, M. E. & Turner, D. H. (1999).The energetics of small internal loops in RNA.Biopolymers, 52, 157–167.11. Walter, A. E., Wu, M. & Turner, D. H. (1994). Thestability and structure of tandem GA mismatches inRNA depend on closing base pairs. Biochemistry, 33,11349–11354.12. Diamond, J. M., Turner, D. H. & Mathews, D. H.(2001). Thermodynamics of three-way multibranchloops in RNA. Biochemistry, 40, 6971–6981.13. Walter, A. E. & Turner, D. H. (1994). Sequencedependence of stability for coaxial stacking of RNAhelixes with Watson–Crick base paired interfaces.Biochemistry, 33, 12715–12719.14. Shankar, N., Kennedy, S. D., Chen, G., Krugh, T. R. &Turner, D. H. (2006). The NMR structure of an internalloop from 23S ribosomal RNA differs from itsstructure in crystals of 50S ribosomal subunits.Biochemistry, 45, 11776–11789.15. Zuker, M. (1989). On finding all suboptimal foldingsof an RNA molecule. Science, 244, 48–52.16. Jaeger, J. A., Turner, D. H. & Zuker, M. (1989).Improved predictions of secondary structures forRNA. Proc. Natl Acad. Sci. USA, 86, 7706–7710.17. Woese, C. R., Winker, S. & Gutell, R. R. (1990).Architecture of ribosomal RNA: constraints on thesequence of “tetra-loops”. Proc. Natl Acad. Sci. USA, 87,8467–8471.18. Michel, F. & Westhof, E. (1990). Modelling of the three-dimensional architecture of group I catalytic intronsbased on comparative sequence analysis. J. Mol. Biol.216, 585–610.19. Tuerk, C., Gauss, P., Thermes, C., Groebe, D. R.,Gayle, M., Guild, N. et al. (1988). CUUCGG hairpins:extraordinarily stable RNA secondary structuresassociated with various biochemical processes. Proc.Natl Acad. Sci. USA, 85, 1364–1368.20. Antao, V. P., Lai, S. Y. & Tinoco, I., Jr (1991).A thermodynamic study of unusually stableRNA and DNA hairpins. Nucleic Acids Res. 19,5901–5905.21. Konings, D. A. & Gutell, R. R. (1995). A comparison ofthermodynamic foldings with comparatively derivedstructures of 16S and 16S-like rRNAs. RNA, 1, 559–574.22. Doshi, K. J., Cannone, J. J., Cobaugh, C. W. & Gutell,R. R. (2004). Evaluation of the suitability of free-energy minimization using nearest-neighbor energyparameters for RNA secondary structure prediction.BMC Bioinformatics, 5, 105.23. Tanaka, S. & Scheraga, H. A. (1976). Medium- andlong-range interaction parameters between aminoacids for predicting three-dimensional structures ofproteins. Macromolecules, 9, 945–950.24. Moult, J. (2005). A decade of CASP: progress,bottlenecks and prognosis in protein structure pre-diction. Curr. Opin. Struct. Biol. 15, 285–289.25. Floudas, C. A., Fung, H. K., McAllister, S. R.,Monnigmann, M. & Rajgaria, R. (2006). Advances inprotein structure prediction and de novo proteindesign: a review. Chem. Eng. Sci. 61, 966–988.26. Kryshtafovych, A., Venclovas, C., Fidelis, K. & Moult,J. (2005). Progress over the first decade of CASPexperiments. Proteins, 61, 225–236.27. Shen, M. Y. & Sali, A. (2006). Statistical potential forassessment and prediction of protein structures.Protein Sci. 15, 2507–2524.28. Summa, C. M. & Levitt, M. (2007). Near-nativestructure refinement using in vacuo energy minimi-zation. Proc. Natl Acad. Sci. USA, 104, 3177–3182.29. Xu, B. S., Yang, Y. D., Liang, H. J. & Zhou, Y. Q.(2009). An all-atom knowledge-based energy func-tion for protein–DNA threading, docking decoydiscrimination, and prediction of transcription-factorbinding profiles. Proteins: Struct. Funct. Bioinform. 76,718–730.30. Dima, R. I., Hyeon, C. & Thirumalai, D. (2005).Extracting stacking interaction parameters for RNAfrom the data set of native structures. J. Mol. Biol. 347,53–69.31. Wu, J. C., Gardner, D. P., Ozer, S., Gutell, R. R. & Ren,P. (2009). Correlation of RNA secondary structurestatistics with thermodynamic stability and applica-tions to folding. J. Mol. Biol. 391, 769–783.32. Dowell, R. D. & Eddy, S. R. (2004). Evaluation ofseveral lightweight stochastic context-free grammarsfor RNA secondary structure prediction. BMC Bioin-formatics, 5, 71.33. Reuter, J. S. & Mathews, D. H. (2010). RNAstructure:software for RNA secondary structure prediction andanalysis. BMC Bioinformatics, 11, 129.34. Do, C. B., Woods, D. A. & Batzoglou, S. (2006).CONTRAfold: RNA secondary structure predictionwithout physics-based models. Bioinformatics, 22,e90–e98.35. Andronescu, M., Condon, A., Hoos, H. H., Mathews,D. H. & Murphy, K. P. (2010). Computationalapproaches for RNA energy parameter estimation.RNA, 16, 2304–2318.36. Hofacker, I. L. (2003). Vienna RNA secondarystructure server. Nucleic Acids Res. 31, 3429–3431.37. Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer,L. S., Tacker, M. & Schuster, P. (1994). Fast folding andcomparison of RNA secondary structures. Monatsh.Chem. 125, 167–188.482 Accurate Prediction of RNA Structure
  11. 11. 38. Cannone, J. J., Subramanian, S., Schnare, M. N.,Collett, J. R., DSouza, L. M., Du, Y. et al. (2002). Thecomparative RNA web (CRW) site: an online databaseof comparative sequence and structure informationfor ribosomal, intron, and other RNAs. BMC Bioinfor-matics, 3, 2.39. Brown, J. W. (1999). The Ribonuclease P Database.Nucleic Acids Res. 27, 314.40. Rosenblad, M. A., Gorodkin, J., Knudsen, B., Zwieb,C. & Samuelsson, T. (2003). SRPDB: Signal Recogni-tion Particle Database. Nucleic Acids Res. 31, 363–364.41. Kretzner, L., Krol, A. & Rosbash, M. (1990). Saccharo-myces cerevisiae U1 small nuclear RNA secondarystructure contains both universal and yeast-specificdomains. Proc. Natl Acad. Sci. USA, 87, 851–855.42. Gallego, J. & Varani, G. (2002). The hepatitis C virusinternal ribosome-entry site: a new target for antiviralresearch. Biochem. Soc. Trans. 30, 140–145.43. Barrick, J. E., Corbino, K. A., Winkler, W. C., Nahvi,A., Mandal, M., Collins, J. et al. (2004). New RNAmotifs suggest an expanded scope for riboswitches inbacterial genetic control. Proc. Natl Acad. Sci. USA,101, 6421–6426.44. Miranda-Rios, J., Navarro, M. & Soberon, M.(2001). A conserved RNA structure (thi box) isinvolved in regulation of thiamin biosynthetic geneexpression in bacteria. Proc. Natl Acad. Sci. USA, 98,9736–9741.45. Grundy, F. J. & Henkin, T. M. (1998). The S boxregulon: a new global transcription terminationcontrol system for methionine and cysteine biosyn-thesis genes in Gram-positive bacteria. Mol. Microbiol.30, 737–749.46. Hentze, M. W. & Kuhn, L. C. (1996). Molecular controlof vertebrate iron metabolism: mRNA-based regulatorycircuits operated by iron, nitric oxide, and oxidativestress. Proc. Natl Acad. Sci. USA, 93, 8175–8182.47. McBride, M. S. & Panganiban, A. T. (1996). Thehuman immunodeficiency virus type 1 encapsidationsite is a multipartite RNA element composed offunctional hairpin structures. J. Virol. 70, 2963–2973.48. Baba, S., Kajikawa, M., Okada, N. & Kawai, G. (2004).Solution structure of an RNA stem–loop derived fromthe 3′ conserved region of eel LINE UnaL2. RNA, 10,1380–1387.49. Sheehy, J. P., Davis, A. R. & Znosko, B. M. (2010).Thermodynamic characterization of naturally occur-ring RNA tetraloops. RNA, 16, 417–429.50. Thulasi, P., Pandya, L. K. & Znosko, B. M. (2010).Thermodynamic characterization of RNA triloops.Biochemistry, 49, 9058–9062.51. Mathews, D. H., Disney, M. D., Childs, J. L.,Schroeder, S. J., Zuker, M. & Turner, D. H. (2004).Incorporating chemical modification constraints intoa dynamic programming algorithm for prediction ofRNA secondary structure. Proc. Natl Acad. Sci. USA,101, 7287–7292.52. Murray, J. B., Terwey, D. P., Maloney, L., Karpeisky,A., Usman, N., Beigelman, L. & Scott, W. G. (1998).The structural basis of hammerhead ribozyme self-cleavage. Cell, 92, 665–673.53. Mandal, M., Boese, B., Barrick, J. E., Winkler, W. C. &Breaker, R. R. (2003). Riboswitches control fundamen-tal biochemical pathways in Bacillus subtilis and otherbacteria. Cell, 113, 577–586.54. Chen, P. J., Kalpana, G., Goldberg, J., Mason, W.,Werner, B., Gerin, J. & Taylor, J. (1986). Structure andreplication of the genome of the hepatitis delta-virus.Proc. Natl Acad. Sci. USA, 83, 8774–8778.55. Biswas, P., Jiang, X., Pacchia, A. L., Dougherty, J. P. &Peltz, S. W. (2004). The human immunodeficiencyvirus type 1 ribosomal frameshifting site is aninvariant sequence determinant and an importanttarget for antiviral therapy. J. Virol. 78, 2082–2087.56. Sudarsan, N., Lee, E. R., Weinberg, Z., Moy, R. H.,Kim, J. N., Link, K. H. & Breaker, R. R. (2008).Riboswitches in eubacteria sense the second messen-ger cyclic di-GMP. Science, 321, 411–413.57. Ruschak, A. M., Mathews, D. H., Bibillo, A., Spinelli,S. L., Childs, J. L., Eickbush, T. H. & Turner, D. H.(2004). Secondary structure models of the 3′untranslated regions of diverse R2 RNAs. RNA, 10,978–987.58. Gutell, R. R., Lee, J. C. & Cannone, J. J. (2002). Theaccuracy of ribosomal RNA comparative structuremodels. Curr. Opin. Struct. Biol. 12, 301–310.59. Gardner, P. P., Daub, J., Tate, J. G., Nawrocki, E. P.,Kolbe, D. L., Lindgreen, S. et al. (2009). Rfam: updatesto the RNA families database. Nucleic Acids Res. 37,D136–D140.483Accurate Prediction of RNA Structure