Gutell 116.rpass.bibm11.pp618-622.2011


Published on

Jiang Y., Xu W., Thompson L.P., Gutell R., and Miranker D. (2011).
R-PASS: A Fast Structure-based RNA Sequence Alignment Algorithm.
Proceedings of 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), Atlanta, GA. November 12-15, 2011. IEEE Computer Society, Washington, DC, USA. pp. 618-622.

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Gutell 116.rpass.bibm11.pp618-622.2011

  1. 1. R-PASS: A Fast Structure-based RNA Sequence Alignment AlgorithmYanan Jianga, Weijia Xub, Lee Parnell Thompsonc, Robin R. Gutella, Daniel P. MirankercaInstitute for Cellular and Molecular Biology bTexas Advanced Computing Center cDepartment of Computer ScienceThe University of Texas at Austin, Austin, USAa{yanan.jiang, robin.gutell} c{parnell, miranker}@cs.utexas.eduAbstract—We present a fast pairwise RNA sequencealignment method using structural information, named R-PASS (RNA Pairwise Alignment of Structure and Sequence),which shows good accuracy on sequences with low sequenceidentity and significantly faster than alternative methods. Themethod begins by representing RNA secondary structure as aset of structure motifs. The motifs from two RNAs are thenused as input into a bipartite graph-matching algorithm, whichdetermines the structure matches. The matches are then usedas constraints in a constrained dynamic programmingsequence alignment procedure. The R-PASS method has anO(nm) complexity. We compare our method with two otherstructure-based alignment methods, LARA and ExpaLoc, andwith a sequence-based alignment method, MAFFT, acrossthree benchmarks and obtain favorable results in accuracy andorders of magnitude faster in speed.Keywords-RNA pairwise structural alignment; structuremotif; bipartite graph matching; constraint sequence alignmentI. INTRODUCTIONA trend in the sequence analysis is to processincreasingly larger scales of sequences in order to detectsequence homologues, predict consensus secondarystructures[1], identify structure motifs[2] and inferphylogenetic relationships[3]. Non-canonical base pairs andstructure motifs are found based on a MSA of 2,240 16SrRNA sequences[2] and a set of statistical free energyvalues are computed from more than 50,000 ncRNAsequences[4]. Furthermore, genome wide sequencealignments identifying ncRNAs have become increasinglyroutine. RNAz[5] predicted over 30,000 structured RNAelements in human genomes from 438,788 alignments ofnon-coding regions. Scalable and fast computation methodis a key to make large scale analysis feasible.Sequence-based RNA sequence alignment programs, e.g.MAFFT[6], generate accurate alignments when the RNAsequences are conserved. However, these programs areunable to produce reliable alignments when sequenceidentity drops below 50–60%[7]. Exploiting thephenomenon of the coevolution of base-pairs and thepreservation of secondary structure are promisingapproaches to improve RNA alignment accuracy[7].Although many structure-based programs exist, most ofthem have high complexity and are not applicable to longRNAs. In this paper, we present a method, R-PASS (RNAPairwise Alignment of Structure and Sequence). Weevaluated our method compared with two state-of-artstructure-based alignment programs, LARA[8] andExpaLoc[9], and a popular sequence-based alignmentprogram MAFFT. Of the programs tested, R-PASS is thefastest. The results also show improved accuracy uponMAFFT and ExpaLoc and comparable accuracy with LARA.II. RELATED WORKMost structure-based alignment programs continue thetradition of the Sankoff’s algorithm[10], where itsimultaneously folds and aligns a set of pseudo knot-freeRNAs using a dynamic programming approach (DP).Although efforts have been made to reduce the time andspace complexity, this approach still requires O(n4) time inthe pairwise alignment case. Thus most structure-basedprograms are not practical for long RNAs[8].R-PASS assumes the structure information is availablefor both sequences. We compare our program to the twomost recent structure based alignment programs that targetthe same problem, LARA[8] and ExpaLoc[9]. LARA adoptsa graph-based representation and models the alignment asan integer linear program. ExpaLoc combines ExpaRNA[9]and LocARNA[11], where ExpaRNA detects the longestexact pattern match of two RNA structures and LocARNAfills in the unaligned space between those patterns.Our program differs with LARA in that the RNAstructures are matched at the structure motif level instead ofat the nucleotide level and an optimal alignment is found bya DP algorithm. Unlike ExpaRNA which finds the exactpattern matches, our matching algorithm is more flexible, somore structure constraints can be used in alignmentconstruction. Also, the alignment building process in ourprogram is still sequence-based, and thus has a much lowercomputation complexity than LocARNA.To evaluate the effectiveness of using additionalstructure information, we also compare our program withMAFFT[6]. Its iterative refinement method L-INS-I isevaluated to be one of the most accurate sequence-basedalignment programs that produce high quality alignmentswith average pairwise sequence identity above 55%[12].III. ALGORITHMGiven two RNA sequences with known secondarystructures, we parse the annotated base pairs into a set ofstructure motifs. The feature vectors of the structure motifsare computed and form the vertices of a bipartite graph.The weight of an edge is based on the similarity of twofeature vectors. A set of edges which represent thecorrespondence between structural motifs are obtained by abipartite graph matching algorithm. These correspondencesare then applied as anchor points to construct an optimalsequence alignment using a DP algorithm.A. Structure Matching1) Infering motifs from base pairing annotation: Thesecondary structure of a RNA sequences consists of a set of2011 IEEE International Conference on Bioinformatics and Biomedicine978-0-7695-4574-5/11 $26.00 © 2011 IEEEDOI 10.1109/BIBM.2011.74618
  2. 2. nested base pairs. The majority of the nucleotides in a RNAstructure form canonical base pairings in a regular helixregion. The remaining unpaired nucleotides form loops,which can be further categorized to bulges, internal, hairpinand multi-stem loops. Thus a RNA secondary structure canbe viewed as a set of those RNA motifs.Stem is the continuous base-paired double strand regionand is composed of a 5’end half and a 3’ end half. Bulge is aloop appearing only on one strand of a stem. Internal loop isthe unpaired regions occurring on both strands of a stem.Hairpin loop is the loop linking the 5’ end and 3’ end of astem. Multi-stem loop is the intersection of three or morestems. It is separated by single strand sequences. Free end isthe unpaired region at the 5’ end or 3’ end of a structure.Given a RNA secondary structure annotated by its basepairings, we first segment it into Stem, Bulge, Internal loopand Hairpin loop motifs. Then we merge the basic motifsinto Compound Stem and Hairpin. A Compound Stem is thediscontinuous base-paired region merged by Stems, Bulgesand Internal loops. A Hairpin is formed by combining aHairpin loop and its flanking Stem/Compound Stem. TheCompound Stem, Stem, Hairpin, Multi-stem loop and Freeend form a complete RNA structure. To reduce computation,we only consider Compound Stem, Stem and Hairpin motifsin the following matching step. The other motifs arespontaneously matched after matching of those three motifs.2) Vector model of motifs: For each Stem/CompoundStem/Hairpin motif we compute a feature vector whichincludes the start (ps) and end position (pe) of the motif andthe number of nucleotides in it (l). Since a Stem/CompoundStem consist of two halves, the end position of the first half(pse) and the start position of the second half (pes) are alsoincluded in its feature vector. The feature vector of aHairpin is the same as the feature vector of its stemcomponent. Therefore f{Stem, Compound Stem, Hairpin} = (ps, pse, pes,pe, l). A RNA secondary structure S can thus be representedas S = {fti}, i ‫ג‬ (1, k), where k is the number of motifs in Sand t ‫ג‬ {Stem, Compound Stem, Hairpin}.The similarity of two structure motifs is measured byhow close their relative positions are in the correspondingglobal sequences and how similar their lengths are. Thesimilarity is only computed between motifs of the same type,i.e. Hairpin with Hairpin and Stem with Stem or CompoundStem. We define the following heuristic similarity functioncomposed of position similarity (pos) and length similarity(len), where sl is the global sequence length, and wl is theweight of length similarity.݀ሺ݂௔ǡ ݂௕ሻൌ ൜Ͳǡ ݂݅ ܽǡ ܾ ܽ‫݁ݎ‬ ݊‫ݐ݋‬ ‫݂݋‬ ‫ݐ‬ ݁ ‫݁݉ܽݏ‬ ‫݁݌ݕݐ‬‫ݏ݋݌‬ሺ݂௔ǡ ݂௕ሻǤ ݈݁݊ሺ݂௔ǡ ݂௕ሻǡ ‫ݐ݋‬ ݁‫݁ݏ݅ݓݎ‬݈݁݊ሺ݂௔ǡ ݂௕ሻ ൌ ͳ െ ‫ݓ‬௟Ǥฬ݈௔ െ ݈௕݈௔ ൅ ݈௕ฬ‫ݏ݋݌‬ሺ݂௔ǡ ݂௕ሻൌ ͳ െ ݉ܽ‫ݔ‬ ൮݉݅݊ ൬ฬ‫݌‬௦௔‫݈ݏ‬௔െ‫݌‬௦௕‫݈ݏ‬௕ฬ ǡ ฬ‫݌‬௦௘௔‫݈ݏ‬௔െ‫݌‬௦௘௕‫݈ݏ‬௕ฬ൰ ǡ݉݅݊ ൬ฬ‫݌‬௘௦௔‫݈ݏ‬௔െ‫݌‬௘௦௕‫݈ݏ‬௕ฬ ǡ ฬ‫݌‬௘௔‫݈ݏ‬௔െ‫݌‬௘௕‫݈ݏ‬௕ฬ ൰൲ ǡ ݂݅ ܽǡ ܾ ‫ג‬ ܵ‫݉݁ݐ‬In the above equation, the score for different types ofthe structural motif is always 0. For the comparison of thesame motif type, the score is the product of the relativeposition similarity and the length similarity. The relativeposition difference for a single strand is defined as theminimum difference as measured by start and end position.For a Stem/Compound Stem structure, the relative positiondifference is first evaluated on each strand individually andthen the maximum score between the two strands is used.For simplicity, only the weight on length difference (wl) isused. The similarity score ranges between 0 and 1inclusively.3) Structure motif matching algorithm: The structuremotifs from two RNA structures are matched by a bipartitegraph matching (BGM). BGM is a powerful matchingtechnique that can achieve global and optimal matchingresults in polynomial time. It has been applied to proteinstructure alignment[13].We denote a bipartite graph G as G = (V = L R, E). Vis a set of vertices which can be divided into two subsets, Land R. Each subset individually represents structure motifsfrom one sequence and contains their feature vectors asvertices, i.e. L = S1, R = S2. E is a set of weighted edges thatlink between vertices in L and vertices in R. The edgeweight is the similarity score between two feature vectors.A threshold for edge weight t is used in a preprocessing stepof graph construction, so only edges with weight no lessthan t are used for matching. The limitation t > 0 ensuresthat the matches are only between motifs of the same type.There is no edge linking vertices from the same subset.Hence, E = {eLi,Rj = d(fLi, fRj) t} , i ‫ג‬ (1, k), j ‫ג‬ (1, h),where k is the number of motifs in S1 and h is the number ofmotifs in S2.Based on this bipartite graph, the correspondence ofstructure motifs are then found by a stable matchingalgorithm[14]. The algorithm generates a set of matchedstructure motifs, more specifically the feature vectors ofstructure motifs, M = {fLi, fRi}, i ‫ג‬ (1, c), where c is thenumber of matches. The stable matching ensures that no twomotifs are better matched together than with the motif theyare currently matched with. The complexity of thisalgorithm is O(kh). Any graph matching algorithm could beused as a substitute for the stable matching, such as amaximum weighted matching algorithm. The stablematching was chosen for its fast runtime.4) Computing matching blocks: The matching structuremotifs are then converted into sequence blocks. Theboundaries of the matched motifs are obtained based ontheir feature vectors. For a Hairpin motif, its sequence isdivided into three parts, 5’ end stem, Hairpin loop and 3’end stem, and the Stem and Compound Stem motif aresegmented into the 5’ end stem and 3’ end stem. Thus thematching segments from two sequences form a match block,of which the boundaries are determined by the start and endpositions of the segments (Fig. 1).The above generated sequence block set may containcrossing blocks caused by mismatches in BGM. In caseswhere the structures are unknown and require prediction,619
  3. 3. overlapping blocks may also appear istructures contain a set of overlapping candsolve this problem, the Dijkstra’s shortest papplied[15]. We construct a graph where thblocks and two pseudo blocks representingend of the sequences. The edge weightbetween two blocks. It is set to be 0 for crothe total number of position gaps betweeboth sequences for non-crossing blocks. Opositive weight are added to the graph. Athen computed between the start block anThe match blocks on the path are used in theB. Constraint sequence alignmentFigure 1. Convert structure alignment to sequenFor sequence A = {ai} and B = {bj}, i ‫ג‬where n is the length of sequence A and msequence B, a constraint alignment is then1). The optimal path is built only throughthe DP matrix using an affine gap penalty ma matrix H, where H(i, j) is the best alig(a1,…, ai, b1,…, bj) with ai aligned to a gwhere V(i, j) is the best alignment score upbj aligned to a gap, and a DP matrix D, whbest alignment score up to ai and bj, the refor the constrained DP is then calculated as:‫ܪ‬ሺ݅ǡ ݆ሻ ൌ ݉ܽ‫ݔ‬ ሺ‫ܦ‬ሺ݅ െ ͳǡ ݆ሻ ൅ ‫݋‬ǡ ‫ܪ‬ሺ݅ െ ͳǡ ݆ሻ ൅ ݁ሻܸሺ݅ǡ ݆ሻ ൌ ݉ܽ‫ݔ‬ ሺ‫ܦ‬ሺ݅ǡ ݆ െ ͳሻ ൅ ‫݋‬ǡ ܸሺ݅ǡ ݆ െ ͳሻ ൅ ݁ሻ‫ܦ‬ሺ݅ǡ ݆ሻൌ ቊ݉ܽ‫ݔ‬ ቀ‫ܦ‬ሺ݅ െ ͳǡ ݆ െ ͳሻ ൅ ܵ൫ܽ௜ǡ ܾ௝൯ǡ ‫ܪ‬ሺ݅ǡ ݆ሻǡ ܸሺ݅ǡ ݆െ ǡ ‫ݐ݋‬ ݁‫݅ݓݎ‬‫ݏ‬In the above equations, o is the gap opengap extension penalty and S(ai, bj) is the sbetween nucleotides ai and bj. The comalignment step is O(nm).IV. RESULTSThe program generated pairwise alignmby comparison to a gold standard referencenucleotide level. A pairwise alignment A ca set of nucleotide correspondence arrangorder, so A = {ai, bi}, i ‫ג‬ (1, n), where n is thof the two sequences, and ai and bi anucleotides from two sequences. Letalignment be A and the testing alignment bif the predicteddidate motifs. Topath algorithm ishe vertices are thethe start and theis the distanceossing blocks anden the blocks onOnly edges withA shortest path isnd the end block.e next step.nce alignment.(1, n), j ‫ג‬ (1, m),m is the length ofn computed (Fig.match blocks inmodel[16]. Givengnment score ofgap; a matrix V,p to ai and bj withhere D(i, j) is thecurrence relation:݆ሻቁ ǡ ݂݅ ሺ݅ǡ ݆ሻ ݅݊ ܾ݈‫݇ܿ݋‬‫ݏ‬݁n penalty, e is thesubstitution scoremplexity of thement is evaluatede alignment at thecan be denoted asged in sequentialhe smaller lengthare the matchedthe referencebe A’= {a’i, b’i},we define a correct correspondenceThe “=” here means two nuclenucleotide index in the the percentage of correct correspoof all correspondence in the referencA. Testing dataWe created the pairwise sequenwith structure information from thre2.1[7], Rfam 10.1[17] and CRWRfam are popular alignment benchevaluations and provide consensusvarious ncRNA families. The CRWhigh quality alignments provides instructure in various formats which sWe took three RNA famispliceosomal RNA (RF00004), nucand Bacterial RNase P class Afamilies from Bralibase data-set 1 aU5 spliceosomal RNA, tRNA andfamilies from CRW Site: 5S rRNABesides the data-set2 from Brapairwise sequence alignments, all thby breaking the multiple sequencesequence alignments. The seconsequence is either inferred froprovided in the original dataset orSite. The pseudoknots are excludedA single test contains two RNcorresponded RNA secondary strubracket format[19]. Based on RNdatasets are grouped into small (1000nt) and large (> 1000nt) RNARNA family are further divided in40, 60 and 80 by sequence identityindicates the pairwise sequence ibetween 40 and 60. Due to the pagesets and all the results are availableB. Comparison with other programWe implemented the algorithmsequence alignment, we use the RI[20] with gap open penalty of -8 anof -1. R-PASS is written in JavaMAFFT are implemented in C. Esmall RNA datasets, while the ottested on all datasets. Each test coinput. Except for MAFFT-L-INSsequence information, the structureto all the other three programs. Forfirst executed and the output constrLocARNA to obtain complete alignexecuted with default setting inprocessor at 3.16G Hz.1) Alignment accuracy: Usindescribed above, we evaluatedgenerated by all four programsreference alignments. As shown incan achieve > 90% accuracy ine as ai = a’i and bi = b’i .eotides have the sameThe alignment accuracyondence over the numberce alignment A.ce alignment testing dataee benchmarks, BralibaseSite[18]. Bralibase andhmarks used by programsecondary structures forW Site well known for itsndividual RNA secondaryserve our purpose well.ilies from Rfam: U2clear RNase P (RF00009)(RF00010); four RNAand data-set 2: g2intron,5S rRNA and two RNAand 16S rRNA.alibase which consists ofhe other tests are createdalignments into pairwisendary structure of eachom consensus structureretrieved from the CRWfrom the structures.NA sequences with theiructures annotated in dotNA sequence length, the(< 200nt), middle (200-A (Table 1). Tests in eachnto at least three subsets:y. For example, subset 40identity of each test ise limit, the details of dataupon request.msin the tool R-PASS. ForIBOSUM scoring matrixnd gap extension penaltya, LARA, ExpaLoc andExpaLoc is tested on thether three programs areontains two sequences asS-I which only use thee annotation is providedr ExpaLoc, ExpaRNA israints are then piped intonments. All programs aren a desktop with Intelng the scoring methodthe alignment qualitywith the correspondingFig. 2, all four programssmall and large RNA620
  4. 4. datasets with sequence identity above 60% and in middleRNA datasets with sequence identity above 80%. In thiszone, using structure information does not improvealignment quality significantly. The additional structureinformation has a remarkable effect in the twilight zone,where the sequence identity is below 60%. While MAFFTperformance drops sharply with decrease of sequenceidentity, R-PASS and LARA can maintain high alignmentquality (Fig. 2).ExpaLoc does not show superiority over MAFFT in thetwilight zone in the small RNA datasets. One possibleexplanation is that as the sequence identity drops, thenumber of exact pattern matches, i.e., the number ofstructure constraints also decreases, thus the degree offreedom expands as LocARNA aligns the sequences,causing potential alignment errors.In the twilight zone, R-PASS is comparable with LARAin terms of alignment quality. R-PASS outperforms all otherprograms in the tRNA dataset from Bralibase and RNase Pdatasets from Rfam (Fig. 2). In the other datasets exceptU5, the largest difference between R-PASS and LARA isless than 3%.Figure 2. Accuracy comparisons in small, middle and large RNAdatasets.2) Running time: The total running time of eachprogram on datasets of the same sized RNA group is used.ExpaLoc costs the most CPU time in the small RNAdatasets (data not shown here).TABLE I. PROGRAM RUNNING TIME IN SECONDS.Program Small Middle LargeR-PASS17.8 17.627LARA 178.24 (10)a1701 (97) 30000 (1111)MAFFT-L-NS-I 491.34 (27) 573 (33) 108 (4)# tests 2806 2893 1046Avg. length 96 314 1467a. The number of times that R-PASS is faster than the programAs shown in Table 1, R-PASS is the fastest among allprograms. The advantage over LARA is more obvious asthe length of the RNA increases (Fig. 3). R-PASS is 10times faster than LARA and 27 times faster than MAFFT insmall RNA datasets; and is 97 times faster than LARA and33 times faster than MAFFT for middle RNA datasets. It is4 times faster than the sequence-based alignment programMAFFT and above 1,100 times faster than LARA in largeRNA datasets while maintaining comparable alignmentquality (Fig. 2). Therefore, our approach is more suitable forlarge-scale RNA sequences.Figure 3. Program running time versus sequence length.3) R-PASS structure motif matching with LARAsubroutine: In R-PASS, individual sequence blocksgenerated after structure motif match can be fed into anyalignment algorithm to produce a global alignment. SinceLARA finds the structure motif correspondence at the basepair level, it performs better than R-PASS in some datasetsin the twilight zone. Thus integrating LARA to align thestem regions may improve R-PASS performance in thosedatasets. We compute the structure match blocks by R-PASS and use LARA to align the blocks generated fromStem motifs and Gotoh algorithm to align the blocks of loopand free end motifs. Each local alignment is then linked insequential order to form a global alignment.We tested the integrated R-PASS and LARA in thesmall RNA datasets. This approach performs better than R-PASS and is comparable with LARA. In some cases, e.g.U5 (Fig. 4), the integrated approach improves the alignmentaccuracy upon LARA.204060801000 20 40 60 80AccuracyPID subsetSmall RNA: tRNAR-PASSLARAExpaLocMAFFT-L-INS-I3040506070809010020 30 40 50 60 70 80 90AccuracyPID subsetMiddle RNA: Nuclear RNase P (RF00009)R-PASSLARAMAFFT-L-INS-I8085909510020 40 60 80AccuracyPID subsetLarge RNA: 16S rRNAR-PASSLARAMAFFT-L-INS-I01234596 314 1467log10(runningtimeinmilisecondpertest)Sequence length (nucleotide)Program running time VS sequence lengthR-PASSLARAMAFFT-L-INS-I621
  5. 5. Figure 4. Accuracy comparison on U5.V. CONCLUSION AND FUTURE WORKWe have developed a new workflow for pairwisealignment of RNA sequences with known structureinformation, named R-PASS. We utilize the RNA structuremotif correspondence found in a bipartite graph frameworkto constrain the sequence alignment problem and performthe final alignment. The complexity of our algorithm isO(nm), yet it can achieve alignment quality comparable withthe current state of the art programs. This is especiallyapparent in the twilight zone. R-PASS is also significantlyfaster than competing methods, often by orders ofmagnitude, which makes our method well suited for use initerative algorithms and high-throughput RNA analysis.We are currently working on various improvements tothe R-PASS framework described in this paper. Thealignment can be refined by matching the basic motifs in aCompound Stem/Hairpin motif. Improvements can also beextended to the nucleotide level where all the base pairingsin each sequence could be used as constraints.The similarity function between structural motifs we useis established by practical experience and the values aredetermined based on preliminary experiments. This functioncan be further enhanced by including additional informationsuch as base pair similarity and nucleotide similarity. Theadditional information can improve the sensitivity and theselectivity of the matching algorithm.Our results show that incorporating structureinformation significantly improves alignment accuracy uponsequence-based alignment methods, especially for lessconserved sequences. While our current algorithm focuseson aligning known structures, it can be adapted to alignunknown structures by structure prediction using RNAfolding algorithms or generating all potential structuremotifs and find correct matches by an advanced similarityfunction.The fast performance of our alignment program alsomakes it promising to compute a multiple sequencealignment over a large set of sequences. While R-PASSfocuses on pairwise alignment, the result can be extended toproduce multiple sequence alignments using a guide tree ora progressive approach. Our method can also targettemplate-based alignment problems, where new sequencesare added into an existing alignment by matching the newsequence with the consensus sequence of the alignment.ACKNOWLEDGMENTThis research is supported by the National Institutes ofHealth grant, GM085337.REFERENCES[1] Gutell, R. R., Weiser, B., Woese, C. R. and Noller, H. F. Comparativeanatomy of 16-S-like ribosomal RNA. Prog Nucleic Acid Res Mol Biol, 32(1985), 155-216.[2] Gutell, R. R., Larsen, N. and Woese, C. R. Lessons from an evolvingrRNA: 16S and 23S rRNA structures from a comparative perspective.Microbiol Rev, 58, 1 (Mar 1994), 10-26.[3] Phillips, A., Janies, D. and Wheeler, W. Multiple sequence alignment inphylogenetic analysis. Mol Phylogenet Evol, 16, 3 (Sep 2000), 317-330.[4] Wu, J. C., Gardner, D. P., Ozer, S., Gutell, R. R. and Ren, P.Correlation of RNA secondary structure statistics with thermodynamicstability and applications to folding. J Mol Biol, 391, 4 (Aug 28 2009), 769-783.[5] Washietl, S., Hofacker, I. L., Lukasser, M., Huttenhofer, A. and Stadler,P. F. Mapping of conserved RNA secondary structures predicts thousandsof functional noncoding RNAs in the human genome. NatureBiotechnology, 23, 11 (Nov 2005), 1383-1390.[6] Katoh, K., Kuma, K., Miyata, T. and Toh, H. Improvement in theaccuracy of multiple sequence alignment program MAFFT. GenomeInform, 16, 1 (2005), 22-33.[7] Gardner, P. P., Wilm, A. and Washietl, S. A benchmark of multiplesequence alignment programs upon structural RNAs. Nucleic Acids Res,33, 8 (2005), 2433-2439.[8] Bauer, M., Klau, G. W. and Reinert, K. Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization.BMC Bioinformatics, 8(2007), 271.[9] Heyne, S., Will, S., Beckstette, M. and Backofen, R. Lightweightcomparison of RNAs based on exact sequence-structure matches.Bioinformatics, 25, 16 (Aug 15 2009), 2095-2102.[10] Sankoff, D. Simultaneous solution of the RNA folding, alignment, andproto-sequence problems. SIAM J Appl Mat, 45(1985), 810-825.[11] Will, S., Reiche, K., Hofacker, I. L., Stadler, P. F. and Backofen, R.Inferring noncoding RNA families and classes by means of genome-scalestructure-based clustering. PLoS Comput Biol, 3, 4 (Apr 13 2007), e65.[12] Wilm, A., Mainz, I. and Steger, G. An enhanced RNA alignmentbenchmark for sequence alignment programs. Algorithms Mol Biol,1(2006), 19.[13] Wang, Y., Makedon, F., Ford, J. and Huang, H. A bipartite graphmatching framework for finding correspondences between structuralelements in two proteins. Conf Proc IEEE Eng Med Biol Soc, 4(2004),2972-2975.[14] Shapley, D. G. a. L. S. College Admissions and the Stability ofMarriage. American Mathematical Monthly, 69(1962), 9-14.[15] Dijkstra, E. W. A note on two problems in connexion with graphs.Numerische Mathematik, 1(1959), 269-271.[16] Gotoh, O. An improved algorithm for matching biological sequences.J Mol Biol, 162, 3 (Dec 15 1982), 705-708.[17] Gardner, P. P., Daub, J., Tate, J., Moore, B. L., Osuch, I. H., Griffiths-Jones, S., Finn, R. D., Nawrocki, E. P., Kolbe, D. L., Eddy, S. R. andBateman, A. Rfam: Wikipedia, clans and the "decimal" release. NucleicAcids Res, 39, Database issue (Jan 2011), D141-145.[18] Cannone, J. J., Subramanian, S., Schnare, M. N., Collett, J. R.,DSouza, L. M., Du, Y., Feng, B., Lin, N., Madabusi, L. V., Muller, K. M.,Pande, N., Shang, Z., Yu, N. and Gutell, R. R. The comparative RNA web(CRW) site: an online database of comparative sequence and structureinformation for ribosomal, intron, and other RNAs. BMC Bioinformatics,3(2002), 2.[19] Gruber, A. R., Lorenz, R., Bernhart, S. H., Neubock, R. and Hofacker,I. L. The Vienna RNA websuite. Nucleic Acids Res, 36, Web Server issue(Jul 1 2008), W70-74.[20] Klein, R. J. and Eddy, S. R. RSEARCH: finding homologs of singlestructured RNA sequences. BMC Bioinformatics, 4(Sep 22 2003), 44.8085909510020 40 60 80AccuracyPID subsetU5LARAsubroutineR-PASSLARA622