Tech-Report

GTfold : A Scalable Multicore Code for RNA
Secondary Structure Prediction
Neha Jatav
Department of Computer Science and Engineering
Indian Institute of Technology, Bombay, Powai, Mumbai, India 400076
Email: nehajatav@cse.iitb.ac.in
Project Guide:
Dr. David Bader
College of Computing
Georgia Institute of Technology, Atlanta, GA,USA
Email: bader@cc.gatech.edu
Abstract—Accurate prediction of RNA secondary structure
from the RNA base sequence is an unsolved computational
challenge. The accuracy of predictions made by free energy
minimization is limited by the quality of the energy parameters
in the underlying free energy model. The energy model that
GTfold and the de facto standard programs have been using
is Turner99, the set of nearest neighbor parameters for RNA
folding compiled by the Turner group in 1999. However, there is
a new set of thermodynamic values Turner 2004 compiled by the
Turner group in 2004. Also, using real sequences directly with
GTfold and other RNA folding programs posed a problem as
real sequences contain unspecified bases.
In this project, a user enhanced option of toggling the different
energy models has been added to GTfold. GTfold can now fold
real RNA sequences containing unidentified base N.
I. INTRODUCTION
GTfold is a fast, scalable multi-core code for predicting
RNA secondary (Article (Mathuriya, Bader, Heitsch, & Har-
vey, 2009)). RNA molecules perform a variety of different
biological functions including the role of small RNAs (with
tens or a few hundred of nucleotides) in gene splicing, editing,
and regulation. At the other end of the size spectrum, the
genomes of numerous viruses are lengthy single-stranded
RNA sequences with many thousands of nucleotides. These
single-stranded RNA sequences base pair to form molecular
structures, and the secondary structure of viruses like dengue
[3], ebola [16], and HIV [17] is known to have functional
significance. Thus, disrupting functionally significant base
pairings in RNA viral genomes is one potential method for
treating or preventing the many RNA-related diseases.
According to the thermodynamic hypothesis, the structure
having the minimum free energy (MFE) is predicted as the
secondary structure of the molecule. The free energy of a
secondary structure is the independent sum of the free energies
of distinct substructures called loops. The optimization is
performed using the dynamic programming algorithm given
by Zuker and Stiegler in 1981 [21] which is similar to the
algorithm for sequence alignment but far more complex. The
algorithm explores all the possibilities when computing the
MFE structure. There are heuristics and approximations which
have been applied to satisfy the computational requirements
in the existing folding programs. The free energies of differ-
ent loops are evaluated using thermodynamic model of free
energy.
II. RELATED WORK
The Vienna RNA, developed by the Theoretical Biochem-
istry Group and has an option of implementing their program
on a different thermodynamic model, called the Andronescu
model (Andronescu, Condon, Hoos, Mathews, & Murphy,
2007) which gives a constraint generation (CG), the first com-
putational approach to RNA free energy parameter estimation
that can be efficiently trained on large sets of structural as well
as thermodynamic data. The CG approach employs a novel
iterative scheme, whereby the energy values are first computed
as the solution to a constrained optimization problem. Then
the newly computed energy parameters are used to update
the constraints on the optimization function, so as to better
optimize the energy parameters in the next iteration. Using
this method on biologically sound data, revised parameters
can be obtained for the Turner99 energy model which provides
significant improvements in prediction accuracy over current
state of-the-art methods.
In Mfold web server developed by Michael Zuker (Article
(Zuker, 2003)), for a sequence entered into the sequence text
area box all characters except for AZ and az are removed.
Lower case characters are converted to upper case. For RNA
folding, T or t are converted to U.In addition, the letters W,
X, Y and Z also refer to A, C, G and U/T, respectively. These
nucleotides, if they pair, should do so only at the end of a helix.
Thus, the mfold web server does not support the IUPAC (In-
ternational Union of Pure and Applied Chemistry) ambiguous
DNA character convention (Cornish-Bowden, 1985).

III. RESEARCH CONTRIBUTIONS
A. Toggling of Thermodynamic values
RNA molecules are made up of A, C, G, and U, nucleotides
which can pair up according to the rules in (A,U), (U,A),
(G,C), (C,G), (G,U), (U,G). Nested base pairings result into
2D structures called secondary structures. There are 3D inter-
actions among the elements of the secondary structures which
result into 3D structures called tertiary structures. Pairings
among bases form various kinds of loops, which can be
classified based on the number of branches present in them.
Nearest neighbor thermodynamic model (NNTM) provides
a set of functions and sequence dependent parameters to
calculate the energy of various kinds of loops. The free energy
of a secondary structure is calculated by adding up the energy
of all loops and stacking present in the structure. There are
two existing thermodynamic models compiled by the Turner
group in 1999 and 2004, known as the Turner 99 model and
Turner 2004 model respectively. The energy parameters can be
toggled between any of these two models and the free energy
can be calculated according to that model.
B. Unidentified base N
The letter N should be used for an unspecified base. It
is not allowed to pair. It is very common in the real RNA
sequences. The real RNA sequences can be processed by
putting constraints on these bases. The base N is prohibited
from pairing and hence finally the RNA sequence is folded
such that none of these unidentified bases are paired.
C. Constraints folding
GTfold allows the optional incorporation of folding con-
straints. Each constraint consists of a single line in the con-
straint file that must conform to a rigid format. The various
types of constraints are itemized below. Multiple constraints
of any form are allowed in any order.
• Force a specific base pair or helix to form. The command
F i j k (1)
will force the formation of the helix (single base pair if
k=1) The triple (i, j, k) refers to k consecutive base pairs,
where rirj is the exterior closing base pair. If any of these
base pairs cannot exist, then an error will be generated
and the job will fail. The usual result is an output page
that declares Job aborted! No Structure!.
• Prohibit a specific base pair or helix from forming. The
command
P i j k (2)
will prohibit every single base pair of the form r[i+h]r[j-
h],(h varying from 0 to k), from occurring.
• Prohibit a string of consecutive bases from pairing. The
command
P i 0 k (3)
(the second to last character is zero) will prevent nu-
cleotides r[i], r[i+1], r[i+2],..., r[i+k-1] from pairing. This
TABLE I
MFES IN KCAL/MOL CALCULATED BY UNAFOLD, GTFOLD AND
RNAFOLD
Sequence Length UNAfold GTfold RNAfold
16S/X54252 698 -138 -143 -143
16S/X54253 702 -141 -149 -149
16S/X98467 1296 -460 -487 -489
16S/X65063 1433 -572 -584 -584
16S/Z17210 1436 -744 -762 -763
16S/X52949 1453 -795 -804 -805
16S/K00421 1475 -682 -687 -687
16S/Z17224 1551 -553 -568 -569
16S/X00794 1963 -723 -742 -747
TABLE II
MFE SCORES IN KCAL/MOL FOR SAME STRUCTURES ON GTFOLD AND
UNAFOLD
Sequence Length UNAfold GTfold
16S/K00421 1475 -680.5 -682.4
16S/X00794 1963 -723.1 -726.4
16S/X52949 1453 -794.6 -794.4
16S/X54252 698 -137.5 -138.7
16S/X54253 702 -141.3 -142.7
16S/X65063 1433 -571.6 -575.5
16S/X98467 1296 -460 -461.2
16S/Z17210 1436 -744 -748.3
16S/Z17224 1551 -552.6 -556.1
is a single base when k=1. Forcing too many bases to be
single stranded can generate a fatal error.
IV. EXPERIMENTAL RESULTS
A. Accuracy
The table I gives the Minimum Free Energies calculated by
the GTfold and other de facto standard programs for predicting
RNA secondary structures. In most of the cases, the Free
Energy calculated by GTfold is the minimum.
The table II shows the energy calculated by GTfold and
UNAfold for the same structure predicted by both. UNAfold
uses a different thermodynamic model as compared to that
used by GTfold. The differences lies in the calculation of the
multiloop energies and the external energies.
The table III shows the implementation of different free
thermodynamic parameters i.e. Turner 99 and Turner 04 using
GTfold.
TABLE III
MFE IN KCAL/MOL CALCULATED USING THE TURNER 99 AND TURNER
04 MODEL ON GTFOLD
Sequence Length Turner04 Turner99
16S/K00421 1475 -636.57 -687
16S/X00794 1963 -691.37 -747
16S/X52949 1453 -768.26 -805
16S/X54252 698 -121.46 -143
16S/X54253 702 -125.55 -149
16S/X65063 1433 -536.93 -584
16S/X98467 1296 -449.16 -489
16S/Z17210 1436 -724.21 -763
16S/Z17224 1551 -521.58 -569

TABLE IV
RUNNING TIMES IN SECONDS FOR UNAFOLD AND GTFOLD
Sequence Length UNAfold GTfold
16S/X54252 698 12 1
16S/X54253 702 10 1
16S/X98467 1296 23 4
16S/X65063 1433 25 4
16S/Z17210 1436 28 5
16S/X52949 1453 29 5
16S/K00421 1475 23 5
16S/Z17224 1551 34 6
16S/X00794 1963 72 9
TABLE V
RUNNING TIMES IN SECONDS FOR GTFOLD RUNNING WITH AND
WITHOUT ILSA
Sequence Length GTfold GTfoldwithILSA
16S/X54252 698 1 20
16S/X54253 702 1 21
16S/X98467 1296 4 127
16S/X65063 1433 4 164
16S/Z17210 1436 5 166
16S/X52949 1453 5 177
16S/K00421 1475 5 181
16S/Z17224 1551 6 213
16S/X00794 1963 9 440
B. Performance Timing
The table IV gives a comparison of the runtimes of GTfold
and UNAfold and it can be seen that GTfold is faster even for
larger RNA sequences.
The table V shows the running time comparison of GTfold
with and without using the Internal Loop Speed-up Algorithm.
V. CONCLUSION
GTfold can be used with both the free energy thermody-
namic models: Turner 99 as well as Turner 04 models. Users
have an option to work with either of the models. GTfold
allows the unidentified base ’N’ and hence can be used directly
with real sequences without any pre-processing or errors.
RÉF ÉRENCES
Andronescu, M., Condon, A., Hoos, H. H., Mathews, D. H.,
& Murphy, K. P. (2007). Efficient parameter estimation
for rna secondary structure prediction. Bioinformatics.
Cornish-Bowden. (1985). Nomenclature for incompletely
specified bases in nucleic acid sequences: recommen-
dations 1984. Nucleic Acids Research.
Mathuriya, A., Bader, D., Heitsch, C., & Harvey, S. (2009).
Gtfold: A scalable multicore code for rna secondary
structure prediction. 24th Annual ACM Symposium
on Applied Computing (SAC), Computational Sciences
Track, Honolulu, HI.
Zuker, M. (2003). Mfold web server for nucleic acid folding
and hybridization prediction. Nucleic Acids Research.

Tech-Report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tech-Report

Similar to Tech-Report (20)

Tech-Report