SlideShare a Scribd company logo
1 of 8
Download to read offline
Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and
                 Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                  Vol. 2, Issue 6, November- December 2012, pp.247-254
         AnG-HPR: Analysis of n-Gram based human Promoter
                            Recognition
                         Chandrashekar.Jatoth1, Rupesh Mahajan2
                    Department of Information Technology, Pad.Dr.DY.P.I.E.T, Pune, India


Abstract
         We describe a promoter recognition
method named An-HPR to locate eukaryotic
promoter regions and predict transcription start
sites (TSSs). We computed n-gram features are
extracted and used in promoter prediction. We
computed n-grams (n=2, 3, 4, 5) as features and
created frequency features to extract informative
and discriminative features for effective
classification. Neural network classifiers with
these n-grams as features are used to identify            Figure 1: A schematic representation of the
promoters in a human genome. Analysis of n-               locations of the promoter region, TFBSs, exons,
Gram based (AnG) is applied to the feature                introns and 3′utr regions.
Extraction and a subset of principal components
(PCs) are selected for classification. Our system                   Available promoter prediction systems use
uses three neural network classifiers to                  two types of features for classification namely,
distinguish promoters versus exons, promoters             context features like nmers, and signal features such
versus introns, and promoters versus 3' un-               as TATA-box, CCAAT-box, and CpG islands.
translated region (3'UTR). We compared AnG-               Among the favorable promoter prediction programs,
HPR with four well-known existing promoter                Eponine [1] builds a PWM to detect TATA-box and
prediction systems such as DragonGSF, Eponine             G+C enrichment regions as promoter-like regions;
and FirstEF, PCA-HPR. Validation shows that               FirstEF [2] uses CpG-related and non-CpG related
AnG-HPR achieves the best performance with                first exons as signal features; Promoter Inspector [3]
three test sets for all the four predictive systems.      uses IUPAC words with wildcards as context
                                                          features. Good experiment results are achieved by
Keywords— promoter recognition; sequence                  integrating these two types of features. DPF [4]
feature; transcription start sites, Biological data       applies a separate module on G+C rich and G+C
sets, neural networks, Binary classification,             poor regions, and selects 256 pentamers to generate
cascaded classifiers.                                     a PWM for prediction. Furthermore, DragonGSF [5,
                                                          6] adds the CpG-island feature to DPF.
1. INTRODUCTION                                                     Most of the promoter prediction programs
     Promoter recognition is a real problem that          try to predict the exact location of the promoter
computationally identifies the transcription start site   region of the known protein-coding genes, while
(TSS) or the 5’end of the gene without time-              some focus on finding the transcription start site
consuming and expensive experimental methods              (TSS). Some research has show that there is often
that align ESTs, cDNAs or mRNAs against to the            no single TSS, but rather a whole transcription start
entire genome. In this article, we focus on humans        region (TSR) containing multiple TSSs that are used
because it is a representative species that has           with different frequencies. Generally two main
attracted much more attention in the past decade. In      approaches are used in promoter prediction [7].
humans, the TSS is surrounded with a core-                1. First approach assigns scores to all single
promoter region within around ±100 base pairs (bp).       nucleotides to Identify TSS.
A proximal promoter region has several hundreds bp        2. Second approach identifies a promoter region
immediately upstream of the core promoter.                without providing scores for all nucleotides.
.                                                                   In this article analyzes the performance of
                                                          17 programs on two tasks: Genome wide
                                                          identification of the start of genes and Genome wide
                                                          identification of TSRs. In Existing methods
                                                          Promoter prediction programs DragonGSF predicts
                                                          the value of precision is between 37-48% and recall
                                                          is 51-70%, DragonPF predicts the value of precision
                                                          is between 61-65% and recall 62-64%, FristEF


                                                                                                247 | P a g e
Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and
                 Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                  Vol. 2, Issue 6, November- December 2012, pp.247-254
predicts the value of precision is 79-81% and recall      identification of promoters and first exons has been
35-40%, Eponine predicts the value of precision is ≈      one of the most difficult problems in gene-finding.
40 and recall ≈ 67%, NNPP2.2 predicts the value of        The FirstEF [2] program identifies a promoter
precision is 69-93% and recall 2.0-4.5%,                  region and first exons in the human genome, which
McPromoter2.0 predicts the value of precision is 26-      may be also be useful for the annotation of other
57%and recall ≈ 4.5, proSom predicts the value of         mammalian genomes. FirstEF consists of different
precision is 0.38% and recall 0.66% [8].                  discriminate functions structured as a decision tree.
We select the 13 representative PPPs that can             The probabilistic models are designed to find
analyze large genome sequences and report strand-         potential first splice donor sites and CpGrelated and
specific TSS predictions. ARTS, Eponine used              non-CpG-related promoter regions based on
SVM as the part of their design, EP3, Promoter            discriminant analysis. For every potential first
scan, Wu-method used Position weight matrix               splice-donor site and upstream promoter region,
(PWM).                                                    FirstEF decides whether the intermediate region
                                                          could be a potential first exon based on a set of
         CpGcluster used distance based algorithm.        quadratic discriminant functions. Training and
CpGProD used linear discriminating analysis               testing using different discriminant functions, the
(LDA). DragonGSF, DragonPF, McPromoter used               first exons and promoter regions from the first-exon
neural networks.NNPP2.2 has used Time Delay               database are used. Ramana et al. have also tried to
neural network. Promoter Explorer has used                identify the promoters as well as first exons of
AbaBoost algorithm. proSOM has used SOM. These            human species by using an algorithm called FirstEF
programs have used as features various aspects of         which is based upon the usage of structural and
promoter and other regions of the DNA [6].                compositional features [3]. They were able to
There are three classification methods for human          predict 86% of the first exons. They have compared
promoter recognition system [9].                          their method with Promoter Inspector and obtained a
• Discriminative model that finds the optimal             sensitivity of 70% compared to Promoter
thresholds or classification boundaries in the signal,    Inspector’s 48%. Bajic et al. termed that the
context and structure features space. Typical             prediction is positive if the predicted transcription
methods include artificial neural networks (ANNs),        start site (TSS) falls within a maximum allowed
discriminate functions and support vector machines        distance from the reference transcription start site
(SVMs).                                                   [5]. They have assessed performance of some of the
• Generative model that describes the generative          prediction algorithms based on the performance
process of signal, context and structure                  measures such as sensitivity and positive predictive
observations. Position weight matrix (PWM),               value. In their later paper they concluded that the
nearest neighborhood and hidden Markov models             promoter prediction combined with gene prediction
(HMMs) belong to generative models.                       yields a better recognition rate [6].
• Ensemble that combines multiple classifiers for
multiple features in order to achieve a consensus         ExonType       Sensitivit    Specificit    correlation
and robust recognition results.                                          y             y             coefficient
                                                          CpG-related    0.92          0.97          0.94
          Promoter Inspector is a program that            NotCpG-        0.74          0.60          0.65
predicts eukaryotic poly II promoter regions with         related
high specificity in mammalian genomic sequences.          all exons     0.86       0.83     0.83
The program Promoter Inspector focuses on the             Table1.Accuracy of FirstEF based on cross-
genomic context of promoters rather than their exact      validation
location. Promoter Inspector is based (refer table 1)
on three classifiers, which specialize in to                        Li et al. is having proposed that locate
differentiating between promoter region and a             eukaryotic      promoter     regions and predict
subset of non-promoter sequences (intron, exon and        transcription start sites (TSSs). Here the authors
3′utr).                                                   have computed codon (3-mer) and pentamer (5-mer)
                                                          frequencies and created codon and pentamer
         In contrast to this, PromnFD and PromFind        frequency feature matrices to extract informative
use only one classifier, i.e the features are extracted   and     discriminative     features  for   effective
from one promoter set and one set of various non-         classification. Principal component analysis (PCA)
promoter sequences. To compare the two                    is applied to the feature matrices and a subset of
approaches, two versions of Promoter Inspector are        principal components (PCs) are selected for
built. Version v1 was based on one st of mixed non-       classification [7].
promoter sequences, while version v2 was built on
the basis of exon, intron and 3′utr. Both versions of
promoter Inspector were applied to exon, intron,
3′utr and promoter evaluation sets [3]. The

                                                                                                248 | P a g e
Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and
                 Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                  Vol. 2, Issue 6, November- December 2012, pp.247-254
Program       True       False     Sensitivit    PP      [16], to expectation and maximization (EM)
              Positiv    Positiv   y (%)         V       algorithm [17]. These techniques are based on
              e          e                       (%)     basically motifs present in the promoter which are
DragonGS      269        69        68.4          79.6    specific regions in the promoter or the global signal
F                                                        that is present in the promoter. To extract the local
FirstEF       331      501         84.2         39.8     or global signals various feature extraction methods
Eponine       199      79          50.6         71.6     are used.
PCA-HPR 301            65          76.6         82.2                Condon usage patterns in coding regions
Table 2. Performance comparison    of four prediction    and hexamer conservation (TATA box, CAAT box)
systems for 22 chromosomes.                              in promoter regions is well known. Techniques that
                                                         use these concepts are available in abundance in
          Promoter prediction systems use two types      literature. Most of the local content-based methods
of features for classification namely, context           are in fact based on conservation of the hexamers
features like n-mers, and signal features such as        [13, 16]. In literature there are a few articles on
TATA box, CCAAT-box and CpG islands. Among               protein     sequence classification       and gene
the favorable promoter prediction programs,              identification using n-grams, but very few on
Eponine builds a PWM to detect TATA-box and              promoter recognition. An n-gram is a selection of n
G+C enrichment regions as promoter-like regions;         contiguous characters from a given character stream
FirstEF uses CpGrelated and non-CpG related first        [18]. Ohler et al. have used interpolated Markov
exons as signal features; Promoter Inspector uses        chains on human and Drosophila promoter
IUPAC words with wild cards as context features.         sequences as positive data set achieving a
Good experiment results are achieved by integrating      performance accuracy of 53% [19]. Ben-gal et al.
these two types’ features. DPF applies a separate        have used a variable-order Bayesian network which
module on G+C rich and G+C poor regions, and             looks at the statistical dependencies between
selects 256 pentamers to generate a PWM for              adjacent base pairs to achieve a true positive
prediction. Furthermore, DragonGSF adds the CpG-         recognition rate of 47.56% [20]. Leu et al. have
island feature to DPF. Jia Zeng et al. in their paper    developed a vertebrate promoter prediction system
selected three promoter systems DragonGSF,               with cluster computing extracting n-grams for n = 6
Eponine and FirstEF to compare the performance on        to 20 [21]. They achieved an accuracy rate of 88%.
test set 1. A promoter region is counted as a true       Ji et al. have used SVM and n-grams (n = 4, 5, 6, 7)
positive (TP) if TSS is located within the region, or    for target gene prediction of Arabidopsis [22].
if a region boundary is within 200bp 5′ of such a        Prediction system with cluster computing extracting
TSS. Otherwise the predicted region is counted as a      n-grams for n = 6 to 20 [21]. They achieved an
false positive (FP). The test results of Eponine and     accuracy rate of 88%. Ji et al. have used SVM and
FirstEF. On test set 2, we adopt the same evaluation     n-grams (n = 4, 5, 6, 7) for target gene prediction of
method as DragonGSF when one or more                     Arabidopsis [22]. There are position specific n-gram
predictions fall in the region of [2000, +2000]          methods by Wang et al. and Li et al. [23, 24]. Wang
relative to a TSS, a TP is counted. All predictions      et al. have proposed a position specific propensity
which fall on the annotated part of the gene in the      analysis model (PSPA) which extracts the
region are counted as FP [10]. Sobha.et.al is termed     propensity of DNA elements at a particular position
that n-gram based promoter recognition methods           and their co-occurrence with respect to TSS in
were tried in promoter prediction and its application    mammals. They have considered a set of top ranking
to whole genome promoter prediction in E.coli and        k-mers (k = 1 to 5) at each position ±100 bp relative
Drosophila [11]. Here, we describe a method named        to TSS and the co-occurrence with other top ranking
AnG-HPR to predict the location of the TSSs with         k-mers at other downstream positions. PSPA score
best performance. We extracting n-grams and using        for a sequence is computed as the product of scores
them to identify promoters in human genome.              for the 200 positions of ±100 bp relative to TSS.
Patterns or features that characterize a                 They found many position-specific promoter
promoter/non-promoter are needed to be extracted         elements that are strongly linked to gene product
from the given set of promoter and non-promoter          function. Li et al. too have considered position-
sequences. Here promoter recognition is addressed        specific weight matrices of hexamers at some ten
by looking at the global signal characterized by their   specific positions for the promoter data of E. coli
frequency of occurrence of n-grams in the promoter       [24].
region.                                                             Here, we extract n-grams to be used as
                                                         features. An investigation of the lower order n-
2. Introduction to N-Grams as features                   grams for promoter recognition was taken up in
         Promoter recognition is tackled using           order to assess their applicability in whole genome
various techniques such as support vector machines       promoter recognition. To this end we have extracted
(SVM) [12], neural networks [13, 14], hidden             the n-grams and fed them to a multi-layer feed-
Markov models [15], position weight matrix (PWM)         forward neural network. Further, by using the best

                                                                                               249 | P a g e
Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and
                 Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                  Vol. 2, Issue 6, November- December 2012, pp.247-254
features, two more neural networks are designed to       Here, the denominator denotes the number of n-
annotate the promoters in a genome segment. In the       grams that are possible in a sequence of length |L|
following sections, we explain the extraction of         and hence Vin denotes the proportional frequency of
features, the classification results using these         occurrence of i-th feature for a particular n value.
features, and a way of finding the promoters in an       Thus each promoter and non-promoter sequence of
unknown segment of the Human genome.                     the data set is represented as a 16-dimensional
                                                         feature
2.1 Feature Extraction                                   Vector (V12, V22, V32, ……… V162) for n=2, as a 64-
    In this section, different data sets that are used   dimensional feature vector (V13, V23, V33, ………
in experimentation and the feature extraction            V643) for n=3, as a 256 dimensional feature vector
method for various n-grams are described.                (V14, V24, V34,……… V2564) for n=4 and a 1024-
                                                         dimensional feature vector (V15, V25, V35,……
2.1.1 Data set                                           V10245) for n=5.
         The training set in this experiment is
divided into several subsets of promoters, introns,                In a binary classification problem, the
exons and 3′UTR sequences. Promoter sequences            training set will be a mixture of both positive and
are extracted from two public databases. One is the      negative data sets. Similarly the test set which
Eukaryotic Promoter Database (EPD), release 86           consists of both positive and negative data is used to
[8], which contains 1871 human promoter                  evaluate the performance of the classifier. A neural
sequences. The other is the Database of                  network classifier is trained using the n-grams of the
Transcription Start Sites (DBTSS), version 5.2.0 [9],    training set as input feature vectors and then the test
which includes 30,964 human promoter sequences           set is evaluated using the same network. Figures 1–4
and 15,531 forward strand promoter sequences. We         depict the average separation between the positive
used forward strand promoter sequences are in our        and negative data for n = 2, n = 3, n = 4 and n = 5,
experiment. Human exon and intron sequences are          respectively. It can be observed that the plots depict
extracted from the EID [10], and the first exons are     the separability of promoter and non-promoter data
not included in the exons training set. Human            sets in different feature spaces.
3’UTR sequences are from the UTR database [11].                    CpG islands, which exist in 60% of
From DBTSS we have extracted promoter                    mammalian promoters [12], are regarded as one of
sequences [-250, +50] bp around the experimental         the most important signal features in promoter
TSS. DBTSS contains 24 chromosomes; each                 recognition. Methods such as CpGProD [13], DPF
chromosome has 1000 nucleotide sequences. From           [4], DragonGSF [5, 6], FirstEF [2] and
EID and 3′UTR we have extracted non-promoter             PromoterExplorer [7] embed this signal feature in
sequences of length 300 bp [27].                         their prediction system. In our method, two features
                                                         are used to identify if a given sequence (>200bp) is
2.1.2 Method                                             CpG islands related: GC percentage (GCp) and
         Patterns or features that characterize a        Observed/expected CpG ratio (o/e). These are
promoter/non-promoter are needed to be extracted         calculated with equations (1-5) (see supplementary
from the given set of promoter and non-promoter          material). If GCp > 0.5, and o/e > 0.6, then the
sequences. Here promoter recognition is addressed        sequence is considered CpG islands related,
by looking at the global signal characterized by the     otherwise it is non-CpG islands related [14]. A DNA
frequency of occurrence of n-grams in the promoter       sequence contains four types of nucleotides: adenine
region. We show in the section neural network            (abbreviated A), cytosine (C), guanine (G) and
architecture and classification performance that         thymine (T). With different combinations, there are
these features perform well for prokaryotic as well      43=64 codons and 45=1024 pentamers in promoter
as eukaryotic promoter recognition. To extract the       and nonpromoter datasets. Pentamers are widely
global signal for a promoter, the frequency of           selected as context features in many promoter
occurrence of n-grams is calculated on the DNA           prediction models [4, 7], as they keep good balance
alphabet {A,T,G,C}. The set of n-grams for n = 2 is      between search efficiency and discriminability. Four
16 possible pairs such as AA, AT, AG, AC, TA, etc.       resulting feature matrices are constructed from
and the set of n-grams for n = 3 are 64 triples such     promoter, exon, intron and 3'UTR training datasets.
as AAA, AAT, AAG, AAC, ATA etc. Similarly n-             Finally, three pairs of matrices: promoter versus
grams for n = 4, 5, 6 are calculated. Let fin denote     exon, promoter versus intron and promoter versus
the frequency of occurrence of the i-th feature of n-    3'UTR are built from the four feature matrices for
gram for a particular n value and let |L| denote the     further processing.
length of the sequence. The feature values Vin are
normalized frequency counts given in Eq. (1).

Vin =        fi n
|L| - (n-1) , 1≤ i ≤ 4n for n=2, 3, 4, 5   (1)

                                                                                                250 | P a g e
Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and
                         Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                          Vol. 2, Issue 6, November- December 2012, pp.247-254
                                                               consisting of both positive and negative data is used
                                                               to evaluate the performance of classifier. A neural
                                                               network classifier is trained using n-grams of
                                                               training set as input feature vectors and then the test
                                                               set is evaluated using by same network. The below
                                                               figures are depict the average separation between
                                                               the positive and negative for n=2, 3, 4, 5
                                                               respectively. It can be observed that the plots depict
                                                               the separability of promoter and non-promoter data
                                                               sets in different feature spaces.
                                                                         A feed forward neural network with three
                                                               layers is used for promoter classification. The nodes
                                                               in the input layer are 16, 64, 256, 1024 features for
                                                               n=2, 3, 4, 5 respectively. The experimentation is
                                                               done with different number of hidden nodes that
                                                               give an optimal classification performance. The
       Fig.1. Average separation between promoters and
                                                               output layer has one node to give a binary decision
       non-promoters for n = 2 for Homo sapiens. Here, 1.
                                                               as to whether the given input sequence is a promoter
       . .16 represent AA, AT, AG, AC etc.
                                                               or non-promoter. 5-fold cross validation is used to
                                                               investigate the effect of various n-grams on
       3. NEURAL              NETWORK                          promoter classification by neural network. Average
          ARCHITECTURE            AND                          performance over these folds is being reported.
          CLASSIFICATION PERFORMANCE                           These simulations are using Stuttgart Neural
                A feed-forward neural network with three       Network Simulator (SNNS). The classification
       layers, namely, an input layer, one hidden and an       results are evaluated using performance measures
       output layer is used for promoter classification. The   such as Precision, Specification, Sensitivity 5.3, 5.4
       number of nodes in the input layer is 16, 64, 256,      given these results.
       1024 features for n = 2, 3, 4, 5 respectively. One                Using these, we would like to find out the
       more experimentation which uses Euclidean               efficacy of these features in identifying promoter in
       distance measure to reduce the number of features       human genome [2].
       of n = 5 is done. Experimentation is done with
       different number of hidden nodes that give an
       optimal classification performance. The output layer
       has one node to give a binary decision as to whether
       the given input sequence is a promoter or non-
       promoter. 5-fold cross-validation [25] is used to
       investigate the effect of various n-grams on
       promoter classification by neural network. Average
       performance over these folds is being reported.
       These simulations are done using Stuttgart Neural
       Network Simulator [26].

S.   Features    Precisio    Specificit   Sensitivit   PPV
No               n           y            y
1    N=2         68.47       84.05        67.36        83.5
     gram                                              1
2    N=3gram     70.86      81.923       63.94         84.8
                                                       9
3    N=4gra      72.42      86.51        84.38         89.2
     m                                                 4           Fig. 2. Average separation between promoter
4    N=5gram 76.56          69.44        80.85         81.5    and non promoter for n=3. Here 0...64 represent the
                                                       4       AAA, AAT, AAG, AAC...etc
       Table 3. Homo sapiens classification results for
       different n-grams (average of 5-fold cross validation
       experiments)

       3.1 Classification Performance
                In a binary classification problem the
       training set will be a mixture of both positive and
       negative data sets. Similarly the test set also

                                                                                                      251 | P a g e
Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and
                          Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                           Vol. 2, Issue 6, November- December 2012, pp.247-254




                                                                  Fig.4. Average separation between promoter and
                                                                  non promoter for n = 5. Here 0...1024 represent the
                                                                  AAAAA, AAAAT, AAAAG, AAAAC...Etc

                                                                            Positive predictive value is defined as the
         Figure.3: Average separation between promoter and
                                                                  proportion of true positives to the sum of true
         non promoter for n=4. Here 0...256 represent the
                                                                  positives and false positives. False positives are
         AAAA, AAAT, AAAG, AAAC...etc
                                                                  negatives which are classified as positives. The
                                                                  classification results for various n-grams for Homo
n-gram     Precision    Specificity    Sensitivity   PPV
                                                                  sapiens presented in Tables 1 and 2 respectively.
3-gram     82.07        82.86          78.06         83.75
4-gram     82.51        84.95          78.64         85.51        4. DISCUSSION AND CONCLUSION
                                                                            In proposed approach human promoters are
         Table 4. Homo sapiens classification results for         classified using a neural network classifier. Since
         different n-grams for reduced data set (average of 5-    the optimal features are not known we started
         fold cross-validation experiments)                       classification model with minimum number of
                                                                  features n=2 and incrementally increased to
         Program       True      False      Sensitivit   PPV      maximum number of features. The features are
                       Positiv   Positiv    y (%)        (%)      given as input to a single layer feed forward neural
                       e         e                                network. We selected four promoter systems,
         DragonGS      269       69         68.4         79.6     DragonGSF, Eponine and FirstEF, PCA-HPR to
         F                                                        compare the performance on test set 1. A promoter
         FirstEF       331       501        84.2         39.8     region is counted as a true positive (TP) if TSS is
         Eponine       199       79         50.6         71.6     located within the region, or if a region boundary is
         PCA-HPR       301       65         76.6         82.2     within 200bp 5' of such a TSS. Otherwise the
         AnG-HPR       431       61         78.64        85.5     predicted region is counted as a false positive (FP).
                                                         1        The test results of Eponine and FirstEF are cited
                                                                  from the reference paper [16]. On test set 2, we
         Table 5. Performance comparison of five prediction       adopt     the    same    evaluation     method     as
         systems for test set 2.                                  DragonGSF[5]: when one or more predictions fall in
                                                                  the region of[−2000, +2000] relative to a TSS, a TP
                   The classification results are evaluated       is counted. All predictions which fall on the
         using performance measures such as precision,            annotated part of the gene in the region [+2001,
         specificity and sensitivity. Specificity is the          EndofTheGene] are counted as FP. Other
         proportion of the negative test sequences that are       predictions are not considered in counting TP and
         correctly classified and sensitivity is the proportion   FP. Experimental results of DragonGSF, FirstEF
         of the positive test sequences that are correctly        and Eponine are obtained from [5]. We adopt the
         classified. Precision is the proportion of the           sensitivity (SN) and the positive predictive value
         correctly classified sequences of the entire test data   (PPV) to evaluate the performance of these systems.
         set.                                                     The results and comparisons based on test 1 and test
                                                                  2 are shown in Table 4 and Table 5.
                                                                            In test 1, with the same number of true
                                                                  positives in comparison with existing methods, our
                                                                  method produces the smallest number of false
                                                                  positives. In test 2, although FirstEF, PCA-HPR

                                                                                                        252 | P a g e
Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and
                     Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                      Vol. 2, Issue 6, November- December 2012, pp.247-254
achieves a higher SN than AnG-HPR, the PPV is       ACKNOWLEDGEMENTS
just half of AnG-HPR. PCA-HPR keeps a good                          We would like to thank Dr. T. Shoba Rani
balance between SN and PPV, while AnG-HPR                  for her valuable comments in our discussions. We
produces better results. On test set 3, we compare         would like to thanks Dr.Bapi raju for his valuable
AnG-HPR with DragonGSF because DragonGSF is                comments in our discussion.
the only online system which can accept relatively
longer sequences among systems compared in the             REFERENCES
analysis. In order to get fair results for these             [1]    T. A. Down et al., Genome Research, 12:
sequences which are longer than 1,000,000bp (the                    458 (2001) [PMID: 11875034].
limitation of a file in the DragonGSF web tool), we          [2]    R. V. Davuluri et al., Nature Genetics, 29:
divided them into segments that are equal or less                   412 (2001) [PMID: 11726928]
than 1,000,000bp each, before sending them to                [3]    M. Scherf et al., Journal of Molecular
PCA-HPR and DragonGSF. Under the same                               Biology, 297: 599 (2000) [PMID:
evaluation criteria as the one in test set 2, AnG-HPR               10731414]
achieved a better result: the SN of AnG-HPR and              [4]    V. B. Bajic et al., Journal of Molecular
DragonGSF are 76.2% and 46.8%, and PPV of the                       Graphics and Modelling, 21: 323 (2003)
two systems are 83.4% and 63.8%, respectively.                      [PMID: 12543131]
DragonGSF reports a good prediction performance              [5]    V. B. Bajic et al., Genome Res., 13: 1923
on the whole human genome sequence, but it uses                     (2003) [PMID: 12869582]
the TRANSFAC [17] database which includes                    [6]    V. B. Bajic et al. Nature Biotechnology,
binding site information only available for known                   22: 1467 (2004) [PMID: 15529174]
promoters. Therefore, our system has the advantage           [7]    Thomas Abeel, Yves van de peer, and
in predicting unknown promoters. Different                          Yvan Saeys. Towards a gold standard for
numbers of feature values are used to arrive at the                 promoter        prediction        evolution.
best performance. Test results are measured using                   Bioinformatics, 25(53):i313–i320, July
measures such as precision, specificity and                         2009.
sensitivity and PPV. Maximum accuracy achieved               [8]    Bajic,V.B. et al. (2006) Performance
using SNNS is 89.2% in the case of DBTSS data                       assessment of promoter predictions on
set.                                                                ENCODE regions in the EGASP
          In this Paper, we focus on extracting the                 experiment. Genome Biol., S3.1–S3.13.
statistical features. There is evidence of a statistical     [9]    Jia Zeng, shanfeng Zhu, and Hong Yan.
preference in terms of codon usage patterns in                      Towards accurate human promoter
protein coding regions. The majority of promoter                    recognition: a review of currently used
prediction methods available now directly extract a                 sequence features and classification
limited number of context features from sequences.                  methods. Briefings in Bioinformatics,
Here we are not doing any feature selection and                     10(05):498–508, 2009.
using the entire set of n-grams. In this paper               [10]   Xiaomeng Li, Jia Zeng, and Hong Yan. A
Classification of both promoter (DBTSS data set)                    principle component analysis model for
and non-promoter is best for n=4. We obtained a                     human      promoter     recognition.     Bio
precision of 72.42%, specificity of 86.5%, and                      information, 2(9):373–378, june 2008.
sensitivity of 84.3% and positive predictive value of        [11]   Sobha Rani T and Raju S.Bapi. Analysis of
89.2% for this set. The result shows that for DBTSS                 n-gram based promoter recognition
n=4 gram features give the better performance than                  methods and application to whole genome
other n-grams. The results here consolidate the                     promoter prediction. In Silico Biology, 9(1-
results obtained for Drosophila Melanogaster in the                 2):S1–S16, March 2009.
work done by Sobha et al. They obtained best                 [12]   Gordon, L., Chervonenkis, A. Y.,
performance results for n=4. Does this then make 4-                 Gammerman, A. J., Shahmurradov, I. A.
grams as a special parameter for the eukaryotes is                  and Solovyev, V. V. (2003). Sequence
needed to be further investigated?                                  alignment kernel for recognition of
          A study of the n-gram (n=2, 3, 4, 5) as                   promoter regions. Bioinformatics 19, 1964-
features for a binary neural network classifier is                  1971.
done. In human genome 4-gram features give them              [13]   Ma, Q., Wang, J. T. L., Shasha, D. and Wu,
an optimal performance with neural network                          C. H. (2001). DNA sequence classification
classifier. The results show that the classification                via an expectation maximization algorithm
result 4-gram is better in identification of the                    and neural networks: a case study. IEEE
promoter than the other n-grams. Human promoter                     Transactions on Systems, Man and
classification gives the better accuracy results of                 Cybernetics, Part C: Applications and
89.2%.                                                              Reviews, Special Issue on Knowledge
                                                                    Management. 31, 468-475.


                                                                                                 253 | P a g e
Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and
                Applications (IJERA) ISSN: 2248-9622 www.ijera.com
                 Vol. 2, Issue 6, November- December 2012, pp.247-254
[14]   Wang, J., Ungar, L. H., Tseng, H. and           [27]   Benchmark                          datasets.
       Hannenhalli, S. (2007). MetaProm: a                    www.fruitifly.org/data/seq-
       neural network based meta-predictor for                tool/datasets.com.
       alternative human promoter prediction.
       BMC Genomics 8, 374.
[15]    Pedersen, A. G., Baldi, P., Chauvin, Y.
       and Brunak, S. (1999). The biology of
       eukaryotic promoter prediction - a review.
       Comput. Chem. 23, 191-207.
[16]   Huerta, A. M. and Collado-Vides, J.
       (2003). Sigma70 promoters in Escherichia
       coli: Specific transcription in dense regions
       of overlapping promoter-like signals. J.
       Mol. Biol.333, 261-278
[17]   Cardon, L. R. and Stormo, G. D. (1992).
       Expectation maximization algorithm for
       identifying protein-binding sites with
       variable lengths from unaligned DNA
       fragments. J. Mol. Biol. 223, 159-170.
[18]   Shannon, C. E. (1948). A mathematical
       theory of communication. The Bell System
       Technical Journal. 27, 379-423.
[19]   Ohler, U., Liao, G. C., Niemann, H. and
       Rubin, G. R. (2002). Computational
       analysis of core promoters in the
       Drosophila genome. Genome Biol. 3,
       research0087.
[20]    Ben-Gal, I., Shani, A., Gohr, A., Grau, J.,
       Arviv, S., Shmilovici, A., Posch, S. and
       Grosse, I. (2005). Identification of
       transcription factor binding sites with
       variable-order       Bayesian      networks.
       Bioinformatics 21, 2657-2666.
[21]   Leu, F., Lo, N. and Yang, L. (2005).
       Predicting Vertebrate Promoters with
       Homogeneous          Cluster      Computing.
       Proceedings of the 1st International
       Conference on Signal-Image Technology
       and Internet-Based Systems, SITIS. 143-
       148.
[22]   Ji, H., Xinbin, D. and Xuechun, Z. (2006).
       A systematic computational approach for
       transcription factor target gene prediction.
       IEEE Symposium on Computational
       Intelligence and Bioinformatics and
       Computational Biology CIBCB ’06, 1-7.
[23]   Wang, J. and Hannenhalli, S. (2006). A
       mammalian promoter model links cis
       elements to genetic networks. Biochem.
       Biophys. Res. Commun. 347, 166-177.
[24]    Li, Q. Z. and Lin, H. (2006). The
       recognition and prediction of σ70
       promoters in Escherichia coli K-12. J.
       Theor. Biol. 242, 135-141.
[25]   Alpaydin, E. (2004). Introduction to
       Machine Learning, MIT Press.
[26]   Stuttgart Neural Network Simulator
       http://www-ra.informatik.uni-
       tuebingen.de/SNNS/



                                                                                            254 | P a g e

More Related Content

What's hot

RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1BITS
 
Study of Parametric Kernels for LSSVM Models in NIR Determination of Fishmeal...
Study of Parametric Kernels for LSSVM Models in NIR Determination of Fishmeal...Study of Parametric Kernels for LSSVM Models in NIR Determination of Fishmeal...
Study of Parametric Kernels for LSSVM Models in NIR Determination of Fishmeal...researchinventy
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...VHIR Vall d’Hebron Institut de Recerca
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Gunnar Rätsch
 
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionJoachim Jacob
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programsMugdhaSharma11
 
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vuploadProf. Wim Van Criekinge
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packageLi Shen
 
Data analysis
Data analysisData analysis
Data analysisamlbinder
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysisyuvraj404
 

What's hot (18)

RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: general concept, goal and experimental design - part 1
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Seminar 20150920.2
Seminar 20150920.2Seminar 20150920.2
Seminar 20150920.2
 
Study of Parametric Kernels for LSSVM Models in NIR Determination of Fishmeal...
Study of Parametric Kernels for LSSVM Models in NIR Determination of Fishmeal...Study of Parametric Kernels for LSSVM Models in NIR Determination of Fishmeal...
Study of Parametric Kernels for LSSVM Models in NIR Determination of Fishmeal...
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...
 
Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)Talk ABRF 2015 (Gunnar Rätsch)
Talk ABRF 2015 (Gunnar Rätsch)
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
Part 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expressionPart 5 of RNA-seq for DE analysis: Detecting differential expression
Part 5 of RNA-seq for DE analysis: Detecting differential expression
 
Gene prediction method
Gene prediction method Gene prediction method
Gene prediction method
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
 
2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload2016 bioinformatics i_wim_vancriekinge_vupload
2016 bioinformatics i_wim_vancriekinge_vupload
 
Et25897899
Et25897899Et25897899
Et25897899
 
diffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis packagediffReps: automated ChIP-seq differential analysis package
diffReps: automated ChIP-seq differential analysis package
 
Data analysis
Data analysisData analysis
Data analysis
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysis
 
Proposal
ProposalProposal
Proposal
 

Viewers also liked (20)

Ab26180184
Ab26180184Ab26180184
Ab26180184
 
Ah26210213
Ah26210213Ah26210213
Ah26210213
 
Bx26501512
Bx26501512Bx26501512
Bx26501512
 
Ae26192196
Ae26192196Ae26192196
Ae26192196
 
Aq26268271
Aq26268271Aq26268271
Aq26268271
 
Fn2611681170
Fn2611681170Fn2611681170
Fn2611681170
 
Fa25939942
Fa25939942Fa25939942
Fa25939942
 
cv
cvcv
cv
 
Pascoa presentes 2013
Pascoa presentes 2013Pascoa presentes 2013
Pascoa presentes 2013
 
Examen de las tic’ planeacion
Examen de las tic’ planeacion Examen de las tic’ planeacion
Examen de las tic’ planeacion
 
Rodrigo méndez arroyo pensamiento y lenguaje
Rodrigo méndez arroyo pensamiento y lenguajeRodrigo méndez arroyo pensamiento y lenguaje
Rodrigo méndez arroyo pensamiento y lenguaje
 
Trabalho No Google
Trabalho No GoogleTrabalho No Google
Trabalho No Google
 
Poema
PoemaPoema
Poema
 
Machado De Assis
Machado De AssisMachado De Assis
Machado De Assis
 
Fascículo SAP - Gestão em MKT
Fascículo SAP - Gestão em MKTFascículo SAP - Gestão em MKT
Fascículo SAP - Gestão em MKT
 
Estatuto Em Pp 1218897781487042 8
Estatuto Em Pp 1218897781487042 8Estatuto Em Pp 1218897781487042 8
Estatuto Em Pp 1218897781487042 8
 
Porto De Galinhas
Porto De GalinhasPorto De Galinhas
Porto De Galinhas
 
Pesquisa de marketing #2
Pesquisa de marketing #2Pesquisa de marketing #2
Pesquisa de marketing #2
 
Regimenes aduaneros especiales
Regimenes aduaneros especialesRegimenes aduaneros especiales
Regimenes aduaneros especiales
 
Sourav raj
Sourav rajSourav raj
Sourav raj
 

Similar to An26247254

BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxChijiokeNsofor
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
IRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
IRJET-A Novel Approaches for Motif Discovery using Data Mining AlgorithmIRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
IRJET-A Novel Approaches for Motif Discovery using Data Mining AlgorithmIRJET Journal
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing ResearchTanmay Ghai
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionAashish Patel
 
Present status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPresent status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPrabhatSingh628463
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGLong Pei
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшаваValeriya Simeonova
 
Genome structure prediction a review over soft computing techniques
Genome structure prediction a review over soft computing techniquesGenome structure prediction a review over soft computing techniques
Genome structure prediction a review over soft computing techniqueseSAT Journals
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationEditor IJCATR
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationIJAEMSJORNAL
 

Similar to An26247254 (20)

Gene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptxGene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptx
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
 
Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
IRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
IRJET-A Novel Approaches for Motif Discovery using Data Mining AlgorithmIRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
IRJET-A Novel Approaches for Motif Discovery using Data Mining Algorithm
 
Genome annotation
Genome annotationGenome annotation
Genome annotation
 
genomeannotation-160822182432.pdf
genomeannotation-160822182432.pdfgenomeannotation-160822182432.pdf
genomeannotation-160822182432.pdf
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
 
Present status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptxPresent status and recent developments on available molecular marker.pptx
Present status and recent developments on available molecular marker.pptx
 
Impact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEGImpact_of_gene_length_on_DEG
Impact_of_gene_length_on_DEG
 
Transcriptomics approaches
Transcriptomics approachesTranscriptomics approaches
Transcriptomics approaches
 
Gene prediction strategies
Gene prediction strategies Gene prediction strategies
Gene prediction strategies
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 
Genome structure prediction a review over soft computing techniques
Genome structure prediction a review over soft computing techniquesGenome structure prediction a review over soft computing techniques
Genome structure prediction a review over soft computing techniques
 
artificial neural network-gene prediction
artificial neural network-gene predictionartificial neural network-gene prediction
artificial neural network-gene prediction
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster Identification
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferation
 
1207.2600
1207.26001207.2600
1207.2600
 

An26247254

  • 1. Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.247-254 AnG-HPR: Analysis of n-Gram based human Promoter Recognition Chandrashekar.Jatoth1, Rupesh Mahajan2 Department of Information Technology, Pad.Dr.DY.P.I.E.T, Pune, India Abstract We describe a promoter recognition method named An-HPR to locate eukaryotic promoter regions and predict transcription start sites (TSSs). We computed n-gram features are extracted and used in promoter prediction. We computed n-grams (n=2, 3, 4, 5) as features and created frequency features to extract informative and discriminative features for effective classification. Neural network classifiers with these n-grams as features are used to identify Figure 1: A schematic representation of the promoters in a human genome. Analysis of n- locations of the promoter region, TFBSs, exons, Gram based (AnG) is applied to the feature introns and 3′utr regions. Extraction and a subset of principal components (PCs) are selected for classification. Our system Available promoter prediction systems use uses three neural network classifiers to two types of features for classification namely, distinguish promoters versus exons, promoters context features like nmers, and signal features such versus introns, and promoters versus 3' un- as TATA-box, CCAAT-box, and CpG islands. translated region (3'UTR). We compared AnG- Among the favorable promoter prediction programs, HPR with four well-known existing promoter Eponine [1] builds a PWM to detect TATA-box and prediction systems such as DragonGSF, Eponine G+C enrichment regions as promoter-like regions; and FirstEF, PCA-HPR. Validation shows that FirstEF [2] uses CpG-related and non-CpG related AnG-HPR achieves the best performance with first exons as signal features; Promoter Inspector [3] three test sets for all the four predictive systems. uses IUPAC words with wildcards as context features. Good experiment results are achieved by Keywords— promoter recognition; sequence integrating these two types of features. DPF [4] feature; transcription start sites, Biological data applies a separate module on G+C rich and G+C sets, neural networks, Binary classification, poor regions, and selects 256 pentamers to generate cascaded classifiers. a PWM for prediction. Furthermore, DragonGSF [5, 6] adds the CpG-island feature to DPF. 1. INTRODUCTION Most of the promoter prediction programs Promoter recognition is a real problem that try to predict the exact location of the promoter computationally identifies the transcription start site region of the known protein-coding genes, while (TSS) or the 5’end of the gene without time- some focus on finding the transcription start site consuming and expensive experimental methods (TSS). Some research has show that there is often that align ESTs, cDNAs or mRNAs against to the no single TSS, but rather a whole transcription start entire genome. In this article, we focus on humans region (TSR) containing multiple TSSs that are used because it is a representative species that has with different frequencies. Generally two main attracted much more attention in the past decade. In approaches are used in promoter prediction [7]. humans, the TSS is surrounded with a core- 1. First approach assigns scores to all single promoter region within around ±100 base pairs (bp). nucleotides to Identify TSS. A proximal promoter region has several hundreds bp 2. Second approach identifies a promoter region immediately upstream of the core promoter. without providing scores for all nucleotides. . In this article analyzes the performance of 17 programs on two tasks: Genome wide identification of the start of genes and Genome wide identification of TSRs. In Existing methods Promoter prediction programs DragonGSF predicts the value of precision is between 37-48% and recall is 51-70%, DragonPF predicts the value of precision is between 61-65% and recall 62-64%, FristEF 247 | P a g e
  • 2. Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.247-254 predicts the value of precision is 79-81% and recall identification of promoters and first exons has been 35-40%, Eponine predicts the value of precision is ≈ one of the most difficult problems in gene-finding. 40 and recall ≈ 67%, NNPP2.2 predicts the value of The FirstEF [2] program identifies a promoter precision is 69-93% and recall 2.0-4.5%, region and first exons in the human genome, which McPromoter2.0 predicts the value of precision is 26- may be also be useful for the annotation of other 57%and recall ≈ 4.5, proSom predicts the value of mammalian genomes. FirstEF consists of different precision is 0.38% and recall 0.66% [8]. discriminate functions structured as a decision tree. We select the 13 representative PPPs that can The probabilistic models are designed to find analyze large genome sequences and report strand- potential first splice donor sites and CpGrelated and specific TSS predictions. ARTS, Eponine used non-CpG-related promoter regions based on SVM as the part of their design, EP3, Promoter discriminant analysis. For every potential first scan, Wu-method used Position weight matrix splice-donor site and upstream promoter region, (PWM). FirstEF decides whether the intermediate region could be a potential first exon based on a set of CpGcluster used distance based algorithm. quadratic discriminant functions. Training and CpGProD used linear discriminating analysis testing using different discriminant functions, the (LDA). DragonGSF, DragonPF, McPromoter used first exons and promoter regions from the first-exon neural networks.NNPP2.2 has used Time Delay database are used. Ramana et al. have also tried to neural network. Promoter Explorer has used identify the promoters as well as first exons of AbaBoost algorithm. proSOM has used SOM. These human species by using an algorithm called FirstEF programs have used as features various aspects of which is based upon the usage of structural and promoter and other regions of the DNA [6]. compositional features [3]. They were able to There are three classification methods for human predict 86% of the first exons. They have compared promoter recognition system [9]. their method with Promoter Inspector and obtained a • Discriminative model that finds the optimal sensitivity of 70% compared to Promoter thresholds or classification boundaries in the signal, Inspector’s 48%. Bajic et al. termed that the context and structure features space. Typical prediction is positive if the predicted transcription methods include artificial neural networks (ANNs), start site (TSS) falls within a maximum allowed discriminate functions and support vector machines distance from the reference transcription start site (SVMs). [5]. They have assessed performance of some of the • Generative model that describes the generative prediction algorithms based on the performance process of signal, context and structure measures such as sensitivity and positive predictive observations. Position weight matrix (PWM), value. In their later paper they concluded that the nearest neighborhood and hidden Markov models promoter prediction combined with gene prediction (HMMs) belong to generative models. yields a better recognition rate [6]. • Ensemble that combines multiple classifiers for multiple features in order to achieve a consensus ExonType Sensitivit Specificit correlation and robust recognition results. y y coefficient CpG-related 0.92 0.97 0.94 Promoter Inspector is a program that NotCpG- 0.74 0.60 0.65 predicts eukaryotic poly II promoter regions with related high specificity in mammalian genomic sequences. all exons 0.86 0.83 0.83 The program Promoter Inspector focuses on the Table1.Accuracy of FirstEF based on cross- genomic context of promoters rather than their exact validation location. Promoter Inspector is based (refer table 1) on three classifiers, which specialize in to Li et al. is having proposed that locate differentiating between promoter region and a eukaryotic promoter regions and predict subset of non-promoter sequences (intron, exon and transcription start sites (TSSs). Here the authors 3′utr). have computed codon (3-mer) and pentamer (5-mer) frequencies and created codon and pentamer In contrast to this, PromnFD and PromFind frequency feature matrices to extract informative use only one classifier, i.e the features are extracted and discriminative features for effective from one promoter set and one set of various non- classification. Principal component analysis (PCA) promoter sequences. To compare the two is applied to the feature matrices and a subset of approaches, two versions of Promoter Inspector are principal components (PCs) are selected for built. Version v1 was based on one st of mixed non- classification [7]. promoter sequences, while version v2 was built on the basis of exon, intron and 3′utr. Both versions of promoter Inspector were applied to exon, intron, 3′utr and promoter evaluation sets [3]. The 248 | P a g e
  • 3. Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.247-254 Program True False Sensitivit PP [16], to expectation and maximization (EM) Positiv Positiv y (%) V algorithm [17]. These techniques are based on e e (%) basically motifs present in the promoter which are DragonGS 269 69 68.4 79.6 specific regions in the promoter or the global signal F that is present in the promoter. To extract the local FirstEF 331 501 84.2 39.8 or global signals various feature extraction methods Eponine 199 79 50.6 71.6 are used. PCA-HPR 301 65 76.6 82.2 Condon usage patterns in coding regions Table 2. Performance comparison of four prediction and hexamer conservation (TATA box, CAAT box) systems for 22 chromosomes. in promoter regions is well known. Techniques that use these concepts are available in abundance in Promoter prediction systems use two types literature. Most of the local content-based methods of features for classification namely, context are in fact based on conservation of the hexamers features like n-mers, and signal features such as [13, 16]. In literature there are a few articles on TATA box, CCAAT-box and CpG islands. Among protein sequence classification and gene the favorable promoter prediction programs, identification using n-grams, but very few on Eponine builds a PWM to detect TATA-box and promoter recognition. An n-gram is a selection of n G+C enrichment regions as promoter-like regions; contiguous characters from a given character stream FirstEF uses CpGrelated and non-CpG related first [18]. Ohler et al. have used interpolated Markov exons as signal features; Promoter Inspector uses chains on human and Drosophila promoter IUPAC words with wild cards as context features. sequences as positive data set achieving a Good experiment results are achieved by integrating performance accuracy of 53% [19]. Ben-gal et al. these two types’ features. DPF applies a separate have used a variable-order Bayesian network which module on G+C rich and G+C poor regions, and looks at the statistical dependencies between selects 256 pentamers to generate a PWM for adjacent base pairs to achieve a true positive prediction. Furthermore, DragonGSF adds the CpG- recognition rate of 47.56% [20]. Leu et al. have island feature to DPF. Jia Zeng et al. in their paper developed a vertebrate promoter prediction system selected three promoter systems DragonGSF, with cluster computing extracting n-grams for n = 6 Eponine and FirstEF to compare the performance on to 20 [21]. They achieved an accuracy rate of 88%. test set 1. A promoter region is counted as a true Ji et al. have used SVM and n-grams (n = 4, 5, 6, 7) positive (TP) if TSS is located within the region, or for target gene prediction of Arabidopsis [22]. if a region boundary is within 200bp 5′ of such a Prediction system with cluster computing extracting TSS. Otherwise the predicted region is counted as a n-grams for n = 6 to 20 [21]. They achieved an false positive (FP). The test results of Eponine and accuracy rate of 88%. Ji et al. have used SVM and FirstEF. On test set 2, we adopt the same evaluation n-grams (n = 4, 5, 6, 7) for target gene prediction of method as DragonGSF when one or more Arabidopsis [22]. There are position specific n-gram predictions fall in the region of [2000, +2000] methods by Wang et al. and Li et al. [23, 24]. Wang relative to a TSS, a TP is counted. All predictions et al. have proposed a position specific propensity which fall on the annotated part of the gene in the analysis model (PSPA) which extracts the region are counted as FP [10]. Sobha.et.al is termed propensity of DNA elements at a particular position that n-gram based promoter recognition methods and their co-occurrence with respect to TSS in were tried in promoter prediction and its application mammals. They have considered a set of top ranking to whole genome promoter prediction in E.coli and k-mers (k = 1 to 5) at each position ±100 bp relative Drosophila [11]. Here, we describe a method named to TSS and the co-occurrence with other top ranking AnG-HPR to predict the location of the TSSs with k-mers at other downstream positions. PSPA score best performance. We extracting n-grams and using for a sequence is computed as the product of scores them to identify promoters in human genome. for the 200 positions of ±100 bp relative to TSS. Patterns or features that characterize a They found many position-specific promoter promoter/non-promoter are needed to be extracted elements that are strongly linked to gene product from the given set of promoter and non-promoter function. Li et al. too have considered position- sequences. Here promoter recognition is addressed specific weight matrices of hexamers at some ten by looking at the global signal characterized by their specific positions for the promoter data of E. coli frequency of occurrence of n-grams in the promoter [24]. region. Here, we extract n-grams to be used as features. An investigation of the lower order n- 2. Introduction to N-Grams as features grams for promoter recognition was taken up in Promoter recognition is tackled using order to assess their applicability in whole genome various techniques such as support vector machines promoter recognition. To this end we have extracted (SVM) [12], neural networks [13, 14], hidden the n-grams and fed them to a multi-layer feed- Markov models [15], position weight matrix (PWM) forward neural network. Further, by using the best 249 | P a g e
  • 4. Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.247-254 features, two more neural networks are designed to Here, the denominator denotes the number of n- annotate the promoters in a genome segment. In the grams that are possible in a sequence of length |L| following sections, we explain the extraction of and hence Vin denotes the proportional frequency of features, the classification results using these occurrence of i-th feature for a particular n value. features, and a way of finding the promoters in an Thus each promoter and non-promoter sequence of unknown segment of the Human genome. the data set is represented as a 16-dimensional feature 2.1 Feature Extraction Vector (V12, V22, V32, ……… V162) for n=2, as a 64- In this section, different data sets that are used dimensional feature vector (V13, V23, V33, ……… in experimentation and the feature extraction V643) for n=3, as a 256 dimensional feature vector method for various n-grams are described. (V14, V24, V34,……… V2564) for n=4 and a 1024- dimensional feature vector (V15, V25, V35,…… 2.1.1 Data set V10245) for n=5. The training set in this experiment is divided into several subsets of promoters, introns, In a binary classification problem, the exons and 3′UTR sequences. Promoter sequences training set will be a mixture of both positive and are extracted from two public databases. One is the negative data sets. Similarly the test set which Eukaryotic Promoter Database (EPD), release 86 consists of both positive and negative data is used to [8], which contains 1871 human promoter evaluate the performance of the classifier. A neural sequences. The other is the Database of network classifier is trained using the n-grams of the Transcription Start Sites (DBTSS), version 5.2.0 [9], training set as input feature vectors and then the test which includes 30,964 human promoter sequences set is evaluated using the same network. Figures 1–4 and 15,531 forward strand promoter sequences. We depict the average separation between the positive used forward strand promoter sequences are in our and negative data for n = 2, n = 3, n = 4 and n = 5, experiment. Human exon and intron sequences are respectively. It can be observed that the plots depict extracted from the EID [10], and the first exons are the separability of promoter and non-promoter data not included in the exons training set. Human sets in different feature spaces. 3’UTR sequences are from the UTR database [11]. CpG islands, which exist in 60% of From DBTSS we have extracted promoter mammalian promoters [12], are regarded as one of sequences [-250, +50] bp around the experimental the most important signal features in promoter TSS. DBTSS contains 24 chromosomes; each recognition. Methods such as CpGProD [13], DPF chromosome has 1000 nucleotide sequences. From [4], DragonGSF [5, 6], FirstEF [2] and EID and 3′UTR we have extracted non-promoter PromoterExplorer [7] embed this signal feature in sequences of length 300 bp [27]. their prediction system. In our method, two features are used to identify if a given sequence (>200bp) is 2.1.2 Method CpG islands related: GC percentage (GCp) and Patterns or features that characterize a Observed/expected CpG ratio (o/e). These are promoter/non-promoter are needed to be extracted calculated with equations (1-5) (see supplementary from the given set of promoter and non-promoter material). If GCp > 0.5, and o/e > 0.6, then the sequences. Here promoter recognition is addressed sequence is considered CpG islands related, by looking at the global signal characterized by the otherwise it is non-CpG islands related [14]. A DNA frequency of occurrence of n-grams in the promoter sequence contains four types of nucleotides: adenine region. We show in the section neural network (abbreviated A), cytosine (C), guanine (G) and architecture and classification performance that thymine (T). With different combinations, there are these features perform well for prokaryotic as well 43=64 codons and 45=1024 pentamers in promoter as eukaryotic promoter recognition. To extract the and nonpromoter datasets. Pentamers are widely global signal for a promoter, the frequency of selected as context features in many promoter occurrence of n-grams is calculated on the DNA prediction models [4, 7], as they keep good balance alphabet {A,T,G,C}. The set of n-grams for n = 2 is between search efficiency and discriminability. Four 16 possible pairs such as AA, AT, AG, AC, TA, etc. resulting feature matrices are constructed from and the set of n-grams for n = 3 are 64 triples such promoter, exon, intron and 3'UTR training datasets. as AAA, AAT, AAG, AAC, ATA etc. Similarly n- Finally, three pairs of matrices: promoter versus grams for n = 4, 5, 6 are calculated. Let fin denote exon, promoter versus intron and promoter versus the frequency of occurrence of the i-th feature of n- 3'UTR are built from the four feature matrices for gram for a particular n value and let |L| denote the further processing. length of the sequence. The feature values Vin are normalized frequency counts given in Eq. (1). Vin = fi n |L| - (n-1) , 1≤ i ≤ 4n for n=2, 3, 4, 5 (1) 250 | P a g e
  • 5. Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.247-254 consisting of both positive and negative data is used to evaluate the performance of classifier. A neural network classifier is trained using n-grams of training set as input feature vectors and then the test set is evaluated using by same network. The below figures are depict the average separation between the positive and negative for n=2, 3, 4, 5 respectively. It can be observed that the plots depict the separability of promoter and non-promoter data sets in different feature spaces. A feed forward neural network with three layers is used for promoter classification. The nodes in the input layer are 16, 64, 256, 1024 features for n=2, 3, 4, 5 respectively. The experimentation is done with different number of hidden nodes that give an optimal classification performance. The Fig.1. Average separation between promoters and output layer has one node to give a binary decision non-promoters for n = 2 for Homo sapiens. Here, 1. as to whether the given input sequence is a promoter . .16 represent AA, AT, AG, AC etc. or non-promoter. 5-fold cross validation is used to investigate the effect of various n-grams on 3. NEURAL NETWORK promoter classification by neural network. Average ARCHITECTURE AND performance over these folds is being reported. CLASSIFICATION PERFORMANCE These simulations are using Stuttgart Neural A feed-forward neural network with three Network Simulator (SNNS). The classification layers, namely, an input layer, one hidden and an results are evaluated using performance measures output layer is used for promoter classification. The such as Precision, Specification, Sensitivity 5.3, 5.4 number of nodes in the input layer is 16, 64, 256, given these results. 1024 features for n = 2, 3, 4, 5 respectively. One Using these, we would like to find out the more experimentation which uses Euclidean efficacy of these features in identifying promoter in distance measure to reduce the number of features human genome [2]. of n = 5 is done. Experimentation is done with different number of hidden nodes that give an optimal classification performance. The output layer has one node to give a binary decision as to whether the given input sequence is a promoter or non- promoter. 5-fold cross-validation [25] is used to investigate the effect of various n-grams on promoter classification by neural network. Average performance over these folds is being reported. These simulations are done using Stuttgart Neural Network Simulator [26]. S. Features Precisio Specificit Sensitivit PPV No n y y 1 N=2 68.47 84.05 67.36 83.5 gram 1 2 N=3gram 70.86 81.923 63.94 84.8 9 3 N=4gra 72.42 86.51 84.38 89.2 m 4 Fig. 2. Average separation between promoter 4 N=5gram 76.56 69.44 80.85 81.5 and non promoter for n=3. Here 0...64 represent the 4 AAA, AAT, AAG, AAC...etc Table 3. Homo sapiens classification results for different n-grams (average of 5-fold cross validation experiments) 3.1 Classification Performance In a binary classification problem the training set will be a mixture of both positive and negative data sets. Similarly the test set also 251 | P a g e
  • 6. Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.247-254 Fig.4. Average separation between promoter and non promoter for n = 5. Here 0...1024 represent the AAAAA, AAAAT, AAAAG, AAAAC...Etc Positive predictive value is defined as the Figure.3: Average separation between promoter and proportion of true positives to the sum of true non promoter for n=4. Here 0...256 represent the positives and false positives. False positives are AAAA, AAAT, AAAG, AAAC...etc negatives which are classified as positives. The classification results for various n-grams for Homo n-gram Precision Specificity Sensitivity PPV sapiens presented in Tables 1 and 2 respectively. 3-gram 82.07 82.86 78.06 83.75 4-gram 82.51 84.95 78.64 85.51 4. DISCUSSION AND CONCLUSION In proposed approach human promoters are Table 4. Homo sapiens classification results for classified using a neural network classifier. Since different n-grams for reduced data set (average of 5- the optimal features are not known we started fold cross-validation experiments) classification model with minimum number of features n=2 and incrementally increased to Program True False Sensitivit PPV maximum number of features. The features are Positiv Positiv y (%) (%) given as input to a single layer feed forward neural e e network. We selected four promoter systems, DragonGS 269 69 68.4 79.6 DragonGSF, Eponine and FirstEF, PCA-HPR to F compare the performance on test set 1. A promoter FirstEF 331 501 84.2 39.8 region is counted as a true positive (TP) if TSS is Eponine 199 79 50.6 71.6 located within the region, or if a region boundary is PCA-HPR 301 65 76.6 82.2 within 200bp 5' of such a TSS. Otherwise the AnG-HPR 431 61 78.64 85.5 predicted region is counted as a false positive (FP). 1 The test results of Eponine and FirstEF are cited from the reference paper [16]. On test set 2, we Table 5. Performance comparison of five prediction adopt the same evaluation method as systems for test set 2. DragonGSF[5]: when one or more predictions fall in the region of[−2000, +2000] relative to a TSS, a TP The classification results are evaluated is counted. All predictions which fall on the using performance measures such as precision, annotated part of the gene in the region [+2001, specificity and sensitivity. Specificity is the EndofTheGene] are counted as FP. Other proportion of the negative test sequences that are predictions are not considered in counting TP and correctly classified and sensitivity is the proportion FP. Experimental results of DragonGSF, FirstEF of the positive test sequences that are correctly and Eponine are obtained from [5]. We adopt the classified. Precision is the proportion of the sensitivity (SN) and the positive predictive value correctly classified sequences of the entire test data (PPV) to evaluate the performance of these systems. set. The results and comparisons based on test 1 and test 2 are shown in Table 4 and Table 5. In test 1, with the same number of true positives in comparison with existing methods, our method produces the smallest number of false positives. In test 2, although FirstEF, PCA-HPR 252 | P a g e
  • 7. Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.247-254 achieves a higher SN than AnG-HPR, the PPV is ACKNOWLEDGEMENTS just half of AnG-HPR. PCA-HPR keeps a good We would like to thank Dr. T. Shoba Rani balance between SN and PPV, while AnG-HPR for her valuable comments in our discussions. We produces better results. On test set 3, we compare would like to thanks Dr.Bapi raju for his valuable AnG-HPR with DragonGSF because DragonGSF is comments in our discussion. the only online system which can accept relatively longer sequences among systems compared in the REFERENCES analysis. In order to get fair results for these [1] T. A. Down et al., Genome Research, 12: sequences which are longer than 1,000,000bp (the 458 (2001) [PMID: 11875034]. limitation of a file in the DragonGSF web tool), we [2] R. V. Davuluri et al., Nature Genetics, 29: divided them into segments that are equal or less 412 (2001) [PMID: 11726928] than 1,000,000bp each, before sending them to [3] M. Scherf et al., Journal of Molecular PCA-HPR and DragonGSF. Under the same Biology, 297: 599 (2000) [PMID: evaluation criteria as the one in test set 2, AnG-HPR 10731414] achieved a better result: the SN of AnG-HPR and [4] V. B. Bajic et al., Journal of Molecular DragonGSF are 76.2% and 46.8%, and PPV of the Graphics and Modelling, 21: 323 (2003) two systems are 83.4% and 63.8%, respectively. [PMID: 12543131] DragonGSF reports a good prediction performance [5] V. B. Bajic et al., Genome Res., 13: 1923 on the whole human genome sequence, but it uses (2003) [PMID: 12869582] the TRANSFAC [17] database which includes [6] V. B. Bajic et al. Nature Biotechnology, binding site information only available for known 22: 1467 (2004) [PMID: 15529174] promoters. Therefore, our system has the advantage [7] Thomas Abeel, Yves van de peer, and in predicting unknown promoters. Different Yvan Saeys. Towards a gold standard for numbers of feature values are used to arrive at the promoter prediction evolution. best performance. Test results are measured using Bioinformatics, 25(53):i313–i320, July measures such as precision, specificity and 2009. sensitivity and PPV. Maximum accuracy achieved [8] Bajic,V.B. et al. (2006) Performance using SNNS is 89.2% in the case of DBTSS data assessment of promoter predictions on set. ENCODE regions in the EGASP In this Paper, we focus on extracting the experiment. Genome Biol., S3.1–S3.13. statistical features. There is evidence of a statistical [9] Jia Zeng, shanfeng Zhu, and Hong Yan. preference in terms of codon usage patterns in Towards accurate human promoter protein coding regions. The majority of promoter recognition: a review of currently used prediction methods available now directly extract a sequence features and classification limited number of context features from sequences. methods. Briefings in Bioinformatics, Here we are not doing any feature selection and 10(05):498–508, 2009. using the entire set of n-grams. In this paper [10] Xiaomeng Li, Jia Zeng, and Hong Yan. A Classification of both promoter (DBTSS data set) principle component analysis model for and non-promoter is best for n=4. We obtained a human promoter recognition. Bio precision of 72.42%, specificity of 86.5%, and information, 2(9):373–378, june 2008. sensitivity of 84.3% and positive predictive value of [11] Sobha Rani T and Raju S.Bapi. Analysis of 89.2% for this set. The result shows that for DBTSS n-gram based promoter recognition n=4 gram features give the better performance than methods and application to whole genome other n-grams. The results here consolidate the promoter prediction. In Silico Biology, 9(1- results obtained for Drosophila Melanogaster in the 2):S1–S16, March 2009. work done by Sobha et al. They obtained best [12] Gordon, L., Chervonenkis, A. Y., performance results for n=4. Does this then make 4- Gammerman, A. J., Shahmurradov, I. A. grams as a special parameter for the eukaryotes is and Solovyev, V. V. (2003). Sequence needed to be further investigated? alignment kernel for recognition of A study of the n-gram (n=2, 3, 4, 5) as promoter regions. Bioinformatics 19, 1964- features for a binary neural network classifier is 1971. done. In human genome 4-gram features give them [13] Ma, Q., Wang, J. T. L., Shasha, D. and Wu, an optimal performance with neural network C. H. (2001). DNA sequence classification classifier. The results show that the classification via an expectation maximization algorithm result 4-gram is better in identification of the and neural networks: a case study. IEEE promoter than the other n-grams. Human promoter Transactions on Systems, Man and classification gives the better accuracy results of Cybernetics, Part C: Applications and 89.2%. Reviews, Special Issue on Knowledge Management. 31, 468-475. 253 | P a g e
  • 8. Chandrashekar.Jatoth, Rupesh Mahajan / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 6, November- December 2012, pp.247-254 [14] Wang, J., Ungar, L. H., Tseng, H. and [27] Benchmark datasets. Hannenhalli, S. (2007). MetaProm: a www.fruitifly.org/data/seq- neural network based meta-predictor for tool/datasets.com. alternative human promoter prediction. BMC Genomics 8, 374. [15] Pedersen, A. G., Baldi, P., Chauvin, Y. and Brunak, S. (1999). The biology of eukaryotic promoter prediction - a review. Comput. Chem. 23, 191-207. [16] Huerta, A. M. and Collado-Vides, J. (2003). Sigma70 promoters in Escherichia coli: Specific transcription in dense regions of overlapping promoter-like signals. J. Mol. Biol.333, 261-278 [17] Cardon, L. R. and Stormo, G. D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol. 223, 159-170. [18] Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal. 27, 379-423. [19] Ohler, U., Liao, G. C., Niemann, H. and Rubin, G. R. (2002). Computational analysis of core promoters in the Drosophila genome. Genome Biol. 3, research0087. [20] Ben-Gal, I., Shani, A., Gohr, A., Grau, J., Arviv, S., Shmilovici, A., Posch, S. and Grosse, I. (2005). Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21, 2657-2666. [21] Leu, F., Lo, N. and Yang, L. (2005). Predicting Vertebrate Promoters with Homogeneous Cluster Computing. Proceedings of the 1st International Conference on Signal-Image Technology and Internet-Based Systems, SITIS. 143- 148. [22] Ji, H., Xinbin, D. and Xuechun, Z. (2006). A systematic computational approach for transcription factor target gene prediction. IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology CIBCB ’06, 1-7. [23] Wang, J. and Hannenhalli, S. (2006). A mammalian promoter model links cis elements to genetic networks. Biochem. Biophys. Res. Commun. 347, 166-177. [24] Li, Q. Z. and Lin, H. (2006). The recognition and prediction of σ70 promoters in Escherichia coli K-12. J. Theor. Biol. 242, 135-141. [25] Alpaydin, E. (2004). Introduction to Machine Learning, MIT Press. [26] Stuttgart Neural Network Simulator http://www-ra.informatik.uni- tuebingen.de/SNNS/ 254 | P a g e