1. PREDICTING THE CELLULAR
LOCALIZATION SITES OF PROTEINS
USING ARTIFICIAL NEURAL
NETWORKS
Vaibhav Dhattarwal
Department of Computer Science and Engineering
Indian Institute of Technology Roorkee
vaibhav.csi.iitr@gmail.com
Abstract - In this paper, I present a brief description
of how a feed-forward artificial neural network was
implemented in C++. In the introduction to my paper,
I begin by explaining as the reason for implementing
this artificial neural network was to predict the
cellular localisation sites in proteins, and to be specific
a yeast Data Set. This is followed by a concise
explanation of the design and implementation of a
three-layer feed forward neural network using back
propagation algorithm. Also explained along with are
the attributes of the data set and the output location
possibilities in the protein. This is followed by a step-
by-step breakdown of how I approached the project.
The implementation of the network is explained along
with how the algorithm is executed within the code.
Finally we can see the results as we vary the
parameters associated with the implemented artificial
neural network.
Keywords-Prediction, Localization Sites, Proteins,
Simulation, NeuralNetworks
I. INTRODUCTION
Let me start off by the basic explanation about
choosing this topic. The topic chosen, Prediction of
Cellular localisation of protein, is basically the
information represented by the data set I have
chosen to do my paper on. I will be implementing
an Artificial Neural Network based on the back
propagation algorithm. To evaluate the
performance of the simulated Artificial Neural
Network, I needed to choose a data set to train and
test the ANN. Let us take a look at the significance
of the data set chosen by me. If one is able to
deduce or figure out the sub cellular location of a
protein, I can interpret its function, its part in
healthy processes and also in commencement of
disease, and its probable usage as a drug target.
Other methods such as experiments used to
ascertain the sub cellular location of a protein have
advantages such as reliability and accuracy along
with disadvantages such as being slow and being
labour-intensive. If I compare to the above
described methods, large throughput computation
based forecasting tools enable me to deduce
information which is difficult to attain. As an
example, for those proteins whose composition is
found out from a genomic sequence, computational
methods are better as they may be tough to confine,
produce, or locate in an experiment.
The sub cellular location of a protein can provide
valuable information about the role it has in the
cellular dynamics. If I may suggest, there has been
an unprecedented surge in the amount of sequenced
genomic data available, which in turn calls out for a
computerized and high-accuracy tool which can be
used to predict sub cellular location to become
increasingly important. There have been lots of
efforts to predict properly the protein sub cellular
location. This paper aims to assimilate the artificial
neural networks and the field of bioinformatics to
predict the location of protein in yeast genome. I
introduce a new sub cellular prediction method
based on a back propagation neural network.
2. The statement goes like this “Prediction of Cellular
Localization sites of proteins using artificial neural
networks”
The task of our paper lies first in simulating a three
layered artificial neural network. In this case, the
backpropagation algorithm is used to train the
artificial neural network. First we explain the
algorithm, and then in our implementation it is
shown as to how the algorithm is implemented in
the code used to simulate the artificial neural
network. After this we see the observations
recorded by executing the yeast data set on the
simulated artificial neural network to train it. Then
we use the observations to see trends and evaluate
performance.
II. PROPOSED METHODOLOGY
A. Simulate an artificial neural network
corresponding to the attributesof the yeast data
set.
To enlarge the function space that the neural
network can represent, we implement the three-
layer feed-forward network which involves one
layer of hidden nodes. If we have our middle layer
with large number of nodes, we can represent
almost any continuous function with acceptable
levels of accuracy.
Figure 1: illustrates the structure of a three-layer
feed forward neural network.
The definitions of input nodes and output nodes can
be looked upon as similar to the earlier discussed
perceptrons network. The major difference is that
we a single layer of hidden nodes between the input
and output nodes.
Similarly, we also use the ratio of correctly
classified examples in the training set as the
threshold for the termination condition. The major
difference of the algorithm for training two-layer
feed-forward neural network is when we update the
weights for the hidden layer, we should back-
propagate the error from the output layer to the
hidden layer.
B. Implement the back propagation algorithmon
the simulated artificial neural network.
The Algorithm for our three layer network:
a. Initialize the weights of the network.
b. Perform the following operation
1. for every example in the training set
Output by the neural network
for this example denoted by
O(forward pass)
Teaching Output for this
example denoted by T.
The error is given by (T-O).
Calculate ΔWHO forall weights
between hidden and output
layer.
Move backwards in the
network(backward pass)
Calculate ΔWIH for all weights
between input and hidden layer.
Update the weights if the
network using the calculated
delta values.
c. Stop when the error criterion is met.
d. Return the trained network
The learning algorithm that we have chosen for our
network is the Backpropagation Algorithm. It can
be divided into two stages:
Stage One: Propagation Phase
This phase consists ofthe following operations:
1. First we do the forward propagation of our
training pattern's input data through the
network.
2. Secondly we do the backward propagation
of the initial propagation of first step and
use the output activations through the
network using our training pattern's desire
target data.
Stage two: Weight updating Phase
In this stage, for every connection possessing a
weight, the following operations are carried out:
3. 1. First, we multiply the output delta with
input to calculate the gradient of the
weight.
2. Second, we subtract a ratio of the gradient
from the weight. This brings the weight in
backward direction of the gradient.
We keep on repeating stages one and two until the
network starts performing with acceptable success
rate.
C. Train the network using the data set.
The yeast data set has eight attributes. These
attributes were calculated from amino acid
sequences.
1. erl: It is representative of the lumen in the
endoplasmic reticulum in the cell. This
attribute tells whether an HDEL pattern as
n signal for retention is present or not.
2. vac: This attribute gives an indication of
the content of amino acids in vacuolar and
extracellular proteins after performing a
discriminant analysis.
3. mit: This attribute gives the composition
of N terminal region, which has twenty
residue of mitochondrial as well as non-
mitochondrial protein after performing a
discriminant analysis.
4. nuc: This feature tells us about nuclear
localization patterns as to whether they are
present or not. It also holds some
information about the frequency of basic
residues.
5. pox: This attribute provides the
composition of the sequence of protein
after discriminant analysis on them. Not
only this, it also indicates the presence of a
short sequence motif.
6. mcg: This is a parameter used in a signal
sequence detection method known as
McGeoch. However in this case we are
using a modified version of it.
7. gvh: This attribute represents a weight
matrix based procedure and is used to
detect signal sequences which are
cleavable.
8. alm: This final feature helps us by
performing identification on the entire
sequence for membrane spanning regions.
For the data set the output classes are summarized
below. Remember that the localization site is
represented by the class as output. Here are the
various classes:
1. CYT (cytosolic or cytoskeletal)
2. NUC (nuclear)
3. MIT (mitochondrial)
4. ME3 (membrane protein, no N-terminal signal)
5. ME2 (membrane protein, uncleaved signal)
6. ME1 (membrane protein, cleaved signal)
7. EXC (extracellular)
8. VAC (vacuolar)
9. POX (peroxisomal)
10. ERL (endoplasmic reticulum lumen)
Figure 2: a Yeast Cell.
D. Obtain results and compare performance with
other networks and techniques used for predicting
the cellular localization ofproteins
Results are evaluated after using the data set on
the simulated artificial neural network.
Varying the number of nodes in the hidden
layer is used to evaluate performance.
Comparison of Accuracies of various
algorithms
Variation of success rate with number of
iterations
Variation of success rate with number of nodes
in hidden layer
III. IMPLEMENTATION
Figure 3: design for calculating output activation
4. Er = 0.0 ;
for all patterns in the training set
do // computes for all training patterns(E) //
for all elements in hidden layer [ NumUnitHidden ]
do
InputHidden[E][j] = WtInput/Hidden[0][j]
for all elements in input layer [ NumUnitInput ]
do
Add to InputHidden[E] [j] thesum over OutputInput[E] [i] * WtInput/Hidden [i][j]
end for
Computesigmoid for output
end for
for all elements in output layer [ NumUnitoutput ]
do
InputOutput[E] [k] = WtHidden/Output[0][k]
for all elements in hidden layer [ NumUnitHidden ]
do
Add to InputOutput [E] [k] sum over OutputHidden[E] [j] * WtHidden/Output [j][k]
end for
Computesigmoid for output
Add to Er the sum over the product (1/2) * (Final[E][k] - Output[E][k]) * (Final[E][k] -
Output[E][k]) ;
ΔOutput[k] = (Final[E][k] - Output[E][k]) * Output[E][k] * (1 - Output[E][k])
// derivative of thefunction //
end for
for all elements in hidden layer [ NumUnitHidden ]
do // Backpropagation of error towards hidden layer //
Sum of ΔOutput [j] = 0.0
for all elements in output layer [ NumUnitOutput ]
do
Add to Sum of ΔOutput [j] the sum over the product WtHidden/Output [j][k] * ΔOutput
[k] ;
end for
ΔH[j] = Sum of ΔOutput [j] * OutputHidden [E][j] * (1.0 - OutputHidden [E][j])
// derivative of thefunction //
end for
for all elements in hidden layer [ NumUnitHidden ]
do // This loop updates the weight input to hidden //
Add to ΔWih [0][j] thesum of: product β * ΔH [j] to theproduct:α * ΔWih [0][j]
Add to WtInput/Hidden [0][j] thechange ΔWih [0][j]
for all elements in input layer [ NumUnitInput ]
do
Add to ΔWih [i][j] the sum of product β * InputHidden [p][i] * ΔH [j] to theproduct:α
* ΔWih [i][j]
Add to WtInput/Hidden [i][j] the change ΔWih [i][j]
end for
end for
for all elements in output layer [ NumUnitOutput ]
do // This loop updates the weight hidden to output //
Add to ΔWho [0][k] the sum of: product β * ΔOutput[k] to theproduct:α* ΔWho [0][k]
Add to WtHidden/Output [0][k] thechange ΔWho [0][k]
for all elements in hidden layer [ NumUnitHidden ]
do
Add to ΔWho [j][k] the sum of product β * OutputHidden [p][j] * ΔOutput[k] to the
product:α *ΔWho [j][k]
Add to WtHidden/Output [j][k] the change ΔWho [j][k]
end for
end for
5. IV. RESULTS AND DISCUSSION
A. Comparisonsof Accuracies of Different
Algorithms
In this section, we will take a look at the accuracies
offered by different algorithms. We take into
consideration four algorithms: Majority Algorithm,
Decision Tree Algorithm, Perceptrons Learning
Algorithm, Three layered Neural Network based on
backpropagation algorithm. Two data sets are
considered that have been studied in detail in
earlier sections. The first is the E.coli data set
which is for the E.coli cell and the second is the
one chosen by us: the Yeast cell data set. As we can
see from the chart below our algorithm is able to
achieve slightly higher accuracy than the rest of the
algorithms. Another thing of note is to see that
considerable success is achieved in the yeast data
set which we chose to implement with accuracy
leading up to 61%
Figure 4: Plot of Accuracy of various algorithms
for two data sets.
B. Variation of Success Rate with number of
iterations
Let us consider the variation of success rate in our
implementation. Success Rate is simply defined as
number of successful predictions divided by total
number of cases handled. The overall success rate
will vary with number of iterations of training the
neural network. As the number of iterations, the
error is reduced as the network learns with every
training session. We can look at the chart below to
find the expected variation of success rate as it rises
with number of iterations. However a thing to
consider is that after about 100 iterations the
success rate remains constant more or less.
Figure 5: Plot of Success Rate with number of
iterations
C. Variation of Success Rate with number of
processing elements in Hidden Layer
Let us consider now varying another important
parameter in our neural network. We shall again
consider the success rate defined in the previous
section. The number of processing elements is
under our control in the network. As the data set we
have chosen is specifying the number of input
attributes and possible outcomes the input and
output layer have fixed number of processing
elements. However we can see the variation of
success rate with number of elements in the hidden
layer. Note that the success rate reaches a constant
value after about 75 elements in the layer.
Figure 6: Plot of Success Rate with No. of PE in
Hidden Layer
V. CONCLUSIONS AND FUTURE WORK
A. Conclusion
6. In this paper, I implemented the machine learning
algorithm of three-layer feed forward network. I
applied it to the problem of classifying proteins to
their cellular localization sites based on the amino
acid sequences of proteins. The Yeast dataset’s
accuracy was compared with the E.coli dataset’s
accuracy. It was tested whether the three-layer
neural network with hidden nodes is able to
separate the datasets. We also explored using larger
number of hidden nodes in the network. We also
implemented three layer feed-forward neural
network which represented discontinuous function.
After obtaining results, we compared the
performance with other networks and techniques
used for predicting the cellular localization of
proteins. The most important results can be
summarized as:
● The classes CYT, NUC and MIT have the
largest number of instances.
● The back propagation algorithm is able to
achieve slightly higher accuracy than the
rest of the algorithms.
● Another thing of note is to see that
considerable success is achieved in the
yeast data set which we chose to
implement with accuracy leading up to
61%
● After about 100 iterations the success rate
remains constant more or less.
● The success rate reaches a constant value
after about 75 elements in the layer.
● The Accuracy rises till we reach the limit
to which we can set the success rate.
B. Future Work
Since the prediction of proteins’ cellular
localization sites is a typical classification problem,
many other techniques such as probability model,
Bayesian network, K-nearest neighbours etc, can be
compared with our technique.
Thus, an aspect of future work is to examine the
performance of these techniques on this particular
problem.
ACKNOWLEDGEMENT
I would like to acknowledge the contribution of Dr.
Durga Toshniwal, Associate Professor, Department
of Computer Science and Engineering, IIT
Roorkee, whose guidance was indispensable
throughout the course of this work.
REFERENCES
[1]. "A ProbablisticClassificationSystemfor Predictingthe
Cellular Localization Sites of Proteins", Paul Horton & Kenta
Nakai, Intelligent Systems in MolecularBiology, 109-115.
[2]. "Expert Sytem for PredictingProtein Localization Sites in
Gram-NegativeBacteria", Kenta Nakai & MinoruKanehisa,
PROTEINS: Structure,Function, andGenetics 11:95-110, 1991.
[3]. "A Knowledge Base for PredictingProteinLocalization
Sites in Eukaryotic Cells", Kenta Nakai & MinoruKanehisa,
Genomics 14:897-911, 1992.
[4]. Cairns, P. Huyck,et.al, A Comparisonof Categorization
Algorithms for Predictingthe Cellular LocalizationSites of
Proteins, IEEEEngineeringin Medicine andBiology,
pp.296-300, 2001.
[5]. Donnes, P., andHoglund, A.,Predictingproteinsubcellular
localization: Past, present, andfutureGenomics Proteomics
Bioinformatics, 2:209-215, 2004.