SlideShare a Scribd company logo
1 of 6
PREDICTING THE CELLULAR
LOCALIZATION SITES OF PROTEINS
USING ARTIFICIAL NEURAL
NETWORKS
Vaibhav Dhattarwal
Department of Computer Science and Engineering
Indian Institute of Technology Roorkee
vaibhav.csi.iitr@gmail.com
Abstract - In this paper, I present a brief description
of how a feed-forward artificial neural network was
implemented in C++. In the introduction to my paper,
I begin by explaining as the reason for implementing
this artificial neural network was to predict the
cellular localisation sites in proteins, and to be specific
a yeast Data Set. This is followed by a concise
explanation of the design and implementation of a
three-layer feed forward neural network using back
propagation algorithm. Also explained along with are
the attributes of the data set and the output location
possibilities in the protein. This is followed by a step-
by-step breakdown of how I approached the project.
The implementation of the network is explained along
with how the algorithm is executed within the code.
Finally we can see the results as we vary the
parameters associated with the implemented artificial
neural network.
Keywords-Prediction, Localization Sites, Proteins,
Simulation, NeuralNetworks
I. INTRODUCTION
Let me start off by the basic explanation about
choosing this topic. The topic chosen, Prediction of
Cellular localisation of protein, is basically the
information represented by the data set I have
chosen to do my paper on. I will be implementing
an Artificial Neural Network based on the back
propagation algorithm. To evaluate the
performance of the simulated Artificial Neural
Network, I needed to choose a data set to train and
test the ANN. Let us take a look at the significance
of the data set chosen by me. If one is able to
deduce or figure out the sub cellular location of a
protein, I can interpret its function, its part in
healthy processes and also in commencement of
disease, and its probable usage as a drug target.
Other methods such as experiments used to
ascertain the sub cellular location of a protein have
advantages such as reliability and accuracy along
with disadvantages such as being slow and being
labour-intensive. If I compare to the above
described methods, large throughput computation
based forecasting tools enable me to deduce
information which is difficult to attain. As an
example, for those proteins whose composition is
found out from a genomic sequence, computational
methods are better as they may be tough to confine,
produce, or locate in an experiment.
The sub cellular location of a protein can provide
valuable information about the role it has in the
cellular dynamics. If I may suggest, there has been
an unprecedented surge in the amount of sequenced
genomic data available, which in turn calls out for a
computerized and high-accuracy tool which can be
used to predict sub cellular location to become
increasingly important. There have been lots of
efforts to predict properly the protein sub cellular
location. This paper aims to assimilate the artificial
neural networks and the field of bioinformatics to
predict the location of protein in yeast genome. I
introduce a new sub cellular prediction method
based on a back propagation neural network.
The statement goes like this “Prediction of Cellular
Localization sites of proteins using artificial neural
networks”
The task of our paper lies first in simulating a three
layered artificial neural network. In this case, the
backpropagation algorithm is used to train the
artificial neural network. First we explain the
algorithm, and then in our implementation it is
shown as to how the algorithm is implemented in
the code used to simulate the artificial neural
network. After this we see the observations
recorded by executing the yeast data set on the
simulated artificial neural network to train it. Then
we use the observations to see trends and evaluate
performance.
II. PROPOSED METHODOLOGY
A. Simulate an artificial neural network
corresponding to the attributesof the yeast data
set.
To enlarge the function space that the neural
network can represent, we implement the three-
layer feed-forward network which involves one
layer of hidden nodes. If we have our middle layer
with large number of nodes, we can represent
almost any continuous function with acceptable
levels of accuracy.
Figure 1: illustrates the structure of a three-layer
feed forward neural network.
The definitions of input nodes and output nodes can
be looked upon as similar to the earlier discussed
perceptrons network. The major difference is that
we a single layer of hidden nodes between the input
and output nodes.
Similarly, we also use the ratio of correctly
classified examples in the training set as the
threshold for the termination condition. The major
difference of the algorithm for training two-layer
feed-forward neural network is when we update the
weights for the hidden layer, we should back-
propagate the error from the output layer to the
hidden layer.
B. Implement the back propagation algorithmon
the simulated artificial neural network.
The Algorithm for our three layer network:
a. Initialize the weights of the network.
b. Perform the following operation
1. for every example in the training set
 Output by the neural network
for this example denoted by
O(forward pass)
 Teaching Output for this
example denoted by T.
 The error is given by (T-O).
 Calculate ΔWHO forall weights
between hidden and output
layer.
 Move backwards in the
network(backward pass)
 Calculate ΔWIH for all weights
between input and hidden layer.
 Update the weights if the
network using the calculated
delta values.
c. Stop when the error criterion is met.
d. Return the trained network
The learning algorithm that we have chosen for our
network is the Backpropagation Algorithm. It can
be divided into two stages:
Stage One: Propagation Phase
This phase consists ofthe following operations:
1. First we do the forward propagation of our
training pattern's input data through the
network.
2. Secondly we do the backward propagation
of the initial propagation of first step and
use the output activations through the
network using our training pattern's desire
target data.
Stage two: Weight updating Phase
In this stage, for every connection possessing a
weight, the following operations are carried out:
1. First, we multiply the output delta with
input to calculate the gradient of the
weight.
2. Second, we subtract a ratio of the gradient
from the weight. This brings the weight in
backward direction of the gradient.
We keep on repeating stages one and two until the
network starts performing with acceptable success
rate.
C. Train the network using the data set.
The yeast data set has eight attributes. These
attributes were calculated from amino acid
sequences.
1. erl: It is representative of the lumen in the
endoplasmic reticulum in the cell. This
attribute tells whether an HDEL pattern as
n signal for retention is present or not.
2. vac: This attribute gives an indication of
the content of amino acids in vacuolar and
extracellular proteins after performing a
discriminant analysis.
3. mit: This attribute gives the composition
of N terminal region, which has twenty
residue of mitochondrial as well as non-
mitochondrial protein after performing a
discriminant analysis.
4. nuc: This feature tells us about nuclear
localization patterns as to whether they are
present or not. It also holds some
information about the frequency of basic
residues.
5. pox: This attribute provides the
composition of the sequence of protein
after discriminant analysis on them. Not
only this, it also indicates the presence of a
short sequence motif.
6. mcg: This is a parameter used in a signal
sequence detection method known as
McGeoch. However in this case we are
using a modified version of it.
7. gvh: This attribute represents a weight
matrix based procedure and is used to
detect signal sequences which are
cleavable.
8. alm: This final feature helps us by
performing identification on the entire
sequence for membrane spanning regions.
For the data set the output classes are summarized
below. Remember that the localization site is
represented by the class as output. Here are the
various classes:
1. CYT (cytosolic or cytoskeletal)
2. NUC (nuclear)
3. MIT (mitochondrial)
4. ME3 (membrane protein, no N-terminal signal)
5. ME2 (membrane protein, uncleaved signal)
6. ME1 (membrane protein, cleaved signal)
7. EXC (extracellular)
8. VAC (vacuolar)
9. POX (peroxisomal)
10. ERL (endoplasmic reticulum lumen)
Figure 2: a Yeast Cell.
D. Obtain results and compare performance with
other networks and techniques used for predicting
the cellular localization ofproteins
 Results are evaluated after using the data set on
the simulated artificial neural network.
 Varying the number of nodes in the hidden
layer is used to evaluate performance.
 Comparison of Accuracies of various
algorithms
 Variation of success rate with number of
iterations
 Variation of success rate with number of nodes
in hidden layer
III. IMPLEMENTATION
Figure 3: design for calculating output activation
Er = 0.0 ;
for all patterns in the training set
do // computes for all training patterns(E) //
for all elements in hidden layer [ NumUnitHidden ]
do
InputHidden[E][j] = WtInput/Hidden[0][j]
for all elements in input layer [ NumUnitInput ]
do
Add to InputHidden[E] [j] thesum over OutputInput[E] [i] * WtInput/Hidden [i][j]
end for
Computesigmoid for output
end for
for all elements in output layer [ NumUnitoutput ]
do
InputOutput[E] [k] = WtHidden/Output[0][k]
for all elements in hidden layer [ NumUnitHidden ]
do
Add to InputOutput [E] [k] sum over OutputHidden[E] [j] * WtHidden/Output [j][k]
end for
Computesigmoid for output
Add to Er the sum over the product (1/2) * (Final[E][k] - Output[E][k]) * (Final[E][k] -
Output[E][k]) ;
ΔOutput[k] = (Final[E][k] - Output[E][k]) * Output[E][k] * (1 - Output[E][k])
// derivative of thefunction //
end for
for all elements in hidden layer [ NumUnitHidden ]
do // Backpropagation of error towards hidden layer //
Sum of ΔOutput [j] = 0.0
for all elements in output layer [ NumUnitOutput ]
do
Add to Sum of ΔOutput [j] the sum over the product WtHidden/Output [j][k] * ΔOutput
[k] ;
end for
ΔH[j] = Sum of ΔOutput [j] * OutputHidden [E][j] * (1.0 - OutputHidden [E][j])
// derivative of thefunction //
end for
for all elements in hidden layer [ NumUnitHidden ]
do // This loop updates the weight input to hidden //
Add to ΔWih [0][j] thesum of: product β * ΔH [j] to theproduct:α * ΔWih [0][j]
Add to WtInput/Hidden [0][j] thechange ΔWih [0][j]
for all elements in input layer [ NumUnitInput ]
do
Add to ΔWih [i][j] the sum of product β * InputHidden [p][i] * ΔH [j] to theproduct:α
* ΔWih [i][j]
Add to WtInput/Hidden [i][j] the change ΔWih [i][j]
end for
end for
for all elements in output layer [ NumUnitOutput ]
do // This loop updates the weight hidden to output //
Add to ΔWho [0][k] the sum of: product β * ΔOutput[k] to theproduct:α* ΔWho [0][k]
Add to WtHidden/Output [0][k] thechange ΔWho [0][k]
for all elements in hidden layer [ NumUnitHidden ]
do
Add to ΔWho [j][k] the sum of product β * OutputHidden [p][j] * ΔOutput[k] to the
product:α *ΔWho [j][k]
Add to WtHidden/Output [j][k] the change ΔWho [j][k]
end for
end for
IV. RESULTS AND DISCUSSION
A. Comparisonsof Accuracies of Different
Algorithms
In this section, we will take a look at the accuracies
offered by different algorithms. We take into
consideration four algorithms: Majority Algorithm,
Decision Tree Algorithm, Perceptrons Learning
Algorithm, Three layered Neural Network based on
backpropagation algorithm. Two data sets are
considered that have been studied in detail in
earlier sections. The first is the E.coli data set
which is for the E.coli cell and the second is the
one chosen by us: the Yeast cell data set. As we can
see from the chart below our algorithm is able to
achieve slightly higher accuracy than the rest of the
algorithms. Another thing of note is to see that
considerable success is achieved in the yeast data
set which we chose to implement with accuracy
leading up to 61%
Figure 4: Plot of Accuracy of various algorithms
for two data sets.
B. Variation of Success Rate with number of
iterations
Let us consider the variation of success rate in our
implementation. Success Rate is simply defined as
number of successful predictions divided by total
number of cases handled. The overall success rate
will vary with number of iterations of training the
neural network. As the number of iterations, the
error is reduced as the network learns with every
training session. We can look at the chart below to
find the expected variation of success rate as it rises
with number of iterations. However a thing to
consider is that after about 100 iterations the
success rate remains constant more or less.
Figure 5: Plot of Success Rate with number of
iterations
C. Variation of Success Rate with number of
processing elements in Hidden Layer
Let us consider now varying another important
parameter in our neural network. We shall again
consider the success rate defined in the previous
section. The number of processing elements is
under our control in the network. As the data set we
have chosen is specifying the number of input
attributes and possible outcomes the input and
output layer have fixed number of processing
elements. However we can see the variation of
success rate with number of elements in the hidden
layer. Note that the success rate reaches a constant
value after about 75 elements in the layer.
Figure 6: Plot of Success Rate with No. of PE in
Hidden Layer
V. CONCLUSIONS AND FUTURE WORK
A. Conclusion
In this paper, I implemented the machine learning
algorithm of three-layer feed forward network. I
applied it to the problem of classifying proteins to
their cellular localization sites based on the amino
acid sequences of proteins. The Yeast dataset’s
accuracy was compared with the E.coli dataset’s
accuracy. It was tested whether the three-layer
neural network with hidden nodes is able to
separate the datasets. We also explored using larger
number of hidden nodes in the network. We also
implemented three layer feed-forward neural
network which represented discontinuous function.
After obtaining results, we compared the
performance with other networks and techniques
used for predicting the cellular localization of
proteins. The most important results can be
summarized as:
● The classes CYT, NUC and MIT have the
largest number of instances.
● The back propagation algorithm is able to
achieve slightly higher accuracy than the
rest of the algorithms.
● Another thing of note is to see that
considerable success is achieved in the
yeast data set which we chose to
implement with accuracy leading up to
61%
● After about 100 iterations the success rate
remains constant more or less.
● The success rate reaches a constant value
after about 75 elements in the layer.
● The Accuracy rises till we reach the limit
to which we can set the success rate.
B. Future Work
Since the prediction of proteins’ cellular
localization sites is a typical classification problem,
many other techniques such as probability model,
Bayesian network, K-nearest neighbours etc, can be
compared with our technique.
Thus, an aspect of future work is to examine the
performance of these techniques on this particular
problem.
ACKNOWLEDGEMENT
I would like to acknowledge the contribution of Dr.
Durga Toshniwal, Associate Professor, Department
of Computer Science and Engineering, IIT
Roorkee, whose guidance was indispensable
throughout the course of this work.
REFERENCES
[1]. "A ProbablisticClassificationSystemfor Predictingthe
Cellular Localization Sites of Proteins", Paul Horton & Kenta
Nakai, Intelligent Systems in MolecularBiology, 109-115.
[2]. "Expert Sytem for PredictingProtein Localization Sites in
Gram-NegativeBacteria", Kenta Nakai & MinoruKanehisa,
PROTEINS: Structure,Function, andGenetics 11:95-110, 1991.
[3]. "A Knowledge Base for PredictingProteinLocalization
Sites in Eukaryotic Cells", Kenta Nakai & MinoruKanehisa,
Genomics 14:897-911, 1992.
[4]. Cairns, P. Huyck,et.al, A Comparisonof Categorization
Algorithms for Predictingthe Cellular LocalizationSites of
Proteins, IEEEEngineeringin Medicine andBiology,
pp.296-300, 2001.
[5]. Donnes, P., andHoglund, A.,Predictingproteinsubcellular
localization: Past, present, andfutureGenomics Proteomics
Bioinformatics, 2:209-215, 2004.

More Related Content

What's hot

Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.Mohd Faiz
 
Artificial Neural Network Abstract
Artificial Neural Network AbstractArtificial Neural Network Abstract
Artificial Neural Network AbstractAnjali Agrawal
 
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of IraqCrude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of IraqKiogyf
 
Neural network and artificial intelligent
Neural network and artificial intelligentNeural network and artificial intelligent
Neural network and artificial intelligentHapPy SumOn
 
Neural Network Applications In Machining: A Review
Neural Network Applications In Machining: A ReviewNeural Network Applications In Machining: A Review
Neural Network Applications In Machining: A ReviewAshish Khetan
 
Question bank soft computing
Question bank   soft computingQuestion bank   soft computing
Question bank soft computingMohit Singh
 
Artificial neural network by arpit_sharma
Artificial neural network by arpit_sharmaArtificial neural network by arpit_sharma
Artificial neural network by arpit_sharmaEr. Arpit Sharma
 
A survey research summary on neural networks
A survey research summary on neural networksA survey research summary on neural networks
A survey research summary on neural networkseSAT Publishing House
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkMuhammad Ishaq
 
Neural network
Neural network Neural network
Neural network Faireen
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkManasa Mona
 
Artificial Neural Network report
Artificial Neural Network reportArtificial Neural Network report
Artificial Neural Network reportAnjali Agrawal
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkmustafa aadel
 
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...cscpconf
 
Pattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesPattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesAlexander Decker
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networksweetysweety8
 
Open CV Implementation of Object Recognition Using Artificial Neural Networks
Open CV Implementation of Object Recognition Using Artificial Neural NetworksOpen CV Implementation of Object Recognition Using Artificial Neural Networks
Open CV Implementation of Object Recognition Using Artificial Neural Networksijceronline
 

What's hot (20)

Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.Artificial Neural Network seminar presentation using ppt.
Artificial Neural Network seminar presentation using ppt.
 
Neural network
Neural networkNeural network
Neural network
 
Artificial Neural Network Abstract
Artificial Neural Network AbstractArtificial Neural Network Abstract
Artificial Neural Network Abstract
 
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of IraqCrude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
Crude Oil Price Prediction Based on Soft Computing Model: Case Study of Iraq
 
Neural network and artificial intelligent
Neural network and artificial intelligentNeural network and artificial intelligent
Neural network and artificial intelligent
 
Neural Network Applications In Machining: A Review
Neural Network Applications In Machining: A ReviewNeural Network Applications In Machining: A Review
Neural Network Applications In Machining: A Review
 
Question bank soft computing
Question bank   soft computingQuestion bank   soft computing
Question bank soft computing
 
Artificial neural network by arpit_sharma
Artificial neural network by arpit_sharmaArtificial neural network by arpit_sharma
Artificial neural network by arpit_sharma
 
A survey research summary on neural networks
A survey research summary on neural networksA survey research summary on neural networks
A survey research summary on neural networks
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Neural network
Neural network Neural network
Neural network
 
Artificial Neural Network Topology
Artificial Neural Network TopologyArtificial Neural Network Topology
Artificial Neural Network Topology
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Artificial Neural Network report
Artificial Neural Network reportArtificial Neural Network report
Artificial Neural Network report
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Neural network
Neural networkNeural network
Neural network
 
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
 
Pattern recognition system based on support vector machines
Pattern recognition system based on support vector machinesPattern recognition system based on support vector machines
Pattern recognition system based on support vector machines
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Open CV Implementation of Object Recognition Using Artificial Neural Networks
Open CV Implementation of Object Recognition Using Artificial Neural NetworksOpen CV Implementation of Object Recognition Using Artificial Neural Networks
Open CV Implementation of Object Recognition Using Artificial Neural Networks
 

Similar to Predict cellular localization

Modeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologyModeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologytheijes
 
Fuzzy Logic Final Report
Fuzzy Logic Final ReportFuzzy Logic Final Report
Fuzzy Logic Final ReportShikhar Agarwal
 
Diagnosis Chest Diseases Using Neural Network and Genetic Hybrid Algorithm
Diagnosis Chest Diseases Using Neural Network and Genetic Hybrid AlgorithmDiagnosis Chest Diseases Using Neural Network and Genetic Hybrid Algorithm
Diagnosis Chest Diseases Using Neural Network and Genetic Hybrid AlgorithmIJERA Editor
 
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...IJMER
 
Neural basics
Neural basicsNeural basics
Neural basicscoursesub
 
Neural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlabNeural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlabijcses
 
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATIONA NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATIONcscpconf
 
Extracted pages from Neural Fuzzy Systems.docx
Extracted pages from Neural Fuzzy Systems.docxExtracted pages from Neural Fuzzy Systems.docx
Extracted pages from Neural Fuzzy Systems.docxdannyabe
 
2. NEURAL NETWORKS USING GENETIC ALGORITHMS.pptx
2. NEURAL NETWORKS USING GENETIC ALGORITHMS.pptx2. NEURAL NETWORKS USING GENETIC ALGORITHMS.pptx
2. NEURAL NETWORKS USING GENETIC ALGORITHMS.pptxssuser67281d
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...ijistjournal
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...ijistjournal
 
A Parallel Framework For Multilayer Perceptron For Human Face Recognition
A Parallel Framework For Multilayer Perceptron For Human Face RecognitionA Parallel Framework For Multilayer Perceptron For Human Face Recognition
A Parallel Framework For Multilayer Perceptron For Human Face RecognitionCSCJournals
 
PADDY CROP DISEASE DETECTION USING SVM AND CNN ALGORITHM
PADDY CROP DISEASE DETECTION USING SVM AND CNN ALGORITHMPADDY CROP DISEASE DETECTION USING SVM AND CNN ALGORITHM
PADDY CROP DISEASE DETECTION USING SVM AND CNN ALGORITHMIRJET Journal
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET Journal
 
Artificial Neural Network: A brief study
Artificial Neural Network: A brief studyArtificial Neural Network: A brief study
Artificial Neural Network: A brief studyIRJET Journal
 

Similar to Predict cellular localization (20)

O18020393104
O18020393104O18020393104
O18020393104
 
Modeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technologyModeling of neural image compression using gradient decent technology
Modeling of neural image compression using gradient decent technology
 
Fuzzy Logic Final Report
Fuzzy Logic Final ReportFuzzy Logic Final Report
Fuzzy Logic Final Report
 
Diagnosis Chest Diseases Using Neural Network and Genetic Hybrid Algorithm
Diagnosis Chest Diseases Using Neural Network and Genetic Hybrid AlgorithmDiagnosis Chest Diseases Using Neural Network and Genetic Hybrid Algorithm
Diagnosis Chest Diseases Using Neural Network and Genetic Hybrid Algorithm
 
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
 
Neural basics
Neural basicsNeural basics
Neural basics
 
Neural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlabNeural network based numerical digits recognization using nnt in matlab
Neural network based numerical digits recognization using nnt in matlab
 
D028018022
D028018022D028018022
D028018022
 
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATIONA NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
 
Extracted pages from Neural Fuzzy Systems.docx
Extracted pages from Neural Fuzzy Systems.docxExtracted pages from Neural Fuzzy Systems.docx
Extracted pages from Neural Fuzzy Systems.docx
 
bbbPaper
bbbPaperbbbPaper
bbbPaper
 
2. NEURAL NETWORKS USING GENETIC ALGORITHMS.pptx
2. NEURAL NETWORKS USING GENETIC ALGORITHMS.pptx2. NEURAL NETWORKS USING GENETIC ALGORITHMS.pptx
2. NEURAL NETWORKS USING GENETIC ALGORITHMS.pptx
 
tsopze2011
tsopze2011tsopze2011
tsopze2011
 
N ns 1
N ns 1N ns 1
N ns 1
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
 
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
Implementation Of Back-Propagation Neural Network For Isolated Bangla Speech ...
 
A Parallel Framework For Multilayer Perceptron For Human Face Recognition
A Parallel Framework For Multilayer Perceptron For Human Face RecognitionA Parallel Framework For Multilayer Perceptron For Human Face Recognition
A Parallel Framework For Multilayer Perceptron For Human Face Recognition
 
PADDY CROP DISEASE DETECTION USING SVM AND CNN ALGORITHM
PADDY CROP DISEASE DETECTION USING SVM AND CNN ALGORITHMPADDY CROP DISEASE DETECTION USING SVM AND CNN ALGORITHM
PADDY CROP DISEASE DETECTION USING SVM AND CNN ALGORITHM
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural Network
 
Artificial Neural Network: A brief study
Artificial Neural Network: A brief studyArtificial Neural Network: A brief study
Artificial Neural Network: A brief study
 

More from Vaibhav Dhattarwal

More from Vaibhav Dhattarwal (6)

Dissertation Prsentation - Vaibhav
Dissertation Prsentation - VaibhavDissertation Prsentation - Vaibhav
Dissertation Prsentation - Vaibhav
 
Project Report -Vaibhav
Project Report -VaibhavProject Report -Vaibhav
Project Report -Vaibhav
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Internship Project Report - Vaibhav
Internship Project Report - VaibhavInternship Project Report - Vaibhav
Internship Project Report - Vaibhav
 
VaibhavDhattarwal_PGP06.054_CV_v2
VaibhavDhattarwal_PGP06.054_CV_v2VaibhavDhattarwal_PGP06.054_CV_v2
VaibhavDhattarwal_PGP06.054_CV_v2
 

Predict cellular localization

  • 1. PREDICTING THE CELLULAR LOCALIZATION SITES OF PROTEINS USING ARTIFICIAL NEURAL NETWORKS Vaibhav Dhattarwal Department of Computer Science and Engineering Indian Institute of Technology Roorkee vaibhav.csi.iitr@gmail.com Abstract - In this paper, I present a brief description of how a feed-forward artificial neural network was implemented in C++. In the introduction to my paper, I begin by explaining as the reason for implementing this artificial neural network was to predict the cellular localisation sites in proteins, and to be specific a yeast Data Set. This is followed by a concise explanation of the design and implementation of a three-layer feed forward neural network using back propagation algorithm. Also explained along with are the attributes of the data set and the output location possibilities in the protein. This is followed by a step- by-step breakdown of how I approached the project. The implementation of the network is explained along with how the algorithm is executed within the code. Finally we can see the results as we vary the parameters associated with the implemented artificial neural network. Keywords-Prediction, Localization Sites, Proteins, Simulation, NeuralNetworks I. INTRODUCTION Let me start off by the basic explanation about choosing this topic. The topic chosen, Prediction of Cellular localisation of protein, is basically the information represented by the data set I have chosen to do my paper on. I will be implementing an Artificial Neural Network based on the back propagation algorithm. To evaluate the performance of the simulated Artificial Neural Network, I needed to choose a data set to train and test the ANN. Let us take a look at the significance of the data set chosen by me. If one is able to deduce or figure out the sub cellular location of a protein, I can interpret its function, its part in healthy processes and also in commencement of disease, and its probable usage as a drug target. Other methods such as experiments used to ascertain the sub cellular location of a protein have advantages such as reliability and accuracy along with disadvantages such as being slow and being labour-intensive. If I compare to the above described methods, large throughput computation based forecasting tools enable me to deduce information which is difficult to attain. As an example, for those proteins whose composition is found out from a genomic sequence, computational methods are better as they may be tough to confine, produce, or locate in an experiment. The sub cellular location of a protein can provide valuable information about the role it has in the cellular dynamics. If I may suggest, there has been an unprecedented surge in the amount of sequenced genomic data available, which in turn calls out for a computerized and high-accuracy tool which can be used to predict sub cellular location to become increasingly important. There have been lots of efforts to predict properly the protein sub cellular location. This paper aims to assimilate the artificial neural networks and the field of bioinformatics to predict the location of protein in yeast genome. I introduce a new sub cellular prediction method based on a back propagation neural network.
  • 2. The statement goes like this “Prediction of Cellular Localization sites of proteins using artificial neural networks” The task of our paper lies first in simulating a three layered artificial neural network. In this case, the backpropagation algorithm is used to train the artificial neural network. First we explain the algorithm, and then in our implementation it is shown as to how the algorithm is implemented in the code used to simulate the artificial neural network. After this we see the observations recorded by executing the yeast data set on the simulated artificial neural network to train it. Then we use the observations to see trends and evaluate performance. II. PROPOSED METHODOLOGY A. Simulate an artificial neural network corresponding to the attributesof the yeast data set. To enlarge the function space that the neural network can represent, we implement the three- layer feed-forward network which involves one layer of hidden nodes. If we have our middle layer with large number of nodes, we can represent almost any continuous function with acceptable levels of accuracy. Figure 1: illustrates the structure of a three-layer feed forward neural network. The definitions of input nodes and output nodes can be looked upon as similar to the earlier discussed perceptrons network. The major difference is that we a single layer of hidden nodes between the input and output nodes. Similarly, we also use the ratio of correctly classified examples in the training set as the threshold for the termination condition. The major difference of the algorithm for training two-layer feed-forward neural network is when we update the weights for the hidden layer, we should back- propagate the error from the output layer to the hidden layer. B. Implement the back propagation algorithmon the simulated artificial neural network. The Algorithm for our three layer network: a. Initialize the weights of the network. b. Perform the following operation 1. for every example in the training set  Output by the neural network for this example denoted by O(forward pass)  Teaching Output for this example denoted by T.  The error is given by (T-O).  Calculate ΔWHO forall weights between hidden and output layer.  Move backwards in the network(backward pass)  Calculate ΔWIH for all weights between input and hidden layer.  Update the weights if the network using the calculated delta values. c. Stop when the error criterion is met. d. Return the trained network The learning algorithm that we have chosen for our network is the Backpropagation Algorithm. It can be divided into two stages: Stage One: Propagation Phase This phase consists ofthe following operations: 1. First we do the forward propagation of our training pattern's input data through the network. 2. Secondly we do the backward propagation of the initial propagation of first step and use the output activations through the network using our training pattern's desire target data. Stage two: Weight updating Phase In this stage, for every connection possessing a weight, the following operations are carried out:
  • 3. 1. First, we multiply the output delta with input to calculate the gradient of the weight. 2. Second, we subtract a ratio of the gradient from the weight. This brings the weight in backward direction of the gradient. We keep on repeating stages one and two until the network starts performing with acceptable success rate. C. Train the network using the data set. The yeast data set has eight attributes. These attributes were calculated from amino acid sequences. 1. erl: It is representative of the lumen in the endoplasmic reticulum in the cell. This attribute tells whether an HDEL pattern as n signal for retention is present or not. 2. vac: This attribute gives an indication of the content of amino acids in vacuolar and extracellular proteins after performing a discriminant analysis. 3. mit: This attribute gives the composition of N terminal region, which has twenty residue of mitochondrial as well as non- mitochondrial protein after performing a discriminant analysis. 4. nuc: This feature tells us about nuclear localization patterns as to whether they are present or not. It also holds some information about the frequency of basic residues. 5. pox: This attribute provides the composition of the sequence of protein after discriminant analysis on them. Not only this, it also indicates the presence of a short sequence motif. 6. mcg: This is a parameter used in a signal sequence detection method known as McGeoch. However in this case we are using a modified version of it. 7. gvh: This attribute represents a weight matrix based procedure and is used to detect signal sequences which are cleavable. 8. alm: This final feature helps us by performing identification on the entire sequence for membrane spanning regions. For the data set the output classes are summarized below. Remember that the localization site is represented by the class as output. Here are the various classes: 1. CYT (cytosolic or cytoskeletal) 2. NUC (nuclear) 3. MIT (mitochondrial) 4. ME3 (membrane protein, no N-terminal signal) 5. ME2 (membrane protein, uncleaved signal) 6. ME1 (membrane protein, cleaved signal) 7. EXC (extracellular) 8. VAC (vacuolar) 9. POX (peroxisomal) 10. ERL (endoplasmic reticulum lumen) Figure 2: a Yeast Cell. D. Obtain results and compare performance with other networks and techniques used for predicting the cellular localization ofproteins  Results are evaluated after using the data set on the simulated artificial neural network.  Varying the number of nodes in the hidden layer is used to evaluate performance.  Comparison of Accuracies of various algorithms  Variation of success rate with number of iterations  Variation of success rate with number of nodes in hidden layer III. IMPLEMENTATION Figure 3: design for calculating output activation
  • 4. Er = 0.0 ; for all patterns in the training set do // computes for all training patterns(E) // for all elements in hidden layer [ NumUnitHidden ] do InputHidden[E][j] = WtInput/Hidden[0][j] for all elements in input layer [ NumUnitInput ] do Add to InputHidden[E] [j] thesum over OutputInput[E] [i] * WtInput/Hidden [i][j] end for Computesigmoid for output end for for all elements in output layer [ NumUnitoutput ] do InputOutput[E] [k] = WtHidden/Output[0][k] for all elements in hidden layer [ NumUnitHidden ] do Add to InputOutput [E] [k] sum over OutputHidden[E] [j] * WtHidden/Output [j][k] end for Computesigmoid for output Add to Er the sum over the product (1/2) * (Final[E][k] - Output[E][k]) * (Final[E][k] - Output[E][k]) ; ΔOutput[k] = (Final[E][k] - Output[E][k]) * Output[E][k] * (1 - Output[E][k]) // derivative of thefunction // end for for all elements in hidden layer [ NumUnitHidden ] do // Backpropagation of error towards hidden layer // Sum of ΔOutput [j] = 0.0 for all elements in output layer [ NumUnitOutput ] do Add to Sum of ΔOutput [j] the sum over the product WtHidden/Output [j][k] * ΔOutput [k] ; end for ΔH[j] = Sum of ΔOutput [j] * OutputHidden [E][j] * (1.0 - OutputHidden [E][j]) // derivative of thefunction // end for for all elements in hidden layer [ NumUnitHidden ] do // This loop updates the weight input to hidden // Add to ΔWih [0][j] thesum of: product β * ΔH [j] to theproduct:α * ΔWih [0][j] Add to WtInput/Hidden [0][j] thechange ΔWih [0][j] for all elements in input layer [ NumUnitInput ] do Add to ΔWih [i][j] the sum of product β * InputHidden [p][i] * ΔH [j] to theproduct:α * ΔWih [i][j] Add to WtInput/Hidden [i][j] the change ΔWih [i][j] end for end for for all elements in output layer [ NumUnitOutput ] do // This loop updates the weight hidden to output // Add to ΔWho [0][k] the sum of: product β * ΔOutput[k] to theproduct:α* ΔWho [0][k] Add to WtHidden/Output [0][k] thechange ΔWho [0][k] for all elements in hidden layer [ NumUnitHidden ] do Add to ΔWho [j][k] the sum of product β * OutputHidden [p][j] * ΔOutput[k] to the product:α *ΔWho [j][k] Add to WtHidden/Output [j][k] the change ΔWho [j][k] end for end for
  • 5. IV. RESULTS AND DISCUSSION A. Comparisonsof Accuracies of Different Algorithms In this section, we will take a look at the accuracies offered by different algorithms. We take into consideration four algorithms: Majority Algorithm, Decision Tree Algorithm, Perceptrons Learning Algorithm, Three layered Neural Network based on backpropagation algorithm. Two data sets are considered that have been studied in detail in earlier sections. The first is the E.coli data set which is for the E.coli cell and the second is the one chosen by us: the Yeast cell data set. As we can see from the chart below our algorithm is able to achieve slightly higher accuracy than the rest of the algorithms. Another thing of note is to see that considerable success is achieved in the yeast data set which we chose to implement with accuracy leading up to 61% Figure 4: Plot of Accuracy of various algorithms for two data sets. B. Variation of Success Rate with number of iterations Let us consider the variation of success rate in our implementation. Success Rate is simply defined as number of successful predictions divided by total number of cases handled. The overall success rate will vary with number of iterations of training the neural network. As the number of iterations, the error is reduced as the network learns with every training session. We can look at the chart below to find the expected variation of success rate as it rises with number of iterations. However a thing to consider is that after about 100 iterations the success rate remains constant more or less. Figure 5: Plot of Success Rate with number of iterations C. Variation of Success Rate with number of processing elements in Hidden Layer Let us consider now varying another important parameter in our neural network. We shall again consider the success rate defined in the previous section. The number of processing elements is under our control in the network. As the data set we have chosen is specifying the number of input attributes and possible outcomes the input and output layer have fixed number of processing elements. However we can see the variation of success rate with number of elements in the hidden layer. Note that the success rate reaches a constant value after about 75 elements in the layer. Figure 6: Plot of Success Rate with No. of PE in Hidden Layer V. CONCLUSIONS AND FUTURE WORK A. Conclusion
  • 6. In this paper, I implemented the machine learning algorithm of three-layer feed forward network. I applied it to the problem of classifying proteins to their cellular localization sites based on the amino acid sequences of proteins. The Yeast dataset’s accuracy was compared with the E.coli dataset’s accuracy. It was tested whether the three-layer neural network with hidden nodes is able to separate the datasets. We also explored using larger number of hidden nodes in the network. We also implemented three layer feed-forward neural network which represented discontinuous function. After obtaining results, we compared the performance with other networks and techniques used for predicting the cellular localization of proteins. The most important results can be summarized as: ● The classes CYT, NUC and MIT have the largest number of instances. ● The back propagation algorithm is able to achieve slightly higher accuracy than the rest of the algorithms. ● Another thing of note is to see that considerable success is achieved in the yeast data set which we chose to implement with accuracy leading up to 61% ● After about 100 iterations the success rate remains constant more or less. ● The success rate reaches a constant value after about 75 elements in the layer. ● The Accuracy rises till we reach the limit to which we can set the success rate. B. Future Work Since the prediction of proteins’ cellular localization sites is a typical classification problem, many other techniques such as probability model, Bayesian network, K-nearest neighbours etc, can be compared with our technique. Thus, an aspect of future work is to examine the performance of these techniques on this particular problem. ACKNOWLEDGEMENT I would like to acknowledge the contribution of Dr. Durga Toshniwal, Associate Professor, Department of Computer Science and Engineering, IIT Roorkee, whose guidance was indispensable throughout the course of this work. REFERENCES [1]. "A ProbablisticClassificationSystemfor Predictingthe Cellular Localization Sites of Proteins", Paul Horton & Kenta Nakai, Intelligent Systems in MolecularBiology, 109-115. [2]. "Expert Sytem for PredictingProtein Localization Sites in Gram-NegativeBacteria", Kenta Nakai & MinoruKanehisa, PROTEINS: Structure,Function, andGenetics 11:95-110, 1991. [3]. "A Knowledge Base for PredictingProteinLocalization Sites in Eukaryotic Cells", Kenta Nakai & MinoruKanehisa, Genomics 14:897-911, 1992. [4]. Cairns, P. Huyck,et.al, A Comparisonof Categorization Algorithms for Predictingthe Cellular LocalizationSites of Proteins, IEEEEngineeringin Medicine andBiology, pp.296-300, 2001. [5]. Donnes, P., andHoglund, A.,Predictingproteinsubcellular localization: Past, present, andfutureGenomics Proteomics Bioinformatics, 2:209-215, 2004.