A new computational approach for protein sequence similarity analysis and functional classification which is fast and easier compared to the conventional method is described. This technique uses Discrete Wavelet Transform decomposition followed by sequence correlation analysis. The technique can also be used for identifying the functional class of a newly obtained protein sequence. The classification was done using a sample set of 270 protein sequences obtained from organisms of diverse origins and functional classes, which gave a classification accuracy of 94.81%. Accuracy and reliability of the technique is verified by comparing the results with that obtained from NCBI.
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISijcseit
HMM has found its application in almost every field. Applying Hmm to biological sequences has its own
advantages. HMM’s being more systematic and specific, yield a result better than consensus techniques.
Profile HMMs use position specific scoring for the matching & substitution of a residue and for the
opening or extension of a gap. HMMs apply a statistical method to estimate the true frequency of a residue
at a given position in the alignment from its observed frequency while standard profiles use the observed
frequency itself to assign the score for that residue. This means that a profile HMM derived from only 10 to
20 aligned sequences can be of equivalent quality to a standard profile created from 40 to 50 aligned
sequences.
Comparative Protein Structure Modeling and itsApplicationsLynellBull52
Comparative Protein Structure Modeling and its
Applications to Drug Discovery
Matthew Jacobson
1
and Andrej Sali
1,2
1
Department of Pharmaceutical Chemistry, California Institute for
Quantitative Biomedical Research, Mission Bay Genentech Hall, 600 16th Street,
University of California, San Francisco, CA 94143-2240, USA
2
Department of Biopharmaceutial Sciences, California Institute for
Quantitative Biomedical Research, Mission Bay Genentech Hall, 600 16th Street,
University of California, San Francisco, CA 94143-2240, USA
Contents
1. Introduction 259
2. Fold assignment and sequence-structure alignment 261
3. Comparative model building 261
4. Loop modeling 262
5. Sidechain modeling 263
6. Comparative modeling by MODELLER 264
7. Physics-based approaches to comparative model construction and refinement 264
8. Accuracy of comparative models 266
9. Modeling on a genomic scale 266
10. Applications of comparative modeling to drug discovery 267
10.1. Comparative models vs experimental structures in virtual screening 267
10.2. Use of comparative models to obtain novel drug leads 268
10.3. Comparative models of kinases in virtual screening 269
10.4. GPCR comparative models for drug development 270
10.5. Other uses of comparative models in drug development 271
10.6. Future directions 272
11. Conclusions 273
References 273
1. INTRODUCTION
Homology or comparative protein structure modeling constructs a three-dimensional
model of a given protein sequence based on its similarity to one or more known
structures. In this perspective, we begin by describing the comparative modeling
technique and the accuracy of the models. We then discuss the significant role that
comparative prediction plays in drug discovery. We focus on virtual ligand screening
against comparative models and illustrate the state-of-the-art by a number of specific
examples.
The genome sequencing efforts are providing us with complete genetic blueprints for
hundreds of organisms, including humans. We are now faced with describing,
ANNUAL REPORTS IN MEDICINAL CHEMISTRY, VOLUME 39 q 2004 Elsevier Inc.
ISSN: 0065-7743 DOI 10.1016/S0065-7743(04)39020-2 All rights reserved
controlling, and modifying the functions of proteins encoded by these genomes. This
task is generally facilitated by protein three-dimensional structures [1], which are best
determined by experimental methods such as X-ray crystallography and nuclear
magnetic resonance (NMR) spectroscopy. Despite significant advances in these
techniques, many protein sequences are not easily accessible to structure determination
by experiment. Over the last two years, the number of sequences in the comprehensive
public sequence databases, such as SwissProt/TrEMBL [2] and GenPept [3], increased
by a factor of 2.3 from 522,959 to 1,215,803 on 26 April 2004. In contrast, despite
structural genomics, the number of experimentally determined structures deposited in
the Protein Data Bank (PDB) increas ...
Protein can be represented by amino acid interaction network. This network is a graph whose vertices are
the proteins amino acids and whose edges are the interactions between them. In this paper we have
formalized amino acid interaction network prediction as a multi-objective evolutionary optimization
problem. This formalism is biologically plausible because interactions among amino acids do not depend
only on a single factor like atomic distance but also other factors like torsion angle, hydrophobicity and
hydrophilicity etc. This problem is then solved and implemented using multi-objective genetic algorithm
and subsequently optimized using ant colony optimization technique. The result shows that our algorithm
performs better than recent amino acid interaction network prediction algorithms that are based on single
factor
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATIONcscpconf
Protein can be represented by amino acid interaction network. This network is a graph whose
vertices are the proteins amino acids and whose edges are the interactions between them. This
interaction network is the first step of proteins three-dimensional structure prediction. In this
paper we present a multi-objective evolutionary algorithm for interaction prediction and ant
colony probabilistic optimization algorithm is used to confirm the interaction.
Amino acid interaction network prediction using multi objective optimizationcsandit
Protein can be represented by amino acid interaction network. This network is a graph whose
vertices are the proteins amino acids and whose edges are the interactions between them. This
interaction network is the first step of proteins three-dimensional structure prediction. In this
paper we present a multi-objective evolutionary algorithm for interaction prediction and ant
colony probabilistic optimization algorithm is used to confirm the interaction.
Criterion based Two Dimensional Protein Folding Using Extended GA IJCSEIT Journal
In the dynamite field of biological and protein research, the protein fold recognition for long pattern
protein sequences is a great confrontation for many years. With that consideration, this paper contributes
to the protein folding research field and presents a novel procedure for mapping appropriate protein
structure to its correct 2D fold by a concrete model using swarm intelligence. Moreover, the model
incorporates Extended Genetic Algorithm (EGA) with concealed Markov model (CMM) for effectively
folding the protein sequences that are having long chain lengths. The protein sequences are preprocessed,
classified and then, analyzed with some parameters (criterion) such as fitness, similarity and sequence gaps
for optimal formation of protein structures. Fitness correlation is evaluated for the determination of
bonding strength of molecules, thereby involves in efficient fold recognition task. Experimental results have
shown that the proposed method is more adept in 2D protein folding and outperforms the existing
algorithms.
Classification of Enzymes Using Machine Learning Based Approaches: A Review mlaij
Enzymes play an important role in metabolism that helps in catalyzing bio-chemical reactions. A
computational method is required to predict the function of enzymes. Many feature selection technique
have been used in this paper by examining many previous research paper. This paper presents supervised
machine learning approach to predict the functional classes and subclass of enzymes based on set of 857
sequence derived features. It uses seven sequence derived properties including amino acid composition,
dipeptide composition, correlation feature, composition, transition, distribution and pseudo amino acid
composition .Support vector machine recursive Feature elimination (SVRRFE) is used to select the optimal
number of features. The Random Forest has been used to construct a three level model with optimal
number of features selected by SVMRFE, where top level distinguish a query protein as an enzyme or nonenzyme,
second level predicts the enzyme functional class and the third layer predict the sub functional
class. The proposed model reported overall accuracy of 100%, precision of 100% and MCC value of 1.00
for the first level, whereas accuracy of 90.1%,precision of 90.5% and MCC value of 0.88 for second level
and accuracy of 88.0%, precision of 88.7% and MCC value of 0.87 for the third level.
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISijcseit
HMM has found its application in almost every field. Applying Hmm to biological sequences has its own
advantages. HMM’s being more systematic and specific, yield a result better than consensus techniques.
Profile HMMs use position specific scoring for the matching & substitution of a residue and for the
opening or extension of a gap. HMMs apply a statistical method to estimate the true frequency of a residue
at a given position in the alignment from its observed frequency while standard profiles use the observed
frequency itself to assign the score for that residue. This means that a profile HMM derived from only 10 to
20 aligned sequences can be of equivalent quality to a standard profile created from 40 to 50 aligned
sequences.
Comparative Protein Structure Modeling and itsApplicationsLynellBull52
Comparative Protein Structure Modeling and its
Applications to Drug Discovery
Matthew Jacobson
1
and Andrej Sali
1,2
1
Department of Pharmaceutical Chemistry, California Institute for
Quantitative Biomedical Research, Mission Bay Genentech Hall, 600 16th Street,
University of California, San Francisco, CA 94143-2240, USA
2
Department of Biopharmaceutial Sciences, California Institute for
Quantitative Biomedical Research, Mission Bay Genentech Hall, 600 16th Street,
University of California, San Francisco, CA 94143-2240, USA
Contents
1. Introduction 259
2. Fold assignment and sequence-structure alignment 261
3. Comparative model building 261
4. Loop modeling 262
5. Sidechain modeling 263
6. Comparative modeling by MODELLER 264
7. Physics-based approaches to comparative model construction and refinement 264
8. Accuracy of comparative models 266
9. Modeling on a genomic scale 266
10. Applications of comparative modeling to drug discovery 267
10.1. Comparative models vs experimental structures in virtual screening 267
10.2. Use of comparative models to obtain novel drug leads 268
10.3. Comparative models of kinases in virtual screening 269
10.4. GPCR comparative models for drug development 270
10.5. Other uses of comparative models in drug development 271
10.6. Future directions 272
11. Conclusions 273
References 273
1. INTRODUCTION
Homology or comparative protein structure modeling constructs a three-dimensional
model of a given protein sequence based on its similarity to one or more known
structures. In this perspective, we begin by describing the comparative modeling
technique and the accuracy of the models. We then discuss the significant role that
comparative prediction plays in drug discovery. We focus on virtual ligand screening
against comparative models and illustrate the state-of-the-art by a number of specific
examples.
The genome sequencing efforts are providing us with complete genetic blueprints for
hundreds of organisms, including humans. We are now faced with describing,
ANNUAL REPORTS IN MEDICINAL CHEMISTRY, VOLUME 39 q 2004 Elsevier Inc.
ISSN: 0065-7743 DOI 10.1016/S0065-7743(04)39020-2 All rights reserved
controlling, and modifying the functions of proteins encoded by these genomes. This
task is generally facilitated by protein three-dimensional structures [1], which are best
determined by experimental methods such as X-ray crystallography and nuclear
magnetic resonance (NMR) spectroscopy. Despite significant advances in these
techniques, many protein sequences are not easily accessible to structure determination
by experiment. Over the last two years, the number of sequences in the comprehensive
public sequence databases, such as SwissProt/TrEMBL [2] and GenPept [3], increased
by a factor of 2.3 from 522,959 to 1,215,803 on 26 April 2004. In contrast, despite
structural genomics, the number of experimentally determined structures deposited in
the Protein Data Bank (PDB) increas ...
Protein can be represented by amino acid interaction network. This network is a graph whose vertices are
the proteins amino acids and whose edges are the interactions between them. In this paper we have
formalized amino acid interaction network prediction as a multi-objective evolutionary optimization
problem. This formalism is biologically plausible because interactions among amino acids do not depend
only on a single factor like atomic distance but also other factors like torsion angle, hydrophobicity and
hydrophilicity etc. This problem is then solved and implemented using multi-objective genetic algorithm
and subsequently optimized using ant colony optimization technique. The result shows that our algorithm
performs better than recent amino acid interaction network prediction algorithms that are based on single
factor
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATIONcscpconf
Protein can be represented by amino acid interaction network. This network is a graph whose
vertices are the proteins amino acids and whose edges are the interactions between them. This
interaction network is the first step of proteins three-dimensional structure prediction. In this
paper we present a multi-objective evolutionary algorithm for interaction prediction and ant
colony probabilistic optimization algorithm is used to confirm the interaction.
Amino acid interaction network prediction using multi objective optimizationcsandit
Protein can be represented by amino acid interaction network. This network is a graph whose
vertices are the proteins amino acids and whose edges are the interactions between them. This
interaction network is the first step of proteins three-dimensional structure prediction. In this
paper we present a multi-objective evolutionary algorithm for interaction prediction and ant
colony probabilistic optimization algorithm is used to confirm the interaction.
Criterion based Two Dimensional Protein Folding Using Extended GA IJCSEIT Journal
In the dynamite field of biological and protein research, the protein fold recognition for long pattern
protein sequences is a great confrontation for many years. With that consideration, this paper contributes
to the protein folding research field and presents a novel procedure for mapping appropriate protein
structure to its correct 2D fold by a concrete model using swarm intelligence. Moreover, the model
incorporates Extended Genetic Algorithm (EGA) with concealed Markov model (CMM) for effectively
folding the protein sequences that are having long chain lengths. The protein sequences are preprocessed,
classified and then, analyzed with some parameters (criterion) such as fitness, similarity and sequence gaps
for optimal formation of protein structures. Fitness correlation is evaluated for the determination of
bonding strength of molecules, thereby involves in efficient fold recognition task. Experimental results have
shown that the proposed method is more adept in 2D protein folding and outperforms the existing
algorithms.
Classification of Enzymes Using Machine Learning Based Approaches: A Review mlaij
Enzymes play an important role in metabolism that helps in catalyzing bio-chemical reactions. A
computational method is required to predict the function of enzymes. Many feature selection technique
have been used in this paper by examining many previous research paper. This paper presents supervised
machine learning approach to predict the functional classes and subclass of enzymes based on set of 857
sequence derived features. It uses seven sequence derived properties including amino acid composition,
dipeptide composition, correlation feature, composition, transition, distribution and pseudo amino acid
composition .Support vector machine recursive Feature elimination (SVRRFE) is used to select the optimal
number of features. The Random Forest has been used to construct a three level model with optimal
number of features selected by SVMRFE, where top level distinguish a query protein as an enzyme or nonenzyme,
second level predicts the enzyme functional class and the third layer predict the sub functional
class. The proposed model reported overall accuracy of 100%, precision of 100% and MCC value of 1.00
for the first level, whereas accuracy of 90.1%,precision of 90.5% and MCC value of 0.88 for second level
and accuracy of 88.0%, precision of 88.7% and MCC value of 0.87 for the third level.
Aligning Subunits of Internally Symmetric Proteins with CE-SymmSpencer Bliven
Poster from 3DSIG 2013 on CE-Symm. For a more recent version, see http://www.slideshare.net/sbliven/3dsig-2014-systematic-detection-of-internal-symmetry-in-proteins
nternational Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
MULISA : A New Strategy for Discovery of Protein Functional Motifs and Residuescsandit
To predict and identify details regarding function
from protein sequences is an emergency task
since the growing number and diversity of protein s
equence. Here, we develop a novel approach
for identifying conservation residues and motifs of
ligand-binding proteins. In this method,
called MuLiSA (Multiple Ligand-bound Structure Alig
nment), we first superimpose the ligands
of ligand-binding proteins and then the residues of
ligand-binding sites are naturally aligned.
We identify important residues and patterns based o
n the z-scores of the residue entropy and
residue-segment entropy. After identifying new patt
ern candidates, the profiles of patterns are
generated to predict the protein function from only
protein sequences. We tested our approach
on ATP-binding proteins and HEM-binding proteins. T
he experiments show that MuLiSA can
identify the conservation residues and novel patter
ns which are really correlated with protein
functions of certain ligand-binding proteins. We fo
und that our MuLiSA can identify
conservation patterns and is better than traditiona
l alignments such as CE and CLUSTALW in
some ligand-binding proteins. We believe that our M
uLiSA is useful to discover ligand-binding
specificity-determining residues and functional imp
ortant patterns of proteins.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Bacterial virulence proteins, which have been classified on structure of virulence, causes
several diseases. For instance, Adhesins play an important role in the host cells. They are
inserted DNA sequences for a variety of virulence properties. Several important methods
conducted for the prediction of bacterial virulence proteins for finding new drugs or vaccines.
In this study, we propose a method for feature selection about classification of bacterial
virulence protein. The features are constituted directly from the amino acid sequence of a given
protein. Amino acids form proteins, which are critical to life, and have many important
functions in living cells. They occurring with different physicochemical properties by a vector of
20 numerical values, and collected in AAIndex databases of known 544 indices.
For all that, this approach have two steps. Firstly, the amino acid sequence of a given protein
analysed with Lyapunov Exponents that they have a chaotic structure in accordance with the
chaos theory. After that, if the results show characterization over the complete distribution in
the phase space from the point of deterministic system, it means related protein will show a
chaotic structure.
Empirical results revealed that generated feature vectors give the best performance with chaotic
structure of physicochemical features of amino acids with Adhesins and non-Adhesins data sets.
The Chaotic Structure of Bacterial Virulence Protein Sequencescsandit
Bacterial virulence proteins, which have been classified on structure of virulence, causes
several diseases. For instance, Adhesins play an important role in the host cells. They are
inserted DNA sequences for a variety of virulence properties. Several important methods
conducted for the prediction of bacterial virulence proteins for finding new drugs or vaccines.
In this study, we propose a method for feature selection about classification of bacterial
virulence protein. The features are constituted directly from the amino acid sequence of a given
protein. Amino acids form proteins, which are critical to life, and have many important
functions in living cells. They occurring with different physicochemical properties by a vector of
20 numerical values, and collected in AAIndex databases of known 544 indices.
For all that, this approach have two steps. Firstly, the amino acid sequence of a given protein
analysed with Lyapunov Exponents that they have a chaotic structure in accordance with the
chaos theory. After that, if the results show characterization over the complete distribution in
the phase space from the point of deterministic system, it means related protein will show a
chaotic structure.
Empirical results revealed that generated feature vectors give the best performance with chaotic
structure of physicochemical features of amino acids with Adhesins and non-Adhesins data sets.
Jiang Y., Xu W., Thompson L.P., Gutell R., and Miranker D. (2011).
R-PASS: A Fast Structure-based RNA Sequence Alignment Algorithm.
Proceedings of 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), Atlanta, GA. November 12-15, 2011. IEEE Computer Society, Washington, DC, USA. pp. 618-622.
Gardner D.P., Xu W., Miranker D.P., Ozer S., Cannone J.J., and Gutell R.R. (2012).
An Accurate Scalable Template-based Alignment Algorithm.
Proceedings of 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012), Philadelphia, PA. October 4-7, 2012. IEEE Computer Society, Washington, DC, USA. pp. 237-243.
QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...inventionjournals
HF method, with the basis set 6-31G (d) was employed to calculate quantum some chemical descriptors of 37 substituted Indole. The best descriptors were selected to establish the quantitative structure activity relationship (QSAR) of the inhibitory activity against isoprenylcysteine carboxyl methyltransferase (Icmt), by principal components analysis (PCA), to a multiple regression analysis (MLR), to a nonlinear regression (RNLM) and to an artificial neural network (ANN). We accordingly propose a quantitative model and we interpret the activity of the compounds relying on the multivariate statistical analysis. This study shows that the MLR and have served to predict activity, but when compared with the results given by the ANN model. We concluded that the predictions achieved by this latter is more effective and much better than other models. The statistical results indicate that the model is statistically significant and shows very good stability towards data variation in the validation method. The contribution of each descriptor to the structure-activity relationship is evaluated.
This paper presents a literature survey conducted for research oriented developments made till. The significance of this paper would be to provide a deep rooted understanding and knowledge transfer regarding existing approaches for gene sequencing and alignments using Smith Waterman algorithms and their respective strengths and weaknesses. In order to develop or perform any quality research it is always advised to conduct research goal oriented literature survey that could facilitate an in depth understanding of research work and an objective can be formulated on the basis of gaps existing between present requirements and existing approaches. Gene sequencing problems are one of the predominant issues for researchers to come up with optimized system model that could facilitate optimum processing and efficiency without introducing overheads in terms of memory and time. This research is oriented towards developing such kind of system while taking into consideration of dynamic programming approach called Smith Waterman algorithm in its enhanced form decorated with other supporting and optimized techniques. This paper provides an introduction oriented knowledge transfer so as to provide a brief introduction of research domain, research gap and motivations, objective formulated and proposed systems to accomplish ultimate objectives.
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Sequencing projects arising from high-throughput technologies including those of sequencing DNA microarray allowed measuring simultaneously the expression levels of millions of genes of a biological sample as well as to annotate and to identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information, bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the researchers’ hypothesis. In this context, this article describes and discusses some techniques used for the functional analysis of gene expression data.
Aligning Subunits of Internally Symmetric Proteins with CE-SymmSpencer Bliven
Poster from 3DSIG 2013 on CE-Symm. For a more recent version, see http://www.slideshare.net/sbliven/3dsig-2014-systematic-detection-of-internal-symmetry-in-proteins
nternational Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
MULISA : A New Strategy for Discovery of Protein Functional Motifs and Residuescsandit
To predict and identify details regarding function
from protein sequences is an emergency task
since the growing number and diversity of protein s
equence. Here, we develop a novel approach
for identifying conservation residues and motifs of
ligand-binding proteins. In this method,
called MuLiSA (Multiple Ligand-bound Structure Alig
nment), we first superimpose the ligands
of ligand-binding proteins and then the residues of
ligand-binding sites are naturally aligned.
We identify important residues and patterns based o
n the z-scores of the residue entropy and
residue-segment entropy. After identifying new patt
ern candidates, the profiles of patterns are
generated to predict the protein function from only
protein sequences. We tested our approach
on ATP-binding proteins and HEM-binding proteins. T
he experiments show that MuLiSA can
identify the conservation residues and novel patter
ns which are really correlated with protein
functions of certain ligand-binding proteins. We fo
und that our MuLiSA can identify
conservation patterns and is better than traditiona
l alignments such as CE and CLUSTALW in
some ligand-binding proteins. We believe that our M
uLiSA is useful to discover ligand-binding
specificity-determining residues and functional imp
ortant patterns of proteins.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Bacterial virulence proteins, which have been classified on structure of virulence, causes
several diseases. For instance, Adhesins play an important role in the host cells. They are
inserted DNA sequences for a variety of virulence properties. Several important methods
conducted for the prediction of bacterial virulence proteins for finding new drugs or vaccines.
In this study, we propose a method for feature selection about classification of bacterial
virulence protein. The features are constituted directly from the amino acid sequence of a given
protein. Amino acids form proteins, which are critical to life, and have many important
functions in living cells. They occurring with different physicochemical properties by a vector of
20 numerical values, and collected in AAIndex databases of known 544 indices.
For all that, this approach have two steps. Firstly, the amino acid sequence of a given protein
analysed with Lyapunov Exponents that they have a chaotic structure in accordance with the
chaos theory. After that, if the results show characterization over the complete distribution in
the phase space from the point of deterministic system, it means related protein will show a
chaotic structure.
Empirical results revealed that generated feature vectors give the best performance with chaotic
structure of physicochemical features of amino acids with Adhesins and non-Adhesins data sets.
The Chaotic Structure of Bacterial Virulence Protein Sequencescsandit
Bacterial virulence proteins, which have been classified on structure of virulence, causes
several diseases. For instance, Adhesins play an important role in the host cells. They are
inserted DNA sequences for a variety of virulence properties. Several important methods
conducted for the prediction of bacterial virulence proteins for finding new drugs or vaccines.
In this study, we propose a method for feature selection about classification of bacterial
virulence protein. The features are constituted directly from the amino acid sequence of a given
protein. Amino acids form proteins, which are critical to life, and have many important
functions in living cells. They occurring with different physicochemical properties by a vector of
20 numerical values, and collected in AAIndex databases of known 544 indices.
For all that, this approach have two steps. Firstly, the amino acid sequence of a given protein
analysed with Lyapunov Exponents that they have a chaotic structure in accordance with the
chaos theory. After that, if the results show characterization over the complete distribution in
the phase space from the point of deterministic system, it means related protein will show a
chaotic structure.
Empirical results revealed that generated feature vectors give the best performance with chaotic
structure of physicochemical features of amino acids with Adhesins and non-Adhesins data sets.
Jiang Y., Xu W., Thompson L.P., Gutell R., and Miranker D. (2011).
R-PASS: A Fast Structure-based RNA Sequence Alignment Algorithm.
Proceedings of 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), Atlanta, GA. November 12-15, 2011. IEEE Computer Society, Washington, DC, USA. pp. 618-622.
Gardner D.P., Xu W., Miranker D.P., Ozer S., Cannone J.J., and Gutell R.R. (2012).
An Accurate Scalable Template-based Alignment Algorithm.
Proceedings of 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012), Philadelphia, PA. October 4-7, 2012. IEEE Computer Society, Washington, DC, USA. pp. 237-243.
QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...inventionjournals
HF method, with the basis set 6-31G (d) was employed to calculate quantum some chemical descriptors of 37 substituted Indole. The best descriptors were selected to establish the quantitative structure activity relationship (QSAR) of the inhibitory activity against isoprenylcysteine carboxyl methyltransferase (Icmt), by principal components analysis (PCA), to a multiple regression analysis (MLR), to a nonlinear regression (RNLM) and to an artificial neural network (ANN). We accordingly propose a quantitative model and we interpret the activity of the compounds relying on the multivariate statistical analysis. This study shows that the MLR and have served to predict activity, but when compared with the results given by the ANN model. We concluded that the predictions achieved by this latter is more effective and much better than other models. The statistical results indicate that the model is statistically significant and shows very good stability towards data variation in the validation method. The contribution of each descriptor to the structure-activity relationship is evaluated.
This paper presents a literature survey conducted for research oriented developments made till. The significance of this paper would be to provide a deep rooted understanding and knowledge transfer regarding existing approaches for gene sequencing and alignments using Smith Waterman algorithms and their respective strengths and weaknesses. In order to develop or perform any quality research it is always advised to conduct research goal oriented literature survey that could facilitate an in depth understanding of research work and an objective can be formulated on the basis of gaps existing between present requirements and existing approaches. Gene sequencing problems are one of the predominant issues for researchers to come up with optimized system model that could facilitate optimum processing and efficiency without introducing overheads in terms of memory and time. This research is oriented towards developing such kind of system while taking into consideration of dynamic programming approach called Smith Waterman algorithm in its enhanced form decorated with other supporting and optimized techniques. This paper provides an introduction oriented knowledge transfer so as to provide a brief introduction of research domain, research gap and motivations, objective formulated and proposed systems to accomplish ultimate objectives.
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...ijitcs
Sequencing projects arising from high-throughput technologies including those of sequencing DNA microarray allowed measuring simultaneously the expression levels of millions of genes of a biological sample as well as to annotate and to identify the role (function) of those genes. Consequently, to better manage and organize this significant amount of information, bioinformatics approaches have been developed. These approaches provide a representation and a more 'relevant' integration of data in order to test and validate the researchers’ hypothesis. In this context, this article describes and discusses some techniques used for the functional analysis of gene expression data.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Functional Classification
1. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
DOI : 10.5121/sipij.2011.2104 36
A FREQUENCY DOMAIN APPROACH TO PROTEIN
SEQUENCE SIMILARITY ANALYSIS AND
FUNCTIONAL CLASSIFICATION
Anu Sabarish.R1
and Tessamma Thomas2
Department of Electronics, Cochin University of Science and Technology,
Kerala, India
1
anusabarish@cusat.ac.in
2
tess@cusat.ac.in
ABSTRACT
A new computational approach for protein sequence similarity analysis and functional classification
which is fast and easier compared to the conventional method is described. This technique uses Discrete
Wavelet Transform decomposition followed by sequence correlation analysis. The technique can also be
used for identifying the functional class of a newly obtained protein sequence. The classification was
done using a sample set of 270 protein sequences obtained from organisms of diverse origins and
functional classes, which gave a classification accuracy of 94.81%. Accuracy and reliability of the
technique is verified by comparing the results with that obtained from NCBI.
KEYWORDS
Genomic Signal Processing, Discrete Wavelet Transforms, Protein Sequence Similarity, Electron Ion
Interaction Potential, Protein functional classification.
1. INTRODUCTION
Genomic sequence analysis is a highly cross disciplinary field which aims at processing and
interpreting the vast information available from the bio molecular sequences, for quicker and
better understanding. A wide range of computational methods are being used with the intention
of extracting valuable information from these sequences in real time, where the traditional
methods based on statistical techniques are less suited.
DNA molecule has a double helix structure [1] consisting of two strands, where each strand
consists of a linked chain of smaller nucleotides or bases. There are 4 types of bases- adenine
(A), thymine (T), cytosine (C) and guanine (G). Three adjacent bases in a DNA sequence form a
triplet called codon. Each of these codons represents an amino acid and instructs the cell
machinery to produce the corresponding amino acid during the Translation phase of protein
synthesis. Thus a protein is a linear chain of amino acids which starts with a start codon ATG,
which corresponds to the amino acid methionine, followed by a sequence of amino acids and
ends with a stop codon. Among the numerous available amino acids only 20 are generally found
in living beings and they form a linear polypeptide chain by covalent linkages [2]. The amino
acid sequence that makes a protein is called its primary structure. The physical and chemical
interactions between the amino acids force the chain to take several different secondary
structures like alpha – helix and beta – sheet as shown in Figure 1. Protein molecules tend to
fold into complex three-dimensional structures forming weak bonds between their own atoms
and they are responsible for carrying out nearly all of the essential functions in the living cell by
properly binding to other molecules with a number of chemical bonds connecting neighbouring
atoms. This unique 3-D structure enables the protein to have target specificity, as protein –
target interaction occurs at predefined targets within the 3-D structure of the protein. This
2. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
37
Figure 1. 3-D structure of a protein showing the structural motifs such as helix and sheet. This structure
is of the human growth hormone taken from PDB [3] using the ID “3hhr”.
selectivity and structure directly relates, to the amino acid sequence or in other words, to the
primary structure of the protein. The biological function of a protein, its chemical properties and
3D structure are ultimately determined by the DNA character string. Protein sequences
belonging to the same functional class from different organisms have some sort of sequence
similarity that allows them to perform their common function. This similarity in structure and
sequence can be attributed to the fact that they are derived phylogenetically from a common
precursor and the evolution process appears to have exerted a considerable degree of
conservatism towards functionally critical residues [4]. Protein sequence comparison and
alignment is done to identify the similarities and differences between different protein
sequences. This similarity search has various applications like identifying the amino acid
residues that are critical for the biological function, structure and in phylogenetic analysis.
Comparative analysis of homologous sequences relies heavily on sequence alignment
techniques and similarity score as a quantitative measure. A number of sequence alignment
techniques have been developed. In [5], Lipman et al., proposed a new algorithm for rapid
sequence similarity search. Instead of comparing each of the nucleotide or amino acid of one
sequence with all of the residues in the second sequence, the algorithm focused on groups of
identities between the sequences. In [6], Brodzik have applied a cross correlation based
technique for sequence alignment. The symmetric phase only matched filter is used for
alignment of DNA sequences containing repetitive patterns in the work. Katoh et al., have used
a fast fourier transform based algorithm for rapid multiple sequence alignment in [7].
Homologous regions are rapidly identified using FFT and an efficient scoring system is also
discussed. Bolten.E, et al., have described a protein clustering approach using transitivity in [8].
Here, pair-wise sequence similarity is found using Smith-Waterman local alignment algorithm,
followed by a graph based method for clustering. In [9], A.Krause et al., proposed an iterative
procedure which uses set theoretical relationship for clustering protein sequences. E.Giladi et al.
proposed a window based approach for finding near-exact sequence matches in [10]. Yi-Leh
Wu et al., in [11], have performed sequence similarity analysis using fourier transform and
wavelet transform based methods. The results were compared and it was found that both
methods give comparable results. A prototyped system of clustering proteins called
SEQOPTICS is described by Y.Chen et al. in [12].SEQOPTICS system uses Smith-Waterman
3. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
38
algorithm for distance measurement and OPTICS for clustering. OPTICS (Ordering points to
identify the clustering structure) uses a density based clustering approach. In [13], M.G.
Grabherr et al. have described a procedure using fast fourier transform and cross correlation for
sequence alignment. In [14], Kin-pong Chan et al. have used wavelet transform using haar
wavelets for time series matching. In [15], S.A Aghili et al. have studied the effectiveness of the
integration of DFT and DWT techniques for sequence similarity search of biological sequences.
It is proposed as a pre-processing phase for any other sequence alignment technique as the
method can be used to prune most of the non desired sequences and reduce the real search
problem to only a fraction of the database. In [16], Veljkovic et al. proposed a Resonant
Recognition Model (RRM) where the presence of consensus spectrum corresponding to a
biological function is identified. In [17], Trad et al. used wavelet decomposition to extract
characteristic bands from protein sequences. A sequence scale similarity analysis is also
proposed to identify the functional similarity between sequences.
Traditional sequence comparison and alignment methods concentrate on local similarity, and the
alignment is achieved introducing insertions and deletions into the sequences. In this method a
sequence comparison and similarity measurement based on multi-resolution analysis of protein
sequence using discrete wavelet transform is adopted. This allows comparison of two sequences
at different resolutions. Based on this analysis a protein classification using simple sequence
correlation is done which can be used as a pre screening method before further detailed analysis.
The paper is organised as follows. The method for representing protein sequence data in
numerical form is mentioned in section 2. Section 3 describes the methodology used and in
section 4 the details of implementation and results are discussed.
2. NUMERICAL REPRESENTATION OF AMINO ACID SEQUENCE
Most of the identified protein sequence data is available freely over the web at various online
databases, one of which is the Entrez search and retrieval system of the National Center for
Biotechnology Information (NCBI) [18]. These Protein sequences are often in the form of a
sequence of characters, each representing a distinct amino acid. In order to perform an analysis
on these protein sequence using digital signal processing methods the amino acid character
sequence has to be represented in some numerical form. For a reliable representation, the
numerical values assigned to each amino acid should represent a physical characteristic of the
particular amino acid and should be relevant for the biological activity of these molecules [16].
A comparison of the informational capacity of various physicochemical, thermodynamic,
structural and statistical amino acid parameters are analysed in [19] and it is shown that Electron
Ion Interaction Potential (EIIP) is the most suitable known amino acid property that can be used
in structure-function analyses of proteins. The EIIP values for amino acids and nucleotides are
calculated using the general model pseudo potential described in [20]:
where,
q is the change of momentum of delocalized electron in the interaction with the potential w,
α is a constant,
Z is the atomic number,
Z0 is the atomic number of the inert element that begins the period, which includes the actual Z
in the standard periodic table,
with q a wave number and KF the corresponding Fermi momentum.
4. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
39
with (EF)Z the corresponding Fermi energy,
The EIIP values of the 4 nucleotides that constitute the entire genomic sequence are given in
Table 1.
Table 1. EIIP values of Nucleotides.
Nucleotide Alphabet EIIP
Adenine A 0.126
Thymine T 0.1335
Guanine G 0.0806
Cytosine C 0.134
The EIIP values of the 20 amino acids that form the linear polypeptide chain of each protein
sequence are given in Table 2.
Table 2. EIIP values of Amino acids [20].
Amino acids Alphabet EIIP
Alanine (Ala) A 0.0373
Cysteine (Cys) C 0.0829
Aspartic acid (Asp) D 0.1263
Glutamic acid (Glu) E 0.0058
Phenylalanine (Phe) F 0.0946
Glycine (Gly) G 0.0050
Histidine (His) H 0.0242
Isoleucine (Ile) I 0
Lysine (Lys) K 0.0371
Leucine (leu) L 0
Amino acids Alphabet EIIP
Methionine (Met) M 0.0823
Asparagine (Asn) N 0.0036
Proline (Pro) P 0.0198
Glutamine (Gln) Q 0.0761
Arginine (Arg) R 0.0959
Serine (Ser) S 0.0829
Threonine (Thr) T 0.0941
Valine (Val) V 0.0057
Tryptophan (Trp) W 0.0548
Tyrosine (Tyr) Y 0.0516
5. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
40
3. METHODOLOGY
3.1. Discrete Wavelet Transform decomposition of sequences
In Wavelet Transform (WT) analysis, a signal is represented as a linear combination of scaled
and shifted versions of the mother wavelet and scaling functions. Thus it represents a signal as
the sum of wavelets with different locations and scales, with the coefficients indicating the
strength of the contribution of the wavelet at the corresponding locations and scales. In DWT,
any discrete time sequence ƒ(n) of finite energy can be expressed in terms of the discrete time
basis functions as,
,
where, represent the coefficient corresponding to scale j and location k.
DWT is implemented using filter banks where the signal is passed through a series of high pass
and low pass filters followed by sub sampling, the method known as Mallat algorithm.
Decomposition of the signal into different frequency bands is obtained by successive high pass
and low pass filtering of the time domain signal. A schematic representation of two-level DWT
decomposition is shown in Figure 2 and the corresponding frequency characteristics is shown in
Figure 3. In the first level the discrete sequence X[n] is passed through a half band, high pass
filter G0 and a low pass filter H0, and is then down sampled by 2 to produce the detail
information D1 and the coarse information A1 . The low frequency components A1 is again
passed through the filters G0 and H0 to produce D2 and A2. The filtered sequence at each level
will have a frequency span one-half that of the original sequence. Thus the sequence can be
represented by half the number of samples, and decimation by 2 is done. As a result DWT
provides good time resolution at high frequencies and good frequency resolution at low
frequencies.
Figure 2. Schematic of Two-level DWT decomposition tree.
G0
H0
X[n]
G0
H0
D1[k] Level 1 Coefficients
D2[k]
A2[k]
Level 2 Coefficients
6. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
41
The filtering and decimation operation done at each level can be mathematically represented as:
where, D is the high pass filter output and A is the low pass filter output.
Figure 3. Frequency splitting characteristics of two - level DWT decomposition.
3.2. Correlation Analysis
An L-Level DWT decomposition mentioned above will provide L+1 sequences, each of which
corresponds to the protein sequence component belonging to different frequency bands. In order
to measure the sequence similarity between two sequences, a cross correlation analysis is done
between each level of the two sequences. The correlation function represents the similarity of
one set of data with another. Hence the correlation coefficients obtained at each level gives a
measure of similarity between the two sequences at each level. The correlation coefficient R(j)
is calculated by shifting one sequence with respect to the other.
where, s1 (n) and s2 (n) are the two sequences and j represents the shift.
The maximum value of the correlation coefficient at each level corresponds to the shift that
gives the best matching alignment for the two sequences. This maximum value, say Rmax, is
taken as a measure of similarity between the sequences at that level.
A2 D2
D1
A1
/4
A
0
7. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
42
4. PROTEIN SEQUENCE COMPARISON AND CLASSIFICATION
The protein sequences that are represented in numerical form are subjected to multi resolution
analysis using DWT followed by correlation analysis.
4.1 Database
The protein sequences that are used in this work are obtained from the National Center for
Biotechnology Information (NCBI) website [18]. The Myoglobin sequences mentioned in
section 4.3.1 are taken from the 24 different animals, Human, Finback whale, Dolphin, Baboon,
Cattle, Dog, Fox, Gorilla, Grey whale, Horse, Mouse, Killer whale, Marmoset, Minke whale,
Mole rat, Night monkey, Norway rat, Pika, Pilot whale, Rabbit, Red deer, Sheep, Sperm whale
and Zebra, all belonging to the class Mammalia. The 11 Beta Actin sequences used in the
second example are taken from Human, Vervet, Cattle, Chicken, Dog, Horse, Rhesus macaque,
Mouse, Pig, Rabbit and Rat. The 15 Cytochrome C sequences used in the third example are
taken from Cattle, Camel, Chicken, Chimpanzee, Dog, Gorilla, Grey whale, Horse, Mouse,
Human, Rat, Ostrich, Pig, Seal and Sheep. The 270 protein sequences used in section 4.3.2 are
also taken from the NCBI database.
4.2 Implementation
The protein sequences are first represented numerically using EIIP values, and normalized for
zero mean. Then a 3 level DWT decomposition using Bior3.3 wavelet is performed giving 4 set
of coefficient sequences, detail D1, D2, D3 and approximation A3. Bior3.3 decomposition
wavelet function and scaling function are very rugged and have many abrupt changes and are
shown suitable for the analysis of protein sequence which is also very rugged in nature [17].
The 4 coefficient sequences thus obtained from the two different proteins are subjected to
correlation analysis, giving 4 correlation values corresponding to Rmax as mentioned above,
representing the measure of sequence similarity. The sequence similarity thus obtained can be
used for functional classification of proteins for which two conditions are to be satisfied. The
first condition is that protein sequences belonging to same functional class should show very
strong sequence similarity. The second condition is that sequences from different classes that
belong to the same organism should not show any significant sequence similarity.
4.3 Results and Discussion
4.3.1 Correlation analysis of proteins
Initially Myoglobin sequences obtained from 24 different mammals mentioned above were
considered. Human Myoglobin was taken as reference and it was compared with the remaining
23 sequences. The result of the correlation analysis is shown in the Figure 4. It can be noted that
there is very strong correlation between every pair of sequences at all decomposition levels.
This shows not only the local pair-wise similarity but also the global sequence similarity which
we cannot obtain by conventional sequence alignment methods. There is more than 90%
similarity in almost all pairs, which points to the conservation of nucleotides which are critical
to the functionality of the proteins, across organisms.
As a second example Beta Actin sequences were considered. Here also, human beta actin
sequence is taken as reference and Beta Actin sequence of 10 other organisms are taken for
analysis. Nine out of Ten sequences showed 100% correlation while the remaining one showed
99% correlation. The result is shown in Figure 5. Third example taken was the Cytochrome C
sequence. Here Bovine Cytochrome C sequence is taken as reference and it was compared with
Cyt C sequences from 14 other origins. The result obtained was similar to the previous ones
8. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
43
with minimum pair-wise correlation value greater than 0.8 (80%) showing significant sequence
similarity, shown in Figure 6.
9. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
44
Figure 4. Correlation coefficients of human myoglobin against myoglobin sequence of 23 different
organisms. The abscissa represents 23 different organisms and the ordinate is the maximum value of
correlation coefficient, Rmax, between human myoglobin and the corresponding organism.D1, D2, D3, A3
represents the 4 decomposition levels. Direct correlation represents the original sequence correlation.
Figure 5. Correlation plot of Human Beta Actin against Beta Actin from 10 other organisms. Similarity
scores at all the 4 levels are shown.
1-Finback Whale 2-Dolphin 3-Baboon
4-Cattle 5-Dog 6-Fox
7-Gorilla 8-Grey Whale 9-Horse
10-Mouse 11-Killer Whale 12-Marmoset
13-Minke Whale 14-Mole Rat 15-Night Monkey
16-Norway rat 17-Pika 18-Pilot Whale
19-Rabbit 20-Red Deer 21-Sheep
22-Sperm Whale 23-Zebra
1-Vervet 2-Bovine 3-Chicken 4-Dog
5-Horse 6-Rhesus 7-Mouse 8-Pig
9-Rabbit 10-Rat
10. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
45
Figure 6. Correlation plot of Bovine Cytochrome C against Cyt C from 14 other organisms. Similarity
scores at all the 4 levels are shown.
Figure 7. Correlation Plot of human Myoglobin against 14 other human proteins.
1-Camel 2-Chicken 3-Chimpanzee 4-Dog
5-Gorilla 6-Grey whale 7-Horse 8-Mouse
9-Human 10-Rat 11-Ostrich 12-Pig
13-Seal 14-Sheep
1-Kinase 2-Myosin 3-FGF 4-Actin
5-Amylase 6-Protease 7-Cytochrome 8-Glucagon
9-Interferon 10-Lysozyme 11-Prolactin 12-Somatotropin
13-Hemoglobin 14-ILH
11. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
46
From the above results it is clear that the first condition is satisfied. To check the second
condition the Myoglobin sequence itself was taken as an example. Human myoglobin was
compared with 14 other human proteins. The result showed that there is no significant
correlation between any pair at any decomposition level as shown in Figure 7.
To confirm this, 15 different human protein sequences belonging to the following family:
Kinase, Myosin, FGF, Actin, Amylase, Protease, Cytochrome, Glucagon, Interferon, Lysozyme,
Myoglobin, Prolactin, Somatotropin, Hemoglobin and Insulin like hormone, were compared
among themselves. All the 120 possible pair wise similarity was measured and the results
analysed. The result showed that among these only one protein, alpha hemoglobin showed good
correlation (0.79) with kinase which is a functionally different protein. But this similarity is only
in one frequency band (D3) and all other bands showed negligible correlation. None of the
protein sequences showed similarity in all the bands with functionally dissimilar protein
sequence. This confirmed that the second condition is also satisfied and this can be used for
classifying protein sequences into different functional classes. Using this, when an unknown
sequence is obtained it can be compared with a list of reference proteins and by best match
criteria one can identify the functional class of that new protein.
4.3.2 Functional classification of proteins
Table 3. Comparison of the success rates of functional protein sequence classification
using the two methods.
Class of
Protein
Success rate
(Method 1)
Success rate
(Method 2)
Class of
Protein
Success rate
(Method 1)
Success
rate
(Method
2)
Actin-α 100% 100% Hemoglobin-β 96.43% 96.43%
Actin-β 100% 100% IGF 83.33% 100%
Amylase 100% 100% Interferon 82.6% 82.6%
Cytochrome-b 100% 100% Lysozyme 75% 83.33%
Cytochrome-c 100% 100% Myoglobin 98.41% 98.41%
FGF 100% 100% Prolactin 100% 100%
Glucagon 100% 100% Somatotropin 50 % 83.33%
Hemoglobin-α 96.43% 96.43%
Entire
sequences
90.3% 94.81%
12. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
47
A sequence correlation based classification was done on a sample set of 270 proteins belonging
to 15 functional classes. These protein sequences are taken from vertebrates of diverse origin.
Two methods were considered for the classification. In the first method, human protein
sequences corresponding to the 15 protein classes are taken as reference protein set. Then from
the sample set, each sequence is randomly selected and compared with reference set to find the
class that has the best sequence similarity using the DWT coefficient correlation method
mentioned in section 3.2.
The reference set consists of human protein sequence of the following classes: Alpha actin, Beta
actin, Amylase, Cytochrome-b, Cytochrome-c, FGF, Glucagon, Alpha hemoglobin, Beta
hemoglobin, IGF, Interferon, Lysozyme, Myoglobin, Prolactin and Somatotropin. The detailed
result of the classification are shown in the Table 3 as Method 1. Using this method 244 out of
the 270 protein sequences were classified to the correct family giving a success rate of 90.3%.
In method 2, further modification is done to improve the success rate by changing the selection
criterion of the reference protein. In the previous step human sequences were selected as
reference set which was a random selection. In this method reference protein for each class is
selected by finding the one that has the maximum similarity with the rest of the proteins in the
same class. Using this method 256 out of the 270 protein sequences were successfully classified
with a success rate of 94.81%. The result obtained by this procedure is also shown in Table 3 as
Method 2. The result illustrates that by selecting the reference proteins suitably the classification
accuracy can be fine tuned.
5. CONCLUSIONS
A simple and successful method for identifying the protein similarity using frequency domain
information is presented. The method uses a 3-level DWT decomposition using Bior3.3
wavelets followed by correlation analysis. This allows measurement of sequence similarity at 4
different scales, coarser A3 level to finer D1 level. Sequence scale similarity analysis of
Myoglobin, Beta Actin and Cytochrome C taken from different organisms have been performed.
It is seen that protein sequences of same functional class from different origins have strong
correlation showing high sequence similarity. 15 functionally different protein sequences from
human were taken and all the possible 120 combinations of pair wise similarity were analysed
which clearly showed that protein sequences of different class taken from the same organism
have no significant sequence similarity. Based on the above inference, classification is done on
a sample set of 270 protein sequences obtained from organisms of diverse origins and functional
class using two different methods. Using method 1, with human sequence as reference set, 244
out of the 270 samples were successfully classified with an accuracy of 90.3%. Using method 2,
where reference set is selected based on maximum similarity criteria, 256 out of the 270
samples were successfully classified with an accuracy of 94.81%.Also when a new protein
sequence is obtained, this method can be used as an initial step for identifying the functional
class to which it belongs. The paper compared all the results with the NCBI database for
verifying the reliability and accuracy of the results.
6. REFERENCES
[1] J.D. Watson & F.H.C. Crick, (1953) “A structure for DNA”, Nature, Vol. 171, pp 737-738.
[2] P. Ramachandran & A. Antoniou, (2008) “Identification of hot-spot locations in proteins using
digital filters”, IEEE Journal of selected topics in signal processing, Vol. 2, No.3, pp 378–389.
[3] Protein Data Bank, Available: http://www.pdb.org/.
[4] E. Margoliash, (1963) “Primary structure and evolution of cytochrome c”, Proceedings of the
National academy of sciences of the USA, Vol. 50, pp 672-679.
13. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
48
[5] Lipman D.J & Pearson.W.R, (1985) “Rapid and sensitive protein similarity searches”, Science,
Vol. 227, pp 1435-1441.
[6] A.K.Brodzik (2005) “A comparative study of cross correlation methods for alignment of DNA
sequences containing repetitive patterns”, European signal processing conference EU-SIPCO
2005.
[7] K.Katoh, K.Misawa, K.Kuma & T.Miyata (2002) “MAFFT: a novel method for rapid multi[ple
sequence alignment based on fast Fourier transform”, Nucleic acids research, Vol. 30, No.14, pp
3059–3066.
[8] E.Bolten. et al. (2001) “Clustering protein sequences –structure prediction by transitive
homology”, Bioinformatics, Vol. 17, No.10, pp 935–941.
[9] A.Krause & M.Vingron (1998) “A set theoretic approach to database searching and clustering”,
Bioinformatics, Vol. 14, No.5, pp 430–438.
[10] E.Giladi et al. (2002) “SST: an algorithm for finding near exact sequence matches in time
proportional to the logarithm of the database size”, Bioinformatics, Vol. 18, No.6, pp 873–879.
[11] Yi-Leh Wu et al. (2000) “A comparison of DFT and DWT based similarity search in Time-
Series Databases”, CIKM, pp 488–495.
[12] Y.Chen et al. (2006) “SEQOPTICS: a protein sequence clustering system”, BMC Bioinformatics,
Vol. 7.
[13] M.G.Grabherr. et al. (2010) “Genome-wide synteny through highly sensitive sequence
alignment: Sastuma”, Bioinformatics, Vol. 26, No.9, pp 1145–1151.
[14] Kin-pong Chan & A.W Fu (1999) “Efficient time series matching by wavelets”, International
conference on data engineering, pp.126.
[15] S.A.Aghili. et al. (2005) “Sequence similarity search using discrete fourier and wavelet
transformation techniques”, International Journal on Artificial Intelligence Tools, Vol. 14, No.5,
pp 733–754.
[16] V.Veljkovic, I.Cosic, B.Dimitrijevic & D.Lalovic, (1985) “Is it possible to analyze DNA and
protein sequences by the methods of digital signal processing?”, IEEE Transactions on
Biomedical Engineering, Vol. BME-32, No. 5, pp 337-341.
[17] C.H.De Trad, Q.Fang & I.Cosic, (2002) “Protein sequence comparison based on the wavelet
transform approach”, Protein Engineering, Vol. 15, No.3, pp 193–203.
[18] National Center for Biotechnology Information, Available: http://www.ncbi.nlm.nih.gov/.
[19] J.Lazovic, (1996) “Selection of amino acid parameters for fourier transform-based analysis of
proteins”, CABIOS Communication, Vol. 12, No. 6, pp 553-562.
[20] V.Veljkovic & I.Slavic, (1972) “Simple General-Model Pseudopotential”, Physical Review
Letters, Vol. 29, No. 2, pp 105-107.
14. Signal & Image Processing : An International Journal(SIPIJ) Vol.2, No.1, March 2011
49
Authors
[1] Anu Sabarish.R graduated from Government Engineering
College, Thrissur, India in Electronics and Communication
Engineering (2005), completed his M.Tech in Digital
Electronics (2007) and is pursuing Ph.D in the area of Genomic
Signal Processing from Cochin University of Science and
Technology, India. His area of interest includes Genomic Signal
Processing and Time Frequency Analysis.
[2] Dr.Tessamma Thomas received her M.Tech. and Ph.D from
Cochin University of Science and Technology, Cochin-22,
India. At present she is working as Professor in the Department
of Electronics, Cochin University of Science and Technology.
She has to her credit more than 80 research papers, in various
research fields, published in International and National journals
and conferences. Her areas of interest include digital signal /
image processing, bio medical image processing, super
resolution, content based image retrieval, genomic signal
processing, etc.