SlideShare a Scribd company logo
1 of 11
Download to read offline
HIDDEN MARKOV MODELS TO PREDICT PROTEIN SECONDARY STRUCTURE
Abhishek Dabral
gtg204v@mail.gatech.edu
MS Bioinformatics, School of Biology
Georgia Institute of Technology
December, 2004
Abstract
Proteins are the building blocks of life. The structure of a protein determines
its function. This structural information of proteins is embedded in its amino
acid sequence. Protein secondary structure prediction is an essential task in
determining the structure and function of the proteins. This study addresses
the problem of protein secondary structure prediction by using Hidden
Markov Model (HMM).A dependency model is built by considering
statistically significant amino acid correlation patterns at segment borders.
The problem of low accuracy in beta strand predictionsin most of the present
methods is also addressed by considering significant correlations outside the
segments. The use of evolutionary data for improving the prediction
accuracy is also explored.
1. Introduction
Amino acids are the building blocks of proteins. Peptide bonds connect the adjacent amino acids of
twenty different types. It isa well known fact that for proteins, structure impliesfunction. The structural
information about proteins is embedded in its amino acid sequence. The basic chemical composition
"common to all 20 amino acids is shown in Figure 1. The central carbon atom, called C , forms four
covalent bonds, one each with NH3 (amino group), COO (carboxyl group), H (hydrogen), and R (side+ !
chain). The first three are common to all amino acids; the side-chain R is a chemical group that differs
for each of the 20 amino acids. Inspection of three-dimensional structures of proteins has revealed the
presence of repeating elements of regular structure, termed as “secondary structure”. These regular
structuresare stabilized by molecular interactionsbetween atomswithin the protein, the most important
being the Hydrogen bond, formed between two electronegative atoms that share one H. There is a
convention on the nomenclature designating the common patternsof H- bondsthat givesrise to specific
structure elements, the Dictionary of Secondary Structures of Proteins(DSSP). DSSP annotationsmark
each residue (amino acid) to be belonging to one of the seven types of secondary structure: H( alpha
helix), G (3-helix or 310 helix), I (3 helix or ( helix), B(residue in isolated $ bridge), E ($ strands), T
( H bond turns), S (bends), and a use of “_“ where none of the above structures are applicable.
Typically, the seven secondary structure typesare reduced into three groups, helix ( includestypes“H”,
alpha helix and G, the 310 helix), strand (includes “E”, beta ladder and “B” beta bridge) and coil (all
other types). A protein which is color coded based on DSSP annotation is shown in Figure 2.
Figure 1:Amino acids and peptide bond formation. Figure 2: A protein that is color coded based on the annotation
The basic amino acid structure is shown in the dark green by the DSSP. The protein shows only the main chain with the
box. Each amino acid consists of the C alpha carbon following color codes: H: " helix(red), G :310 helix, E: extended
atom(yellow) that forms four covalent bonds, one each strand in $ ladder(yellow), B: residue in isolated $ bridge
with amino group (blue),carboxyl group (light green), (Orange),T: Hydrogen bond turn (dark blue) and S: bend hydrogen
atom, and iv) a side chain R..In the polymerization (Light blue). Residues not conforming to any of the type
of amino acids, the carboxyl group of one amino acid are shown green. Protein is a catalytic subunit of cAMP-
( light green) reacts with the amino group of the other dependent kinase.
amino acid(blue) under cleavage of water. (PDB ID 1BKX, available at http://www.rcsb.org/pdb).
(Image courtesy Ganapathiraju et al, Characterization of Protein Secondary Structure Prediction.)
Asan intermediate step towardssolving thegranderproblemof determining three-dimensional protein
structures, the prediction of secondary structural elements is more tractable but is in itself not yet a
fully solved problem. Protein secondary structure prediction from amino acid sequence dates back to
the early 1970s, when Chou and Fasman[2], and others, developed statistical methods to predict
secondary structure from primary sequence [2]. These early methods were based on the patterns of
occurrence of specific amino acids in the three secondary structure types—helix, strand, and coil.
Early attempts to predict secondary structure had focused on the development of mappings from a
local window of residues in the sequence to the structural state of the central residue in the window
and a large number of methods estimating such mappings had been developed. Earlier approaches
scored individual amino acids by frequency of occurrence in each structural state, combining them in
wayscorresponding to conditional dependence models[2,5]. Methodsconsidering correlationsamong
positions within the window improved the accuracy. Further improvements were demonstrated by the
inclusion of evolutionary information via multiple alignments of homologous sequences [7,9]
2. Method
In this work, the authors adopted a model based approach, formulating secondary structure prediction
as a general Bayesian inference problem. The approach eschewed many problems associated with
window based predictions, such as the need for post prediction filtering [4, 7].The work was broadly
divided into three stages. First of all, statistical analysis was performed, which explored the most
informative correlations for different secondary structures. Then, a semi Markov HMM was chosen,
which was similar to the model developed by [10]. Correlations at terminal positions of structural
segments and dependencies to forward residues within the segments were specifically considered.
Finally, an iterative estimation of the HMM parameters was implemented.
The starting point was to choose a representation of sequence/structure relationships in proteins based
on secondary structure segments. The model was parameterized by representing the segment position
and structural types. Segment location was denoted by the last residue of the segment. Because the
segments are required to be contiguous, this parameterization uniquely identified a set of segment
locations for a given sequence.
1 2 n i iLet R = (R , R , . . . R ) be a sequence of n amino acid residues, S = { i:Struct( R )… Struct(R +1)} be
a sequence of m positions denoting the end of each individual structural segment (so that Sm = n), and
1 2 mT = (T , T , . . . , T ) be the sequence of secondary structural types for each respective segment (See
Figure 3).Together m, S and T completely determine a secondary structure assignment for a given amino
acid sequence, where m denotes total number of segments, S represents segment end position and T
represents the structural state of each segment. In the case of secondary structure prediction, the
1 2 m 1 2 mquantities of interest are thus the values of m, S = (S , S , . . . , S ) and T = (T , T , . . . , T )
1 2 ncorresponding to the known amino acid sequence R = (R , R , . . . R ) , i.e., the locations and types of
the secondary structural segments. The problem is to infer the values of (m, S, T ) given a residue
sequence R. A Bayesian approach to the assignment of these parameter valuesistaken, by defining a joint
probability distribution P ( m, S, T ) for an amino acid sequence and its secondary structure assignment.
The conditional or posterior probability distribution over structural assignments is then calculated, given
a new sequence P (m, S, T | R) via Bayesian inference. Prediction then involves finding those secondary
structure assignments (m, S, T ) which maximize this posterior distribution.
Figure 3: Representation of the secondary structure of a protein sequence in terms of structural segments. The parameters
shown represent the segment types T = (L,E,L,E,L,H,L, . . .) and endpoints S = (4,9,11,15,18,25, . . .). The associated structure
assignment is LLLLEEEEELLEEEELLLHHHHHHHLLL . . . .(Figure courtesy Schmidler et al[10]).
CORRELATION ANALYSIS
Correlation analysis begins with a statistical analysis to explore the dependency structure. A P² (chi
square), test is used to identify the most informative correlations between amino acid pairs in different
types of secondary structure segments and positions. The P² is used to compute the joint distribution of
amino acid pair, and compare it with the product of marginal distributions. Logically, P² measures the
size of the difference between the pair of observed and expected frequencies of the data. More
specifically, the difference between the observed and expected frequency is calculated, that difference
is squared and then that result is divided by the expected frequency.
The formula for P² can be expressed as:
P² = ' (O - E)²
E
O= the observed frequency
E= the expected frequency
Squaring the difference ensures a positive number, so that we end up with an absolute value of
differences. If we do not work with absolute values, the positive and negative differences across an entire
table will always add up to 0. Dividing the squared difference by the expected frequency essentially
removes the expected frequency from the equation, so that the remaining measures of observed/expected
,difference are comparable across all data values (cells in a Table 1). Using P² the correlation between
amino acid pairs at various separation distances was considered and the positions which were highly
correlated were found for the corresponding secondary structure, alpha helix or beta strand. Position
specific correlation isthen calculated for terminal positions. This is done in order to find capping regions
in alpha helices which typically show hydrogen bonding patterns and side chain interactions which are
different from internal positions. The data used is 8100 proteins and their secondary structures collected
from the Protein Data Bank (PDB). Table1 below shows the results of the P² test for the three secondary
structure types. The correlation isfound by using a function built in MATLAB. It is learnt from the above
analysis that in "-helix segments, a residue at position i iscorrelated with residues at position i-2, i-3 and
i-4, where i denotes the position of the amino acid within a segment. Similarly a $ strand residue has
highest correlations with residues at position i-1, i-2 and a loop residue had its most significant
correlations with those at position i-1, i-2 and i-3.
Table 1: Correlations of amino acids
" helices are characterized by capping boxes where the hydrogen bonding patterns and side chain
interactions are different from the internal positions. For this reason, position specific correlations has to
be considered. Table 2 givesthe correlation analysisfor the terminal positionsin "- helical segments. The
results show that there are statistically significant correlation between residues in terminal positions and
the residues that are outside the segment.Another observation is that there exist significant correlations
with the forward residues. Also, the degree of correlation for the forward residuesmight be different from
those of backward, which indicates an asymmetric dependency behavior for forward and backward
residues. Internal positions also show similar correlation pattern.
Table 2: Position specific Correlations in Helix Terminal Positions
3.The Model
A secondary structure of a protein is defined by a vector given by (m, S, T ), where m denotes total
number of segments, S represents segment end position and T represents the structural state of each
segment. In a HMM there are a finite number of distinct states. In the model built the hidden states are
the structural states {H,E,L}.Each state generates an observation in the form of amino acid segment.
Starting from the initial state the transitions occur from one state to another, following a transition
probability distribution. Each state generates an amino acid segment according to the observation
frequency distribution. The state prediction could be re-stated asa posterior maximization problem. That
is, given the observation sequence of amino acids, denoted by R, find the vector (m, S, T ) with maximum
posterior probability (m, S, T |R).The posterior probability can be expressed as :
P (m, S, T |R ) = P(R) |m, S, T)(m, S, T)
P(R)
where P(R) |m, S, T) denotes the sequence likelihood and P(m, S, T ) denotes the apriori distribution.
The apiori distribution P(m, S, T ) is modeled as:
m
j j-1 j j-1 jP(m, S, T ) = P(m) J P(T | T )P(S | S , T)
j =1
where P(m) is the probability of observing m secondary structure segments, and it is assumed to be
j j-1independent from other state variables. P (T | T ) represents the state transition probability (among
j j-1 jdifferent secondary structure types) and P(S | S , T) allows to model the length distribution of secondary
structure segments with the following assumption:
j j-1 j j j-1 jP(S | S , T) = P(S - S | T).
The likelihood term P(R) | m, S, T) is modeled as:
m
j-1 jP(R) | m, S, T) = P(m, S, T ) = J P( R[s + 1:s ]| S, T)
j =1
m
j-1 j jj-1 j= J P( R[s + 1:s ]| S ,S, T)
j =1
p:qIt is important to note here that the segment likelihood terms were assumed to be independent. Also, R
j-1 jdenotes the sequence of residues with indexes from p to q. P( R[s + 1:s ]| S, T) represents the probability of
j-1 j jj-1 jobserving a particular amino acid segment given all state variables. It is equal to P( R[s + 1:s ]| S ,S, T)
because in a HMM the symbol observation probability depends only on its generator state.
Although the observation probability of amino acids at different secondary structure states is assumed to
be independent, the amino acids within the segments are allowed to depend on neighboring residues. A
j-1 jdependency model is created for P( R[s + 1:s ]| S, T) as:
j-1 j jj-1 j j-1 jP( R[s + 1:s ]| S, T) = P( R[s + 1:s ]| S ,S, T=H)
=
x
x
Here the first product term represents the observation probability of amino acids at the N terminal positions
of length l for " helices. The second term represents the observation probability at the internal position and
the third product represents the observation probability at the C terminal residues of length l for " helices.
As the number of sequences in the PDB is not sufficient to reliably estimate the conditional probabilities,
the dependency parameters are reduced by grouping the amino acids into three hydrophobicity classes
idenoted by h 0 {hydrophobic, hydrophilic, neutral}. The statistical analysis done using the P² test findsthe
dependency patterns as shown in Table 3.
Figure 4:A graphical model (Whittaker, 1990) representing the conditional independence structure for the amino
acids in an example a-helix segment. Ri are the amino acids of the a-helix and Hi are their associated hydrophobicity
classes as assigned by (Klingler and Brutlag, 1994). The model provides for dependence among the hydrophobicity
classes at appropriate periodicity allowing the amino acid distributions to be modeled as conditionally independent,
thus reducing the dimensionality of the model.
Helix: Strand: Loop:
i I -1 i+2 i i -1 i-2 i i -1 i-2N1 R | h , h N1 R | h , h N1 R | h , h
i i -2 i+1 i i -2 i-3 i i -1 i-2N2 R | h , h C1 R | h , h N2 R | h , h
i i -2 i-4 i i -1 i-2, i+1 i+2 i i -1 i-3R | h , h Int Int R | h , h h , h C1 R | h , hC1
i i -2 i-4 i -1 i-3C2 R | h , h i | h , hC2 R
i -1 i-2i i -2 i-3, i-4 i+2 i | h , hInt R | h , h h , h Int R
Table 3: Dependencies with segments
After obtaining a amino acid sequence R, the vector (m, S, T) that maximizes the posterior probability
(m, S, T |R) is determined as the predicted secondary structure. A forward backward algorithm
generalized for semi-HMM is used to determine the posterior probability. After prediction of secondary
structure, proteins which have close secondary structure are used to re-adjust the HMM parameters
iteratively. This is done by removing those predicted sequences from the training set which do not have a
close secondary structure. The HMM parameters are re-estimated.
4. Results
3 3The accuracy of secondary structure prediction is done by using the Q test, where Q is given as
3Q = Correctly Predicted residues
Number of residues
The data set used is the latest version of PDB (Protein Data Bank[11]) after filtering out the sequences
which have less than 50 or more than 900 residues(as suggested by Schmidler et al). The minimum$ strand
length is restricted to 3 and minimum " helix length to 5.[4]. There is a total 1.5% increase in the overall
3 state prediction accuracy as compared to the Bayesian method used by Schmidler [10]. The dependency
model used, increases the " helix and $ strand accuracy.
5. Generalization of the model and possible improvements
Significant evidence exists that inclusion of multiple sequence alignment information, when available, can
improve single sequence prediction methods by as much as 5–7% [3,7,8,9].I used the NCBI
(www.ncbi.nlm.nih.gov) resource for obtaining the amino acid sequences and the CLUSTALW tool
(www.ebi.ac.uk/clustalw/) to align multiple sequence alignments. The multiple alignment was then used as
a test set to train the models. The results are presented in the next section.
In both the work considered here [1,10], we have only concerned ourselves with the 3-state problem, where
S ={H, E , L } which may induce a model error. The model can be generalized by considering more states
a protein can fold into such as coiled coils, hairpins etc [7].However it takes a lot of computational time as
the complexity of the dependency model increases with increasing states and the task is far from trivial.
This generalization is an area for future work and extension of the model to make it more complete.
6. Extensions and my contribution
As discussed in the previous section, I used the multiple alignment of sequences as a test set. The aligned
sequences have a high degree of similarity and are homologous (evolutionarily related). Homology
modeling is based on the notion that new proteins evolve gradually from existing ones by amino acid
substitution, addition, and/or deletion and that the 3D structures and functions are often strongly conserved
during this process. Many proteins thus share similar functions and structures and there are usually strong
sequence similarities among the structurally similar proteins. Strong sequence similarity often indicates
strong structure similarity, although the opposite is not necessarily true. Homology modeling triesto identify
structures similar to the target protein through sequence comparison. It is important to explain here how
aligned sequences are a good choice for test data, and the reason they can subsequently improve the
prediction accuracy. It is a very popular saying by Theodosius Dobzhansky (1900-1975), “Nothing in
Biology makes sense except in the light of evolution”. Nature has this tremendous power of propagating
features which are beneficial for the species to develop and survive, and deprecating those which do not.
Thus, the proteins which are of vital importance for the functioning of species are propagated without any
mutations from generation to generation and any deleterious mutations are not propagated as it dies off. To
conserve the function of the species, the structure has to be conserved as we know that the two are closely
related. Alignment of multiple sequences show that the residues which play a critical role in the function of
proteinsare conserved, and these residueshave similar secondary structure. So, using the multiple alignment
as a test set. I used a multiple aligned sequence 1fxia.msf available at (www.sanger.ac.uk) and used the
CLUSTALX program for adjusting the alignment (Figure 5) . This sequence is then used as an input set for
trainingthe HMM. The training isdone usingtheBaum-Welchalgorithmortheforward-backward algorithm
(See appendix). The algorithm used,.prints the log likelihood at each iteration along with transition matrix
and initial probabilities. The bioinformatics toolbox in MATLAB is used for building profile, showing log-
odds score, Symbol emission for the matchand insert states. Some of the functions used were hmmprofstruct,
hmmprofmerge, hmmprofestimate, hmmprofgenerate, hmmprofalign. The log odds best path is shown in
Figure 6.
Figure 5: CLUSTALX multiple sequence alignment for 1fxia.msf
Figure 6:Log odds best path
Results using the alignment:
Predicted secondary structure composition for the protein came as:
sec str type H E L
% in protein 25.71 25.71 48.57
Residue composition for the protein is as follows:
%A: 2.9 %C: 0.0 %D: 17.1 %E: 11.4 %F: 0.0
%G: 5.7 %H: 0.0 %I: 5.7 %K: 5.7 %L: 11.4
%M: 0.0 %N: 2.9 %P: 8.6 %Q: 0.0 %R: 0.0
%S: 0.0 %T: 8.6 %V: 11.4 %W: 0.0 %Y: 8.6
3To determine the accuracy of the prediction, I used the Q test as done in Section 4 above.
I used PDB (Protein Data Bank) to look up the known structures of the protein, determined
empirically. I then compared the predicted secondary structure, with the known structures to get the
number of correctly predicted residues. The ratio of correctly predicted residue to the total number of
3 3residues (Q ) gave the prediction accuracy. Q came out to be a value near 0.83 which implies a
prediction accuracy of 83%, a significant improvement over the previous methods.
7. Conclusion:
This paper discusses an approach to the prediction of protein secondary structure from sequence using
probabilistic models for protein structural segments and an algorithm for prediction based on Hidden
Markov Models(HMM). Extension of this approach to use of multiple aligned sequences hasshown that
accuracies improve when the evolutionary information is taken into account. Extensions to the model
using more secondary structure elements is also discussed.
8. Acknowledgments:
I would like to thank Zafer Aydin and Dr Mark Borodovsky, for their cooperation and assistance in
helping me understand the concepts of their paper and their importance. I would also like
acknowledge Dr Mason Porter for his guidance throughout the project.
Appendix:
1. Baum-Welch Algorithm
Also called the Forward-Backward algorithm, can be derived using simple ``occurrence counting''
arguments or using calculus to maximize the auxiliary quantity.A special feature of the algorithm is
the guaranteed convergence. For more discussion on Baum Welch see:
(http://jedlik.phy.bme.hu/~gerjanos/HMM/node11.html).
References:
[1] Aydin Z., Altunbasak Y & M. Borodovsky, (2004) Protein secondary structure prediction with semi-
Markov HMM ("IEEE Int. Conf. on Acoustics Speech and Signal Processing, Montreal, CA, May 2004, in
press).
[2] Chou, P. Y. and Fasman, G. D. (1974) Prediction of protein conformation. Biochemistry 13:222-245.
[3] Di Francesco, J. Garnier, and P.J. Munson, Improving protein secondary structure prediction with aligned
homologous sequences
[4] Frishman D, Argos P (1996): "Incorporation of non-local interactions in protein secondary structure
prediction from amino acid sequence", Protein Engineering, 9(2), 133-142
[5] Garnier, J., Osguthorpe, D. J. & Robson, B. (1978),Analysis and implications of simple methods for
predicting the secondary structure of globular proteins.J. Mol. Biol. 120, 97-120.
[6] Rabiner L.R. “A Tutorial on hidden markov models and selected applications in speech recognition.
[7] Rost, B. & Sander, C. (1993),Prediction of protein secondary structure at better than 70 percent accuracy. J.
Mol. Biol. 232, 584-599.
[8] Rost, B., Sander, C. & Schneider, R. (1994),Redefining the goals of protein secondary structure prediction.
J. Mol. Biol. 235, 13-26.
[9] Salamov, A. A. & Solovyev, V. V. (1995).,Prediction of protein secondary structure by combining nearest-
neighbor algorithms and multiple sequence alignments. J. Mol. Biol. 247, 11-15.
[10] Scott C. Schmidler, Jun S. Liu and Douglas L. Brutlag, Bayesian Segmentation of Protein Secondary
Structure
[11] EVA, “List of Sequence-unique pdb files”.

More Related Content

What's hot

Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM ModelCrimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM ModelCrimsonPublishers-SBB
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesProf. Wim Van Criekinge
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Melissa Moody
 
Aligning Subunits of Internally Symmetric Proteins with CE-Symm
Aligning Subunits of Internally Symmetric Proteins with CE-SymmAligning Subunits of Internally Symmetric Proteins with CE-Symm
Aligning Subunits of Internally Symmetric Proteins with CE-SymmSpencer Bliven
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partiiSumatiHajela
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
 
Criterion based Two Dimensional Protein Folding Using Extended GA
Criterion based Two Dimensional Protein Folding Using Extended GA Criterion based Two Dimensional Protein Folding Using Extended GA
Criterion based Two Dimensional Protein Folding Using Extended GA IJCSEIT Journal
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure predictionSamvartika Majumdar
 
Protein computational analysis
Protein computational analysisProtein computational analysis
Protein computational analysisKinza Irshad
 
Inhibition of RelA-Mediated Biofilm Synthesis
Inhibition of RelA-Mediated Biofilm SynthesisInhibition of RelA-Mediated Biofilm Synthesis
Inhibition of RelA-Mediated Biofilm SynthesisJohn Cahill
 

What's hot (15)

Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM ModelCrimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
 
Bioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matricesBioinformatica 20-10-2011-t3-scoring matrices
Bioinformatica 20-10-2011-t3-scoring matrices
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...
 
Aligning Subunits of Internally Symmetric Proteins with CE-Symm
Aligning Subunits of Internally Symmetric Proteins with CE-SymmAligning Subunits of Internally Symmetric Proteins with CE-Symm
Aligning Subunits of Internally Symmetric Proteins with CE-Symm
 
Jm200026b
Jm200026bJm200026b
Jm200026b
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-Sequence alig Sequence Alignment Pairwise alignment:-
Sequence alig Sequence Alignment Pairwise alignment:-
 
Criterion based Two Dimensional Protein Folding Using Extended GA
Criterion based Two Dimensional Protein Folding Using Extended GA Criterion based Two Dimensional Protein Folding Using Extended GA
Criterion based Two Dimensional Protein Folding Using Extended GA
 
Protein 3 d structure prediction
Protein 3 d structure predictionProtein 3 d structure prediction
Protein 3 d structure prediction
 
3 D QSAR Approaches and Contour Map Analysis
3 D QSAR Approaches and Contour Map Analysis3 D QSAR Approaches and Contour Map Analysis
3 D QSAR Approaches and Contour Map Analysis
 
Protein computational analysis
Protein computational analysisProtein computational analysis
Protein computational analysis
 
c4ra02698e
c4ra02698ec4ra02698e
c4ra02698e
 
Seminar2
Seminar2Seminar2
Seminar2
 
Inhibition of RelA-Mediated Biofilm Synthesis
Inhibition of RelA-Mediated Biofilm SynthesisInhibition of RelA-Mediated Biofilm Synthesis
Inhibition of RelA-Mediated Biofilm Synthesis
 

Viewers also liked

Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Vijay Hemmadi
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionRoshan Karunarathna
 
Applying Hidden Markov Models to Bioinformatics
Applying Hidden Markov Models to BioinformaticsApplying Hidden Markov Models to Bioinformatics
Applying Hidden Markov Models to Bioinformaticsbutest
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure predictionkaramveer prajapat
 
Endangered species of india
Endangered species of india Endangered species of india
Endangered species of india Vijay Hemmadi
 
Cloud applications - Protein Structure Predication and gene expression data...
Cloud applications - Protein Structure Predication  and  gene expression data...Cloud applications - Protein Structure Predication  and  gene expression data...
Cloud applications - Protein Structure Predication and gene expression data...Pushpendra Singh Dangi
 
Biology protein structure in cloud computing
Biology protein structure in cloud computingBiology protein structure in cloud computing
Biology protein structure in cloud computinggaurav jain
 
Structure prediction of Proteins
Structure prediction of ProteinsStructure prediction of Proteins
Structure prediction of ProteinsgeetikaJethra
 
Natural disasters and its managment
Natural disasters and its managmentNatural disasters and its managment
Natural disasters and its managmentVijay Hemmadi
 
Introduction to probability distributions-Statistics and probability analysis
Introduction to probability distributions-Statistics and probability analysis Introduction to probability distributions-Statistics and probability analysis
Introduction to probability distributions-Statistics and probability analysis Vijay Hemmadi
 
Liposomes-Classification, methods of preparation and application
Liposomes-Classification, methods of preparation and application Liposomes-Classification, methods of preparation and application
Liposomes-Classification, methods of preparation and application Vijay Hemmadi
 
Protein structure determination
Protein structure determinationProtein structure determination
Protein structure determinationVydehi indraneel
 
Positive Attitude & Goal Setting
Positive Attitude & Goal SettingPositive Attitude & Goal Setting
Positive Attitude & Goal SettingVijay Koganti
 
Determination of protein structure by Dr. Anurag Yadav
Determination of protein structure by Dr. Anurag YadavDetermination of protein structure by Dr. Anurag Yadav
Determination of protein structure by Dr. Anurag YadavDr Anurag Yadav
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijayVijay Hemmadi
 
The mechanism of protein folding
The mechanism of protein foldingThe mechanism of protein folding
The mechanism of protein foldingPrasanthperceptron
 

Viewers also liked (19)

Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
Chou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure predictionChou fasman algorithm for protein structure prediction
Chou fasman algorithm for protein structure prediction
 
Applying Hidden Markov Models to Bioinformatics
Applying Hidden Markov Models to BioinformaticsApplying Hidden Markov Models to Bioinformatics
Applying Hidden Markov Models to Bioinformatics
 
methods for protein structure prediction
methods for protein structure predictionmethods for protein structure prediction
methods for protein structure prediction
 
Endangered species of india
Endangered species of india Endangered species of india
Endangered species of india
 
Cloud applications - Protein Structure Predication and gene expression data...
Cloud applications - Protein Structure Predication  and  gene expression data...Cloud applications - Protein Structure Predication  and  gene expression data...
Cloud applications - Protein Structure Predication and gene expression data...
 
Mining
MiningMining
Mining
 
Biology protein structure in cloud computing
Biology protein structure in cloud computingBiology protein structure in cloud computing
Biology protein structure in cloud computing
 
Structure prediction of Proteins
Structure prediction of ProteinsStructure prediction of Proteins
Structure prediction of Proteins
 
Natural disasters and its managment
Natural disasters and its managmentNatural disasters and its managment
Natural disasters and its managment
 
Introduction to probability distributions-Statistics and probability analysis
Introduction to probability distributions-Statistics and probability analysis Introduction to probability distributions-Statistics and probability analysis
Introduction to probability distributions-Statistics and probability analysis
 
Liposomes-Classification, methods of preparation and application
Liposomes-Classification, methods of preparation and application Liposomes-Classification, methods of preparation and application
Liposomes-Classification, methods of preparation and application
 
Protein structure determination
Protein structure determinationProtein structure determination
Protein structure determination
 
Positive Attitude & Goal Setting
Positive Attitude & Goal SettingPositive Attitude & Goal Setting
Positive Attitude & Goal Setting
 
Determination of protein structure by Dr. Anurag Yadav
Determination of protein structure by Dr. Anurag YadavDetermination of protein structure by Dr. Anurag Yadav
Determination of protein structure by Dr. Anurag Yadav
 
Enzyme assays
Enzyme assaysEnzyme assays
Enzyme assays
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
The mechanism of protein folding
The mechanism of protein foldingThe mechanism of protein folding
The mechanism of protein folding
 

Similar to Protein Secondary Structure Prediction using HMM

AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATIONAMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATIONcscpconf
 
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISHMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISijcseit
 
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...sipij
 
Gutell 061.nar.1997.25.01559
Gutell 061.nar.1997.25.01559Gutell 061.nar.1997.25.01559
Gutell 061.nar.1997.25.01559Robin Gutell
 
ENHANCED POPULATION BASED ANT COLONY FOR THE 3D HYDROPHOBIC POLAR PROTEIN STR...
ENHANCED POPULATION BASED ANT COLONY FOR THE 3D HYDROPHOBIC POLAR PROTEIN STR...ENHANCED POPULATION BASED ANT COLONY FOR THE 3D HYDROPHOBIC POLAR PROTEIN STR...
ENHANCED POPULATION BASED ANT COLONY FOR THE 3D HYDROPHOBIC POLAR PROTEIN STR...ijbbjournal
 
Gutell 075.jmb.2001.310.0735
Gutell 075.jmb.2001.310.0735Gutell 075.jmb.2001.310.0735
Gutell 075.jmb.2001.310.0735Robin Gutell
 
Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011Robin Gutell
 
ANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdfANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdfDrGRevathy
 
CVOS2015IIAV4q
CVOS2015IIAV4qCVOS2015IIAV4q
CVOS2015IIAV4qThu Nguyen
 
IJBB-51-3-188-200
IJBB-51-3-188-200IJBB-51-3-188-200
IJBB-51-3-188-200sankar basu
 
Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769Robin Gutell
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Austin Journal of Computational Biology and Bioinformatics
Austin Journal of Computational Biology and BioinformaticsAustin Journal of Computational Biology and Bioinformatics
Austin Journal of Computational Biology and BioinformaticsAustin Publishing Group
 
Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105Robin Gutell
 
Comparative Structural Crystallography and Molecular Interaction Analysis of ...
Comparative Structural Crystallography and Molecular Interaction Analysis of ...Comparative Structural Crystallography and Molecular Interaction Analysis of ...
Comparative Structural Crystallography and Molecular Interaction Analysis of ...iosrjce
 
Bioinformatics2015.pdf
Bioinformatics2015.pdfBioinformatics2015.pdf
Bioinformatics2015.pdfAbdetaImi
 

Similar to Protein Secondary Structure Prediction using HMM (20)

AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATIONAMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATION
 
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISHMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
 
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
 
Gutell 061.nar.1997.25.01559
Gutell 061.nar.1997.25.01559Gutell 061.nar.1997.25.01559
Gutell 061.nar.1997.25.01559
 
ENHANCED POPULATION BASED ANT COLONY FOR THE 3D HYDROPHOBIC POLAR PROTEIN STR...
ENHANCED POPULATION BASED ANT COLONY FOR THE 3D HYDROPHOBIC POLAR PROTEIN STR...ENHANCED POPULATION BASED ANT COLONY FOR THE 3D HYDROPHOBIC POLAR PROTEIN STR...
ENHANCED POPULATION BASED ANT COLONY FOR THE 3D HYDROPHOBIC POLAR PROTEIN STR...
 
Gutell 075.jmb.2001.310.0735
Gutell 075.jmb.2001.310.0735Gutell 075.jmb.2001.310.0735
Gutell 075.jmb.2001.310.0735
 
Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011Gutell 116.rpass.bibm11.pp618-622.2011
Gutell 116.rpass.bibm11.pp618-622.2011
 
Structure alignment methods
Structure alignment methodsStructure alignment methods
Structure alignment methods
 
ANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdfANTIC-2021_paper_95.pdf
ANTIC-2021_paper_95.pdf
 
D1803032632
D1803032632D1803032632
D1803032632
 
CSUPERB2014
CSUPERB2014CSUPERB2014
CSUPERB2014
 
CVOS2015IIAV4q
CVOS2015IIAV4qCVOS2015IIAV4q
CVOS2015IIAV4q
 
IJBB-51-3-188-200
IJBB-51-3-188-200IJBB-51-3-188-200
IJBB-51-3-188-200
 
Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769Gutell 108.jmb.2009.391.769
Gutell 108.jmb.2009.391.769
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
D0261020030
D0261020030D0261020030
D0261020030
 
Austin Journal of Computational Biology and Bioinformatics
Austin Journal of Computational Biology and BioinformaticsAustin Journal of Computational Biology and Bioinformatics
Austin Journal of Computational Biology and Bioinformatics
 
Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105Gutell 090.bmc.bioinformatics.2004.5.105
Gutell 090.bmc.bioinformatics.2004.5.105
 
Comparative Structural Crystallography and Molecular Interaction Analysis of ...
Comparative Structural Crystallography and Molecular Interaction Analysis of ...Comparative Structural Crystallography and Molecular Interaction Analysis of ...
Comparative Structural Crystallography and Molecular Interaction Analysis of ...
 
Bioinformatics2015.pdf
Bioinformatics2015.pdfBioinformatics2015.pdf
Bioinformatics2015.pdf
 

Protein Secondary Structure Prediction using HMM

  • 1. HIDDEN MARKOV MODELS TO PREDICT PROTEIN SECONDARY STRUCTURE Abhishek Dabral gtg204v@mail.gatech.edu MS Bioinformatics, School of Biology Georgia Institute of Technology December, 2004 Abstract Proteins are the building blocks of life. The structure of a protein determines its function. This structural information of proteins is embedded in its amino acid sequence. Protein secondary structure prediction is an essential task in determining the structure and function of the proteins. This study addresses the problem of protein secondary structure prediction by using Hidden Markov Model (HMM).A dependency model is built by considering statistically significant amino acid correlation patterns at segment borders. The problem of low accuracy in beta strand predictionsin most of the present methods is also addressed by considering significant correlations outside the segments. The use of evolutionary data for improving the prediction accuracy is also explored. 1. Introduction Amino acids are the building blocks of proteins. Peptide bonds connect the adjacent amino acids of twenty different types. It isa well known fact that for proteins, structure impliesfunction. The structural information about proteins is embedded in its amino acid sequence. The basic chemical composition "common to all 20 amino acids is shown in Figure 1. The central carbon atom, called C , forms four covalent bonds, one each with NH3 (amino group), COO (carboxyl group), H (hydrogen), and R (side+ ! chain). The first three are common to all amino acids; the side-chain R is a chemical group that differs for each of the 20 amino acids. Inspection of three-dimensional structures of proteins has revealed the presence of repeating elements of regular structure, termed as “secondary structure”. These regular structuresare stabilized by molecular interactionsbetween atomswithin the protein, the most important being the Hydrogen bond, formed between two electronegative atoms that share one H. There is a convention on the nomenclature designating the common patternsof H- bondsthat givesrise to specific structure elements, the Dictionary of Secondary Structures of Proteins(DSSP). DSSP annotationsmark each residue (amino acid) to be belonging to one of the seven types of secondary structure: H( alpha helix), G (3-helix or 310 helix), I (3 helix or ( helix), B(residue in isolated $ bridge), E ($ strands), T ( H bond turns), S (bends), and a use of “_“ where none of the above structures are applicable. Typically, the seven secondary structure typesare reduced into three groups, helix ( includestypes“H”, alpha helix and G, the 310 helix), strand (includes “E”, beta ladder and “B” beta bridge) and coil (all other types). A protein which is color coded based on DSSP annotation is shown in Figure 2.
  • 2. Figure 1:Amino acids and peptide bond formation. Figure 2: A protein that is color coded based on the annotation The basic amino acid structure is shown in the dark green by the DSSP. The protein shows only the main chain with the box. Each amino acid consists of the C alpha carbon following color codes: H: " helix(red), G :310 helix, E: extended atom(yellow) that forms four covalent bonds, one each strand in $ ladder(yellow), B: residue in isolated $ bridge with amino group (blue),carboxyl group (light green), (Orange),T: Hydrogen bond turn (dark blue) and S: bend hydrogen atom, and iv) a side chain R..In the polymerization (Light blue). Residues not conforming to any of the type of amino acids, the carboxyl group of one amino acid are shown green. Protein is a catalytic subunit of cAMP- ( light green) reacts with the amino group of the other dependent kinase. amino acid(blue) under cleavage of water. (PDB ID 1BKX, available at http://www.rcsb.org/pdb). (Image courtesy Ganapathiraju et al, Characterization of Protein Secondary Structure Prediction.) Asan intermediate step towardssolving thegranderproblemof determining three-dimensional protein structures, the prediction of secondary structural elements is more tractable but is in itself not yet a fully solved problem. Protein secondary structure prediction from amino acid sequence dates back to the early 1970s, when Chou and Fasman[2], and others, developed statistical methods to predict secondary structure from primary sequence [2]. These early methods were based on the patterns of occurrence of specific amino acids in the three secondary structure types—helix, strand, and coil. Early attempts to predict secondary structure had focused on the development of mappings from a local window of residues in the sequence to the structural state of the central residue in the window and a large number of methods estimating such mappings had been developed. Earlier approaches scored individual amino acids by frequency of occurrence in each structural state, combining them in wayscorresponding to conditional dependence models[2,5]. Methodsconsidering correlationsamong positions within the window improved the accuracy. Further improvements were demonstrated by the inclusion of evolutionary information via multiple alignments of homologous sequences [7,9] 2. Method In this work, the authors adopted a model based approach, formulating secondary structure prediction as a general Bayesian inference problem. The approach eschewed many problems associated with window based predictions, such as the need for post prediction filtering [4, 7].The work was broadly divided into three stages. First of all, statistical analysis was performed, which explored the most informative correlations for different secondary structures. Then, a semi Markov HMM was chosen, which was similar to the model developed by [10]. Correlations at terminal positions of structural segments and dependencies to forward residues within the segments were specifically considered. Finally, an iterative estimation of the HMM parameters was implemented.
  • 3. The starting point was to choose a representation of sequence/structure relationships in proteins based on secondary structure segments. The model was parameterized by representing the segment position and structural types. Segment location was denoted by the last residue of the segment. Because the segments are required to be contiguous, this parameterization uniquely identified a set of segment locations for a given sequence. 1 2 n i iLet R = (R , R , . . . R ) be a sequence of n amino acid residues, S = { i:Struct( R )… Struct(R +1)} be a sequence of m positions denoting the end of each individual structural segment (so that Sm = n), and 1 2 mT = (T , T , . . . , T ) be the sequence of secondary structural types for each respective segment (See Figure 3).Together m, S and T completely determine a secondary structure assignment for a given amino acid sequence, where m denotes total number of segments, S represents segment end position and T represents the structural state of each segment. In the case of secondary structure prediction, the 1 2 m 1 2 mquantities of interest are thus the values of m, S = (S , S , . . . , S ) and T = (T , T , . . . , T ) 1 2 ncorresponding to the known amino acid sequence R = (R , R , . . . R ) , i.e., the locations and types of the secondary structural segments. The problem is to infer the values of (m, S, T ) given a residue sequence R. A Bayesian approach to the assignment of these parameter valuesistaken, by defining a joint probability distribution P ( m, S, T ) for an amino acid sequence and its secondary structure assignment. The conditional or posterior probability distribution over structural assignments is then calculated, given a new sequence P (m, S, T | R) via Bayesian inference. Prediction then involves finding those secondary structure assignments (m, S, T ) which maximize this posterior distribution. Figure 3: Representation of the secondary structure of a protein sequence in terms of structural segments. The parameters shown represent the segment types T = (L,E,L,E,L,H,L, . . .) and endpoints S = (4,9,11,15,18,25, . . .). The associated structure assignment is LLLLEEEEELLEEEELLLHHHHHHHLLL . . . .(Figure courtesy Schmidler et al[10]). CORRELATION ANALYSIS Correlation analysis begins with a statistical analysis to explore the dependency structure. A P² (chi square), test is used to identify the most informative correlations between amino acid pairs in different types of secondary structure segments and positions. The P² is used to compute the joint distribution of amino acid pair, and compare it with the product of marginal distributions. Logically, P² measures the size of the difference between the pair of observed and expected frequencies of the data. More specifically, the difference between the observed and expected frequency is calculated, that difference is squared and then that result is divided by the expected frequency. The formula for P² can be expressed as: P² = ' (O - E)² E O= the observed frequency E= the expected frequency
  • 4. Squaring the difference ensures a positive number, so that we end up with an absolute value of differences. If we do not work with absolute values, the positive and negative differences across an entire table will always add up to 0. Dividing the squared difference by the expected frequency essentially removes the expected frequency from the equation, so that the remaining measures of observed/expected ,difference are comparable across all data values (cells in a Table 1). Using P² the correlation between amino acid pairs at various separation distances was considered and the positions which were highly correlated were found for the corresponding secondary structure, alpha helix or beta strand. Position specific correlation isthen calculated for terminal positions. This is done in order to find capping regions in alpha helices which typically show hydrogen bonding patterns and side chain interactions which are different from internal positions. The data used is 8100 proteins and their secondary structures collected from the Protein Data Bank (PDB). Table1 below shows the results of the P² test for the three secondary structure types. The correlation isfound by using a function built in MATLAB. It is learnt from the above analysis that in "-helix segments, a residue at position i iscorrelated with residues at position i-2, i-3 and i-4, where i denotes the position of the amino acid within a segment. Similarly a $ strand residue has highest correlations with residues at position i-1, i-2 and a loop residue had its most significant correlations with those at position i-1, i-2 and i-3. Table 1: Correlations of amino acids " helices are characterized by capping boxes where the hydrogen bonding patterns and side chain interactions are different from the internal positions. For this reason, position specific correlations has to be considered. Table 2 givesthe correlation analysisfor the terminal positionsin "- helical segments. The results show that there are statistically significant correlation between residues in terminal positions and the residues that are outside the segment.Another observation is that there exist significant correlations with the forward residues. Also, the degree of correlation for the forward residuesmight be different from those of backward, which indicates an asymmetric dependency behavior for forward and backward residues. Internal positions also show similar correlation pattern.
  • 5. Table 2: Position specific Correlations in Helix Terminal Positions 3.The Model A secondary structure of a protein is defined by a vector given by (m, S, T ), where m denotes total number of segments, S represents segment end position and T represents the structural state of each segment. In a HMM there are a finite number of distinct states. In the model built the hidden states are the structural states {H,E,L}.Each state generates an observation in the form of amino acid segment. Starting from the initial state the transitions occur from one state to another, following a transition probability distribution. Each state generates an amino acid segment according to the observation frequency distribution. The state prediction could be re-stated asa posterior maximization problem. That is, given the observation sequence of amino acids, denoted by R, find the vector (m, S, T ) with maximum posterior probability (m, S, T |R).The posterior probability can be expressed as : P (m, S, T |R ) = P(R) |m, S, T)(m, S, T) P(R) where P(R) |m, S, T) denotes the sequence likelihood and P(m, S, T ) denotes the apriori distribution. The apiori distribution P(m, S, T ) is modeled as: m j j-1 j j-1 jP(m, S, T ) = P(m) J P(T | T )P(S | S , T) j =1 where P(m) is the probability of observing m secondary structure segments, and it is assumed to be j j-1independent from other state variables. P (T | T ) represents the state transition probability (among j j-1 jdifferent secondary structure types) and P(S | S , T) allows to model the length distribution of secondary structure segments with the following assumption: j j-1 j j j-1 jP(S | S , T) = P(S - S | T).
  • 6. The likelihood term P(R) | m, S, T) is modeled as: m j-1 jP(R) | m, S, T) = P(m, S, T ) = J P( R[s + 1:s ]| S, T) j =1 m j-1 j jj-1 j= J P( R[s + 1:s ]| S ,S, T) j =1 p:qIt is important to note here that the segment likelihood terms were assumed to be independent. Also, R j-1 jdenotes the sequence of residues with indexes from p to q. P( R[s + 1:s ]| S, T) represents the probability of j-1 j jj-1 jobserving a particular amino acid segment given all state variables. It is equal to P( R[s + 1:s ]| S ,S, T) because in a HMM the symbol observation probability depends only on its generator state. Although the observation probability of amino acids at different secondary structure states is assumed to be independent, the amino acids within the segments are allowed to depend on neighboring residues. A j-1 jdependency model is created for P( R[s + 1:s ]| S, T) as: j-1 j jj-1 j j-1 jP( R[s + 1:s ]| S, T) = P( R[s + 1:s ]| S ,S, T=H) = x x Here the first product term represents the observation probability of amino acids at the N terminal positions of length l for " helices. The second term represents the observation probability at the internal position and the third product represents the observation probability at the C terminal residues of length l for " helices. As the number of sequences in the PDB is not sufficient to reliably estimate the conditional probabilities, the dependency parameters are reduced by grouping the amino acids into three hydrophobicity classes idenoted by h 0 {hydrophobic, hydrophilic, neutral}. The statistical analysis done using the P² test findsthe dependency patterns as shown in Table 3.
  • 7. Figure 4:A graphical model (Whittaker, 1990) representing the conditional independence structure for the amino acids in an example a-helix segment. Ri are the amino acids of the a-helix and Hi are their associated hydrophobicity classes as assigned by (Klingler and Brutlag, 1994). The model provides for dependence among the hydrophobicity classes at appropriate periodicity allowing the amino acid distributions to be modeled as conditionally independent, thus reducing the dimensionality of the model. Helix: Strand: Loop: i I -1 i+2 i i -1 i-2 i i -1 i-2N1 R | h , h N1 R | h , h N1 R | h , h i i -2 i+1 i i -2 i-3 i i -1 i-2N2 R | h , h C1 R | h , h N2 R | h , h i i -2 i-4 i i -1 i-2, i+1 i+2 i i -1 i-3R | h , h Int Int R | h , h h , h C1 R | h , hC1 i i -2 i-4 i -1 i-3C2 R | h , h i | h , hC2 R i -1 i-2i i -2 i-3, i-4 i+2 i | h , hInt R | h , h h , h Int R Table 3: Dependencies with segments After obtaining a amino acid sequence R, the vector (m, S, T) that maximizes the posterior probability (m, S, T |R) is determined as the predicted secondary structure. A forward backward algorithm generalized for semi-HMM is used to determine the posterior probability. After prediction of secondary structure, proteins which have close secondary structure are used to re-adjust the HMM parameters iteratively. This is done by removing those predicted sequences from the training set which do not have a close secondary structure. The HMM parameters are re-estimated. 4. Results 3 3The accuracy of secondary structure prediction is done by using the Q test, where Q is given as 3Q = Correctly Predicted residues Number of residues The data set used is the latest version of PDB (Protein Data Bank[11]) after filtering out the sequences which have less than 50 or more than 900 residues(as suggested by Schmidler et al). The minimum$ strand length is restricted to 3 and minimum " helix length to 5.[4]. There is a total 1.5% increase in the overall
  • 8. 3 state prediction accuracy as compared to the Bayesian method used by Schmidler [10]. The dependency model used, increases the " helix and $ strand accuracy. 5. Generalization of the model and possible improvements Significant evidence exists that inclusion of multiple sequence alignment information, when available, can improve single sequence prediction methods by as much as 5–7% [3,7,8,9].I used the NCBI (www.ncbi.nlm.nih.gov) resource for obtaining the amino acid sequences and the CLUSTALW tool (www.ebi.ac.uk/clustalw/) to align multiple sequence alignments. The multiple alignment was then used as a test set to train the models. The results are presented in the next section. In both the work considered here [1,10], we have only concerned ourselves with the 3-state problem, where S ={H, E , L } which may induce a model error. The model can be generalized by considering more states a protein can fold into such as coiled coils, hairpins etc [7].However it takes a lot of computational time as the complexity of the dependency model increases with increasing states and the task is far from trivial. This generalization is an area for future work and extension of the model to make it more complete. 6. Extensions and my contribution As discussed in the previous section, I used the multiple alignment of sequences as a test set. The aligned sequences have a high degree of similarity and are homologous (evolutionarily related). Homology modeling is based on the notion that new proteins evolve gradually from existing ones by amino acid substitution, addition, and/or deletion and that the 3D structures and functions are often strongly conserved during this process. Many proteins thus share similar functions and structures and there are usually strong sequence similarities among the structurally similar proteins. Strong sequence similarity often indicates strong structure similarity, although the opposite is not necessarily true. Homology modeling triesto identify structures similar to the target protein through sequence comparison. It is important to explain here how aligned sequences are a good choice for test data, and the reason they can subsequently improve the prediction accuracy. It is a very popular saying by Theodosius Dobzhansky (1900-1975), “Nothing in Biology makes sense except in the light of evolution”. Nature has this tremendous power of propagating features which are beneficial for the species to develop and survive, and deprecating those which do not. Thus, the proteins which are of vital importance for the functioning of species are propagated without any mutations from generation to generation and any deleterious mutations are not propagated as it dies off. To conserve the function of the species, the structure has to be conserved as we know that the two are closely related. Alignment of multiple sequences show that the residues which play a critical role in the function of proteinsare conserved, and these residueshave similar secondary structure. So, using the multiple alignment as a test set. I used a multiple aligned sequence 1fxia.msf available at (www.sanger.ac.uk) and used the CLUSTALX program for adjusting the alignment (Figure 5) . This sequence is then used as an input set for trainingthe HMM. The training isdone usingtheBaum-Welchalgorithmortheforward-backward algorithm (See appendix). The algorithm used,.prints the log likelihood at each iteration along with transition matrix and initial probabilities. The bioinformatics toolbox in MATLAB is used for building profile, showing log- odds score, Symbol emission for the matchand insert states. Some of the functions used were hmmprofstruct,
  • 9. hmmprofmerge, hmmprofestimate, hmmprofgenerate, hmmprofalign. The log odds best path is shown in Figure 6. Figure 5: CLUSTALX multiple sequence alignment for 1fxia.msf Figure 6:Log odds best path
  • 10. Results using the alignment: Predicted secondary structure composition for the protein came as: sec str type H E L % in protein 25.71 25.71 48.57 Residue composition for the protein is as follows: %A: 2.9 %C: 0.0 %D: 17.1 %E: 11.4 %F: 0.0 %G: 5.7 %H: 0.0 %I: 5.7 %K: 5.7 %L: 11.4 %M: 0.0 %N: 2.9 %P: 8.6 %Q: 0.0 %R: 0.0 %S: 0.0 %T: 8.6 %V: 11.4 %W: 0.0 %Y: 8.6 3To determine the accuracy of the prediction, I used the Q test as done in Section 4 above. I used PDB (Protein Data Bank) to look up the known structures of the protein, determined empirically. I then compared the predicted secondary structure, with the known structures to get the number of correctly predicted residues. The ratio of correctly predicted residue to the total number of 3 3residues (Q ) gave the prediction accuracy. Q came out to be a value near 0.83 which implies a prediction accuracy of 83%, a significant improvement over the previous methods. 7. Conclusion: This paper discusses an approach to the prediction of protein secondary structure from sequence using probabilistic models for protein structural segments and an algorithm for prediction based on Hidden Markov Models(HMM). Extension of this approach to use of multiple aligned sequences hasshown that accuracies improve when the evolutionary information is taken into account. Extensions to the model using more secondary structure elements is also discussed. 8. Acknowledgments: I would like to thank Zafer Aydin and Dr Mark Borodovsky, for their cooperation and assistance in helping me understand the concepts of their paper and their importance. I would also like acknowledge Dr Mason Porter for his guidance throughout the project.
  • 11. Appendix: 1. Baum-Welch Algorithm Also called the Forward-Backward algorithm, can be derived using simple ``occurrence counting'' arguments or using calculus to maximize the auxiliary quantity.A special feature of the algorithm is the guaranteed convergence. For more discussion on Baum Welch see: (http://jedlik.phy.bme.hu/~gerjanos/HMM/node11.html). References: [1] Aydin Z., Altunbasak Y & M. Borodovsky, (2004) Protein secondary structure prediction with semi- Markov HMM ("IEEE Int. Conf. on Acoustics Speech and Signal Processing, Montreal, CA, May 2004, in press). [2] Chou, P. Y. and Fasman, G. D. (1974) Prediction of protein conformation. Biochemistry 13:222-245. [3] Di Francesco, J. Garnier, and P.J. Munson, Improving protein secondary structure prediction with aligned homologous sequences [4] Frishman D, Argos P (1996): "Incorporation of non-local interactions in protein secondary structure prediction from amino acid sequence", Protein Engineering, 9(2), 133-142 [5] Garnier, J., Osguthorpe, D. J. & Robson, B. (1978),Analysis and implications of simple methods for predicting the secondary structure of globular proteins.J. Mol. Biol. 120, 97-120. [6] Rabiner L.R. “A Tutorial on hidden markov models and selected applications in speech recognition. [7] Rost, B. & Sander, C. (1993),Prediction of protein secondary structure at better than 70 percent accuracy. J. Mol. Biol. 232, 584-599. [8] Rost, B., Sander, C. & Schneider, R. (1994),Redefining the goals of protein secondary structure prediction. J. Mol. Biol. 235, 13-26. [9] Salamov, A. A. & Solovyev, V. V. (1995).,Prediction of protein secondary structure by combining nearest- neighbor algorithms and multiple sequence alignments. J. Mol. Biol. 247, 11-15. [10] Scott C. Schmidler, Jun S. Liu and Douglas L. Brutlag, Bayesian Segmentation of Protein Secondary Structure [11] EVA, “List of Sequence-unique pdb files”.