Protein Secondary Structure Prediction using HMM

HIDDEN MARKOV MODELS TO PREDICT PROTEIN SECONDARY STRUCTURE
Abhishek Dabral
gtg204v@mail.gatech.edu
MS Bioinformatics, School of Biology
Georgia Institute of Technology
December, 2004
Abstract
Proteins are the building blocks of life. The structure of a protein determines
its function. This structural information of proteins is embedded in its amino
acid sequence. Protein secondary structure prediction is an essential task in
determining the structure and function of the proteins. This study addresses
the problem of protein secondary structure prediction by using Hidden
Markov Model (HMM).A dependency model is built by considering
statistically significant amino acid correlation patterns at segment borders.
The problem of low accuracy in beta strand predictionsin most of the present
methods is also addressed by considering significant correlations outside the
segments. The use of evolutionary data for improving the prediction
accuracy is also explored.
1. Introduction
Amino acids are the building blocks of proteins. Peptide bonds connect the adjacent amino acids of
twenty different types. It isa well known fact that for proteins, structure impliesfunction. The structural
information about proteins is embedded in its amino acid sequence. The basic chemical composition
"common to all 20 amino acids is shown in Figure 1. The central carbon atom, called C , forms four
covalent bonds, one each with NH3 (amino group), COO (carboxyl group), H (hydrogen), and R (side+ !
chain). The first three are common to all amino acids; the side-chain R is a chemical group that differs
for each of the 20 amino acids. Inspection of three-dimensional structures of proteins has revealed the
presence of repeating elements of regular structure, termed as “secondary structure”. These regular
structuresare stabilized by molecular interactionsbetween atomswithin the protein, the most important
being the Hydrogen bond, formed between two electronegative atoms that share one H. There is a
convention on the nomenclature designating the common patternsof H- bondsthat givesrise to specific
structure elements, the Dictionary of Secondary Structures of Proteins(DSSP). DSSP annotationsmark
each residue (amino acid) to be belonging to one of the seven types of secondary structure: H( alpha
helix), G (3-helix or 310 helix), I (3 helix or ( helix), B(residue in isolated $ bridge), E ($ strands), T
( H bond turns), S (bends), and a use of “_“ where none of the above structures are applicable.
Typically, the seven secondary structure typesare reduced into three groups, helix ( includestypes“H”,
alpha helix and G, the 310 helix), strand (includes “E”, beta ladder and “B” beta bridge) and coil (all
other types). A protein which is color coded based on DSSP annotation is shown in Figure 2.

Figure 1:Amino acids and peptide bond formation. Figure 2: A protein that is color coded based on the annotation
The basic amino acid structure is shown in the dark green by the DSSP. The protein shows only the main chain with the
box. Each amino acid consists of the C alpha carbon following color codes: H: " helix(red), G :310 helix, E: extended
atom(yellow) that forms four covalent bonds, one each strand in $ ladder(yellow), B: residue in isolated $ bridge
with amino group (blue),carboxyl group (light green), (Orange),T: Hydrogen bond turn (dark blue) and S: bend hydrogen
atom, and iv) a side chain R..In the polymerization (Light blue). Residues not conforming to any of the type
of amino acids, the carboxyl group of one amino acid are shown green. Protein is a catalytic subunit of cAMP-
( light green) reacts with the amino group of the other dependent kinase.
amino acid(blue) under cleavage of water. (PDB ID 1BKX, available at http://www.rcsb.org/pdb).
(Image courtesy Ganapathiraju et al, Characterization of Protein Secondary Structure Prediction.)
Asan intermediate step towardssolving thegranderproblemof determining three-dimensional protein
structures, the prediction of secondary structural elements is more tractable but is in itself not yet a
fully solved problem. Protein secondary structure prediction from amino acid sequence dates back to
the early 1970s, when Chou and Fasman[2], and others, developed statistical methods to predict
secondary structure from primary sequence [2]. These early methods were based on the patterns of
occurrence of specific amino acids in the three secondary structure types—helix, strand, and coil.
Early attempts to predict secondary structure had focused on the development of mappings from a
local window of residues in the sequence to the structural state of the central residue in the window
and a large number of methods estimating such mappings had been developed. Earlier approaches
scored individual amino acids by frequency of occurrence in each structural state, combining them in
wayscorresponding to conditional dependence models[2,5]. Methodsconsidering correlationsamong
positions within the window improved the accuracy. Further improvements were demonstrated by the
inclusion of evolutionary information via multiple alignments of homologous sequences [7,9]
2. Method
In this work, the authors adopted a model based approach, formulating secondary structure prediction
as a general Bayesian inference problem. The approach eschewed many problems associated with
window based predictions, such as the need for post prediction filtering [4, 7].The work was broadly
divided into three stages. First of all, statistical analysis was performed, which explored the most
informative correlations for different secondary structures. Then, a semi Markov HMM was chosen,
which was similar to the model developed by [10]. Correlations at terminal positions of structural
segments and dependencies to forward residues within the segments were specifically considered.
Finally, an iterative estimation of the HMM parameters was implemented.

The starting point was to choose a representation of sequence/structure relationships in proteins based
on secondary structure segments. The model was parameterized by representing the segment position
and structural types. Segment location was denoted by the last residue of the segment. Because the
segments are required to be contiguous, this parameterization uniquely identified a set of segment
locations for a given sequence.
1 2 n i iLet R = (R , R , . . . R ) be a sequence of n amino acid residues, S = { i:Struct( R )… Struct(R +1)} be
a sequence of m positions denoting the end of each individual structural segment (so that Sm = n), and
1 2 mT = (T , T , . . . , T ) be the sequence of secondary structural types for each respective segment (See
Figure 3).Together m, S and T completely determine a secondary structure assignment for a given amino
acid sequence, where m denotes total number of segments, S represents segment end position and T
represents the structural state of each segment. In the case of secondary structure prediction, the
1 2 m 1 2 mquantities of interest are thus the values of m, S = (S , S , . . . , S ) and T = (T , T , . . . , T )
1 2 ncorresponding to the known amino acid sequence R = (R , R , . . . R ) , i.e., the locations and types of
the secondary structural segments. The problem is to infer the values of (m, S, T ) given a residue
sequence R. A Bayesian approach to the assignment of these parameter valuesistaken, by defining a joint
probability distribution P ( m, S, T ) for an amino acid sequence and its secondary structure assignment.
The conditional or posterior probability distribution over structural assignments is then calculated, given
a new sequence P (m, S, T | R) via Bayesian inference. Prediction then involves finding those secondary
structure assignments (m, S, T ) which maximize this posterior distribution.
Figure 3: Representation of the secondary structure of a protein sequence in terms of structural segments. The parameters
shown represent the segment types T = (L,E,L,E,L,H,L, . . .) and endpoints S = (4,9,11,15,18,25, . . .). The associated structure
assignment is LLLLEEEEELLEEEELLLHHHHHHHLLL . . . .(Figure courtesy Schmidler et al[10]).
CORRELATION ANALYSIS
Correlation analysis begins with a statistical analysis to explore the dependency structure. A P² (chi
square), test is used to identify the most informative correlations between amino acid pairs in different
types of secondary structure segments and positions. The P² is used to compute the joint distribution of
amino acid pair, and compare it with the product of marginal distributions. Logically, P² measures the
size of the difference between the pair of observed and expected frequencies of the data. More
specifically, the difference between the observed and expected frequency is calculated, that difference
is squared and then that result is divided by the expected frequency.
The formula for P² can be expressed as:
P² = ' (O - E)²
E
O= the observed frequency
E= the expected frequency

Squaring the difference ensures a positive number, so that we end up with an absolute value of
differences. If we do not work with absolute values, the positive and negative differences across an entire
table will always add up to 0. Dividing the squared difference by the expected frequency essentially
removes the expected frequency from the equation, so that the remaining measures of observed/expected
,difference are comparable across all data values (cells in a Table 1). Using P² the correlation between
amino acid pairs at various separation distances was considered and the positions which were highly
correlated were found for the corresponding secondary structure, alpha helix or beta strand. Position
specific correlation isthen calculated for terminal positions. This is done in order to find capping regions
in alpha helices which typically show hydrogen bonding patterns and side chain interactions which are
different from internal positions. The data used is 8100 proteins and their secondary structures collected
from the Protein Data Bank (PDB). Table1 below shows the results of the P² test for the three secondary
structure types. The correlation isfound by using a function built in MATLAB. It is learnt from the above
analysis that in "-helix segments, a residue at position i iscorrelated with residues at position i-2, i-3 and
i-4, where i denotes the position of the amino acid within a segment. Similarly a $ strand residue has
highest correlations with residues at position i-1, i-2 and a loop residue had its most significant
correlations with those at position i-1, i-2 and i-3.
Table 1: Correlations of amino acids
" helices are characterized by capping boxes where the hydrogen bonding patterns and side chain
interactions are different from the internal positions. For this reason, position specific correlations has to
be considered. Table 2 givesthe correlation analysisfor the terminal positionsin "- helical segments. The
results show that there are statistically significant correlation between residues in terminal positions and
the residues that are outside the segment.Another observation is that there exist significant correlations
with the forward residues. Also, the degree of correlation for the forward residuesmight be different from
those of backward, which indicates an asymmetric dependency behavior for forward and backward
residues. Internal positions also show similar correlation pattern.

Table 2: Position specific Correlations in Helix Terminal Positions
3.The Model
A secondary structure of a protein is defined by a vector given by (m, S, T ), where m denotes total
number of segments, S represents segment end position and T represents the structural state of each
segment. In a HMM there are a finite number of distinct states. In the model built the hidden states are
the structural states {H,E,L}.Each state generates an observation in the form of amino acid segment.
Starting from the initial state the transitions occur from one state to another, following a transition
probability distribution. Each state generates an amino acid segment according to the observation
frequency distribution. The state prediction could be re-stated asa posterior maximization problem. That
is, given the observation sequence of amino acids, denoted by R, find the vector (m, S, T ) with maximum
posterior probability (m, S, T |R).The posterior probability can be expressed as :
P (m, S, T |R ) = P(R) |m, S, T)(m, S, T)
P(R)
where P(R) |m, S, T) denotes the sequence likelihood and P(m, S, T ) denotes the apriori distribution.
The apiori distribution P(m, S, T ) is modeled as:
m
j j-1 j j-1 jP(m, S, T ) = P(m) J P(T | T )P(S | S , T)
j =1
where P(m) is the probability of observing m secondary structure segments, and it is assumed to be
j j-1independent from other state variables. P (T | T ) represents the state transition probability (among
j j-1 jdifferent secondary structure types) and P(S | S , T) allows to model the length distribution of secondary
structure segments with the following assumption:
j j-1 j j j-1 jP(S | S , T) = P(S - S | T).

The likelihood term P(R) | m, S, T) is modeled as:
m
j-1 jP(R) | m, S, T) = P(m, S, T ) = J P( R[s + 1:s ]| S, T)
j =1
m
j-1 j jj-1 j= J P( R[s + 1:s ]| S ,S, T)
j =1
p:qIt is important to note here that the segment likelihood terms were assumed to be independent. Also, R
j-1 jdenotes the sequence of residues with indexes from p to q. P( R[s + 1:s ]| S, T) represents the probability of
j-1 j jj-1 jobserving a particular amino acid segment given all state variables. It is equal to P( R[s + 1:s ]| S ,S, T)
because in a HMM the symbol observation probability depends only on its generator state.
Although the observation probability of amino acids at different secondary structure states is assumed to
be independent, the amino acids within the segments are allowed to depend on neighboring residues. A
j-1 jdependency model is created for P( R[s + 1:s ]| S, T) as:
j-1 j jj-1 j j-1 jP( R[s + 1:s ]| S, T) = P( R[s + 1:s ]| S ,S, T=H)
=
x
x
Here the first product term represents the observation probability of amino acids at the N terminal positions
of length l for " helices. The second term represents the observation probability at the internal position and
the third product represents the observation probability at the C terminal residues of length l for " helices.
As the number of sequences in the PDB is not sufficient to reliably estimate the conditional probabilities,
the dependency parameters are reduced by grouping the amino acids into three hydrophobicity classes
idenoted by h 0 {hydrophobic, hydrophilic, neutral}. The statistical analysis done using the P² test findsthe
dependency patterns as shown in Table 3.

Figure 4:A graphical model (Whittaker, 1990) representing the conditional independence structure for the amino
acids in an example a-helix segment. Ri are the amino acids of the a-helix and Hi are their associated hydrophobicity
classes as assigned by (Klingler and Brutlag, 1994). The model provides for dependence among the hydrophobicity
classes at appropriate periodicity allowing the amino acid distributions to be modeled as conditionally independent,
thus reducing the dimensionality of the model.
Helix: Strand: Loop:
i I -1 i+2 i i -1 i-2 i i -1 i-2N1 R | h , h N1 R | h , h N1 R | h , h
i i -2 i+1 i i -2 i-3 i i -1 i-2N2 R | h , h C1 R | h , h N2 R | h , h
i i -2 i-4 i i -1 i-2, i+1 i+2 i i -1 i-3R | h , h Int Int R | h , h h , h C1 R | h , hC1
i i -2 i-4 i -1 i-3C2 R | h , h i | h , hC2 R
i -1 i-2i i -2 i-3, i-4 i+2 i | h , hInt R | h , h h , h Int R
Table 3: Dependencies with segments
After obtaining a amino acid sequence R, the vector (m, S, T) that maximizes the posterior probability
(m, S, T |R) is determined as the predicted secondary structure. A forward backward algorithm
generalized for semi-HMM is used to determine the posterior probability. After prediction of secondary
structure, proteins which have close secondary structure are used to re-adjust the HMM parameters
iteratively. This is done by removing those predicted sequences from the training set which do not have a
close secondary structure. The HMM parameters are re-estimated.
4. Results
3 3The accuracy of secondary structure prediction is done by using the Q test, where Q is given as
3Q = Correctly Predicted residues
Number of residues
The data set used is the latest version of PDB (Protein Data Bank[11]) after filtering out the sequences
which have less than 50 or more than 900 residues(as suggested by Schmidler et al). The minimum$ strand
length is restricted to 3 and minimum " helix length to 5.[4]. There is a total 1.5% increase in the overall

3 state prediction accuracy as compared to the Bayesian method used by Schmidler [10]. The dependency
model used, increases the " helix and $ strand accuracy.
5. Generalization of the model and possible improvements
Significant evidence exists that inclusion of multiple sequence alignment information, when available, can
improve single sequence prediction methods by as much as 5–7% [3,7,8,9].I used the NCBI
(www.ncbi.nlm.nih.gov) resource for obtaining the amino acid sequences and the CLUSTALW tool
(www.ebi.ac.uk/clustalw/) to align multiple sequence alignments. The multiple alignment was then used as
a test set to train the models. The results are presented in the next section.
In both the work considered here [1,10], we have only concerned ourselves with the 3-state problem, where
S ={H, E , L } which may induce a model error. The model can be generalized by considering more states
a protein can fold into such as coiled coils, hairpins etc [7].However it takes a lot of computational time as
the complexity of the dependency model increases with increasing states and the task is far from trivial.
This generalization is an area for future work and extension of the model to make it more complete.
6. Extensions and my contribution
As discussed in the previous section, I used the multiple alignment of sequences as a test set. The aligned
sequences have a high degree of similarity and are homologous (evolutionarily related). Homology
modeling is based on the notion that new proteins evolve gradually from existing ones by amino acid
substitution, addition, and/or deletion and that the 3D structures and functions are often strongly conserved
during this process. Many proteins thus share similar functions and structures and there are usually strong
sequence similarities among the structurally similar proteins. Strong sequence similarity often indicates
strong structure similarity, although the opposite is not necessarily true. Homology modeling triesto identify
structures similar to the target protein through sequence comparison. It is important to explain here how
aligned sequences are a good choice for test data, and the reason they can subsequently improve the
prediction accuracy. It is a very popular saying by Theodosius Dobzhansky (1900-1975), “Nothing in
Biology makes sense except in the light of evolution”. Nature has this tremendous power of propagating
features which are beneficial for the species to develop and survive, and deprecating those which do not.
Thus, the proteins which are of vital importance for the functioning of species are propagated without any
mutations from generation to generation and any deleterious mutations are not propagated as it dies off. To
conserve the function of the species, the structure has to be conserved as we know that the two are closely
related. Alignment of multiple sequences show that the residues which play a critical role in the function of
proteinsare conserved, and these residueshave similar secondary structure. So, using the multiple alignment
as a test set. I used a multiple aligned sequence 1fxia.msf available at (www.sanger.ac.uk) and used the
CLUSTALX program for adjusting the alignment (Figure 5) . This sequence is then used as an input set for
trainingthe HMM. The training isdone usingtheBaum-Welchalgorithmortheforward-backward algorithm
(See appendix). The algorithm used,.prints the log likelihood at each iteration along with transition matrix
and initial probabilities. The bioinformatics toolbox in MATLAB is used for building profile, showing log-
odds score, Symbol emission for the matchand insert states. Some of the functions used were hmmprofstruct,

hmmprofmerge, hmmprofestimate, hmmprofgenerate, hmmprofalign. The log odds best path is shown in
Figure 6.
Figure 5: CLUSTALX multiple sequence alignment for 1fxia.msf
Figure 6:Log odds best path

Results using the alignment:
Predicted secondary structure composition for the protein came as:
sec str type H E L
% in protein 25.71 25.71 48.57
Residue composition for the protein is as follows:
%A: 2.9 %C: 0.0 %D: 17.1 %E: 11.4 %F: 0.0
%G: 5.7 %H: 0.0 %I: 5.7 %K: 5.7 %L: 11.4
%M: 0.0 %N: 2.9 %P: 8.6 %Q: 0.0 %R: 0.0
%S: 0.0 %T: 8.6 %V: 11.4 %W: 0.0 %Y: 8.6
3To determine the accuracy of the prediction, I used the Q test as done in Section 4 above.
I used PDB (Protein Data Bank) to look up the known structures of the protein, determined
empirically. I then compared the predicted secondary structure, with the known structures to get the
number of correctly predicted residues. The ratio of correctly predicted residue to the total number of
3 3residues (Q ) gave the prediction accuracy. Q came out to be a value near 0.83 which implies a
prediction accuracy of 83%, a significant improvement over the previous methods.
7. Conclusion:
This paper discusses an approach to the prediction of protein secondary structure from sequence using
probabilistic models for protein structural segments and an algorithm for prediction based on Hidden
Markov Models(HMM). Extension of this approach to use of multiple aligned sequences hasshown that
accuracies improve when the evolutionary information is taken into account. Extensions to the model
using more secondary structure elements is also discussed.
8. Acknowledgments:
I would like to thank Zafer Aydin and Dr Mark Borodovsky, for their cooperation and assistance in
helping me understand the concepts of their paper and their importance. I would also like
acknowledge Dr Mason Porter for his guidance throughout the project.

Appendix:
1. Baum-Welch Algorithm
Also called the Forward-Backward algorithm, can be derived using simple ``occurrence counting''
arguments or using calculus to maximize the auxiliary quantity.A special feature of the algorithm is
the guaranteed convergence. For more discussion on Baum Welch see:
(http://jedlik.phy.bme.hu/~gerjanos/HMM/node11.html).
References:
[1] Aydin Z., Altunbasak Y & M. Borodovsky, (2004) Protein secondary structure prediction with semi-
Markov HMM ("IEEE Int. Conf. on Acoustics Speech and Signal Processing, Montreal, CA, May 2004, in
press).
[2] Chou, P. Y. and Fasman, G. D. (1974) Prediction of protein conformation. Biochemistry 13:222-245.
[3] Di Francesco, J. Garnier, and P.J. Munson, Improving protein secondary structure prediction with aligned
homologous sequences
[4] Frishman D, Argos P (1996): "Incorporation of non-local interactions in protein secondary structure
prediction from amino acid sequence", Protein Engineering, 9(2), 133-142
[5] Garnier, J., Osguthorpe, D. J. & Robson, B. (1978),Analysis and implications of simple methods for
predicting the secondary structure of globular proteins.J. Mol. Biol. 120, 97-120.
[6] Rabiner L.R. “A Tutorial on hidden markov models and selected applications in speech recognition.
[7] Rost, B. & Sander, C. (1993),Prediction of protein secondary structure at better than 70 percent accuracy. J.
Mol. Biol. 232, 584-599.
[8] Rost, B., Sander, C. & Schneider, R. (1994),Redefining the goals of protein secondary structure prediction.
J. Mol. Biol. 235, 13-26.
[9] Salamov, A. A. & Solovyev, V. V. (1995).,Prediction of protein secondary structure by combining nearest-
neighbor algorithms and multiple sequence alignments. J. Mol. Biol. 247, 11-15.
[10] Scott C. Schmidler, Jun S. Liu and Douglas L. Brutlag, Bayesian Segmentation of Protein Secondary
Structure
[11] EVA, “List of Sequence-unique pdb files”.

Protein Secondary Structure Prediction using HMM

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (19)

Similar to Protein Secondary Structure Prediction using HMM

Similar to Protein Secondary Structure Prediction using HMM (20)

Protein Secondary Structure Prediction using HMM