Signals of Evolution: Conservation, Specificity Determining Positions and Coevolution                                     ...
Upcoming SlideShare
Loading in …5

Signals of Evolution: Conservation, Specificity Determining Positions and Coevolution


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Signals of Evolution: Conservation, Specificity Determining Positions and Coevolution

  1. 1. Signals of Evolution: Conservation, Specificity Determining Positions and Coevolution Elin Teppa1, Diego Zea 2, Morten Nielsen 1 3 and Cristina Marino Buslje 1 1 Structural Bioinformatics Unit, Leloir Institute Foundation 2 Structural Bioinformatics Group, National University of Quilmes 3 Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of DenmarkINTRODUCTION RESULTS AND DISCUSSIONProtein sequences evolve under several constraints and each constraint leads to a We calculate the Spearman rank correlation between methods to find out if theyspecific pattern of conservation and variation in protein sequences. In this study we capture different pieces of information (Fig. 1). The analysis shows as expected, afocused on the analysis of three major evolutionary signals: conservation, strong correlation between ET rv and conservation. This is because the firstspecificity determining positions and coevolution between residues. These signals includes conservation information in it score. Surprisingly, the correlation betweenare the results of different evolutionary mechanisms and have been used by ET iv, SDPfox and XDET is less than expected. For the first two methods, this candifferent bioinformatics methods to predict functionally important sites. be understood by the strong dependence of the results of the sequence clustering method, which are phylogenetically and functionally based for ET iv and SDPfoxFully conserved position in a Multiple Sequence Alignment (MSA) are interpreted respectively. On the other hand the prediction of SDPs by XDET is based on theas important residues for the structure and function of the protein. At the beginning, comparison of the mutational behavior of a position respect to the family mutationalthe computational methods used this information to predict functional important trend. Such approach may detect important positions (as they have the samesites including catalytic residues. Nowadays, more factors are taken into account to behavior than the evolution of the whole family) but this is not enough evidence toimprove the performance of prediction methods. assign their biological importance to the determination of the specificity of the enzyme.Other positions show a more subtle pattern of conservation, they are conservedwithin a group of sequences (sub-family) but may change in another group. Such We also analyzed to which extent are overlap the best scores predicted residuespositions are responsible for protein specificity i.e. ligand binding, protein-protein by the different methods. We take into account the best N scores for each method,interaction, etc. (named: Specificity-Determining Positions –SDPs- ). The were N is equal to 10% of the total length of the sequence. We illustrated in Figureclassification of proteins into groups can be defined according to different criteria 2, the average of the overlapping residues for the 434 families. Except for ET rvi.e. identity, phylogenetically, functional similarity, among others. SDPs are with conservation and ET iv, the others methods differ in which residues are mostsuggested to be located in the proximity of the catalytic residues in order to carry important for the family.out their role of defining the substrate specificity.Coevolution between residues is another signal that can be extracted from MSAs.Coevolution is the result of compensatory mutations, namely they are thoseresidues that have undergone concerted changes to overcome a commonselection pressure. Owing to the limitations on the amino acid diversity in theproximity of an active site, the catalytic residues carry a particular signature definedby a close proximity network of residues with high mutual information.In summary, in this study we consider different methods that attempt to captureinformation from three different evolutionary signals. They have in common theprediction of functionally important sites and are capable of detecting the catalyticresidues or to point the residues nearby the catalytic residues.Disentangling the function of different positions in an alignment will allows us tocreate methods that take profit from different information contained in analignment. That could be use for the deeper study of any proteins. Besides it wouldhelp to do better and accurate annotations of proteins with unknown function. Figure 1 : Heat map of the Spearman Figure 2 : Average percentage residues predicted in rank correlation between methods common between methods considering the top 10% ranked positions.MATERIALS AND METHODS As an example we illustrate in Fig.3 the highest scores of the Phosphofructokinase 1 family mapped in the 3D structure of the reference protein.The dataset was constructed based on the catalytic site atlas (CSA) database [1]and Pfam database [2]. A total of 434 proteins families which in turn have 1212cayalytic residues have been studied. Figure 3 : Mapping of the predictedFor a given family one reference pdb entry was selected and the MSAs were functionally important sites using six different prediction scores. Plotted is theprepared removing redundant sequences at the level of 62% identity and trimmimg cartoon representation of the PDB:deletions and insertions across the whole alignment so as to preserve the 1PFK.The top 10% prediction scores arecontinuity of the reference sequence. In addition, all positions with >50% gaps, as represented in green. The catalytic residues are show in red sticks, and thewell as sequences covering <50% of the reference sequence length were SDPs known experimentally are show inremoved. blue sticks.Conservation: It was used the Kullback-Leibler conservation score. Predictive Method performanceMutual Information: Mutual Information was calculated as describe in [3]. MI pC 0.83899gives a value for each pair of residues in a MSA. We calculated a cumulative pMI 0.80342Mutual Information score (cMI) for each residue as the sum of MI values abovecertain threshold of every amino acid pair where the particular residue appears. pET rv 0.86774 pET iv 0.63360Evolutionary Tracing: The ET method identified invariant specific residues by pSDPfox 0.63602partitioning the phylogenetic tree into subgroups of similar sequences [4]. ET ivscore represents conservation within groups in a qualitative way and predicts Table 1 : Predictive perfomance forSDPs; whereas ET rv score incorporate entropy as a quantitative measure of detecting catalytic residues in terms toconservation giving a rank of positions by their relative importance. AUC value on the 434 Pfam entries.SDPfox: This method predicts SDPs in a phylogeny-independent manner. At first it We demonstrate that the methods capture different information and identify withperforms an identification of specificity groups through assign each protein to a the highest scores different residues positions. An exception is ET rv scores thatgroup by iterations till convergence. This classification allows the prediction of shows a strong correlation with conservation.SDPs that end up separated on a phylogenetic tree [5]. pET rv, pCons and pMI scores have shown a good performance to detect catalyticXDET: This method implements the mutational behaviour algorithm based on the residues. However, only pMI could be combined with other scores to improve thecomparison of the mutational behaviour of a position with the mutational behaviour prediction of catalytic residues, because this has a low correlation with otherof the whole alignment. The principle is that positions showing a family dependent measures.conservation pattern would have a similar mutational behaviour as the whole family[6]. A weakness of the SDPs prediction methods is that some conserved positions could mask SDPs positions which would be detected if more sequences becomeProximity scores for each method was calculated as the sum of the scores of available for the family.residues within a distance ≤ 6Ǻ in the 3D structure to the given amino acid.The predictive performance in detecting catalytic residues using the proximity There is a lack of publicly available SDP database, which hinders the direct testingscores was evaluated in terms of the area under the ROC curve per family. of methods for their prediction.REFERENCES The SDP prediction methods even with different approaches, share the use of conserved amino acids as indicators of likely functional significance. In this context1 Porter, C.T et al, The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structuraldata. Nucl. Acids Res., 2004. 32(suppl_1): p. D129-133. the co-evolution is less representative of the global evolution of a whole family or2 Finn, R.D., et al., The Pfam protein families database. Nucl. Acids Res., 2008. 36(suppl_1): p. D281-288 subfamily, thus providing information of specific events that required a common3 Buslje, C.M., et al., Correction for phylogeny, small number of observations and data redundancy improves theidentification of coevolving amino acid pairs using mutual information. Bioinformatics, 2009. 25(9): p. 1125-1131. adaptation of two or more residues and can be detected even in phylogenetically4 Lichtarge, O., et al., A family of Evolution-Entropy Hybrid Methods for ranking protein residudes by importance.J.Mol.Biol, 2004. 336: p. 1265-82. divergent family.5 Kalinina O.V. et al., An automated stochastic approach to the identification of the protein specificity determinants andfunctional subfamilies. AMB, 2010: p.5-29.6 Del Sol A. et al., Automatic Methods for Predicting Functionally Important Residues. J.Mol.Biol, 2003.326(4):1289-1302