Gualberto Asencio Cortés
Supervisor: Jesús S. Aguilar Ruíz
Bioinformatics Group
School of Engineering
Pablo de Olavide Uni...
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a ...
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a ...
Proteins and amino acids
Motivation Our proposal Recent results Conclusions and future work
Protein Distance Map Predictio...
Protein structure representations
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio ...
Protein Structure Prediction (PSP)
Protein
structures
Training
New sequence
Protein Distance Map Prediction based on a Nea...
Why PSP is important?
• Knowing protein functions
• Drug design for diseases such as cancer
orAlzheimer
• Protein docking ...
Why another contact/distance map
predictor?
• Currently a hot topic in bioinformatics journals
• Current results are up to...
Why distance maps?
• Why a threshold?Why 8 angstroms?
Protein Distance Map Prediction based on a Nearest Neighbors Approac...
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a ...
PDMpred: Prediction process
Training set
of protein
structures
Training data Test data
Distance
maps
All training
fragment...
PDMpred: Prediction process
Training set
of protein
structures
Training data Test data
Distance
maps
All training
fragment...
PDMpred: Prediction process
Training set
of protein
structures
Test set of
protein
sequences
Training data Test data
Test
...
PDMpred: Prediction process
Training set
of protein
structures
Test set of
protein
sequences
Training data Test data
Test
...
PDMpred: Evaluation measures
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Corté...
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a ...
Recent results
① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according
to physicochemical properti...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
① G Asencio, J S Aguilar-Ru...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
① G Asencio, J S Aguilar-Ru...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
① G Asencio, J S Aguilar-Ru...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
② G Asencio, J S Aguilar-Ru...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Datasets and
configuration
...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Results
② G Asencio, J S Ag...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
③ G Asencio, J S Aguilar-Ru...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Datasets and
configuration
...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Results
Table 3. Efficiency...
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Results
③ G Asencio, J S Ag...
1. Motivation
2. Our proposal
3. Recent results
4. Conclusions and future work
Protein Distance Map Prediction based on a ...
Conclusions
• New protein distance map predictor has performed using a nearest
neighbors-based approach and feature select...
Current research
 Five new measures implemented:
 Recursive Convex Hull of amino acids (RCH)
 Solvent Accessibility (SA...
Current research
 Distance map post-proccessing
 New feasibility measures
 Based on the geometry of the predicted dista...
Future work
 Perform feature selections over RCH, SA, SS, CN and PSSM with different
statistics and windows sizes.
 Buil...
Thank you
Protein Distance Map Prediction based on a Nearest Neighbors Approach
Gualberto Asencio Cortés
Acknowledgements:...
Upcoming SlideShare
Loading in …5
×

Protein Distance Map Prediction based on a Nearest Neighbors Approach

9,879 views

Published on

Seminar of February 9, 2012 for the ICOS group in the University of Nottingham.

Abstract: The Protein Structure Prediction (PSP) problem is to determine the three-dimensional structure of a protein, using only information contained in its amino acid sequence. The PSP problem is one of the most important open problems in structural bioinformatics. This is because the 3D structures determine the protein function and would be of enormous help for designing new drugs for diseases such as cancer or Alzheimer. Among the main data structures to represent protein structures, there are two widely used: contact maps and distance maps. Contact maps represent binary proximities (contact or non-contact) between each pair of amino acids of a protein. Distance maps represent distances between these amino acids pairs. However, contact and distance maps are very difficult to predict. In fact, the accuracy achieved by protein contact map predictors at Top L/5 in the last Critical Assessment of Techniques for Protein Structure Prediction competition (CASP9) is up to 22% approximately, and clearly must be improved. In this seminar, the author will present an approach to predict protein structures based on a nearest neighbors scheme. In this approach protein fragments are assembled according to their physico-chemical similarities, using information extracted from known protein structures. This method produces a distance map, which provides more information about the structure of a protein than a contact map, and which can be converted into contact map with different thresholds. The prediction procedure starts with a feature selection on the 544 amino acid physico-chemical properties of the AAindex repository, resulting different properties set which were used to predictions. The author will show some recent results using his approach and, finally, he will outline some of his current researching and future works.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
9,879
On SlideShare
0
From Embeds
0
Number of Embeds
95
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • - Proteins are macromecules composed by one or more chains of amino acids. There are 20 natural amino acids.
  • - Proteins are macromecules composed by one or more chains of amino acids. There are 20 natural amino acids.
  • PSP is very useful to known protein functions, because protein functions are determined by the protein structure.
    It is
  • CASP is a very important bianual competition among protein structure predictors
  • Our goal is to compete in CASP, using the same evaluation scheme, and try to overcome this 22%.
  • Protein Distance Map Prediction based on a Nearest Neighbors Approach

    1. 1. Gualberto Asencio Cortés Supervisor: Jesús S. Aguilar Ruíz Bioinformatics Group School of Engineering Pablo de Olavide University, Seville, Spain Host: Jaume Bacardit Protein Distance Map Prediction based on a Nearest NeighborsApproach Current state of research February 9, 2012
    2. 2. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 2 / 33
    3. 3. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 3 / 33
    4. 4. Proteins and amino acids Motivation Our proposal Recent results Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 4 / 33
    5. 5. Protein structure representations Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 3D model Distance map Contact map (threshold = 8A) 1M3Y (4 chains of 413 amino acids) Motivation Our proposal Recent results Conclusions and future work 5 / 33
    6. 6. Protein Structure Prediction (PSP) Protein structures Training New sequence Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 6 / 33
    7. 7. Why PSP is important? • Knowing protein functions • Drug design for diseases such as cancer orAlzheimer • Protein docking and virtual screening • Protein engineering • … Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 7 / 33
    8. 8. Why another contact/distance map predictor? • Currently a hot topic in bioinformatics journals • Current results are up to 22% of precision for contact prediction in the last CASP9, and clearly must be improved ▫ CASP competition CASP10 in 2012! http://predictioncenter.org/ Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 8 / 33
    9. 9. Why distance maps? • Why a threshold?Why 8 angstroms? Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés • Distance maps store more information • Conversion to contact maps is very easy Motivation Our proposal Recent results Conclusions and future work 9 / 33
    10. 10. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 33
    11. 11. PDMpred: Prediction process Training set of protein structures Training data Test data Distance maps All training fragments Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés d Training profiles A-A A-R V-V Insert your t it le here Do you have a subt it le? I f so, writ e it here First A ut hor · Second A ut hor Received: date / Accepted: date A bst ract Insert your abstract here. Include keywords, PACS and mathematic subject classification numbers as needed. K eywords First keyword · Second keyword · More si ∈ { A, R, N, D, B, C, Q, E, Z, G, H, I , L, K , M , F, P, S, T, W, Y, V} 1 I nt roduct ion TheProtein StructurePrediction (PSP) problem consistsin determining thethre dimensional model of a protein, using only information contained in the amin acid sequence of the protein. The PSP problem is one of the most importan open problems in computational biology [53]. This is because the 3D structure determine the protein function. It follows that knowing the 3D structure of protein would be of enormous help for designing new drugs for diseases such a cancer or Alzheimer. Although there exist experimental methods for determinin protein structures, e.g., X-ray crystallography and nuclear magnetic resonanc Motivation Our proposal Recent results Conclusions and future work 11 / 33
    12. 12. PDMpred: Prediction process Training set of protein structures Training data Test data Distance maps All training fragments Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés d Training profiles A-A A-R V-V Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} L Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) ( Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} ( Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} ( Note that prediction vectors represent fragments of different lengths, bu these lengths is not included in them. The physico-chemical properties include in the prediction vectors are explained in the next subsection. From the point view of data mining, Bi and Ei are the attributes of training instances and Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} Note that prediction vectors represent fragments of different length these lengths is not included in them. The physico-chemical properties in in the prediction vectors are explained in the next subsection. From the p Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Profile 1 L P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Profile 2 B1 E1 ... Bm Em D Motivation Our proposal Recent results Conclusions and future work 33
    13. 13. PDMpred: Prediction process Training set of protein structures Test set of protein sequences Training data Test data Test profiles ? ? ? ? ? ? ? ? ? ? Distance maps ? ? ? ? ? ? ? ? ? ? Distance maps d All test fragments All training fragments Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés d Training profiles A-A A-R V-V A-A V-V Motivation Our proposal Recent results Conclusions and future work 13 / 33
    14. 14. PDMpred: Prediction process Training set of protein structures Test set of protein sequences Training data Test data Test profiles ? ? ? ? ? ? ? ? ? ? Distance maps ? ? ? ? ? ? ? ? ? ? Distance maps d All test fragments All training fragments Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés d Training profiles A-A A-R V-V A-A V-V average average Pi (se) + j = 1 j = e L|e− j | , ∀i ∈ { 1..m} (4) test t1 . . . tn ? training ... a1 . . . an Da ... b1 . . . bn Db ... neighbor search for each test prediction vector vectors represent fragments of different lengths, but Motivation Our proposal Recent results Conclusions and future work 33
    15. 15. PDMpred: Evaluation measures Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés second was a measure of recall, which has been used in other protein predic- tion methods [14]. Finally, we have obtained measures of accuracy, specificity and Matthews Correlation Coefficient, that may often provide a much more bal- anced evaluation of the prediction than, for instance, the percentages [15]. The following formulas (2,3,4,5,6) define these five measures. Precision = TP TP + F P (2) Recall = TP TP + F N (3) Accuracy = TP + TN TP + F P + F N + TN (4) Specif icity = TN TN + F P (5) M CC = TP × TN − F P × F N (TP + F P)(TP + F N )(TN + F P)(TN + F N ) (6) Motivation Our proposal Recent results Conclusions and future work 15 / 33
    16. 16. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 33
    17. 17. Recent results ① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181. http://www.upo.es/eps/asencio/asppred ① G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based approach for viral protein structure prediction. In: 9th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBio 2011)Torino, Italia. Lecture Notes in Computer Science 6623, p. 69-76, Springer 2011, ISBN 978-3-642-20388-6. ① G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. In: 10th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBio 2012) Málaga, Spain (accepted). Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 17 / 33
    18. 18. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181. BUNA790101 alpha-NH chemical shifts (Bundi-Wuthrich, 1979) BUNA790103 Spin-spin coupling constants 3JHalpha-NH (Bundi-Wuthrich, 1979) CHAM820102 Free energy of solution in water, kcal/mole (Charton-Charton, 1982) FAUJ880111 Positive charge (Fauchere et al., 1988) FAUJ880112 Negative charge (Fauchere et al., 1988) GARJ730101 Partition coefficient (Garel et al., 1973) JOND750102 pK (-COOH) (Jones, 1975) KARP850103 Flexibility parameter for two rigid neighbors (Karplus-Schulz, 1985) KHAG800101 The Kerr-constant increments (Khanarian-Moore, 1980) MAXF760103 Normalized frequency of zeta R (Maxfield-Scheraga, 1976) PRAM820101 Intercept in regression analysis (Prabhakaran-Ponnuswamy, 1982) QIAN880139 Weights for coil at the window position of 6 (Qian-Sejnowski, 1988) RICJ880101 Relative preference value at N" (Richardson-Richardson, 1988) RICJ880104 Relative preference value at N1 (Richardson-Richardson, 1988) RICJ880114 Relative preference value at C1 (Richardson-Richardson, 1988) RICJ880117 Relative preference value at C" (Richardson-Richardson, 1988) SUEM840102 Zimm-Bragg parameter sigma x 1.0E4 (Sueki et al., 1984) TANS770102 Normalized frequency of isolated helix (Tanaka-Scheraga, 1977) TANS770108 Normalized frequency of zeta R (Tanaka-Scheraga, 1977) VASM830101 Relative population of conformational state A (Vasquez et al., 1983) VELV850101 Electron-ion interaction potential (Veljkovic et al., 1985) WERD780102 Free energy change of epsilon(i) to epsilon(ex) (Wertz-Scheraga, 1978) WERD780103 Free energy change of alpha(Ri)to alpha(Rh)(Wertz-Scheraga, 1978) YUTK870103 Activation Gibbs energy of unfolding, pH7.0 (Yutani et al., 1987) AURR980120 Normalized positional residue frequency at helix termini C4' (Aurora-Rose, NADH010107 Hydropathy scale based on self-information values in the two-state model MONM990201 Averaged turn propensities in a transmembrane helix (Monne et al., 1999) MITS020101 Amphiphilicity index (Mitaku et al., 2002) WILM950104 Hydrophobicity coefficient in RP-HPLC, C18 with 0.1%TFA/2-PrOH/MeCN/H2O DIGM050101 Hydrostatic pressure asymmetry index, PAI (Di Giulio, 2005) Profile Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} (2) Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} (3) Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} Note that prediction vectors represent fragments of diffe these lengths is not included in them. The physico-chemical pr in the prediction vectors are explained in the next subsection. F view of data mining, Bi and Ei are the attributes of training the class to predict. 2.3 Physico-chemical feat ure select ion To the aim of using the smallest and most effective set of Profile 1 L P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) (2) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Feature selection (FS): 3o properties from 544 of AAindex ( http://www.genome.jp/aaindex/ ) FS Algorithm: Relief evaluation algorithm + Ranker search algorithm Motivation Our proposal Recent results Conclusions and future work 33
    19. 19. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181. Datasets and configuration  10-fold cross validation  Beta-carbon distances  No minimum sequence separation  Distance threshold (cut-off) of 8 angstroms  Five datasets: 1. 20 random proteins from PDB, identity ≤ 30% 2. 118 proteins from CullPDB, identity ≤ 10% 3. 170 proteins from PDBselect, identity ≤ 25% 4. 221 proteins from CullPDB, identity ≤ 5% 5. 5130 proteins from PDBselect, identity ≤ 25%  PDB (Protein Data Bank): http://www.rcsb.org  CullPDB: http://dunbrack.fccc.edu/PISCES.php  PDBselect: http://bioinfo.tg.fh-giessen.de/pdbselect/ http://www.upo.es/eps/asencio/asppred Motivation Our proposal Recent results Conclusions and future work 33
    20. 20. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ① G Asencio, J S Aguilar-Ruiz (2011) Predicting protein distance maps according to physicochemical properties. Journal of Integrative Bioinformatics 8(3): 181. Results 3 0.48± 0.04 0.43± 0.05 0.99± 0.01 0.99± 0.01 4 0.40± 0.05 0.41± 0.05 0.99± 0.01 0.99± 0.01 5 0.14± 0.08 0.14± 0.08 0.99± 0.05 0.99± 0.05 Table 2: Efficiency of our method at 4 ˚A of distance threshold (µ ± σ values). Dataset Recall Precision Accuracy Specificity 1 0.39± 0.06 0.41± 0.08 0.97± 0.03 0.98± 0.01 2 0.39± 0.07 0.40± 0.07 0.95± 0.01 0.97± 0.02 3 0.38± 0.02 0.38± 0.02 0.95± 0.02 0.97± 0.01 4 0.40± 0.03 0.41± 0.03 0.95± 0.01 0.97± 0.01 5 0.51± 0.11 0.51± 0.11 0.92± 0.06 0.95± 0.07 Table 3: Efficiency of our method at 8 ˚A of distance threshold (µ ± σ values). For 8 ˚A of threshold, recall and precision are basically the same in experiments 1 to 4. We found that our predictor no needs many proteins as training. Seems to it find good similar fragments in poor trainings. In experiment 5 we achieved better recall and precision than other experiments; however, standard deviation values are higher. This may be due to the great number of different types of proteins (structural classes or number of domains, for instance) in all the PDBselect. We included detailed information about each protein in five experiments as supplemental ma- terial at http://www.upo.es/eps/asencio/asppred. We indicate for each protein Journal of Integrative Bioinformatics 2011 http://journal.imbio.de/ Dataset K Recall Precision Accuracy Specificity 4 1 0.40± 0.03 0.41± 0.03 0.95± 0.01 0.97± 0.01 3 0.40± 0.04 0.72± 0.01 0.96± 0.01 0.97± 0.00 5 0.39± 0.04 0.81± 0.01 0.97± 0.00 0.98± 0.00 7 0.39± 0.04 0.84± 0.00 0.99± 0.00 0.99± 0.00 9 0.38± 0.04 0.86± 0.00 0.99± 0.00 0.99± 0.00 11 0.38± 0.04 0.87± 0.00 0.99± 0.00 0.99± 0.00 13 0.37± 0.04 0.88± 0.00 0.99± 0.00 0.99± 0.00 15 0.37± 0.04 0.88± 0.00 0.99± 0.00 0.99± 0.00 Table 4: Study of the number (K ) of nearest training profiles. Motivation Our proposal Recent results Conclusions and future work 20 / 33
    21. 21. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based approach for viral protein structure prediction (EvoBio 2011) Residue accessible surface area in folded protein (Chothia, 1976) Average relative fractional occurrence in ER(i-1) (Rackovsky-Scheraga, 1982) RF value in high salt chromatography (Weber-Lacey, 1978) Profile Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} (2) Ei = Pi (se) + L j = 1 Pi (sj ) L|e− j | , ∀i ∈ { 1..m} (3) Profile 1 P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} Note that prediction vectors represent fragments of diffe these lengths is not included in them. The physico-chemical pr in the prediction vectors are explained in the next subsection. F view of data mining, Bi and Ei are the attributes of training the class to predict. 2.3 Physico-chemical feat ure select ion To the aim of using the smallest and most effective set of Profile 1 L P1 P2 ... Pm D Fig. 1. Prueba Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) (2) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Roberto Ruiz Sánchez, José Cristóbal Riquelme Santos, Jesús S. Aguilar-Ruiz. Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recognition 39(12): 2383-2392 (2006) Feature selection (FS): 3 properties from 544 of AAindex FS Algorithm: BIRS algorithm + CFS search algorithm Motivation Our proposal Recent results Conclusions and future work 21 / 33
    22. 22. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Datasets and configuration  Leave-one-out CrossValidation (previously 10-fold CV)  Minimum sequence separation of 7 amino acids (previously not used)  Distance threshold (cut-off) of 8 angstroms  Beta-carbon distances  Training/test protein dataset: Viral capsid proteins, from PDB, identity ≤ 30%, 63 proteins with maximum length of 1284 amino acids. ② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based approach for viral protein structure prediction (EvoBio 2011) Motivation Our proposal Recent results Conclusions and future work 22 / 33
    23. 23. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Results ② G Asencio, J S Aguilar-Ruiz,A E Marquez (2011) A nearest neighbour-based approach for viral protein structure prediction (EvoBio 2011) Motivation Our proposal Recent results Conclusions and future work 23 / 33
    24. 24. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés ③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. (EvoBio 2012) (accepted) Profile Table 1. The 16 physico-chemical properties of amino acids considered from AAindex CHOC760104 Proportion of residues 100% buried LEVM760104 Side chain torsion angle phi(AAAR) MEIH800103 Average side chain orientation angle PALJ810107 Normalized frequency of alpha-helix in all-alpha class QIAN880112 Weights for alpha-helix at the window position of 5 WOLS870101 Principal property value z1 ONEK900101 Delta G values for the peptides extrapolated to 0 M urea BLAM930101 Alpha helix propensity of position 44 in T4 lysozyme PARS000101 p-Values of mesophilic proteins based on the distributions of B values NADH010102 Hydropathy scale based on self-information values in the two-state model (9% accessibility) SUYM030101 Linker propensity index WOLR790101 Hydrophobicity index JACR890101 Weights from the IFH scale MIYS990103 Optimized relative partition energies - method B MIYS990104 Optimized relative partition energies - method C MIYS990105 Optimized relative partition energies - method D Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) (2) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} (3) Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} (4) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { Note that prediction vectors represent fragment these lengths is not included in them. The physico-ch in the prediction vectors are explained in the next sub view of data mining, Bi and Ei are the attributes of the class to predict. 2.3 Physico-chemical feat ure select ion To the aim of using the smallest and most effectiv properties, we performed a feature selection from t physico-chemical properties of amino acids. This rep 544 amino acid properties. We used BARS to perform the feature selection Pi = 1 e− b− 1 e− 1 j = b+ 1 Pi (sj ) (2) Profile 2 B1 E1 ... Bm Em D Fig. 2. Prueba Bi = Pi (sb) + L j = 1 j = b Pi (sj ) L|b− j | , ∀i ∈ { 1..m} (3) Ei = Pi (se) + L j = 1 j = e Pi (sj ) L|e− j | , ∀i ∈ { 1..m} (4) hat prediction vectors represent fragments of different lengths, but Roberto Ruiz, José C. Riquelme, and Jesús S. Aguilar-Ruiz. Best agglomerative ranked subset for feature selection. Journal of Machine Learning Research – ProceedingsTrack, 4:148–162, 2008. Feature selection (FS): 16 properties from 544 of AAindex FS Algorithm: BARS algorithm + CFS search algorithm Motivation Our proposal Recent results Conclusions and future work 24 / 33
    25. 25. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Datasets and configuration  Leave-one-out CrossValidation  Minimum sequence separation of 7 amino acids  Distance threshold (cut-off) of 8 angstroms  Beta-carbon distances  Training/test protein dataset: Mitochondrial matrix proteins, from PDB, identity ≤ 30%, 74 proteins with a maximum length of 1094 amino acids. ③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. (EvoBio 2012) (accepted) Motivation Our proposal Recent results Conclusions and future work 25 / 33
    26. 26. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Results Table 3. Efficiency of our method predicting mitochondrial matrix proteins Protein set Recall Precision Accuracy Specificity MCC All proteins (74) 0.80 0.79 0.97 0.97 0.82 L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75 300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83 L > 450 (27) 0.77 0.76 0.95 0.95 0.82 a cross validation, cut-off of 8 angstroms and minimum sequence separation of 7 amino acids, achieved a precision value of 0.11 for proteins of more than 300 amino acids. ③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. (EvoBio 2012) (accepted) Table 3. Efficiency of our method predicting mitochondrial matrix proteins Protein set Recall Precision Accuracy Specificity MCC All proteins (74) 0.80 0.79 0.97 0.97 0.82 L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75 300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83 L > 450 (27) 0.77 0.76 0.95 0.95 0.82 a cross validation, cut-off of 8 angstroms and minimum sequence separation of 7 amino acids, achieved a precision value of 0.11 for proteins of more than 300 amino acids. (a) 1TG6 (277 amino acids) (b) 3BLX (349 amino acids) (c) Color scale Fig. 2. Predicted distance maps for the mitochondrial matrix proteins 1TG6 (a) and 3BLX (b) with their color scale (c). Figure 2 shows the predicted distance maps for protein 1TG6 (277 amino Protein set Recall Precision Accuracy Specificity MCC All proteins (74) 0.80 0.79 0.97 0.97 0.82 L ≤ 300 (20) 0.77 0.76 0.98 0.98 0.75 300 < L ≤ 450 (27) 0.84 0.83 0.99 0.99 0.83 L > 450 (27) 0.77 0.76 0.95 0.95 0.82 a cross validation, cut-off of 8 angstroms and minimum sequence separation of 7 amino acids, achieved a precision value of 0.11 for proteins of more than 300 amino acids. (a) 1TG6 (277 amino acids) (b) 3BLX (349 amino acids) (c) Color scale Fig. 2. Predicted distance maps for the mitochondrial matrix proteins 1TG6 (a) and 3BLX (b) with their color scale (c). Motivation Our proposal Recent results Conclusions and future work 26 / 33
    27. 27. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Results ③ G Asencio, J S Aguilar-Ruiz,A E Marquez, R Ruiz, C E Santiesteban (2012) Prediction of mitochondrial matrix protein structures based on feature selection and fragment assembly. (EvoBio 2012) (accepted) Table 4. Comparison at 8 ˚A with RBFNN on the same benchmark PDB code (length) RBFNN PDMpred Np Nd Ap Np Nd Ap 1TTF (94) 376 1421 26.46 1307 1421 91.96 1E88 (160) 1006 3352 30.01 3075 3352 91.73 1NAR (290) 3346 10524 31.79 1797 10524 17.07 1BTJ B (337) 3796 14283 26.58 14026 14283 98.20 1J7E (458) 6589 25026 26.33 23407 25026 93.53 Average 27.67 78.49 N p : predict ed numbers; N d : desired numbers; A p : predict ion recall (%). is the count of the predicted contacts by the algorithm and desired numbers Nd is the total number of contacts. The contact threshold was set at 8 ˚A. In Table 4 we show the results of this experimentation. As we can see in Motivation Our proposal Recent results Conclusions and future work 27 / 33
    28. 28. 1. Motivation 2. Our proposal 3. Recent results 4. Conclusions and future work Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés 28 / 33
    29. 29. Conclusions • New protein distance map predictor has performed using a nearest neighbors-based approach and feature selections of physico-chemical properties. • We predict distance maps, which provide more information than contact maps and which conversion to contact maps is very easy. • We achieved up to 0.80 of recall and 0.79 of precision with minimum separation of 7 amino acids on some non-homologous protein sets. • Our results are a large improvement (5o.82% better) in recall over the results of a previous study (Zhang et al. 2005). A nearest neighbour-based approach for viral protein structure prediction Gualberto Asencio Cortés, Jesús S. Aguilar-Ruíz and Alfonso E. Márquez Chamorro Motivation Our proposal Recent results Conclusions and future work 29 / 33
    30. 30. Current research  Five new measures implemented:  Recursive Convex Hull of amino acids (RCH)  Solvent Accessibility (SA)  Secondary Structure (SS) from PSI-PRED  Coordination Number (CN)  Position-Specific Scoring Matrix (PSSM) from PSI-BLAST  Using this protein set in the current experiments (from ICOS, named INF2010):  3262 proteins from PDB-REPRDB, identity ≤ 30%. Using 90% of the set for training and 10% for test. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 30 / 33
    31. 31. Current research  Distance map post-proccessing  New feasibility measures  Based on the geometry of the predicted distance maps  Using triangular inequalities Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 31 / 33
    32. 32. Future work  Perform feature selections over RCH, SA, SS, CN and PSSM with different statistics and windows sizes.  Build aTop L/x ranking (as in CASP) of predicted contacts (from predicted distances) using the standard deviation of distances of the nearest neighbors (profiles).  Use 24 amino acids as minimum sequence separation and CASP9 target domains in free modelling category.  Divide training profiles in bags according to evolutionary information (PSSM) of amino acid pairs.Then predict distances using only the appropiate bag according to the test fragment.  Use protein domains as training instead whole sequences. Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Motivation Our proposal Recent results Conclusions and future work 32 / 33
    33. 33. Thank you Protein Distance Map Prediction based on a Nearest Neighbors Approach Gualberto Asencio Cortés Acknowledgements: Bioinformatics group in Pablo de Olavide University ICOS group in the University of Nottingham Contact: guaasecor@upo.es 33 / 33

    ×