The Infobiotics Contact Map predictor at CASP9


Published on

Description of the Contact Map prediction method of the Infobiotics team that participated in CASP9

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

The Infobiotics Contact Map predictor at CASP9

  1. 1. The Infobiotics Contact Map Predictor at CASP9 J. Bacardit 1,2 , P.Widera 1 and N. Krasnogor 1 1 - School of Computer Science, University of Nottingham, 2 – School of Biosciences, University of Nottingham jaume.bacardit
  2. 2. Contact Map <ul><li>Two residues of a chain are said to be in contact if their distance is less than a certain threshold </li></ul><ul><li>The contacts of a protein can be represented by a binary matrix. 1 = contact 0 = non contact </li></ul><ul><li>Plotting this matrix reveals many characteristics from the protein structure </li></ul><ul><li>CM prediction is used in many 3D PSP methods (e.g. I-Tasser) </li></ul>Contact helices sheets
  3. 3. Steps <ul><li>Prediction of </li></ul><ul><ul><li>Secondary structure (using PSIPRED) </li></ul></ul><ul><ul><li>Solvent Accessibility </li></ul></ul><ul><ul><li>Recursive Convex Hull </li></ul></ul><ul><ul><li>Coordination number </li></ul></ul><ul><li>Integration of all these predictions plus other sources of information </li></ul><ul><li>Final CM prediction (using BioHEL) </li></ul>Using BioHEL [Bacardit et al., 09]
  4. 4. Prediction of RCH, SA and CN <ul><li>Dataset of 3262 protein chains created using PDB-REPRDB with: </li></ul><ul><ul><li>A resolution less than 2Å </li></ul></ul><ul><ul><li>Less than 30% sequence identify </li></ul></ul><ul><ul><li>Without chain breaks nor non-standard residues </li></ul></ul><ul><li>90% of this set was used for training (~490000 residues) </li></ul><ul><li>10% for test </li></ul>
  5. 5. How are these features predicted? <ul><li>Many of these features are due to local interactions of an amino acid and its immediate neighbours </li></ul><ul><ul><li>We predict them from the closest neighbours in the chain </li></ul></ul>R i SS i R i+1 SS i+1 R i-1 SS i-1 R i+2 SS i+2 R i-2 SS i-2 R i+3 SS i+3 R i+4 SS i+4 R i-3 SS i-3 R i-4 SS i-4 R i-5 SS i-5 R i+5 SS i+5 R i-1 R i R i+1  SS i R i R i+1 R i+2  SS i+1 R i+1 R i+2 R i+3  SS i+2
  6. 6. Prediction of RCH, SA and CN <ul><li>All three features were predicted based on a window of ±4 residues around the target </li></ul><ul><ul><li>Evolutionary information (as a Position-Specific Scoring Matrix) is the basis of this local information </li></ul></ul><ul><ul><li>Each residue characterised by a vector of 180 values </li></ul></ul><ul><li>The domain for all three features was partitioned into 5 states </li></ul>
  7. 7. Characterisation of the contact map problem <ul><li>Three types of input information were used </li></ul><ul><ul><li>Detailed information of three different windows of residues centered around </li></ul></ul><ul><ul><ul><li>The two target residues (2x) </li></ul></ul></ul><ul><ul><ul><li>The middle point between them </li></ul></ul></ul><ul><ul><li>Statistics about the connecting segment between the two target residues and </li></ul></ul><ul><ul><li>Global protein information. </li></ul></ul>1 2 3
  8. 8. Contact Map dataset <ul><li>From the set of 3262 proteins we kept all proteins with less than 250AA and a randomly selected 20% of larger proteins </li></ul><ul><li>The resulting training set contained more than 32 million instances and 631 attributes </li></ul><ul><li>Less than 2% of those are actual contacts </li></ul><ul><li>56GB of disk space </li></ul>
  9. 9. Samples and ensembles <ul><li>50 samples of 660K examples are generated from the training set with a ratio of 2:1 non-contacts/contacts </li></ul><ul><li>BioHEL is run 25 times for each sample </li></ul><ul><li>Prediction is done by a consensus of 1250 rule sets </li></ul><ul><li>Confidence of prediction is computed based on the votes distribution in the ensemble. </li></ul><ul><li>Whole training process takes about 25000 CPU hours </li></ul>Training set x50 x25 Consensus Predictions Samples Rule sets