Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Infobiotics Contact Map predictor at CASP9


Published on

Description of the Contact Map prediction method of the Infobiotics team that participated in CASP9

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

The Infobiotics Contact Map predictor at CASP9

  1. 1. The Infobiotics Contact Map Predictor at CASP9 J. Bacardit 1,2 , P.Widera 1 and N. Krasnogor 1 1 - School of Computer Science, University of Nottingham, 2 – School of Biosciences, University of Nottingham jaume.bacardit
  2. 2. Contact Map <ul><li>Two residues of a chain are said to be in contact if their distance is less than a certain threshold </li></ul><ul><li>The contacts of a protein can be represented by a binary matrix. 1 = contact 0 = non contact </li></ul><ul><li>Plotting this matrix reveals many characteristics from the protein structure </li></ul><ul><li>CM prediction is used in many 3D PSP methods (e.g. I-Tasser) </li></ul>Contact helices sheets
  3. 3. Steps <ul><li>Prediction of </li></ul><ul><ul><li>Secondary structure (using PSIPRED) </li></ul></ul><ul><ul><li>Solvent Accessibility </li></ul></ul><ul><ul><li>Recursive Convex Hull </li></ul></ul><ul><ul><li>Coordination number </li></ul></ul><ul><li>Integration of all these predictions plus other sources of information </li></ul><ul><li>Final CM prediction (using BioHEL) </li></ul>Using BioHEL [Bacardit et al., 09]
  4. 4. Prediction of RCH, SA and CN <ul><li>Dataset of 3262 protein chains created using PDB-REPRDB with: </li></ul><ul><ul><li>A resolution less than 2Å </li></ul></ul><ul><ul><li>Less than 30% sequence identify </li></ul></ul><ul><ul><li>Without chain breaks nor non-standard residues </li></ul></ul><ul><li>90% of this set was used for training (~490000 residues) </li></ul><ul><li>10% for test </li></ul>
  5. 5. How are these features predicted? <ul><li>Many of these features are due to local interactions of an amino acid and its immediate neighbours </li></ul><ul><ul><li>We predict them from the closest neighbours in the chain </li></ul></ul>R i SS i R i+1 SS i+1 R i-1 SS i-1 R i+2 SS i+2 R i-2 SS i-2 R i+3 SS i+3 R i+4 SS i+4 R i-3 SS i-3 R i-4 SS i-4 R i-5 SS i-5 R i+5 SS i+5 R i-1 R i R i+1  SS i R i R i+1 R i+2  SS i+1 R i+1 R i+2 R i+3  SS i+2
  6. 6. Prediction of RCH, SA and CN <ul><li>All three features were predicted based on a window of ±4 residues around the target </li></ul><ul><ul><li>Evolutionary information (as a Position-Specific Scoring Matrix) is the basis of this local information </li></ul></ul><ul><ul><li>Each residue characterised by a vector of 180 values </li></ul></ul><ul><li>The domain for all three features was partitioned into 5 states </li></ul>
  7. 7. Characterisation of the contact map problem <ul><li>Three types of input information were used </li></ul><ul><ul><li>Detailed information of three different windows of residues centered around </li></ul></ul><ul><ul><ul><li>The two target residues (2x) </li></ul></ul></ul><ul><ul><ul><li>The middle point between them </li></ul></ul></ul><ul><ul><li>Statistics about the connecting segment between the two target residues and </li></ul></ul><ul><ul><li>Global protein information. </li></ul></ul>1 2 3
  8. 8. Contact Map dataset <ul><li>From the set of 3262 proteins we kept all proteins with less than 250AA and a randomly selected 20% of larger proteins </li></ul><ul><li>The resulting training set contained more than 32 million instances and 631 attributes </li></ul><ul><li>Less than 2% of those are actual contacts </li></ul><ul><li>56GB of disk space </li></ul>
  9. 9. Samples and ensembles <ul><li>50 samples of 660K examples are generated from the training set with a ratio of 2:1 non-contacts/contacts </li></ul><ul><li>BioHEL is run 25 times for each sample </li></ul><ul><li>Prediction is done by a consensus of 1250 rule sets </li></ul><ul><li>Confidence of prediction is computed based on the votes distribution in the ensemble. </li></ul><ul><li>Whole training process takes about 25000 CPU hours </li></ul>Training set x50 x25 Consensus Predictions Samples Rule sets