• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The Infobiotics Contact Map predictor at CASP9
 

The Infobiotics Contact Map predictor at CASP9

on

  • 691 views

Description of the Contact Map prediction method of the Infobiotics team that participated in CASP9

Description of the Contact Map prediction method of the Infobiotics team that participated in CASP9

Statistics

Views

Total Views
691
Views on SlideShare
691
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The Infobiotics Contact Map predictor at CASP9 The Infobiotics Contact Map predictor at CASP9 Presentation Transcript

    • The Infobiotics Contact Map Predictor at CASP9 J. Bacardit 1,2 , P.Widera 1 and N. Krasnogor 1 1 - School of Computer Science, University of Nottingham, 2 – School of Biosciences, University of Nottingham jaume.bacardit @nottingham.ac.uk
    • Contact Map
      • Two residues of a chain are said to be in contact if their distance is less than a certain threshold
      • The contacts of a protein can be represented by a binary matrix. 1 = contact 0 = non contact
      • Plotting this matrix reveals many characteristics from the protein structure
      • CM prediction is used in many 3D PSP methods (e.g. I-Tasser)
      Contact helices sheets
    • Steps
      • Prediction of
        • Secondary structure (using PSIPRED)
        • Solvent Accessibility
        • Recursive Convex Hull
        • Coordination number
      • Integration of all these predictions plus other sources of information
      • Final CM prediction (using BioHEL)
      Using BioHEL [Bacardit et al., 09]
    • Prediction of RCH, SA and CN
      • Dataset of 3262 protein chains created using PDB-REPRDB with:
        • A resolution less than 2Å
        • Less than 30% sequence identify
        • Without chain breaks nor non-standard residues
      • 90% of this set was used for training (~490000 residues)
      • 10% for test
    • How are these features predicted?
      • Many of these features are due to local interactions of an amino acid and its immediate neighbours
        • We predict them from the closest neighbours in the chain
      R i SS i R i+1 SS i+1 R i-1 SS i-1 R i+2 SS i+2 R i-2 SS i-2 R i+3 SS i+3 R i+4 SS i+4 R i-3 SS i-3 R i-4 SS i-4 R i-5 SS i-5 R i+5 SS i+5 R i-1 R i R i+1  SS i R i R i+1 R i+2  SS i+1 R i+1 R i+2 R i+3  SS i+2
    • Prediction of RCH, SA and CN
      • All three features were predicted based on a window of ±4 residues around the target
        • Evolutionary information (as a Position-Specific Scoring Matrix) is the basis of this local information
        • Each residue characterised by a vector of 180 values
      • The domain for all three features was partitioned into 5 states
    • Characterisation of the contact map problem
      • Three types of input information were used
        • Detailed information of three different windows of residues centered around
          • The two target residues (2x)
          • The middle point between them
        • Statistics about the connecting segment between the two target residues and
        • Global protein information.
      1 2 3
    • Contact Map dataset
      • From the set of 3262 proteins we kept all proteins with less than 250AA and a randomly selected 20% of larger proteins
      • The resulting training set contained more than 32 million instances and 631 attributes
      • Less than 2% of those are actual contacts
      • 56GB of disk space
    • Samples and ensembles
      • 50 samples of 660K examples are generated from the training set with a ratio of 2:1 non-contacts/contacts
      • BioHEL is run 25 times for each sample
      • Prediction is done by a consensus of 1250 rule sets
      • Confidence of prediction is computed based on the votes distribution in the ensemble.
      • Whole training process takes about 25000 CPU hours
      Training set x50 x25 Consensus Predictions Samples Rule sets