Protein threading using context specific alignment potential ismb-2013

1,490 views

Published on

Template-based modeling, including homology modeling and protein threading, is the most reliable method for protein 3D structure prediction. However, alignment errors and template selection are still the main bottleneck for current template-base modeling methods, especially when proteins under consideration are distantly related.

We present a novel context-specific alignment potential for protein threading, including alignment and template selection. Our alignment potential measures the log-odds ratio of one alignment being generated from two related proteins to being generated from two unrelated proteins, by integrating both local and global contextspecific information.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,490
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Currently, template-based modeling is the main-stream approach in protein structure prediction. This is based on the observation that although we have around 50,000 non-redundant structures in PDB, the unique structure fold in SCOP is only about 12 hundred. And what most important thing is, in recent years after 2010, the new unique fold less appeared, which implies that number of naturally occurring protein fold is limited, and this becomes a fundamental assumption that, we could use known structures to predict an unknown query sequence.More formally, the definition of template-based modelingis, given a query protein one-dimension amino acid sequence, and a template database with known three-dimension structure, we align each template and query to find the best match and build the query model upon the template.
  • Here we move into the first part, how to define the label for protein alignment data. In details, we transfer an alignment path into a series of continuous labels with M,Is and It, these three states. So there are nine adjacent state transitions in total.After defined the label, we could apply DeepAlign to generate the training data by structurally similar proteins.
  • Protein threading using context specific alignment potential ismb-2013

    1. 1. Protein Threading Using Context- Specific Alignment Potential Sheng Wang http://raptorx.uchicago.edu Toyota Technological Institute at Chicago, Joint work with Jianzhu Ma, Feng Zhao and Jinbo Xu ISMB 2013 Jul 22, ICC Berlin, Germany
    2. 2. Outline • Where we are @ template-based modeling • What’s our work • What’s the problem • What’s our solution • Welcome to our server
    3. 3. Template-based Modeling (or, Threading) • Observation – ~50,000 non-redundant structures in PDB – ~ 1,200 unique structure folds (SCOP) • Methodology – Use known structures to predict a new one Template sequence Query sequence DDVYILDQAEEG DE-FIVD-PDEH DDVYILDQAEEG SPCKR---ADEG DDVYILDQAEEG E--IFVDQADDS DDVYILDQAEEG NMCVFGQWERTY database
    4. 4. Template-based Modeling Procedures  Easy: similar sequences → similar structures  Sequence-based method, e.g., BLAST, FASTA  Works only for close homologous (>70% sequence identity)  Medium: similar profiles → similar structures  Protein profile is a matrix that represents a multiple sequence alignment of the similar proteins  Profile-based method, e.g., PSI-BLAST , HHMER, HHpred,  Works for relative remote homologous (>40% sequence identity)  Challenge: dissimilar profiles → similar structures  Adding structural information, or context-specific into sequence/profile based methods  Threading method, e.g., MUSTER, RAPTOR, CS-BLAST  Works for distant remote homologous (<40% sequence identity)
    5. 5. Our Work • CNFpred: Transform a template-sequence alignment problem into a Machine Learning problem to calculate the alignment’s probability. • DeepAlign: Prepare for high quality training data of structural alignment. • CNF model: Combined Machine Learning model that incorporate Conditional Random Field (CRF) and Neural Network (NN).
    6. 6. Protein Alignment Model S A L R Q L P L S E M M M M L P L S - E S A - L R Q Template Sequence Match states (M) M M Is M It M Insertion at sequence (Is) Insertion at template (It) The structural alignment generated by DeepAlign is used for training data
    7. 7. DeepAlign for Structure Alignment • evolutionary information • local sub-structure similarity • angular similarity for hydrogen bonding BLOSUM is the local amino acid substitution matrix; CLESUM is the local sub-structure substitution matrix; v(i,j) measures the angular similarity for hydrogen bonding; d(i,j) measures the spatial proximity of two aligned residues. local similarity global similarity Score(i,j)=( max(0,BLOSUM(i,j) )+CLESUM(i,j) )*v(i,j)*d(i,j)
    8. 8. CNF-based Alignment Model E: a neural network estimating the log-likelihood of state transition Z(S,T): normalization factor 1 2{ , ,..., }LA a a a { , , }i t sa M I IGiven an alignment Define a conditional probability between Sequence S and Template T Where, ),(/)),,,(exp(),|( 1 TSZTSaaETSAp i ii  Context-Specific
    9. 9. Comprehensive Features MTYKLILN--GKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE How similar two residues : EAA How similar query’s sequence and profile and template’s profile: Esp, Epp How similar template’s secondary structure and sequence’s predicted second structure (3-class and 8-class): Ess3, Ess8 Sequence S How similar is the query’s solvent accessibility and template’s solvent accessibility: Esa Total scoring function is a non-linear combination of: E( ai, ai-1, EAA , Esp , Epp , Ediso, Ess3 , Ess8 , Esa ) Template T MTYKLILNSTVRTKSDTVTDAVP---ADKICSFAQQLPWEREWSF-- For disordered regions, Ediso, no structure information used.
    10. 10. What’s the problem? • Only the alignment probability is described, instead of the log-odds potential compared to background. • Only incorporate local information, insufficient of global information.
    11. 11. Our solution Propose a protein alignment potential • With an elaborately designed reference state. • Can be generalized into sequence-sequence, sequence-structure as well as structure-structure alignment. Incorporate both local and global terms • For local term, CNFpred potential is applied. • For global term, EPAD potential is employed.
    12. 12. Protein alignment potential Similarly, given one alignment A between sequence S and template T, we define the potential of A as follows. N N i ref yxAP TSAP AP TSAP TSAu     1 ),|( ),|( log )( ),|( log),|( Given 2 AAs a and b, their mutation potential is defined as follows. )()( )( log )( )( log)( bPaP baP baP baP bau ref      x and y are two random proteins with the as S and T, respectively. Assumption: the alignment maximizing the potential is the optimal.
    13. 13. ),(/)),|(),|(exp(),|( TSZTSAGTSAFTSAP  The alignment probability given sequence S and template T could be modeled as follows, local term global term partition function  A TSAPtsZ ),|(),( Protein alignment potential
    14. 14. ),(),|(),|( ),|(),|( ),(/)),|(),|(exp( ),(/)),|(),|(exp( log ),|( ),|( log),|( , , 1 1 TScyxAGEXPTSAG yxAFEXPTSAF yxZyxAGyxAF TSZTSAGTSAF yxAP TSAP TSAu yx yx N N i N N i           Expected score, can be calculated in advance by sampling Independent of any specific alignment. Protein alignment potential
    15. 15. Model the local potential   i ii TSaaETSAF ),,,(),|( 1 From CNFpred, we use a context-specific linear chain model as, The expectation term can be calculated by uniformly sampling a few thousand protein pairs, so the local potential is The local potential is defined as, ),|(),|(),|( , yxAFEXPTSAFTSAU yxlocal     i iiiilocal aaETSaaETSAU )),(),,,((),|( 11
    16. 16. Maximize on probability Maximize on potential Long but less informative and highly false positive. Good for building models. Template Template Sequence Sequence Short but relevant and highly significant. Good for ranking templates. What’s the difference between
    17. 17. Model the global potential   ji ji T ij ssdPTSAG ),|(log),|( From EPAD, we use a context-specific distance-dependent model as, The expectation term can be calculated by uniformly sampling a few thousand residue pairs from templates, so the global potential is The global potential is defined as, ),|(),|(),|( , yxAGEXPTSAGTSAU yxglobal    ji T ijji T ijglobal dPssdPTSAU ))(log),|((log),|(
    18. 18. What’s global information given an alignment? i j i j   ji ji T ij ssdPTSAG ),|(log),|( Template T Sequence S T ijd T ijd i j If the alignment is good, the distance of a sequence residue pair shall match well with that of their aligned template residue pair. si sj
    19. 19. Result on 1000*6000 CNFpred (local+global potential) compared to, HHpred CNFpred (local potential)
    20. 20. Welcome to our server http://raptorx.uchicago.edu/ Binding Contact
    21. 21. Thank you  Jinbo Xu Feng Zhao Jianzhu Ma National Institutes of Health (R01GM0897532) National Science Foundation (DBI-0960390) NSF CAREER award CCF-1149811 Alfred P. Sloan Research Fellowship

    ×