Protein threading using context specific alignment potential ismb-2013

Protein Threading Using Context-
Specific Alignment Potential
Sheng Wang
http://raptorx.uchicago.edu
Toyota Technological Institute at Chicago,
Joint work with Jianzhu Ma, Feng Zhao and Jinbo Xu
ISMB 2013
Jul 22, ICC Berlin, Germany

Outline
• Where we are @ template-based modeling
• What’s our work
• What’s the problem
• What’s our solution
• Welcome to our server

Template-based Modeling (or, Threading)
• Observation
– ~50,000 non-redundant structures in PDB
– ~ 1,200 unique structure folds (SCOP)
• Methodology
– Use known structures to predict a new one
Template sequence
Query sequence DDVYILDQAEEG
DE-FIVD-PDEH
DDVYILDQAEEG
SPCKR---ADEG
DDVYILDQAEEG
E--IFVDQADDS
DDVYILDQAEEG
NMCVFGQWERTY
database

Template-based Modeling Procedures
 Easy: similar sequences → similar structures
 Sequence-based method, e.g., BLAST, FASTA
 Works only for close homologous (>70% sequence identity)
 Medium: similar profiles → similar structures
 Protein profile is a matrix that represents a multiple sequence
alignment of the similar proteins
 Profile-based method, e.g., PSI-BLAST , HHMER, HHpred,
 Works for relative remote homologous (>40% sequence identity)
 Challenge: dissimilar profiles → similar structures
 Adding structural information, or context-specific into sequence/profile
based methods
 Threading method, e.g., MUSTER, RAPTOR, CS-BLAST
 Works for distant remote homologous (<40% sequence identity)

Our Work
• CNFpred: Transform a template-sequence
alignment problem into a Machine Learning
problem to calculate the alignment’s probability.
• DeepAlign: Prepare for high quality training
data of structural alignment.
• CNF model: Combined Machine Learning model
that incorporate Conditional Random Field (CRF)
and Neural Network (NN).

Protein Alignment Model
S A L R Q
L
P
L
S
E
M
M
M
M
L P L S - E
S A - L R Q
Template
Sequence
Match states (M)
M M Is M It M
Insertion at sequence (Is)
Insertion at template (It)
The structural alignment generated by DeepAlign is used for training data

DeepAlign for Structure Alignment
• evolutionary information
• local sub-structure similarity
• angular similarity for hydrogen bonding
BLOSUM is the local amino acid substitution matrix;
CLESUM is the local sub-structure substitution matrix;
v(i,j) measures the angular similarity for hydrogen bonding;
d(i,j) measures the spatial proximity of two aligned residues.
local similarity global similarity
Score(i,j)=( max(0,BLOSUM(i,j) )+CLESUM(i,j) )*v(i,j)*d(i,j)

CNF-based Alignment Model
E: a neural network estimating the log-likelihood of state transition
Z(S,T): normalization factor
1 2{ , ,..., }LA a a a { , , }i t sa M I IGiven an alignment
Define a conditional probability
between Sequence S and Template T
Where,
),(/)),,,(exp(),|( 1 TSZTSaaETSAp
i
ii 
Context-Specific

Comprehensive Features
MTYKLILN--GKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
How similar two
residues : EAA
How similar query’s
sequence and profile and
template’s profile: Esp,
Epp
How similar template’s
secondary structure and
sequence’s predicted second
structure (3-class and 8-class):
Ess3, Ess8
Sequence S
How similar is the query’s solvent
accessibility and template’s
solvent accessibility: Esa
Total scoring function is a non-linear combination of:
E( ai, ai-1, EAA , Esp , Epp , Ediso, Ess3 , Ess8 , Esa )
Template T
MTYKLILNSTVRTKSDTVTDAVP---ADKICSFAQQLPWEREWSF--
For disordered regions, Ediso,
no structure information used.

What’s the problem?
• Only the alignment probability is described,
instead of the log-odds potential compared to
background.
• Only incorporate local information, insufficient
of global information.

Our solution
Propose a protein alignment potential
• With an elaborately designed reference state.
• Can be generalized into sequence-sequence,
sequence-structure as well as structure-structure
alignment.
Incorporate both local and global terms
• For local term, CNFpred potential is applied.
• For global term, EPAD potential is employed.

Protein alignment potential
Similarly, given one alignment A between sequence S and template T,
we define the potential of A as follows.
N
N
i
ref
yxAP
TSAP
AP
TSAP
TSAu
 


1
),|(
),|(
log
)(
),|(
log),|(
Given 2 AAs a and b, their mutation potential is defined as follows.
)()(
)(
log
)(
)(
log)(
bPaP
baP
baP
baP
bau
ref





x and y are two random proteins with
the as S and T, respectively.
Assumption: the alignment maximizing the potential is the optimal.

),(/)),|(),|(exp(),|( TSZTSAGTSAFTSAP 
The alignment probability given sequence S and template T could be modeled
as follows,
local term global term
partition function

A
TSAPtsZ ),|(),(

),(),|(),|(
),|(),|(
),(/)),|(),|(exp(
),(/)),|(),|(exp(
log
),|(
),|(
log),|(
,
,
1
1
TScyxAGEXPTSAG
yxAFEXPTSAF
yxZyxAGyxAF
TSZTSAGTSAF
yxAP
TSAP
TSAu
yx
yx
N
N
i
N
N
i










Expected score, can be calculated in advance by sampling
Independent of any
specific alignment.

Model the local potential
 
i
ii TSaaETSAF ),,,(),|( 1
From CNFpred, we use a context-specific linear chain model as,
The expectation term can be calculated by uniformly sampling a few
thousand protein pairs, so the local potential is
The local potential is defined as,
),|(),|(),|( , yxAFEXPTSAFTSAU yxlocal 
  
i
iiiilocal aaETSaaETSAU )),(),,,((),|( 11

Maximize on probability Maximize on potential
Long but less informative and
highly false positive.
Good for building models.
Template Template
Sequence
Sequence
Short but relevant and highly
significant.
Good for ranking templates.
What’s the difference between

Model the global potential


ji
ji
T
ij ssdPTSAG ),|(log),|(
From EPAD, we use a context-specific distance-dependent model as,
The expectation term can be calculated by uniformly sampling a few
thousand residue pairs from templates, so the global potential is
The global potential is defined as,
),|(),|(),|( , yxAGEXPTSAGTSAU yxglobal 


ji
T
ijji
T
ijglobal dPssdPTSAU ))(log),|((log),|(

What’s global information given an
alignment?
i j
i j


ji
ji
T
ij ssdPTSAG ),|(log),|(
Template T
Sequence S
T
ijd
T
ijd
i j
If the alignment is good, the distance of a sequence residue pair
shall match well with that of their aligned template residue pair.
si
sj

Result on 1000*6000
CNFpred (local+global potential) compared to,
HHpred CNFpred (local potential)

Welcome to our server
http://raptorx.uchicago.edu/
Binding
Contact

Thank you 
Jinbo Xu
Feng Zhao
Jianzhu Ma
National Institutes of Health (R01GM0897532)
National Science Foundation (DBI-0960390)
NSF CAREER award CCF-1149811
Alfred P. Sloan Research Fellowship

Protein threading using context specific alignment potential ismb-2013

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Protein threading using context specific alignment potential ismb-2013

Similar to Protein threading using context specific alignment potential ismb-2013 (20)

Recently uploaded

Recently uploaded (20)

Protein threading using context specific alignment potential ismb-2013

Editor's Notes