Protein threading is a protein structure prediction method that involves "threading" or placing an amino acid sequence into known protein structure templates to find the best matching fold. The key steps are:
1) A query sequence is threaded into structural positions of templates from a structure library to find sequence-structure alignments
2) Alignments are scored and optimized using an objective function accounting for residue interactions and preferences
3) The highest scoring template is selected as the predicted structure, though loop regions are often not accurately predicted
2. Overview
Protein threading definition:DONE
Why do we need protein threading:DONE
Basic principles:DONE
Workflow:DONE
Describe each step of the workflow:DOING
Assesment step is important
Softwares/tools
Advantages/application
Limitation
References
thankyou
3. “Threading”?
o Threading=placing/aligning
o Aminoacidsequence is beingthreaded“into” the
templatestructure by “statisticalprinciples” and
stitchthe aligned regions together.
Placed into by force
o Given a Protein sequence and a template library return the best sequence-structure alignment
o In threading, a newsequence is mounted on a series of known folds withthe goal of findinga fold(a
sequence-structure alignment)that providesthe best score (lowest energy).
5. Need for threading
Sequence Homology <20%
Fold recognition
No 3D structure similarity
Computational limitations
6. Protein threading v/s Fold recognition
o Structure prediction method o Identification of “folds”
o Involves the process of
“fold recognition”
Each different topology of alpha helices and beta sheets make up the folds.
All-alpha, All-beta, Alpha/Beta etc.
7. History
1. (Bowie et aI., 1991) on "the inverse protein folding problem" - foundation
Used simple measures for fitness of different amino acid types to local structural environments in terms of
solvent accessibility and protein secondary structure.
2. The work by Jones, Taylor, and Thornton
8. Principles
limitednumber of basic folds found in nature ~1500
aminoacidpreferences for different structural
environments provide sufficient informationto choose
amongfolds
Thecoresecondarystructureregions(helix& sheets)couldbemodelledandstructureis predicted
Taking into account the large number of
amino acid sequences in databases like
UniProt, one would expect a high number
of folds. But in reality it is limited, it
appears that nature has re-used the same
fold again and again for performing new
functions.
9. Requirements
1. Query sequence
2. Library of core fold templates
3. Objective function (evaluate any particular placement of a sequence in a core template )
4. Method for searching over space of alignments between sequence and each core template
5. Method for choosing the best template given alignments
10. Workflow
Fold library
Query sequence
Thread it by their sequential order (gaps allowed) into structural positions of a template structure
Optimize by fitness scores
Sequence structure alignments
Statistical assessment
Prediction of backbone atoms of the query protein
11. 1.Sequence selection
The query sequence “target
Homology<20%
Result depends on the size and details of the library.
For better results, library must be large and sufficient
Remove homologous structures
2.Fold library
FSSP PDB-Select PISCES
12.
13. 3.Threading
Different proteins fold into similar 3D shapes -similar interaction patterns among and between their residues
and environment.
Interaction patterns could possibly be captured using simple statistics-based energy models.
Threading/Placing the (backbone atoms of) residues of a query protein into the correct structural positions in
a correct structural fold needs:
(a) an energy function whose global minimum will correspond to the correct placement of residues into the
correct structural template
(b) an algorithm to find the global minimum of the given energy function.
14.
15. 4.Optimization
Possible sequence-template alignments are scored using a specified objective function
Objective function scores the sequence-structure compatibility between
1. sequence amino acids
2. their corresponding positions in a core template
1. aminoacidpreferences for solvent accessibility
2. aminoacid preferences for particular secondarystructures
3. interactionsamong spatiallyneighboringaminoacids
“objective function includes interactions
between neighboring (in 3D) amino acids”
17. Energy of every alignment is given by the sumof pairwise residue-residue interactions.
18. Solvent accessibility
Residue solvent accessibility is defined as the extent of accessible surface area of a given residue
Due to the spatial arrangement and packing
Important in fold recognition process
RSA prediction is done by:
1. Two(exposed-buried) and three-state models(exposed-intermediate-buried)
2. Based on relative RSA
Algorithms:
1. Neural network
2. Nearest neighbour
3. Support vector regression
19. Search space
If interaction terms between amino acids are not allowed
– dynamic programming will find optimal alignment efficiently
--deterministic
If interaction terms allowed
– heuristic methods (fast ,might not find the optimal alignment )
– exact methods (optimal, might take exponential time ,might fail due to time or space limits)
eg: branch and bound
20. Branch and bound
Objective function definition
Lower bound setup
Splitting of threading sets:
• split the segment having the widest interval
• choose a spit point as the value that results in
the lower bound for the set
21.
22. score function recognizes correct arrangements of protein residues.
usually more coarse-grained than those used in a real energy calculation.
The residues are placed on the backbone of the template structure and from there, one can
calculate ideal coordinates for the Cβ atom.
Since, most of the chemical identity of a residue comes from an interaction site located at the Cβ
residue
Howto builda scoring function?
(i) Contact potentials
(ii) Quasi-chemical approx.
If we know the concentration of two particles A and B, we can calculate how often they will be
observed at a certain distance from each other by chance.
G which is a function of the distance rAB between particles of types A and B:
K=Boltzmanns constant
T= temperature.
ρ rAB is the observed
frequency of AB pairs
at distance r
ρ rAB0 is the frequency
of AB pairs at distance
(by chance)
26. (b)Quasi-chemical
o Approximation method
o For deriving pairwise contact potentials from -> number of residue-residue contacts found
o Quite successful
o Finds the interaction parameters for amino acids
o By measuring ∆G (experimentally) for mutated proteins
o We obtain the differences in contact energies and then can be used in the potential models
∆G=Hmut-Hwild H=Potential energy
27. Assessment
1. Based on the energy function(lower the energy better the s-s alignment)
2. Identification of "reliable" versus "unreliable" parts of a threaded structure by quantitative assessment of the
structural deviations in terms of RMSDfor regions of predicted structures.
3. Calculation of z-scores. Aim is the find the score function which gives the greatest z-score.
28. Loop modeling
A threading program could provide a somewhat accurate structure for the backbone atoms in the core
secondary structures while predictions for the loop regions are often not accurate.
Since, secondary structures among homologous proteins are generally "well" conserved, loops are often not.
Hence, template-based loop predictions are generally not accurate.
MODELLER which runs a protocol of energy minimization and molecular dynamics simulation to refine a
structural model.
After a structure model is generated, one can apply structure assessment tools such as WHATIF and
PROCHECK
Based on this assessment, a user can pick the best among the multiple structures derived from an
alignment.
30. Variations
1D-3D profile methods
Prepare a profile first for each residue
1. How buried it is
2. Environment(polar/non-polar)
3. Local secondary structure(helix/sheet)
4. We calculate score for a sequence by DP
5. Calculate the significance by z-score
31. 1. It also requires searching over a large set of possible alignments for the one that delivers minimum ``energy‘’. Such
a search is an NP complete problem (i.e. that there is an apparent ``Levinthal'' paradox in threading).
2. The search in threading is biased by the energy function, so that the related key issue is the precision of the energy
function.
3. First, fold recognition for structural analogues and some remote homologues is still challenging(modeling
techniques such as protein threading, but the predictions typically gave a low confidence level)
4. Even when a correct fold is identified, the accuracy of threading alignment has been about 60-90% for proteins
with less than 30% sequence identity with their template structures.
5. The current energy functions are generally coarse gained mainly to achieve fast predictions
6. There is still significant room for further improving the computational efficiency of threading programs
Limitations
32. References
1. A new approach to protein fold recognition ,D.T.Jones(1992)
2. Protein Structure Prediction by Protein Threading Ying Xu, Zhijie Liu, Liming Cai, and Dong Xu
3. https://web.stanford.edu/class/cs273/refs/torda_chapter_proteomics.pdf
4. http://www.mit.edu/~leonid/publications/Mirny_Shakhnovich_ProtStucPredThread.pdf
5. https://biostat.wisc.edu/bmi776/lectures/threading.pdf
6. Protein Fold Recognition by Prediction-based Threading Burkhard Rost 1,2 *, Reinhard Schneider1 and
Chris Sander
7. COS 597c: Topics in Computational Molecular Biology Lecturer: Larry Brown Scribe: Jessica Bessler 1