Homology Modeling
Computational methods for Protein Structure Prediction
• Homology or Comparative Modeling
• Fold Recognition or threading Methods
• Ab initio methods
– that utilize knowledge-based information
– without the aid of knowledge-based information
Why do we need computational approaches?
 The goal of research in the area of structural genomics is to characterize
and identify the large number of protein sequences that are being
discovered.
 Knowledge of the three-dimensional structure
• helps in the rational design of site-directed mutations.
• can be of great importance for the design of drugs.
• greatly enhances our understanding of how proteins function and
how they interact with each other , for example, explain antigenic
behaviour, DNA binding specificity, etc
 Structural information from x-ray crystallographic or NMR results
• obtained much more slowly.
• techniques involve elaborate technical procedures.
• many proteins fail to crystallize at all and/or cannot be obtained or
dissolved in large enough quantities for NMR measurements.
• The size of the protein is also a limiting factor for NMR.
Homology modeling
Based on the two major observations (and some
simplifications):
1. The structure of a protein is uniquely defined by
its amino acid sequence.
2. Similar sequences adopt similar structures.
(Distantly related sequences may still fold into
similar structures.)
Homology modeling needs three items of input:
• The sequence of a protein with unknown 3D
structure, the "target sequence."
• A 3D “template” – a structure having the highest
sequence identity with the target sequence ( >30%
sequence identity)
• An sequence alignment between the target sequence
and the template sequence
Homology Modeling: How it works
o Find template
o Align target sequence
with template
o Generate model:
- add loops
- add sidechains
o Refine model
Homology Modeling
• Simplest, reliable approach.
• Basis: proteins with similar sequences tend to fold into similar structures.
• Has been observed that even proteins with 25% sequence identity fold
into similar structures
• Does not work for remote homologs (< 25% pairwise identity)
• Predicts the three-dimensional structure of a given protein sequence
(TARGET) based on an alignment to one or more known protein
structures (TEMPLATES)
• If similarity between the TARGET sequence and the TEMPLATE sequence
is detected, structural similarity can be assumed.
• In general, 30% sequence identity is required for generating useful
models.
Homology Modeling Process
 Template recognition
 Alignment
 Determining structurally conserved regions
 Backbone generation
 Building loops or variable regions
 Conformational search for side chains
 Refinement of structure
 Validating structures
Structural Prediction by Homology Modeling
Structural Databases
Reference Proteins
Conserved Regions Protein Sequence
Predicted Conserved Regions
Initial Model
Structure Analysis
Refined Model
SeqFold,Profiles-3D, PSI-BLAST, BLAST & FASTA, Fold-recognition methods (FUGUE)
C Matrix Matching
α
Sequence Alignment
Coordinate Assignment
Loop Searching/generation
WHAT IF, PROCHECK, PROSAII,..
Sidechain Rotamers
and/or MM/MD
MODELER
Template Recognition
• First we search the related proteins sequence(templates) to the target
sequence in any structural database of proteins.
• The accuracy of model depends on the selection of proper template.
• FASTA and BLAST from EMBL-EBI and NCBI can be used.
• This gives a probable set of templates but the final one is not yet decided.
• After intial aligments and finding structurally conserved regions among
templates, we choose the final template.
Structurally Conserved Regions (SCRs)
SCRs are region in all proteins of a particular family that are
nearly identical in structures.
Tend to be at inner cores of the proteins.
Usually contains alpha-helices and beta sheets.
No SCR can span more than one secondary structure.
There are generally two main approaches
Constructing c-alpha distance matrix.
Aligning vectors of secondary structure units.
Determining Structurally Conserved Regions
• When two or more reference protein structures are
available, establish structural guidelines for the family of
proteins under consideration.
• First step in building a model protein by homology is
determining what regions are structurally conserved or
constant among all the reference proteins.
• Target protein is supposed to assume the same
conformation in conserved regions.
Alignment in Homology Modeling
Sequence alignment is central technique in homology modeling.
 Used in determining which areas of the reference proteins are
conserved in sequence.
 Hence suggesting where the reference proteins may also be
structurally conserved.
 After SCRs are found, it is used to establish one to one correspondence
between the amino acids of reference proteins and the target in SCRs.
 Thus providing basis of the transforming of coordinates from the
reference to the model .
The First Developed Algorithm
 Needleman and Wunch algorithm for pairwaise sequence
alignment.
 It is based on Dynamic Programming method.
 It is a Global Alignment approach.
 Modified version
 Smith Waterman algorithm is modified Needleman Wunch
 It is for local alignment.
Alignment of Target Protein with SCR
 After doing alignments and finding SCRs, we align the unknown sequence
with the aligned reference proteins with the knowledge of SCRs.
 Since SCR cann’t contain insertions and deletions, the enhancement of
standard alignmnet algorithm is used.
 After finding the suitable template and aligning the unknown sequence
with the template, assignment of coordinates within conserved regions is
done.
Alignment of Model sequence with Reference sequences
having SCRs
Mapping the Pathway Through the Matrix
Assignment of coordinates within conserved region
 Once the correspondence between amino acids in the reference and model
sequences has been made, the coordinates for an SCR can be assigned.
 The reference proteins' coordinates are used as a basis for this
assignment.
 When the side chains of the reference and model proteins are the same at
corresponding locations along the sequence, all the coordinates for the
amino acid are transferred.
 When they differ, the backbone coordinates are transferred , but the side
chain atoms are automatically replaced to preserve the model protein's
residue types.
Assignment of coordinates in loop or variable region
Two main methods
 Finding similar peptide segments in other proteins.
 Advantage: all loops found are guaranteed to have reasonable internal
geometries and conformations.
 Disadvantage: may not fit properly into the given model protein’s
framework.
 Generating a segment de-novo.
Selection Of Loops
 Check the loops on the basis of steric overlaps.
 A specified degree of overlap can be tolerated.
 Check the atoms within the loop against each other.
 Then check loop atoms against rest of the protein’s atoms.
Side Chain Conformation Search
 With bond lengths, bond angles and two rotatable backbone bonds per
residue and , its very difficult to find the best conformation of a side
φ ψ
chain.
 In addition, side chains of many residues have one or more degree of
freedom.
 Hence conformational search in loop regions is carried out first.
 Side chain residues replaced during coordinate transformations will also
be checked.
Selection Of Good Rotamer
 Fortunately, statistical studies show side chain adopt only a small number of
many possible conformations.
 The correct rotamer of a particular residue is mainly determined by local
environment.
 Side chain generally adopt conformations where they are closely packed.
It is observed that:
 In homologous proteins, corresponding residues virtually retain the same
rotameric state.
 Within a range of values,
χ 80% of the identical residues and 75% of the
mutated residues have the same conformations.
 Certain rotamers are almost always associated with certain secondary structure.
Refinement of model using Molecular Mechanics
Many structural artifacts can be introduced while the model
protein is being built.
 Substitution of large side chains for small ones.
 Strained peptide bonds between segments taken from
difference reference proteins.
 Non optimum conformation of loops.
The process of finding the minimum of an experimental potential energy
function is called as the Molecular mechanics(MM).
The process produce a molecule of idealized geometry.
Molecular mechanics describes the energy of a molecule in terms of a
simple function which accounts for distortion from “ideal” bond distances
and angles, as well as and for nonbonded van der Waals and Coulombic
interactions.
Molecular mechanics is a mathematical formalism which attempts to
reproduce molecular geometries, energies and other features by adjusting
bond lengths, bond angles and torsion angles to equilibrium values that are
dependent on the hybridization of an atom and its bonding scheme.
Molecular mechanics
• Molecular mechanics breaks down pair wise interaction into
 Bonded interaction ( internal coordination )
- Atoms that are connected via one to three bonds
 Non bonded interaction
- Electrostatic and Van der waals component
The general form of the force field equation is
E P (X) = E bonded + E nonbonded
Model Validation
 Every homology model contains errors.Two main reasons
 % sequence identity between reference and model
 The number of errors in templates
 Hence it is essential to check the correctness of overall fold/
structure, errors of localized regions and stereochemical
parameters: bond lengths, angles, geometries
Model Evaluation
 WHAT IF http://www.cmbi.kun.nl/gv/servers/WIWWWI/
 SOV http://predictioncenter.llnl.gov/local/sov/sov.html
 PROVE http://www.ucmb.ulb.ac.be/UCMB/PROVE/
 ANOLEA http://www.fundp.ac.be/pub/ANOLEA.html
 ERRAT http://www.doe-mbi.ucla.edu/Services/ERRATv2/
 VERIFY3D
http://shannon.mbi.ucla.edu/DOE/Services/Verify_3D/
 BIOTECH http://biotech.embl-ebi.ac.uk:8400/
 ProsaII http://www.came.sbg.ac.at
 WHATCHECK
http://www.sander.embl-heidelberg.de/whatcheck/
Challenges
 To model proteins with lower similarities( eg < 30% sequence
identity)
 To increase accuracy of models and to make it fully automated
 Improvements may include simulataneous optimization techniques
in side chain modeling and loop modeling
 Developing better optimizers and potential function, which can lead
the model structure away from template towards the correct
structure
 Although comparative modelling needs significant improvement, it is
already a mature technique that can be used to address many
practical problems
Automated Web-Based Homology Modeling
 SWISS Model : http://www.expasy.org/swissmod/SWISS-
MODEL.html
 WHAT IF : http://www.cmbi.kun.nl/swift/servers/
 The CPHModels Server :
http://www.cbs.dtu.dk/services/CPHmodels/
 3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/
 SDSC1 : http://cl.sdsc.edu/hm.html
 EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/
Comparative Modeling Server & Program
 COMPOSER
http://www.tripos.com/sciTech/inSilicoDisc/bioInformatics/m
atchmaker.html
 MODELER http://salilab.org/modeler
 InsightII http://www.msi.com/
 SYBYL http://www.tripos.com/

Homology modelling-Protein structure prediction

  • 1.
  • 2.
    Computational methods forProtein Structure Prediction • Homology or Comparative Modeling • Fold Recognition or threading Methods • Ab initio methods – that utilize knowledge-based information – without the aid of knowledge-based information
  • 3.
    Why do weneed computational approaches?  The goal of research in the area of structural genomics is to characterize and identify the large number of protein sequences that are being discovered.  Knowledge of the three-dimensional structure • helps in the rational design of site-directed mutations. • can be of great importance for the design of drugs. • greatly enhances our understanding of how proteins function and how they interact with each other , for example, explain antigenic behaviour, DNA binding specificity, etc  Structural information from x-ray crystallographic or NMR results • obtained much more slowly. • techniques involve elaborate technical procedures. • many proteins fail to crystallize at all and/or cannot be obtained or dissolved in large enough quantities for NMR measurements. • The size of the protein is also a limiting factor for NMR.
  • 4.
    Homology modeling Based onthe two major observations (and some simplifications): 1. The structure of a protein is uniquely defined by its amino acid sequence. 2. Similar sequences adopt similar structures. (Distantly related sequences may still fold into similar structures.)
  • 5.
    Homology modeling needsthree items of input: • The sequence of a protein with unknown 3D structure, the "target sequence." • A 3D “template” – a structure having the highest sequence identity with the target sequence ( >30% sequence identity) • An sequence alignment between the target sequence and the template sequence
  • 6.
    Homology Modeling: Howit works o Find template o Align target sequence with template o Generate model: - add loops - add sidechains o Refine model
  • 7.
    Homology Modeling • Simplest,reliable approach. • Basis: proteins with similar sequences tend to fold into similar structures. • Has been observed that even proteins with 25% sequence identity fold into similar structures • Does not work for remote homologs (< 25% pairwise identity) • Predicts the three-dimensional structure of a given protein sequence (TARGET) based on an alignment to one or more known protein structures (TEMPLATES) • If similarity between the TARGET sequence and the TEMPLATE sequence is detected, structural similarity can be assumed. • In general, 30% sequence identity is required for generating useful models.
  • 8.
    Homology Modeling Process Template recognition  Alignment  Determining structurally conserved regions  Backbone generation  Building loops or variable regions  Conformational search for side chains  Refinement of structure  Validating structures
  • 9.
    Structural Prediction byHomology Modeling Structural Databases Reference Proteins Conserved Regions Protein Sequence Predicted Conserved Regions Initial Model Structure Analysis Refined Model SeqFold,Profiles-3D, PSI-BLAST, BLAST & FASTA, Fold-recognition methods (FUGUE) C Matrix Matching α Sequence Alignment Coordinate Assignment Loop Searching/generation WHAT IF, PROCHECK, PROSAII,.. Sidechain Rotamers and/or MM/MD MODELER
  • 10.
    Template Recognition • Firstwe search the related proteins sequence(templates) to the target sequence in any structural database of proteins. • The accuracy of model depends on the selection of proper template. • FASTA and BLAST from EMBL-EBI and NCBI can be used. • This gives a probable set of templates but the final one is not yet decided. • After intial aligments and finding structurally conserved regions among templates, we choose the final template.
  • 11.
    Structurally Conserved Regions(SCRs) SCRs are region in all proteins of a particular family that are nearly identical in structures. Tend to be at inner cores of the proteins. Usually contains alpha-helices and beta sheets. No SCR can span more than one secondary structure. There are generally two main approaches Constructing c-alpha distance matrix. Aligning vectors of secondary structure units.
  • 12.
    Determining Structurally ConservedRegions • When two or more reference protein structures are available, establish structural guidelines for the family of proteins under consideration. • First step in building a model protein by homology is determining what regions are structurally conserved or constant among all the reference proteins. • Target protein is supposed to assume the same conformation in conserved regions.
  • 13.
    Alignment in HomologyModeling Sequence alignment is central technique in homology modeling.  Used in determining which areas of the reference proteins are conserved in sequence.  Hence suggesting where the reference proteins may also be structurally conserved.  After SCRs are found, it is used to establish one to one correspondence between the amino acids of reference proteins and the target in SCRs.  Thus providing basis of the transforming of coordinates from the reference to the model .
  • 14.
    The First DevelopedAlgorithm  Needleman and Wunch algorithm for pairwaise sequence alignment.  It is based on Dynamic Programming method.  It is a Global Alignment approach.  Modified version  Smith Waterman algorithm is modified Needleman Wunch  It is for local alignment.
  • 15.
    Alignment of TargetProtein with SCR  After doing alignments and finding SCRs, we align the unknown sequence with the aligned reference proteins with the knowledge of SCRs.  Since SCR cann’t contain insertions and deletions, the enhancement of standard alignmnet algorithm is used.  After finding the suitable template and aligning the unknown sequence with the template, assignment of coordinates within conserved regions is done.
  • 16.
    Alignment of Modelsequence with Reference sequences having SCRs
  • 17.
    Mapping the PathwayThrough the Matrix
  • 18.
    Assignment of coordinateswithin conserved region  Once the correspondence between amino acids in the reference and model sequences has been made, the coordinates for an SCR can be assigned.  The reference proteins' coordinates are used as a basis for this assignment.  When the side chains of the reference and model proteins are the same at corresponding locations along the sequence, all the coordinates for the amino acid are transferred.  When they differ, the backbone coordinates are transferred , but the side chain atoms are automatically replaced to preserve the model protein's residue types.
  • 19.
    Assignment of coordinatesin loop or variable region Two main methods  Finding similar peptide segments in other proteins.  Advantage: all loops found are guaranteed to have reasonable internal geometries and conformations.  Disadvantage: may not fit properly into the given model protein’s framework.  Generating a segment de-novo.
  • 20.
    Selection Of Loops Check the loops on the basis of steric overlaps.  A specified degree of overlap can be tolerated.  Check the atoms within the loop against each other.  Then check loop atoms against rest of the protein’s atoms.
  • 21.
    Side Chain ConformationSearch  With bond lengths, bond angles and two rotatable backbone bonds per residue and , its very difficult to find the best conformation of a side φ ψ chain.  In addition, side chains of many residues have one or more degree of freedom.  Hence conformational search in loop regions is carried out first.  Side chain residues replaced during coordinate transformations will also be checked.
  • 22.
    Selection Of GoodRotamer  Fortunately, statistical studies show side chain adopt only a small number of many possible conformations.  The correct rotamer of a particular residue is mainly determined by local environment.  Side chain generally adopt conformations where they are closely packed. It is observed that:  In homologous proteins, corresponding residues virtually retain the same rotameric state.  Within a range of values, χ 80% of the identical residues and 75% of the mutated residues have the same conformations.  Certain rotamers are almost always associated with certain secondary structure.
  • 23.
    Refinement of modelusing Molecular Mechanics Many structural artifacts can be introduced while the model protein is being built.  Substitution of large side chains for small ones.  Strained peptide bonds between segments taken from difference reference proteins.  Non optimum conformation of loops.
  • 24.
    The process offinding the minimum of an experimental potential energy function is called as the Molecular mechanics(MM). The process produce a molecule of idealized geometry. Molecular mechanics describes the energy of a molecule in terms of a simple function which accounts for distortion from “ideal” bond distances and angles, as well as and for nonbonded van der Waals and Coulombic interactions. Molecular mechanics is a mathematical formalism which attempts to reproduce molecular geometries, energies and other features by adjusting bond lengths, bond angles and torsion angles to equilibrium values that are dependent on the hybridization of an atom and its bonding scheme. Molecular mechanics
  • 25.
    • Molecular mechanicsbreaks down pair wise interaction into  Bonded interaction ( internal coordination ) - Atoms that are connected via one to three bonds  Non bonded interaction - Electrostatic and Van der waals component The general form of the force field equation is E P (X) = E bonded + E nonbonded
  • 26.
    Model Validation  Everyhomology model contains errors.Two main reasons  % sequence identity between reference and model  The number of errors in templates  Hence it is essential to check the correctness of overall fold/ structure, errors of localized regions and stereochemical parameters: bond lengths, angles, geometries
  • 27.
    Model Evaluation  WHATIF http://www.cmbi.kun.nl/gv/servers/WIWWWI/  SOV http://predictioncenter.llnl.gov/local/sov/sov.html  PROVE http://www.ucmb.ulb.ac.be/UCMB/PROVE/  ANOLEA http://www.fundp.ac.be/pub/ANOLEA.html  ERRAT http://www.doe-mbi.ucla.edu/Services/ERRATv2/  VERIFY3D http://shannon.mbi.ucla.edu/DOE/Services/Verify_3D/  BIOTECH http://biotech.embl-ebi.ac.uk:8400/  ProsaII http://www.came.sbg.ac.at  WHATCHECK http://www.sander.embl-heidelberg.de/whatcheck/
  • 28.
    Challenges  To modelproteins with lower similarities( eg < 30% sequence identity)  To increase accuracy of models and to make it fully automated  Improvements may include simulataneous optimization techniques in side chain modeling and loop modeling  Developing better optimizers and potential function, which can lead the model structure away from template towards the correct structure  Although comparative modelling needs significant improvement, it is already a mature technique that can be used to address many practical problems
  • 29.
    Automated Web-Based HomologyModeling  SWISS Model : http://www.expasy.org/swissmod/SWISS- MODEL.html  WHAT IF : http://www.cmbi.kun.nl/swift/servers/  The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/  3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/  SDSC1 : http://cl.sdsc.edu/hm.html  EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/
  • 30.
    Comparative Modeling Server& Program  COMPOSER http://www.tripos.com/sciTech/inSilicoDisc/bioInformatics/m atchmaker.html  MODELER http://salilab.org/modeler  InsightII http://www.msi.com/  SYBYL http://www.tripos.com/