Protein Structure Modeling
Learning Objectives
• At the end of this topic the student should be
able to:
(i) Perform computer aided- protein structure
determination and validation
(ii) Design docking experiments
(iii) Demonstrate understanding of protein
interactions and simulations
Computer-Aided Protein Structure
Prediction
Protein
Sequence +
Structure
Recap
Structure prediction aims at building models of 3D structures of proteins that
could be useful for understanding structure function relationships.
STAT115 5
Protein Structure Levels
Folded structure:
lowest energy state
Watch Video
http://www.youtube.com/watch?v=fvBO3TqJ6F
E&feature=fvw
Protein Structure Determination &
Prediction
• Experimental Techniques
– X-ray Crystallography
– NMR
• Limitations of Current Experimental
Techniques
– Protein DataBank (PDB) -> 30,000 protein structures
– Unique structure are between 4000 to 5000 only
– Non-redudant (NR) -> 10,000,000 proteins sequences
• Importance of Structure Prediction
– Fill gap between known sequence and structures
– Protein Engineering.- To alter function of a protein
– Rational Drug Design
STAT115 8
Protein Structure Prediction
• Since AA sequence determines structure, can we
predict protein structure from its AA sequence?
– predicting the three angles, unlimited DoF!
• Physical properties that determine fold
– Rigidity of the protein backbone
– Interactions among amino acids, including
• Electrostatic interactions
• van der Waals forces
• Volume constraints
• Hydrogen, disulfide bonds
– Interactions of amino acids with water
Why Protein Modeling
 The goal of research in the area of structural genomics is to
provide the means to characterize and identify the large
number of protein sequences that are being discovered
Knowledge of the three-dimensional structure
 helps in the rational design of site-directed mutations
 can be of great importance for the design of drugs
 greatly enhances our understanding of how proteins function
and how they interact with each other , for example, explain
antigenic behaviour, DNA binding specificity, etc
Computational methods for Protein Structure
Prediction
 Homology or Comparative Modeling
 Fold Recognition or threading Methods
 Ab initio methods that utilize knowledge-based information
 Ab initio methods without the aid of knowledge-based
information
Why do we need computational approaches?
 Structural information from x-ray crystallography or NMR
results
obtained much more slowly.
techniques involve elaborate technical procedures
many proteins fail to crystallize at all and/or cannot be
obtained or dissolved in large enough quantities for NMR
measurements
The size of the protein is also a limiting factor for NMR
 With a better computational method this can be done extremely fast.
Tertiary Structure Prediction Methods
Any given protein sequence
Structure selection
Compare sequence identity with proteins that have solved structure
Homology
Modeling
> 35%
Fold
Recognition
ab initio
Folding
< 35%
< 35%
Structure refinement
Final Structure
Structure selection
Homology Modeling
• Predicts the three-dimensional structure of a
given protein sequence (TARGET) based on an
alignment to one or more known protein
structures (TEMPLATES)
• If similarity between the TARGET sequence and
the TEMPLATE sequence is detected, structural
similarity can be assumed.
• In general, 30% sequence identity is required for
generating useful models.
Homology Modeling Process
 Template recognition
 Alignment
 Determining structurally conserved regions
 Backbone generation
 Building loops or variable regions
 Conformational search for side chains
 Refinement of structure
 Validating structures
Template Recognition
• First we search the related proteins sequence(templates) to the target
sequence in any structural database of proteins
• The accuracy of model depends on the selection of a proper template
• FASTA and BLAST (PSI-BLAST) from EMBL-EBI and NCBI can be used
• This gives a probable set of templates but the final one is not yet decided
• After intial aligments and finding structurally conserved regions among
templates, we choose the final template
How good can homology modeling be?
Sequence Identity
60-100% Comparable to medium resolution NMR
Substrate Specificity
30-60% Molecular replacement in crystallography
Support site-directed mutagenesis
through visualization
<30% Serious errors!!!
Protein Structural Databases
• Templates can be found using the TARGET sequence as a
query for searching using FASTA or BLAST
– PDB (http://www.rcsb.org/pdb)
– MODELLER
(http://guitar.rockefeller.edu/modeller/modeller.html)
– ModBase (http://pipe.rockefeller.edu/modbase/general-
info.html)
– 3DCrunch
(http://www.expasy.ch/swissmod/SM_3DCrunch.html)
Target-Template Alignment
• No current comparative modeling method can
recover from an incorrect alignment
• Use multiple sequence alignments as initial guide.
• Consider slightly alternative alignments in areas of
uncertainty, build multiple models
Note: sequence from database versus sequence from
the actual PDB are not always identical
Target-Multiple Template Alignment
• Alignment is prepared by superimposing all template
structures
• Add target sequence to this alignment
• Compare with multiple sequence alignment and adjust
Gaining confidence in template searching
• Once a suitable template is found, it is a good idea to do a
literature search (PubMed) on the relevant fold to
determine what biological role(s) it plays.
• Does this match the biological/biochemical function that
you expect?
Other factors to consider in selecting
templates
• Template environment
– pH
– Ligands present?
• Resolution of the templates
• Family of proteins
– Phylogenetic tree construction can help find the
subfamily closest to the target sequence
• Multiple templates?
Determining Structurally Conserved Regions
(SCRs)
• When two or more reference protein structures are available
• Establish structural guidelines for the family of proteins under consideration
• First step in building a model protein by homology is determining what
regions are structurally conserved or constant among all the reference
proteins
• Target protein is supposed to assume the same conformation in conserved
regions
Structurally Conserved Regions
 SCRs are region in all proteins of a particular family that
are nearly identical in structures.
 Tend to be at inner cores of the proteins
 Usually contains alpha-helices and beta sheets
 No SCR can span more than one secondary structure
Alignment of Target Protein with SCR
 After doing alignments and finding SCRs
 We align the unknown sequence with the aligned reference proteins with the
knowledge of SCRs
 Since SCR cant contain insertions and deletions
 The enhancement of standard alignment algorithm is used
 After finding the suitable template and aligning the unknown sequence with the
template
 Assignment of coordinates within conserved regions is done
Assignment of coordinates within conserved
region
 Once the correspondence between amino acids in the reference and model
sequences has been made, the coordinates for an SCR can be assigned
 The reference proteins' coordinates are used as a basis for this assignment
 Where the side chains of the reference and model proteins are the same at
corresponding locations along the sequence, all the coordinates for the amino
acid are transferred
 Where they differ, the backbone coordinates are transferred , but the side
chain atoms are automatically replaced to preserve the model protein's
residue types
Assignment of coordinates in loop or variable
region
Two main methods
 Finding similar peptide segments in other proteins
 Generating a segment de-novo
Finding similar peptide segment in other proteins
 Advantage: all loops found are guaranteed to have reasonable
internal geometries and conformations
 Disadvantage: may not fit properly into the given model protein’s
framework
In this case, de-novo method is advisable
Assignment of coordinates in loop or
variable region
Selection Of Loops
 Check the loops on the basis of steric overlaps
 A specified degree of overlap can be tolerated
 Check the atoms within the loop against each other
 Then check loop atoms against rest of the protein’s atoms
Sidechains on surface of protein
• Exposed sidechains on surface can be highly flexible without a
single dominant conformation
• So ultimately if these solvent exposed sidechains do not form
binding interactions with other molecules or involved in say, a
catalytic reaction, then accuracy may not be
Side Chain Conformation Search
 With bond lengths, bond angles and two rotable backbone bonds per
residue φ and ψ, its very difficult to find the best conformation of a side
chain
 In addition, side chains of many residues have one or more degree of
freedom.
 Hence Side chain conformational search in loop regions is must
 Side chain residues replaced during coordinate transformations should
also be checked
Selection Of Good Rotamer
 Fortunately, statstical studies show side chain adopt only a small number
of many possible conformations
 The correct rotamer of a particular residue is mainly determined by local
environment
 Side chain generally adopt conformations where they are closely packed
Refinement of model using Molecular
Mechanics
Many structural artifacts can be introduced while the model protein is
being built
 Substitution of large side chains for small ones
 Strained peptide bonds between segments taken from difference
reference proteins
 Non optimum conformation of loops
Model Validation
 Every homology model contains errors. Two main reasons
 % sequence identity between reference and model
 The number of errors in templates
 Hence it is essential to check the correctness of overall fold/ structure,
errors of localized regions and stereochemical parameters: bond
lengths, angles, geometries
Structural Prediction by Homology Modeling
Structural Databases
Reference Proteins
Conserved Regions Protein Sequence
Predicted Conserved Regions
Initial Model
Structure Analysis
Refined Model
SeqFold,Profiles-3D, PSI-BLAST, BLAST & FASTA, Fold-recognition methods
Cα Matrix Matching
Sequence Alignment
Coordinate Assignment
Loop Searching/generation
WHAT IF, PROCHECK, PROSAII,..
Sidechain Rotamers
and/or MM/MD
MODELER
Errors in Homology Modeling
a) Side chain packing b) Distortions and shifts c) no template
Errors in Homology Modeling
d) Misalignments e) incorrect template
Marti-Renom et al., Ann. Rev. Biophys. Biomol. Struct., 2000, 29:291-325.
Automated Web-Based Homology Modelling
 SWISS Model : http://www.expasy.org/swissmod/SWISS-MODEL.html
 WHAT IF : http://www.cmbi.kun.nl/swift/servers/
 The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/
 3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/
 SDSC1 : http://cl.sdsc.edu/hm.html
 EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/
 HHpred server: https://toolkit.tuebingen.mpg.de/tools/hhpred
Tools for Model Evaluation
 WHAT IF http://www.cmbi.kun.nl/gv/servers/WIWWWI/
 SOV http://predictioncenter.llnl.gov/local/sov/sov.html
 PROVE http://www.ucmb.ulb.ac.be/UCMB/PROVE/
 ANOLEA http://www.fundp.ac.be/pub/ANOLEA.html
 ERRAT http://www.doe-mbi.ucla.edu/Services/ERRATv2/
 VERIFY3D http://shannon.mbi.ucla.edu/DOE/Services/Verify_3D/
 BIOTECH http://biotech.embl-ebi.ac.uk:8400/
 ProsaII http://www.came.sbg.ac.at
 WHATCHECK http://www.sander.embl-heidelberg.de/whatcheck/
Challenges
 To model proteins with lower similarities( eg < 30% sequence identity)
 To increase accuracy of models and to make it fully automated
 Improvements may include simulataneous optimization techniques in
side chain modeling and loop modeling
 Developing better optimizers and potential function, which can lead
the model structure away from template towards the correct
structure
 Although comparative modelling needs significant improvement,
it is already a mature technique that can be used to address
many practical problems
Comparative Modeling Server & Program
 COMPOSER
http://www.tripos.com/sciTech/inSilicoDisc/bioInformatics/matchmaker.h
tml
 MODELER http://salilab.org/modeler
 InsightII http://www.msi.com/
 SYBYL http://www.tripos.com/
STAT115 41
Fold Recognition
• Also called protein threading
• Given new sequence and library of known
folds, find best alignment of sequence to each
fold, returned the most favorable one
STAT115 42
Protein Folding with
Dynamic Programming
• D. Eisenburg 1994
• Align sequence to each fold (a string of
environmental classes)
• Advantages: fast and works pretty well
• Disadvantages: do not consider AA contacts
STAT115 43
Fold Recognition
• Each predictor can submit N top hits
• Every predictor does well on something
• Common folds (more examples) are easier to
recognize
STAT115 44
Fold Recognition
• Alignment (seq to fold) is a big problem
STAT115 45
ab initio
• Predict interresidue contacts and then
compute structure (mild success)
• Simplified energy term + reduced search space
(phi/psi or lattice) (moderate success)
• Creative ways to memorize sequence 
structure correlations in short segments from
the PDB, and use these to model new
structures: ROSETTA
46
Ab initio prediction of protein structure
sample conformational space such that native-like conformations are found
astronomically large number of
conformations
5 states/100 residues = 5100 = 1070
select
hard to design functions
that are not fooled by
non-native conformations
(“decoys”)
Programs for Model Protein Prediction
https://molbiol-tools.ca/Protein_tertiary_structure.htm
L1Protein_Structure_Analysis.pptx

L1Protein_Structure_Analysis.pptx

  • 1.
  • 2.
    Learning Objectives • Atthe end of this topic the student should be able to: (i) Perform computer aided- protein structure determination and validation (ii) Design docking experiments (iii) Demonstrate understanding of protein interactions and simulations
  • 3.
  • 4.
    Recap Structure prediction aimsat building models of 3D structures of proteins that could be useful for understanding structure function relationships.
  • 5.
    STAT115 5 Protein StructureLevels Folded structure: lowest energy state
  • 6.
  • 7.
    Protein Structure Determination& Prediction • Experimental Techniques – X-ray Crystallography – NMR • Limitations of Current Experimental Techniques – Protein DataBank (PDB) -> 30,000 protein structures – Unique structure are between 4000 to 5000 only – Non-redudant (NR) -> 10,000,000 proteins sequences • Importance of Structure Prediction – Fill gap between known sequence and structures – Protein Engineering.- To alter function of a protein – Rational Drug Design
  • 8.
    STAT115 8 Protein StructurePrediction • Since AA sequence determines structure, can we predict protein structure from its AA sequence? – predicting the three angles, unlimited DoF! • Physical properties that determine fold – Rigidity of the protein backbone – Interactions among amino acids, including • Electrostatic interactions • van der Waals forces • Volume constraints • Hydrogen, disulfide bonds – Interactions of amino acids with water
  • 9.
    Why Protein Modeling The goal of research in the area of structural genomics is to provide the means to characterize and identify the large number of protein sequences that are being discovered Knowledge of the three-dimensional structure  helps in the rational design of site-directed mutations  can be of great importance for the design of drugs  greatly enhances our understanding of how proteins function and how they interact with each other , for example, explain antigenic behaviour, DNA binding specificity, etc
  • 10.
    Computational methods forProtein Structure Prediction  Homology or Comparative Modeling  Fold Recognition or threading Methods  Ab initio methods that utilize knowledge-based information  Ab initio methods without the aid of knowledge-based information
  • 11.
    Why do weneed computational approaches?  Structural information from x-ray crystallography or NMR results obtained much more slowly. techniques involve elaborate technical procedures many proteins fail to crystallize at all and/or cannot be obtained or dissolved in large enough quantities for NMR measurements The size of the protein is also a limiting factor for NMR  With a better computational method this can be done extremely fast.
  • 12.
    Tertiary Structure PredictionMethods Any given protein sequence Structure selection Compare sequence identity with proteins that have solved structure Homology Modeling > 35% Fold Recognition ab initio Folding < 35% < 35% Structure refinement Final Structure Structure selection
  • 13.
    Homology Modeling • Predictsthe three-dimensional structure of a given protein sequence (TARGET) based on an alignment to one or more known protein structures (TEMPLATES) • If similarity between the TARGET sequence and the TEMPLATE sequence is detected, structural similarity can be assumed. • In general, 30% sequence identity is required for generating useful models.
  • 14.
    Homology Modeling Process Template recognition  Alignment  Determining structurally conserved regions  Backbone generation  Building loops or variable regions  Conformational search for side chains  Refinement of structure  Validating structures
  • 15.
    Template Recognition • Firstwe search the related proteins sequence(templates) to the target sequence in any structural database of proteins • The accuracy of model depends on the selection of a proper template • FASTA and BLAST (PSI-BLAST) from EMBL-EBI and NCBI can be used • This gives a probable set of templates but the final one is not yet decided • After intial aligments and finding structurally conserved regions among templates, we choose the final template
  • 16.
    How good canhomology modeling be? Sequence Identity 60-100% Comparable to medium resolution NMR Substrate Specificity 30-60% Molecular replacement in crystallography Support site-directed mutagenesis through visualization <30% Serious errors!!!
  • 17.
    Protein Structural Databases •Templates can be found using the TARGET sequence as a query for searching using FASTA or BLAST – PDB (http://www.rcsb.org/pdb) – MODELLER (http://guitar.rockefeller.edu/modeller/modeller.html) – ModBase (http://pipe.rockefeller.edu/modbase/general- info.html) – 3DCrunch (http://www.expasy.ch/swissmod/SM_3DCrunch.html)
  • 18.
    Target-Template Alignment • Nocurrent comparative modeling method can recover from an incorrect alignment • Use multiple sequence alignments as initial guide. • Consider slightly alternative alignments in areas of uncertainty, build multiple models Note: sequence from database versus sequence from the actual PDB are not always identical
  • 19.
    Target-Multiple Template Alignment •Alignment is prepared by superimposing all template structures • Add target sequence to this alignment • Compare with multiple sequence alignment and adjust
  • 20.
    Gaining confidence intemplate searching • Once a suitable template is found, it is a good idea to do a literature search (PubMed) on the relevant fold to determine what biological role(s) it plays. • Does this match the biological/biochemical function that you expect?
  • 21.
    Other factors toconsider in selecting templates • Template environment – pH – Ligands present? • Resolution of the templates • Family of proteins – Phylogenetic tree construction can help find the subfamily closest to the target sequence • Multiple templates?
  • 22.
    Determining Structurally ConservedRegions (SCRs) • When two or more reference protein structures are available • Establish structural guidelines for the family of proteins under consideration • First step in building a model protein by homology is determining what regions are structurally conserved or constant among all the reference proteins • Target protein is supposed to assume the same conformation in conserved regions
  • 23.
    Structurally Conserved Regions SCRs are region in all proteins of a particular family that are nearly identical in structures.  Tend to be at inner cores of the proteins  Usually contains alpha-helices and beta sheets  No SCR can span more than one secondary structure
  • 24.
    Alignment of TargetProtein with SCR  After doing alignments and finding SCRs  We align the unknown sequence with the aligned reference proteins with the knowledge of SCRs  Since SCR cant contain insertions and deletions  The enhancement of standard alignment algorithm is used  After finding the suitable template and aligning the unknown sequence with the template  Assignment of coordinates within conserved regions is done
  • 25.
    Assignment of coordinateswithin conserved region  Once the correspondence between amino acids in the reference and model sequences has been made, the coordinates for an SCR can be assigned  The reference proteins' coordinates are used as a basis for this assignment  Where the side chains of the reference and model proteins are the same at corresponding locations along the sequence, all the coordinates for the amino acid are transferred  Where they differ, the backbone coordinates are transferred , but the side chain atoms are automatically replaced to preserve the model protein's residue types
  • 26.
    Assignment of coordinatesin loop or variable region Two main methods  Finding similar peptide segments in other proteins  Generating a segment de-novo
  • 27.
    Finding similar peptidesegment in other proteins  Advantage: all loops found are guaranteed to have reasonable internal geometries and conformations  Disadvantage: may not fit properly into the given model protein’s framework In this case, de-novo method is advisable Assignment of coordinates in loop or variable region
  • 28.
    Selection Of Loops Check the loops on the basis of steric overlaps  A specified degree of overlap can be tolerated  Check the atoms within the loop against each other  Then check loop atoms against rest of the protein’s atoms
  • 29.
    Sidechains on surfaceof protein • Exposed sidechains on surface can be highly flexible without a single dominant conformation • So ultimately if these solvent exposed sidechains do not form binding interactions with other molecules or involved in say, a catalytic reaction, then accuracy may not be
  • 30.
    Side Chain ConformationSearch  With bond lengths, bond angles and two rotable backbone bonds per residue φ and ψ, its very difficult to find the best conformation of a side chain  In addition, side chains of many residues have one or more degree of freedom.  Hence Side chain conformational search in loop regions is must  Side chain residues replaced during coordinate transformations should also be checked
  • 31.
    Selection Of GoodRotamer  Fortunately, statstical studies show side chain adopt only a small number of many possible conformations  The correct rotamer of a particular residue is mainly determined by local environment  Side chain generally adopt conformations where they are closely packed
  • 32.
    Refinement of modelusing Molecular Mechanics Many structural artifacts can be introduced while the model protein is being built  Substitution of large side chains for small ones  Strained peptide bonds between segments taken from difference reference proteins  Non optimum conformation of loops
  • 33.
    Model Validation  Everyhomology model contains errors. Two main reasons  % sequence identity between reference and model  The number of errors in templates  Hence it is essential to check the correctness of overall fold/ structure, errors of localized regions and stereochemical parameters: bond lengths, angles, geometries
  • 34.
    Structural Prediction byHomology Modeling Structural Databases Reference Proteins Conserved Regions Protein Sequence Predicted Conserved Regions Initial Model Structure Analysis Refined Model SeqFold,Profiles-3D, PSI-BLAST, BLAST & FASTA, Fold-recognition methods Cα Matrix Matching Sequence Alignment Coordinate Assignment Loop Searching/generation WHAT IF, PROCHECK, PROSAII,.. Sidechain Rotamers and/or MM/MD MODELER
  • 35.
    Errors in HomologyModeling a) Side chain packing b) Distortions and shifts c) no template
  • 36.
    Errors in HomologyModeling d) Misalignments e) incorrect template Marti-Renom et al., Ann. Rev. Biophys. Biomol. Struct., 2000, 29:291-325.
  • 37.
    Automated Web-Based HomologyModelling  SWISS Model : http://www.expasy.org/swissmod/SWISS-MODEL.html  WHAT IF : http://www.cmbi.kun.nl/swift/servers/  The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/  3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/  SDSC1 : http://cl.sdsc.edu/hm.html  EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/  HHpred server: https://toolkit.tuebingen.mpg.de/tools/hhpred
  • 38.
    Tools for ModelEvaluation  WHAT IF http://www.cmbi.kun.nl/gv/servers/WIWWWI/  SOV http://predictioncenter.llnl.gov/local/sov/sov.html  PROVE http://www.ucmb.ulb.ac.be/UCMB/PROVE/  ANOLEA http://www.fundp.ac.be/pub/ANOLEA.html  ERRAT http://www.doe-mbi.ucla.edu/Services/ERRATv2/  VERIFY3D http://shannon.mbi.ucla.edu/DOE/Services/Verify_3D/  BIOTECH http://biotech.embl-ebi.ac.uk:8400/  ProsaII http://www.came.sbg.ac.at  WHATCHECK http://www.sander.embl-heidelberg.de/whatcheck/
  • 39.
    Challenges  To modelproteins with lower similarities( eg < 30% sequence identity)  To increase accuracy of models and to make it fully automated  Improvements may include simulataneous optimization techniques in side chain modeling and loop modeling  Developing better optimizers and potential function, which can lead the model structure away from template towards the correct structure  Although comparative modelling needs significant improvement, it is already a mature technique that can be used to address many practical problems
  • 40.
    Comparative Modeling Server& Program  COMPOSER http://www.tripos.com/sciTech/inSilicoDisc/bioInformatics/matchmaker.h tml  MODELER http://salilab.org/modeler  InsightII http://www.msi.com/  SYBYL http://www.tripos.com/
  • 41.
    STAT115 41 Fold Recognition •Also called protein threading • Given new sequence and library of known folds, find best alignment of sequence to each fold, returned the most favorable one
  • 42.
    STAT115 42 Protein Foldingwith Dynamic Programming • D. Eisenburg 1994 • Align sequence to each fold (a string of environmental classes) • Advantages: fast and works pretty well • Disadvantages: do not consider AA contacts
  • 43.
    STAT115 43 Fold Recognition •Each predictor can submit N top hits • Every predictor does well on something • Common folds (more examples) are easier to recognize
  • 44.
    STAT115 44 Fold Recognition •Alignment (seq to fold) is a big problem
  • 45.
    STAT115 45 ab initio •Predict interresidue contacts and then compute structure (mild success) • Simplified energy term + reduced search space (phi/psi or lattice) (moderate success) • Creative ways to memorize sequence  structure correlations in short segments from the PDB, and use these to model new structures: ROSETTA
  • 46.
    46 Ab initio predictionof protein structure sample conformational space such that native-like conformations are found astronomically large number of conformations 5 states/100 residues = 5100 = 1070 select hard to design functions that are not fooled by non-native conformations (“decoys”)
  • 47.
    Programs for ModelProtein Prediction https://molbiol-tools.ca/Protein_tertiary_structure.htm