1
WITH A FOCUS ON ROSETTA
This presentation was prepared by: Xavier Ambroggio,
ambroggiox@niaid.nih.gov
PROTEIN STRUCTURE PREDICTION
OFFICE OF CYBER INFRASTRUCTURE AND COMPUTATIONAL BIOLOGY
NATIONAL INSTITUTE OF ALLERGY AND INFECTIOUS DISEASES
Fall 2011 Computational Structural Biology Seminar Series
2
9 – 11 AM, T/Th in 12A/B51 http://training.cit.nih.gov
Week Day Date Course Instructor CIT Course #
Week 1
Tues Aug. 23 Fundamentals, Data Sources, and Visualization of Macromolecular Structure Darrell Hurt SS260-11001
Thurs Aug. 25 Generating Protein Structures from Homology Darrell Hurt SS270-11001
Week 2
Tues Aug. 30 Predicting Protein Structures from Amino Acid Sequences Xavier Ambroggio SS660-11001
Thurs Sept. 1 Predicting Macromolecular Complexes from Uncomplexed Structures Xavier Ambroggio SS670-11001
Week 3
Tues Sept. 6 Design and Analysis of Macromolecular Interfaces Xavier Ambroggio SS770-11001
Thurs Sept. 8 Analysis and Advanced Visualization of Macromolecular Structure Darrell Hurt SS330-11001
Week 4
Tues Sept. 13 Computational Drug Design Mike Dolan SS340-11001
Thurs Sept. 15 Introduction to Molecular Dynamics Mike Dolan TBA
Week 5 Thurs. Sept. 22 Advanced Molecular Dynamics Mike Dolan TBA
Bioinformatics and Computational Biosciences Branch
3
Scientific
Collaboration
Scientific
Training
Custom Scientific
Software &
Infrastructure
•  Structural Biology
•  Phylogenetics
•  Statistics
•  Sequence Analysis
•  Microarray Analysis
•  NGS Analysis
•  Bioinformatics
•  Biological Networks
•  Function Prediction
•  …
4
Ab Initio Structure Prediction:
Given an amino acid sequence, find the tertiary structure
“Protein folding problem”
CASP: Critical Assessment of protein Structure Prediction
http://predictioncenter.org
•  Double-blind experiment (…competition)
•  World-wide scientific community
•  Unbiased assessment of techniques in structure
prediction
•  Biennial (every even year)
•  “Pulse” of the prediction community
•  What can be predicted?
•  Which servers/algorithms perform best?
6
CASP Overview
Blutsbrüder Design
CASP Top Free-Modeling Servers
7
Why Rosetta focus?
•  Standalone
•  Versatile
  RNA
  design
  dock
  …
•  Open Source
•  Substantial Literature
•  Shared methodology
Use any and all available servers!!!
Das & Baker Annu. Rev. Biochem 2008
prediction
design
Rosetta: multipurpose macromolecular modeling suite
CIT Course #
SS660-11001
CIT Course #
SS670-11001
CIT Course #
SS770-11001
ab initio predict the structure from sequence
relax refine the structure using Rosetta energy functions
idealize replace bond geometries with ideal values
loop modeling build and refine local structurally variable regions in context of a structural template
design optimize sequence given a structure with a fixed backbone
docking structure prediction for a protein-protein complex given subunits
ligand ligand docking
ddG prediction protein-protein interface and protein stability ddG stability calculations for mutations
scoring score input conformations with Rosetta energy functions
RNA predict RNA structures from sequences and design sequences from fixed structures
clustering grouping input structures by RMSD to each other for structure prediction analysis
backrub generate alternate backbone conformations based on sets of rotations
membrane ab initio predict the structures of helical membrane proteins
enzyme design redesign a protein around a ligand
domain assembly fixed domains connected by variable regions
antibody automated antibody homology modeling
XML parsing Parse XML scripts into protocols
Brief Description of Select Rosetta Functions
What types of protein domains can Rosetta fold?
Small, globular, soluble protein domains…
Small, simple membrane protein domains… …but not complex domains or
multi-domain proteins.
T4-lysozyme C-terminal domain
V-type Na+ ATP
synthase subunit
rhodopsin
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
A B C
What are the success rates?
High resolution predictions are achievable
•  targets ≤100 residues
•  success rate ~30%
•  success rate with accurate secondary
structure ~50%
•  a hallmark of accuracy: convergence
11
Slide content courtesy Rhiju Das, Baker Lab
What types of protein domains can no one fold?
CASP9: domains with no good FM predictions
Slide	
  content	
  adapted	
  from	
  talk	
  given	
  by	
  Lisa	
  Kinch	
  of	
  the	
  Grishin	
  lab	
  at	
  CASP9	
  mee>ng:	
  h@p://predic>oncenter.org/casp9/	
  
•  Non-­‐globular	
  
•  Trimeric	
  
•  Fe	
  stabilized	
  
•  High	
  contact	
  order	
  
Many	
  residues	
  close	
  	
  
in	
  3D,	
  far	
  in	
  1D	
  	
  
•  +	
  elongated	
  sheet?	
  
T0591d1,	
  3MWT	
   T0550d2,	
  3NQK	
  
T0629d2,	
  2XGF	
  
1.  Select	
  fragments	
  consistent	
  with	
  local	
  
sequence	
  preferences	
  
2.  Assemble	
  fragments	
  into	
  models	
  with	
  
na>ve-­‐like	
  global	
  proper>es	
  
3.  Iden>fy	
  the	
  best	
  model	
  from	
  the	
  
popula>on	
  of	
  decoys	
  
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
Figures adapted from Charlie Strauss;
Protein structure prediction using ROSETTA, Rohl et al (2004) Methods in Enzymology, 383:66
Basic	
  Ab	
  Ini'o	
  Rose<a	
  protocol
Assembly	
  
Decoy	
  
Decoy	
  
Decoy	
  
Decoy	
  
Decoy	
  
Decoy	
  
Decoy	
  
Decoy	
  
Decoy	
  
Fragment	
  
Fragment	
  
Fragment	
  
Fragment	
  
Fragment	
  
Fragment	
  
Fragment	
  
Fragment	
  
Fragment	
  
Fragment	
  
Decoy	
  
Fragment-Based Structure Prediction
Rosetta, Quark, …
Template(s)	
  
Template(s)	
  
Template(s)	
  
Template(s)	
  
Template(s)	
  
Template(s)	
  
Template(s)	
  
Template(s)	
  
Template(s)	
  
Template(s)	
  
Template(s)	
   Model	
  Alignment	
  Homology modeling:
First atomic-resolution model
Target 0281 CASP6
•  Topology sampled by ab initio trajectory
of homolog sequence (rmsd=2.2Å)
•  Full atom refinement reduces rmsd to
1.5Å
•  Side chain packing accurately
recovered
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
Figures adapted from Bradley P, Malmström L, Qian B, Schonbrun J, Chivian D, Kim DE, Meiler J, Misura KM, Baker D. Free modeling with Rosetta in CASP6. Proteins.
Folding Theory: Sequence-Structure Relationships
16
•  Secondary structure formation is the earliest part of the folding process
•  Local sequence codes for local structures… i.e. fragments
  helical sequences in a folded protein tend to be helical in isolation
•  Secondary structure prediction algorithms have ~70-80% accuracy
  Partial failure due to tertiary interactions stabilizing secondary structure elements
Rosetta fragments
•  3 and 9 residue fragments matched to
query sequence
•  database created from crystal structures
  < 2.5Å resolution
  < 50% sequence identity
•  low resolution modeling
  centroid representation of side chains
•  ranked by:
  alignment
  Secondary structure predictions
•  PSI-PRED
•  SAM-T02
•  Jufo
•  PhD
17
KVFGRCELAAAMKRHGLDNYRGYSLGNWVC...
KVF
KVFGRCELA
VFG
VFGRCELAA
FGR
FGRCELAAA
GRC
GRCELAAAM
---------------------------------
EEEE TT S EEEEEEE TT HH...
query
sec str
Slide content courtesy David Hoover, CIT, NIH
Sliding fragment windows
# Rank G K L M Q E R A
13 1000 G K L
25 821 G R L
46 1000 K L M
21 635 R L M
43 923 K V M
26 523 R V M
15 970 M Q E
26 934 E R A
Separate 3-mer and 9-mer libraries generated
Slide content courtesy David Hoover, CIT, NIH
Example 3-mer fragment library
Making Fragment Libraries with Robetta
http://robetta.bakerlab.org/
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Making Fragment Libraries on Biowulf
Slide content by David Hoover from: http://biowulf.nih.gov/apps/Rosetta23.html#RosettaFragments
22
•  Levinthal paradox:
  Given either alpha, beta, or loop conformation, for protein of nres, 3nres possible conformations.
  If nres = 100, sampling a conformation every 10-13 seconds = 1027 years to fold
  Universe is 1010 years old.
  Folding is non-random and cooperative.
•  Many different combinations of secondary structure elements have similar stabilities
  Tertiary (side-chain level) interactions drive folding towards the native topology
  Phase transition results in a substantial energy gap between native and non-native structures
Folding Theory: The Folding Landscape
•  Cyrus Levinthal, J. Chim. Phys. 65, 44; 1968
•  Hue Sun Chan and Ken A. Dill, Protein Folding in the Landscape Perspective: Chevron Plots and Non-
Arrhenius Kinetics, Proteins: Structure, Function, and Genetics, Volume 30, No. 1, January 1998, pp 2-33.
Implications and requirements for folding algorithm:
•  Fast conformational sampling algorithm
•  Accurate scoring function
•  Full-atom modeling
early centroid models centroid models final full-atom models
Assembly Coarse funnel to native-like decoys Fine-grained funnel to near-native decoys
Major Classes of Energy Functions in Rosetta
24
Low resolution: reduced atom representation (centroid)
  simplified energy function
  used for aggressive search of state space
High resolution: full-atom representation
  detailed energy function
  local search of state space
  refinement and minimization
General
  weighted sum of linear terms: Energy = w1*term1 + w2*term2 + …
  pairwise decomposable (speed)
  weighted for task, e.g. ligand docking
Low resolution (centroid) folding
25
  Fragment insertion
  conformation modification occurs in torsion space
  initial insertions result in large changes in dihedrals
  9 mers inserted first followed by 3 mers later in process
  later insertions purposefully result in small changes in dihedrals random insertion
*
*
Sss + SHS - sheet and helix-sheet geometries
•  Scβ density/compactness of structure
•  Svdw no clashes
•  SRgyr radius	
  of	
  gyra>on	
  (Rgyr),	
  globular structure
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
Driving assembly towards native-like decoys
Low-resolution homolog folding improves prediction
•  Collect homologs
•  Create low-resolution models
  cluster
•  Thread query sequence onto models
•  Proceed to fullatom refinement
…	
   …	
   …	
  
Slide content adapted from Ora Schueler-Furman’s
“Workshop in Structural Computational Biology”
Low resolution (centroid) folding example
28
Clustering:
Graphical representation
29
30
High resolution (full-atom) refinement
Chen Y et al. Nucl. Acids Res. 2004;32:5147-5162
evaluating/optimizing specific atom-atom interactions
e.g. hydrogen bonding:
Comparison of low resolution, relax, and abrelax folding example
31
32
Examples from the Rosetta@home archive of top predictions
Note: massively parallel computation
rosetta prediction
crystal structure
Detailed ab initio Rosetta Workflow
33
INPUT
•  amino acid sequence
•  secondary structure prediction(s)
•  fragment library
•  constraints from experimental data
•  NMR
•  biochemical/biophysical studies
•  ...
LOW RESOLUTION FOLDING
•  fragment insertions
•  scoring
•  filters
CLUSTERING
•  groups of decoys with low RMSD to each other
•  lowest energy decoy of clusters selected for
further refinement or prediction
HIGH RESOLUTION REFINEMENT
•  backbone minimization
•  rotamer optimization
ADDITIONAL MODELING
•  identifying variable regions
•  rebuilding
>103-106
trajectories
automated
manual
34
Computational Considerations
Protocol Utility Caveats
Centroid •  fast
•  widely sample conformational space
•  possibility of no near-native models after low
resolution folding
•  no discrimination by energy
Full-atom
refinement
•  near-native decoys separated by energy •  more computationally demanding
•  must have near-native in starting decoy pool
Combined •  streamlined
•  for powerful and massively parallel
computing
•  most computationally demanding
•  improvement only with sufficient sampling
35
Native (CheY)
A ~1000-fold increase in computational power
Slide content courtesy Rhiju Das, Baker Lab
36
Architect of Rosetta@home: David Kim	

A ~1000-fold increase in computational power
Native (CheY)
Lowest energy
Rosetta
structure
“brute force” approach
Computational power vs. accuracy
in ab initio structure prediction
37
Cα RMSD of lowest energy model to the native structure vs. sample size
Sample Size
RMSDtonative
Category 1:
Successful high-resolution predictions
Category 2:
Successful high-resolution predictions
with additional sampling
Category 3:
Unsuccessful predictions (with any amount of sampling)
Kim DE, Blum B, Bradley P, Baker D. Sampling bottlenecks in de novo protein structure prediction. J Mol Biol. 2009 Oct 16;393(1):249-60.
38
“De novo” phasing: large-scale tests
Tests on 30 data sets
(covering 16 proteins)
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
TF Z-score Have I solved it?
< 5 no
5 - 6 unlikely
6 - 7 possibly
7 - 8 probably
> 8 definitely
39
“De novo” phasing: large-scale tests
Tests on 30 data sets
(covering 16 proteins)
1hz5-sf.cif
Å
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
Rosetta-refined native
(positive controls)
Rosetta-refined de novo models
40
“De novo” phasing: large-scale tests
Tests on 30 data sets
(covering 16 proteins)
1hz5-sf.cif
Success in 14/30 data sets
Å
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
Rosetta-refined native
(positive controls)
Rosetta-refined de novo models
41
“De novo” phasing: large-scale tests
Tests on 30 data sets
(covering 16 proteins)
Rosetta-refined native
(positive controls)
Rosetta-refined de novo models
Rosetta-refined de novo models, fragments with
correct native 2° structure
1hz5-sf.cif
Å
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
Preparation for folding simulations
•  proper secondary structure assignment
•  constraints
•  limit search space
•  increase sampling efficiency
•  decrease CPU time
42
Constraints
•  There are constraint types and function types
  Constraint types: AtomPair, Angle, Dihedral, etc.
  Function types: Bounded, Spline, Harmonic, Gaussian, etc.
•  Each constraint is scored individually and the total constraint score is the sum of all
individual scores
•  Each constraint can have its own constraint type and function type.
  In some cases, like when using Spline function, each constraint can have its own
weight
•  How you define the constraint and how it’s scored depends on the constraint type;
this is same with function type.
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Constraint file example: EPR data
<cst type> <atom1> <res1> <atom2> <res2> <cst_func> <RosettaEPR> <Dcb> <weight> <bin>!
AtomPair CB 32 CB 36 SPLINE EPR_DISTANCE 16.0 1.0 0.5!
AtomPair CB 59 CB 74 SPLINE EPR_DISTANCE 19.0 1.0 0.5!
AtomPair CB 62 CB 71 SPLINE EPR_DISTANCE 19.0 1.0 0.5!
AtomPair CB 62 CB 74 SPLINE EPR_DISTANCE 25.0 1.0 0.5!
AtomPair CB 63 CB 74 SPLINE EPR_DISTANCE 14.0 1.0 0.5!
AtomPair CB 66 CB 74 SPLINE EPR_DISTANCE 23.0 1.0 0.5!
AtomPair CB 83 CB 90 SPLINE EPR_DISTANCE 13.0 1.0 0.5!
Constraint info Constraint Function info
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Membrane protein ab initio
•  RosettaMembrane divides the protein into:
  hydrophobic
  hydrophilic
  soluble layers
•  Specific scoring function for each layer
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Figure from Yarov-Yarovoy, Schonbrun, and Baker 2006.
Input	
  Files	
  
Spanfile	
  -­‐	
  *.span	
  
	
  -­‐-­‐transmembrane	
  topology	
  predic>on	
  file	
  generated	
  using	
  octopus2span.pl	
  script	
  
	
  -­‐-­‐Input	
  OCTOPUS	
  topology	
  file	
  is	
  generated	
  at	
  h@p://octopus.cbr.su.se	
  using	
  protein	
  
sequence	
  as	
  input.	
  
Lipopholicity	
  predicDon	
  file	
  -­‐	
  *.lips4	
  
	
  -­‐-­‐Generate	
  using	
  run_lips.pl	
  script	
  
	
  -­‐-­‐Need	
  input	
  FASTA	
  file,	
  spanfile,	
  blaspgp	
  and	
  nr	
  (NCBI)	
  database	
  
to	
  run	
  
Fragment	
  generaDon	
  
	
  -­‐-­‐Advised	
  to	
  use	
  SAM	
  but	
  not	
  JUFO	
  or	
  PSIPRED,	
  which	
  predict	
  TMH	
  regions	
  poorly	
  
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Folding and studying folding with molecular dynamics
Specialized hardware, ANTON capable of continuous ms length trajectories
Standard simulations:
1 - 3 µs simulations ~ months of HPC
Approximate Rates of Folding:
1 µs helix
10 µs sheet
100 µs fast folding protein
1+ ms typical protein
D E Shaw et al. Science 2010;330:341-346
simulation of villin at 300 K
2-8 µs folder
simulation of FiP35 at 337 K
20-80 µs folder
Blue: x-ray structures
Red: last frame of MD simulation
Folding proteins at x-ray resolution
Published by AAAS
tip of hairpin 1 (12-18, blue)
hairpin 1 (8-22, green)
hairpin 2 (19-30, orange)
full protein (2-33, red)
D E Shaw et al. Science 2010;330:341-346
Reversible folding simulation of FiP35.
Thank You
For questions or comments please contact:
ScienceApps@niaid.nih.gov
301.496.4455
50

Protein structure prediction with a focus on Rosetta

  • 1.
    1 WITH A FOCUSON ROSETTA This presentation was prepared by: Xavier Ambroggio, ambroggiox@niaid.nih.gov PROTEIN STRUCTURE PREDICTION OFFICE OF CYBER INFRASTRUCTURE AND COMPUTATIONAL BIOLOGY NATIONAL INSTITUTE OF ALLERGY AND INFECTIOUS DISEASES
  • 2.
    Fall 2011 ComputationalStructural Biology Seminar Series 2 9 – 11 AM, T/Th in 12A/B51 http://training.cit.nih.gov Week Day Date Course Instructor CIT Course # Week 1 Tues Aug. 23 Fundamentals, Data Sources, and Visualization of Macromolecular Structure Darrell Hurt SS260-11001 Thurs Aug. 25 Generating Protein Structures from Homology Darrell Hurt SS270-11001 Week 2 Tues Aug. 30 Predicting Protein Structures from Amino Acid Sequences Xavier Ambroggio SS660-11001 Thurs Sept. 1 Predicting Macromolecular Complexes from Uncomplexed Structures Xavier Ambroggio SS670-11001 Week 3 Tues Sept. 6 Design and Analysis of Macromolecular Interfaces Xavier Ambroggio SS770-11001 Thurs Sept. 8 Analysis and Advanced Visualization of Macromolecular Structure Darrell Hurt SS330-11001 Week 4 Tues Sept. 13 Computational Drug Design Mike Dolan SS340-11001 Thurs Sept. 15 Introduction to Molecular Dynamics Mike Dolan TBA Week 5 Thurs. Sept. 22 Advanced Molecular Dynamics Mike Dolan TBA
  • 3.
    Bioinformatics and ComputationalBiosciences Branch 3 Scientific Collaboration Scientific Training Custom Scientific Software & Infrastructure •  Structural Biology •  Phylogenetics •  Statistics •  Sequence Analysis •  Microarray Analysis •  NGS Analysis •  Bioinformatics •  Biological Networks •  Function Prediction •  …
  • 4.
    4 Ab Initio StructurePrediction: Given an amino acid sequence, find the tertiary structure “Protein folding problem”
  • 5.
    CASP: Critical Assessmentof protein Structure Prediction http://predictioncenter.org •  Double-blind experiment (…competition) •  World-wide scientific community •  Unbiased assessment of techniques in structure prediction •  Biennial (every even year) •  “Pulse” of the prediction community •  What can be predicted? •  Which servers/algorithms perform best?
  • 6.
  • 7.
    CASP Top Free-ModelingServers 7 Why Rosetta focus? •  Standalone •  Versatile   RNA   design   dock   … •  Open Source •  Substantial Literature •  Shared methodology Use any and all available servers!!!
  • 8.
    Das & BakerAnnu. Rev. Biochem 2008 prediction design Rosetta: multipurpose macromolecular modeling suite CIT Course # SS660-11001 CIT Course # SS670-11001 CIT Course # SS770-11001
  • 9.
    ab initio predictthe structure from sequence relax refine the structure using Rosetta energy functions idealize replace bond geometries with ideal values loop modeling build and refine local structurally variable regions in context of a structural template design optimize sequence given a structure with a fixed backbone docking structure prediction for a protein-protein complex given subunits ligand ligand docking ddG prediction protein-protein interface and protein stability ddG stability calculations for mutations scoring score input conformations with Rosetta energy functions RNA predict RNA structures from sequences and design sequences from fixed structures clustering grouping input structures by RMSD to each other for structure prediction analysis backrub generate alternate backbone conformations based on sets of rotations membrane ab initio predict the structures of helical membrane proteins enzyme design redesign a protein around a ligand domain assembly fixed domains connected by variable regions antibody automated antibody homology modeling XML parsing Parse XML scripts into protocols Brief Description of Select Rosetta Functions
  • 10.
    What types ofprotein domains can Rosetta fold? Small, globular, soluble protein domains… Small, simple membrane protein domains… …but not complex domains or multi-domain proteins. T4-lysozyme C-terminal domain V-type Na+ ATP synthase subunit rhodopsin Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop A B C
  • 11.
    What are thesuccess rates? High resolution predictions are achievable •  targets ≤100 residues •  success rate ~30% •  success rate with accurate secondary structure ~50% •  a hallmark of accuracy: convergence 11 Slide content courtesy Rhiju Das, Baker Lab
  • 12.
    What types ofprotein domains can no one fold? CASP9: domains with no good FM predictions Slide  content  adapted  from  talk  given  by  Lisa  Kinch  of  the  Grishin  lab  at  CASP9  mee>ng:  h@p://predic>oncenter.org/casp9/   •  Non-­‐globular   •  Trimeric   •  Fe  stabilized   •  High  contact  order   Many  residues  close     in  3D,  far  in  1D     •  +  elongated  sheet?   T0591d1,  3MWT   T0550d2,  3NQK   T0629d2,  2XGF  
  • 13.
    1.  Select  fragments  consistent  with  local   sequence  preferences   2.  Assemble  fragments  into  models  with   na>ve-­‐like  global  proper>es   3.  Iden>fy  the  best  model  from  the   popula>on  of  decoys   Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology” Figures adapted from Charlie Strauss; Protein structure prediction using ROSETTA, Rohl et al (2004) Methods in Enzymology, 383:66 Basic  Ab  Ini'o  Rose<a  protocol
  • 14.
    Assembly   Decoy   Decoy   Decoy   Decoy   Decoy   Decoy   Decoy   Decoy   Decoy   Fragment   Fragment   Fragment   Fragment   Fragment   Fragment   Fragment   Fragment   Fragment   Fragment   Decoy   Fragment-Based Structure Prediction Rosetta, Quark, … Template(s)   Template(s)   Template(s)   Template(s)   Template(s)   Template(s)   Template(s)   Template(s)   Template(s)   Template(s)   Template(s)   Model  Alignment  Homology modeling:
  • 15.
    First atomic-resolution model Target0281 CASP6 •  Topology sampled by ab initio trajectory of homolog sequence (rmsd=2.2Å) •  Full atom refinement reduces rmsd to 1.5Å •  Side chain packing accurately recovered Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology” Figures adapted from Bradley P, Malmström L, Qian B, Schonbrun J, Chivian D, Kim DE, Meiler J, Misura KM, Baker D. Free modeling with Rosetta in CASP6. Proteins.
  • 16.
    Folding Theory: Sequence-StructureRelationships 16 •  Secondary structure formation is the earliest part of the folding process •  Local sequence codes for local structures… i.e. fragments   helical sequences in a folded protein tend to be helical in isolation •  Secondary structure prediction algorithms have ~70-80% accuracy   Partial failure due to tertiary interactions stabilizing secondary structure elements
  • 17.
    Rosetta fragments •  3and 9 residue fragments matched to query sequence •  database created from crystal structures   < 2.5Å resolution   < 50% sequence identity •  low resolution modeling   centroid representation of side chains •  ranked by:   alignment   Secondary structure predictions •  PSI-PRED •  SAM-T02 •  Jufo •  PhD 17
  • 18.
    KVFGRCELAAAMKRHGLDNYRGYSLGNWVC... KVF KVFGRCELA VFG VFGRCELAA FGR FGRCELAAA GRC GRCELAAAM --------------------------------- EEEE TT SEEEEEEE TT HH... query sec str Slide content courtesy David Hoover, CIT, NIH Sliding fragment windows
  • 19.
    # Rank GK L M Q E R A 13 1000 G K L 25 821 G R L 46 1000 K L M 21 635 R L M 43 923 K V M 26 523 R V M 15 970 M Q E 26 934 E R A Separate 3-mer and 9-mer libraries generated Slide content courtesy David Hoover, CIT, NIH Example 3-mer fragment library
  • 20.
    Making Fragment Librarieswith Robetta http://robetta.bakerlab.org/ Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
  • 21.
    Making Fragment Librarieson Biowulf Slide content by David Hoover from: http://biowulf.nih.gov/apps/Rosetta23.html#RosettaFragments
  • 22.
    22 •  Levinthal paradox:  Given either alpha, beta, or loop conformation, for protein of nres, 3nres possible conformations.   If nres = 100, sampling a conformation every 10-13 seconds = 1027 years to fold   Universe is 1010 years old.   Folding is non-random and cooperative. •  Many different combinations of secondary structure elements have similar stabilities   Tertiary (side-chain level) interactions drive folding towards the native topology   Phase transition results in a substantial energy gap between native and non-native structures Folding Theory: The Folding Landscape •  Cyrus Levinthal, J. Chim. Phys. 65, 44; 1968 •  Hue Sun Chan and Ken A. Dill, Protein Folding in the Landscape Perspective: Chevron Plots and Non- Arrhenius Kinetics, Proteins: Structure, Function, and Genetics, Volume 30, No. 1, January 1998, pp 2-33. Implications and requirements for folding algorithm: •  Fast conformational sampling algorithm •  Accurate scoring function •  Full-atom modeling
  • 23.
    early centroid modelscentroid models final full-atom models Assembly Coarse funnel to native-like decoys Fine-grained funnel to near-native decoys
  • 24.
    Major Classes ofEnergy Functions in Rosetta 24 Low resolution: reduced atom representation (centroid)   simplified energy function   used for aggressive search of state space High resolution: full-atom representation   detailed energy function   local search of state space   refinement and minimization General   weighted sum of linear terms: Energy = w1*term1 + w2*term2 + …   pairwise decomposable (speed)   weighted for task, e.g. ligand docking
  • 25.
    Low resolution (centroid)folding 25   Fragment insertion   conformation modification occurs in torsion space   initial insertions result in large changes in dihedrals   9 mers inserted first followed by 3 mers later in process   later insertions purposefully result in small changes in dihedrals random insertion * *
  • 26.
    Sss + SHS- sheet and helix-sheet geometries •  Scβ density/compactness of structure •  Svdw no clashes •  SRgyr radius  of  gyra>on  (Rgyr),  globular structure Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology” Driving assembly towards native-like decoys
  • 27.
    Low-resolution homolog foldingimproves prediction •  Collect homologs •  Create low-resolution models   cluster •  Thread query sequence onto models •  Proceed to fullatom refinement …   …   …   Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
  • 28.
    Low resolution (centroid)folding example 28
  • 29.
  • 30.
    30 High resolution (full-atom)refinement Chen Y et al. Nucl. Acids Res. 2004;32:5147-5162 evaluating/optimizing specific atom-atom interactions e.g. hydrogen bonding:
  • 31.
    Comparison of lowresolution, relax, and abrelax folding example 31
  • 32.
    32 Examples from theRosetta@home archive of top predictions Note: massively parallel computation rosetta prediction crystal structure
  • 33.
    Detailed ab initioRosetta Workflow 33 INPUT •  amino acid sequence •  secondary structure prediction(s) •  fragment library •  constraints from experimental data •  NMR •  biochemical/biophysical studies •  ... LOW RESOLUTION FOLDING •  fragment insertions •  scoring •  filters CLUSTERING •  groups of decoys with low RMSD to each other •  lowest energy decoy of clusters selected for further refinement or prediction HIGH RESOLUTION REFINEMENT •  backbone minimization •  rotamer optimization ADDITIONAL MODELING •  identifying variable regions •  rebuilding >103-106 trajectories automated manual
  • 34.
    34 Computational Considerations Protocol UtilityCaveats Centroid •  fast •  widely sample conformational space •  possibility of no near-native models after low resolution folding •  no discrimination by energy Full-atom refinement •  near-native decoys separated by energy •  more computationally demanding •  must have near-native in starting decoy pool Combined •  streamlined •  for powerful and massively parallel computing •  most computationally demanding •  improvement only with sufficient sampling
  • 35.
    35 Native (CheY) A ~1000-foldincrease in computational power Slide content courtesy Rhiju Das, Baker Lab
  • 36.
    36 Architect of Rosetta@home:David Kim A ~1000-fold increase in computational power Native (CheY) Lowest energy Rosetta structure “brute force” approach
  • 37.
    Computational power vs.accuracy in ab initio structure prediction 37 Cα RMSD of lowest energy model to the native structure vs. sample size Sample Size RMSDtonative Category 1: Successful high-resolution predictions Category 2: Successful high-resolution predictions with additional sampling Category 3: Unsuccessful predictions (with any amount of sampling) Kim DE, Blum B, Bradley P, Baker D. Sampling bottlenecks in de novo protein structure prediction. J Mol Biol. 2009 Oct 16;393(1):249-60.
  • 38.
    38 “De novo” phasing:large-scale tests Tests on 30 data sets (covering 16 proteins) Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007. TF Z-score Have I solved it? < 5 no 5 - 6 unlikely 6 - 7 possibly 7 - 8 probably > 8 definitely
  • 39.
    39 “De novo” phasing:large-scale tests Tests on 30 data sets (covering 16 proteins) 1hz5-sf.cif Å Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007. Rosetta-refined native (positive controls) Rosetta-refined de novo models
  • 40.
    40 “De novo” phasing:large-scale tests Tests on 30 data sets (covering 16 proteins) 1hz5-sf.cif Success in 14/30 data sets Å Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007. Rosetta-refined native (positive controls) Rosetta-refined de novo models
  • 41.
    41 “De novo” phasing:large-scale tests Tests on 30 data sets (covering 16 proteins) Rosetta-refined native (positive controls) Rosetta-refined de novo models Rosetta-refined de novo models, fragments with correct native 2° structure 1hz5-sf.cif Å Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
  • 42.
    Preparation for foldingsimulations •  proper secondary structure assignment •  constraints •  limit search space •  increase sampling efficiency •  decrease CPU time 42
  • 43.
    Constraints •  There areconstraint types and function types   Constraint types: AtomPair, Angle, Dihedral, etc.   Function types: Bounded, Spline, Harmonic, Gaussian, etc. •  Each constraint is scored individually and the total constraint score is the sum of all individual scores •  Each constraint can have its own constraint type and function type.   In some cases, like when using Spline function, each constraint can have its own weight •  How you define the constraint and how it’s scored depends on the constraint type; this is same with function type. Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
  • 44.
    Constraint file example:EPR data <cst type> <atom1> <res1> <atom2> <res2> <cst_func> <RosettaEPR> <Dcb> <weight> <bin>! AtomPair CB 32 CB 36 SPLINE EPR_DISTANCE 16.0 1.0 0.5! AtomPair CB 59 CB 74 SPLINE EPR_DISTANCE 19.0 1.0 0.5! AtomPair CB 62 CB 71 SPLINE EPR_DISTANCE 19.0 1.0 0.5! AtomPair CB 62 CB 74 SPLINE EPR_DISTANCE 25.0 1.0 0.5! AtomPair CB 63 CB 74 SPLINE EPR_DISTANCE 14.0 1.0 0.5! AtomPair CB 66 CB 74 SPLINE EPR_DISTANCE 23.0 1.0 0.5! AtomPair CB 83 CB 90 SPLINE EPR_DISTANCE 13.0 1.0 0.5! Constraint info Constraint Function info Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
  • 45.
    Membrane protein abinitio •  RosettaMembrane divides the protein into:   hydrophobic   hydrophilic   soluble layers •  Specific scoring function for each layer Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop Figure from Yarov-Yarovoy, Schonbrun, and Baker 2006.
  • 46.
    Input  Files   Spanfile  -­‐  *.span    -­‐-­‐transmembrane  topology  predic>on  file  generated  using  octopus2span.pl  script    -­‐-­‐Input  OCTOPUS  topology  file  is  generated  at  h@p://octopus.cbr.su.se  using  protein   sequence  as  input.   Lipopholicity  predicDon  file  -­‐  *.lips4    -­‐-­‐Generate  using  run_lips.pl  script    -­‐-­‐Need  input  FASTA  file,  spanfile,  blaspgp  and  nr  (NCBI)  database   to  run   Fragment  generaDon    -­‐-­‐Advised  to  use  SAM  but  not  JUFO  or  PSIPRED,  which  predict  TMH  regions  poorly   Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
  • 47.
    Folding and studyingfolding with molecular dynamics Specialized hardware, ANTON capable of continuous ms length trajectories Standard simulations: 1 - 3 µs simulations ~ months of HPC Approximate Rates of Folding: 1 µs helix 10 µs sheet 100 µs fast folding protein 1+ ms typical protein
  • 48.
    D E Shawet al. Science 2010;330:341-346 simulation of villin at 300 K 2-8 µs folder simulation of FiP35 at 337 K 20-80 µs folder Blue: x-ray structures Red: last frame of MD simulation Folding proteins at x-ray resolution
  • 49.
    Published by AAAS tipof hairpin 1 (12-18, blue) hairpin 1 (8-22, green) hairpin 2 (19-30, orange) full protein (2-33, red) D E Shaw et al. Science 2010;330:341-346 Reversible folding simulation of FiP35.
  • 50.
    Thank You For questionsor comments please contact: ScienceApps@niaid.nih.gov 301.496.4455 50