Computational Protein Design. 2. Computational Protein Design Techniques

Computational Protein Design
2. Computational Protein Design Techniques

Pablo Carbonell
pablo.carbonell@issb.genopole.fr

iSSB, Institute of Systems and Synthetic Biology
Genopole, University d’Évry-Val d’Essonne, France

mSSB: December 2010

Pablo Carbonell (iSSB) Computational Protein Design mSSB: December 2010 1 / 45

Outline

1 Introduction

2 Computational Protein Descriptors

3 Sequence-based CPD

4 Structure-based CPD

5 Search Algorithms in CPD

6 De Novo Design

7 Challenges in Sequence and Structure-Based CPD


Outline

1 Introduction





6 De Novo Design



A Blueprint of CPD Approaches

∗ RS : research studies

Outline

1 Introduction





6 De Novo Design



Molecular Signature Descriptors

A 2D representation of the molecular graphs Atomic signature :
as an undirected colored graphs G(V , E, C),
Xh
with V : atoms, E : bonds, C : atom type h
σ(G) = σ(x) (1)
The signature descriptor of height h of atom x x∈V
in the molecular graph G, or h σ(x), is a
The signature is a systematic
canonical representation of the subgraph of
codiﬁcation of the molecular
G containing all atoms that are at distance h
graph [Faulon et al., 2004]
from x

σ(methylcyclopropane) =
1 [C]([H][C]([H][H][C,0])[C,0]([H][H])[C]([H][H][H]))
2 [C]([H][H][C]([H][C,0][C]([H][H][H]))[C,0]([H][H]))
1 [C]([H][H][H][C]([H][C]([H][H][C,0])[C,0]([H][H])))
1 [H]([C]([C]([H][H][C,0])[C,0]([H][H])[C]([H][H][H])))
4 [H]([C]([H][C]([H][C,0][C]([H][H][H]))[C,0]([H][H])))
3 [H]([C]([H][H][C]([H][C]([H][H][C,0])[C,0]([H][H]))))


Molecular Signature of Reactions and Proteins

Signature of a reaction. The signature of reaction R

S1 + S2 + . . . + Sn → P1 + P2 + . . . + Pn (2)

that transforms n substrates into m products is given by the difference between the
signature of the products and the signature of the substrates:
h
Xh Xh
σ(R) = σ(p) − σ(s) (3)
p∈P s∈S

Signature of protein sequences. The protein P is represented by the linear
chain given by its collapsed graph at residue level, a reduced molecular graph
representation G(V , E, C) known as string signature where V : residues a ∈ A,
E : contiguous in sequence, C : amino acid type

h
Xh
σ(P) = σ(a) (4)
a∈A


Protein Contact Maps

The protein contact map is a graph
representation of the 3D interactions
at residue level G(V , E, C) where V :
residues, E : contacts, C : amino acid
type
Two residues are considered to
interact when atoms between both
residues are at a distance lower than a
predetermined threshold (tipically
4.5 ∼ 5 Å)
Contact maps can account for
long-range interactions and
conformational states

Song et al. [2010]


Outline

1 Introduction





6 De Novo Design



Sequence and Structure-Based CPD

Sequence-based CPD methods are in some cases a good trade-off between
complexity of the model and accuracy of the predictions


Sequence-based Knowledge-based potentials

The simplest way to score a protein and to identify active regions is through amino
acid scales or indexes
AAindex is a database of
544 amino acid indexes
94 Amino Acid Matrices
47 amino acid pair-wise contact potentials

Examples: hydrophobicity,
accessibility, van der Waals volume,
secondary structure propensity,
ﬂexibility
This approach is widely used when
analyzing conserved motifs and
correlated mutations in protein fold
families through multiple alignments


Quantitative Structure-Activity Relationship (QSAR) Techniques

The goal is to model causal relationships
QSAR is a statistical method used
between
extensively by the chemical and
pharmaceutical industries in structures of interacting molecules
small-molecules and peptide measurables properties of scientiﬁc
optimization or commercial interest such as
ADME/Tox (absorption, distribution,
metabolism, excretion, and toxicity) of
drugs


QSAR Model Evaluation

Model predictability is generally evaluated through the leave-one-out (LOO)
cross-validation correlation coefﬁcient q 2
Partial least-squares (PLS) regression is commonly used
Additional nonlinear terms can be added through the use of nonlinear regression
or machine learning techniques (kernel methods, random forests, etc)


QSAR Modeling Workﬂow


The ProSAR Algorithm

An extension of SAR-based approaches to CPD
It formalizes the decision-making processes about which mutations to include in
combinatorial libraries
N
XX
y = cij xij (5)
i=1 j∈A

y : the predicted function (activity) of the protein sequence
cij : the regression coefﬁcients corresponding to the mutational effect of having residue
j among the 20 amino acids A at postion i
xij : binary variable indicating the presence or absence of residue j at position i


Improving Catalytic Function by ProSAR-driven Enzyme Evolution

Statistical analysis of protein sequence
activity relationships

Bacterial biocatalysis of
Atorvastatin (Lipitor)
(cholesterol-lowering drug)
Codexis Inc.


Outline

1 Introduction





6 De Novo Design



Structure-based CPD

Energy functions and molecular force ﬁelds
Local conformational restrictions
Predicting entropic factors
Protein topological properties

From Narasimhan et al. [2010]


Energy Functions and Molecular Force Fields

In structure-based CPD, folds are usually
represented by the spatial coordinates of the
backbone atoms or design scaffold
Protein design is done by amino acid side
chains along the scaffold

Side chains are only permitted to assume a
discrete set of statistically preferred
conformations: rotamers
Rotamer/backbone and rotamer/rotamer
interaction energies are tabulated
These potential energies can then be
approximated by using any of the standard
force ﬁelds : CHARMM, AMBER, GROMOS


Molecular Force Fields

AMBER: a classical force ﬁeld for energy and MD calculations:

X 1 X 1 X 1
V (r N ) = kb (l − l0 )2 + ka (θ − θ0 )2 + Vn [1 + cos(nω − γ)]
2 2 2
bonds angles torsions
N−1 X
( "„ « „ «6 # )
X N r0ij
12
r0ij qi qj
+ i,j −2 + (6)
rij rij 4π 0 rij
j=1 i=j+1

P
1 (·): energy between covalently bonded atoms.
Pbonds
angles (·): energy due to the geometry of electron orbitals involved in covalent
2

bonding.
P
torsions (·): energy for twisting a bond due to bond order (e.g. double bonds) and
3

neighboring bonds or lone pairs of electrons.
PN−1 PN
i=j+1 (·): non-bonded energy between all atom pairs:
4
j=1
1 van der Waals energies
2 Electrostatic energies


Structure-based Knowledge-based Potentials

They are built by performing a large-scale statistical study of structural databases
such as PDB (Protein Data Bank)
Rotamer libraries (∼ 150 rotameric states)
Binary patterning: only some type of amino acids are allowed based on the
hydrophobic environment
An implicit solvation model
Secondary structure propensity
Frequency of small segments in the PDB
Pairwise potentials
van der Waals interactions
Hydrogen bonding
Electrostatics
Entropy-based penalties for ﬂexible side-chains

From Boas and Harbury [2007]


Energy Functions

Design along the backbone or scaffold
Rotamer/backbone and rotamer/rotamer interact. energies tabulated
Precomputed from molecular force ﬁelds : CHARMM, AMBER, GROMOS

Total energy of the protein
X X
ETOT = Ek (rk ) + Ekl (rk , rl ) (7)
k k =l

N : length of the protein
rk : the rotamer of the kth side chain
Ek (rk ) : the self-energy of a particular rotamer rk
Ekl (rk , rl ) : the pair energy of rotamers rk , rj


The Role of Dynamics

Besides protein structure, protein dynamics can play a direct role in molecular
recognition
Flexible proteins recognize their targets through induced ﬁt or conformational
selection, likely showing promiscuity
Binding is commonly enthalpy-driven, but in some cases entropy is important, for
instance:
Proteins with multiple binding sites
Small hydrophobic molecules
Two types of source of protein motions:
Protein ﬂexibility: intraconformational dynamics (fast time scale motions)
Conformational heterogeneity: interconformational dynamics

Gibbs free energy:

∆G = ∆H − T ∆S (8)
∆S = ∆Ssolv + ∆Sconf + ∆Srt (9)
∆Sconf : conformational entropy of protein and ligand

∆Srtf : rotational and translational degree of freedoms


Predicting Side-chain Dynamics from Structural Descriptors

The Lipari-Szabo model free approach approach allows to quantify motions from
NMR experiments by computing the generalized order parameter S 2
Protein backbone dynamics : 15 NH and 13 Cα H NMR relaxation methods
Protein side chain methyl dynamics : 13 Cα H NMR relaxation methods (side-chain
motions in the picosecond-to-nanosecond time regime)
From the BMRB we compiled S 2 data for 18 proteins, including 10 proteins in 2 or
more different states : calmodulin, barnase, pdz, mup, dfhr, staphylococcal
nuclease, pin1, sh3 domain, MSG
This technique provides only measurements for the Cα of methyl groups in side
chains : ALA, LEU, ILE, MET, THR, VAL


Structural Descriptors of Methyl Dynamics

We consider the following parameters inﬂuencing side-chain dynamics :
Packing density at the methyl site i and its neighboring residues j within a sphere of
r =5Å
0 1
X X B X
Pi = Cj e−rij = e−rjk A e−rij (10)
C
@
rij <5Å rij <5Å rjk <5Å

Side chain stiffness : number of dihedral angles separating the backbone from the
methyl carbon. weighted by the side-chain packing
Rotameric state : angular distance ∆χ = χ − χ0 to the closest rotameric state χ0 in
the library
Elongation : distance from the methyl site to the Cα
Pairwise contact potential : a knowledge-based potential of frequence of contacts
between residues at several distances computed from the PDB
Solvation effect : DSSP accessibility and residue hydrophobicity
Van der Waals contacts
Hydrogen bonds (in the case of Threonine)


Predicting Methyl Side-chain Dynamics
Algorithm : neural network
Cross-validation : r = 0.71 ± 0.029 Example : experimental and predicted
(p-value = 4.6 × 10−87 ) changes in ∆S 2 of barnase after binding
barstar

Protein MD method r (MD) r (nnet)

ubiquitin AMBER99SB 0.81 0.81
TNfn3 CHARMM 22 0.62 0.79 ∆S 2 > 0 ∆S 2 < 0
FNfn10 CHARMM 22 0.51 0.64 rigidiﬁcation ﬂexibilization
barnase OPLS-AA/L 0.55 0.64
calmodulin FDPB 0.60 0.72

[Carbonell and del Sol, 2009]


Outline

1 Introduction





6 De Novo Design



Search Algorithms in CPD


Search Algorithms

Objective: ﬁnding the best design within the space of all possible amino
acid/rotameric states
A vast search space: 20N or pN
N: number of positions to mutate
p: number of rotameric states
Strategies
Deterministic algorithms
Dead-end elimination (DEE) algorithm: a pruning method.
Some accelerations of the DEE algorithm: upper-bound estimation; the “magic bullet” metric;
conformational splitting; background optimization
Stochastic algorithms
Monte Carlo
Simulated annealing
Genetic algorithms


The DEE Algorithm

It assumes that the energy of the protein can be written as
X X
ETOT = Ek (rk ) + Ekl (rk , rl ) (11)
k k =l

N : length of the protein
rk : the rotamer of the kth side chain
Ek (rk ):" the self-energy of a particular rotamer rk
Ekl (rk , rl ): the pair energy of the rotamers rk , rj
Complexity:
Single search scales quadratically with total number of rotamers O((p × N)2 )
Pair search scales cubically O((p × N)3 )
Brute force enumeration : O(pN )


The DEE Algorithm

Single rotamers and rotamer pairs are eliminated during the computational cycles
Single elimination : eliminate rotamer if some other rotamer in the side chain gives
better energy
N
X N
X
A
Ek (rk ) + min Ekl (rk , rlX )
A
> B
Ek (rk ) + max Ekl (rk , rlX )
B
(12)
X X
l=1 l=1

Pairs elimination : eliminate pair of rotamers in two positions if there exists another
pair that gives better energy
def
Ukl = Ek (rk ) + El (rlB ) + Ekl (rk , rlB )
AB A A
(13)

N
X “ ”
AB
Ukl + min Eki (rk , riX ) + Elj (rlB , rjX ) >
A
X
i=1
N
X “ ”
CD
Ukl + max Eki (rk , riX ) + Elj (rlD , rjX )
C
(14)
X
i=1

Values are precomputed and stored in energy matrices


Stochastic Algorithms

Search in the space of feasible designs by making a series of combinations of
random and directed moves
Monte Carlo Metropolis: a move consists of exchanging one rotamer for another
at a randomly chosen position, a modiﬁcation is accepted if it lowers the energy
Simulated Annealing allows to explore nearby solutions at the initial cycles of the
search
Genetic Algorithms: a population of models is propagated (evolved) throughout
the course of the run and genetic operators, such as recombination, are used to
create new models from existing parents
They are fast, can be scaled up to problems of large complexity
They are not guaranteed to converge to the optimal solution


The SCHEMA Algorithm

Equivalent to an in silico directed evolution
Consists of scoring libraries of hybrid protein
sequences against the parental sequence
Scoring:
Calculate the number of interactions between residues
(contacts within 4.5 Å) that are disrupted in the creation
of hybrid proteins
Hybrids are scored for stability by counting the number of
disruptions
Protein is partitioned into blocks that should not
From [Meyer et al., 2006]
interrupted by crossovers (analog to genetic algorithms)


The OPTCOM and IPRO Algorithms for Library Design

The OPTCOM algorithm: The IPRO algorithm:
Balances size and Identify point mutations in the parent sequences
quality of the library using energy-based scoring fuctions
Residue and rotamer choices are driven by a
mixed-integer linear programming formulation
(MILP)

From [Saraf et al., 2006]


Some Web Resources

IPRO: Iterative Protein Redesign and Optimization.
http://maranas.che.psu.edu/IPRO.htm
EGAD: A Genetic Algorithm for protein Design.
http://egad.ucsd.edu/software.php
RosettaDesign: A software package.
http://rosettadesign.med.unc.edu/
SCHEMA A pair-wise energy function for scoring protein chimeras made from
homologous proteins. http://www.che.caltech.edu/groups/fha/
schema-tools/schema-overview.html
SHARPEN: Systematic Hierarchical Algorithms for Rotamers and Proteins on
an Extended Network.
http://koko.che.caltech.edu/sharpenabout.html
WHAT IF: Software for protein modelling, design, validation, and
visualisation. http://swift.cmbi.ru.nl/whatif/
FoldX: A force ﬁeld for energy calculations and protein design.
http://foldx.crg.es/


Outline

1 Introduction





6 De Novo Design



De Novo-Designed Proteins

In de novo designs, some assumptions are needed in order to make the search
space tractable
Usually we start from some basic motifs or domains as scaffolds for the design
Examples:
βαβ motif resembling a zinc ﬁnger
3 and 4 helix bundles
Helical coiled-coils
Helix bundle motifs can be parametrized using a few global variables that
describe the global structure
Applications:
New metal-binding sites
Nonbiological cofactors for novel biomaterials and electromechanical devices
Novel enzymatic activities


Example: De Novo Design of a Metalloprotein

Computational de novo design of a four-helix (108 residues) bundle containing the
non-biological cofactor iron diphenyl porphyrin (DPP-Fe) [Bender et al., 2007]
The initial helix bundle was selected as low-energy structure computed with MCSA
STITCH: a program to select loops connecting helices from PDB Select
CHARMM and PROCHECK for removing overlaps
4 His and the 4 Thr residues to support the 6-point coordination of the Fe(III) cations
SCADS: provides side-dependent amino acid probabilities in each round


Outline

1 Introduction





6 De Novo Design



Challenges in Sequence and Structure-Based CPD

Modeling
Greater availability of 3D protein structural information
More accurate energy functions
Improvement of rigid and ﬂexible docking

Design
Improvement in search algorithms
Parametrization for non-natural amino acids

Prediction
Beyond additive models: using machine-learning algorithms
More complete environment descriptors


2. Computational Protein Design Techniques

Pablo Carbonell
pablo.carbonell@issb.genopole.fr

iSSB, Institute of Systems and Synthetic Biology
Genopole, University d’Évry-Val d’Essonne, France

mSSB: December 2010


Bibliography I

Gretchen M. Bender, Andreas Lehmann, Hongling Zou, Hong Cheng, H. Christopher Fry, Don Engel, Michael J. Therien, J. Kent Blasie, Heinrich Roder,
Jeffrey G. Saven, and William F. DeGrado. De Novo Design of a Single-Chain Diphenylporphyrin Metalloprotein. Journal of the American Chemical
Society, 129(35):10732–10740, September 2007. ISSN 0002-7863. doi: 10.1021/ja071199j. URL http://dx.doi.org/10.1021/ja071199j.
F. Edward Boas and Pehr B. Harbury. Potential energy functions for protein design. Current opinion in structural biology, 17(2):199–204, April 2007. ISSN
0959-440X. doi: 10.1016/j.sbi.2007.03.006. URL http://dx.doi.org/10.1016/j.sbi.2007.03.006.
Pablo Carbonell and Antonio del Sol. Methyl side-chain dynamics prediction based on protein structure. Bioinformatics, pages btp463+, July 2009. doi:
10.1093/bioinformatics/btp463. URL http://dx.doi.org/10.1093/bioinformatics/btp463.
Jean-Loup L. Faulon, Michael J. Collins, and Robert D. Carr. The signature molecular descriptor. 4. Canonizing molecules using extended valence
sequences. Journal of chemical information and computer sciences, 44(2):427–436, 2004. ISSN 0095-2338. doi: 10.1021/ci0341823. URL
http://dx.doi.org/10.1021/ci0341823.
Michelle M. Meyer, Lisa Hochrein, and Frances H. Arnold. Structure-guided SCHEMA recombination of distantly related β-lactamases. Protein Engineering
Design and Selection, 19(12):563–570, December 2006. ISSN 1741-0126. doi: 10.1093/protein/gzl045. URL
http://dx.doi.org/10.1093/protein/gzl045.
Diwahar Narasimhan, Mark R. Nance, Daquan Gao, Mei-Chuan Ko, Joanne Macdonald, Patricia Tamburi, Dan Yoon, Donald M. Landry, James H. Woods,
Chang-Guo Zhan, John J. G. Tesmer, and Roger K. Sunahara. Structural analysis of thermostabilizing mutations of cocaine esterase. Protein
Engineering Design and Selection, 23(7):537–547, July 2010. doi: 10.1093/protein/gzq025. URL http://dx.doi.org/10.1093/protein/gzq025.
Manish C. Saraf, Gregory L. Moore, Nina M. Goodey, Vania Y. Cao, Stephen J. Benkovic, and Costas D. Maranas. IPRO: an iterative computational protein
library redesign and optimization procedure. Biophysical journal, 90(11):4167–4180, June 2006. ISSN 0006-3495. doi: 10.1529/biophysj.105.079277. URL
http://dx.doi.org/10.1529/biophysj.105.079277.
Jiangning Song, Kazuhiro Takemoto, Hongbin Shen, Hao Tan, Michael M. Gromiha, and Tatsuya Akutsu. Prediction of Protein Folding Rates from Structural
Topology and Complex Network Properties. IPSJ Transactions on Bioinformatics, 3:40–53, 2010. doi: 10.2197/ipsjtbio.3.40. URL
http://dx.doi.org/10.2197/ipsjtbio.3.40.


Computational Protein Design. 2. Computational Protein Design Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Computational Protein Design. 2. Computational Protein Design Techniques

Similar to Computational Protein Design. 2. Computational Protein Design Techniques (20)

Recently uploaded

Recently uploaded (20)

Computational Protein Design. 2. Computational Protein Design Techniques