Faculty of Science and Bio-engineering Sciences
Department of Bio-engineering Sciences
Predicting peptide interactions
using protein building blocks
Thesis submitted in partial fulfilment of the requirements for the degree of
Doctor in Bio-engineering Sciences
Peter Vanhee
Promotor: Prof. Dr. Frederic Rousseau
Co-promoter: Prof. Dr. Joost Schymkowitz
March 4th, 2011
Published by the VIB Switch Laboratory
SWIT, Department of Bio-engineering Sciences
Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel
Apart from any fair dealing for the purpose of research, private study, criticism or review, this
publication may not be reproduced, stored in a retrieval system, or transmitted in any form, by any
means, electronic, mechanical, photocopying, recording, scanning, or otherwise, without the prior
permission in writing of the publisher.
Peter Vanhee was funded by a PhD grant from the Institute for the Promotion of Innovation
through Science and Technology in Flanders (IWT-Vlaanderen), Belgium.
Predicting peptide interactions using protein building blocks. Peter Vanhee PhD disser-
tation Vrije Universiteit Brussel, Brussels, Belgium, March 2011.
Cover: design by Antonio De Marco and Peter Vanhee.
© Vrije Universiteit Brussel, all rights reserved.
Summary
P
roteins are by far the most versatile and complex molecules in the cell. It is
commonly accepted that protein function directly relates to three-dimensional
structure, which in turn is dependent on the specific amino acid sequence of the
protein. Peptides are short sequences of amino acids that perform a myriad of
functions and are estimated to be involved in up to 40% of all protein-protein
interactions. The lack of structural evidence for many of these peptide interac-
tions however has hindered the functional annotation of this important class of
molecules and the development of peptides as therapeutics. In this thesis, we
propose the use of small, recurrent polypeptide fragments as one way of solving
the lack of protein-peptide structures. We show that protein-peptide binding sites
can be modeled at high resolutions using fragment interactions and provide two
methods for the de novo prediction of protein loops and peptide structure. The
developments presented in this work provide a valuable alternative to experimental
high-resolution structure elucidation of target protein-peptide complexes, bringing
closer the possibility of in silico designed peptides for therapeutic applications.
Samenvatting
E
iwitten zijn veruit de meest krachtige en complexe biologische moleculen in de
cel. Ze zijn essentieel en aanwezig in alle vormen van het leven: van virussen
en bacteri¨en tot planten en dieren. Het is algemeen aanvaard dat de functie van
een eiwit afhankelijk is van de driedimensionale structuur van het eiwit, die op
haar beurt meteen gerelateerd kan worden aan de opeenvolging van aminozuren
waaruit het eiwit is opgebouwd. Eiwitten interageren met allerhande moleculen
zoals andere eiwitten, DNA, RNA of peptiden.
Peptiden zijn moleculen die bestaan uit korte sequenties van aminozuren. Er
wordt geschat dat eiwit-peptide interacties een rol spelen in meer dan 40% van
alle eiwit-eiwit interacties in en buiten de cel. Het gebrek aan data omtrent de
driedimensionale structuur van deze peptide-interacties heeft er evenwel voor
gezorgd dat de functie van veel van deze interacties tot nog toe onbekend is; het
gebrek aan structurele data is ook een hinderpaal in de ontwikkeling van deze klas
van moleculen als nieuwe en krachtige geneesmiddelen.
In deze thesis stellen we het gebruik van eiwitfragmenten voor om de complexe
structuur van peptide-interacties te voorspellen en zodoende het gebrek aan hoge-
resolutie structuren te omzeilen. Hiervoor maken we gebruik van BriX, een data-
bank met meer dan 7 miljoen eiwitfragmenten bestaande uit 4 tot 14 aminozuren
elk, waarin ongeveer 2000 canonieke eiwitfragmenten kunnen ge¨ıdentificeerd wor-
den. We tonen aan dat de bindingsoppervlakken tussen eiwitten en peptiden sterke
v
gelijkenissen vertonen met de interacties tussen eiwitfragmenten die uit verschil-
lende, niet-gerelateerde eiwitstructuren worden ge¨extraheerd. Dit inzicht laat ons
toe de enorme hoeveelheid aan structurele data uit deze eiwitfragmenten te ge-
bruiken voor het voorspellen van de interactie tussen eiwit en peptide. We tonen
bijvoorbeeld aan dat we accuraat de structuur kunnen voorspellen van peptiden
die binden aan modulaire domeinen, zoals PDZ domeinen of het LBD domein van
de oestrogeen receptor.
De ontwikkelingen gepresenteerd in dit werk bieden een alternatief voor het ex-
perimenteel oplossen van hoge-resolutie structuren van eiwit-peptide interacties,
en brengen ons een stap dichter bij het ontwerpen van peptiden voor therapeutis-
che doeleinden.
vi
Preface
This thesis deals with the topic of protein and peptide structure prediction and
design. Parts of this research have been published or are currently in the process
of publication. It is the objective of this thesis to unite and present all the individual
findings obtained and described in each manuscript. I have, however, taken the
liberty to edit each of these manuscripts to fit the flow of the thesis. Each chapter,
with the exception of first and last, contains an introduction, a results section, a
materials and methods section, and a conclusion.
Chapter 1 introduces proteins and peptides and the role of structure. We have
made an attempt to bring an objective overview of the field of protein modeling and
design that is relevant to many of the concepts and applications presented here.
We also introduce the field of peptide prediction and design and its relevance for
therapeutics (Vanhee et al., 2011).
Chapter 2 introduces the ‘protein fragment paradigm’ that is key to this work.
Two databases of protein fragments – BriX and Loop BriX (http://brix.crg.es)
– are described that provide a vast resource for fragment-based protein structure
prediction and design (Vanhee et al., 2011).
Chapter 3 describes LoopX (http://loopx.crg.es), a method for de novo pre-
diction of protein loops, the most variable parts of the protein structure and no-
toriously difficult to predict. We describe how LoopX outperforms state-of-the-art
methods, combining a 100-fold speed increase with excellent prediction accuracy
and coverage for loops up to 12 residues. Moreover, we demonstrate that LoopX
can model conformational ensembles adopted by protein loops.
Chapter 4 provides a comprehensive overview of the structural landscape of
protein-peptide interactions, conveniently stored in the PepX database (http:
//pepx.switchlab.org). Protein-peptide complexes are classified based on the
architecture of their binding sites and annotated with both structural and biological
information (Vanhee et al., 2010).
Chapter 5 provides an key insight in the structure of the protein-peptide inter-
actions, relating the architecture of monomeric proteins with the architecture of
protein-peptide complexes. Our analysis, building on both the BriX and PepX
databases, suggests that the wealth of structural data on monomeric proteins can
be harvested to model peptide interactions (Vanhee et al., 2009).
Chapter 6 puts many of the developed insights into practice by describing a peptide
structure prediction algorithm that is able to model peptide interactions without
previous knowledge of the complex structure. We provide two in-depth case stud-
ies on the PDZ domain and the α-ligand binding domain of the estrogen receptor,
demonstrating the potential for structure prediction of peptide motifs.
Finally, Chapter 7 provides a discussion on the general topic of this thesis.
viii
Publications
Protein-Peptide Interactions Adopt the Same Structural Motifs as Monomeric Pro-
tein Folds. Peter Vanhee, Francois Stricher, Lies Baeten, Erik Verschueren, Tom Lenaerts,
Luis Serrano, Frederic Rousseau and Joost Schymkowitz. Structure, August 2009.
PepX: a structural database of non-redundant protein-peptide complexes. Peter
Vanhee, Joke Reumers, Francois Stricher, Lies Baeten, Luis Serrano, Joost Schymkowitz,
Frederic Rousseau. Nucleic Acids Research, January 2010.
Modeling protein-peptide interactions using protein fragments: fitting the pieces?
Peter Vanhee, Francois Stricher, Lies Baeten, Erik Verschueren, Luis Serrano, Frederic
Rousseau and Joost Schymkowitz. BMC Bioinformatics, December 2010.
BriX: a database of protein building blocks for structural analysis, modeling and
design. Peter Vanhee*, Erik Verschueren*, Lies Baeten, Francois Stricher, Luis Serrano,
Frederic Rousseau and Joost Schymkowitz. Nucleic Acids Research, January 2011.
Computational design of peptide ligands. Peter Vanhee, Almer van der Sloot, Erik Ver-
schueren, Luis Serrano, Frederic Rousseau and Joost Schymkowitz. Trends in Biotech-
nology, May 2011.
ix
Acknowledgements
This thesis is the result of the hard work of many, many different people I have
worked together with during the course of my PhD. Here, I would like to express
my gratitude towards them.
First of all, I would like to thank my supervisors, Joost Schymkowitz and Frederic
Rousseau, who have been tremendously helpful during the course of this work.
They introduced me to the complex maze the field of biology really is, encouraged
me to do research at the forefront of science, and supervised this project from
start to end. They have motivated me, at the SWITCH lab in which I started this
PhD, to develop a deep interest in molecular biology. I also would like to thank
Tom Lenaerts, my master thesis supervisor, who proposed me to start a PhD and
introduced me to the SWITCH lab. He has always been a source of help and
advice.
I am very grateful to Luis Serrano, who opened the doors of his lab at the
Center of Genomic Regulation in Barcelona. Despite his hectic agenda, he has
been instrumental in all parts of his work, continuously throwing in new ideas
and providing me with the opportunity to work in one of the leading institutes in
biomedical sciences in Europe.
I have been very fortunate with the people with whom I have been working
side by side in this project. Lies Baeten who graduated from computer science like
me, has initiated the BriX project during her PhD. Sharing the same background,
she has contributed to many of the ideas and tools we developed together during
xi
this work. One year later, Erik Verschueren joined the SWITCH laboratory and
continued his work in the CRG in Barcelona. Since we met each other again in the
CRG, we have been working together on a daily basis, sharing many moments of
frustration and euphoria. Without both Lies and Erik, I believe this work would not
have been the same.
Many more people have been important to this project. For example, Almer
van der Sloot, whom I met in the lab of Luis Serrano, has often shared his broad
knowledge in cellular biology with me; I was very happy to write with him a
review on computational peptide design, that shaped the introductory chapter of
this thesis. Fran¸cois Stricher often helped me understanding the nitty-gritty of
protein structure and stability, and his contributions to the FoldX force field have
been essential to this work. Joke Reumers, a former member of the SWITCH lab,
motivated me to work on the database of protein-peptide complexes, and together
we published a paper which has left me hungry for more publications. I also
really enjoyed working together with Joost Van Durme; we pushed the project of
protein loop prediction, which was originally started by Lies Baeten, to the next
level. Programming with Javier Delgado, originally a SWITCH member and now
post-doc at the lab of Luis Serrano, on the FoldX suite has been very pleasant as
well.
I wish to thank all the members in both the SWITCH group and the group of
Luis Serrano at the CRG. Besides being great colleagues, many of you have also
become good friends. I’d like to thank Ivo, a former member of the SWITCH lab,
for giving critical advice and support. Outside the context of this PhD, I have
often worked together with Antonio, Christof and Andrea. I believe what we did
together has helped me making this project successfull, and I hope we will be
working together again in the future.
The financial support for performing this study was given by IWT, FWO and
EMBO. It goes without saying that without their funds, this thesis would not have
been possible.
Finally, I would like to thank my family and my friends, and in particular my
parents for giving their unconditional support. I also wish to thank everyone
who welcomed me with open arms in Barcelona and with whom I spent many
unforgettable moments. And thanks to you Camilla, for sharing both the difficult
xii
and the great moments I went through while working on this thesis. Nothing here
would have been possible without all of you.
xiii
Contents
1 Introduction 1
1.1 Proteins and peptides . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Protein function . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Protein building blocks . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Protein folding and stability . . . . . . . . . . . . . . . . . 5
1.1.4 Biological function of protein-peptide interactions . . . . . 7
1.1.5 Peptides as therapeutics . . . . . . . . . . . . . . . . . . . 8
1.1.6 Protein Structure . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Protein structure prediction and design . . . . . . . . . . . . . . . 13
1.2.1 Comparative modeling . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Ab initio structure prediction . . . . . . . . . . . . . . . . . 17
1.2.3 Predicting protein dynamics . . . . . . . . . . . . . . . . . 18
1.2.4 Computational protein design . . . . . . . . . . . . . . . . 19
1.2.5 Protein docking . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Computational design of peptide ligands . . . . . . . . . . . . . . 20
1.3.1 A better understanding of protein-peptide interactions . . . 21
1.3.2 Peptide design based on sequence motifs . . . . . . . . . . 23
1.3.3 Protein complexes as a source of active peptides . . . . . . 26
1.3.4 Protein docking and fragment based docking as tools for
peptide design . . . . . . . . . . . . . . . . . . . . . . . . 28
xv
1.3.5 Peptide design using protein-peptide complexes . . . . . . 28
1.3.6 Remedying the lack of structural information . . . . . . . . 30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Fragmenting protein space 43
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Contents of the BriX database . . . . . . . . . . . . . . . . . . . . 46
2.2.1 Update of the BriX database . . . . . . . . . . . . . . . . . 46
2.2.2 BriX Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.3 Creation of the Loop BriX database . . . . . . . . . . . . . 49
2.2.4 Loop BriX Statistics . . . . . . . . . . . . . . . . . . . . . . 53
2.2.5 Applications of the BriX database . . . . . . . . . . . . . . 53
2.3 Database access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.1 Database availability . . . . . . . . . . . . . . . . . . . . . 54
2.3.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3.3 Covering or bridging of protein structures . . . . . . . . . . 55
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3 Predicting loop structure 63
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.1 Comparison with the state-of-the-art loop reconstruction
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2.2 Loop homology is no prerequisite for loop reconstruction
accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.3 Loop ensemble prediction . . . . . . . . . . . . . . . . . . 70
3.2.4 Comparison with MODELLER, RAPPER, PLOP and FREAD . 72
3.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.1 LoopX Algorithm . . . . . . . . . . . . . . . . . . . . . . . 76
3.3.2 Reconstruction accuracy . . . . . . . . . . . . . . . . . . . 79
3.3.3 Benchmark datasets . . . . . . . . . . . . . . . . . . . . . 80
3.3.4 LoopX Webserver . . . . . . . . . . . . . . . . . . . . . . . 81
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
xvi
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4 The structural landscape of protein-peptide interactions 87
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Contents of the PepX database . . . . . . . . . . . . . . . . . . . . 90
4.2.1 Construction of a non-redundant data set of protein-peptide
complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2.2 Statistics on structural protein-peptide complexes . . . . . 93
4.2.3 Ligand annotation with structural variants for peptide design 97
4.3 Database Access . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.1 Database Availability . . . . . . . . . . . . . . . . . . . . . 97
4.3.2 User interface . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Protein-peptide interactions resemble
monomeric protein interactions 107
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.2.1 InteraX: a database of interacting protein fragments . . . . 111
5.2.2 Reconstruction of protein-peptide interactions from inter-
acting fragment pairs derived from monomeric proteins . . 113
5.2.3 Reconstruction of peptide binding motifs by using multiple
fragment pairs observed in monomeric proteins . . . . . . 117
5.2.4 Statistical analysis of the factors that determine reconstruc-
tion accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.1 Construction of a non-redundant data set of protein-peptide
complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3.2 The dataset of protein fragments . . . . . . . . . . . . . . . 125
5.3.3 InteraX database . . . . . . . . . . . . . . . . . . . . . . . 126
5.3.4 Covering algorithm . . . . . . . . . . . . . . . . . . . . . . 126
5.3.5 FoldX force field . . . . . . . . . . . . . . . . . . . . . . . . 127
5.3.6 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . 129
xvii
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 Predicting peptide structure and specificity 133
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2.1 Peptide docking using interaction patterns from InteraX . . 135
6.2.2 De novo peptide structure prediction using interaction pat-
terns from InteraX . . . . . . . . . . . . . . . . . . . . . . 136
6.2.3 Case study: PDZ peptide design and specificity . . . . . . 137
6.2.4 Case study: helical peptide design for the estrogen receptor
ligand-binding domain . . . . . . . . . . . . . . . . . . . . 143
6.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3.1 A constraints-based framework for peptide design . . . . . 148
6.3.2 Local backbone moves using BriX . . . . . . . . . . . . . . 151
6.3.3 Binding site prediction . . . . . . . . . . . . . . . . . . . . 152
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7 Discussion 157
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
List of Figures 169
List of Tables 173
xviii
1Introduction
Parts of this chapter are based on
Computational design of peptide ligands Peter Vanhee, Almer van der Sloot, Erik Verschueren,
Luis Serrano, Frederic Rousseau and Joost Schymkowitz. Trends in Biotechnology, May 2011.
P
roteins are the most versatile and complex molecules in the cell, giving rise
to most of life’s extraordinary shapes and processes. Peptides – short se-
quences of ∼4-40 amino acids – are key components of protein-protein interaction
networks, regulating many important cellular processes. It is commonly accepted
that protein function directly relates to the three-dimensional structure of these
molecules, yet high-resolution structures are often lacking. For therapeutic usage,
peptides possess several attractive features when compared to small molecule and
protein drugs: they show a high structural compatibility with target proteins, con-
tain the ability to disrupt protein-protein interfaces and have a small size. Efficient
structure prediction and design of high affinity peptide ligands via rational methods
has been a major obstacle to the development of this potential drug class. How-
ever, structural insights into the architecture of protein-peptide interfaces have
recently culminated in a number of computational approaches for the rational de-
sign of peptides targeting proteins. These methods provide a valuable alternative
to high-resolution structures of target protein-peptide complexes, bringing closer
the possibility of in silico designed peptides.
1
1. INTRODUCTION
1.1 Proteins and peptides
1.1.1 Protein function
A B C
Figure 1.1: Proteins interacting with antibodies, small molecules and peptides.
Structural models of protein interactions relevant for therapeutics. (A) The monoclonal
antibody (mAb) cetuximab inhibits the extracellular domain of the epidermal growth
factor receptor (EGFR) (PDB 1YY8). This therapeutic mAb is used in the treatment of
colorectal cancer. (Citri & Yarden, 2006). (B) The small molecule gefitinib (Iressa,
AstraZeneca) occupies the ATP-reserved binding pocket of intracellular kinase domain
of EGFR, thus preventing phosporylation (PDB 2ITY) and inhibiting tumor growth.
(Yun et al., 2007) (C) A phosphotyrosine peptide interacting with the SH2 domain of
GRB2 that binds the intracellular domain of EGFR (PDB 1JYR) (Huang et al., 2008)
Proteins are present in all forms of life, from plants, bacteria and viruses to
animals. They are the cell’s workhorses, putting the genetic information (DNA) of
the cell into action. There are many different types of proteins, making up most
of the cell’s dry mass. Proteins are involved in almost all of the processes going
on in the body; they transport nutrients through the blood, break them down to
power muscles and send signals to the brain. Many proteins act as enzymes that
catalyze reactions to form and break covalent bonds, directing the vast majority of
2
1.1 Proteins and peptides
all major chemical processes in the cell.
Here is a small sample of the role of proteins:
• Enzymes facilitate biochemical reactions. For example, Alcohol dehydroge-
nase transforms alcohol into a non-toxic form that the body uses for food
and lactase breaks down sugar lactose found in milk.
• Transport proteins move molecules from one place to another. For example,
hemoglobin carries oxygen through the blood and cytochromes operate in
the electron transport chain as electron carrier proteins.
• Structural proteins give structural features to the cell and provide support.
For example, keratin strengthens protective coverings such as hair, and col-
lagen gives structure and support to the skin and the bones.
• Hormonal proteins are messenger proteins that coordinate important pro-
cesses in the cell and facilitate cell-cell communication. For example, insulin
regulates glucose metabolism by controlling the blood-sugar concentrations
and growth hormone helps regulate growth.
• Contractile proteins provide movement to the cell. For example, actin and
myosin are responsible for muscle contraction.
• Antibodies defend the body from foreign invaders by tightly binding to
antigens such as viruses or bacteria. Antigens are bound by the Major
Histocompatibility Complex (MHC) and presented to a T-cell receptor, after
which white bloods cells can be recruited to destroy the invaders. The
structural diversity of antibodies (and in particular of the loops that bind the
antigen) is immense: it has been estimated that humans can generate around
10 billion different antibodies, able to recognize virtually any foreign invader.
To perform all these functions, proteins do not act alone. Instead, they can
associate with themselves or with other proteins as dimers or as multi-subunit com-
plexes, creating networks of protein interactions (Figure 1.1). These interactions –
for example between proteins and other proteins, small molecules, peptides, met-
als, lipids, DNA or RNA – are fundamental in understanding the relation between
3
1. INTRODUCTION
genotype and phenotype at all different levels, from the molecules towards to the
organism itself. In Saccharomyces cerevisiae (baker’s yeast) – for which most of
the protein-protein interaction studies have been carried out – nearly every protein
is involved in an interaction (Han et al., 2004). High-throughput studies on entire
organisms, elucidating entire protein-protein interaction networks, are now within
reach of our understanding, as was shown recently for the bacterium Mycoplasma
pneumoniae (K¨uhner et al., 2009).
1.1.2 Protein building blocks
aliphatic
tiny
small
polar
charged
positivearomatic
hydrophobic
I
L
V
M
F
G
W
Y
D
R
KH E Q
T N
SCSH
CS–S
P
A
Figure 1.2: Amino acids grouped by properties. A Venn diagram grouping amino
acids according to their properties. This is just one of the many possible classifications
of amino acids. The Figure was adapted from (Taylor, 1986).
Proteins are made of amino acids, small molecules of carbon, oxygen, nitrogen,
sulfur and hydrogen. To make a protein, amino acids are connected together
with peptide bonds, that folds into a three-dimensional structure according to the
chemical properties of the amino acids. Each of the amino acids has a small group
of atoms (the ‘sidechain’) branching off the main chain (the ‘backbone’), which
4
1.1 Proteins and peptides
gives its unique properties to the nascent protein. There are 20 naturally occurring
amino acids, each of them with a slightly different chemical structure. Based on
their chemical properties, they can be organized in different categories: basic or
acidic, polar (hydrophilic or ‘water-loving’) or hydrophobic (‘greasy’), charged or
uncharged, aliphatic or aromatic (Figure 1.2).
Primary structure
α-helix
amino acid sequence
β-sheet
Secondary structure
regular sub-structures
Quaternary structure
complex of protein molecules
hemoglobin
3-dimensional structure
Tertiary structure
p13 protein
Figure 1.3: Four levels of protein structure.
As shown in Figure 1.3, proteins have different levels of structure: the primary
structure is the amino acid sequence, the secondary structure the local substruc-
tures (α-helices and β-strands) that are stabilized by an organized network of
hydrogen bonds. The tertiary structure is the entire protein folded into a complete
three-dimensional structure, and the quaternary structure is the structure of the
interaction of multiple units coming together to form a larger unit.
1.1.3 Protein folding and stability
As Anfinsen showed, the structure of a protein is uniquely defined by its sequence
(Anfinsen, 1973). The folding of the polypeptide chain is a complex process
that turns the essentially unstructured and elongated polypeptide chain into a
compact, stable and unique protein fold, held together by mainly non-covalent
interactions (Onuchic & Wolynes, 2004). In the late sixties, Levinthal famously
pointed out that it seemed impossible that a protein could fold spontaneously
5
1. INTRODUCTION
following a random process in a reasonable timeframe, suggesting the existence
of folding pathways (Levinthal, 1969). Through a combination of technologies –
most notably, protein recombinant technologies, NMR and X-Ray technologies and
computer simulations – we now commonly accept that the major force of protein
folding is the hydrophobic collapse of the polypeptide chain (Dill, 1990; Chandler,
2005).
The large majority of proteins in their folded state is only marginally stable,
meaning that the energy difference between the unfolded and folded state of
proteins is relatively small. The breaking of a single hydrogen bond caused by
a single amino acid mutation might lead to the collapse of the entire protein.
Different forces contribute the free energy of the protein, commonly expressed
as a variation of free energy (∆G) between the unfolded and folded states. The
non-covalent forces that contribute to the stability of proteins can be described as
follows:
• Van der Waals interactions are weak, attractive or repulsive interactions
that occur between both charged and polar molecules. They include the
London dispersion forces, dipole-dipole interactions and hydrogen bonding,
and are often calculated using (6-12)-potentials such as the Lennard-Jones
potential.
• Hydrogen bonding occurs when two electronegative atoms compete for the
same hydrogen atom. The proton donor is covalently bound to the hydrogen
atom, while the proton acceptor interacts favorably with the hydrogen atom.
Originally observed in 1936 by Pauling and Mirsky – before the first protein
structures became available –, hydrogen bonds are ‘holding together’ the
folded polypeptide chain, giving rise to both α-helices and β-sheets.
• Hydrophobic interactions are a phenomenon observed when non-polar
compounds collapse into aggregates when surrounded by water. Almost
half of the amino acids are hydrophobic (Figure 1.2) and tend to cluster
together to form the hydrophobic core of the protein.
• Electrostatic interactions are long-distance cohesive forces that appear be-
tween differently charged atoms. Salt bridges are a special kind of hydrogen
6
1.1 Proteins and peptides
bonds that occur between charged functional groups.
It came somewhat as a surprise to discover that an estimated 40% of all proteins
in the human proteome are intrinsically disordered and become only fully or partly
structured upon binding to binding partners in the cell (Gianni et al., 2003). Short
motifs or peptides are often recognized within these unstructured proteins and
play important roles in protein regulatory networks and signaling pathways.
1.1.4 Biological function of protein-peptide interactions
Peptides and short peptide stretches in larger proteins (∼4-40 amino acids long)
perform a myriad of functions both in cell-to-cell and intracellular communication:
they are important mediators in many signalling pathways and regulatory networks
(Neduva & Russell, 2006). A great variety of endogenous regulatory peptides of
variable length act as peptide hormones and/or neurotransmitters and are involved
in inter-cellular communication (Brunton et al., 2006). These peptides show a wide
range of physiological activities and are important in maintaining homeostasis.
Examples are the potent blood pressure regulators angiotensine II (8 amino acids,
a.a.) and vasopresine (9 a.a.), the appetite regulators ghrelin (28 a.a) and obestatin
(23 a.a.), and glucagon (29 a.a.), a regulator of glucose metabolism. They act in an
endocrine, paracrine or autocrine fashion by binding cell surface receptors, such as
G-protein coupled receptors (GPCRs). Typically, these peptides are produced by
differential processing of a precursor protein by endopeptidases to yield biologically
active peptides. Many higher organisms, from amphibians to humans, also rely
on peptides as an integral part of their host-defense mechanism against microbial
assault, and although the use of peptides in antimicrobial therapy is rather limited at
the moment, peptides are increasingly being considered as antibacterials, antivirals
and antifungals in clinical settings (Hancock & Sahl, 2006; Easton et al., 2009).
Short, often unstructured peptide stretches or ‘motifs’ of larger proteins are
also important players in intracellular signaling networks (Russell & Gibson, 2008).
These motifs are recognized by globular protein domains, such as the SH3 domain
that binds short polyproline-rich motifs, the SH2 domain that recognizes peptides
containing phoshorylated tyrosine, or the PDZ domain that binds C-terminal motifs
(Pawson & Nash, 2003). It is believed that up to 40% of all protein interactions in
7
1. INTRODUCTION
the cell are either directly or indirectly influenced by peptide-mediated interactions
(Neduva & Russell, 2006; Petsalaki & Russell, 2008). Given the importance of these
protein-peptide interactions in both inter- and intracellular signalling they provide
important targets for therapeutic intervention in a range of diseases. Modulating
these interactions with peptide-like agonists and antagonists therefore constitutes
an attractive therapeutic strategy.
1.1.5 Peptides as therapeutics
The vast majority of therapeutic compounds achieve their effects by binding to and
altering the function of target protein molecules (Figure 1.4). Traditionally, the main
source of successful therapeutics has been small organic molecules, which usually
bind in small cavities of the target protein and inhibit or ‘block‘ specific catalytic
centers or the binding sites of natural substrate analogues (Drews, 2000) (Figure
1.4C). The recent focus on protein-protein interaction networks has shifted the goal
of drug targeting increasingly towards disruption of protein-protein interactions, a
feat for which classical small molecules are not always ideally suited (Arkin & Wells,
2004; Wells & McClendon, 2007). The newest additions to the pharmaceutical
arsenal are protein-based therapeutics, which are generally improved recombinant
replacements of endogenous proteins or monoclonal antibodies directed against
a wide variety of targets (Walsh, 2010) (Figure 1.4B). Although the introduction of
protein therapeutics – in particular monoclonal antibodies – has been tremendously
successful, their use is mainly limited to extracellular targets, such as membrane-
bound receptors and secreted proteins, because uptake of these large molecules
into intracellular compartments remains cumbersome (Patel et al., 2007).
Peptides are generally considered ‘poor drugs’ because of cumbersome deliv-
ery, prohibitively short in vivo lifetimes and bad overall bio-availability (Antosova
et al., 2009; Audie & Boyd, 2010). However, recent technological innovations in
formulation, delivery and chemistry have sparked greater interest in peptide ther-
apeutics (Walensky et al., 2004; Timmerman et al., 2005; Tan et al., 2010). Their
chemical structure makes them by definition highly compatible with the proteins
they target and their intermediate size enables them to disrupt protein-protein
interfaces, whilst remaining sufficiently small for intracellular targeting (Patel et al.,
8
1.1 Proteins and peptides
x
antibody
receptor
Y-kinase
hormone
small
molecule
peptide
effector
1 2 3 4A B C D
Figure 1.4: Targeting the cell with different molecules. Overview of different
drug strategies targeting protein signaling pathways: (A) normal scenario of a generic
pathway, (B) therapeutic antibodies, (C) small molecules and (D) peptides.
2007) (Figure 1.4D). Presently, more than 50 peptide-based products are approved
for clinical use in the United States and other countries (Table 1.1) (Pechon et al.,
2010), underlining the tremendous market potential of peptidic drugs. This has
spurred a great interest in technologies capable of providing new peptide se-
quences with high affinity and specificity towards therapeutically relevant targets.
In the remainder of this chapter and throughout this thesis, we will discuss recent
technological advances that could lead to rationally designed peptides targeting
proteins.
Peptides and redefining ‘druggability’
Given the current success of recombinant protein-based therapeutics, we are al-
ready witnessing the erosion of the long-standing and relatively narrow definition
of what constitutes a ‘druggable target’ (Hopkins & Groom, 2002) (i.e. a protein
that can be modulated by an orally administered active small molecule, adhering
to the ‘rule of five’ proposed by Lipinski et al. (2001)). The definition of ‘drugga-
bility’ has widened to include targets whose activity can be modulated by larger
molecules, such as proteins and peptides.
Current small-molecule drugs target only a fraction of all proteins inside and
9
1. INTRODUCTION
Name
Approval
date US
Disease
#
A.A.
Origin of mimet-
ics
Company
Global
sales
(US $
mil-
lion)
Glatiramer,
Copaxone
®
1996
Multiple Scle-
rosis
4 Myelin protein Teva 3200
Leuprolide,
Lupron ®
1985
Prostate and
breast cancer
(mainly)
9
Gonadotropin Re-
leasing Hormone
(GnRH) mimetic
Abbott
(amongst
others)
1900
Goserelin,
Zoladex ®
1989
Prostate and
breast cancer
(mainly)
10
Luteinising Re-
leasing Hormone
(LRH) mimetic
Astra-
Zeneca
1146
Octreotide,
Sando-
statin
®
1998
Acromegaly,
carcinoid
syndrome
8
Somatostatin hor-
mone mimetic
Novartis 1123
Teriparatide,
Forteo ®
2002 Osteoporosis 34
Parathyroid
hormone (84
residues, residues
1-34)
Eli Lilly 779
Exenatide,
Byetta ®
2005
Diabetes Type
2
39
Exendin-4 hor-
mone (incretin
mimetic)
Amylin /
Eli Lilly
750
Enfuvirtide,
Fuzeon ®
2003
AIDS (HIV-1 in-
fection)
36
Viral glycoprotein
(gp41)
Roche 167
Table 1.1: Leading examples of peptide therapeutics currently on the market.
Data is extracted from the annual Peptide Report issued by the Peptide Therapeutics
Foundation (http://www.peptidetherapeutics.org/annual-report.html) (Pe-
chon et al., 2010).
outside the cell. Typical targets include GPCRs, enzymes, nuclear hormone re-
ceptors and ion channels, all of which have natural small-molecule substrates
(Brunton et al., 2006). Most of these drugs target the binding pocket of the sub-
strate directly, but also other, allosteric cavities can be targeted. On average, the
contact surface between a small-molecule ligand and its protein receptor is be-
tween 300-1000 Å2
(Smith et al., 2006). In contrast, the contact surface between
two interacting proteins is generally much flatter, larger (1200-3000 Å2
) (Conte
et al., 1999; Jones & Thornton, 1996), and discontinuous in sequence. Most of
the free energy of binding is contributed by a limited number of amino acids in the
10
1.1 Proteins and peptides
interface (‘hotspot’ residues) (Clackson & Wells, 1995). Interaction networks are
distributed and constructed with a modular architecture, showing tight coopera-
tive interactions within a module and additive interactions between the modules
(Reichmann et al., 2005). Conversely, protein-peptide interfaces display a smaller
contact surface and a more continuous architecture and often target well-outlined,
large hydrophobic pockets on a protein (London et al., 2010). These pockets are
larger than the typical clefts targeted by small molecules, but smaller than large
protein-protein interfaces.
Given the large, shallow and distributed nature and lack of pockets and cavities
in protein-protein interfaces, these interfaces are often considered to be hard
to target with small-molecule drugs. Although progress has been made – e.g.
Thorsen et al. (2010) identified a small molecule that targets PDZ domains with
micromolar affinity similar to the endogenous peptide – disrupting protein-protein
interfaces with classical small molecule compounds remains a difficult task (Wells
& McClendon, 2007). Peptide-like drugs are likely to be more suitable candidates
to act as competitive inhibitors of protein-protein interactions, considering their
similar binding mode.
Alternatively, peptide-like ligands could target protein-protein interactions in
a non-competitive manner by acting as an allosteric modulator. This concept
has received a lot of attention owing to the success of small-molecule allosteric
modulators (Conn et al., 2009; Eglen & Reisine, 2010), but is also well-established
in protein-mediated interactions (Alvarado et al., 2010). One limitation to this
allosteric approach is the need to identify the ‘pressure points’ in a target protein
structure that should be ‘hit’ in order to affect its function. Several methods to map
the dynamics of the amino acid interaction network that constitutes the protein
structure have been developed, and these have, at a minimum, the potential to
reveal sites on the protein surface with allosteric modulatory power (Lee et al.,
2008; Lenaerts et al., 2008, 2009; Haliloglu & Erman, 2009; Haliloglu et al., 2010).
In short, targeting protein-protein interactions with peptide based competitive
inhibitors or – albeit more challenging – peptide-based allosteric modulators ex-
tend the definition of druggability by expanding the potential classes of druggable
targets.
11
1. INTRODUCTION
1.1.6 Protein Structure
Structural biology gathers structural information from atoms to cells at different
levels of the biological hierarchy into a common framework. Thus, elucidating
structure of protein molecules is key. Since the determination of the first three-
dimensional structure of myoglobin – an oxygen-carrying protein found in muscle
tissue – in 1958 (Kendrew et al., 1958), our understanding of biology has radically
changed (Figure 1.5). Experimental structure determination is often carried out
either by X-ray crystallography or by nuclear magnetic resonance (NMR).
1958 1960 1963 1965 1969 1973 1977 1978 1985 1987 2001
Perutz published the
low-resolution structure
of haemoglobin.
Kendrew reported
the first low-
resolution structure
of myoglobin.
Anfinsen deduced from
experiments that the
native conformation of a
protein is determined by
its amino-acid sequence .
Levitt and Lifson
introduced the method
of energy refinement.
Karplus published the first
molecular dynamics simulations.
Wuthrich determined the first
NMR structure of a protein.
Preliminary publication of the
human genome sequence.
Phillips determined the high-
resolution structure of hen egg-
white lysozyme .
Cohen, Boyer and colleagues
developed recombinant
DNA technology.
Hutchison andSmith reported
an effective method for site-
directed mutagenesis .
(1987–1989) Fersht introduced
/ -value analysis of binding
and folding.
Sanger and Maxam and Gilbert published
their respective methods for DNA sequencing.
Figure 1.5: History of experimental structure determination from the last fifty
years. Figure adapted from (Fersht, 2008).
Many structures of globular proteins exist in public databases such as the Pro-
tein Data Bank (PDB, Berman et al. (2000)), approximately 69.900 as of December
2010. Even though special care is taken for the resolution of fibrous proteins
such as membrane proteins, currently only few structures are available in public
databases because of the complexities associated with crystallizing these aggre-
gating proteins.
Protein structures are not ‘static pictures’. Instead, proteins undergo dynamic
excursions from their ‘ground states’ and these fluctuations have increasingly been
associated with important biological processes such as for example protein folding,
enzyme catalysis and signal transduction (Eisenmesser et al., 2002; Lange et al.,
2008). These fluctuations are typically modeled using Nuclear Magnetic Resonance
(NMR) experiments, but computational techniques have also been a major aid in
the determination of protein dynamics (Duan & Kollman, 1998; Shaw et al., 2010).
Nowadays, high-throughput structural genomics initiatives strive towards ex-
perimentally solving selected sets of proteins, such that all proteins of unknown
12
1.2 Protein structure prediction and design
structures have at least one neighbor in protein classification systems such as
SCOP and CATH (Chandonia & Brenner, 2006). Since their inception, structural
genomics has resolved thousands of novel protein structures, mainly using X-ray
diffraction. Yet it has been estimated that at least 16.000 carefully selected struc-
tures would need to be solved in order that comparative modeling can predict
90% of all protein domain families (Chruszcz et al., 2010). Solving this ‘structural
data’ problem is key to the general understanding of proteins in the cell, and since
still more than 70% of all known proteins are without a determined structure –
let alone complexes of protein-protein, protein-DNA, protein-RNA interactions –,
the computational modeling of protein structure has become a field on its own.
In the remainder of this introductory chapter, we introduce this field and focus in
particular on the prediction of peptide interactions.
1.2 Protein structure prediction and design
Following Anfinsen’s dogma, the structure of a protein is uniquely defined by
its sequence (Anfinsen, 1973). Since experimental determination of a protein’s
structure is an often expensive and time-consuming task, computational biologists
have embraced the challenge to predict secondary and tertiary structure directly
from sequence. However, the problem turns out to be far from trivial: the
structural variety the polypeptide chain can adapt is virtually unlimited. Creighton
estimated that a protein of 100 amino acids can adopt up to 10100
alternative
conformations (Creighton, 1984), approximately as much as there are atoms in
the universe.
Many computational methods have been developed and used to increase struc-
tural coverage (Baker & Sali, 2001). Every design methodology essentially has two
components: a sampling component that samples the space of possible conforma-
tions and a scoring component, that ranks solutions based on a ranking scheme.
The sampling problem is often simplified by taking a ‘fixed-backbone’ assump-
tion, although recent years have seen an increasingly number of methods that
introduce protein backbone flexibility, e.g. by using an ensemble of backbone
conformations, robotic-arm inspired moves or iterative backbone and sequence
optimization, amongst others (Mandell & Kortemme, 2009a). Sidechain confor-
13
1. INTRODUCTION
mations are subsequently sampled on a given backbone, often using a library of
rotamers that represent all different conformations the sidechain can adapt on the
given backbone template. Complete enumeration of these conformations remains
a herculean task, such that both stochastic methods (e.g. Monte Carlo simulation,
Kuhlman & Baker (2000)) and deterministic methods (e.g. the popular dead-end
elimination technique, Desmet et al. (1992)) are employed. Finally, scoring func-
tions - either statistical or physics-based - are used to rank the final set of solutions,
mainly relying on energetic terms such as Van der Waals packing constraints, hy-
drophobic interactions, hydrogen bonding, solvation and electrostatic interactions
(Section 1.1.3). In our work, we rely on the empirical force field FoldX to weigh
these components and output a final ranking based on the total free energy estimate
of the model (Schymkowitz et al., 2005).
Probably the most important question in modeling is when structure prediction
becomes biologically useful (Zhang, 2009). This depends on the purpose of the
model: highly accurate models (< 1-2 Å root mean square deviation, RMSD, versus
the crystallographic model) can in some cases be used for ligand-binding studies
or even virtual screening, while medium-resolution models (2.5-5 Å) could provide
an idea about functionally important residues, active sites or disease-associated
mutations. Low resolution models (> 5 Å) could be used for topology recognition
or for determining the protein boundaries.
Measuring progress in the field of structure prediction is the topic of the bi-
annual competition CASP (Critical Assessment of techniques for protein Structure
Prediction) (Moult et al., 1995). Research groups are given an amino acid sequence
for which no native structure is known, but that will be determined soon. Since
its inception in the mid-90’s, this community-wide effort has already instigated
many novel design protocols, with over 100 different groups participating in the
competition.
The available range of modeling methods can roughly be divided in two cate-
gories, although an increasing number of hybrid methods is being developed as
well: methods that rely on comparative modeling or threading and methods that
predict structure ab initio. Typically, the first category of methods has provided
the more accurate models but is limited by the amount of structural data available,
while the latter is unconstrained but limited by the huge number of possible confor-
14
1.2 Protein structure prediction and design
mations, idealized model systems and approximate free energy estimates. In what
follows, we will discuss a selection of recent advances in these methodologies and
also focus on a related but slightly different field in computational biology: that of
computational protein design. We conclude with a short discussion of the docking
problem, in which the structure of two interacting proteins is sought, given their
structures in isolation.
1.2.1 Comparative modeling
Comparative modeling – also called homology modeling – relies on the observation
that sequence similarity suggests structural similarity, often because in the process
of evolution, structure (and thus function) is not radically altered. Homology
modeling thus searches for proteins which share a certain amount of sequence
similarity with the target protein. Highly accurate models are often generated
when more than 50% sequence identity exists between the target protein and the
templates. These models might have a root-mean-square (RMS) error of 1 Å on
the main-chain atoms, which is comparable to the difference between X-ray and
NMR methods (Baker & Sali, 2001). Between 30 and 50%, homology modeling
will still give reliable models but especially loops and other variable regions in
the protein structure might deviate from the template structures. Finally, when
comparative models are based on less than 30% sequence homology the errors
accumulate rapidly in the model. Frequent sidechain packing errors, distortion of
the protein core, loop modeling errors and other severe problems might render the
model useless. These bottlenecks in comparative modeling - especially when low
sequence homology exists - might be partly remedied by means of all-atom force
field refinements, multiple template structures or specialized loop reconstruction
methods (see Box).
To better understand the mechanisms of folding, Rose and Creamer put forward
the challenge to find two proteins with high sequence similarity but a different
fold (Rose & Creamer, 1994). Recently, two small proteins of 56 residues each,
GA88 and GB88, were designed that sharing 88% sequence homology and only
7 non-identical residues. The structures, solved by NMR, revealed two distinct
folds: GA88 adopted a 3-α fold while GB88 adopted a α-β-fold, showing that
15
1. INTRODUCTION
Protein loop reconstruction with LoopX
In Chapter 3 we present LoopX, a loop reconstruction method that
combines a database backbone template search with sidechain reconstruc-
tion. We demonstrate the competitiveness of the method by comparing it to
various state-of-art loop prediction methods, including the robotics-inspired
KIC, which was shown recently recently to reconstruct loops until length
12 with sub-angstrom accuracy (Mandell et al., 2009). Additionally, we
demonstrate that LoopX can model the conformational ensemble adopted by
protein loops and induced by ligand binding in the case of the PDZ-peptide
and meganuclease-DNA interactions.
conformational switching between two folds could be effected with just a handful
of mutations (He et al., 2008; Alexander et al., 2009). This poses interesting
problems to homology-based methods, since they challenge our interpretations of
sequence-structure relationships, yet they might not be representative for a large
number of examples.
Protein threading Threading methods attempt to fit sequences to a known struc-
ture from a library of folds, especially for cases in which no evolutionary relation
is obvious (Bowie et al., 1991). This is accomplished by ‘threading’ the sequence
along the backbone of the template model, followed by a scoring function which
evaluates the placement of the amino acids in the backbone. Threading methods
are less constrained since no homology is required, yet they still rely on correctly
selecting a series of templates in which to fit the sequence. One such success-
ful (hybrid) approach is I-TASSER (Roy et al., 2010). The algorithm performs a
PSI-BLAST (Altschul et al., 1997) using the query sequence to identify evolution-
ary relatives, followed by secondary structure assignment to construct an initial
scaffold. That scaffold is then used to select a series of models from the PDB
for threading using a series of state-of-the-art threading programs (Wu & Zhang,
2007). In subsequent steps, fragments from the structural models are determined
and combined with ab initio predictions for badly aligned regions (particularly,
loops), to result in a structural model for which functional properties can be de-
rived.
16
1.2 Protein structure prediction and design
Fragment-based structure prediction Another type of comparative modeling
tools are fragment-based methods. They differ into their definition of the smallest
unit used to infer sequence similarity: instead of the entire fold, they consider
stretches of residues, effectively expanding the number of templates that can be
modeled with. Many different fragment-based methods have been successfully ap-
plied to protein structure prediction, often in combination with ab initio sidechain
prediction and energy evaluation. A successful fragment-based approach is used
in Rosetta to bootstrap structure prediction (Section 1.2.2). In our own work,
we have used the ‘fragment paradigm’ as an effective way to solve the sampling
problem in protein structure prediction (see Box).
The BriX fragment paradigm
In our own work, we have used the ‘fragment paradigm’ as an effec-
tive way to solve the sampling problem in protein structure prediction (Baeten
et al., 2008; Vanhee et al., 2011). We used short protein backbone fragments
to reconstruct (parts of) proteins (Chapter 2), deduce structural relationships
between single proteins and protein-peptide interfaces (Chapter 5) and
perform blind reconstruction of peptide interactions (Chapter 6).
Comparative modeling entirely depends on the quality of the templates avail-
able: many structural templates obviously lead to better homology modeling and
thus to better models. In the limit, and with the increasing body of structural
data that is being deposited into public databases, comparative modeling will ulti-
mately cover the entire structural space, since the number of unique folds in nature
is expected to be limited (Grant et al., 2004).
1.2.2 Ab initio structure prediction
Ab initio methods differ from comparative modeling techniques since they remove
the requirement of having at least one related structure. Yet, some of the most
successful ab initio techniques - such as the popular Rosetta framework (Rohl
et al., 2004) - follow a hybrid approach. For example, Rosetta scans the PDB for
small structural fragments with similar sequence signatures to the target sequence
17
1. INTRODUCTION
using a Bayesian probability distribution. It then iteratively assembles these frag-
ments using Monte Carlo sampling, optimizing packing between the fragments
and favoring β-sheet formation. In a fine-grain step the sidechains are rebuilt
with backbone-dependent rotameric libraries. As an example, an enzymatic active
site has been designed using a series of minimal active site templates (termed
‘theozymes‘) that could accomodate the four different catalytic motifs used to cat-
alyze breaking a carbon-carbon bond, and the model was later confirmed using
X-Ray crystallography (Jiang et al., 2008). In other recent work, the methodology
was repeated to provide a design that catalyzes the Diels-Alder reaction, a reac-
tion that synthesizes a special type of organic bonds and for which supposedly
no natural enzyme exists (Siegel et al., 2010). Both cases are milestones for our
current ability to design proteins with desired properties using semi-computational
approaches.
Often, interleaving experimental information in the computational method re-
sults in better designs, by constraining for example the design towards certain
active sites, or allowing more conformational freedom in one part of the protein
than in another. Among many different approaches, the use of NMR chemical shift
data is particularly appealing. In CS-Rosetta, the chemical shift data was used to
select fragments with similar resonance profiles in combination with the traditional
sequence similarity from public databases, thus constraining the fragments used in
the stochastic sampling and improving the final models (Shen et al., 2008). A com-
pletely different but highly innovate approach is to use the human eye to resolve
hard three-dimensional problems. By means of an online computer game, Baker
and co-workers managed to not only explain protein folding to a large community
outside the protein design field, they also showed a significant improvement in
CASP models that could not have been achieved using computational algorithms
alone (Cooper et al., 2010).
1.2.3 Predicting protein dynamics
Methods that sample the conformational space according to first principles are
much slower and thus limited in use. Probably the best known of these methods
are the Molecular Dynamics (MD) (McCammon et al., 1977). In MD simulations,
18
1.2 Protein structure prediction and design
biophysical forces are explicitly described all-atom – e.g. with CHARMM (Brooks
et al., 1983) or AMBER (Cornell et al., 1995) –, as opposed to the statistical-based
force fields descriptions often used in approximate or homology-based methods
(Klepeis et al., 2009). The advantage of MD simulations over other structure
prediction methods is that they not only capture the ground state of the folded
protein, but they also hint at the dynamics of the protein and its folding process,
often important for understanding the function of the protein. Recently, David
Shaw and co-workers modelled the folding and unfolding of the small WW domain
and the BPTI protein, the latter in millisecond scale (Shaw et al., 2010). Important
biological conformational changes, such as folding, often taken place on a time
scale between 10 µs and 1 ms, and this achievement - made possible through the
use of a customized supercomputer - improved the sampling time of such methods
a 100-fold.
1.2.4 Computational protein design
Computational protein design deals with finding a compatible sequence for a given
protein fold, and as such, is often termed the ‘inverse folding problem’. It shares
many of the same challenges posed by protein structure prediction and both
require an understanding of the often complicated relationship between sequence
and structure (Mandell & Kortemme, 2009b).
Traditionally, changing the properties of a protein is accomplished using ‘ratio-
nal design’, in which humans tinker and tweak with proteins, or directed evolution,
an experimental technique that harnesses natural selection at the molecular level
to customize proteins to meet certain specifics (Romero & Arnold, 2009). In the
last twenty years however, computer algorithms have entered this field to produce
in-silico models of optimized proteins that are then subjected to experimental
analysis (van der Sloot et al., 2009; Lutz, 2010). Protein design has an enormous
number of applications both in academia and industry: protein design techniques
for example have been used to increase the thermostability of an enzyme whilst re-
taining enzymatic activity (Korkegian et al., 2005); the affinity and specificity of the
family of leucine zipper transcription factors was altered (Grigoryan et al., 2009);
pathways in cellular systems have been artificially rewired, following a synthetic
19
1. INTRODUCTION
biology approach, making use of the modular architecture of signaling pathways
in eukaryotes (Pryciak, 2009).
1.2.5 Protein docking
Protein docking deals with finding the structure of the interaction rather than the
structure of the individual proteins, that is, predicting the quaternary structure
(Figure 1.3). Since proteins exercise their functions through the way they interact
with other proteins, an atomistic understanding is often required to infer functional
relationships between proteins, decipher signaling pathways and so on. Usually,
docking of two unbound proteins proceeds in two phases: first, a putative binding
mode is detected using geometric complementarity or Fast Fourier Transforma-
tions (e.g. using PatchDock (Schneidman-Duhovny et al., 2005) or ZDock (Chen
et al., 2003)). Second, a fine-grain refinement protocol fits both structures, some-
times allowing for slight conformational flexibility in the backbone and sidechains
of the proteins (e.g. Haddock (Dominguez et al., 2003) or RosettaDock (Gray
et al., 2003)). Similar to CASP, the recurring competition CAPRI (‘Critical Assess-
ment of PRedicted Interactions‘) measures the progress in the field using a blind
competition (Janin et al., 2003). While CAPRI largely seems to be an academic
exercise at this point and limited by the need of start structures, an increase in
prediction accuracy of the methods can be observed. Most of the improvements
are now towards introducing backbone flexibility in the search process (Lensink
et al., 2007).
1.3 Computational design of peptide ligands
Efficient design of high affinity peptide ligands via rational methods has been
a major obstacle to the development of peptides for therapeutics. However,
structural insights into the architecture of protein-peptide interfaces have recently
culminated in a number of computational approaches for the rational design of
peptides targeting proteins (Vanhee et al., 2009; London et al., 2010). These
methods provide a valuable alternative to experimental high-resolution structures
of target protein-peptide complexes, bringing closer the dream of in silico designed
20
1.3 Computational design of peptide ligands
peptides for therapeutic applications. Here we provide an extensive review of these
methods (Figure 1.7).
1.3.1 A better understanding of protein-peptide interactions
With the increase of high-resolution structures of protein-peptide complexes in
the Protein Data Bank (PDB, http://www.pdb.org) (Berman et al., 2000), and in
complementary databases such as the database of three-dimensional interact-
ing domains (3did, http://3did.irbbarcelona.org) (Stein et al., 2010) and
the non-redundant database of protein-peptide complexes (PepX, http://pepx.
switchlab.org) (Vanhee et al., 2010), large-scale structural studies have at-
tempted to describe the key properties of peptide binding (Vanhee et al., 2009;
London et al., 2010) (Chapter 5). For example, we have identified 505 unique
structural peptide-mediated interactions from a set of 1431 high-resolution struc-
tures, with a high over-representation of well-studied peptide interactions, such
as MHC-peptide complexes, thrombin-bound peptides, or peptides bound to the
α-ligand binding domain of the estrogen receptor (Vanhee et al. (2009), Chapter
4). In a set of 103 peptide complexes, it has been noted that many interfaces ex-
hibit tighter packing and more main chain hydrogen bonds than normally found in
protein-protein interfaces (London et al., 2010). This difference is logical: peptides
in isolation cannot be too hydrophobic for they would aggregate. Therefore part
of the binding energy to compensate the loss of entropy upon binding has to be
derived from main-chain/main-chain- and main-chain/side-chain hydrogen bonds.
In silico mutagenesis on these interaction interfaces has revealed that peptide
interfaces contain ‘hotspot’ residues, reminiscent of those found in protein-protein
interfaces (Clackson & Wells, 1995). Peptides that are 6-8 residues-long typically
contain two hotspot residues, whereas 3 hotspots are typical for peptides of length
9-11 (London et al., 2010). In general, peptides often exhibit an elongated struc-
ture upon binding (Stein & Aloy, 2010) and do not appear to induce any large
conformational changes in their binding partners in order to reduce the entropic
cost of complex formation (London et al., 2010). In contrast, many of the peptide
motifs are located in structurally disordered regions of proteins and only adopt a
stable fold upon binding to their protein partner (‘fold-on-binding’).
21
1. INTRODUCTION
hydrophobic hydrophilichydrogen bonds
β2
A B
C
Figure 1.6: PDZ-peptide interactions and peptide specificity. (A) The PDZ domain
of Erbin (PDB 1N7T) binds peptides in an elongated way, with multiple residues con-
tributing to the interaction. (B) The carboxy-terminal of the PDZ peptide binds tightly
in a hydrophobic pocket of the PDZ domain. (C) Distribution of 74 PDZ domains
in selectivity space, after singular value decomposition of correlated positions in the
peptide. The contributions of different peptide positions allows the PDZ domains to
optimize their specificity while avoiding cross-reactivity, revealing an even distribution
throughout selectivity space (Figure adapted from Stiffler et al. (2007)).
Despite their limited size, peptide interactions can be highly specific. For ex-
ample, many C-terminal peptides exhibit high specificity in vivo for certain specific
22
1.3 Computational design of peptide ligands
PDZ domains, while avoiding cross-reactivity (Stiffler et al., 2007) (Figure 1.6).
Interestingly, peptide specificity across 157 mouse PDZ domains matched with
217 peptide ligands could not be captured in discrete classes but instead showed a
more evenly distribution in selectivity space. Specificity in peptide interactions can
also be introduced by engineering approaches even when not observed in nature
(Reina et al., 2002; Grigoryan et al., 2009). One such example is the basic-region
leucine zipper (bZIP) family of transcription factors which share a high degree of
structural and sequence similarity and binds DNA upon homo- and/or heterodimer-
ization with an identical or related bZIP monomer subunit. By replacing one of the
wild-type monomer subunits with a variant that has the basic-region substituted
by an acidic region, DNA binding is prevented and consequently the activity of the
transcription factor will be inhibited. As these acidic variants inherit the dimeriza-
tion properties from the wild-type, it is difficult to inhibit one specific bZIP family
member due to intrinsic heterodimerization properties. Keating and co-workers
showed recently that it was feasible to design anti-bZIP peptide variants that bind
specifically to only a single member of the human bZIP family using a computa-
tional design approach (Grigoryan et al., 2009). An algorithm was employed that
explicitly considers both target and non-target interactions by selecting sequences
that minimize the loss of affinity for the target while maximizing differences in
affinity between any non-target members. Out of the 20 targeted bZIP fami-
lies, 10 designed peptides bound their representative member of the family with
considerable higher affinity than any other non-target competitors, demonstrating
peptide specificity. This study and related, albeit smaller-scale computational de-
sign studies (Reina et al., 2002; van der Sloot et al., 2006) demonstrate that specific
binding partners can be designed even in situations where there is a high degree
of sequence- and structural similarity between target and non-target molecules.
1.3.2 Peptide design based on sequence motifs
If structural information is present for a drug target – either from the single structure
or from the target in complex with its ligand – this information can be used in the
drug discovery process to speed up lead identification (Murray & Blundell, 2010).
Unfortunately, structural information is available for only an estimated 50% of all
23
1. INTRODUCTION
A | Structure-free peptide design
Phage-display
library screening
Quantitave
peptide assays
Sequence
motif scanning
B | Structure-based peptide design
Peptides derived from
protein complex structures
De novo peptide design with
structural scaffolds
experimental
experimental/computational
experimental/computational
computational
computational
Optimizing binding affinity
Peptide docking &
de novo design
1 2
1
1 2
3
2
1 2 3 4 5 6 7 8
selection
sequencing
direct
read-out
...SEQUENCESEQUENCE...
consensus motif
1 2 3 4 5 6 7 8
1 2 3 4 5
1 2 3 4 5 6 7 81 2 3 4 5 6 7 8
...
...
...
...
...
...
...
...
1 2 3 4 5 6 7 8
A
C
D
E
F
G
H
I
...
...
best binding motifPWM heatmap
aminoacids
position-specific mutagenesis
Figure 1.7: Example workflows for peptide design. (A) Structure-free peptide design
and (B) structure-based peptide design.
drug targets (Tanrikulu & Schneider, 2008), with a significant underrepresentation
of targets of high therapeutic importance such as membrane proteins (Baker,
24
1.3 Computational design of peptide ligands
2010). As a result, many research groups use existing information compiled in
databases of protein-peptide interactions to derive sequence-binding motifs that
could be used to design peptides.
The most obvious cases are the well studied SH2, SH3, PDZ and WW domains
where using simple sequence based rules one can design peptide templates (i.e.
for SH3 domains the well known PxxP motif, with x any amino acid, or for PDZ
class I the T/S-x-I/V/L-COOH), even though the discrete classification in motifs
has been disputed (Stiffler et al., 2007). These templates can be randomized at
the non-key positions and using different screening methods like yeast two-hybrid
(Y2H) or phage display, specific peptides can be found (Tonikian et al., 2008;
Giordano et al., 2010).
For cases in which enough information on peptide binding is available, other
more sophisticated approaches can be used. For example, an artificial neural
network, capable of learning to recognize non-linearity in complex datasets, was
trained on 650 peptides derived from T-cell epitopes and known to bind the Major
Histocompatibility Complex (MHC) class II molecule (Honeyman et al., 1998).
The neural network then was used to speed up epitope screening by reducing the
experimental T-cell assay from 68 to 22 peptides, with only a potential loss of 5
out of 17 epitopes. In more recent work, prediction was combined with genetic
algorithms, hidden markov models or other motif discovery algorithms (Lin et al.,
2008). Predicting from sequence alone is often difficult because of permissive
binding modes (MHC Class II for example accommodates from 9 to 18 residues,
although longer peptides have been observed too), multiple binding cores and
insufficient high-quality binding data, all leading to noisy and often inaccurate pre-
dictions. Adding structural information to the prediction process – approximately
169 X-ray structures of MHC in complex with an antigenic peptide are available in
PepX (Vanhee et al. (2010) and Chapter 4) – could increase prediction accuracies,
yet structure-based methods are still too slow for genome-wide screening (Lin
et al., 2008).
While these motif-scanning methodologies can lead to novel peptide discover-
ies, it is unclear whether they could lead to more generalist approaches when little
information is known about the target protein.
25
1. INTRODUCTION
1.3.3 Protein complexes as a source of active peptides
Peptide fragments derived from the crystallographic interface of a protein-protein
interaction are the major sources for rational drug design (Watt, 2006). In 2003,
the anti-HIV peptide enfuvirtide (Fuzeon ®) was the first peptide (36 a.a.) derived
from an extracellular protein interface to receive FDA approval (Table 1.1) and
presented a landmark in the field of peptide therapeutics (Naider & Anglister,
2009). Intracellular targets associated with HIV infection have been targeted with
peptides as well.
Transcription factors (TF), regarded as ‘undruggable’ by classical small molecule
drugs owing to their large protein-protein interfaces (Section 1.1.5), have now
been targeted with peptides too. The original discovery of a 59-mer peptide frag-
ment from the co-activator of the Mastermind-like family (MAML-1) required for
NOTCH signaling marked the start for structure-based inhibitor design (Weng
et al., 2003). The protein-peptide complex bound to DNA has been solved re-
cently by two independent groups, showing that the Mastermind peptide binds
as a twisted helix in the shallow protein-protein groove (Nam et al., 2006) (Figure
1.8A). Using a technique termed ‘peptide stapling’ (Schafmeister et al., 2000), a
16-residue peptide has been designed in which two residues are stapled together
using a hydrocarbon bond; this acts to constrain the helix functionality of the pep-
tide while improving binding affinity. The inhibitory α-helical peptide is able to
penetrate the cell membrane and bind to a shallow groove formed by the intra-
cellular domain of NOTCH and a DNA-bound TF, thereby blocking the interaction
with the co-activator MAML-1, required for recruiting the transcription machinery.
As a consequence, proliferation of T-cell acute lymphoblastic leukemia cells was
stopped.
In an entirely differently class of proteins, stapled α-helical peptides have been
shown to be effective as well, inhibiting members of the anti-apoptotic BCL2-family
(Stewart et al., 2010). These anti-apoptotic proteins contain a hydrophobic groove
that engages the death-promoting BH3-helix. Molecular mimicry of that helix with
a stapled peptide led to selective inhibition of the apoptotic protein (Figure 1.8B).
Both examples of successful helical peptide design suggest that nature’s use
peptides in protein-protein interfaces provides exciting opportunities for peptide
26
1.3 Computational design of peptide ligands
hydrocarbon staple
hydrophobic
positive charge
negative charge
hydrophilic
A B
Figure 1.8: Stapled helical peptides as potent therapeutic peptides. (A) Design of
MAML-1 derived peptides by taking different portions of the MAML-1 helix and turning
them into peptides (sliding window: orange, red, pink, orange, indicate different
peptides used for stabilization, PDB 2F8X). The 16 a.a. sequence of MAML-1 targeting
ICN1 and CSL is shown in red and was used to design the stapled peptide. Figure
adapted from (Moellering et al., 2009). (B) Crystal structure of the stapled helix
MCL-1 complex (PDB 3MK8). The stapled helix engages in binding in the canonical
binding groove. Hydrophobic interactions at the binding interface are reinforced by a
complementary polar interaction network. The side chains of hydrophobic (yellow),
positively charged (blue), negatively charged (red) and hydrophilic (green) residues
are shown. Figure adapted from (Stewart et al., 2010).
therapeutics. Scanning the entire PDB for interfaces involving helical segments
has revealed many potentially interesting interfaces in which α-helical interactions
play an important role, such as nuclear hormone receptors or other transcription
factor-cofactor interfaces (Jochim & Arora, 2009). The acquirement of Aileron’s
peptide stapling technology by Roche in August 2010 only confirms the potential
of these stabilized α-helix peptides as a new class of powerful peptide therapeutics
(Sheridan, 2010).
Yet, so far successful peptide designs seem to be largely limited to the α-
helical scaffold. One reason for this may be the large entropy cost associated
with structuring a peptide upon binding, which is easier to achieve using α-helical
peptides. For example, a leucine-zipper scaffold can be used to fix the helix bundle
(Grigoryan et al., 2009) or chemical stapling of side chain interactions to fix a single
helix scaffold (Schafmeister et al., 2000). For hairpin structures, cyclization has
also been employed (Craik et al., 2007). Yet another way to extend the structural
27
1. INTRODUCTION
stability of peptides is to incorporate them in a highly stable mini-protein, such as
knottins or other scaffolds (Gebauer & Skerra, 2009).
In conclusion, protein-complex derived peptides in combination with scaffold
designs are currently the most successful ways for therapeutic peptide design.
1.3.4 Protein docking and fragment based docking as tools for
peptide design
A generalist approach for peptide design uses structures or homology models of the
target in combination with docking algorithms to construct peptides along a chosen
path on the target surface. Several tools can be used to structurally detect putative
binding sites, for example using geometric amino acid-dependent preferences
derived from a set of structural binding modes (Petsalaki et al., 2009). Autodock
– a popular small-molecule docking algorithm – was used in combination with a
genetic algorithm to design tetrapeptides against a selected hydrophobic region
of α-synuclein, a protein associated with aggregation diseases (Abe et al., 2007).
Upon experimental validation, several binding peptides having µM dissociation
constants were identified that could be used as leads for further screening. Another
approach uses a Gaussian Network Model to identify the binding site and Autodock
is used to dock a series of dipeptides in a pairwise fashion on the grid along a flexible
binding path, finally resulting in an optimal peptide sequence for a given surface
(Unal et al., 2010).
While these methods work well for peptides comparable in size to small
molecules (typically no longer than 3-4 a.a.), the design of longer peptides still
represents major combinatorial problems.
1.3.5 Peptide design using protein-peptide complexes
Often structural information on the protein-peptide binding interface can be used
to the advantage of modeling the protein-peptide interaction. Most approaches
can be divided in three main scenarios:
1. Use a structure with a peptide ligand as template and model by homology
a domain-related sequence and then mutate in silico with a protein design
28
1.3 Computational design of peptide ligands
algorithm the amino acid side chains of the peptide in order to change
specificity while keeping peptide backbone coordinates fixed (Reina et al.,
2002; van der Sloot et al., 2006).
2. Use a structure with a peptide ligand to model by homology a domain-related
sequence while allowing peptide backbone flexibility. The crudest approach
uses different domain-peptide complexes of the same family to generate
different ligand backbone structures that can then be superimposed on the
target structure (Fernandez-Ballester et al., 2009). This was probed recently
for the PDZ domain, for which peptide specificity was computationally re-
designed using all available structures from the PDZ domain and compared
with large-scale phage display experiments (Smith & Kortemme, 2010). An-
other approach introduces backbone flexibility in the peptide starting from
a series of perturbed X-ray protein-peptide complexes (Raveh et al., 2010).
This protocol was validated on a set of 89 peptide complexes and it produced
models that showed sub-angstrom deviation from the native structure.
3. Use a structure while only knowing the approximate binding site of the
peptide ligand, for example based on evidence from related domains. The
PepSpec algorithm does not rely on a structural model of the peptide (King &
Bradley, 2010). Instead, it only needs a single anchor residue positioned in
the binding pocket and introduces implicit backbone movements in the re-
ceptor through ensemble modeling. Evaluation was carried out on a series of
peptide-binding domain families, such as PDZ, SH2 and SH3. In the absence
of an experimentally obtained structural model of the domain and relying on
a model based on a homologous domain, the algorithm captured some of
the peptide specificities that were matched with experimental phage display
libraries. However, large simulation times that scale unfavourable with pep-
tide length were reported (∼100-300 hours per peptide). In this field, we
achieved some progress too: peptides were designed for the PDZ domain,
the α-ligand binding domain and the SH2 domain within sub angstrom accu-
racy, using structural data from BriX interaction patterns in combination with
the FoldX force field for sidechain placement and energy evaluation (Figure
1.10 and Chapter 6).
29
1. INTRODUCTION
To summarize, fixed backbone peptide design (scenario 1) can be successfully
used in situations when a high degree of sequence and structural similarity exists –
or can be assumed – between template complex-structure and the target complex
structure while minimizing computational cost. When changes in backbone con-
formation are expected to play a greater role (e.g. in cases of decreasing sequence
and structural similarity or when insertions/deletions relative to the template struc-
ture have to be modeled) one of the approaches mentioned under scenario 2 can
be employed. When the exact binding site of the peptide is not known one of the
approaches mentioned under 3 has to be employed. For now, the computational
cost of these methods would limit the use to selected (design) case studies rather
than proteome-wide screening.
1.3.6 Remedying the lack of structural information
A recently reported method addresses the lack of structural information on mem-
brane proteins by employing a database of helix-helix interaction scaffolds to initiate
de novo peptide design (Yin et al., 2007) targeting integrins. Integrins are important
receptor proteins in mammalian cells, with a flexible domain of transmembrane
(TM) helices in the phospholipid bilayer. Integrins process extracellular signals,
transmitting them to the interior of the cell, thus making them attractive targets for
tumor therapy (Desgrosellier & Cheresh, 2010).
Peptides selectively targeting integrins αIIbβ3 and αVβ3 have been computa-
tionally designed with a new approach for rational peptide design (Yin et al., 2007).
While typical peptide designs are derived from the crystal structure of the target
protein-protein complex (Section 1.3.3), the design task mainly consists in stabiliz-
ing the hot-spot interactions with a peptide. However, because in most cases no
crystal structure of the interface is available, in this study the authors relied on a
repertoire of over 400 naturally occurring TM-helix interactions with recognizable
sequence signatures (Walters & DeGrado, 2006).
The computational design was divided in two steps: (1) the helix-helix inter-
action motifs served as realistic backbone templates (Figure 1.9A/B) – as opposed
to idealized helix pairs often used in protein design – and were selected based
on sequence compatibility with the target TMs; (2) the authors threaded the se-
30
1.3 Computational design of peptide ligands
anti-αIIb scaffold (PDB 1JB0)TM helix-pairs cluster
αIIb peptide threaded
on scaffold and repacked
1 2
A B C
Figure 1.9: Design of helices targeting Trans-Membrane (TM) proteins. Using (A)
a library of trans-membrane helix pairs from (B) unrelated structures (e.g. PDB 1JB0)
to (C) design novel peptide ligands. Figure adapted from (Yin et al., 2007).
quence of the target TM helix on one helix of the helix pair; they then selected a
compatible side chain for the peptide, using a side-chain-repacking algorithm for
the other helix (Figure 1.9C). The computationally designed peptides were subse-
quently validated in micelles, bacterial membranes and finally mammalian cells,
where they inhibited the binding between the TMs of the α- and β-subunits, thus
activating the integrin.
Multiple advances reported in this study are noteworthy. First, the authors
showed that peptides possess the capacity to integrate within the lipid bilayer and
selectively interact with and activate α-β-integrins in mammalian cells; this had
previously been difficult to accomplish owing to the lack of a solvent-exposed
binding site. Second, by using a library of naturally constrained helix-helix inter-
action motifs, they circumvented the need to model computationally expensive
inter-helical hydrogen bonding patterns and deviations from idealized helical ge-
ometry. Finally, this study provides exciting opportunities for designing peptide
inhibitors, getting around the need for high-resolution structures of the interface.
We have taken a radically different approach toward remedying the lack of
structural data on protein-peptide interactions (Chapter 5). Peptide binding motifs
often resemble intramolecular packing motifs, suggesting that the wealth of data
31
1. INTRODUCTION
target without ligand monomeric interaction motif designed ligand
helix-helix motif helix-loop motif cation-PI motif
binding
site
WT design
A
B
1
1
2
2
3
3
Figure 1.10: Innovative structural approaches in peptide design using BriX and
InteraX. (A) Examples of monomeric interaction motif mining in structures (red): (1)
a helix-helix interaction motif (PDB 153L); (2) a helix-loop interaction motif (PDB
153L); (3) a cation-PI interaction motif (PDB D1GA). See Chapter 5. (B) Sub-angstrom
design of peptide interactions using monomeric structures. (1) Structure of a PDZ
domain (PDB 2I1N) without its ligand and the helix and β2 strand forming the interface
(blue). (2) Identification of an intra-molecular helix-strand-strand motif (red) from an
unrelated structure (PDB 1GSA). (3) Comparison between the structures of the wild-
type sequence (EETSV) designed on the intra-molecular scaffold (red) and the original
ligand (gold). See Chapter 6.
on single-chain proteins could be used to model peptide interactions (Figure 1.10
and Chapter 5). Through analysis of a representable set of 301 protein-peptide
binding interfaces, we showed that more than half of all peptide interaction motifs
could be reliably modeled from sets of interacting fragments from the BriX database
of protein fragments (http://brix.crg.es and Chapter 2) (Baeten et al., 2008;
Vanhee et al., 2011). As a result, the amount of structural peptide interaction
motifs increased from a couple of 100 to over 100.000 fragment interactions. The
use of these intramolecular ‘fragment interaction motifs’ that have pre-optimized
32
1.3 Computational design of peptide ligands
packing represents an important conceptual breakthrough because it transforms
the whole database of protein structures into learning data for computer algorithms
that design peptide substrates de novo, as we describe in Chapter 6. In the near
future, we expect that such algorithms will start to appear so that large-scale virtual
peptide screening will become a valid opportunity.
33
REFERENCES
References
Abe, K., Kobayashi, N., Sode, K. & Ikebukuro, K.
(2007). Peptide ligand screening of alpha-
synuclein aggregation modulators by in sil-
ico panning. BMC Bioinformatics, 8, 451. 28
Alexander, P.A., He, Y., Chen, Y., Orban,
J. & Bryan, P.N. (2009). A minimal se-
quence code for switching protein structure
and function. Proceedings of the National
Academy of Sciences, 106, 21149–54. 16
Altschul, S.F., Madden, T.L., Sch¨affer, A.A.,
Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J.
(1997). Gapped blast and psi-blast: a new
generation of protein database search pro-
grams. Nucleic Acids Research, 25, 3389–
402. 16
Alvarado, D., Klein, D.E. & Lemmon, M.A.
(2010). Structural basis for negative coop-
erativity in growth factor binding to an egf
receptor. Cell, 142, 568–79. 11
Anfinsen, C. (1973). Principles that govern the
folding of protein chains. Science. 5, 13
Antosova, Z., Mackova, M., Kral, V. & Macek, T.
(2009). Therapeutic application of peptides
and proteins: parenteral forever? Trends in
biotechnology, 27, 628–35. 8
Arkin, M.R. & Wells, J.A. (2004). Small-
molecule inhibitors of protein-protein inter-
actions: progressing towards the dream. Nat
Rev Drug Discov, 3, 301–17. 8
Audie, J. & Boyd, C. (2010). The synergistic
use of computation, chemistry and biology
to discover novel peptide-based drugs: the
time is right. Curr Pharm Des, 16, 567–82. 8
Baeten, L., Reumers, J., Tur, V., Stricher, F.,
Lenaerts, T., Serrano, L., Rousseau, F. &
Schymkowitz, J. (2008). Reconstruction of
protein backbones from the brix collection
of canonical protein fragments. PLoS Com-
put Biol, 4, e1000083. 17, 32
Baker, D. & Sali, A. (2001). Protein structure
prediction and structural genomics. Science,
294, 93–6. 13, 15
Baker, M. (2010). Making membrane proteins
for structures: a trillion tiny tweaks. Nature
Publishing Group, 7, 429–434. 24
Berman, H.M., Westbrook, J., Feng, Z., Gilliland,
G., Bhat, T.N., Weissig, H., Shindyalov, I.N. &
Bourne, P.E. (2000). The protein data bank.
Nucleic Acids Research, 28, 235–42. 12, 21
Bowie, J.U., L¨uthy, R. & Eisenberg, D. (1991). A
method to identify protein sequences that
fold into a known three-dimensional struc-
ture. Science, 253, 164–70. 16
Brooks, B., Bruccoleri, R., Olafson, B.D., States,
D.J., Swaminathan, S. & Karplus, M. (1983).
Charmm: A program for macromolecular
energy, minimization, and dynamics calcu-
lations. J Comput Chem. 19
Brunton, L., Lazo, J. & Parker. . . , K. (2006).
Goodman & gilman’s the pharmacological
basis of therapeutics. mcgraw-hill.co.uk. 7,
10
Chandler, D. (2005). Interfaces and the driving
force of hydrophobic assembly. Nature, 437,
640–7. 6
Chandonia, J.M. & Brenner, S.E. (2006). The im-
pact of structural genomics: expectations
and outcomes. Science, 311, 347–51. 13
Chen, R., Li, L. & Weng, Z. (2003). Zdock: an
initial-stage protein-docking algorithm. Pro-
teins, 52, 80–7. 20
Chruszcz, M., Domagalski, M., Osinski, T., Wlo-
dawer, A. & Minor, W. (2010). Unmet chal-
lenges of structural genomics. Curr Opin
Struct Biol. 13
Citri, A. & Yarden, Y. (2006). Egf-erbb signalling:
towards the systems level. Nat Rev Mol Cell
Biol, 7, 505–16. 2
Clackson, T. & Wells, J.A. (1995). A hot spot of
binding energy in a hormone-receptor inter-
face. Science, 267, 383–6. 11, 21
34
REFERENCES
Conn, P.J., Christopoulos, A. & Lindsley, C.W.
(2009). Allosteric modulators of gpcrs: a
novel approach for the treatment of cns dis-
orders. Nat Rev Drug Discov, 8, 41–54. 11
Conte, L.L., Chothia, C. & Janin, J. (1999). The
atomic structure of protein-protein recogni-
tion sites. Journal of Molecular Biology, 285,
2177–98. 10
Cooper, S., Khatib, F., Treuille, A., Barbero, J.,
Lee, J., Beenen, M., Leaver-Fay, A., Baker, D.,
Popovi´c, Z. & Players, F. (2010). Predicting
protein structures with a multiplayer online
game. Nature, 466, 756–60. 18
Cornell, W.D., Cieplak, P., Bayly, C.I., Gould,
I.R., Merz, K.M., Ferguson, D.M., Spellmeyer,
D.C., Fox, T., Caldwell, J.W. & Kollman, P.A.
(1995). A second generation force field for
the simulation of proteins, nucleic acids, and
organic molecules. J. Am. Chem. Soc.. 19
Craik, D.J., Clark, R.J. & Daly, N.L. (2007).
Potential therapeutic applications of the
cyclotides and related cystine knot mini-
proteins. Expert opinion on investigational
drugs, 16, 595–604. 27
Creighton, P. (1984). Structures and molecular
principles. Proteins. 13
Desgrosellier, J.S. & Cheresh, D.A. (2010). Inte-
grins in cancer: biological implications and
therapeutic opportunities. Nat Rev Cancer,
10, 9–22. 30
Desmet, J., Maeyer, M., Hazes, B. & Laster, I.
(1992). The dead-end elimination theorem
and its use in protein side-chain positioning.
Nature. 14
Dill, K.A. (1990). Dominant forces in protein
folding. Biochemistry, 29, 7133–55. 6
Dominguez, C., Boelens, R. & Bonvin, A.M.J.J.
(2003). Haddock: a protein-protein dock-
ing approach based on biochemical or bio-
physical information. Journal of the Ameri-
can Chemical Society, 125, 1731–7. 20
Drews, J. (2000). Drug discovery: a historical
perspective. Science, 287, 1960–4. 8
Duan, Y. & Kollman, P.A. (1998). Pathways to
a protein folding intermediate observed in a
1-microsecond simulation in aqueous solu-
tion. Science, 282, 740–4. 12
Easton, D.M., Nijnik, A., Mayer, M.L. & Hancock,
R.E.W. (2009). Potential of immunomodu-
latory host defense peptides as novel anti-
infectives. Trends in biotechnology, 27, 582–
90. 7
Eglen, R. & Reisine, T. (2010). Human kinome
drug discovery and the emerging importance
of atypical allosteric inhibitors. Expert Opin-
ion on Drug Discovery, 5, 277–290. 11
Eisenmesser, E.Z., Bosco, D.A., Akke, M. & Kern,
D. (2002). Enzyme dynamics during cataly-
sis. Science, 295, 1520–3. 12
Fernandez-Ballester, G., Beltrao, P., Gonzalez,
J.M., Song, Y.H., Wilmanns, M., Valencia, A.
& Serrano, L. (2009). Structure-based pre-
diction of the saccharomyces cerevisiae sh3-
ligand interactions. Journal of Molecular Bi-
ology, 388, 902–16. 29
Fersht, A.R. (2008). From the first protein struc-
tures to our current knowledge of protein
folding: delights and scepticisms. Nat Rev
Mol Cell Biol, 9, 650–654. 12
Gebauer, M. & Skerra, A. (2009). Engineered pro-
tein scaffolds as next-generation antibody
therapeutics. Curr Opin Chem Biol, 13, 245–
55. 28
Gianni, S., Guydosh, N.R., Khan, F., Caldas, T.D.,
Mayor, U., White, G.W.N., DeMarco, M.L.,
Daggett, V. & Fersht, A.R. (2003). Unify-
ing features in protein-folding mechanisms.
Proceedings of the National Academy of Sci-
ences of the United States of America, 100,
13286–91. 7
35
REFERENCES
Giordano, R.J., Card´o-Vila, M., Salameh, A.,
Anobom, C.D., Zeitlin, B.D., Hawke, D.H., Va-
lente, A.P., Almeida, F.C.L., N¨or, J.E., Sidman,
R.L., Pasqualini, R. & Arap, W. (2010). From
combinatorial peptide selection to drug pro-
totype (i): targeting the vascular endothelial
growth factor receptor pathway. Proceedings
of the National Academy of Sciences of the
United States of America, 107, 5112–7. 25
Grant, A., Lee, D. & Orengo, C. (2004). Progress
towards mapping the universe of protein
folds. Genome Biol, 5, 107. 17
Gray, J.J., Moughon, S., Wang, C., Schueler-
Furman, O., Kuhlman, B., Rohl, C.A. & Baker,
D. (2003). Protein-protein docking with si-
multaneous optimization of rigid-body dis-
placement and side-chain conformations.
Journal of Molecular Biology, 331, 281–99.
20
Grigoryan, G., Reinke, A.W. & Keating, A.E.
(2009). Design of protein-interaction speci-
ficity gives selective bzip-binding peptides.
Nature, 458, 859–64. 19, 23, 27
Haliloglu, T. & Erman, B. (2009). Analysis of cor-
relations between energy and residue fluctu-
ations in native proteins and determination
of specific sites for binding. Phys. Rev. Lett.,
102, 088103. 11
Haliloglu, T., Gul, A. & Erman, B. (2010). Pre-
dicting important residues and interaction
pathways in proteins using gaussian network
model: binding and stability of hla proteins.
PLoS Comput Biol, 6, e1000845. 11
Han, J.D.J., Bertin, N., Hao, T., Goldberg, D.S.,
Berriz, G.F., Zhang, L.V., Dupuy, D., Walhout,
A.J.M., Cusick, M.E., Roth, F.P. & Vidal, M.
(2004). Evidence for dynamically organized
modularity in the yeast protein-protein in-
teraction network. Nature, 430, 88–93. 4
Hancock, R.E.W. & Sahl, H.G. (2006). An-
timicrobial and host-defense peptides as
new anti-infective therapeutic strategies. Nat
Biotechnol, 24, 1551–7. 7
He, Y., Chen, Y., Alexander, P., Bryan, P.N. &
Orban, J. (2008). Nmr structures of two de-
signed proteins with high sequence identity
but different fold and function. Proceedings
of the National Academy of Sciences of the
United States of America, 105, 14412–7. 16
Honeyman, M.C., Brusic, V., Stone, N.L. & Har-
rison, L.C. (1998). Neural network-based
prediction of candidate t-cell epitopes. Nat
Biotechnol, 16, 966–9. 25
Hopkins, A.L. & Groom, C.R. (2002). The drug-
gable genome. Nat Rev Drug Discov, 1, 727–
30. 9
Huang, H., Li, L., Wu, C., Schibli, D., Colwill, K.,
Ma, S., Li, C., Roy, P., Ho, K., Songyang, Z.,
Pawson, T., Gao, Y. & Li, S.S.C. (2008). Defin-
ing the specificity space of the human src
homology 2 domain. Mol Cell Proteomics,
7, 768–84. 2
Janin, J., Henrick, K., Moult, J., Eyck, L.T., Stern-
berg, M.J.E., Vajda, S., Vakser, I., Wodak, S.J. &
of PRedicted Interactions, C.A. (2003). Capri:
a critical assessment of predicted interac-
tions. Proteins, 52, 2–9. 20
Jiang, L., Althoff, E.A., Clemente, F.R., Doyle,
L., Rothlisberger, D., Zanghellini, A., Gallaher,
J.L., Betker, J.L., Tanaka, F., Barbas, C.F., Hil-
vert, D., Houk, K.N., Stoddard, B.L. & Baker,
D. (2008). De novo computational design
of retro-aldol enzymes. Science, 319, 1387–
1391. 18
Jochim, A.L. & Arora, P.S. (2009). Assessment
of helical interfaces in protein-protein inter-
actions. Mol Biosyst, 5, 924–6. 27
Jones, S. & Thornton, J.M. (1996). Principles
of protein-protein interactions. Proceedings
of the National Academy of Sciences of the
United States of America, 93, 13–20. 10
Kendrew, J., BODO, G., Dintzis, H., Parrish, R. &
WYCKOFF, H. (1958). A three-dimensional
model of the myoglobin molecule obtained
by x-ray analysis. Nature. 12
36
REFERENCES
King, C.A. & Bradley, P. (2010). Structure-based
prediction of protein-peptide specificity in
rosetta. Proteins, 78, 3437–49. 29
Klepeis, J.L., Lindorff-Larsen, K., Dror, R.O. &
Shaw, D.E. (2009). Long-timescale molecu-
lar dynamics simulations of protein structure
and function. Curr Opin Struct Biol, 19, 120–
7. 19
Korkegian, A., Black, M.E., Baker, D. & Stoddard,
B.L. (2005). Computational thermostabiliza-
tion of an enzyme. Science, 308, 857–60.
19
Kuhlman, B. & Baker, D. (2000). Native pro-
tein sequences are close to optimal for
their structures. Proceedings of the National
Academy of Sciences of the United States of
America, 97, 10383–8. 14
K¨uhner, S., van Noort, V., Betts, M.J., Leo-
Macias, A., Batisse, C., Rode, M., Yamada,
T., Maier, T., Bader, S., Beltran-Alvarez, P.,
Casta˜no-Diez, D., Chen, W.H., Devos, D.,
G¨uell, M., Norambuena, T., Racke, I., Rybin,
V., Schmidt, A., Yus, E., Aebersold, R., Her-
rmann, R., B¨ottcher, B., Frangakis, A.S., Rus-
sell, R.B., Serrano, L., Bork, P. & Gavin, A.C.
(2009). Proteome organization in a genome-
reduced bacterium. Science, 326, 1235–40.
4
Lange, O.F., Lakomek, N.A., Far`es, C., Schr¨oder,
G.F., Walter, K.F.A., Becker, S., Meiler, J.,
Grubm¨uller, H., Griesinger, C. & de Groot,
B.L. (2008). Recognition dynamics up to mi-
croseconds revealed from an rdc-derived
ubiquitin ensemble in solution. Science, 320,
1471–5. 12
Lee, J., Natarajan, M., Nashine, V.C., Socolich,
M., Vo, T., Russ, W.P., Benkovic, S.J. & Ran-
ganathan, R. (2008). Surface sites for engi-
neering allosteric control in proteins. Sci-
ence, 322, 438–42. 11
Lenaerts, T., Ferkinghoff-Borg, J., Stricher, F.,
Serrano, L., Schymkowitz, J.W.H. & Rousseau,
F. (2008). Quantifying information transfer
by protein domains: analysis of the fyn sh2
domain structure. BMC Struct Biol, 8, 43. 11
Lenaerts, T., Schymkowitz, J. & Rousseau, F.
(2009). Protein domains as information pro-
cessing units. Curr Protein Pept Sci, 10, 133–
45. 11
Lensink, M.F., M´endez, R. & Wodak, S.J. (2007).
Docking and scoring protein complexes:
Capri 3rd edition. Proteins, 69, 704–18. 20
Levinthal, C. (1969). How to fold graciously.
Mossbauer spectroscopy in biological sys-
tems. 6
Lin, H.H., Zhang, G.L., Tongchusak, S., Reinherz,
E.L. & Brusic, V. (2008). Evaluation of mhc-ii
peptide binding prediction servers: applica-
tions for vaccine research. BMC Bioinformat-
ics, 9 Suppl 12, S22. 25
Lipinski, C.A., Lombardo, F., Dominy, B.W. &
Feeney, P.J. (2001). Experimental and com-
putational approaches to estimate solubility
and permeability in drug discovery and de-
velopment settings. Adv Drug Deliv Rev, 46,
3–26. 9
London, N., Movshovitz-Attias, D. & Schueler-
Furman, O. (2010). The structural basis
of peptide-protein binding strategies. Struc-
ture, 18, 188–199. 11, 20, 21
Lutz, S. (2010). Beyond directed evolution-
semi-rational protein engineering and de-
sign. Curr Opin Biotechnol. 19
Mandell, D.J. & Kortemme, T. (2009a). Backbone
flexibility in computational protein design.
Curr Opin Biotechnol, 20, 420–8. 13
Mandell, D.J. & Kortemme, T. (2009b).
Computer-aided design of functional protein
interactions. Nat Chem Biol, 5, 797–807. 19
Mandell, D.J., Coutsias, E.A. & Kortemme,
T. (2009). Sub-angstrom accuracy in pro-
tein loop reconstruction by robotics-inspired
conformational sampling. Nat Methods, 6,
551–2. 16
37
REFERENCES
McCammon, J.A., Gelin, B.R. & Karplus, M.
(1977). Dynamics of folded proteins. Na-
ture, 267, 585–90. 18
Moellering, R.E., Cornejo, M., Davis, T.N.,
Bianco, C.D., Aster, J.C., Blacklow, S.C.,
Kung, A.L., Gilliland, D.G., Verdine, G.L. &
Bradner, J.E. (2009). Direct inhibition of the
notch transcription factor complex. Nature,
462, 182–8. 27
Moult, J., Pedersen, J.T., Judson, R. & Fidelis, K.
(1995). A large-scale experiment to assess
protein structure prediction methods. Pro-
teins, 23, ii–v. 14
Murray, C.W. & Blundell, T.L. (2010). Structural
biology in fragment-based drug design. Curr
Opin Struct Biol, 20, 497–507. 23
Naider, F. & Anglister, J. (2009). Peptides in the
treatment of aids. Curr Opin Struct Biol, 19,
473–82. 26
Nam, Y., Sliz, P., Song, L., Aster, J.C. & Blacklow,
S.C. (2006). Structural basis for cooperativity
in recruitment of maml coactivators to notch
transcription complexes. Cell, 124, 973–83.
26
Neduva, V. & Russell, R.B. (2006). Peptides me-
diating interaction networks: new leads at
last. Curr Opin Biotechnol, 17, 465–71. 7, 8
Onuchic, J.N. & Wolynes, P.G. (2004). Theory
of protein folding. Curr Opin Struct Biol, 14,
70–5. 5
Patel, L.N., Zaro, J.L. & Shen, W.C. (2007). Cell
penetrating peptides: intracellular pathways
and pharmaceutical perspectives. Pharm
Res, 24, 1977–92. 8
Pawson, T. & Nash, P. (2003). Assembly of cell
regulatory systems through protein interac-
tion domains. Science, 300, 445–52. 7
Pechon, P., Tartar, A., Dunn, M.K. & Reichert, J.
(2010). Development and trends for peptide
therapeutics. 1–48. 9, 10
Petsalaki, E. & Russell, R.B. (2008). Peptide-
mediated interactions in biological systems:
new discoveries and applications. Curr Opin
Biotechnol, 19, 344–50. 8
Petsalaki, E., Stark, A. & Russell, R.B. (2009).
Accurate prediction of peptide binding sites
on protein surfaces. PLoS Comput Biol, 5,
e1000335. 28
Pryciak, P.M. (2009). Designing new cellular
signaling pathways. Chemistry & Biology, 16,
249–254. 20
Raveh, B., London, N. & Schueler-Furman, O.
(2010). Sub-angstrom modeling of com-
plexes between flexible peptides and glob-
ular proteins. Proteins, 78, 2029–40. 29
Reichmann, D., Rahat, O., Albeck, S., Meged, R.,
Dym, O. & Schreiber, G. (2005). The modular
architecture of protein-protein binding inter-
faces. Proceedings of the National Academy
of Sciences of the United States of America,
102, 57–62. 11
Reina, J., Lacroix, E., Hobson, S.D., Fernandez-
Ballester, G., Rybin, V., Schwab, M.S., Serrano,
L. & Gonzalez, C. (2002). Computer-aided
design of a pdz domain to recognize new
target sequences. Nature Structural Biology,
9, 621–7. 23, 29
Rohl, C.A., Strauss, C.E.M., Misura, K.M.S. &
Baker, D. (2004). Protein structure predic-
tion using rosetta. Meth Enzymol, 383, 66–
93. 17
Romero, P.A. & Arnold, F.H. (2009). Exploring
protein fitness landscapes by directed evo-
lution. Nat Rev Mol Cell Biol, 10, 866–76.
19
Rose, G.D. & Creamer, T.P. (1994). Protein fold-
ing: predicting predicting. Proteins, 19, 1–3.
15
Roy, A., Kucukural, A. & Zhang, Y. (2010). I-
tasser: a unified platform for automated pro-
tein structure and function prediction. Nat
Protoc, 5, 725–38. 16
38
REFERENCES
Russell, R.B. & Gibson, T.J. (2008). A careful dis-
orderliness in the proteome: sites for inter-
action and targets for future therapies. FEBS
Lett, 582, 1271–5. 7
Schafmeister, C.E., Po, J. & Verdine, G.L. (2000).
An all-hydrocarbon cross-linking system for
enhancing the helicity and metabolic sta-
bility of peptides. Journal of the American
Chemical Society, 122, 5891–5892. 26, 27
Schneidman-Duhovny, D., Inbar, Y., Nussinov,
R. & Wolfson, H.J. (2005). Patchdock and
symmdock: servers for rigid and symmetric
docking. Nucleic Acids Research, 33, W363–
7. 20
Schymkowitz, J., Borg, J., Stricher, F., Nys, R.,
Rousseau, F. & Serrano, L. (2005). The foldx
web server: an online force field. Nucleic
Acids Research, 33, W382–8. 14
Shaw, D.E., Maragakis, P., Lindorff-Larsen, K.,
Piana, S., Dror, R.O., Eastwood, M.P., Bank,
J.A., Jumper, J.M., Salmon, J.K., Shan, Y. &
Wriggers, W. (2010). Atomic-level charac-
terization of the structural dynamics of pro-
teins. Science, 330, 341–6. 12, 19
Shen, Y., Lange, O., Delaglio, F., Rossi, P.,
Aramini, J.M., Liu, G., Eletsky, A., Wu, Y.,
Singarapu, K.K., Lemak, A., Ignatchenko, A.,
Arrowsmith, C.H., Szyperski, T., Montelione,
G.T., Baker, D. & Bax, A. (2008). Consis-
tent blind protein structure generation from
nmr chemical shift data. Proceedings of the
National Academy of Sciences of the United
States of America, 105, 4685–90. 18
Sheridan, C. (2010). Roche backs aileron’s sta-
pled peptides. Nat Biotechnol, 28, 992–3.
27
Siegel, J.B., Zanghellini, A., Lovick, H.M., Kiss, G.,
Lambert, A.R., Clair, J.L.S., Gallaher, J.L., Hil-
vert, D., Gelb, M.H., Stoddard, B.L., Houk,
K.N., Michael, F.E. & Baker, D. (2010). Com-
putational design of an enzyme catalyst for
a stereoselective bimolecular diels-alder re-
action. Science, 329, 309–13. 18
Smith, C.A. & Kortemme, T. (2010). Structure-
based prediction of the peptide sequence
space recognized by natural and synthetic
pdz domains. Journal of Molecular Biology,
402, 460–74. 29
Smith, R.D., Hu, L., Falkner, J.A., Benson, M.L.,
Nerothin, J.P. & Carlson, H.A. (2006). Explor-
ing protein-ligand recognition with binding
moad. J Mol Graph Model, 24, 414–25. 10
Stein, A. & Aloy, P. (2010). Novel peptide-
mediated interactions derived from high-
resolution 3-dimensional structures. PLoS
Comput Biol, 6, e1000789. 21
Stein, A., C´eol, A. & Aloy, P. (2010). 3did: iden-
tification and classification of domain-based
interactions of known three-dimensional
structure. Nucleic Acids Research. 21
Stewart, M.L., Fire, E., Keating, A.E. & Walensky,
L.D. (2010). The mcl-1 bh3 helix is an exclu-
sive mcl-1 inhibitor and apoptosis sensitizer.
Nat Chem Biol, 6, 595–601. 26, 27
Stiffler, M.A., Chen, J.R., Grantcharova, V.P.,
Lei, Y., Fuchs, D., Allen, J.E., Zaslavskaia, L.A.
& MacBeath, G. (2007). Pdz domain bind-
ing selectivity is optimized across the mouse
proteome. Science, 317, 364–9. 22, 23, 25
Tan, M.L., Choong, P.F.M. & Dass, C.R. (2010).
Recent developments in liposomes, mi-
croparticles and nanoparticles for protein
and peptide drug delivery. Peptides, 31,
184–93. 8
Tanrikulu, Y. & Schneider, G. (2008). Pseu-
doreceptor models in drug design: bridging
ligand- and receptor-based virtual screening.
Nat Rev Drug Discov, 7, 667–77. 24
Taylor, W.R. (1986). The classification of amino
acid conservation. J Theor Biol, 119, 205–18.
4
Thorsen, T.S., Madsen, K.L., Rebola, N., Rathje,
M., Anggono, V., Bach, A., Moreira, I.S.,
Stuhr-Hansen, N., Dyhring, T., Peters, D.,
Beuming, T., Huganir, R., Weinstein, H., Mulle,
C., Strmgaard, K., Rnn, L.C.B. & Gether, U.
39
REFERENCES
(2010). Identification of a small-molecule in-
hibitor of the pick1 pdz domain that inhibits
hippocampal ltp and ltd. Proceedings of the
National Academy of Sciences of the United
States of America, 107, 413–8. 11
Timmerman, P., Beld, J., Puijk, W.C. & Meloen,
R.H. (2005). Rapid and quantitative cycliza-
tion of multiple peptide loops onto synthetic
scaffolds for structural mimicry of protein
surfaces. Chembiochem, 6, 821–4. 8
Tonikian, R., Zhang, Y., Sazinsky, S.L., Currell, B.,
Yeh, J.H., Reva, B., Held, H.A., Appleton, B.A.,
Evangelista, M., Wu, Y., Xin, X., Chan, A.C.,
Seshagiri, S., Lasky, L.A., Sander, C., Boone,
C., Bader, G.D. & Sidhu, S.S. (2008). A speci-
ficity map for the pdz domain family. PLoS
Biol, 6, e239. 25
Unal, E.B., Gursoy, A. & Erman, B. (2010). Vital:
Viterbi algorithm for de novo peptide design.
PLoS ONE, 5, e10926. 28
van der Sloot, A.M., Tur, V., Szegezdi, E., Mul-
lally, M.M., Cool, R.H., Samali, A., Serrano, L.
& Quax, W.J. (2006). Designed tumor necro-
sis factor-related apoptosis-inducing ligand
variants initiating apoptosis exclusively via
the dr5 receptor. Proceedings of the National
Academy of Sciences of the United States of
America, 103, 8634–9. 23, 29
van der Sloot, A.M., Kiel, C., Serrano, L. &
Stricher, F. (2009). Protein design in biolog-
ical networks: from manipulating the input
to modifying the output. Protein Eng Des Sel,
22, 537–42. 19
Vanhee, P., Stricher, F., Baeten, L., Verschueren,
E., Lenaerts, T., Serrano, L., Rousseau, F. &
Schymkowitz, J. (2009). Protein-peptide in-
teractions adopt the same structural motifs
as monomeric protein folds. Structure, 17,
1128–1136. 20, 21
Vanhee, P., Reumers, J., Stricher, F., Baeten, L.,
Serrano, L., Schymkowitz, J. & Rousseau, F.
(2010). Pepx: a structural database of non-
redundant protein-peptide complexes. Nu-
cleic Acids Research, 38, D545–51. 21, 25
Vanhee, P., Verschueren, E., Baeten, L., Stricher,
F., Serrano, L., Rousseau, F. & Schymkowitz, J.
(2011). Brix: a database of protein building
blocks for structural analysis, modeling and
design. Nucleic Acids Research, 39, D435–
42. 17, 32
Walensky, L.D., Kung, A.L., Escher, I., Malia, T.J.,
Barbuto, S., Wright, R.D., Wagner, G., Ver-
dine, G.L. & Korsmeyer, S.J. (2004). Activa-
tion of apoptosis in vivo by a hydrocarbon-
stapled bh3 helix. Science, 305, 1466–70.
8
Walsh, G. (2010). Biopharmaceutical bench-
marks 2010. Nat Biotechnol, 28, 917–24.
8
Walters, R.F.S. & DeGrado, W.F. (2006). Helix-
packing motifs in membrane proteins. Pro-
ceedings of the National Academy of Sci-
ences of the United States of America, 103,
13658–63. 30
Watt, P.M. (2006). Screening for peptide drugs
from the natural repertoire of biodiverse pro-
tein folds. Nature Biotechnology, 24, 177–
83. 26
Wells, J.A. & McClendon, C.L. (2007). Reach-
ing for high-hanging fruit in drug discovery
at protein-protein interfaces. Nature, 450,
1001–9. 8, 11
Weng, A.P., Nam, Y., Wolfe, M.S., Pear, W.S.,
Griffin, J.D., Blacklow, S.C. & Aster, J.C.
(2003). Growth suppression of pre-t acute
lymphoblastic leukemia cells by inhibition of
notch signaling. Mol Cell Biol, 23, 655–64.
26
Wu, S. & Zhang, Y. (2007). Lomets: a lo-
cal meta-threading-server for protein struc-
ture prediction. Nucleic Acids Research, 35,
3375–82. 16
Yin, H., Slusky, J.S., Berger, B.W., Walters, R.S.,
Vilaire, G., Litvinov, R.I., Lear, J.D., Caputo,
G.A., Bennett, J.S. & Degrado, W.F. (2007).
Computational design of peptides that tar-
get transmembrane helices. Science, 315,
1817–1822. 30, 31
40
REFERENCES
Yun, C.H., Boggon, T.J., Li, Y., Woo, M.S.,
Greulich, H., Meyerson, M. & Eck, M.J.
(2007). Structures of lung cancer-derived
egfr mutants and inhibitor complexes:
mechanism of activation and insights into
differential inhibitor sensitivity. Cancer Cell,
11, 217–27. 2
Zhang, Y. (2009). Protein structure prediction:
when is it useful? Curr Opin Struct Biol, 19,
145–55. 14
41
2Fragmenting protein space
This chapter is based on
BriX: a database of protein building blocks for structural analysis, modeling and design.
Peter Vanhee*, Erik Verschueren*, Lies Baeten, Francois Stricher, Luis Serrano, Frederic Rousseau
and Joost Schymkowitz. 1 Nucleic Acids Research 2, January 2011.
H
igh-resolution structures of proteins remain the most valuable source for un-
derstanding their function in the cell and provide leads for drug design. Since
the availability of sufficient protein structures to tackle complex problems such as
modeling backbone moves or docking remains a problem, alternative approaches
using small, recurrent protein fragments have been proposed. Here we present
two databases that provide a vast resource for implementing such fragment-based
strategies. The BriX database contains fragments from over 7000 non-homologous
proteins from the Astral collection, segmented in lengths from 4 to 14 residues
and clustered according to structural similarity, summing up to a content of 2
million fragments per length. To overcome the lack of loops classified in BriX, we
constructed the Loop BriX database of non-regular structure elements, clustered
according to end-to-end distance between the regular residues flanking the loop.
1
Peter Vanhee and Erik Verschueren are joint first authors.
2
This paper was recently chosen by the Editors of Nucleic Acids Research to appear
on a selected Featured Articles page (http://www.oxfordjournals.org/our_journals/nar/
featured_articles.html), representing the top 5% of NAR papers in terms of originality, signif-
icance and scientific excellence.
43
2. FRAGMENTING PROTEIN SPACE
Both databases are available online (http://brix.crg.es) and can be accessed
through a user-friendly web-interface. For high-throughput queries a web-based
API is provided, as well as full database downloads. In addition, two exciting
applications are provided as online services: (1) user-submitted structures can
be covered on the fly with BriX classes, representing putative structural variation
throughout the protein and (2) gaps or low-confidence regions in these structures
can be bridged with matching fragments.
Both databases provide the source for fragment-based strategies developed in
this thesis, such as the reconstruction of protein loops (Chapter 3) or the modeling
of protein-peptide complexes (Chapters 5 and 6).
2.1 Introduction
Proteins are by far the most versatile and complex molecules in the cell. It is
commonly accepted that protein function directly relates to its three dimensional
(3D) structure. Yet, for just over a quarter of all single-domain protein families
detailed structural information is available (Levitt, 2009), a number that can be
extended through threading and homology modeling (Kopp et al., 2007). Due to
experimental constraints of X-Ray crystallography or NMR, the rate at which new
structures are determined is considerably slower than the amount of new sequence
data that is being determined by next-generation sequencing methods (Baker &
Sali, 2001; Arnold et al., 2009).
In order to understand the structural protein universe, proteins have been
classified on the architecture of the fold and evolutionary relationships in databases
such as SCOP (Murzin et al., 1995) or CATH (Orengo et al., 1997). However,
proteins often perform their functions using just a limited number of residues,
making it worthwhile to find structural similarities at the level of protein fragments.
Seeking for a ‘parts list’ of proteins – with α-helices and β-sheets as prime examples
of common parts – fragment libraries have been constructed based on the similarity
of the polypeptide backbone (Fitzkee et al., 2005; Budowski-Tal et al., 2010).
These protein fragment libraries have been widely used for a range of ap-
plications such as structural comparison of protein folds through a simplified
representation with fragments (Le et al., 2009), homology modeling at the level
44
2.1 Introduction
of fragments (Ananthalakshmi et al., 2005; Berkholz et al., 2010), investigating
sequence-to-structure relationships (Samson & Levitt, 2009), approximating ter-
tiary structure of proteins using fragments (Bystroff & Shao, 2002; Kolodny et al.,
2002; Kolodny & Levitt, 2003; Kifer et al., 2008), loop prediction (Bornot et al.,
2009; Choi & Deane, 2010; Fernandez-Fuentes et al., 2006) or even novel fold
prediction (Simons et al., 1997; Qian et al., 2007).
Unfortunately, many of the available fragment libraries are either limited in
fragment classes or ‘states’ (Budowski-Tal et al., 2010; Pandini et al., 2010) or
not publicly accessible (Bystroff & Shao, 2002). Moreover, existing databases are
often biased towards short stretches of residues, typically 3 to 9 residues long, or
contain an extensive parts list but are not clustered based on backbone similarity,
thereby complicating comparative studies (Fitzkee et al., 2005). Although limited
alphabets have been shown to successfully reconstruct existing proteins to global
fits of 0.5 Å root mean square distance (RMSD) or serve successfully as templates
to efficiently sample the protein space, they are too limited to describe protein
structure at sub-angstrom resolution, especially in the case of loops (Baeten et al.,
2008). To overcome these limitations we have previously constructed BriX, a
database of protein fragments from 4 to 14 residues, hierarchically clustered on
backbone similarities (Baeten et al., 2008).
Here we describe how we updated the BriX database, which previously con-
tained fragments from 1259 structures, to incorporate over 7000 structures from
the ASTRAL40 set (a curated set of proteins with less than 40% sequence homology)
(Chandonia et al., 2004). Furthermore, we enriched the database with all loops
from over 14.000 structures in the ASTRAL95 set (sharing less than 95% sequence
homology) and clustered these loops in their own respect. We also provide a user-
friendly web-interface to explore both BriX and Loop BriX (http://brix.crg.es).
Finally, to illustrate the potential of our database we allow users to upload their own
PDB structure and ‘cover’ parts or ‘bridge’ gaps with BriX or Loop BriX fragments.
The new release of BriX is expected to be helpful to the scientific community by
facilitating the use of fragments in structural biology, protein modeling and design.
45
2. FRAGMENTING PROTEIN SPACE
2.2 Contents of the BriX database
2.2.1 Update of the BriX database
The first version of the BriX database (Baeten et al., 2008) was constructed from
the WHAT IF set of 1259 non-redundant proteins (Vriend, 1990). Using a sliding-
window technique, we segmented all proteins into fragments of 4 to 14 residues
long and clustered them on their backbone similarity with a hierarchical clustering
algorithm. The similarity between two fragments is defined as the average root
mean square distance (RMSD) between the backbone atoms (N, Cα, C, O) of each
corresponding residue.
!"
#"
$!"
$#"
%!"
%#"
&!"
'((")(*+)"*,-./012" '(("3/.)"*,-./012" '(*+)")14"3/.)"
*,-./012"5)637"
'(*+)")14"3/.)"
*,-./012"5)837"
9:(.0;4-<)01"
*,-./012"
9/<3,)1/")14"=/(("
2:,>)=/"*,-./012"
?<)(("*,-./012"
@"*/,"?ABC"=()22"
?.,:=.:,/2"01"'2.,)(D!"
E)<0(0/2"01"?ABC"
Figure 2.1: SCOP representation of ASTRAL40. The distribution of SCOP classes in
ASTRAL40 (the dataset used to construct BriX) is similar to the SCOP distribution.
The updated version of the BriX database is enriched with the much larger
ASTRAL40 set of 7290 proteins sharing less than 40% of sequence homology.
The ASTRAL40 set is a complete representation of the variety present in structural
46
2.2 Contents of the BriX database
databases such as SCOP (Figure 2.1). Similar to the procedure used by Baeten
et al. (2008), we fragmented all proteins and assigned each fragment to the closest
class represented by its centroid. As it turns out, we were able to fit most of the
ASTRAL40 fragments into existing BriX classes, showing the completeness of our
structural alphabet in the updated version BriX, while increasing its content 7-fold
(Figure 2.2).
BriX v.1
1926
2845
2694
3613
3061
3398
3207
2850
2374
2030
1589
4 5 6 7 8 9 10 11 12 13 14
Length
Classes
0.5 0.6 0.7 0.8 0.9 1.0
0
500,000
1,000,000
1,500,000
2,000,000
4 5 6 7 8 9 10 11 12 13 14
Classifiedfragments
Length
BriX v.2
BriX v.1
A B
C
RMSDThresholds(Å)
Figure 2.2: The BriX database. (A) Number of BriX classes for lowest class thresholds
per length. A peak in the number of classes can be observed at fragment length 7
and class threshold 0.5 Å. (B) Increase in the number of classified fragments from the
first version of BriX (Baeten et al., 2008) to the current version. (C) BriX classes with
class thresholds varying from 0.5 Å to 1.0 Å RMSD for fragments of length 7. The
class threshold indicates the compactness and structural homogeneity of the class,
with lower thresholds causing classes to be more compact than higher thresholds.
47
2. FRAGMENTING PROTEIN SPACE
2.2.2 BriX Statistics
As expected, the number of classes varies with the length of the clustered frag-
ments: even for short fragment length (n = 4) and strict threshold (≤ 0.4 Å RMSD)
a large number of classes (2000) were observed. The largest amount of structural
classes is detected when applying a clustering threshold of 0.5 Å to fragments of
length 7: 3613 classes can be distinguished (Figure 2.2A). Hereafter the number
of classes steadily decreases until 1500 classes at length 14. As expected, the
number of classes per length decreases with increasing classification thresholds
(Figure 2.3) as more different fragments are classified into a single class. Also,
the percentage of classified fragments decreases steadily with increasing fragment
length. To compensate for this, increasing the covering thresholds for a specific
length improves the classification rates (Figure 2.4).
!"
#!!"
$!!!"
$#!!"
%!!!"
%#!!"
&!!!"
&#!!"
'!!!"
!('" !(#" !()" !(*" !(+" !(," $" $($" $(%" $(&" $('"
-./012"34"5267"89:;;1;"
<=>?"@A21;A39B"CDEF;G23/H"
I'"
I#"
I)"
I*"
I+"
I,"
I$!"
I$$"
I$%"
I$&"
I$'"
Figure 2.3: Number of BriX classes versus different classification thresholds. The
number of classes in BriX decreases with increasing classification thresholds. Different
lengths (from 4 to 14 residues) are shown in different colors.
Furthermore, we analyzed the secondary structure content in classes derived
48
2.2 Contents of the BriX database
!
"!
#!
$!
%!
&!
'!
(!
)!
*!
"!!
!+% !+& !+' !+( !+) !+* " "+" "+# "+$ "+%
,%
,&
,'
,(
,)
,*
,"!
,""
,"#
,"$
,"%
Figure 2.4: Percentage of classified fragments versus different classification
thresholds. The percentage of classified fragments decreases steadily with increasing
fragment length. To compensate for this, increasing the covering thresholds (X axis)
for a specific length improves the classification rates.
for different fragment lengths and thresholds. Not surprisingly, α-helical and β-
strand fragments remain well represented in structural classes of higher length
(Figure 2.5) while loop fragments are under-represented in classes of all lengths,
indicating that they are harder to classify. Clearly the majority of unclassified
fragments are composed of loop structures (Figure 2.6). This indicates that a
separate classification scheme, more suited to the particularities of loop structures,
could significantly enrich the BriX database.
2.2.3 Creation of the Loop BriX database
The Loop BriX database was built using 14.525 protein structures derived from the
ASTRAL95 set containing protein structures sharing less than 95 percent sequence
49
2. FRAGMENTING PROTEIN SPACE
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
numberoffragmentsclassified
% of secondary structure in BriX class
L14 helix
L14 strand
L14 loop
L14 turn
L10 helix
L10 strand
L10 loop
L10 turn
L7 helix
L7 strand
L7 loop
L7 turn
L5 helix
L5 strand
L5 loop
L5 turn
Figure 2.5: Secondary structure content for classified fragments in BriX. Classified
fragments (Y axis) are shown by secondary structure content (X axis) for length 5, 7,
10 and 14. α-helical and β-strand fragments are well represented in BriX classes, even
for longer fragment lengths. In contrast, turn and loop fragments are generally less
well classified.
identity (Chandonia et al., 2004). A loop fragment starts and ends with a single
residue belonging to a regular secondary structure such as a helix or a strand
and contains any number of irregular residues in between. As shown by different
studies, the structural loop space can be partitioned by four combinations of
flanking regular elements: α-α, α-β, β-α and β-β (Espadaler et al., 2004; Donate
et al., 1996; Burke & Deane, 2001) (Table 2.1).
We have introduced a novel way to compare the similarity between two loop
fragments based on the (1) the distance between their end points (‘end-to-end
distance’) rather than the overall structure similarity used in BriX and (2) the
superposition of two regular anchor residues at each side of the loop with a
RMSD < 1 Å. Firstly, loops in each of the four loop classes described above were
clustered on end-to-end distance using the same hierarchical clustering algorithm.
50
2.2 Contents of the BriX database
0
10
20
30
40
50
60
70
80
Helix Strand Turn Loop
%ofDSSPinunclassifiedfragments
L4
L7
L10
L13
Figure 2.6: Secondary structure content for unclassified fragments in BriX. Sec-
ondary structure distribution of unclassified fragments in BriX (ASTRAL40) for fragment
lengths 4, 7, 10 and 13. Unclassified fragments contain mainly loop elements, show-
ing the need for the separate classification scheme of loop elements employed in Loop
BriX.
These ‘super classes’ are composed of varying sizes and thus show a considerable
amount of variation in the part between the end points (Figure 2.7A). Secondly,
super classes were clustered in ‘sub classes’, grouping loops of the same length
and similar structure.
51
2. FRAGMENTING PROTEIN SPACE
α-α (%) β-β (%) α-β (%) β-α (%) Ref.
20 33 26 20 Donate et al. (1996)
20 28 24 28 SLoop (Burke & Deane, 2001)
22 31 24 23 ArchDB (Espadaler et al., 2004)
19 35 21 25 Loop BriX
Table 2.1: Distribution of loops across the four main loop categories for four
different loop databases.
sub 2219
sub 2220
sub
2221
superclass 27217
β-strand
anchor 2
β-strand
anchor 1
11.7Å
A B
4246
2732
1076
177
100
111
5
3907
2017
645
99
73
77
2
2-5 5-15 15-25 25-50 50-1000 >10002
Numberofclasses
subclass
superclass
Number of fragments per class
Figure 2.7: The Loop BriX database. (A) Example of a superclass containing 3
subclasses. The superclass contains fragments with end-to-end distance around 11.78
Å RMSD and two β-strand anchor residues. At the subclass level, fragments with
similar length and backbone are grouped (length 7 for subclass 2219 and length 13 for
subclass 2220 and 2221, superposition threshold of 1 Å). (B) Number of superclasses
(blue) and subclasses (red) per class size, distributed in bins. In general, classes from
Loop BriX are less populated than classes from BriX.
52
2.2 Contents of the BriX database
2.2.4 Loop BriX Statistics
In contrast to the quite limited conformational space of regular structure elements,
loop structures are much more variable. In Loop BriX, loop fragments are between
4 and 117 irregular residues long and classes are generally less populated (Figure
2.7B). Intriguingly, we observe a clear distinction between classes of loops con-
necting different secondary structure: the number of super-classes having more
than 100 fragments is much lower for α-α (8) than β-β classes (20), showing less
regularity for α-α classes than for β-β classes (Table 2.2). This is explained by the
fact that α-helices, being cylindrical, show much more variation at their end points,
while β-strands have more regular end-to-end distances.
# Loops
α-α classes β-β classes α-β classes β-α classes all classes
super sub super sub super sub super sub super sub
1 8714 12290 9101 17513 7747 11896 7747 12136 33309 53835
5 395 233 621 438 400 257 507 308 1923 1236
10 135 77 231 171 149 103 190 100 705 451
20 58 30 92 70 79 54 64 39 293 193
50 22 9 34 33 37 23 26 15 119 80
100 8 5 20 20 18 14 13 4 59 43
Table 2.2: Classification of loops within Loop BriX. Number of super- and subclasses
in function of their respective minimum class content in Loop BriX.
We then examined the results of our loop classification scheme, looking at the
percentage of loops we were able to classify. At the super class level our approach
classified almost 90% of 6-residue loops and 45% of 14-residue loops while the
success of sub-clustering in equally sized groups decreased more rapidly. We
found that the sub-classification was successful for fragments up to length 16,
after which no regular loop patterns could be identified.
2.2.5 Applications of the BriX database
The first version of the BriX database already inspired many applications in the
fields of structural biology and protein design. Baeten et al. (2008) showed that
proteins from the widely used Park & Levitt set could be reconstructed using BriX
fragments to a global 0.48 Å RMSD accuracy, improving existing results using more
53
2. FRAGMENTING PROTEIN SPACE
limited structural alphabets .
Demon et al. (2009) used BriX database fragments in combination with the
FoldX protein design algorithm (Schymkowitz et al., 2005) to construct a model of
murine caspase 3 and 7 in complex with substrate peptides. These models were
subsequently used to explain experimentally observed differences in substrate
specificity between caspase 3 and 7.
In other recent work, we have shown that the structural space of protein-peptide
interactions can be approximated using fragments from the BriX database (Vanhee
et al., 2009) (Chapter 5). The interfaces of over 300 protein-peptide complexes
from the PepX database (Vanhee et al., 2010) (Chapter 4) were reconstructed to
within 1 Å RMSD, using observed fragment interactions to reconstruct the binding
modes. The sheer size of the database allowed us to extract structural knowledge
on protein-peptide interactions.
Until now, all of these services have been limited to internal use of the database.
With the updated version of the BriX and Loop BriX databases, the website and the
addition of the covering and bridging algorithms (Section 2.3.3), we open up the
possibilities to use the BriX database to the scientific community at large.
2.3 Database access
2.3.1 Database availability
The BriX and Loop BriX databases are accessible through a web portal at http://
brix.crg.es. The portal is built on the open-source Drupal Content Management
System for full flexibility (http://drupal.org). The entire database with annota-
tions is available for download in the SQL format, describing the relations between
classes and fragments. As an additional service for automated high-throughput
querying, all information contained within the BriX and Loop BriX database can be
downloaded as CSV (comma-separated values) lists. For example, prompting the
URL http://brix.crg.es/classes?Length=10&Structure=HHHHHHHHHH returns
a CSV file containing BriX classes of length 10 with an α-helical structure. Finally,
BriX will be updated automatically when new versions of the ASTRAL sets will
become available.
54
2.3 Database access
2.3.2 User Interface
A user-friendly browsing interface is available on the website http://brix.crg.
es (Figure 2.8A). BriX contains two levels: the class level and the fragment level
(Figure 2.8C). Classes can be sorted and filtered on (1) class size, (2) fragment
length (from 4 to 14 residues), (3) clustering threshold describing the compactness
of the classes, (4) minimum and maximum percentage of helix, loop, sheet and
turn content and (5) regular expressions of the amino acid sequence and secondary
structure as determined by DSSP (Kabsch & Sander, 1983) (Figure 2.8B). For
each BriX class, we generated images of the superposed fragments using Chimera
(Pettersen et al., 2004) and logos of the sequence and structure distributions using
Weblogo (Crooks et al., 2004). Subsequently, the fragments of each class can be
filtered on PDB ID (Berman et al., 2000), sequence or secondary structure.
Loop BriX contains three levels: (1) the superclass level with fragments of
similar end-to-end distance and matching end residues, (2) the subclass level with
fragments of similar backbone patterns and length and finally, (3) the fragment
level (Figure 2.8D). The Loop BriX superclasses and subclasses can be queried
with the same parameters as the BriX database plus end-to-end distance.
2.3.3 Covering or bridging of protein structures
To explore the vast size of the database we provide two algorithms to query BriX and
Loop BriX with a user-submitted structure: ‘covering’ and ‘bridging’1
. The covering
algorithm covers backbone coordinates of the input structure with similar BriX
classes. The bridging algorithm spans the distance between any pair of anchoring
residues regardless of backbone coordinates in between them. This is extremely
useful to derive plausible loop conformations where backbone coordinates are not
present or poorly defined.
In Figure 2.9A, we show the application of the covering algorithm to a PDZ
domain (PDB ID 2WL7), covering a part of the β-strand with classes from the
BriX database. Residues 112 to 116 are selected for covering. The algorithm
1
For an intuitive explanation of the algorithms, two videos are provided to demonstrate the
covering and bridging of an example protein. They can be accessed online at http://brix.crg.
es/content/help.
55
2. FRAGMENTING PROTEIN SPACE
A B
D
Class Fragment Subclass FragmentSuperclass
C
Figure 2.8: The BriX website (http://brix.crg.es). (A) An overview of the class level
with secondary structure content and sequence and structure logos per entry. (B) A
panel on the class level where the user can filter on length, threshold, sequence and
secondary structure content. Similar panels are implemented at every level of the
class hierarchy. (C) BriX contains two levels: the class level and the fragment level.
(D) Loop BriX contains three levels: the superclass, the subclass and the fragment
level.
then matches the selected region to the BriX classes by calculating the distance
to each class centroid. Here, the user can select the class threshold that defines
their compactness (0.6 Å in this example). Fragments are returned for every class
having a centroid close enough to the query fragment. The user can also select
the maximum number of fragments per class, the total minimum and maximum
number of fragments (between 1 and 1000) and superposition thresholds are
adapted accordingly. In the case of the β-strand of the PDZ, over 3000 fragments
superposing with 0.6 Å are matched, of which 1000 are returned to the user as a set
of downloadable fragment PDB files. Moreover, the service provides a snapshot of
56
2.3 Database access
B
A
PDB 2WL7
A112
A104 A112
L5
sequence logo
PDB 2WL7 without loop
secondary structure logo
sequence logo
secondary structure logo
12.7Å
Figure 2.9: BriX applications: ‘covering’ and ‘bridging’. (A) Covering: an input PDZ
structure (PDB: 2WL7) is shown for which the algorithm finds matching structural
fragments for the β-strand (red). The algorithm returns a set of protein fragment
structures (green) superposed on the β-strand, together with structure and sequence
logos. (B) Bridging: the same PDB structure (PDB: 2WL7), now with a missing
loop.The algorithm finds loop fragments that match the regular anchor residues and
span the loop with the same end-to-end distance (green).
these fragments superposed on the query PDB as well as logos depicting sequence
and structure propensities of the matched fragments, useful to derive sequence or
structure relationships. Finally, the set of matching classes and fragments can be
57
2. FRAGMENTING PROTEIN SPACE
further inspected online using the previously described search interface.
The bridging algorithm works in a similar fashion. To illustrate this, we removed
a loop of the same PDZ domain from the input structure (Figure 2.9B), which is
involved in binding the peptide ligand of this domain. This loop is anchored by
residue 104 on the left and residue 112 on the right, spanning a gap of 12.7Å
end-to-end distance. The algorithm reconstructs a backbone with fragments from
the Loop BriX database between the two anchor residues. As one might expect,
the results contain loops from PDZ domains (for example, PDB ID 1WIF), but also
loops derived from proteins with unrelated SCOP classes.
Given the vastness of our database, calculations can be demanding. We allo-
cated a dedicated cluster (40 nodes) that runs the algorithms independent from
the web server.
2.4 Discussion
To our knowledge, BriX is the most extensive alphabet of protein fragments publicly
available. As shown by Baeten et al. (2008) and recently confirmed by Le et al.
(2009), BriX reaches an accuracy of 0.48 Å RMSD in reconstructing an extensive
set of proteins. With a 7-fold increase in the amount of classified fragments, we
expect that this accuracy will even improve (∼0 Å), including for regions that are
typically regarded as ‘unstructured’ and thus difficult to reconstruct with unrelated
fragments, such as protein loops. Protein fragments however have limited meaning
in isolation, and in Chapter 5 we introduce the concept of ‘fragment context’ by
mining the database for fragment interactions, stored in the InteraX database.
While most protein fragments libraries are used to improve structure alignment
or deduce sequence-structure relationships through classic homology modeling (Le
et al., 2009), we use BriX for full-atom reconstruction of entire protein structures
(Baeten et al., 2008), loop reconstruction (Chapter 3) or the prediction of peptide
structure (Chapter 6).
58
2.4 Discussion
Author Contributions
F.R., J.S., and L.S. conceptualized the study. L.B. developed the first version of
the BriX database (Baeten et al., 2008). L.B., P.V. and E.V. developed the second
version of the BriX database. E.V., P.V. and L.B. performed the analysis. P.V. and
E.V. developed the covering and bridging algorithms. P.V. developed the website.
P.V. and E.V. wrote the paper.
59
REFERENCES
References
Ananthalakshmi, P., Kumar, C.K., Jeyasimhan, M.,
Sumathi, K. & Sekar, K. (2005). Fragment
finder: a web-based software to identify
similar three-dimensional structural motif.
Nucleic Acids Research, 33, W85–8. 45
Arnold, K., Kiefer, F., Kopp, J., Battey, J.N.D.,
Podvinec, M., Westbrook, J.D., Berman, H.M.,
Bordoli, L. & Schwede, T. (2009). The protein
model portal. J Struct Funct Genomics, 10,
1–8. 44
Baeten, L., Reumers, J., Tur, V., Stricher, F.,
Lenaerts, T., Serrano, L., Rousseau, F. &
Schymkowitz, J. (2008). Reconstruction of
protein backbones from the brix collection
of canonical protein fragments. PLoS Com-
put Biol, 4, e1000083. 45, 46, 47, 53, 58,
59
Baker, D. & Sali, A. (2001). Protein structure
prediction and structural genomics. Science,
294, 93–6. 44
Berkholz, D.S., Krenesky, P.B., Davidson, J.R.
& Karplus, P.A. (2010). Protein geometry
database: a flexible engine to explore back-
bone conformations and their relationships
to covalent geometry. Nucleic Acids Re-
search, 38, D320–5. 45
Berman, H.M., Westbrook, J., Feng, Z., Gilliland,
G., Bhat, T.N., Weissig, H., Shindyalov, I.N. &
Bourne, P.E. (2000). The protein data bank.
Nucleic Acids Research, 28, 235–42. 55
Bornot, A., Etchebest, C. & de Brevern, A.G.
(2009). A new prediction strategy for long
local protein structures using an original de-
scription. Proteins, 76, 570–87. 45
Budowski-Tal, I., Nov, Y. & Kolodny, R. (2010).
Fragbag, an accurate representation of pro-
tein structure, retrieves structural neighbors
from the entire pdb quickly and accurately.
Proceedings of the National Academy of Sci-
ences, 107, 3481–6. 44, 45
Burke, D.F. & Deane, C.M. (2001). Improved
protein loop prediction from sequence
alone. Protein Eng, 14, 473–8. 50, 52
Bystroff, C. & Shao, Y. (2002). Fully automated
ab initio protein structure prediction using
i-sites, hmmstr and rosetta. Bioinformatics,
18 Suppl 1, S54–61. 45
Chandonia, J.M., Hon, G., Walker, N.S., Conte,
L.L., Koehl, P., Levitt, M. & Brenner, S.E.
(2004). The astral compendium in 2004. Nu-
cleic Acids Research, 32, D189–92. 45, 50
Choi, Y. & Deane, C.M. (2010). Fread revis-
ited: Accurate loop structure prediction us-
ing a database search algorithm. Proteins,
78, 1431–40. 45
Crooks, G.E., Hon, G., Chandonia, J.M. & Bren-
ner, S.E. (2004). Weblogo: a sequence logo
generator. Genome Res, 14, 1188–90. 55
Demon, D., Damme, P.V., Berghe, T.V., Deceun-
inck, A., Durme, J.V., Verspurten, J., Helsens,
K., Impens, F., Wejda, M., Schymkowitz, J.,
Rousseau, F., Madder, A., Vandekerckhove, J.,
Declercq, W., Gevaert, K. & Vandenabeele,
P. (2009). Proteome-wide substrate analy-
sis indicates substrate exclusion as a mecha-
nism to generate caspase-7 versus caspase-3
specificity. Mol Cell Proteomics, 8, 2700–14.
54
Donate, L.E., Rufino, S.D., Canard, L.H. & Blun-
dell, T.L. (1996). Conformational analysis
and clustering of short and medium size
loops connecting regular secondary struc-
tures: a database for modeling and predic-
tion. Protein Sci, 5, 2600–16. 50, 52
Espadaler, J., Fernandez-Fuentes, N., Hermoso, A.,
Querol, E., Aviles, F.X., Sternberg, M.J.E. &
Oliva, B. (2004). Archdb: automated pro-
tein loop classification as a tool for struc-
tural genomics. Nucleic Acids Research, 32,
D185–8. 50, 52
Fernandez-Fuentes, N., Zhai, J. & Fiser, A. (2006).
Archpred: a template based loop structure
prediction server. Nucleic Acids Research,
34, W173–6. 45
60
REFERENCES
Fitzkee, N.C., Fleming, P.J., Gong, H., Panasik, N.,
Street, T.O. & Rose, G.D. (2005). Are proteins
made from a limited parts list? TRENDS in
Biochemical Sciences, 30, 73–80. 44, 45
Kabsch, W. & Sander, C. (1983). Dictionary of
protein secondary structure: pattern recog-
nition of hydrogen-bonded and geometrical
features. Biopolymers, 22, 2577–637. 55
Kifer, I., Nussinov, R. & Wolfson, H.J. (2008).
Constructing templates for protein structure
prediction by simulation of protein folding
pathways. Proteins, 73, 380–94. 45
Kolodny, R. & Levitt, M. (2003). Protein decoy
assembly using short fragments under ge-
ometric constraints. Biopolymers, 68, 278–
85. 45
Kolodny, R., Koehl, P., Guibas, L. & Levitt, M.
(2002). Small libraries of protein fragments
model native protein structures accurately.
Journal of Molecular Biology, 323, 297–307.
45
Kopp, J., Bordoli, L., Battey, J.N.D., Kiefer, F.
& Schwede, T. (2007). Assessment of casp7
predictions for template-based modeling tar-
gets. Proteins, 69 Suppl 8, 38–56. 44
Le, Q., Pollastri, G. & Koehl, P. (2009). Struc-
tural alphabets for protein structure clas-
sification: a comparison study. Journal of
Molecular Biology, 387, 431–50. 44, 58
Levitt, M. (2009). Nature of the protein uni-
verse. Proceedings of the National Academy
of Sciences of the United States of America,
106, 11079–84. 44
Murzin, A.G., Brenner, S.E., Hubbard, T. &
Chothia, C. (1995). Scop: a structural clas-
sification of proteins database for the inves-
tigation of sequences and structures. J. Mol.
Biol., 247, 536–540. 44
Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T.,
Swindells, M.B. & Thornton, J.M. (1997).
Cath–a hierarchic classification of protein
domain structures. Structure, 5, 1093–108.
44
Pandini, A., Fornili, A. & Kleinjung, J. (2010).
Structural alphabets derived from attractors
in conformational space. BMC Bioinformat-
ics, 11, 97. 45
Pettersen, E.F., Goddard, T.D., Huang, C.C.,
Couch, G.S., Greenblatt, D.M., Meng, E.C.
& Ferrin, T.E. (2004). Ucsf chimera–a visual-
ization system for exploratory research and
analysis. J Comput Chem, 25, 1605–12. 55
Qian, B., Raman, S., Das, R., Bradley, P., Mc-
Coy, A.J., Read, R.J. & Baker, D. (2007). High-
resolution structure prediction and the crys-
tallographic phase problem. Nature, 450,
259–64. 45
Samson, A.O. & Levitt, M. (2009). Protein seg-
ment finder: an online search engine for
segment motifs in the pdb. Nucleic Acids Re-
search, 37, D224–8. 45
Schymkowitz, J., Borg, J., Stricher, F., Nys, R.,
Rousseau, F. & Serrano, L. (2005). The foldx
web server: an online force field. Nucleic
Acids Research, 33, W382–8. 54
Simons, K.T., Kooperberg, C., Huang, E. &
Baker, D. (1997). Assembly of protein ter-
tiary structures from fragments with similar
local sequences using simulated annealing
and bayesian scoring functions. J Mol Biol,
268, 209–25. 45
Vanhee, P., Stricher, F., Baeten, L., Verschueren,
E., Lenaerts, T., Serrano, L., Rousseau, F. &
Schymkowitz, J. (2009). Protein-peptide in-
teractions adopt the same structural motifs
as monomeric protein folds. Structure, 17,
1128–1136. 54
Vanhee, P., Reumers, J., Stricher, F., Baeten, L.,
Serrano, L., Schymkowitz, J. & Rousseau, F.
(2010). Pepx: a structural database of non-
redundant protein-peptide complexes. Nu-
cleic Acids Research, 38, D545–51. 54
Vriend, G. (1990). What if: a molecular mod-
eling and drug design program. Journal of
Molecular Graphics, 8, 52–6, 29. 46
61
3Predicting loop structure
This chapter is based on
Fast and accurate prediction of protein loop structure and dynamics. Peter Vanhee*, Joost
Van Durme*, Lies Baeten*, Erik Verschueren, Frederic Rousseau, Joost Schymkowitz, Francois
Stricher and Luis Serrano.1 In review, February 2011.
P
rotein loops play important roles in binding and catalysis but being the most
variable protein elements they are notoriously difficult to predict. In accor-
dance with the idea of BriX – protein structures represented with protein building
blocks (Chapter 2) – we show that even though loops are typically regarded as
‘unstructured regions’, loop structures up to a certain length are recurrent too,
regardless of sequence identity. We report the loop structure prediction algorithm
LoopX (http://loopx.crg.es), which combines a 100-fold speed increase over
state-of-the art methods (from days to hours) with excellent prediction accuracy
and coverage for loops up to 12 residues. Moreover, we demonstrate that LoopX
can be used to model the conformational ensemble adopted by protein loops upon
ligand binding.
1
Peter Vanhee, Joost Van Durme and Lies Baeten are joint first authors.
63
3. PREDICTING LOOP STRUCTURE
3.1 Introduction
Protein loops often represent functional entities that exploit their intrinsic confor-
mational flexibility to perform a multitude of tasks critical to protein function and
specificity (Fetrow, 1995; Jones et al., 1998; Todd et al., 2001). For example, the
complementarity-determining regions (CDR loops) of antibodies are responsible
for recognition and binding of antigen epitopes (Rini et al., 1992). Loops play
also key roles in binding a variety of ligands such as metal ions (Lu & Valentine,
1997), ATP (Prodromou et al., 1997), calcium (Strynadka & James, 1989) and DNA
(Redondo et al., 2008). Furthermore, loops are able to regulate peptide binding
by governing access to domain pockets, as was recently shown for SH2 domains
which mediate protein-protein interactions through binding of phosphotyrosine-
containing sequences (Kaneko et al., 2010). Loops have also been the subject of
successful design studies to construct novel enzymatic functionality (Jiang et al.,
2008; Siegel et al., 2010) or alter protein stability (van der Sloot et al., 2009).
Loops are typically defined as irregular regions embraced by regular secondary
structure elements. In both X-ray and NMR structural models these regions are
often the least well defined. Because of high sequence and structural variability,
loop structure prediction is one of the most challenging tasks in homology modeling
and protein design. Conventional homology modeling generally produces poor
approximations for loop structures and crystallographic artefacts such as crystal
contacts and high loop mobility tend to toughen the prediction of loop regions
(Benner et al., 1993; Burke et al., 2000; Lawson & Wheatley, 2004; Heuser et al.,
2004).
In this perspective, loop modeling can be seen as a ‘mini-folding’ problem as
the correct conformation has to be acquired primarily from sequence information
(Fiser et al., 2000). Existing approaches to tackle this problem can be divided in
two categories: ab initio and knowledge-based methods. Ab initio methods rely
for a great deal on their scoring functions, which are based on molecular dynamics
(MD) and energy minimization to sample a large number of loop conformations
(Rapp & Friesner, 1999; Jacobson et al., 2004; Spassov et al., 2008; Felts et al.,
2008; Xiang et al., 2002). The final prediction is the solution with the lowest
calculated energy.
64
3.1 Introduction
Alternatively, knowledge-based methods attempt to find a loop segment of a
protein with known three dimensional structure that fits the stem regions of a
target loop. The basis of this method is mining loop databases built from existing
Protein Data Bank (PDB) structures (Donate et al., 1996; Oliva et al., 1997; Burke
et al., 2000; Espadaler et al., 2004; Fernandez-Fuentes et al., 2006; Vanhee et al.,
2011) (Chapter 2). Typically, the database is searched for templates followed by
evaluation and filtering of possible candidates using empirical scoring potentials.
The remaining loop candidates are then ranked according to sequence identity or
geometric criteria (Wojcik et al., 1999; Heuser et al., 2004).
Figure 3.1: Predicted loop structure with LoopX. Loop reconstructions with a max-
imum of 35% loop sequence identity. The structures depict the best loop reconstruc-
tion ranked first by LoopX for all loop lengths from 4 to 12. The crystallographic loop
(light red) is compared by structural superposition to the reconstructed loop (dark
red) and the in-picture text denotes the loop length, original PDB identifier and RMSD
prediction accuracy.
Predicting loops of 12 amino acids and longer often presents severe problems
for both types of methods. Ab initio methods have been significantly improved to
find loop predictions close to the real solution (Mandell et al., 2009), albeit a the
65
3. PREDICTING LOOP STRUCTURE
cost of a significant increase of computation time that grows exponentially with
loop length. Knowledge based methods allow for fast searches, but they are mainly
limited by the completeness of the loop database (Du et al., 2003) and the inability
to select near-native loops based on sequence homology alone (Fernandez-Fuentes
& Fiser, 2006). However, the rapid expansion of the PDB results in a dense cover-
age of loop conformational fragments and knowledge based methods can exploit
this expansion for more accurate loop structure prediction (Berman et al., 2000;
Levitt, 2007). In contrast to ab initio methods, knowledge based methods have
the advantage to use frequently occurring, energetically favorable conformations
of native structures. In this regard, with a high completeness of the database and
for shorter loops (less than 12 residues), their performance further depends on the
reliability of the scoring function to evaluate and rank the loop candidates.
Here we present LoopX, a loop structure prediction algorithm that combines
a fast database search with an all-atom sidechain reconstruction (Figure 3.11).
LoopX follows a novel approach in predicting loop structure: the method com-
bines the power of both knowledge-based and ab initio methods. LoopX makes
use of the Loop BriX database of protein loop fragments (Vanhee et al. (2011)
and Chapter 2) in combination with the FoldX side chain reconstruction algorithm
(Schymkowitz et al., 2005), and applies a series of filters to select feasible loop can-
didates. An additional, computationally more expensive step introduces backbone
variation to optimize loop placement using the BriX database (Vanhee et al., 2011).
We perform an exhaustive study of the performance of loops (measured by the
RMSD against the crystallographic loop) from five different datasets, comparing
the method against five state-of-the-art loop reconstruction methods.
3.2 Results
3.2.1 Comparison with the state-of-the-art loop reconstruction
algorithm
Recently reported, Mandell et al. (2009) reached sub-angstrom accuracy in loop
reconstruction on a set of 12-residue loops using a method called kinematic clo-
sure (KIC) that has taken inspiration from the robotics field. This huge step forward
66
3.2 Results
0
1
2
3
4
5
6
7
Dataset 1: 23
12-residue loops
LoopX KIC
0
1
2
3
4
5
6
7
FREAD MODELLER RAPPER PLOP
0
1
2
3
4
5
6
7
Dataset 2: 20
12-residue loops
0
1
2
3
4
5
6
7
Dataset 3: 18
interface loops
Dataset 4: 270
(4-12)-residue loops
r.m.s.dcrystallographicloop
Figure 3.2: Accuracy of LoopX versus state-of-the-art loop prediction methods on
datasets 1,2,3 and 4. Box plot comparisons of the LoopX algorithm (blue) and the
KIC protocol (red) on dataset 1,2 and 3; comparison of the LoopX algorithm (blue)
and the algorithms FREAD (green), MODELLER (purple), RAPPER (orange) and PLOP
(cyan) on dataset 4. Boxes span the interquartile range (IQR, 25th-75th percentiles),
black lines represent the median, whiskers extend to furthest values within 0.8 times
the IQR, and open circles are outliers.
in the loop reconstruction field comes with the price of a significant amount of
computation time. As a result, whilst the methodology could be used to pre-
dict the loop structures with high accuracy, it is not practical when performing
protein design where many loops of different lengths and sequences need to be
explored. Here we show that LoopX reaches a similar accuracy but carries out the
reconstruction considerably faster (from days to minutes), creating a wide range
of possibilities for large-scale loop sampling analysis.
We analyzed three 12-residue loop sets used by Mandell et al. (2009) and
compared KIC results to LoopX predictions (Figure 3.2). Prediction coverage of
LoopX – i.e. the amount of structures for which LoopX makes a prediction – was
18/23 (Figure 3.3), 17/20 (Figure 3.4), 17/18 (Figure 3.5) and RMSD improvement
ratios to KIC results were 12/25, 8/20 and 8/18 for datasets 1, 2 and 3, respec-
tively. The box plots in Figure 3.2 show that LoopX, with exception of dataset
2, compares well to the results obtained by KIC. The substantial increase in loop
reconstruction speed of LoopX (∼5-60 minutes per loop) compared to KIC (∼320
hours per loop) enables LoopX to be used for high-accuracy and high-throughput
loop reconstruction and loop sampling projects in a moderate timeframe.
67
3. PREDICTING LOOP STRUCTURE
0
1
2
3
4
5
6
7
1xif203-2141tib
99-1103hsc72-931rro
17-281dts41-52
2sil255-266
2exo
293-3041onc23-342tgi48-59
1pbe
129-1401cyo
12-234ilb
46-57
1tm
l243-2541eco
35-46
1m
sc42613
2cpl145-156
1srp
311-322
2rn2
90-101
1ede
150-161
1ezm
122-133
1thg
127-138
1thw
178-189
2ebn
136-147
RMSDcrystallographicloop
Dataset 1, Mandell et al
LoopX
Rosetta
KIC
Figure 3.3: Dataset 1: comparison of LoopX with Rosetta and KIC. Rosetta (Wang
et al., 2007) was reevaluated by Mandell et al. (2009), see Supp. Mat. KIC values
were taken from Mandell et al. (2009). Values are in Angstrom and represent the
RMSD between predicted loop and crystallographic loop. The X-axis shows the PDB
structures together with the predicted loop region. Sorted by LoopX accuracy from
left to right, we predicted 18/23 loops.
3.2.2 Loop homology is no prerequisite for loop reconstruction
accuracy
The common limitation of database-search based loop reconstruction algorithms
is that a high degree of sequence identity between the query loop and the database
loops is required to achieve both high accuracy and coverage. This is a valid
argument when using small databases or sub-optimal scoring methods, although
it has never been formally demonstrated.
We tested LoopX on dataset 4, which contains 527 loops extracted from 204 un-
related structures (<35% sequence identity) (Section 3.3.3). The sampling power
of the algorithm was evaluated by selecting the best prediction in the top ten for
each loop reconstruction. Averaged per length we achieve accuracy < 1 Å for
lengths 4-6, ∼1 Å for length 7 and < 2 Å for lengths 8-12 (Figure 3.6).
68
3.2 Results
0
1
2
3
4
5
6
7
8
9
1m
y7
254-265
1oyc204-2151i7p
63-76
1dqz209-2201t1d
127-138
1bn8
298-309
1cnv188-201
1m
s9
529-543
1exm
291-3021m
3s68-801cb0
33-441qlw
31-422pia30-431c5e
82-931f46
64-77
1cs6
145-156
1a8d
155-1681oth
69-80
1bhe
121-134
1arb
182-196
RMSDcrystallographicloop
Dataset 2, Mandell et al
LoopX
Rosetta
KIC
Figure 3.4: Dataset 2: comparison of LoopX with Rosetta and KIC. LoopX predic-
tions on dataset 2 (Wang et al., 2007; Mandell et al., 2009). Rosetta and KIC values
are taken from (Mandell et al., 2009). Values are in Angstrom and represent the
RMSD between predicted loop and crystallographic loop. The X-axis shows the PDB
structures together with the predicted loop region. Sorted by LoopX accuracy from
left to right, we predicted 17/20 loops.
To evaluate the dependence of the algorithm on loop sequence identity, we
ran three additional predictions on the same set but discarding candidate loops
with sequence identity above 85%, 50% and 35%. For lengths 4-5, sub-angstrom
accuracy was attained over all identity cut-offs. Accuracy for lengths 6 and 7 vary for
all cut-offs in the lower and upper range between 1 Å and 2 Å, respectively. Lengths
8, 9 and 12 stay below 3 Å, where accuracy for lengths 10 and 11 deteriorates to
4 Å and 5 Å, respectively.
The finding that even at low sequence identity (<35%) for loops up to length
7 the algorithm manages to maintain an excellent prediction accuracy (Figure 3.6
and 3.1), can be attributed to the ranking power of the FoldX free energies and the
completeness of the Loop BriX database. From our experiments we hypothesize
that the structural space of loops is saturated for lower loop lengths (≤ 7 residues)
and near saturated for medium loop lengths (≤ 12 residues). Hence, there appears
69
3. PREDICTING LOOP STRUCTURE
0
2
4
6
8
10
12
1w
r6
7-11
2d3g
7-11
1nf3
27-361cm
x307-311
1k8r25-37
1w
rd
7-11
1grn
26-371nbf307-311
1fxt7-12
1he8
26-371g4u
26-371w
q1
25-361he1
26-361bkd
24-361doa26-391hh4
26-39
1ki1
26-39
1gzs26-40
RMSDcrystallographicloop
Dataset 3, Mandell et al
LoopX
KIC
Figure 3.5: Dataset 3: comparison of LoopX with KIC. Dataset 3 contains loops
from 4 different proteins crystallized with 18 different partner proteins. KIC values
are taken from (Mandell et al., 2009). Values are in Angstrom and represent the
RMSD between predicted loop and crystallographic loop. The X-axis shows the PDB
structures together with the predicted loop region. Sorted by LoopX accuracy from
left to right, we predicted 17/18 loops.
to be no need to employ computationally expensive ab initio methods at these
loop lengths, unless combined with database methods.
3.2.3 Loop ensemble prediction
A practical and more challenging application of loop reconstruction algorithms is
their ability to predict the conformational ensemble adopted by loops, for example
when involved in ligand binding. We assess the ability of LoopX to predict the
carboxylate-binding loop movement of the PDZ domain, key domain regulators
in cell signaling pathways (Nourry et al., 2003). The carboxylate-binding loop is
involved in selection of C-terminal versus internal ligand binding (Penkert et al.,
2004), and predicting the conformational ensemble could help in the structural
70
3.2 Results
!"
#"
$"
%"
&"
'"
("
&" '" (" )" *" +" #!" ##" #$"
,-./01."2345"6/7890::;1/0<=>6":;;<"
?;;<"?.@19="
,::"
A"*'B"
A"'!B"
A%'B"
Loop
Homology
Figure 3.6: Influence of loop homology in LoopX. Loop reconstruction accuracy
measured at different loop lengths and at multiple loop sequence identity thresholds,
for dataset 4 containing 210 loops. Candidate loops were discarded if the percentage
sequence identity was higher than the given threshold, except in the case of ‘all’ where
nothing was discarded.
elucidation of PDZ peptide recognition1
. We selected 10 high-resolution structures
of the PDZ with carboxylate-binding loop distances between backbone atoms up to
4.5 Å (Section 3.3.3, Figure 3.7 and Figure 3.8). We introduced backbone variation
using the BriX database to optimize loop placement (Section 3.3.1). Starting from a
canonical, closed-loop conformation (PDB 2I1N), LoopX generates a loop ensemble
that accurately covers every crystal structure with sub-angstrom accuracy (0.35-
1.07 Å), including the open loop conformation (PDB 2WL7, 0.68 Å) (Figure 3.9).
In a similar experiment, LoopX predicts DNA binding-induced loop movements
of meganuclease to ∼0.50 Å accuracy (data not shown). LoopX is thus able to
1
In Chapter 6 we provide a thorough case study of PDZ-peptide interactions and the application
of the BriX design protocol to model peptide specificity from the structure of the PDZ domain alone.
71
3. PREDICTING LOOP STRUCTURE
Carboxylate-binding loop
PSD93 PDZ-1
(PDB 2WL7)
DLG3 PDZ-1
(PDB 2I1N)
Peptide binding
pocket
A B
0
1
2
3
4
1 2 3 4 5 6 7 8 9
PDZ_consensus-1 (1 peptides)
0
1
2
3
4
1 2 3 4 5 6 7 8 9
PDZ_ensemble-1 (10 peptides)
Loop sequence PDZ ensemble (10 structures)
Loop consensus sequence PDZ ensemble
Figure 3.7: Conformational ensemble adopted by the PDZ carboxylate-binding
loop. (A) 10 X-Ray structures with loop movement (maximum difference between
backbone atoms: 4.5 Å). (B) Sequence diversity of 10 selected PDZ domains (top)
and consensus sequence used to build loops with LoopX (bottom).
model the conformational loop ensemble of the ligand-induced loop movements
as observed from crystal structures in a reasonable timeframe.
3.2.4 Comparison with MODELLER, RAPPER, PLOP and FREAD
LoopX was compared to the ab initio methods MODELLER (Fiser et al., 2000),
RAPPER (de Bakker et al., 2003), PLOP (Jacobson et al., 2004) and to the database
search method FREAD (Choi & Deane, 2010) (for a review of these methods, we
refer to (Choi & Deane, 2010)). We show that LoopX outperforms these methods
on overall coverage and accuracy on a 510-loop set ranging from 4 to 20 residues
(dataset 5, Section 3.3.3).
When compared to MODELLER, RAPPER, PLOP and FREAD, LoopX produces
the best top global RMSDs for 14 of the 17 loop lengths on dataset 5 (Figure 3.10).
At length 12 and 18, FREAD achieves a slightly better RMSD coverage. For length
20, FREAD performs better on average RMSD on this set.
The selection of suitable loop hits for loop reconstruction and prediction de-
72
3.2 Results
X2WL7
X2Q9V
X2OZF
X2I04
X2HE2
X2AWU
X2I1N
X2F5Y
X2AWW
X1G9O
1G9O
2AWW
2F5Y
2I1N
2AWU
2HE2
2I04
2OZF
2Q9V
2WL7
1.89 1.15 1.17 1.87 1.55 1.78 1.13 0.72 0.89 0
1.57 1.08 1.36 1.64 1.28 1.32 0.85 0.88 0 0.89
2.09 1.57 1.59 1.52 1.18 1.26 0.76 0 0.88 0.72
2.09 1.7 1.58 0.95 0.99 0.86 0 0.76 0.85 1.13
2.45 2.29 2.34 1.15 1.09 0 0.86 1.26 1.32 1.78
2.2 1.95 2.14 1.39 0 1.09 0.99 1.18 1.28 1.55
2.6 2.37 2.15 0 1.39 1.15 0.95 1.52 1.64 1.87
2.08 1.29 0 2.15 2.14 2.34 1.58 1.59 1.36 1.17
1.25 0 1.29 2.37 1.95 2.29 1.7 1.57 1.08 1.15
0 1.25 2.08 2.6 2.2 2.45 2.09 2.09 1.57 1.89
1 2
Value
01020
Color Key
and Histogram
Count
2WL7 2Q9V 2OZF 2HE22I04 2AWU 2I1N 2F5Y 2AWW 1G9O
2WL7
2Q9V
2OZF
2HE2
2I04
2AWU
2I1N
2F5Y
2AWW
1G9O
Figure 3.8: Cross-comparison of backbone distances of the conformational en-
semble of the carboxylate-binding loop. All-against-all comparison of the 10 X-ray
loops presented in a heat map. A dendogram shows the clusters of loops that are
related in RMSD distance.
pends for a large part on the completeness of the database. The coverage of LoopX
on dataset 5 is 77.8% (389/510 predicted structures). For loop lengths smaller than
15, near-complete coverage is achieved (>80%). This drops to about a half for
lengths 15 to 18 and to a third for lengths 19 and 20. This difference in prediction
rates shows that the prediction of large loop lengths using database methods is still
difficult owing to the lack of loop templates. In comparison, for ab initio methods,
73
3. PREDICTING LOOP STRUCTURE
A B
X2I1N
X2Q9V
X2I04
X2HE2
X2AWU
X2AWW
X2OZF
X2WL7
X2F5Y
X1G9O
97389
96874
93183
78925
19429
101144
1.84 1.02 2.49 2.07 2.27 1.28 1.73 0.68 1.72 1.45
0.95 1.16 1.78 1.42 1.59 0.7 1.13 1.88 0.59 0.55
0.35 1.7 1.07 0.77 0.76 0.82 1.71 2.05 0.82 1.19
0.7 1.47 1.49 1.21 1.23 0.83 1.48 1.93 0.72 0.93
0.8 1.28 1.57 1.24 1.4 0.73 1.35 1.87 0.48 0.42
0.69 1.47 1.34 1.22 1.2 0.85 1.45 1.81 1.05 1.23
X2I1N
X2Q9V
X2I04
X2HE2
X2AWU
X2AWW
X2OZF
X2WL7
X2F5Y
X1G9O
97389_8
97389_23
97389_12
96874_8
96874_28
96874_19
96874_11
93183_81
93183_68
93183_46
93183_45
93183_22
93183_21
19429_9
19429_34
19429_32
19429_24
2 1.31 2.62 2.38 2.43 1.46 1.7 0.94 1.96 1.69
1.84 1.02 2.49 2.07 2.27 1.28 1.73 0.68 1.72 1.45
2.1 1.25 2.69 2.24 2.45 1.58 2.06 0.7 2.06 1.86
1.14 1.37 1.89 1.8 1.78 0.9 0.97 1.98 1.03 0.91
0.64 1.92 1.31 1.05 1.06 1.1 1.74 2.36 0.5 1.06
0.95 1.16 1.78 1.42 1.59 0.7 1.13 1.88 0.59 0.55
0.84 2.09 1.38 1.48 1.29 1.25 1.63 2.5 0.93 1.26
0.95 1.75 1.66 1.42 1.24 0.9 1.65 2.18 0.95 1.1
0.59 1.74 1.2 1.32 1.09 0.87 1.44 2.09 1.07 1.26
1.17 1.13 1.85 1.74 1.84 0.85 0.77 1.75 1.33 1.09
1.35 2.77 1.46 1.67 0.98 1.8 2.61 2.95 1.62 2.09
0.81 1.73 1.3 0.99 0.99 1.02 1.89 1.96 1.26 1.55
0.35 1.7 1.07 0.77 0.76 0.82 1.71 2.05 0.82 1.19
1.04 1.35 1.74 1.71 1.68 0.85 1.06 1.89 0.97 0.73
0.78 2.06 1.26 1.05 1.03 1.26 1.97 2.42 0.75 1.17
0.8 1.28 1.57 1.24 1.4 0.73 1.35 1.87 0.48 0.42
0.87 2.09 1.33 1.48 1.28 1.28 1.71 2.44 1 1.21
2I1N 2Q9V 2I04 2AWU2HE2 2AWW 2OZF 2WL7 2F5Y 1G9O
LoopX start structure
LoopconformationsgeneratedfromPDB2I1N
LoopX start structure
2I1N 2Q9V 2I04 2AWU2HE2 2AWW 2OZF 2WL7 2F5Y 1G9O
BriXfragmentidentifiers
Figure 3.9: Reconstruction of PDZ carboxylate-binding loop ensemble. Heat
maps of the top reconstruction results (in RMSD versus the X-ray loops) for the PDZ
carboxylate-binding loop ensemble. The best results are shown in red, the worst in
yellow. The X axis shows the conformational ensemble as observed in the 10 X-ray
structures, while the Y axis shows the top loops returned by LoopX. (A) shows the
first run of the LoopX algorithm, while (B) shows the results after generating a loop
ensemble using BriX, resulting in slightly better results.
the prediction of loops with lengths larger than 12 typically poses combinatorial
problems. As a result, most ab initio methods only report loop prediction results
for loop lengths until 12 residues (Mandell et al., 2009).
One of the most important aspects of a loop prediction algorithm is the ability
to rank the best prediction within the top solutions. For the shorter loop lengths
from 4 till 7 residues, LoopX ranks the best solution at position 1 in 40 to 57%
of the cases. This might appear on the lower side at first sight, but the RMSD
differences between the given solutions for these loop lengths are so small, that
the effect on accuracy is minimal. For loop lengths from 8 to 18, the best solution
is ranked first in 63 to 88% of the cases. For loop lengths 19 and 20 too few
predictions were made to evaluate the ranking ability. The high ranking power can
be attributed to the accurate energy calculations by the FoldX force field.
Choi & Deane (2010) describe a dramatic increase in accuracy of FREAD when
selecting for loops that are close homologs. We have carried out an analysis with
74
3.2 Results
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
AverageRMSDcrystallographicloop
Loop lenghts of dataset 5 (510 loops)
LoopX
FREAD
MODELLER
RAPPER
PLOP
Figure 3.10: LoopX prediction accuracy compared with FREAD, MODELLER, RAP-
PER and PLOP. Loop prediction accuracies (RMSD) on dataset 5 containing 510 loops
and organized per loop length. The results for FREAD, MODELLER, RAPPER and PLOP
are taken from Choi & Deane (2010).
FREAD using this cut-off and indeed the average top global RMSD of FREAD ranges
from 0.49 Å to 1.86 Å with an exception of 3.34 Å at loop length 12. The drawback
of this accuracy is a coverage as low as 33.5% (171/510 predicted structures).
This is not surprising since the new environment substitution score only allows
highly similar sequences to pass through the filter. The method now becomes
very dependent on the presence of close homologs of the query structure in the
database and therefore more closely resembles a homology modeling technique.
LoopX circumvents this problems by initially selecting for loops using structural
compatibility only, and the construction of sidechains and filtering/ranking of the
loop candidates result in a much higher coverage and accuracy that is – to our
knowledge – unprecedented for database-driven loop prediction.
75
3. PREDICTING LOOP STRUCTURE
3.3 Materials and Methods
3.3.1 LoopX Algorithm
β 1
anchor
β 2
anchor
β1
β2
β1 α2
β2
α1
α1
α2
A B
steric
clashes
FoldX
GLY
PRO
structure-sequence
incompatibility
RMSD < 1.5 Å
RMSD < 1.5 Å
length1
length2
length3
min(∆G kcal/mol)
Figure 3.11: Overview of the LoopX algorithm. (A) All loop classes, which have
matching secondary structure anchoring points to the target structure on both the
N- and C-terminal parts, are selected from Loop BriX. Then, only loop classes with a
RMSD lower than 1.5 Å to the anchoring residues and with a length identical to the
query sequence are withheld. (B) Loop candidates having steric (backbone) clashes
with respect to the context and incompatible backbone dihedral angles towards the
target sequence are discarded as well. Finally, all side chains are modeled with the
FoldX force field, loops are ranked by their FoldX stability estimates and local variation
is generated by mending short BriX fragments in between the loops and the anchors.
Loop template selection from Loop BriX
Loop backbone templates are selected from the Loop BriX database containing
14.525 protein structures with < 95% sequence identity (Section 2.2.3). Anchor
76
3.3 Materials and Methods
groups (2 residues, one loop and one non-loop residue) are chosen and fitted with
all Loop BriX super classes by superposition on the super class centroids. To speed
up this calculation, only loop super classes that have the same secondary structure
residues embracing the loop are considered. A super class is accepted as donor
of candidate loops when the RMSD of the backbone atoms (N, Cα, C, O) between
the four anchor residues of its centroid and the target loop after superposition is
< 1.5 Å.
Because predicting coverage and accuracy is strongly dependent on the choice
of anchor residues flanking the loop, and DSSP annotations highly depend from the
structural context, we have added an additional protocol optimization step in which
slightly alternative loop boundaries are set and evaluated. This is accomplished by
setting a new start residue upstream and a new end residue downstream, totaling
in nine combinations of start and end residues per loop. The alternative loop with
the best FoldX energy is then taken as the actual prediction.
Filters
PDB identifier filter PDB structures that have the same PDB four-letter identifier
as the PDB of the target loop are discarded.
Backbone entropy filter Loop candidates originate from different proteins. It
is therefore probable that the amino acid sequence of the candidate loop differs
from the amino acid sequence of the target loop. Consequently, it is required
to determine the propensity of the target amino acid sequence to adopt the φ/ψ
main chain dihedral conformation of the candidate loop. The FoldX force field
is consulted to retrieve the unscaled entropy cost for each φ/ψ pair within the
candidate loop structure for the respective amino acid in the target sequence. This
unscaled entropy cost corresponds with the statistical preference of a residue to
occur in a certain [φ, ψ] region within the Ramachandran plot.
Backbone ω filter The dihedral angle ω measured over the peptide bond is
checked as well. In general, only ω angles with absolute values in the interval
[155;180] are allowed. However, if the target amino acid is one of the structure
breakers – PRO or GLY – a cis conformation can be present. In this case the heavy
77
3. PREDICTING LOOP STRUCTURE
atoms N and O are on the same side of the double bond, causing the chain to
bend, and ω angles having an absolute value in the interval [0;25] are allowed as
well.
Steric clash filter The conformational fit of the loop is evaluated within the target
environment using a steric clash filter that checks for van der Waals clashes along
the main chain atoms (N, Cα, C, O). Loop candidates causing steric clashes with
neighbouring residues are discarded.
Sequence Homology filter Using homolog structures to reconstruct loops might
improve reconstruction results. Loop reconstruction algorithms that rely on
database searches can be at an advantage compared to ab initio methods when the
structure of a sequence homolog is available. On the other hand, when no such
homolog is available the accuracy might decrease rapidly. In order to evaluate
the effect of sequence homology, a sequence homology filter was implemented.
Sequence homology of the target loop is compared with the BriX loop. If sequence
homology is >= 100%, 85%, 50% or 35% (as indicated), the loop is discarded.
Loop placement
The backbone atoms of the N- and C-anchor residues of the side chain-reconstructed
candidate loops are superposed on the equivalent backbone atoms of the N- and
C-anchor residues of the input structure.
Sidechain design using FoldX
As a general rule, filters accept only 1-5 % of the initial loop candidates. From
this set, all side chains are designed with the target loop sequence extracted from
the crystallographic loop. Importantly, no side chain orientations from the BriX
loop nor the crystallographic loop are used. Sidechains are rebuilt using the
FoldX command BuildModel (Schymkowitz et al., 2005). BuildModel uses a
backbone-dependent rotamer library to determine the most probable side chain
conformation. At every mutation step, the algorithm rearranges neighbouring
residues, resulting in a local minimum of the sequence.
78
3.3 Materials and Methods
Scoring and ranking
The estimated free energy ∆G of the loop is calculated with the FoldX command
Stability, which takes into account the entire protein environment (see also
Section 5.3.5). The final set of solutions is then ranked from low to high ∆G.
Loop ensemble generation
Even though a correct loop can be retrieved directly from the database in many
cases (Section 3.2), positioning the loop by anchor residue superposition often
poses problems for database methods (Choi & Deane, 2010). To remedy this, the
algorithm samples the space of naturally occurring fragments anchoring the loop.
In this additional step, backbone variation is introduced on the stem residues,
allowing for a smoother fitting of the loop in the query structure (Figure 3.11). The
first phase consists of extracting flanking segments of four residues each on both
sides of the loop region. The BriX fragment database (Chapter 2) is then queried for
protein fragments that are structurally similar to the four-residue segments, using
a superposition threshold of 0.8 Å. In the second phase, alternative orientations
for the flanking residues are generated. A selected set of non-redundant BriX
fragments is superposed on the flanking segment and the backbone coordinates
of the original flanking segment are substituted by the ones of the BriX fragment.
In the last phase, the loop candidates are now superposed on the adapted flanking
segments, effectively generating movements of the loop that fit slightly varied
models of the flanking segments. Finally, the coordinates of the stem residues are
restored to the original ones and backbone clashes caused by the loop movement
are filtered. As described previously, the solutions in the ensemble are scored and
ranked with the FoldX force field.
3.3.2 Reconstruction accuracy
The reconstruction accuracy is given by the RMSD between the backbone atoms
(N, Cα, C, O) of the predicted loop and the crystallographic loop, excluding the
anchor residues. A distinction is made between ‘local’ and ‘global’ RMSD. The
local RMSD is calculated by superposing the predicted loop on the native loops
regardless of the context; as a result, the local RMSD indicates the quality of the
79
3. PREDICTING LOOP STRUCTURE
predicted loop outside the context of the protein. The global RMSD is computed
between the predicted loop and the native without superposition; the global RMSD
is the measure used to evaluate the prediction accuracy.
3.3.3 Benchmark datasets
Three datasets from Mandell et al. (2009)
Dataset 1 was originally compiled by Fiser et al. (2000) and contains 40 12-residue
loops. Mandell et al. (2009) analysed this set in order to exclude structures where
the loop was too close to ligands and ions, reducing the set with 15 12-residue
loops. We eliminated three structures that were incorrectly annotated as loops
(PDB 1tca, 3cla and 1541) and for which LoopX could consequently not produce
any prediction. The final set contains 23 loop structures (Figure 3.3).
Dataset 2 was originally compiled by Zhu et al. (2006) and contains 20 12-
residue loops, selected from high-quality structures (<2Å), diverse sequences
(<40% sequence identity) and lack of secondary structure in the loops, amongst
other criteria (Figure 3.4).
Dataset 3 was designed by Mandell et al. (2009) to evaluate loop reconstruction
of four different proteins from the PDB crystallized with 18 different partner pro-
teins (Figure 3.5). The nature of the binding partner influences the conformation
of the loop and as such one can study if the loop reconstruction algorithm can
predict these functional changes in loop conformation.
WHAT IF dataset
Dataset 4 is a set of 223 crystallographic PDB chains with an R-factor below 0.19
and a resolution below 1.1 Å, harvested from the WHAT IF website (http://
swift.cmbi.ru.nl/gv/select/index.html) and the atomic coordinates were
downloaded from the PDB (http://www.pdb.org). An additional step was applied
to filter this set to a maximum of 35 percent sequence identity over all structures.
This resulted in the final set of 210 structures. Loop segments with minimum length
4 were then extracted by running the DSSP12 secondary structure assignment
program and keeping sequence stretches other than helix (H), beta-sheet (E), 3/10
80
3.4 Discussion
helix (G) and beta bridge (B), but with helix or beta-sheet on both flanks. This
resulted in 527 loops.
FREAD dataset
Dataset 5 is a set of 510 loops taken from Choi & Deane (2010) to evaluate LoopX
results with results from FREAD (Choi & Deane, 2010), RAPPER (de Bakker et al.,
2003), MODELLER (Fiser et al., 2000) and PLOP (Jacobson et al., 2004). The
compilation protocol of this set is explained in detail in Choi & Deane (2010).
PDZ dataset
Dataset 6 is a set of carboxylate-binding loops from 10 high-resolution PDZ struc-
tures (Figure 3.7). We selected these PDZ domains based on (1) structural variety
adopted by the loop that is involved in binding peptide ligands and (2) sequence
diversity.
3.3.4 LoopX Webserver
LoopX is freely available for academic users as a webserver at http://loopx.crg.
es. To illustrate the usage of the website, a video tutorial is viewable online at http:
//loopx.crg.es/content/help. In the tutorial, we show the reconstruction of
the PDZ carboxylate-binding loop for PDB 2I1N.
3.4 Discussion
We demonstrate that it is possible to have both fast and accurate modeling of loops
in proteins using a fragment database and an all-atom empirical force field, showing
that for loops up to length 12 (95% of all loops in dataset 4, Supplementary Table
9) computationally expensive ab initio methods are not needed. Moreover, we
demonstrate that this approach can be used to describe the entire conformational
ensemble adopted by protein loops and induced by ligand binding. This new
method that combines speed and accuracy could be used not only for homology
modeling, but also to perform protein design and study conformational ensembles
adopted by protein loops.
81
3. PREDICTING LOOP STRUCTURE
As the number of deposited structures in the Protein Data Bank increases,
database search methods for loop structure prediction gain in power, and we are
in the process of using the entire PDB (approximately 68.000 structures as of
September 2010) for fast and accurate loop modeling.
Author Contributions
F.R., J.S., and L.S. conceptualized the study. L.B. developed the first version of the
LoopX algorithm and performed preliminary analysis published in her thesis work
(Baeten, 2010). P.V. developed the current version of the LoopX algorithm. P.V.
developed the web server. J.V.D. and P.V. performed the experiments. J.V.D.,
P.V., F.S., F.R., J.S. and L.S. performed the analysis. J.V.D., P.V. and E.V. wrote the
manuscript.
82
REFERENCES
References
Baeten, L. (2010). Reconstruction of protein
structures from polypeptide fragment li-
braries. PhD Thesis (Free University of Brus-
sels), 1–149. 82
Benner, S.A., Cohen, M.A. & Gonnet, G.H.
(1993). Empirical and structural models for
insertions and deletions in the divergent evo-
lution of proteins. Journal of Molecular Biol-
ogy, 229, 1065–82. 64
Berman, H.M., Westbrook, J., Feng, Z., Gilliland,
G., Bhat, T.N., Weissig, H., Shindyalov, I.N. &
Bourne, P.E. (2000). The protein data bank.
Nucleic Acids Research, 28, 235–42. 66
Burke, D.F., Deane, C.M. & Blundell, T.L. (2000).
Browsing the sloop database of structurally
classified loops connecting elements of pro-
tein secondary structure. Bioinformatics, 16,
513–9. 64, 65
Choi, Y. & Deane, C.M. (2010). Fread revis-
ited: Accurate loop structure prediction us-
ing a database search algorithm. Proteins,
78, 1431–40. 72, 74, 75, 79, 81
de Bakker, P.I.W., DePristo, M.A., Burke, D.F. &
Blundell, T.L. (2003). Ab initio construction
of polypeptide fragments: Accuracy of loop
decoy discrimination by an all-atom statisti-
cal potential and the amber force field with
the generalized born solvation model. Pro-
teins, 51, 21–40. 72, 81
Donate, L.E., Rufino, S.D., Canard, L.H. & Blun-
dell, T.L. (1996). Conformational analysis
and clustering of short and medium size
loops connecting regular secondary struc-
tures: a database for modeling and predic-
tion. Protein Sci, 5, 2600–16. 65
Du, P., Andrec, M. & Levy, R.M. (2003). Have we
seen all structures corresponding to short
protein fragments in the protein data bank?
an update. Protein Eng, 16, 407–14. 66
Espadaler, J., Fernandez-Fuentes, N., Hermoso, A.,
Querol, E., Aviles, F.X., Sternberg, M.J.E. &
Oliva, B. (2004). Archdb: automated pro-
tein loop classification as a tool for struc-
tural genomics. Nucleic Acids Research, 32,
D185–8. 65
Felts, A.K., Gallicchio, E., Chekmarev, D., Paris,
K.A., Friesner, R.A. & Levy, R.M. (2008). Pre-
diction of protein loop conformations using
the agbnp implicit solvent model and torsion
angle sampling. Journal of chemical theory
and computation, 4, 855–868. 64
Fernandez-Fuentes, N. & Fiser, A. (2006). Sat-
urating representation of loop conforma-
tional fragments in structure databanks.
BMC Struct Biol, 6, 15. 66
Fernandez-Fuentes, N., Oliva, B. & Fiser, A.
(2006). A supersecondary structure library
and search algorithm for modeling loops in
protein structures. Nucleic Acids Research,
34, 2085–97. 65
Fetrow, J.S. (1995). Omega loops: nonregular
secondary structures significant in protein
function and stability. FASEB J, 9, 708–17.
64
Fiser, A., Do, R.K. & Sali, A. (2000). Modeling
of loops in protein structures. Protein Sci, 9,
1753–73. 64, 72, 80, 81
Heuser, P., Wohlfahrt, G. & Schomburg, D.
(2004). Efficient methods for filtering and
ranking fragments for the prediction of struc-
turally variable regions in proteins. Proteins,
54, 583–95. 64, 65
Jacobson, M.P., Pincus, D.L., Rapp, C.S., Day,
T.J.F., Honig, B., Shaw, D.E. & Friesner, R.A.
(2004). A hierarchical approach to all-atom
protein loop prediction. Proteins, 55, 351–
67. 64, 72, 81
Jiang, L., Althoff, E.A., Clemente, F.R., Doyle,
L., Rothlisberger, D., Zanghellini, A., Gallaher,
J.L., Betker, J.L., Tanaka, F., Barbas, C.F., Hil-
vert, D., Houk, K.N., Stoddard, B.L. & Baker,
D. (2008). De novo computational design
83
REFERENCES
of retro-aldol enzymes. Science, 319, 1387–
1391. 64
Jones, E.Y., Tormo, J., Reid, S.W. & Stuart, D.I.
(1998). Recognition surfaces of mhc class i.
Immunol Rev, 163, 121–8. 64
Kaneko, T., Huang, H., Zhao, B., Li, L., Liu, H.,
Voss, C.K., Wu, C., Schiller, M.R. & Li, S.S.C.
(2010). Loops govern sh2 domain specificity
by controlling access to binding pockets. Sci-
ence Signaling, 3, ra34. 64
Lawson, Z. & Wheatley, M. (2004). The third ex-
tracellular loop of g-protein-coupled recep-
tors: more than just a linker between two
important transmembrane helices. Biochem
Soc Trans, 32, 1048–50. 64
Levitt, M. (2007). Growth of novel protein
structural data. Proceedings of the National
Academy of Sciences of the United States of
America, 104, 3183–8. 66
Lu, Y. & Valentine, J.S. (1997). Engineering
metal-binding sites in proteins. Curr Opin
Struct Biol, 7, 495–500. 64
Mandell, D.J., Coutsias, E.A. & Kortemme,
T. (2009). Sub-angstrom accuracy in pro-
tein loop reconstruction by robotics-inspired
conformational sampling. Nat Methods, 6,
551–2. 65, 66, 67, 68, 69, 70, 74, 80
Nourry, C., Grant, S.G.N. & Borg, J.P. (2003).
Pdz domain proteins: plug and play! Sci
STKE, 2003, RE7. 70
Oliva, B., Bates, P.A., Querol, E., Avil´es, F.X. &
Sternberg, M.J. (1997). An automated clas-
sification of the structure of protein loops.
Journal of Molecular Biology, 266, 814–30.
65
Penkert, R.R., DiVittorio, H.M. & Prehoda, K.E.
(2004). Internal recognition through pdz do-
main plasticity in the par-6-pals1 complex.
Nature structural & molecular biology, 11,
1122–7. 70
Prodromou, C., Roe, S.M., O’Brien, R., Ladbury,
J.E., Piper, P.W. & Pearl, L.H. (1997). Identifi-
cation and structural characterization of the
atp/adp-binding site in the hsp90 molecular
chaperone. Cell, 90, 65–75. 64
Rapp, C.S. & Friesner, R.A. (1999). Prediction
of loop geometries using a generalized born
model of solvation effects. Proteins, 35, 173–
83. 64
Redondo, P., Prieto, J., Mu˜noz, I.G., Alib´es,
A., Stricher, F., Serrano, L., Cabaniols,
J.P., Daboussi, F., Arnould, S., Perez, C.,
Duchateau, P., Paques, F., Blanco, F.J. & Mon-
toya, G. (2008). Molecular basis of xero-
derma pigmentosum group c dna recogni-
tion by engineered meganucleases. Nature,
456, 107–11. 64
Rini, J.M., Schulze-Gahmen, U. & Wilson, I.A.
(1992). Structural evidence for induced fit
as a mechanism for antibody-antigen recog-
nition. Science, 255, 959–65. 64
Schymkowitz, J., Borg, J., Stricher, F., Nys, R.,
Rousseau, F. & Serrano, L. (2005). The foldx
web server: an online force field. Nucleic
Acids Research, 33, W382–8. 66, 78
Siegel, J.B., Zanghellini, A., Lovick, H.M., Kiss, G.,
Lambert, A.R., Clair, J.L.S., Gallaher, J.L., Hil-
vert, D., Gelb, M.H., Stoddard, B.L., Houk,
K.N., Michael, F.E. & Baker, D. (2010). Com-
putational design of an enzyme catalyst for
a stereoselective bimolecular diels-alder re-
action. Science, 329, 309–13. 64
Spassov, V.Z., Flook, P.K. & Yan, L. (2008).
Looper: a molecular mechanics-based algo-
rithm for protein loop prediction. Protein Eng
Des Sel, 21, 91–100. 64
Strynadka, N.C. & James, M.N. (1989). Crystal
structures of the helix-loop-helix calcium-
binding proteins. Annu Rev Biochem, 58,
951–98. 64
Todd, A.E., Orengo, C.A. & Thornton, J.M.
(2001). Evolution of function in protein su-
perfamilies, from a structural perspective. J
Mol Biol, 307, 1113–43. 64
84
REFERENCES
van der Sloot, A.M., Kiel, C., Serrano, L. &
Stricher, F. (2009). Protein design in biolog-
ical networks: from manipulating the input
to modifying the output. Protein Eng Des Sel,
22, 537–42. 64
Vanhee, P., Verschueren, E., Baeten, L., Stricher,
F., Serrano, L., Rousseau, F. & Schymkowitz, J.
(2011). Brix: a database of protein building
blocks for structural analysis, modeling and
design. Nucleic Acids Research, 39, D435–
42. 65, 66
Wang, C., Bradley, P. & Baker, D. (2007).
Protein-protein docking with backbone flex-
ibility. J Mol Biol, 373, 503–19. 68, 69
Wojcik, J., Mornon, J.P. & Chomilier, J.
(1999). New efficient statistical sequence-
dependent structure prediction of short to
medium-sized protein loops based on an
exhaustive loop classification. Journal of
Molecular Biology, 289, 1469–90. 65
Xiang, Z., Soto, C.S. & Honig, B. (2002). Eval-
uating conformational free energies: the
colony energy and its application to the
problem of loop prediction. Proceedings of
the National Academy of Sciences of the
United States of America, 99, 7432–7. 64
Zhu, K., Pincus, D.L., Zhao, S. & Friesner, R.A.
(2006). Long loop prediction using the pro-
tein local optimization program. Proteins,
65, 438–52. 80
85
4The structural landscape of protein-peptide
interactions
This chapter is based on
PepX: a structural database of non-redundant protein-peptide complexes. Peter Vanhee,
Joke Reumers, Francois Stricher, Lies Baeten, Luis Serrano, Joost Schymkowitz, Frederic Rousseau.
Nucleic Acids Research, January 2010.
A
lthough protein-peptide interactions are estimated to constitute up to 40% of all
protein interactions, relatively little information is available for the structural
details of these interactions. A reliable data set of non-redundant protein-peptide
complexes is indispensable as a basis for modeling and design, but current data
sets for protein-peptide interactions are often biased towards specific types of
interactions or are limited to interactions with small ligands. We have designed
PepX (http://pepx.switchlab.org), an unbiased and exhaustive data set of all
protein-peptide complexes available in the Protein Data Bank with peptide lengths
up to 35 residues. In addition, these complexes have been clustered based on their
binding interfaces rather than sequence homology, providing a set of structurally
diverse protein-peptide interactions. The final data set contains 505 unique protein-
peptide interface clusters from 1431 complexes. Thorough annotation of each
complex with both biological and structural information facilitates searching for
and browsing through individual complexes and clusters. Moreover, we provide
87
4. THE STRUCTURAL LANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS
an additional source of data for peptide design by annotating peptides with naturally
occurring backbone variations using fragment clusters from the BriX database.
4.1 Introduction
A growing number of interactions are known to be mediated by short linear pep-
tides (Neduva & Russell, 2006). It is estimated that 15 to 40 % of all interactions in
the cell are protein-peptide interactions (Neduva et al., 2005; Petsalaki & Russell,
2008), which indicates that a large portion of the proteome is either directly or
indirectly involved in peptide-binding events. Peptide-mediated interactions are
normally short-lived and therefore found most in signaling and regulatory networks
where fast response to stimuli is required (Pawson & Nash, 2003). Many databases
have been implemented that assemble the sequence patterns involved in such in-
teractions, such as the Eukaryotic Linear Motif (ELM) database (Puntervoll, 2003),
PROSITE (Hulo et al., 2006) and SCANSITE (Obenauer et al., 2003).
Unfortunately, the estimated abundance of protein-peptide interactions from
the genome is not reflected in the number of available three dimensional protein-
peptide complexes. While many protein-protein and protein-domain interaction
databases with structural annotations exist (Gong et al., 2005; Ogmen et al., 2005;
Chen et al., 2007; Raghavachari et al., 2008; Jensen et al., 2009), only few of them
explicitly consider protein-peptide interactions (Stein et al., 2009). Moreover, focus
on specific types of peptide interactions (PDZ domains, SH3 domains) has biased
the content of structural databases. Grouping of 3D structures of protein-peptide
complexes into functional modules has been established by several methods, such
as using ELM patterns (e.g. 3did (Stein et al., 2009)) and multiple sequence
alignment of the ligands (e.g. FireDB (Lopez et al., 2007)). Additionally, specialized
databases focusing on a specific functional group have been published, such as
PROCOGNATE for enzyme complexes (Bashton et al., 2008), MPID-T for T-cell
receptors (Tong et al., 2006) and the HMRBase for hormone-receptor data (Rashid
et al., 2009). For a detailed list with related databases see Table 4.1. In contrast,
our objective was to build an unbiased collection of non-redundant peptide binding
sites, where grouping is based solely on three-dimensional similarity and no bias
for functional relations or sequence similarity is introduced.
88
4.1 Introduction
MHC
14%
thrombin
12%
α-ligand binding
domain
8%
protein
kinase A
5%
chymotrypsin
2%
streptavidin
2%
trypsin
1%
SH3
1%
HIV-1 protease
1%
HIV-1 antibody
1%
mdm2
1%
remainder
52%
Figure 4.1: Contents of PepX. Distribution of 1431 protein-peptide interactions after
clustering on the architecture of the binding site. 48% of the protein-peptide com-
plexes are classified in 10 classes with more than 5 members, while the remaining
52% contain less frequent structural binding modes.
To this end we have mined the Brookhaven Protein Data Bank (PDB) for protein-
peptide complexes using rigid quality parameters, and thus obtained 1431 high-
resolution 3D structures (see Section 4.2.1 for details on the selection procedure).
These complexes were clustered based on three-dimensional similarity into 505
unique protein-peptide interface clusters, representing the full structural diversity
of protein-peptide complexes available in the PDB. The aforementioned bias for
specific peptide interactions is demonstrated in the further clustering of these com-
plexes. 47% of all protein-peptide complexes available from the PDB are clustered
89
4. THE STRUCTURAL LANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS
Database Description Ref.
3did
3D interacting domains: domain-domain and
domain-peptide (ELM based)
(Stein et al., 2009)
fireDB
PDB structures and associated ligands, anno-
tated functionally important residues
(Lopez et al., 2007)
HMRBase Hormones and receptors (Rashid et al., 2009)
MPID-T T-cell receptor/peptide/MHC interactions (Tong et al., 2006)
PDB-Ligand
3D structure database of small molecular lig-
ands that are bound to larger biomolecules
(Shin & Cho, 2005)
PLD
Biomolecular data, including binding energies,
Tanimoto ligand similarity scores and protein
sequence similarities of protein-ligand com-
plexes
(Puvanendrampillai
& Mitchell, 2003)
PRO-COGNATE
Protein cognate ligands for the domains in en-
zyme structures (from CATH, SCOP and Pfam
(Bashton et al.,
2008)
SuperLigands Descriptions of PDB ligand structures
(Michalsky et al.,
2005)
Voronoia
Calculating packing densities of proteins and lig-
ands
(Rother et al., 2009)
MOAD
6000+ protein-ligand complexes, annotated
with binding data
(Smith et al., 2006)
SCOWLP
Structural Protein-peptide complexes clustered
on binding interface
(Teyra et al., 2008)
peptiDB 103 non-redundant protein-peptide complexes (London et al., 2010)
PepX
1431 protein-peptide complexes clustered in
505 non-redundant peptide binding modes
(Vanhee et al., 2010)
Table 4.1: Public databases of protein-ligand complexes.
within only 10 classes, containing complexes with peptides bound to Major Histo-
compatibility Complex (MHC) (14%), thrombins (12%), α-ligand binding domains
(8%), protein kinase A, chymotrypsin, streptavidin, trypsin, SH3 domains, HIV-1
protease, HIV-1 antibody and mdm2 (Figure 4.1 and Figure 4.2).
4.2 Contents of the PepX database
4.2.1 Construction of a non-redundant data set of protein-peptide
complexes
We have filtered the Brookhaven Protein Data Bank (PDB) (Kouranov et al., 2006)
for protein-peptide complexes requiring (1) X-Ray structures with a resolution
lower than 2.5 Å, (2) peptides with a size from 5 to 35 amino acids, (3) peptides
90
4.2 Contents of the PepX database
A B
C D
0
1
2
3
4
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5
MHC-0 (168 peptides)
0
1
2
3
4
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5
LBD-0 (111 peptides)
0
1
2
3
4
1 2 3 4 5 6 7 8
PTS-0 (10 peptides)
0
1
2
3
4
1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3
SH3-0 (17 peptides)
Figure 4.2: Examples of protein-peptide clusters from PepX. The peptides are
shown in red and the peptide cluster centroid in green. Weblogo’s show the peptide’s
sequence diversity as observed in the crystals. For (C) and (D) the surface of the protein
receptor is given with hydrophobic residues colored red and hydrophilic residues blue.
(A) class I MHC bound to peptide (169 structures), without a clear sequence motif. (B)
estrogen receptor α-ligand binding domain bound to peptide (111 structures), with
the LxxLL sequence motif (x for any amino acid). (C) Peroxisomal Targeting Signal
1 (PTS1) binding domain with peptide (10 structures) and (D) SH3 domain-peptide
interaction (17 structures).
91
4. THE STRUCTURAL LANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS
containing natural amino acids only, (4) receptors with a minimum size of 35
amino acids, and (5) the first unit in the PDB in case of crystallographic symmetry.
1431 complexes were retained and clustered on their binding architecture using an
adaptation of the Hierarchical Agglomeration algorithm used for constructing BriX
(Baeten et al., 2008), a database of protein fragments (see Chapter 2). RMSD be-
tween any two complexes superposed on backbone Cα atoms has been computed
using MUSTANG to allow for structural alignment of unrelated protein structures
(Konagurthu et al., 2006). Any two structures are grouped together if they su-
perpose below 2 Å RMSD for at least 75% of their interfaces. In this way, we
retained 505 unique protein-peptide interface clusters. Furthermore, we clustered
the protein-peptide complexes using RMSD values of 1 Å, 2 Å and 3 Å combined
with structural alignment of 50%, 75% and 95% of the interfaces. The clusters
vary slightly depending on those parameters. The distribution of the number of
elements in the PepX clusters for various thresholds of structural similarity and
structural alignment of the binding site is shown in Figure 4.3. For all settings
most clusters contain only one complex: 64% of all clusters are singletons for
thresholds of 3 Å and 50% alignment (Figure 4.3 A), whereas 87% of all clusters
resulting from 1 Å RMSD and 95% alignment (Figure 4.3 C) contain only one
element.
cluster size
numberofclusters
50% structural alignment 75% structural alignment 95% structural alignment
1 Å 2 Å 3 Å
cluster threshold
Figure 4.3: Distribution of number of elements in the PepX clusters. Distribution
is shown for various thresholds of structural similarity (1-2-3 Å) and binding site
alignment (50% (A), 75% (B) and 95% (C)). For all settings the largest number of
clusters contains only one complex, going from 63% of all clusters (50% and 3Å, A) to
87% of all clusters (95% and 1Å, C).
92
4.2 Contents of the PepX database
4.2.2 Statistics on structural protein-peptide complexes
The upper threshold for the peptide length was set to 35 amino acids, but the
majority of the peptides are between 5 and 15 residues long, with a peak at 9
residues (Figure 4.4A). The size of receptors varies between 67 and 2552 residues,
and the largest fraction lies in the [200-400] range (Figure 4.4B).
6
5
5
7
15
10
8
4
4
2
3
3
1
1
1
6
1
1
1
1
1
2
3
3
2
1
1
1
1
1
1
0
2
4
6
8
10
12
14
16
5 10 15 20 25 30 35
%ofligands
Ligand length
9
19
47
13
7
1
1
2
0
0
0
0
10
20
30
40
50
100 200 300 400 500 600 700 800 900 1000 more
%ofreceptors
Receptor length
A B
Figure 4.4: Distribution of peptide and receptor size. (A) The smallest peptide
considered is 5 amino acids long, the longest consists of 35 residues. Circa 70% of all
peptides lies within the [5-15] residue range. (B) The largest protein in the complexes
contains 2552 amino acid residues; the shortest considered is 35 residues long. Most
proteins are smaller than 600 residues, with a peak in the [200-400] range.
The receptor sequences in the PepX database were clustered with the cd-hit
algorithm (Li & Godzik, 2006) for various thresholds, resulting in datasets where
sequences with 40-100% sequence identity are removed (Figure 4.5). Although
there is large sequence redundancy within the database (removing sequences
with more than 40% sequence identity results in removing more than 70% of all
complexes in the database), this does not always reflect a redundancy in binding
modes. For instance, Major Histocompatibility Complexes have high sequence
identity but bind a wide range of peptides in different modes (Collins et al., 1994;
Elliott & Neefjes, 2006). Preliminary analysis of the sequence redundancy in
the full complex dataset versus the dataset with cluster centroids revealed that
using geometric properties for clustering removes most sequence identity without
discarding relevant structural binding motifs.
All receptors in protein-peptide complexes have been annotated with the struc-
tural classifications SCOP (Andreeva et al., 2008) and CATH (Cuff et al., 2009)
93
4. THE STRUCTURAL LANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS
74
71
70
70
69
67
66
35
32
30
29
29
28
26
0
20
40
60
80
100
100 90 80 70 60 50 40
%ofstructures
cd-hit identity threshold (%)
centroids
all complexes
Figure 4.5: Receptor sequence redundancy within the PepX database. The re-
ceptor sequences in the PepX database were clustered with the cd-hit algorithm for
various thresholds of sequence identity, from removing identical sequences up to 40%
sequence identity. Although there is large sequence redundancy within the database,
this does not always reflect a redundancy in binding modes. For instance, removing
only identical sequences (100%) results in a loss of more than 60% of all complexes
and more than 20% of the centroids, showing that some receptors bind in different
structural modes.
based on the PDB identifier and chain of the receptor (Kouranov et al., 2006) and
with PFAM (Finn et al., 2008) based on the UniProt identifier (Consortium, 2009).
The coverage of PepX is highest for UniProt (82%), followed by structural classifi-
cations by CATH (71%) and SCOP (56%), and finally protein family annotation by
Pfam (50%) (Figure 4.6). Within these annotations, we have analysed in detail the
occurrence of PepX complexes in the various levels of the structural hierarchies
represented in SCOP and CATH. Although most SCOP classes are represented
by receptors in the database, protein-peptide complexes do not represent the full
range of SCOP folds (8%), superfamilies (6%) and families (4%) (Figure 4.7A).
When we look at the distribution of receptors in the different SCOP classes with
respect to the distribution of PDB structures in the full SCOP database (Figure
94
4.2 Contents of the PepX database
4.8), we see that in PepX the all-β and α + β classes are clearly overrepresented
(30 versus 24% for the all-β class, 38 versus 25% for the α + β class, respectively).
Similar results are obtained for the CATH classifications: the complexes represent
every CATH class, and architectures are highly represented as well (Figure 4.7B). In
contrast, at lower CATH levels, less than 10% of both topologies and superfamilies
hold at least one protein-peptide complex. In accordance with the SCOP analysis,
classes with mainly β-structures are largely overrepresented in PepX. Alpha and
beta structures are underrepresented (35% in PepX versus 52% full CATH). This is
also seen in SCOP when we merge the classes together (α/β and α+β), although
the difference is smaller (43% PepX versus 49% full SCOP).
0	
  
20	
  
40	
  
60	
  
80	
  
100	
  
SCOP	
   CATH	
   PFAM	
   UniProt	
  
%	
  of	
  structures	
  
795	
  
1016	
  
715	
  
1168	
  
Figure 4.6: PepX Annotations. Percentage of receptors in the PepX database repre-
sented by different annotations: SCOP, CATH, Pfam and UniProt. Coverage is highest
for UniProt, followed by structural classifications by CATH and SCOP, and finally
protein family annotation by Pfam.
95
4. THE STRUCTURAL LANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS
A B
Figure 4.7: Representation of the SCOP and CATH hierarchies in PepX. (A) Protein-
peptide complexes do not represent the full range of SCOP folds, superfamilies and
families. (B) Similarly, lower CATH levels are not very well represented in PepX.
0
5
10
15
20
25
30
35
40
45
AllalphaproteinsAllbetaproteins
Alphaand
betaproteins(a/b)
Alphaand
betaproteins(a+b)
M
ulti-dom
ain
proteins(alphaand
beta)
M
em
brane
and
cellsurface
proteinsand
peptides
Sm
allproteins
Coiled
coilproteins
Low
resolution
protein
structures
PeptidesDesigned
proteins
%ofstructures
SCOP
PepX
Figure 4.8: Distribution of PepX structures in the different SCOP classes. Whereas
the β, α/β and α+β classes are of similar size in the full SCOP database, in PepX the
all-β and α+β classes are clearly overrepresented.
96
4.3 Database Access
4.2.3 Ligand annotation with structural variants for peptide de-
sign
Given the scarcity of protein-peptide structures and their obvious relevance in
drug design (Ballinger et al., 1999; Reina et al., 2002; Yin et al., 2007; Parthasarathi
et al., 2008; van der Sloot et al., 2009) (see also Section 1.1.5), we provide an
additional service for peptide design. Since it was recently shown that protein-
peptide interactions can be reliably mimicked using interacting fragments from
monomeric proteins (Vanhee et al., 2009) (Chapter 5), it is possible to provide
structural variations of peptide ligands using protein fragments. Each ligand peptide
in the PepX data set is associated with its corresponding structural class from the
database of protein fragment classes, BriX (Baeten et al., 2008; Vanhee et al., 2011)
(Chapter 2). Sets of protein fragments with highly similar backbone structure are
grouped in these fragment classes. Each protein fragment class represents a natural
variation on a typical backbone conformation. Mapped on protein-peptide pairs,
these structural classes can be used to model and design alternative peptides with
slightly adapted backbone conformation that better fit given amino acid sequences.
In Chapter 6 we show the application of BriX and InteraX towards modeling peptide
complexes from the PepX database.
4.3 Database Access
4.3.1 Database Availability
PepX is accessible through a web portal at http://pepx.switchlab.org. We
recorded usage statistics since the database inauguration in July 2009 and statistics
until October 2010 are shown in Figure 4.9. The full database with annotations is
available for download both in SQL format and as flat files. The entire dataset of
1431 PDBs with binding site residues and the equivalent centroid dataset of 505
binding sites can be downloaded. The PepX web server is implemented using the
Drupal Content Management system (http://drupal.org).
All information contained within the PepX database is exposed as XML (Exten-
sible Markup Language). When certain URLs are visited, an XML file with the re-
quested data is returned, following the REST interface for data exchange. For exam-
97
4. THE STRUCTURAL LANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS
Visits
1 640
2,789 visits came from 51 countries
0
150
300
0
150
300
1 Jul - 31 Jul 1 Oct - 31 Oct 1 Jan - 31 Jan 1 Apr - 30 Apr 1 Jul - 31 Jul 1 Oct - 31 Oct
Visits
Referring Sites
1,001.00 (35.89%)
Search Engines
941.00 (33.74%)
Direct Traffic
847.00 (30.37%)
62.14% Bounce Rate
00:04:13 Avg. Time on Site
32.56% % New Visits
2,789 Visits
9,929 Pageviews
3.56 Pages/Visit
Figure 4.9: PepX usage statistics. Statistics were recorded since July 1st, 2009 and
run until October 31st, 2010. Provided by Google Analytics.
ple, calling the URL http://pepx.switchlab.org/clusters.xml?threshold=
2&alignment=75 serves an XML file with a description of the clusters for threshold
2 Å and an alignment of 75%. The XML interface is implemented for clusters, PDBs
and BriX classes providing backbone variations on the peptides.
98
4.3 Database Access
4.3.2 User interface
Figure 4.10: PepX user flow. Searching for the keywords ‘thrombin’ and ‘inhibitor’
provides a list of hits. For the entry 1BTH the PepX cluster is shown, together
with 3D views of the complex and the binding site. Every complex is annotated
with binding energy calculated with FoldX, hydrogen bond interactions, secondary
structure content, direct links to relevant databases and peptide backbone variations
from BriX.
99
4. THE STRUCTURAL LANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS
Extensive search and browse facilities are implemented for the PepX website.
Browsing the database can be performed at two levels: individual complex struc-
tures and clusters of complexes. In the latter case the user can choose the level
of similarity within one cluster by adjusting the root mean square distance be-
tween structures within one cluster and the percentage of structural alignment
between binding sites. The full PepX database can be searched through a sim-
ple Google-like search box, which uses a full index of all information contained
in the database (Figure 4.11A). The guided search allows searching the database
in specific subgroups, generated from the structural classifications and keywords
(Figure 4.11B). In addition, tag clouds of the structural annotations can be used to
generate specialized listings of protein-peptide complexes (Figure 4.11 C).
A B
C
Figure 4.11: Search options in the PepX database. (A) A simple, Google-like search
on the contents of the database. The search is non-restrictive and accepts everything
from keywords to PDB identifiers. (B) Guided search through PepX using SCOP. (C)
Tag clouds of the PDB keywords associated with PepX structures.
100
4.4 Discussion
For each individual complex several types of information are shown (Figure
4.10). Besides general information of the complex (PDB id, chains), functional and
structural annotation of the protein (UniProt, SCOP, CATH), also detailed structural
information about the interaction itself is displayed.The binding affinity for the
protein-peptide complex is calculated using the FoldX force field (Schymkowitz
et al., 2005) and details of the contribution of backbone and side chain hydrogen
bonds as well as the total binding energy is shown. The binding site is structurally
characterized using several metrics such as secondary structure content, and 3D
images of the binding site and the ligand itself were generated to illustrate the
specific parts of the protein contributing to the binding site. Furthermore, all the
clusters the complex takes part in are listed. Clicking on a specific cluster reveals
a detailed page containing information on the centroid complex of the cluster as
well as the list of all complexes belonging to the cluster.
4.4 Discussion
PepX to-date is the only dedicated resource for structural data on protein-peptide
complexes. Since its original publication in November 2009, many parts have
changed. For example, we have added sequence logo’s (Crooks et al., 2004) and
Position Weight Matrices (PWM) (Obenauer et al., 2003) for every complex in the
database. The PWM measures the relative contribution of each amino acids at
each position in the peptide, as measured using FoldX. They intuitively represent
the sequence variety that is allowed in the backbone constraints, and therefore
could be used for peptide sequence optimization, for example. In Chapter 6, we
rely on PWM’s to capture the different sequences that are allowed in the peptide
designs, matching peptide specificities obtained with the PWM to experimental
specificities as measured using phage display or peptide arrays.
Another change in PepX involves the annotation of peptides that are internally
stabilized using one or more disulfide bonds. Since they will likely adapt a stable
fold in isolation, they are thus better qualified as ‘mini-protein’. In total, 48 out of
1428 peptides contain such a disulfide bridge.
In Chapter 5 we describe how we used PepX as the basis to research structural
properties of peptide interactions, and in particular, their relation to the architecture
101
4. THE STRUCTURAL LANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS
of monomeric proteins. Finally, in Chapter 6, we describe the use of motifs from
PepX to benchmark the peptide design algorithm.
Author Contributions
P.V., F.R., J.S., and L.S. conceptualized the study. P.V. developed the PepX
database. P.V., J.R. and F.S. performed the analysis. P.V. developed the web-
site. P.V. and J.R. wrote the paper.
102
REFERENCES
References
Andreeva, A., Howorth, D., Chandonia, J.M.,
Brenner, S.E., Hubbard, T.J.P., Chothia, C. &
Murzin, A.G. (2008). Data growth and its
impact on the scop database: new develop-
ments. Nucleic Acids Research, 36, D419–
25. 93
Baeten, L., Reumers, J., Tur, V., Stricher, F.,
Lenaerts, T., Serrano, L., Rousseau, F. &
Schymkowitz, J. (2008). Reconstruction of
protein backbones from the brix collection
of canonical protein fragments. PLoS Com-
put Biol, 4, e1000083. 92, 97
Ballinger, M.D., Shyamala, V., Forrest, L.D.,
Deuter-Reinhard, M., Doyle, L.V., Wang, J.X.,
Panganiban-Lustan, L., Stratton, J.R., Apell,
G., Winter, J.A., Doyle, M.V., Rosenberg, S.
& Kavanaugh, W.M. (1999). Semirational de-
sign of a potent, artificial agonist of fibroblast
growth factor receptors. Nature Biotechnol-
ogy, 17, 1199–204. 97
Bashton, M., Nobeli, I. & Thornton, J.M. (2008).
Procognate: a cognate ligand domain map-
ping for enzymes. Nucleic Acids Research,
36, D618–22. 88, 90
Chen, Y.C., Lo, Y.S., Hsu, W.C. & Yang, J.M.
(2007). 3d-partner: a web server to infer in-
teracting partners and binding models. Nu-
cleic Acids Research, 35, W561–7. 88
Collins, E.J., Garboczi, D.N. & Wiley, D.C.
(1994). Three-dimensional structure of a
peptide extending from one end of a class
i mhc binding site. Nature, 371, 626–9. 93
Consortium, U. (2009). The universal protein
resource (uniprot) 2009. Nucleic Acids Re-
search, 37, D169–74. 94
Crooks, G.E., Hon, G., Chandonia, J.M. & Bren-
ner, S.E. (2004). Weblogo: a sequence logo
generator. Genome Res, 14, 1188–90. 101
Cuff, A.L., Sillitoe, I., Lewis, T., Redfern, O.C.,
Garratt, R., Thornton, J. & Orengo, C.A.
(2009). The cath classification revisited–
architectures reviewed and new ways to
characterize structural divergence in super-
families. Nucleic Acids Research, 37, D310–
4. 93
Elliott, T. & Neefjes, J. (2006). The complex
route to mhc class i-peptide complexes. Cell,
127, 249–51. 93
Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sam-
mut, S.J., Hotz, H.R., Ceric, G., Forslund, K.,
Eddy, S.R., Sonnhammer, E.L.L. & Bateman, A.
(2008). The pfam protein families database.
Nucleic Acids Research, 36, D281–8. 94
Gong, S., Yoon, G., Jang, I., Bolser, D., Dafas,
P., Schroeder, M., Choi, H., Cho, Y., Han,
K., Lee, S., Choi, H., Lappe, M., Holm, L.,
Kim, S., Oh, D. & Bhak, J. (2005). Psibase:
a database of protein structural interactome
map (psimap). Bioinformatics, 21, 2541–3.
88
Hulo, N., Bairoch, A., Bulliard, V., Cerutti,
L., Castro, E.D., Langendijk-Genevaux, P.S.,
Pagni, M. & Sigrist, C.J.A. (2006). The prosite
database. Nucleic Acids Res, 34, D227–30.
88
Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S.,
Creevey, C., Muller, J., Doerks, T., Julien, P.,
Roth, A., Simonovic, M., Bork, P. & von Mer-
ing, C. (2009). String 8–a global view on
proteins and their functional interactions in
630 organisms. Nucleic Acids Research, 37,
D412–6. 88
Konagurthu, A.S., Whisstock, J.C., Stuckey, P.J.
& Lesk, A.M. (2006). Mustang: a multiple
structural alignment algorithm. Proteins, 64,
559–74. 92
Kouranov, A., Xie, L., de la Cruz, J., Chen, L.,
Westbrook, J., Bourne, P.E. & Berman, H.M.
(2006). The rcsb pdb information portal for
structural genomics. Nucleic Acids Research,
34, D302–5. 90, 94
Li, W. & Godzik, A. (2006). Cd-hit: a fast pro-
gram for clustering and comparing large sets
103
REFERENCES
of protein or nucleotide sequences. Bioinfor-
matics, 22, 1658–9. 93
London, N., Movshovitz-Attias, D. & Schueler-
Furman, O. (2010). The structural basis
of peptide-protein binding strategies. Struc-
ture, 18, 188–199. 90
Lopez, G., Valencia, A. & Tress, M. (2007).
Firedb–a database of functionally important
residues from proteins of known structure.
Nucleic Acids Research, 35, D219–23. 88,
90
Michalsky, E., Dunkel, M., Goede, A. & Preissner,
R. (2005). Superligands - a database of lig-
and structures derived from the protein data
bank. BMC Bioinformatics, 6, 122. 90
Neduva, V. & Russell, R.B. (2006). Peptides me-
diating interaction networks: new leads at
last. Curr Opin Biotechnol, 17, 465–71. 88
Neduva, V., Linding, R., Su-Angrand, I., Stark, A.,
de Masi, F., Gibson, T.J., Lewis, J., Serrano, L. &
Russell, R.B. (2005). Systematic discovery of
new recognition peptides mediating protein
interaction networks. PLoS Biol, 3, e405. 88
Obenauer, J.C., Cantley, L.C. & Yaffe, M.B.
(2003). Scansite 2.0: Proteome-wide predic-
tion of cell signaling interactions using short
sequence motifs. Nucleic Acids Research, 31,
3635–41. 88, 101
Ogmen, U., Keskin, O., Aytuna, A.S., Nussinov, R.
& Gursoy, A. (2005). Prism: protein interac-
tions by structural matching. Nucleic Acids
Research, 33, W331–6. 88
Parthasarathi, L., Casey, F., Stein, A., Aloy, P. &
Shields, D.C. (2008). Approved drug mimics
of short peptide ligands from protein interac-
tion motifs. Journal of chemical information
and modeling, 48, 1943–8. 97
Pawson, T. & Nash, P. (2003). Assembly of cell
regulatory systems through protein interac-
tion domains. Science, 300, 445–52. 88
Petsalaki, E. & Russell, R.B. (2008). Peptide-
mediated interactions in biological systems:
new discoveries and applications. Curr Opin
Biotechnol, 19, 344–50. 88
Puntervoll, P. (2003). Elm server: a new
resource for investigating short functional
sites in modular eukaryotic proteins. Nucleic
Acids Research, 31, 3625–3630. 88
Puvanendrampillai, D. & Mitchell, J.B.O. (2003).
L/d protein ligand database (pld): additional
understanding of the nature and specificity
of protein-ligand complexes. Bioinformatics,
19, 1856–7. 90
Raghavachari, B., Tasneem, A., Przytycka, T.M.
& Jothi, R. (2008). Domine: a database of
protein domain interactions. Nucleic Acids
Research, 36, D656–61. 88
Rashid, M., Singla, D., Sharma, A., Kumar,
M. & Raghava, G.P.S. (2009). Hmrbase: a
database of hormones and their receptors.
BMC Genomics, 10, 307. 88, 90
Reina, J., Lacroix, E., Hobson, S.D., Fernandez-
Ballester, G., Rybin, V., Schwab, M.S., Serrano,
L. & Gonzalez, C. (2002). Computer-aided
design of a pdz domain to recognize new
target sequences. Nature Structural Biology,
9, 621–7. 97
Rother, K., Hildebrand, P.W., Goede, A., Gruen-
ing, B. & Preissner, R. (2009). Voronoia: ana-
lyzing packing in protein structures. Nucleic
Acids Research, 37, D393–5. 90
Schymkowitz, J., Borg, J., Stricher, F., Nys, R.,
Rousseau, F. & Serrano, L. (2005). The foldx
web server: an online force field. Nucleic
Acids Research, 33, W382–8. 101
Shin, J.M. & Cho, D.H. (2005). Pdb-ligand: a
ligand database based on pdb for the au-
tomated and customized classification of
ligand-binding structures. Nucleic Acids Re-
search, 33, D238–41. 90
104
REFERENCES
Smith, R.D., Hu, L., Falkner, J.A., Benson, M.L.,
Nerothin, J.P. & Carlson, H.A. (2006). Explor-
ing protein-ligand recognition with binding
moad. J Mol Graph Model, 24, 414–25. 90
Stein, A., Panjkovich, A. & Aloy, P. (2009).
3did update: domain-domain and peptide-
mediated interactions of known 3d struc-
ture. Nucleic Acids Research, 37, D300–4.
88, 90
Teyra, J., Paszkowski-Rogacz, M., Anders, G. &
Pisabarro, M.T. (2008). Scowlp classification:
structural comparison and analysis of pro-
tein binding regions. BMC Bioinformatics, 9,
9. 90
Tong, J.C., Kong, L., Tan, T.W. & Ranganathan,
S. (2006). Mpid-t: database for sequence-
structure-function information on t-cell re-
ceptor/peptide/mhc interactions. Appl Bioin-
formatics, 5, 111–4. 88, 90
van der Sloot, A.M., Kiel, C., Serrano, L. &
Stricher, F. (2009). Protein design in biolog-
ical networks: from manipulating the input
to modifying the output. Protein Eng Des Sel,
22, 537–42. 97
Vanhee, P., Stricher, F., Baeten, L., Verschueren,
E., Lenaerts, T., Serrano, L., Rousseau, F. &
Schymkowitz, J. (2009). Protein-peptide in-
teractions adopt the same structural motifs
as monomeric protein folds. Structure, 17,
1128–1136. 97
Vanhee, P., Reumers, J., Stricher, F., Baeten, L.,
Serrano, L., Schymkowitz, J. & Rousseau, F.
(2010). Pepx: a structural database of non-
redundant protein-peptide complexes. Nu-
cleic Acids Research, 38, D545–51. 90
Vanhee, P., Verschueren, E., Baeten, L., Stricher,
F., Serrano, L., Rousseau, F. & Schymkowitz, J.
(2011). Brix: a database of protein building
blocks for structural analysis, modeling and
design. Nucleic Acids Research, 39, D435–
42. 97
Yin, H., Slusky, J.S., Berger, B.W., Walters, R.S.,
Vilaire, G., Litvinov, R.I., Lear, J.D., Caputo,
G.A., Bennett, J.S. & Degrado, W.F. (2007).
Computational design of peptides that tar-
get transmembrane helices. Science, 315,
1817–1822, when writing the FGF story fol-
low this format! 97
105
5Protein-peptide interactions resemble
monomeric protein interactions
This chapter is based on
Protein-Peptide Interactions Adopt the Same Structural Motifs as Monomeric Protein Folds
Peter Vanhee, Francois Stricher, Lies Baeten, Erik Verschueren, Tom Lenaerts, Luis Serrano, Fred-
eric Rousseau and Joost Schymkowitz Structure1, August 2009.
and
Modeling protein-peptide interactions using protein fragments: fitting the pieces? Pe-
ter Vanhee, Francois Stricher, Lies Baeten, Erik Verschueren, Luis Serrano, Frederic Rousseau and
Joost Schymkowitz BMC Bioinformatics2, December 2010.
W
e compared the modes of interaction between protein-peptide interfaces
and those observed within monomeric proteins and found surprisingly lit-
tle differences. Over 65% of 731 protein-peptide interfaces could be reconstructed
within 1 Å RMSD using solely fragment interactions occurring in monomeric pro-
teins. Interestingly, more than 80% of interacting fragments used in reconstructing
1
This paper was featured on the cover of the Structure issue of August 2009.
2
This short paper is part of a series of Highlights from the Sixth International Society for
Computational Biology (ISCB) Student Council Symposium (Klijn et al., 2010).
107
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
a protein-peptide binding site were obtained from monomeric proteins of an en-
tirely different structural classification, with an average sequence identity below
15%. Nevertheless, geometric properties perfectly match the interaction patterns
observed within monomeric proteins. These data suggest that the wealth of struc-
tural data on monomeric proteins could be harvested to model protein-peptide
interactions and, more importantly, that sequence homology is no prerequisite.
5.1 Introduction
Recently, Russell and co-workers estimated that 15-40% of all interactions in the
cell are mediated through protein-peptide interactions (Neduva et al., 2005; Pet-
salaki & Russell, 2008), meaning that, at the most extreme, nearly every protein
is affected either directly or indirectly by peptide-binding events. Such interac-
tions are commonly mediated by specialized protein domains (Pawson & Scott,
1997), which are crucially involved in highly diverse biological processes and
occur in a myriad of proteins in ever changing combinations with other func-
tional units. Protein-peptide interactions are for instance of central importance
for motif-dependent interactions in cell signalling, such as the binding of tyrosyl-
phosphorylated peptides to proteins containing the Src homology domain 2 (SH2)
or the phosphotyrosine-binding domain (PTB) (Bradshaw & Waksman, 2002; Yaffe,
2002). Peptides with certain proline motifs constitutively bind to proteins contain-
ing Src homology domain 3 (SH3) at low affinities (Cesareni et al., 2002; Mayer,
2001).
Short length peptides are usually devoid of stable secondary structure in iso-
lation. Thus one might argue that peptide binding is equivalent to the folding
process, in which the peptide is the last element to be added to the growing
structure, albeit not on the same polypeptide chain. This argument is supported
by folding experiments with Barnase (Kippen et al., 1994), for which cleaving the
polypeptide chain in two molecules resulted in an association fold similar to that
of the monomeric protein. In peptide complementation experiments with chy-
motrypsin inhibitor 2 (CI2) Itzhaki et al. (1995) demonstrated that folding does
not require the structural building blocks to be part of the same polypeptide chain.
This folding analogy suggests that protein-peptide interactions should follow simi-
108
5.1 Introduction
lar structural patterns to those observed in monomeric proteins (Tsai et al., 1998).
In particular cases, such as β-strand extension in PDZ domains, the equivalence
to monomeric structures is obvious (Remaut & Waksman, 2006), but for other
protein-peptide structures, there is no apparent monomeric counterpart that has a
similar arrangement of structural elements on a single chain.
Similarities between singular folds and protein-protein interfaces have been
observed too, and Tuncbag et al. (2008) ventured to suggest that evolution reuses
patterns of interaction for both folding and association. In an earlier study, architec-
tural motifs from protein monomers have been shown to recur at protein-protein
interfaces, although this similarity is less obvious for structures that fold separately
and associate afterwards (Tsai et al., 1997). The protein interface between protein
and ligand is richer in hydrophobic residues than the surrounding surface (Ma
et al., 2003), suggesting similarity to the protein core. Cohen et al. (2008) have
shown that the chemistry, geometry and packing density of interactions within
protein cores are similar to those at the interface, while backbone interactions are
preferred in the core as opposed to side chain interactions in the binding site.
Clustering all the protein-protein interfaces available in the PDB, Tuncbag et al.
(2008) found that some of the architectures preferred in the interface exist also
in single chains. These striking similarities between folding and binding offer op-
portunities for protein-protein interface design, recently demonstrated by Potapov
et al. (2008) who redesigned and experimentally verified the interface of TEM1
β-lactamase and its inhibitor protein using a combination of naturally occurring
interaction templates from the Protein Data Bank (PDB) (Berman et al., 2000).
Part of the problem in identifying structural similarities between structural
motifs that occur in protein-peptide interaction and in monomeric proteins is the
apparent complexity of such interactions when viewed in all its atomic detail. Alter-
natively, it is often relatively simple to divide a protein structure in a small number
of interacting fragments, roughly determined by the elements of secondary struc-
ture. Therefore, instead of considering entire protein-peptide interfaces, we divide
the structure into pairs of interacting protein fragments, and as such rely on the
modularity of the binding site shown for protein-protein complexes (Reichmann
et al., 2005). It has been demonstrated that protein fragments of variable length al-
low to efficiently reconstruct the architecture of monomeric proteins (Baeten et al.,
109
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
2008; Kolodny & Levitt, 2003). Yet, it remains to be shown whether combinations
of fragments of monomeric proteins are able to reflect the complex architectures
exhibited by the binding interfaces of protein-peptide complexes.
Here we describe an exhaustive study of all natural protein-peptide interfaces
available in the PDB (731 cases, see Section 5.3 and the PepX database described
in Chapter 4). We relate the architecture of the protein-peptide interface to the
arrangement of interacting fragments observed within monomeric proteins. Our
set of building blocks are all the recurrent fragments of five amino acids that are
found in the WHAT IF dataset of 1259 structurally non-redundant high resolution
protein structures (Vriend, 1990). The fragments are clustered into an alphabet
of roughly 2000 elements and are publically available in the BriX database (Baeten
et al., 2008; Vanhee et al., 2011) (Chapter 2). We show here that more than 65%
of protein-peptide interfaces can be reconstructed from pairs of interacting frag-
ments of five amino acids taken from monomeric structures within 1 Å root mean
square deviation (RMSD). In 25% of the cases the entire arrangement of structural
elements as it occurs in the protein-peptide interface can be found back in the
monomeric fold of a particular PDB structure. Interestingly, on average less than
15% sequence similarity exists between the structurally equivalent building blocks
as they occur in monomeric folds and protein-peptide interfaces. Despite of this,
the interaction networks of the original protein-peptide interfaces are preserved in
the corresponding building blocks from the monomeric proteins. Although more
than 90% of the protein-peptide interfaces can be reconstructed at a lower reso-
lution (2 Å RMSD), it is clear that around 35% of protein-peptide interactions is
mediated by irregular structure elements that have no equivalent in our database
of monomeric structures.
Our work demonstrates that the rules that govern protein-peptide interactions
are identical to those that steer the architecture of proteins. This similarity can be
revealed by casting the proteins as a collection of recurrent polypeptide fragments
that interact in an inter- or intramolecular fashion. Our analysis of the known crystal
structures of protein-peptide complex shows that the configuration of fragments
corresponding to the interactions between a protein domain and a bound peptide
can be found back in the structure of a monomeric protein in the vast majority
of the cases. These configurations can be used as design templates for de novo
110
5.2 Results
peptide design, as we show in Chapter 6.
5.2 Results
5.2.1 InteraX: a database of interacting protein fragments
InteraX is a collection of interaction motifs from single-chain protein structures.
We extended our original fragment description to include protein fragments in-
teracting with other protein fragments (see Section 5.3.3). Two versions of the
InteraX database exist, depending of which version of BriX is used. InteraX derived
from the WHAT IF protein set (Vriend, 1990) contains 7.089.578 protein fragment
interactions between 567.278 fragments of five residues each, from an original of
2.597 non-redundant protein monomers. InteraX derived from Astral40 (Chando-
nia et al., 2004) contains 13.643.407 interactions between 1.157.418 fragments
derived from 6.481 proteins with less then 40% sequence homology.
In Figure 5.1, a series of interaction patterns are shown, for example a helix-
helix interaction with a favorable salt bridge stabilizing the interaction (Figure
5.1A), or a canonical β-β motif, which is the most represented example of an
intramolecular interaction motif from InteraX (Figure 5.1B). Other examples include
loop-loop motifs stabilized by sidechain hydrogen bonds (Figure 5.1C) or by cation-
π interaction (Figure 5.1E). Besides non-covalent interactions we also describe
covalent iteractions between non-continuous fragments, like a disulfide bridge
between a helix and a strand fragment (Figure 5.1D). Finally, a packing motif
between a loop and a helix is shown in Figure 5.1F.
We have annotated each interaction with the number of residue interactions,
hydrogen bonding and estimated free energy of interaction (Table 5.1). Two crite-
ria are used to describe residue interactions: the first uses a full-atom description
of the motif and requires atomic contact as measured by the Van der Waals radius
of the atom plus a cut-off value of 0.5 Å between two residues to be considered
interacting. The second approach is less stringent. We look for potential inter-
actions by drawing a hypothetical sphere around all residues, describing their full
rotamer-dependent action radius, and measure whether the spheres can overlap.
We used the FoldX force field (Schymkowitz et al., 2005a) to make an estimate
111
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
A B
C D
E F
Figure 5.1: Different protein fragment interactions from InteraX See Table 5.1 for
more details. (A) Helix-helix interaction motif stabilized by a salt bridge between
LYS54 and GLU110. (B) Canonical β-sheet motif formed by two β-strands stabilized
by a series of backbone hydrogen bonds. (C) Loop-loop interaction motif in which a
series of side chain hydrogen bonds contribute to the stability of the loops. (D) α-β
interaction with two covalent disulfide bridges. (E) Loop-loop motif with a cation-π
interaction between the cationic sidechain of lysine and the aromatic ring structures
of tryptophan and tyrosine. The entire cation-π interaction motif contains another
tryptophan and tyrosine at the opposite side and is captured by two InteraX patterns.
(F) Helix-loop interaction pattern.
of the overall interaction energy (∆∆G, kcal/mol) between each two fragments
in isolation in order to distinguish between weak and strong interaction motifs.
112
5.2 Results
Fig.
5.1
PDB
Frag
1
Frag
2
Residue
Con-
tacts
Atom
Residue
Con-
tacts
Half-
sphere
H-
Bonds
MC-
MC
H-
Bonds
SC-SC
H-
Bonds
MC-
SC
FoldX
∆∆G
FoldX
∆∆G
Back-
bone
A 153L
A:51-
55
A:110-
114
4 20 0 1 0 -0.03 0.21
B 1A44
A:64-
68
A:84-
88
7 22 5 0 0 -1.61 -1.14
C 1DRY
A:152-
156
A:302-
306
4 18 0 3 0 -0.19 -0.22
D 1AHO
A:22-
26
A:46-
50
21 2 0 0 0 -5.74 0.7
E 1GAI
A:49-
53
A:106-
110
10 22 3 0 0 -0.43 -0.75
F 153L
A:11-
15
A:174-
178
2 17 0 0 1 -0.6 -0.3
Table 5.1: InteraX protein fragment interactions. A sample of 6 annotated InteraX
motifs corresponding to Figure 5.1.
Additionally, InteraX pairs are annotated with the number of hydrogen bonds (cat-
egorized in main chain bonds, side chain bonds or mixed side chain-main chain
bonds) and their contribution to the total the free energy.
5.2.2 Reconstruction of protein-peptide interactions from inter-
acting fragment pairs derived from monomeric proteins
We define the protein-peptide interface as the collection of amino acid residues
belonging to either the protein- or peptide-chain whose inter-atomic distance falls
within a given cut-off distance. Starting from these interface residues, we gen-
erate interacting fragments by sliding a window of length 5 over each interface
residue (see Section 5.3 for details). By repeating this procedure for each pair
of interfacing residues, the algorithm thus generates a collection of interacting
fragment pairs from the protein-peptide structure. Next, for each fragment pair in
the protein-peptide interface the corresponding InteraX pairs are determined that
contain protein backbone arrangements similar to the fragment pair. The overlap
between the query fragment pair, taken from the protein-peptide interface, and the
database-derived fragment pair, taken from a monomeric protein, is quantified by
113
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
20
16
24
41
27
45
22
6
0
5
10
15
20
25
30
35
40
45
50
0-25% 25-50% 50-75% 75-100%
Protein-peptidedataset
Coverage binding site
Two-body
Single BriX Protein
Figure 5.2: Coverage of protein-peptide interfaces Protein-peptide interfaces are
covered with fragments found in monomeric proteins. The ‘two-body’ coverage
shown in dark red captures the percentage of the protein-peptide interface that can
be covered with InteraX pairs of protein fragments from different proteins. The ‘single
BriX protein’ coverage shown in light red captures the best coverage of the binding
site with a single monomeric protein. Results are averaged over the entire data set of
301 complexes.
the root-mean squared distance (RMSD) after superposition, using a superposition
threshold of 1 Å. The degree of coverage of the binding site is then defined as the
number of residues covered by an InteraX pair, divided by the number of residues
in the entire binding site. This ‘two-body coverage’ (every InteraX pair is a two-
body interaction) is a measure that describes to which extent the binding interface
can be reconstructed from sets of pairs of interacting fragments found in individ-
ual monomeric proteins. Higher coverage indicates an interface that contains a
high degree of architectural patterns adopted by monomeric protein structures,
whereas lower coverage of the interface implies a peptide binding interface that
cannot be related to the intramolecular architecture of monomeric proteins.
Overall for the 731 protein-peptide interaction interfaces analysed here, we
find that for the majority of the complexes at least 50% of their protein-peptide
114
5.2 Results
interface is covered with two-body interactions at a resolution of 1 Å (Figure
5.2). For 40% of the protein-peptide complexes, the coverage rises to more than
75% of the protein-peptide interface. In comparison, we find that 98% of the
protein-peptide interface can be rebuilt with single protein fragments from the
BriX database at a resolution of 1 Å. Therefore, using protein fragment interactions
instead of single protein fragments significantly reduces the coverage of the protein-
peptide interface. In addition, the extent of coverage achieved by a two-body
fragment approach illustrates that the architectural patterns of backbones found
in the intramolecular arrangement of monomeric proteins contains a significant
amount of structural information that is applicable to protein-peptide interactions.
Figure 5.3 illustrates how different interface topologies including all-α, mixed
α-β and all-β interfaces can be reconstructed by the superposition of two-body
fragments from monomeric proteins. The first example is a PDZ domain bound
to its ligand as an additional strand to an antiparallel β-sheet, tightly covered by
intramolecular interactions with an average of 0.49 Å pairwise RMSD. Figure 5.3B
shows a α-helix ligand binding domain with its ligand, for which fragments cover
the entire interface with 0.34 Å pairwise RMSD, due to the canonical interaction
motifs and the limited structural variation in the single α-helices. Figure 5.3C is
a class I major histocompatibility complex (MHC) bound to a decameric peptide,
in the peptide-binding groove formed by two α-helixes. MHC has been optimized
to bind many different peptides with different sequences, but most bind in a
similar orientation with both peptide termini bound in conserved pockets while
length variations are accommodated by the peptide bulging or zigzagging in the
middle (Collins et al., 1994). Different variations of a helix-loop motif are used for
binding and surprisingly, those seemingly irregular interaction patterns often recur
in monomeric proteins, covering 90% of the entire interface. Figure 5.3D shows
a polyproline peptide bound to a SH3 domain. Our method covers only 54% of
the interface because of the low frequency of occurrence of the polyproline motif
within single chains.
115
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
1w9qBS 3erdAC
1ywoAP2clrAC
A B
C D
Figure 5.3: Protein-peptide interfaces can be described as interactions between
recurrent protein fragments from monomeric proteins. Protein covers from the
BriX database are shown in red for receptor fragments and green for ligand fragments.
The protein-peptide complex is shown in grey. The PDB identifier with the protein
and peptide chains is shown below. Each interaction covering a part of the protein-
peptide interface consists of two fragments of five residues each. (A) PDZ domain
bound to ligand. 7298 interacting fragment pairs covering 100% of the interface with
0.49 Å average pairwise RMSD. (B) human estrogen receptor α-ligand binding domain
bound to peptide. 42.092 interacting fragment pairs covering 85% of the interface with
0.28 Å average pairwise RMSD. (C) Class I MHC bound to peptide. 325 interacting
fragment pairs covering 82% of the interface with 0.80Å average pairwise RMSD. (D)
SH3 domain with polyprolyne peptide. 14 interacting fragment pairs covering 77% of
the interface with 0.90 Å average pairwise RMSD.
116
5.2 Results
5.2.3 Reconstruction of peptide binding motifs by using multi-
ple fragment pairs observed in monomeric proteins
Can entire binding modes of protein-peptide complexes be reconstructed using
parts of single chain folds? Recently, Tuncbag et al. (2008) observed for protein-
protein complexes that some of the more frequent interface architectures are the
same as single chains. For protein-peptide complexes, we address this question
by combining InteraX pairs from the same monomeric protein, describing protein-
peptide interfaces as sets of interacting fragments.
Figure 5.4 depicts six examples of binding interfaces that are described by a
combination of InteraX pairs originating from a single monomeric protein. The
first example shows a PDZ domain with a peptide bound in the canonical β-strand
extension, from the scaffolding protein human synthenin. An exact match for the
entire binding motif is found in a pseudo enzyme-substrate complex from E.coli,
exhibiting a rudimentary form of a Rossmann-fold domain unrelated to the PDZ
domain fold.
In the second example the human estrogen receptor α-ligand binding domain
is bound to a coactivator peptide in the nucleus, in a hydrophobic groove on
the surface of the ligand binding domain. The entire interface of 35 residues
superposes with an RMSD of 1.94 Å on the unrelated all-α citrate synthase from a
different species.
Figure 5.4C shows the particular binding mode of the MHC antigen-recognition
domain with a peptide, partly reconstructed from an unrelated ferritin-like protein,
superposing 24 residues with an RMSD of 0.94 Å. The ferritin-like fold lacks the
β-sheet typical for the MHC antigen-recognition domain but is composed of a helix
bundle in which the loop regions interact similar to the peptide bound to MHC.
In Figure 5.4D a peptide inhibiting the serine-like NS3/4A protease from the
hepatitis C virus is bound in an extended backbone conformation, forming an
anti-parallel β-sheet with one β-strand of the enzyme. The entire β-sheet of 34
consecutive residues is found in murB, a glucosamine reductase involved in cell-
wall biosynthesis in E. coli, but the ligand strand is now an integral part of the fold.
Strikingly, both proteins occur in different structural classifications according to
SCOP: NS3/4A is an all-β protein, whereas murB is a member of the α+β class.
117
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
In Figure 5.4E, a tetratricopeptide repeat (TPR) motif from the adaptor protein
Hop is shown bound to a heptapeptide from Hsp70. The BriX hit contains exactly
this TPR motif in p67 but now the C terminal of p67 folds back into a hydrophobic
groove formed by a TPR domain in a single chain. This has already been observed
by Grizot et al. (2001), relating the single chain to the TPR domain in complex with
RacGTP (Lapouge et al., 2000).
The last example shows a SH3 domain complexed with a polyproline peptide.
Similar backbone architecture can be observed in an E.coli protein of unknown
function, but this time both fragments partly fold as β-strands because of the
different structural contexts. Yet, the polyproline motifs are present in both the
complex and the single-chain protein.
For 25% of the 761 complexes a similar structural arrangement covering more
than 50% of the entire interface could be observed in a single monomeric protein
(Figure 5.2). The bulk of the interfaces however can be covered for only 25-50%.
This rather low score is significant because if protein-peptide binding modes could
be entirely described using single-chain folds, we would be able to retrieve them
using SCOP, the structural classification of proteins (Murzin et al., 1995).
We looked if there is any correspondence between the SCOP classifications of
the protein-peptide complex and the protein from BriX that contains the collection
of interacting fragments covering the interface. All four hierarchical SCOP classes
– class, fold, superfamily and family – are compared if SCOP data was available
for the protein-peptide complex (see Section 5.3 for details). Intriguingly, 74%
of the equivalent structural arrangements of fragments are from unrelated SCOP
classifications, 23% is related on the class level and the remaining 3% is distributed
across the fold, superfamily and family levels. These data clearly illustrate that the
fragment interaction approach reveals structural similarities that are not apparent
from structural classifications.
118
5.2 Results
1w9qBS 1gsa 3erdAC 1csh
1jigA1yn7AC 2mbr 1dy8AC
1elwAC 1hh8A 2vknAC 1k7jA
1.64 Å
20 res
1.94 Å
35 res
0.94 Å
24 res
1.51 Å
34 res
1.17 Å
47 res
0.69 Å
12 res
A B
C D
E F
Figure 5.4: Relation between intermolecular interface architectures and in-
tramolecular protein architectures. The protein-peptide complex is colored blue
for the receptor and green for the ligand (left); the monomeric protein from BriX is
colored red (right). The superposing region is shown in ribbon view, and the number
of superposed residues and the superposition value are shown. The PDB ID with
the protein and peptide chains is shown below the figures. (A) PDZ domain with
peptide and unrelated enzyme-substrate complex (B) α-ligand binding domain with
peptide and unrelated all-α citrate synthase from a different species (C) Class I MHC
complex with peptide and ferritin-like protein (D) Hepatitis C protease with inhibitor
and MurB, a glucosamine reductase protein (E) Repetition of the ligand-bound TPR
motif in complex and single-chain form (F) SH3 domain with polyproline peptide and
protein of unknown function.
119
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
5.2.4 Statistical analysis of the factors that determine recon-
struction accuracy
What is the impact of regular secondary structure on the reconstruction accu-
racy of peptide binding sites?
Secondary structure plays an important role in protein-peptide binding. Approxi-
mately one third of all known peptides bind their protein domain through β-strand
addition, while another third folds as α-helical peptides (Petsalaki & Russell, 2008;
Remaut & Waksman, 2006). In our test set, 38% of the peptides adopt some
form of secondary structure, while 42% of all binding site residues are of regular
secondary structure. As expected, more regular interfaces are better covered, with
a correlation of 0.88 between the percentage of secondary structure and the cover-
age (Figure 5.5A). Interestingly however, binding interfaces with 50% regularity are
on average still 80% covered at a resolution of 1 Å, illustrating that even irregular
interfaces are partially reflecting the architecture of intramolecular interactions.
Are more stable interactions more common?
For every protein peptide complex, we estimated the change in free energy upon
binding (∆∆G, kcal/mol) with the empirical force field FoldX (Schymkowitz et al.,
2005b,a) (Section 5.3.5). Interestingly, we found a correlation of 0.91 between
the ∆∆G binding energies of the complexes and the coverage of the binding sites,
suggesting that higher affinity binding correlates with better coverage. This result
is not obvious as FoldX energies do not depend on the size of the binding site.
Further, decomposing the energy terms reveals that more backbone H-bonds in
the protein-peptide interface imply a better coverage with BriX fragments, with a
correlation of 0.81 (Figure 5.5B). Alternatively, if we correlate the predicted binding
∆∆G with the percentage of secondary structure in the binding site, we find that
more structured binding sites do have slightly better binding (correlation of 0.62),
although this is probably caused by the high amount of β-strand binding modes in
our dataset.
120
5.2 Results
A B
C
-1.4
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0 20 40 60 80 100
BLOSUM62score
protein-peptide interface coverage, %
D
0
20
40
60
80
100
0 20 40 60 80 100
%secondarystructureinterface
protein-peptide interface coverage, %
-20
-15
-10
-5
0
5
0 20 40 60 80 100
protein-peptideinterfaceH-Bonds(kcal/mol)
protein-peptide interface coverage, %
-20
-15
-10
-5
0
5
0 20 40 60 80 100
binding∆∆GFoldX(kcal/mol)
% secondary structure interface
Figure 5.5: Properties of the protein-peptide interface coverage. The coverage data
of 301 complexes is equally distributed in 20 bins and plotted against (A) secondary
structure distribution (α-helix and β-strand) in the binding site (0.88 correlation),
(B) interface H-bonds as measured in kcal/mol with FoldX (0.81 correlation), (C)
BLOSUM62 score for similarity between residues from the protein-peptide complex
and the covering BriX fragment (no correlation). In D the correlation between the
secondary structure of the interface is plotted against the predicted binding energy
∆∆G (0.62 correlation).
Are buried backbone fragments more suitable for reconstructing peptide bind-
ing sites?
BriX contains fragments from both the protein core and the surface. We analyzed
from which part of the protein the InteraX fragments originate by measuring the
121
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
burial of the side chains. A value of 1 stands for a complete burial of the side
chain, while a value of 0 implicates complete exposure to the solvent. On average,
the fragment covers have a side chain burial value of 0.72, demonstrating that the
backbone conformations of well-packed intramolecular interactions are generally
closer to the backbone architectures found at protein- peptide interfaces. As a
result we note a correlation of 0.65 with the coverage: more buried interactions
from BriX are slightly better covering the binding site (Figure 5.6B). The correla-
tion between the FoldX binding energies and the coverage of the protein-peptide
interfaces is given in Figure 5.6A.
A B
-20
-15
-10
-5
0
5
0 20 40 60 80 100
binding∆∆GFoldX(kcal/mol)
protein-peptide interface coverage, %
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
sidechainburialcoveringBriXfragments
protein-peptide interface coverage, %
Figure 5.6: Properties of the protein-peptide interface coverage correlated with
binding energy and burial. The coverage data of 301 complexes is equally distributed
in 20 bins and plotted against (A) predicted binding energy (∆∆G, kcal/mol) with FoldX
(0.91 correlation), (B) side chain burial with FoldX (0.65 correlation).
Is there sequence similarity between protein-peptide interactions and the
covering InteraX pairs?
We also examined whether the sequences of the wild-type protein-peptide inter-
faces are similar to their corresponding fragment covers from BriX, which have a
similar backbone but not necessarily the same side chain composition. Therefore,
we calculated the number of times a residue from the binding site is covered
with exactly the same residue. Surprisingly, we did not find any correlation, with
122
5.2 Results
sequence similarities ranging from 0% to only 14%. We repeated the distance
measurement between any two residues with the BLOSUM62 matrix, which gives
a score for the likeliness of two amino acid residues replacing each other in homol-
ogous sequences (Henikoff & Henikoff, 1991). A negative BLOSUM score is given
to less likely substitutions while a positive score implies a more likely substitution.
This yields an average BLOSUM62 score of -0.67, thus reinforcing the idea of
sequence independency between fragments from the monomeric proteins and the
protein-peptide interface. Furthermore, no correlation exists between coverage
and sequence similarity, as shown in Figure 5.5C.
We researched whether the hydrophobic properties, the charge and the beta
propensities of the amino acids covering the protein-peptide interfaces are con-
served. We use the hydropathy index to measure hydrophobic properties (Kyte
& Doolittle, 1982), the sidechain charge at pH 7 and the β-propensities (Street
& Mayo, 1999). A value close to 0 represents high conservation. We note that
none of the physical properties are clearly conserved by our coverage algorithm, as
shown in Figure 5.7 for both the single InteraX pairs and for the best recombination
of InteraX pairs from a single BriX protein. The difference in physical properties
between two random sequences of 10.000 amino acids each is plotted with a red
line (hydropathy = 3.269, charge = 0.431, β-propensities =0.184). For coverage
using single InteraX pairs, the physical properties for most protein-peptide inter-
faces are close to the random difference, while for coverage using a recombination
of InteraX pairs from the same proteins slightly more variation exists. These results
further stress the sidechain independency of the coverage algorithm.
While we do not observe sequence similarity we went further and looked at
the entire interaction network of the protein-peptide interfaces, compared with
their matching fragments from our database. We looked at similarities in H-bond
patterns, electrostatics, and volumetric properties and found that 88% of the elec-
trostatic network, 95% of the H-bond patterns and 91% of the volumetric network
of the original protein-peptide interfaces are retained in the BriX covers. Thus,
while sequence identity is very low, geometric properties are retained, making
the use of fragments an alternative method for homology modeling to do protein-
peptide interface design. In Chapter 6 we apply the original idea developed in this
chapter for peptide docking and design.
123
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
0
10
20
30
40
50
60
70
80
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Numberofprotein-peptideinterfaces
Difference in Beta Propensity
0
10
20
30
40
50
60
70
80
0 1 2 3 4 5 6 7
Numberofprotein-peptideinterfaces
Difference in Hydropathy
0
10
20
30
40
50
60
70
80
0 1 2 3 4 5 6 7
Numberofprotein-peptideinterfaces
Difference in Hydropathy
0
10
20
30
40
50
60
70
80
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Numberofprotein-peptideinterfaces
Difference in Charge
0
10
20
30
40
50
60
70
80
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Numberofprotein-peptideinterfaces
Difference in Charge
0
10
20
30
40
50
60
70
80
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Numberofprotein-peptideinterfaces
Difference in Beta Propensity
A B C
D E F
Figure 5.7: Physical properties are not conserved in the BriX covers. Histograms
of the physical property conservation are shown for the entire dataset, for hydropathy
(left), charge (middle) and beta propensities (right). On the X axis the difference
between the properties of the amino acids in the protein-peptide interface and those
from the BriX covers is plotted, on the Y axis the number of protein-peptide interfaces
in the bins is shown. A, B, and C show the statistics for the coverage using single
InteraX pairs, while D, E, and F show the statistics for the best recombination of InteraX
pairs from a single protein. The conservation for the physical property between two
random sequences is drawn with a red line.
5.3 Materials and Methods
5.3.1 Construction of a non-redundant data set of protein-peptide
complexes
We have constructed the non-redundant data set in a similar way as we recon-
structed PepX (Section 4.2.1), with the difference that we required (1) peptides
with a size from 5 to 14 amino acids (and not until 35), and (2) receptors with a
minimum size of 25 amino acids (and not 35). 731 complexes were retained and
clustered on their binding architecture, similar to the construction of PepX. Any
124
5.3 Materials and Methods
Protein in complex
with peptide
PDB ID Members
FoldX
∆∆G
Two-body
coverage, %
Single BriX
coverage, %
Major Histocom-
patibility Complex
(MHC)
2clrAC 172 -24.12 82 24
α-ligand binding do-
main
3erdAC 69 -19.02 85 60
Bovine γ-
chymotrypsin
1ab9BCA 30 -14.24 56 30
Thrombin 1vzqHI 26 -7.21 59 24
Streptavidin 1sldBP 24 -7.48 35 24
HIV-1 antibody 1u8hABC 15 -8.31 65 50
HIV-1 protease 2nxlABP 14 -18.61 81 29
SH3 1uj0AB 13 -12.14 54 35
PDZ 1w9qBS 8 -10.37 81 74
Table 5.2: Coverage statistics for the most populated classes in the protein-
peptide dataset. The table shows statistics for the top 9 classes in our dataset, which
account for 371 of the 731 protein-peptide complexes.
two structures are grouped together if they superpose below 2 Å RMSD for at least
75% of their interfaces. In this way, we retained 258 unique protein-peptide inter-
face clusters. The centroid of each cluster was selected for the dataset, while for
clusters with more then 10 elements we selected 5 representative interfaces. The
final dataset contains 301 representative protein-peptide interfaces. The interface
size of the protein-peptide complexes varies between 3 and 55 residues, with an
average of 21 residues in the binding site. 70% of all protein-peptide complexes
have been annotated with SCOP Murzin et al. (1995). Table 5.2 shows the cover-
age results for the top 9 clusters in the dataset. This original data set served as the
basis to research the landscape of protein-peptide interactions as available from
the PDB, and results are presented in Chapter 4.
5.3.2 The dataset of protein fragments
BriX1
is a database of canonical protein fragments obtained through fragmenting
and clustering a set of 1261 high quality protein structures (Baeten et al., 2008).
1
At the time of this study, we used the original version of this database as published by Baeten
et al. (2008). In Chapter 2 we describe how we updated this database, increasing its size 7-fold
and including a special clustering for loop structures.
125
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
Protein structures have been reconstructed using BriX fragments with an average
accuracy of 0.48 Å RMSD, covering 99% of the original structure. The resulting
alphabet of protein fragments varies in length from 4 to 14 amino acids, but in
this study we have limited ourselves to fragments of length 5. Because 94% of all
fragments of length 5 are classified, all the protein data is used. 258.474 fragments
were clustered into 7744 structural classes for six different RMSD thresholds to
allow different levels of structural variety. Moreover, fragment recombination
used to obtain ‘n-body’ interactions – through recombination of InteraX pairs that
have overlapping regions – gradually increases the fragment lengths. A complete
analysis of the BriX database is presented in Chapter 2.
5.3.3 InteraX database
The InteraX database was constructed by mining all fragment interactions in BriX
between fragments of length 5. For every fragment in the BriX database, we looked
for interacting fragments within the same protein. We defined residue interaction
between two residues of interacting fragments as follows: if the distance of any of
the atoms of the residues is less then the sum of their Van der Waals radii plus 0.5
Å (Keskin et al., 2008), residues are considered as interacting. Each InteraX pair
has at least three of these interactions. Annotation of each pair is done using the
FoldX version 3.0 Beta 4 (unpublished).
5.3.4 Covering algorithm
The covering algorithm1
harvests the wealth of data provided in BriX to reconstruct
protein-peptide interfaces. Instead of considering single protein fragments, the
backbone arrangements of the interactions between fragments (stored in the In-
teraX database) are the basis for reconstruction. The covering algorithm searches
for similar backbone arrangements of fragment interactions, between the entire
BriX dataset and the protein-peptide interfaces2
.
1
A generalized version of this original method that uses constraint satisfaction is described in
Section 6.3.1. This new algorithm does more than simply covering the interface and can be used
for de novo structure prediction and design.
2
A step-by-step guide through the covering and reconstruction algorithms is available for online
viewing at http://pepx.switchlab.org/content/related-work.
126
5.3 Materials and Methods
In the first step, binding site residues are defined by measuring the distances
between any two residues from different polypeptide chains, one from the receptor
protein and the other from the ligand peptide. Residue interaction is defined as in
Section 5.3.3. Fragments were constructed from interacting residues by sliding a
window of five residues over the structure from the N to C terminal. The window
sliding starts 4 residues before the first interacting residue and ends 4 residues
after the last interacting residue, such that nearby residues of the binding site are
used to facilitate the interaction search.
In a second step, the binding site fragments are covered with fragments from
the BriX fragment database. Every fragment is compared with all the class centroids
of BriX, using a superposition threshold of 1 Å. The four backbone atoms N, Cα,
Cβ and O are used in the superposition, such that the directions of the original
sidechains are preserved in the covering fragments. Structural variation within
the classes is tolerated up to 0.9 Å distance from the class centroid. We applied
a lower threshold for highly redundant classes such as all-α classes, and raised
the threshold for classes with few structural elements. To use all data available
in BriX, all the fragments from the selected classes are loaded on the binding site
fragments.
In a third step, the algorithm looks for architectural matches between InteraX
pairs and fragment pairs from the protein-peptide binding site. Fragment pairs
are created every time with one fragment from the receptor and another from
the ligand. InteraX pairs are filtered on superposition on the BriX pair using a
threshold of 1 Å for tight matches. Applying this procedure to the entire binding
site results in a set of InteraX pairs (also called ‘two-body’ interactions) that cover
the binding site of the protein-peptide complex. In a final step, these ‘two-body’
interactions from the same BriX protein are combined into ‘n-body’ interactions
with a superposition threshold of 2 Å, thus covering a larger part of the binding
site with a single monomeric fold.
5.3.5 FoldX force field
The FoldX software was used to compute binding energies ∆∆G after local opti-
mization of the side chains and to measure the side chain burial of the residues in
127
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
both the protein-peptide dataset and the BriX database.
FoldX is a protein design algorithm that contains a force field used for rapid
evaluation of protein stability, folding and dynamis as affected by mutations
(Schymkowitz et al., 2005b). The force field that is part of the algorithm makes
a quantitative estimation of the interactions contributing to the stability of pro-
teins (as for example used in Chapter 3 to measure protein stability after loop
building) and for the stability of protein-protein, protein-peptide and protein-DNA
complexes (as used in this Chapter and Chapters 4 and 6). The different energy
terms taken into account in FoldX have been weighted using empirical data from
protein engineering experiments, and the predictive power was tested on a large
set of protein mutants covering many different structural environments found in
proteins (Guerois et al., 2002).
Energy estimates calculated by FoldX include terms that have been found to
be important for protein stability. The free energy of unfolding (∆G) of a target
protein is calculated using:
∆G = ∆Gvdw + ∆GsolvH + ∆GsolvP + ∆Gwb + ∆Ghbond + ∆Gel + ∆Smc + ∆Ssc (5.1)
where ∆Gvdw is the sum of the van der Waals contributions of all atoms with
respect to the same interactions with the solvent; ∆GsolvH and ∆GsolvP are the dif-
ferences in solvation energy for apolar and polar groups, respectively, when going
from the unfolded to the folded state; ∆Ghbond is the free energy difference between
the formation of an intramolecular hydrogen bond and that of an intermolecular
hydrogen bond (with solvent); ∆Gwb is the extra stabilizing free energy provided
by a water molecule, making more than one hydrogen bond to the protein (water
bridges) that cannot be taken into account with non-explicit solvent approxima-
tions; ∆Gel is the electrostatic contribution of charged groups, including the helix
dipole; ∆Smc is the entropy cost for fixing the backbone in the folded state (this
term is dependent on the intrinsic tendency of a particular amino acid to adopt
certain dihedral angles); and ∆Ssc is the entropic cost of fixing a side chain in a
particular conformation. The energy values of ∆Gvdw, ∆GsolvG, ∆GsolvP and ∆Ghbond
attributed to each atom type have been derived from a set of experimental data,
and ∆Smc and ∆Ssc have been taken from theoretical estimates.
128
5.4 Discussion
FoldX also provides an estimate of the loss of free energy upon binding, i.e.
∆∆G, which can be calculated for protein-protein, protein-peptide or protein-DNA
complexes:
∆∆GAB = ∆GAB − (∆GA + ∆GB) + ∆Gkon + ∆Str (5.2)
where ∆GAB is the stability of the complex, ∆GA and ∆GA are the stabilities of
each of the partners in isolation; ∆Gkon
reflects the effect of the electrostatic inter-
actions on the kon term for protein complexes; and ∆Str is the loss of translational
and rotational entropy upon making the complex. The ∆∆G of the protein-peptide
complex was computed using the FoldX command AnalyzeComplex after repairing
the positions of the side chains using the FoldX command RepairPDB.
5.3.6 Statistical analysis
731 protein-peptide complexes were clustered in 258 distinct protein-peptide in-
terface classes. Statistics were performed by distributing the data for the 258
protein-peptide classes in 20 bins, averaging the results in a single bin. Through
this approach only general trends are observed within the data as details are leveled
out. For classes with more than 10 elements we took 5 representative elements
and averaged the statistics per class.
5.4 Discussion
We have researched whether interactions seen in protein-peptide complexes are
different from those observed within monomeric proteins. Our study was moti-
vated by the sheer abundance of monomeric protein structures compared to the
lack of complex structures. We analysed all 301 non-redundant protein-peptide
interactions available in the PDB. In this set our reconstruction method shows
an overall reconstruction of 91% of the binding site in 41% of the cases, 62% of
the binding site in 25% of the cases, and less than 19% of the binding site for
the remaining 34%. In general, the reconstruction accuracy depends on the reg-
ularity of the structure related to secondary structure and H-bond patterns, but
129
5. PROTEIN-PEPTIDE INTERACTIONS RESEMBLE
MONOMERIC PROTEIN INTERACTIONS
irregular structures are still covered to a good extent. Importantly, the reconstruc-
tion accuracy does not depend on side chain similarity but clearly reflects general
architectural rules of polypeptides.
Using protein fragments to model protein-peptide interfaces opens up the way
to incorporate the wealth of data on monomeric protein structures for protein-
peptide binding prediction and design. We demonstrated that most interactions
can be viewed as sets of pairwise interactions between protein fragments, iden-
tical to interactions in monomeric proteins. Not only have we shown that using
fragments is an efficient way to look at interfaces, we also reached a level of de-
tail in studying protein interactions that cannot be reached using fold comparison
through SCOP or other protein classifications.
We strictly limited ourselves to superpositions of maximum 1 Å RMSD, yet
we did not observe any sequence relation between the protein-peptide interfaces
and the BriX proteins, suggesting that the arrangement of the backbone is largely
independent from the side chain in the bound complex. The interaction net-
works however were preserved between intra- and intermolecular interactions.
Through recombination of pairwise fragment interactions we could reconstruct
entire binding sites in some cases, revealing identical binding patterns between
protein-peptide interfaces and parts of single chain folds.
Although most binding interfaces with regular structure can be covered, we
note that loop interactions are often not or only partly covered due to the huge
amount of different loop interactions. Special consideration for loop fragments as
we presented in Chapter 2 and applied to loop reconstruction in Chapter 3 could
increase this coverage. Finally, this structural insight paves the way for a peptide
design algorithm that employs the wealth on monomeric structural data to predict
and design protein-binding peptides (Chapter 6).
Author Contributions
P.V., F.R., J.S., and L.S. conceptualized the study. P.V. performed the experiments.
P.V., E.V., F.S., F.R., J.S. and L.S. performed the analysis. P.V. and J.S. wrote the
paper.
130
REFERENCES
References
Baeten, L., Reumers, J., Tur, V., Stricher, F.,
Lenaerts, T., Serrano, L., Rousseau, F. &
Schymkowitz, J. (2008). Reconstruction of
protein backbones from the brix collection
of canonical protein fragments. PLoS Com-
put Biol, 4, e1000083. 109, 110, 125
Berman, H.M., Westbrook, J., Feng, Z., Gilliland,
G., Bhat, T.N., Weissig, H., Shindyalov, I.N. &
Bourne, P.E. (2000). The protein data bank.
Nucleic Acids Research, 28, 235–42. 109
Bradshaw, J.M. & Waksman, G. (2002). Molecu-
lar recognition by sh2 domains. Adv Protein
Chem, 61, 161–210. 108
Cesareni, G., Panni, S., Nardelli, G. & Castagnoli,
L. (2002). Can we infer peptide recognition
specificity mediated by sh3 domains? FEBS
Lett, 513, 38–44. 108
Chandonia, J.M., Hon, G., Walker, N.S., Conte,
L.L., Koehl, P., Levitt, M. & Brenner, S.E.
(2004). The astral compendium in 2004. Nu-
cleic Acids Research, 32, D189–92. 111
Cohen, M., Reichmann, D., Neuvirth, H. &
Schreiber, G. (2008). Similar chemistry, but
different bond preferences in inter versus
intra-protein interactions. Proteins. 109
Collins, E.J., Garboczi, D.N. & Wiley, D.C.
(1994). Three-dimensional structure of a
peptide extending from one end of a class
i mhc binding site. Nature, 371, 626–9. 115
Grizot, S., Fieschi, F., Dagher, M.C. & Pebay-
Peyroula, E. (2001). The active n-terminal
region of p67phox. structure at 1.8 a res-
olution and biochemical characterizations
of the a128v mutant implicated in chronic
granulomatous disease. J Biol Chem, 276,
21627–31. 118
Guerois, R., Nielsen, J.E. & Serrano, L. (2002).
Predicting changes in the stability of proteins
and protein complexes: a study of more than
1000 mutations. Journal of Molecular Biol-
ogy, 320, 369–87. 128
Henikoff, S. & Henikoff, J.G. (1991). Au-
tomated assembly of protein blocks for
database searching. Nucleic Acids Research,
19, 6565–6572. 123
Itzhaki, L.S., Otzen, D.E. & Fersht, A.R. (1995).
The structure of the transition state for fold-
ing of chymotrypsin inhibitor 2 analysed by
protein engineering methods: evidence for
a nucleation-condensation mechanism for
protein folding. Journal of Molecular Biology,
254, 260–88. 108
Keskin, O., Gursoy, A., Ma, B. & Nussinov, R.
(2008). Principles of protein-protein interac-
tions: what are the preferred ways for pro-
teins to interact? Chem Rev, 108, 1225–44.
126
Kippen, A.D., Sancho, J. & Fersht, A.R. (1994).
Folding of barnase in parts. Biochemistry, 33,
3778–3786. 108
Klijn, C., Michaut, M. & Abeel, T. (2010). High-
lights from the 6th international society for
computational biology student council sym-
posium at the 18th annual international con-
ference on intelligent systems for molecular
biology. BMC Bioinformatics, 11, I1. 107
Kolodny, R. & Levitt, M. (2003). Protein decoy
assembly using short fragments under ge-
ometric constraints. Biopolymers, 68, 278–
85. 110
Kyte, J. & Doolittle, R.F. (1982). A simple
method for displaying the hydropathic char-
acter of a protein. Journal of Molecular Biol-
ogy, 157, 105–32. 123
Lapouge, K., Smith, S.J., Walker, P.A., Gamblin,
S.J., Smerdon, S.J. & Rittinger, K. (2000).
Structure of the tpr domain of p67phox in
complex with rac.gtp. Mol Cell, 6, 899–907.
118
Ma, B., Elkayam, T., Wolfson, H. & Nussinov, R.
(2003). Protein-protein interactions: struc-
turally conserved residues distinguish be-
tween binding sites and exposed protein sur-
faces. Proc Natl Acad Sci USA, 100, 5772–7.
109
131
REFERENCES
Mayer, B.J. (2001). Sh3 domains: complexity in
moderation. J Cell Sci, 114, 1253–63. 108
Murzin, A.G., Brenner, S.E., Hubbard, T. &
Chothia, C. (1995). Scop: a structural clas-
sification of proteins database for the inves-
tigation of sequences and structures. J. Mol.
Biol., 247, 536–540. 118, 125
Neduva, V., Linding, R., Su-Angrand, I., Stark, A.,
de Masi, F., Gibson, T.J., Lewis, J., Serrano, L.
& Russell, R.B. (2005). Systematic discovery
of new recognition peptides mediating pro-
tein interaction networks. PLoS Biol, 3, e405.
108
Pawson, T. & Scott, J.D. (1997). Signaling
through scaffold, anchoring, and adaptor
proteins. Science, 278, 2075–80. 108
Petsalaki, E. & Russell, R.B. (2008). Peptide-
mediated interactions in biological systems:
new discoveries and applications. Curr Opin
Biotechnol, 19, 344–50. 108, 120
Potapov, V., Reichmann, D., Abramovich, R.,
Filchtinski, D., Zohar, N., Halevy, D.B., Edel-
man, M., Sobolev, V. & Schreiber, G. (2008).
Computational redesign of a protein-protein
interface for high affinity and binding speci-
ficity using modular architecture and natu-
rally occurring template fragments. Journal
of Molecular Biology, 384, 109–19. 109
Reichmann, D., Rahat, O., Albeck, S., Meged, R.,
Dym, O. & Schreiber, G. (2005). The modular
architecture of protein-protein binding inter-
faces. Proceedings of the National Academy
of Sciences of the United States of America,
102, 57–62. 109
Remaut, H. & Waksman, G. (2006). Protein-
protein interaction through beta-strand ad-
dition. TRENDS in Biochemical Sciences, 31,
436–44. 109, 120
Schymkowitz, J., Borg, J., Stricher, F., Nys, R.,
Rousseau, F. & Serrano, L. (2005a). The foldx
web server: an online force field. Nucleic
Acids Research, 33, W382–8. 111, 120
Schymkowitz, J.W.H., Rousseau, F., Martins, I.C.,
Ferkinghoff-Borg, J., Stricher, F. & Serrano,
L. (2005b). Prediction of water and metal
binding sites and their affinities by using the
fold-x force field. Proceedings of the National
Academy of Sciences of the United States of
America, 102, 10147–52. 120, 128
Street, A.G. & Mayo, S.L. (1999). Intrinsic beta-
sheet propensities result from van der waals
interactions between side chains and the lo-
cal backbone. Proceedings of the National
Academy of Sciences of the United States of
America, 96, 9074–6. 123
Tsai, C.J., Xu, D. & Nussinov, R. (1997). Struc-
tural motifs at protein-protein interfaces:
protein cores versus two-state and three-
state model complexes. Protein Sci, 6, 1793–
805. 109
Tsai, C.J., Xu, D. & Nussinov, R. (1998). Protein
folding via binding and vice versa. Folding &
design, 3, R71–80. 109
Tuncbag, N., Gursoy, A., Guney, E., Nussinov, R.
& Keskin, O. (2008). Architectures and func-
tional coverage of protein-protein interfaces.
J Mol Biol, 381, 785–802. 109, 117
Vanhee, P., Verschueren, E., Baeten, L., Stricher,
F., Serrano, L., Rousseau, F. & Schymkowitz, J.
(2011). Brix: a database of protein building
blocks for structural analysis, modeling and
design. Nucleic Acids Research, 39, D435–
42. 110
Vriend, G. (1990). What if: a molecular mod-
eling and drug design program. Journal of
Molecular Graphics, 8, 52–6, 29. 110, 111
Yaffe, M.B. (2002). Phosphotyrosine-binding
domains in signal transduction. Nat Rev Mol
Cell Biol, 3, 177–86. 108
132
6Predicting peptide structure and specificity
Parts of this chapter are prepared for publication in
Peptide structure prediction from the architecture of proteins Peter Vanhee*, Erik Ver-
schueren*, Luis Serrano, Frederic Rousseau and Joost Schymkowitz 1 In preparation, December
2010.
and
The Multiple Specificity Landscape of Peptide Recognition Modules David Gfeller, Frank
Butty, Erik Verschueren, Peter Vanhee, Haiming Huang, Andreas Ernst, Nisa Dar, Igor Stagljar,
Luis Serrano, Sachdev S. Sidhu, Gary D. Bader and Philip M. Kim Molecular Systems Biology
(in review), February 2011.
I
t is believed that around 30% of all interactions in the cell are directly or indirectly
mediated by peptide interactions. However, there is a huge lack of structural
evidence for many of these interactions. Therefore, an in-silico method could
guide the search and discovery of new peptide interactions, or produce structural
models for known peptide interactions. Predicting not only the interaction be-
tween a peptide and a protein but also the shape of the peptide, remains a very
challenging problem due to the large number of rotatable bonds and flexibility
1
Peter Vanhee and Erik Verschueren are joint first authors.
133
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
in both the main chains and the side chains of the peptides. Here we present a
novel, knowledge-based method to overcome some of those limitations, relying on
the structural insights presented in Chapter 5. The method samples the reduced
space of fragment conformations and their local interactions from ∼7000 globular
proteins. We show the usefulness of our approach by redesigning the interaction
scaffold of nine protein-peptide complexes, for which four of the peptides can be
modeled to within 1 Å RMSD of the original peptide position. For two different
protein domains, the PDZ domain and the α-Ligand Binding Domain (LBD), we
predict peptide structure within sub-angstrom accuracy, using the sequence of the
peptide combined with the structure of the domain, but without previous knowl-
edge of the peptide structure. Furthermore, we show how the method correctly
identifies the peptide-binding sites in these domains. Finally, we model domain
specificity for the PDZ and the LDB, and make a structural prediction for a novel
PDZ binding mode recently discovered through phage display.
6.1 Introduction
The process of predicting peptide structure when bound to a target protein can
essentially be divided in three steps:
1. Model the target protein structure, either by using a structure from the PDB
or through homology modeling.
2. Predict the potential binding site on the surface of the protein structure.
3. Model a peptide structure in the binding site with high resolution. We
speak about docking when the backbone structure of the peptide is known
beforehand (Section 6.2.1), and we call this process design when the peptide
structure is not known beforehand (from Section 6.2.2 onwards).
For a comprehensive overview of the field of computational peptide design, we
refer to Section 1.3.
134
6.2 Results
6.2 Results
6.2.1 Peptide docking using interaction patterns from InteraX
We researched whether InteraX motifs (interactions between monomeric frag-
ments, see Section 5.2.1) contain the predictive capacity for the docking of rigid
peptide structures in a selected binding site on the protein interface. For nine
protein-peptide complexes – the centroids of nine representative clusters from
PepX, accounting for approximately half of our protein-peptide data set (Figure
4.1) – we rebuilt the interfaces using the original sequence and structure of the
peptide in the binding pocket, but without previous knowledge of the interaction
pattern of the protein-peptide complex. The sidechains of the interface residues
are rebuilt with the all-atom force field FoldX and ranked based on the ∆∆G binding
energy. The algorithm is described in detail in Section 6.3.
1w9qB + S1w9qB + 1k6z
Figure 6.1: Docking of the PDZ peptide using InteraX patterns. At the left, the
superposition between the BriX protein (PDB 1K6Z, grey) and the PDZ domain (PDB
1W9Q, cyan) on the receptor β-strand is shown. The fragments from the BriX protein
are colored blue, the receptor fragment from the PDZ domain is colored green. At
the right, the peptide ligand according to the interaction motif from the BriX protein is
shown (red), superposed on the original ligand (yellow), with RMSD 0.29 Å.
In four of the nine cases, we are able to position the original peptide ligand to
within 1 Å RMSD of the original position (Table 6.1). For example, the peptide
bound to the PDZ domain can be positioned to within 0.29 Å RMSD of the original
135
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
Protein-peptide
complex
PDB ID RMSD, Å
Major Histocom-
patibility Complex
(MHC)
2clrAC 0.79
α-ligand binding do-
main
3erdAC 1.96
Bovine γ-
chymotrypsin
1ab9BCA 2.87
Thrombin 1vzqHI 1.88
Streptavidin 1sldBP 1.38
HIV-1 antibody 1u8hABC 1.29
HIV-1 protease 2nxlABP 0.51
SH3 1uj0AB 0.73
PDZ 1w9qBS 0.14
Table 6.1: Peptide docking using InteraX. Accuracy of rigid peptide docking on
a 9 representative classes from PepX (Figure 4.1). For coverage statistics of these
interfaces using InteraX patterns and thus with knowledge of the binding interface,
see Table 5.2.
peptide, using the β-β interaction pattern observed in an unrelated secretion chap-
erone (Figure 6.1). For the MHC, the algorithm finds the correct peptide position
to within 0.99 Å RMSD using an α-loop InteraX pair of an unrelated BriX protein
(PDB ID 1AJS).
In another four cases, the ligand was placed correctly to within 2 Å, and for the
remaining case, the algorithm was not able to filter out the correct positions of the
ligands, due to the lack of interaction motifs in InteraX that superpose sufficiently
close to the receptor fragments.
6.2.2 De novo peptide structure prediction using interaction
patterns from InteraX
For peptide docking, a structure of both the ligand and the domain is required. In
many real-world scenarios however, no structure of the protein-peptide complex
exists, and methods based on homology modeling or first principles (ab initio)
need to be employed. In the next sections, we lift the need of having a structure
of the peptide by using backbone templates from the InteraX database and side
chain reconstruction using the FoldX force field. Our design strategy assumes no
136
6.2 Results
previous knowledge of either the peptide structure or orientation and is driven by
the search for favorable binding energy. We also show that no previous knowledge
of the binding site is required to design the peptides. Finally, for the PDZ domain
we show that we can capture the specificity profile of these by comparing our
results to phage display peptide binding assays.
6.2.3 Case study: PDZ peptide design and specificity
We evaluated our peptide design method on PDZ domains, flexible domains that
mediate protein-protein interactions mainly by binding the C-termini of target pro-
teins (Nourry et al., 2003). These C-terminal peptides bind in an elongated surface
groove as an antiparallel β-strand, interacting with an exposed β-strand and α-helix
of the PDZ domain (Figure 6.2A). PDZ domains have been classified in discrete
classes: for example, Class I domains recognize peptides with the consensus mo-
tif Ser/Thr-X-Val/Leu/Ile-COOH. However, discretizations of the peptide sequence
space have recently been challenged, proposing a more evenly distribution (Stiffler
et al., 2007).
Only a limited number of complex PDZ-peptide structures is available – e.g. 17
out of 54 human PDZ domains (Smith & Kortemme, 2010) – such that modeling
peptide structure and peptide specificity remains a challenge. Here, we apply
our method to model peptide ligands for the PDZ1 domain of DLG3 (categorized
under Class I, PDB 2I1N). We define possible anchor fragments in the PDZ domain
at residues 140-144, 141-145, 142-146 and 143-147 and search in the space of
InteraX patterns for all possible interacting pairs. Next, we apply a series of filters
to filter for and sort by backbone clashes, hydrogen bonding, and finally, the full
∆∆G interaction energy after building the wild type sequence (EETSV) using FoldX
(Section 6.3).
Our designs compare favorably to the bound peptide of the X-ray model,
with 9 templates less than 1 Å RMSD away from the crystallographic peptide,
counting distance between the backbone atoms (Figure 6.2B, black circles). The
interaction energy and important hydrogen bonding patterns that contribute to
binding are kept intact, such as Thr-His, or the carboxyl hydrogen bonding pattern.
Interestingly, most of the variation in our designed peptides resides in the N-
137
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
His196
Thr-2
Val0
Ser-1
Glu-3
Glu-4
Arg136
carboyxlate-binding loop
0 2 4 6 8
0
RMSD (design to crystallographic peptide) (Å)
EstimatedDDGinteractionFoldX,kcal/mol
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
PDZ1 (closed loop)
●
●
Ab initio design
BriX refinement
A B
wildtype
design
RMSD per residue
-4 -3 -2 -1 0
Figure 6.2: Peptide design for the canonical PDZ domain. (A) Top design for
the canonical PDZ domain (PDB 2I1N). RMSD against the crystallographic peptide
is 0.79 Å. (B) Ab initio design samples peptides that are < 1 Å separated from the
crystallographic peptide. When applying BriX backbone moves on the top designs,
this region is even more densely sampled. The C-terminal region (residues 0,-1,-2) is
the most constrained part in our designs as shown by the inset box plot, corresponding
to the PDZ Class I binding motif (Ser/Thr-X-Val/Leu/Ile) (Nourry et al., 2003).
terminal region (inset in Figure 6.2B), which is the most flexible part of the peptide
and lacks a clear sequence signature. The binding pocket of the PDZ extends only
to the last three C-terminal residues, which are consequently the most constrained
structurally, something that is reflected in the designs with average RMSD values
< 0.5 Å.
In a second round of experiments we introduce more structural variation on
the peptide designs through a series of backbone moves using the BriX database
(Section 6.3.2). In this fine-grain step we optimize binding of the peptide by
reducing the estimated interaction energy with 1.8 kcal/mol and the number of
peptide designs that are < 1 Å RMSD away from X-ray peptides increase from 9 to
38 (Figure 6.2B, blue circles).
138
6.2 Results
Multiple specificity of the PDZ1 domain and structural prediction
PDZ domains are highly specific to different peptidic ligands and an example is
the first PDZ domain of DLG1. The family of DLG (Discs Large Homolog) proteins
consists of 4 highly conserved paralogs, which have 3 PDZ domains each. Even
though the sequences of these PDZ domains are very similar, Gfeller et al. (2010)
recently developed a method that detects correlated positions in peptide-binding
data obtained from experimental phage display. When applied to the first PDZ
domain of DLG1, the method detected two different specificity profiles (Figure 6.3),
suggesting that this PDZ domain can bind two different peptide ligands. However,
structural evidence for this observation does not exist.
In order to interpret this observed multiple specificity structurally, we note that
the two sequence logo’s representing the multiple specificity (Figure 6.3C) align
well when the second one is shifted with one position, indicating the presence of
an additional residue at the C-terminal. However, a significant displacement of
the carboxylate-binding loop (Figure 6.2A) is required in order to accommodate
this extra residue. As we previously observed (Section 3.7), this loop adopts a
number of conformations, and we selected a conformation that adopts the largest
movement away from the original binding site (PDB 2WL7). This crystal structure
of DLG2 PDZ1 was recently crystallized as a trimer with C-terminally extended
peptide RRRPIL (Fiorentini et al., 2009). In this crystal structure, Ile-1 is found at
the same spatial position as Val0 in PDB 2I1N (Figure 6.2A) and >80% sequence
identity between both domains exist. However, a complete structural model of
this novel binding specificity could not be derived from this crystal structure since
only the last three residues (PIL) are in contact with the binding site. Hence, we
applied our method to design a peptide that confirms the experimentally observed
peptide specificity.
Figure 6.4A shows the top five structural models when using the sequence
ETDIW (derived from the last logo in Figure 6.3C). The predicted interaction energy
for the top model (-8.50 kcal/mol) compares favorably to the one computed for
the original DLG3 PDZ1 structure (PDB 2I1N, -7.3 kcal/mol). The C-terminal Trp
can be accommodated by the displaced carboyxlate-binding loop and while the
2I1N peptide COOH forms a hydrogen bond mediated by a water molecule to a
139
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
-6 -5 -4 -3 -2 -1 0
$ 5 < ( 7 / 9
* 5 * 7 ' , :
* : ) ( 7 / 9
5 . ( 7 / , /
5 3 : 6 7 ' 9
5 6 6 7 ' , :
6 5 5 ( 7 : 9
7 6 5 6 7 : 9
* ( 7 ' , :
, ( 7 ' , :
5 5 ( 7 / 9
5 : ( 7 ' 9
6 : + 7 : 9
< 6 7 ' , :
) ( 7 / 9
5 1 7 / 9
5 6 7 / 9
: ( 7 0 9
< ( 7 / 9
+ 7 : 9
6 , , : 6 , ,:
( 7 /,/
( 7 ' , :
( 7 ', :
* 7 ' , :
6 7 ' , :
6 7 ' , :
: + 7 : 9
+ 7 : 9
: ( 7 0 9
: ( 7 ' 9
: 6 7 ' 9
5 ( 7 : 9
5 6 7 : 9
5 6 7 /9
5 1 7 / 9
5 ( 7 / 9
< ( 7 / 9
< ( 7 / 9
) ( 7 /9
) ( 7 /9
1
0
Sequence
similarity
A B
C
bits
C-terminal positions
Single PWM
-6 -5 -4 -3 -2 -1 0-6 -5 -4 -3 -2 -1 0
0
1
2
3
4
bits
0
1
2
3
4
bits
Multiple PWMs
+
0
1
2
3
4
-6 -5 -4 -3 -2 -1 0
C-terminal positions C-terminal positions
Fig_1
'/*3'=ELQGLQJSHSWLGHV
Figure 6.3: Multiple specificity for DLG1 PDZ1 binding peptides. (A) Phage pep-
tides binding to the first PDZ domain of the human protein DLG1, aligned from the
C-terminus. The last five positions (red box) display positional correlations. Pairs of
significantly correlated positions (MI p-value < 0.001) are connected with a red edge,
others with a black edge. An example of correlation can be found between the two
last columns: W or L at position 0 always appears with I at -1, while V at position
0 is never found with I at -1. (B) Hierarchical clustering of the peptides shown in
A. The two main clusters (orange dashed line) are the ones identified by the method
of Gfeller et al. (2010). Positional correlations are successfully removed within the
clusters (black edges). (C) Sequence logos for the single PWM (left) and the multiple
PWMs (right), together with their respective weights. This figure is taken from the
manuscript prepared by Gfeller et al. (2010).
conserved Arg/Lys (Doyle et al., 1996), the Trp-COOH directly forms a hydrogen
bond with this Arg.
We then studied amino acid preferences at each peptide position by performing
in silico mutagenesis experiments, mutating step-by-step every position to all pos-
sible amino acids and evaluating the interaction energies. These preferences are
represented in a Position Weight Matrix (PWM) and graphically shown as a heat
140
6.2 Results
map. Figure 6.4B shows the PWM for the designed peptide, agreeing well with
experimental phage display data (Tonikian et al., 2008). Interestingly, we predict
a clear preference for the Trp (as well as Phe or Tyr) at the C-terminal position.
Trp0
Ile-1
Asp-2
Thr-3
Glu-4His165
Arg105
Ser113
A B
design
aminoacids
-4 -3 -2 -1 0-5 -4 -3 -2 -1 0
G
A
L
V
I
P
R
T
S
C
M
K
E
Q
D
N
W
Y
F
H
Positions
AminoAcids
Decreasing
G
Increasing
G
-5 -4 -3 -2 -1 0
G
A
L
V
I
P
R
T
S
C
M
K
E
Q
D
N
W
Y
F
H
Positions
AminoAcids
Decreasing
G
Increasing
G
-2 -1 0
G
A
L
V
I
P
R
T
S
C
M
K
E
Q
D
N
W
Y
F
H
s
AminoAcids
Decreasi
G
Increasin
G
decreasing
increasing
Figure 6.4: Structural prediction for the alternative PDZ specificity. (A) Compu-
tational modeling of DLG1 PDZ1 in complex with an extended ligand ETDIW cor-
responding the new specificity observed in phage (Figure 6.3). The top 5 models
predicted by our method are shown, using the template PDB 2WL7 which has a dis-
placed loop that could accommodate multiple specificity (see also Section 3.7 for the
conformations this loop might adopt). (B) PWM showing amino acid preferences in
the model of extended ligand binding to DLG1 PDZ1. The amino acids of the initial
ligand ETDIW are marked with black boxes.
Our results provide a structural explanation of the predicted multiple specificity
of the PDZ1, one following the canonical C-terminal PDZ binding mode (Figure 6.2)
and another unexpected one allowing for an additional residue at the C-terminus
(Figure 6.4). It appears that remodeling of the carboxylate binding loop is often
associated with non-canonical binding modes of PDZ domains as we observed by
analysis of a set of PDZ domains from the PDB (Gfeller et al. (2010) and Section
3.2.3). Through phage display clustering (Gfeller et al., 2010) in combination with
structural modeling, these peptides together with their specificity can be detected
and confirmed.
141
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
PDZ peptide binding site prediction
Many methods exist to predict peptide binding sites, for example using geometric
amino acid-dependent preferences derived from protein-peptide binding interfaces
(Petsalaki et al., 2009). Here, we use the information intrinsically contained in the
InteraX pairs to predict ‘hot spot’ residues on the surface of the PDZ domain
(Section 6.3.3).
0 5 10 15 20 25 30
0
RMSD (design to crystallographic peptide) (Å)
EstimatedDDGinteractionFoldX,kcal/mol
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●●
●●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
● ● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
● ●
●
●
●● ●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ●● ●
●
●
●●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●●
● ●
●
●
●● ●
●
● ●
●
●
●
●
●
●●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
● ●
●
●
● ●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
● ●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
● ●
●
● ●
●
●
●
●
●
● ●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●●
●●
●
● ●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
● ●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
● ●
● ●
●
●
●
●
●
●
●
●● ●
● ● ●
●
●●
●
●
●
●
● ●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
● ●
●
●
● ●
●●
●
● ●
●
●
●● ● ●
● ●
●
●
● ●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●● ●
●●
●
● ●
●
●●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●●
● ●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ●
●
●●
●● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
● ● ●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
● ●
●
●
●●
●
●
●
●●
●●
●
●
●
● ●●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●●● ●
●
●
●●
● ●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
● ●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ●
●
●
●●● ●● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
● ●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●●
●
●
● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
PDZ1 (closed loop)
●
●
●
Ab initio design
A B
C D
Figure 6.5: PDZ peptide-binding site prediction. (A) Sub-angstrom prediction of
the crystallographic peptide when removing information about the binding site. (B-
D) Different views of hot spot prediction, showing a clear preference (red) for the
crystallographic binding site.
Figure 6.5A shows the energetic landscape when designing peptides without
142
6.2 Results
previous knowledge of the binding site, showing a clear energy funnel in the [0-2]Å
range, providing evidence for a single peptide binding pocket in PDZ domain as
is generally accepted (Nourry et al., 2003). This example illustrates the use of our
method for the prediction of peptide-binding sites.
6.2.4 Case study: helical peptide design for the estrogen recep-
tor ligand-binding domain
Helical peptides are important building blocks of a protein’s structure, but are
also often encountered in protein-protein interfaces (Jochim & Arora, 2009). In
one such case, that of the estrogen receptor (ER), interactions between helical
protein segments play an important role in the estrogen signaling pathway. The
estrogen receptor is a transcription factor that is responsible for estrogen signaling
and regulates a number of important hormone-dependent processes in the cell
(Heldring et al., 2007). ER contains an evolutionary conserved ligand-binding
domain (LBD) that, after agonist activity induced by estrogen, binds leucine-rich
peptides (LxxLL, with x any amino acid) in a hydrophobic pocket formed by the
helices H3, H5 and H12 (Figure 6.6A). In absence of an agonist, LBD binds its
own C-terminal tail (H12), that has a similar leucine-rich signature. Finally, when
LBD binds an antagonist, e.g. Tamoxifen or Fulvestrant (Heldring et al., 2007), the
binding pocket is partially or completely deformed, no peptides can bind anymore
and the estrogen signaling pathway is blocked.
Here, we aim at the design of a peptide that blocks the hydrophobic pocket
formed by H3, H5 and H12. Unlike with PDZ peptide design (Section 6.2.3),
we take a ‘multi-body’ approach. We look for BriX fragments that simultaneously
interact with fragments from H3, H5 and H12, as defined by InteraX (Section
6.3.1). The advantage is that peptides will be optimized for binding the entire
interface and not just a single fragment at a time, as they will be constrained to
interact simultaneously with H3, H5 and H15. We use the sequence signature
of the peptide (LHRLL) and the structure of the unbound LBD from PDB 3ERD
(Shiau et al., 1998). We note that while it is relatively easy to position a β-strand
owing to the constrained backbone hydrogen bonding patterns (as in the case
of PDZ), the α-α interactions are much less constrained because they lack any
143
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
Helix 3
Helix 5
Helix 12
A B
Figure 6.6: Estrogen receptor α-LBD binding site. The binding site of the LBD is
formed by three helices (A) that together create a hydrophobic groove accommodating
the peptide (B). The residue stretches that are used as the anchor fragments in the
design are shown in purple, red and green.
backbone hydrogen bonds and most of the free energy is contributed through
hydrophobic packing. Therefore we rely on a combination of (1) multi-body
fragment interaction optimization using InteraX, (2) backbone clash filters and (3)
total interaction energy measured by FoldX after building full-flexible side chains
on the peptide. Furthermore, while the original peptide ligand from the crystal
structure contains 11 residues, we limit our design to a fragment of 5 residues,
which accurately captures the LxxLL pattern, and superpose the helical structure
of the 3ERD peptide on the designed peptide to extend it to 11 residues.
Figure 6.7A shows the designed structure when compared with the crystal
structure (RMSD = 1.05 Å). There is only slight variation on the three conserved
leucine residues that constitute most of the binding energy within the hydrophobic
binding pocket (colored in yellow). The energy landscape of 4052 models reveals a
clear ‘energy funnel‘ towards designs close to the crystallographic peptide (Figure
6.7B). The best designs are within [0-1]Å RMSD (1 design), [1-1.5]Å (2 designs)
and [1.5-2.0] Å (13 designs). We also evaluated the energies of different 11-
residue LBD-binding peptides distilled from the protein-peptide database PepX
144
6.2 Results
0 2 4 6 8
0
RMSD (design to crystallographic peptide) (Å)
EstimatedDDGinteractionFoldX,kcal/mol
Alpha Ligand Binding Domain peptide design (3ERD)
●
●
Leu694
Leu693
Leu690
wildtype
design
A B
Figure 6.7: Helical peptide design for the LDB. (A) Crystal structure (3ERD) versus
designed peptide, indicating three leucine residues constituting the binding motif.
The LDB is represented in surface view; hydrophobic residues are colored yellow,
hydrophilic residues are colored green. (B) Energy landscape of the designs, showing
a clear ‘funnel’ near the crystal structure.
(Chapter 4). When comparing the estimated energies of our designs to this cluster
(Figure 6.8A), we show that our best designs (∼15 kcal/mol) are within the range
of energies as observed from other crystal structures (∼ 15-25 kcal/mol). While
this is no validation of our results, we show that the algorithm samples the range
of energetic values as estimated from experimentally resolved structures.
Predicting LBD peptide specificity profiles
We validated whether the designs contain the predictive capacity that is needed
in order to explain the LxxLL peptide specificity. We perform in silico mutagenesis
experiments, mutating every residue of the best five designs to all the twenty amino
acids and evaluate the binding energies using FoldX.
The method correctly detects a preference for the three conserved leucines
(Figure 6.8B). The Position Weight Matrix (PWM) also shows a preference for the
hydrophobic residues Ile and Met at these positions, but we could not verify these
145
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
A B
8 10 12 14 16 18 20
-30-25-20-15-10-50
Alpha LBD - Peptide, PepX cluster 2GTK T1 A75, ID=4845
px$LigandSize
px$InteractionEnergy
DDG FoldX
Backbone HBond
Sidechain HBond
687
688
689
690
691
692
693
694
695
696
697
Y
W
V
T
S
R
Q
P
N
M
L
K
I
H
G
F
E
D
C
A
Y
W
V
T
S
R
Q
P
N
M
L
K
I
H
G
F
E
D
C
A
690 693 694
aminoacids
R
T
Decreasing
Increasing
decreasing
increasing
Figure 6.8: Recognizing LBD peptide specificity on designed templates. (A) Com-
parison of the energy landscape on 111 LBD-peptide complexes. The peptides that
have a similar length to the designed peptide are marked with a red box. (A) PWM
of the amino acid preferences for the designed LBD peptide. The amino acids of the
ligand motif LxxLL are marked with black boxes.
observations with experimental data.
Predicting the LBD peptide-binding site
We predicted peptide-binding hot spot residues scanning the entire surface of the
LBD, using the protocol described in Section 6.3.3. Figure 6.9 shows our designs,
which display a clear energy funnel close to the crystallographic peptide. Inter-
estingly, we also observe a second energy funnel 30-40 Å RMSD separated from
the crystallized peptide, with binding energies that are only 2 kcal/mol separated.
Looking into the functional role of this putative alternative binding site – oriented
at the back of the LBD peptide-binding site – we discovered that this hot spot cor-
responds to the dimer interface of the LBD. Even though the second LBD domain
that binds as a dimer has a different sequence signature at these positions, this
second hydrophobic pocket also seems to accommodate for the leucine-rich motif
displayed by the peptide. Experimental validation is lacking since these domains
are often crystallized as a dimer. Our findings however highlight the potential of
the method to detect biologically relevant binding sites.
146
6.2 Results
0 10 20 30 40 50
RMSD (design to crystallographic peptide) (Å)
EstimatedDDGinteractionFoldX,kcal/mol
3ERD (full peptide)
●
●
●
Ab initio design
A B
C D
peptide binding site
dimerization binding site (D)
Figure 6.9: LBD peptide-binding site prediction. (A) Energy landscape of peptide
prediction when removing information about the binding site, showing two clear
funnels (0-2 Å and 30-40 Å separated from the actual binding site). (B-D) Different
views of the hot spots on the surface of the LBD with the predicted peptide in green:
(C) shows the actual binding site and (D) shows predicted hot spots corresponding to
the dimerization site of the LBD.
147
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
6.3 Materials and Methods
6.3.1 A constraints-based framework for peptide design
The BriX Design method is implemented as a Constraint Satisfaction Problem
(CSP), a mathematical problem defined as a set of objects whose state must
satisfy a number of constraints and for which a solution is sought using constraint
satisfaction methods (Kumar, 1992). While a formal description of this actively
researched class of problems is outside the scope of this thesis, we explain the
problem description when applied to peptide prediction.
Every CSP essentially consists of three elements (Figure 6.10A):
1. V is the set of Variables. This set of variables contains fragments from
five residues, outlined along the chain of the target protein. For example,
Figure 6.10B shows three variables (V1,V2 and V3) which represent three
non-overlapping fragments in the binding site of the PDZ domain. The
special variable V0 represents the fragment to design. These fragments can
be designated fragments in the binding site (when the binding site is known),
or can cover the entire chain of the protein (when doing blind experiments).
2. D is the Domain. The domain contains a set of values for every variable. The
domain of a variable are all the BriX fragments. This domain will be reduced
by the different constraints during the course of the algorithm.
3. C is the set of Constraints. Constraints can be unary, i.e. constrain the
domain of a single variable, binary, i.e. constrain the domain of two vari-
ables simultaneously, or n-ary, i.e. constrain the domains of n variables
simultaneously.
Constraints
Constraints limit the domains of the variables. A list of the three most important
constraints in BriX Design is given:
• BriX fragments: unary constraint that constrains the domain of fragments
from the entire BriX database towards BriX fragments that superpose on
148
6.3 Materials and Methods
V1
V2
V0
C1
C2
C2C2
C1
D1 D2
D0
C3
A B
V0
Figure 6.10: Method: BriX peptide design implemented as a Constraint Satisfac-
tion Problem. (A) Cartoon representation of a CSP consisting of Variables, Domains
and Constraints. (B) The structural model that corresponds to the instance of a CSP:
Variables are fragments in the the PDZ peptide-binding site with domains that span
the space of possible backbones matching these fragments (shown in red, blue and
yellow). The CSP looks for solutions that are overlapping in these three variables by
applying different constraints, here resulting in a modelled peptide structure for V0.
the fragment (usually with a backbone RMSD value <1Å) described by the
variable. This is the most limiting constraint.
• InteraX: binary constraint between two variables, usually between one frag-
ment in the binding site (V1−3) and the fragment to design (V0). Two variables
are constrained to follow the same interaction patterns as those present in
the InteraX database (Section 5.2.1). Even though this constraint is binary in
nature, the concatenation of the InteraX constraints can result in constraining
the search to look for interaction patterns between multiple variables, e.g.
to reconstruct the entire peptide-binding site with a single protein (Section
5.2.3).
• FoldX free energy: n-ary constraint that is fired when all variables are as-
signed, such that a structural model can be produced. The interaction en-
149
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
ergy is decomposed in different terms: clashes, backbone hydrogen bonds,
sidechain hydrogen bonds, ∆∆G after building a poly-alanine peptide se-
quence, ∆∆G after building a user-given peptide sequence. The particular
order of these energy constraints can significantly increase the speed of the
algorithm, e.g. by first filtering on clashes, backbone hydrogen bonds, and
finally by ∆∆G on an all-atom model.
Heuristics
Heuristics are used to speed up the process of finding an assignment of all variables
that satisfies all the constraints in the CSP. These ‘rule of thumbs’ shape the search
tree and thus define the order in which solutions are find by the search algorithm.
BriX Design uses only one type of heuristics: a value is always assigned from the
smallest domain in the system, that is typically also the most constrained variable
in the system and thus quickly results in a solution to the problem.
Search
A CSP can be resolved using constraint satisfaction methods (Tsang, 1993). Since
our CSP has finite domains (i.e. BriX fragments) as opposed to nearly infinite
domains (i.e. the space of all dihedral angles), we can apply search methods that
traverse the domains of the variables until a solution has been found that satisfies
all constraints.
Initially, all variables are unassigned. The variable with the smallest domain
(typically the most constraining fragment in the protein environment) is chosen and
possible values are assigned in turn. Whenever a variable is assigned, constraints
that are connected to this variable are triggered. For example, the InteraX constraint
will be fired immediately after a BriX fragment is assigned to the β-strand of the PDZ
domain (V1=fragA), limiting the domain of V0 to all fragments that are present in
the InteraX database having fragA as interaction partner.
Constraints are propagated along the constraint network each time the domain
of a variable changes, unless an inconsistency is found (i.e. when a variable
domain becomes empty). In that case, the algorithm applies backtracking until the
last assignment which satisfies the constraints and traverses an alternative path
150
6.3 Materials and Methods
along the search tree.
The search progresses in a depth-first and not a breadth-first manner, meaning
that priority is given towards solving all of the constraints initially for a single
assignment instead of progressively limiting the domains for each variable. This
has the advantage that very quickly in the search process (typically, in less than 5
minutes) a good candidate satisfying all the constraints is found.
Translating a set of assigned variables to a structural model
A solution is an assignment of a single value to each variable that satisfies all
constraints. These assignments are then translated to a structural model using
superposition of the BriX fragment on the fragment described by the variable in the
case of binding site fragments (e.g. V1,V2 and V3), or using superposition on one
fragment of the InteraX pair in the case of designed fragments (e.g. V0). The math-
ematical method provided by Kabsch (1976) is used to provide a rotation matrix
and a translation vector that describe the best superposition for the coordinates
of two fragments. Superposition is done using the backbone atoms N, Cα, C, O,
disregarding the sequences of the fragments. When the backbone scaffold is in
place, we use FoldX to build sidechains using a rotamer search (Schymkowitz et al.,
2005). In conclusion, the method translates an assignment of BriX fragments to a
complete model of the protein-peptide complex.
Implementation
The BriX Design method is implemented using the Generic Constraint Develop-
ment Environment Gecode 3 (http://www.gecode.org) (Schulte, 2002), an effi-
cient and clean implementation of constraint satisfaction methods in C++. The
method is integrated with the FoldX and the BriX frameworks.
6.3.2 Local backbone moves using BriX
Backbone moves of fragments can naturally be represented in the CSP, since
the domain of a variable represents the space of compatible backbones with a
particular segment in the protein. These slightly different backbone conformations
are evaluated against the entire protein context by the different constraints. When
151
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
applying these backbone moves on a designed ligand, the structure of the ligand
is optimized in terms of interaction energy as we have shown in Section 6.2.3.
Backbone moves could be applied on fragments that are located in continuous
regions, e.g. in the binding site of the receptor protein. A mending strategy thus
needs to be employed to connect the fragments together such that the dihedral
angles are not violated.
6.3.3 Binding site prediction
The method can be turned into a binding site prediction method by considering all
fragments of the polypeptide chain (by sliding a window of length 5 from the N-
to C-terminal) instead of fragments in the binding site. Energetic evaluations will
narrow down the solutions towards peptide designs that target the binding site.
To make the hot spot surfaces shown in Figures 6.5 and 6.9, for each residue of
the protein we take the best design (in terms of binding energy) this residue was
part of.
6.4 Discussion
In this final chapter, we have proposed a method for the structural discovery of
peptide ligands binding to globular protein domains. For nine protein-peptide com-
plexes, we showed how InteraX contains interaction patterns that are sufficiently
rich to dock the peptides in the binding sites of the proteins with sub-angstrom ac-
curacy. For two selected cases, the PDZ domain and the α-ligand binding domain,
we have shown that even in the absence of the peptide structure or knowledge of
the binding site, we can accurately reconstruct the peptide.
Our results show that a combination of BriX, InteraX and FoldX can reach
optimal designs within a fraction of the time of ab initio methods. As a comparison,
the recently proposed PepSpec algorithm that relies on the Rosetta molecular
modeling package, was shown to reconstruct peptides between 100 and 300
cpu hours (King & Bradley, 2010). Our design method typically generates sub-
angstrom models within 5-15 minutes and then continues to enrich the ensemble.
In some cases, the prerequisite of knowing the sequence of the peptide to be
152
6.4 Discussion
modeled can be circumvented. For example, in the case of the PDZ, most of the
binding is contributed through a network of backbone hydrogen bonding (Nourry
et al., 2003). The selection of optimal and near-optimal templates could thus
be guided without the knowledge of the sequence, allowing a greater diversity
of templates to be used for specificity modeling in a later phase. For this, we
use a minimal representation of the peptide backbone, in which all residues are
mutated to alanine, or glycine for specific positions. Preliminary studies on the
relation between a minimal model (using the ‘poly-alanine’ peptide) and a full
sidechain model have shown a high correlation. These results show that using a
minimal representation of the peptide could lead to high-quality models as well. In
the case of LBD however, no backbone hydrogen bonding can steer the search for
optimal models, and thus the minimal model would not show a determined energy
funnel. Here, a partial representation of the peptide (for example, LxxLL, with x
any amino acid) could potentially identify optimal backbone templates, although
this hypothesis was not verified. This minimal model opens the way for structural
verification of many known peptide-binding motifs with an unknown structure,
as are available for example in the database of eukaryotic linear motifs (ELM,
http://elm.eu.org/) (Gould et al., 2010), or the database of three-dimensional
interacting domains 3did (3did, http://3did.irbbarcelona.org/) (Stein et al.,
2010).
While many peptide complexes bind their target proteins in a structured fashion
– either through β-strand complementarity or α-helical packing (Chapter 4) – other
peptide interactions such as those observed for SH2 (Bradshaw & Waksman, 2002)
and SH3 domains (Mayer, 2001), do not. In Chapter 5 we have shown that even
these unstructured interactions can be modeled using InteraX patterns, suggesting
that they could be predicted as well de novo. Preliminary analysis has shown that
we can reach a reasonable accuracy (<2Å) in reconstructing these unstructured
interactions when targeting the binding site. However, the positioning of these
unstructured peptides using unstructured anchor fragments is worrisome: while
superpositions using β- or α-fragments typically produce very ‘tight fits’, superpo-
sitions using unstructured fragments do not, increasingly shifting the selection of
binding peptides to the force field and not relying on the prediction capacity stored
within InteraX pairs. We hypothesize that using different superposition mecha-
153
6. PREDICTING PEPTIDE STRUCTURE AND SPECIFICITY
nisms, for example through iterative superposition using subsets of atoms (Meng
et al., 2006), could improve the accuracy of the method for predicting unstructured
peptide interactions.
Finally, the de novo prediction of peptide structure and specificity, in particular
for therapeutic applications, should not be guided by a single optimization function
– in this case, the estimated binding energy. Instead, α-helical peptide stability
(e.g. using Agadir, Mu˜noz et al. (1995)) or low peptide aggregation propensities
(e.g. using Tango, Fernandez-Escamilla et al. (2004)) should be considered as
well, amongst other parameters that define a ‘successful’ peptide design. The
proposed peptide design method provides a framework for the implementation of
these additional optimization functions.
Author Contributions
P.V., E.V., F.R., J.S., and L.S. conceptualized the study. P.V. developed the first
version of the BriX Design algorithm (as used in Chapter 5). P.V. and E.V. devel-
oped the second version BriX Design algorithm (based on Gecode, the Generic
Constraint Development Environment). P.V. performed and analyzed experiments
on the PDZ and the LBD. E.V. performed and analyzed experiments on the SH2
(results not discussed). Gfeller et al. (2010) discovered multiple specificity for the
PDZ1 DLG1 and D.G., E.V., P.V. and L.S. created the structural models and the
specificity profiles to confirm this multiple specificity observed in phage display
experiments.
154
REFERENCES
References
Bradshaw, J.M. & Waksman, G. (2002). Molecu-
lar recognition by sh2 domains. Adv Protein
Chem, 61, 161–210. 153
Doyle, D.A., Lee, A., Lewis, J., Kim, E., Sheng, M.
& MacKinnon, R. (1996). Crystal structures
of a complexed and peptide-free membrane
protein-binding domain: molecular basis of
peptide recognition by pdz. Cell, 85, 1067–
76. 140
Fernandez-Escamilla, A.M., Rousseau, F.,
Schymkowitz, J. & Serrano, L. (2004). Predic-
tion of sequence-dependent and mutational
effects on the aggregation of peptides
and proteins. Nature Biotechnology, 22,
1302–1306. 154
Fiorentini, M., Nielsen, A.K., Kristensen, O., Kas-
trup, J.S. & Gajhede, M. (2009). Structure of
the first pdz domain of human psd-93. Acta
Crystallogr Sect F Struct Biol Cryst Commun,
65, 1254–7. 139
Gfeller, D., Butty, F., Verschueren, E., Vanhee, P.,
Huang, H., Ernst, A., Dar, N., Stagljar, I., Ser-
rano, L., Sidhu, S.S., Bader, G.D. & Kim, P.M.
(2010). The multiple specificity landscape of
peptide recognition modules. Molecular Sys-
tems Biology (in review), 1–35. 139, 140,
141, 154
Gould, C.M., Diella, F., Via, A., Puntervoll, P.,
Gem¨und, C., Chabanis-Davidson, S., Michael,
S., Sayadi, A., Bryne, J.C., Chica, C., Seiler,
M., Davey, N.E., Haslam, N., Weatheritt, R.J.,
Budd, A., Hughes, T., Pas, J., Rychlewski, L.,
Trav´e, G., Aasland, R., Helmer-Citterich, M.,
Linding, R. & Gibson, T.J. (2010). Elm: the
status of the 2010 eukaryotic linear motif re-
source. Nucleic Acids Research, 38, D167–
80. 153
Heldring, N., Pike, A., Andersson, S., Matthews,
J., Cheng, G., Hartman, J., Tujague, M., Str¨om,
A., Treuter, E., Warner, M. & Gustafsson, J.A.
(2007). Estrogen receptors: how do they sig-
nal and what are their targets. Physiol Rev,
87, 905–31. 143
Jochim, A.L. & Arora, P.S. (2009). Assessment
of helical interfaces in protein-protein inter-
actions. Mol Biosyst, 5, 924–6. 143
Kabsch, W. (1976). A solution for the best rota-
tion to relate two sets of vectors. Acta Cryst.,
922. 151
King, C.A. & Bradley, P. (2010). Structure-based
prediction of protein-peptide specificity in
rosetta. Proteins, 78, 3437–49. 152
Kumar, V. (1992). Algorithms for constraint-
satisfaction problems: A survey. AI maga-
zine. 148
Mayer, B.J. (2001). Sh3 domains: complexity in
moderation. J Cell Sci, 114, 1253–63. 153
Meng, E.C., Pettersen, E.F., Couch, G.S., Huang,
C.C. & Ferrin, T.E. (2006). Tools for inte-
grated sequence-structure analysis with ucsf
chimera. BMC Bioinformatics, 7, 339. 154
Mu˜noz, V., Blanco, F.J. & Serrano, L. (1995). The
hydrophobic-staple motif and a role for loop-
residues in alpha-helix stability and protein
folding. Nature Structural Biology, 2, 380–5.
154
Nourry, C., Grant, S.G.N. & Borg, J.P. (2003).
Pdz domain proteins: plug and play! Sci
STKE, 2003, RE7. 137, 138, 143, 153
Petsalaki, E., Stark, A. & Russell, R.B. (2009).
Accurate prediction of peptide binding sites
on protein surfaces. PLoS Comput Biol, 5,
e1000335. 142
Schulte, C. (2002). Programming constraint
services: high-level programming of stan-
dard and new constraint services. por-
tal.acm.org. 151
Schymkowitz, J., Borg, J., Stricher, F., Nys, R.,
Rousseau, F. & Serrano, L. (2005). The foldx
web server: an online force field. Nucleic
Acids Research, 33, W382–8. 151
155
REFERENCES
Shiau, A.K., Barstad, D., Loria, P.M., Cheng,
L., Kushner, P.J., Agard, D.A. & Greene, G.L.
(1998). The structural basis of estrogen re-
ceptor/coactivator recognition and the an-
tagonism of this interaction by tamoxifen.
Cell, 95, 927–37. 143
Smith, C.A. & Kortemme, T. (2010). Structure-
based prediction of the peptide sequence
space recognized by natural and synthetic
pdz domains. Journal of Molecular Biology,
402, 460–74. 137
Stein, A., C´eol, A. & Aloy, P. (2010). 3did: iden-
tification and classification of domain-based
interactions of known three-dimensional
structure. Nucleic Acids Research. 153
Stiffler, M.A., Chen, J.R., Grantcharova, V.P.,
Lei, Y., Fuchs, D., Allen, J.E., Zaslavskaia, L.A.
& MacBeath, G. (2007). Pdz domain bind-
ing selectivity is optimized across the mouse
proteome. Science, 317, 364–9. 137
Tonikian, R., Zhang, Y., Sazinsky, S.L., Currell, B.,
Yeh, J.H., Reva, B., Held, H.A., Appleton, B.A.,
Evangelista, M., Wu, Y., Xin, X., Chan, A.C.,
Seshagiri, S., Lasky, L.A., Sander, C., Boone,
C., Bader, G.D. & Sidhu, S.S. (2008). A speci-
ficity map for the pdz domain family. PLoS
Biol, 6, e239. 141
Tsang, E. (1993). Foundations of constraint sat-
isfaction. en.scientificcommons.org. 150
156
7Discussion
P
redicting the structure of proteins is a hard task. Even though nature only
uses twenty different building blocks to construct proteins, the amount of
theoretically possible combinations the folded chain of amino acids can adopt is
immense. Yet, knowledge of protein structure is key to understanding many of
the intricacies of biological systems, for example to elucidate protein interaction
pathways or for developing new protein-targeting therapeutics.
In this thesis, we focused on understanding and predicting the structure of
peptides – small molecules of typically no more than 10 amino acids, that play
important roles predominantly in cell signaling and cell regulatory networks. An
estimated 15-40% of all interactions in the cell are directly or indirectly influ-
enced by peptide-binding events. Peptides are also increasingly considered as a
new promising class of therapeutics, because of their small size and interaction
properties that are able to disrupt protein-protein interfaces. The design of these
molecules however has been hindered by the lack of structural information. For ex-
ample, while a considerable amount of protein structures exist in public databases,
as little as 3% is describing protein-peptide structure. We have proposed several
methodologies that use polypeptide fragments to extend the structural coverage
for peptide interactions, including loop prediction and peptide specificity design.
In this final chapter, we reflect upon our work and provide thoughts on further
applications.
157
7. DISCUSSION
Fragmenting protein space
The fragmentation of protein structures is an appealing strategy to reduce the com-
binatorial problems typically associated with protein structure prediction: instead
of sampling conformation space of protein structure, the space of naturally occur-
ring fragment templates is considered. The idea is in essence similar to that of
using rotamer libraries for predicting sidechain conformations adopted by protein
structures and derived from sets of experimentally derived high-resolution struc-
tures (Dunbrack, 2002). In such reduced models, search can often proceed in a
deterministic way, since the total number of combinations is significantly more
limited when compared to sampling based on physical principles. In this work,
the use of protein fragments stands central.
The typical limitation of database-driven methods is the lack of data to accu-
rately describe new protein structures in full atomic detail. Nearly twenty years ago,
Chothia (1992) estimated that nature uses around 1000 different protein folds, and
ten years later, Aloy & Russell (2004) estimated the total number of protein-protein
interactions at roughly 10.000. In our work we have taken a more pragmatical ap-
proach in an attempt to describe the structural protein space: we fragmented and
clustered a high number of non-homologous protein structures (>7000), resulting
in a rich alphabet of >1000 classes per fragment length (Chapter 2). Using BriX,
Baeten et al. (2008) showed that proteins could be globally reconstructed with an
average accuracy of 0.48 Å when compared to x-ray models. From these results,
one could conclude that – at least at the fragment level – our structural alphabet is
complete.
Loops are often the hardest parts of the protein structure to classify because
of their high variability. As a consequence, they are often poorly represented
or ignored in existing structural alphabets (Le et al., 2009). In BriX, we noticed
that many of these irregular structures are not classified, making the database
less useful for high-resolution structure prediction and refinement. To overcome
this limitation, we presented a novel way to group these ‘irregular’ elements by
grouping on their regular end points, i.e. the α-helices and β-strands that flank
the irregular loop residues (Chapter 2). We validated this new alphabet on a
set of 527 loops and found that loops until a length of 12 (∼90% of all loops
158
in the data set) could be described with high accuracy (< 2 Å RMSD), largely
independent from sequence homology between the original loop and the BriX
template (Chapter 3). Thus, contrary to the perceived irregularity of loops, the
‘space’ of experimentally observed loop structures seems saturated, at least for
small and medium-sized loops. As a result, there appears to be no need to employ
computationally expensive ab initio methods for the prediction of protein loop
structure, unless combined with database methods.
Protein fragments in isolation, i.e. outside of their network of local interactions,
have limited meaning. For example, protein fragments with similar backbone di-
hedral angles might or might not form hydrogen bonds depending on the protein
context. Therefore, we introduced the concept of ‘fragment context’ by mining
the BriX database for fragment interactions. These fragment interactions from
monomeric proteins, stored in the InteraX database, accurately capture backbone
and sidechain interactions, hydrogen bonding networks, packing constraints, or
even electrostatic interactions (Chapter 5). While we did not make an attempt
to improve the protein reconstruction accuracy when using this database, we
extensively used these ‘interaction motifs’ for the description of protein-peptide
structure. We found that peptide interactions are structurally comparable to frag-
ment interactions from InteraX. On a large set of available protein-peptide struc-
tures (stored in the PepX database, Chapter 4), we show that in nearly half of the
cases the architecture of the peptide binding site can be entirely reconstructed
using interacting protein fragments. In some cases, the entire architecture of the
protein-peptide binding site was found back in unrelated monomeric proteins, po-
tentially providing clues to the evolutionary origins of these peptide motifs or their
function in the cell.
Our findings, both for protein loops and protein-peptide binding modes, illus-
trate that even in the absence of homology, structural relations using fragments
can be used for modeling with high accuracy. Moreover, the relation between
intramolecular and intermolecular interaction patterns effectively turns the entire
database of protein fragments into learning data for describing protein-peptide
complexes.
159
7. DISCUSSION
On the relation between sequence and structure
The sequence of a protein uniquely determines its structure and thus its function.
Even though this correlation is strikingly simple, its consequences are not: many
mutations in the sequence of a protein do not affect the structure, while others
might lead to the collapse of the protein chain (Alexander et al., 2009). Using
BriX, we have made various attempts to link structure with sequence, yet we
could not observe general patterns. For example, the sequence conservation in
BriX classes is very low, with only few exceptions such as sequences containing
proline and glycine that have a pronounced effect on the structure of the protein
backbone (Baeten et al., 2008). Sequence-to-structure relationships however are
often occurring: fragments with a similar sequence many times adopt a similar
structure – an insight that has been at the basis of the Rosetta software (Rohl
et al., 2004). We did not observe any sequence similarity between protein-peptide
complexes and interacting fragments (with sequence similarities as low as 0-14%),
even when the entire binding site could be reconstructed from a single protein or
the entire interaction network was preserved.
These findings suggest that the architectural framework of proteins is largely
independent of the specific amino acid sequence, providing opportunities for the
design of proteins and peptides using BriX and InteraX.
Fragment-based prediction of protein loop and peptide structure
We evaluated the predictive capacity of our fragment-based approach in com-
bination with the full-atom force field FoldX (Schymkowitz et al., 2005). We
proposed two methodologies to predict protein structure: LoopX predicts protein
loop structure given the amino acid sequence of the loop (Chapter 3), and BriX
Design (Chapter 6) predicts peptide structure given the amino acid sequence of
the peptide and a structure of the target protein.
We demonstrated that LoopX outcompetes state-of-the-art methods, both in ac-
curacy in speed. We compared LoopX to Kinematic Closure (Mandell et al., 2009),
a robotics-inspired method that samples loop conformations with sub-angstrom
accuracy. On three data sets of 12-residue loops, LoopX reached comparable ac-
curacy but in a fraction of the time (∼5-60 minutes vs. ∼320 hours). On another
160
data set, LoopX was compared to one database method and three other ab initio
methods. With a prediction coverage of nearly 80%, LoopX performed better than
each of these methods. The performance dramatically decreases with larger loop
lengths (> 12 residues), suggesting that the diversity of available loop templates of
larger lengths is insufficient. Our results demonstrate that using a combination of
loop templates, rotamer-dependent side chain construction and all-atom energy
evaluation, both high-accuracy and high-throughput loop reconstruction for loops
≤ 12 residues is now within reach.
The amount of experimentally resolved protein-peptide complexes is rather low
(∼1400 complexes), and is even reduced to ∼500 binding modes when discarding
structural similarities (Chapter 4). Therefore, we relied on the database of interact-
ing fragments to increase the structural space of peptide binding modes. For nine
representative protein-peptide complexes, we have shown the predictive power
of the InteraX database by docking 4 out of 9 peptides with sub-angstrom accu-
racy. This docking exercise is somewhat limited since the structure of the peptide
needs to be known beforehand. In a second step, we lifted this requirement and
showed for two cases, the PDZ domain and the LBD, how we used multiple frag-
ment interactions to reconstruct the entire binding mode with high accuracy. The
combination BriX-InteraX-FoldX was also able to describe the energy landscape of
the peptides in close contact with the protein surfaces, with clear ‘energy funnels‘
towards the crystallized binding pocket.
In addition to predicting the canonical binding motif for PDZ1, the method was
able to detect small differences in peptide specificity. Gfeller et al. (2010) recently
discovered multiple specificity for the PDZ1 domain through the clustering of phage
display data (Chapter 6). However, since structural evidence for this observation
was lacking, we made a structural validation for this alternative peptide motif. We
predicted a peptide structure that differed from the canonical peptide structure
with one residue extending at the C-terminal. Through in silico mutagenesis
experiments with FoldX, we detected a specificity profile similar to the profile
detected using peptide phage display.
These results suggest that the method can produce reliable, high-resolution
models to verify peptide motifs for which no three-dimensional structure exists
(Gould et al., 2010; Stein et al., 2010). Moreover, small differences in the PDZ
161
7. DISCUSSION
domain such as the displacement of the carboxylate-binding loop are detected,
producing structural models with different specificities. This level of detail opens
the possibility for family-wide prediction of e.g. PDZ-peptide binding, for which
many high-resolution structures are still missing (Smith & Kortemme, 2010).
Towards modeling of conformational ensembles
Proteins are not rigid structures that occupy a single low-energy conformation.
Instead, their dynamic nature is important for protein function and evolvability
(Tokuriki & Tawfik, 2009). Modeling of the conformational ensembles proteins
can adopt is a computationally expensive task (Shaw et al., 2010). Because of the
organization of the BriX and PepX databases, fragments are grouped into classes
of similar structure. We believe that these ‘natural’ ensembles can, to a certain
extent, represent the structural variability of proteins. As we demonstrate in the
case of loops, the conformational ensemble adopted by a peptide-binding loop and
observed from different crystal structures, could be modeled with sub-angstrom
resolution (Chapter 3). Further work could employ BriX ensembles to model
flexibility as observed in NMR structures, or reproduce the diversity from structural
ensembles for which a large number of high-resolution models in different contexts
are available, e.g. for ubiquitin (Lange et al., 2008) or lysozyme (Baase et al., 2010).
In contrast to other methods sampling conformational and sequence ensembles
(Friedland & Kortemme, 2010), BriX fragments provide slight backbone variations
observed from experimentally determined structures. Using naturally occurring
backbone ensembles has the potential to lift the ‘fixed backbone’ assumption so
often employed in computational modeling (Mandell & Kortemme, 2009). In
Chapter 6 we showed how these structural ensembles can be used to introduce
backbone flexibility on the peptide that improve the overall binding affinity. The
introduction of ‘mending’ methods to connect fragments along the polypeptide
chain could make these fragment-based methods applicable to the modeling of
backbone movements and conformational ensembles in protein structures.
Combining protein loop prediction with peptide prediction provides challenges
too. For example, it was recently demonstrated that the loops of the SH2 domain
have a major role in the selectivity of this domain for peptide binding (Kaneko
162
et al., 2010). Structural variations of that loop either blocked or opened access to
one of the three binding subsites of the SH2 domain, thus orchestrating peptide
binding. Coordinated prediction of loop movement at the side of the domain while
keeping a flexible model of the peptide in the binding pocket could lead to both
the understanding and the design of selective peptides.
Dynamic approaches towards structure prediction will become essential for a
further development of the modeling and prediction fields, and fragment-based
approaches such as the ones introduced in this work might play a pivotal role.
High-throughput docking and design
Recent years have been seen the emergence of high-throughput proteomics, at-
tempting to characterize every protein-protein interaction in entire organisms (Li
et al., 2004; K¨uhner et al., 2009). While around 10.000 interactions have been ex-
perimentally observed among roughly 6000 proteins in Saccharomyces cerevisiae
(Uetz et al., 2000), only for an estimated 1% structural information is available,
complicating the verification and understanding of many of these interactions.
Methodologies to increase this structural coverage are thus expected to contribute
to our general understanding of the ‘interactome’ (Mosca et al., 2009), yet too of-
ten the quality of the structural predictions is in the mid- and low-range as reported
by the CAPRI docking competition (Lensink et al., 2007).
The idea of using interacting fragments with preoptimized packing and back-
bone hydrogen bonding could be applied to modeling protein-protein interac-
tions too. However, protein-protein interfaces are typically less hydrophobic than
protein-peptide interfaces, with less hydrogen bonding because of their large in-
terfaces (London et al., 2010). Hence, the structural insight presented in Chapter
5 might not hold for the prediction of protein-protein interactions. Alternatively,
structural similarity to other protein-protein interfaces has already been used to
improve upon docking experiments (Sinha et al., 2010). Extending the space of
interacting fragments to include fragments from available protein-protein inter-
faces (e.g. by fragmenting the ∼8000 protein interface architectures described by
Tuncbag et al. (2008)), could provide the data needed to cover the space of most
protein interactions, and integration within the current design protocols would be
163
7. DISCUSSION
a straightforward task.
For high-throughput purposes a right balance between speed and accuracy is
of crucial importance. On one hand, homology-based methods are extremely fast
but rely on the presence of a related model. On the other hand, ab initio methods
such as molecular dynamics simulations are notoriously slow, making their use
limited to a handful of cases only, even with the advent of massive parallelization
and cloud computing (Snow et al., 2002; Shaw et al., 2010). Using fragments
of large databases such as BriX and Loop BriX has enormous advantages: natural
backbone templates can be used for modeling, avoiding the expensive cost of
explicit backbone flexibility, such that computer time can be focused on optimizing
side chain packing. The algorithms developed in this thesis – LoopX (Chapter 3) and
BriX Design (Chapter 6) – both follow this philosophy. As we have demonstrated,
loops can be modeled accurately (< 2Å RMSD) in as few as ∼5-60 minutes, while
peptides can be designed with sub-angstrom accuracy in 5 minutes to a couple of
hours only. These results show the potential for applying our methodologies for
large-scale analysis and prediction.
Virtual peptide design for therapeutics
The main advantage of developing peptide or peptide-like drugs is that they can
help expand the ‘druggable genome’ (Hopkins & Groom, 2002) by targeting specif-
ically protein-protein interactions, which are less suitable for small molecule based
therapies. Peptide drugs can offer an increase in target selectivity and have the
potential to act on and show reduced toxicity in comparison with small molecules.
Unfortunately, rational peptide designs that have shown potent inhibition in vitro or
in vivo are still rare and often are directly derived from a crystallized protein-protein
interface (Chapter 1).
From a design perspective, there are clear advantages to develop peptide ther-
apeutics: a large body of existing structural and other experimental data on
protein-protein and protein-peptide interactions is already available. Moreover,
the closely related protein structure prediction and design field is already relatively
well-developed. Because peptides are constructed from amino acids, the large
body of energy potentials developed in the fields of protein folding, docking and
164
dynamics can be applied to peptide structure prediction (Kaufmann et al., 2010;
Schymkowitz et al., 2005). To harness the full potential of such approaches would
require, for example, the introduction of non-natural amino acids to further extend
the chemical repertoire (Link et al., 2003).
Methods for the prediction and design of peptides, including the one presented
in this work (Chapter 6), are currently being developed. Unfortunately, their
evaluation is often based on the calculated root mean square deviation (RMSD) – or
similar measures – from existing structural data, or using calculated binding scores;
only in a number of cases other experimental data is used, such as specificity assays
from phage display or peptide array experiments. When designing peptides, the
optimization process is often carried out using the interaction energy estimates
as the scoring function, disregarding other relevant factors of ‘successful’ peptide
designs, such as metabolic stability or specificity. Multi-objective peptide design
algorithms are thus expected to improve over current designs.
For the algorithm developer and user, few ready-to-use benchmark sets exist,
making algorithm comparison time-consuming and often impossible. The organi-
zation of a dedicated peptide prediction competition in the lines of the structure
prediction competition CASP (Moult, 2005) or the protein docking competition
CAPRI (Janin, 2005) could help advance the peptide design field.
In conclusion, although several roadblocks still exist in the development of
peptide or peptide-like drugs, the future for this class of molecules is bright.
165
REFERENCES
References
Alexander, P.A., He, Y., Chen, Y., Orban,
J. & Bryan, P.N. (2009). A minimal se-
quence code for switching protein structure
and function. Proceedings of the National
Academy of Sciences, 106, 21149–54. 160
Aloy, P. & Russell, R.B. (2004). Ten thousand
interactions for the molecular biologist. Na-
ture Biotechnology, 22, 1317–21. 158
Baase, W.A., Liu, L., Tronrud, D.E. & Matthews,
B.W. (2010). Lessons from the lysozyme of
phage t4. Protein Sci, 19, 631–41. 162
Baeten, L., Reumers, J., Tur, V., Stricher, F.,
Lenaerts, T., Serrano, L., Rousseau, F. &
Schymkowitz, J. (2008). Reconstruction of
protein backbones from the brix collection
of canonical protein fragments. PLoS Com-
put Biol, 4, e1000083. 158, 160
Chothia, C. (1992). One thousand families for
the molecular biologist. Nature, 357, 543–4.
158
Dunbrack, R.L. (2002). Rotamer libraries in the
21st century. Curr Opin Struct Biol, 12, 431–
40. 158
Friedland, G.D. & Kortemme, T. (2010). Design-
ing ensembles in conformational and se-
quence space to characterize and engineer
proteins. Curr Opin Struct Biol, 20, 377–84.
162
Gfeller, D., Butty, F., Verschueren, E., Vanhee, P.,
Huang, H., Ernst, A., Dar, N., Stagljar, I., Ser-
rano, L., Sidhu, S.S., Bader, G.D. & Kim, P.M.
(2010). The multiple specificity landscape of
peptide recognition modules. Molecular Sys-
tems Biology (in review), 1–35. 161
Gould, C.M., Diella, F., Via, A., Puntervoll, P.,
Gem¨und, C., Chabanis-Davidson, S., Michael,
S., Sayadi, A., Bryne, J.C., Chica, C., Seiler,
M., Davey, N.E., Haslam, N., Weatheritt, R.J.,
Budd, A., Hughes, T., Pas, J., Rychlewski, L.,
Trav´e, G., Aasland, R., Helmer-Citterich, M.,
Linding, R. & Gibson, T.J. (2010). Elm: the
status of the 2010 eukaryotic linear motif re-
source. Nucleic Acids Research, 38, D167–
80. 161
Hopkins, A.L. & Groom, C.R. (2002). The drug-
gable genome. Nat Rev Drug Discov, 1, 727–
30. 164
Janin, J. (2005). Assessing predictions of
protein-protein interaction: the capri exper-
iment. Protein Sci, 14, 278–83. 165
Kaneko, T., Huang, H., Zhao, B., Li, L., Liu, H.,
Voss, C.K., Wu, C., Schiller, M.R. & Li, S.S.C.
(2010). Loops govern sh2 domain specificity
by controlling access to binding pockets. Sci-
ence Signaling, 3, ra34. 162
Kaufmann, K.W., Lemmon, G.H., Deluca, S.L.,
Sheehan, J.H. & Meiler, J. (2010). Practi-
cally useful: What the rosettaprotein mod-
eling suite can do for you. Biochemistry, 49,
2987–2998. 165
K¨uhner, S., van Noort, V., Betts, M.J., Leo-
Macias, A., Batisse, C., Rode, M., Yamada,
T., Maier, T., Bader, S., Beltran-Alvarez, P.,
Casta˜no-Diez, D., Chen, W.H., Devos, D.,
G¨uell, M., Norambuena, T., Racke, I., Rybin,
V., Schmidt, A., Yus, E., Aebersold, R., Her-
rmann, R., B¨ottcher, B., Frangakis, A.S., Rus-
sell, R.B., Serrano, L., Bork, P. & Gavin, A.C.
(2009). Proteome organization in a genome-
reduced bacterium. Science, 326, 1235–40.
163
Lange, O.F., Lakomek, N.A., Far`es, C., Schr¨oder,
G.F., Walter, K.F.A., Becker, S., Meiler, J.,
Grubm¨uller, H., Griesinger, C. & de Groot,
B.L. (2008). Recognition dynamics up to mi-
croseconds revealed from an rdc-derived
ubiquitin ensemble in solution. Science, 320,
1471–5. 162
Le, Q., Pollastri, G. & Koehl, P. (2009). Struc-
tural alphabets for protein structure clas-
sification: a comparison study. Journal of
Molecular Biology, 387, 431–50. 158
166
REFERENCES
Lensink, M.F., M´endez, R. & Wodak, S.J. (2007).
Docking and scoring protein complexes:
Capri 3rd edition. Proteins, 69, 704–18. 163
Li, S., Armstrong, C.M., Bertin, N., Ge, H.,
Milstein, S., Boxem, M., Vidalain, P.O., Han,
J.D.J., Chesneau, A., Hao, T., Goldberg, D.S.,
Li, N., Martinez, M., Rual, J.F., Lamesch,
P., Xu, L., Tewari, M., Wong, S.L., Zhang,
L.V., Berriz, G.F., Jacotot, L., Vaglio, P.,
Reboul, J., Hirozane-Kishikawa, T., Li, Q.,
Gabel, H.W., Elewa, A., Baumgartner, B., Rose,
D.J., Yu, H., Bosak, S., Sequerra, R., Fraser,
A., Mango, S.E., Saxton, W.M., Strome, S.,
Heuvel, S.V.D., Piano, F., Vandenhaute, J.,
Sardet, C., Gerstein, M., Doucette-Stamm, L.,
Gunsalus, K.C., Harper, J.W., Cusick, M.E.,
Roth, F.P., Hill, D.E. & Vidal, M. (2004). A
map of the interactome network of the meta-
zoan c. elegans. Science, 303, 540–3. 163
Link, A.J., Mock, M.L. & Tirrell, D.A. (2003).
Non-canonical amino acids in protein engi-
neering. Curr Opin Biotechnol, 14, 603–9.
165
London, N., Movshovitz-Attias, D. & Schueler-
Furman, O. (2010). The structural basis
of peptide-protein binding strategies. Struc-
ture, 18, 188–199. 163
Mandell, D.J. & Kortemme, T. (2009). Backbone
flexibility in computational protein design.
Curr Opin Biotechnol, 20, 420–8. 162
Mandell, D.J., Coutsias, E.A. & Kortemme,
T. (2009). Sub-angstrom accuracy in pro-
tein loop reconstruction by robotics-inspired
conformational sampling. Nat Methods, 6,
551–2. 160
Mosca, R., Pons, C., Fern´andez-Recio, J. &
Aloy, P. (2009). Pushing structural infor-
mation into the yeast interactome by high-
throughput protein docking experiments.
PLoS Comput Biol, 5, e1000490. 163
Moult, J. (2005). A decade of casp: progress,
bottlenecks and prognosis in protein struc-
ture prediction. Curr Opin Struct Biol, 15,
285–9. 165
Rohl, C.A., Strauss, C.E.M., Misura, K.M.S. &
Baker, D. (2004). Protein structure predic-
tion using rosetta. Meth Enzymol, 383, 66–
93. 160
Schymkowitz, J., Borg, J., Stricher, F., Nys, R.,
Rousseau, F. & Serrano, L. (2005). The foldx
web server: an online force field. Nucleic
Acids Research, 33, W382–8. 160, 165
Shaw, D.E., Maragakis, P., Lindorff-Larsen, K.,
Piana, S., Dror, R.O., Eastwood, M.P., Bank,
J.A., Jumper, J.M., Salmon, J.K., Shan, Y. &
Wriggers, W. (2010). Atomic-level charac-
terization of the structural dynamics of pro-
teins. Science, 330, 341–6. 162, 164
Sinha, R., Kundrotas, P.J. & Vakser, I.A. (2010).
Docking by structural similarity at protein-
protein interfaces. Proteins, 78, 3235–41.
163
Smith, C.A. & Kortemme, T. (2010). Structure-
based prediction of the peptide sequence
space recognized by natural and synthetic
pdz domains. Journal of Molecular Biology,
402, 460–74. 162
Snow, C.D., Nguyen, H., Pande, V.S. & Grue-
bele, M. (2002). Absolute comparison of sim-
ulated and experimental protein-folding dy-
namics. Nature, 420, 102–6. 164
Stein, A., C´eol, A. & Aloy, P. (2010). 3did: iden-
tification and classification of domain-based
interactions of known three-dimensional
structure. Nucleic Acids Research. 161
Tokuriki, N. & Tawfik, D.S. (2009). Protein dy-
namism and evolvability. Science, 324, 203–
7. 162
Tuncbag, N., Gursoy, A., Guney, E., Nussinov, R.
& Keskin, O. (2008). Architectures and func-
tional coverage of protein-protein interfaces.
J Mol Biol, 381, 785–802. 163
Uetz, P., Giot, L., Cagney, G., Mansfield, T.A.,
Judson, R.S., Knight, J.R., Lockshon, D.,
Narayan, V., Srinivasan, M., Pochart, P.,
Qureshi-Emili, A., Li, Y., Godwin, B., Conover,
D., Kalbfleisch, T., Vijayadamodar, G., Yang,
167
REFERENCES
M., Johnston, M., Fields, S. & Rothberg, J.M.
(2000). A comprehensive analysis of protein-
protein interactions in saccharomyces cere-
visiae. Nature, 403, 623–7. 163
168
List of Figures
1.1 Proteins interacting with antibodies, small molecules and peptides. 2
1.2 Amino acids grouped by properties. . . . . . . . . . . . . . . . . . 4
1.3 Four levels of protein structure. . . . . . . . . . . . . . . . . . . . 5
1.4 Targeting the cell with different molecules. . . . . . . . . . . . . . 9
1.5 History of experimental structure determination from the last fifty
years. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 PDZ-peptide interactions and peptide specificity. . . . . . . . . . . 22
1.7 Example workflows for peptide design. . . . . . . . . . . . . . . . 24
1.8 Stapled helical peptides as potent therapeutic peptides. . . . . . . 27
1.9 Design of helices targeting Trans-Membrane (TM) proteins. . . . . 31
1.10 Innovative structural approaches in peptide design using BriX and
InteraX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1 SCOP representation of ASTRAL40. . . . . . . . . . . . . . . . . . 46
2.2 The BriX database. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3 Number of BriX classes versus different classification thresholds. . 48
2.4 Percentage of classified fragments versus different classification
thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5 Secondary structure content for classified fragments in BriX. . . . . 50
2.6 Secondary structure content for unclassified fragments in BriX. . . 51
2.7 The Loop BriX database. . . . . . . . . . . . . . . . . . . . . . . . 52
169
LIST OF FIGURES
2.8 The BriX website (http://brix.crg.es). . . . . . . . . . . . . . . . . . 56
2.9 BriX applications: ‘covering’ and ‘bridging’. . . . . . . . . . . . . . 57
3.1 Predicted loop structure with LoopX. . . . . . . . . . . . . . . . . 65
3.2 Accuracy of LoopX versus state-of-the-art loop prediction methods
on datasets 1,2,3 and 4. . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Dataset 1: comparison of LoopX with Rosetta and KIC. . . . . . . . 68
3.4 Dataset 2: comparison of LoopX with Rosetta and KIC. . . . . . . . 69
3.5 Dataset 3: comparison of LoopX with KIC. . . . . . . . . . . . . . 70
3.6 Influence of loop homology in LoopX. . . . . . . . . . . . . . . . . 71
3.7 Conformational ensemble adopted by the PDZ carboxylate-binding
loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.8 Cross-comparison of backbone distances of the conformational en-
semble of the carboxylate-binding loop. . . . . . . . . . . . . . . . 73
3.9 Reconstruction of PDZ carboxylate-binding loop ensemble. . . . . 74
3.10 LoopX prediction accuracy compared with FREAD, MODELLER,
RAPPER and PLOP. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.11 Overview of the LoopX algorithm. . . . . . . . . . . . . . . . . . . 76
4.1 Contents of PepX. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Examples of protein-peptide clusters from PepX. . . . . . . . . . . 91
4.3 Distribution of number of elements in the PepX clusters. . . . . . . 92
4.4 Distribution of peptide and receptor size. . . . . . . . . . . . . . . 93
4.5 Receptor sequence redundancy within the PepX database. . . . . . 94
4.6 PepX Annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.7 Representation of the SCOP and CATH hierarchies in PepX. . . . . 96
4.8 Distribution of PepX structures in the different SCOP classes. . . . 96
4.9 PepX usage statistics. . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.10 PepX user flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.11 Search options in the PepX database. . . . . . . . . . . . . . . . . 100
5.1 Different protein fragment interactions from InteraX . . . . . . . . 112
5.2 Coverage of protein-peptide interfaces . . . . . . . . . . . . . . . . 114
170
LIST OF FIGURES
5.3 Protein-peptide interfaces can be described as interactions between
recurrent protein fragments from monomeric proteins. . . . . . . . 116
5.4 Relation between intermolecular interface architectures and in-
tramolecular protein architectures. . . . . . . . . . . . . . . . . . . 119
5.5 Properties of the protein-peptide interface coverage. . . . . . . . . 121
5.6 Properties of the protein-peptide interface coverage correlated with
binding energy and burial. . . . . . . . . . . . . . . . . . . . . . . 122
5.7 Physical properties are not conserved in the BriX covers. . . . . . . 124
6.1 Docking of the PDZ peptide using InteraX patterns. . . . . . . . . . 135
6.2 Peptide design for the canonical PDZ domain. . . . . . . . . . . . 138
6.3 Multiple specificity for DLG1 PDZ1 binding peptides. . . . . . . . 140
6.4 Structural prediction for the alternative PDZ specificity. . . . . . . 141
6.5 PDZ peptide-binding site prediction. . . . . . . . . . . . . . . . . . 142
6.6 Estrogen receptor α-LBD binding site. . . . . . . . . . . . . . . . . 144
6.7 Helical peptide design for the LDB. . . . . . . . . . . . . . . . . . 145
6.8 Recognizing LBD peptide specificity on designed templates. . . . . 146
6.9 LBD peptide-binding site prediction. . . . . . . . . . . . . . . . . . 147
6.10 Method: BriX peptide design implemented as a Constraint Satisfac-
tion Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
171
List of Tables
1.1 Leading examples of peptide therapeutics currently on the market . 10
2.1 Distribution of loops across the four main loop categories for four
different loop databases. . . . . . . . . . . . . . . . . . . . . . . . 52
2.2 Classification of loops within Loop BriX. . . . . . . . . . . . . . . . 53
4.1 Public databases of protein-ligand complexes. . . . . . . . . . . . 90
5.1 InteraX protein fragment interactions. . . . . . . . . . . . . . . . . 113
5.2 Coverage statistics for the most populated classes in the protein-
peptide dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.1 Peptide docking using InteraX . . . . . . . . . . . . . . . . . . . . 136
173

Predicting peptide interactions using protein building blocks

  • 1.
    Faculty of Scienceand Bio-engineering Sciences Department of Bio-engineering Sciences Predicting peptide interactions using protein building blocks Thesis submitted in partial fulfilment of the requirements for the degree of Doctor in Bio-engineering Sciences Peter Vanhee Promotor: Prof. Dr. Frederic Rousseau Co-promoter: Prof. Dr. Joost Schymkowitz March 4th, 2011
  • 2.
    Published by theVIB Switch Laboratory SWIT, Department of Bio-engineering Sciences Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel Apart from any fair dealing for the purpose of research, private study, criticism or review, this publication may not be reproduced, stored in a retrieval system, or transmitted in any form, by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, without the prior permission in writing of the publisher. Peter Vanhee was funded by a PhD grant from the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT-Vlaanderen), Belgium. Predicting peptide interactions using protein building blocks. Peter Vanhee PhD disser- tation Vrije Universiteit Brussel, Brussels, Belgium, March 2011. Cover: design by Antonio De Marco and Peter Vanhee. © Vrije Universiteit Brussel, all rights reserved.
  • 3.
    Summary P roteins are byfar the most versatile and complex molecules in the cell. It is commonly accepted that protein function directly relates to three-dimensional structure, which in turn is dependent on the specific amino acid sequence of the protein. Peptides are short sequences of amino acids that perform a myriad of functions and are estimated to be involved in up to 40% of all protein-protein interactions. The lack of structural evidence for many of these peptide interac- tions however has hindered the functional annotation of this important class of molecules and the development of peptides as therapeutics. In this thesis, we propose the use of small, recurrent polypeptide fragments as one way of solving the lack of protein-peptide structures. We show that protein-peptide binding sites can be modeled at high resolutions using fragment interactions and provide two methods for the de novo prediction of protein loops and peptide structure. The developments presented in this work provide a valuable alternative to experimental high-resolution structure elucidation of target protein-peptide complexes, bringing closer the possibility of in silico designed peptides for therapeutic applications.
  • 5.
    Samenvatting E iwitten zijn veruitde meest krachtige en complexe biologische moleculen in de cel. Ze zijn essentieel en aanwezig in alle vormen van het leven: van virussen en bacteri¨en tot planten en dieren. Het is algemeen aanvaard dat de functie van een eiwit afhankelijk is van de driedimensionale structuur van het eiwit, die op haar beurt meteen gerelateerd kan worden aan de opeenvolging van aminozuren waaruit het eiwit is opgebouwd. Eiwitten interageren met allerhande moleculen zoals andere eiwitten, DNA, RNA of peptiden. Peptiden zijn moleculen die bestaan uit korte sequenties van aminozuren. Er wordt geschat dat eiwit-peptide interacties een rol spelen in meer dan 40% van alle eiwit-eiwit interacties in en buiten de cel. Het gebrek aan data omtrent de driedimensionale structuur van deze peptide-interacties heeft er evenwel voor gezorgd dat de functie van veel van deze interacties tot nog toe onbekend is; het gebrek aan structurele data is ook een hinderpaal in de ontwikkeling van deze klas van moleculen als nieuwe en krachtige geneesmiddelen. In deze thesis stellen we het gebruik van eiwitfragmenten voor om de complexe structuur van peptide-interacties te voorspellen en zodoende het gebrek aan hoge- resolutie structuren te omzeilen. Hiervoor maken we gebruik van BriX, een data- bank met meer dan 7 miljoen eiwitfragmenten bestaande uit 4 tot 14 aminozuren elk, waarin ongeveer 2000 canonieke eiwitfragmenten kunnen ge¨ıdentificeerd wor- den. We tonen aan dat de bindingsoppervlakken tussen eiwitten en peptiden sterke v
  • 6.
    gelijkenissen vertonen metde interacties tussen eiwitfragmenten die uit verschil- lende, niet-gerelateerde eiwitstructuren worden ge¨extraheerd. Dit inzicht laat ons toe de enorme hoeveelheid aan structurele data uit deze eiwitfragmenten te ge- bruiken voor het voorspellen van de interactie tussen eiwit en peptide. We tonen bijvoorbeeld aan dat we accuraat de structuur kunnen voorspellen van peptiden die binden aan modulaire domeinen, zoals PDZ domeinen of het LBD domein van de oestrogeen receptor. De ontwikkelingen gepresenteerd in dit werk bieden een alternatief voor het ex- perimenteel oplossen van hoge-resolutie structuren van eiwit-peptide interacties, en brengen ons een stap dichter bij het ontwerpen van peptiden voor therapeutis- che doeleinden. vi
  • 7.
    Preface This thesis dealswith the topic of protein and peptide structure prediction and design. Parts of this research have been published or are currently in the process of publication. It is the objective of this thesis to unite and present all the individual findings obtained and described in each manuscript. I have, however, taken the liberty to edit each of these manuscripts to fit the flow of the thesis. Each chapter, with the exception of first and last, contains an introduction, a results section, a materials and methods section, and a conclusion. Chapter 1 introduces proteins and peptides and the role of structure. We have made an attempt to bring an objective overview of the field of protein modeling and design that is relevant to many of the concepts and applications presented here. We also introduce the field of peptide prediction and design and its relevance for therapeutics (Vanhee et al., 2011). Chapter 2 introduces the ‘protein fragment paradigm’ that is key to this work. Two databases of protein fragments – BriX and Loop BriX (http://brix.crg.es) – are described that provide a vast resource for fragment-based protein structure prediction and design (Vanhee et al., 2011). Chapter 3 describes LoopX (http://loopx.crg.es), a method for de novo pre- diction of protein loops, the most variable parts of the protein structure and no-
  • 8.
    toriously difficult topredict. We describe how LoopX outperforms state-of-the-art methods, combining a 100-fold speed increase with excellent prediction accuracy and coverage for loops up to 12 residues. Moreover, we demonstrate that LoopX can model conformational ensembles adopted by protein loops. Chapter 4 provides a comprehensive overview of the structural landscape of protein-peptide interactions, conveniently stored in the PepX database (http: //pepx.switchlab.org). Protein-peptide complexes are classified based on the architecture of their binding sites and annotated with both structural and biological information (Vanhee et al., 2010). Chapter 5 provides an key insight in the structure of the protein-peptide inter- actions, relating the architecture of monomeric proteins with the architecture of protein-peptide complexes. Our analysis, building on both the BriX and PepX databases, suggests that the wealth of structural data on monomeric proteins can be harvested to model peptide interactions (Vanhee et al., 2009). Chapter 6 puts many of the developed insights into practice by describing a peptide structure prediction algorithm that is able to model peptide interactions without previous knowledge of the complex structure. We provide two in-depth case stud- ies on the PDZ domain and the α-ligand binding domain of the estrogen receptor, demonstrating the potential for structure prediction of peptide motifs. Finally, Chapter 7 provides a discussion on the general topic of this thesis. viii
  • 9.
    Publications Protein-Peptide Interactions Adoptthe Same Structural Motifs as Monomeric Pro- tein Folds. Peter Vanhee, Francois Stricher, Lies Baeten, Erik Verschueren, Tom Lenaerts, Luis Serrano, Frederic Rousseau and Joost Schymkowitz. Structure, August 2009. PepX: a structural database of non-redundant protein-peptide complexes. Peter Vanhee, Joke Reumers, Francois Stricher, Lies Baeten, Luis Serrano, Joost Schymkowitz, Frederic Rousseau. Nucleic Acids Research, January 2010. Modeling protein-peptide interactions using protein fragments: fitting the pieces? Peter Vanhee, Francois Stricher, Lies Baeten, Erik Verschueren, Luis Serrano, Frederic Rousseau and Joost Schymkowitz. BMC Bioinformatics, December 2010. BriX: a database of protein building blocks for structural analysis, modeling and design. Peter Vanhee*, Erik Verschueren*, Lies Baeten, Francois Stricher, Luis Serrano, Frederic Rousseau and Joost Schymkowitz. Nucleic Acids Research, January 2011. Computational design of peptide ligands. Peter Vanhee, Almer van der Sloot, Erik Ver- schueren, Luis Serrano, Frederic Rousseau and Joost Schymkowitz. Trends in Biotech- nology, May 2011. ix
  • 11.
    Acknowledgements This thesis isthe result of the hard work of many, many different people I have worked together with during the course of my PhD. Here, I would like to express my gratitude towards them. First of all, I would like to thank my supervisors, Joost Schymkowitz and Frederic Rousseau, who have been tremendously helpful during the course of this work. They introduced me to the complex maze the field of biology really is, encouraged me to do research at the forefront of science, and supervised this project from start to end. They have motivated me, at the SWITCH lab in which I started this PhD, to develop a deep interest in molecular biology. I also would like to thank Tom Lenaerts, my master thesis supervisor, who proposed me to start a PhD and introduced me to the SWITCH lab. He has always been a source of help and advice. I am very grateful to Luis Serrano, who opened the doors of his lab at the Center of Genomic Regulation in Barcelona. Despite his hectic agenda, he has been instrumental in all parts of his work, continuously throwing in new ideas and providing me with the opportunity to work in one of the leading institutes in biomedical sciences in Europe. I have been very fortunate with the people with whom I have been working side by side in this project. Lies Baeten who graduated from computer science like me, has initiated the BriX project during her PhD. Sharing the same background, she has contributed to many of the ideas and tools we developed together during xi
  • 12.
    this work. Oneyear later, Erik Verschueren joined the SWITCH laboratory and continued his work in the CRG in Barcelona. Since we met each other again in the CRG, we have been working together on a daily basis, sharing many moments of frustration and euphoria. Without both Lies and Erik, I believe this work would not have been the same. Many more people have been important to this project. For example, Almer van der Sloot, whom I met in the lab of Luis Serrano, has often shared his broad knowledge in cellular biology with me; I was very happy to write with him a review on computational peptide design, that shaped the introductory chapter of this thesis. Fran¸cois Stricher often helped me understanding the nitty-gritty of protein structure and stability, and his contributions to the FoldX force field have been essential to this work. Joke Reumers, a former member of the SWITCH lab, motivated me to work on the database of protein-peptide complexes, and together we published a paper which has left me hungry for more publications. I also really enjoyed working together with Joost Van Durme; we pushed the project of protein loop prediction, which was originally started by Lies Baeten, to the next level. Programming with Javier Delgado, originally a SWITCH member and now post-doc at the lab of Luis Serrano, on the FoldX suite has been very pleasant as well. I wish to thank all the members in both the SWITCH group and the group of Luis Serrano at the CRG. Besides being great colleagues, many of you have also become good friends. I’d like to thank Ivo, a former member of the SWITCH lab, for giving critical advice and support. Outside the context of this PhD, I have often worked together with Antonio, Christof and Andrea. I believe what we did together has helped me making this project successfull, and I hope we will be working together again in the future. The financial support for performing this study was given by IWT, FWO and EMBO. It goes without saying that without their funds, this thesis would not have been possible. Finally, I would like to thank my family and my friends, and in particular my parents for giving their unconditional support. I also wish to thank everyone who welcomed me with open arms in Barcelona and with whom I spent many unforgettable moments. And thanks to you Camilla, for sharing both the difficult xii
  • 13.
    and the greatmoments I went through while working on this thesis. Nothing here would have been possible without all of you. xiii
  • 15.
    Contents 1 Introduction 1 1.1Proteins and peptides . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Protein function . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Protein building blocks . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Protein folding and stability . . . . . . . . . . . . . . . . . 5 1.1.4 Biological function of protein-peptide interactions . . . . . 7 1.1.5 Peptides as therapeutics . . . . . . . . . . . . . . . . . . . 8 1.1.6 Protein Structure . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Protein structure prediction and design . . . . . . . . . . . . . . . 13 1.2.1 Comparative modeling . . . . . . . . . . . . . . . . . . . . 15 1.2.2 Ab initio structure prediction . . . . . . . . . . . . . . . . . 17 1.2.3 Predicting protein dynamics . . . . . . . . . . . . . . . . . 18 1.2.4 Computational protein design . . . . . . . . . . . . . . . . 19 1.2.5 Protein docking . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3 Computational design of peptide ligands . . . . . . . . . . . . . . 20 1.3.1 A better understanding of protein-peptide interactions . . . 21 1.3.2 Peptide design based on sequence motifs . . . . . . . . . . 23 1.3.3 Protein complexes as a source of active peptides . . . . . . 26 1.3.4 Protein docking and fragment based docking as tools for peptide design . . . . . . . . . . . . . . . . . . . . . . . . 28 xv
  • 16.
    1.3.5 Peptide designusing protein-peptide complexes . . . . . . 28 1.3.6 Remedying the lack of structural information . . . . . . . . 30 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2 Fragmenting protein space 43 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.2 Contents of the BriX database . . . . . . . . . . . . . . . . . . . . 46 2.2.1 Update of the BriX database . . . . . . . . . . . . . . . . . 46 2.2.2 BriX Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.2.3 Creation of the Loop BriX database . . . . . . . . . . . . . 49 2.2.4 Loop BriX Statistics . . . . . . . . . . . . . . . . . . . . . . 53 2.2.5 Applications of the BriX database . . . . . . . . . . . . . . 53 2.3 Database access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.3.1 Database availability . . . . . . . . . . . . . . . . . . . . . 54 2.3.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.3.3 Covering or bridging of protein structures . . . . . . . . . . 55 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3 Predicting loop structure 63 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.1 Comparison with the state-of-the-art loop reconstruction algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2.2 Loop homology is no prerequisite for loop reconstruction accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.2.3 Loop ensemble prediction . . . . . . . . . . . . . . . . . . 70 3.2.4 Comparison with MODELLER, RAPPER, PLOP and FREAD . 72 3.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . 76 3.3.1 LoopX Algorithm . . . . . . . . . . . . . . . . . . . . . . . 76 3.3.2 Reconstruction accuracy . . . . . . . . . . . . . . . . . . . 79 3.3.3 Benchmark datasets . . . . . . . . . . . . . . . . . . . . . 80 3.3.4 LoopX Webserver . . . . . . . . . . . . . . . . . . . . . . . 81 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 xvi
  • 17.
    References . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4 The structural landscape of protein-peptide interactions 87 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2 Contents of the PepX database . . . . . . . . . . . . . . . . . . . . 90 4.2.1 Construction of a non-redundant data set of protein-peptide complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2.2 Statistics on structural protein-peptide complexes . . . . . 93 4.2.3 Ligand annotation with structural variants for peptide design 97 4.3 Database Access . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.1 Database Availability . . . . . . . . . . . . . . . . . . . . . 97 4.3.2 User interface . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5 Protein-peptide interactions resemble monomeric protein interactions 107 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.1 InteraX: a database of interacting protein fragments . . . . 111 5.2.2 Reconstruction of protein-peptide interactions from inter- acting fragment pairs derived from monomeric proteins . . 113 5.2.3 Reconstruction of peptide binding motifs by using multiple fragment pairs observed in monomeric proteins . . . . . . 117 5.2.4 Statistical analysis of the factors that determine reconstruc- tion accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . 124 5.3.1 Construction of a non-redundant data set of protein-peptide complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.3.2 The dataset of protein fragments . . . . . . . . . . . . . . . 125 5.3.3 InteraX database . . . . . . . . . . . . . . . . . . . . . . . 126 5.3.4 Covering algorithm . . . . . . . . . . . . . . . . . . . . . . 126 5.3.5 FoldX force field . . . . . . . . . . . . . . . . . . . . . . . . 127 5.3.6 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . 129 xvii
  • 18.
    5.4 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6 Predicting peptide structure and specificity 133 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.2.1 Peptide docking using interaction patterns from InteraX . . 135 6.2.2 De novo peptide structure prediction using interaction pat- terns from InteraX . . . . . . . . . . . . . . . . . . . . . . 136 6.2.3 Case study: PDZ peptide design and specificity . . . . . . 137 6.2.4 Case study: helical peptide design for the estrogen receptor ligand-binding domain . . . . . . . . . . . . . . . . . . . . 143 6.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . 148 6.3.1 A constraints-based framework for peptide design . . . . . 148 6.3.2 Local backbone moves using BriX . . . . . . . . . . . . . . 151 6.3.3 Binding site prediction . . . . . . . . . . . . . . . . . . . . 152 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7 Discussion 157 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 List of Figures 169 List of Tables 173 xviii
  • 19.
    1Introduction Parts of thischapter are based on Computational design of peptide ligands Peter Vanhee, Almer van der Sloot, Erik Verschueren, Luis Serrano, Frederic Rousseau and Joost Schymkowitz. Trends in Biotechnology, May 2011. P roteins are the most versatile and complex molecules in the cell, giving rise to most of life’s extraordinary shapes and processes. Peptides – short se- quences of ∼4-40 amino acids – are key components of protein-protein interaction networks, regulating many important cellular processes. It is commonly accepted that protein function directly relates to the three-dimensional structure of these molecules, yet high-resolution structures are often lacking. For therapeutic usage, peptides possess several attractive features when compared to small molecule and protein drugs: they show a high structural compatibility with target proteins, con- tain the ability to disrupt protein-protein interfaces and have a small size. Efficient structure prediction and design of high affinity peptide ligands via rational methods has been a major obstacle to the development of this potential drug class. How- ever, structural insights into the architecture of protein-peptide interfaces have recently culminated in a number of computational approaches for the rational de- sign of peptides targeting proteins. These methods provide a valuable alternative to high-resolution structures of target protein-peptide complexes, bringing closer the possibility of in silico designed peptides. 1
  • 20.
    1. INTRODUCTION 1.1 Proteinsand peptides 1.1.1 Protein function A B C Figure 1.1: Proteins interacting with antibodies, small molecules and peptides. Structural models of protein interactions relevant for therapeutics. (A) The monoclonal antibody (mAb) cetuximab inhibits the extracellular domain of the epidermal growth factor receptor (EGFR) (PDB 1YY8). This therapeutic mAb is used in the treatment of colorectal cancer. (Citri & Yarden, 2006). (B) The small molecule gefitinib (Iressa, AstraZeneca) occupies the ATP-reserved binding pocket of intracellular kinase domain of EGFR, thus preventing phosporylation (PDB 2ITY) and inhibiting tumor growth. (Yun et al., 2007) (C) A phosphotyrosine peptide interacting with the SH2 domain of GRB2 that binds the intracellular domain of EGFR (PDB 1JYR) (Huang et al., 2008) Proteins are present in all forms of life, from plants, bacteria and viruses to animals. They are the cell’s workhorses, putting the genetic information (DNA) of the cell into action. There are many different types of proteins, making up most of the cell’s dry mass. Proteins are involved in almost all of the processes going on in the body; they transport nutrients through the blood, break them down to power muscles and send signals to the brain. Many proteins act as enzymes that catalyze reactions to form and break covalent bonds, directing the vast majority of 2
  • 21.
    1.1 Proteins andpeptides all major chemical processes in the cell. Here is a small sample of the role of proteins: • Enzymes facilitate biochemical reactions. For example, Alcohol dehydroge- nase transforms alcohol into a non-toxic form that the body uses for food and lactase breaks down sugar lactose found in milk. • Transport proteins move molecules from one place to another. For example, hemoglobin carries oxygen through the blood and cytochromes operate in the electron transport chain as electron carrier proteins. • Structural proteins give structural features to the cell and provide support. For example, keratin strengthens protective coverings such as hair, and col- lagen gives structure and support to the skin and the bones. • Hormonal proteins are messenger proteins that coordinate important pro- cesses in the cell and facilitate cell-cell communication. For example, insulin regulates glucose metabolism by controlling the blood-sugar concentrations and growth hormone helps regulate growth. • Contractile proteins provide movement to the cell. For example, actin and myosin are responsible for muscle contraction. • Antibodies defend the body from foreign invaders by tightly binding to antigens such as viruses or bacteria. Antigens are bound by the Major Histocompatibility Complex (MHC) and presented to a T-cell receptor, after which white bloods cells can be recruited to destroy the invaders. The structural diversity of antibodies (and in particular of the loops that bind the antigen) is immense: it has been estimated that humans can generate around 10 billion different antibodies, able to recognize virtually any foreign invader. To perform all these functions, proteins do not act alone. Instead, they can associate with themselves or with other proteins as dimers or as multi-subunit com- plexes, creating networks of protein interactions (Figure 1.1). These interactions – for example between proteins and other proteins, small molecules, peptides, met- als, lipids, DNA or RNA – are fundamental in understanding the relation between 3
  • 22.
    1. INTRODUCTION genotype andphenotype at all different levels, from the molecules towards to the organism itself. In Saccharomyces cerevisiae (baker’s yeast) – for which most of the protein-protein interaction studies have been carried out – nearly every protein is involved in an interaction (Han et al., 2004). High-throughput studies on entire organisms, elucidating entire protein-protein interaction networks, are now within reach of our understanding, as was shown recently for the bacterium Mycoplasma pneumoniae (K¨uhner et al., 2009). 1.1.2 Protein building blocks aliphatic tiny small polar charged positivearomatic hydrophobic I L V M F G W Y D R KH E Q T N SCSH CS–S P A Figure 1.2: Amino acids grouped by properties. A Venn diagram grouping amino acids according to their properties. This is just one of the many possible classifications of amino acids. The Figure was adapted from (Taylor, 1986). Proteins are made of amino acids, small molecules of carbon, oxygen, nitrogen, sulfur and hydrogen. To make a protein, amino acids are connected together with peptide bonds, that folds into a three-dimensional structure according to the chemical properties of the amino acids. Each of the amino acids has a small group of atoms (the ‘sidechain’) branching off the main chain (the ‘backbone’), which 4
  • 23.
    1.1 Proteins andpeptides gives its unique properties to the nascent protein. There are 20 naturally occurring amino acids, each of them with a slightly different chemical structure. Based on their chemical properties, they can be organized in different categories: basic or acidic, polar (hydrophilic or ‘water-loving’) or hydrophobic (‘greasy’), charged or uncharged, aliphatic or aromatic (Figure 1.2). Primary structure α-helix amino acid sequence β-sheet Secondary structure regular sub-structures Quaternary structure complex of protein molecules hemoglobin 3-dimensional structure Tertiary structure p13 protein Figure 1.3: Four levels of protein structure. As shown in Figure 1.3, proteins have different levels of structure: the primary structure is the amino acid sequence, the secondary structure the local substruc- tures (α-helices and β-strands) that are stabilized by an organized network of hydrogen bonds. The tertiary structure is the entire protein folded into a complete three-dimensional structure, and the quaternary structure is the structure of the interaction of multiple units coming together to form a larger unit. 1.1.3 Protein folding and stability As Anfinsen showed, the structure of a protein is uniquely defined by its sequence (Anfinsen, 1973). The folding of the polypeptide chain is a complex process that turns the essentially unstructured and elongated polypeptide chain into a compact, stable and unique protein fold, held together by mainly non-covalent interactions (Onuchic & Wolynes, 2004). In the late sixties, Levinthal famously pointed out that it seemed impossible that a protein could fold spontaneously 5
  • 24.
    1. INTRODUCTION following arandom process in a reasonable timeframe, suggesting the existence of folding pathways (Levinthal, 1969). Through a combination of technologies – most notably, protein recombinant technologies, NMR and X-Ray technologies and computer simulations – we now commonly accept that the major force of protein folding is the hydrophobic collapse of the polypeptide chain (Dill, 1990; Chandler, 2005). The large majority of proteins in their folded state is only marginally stable, meaning that the energy difference between the unfolded and folded state of proteins is relatively small. The breaking of a single hydrogen bond caused by a single amino acid mutation might lead to the collapse of the entire protein. Different forces contribute the free energy of the protein, commonly expressed as a variation of free energy (∆G) between the unfolded and folded states. The non-covalent forces that contribute to the stability of proteins can be described as follows: • Van der Waals interactions are weak, attractive or repulsive interactions that occur between both charged and polar molecules. They include the London dispersion forces, dipole-dipole interactions and hydrogen bonding, and are often calculated using (6-12)-potentials such as the Lennard-Jones potential. • Hydrogen bonding occurs when two electronegative atoms compete for the same hydrogen atom. The proton donor is covalently bound to the hydrogen atom, while the proton acceptor interacts favorably with the hydrogen atom. Originally observed in 1936 by Pauling and Mirsky – before the first protein structures became available –, hydrogen bonds are ‘holding together’ the folded polypeptide chain, giving rise to both α-helices and β-sheets. • Hydrophobic interactions are a phenomenon observed when non-polar compounds collapse into aggregates when surrounded by water. Almost half of the amino acids are hydrophobic (Figure 1.2) and tend to cluster together to form the hydrophobic core of the protein. • Electrostatic interactions are long-distance cohesive forces that appear be- tween differently charged atoms. Salt bridges are a special kind of hydrogen 6
  • 25.
    1.1 Proteins andpeptides bonds that occur between charged functional groups. It came somewhat as a surprise to discover that an estimated 40% of all proteins in the human proteome are intrinsically disordered and become only fully or partly structured upon binding to binding partners in the cell (Gianni et al., 2003). Short motifs or peptides are often recognized within these unstructured proteins and play important roles in protein regulatory networks and signaling pathways. 1.1.4 Biological function of protein-peptide interactions Peptides and short peptide stretches in larger proteins (∼4-40 amino acids long) perform a myriad of functions both in cell-to-cell and intracellular communication: they are important mediators in many signalling pathways and regulatory networks (Neduva & Russell, 2006). A great variety of endogenous regulatory peptides of variable length act as peptide hormones and/or neurotransmitters and are involved in inter-cellular communication (Brunton et al., 2006). These peptides show a wide range of physiological activities and are important in maintaining homeostasis. Examples are the potent blood pressure regulators angiotensine II (8 amino acids, a.a.) and vasopresine (9 a.a.), the appetite regulators ghrelin (28 a.a) and obestatin (23 a.a.), and glucagon (29 a.a.), a regulator of glucose metabolism. They act in an endocrine, paracrine or autocrine fashion by binding cell surface receptors, such as G-protein coupled receptors (GPCRs). Typically, these peptides are produced by differential processing of a precursor protein by endopeptidases to yield biologically active peptides. Many higher organisms, from amphibians to humans, also rely on peptides as an integral part of their host-defense mechanism against microbial assault, and although the use of peptides in antimicrobial therapy is rather limited at the moment, peptides are increasingly being considered as antibacterials, antivirals and antifungals in clinical settings (Hancock & Sahl, 2006; Easton et al., 2009). Short, often unstructured peptide stretches or ‘motifs’ of larger proteins are also important players in intracellular signaling networks (Russell & Gibson, 2008). These motifs are recognized by globular protein domains, such as the SH3 domain that binds short polyproline-rich motifs, the SH2 domain that recognizes peptides containing phoshorylated tyrosine, or the PDZ domain that binds C-terminal motifs (Pawson & Nash, 2003). It is believed that up to 40% of all protein interactions in 7
  • 26.
    1. INTRODUCTION the cellare either directly or indirectly influenced by peptide-mediated interactions (Neduva & Russell, 2006; Petsalaki & Russell, 2008). Given the importance of these protein-peptide interactions in both inter- and intracellular signalling they provide important targets for therapeutic intervention in a range of diseases. Modulating these interactions with peptide-like agonists and antagonists therefore constitutes an attractive therapeutic strategy. 1.1.5 Peptides as therapeutics The vast majority of therapeutic compounds achieve their effects by binding to and altering the function of target protein molecules (Figure 1.4). Traditionally, the main source of successful therapeutics has been small organic molecules, which usually bind in small cavities of the target protein and inhibit or ‘block‘ specific catalytic centers or the binding sites of natural substrate analogues (Drews, 2000) (Figure 1.4C). The recent focus on protein-protein interaction networks has shifted the goal of drug targeting increasingly towards disruption of protein-protein interactions, a feat for which classical small molecules are not always ideally suited (Arkin & Wells, 2004; Wells & McClendon, 2007). The newest additions to the pharmaceutical arsenal are protein-based therapeutics, which are generally improved recombinant replacements of endogenous proteins or monoclonal antibodies directed against a wide variety of targets (Walsh, 2010) (Figure 1.4B). Although the introduction of protein therapeutics – in particular monoclonal antibodies – has been tremendously successful, their use is mainly limited to extracellular targets, such as membrane- bound receptors and secreted proteins, because uptake of these large molecules into intracellular compartments remains cumbersome (Patel et al., 2007). Peptides are generally considered ‘poor drugs’ because of cumbersome deliv- ery, prohibitively short in vivo lifetimes and bad overall bio-availability (Antosova et al., 2009; Audie & Boyd, 2010). However, recent technological innovations in formulation, delivery and chemistry have sparked greater interest in peptide ther- apeutics (Walensky et al., 2004; Timmerman et al., 2005; Tan et al., 2010). Their chemical structure makes them by definition highly compatible with the proteins they target and their intermediate size enables them to disrupt protein-protein interfaces, whilst remaining sufficiently small for intracellular targeting (Patel et al., 8
  • 27.
    1.1 Proteins andpeptides x antibody receptor Y-kinase hormone small molecule peptide effector 1 2 3 4A B C D Figure 1.4: Targeting the cell with different molecules. Overview of different drug strategies targeting protein signaling pathways: (A) normal scenario of a generic pathway, (B) therapeutic antibodies, (C) small molecules and (D) peptides. 2007) (Figure 1.4D). Presently, more than 50 peptide-based products are approved for clinical use in the United States and other countries (Table 1.1) (Pechon et al., 2010), underlining the tremendous market potential of peptidic drugs. This has spurred a great interest in technologies capable of providing new peptide se- quences with high affinity and specificity towards therapeutically relevant targets. In the remainder of this chapter and throughout this thesis, we will discuss recent technological advances that could lead to rationally designed peptides targeting proteins. Peptides and redefining ‘druggability’ Given the current success of recombinant protein-based therapeutics, we are al- ready witnessing the erosion of the long-standing and relatively narrow definition of what constitutes a ‘druggable target’ (Hopkins & Groom, 2002) (i.e. a protein that can be modulated by an orally administered active small molecule, adhering to the ‘rule of five’ proposed by Lipinski et al. (2001)). The definition of ‘drugga- bility’ has widened to include targets whose activity can be modulated by larger molecules, such as proteins and peptides. Current small-molecule drugs target only a fraction of all proteins inside and 9
  • 28.
    1. INTRODUCTION Name Approval date US Disease # A.A. Originof mimet- ics Company Global sales (US $ mil- lion) Glatiramer, Copaxone ® 1996 Multiple Scle- rosis 4 Myelin protein Teva 3200 Leuprolide, Lupron ® 1985 Prostate and breast cancer (mainly) 9 Gonadotropin Re- leasing Hormone (GnRH) mimetic Abbott (amongst others) 1900 Goserelin, Zoladex ® 1989 Prostate and breast cancer (mainly) 10 Luteinising Re- leasing Hormone (LRH) mimetic Astra- Zeneca 1146 Octreotide, Sando- statin ® 1998 Acromegaly, carcinoid syndrome 8 Somatostatin hor- mone mimetic Novartis 1123 Teriparatide, Forteo ® 2002 Osteoporosis 34 Parathyroid hormone (84 residues, residues 1-34) Eli Lilly 779 Exenatide, Byetta ® 2005 Diabetes Type 2 39 Exendin-4 hor- mone (incretin mimetic) Amylin / Eli Lilly 750 Enfuvirtide, Fuzeon ® 2003 AIDS (HIV-1 in- fection) 36 Viral glycoprotein (gp41) Roche 167 Table 1.1: Leading examples of peptide therapeutics currently on the market. Data is extracted from the annual Peptide Report issued by the Peptide Therapeutics Foundation (http://www.peptidetherapeutics.org/annual-report.html) (Pe- chon et al., 2010). outside the cell. Typical targets include GPCRs, enzymes, nuclear hormone re- ceptors and ion channels, all of which have natural small-molecule substrates (Brunton et al., 2006). Most of these drugs target the binding pocket of the sub- strate directly, but also other, allosteric cavities can be targeted. On average, the contact surface between a small-molecule ligand and its protein receptor is be- tween 300-1000 Å2 (Smith et al., 2006). In contrast, the contact surface between two interacting proteins is generally much flatter, larger (1200-3000 Å2 ) (Conte et al., 1999; Jones & Thornton, 1996), and discontinuous in sequence. Most of the free energy of binding is contributed by a limited number of amino acids in the 10
  • 29.
    1.1 Proteins andpeptides interface (‘hotspot’ residues) (Clackson & Wells, 1995). Interaction networks are distributed and constructed with a modular architecture, showing tight coopera- tive interactions within a module and additive interactions between the modules (Reichmann et al., 2005). Conversely, protein-peptide interfaces display a smaller contact surface and a more continuous architecture and often target well-outlined, large hydrophobic pockets on a protein (London et al., 2010). These pockets are larger than the typical clefts targeted by small molecules, but smaller than large protein-protein interfaces. Given the large, shallow and distributed nature and lack of pockets and cavities in protein-protein interfaces, these interfaces are often considered to be hard to target with small-molecule drugs. Although progress has been made – e.g. Thorsen et al. (2010) identified a small molecule that targets PDZ domains with micromolar affinity similar to the endogenous peptide – disrupting protein-protein interfaces with classical small molecule compounds remains a difficult task (Wells & McClendon, 2007). Peptide-like drugs are likely to be more suitable candidates to act as competitive inhibitors of protein-protein interactions, considering their similar binding mode. Alternatively, peptide-like ligands could target protein-protein interactions in a non-competitive manner by acting as an allosteric modulator. This concept has received a lot of attention owing to the success of small-molecule allosteric modulators (Conn et al., 2009; Eglen & Reisine, 2010), but is also well-established in protein-mediated interactions (Alvarado et al., 2010). One limitation to this allosteric approach is the need to identify the ‘pressure points’ in a target protein structure that should be ‘hit’ in order to affect its function. Several methods to map the dynamics of the amino acid interaction network that constitutes the protein structure have been developed, and these have, at a minimum, the potential to reveal sites on the protein surface with allosteric modulatory power (Lee et al., 2008; Lenaerts et al., 2008, 2009; Haliloglu & Erman, 2009; Haliloglu et al., 2010). In short, targeting protein-protein interactions with peptide based competitive inhibitors or – albeit more challenging – peptide-based allosteric modulators ex- tend the definition of druggability by expanding the potential classes of druggable targets. 11
  • 30.
    1. INTRODUCTION 1.1.6 ProteinStructure Structural biology gathers structural information from atoms to cells at different levels of the biological hierarchy into a common framework. Thus, elucidating structure of protein molecules is key. Since the determination of the first three- dimensional structure of myoglobin – an oxygen-carrying protein found in muscle tissue – in 1958 (Kendrew et al., 1958), our understanding of biology has radically changed (Figure 1.5). Experimental structure determination is often carried out either by X-ray crystallography or by nuclear magnetic resonance (NMR). 1958 1960 1963 1965 1969 1973 1977 1978 1985 1987 2001 Perutz published the low-resolution structure of haemoglobin. Kendrew reported the first low- resolution structure of myoglobin. Anfinsen deduced from experiments that the native conformation of a protein is determined by its amino-acid sequence . Levitt and Lifson introduced the method of energy refinement. Karplus published the first molecular dynamics simulations. Wuthrich determined the first NMR structure of a protein. Preliminary publication of the human genome sequence. Phillips determined the high- resolution structure of hen egg- white lysozyme . Cohen, Boyer and colleagues developed recombinant DNA technology. Hutchison andSmith reported an effective method for site- directed mutagenesis . (1987–1989) Fersht introduced / -value analysis of binding and folding. Sanger and Maxam and Gilbert published their respective methods for DNA sequencing. Figure 1.5: History of experimental structure determination from the last fifty years. Figure adapted from (Fersht, 2008). Many structures of globular proteins exist in public databases such as the Pro- tein Data Bank (PDB, Berman et al. (2000)), approximately 69.900 as of December 2010. Even though special care is taken for the resolution of fibrous proteins such as membrane proteins, currently only few structures are available in public databases because of the complexities associated with crystallizing these aggre- gating proteins. Protein structures are not ‘static pictures’. Instead, proteins undergo dynamic excursions from their ‘ground states’ and these fluctuations have increasingly been associated with important biological processes such as for example protein folding, enzyme catalysis and signal transduction (Eisenmesser et al., 2002; Lange et al., 2008). These fluctuations are typically modeled using Nuclear Magnetic Resonance (NMR) experiments, but computational techniques have also been a major aid in the determination of protein dynamics (Duan & Kollman, 1998; Shaw et al., 2010). Nowadays, high-throughput structural genomics initiatives strive towards ex- perimentally solving selected sets of proteins, such that all proteins of unknown 12
  • 31.
    1.2 Protein structureprediction and design structures have at least one neighbor in protein classification systems such as SCOP and CATH (Chandonia & Brenner, 2006). Since their inception, structural genomics has resolved thousands of novel protein structures, mainly using X-ray diffraction. Yet it has been estimated that at least 16.000 carefully selected struc- tures would need to be solved in order that comparative modeling can predict 90% of all protein domain families (Chruszcz et al., 2010). Solving this ‘structural data’ problem is key to the general understanding of proteins in the cell, and since still more than 70% of all known proteins are without a determined structure – let alone complexes of protein-protein, protein-DNA, protein-RNA interactions –, the computational modeling of protein structure has become a field on its own. In the remainder of this introductory chapter, we introduce this field and focus in particular on the prediction of peptide interactions. 1.2 Protein structure prediction and design Following Anfinsen’s dogma, the structure of a protein is uniquely defined by its sequence (Anfinsen, 1973). Since experimental determination of a protein’s structure is an often expensive and time-consuming task, computational biologists have embraced the challenge to predict secondary and tertiary structure directly from sequence. However, the problem turns out to be far from trivial: the structural variety the polypeptide chain can adapt is virtually unlimited. Creighton estimated that a protein of 100 amino acids can adopt up to 10100 alternative conformations (Creighton, 1984), approximately as much as there are atoms in the universe. Many computational methods have been developed and used to increase struc- tural coverage (Baker & Sali, 2001). Every design methodology essentially has two components: a sampling component that samples the space of possible conforma- tions and a scoring component, that ranks solutions based on a ranking scheme. The sampling problem is often simplified by taking a ‘fixed-backbone’ assump- tion, although recent years have seen an increasingly number of methods that introduce protein backbone flexibility, e.g. by using an ensemble of backbone conformations, robotic-arm inspired moves or iterative backbone and sequence optimization, amongst others (Mandell & Kortemme, 2009a). Sidechain confor- 13
  • 32.
    1. INTRODUCTION mations aresubsequently sampled on a given backbone, often using a library of rotamers that represent all different conformations the sidechain can adapt on the given backbone template. Complete enumeration of these conformations remains a herculean task, such that both stochastic methods (e.g. Monte Carlo simulation, Kuhlman & Baker (2000)) and deterministic methods (e.g. the popular dead-end elimination technique, Desmet et al. (1992)) are employed. Finally, scoring func- tions - either statistical or physics-based - are used to rank the final set of solutions, mainly relying on energetic terms such as Van der Waals packing constraints, hy- drophobic interactions, hydrogen bonding, solvation and electrostatic interactions (Section 1.1.3). In our work, we rely on the empirical force field FoldX to weigh these components and output a final ranking based on the total free energy estimate of the model (Schymkowitz et al., 2005). Probably the most important question in modeling is when structure prediction becomes biologically useful (Zhang, 2009). This depends on the purpose of the model: highly accurate models (< 1-2 Å root mean square deviation, RMSD, versus the crystallographic model) can in some cases be used for ligand-binding studies or even virtual screening, while medium-resolution models (2.5-5 Å) could provide an idea about functionally important residues, active sites or disease-associated mutations. Low resolution models (> 5 Å) could be used for topology recognition or for determining the protein boundaries. Measuring progress in the field of structure prediction is the topic of the bi- annual competition CASP (Critical Assessment of techniques for protein Structure Prediction) (Moult et al., 1995). Research groups are given an amino acid sequence for which no native structure is known, but that will be determined soon. Since its inception in the mid-90’s, this community-wide effort has already instigated many novel design protocols, with over 100 different groups participating in the competition. The available range of modeling methods can roughly be divided in two cate- gories, although an increasing number of hybrid methods is being developed as well: methods that rely on comparative modeling or threading and methods that predict structure ab initio. Typically, the first category of methods has provided the more accurate models but is limited by the amount of structural data available, while the latter is unconstrained but limited by the huge number of possible confor- 14
  • 33.
    1.2 Protein structureprediction and design mations, idealized model systems and approximate free energy estimates. In what follows, we will discuss a selection of recent advances in these methodologies and also focus on a related but slightly different field in computational biology: that of computational protein design. We conclude with a short discussion of the docking problem, in which the structure of two interacting proteins is sought, given their structures in isolation. 1.2.1 Comparative modeling Comparative modeling – also called homology modeling – relies on the observation that sequence similarity suggests structural similarity, often because in the process of evolution, structure (and thus function) is not radically altered. Homology modeling thus searches for proteins which share a certain amount of sequence similarity with the target protein. Highly accurate models are often generated when more than 50% sequence identity exists between the target protein and the templates. These models might have a root-mean-square (RMS) error of 1 Å on the main-chain atoms, which is comparable to the difference between X-ray and NMR methods (Baker & Sali, 2001). Between 30 and 50%, homology modeling will still give reliable models but especially loops and other variable regions in the protein structure might deviate from the template structures. Finally, when comparative models are based on less than 30% sequence homology the errors accumulate rapidly in the model. Frequent sidechain packing errors, distortion of the protein core, loop modeling errors and other severe problems might render the model useless. These bottlenecks in comparative modeling - especially when low sequence homology exists - might be partly remedied by means of all-atom force field refinements, multiple template structures or specialized loop reconstruction methods (see Box). To better understand the mechanisms of folding, Rose and Creamer put forward the challenge to find two proteins with high sequence similarity but a different fold (Rose & Creamer, 1994). Recently, two small proteins of 56 residues each, GA88 and GB88, were designed that sharing 88% sequence homology and only 7 non-identical residues. The structures, solved by NMR, revealed two distinct folds: GA88 adopted a 3-α fold while GB88 adopted a α-β-fold, showing that 15
  • 34.
    1. INTRODUCTION Protein loopreconstruction with LoopX In Chapter 3 we present LoopX, a loop reconstruction method that combines a database backbone template search with sidechain reconstruc- tion. We demonstrate the competitiveness of the method by comparing it to various state-of-art loop prediction methods, including the robotics-inspired KIC, which was shown recently recently to reconstruct loops until length 12 with sub-angstrom accuracy (Mandell et al., 2009). Additionally, we demonstrate that LoopX can model the conformational ensemble adopted by protein loops and induced by ligand binding in the case of the PDZ-peptide and meganuclease-DNA interactions. conformational switching between two folds could be effected with just a handful of mutations (He et al., 2008; Alexander et al., 2009). This poses interesting problems to homology-based methods, since they challenge our interpretations of sequence-structure relationships, yet they might not be representative for a large number of examples. Protein threading Threading methods attempt to fit sequences to a known struc- ture from a library of folds, especially for cases in which no evolutionary relation is obvious (Bowie et al., 1991). This is accomplished by ‘threading’ the sequence along the backbone of the template model, followed by a scoring function which evaluates the placement of the amino acids in the backbone. Threading methods are less constrained since no homology is required, yet they still rely on correctly selecting a series of templates in which to fit the sequence. One such success- ful (hybrid) approach is I-TASSER (Roy et al., 2010). The algorithm performs a PSI-BLAST (Altschul et al., 1997) using the query sequence to identify evolution- ary relatives, followed by secondary structure assignment to construct an initial scaffold. That scaffold is then used to select a series of models from the PDB for threading using a series of state-of-the-art threading programs (Wu & Zhang, 2007). In subsequent steps, fragments from the structural models are determined and combined with ab initio predictions for badly aligned regions (particularly, loops), to result in a structural model for which functional properties can be de- rived. 16
  • 35.
    1.2 Protein structureprediction and design Fragment-based structure prediction Another type of comparative modeling tools are fragment-based methods. They differ into their definition of the smallest unit used to infer sequence similarity: instead of the entire fold, they consider stretches of residues, effectively expanding the number of templates that can be modeled with. Many different fragment-based methods have been successfully ap- plied to protein structure prediction, often in combination with ab initio sidechain prediction and energy evaluation. A successful fragment-based approach is used in Rosetta to bootstrap structure prediction (Section 1.2.2). In our own work, we have used the ‘fragment paradigm’ as an effective way to solve the sampling problem in protein structure prediction (see Box). The BriX fragment paradigm In our own work, we have used the ‘fragment paradigm’ as an effec- tive way to solve the sampling problem in protein structure prediction (Baeten et al., 2008; Vanhee et al., 2011). We used short protein backbone fragments to reconstruct (parts of) proteins (Chapter 2), deduce structural relationships between single proteins and protein-peptide interfaces (Chapter 5) and perform blind reconstruction of peptide interactions (Chapter 6). Comparative modeling entirely depends on the quality of the templates avail- able: many structural templates obviously lead to better homology modeling and thus to better models. In the limit, and with the increasing body of structural data that is being deposited into public databases, comparative modeling will ulti- mately cover the entire structural space, since the number of unique folds in nature is expected to be limited (Grant et al., 2004). 1.2.2 Ab initio structure prediction Ab initio methods differ from comparative modeling techniques since they remove the requirement of having at least one related structure. Yet, some of the most successful ab initio techniques - such as the popular Rosetta framework (Rohl et al., 2004) - follow a hybrid approach. For example, Rosetta scans the PDB for small structural fragments with similar sequence signatures to the target sequence 17
  • 36.
    1. INTRODUCTION using aBayesian probability distribution. It then iteratively assembles these frag- ments using Monte Carlo sampling, optimizing packing between the fragments and favoring β-sheet formation. In a fine-grain step the sidechains are rebuilt with backbone-dependent rotameric libraries. As an example, an enzymatic active site has been designed using a series of minimal active site templates (termed ‘theozymes‘) that could accomodate the four different catalytic motifs used to cat- alyze breaking a carbon-carbon bond, and the model was later confirmed using X-Ray crystallography (Jiang et al., 2008). In other recent work, the methodology was repeated to provide a design that catalyzes the Diels-Alder reaction, a reac- tion that synthesizes a special type of organic bonds and for which supposedly no natural enzyme exists (Siegel et al., 2010). Both cases are milestones for our current ability to design proteins with desired properties using semi-computational approaches. Often, interleaving experimental information in the computational method re- sults in better designs, by constraining for example the design towards certain active sites, or allowing more conformational freedom in one part of the protein than in another. Among many different approaches, the use of NMR chemical shift data is particularly appealing. In CS-Rosetta, the chemical shift data was used to select fragments with similar resonance profiles in combination with the traditional sequence similarity from public databases, thus constraining the fragments used in the stochastic sampling and improving the final models (Shen et al., 2008). A com- pletely different but highly innovate approach is to use the human eye to resolve hard three-dimensional problems. By means of an online computer game, Baker and co-workers managed to not only explain protein folding to a large community outside the protein design field, they also showed a significant improvement in CASP models that could not have been achieved using computational algorithms alone (Cooper et al., 2010). 1.2.3 Predicting protein dynamics Methods that sample the conformational space according to first principles are much slower and thus limited in use. Probably the best known of these methods are the Molecular Dynamics (MD) (McCammon et al., 1977). In MD simulations, 18
  • 37.
    1.2 Protein structureprediction and design biophysical forces are explicitly described all-atom – e.g. with CHARMM (Brooks et al., 1983) or AMBER (Cornell et al., 1995) –, as opposed to the statistical-based force fields descriptions often used in approximate or homology-based methods (Klepeis et al., 2009). The advantage of MD simulations over other structure prediction methods is that they not only capture the ground state of the folded protein, but they also hint at the dynamics of the protein and its folding process, often important for understanding the function of the protein. Recently, David Shaw and co-workers modelled the folding and unfolding of the small WW domain and the BPTI protein, the latter in millisecond scale (Shaw et al., 2010). Important biological conformational changes, such as folding, often taken place on a time scale between 10 µs and 1 ms, and this achievement - made possible through the use of a customized supercomputer - improved the sampling time of such methods a 100-fold. 1.2.4 Computational protein design Computational protein design deals with finding a compatible sequence for a given protein fold, and as such, is often termed the ‘inverse folding problem’. It shares many of the same challenges posed by protein structure prediction and both require an understanding of the often complicated relationship between sequence and structure (Mandell & Kortemme, 2009b). Traditionally, changing the properties of a protein is accomplished using ‘ratio- nal design’, in which humans tinker and tweak with proteins, or directed evolution, an experimental technique that harnesses natural selection at the molecular level to customize proteins to meet certain specifics (Romero & Arnold, 2009). In the last twenty years however, computer algorithms have entered this field to produce in-silico models of optimized proteins that are then subjected to experimental analysis (van der Sloot et al., 2009; Lutz, 2010). Protein design has an enormous number of applications both in academia and industry: protein design techniques for example have been used to increase the thermostability of an enzyme whilst re- taining enzymatic activity (Korkegian et al., 2005); the affinity and specificity of the family of leucine zipper transcription factors was altered (Grigoryan et al., 2009); pathways in cellular systems have been artificially rewired, following a synthetic 19
  • 38.
    1. INTRODUCTION biology approach,making use of the modular architecture of signaling pathways in eukaryotes (Pryciak, 2009). 1.2.5 Protein docking Protein docking deals with finding the structure of the interaction rather than the structure of the individual proteins, that is, predicting the quaternary structure (Figure 1.3). Since proteins exercise their functions through the way they interact with other proteins, an atomistic understanding is often required to infer functional relationships between proteins, decipher signaling pathways and so on. Usually, docking of two unbound proteins proceeds in two phases: first, a putative binding mode is detected using geometric complementarity or Fast Fourier Transforma- tions (e.g. using PatchDock (Schneidman-Duhovny et al., 2005) or ZDock (Chen et al., 2003)). Second, a fine-grain refinement protocol fits both structures, some- times allowing for slight conformational flexibility in the backbone and sidechains of the proteins (e.g. Haddock (Dominguez et al., 2003) or RosettaDock (Gray et al., 2003)). Similar to CASP, the recurring competition CAPRI (‘Critical Assess- ment of PRedicted Interactions‘) measures the progress in the field using a blind competition (Janin et al., 2003). While CAPRI largely seems to be an academic exercise at this point and limited by the need of start structures, an increase in prediction accuracy of the methods can be observed. Most of the improvements are now towards introducing backbone flexibility in the search process (Lensink et al., 2007). 1.3 Computational design of peptide ligands Efficient design of high affinity peptide ligands via rational methods has been a major obstacle to the development of peptides for therapeutics. However, structural insights into the architecture of protein-peptide interfaces have recently culminated in a number of computational approaches for the rational design of peptides targeting proteins (Vanhee et al., 2009; London et al., 2010). These methods provide a valuable alternative to experimental high-resolution structures of target protein-peptide complexes, bringing closer the dream of in silico designed 20
  • 39.
    1.3 Computational designof peptide ligands peptides for therapeutic applications. Here we provide an extensive review of these methods (Figure 1.7). 1.3.1 A better understanding of protein-peptide interactions With the increase of high-resolution structures of protein-peptide complexes in the Protein Data Bank (PDB, http://www.pdb.org) (Berman et al., 2000), and in complementary databases such as the database of three-dimensional interact- ing domains (3did, http://3did.irbbarcelona.org) (Stein et al., 2010) and the non-redundant database of protein-peptide complexes (PepX, http://pepx. switchlab.org) (Vanhee et al., 2010), large-scale structural studies have at- tempted to describe the key properties of peptide binding (Vanhee et al., 2009; London et al., 2010) (Chapter 5). For example, we have identified 505 unique structural peptide-mediated interactions from a set of 1431 high-resolution struc- tures, with a high over-representation of well-studied peptide interactions, such as MHC-peptide complexes, thrombin-bound peptides, or peptides bound to the α-ligand binding domain of the estrogen receptor (Vanhee et al. (2009), Chapter 4). In a set of 103 peptide complexes, it has been noted that many interfaces ex- hibit tighter packing and more main chain hydrogen bonds than normally found in protein-protein interfaces (London et al., 2010). This difference is logical: peptides in isolation cannot be too hydrophobic for they would aggregate. Therefore part of the binding energy to compensate the loss of entropy upon binding has to be derived from main-chain/main-chain- and main-chain/side-chain hydrogen bonds. In silico mutagenesis on these interaction interfaces has revealed that peptide interfaces contain ‘hotspot’ residues, reminiscent of those found in protein-protein interfaces (Clackson & Wells, 1995). Peptides that are 6-8 residues-long typically contain two hotspot residues, whereas 3 hotspots are typical for peptides of length 9-11 (London et al., 2010). In general, peptides often exhibit an elongated struc- ture upon binding (Stein & Aloy, 2010) and do not appear to induce any large conformational changes in their binding partners in order to reduce the entropic cost of complex formation (London et al., 2010). In contrast, many of the peptide motifs are located in structurally disordered regions of proteins and only adopt a stable fold upon binding to their protein partner (‘fold-on-binding’). 21
  • 40.
    1. INTRODUCTION hydrophobic hydrophilichydrogenbonds β2 A B C Figure 1.6: PDZ-peptide interactions and peptide specificity. (A) The PDZ domain of Erbin (PDB 1N7T) binds peptides in an elongated way, with multiple residues con- tributing to the interaction. (B) The carboxy-terminal of the PDZ peptide binds tightly in a hydrophobic pocket of the PDZ domain. (C) Distribution of 74 PDZ domains in selectivity space, after singular value decomposition of correlated positions in the peptide. The contributions of different peptide positions allows the PDZ domains to optimize their specificity while avoiding cross-reactivity, revealing an even distribution throughout selectivity space (Figure adapted from Stiffler et al. (2007)). Despite their limited size, peptide interactions can be highly specific. For ex- ample, many C-terminal peptides exhibit high specificity in vivo for certain specific 22
  • 41.
    1.3 Computational designof peptide ligands PDZ domains, while avoiding cross-reactivity (Stiffler et al., 2007) (Figure 1.6). Interestingly, peptide specificity across 157 mouse PDZ domains matched with 217 peptide ligands could not be captured in discrete classes but instead showed a more evenly distribution in selectivity space. Specificity in peptide interactions can also be introduced by engineering approaches even when not observed in nature (Reina et al., 2002; Grigoryan et al., 2009). One such example is the basic-region leucine zipper (bZIP) family of transcription factors which share a high degree of structural and sequence similarity and binds DNA upon homo- and/or heterodimer- ization with an identical or related bZIP monomer subunit. By replacing one of the wild-type monomer subunits with a variant that has the basic-region substituted by an acidic region, DNA binding is prevented and consequently the activity of the transcription factor will be inhibited. As these acidic variants inherit the dimeriza- tion properties from the wild-type, it is difficult to inhibit one specific bZIP family member due to intrinsic heterodimerization properties. Keating and co-workers showed recently that it was feasible to design anti-bZIP peptide variants that bind specifically to only a single member of the human bZIP family using a computa- tional design approach (Grigoryan et al., 2009). An algorithm was employed that explicitly considers both target and non-target interactions by selecting sequences that minimize the loss of affinity for the target while maximizing differences in affinity between any non-target members. Out of the 20 targeted bZIP fami- lies, 10 designed peptides bound their representative member of the family with considerable higher affinity than any other non-target competitors, demonstrating peptide specificity. This study and related, albeit smaller-scale computational de- sign studies (Reina et al., 2002; van der Sloot et al., 2006) demonstrate that specific binding partners can be designed even in situations where there is a high degree of sequence- and structural similarity between target and non-target molecules. 1.3.2 Peptide design based on sequence motifs If structural information is present for a drug target – either from the single structure or from the target in complex with its ligand – this information can be used in the drug discovery process to speed up lead identification (Murray & Blundell, 2010). Unfortunately, structural information is available for only an estimated 50% of all 23
  • 42.
    1. INTRODUCTION A |Structure-free peptide design Phage-display library screening Quantitave peptide assays Sequence motif scanning B | Structure-based peptide design Peptides derived from protein complex structures De novo peptide design with structural scaffolds experimental experimental/computational experimental/computational computational computational Optimizing binding affinity Peptide docking & de novo design 1 2 1 1 2 3 2 1 2 3 4 5 6 7 8 selection sequencing direct read-out ...SEQUENCESEQUENCE... consensus motif 1 2 3 4 5 6 7 8 1 2 3 4 5 1 2 3 4 5 6 7 81 2 3 4 5 6 7 8 ... ... ... ... ... ... ... ... 1 2 3 4 5 6 7 8 A C D E F G H I ... ... best binding motifPWM heatmap aminoacids position-specific mutagenesis Figure 1.7: Example workflows for peptide design. (A) Structure-free peptide design and (B) structure-based peptide design. drug targets (Tanrikulu & Schneider, 2008), with a significant underrepresentation of targets of high therapeutic importance such as membrane proteins (Baker, 24
  • 43.
    1.3 Computational designof peptide ligands 2010). As a result, many research groups use existing information compiled in databases of protein-peptide interactions to derive sequence-binding motifs that could be used to design peptides. The most obvious cases are the well studied SH2, SH3, PDZ and WW domains where using simple sequence based rules one can design peptide templates (i.e. for SH3 domains the well known PxxP motif, with x any amino acid, or for PDZ class I the T/S-x-I/V/L-COOH), even though the discrete classification in motifs has been disputed (Stiffler et al., 2007). These templates can be randomized at the non-key positions and using different screening methods like yeast two-hybrid (Y2H) or phage display, specific peptides can be found (Tonikian et al., 2008; Giordano et al., 2010). For cases in which enough information on peptide binding is available, other more sophisticated approaches can be used. For example, an artificial neural network, capable of learning to recognize non-linearity in complex datasets, was trained on 650 peptides derived from T-cell epitopes and known to bind the Major Histocompatibility Complex (MHC) class II molecule (Honeyman et al., 1998). The neural network then was used to speed up epitope screening by reducing the experimental T-cell assay from 68 to 22 peptides, with only a potential loss of 5 out of 17 epitopes. In more recent work, prediction was combined with genetic algorithms, hidden markov models or other motif discovery algorithms (Lin et al., 2008). Predicting from sequence alone is often difficult because of permissive binding modes (MHC Class II for example accommodates from 9 to 18 residues, although longer peptides have been observed too), multiple binding cores and insufficient high-quality binding data, all leading to noisy and often inaccurate pre- dictions. Adding structural information to the prediction process – approximately 169 X-ray structures of MHC in complex with an antigenic peptide are available in PepX (Vanhee et al. (2010) and Chapter 4) – could increase prediction accuracies, yet structure-based methods are still too slow for genome-wide screening (Lin et al., 2008). While these motif-scanning methodologies can lead to novel peptide discover- ies, it is unclear whether they could lead to more generalist approaches when little information is known about the target protein. 25
  • 44.
    1. INTRODUCTION 1.3.3 Proteincomplexes as a source of active peptides Peptide fragments derived from the crystallographic interface of a protein-protein interaction are the major sources for rational drug design (Watt, 2006). In 2003, the anti-HIV peptide enfuvirtide (Fuzeon ®) was the first peptide (36 a.a.) derived from an extracellular protein interface to receive FDA approval (Table 1.1) and presented a landmark in the field of peptide therapeutics (Naider & Anglister, 2009). Intracellular targets associated with HIV infection have been targeted with peptides as well. Transcription factors (TF), regarded as ‘undruggable’ by classical small molecule drugs owing to their large protein-protein interfaces (Section 1.1.5), have now been targeted with peptides too. The original discovery of a 59-mer peptide frag- ment from the co-activator of the Mastermind-like family (MAML-1) required for NOTCH signaling marked the start for structure-based inhibitor design (Weng et al., 2003). The protein-peptide complex bound to DNA has been solved re- cently by two independent groups, showing that the Mastermind peptide binds as a twisted helix in the shallow protein-protein groove (Nam et al., 2006) (Figure 1.8A). Using a technique termed ‘peptide stapling’ (Schafmeister et al., 2000), a 16-residue peptide has been designed in which two residues are stapled together using a hydrocarbon bond; this acts to constrain the helix functionality of the pep- tide while improving binding affinity. The inhibitory α-helical peptide is able to penetrate the cell membrane and bind to a shallow groove formed by the intra- cellular domain of NOTCH and a DNA-bound TF, thereby blocking the interaction with the co-activator MAML-1, required for recruiting the transcription machinery. As a consequence, proliferation of T-cell acute lymphoblastic leukemia cells was stopped. In an entirely differently class of proteins, stapled α-helical peptides have been shown to be effective as well, inhibiting members of the anti-apoptotic BCL2-family (Stewart et al., 2010). These anti-apoptotic proteins contain a hydrophobic groove that engages the death-promoting BH3-helix. Molecular mimicry of that helix with a stapled peptide led to selective inhibition of the apoptotic protein (Figure 1.8B). Both examples of successful helical peptide design suggest that nature’s use peptides in protein-protein interfaces provides exciting opportunities for peptide 26
  • 45.
    1.3 Computational designof peptide ligands hydrocarbon staple hydrophobic positive charge negative charge hydrophilic A B Figure 1.8: Stapled helical peptides as potent therapeutic peptides. (A) Design of MAML-1 derived peptides by taking different portions of the MAML-1 helix and turning them into peptides (sliding window: orange, red, pink, orange, indicate different peptides used for stabilization, PDB 2F8X). The 16 a.a. sequence of MAML-1 targeting ICN1 and CSL is shown in red and was used to design the stapled peptide. Figure adapted from (Moellering et al., 2009). (B) Crystal structure of the stapled helix MCL-1 complex (PDB 3MK8). The stapled helix engages in binding in the canonical binding groove. Hydrophobic interactions at the binding interface are reinforced by a complementary polar interaction network. The side chains of hydrophobic (yellow), positively charged (blue), negatively charged (red) and hydrophilic (green) residues are shown. Figure adapted from (Stewart et al., 2010). therapeutics. Scanning the entire PDB for interfaces involving helical segments has revealed many potentially interesting interfaces in which α-helical interactions play an important role, such as nuclear hormone receptors or other transcription factor-cofactor interfaces (Jochim & Arora, 2009). The acquirement of Aileron’s peptide stapling technology by Roche in August 2010 only confirms the potential of these stabilized α-helix peptides as a new class of powerful peptide therapeutics (Sheridan, 2010). Yet, so far successful peptide designs seem to be largely limited to the α- helical scaffold. One reason for this may be the large entropy cost associated with structuring a peptide upon binding, which is easier to achieve using α-helical peptides. For example, a leucine-zipper scaffold can be used to fix the helix bundle (Grigoryan et al., 2009) or chemical stapling of side chain interactions to fix a single helix scaffold (Schafmeister et al., 2000). For hairpin structures, cyclization has also been employed (Craik et al., 2007). Yet another way to extend the structural 27
  • 46.
    1. INTRODUCTION stability ofpeptides is to incorporate them in a highly stable mini-protein, such as knottins or other scaffolds (Gebauer & Skerra, 2009). In conclusion, protein-complex derived peptides in combination with scaffold designs are currently the most successful ways for therapeutic peptide design. 1.3.4 Protein docking and fragment based docking as tools for peptide design A generalist approach for peptide design uses structures or homology models of the target in combination with docking algorithms to construct peptides along a chosen path on the target surface. Several tools can be used to structurally detect putative binding sites, for example using geometric amino acid-dependent preferences derived from a set of structural binding modes (Petsalaki et al., 2009). Autodock – a popular small-molecule docking algorithm – was used in combination with a genetic algorithm to design tetrapeptides against a selected hydrophobic region of α-synuclein, a protein associated with aggregation diseases (Abe et al., 2007). Upon experimental validation, several binding peptides having µM dissociation constants were identified that could be used as leads for further screening. Another approach uses a Gaussian Network Model to identify the binding site and Autodock is used to dock a series of dipeptides in a pairwise fashion on the grid along a flexible binding path, finally resulting in an optimal peptide sequence for a given surface (Unal et al., 2010). While these methods work well for peptides comparable in size to small molecules (typically no longer than 3-4 a.a.), the design of longer peptides still represents major combinatorial problems. 1.3.5 Peptide design using protein-peptide complexes Often structural information on the protein-peptide binding interface can be used to the advantage of modeling the protein-peptide interaction. Most approaches can be divided in three main scenarios: 1. Use a structure with a peptide ligand as template and model by homology a domain-related sequence and then mutate in silico with a protein design 28
  • 47.
    1.3 Computational designof peptide ligands algorithm the amino acid side chains of the peptide in order to change specificity while keeping peptide backbone coordinates fixed (Reina et al., 2002; van der Sloot et al., 2006). 2. Use a structure with a peptide ligand to model by homology a domain-related sequence while allowing peptide backbone flexibility. The crudest approach uses different domain-peptide complexes of the same family to generate different ligand backbone structures that can then be superimposed on the target structure (Fernandez-Ballester et al., 2009). This was probed recently for the PDZ domain, for which peptide specificity was computationally re- designed using all available structures from the PDZ domain and compared with large-scale phage display experiments (Smith & Kortemme, 2010). An- other approach introduces backbone flexibility in the peptide starting from a series of perturbed X-ray protein-peptide complexes (Raveh et al., 2010). This protocol was validated on a set of 89 peptide complexes and it produced models that showed sub-angstrom deviation from the native structure. 3. Use a structure while only knowing the approximate binding site of the peptide ligand, for example based on evidence from related domains. The PepSpec algorithm does not rely on a structural model of the peptide (King & Bradley, 2010). Instead, it only needs a single anchor residue positioned in the binding pocket and introduces implicit backbone movements in the re- ceptor through ensemble modeling. Evaluation was carried out on a series of peptide-binding domain families, such as PDZ, SH2 and SH3. In the absence of an experimentally obtained structural model of the domain and relying on a model based on a homologous domain, the algorithm captured some of the peptide specificities that were matched with experimental phage display libraries. However, large simulation times that scale unfavourable with pep- tide length were reported (∼100-300 hours per peptide). In this field, we achieved some progress too: peptides were designed for the PDZ domain, the α-ligand binding domain and the SH2 domain within sub angstrom accu- racy, using structural data from BriX interaction patterns in combination with the FoldX force field for sidechain placement and energy evaluation (Figure 1.10 and Chapter 6). 29
  • 48.
    1. INTRODUCTION To summarize,fixed backbone peptide design (scenario 1) can be successfully used in situations when a high degree of sequence and structural similarity exists – or can be assumed – between template complex-structure and the target complex structure while minimizing computational cost. When changes in backbone con- formation are expected to play a greater role (e.g. in cases of decreasing sequence and structural similarity or when insertions/deletions relative to the template struc- ture have to be modeled) one of the approaches mentioned under scenario 2 can be employed. When the exact binding site of the peptide is not known one of the approaches mentioned under 3 has to be employed. For now, the computational cost of these methods would limit the use to selected (design) case studies rather than proteome-wide screening. 1.3.6 Remedying the lack of structural information A recently reported method addresses the lack of structural information on mem- brane proteins by employing a database of helix-helix interaction scaffolds to initiate de novo peptide design (Yin et al., 2007) targeting integrins. Integrins are important receptor proteins in mammalian cells, with a flexible domain of transmembrane (TM) helices in the phospholipid bilayer. Integrins process extracellular signals, transmitting them to the interior of the cell, thus making them attractive targets for tumor therapy (Desgrosellier & Cheresh, 2010). Peptides selectively targeting integrins αIIbβ3 and αVβ3 have been computa- tionally designed with a new approach for rational peptide design (Yin et al., 2007). While typical peptide designs are derived from the crystal structure of the target protein-protein complex (Section 1.3.3), the design task mainly consists in stabiliz- ing the hot-spot interactions with a peptide. However, because in most cases no crystal structure of the interface is available, in this study the authors relied on a repertoire of over 400 naturally occurring TM-helix interactions with recognizable sequence signatures (Walters & DeGrado, 2006). The computational design was divided in two steps: (1) the helix-helix inter- action motifs served as realistic backbone templates (Figure 1.9A/B) – as opposed to idealized helix pairs often used in protein design – and were selected based on sequence compatibility with the target TMs; (2) the authors threaded the se- 30
  • 49.
    1.3 Computational designof peptide ligands anti-αIIb scaffold (PDB 1JB0)TM helix-pairs cluster αIIb peptide threaded on scaffold and repacked 1 2 A B C Figure 1.9: Design of helices targeting Trans-Membrane (TM) proteins. Using (A) a library of trans-membrane helix pairs from (B) unrelated structures (e.g. PDB 1JB0) to (C) design novel peptide ligands. Figure adapted from (Yin et al., 2007). quence of the target TM helix on one helix of the helix pair; they then selected a compatible side chain for the peptide, using a side-chain-repacking algorithm for the other helix (Figure 1.9C). The computationally designed peptides were subse- quently validated in micelles, bacterial membranes and finally mammalian cells, where they inhibited the binding between the TMs of the α- and β-subunits, thus activating the integrin. Multiple advances reported in this study are noteworthy. First, the authors showed that peptides possess the capacity to integrate within the lipid bilayer and selectively interact with and activate α-β-integrins in mammalian cells; this had previously been difficult to accomplish owing to the lack of a solvent-exposed binding site. Second, by using a library of naturally constrained helix-helix inter- action motifs, they circumvented the need to model computationally expensive inter-helical hydrogen bonding patterns and deviations from idealized helical ge- ometry. Finally, this study provides exciting opportunities for designing peptide inhibitors, getting around the need for high-resolution structures of the interface. We have taken a radically different approach toward remedying the lack of structural data on protein-peptide interactions (Chapter 5). Peptide binding motifs often resemble intramolecular packing motifs, suggesting that the wealth of data 31
  • 50.
    1. INTRODUCTION target withoutligand monomeric interaction motif designed ligand helix-helix motif helix-loop motif cation-PI motif binding site WT design A B 1 1 2 2 3 3 Figure 1.10: Innovative structural approaches in peptide design using BriX and InteraX. (A) Examples of monomeric interaction motif mining in structures (red): (1) a helix-helix interaction motif (PDB 153L); (2) a helix-loop interaction motif (PDB 153L); (3) a cation-PI interaction motif (PDB D1GA). See Chapter 5. (B) Sub-angstrom design of peptide interactions using monomeric structures. (1) Structure of a PDZ domain (PDB 2I1N) without its ligand and the helix and β2 strand forming the interface (blue). (2) Identification of an intra-molecular helix-strand-strand motif (red) from an unrelated structure (PDB 1GSA). (3) Comparison between the structures of the wild- type sequence (EETSV) designed on the intra-molecular scaffold (red) and the original ligand (gold). See Chapter 6. on single-chain proteins could be used to model peptide interactions (Figure 1.10 and Chapter 5). Through analysis of a representable set of 301 protein-peptide binding interfaces, we showed that more than half of all peptide interaction motifs could be reliably modeled from sets of interacting fragments from the BriX database of protein fragments (http://brix.crg.es and Chapter 2) (Baeten et al., 2008; Vanhee et al., 2011). As a result, the amount of structural peptide interaction motifs increased from a couple of 100 to over 100.000 fragment interactions. The use of these intramolecular ‘fragment interaction motifs’ that have pre-optimized 32
  • 51.
    1.3 Computational designof peptide ligands packing represents an important conceptual breakthrough because it transforms the whole database of protein structures into learning data for computer algorithms that design peptide substrates de novo, as we describe in Chapter 6. In the near future, we expect that such algorithms will start to appear so that large-scale virtual peptide screening will become a valid opportunity. 33
  • 52.
    REFERENCES References Abe, K., Kobayashi,N., Sode, K. & Ikebukuro, K. (2007). Peptide ligand screening of alpha- synuclein aggregation modulators by in sil- ico panning. BMC Bioinformatics, 8, 451. 28 Alexander, P.A., He, Y., Chen, Y., Orban, J. & Bryan, P.N. (2009). A minimal se- quence code for switching protein structure and function. Proceedings of the National Academy of Sciences, 106, 21149–54. 16 Altschul, S.F., Madden, T.L., Sch¨affer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997). Gapped blast and psi-blast: a new generation of protein database search pro- grams. Nucleic Acids Research, 25, 3389– 402. 16 Alvarado, D., Klein, D.E. & Lemmon, M.A. (2010). Structural basis for negative coop- erativity in growth factor binding to an egf receptor. Cell, 142, 568–79. 11 Anfinsen, C. (1973). Principles that govern the folding of protein chains. Science. 5, 13 Antosova, Z., Mackova, M., Kral, V. & Macek, T. (2009). Therapeutic application of peptides and proteins: parenteral forever? Trends in biotechnology, 27, 628–35. 8 Arkin, M.R. & Wells, J.A. (2004). Small- molecule inhibitors of protein-protein inter- actions: progressing towards the dream. Nat Rev Drug Discov, 3, 301–17. 8 Audie, J. & Boyd, C. (2010). The synergistic use of computation, chemistry and biology to discover novel peptide-based drugs: the time is right. Curr Pharm Des, 16, 567–82. 8 Baeten, L., Reumers, J., Tur, V., Stricher, F., Lenaerts, T., Serrano, L., Rousseau, F. & Schymkowitz, J. (2008). Reconstruction of protein backbones from the brix collection of canonical protein fragments. PLoS Com- put Biol, 4, e1000083. 17, 32 Baker, D. & Sali, A. (2001). Protein structure prediction and structural genomics. Science, 294, 93–6. 13, 15 Baker, M. (2010). Making membrane proteins for structures: a trillion tiny tweaks. Nature Publishing Group, 7, 429–434. 24 Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. & Bourne, P.E. (2000). The protein data bank. Nucleic Acids Research, 28, 235–42. 12, 21 Bowie, J.U., L¨uthy, R. & Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional struc- ture. Science, 253, 164–70. 16 Brooks, B., Bruccoleri, R., Olafson, B.D., States, D.J., Swaminathan, S. & Karplus, M. (1983). Charmm: A program for macromolecular energy, minimization, and dynamics calcu- lations. J Comput Chem. 19 Brunton, L., Lazo, J. & Parker. . . , K. (2006). Goodman & gilman’s the pharmacological basis of therapeutics. mcgraw-hill.co.uk. 7, 10 Chandler, D. (2005). Interfaces and the driving force of hydrophobic assembly. Nature, 437, 640–7. 6 Chandonia, J.M. & Brenner, S.E. (2006). The im- pact of structural genomics: expectations and outcomes. Science, 311, 347–51. 13 Chen, R., Li, L. & Weng, Z. (2003). Zdock: an initial-stage protein-docking algorithm. Pro- teins, 52, 80–7. 20 Chruszcz, M., Domagalski, M., Osinski, T., Wlo- dawer, A. & Minor, W. (2010). Unmet chal- lenges of structural genomics. Curr Opin Struct Biol. 13 Citri, A. & Yarden, Y. (2006). Egf-erbb signalling: towards the systems level. Nat Rev Mol Cell Biol, 7, 505–16. 2 Clackson, T. & Wells, J.A. (1995). A hot spot of binding energy in a hormone-receptor inter- face. Science, 267, 383–6. 11, 21 34
  • 53.
    REFERENCES Conn, P.J., Christopoulos,A. & Lindsley, C.W. (2009). Allosteric modulators of gpcrs: a novel approach for the treatment of cns dis- orders. Nat Rev Drug Discov, 8, 41–54. 11 Conte, L.L., Chothia, C. & Janin, J. (1999). The atomic structure of protein-protein recogni- tion sites. Journal of Molecular Biology, 285, 2177–98. 10 Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M., Leaver-Fay, A., Baker, D., Popovi´c, Z. & Players, F. (2010). Predicting protein structures with a multiplayer online game. Nature, 466, 756–60. 18 Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Merz, K.M., Ferguson, D.M., Spellmeyer, D.C., Fox, T., Caldwell, J.W. & Kollman, P.A. (1995). A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc.. 19 Craik, D.J., Clark, R.J. & Daly, N.L. (2007). Potential therapeutic applications of the cyclotides and related cystine knot mini- proteins. Expert opinion on investigational drugs, 16, 595–604. 27 Creighton, P. (1984). Structures and molecular principles. Proteins. 13 Desgrosellier, J.S. & Cheresh, D.A. (2010). Inte- grins in cancer: biological implications and therapeutic opportunities. Nat Rev Cancer, 10, 9–22. 30 Desmet, J., Maeyer, M., Hazes, B. & Laster, I. (1992). The dead-end elimination theorem and its use in protein side-chain positioning. Nature. 14 Dill, K.A. (1990). Dominant forces in protein folding. Biochemistry, 29, 7133–55. 6 Dominguez, C., Boelens, R. & Bonvin, A.M.J.J. (2003). Haddock: a protein-protein dock- ing approach based on biochemical or bio- physical information. Journal of the Ameri- can Chemical Society, 125, 1731–7. 20 Drews, J. (2000). Drug discovery: a historical perspective. Science, 287, 1960–4. 8 Duan, Y. & Kollman, P.A. (1998). Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solu- tion. Science, 282, 740–4. 12 Easton, D.M., Nijnik, A., Mayer, M.L. & Hancock, R.E.W. (2009). Potential of immunomodu- latory host defense peptides as novel anti- infectives. Trends in biotechnology, 27, 582– 90. 7 Eglen, R. & Reisine, T. (2010). Human kinome drug discovery and the emerging importance of atypical allosteric inhibitors. Expert Opin- ion on Drug Discovery, 5, 277–290. 11 Eisenmesser, E.Z., Bosco, D.A., Akke, M. & Kern, D. (2002). Enzyme dynamics during cataly- sis. Science, 295, 1520–3. 12 Fernandez-Ballester, G., Beltrao, P., Gonzalez, J.M., Song, Y.H., Wilmanns, M., Valencia, A. & Serrano, L. (2009). Structure-based pre- diction of the saccharomyces cerevisiae sh3- ligand interactions. Journal of Molecular Bi- ology, 388, 902–16. 29 Fersht, A.R. (2008). From the first protein struc- tures to our current knowledge of protein folding: delights and scepticisms. Nat Rev Mol Cell Biol, 9, 650–654. 12 Gebauer, M. & Skerra, A. (2009). Engineered pro- tein scaffolds as next-generation antibody therapeutics. Curr Opin Chem Biol, 13, 245– 55. 28 Gianni, S., Guydosh, N.R., Khan, F., Caldas, T.D., Mayor, U., White, G.W.N., DeMarco, M.L., Daggett, V. & Fersht, A.R. (2003). Unify- ing features in protein-folding mechanisms. Proceedings of the National Academy of Sci- ences of the United States of America, 100, 13286–91. 7 35
  • 54.
    REFERENCES Giordano, R.J., Card´o-Vila,M., Salameh, A., Anobom, C.D., Zeitlin, B.D., Hawke, D.H., Va- lente, A.P., Almeida, F.C.L., N¨or, J.E., Sidman, R.L., Pasqualini, R. & Arap, W. (2010). From combinatorial peptide selection to drug pro- totype (i): targeting the vascular endothelial growth factor receptor pathway. Proceedings of the National Academy of Sciences of the United States of America, 107, 5112–7. 25 Grant, A., Lee, D. & Orengo, C. (2004). Progress towards mapping the universe of protein folds. Genome Biol, 5, 107. 17 Gray, J.J., Moughon, S., Wang, C., Schueler- Furman, O., Kuhlman, B., Rohl, C.A. & Baker, D. (2003). Protein-protein docking with si- multaneous optimization of rigid-body dis- placement and side-chain conformations. Journal of Molecular Biology, 331, 281–99. 20 Grigoryan, G., Reinke, A.W. & Keating, A.E. (2009). Design of protein-interaction speci- ficity gives selective bzip-binding peptides. Nature, 458, 859–64. 19, 23, 27 Haliloglu, T. & Erman, B. (2009). Analysis of cor- relations between energy and residue fluctu- ations in native proteins and determination of specific sites for binding. Phys. Rev. Lett., 102, 088103. 11 Haliloglu, T., Gul, A. & Erman, B. (2010). Pre- dicting important residues and interaction pathways in proteins using gaussian network model: binding and stability of hla proteins. PLoS Comput Biol, 6, e1000845. 11 Han, J.D.J., Bertin, N., Hao, T., Goldberg, D.S., Berriz, G.F., Zhang, L.V., Dupuy, D., Walhout, A.J.M., Cusick, M.E., Roth, F.P. & Vidal, M. (2004). Evidence for dynamically organized modularity in the yeast protein-protein in- teraction network. Nature, 430, 88–93. 4 Hancock, R.E.W. & Sahl, H.G. (2006). An- timicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nat Biotechnol, 24, 1551–7. 7 He, Y., Chen, Y., Alexander, P., Bryan, P.N. & Orban, J. (2008). Nmr structures of two de- signed proteins with high sequence identity but different fold and function. Proceedings of the National Academy of Sciences of the United States of America, 105, 14412–7. 16 Honeyman, M.C., Brusic, V., Stone, N.L. & Har- rison, L.C. (1998). Neural network-based prediction of candidate t-cell epitopes. Nat Biotechnol, 16, 966–9. 25 Hopkins, A.L. & Groom, C.R. (2002). The drug- gable genome. Nat Rev Drug Discov, 1, 727– 30. 9 Huang, H., Li, L., Wu, C., Schibli, D., Colwill, K., Ma, S., Li, C., Roy, P., Ho, K., Songyang, Z., Pawson, T., Gao, Y. & Li, S.S.C. (2008). Defin- ing the specificity space of the human src homology 2 domain. Mol Cell Proteomics, 7, 768–84. 2 Janin, J., Henrick, K., Moult, J., Eyck, L.T., Stern- berg, M.J.E., Vajda, S., Vakser, I., Wodak, S.J. & of PRedicted Interactions, C.A. (2003). Capri: a critical assessment of predicted interac- tions. Proteins, 52, 2–9. 20 Jiang, L., Althoff, E.A., Clemente, F.R., Doyle, L., Rothlisberger, D., Zanghellini, A., Gallaher, J.L., Betker, J.L., Tanaka, F., Barbas, C.F., Hil- vert, D., Houk, K.N., Stoddard, B.L. & Baker, D. (2008). De novo computational design of retro-aldol enzymes. Science, 319, 1387– 1391. 18 Jochim, A.L. & Arora, P.S. (2009). Assessment of helical interfaces in protein-protein inter- actions. Mol Biosyst, 5, 924–6. 27 Jones, S. & Thornton, J.M. (1996). Principles of protein-protein interactions. Proceedings of the National Academy of Sciences of the United States of America, 93, 13–20. 10 Kendrew, J., BODO, G., Dintzis, H., Parrish, R. & WYCKOFF, H. (1958). A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature. 12 36
  • 55.
    REFERENCES King, C.A. &Bradley, P. (2010). Structure-based prediction of protein-peptide specificity in rosetta. Proteins, 78, 3437–49. 29 Klepeis, J.L., Lindorff-Larsen, K., Dror, R.O. & Shaw, D.E. (2009). Long-timescale molecu- lar dynamics simulations of protein structure and function. Curr Opin Struct Biol, 19, 120– 7. 19 Korkegian, A., Black, M.E., Baker, D. & Stoddard, B.L. (2005). Computational thermostabiliza- tion of an enzyme. Science, 308, 857–60. 19 Kuhlman, B. & Baker, D. (2000). Native pro- tein sequences are close to optimal for their structures. Proceedings of the National Academy of Sciences of the United States of America, 97, 10383–8. 14 K¨uhner, S., van Noort, V., Betts, M.J., Leo- Macias, A., Batisse, C., Rode, M., Yamada, T., Maier, T., Bader, S., Beltran-Alvarez, P., Casta˜no-Diez, D., Chen, W.H., Devos, D., G¨uell, M., Norambuena, T., Racke, I., Rybin, V., Schmidt, A., Yus, E., Aebersold, R., Her- rmann, R., B¨ottcher, B., Frangakis, A.S., Rus- sell, R.B., Serrano, L., Bork, P. & Gavin, A.C. (2009). Proteome organization in a genome- reduced bacterium. Science, 326, 1235–40. 4 Lange, O.F., Lakomek, N.A., Far`es, C., Schr¨oder, G.F., Walter, K.F.A., Becker, S., Meiler, J., Grubm¨uller, H., Griesinger, C. & de Groot, B.L. (2008). Recognition dynamics up to mi- croseconds revealed from an rdc-derived ubiquitin ensemble in solution. Science, 320, 1471–5. 12 Lee, J., Natarajan, M., Nashine, V.C., Socolich, M., Vo, T., Russ, W.P., Benkovic, S.J. & Ran- ganathan, R. (2008). Surface sites for engi- neering allosteric control in proteins. Sci- ence, 322, 438–42. 11 Lenaerts, T., Ferkinghoff-Borg, J., Stricher, F., Serrano, L., Schymkowitz, J.W.H. & Rousseau, F. (2008). Quantifying information transfer by protein domains: analysis of the fyn sh2 domain structure. BMC Struct Biol, 8, 43. 11 Lenaerts, T., Schymkowitz, J. & Rousseau, F. (2009). Protein domains as information pro- cessing units. Curr Protein Pept Sci, 10, 133– 45. 11 Lensink, M.F., M´endez, R. & Wodak, S.J. (2007). Docking and scoring protein complexes: Capri 3rd edition. Proteins, 69, 704–18. 20 Levinthal, C. (1969). How to fold graciously. Mossbauer spectroscopy in biological sys- tems. 6 Lin, H.H., Zhang, G.L., Tongchusak, S., Reinherz, E.L. & Brusic, V. (2008). Evaluation of mhc-ii peptide binding prediction servers: applica- tions for vaccine research. BMC Bioinformat- ics, 9 Suppl 12, S22. 25 Lipinski, C.A., Lombardo, F., Dominy, B.W. & Feeney, P.J. (2001). Experimental and com- putational approaches to estimate solubility and permeability in drug discovery and de- velopment settings. Adv Drug Deliv Rev, 46, 3–26. 9 London, N., Movshovitz-Attias, D. & Schueler- Furman, O. (2010). The structural basis of peptide-protein binding strategies. Struc- ture, 18, 188–199. 11, 20, 21 Lutz, S. (2010). Beyond directed evolution- semi-rational protein engineering and de- sign. Curr Opin Biotechnol. 19 Mandell, D.J. & Kortemme, T. (2009a). Backbone flexibility in computational protein design. Curr Opin Biotechnol, 20, 420–8. 13 Mandell, D.J. & Kortemme, T. (2009b). Computer-aided design of functional protein interactions. Nat Chem Biol, 5, 797–807. 19 Mandell, D.J., Coutsias, E.A. & Kortemme, T. (2009). Sub-angstrom accuracy in pro- tein loop reconstruction by robotics-inspired conformational sampling. Nat Methods, 6, 551–2. 16 37
  • 56.
    REFERENCES McCammon, J.A., Gelin,B.R. & Karplus, M. (1977). Dynamics of folded proteins. Na- ture, 267, 585–90. 18 Moellering, R.E., Cornejo, M., Davis, T.N., Bianco, C.D., Aster, J.C., Blacklow, S.C., Kung, A.L., Gilliland, D.G., Verdine, G.L. & Bradner, J.E. (2009). Direct inhibition of the notch transcription factor complex. Nature, 462, 182–8. 27 Moult, J., Pedersen, J.T., Judson, R. & Fidelis, K. (1995). A large-scale experiment to assess protein structure prediction methods. Pro- teins, 23, ii–v. 14 Murray, C.W. & Blundell, T.L. (2010). Structural biology in fragment-based drug design. Curr Opin Struct Biol, 20, 497–507. 23 Naider, F. & Anglister, J. (2009). Peptides in the treatment of aids. Curr Opin Struct Biol, 19, 473–82. 26 Nam, Y., Sliz, P., Song, L., Aster, J.C. & Blacklow, S.C. (2006). Structural basis for cooperativity in recruitment of maml coactivators to notch transcription complexes. Cell, 124, 973–83. 26 Neduva, V. & Russell, R.B. (2006). Peptides me- diating interaction networks: new leads at last. Curr Opin Biotechnol, 17, 465–71. 7, 8 Onuchic, J.N. & Wolynes, P.G. (2004). Theory of protein folding. Curr Opin Struct Biol, 14, 70–5. 5 Patel, L.N., Zaro, J.L. & Shen, W.C. (2007). Cell penetrating peptides: intracellular pathways and pharmaceutical perspectives. Pharm Res, 24, 1977–92. 8 Pawson, T. & Nash, P. (2003). Assembly of cell regulatory systems through protein interac- tion domains. Science, 300, 445–52. 7 Pechon, P., Tartar, A., Dunn, M.K. & Reichert, J. (2010). Development and trends for peptide therapeutics. 1–48. 9, 10 Petsalaki, E. & Russell, R.B. (2008). Peptide- mediated interactions in biological systems: new discoveries and applications. Curr Opin Biotechnol, 19, 344–50. 8 Petsalaki, E., Stark, A. & Russell, R.B. (2009). Accurate prediction of peptide binding sites on protein surfaces. PLoS Comput Biol, 5, e1000335. 28 Pryciak, P.M. (2009). Designing new cellular signaling pathways. Chemistry & Biology, 16, 249–254. 20 Raveh, B., London, N. & Schueler-Furman, O. (2010). Sub-angstrom modeling of com- plexes between flexible peptides and glob- ular proteins. Proteins, 78, 2029–40. 29 Reichmann, D., Rahat, O., Albeck, S., Meged, R., Dym, O. & Schreiber, G. (2005). The modular architecture of protein-protein binding inter- faces. Proceedings of the National Academy of Sciences of the United States of America, 102, 57–62. 11 Reina, J., Lacroix, E., Hobson, S.D., Fernandez- Ballester, G., Rybin, V., Schwab, M.S., Serrano, L. & Gonzalez, C. (2002). Computer-aided design of a pdz domain to recognize new target sequences. Nature Structural Biology, 9, 621–7. 23, 29 Rohl, C.A., Strauss, C.E.M., Misura, K.M.S. & Baker, D. (2004). Protein structure predic- tion using rosetta. Meth Enzymol, 383, 66– 93. 17 Romero, P.A. & Arnold, F.H. (2009). Exploring protein fitness landscapes by directed evo- lution. Nat Rev Mol Cell Biol, 10, 866–76. 19 Rose, G.D. & Creamer, T.P. (1994). Protein fold- ing: predicting predicting. Proteins, 19, 1–3. 15 Roy, A., Kucukural, A. & Zhang, Y. (2010). I- tasser: a unified platform for automated pro- tein structure and function prediction. Nat Protoc, 5, 725–38. 16 38
  • 57.
    REFERENCES Russell, R.B. &Gibson, T.J. (2008). A careful dis- orderliness in the proteome: sites for inter- action and targets for future therapies. FEBS Lett, 582, 1271–5. 7 Schafmeister, C.E., Po, J. & Verdine, G.L. (2000). An all-hydrocarbon cross-linking system for enhancing the helicity and metabolic sta- bility of peptides. Journal of the American Chemical Society, 122, 5891–5892. 26, 27 Schneidman-Duhovny, D., Inbar, Y., Nussinov, R. & Wolfson, H.J. (2005). Patchdock and symmdock: servers for rigid and symmetric docking. Nucleic Acids Research, 33, W363– 7. 20 Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F. & Serrano, L. (2005). The foldx web server: an online force field. Nucleic Acids Research, 33, W382–8. 14 Shaw, D.E., Maragakis, P., Lindorff-Larsen, K., Piana, S., Dror, R.O., Eastwood, M.P., Bank, J.A., Jumper, J.M., Salmon, J.K., Shan, Y. & Wriggers, W. (2010). Atomic-level charac- terization of the structural dynamics of pro- teins. Science, 330, 341–6. 12, 19 Shen, Y., Lange, O., Delaglio, F., Rossi, P., Aramini, J.M., Liu, G., Eletsky, A., Wu, Y., Singarapu, K.K., Lemak, A., Ignatchenko, A., Arrowsmith, C.H., Szyperski, T., Montelione, G.T., Baker, D. & Bax, A. (2008). Consis- tent blind protein structure generation from nmr chemical shift data. Proceedings of the National Academy of Sciences of the United States of America, 105, 4685–90. 18 Sheridan, C. (2010). Roche backs aileron’s sta- pled peptides. Nat Biotechnol, 28, 992–3. 27 Siegel, J.B., Zanghellini, A., Lovick, H.M., Kiss, G., Lambert, A.R., Clair, J.L.S., Gallaher, J.L., Hil- vert, D., Gelb, M.H., Stoddard, B.L., Houk, K.N., Michael, F.E. & Baker, D. (2010). Com- putational design of an enzyme catalyst for a stereoselective bimolecular diels-alder re- action. Science, 329, 309–13. 18 Smith, C.A. & Kortemme, T. (2010). Structure- based prediction of the peptide sequence space recognized by natural and synthetic pdz domains. Journal of Molecular Biology, 402, 460–74. 29 Smith, R.D., Hu, L., Falkner, J.A., Benson, M.L., Nerothin, J.P. & Carlson, H.A. (2006). Explor- ing protein-ligand recognition with binding moad. J Mol Graph Model, 24, 414–25. 10 Stein, A. & Aloy, P. (2010). Novel peptide- mediated interactions derived from high- resolution 3-dimensional structures. PLoS Comput Biol, 6, e1000789. 21 Stein, A., C´eol, A. & Aloy, P. (2010). 3did: iden- tification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Research. 21 Stewart, M.L., Fire, E., Keating, A.E. & Walensky, L.D. (2010). The mcl-1 bh3 helix is an exclu- sive mcl-1 inhibitor and apoptosis sensitizer. Nat Chem Biol, 6, 595–601. 26, 27 Stiffler, M.A., Chen, J.R., Grantcharova, V.P., Lei, Y., Fuchs, D., Allen, J.E., Zaslavskaia, L.A. & MacBeath, G. (2007). Pdz domain bind- ing selectivity is optimized across the mouse proteome. Science, 317, 364–9. 22, 23, 25 Tan, M.L., Choong, P.F.M. & Dass, C.R. (2010). Recent developments in liposomes, mi- croparticles and nanoparticles for protein and peptide drug delivery. Peptides, 31, 184–93. 8 Tanrikulu, Y. & Schneider, G. (2008). Pseu- doreceptor models in drug design: bridging ligand- and receptor-based virtual screening. Nat Rev Drug Discov, 7, 667–77. 24 Taylor, W.R. (1986). The classification of amino acid conservation. J Theor Biol, 119, 205–18. 4 Thorsen, T.S., Madsen, K.L., Rebola, N., Rathje, M., Anggono, V., Bach, A., Moreira, I.S., Stuhr-Hansen, N., Dyhring, T., Peters, D., Beuming, T., Huganir, R., Weinstein, H., Mulle, C., Strmgaard, K., Rnn, L.C.B. & Gether, U. 39
  • 58.
    REFERENCES (2010). Identification ofa small-molecule in- hibitor of the pick1 pdz domain that inhibits hippocampal ltp and ltd. Proceedings of the National Academy of Sciences of the United States of America, 107, 413–8. 11 Timmerman, P., Beld, J., Puijk, W.C. & Meloen, R.H. (2005). Rapid and quantitative cycliza- tion of multiple peptide loops onto synthetic scaffolds for structural mimicry of protein surfaces. Chembiochem, 6, 821–4. 8 Tonikian, R., Zhang, Y., Sazinsky, S.L., Currell, B., Yeh, J.H., Reva, B., Held, H.A., Appleton, B.A., Evangelista, M., Wu, Y., Xin, X., Chan, A.C., Seshagiri, S., Lasky, L.A., Sander, C., Boone, C., Bader, G.D. & Sidhu, S.S. (2008). A speci- ficity map for the pdz domain family. PLoS Biol, 6, e239. 25 Unal, E.B., Gursoy, A. & Erman, B. (2010). Vital: Viterbi algorithm for de novo peptide design. PLoS ONE, 5, e10926. 28 van der Sloot, A.M., Tur, V., Szegezdi, E., Mul- lally, M.M., Cool, R.H., Samali, A., Serrano, L. & Quax, W.J. (2006). Designed tumor necro- sis factor-related apoptosis-inducing ligand variants initiating apoptosis exclusively via the dr5 receptor. Proceedings of the National Academy of Sciences of the United States of America, 103, 8634–9. 23, 29 van der Sloot, A.M., Kiel, C., Serrano, L. & Stricher, F. (2009). Protein design in biolog- ical networks: from manipulating the input to modifying the output. Protein Eng Des Sel, 22, 537–42. 19 Vanhee, P., Stricher, F., Baeten, L., Verschueren, E., Lenaerts, T., Serrano, L., Rousseau, F. & Schymkowitz, J. (2009). Protein-peptide in- teractions adopt the same structural motifs as monomeric protein folds. Structure, 17, 1128–1136. 20, 21 Vanhee, P., Reumers, J., Stricher, F., Baeten, L., Serrano, L., Schymkowitz, J. & Rousseau, F. (2010). Pepx: a structural database of non- redundant protein-peptide complexes. Nu- cleic Acids Research, 38, D545–51. 21, 25 Vanhee, P., Verschueren, E., Baeten, L., Stricher, F., Serrano, L., Rousseau, F. & Schymkowitz, J. (2011). Brix: a database of protein building blocks for structural analysis, modeling and design. Nucleic Acids Research, 39, D435– 42. 17, 32 Walensky, L.D., Kung, A.L., Escher, I., Malia, T.J., Barbuto, S., Wright, R.D., Wagner, G., Ver- dine, G.L. & Korsmeyer, S.J. (2004). Activa- tion of apoptosis in vivo by a hydrocarbon- stapled bh3 helix. Science, 305, 1466–70. 8 Walsh, G. (2010). Biopharmaceutical bench- marks 2010. Nat Biotechnol, 28, 917–24. 8 Walters, R.F.S. & DeGrado, W.F. (2006). Helix- packing motifs in membrane proteins. Pro- ceedings of the National Academy of Sci- ences of the United States of America, 103, 13658–63. 30 Watt, P.M. (2006). Screening for peptide drugs from the natural repertoire of biodiverse pro- tein folds. Nature Biotechnology, 24, 177– 83. 26 Wells, J.A. & McClendon, C.L. (2007). Reach- ing for high-hanging fruit in drug discovery at protein-protein interfaces. Nature, 450, 1001–9. 8, 11 Weng, A.P., Nam, Y., Wolfe, M.S., Pear, W.S., Griffin, J.D., Blacklow, S.C. & Aster, J.C. (2003). Growth suppression of pre-t acute lymphoblastic leukemia cells by inhibition of notch signaling. Mol Cell Biol, 23, 655–64. 26 Wu, S. & Zhang, Y. (2007). Lomets: a lo- cal meta-threading-server for protein struc- ture prediction. Nucleic Acids Research, 35, 3375–82. 16 Yin, H., Slusky, J.S., Berger, B.W., Walters, R.S., Vilaire, G., Litvinov, R.I., Lear, J.D., Caputo, G.A., Bennett, J.S. & Degrado, W.F. (2007). Computational design of peptides that tar- get transmembrane helices. Science, 315, 1817–1822. 30, 31 40
  • 59.
    REFERENCES Yun, C.H., Boggon,T.J., Li, Y., Woo, M.S., Greulich, H., Meyerson, M. & Eck, M.J. (2007). Structures of lung cancer-derived egfr mutants and inhibitor complexes: mechanism of activation and insights into differential inhibitor sensitivity. Cancer Cell, 11, 217–27. 2 Zhang, Y. (2009). Protein structure prediction: when is it useful? Curr Opin Struct Biol, 19, 145–55. 14 41
  • 61.
    2Fragmenting protein space Thischapter is based on BriX: a database of protein building blocks for structural analysis, modeling and design. Peter Vanhee*, Erik Verschueren*, Lies Baeten, Francois Stricher, Luis Serrano, Frederic Rousseau and Joost Schymkowitz. 1 Nucleic Acids Research 2, January 2011. H igh-resolution structures of proteins remain the most valuable source for un- derstanding their function in the cell and provide leads for drug design. Since the availability of sufficient protein structures to tackle complex problems such as modeling backbone moves or docking remains a problem, alternative approaches using small, recurrent protein fragments have been proposed. Here we present two databases that provide a vast resource for implementing such fragment-based strategies. The BriX database contains fragments from over 7000 non-homologous proteins from the Astral collection, segmented in lengths from 4 to 14 residues and clustered according to structural similarity, summing up to a content of 2 million fragments per length. To overcome the lack of loops classified in BriX, we constructed the Loop BriX database of non-regular structure elements, clustered according to end-to-end distance between the regular residues flanking the loop. 1 Peter Vanhee and Erik Verschueren are joint first authors. 2 This paper was recently chosen by the Editors of Nucleic Acids Research to appear on a selected Featured Articles page (http://www.oxfordjournals.org/our_journals/nar/ featured_articles.html), representing the top 5% of NAR papers in terms of originality, signif- icance and scientific excellence. 43
  • 62.
    2. FRAGMENTING PROTEINSPACE Both databases are available online (http://brix.crg.es) and can be accessed through a user-friendly web-interface. For high-throughput queries a web-based API is provided, as well as full database downloads. In addition, two exciting applications are provided as online services: (1) user-submitted structures can be covered on the fly with BriX classes, representing putative structural variation throughout the protein and (2) gaps or low-confidence regions in these structures can be bridged with matching fragments. Both databases provide the source for fragment-based strategies developed in this thesis, such as the reconstruction of protein loops (Chapter 3) or the modeling of protein-peptide complexes (Chapters 5 and 6). 2.1 Introduction Proteins are by far the most versatile and complex molecules in the cell. It is commonly accepted that protein function directly relates to its three dimensional (3D) structure. Yet, for just over a quarter of all single-domain protein families detailed structural information is available (Levitt, 2009), a number that can be extended through threading and homology modeling (Kopp et al., 2007). Due to experimental constraints of X-Ray crystallography or NMR, the rate at which new structures are determined is considerably slower than the amount of new sequence data that is being determined by next-generation sequencing methods (Baker & Sali, 2001; Arnold et al., 2009). In order to understand the structural protein universe, proteins have been classified on the architecture of the fold and evolutionary relationships in databases such as SCOP (Murzin et al., 1995) or CATH (Orengo et al., 1997). However, proteins often perform their functions using just a limited number of residues, making it worthwhile to find structural similarities at the level of protein fragments. Seeking for a ‘parts list’ of proteins – with α-helices and β-sheets as prime examples of common parts – fragment libraries have been constructed based on the similarity of the polypeptide backbone (Fitzkee et al., 2005; Budowski-Tal et al., 2010). These protein fragment libraries have been widely used for a range of ap- plications such as structural comparison of protein folds through a simplified representation with fragments (Le et al., 2009), homology modeling at the level 44
  • 63.
    2.1 Introduction of fragments(Ananthalakshmi et al., 2005; Berkholz et al., 2010), investigating sequence-to-structure relationships (Samson & Levitt, 2009), approximating ter- tiary structure of proteins using fragments (Bystroff & Shao, 2002; Kolodny et al., 2002; Kolodny & Levitt, 2003; Kifer et al., 2008), loop prediction (Bornot et al., 2009; Choi & Deane, 2010; Fernandez-Fuentes et al., 2006) or even novel fold prediction (Simons et al., 1997; Qian et al., 2007). Unfortunately, many of the available fragment libraries are either limited in fragment classes or ‘states’ (Budowski-Tal et al., 2010; Pandini et al., 2010) or not publicly accessible (Bystroff & Shao, 2002). Moreover, existing databases are often biased towards short stretches of residues, typically 3 to 9 residues long, or contain an extensive parts list but are not clustered based on backbone similarity, thereby complicating comparative studies (Fitzkee et al., 2005). Although limited alphabets have been shown to successfully reconstruct existing proteins to global fits of 0.5 Å root mean square distance (RMSD) or serve successfully as templates to efficiently sample the protein space, they are too limited to describe protein structure at sub-angstrom resolution, especially in the case of loops (Baeten et al., 2008). To overcome these limitations we have previously constructed BriX, a database of protein fragments from 4 to 14 residues, hierarchically clustered on backbone similarities (Baeten et al., 2008). Here we describe how we updated the BriX database, which previously con- tained fragments from 1259 structures, to incorporate over 7000 structures from the ASTRAL40 set (a curated set of proteins with less than 40% sequence homology) (Chandonia et al., 2004). Furthermore, we enriched the database with all loops from over 14.000 structures in the ASTRAL95 set (sharing less than 95% sequence homology) and clustered these loops in their own respect. We also provide a user- friendly web-interface to explore both BriX and Loop BriX (http://brix.crg.es). Finally, to illustrate the potential of our database we allow users to upload their own PDB structure and ‘cover’ parts or ‘bridge’ gaps with BriX or Loop BriX fragments. The new release of BriX is expected to be helpful to the scientific community by facilitating the use of fragments in structural biology, protein modeling and design. 45
  • 64.
    2. FRAGMENTING PROTEINSPACE 2.2 Contents of the BriX database 2.2.1 Update of the BriX database The first version of the BriX database (Baeten et al., 2008) was constructed from the WHAT IF set of 1259 non-redundant proteins (Vriend, 1990). Using a sliding- window technique, we segmented all proteins into fragments of 4 to 14 residues long and clustered them on their backbone similarity with a hierarchical clustering algorithm. The similarity between two fragments is defined as the average root mean square distance (RMSD) between the backbone atoms (N, Cα, C, O) of each corresponding residue. !" #" $!" $#" %!" %#" &!" '((")(*+)"*,-./012" '(("3/.)"*,-./012" '(*+)")14"3/.)" *,-./012"5)637" '(*+)")14"3/.)" *,-./012"5)837" 9:(.0;4-<)01" *,-./012" 9/<3,)1/")14"=/((" 2:,>)=/"*,-./012" ?<)(("*,-./012" @"*/,"?ABC"=()22" ?.,:=.:,/2"01"'2.,)(D!" E)<0(0/2"01"?ABC" Figure 2.1: SCOP representation of ASTRAL40. The distribution of SCOP classes in ASTRAL40 (the dataset used to construct BriX) is similar to the SCOP distribution. The updated version of the BriX database is enriched with the much larger ASTRAL40 set of 7290 proteins sharing less than 40% of sequence homology. The ASTRAL40 set is a complete representation of the variety present in structural 46
  • 65.
    2.2 Contents ofthe BriX database databases such as SCOP (Figure 2.1). Similar to the procedure used by Baeten et al. (2008), we fragmented all proteins and assigned each fragment to the closest class represented by its centroid. As it turns out, we were able to fit most of the ASTRAL40 fragments into existing BriX classes, showing the completeness of our structural alphabet in the updated version BriX, while increasing its content 7-fold (Figure 2.2). BriX v.1 1926 2845 2694 3613 3061 3398 3207 2850 2374 2030 1589 4 5 6 7 8 9 10 11 12 13 14 Length Classes 0.5 0.6 0.7 0.8 0.9 1.0 0 500,000 1,000,000 1,500,000 2,000,000 4 5 6 7 8 9 10 11 12 13 14 Classifiedfragments Length BriX v.2 BriX v.1 A B C RMSDThresholds(Å) Figure 2.2: The BriX database. (A) Number of BriX classes for lowest class thresholds per length. A peak in the number of classes can be observed at fragment length 7 and class threshold 0.5 Å. (B) Increase in the number of classified fragments from the first version of BriX (Baeten et al., 2008) to the current version. (C) BriX classes with class thresholds varying from 0.5 Å to 1.0 Å RMSD for fragments of length 7. The class threshold indicates the compactness and structural homogeneity of the class, with lower thresholds causing classes to be more compact than higher thresholds. 47
  • 66.
    2. FRAGMENTING PROTEINSPACE 2.2.2 BriX Statistics As expected, the number of classes varies with the length of the clustered frag- ments: even for short fragment length (n = 4) and strict threshold (≤ 0.4 Å RMSD) a large number of classes (2000) were observed. The largest amount of structural classes is detected when applying a clustering threshold of 0.5 Å to fragments of length 7: 3613 classes can be distinguished (Figure 2.2A). Hereafter the number of classes steadily decreases until 1500 classes at length 14. As expected, the number of classes per length decreases with increasing classification thresholds (Figure 2.3) as more different fragments are classified into a single class. Also, the percentage of classified fragments decreases steadily with increasing fragment length. To compensate for this, increasing the covering thresholds for a specific length improves the classification rates (Figure 2.4). !" #!!" $!!!" $#!!" %!!!" %#!!" &!!!" &#!!" '!!!" !('" !(#" !()" !(*" !(+" !(," $" $($" $(%" $(&" $('" -./012"34"5267"89:;;1;" <=>?"@A21;A39B"CDEF;G23/H" I'" I#" I)" I*" I+" I," I$!" I$$" I$%" I$&" I$'" Figure 2.3: Number of BriX classes versus different classification thresholds. The number of classes in BriX decreases with increasing classification thresholds. Different lengths (from 4 to 14 residues) are shown in different colors. Furthermore, we analyzed the secondary structure content in classes derived 48
  • 67.
    2.2 Contents ofthe BriX database ! "! #! $! %! &! '! (! )! *! "!! !+% !+& !+' !+( !+) !+* " "+" "+# "+$ "+% ,% ,& ,' ,( ,) ,* ,"! ,"" ,"# ,"$ ,"% Figure 2.4: Percentage of classified fragments versus different classification thresholds. The percentage of classified fragments decreases steadily with increasing fragment length. To compensate for this, increasing the covering thresholds (X axis) for a specific length improves the classification rates. for different fragment lengths and thresholds. Not surprisingly, α-helical and β- strand fragments remain well represented in structural classes of higher length (Figure 2.5) while loop fragments are under-represented in classes of all lengths, indicating that they are harder to classify. Clearly the majority of unclassified fragments are composed of loop structures (Figure 2.6). This indicates that a separate classification scheme, more suited to the particularities of loop structures, could significantly enrich the BriX database. 2.2.3 Creation of the Loop BriX database The Loop BriX database was built using 14.525 protein structures derived from the ASTRAL95 set containing protein structures sharing less than 95 percent sequence 49
  • 68.
    2. FRAGMENTING PROTEINSPACE 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 numberoffragmentsclassified % of secondary structure in BriX class L14 helix L14 strand L14 loop L14 turn L10 helix L10 strand L10 loop L10 turn L7 helix L7 strand L7 loop L7 turn L5 helix L5 strand L5 loop L5 turn Figure 2.5: Secondary structure content for classified fragments in BriX. Classified fragments (Y axis) are shown by secondary structure content (X axis) for length 5, 7, 10 and 14. α-helical and β-strand fragments are well represented in BriX classes, even for longer fragment lengths. In contrast, turn and loop fragments are generally less well classified. identity (Chandonia et al., 2004). A loop fragment starts and ends with a single residue belonging to a regular secondary structure such as a helix or a strand and contains any number of irregular residues in between. As shown by different studies, the structural loop space can be partitioned by four combinations of flanking regular elements: α-α, α-β, β-α and β-β (Espadaler et al., 2004; Donate et al., 1996; Burke & Deane, 2001) (Table 2.1). We have introduced a novel way to compare the similarity between two loop fragments based on the (1) the distance between their end points (‘end-to-end distance’) rather than the overall structure similarity used in BriX and (2) the superposition of two regular anchor residues at each side of the loop with a RMSD < 1 Å. Firstly, loops in each of the four loop classes described above were clustered on end-to-end distance using the same hierarchical clustering algorithm. 50
  • 69.
    2.2 Contents ofthe BriX database 0 10 20 30 40 50 60 70 80 Helix Strand Turn Loop %ofDSSPinunclassifiedfragments L4 L7 L10 L13 Figure 2.6: Secondary structure content for unclassified fragments in BriX. Sec- ondary structure distribution of unclassified fragments in BriX (ASTRAL40) for fragment lengths 4, 7, 10 and 13. Unclassified fragments contain mainly loop elements, show- ing the need for the separate classification scheme of loop elements employed in Loop BriX. These ‘super classes’ are composed of varying sizes and thus show a considerable amount of variation in the part between the end points (Figure 2.7A). Secondly, super classes were clustered in ‘sub classes’, grouping loops of the same length and similar structure. 51
  • 70.
    2. FRAGMENTING PROTEINSPACE α-α (%) β-β (%) α-β (%) β-α (%) Ref. 20 33 26 20 Donate et al. (1996) 20 28 24 28 SLoop (Burke & Deane, 2001) 22 31 24 23 ArchDB (Espadaler et al., 2004) 19 35 21 25 Loop BriX Table 2.1: Distribution of loops across the four main loop categories for four different loop databases. sub 2219 sub 2220 sub 2221 superclass 27217 β-strand anchor 2 β-strand anchor 1 11.7Å A B 4246 2732 1076 177 100 111 5 3907 2017 645 99 73 77 2 2-5 5-15 15-25 25-50 50-1000 >10002 Numberofclasses subclass superclass Number of fragments per class Figure 2.7: The Loop BriX database. (A) Example of a superclass containing 3 subclasses. The superclass contains fragments with end-to-end distance around 11.78 Å RMSD and two β-strand anchor residues. At the subclass level, fragments with similar length and backbone are grouped (length 7 for subclass 2219 and length 13 for subclass 2220 and 2221, superposition threshold of 1 Å). (B) Number of superclasses (blue) and subclasses (red) per class size, distributed in bins. In general, classes from Loop BriX are less populated than classes from BriX. 52
  • 71.
    2.2 Contents ofthe BriX database 2.2.4 Loop BriX Statistics In contrast to the quite limited conformational space of regular structure elements, loop structures are much more variable. In Loop BriX, loop fragments are between 4 and 117 irregular residues long and classes are generally less populated (Figure 2.7B). Intriguingly, we observe a clear distinction between classes of loops con- necting different secondary structure: the number of super-classes having more than 100 fragments is much lower for α-α (8) than β-β classes (20), showing less regularity for α-α classes than for β-β classes (Table 2.2). This is explained by the fact that α-helices, being cylindrical, show much more variation at their end points, while β-strands have more regular end-to-end distances. # Loops α-α classes β-β classes α-β classes β-α classes all classes super sub super sub super sub super sub super sub 1 8714 12290 9101 17513 7747 11896 7747 12136 33309 53835 5 395 233 621 438 400 257 507 308 1923 1236 10 135 77 231 171 149 103 190 100 705 451 20 58 30 92 70 79 54 64 39 293 193 50 22 9 34 33 37 23 26 15 119 80 100 8 5 20 20 18 14 13 4 59 43 Table 2.2: Classification of loops within Loop BriX. Number of super- and subclasses in function of their respective minimum class content in Loop BriX. We then examined the results of our loop classification scheme, looking at the percentage of loops we were able to classify. At the super class level our approach classified almost 90% of 6-residue loops and 45% of 14-residue loops while the success of sub-clustering in equally sized groups decreased more rapidly. We found that the sub-classification was successful for fragments up to length 16, after which no regular loop patterns could be identified. 2.2.5 Applications of the BriX database The first version of the BriX database already inspired many applications in the fields of structural biology and protein design. Baeten et al. (2008) showed that proteins from the widely used Park & Levitt set could be reconstructed using BriX fragments to a global 0.48 Å RMSD accuracy, improving existing results using more 53
  • 72.
    2. FRAGMENTING PROTEINSPACE limited structural alphabets . Demon et al. (2009) used BriX database fragments in combination with the FoldX protein design algorithm (Schymkowitz et al., 2005) to construct a model of murine caspase 3 and 7 in complex with substrate peptides. These models were subsequently used to explain experimentally observed differences in substrate specificity between caspase 3 and 7. In other recent work, we have shown that the structural space of protein-peptide interactions can be approximated using fragments from the BriX database (Vanhee et al., 2009) (Chapter 5). The interfaces of over 300 protein-peptide complexes from the PepX database (Vanhee et al., 2010) (Chapter 4) were reconstructed to within 1 Å RMSD, using observed fragment interactions to reconstruct the binding modes. The sheer size of the database allowed us to extract structural knowledge on protein-peptide interactions. Until now, all of these services have been limited to internal use of the database. With the updated version of the BriX and Loop BriX databases, the website and the addition of the covering and bridging algorithms (Section 2.3.3), we open up the possibilities to use the BriX database to the scientific community at large. 2.3 Database access 2.3.1 Database availability The BriX and Loop BriX databases are accessible through a web portal at http:// brix.crg.es. The portal is built on the open-source Drupal Content Management System for full flexibility (http://drupal.org). The entire database with annota- tions is available for download in the SQL format, describing the relations between classes and fragments. As an additional service for automated high-throughput querying, all information contained within the BriX and Loop BriX database can be downloaded as CSV (comma-separated values) lists. For example, prompting the URL http://brix.crg.es/classes?Length=10&Structure=HHHHHHHHHH returns a CSV file containing BriX classes of length 10 with an α-helical structure. Finally, BriX will be updated automatically when new versions of the ASTRAL sets will become available. 54
  • 73.
    2.3 Database access 2.3.2User Interface A user-friendly browsing interface is available on the website http://brix.crg. es (Figure 2.8A). BriX contains two levels: the class level and the fragment level (Figure 2.8C). Classes can be sorted and filtered on (1) class size, (2) fragment length (from 4 to 14 residues), (3) clustering threshold describing the compactness of the classes, (4) minimum and maximum percentage of helix, loop, sheet and turn content and (5) regular expressions of the amino acid sequence and secondary structure as determined by DSSP (Kabsch & Sander, 1983) (Figure 2.8B). For each BriX class, we generated images of the superposed fragments using Chimera (Pettersen et al., 2004) and logos of the sequence and structure distributions using Weblogo (Crooks et al., 2004). Subsequently, the fragments of each class can be filtered on PDB ID (Berman et al., 2000), sequence or secondary structure. Loop BriX contains three levels: (1) the superclass level with fragments of similar end-to-end distance and matching end residues, (2) the subclass level with fragments of similar backbone patterns and length and finally, (3) the fragment level (Figure 2.8D). The Loop BriX superclasses and subclasses can be queried with the same parameters as the BriX database plus end-to-end distance. 2.3.3 Covering or bridging of protein structures To explore the vast size of the database we provide two algorithms to query BriX and Loop BriX with a user-submitted structure: ‘covering’ and ‘bridging’1 . The covering algorithm covers backbone coordinates of the input structure with similar BriX classes. The bridging algorithm spans the distance between any pair of anchoring residues regardless of backbone coordinates in between them. This is extremely useful to derive plausible loop conformations where backbone coordinates are not present or poorly defined. In Figure 2.9A, we show the application of the covering algorithm to a PDZ domain (PDB ID 2WL7), covering a part of the β-strand with classes from the BriX database. Residues 112 to 116 are selected for covering. The algorithm 1 For an intuitive explanation of the algorithms, two videos are provided to demonstrate the covering and bridging of an example protein. They can be accessed online at http://brix.crg. es/content/help. 55
  • 74.
    2. FRAGMENTING PROTEINSPACE A B D Class Fragment Subclass FragmentSuperclass C Figure 2.8: The BriX website (http://brix.crg.es). (A) An overview of the class level with secondary structure content and sequence and structure logos per entry. (B) A panel on the class level where the user can filter on length, threshold, sequence and secondary structure content. Similar panels are implemented at every level of the class hierarchy. (C) BriX contains two levels: the class level and the fragment level. (D) Loop BriX contains three levels: the superclass, the subclass and the fragment level. then matches the selected region to the BriX classes by calculating the distance to each class centroid. Here, the user can select the class threshold that defines their compactness (0.6 Å in this example). Fragments are returned for every class having a centroid close enough to the query fragment. The user can also select the maximum number of fragments per class, the total minimum and maximum number of fragments (between 1 and 1000) and superposition thresholds are adapted accordingly. In the case of the β-strand of the PDZ, over 3000 fragments superposing with 0.6 Å are matched, of which 1000 are returned to the user as a set of downloadable fragment PDB files. Moreover, the service provides a snapshot of 56
  • 75.
    2.3 Database access B A PDB2WL7 A112 A104 A112 L5 sequence logo PDB 2WL7 without loop secondary structure logo sequence logo secondary structure logo 12.7Å Figure 2.9: BriX applications: ‘covering’ and ‘bridging’. (A) Covering: an input PDZ structure (PDB: 2WL7) is shown for which the algorithm finds matching structural fragments for the β-strand (red). The algorithm returns a set of protein fragment structures (green) superposed on the β-strand, together with structure and sequence logos. (B) Bridging: the same PDB structure (PDB: 2WL7), now with a missing loop.The algorithm finds loop fragments that match the regular anchor residues and span the loop with the same end-to-end distance (green). these fragments superposed on the query PDB as well as logos depicting sequence and structure propensities of the matched fragments, useful to derive sequence or structure relationships. Finally, the set of matching classes and fragments can be 57
  • 76.
    2. FRAGMENTING PROTEINSPACE further inspected online using the previously described search interface. The bridging algorithm works in a similar fashion. To illustrate this, we removed a loop of the same PDZ domain from the input structure (Figure 2.9B), which is involved in binding the peptide ligand of this domain. This loop is anchored by residue 104 on the left and residue 112 on the right, spanning a gap of 12.7Å end-to-end distance. The algorithm reconstructs a backbone with fragments from the Loop BriX database between the two anchor residues. As one might expect, the results contain loops from PDZ domains (for example, PDB ID 1WIF), but also loops derived from proteins with unrelated SCOP classes. Given the vastness of our database, calculations can be demanding. We allo- cated a dedicated cluster (40 nodes) that runs the algorithms independent from the web server. 2.4 Discussion To our knowledge, BriX is the most extensive alphabet of protein fragments publicly available. As shown by Baeten et al. (2008) and recently confirmed by Le et al. (2009), BriX reaches an accuracy of 0.48 Å RMSD in reconstructing an extensive set of proteins. With a 7-fold increase in the amount of classified fragments, we expect that this accuracy will even improve (∼0 Å), including for regions that are typically regarded as ‘unstructured’ and thus difficult to reconstruct with unrelated fragments, such as protein loops. Protein fragments however have limited meaning in isolation, and in Chapter 5 we introduce the concept of ‘fragment context’ by mining the database for fragment interactions, stored in the InteraX database. While most protein fragments libraries are used to improve structure alignment or deduce sequence-structure relationships through classic homology modeling (Le et al., 2009), we use BriX for full-atom reconstruction of entire protein structures (Baeten et al., 2008), loop reconstruction (Chapter 3) or the prediction of peptide structure (Chapter 6). 58
  • 77.
    2.4 Discussion Author Contributions F.R.,J.S., and L.S. conceptualized the study. L.B. developed the first version of the BriX database (Baeten et al., 2008). L.B., P.V. and E.V. developed the second version of the BriX database. E.V., P.V. and L.B. performed the analysis. P.V. and E.V. developed the covering and bridging algorithms. P.V. developed the website. P.V. and E.V. wrote the paper. 59
  • 78.
    REFERENCES References Ananthalakshmi, P., Kumar,C.K., Jeyasimhan, M., Sumathi, K. & Sekar, K. (2005). Fragment finder: a web-based software to identify similar three-dimensional structural motif. Nucleic Acids Research, 33, W85–8. 45 Arnold, K., Kiefer, F., Kopp, J., Battey, J.N.D., Podvinec, M., Westbrook, J.D., Berman, H.M., Bordoli, L. & Schwede, T. (2009). The protein model portal. J Struct Funct Genomics, 10, 1–8. 44 Baeten, L., Reumers, J., Tur, V., Stricher, F., Lenaerts, T., Serrano, L., Rousseau, F. & Schymkowitz, J. (2008). Reconstruction of protein backbones from the brix collection of canonical protein fragments. PLoS Com- put Biol, 4, e1000083. 45, 46, 47, 53, 58, 59 Baker, D. & Sali, A. (2001). Protein structure prediction and structural genomics. Science, 294, 93–6. 44 Berkholz, D.S., Krenesky, P.B., Davidson, J.R. & Karplus, P.A. (2010). Protein geometry database: a flexible engine to explore back- bone conformations and their relationships to covalent geometry. Nucleic Acids Re- search, 38, D320–5. 45 Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. & Bourne, P.E. (2000). The protein data bank. Nucleic Acids Research, 28, 235–42. 55 Bornot, A., Etchebest, C. & de Brevern, A.G. (2009). A new prediction strategy for long local protein structures using an original de- scription. Proteins, 76, 570–87. 45 Budowski-Tal, I., Nov, Y. & Kolodny, R. (2010). Fragbag, an accurate representation of pro- tein structure, retrieves structural neighbors from the entire pdb quickly and accurately. Proceedings of the National Academy of Sci- ences, 107, 3481–6. 44, 45 Burke, D.F. & Deane, C.M. (2001). Improved protein loop prediction from sequence alone. Protein Eng, 14, 473–8. 50, 52 Bystroff, C. & Shao, Y. (2002). Fully automated ab initio protein structure prediction using i-sites, hmmstr and rosetta. Bioinformatics, 18 Suppl 1, S54–61. 45 Chandonia, J.M., Hon, G., Walker, N.S., Conte, L.L., Koehl, P., Levitt, M. & Brenner, S.E. (2004). The astral compendium in 2004. Nu- cleic Acids Research, 32, D189–92. 45, 50 Choi, Y. & Deane, C.M. (2010). Fread revis- ited: Accurate loop structure prediction us- ing a database search algorithm. Proteins, 78, 1431–40. 45 Crooks, G.E., Hon, G., Chandonia, J.M. & Bren- ner, S.E. (2004). Weblogo: a sequence logo generator. Genome Res, 14, 1188–90. 55 Demon, D., Damme, P.V., Berghe, T.V., Deceun- inck, A., Durme, J.V., Verspurten, J., Helsens, K., Impens, F., Wejda, M., Schymkowitz, J., Rousseau, F., Madder, A., Vandekerckhove, J., Declercq, W., Gevaert, K. & Vandenabeele, P. (2009). Proteome-wide substrate analy- sis indicates substrate exclusion as a mecha- nism to generate caspase-7 versus caspase-3 specificity. Mol Cell Proteomics, 8, 2700–14. 54 Donate, L.E., Rufino, S.D., Canard, L.H. & Blun- dell, T.L. (1996). Conformational analysis and clustering of short and medium size loops connecting regular secondary struc- tures: a database for modeling and predic- tion. Protein Sci, 5, 2600–16. 50, 52 Espadaler, J., Fernandez-Fuentes, N., Hermoso, A., Querol, E., Aviles, F.X., Sternberg, M.J.E. & Oliva, B. (2004). Archdb: automated pro- tein loop classification as a tool for struc- tural genomics. Nucleic Acids Research, 32, D185–8. 50, 52 Fernandez-Fuentes, N., Zhai, J. & Fiser, A. (2006). Archpred: a template based loop structure prediction server. Nucleic Acids Research, 34, W173–6. 45 60
  • 79.
    REFERENCES Fitzkee, N.C., Fleming,P.J., Gong, H., Panasik, N., Street, T.O. & Rose, G.D. (2005). Are proteins made from a limited parts list? TRENDS in Biochemical Sciences, 30, 73–80. 44, 45 Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: pattern recog- nition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577–637. 55 Kifer, I., Nussinov, R. & Wolfson, H.J. (2008). Constructing templates for protein structure prediction by simulation of protein folding pathways. Proteins, 73, 380–94. 45 Kolodny, R. & Levitt, M. (2003). Protein decoy assembly using short fragments under ge- ometric constraints. Biopolymers, 68, 278– 85. 45 Kolodny, R., Koehl, P., Guibas, L. & Levitt, M. (2002). Small libraries of protein fragments model native protein structures accurately. Journal of Molecular Biology, 323, 297–307. 45 Kopp, J., Bordoli, L., Battey, J.N.D., Kiefer, F. & Schwede, T. (2007). Assessment of casp7 predictions for template-based modeling tar- gets. Proteins, 69 Suppl 8, 38–56. 44 Le, Q., Pollastri, G. & Koehl, P. (2009). Struc- tural alphabets for protein structure clas- sification: a comparison study. Journal of Molecular Biology, 387, 431–50. 44, 58 Levitt, M. (2009). Nature of the protein uni- verse. Proceedings of the National Academy of Sciences of the United States of America, 106, 11079–84. 44 Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. (1995). Scop: a structural clas- sification of proteins database for the inves- tigation of sequences and structures. J. Mol. Biol., 247, 536–540. 44 Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B. & Thornton, J.M. (1997). Cath–a hierarchic classification of protein domain structures. Structure, 5, 1093–108. 44 Pandini, A., Fornili, A. & Kleinjung, J. (2010). Structural alphabets derived from attractors in conformational space. BMC Bioinformat- ics, 11, 97. 45 Pettersen, E.F., Goddard, T.D., Huang, C.C., Couch, G.S., Greenblatt, D.M., Meng, E.C. & Ferrin, T.E. (2004). Ucsf chimera–a visual- ization system for exploratory research and analysis. J Comput Chem, 25, 1605–12. 55 Qian, B., Raman, S., Das, R., Bradley, P., Mc- Coy, A.J., Read, R.J. & Baker, D. (2007). High- resolution structure prediction and the crys- tallographic phase problem. Nature, 450, 259–64. 45 Samson, A.O. & Levitt, M. (2009). Protein seg- ment finder: an online search engine for segment motifs in the pdb. Nucleic Acids Re- search, 37, D224–8. 45 Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F. & Serrano, L. (2005). The foldx web server: an online force field. Nucleic Acids Research, 33, W382–8. 54 Simons, K.T., Kooperberg, C., Huang, E. & Baker, D. (1997). Assembly of protein ter- tiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J Mol Biol, 268, 209–25. 45 Vanhee, P., Stricher, F., Baeten, L., Verschueren, E., Lenaerts, T., Serrano, L., Rousseau, F. & Schymkowitz, J. (2009). Protein-peptide in- teractions adopt the same structural motifs as monomeric protein folds. Structure, 17, 1128–1136. 54 Vanhee, P., Reumers, J., Stricher, F., Baeten, L., Serrano, L., Schymkowitz, J. & Rousseau, F. (2010). Pepx: a structural database of non- redundant protein-peptide complexes. Nu- cleic Acids Research, 38, D545–51. 54 Vriend, G. (1990). What if: a molecular mod- eling and drug design program. Journal of Molecular Graphics, 8, 52–6, 29. 46 61
  • 81.
    3Predicting loop structure Thischapter is based on Fast and accurate prediction of protein loop structure and dynamics. Peter Vanhee*, Joost Van Durme*, Lies Baeten*, Erik Verschueren, Frederic Rousseau, Joost Schymkowitz, Francois Stricher and Luis Serrano.1 In review, February 2011. P rotein loops play important roles in binding and catalysis but being the most variable protein elements they are notoriously difficult to predict. In accor- dance with the idea of BriX – protein structures represented with protein building blocks (Chapter 2) – we show that even though loops are typically regarded as ‘unstructured regions’, loop structures up to a certain length are recurrent too, regardless of sequence identity. We report the loop structure prediction algorithm LoopX (http://loopx.crg.es), which combines a 100-fold speed increase over state-of-the art methods (from days to hours) with excellent prediction accuracy and coverage for loops up to 12 residues. Moreover, we demonstrate that LoopX can be used to model the conformational ensemble adopted by protein loops upon ligand binding. 1 Peter Vanhee, Joost Van Durme and Lies Baeten are joint first authors. 63
  • 82.
    3. PREDICTING LOOPSTRUCTURE 3.1 Introduction Protein loops often represent functional entities that exploit their intrinsic confor- mational flexibility to perform a multitude of tasks critical to protein function and specificity (Fetrow, 1995; Jones et al., 1998; Todd et al., 2001). For example, the complementarity-determining regions (CDR loops) of antibodies are responsible for recognition and binding of antigen epitopes (Rini et al., 1992). Loops play also key roles in binding a variety of ligands such as metal ions (Lu & Valentine, 1997), ATP (Prodromou et al., 1997), calcium (Strynadka & James, 1989) and DNA (Redondo et al., 2008). Furthermore, loops are able to regulate peptide binding by governing access to domain pockets, as was recently shown for SH2 domains which mediate protein-protein interactions through binding of phosphotyrosine- containing sequences (Kaneko et al., 2010). Loops have also been the subject of successful design studies to construct novel enzymatic functionality (Jiang et al., 2008; Siegel et al., 2010) or alter protein stability (van der Sloot et al., 2009). Loops are typically defined as irregular regions embraced by regular secondary structure elements. In both X-ray and NMR structural models these regions are often the least well defined. Because of high sequence and structural variability, loop structure prediction is one of the most challenging tasks in homology modeling and protein design. Conventional homology modeling generally produces poor approximations for loop structures and crystallographic artefacts such as crystal contacts and high loop mobility tend to toughen the prediction of loop regions (Benner et al., 1993; Burke et al., 2000; Lawson & Wheatley, 2004; Heuser et al., 2004). In this perspective, loop modeling can be seen as a ‘mini-folding’ problem as the correct conformation has to be acquired primarily from sequence information (Fiser et al., 2000). Existing approaches to tackle this problem can be divided in two categories: ab initio and knowledge-based methods. Ab initio methods rely for a great deal on their scoring functions, which are based on molecular dynamics (MD) and energy minimization to sample a large number of loop conformations (Rapp & Friesner, 1999; Jacobson et al., 2004; Spassov et al., 2008; Felts et al., 2008; Xiang et al., 2002). The final prediction is the solution with the lowest calculated energy. 64
  • 83.
    3.1 Introduction Alternatively, knowledge-basedmethods attempt to find a loop segment of a protein with known three dimensional structure that fits the stem regions of a target loop. The basis of this method is mining loop databases built from existing Protein Data Bank (PDB) structures (Donate et al., 1996; Oliva et al., 1997; Burke et al., 2000; Espadaler et al., 2004; Fernandez-Fuentes et al., 2006; Vanhee et al., 2011) (Chapter 2). Typically, the database is searched for templates followed by evaluation and filtering of possible candidates using empirical scoring potentials. The remaining loop candidates are then ranked according to sequence identity or geometric criteria (Wojcik et al., 1999; Heuser et al., 2004). Figure 3.1: Predicted loop structure with LoopX. Loop reconstructions with a max- imum of 35% loop sequence identity. The structures depict the best loop reconstruc- tion ranked first by LoopX for all loop lengths from 4 to 12. The crystallographic loop (light red) is compared by structural superposition to the reconstructed loop (dark red) and the in-picture text denotes the loop length, original PDB identifier and RMSD prediction accuracy. Predicting loops of 12 amino acids and longer often presents severe problems for both types of methods. Ab initio methods have been significantly improved to find loop predictions close to the real solution (Mandell et al., 2009), albeit a the 65
  • 84.
    3. PREDICTING LOOPSTRUCTURE cost of a significant increase of computation time that grows exponentially with loop length. Knowledge based methods allow for fast searches, but they are mainly limited by the completeness of the loop database (Du et al., 2003) and the inability to select near-native loops based on sequence homology alone (Fernandez-Fuentes & Fiser, 2006). However, the rapid expansion of the PDB results in a dense cover- age of loop conformational fragments and knowledge based methods can exploit this expansion for more accurate loop structure prediction (Berman et al., 2000; Levitt, 2007). In contrast to ab initio methods, knowledge based methods have the advantage to use frequently occurring, energetically favorable conformations of native structures. In this regard, with a high completeness of the database and for shorter loops (less than 12 residues), their performance further depends on the reliability of the scoring function to evaluate and rank the loop candidates. Here we present LoopX, a loop structure prediction algorithm that combines a fast database search with an all-atom sidechain reconstruction (Figure 3.11). LoopX follows a novel approach in predicting loop structure: the method com- bines the power of both knowledge-based and ab initio methods. LoopX makes use of the Loop BriX database of protein loop fragments (Vanhee et al. (2011) and Chapter 2) in combination with the FoldX side chain reconstruction algorithm (Schymkowitz et al., 2005), and applies a series of filters to select feasible loop can- didates. An additional, computationally more expensive step introduces backbone variation to optimize loop placement using the BriX database (Vanhee et al., 2011). We perform an exhaustive study of the performance of loops (measured by the RMSD against the crystallographic loop) from five different datasets, comparing the method against five state-of-the-art loop reconstruction methods. 3.2 Results 3.2.1 Comparison with the state-of-the-art loop reconstruction algorithm Recently reported, Mandell et al. (2009) reached sub-angstrom accuracy in loop reconstruction on a set of 12-residue loops using a method called kinematic clo- sure (KIC) that has taken inspiration from the robotics field. This huge step forward 66
  • 85.
    3.2 Results 0 1 2 3 4 5 6 7 Dataset 1:23 12-residue loops LoopX KIC 0 1 2 3 4 5 6 7 FREAD MODELLER RAPPER PLOP 0 1 2 3 4 5 6 7 Dataset 2: 20 12-residue loops 0 1 2 3 4 5 6 7 Dataset 3: 18 interface loops Dataset 4: 270 (4-12)-residue loops r.m.s.dcrystallographicloop Figure 3.2: Accuracy of LoopX versus state-of-the-art loop prediction methods on datasets 1,2,3 and 4. Box plot comparisons of the LoopX algorithm (blue) and the KIC protocol (red) on dataset 1,2 and 3; comparison of the LoopX algorithm (blue) and the algorithms FREAD (green), MODELLER (purple), RAPPER (orange) and PLOP (cyan) on dataset 4. Boxes span the interquartile range (IQR, 25th-75th percentiles), black lines represent the median, whiskers extend to furthest values within 0.8 times the IQR, and open circles are outliers. in the loop reconstruction field comes with the price of a significant amount of computation time. As a result, whilst the methodology could be used to pre- dict the loop structures with high accuracy, it is not practical when performing protein design where many loops of different lengths and sequences need to be explored. Here we show that LoopX reaches a similar accuracy but carries out the reconstruction considerably faster (from days to minutes), creating a wide range of possibilities for large-scale loop sampling analysis. We analyzed three 12-residue loop sets used by Mandell et al. (2009) and compared KIC results to LoopX predictions (Figure 3.2). Prediction coverage of LoopX – i.e. the amount of structures for which LoopX makes a prediction – was 18/23 (Figure 3.3), 17/20 (Figure 3.4), 17/18 (Figure 3.5) and RMSD improvement ratios to KIC results were 12/25, 8/20 and 8/18 for datasets 1, 2 and 3, respec- tively. The box plots in Figure 3.2 show that LoopX, with exception of dataset 2, compares well to the results obtained by KIC. The substantial increase in loop reconstruction speed of LoopX (∼5-60 minutes per loop) compared to KIC (∼320 hours per loop) enables LoopX to be used for high-accuracy and high-throughput loop reconstruction and loop sampling projects in a moderate timeframe. 67
  • 86.
    3. PREDICTING LOOPSTRUCTURE 0 1 2 3 4 5 6 7 1xif203-2141tib 99-1103hsc72-931rro 17-281dts41-52 2sil255-266 2exo 293-3041onc23-342tgi48-59 1pbe 129-1401cyo 12-234ilb 46-57 1tm l243-2541eco 35-46 1m sc42613 2cpl145-156 1srp 311-322 2rn2 90-101 1ede 150-161 1ezm 122-133 1thg 127-138 1thw 178-189 2ebn 136-147 RMSDcrystallographicloop Dataset 1, Mandell et al LoopX Rosetta KIC Figure 3.3: Dataset 1: comparison of LoopX with Rosetta and KIC. Rosetta (Wang et al., 2007) was reevaluated by Mandell et al. (2009), see Supp. Mat. KIC values were taken from Mandell et al. (2009). Values are in Angstrom and represent the RMSD between predicted loop and crystallographic loop. The X-axis shows the PDB structures together with the predicted loop region. Sorted by LoopX accuracy from left to right, we predicted 18/23 loops. 3.2.2 Loop homology is no prerequisite for loop reconstruction accuracy The common limitation of database-search based loop reconstruction algorithms is that a high degree of sequence identity between the query loop and the database loops is required to achieve both high accuracy and coverage. This is a valid argument when using small databases or sub-optimal scoring methods, although it has never been formally demonstrated. We tested LoopX on dataset 4, which contains 527 loops extracted from 204 un- related structures (<35% sequence identity) (Section 3.3.3). The sampling power of the algorithm was evaluated by selecting the best prediction in the top ten for each loop reconstruction. Averaged per length we achieve accuracy < 1 Å for lengths 4-6, ∼1 Å for length 7 and < 2 Å for lengths 8-12 (Figure 3.6). 68
  • 87.
    3.2 Results 0 1 2 3 4 5 6 7 8 9 1m y7 254-265 1oyc204-2151i7p 63-76 1dqz209-2201t1d 127-138 1bn8 298-309 1cnv188-201 1m s9 529-543 1exm 291-3021m 3s68-801cb0 33-441qlw 31-422pia30-431c5e 82-931f46 64-77 1cs6 145-156 1a8d 155-1681oth 69-80 1bhe 121-134 1arb 182-196 RMSDcrystallographicloop Dataset 2,Mandell et al LoopX Rosetta KIC Figure 3.4: Dataset 2: comparison of LoopX with Rosetta and KIC. LoopX predic- tions on dataset 2 (Wang et al., 2007; Mandell et al., 2009). Rosetta and KIC values are taken from (Mandell et al., 2009). Values are in Angstrom and represent the RMSD between predicted loop and crystallographic loop. The X-axis shows the PDB structures together with the predicted loop region. Sorted by LoopX accuracy from left to right, we predicted 17/20 loops. To evaluate the dependence of the algorithm on loop sequence identity, we ran three additional predictions on the same set but discarding candidate loops with sequence identity above 85%, 50% and 35%. For lengths 4-5, sub-angstrom accuracy was attained over all identity cut-offs. Accuracy for lengths 6 and 7 vary for all cut-offs in the lower and upper range between 1 Å and 2 Å, respectively. Lengths 8, 9 and 12 stay below 3 Å, where accuracy for lengths 10 and 11 deteriorates to 4 Å and 5 Å, respectively. The finding that even at low sequence identity (<35%) for loops up to length 7 the algorithm manages to maintain an excellent prediction accuracy (Figure 3.6 and 3.1), can be attributed to the ranking power of the FoldX free energies and the completeness of the Loop BriX database. From our experiments we hypothesize that the structural space of loops is saturated for lower loop lengths (≤ 7 residues) and near saturated for medium loop lengths (≤ 12 residues). Hence, there appears 69
  • 88.
    3. PREDICTING LOOPSTRUCTURE 0 2 4 6 8 10 12 1w r6 7-11 2d3g 7-11 1nf3 27-361cm x307-311 1k8r25-37 1w rd 7-11 1grn 26-371nbf307-311 1fxt7-12 1he8 26-371g4u 26-371w q1 25-361he1 26-361bkd 24-361doa26-391hh4 26-39 1ki1 26-39 1gzs26-40 RMSDcrystallographicloop Dataset 3, Mandell et al LoopX KIC Figure 3.5: Dataset 3: comparison of LoopX with KIC. Dataset 3 contains loops from 4 different proteins crystallized with 18 different partner proteins. KIC values are taken from (Mandell et al., 2009). Values are in Angstrom and represent the RMSD between predicted loop and crystallographic loop. The X-axis shows the PDB structures together with the predicted loop region. Sorted by LoopX accuracy from left to right, we predicted 17/18 loops. to be no need to employ computationally expensive ab initio methods at these loop lengths, unless combined with database methods. 3.2.3 Loop ensemble prediction A practical and more challenging application of loop reconstruction algorithms is their ability to predict the conformational ensemble adopted by loops, for example when involved in ligand binding. We assess the ability of LoopX to predict the carboxylate-binding loop movement of the PDZ domain, key domain regulators in cell signaling pathways (Nourry et al., 2003). The carboxylate-binding loop is involved in selection of C-terminal versus internal ligand binding (Penkert et al., 2004), and predicting the conformational ensemble could help in the structural 70
  • 89.
    3.2 Results !" #" $" %" &" '" (" &" '"(" )" *" +" #!" ##" #$" ,-./01."2345"6/7890::;1/0<=>6":;;<" ?;;<"?.@19=" ,::" A"*'B" A"'!B" A%'B" Loop Homology Figure 3.6: Influence of loop homology in LoopX. Loop reconstruction accuracy measured at different loop lengths and at multiple loop sequence identity thresholds, for dataset 4 containing 210 loops. Candidate loops were discarded if the percentage sequence identity was higher than the given threshold, except in the case of ‘all’ where nothing was discarded. elucidation of PDZ peptide recognition1 . We selected 10 high-resolution structures of the PDZ with carboxylate-binding loop distances between backbone atoms up to 4.5 Å (Section 3.3.3, Figure 3.7 and Figure 3.8). We introduced backbone variation using the BriX database to optimize loop placement (Section 3.3.1). Starting from a canonical, closed-loop conformation (PDB 2I1N), LoopX generates a loop ensemble that accurately covers every crystal structure with sub-angstrom accuracy (0.35- 1.07 Å), including the open loop conformation (PDB 2WL7, 0.68 Å) (Figure 3.9). In a similar experiment, LoopX predicts DNA binding-induced loop movements of meganuclease to ∼0.50 Å accuracy (data not shown). LoopX is thus able to 1 In Chapter 6 we provide a thorough case study of PDZ-peptide interactions and the application of the BriX design protocol to model peptide specificity from the structure of the PDZ domain alone. 71
  • 90.
    3. PREDICTING LOOPSTRUCTURE Carboxylate-binding loop PSD93 PDZ-1 (PDB 2WL7) DLG3 PDZ-1 (PDB 2I1N) Peptide binding pocket A B 0 1 2 3 4 1 2 3 4 5 6 7 8 9 PDZ_consensus-1 (1 peptides) 0 1 2 3 4 1 2 3 4 5 6 7 8 9 PDZ_ensemble-1 (10 peptides) Loop sequence PDZ ensemble (10 structures) Loop consensus sequence PDZ ensemble Figure 3.7: Conformational ensemble adopted by the PDZ carboxylate-binding loop. (A) 10 X-Ray structures with loop movement (maximum difference between backbone atoms: 4.5 Å). (B) Sequence diversity of 10 selected PDZ domains (top) and consensus sequence used to build loops with LoopX (bottom). model the conformational loop ensemble of the ligand-induced loop movements as observed from crystal structures in a reasonable timeframe. 3.2.4 Comparison with MODELLER, RAPPER, PLOP and FREAD LoopX was compared to the ab initio methods MODELLER (Fiser et al., 2000), RAPPER (de Bakker et al., 2003), PLOP (Jacobson et al., 2004) and to the database search method FREAD (Choi & Deane, 2010) (for a review of these methods, we refer to (Choi & Deane, 2010)). We show that LoopX outperforms these methods on overall coverage and accuracy on a 510-loop set ranging from 4 to 20 residues (dataset 5, Section 3.3.3). When compared to MODELLER, RAPPER, PLOP and FREAD, LoopX produces the best top global RMSDs for 14 of the 17 loop lengths on dataset 5 (Figure 3.10). At length 12 and 18, FREAD achieves a slightly better RMSD coverage. For length 20, FREAD performs better on average RMSD on this set. The selection of suitable loop hits for loop reconstruction and prediction de- 72
  • 91.
    3.2 Results X2WL7 X2Q9V X2OZF X2I04 X2HE2 X2AWU X2I1N X2F5Y X2AWW X1G9O 1G9O 2AWW 2F5Y 2I1N 2AWU 2HE2 2I04 2OZF 2Q9V 2WL7 1.89 1.151.17 1.87 1.55 1.78 1.13 0.72 0.89 0 1.57 1.08 1.36 1.64 1.28 1.32 0.85 0.88 0 0.89 2.09 1.57 1.59 1.52 1.18 1.26 0.76 0 0.88 0.72 2.09 1.7 1.58 0.95 0.99 0.86 0 0.76 0.85 1.13 2.45 2.29 2.34 1.15 1.09 0 0.86 1.26 1.32 1.78 2.2 1.95 2.14 1.39 0 1.09 0.99 1.18 1.28 1.55 2.6 2.37 2.15 0 1.39 1.15 0.95 1.52 1.64 1.87 2.08 1.29 0 2.15 2.14 2.34 1.58 1.59 1.36 1.17 1.25 0 1.29 2.37 1.95 2.29 1.7 1.57 1.08 1.15 0 1.25 2.08 2.6 2.2 2.45 2.09 2.09 1.57 1.89 1 2 Value 01020 Color Key and Histogram Count 2WL7 2Q9V 2OZF 2HE22I04 2AWU 2I1N 2F5Y 2AWW 1G9O 2WL7 2Q9V 2OZF 2HE2 2I04 2AWU 2I1N 2F5Y 2AWW 1G9O Figure 3.8: Cross-comparison of backbone distances of the conformational en- semble of the carboxylate-binding loop. All-against-all comparison of the 10 X-ray loops presented in a heat map. A dendogram shows the clusters of loops that are related in RMSD distance. pends for a large part on the completeness of the database. The coverage of LoopX on dataset 5 is 77.8% (389/510 predicted structures). For loop lengths smaller than 15, near-complete coverage is achieved (>80%). This drops to about a half for lengths 15 to 18 and to a third for lengths 19 and 20. This difference in prediction rates shows that the prediction of large loop lengths using database methods is still difficult owing to the lack of loop templates. In comparison, for ab initio methods, 73
  • 92.
    3. PREDICTING LOOPSTRUCTURE A B X2I1N X2Q9V X2I04 X2HE2 X2AWU X2AWW X2OZF X2WL7 X2F5Y X1G9O 97389 96874 93183 78925 19429 101144 1.84 1.02 2.49 2.07 2.27 1.28 1.73 0.68 1.72 1.45 0.95 1.16 1.78 1.42 1.59 0.7 1.13 1.88 0.59 0.55 0.35 1.7 1.07 0.77 0.76 0.82 1.71 2.05 0.82 1.19 0.7 1.47 1.49 1.21 1.23 0.83 1.48 1.93 0.72 0.93 0.8 1.28 1.57 1.24 1.4 0.73 1.35 1.87 0.48 0.42 0.69 1.47 1.34 1.22 1.2 0.85 1.45 1.81 1.05 1.23 X2I1N X2Q9V X2I04 X2HE2 X2AWU X2AWW X2OZF X2WL7 X2F5Y X1G9O 97389_8 97389_23 97389_12 96874_8 96874_28 96874_19 96874_11 93183_81 93183_68 93183_46 93183_45 93183_22 93183_21 19429_9 19429_34 19429_32 19429_24 2 1.31 2.62 2.38 2.43 1.46 1.7 0.94 1.96 1.69 1.84 1.02 2.49 2.07 2.27 1.28 1.73 0.68 1.72 1.45 2.1 1.25 2.69 2.24 2.45 1.58 2.06 0.7 2.06 1.86 1.14 1.37 1.89 1.8 1.78 0.9 0.97 1.98 1.03 0.91 0.64 1.92 1.31 1.05 1.06 1.1 1.74 2.36 0.5 1.06 0.95 1.16 1.78 1.42 1.59 0.7 1.13 1.88 0.59 0.55 0.84 2.09 1.38 1.48 1.29 1.25 1.63 2.5 0.93 1.26 0.95 1.75 1.66 1.42 1.24 0.9 1.65 2.18 0.95 1.1 0.59 1.74 1.2 1.32 1.09 0.87 1.44 2.09 1.07 1.26 1.17 1.13 1.85 1.74 1.84 0.85 0.77 1.75 1.33 1.09 1.35 2.77 1.46 1.67 0.98 1.8 2.61 2.95 1.62 2.09 0.81 1.73 1.3 0.99 0.99 1.02 1.89 1.96 1.26 1.55 0.35 1.7 1.07 0.77 0.76 0.82 1.71 2.05 0.82 1.19 1.04 1.35 1.74 1.71 1.68 0.85 1.06 1.89 0.97 0.73 0.78 2.06 1.26 1.05 1.03 1.26 1.97 2.42 0.75 1.17 0.8 1.28 1.57 1.24 1.4 0.73 1.35 1.87 0.48 0.42 0.87 2.09 1.33 1.48 1.28 1.28 1.71 2.44 1 1.21 2I1N 2Q9V 2I04 2AWU2HE2 2AWW 2OZF 2WL7 2F5Y 1G9O LoopX start structure LoopconformationsgeneratedfromPDB2I1N LoopX start structure 2I1N 2Q9V 2I04 2AWU2HE2 2AWW 2OZF 2WL7 2F5Y 1G9O BriXfragmentidentifiers Figure 3.9: Reconstruction of PDZ carboxylate-binding loop ensemble. Heat maps of the top reconstruction results (in RMSD versus the X-ray loops) for the PDZ carboxylate-binding loop ensemble. The best results are shown in red, the worst in yellow. The X axis shows the conformational ensemble as observed in the 10 X-ray structures, while the Y axis shows the top loops returned by LoopX. (A) shows the first run of the LoopX algorithm, while (B) shows the results after generating a loop ensemble using BriX, resulting in slightly better results. the prediction of loops with lengths larger than 12 typically poses combinatorial problems. As a result, most ab initio methods only report loop prediction results for loop lengths until 12 residues (Mandell et al., 2009). One of the most important aspects of a loop prediction algorithm is the ability to rank the best prediction within the top solutions. For the shorter loop lengths from 4 till 7 residues, LoopX ranks the best solution at position 1 in 40 to 57% of the cases. This might appear on the lower side at first sight, but the RMSD differences between the given solutions for these loop lengths are so small, that the effect on accuracy is minimal. For loop lengths from 8 to 18, the best solution is ranked first in 63 to 88% of the cases. For loop lengths 19 and 20 too few predictions were made to evaluate the ranking ability. The high ranking power can be attributed to the accurate energy calculations by the FoldX force field. Choi & Deane (2010) describe a dramatic increase in accuracy of FREAD when selecting for loops that are close homologs. We have carried out an analysis with 74
  • 93.
    3.2 Results 0.00 2.00 4.00 6.00 8.00 10.00 12.00 14.00 4 56 7 8 9 10 11 12 13 14 15 16 17 18 19 20 AverageRMSDcrystallographicloop Loop lenghts of dataset 5 (510 loops) LoopX FREAD MODELLER RAPPER PLOP Figure 3.10: LoopX prediction accuracy compared with FREAD, MODELLER, RAP- PER and PLOP. Loop prediction accuracies (RMSD) on dataset 5 containing 510 loops and organized per loop length. The results for FREAD, MODELLER, RAPPER and PLOP are taken from Choi & Deane (2010). FREAD using this cut-off and indeed the average top global RMSD of FREAD ranges from 0.49 Å to 1.86 Å with an exception of 3.34 Å at loop length 12. The drawback of this accuracy is a coverage as low as 33.5% (171/510 predicted structures). This is not surprising since the new environment substitution score only allows highly similar sequences to pass through the filter. The method now becomes very dependent on the presence of close homologs of the query structure in the database and therefore more closely resembles a homology modeling technique. LoopX circumvents this problems by initially selecting for loops using structural compatibility only, and the construction of sidechains and filtering/ranking of the loop candidates result in a much higher coverage and accuracy that is – to our knowledge – unprecedented for database-driven loop prediction. 75
  • 94.
    3. PREDICTING LOOPSTRUCTURE 3.3 Materials and Methods 3.3.1 LoopX Algorithm β 1 anchor β 2 anchor β1 β2 β1 α2 β2 α1 α1 α2 A B steric clashes FoldX GLY PRO structure-sequence incompatibility RMSD < 1.5 Å RMSD < 1.5 Å length1 length2 length3 min(∆G kcal/mol) Figure 3.11: Overview of the LoopX algorithm. (A) All loop classes, which have matching secondary structure anchoring points to the target structure on both the N- and C-terminal parts, are selected from Loop BriX. Then, only loop classes with a RMSD lower than 1.5 Å to the anchoring residues and with a length identical to the query sequence are withheld. (B) Loop candidates having steric (backbone) clashes with respect to the context and incompatible backbone dihedral angles towards the target sequence are discarded as well. Finally, all side chains are modeled with the FoldX force field, loops are ranked by their FoldX stability estimates and local variation is generated by mending short BriX fragments in between the loops and the anchors. Loop template selection from Loop BriX Loop backbone templates are selected from the Loop BriX database containing 14.525 protein structures with < 95% sequence identity (Section 2.2.3). Anchor 76
  • 95.
    3.3 Materials andMethods groups (2 residues, one loop and one non-loop residue) are chosen and fitted with all Loop BriX super classes by superposition on the super class centroids. To speed up this calculation, only loop super classes that have the same secondary structure residues embracing the loop are considered. A super class is accepted as donor of candidate loops when the RMSD of the backbone atoms (N, Cα, C, O) between the four anchor residues of its centroid and the target loop after superposition is < 1.5 Å. Because predicting coverage and accuracy is strongly dependent on the choice of anchor residues flanking the loop, and DSSP annotations highly depend from the structural context, we have added an additional protocol optimization step in which slightly alternative loop boundaries are set and evaluated. This is accomplished by setting a new start residue upstream and a new end residue downstream, totaling in nine combinations of start and end residues per loop. The alternative loop with the best FoldX energy is then taken as the actual prediction. Filters PDB identifier filter PDB structures that have the same PDB four-letter identifier as the PDB of the target loop are discarded. Backbone entropy filter Loop candidates originate from different proteins. It is therefore probable that the amino acid sequence of the candidate loop differs from the amino acid sequence of the target loop. Consequently, it is required to determine the propensity of the target amino acid sequence to adopt the φ/ψ main chain dihedral conformation of the candidate loop. The FoldX force field is consulted to retrieve the unscaled entropy cost for each φ/ψ pair within the candidate loop structure for the respective amino acid in the target sequence. This unscaled entropy cost corresponds with the statistical preference of a residue to occur in a certain [φ, ψ] region within the Ramachandran plot. Backbone ω filter The dihedral angle ω measured over the peptide bond is checked as well. In general, only ω angles with absolute values in the interval [155;180] are allowed. However, if the target amino acid is one of the structure breakers – PRO or GLY – a cis conformation can be present. In this case the heavy 77
  • 96.
    3. PREDICTING LOOPSTRUCTURE atoms N and O are on the same side of the double bond, causing the chain to bend, and ω angles having an absolute value in the interval [0;25] are allowed as well. Steric clash filter The conformational fit of the loop is evaluated within the target environment using a steric clash filter that checks for van der Waals clashes along the main chain atoms (N, Cα, C, O). Loop candidates causing steric clashes with neighbouring residues are discarded. Sequence Homology filter Using homolog structures to reconstruct loops might improve reconstruction results. Loop reconstruction algorithms that rely on database searches can be at an advantage compared to ab initio methods when the structure of a sequence homolog is available. On the other hand, when no such homolog is available the accuracy might decrease rapidly. In order to evaluate the effect of sequence homology, a sequence homology filter was implemented. Sequence homology of the target loop is compared with the BriX loop. If sequence homology is >= 100%, 85%, 50% or 35% (as indicated), the loop is discarded. Loop placement The backbone atoms of the N- and C-anchor residues of the side chain-reconstructed candidate loops are superposed on the equivalent backbone atoms of the N- and C-anchor residues of the input structure. Sidechain design using FoldX As a general rule, filters accept only 1-5 % of the initial loop candidates. From this set, all side chains are designed with the target loop sequence extracted from the crystallographic loop. Importantly, no side chain orientations from the BriX loop nor the crystallographic loop are used. Sidechains are rebuilt using the FoldX command BuildModel (Schymkowitz et al., 2005). BuildModel uses a backbone-dependent rotamer library to determine the most probable side chain conformation. At every mutation step, the algorithm rearranges neighbouring residues, resulting in a local minimum of the sequence. 78
  • 97.
    3.3 Materials andMethods Scoring and ranking The estimated free energy ∆G of the loop is calculated with the FoldX command Stability, which takes into account the entire protein environment (see also Section 5.3.5). The final set of solutions is then ranked from low to high ∆G. Loop ensemble generation Even though a correct loop can be retrieved directly from the database in many cases (Section 3.2), positioning the loop by anchor residue superposition often poses problems for database methods (Choi & Deane, 2010). To remedy this, the algorithm samples the space of naturally occurring fragments anchoring the loop. In this additional step, backbone variation is introduced on the stem residues, allowing for a smoother fitting of the loop in the query structure (Figure 3.11). The first phase consists of extracting flanking segments of four residues each on both sides of the loop region. The BriX fragment database (Chapter 2) is then queried for protein fragments that are structurally similar to the four-residue segments, using a superposition threshold of 0.8 Å. In the second phase, alternative orientations for the flanking residues are generated. A selected set of non-redundant BriX fragments is superposed on the flanking segment and the backbone coordinates of the original flanking segment are substituted by the ones of the BriX fragment. In the last phase, the loop candidates are now superposed on the adapted flanking segments, effectively generating movements of the loop that fit slightly varied models of the flanking segments. Finally, the coordinates of the stem residues are restored to the original ones and backbone clashes caused by the loop movement are filtered. As described previously, the solutions in the ensemble are scored and ranked with the FoldX force field. 3.3.2 Reconstruction accuracy The reconstruction accuracy is given by the RMSD between the backbone atoms (N, Cα, C, O) of the predicted loop and the crystallographic loop, excluding the anchor residues. A distinction is made between ‘local’ and ‘global’ RMSD. The local RMSD is calculated by superposing the predicted loop on the native loops regardless of the context; as a result, the local RMSD indicates the quality of the 79
  • 98.
    3. PREDICTING LOOPSTRUCTURE predicted loop outside the context of the protein. The global RMSD is computed between the predicted loop and the native without superposition; the global RMSD is the measure used to evaluate the prediction accuracy. 3.3.3 Benchmark datasets Three datasets from Mandell et al. (2009) Dataset 1 was originally compiled by Fiser et al. (2000) and contains 40 12-residue loops. Mandell et al. (2009) analysed this set in order to exclude structures where the loop was too close to ligands and ions, reducing the set with 15 12-residue loops. We eliminated three structures that were incorrectly annotated as loops (PDB 1tca, 3cla and 1541) and for which LoopX could consequently not produce any prediction. The final set contains 23 loop structures (Figure 3.3). Dataset 2 was originally compiled by Zhu et al. (2006) and contains 20 12- residue loops, selected from high-quality structures (<2Å), diverse sequences (<40% sequence identity) and lack of secondary structure in the loops, amongst other criteria (Figure 3.4). Dataset 3 was designed by Mandell et al. (2009) to evaluate loop reconstruction of four different proteins from the PDB crystallized with 18 different partner pro- teins (Figure 3.5). The nature of the binding partner influences the conformation of the loop and as such one can study if the loop reconstruction algorithm can predict these functional changes in loop conformation. WHAT IF dataset Dataset 4 is a set of 223 crystallographic PDB chains with an R-factor below 0.19 and a resolution below 1.1 Å, harvested from the WHAT IF website (http:// swift.cmbi.ru.nl/gv/select/index.html) and the atomic coordinates were downloaded from the PDB (http://www.pdb.org). An additional step was applied to filter this set to a maximum of 35 percent sequence identity over all structures. This resulted in the final set of 210 structures. Loop segments with minimum length 4 were then extracted by running the DSSP12 secondary structure assignment program and keeping sequence stretches other than helix (H), beta-sheet (E), 3/10 80
  • 99.
    3.4 Discussion helix (G)and beta bridge (B), but with helix or beta-sheet on both flanks. This resulted in 527 loops. FREAD dataset Dataset 5 is a set of 510 loops taken from Choi & Deane (2010) to evaluate LoopX results with results from FREAD (Choi & Deane, 2010), RAPPER (de Bakker et al., 2003), MODELLER (Fiser et al., 2000) and PLOP (Jacobson et al., 2004). The compilation protocol of this set is explained in detail in Choi & Deane (2010). PDZ dataset Dataset 6 is a set of carboxylate-binding loops from 10 high-resolution PDZ struc- tures (Figure 3.7). We selected these PDZ domains based on (1) structural variety adopted by the loop that is involved in binding peptide ligands and (2) sequence diversity. 3.3.4 LoopX Webserver LoopX is freely available for academic users as a webserver at http://loopx.crg. es. To illustrate the usage of the website, a video tutorial is viewable online at http: //loopx.crg.es/content/help. In the tutorial, we show the reconstruction of the PDZ carboxylate-binding loop for PDB 2I1N. 3.4 Discussion We demonstrate that it is possible to have both fast and accurate modeling of loops in proteins using a fragment database and an all-atom empirical force field, showing that for loops up to length 12 (95% of all loops in dataset 4, Supplementary Table 9) computationally expensive ab initio methods are not needed. Moreover, we demonstrate that this approach can be used to describe the entire conformational ensemble adopted by protein loops and induced by ligand binding. This new method that combines speed and accuracy could be used not only for homology modeling, but also to perform protein design and study conformational ensembles adopted by protein loops. 81
  • 100.
    3. PREDICTING LOOPSTRUCTURE As the number of deposited structures in the Protein Data Bank increases, database search methods for loop structure prediction gain in power, and we are in the process of using the entire PDB (approximately 68.000 structures as of September 2010) for fast and accurate loop modeling. Author Contributions F.R., J.S., and L.S. conceptualized the study. L.B. developed the first version of the LoopX algorithm and performed preliminary analysis published in her thesis work (Baeten, 2010). P.V. developed the current version of the LoopX algorithm. P.V. developed the web server. J.V.D. and P.V. performed the experiments. J.V.D., P.V., F.S., F.R., J.S. and L.S. performed the analysis. J.V.D., P.V. and E.V. wrote the manuscript. 82
  • 101.
    REFERENCES References Baeten, L. (2010).Reconstruction of protein structures from polypeptide fragment li- braries. PhD Thesis (Free University of Brus- sels), 1–149. 82 Benner, S.A., Cohen, M.A. & Gonnet, G.H. (1993). Empirical and structural models for insertions and deletions in the divergent evo- lution of proteins. Journal of Molecular Biol- ogy, 229, 1065–82. 64 Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. & Bourne, P.E. (2000). The protein data bank. Nucleic Acids Research, 28, 235–42. 66 Burke, D.F., Deane, C.M. & Blundell, T.L. (2000). Browsing the sloop database of structurally classified loops connecting elements of pro- tein secondary structure. Bioinformatics, 16, 513–9. 64, 65 Choi, Y. & Deane, C.M. (2010). Fread revis- ited: Accurate loop structure prediction us- ing a database search algorithm. Proteins, 78, 1431–40. 72, 74, 75, 79, 81 de Bakker, P.I.W., DePristo, M.A., Burke, D.F. & Blundell, T.L. (2003). Ab initio construction of polypeptide fragments: Accuracy of loop decoy discrimination by an all-atom statisti- cal potential and the amber force field with the generalized born solvation model. Pro- teins, 51, 21–40. 72, 81 Donate, L.E., Rufino, S.D., Canard, L.H. & Blun- dell, T.L. (1996). Conformational analysis and clustering of short and medium size loops connecting regular secondary struc- tures: a database for modeling and predic- tion. Protein Sci, 5, 2600–16. 65 Du, P., Andrec, M. & Levy, R.M. (2003). Have we seen all structures corresponding to short protein fragments in the protein data bank? an update. Protein Eng, 16, 407–14. 66 Espadaler, J., Fernandez-Fuentes, N., Hermoso, A., Querol, E., Aviles, F.X., Sternberg, M.J.E. & Oliva, B. (2004). Archdb: automated pro- tein loop classification as a tool for struc- tural genomics. Nucleic Acids Research, 32, D185–8. 65 Felts, A.K., Gallicchio, E., Chekmarev, D., Paris, K.A., Friesner, R.A. & Levy, R.M. (2008). Pre- diction of protein loop conformations using the agbnp implicit solvent model and torsion angle sampling. Journal of chemical theory and computation, 4, 855–868. 64 Fernandez-Fuentes, N. & Fiser, A. (2006). Sat- urating representation of loop conforma- tional fragments in structure databanks. BMC Struct Biol, 6, 15. 66 Fernandez-Fuentes, N., Oliva, B. & Fiser, A. (2006). A supersecondary structure library and search algorithm for modeling loops in protein structures. Nucleic Acids Research, 34, 2085–97. 65 Fetrow, J.S. (1995). Omega loops: nonregular secondary structures significant in protein function and stability. FASEB J, 9, 708–17. 64 Fiser, A., Do, R.K. & Sali, A. (2000). Modeling of loops in protein structures. Protein Sci, 9, 1753–73. 64, 72, 80, 81 Heuser, P., Wohlfahrt, G. & Schomburg, D. (2004). Efficient methods for filtering and ranking fragments for the prediction of struc- turally variable regions in proteins. Proteins, 54, 583–95. 64, 65 Jacobson, M.P., Pincus, D.L., Rapp, C.S., Day, T.J.F., Honig, B., Shaw, D.E. & Friesner, R.A. (2004). A hierarchical approach to all-atom protein loop prediction. Proteins, 55, 351– 67. 64, 72, 81 Jiang, L., Althoff, E.A., Clemente, F.R., Doyle, L., Rothlisberger, D., Zanghellini, A., Gallaher, J.L., Betker, J.L., Tanaka, F., Barbas, C.F., Hil- vert, D., Houk, K.N., Stoddard, B.L. & Baker, D. (2008). De novo computational design 83
  • 102.
    REFERENCES of retro-aldol enzymes.Science, 319, 1387– 1391. 64 Jones, E.Y., Tormo, J., Reid, S.W. & Stuart, D.I. (1998). Recognition surfaces of mhc class i. Immunol Rev, 163, 121–8. 64 Kaneko, T., Huang, H., Zhao, B., Li, L., Liu, H., Voss, C.K., Wu, C., Schiller, M.R. & Li, S.S.C. (2010). Loops govern sh2 domain specificity by controlling access to binding pockets. Sci- ence Signaling, 3, ra34. 64 Lawson, Z. & Wheatley, M. (2004). The third ex- tracellular loop of g-protein-coupled recep- tors: more than just a linker between two important transmembrane helices. Biochem Soc Trans, 32, 1048–50. 64 Levitt, M. (2007). Growth of novel protein structural data. Proceedings of the National Academy of Sciences of the United States of America, 104, 3183–8. 66 Lu, Y. & Valentine, J.S. (1997). Engineering metal-binding sites in proteins. Curr Opin Struct Biol, 7, 495–500. 64 Mandell, D.J., Coutsias, E.A. & Kortemme, T. (2009). Sub-angstrom accuracy in pro- tein loop reconstruction by robotics-inspired conformational sampling. Nat Methods, 6, 551–2. 65, 66, 67, 68, 69, 70, 74, 80 Nourry, C., Grant, S.G.N. & Borg, J.P. (2003). Pdz domain proteins: plug and play! Sci STKE, 2003, RE7. 70 Oliva, B., Bates, P.A., Querol, E., Avil´es, F.X. & Sternberg, M.J. (1997). An automated clas- sification of the structure of protein loops. Journal of Molecular Biology, 266, 814–30. 65 Penkert, R.R., DiVittorio, H.M. & Prehoda, K.E. (2004). Internal recognition through pdz do- main plasticity in the par-6-pals1 complex. Nature structural & molecular biology, 11, 1122–7. 70 Prodromou, C., Roe, S.M., O’Brien, R., Ladbury, J.E., Piper, P.W. & Pearl, L.H. (1997). Identifi- cation and structural characterization of the atp/adp-binding site in the hsp90 molecular chaperone. Cell, 90, 65–75. 64 Rapp, C.S. & Friesner, R.A. (1999). Prediction of loop geometries using a generalized born model of solvation effects. Proteins, 35, 173– 83. 64 Redondo, P., Prieto, J., Mu˜noz, I.G., Alib´es, A., Stricher, F., Serrano, L., Cabaniols, J.P., Daboussi, F., Arnould, S., Perez, C., Duchateau, P., Paques, F., Blanco, F.J. & Mon- toya, G. (2008). Molecular basis of xero- derma pigmentosum group c dna recogni- tion by engineered meganucleases. Nature, 456, 107–11. 64 Rini, J.M., Schulze-Gahmen, U. & Wilson, I.A. (1992). Structural evidence for induced fit as a mechanism for antibody-antigen recog- nition. Science, 255, 959–65. 64 Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F. & Serrano, L. (2005). The foldx web server: an online force field. Nucleic Acids Research, 33, W382–8. 66, 78 Siegel, J.B., Zanghellini, A., Lovick, H.M., Kiss, G., Lambert, A.R., Clair, J.L.S., Gallaher, J.L., Hil- vert, D., Gelb, M.H., Stoddard, B.L., Houk, K.N., Michael, F.E. & Baker, D. (2010). Com- putational design of an enzyme catalyst for a stereoselective bimolecular diels-alder re- action. Science, 329, 309–13. 64 Spassov, V.Z., Flook, P.K. & Yan, L. (2008). Looper: a molecular mechanics-based algo- rithm for protein loop prediction. Protein Eng Des Sel, 21, 91–100. 64 Strynadka, N.C. & James, M.N. (1989). Crystal structures of the helix-loop-helix calcium- binding proteins. Annu Rev Biochem, 58, 951–98. 64 Todd, A.E., Orengo, C.A. & Thornton, J.M. (2001). Evolution of function in protein su- perfamilies, from a structural perspective. J Mol Biol, 307, 1113–43. 64 84
  • 103.
    REFERENCES van der Sloot,A.M., Kiel, C., Serrano, L. & Stricher, F. (2009). Protein design in biolog- ical networks: from manipulating the input to modifying the output. Protein Eng Des Sel, 22, 537–42. 64 Vanhee, P., Verschueren, E., Baeten, L., Stricher, F., Serrano, L., Rousseau, F. & Schymkowitz, J. (2011). Brix: a database of protein building blocks for structural analysis, modeling and design. Nucleic Acids Research, 39, D435– 42. 65, 66 Wang, C., Bradley, P. & Baker, D. (2007). Protein-protein docking with backbone flex- ibility. J Mol Biol, 373, 503–19. 68, 69 Wojcik, J., Mornon, J.P. & Chomilier, J. (1999). New efficient statistical sequence- dependent structure prediction of short to medium-sized protein loops based on an exhaustive loop classification. Journal of Molecular Biology, 289, 1469–90. 65 Xiang, Z., Soto, C.S. & Honig, B. (2002). Eval- uating conformational free energies: the colony energy and its application to the problem of loop prediction. Proceedings of the National Academy of Sciences of the United States of America, 99, 7432–7. 64 Zhu, K., Pincus, D.L., Zhao, S. & Friesner, R.A. (2006). Long loop prediction using the pro- tein local optimization program. Proteins, 65, 438–52. 80 85
  • 105.
    4The structural landscapeof protein-peptide interactions This chapter is based on PepX: a structural database of non-redundant protein-peptide complexes. Peter Vanhee, Joke Reumers, Francois Stricher, Lies Baeten, Luis Serrano, Joost Schymkowitz, Frederic Rousseau. Nucleic Acids Research, January 2010. A lthough protein-peptide interactions are estimated to constitute up to 40% of all protein interactions, relatively little information is available for the structural details of these interactions. A reliable data set of non-redundant protein-peptide complexes is indispensable as a basis for modeling and design, but current data sets for protein-peptide interactions are often biased towards specific types of interactions or are limited to interactions with small ligands. We have designed PepX (http://pepx.switchlab.org), an unbiased and exhaustive data set of all protein-peptide complexes available in the Protein Data Bank with peptide lengths up to 35 residues. In addition, these complexes have been clustered based on their binding interfaces rather than sequence homology, providing a set of structurally diverse protein-peptide interactions. The final data set contains 505 unique protein- peptide interface clusters from 1431 complexes. Thorough annotation of each complex with both biological and structural information facilitates searching for and browsing through individual complexes and clusters. Moreover, we provide 87
  • 106.
    4. THE STRUCTURALLANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS an additional source of data for peptide design by annotating peptides with naturally occurring backbone variations using fragment clusters from the BriX database. 4.1 Introduction A growing number of interactions are known to be mediated by short linear pep- tides (Neduva & Russell, 2006). It is estimated that 15 to 40 % of all interactions in the cell are protein-peptide interactions (Neduva et al., 2005; Petsalaki & Russell, 2008), which indicates that a large portion of the proteome is either directly or indirectly involved in peptide-binding events. Peptide-mediated interactions are normally short-lived and therefore found most in signaling and regulatory networks where fast response to stimuli is required (Pawson & Nash, 2003). Many databases have been implemented that assemble the sequence patterns involved in such in- teractions, such as the Eukaryotic Linear Motif (ELM) database (Puntervoll, 2003), PROSITE (Hulo et al., 2006) and SCANSITE (Obenauer et al., 2003). Unfortunately, the estimated abundance of protein-peptide interactions from the genome is not reflected in the number of available three dimensional protein- peptide complexes. While many protein-protein and protein-domain interaction databases with structural annotations exist (Gong et al., 2005; Ogmen et al., 2005; Chen et al., 2007; Raghavachari et al., 2008; Jensen et al., 2009), only few of them explicitly consider protein-peptide interactions (Stein et al., 2009). Moreover, focus on specific types of peptide interactions (PDZ domains, SH3 domains) has biased the content of structural databases. Grouping of 3D structures of protein-peptide complexes into functional modules has been established by several methods, such as using ELM patterns (e.g. 3did (Stein et al., 2009)) and multiple sequence alignment of the ligands (e.g. FireDB (Lopez et al., 2007)). Additionally, specialized databases focusing on a specific functional group have been published, such as PROCOGNATE for enzyme complexes (Bashton et al., 2008), MPID-T for T-cell receptors (Tong et al., 2006) and the HMRBase for hormone-receptor data (Rashid et al., 2009). For a detailed list with related databases see Table 4.1. In contrast, our objective was to build an unbiased collection of non-redundant peptide binding sites, where grouping is based solely on three-dimensional similarity and no bias for functional relations or sequence similarity is introduced. 88
  • 107.
    4.1 Introduction MHC 14% thrombin 12% α-ligand binding domain 8% protein kinaseA 5% chymotrypsin 2% streptavidin 2% trypsin 1% SH3 1% HIV-1 protease 1% HIV-1 antibody 1% mdm2 1% remainder 52% Figure 4.1: Contents of PepX. Distribution of 1431 protein-peptide interactions after clustering on the architecture of the binding site. 48% of the protein-peptide com- plexes are classified in 10 classes with more than 5 members, while the remaining 52% contain less frequent structural binding modes. To this end we have mined the Brookhaven Protein Data Bank (PDB) for protein- peptide complexes using rigid quality parameters, and thus obtained 1431 high- resolution 3D structures (see Section 4.2.1 for details on the selection procedure). These complexes were clustered based on three-dimensional similarity into 505 unique protein-peptide interface clusters, representing the full structural diversity of protein-peptide complexes available in the PDB. The aforementioned bias for specific peptide interactions is demonstrated in the further clustering of these com- plexes. 47% of all protein-peptide complexes available from the PDB are clustered 89
  • 108.
    4. THE STRUCTURALLANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS Database Description Ref. 3did 3D interacting domains: domain-domain and domain-peptide (ELM based) (Stein et al., 2009) fireDB PDB structures and associated ligands, anno- tated functionally important residues (Lopez et al., 2007) HMRBase Hormones and receptors (Rashid et al., 2009) MPID-T T-cell receptor/peptide/MHC interactions (Tong et al., 2006) PDB-Ligand 3D structure database of small molecular lig- ands that are bound to larger biomolecules (Shin & Cho, 2005) PLD Biomolecular data, including binding energies, Tanimoto ligand similarity scores and protein sequence similarities of protein-ligand com- plexes (Puvanendrampillai & Mitchell, 2003) PRO-COGNATE Protein cognate ligands for the domains in en- zyme structures (from CATH, SCOP and Pfam (Bashton et al., 2008) SuperLigands Descriptions of PDB ligand structures (Michalsky et al., 2005) Voronoia Calculating packing densities of proteins and lig- ands (Rother et al., 2009) MOAD 6000+ protein-ligand complexes, annotated with binding data (Smith et al., 2006) SCOWLP Structural Protein-peptide complexes clustered on binding interface (Teyra et al., 2008) peptiDB 103 non-redundant protein-peptide complexes (London et al., 2010) PepX 1431 protein-peptide complexes clustered in 505 non-redundant peptide binding modes (Vanhee et al., 2010) Table 4.1: Public databases of protein-ligand complexes. within only 10 classes, containing complexes with peptides bound to Major Histo- compatibility Complex (MHC) (14%), thrombins (12%), α-ligand binding domains (8%), protein kinase A, chymotrypsin, streptavidin, trypsin, SH3 domains, HIV-1 protease, HIV-1 antibody and mdm2 (Figure 4.1 and Figure 4.2). 4.2 Contents of the PepX database 4.2.1 Construction of a non-redundant data set of protein-peptide complexes We have filtered the Brookhaven Protein Data Bank (PDB) (Kouranov et al., 2006) for protein-peptide complexes requiring (1) X-Ray structures with a resolution lower than 2.5 Å, (2) peptides with a size from 5 to 35 amino acids, (3) peptides 90
  • 109.
    4.2 Contents ofthe PepX database A B C D 0 1 2 3 4 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 MHC-0 (168 peptides) 0 1 2 3 4 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 LBD-0 (111 peptides) 0 1 2 3 4 1 2 3 4 5 6 7 8 PTS-0 (10 peptides) 0 1 2 3 4 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 SH3-0 (17 peptides) Figure 4.2: Examples of protein-peptide clusters from PepX. The peptides are shown in red and the peptide cluster centroid in green. Weblogo’s show the peptide’s sequence diversity as observed in the crystals. For (C) and (D) the surface of the protein receptor is given with hydrophobic residues colored red and hydrophilic residues blue. (A) class I MHC bound to peptide (169 structures), without a clear sequence motif. (B) estrogen receptor α-ligand binding domain bound to peptide (111 structures), with the LxxLL sequence motif (x for any amino acid). (C) Peroxisomal Targeting Signal 1 (PTS1) binding domain with peptide (10 structures) and (D) SH3 domain-peptide interaction (17 structures). 91
  • 110.
    4. THE STRUCTURALLANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS containing natural amino acids only, (4) receptors with a minimum size of 35 amino acids, and (5) the first unit in the PDB in case of crystallographic symmetry. 1431 complexes were retained and clustered on their binding architecture using an adaptation of the Hierarchical Agglomeration algorithm used for constructing BriX (Baeten et al., 2008), a database of protein fragments (see Chapter 2). RMSD be- tween any two complexes superposed on backbone Cα atoms has been computed using MUSTANG to allow for structural alignment of unrelated protein structures (Konagurthu et al., 2006). Any two structures are grouped together if they su- perpose below 2 Å RMSD for at least 75% of their interfaces. In this way, we retained 505 unique protein-peptide interface clusters. Furthermore, we clustered the protein-peptide complexes using RMSD values of 1 Å, 2 Å and 3 Å combined with structural alignment of 50%, 75% and 95% of the interfaces. The clusters vary slightly depending on those parameters. The distribution of the number of elements in the PepX clusters for various thresholds of structural similarity and structural alignment of the binding site is shown in Figure 4.3. For all settings most clusters contain only one complex: 64% of all clusters are singletons for thresholds of 3 Å and 50% alignment (Figure 4.3 A), whereas 87% of all clusters resulting from 1 Å RMSD and 95% alignment (Figure 4.3 C) contain only one element. cluster size numberofclusters 50% structural alignment 75% structural alignment 95% structural alignment 1 Å 2 Å 3 Å cluster threshold Figure 4.3: Distribution of number of elements in the PepX clusters. Distribution is shown for various thresholds of structural similarity (1-2-3 Å) and binding site alignment (50% (A), 75% (B) and 95% (C)). For all settings the largest number of clusters contains only one complex, going from 63% of all clusters (50% and 3Å, A) to 87% of all clusters (95% and 1Å, C). 92
  • 111.
    4.2 Contents ofthe PepX database 4.2.2 Statistics on structural protein-peptide complexes The upper threshold for the peptide length was set to 35 amino acids, but the majority of the peptides are between 5 and 15 residues long, with a peak at 9 residues (Figure 4.4A). The size of receptors varies between 67 and 2552 residues, and the largest fraction lies in the [200-400] range (Figure 4.4B). 6 5 5 7 15 10 8 4 4 2 3 3 1 1 1 6 1 1 1 1 1 2 3 3 2 1 1 1 1 1 1 0 2 4 6 8 10 12 14 16 5 10 15 20 25 30 35 %ofligands Ligand length 9 19 47 13 7 1 1 2 0 0 0 0 10 20 30 40 50 100 200 300 400 500 600 700 800 900 1000 more %ofreceptors Receptor length A B Figure 4.4: Distribution of peptide and receptor size. (A) The smallest peptide considered is 5 amino acids long, the longest consists of 35 residues. Circa 70% of all peptides lies within the [5-15] residue range. (B) The largest protein in the complexes contains 2552 amino acid residues; the shortest considered is 35 residues long. Most proteins are smaller than 600 residues, with a peak in the [200-400] range. The receptor sequences in the PepX database were clustered with the cd-hit algorithm (Li & Godzik, 2006) for various thresholds, resulting in datasets where sequences with 40-100% sequence identity are removed (Figure 4.5). Although there is large sequence redundancy within the database (removing sequences with more than 40% sequence identity results in removing more than 70% of all complexes in the database), this does not always reflect a redundancy in binding modes. For instance, Major Histocompatibility Complexes have high sequence identity but bind a wide range of peptides in different modes (Collins et al., 1994; Elliott & Neefjes, 2006). Preliminary analysis of the sequence redundancy in the full complex dataset versus the dataset with cluster centroids revealed that using geometric properties for clustering removes most sequence identity without discarding relevant structural binding motifs. All receptors in protein-peptide complexes have been annotated with the struc- tural classifications SCOP (Andreeva et al., 2008) and CATH (Cuff et al., 2009) 93
  • 112.
    4. THE STRUCTURALLANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS 74 71 70 70 69 67 66 35 32 30 29 29 28 26 0 20 40 60 80 100 100 90 80 70 60 50 40 %ofstructures cd-hit identity threshold (%) centroids all complexes Figure 4.5: Receptor sequence redundancy within the PepX database. The re- ceptor sequences in the PepX database were clustered with the cd-hit algorithm for various thresholds of sequence identity, from removing identical sequences up to 40% sequence identity. Although there is large sequence redundancy within the database, this does not always reflect a redundancy in binding modes. For instance, removing only identical sequences (100%) results in a loss of more than 60% of all complexes and more than 20% of the centroids, showing that some receptors bind in different structural modes. based on the PDB identifier and chain of the receptor (Kouranov et al., 2006) and with PFAM (Finn et al., 2008) based on the UniProt identifier (Consortium, 2009). The coverage of PepX is highest for UniProt (82%), followed by structural classifi- cations by CATH (71%) and SCOP (56%), and finally protein family annotation by Pfam (50%) (Figure 4.6). Within these annotations, we have analysed in detail the occurrence of PepX complexes in the various levels of the structural hierarchies represented in SCOP and CATH. Although most SCOP classes are represented by receptors in the database, protein-peptide complexes do not represent the full range of SCOP folds (8%), superfamilies (6%) and families (4%) (Figure 4.7A). When we look at the distribution of receptors in the different SCOP classes with respect to the distribution of PDB structures in the full SCOP database (Figure 94
  • 113.
    4.2 Contents ofthe PepX database 4.8), we see that in PepX the all-β and α + β classes are clearly overrepresented (30 versus 24% for the all-β class, 38 versus 25% for the α + β class, respectively). Similar results are obtained for the CATH classifications: the complexes represent every CATH class, and architectures are highly represented as well (Figure 4.7B). In contrast, at lower CATH levels, less than 10% of both topologies and superfamilies hold at least one protein-peptide complex. In accordance with the SCOP analysis, classes with mainly β-structures are largely overrepresented in PepX. Alpha and beta structures are underrepresented (35% in PepX versus 52% full CATH). This is also seen in SCOP when we merge the classes together (α/β and α+β), although the difference is smaller (43% PepX versus 49% full SCOP). 0   20   40   60   80   100   SCOP   CATH   PFAM   UniProt   %  of  structures   795   1016   715   1168   Figure 4.6: PepX Annotations. Percentage of receptors in the PepX database repre- sented by different annotations: SCOP, CATH, Pfam and UniProt. Coverage is highest for UniProt, followed by structural classifications by CATH and SCOP, and finally protein family annotation by Pfam. 95
  • 114.
    4. THE STRUCTURALLANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS A B Figure 4.7: Representation of the SCOP and CATH hierarchies in PepX. (A) Protein- peptide complexes do not represent the full range of SCOP folds, superfamilies and families. (B) Similarly, lower CATH levels are not very well represented in PepX. 0 5 10 15 20 25 30 35 40 45 AllalphaproteinsAllbetaproteins Alphaand betaproteins(a/b) Alphaand betaproteins(a+b) M ulti-dom ain proteins(alphaand beta) M em brane and cellsurface proteinsand peptides Sm allproteins Coiled coilproteins Low resolution protein structures PeptidesDesigned proteins %ofstructures SCOP PepX Figure 4.8: Distribution of PepX structures in the different SCOP classes. Whereas the β, α/β and α+β classes are of similar size in the full SCOP database, in PepX the all-β and α+β classes are clearly overrepresented. 96
  • 115.
    4.3 Database Access 4.2.3Ligand annotation with structural variants for peptide de- sign Given the scarcity of protein-peptide structures and their obvious relevance in drug design (Ballinger et al., 1999; Reina et al., 2002; Yin et al., 2007; Parthasarathi et al., 2008; van der Sloot et al., 2009) (see also Section 1.1.5), we provide an additional service for peptide design. Since it was recently shown that protein- peptide interactions can be reliably mimicked using interacting fragments from monomeric proteins (Vanhee et al., 2009) (Chapter 5), it is possible to provide structural variations of peptide ligands using protein fragments. Each ligand peptide in the PepX data set is associated with its corresponding structural class from the database of protein fragment classes, BriX (Baeten et al., 2008; Vanhee et al., 2011) (Chapter 2). Sets of protein fragments with highly similar backbone structure are grouped in these fragment classes. Each protein fragment class represents a natural variation on a typical backbone conformation. Mapped on protein-peptide pairs, these structural classes can be used to model and design alternative peptides with slightly adapted backbone conformation that better fit given amino acid sequences. In Chapter 6 we show the application of BriX and InteraX towards modeling peptide complexes from the PepX database. 4.3 Database Access 4.3.1 Database Availability PepX is accessible through a web portal at http://pepx.switchlab.org. We recorded usage statistics since the database inauguration in July 2009 and statistics until October 2010 are shown in Figure 4.9. The full database with annotations is available for download both in SQL format and as flat files. The entire dataset of 1431 PDBs with binding site residues and the equivalent centroid dataset of 505 binding sites can be downloaded. The PepX web server is implemented using the Drupal Content Management system (http://drupal.org). All information contained within the PepX database is exposed as XML (Exten- sible Markup Language). When certain URLs are visited, an XML file with the re- quested data is returned, following the REST interface for data exchange. For exam- 97
  • 116.
    4. THE STRUCTURALLANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS Visits 1 640 2,789 visits came from 51 countries 0 150 300 0 150 300 1 Jul - 31 Jul 1 Oct - 31 Oct 1 Jan - 31 Jan 1 Apr - 30 Apr 1 Jul - 31 Jul 1 Oct - 31 Oct Visits Referring Sites 1,001.00 (35.89%) Search Engines 941.00 (33.74%) Direct Traffic 847.00 (30.37%) 62.14% Bounce Rate 00:04:13 Avg. Time on Site 32.56% % New Visits 2,789 Visits 9,929 Pageviews 3.56 Pages/Visit Figure 4.9: PepX usage statistics. Statistics were recorded since July 1st, 2009 and run until October 31st, 2010. Provided by Google Analytics. ple, calling the URL http://pepx.switchlab.org/clusters.xml?threshold= 2&alignment=75 serves an XML file with a description of the clusters for threshold 2 Å and an alignment of 75%. The XML interface is implemented for clusters, PDBs and BriX classes providing backbone variations on the peptides. 98
  • 117.
    4.3 Database Access 4.3.2User interface Figure 4.10: PepX user flow. Searching for the keywords ‘thrombin’ and ‘inhibitor’ provides a list of hits. For the entry 1BTH the PepX cluster is shown, together with 3D views of the complex and the binding site. Every complex is annotated with binding energy calculated with FoldX, hydrogen bond interactions, secondary structure content, direct links to relevant databases and peptide backbone variations from BriX. 99
  • 118.
    4. THE STRUCTURALLANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS Extensive search and browse facilities are implemented for the PepX website. Browsing the database can be performed at two levels: individual complex struc- tures and clusters of complexes. In the latter case the user can choose the level of similarity within one cluster by adjusting the root mean square distance be- tween structures within one cluster and the percentage of structural alignment between binding sites. The full PepX database can be searched through a sim- ple Google-like search box, which uses a full index of all information contained in the database (Figure 4.11A). The guided search allows searching the database in specific subgroups, generated from the structural classifications and keywords (Figure 4.11B). In addition, tag clouds of the structural annotations can be used to generate specialized listings of protein-peptide complexes (Figure 4.11 C). A B C Figure 4.11: Search options in the PepX database. (A) A simple, Google-like search on the contents of the database. The search is non-restrictive and accepts everything from keywords to PDB identifiers. (B) Guided search through PepX using SCOP. (C) Tag clouds of the PDB keywords associated with PepX structures. 100
  • 119.
    4.4 Discussion For eachindividual complex several types of information are shown (Figure 4.10). Besides general information of the complex (PDB id, chains), functional and structural annotation of the protein (UniProt, SCOP, CATH), also detailed structural information about the interaction itself is displayed.The binding affinity for the protein-peptide complex is calculated using the FoldX force field (Schymkowitz et al., 2005) and details of the contribution of backbone and side chain hydrogen bonds as well as the total binding energy is shown. The binding site is structurally characterized using several metrics such as secondary structure content, and 3D images of the binding site and the ligand itself were generated to illustrate the specific parts of the protein contributing to the binding site. Furthermore, all the clusters the complex takes part in are listed. Clicking on a specific cluster reveals a detailed page containing information on the centroid complex of the cluster as well as the list of all complexes belonging to the cluster. 4.4 Discussion PepX to-date is the only dedicated resource for structural data on protein-peptide complexes. Since its original publication in November 2009, many parts have changed. For example, we have added sequence logo’s (Crooks et al., 2004) and Position Weight Matrices (PWM) (Obenauer et al., 2003) for every complex in the database. The PWM measures the relative contribution of each amino acids at each position in the peptide, as measured using FoldX. They intuitively represent the sequence variety that is allowed in the backbone constraints, and therefore could be used for peptide sequence optimization, for example. In Chapter 6, we rely on PWM’s to capture the different sequences that are allowed in the peptide designs, matching peptide specificities obtained with the PWM to experimental specificities as measured using phage display or peptide arrays. Another change in PepX involves the annotation of peptides that are internally stabilized using one or more disulfide bonds. Since they will likely adapt a stable fold in isolation, they are thus better qualified as ‘mini-protein’. In total, 48 out of 1428 peptides contain such a disulfide bridge. In Chapter 5 we describe how we used PepX as the basis to research structural properties of peptide interactions, and in particular, their relation to the architecture 101
  • 120.
    4. THE STRUCTURALLANDSCAPE OF PROTEIN-PEPTIDE INTERACTIONS of monomeric proteins. Finally, in Chapter 6, we describe the use of motifs from PepX to benchmark the peptide design algorithm. Author Contributions P.V., F.R., J.S., and L.S. conceptualized the study. P.V. developed the PepX database. P.V., J.R. and F.S. performed the analysis. P.V. developed the web- site. P.V. and J.R. wrote the paper. 102
  • 121.
    REFERENCES References Andreeva, A., Howorth,D., Chandonia, J.M., Brenner, S.E., Hubbard, T.J.P., Chothia, C. & Murzin, A.G. (2008). Data growth and its impact on the scop database: new develop- ments. Nucleic Acids Research, 36, D419– 25. 93 Baeten, L., Reumers, J., Tur, V., Stricher, F., Lenaerts, T., Serrano, L., Rousseau, F. & Schymkowitz, J. (2008). Reconstruction of protein backbones from the brix collection of canonical protein fragments. PLoS Com- put Biol, 4, e1000083. 92, 97 Ballinger, M.D., Shyamala, V., Forrest, L.D., Deuter-Reinhard, M., Doyle, L.V., Wang, J.X., Panganiban-Lustan, L., Stratton, J.R., Apell, G., Winter, J.A., Doyle, M.V., Rosenberg, S. & Kavanaugh, W.M. (1999). Semirational de- sign of a potent, artificial agonist of fibroblast growth factor receptors. Nature Biotechnol- ogy, 17, 1199–204. 97 Bashton, M., Nobeli, I. & Thornton, J.M. (2008). Procognate: a cognate ligand domain map- ping for enzymes. Nucleic Acids Research, 36, D618–22. 88, 90 Chen, Y.C., Lo, Y.S., Hsu, W.C. & Yang, J.M. (2007). 3d-partner: a web server to infer in- teracting partners and binding models. Nu- cleic Acids Research, 35, W561–7. 88 Collins, E.J., Garboczi, D.N. & Wiley, D.C. (1994). Three-dimensional structure of a peptide extending from one end of a class i mhc binding site. Nature, 371, 626–9. 93 Consortium, U. (2009). The universal protein resource (uniprot) 2009. Nucleic Acids Re- search, 37, D169–74. 94 Crooks, G.E., Hon, G., Chandonia, J.M. & Bren- ner, S.E. (2004). Weblogo: a sequence logo generator. Genome Res, 14, 1188–90. 101 Cuff, A.L., Sillitoe, I., Lewis, T., Redfern, O.C., Garratt, R., Thornton, J. & Orengo, C.A. (2009). The cath classification revisited– architectures reviewed and new ways to characterize structural divergence in super- families. Nucleic Acids Research, 37, D310– 4. 93 Elliott, T. & Neefjes, J. (2006). The complex route to mhc class i-peptide complexes. Cell, 127, 249–51. 93 Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sam- mut, S.J., Hotz, H.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L.L. & Bateman, A. (2008). The pfam protein families database. Nucleic Acids Research, 36, D281–8. 94 Gong, S., Yoon, G., Jang, I., Bolser, D., Dafas, P., Schroeder, M., Choi, H., Cho, Y., Han, K., Lee, S., Choi, H., Lappe, M., Holm, L., Kim, S., Oh, D. & Bhak, J. (2005). Psibase: a database of protein structural interactome map (psimap). Bioinformatics, 21, 2541–3. 88 Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., Castro, E.D., Langendijk-Genevaux, P.S., Pagni, M. & Sigrist, C.J.A. (2006). The prosite database. Nucleic Acids Res, 34, D227–30. 88 Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks, T., Julien, P., Roth, A., Simonovic, M., Bork, P. & von Mer- ing, C. (2009). String 8–a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Research, 37, D412–6. 88 Konagurthu, A.S., Whisstock, J.C., Stuckey, P.J. & Lesk, A.M. (2006). Mustang: a multiple structural alignment algorithm. Proteins, 64, 559–74. 92 Kouranov, A., Xie, L., de la Cruz, J., Chen, L., Westbrook, J., Bourne, P.E. & Berman, H.M. (2006). The rcsb pdb information portal for structural genomics. Nucleic Acids Research, 34, D302–5. 90, 94 Li, W. & Godzik, A. (2006). Cd-hit: a fast pro- gram for clustering and comparing large sets 103
  • 122.
    REFERENCES of protein ornucleotide sequences. Bioinfor- matics, 22, 1658–9. 93 London, N., Movshovitz-Attias, D. & Schueler- Furman, O. (2010). The structural basis of peptide-protein binding strategies. Struc- ture, 18, 188–199. 90 Lopez, G., Valencia, A. & Tress, M. (2007). Firedb–a database of functionally important residues from proteins of known structure. Nucleic Acids Research, 35, D219–23. 88, 90 Michalsky, E., Dunkel, M., Goede, A. & Preissner, R. (2005). Superligands - a database of lig- and structures derived from the protein data bank. BMC Bioinformatics, 6, 122. 90 Neduva, V. & Russell, R.B. (2006). Peptides me- diating interaction networks: new leads at last. Curr Opin Biotechnol, 17, 465–71. 88 Neduva, V., Linding, R., Su-Angrand, I., Stark, A., de Masi, F., Gibson, T.J., Lewis, J., Serrano, L. & Russell, R.B. (2005). Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biol, 3, e405. 88 Obenauer, J.C., Cantley, L.C. & Yaffe, M.B. (2003). Scansite 2.0: Proteome-wide predic- tion of cell signaling interactions using short sequence motifs. Nucleic Acids Research, 31, 3635–41. 88, 101 Ogmen, U., Keskin, O., Aytuna, A.S., Nussinov, R. & Gursoy, A. (2005). Prism: protein interac- tions by structural matching. Nucleic Acids Research, 33, W331–6. 88 Parthasarathi, L., Casey, F., Stein, A., Aloy, P. & Shields, D.C. (2008). Approved drug mimics of short peptide ligands from protein interac- tion motifs. Journal of chemical information and modeling, 48, 1943–8. 97 Pawson, T. & Nash, P. (2003). Assembly of cell regulatory systems through protein interac- tion domains. Science, 300, 445–52. 88 Petsalaki, E. & Russell, R.B. (2008). Peptide- mediated interactions in biological systems: new discoveries and applications. Curr Opin Biotechnol, 19, 344–50. 88 Puntervoll, P. (2003). Elm server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Research, 31, 3625–3630. 88 Puvanendrampillai, D. & Mitchell, J.B.O. (2003). L/d protein ligand database (pld): additional understanding of the nature and specificity of protein-ligand complexes. Bioinformatics, 19, 1856–7. 90 Raghavachari, B., Tasneem, A., Przytycka, T.M. & Jothi, R. (2008). Domine: a database of protein domain interactions. Nucleic Acids Research, 36, D656–61. 88 Rashid, M., Singla, D., Sharma, A., Kumar, M. & Raghava, G.P.S. (2009). Hmrbase: a database of hormones and their receptors. BMC Genomics, 10, 307. 88, 90 Reina, J., Lacroix, E., Hobson, S.D., Fernandez- Ballester, G., Rybin, V., Schwab, M.S., Serrano, L. & Gonzalez, C. (2002). Computer-aided design of a pdz domain to recognize new target sequences. Nature Structural Biology, 9, 621–7. 97 Rother, K., Hildebrand, P.W., Goede, A., Gruen- ing, B. & Preissner, R. (2009). Voronoia: ana- lyzing packing in protein structures. Nucleic Acids Research, 37, D393–5. 90 Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F. & Serrano, L. (2005). The foldx web server: an online force field. Nucleic Acids Research, 33, W382–8. 101 Shin, J.M. & Cho, D.H. (2005). Pdb-ligand: a ligand database based on pdb for the au- tomated and customized classification of ligand-binding structures. Nucleic Acids Re- search, 33, D238–41. 90 104
  • 123.
    REFERENCES Smith, R.D., Hu,L., Falkner, J.A., Benson, M.L., Nerothin, J.P. & Carlson, H.A. (2006). Explor- ing protein-ligand recognition with binding moad. J Mol Graph Model, 24, 414–25. 90 Stein, A., Panjkovich, A. & Aloy, P. (2009). 3did update: domain-domain and peptide- mediated interactions of known 3d struc- ture. Nucleic Acids Research, 37, D300–4. 88, 90 Teyra, J., Paszkowski-Rogacz, M., Anders, G. & Pisabarro, M.T. (2008). Scowlp classification: structural comparison and analysis of pro- tein binding regions. BMC Bioinformatics, 9, 9. 90 Tong, J.C., Kong, L., Tan, T.W. & Ranganathan, S. (2006). Mpid-t: database for sequence- structure-function information on t-cell re- ceptor/peptide/mhc interactions. Appl Bioin- formatics, 5, 111–4. 88, 90 van der Sloot, A.M., Kiel, C., Serrano, L. & Stricher, F. (2009). Protein design in biolog- ical networks: from manipulating the input to modifying the output. Protein Eng Des Sel, 22, 537–42. 97 Vanhee, P., Stricher, F., Baeten, L., Verschueren, E., Lenaerts, T., Serrano, L., Rousseau, F. & Schymkowitz, J. (2009). Protein-peptide in- teractions adopt the same structural motifs as monomeric protein folds. Structure, 17, 1128–1136. 97 Vanhee, P., Reumers, J., Stricher, F., Baeten, L., Serrano, L., Schymkowitz, J. & Rousseau, F. (2010). Pepx: a structural database of non- redundant protein-peptide complexes. Nu- cleic Acids Research, 38, D545–51. 90 Vanhee, P., Verschueren, E., Baeten, L., Stricher, F., Serrano, L., Rousseau, F. & Schymkowitz, J. (2011). Brix: a database of protein building blocks for structural analysis, modeling and design. Nucleic Acids Research, 39, D435– 42. 97 Yin, H., Slusky, J.S., Berger, B.W., Walters, R.S., Vilaire, G., Litvinov, R.I., Lear, J.D., Caputo, G.A., Bennett, J.S. & Degrado, W.F. (2007). Computational design of peptides that tar- get transmembrane helices. Science, 315, 1817–1822, when writing the FGF story fol- low this format! 97 105
  • 125.
    5Protein-peptide interactions resemble monomericprotein interactions This chapter is based on Protein-Peptide Interactions Adopt the Same Structural Motifs as Monomeric Protein Folds Peter Vanhee, Francois Stricher, Lies Baeten, Erik Verschueren, Tom Lenaerts, Luis Serrano, Fred- eric Rousseau and Joost Schymkowitz Structure1, August 2009. and Modeling protein-peptide interactions using protein fragments: fitting the pieces? Pe- ter Vanhee, Francois Stricher, Lies Baeten, Erik Verschueren, Luis Serrano, Frederic Rousseau and Joost Schymkowitz BMC Bioinformatics2, December 2010. W e compared the modes of interaction between protein-peptide interfaces and those observed within monomeric proteins and found surprisingly lit- tle differences. Over 65% of 731 protein-peptide interfaces could be reconstructed within 1 Å RMSD using solely fragment interactions occurring in monomeric pro- teins. Interestingly, more than 80% of interacting fragments used in reconstructing 1 This paper was featured on the cover of the Structure issue of August 2009. 2 This short paper is part of a series of Highlights from the Sixth International Society for Computational Biology (ISCB) Student Council Symposium (Klijn et al., 2010). 107
  • 126.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS a protein-peptide binding site were obtained from monomeric proteins of an en- tirely different structural classification, with an average sequence identity below 15%. Nevertheless, geometric properties perfectly match the interaction patterns observed within monomeric proteins. These data suggest that the wealth of struc- tural data on monomeric proteins could be harvested to model protein-peptide interactions and, more importantly, that sequence homology is no prerequisite. 5.1 Introduction Recently, Russell and co-workers estimated that 15-40% of all interactions in the cell are mediated through protein-peptide interactions (Neduva et al., 2005; Pet- salaki & Russell, 2008), meaning that, at the most extreme, nearly every protein is affected either directly or indirectly by peptide-binding events. Such interac- tions are commonly mediated by specialized protein domains (Pawson & Scott, 1997), which are crucially involved in highly diverse biological processes and occur in a myriad of proteins in ever changing combinations with other func- tional units. Protein-peptide interactions are for instance of central importance for motif-dependent interactions in cell signalling, such as the binding of tyrosyl- phosphorylated peptides to proteins containing the Src homology domain 2 (SH2) or the phosphotyrosine-binding domain (PTB) (Bradshaw & Waksman, 2002; Yaffe, 2002). Peptides with certain proline motifs constitutively bind to proteins contain- ing Src homology domain 3 (SH3) at low affinities (Cesareni et al., 2002; Mayer, 2001). Short length peptides are usually devoid of stable secondary structure in iso- lation. Thus one might argue that peptide binding is equivalent to the folding process, in which the peptide is the last element to be added to the growing structure, albeit not on the same polypeptide chain. This argument is supported by folding experiments with Barnase (Kippen et al., 1994), for which cleaving the polypeptide chain in two molecules resulted in an association fold similar to that of the monomeric protein. In peptide complementation experiments with chy- motrypsin inhibitor 2 (CI2) Itzhaki et al. (1995) demonstrated that folding does not require the structural building blocks to be part of the same polypeptide chain. This folding analogy suggests that protein-peptide interactions should follow simi- 108
  • 127.
    5.1 Introduction lar structuralpatterns to those observed in monomeric proteins (Tsai et al., 1998). In particular cases, such as β-strand extension in PDZ domains, the equivalence to monomeric structures is obvious (Remaut & Waksman, 2006), but for other protein-peptide structures, there is no apparent monomeric counterpart that has a similar arrangement of structural elements on a single chain. Similarities between singular folds and protein-protein interfaces have been observed too, and Tuncbag et al. (2008) ventured to suggest that evolution reuses patterns of interaction for both folding and association. In an earlier study, architec- tural motifs from protein monomers have been shown to recur at protein-protein interfaces, although this similarity is less obvious for structures that fold separately and associate afterwards (Tsai et al., 1997). The protein interface between protein and ligand is richer in hydrophobic residues than the surrounding surface (Ma et al., 2003), suggesting similarity to the protein core. Cohen et al. (2008) have shown that the chemistry, geometry and packing density of interactions within protein cores are similar to those at the interface, while backbone interactions are preferred in the core as opposed to side chain interactions in the binding site. Clustering all the protein-protein interfaces available in the PDB, Tuncbag et al. (2008) found that some of the architectures preferred in the interface exist also in single chains. These striking similarities between folding and binding offer op- portunities for protein-protein interface design, recently demonstrated by Potapov et al. (2008) who redesigned and experimentally verified the interface of TEM1 β-lactamase and its inhibitor protein using a combination of naturally occurring interaction templates from the Protein Data Bank (PDB) (Berman et al., 2000). Part of the problem in identifying structural similarities between structural motifs that occur in protein-peptide interaction and in monomeric proteins is the apparent complexity of such interactions when viewed in all its atomic detail. Alter- natively, it is often relatively simple to divide a protein structure in a small number of interacting fragments, roughly determined by the elements of secondary struc- ture. Therefore, instead of considering entire protein-peptide interfaces, we divide the structure into pairs of interacting protein fragments, and as such rely on the modularity of the binding site shown for protein-protein complexes (Reichmann et al., 2005). It has been demonstrated that protein fragments of variable length al- low to efficiently reconstruct the architecture of monomeric proteins (Baeten et al., 109
  • 128.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS 2008; Kolodny & Levitt, 2003). Yet, it remains to be shown whether combinations of fragments of monomeric proteins are able to reflect the complex architectures exhibited by the binding interfaces of protein-peptide complexes. Here we describe an exhaustive study of all natural protein-peptide interfaces available in the PDB (731 cases, see Section 5.3 and the PepX database described in Chapter 4). We relate the architecture of the protein-peptide interface to the arrangement of interacting fragments observed within monomeric proteins. Our set of building blocks are all the recurrent fragments of five amino acids that are found in the WHAT IF dataset of 1259 structurally non-redundant high resolution protein structures (Vriend, 1990). The fragments are clustered into an alphabet of roughly 2000 elements and are publically available in the BriX database (Baeten et al., 2008; Vanhee et al., 2011) (Chapter 2). We show here that more than 65% of protein-peptide interfaces can be reconstructed from pairs of interacting frag- ments of five amino acids taken from monomeric structures within 1 Å root mean square deviation (RMSD). In 25% of the cases the entire arrangement of structural elements as it occurs in the protein-peptide interface can be found back in the monomeric fold of a particular PDB structure. Interestingly, on average less than 15% sequence similarity exists between the structurally equivalent building blocks as they occur in monomeric folds and protein-peptide interfaces. Despite of this, the interaction networks of the original protein-peptide interfaces are preserved in the corresponding building blocks from the monomeric proteins. Although more than 90% of the protein-peptide interfaces can be reconstructed at a lower reso- lution (2 Å RMSD), it is clear that around 35% of protein-peptide interactions is mediated by irregular structure elements that have no equivalent in our database of monomeric structures. Our work demonstrates that the rules that govern protein-peptide interactions are identical to those that steer the architecture of proteins. This similarity can be revealed by casting the proteins as a collection of recurrent polypeptide fragments that interact in an inter- or intramolecular fashion. Our analysis of the known crystal structures of protein-peptide complex shows that the configuration of fragments corresponding to the interactions between a protein domain and a bound peptide can be found back in the structure of a monomeric protein in the vast majority of the cases. These configurations can be used as design templates for de novo 110
  • 129.
    5.2 Results peptide design,as we show in Chapter 6. 5.2 Results 5.2.1 InteraX: a database of interacting protein fragments InteraX is a collection of interaction motifs from single-chain protein structures. We extended our original fragment description to include protein fragments in- teracting with other protein fragments (see Section 5.3.3). Two versions of the InteraX database exist, depending of which version of BriX is used. InteraX derived from the WHAT IF protein set (Vriend, 1990) contains 7.089.578 protein fragment interactions between 567.278 fragments of five residues each, from an original of 2.597 non-redundant protein monomers. InteraX derived from Astral40 (Chando- nia et al., 2004) contains 13.643.407 interactions between 1.157.418 fragments derived from 6.481 proteins with less then 40% sequence homology. In Figure 5.1, a series of interaction patterns are shown, for example a helix- helix interaction with a favorable salt bridge stabilizing the interaction (Figure 5.1A), or a canonical β-β motif, which is the most represented example of an intramolecular interaction motif from InteraX (Figure 5.1B). Other examples include loop-loop motifs stabilized by sidechain hydrogen bonds (Figure 5.1C) or by cation- π interaction (Figure 5.1E). Besides non-covalent interactions we also describe covalent iteractions between non-continuous fragments, like a disulfide bridge between a helix and a strand fragment (Figure 5.1D). Finally, a packing motif between a loop and a helix is shown in Figure 5.1F. We have annotated each interaction with the number of residue interactions, hydrogen bonding and estimated free energy of interaction (Table 5.1). Two crite- ria are used to describe residue interactions: the first uses a full-atom description of the motif and requires atomic contact as measured by the Van der Waals radius of the atom plus a cut-off value of 0.5 Å between two residues to be considered interacting. The second approach is less stringent. We look for potential inter- actions by drawing a hypothetical sphere around all residues, describing their full rotamer-dependent action radius, and measure whether the spheres can overlap. We used the FoldX force field (Schymkowitz et al., 2005a) to make an estimate 111
  • 130.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS A B C D E F Figure 5.1: Different protein fragment interactions from InteraX See Table 5.1 for more details. (A) Helix-helix interaction motif stabilized by a salt bridge between LYS54 and GLU110. (B) Canonical β-sheet motif formed by two β-strands stabilized by a series of backbone hydrogen bonds. (C) Loop-loop interaction motif in which a series of side chain hydrogen bonds contribute to the stability of the loops. (D) α-β interaction with two covalent disulfide bridges. (E) Loop-loop motif with a cation-π interaction between the cationic sidechain of lysine and the aromatic ring structures of tryptophan and tyrosine. The entire cation-π interaction motif contains another tryptophan and tyrosine at the opposite side and is captured by two InteraX patterns. (F) Helix-loop interaction pattern. of the overall interaction energy (∆∆G, kcal/mol) between each two fragments in isolation in order to distinguish between weak and strong interaction motifs. 112
  • 131.
    5.2 Results Fig. 5.1 PDB Frag 1 Frag 2 Residue Con- tacts Atom Residue Con- tacts Half- sphere H- Bonds MC- MC H- Bonds SC-SC H- Bonds MC- SC FoldX ∆∆G FoldX ∆∆G Back- bone A 153L A:51- 55 A:110- 114 420 0 1 0 -0.03 0.21 B 1A44 A:64- 68 A:84- 88 7 22 5 0 0 -1.61 -1.14 C 1DRY A:152- 156 A:302- 306 4 18 0 3 0 -0.19 -0.22 D 1AHO A:22- 26 A:46- 50 21 2 0 0 0 -5.74 0.7 E 1GAI A:49- 53 A:106- 110 10 22 3 0 0 -0.43 -0.75 F 153L A:11- 15 A:174- 178 2 17 0 0 1 -0.6 -0.3 Table 5.1: InteraX protein fragment interactions. A sample of 6 annotated InteraX motifs corresponding to Figure 5.1. Additionally, InteraX pairs are annotated with the number of hydrogen bonds (cat- egorized in main chain bonds, side chain bonds or mixed side chain-main chain bonds) and their contribution to the total the free energy. 5.2.2 Reconstruction of protein-peptide interactions from inter- acting fragment pairs derived from monomeric proteins We define the protein-peptide interface as the collection of amino acid residues belonging to either the protein- or peptide-chain whose inter-atomic distance falls within a given cut-off distance. Starting from these interface residues, we gen- erate interacting fragments by sliding a window of length 5 over each interface residue (see Section 5.3 for details). By repeating this procedure for each pair of interfacing residues, the algorithm thus generates a collection of interacting fragment pairs from the protein-peptide structure. Next, for each fragment pair in the protein-peptide interface the corresponding InteraX pairs are determined that contain protein backbone arrangements similar to the fragment pair. The overlap between the query fragment pair, taken from the protein-peptide interface, and the database-derived fragment pair, taken from a monomeric protein, is quantified by 113
  • 132.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS 20 16 24 41 27 45 22 6 0 5 10 15 20 25 30 35 40 45 50 0-25% 25-50% 50-75% 75-100% Protein-peptidedataset Coverage binding site Two-body Single BriX Protein Figure 5.2: Coverage of protein-peptide interfaces Protein-peptide interfaces are covered with fragments found in monomeric proteins. The ‘two-body’ coverage shown in dark red captures the percentage of the protein-peptide interface that can be covered with InteraX pairs of protein fragments from different proteins. The ‘single BriX protein’ coverage shown in light red captures the best coverage of the binding site with a single monomeric protein. Results are averaged over the entire data set of 301 complexes. the root-mean squared distance (RMSD) after superposition, using a superposition threshold of 1 Å. The degree of coverage of the binding site is then defined as the number of residues covered by an InteraX pair, divided by the number of residues in the entire binding site. This ‘two-body coverage’ (every InteraX pair is a two- body interaction) is a measure that describes to which extent the binding interface can be reconstructed from sets of pairs of interacting fragments found in individ- ual monomeric proteins. Higher coverage indicates an interface that contains a high degree of architectural patterns adopted by monomeric protein structures, whereas lower coverage of the interface implies a peptide binding interface that cannot be related to the intramolecular architecture of monomeric proteins. Overall for the 731 protein-peptide interaction interfaces analysed here, we find that for the majority of the complexes at least 50% of their protein-peptide 114
  • 133.
    5.2 Results interface iscovered with two-body interactions at a resolution of 1 Å (Figure 5.2). For 40% of the protein-peptide complexes, the coverage rises to more than 75% of the protein-peptide interface. In comparison, we find that 98% of the protein-peptide interface can be rebuilt with single protein fragments from the BriX database at a resolution of 1 Å. Therefore, using protein fragment interactions instead of single protein fragments significantly reduces the coverage of the protein- peptide interface. In addition, the extent of coverage achieved by a two-body fragment approach illustrates that the architectural patterns of backbones found in the intramolecular arrangement of monomeric proteins contains a significant amount of structural information that is applicable to protein-peptide interactions. Figure 5.3 illustrates how different interface topologies including all-α, mixed α-β and all-β interfaces can be reconstructed by the superposition of two-body fragments from monomeric proteins. The first example is a PDZ domain bound to its ligand as an additional strand to an antiparallel β-sheet, tightly covered by intramolecular interactions with an average of 0.49 Å pairwise RMSD. Figure 5.3B shows a α-helix ligand binding domain with its ligand, for which fragments cover the entire interface with 0.34 Å pairwise RMSD, due to the canonical interaction motifs and the limited structural variation in the single α-helices. Figure 5.3C is a class I major histocompatibility complex (MHC) bound to a decameric peptide, in the peptide-binding groove formed by two α-helixes. MHC has been optimized to bind many different peptides with different sequences, but most bind in a similar orientation with both peptide termini bound in conserved pockets while length variations are accommodated by the peptide bulging or zigzagging in the middle (Collins et al., 1994). Different variations of a helix-loop motif are used for binding and surprisingly, those seemingly irregular interaction patterns often recur in monomeric proteins, covering 90% of the entire interface. Figure 5.3D shows a polyproline peptide bound to a SH3 domain. Our method covers only 54% of the interface because of the low frequency of occurrence of the polyproline motif within single chains. 115
  • 134.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS 1w9qBS 3erdAC 1ywoAP2clrAC A B C D Figure 5.3: Protein-peptide interfaces can be described as interactions between recurrent protein fragments from monomeric proteins. Protein covers from the BriX database are shown in red for receptor fragments and green for ligand fragments. The protein-peptide complex is shown in grey. The PDB identifier with the protein and peptide chains is shown below. Each interaction covering a part of the protein- peptide interface consists of two fragments of five residues each. (A) PDZ domain bound to ligand. 7298 interacting fragment pairs covering 100% of the interface with 0.49 Å average pairwise RMSD. (B) human estrogen receptor α-ligand binding domain bound to peptide. 42.092 interacting fragment pairs covering 85% of the interface with 0.28 Å average pairwise RMSD. (C) Class I MHC bound to peptide. 325 interacting fragment pairs covering 82% of the interface with 0.80Å average pairwise RMSD. (D) SH3 domain with polyprolyne peptide. 14 interacting fragment pairs covering 77% of the interface with 0.90 Å average pairwise RMSD. 116
  • 135.
    5.2 Results 5.2.3 Reconstructionof peptide binding motifs by using multi- ple fragment pairs observed in monomeric proteins Can entire binding modes of protein-peptide complexes be reconstructed using parts of single chain folds? Recently, Tuncbag et al. (2008) observed for protein- protein complexes that some of the more frequent interface architectures are the same as single chains. For protein-peptide complexes, we address this question by combining InteraX pairs from the same monomeric protein, describing protein- peptide interfaces as sets of interacting fragments. Figure 5.4 depicts six examples of binding interfaces that are described by a combination of InteraX pairs originating from a single monomeric protein. The first example shows a PDZ domain with a peptide bound in the canonical β-strand extension, from the scaffolding protein human synthenin. An exact match for the entire binding motif is found in a pseudo enzyme-substrate complex from E.coli, exhibiting a rudimentary form of a Rossmann-fold domain unrelated to the PDZ domain fold. In the second example the human estrogen receptor α-ligand binding domain is bound to a coactivator peptide in the nucleus, in a hydrophobic groove on the surface of the ligand binding domain. The entire interface of 35 residues superposes with an RMSD of 1.94 Å on the unrelated all-α citrate synthase from a different species. Figure 5.4C shows the particular binding mode of the MHC antigen-recognition domain with a peptide, partly reconstructed from an unrelated ferritin-like protein, superposing 24 residues with an RMSD of 0.94 Å. The ferritin-like fold lacks the β-sheet typical for the MHC antigen-recognition domain but is composed of a helix bundle in which the loop regions interact similar to the peptide bound to MHC. In Figure 5.4D a peptide inhibiting the serine-like NS3/4A protease from the hepatitis C virus is bound in an extended backbone conformation, forming an anti-parallel β-sheet with one β-strand of the enzyme. The entire β-sheet of 34 consecutive residues is found in murB, a glucosamine reductase involved in cell- wall biosynthesis in E. coli, but the ligand strand is now an integral part of the fold. Strikingly, both proteins occur in different structural classifications according to SCOP: NS3/4A is an all-β protein, whereas murB is a member of the α+β class. 117
  • 136.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS In Figure 5.4E, a tetratricopeptide repeat (TPR) motif from the adaptor protein Hop is shown bound to a heptapeptide from Hsp70. The BriX hit contains exactly this TPR motif in p67 but now the C terminal of p67 folds back into a hydrophobic groove formed by a TPR domain in a single chain. This has already been observed by Grizot et al. (2001), relating the single chain to the TPR domain in complex with RacGTP (Lapouge et al., 2000). The last example shows a SH3 domain complexed with a polyproline peptide. Similar backbone architecture can be observed in an E.coli protein of unknown function, but this time both fragments partly fold as β-strands because of the different structural contexts. Yet, the polyproline motifs are present in both the complex and the single-chain protein. For 25% of the 761 complexes a similar structural arrangement covering more than 50% of the entire interface could be observed in a single monomeric protein (Figure 5.2). The bulk of the interfaces however can be covered for only 25-50%. This rather low score is significant because if protein-peptide binding modes could be entirely described using single-chain folds, we would be able to retrieve them using SCOP, the structural classification of proteins (Murzin et al., 1995). We looked if there is any correspondence between the SCOP classifications of the protein-peptide complex and the protein from BriX that contains the collection of interacting fragments covering the interface. All four hierarchical SCOP classes – class, fold, superfamily and family – are compared if SCOP data was available for the protein-peptide complex (see Section 5.3 for details). Intriguingly, 74% of the equivalent structural arrangements of fragments are from unrelated SCOP classifications, 23% is related on the class level and the remaining 3% is distributed across the fold, superfamily and family levels. These data clearly illustrate that the fragment interaction approach reveals structural similarities that are not apparent from structural classifications. 118
  • 137.
    5.2 Results 1w9qBS 1gsa3erdAC 1csh 1jigA1yn7AC 2mbr 1dy8AC 1elwAC 1hh8A 2vknAC 1k7jA 1.64 Å 20 res 1.94 Å 35 res 0.94 Å 24 res 1.51 Å 34 res 1.17 Å 47 res 0.69 Å 12 res A B C D E F Figure 5.4: Relation between intermolecular interface architectures and in- tramolecular protein architectures. The protein-peptide complex is colored blue for the receptor and green for the ligand (left); the monomeric protein from BriX is colored red (right). The superposing region is shown in ribbon view, and the number of superposed residues and the superposition value are shown. The PDB ID with the protein and peptide chains is shown below the figures. (A) PDZ domain with peptide and unrelated enzyme-substrate complex (B) α-ligand binding domain with peptide and unrelated all-α citrate synthase from a different species (C) Class I MHC complex with peptide and ferritin-like protein (D) Hepatitis C protease with inhibitor and MurB, a glucosamine reductase protein (E) Repetition of the ligand-bound TPR motif in complex and single-chain form (F) SH3 domain with polyproline peptide and protein of unknown function. 119
  • 138.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS 5.2.4 Statistical analysis of the factors that determine recon- struction accuracy What is the impact of regular secondary structure on the reconstruction accu- racy of peptide binding sites? Secondary structure plays an important role in protein-peptide binding. Approxi- mately one third of all known peptides bind their protein domain through β-strand addition, while another third folds as α-helical peptides (Petsalaki & Russell, 2008; Remaut & Waksman, 2006). In our test set, 38% of the peptides adopt some form of secondary structure, while 42% of all binding site residues are of regular secondary structure. As expected, more regular interfaces are better covered, with a correlation of 0.88 between the percentage of secondary structure and the cover- age (Figure 5.5A). Interestingly however, binding interfaces with 50% regularity are on average still 80% covered at a resolution of 1 Å, illustrating that even irregular interfaces are partially reflecting the architecture of intramolecular interactions. Are more stable interactions more common? For every protein peptide complex, we estimated the change in free energy upon binding (∆∆G, kcal/mol) with the empirical force field FoldX (Schymkowitz et al., 2005b,a) (Section 5.3.5). Interestingly, we found a correlation of 0.91 between the ∆∆G binding energies of the complexes and the coverage of the binding sites, suggesting that higher affinity binding correlates with better coverage. This result is not obvious as FoldX energies do not depend on the size of the binding site. Further, decomposing the energy terms reveals that more backbone H-bonds in the protein-peptide interface imply a better coverage with BriX fragments, with a correlation of 0.81 (Figure 5.5B). Alternatively, if we correlate the predicted binding ∆∆G with the percentage of secondary structure in the binding site, we find that more structured binding sites do have slightly better binding (correlation of 0.62), although this is probably caused by the high amount of β-strand binding modes in our dataset. 120
  • 139.
    5.2 Results A B C -1.4 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 020 40 60 80 100 BLOSUM62score protein-peptide interface coverage, % D 0 20 40 60 80 100 0 20 40 60 80 100 %secondarystructureinterface protein-peptide interface coverage, % -20 -15 -10 -5 0 5 0 20 40 60 80 100 protein-peptideinterfaceH-Bonds(kcal/mol) protein-peptide interface coverage, % -20 -15 -10 -5 0 5 0 20 40 60 80 100 binding∆∆GFoldX(kcal/mol) % secondary structure interface Figure 5.5: Properties of the protein-peptide interface coverage. The coverage data of 301 complexes is equally distributed in 20 bins and plotted against (A) secondary structure distribution (α-helix and β-strand) in the binding site (0.88 correlation), (B) interface H-bonds as measured in kcal/mol with FoldX (0.81 correlation), (C) BLOSUM62 score for similarity between residues from the protein-peptide complex and the covering BriX fragment (no correlation). In D the correlation between the secondary structure of the interface is plotted against the predicted binding energy ∆∆G (0.62 correlation). Are buried backbone fragments more suitable for reconstructing peptide bind- ing sites? BriX contains fragments from both the protein core and the surface. We analyzed from which part of the protein the InteraX fragments originate by measuring the 121
  • 140.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS burial of the side chains. A value of 1 stands for a complete burial of the side chain, while a value of 0 implicates complete exposure to the solvent. On average, the fragment covers have a side chain burial value of 0.72, demonstrating that the backbone conformations of well-packed intramolecular interactions are generally closer to the backbone architectures found at protein- peptide interfaces. As a result we note a correlation of 0.65 with the coverage: more buried interactions from BriX are slightly better covering the binding site (Figure 5.6B). The correla- tion between the FoldX binding energies and the coverage of the protein-peptide interfaces is given in Figure 5.6A. A B -20 -15 -10 -5 0 5 0 20 40 60 80 100 binding∆∆GFoldX(kcal/mol) protein-peptide interface coverage, % 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 sidechainburialcoveringBriXfragments protein-peptide interface coverage, % Figure 5.6: Properties of the protein-peptide interface coverage correlated with binding energy and burial. The coverage data of 301 complexes is equally distributed in 20 bins and plotted against (A) predicted binding energy (∆∆G, kcal/mol) with FoldX (0.91 correlation), (B) side chain burial with FoldX (0.65 correlation). Is there sequence similarity between protein-peptide interactions and the covering InteraX pairs? We also examined whether the sequences of the wild-type protein-peptide inter- faces are similar to their corresponding fragment covers from BriX, which have a similar backbone but not necessarily the same side chain composition. Therefore, we calculated the number of times a residue from the binding site is covered with exactly the same residue. Surprisingly, we did not find any correlation, with 122
  • 141.
    5.2 Results sequence similaritiesranging from 0% to only 14%. We repeated the distance measurement between any two residues with the BLOSUM62 matrix, which gives a score for the likeliness of two amino acid residues replacing each other in homol- ogous sequences (Henikoff & Henikoff, 1991). A negative BLOSUM score is given to less likely substitutions while a positive score implies a more likely substitution. This yields an average BLOSUM62 score of -0.67, thus reinforcing the idea of sequence independency between fragments from the monomeric proteins and the protein-peptide interface. Furthermore, no correlation exists between coverage and sequence similarity, as shown in Figure 5.5C. We researched whether the hydrophobic properties, the charge and the beta propensities of the amino acids covering the protein-peptide interfaces are con- served. We use the hydropathy index to measure hydrophobic properties (Kyte & Doolittle, 1982), the sidechain charge at pH 7 and the β-propensities (Street & Mayo, 1999). A value close to 0 represents high conservation. We note that none of the physical properties are clearly conserved by our coverage algorithm, as shown in Figure 5.7 for both the single InteraX pairs and for the best recombination of InteraX pairs from a single BriX protein. The difference in physical properties between two random sequences of 10.000 amino acids each is plotted with a red line (hydropathy = 3.269, charge = 0.431, β-propensities =0.184). For coverage using single InteraX pairs, the physical properties for most protein-peptide inter- faces are close to the random difference, while for coverage using a recombination of InteraX pairs from the same proteins slightly more variation exists. These results further stress the sidechain independency of the coverage algorithm. While we do not observe sequence similarity we went further and looked at the entire interaction network of the protein-peptide interfaces, compared with their matching fragments from our database. We looked at similarities in H-bond patterns, electrostatics, and volumetric properties and found that 88% of the elec- trostatic network, 95% of the H-bond patterns and 91% of the volumetric network of the original protein-peptide interfaces are retained in the BriX covers. Thus, while sequence identity is very low, geometric properties are retained, making the use of fragments an alternative method for homology modeling to do protein- peptide interface design. In Chapter 6 we apply the original idea developed in this chapter for peptide docking and design. 123
  • 142.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS 0 10 20 30 40 50 60 70 80 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Numberofprotein-peptideinterfaces Difference in Beta Propensity 0 10 20 30 40 50 60 70 80 0 1 2 3 4 5 6 7 Numberofprotein-peptideinterfaces Difference in Hydropathy 0 10 20 30 40 50 60 70 80 0 1 2 3 4 5 6 7 Numberofprotein-peptideinterfaces Difference in Hydropathy 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Numberofprotein-peptideinterfaces Difference in Charge 0 10 20 30 40 50 60 70 80 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Numberofprotein-peptideinterfaces Difference in Charge 0 10 20 30 40 50 60 70 80 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Numberofprotein-peptideinterfaces Difference in Beta Propensity A B C D E F Figure 5.7: Physical properties are not conserved in the BriX covers. Histograms of the physical property conservation are shown for the entire dataset, for hydropathy (left), charge (middle) and beta propensities (right). On the X axis the difference between the properties of the amino acids in the protein-peptide interface and those from the BriX covers is plotted, on the Y axis the number of protein-peptide interfaces in the bins is shown. A, B, and C show the statistics for the coverage using single InteraX pairs, while D, E, and F show the statistics for the best recombination of InteraX pairs from a single protein. The conservation for the physical property between two random sequences is drawn with a red line. 5.3 Materials and Methods 5.3.1 Construction of a non-redundant data set of protein-peptide complexes We have constructed the non-redundant data set in a similar way as we recon- structed PepX (Section 4.2.1), with the difference that we required (1) peptides with a size from 5 to 14 amino acids (and not until 35), and (2) receptors with a minimum size of 25 amino acids (and not 35). 731 complexes were retained and clustered on their binding architecture, similar to the construction of PepX. Any 124
  • 143.
    5.3 Materials andMethods Protein in complex with peptide PDB ID Members FoldX ∆∆G Two-body coverage, % Single BriX coverage, % Major Histocom- patibility Complex (MHC) 2clrAC 172 -24.12 82 24 α-ligand binding do- main 3erdAC 69 -19.02 85 60 Bovine γ- chymotrypsin 1ab9BCA 30 -14.24 56 30 Thrombin 1vzqHI 26 -7.21 59 24 Streptavidin 1sldBP 24 -7.48 35 24 HIV-1 antibody 1u8hABC 15 -8.31 65 50 HIV-1 protease 2nxlABP 14 -18.61 81 29 SH3 1uj0AB 13 -12.14 54 35 PDZ 1w9qBS 8 -10.37 81 74 Table 5.2: Coverage statistics for the most populated classes in the protein- peptide dataset. The table shows statistics for the top 9 classes in our dataset, which account for 371 of the 731 protein-peptide complexes. two structures are grouped together if they superpose below 2 Å RMSD for at least 75% of their interfaces. In this way, we retained 258 unique protein-peptide inter- face clusters. The centroid of each cluster was selected for the dataset, while for clusters with more then 10 elements we selected 5 representative interfaces. The final dataset contains 301 representative protein-peptide interfaces. The interface size of the protein-peptide complexes varies between 3 and 55 residues, with an average of 21 residues in the binding site. 70% of all protein-peptide complexes have been annotated with SCOP Murzin et al. (1995). Table 5.2 shows the cover- age results for the top 9 clusters in the dataset. This original data set served as the basis to research the landscape of protein-peptide interactions as available from the PDB, and results are presented in Chapter 4. 5.3.2 The dataset of protein fragments BriX1 is a database of canonical protein fragments obtained through fragmenting and clustering a set of 1261 high quality protein structures (Baeten et al., 2008). 1 At the time of this study, we used the original version of this database as published by Baeten et al. (2008). In Chapter 2 we describe how we updated this database, increasing its size 7-fold and including a special clustering for loop structures. 125
  • 144.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS Protein structures have been reconstructed using BriX fragments with an average accuracy of 0.48 Å RMSD, covering 99% of the original structure. The resulting alphabet of protein fragments varies in length from 4 to 14 amino acids, but in this study we have limited ourselves to fragments of length 5. Because 94% of all fragments of length 5 are classified, all the protein data is used. 258.474 fragments were clustered into 7744 structural classes for six different RMSD thresholds to allow different levels of structural variety. Moreover, fragment recombination used to obtain ‘n-body’ interactions – through recombination of InteraX pairs that have overlapping regions – gradually increases the fragment lengths. A complete analysis of the BriX database is presented in Chapter 2. 5.3.3 InteraX database The InteraX database was constructed by mining all fragment interactions in BriX between fragments of length 5. For every fragment in the BriX database, we looked for interacting fragments within the same protein. We defined residue interaction between two residues of interacting fragments as follows: if the distance of any of the atoms of the residues is less then the sum of their Van der Waals radii plus 0.5 Å (Keskin et al., 2008), residues are considered as interacting. Each InteraX pair has at least three of these interactions. Annotation of each pair is done using the FoldX version 3.0 Beta 4 (unpublished). 5.3.4 Covering algorithm The covering algorithm1 harvests the wealth of data provided in BriX to reconstruct protein-peptide interfaces. Instead of considering single protein fragments, the backbone arrangements of the interactions between fragments (stored in the In- teraX database) are the basis for reconstruction. The covering algorithm searches for similar backbone arrangements of fragment interactions, between the entire BriX dataset and the protein-peptide interfaces2 . 1 A generalized version of this original method that uses constraint satisfaction is described in Section 6.3.1. This new algorithm does more than simply covering the interface and can be used for de novo structure prediction and design. 2 A step-by-step guide through the covering and reconstruction algorithms is available for online viewing at http://pepx.switchlab.org/content/related-work. 126
  • 145.
    5.3 Materials andMethods In the first step, binding site residues are defined by measuring the distances between any two residues from different polypeptide chains, one from the receptor protein and the other from the ligand peptide. Residue interaction is defined as in Section 5.3.3. Fragments were constructed from interacting residues by sliding a window of five residues over the structure from the N to C terminal. The window sliding starts 4 residues before the first interacting residue and ends 4 residues after the last interacting residue, such that nearby residues of the binding site are used to facilitate the interaction search. In a second step, the binding site fragments are covered with fragments from the BriX fragment database. Every fragment is compared with all the class centroids of BriX, using a superposition threshold of 1 Å. The four backbone atoms N, Cα, Cβ and O are used in the superposition, such that the directions of the original sidechains are preserved in the covering fragments. Structural variation within the classes is tolerated up to 0.9 Å distance from the class centroid. We applied a lower threshold for highly redundant classes such as all-α classes, and raised the threshold for classes with few structural elements. To use all data available in BriX, all the fragments from the selected classes are loaded on the binding site fragments. In a third step, the algorithm looks for architectural matches between InteraX pairs and fragment pairs from the protein-peptide binding site. Fragment pairs are created every time with one fragment from the receptor and another from the ligand. InteraX pairs are filtered on superposition on the BriX pair using a threshold of 1 Å for tight matches. Applying this procedure to the entire binding site results in a set of InteraX pairs (also called ‘two-body’ interactions) that cover the binding site of the protein-peptide complex. In a final step, these ‘two-body’ interactions from the same BriX protein are combined into ‘n-body’ interactions with a superposition threshold of 2 Å, thus covering a larger part of the binding site with a single monomeric fold. 5.3.5 FoldX force field The FoldX software was used to compute binding energies ∆∆G after local opti- mization of the side chains and to measure the side chain burial of the residues in 127
  • 146.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS both the protein-peptide dataset and the BriX database. FoldX is a protein design algorithm that contains a force field used for rapid evaluation of protein stability, folding and dynamis as affected by mutations (Schymkowitz et al., 2005b). The force field that is part of the algorithm makes a quantitative estimation of the interactions contributing to the stability of pro- teins (as for example used in Chapter 3 to measure protein stability after loop building) and for the stability of protein-protein, protein-peptide and protein-DNA complexes (as used in this Chapter and Chapters 4 and 6). The different energy terms taken into account in FoldX have been weighted using empirical data from protein engineering experiments, and the predictive power was tested on a large set of protein mutants covering many different structural environments found in proteins (Guerois et al., 2002). Energy estimates calculated by FoldX include terms that have been found to be important for protein stability. The free energy of unfolding (∆G) of a target protein is calculated using: ∆G = ∆Gvdw + ∆GsolvH + ∆GsolvP + ∆Gwb + ∆Ghbond + ∆Gel + ∆Smc + ∆Ssc (5.1) where ∆Gvdw is the sum of the van der Waals contributions of all atoms with respect to the same interactions with the solvent; ∆GsolvH and ∆GsolvP are the dif- ferences in solvation energy for apolar and polar groups, respectively, when going from the unfolded to the folded state; ∆Ghbond is the free energy difference between the formation of an intramolecular hydrogen bond and that of an intermolecular hydrogen bond (with solvent); ∆Gwb is the extra stabilizing free energy provided by a water molecule, making more than one hydrogen bond to the protein (water bridges) that cannot be taken into account with non-explicit solvent approxima- tions; ∆Gel is the electrostatic contribution of charged groups, including the helix dipole; ∆Smc is the entropy cost for fixing the backbone in the folded state (this term is dependent on the intrinsic tendency of a particular amino acid to adopt certain dihedral angles); and ∆Ssc is the entropic cost of fixing a side chain in a particular conformation. The energy values of ∆Gvdw, ∆GsolvG, ∆GsolvP and ∆Ghbond attributed to each atom type have been derived from a set of experimental data, and ∆Smc and ∆Ssc have been taken from theoretical estimates. 128
  • 147.
    5.4 Discussion FoldX alsoprovides an estimate of the loss of free energy upon binding, i.e. ∆∆G, which can be calculated for protein-protein, protein-peptide or protein-DNA complexes: ∆∆GAB = ∆GAB − (∆GA + ∆GB) + ∆Gkon + ∆Str (5.2) where ∆GAB is the stability of the complex, ∆GA and ∆GA are the stabilities of each of the partners in isolation; ∆Gkon reflects the effect of the electrostatic inter- actions on the kon term for protein complexes; and ∆Str is the loss of translational and rotational entropy upon making the complex. The ∆∆G of the protein-peptide complex was computed using the FoldX command AnalyzeComplex after repairing the positions of the side chains using the FoldX command RepairPDB. 5.3.6 Statistical analysis 731 protein-peptide complexes were clustered in 258 distinct protein-peptide in- terface classes. Statistics were performed by distributing the data for the 258 protein-peptide classes in 20 bins, averaging the results in a single bin. Through this approach only general trends are observed within the data as details are leveled out. For classes with more than 10 elements we took 5 representative elements and averaged the statistics per class. 5.4 Discussion We have researched whether interactions seen in protein-peptide complexes are different from those observed within monomeric proteins. Our study was moti- vated by the sheer abundance of monomeric protein structures compared to the lack of complex structures. We analysed all 301 non-redundant protein-peptide interactions available in the PDB. In this set our reconstruction method shows an overall reconstruction of 91% of the binding site in 41% of the cases, 62% of the binding site in 25% of the cases, and less than 19% of the binding site for the remaining 34%. In general, the reconstruction accuracy depends on the reg- ularity of the structure related to secondary structure and H-bond patterns, but 129
  • 148.
    5. PROTEIN-PEPTIDE INTERACTIONSRESEMBLE MONOMERIC PROTEIN INTERACTIONS irregular structures are still covered to a good extent. Importantly, the reconstruc- tion accuracy does not depend on side chain similarity but clearly reflects general architectural rules of polypeptides. Using protein fragments to model protein-peptide interfaces opens up the way to incorporate the wealth of data on monomeric protein structures for protein- peptide binding prediction and design. We demonstrated that most interactions can be viewed as sets of pairwise interactions between protein fragments, iden- tical to interactions in monomeric proteins. Not only have we shown that using fragments is an efficient way to look at interfaces, we also reached a level of de- tail in studying protein interactions that cannot be reached using fold comparison through SCOP or other protein classifications. We strictly limited ourselves to superpositions of maximum 1 Å RMSD, yet we did not observe any sequence relation between the protein-peptide interfaces and the BriX proteins, suggesting that the arrangement of the backbone is largely independent from the side chain in the bound complex. The interaction net- works however were preserved between intra- and intermolecular interactions. Through recombination of pairwise fragment interactions we could reconstruct entire binding sites in some cases, revealing identical binding patterns between protein-peptide interfaces and parts of single chain folds. Although most binding interfaces with regular structure can be covered, we note that loop interactions are often not or only partly covered due to the huge amount of different loop interactions. Special consideration for loop fragments as we presented in Chapter 2 and applied to loop reconstruction in Chapter 3 could increase this coverage. Finally, this structural insight paves the way for a peptide design algorithm that employs the wealth on monomeric structural data to predict and design protein-binding peptides (Chapter 6). Author Contributions P.V., F.R., J.S., and L.S. conceptualized the study. P.V. performed the experiments. P.V., E.V., F.S., F.R., J.S. and L.S. performed the analysis. P.V. and J.S. wrote the paper. 130
  • 149.
    REFERENCES References Baeten, L., Reumers,J., Tur, V., Stricher, F., Lenaerts, T., Serrano, L., Rousseau, F. & Schymkowitz, J. (2008). Reconstruction of protein backbones from the brix collection of canonical protein fragments. PLoS Com- put Biol, 4, e1000083. 109, 110, 125 Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. & Bourne, P.E. (2000). The protein data bank. Nucleic Acids Research, 28, 235–42. 109 Bradshaw, J.M. & Waksman, G. (2002). Molecu- lar recognition by sh2 domains. Adv Protein Chem, 61, 161–210. 108 Cesareni, G., Panni, S., Nardelli, G. & Castagnoli, L. (2002). Can we infer peptide recognition specificity mediated by sh3 domains? FEBS Lett, 513, 38–44. 108 Chandonia, J.M., Hon, G., Walker, N.S., Conte, L.L., Koehl, P., Levitt, M. & Brenner, S.E. (2004). The astral compendium in 2004. Nu- cleic Acids Research, 32, D189–92. 111 Cohen, M., Reichmann, D., Neuvirth, H. & Schreiber, G. (2008). Similar chemistry, but different bond preferences in inter versus intra-protein interactions. Proteins. 109 Collins, E.J., Garboczi, D.N. & Wiley, D.C. (1994). Three-dimensional structure of a peptide extending from one end of a class i mhc binding site. Nature, 371, 626–9. 115 Grizot, S., Fieschi, F., Dagher, M.C. & Pebay- Peyroula, E. (2001). The active n-terminal region of p67phox. structure at 1.8 a res- olution and biochemical characterizations of the a128v mutant implicated in chronic granulomatous disease. J Biol Chem, 276, 21627–31. 118 Guerois, R., Nielsen, J.E. & Serrano, L. (2002). Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. Journal of Molecular Biol- ogy, 320, 369–87. 128 Henikoff, S. & Henikoff, J.G. (1991). Au- tomated assembly of protein blocks for database searching. Nucleic Acids Research, 19, 6565–6572. 123 Itzhaki, L.S., Otzen, D.E. & Fersht, A.R. (1995). The structure of the transition state for fold- ing of chymotrypsin inhibitor 2 analysed by protein engineering methods: evidence for a nucleation-condensation mechanism for protein folding. Journal of Molecular Biology, 254, 260–88. 108 Keskin, O., Gursoy, A., Ma, B. & Nussinov, R. (2008). Principles of protein-protein interac- tions: what are the preferred ways for pro- teins to interact? Chem Rev, 108, 1225–44. 126 Kippen, A.D., Sancho, J. & Fersht, A.R. (1994). Folding of barnase in parts. Biochemistry, 33, 3778–3786. 108 Klijn, C., Michaut, M. & Abeel, T. (2010). High- lights from the 6th international society for computational biology student council sym- posium at the 18th annual international con- ference on intelligent systems for molecular biology. BMC Bioinformatics, 11, I1. 107 Kolodny, R. & Levitt, M. (2003). Protein decoy assembly using short fragments under ge- ometric constraints. Biopolymers, 68, 278– 85. 110 Kyte, J. & Doolittle, R.F. (1982). A simple method for displaying the hydropathic char- acter of a protein. Journal of Molecular Biol- ogy, 157, 105–32. 123 Lapouge, K., Smith, S.J., Walker, P.A., Gamblin, S.J., Smerdon, S.J. & Rittinger, K. (2000). Structure of the tpr domain of p67phox in complex with rac.gtp. Mol Cell, 6, 899–907. 118 Ma, B., Elkayam, T., Wolfson, H. & Nussinov, R. (2003). Protein-protein interactions: struc- turally conserved residues distinguish be- tween binding sites and exposed protein sur- faces. Proc Natl Acad Sci USA, 100, 5772–7. 109 131
  • 150.
    REFERENCES Mayer, B.J. (2001).Sh3 domains: complexity in moderation. J Cell Sci, 114, 1253–63. 108 Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. (1995). Scop: a structural clas- sification of proteins database for the inves- tigation of sequences and structures. J. Mol. Biol., 247, 536–540. 118, 125 Neduva, V., Linding, R., Su-Angrand, I., Stark, A., de Masi, F., Gibson, T.J., Lewis, J., Serrano, L. & Russell, R.B. (2005). Systematic discovery of new recognition peptides mediating pro- tein interaction networks. PLoS Biol, 3, e405. 108 Pawson, T. & Scott, J.D. (1997). Signaling through scaffold, anchoring, and adaptor proteins. Science, 278, 2075–80. 108 Petsalaki, E. & Russell, R.B. (2008). Peptide- mediated interactions in biological systems: new discoveries and applications. Curr Opin Biotechnol, 19, 344–50. 108, 120 Potapov, V., Reichmann, D., Abramovich, R., Filchtinski, D., Zohar, N., Halevy, D.B., Edel- man, M., Sobolev, V. & Schreiber, G. (2008). Computational redesign of a protein-protein interface for high affinity and binding speci- ficity using modular architecture and natu- rally occurring template fragments. Journal of Molecular Biology, 384, 109–19. 109 Reichmann, D., Rahat, O., Albeck, S., Meged, R., Dym, O. & Schreiber, G. (2005). The modular architecture of protein-protein binding inter- faces. Proceedings of the National Academy of Sciences of the United States of America, 102, 57–62. 109 Remaut, H. & Waksman, G. (2006). Protein- protein interaction through beta-strand ad- dition. TRENDS in Biochemical Sciences, 31, 436–44. 109, 120 Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F. & Serrano, L. (2005a). The foldx web server: an online force field. Nucleic Acids Research, 33, W382–8. 111, 120 Schymkowitz, J.W.H., Rousseau, F., Martins, I.C., Ferkinghoff-Borg, J., Stricher, F. & Serrano, L. (2005b). Prediction of water and metal binding sites and their affinities by using the fold-x force field. Proceedings of the National Academy of Sciences of the United States of America, 102, 10147–52. 120, 128 Street, A.G. & Mayo, S.L. (1999). Intrinsic beta- sheet propensities result from van der waals interactions between side chains and the lo- cal backbone. Proceedings of the National Academy of Sciences of the United States of America, 96, 9074–6. 123 Tsai, C.J., Xu, D. & Nussinov, R. (1997). Struc- tural motifs at protein-protein interfaces: protein cores versus two-state and three- state model complexes. Protein Sci, 6, 1793– 805. 109 Tsai, C.J., Xu, D. & Nussinov, R. (1998). Protein folding via binding and vice versa. Folding & design, 3, R71–80. 109 Tuncbag, N., Gursoy, A., Guney, E., Nussinov, R. & Keskin, O. (2008). Architectures and func- tional coverage of protein-protein interfaces. J Mol Biol, 381, 785–802. 109, 117 Vanhee, P., Verschueren, E., Baeten, L., Stricher, F., Serrano, L., Rousseau, F. & Schymkowitz, J. (2011). Brix: a database of protein building blocks for structural analysis, modeling and design. Nucleic Acids Research, 39, D435– 42. 110 Vriend, G. (1990). What if: a molecular mod- eling and drug design program. Journal of Molecular Graphics, 8, 52–6, 29. 110, 111 Yaffe, M.B. (2002). Phosphotyrosine-binding domains in signal transduction. Nat Rev Mol Cell Biol, 3, 177–86. 108 132
  • 151.
    6Predicting peptide structureand specificity Parts of this chapter are prepared for publication in Peptide structure prediction from the architecture of proteins Peter Vanhee*, Erik Ver- schueren*, Luis Serrano, Frederic Rousseau and Joost Schymkowitz 1 In preparation, December 2010. and The Multiple Specificity Landscape of Peptide Recognition Modules David Gfeller, Frank Butty, Erik Verschueren, Peter Vanhee, Haiming Huang, Andreas Ernst, Nisa Dar, Igor Stagljar, Luis Serrano, Sachdev S. Sidhu, Gary D. Bader and Philip M. Kim Molecular Systems Biology (in review), February 2011. I t is believed that around 30% of all interactions in the cell are directly or indirectly mediated by peptide interactions. However, there is a huge lack of structural evidence for many of these interactions. Therefore, an in-silico method could guide the search and discovery of new peptide interactions, or produce structural models for known peptide interactions. Predicting not only the interaction be- tween a peptide and a protein but also the shape of the peptide, remains a very challenging problem due to the large number of rotatable bonds and flexibility 1 Peter Vanhee and Erik Verschueren are joint first authors. 133
  • 152.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY in both the main chains and the side chains of the peptides. Here we present a novel, knowledge-based method to overcome some of those limitations, relying on the structural insights presented in Chapter 5. The method samples the reduced space of fragment conformations and their local interactions from ∼7000 globular proteins. We show the usefulness of our approach by redesigning the interaction scaffold of nine protein-peptide complexes, for which four of the peptides can be modeled to within 1 Å RMSD of the original peptide position. For two different protein domains, the PDZ domain and the α-Ligand Binding Domain (LBD), we predict peptide structure within sub-angstrom accuracy, using the sequence of the peptide combined with the structure of the domain, but without previous knowl- edge of the peptide structure. Furthermore, we show how the method correctly identifies the peptide-binding sites in these domains. Finally, we model domain specificity for the PDZ and the LDB, and make a structural prediction for a novel PDZ binding mode recently discovered through phage display. 6.1 Introduction The process of predicting peptide structure when bound to a target protein can essentially be divided in three steps: 1. Model the target protein structure, either by using a structure from the PDB or through homology modeling. 2. Predict the potential binding site on the surface of the protein structure. 3. Model a peptide structure in the binding site with high resolution. We speak about docking when the backbone structure of the peptide is known beforehand (Section 6.2.1), and we call this process design when the peptide structure is not known beforehand (from Section 6.2.2 onwards). For a comprehensive overview of the field of computational peptide design, we refer to Section 1.3. 134
  • 153.
    6.2 Results 6.2 Results 6.2.1Peptide docking using interaction patterns from InteraX We researched whether InteraX motifs (interactions between monomeric frag- ments, see Section 5.2.1) contain the predictive capacity for the docking of rigid peptide structures in a selected binding site on the protein interface. For nine protein-peptide complexes – the centroids of nine representative clusters from PepX, accounting for approximately half of our protein-peptide data set (Figure 4.1) – we rebuilt the interfaces using the original sequence and structure of the peptide in the binding pocket, but without previous knowledge of the interaction pattern of the protein-peptide complex. The sidechains of the interface residues are rebuilt with the all-atom force field FoldX and ranked based on the ∆∆G binding energy. The algorithm is described in detail in Section 6.3. 1w9qB + S1w9qB + 1k6z Figure 6.1: Docking of the PDZ peptide using InteraX patterns. At the left, the superposition between the BriX protein (PDB 1K6Z, grey) and the PDZ domain (PDB 1W9Q, cyan) on the receptor β-strand is shown. The fragments from the BriX protein are colored blue, the receptor fragment from the PDZ domain is colored green. At the right, the peptide ligand according to the interaction motif from the BriX protein is shown (red), superposed on the original ligand (yellow), with RMSD 0.29 Å. In four of the nine cases, we are able to position the original peptide ligand to within 1 Å RMSD of the original position (Table 6.1). For example, the peptide bound to the PDZ domain can be positioned to within 0.29 Å RMSD of the original 135
  • 154.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY Protein-peptide complex PDB ID RMSD, Å Major Histocom- patibility Complex (MHC) 2clrAC 0.79 α-ligand binding do- main 3erdAC 1.96 Bovine γ- chymotrypsin 1ab9BCA 2.87 Thrombin 1vzqHI 1.88 Streptavidin 1sldBP 1.38 HIV-1 antibody 1u8hABC 1.29 HIV-1 protease 2nxlABP 0.51 SH3 1uj0AB 0.73 PDZ 1w9qBS 0.14 Table 6.1: Peptide docking using InteraX. Accuracy of rigid peptide docking on a 9 representative classes from PepX (Figure 4.1). For coverage statistics of these interfaces using InteraX patterns and thus with knowledge of the binding interface, see Table 5.2. peptide, using the β-β interaction pattern observed in an unrelated secretion chap- erone (Figure 6.1). For the MHC, the algorithm finds the correct peptide position to within 0.99 Å RMSD using an α-loop InteraX pair of an unrelated BriX protein (PDB ID 1AJS). In another four cases, the ligand was placed correctly to within 2 Å, and for the remaining case, the algorithm was not able to filter out the correct positions of the ligands, due to the lack of interaction motifs in InteraX that superpose sufficiently close to the receptor fragments. 6.2.2 De novo peptide structure prediction using interaction patterns from InteraX For peptide docking, a structure of both the ligand and the domain is required. In many real-world scenarios however, no structure of the protein-peptide complex exists, and methods based on homology modeling or first principles (ab initio) need to be employed. In the next sections, we lift the need of having a structure of the peptide by using backbone templates from the InteraX database and side chain reconstruction using the FoldX force field. Our design strategy assumes no 136
  • 155.
    6.2 Results previous knowledgeof either the peptide structure or orientation and is driven by the search for favorable binding energy. We also show that no previous knowledge of the binding site is required to design the peptides. Finally, for the PDZ domain we show that we can capture the specificity profile of these by comparing our results to phage display peptide binding assays. 6.2.3 Case study: PDZ peptide design and specificity We evaluated our peptide design method on PDZ domains, flexible domains that mediate protein-protein interactions mainly by binding the C-termini of target pro- teins (Nourry et al., 2003). These C-terminal peptides bind in an elongated surface groove as an antiparallel β-strand, interacting with an exposed β-strand and α-helix of the PDZ domain (Figure 6.2A). PDZ domains have been classified in discrete classes: for example, Class I domains recognize peptides with the consensus mo- tif Ser/Thr-X-Val/Leu/Ile-COOH. However, discretizations of the peptide sequence space have recently been challenged, proposing a more evenly distribution (Stiffler et al., 2007). Only a limited number of complex PDZ-peptide structures is available – e.g. 17 out of 54 human PDZ domains (Smith & Kortemme, 2010) – such that modeling peptide structure and peptide specificity remains a challenge. Here, we apply our method to model peptide ligands for the PDZ1 domain of DLG3 (categorized under Class I, PDB 2I1N). We define possible anchor fragments in the PDZ domain at residues 140-144, 141-145, 142-146 and 143-147 and search in the space of InteraX patterns for all possible interacting pairs. Next, we apply a series of filters to filter for and sort by backbone clashes, hydrogen bonding, and finally, the full ∆∆G interaction energy after building the wild type sequence (EETSV) using FoldX (Section 6.3). Our designs compare favorably to the bound peptide of the X-ray model, with 9 templates less than 1 Å RMSD away from the crystallographic peptide, counting distance between the backbone atoms (Figure 6.2B, black circles). The interaction energy and important hydrogen bonding patterns that contribute to binding are kept intact, such as Thr-His, or the carboxyl hydrogen bonding pattern. Interestingly, most of the variation in our designed peptides resides in the N- 137
  • 156.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY His196 Thr-2 Val0 Ser-1 Glu-3 Glu-4 Arg136 carboyxlate-binding loop 0 2 4 6 8 0 RMSD (design to crystallographic peptide) (Å) EstimatedDDGinteractionFoldX,kcal/mol ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PDZ1 (closed loop) ● ● Ab initio design BriX refinement A B wildtype design RMSD per residue -4 -3 -2 -1 0 Figure 6.2: Peptide design for the canonical PDZ domain. (A) Top design for the canonical PDZ domain (PDB 2I1N). RMSD against the crystallographic peptide is 0.79 Å. (B) Ab initio design samples peptides that are < 1 Å separated from the crystallographic peptide. When applying BriX backbone moves on the top designs, this region is even more densely sampled. The C-terminal region (residues 0,-1,-2) is the most constrained part in our designs as shown by the inset box plot, corresponding to the PDZ Class I binding motif (Ser/Thr-X-Val/Leu/Ile) (Nourry et al., 2003). terminal region (inset in Figure 6.2B), which is the most flexible part of the peptide and lacks a clear sequence signature. The binding pocket of the PDZ extends only to the last three C-terminal residues, which are consequently the most constrained structurally, something that is reflected in the designs with average RMSD values < 0.5 Å. In a second round of experiments we introduce more structural variation on the peptide designs through a series of backbone moves using the BriX database (Section 6.3.2). In this fine-grain step we optimize binding of the peptide by reducing the estimated interaction energy with 1.8 kcal/mol and the number of peptide designs that are < 1 Å RMSD away from X-ray peptides increase from 9 to 38 (Figure 6.2B, blue circles). 138
  • 157.
    6.2 Results Multiple specificityof the PDZ1 domain and structural prediction PDZ domains are highly specific to different peptidic ligands and an example is the first PDZ domain of DLG1. The family of DLG (Discs Large Homolog) proteins consists of 4 highly conserved paralogs, which have 3 PDZ domains each. Even though the sequences of these PDZ domains are very similar, Gfeller et al. (2010) recently developed a method that detects correlated positions in peptide-binding data obtained from experimental phage display. When applied to the first PDZ domain of DLG1, the method detected two different specificity profiles (Figure 6.3), suggesting that this PDZ domain can bind two different peptide ligands. However, structural evidence for this observation does not exist. In order to interpret this observed multiple specificity structurally, we note that the two sequence logo’s representing the multiple specificity (Figure 6.3C) align well when the second one is shifted with one position, indicating the presence of an additional residue at the C-terminal. However, a significant displacement of the carboxylate-binding loop (Figure 6.2A) is required in order to accommodate this extra residue. As we previously observed (Section 3.7), this loop adopts a number of conformations, and we selected a conformation that adopts the largest movement away from the original binding site (PDB 2WL7). This crystal structure of DLG2 PDZ1 was recently crystallized as a trimer with C-terminally extended peptide RRRPIL (Fiorentini et al., 2009). In this crystal structure, Ile-1 is found at the same spatial position as Val0 in PDB 2I1N (Figure 6.2A) and >80% sequence identity between both domains exist. However, a complete structural model of this novel binding specificity could not be derived from this crystal structure since only the last three residues (PIL) are in contact with the binding site. Hence, we applied our method to design a peptide that confirms the experimentally observed peptide specificity. Figure 6.4A shows the top five structural models when using the sequence ETDIW (derived from the last logo in Figure 6.3C). The predicted interaction energy for the top model (-8.50 kcal/mol) compares favorably to the one computed for the original DLG3 PDZ1 structure (PDB 2I1N, -7.3 kcal/mol). The C-terminal Trp can be accommodated by the displaced carboyxlate-binding loop and while the 2I1N peptide COOH forms a hydrogen bond mediated by a water molecule to a 139
  • 158.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY -6 -5 -4 -3 -2 -1 0 $ 5 < ( 7 / 9 * 5 * 7 ' , : * : ) ( 7 / 9 5 . ( 7 / , / 5 3 : 6 7 ' 9 5 6 6 7 ' , : 6 5 5 ( 7 : 9 7 6 5 6 7 : 9 * ( 7 ' , : , ( 7 ' , : 5 5 ( 7 / 9 5 : ( 7 ' 9 6 : + 7 : 9 < 6 7 ' , : ) ( 7 / 9 5 1 7 / 9 5 6 7 / 9 : ( 7 0 9 < ( 7 / 9 + 7 : 9 6 , , : 6 , ,: ( 7 /,/ ( 7 ' , : ( 7 ', : * 7 ' , : 6 7 ' , : 6 7 ' , : : + 7 : 9 + 7 : 9 : ( 7 0 9 : ( 7 ' 9 : 6 7 ' 9 5 ( 7 : 9 5 6 7 : 9 5 6 7 /9 5 1 7 / 9 5 ( 7 / 9 < ( 7 / 9 < ( 7 / 9 ) ( 7 /9 ) ( 7 /9 1 0 Sequence similarity A B C bits C-terminal positions Single PWM -6 -5 -4 -3 -2 -1 0-6 -5 -4 -3 -2 -1 0 0 1 2 3 4 bits 0 1 2 3 4 bits Multiple PWMs + 0 1 2 3 4 -6 -5 -4 -3 -2 -1 0 C-terminal positions C-terminal positions Fig_1 '/*3'=ELQGLQJSHSWLGHV Figure 6.3: Multiple specificity for DLG1 PDZ1 binding peptides. (A) Phage pep- tides binding to the first PDZ domain of the human protein DLG1, aligned from the C-terminus. The last five positions (red box) display positional correlations. Pairs of significantly correlated positions (MI p-value < 0.001) are connected with a red edge, others with a black edge. An example of correlation can be found between the two last columns: W or L at position 0 always appears with I at -1, while V at position 0 is never found with I at -1. (B) Hierarchical clustering of the peptides shown in A. The two main clusters (orange dashed line) are the ones identified by the method of Gfeller et al. (2010). Positional correlations are successfully removed within the clusters (black edges). (C) Sequence logos for the single PWM (left) and the multiple PWMs (right), together with their respective weights. This figure is taken from the manuscript prepared by Gfeller et al. (2010). conserved Arg/Lys (Doyle et al., 1996), the Trp-COOH directly forms a hydrogen bond with this Arg. We then studied amino acid preferences at each peptide position by performing in silico mutagenesis experiments, mutating step-by-step every position to all pos- sible amino acids and evaluating the interaction energies. These preferences are represented in a Position Weight Matrix (PWM) and graphically shown as a heat 140
  • 159.
    6.2 Results map. Figure6.4B shows the PWM for the designed peptide, agreeing well with experimental phage display data (Tonikian et al., 2008). Interestingly, we predict a clear preference for the Trp (as well as Phe or Tyr) at the C-terminal position. Trp0 Ile-1 Asp-2 Thr-3 Glu-4His165 Arg105 Ser113 A B design aminoacids -4 -3 -2 -1 0-5 -4 -3 -2 -1 0 G A L V I P R T S C M K E Q D N W Y F H Positions AminoAcids Decreasing G Increasing G -5 -4 -3 -2 -1 0 G A L V I P R T S C M K E Q D N W Y F H Positions AminoAcids Decreasing G Increasing G -2 -1 0 G A L V I P R T S C M K E Q D N W Y F H s AminoAcids Decreasi G Increasin G decreasing increasing Figure 6.4: Structural prediction for the alternative PDZ specificity. (A) Compu- tational modeling of DLG1 PDZ1 in complex with an extended ligand ETDIW cor- responding the new specificity observed in phage (Figure 6.3). The top 5 models predicted by our method are shown, using the template PDB 2WL7 which has a dis- placed loop that could accommodate multiple specificity (see also Section 3.7 for the conformations this loop might adopt). (B) PWM showing amino acid preferences in the model of extended ligand binding to DLG1 PDZ1. The amino acids of the initial ligand ETDIW are marked with black boxes. Our results provide a structural explanation of the predicted multiple specificity of the PDZ1, one following the canonical C-terminal PDZ binding mode (Figure 6.2) and another unexpected one allowing for an additional residue at the C-terminus (Figure 6.4). It appears that remodeling of the carboxylate binding loop is often associated with non-canonical binding modes of PDZ domains as we observed by analysis of a set of PDZ domains from the PDB (Gfeller et al. (2010) and Section 3.2.3). Through phage display clustering (Gfeller et al., 2010) in combination with structural modeling, these peptides together with their specificity can be detected and confirmed. 141
  • 160.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY PDZ peptide binding site prediction Many methods exist to predict peptide binding sites, for example using geometric amino acid-dependent preferences derived from protein-peptide binding interfaces (Petsalaki et al., 2009). Here, we use the information intrinsically contained in the InteraX pairs to predict ‘hot spot’ residues on the surface of the PDZ domain (Section 6.3.3). 0 5 10 15 20 25 30 0 RMSD (design to crystallographic peptide) (Å) EstimatedDDGinteractionFoldX,kcal/mol ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PDZ1 (closed loop) ● ● ● Ab initio design A B C D Figure 6.5: PDZ peptide-binding site prediction. (A) Sub-angstrom prediction of the crystallographic peptide when removing information about the binding site. (B- D) Different views of hot spot prediction, showing a clear preference (red) for the crystallographic binding site. Figure 6.5A shows the energetic landscape when designing peptides without 142
  • 161.
    6.2 Results previous knowledgeof the binding site, showing a clear energy funnel in the [0-2]Å range, providing evidence for a single peptide binding pocket in PDZ domain as is generally accepted (Nourry et al., 2003). This example illustrates the use of our method for the prediction of peptide-binding sites. 6.2.4 Case study: helical peptide design for the estrogen recep- tor ligand-binding domain Helical peptides are important building blocks of a protein’s structure, but are also often encountered in protein-protein interfaces (Jochim & Arora, 2009). In one such case, that of the estrogen receptor (ER), interactions between helical protein segments play an important role in the estrogen signaling pathway. The estrogen receptor is a transcription factor that is responsible for estrogen signaling and regulates a number of important hormone-dependent processes in the cell (Heldring et al., 2007). ER contains an evolutionary conserved ligand-binding domain (LBD) that, after agonist activity induced by estrogen, binds leucine-rich peptides (LxxLL, with x any amino acid) in a hydrophobic pocket formed by the helices H3, H5 and H12 (Figure 6.6A). In absence of an agonist, LBD binds its own C-terminal tail (H12), that has a similar leucine-rich signature. Finally, when LBD binds an antagonist, e.g. Tamoxifen or Fulvestrant (Heldring et al., 2007), the binding pocket is partially or completely deformed, no peptides can bind anymore and the estrogen signaling pathway is blocked. Here, we aim at the design of a peptide that blocks the hydrophobic pocket formed by H3, H5 and H12. Unlike with PDZ peptide design (Section 6.2.3), we take a ‘multi-body’ approach. We look for BriX fragments that simultaneously interact with fragments from H3, H5 and H12, as defined by InteraX (Section 6.3.1). The advantage is that peptides will be optimized for binding the entire interface and not just a single fragment at a time, as they will be constrained to interact simultaneously with H3, H5 and H15. We use the sequence signature of the peptide (LHRLL) and the structure of the unbound LBD from PDB 3ERD (Shiau et al., 1998). We note that while it is relatively easy to position a β-strand owing to the constrained backbone hydrogen bonding patterns (as in the case of PDZ), the α-α interactions are much less constrained because they lack any 143
  • 162.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY Helix 3 Helix 5 Helix 12 A B Figure 6.6: Estrogen receptor α-LBD binding site. The binding site of the LBD is formed by three helices (A) that together create a hydrophobic groove accommodating the peptide (B). The residue stretches that are used as the anchor fragments in the design are shown in purple, red and green. backbone hydrogen bonds and most of the free energy is contributed through hydrophobic packing. Therefore we rely on a combination of (1) multi-body fragment interaction optimization using InteraX, (2) backbone clash filters and (3) total interaction energy measured by FoldX after building full-flexible side chains on the peptide. Furthermore, while the original peptide ligand from the crystal structure contains 11 residues, we limit our design to a fragment of 5 residues, which accurately captures the LxxLL pattern, and superpose the helical structure of the 3ERD peptide on the designed peptide to extend it to 11 residues. Figure 6.7A shows the designed structure when compared with the crystal structure (RMSD = 1.05 Å). There is only slight variation on the three conserved leucine residues that constitute most of the binding energy within the hydrophobic binding pocket (colored in yellow). The energy landscape of 4052 models reveals a clear ‘energy funnel‘ towards designs close to the crystallographic peptide (Figure 6.7B). The best designs are within [0-1]Å RMSD (1 design), [1-1.5]Å (2 designs) and [1.5-2.0] Å (13 designs). We also evaluated the energies of different 11- residue LBD-binding peptides distilled from the protein-peptide database PepX 144
  • 163.
    6.2 Results 0 24 6 8 0 RMSD (design to crystallographic peptide) (Å) EstimatedDDGinteractionFoldX,kcal/mol Alpha Ligand Binding Domain peptide design (3ERD) ● ● Leu694 Leu693 Leu690 wildtype design A B Figure 6.7: Helical peptide design for the LDB. (A) Crystal structure (3ERD) versus designed peptide, indicating three leucine residues constituting the binding motif. The LDB is represented in surface view; hydrophobic residues are colored yellow, hydrophilic residues are colored green. (B) Energy landscape of the designs, showing a clear ‘funnel’ near the crystal structure. (Chapter 4). When comparing the estimated energies of our designs to this cluster (Figure 6.8A), we show that our best designs (∼15 kcal/mol) are within the range of energies as observed from other crystal structures (∼ 15-25 kcal/mol). While this is no validation of our results, we show that the algorithm samples the range of energetic values as estimated from experimentally resolved structures. Predicting LBD peptide specificity profiles We validated whether the designs contain the predictive capacity that is needed in order to explain the LxxLL peptide specificity. We perform in silico mutagenesis experiments, mutating every residue of the best five designs to all the twenty amino acids and evaluate the binding energies using FoldX. The method correctly detects a preference for the three conserved leucines (Figure 6.8B). The Position Weight Matrix (PWM) also shows a preference for the hydrophobic residues Ile and Met at these positions, but we could not verify these 145
  • 164.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY A B 8 10 12 14 16 18 20 -30-25-20-15-10-50 Alpha LBD - Peptide, PepX cluster 2GTK T1 A75, ID=4845 px$LigandSize px$InteractionEnergy DDG FoldX Backbone HBond Sidechain HBond 687 688 689 690 691 692 693 694 695 696 697 Y W V T S R Q P N M L K I H G F E D C A Y W V T S R Q P N M L K I H G F E D C A 690 693 694 aminoacids R T Decreasing Increasing decreasing increasing Figure 6.8: Recognizing LBD peptide specificity on designed templates. (A) Com- parison of the energy landscape on 111 LBD-peptide complexes. The peptides that have a similar length to the designed peptide are marked with a red box. (A) PWM of the amino acid preferences for the designed LBD peptide. The amino acids of the ligand motif LxxLL are marked with black boxes. observations with experimental data. Predicting the LBD peptide-binding site We predicted peptide-binding hot spot residues scanning the entire surface of the LBD, using the protocol described in Section 6.3.3. Figure 6.9 shows our designs, which display a clear energy funnel close to the crystallographic peptide. Inter- estingly, we also observe a second energy funnel 30-40 Å RMSD separated from the crystallized peptide, with binding energies that are only 2 kcal/mol separated. Looking into the functional role of this putative alternative binding site – oriented at the back of the LBD peptide-binding site – we discovered that this hot spot cor- responds to the dimer interface of the LBD. Even though the second LBD domain that binds as a dimer has a different sequence signature at these positions, this second hydrophobic pocket also seems to accommodate for the leucine-rich motif displayed by the peptide. Experimental validation is lacking since these domains are often crystallized as a dimer. Our findings however highlight the potential of the method to detect biologically relevant binding sites. 146
  • 165.
    6.2 Results 0 1020 30 40 50 RMSD (design to crystallographic peptide) (Å) EstimatedDDGinteractionFoldX,kcal/mol 3ERD (full peptide) ● ● ● Ab initio design A B C D peptide binding site dimerization binding site (D) Figure 6.9: LBD peptide-binding site prediction. (A) Energy landscape of peptide prediction when removing information about the binding site, showing two clear funnels (0-2 Å and 30-40 Å separated from the actual binding site). (B-D) Different views of the hot spots on the surface of the LBD with the predicted peptide in green: (C) shows the actual binding site and (D) shows predicted hot spots corresponding to the dimerization site of the LBD. 147
  • 166.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY 6.3 Materials and Methods 6.3.1 A constraints-based framework for peptide design The BriX Design method is implemented as a Constraint Satisfaction Problem (CSP), a mathematical problem defined as a set of objects whose state must satisfy a number of constraints and for which a solution is sought using constraint satisfaction methods (Kumar, 1992). While a formal description of this actively researched class of problems is outside the scope of this thesis, we explain the problem description when applied to peptide prediction. Every CSP essentially consists of three elements (Figure 6.10A): 1. V is the set of Variables. This set of variables contains fragments from five residues, outlined along the chain of the target protein. For example, Figure 6.10B shows three variables (V1,V2 and V3) which represent three non-overlapping fragments in the binding site of the PDZ domain. The special variable V0 represents the fragment to design. These fragments can be designated fragments in the binding site (when the binding site is known), or can cover the entire chain of the protein (when doing blind experiments). 2. D is the Domain. The domain contains a set of values for every variable. The domain of a variable are all the BriX fragments. This domain will be reduced by the different constraints during the course of the algorithm. 3. C is the set of Constraints. Constraints can be unary, i.e. constrain the domain of a single variable, binary, i.e. constrain the domain of two vari- ables simultaneously, or n-ary, i.e. constrain the domains of n variables simultaneously. Constraints Constraints limit the domains of the variables. A list of the three most important constraints in BriX Design is given: • BriX fragments: unary constraint that constrains the domain of fragments from the entire BriX database towards BriX fragments that superpose on 148
  • 167.
    6.3 Materials andMethods V1 V2 V0 C1 C2 C2C2 C1 D1 D2 D0 C3 A B V0 Figure 6.10: Method: BriX peptide design implemented as a Constraint Satisfac- tion Problem. (A) Cartoon representation of a CSP consisting of Variables, Domains and Constraints. (B) The structural model that corresponds to the instance of a CSP: Variables are fragments in the the PDZ peptide-binding site with domains that span the space of possible backbones matching these fragments (shown in red, blue and yellow). The CSP looks for solutions that are overlapping in these three variables by applying different constraints, here resulting in a modelled peptide structure for V0. the fragment (usually with a backbone RMSD value <1Å) described by the variable. This is the most limiting constraint. • InteraX: binary constraint between two variables, usually between one frag- ment in the binding site (V1−3) and the fragment to design (V0). Two variables are constrained to follow the same interaction patterns as those present in the InteraX database (Section 5.2.1). Even though this constraint is binary in nature, the concatenation of the InteraX constraints can result in constraining the search to look for interaction patterns between multiple variables, e.g. to reconstruct the entire peptide-binding site with a single protein (Section 5.2.3). • FoldX free energy: n-ary constraint that is fired when all variables are as- signed, such that a structural model can be produced. The interaction en- 149
  • 168.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY ergy is decomposed in different terms: clashes, backbone hydrogen bonds, sidechain hydrogen bonds, ∆∆G after building a poly-alanine peptide se- quence, ∆∆G after building a user-given peptide sequence. The particular order of these energy constraints can significantly increase the speed of the algorithm, e.g. by first filtering on clashes, backbone hydrogen bonds, and finally by ∆∆G on an all-atom model. Heuristics Heuristics are used to speed up the process of finding an assignment of all variables that satisfies all the constraints in the CSP. These ‘rule of thumbs’ shape the search tree and thus define the order in which solutions are find by the search algorithm. BriX Design uses only one type of heuristics: a value is always assigned from the smallest domain in the system, that is typically also the most constrained variable in the system and thus quickly results in a solution to the problem. Search A CSP can be resolved using constraint satisfaction methods (Tsang, 1993). Since our CSP has finite domains (i.e. BriX fragments) as opposed to nearly infinite domains (i.e. the space of all dihedral angles), we can apply search methods that traverse the domains of the variables until a solution has been found that satisfies all constraints. Initially, all variables are unassigned. The variable with the smallest domain (typically the most constraining fragment in the protein environment) is chosen and possible values are assigned in turn. Whenever a variable is assigned, constraints that are connected to this variable are triggered. For example, the InteraX constraint will be fired immediately after a BriX fragment is assigned to the β-strand of the PDZ domain (V1=fragA), limiting the domain of V0 to all fragments that are present in the InteraX database having fragA as interaction partner. Constraints are propagated along the constraint network each time the domain of a variable changes, unless an inconsistency is found (i.e. when a variable domain becomes empty). In that case, the algorithm applies backtracking until the last assignment which satisfies the constraints and traverses an alternative path 150
  • 169.
    6.3 Materials andMethods along the search tree. The search progresses in a depth-first and not a breadth-first manner, meaning that priority is given towards solving all of the constraints initially for a single assignment instead of progressively limiting the domains for each variable. This has the advantage that very quickly in the search process (typically, in less than 5 minutes) a good candidate satisfying all the constraints is found. Translating a set of assigned variables to a structural model A solution is an assignment of a single value to each variable that satisfies all constraints. These assignments are then translated to a structural model using superposition of the BriX fragment on the fragment described by the variable in the case of binding site fragments (e.g. V1,V2 and V3), or using superposition on one fragment of the InteraX pair in the case of designed fragments (e.g. V0). The math- ematical method provided by Kabsch (1976) is used to provide a rotation matrix and a translation vector that describe the best superposition for the coordinates of two fragments. Superposition is done using the backbone atoms N, Cα, C, O, disregarding the sequences of the fragments. When the backbone scaffold is in place, we use FoldX to build sidechains using a rotamer search (Schymkowitz et al., 2005). In conclusion, the method translates an assignment of BriX fragments to a complete model of the protein-peptide complex. Implementation The BriX Design method is implemented using the Generic Constraint Develop- ment Environment Gecode 3 (http://www.gecode.org) (Schulte, 2002), an effi- cient and clean implementation of constraint satisfaction methods in C++. The method is integrated with the FoldX and the BriX frameworks. 6.3.2 Local backbone moves using BriX Backbone moves of fragments can naturally be represented in the CSP, since the domain of a variable represents the space of compatible backbones with a particular segment in the protein. These slightly different backbone conformations are evaluated against the entire protein context by the different constraints. When 151
  • 170.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY applying these backbone moves on a designed ligand, the structure of the ligand is optimized in terms of interaction energy as we have shown in Section 6.2.3. Backbone moves could be applied on fragments that are located in continuous regions, e.g. in the binding site of the receptor protein. A mending strategy thus needs to be employed to connect the fragments together such that the dihedral angles are not violated. 6.3.3 Binding site prediction The method can be turned into a binding site prediction method by considering all fragments of the polypeptide chain (by sliding a window of length 5 from the N- to C-terminal) instead of fragments in the binding site. Energetic evaluations will narrow down the solutions towards peptide designs that target the binding site. To make the hot spot surfaces shown in Figures 6.5 and 6.9, for each residue of the protein we take the best design (in terms of binding energy) this residue was part of. 6.4 Discussion In this final chapter, we have proposed a method for the structural discovery of peptide ligands binding to globular protein domains. For nine protein-peptide com- plexes, we showed how InteraX contains interaction patterns that are sufficiently rich to dock the peptides in the binding sites of the proteins with sub-angstrom ac- curacy. For two selected cases, the PDZ domain and the α-ligand binding domain, we have shown that even in the absence of the peptide structure or knowledge of the binding site, we can accurately reconstruct the peptide. Our results show that a combination of BriX, InteraX and FoldX can reach optimal designs within a fraction of the time of ab initio methods. As a comparison, the recently proposed PepSpec algorithm that relies on the Rosetta molecular modeling package, was shown to reconstruct peptides between 100 and 300 cpu hours (King & Bradley, 2010). Our design method typically generates sub- angstrom models within 5-15 minutes and then continues to enrich the ensemble. In some cases, the prerequisite of knowing the sequence of the peptide to be 152
  • 171.
    6.4 Discussion modeled canbe circumvented. For example, in the case of the PDZ, most of the binding is contributed through a network of backbone hydrogen bonding (Nourry et al., 2003). The selection of optimal and near-optimal templates could thus be guided without the knowledge of the sequence, allowing a greater diversity of templates to be used for specificity modeling in a later phase. For this, we use a minimal representation of the peptide backbone, in which all residues are mutated to alanine, or glycine for specific positions. Preliminary studies on the relation between a minimal model (using the ‘poly-alanine’ peptide) and a full sidechain model have shown a high correlation. These results show that using a minimal representation of the peptide could lead to high-quality models as well. In the case of LBD however, no backbone hydrogen bonding can steer the search for optimal models, and thus the minimal model would not show a determined energy funnel. Here, a partial representation of the peptide (for example, LxxLL, with x any amino acid) could potentially identify optimal backbone templates, although this hypothesis was not verified. This minimal model opens the way for structural verification of many known peptide-binding motifs with an unknown structure, as are available for example in the database of eukaryotic linear motifs (ELM, http://elm.eu.org/) (Gould et al., 2010), or the database of three-dimensional interacting domains 3did (3did, http://3did.irbbarcelona.org/) (Stein et al., 2010). While many peptide complexes bind their target proteins in a structured fashion – either through β-strand complementarity or α-helical packing (Chapter 4) – other peptide interactions such as those observed for SH2 (Bradshaw & Waksman, 2002) and SH3 domains (Mayer, 2001), do not. In Chapter 5 we have shown that even these unstructured interactions can be modeled using InteraX patterns, suggesting that they could be predicted as well de novo. Preliminary analysis has shown that we can reach a reasonable accuracy (<2Å) in reconstructing these unstructured interactions when targeting the binding site. However, the positioning of these unstructured peptides using unstructured anchor fragments is worrisome: while superpositions using β- or α-fragments typically produce very ‘tight fits’, superpo- sitions using unstructured fragments do not, increasingly shifting the selection of binding peptides to the force field and not relying on the prediction capacity stored within InteraX pairs. We hypothesize that using different superposition mecha- 153
  • 172.
    6. PREDICTING PEPTIDESTRUCTURE AND SPECIFICITY nisms, for example through iterative superposition using subsets of atoms (Meng et al., 2006), could improve the accuracy of the method for predicting unstructured peptide interactions. Finally, the de novo prediction of peptide structure and specificity, in particular for therapeutic applications, should not be guided by a single optimization function – in this case, the estimated binding energy. Instead, α-helical peptide stability (e.g. using Agadir, Mu˜noz et al. (1995)) or low peptide aggregation propensities (e.g. using Tango, Fernandez-Escamilla et al. (2004)) should be considered as well, amongst other parameters that define a ‘successful’ peptide design. The proposed peptide design method provides a framework for the implementation of these additional optimization functions. Author Contributions P.V., E.V., F.R., J.S., and L.S. conceptualized the study. P.V. developed the first version of the BriX Design algorithm (as used in Chapter 5). P.V. and E.V. devel- oped the second version BriX Design algorithm (based on Gecode, the Generic Constraint Development Environment). P.V. performed and analyzed experiments on the PDZ and the LBD. E.V. performed and analyzed experiments on the SH2 (results not discussed). Gfeller et al. (2010) discovered multiple specificity for the PDZ1 DLG1 and D.G., E.V., P.V. and L.S. created the structural models and the specificity profiles to confirm this multiple specificity observed in phage display experiments. 154
  • 173.
    REFERENCES References Bradshaw, J.M. &Waksman, G. (2002). Molecu- lar recognition by sh2 domains. Adv Protein Chem, 61, 161–210. 153 Doyle, D.A., Lee, A., Lewis, J., Kim, E., Sheng, M. & MacKinnon, R. (1996). Crystal structures of a complexed and peptide-free membrane protein-binding domain: molecular basis of peptide recognition by pdz. Cell, 85, 1067– 76. 140 Fernandez-Escamilla, A.M., Rousseau, F., Schymkowitz, J. & Serrano, L. (2004). Predic- tion of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nature Biotechnology, 22, 1302–1306. 154 Fiorentini, M., Nielsen, A.K., Kristensen, O., Kas- trup, J.S. & Gajhede, M. (2009). Structure of the first pdz domain of human psd-93. Acta Crystallogr Sect F Struct Biol Cryst Commun, 65, 1254–7. 139 Gfeller, D., Butty, F., Verschueren, E., Vanhee, P., Huang, H., Ernst, A., Dar, N., Stagljar, I., Ser- rano, L., Sidhu, S.S., Bader, G.D. & Kim, P.M. (2010). The multiple specificity landscape of peptide recognition modules. Molecular Sys- tems Biology (in review), 1–35. 139, 140, 141, 154 Gould, C.M., Diella, F., Via, A., Puntervoll, P., Gem¨und, C., Chabanis-Davidson, S., Michael, S., Sayadi, A., Bryne, J.C., Chica, C., Seiler, M., Davey, N.E., Haslam, N., Weatheritt, R.J., Budd, A., Hughes, T., Pas, J., Rychlewski, L., Trav´e, G., Aasland, R., Helmer-Citterich, M., Linding, R. & Gibson, T.J. (2010). Elm: the status of the 2010 eukaryotic linear motif re- source. Nucleic Acids Research, 38, D167– 80. 153 Heldring, N., Pike, A., Andersson, S., Matthews, J., Cheng, G., Hartman, J., Tujague, M., Str¨om, A., Treuter, E., Warner, M. & Gustafsson, J.A. (2007). Estrogen receptors: how do they sig- nal and what are their targets. Physiol Rev, 87, 905–31. 143 Jochim, A.L. & Arora, P.S. (2009). Assessment of helical interfaces in protein-protein inter- actions. Mol Biosyst, 5, 924–6. 143 Kabsch, W. (1976). A solution for the best rota- tion to relate two sets of vectors. Acta Cryst., 922. 151 King, C.A. & Bradley, P. (2010). Structure-based prediction of protein-peptide specificity in rosetta. Proteins, 78, 3437–49. 152 Kumar, V. (1992). Algorithms for constraint- satisfaction problems: A survey. AI maga- zine. 148 Mayer, B.J. (2001). Sh3 domains: complexity in moderation. J Cell Sci, 114, 1253–63. 153 Meng, E.C., Pettersen, E.F., Couch, G.S., Huang, C.C. & Ferrin, T.E. (2006). Tools for inte- grated sequence-structure analysis with ucsf chimera. BMC Bioinformatics, 7, 339. 154 Mu˜noz, V., Blanco, F.J. & Serrano, L. (1995). The hydrophobic-staple motif and a role for loop- residues in alpha-helix stability and protein folding. Nature Structural Biology, 2, 380–5. 154 Nourry, C., Grant, S.G.N. & Borg, J.P. (2003). Pdz domain proteins: plug and play! Sci STKE, 2003, RE7. 137, 138, 143, 153 Petsalaki, E., Stark, A. & Russell, R.B. (2009). Accurate prediction of peptide binding sites on protein surfaces. PLoS Comput Biol, 5, e1000335. 142 Schulte, C. (2002). Programming constraint services: high-level programming of stan- dard and new constraint services. por- tal.acm.org. 151 Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F. & Serrano, L. (2005). The foldx web server: an online force field. Nucleic Acids Research, 33, W382–8. 151 155
  • 174.
    REFERENCES Shiau, A.K., Barstad,D., Loria, P.M., Cheng, L., Kushner, P.J., Agard, D.A. & Greene, G.L. (1998). The structural basis of estrogen re- ceptor/coactivator recognition and the an- tagonism of this interaction by tamoxifen. Cell, 95, 927–37. 143 Smith, C.A. & Kortemme, T. (2010). Structure- based prediction of the peptide sequence space recognized by natural and synthetic pdz domains. Journal of Molecular Biology, 402, 460–74. 137 Stein, A., C´eol, A. & Aloy, P. (2010). 3did: iden- tification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Research. 153 Stiffler, M.A., Chen, J.R., Grantcharova, V.P., Lei, Y., Fuchs, D., Allen, J.E., Zaslavskaia, L.A. & MacBeath, G. (2007). Pdz domain bind- ing selectivity is optimized across the mouse proteome. Science, 317, 364–9. 137 Tonikian, R., Zhang, Y., Sazinsky, S.L., Currell, B., Yeh, J.H., Reva, B., Held, H.A., Appleton, B.A., Evangelista, M., Wu, Y., Xin, X., Chan, A.C., Seshagiri, S., Lasky, L.A., Sander, C., Boone, C., Bader, G.D. & Sidhu, S.S. (2008). A speci- ficity map for the pdz domain family. PLoS Biol, 6, e239. 141 Tsang, E. (1993). Foundations of constraint sat- isfaction. en.scientificcommons.org. 150 156
  • 175.
    7Discussion P redicting the structureof proteins is a hard task. Even though nature only uses twenty different building blocks to construct proteins, the amount of theoretically possible combinations the folded chain of amino acids can adopt is immense. Yet, knowledge of protein structure is key to understanding many of the intricacies of biological systems, for example to elucidate protein interaction pathways or for developing new protein-targeting therapeutics. In this thesis, we focused on understanding and predicting the structure of peptides – small molecules of typically no more than 10 amino acids, that play important roles predominantly in cell signaling and cell regulatory networks. An estimated 15-40% of all interactions in the cell are directly or indirectly influ- enced by peptide-binding events. Peptides are also increasingly considered as a new promising class of therapeutics, because of their small size and interaction properties that are able to disrupt protein-protein interfaces. The design of these molecules however has been hindered by the lack of structural information. For ex- ample, while a considerable amount of protein structures exist in public databases, as little as 3% is describing protein-peptide structure. We have proposed several methodologies that use polypeptide fragments to extend the structural coverage for peptide interactions, including loop prediction and peptide specificity design. In this final chapter, we reflect upon our work and provide thoughts on further applications. 157
  • 176.
    7. DISCUSSION Fragmenting proteinspace The fragmentation of protein structures is an appealing strategy to reduce the com- binatorial problems typically associated with protein structure prediction: instead of sampling conformation space of protein structure, the space of naturally occur- ring fragment templates is considered. The idea is in essence similar to that of using rotamer libraries for predicting sidechain conformations adopted by protein structures and derived from sets of experimentally derived high-resolution struc- tures (Dunbrack, 2002). In such reduced models, search can often proceed in a deterministic way, since the total number of combinations is significantly more limited when compared to sampling based on physical principles. In this work, the use of protein fragments stands central. The typical limitation of database-driven methods is the lack of data to accu- rately describe new protein structures in full atomic detail. Nearly twenty years ago, Chothia (1992) estimated that nature uses around 1000 different protein folds, and ten years later, Aloy & Russell (2004) estimated the total number of protein-protein interactions at roughly 10.000. In our work we have taken a more pragmatical ap- proach in an attempt to describe the structural protein space: we fragmented and clustered a high number of non-homologous protein structures (>7000), resulting in a rich alphabet of >1000 classes per fragment length (Chapter 2). Using BriX, Baeten et al. (2008) showed that proteins could be globally reconstructed with an average accuracy of 0.48 Å when compared to x-ray models. From these results, one could conclude that – at least at the fragment level – our structural alphabet is complete. Loops are often the hardest parts of the protein structure to classify because of their high variability. As a consequence, they are often poorly represented or ignored in existing structural alphabets (Le et al., 2009). In BriX, we noticed that many of these irregular structures are not classified, making the database less useful for high-resolution structure prediction and refinement. To overcome this limitation, we presented a novel way to group these ‘irregular’ elements by grouping on their regular end points, i.e. the α-helices and β-strands that flank the irregular loop residues (Chapter 2). We validated this new alphabet on a set of 527 loops and found that loops until a length of 12 (∼90% of all loops 158
  • 177.
    in the dataset) could be described with high accuracy (< 2 Å RMSD), largely independent from sequence homology between the original loop and the BriX template (Chapter 3). Thus, contrary to the perceived irregularity of loops, the ‘space’ of experimentally observed loop structures seems saturated, at least for small and medium-sized loops. As a result, there appears to be no need to employ computationally expensive ab initio methods for the prediction of protein loop structure, unless combined with database methods. Protein fragments in isolation, i.e. outside of their network of local interactions, have limited meaning. For example, protein fragments with similar backbone di- hedral angles might or might not form hydrogen bonds depending on the protein context. Therefore, we introduced the concept of ‘fragment context’ by mining the BriX database for fragment interactions. These fragment interactions from monomeric proteins, stored in the InteraX database, accurately capture backbone and sidechain interactions, hydrogen bonding networks, packing constraints, or even electrostatic interactions (Chapter 5). While we did not make an attempt to improve the protein reconstruction accuracy when using this database, we extensively used these ‘interaction motifs’ for the description of protein-peptide structure. We found that peptide interactions are structurally comparable to frag- ment interactions from InteraX. On a large set of available protein-peptide struc- tures (stored in the PepX database, Chapter 4), we show that in nearly half of the cases the architecture of the peptide binding site can be entirely reconstructed using interacting protein fragments. In some cases, the entire architecture of the protein-peptide binding site was found back in unrelated monomeric proteins, po- tentially providing clues to the evolutionary origins of these peptide motifs or their function in the cell. Our findings, both for protein loops and protein-peptide binding modes, illus- trate that even in the absence of homology, structural relations using fragments can be used for modeling with high accuracy. Moreover, the relation between intramolecular and intermolecular interaction patterns effectively turns the entire database of protein fragments into learning data for describing protein-peptide complexes. 159
  • 178.
    7. DISCUSSION On therelation between sequence and structure The sequence of a protein uniquely determines its structure and thus its function. Even though this correlation is strikingly simple, its consequences are not: many mutations in the sequence of a protein do not affect the structure, while others might lead to the collapse of the protein chain (Alexander et al., 2009). Using BriX, we have made various attempts to link structure with sequence, yet we could not observe general patterns. For example, the sequence conservation in BriX classes is very low, with only few exceptions such as sequences containing proline and glycine that have a pronounced effect on the structure of the protein backbone (Baeten et al., 2008). Sequence-to-structure relationships however are often occurring: fragments with a similar sequence many times adopt a similar structure – an insight that has been at the basis of the Rosetta software (Rohl et al., 2004). We did not observe any sequence similarity between protein-peptide complexes and interacting fragments (with sequence similarities as low as 0-14%), even when the entire binding site could be reconstructed from a single protein or the entire interaction network was preserved. These findings suggest that the architectural framework of proteins is largely independent of the specific amino acid sequence, providing opportunities for the design of proteins and peptides using BriX and InteraX. Fragment-based prediction of protein loop and peptide structure We evaluated the predictive capacity of our fragment-based approach in com- bination with the full-atom force field FoldX (Schymkowitz et al., 2005). We proposed two methodologies to predict protein structure: LoopX predicts protein loop structure given the amino acid sequence of the loop (Chapter 3), and BriX Design (Chapter 6) predicts peptide structure given the amino acid sequence of the peptide and a structure of the target protein. We demonstrated that LoopX outcompetes state-of-the-art methods, both in ac- curacy in speed. We compared LoopX to Kinematic Closure (Mandell et al., 2009), a robotics-inspired method that samples loop conformations with sub-angstrom accuracy. On three data sets of 12-residue loops, LoopX reached comparable ac- curacy but in a fraction of the time (∼5-60 minutes vs. ∼320 hours). On another 160
  • 179.
    data set, LoopXwas compared to one database method and three other ab initio methods. With a prediction coverage of nearly 80%, LoopX performed better than each of these methods. The performance dramatically decreases with larger loop lengths (> 12 residues), suggesting that the diversity of available loop templates of larger lengths is insufficient. Our results demonstrate that using a combination of loop templates, rotamer-dependent side chain construction and all-atom energy evaluation, both high-accuracy and high-throughput loop reconstruction for loops ≤ 12 residues is now within reach. The amount of experimentally resolved protein-peptide complexes is rather low (∼1400 complexes), and is even reduced to ∼500 binding modes when discarding structural similarities (Chapter 4). Therefore, we relied on the database of interact- ing fragments to increase the structural space of peptide binding modes. For nine representative protein-peptide complexes, we have shown the predictive power of the InteraX database by docking 4 out of 9 peptides with sub-angstrom accu- racy. This docking exercise is somewhat limited since the structure of the peptide needs to be known beforehand. In a second step, we lifted this requirement and showed for two cases, the PDZ domain and the LBD, how we used multiple frag- ment interactions to reconstruct the entire binding mode with high accuracy. The combination BriX-InteraX-FoldX was also able to describe the energy landscape of the peptides in close contact with the protein surfaces, with clear ‘energy funnels‘ towards the crystallized binding pocket. In addition to predicting the canonical binding motif for PDZ1, the method was able to detect small differences in peptide specificity. Gfeller et al. (2010) recently discovered multiple specificity for the PDZ1 domain through the clustering of phage display data (Chapter 6). However, since structural evidence for this observation was lacking, we made a structural validation for this alternative peptide motif. We predicted a peptide structure that differed from the canonical peptide structure with one residue extending at the C-terminal. Through in silico mutagenesis experiments with FoldX, we detected a specificity profile similar to the profile detected using peptide phage display. These results suggest that the method can produce reliable, high-resolution models to verify peptide motifs for which no three-dimensional structure exists (Gould et al., 2010; Stein et al., 2010). Moreover, small differences in the PDZ 161
  • 180.
    7. DISCUSSION domain suchas the displacement of the carboxylate-binding loop are detected, producing structural models with different specificities. This level of detail opens the possibility for family-wide prediction of e.g. PDZ-peptide binding, for which many high-resolution structures are still missing (Smith & Kortemme, 2010). Towards modeling of conformational ensembles Proteins are not rigid structures that occupy a single low-energy conformation. Instead, their dynamic nature is important for protein function and evolvability (Tokuriki & Tawfik, 2009). Modeling of the conformational ensembles proteins can adopt is a computationally expensive task (Shaw et al., 2010). Because of the organization of the BriX and PepX databases, fragments are grouped into classes of similar structure. We believe that these ‘natural’ ensembles can, to a certain extent, represent the structural variability of proteins. As we demonstrate in the case of loops, the conformational ensemble adopted by a peptide-binding loop and observed from different crystal structures, could be modeled with sub-angstrom resolution (Chapter 3). Further work could employ BriX ensembles to model flexibility as observed in NMR structures, or reproduce the diversity from structural ensembles for which a large number of high-resolution models in different contexts are available, e.g. for ubiquitin (Lange et al., 2008) or lysozyme (Baase et al., 2010). In contrast to other methods sampling conformational and sequence ensembles (Friedland & Kortemme, 2010), BriX fragments provide slight backbone variations observed from experimentally determined structures. Using naturally occurring backbone ensembles has the potential to lift the ‘fixed backbone’ assumption so often employed in computational modeling (Mandell & Kortemme, 2009). In Chapter 6 we showed how these structural ensembles can be used to introduce backbone flexibility on the peptide that improve the overall binding affinity. The introduction of ‘mending’ methods to connect fragments along the polypeptide chain could make these fragment-based methods applicable to the modeling of backbone movements and conformational ensembles in protein structures. Combining protein loop prediction with peptide prediction provides challenges too. For example, it was recently demonstrated that the loops of the SH2 domain have a major role in the selectivity of this domain for peptide binding (Kaneko 162
  • 181.
    et al., 2010).Structural variations of that loop either blocked or opened access to one of the three binding subsites of the SH2 domain, thus orchestrating peptide binding. Coordinated prediction of loop movement at the side of the domain while keeping a flexible model of the peptide in the binding pocket could lead to both the understanding and the design of selective peptides. Dynamic approaches towards structure prediction will become essential for a further development of the modeling and prediction fields, and fragment-based approaches such as the ones introduced in this work might play a pivotal role. High-throughput docking and design Recent years have been seen the emergence of high-throughput proteomics, at- tempting to characterize every protein-protein interaction in entire organisms (Li et al., 2004; K¨uhner et al., 2009). While around 10.000 interactions have been ex- perimentally observed among roughly 6000 proteins in Saccharomyces cerevisiae (Uetz et al., 2000), only for an estimated 1% structural information is available, complicating the verification and understanding of many of these interactions. Methodologies to increase this structural coverage are thus expected to contribute to our general understanding of the ‘interactome’ (Mosca et al., 2009), yet too of- ten the quality of the structural predictions is in the mid- and low-range as reported by the CAPRI docking competition (Lensink et al., 2007). The idea of using interacting fragments with preoptimized packing and back- bone hydrogen bonding could be applied to modeling protein-protein interac- tions too. However, protein-protein interfaces are typically less hydrophobic than protein-peptide interfaces, with less hydrogen bonding because of their large in- terfaces (London et al., 2010). Hence, the structural insight presented in Chapter 5 might not hold for the prediction of protein-protein interactions. Alternatively, structural similarity to other protein-protein interfaces has already been used to improve upon docking experiments (Sinha et al., 2010). Extending the space of interacting fragments to include fragments from available protein-protein inter- faces (e.g. by fragmenting the ∼8000 protein interface architectures described by Tuncbag et al. (2008)), could provide the data needed to cover the space of most protein interactions, and integration within the current design protocols would be 163
  • 182.
    7. DISCUSSION a straightforwardtask. For high-throughput purposes a right balance between speed and accuracy is of crucial importance. On one hand, homology-based methods are extremely fast but rely on the presence of a related model. On the other hand, ab initio methods such as molecular dynamics simulations are notoriously slow, making their use limited to a handful of cases only, even with the advent of massive parallelization and cloud computing (Snow et al., 2002; Shaw et al., 2010). Using fragments of large databases such as BriX and Loop BriX has enormous advantages: natural backbone templates can be used for modeling, avoiding the expensive cost of explicit backbone flexibility, such that computer time can be focused on optimizing side chain packing. The algorithms developed in this thesis – LoopX (Chapter 3) and BriX Design (Chapter 6) – both follow this philosophy. As we have demonstrated, loops can be modeled accurately (< 2Å RMSD) in as few as ∼5-60 minutes, while peptides can be designed with sub-angstrom accuracy in 5 minutes to a couple of hours only. These results show the potential for applying our methodologies for large-scale analysis and prediction. Virtual peptide design for therapeutics The main advantage of developing peptide or peptide-like drugs is that they can help expand the ‘druggable genome’ (Hopkins & Groom, 2002) by targeting specif- ically protein-protein interactions, which are less suitable for small molecule based therapies. Peptide drugs can offer an increase in target selectivity and have the potential to act on and show reduced toxicity in comparison with small molecules. Unfortunately, rational peptide designs that have shown potent inhibition in vitro or in vivo are still rare and often are directly derived from a crystallized protein-protein interface (Chapter 1). From a design perspective, there are clear advantages to develop peptide ther- apeutics: a large body of existing structural and other experimental data on protein-protein and protein-peptide interactions is already available. Moreover, the closely related protein structure prediction and design field is already relatively well-developed. Because peptides are constructed from amino acids, the large body of energy potentials developed in the fields of protein folding, docking and 164
  • 183.
    dynamics can beapplied to peptide structure prediction (Kaufmann et al., 2010; Schymkowitz et al., 2005). To harness the full potential of such approaches would require, for example, the introduction of non-natural amino acids to further extend the chemical repertoire (Link et al., 2003). Methods for the prediction and design of peptides, including the one presented in this work (Chapter 6), are currently being developed. Unfortunately, their evaluation is often based on the calculated root mean square deviation (RMSD) – or similar measures – from existing structural data, or using calculated binding scores; only in a number of cases other experimental data is used, such as specificity assays from phage display or peptide array experiments. When designing peptides, the optimization process is often carried out using the interaction energy estimates as the scoring function, disregarding other relevant factors of ‘successful’ peptide designs, such as metabolic stability or specificity. Multi-objective peptide design algorithms are thus expected to improve over current designs. For the algorithm developer and user, few ready-to-use benchmark sets exist, making algorithm comparison time-consuming and often impossible. The organi- zation of a dedicated peptide prediction competition in the lines of the structure prediction competition CASP (Moult, 2005) or the protein docking competition CAPRI (Janin, 2005) could help advance the peptide design field. In conclusion, although several roadblocks still exist in the development of peptide or peptide-like drugs, the future for this class of molecules is bright. 165
  • 184.
    REFERENCES References Alexander, P.A., He,Y., Chen, Y., Orban, J. & Bryan, P.N. (2009). A minimal se- quence code for switching protein structure and function. Proceedings of the National Academy of Sciences, 106, 21149–54. 160 Aloy, P. & Russell, R.B. (2004). Ten thousand interactions for the molecular biologist. Na- ture Biotechnology, 22, 1317–21. 158 Baase, W.A., Liu, L., Tronrud, D.E. & Matthews, B.W. (2010). Lessons from the lysozyme of phage t4. Protein Sci, 19, 631–41. 162 Baeten, L., Reumers, J., Tur, V., Stricher, F., Lenaerts, T., Serrano, L., Rousseau, F. & Schymkowitz, J. (2008). Reconstruction of protein backbones from the brix collection of canonical protein fragments. PLoS Com- put Biol, 4, e1000083. 158, 160 Chothia, C. (1992). One thousand families for the molecular biologist. Nature, 357, 543–4. 158 Dunbrack, R.L. (2002). Rotamer libraries in the 21st century. Curr Opin Struct Biol, 12, 431– 40. 158 Friedland, G.D. & Kortemme, T. (2010). Design- ing ensembles in conformational and se- quence space to characterize and engineer proteins. Curr Opin Struct Biol, 20, 377–84. 162 Gfeller, D., Butty, F., Verschueren, E., Vanhee, P., Huang, H., Ernst, A., Dar, N., Stagljar, I., Ser- rano, L., Sidhu, S.S., Bader, G.D. & Kim, P.M. (2010). The multiple specificity landscape of peptide recognition modules. Molecular Sys- tems Biology (in review), 1–35. 161 Gould, C.M., Diella, F., Via, A., Puntervoll, P., Gem¨und, C., Chabanis-Davidson, S., Michael, S., Sayadi, A., Bryne, J.C., Chica, C., Seiler, M., Davey, N.E., Haslam, N., Weatheritt, R.J., Budd, A., Hughes, T., Pas, J., Rychlewski, L., Trav´e, G., Aasland, R., Helmer-Citterich, M., Linding, R. & Gibson, T.J. (2010). Elm: the status of the 2010 eukaryotic linear motif re- source. Nucleic Acids Research, 38, D167– 80. 161 Hopkins, A.L. & Groom, C.R. (2002). The drug- gable genome. Nat Rev Drug Discov, 1, 727– 30. 164 Janin, J. (2005). Assessing predictions of protein-protein interaction: the capri exper- iment. Protein Sci, 14, 278–83. 165 Kaneko, T., Huang, H., Zhao, B., Li, L., Liu, H., Voss, C.K., Wu, C., Schiller, M.R. & Li, S.S.C. (2010). Loops govern sh2 domain specificity by controlling access to binding pockets. Sci- ence Signaling, 3, ra34. 162 Kaufmann, K.W., Lemmon, G.H., Deluca, S.L., Sheehan, J.H. & Meiler, J. (2010). Practi- cally useful: What the rosettaprotein mod- eling suite can do for you. Biochemistry, 49, 2987–2998. 165 K¨uhner, S., van Noort, V., Betts, M.J., Leo- Macias, A., Batisse, C., Rode, M., Yamada, T., Maier, T., Bader, S., Beltran-Alvarez, P., Casta˜no-Diez, D., Chen, W.H., Devos, D., G¨uell, M., Norambuena, T., Racke, I., Rybin, V., Schmidt, A., Yus, E., Aebersold, R., Her- rmann, R., B¨ottcher, B., Frangakis, A.S., Rus- sell, R.B., Serrano, L., Bork, P. & Gavin, A.C. (2009). Proteome organization in a genome- reduced bacterium. Science, 326, 1235–40. 163 Lange, O.F., Lakomek, N.A., Far`es, C., Schr¨oder, G.F., Walter, K.F.A., Becker, S., Meiler, J., Grubm¨uller, H., Griesinger, C. & de Groot, B.L. (2008). Recognition dynamics up to mi- croseconds revealed from an rdc-derived ubiquitin ensemble in solution. Science, 320, 1471–5. 162 Le, Q., Pollastri, G. & Koehl, P. (2009). Struc- tural alphabets for protein structure clas- sification: a comparison study. Journal of Molecular Biology, 387, 431–50. 158 166
  • 185.
    REFERENCES Lensink, M.F., M´endez,R. & Wodak, S.J. (2007). Docking and scoring protein complexes: Capri 3rd edition. Proteins, 69, 704–18. 163 Li, S., Armstrong, C.M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P.O., Han, J.D.J., Chesneau, A., Hao, T., Goldberg, D.S., Li, N., Martinez, M., Rual, J.F., Lamesch, P., Xu, L., Tewari, M., Wong, S.L., Zhang, L.V., Berriz, G.F., Jacotot, L., Vaglio, P., Reboul, J., Hirozane-Kishikawa, T., Li, Q., Gabel, H.W., Elewa, A., Baumgartner, B., Rose, D.J., Yu, H., Bosak, S., Sequerra, R., Fraser, A., Mango, S.E., Saxton, W.M., Strome, S., Heuvel, S.V.D., Piano, F., Vandenhaute, J., Sardet, C., Gerstein, M., Doucette-Stamm, L., Gunsalus, K.C., Harper, J.W., Cusick, M.E., Roth, F.P., Hill, D.E. & Vidal, M. (2004). A map of the interactome network of the meta- zoan c. elegans. Science, 303, 540–3. 163 Link, A.J., Mock, M.L. & Tirrell, D.A. (2003). Non-canonical amino acids in protein engi- neering. Curr Opin Biotechnol, 14, 603–9. 165 London, N., Movshovitz-Attias, D. & Schueler- Furman, O. (2010). The structural basis of peptide-protein binding strategies. Struc- ture, 18, 188–199. 163 Mandell, D.J. & Kortemme, T. (2009). Backbone flexibility in computational protein design. Curr Opin Biotechnol, 20, 420–8. 162 Mandell, D.J., Coutsias, E.A. & Kortemme, T. (2009). Sub-angstrom accuracy in pro- tein loop reconstruction by robotics-inspired conformational sampling. Nat Methods, 6, 551–2. 160 Mosca, R., Pons, C., Fern´andez-Recio, J. & Aloy, P. (2009). Pushing structural infor- mation into the yeast interactome by high- throughput protein docking experiments. PLoS Comput Biol, 5, e1000490. 163 Moult, J. (2005). A decade of casp: progress, bottlenecks and prognosis in protein struc- ture prediction. Curr Opin Struct Biol, 15, 285–9. 165 Rohl, C.A., Strauss, C.E.M., Misura, K.M.S. & Baker, D. (2004). Protein structure predic- tion using rosetta. Meth Enzymol, 383, 66– 93. 160 Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F. & Serrano, L. (2005). The foldx web server: an online force field. Nucleic Acids Research, 33, W382–8. 160, 165 Shaw, D.E., Maragakis, P., Lindorff-Larsen, K., Piana, S., Dror, R.O., Eastwood, M.P., Bank, J.A., Jumper, J.M., Salmon, J.K., Shan, Y. & Wriggers, W. (2010). Atomic-level charac- terization of the structural dynamics of pro- teins. Science, 330, 341–6. 162, 164 Sinha, R., Kundrotas, P.J. & Vakser, I.A. (2010). Docking by structural similarity at protein- protein interfaces. Proteins, 78, 3235–41. 163 Smith, C.A. & Kortemme, T. (2010). Structure- based prediction of the peptide sequence space recognized by natural and synthetic pdz domains. Journal of Molecular Biology, 402, 460–74. 162 Snow, C.D., Nguyen, H., Pande, V.S. & Grue- bele, M. (2002). Absolute comparison of sim- ulated and experimental protein-folding dy- namics. Nature, 420, 102–6. 164 Stein, A., C´eol, A. & Aloy, P. (2010). 3did: iden- tification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Research. 161 Tokuriki, N. & Tawfik, D.S. (2009). Protein dy- namism and evolvability. Science, 324, 203– 7. 162 Tuncbag, N., Gursoy, A., Guney, E., Nussinov, R. & Keskin, O. (2008). Architectures and func- tional coverage of protein-protein interfaces. J Mol Biol, 381, 785–802. 163 Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., Qureshi-Emili, A., Li, Y., Godwin, B., Conover, D., Kalbfleisch, T., Vijayadamodar, G., Yang, 167
  • 186.
    REFERENCES M., Johnston, M.,Fields, S. & Rothberg, J.M. (2000). A comprehensive analysis of protein- protein interactions in saccharomyces cere- visiae. Nature, 403, 623–7. 163 168
  • 187.
    List of Figures 1.1Proteins interacting with antibodies, small molecules and peptides. 2 1.2 Amino acids grouped by properties. . . . . . . . . . . . . . . . . . 4 1.3 Four levels of protein structure. . . . . . . . . . . . . . . . . . . . 5 1.4 Targeting the cell with different molecules. . . . . . . . . . . . . . 9 1.5 History of experimental structure determination from the last fifty years. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.6 PDZ-peptide interactions and peptide specificity. . . . . . . . . . . 22 1.7 Example workflows for peptide design. . . . . . . . . . . . . . . . 24 1.8 Stapled helical peptides as potent therapeutic peptides. . . . . . . 27 1.9 Design of helices targeting Trans-Membrane (TM) proteins. . . . . 31 1.10 Innovative structural approaches in peptide design using BriX and InteraX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.1 SCOP representation of ASTRAL40. . . . . . . . . . . . . . . . . . 46 2.2 The BriX database. . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.3 Number of BriX classes versus different classification thresholds. . 48 2.4 Percentage of classified fragments versus different classification thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5 Secondary structure content for classified fragments in BriX. . . . . 50 2.6 Secondary structure content for unclassified fragments in BriX. . . 51 2.7 The Loop BriX database. . . . . . . . . . . . . . . . . . . . . . . . 52 169
  • 188.
    LIST OF FIGURES 2.8The BriX website (http://brix.crg.es). . . . . . . . . . . . . . . . . . 56 2.9 BriX applications: ‘covering’ and ‘bridging’. . . . . . . . . . . . . . 57 3.1 Predicted loop structure with LoopX. . . . . . . . . . . . . . . . . 65 3.2 Accuracy of LoopX versus state-of-the-art loop prediction methods on datasets 1,2,3 and 4. . . . . . . . . . . . . . . . . . . . . . . . 67 3.3 Dataset 1: comparison of LoopX with Rosetta and KIC. . . . . . . . 68 3.4 Dataset 2: comparison of LoopX with Rosetta and KIC. . . . . . . . 69 3.5 Dataset 3: comparison of LoopX with KIC. . . . . . . . . . . . . . 70 3.6 Influence of loop homology in LoopX. . . . . . . . . . . . . . . . . 71 3.7 Conformational ensemble adopted by the PDZ carboxylate-binding loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.8 Cross-comparison of backbone distances of the conformational en- semble of the carboxylate-binding loop. . . . . . . . . . . . . . . . 73 3.9 Reconstruction of PDZ carboxylate-binding loop ensemble. . . . . 74 3.10 LoopX prediction accuracy compared with FREAD, MODELLER, RAPPER and PLOP. . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.11 Overview of the LoopX algorithm. . . . . . . . . . . . . . . . . . . 76 4.1 Contents of PepX. . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Examples of protein-peptide clusters from PepX. . . . . . . . . . . 91 4.3 Distribution of number of elements in the PepX clusters. . . . . . . 92 4.4 Distribution of peptide and receptor size. . . . . . . . . . . . . . . 93 4.5 Receptor sequence redundancy within the PepX database. . . . . . 94 4.6 PepX Annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.7 Representation of the SCOP and CATH hierarchies in PepX. . . . . 96 4.8 Distribution of PepX structures in the different SCOP classes. . . . 96 4.9 PepX usage statistics. . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.10 PepX user flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.11 Search options in the PepX database. . . . . . . . . . . . . . . . . 100 5.1 Different protein fragment interactions from InteraX . . . . . . . . 112 5.2 Coverage of protein-peptide interfaces . . . . . . . . . . . . . . . . 114 170
  • 189.
    LIST OF FIGURES 5.3Protein-peptide interfaces can be described as interactions between recurrent protein fragments from monomeric proteins. . . . . . . . 116 5.4 Relation between intermolecular interface architectures and in- tramolecular protein architectures. . . . . . . . . . . . . . . . . . . 119 5.5 Properties of the protein-peptide interface coverage. . . . . . . . . 121 5.6 Properties of the protein-peptide interface coverage correlated with binding energy and burial. . . . . . . . . . . . . . . . . . . . . . . 122 5.7 Physical properties are not conserved in the BriX covers. . . . . . . 124 6.1 Docking of the PDZ peptide using InteraX patterns. . . . . . . . . . 135 6.2 Peptide design for the canonical PDZ domain. . . . . . . . . . . . 138 6.3 Multiple specificity for DLG1 PDZ1 binding peptides. . . . . . . . 140 6.4 Structural prediction for the alternative PDZ specificity. . . . . . . 141 6.5 PDZ peptide-binding site prediction. . . . . . . . . . . . . . . . . . 142 6.6 Estrogen receptor α-LBD binding site. . . . . . . . . . . . . . . . . 144 6.7 Helical peptide design for the LDB. . . . . . . . . . . . . . . . . . 145 6.8 Recognizing LBD peptide specificity on designed templates. . . . . 146 6.9 LBD peptide-binding site prediction. . . . . . . . . . . . . . . . . . 147 6.10 Method: BriX peptide design implemented as a Constraint Satisfac- tion Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 171
  • 191.
    List of Tables 1.1Leading examples of peptide therapeutics currently on the market . 10 2.1 Distribution of loops across the four main loop categories for four different loop databases. . . . . . . . . . . . . . . . . . . . . . . . 52 2.2 Classification of loops within Loop BriX. . . . . . . . . . . . . . . . 53 4.1 Public databases of protein-ligand complexes. . . . . . . . . . . . 90 5.1 InteraX protein fragment interactions. . . . . . . . . . . . . . . . . 113 5.2 Coverage statistics for the most populated classes in the protein- peptide dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.1 Peptide docking using InteraX . . . . . . . . . . . . . . . . . . . . 136 173