Protein structure prediction methods for drug design

            Thomas Lengauer
            was a professor of Comput...
Lengauer and Zimmer

Protein structure prediction methods for drug design

                                               Therefore, in the f...
Lengauer and Zimmer

                                       or spatial motifs – can help to ascertain                the...
Protein structure prediction methods for drug design

                                         q   dynamic programming m...
Lengauer and Zimmer

                                         task – DALI/FSSP,85,86 SSAP,87 VAST,88                 be ...
Protein structure prediction methods for drug design

                                        selected by computer. The ...
Lengauer and Zimmer

            docking                        perform a wide range of functions.101                    ...
Protein structure prediction methods for drug design

                                        libraries (see, eg, Rarey ...
Lengauer and Zimmer

                                       target proteins will not be accessible by                2. ...
Protein structure prediction methods for drug design

                                    their identification’, Proteins...
Lengauer and Zimmer

                                     42. Bateman, A., Birney, E., Durbin, R. et al.           54. He...
Protein structure prediction methods for drug design

                                    prediction: a test of the energ...
Upcoming SlideShare
Loading in …5



Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Protein structure prediction methods for drug design Thomas Lengauer was a professor of Computer Protein structure prediction Science at the University of Paderborn, before he joined GMD, the German National methods for drug design Research Centre for Thomas Lengauer and Ralf Zimmer Information Technology, in Date received (in revised form): 4th July 2000 1992 as Director of the Institute for Algorithms and Scientific Computing. Jointly, Abstract he is Professor of Computer Along the long path from genomic data to a new drug, the knowledge of three-dimensional Science at the University of Bonn. His research interests protein structure can be of significant help in several places. This paper points out such places, include computational biology discusses the virtues of protein structure knowledge and reviews bioinformatics methods for and bioinformatics, gaining such knowledge on the protein structure. computational chemistry and combinatorial optimisation problems in technological applications. INTRODUCTION NOTIONS OF PROTEIN Ralf Zimmer FUNCTION The long path from genomic data to a is a research scientist at GMD. He directs the research new drug can conceptually be divided The increased accessibility of genomic group on algorithmic into two parts (see left side of Figure 1). data and, especially, that of large-scale structural genomics. His The first task is to select a target protein expression data has opened new research interests include whose molecular function is to be possibilities for the search for target algorithms and statistical methods for genomics, moderated, in many cases blocked, by a proteins. This development has proteomics, protein sequence drug molecule binding to it. Given the prompted large-scale investments into and structure analysis, and target protein, the second task is to the new technology by many target finding, as well as select a suitable drug that binds to the pharmaceutical companies. The connections between molecular biology and protein tightly, is easy to synthesise, is respective screening experiments rely computing (DNA computing). bio-accessible and has no adverse effects critically on appropriate bioinformatics such as toxicity. The knowledge of the support for interpreting the generated three-dimensional structure of a protein data. Specifically, methods are required can be of significant help in both phases. to identify interesting differentially Keywords: protein structure The steric and physicochemical expressed genes and to predict the prediction, protein target, protein–ligand docking complementarity of the binding site of function and structure of putative target the protein and the drug molecule is an proteins from differential expression data important, if not the dominating, feature generated in an appropriate screening of strong binding. Thus, in many cases, experiment. the knowledge of the protein structure Protein function is a colourful notion affords well-founded hypotheses of the whose meaning can range over several function of the protein. If the structure levels: of the relevant binding site of the protein is known in detail, we can even q a very general classification (globular, start to employ structure-based methods enzyme, hormone, structural protein, in order to develop a drug binding viral capsid protein, transmembrane Thomas Lengauer, tightly to the protein. protein, etc.); Institute for Algorithms and Scientific Computing (SCAI), In this paper bioinformatics methods GMD – National Research for prediction aspects of the protein q biochemical function (biochemical Center for Information structure are described and their use reaction, enzyme specificity, binding Technology, Sankt Augustin, towards the goal of drug design is partners, cofactors); Germany D53754. discussed. The possibilities and limitations of using protein structure knowledge q classification via broad cellular function Tel: +49 2241 14 2776/2777 Fax: +49 2241 14 2656 towards the goal of developing new drug (interaction with DNA and other E-mail: therapies are also discussed. proteins, cellular localisation); © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 275 08-lengauer.p65 275 9/19/00, 1:49 PM
  2. 2. Lengauer and Zimmer Genome/Organism/Disease Target Protein Search Structure Families Evolutionary Expression Phenotyp SNPs, Linkage SEARCH Para-/Analogs Information Genotyp Mutations Target Protein IDENTIFY Structure Sequence Fusion Co-Evolution Co-Expression Motifs Target Protein Function MODEL Structure Assay/ Target Protein Structure Screening Drug Lead DESIGN Rational Drug Design Search Ligand Computer Docking Combinatorial Design HTS Libraries HTS Trial&Error Target Lead Structure / Drug Figure 1 q broad phenotypic function (changes function simply because they originate observed for organisms with deleted or from a common ancestor and they still mutated genes); fulfil their role within the cellular processes, mutations occur independently q identification of detailed physiological after speciation events. Depending on the function such as the localisation in a extent of the evolutionary changes, the metabolic or regulatory pathway and recognition of homology or orthology the associated cellular role of the among proteins can be difficult, but still protein; in these cases consistent evidence for relatedness should be expected on the q identification of molecular binding sequence, structure and function levels. partners and their mode of interaction Sometimes, the situation is complicated with the protein. because of gene duplications within a species leading to paralogous copies of the The derivation of protein function from same gene. These paralogous copies are protein sequence by theoretical means is subject to evolutionary changes and the commonly performed by transferring evolutionary pressure on structure or functional information from related function is much relaxed for all but one proteins (eg from other organisms). copy, which still serves the original Usually the transfer is from proteins purpose, such that greater deviations in whose function has been established with sequence, structure and function occur for experimental evidence. The establishment these copies. As still considerable, ie of the relevant protein relationship based significantly more than random, sequence on sequence is complicated by some similarity among paralogous proteins can subtleties of evolutionary processes. be observed, this messes up the situation, Though it is often true that organisms leading to erroneous transfer of functions share related proteins with similar to already functionally disabled or sequence, similar structure and the same functionally completely different proteins. 276 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 276 9/19/00, 1:49 PM
  3. 3. Protein structure prediction methods for drug design Therefore, in the following, we have to families that form clusters of structurally distinguish between three notions. or functionally related proteins are helpful similarity Similarity is a quantitative measure on the in the prediction of protein function in sequence, structure or function level. these cases. There are several protein homology Homology is used when there is a clear classifications available on the internet established or potential (assumed, that can serve for this purpose predicted) evolutionary relationship orthology between proteins. The term orthology, in q COGS3,4 addition, indicates homologous proteins with (established or potential) the same q ProDom5 or at least similar function. The notion of paralogy paralogy, in contrast, is used, when q PFAM6 homologous proteins are expected to have evolved enough to expect changes q SMART7,8 in function (with or without a change in 3D structure). q PRINTS9 For drug design, we need to know more of the function of the protein than q Blocks+10 follows from just a general classification. It would be best both to know natural q ProtoMap11,12 binding partners and to have a detailed structural model of the binding sites of A number of these databases (Pfam, the protein. PROSITE, PRINTS, ProDom, SWISSPROT+TREMBL) are currently METHODS OF being united in the InterPro13 database. PREDICTING PROTEIN Since protein function is basically tied to FUNCTION protein domains, protein domain analysis There are a number of ways to predict is an integral part of the methodology protein function from sequence. Most of that leads to protein family databases.14–22 them are based on sequence similarity. A Since only 20–40 per cent of the large database of protein sequences is protein sequences in a genome such as screened for ‘model sequences’ that Mycoplasma genitalium, M. janaschii and M. exhibit a high level of similarity to the tuberculosis have significant sequence query protein sequence. Sequence similarity to proteins of known BLAST alignment tools such as BLAST1 and function,23,24 we need to be able to make PSI-BLAST PSI-BLAST2 are the work-horses of such conclusions on the function of proteins analyses. If one or more model sequences that exhibit no significant sequence are found that exhibit a sufficiently high similarity to suitable model proteins. As level of similarity to the query sequence the similarity between query sequence and about whose function we have some and model sequence decreases below a knowledge, then conclusions may be threshold of, say, 25 per cent, safe possible on the function of the query conclusions on a common evolutionary sequence. If the homology is above, say, origin of the query sequence and the 40 per cent and functionally important model sequence can no longer be made. motifs are conserved then we can However, it turns out that, in many cases, hypothesise that the query sequence has a the protein fold can still be reliably function that is quite similar to that of predicted, and in several cases even the model sequence. As the level of detailed structural models of protein similarity decreases, the conclusions on binding sites can be generated. Thus, function that can be drawn from especially in this similarity range, protein sequence similarity become less and less structure prediction – again together with protein classifications reliable. Classifications of proteins into the identification of conserved sequence © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 277 08-lengauer.p65 277 9/19/00, 1:49 PM
  4. 4. Lengauer and Zimmer or spatial motifs – can help to ascertain them in more detail here. While these aspects of protein function. methods are reported to generate Other sources of information beside significant insight into protein function on sequence similarity have been explored in a higher level and to point to putative order to gain insight into protein target proteins,39 in the end, drug design function. These methods are represented can be expected to necessitate structural by five arrows pointing downwards in the knowledge of either the target protein or top right part of Figure 1. The following its binding partners. comments on these methods apply in the order from left to right: METHODS FOR PREDICTING PROTEIN sequence alignment q Sequence alignment has long been used STRUCTURE for ascertaining protein function. This is In the authors’ view, computational the standard method and we methods for predicting protein structure commented on it above. This approach from sequence alone are still well out of is only reliable if there is high sequence range, although, there are recent similarity such that we can argue about methodical advances – sometimes called orthologous proteins, since we know mini-threading – that are based on the the function of one of the proteins. assembly of fragments (see eg ROSETTA40). In contrast, modelling q Recently, the Rosetta stone method has protein structures after folds that have been been introduced. This method uses over seen before has become quite a powerful 20 completely sequenced genomes and method for protein structure prediction. analyses evolutionary correlations of two Here, the query sequence is aligned domains being fused into one protein in (threaded) to a model sequence whose one species and occurring in separate three-dimensional structure is known (the proteins in another species. From these template protein). All proteins in a given classifications the method establishes protein structure database – usually, an pairwise links between functionally appropriate representative set of structures related proteins25 and elicits putative are tried — and each template is ranked protein–protein interactions.26 using heuristic scoring functions. The score reflects the likelihood that the query q For the same purpose, the phylogenetic sequence assumes the template structure. profile method analyses the co- The approach of modelling a protein occurrence of genes in the genomes of structure after a known template is called homology-based different organisms.27 homology-based modelling and the selection modelling of a suitable template protein is often done protein threading q The analysis of change of phenotype via protein threading. based on mutated genes (eg by knock- Protein threading has three major out experiments) yields important objectives: first, to provide orthogonal information on aspects of protein evidence of possible homology for function.28–30 distantly related protein sequences; second, to detect possible homology in q In the future, the analysis of genetic cases where sequence methods fail; and variations31 among individuals, eg single third, to improve structural models for nucleotide polymorphisms (SNPs),32–34 the query sequence via structurally more will be helpful in ascertaining protein accurate alignments. function beyond mere disease linkage or There are several successful protein association (right arrow in Figure 1).35–38 threading methods, including: None of these methods looks at protein q methods based on hidden Markov structures, and thus we do not discuss models;41–48 278 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 278 9/19/00, 1:49 PM
  5. 5. Protein structure prediction methods for drug design q dynamic programming methods based q Modeller60–64 and ModBase;65 on profiles;49–51 q Swiss-Model;66,67 q environment compatibility (ie contact capacity potentials as used in the q or commercial versions included in protein threader 123D).52 Quanta (MSI) or Sybyl (Tripos, Inc.). side-chain modelling These programs are very fast. A mid-size For protein side-chain modelling there protein sequence can be threaded against are two contrasting approaches based on a database of about 1,500 protein knowledge deduced from structural structures in a few minutes on a PC or databases and methods such as energy workstation. However, the underlying minimisation and molecular dynamics,68 methods assume that the assignment of respectively. Methods based on side-chain chemical properties to spatial regions in rotamer libraries that have been created the protein is the same in the query via the analysis of the protein structure protein and the template protein. This is database are usually employed to get a not the case, in practice, especially if one first model. Energy minimisation or compares proteins with partly different molecular dynamics69 is often used to folds or different functions. Extensions of refine the model. Such methods have the homology-based modelling approach been in use for crystallography/nuclear to proteins with very similar protein magnetic resonance (NMR) for many structures but different chemical make- years and are available in several program up require the solution of packages and tools (Charmm,70 algorithmically provably hard problems GROMOS/GROMACS71,72 and many and thus necessitate much more others73,74). In general these methods are computing time.There are: quite computer-intensive and can only be exercised on one or a few proteins. q heuristic approaches based on distance- Generally, the backbone alignment is an based pair potentials of mean force;53–56 input to homology-based modelling tools and the quality of the derived models is q optimal or approximate combinatorial highly sensitive to the accuracy of the tree search techniques.57–59 provided alignments. loop modelling Loops are modelled by a related host of Such approaches need hours to thread a methods. Loops that involve more than protein through a database of 1,500 about five residues are still hard to templates. However, they can yield more model.75–78 accurate alignments and models of The evaluation of the accuracy of binding pockets of proteins. assigning a protein fold (general protein The process of protein threading architecture) to a query sequence is selects a suitable template protein for a commonly based on generally accepted protein query sequence and computes an fold classifications such as SCOP79 or quality assurance alignment of the backbone of the two CATH.80 The quality of backbone proteins that is the starting point for alignments is much harder to rate, and no generating a structural model for the generally accepted scheme is available, as query protein based on the structure of of today.81–84 Rating the quality of the template protein. What is left is to protein structure models is generally place the side chains of the query protein based on the root mean square (rms) and to model the loops of the query deviation of the model and the actual protein that are not modelled by the structure on a selected set of residues. template structure. These two tasks are The problem here is that the model must performed by homology-based be superposed with the actual structure. modelling tools such as: There are several tools that perform this © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 279 08-lengauer.p65 279 9/19/00, 1:49 PM
  6. 6. Lengauer and Zimmer task – DALI/FSSP,85,86 SSAP,87 VAST,88 be derived beyond doubt. For more than PROSUP89 or SARF90 – and they can half of the 21 more difficult cases yield different results. Thus, there is no reasonable models could be predicted by accepted gold standard for protein at least one of the participating prediction CAFASP structure superposition. However, for the teams. In addition, the CAFASP purpose of rating the structures of target subsection of the assessment has proteins, the available superposition demonstrated that 10 out of 19 folds methods are sufficient. could be solved via completely automatic application of the best threading methods PERFORMANCE OF without any manual intervention. PROTEIN STRUCTURE Methods for refining rough structural PREDICTION METHODS models towards the true native structure There are strong efforts to render the of the query protein are also not predition assessment quality of protein structure prediction straightforward. This is an active area of methods more transparent and easier to research.92 evaluate. The centre of these efforts is the A combination of protein threading bi-annual CASP experiment, which rates followed by homology-based modelling protein structure prediction methods on cannot create genuinely novel protein blind predictions and aims at developing structures. But it turns out to be quite standardised and generally agreed upon sensitive in creating structure models assessment procedures both for fold based on known folds. Models that have identification and the evaluation of been reasonably accurate (eg down to alignment accuracy as well as homology 1.4Å for some 60 amino acids of the models. A blind prediction is a prediction active site of herpes virus thymidine of the three-dimensional structure for a kinase93) have been reported in blind protein sequence at a time, at which the studies of proteins with a sequence actual structure of the protein is not identity to the template protein of as low known (yet). After the structure has been as 10 per cent. Correct folds can be resolved, the prediction is compared with assigned in many cases, even if the query the actual structure. There have been sequence and the suitable template CASP three issues of the CASP experiment;91 exhibit a very low level of sequence the fourth one follows this year. The similarity (down to 5 per cent, ie far CASP experiment has been a significant below the level of random sequence help in providing a more solid basis for similarity of 17–18 per cent in optimal assessing the power of different protein alignments). structure prediction methods. For fold recognition, detectable STRUCTURAL GENOMICS progress has been observed from CASP1 The goal of structural genomics projects to CASP2. In CASP3, similar is to solve experimental structures of all performance as in CASP2 was achieved major classes of protein folds on more difficult targets. There appears to systematically independent of some be a certain limit of current fold functional interest in the proteins.94,95 The structure space recognition methods, which is still well aim is to chart the protein structure space below the limit of detectable structural efficiently; functional annotations and/or similarity (via structural comparisons). In assignment are made afterwards. This addition, in CASP3 several groups affords a thoroughly thought-out strategy produced reasonable models of up to 60 of mixing experimental protein structure residues for ab initio target fragments. determination, eg via X-ray, with In CASP3 from 43 protein targets, 15 computer-based protein structure could be classified as comparative prediction. The experiments have to yield homology modelling targets, ie related novel protein structures. The proteins to folds and accompanying alignments could be resolved experimentally are again 280 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 280 9/19/00, 1:49 PM
  7. 7. Protein structure prediction methods for drug design selected by computer. The computer part characteristics are imprinted onto the deduces the remaining structures based protein structure by specific patterns of on homology-based modelling and amino acid side chains that make up the protein threading. One goal of the overall binding pocket. The conservation of structural genomics endeavour is to have these amino acids is what makes two an experimentally resolved protein proteins have the same function. Since structure within a certain structural nature varies sequence quite flexibly, this distance to any possible protein sequence, level of conservation is only maintained which allows for computing reliable among orthologous proteins that exhibit models for all protein sequences. a high level of sequence similarity. Once a map of the protein structure Thus, if the template protein from space is available, this knowledge should which we predict protein structure is not provide additional insights on what the orthologous to the query protein, other function of the protein in the cell is and methods of function prediction have to with what other partners it might come to bear. It is quite natural to interact. Such information should add to consider conservation patterns in the information gained from high- protein sequence here, such as exhibited throughput screening and biological in databases containing functional functional motifs assays. So far, glimpses of what will be sequence motifs such as PROSITE. An possible could be obtained by analysing alternative that has been investigated complete genomes or large sets of more recently is to analyse conservation proteins from expression experiments in 3D space.98 Experience shows that structural motifs with the structural knowledge available such ‘structural’ motifs provide more today, ie more or less complete information than motifs derived purely representative sets and a quite coarse from sequence, even if the sequence coverage of structure space.63,96,97 motifs are distributed over several regions (BLOCKS+, PRINTS). Recently, the METHODS FOR notion of an approximate structural motif PREDICTING PROTEIN has been introduced – sometimes called fuzzy functional forms FUNCTION FROM fuzzy functional form (FFF).99 Using a PROTEIN STRUCTURE library of approximate structural motifs Aspects of protein structure that are enhances the range of applicability of useful for drug design studies typically motif search at the price of reduced have to involve three-dimensional sensitivity and specificity. Such structure. Predicting the secondary approaches are supported by the fact that, structure of the protein is not sufficient. often, binding sites of proteins are much Even the similarity of the three- more conserved than the overall protein dimensional structures of two proteins structure (eg bacterial and eukaryotic cannot be taken as an indication for a serine proteases), such that an inexact similar function of these proteins. The model can have an accurately modelled reason is that protein structure is part responsible for function. As the conserved much more than protein structural genomics projects produce a function. Indeed, protein folds such as the more and more complete picture of the TIM barrel (triose-phosphate isomerase) protein structure space, comprehensive are quite ubiquitous and can be libraries of highly discriminative considered as general scaffolds that lend structural motifs can be expected. molecular stability to the protein and are The relationship between structure and not directly tied to its function. In function is a true many-to-many relation. contrast, the molecular function of the Recent studies have shown that protein is tied to local structural particular functions could be mounted characteristics pertaining to binding onto several different protein folds100 and, pockets on the protein surface. These conversely, several protein fold classes can © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 281 08-lengauer.p65 281 9/19/00, 1:49 PM
  8. 8. Lengauer and Zimmer docking perform a wide range of functions.101 search for drug leads. A docking method This limits our potential of deducing that takes a minute per instance can be function from structure. But knowledge used to screen up to thousands of on which folds support a given function compounds on a PC or hundreds of and which functions are based on a given thousands of drugs on a suitable parallel fold can still help in predicting function computer. Docking methods that take the from structure. In addition, local better part of an hour cannot be suitably drug screening structural templates such as FFFs employed for such large-scale screening indicative for a particular function can purposes. In order to screen really large identify similar sites and the associated drug databases with several hundred function despite a globally different fold. thousand compounds docking methods Such 3D patterns can also discriminate that can handle single protein/drug pairs among globally similar folds with respect within seconds are needed. to containing particular conserved 3D The high conformational flexibility of functional motifs in order to classify them small molecules as well as the subtle into different functional categories. structural changes in the protein binding Though it is not easy to derive pocket upon docking (induced fit) are functions from resolved protein major complications in docking. structures, the availability of structural Furthermore, docking necessitates careful information improves the chances analysis of the binding energy. The energy scoring function compared with relying on sequence model is cast into the so-called scoring methods alone. function that rates the protein–ligand complex energetically. Challenges in the METHODS FOR energy model include the handling of DEVELOPING DRUGS entropic contributions, and solvation BASED ON PROTEIN effects, and the computation of long- STRUCTURE range forces in fast docking methods. The object of drug design is to find or The state of the art in docking can be develop a, mostly small, drug molecule summarised as follows (see also Table 1). structural flexibility that tightly binds to the target protein, Handling the structural flexibility of the moderating (often blocking) its function drug molecule can be done within the or competing with natural substrates of regime up to about a minute per the protein. Such a drug can be best molecular complex on a PC (see, eg, found on the basis of knowledge of the Kramer et al.102). A suitable analysis of the protein structure. If the spatial shape of structural changes in the protein still the site of the protein is known, to which necessitates more computing time. the drug is supposed to bind, then Today, tools that are able to dock a docking methods can be applied to select molecule to a protein within seconds are suitable lead compounds that have the still based on rigid-body docking (both potential of being refined to drugs. The the protein and ligand conformational speed of a docking method determines flexibility is omitted). whether the method can be employed for Recently, fast docking tools have been screening compound databases in the adapted to screening combinatorial drug Table 1: Taxonomy of docking methods Runtime on a PC Fraction of a second About a minute An hour or longer Flexibility of the drug molecule X X Flexibility of the protein binding site X Energy model None Short-range Force field 282 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 282 9/19/00, 1:49 PM
  9. 9. Protein structure prediction methods for drug design libraries (see, eg, Rarey and Lengauer103). advantage that it does not have to deal Such libraries provide a carefully selected with insufficiently powerful computer set of molecular building blocks together models, at the expense of high laboratory with a small set of chemical reactions that cost and the absence of structural link the modules. In this way, a knowledge on ‘why’ a compound binds combinational library combinatorial library can theoretically to the protein. provide a diversity of up to billions of molecules from a small set of reactants. CONCLUSION The accuracy of docking predictions In summary, the field is still in an early lies within 50–80 per cent ‘correct’ stage of development. Ab initio protein predictions depending on the evaluation structure prediction continues to be a measure and the method. That means that grand challenge for which no docking methods are far from perfectly comprehensive solution is in sight. The accurate. Nevertheless, they are very quality of fold prediction based on useful in pharmaceutical practice. The homology rises and tools has reached the major benefit of docking is that a large stage where one can generate confident drug library can be ranked with respect predictions for soluble proteins that in a to the potential that its molecules have substantial fraction (about half) of the for being a useful lead compound for the cases provide significant threading hits in target protein in question. The quality of the structure database. Protein threading a method in this context can be and homology-based prediction become enrichment factor measured by an enrichment factor. Roughly, especially helpful in an environment this is the ratio between the number of where the methods can be used in active compounds (drugs that bind concert with experimental techniques for tightly to the protein) in a top fraction structure and function determination. (say the top 1 per cent) of the ranked Here, the prediction methods can drug database divided by the same figure exercise their strengths, which lie in in the randomly arranged drug database. being used interactively by experts and State-of-the-art docking methods in the making suggestions that can be followed middle regime (minutes per molecular up by succeeding experimentation, rather pair), eg FlexX,104 achieve enrichment than being required to provide proven factors of up to about 15. Fast methods fact. The process of going from structure (seconds per pair), eg FeatureTrees,105 to function is far from being automated. achieve similar enrichment factors, but In a scenario that combines structure deliver molecules similar to known prediction methods with binding ligands and do not detect as experimentation, the step from structure diverse a range of binding molecules. to function can be performed in a Even if the structure of the protein customised manner. binding site is not known, computer- Protein structure prediction by based methods can be used to select homology is definitely not yet a turn-key promising lead compounds. Such technology. But we can expect it to enter methods compare the structure of a the ‘production’ stage through the molecule with that of a ligand that is activities in structural genomics. Still the known to bind to the protein, for field of protein structure prediction is instance, its natural substrate. very busy, generating the tools and Alternatives to docking for lead finding processes for raising the number of high-throughput include high-throughput screening confident structure predictions and the screening (HTS). This laboratory method allows for accompanying estimates of significance. testing the binding affinity of up to more Problems for applying these results in than several thousand compounds to the drug design are not only that the models same target protein in a day. In may not be sufficiently accurate but also comparison this method has the that the structures of many interesting © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 283 08-lengauer.p65 283 9/19/00, 1:49 PM
  10. 10. Lengauer and Zimmer target proteins will not be accessible by 2. Altschul, S. F., Madden, T. L., Schaffer, A. A. et al. (1997), ‘Gapped BLAST and PSI- homology-based modelling, at all, for BLAST: a new generation of protein some time to come. This includes the database search programs’, Nucleic Acids therapeutically particularly interesting Res., Vol. 25(17), pp. 3389–3402. http:// class of membrane proteins, for which essentially no structures have been 3. Tatusov, R. L., Galperin, M. Y., Natale, D. resolved. A. and Koonin, E. V. (2000), ‘The COG database: a tool for genome-scale analysis Docking is used frequently in of protein functions and evolution’, Nucleic structure-based drug design. To the Acids Res., Vol. 28(1), pp. 33–36. authors’ knowledge, the first drug 4. Tatusov, R. L., Koonin, E. V. and Lipman, drugs developed with developed with structure-based D. J. (1997), ‘A genomic perspective on computer techniques techniques was the HIV protease protein families’, Science, Vol. 278(5338), pp. 631–637. inhibitor Dorzolamide. In the past few years structural considerations have begun 5. Corpet, F., Servant, F., Gouzy, J. and Kahn, D. (2000), ‘ProDom and ProDom-CG: tools to pervade the design of new drugs. A for protein domain analysis and whole point in case is that of the neuraminidase genome comparisons’, Nucleic Acids Res., inhibitors for HIV. Such studies mostly Vol. 28(1), pp. 267–269. involve experimentally resolved protein 6. Bateman, A., Birney, E., Durbin, R. et al. structures. However, even models can (2000), ‘The Pfam protein families database’, Nucleic Acids Res., Vol. 28(1), serve to guide drug development. Based pp. 263–266. on the experimentally resolved structure 7. Schultz, J., Milpetz, F., Bork, P. and Ponting, of the membrane protein C. P. (1998), ‘SMART, a simple modular bacteriorhodopsin, several groups are architecture research tool: identification of attempting to model binding sites of G- signaling domains’, Proc. Natl Acad. Sci. protein coupled receptors that are USA, Vol. 95(11), pp. 5857–5864. believed to be structurally similar. 8. Schultz, J., Copley, R. R., Doerks, T. et al. Nevertheless, the authors are not aware (2000), ‘SMART: a web-based tool for the study of genetically mobile domains’, of any instance where the whole process Nucleic Acids Res., Vol. 28(1), pp. 231–234. line from the protein sequence to the 9. Attwood, T. K., Croning, M. D., Flower, lead structure has been exercised in an D. R. et al. (2000), ‘PRINTS-S: the integrated manner and with significant database formerly known as PRINTS’, help of computer predictions. The field Nucleic Acids Res., Vol. 28(1), pp. 225–227. has not reached this level of maturity 10. Henikoff, S., Henikoff, J. G. and yet. While structural aspects – even as Pietrokovski, S. (1999), ‘Blocks+: a non- redundant database of protein alignment predicted by the computer – can be blocks derived from multiple compilations’, expected to invade the search for target Bioinformatics, Vol. 15(6), pp. 471–479. proteins and the development of new 11. Yona, G., Linial, N. and Linial, M. (2000), drugs, experimental data, where they are ‘ProtoMap: automatic classification of accessible, will always be highly welcome protein sequences and hierarchy of protein families’, Nucleic Acids Res., Vol. 28(1), pp. and often be indispensable in this 49–55. process. 12. Yona, G., Linial, N. and Linial, M. (1999), ‘ProtoMap: automatic classification of Acknowledgements protein sequences, a hierarchy of protein We thank Matthias Rarey for helpful comments on families, and local maps of the protein this paper and Gerhard Barnickel and Gerhard space’, Proteins, Vol. 37(3), pp. 360–378. Klebe for information on the state of drugs 13. developed by structure-based techniques. 14. Rose, G. D. (1979), ‘Hierarchic organization of domains in globular proteins’, J. Mol. References Biol., Vol. 134(3), pp. 447–470. 1. Altschul, S. F., Gish, W., Miller, W. et al. (1990), ‘Basic local alignment search 15. Nichols, W. L., Rose, G. D., Ten Eyck, L. F. tool’, J. Mol. Biol., Vol. 215(3), pp. 403–410. and Zimm, B. H. (1995), ‘Rigid domains http://ncbi.nlm.nih. gov/BLAST/ in proteins: an algorithmic approach to 284 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 284 9/19/00, 1:49 PM
  11. 11. Protein structure prediction methods for drug design their identification’, Proteins, Vol. 23(1), 28. Bork, P., Dandekar, T., Diaz-Lazcoz, Y. pp. 38–48. et al. (1998), ‘Predicting function: from genes to genomes and back’, J. Mol. Biol., 16. Gracy, J. and Argos, P. (1998), ‘Automated Vol. 283(4), pp. 707–725. protein sequence database classification. II. Delineation of domain boundaries from 29. Roemer, K., Johnson, P. A. and sequence similarities’, Bioinformatics, Friedmann, T. (1991), Knock-in and Vol. 14(2), pp. 174–187. knock-out: Transgenes, Development and Disease: A Keystone Symposium 17. Gracy, J. and Argos, P. (1998), ‘DOMO: sponsored by Genentech and Immunex, a new database of aligned protein Tamarron, CO, USA, January 12–18 domains’, Trends Biochem. Sci., Vol. 23(12), 1991’, New Biol., Vol. 3(4), pp. 331–335. pp. 495–497. 30. Sato, T. N. (1999), ‘Gene trap, gene 18. Sowdhamini, R., Rufino, S. D. and knockout, gene knock-in, and transgenics Blundell, T. L. (1996), ‘A database of in vascular development’, Thromb. globular protein structural domains: Haemost., Vol. 82(2), pp. 865–869. clustering of representative family members into similar folds’, Fold Des., Vol. 1(3), 31. Collins, F. S., Guyer, M. S. and pp. 209–220. Charkravarti, A. (1997), ‘Variations on a theme: cataloging human DNA sequence 19. Jones, S., Stewart, M., Michie, A. et al. variation’, Science, Vol. 278(5343), (1998), ‘Domain assignment for protein pp. 1580–1581. structures using a consensus approach: characterization and analysis’, Protein Sci., 32. Brookes, A. J. (1999), ‘The essence of Vol. 7(2), pp. 233–242. SNPs’, Gene, Vol. 234(2), pp. 177–186. 20. Orengo, C. A., Martin, A. M., 33. Kuska, B. (1999), ‘Snipping “SNPs”: a Hutchinson, G. et al. (1998), ‘Classifying a new tool for mining gene variations’, protein in the CATH database of domain J. Natl Cancer Inst., Vol. 91(13), p. 1110. structures’, Acta Crystallogr. D Biol. 34. Vilain, E. (1998), ‘CYPs, SNPs, Crystallogr., Vol. 54(1(Pt 6)), pp. 1155–1167. and molecular diagnosis in the 21. Murzin, A. G. (1996), ‘Structural postgenomic era’, Clin. Chem., classification of proteins: new Vol. 44(12), pp. 2403–2404. superfamilies’, Curr. Opin. Struct. Biol., 35. Collins, F. S. (1999), ‘Shattuck lecture – Vol. 6(3), pp. 386–394. medical and societal consequences of the 22. Murzin, A. G., Brenner, S. E., Hubbard, T. Human Genome Project’, N. Engl. J. and Chothia, C. (1995), ‘SCOP: a Med., Vol. 341(1), pp. 28–37. structural classification of proteins database 36. Ellsworth, D. L. and Manolio, T. A. (1999), for the investigation of sequences and ‘The emerging importance of genetics in structures’, J. Mol. Biol., Vol. 247(4), epidemiologic research II. Issues in study pp. 536–540. design and gene mapping’, Ann. Epidemiol., Vol. 9(2), pp. 75–90. 23. Fischer, D. and Eisenberg, D. (1999), ‘Predicting structures for genome 37. Ellsworth, D. L. and Manolio, T. A. proteins’, Curr. Opin. Struct. Biol., Vol. 9(2), (1999), ‘The emerging importance of pp. 208–211. genetics in epidemiologic research III. Bioinformatics and statistical genetic 24. Huynen, M., Doerks, T., Eisenhaber, F. et al. methods’, Ann. Epidemiol., Vol. 9(4), (1998), ‘Homology-based fold predictions pp. 207–224. for Mycoplasma genitalium proteins’, J. Mol. Biol., Vol. 280(3), pp. 323–326. 38. Terwilliger, J. D. and Ott, J. (1994), ‘Handbook of Human Genetic Linkage’, 25. Marcotte, E. M., Pellegrini, M., Johns Hopkins University Press, Thompson, M. J. et al. (1999), ‘A combined Baltimore. algorithm for genome-wide prediction of protein function’, Nature, Vol. 402(6757), 39. Drews, J. (1996), ‘Genomic sciences and pp. 83–86. the medicine of tomorrow’, Nat. Biotechnol., Vol. 14(11), pp. 1516–1518. 26. Marcotte, E. M., Pellegrini, M., Ng, H. L., Rice, D. W. et al. (1999), ‘Detecting protein 40. Simons, K. T., Bonneau, R., Ruczinski, I. function and protein-protein interactions and Baker, D. (1999), ‘Ab initio protein from genome sequences’, Science, Vol. structure prediction of CASP III targets 285(5428), pp. 751–753. using ROSETTA’, Proteins, Vol. 37(S3), pp. 171–176. 27. Pellegrini, M., Marcotte, E. M., Thompson, M. J. et al. (1999), ‘Assigning protein 41. Karchin, R. and Hughey, R. (1998), functions by comparative genome analysis: ‘Weighting hidden Markov models for protein phylogenetic profiles’, Proc. Natl maximum discrimination’, Bioinformatics, Acad. Sci. USA, Vol. 96(8), pp. 4285–4288. Vol. 14(9), pp. 772–782. © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 285 08-lengauer.p65 285 9/19/00, 1:49 PM
  12. 12. Lengauer and Zimmer 42. Bateman, A., Birney, E., Durbin, R. et al. 54. Hendlich, M., Lackner, P., Weitckus, S. et al. (1999), ‘Pfam 3.1: 1313 multiple alignments (1990), ‘Identification of native protein folds and profile HMMs match the majority of amongst a large number of incorrect proteins’, Nucleic Acids Res., Vol. 27(1), models. The calculation of low energy pp. 260–262. conformations from potentials of mean force’, J. Mol. Biol., Vol. 216(1), pp. 167–180. 43. Park, J., Karplus, K., Barrett, C. et al. (1998), ‘Sequence comparisons using 55. Sippl, M. J. (1995), ‘Knowledge-based multiple sequences detect three times as potentials for proteins’, Curr. Opin. Struct. many remote homologues as pairwise Biol., Vol. 5(2), pp. 229–235. methods’, J. Mol. Biol., Vol. 284(4), pp. 1201–1210. 56. Sippl, M. J. and Flockner, H. (1996), ‘Threading thrills and threats’, Structure, 44. Barrett, C., Hughey, R. and Karplus, K. Vol. 4(1), pp. 15–19. (1997), ‘Scoring hidden Markov models’, Comput. Appl. Biosci., Vol. 13(2), 57. Lathrop, R. H. and Smith, T. F. (1996), pp. 191–199. ‘Global optimum protein threading with gapped alignment and empirical pair score 45. McClure, M. A., Smith, C. and Elton, P. functions’, J. Mol. Biol., Vol. 255(4), (1996), ‘Parameterization studies for the pp. 641–665. SAM and HMMER methods of hidden Markov model generation’, ‘Proc. 4th 58. Thiele, R., Zimmer, R. and Lengauer, T. International Conference on Intelligent (1999), ‘Protein threading by recursive Systems for Molecular Biology’, AAAI dynamic programming’, J. Mol. Biol., Vol. Press, Menlo Park, CA, pp. 155–164 290(3), pp. 757–779. 46. Eddy, S. R. (1998), ‘Profile hidden 59. Xu, Y., Xu, D. and Uberbacher, E. C. Markov models’, Bioinformatics, Vol. 14(9), (1998), ‘An efficient computational method pp. 755–763. for globally optimal threading’, J. Comput. Biol., Vol. 5(3), pp. 597–614. 47. Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. and Durbin, R. (1998), ‘Pfam: 60. Sali, A. (1995), ‘Modeling mutations and multiple sequence alignments and HMM- homologous proteins’, Curr. Opin. profiles of protein domains’, Nucleic Acids Biotechnol., Vol. 6(4), pp. 437–451. Res., Vol. 26(1), pp. 320–322. 61. Sali, A., Potterton, L., Yuan, F. et al. (1995), 48. Eddy, S. R. (1996), ‘Hidden Markov ‘Evaluation of comparative protein models’, Curr. Opin. Struct. Biol., Vol. 6(3), modeling by MODELLER’, Proteins, pp. 361–365. (1995), ‘Proc. 3rd Vol. 23(3), pp. 318–326. International Conference on Intelligent 62. Sali, A. (1998), ‘100,000 protein structures Systems for Molecular Biology’, AAAI for the biologist’, Nat. Struct. Biol., Vol. Press, Menlo Park, CA, pp. 114–120. 5(12), pp. 1029–1032. 49. Bowie, J. U., Luthy, R. and Eisenberg, D. 63. Sanchez, R. and Sali, A. (1998), (1991), ‘A method to identify protein ‘Large-scale protein structure modeling of sequences that fold into a known three- the Saccharomyces cerevisiae genome’, dimensional structure’, Science, Vol. Proc. Natl Acad. Sci. USA, Vol. 95(23), 253(5016), pp. 164–170. pp. 13597–13602. 50. Luthy, R., Bowie, J. U. and Eisenberg, D. 64. Sanchez, R. and Sali, A. (1997), ‘Evaluation (1992), ‘Assessment of protein models with of comparative protein structure modeling three-dimensional profiles’, Nature, Vol. by MODELLER-3’, Proteins, Suppl 1, 356(6364), pp. 83–85. pp. 50–58. 51. Luthy, R., Xenarios, I. and Bucher, P. 65. Sanchez, R., Pieper, U., Mirkovic, N. et al. (1994), ‘Improving the sensitivity of the (2000), ‘MODBASE, a database of sequence profile method’, Protein Sci., Vol. annotated comparative protein structure 3(1), pp. 139–146. models’, Nucleic Acids Res., Vol. 28(1), 52. Alexandrov, N. N., Nussinov, R. and pp. 250–253. Zimmer, R. M. (1996), ‘Fast protein fold 66. Guex, N., Diemand, A. and Peitsch, M. C. recognition via sequence to structure (1999), ‘Protein modelling for all’, Trends alignment and contact capacity potentials’, Biochem. Sci., Vol. 24(9), pp. 364–367. Pacific Symposium on Biocomputing, pp. 53–72. 67. Guex, N. and Peitsch, M. C. (1997), ‘SWISS-MODEL and the Swiss- 53. Sippl, M. J. (1990), ‘Calculation of PdbViewer: an environment for conformational ensembles from potentials comparative protein modeling’, of mean force. An approach to the Electrophoresis, Vol. 18(15), pp. 2714–2723. knowledge-based prediction of local structures in globular proteins’, J. Mol. Biol., 68. Petrella, R. J., Lazaridis, T. and Karplus, M. Vol. 213(4), pp. 859–883. (1998), ‘Protein sidechain conformer 286 © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 08-lengauer.p65 286 9/19/00, 1:49 PM
  13. 13. Protein structure prediction methods for drug design prediction: a test of the energy function’, 82. Marchler-Bauer, A., Levitt, M. and Bryant, Fold Des., Vol. 3(5), pp. 353–377. S. H. (1997), ‘A retrospective analysis of CASP2 threading predictions’, Proteins, 69. Karplus, M. and Petsko, G. A. (1990), Suppl 1, pp. 83–91. ‘Molecular dynamics simulations in biology’, Nature, Vol. 347(6294), 83. Marchler-Bauer, A. and Bryant, S. H. pp. 631–639 (1997), ‘A measure of success in fold recognition’, Trends Biochem. Sci., Vol. 22(7), 70. Brooks, B. R., Bruccoleri, R. E., Olafson, pp. 236–240. B. D. et al. (1983), ‘CHARMM: A program for macromolecular energy, minimization, 84. Lackner, P., Koppensteiner, W. A., and dynamics calculation’, Domingues, F. S. and Sippl, M. J. (1999), J. Comp. Chem., Vol. 4, pp. 187–213. ‘Automated large scale evaluation of 71. Van Gunsteren, W. F. and Berendsen, H. J. protein structure predictions’, Proteins, Vol. (1982), ‘Molecular dynamics: perspective 37(S3), pp. 7–14. for complex systems’, Biochem. Soc. Trans., 85. Holm, L. and Sander, C. (1998), ‘Dictionary Vol. 10(5), pp. 301–305. of recurrent domains in protein structures’, 72. Van Gunsteren, W. F. and Berendsen, Proteins, Vol. 33(1), H. J. (1990), ‘Moleküldynamik- pp. 88–96. Computersimulationen: Methodik, 86. Holm, L. and Sander, C. (1998), ‘Touring Anwendungen und Perspektiven in protein fold space with Dali/FSSP’, Nucleic der Chemie’, Angew. Chem., Vol. 102, Acids Res., Vol. 26(1), pp. 316–319. pp. 1020–1055. 87. Orengo, C. A. and Taylor, W. R. (1996), 73. Levitt, M. (1983), ‘Protein folding by ‘SSAP: sequential structure alignment restrained energy minimization and program for protein structure comparison’, molecular dynamics’, J. Mol. Biol., Methods Enzymol., Vol. 266, pp. 617–635. Vol. 170(3), pp. 723–764. 88. Gibrat, J. F., Madej, T. and Bryant, S. H. 74. Novotny, J., Bruccoleri, R. and Karplus, M. (1996), ‘Surprising similarities in structure (1984), ‘An analysis of incorrectly folded comparison’, Curr. Opin. Struct. Biol., Vol. protein models. Implications for structure 6(3), pp. 377–385. predictions’, J. Mol. Biol., Vol. 177(4), pp. 787–818. 89. Lackner, P., Koppensteiner, W. A., Domingues, F. S. and Sippl, M. J. (1999), 75. van Vlijmen, H. W. and Karplus, M. (1997), ‘Automated large scale evaluation of ‘PDB-based protein loop prediction: protein structure predictions’, Proteins, Vol. parameters for selection and methods for 37(S3), pp. 7–14. optimization’, J. Mol. Biol., Vol. 267(4), pp. 975–1001. 90. Alexandrov, N. N. (1996), ‘SARFing the PDB’, Protein Eng., Vol. 9(9), pp. 727–732. 76. Lessel, U. and Schomburg, D. (1997), ‘Creation and characterization of a new, 91. Lattman, E. E. (ed.) (1999), ‘Third Meeting non-redundant fragment data bank’, Protein on the Critical Assessment of Techniques Eng., Vol. 10(6), pp. 659–664. for Protein Structure Prediction’, Proteins, Vol. 37, Suppl. 3.. 77. Lessel, U. and Schomburg, D. (1999), ‘Importance of anchor group positioning 92. Kolinski, A., Rotkiewicz, P., Ilkowski, B. in protein loop prediction’, Proteins, and Skolnick, J. (1999), ‘A method for the Vol. 37(1), pp. 56–64. improvement of threading-based protein models’, Proteins, Vol. 37(4), pp. 592–610. 78. Fechteler, T., Dengler, U. and Schomburg, D. (1995), ‘Prediction of protein three- 93. Zimmer, R. and Thiele, R. (1997), ‘Fast dimensional structures in insertion and protein fold recognition and accurate deletion regions: a procedure for searching sequence–structure alignment’, in ‘German data bases of representative protein Conference on Bioinformatics, GCB ’96’, fragments using geometric scoring criteria’, Hofestädt, R., Lengauer, T., Löffler, M. J. Mol. Biol., Vol. 253(1), pp. 114–131. and Schomburg, D. Eds, Springer, Berlin, pp. 137–148. 79. Lo Conte, L., Ailey, B., Hubbard, T. J. et al. (2000), ‘SCOP: a structural classification of 94. Kim, S. H. (1998), ‘Shining a light on proteins database’, Nucleic Acids Res., Vol. structural genomics’, Nat. Struct. Biol., Vol. 28(1), pp. 257–259. 5 Suppl., pp. 643–645. 80. Orengo, C. A., Michie, A. D., Jones, S. et al. 95. Montelione, G. T. and Anderson, S. (1999), (1997), ‘CATH – a hierarchic classification ‘Structural genomics: keystone for a of protein domain structures’, Structure, Vol. Human Proteome Project’, Nat. Struct. 5(8), pp. 1093–1108. Biol., Vol. 6(1), pp. 11–12. 81. Marchler-Bauer, A. and Bryant, S. H. 96. Sali, A. (1998), ‘100,000 protein structures (1997), ‘Measures of threading specificity for the biologist’, Nat. Struct. Biol., and accuracy’, Proteins, Suppl 1, pp. 74–82. Vol. 5(12), pp. 1029–1032. © HENRY STEWART PUBLIC ATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 1. NO 3. 275–288. SEPTEMBER 2000 287 08-lengauer.p65 287 9/19/00, 1:50 PM