Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Following the Evolution of New Protein Folds via Protodomains [Report]


Published on

Protein evolution proceeds through genetic mechanisms, but selection acts on biological assemblies. I define a protodomain as a minimal independently evolving unit with conserved structure. Protodomain rearrangements have minimal impact on biological assemblies, so they represent a valid evolutionary path through fold space.

This report is the written portion of my Candidacy Exam at University of California, San Diego. It discusses my current research in Philip Bourne's lab, as well as proposes research for my thesis over the next two years. Slides for the oral presentation are available at

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

Following the Evolution of New Protein Folds via Protodomains [Report]

  1. 1. Examination for the Advancement to CandidacyFollowing the Evolution of New Protein Folds via Protodomains Spencer E. Bliven January 28, 2013 Bioinformatics & Systems Biology. University of California, San Diego
  2. 2. 2
  3. 3. Committee Members Philip E. Bourne, Chair Milton H. Saier, Co-Chair Russell F. Doolittle Michael K. Gilson Adam Godzik 3
  4. 4. AbstractThe rate at which novel protein folds are discovered has declined rapidly, leading to some hope thatcurrently known protein structures cover the majority of protein fold space utilized by nature, atleast within well-sampled classes of proteins [1]. This presents an opportunity to globally analyzefold space. In my proposed thesis I will look for answers to the following ambitions questions: • How do new folds evolve from existing ones? What are the relative frequencies of known mechanisms for forming new folds? • Is fold space continuous or discrete? How can it display properties of both? • How can an understanding of protein fold space translate into insights about specific protein families?Although these questions are quite broad, I think that the key elements are now in place to makeanswers accessible in a PhD thesis. I first redefine these questions as clear computational problems.I then propose some algorithms that could be used to solve the problems, as well as summarizesome steps, which have already been taken towards understanding, fold space. Finally, I describean evolutionary model, which places the computational results in a biological framework.Publications:[2] Andreas Prlić, Spencer Bliven, Peter W Rose, Wolfgang F Bluhm, Chris Bizon, Adam Godzik,and Philip E Bourne. Pre-calculated protein structure alignments at the RCSB PDB website.Bioinformatics, 26(23):2983–2985, December 2010.[3] Spencer Bliven and Andreas Prlić. Circular permutation in proteins. PLoS Comput Biol,8(3):e1002445, March 2012.[4] Andreas Prlić, Andrew Yates, Spencer E Bliven, Peter W Rose, Julius Jacobsen, Peter VTroshin, Mark Chapman, Jianjiong Gao, Chuan Hock Koh, Sylvain Foisy, Richard Holland, Gedim-inas Rimša, Michael L Heuer, H Brandstätter-Müller, Philip E Bourne, and Scooter Willis. BioJava:an open-source framework for bioinformatics in 2012. Bioinformatics, 28(20):2693–2695, October2012.
  5. 5. Contents1 Introduction 6 1.1 Fold Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Protodomains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Protodomain Rearrangements in Evolution . . . . . . . . . . . . . . . . . . . . . . . . 72 Specific Aims 113 Preliminary Research 13 3.1 Detecting Circular Permutations (CE-CP) . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Detecting Protein Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Classifying Quaternary Structure Based on Symmetry . . . . . . . . . . . . . . . . . 14 3.4 Domain-based All-vs-all Structural Comparison . . . . . . . . . . . . . . . . . . . . . 164 Research Design and Methodology 16 4.1 Evolutionary Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Impact 216 Conclusion 22 5
  6. 6. 1 Introduction Multidimensional scaling can then be used to em- bed protein folds in a Euclidean space [14, 15, 16].1.1 Fold Space The problem with such approaches is that they are generally able to distinguish protein classes,The question of the nature of protein fold space but cannot capture the finer classifications of foldhas captured the attention of numerous struc- and superfamily. Thus, they have limited utilitytural and computational biologists. The number in predicting evolutionary relationships or func-of possible protein sequences is so large as to be tional characteristics.practically infinite, yet proteins with completely It now seems that neither the discrete nor contin-different primary sequences may fold into nearly uous views of protein fold space can fully explainidentical structures [5]. Since the structure of the relationships between protein folds [6]. In-a protein is essential to performing its function, stead of focusing on the geometric similarities be-understanding the nature of what structures are tween folds, perhaps the secret to understandingpossible and how evolution has sampled the space protein fold space lies in the evolutionary historyof possible structures can have far reaching con- of proteins. In this proposal I suggest a modelsequences. for the long-term evolution of protein folds, andA number of questions regarding the nature of discuss the algorithms that need to be developedprotein fold space remain open. One of the more in order to apply this model to unraveling thecontroversial questions is whether fold space is evolutionary relationships among the plethora ofcomposed primarily of discrete protein folds, or folds observed today. By examining the evolu-whether folds are connected by a continuum of tionary history of folds, individual cases of foldspossible but unobserved folds [6, 7, 8, 9]. This which developed by incremental changes ("con-question has practical implications on the des- tinuous fold space") can be distinguished fromignability of proteins, the utility of structural the rapid introduction of new folds either de novogenomics initiatives, and the design of structure or through the recombination of existing foldsclassification methods. ("discrete fold space").With the proliferation of protein structures, sev-eral schemes for classifying proteins into discrete 1.2 Protodomainscategories emerged. Some notable examples in-clude SCOP [10], and CATH [9]. Such classi- One difficulty with clearly defining the naturefications are undeniably useful as a description of fold space is the multi-scale nature of pro-of the possible folds observed in nature. How- teins. Various subdivisions of proteins along ge-ever, numerous examples exist of proteins with netic, geometric, structural, and functional con-clear structural similarity, yet are classified as ditions are broadly used in the literature, oftendiscrete folds by these methods. For example, inconsistently. To avoid any confusion, I willGrishin [11] describes a sequence of structurally briefly explain which definitions will be used insimilar proteins leading from an all-β to an all- this proposal, before introducing a new term, theα protein. Such observations led to the view of protodomain.fold space as a continuum. In this view, proteinclassifications are more like clusters of closely re- The largest and most biologically relevant unitlated structures. Some structures may lie near of proteins is the biological assembly, whichthe edge of multiple clusters, incorporating struc- consists of the protein as it exists in the cell [17].tural features of each. Efforts have been made to This may be as a monomer, a protein complex,formalize the notion of continuous fold space by or an aggregate. A single polypeptide may ex-defining rigorous distance functions for pairwise ist in several types of biological assemblies at thecomparisons and computing all-against-all pair- same time due to dynamics or different cellularwise comparisons of known proteins [12, 8, 13]. conditions. The prevalent biological assembly for 6
  7. 7. a given cell and protein can be determined ex- to significant changes in the structure of compo-perimentally, or can be predicted from a crystal nent domains. Second, domains often consist ofstructure using methods such as PISA [17]. combinations of subdomains, which could plau- sibly be evolutionarily related. A prime exampleThe most unambiguous decomposition of a bio- is symmetric domains, which consist of multiplelogical assembly is into a set of chains, based on copies of a subdomain following a gene duplica-polypeptide connectivity. Chains typically corre- tion event (see Protodomain Rearrangements inlate one-to-one with genes (although alternative Evolution).splicing and post-translational cleavage and liga-tion lead to notable exceptions). Thus, chains are To address the issue of evolution, we introducea good approximation of a decomposition along the concept of a protodomain. A protodomaingenetic criteria. Since a chain must be trans- is a minimal, independently evolving protein unitlated as a unit, all transcriptional and transla- with a conserved structure. In many casestional regulation occurs at the chain level. this may include a whole domain, or even the whole chain. Other domains may consist of sev-Decomposing a biological assembly based on eral protodomains that have independent evo-structural criteria leads to a set of domains. lutionary histories. As a part of a domain, aSeveral different specific criteria are used to de- protodomain is not required to fold indepen-fine domains (e.g. SCOP [10], CATH [9], or PDP dently. Rather, it is a syntenic block that main-[18], but generally define domains to be com- tains the same fold throughout evolution. Thepact, independent structural units capable of in- fact that members of a protodomain are relateddependently folding [19, p. 27]. In SCOP 1.75, by homology distinguishes them from struc-60% of protein structures contain multiple do- tural motifs, which are small, common struc-mains. Because domains are defined based on tures, which can evolve independently (e.g. zincstructural criteria, domains may consist of sev- finger motifs).eral non-contiguous portions of a chain or evenportions from several chains. Although most do-mains are formed by a single contiguous portion 1.3 Protodomain Rearrangements inof a chain, about 4% of SCOP domains contain Evolutiontwo or more non-contiguous blocks, and 1% spanmultiple chains. To provide motivation for this definition of protodomains, it is useful to consider severalSeveral domain classifications schemes exist, in- evolutionary processes that conserve the struc-cluding PFam [20], SCOP, and CATH. Each at- ture of the biological assembly while dramaticallytempts to cluster domains based on structural changing the structure of component chains.similarity and evolutionary relationships. Using First, circular permutation can be viewed asthe SCOP nomenclature, a fold is a group of an alteration in the sequence order in whichdomains that contain the same major secondary two protodomains occur. Second, the evolu-structural elements in the same mutual orienta- tion of internal pseudosymmetry from quater-tion and with the same connectivity (author?) nary symmetry maintains the structure of the bi-[10, 6]. A continuing challenge is to demonstrate ological assembly while fusing the participatingwhether domains with a common fold derive from protodomains into a single chain. Arbitrary re-a common ancestor or whether they represent combinations of structural motifs could also rep-convergent evolution. resent protodomain rearrangements, but unlessThe problem with comparing domains is twofold. they show sequence conservation, the evidenceFirst, while domains may be able to fold inde- for homology between motifs is to tenuous to con-pendently, proteins evolve in the context of their sider them protodomains.biological assembly. Changes in the structure Two protein chains are related by a circular per-of the biological assembly may not correspond mutation if they contain two regions which are 7
  8. 8. (a) (b) (c) (d)Figure 1: Several proteins that contain multiple copies of a hypothetical protodomain. (a) Glyox-alase I from Clostridium acetobutylicum [3HDP] contains two symmetrical copies of the protodomainthat bind a nickel ion near the axis of rotation. The authors list the structure as monomeric, butPISA suggests a dimer similar to c. (b) GTP binding regulator from Thermotoga maritima [1VR8]also contains two copies of the protodomain, but at a different relative orientation. Here it is shownsuperimposed on one protodomain from glyoxalase I, in yellow. (c) Dimer form of glyoxalase I in E.coli [1F9Z]. While each chain individually has a very different structure from the C. acetobutylicumhomologue, the protodomains are oriented identically between chains and conserve the metal-bindingsite between protodomains. (d) One chain from Pseudomonas 1,2-dihydroxynaphthalene dioxyge-nase [2EHZ] contains four copies of the protodomain, but it assembles into an octomer. 8
  9. 9. 1 2 1 2 Duplication Fission Fusion + Truncation Fusion 1 2 1 2 1 2 Fusion Fission Truncation 2 1 2 1 (a) (b) N C Figure 3: Schematic of the genetic modifica- a tions that can lead to circular permutation, with protodomains represented as arrows. (a) Fission & Fusion mechanism. (b) Duplication by permu- a tation mechanism. c homologous but which occur in a permuted or- c der (see figure 2). A large number of proteins re- b lated by a circular permutation are known [21]. Despite their permuted sequence, the structures b N of pairs of circularly permuted proteins are usu- C ally extremely similar. Andreas Prlić and I described the two major mechanisms by which circularly permuted formsFigure 2: Schematic representation of a circular of proteins evolve in our 2012 review of the sub-permutation in two proteins [3]. The first protein ject [3]:(outer circle) has the sequence a-b-c. After the There are two main models that arepermutation the second protein (inner circle) has currently being used to explain the evo-the sequence c-a-b. The letters N and C indicate lution of circularly permuted proteins:the location of the amino- and carboxy-termini permutation by duplication and fissionof the protein sequences and how their positions and fusion. The two models have com-change relative to each other. pelling examples supporting them, but the relative contribution of each model in evolution is still under debate [22]. Other, less common, mechanisms have been proposed, such as “cut and paste” [23] or “exon shuffling.” Permutation by Duplication The earliest model proposed for the evolution of circular permutations is the permutation by duplication mechanism [24]. In this model, a precursor gene 9
  10. 10. Saposin Saposin prosaposin. It is a precursor for four cleavage products, the saposins A, B, linker α1 α2 α3 α4 α1’ α2’ α3’ α4’ C, and D. The four saposin domains Swaposin most likely arose from two tandem du- plications of an ancestral gene [27]. This repeat suggests a mechanism for Saposin Swaposin the evolution of the relationship with α4 α4 α3 C α3 N the plant-specific insert (PSI) (see fig- α1’ α1 α2 linker α2’ ure 4). The PSI is a domain exclu- N C sively found in plants, consisting of ap- proximately 100 residues and found inFigure 4: Suggested relationship between saposin plant aspartic proteases [28]. It be-and swaposin. They could have evolved from a longs to the saposin-like protein fam-similar gene [26]. Both consist of four alpha he- ily (SAPLIP) and has the N- and C-lices with the order of helices being permuted termini ‘‘swapped’’, such that the or-relative to each other. der of helices is 3-4-1-2 compared with saposin, thus leading to the name ‘‘swa- posin’’ [29]. For a review on functional first undergoes a duplication and fusion and structural features of saposin-like to form a large tandem repeat. Next, proteins, see (author?) [30]. start and stop codons are introduced at corresponding locations in the du- Fission and Fusion plicated gene, removing redundant sec- tions of the protein (see figure 3b). Another model for the evolution of cir- One surprising prediction of the per- cular permutations is the fission and fu- mutation by duplication mechanism is sion model. The process starts with two that intermediate permutations can oc- partial proteins. These may represent cur. For instance, the duplicated ver- two independent polypeptides (such as sion of the protein should still be two parts of a heterodimer), or may functional, since otherwise evolution have originally been halves of a single would quickly select against such pro- protein that underwent a fission event teins. Likewise, partially duplicated to become two polypeptides (see figure intermediates where only one termi- 3a). nus was truncated should be functional. The two proteins can later fuse to- Such intermediates have been exten- gether to form a single polypeptide. Re- sively documented in protein families gardless of which protein comes first, such as DNA methyltransferases [25]. this fusion protein may show similar Saposin and swaposin. An ex- function. Thus, if a fusion between ample for permutation by duplication two proteins occurs twice in evolution is the relationship between saposin and (either between paralogues within the swaposin. Saposins are highly con- same species or between orthologues in served glycoproteins that consist of an different species) but in a different or- approximately 80 amino acid residue der, the resulting fusion proteins will be long protein forming a four alpha he- related by a circular permutation. lical structure. They have a nearly Evidence for a particular protein hav- identical placement of cysteine residues ing evolved by a fission and fusion and glycosylation sites. The cDNA se- mechanism can be provided by ob- quence that codes for saposin is called serving the halves of the permutation 10
  11. 11. B. taurus I (NAD) II III (NADPH) Symmetry has been known to be important in proteins since Perutz’s 1968 structure of P11024 Vertebrates E. coli P07001 I IIα D8AU95 IIβ III hemoglobin [33, 34]. It plays a fundamental role Rb. capsulatus I IIα IIβ III in protein allostery [35], as well as contributes to R. rubrum D5APA8 D5APA9 function and folding [36]. Among current struc- I IIα IIβ III Q2RSB2 Q2RSB3 Q2RSB4 Bacteria tures in the PDB, 43% have symmetric quater- nary structure1 . In addition to this, 19% of E. tenella IIβ III I IIα Q07600 Parasitic protozoans SCOP classes consist of domains with internal pseudosymmetry (see 3.2 below).Figure 5: Transhydrogenases in various organ-isms can be found in three different domain ar- Symmetric molecules are thought to evolve viarangements. In cattle, the three domains are ar- a duplication mechanism from multimers [37].ranged sequentially. In the bacteria E. coli, Rb. For instance, a monomer with three-fold rota-capsulatus, and R. rubrum, the transhydrogenase tional symmetric could evolve from a symmet-consists of two or three subunits. Finally, tran- ric trimer via fusion with two duplicate copiesshydrogenase from the protist E. tenella consists of the gene (see figure 6) [31]. The biologicalof a single subunit that is circularly permuted assembly in both the trimeric and monomericrelative to cattle transhydrogenase. forms consists of three copies of the protodomain, but with a different genetic composition. One might expect that monomers with a number of as independent polypeptides in related protodomains that is not a power of two would species, or by demonstrating experi- be disfavored, since they would require several mentally that the two halves can func- sequential duplications whose chains would not tion as separate polypeptides [31]. form complete assemblies. However, intermedi- Transhydrogenases. An example ates consisting of two biological assemblies with for the fission and fusion mechanism one strand-swapped chain can stably fold (fig- can be found in nicotinamide nucleotide ure 6b). Intermediates may also be stabilized by trans- hydrogenases [32]. These are single-protodomain paralogues. membrane-bound enzymes that cat- alyze the transfer of a hydride ion be- While symmetry and circular permutation are tween NAD(H) and NADP(H) in a re- not the only ways in which protodomains can action that is coupled to transmem- recombine, they are readily identifiable and are brane proton translocation. They con- associated with established evolutionary pro- sist of three major functional units (I, cesses. Thus, these form a basis from which II, and III) that can be found in differ- to start decomposing biological assemblies into ent arrangement in bacteria, protozoa, protodomains. Additional protodomains may be and higher eukaryotes (see figure 5). then identified based on structural similarity, se- Phylogenetic analysis suggests that the quence similarity, or other criteria. three groups of domain arrangements were acquired and fused independently [22]. 2 Specific AimsBoth mechanisms of circular permutation require 1. Improve algorithms to identify con-major rearrangements to the underlying genes served protodomains globally acrossthat code for the protein, but result in minimal the PDB. The first step to understand-changes to the protein structure. The prevail- ing the evolution of protein architectureing theory for the evolution of internal symmetryshares this pattern of modifying gene structure 1 The Protein Data Bank., ac-while conserving the functional assembly. cessed 1/18/2013 11
  12. 12. (a) (b) (c)Figure 6: Hypothetical precursors to fibroblast growth factor 1 (FGF-1) synthesized by (author?)[31]. (a) Trimer, with one protodomain per chain [3OL4]. (b) Trimer, with two protodomains perchain. Two barrels are formed, each consisting of three protodomains [3OGF]. (c) Fully symmetricmonomer consisting of three protodomains [3O4D]. is defining the basic repeating units, or puter, but extending this to protodomains protodomains in our structural terminology. could identify links between seemingly dis- A crude approximation of this is simply the similar folds. domain architecture of various proteins, for 3. Integrate protodomain arrangements instance SCOP superfamilies. However, this with domain and quaternary structure breaks down for symmetric proteins, pro- information to create a parsimonious teins that have undergone circular permuta- model of fold evolution across the tree tion, and other complex cases such as strand of life. A model of protein fold evolution swapping. While we have made progress on that incorporates protodomain architecture algorithms for discovering such cases (see is suggested here and will be further refined Preliminary Research), additional algorith- based on future data. This model will form mic advances are needed to accurately assign the basis for identifying key events in the individual residues to protodomains. These evolution of protein folds, with the goal of algorithms will be made freely available to eventually building a parsimonious ’tree of the community under an open source license. proteins’ to document the evolution of ex- 2. Identify structurally similar and po- isting protodomains. tentially homologous protodomains across fold space. After protodomains 4. Apply protodomain principles to un- are identified, an all-vs-all comparison of derstanding the evolution of specific representative protodomains will be per- protein families. Apply the protodomain formed using structural comparison algo- architecture data to specific protein families rithms. These pairwise similarities can be in more detail. A deep knowledge of evolu- used to construct a network of structurally tionary relationships within a specific family similar protodomains, allowing clustering will bring external corroboration to any in- analysis to identify potential homologues. sights discovered through the new algorith- Shared ancestry between protodomains can mic developments. Additionally, knowledge then be further established using sequence of symmetry could be used to suggest useful comparison and consistency with evolution- protein engineering task relevant to the spe- ary trees. An all-vs-all comparison has al- cific system. Several candidate systems are ready been performed at the domain level presented, including ion transporters and utilizing the Open Science Grid supercom- beta propellers. 12
  13. 13. 3 Preliminary Research3.1 Detecting Circular Permutations (CE-CP)The Combinatorial Extension (CE) algorithm isable to accurately identify alignments betweendistantly related proteins based on structuralsimilarity [38]. It is one of the most widelyused structural alignment algorithms, and anopen-source implementation is currently avail-able through the BioJava project [39]. How-ever, the algorithm is limited to comparing pro-teins that have the same order of residues. Pro-teins which are related by rearrangement eventscannot be detected by CE, and require moreadvanced topology-independent structural align-ment algorithms. (a)The most basic case of rearrangement is that ofcircular permutation, which requires just a sin-gle change in sequence topology to align the pairsof proteins. A version of CE adapted for circu-lar permutations was implemented, called Com-binatorial Extension with Circular Permutations(CE-CP) [4]. To get around the requirement thatthe topology of both proteins being aligned beidentical, we use an algorithm analogous to that (b)proposed by (author?) [40] for detecting circu-lar permutations by sequence similarity. To com- Figure 7: (a) CE-CP alignment of periplasmicpare two proteins, A and B, which are suspected molybdate-binding protein [1ATG] (orange &of being related by a circular permutation, first yellow) on OpuAC [2B4L] (cyan and blue). (b)a duplicate of B is created by concatenating the Dotplot of duplicated search matrix. CE-CPsequence of B to itself. Then the full CE algo- finds the red alignment, which crosses the dupli-rithm is run to compare A and BB. If no circular cation boundary at the position of the circularpermutation is found, this will result in two iden- permutation. This infomation is then mappedtical alignment paths aligning A with each copy back to equivalent positions (grey alignment) toof B. However, if a circular permutation has oc- form the final alignment.curred, the optimal alignment path will span theboundary between copies of B, as shown in figure7.One difficulty with the duplication algorithm isthat for difficult cases, the optimal AFP pathmay contain a large enough insertion in BB thatthe same residue in different copies of B is as-signed to different portions of A, one before andone after the permutation site. This introducesambiguity as to where the circular permutation 13
  14. 14. occurred. To solve this problem we choose the SCOP class Percentpermutation site that results in the longest align- Symmetricment, and discard portions of the optimal path, (a) All alpha proteins 23% 23%which are inconsistent with the topology of the (b) All beta proteins 26% 26%chosen permutation site. (c) Alpha and beta proteins 16%The CE-CP algorithm is available as part of (a/b)the BioJava open source library. It is also (d) Alpha and beta proteins 14%provided as an alignment algorithm on the (a+b)RCSB PDB website ( (e) Multi-domain proteins 3%pdb/workbench/ (alpha and beta) (f) Membrane and cell 24% surface proteins and3.2 Detecting Protein Symmetry peptidesThe CE algorithm has also been adapted to de- (a-f ) All Classes 19%tect pseudosymmetry within protein chains. Un- Table 1: Percentage of SCOP superfamilies foundlike in crystallographic symmetry, where sym- to contain symmetry.metric chains are known to have identical se-quences and structures, pseudosymmetric sub-units within a domain can have wildly divergent A manuscript describing the current version ofsequences. Thus, detecting pseudosymmetry re- CE-Symm is currently being prepared. However,quires a structural alignment algorithm to find the program is limited in that while it can ac-regions of self-similarity within the protein. curately determine whether a domain containsThe CE-Symm algorithm is able to find pseu- pseudosymmetry, it is much less reliable at deter-dosymmetry and other types of internal repeats mining the minimal protodomains that comprisethrough a few modifications of CE. A protein is the protein. Under Aim 1 of this proposal, I willcompared to itself in a sequential manner using improve the accuracy of CE-Symm for identify-dynamic programming. Like CE-CP, the pro- ing protodomains.tein is searched against a duplicated copy of it-self to allow a single discontinuity in sequence. 3.3 Classifying Quaternary StructureThe alignment is also restricted to lie a minimum Based on Symmetrydistance from the diagonal, in order to avoidthe trivial alignment. After identifying optimal Peter Rose recently developed an algorithm forregions of self-similarity, the alignment is post- quickly determining the symmetry of a proteinprocessed to identify whether rotational symme- at the level of quaternary structure. The algo-try is present and to determine the symmetry rithm first finds the stoichiometry of each compo-order. nent within the biological assembly of the inputCE-Symm was run on a non-redundant set structure. Identical chains are identified within aof proteins representing one protein from each protein using a sequence identity threshold. ForSCOP superfamily ( instance, human hemoglobin consists of two al-jfatcatserver/scopResults.jsp). Alignments pha and two beta subunits. Thus for high se-with a z-score greater than 3.5 were considered quence thresholds it would be classified as hav-to be symmetric molecules. This search found ing α2 β2 stoichiometry. However, the subunitsthat 20% of SCOP superfamilies are symmet- can be aligned with 43% identity. Thus, if chainsric, which is slightly higher than results found are clustered at 40% identity, hemoglobin will beby other methods (see table 1) [41]. classified as having 4 corresponding components (α4 ). Next, a number of rotations of the assem- bly are performed. Rotations that result in corre- 14
  15. 15. FR2 FR F R R2 e (a) (b) (c) (d) (e)Figure 8: Three-fold dihedral symmetry in 5-enol-pyruvyl shikimate-3-phosphate (EPSP) synthase[1G6S]. (a) Top and (b) side views of EPSP, projected along the 3-fold axis and one of the three2-fold axes respectively. (c) Dot-plot showing the six possible alignments consistent with the D3point group. The dotted lines represent the hinge regions between the two halves of the structure.(d) CE-Symm finds an alignment around one 2-fold axis (corresponding to the pink alignment in thedot plot, which requires only one circular permutation). (e) After manually resolving the domainswapping between the 2-fold symmetric domains (residues 20-241, middle quadrant between thedotted lines), CE-Symm is able to find the 3-fold axis as well. 15
  16. 16. sponding subunits being superimposed are stored available) or Protein Domain Parser (PDP) as-as valid operations. Finally, the point group of signments. The rigid FATCAT algorithm wasthe assembly is determined based on which rota- used to align each pair of domains. This wastions are valid. Hemoglobin is classified as 2-fold made possible by running the approximately 300rotational symmetry at strict thresholds, while at million alignment on the Open Science Grid dis-relaxed thresholds where all subunits are consid- tributed supercomputer.ered equivalent, the algorithm detects the 2-fold After calculating alignments for all pairs, in-dihedral pseudosymmetry. The algorithm is also significant alignments were removed and the re-able to determine all axes of rotation for the pro- maining significant alignments used to form ateins and display them using a user-friendly java network of structural similarity (see figure 9).applet. Domains with similar folds tend to cluster to-Using this algorithm, a census of quaternary gether. Mapping additional information onto thestructure symmetry in the PDB was performed. network, such as SCOP classification or EC num-This found that about 80% of all biological as- ber, allows the correlation between structure andsemblies of two or more chains contain symmetry. function to be probed. However, it is difficultThese results will soon be incorporated into the to draw conclusions about evolutionary relation-PDB to enable searching for proteins based on ships between proteins due to the imperfect cor-quaternary symmetry. relation between structural similarity and evolu-The algorithm currently uses sequence compar- tionary distance.isons to identify corresponding chains. Thus itis only able to align very closely related chains.Future work will focus on extending this algo- 4 Research Design andrithm to use structurally similar protodomains Methodologyas well. This should reveal additional symme-try from protodomains, which have fused into a Aim 1: Improve algorithms to identifysingle chain. conserved protodomains globally across the PDB3.4 Domain-based All-vs-all Structural Comparison Significant progress has already been made to- wards this aim with the implementation of CE-In addition to developing new methods for iden- CP and CE-Symm. These allow the decompo-tifying protodomains, we have also made pre- sition of some domains into subdomains, whichliminary progress on characterizing fold space. can then be used to seed searches for other re-Two all-vs-all structural comparisons of the en- lated protodomains with different architectures.tire PDB have been completed. Initially, a com- However, CE-Symm requires additional devel-parison of all chains in the PDB was calculated. opments to be able to identify hypotheticalThis was later extended to include structural protodomains in symmetric proteins.comparisons of all domains. The results are avail- Since CE identifies structurally similar motifsable through the PDB website and are updated within a protein, the alignments it returns doweekly. not always define a one-to-one correspondenceDetailed methods for the structural compari- between protodomains in the protein. Addition-son are reported in (author?) [2]. In brief, a ally, the alignment returned is not guaranteed tonon-redundant set of protein chains was selected represent the minimal symmetric subunit. Forbased on a clustering sequences to 40% identity. instance, a four-fold symmetric molecule may beThese representatives were then decomposed into aligned along its 180° rotation rather than iden-domains using either SCOP domains (where tifying the more fundamental 90° rotation which 16
  17. 17. Figure 9: Network showing structural similarity between protein domains that are annotated inthe TCDB as belonging to a membrane protein. Domains with sequence identity above 40% areclustered together into a single node. Domains which can be aligned with TM-Scores greater than.5 are connected by an edge. 17
  18. 18. Aim 2. Identify structurally similar and potentially homologous protodomains across fold space Given a set of protodomains from Aim 1, an all- vs-all structural comparison will be computed. Since many domains consist of only a single protodomain, many of these comparisons will overlap with the existing domain-based all-vs-all calculation and will not need to be recomputed. Nonetheless, this will require significant compu- tational resources and be performed on Open Sci- ence Grid (OSG). Aim 1 is likely to consist of several iterationsFigure 10: CE-Symm alignment of chicken Triose of improving the assignment of protodomains,Phosphate Isomerase, a TIM barrel composed as the algorithms become progressively better.of 8 αβ repeating units. CE-Symm chooses The all-vs-all comparison can also be itera-the highest scoring of the 7 possible alignments, tively improved, with only the new or changedwhich happens to correspond to a 180° rotation. protodomain definitions being recomputed.Rather than finding the maximal 8-fold symme-try, CE-Symm finds only 2-fold. After the comparison is complete, the network of structurally similar protodomains will be an- alyzed. Identifying clusters within this networkwould lead to the correct protodomain assign- will suggest closely related protodomains. Net-ment (see figure 10). work analysis can be used to associate clusters with interesting properties such as ligand bind-The problem of non-one-to-one alignments can ing, symmetry order, enzymatic activity, and dis-be solved by suitably post-processing alignments tribution across restore this property. Some attempts atthis have been made, but a suitable solution isyet to be implemented. The problem of non- Aim 3. Integrate protodomainminimal protodomain assignment could be solved arrangements with domain andby modifying CE to find multiple distinct, high- quaternary structure information toscoring alignments rather than simply returning create a parsimonious model of foldthe top hit. An alternative would be to use evolution across the tree of life.the SymD algorithm to identify protein symme-try [41]. SymD is significantly slower than CE- Aim 3 seeks to place protodomains within theSymm, but is able to find some alignments cor- context of the biological assembly, and thenceresponding to multiple rotations. into the broader context of the evolution of pro- tein folds. Much as the preliminary work byAfter protodomain detection algorithms become Dr. Rose characterized biological assemblies bysufficiently accurate, they will be run on across the symmetry and composition of the chains,all domains in the PDB’s nonredundant set. This so can biological assemblies be classified accord-will result in the set of all known protodomains, ing to the symmetry and composition of itswhich could be easily kept up-to-date with the protodomains. This will be a noisier process doPDB’s weekly release schedule. This list of to the much greater divergence of protodomainsprotodomains will be made available to the com- compared to identical chains in a crystal, but themunity as a deliverable from Aim 1. sensitivity and specificity of the process can be 18
  19. 19. controlled by the clustering parameters used to Aim 4. Apply protodomain principles toestablish the set of protodomains in each assem- understanding the evolution of specificbly. protein families A protein family will be identified which is suit- able for more detailed study. The purpose ofSpecial attention will be focused on cases where this is twofold. First, the family will act asvery similar protodomains are found in biologi- a benchmark to test the algorithms assemblies with different components. These When dealing with global surveys and largerepresent major changes in the overall protein, datasets, finding a test set where the evolutionand could lead to different selective pressures of protodomains is well understood can makeon the assembly. Likewise, cases where the naive assumptions and algorithmic limitationsprotodomain content of the assembly is con- clear. Second, by focusing on a subset of pro-served but the chains composing that assembly teins it becomes easier to make testable predic-differ signal genomic changes without large struc- tions about practical problems. Knowledge oftural changes. These orthogonal processes of ge- the protodomain architecture of a family will fos-nomic rearrangement and structural rearrange- ter solutions to applications such as protein en-ment form the core of a general model for the gineering and structure prediction.evolution of proteins. To be a functional test set, the protein family should • Have good structural coverageBy combining structural similarity information,the protodomain composition, and external in- • Contain multiple members with symmetryformation about evolution such as sequence con- at either domain or quaternary structureservation, distribution among organisms, and level.evolutionary phylogenies, it should be possible to • Contain circularly permuted membersreconstruct major changes in biological assembly • Span a diverse set of foldsfor key protein families. Furthermore, to have practical applications it should contain proteins with connections human health and disease effects.One interesting question is how novel proteinsevolved. For those who view fold space as con- The beta-propeller family could make a goodtinuous, new proteins come about through many benchmark. Propellers composed of betweensmall structural changes that lead from an ances- four and eight protodomains are known, as welltral fold to the child fold through countless inter- as diverse quaternary assemblies. They aremediates. However, expanding a structural simi- widespread throughout the tree of life, and a rea-larity network such as that in figure 9 to include sonable amount of evidence suggests that theyall known proteins does not result in a single con- evolved from a common ancestor [42]. However,nected graph for reasonable similarity thresholds. the evolution of beta propellers has been wellThis indicates that there remain folds which can- studied and it is not clear whether other all-betanot be related to one another through continu- protein families are closely related.ous structural changes. While some such rela- Ion channels are also an exciting prospect for fur-tionships may require additional sampling of fold ther investigation. Symmetry is known to be im-space to reveal, for others finding protodomain portant to the function of several ion channelsrearrangements may be able to detect intermedi- [43], and it appears at both the quaternary andates connecting these orphan folds with plausible domain levels. Some channels show signs of hav-evolutionary paths. ing repeat regions, but which now have become 19
  20. 20. (a) (b)Figure 11: Structure of the E. coli ammoniachannel AmtB [3C1G]. (a) Three-fold quater-nary symmetry. (b) CE-Symm alignment of onechain, showing 2-fold symmetry around the twoion channels.asymmetric to the point where they are difficultto align [44]. Our preliminary network analysisalso shows that the SCOP membrane proteinsclass is the most likely to be structurally similarto other SCOP classes. This diversity could un-cover interesting structural relationships betweenmembrane proteins and cytosolic families.Aim 4 will consist of a detailed evolutionary anal-ysis of the benchmark family, for comparisonwith the computational predictions of aim 3. Figure 12: Diagram showing evolutionary changes under the model. The two-color shapes4.1 Evolutionary Model represent heteromer formation, disintegration,Traditional evolutionary analysis relies on se- fusion, and fission. The path on the rightquence comparisons to establish relationships be- represents the same processes for homomers.tween proteins. For instance, PFam uses se- Although local mutations slowly change thequence profiles to create families of homologous structure of all states, only after a homomerproteins. Sophisticated sequence-based methods has monomerized can local mutations break itsare able to detect homology down to the "twilight" of about 20% identity [45, 46, 47].More distantly related proteins can be discernedif structures are available for both proteins. Be-cause structure changes much less rapidly thansequence, classifications that include structuralinformation, such as SUPERFAMILY, are ableto merge much older protein families [48].Here we present a model of protein fold evolu-tion with an emphasis on structure. The modelcontains six general operations. These oper-ations separate the underlying genetic events, 20
  21. 21. which form the mechanism of evolution (changes proposed model as two fusions, both of whichin DNA such as insertions/deletions, mutations, preserve the protodomain architecture of the bi-etc) from the effect of those changes on the 3D ological assembly. The permutation by duplica-protein complex that provides function to the tion mechanism is slightly more complicated tocell. fit into the proposed model, since it appears that the intermediate contains twice as many copies of 1. Local mutation. Any change to protein the protodomains as the original domain. How- structure, which does not involve changes ever, the intermediate phase can be considered to the protodomain architecture of the func- two biological assemblies on one chain, each con- tional biological complex. taining the same protodomain architecture as the 2. Protodomain fusion. Two chains of a original. protein complex fuse to become a single Protodomains are also useful for explaining the chimeric gene. This can be from a gene fu- evolution of internal pseudosymmetry within sion (heteromer) or a gene duplication (ho- protein chains. Symmetric proteins are thought momer). to evolve from homomeric complexes via gene du- 3. Protodomain fission. One protein chain plication. A rotationally symmetric domain con- is split into two independently translated sists of multiple copies of a single protodomain genes. arranged in a symmetric fashion. 4. Gain of protein-protein interface. Two previously unassociated proteins form a 5 Impact complex. This can be either a heteromer or a homomer. The analysis of fold space is motivated both by 5. Loss of protein-protein interface. Pro- basic science questions about the evolution of teins that previously formed a complex lose proteins, as well as by practical revelations which their interaction. can come from understanding the relationships 6. Development of new protodomains. between proteins. In particular, the following ar- Dramatic evolutionary events may lead to eas can benefit from analyzing fold space: the creation of an entirely novel fold where 1. Protein design. Designing proteins with no precursor can be found. For instance, the novel functions requires mutating one or evolution of a folded protein from a disor- more scaffold proteins such that they at- dered one could result in a new protodomain tain the desired structure [49]. Candidate with no structural ancestors. Additionally, scaffolds could be selected based on the as- primordial folds found in the last universal sociation of the desired function with por- common ancestor (LUCA) would be consid- tions of fold space. To reduce the compu- ered new protodomains, since their evolu- tational complexity, pseudosymmetric scaf- tionary history is lost to us. folds could be simplified to fully symmet-Distinguishing protodomain rearrangements at a ric multimeric scaffolds with many fewer de-genetic level from structural changes is especially grees of freedom.useful for explaining the evolution of circular per- 2. Function prediction. A better un-mutations and pseudosymmetric domains. The derstanding of fold space could be usedcircularly permuted proteins identified by CE- to improve function prediction. For in-CP and other methods often contain a high de- stance, identifying proteins with identicalgree of conservation. Thus the two halves of a protodomain architectures could be usedcircularly permuted protein can be considered to propagate functional predictions, evenprotodomains. The fission & fusion mechanism when the order and connectivity of those(see figure 3a) can be easily encapsulate in the protodomains has changed. 21
  22. 22. 3. Protein classification. The graph of fold I that that if I fulfill the plan set forth in this space can be used to extend existing classi- proposal I will have achieved both those goals. fication schemes to newly solved structures. The project requires some significant algorith-Although significant prior work has gone into an- mic developments. In addition to my existingalyzing fold space, there are several reasons why work on the CE-CP and CE-Symm algorithms,now is a good time to revisit this topic. The I will solve the difficult problem of decompos-number of structures available for analysis has ing biological assemblies into their constituentgreatly increased. Previous attempts to charac- protodomains. I will also develop algorithms toterize fold space relied on all-vs-all comparisons integrate structural similarity data with evolu-of hundreds of proteins [16]. Thanks to a recent tionary histories so that the evolutionary historyeffort by the Protein Data Bank (PDB) to pre- of each protodomain can be calculated.pare all-vs-all comparisons of all known proteins,the Bourne lab now has unique access to pairwise Running these algorithms globally across thealignments of 18,000 non-redundant proteins rep- PDB will require dealing with large quantitiesresenting over 70,000 individual structures. This of data. Computation will be run in a scalable,massive increase in the number of structures con- parallel manner on the Open Science Grid super-sidered provides a much more thorough sample computer. It will then be analyzed both in aggre-of fold space. gate and as specific case studies, and the results made available to the public where appropriate.Estimates of the rate of discovery of novel foldshave lead to the conclusion that we are nearing I believe that this thesis will also make a sig-the point of sampling all naturally occurring pro- nificant contribution to biology. The naturetein folds. The number of novel folds deposited of fold space has been hotly debated. By an-in the PDB has declined, even while the number swering questions about fold space in the con-of depositions rises exponentially. This has led text of an evolutionary model, the answers willto several estimates of the total number of pro- be much more biologically relevant and descrip-tein folds, generally in the thousands [50, 51, 52]. tive than previous characterizations of fold space.The most recent version of SCOP (1.75) contains Furthermore, incorporating information on the1137 non-transmembrane folds, indicating that functional biological assemblies into our modelwe are approaching saturation for protein folds. rather than focusing on isolated components bet-The express goal of structural genomics initia- ter mimics the selective pressures at work in cells.tives is to determine structures for all remaining Understanding protein symmetry is directly ap-novel folds. Although attaining this goal is still plicable to studies of allostery, protein folding,far in the future, it is likely that the majority of and evolution. Studies of evolution can give in-distinct protein folds are present in the current sight into the next drug target or protein designPDB [1]. Since fold space is fairly well sampled scaffold. And the tools to detect relationshipsby protein structures (at least among soluble pro- between distant proteins will pave the way forteins), additional insights into the nature of fold future advances in and its relationship to function may be ac-cessible. Acknowledgements6 Conclusion Andreas Prlić has been a strong mentor through-A doctorate in Bioinformatics and Systems Bi- out my time at UCSD, and has contributed ad-ology should show my proficiency at both solv- vice and code to most of the projects discusseding computational challenges and making orig- here. Peter Rose did all the quaternary sym-inal contributions to our knowledge of biology. metry studies, and is gracious enough to let 22
  23. 23. me adapt it for protodomains. Douglas Myers- [7] Jeffrey Skolnick, Adrian K Arakaki, Se-Turnbull has helped with numerous structural ung Yup Lee, and Michal Brylinski. Thecomparison searches of protodomains and iden- continuity of protein structure space istified some great examples. Almost all of the al- an intrinsic property of proteins. PNAS,gorithms were either a part of or built on top of 106(37):15690–15695, September 2009.the BioJava library, to which many great bioin- [8] I N Shindyalov and Philip E Bourne. Anformaticians have contributed. My wife, Chris- alternative view of protein fold space. Pro-tine, has been very supportive of my long hours. teins, 38(3):247–260, February 2000.Phil Bourne has been the best advisor a grad stu-dent could hope for, and he always makes time [9] Christine A Orengo, A D Michie, S Jones,to discuss science and life. D T Jones, M B Swindells, and Janet M Thornton. CATH–a hierarchic classifica- tion of protein domain structures. Structure,References 5(8):1093–1108, August 1997. [10] Alexey G Murzin, Steven E Brenner, T Hub- [1] Lukasz Jaroszewski, Zhanwen Li, S Sri Kr- bard, and C Chothia. SCOP: a structural ishna, Constantina Bakolitsa, John Woo- classification of proteins database for the in- ley, Ashley M Deacon, Ian A Wilson, and vestigation of sequences and structures. J Adam Godzik. Exploration of uncharted re- Mol Biol, 247(4):536–540, April 1995. gions of the protein universe. PLoS Biol, 7(9):e1000205, September 2009. [11] N V Grishin. Fold change in evolution of protein structures. J Struct Biol, 134(2- [2] Andreas Prlić, Spencer Bliven, Peter W 3):167–185, April 2001. Rose, Wolfgang F Bluhm, Chris Bizon, Adam Godzik, and Philip E Bourne. Pre- [12] Manfred J Sippl. On distance and similar- calculated protein structure alignments at ity in fold space. Bioinformatics, 24(6):872– the RCSB PDB website. Bioinformatics, 873, March 2008. 26(23):2983–2985, December 2010. [13] Brian Marsden and Ruben Abagyan. SAD– [3] Spencer Bliven and Andreas Prlić. Circular a normalized structural alignment database: permutation in proteins. PLoS Comput Biol, improving sequence-structure alignments. 8(3):e1002445, March 2012. Bioinformatics, 20(15):2333–2344, October 2004. [4] Andreas Prlić, Andrew Yates, Spencer E Bliven, Peter W Rose, Julius Jacobsen, [14] C A Orengo, T P Flores, W R Taylor, and Peter V Troshin, Mark Chapman, Jian- J M Thornton. Identification and classifica- jiong Gao, Chuan Hock Koh, Sylvain tion of protein fold families. Protein Eng, Foisy, Richard Holland, Gediminas Rimša, 6(5):485–500, July 1993. Michael L Heuer, H Brandstätter-Müller, [15] L Holm and C Sander. Mapping the pro- Philip E Bourne, and Scooter Willis. tein universe. Science, 273(5275):595–603, BioJava: an open-source framework for August 1996. bioinformatics in 2012. Bioinformatics, [16] Jingtong Hou, Gregory E Sims, Chao 28(20):2693–2695, October 2012. Zhang, and Sung-Hou Kim. A global repre- [5] Manfred J Sippl. Fold space unlimited. Curr sentation of the protein fold space. PNAS, Opin Struct Biol, 19(3):312–320, June 2009. 100(5):2386–2390, March 2003. [6] Ruslan I Sadreyev, Bong-Hyun Kim, and [17] Evgeny Krissinel and Kim Henrick. In- Nick V Grishin. Discrete-continuous dual- ference of macromolecular assemblies from ity of protein structure space. Curr Opin crystalline state. J Mol Biol, 372(3):774– Struct Biol, 19(3):321–328, June 2009. 797, September 2007. 23
  24. 24. [18] Nickolai Alexandrov and Ilya Shindyalov. tandem-duplication events gave rise to the PDP: protein domain parser. Bioinformat- four saposin domains in vertebrates. J Mol ics, 19(3):429–430, February 2003. Evol, 54(1):30–34, January 2002.[19] Jenny Gu and Philip E Bourne, editors. [28] K Guruprasad, K Törmäkangas, J Kervi- Structural Bioinformatics. John Wiley & nen, and T L Blundell. Comparative mod- Sons, 2 edition, January 2009. elling of barley-grain aspartic proteinase: a[20] Marco Punta, Penny C Coggill, Ruth Y structural rationale for observed hydrolytic Eberhardt, Jaina Mistry, John Tate, Chris specificity. FEBS Lett, 352(2):131–136, Boursnell, Ningze Pang, Kristoffer Forslund, September 1994. Goran Ceric, Jody Clements, Andreas [29] C P Ponting and R B Russell. Swaposins: Heger, Liisa Holm, Erik L L Sonnham- circular permutations within genes encoding mer, Sean R Eddy, Alex Bateman, and saposin homologues. Trends Biochem Sci, Robert D Finn. The Pfam protein families 20(5):179–180, May 1995. database. Nucleic Acids Res, 40(Database [30] Heike Bruhn. A short guided tour issue):D290–301, January 2012. through functional and structural features[21] Wei-Cheng Lo, Chi-Ching Lee, Che-Yu Lee, of saposin-like proteins. Biochem. J., 389(Pt and Ping-Chiang Lyu. CPDB: a database 2):249–257, July 2005. of circular permutation in proteins. Nu- [31] Jihun Lee and Michael Blaber. Experimen- cleic Acids Res, 37(Database issue):D328– tal support for the evolution of symmet- 32, 2009. ric protein architecture from a simple pep-[22] January Weiner and Erich Bornberg-Bauer. tide motif. PNAS, 108(1):126–130, January Evolution of circular permutations in mul- 2011. tidomain proteins. Mol. Biol. Evol., [32] Y Hatefi and M Yamaguchi. Nicotinamide 23(4):734–743, April 2006. nucleotide transhydrogenase: a model for[23] Janusz M Bujnicki. Sequence permutations utilization of substrate binding energy for in the molecular evolution of DNA methyl- proton translocation. FASEB J, 10(4):444– transferases. BMC Evol. Biol., 2:3, March 452, March 1996. 2002. [33] M F Perutz, H Miurhead, J M Cox, L C[24] Bruce A Cunningham, John J Hemperly, Goaman, F S Mathews, E L McGandy, Thomas P Hopp, and Gerald M Edelman. and L E Webb. Three-dimensional Fourier Favin versus concanavalin A: Circularly synthesis of horse oxyhaemoglobin at 2.8 permuted amino acid sequences. PNAS, A resolution: (1) x-ray analysis. Nature, 76(7):3218–3222, July 1979. 219(5149):29–32, July 1968.[25] A Jeltsch. Circular permutations in the [34] M F Perutz, H Muirhead, J M Cox, and molecular evolution of DNA methyltrans- L C Goaman. Three-dimensional Fourier ferases. J Mol Evol, 49(1):161–164, July synthesis of horse oxyhaemoglobin at 2.8 1999. A resolution: the atomic model. Nature,[26] H Ponstingl, K Henrick, and Janet M 219(5150):131–139, July 1968. Thornton. Discriminating between homod- [35] Jacque Monod, Jeffries Wyman, and Jean- imeric and monomeric proteins in the crys- Pierre Changeux. On the Nature of Al- talline state. Proteins, 41(1):47–57, October losteric Transitions: A Plausible Model. J 2000. Mol Biol, 12:88–118, May 1965.[27] Einat Hazkani-Covo, Neta Altman, Mia [36] K Kinoshita, A Kidera, and N Go. Diversity Horowitz, and Dan Graur. The evolution- of functions of proteins with internal sym- ary history of prosaposin: two successive metry in spatial arrangement of secondary 24
  25. 25. structural elements. Protein Sci, 8(6):1210– [46] L Jaroszewski, L Rychlewski, and A Godzik. 1217, June 1999. Improving the quality of twilight-zone align- ments. Protein Sci, 9(8):1487–1496, August[37] Anne-Laure Abraham, Joël Pothier, and 2000. Eduardo P C Rocha. Alternative to homo- oligomerisation: the creation of local sym- [47] J Soding. Protein homology detection by metry in proteins by internal amplification. HMM-HMM comparison. Bioinformatics, J Mol Biol, 394(3):522–534, December 2009. 21(7):951–960, March 2005.[38] I N Shindyalov and Philip E Bourne. Pro- [48] J Gough, K Karplus, R Hughey, and tein structure alignment by incremental C Chothia. Assignment of homology to combinatorial extension (CE) of the optimal genome sequences using a library of hidden path. Protein Eng, 11(9):739–747, August Markov models that represent all proteins 1998. of known structure. J Mol Biol, 313(4):903– 919, November 2001.[39] R C G Holland, T A Down, M Pocock, Andreas Prlić, D Huen, K James, S Foisy, [49] Brian Kuhlman, Gautam Dantas, Gre- A Dräger, A Yates, M Heuer, and M J gory C Ireton, Gabriele Varani, Barry L Schreiber. BioJava: an open-source frame- Stoddard, and David Baker. Design of work for bioinformatics. Bioinformatics, a novel globular protein fold with atomic- 24(18):2096–2097, September 2008. level accuracy. Science, 302(5649):1364– 1368, November 2003.[40] S Uliel, A Fliess, A Amir, and R Unger. A simple algorithm for detecting circular [50] C Zhang and Charles DeLisi. Estimating permutations in proteins. Bioinformatics, the number of protein folds. J Mol Biol, 15(11):930–936, November 1999. 284(5):1301–1305, December 1998.[41] Changhoon Kim, Jodi Basner, and [51] S Govindarajan, R Recabarren, and R A Byungkook Lee. Detecting internally Goldstein. Estimating the total number of symmetric protein structures. BMC protein folds. Proteins, 35(4):408–414, June Bioinformatics, 11:303, 2010. 1999.[42] Indronil Chaudhuri, Johannes Söding, and [52] Y I Wolf, Nick V Grishin, and E V Koonin. Andrei N Lupas. Evolution of the beta- Estimating the number of protein folds and propeller fold. Proteins: Structure, Func- families from complete genome data. J Mol tion, and Bioinformatics, 71(2):795–803, Biol, 299(4):897–905, June 2000. May 2008.[43] Lucy R Forrest and Gary Rudnick. The rocking bundle: a mechanism for ion- coupled solute flux by symmetrical trans- porters. Physiology (Bethesda), 24:377–386, December 2009.[44] Lucy R Forrest, Reinhard Krämer, and Christine Ziegler. The structural basis of secondary active transport mechanisms. Biochim Biophys Acta, 1807(2):167–188, February 2011.[45] R F Doolittle. Similar amino acid sequences: chance or common ancestry? Science, 214(4517):149–159, October 1981. 25
  26. 26. 2012 2013 2014 Title July October January Effort April July October January April July October January 1) Aim 1 36w 1d Aim 1 1.1) Refine CE-Symm alignments 4w 3d 1.2) Return multiple CE-Symm alginments 6w 1.3) Run Algorithms on PDB 3w 1d 1.4) Build NR protodomain set 3w 4d 1.5) Investigate additional protodomain detection algorithms 18w 3d 2) Aim 2 38w 2d Aim 2 2.1) Protodomain all-vs-all on OSG 10w 2.2) Analyze protodomain similarity network 6w 2.3) Annotate network with phylogenetic history 9w 4d Figure 13: Timeline 2.4) Optimize protodomain clustering 12w 3d 3) Aim 3 30w 3d Aim 3 3.1) Determine protodomain composition of BAs 8w 3.2) Identify interesting examples of BA/protodomain coevolution 14w 3d 3.3) Apply evolutionary model to network data 8w26 4) Aim 4 48w 4d Aim 4 4.1) Review literature on ion channel & beta propellor evolution 21w 1d 4.2) Computationally determine evolutionary history of target family 10w 3d w.r.t. model 4.3) Validate model based on literature 4w 4.4) Apply new knowledge to protein-specific problem 13w 5) Thesis 13w Thesis 5.1) Compile Thesis 13w