Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and Information Theory

428 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
428
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and Information Theory

  1. 1. DISCOVERY  OF  FUNCTIONAL  PROTEIN  LINEAR  MOTIFS   USING  A  GREEDY  ALGORITHM  AND  INFORMATION  THEORY   LEANDRO  G.  RADUSKY§,  JULIANA  GLAVINA§,  MARIA  FATIMA  LADELFA¶,  MARTIN  MONTE¶     AND  IGNACIO  E.  SANCHEZ§   §PROTEIN  PHYSIOLOGY  LABORATORY,  DEPARTAMENTO  DE  QUIMICA  BIOLOGICA,  FACULTAD  DE  CIENCIAS  EXACTAS  Y  NATURALES-­‐UNIVERSIDAD  DE  BUENOS  AIRES,  ARGENTINA  ¶MOLECULAR   AND  CELL  BIOLOGY  LABORATORY,  DEPARTAMENTO  DE  QUIMICA  BIOLOGICA,  FACULTAD  DE  CIENCIAS  EXACTAS  Y  NATURALES-­‐UNIVERSIDAD  DE  BUENOS  AIRES,  ARGENTINA  .    INTRODUCTION  The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Manyglobular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It islikely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linearmotifs from protein-protein interaction datasets.   ALGORITHM   RESULTS  1.  DATASET VALIDATION:  SEARCH  FOR  KNOWN  MOTIFS   ProteinThe algorithm takes as input the sequence of all the under study We have tested the ability of our algorithm to identify known functional linear motifs inprotein targets bound by the protein under study. sequence sets taken from the ELM database [6]. PhysicallyThe hypothesis is that any linear motif mediating interacts with Motif 14-3-3 type 1 Gamma-adaptin Clathrin box Mannosylation CtBP Dyneinthe interaction will be overrepresented in thesequence of these proteins. (DE)(DES)xF L(ILM)x Px(DEN) Several ELM R(SFYW)xSxP WxxW (QR)xTQT x(DE)(LVIMFD) (ILMF)(DE) L(VAST) ProteinThe user also determines the length of the putative targets Dilimot RSxSxP DDxFxxF LIxLD DGxW DxPxDL KxTQTlinear motif to be looked for, e.g., ten residues. Our method2.  INPUT  FILTERS Our algorithm captures the known motif in six cases (top), suggesting significant sequence specificity in positions marked as “x” in the consensus. There is a partial match with the1.  The presence of homologous proteins in the dataset would known consensus in two cases (bottom left) and no match in three cases (bottom right). lead to spurious motif overrepresentation. We use the CD- The performance is comparable to that of Dilimot [1], a similar software that describes HIT algorithm [2] to identify this kind of redundancy and motifs as consensus sequences remove it from the input.2.  Most functional linear motifs are located within disordered Motif Integrin TRAF6 Motif NR box EH1 HP1 protein domains [1]. Disordered regions are identified using the VSL software [3] and kept for analysis. ELM RGD PxE ELM LxLL Fx(IV)xx(IL)(ILM) PxVx(LM) Dilimot RxDV PQE Dilimot Not found FxIxNI KVPxVxL3.  MOTIF  SEARCH input Our method Our method Not found Not found Not found Matrix M: sequences to be analyzedOur software is an adaptation of a Integer L: motif lengthmethod used for motif search in DNAsequences [4], implemented in Python. output CASE  STUDY:  NUCLEOLAR  LOCALIZATION  OF  MAGE  PROTEINS   Matrix Res: All k-word alingmentsIt first calculates all possible alignmentsof two k-words in the dataset. Algorithm The MAGE (melanoma-associated antigen) family of proteins are plausible targets for anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2Next, we offer all possible k-words to { protein is observed in both the nucleus and the nucleolus.each growing alignment and incorporate M’ = ObtainAllKWords(M)the one resulting in the highest score. Res = CreateAlignmentsOfTwoKWords (M’) Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar While (Res) has changed { proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not ofWe repeat this procedure until CurrentKWordss = ObtainAllKWords (M) MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus.incorporation of new k-words does not For all alignments A in Res Truncated MAGE-B2-GFPincrease the score of any alignment. { GFP-MAGE-A2 GFP-MAGE-B2 AddBestKword (A, CurrentKwords)Last, we sort the alignments by their } }scores. The sorted list is the output of SortByScore (Res)the search. Print Res }4.  MOTIF  SCORING Transfected U2Os cells.We use the information content [5] of each alignment to quantify the overrepresentation of Green: GFP tag, blue: DAPI.the motif contained in each sequence alignment. Magnification 100x.The uncertainty at a position of the alignment is: H(l) = -Σ f(aa,l) log2 f(aa,l) (bits)The information content at a position is the decrease inuncertainty between a random sequence and the CONCLUDING  REMARKS  observed sequences, with a correction e(n) for the Rsequence(l) = log220 +sampling of a finite number of sequences: Σ f(aa,l) log2 f(aa,l)-e(n) (bits) •  We have implemented an algorithm for the discovery of novel protein functional motifs within sets of unaligned sequences.The information content of an alignment is the sum overall positions: Rsequence = Rsequence(l) (bits) •  The algorithm shows good performance in the recovery of known motifs. •  We propose a putative motif responsible for localization of MAGE proteins in the nucleolus.5.  OUTPUT   REFERENCES   [1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405.We measure the similarity between two motifs as the Pearson correlation coefficient R [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682. [3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182.between the corresponding amino acid frequencies. The group alignments above the [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187. [5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100.desired value of R. [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80.Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments. [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625 [8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8. [9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99.

×