Generic approach for predicting unannotated protein pair function using protein
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Generic approach for predicting unannotated protein pair function using protein

  • 415 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
415
On Slideshare
415
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. INTERNATIONALComputer Engineering and Technology ENGINEERING International Journal of JOURNAL OF COMPUTER (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME & TECHNOLOGY (IJCET)ISSN 0976 – 6367(Print)ISSN 0976 – 6375(Online) IJCETVolume 4, Issue 2, March – April (2013), pp. 142-157© IAEME: www.iaeme.com/ijcet.aspJournal Impact Factor (2013): 6.1302 (Calculated by GISI) ©IAEMEwww.jifactor.com GENERIC APPROACH FOR PREDICTING UNANNOTATED PROTEIN PAIR FUNCTION USING PROTEIN Anjan Kumar Payra1, Sovan Saha1 1 Dept. of Computer Science &Engg Dr. Sudhir Chandra Sur Degree Engineering College, DumDum Kolkata, India ABSTRACT Proteins are the most versatile macromolecules in living systems and serve crucial functions in essentially all biological processes. With successful sequencing of several genomes, the challenging problem now is to determine the functions of proteins in post genomic era. Determining protein functions experimentally is a laborious and time- consuming task involving many resources. Therefore, research is going on to predict protein functions using various computational methods since at present there are various diseases whose recovery drugs are still unknown or yet to be discovered and the drug discovery process starts with protein identification because proteins are responsible for many functions required for maintenance of life. So Protein identification further needs determination of protein function. These methods are based on sequence and structure, gene neighborhood, gene fusions, cellular localization, protein-protein interactions etc. In this work, we present an approach to predict functions of unannotated protein pair in an intelligent way based on their protein interaction network. The success rate obtained in our work is 94.4 %. Keywords: Protein interaction network, Unannotated protein pair function prediction, Functional groups, success rate. I. INTRODUCTION Proteins are the building blocks of life. Human body needs protein to repair and maintain itself. So proteins have versatile functions to perform. However the concept of protein function is highly context-sensitive and not very well-defined. In fact, this concept typically acts as an umbrella term for all types of activities that a protein is involved in, be it cellular, molecular or physiological. One such categorization of the types of functions a protein can perform has been suggested by Bork et al. [1998]: 142
  • 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEo Molecular function: The biochemical functions performed by a protein, such as ligandbinding, catalysis of biochemical reactions and conformational changes.o Cellular function: Many proteins come together to perform complex physiologicalfunctions, such as operation of metabolic pathways and signal transduction, to keep thevarious components of the organism working well.o Phenotypic function: The integration of the physiological subsystems, consisting ofvarious proteins performing their cellular functions, and the interaction of this integratedsystem with environmental stimuli determines the phenotypic properties and behavior of theorganism.In order to predict protein function we have to study the existing data types which can bebroadly classified under 8 sections: Amino acid sequences Protein structure Genome sequences Phylogenetic data Micro array expression data Protein interaction networks and protein complexes Biomedical literature Combination of multiple data types Amino acid sequences: An amino acid sequence is the order that amino acids jointogether to form peptide chains, or polypeptides. If the peptide chain is a protein, thissequence is often called the primary structure of the protein. Due to the structure of aminoacids and how they bond together, the order of the amino acids is only read in one directionand is specific for the peptide being formed. It can be used to identify a protein orhomologous proteins through searches in databases and also to obtain information about posttranslational cleavage points. In addition, the sequence results provide information about thepurity of a preparation. It limits of detectable contamination depend on the sequences of theanalyzed proteins. The central dogma of molecular biology is the conversion of a gene toprotein via the transcription and translation phases as shown in Fig. 1. The result of thisprocess is a sequence constructed from twenty amino acids, and is known as the protein’sprimary structure. This sequence is the most fundamental form of information available aboutthe protein since it determines different characteristics of the protein such as its sub-cellular,localization, structure and function. Fig. 1 Central dogma of molecular biology 143
  • 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEThe most popular experimental method for the identification of protein sequences is massspectrometry [Sickmann et al. 2003], which, in combination with algorithms such asProFound [Zhang and Chait 2000], comes in various flavors, such as peptide mass fingerprinting, peptide fragmentation and other comparative methods. However, these methods arelow-throughput, and thus, with the exponential generation of genome sequences, the focushas shifted to computational approaches that can identify genes from these genomes.Specifically, techniques that predict protein function from sequence can be categorized intothree classes, namely, sequence homology-based approaches, subsequence-based approachesand feature-based approaches, which are explained below:Homology-based approaches: Homologous traits of organism are therefore due to decentfrom common ancestor. The homology based search process more sensitive by multiplemeans, such as making the search probabilistic and adding evidence from other sources ofdata to obtain more accurate and confident annotations for the query proteins.Subsequence-based approaches: It has been reflected in several studies that often not thewhole sequence, but only some segments of it are important for determining the function of agiven protein. Consequently, the approaches in this category treat these segments orsubsequences as features of a protein sequence and construct models for the mapping of thesefeatures to protein function. These models are then used to predict the function of a queryprotein.Feature-based approaches: The final category of approaches attempts to exploit theperspective that the amino acid sequence is a unique characterization of a protein, anddetermines several of its physical and functional features. These features are used to constructa predictive model which can map the feature-value vector of a query protein to its function. Protein Structure: A protein is an organic biopolymer that is comprised of a set of aminoacids, and assumes a configuration in three-dimensional space due to interactions betweenthese constituents as shown in Fig. 2. Protein structures may be specified at multiple levels.Usually, it is specified at three levels, with a fourth level being specified for some cases[Schulz and Schirmer 1996]. Following is a brief description of these levels:Primary structure: The primary structure of a protein is simply a sequence of amino acids.Secondary structure: The sequence of a protein influences its conformation in threedimensional spaces via the formation of bonds between spatially close amino acids in thesequence. This process is popularly known as protein folding, and leads to the creation ofsubstructures such as α-helices, β-sheets, turns and random coils, of which the first two arethe most common, while the last two are formed very rarely. The collection of thesesubstructures forms the secondary structure of a protein.Tertiary structure: The attractive and repulsive forces among the substructures caused bythe folding balance each other and provide the protein with a relatively stable, thoughcomplicated, three-dimensional structure. This structure is known as the tertiary structure ofthe protein.Quaternary structure: Some proteins, such as the spectrin protein [Fuller et al.1974] consistof multiple amino acid sequences, also known as protein subunits. Each of these sequencesfolds to form its own tertiary structure, which come together to produce the quarter narystructure of protein. 144
  • 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEThe existing approaches in predicting protein functions from protein structure are:Similarity-based approaches: Given the structure of a protein, these approaches identify theprotein with the most similar structure using structural alignment techniques, and transfer itsfunctional annotations to the query protein. Fig. 2 Structure of proteinMotif-based approaches: The approaches in this category attempt to identify threedimensional motifs, that are substructures conserved in a set of functionally related proteins,and estimate a mapping between the function of a protein and the structural motifs it contains.This mapping is then used to predict the functions of unannotated proteins.Surface-based approaches: It is sometimes necessary to analyze the structure of a protein ata higher resolution than that of distances between consecutive amino acids. This correspondsto the modeling of a continuous surface for the structure and identifying features such asvoids or holes in these surfaces. The approaches in this category utilize these features to infera protein’s function.Learning-based approaches: This category of recent approaches employ effectiveclassification methods, such as SVM and k-nearest neighbor, to identify the most appropriatefunctional class for a protein from its most relevant structural features. Genomic sequences: Genome sequencing is a laboratory process that determines thecomplete DNA sequence of an organisms genome at a single time. This entails sequencingall of an organisms chromosomal DNA as well as DNA contained in the mitochondria and,for plants, in the chloroplast. Almost any biological sample containing a full copy of theDNA—even a very small amount of DNA or ancient DNA—can provide the genetic materialnecessary for full genome sequencing.DNA itself is typically a double stranded molecule,where one of the strands is constituted of four characters, namely A, T , C and G, whichdenote the four nucleotides adenosine, guanine, cytosine and thymine, and other strand iscomplimentary to the first, owing to the complimentarity of the A−C and T−G nucleotidepairs as shown in Fig. 3 . Several approaches have been proposed to accomplish the target ofderiving functional associations from genomic data, and possible function predictionsubsequently. 145
  • 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 09766367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Fig. 3 DNA moleculesThese approaches largely fall into one of the following three categories [Marcotte 2000]:Genome-wide homology-based annotation transfer: This category consists simply of the based transfer oryuse of larger databases for searching proteins homologous to the query proteins, and thetransfer of functional annotation from the closest results.Gene neighborhood- or gene order-based approaches: These approaches are based on the order :hypothesis that proteins, whose corresponding genes are located “close” to each other inmultiple genomes, are expected to interact functionally. This hypothesis is supported by theconcept of an operon, and its relevance to protein function [Salgado et al. 2000]. ,Gene fusion-based approaches: These approaches attempt to discover pairs or sets of genes based approaches:in one genome that are merged to form a single gene in another genome. The underlyinghypothesis here is that these sets of genes are functionally related, and is supported by related,biochemical and structural evidence [Marcotte et al. 1999]. Phylogenetic data: A phylogenetic tree or evolutionary tree is a branching diagram or"tree" showing the inferred evolutionary relationships among various biological speci or speciesother entities based upon similarities and differences in their physical and/or geneticcharacteristics. The organisms are joined together in the tree, are implied to have descendedfrom a ancestor. In a rooted phylogenetic tree, each node with descendants represents the descendantsinferred most recent common ancestor of the descendants and the edge lengths in some treesmay be interpreted as time estimates. Each node is called a taxonomic unit. Internal nodes aregenerally called hypothetical taxonomic units, as they cannot be directly observed. asPhylogenetic profiling is a bioinformatics technique in which the joint presence or jointabsence of two traits across large numbers of species is used to infer a meaningful biologicalconnection, such as involvement of two different proteins in the same biological pathway. Itis essential to include the evolutionary perspective in any complete understanding of proteinfunction. As a result, several approaches for predicting protein function using evolution evolution-based data have recently been proposed. The field of biology that deals with the evolutionary ve herelationships among living organisms is also known as phylogenetics [Bittar and Sonderegger2004]. The phylogenetic profile of a protein is (generally) a binary vector whose length is l 146
  • 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEthe number of available genomes. The vector contains a 1 in the ith position if the ith genomecontains a homologue of the corresponding gene, else a 0.In several other studies, a moreextensive representation of evolutionary knowledge is used [Bittar and Sonderegger 2004].This representation is known as a phylogenetic tree [Baldauf 2003], which is a standard treewith respect to the graph theoretical definition, but whose nodes and branches carry specialmeaning as shown in Fig. 4. Micro array expression data: Protein synthesis from genes occurs in prokaryoticorganisms in two phases [Weaver 2002]. In the transcription phase, an mRNA is created fromthe original gene by converting the latter to the corresponding RNA code. The protein is thensynthesized from mRNA by translating the RNA code to the corresponding amino acidsequence according to the codon translation rules. Gene expression experiments are a methodto quantitatively measure the transcription phase of protein synthesis [Nguyen et al. 2002].The most common category of these experiments uses square-shaped glass chips measuringas little as 1 inch on either side, also known as cDNA micro arrays. Experiment using Microarray is shown in Fig. 5. The experiment is carried out in the following stages. Fig. 4 Constructing a simple phylogenetic treeIn the first stage, the chip is laid out with a matrix of dots of cDNAs, usually severalthousands in number, one corresponding to each of the gene being measured. In parallel,mRNA is extracted from both the normal as well as the cells of the organism that have beenexposed to the condition being studied. These mRNA are reverse transcripted to cDNA andcolored with green and red colors respectively. These colored cDNAs are then spread on themicro array chip, leading to a hybridization of the cDNA already on the chip with thoseproduced by the genes in the two types of cells. This generates a spot of a certain color on thechip for each gene which denotes its expression level. In the final stage of the experiment, theintensity of this region is measured by a laser scanners connected to a computer, whichgenerates a real valued measurement of the expression of each gene as the ratio of the logintensities of red and blue colors in the region. The result of the experiment thus is ameasurement of the transcription activity of the genes under the specified condition. 147
  • 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Fig. 5 Micro array procedureExisting approaches in gene expression data are:Clustering-based approaches: An underlying hypothesis of gene expression analysis is thatfunctionally similar genes have similar expression profiles, since they are expected to beactivated and repressed under the same conditions. Because clustering is a natural approachfor grouping similar data points, approaches in this category cluster genes on the basis oftheir gene expression profiles, and assign functions to the unannotated proteins using themost dominant function for the respective clusters containing them.Classification-based approaches: A more direct solution to the problem of predictingprotein function from gene expression profiles is the data mining approach of classification.Thus, approaches in this category build various types of models for the expression functionmapping using classifiers, such as neural networks, SVMs and the naive Bayes classifier, anduse these models to annotate novel proteins.Temporal analysis-based approaches: Temporal gene expression experiments measure theactivity of genes at different instances of time, for instance, during a disease. This behaviorcan also be used to predict protein function. Thus, approaches in this category derive featuresfrom this temporal data and use classification. Protein interaction networks and protein complexes: A protein almost never performsits function in isolation. Rather, it usually interacts with other proteins in order to accomplisha certain function. However, in keeping with the complexity of the biological machinery,these interactions are of various kinds. At the highest level, they can be categorized intogenetic and physical interactions. Genetic interactions occur when the mutations in one genecause modifications in the behavior of another gene, which implies that these interactions areonly conceptual and do not occur physically in a genome. In our project we consider thephysical interactions between proteins, since they are more directly related to the processthrough which a protein accomplishes its functions. Since a protein generally interacts withmore than one other protein, these interactions can be structured to form a network, andhence the name protein interaction networks which is shown in Fig. 6 and Fig. 7. 148
  • 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Fig. 6 Organic View (Cytoscape) of our data setExisting Approaches that attempt to predict function of proteins from a protein interactionnetwork can be broadly categorized into the following four categories:Neighborhood-based approaches: These approaches utilize the neighborhood of the queryprotein in the interaction network and the most “dominant” annotations among theseneighbors to predict its function. Fig. 7 Circle View (Cytoscape) of our data setGlobal optimization-based approaches: In many cases, the neighborhood of the queryprotein may not contain enough information, such as annotated proteins, for determining thefunction of the query protein robustly. Under these conditions, it may be advantageous toconsider the structure of the entire network and use the annotations of the proteins indirectlyconnected to the query protein also. The approaches in this category are based on this idea,and in most cases, are based on the optimization of an objective function based on theannotations of the proteins in the network. 149
  • 9. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEClustering-based approaches: The approaches in this category were based on thehypothesis that dense regions in the interaction network represented functional modules,which are natural units in which proteins perform their function. Thus, these approachesapply graph clustering algorithms to these networks and then determine the functions ofunannotated proteins in the extracted modules using measures such as majority.Association-based approaches: Recently, several computationally efficient algorithms havebeen proposed for finding frequently occurring patterns in data, in the field of associationanalysis in data mining [Tan et al. 2005]. The approaches in this category use thesealgorithms to detect frequently occurring sets of interactions in interaction networks ofprotein complexes, and hypothesize that these sub graphs denote function modules. Functionprediction from these modules is performed as in the clustering based approaches. Biomedical literature: As in all other research communities, researchers in the fields ofbiology and medicine publish the results of their research in various journals and conferences.As a result, over the past, a huge repository of knowledge has been created in the form ofpapers, books, reports, theses and other such texts. Clearly, these repositories contain a hugeamount of information about important biological concepts such as protein structure andfunction, cancer-causing genes and several others. Thus, there is great utility in the mining ofthese repositories and retrieval of useful information as shown in Fig. 8.Multiple data types: With a plethora of data being generated by a wide spectrum ofproteomics experiments, it may be hypothesized that sometimes what can’t be discoveredfrom one source of information may become obvious when multiple sources are analyzedsimultaneously. This intuition has been concretized by Kemmeren and Holstege [2003], whohave suggested the following distinct advantages achieved by integrating functional genomicsdata: Fig. 8 Biomedical literature 150
  • 10. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEo Usually, individual biological data sets provide information about complimentarybiological processes, such as gene expression and protein interaction networks. Thus,combining them provides a global picture of the biological phenomena a set of genes isinvolved in.o Often, data quality varies between different types of data, as well as within differentsources of data of the same type. For instance, studies have shown significant variationsbetween the qualities of different protein interaction data sets [Deng et al. 2003]. Thus, thecombination of several data sources/types improves the quality of the overall data set, sincethe errors in one data set may be corrected in another.o The most important advantage of the integrative approach is that since only conclusionsvalid over a set of data types are accepted, the predictions made by this approach are usuallymore confident than those made on the basis of individual data sets. Hence, now we have a clear idea regarding the different existing data types. So now letus highlight about our work. Our objective is to assign un-annotated “protein pair” todifferent functional groups. So we now focus on discussing the existing computationaltechniques that use protein-protein interaction data to predict protein function. Proteinfunctionality can be predicted by neighborhood property which suggests that the PPI network,neighbors of a particular protein have similar function. In the work of Schwikowski [1] aneighborhood-counting method is proposed to assign k functions to a protein by identifyingthe k most frequent functional labels among its interacting partners. It is simple and effective,but the full topology is not considered and no confidence scores are assigned for theannotations. But in the chi-square method, Hishigaki et al. [2] assigns k functions to a proteinwith the k largest chi-square scores. For a protein P, each function f is assigned a scoreሺ௡೑ ି௘೑ ሻమ , where nf is the number of proteins in the n-neighborhood of P that have the function ௘೑f; The value ef is the expectation of this number based on the frequency of f among allproteins in the network. Chen et al. [3] extends this neighborhood property to higher levels inthe network. They speculate the functional similarity between a protein and its neighborsfrom the level-1 and level-2. An algorithm developed here is to assign a weight to each of itslevel-1 and level-2 neighbors by estimating its functional similarity. Many graph algorithmshave been applied for its functional analysis. Vazquez et al. [4] assign proteins to a functionso as to maximize the connectivity of a protein assigned with the same function. They mapthis problem into an optimization problem using simulated annealing where they maximizesthe number of edges that connect proteins ( un-annotated or previously annotated) assignedwith the same function. Karaoz et al. [5] apply a similar approach to a collection of PPI dataand gene expression data. They construct a distinct network for each function in GO. For aparticular state of function of each annotated protein v equals +1 if v has function f and -1 if vhas different function. Nabieva et al. [6] proposes a flow based approach to predict proteinfunction from the protein interaction network. Considering both the local and globalproperties of the graph, this approach assigns function to un-annotated protein based on theamount of flow it receives during simulation whereas each annotated protein is the source offunctional flow. Deng et al. [7] proposes an approach employing the theory of Markovrandom field where they estimates the posterior probability of a protein of interest. Letvskyand Kasif [8] use loopy belief propagation with the assumption of a binomial model for localneighbors of protein annotated with a given time. Similarly, Wu et al. [9] propose a relatedprobabilistic model to annotate functions of unknown proteins and PPI networks based on thestructure of the PPI network. Joshi et al. [10] develop new integrated probabilistic method for 151
  • 11. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEcellular function by combining information from protein-protein interaction, proteincomplexes, micro array gene expression profiles and annotations of known protein through anintegrative statistical model. In the work of Samanta et al. [11], a network based statisticalalgorithm is proposed, which assumes that if two proteins share significantly larger number ofcommon interacting partners they share a common functionality. Another application isUVCLUSTER based on bi-clustering which iteratively explored distance datasets proposedby Arnau et al. [12].Apart from graph clustering, in the early stage, Bader and Hogue [13]propose Molecular Complex Detection (MCODE) where dense regions are detectedaccording to some parameters.Altaf-ul-Amin et al.[14] also use a clustering approach. It startsfrom a single node in a graph and clusters are gradually grown until the similarity of everyadded node within a cluster and density of clusters reaches a certain limit. Spirin and Mirny[15] use graph clustering approach where they detect densely connected modules withinthemselves as well as sparsely connected with the rest of the network based on superparamagnetic clustering and Monte Carlo algorithm. Pruzli et al. [16] use graph theoreticapproach where clusters are identified using Leda’s routine components and those clusters areanalyzed by Highly Connected Sub graphs (HCS) algorithm. Later King et al. [17] partitionnetworks into clusters using a cost function applying Restricted Neighborhood SearchClustering algorithm (RNCS). Clusters are filtered according to their size, density andfunctional homogeneity. Krogan et al. [18] use Markov clustering algorithm to predictProtein function.II. PRESENT WORKo Motivation: Many approaches have been discussed in the previous section over protein-protein interaction network (PPI).After studying and going through various papers it can beanalyzed that very few assessment had been pursued on PPI considering protein pairs andinterconnection within their PPI network. This analyzation has encouraged us to work overPPI network and to predict function of unannotated protein pair using a generic approachwhich will be discussed in the forward sections.o Dataset: In this work, the protein-protein interaction data of yeast (SaccharomycesCerevisiae) from ftp://ftpmips.gsf.de/yeast/PPI/, is collected which contains 15613 geneticand physical interactions. Self-interactions are discarded. A set of 12487 unique binaryinteractions involving 4648 proteins are taken as data. In our proposed method 15 functionalgroups are considered. They are cell cycle control (O1), cell polarity (O2), cell wallorganization and biogenesis (O3), chromatin chromosome structure (O4), co-immuno-precipitation (O5), co-purification (O6), DNA Repair(O7), lipid metabolism (O8), nuclear-cytoplasmic transport (O9), pol II transcription (O10), protein folding (O11), proteinmodification (O12), protein synthesis(O13), small molecule transport (O14) and vesiculartransport (O15). For each functional group, 90% protein pairs are taken as training samplesand rest (2-8%) among them are considered as test samples.o Basic terminologies:Protein interaction network: Protein–protein interactions occur when two ormore proteins bind together, often to carry out their biological function. Many of the mostimportant molecular processes in the cell such as DNA replication are carried out by large 152
  • 12. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEmolecular machines that are built from a large number of protein components organized bytheir protein–protein interactions. These protein interactions form a network like structurewhich is known as Protein interaction network. Here protein interaction network isrepresented as a graph GP which consists of a set of vertex (nodes) V connected by edges(links) E. Thus GP = (V, E).Here each protein is represented as a node and theirinterconnections are represented by edges.Sub graph: A graph G´P is a sub graph of a graph GP if the vertex set of G´P is a subset of thevertex set of GP and if the edge set of G´P is a subset of the edge set of GP. That is, if G´P =(V, E’) and GP= (V, E), then G´P is called as sub graph of GP if V′ ‫ ك‬V andE′ ‫ ك‬E. G´P maybe defined as a set of {K ‫ ׫‬U} where K represents the set of un-annotated protein pair whileU represents the set of annotated protein pair.Level-1 neighbors: In G´P, the directly connected neighbors of a particular vertex are calledlevel-1 neighbors.o Proposed Work: Here the work which has been proposed is to deduce the PPI network ofeach individual protein belonging to unannotated protein pair chosen from the original dataset mentioned earlier. Hence afterward identifying the common interaction between thosededuced PPI networks and thereby estimating success rate by using a Generic Approach forpredicting function of unannotated protein pair.o Method: In this method, given ‫′ܩ‬௉ , a sub graph of protein interaction network, consistingof protein pair as nodes associated with any element of set O= {O1, O2, O3,….,O15} where Oirepresents a particular functional group, this method maps the elements of the set of un-annotated protein pair U to any element of set O. Steps associated with this method isdescribed as follows:Step 1: Take any protein pair as an element from set U.Step 2: Deduce PPI network for each protein belonging to selected protein pair in Step 1.Step 3: Find common interacting pair in between PPI network deduced in step 2.Step 4: Count the number of occurrences Si (i=1,..,15) of set O= {O1, O2,O3,….,O15} in between common interacting pair found in Step 3.Step 5: Assign Oi of set O= {O1, O2, O3,….,O15} corresponding Max (Si (i=1,..,15) ) to unannotated protein pair considered in Step 1.o Illustration of Method-I with an example:An un-annotated protein pair YAL011w-YDL181w is taken from our test dataset U, which isshown in yellow color in Fig 9. From GP,‫′ܩ‬ଢ଼୅୐଴ଵଵ୵ is taken where its level-1 neighbors areYDR146c,YCR033w,YDR181c,YDL080c,YDR269w. Similarly, level-1 neighbors are takenfor ‫′ܩ‬ଢ଼ୈ୐ଵ଼ଵ୵ ,which are YPL078c,YPL240c,YBR118w,and YER148w respectively. Twofunctional groups (i.e., DNA repair and cell polarity) are involved in level-1 which is shownin Fig 9. 153
  • 13. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEME Fig. 9 Sub-graph G´P of Protein pair YAL011w-YDL181w and its level-1 neighborThen common interacting pair between ‫′ܩ‬ଢ଼୅୐଴ଵଵ୵ and ‫′ܩ‬ଢ଼ୈ୐ଵ଼ଵ୵ is considered. So, In Fig9, it is seen that there exists only one common interacting pair that is YDL080c-YPL078cwhich is marked in green color in Fig 9.By studying our dataset ,it is derived that the proteinpair YDL080c-YPL078c belongs to functional group DNA Repair(O7).Now the number ofoccurrences of each functional groups among the common interacting pair is enlisted andhighest number of occurrences of a particular functional group is assigned as the functionalgroup of unannotated protein pair. So, as in Fig 9, there exists one interacting pair of O7, weassign O7 to unannotated protein pair YAL011w-YDL181w. Fig. 10 Sub-graph G´P of Protein pair YMR236w-YHR099w and its level-1 neighbor. 154
  • 14. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEAnother example of sub graph obtained in our work has been highlighted above in Fig. 10 and further themethod for predicting function of YMR236w-YHR099w is same as mentioned earlier. In our work, we selectunannotated protein pairs and predict their functional group using Generic approach as shown in TABLE -I.Simultaneously, by counting matched and unmatched set of predicting protein pairs, we obtained success rate orprobability of success, as shown in TABLE-II. TABLE - IC Unannotated protein pair Original function Predicted function R1 YNL250w|YKL101w Cell cycle control Cell cycle control2 YBR023c|YER111c Cell cycle control Cell cycle control3 YPL174c|YLR210w Mitosis Mitosis4 YLR229c|YPL161c Two hybrid Two hybrid5 YBR023c|YLR370c Cell polarity Cell polarity6 YNL233w|YCR009c Cell polarity Cell wall organization and biogenesis ˟7 YBL061c|YLR342w Cell polarity Cell polarity 8 YFR036w|YLR127c Coimmunoprecipitation Coimmunoprecipitation 9 YDR108w|YML077w Coimmunoprecipitation Coimmunoprecipitation10 YFR002w|YGR119c two hybrid two hybrid11 YBL014c|YML043c Coimmunoprecipitation affinity purification ˟12 YBR193c|YOL135c Coimmunoprecipitation Coimmunoprecipitation13 YBL084c|YDR118w Coimmunoprecipitation Coimmunoprecipitation14 YDR145w|YGR252w copurification copurification15 YHR099w|YOL148c copurification copurification16 YHR099w|YMR236w copurification copurification17 YGL112c|YHR099w copurification copurification18 YBR081c|YDR392w copurification copurification19 YGL097w|YIL063c copurification copurification20 YGL097w|YIL063c synthetic lethal synthetic lethal21 YDR145w|YDR176w copurification copurification22 YDR145w|YLR055c copurification copurification23 YNL273w|YGL163c DNA repair DNA repair24 YCL061c|YMR190c DNA repair DNA repair25 YKL113c|YDR369c DNA repair DNA repair26 YGR078c|YFR019w Lipid metabolism Lipid metabolism27 YBR023c|YFR019w Lipid metabolism Lipid metabolism28 YCL061c|YAR002w Nuclear-cytoplasmic transport Nuclear-cytoplasmic transport29 YLR418c|YLR384c Pol II transcription Pol II transcription30 YLR418c|YJR140c Pol II transcription Pol II transcription31 YPR135w|YGL244w Pol II transcription Pol II transcription32 YPR135w|YHR200w Pol II transcription Pol II transcription33 YOR070c|YJR032w Protein folding Protein folding34 YDR420w|YDR245w Protein modification Protein modification35 YLR418c|YDR363w-a Vesicular transport Vesicular transport36 YLR039c|YLR360w Vesicular transport Vesicular transport TABLE - II Total no. of Unannotated protein pair Matched Unmatched Success rate 36 34 2 94.4 155
  • 15. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEIII. RESULTS& DISCUSSIONThe above methods are evaluated by success rate which is defined as ‫ܡܔܜ܋܍ܚܚܗ܋ ܌܍ܜ܋ܑ܌܍ܚܘ ܖܗܑܜ܋ܖܝ۴ ܚܑ܉۾ ܖܑ܍ܜܗܚܘ ܎ܗ ܚ܍܊ܕܝܖ‬ ࡿ࢛ࢉࢉࢋ࢙࢙ ࢘ࢇ࢚ࢋ ൌ ‫ܛܚܑ܉۾ ܖܑ܍ܜܗܚܘ ܌܍ܜ܉ܜܗܖܖ܉ܖ܃ ܎ܗ ܚ܍܊ܕܝܖ ܔ܉ܜܗܜ‬In our work, we predict functions of protein pairs using algorithm of Generic Approach andestimate success rate of 15 considered functional groups, out of which the probability ofsuccess for six functional groups (co-purification (O6), co-immuno-precipitation (O5), polII transcription (O10), vesicular transport (O15), DNA Repair (O7), cell polarity (O2)) havebeen shown in tabular and pictorial representation, as shown in TABLE-III and Fig. 12respectively. TABLE - III NUMBER OF NUMBER OF PROBABLITY OF FUNCTIONAL GROUP UNANNOTATED MATCHED PROTEIN SUCCESS PROTEIN PAIR PAIR O6 8 8 1 O5 5 4 0.8 O10 4 4 1 O15 2 2 1 O2 3 2 0.66 O7 3 3 1 9 8 NUMBER OF 7 UNANNOTATED 6 PROTEIN PAIR 5 4 NUMBER OF 3 MATCHED PROTEIN 2 1 PAIR 0 PROBABLITY OF SUCCESS Fig. 12 Pictorial representation of success rate for five functional groups.Our proposed work adds an extra dimension to existing graph-theoretic methods as itcomputes functions of unannotated protein pair instead of single protein considering level-1neighbors. We hope the performance of generic approach will increase if we consider more alarge interaction network and level-2 neighbors. In future, our aim is to work with morefunctional groups and for different organisms also. 156
  • 16. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 2, March – April (2013), © IAEMEREFERENCES [1] B. Schwikowski, P. Uetz and S. Fields, A network of protein- protein interactions in yeast. Nature Biotech.18, 1257-1261, 2000. [2] H. Hishigaki, K. Nakai, T. Ono, A. Tanigami, and T. Tagaki, Assessment of prediction accuracy of protein function from Protein- protein interaction data. Yeast 18, 523-531, 2001. [3] J. Chen, W. Hsu, M. L. Lee, and S. K. Ng. Labeling network motifs in protein interactomes for protein function prediction. Proc 23rd International Conference on Data Engineering (ICDE). 546- 555, 2007. [4] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction Networks,” Nature Biotechnology, vol. 21, pp. 0697- 700, June, 2003. [5] U. Karaoz, T. M. Murali, S. Letovsky, Y. Zheng, C. Ding, C. R. Cantor, and S. Kasif. Whole-genome annotation by using evidence Integration in functional-linkage. [6] E. Nabieva, K. Jim, A. Agarwal, B. Chazelle, M. Singh. Whole Proteome prediction of protein functions via graph-theoretic analysis of interaction maps. Bioinformatics 21 (Suppl 1): i302– i310, 2005. [7] M. Deng, Inferring domain-domain interactions from protein protein interactions. Genome Res. 12(10):1540-8, 2002. [8] S. Letovsky, S. Kasif. Predicting protein function from protein protein interaction data: a probabilistic approach. Bioinformatics.19 (Suppl 1): i197–i204, 2003. [9] D. D. Wu, X. Hu, An efficient approach to detect a protein community from a seed. 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB2005).La Jolla CA, USA: IEEE pp. 135–141, 2005. [10] Vazquez, “Global Protein Function Prediction from Protein-Protein Interaction Networks,” Nature Biotechnology, vol. 21, pp. 697- 700, June 2003. [11] M. P. Samanta,S. Liang, Predicting protein functions from redundancies in large scale protein interaction networks. ProcNatlAcadSci USA 100: 12579–12583, 2003. [12] V. Arnau, S. Mars, Marin I Iterative cluster analysis of protein interaction data. Bioinformatics 21: 364–378, 2005. [13] G. D. Bader,C. W. Hogue, An automated method for finding molecular complexes in large protein interaction networks.BMC Bioinformatics 4: 2,2003. [14] M. Altaf-Ul-Amin,Y. Shinbo,K. Mihara,K. Kurokawa,S. Kanaya Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC bioinformatics 7: 207, 2006. [15] V. Spirin, L. A. Mirny, Protein complexes and functional modules in molecular networks. ProcNatlAcadSci USA 100:12123–12128, 2003. [16] A. D. King, N. Przulj, I. Jurisica, Protein complex prediction via cost-based clustering. Bioinformatics 20: 3013–3020, 2004. [17] S. Asthana, O. D. King, F. D. Gibbons, F. P. Roth, Predicting protein complex membership using probabilistic network reliability. Genome Res 14: 1170–1175, 2004. [18] N. J. Krogan, G. Cagney, H. Yu, G. Zhong, X. Guo, A. Ignatchenko, Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440: 637– 643, 2006. [19] Deepalakshmi. R and Jothi Venkateswaran C, “A Survey on Mining Methods for Protein Sequence Analysis: An Aerial View”, International journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 2, 2012, pp. 28 - 34, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. 157