Protein functional site prediction using the shotest path graphnew1 2


Published on

Sanjaka, Malinda, M.S., Department of Computer Science, College of Science and Mathematics, North Dakota State University, April 2013. Protein Functional Site Prediction Using the Shortest-Path Graph Kernel Method. Major Professor: Dr. Changhui Yan.
Over the past decade Structural Genomics projects have accumulated structural data for over 75,000 proteins, but the function of most of them are unknown or uncertain due to limitation of laboratory approaches for discovering the functionality of proteins. Computational methods play key roles to minimize this gap. Graphs are often used to describe and analyze the geometry and physicochemical composition of bimolecular structures such as, chemical compounds and protein active sites (phosphorylation and enzyme catalytic sites). A key problem in graph-based structure analysis is to define a measure of similarity that enables a meaningful comparison of such structures. In this regard, kernel functions have attracted a lot of attention, especially since they allow for the application of a rich repertoire of methods from the field of kernel-based machine learning. In this study, we developed an innovative graph method to represent protein surface based on how amino acid residues contact with each other. Then, we implemented a shortest-path graph kernel function to calculate similarities between the graphs. We implemented three variants of the nearest-neighbor method to predict functional sites on protein using the similarity measure given by the shortest-path graph kernel. The prediction methods were evaluated on two datasets using the leave-one-out approach. The best method achieved accuracy as high as 78%. We sorted all examples in the order of decreasing prediction scores. The results revealed that the positive examples (functional sites) were associated with high prediction scores and the functional sites were enriched in the region of top 10 percentile. This project showed that the proposed method were able to capture the similarity between protein functional sites and would provide a useful tool for functional site prediction.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hi All,Good Morning. My research title is protein functional site prediction using the shortest path graph kernel
  • This is my presentation outline ….(read list)
  • Problem statement of our research approach is mainly related to determine functional sites in a protein structure,What are the functional sites ? a residue or group in a protein that activity participate for biochemical relation with another element such as Magnesium, Zinc , phosphate groupthe picture shows a example for a functional site [ (Phosphorylation Site).] papal colorI will give more details aware the functional site in rest of my presentation.
  • Function sites prediction has several importance such as;1. The Functional sites prediction helps to Understand protein functionalities.2. protein functional information can be used by Drug design companies todesign new Structure based drugs andalso3. Protein engineers used functional sites information for Design new proteins based on strong identified functionalities.
  • Next section of my presentation is introduction
  • Proteins are very important molecules in biological cells. They are involved in virtually all cell functions. Each protein within the body has a specific role.A Protein is consist of one or more amino acids of 20 amino acids which are shown in the following table that means a protein is a sequence of amino acids, each amino acid in a protein sequence commonly calls a residue. In the other words, each residues in a protein represent one amino acid.Proteinitself have a 3 dimensional structurewhich is used to identify the functional important groups . In the other words , active sites
  • As I mentioned in the problem statement section , functional site is a part of protein which is involve in various biochemical reactions.Here have shown few functional sites example such as Read list , but In our approach we considered only first two functional sites that mean phosphoryation site and catalytic active site.This picture shows a example for a reaction which is happening in a phosphoryation sites and it describes the way of the involving a addition or removal of a phosphate from a protein structure.Similarly , other functional sites also involve to various biochemical reactions.
  • Now we have a brief idea about the importance aware the protein functional sites prediction, so we need to know that how we can determine these functionally importance group in a protein. One of the popular way is conducting laboratory experimental such as x-ray or NMR but there are some challenges related with laboratory experimental methods.This list show what are the those , laboratory method might be time consuming or needs valuable equipment ,further Or some protein don’t support for laboratory processes, example NMR, all protein need to be liquid status but reality is that all protein cannot be convert into liquid status . So next side I will explain what are the available alternative.
  • Large number of structural genomics projects are working on finding protein structures and already have large number of structure in protein data banks. The problem is that lack of knowledge of functionality of those protein structures. In briefly, we already have large number of protein structure without knowing their functionalities. Further day to day increase the gab between knowing protein structures with lack of knowing their functionalities. It is need a some alternative to minimize this gab so computer professional try to provide methods with high accuracy as solution to this problem.Further we can identify some advantages of computational methods when comparing with laboratory methods such as …the computational method is..(Read list)………Also have some disadvantage such as accuracy . most of time laboratory methods have more accuracy than computation methods but the information discover by computational method related with functional important group in a protein can be used as a good guild line for laboratory technician for their research .
  • Now a days, there are few computational methods are used such as template based method, micro environmental based method and macro environment based method. Briefly template based needs to find a similar template from a protein database which is experimentally verified, then used an alignment method to determine functionality of a target protein structureMicro- environmentmethod basically used the nearest neighbor method for determine unknown functionality of a target protein with comparing structural and physicochemical properties of their neighbors. I will explain more about the nearest neighbor method in next few sides.The macro – environment-based method used same process used in micro- environmental method but only different is number neighbor residues in macro based method comparatively high than micro-environment based method.In Our approach we used macro environment –based method for prediction functional sites from a protein structure.
  • we proposed a graphs kernel based computational approach,we generate set of graph on each residues which are either positive or negative and those are experimentally verified their functionality. The number of nodes in a graph defend on the number of residues contacting with a central node of a graph. And number of graphs are equal to number residue in protein sequence This is only the overview of our approach but I will explain in detail about thisprocess in the materials and methods section.
  • As mentioned in the previous sides, we generated set of graphs based on each residues , the type of residues is either positive or negative . The functionality of each residues are verified by experimentally methods so the set of graphs in train set can be consider as knowledge based which is consist of functional site graph and non-functional site graphs .further we used two set of knowledge based one for catalytic site prediction and other for phosphoslation side prediction. This knowledge based used to calculate similarity between each residues in the train set and a target residue further we used nearest neighbor method as predictor of the proposed methodI will explain more about the similarity calculation and the prediction process in the metrical and methods section
  • The next section is marterials and methods
  • Material and methods,We used following databases to retrieve information protein structures and sequence 1.We used this link to download PDB Files of each protein , the pdb file provides information related with protein structure, it provide geometric coordinate of each atoms of residues in a protein structure. This information we used for checking contacting or not any given residues in a protein sequences each to others2. This link is used to download all protein sequences, it provide protein sequences in fasta format so it easy to map with pdb id of each datasets and retrieve relevant protein sequences
  • In our research approach used two datasets, one is catalytic binding site and other is phosphorylation site. This link is used to download catalytic binding site protein’s pdb id and map with pdb database for get a relevant protein sequence as mentioned in previous site.and we selected a dataset which contain 73 protein sequences in order contain at least one phosphorylation site in a protein sequence based on the information provide by the CSA.DAT databaseThe CSA.DAT database provides literature information related with catalytic site active residues which are experimentally discovered Then we mapped each residues in protein sequences through residue’s index number with CSA.dat database for identify catalytic active residues and non-active residues . finally we found 201 active residues and 23 hundred and nightly eight non active residues but this dataset is unfair to get reasonable predication so we selected a balanced dataset randomly based on number of active resides and finally our balanced dataset contain 201 active residues and 201 non active residues in other words…functionalWe used this link to download phosporyation site and itself provide information related with Phosporyation active residues ,the dataset contain 679 protein sequences, we used same process used on catalytic site database, used PDBdatabase to map with phosporyation PDB ID. Then we used active residue list to find active phosporyation site in each sequences. finally we found 2062 active residues and And ------- hundred and nightly eight non active residues but this dataset also unfair to get reasonable predication so we selected a balanced dataset randomly based on number of active resides and finally our balanced dataset which contain 2062 active residues and 2062 non active residues.
  • A Graph can be defined by using their vertices and edges , in our research approach we used undirected , labeled and weighted graph. Simple graph can be represent by using adjacency matrix alsowe used adjacency matrix to represent Contacting each residues in a graph.
  • In our proposed method, we generate set of graphs based on contacting each residues each to others, nodes in a graph is represent a residue of protein structure which might be positive or negative on the other words, functional active site or non –functional site these node are labeled by using pssm values , the pssm values are indicated biological conservation of each amino acids.The edge is defined based on contacting residues each to othersFinding Contacting between each residues is little bit complicate because of each residues consist of number of atoms so we need to consider all atoms in a residues with each atoms in a another residue, if at least one atom in a residue contacting with a atom in another residue then these two residue can be consider as contacting. Based on this contacting we create a edge between two nodes. These edge is weighted by length between two nodes. In our approach always we assume length equal to 1.The information need to calculate distance between two residues provide by pdb files and VDW file.
  • In simple kernel is a matrix, each element of the matrix is result of the vector product. Graph kernel is also a matrix which used graph instead of vector, each element of the matrix is similarity between two graphs, graph similarity calculate based on comparison each pair of nodes of both graphs.Shortest path graph also graph kernel which calculated similarity between two graph based shorted path between each pair of nodes in each graphs.
  • We used the floyed-warshall algorithm convert original graph to shortest-path graphs kernel, the shortest-path graphs kernel is used to calculate similarity between two graph. Graph similarity is calculated by comparing each pair of nodes in both graphs based on labeled values of each nodes and weighted of edges between particular pair of nodes.This function is used to calculate similarity between two graphs, e1 and e2 mean two edges between pair of each nodes
  • The nearest neighbor method is used to classify target dataset by using training set based on their some properties example is distance with their neighborsIn our approach we used three nearest neighbor variants to classify test set based on graph similarity between training set and test set.The training dataset contain set of functional site and set of non-functional graphs which are verified by laboratory experiment. The test set always represent only graph which is either functional or non-functional which also verified by laboratory experiment.
  • There are two type of cross validationIn K-fold cross validation , whole dataset is divided into number of part equal to k then one of them used as test dataset and rest of them used as training set.But when k equal to number of instance in a dataset , it becomes a leave one out cross validation. In our approach we used leave one out cross validation for better evolution of predictors. in other words , in our approach , every time use a graph as test dataset while rest of all graphs used as training set.How ever we eliminate graphs of same type in a same protein when used the predictors.
  • Next Section is result and discussion.
  • Result and discussion We used two balanced dataset , one for catalytic enzyme active site and other one for phosporylational site, both datasets is consist of functional sites and non-functional sites which are laboratory verified .Catalytic enzyme active site dataset is consist of 201 functional sites and similar number of non-active sites.And we used nearest neighbor method for classification and used three variant of nearest neighbor method based on similarity , max , average and top10ave.Based on the given classification , we calculated percentage of accuracy of our predication method, the method is shown the best performance which value is 77.1% with catalytic site dataset. While we calculate same value of phosphorylation site , It shows 63.8% best performance.As a summary of result , our method is shown best performance with catalytic enzyme dataset than the posphoryation dataset.
  • The process of calculating percentile data, first need to sort based on similarity values on ascending order then divide the position location of each element by total numbers element of the list. In our approach we use full dataset and calculated the percentile based of all nearest neighbor variants in other words , max ,average and top 10 average values.The given result shown in next side.
  • The result are clearly shown that most of active sites belong to group under 10% of percentile
  • Opsitely no-active s
  • BLAST :Basic Local Alignment Search Tool , The basic BLAST algorithm can be implemented in DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences.
  • 1. For template-based modeling (TBM) and fold recognition methods, a prediction model can be built based on the coordinates of the appropriate template(s) [1]. These approaches generally involve four steps: 1) a representative protein structure database is searched to identify a template that is structurally similar to the protein target; 2) an alignment between the target and the template is generated that should align equivalent residues together as in the case of a structural alignment; 3) a prediction structure of the target is built based on the alignment and the selected template structure, and 4) model quality evaluation. The first two steps significantly affect the quality of the final model prediction in TBM methods.2. The main signature of residue microenvironment‐based methods is the focus on a single residue or position in the structure and its surrounding environment. Usually, a set of structural, physicochemical and evolutionary properties are collected and encoded into a fixed‐length vector. Sets of functional (positive) and non‐functional (negative) residues are then incorporated into supervised machine learning approaches3. Most methods discussed in this paper focus on the prediction of enzyme active sites, co-factor binding sites, orpost-translational modification sites, where a relatively compact local structural region is involved. However, a largegroup of algorithms and tools have been developed to identify particular classes of larger structural neighborhoods, e.g.surface patches, pockets, cavities or clefts, which provide interfaces to ligands or macromolecular partners. Thesemethods are highly valuable because protein-protein interactions lie at the center of almost every cellular process andprotein-DNA binding is essential for genetic activities. Similarly, accurate identification of ligand-binding sites is valuablein the context of structure-based drug design. Residue macroenvironment-based methods have been reviewed recently,thus we provide only a brief summary and refer authors to relevant publications where appropriate.4. Based on the types ofstructuralpatternsthey search for, graph‐theoretic approaches can be used in anyof the three main methodological groups (template, residue microenvironment, residuemacroenvironment). However, these approaches represent a special category based on the distinctproblem formulations and algorithmic approaches. Instead of using atomic coordinates directly, graph‐theoretic methodsstart with transforming protein structuresinto graphs and then exploit various motiffinders and graph similarity measures, combined with machine learning, to discover functional sites.Representative graph similarity measures involve subgraph enumeration, subgraph isomorphism, oridentification of frequentsubgraphs, although other measures, e.g. random walk‐based scoring, can beapplied as well
  • NR Database:non-redundant protein sequence database,
  • Protein functional site prediction using the shotest path graphnew1 2

    1. 1. PROTEIN FUNCTIONAL SITEPREDICTION USING THE SHORTEST-PATH GRAPH KERNEL METHODPresented by :: Malinda SanjakaMajor Advisor:: Dr. Changhui YanGraduate Committee Members::Dr. Juan (Jen) LiDr. Jun KongDr. Nan YuDate:: 04/22/20131
    2. 2. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work2
    3. 3. Problem Statement Problem : Prediction of functional sites on proteinstructures What are the functional sites The functional sites are the small portion of a protein where substratemolecules bind and undergo a chemical reaction. Example:3Phosphorylation SiteProtein 3D Structure
    4. 4. Problem Statement(2)Importance of Functional Sites Prediction To understand protein functionalities To structure based drug design To design new protein4
    5. 5. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work5
    6. 6. Introduction20 Amino AcidProtein6
    7. 7. Introduction(2)Protein Functional SitesD. Catalytic active site atlas Catalytic active site atlas Phosphorylation Site DNA binding Site Zinc-binding site7Addition of a phosphate to an amino acid The functional sites are the small portion of a protein where substrate molecules bindand undergo a chemical reaction.
    8. 8. Introduction(3)Laboratory Methods for Functional Sites Determination X-ray Crystallography Nuclear Magnetic Resonance(NMR) Challenges Time consume High cost Lack of support for some protein Need skilled professional bodies8
    9. 9. Introduction(4)The Need for Computational MethodsStructural Genomics (SG) projects reveal large number of protein structuresbut least understanding of protein function. Advantages Low cost Less execution time Less environmental impacts Results optimize by repeating Reusable Run as simulation Reduce human mistakes Disadvantage Accuracy is less than laboratory experimental results Computational methods provide helpful guide line for experimental approach9
    10. 10. Introduction(5)Computational Methods for Functional Sites Prediction Template-based Identify the structure similar template An alignment a target and the template Predict functional groups Micro environment-based Focus on a single residue or position Used structural and physicochemical properties Supervised machine learning approaches Macro environment-based Local structural region is involved Protein to protein interaction Structure-based drug design DNA-binding sites and ligand-binding sites10
    11. 11. Introduction(6)Overview of Our ApproachWe used graphs to represent each residue with contacting neighbors in aprotein structure.Central Residue(+/Functional)Contacting ResiduesOne Residue isconsist of number ofatoms11Residue(-/Non-Functional) Contacting
    12. 12. Introduction(7)Overview of Our Approach –PredictionDatabase Knowledge(Experimentally Verified)Positive(Functional/Active)Negative(Non-Functional/Non-Active)Target Graph(Functional or Non-Functional)Similarity PredictionNearest NeighborMethodShortest-Path GraphKernel12
    13. 13. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work13
    14. 14. Materials and MethodsDatasets How to get protein structure Download::[] How to get the protein sequence PDB Database ::[]. PDB ID and Change ID :: 101m_A FASTA Format:: >101m_Amol:protein length:154 MYOGLOBINMVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKH14
    15. 15. Materials and Methods(2)Catalytic Binding Site (CSA)[] 73 Protein Chains 201 Active Catalytic Sites 20398 Non-Active Residues Balanced Dataset 201 Active Catalytic Sites 201 Non-Active ResiduesPhosphorylation Site Section 3.3.4 of this paper[]. 679 Protein Chains 2062 Active Phosphorylation Site Residues 139795 Non-Active Residues Balanced Dataset 2062 Active Phosphorylation Site Residues 2062 Non-Active Residues15
    16. 16. Materials and Methods(3)Graph Representation Definition A graph G=<V, E> V vertices (nodes) and E edges (arcs) A path in G is a sequence of vertices<v0, v1, v2, ..., vn> Directed Graph Undirected Graph Adjacency Matrix16Node(Label)Edge(Weight)
    17. 17. Materials and Methods(4)Graph Representation Contd. Node Edge Weight Labels(PSSM <Biological conservation of amino acid>)(Position-specific scoring matrix) blast-2.2.25+ NR Database Distance ContactingResidue (Node-Labeled(PSSM))Edge(Arch) –weight (1)CalculationDistance (d1)2+ (y1-y2)2+ (z1-z2)2 VDW- radius of each atoms(van der Waals-VDW.radii file)d1 <= (R1+R2+0.5)Protein Sequences17R1 R2d1<x,y,z> PDBResidue1.Atom1 Residue2.Atom1
    18. 18. Materials and Methods(6)Shortest-path graph Kernel What is a kernel Simply Kernel is a matrix AxA =<v1…..Vn,v1…..Vn> =Matrix elements What is a graph kernel Use graph instead of vectors What is shortest-path graph kernel Compare the each pair of node by usingshortest- path between each nodeV1V1V2V2VnVng1 g2 gng2g1gn18
    19. 19. Materials and Methods(7)Shortest-Path Graph Kernel Contd. Original G1 and G2 graphs converted into shortest-path graphs S1 (V1, E1) and S2(V2, E2) The Floyd-Warshall algorithm The kernel function is used to calculate similarity between G1 and G2 bycomparing all pairs of edges between S1 and S2. Calculation11 22),(),( 2121Ee Eeedge eekGGKWhere, kedge ( ) is a kernel function for comparing two edges19e1 e2v1 w1 w2v2
    20. 20. Materials and Methods(8))2||)()(||exp(),( 22wlabelsvlabelswvknodeWhere, labels (v) returns the vector of attributes associated with node v. Note that Knode() is a Gaussiankernel function. 221was set to 72 by trying different values between 32 and 128 with increments of 2.|))()(|,0max(),( 2121 eweighteweightceekweightWhere, weight (e) returns the weight of edge e. Kweight( ) is a Brownian bridge kernel that assigns thehighest value to the edges that are identical in length. Constant c was set to 2 as in Borgward etal.(2005).Shortest-Path Graph Kernel Contd.Let e1 be the edge between nodes v1 and w1, and e2 be the edge between nodes v2 and w2. Then,),(*),(*),(),( 21212121 wwkeekvvkeek nodeweightnodeedgeWhere, knode( ) is a kernel function for comparing the labels of two nodes, and kweight( ) is akernel function for comparing the weights of two edges. These two functions are defined asin Borgward et al.(2005):20v1<Pssm1>e1=1w2w1 v2 e2=1<Pssm2> <Pssm3><Pssm4>
    21. 21. Materials and Methods(9)Prediction Methods Nearest Neighbor Algorithm Classify a new example x by finding the trainingexample <Xi-Yj> that is nearest to x according toEuclidean distance: NNM_Max NNM_AVE NNM_TOP10AVEPositive(Functional/Active)Negative(Non-Functional/Non-Active) ?Test SetTrain Set(Experimentally Verified )21Similarity
    22. 22. Materials and Methods(10) K-fold Cross-Validation Leave-One-Out Cross-ValidationEvolution of Predictors22
    23. 23. Materials and Methods(11)Measurements for EvaluationTrue Positive/ False PositiveSensitivitySpecificityAccuracy23
    24. 24. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work24
    25. 25. Results and DiscussionEnzyme Catalytic SiteEnzyme catalytic siteTP TP % FN FN% FP FP% TN TN% Contact Not Contact Accuracy Sensitivity SpecificityNNM_Max 150 74.5% 51 25.3% 64 31.8% 137 68.1% 5 59 71.3% 74.5% 68.1%NNM_Ave 155 77.1% 46 22.8% 46 22.8% 155 77.1% 5 41 77.1% 77.1% 77.1%NNM_Top10Ave 156 77.6% 45 22.3% 51 25.3% 150 74.6% 5 46 76.1% 77.6% 74.6%Phosphorylation SitePhosphorylationTP TP% FN FN% FP FP% TN TN% Contact Not Contact Accuracy Sensitivity SpecificityNNM_Max 1104 53.5% 958 46.4% 758 36.7% 1304 50.1% 73 685 58.3% 53.5% 50.1%NNM_Ave 1054 51.1% 1008 48.8% 482 23.3% 1580 76.6% 54 428 63.8% 51.1% 76.6%NNM_Top10Ave1085 52.6% 977 47.3% 667 32.3 1395 67.6% 60 607 60.1% 52.6% 67.6%25
    26. 26. Results and Discussion(2)Percentile Ranking Used full dataset Ordered list Position ranking Majority of functional sitesare less 10% percentile NNM_MAX NNM_AVE NNM_TOP10AVE26
    27. 27. Percentile Result(CSA) Active(Functional)0204060800.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Active Residues Vs.Percentile[Max]Number ActiveResidues0204060800.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Active Residues Vs. Percentile[Ave]Number ActiveResidues0204060800.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Active Residues Vs. Percentile[Top10 Ave]Number ActiveResiduesResults and Discussion(3)27
    28. 28. Percentile Result(CSA) Non-Active(Non-Functional)18.51919.52020.5210.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Non-Active Residues Vs.Percentile[Max]Number Non-ActiveResidues18.51919.52020.5210.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Non-Active Residues Vs.Percentile[Ave]Number Non-ActiveResidues18.51919.52020.5210.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Non-Active Residues Vs.Percentile[Top 10 Ave]Number Non-ActiveResiduesResults and Discussion(4)28
    29. 29. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work29
    30. 30. Conclusions We developed an innovative graph method to represent proteinsurface based on how amino acid residues contact with each other. We implemented a shortest-path graph kernel method and used itto compute the similarity between graphs. We developed three nearest neighbor variants to predict bothdataset based on the similarity matrix that the graph kernel methodproduced. The predictors were able to predict catalytic sites with accuracy upto 77.1%. This work showed that the proposed methods were able to capturethe similarity between enzyme catalytic sites and would provide auseful tool for catalytic site prediction.30
    31. 31. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work31
    32. 32. Future WorkAdd more parameters into labels(graphs, nodes)Improve the program as web serviceWorking with other kernel methods suchas, Minimum Spring Tree and etc.Optimize algorithm for large datasets32
    33. 33. AcknowledgementsI would like to express my deep gratitude to my adviser Dr.Changhui Yan for his continuousencouragements, guidance, and supports to complete thispaper successfully.My sincere thanks also go to my committee members, Dr. Juan(Jen) Li, Dr. Jun Kong, and Dr. Nan Yu for their willingness toserve as committee members.33
    34. 34. Thank you.?34
    35. 35. Introduction ….vdw.PDBNRDatabaseBlast35
    36. 36. Protein…-CUA-AAA-GAA-GGU-GUU-AGC-AAG…-L-K-E-G-V-S-K-D-…DNAprotein sequence36
    37. 37. Important of Functional SitePredictionUnderstanding Protein FunctionalitiesReveal the Structural ProteinDrug DesignDesign New Protein37
    38. 38. Rationale for Understanding Protein Structure andFunctionProtein sequence-large numbers ofsequences, includingwhole genomesProtein function- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution?structure determinationstructure predictionhomologyrational mutagenesisbiochemical analysismodel studiesProtein structure- three dimensional- complicated38
    39. 39. Existing Applications for ProteinActive Sites Prediction39
    40. 40. Our Approach Shortest-path Distance Theory Graph with Adjacent Matrix and Graph kernel Nearest Neighbor Variant (Max, Ave, Top10 Ave) Leave-one-out Cross-Validation True Positive & False Positive Increment percentile40
    41. 41. Literature Review Graph Adjacency Matrix Shortest Distance Path Algorithm Cross Validation True Positive vs. False Positive Percentile Ranking 41
    42. 42. Graph A graph G=<V, E> V vertices (nodes) and E edges (arcs) A path in G is a sequence of vertices <v0, v1, v2, ..., vn> Directed Graph Undirected Graph 42
    43. 43. Adjacency Matrix A simple graph is a matrix with rows and columnslabeled by graph vertices1 = Adjacent0 = Not Adjacent0s on the diagonal43
    44. 44. Shortest Distance Path Algorithm Used in communications, transportation, electronics, andbioinformatics problems. The all-pairs shortest-path problem involves finding theshortest path between all pairs of vertices in a graph.A i j=1 if there is an edge (Vi,Vj) ; otherwise, A i j =044
    45. 45. Percentile Ranking There is no proper definition for percentilecalculation Ordered List Position Ranking Max, Ave, Top1045
    46. 46. Method And Material Data Gathering Identify the Active Residues Balance Dataset Generating a Map File Generate Set of Graphs Development of Graph Kernel46
    47. 47. Data GatheringCatalytic Binding Site (CSA) EC1, EC2…EC6 HTML Regular Expression Finding Large Single Group Selected EC 3.4 73 Protein chains 201 Active Catalytic Site 20398 Non-Active Resides47
    48. 48. Data Gathering..Phosphorylation Site Section 3.3.4 of This Paper[]. 679 protein chains 2062 Active Phosphorylation Site Residues 139795 Non-Active Resides48
    49. 49. Identify the Active ResiduesCatalytic Binding Site (CSA) CSA Annotation –Database(CSA_2_2_12.dat)[] 251777 Records List of Active Residue(201)Phosphorylation Site[] List of Active Residue(2062)49
    50. 50. Balance DatasetComputation TimeLeave-One-Out Cross-ValidationRandom SelectionCatalytic Binding Site (CSA)-Active 201 , Non Active 201Phosphorylation Site-Active 2062, Non Active 206250
    51. 51. Generating a Map File Map with Protein PDB ID with Protein Sequences Atomic Solvent Accessible Area Calculations (RASA) Position-Specific Scoring Matrix Calculations (PSSM) Active Residues51
    53. 53. Atomic Solvent Accessible AreaCalculations (RASA) Calculate the Solvent Accessible Area (RASA) of eachProtein Naccess V2.11 Program– Linux/Unix systems /Cygwin– []– ./naccess 1a91.pdb & ./naccess 1afo.pdb & ./naccess 1aig.pdb PDB DATA Bank –PDB File– []ncbi-blast-2.2.24+RASA >053
    54. 54. Position-Specific Scoring MatrixCalculations (PSSM) Download PDB Files blast-2.2.25+ Program– Microsoft Windows NR Database (non-redundant protein sequence)Process p = new Process();p.StartInfo.UseShellExecute = false;p.StartInfo.RedirectStandardOutput = true;p.StartInfo.FileName = "C:blast-2.2.25+binpsiblast.exe";p.StartInfo.Arguments = string.Format("{0}", "-query " + FileNameIN + " -db C:blast-• 2.2.25+dbnr -num_iterations 2 -out_ascii_pssm " + FileNameOUT);p.Start();• Example: Sample record of .PSSM1 A 5 -2 -2 -2 -1 -1 -2 1 -2 -2 -3 -1 -2 -3 -2 2 -1 -3 -3 -1 77 0 0 0 0 0 0 10 0 0 0 0 0 0 0 13 0 0 0 0 0.59 1.#J54
    55. 55. Sample Mapping File>1neg_ASeq :KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLAAAWSHPQFSUR :11101011111111111111111111111111111111011111111011110111111111111Site :00000000000000000000000000000000000000000000000000000000000010000rASA:115.47,81.22,64.82,.00,20.59,.00,41.60,111.13,56.32,14.17,124.18,35.41,127.39,43.03,111.84,160.37,10.00,.71,33.57,1.82,120.20,91.83,15.89,41.40,69.81,.77,20.31,2.22,49.44,65.40,30.56,97.39,80.11,152.72,75.17,80.10,47.20,64.49,.00,57.09,16.33,101.38,111.31,104.16,71.57,2.73,60.84,.00,18.67,8.04,64.07,71.08,.00,125.10,66.68,24.97,32.49,79.86,65.19,179.94,87.62,51.01,109.35,145.21,71.53,entropy:0.80,0.85,0.25,0.92,0.44,1.48,1.02,2.42,1.57,2.01,0.44,0.93,0.49,0.73,0.73,0.83,1.72,1.46,0.59,2.15,0.72,0.98,1.99,1.65,0.60,1.20,0.35,0.94,0.66,0.65,0.51,0.23,1.04,0.45,1.09,4.74,3.91,0.67,1.38,0.61,0.45,0.75,1.43,0.49,0.36,2.32,0.72,1.63,3.17,0.46,1.53,2.78,1.61,0.38,0.45,0.26,0.15,0.51,0.17,0.38,0.47,0.46,0.93,2.04,1.73,pdbindex:6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70, 55
    56. 56. Generate Set of GraphsShorted Distance Path (Dijkstra Theory)Adjacent Matrix TheoryContacting Neighbor’s ResiduesLabeledWeightedVarious Numbers of Node and EdgeNormalization Graph– Linear Normalization(X1) =(X-Min)/ (Max-Min)56
    57. 57. Calculate Distance between Atomsand Check the Contacting2+ (y1-y2)2+ (z1-z2)2 PDB File VDW(van der Waals-VDW.radii file)D1 <= (R1+R2+0.5)Example of a contact residue2 A _ 3 A! : 1.33441Example of a non-contact residue.4 A _ 2 A : 4.14432 57
    58. 58. Structure of a Graph58
    59. 59. Development of Graph KernelOriginal G1 and G2 graph converted intoshortest-path graphs S1 (V1, E1) and S2 (V2, E2)The Floyd-Warshall algorithmThe kernel function is used to calculate thesimilarity between G1 and G2 by comparingall pairs of edges between S1 and S2.59
    60. 60. The Floyd-Warshall Algorithmfor i = 1 to Nfor j = 1 to Nif there is an edge from i to jdist[0][i][j] = the length of the edge from i to jelse dist[0][i][j] = INFINITYfor k = 1 to Nfor i = 1 to Nfor j = 1 to Ndist[k][i][j] = min(dist[k-1][i][j], dist[k-1][i][k] + dist[k-1][k][j])To find the shortest path between all vertices v V for a weighted graph G = (V; E).D(k)ij=the weight of the shortest path from vertex I to vertex j for which all intermediatevertices are in the set {1,2,……k}60
    61. 61. ImplementationdoublePssm(intResidueA, intResidueB){inti;double sum=0;for (i=0; i<20; i++){sum+=pow((double)(seq_a_pssm[ResidueA][i]-seq_b_pssm[ResidueB][i]), 2);}sum=((double)sum);return sum;}dis+=Pssm(i, j);attr_dis[i][j]=exp((-1)*parm_gamma*dis);sum=0;for (i=0; i<seq_a_len; i++)for (j=0; j<seq_b_len; j++)for (k=i+1; k<seq_a_len; k++)for (r=j+1; r<seq_b_len; r++){xx1 = seq_a_dist[i][k]-seq_b_dist[j][r];Klen=MaxValue(0, CC-fabs(xx1));product1=attr_dis[i][j]*attr_dis[k][r];product2=attr_dis[k][j]*attr_dis[i][r];value=MaxValue(product1, product2);sum+=value*Klen;}return sum;61
    62. 62. Compare SimilarityMaxAveTop 10 Ave62
    63. 63. Result and DiscussionComparison Similarity (TP/FP)– Max– Ave– Top 10 AvePercentile Ranking calculation RASA Value63
    64. 64. Percentile Result(CSA)64
    65. 65. rASA Vs. Active Residues65
    66. 66. 66
    67. 67. staticIEnumerable<string>SortByLength(IEnumerable<string> e){var sorted = from s in eorderbys.Length descendingselect s;return sorted;}Section 3.467
    68. 68. Protein Chain (CSA)68
    69. 69. List ofPhosphorylationSite69
    70. 70. Catalytic Binding Site (CSA)-Active ResidueBack70
    71. 71. Phosphorylation Site-Active ResiduesBack71
    72. 72. van der Waals-VDW.radii fileBackRESIDUE ATOM ALA 5ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0RESIDUE ATOM ARG 11ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD 1.87 0ATOM NE 1.65 1ATOM CZ 1.76 0ATOM NH1 1.65 1ATOM NH2 1.65 1RESIDUE ATOM ASP 8ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.76 0ATOM OD1 1.40 1ATOM OD2 1.40 1RESIDUE ATOM ASN 8ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.76 0ATOM OD1 1.40 1ATOM ND2 1.65 1RESIDUE ATOM CYS 6ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM SG 1.85 0RESIDUE ATOM GLU 9ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD 1.76 0ATOM OE1 1.40 1ATOM OE2 1.40 1RESIDUE ATOM GLN 9ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD 1.76 0ATOM OE1 1.40 1ATOM NE2 1.65 1RESIDUE ATOM GLY 4ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1RESIDUE ATOM HIS 10ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.76 0ATOM ND1 1.65 1ATOM CD2 1.76 0ATOM CE1 1.76 0ATOM NE2 1.65 1RESIDUE ATOM ILE 8ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG1 1.87 0ATOM CG2 1.87 0ATOM CD1 1.87 0RESIDUE ATOM LEU 8ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD1 1.87 0ATOM CD2 1.87 0RESIDUE ATOM LYS 9ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD 1.87 0ATOM CE 1.87 0ATOM NZ 1.50 172
    73. 73. PDB FILE SAMPLEBack73
    74. 74. Distance File ExampleBack74