Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PROTEIN FUNCTIONAL SITEPREDICTION USING THE SHORTEST-PATH GRAPH KERNEL METHODPresented by :: Malinda SanjakaMajor Advisor:...
OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work2
Problem Statement Problem : Prediction of functional sites on proteinstructures What are the functional sites The funct...
Problem Statement(2)Importance of Functional Sites Prediction To understand protein functionalities To structure based d...
OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work5
Introduction20 Amino AcidProtein6
Introduction(2)Protein Functional SitesD. Catalytic active site atlas Catalytic active site atlas Phosphorylation Site ...
Introduction(3)Laboratory Methods for Functional Sites Determination X-ray Crystallography Nuclear Magnetic Resonance(NM...
Introduction(4)The Need for Computational MethodsStructural Genomics (SG) projects reveal large number of protein structur...
Introduction(5)Computational Methods for Functional Sites Prediction Template-based Identify the structure similar templ...
Introduction(6)Overview of Our ApproachWe used graphs to represent each residue with contacting neighbors in aprotein stru...
Introduction(7)Overview of Our Approach –PredictionDatabase Knowledge(Experimentally Verified)Positive(Functional/Active)N...
OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work13
Materials and MethodsDatasets How to get protein structure Download::[http://ftp.wwpdb.org/pub/pdb/data/biounit/coordina...
Materials and Methods(2)Catalytic Binding Site (CSA)[http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_L...
Materials and Methods(3)Graph Representation Definition A graph G=<V, E> V vertices (nodes) and E edges (arcs) A path ...
Materials and Methods(4)Graph Representation Contd. Node Edge Weight Labels(PSSM <Biological conservation of amino aci...
Materials and Methods(6)Shortest-path graph Kernel What is a kernel Simply Kernel is a matrix AxA =<v1…..Vn,v1…..Vn> =M...
Materials and Methods(7)Shortest-Path Graph Kernel Contd. Original G1 and G2 graphs converted into shortest-path graphs S...
Materials and Methods(8))2||)()(||exp(),( 22wlabelsvlabelswvknodeWhere, labels (v) returns the vector of attributes associ...
Materials and Methods(9)Prediction Methods Nearest Neighbor Algorithm Classify a new example x by finding the trainingex...
Materials and Methods(10) K-fold Cross-Validation Leave-One-Out Cross-ValidationEvolution of Predictors22
Materials and Methods(11)Measurements for EvaluationTrue Positive/ False PositiveSensitivitySpecificityAccuracy23
OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work24
Results and DiscussionEnzyme Catalytic SiteEnzyme catalytic siteTP TP % FN FN% FP FP% TN TN% Contact Not Contact Accuracy ...
Results and Discussion(2)Percentile Ranking Used full dataset Ordered list Position ranking Majority of functional sit...
Percentile Result(CSA) Active(Functional)0204060800.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Nu...
Percentile Result(CSA) Non-Active(Non-Functional)18.51919.52020.5210.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0....
OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work29
Conclusions We developed an innovative graph method to represent proteinsurface based on how amino acid residues contact ...
OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work31
Future WorkAdd more parameters into labels(graphs, nodes)Improve the program as web serviceWorking with other kernel me...
AcknowledgementsI would like to express my deep gratitude to my adviser Dr.Changhui Yan for his continuousencouragements, ...
Thank you.?34
Introduction ….vdw.PDBNRDatabaseBlast35
Protein…-CUA-AAA-GAA-GGU-GUU-AGC-AAG…-L-K-E-G-V-S-K-D-…DNAprotein sequence36
Important of Functional SitePredictionUnderstanding Protein FunctionalitiesReveal the Structural ProteinDrug DesignDes...
Rationale for Understanding Protein Structure andFunctionProtein sequence-large numbers ofsequences, includingwhole genome...
Existing Applications for ProteinActive Sites Prediction39
Our Approach Shortest-path Distance Theory Graph with Adjacent Matrix and Graph kernel Nearest Neighbor Variant (Max, A...
Literature Review Graph Adjacency Matrix Shortest Distance Path Algorithm Cross Validation True Positive vs. False Po...
Graph A graph G=<V, E> V vertices (nodes) and E edges (arcs) A path in G is a sequence of vertices <v0, v1, v2, ..., vn...
Adjacency Matrix A simple graph is a matrix with rows and columnslabeled by graph vertices1 = Adjacent0 = Not Adjacent0s ...
Shortest Distance Path Algorithm Used in communications, transportation, electronics, andbioinformatics problems. The al...
Percentile Ranking There is no proper definition for percentilecalculation Ordered List Position Ranking Max, Ave, Top...
Method And Material Data Gathering Identify the Active Residues Balance Dataset Generating a Map File Generate Set of...
Data GatheringCatalytic Binding Site (CSA)http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_List.pl EC1...
Data Gathering..Phosphorylation Site Section 3.3.4 of This Paper[http://www.informatics.indiana.edu/predrag/publications....
Identify the Active ResiduesCatalytic Binding Site (CSA) CSA Annotation –Database(CSA_2_2_12.dat)[ http://www.ebi.ac.uk/t...
Balance DatasetComputation TimeLeave-One-Out Cross-ValidationRandom SelectionCatalytic Binding Site (CSA)-Active 201 ,...
Generating a Map File Map with Protein PDB ID with Protein Sequences Atomic Solvent Accessible Area Calculations (RASA)...
Map with Protein PDB ID with ProteinSequences PDB ID and Change ID101m_A PDB Database[ftp://ftp.wwpdb.org/pub/pdb/derive...
Atomic Solvent Accessible AreaCalculations (RASA) Calculate the Solvent Accessible Area (RASA) of eachProtein Naccess V2...
Position-Specific Scoring MatrixCalculations (PSSM) Download PDB Files blast-2.2.25+ Program– Microsoft Windows NR Data...
Sample Mapping File>1neg_ASeq :KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLAAAWSHPQFSUR :111010111111111111111...
Generate Set of GraphsShorted Distance Path (Dijkstra Theory)Adjacent Matrix TheoryContacting Neighbor’s ResiduesLabel...
Calculate Distance between Atomsand Check the Contacting2+ (y1-y2)2+ (z1-z2)2 PDB File VDW(van der Waals-VDW.radii file)...
Structure of a Graph58
Development of Graph KernelOriginal G1 and G2 graph converted intoshortest-path graphs S1 (V1, E1) and S2 (V2, E2)The Fl...
The Floyd-Warshall Algorithmfor i = 1 to Nfor j = 1 to Nif there is an edge from i to jdist[0][i][j] = the length of the e...
ImplementationdoublePssm(intResidueA, intResidueB){inti;double sum=0;for (i=0; i<20; i++){sum+=pow((double)(seq_a_pssm[Res...
Compare SimilarityMaxAveTop 10 Ave62
Result and DiscussionComparison Similarity (TP/FP)– Max– Ave– Top 10 AvePercentile Ranking calculation RASA Value63
Percentile Result(CSA)64
rASA Vs. Active Residues65
66
staticIEnumerable<string>SortByLength(IEnumerable<string> e){var sorted = from s in eorderbys.Length descendingselect s;re...
Protein Chain (CSA)68
List ofPhosphorylationSite69
Catalytic Binding Site (CSA)-Active ResidueBack70
Phosphorylation Site-Active ResiduesBack71
van der Waals-VDW.radii fileBackRESIDUE ATOM ALA 5ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0RESID...
PDB FILE SAMPLEBack73
Distance File ExampleBack74
Upcoming SlideShare
Loading in …5
×

Protein functional site prediction using the shotest path graphnew1 2

ABSTRACT
Sanjaka, Malinda, M.S., Department of Computer Science, College of Science and Mathematics, North Dakota State University, April 2013. Protein Functional Site Prediction Using the Shortest-Path Graph Kernel Method. Major Professor: Dr. Changhui Yan.
Over the past decade Structural Genomics projects have accumulated structural data for over 75,000 proteins, but the function of most of them are unknown or uncertain due to limitation of laboratory approaches for discovering the functionality of proteins. Computational methods play key roles to minimize this gap. Graphs are often used to describe and analyze the geometry and physicochemical composition of bimolecular structures such as, chemical compounds and protein active sites (phosphorylation and enzyme catalytic sites). A key problem in graph-based structure analysis is to define a measure of similarity that enables a meaningful comparison of such structures. In this regard, kernel functions have attracted a lot of attention, especially since they allow for the application of a rich repertoire of methods from the field of kernel-based machine learning. In this study, we developed an innovative graph method to represent protein surface based on how amino acid residues contact with each other. Then, we implemented a shortest-path graph kernel function to calculate similarities between the graphs. We implemented three variants of the nearest-neighbor method to predict functional sites on protein using the similarity measure given by the shortest-path graph kernel. The prediction methods were evaluated on two datasets using the leave-one-out approach. The best method achieved accuracy as high as 78%. We sorted all examples in the order of decreasing prediction scores. The results revealed that the positive examples (functional sites) were associated with high prediction scores and the functional sites were enriched in the region of top 10 percentile. This project showed that the proposed method were able to capture the similarity between protein functional sites and would provide a useful tool for functional site prediction.

  • Login to see the comments

Protein functional site prediction using the shotest path graphnew1 2

  1. 1. PROTEIN FUNCTIONAL SITEPREDICTION USING THE SHORTEST-PATH GRAPH KERNEL METHODPresented by :: Malinda SanjakaMajor Advisor:: Dr. Changhui YanGraduate Committee Members::Dr. Juan (Jen) LiDr. Jun KongDr. Nan YuDate:: 04/22/20131
  2. 2. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work2
  3. 3. Problem Statement Problem : Prediction of functional sites on proteinstructures What are the functional sites The functional sites are the small portion of a protein where substratemolecules bind and undergo a chemical reaction. Example:3Phosphorylation SiteProtein 3D Structure
  4. 4. Problem Statement(2)Importance of Functional Sites Prediction To understand protein functionalities To structure based drug design To design new protein4
  5. 5. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work5
  6. 6. Introduction20 Amino AcidProtein6
  7. 7. Introduction(2)Protein Functional SitesD. Catalytic active site atlas Catalytic active site atlas Phosphorylation Site DNA binding Site Zinc-binding site7Addition of a phosphate to an amino acid The functional sites are the small portion of a protein where substrate molecules bindand undergo a chemical reaction.
  8. 8. Introduction(3)Laboratory Methods for Functional Sites Determination X-ray Crystallography Nuclear Magnetic Resonance(NMR) Challenges Time consume High cost Lack of support for some protein Need skilled professional bodies8
  9. 9. Introduction(4)The Need for Computational MethodsStructural Genomics (SG) projects reveal large number of protein structuresbut least understanding of protein function. Advantages Low cost Less execution time Less environmental impacts Results optimize by repeating Reusable Run as simulation Reduce human mistakes Disadvantage Accuracy is less than laboratory experimental results Computational methods provide helpful guide line for experimental approach9
  10. 10. Introduction(5)Computational Methods for Functional Sites Prediction Template-based Identify the structure similar template An alignment a target and the template Predict functional groups Micro environment-based Focus on a single residue or position Used structural and physicochemical properties Supervised machine learning approaches Macro environment-based Local structural region is involved Protein to protein interaction Structure-based drug design DNA-binding sites and ligand-binding sites10
  11. 11. Introduction(6)Overview of Our ApproachWe used graphs to represent each residue with contacting neighbors in aprotein structure.Central Residue(+/Functional)Contacting ResiduesOne Residue isconsist of number ofatoms11Residue(-/Non-Functional) Contacting
  12. 12. Introduction(7)Overview of Our Approach –PredictionDatabase Knowledge(Experimentally Verified)Positive(Functional/Active)Negative(Non-Functional/Non-Active)Target Graph(Functional or Non-Functional)Similarity PredictionNearest NeighborMethodShortest-Path GraphKernel12
  13. 13. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work13
  14. 14. Materials and MethodsDatasets How to get protein structure Download::[http://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/all/] How to get the protein sequence PDB Database ::[ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt]. PDB ID and Change ID :: 101m_A FASTA Format:: >101m_Amol:protein length:154 MYOGLOBINMVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKH14
  15. 15. Materials and Methods(2)Catalytic Binding Site (CSA)[http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_List.pl] 73 Protein Chains 201 Active Catalytic Sites 20398 Non-Active Residues Balanced Dataset 201 Active Catalytic Sites 201 Non-Active ResiduesPhosphorylation Site Section 3.3.4 of this paper[http://www.informatics.indiana.edu/predrag/publications.htm]. 679 Protein Chains 2062 Active Phosphorylation Site Residues 139795 Non-Active Residues Balanced Dataset 2062 Active Phosphorylation Site Residues 2062 Non-Active Residues15
  16. 16. Materials and Methods(3)Graph Representation Definition A graph G=<V, E> V vertices (nodes) and E edges (arcs) A path in G is a sequence of vertices<v0, v1, v2, ..., vn> Directed Graph Undirected Graph Adjacency Matrix16Node(Label)Edge(Weight)
  17. 17. Materials and Methods(4)Graph Representation Contd. Node Edge Weight Labels(PSSM <Biological conservation of amino acid>)(Position-specific scoring matrix) blast-2.2.25+ NR Database Distance ContactingResidue (Node-Labeled(PSSM))Edge(Arch) –weight (1)CalculationDistance (d1)2+ (y1-y2)2+ (z1-z2)2 VDW- radius of each atoms(van der Waals-VDW.radii file)d1 <= (R1+R2+0.5)Protein Sequences17R1 R2d1<x,y,z> PDBResidue1.Atom1 Residue2.Atom1
  18. 18. Materials and Methods(6)Shortest-path graph Kernel What is a kernel Simply Kernel is a matrix AxA =<v1…..Vn,v1…..Vn> =Matrix elements What is a graph kernel Use graph instead of vectors What is shortest-path graph kernel Compare the each pair of node by usingshortest- path between each nodeV1V1V2V2VnVng1 g2 gng2g1gn18
  19. 19. Materials and Methods(7)Shortest-Path Graph Kernel Contd. Original G1 and G2 graphs converted into shortest-path graphs S1 (V1, E1) and S2(V2, E2) The Floyd-Warshall algorithm The kernel function is used to calculate similarity between G1 and G2 bycomparing all pairs of edges between S1 and S2. Calculation11 22),(),( 2121Ee Eeedge eekGGKWhere, kedge ( ) is a kernel function for comparing two edges19e1 e2v1 w1 w2v2
  20. 20. Materials and Methods(8))2||)()(||exp(),( 22wlabelsvlabelswvknodeWhere, labels (v) returns the vector of attributes associated with node v. Note that Knode() is a Gaussiankernel function. 221was set to 72 by trying different values between 32 and 128 with increments of 2.|))()(|,0max(),( 2121 eweighteweightceekweightWhere, weight (e) returns the weight of edge e. Kweight( ) is a Brownian bridge kernel that assigns thehighest value to the edges that are identical in length. Constant c was set to 2 as in Borgward etal.(2005).Shortest-Path Graph Kernel Contd.Let e1 be the edge between nodes v1 and w1, and e2 be the edge between nodes v2 and w2. Then,),(*),(*),(),( 21212121 wwkeekvvkeek nodeweightnodeedgeWhere, knode( ) is a kernel function for comparing the labels of two nodes, and kweight( ) is akernel function for comparing the weights of two edges. These two functions are defined asin Borgward et al.(2005):20v1<Pssm1>e1=1w2w1 v2 e2=1<Pssm2> <Pssm3><Pssm4>
  21. 21. Materials and Methods(9)Prediction Methods Nearest Neighbor Algorithm Classify a new example x by finding the trainingexample <Xi-Yj> that is nearest to x according toEuclidean distance: NNM_Max NNM_AVE NNM_TOP10AVEPositive(Functional/Active)Negative(Non-Functional/Non-Active) ?Test SetTrain Set(Experimentally Verified )21Similarity
  22. 22. Materials and Methods(10) K-fold Cross-Validation Leave-One-Out Cross-ValidationEvolution of Predictors22
  23. 23. Materials and Methods(11)Measurements for EvaluationTrue Positive/ False PositiveSensitivitySpecificityAccuracy23
  24. 24. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work24
  25. 25. Results and DiscussionEnzyme Catalytic SiteEnzyme catalytic siteTP TP % FN FN% FP FP% TN TN% Contact Not Contact Accuracy Sensitivity SpecificityNNM_Max 150 74.5% 51 25.3% 64 31.8% 137 68.1% 5 59 71.3% 74.5% 68.1%NNM_Ave 155 77.1% 46 22.8% 46 22.8% 155 77.1% 5 41 77.1% 77.1% 77.1%NNM_Top10Ave 156 77.6% 45 22.3% 51 25.3% 150 74.6% 5 46 76.1% 77.6% 74.6%Phosphorylation SitePhosphorylationTP TP% FN FN% FP FP% TN TN% Contact Not Contact Accuracy Sensitivity SpecificityNNM_Max 1104 53.5% 958 46.4% 758 36.7% 1304 50.1% 73 685 58.3% 53.5% 50.1%NNM_Ave 1054 51.1% 1008 48.8% 482 23.3% 1580 76.6% 54 428 63.8% 51.1% 76.6%NNM_Top10Ave1085 52.6% 977 47.3% 667 32.3 1395 67.6% 60 607 60.1% 52.6% 67.6%25
  26. 26. Results and Discussion(2)Percentile Ranking Used full dataset Ordered list Position ranking Majority of functional sitesare less 10% percentile NNM_MAX NNM_AVE NNM_TOP10AVE26
  27. 27. Percentile Result(CSA) Active(Functional)0204060800.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Active Residues Vs.Percentile[Max]Number ActiveResidues0204060800.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Active Residues Vs. Percentile[Ave]Number ActiveResidues0204060800.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Active Residues Vs. Percentile[Top10 Ave]Number ActiveResiduesResults and Discussion(3)27
  28. 28. Percentile Result(CSA) Non-Active(Non-Functional)18.51919.52020.5210.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Non-Active Residues Vs.Percentile[Max]Number Non-ActiveResidues18.51919.52020.5210.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Non-Active Residues Vs.Percentile[Ave]Number Non-ActiveResidues18.51919.52020.5210.0-0.10.1-0.20.2-0.30.3-0.40.4-0.50.5-0.60.6-0.70.7-0.80.8-0.90.9-1.0Number Non-Active Residues Vs.Percentile[Top 10 Ave]Number Non-ActiveResiduesResults and Discussion(4)28
  29. 29. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work29
  30. 30. Conclusions We developed an innovative graph method to represent proteinsurface based on how amino acid residues contact with each other. We implemented a shortest-path graph kernel method and used itto compute the similarity between graphs. We developed three nearest neighbor variants to predict bothdataset based on the similarity matrix that the graph kernel methodproduced. The predictors were able to predict catalytic sites with accuracy upto 77.1%. This work showed that the proposed methods were able to capturethe similarity between enzyme catalytic sites and would provide auseful tool for catalytic site prediction.30
  31. 31. OutlineProblem StatementIntroductionMaterials and MethodsResults and DiscussionConclusionFuture Work31
  32. 32. Future WorkAdd more parameters into labels(graphs, nodes)Improve the program as web serviceWorking with other kernel methods suchas, Minimum Spring Tree and etc.Optimize algorithm for large datasets32
  33. 33. AcknowledgementsI would like to express my deep gratitude to my adviser Dr.Changhui Yan for his continuousencouragements, guidance, and supports to complete thispaper successfully.My sincere thanks also go to my committee members, Dr. Juan(Jen) Li, Dr. Jun Kong, and Dr. Nan Yu for their willingness toserve as committee members.33
  34. 34. Thank you.?34
  35. 35. Introduction ….vdw.PDBNRDatabaseBlast35
  36. 36. Protein…-CUA-AAA-GAA-GGU-GUU-AGC-AAG…-L-K-E-G-V-S-K-D-…DNAprotein sequence36
  37. 37. Important of Functional SitePredictionUnderstanding Protein FunctionalitiesReveal the Structural ProteinDrug DesignDesign New Protein37
  38. 38. Rationale for Understanding Protein Structure andFunctionProtein sequence-large numbers ofsequences, includingwhole genomesProtein function- rational drug design and treatment of disease- protein and genetic engineering- build networks to model cellular pathways- study organismal function and evolution?structure determinationstructure predictionhomologyrational mutagenesisbiochemical analysismodel studiesProtein structure- three dimensional- complicated38
  39. 39. Existing Applications for ProteinActive Sites Prediction39
  40. 40. Our Approach Shortest-path Distance Theory Graph with Adjacent Matrix and Graph kernel Nearest Neighbor Variant (Max, Ave, Top10 Ave) Leave-one-out Cross-Validation True Positive & False Positive Increment percentile40
  41. 41. Literature Review Graph Adjacency Matrix Shortest Distance Path Algorithm Cross Validation True Positive vs. False Positive Percentile Ranking 41
  42. 42. Graph A graph G=<V, E> V vertices (nodes) and E edges (arcs) A path in G is a sequence of vertices <v0, v1, v2, ..., vn> Directed Graph Undirected Graph 42
  43. 43. Adjacency Matrix A simple graph is a matrix with rows and columnslabeled by graph vertices1 = Adjacent0 = Not Adjacent0s on the diagonal43
  44. 44. Shortest Distance Path Algorithm Used in communications, transportation, electronics, andbioinformatics problems. The all-pairs shortest-path problem involves finding theshortest path between all pairs of vertices in a graph.A i j=1 if there is an edge (Vi,Vj) ; otherwise, A i j =044
  45. 45. Percentile Ranking There is no proper definition for percentilecalculation Ordered List Position Ranking Max, Ave, Top1045
  46. 46. Method And Material Data Gathering Identify the Active Residues Balance Dataset Generating a Map File Generate Set of Graphs Development of Graph Kernel46
  47. 47. Data GatheringCatalytic Binding Site (CSA)http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Show_EC_List.pl EC1, EC2…EC6 HTML Regular Expression Finding Large Single Group Selected EC 3.4 73 Protein chains 201 Active Catalytic Site 20398 Non-Active Resides47
  48. 48. Data Gathering..Phosphorylation Site Section 3.3.4 of This Paper[http://www.informatics.indiana.edu/predrag/publications.htm]. 679 protein chains 2062 Active Phosphorylation Site Residues 139795 Non-Active Resides48
  49. 49. Identify the Active ResiduesCatalytic Binding Site (CSA) CSA Annotation –Database(CSA_2_2_12.dat)[ http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/CSA/CSA_Download.pl] 251777 Records List of Active Residue(201)Phosphorylation Site[http://www.informatics.indiana.edu/predrag/publications.htm] List of Active Residue(2062)49
  50. 50. Balance DatasetComputation TimeLeave-One-Out Cross-ValidationRandom SelectionCatalytic Binding Site (CSA)-Active 201 , Non Active 201Phosphorylation Site-Active 2062, Non Active 206250
  51. 51. Generating a Map File Map with Protein PDB ID with Protein Sequences Atomic Solvent Accessible Area Calculations (RASA) Position-Specific Scoring Matrix Calculations (PSSM) Active Residues51
  52. 52. Map with Protein PDB ID with ProteinSequences PDB ID and Change ID101m_A PDB Database[ftp://ftp.wwpdb.org/pub/pdb/derived_data/pdb_seqres.txt].FASTA Format>101m_Amol:protein length:154 MYOGLOBINMVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKELGYQG52
  53. 53. Atomic Solvent Accessible AreaCalculations (RASA) Calculate the Solvent Accessible Area (RASA) of eachProtein Naccess V2.11 Program– Linux/Unix systems /Cygwin– [http://www.bioinf.manchester.ac.uk/naccess/]– ./naccess 1a91.pdb & ./naccess 1afo.pdb & ./naccess 1aig.pdb PDB DATA Bank –PDB File– [http://ftp.wwpdb.org/pub/pdb/data/biounit/coordinates/all/]ncbi-blast-2.2.24+RASA >053
  54. 54. Position-Specific Scoring MatrixCalculations (PSSM) Download PDB Files blast-2.2.25+ Program– Microsoft Windows NR Database (non-redundant protein sequence)Process p = new Process();p.StartInfo.UseShellExecute = false;p.StartInfo.RedirectStandardOutput = true;p.StartInfo.FileName = "C:blast-2.2.25+binpsiblast.exe";p.StartInfo.Arguments = string.Format("{0}", "-query " + FileNameIN + " -db C:blast-• 2.2.25+dbnr -num_iterations 2 -out_ascii_pssm " + FileNameOUT);p.Start();• Example: Sample record of .PSSM1 A 5 -2 -2 -2 -1 -1 -2 1 -2 -2 -3 -1 -2 -3 -2 2 -1 -3 -3 -1 77 0 0 0 0 0 0 10 0 0 0 0 0 0 0 13 0 0 0 0 0.59 1.#J54
  55. 55. Sample Mapping File>1neg_ASeq :KELVLALYDYQEKSPREVTMKKGDILTLLNSTNKDWWKVEVNDRQGFVPAAYVKKLAAAWSHPQFSUR :11101011111111111111111111111111111111011111111011110111111111111Site :00000000000000000000000000000000000000000000000000000000000010000rASA:115.47,81.22,64.82,.00,20.59,.00,41.60,111.13,56.32,14.17,124.18,35.41,127.39,43.03,111.84,160.37,10.00,.71,33.57,1.82,120.20,91.83,15.89,41.40,69.81,.77,20.31,2.22,49.44,65.40,30.56,97.39,80.11,152.72,75.17,80.10,47.20,64.49,.00,57.09,16.33,101.38,111.31,104.16,71.57,2.73,60.84,.00,18.67,8.04,64.07,71.08,.00,125.10,66.68,24.97,32.49,79.86,65.19,179.94,87.62,51.01,109.35,145.21,71.53,entropy:0.80,0.85,0.25,0.92,0.44,1.48,1.02,2.42,1.57,2.01,0.44,0.93,0.49,0.73,0.73,0.83,1.72,1.46,0.59,2.15,0.72,0.98,1.99,1.65,0.60,1.20,0.35,0.94,0.66,0.65,0.51,0.23,1.04,0.45,1.09,4.74,3.91,0.67,1.38,0.61,0.45,0.75,1.43,0.49,0.36,2.32,0.72,1.63,3.17,0.46,1.53,2.78,1.61,0.38,0.45,0.26,0.15,0.51,0.17,0.38,0.47,0.46,0.93,2.04,1.73,pdbindex:6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70, 55
  56. 56. Generate Set of GraphsShorted Distance Path (Dijkstra Theory)Adjacent Matrix TheoryContacting Neighbor’s ResiduesLabeledWeightedVarious Numbers of Node and EdgeNormalization Graph– Linear Normalization(X1) =(X-Min)/ (Max-Min)56
  57. 57. Calculate Distance between Atomsand Check the Contacting2+ (y1-y2)2+ (z1-z2)2 PDB File VDW(van der Waals-VDW.radii file)D1 <= (R1+R2+0.5)Example of a contact residue2 A _ 3 A! : 1.33441Example of a non-contact residue.4 A _ 2 A : 4.14432 57
  58. 58. Structure of a Graph58
  59. 59. Development of Graph KernelOriginal G1 and G2 graph converted intoshortest-path graphs S1 (V1, E1) and S2 (V2, E2)The Floyd-Warshall algorithmThe kernel function is used to calculate thesimilarity between G1 and G2 by comparingall pairs of edges between S1 and S2.59
  60. 60. The Floyd-Warshall Algorithmfor i = 1 to Nfor j = 1 to Nif there is an edge from i to jdist[0][i][j] = the length of the edge from i to jelse dist[0][i][j] = INFINITYfor k = 1 to Nfor i = 1 to Nfor j = 1 to Ndist[k][i][j] = min(dist[k-1][i][j], dist[k-1][i][k] + dist[k-1][k][j])To find the shortest path between all vertices v V for a weighted graph G = (V; E).D(k)ij=the weight of the shortest path from vertex I to vertex j for which all intermediatevertices are in the set {1,2,……k}60
  61. 61. ImplementationdoublePssm(intResidueA, intResidueB){inti;double sum=0;for (i=0; i<20; i++){sum+=pow((double)(seq_a_pssm[ResidueA][i]-seq_b_pssm[ResidueB][i]), 2);}sum=((double)sum);return sum;}dis+=Pssm(i, j);attr_dis[i][j]=exp((-1)*parm_gamma*dis);sum=0;for (i=0; i<seq_a_len; i++)for (j=0; j<seq_b_len; j++)for (k=i+1; k<seq_a_len; k++)for (r=j+1; r<seq_b_len; r++){xx1 = seq_a_dist[i][k]-seq_b_dist[j][r];Klen=MaxValue(0, CC-fabs(xx1));product1=attr_dis[i][j]*attr_dis[k][r];product2=attr_dis[k][j]*attr_dis[i][r];value=MaxValue(product1, product2);sum+=value*Klen;}return sum;61
  62. 62. Compare SimilarityMaxAveTop 10 Ave62
  63. 63. Result and DiscussionComparison Similarity (TP/FP)– Max– Ave– Top 10 AvePercentile Ranking calculation RASA Value63
  64. 64. Percentile Result(CSA)64
  65. 65. rASA Vs. Active Residues65
  66. 66. 66
  67. 67. staticIEnumerable<string>SortByLength(IEnumerable<string> e){var sorted = from s in eorderbys.Length descendingselect s;return sorted;}Section 3.467
  68. 68. Protein Chain (CSA)68
  69. 69. List ofPhosphorylationSite69
  70. 70. Catalytic Binding Site (CSA)-Active ResidueBack70
  71. 71. Phosphorylation Site-Active ResiduesBack71
  72. 72. van der Waals-VDW.radii fileBackRESIDUE ATOM ALA 5ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0RESIDUE ATOM ARG 11ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD 1.87 0ATOM NE 1.65 1ATOM CZ 1.76 0ATOM NH1 1.65 1ATOM NH2 1.65 1RESIDUE ATOM ASP 8ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.76 0ATOM OD1 1.40 1ATOM OD2 1.40 1RESIDUE ATOM ASN 8ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.76 0ATOM OD1 1.40 1ATOM ND2 1.65 1RESIDUE ATOM CYS 6ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM SG 1.85 0RESIDUE ATOM GLU 9ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD 1.76 0ATOM OE1 1.40 1ATOM OE2 1.40 1RESIDUE ATOM GLN 9ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD 1.76 0ATOM OE1 1.40 1ATOM NE2 1.65 1RESIDUE ATOM GLY 4ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1RESIDUE ATOM HIS 10ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.76 0ATOM ND1 1.65 1ATOM CD2 1.76 0ATOM CE1 1.76 0ATOM NE2 1.65 1RESIDUE ATOM ILE 8ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG1 1.87 0ATOM CG2 1.87 0ATOM CD1 1.87 0RESIDUE ATOM LEU 8ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD1 1.87 0ATOM CD2 1.87 0RESIDUE ATOM LYS 9ATOM N 1.65 1ATOM CA 1.87 0ATOM C 1.76 0ATOM O 1.40 1ATOM CB 1.87 0ATOM CG 1.87 0ATOM CD 1.87 0ATOM CE 1.87 0ATOM NZ 1.50 172
  73. 73. PDB FILE SAMPLEBack73
  74. 74. Distance File ExampleBack74

×