Please turn off your mobiles or putthem on silence mode
Biological Relation Extraction Tools Using Biomedical                          Ontologies and Text Mining
Agenda Introduction to Biomedical Text Mining System Overview   Problem Description   Motivation   Challenges System...
Agenda Introduction to Biomedical Text Mining System Overview   Problem Description   Motivation   Challenges System...
Introduction to Biomedical TextMining Text Mining = Process unstructured (textual)  information, extract meaningful data,...
Agenda Introduction to Biomedical Text Mining System Overview   Problem Description   Motivation   Challenges System...
System Overview Problem Description   Huge amount of information stored in million of    documents   These information ...
System Overview Motivation:   Build semantic structure of documents which    facilitates navigation through thousands of...
System Overview Challenges:   Concept Recognition   Build semantic structure of annotated documents using    ontologies...
Overall System Components Framework Searching and Browsing Swanson’s Algorithm PPI Gene Clustering
Overall System Architecture                             Searching          Gene     Swanson’s  PPI                        ...
Agenda Introduction to Biomedical Text Mining System Overview   Problem Description   Motivation   Challenges System...
System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framewo...
System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framewo...
System Framework Objective:   Use ontologies to markup biomedical text documents.   Based on established semantic links...
System Framework              Framework    Concept Issues        Design Issues
System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framewo...
Framework Concept Issues      User                         Expanded Query             Query Expansion     Query           ...
System Framework PubMed:   Largest documents source in the biomedical field   Contains over 18 million documents   Mai...
System Framework Gene Ontology:   The Gene Ontology project is a major    bioinformatics initiative with the aim of    s...
System Framework MeSH database:    Comprehensive controlled vocabulary for the purpose of indexing journal articles and ...
System Framework Query Expansion (QE):is the process of reformulating  a seed query to improve retrieval performance in  ...
QueryExpansion           Ocellus                 pigmentation Example     Pigment                                  Pigment...
System Framework Documents Annotating   Annotate documents with Gene Ontology Terms, Genes    and proteins.   Represent...
GO extractor●GO’s vocabulary consists of 7,841 words. The majority of the GO words foundoccur only once in the whole ontol...
GO extractor algorithm Get last  word  Compar                                      Set main   e with                      ...
Go ExtractorExample:-Abstract“............................................and its effected by the Kinase activity”. Abstra...
System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framewo...
Framework Design Issues Top Level Architecture of the System can be divided into:-    Data Handling Components    Infor...
Class Diagram
System Framework Framework main components:   Document Sources   Extractor   Document Annotators   Ontology Manager  ...
System Framework Document Sources   Fetching of singles or collections of documents from    remote stores. Extractor  ...
System Framework Ontology Manager   Provide interface to around ontologies   Composed by sub-managers to merge ontologi...
System Framework Database Manager   implemented as a pool object (connections pool)   handles and maintains queries to ...
System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framewo...
Framework Sequence Diagram
System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framewo...
Framework Database
System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framewo...
Framework GUI GUI Goals   User friendly   Consistency   Model View Control (MVC)   Human-Computer Interaction concept...
Framework GUI
System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framewo...
Our system      Textpresso       XplorMed          VivismoOntology   Full Gene       Only 30         Top hierarchy     Dri...
IBN-SINA vs. Others                 IBN-SINA         Textpresso       XplorMed         VivismoWorks on     works on all   ...
Our System Vs. GoPubMed
System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framewo...
Framework Demo          DEMO
Agenda Introduction to Biomedical Text Mining System Overview   Problem Description   Motivation   Challenges System...
Overall System Architecture                             Searching          Gene     Swanson’s  PPI                        ...
Swanson Algorithm(1986)Swanson’s method is a away of finding indirect relations betweenobjects.      A                    ...
Cosine Similarity                Cosine similarity is a measure of similarity between two vectors of n            dimensio...
Cosine Similarity (Cont.)                     Finally, applying cosine similarity function :-A     B   C   D      E       ...
Swanson example                                            Relation between P53 and P51 1986: “Fish oil, Raynaud’s syndrom...
Agenda Introduction to Biomedical Text Mining System Overview   Problem Description   Motivation   Challenges System...
Overall System Architecture                             Searching          Gene     Swanson’s  PPI                        ...
PPI Agenda   Problem Description   Motivation   PPI System Overview   PPI System Main Components       Dependency Par...
PPI Agenda   Problem Description   Motivation   PPI System Overview   PPI System Main Components       Dependency Par...
Problem Description Due to the ever growing amount of publications about protein-protein interactions, information extrac...
PPI Agenda   Problem Description   Motivation   PPI System Overview   PPI System Main Components       Dependency Par...
Motivation   The interactions between proteins are important for    very numerous if not all biological functions.   The...
PPI Agenda   Problem Description   Motivation   PPI System Overview   PPI System Main Components       Dependency Par...
System Overview   We worked on Sentence level (Why?)       It increases the semantic understood from the sentence.     ...
System Overview
System Overview   Our approach depends on:      The shortest path between the entities in dependency      tree of a sente...
PPI Agenda   Problem Description   Motivation   PPI System Overview   PPI System Main Components       Dependency Par...
Dependency Parse Tree
Dependency Parse Tree• Unlike a syntactic parse, it captures the semantic predicate-argument relationships among its words...
Dependency Parse Tree (Example) "The dependency tree of the sentence “The results demonstrated  that KaiC interacts rhyth...
Example (Cont.)• Then, we select the shortest paths between the protein pairs:  • KaiC - nsubj - interacts - prep with – S...
Example (Cont.)• Then, we rename the proteins in the pair as PROTX1 and PROTX2, and all the other proteins in the sentence...
PPI Agenda   Problem Description   Motivation   PPI System Overview   PPI System Main Components       Dependency Par...
Similarity Metrics
Similarity Metrics   The main idea of using similarity metrics is to    find a function that maps input patterns into a  ...
Similarity Metrics   We implemented Levenshtein distance (Edit    Distance).       number of transpositions, substitutio...
Similarity Metrics• We used only 10 string similarities from SimMetrics.  • Cosine Similarity  • Block Distance  • Dice Si...
PPI Agenda   Problem Description   Motivation   PPI System Overview   PPI System Main Components       Dependency Par...
K-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier• k nearest neighbor-assign label according to the majority label of k nearest-neighboor trai...
KNN Example• If k = 3, it is classified as  a triangle• k = 5, it is classified as a  square
KNN Strengths and Weaknesses• Strengths:  • Simple to implement and use  • Comprehensible – easy to explain prediction  • ...
KNN Strengths and Weaknesses• Weaknesses:  • Need a lot of space to store all examples.  • Takes more time to classify a n...
PPI Agenda   Problem Description   Motivation   PPI System Overview   PPI System Main Components       Dependency Par...
Evaluation of PPI
Evaluation of PPI• we used five different datasets which are:  • BioInfer dataset.  • AIMed dataset.  • LLL dataset.  • IE...
Confusion Matrix
Evaluation Metrics• Precision:• Recall:• F-measure:
PPI Agenda   Problem Description   Motivation   PPI System Overview   PPI System Main Components       Dependency Par...
Results
Results
Results
Results
Results
Results and ComparisonDataset    Min. Result   Max. ResultBioInfer   32            56.9AIMed      5             48.9LLL   ...
Our PPI System Vs. Graph Kernel           ApproachDataset    Our System   Graph Kernel           (%)          Approach (%)...
Agenda Introduction to Biomedical Text Mining System Overview   Problem Description   Motivation   Challenges System...
Overall System Architecture                             Searching          Gene     Swanson’s  PPI                        ...
Motivation Goal :   Grouping genes according some features . Challenges :   Large number of genes .   The complexity ...
Motivation The solution is :          Gene       Clustering
Gene Clustering Techniques Based on Gene Expression :   Advantages :       High Accuracy .   Disadvantages :       Hi...
Gene Clustering Techniques Based on Text Mining :   Advantages :       Low Cost .       Low Time Consuming .   Disadv...
Gene Clustering Based on TextMining To perform Gene Clustering we need :   Clustering Algorithms .   Similarity Measure...
Clustering Algorithms Hierarchical Algorithms . Partitioning Algorithms . Density-Based Algorithms .
Hierarchical Algorithms Single Linkage
Partitioning Algorithms K-Medoids
Density-Based Algorithms DBScan
Graph-Theoretic Algorithms Zahn Algorithm
Similarity Measurements Swanson Algorithm . Document Occurrences .
Swanson Algorithm Search PubMed for gene A and extract set A ( the  most related keywords - MeSH or GO terms - ) . Searc...
Document Occurrences Search PubMed for gene A and extract set A  (documents Ids of gene A) . Search PubMed for gene B an...
Agenda Introduction to Biomedical Text Mining System Overview   Problem Description   Motivation   Challenges System...
Extended Work: PPI System withSVM Classifier (1)  Equation :      u=w⋅x-b - Objective : min (1/2) || w||2  subject to yi ...
Extended Work: PPI System withSVM Classifier (2) min Ψ (α ) = min (1/2) ∑ ∑ yi yj (xi ⋅xj)αi αj- ∑ αi     α is called mu...
Agenda Introduction to Biomedical Text Mining System Overview   Problem Description   Motivation   Challenges System...
Conclusion Problem 1: Algorithms for concept recognition in  documents abstracts and titles    We introduced an algorith...
Conclusion Problem 4: Using Swanson’s algorithm to assess the similarity between  different biological terms    We intro...
Future work There are hot research areas and open problems in the biological text mining   The content Provider for Docu...
Future work There are some features that may be added to the System   Biomedical Ontology based Search Engine     Provi...
Future work• There are some features that may be added to  the System   Gene Clustering     Using more sophisticated clu...
Ibn Sina
Ibn Sina
Ibn Sina
Ibn Sina
Upcoming SlideShare
Loading in...5
×

Ibn Sina

905

Published on

Presentation to describe Ibn Sina graduation project. Presentation given in July 2010.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
905
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Ibn Sina

  1. 1. Please turn off your mobiles or putthem on silence mode
  2. 2. Biological Relation Extraction Tools Using Biomedical Ontologies and Text Mining
  3. 3. Agenda Introduction to Biomedical Text Mining System Overview  Problem Description  Motivation  Challenges System Framework Application upon System Framework  Swanson’s Algorithm  Protein to Protein Interactions (PPI)  Gene Clustering based on Text Mining Extended Work Conclusion and Future Work.
  4. 4. Agenda Introduction to Biomedical Text Mining System Overview  Problem Description  Motivation  Challenges System Framework Application upon System Framework  Swanson’s Algorithm  Protein to Protein Interactions (PPI)  Gene Clustering based on Text Mining Extended Work Conclusion and Future Work.
  5. 5. Introduction to Biomedical TextMining Text Mining = Process unstructured (textual) information, extract meaningful data, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms. Biomedical Text Mining = Working on biomedical documents.
  6. 6. Agenda Introduction to Biomedical Text Mining System Overview  Problem Description  Motivation  Challenges System Framework Application upon System Framework  Swanson’s Algorithm  Protein to Protein Interactions (PPI)  Gene Clustering based on Text Mining Extended Work Conclusion and Future Work.
  7. 7. System Overview Problem Description  Huge amount of information stored in million of documents  These information can be used effectively to solve many problems  Knowledge retrieval with no much effort  Discover relationship between different entities  Assessing relationship strength between different entities  Group entities into different clusters
  8. 8. System Overview Motivation:  Build semantic structure of documents which facilitates navigation through thousands of documents.  Extract relationships between biomedical terms using text mining techniques with aid of biomedical ontologies.  Using text mining to group genes into different clusters.
  9. 9. System Overview Challenges:  Concept Recognition  Build semantic structure of annotated documents using ontologies  Relationship Recognition  Similarity (distance) between different entities.
  10. 10. Overall System Components Framework Searching and Browsing Swanson’s Algorithm PPI Gene Clustering
  11. 11. Overall System Architecture Searching Gene Swanson’s PPI & Clustering Algorithm Browsing Framework
  12. 12. Agenda Introduction to Biomedical Text Mining System Overview  Problem Description  Motivation  Challenges System Framework Application upon System Framework  Swanson’s Algorithm  Protein to Protein Interactions (PPI)  Gene Clustering based on Text Mining Extended Work Conclusion and Future Work.
  13. 13. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Framework Demo
  14. 14. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo
  15. 15. System Framework Objective:  Use ontologies to markup biomedical text documents.  Based on established semantic links between documents and ontology concepts, the goal is build semantic representation of information.  Provide services to other applications and users.
  16. 16. System Framework Framework Concept Issues Design Issues
  17. 17. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo
  18. 18. Framework Concept Issues User Expanded Query Query Expansion Query Fetching Documents Search PubMed Gene Documents Ontology Extract GO terms Annotate PubMed documentsStructure Representation of documents Annotated Documents
  19. 19. System Framework PubMed:  Largest documents source in the biomedical field  Contains over 18 million documents  Maintained by the United States National Library of Medicine (NLM)  Indexes all documents by MeSH terms to facilitate searching and retrieval
  20. 20. System Framework Gene Ontology:  The Gene Ontology project is a major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases  Includes a controlled vocabulary of terms for describing gene product characteristics.  Consists of three main categories  Cellular component  Biological process  Molecular function
  21. 21. System Framework MeSH database:  Comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching [Wikipedia] MeSH main heading:  Anatomy  Organisms  Diseases  Chemicals and Drugs  Analytical, Diagnostic and Therapeutic Techniques and Equipment  Psychiatry and Psychology  Phenomena and Processes  Disciplines and Occupations  Anthropology, Education, Sociology and Social Phenomena  Technology, Industry, Agriculture  Humanities  Information Science  Named Groups  Health Care  Publication Characteristics  Geographical liocations
  22. 22. System Framework Query Expansion (QE):is the process of reformulating a seed query to improve retrieval performance in information retrieval operations [Wikipedia] How ? Example
  23. 23. QueryExpansion Ocellus pigmentation Example Pigment Pigment metabolic Pigmentation accumulation process Cellular pigmentation
  24. 24. System Framework Documents Annotating  Annotate documents with Gene Ontology Terms, Genes and proteins.  Represent each documents by set of terms. (How ?)
  25. 25. GO extractor●GO’s vocabulary consists of 7,841 words. The majority of the GO words foundoccur only once in the whole ontology. On the other hand 51 of the GO wordsoccur at least 100 times in the ontology. More than 90%, do not occur morethan 10 times.●words with a very high frequency do not give much information as they arepart of many labels in the ontology. However, extracting a word with a lowfrequency gives a much better hint about a mentioned concept. (Zipfs law).●From the nature of GO-terms, the words in the end are very generalex.(activity , transport).●Besides, many GO-terms are substring of descending GO-terms.●The algorithm is taken from GOPubMed (2008) “GoPubMed: Ontology-basedliterature search for the life sciences”.
  26. 26. GO extractor algorithm Get last word Compar Set main e with root as a root N root Do BFS The same word N and take Reache Y Get occurred at each one as s leaf next any sibling a root word Y get next word & do BFS and consider each one as a root
  27. 27. Go ExtractorExample:-Abstract“............................................and its effected by the Kinase activity”. Abstract.● Starting from the last word of the paragraph “activity”.●Starting from the root of the GO tree searching for GO-term ending with“activity”.● When we rich it, fetch the next word and starting from the new root.● Now we are looking in the subtree for an ontology ends with “Kinase activity”.●While on search we reach leaf . It means that we got a GO-term. Now restartby take the next word and from the root.
  28. 28. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo
  29. 29. Framework Design Issues Top Level Architecture of the System can be divided into:-  Data Handling Components  Information Handling Components  Information Extraction  Information Representation  Information Retrieval
  30. 30. Class Diagram
  31. 31. System Framework Framework main components:  Document Sources  Extractor  Document Annotators  Ontology Manager  System Engine  Database Manager  Cache Manager  Document
  32. 32. System Framework Document Sources  Fetching of singles or collections of documents from remote stores. Extractor  Implements Information Extraction algorithms to extract ontology terms from the documents Document Annotators  establish semantic link between documents and ontology concepts.  For example linking documents with its GO terms, MeSH terms . . . etc.
  33. 33. System Framework Ontology Manager  Provide interface to around ontologies  Composed by sub-managers to merge ontologies such as Gene ontology System Engine  Main component of the system.  Responsible for maintaining all the operations and communications between various components of the system
  34. 34. System Framework Database Manager  implemented as a pool object (connections pool)  handles and maintains queries to the database such insert, update and delete documents Cache Manager  Implemented as client side of MemCached (open source caching project).  Handles operations to the system cache
  35. 35. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo
  36. 36. Framework Sequence Diagram
  37. 37. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo
  38. 38. Framework Database
  39. 39. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo
  40. 40. Framework GUI GUI Goals  User friendly  Consistency  Model View Control (MVC)  Human-Computer Interaction concepts  Usability  Specific Application services satisfaction  Standard Data Exchange  Internationalization
  41. 41. Framework GUI
  42. 42. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo
  43. 43. Our system Textpresso XplorMed VivismoOntology Full Gene Only 30 Top hierarchy Driveused Ontology category of ontology the MeSH from the ontology search resultOutput Uses the deep Returns a list For each Returns a list ontology to of relevant MeSH of relevant navigate abstract category, abstract through a there is an large result set associated list in a non- sequential order
  44. 44. IBN-SINA vs. Others IBN-SINA Textpresso XplorMed VivismoWorks on works on all Designed for works on all works on all the PubMed full paper which the PubMed the PubMed abstracts not available abstracts abstracts most of the timeTerm Allows gaps Tries to nd the Extract terms Extract termsExtraction within category terms based on based on term matches and directly in the term frequency in considers the text only frequency in the collected information allowing the collected documents content of the for some documents words, which variations in leads to more lower/uppercas rened term e letters and extraction plural forms
  45. 45. Our System Vs. GoPubMed
  46. 46. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo
  47. 47. Framework Demo DEMO
  48. 48. Agenda Introduction to Biomedical Text Mining System Overview  Problem Description  Motivation  Challenges System Framework Application upon System Framework  Swanson’s Algorithm  Protein to Protein Interactions (PPI)  Gene Clustering based on Text Mining Extended Work Conclusion and Future Work.
  49. 49. Overall System Architecture Searching Gene Swanson’s PPI & Clustering Algorithm Browsing Framework
  50. 50. Swanson Algorithm(1986)Swanson’s method is a away of finding indirect relations betweenobjects. A B Related Related term A1 term B1 Related Related term A2 term B2 1986: “Undiscovered public knowledge”
  51. 51. Cosine Similarity Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining [Wikipedia]. Terms related to first term “As’ related terms” A B C D E F G H Terms related to second term “Bs’ related terms” A X Y B Z D E FA B C D E F G H X Y Z1 1 1 1 1 1 1 1 0 0 0A B C D E F G H X Y Z1 1 0 1 1 1 0 0 1 1 1
  52. 52. Cosine Similarity (Cont.) Finally, applying cosine similarity function :-A B C D E F G H X Y Z1 1 1 1 1 1 1 1 0 0 0A B C D E F G H X Y Z1 1 0 1 1 1 0 0 1 1 1 Similarity = (1+1+0+1+1+1+0+0+0+0+0)/ (√8*√8) = 5/8 = 0.625
  53. 53. Swanson example Relation between P53 and P51 1986: “Fish oil, Raynaud’s syndrome, and undiscovered public knowledge”
  54. 54. Agenda Introduction to Biomedical Text Mining System Overview  Problem Description  Motivation  Challenges System Framework Application upon System Framework  Swanson’s Algorithm  Protein to Protein Interactions (PPI)  Gene Clustering based on Text Mining Extended Work Conclusion and Future Work.
  55. 55. Overall System Architecture Searching Gene Swanson’s PPI & Clustering Algorithm Browsing Framework
  56. 56. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components  Dependency Parse Tree  Similarity Metrics  K-Nearest Neighbor Classifier Evaluation of PPI  Evaluation Metrics Results and Comparison
  57. 57. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components  Dependency Parse Tree  Similarity Metrics  K-Nearest Neighbor Classifier Evaluation of PPI  Evaluation Metrics Results and Comparison
  58. 58. Problem Description Due to the ever growing amount of publications about protein-protein interactions, information extraction from text is increasingly recognized as one of crucial technologies in bioinformatics Reference: Gunes Erkan, Arzucan Ozgur, Dragomir R. Radev. Semi-Supervised Classication for Extracting Protein Interaction Sentences using Dependency Parsing. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 228237, Prague, June 2007
  59. 59. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components  Dependency Parse Tree  Similarity Metrics  K-Nearest Neighbor Classifier Evaluation of PPI  Evaluation Metrics Results and Comparison
  60. 60. Motivation The interactions between proteins are important for very numerous if not all biological functions. The function of a protein can be characterized more precisely through knowledge of PPI. Information about these interactions improves our understanding of diseases and can provide the basis for new therapeutic approaches. Validate experimental results and test benches.
  61. 61. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components  Dependency Parse Tree  Similarity Metrics  K-Nearest Neighbor Classifier Evaluation of PPI  Evaluation Metrics Results and Comparison
  62. 62. System Overview We worked on Sentence level (Why?)  It increases the semantic understood from the sentence.  Synthesis of the sentence increases the knowledge obtained from it.  Specific relation between proteins can be deduced from it.
  63. 63. System Overview
  64. 64. System Overview Our approach depends on: The shortest path between the entities in dependency tree of a sentence usually captures the necessary information to identify their relationship.
  65. 65. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components  Dependency Parse Tree  Similarity Metrics  K-Nearest Neighbor Classifier Evaluation of PPI  Evaluation Metrics Results and Comparison
  66. 66. Dependency Parse Tree
  67. 67. Dependency Parse Tree• Unlike a syntactic parse, it captures the semantic predicate-argument relationships among its words. Stanford Parser API to make the Natural Language processing task. Shortest path is found using Breadth First Search (BFS) as each edge has equal wait, and therefore this leads to most near path discovered first.
  68. 68. Dependency Parse Tree (Example) "The dependency tree of the sentence “The results demonstrated that KaiC interacts rhythmically with KaiA, KaiB, and SasA.”
  69. 69. Example (Cont.)• Then, we select the shortest paths between the protein pairs: • KaiC - nsubj - interacts - prep with – SasA • KaiC - nsubj - interacts - prep with - SasA - conj and - KaiA • KaiC - nsubj - interacts - prep with – SasA - conj and – KaiB • SasA - conj and – KaiA • SasA - conj and – KaiB • KaiA – conj and – SasA - conj and - KaiB
  70. 70. Example (Cont.)• Then, we rename the proteins in the pair as PROTX1 and PROTX2, and all the other proteins in the sentence as PROTX0: • PROTX1 - nsubj - interacts - prep with - ROTX2 • PROTX1 - nsubj - interacts - prep with - ROTX0 – conj_and - PROTX2 • PROTX1 - nsubj - interacts - prep with – ROTX0 –conj_and - PROTX2 • PROTX1 – conj_and - PROTX2 • PROTX1 – conj_and - PROTX2 • PROTX1 – conj_and – PROTX0 – conj_and - PROTX2
  71. 71. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components  Dependency Parse Tree  Similarity Metrics  K-Nearest Neighbor Classifier Evaluation of PPI  Evaluation Metrics Results and Comparison
  72. 72. Similarity Metrics
  73. 73. Similarity Metrics The main idea of using similarity metrics is to find a function that maps input patterns into a target space such that a simple distance in the target space approximates the “semantic” distance in the input space.
  74. 74. Similarity Metrics We implemented Levenshtein distance (Edit Distance).  number of transpositions, substitutions and deletions needed to transform one string into another. We also used an open source library called “SimMetrics” – Java library of 23 string similarity metrics. • Developed at the University of Sheffield (Chapman, 2004)
  75. 75. Similarity Metrics• We used only 10 string similarities from SimMetrics. • Cosine Similarity • Block Distance • Dice Similarity • Euclidean Distance • Jaccard Similarity • Jaro Similarity • Jaro Winkler Similarity • Matching Coecient • Monge Elkan Similarity
  76. 76. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components  Dependency Parse Tree  Similarity Metrics  K-Nearest Neighbor Classifier Evaluation of PPI  Evaluation Metrics Results and Comparison
  77. 77. K-Nearest Neighbor Classifier
  78. 78. K-Nearest Neighbor Classifier• k nearest neighbor-assign label according to the majority label of k nearest-neighboor training patterns.
  79. 79. KNN Example• If k = 3, it is classified as a triangle• k = 5, it is classified as a square
  80. 80. KNN Strengths and Weaknesses• Strengths: • Simple to implement and use • Comprehensible – easy to explain prediction • Robust to noisy data by averaging k-nearest neighbors
  81. 81. KNN Strengths and Weaknesses• Weaknesses: • Need a lot of space to store all examples. • Takes more time to classify a new example than with a model (need to calculate and compare distance from new example to all other examples).
  82. 82. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components  Dependency Parse Tree  Similarity Metrics  K-Nearest Neighbor Classifier Evaluation of PPI  Evaluation Metrics Results and Comparison
  83. 83. Evaluation of PPI
  84. 84. Evaluation of PPI• we used five different datasets which are: • BioInfer dataset. • AIMed dataset. • LLL dataset. • IEPA dataset. • HPRD50 dataset.• We used KNN classier and changing K and similarity metric as parameters.
  85. 85. Confusion Matrix
  86. 86. Evaluation Metrics• Precision:• Recall:• F-measure:
  87. 87. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components  Dependency Parse Tree  Similarity Metrics  K-Nearest Neighbor Classifier Evaluation of PPI  Evaluation Metrics Results and Comparison
  88. 88. Results
  89. 89. Results
  90. 90. Results
  91. 91. Results
  92. 92. Results
  93. 93. Results and ComparisonDataset Min. Result Max. ResultBioInfer 32 56.9AIMed 5 48.9LLL 48.8 73IEPA 36.6 72HPRD50 12.9 63.49
  94. 94. Our PPI System Vs. Graph Kernel ApproachDataset Our System Graph Kernel (%) Approach (%)BioInfer 56.9 52.9AIMed 48.9 56.4LLL 73 76.8IEPA 72 75.1HPRD50 67 63.4
  95. 95. Agenda Introduction to Biomedical Text Mining System Overview  Problem Description  Motivation  Challenges System Framework Application upon System Framework  Swanson’s Algorithm  Protein to Protein Interactions (PPI)  Gene Clustering based on Text Mining Extended Work Conclusion and Future Work.
  96. 96. Overall System Architecture Searching Gene Swanson’s PPI & Clustering Algorithm Browsing Framework
  97. 97. Motivation Goal :  Grouping genes according some features . Challenges :  Large number of genes .  The complexity of biological networks .
  98. 98. Motivation The solution is :  Gene Clustering
  99. 99. Gene Clustering Techniques Based on Gene Expression :  Advantages :  High Accuracy .  Disadvantages :  High cost .  Time Consuming .  Noise .
  100. 100. Gene Clustering Techniques Based on Text Mining :  Advantages :  Low Cost .  Low Time Consuming .  Disadvantages :  Low accuracy .
  101. 101. Gene Clustering Based on TextMining To perform Gene Clustering we need :  Clustering Algorithms .  Similarity Measurements .
  102. 102. Clustering Algorithms Hierarchical Algorithms . Partitioning Algorithms . Density-Based Algorithms .
  103. 103. Hierarchical Algorithms Single Linkage
  104. 104. Partitioning Algorithms K-Medoids
  105. 105. Density-Based Algorithms DBScan
  106. 106. Graph-Theoretic Algorithms Zahn Algorithm
  107. 107. Similarity Measurements Swanson Algorithm . Document Occurrences .
  108. 108. Swanson Algorithm Search PubMed for gene A and extract set A ( the most related keywords - MeSH or GO terms - ) . Search PubMed for gene B and extract set B ( the most related keywords - MeSH or GO terms - ) . Based on the intersection between set A and set B, we apply the cosine similarity.
  109. 109. Document Occurrences Search PubMed for gene A and extract set A (documents Ids of gene A) . Search PubMed for gene B and extract set B (documents Ids of gene B). Based on the intersection between set A and set B, we apply the Jaccard Similarity Coefficient.
  110. 110. Agenda Introduction to Biomedical Text Mining System Overview  Problem Description  Motivation  Challenges System Framework Application upon System Framework  Swanson’s Algorithm  Protein to Protein Interactions (PPI)  Gene Clustering based on Text Mining Extended Work Conclusion and Future Work.
  111. 111. Extended Work: PPI System withSVM Classifier (1) Equation : u=w⋅x-b - Objective : min (1/2) || w||2 subject to yi (w ⋅ xi-b) ≥ 1, ∀i
  112. 112. Extended Work: PPI System withSVM Classifier (2) min Ψ (α ) = min (1/2) ∑ ∑ yi yj (xi ⋅xj)αi αj- ∑ αi  α is called multiplier and if we can get α we can get (w , b) . w = ∑ yi αi xi , b = w ⋅ xk-yk for some αk > 0
  113. 113. Agenda Introduction to Biomedical Text Mining System Overview  Problem Description  Motivation  Challenges System Framework Application upon System Framework  Swanson’s Algorithm  Protein to Protein Interactions (PPI)  Gene Clustering based on Text Mining Extended Work Conclusion and Future Work.
  114. 114. Conclusion Problem 1: Algorithms for concept recognition in documents abstracts and titles  We introduced an algorithm to annotate the Gene Ontology terms in the documents. Problem 2: Use the annotated documents to build a structured representation of documents  We introduced how framework uses Gene Ontology to build a semantic representation of the obtained documents Problem 3: Design a system for ontology based search engines for biological researchers  We introduced design of the framework and how it is flexible for future modifications and scalable with respect to number of documents and number of users.
  115. 115. Conclusion Problem 4: Using Swanson’s algorithm to assess the similarity between different biological terms  We introduced how can Swansons algorithm be used to estimate the similarity between two instances (P53 and P21) Problem 5: Supervised machine learning algorithms for prediction of Protein to Protein interactions  We introduced how we used supervised machine learning algorithms such as KNN and a new technique to estimate the distance between sentence in order to predict the possible interactions between proteins mentioned in the documents. Problem 6: Unsupervised machine learning algorithms to identify different clusters of Genes  We introduced how we used unsupervised machine learning algorithms such as DBScan and the similarity based on Swanson Algorithms and Cosine similarity in order to group genes mentioned in the documents in different clusters.
  116. 116. Future work There are hot research areas and open problems in the biological text mining  The content Provider for Documents  Google Scholar  Using Semantic web 3.0 ( Online Journals )  The Ontology Generation  Ability to Edit the Ontologies and Adding knowledge  Other Ontologies  Using Wikipedia as an Ontology
  117. 117. Future work There are some features that may be added to the System  Biomedical Ontology based Search Engine  Provide documents summary for each group of documents  Allow the user to save and print the results obtained by the system.  Protein-Protein Interaction (PPI)  Use more sophisticated classifiers and machine learning techniques such as AdaBoost to enhance the classification process.  Use a background knowledge of verbs as there are many verbs gives the same meaning.  This will help the system to have more accurate results, as we can introduce some fuzzy distance to the differences between the meaning of verbs. This also will introduce the ability to discover the type of relations between the terms and to be more semantic relations identification.
  118. 118. Future work• There are some features that may be added to the System  Gene Clustering  Using more sophisticated clustering algorithms which originally designed for gene clustering. More Applications:  Based on the services provided by the ontology based engine, we can construct some applications such as extracting the relation between the drugs and diseases, group diseases in different clusters which decision helps to identify the characteristics of a new discovered disease and other applications that relay on text mining in biomedical documents.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×