Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semantic annotation of biomedical data

2,712 views

Published on

Presentation about semantic annotation of biomedical data. Presented at LIRMM, INRIA and other between 2008 and 2010.

Published in: Education, Technology
  • Login to see the comments

Semantic annotation of biomedical data

  1. 1. Semantic annotation of biomedical data<br />Clement Jonquet<br />jonquet@stanford.edu<br />INRIA - EXMO seminar - March 24th, 2010<br />
  2. 2. Speech overview<br /><ul><li>Introduction: semantic annotation, semantic web, biomedical context, the challenge
  3. 3. Ontology-based annotation workflow: concept recognition, semantic expansion, why it’s hard?
  4. 4. Annotation services: the NCBO Annotator web service, the NCBO biomedical resources index
  5. 5. Users & use cases
  6. 6. Conclusion and future work</li></ul>2<br />INRIA - EXMO seminar - March 24th, 2010<br />
  7. 7. Annotation & semantic web<br /><ul><li>Part of the vision for the semantic web
  8. 8. Web content must be semantically described using ontologies
  9. 9. Semantic annotations help to structure the web
  10. 10. Annotation is not an easy task
  11. 11. Automatic vs. manual
  12. 12. Lack of annotation tools (convenient, simple to use and easily integrated into automatic processes)
  13. 13. Today’s web content (& public data available through the web) mainly composed of unstructured text</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />3<br />
  14. 14. Annotation is not a common practice<br /><ul><li>High number of ontologies
  15. 15. Getting access to all is hard: formats, locations, APIs
  16. 16. Lack of tools that easily access all ontologies (domain)
  17. 17. Users do not always know the structure of an ontology’s content or how to use it in order to do the annotations themselves
  18. 18. Lack of tools to do the annotations automatically
  19. 19. Boring additional task without immediate reward for the user</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />4<br />
  20. 20. Biomedical context<br /><ul><li>Explosion of publicly available biomedical data
  21. 21. Very diverse, grow very fast
  22. 22. Most of the data are unstructured and rarely described with ontology concepts available in the domains
  23. 23. Hard for biomedical researchers to find the data they need
  24. 24. Data integration problem
  25. 25. Translational discoveries are prevented
  26. 26. Good example of use of ontologies and terminologies for annotations
  27. 27. Gene Ontology annotations
  28. 28. PubMed (biomedical literature) indexed with Mesh headings
  29. 29. Limitations
  30. 30. UMLS only, almost nothing for OBO & OWL ontologies
  31. 31. Manual approaches, curators (scalability?)
  32. 32. Automatic approaches (usability & accuracy?) </li></ul>INRIA - EXMO seminar - March 24th, 2010<br />5<br />
  33. 33. The challenge<br /><ul><li>Automatically process a piece of raw text to annotate it with relevant ontologies
  34. 34. Large scale – to scale up for many resources and ontologies
  35. 35. Automatic – to keep precision and accuracy
  36. 36. Easy to use and to access – to prevent the biomedical community from getting lost
  37. 37. Customizable – to fit very specific needs
  38. 38. Smart – to leverage the knowledge contained in ontologies</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />6<br />
  39. 39. Vocabulary<br /><ul><li>Element = a collection of observations resulting from a biomedical experiment or study
  40. 40. a dataset, clinical-trial description, research article, imaging study
  41. 41. Text metadata = the set of free text that describe or ‘annotate’ an element
  42. 42. Resource = a collection of elements
  43. 43. GEO, PubMed, ClinicalTrial.gov, Guideline.gov, ArrayExpress
  44. 44. Concept = a unique entity (class) in an specific ontology (has an URI)
  45. 45. UMLS CUI or NCBO URI e.g., C0025202, DOID:1909
  46. 46. Term = a string that identifies a given concept (name, synonyms)
  47. 47. Melanoma, Melanomas, Malignant melanoma
  48. 48. Annotation = meta-information on a data: this data deals with this concept
  49. 49. PMID17984116 deals with C0025202</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />7<br />
  50. 50. Why using ontologies?<br />They structure the knowledge from a domain<br />They specify terms that can be used by natural language processing algorithms to process text<br />They uniquely identify concept (URI)<br />They specify relations between concepts that can be used for computing concept similarity<br />They define hierarchies allowing abstraction of type<br />They play the role of common denominator for various data from a domain<br />INRIA - EXMO seminar - March 24th, 2010<br />8<br />
  51. 51. Why using ontologies?<br />9<br />INRIA - EXMO seminar - March 24th, 2010<br />
  52. 52. Why is it a hard problem? (1/2)<br /><ul><li>Identify concept from text is a hard task
  53. 53. May involve NLP, stemming, spell-checking, or recognition of morphological variants
  54. 54. Concept disambiguation
  55. 55. Scalability issues
  56. 56. We want to deal with millions of concepts (~4M)
  57. 57. 200+ ontologies in several format, spread out
  58. 58. Huge biomedical resources e.g., PubMed 17M citations
  59. 59. What to do with annotations when the ontologies and the resources evolve over time
  60. 60. e.g., elements in resources are added
  61. 61. e.g., concepts in ontologies are removed </li></ul>INRIA - EXMO seminar - March 24th, 2010<br />10<br />
  62. 62. Why is it a hard problem? (2/2)<br />How to leverage the knowledge contained in ontologies?<br />Process the transitive closure for relations (not trivial for ontologies with 300k concepts)<br />Execute semantic distance algorithms to determine similarity<br />Compute mappings between ontologies to connect ontologies one another<br />Keep all of this up to date when ontologies evolve<br />e.g., new GO version everyday<br />INRIA - EXMO seminar - March 24th, 2010<br />11<br />
  63. 63. Ontology-based annotation workflow<br />INRIA - EXMO seminar - March 24th, 2010`<br />12<br />First, direct annotations are created by recognizing concepts in raw text,<br />Second, annotations are semantically expanded using knowledge of the ontologies,<br />Third, all annotations are scored according to the context in which they have been created.<br />
  64. 64. Concept recognition (step 1)<br /><ul><li>Uses a dictionary: a list of strings that identifies ontology concepts
  65. 65. 220 ontologies, ~4.2M concepts & ~7.9M terms</li></ul>Uses NCIBI Mgrep, a syntactic concept recognizer<br />High degree of accuracy <br />Fast, scalable,<br />Domain independent<br />13<br />INRIA - EXMO seminar - March 24th, 2010`<br />
  66. 66. Semantic expansion (step 2)<br /><ul><li>Uses is_a hierarchies defined by original ontologies
  67. 67. Uses mapping in UMLS Metathesaurus and NCBO BioPortal
  68. 68. Uses semantic- similarity algorithms based on the is_a graph (ongoing work)
  69. 69. Components available as web services</li></ul>14<br />INRIA - EXMO seminar - March 24th, 2010`<br />
  70. 70. An example<br /><ul><li>“Melanoma is a malignant tumor of melanocytes which are found predominantly in skin but also in the bowel and the eye”.
  71. 71. NCI/C0025201, Melanocyte in NCI Thesaurus
  72. 72. 39228/DOID:1909, Melanoma in Human Disease
  73. 73. Is_a closure expansion
  74. 74. 39228/DOID:191, Melanocytic neoplasm, direct parent of Melanoma in Human Disease
  75. 75. 39228/DOID:0000818, cell proliferation disease, grand parent of Melanoma in Human Disease
  76. 76. Mapping expansion
  77. 77. FMA/C0025201, Melanocyte in Foundational Model of Anatomy, concept mapped to NCI/C0025201 in UMLS.</li></ul>INRIA - EXMO seminar - March 24th, 2010`<br />15<br /><ul><li>“Melanoma is a malignant tumor of melanocytes whichare found predominantly in skin but also in the bowel and the eye”.</li></li></ul><li>Annotations services<br /><ul><li>NCBO Annotator web service
  78. 78. The annotation workflow available as a service
  79. 79. Automatically process a piece of raw text to annotate it with relevant ontology concepts and return the annotations to the user
  80. 80. NCBO Biomedical resources index
  81. 81. We have used the annotation workflow to annotate some common resources (gene expression data, clinical trials, articles) and index then by concepts
  82. 82. Semantic expansion components
  83. 83. http://bioportal.bioontology.org/</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />16<br />
  84. 84. NCBO Annotator web servicehttp://bioportal.bioontology.org/annotator<br /><ul><li>Semantic web vision
  85. 85. Ontology-based service
  86. 86. Semantically described results
  87. 87. OWL ontology that formalizes the service model
  88. 88. Higher expressivity (e.g., SWRL rules)
  89. 89. Annotations returned to the users in OWL as instances
  90. 90. Software agents and semantic web technology stack (e.g., SPARQL)
  91. 91. What distinguish the service?
  92. 92. Can be integrated in current workflow (service-oriented approach)
  93. 93. Uses semantic expansion
  94. 94. Creates annotations for both all UMLS and NCBO ontologies</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />17<br />[AMIA STB 09]<br />[BMC BioInformatics 09]<br />
  95. 95. NCBO Annotator in BioPortal<br />INRIA - EXMO seminar - March 24th, 2010<br />18<br />
  96. 96. NCBO Biomedical Resources indexhttp://bioportal.bioontology.org/resources<br /><ul><li>We have used the workflow to index several important biomedical resources with ontology concepts (22+)
  97. 97. The index can be used to enhance search & data integration</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />19<br />[DILS 08]<br />[BMC BioInfo09]<br />[IC 10]<br />
  98. 98. Ex: annotation of a GEO element<br />20<br />
  99. 99. Ex: search of a GEO element<br />21<br />
  100. 100. OBR results available in NCBO BioPortal<br />INRIA - EXMO seminar - March 24th, 2010<br />22<br />Example of resource available (name and description)<br />Number of annotations in the OBR index<br />Ontology concept/term browsed<br />Title and URL link to the original element<br />Context in which an element has been annotated<br />ID of an element<br />
  101. 101. Good use of the semantics (1/2)<br /><ul><li>Simple keywords based search miss results</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />23<br />
  102. 102. 24<br />Good use of the semantics (2/2)<br />INRIA - EXMO seminar - March 24th, 2010<br />
  103. 103. First evaluation<br /><ul><li>Mgrep vs. MetaMap evaluation
  104. 104. Higher precision & faster
  105. 105. Not limited to UMLS terminologies
  106. 106. 22 resources the annotation index
  107. 107. Between 99% and 100% of the processed elements are annotated
  108. 108. The number of annotating concepts ranges from 359 to 769
  109. 109. Functional comparative evaluation with online queries
  110. 110. Superior average number of results (i.e., resource elements)
  111. 111. Higher number of terms for which we return results. Significant improve in the case of AE or GEO.</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />25<br />[BMC BioInformatics 09]<br />
  112. 112. Technical details<br />Workflow and pre-computation of the data <br />Java, JDBC & MySQL (prototypes)<br />Java, Spring/Hibernate & MySQL (production)<br />Services deployed as REST web services<br />Tomcat & RestLet<br />INRIA - EXMO seminar - March 24th, 2010<br />26<br />
  113. 113. 27<br />Users…<br />INRIA - EXMO seminar - March 24th, 2010<br />Ontology-based services (OBS)<br />NCBO Biomedical Resources index service <br />NCBO Annotator web service<br />BioPortal services<br />UMLS services<br />UCSF<br />Laboratree<br />CollabRx<br />UCHSC<br />PharmGKB, JAX<br />HGMD<br />BioPortal UI<br />PDB/PLoS<br />I2B2<br />NextBio<br />IO informatics<br />“Resources” tab`<br />Knewco<br />IO informatics<br />CaNanoLab<br />
  114. 114. …and uses cases<br /><ul><li>For concept recognition from text
  115. 115. Decide which clinical trials are relevant for a particular patient. Use the annotator service to map clinical-trial eligibility criteria to concepts from UMLS
  116. 116. For accelerating curation
  117. 117. Use concepts recognized in the abstracts of publications to triage papers for curation
  118. 118. For structuring Web data
  119. 119. Ensure that any textual annotation created in Laboratree also has corresponding ontology-based annotations
  120. 120. For mining gene expression data
  121. 121. http://gminer.mcw.edu/</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />28<br />
  122. 122. Conclusion<br /><ul><li>Enabling data integration and translational discoveries requires scalable data annotation using ontologies
  123. 123. Semantic annotations can bridge the gap between resources and ontologies
  124. 124. Ontology-based annotations are essential for new semantic services on the Web
  125. 125. Our annotation workflow offers high-throughput, semantic annotation
  126. 126. Offers an automated, service-oriented vision
  127. 127. Leverages the knowledge contained in ontologies
  128. 128. Please try it and join us!</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />29<br />
  129. 129. Future work<br /><ul><li>Enhanced concept recognition
  130. 130. NLP and text mining techniques
  131. 131. to enhance the recognition of concepts
  132. 132. to avoid noise
  133. 133. Recognizing relations
  134. 134. Enhanced semantic expansion
  135. 135. Extracting a similarity measure from the annotation index
  136. 136. Composing semantic expansion components on demand
  137. 137. Methods for annotation evolution
  138. 138. Allowing users to update the annotations previously done by the service
  139. 139. Updating the annotations in the index as metadata change
  140. 140. Parameterization of the scoring algorithm on demand</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />30<br />
  141. 141. Credits and collaborators<br /><ul><li>@ NCBO, Stanford University
  142. 142. Dr. Nigam H. Shah M.D., PhD initiator of the project & biomedical expert
  143. 143. Pr. Mark A. Musen NCBO PI
  144. 144. Dr. Adrien Coulet postdoc, new resources within resource index & extraction
  145. 145. Cherie Youn Software developer & production releases
  146. 146. Nipun Bhatia CS student Mgrep evaluation
  147. 147. @ Univ. of Victoria
  148. 148. Chris Callendar Software developer (UI)
  149. 149. @ NCIBI, Univ. of Michigan
  150. 150. Manhong Dai Mgrep designers & developers
  151. 151. Dr. Fan Meng </li></ul>31<br />INRIA - EXMO seminar - March 24th, 2010<br />
  152. 152. Thank youNational Center for BioMedical Ontologyhttp://www.bioontology.orgBioPortal, biomedical ontology repositoryhttp://bioportal.bioontology.orgContact mejonquet@stanford.edu<br />

×