Semantic annotation of biomedical data<br />Clement Jonquet<br />jonquet@stanford.edu<br />INRIA - EXMO seminar - March 24...
Speech overview<br /><ul><li>Introduction: semantic annotation, semantic web, biomedical context, the challenge
Ontology-based annotation workflow: concept recognition, semantic expansion, why it’s hard?
Annotation services: the NCBO Annotator  web service, the NCBO biomedical resources index
Users & use cases
Conclusion and future work</li></ul>2<br />INRIA - EXMO seminar - March 24th, 2010<br />
Annotation & semantic web<br /><ul><li>Part of the vision for the semantic web
Web content must be semantically described using ontologies
Semantic annotations help to structure the web
Annotation is not an easy task
Automatic vs. manual
Lack of annotation tools (convenient, simple to use and easily integrated into automatic processes)
Today’s web content (& public data available through the web) mainly composed of unstructured text</li></ul>INRIA - EXMO s...
Annotation is not a common practice<br /><ul><li>High number of ontologies
Getting access to all is hard: formats, locations, APIs
Lack of tools that easily access all ontologies (domain)
Users do not always know the structure of an ontology’s content or how to use it in order to do the annotations themselves
Lack of tools to do the annotations automatically
Boring additional task without immediate reward for the user</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />4<br />
Biomedical context<br /><ul><li>Explosion of publicly available biomedical data
Very diverse, grow very fast
Most of the data are unstructured and rarely described with ontology concepts available in the domains
Hard for biomedical researchers to find the data they need
Data integration problem
Translational discoveries are prevented
Good example of use of ontologies and terminologies for annotations
Gene Ontology annotations
PubMed (biomedical literature) indexed with Mesh headings
Limitations
UMLS only, almost nothing for OBO & OWL ontologies
Manual approaches, curators (scalability?)
Automatic approaches (usability & accuracy?) </li></ul>INRIA - EXMO seminar - March 24th, 2010<br />5<br />
Upcoming SlideShare
Loading in...5
×

Semantic annotation of biomedical data

1,818

Published on

Presentation about semantic annotation of biomedical data. Presented at LIRMM, INRIA and other between 2008 and 2010.

Published in: Education, Technology
1 Comment
2 Likes
Statistics
Notes
  • Very interesting, this looks similar to what I'm attempting for my MSc thesis, however in my case it's on a MUCH smaller scale :). I am currently looking at implementing an ontology-based semantic similarity measure for 'expansion' of semantic annotations. See my blog for more (unstructured) information: http://graus.nu/category/thesis/

    Can I find more information/publications about this particular project?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,818
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
83
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide
  • Let’s try to understand the context of this work and what we mean by semantic annotation.
  • Ontology based annotation is not wide-spread; possibly because of:Lack of a one stop shop for bio-ontologiesLack of tools to annotate datasetsManual  will not scaleAutomatic  can it be ‘good enough’?Lack of a sustainable mechanism to create ontology based annotations
  • They structure the knowledge from a domainThey specify terms that can be used by natural language processing algorithms to process textThey uniquely identify concept (URI)They specify relations between concepts that can be used for computing concept similarityThey define hierarchies allowing abstraction of typeThey play the role of common denominator for various data from a domain
  • Uses a dictionary (or lexicon): a list of strings that identifies ontology conceptsConstructed by accessing ontologies and pooling all concept names or other string forms (synonyms, labels) that syntactically identify conceptsWe use Mgrep, a syntactic concept recognizerDeveloped by University of Michigan – NCIBIHas a very high degree of accuracy (over 95% in recognizing disease names)Fast, scalable, domain independentAnother AMIA STB 2009 presentation (tomorrow, 1:50pm)Mgrep vs. MetaMap evaluation Higher precision &amp; faster Not limited to UMLS terminologies
  • Performing a search of GEO using OBR. A user searching for “melanoma” in Bioportal is able to view the set of online data resources that have been annotated with the ontology terms related to this query. The GEO element “melanoma progression” is returned as a pertinent element for this search. (Note: In the current version, we have dealt only with element titles and descriptions to validate the notion of context awareness. Later, we will process the more of the metadata structure to enable a finer grained level of detail.) The display within BioPortal allows the user to view the original data set with a single click.
  • Specific evaluation with external users on progressCenter for Clinical and Translational InformaticsJackson LabUniv. of Indiana (research management system)
  • Let’s try to understand the context of this work and what we mean by semantic annotation.
  • Semantic annotation of biomedical data

    1. 1. Semantic annotation of biomedical data<br />Clement Jonquet<br />jonquet@stanford.edu<br />INRIA - EXMO seminar - March 24th, 2010<br />
    2. 2. Speech overview<br /><ul><li>Introduction: semantic annotation, semantic web, biomedical context, the challenge
    3. 3. Ontology-based annotation workflow: concept recognition, semantic expansion, why it’s hard?
    4. 4. Annotation services: the NCBO Annotator web service, the NCBO biomedical resources index
    5. 5. Users & use cases
    6. 6. Conclusion and future work</li></ul>2<br />INRIA - EXMO seminar - March 24th, 2010<br />
    7. 7. Annotation & semantic web<br /><ul><li>Part of the vision for the semantic web
    8. 8. Web content must be semantically described using ontologies
    9. 9. Semantic annotations help to structure the web
    10. 10. Annotation is not an easy task
    11. 11. Automatic vs. manual
    12. 12. Lack of annotation tools (convenient, simple to use and easily integrated into automatic processes)
    13. 13. Today’s web content (& public data available through the web) mainly composed of unstructured text</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />3<br />
    14. 14. Annotation is not a common practice<br /><ul><li>High number of ontologies
    15. 15. Getting access to all is hard: formats, locations, APIs
    16. 16. Lack of tools that easily access all ontologies (domain)
    17. 17. Users do not always know the structure of an ontology’s content or how to use it in order to do the annotations themselves
    18. 18. Lack of tools to do the annotations automatically
    19. 19. Boring additional task without immediate reward for the user</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />4<br />
    20. 20. Biomedical context<br /><ul><li>Explosion of publicly available biomedical data
    21. 21. Very diverse, grow very fast
    22. 22. Most of the data are unstructured and rarely described with ontology concepts available in the domains
    23. 23. Hard for biomedical researchers to find the data they need
    24. 24. Data integration problem
    25. 25. Translational discoveries are prevented
    26. 26. Good example of use of ontologies and terminologies for annotations
    27. 27. Gene Ontology annotations
    28. 28. PubMed (biomedical literature) indexed with Mesh headings
    29. 29. Limitations
    30. 30. UMLS only, almost nothing for OBO & OWL ontologies
    31. 31. Manual approaches, curators (scalability?)
    32. 32. Automatic approaches (usability & accuracy?) </li></ul>INRIA - EXMO seminar - March 24th, 2010<br />5<br />
    33. 33. The challenge<br /><ul><li>Automatically process a piece of raw text to annotate it with relevant ontologies
    34. 34. Large scale – to scale up for many resources and ontologies
    35. 35. Automatic – to keep precision and accuracy
    36. 36. Easy to use and to access – to prevent the biomedical community from getting lost
    37. 37. Customizable – to fit very specific needs
    38. 38. Smart – to leverage the knowledge contained in ontologies</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />6<br />
    39. 39. Vocabulary<br /><ul><li>Element = a collection of observations resulting from a biomedical experiment or study
    40. 40. a dataset, clinical-trial description, research article, imaging study
    41. 41. Text metadata = the set of free text that describe or ‘annotate’ an element
    42. 42. Resource = a collection of elements
    43. 43. GEO, PubMed, ClinicalTrial.gov, Guideline.gov, ArrayExpress
    44. 44. Concept = a unique entity (class) in an specific ontology (has an URI)
    45. 45. UMLS CUI or NCBO URI e.g., C0025202, DOID:1909
    46. 46. Term = a string that identifies a given concept (name, synonyms)
    47. 47. Melanoma, Melanomas, Malignant melanoma
    48. 48. Annotation = meta-information on a data: this data deals with this concept
    49. 49. PMID17984116 deals with C0025202</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />7<br />
    50. 50. Why using ontologies?<br />They structure the knowledge from a domain<br />They specify terms that can be used by natural language processing algorithms to process text<br />They uniquely identify concept (URI)<br />They specify relations between concepts that can be used for computing concept similarity<br />They define hierarchies allowing abstraction of type<br />They play the role of common denominator for various data from a domain<br />INRIA - EXMO seminar - March 24th, 2010<br />8<br />
    51. 51. Why using ontologies?<br />9<br />INRIA - EXMO seminar - March 24th, 2010<br />
    52. 52. Why is it a hard problem? (1/2)<br /><ul><li>Identify concept from text is a hard task
    53. 53. May involve NLP, stemming, spell-checking, or recognition of morphological variants
    54. 54. Concept disambiguation
    55. 55. Scalability issues
    56. 56. We want to deal with millions of concepts (~4M)
    57. 57. 200+ ontologies in several format, spread out
    58. 58. Huge biomedical resources e.g., PubMed 17M citations
    59. 59. What to do with annotations when the ontologies and the resources evolve over time
    60. 60. e.g., elements in resources are added
    61. 61. e.g., concepts in ontologies are removed </li></ul>INRIA - EXMO seminar - March 24th, 2010<br />10<br />
    62. 62. Why is it a hard problem? (2/2)<br />How to leverage the knowledge contained in ontologies?<br />Process the transitive closure for relations (not trivial for ontologies with 300k concepts)<br />Execute semantic distance algorithms to determine similarity<br />Compute mappings between ontologies to connect ontologies one another<br />Keep all of this up to date when ontologies evolve<br />e.g., new GO version everyday<br />INRIA - EXMO seminar - March 24th, 2010<br />11<br />
    63. 63. Ontology-based annotation workflow<br />INRIA - EXMO seminar - March 24th, 2010`<br />12<br />First, direct annotations are created by recognizing concepts in raw text,<br />Second, annotations are semantically expanded using knowledge of the ontologies,<br />Third, all annotations are scored according to the context in which they have been created.<br />
    64. 64. Concept recognition (step 1)<br /><ul><li>Uses a dictionary: a list of strings that identifies ontology concepts
    65. 65. 220 ontologies, ~4.2M concepts & ~7.9M terms</li></ul>Uses NCIBI Mgrep, a syntactic concept recognizer<br />High degree of accuracy <br />Fast, scalable,<br />Domain independent<br />13<br />INRIA - EXMO seminar - March 24th, 2010`<br />
    66. 66. Semantic expansion (step 2)<br /><ul><li>Uses is_a hierarchies defined by original ontologies
    67. 67. Uses mapping in UMLS Metathesaurus and NCBO BioPortal
    68. 68. Uses semantic- similarity algorithms based on the is_a graph (ongoing work)
    69. 69. Components available as web services</li></ul>14<br />INRIA - EXMO seminar - March 24th, 2010`<br />
    70. 70. An example<br /><ul><li>“Melanoma is a malignant tumor of melanocytes which are found predominantly in skin but also in the bowel and the eye”.
    71. 71. NCI/C0025201, Melanocyte in NCI Thesaurus
    72. 72. 39228/DOID:1909, Melanoma in Human Disease
    73. 73. Is_a closure expansion
    74. 74. 39228/DOID:191, Melanocytic neoplasm, direct parent of Melanoma in Human Disease
    75. 75. 39228/DOID:0000818, cell proliferation disease, grand parent of Melanoma in Human Disease
    76. 76. Mapping expansion
    77. 77. FMA/C0025201, Melanocyte in Foundational Model of Anatomy, concept mapped to NCI/C0025201 in UMLS.</li></ul>INRIA - EXMO seminar - March 24th, 2010`<br />15<br /><ul><li>“Melanoma is a malignant tumor of melanocytes whichare found predominantly in skin but also in the bowel and the eye”.</li></li></ul><li>Annotations services<br /><ul><li>NCBO Annotator web service
    78. 78. The annotation workflow available as a service
    79. 79. Automatically process a piece of raw text to annotate it with relevant ontology concepts and return the annotations to the user
    80. 80. NCBO Biomedical resources index
    81. 81. We have used the annotation workflow to annotate some common resources (gene expression data, clinical trials, articles) and index then by concepts
    82. 82. Semantic expansion components
    83. 83. http://bioportal.bioontology.org/</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />16<br />
    84. 84. NCBO Annotator web servicehttp://bioportal.bioontology.org/annotator<br /><ul><li>Semantic web vision
    85. 85. Ontology-based service
    86. 86. Semantically described results
    87. 87. OWL ontology that formalizes the service model
    88. 88. Higher expressivity (e.g., SWRL rules)
    89. 89. Annotations returned to the users in OWL as instances
    90. 90. Software agents and semantic web technology stack (e.g., SPARQL)
    91. 91. What distinguish the service?
    92. 92. Can be integrated in current workflow (service-oriented approach)
    93. 93. Uses semantic expansion
    94. 94. Creates annotations for both all UMLS and NCBO ontologies</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />17<br />[AMIA STB 09]<br />[BMC BioInformatics 09]<br />
    95. 95. NCBO Annotator in BioPortal<br />INRIA - EXMO seminar - March 24th, 2010<br />18<br />
    96. 96. NCBO Biomedical Resources indexhttp://bioportal.bioontology.org/resources<br /><ul><li>We have used the workflow to index several important biomedical resources with ontology concepts (22+)
    97. 97. The index can be used to enhance search & data integration</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />19<br />[DILS 08]<br />[BMC BioInfo09]<br />[IC 10]<br />
    98. 98. Ex: annotation of a GEO element<br />20<br />
    99. 99. Ex: search of a GEO element<br />21<br />
    100. 100. OBR results available in NCBO BioPortal<br />INRIA - EXMO seminar - March 24th, 2010<br />22<br />Example of resource available (name and description)<br />Number of annotations in the OBR index<br />Ontology concept/term browsed<br />Title and URL link to the original element<br />Context in which an element has been annotated<br />ID of an element<br />
    101. 101. Good use of the semantics (1/2)<br /><ul><li>Simple keywords based search miss results</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />23<br />
    102. 102. 24<br />Good use of the semantics (2/2)<br />INRIA - EXMO seminar - March 24th, 2010<br />
    103. 103. First evaluation<br /><ul><li>Mgrep vs. MetaMap evaluation
    104. 104. Higher precision & faster
    105. 105. Not limited to UMLS terminologies
    106. 106. 22 resources the annotation index
    107. 107. Between 99% and 100% of the processed elements are annotated
    108. 108. The number of annotating concepts ranges from 359 to 769
    109. 109. Functional comparative evaluation with online queries
    110. 110. Superior average number of results (i.e., resource elements)
    111. 111. Higher number of terms for which we return results. Significant improve in the case of AE or GEO.</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />25<br />[BMC BioInformatics 09]<br />
    112. 112. Technical details<br />Workflow and pre-computation of the data <br />Java, JDBC & MySQL (prototypes)<br />Java, Spring/Hibernate & MySQL (production)<br />Services deployed as REST web services<br />Tomcat & RestLet<br />INRIA - EXMO seminar - March 24th, 2010<br />26<br />
    113. 113. 27<br />Users…<br />INRIA - EXMO seminar - March 24th, 2010<br />Ontology-based services (OBS)<br />NCBO Biomedical Resources index service <br />NCBO Annotator web service<br />BioPortal services<br />UMLS services<br />UCSF<br />Laboratree<br />CollabRx<br />UCHSC<br />PharmGKB, JAX<br />HGMD<br />BioPortal UI<br />PDB/PLoS<br />I2B2<br />NextBio<br />IO informatics<br />“Resources” tab`<br />Knewco<br />IO informatics<br />CaNanoLab<br />
    114. 114. …and uses cases<br /><ul><li>For concept recognition from text
    115. 115. Decide which clinical trials are relevant for a particular patient. Use the annotator service to map clinical-trial eligibility criteria to concepts from UMLS
    116. 116. For accelerating curation
    117. 117. Use concepts recognized in the abstracts of publications to triage papers for curation
    118. 118. For structuring Web data
    119. 119. Ensure that any textual annotation created in Laboratree also has corresponding ontology-based annotations
    120. 120. For mining gene expression data
    121. 121. http://gminer.mcw.edu/</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />28<br />
    122. 122. Conclusion<br /><ul><li>Enabling data integration and translational discoveries requires scalable data annotation using ontologies
    123. 123. Semantic annotations can bridge the gap between resources and ontologies
    124. 124. Ontology-based annotations are essential for new semantic services on the Web
    125. 125. Our annotation workflow offers high-throughput, semantic annotation
    126. 126. Offers an automated, service-oriented vision
    127. 127. Leverages the knowledge contained in ontologies
    128. 128. Please try it and join us!</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />29<br />
    129. 129. Future work<br /><ul><li>Enhanced concept recognition
    130. 130. NLP and text mining techniques
    131. 131. to enhance the recognition of concepts
    132. 132. to avoid noise
    133. 133. Recognizing relations
    134. 134. Enhanced semantic expansion
    135. 135. Extracting a similarity measure from the annotation index
    136. 136. Composing semantic expansion components on demand
    137. 137. Methods for annotation evolution
    138. 138. Allowing users to update the annotations previously done by the service
    139. 139. Updating the annotations in the index as metadata change
    140. 140. Parameterization of the scoring algorithm on demand</li></ul>INRIA - EXMO seminar - March 24th, 2010<br />30<br />
    141. 141. Credits and collaborators<br /><ul><li>@ NCBO, Stanford University
    142. 142. Dr. Nigam H. Shah M.D., PhD initiator of the project & biomedical expert
    143. 143. Pr. Mark A. Musen NCBO PI
    144. 144. Dr. Adrien Coulet postdoc, new resources within resource index & extraction
    145. 145. Cherie Youn Software developer & production releases
    146. 146. Nipun Bhatia CS student Mgrep evaluation
    147. 147. @ Univ. of Victoria
    148. 148. Chris Callendar Software developer (UI)
    149. 149. @ NCIBI, Univ. of Michigan
    150. 150. Manhong Dai Mgrep designers & developers
    151. 151. Dr. Fan Meng </li></ul>31<br />INRIA - EXMO seminar - March 24th, 2010<br />
    152. 152. Thank youNational Center for BioMedical Ontologyhttp://www.bioontology.orgBioPortal, biomedical ontology repositoryhttp://bioportal.bioontology.orgContact mejonquet@stanford.edu<br />
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×