Semantic annotation of biomedical data
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Semantic annotation of biomedical data

  • 2,103 views
Uploaded on

Presentation about semantic annotation of biomedical data. Presented at LIRMM, INRIA and other between 2008 and 2010.

Presentation about semantic annotation of biomedical data. Presented at LIRMM, INRIA and other between 2008 and 2010.

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Very interesting, this looks similar to what I'm attempting for my MSc thesis, however in my case it's on a MUCH smaller scale :). I am currently looking at implementing an ontology-based semantic similarity measure for 'expansion' of semantic annotations. See my blog for more (unstructured) information: http://graus.nu/category/thesis/

    Can I find more information/publications about this particular project?
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
2,103
On Slideshare
2,101
From Embeds
2
Number of Embeds
1

Actions

Shares
Downloads
79
Comments
1
Likes
2

Embeds 2

http://localhost 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Let’s try to understand the context of this work and what we mean by semantic annotation.
  • Ontology based annotation is not wide-spread; possibly because of:Lack of a one stop shop for bio-ontologiesLack of tools to annotate datasetsManual  will not scaleAutomatic  can it be ‘good enough’?Lack of a sustainable mechanism to create ontology based annotations
  • They structure the knowledge from a domainThey specify terms that can be used by natural language processing algorithms to process textThey uniquely identify concept (URI)They specify relations between concepts that can be used for computing concept similarityThey define hierarchies allowing abstraction of typeThey play the role of common denominator for various data from a domain
  • Uses a dictionary (or lexicon): a list of strings that identifies ontology conceptsConstructed by accessing ontologies and pooling all concept names or other string forms (synonyms, labels) that syntactically identify conceptsWe use Mgrep, a syntactic concept recognizerDeveloped by University of Michigan – NCIBIHas a very high degree of accuracy (over 95% in recognizing disease names)Fast, scalable, domain independentAnother AMIA STB 2009 presentation (tomorrow, 1:50pm)Mgrep vs. MetaMap evaluation Higher precision & faster Not limited to UMLS terminologies
  • Performing a search of GEO using OBR. A user searching for “melanoma” in Bioportal is able to view the set of online data resources that have been annotated with the ontology terms related to this query. The GEO element “melanoma progression” is returned as a pertinent element for this search. (Note: In the current version, we have dealt only with element titles and descriptions to validate the notion of context awareness. Later, we will process the more of the metadata structure to enable a finer grained level of detail.) The display within BioPortal allows the user to view the original data set with a single click.
  • Specific evaluation with external users on progressCenter for Clinical and Translational InformaticsJackson LabUniv. of Indiana (research management system)
  • Let’s try to understand the context of this work and what we mean by semantic annotation.

Transcript

  • 1. Semantic annotation of biomedical data
    Clement Jonquet
    jonquet@stanford.edu
    INRIA - EXMO seminar - March 24th, 2010
  • 2. Speech overview
    • Introduction: semantic annotation, semantic web, biomedical context, the challenge
    • 3. Ontology-based annotation workflow: concept recognition, semantic expansion, why it’s hard?
    • 4. Annotation services: the NCBO Annotator web service, the NCBO biomedical resources index
    • 5. Users & use cases
    • 6. Conclusion and future work
    2
    INRIA - EXMO seminar - March 24th, 2010
  • 7. Annotation & semantic web
    • Part of the vision for the semantic web
    • 8. Web content must be semantically described using ontologies
    • 9. Semantic annotations help to structure the web
    • 10. Annotation is not an easy task
    • 11. Automatic vs. manual
    • 12. Lack of annotation tools (convenient, simple to use and easily integrated into automatic processes)
    • 13. Today’s web content (& public data available through the web) mainly composed of unstructured text
    INRIA - EXMO seminar - March 24th, 2010
    3
  • 14. Annotation is not a common practice
    • High number of ontologies
    • 15. Getting access to all is hard: formats, locations, APIs
    • 16. Lack of tools that easily access all ontologies (domain)
    • 17. Users do not always know the structure of an ontology’s content or how to use it in order to do the annotations themselves
    • 18. Lack of tools to do the annotations automatically
    • 19. Boring additional task without immediate reward for the user
    INRIA - EXMO seminar - March 24th, 2010
    4
  • 20. Biomedical context
    • Explosion of publicly available biomedical data
    • 21. Very diverse, grow very fast
    • 22. Most of the data are unstructured and rarely described with ontology concepts available in the domains
    • 23. Hard for biomedical researchers to find the data they need
    • 24. Data integration problem
    • 25. Translational discoveries are prevented
    • 26. Good example of use of ontologies and terminologies for annotations
    • 27. Gene Ontology annotations
    • 28. PubMed (biomedical literature) indexed with Mesh headings
    • 29. Limitations
    • 30. UMLS only, almost nothing for OBO & OWL ontologies
    • 31. Manual approaches, curators (scalability?)
    • 32. Automatic approaches (usability & accuracy?)
    INRIA - EXMO seminar - March 24th, 2010
    5
  • 33. The challenge
    • Automatically process a piece of raw text to annotate it with relevant ontologies
    • 34. Large scale – to scale up for many resources and ontologies
    • 35. Automatic – to keep precision and accuracy
    • 36. Easy to use and to access – to prevent the biomedical community from getting lost
    • 37. Customizable – to fit very specific needs
    • 38. Smart – to leverage the knowledge contained in ontologies
    INRIA - EXMO seminar - March 24th, 2010
    6
  • 39. Vocabulary
    • Element = a collection of observations resulting from a biomedical experiment or study
    • 40. a dataset, clinical-trial description, research article, imaging study
    • 41. Text metadata = the set of free text that describe or ‘annotate’ an element
    • 42. Resource = a collection of elements
    • 43. GEO, PubMed, ClinicalTrial.gov, Guideline.gov, ArrayExpress
    • 44. Concept = a unique entity (class) in an specific ontology (has an URI)
    • 45. UMLS CUI or NCBO URI e.g., C0025202, DOID:1909
    • 46. Term = a string that identifies a given concept (name, synonyms)
    • 47. Melanoma, Melanomas, Malignant melanoma
    • 48. Annotation = meta-information on a data: this data deals with this concept
    • 49. PMID17984116 deals with C0025202
    INRIA - EXMO seminar - March 24th, 2010
    7
  • 50. Why using ontologies?
    They structure the knowledge from a domain
    They specify terms that can be used by natural language processing algorithms to process text
    They uniquely identify concept (URI)
    They specify relations between concepts that can be used for computing concept similarity
    They define hierarchies allowing abstraction of type
    They play the role of common denominator for various data from a domain
    INRIA - EXMO seminar - March 24th, 2010
    8
  • 51. Why using ontologies?
    9
    INRIA - EXMO seminar - March 24th, 2010
  • 52. Why is it a hard problem? (1/2)
    • Identify concept from text is a hard task
    • 53. May involve NLP, stemming, spell-checking, or recognition of morphological variants
    • 54. Concept disambiguation
    • 55. Scalability issues
    • 56. We want to deal with millions of concepts (~4M)
    • 57. 200+ ontologies in several format, spread out
    • 58. Huge biomedical resources e.g., PubMed 17M citations
    • 59. What to do with annotations when the ontologies and the resources evolve over time
    • 60. e.g., elements in resources are added
    • 61. e.g., concepts in ontologies are removed
    INRIA - EXMO seminar - March 24th, 2010
    10
  • 62. Why is it a hard problem? (2/2)
    How to leverage the knowledge contained in ontologies?
    Process the transitive closure for relations (not trivial for ontologies with 300k concepts)
    Execute semantic distance algorithms to determine similarity
    Compute mappings between ontologies to connect ontologies one another
    Keep all of this up to date when ontologies evolve
    e.g., new GO version everyday
    INRIA - EXMO seminar - March 24th, 2010
    11
  • 63. Ontology-based annotation workflow
    INRIA - EXMO seminar - March 24th, 2010`
    12
    First, direct annotations are created by recognizing concepts in raw text,
    Second, annotations are semantically expanded using knowledge of the ontologies,
    Third, all annotations are scored according to the context in which they have been created.
  • 64. Concept recognition (step 1)
    • Uses a dictionary: a list of strings that identifies ontology concepts
    • 65. 220 ontologies, ~4.2M concepts & ~7.9M terms
    Uses NCIBI Mgrep, a syntactic concept recognizer
    High degree of accuracy
    Fast, scalable,
    Domain independent
    13
    INRIA - EXMO seminar - March 24th, 2010`
  • 66. Semantic expansion (step 2)
    • Uses is_a hierarchies defined by original ontologies
    • 67. Uses mapping in UMLS Metathesaurus and NCBO BioPortal
    • 68. Uses semantic- similarity algorithms based on the is_a graph (ongoing work)
    • 69. Components available as web services
    14
    INRIA - EXMO seminar - March 24th, 2010`
  • 70. An example
    • “Melanoma is a malignant tumor of melanocytes which are found predominantly in skin but also in the bowel and the eye”.
    • 71. NCI/C0025201, Melanocyte in NCI Thesaurus
    • 72. 39228/DOID:1909, Melanoma in Human Disease
    • 73. Is_a closure expansion
    • 74. 39228/DOID:191, Melanocytic neoplasm, direct parent of Melanoma in Human Disease
    • 75. 39228/DOID:0000818, cell proliferation disease, grand parent of Melanoma in Human Disease
    • 76. Mapping expansion
    • 77. FMA/C0025201, Melanocyte in Foundational Model of Anatomy, concept mapped to NCI/C0025201 in UMLS.
    INRIA - EXMO seminar - March 24th, 2010`
    15
    • “Melanoma is a malignant tumor of melanocytes whichare found predominantly in skin but also in the bowel and the eye”.
  • Annotations services
    • NCBO Annotator web service
    • 78. The annotation workflow available as a service
    • 79. Automatically process a piece of raw text to annotate it with relevant ontology concepts and return the annotations to the user
    • 80. NCBO Biomedical resources index
    • 81. We have used the annotation workflow to annotate some common resources (gene expression data, clinical trials, articles) and index then by concepts
    • 82. Semantic expansion components
    • 83. http://bioportal.bioontology.org/
    INRIA - EXMO seminar - March 24th, 2010
    16
  • 84. NCBO Annotator web servicehttp://bioportal.bioontology.org/annotator
    • Semantic web vision
    • 85. Ontology-based service
    • 86. Semantically described results
    • 87. OWL ontology that formalizes the service model
    • 88. Higher expressivity (e.g., SWRL rules)
    • 89. Annotations returned to the users in OWL as instances
    • 90. Software agents and semantic web technology stack (e.g., SPARQL)
    • 91. What distinguish the service?
    • 92. Can be integrated in current workflow (service-oriented approach)
    • 93. Uses semantic expansion
    • 94. Creates annotations for both all UMLS and NCBO ontologies
    INRIA - EXMO seminar - March 24th, 2010
    17
    [AMIA STB 09]
    [BMC BioInformatics 09]
  • 95. NCBO Annotator in BioPortal
    INRIA - EXMO seminar - March 24th, 2010
    18
  • 96. NCBO Biomedical Resources indexhttp://bioportal.bioontology.org/resources
    • We have used the workflow to index several important biomedical resources with ontology concepts (22+)
    • 97. The index can be used to enhance search & data integration
    INRIA - EXMO seminar - March 24th, 2010
    19
    [DILS 08]
    [BMC BioInfo09]
    [IC 10]
  • 98. Ex: annotation of a GEO element
    20
  • 99. Ex: search of a GEO element
    21
  • 100. OBR results available in NCBO BioPortal
    INRIA - EXMO seminar - March 24th, 2010
    22
    Example of resource available (name and description)
    Number of annotations in the OBR index
    Ontology concept/term browsed
    Title and URL link to the original element
    Context in which an element has been annotated
    ID of an element
  • 101. Good use of the semantics (1/2)
    • Simple keywords based search miss results
    INRIA - EXMO seminar - March 24th, 2010
    23
  • 102. 24
    Good use of the semantics (2/2)
    INRIA - EXMO seminar - March 24th, 2010
  • 103. First evaluation
    • Mgrep vs. MetaMap evaluation
    • 104. Higher precision & faster
    • 105. Not limited to UMLS terminologies
    • 106. 22 resources the annotation index
    • 107. Between 99% and 100% of the processed elements are annotated
    • 108. The number of annotating concepts ranges from 359 to 769
    • 109. Functional comparative evaluation with online queries
    • 110. Superior average number of results (i.e., resource elements)
    • 111. Higher number of terms for which we return results. Significant improve in the case of AE or GEO.
    INRIA - EXMO seminar - March 24th, 2010
    25
    [BMC BioInformatics 09]
  • 112. Technical details
    Workflow and pre-computation of the data
    Java, JDBC & MySQL (prototypes)
    Java, Spring/Hibernate & MySQL (production)
    Services deployed as REST web services
    Tomcat & RestLet
    INRIA - EXMO seminar - March 24th, 2010
    26
  • 113. 27
    Users…
    INRIA - EXMO seminar - March 24th, 2010
    Ontology-based services (OBS)
    NCBO Biomedical Resources index service
    NCBO Annotator web service
    BioPortal services
    UMLS services
    UCSF
    Laboratree
    CollabRx
    UCHSC
    PharmGKB, JAX
    HGMD
    BioPortal UI
    PDB/PLoS
    I2B2
    NextBio
    IO informatics
    “Resources” tab`
    Knewco
    IO informatics
    CaNanoLab
  • 114. …and uses cases
    • For concept recognition from text
    • 115. Decide which clinical trials are relevant for a particular patient. Use the annotator service to map clinical-trial eligibility criteria to concepts from UMLS
    • 116. For accelerating curation
    • 117. Use concepts recognized in the abstracts of publications to triage papers for curation
    • 118. For structuring Web data
    • 119. Ensure that any textual annotation created in Laboratree also has corresponding ontology-based annotations
    • 120. For mining gene expression data
    • 121. http://gminer.mcw.edu/
    INRIA - EXMO seminar - March 24th, 2010
    28
  • 122. Conclusion
    • Enabling data integration and translational discoveries requires scalable data annotation using ontologies
    • 123. Semantic annotations can bridge the gap between resources and ontologies
    • 124. Ontology-based annotations are essential for new semantic services on the Web
    • 125. Our annotation workflow offers high-throughput, semantic annotation
    • 126. Offers an automated, service-oriented vision
    • 127. Leverages the knowledge contained in ontologies
    • 128. Please try it and join us!
    INRIA - EXMO seminar - March 24th, 2010
    29
  • 129. Future work
    • Enhanced concept recognition
    • 130. NLP and text mining techniques
    • 131. to enhance the recognition of concepts
    • 132. to avoid noise
    • 133. Recognizing relations
    • 134. Enhanced semantic expansion
    • 135. Extracting a similarity measure from the annotation index
    • 136. Composing semantic expansion components on demand
    • 137. Methods for annotation evolution
    • 138. Allowing users to update the annotations previously done by the service
    • 139. Updating the annotations in the index as metadata change
    • 140. Parameterization of the scoring algorithm on demand
    INRIA - EXMO seminar - March 24th, 2010
    30
  • 141. Credits and collaborators
    • @ NCBO, Stanford University
    • 142. Dr. Nigam H. Shah M.D., PhD initiator of the project & biomedical expert
    • 143. Pr. Mark A. Musen NCBO PI
    • 144. Dr. Adrien Coulet postdoc, new resources within resource index & extraction
    • 145. Cherie Youn Software developer & production releases
    • 146. Nipun Bhatia CS student Mgrep evaluation
    • 147. @ Univ. of Victoria
    • 148. Chris Callendar Software developer (UI)
    • 149. @ NCIBI, Univ. of Michigan
    • 150. Manhong Dai Mgrep designers & developers
    • 151. Dr. Fan Meng
    31
    INRIA - EXMO seminar - March 24th, 2010
  • 152. Thank youNational Center for BioMedical Ontologyhttp://www.bioontology.orgBioPortal, biomedical ontology repositoryhttp://bioportal.bioontology.orgContact mejonquet@stanford.edu