Enabling semantic search
in a bio-specimen
repository
July 9th, 2013
ICBO 2013
Shahim Essaid, Carlo
Torniai, and Melissa H...
OHSU’s Biolibrary Search
Engine
 Data aggregated from four repositories
with plans for additional repositories
 A web-ba...
OHSU Biolibrary system
Search application
Two search interfaces
(with no data integration)
Limited free text
search
Search application
Search through anatomy and
histology lists
Multiple wizard-like
forms
Example coded data vs. pathology
report
(Available structured data from one case)
However, pathology report also includes:...
Entity recognition with
MetaMap
Selected mapping examples
(the same report from earlier)
Final Pathologic Diagnosis:
A: Gallbladder, cholecystectomy:
- Ac...
C: Superior mesenteric vein
margin, biopsy:
- Vascular tissue with no
diagnostic abnormality
- Negative for malignancy
D: ...
Selected mapping examples
(the same report from earlier)
E:
Pancreas, stomach, duodenum, pancreaticogast
roduodenectomy:
-...
Deriving an OWL ontology for DL
queries
Adding relationships
(developing an application ontology to support
search)
 “subclass of” axioms generated based on the ...
Adding relationships
(developing an application ontology to support
search)
 Problematic multiple and cyclic inheritance ...
SNOMED-CT examples of disorder definitions
(used to relate anatomy to pathology in the application ontology)
Application integration
 Integration with existing application was limited to
appending the annotations to the text of pa...
A simple DL query for anatomy
(linked to actual report in the mapping table)
Difficulties and limitations
 “Structured” text in pathology reports is not in natural
language, making it perform less w...
Conclusions
 OHSU Biolibrary is adding many other specimen
collections, need for better search will increase
 Can use NE...
Thanks
 Dr. Chris Corless
 Rob Schuff
 Medical Research Foundation of Oregon
Upcoming SlideShare
Loading in …5
×

Enabling semantic search in a bio-specimen repository - ICBO 2013

493 views

Published on

Paper presented at the international conference on biomedical ontology 2013.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
493
On SlideShare
0
From Embeds
0
Number of Embeds
38
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • I changed it to “four” because the next diagram shows 4 sources. Our goal was narrowed down to enhancing IR over the large free text content. We limited the NLP to surgical pathology since other types of reports have text that is even less natural
  • KCI has tissue level info, sparse morphology and histology coding, CR has patient level info
  • The key point, other than the free text (and poor/limited use of terminology in next slide) is that the data from the two systems is filtered for each top UI. I think the UIs were implemented for their corresponding clients and they didn’t want to see each other’s data, maybe.This slide shows the form for the free text search with the other forms and top two sections on the left.
  • Parentheses is counts – even though there are 600K records, maybe only a few records were mapped from CR back to pathology dataBasically all structured data is cancer dataHave to select all entries that have adrenal gland in label, cannot select its parts or synonyms, etc.Wizards on the left allow filtering/query on some data fields
  • This is one report. The table shows the existing coded data (the site and diagnosis are from the cancer registry but the ICD9 code is from pathology). This is the only coded data for the content of the report. There are additional data fields (names, dates, treatments, etc.) but pathology is limited to this.
  • This shows that the UMLS imports biomedical vocabularies into the three tables shown. There are many more tables that hold additional information but there are the main tables. External vocabularies are analyzed (their labels) with the UMLS lexical tools and related resources to do an automated import (matching to existing content/cui in the UMLS) that is then manually curated.The expanding content of the UMLS is then used to enhance the lexical tools because each new terminology brings in new lexical iformation, and the cycle continues. MetaMap used the lexical tools to do its work and generates outputs that are coded/mapped with UMLS CUIs, and indications for which vocabulary and terms supported the mappings. MetaMap has many components and they are listed in the diagram. Most components can take customized training content but by default MetaMap is based on the full UMLS release.
  • term, start location by character, end location, score (lower is better, -1000 is better than -800 )The table show the mappings for the shown text
  • Same as before. See the odd “blood group antigen D” due to the D: bullet in the report. The is MetaMap attempting do recognize acronyms.
  • E: is considered an acronym for “elementary charge”
  • The small diagram tries to show the last bullet. The next slide has two examples.
  • Two disorders/diseases. The definitions have groups that relate specific morphology to specific sites. F fully qualified nameP preferred nameD defining relationshipQ qualifierC current concept
  • Enabling semantic search in a bio-specimen repository - ICBO 2013

    1. 1. Enabling semantic search in a bio-specimen repository July 9th, 2013 ICBO 2013 Shahim Essaid, Carlo Torniai, and Melissa Haendel
    2. 2. OHSU’s Biolibrary Search Engine  Data aggregated from four repositories with plans for additional repositories  A web-based search engine over de- identified data  Our goal was to develop a controlled application ontology to support search capabilities
    3. 3. OHSU Biolibrary system
    4. 4. Search application Two search interfaces (with no data integration) Limited free text search
    5. 5. Search application Search through anatomy and histology lists Multiple wizard-like forms
    6. 6. Example coded data vs. pathology report (Available structured data from one case) However, pathology report also includes: • Low grade pancreatic intraepithelial neoplasia • Extensive perineural invasion • Acute and chronic cholecystitis • Bile duct tissue with chronic inflammation • Chronic pancreatitis • Acute gastric serositis
    7. 7. Entity recognition with MetaMap
    8. 8. Selected mapping examples (the same report from earlier) Final Pathologic Diagnosis: A: Gallbladder, cholecystectomy: - Acute and chronic cholecystitis - Negative for malignancy B: Bile ductular tissue, biopsy: - Bile duct tissue with chronic inflammation - Negative for malignancy
    9. 9. C: Superior mesenteric vein margin, biopsy: - Vascular tissue with no diagnostic abnormality - Negative for malignancy D: Portal vein margin, biopsy: - Fibroconnective tissue with no diagnostic abnormality - Negative for malignancy Selected mapping examples (the same report from earlier)
    10. 10. Selected mapping examples (the same report from earlier) E: Pancreas, stomach, duodenum, pancreaticogast roduodenectomy: - Pancreatic ductal adenocarcinoma, grade 2/3, invading peripancreatic fat - Size: 3 cm in greatest dimension - Pancreatic neck margin positive for invasive carcinoma (please see comment) - Superior mesenteric artery margin negative at 0.2 cm from invasive tumor, deep pancreatic margin negative at 0.6 cm from invasive tumor - Extensive perineural invasion present - No angiolymphatic invasion identified - Metastatic pancreatic ductal adenocarcinoma present in two of ten peripancreatic lymph nodes (2/10)
    11. 11. Deriving an OWL ontology for DL queries
    12. 12. Adding relationships (developing an application ontology to support search)  “subclass of” axioms generated based on the UMLS hierarchy table  Mapped entities were augmented with transitive closure of parents  “part of” axioms were generated by aggregating many mereological relationships from the UMLS relationship table  Relate anatomy, pathology, and disease entities with SMOMED-CT disorder/disease definitions
    13. 13. Adding relationships (developing an application ontology to support search)  Problematic multiple and cyclic inheritance resolved manually  Resulted in an OWL ontology that supports useful DL queries along the “subclass of” and “part of/has part” axes. Examples: • Retrieve all pathologies (limited to a type if needed) that affect an anatomical site (± all parts) • Retrieve all anatomical sites with a specific type of pathology • List all pathologies/sites for a disease • Etc.  The MetaMap mappings were saved in a database table. After relevant concepts are identified with a DL query, a database query can find actual reports.
    14. 14. SNOMED-CT examples of disorder definitions (used to relate anatomy to pathology in the application ontology)
    15. 15. Application integration  Integration with existing application was limited to appending the annotations to the text of pathology reports | C1521733 C0332144 0:26 | C0016976 32:44 | C0205178 63:70 | …  Annotations (CUIs and location) are then indexed in Solr and can be searched with the existing free text search form. (after a DL query on the OWL file)
    16. 16. A simple DL query for anatomy (linked to actual report in the mapping table)
    17. 17. Difficulties and limitations  “Structured” text in pathology reports is not in natural language, making it perform less well using MetaMap  Named entity recognition helps with document retrieval but extraction of structured data is more valuable  Negation detection is poor but very important  Significant multiple inheritance and subsumption cycles (inappropriate equivalences) when several UMLS vocabularies are used to derive an OWL representation  Short project, no access to full reports, limited computational resources
    18. 18. Conclusions  OHSU Biolibrary is adding many other specimen collections, need for better search will increase  Can use NER to enhance the data with SNOMED-CT  Interest in identifying references in pathology reports to specimen blocks and slides to annotate these resources as well  Still limited resources for supporting sophisticated terminology and semantic efforts….
    19. 19. Thanks  Dr. Chris Corless  Rob Schuff  Medical Research Foundation of Oregon

    ×