Semantic decomposition of ontological
resources for the creation of flexible, high-
performance biomedical concept recogni...
Overview
●
Why identify biomedical concepts in free text?
●
How ontologies can help
●
Problems with using ontologies for c...
Why identify biomedical concepts in free text?
●
Indexing MedLine abstracts for semantic search
– Identifying 'hypertensio...
Ontologies
●
Define the concepts of a given domain, their properties and their
relationships
– Provide canonical names for...
Problems with biomedical ontologies for concept identification
●
Often very large
– Foundational Model of Anatomy > 200MB,...
Some solutions
●
Hearst patterns (Hearst 1992)
– Identify hypernomic (class-member) relations
– 'Bruises, cuts, and other ...
Some solutions
●
Domain-specific linguistic features
– Neoclassical combining forms
– Biomedical and clinical terms often ...
Some solutions
●
NLM MetaMap (Aronson 2010): uses neoclassical combining
forms + lexical variant generation + ontologies
–...
Semantic decomposition of ontologies
●
Provide a systematic method of reducing the size of large
ontologies to make their ...
Semantic decomposition of ontologies
●
Little published work in this area
●
Tong et al (2008) decomposed the Gene Ontology...
Semantic decomposition of ontologies
●
Initial focus on Foundational Model of Anatomy (FMA) (Rosse
2003) as anatomical ter...
Semantic decomposition of ontologies
●
Extend Tong et al's idea but classify each token according to its part of
speech (n...
Regular expressions
●
Used to match sequences of characters against some input
●
Written in a formal language that describ...
Regular expression and pattern generation
●
Create regexes from the union of entries (with morphological variants)
in each...
Evaluation
●
Corpus of discharge summaries, progress notes, and surgical,
radiology and pathology reports (Savova et al 20...
Results – Anatomical terms
Method P R F Time
Semantic 0.36 (0.89) 0.91 0.51 (0.90) 19s
Direct lookup 0.22 (0.54) 0.73 0.34...
Error analysis – Anatomical terms
●
Many false positives (87.9%) were in fact correct terms – missing
from the manually an...
Results – Disease terms
Method P R F Time
Semantic 0.58 0.68 0.62 12s
Direct lookup 0.69 0.27 0.37 9s
MetaMap 0.46 0.83 0....
Error analysis – Disease terms
●
Factors affecting recall:
– Abbreviations (e.g. COPD)
– Definite descriptors ('the diseas...
Conclusion
●
Semantic decomposition and regex/pattern-based recombination
of ontology terms is slightly slower than direct...
Further work
●
Calculate positional entropy of each morpheme and use these to
help generate patterns (e.g. some morphemes ...
Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers
Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers
Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers
Upcoming SlideShare
Loading in …5
×

Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers

2,988 views
2,883 views

Published on

The need to recognise biomedical and clinical concepts in free text has been driven by demand for semantic information retrieval and decision support. Comprehensive, large-scale ontologies, such as the Foundational Model of Anatomy (FMA) and the Disease Ontology (DO), form the building blocks of the Unified Medical Language System (UMLS) and are the basis of dictionary-based biomedical concept recognisers such as MetaMap. However, these tools typically require substantial computing resources in terms of disk space, memory and processing time to execute. Recently, regular-expression (regex) based concept recognisers such as mGrep have begun to address this shortcoming, but a method that allows researchers to create their own concept recogniser from a given ontology remains unexplained.

In this presentation, I present a method for semantic decomposition of biomedical ontologies as applied to the FMA and DO in the creation of a high-performance tool for identifying anatomical and disease concepts in free text. The method involves 1) tokenizing each ontology into distinct words, 2) extracting free and bound morphemes from the word list, 3) classifying each morpheme according to semantic type or grammatical role, 4) generating regexes over each morpheme set, 5) applying simple grammatical rules over the regexes to identify potential concepts. We evaluate its precision and recall performance against manually annotated clinical and biomedical corpora, and compare the results with the performance of 1) direct ontology lookup and 2) MetaMap against the same corpora.

As measured by the Mann-Whitney rank sum test, the method demonstrates significant (p < 0.01) improvement in accuracy over direct ontology lookup. Against MetaMap, it also demonstrates a measurable improvement in accuracy, although this is not statistically significant (p > 0.05), but has the benefit of reducing processing time by by several orders of magnitude.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,988
On SlideShare
0
From Embeds
0
Number of Embeds
32
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers

  1. 1. Semantic decomposition of ontological resources for the creation of flexible, high- performance biomedical concept recognisers 26 June 2012 Phil Gooch Centre for Health Informatics
  2. 2. Overview ● Why identify biomedical concepts in free text? ● How ontologies can help ● Problems with using ontologies for concept identification ● Potential solutions ● Application of method to two ontologies: Foundation Model of Anatomy and Disease Ontology ● Evaluation against a small corpus of 163 clinical discharge summaries, surgical, pathology and radiology reports
  3. 3. Why identify biomedical concepts in free text? ● Indexing MedLine abstracts for semantic search – Identifying 'hypertension' as being of semantic type 'disease', moreover being a cardiovascular disease ● Literature based knowledge discovery – Disease D associated with increase in physiological function F – Substance S inhibits F – => S might be a treatment for D ● Decision support – What treatment recommendations do clinical guideline documents provide for hypertension in pregnancy? – What were the findings of the pathology report? – 50% of clinically important information resides in the free text of the patient record, rather than in structured fields (Sittig 2007)
  4. 4. Ontologies ● Define the concepts of a given domain, their properties and their relationships – Provide canonical names for terms – Classification hierarchy, whole-part relations and synonyms ● Can function as dictionary, a lookup list of terms for concept identification via string matching ● Or defined properties can be used to infer concepts – A Company issues Shares – 'shares in Abc fell' => 'Abc' is a Company
  5. 5. Problems with biomedical ontologies for concept identification ● Often very large – Foundational Model of Anatomy > 200MB, 150K+ terms – Even when expressed in a compact data structure (e.g. Trie), potentially large RAM overhead when used to match strings ● May not be complete: how to identify potentially new terms, classes ● May not contain all synonyms or other ways of expressing terms, e.g. abbreviations – Separate lists of word variations often compiled (e.g. NLM SPECIALIST lexical variant generation tools)
  6. 6. Some solutions ● Hearst patterns (Hearst 1992) – Identify hypernomic (class-member) relations – 'Bruises, cuts, and other injuries' – 'Diseases such as atherosclerosis' – High precision, but low recall ● Boostrapping – 'scaphoid, lunate, triquetral and pisiform' – If we know that the scaphoid and lunate are bones of the wrist, we can infer that the others in this list are also – Improves recall, but reduces precision (Maynard 2009)
  7. 7. Some solutions ● Domain-specific linguistic features – Neoclassical combining forms – Biomedical and clinical terms often composed of or contain well- defined Latin and Greek roots, suffixes and prefixes – -osis, -itis, -opathy => disease – cardi-, ileo- => anatomy – High precision, but low recall (Gooch & Roudsari 2011)
  8. 8. Some solutions ● NLM MetaMap (Aronson 2010): uses neoclassical combining forms + lexical variant generation + ontologies – Comprehensive, but heavyweight (4GB+ RAM, 10GB+ install) ● mGrep (Meng 2009) radix trie-based lookup over ontologies – Fast, higher precision but lower recall than MetaMap (Shah 2009) – Still requires the complete source ontologies – Requires substantial preprocessing of input text via the NCBO web service (NCBO Support 2011)
  9. 9. Semantic decomposition of ontologies ● Provide a systematic method of reducing the size of large ontologies to make their use for concept identification feasible ● Reproducible method so that concept recognisers for new ontologies can be quickly developed ● Has spin-off benefits for ontology quality assurance – E.g. identification of spelling errors and lexical inconsistencies in biomedical ontologies (Gooch 2011)
  10. 10. Semantic decomposition of ontologies ● Little published work in this area ● Tong et al (2008) decomposed the Gene Ontology into individual tokens (words) and calculated the positional entropy of each token via the probability of token t appearing at position p in a given ontology term ● Could be applied to identifying potential ontology terms in free text, but wasn't evaluated
  11. 11. Semantic decomposition of ontologies ● Initial focus on Foundational Model of Anatomy (FMA) (Rosse 2003) as anatomical terms are central to the identification of – location of disease, morbidity – location of symptoms – location of procedures – surgery, pathology and radiology reports – administration route of medication ● Apply the method to the Disease Ontology (Osborne et al 2009) to see how well it generalises
  12. 12. Semantic decomposition of ontologies ● Extend Tong et al's idea but classify each token according to its part of speech (noun, adjective etc) and its semantic type ● Reduce the set of tokens further by identifying words (free morphemes) sharing common roots and suffixes (bound morphemes) ● Morpheme – smallest linguistic unit that has meaning (cephalon, -derm, -ium, -rrhea)
  13. 13. Regular expressions ● Used to match sequences of characters against some input ● Written in a formal language that describes the patterns in the input that we wish to match ● For this task, we precompile sets of regular expressions (regex) generated from the set of morphemes extracted from the ontology ● We write recombination rules over the regexes which include stop- words (determiners, prepositions) to identify candidate noun phrases and prepositional phrases that look like ontology terms
  14. 14. Regular expression and pattern generation ● Create regexes from the union of entries (with morphological variants) in each set – nounPattern = … macula | malleus | mandible | manubri(um|a) | manus ... ● Top and tail with word boundaries, with optional plurality – noun = b( + nounPattern + )?sb – adjective = b( + adjPattern + )b ● Combine regex output with patterns – NP = adjective{0,5} (noun | properNoun){1,5} – PP = NP “of|on” NP – Term = NP | PP ● Test by running the patterns against the complete ontology – all terms should be matched
  15. 15. Evaluation ● Corpus of discharge summaries, progress notes, and surgical, radiology and pathology reports (Savova et al 2011) ● Manually annotated for mentions of anatomical and disease concepts ● Compare manually identified terms against system-generated terms via semantic decomposition/recombination pattern approach vs direct ontology lookup vs MetaMap ● Calculate precision (tp/tp + fp), recall (tp/tp + fn), and F-measure (2 * P * R / P + R), and Mann-Whitney U between approaches
  16. 16. Results – Anatomical terms Method P R F Time Semantic 0.36 (0.89) 0.91 0.51 (0.90) 19s Direct lookup 0.22 (0.54) 0.73 0.34 (0.62) 10s MetaMap 0.30 (0.75) 0.86 0.44 (0.80) 2239s Figures in parentheses denote results after corpus correction Semantic vs direct lookup: significant increase in P and R (p < 0.01) Semantic vs MetaMap: increase in P and R, but not significant (p > 0.05)
  17. 17. Error analysis – Anatomical terms ● Many false positives (87.9%) were in fact correct terms – missing from the manually annotated corpus ● Adding these missing annotations increased precision from 0.36 to 0.89 ● Remaining FPs were partial matches, e.g. 'nonspecific bowel', 'a haploidentical bone marrow', 'normal sinus', and non-specific anatomical areas, e.g. 'multifocal areas', 'particular organ site', 'pruritic areas'. ● Phrases not in the ontology as discrete terms picked up by semantic method, e.g. 'angiolymphatic space', 'dentate line'
  18. 18. Results – Disease terms Method P R F Time Semantic 0.58 0.68 0.62 12s Direct lookup 0.69 0.27 0.37 9s MetaMap 0.46 0.83 0.59 1748s Semantic vs direct lookup: significant increase in R (p << 0.01), significant decrease in P (p < 0.01), overal significant increase in F (p < 0.01) Semantic vs MetaMap: significant increase in P (p << 0.01), but significant decrease in R (p < 0.01), overall increase in F but not significant (p > 0.05)
  19. 19. Error analysis – Disease terms ● Factors affecting recall: – Abbreviations (e.g. COPD) – Definite descriptors ('the disease', 'her infirmity') – Symptoms annotated as disease ('mood changes', 'double vision') ● Factors affecting precision – Terms manually annotated as Symptoms being marked as Disease e.g. 'difficulty walking' – Some inconsistent manual annotation of negated terms, family history etc
  20. 20. Conclusion ● Semantic decomposition and regex/pattern-based recombination of ontology terms is slightly slower than directly looking up terms and synonyms extracted from the ontology, but leads to significantly increased accuracy that balances precision and recall ● Against MetaMap, the improvements are measurable but not statistically significant for anatomical terms, but precision is significantly improved for disease terms. However, the processing time is several orders of magnitude faster. ● Our findings are comparable to Shah et al (2009) for mGrep vs MetaMap, but we now have a systematic method for creating new concept recognisers from scratch
  21. 21. Further work ● Calculate positional entropy of each morpheme and use these to help generate patterns (e.g. some morphemes are more likely to occur at the start or end of a pattern) ● Improve lookup performance by using a radix trie (better for morpheme sets that share long prefixes and suffixes) rather than standard Java.util.regex ● Apply method to other biomedical ontologies ● Evaluate against other corpora, e.g. annotated MedLine abstracts

×