Automated Extraction of Domain-specific Clinical OntologiesSegmenting, merging, and surveying modulesChimezie Ogbujicut@case.edu
Need for Ontology BootstrappingThere is a critical need for formal, reproducible methods for recognizing and filling gaps in medical terminologies (Cimino 1998)Clinical terminology systems need to extend smoothly and quickly in response to the needs of users (Rector 1999)A fixed, enumerated list of concepts can never be complete and results in a combinatorial explosion of terms (exhaustive pre-coordination)
A general best practice is to re-use ontologies, especially those that have been standardizedHowever, there is a proliferation of (domain-specific) clinical ontologiesFlies in the face of this best practiceAs more projects leverage the full value of reference, medical ontologies, there will be an increased need for automated management:Not there yet, mostly have coding systems
The GoalWant to (automatically)Customize a large source ontology such as SNOMED-CT in a tractable wayGenerate normalized, anatomy and clinical terminology modules that are manageable in size, and preserve the meaning of common termsProvide a framework for bootstrapping the creation of clinical terminology for a specific domain
Prior WorkNoy and Musen (2000)Discuss how to either automate the merging and alignment or guide the user, suggesting conflicts and actions to takeRely on lexical matching of term namesBontas and Tolksdorf (2005)Similar goal as Noy & MusenUser provides a list of term matches between source & targetFollow semantic connections from these terms
Modularization:Ontology EngineeringSeidenberg and Rector (2006) describe an ontology segmentation heuristic that starts with a set of terms and creates an extract from an ontology around those termsTraverses ontology structure and is limited by user-specified recursion depth
Seidenberg and Rector (2006)
Grau et al. (2008): Developing ontology P and want to re-use a set of symbols from (another) ontology Q without changing their meaningP + Q is a conservative extension of QWhen answering a query involving terms in O (its signature or vocabulary), importing O'1  should give the same answers as if O' had been imported instead (both are subsets but O'1 is more manageable):Then we say O'1 is a module for O in O'
Segments v.s. Modules The segmentation heuristic used is in contrast to (and predates) those of Grau et al. (2008) that produce modules with 100% semantic fidelitySacrifice semantic fidelity for an expedient extraction processThe (tractable) calculation of deductive, conservative extensions for EL is an open research problem
MaterialsSNOMED-CTFoundational Model of Anatomy (FMA)Common anatomy signature
Reference Clinical OntologiesThere is a reasonable consensus around two reference ontologies that cover a substantial portion of clinical medicineSNOMED-CT and the FMABoth leverage an underlying formal knowledge representation
SNOMED-CTA comprehensive terminological framework for clinical documentation and reporting.Comprised of about half a million concepts:Clinical findings, procedures, body structures, organisms, substances, pharmaceutical products, specimen, quantitative measures, and clinical situationsHas an underlying description logic (EL family)EL family has shown to be suitable for medical terminologyAnd subsequently, ELHR+, the performance target of many modern classifiers
Technical challenges:Its size discourages the use of logical inference systems to manage and process it (due to performance issues)Most description logic systems run into challenges with memory exhaustion when classifying it in its entirety (there have been recent advances here)In some cases, its definitions are inconsistent or incomplete (more on this later)Policy pressures (opportunity):Participants in meaningful use program must capture EHR problem lists based on ICD-9 or SNOMED-CT
Using Modulzarization for Quality AssurancePlenty of (recent) work on quality assurance of SNOMED-CTUsing Semantic Web technologies (and lattice theory) for quality assurance of large biomedical ontologies (Zhang et al. 2010)Identifying incorrect or clinically misleading SNOMED-CT inferences that arose from use of SNOMED-CT(Rector et al. 2011)More, recent QA of SNOMED-CT (Rector 2011) leverages extraction of  manageable modules and discusses the value to domain experts of browsing SNOMED-CT via a module built from a set of terms relevant to a domain or application
Foundational Model of AnatomyGoal is to conceptualize the physical objects and spaces that constitute the human bodyLeverages a frame-based knowledge representation to formulate over 75,000 concepts including:Macroscopic, microscopic, and sub-cellular canonical anatomyAnatomy is fundamental to biomedical domains
Concepts are connected by several mereological relationsPrimarily concerned with part_of and has_partAdheres to a strict, aristotelian modeling paradigmEnsures definitions are consistent and state the essence of anatomy in terms of their characteristicsUsing July 24th 2008 ALPHA version of the FMA 2.0 in OBO foundry
Common Anatomy SignatureThere is a significant overlap between anatomy terms in SNOMED-CT and FMABodenreider and Zhang (2006) analyzed this overlapLeveraged lexical and structural analysisIdentified ~ 7500 common conceptsRefer to as Sanatomy
Small Detail: SEP TripletsSNOMED-CT uses SEP triplets to model anatomy concepts and their relationships to each otherFor every proper SNOMED-CT anatomy concept (an Entire class), there are two auxiliary classes:A Structure classA Part class
Example
Main motivation is to rely on subsumption to reason about part-whole relationshipsSNOMED-CT is moving away from this, but for the purpose of using it in concert with the FMA, this is still an issuePrevious work (Suntisrivaraporn 2007) demonstrated how an expressive description logic can be used to  more directly represent mereological relations.
Build on this but re-use terms (a transliteration) from a reference ontology of anatomy rather than re-using SNOMED-CT termsTo preserve the meaning of anatomy terms but increase the (latent) knowledge about them and provide a terminology path to additional terms of interest
Reifying SEP tripletsNeed to replace SNOMED-CT anatomy terms in a way that preserves the intent of the SEP anatomy schemeTranscribe them into a more expressive description logicDefine a set of rules to determine how axioms involving mapped SNOMED-CT terms are replaced(Shultz et al. 1998) describe how to logically identify components of an SEP triplet
MethodStart with a list of user-specified SNOMED-CT concepts Determines the domain3 step process resulting inA SNOMED-CT module: O'snct-fmaTransliteration of SEP tripletsFMA segment: O'fma-snctDirectly merge results into a single ontology
Segmenting and Merging Domain-specific Ontology Modules for Clinical Informatics (Ogbuji 2010)
Collecting the domain of discourse(Sahoo et al. 2011) Automatically extract a minimal common set of terms (upper-domain ontology) from an existing domain ontologyCan be used to survey the generation of anatomy and clinical terminology modules:“For a given domain, what are the most general categories of (clinical) terminology that can be automatically extracted from specific distributions of SNOMED-CT and the FMA?”
DemonstrationImplementation (Python)http://code.google.com/p/python-dlp/wiki/ClinicalOntologyModulesExample: Atrial Fibrillation (disorder)

Automated clinicalontologyextraction

  • 1.
    Automated Extraction ofDomain-specific Clinical OntologiesSegmenting, merging, and surveying modulesChimezie Ogbujicut@case.edu
  • 2.
    Need for OntologyBootstrappingThere is a critical need for formal, reproducible methods for recognizing and filling gaps in medical terminologies (Cimino 1998)Clinical terminology systems need to extend smoothly and quickly in response to the needs of users (Rector 1999)A fixed, enumerated list of concepts can never be complete and results in a combinatorial explosion of terms (exhaustive pre-coordination)
  • 3.
    A general bestpractice is to re-use ontologies, especially those that have been standardizedHowever, there is a proliferation of (domain-specific) clinical ontologiesFlies in the face of this best practiceAs more projects leverage the full value of reference, medical ontologies, there will be an increased need for automated management:Not there yet, mostly have coding systems
  • 4.
    The GoalWant to(automatically)Customize a large source ontology such as SNOMED-CT in a tractable wayGenerate normalized, anatomy and clinical terminology modules that are manageable in size, and preserve the meaning of common termsProvide a framework for bootstrapping the creation of clinical terminology for a specific domain
  • 5.
    Prior WorkNoy andMusen (2000)Discuss how to either automate the merging and alignment or guide the user, suggesting conflicts and actions to takeRely on lexical matching of term namesBontas and Tolksdorf (2005)Similar goal as Noy & MusenUser provides a list of term matches between source & targetFollow semantic connections from these terms
  • 6.
    Modularization:Ontology EngineeringSeidenberg andRector (2006) describe an ontology segmentation heuristic that starts with a set of terms and creates an extract from an ontology around those termsTraverses ontology structure and is limited by user-specified recursion depth
  • 7.
  • 8.
    Grau et al.(2008): Developing ontology P and want to re-use a set of symbols from (another) ontology Q without changing their meaningP + Q is a conservative extension of QWhen answering a query involving terms in O (its signature or vocabulary), importing O'1 should give the same answers as if O' had been imported instead (both are subsets but O'1 is more manageable):Then we say O'1 is a module for O in O'
  • 9.
    Segments v.s. ModulesThe segmentation heuristic used is in contrast to (and predates) those of Grau et al. (2008) that produce modules with 100% semantic fidelitySacrifice semantic fidelity for an expedient extraction processThe (tractable) calculation of deductive, conservative extensions for EL is an open research problem
  • 10.
    MaterialsSNOMED-CTFoundational Model ofAnatomy (FMA)Common anatomy signature
  • 11.
    Reference Clinical OntologiesThereis a reasonable consensus around two reference ontologies that cover a substantial portion of clinical medicineSNOMED-CT and the FMABoth leverage an underlying formal knowledge representation
  • 12.
    SNOMED-CTA comprehensive terminologicalframework for clinical documentation and reporting.Comprised of about half a million concepts:Clinical findings, procedures, body structures, organisms, substances, pharmaceutical products, specimen, quantitative measures, and clinical situationsHas an underlying description logic (EL family)EL family has shown to be suitable for medical terminologyAnd subsequently, ELHR+, the performance target of many modern classifiers
  • 13.
    Technical challenges:Its sizediscourages the use of logical inference systems to manage and process it (due to performance issues)Most description logic systems run into challenges with memory exhaustion when classifying it in its entirety (there have been recent advances here)In some cases, its definitions are inconsistent or incomplete (more on this later)Policy pressures (opportunity):Participants in meaningful use program must capture EHR problem lists based on ICD-9 or SNOMED-CT
  • 14.
    Using Modulzarization forQuality AssurancePlenty of (recent) work on quality assurance of SNOMED-CTUsing Semantic Web technologies (and lattice theory) for quality assurance of large biomedical ontologies (Zhang et al. 2010)Identifying incorrect or clinically misleading SNOMED-CT inferences that arose from use of SNOMED-CT(Rector et al. 2011)More, recent QA of SNOMED-CT (Rector 2011) leverages extraction of manageable modules and discusses the value to domain experts of browsing SNOMED-CT via a module built from a set of terms relevant to a domain or application
  • 15.
    Foundational Model ofAnatomyGoal is to conceptualize the physical objects and spaces that constitute the human bodyLeverages a frame-based knowledge representation to formulate over 75,000 concepts including:Macroscopic, microscopic, and sub-cellular canonical anatomyAnatomy is fundamental to biomedical domains
  • 16.
    Concepts are connectedby several mereological relationsPrimarily concerned with part_of and has_partAdheres to a strict, aristotelian modeling paradigmEnsures definitions are consistent and state the essence of anatomy in terms of their characteristicsUsing July 24th 2008 ALPHA version of the FMA 2.0 in OBO foundry
  • 17.
    Common Anatomy SignatureThereis a significant overlap between anatomy terms in SNOMED-CT and FMABodenreider and Zhang (2006) analyzed this overlapLeveraged lexical and structural analysisIdentified ~ 7500 common conceptsRefer to as Sanatomy
  • 18.
    Small Detail: SEPTripletsSNOMED-CT uses SEP triplets to model anatomy concepts and their relationships to each otherFor every proper SNOMED-CT anatomy concept (an Entire class), there are two auxiliary classes:A Structure classA Part class
  • 19.
  • 20.
    Main motivation isto rely on subsumption to reason about part-whole relationshipsSNOMED-CT is moving away from this, but for the purpose of using it in concert with the FMA, this is still an issuePrevious work (Suntisrivaraporn 2007) demonstrated how an expressive description logic can be used to more directly represent mereological relations.
  • 21.
    Build on thisbut re-use terms (a transliteration) from a reference ontology of anatomy rather than re-using SNOMED-CT termsTo preserve the meaning of anatomy terms but increase the (latent) knowledge about them and provide a terminology path to additional terms of interest
  • 22.
    Reifying SEP tripletsNeedto replace SNOMED-CT anatomy terms in a way that preserves the intent of the SEP anatomy schemeTranscribe them into a more expressive description logicDefine a set of rules to determine how axioms involving mapped SNOMED-CT terms are replaced(Shultz et al. 1998) describe how to logically identify components of an SEP triplet
  • 24.
    MethodStart with alist of user-specified SNOMED-CT concepts Determines the domain3 step process resulting inA SNOMED-CT module: O'snct-fmaTransliteration of SEP tripletsFMA segment: O'fma-snctDirectly merge results into a single ontology
  • 25.
    Segmenting and MergingDomain-specific Ontology Modules for Clinical Informatics (Ogbuji 2010)
  • 27.
    Collecting the domainof discourse(Sahoo et al. 2011) Automatically extract a minimal common set of terms (upper-domain ontology) from an existing domain ontologyCan be used to survey the generation of anatomy and clinical terminology modules:“For a given domain, what are the most general categories of (clinical) terminology that can be automatically extracted from specific distributions of SNOMED-CT and the FMA?”
  • 28.