SlideShare a Scribd company logo
1 of 36
Reference Domain Ontologies and Large
Medical Language Models
Chimezie Ogbuji
Chief Medical Informatics Officer / Amara Home Care
Owner /Metacognition
Semantic Web
● Semantic Web
○ Goal: build a framework for intelligent, machine understanding on the standards, ubiquity,
and connectedness of the World Wide Web (WWW)
○ 2006 to 2010: Most exciting time. Peak of inflated expectations
○ Driven by standardization efforts at the WWW Consortium (W3C)
○ Equal parts hype and robust infrastructure for modern applications
A cautionary tale for Large Language Models (LLM)
The Semantic Web: Where is it now? Rashif Ray Rahman / Oct 3, 2018
https://medium.com/@schivmeister/the-semantic-web-where-is-it-now-f4773f3097e3
Semantic Web Layers
How are things identified and
retrieved?
Rules, Logic, & Ontologies: How
can machines reason about the data?
How can machines ask
questions of the data?
How can data be exchanged in a
knowledge graph format?
Gartner's Hype Cycle
Gartner's 2016 Hype Cycle for Emerging Technologies
Natural Language (NL) question
answering was in the trough of
disillusionment a year before a
key technology underlying
today’s language models
(transformers) was in its infancy
A Cautionary Tale?
● SemanticDB work at the Cleveland Clinic Foundation
○ A reconceived implementation of an existing, 30-year old registry of heart
surgery and cardiovascular intervention cases
■ 500 variables, ~ 200K patient records, 100 heart and vascular
research publications per year
○ Addressing shortcomings of conventional data warehouse functionality
■ Domain-specific criteria conceived by researchers who work with DB
administrators
○ Internally-funded project in conjunction with CCF Innovations Department to partner
with Cycorp, Inc
■ Cyc: a powerful reasoning system and knowledge base with built-in capability
for natural language.
The Semantic Research
Assistant (SRA). Query for
patients who had
a coronary artery bypass graft
(CABG) between 2008 and
2010 (inclusive) and after a
percutaneous coronary
intervention (PCI)
D Pierce, C., Booth, D., Ogbuji, C., Deaton, C., Blackstone, E., & Lenat, D. (2012). Semanticdb: A semantic web
infrastructure for clinical research and quality reporting. Current Bioinformatics, 7(3), 267-277.
● Primary challenges at the time
○ Developing representational models (ontologies) that can cover the domain
in 200K+ patient dataset to facilitate machine reasoning
○ Resolving natural language query fragments to concepts in these models
(the purpose of the SRA and challenge of Natural Language Processing at
the time)
○ Dispatching Semantic Web queries (SPARQL) to the RDF patient
registry
○ Evaluating the queries efficiently
Parrot-like software that use sophisticated analysis of patterns and relationships
underlying language to simulate intelligent, natural language
“How GPT3 Works - Visualizations and Animations” - Jay Alammar
What are Large Language Models?
https://jalammar.github.io/how-gpt3-works-visualizations-animations/
Peak Hype?
Semantic Web
v.s. LLM
● Semantic Web
○ The basis for the value proposition was well-
understood
○ Driven mainly by the development of industry
standards
○ Adoption significantly lagged behind the research
○ Its applicability was not well defined
● Large Language Models
○ The basis for the value proposition is not fully
understood (the mechanism is a mystery to us)
○ No standardization (driven by community use)
○ Lightspeed community use keeping pace with
lightspeed research
○ Its applicability is well defined
● Artificial Neural Networks (ANN): a branch of deep learning inspired by biological neural
networks in animal brains
● Natural language processing (NLP): interdisciplinary subfield of computer science and
linguistics concerned with the ability of computers to support and manipulate human language
● LLMs: probabilistic models of natural language using ANN and trained on large textual data
● Fine tuning: A subsequent, task-specific training performed on a model to refine it for a
specific use case
● Instruction tuning: fine-tuning that improves a model's ability to follow instructions
● Transfer learning: a technique where knowledge learned from a task is re-used to boost
performance on a related task
● Unsupervised Learning: learning patterns without being told what’s right/wrong
Terminology
Sindhu, et. al.. "An empirical science research on bioinformatics in machine learning." 2020
What is the state of the
art of the use of LLMS in
the domain of medicine?
● MedAlpaca (4/2023)
○ Trained on Q/A pairs from online forums (52K), medical curriculum flashcards (34K),
Q/A pairs from WikiDoc (68K), and data from open NLP datasets and benchmarks.
Evaluated on United States Medical Licensing Examination (USMLE) self-assessment
datasets (119)
● Med-PaLM 2 (5/2023)
○ Trained on multiple-choice question dataset for solving medical problems, collected
from professional examinations (183K), several standard benchmark training datasets
of multiple-choice question dataset used for evaluation (10K), and common consumer
questions (60). Evaluated on standard multiple-choice datasets
Recent Medical LLMs
● MEDITRON (11/2023)
○ Trained on a dataset of clinical practice guidelines (46K), PubMed Papers (5M) &
abstracts (16M), and standard benchmark training datasets (10,178). Evaluated on
standard multiple-choice datasets and Q/A based on PubMed abstracts
● MedPrompt (11/2023)
○ A study of the power of how to prompt GPT-4 to unleash capabilities on medical
challenge problems without training
● BioMistral (2/2024)
○ Trained on PMC Open Access Subset of medical research papers (~1.47M documents).
Evaluated on standard multiple-choice datasets and Q/A based on PubMed abstracts
Training &
Evaluation ● Mix of training on open Q/As, multiple
choice questions, and raw text (domain
expertise or research publication)
● Most were evaluated on medical
reasoning benchmarks that provided
training data for the models before
evaluating them
● Suffer from the same current and more
general issue of how to objectively
evaluate LLMs
What are Ontologies and
Description Logic (DL)?
● Rigorously-specified
conceptualizations of a domain
as mathematical logic
● Usually expressed as
hierarchies of classes,
restrictions on relationships,
etc.
● Meant for automated processing
by logical reasoning tools
Mondal, Sutapa, Vijaya Raghava Mutharaju, and Sumit Bhatia. Embeddings for
the EL++ description logic. Diss. IIIT-Delhi, 2020.
Angioedema ⊑ Edema
Angioedema ⊑ ∃ morphology . angioedema
A_ACE ⊑ ∃ morphology . (Angioedema ⊓ ∃ caused_by kallidin i )
Essential hypertension ⊑ Hypertensive disorder, systemic arterial
Essential hypertension ⊑ (∃ located-in . (systemic circulatory system structure))
Internationally standardized medical terminology system with over
360K+ medical concepts, 1.25M relationships between them, and 9.6K
textual definitions created by domain-experts. Uses a DL that facilitates
automation and machine reasoning.
● Released in US English, UK English, UK Australian, Spanish, Danish, Dutch,
Lithuanian, Swedish, and Canadian French
What is SNOMED-CT?
Nested matryoshka dolls are a good analogy for
visualizing DL concept inclusion (⊑)
All instances of
Essential hypertension
are within the set of all
things that stand in a
located-in relation
with a systemic
circulatory system
structure
∃ - existential role restrictions
Angioedema caused by angiotensin-converting-enzyme inhibitor
(A_ACE)
⊓ - intersection of concepts
∃ - existential role restriction
named concept
DLs are designed for computer processing and not easily read by non-
mathematicians
What are Controlled Natural
Languages (CNL)?
Since CNLs are based on natural languages, their grammars use the same
syntactic structures: sentences, noun phrases, verb phrases, and relative clauses
CNLs were originally designed for use by domain experts to encode knowledge
without working directly in DL
Kuhn, Tobias. "The understandability of OWL statements in controlled
English." (2013): 101-115
“Every A_ACE morphology an Angioedema caused_by Kallidin i”
● Adopt a CNL for use
with SNOMED-CT
○ Using phraseology
appropriate for the domain
(pathophysiology)
● The CNL phrases generated can
be used as training data for
LLMs
“Every A_ACE is characterized in form by an Angioedema caused by Kallidin i”
● SNOMED-CT includes text definitions
○ “[..] applied to some SNOMED CT concepts that provides additional information about the
intended meaning or usage of the concept.”
● These can be used in addition to SNOMED-CT CNL phrases to train LLMs
Text Definitions
Angioedema ⊑ Non-allergic hypersensitivity reaction
Non-allergic hypersensitivity reaction ⊑ Non-allergic hypersensitivity process
“Every Angioedema is a Non-allergic hypersensitivity reaction”
“Every Non-allergic hypersensitivity reaction is a Non-allergic hypersensitivity process”
Non-allergic hypersensitivity process (SNOMED-CT’s Text definition)
“A pathological nonimmune process generally directed towards a foreign substance, which
results in tissue injury, which is usually transient. It is the realization of the pseudoallergic
disposition. A variety of mechanisms such as direct histamine release, complement
activation, cyclooxygenase activation and bradykinin generation may be involved.”
Combining CNL and Text
Definitions
● Create a dataset of instructions to define SNOMED-CT terms
○ Create full definition instructions and instructions on the component parts of each
definition (granular instructions)
● Train an existing model on the text definitions
● Perform instruction fine tuning on the resulting model
○ Use transfer learning to boost learning of SNOMED-CT definitions from the training
on textual definitions
○ Definition Rendering Instruction Fine Tuning (DRIFT)
Training a LLM on SNOMED-CT CNL phrases and Text Definitions
Putting into Practice
What is Angiotensin converting enzyme inhibitor-aggravated angioedema?
Angiotensin converting enzyme inhibitor-aggravated angioedema is defined in SNOMED-CT
as a disorder (a clinical diagnosis). Angiotensin converting enzyme inhibitor-aggravated
angioedema is an angioedema caused by angiotensin-converting-enzyme inhibitor, a drug-
aggravated angioedema-urticaria, [..] It is a hypersensitivity process characterized in form
by an angioedema caused by a substance with angiotensin-converting enzyme inhibitor
mechanism of action and is characterized in form by an angioedema caused by kallidin i.
What causes an Angiotensin converting enzyme inhibitor-aggravated angioedema?
It is caused by kallidin i
What is the morphology of an Angiotensin converting enzyme inhibitor-aggravated
angioedema?
It is characterized in form by an angioedema
Full / Granular Definition
Instructions
● SNOMED CT concepts are organised into 19
distinct hierarchies, covering different aspects
of healthcare
● Generate definitions from subset of hierarchies
dealing with medical problems
○ Clinical finding (includes findings and disorders)
○ Subset of Body structure: Morphological abnormality which
physically characterize disorders
○ Situation with explicit context (situation)
Subset of SNOMED-CT
Experiment 626
● Began with OpenHermes-2.5-Mistral-7B model
● Performed unsupervised training on (7,694) SNOMED-CT text definitions
○ Using September 23rd 2023 release of SNOMED CT United States
Edition
● Performed DRIFT on medical problem hierarchies
○ Used full instruction definitions (130K)
○ Added 80% of granular instructions from each category (204K)
○ Validated training using remaining 20% (102K)
○ Used QLoRA fine tuning on Apple Mac Studio M1 Ultra with 128GB
RAM (OoriData servers)
● Runtime of 1 day 17 hours
● Software used: mlx, mlx-tuning-fork, Ogbuji-PT, and django-snomed-ct
https://huggingface.co/cogbuji/Mr-Grammatology-clinical-problems-Mistral-7B-0.5
Training
Conclusion (future considerations)
● Investigate how logical reasoning enabled by DL and term synonym can be
leveraged to further generate gold standard text for medical language model
training
○ Transitivity of relations, ACE vs. “Angiotensin converting enzyme”
● Use other LLMs (Medical LLMs, larger LLMs, etc.)
● Train on other (or all) SNOMED-CT categories
● Try other foundational biomedical ontologies (widely adopted and include
textual definitions):
○ Foundational Model of Anatomy (FMA): 120K classes and > 2.1M relationships
○ Gene Ontology (GO): 42K classes
● Evaluate against standard medical reasoning benchmarks (with and without
training against their data)
● Investigate prompting strategies in depth (Chain of biological thought, etc.)
Questions?
https://linkr.bio/chimezie
https://www.researchgate.net/profile/Chimezie-Ogbuji
https://www.linkedin.com/in/chimezie/
https://chimezie.medium.com/
https://huggingface.co/cogbuji
https://github.com/chimezie
https://github.com/OoriData

More Related Content

Similar to Reference Domain Ontologies and Large Medical Language Models.pptx

AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...Timothy Cook
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming DatacentricTimothy Cook
 
Local and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering ApproachLocal and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering ApproachIRJET Journal
 
Medinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopMedinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopKoray Atalag
 
Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?Amit Sheth
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarAhmad C. Bukhari
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paolo Missier
 
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...AmrAlaaEldin12
 
Hl7 common terminology services
Hl7 common terminology servicesHl7 common terminology services
Hl7 common terminology servicesSyed Ali Raza
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdfAdhySugara2
 
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...home
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfAlan Morrison
 
Biomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithmBiomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithmTELKOMNIKA JOURNAL
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...William Gunn
 
Data science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridgedData science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridgedRangarajan Chari
 
YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016Richard Vidgen
 
PhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizPhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizLuis Marco Ruiz
 

Similar to Reference Domain Ontologies and Large Medical Language Models.pptx (20)

AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
AeHIN 28 August, 2014 - Innovation in Healthcare IT Standards: The Path to Bi...
 
Cri big data
Cri big dataCri big data
Cri big data
 
Becoming Datacentric
Becoming DatacentricBecoming Datacentric
Becoming Datacentric
 
Local and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering ApproachLocal and Global Learning Method for Question Answering Approach
Local and Global Learning Method for Question Answering Approach
 
Medinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling WorshopMedinfo 2010 openEHR Clinical Modelling Worshop
Medinfo 2010 openEHR Clinical Modelling Worshop
 
Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?Driving Deep Semantics in Middleware and Networks: What, why and how?
Driving Deep Semantics in Middleware and Networks: What, why and how?
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So Far
 
Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005Paper presentations: UK e-science AHM meeting, 2005
Paper presentations: UK e-science AHM meeting, 2005
 
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
Achieving Semantic Integration of Medical Knowledge for Clinical Decision Sup...
 
Hl7 common terminology services
Hl7 common terminology servicesHl7 common terminology services
Hl7 common terminology services
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
informatics_future.pdf
informatics_future.pdfinformatics_future.pdf
informatics_future.pdf
 
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
A Novel Technique for Name Identification from Homeopathy Diagnosis Discussio...
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
 
Biomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithmBiomedical-named entity recognition using CUDA accelerated KNN algorithm
Biomedical-named entity recognition using CUDA accelerated KNN algorithm
 
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
Sci Know Mine 2013: What can we learn from topic modeling on 350M academic do...
 
Data science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridgedData science nlp_resume-2018-abridged
Data science nlp_resume-2018-abridged
 
YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016YHORG Presentation 23 February 2016
YHORG Presentation 23 February 2016
 
PhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco RuizPhD dissertation Luis Marco Ruiz
PhD dissertation Luis Marco Ruiz
 

More from Chimezie Ogbuji

Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryChimezie Ogbuji
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchChimezie Ogbuji
 
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...Chimezie Ogbuji
 
Automated clinicalontologyextraction
Automated clinicalontologyextractionAutomated clinicalontologyextraction
Automated clinicalontologyextractionChimezie Ogbuji
 
GRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and WhereGRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and WhereChimezie Ogbuji
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachChimezie Ogbuji
 
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLTools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLChimezie Ogbuji
 
UniProt and the Semantic Web
UniProt and the Semantic WebUniProt and the Semantic Web
UniProt and the Semantic WebChimezie Ogbuji
 
Semantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical InformaticsSemantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical InformaticsChimezie Ogbuji
 
Segmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical InformaticsSegmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical InformaticsChimezie Ogbuji
 
Overview of CPR Ontology
Overview of CPR OntologyOverview of CPR Ontology
Overview of CPR OntologyChimezie Ogbuji
 
The Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are ImportantThe Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are ImportantChimezie Ogbuji
 

More from Chimezie Ogbuji (12)

Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data Dictionary
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes research
 
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
 
Automated clinicalontologyextraction
Automated clinicalontologyextractionAutomated clinicalontologyextraction
Automated clinicalontologyextraction
 
GRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and WhereGRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and Where
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial Approach
 
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLTools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDL
 
UniProt and the Semantic Web
UniProt and the Semantic WebUniProt and the Semantic Web
UniProt and the Semantic Web
 
Semantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical InformaticsSemantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical Informatics
 
Segmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical InformaticsSegmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical Informatics
 
Overview of CPR Ontology
Overview of CPR OntologyOverview of CPR Ontology
Overview of CPR Ontology
 
The Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are ImportantThe Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are Important
 

Recently uploaded

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Reference Domain Ontologies and Large Medical Language Models.pptx

  • 1. Reference Domain Ontologies and Large Medical Language Models Chimezie Ogbuji Chief Medical Informatics Officer / Amara Home Care Owner /Metacognition
  • 2. Semantic Web ● Semantic Web ○ Goal: build a framework for intelligent, machine understanding on the standards, ubiquity, and connectedness of the World Wide Web (WWW) ○ 2006 to 2010: Most exciting time. Peak of inflated expectations ○ Driven by standardization efforts at the WWW Consortium (W3C) ○ Equal parts hype and robust infrastructure for modern applications A cautionary tale for Large Language Models (LLM) The Semantic Web: Where is it now? Rashif Ray Rahman / Oct 3, 2018 https://medium.com/@schivmeister/the-semantic-web-where-is-it-now-f4773f3097e3
  • 3. Semantic Web Layers How are things identified and retrieved? Rules, Logic, & Ontologies: How can machines reason about the data? How can machines ask questions of the data? How can data be exchanged in a knowledge graph format?
  • 5. Gartner's 2016 Hype Cycle for Emerging Technologies Natural Language (NL) question answering was in the trough of disillusionment a year before a key technology underlying today’s language models (transformers) was in its infancy A Cautionary Tale?
  • 6. ● SemanticDB work at the Cleveland Clinic Foundation ○ A reconceived implementation of an existing, 30-year old registry of heart surgery and cardiovascular intervention cases ■ 500 variables, ~ 200K patient records, 100 heart and vascular research publications per year ○ Addressing shortcomings of conventional data warehouse functionality ■ Domain-specific criteria conceived by researchers who work with DB administrators ○ Internally-funded project in conjunction with CCF Innovations Department to partner with Cycorp, Inc ■ Cyc: a powerful reasoning system and knowledge base with built-in capability for natural language.
  • 7. The Semantic Research Assistant (SRA). Query for patients who had a coronary artery bypass graft (CABG) between 2008 and 2010 (inclusive) and after a percutaneous coronary intervention (PCI) D Pierce, C., Booth, D., Ogbuji, C., Deaton, C., Blackstone, E., & Lenat, D. (2012). Semanticdb: A semantic web infrastructure for clinical research and quality reporting. Current Bioinformatics, 7(3), 267-277.
  • 8. ● Primary challenges at the time ○ Developing representational models (ontologies) that can cover the domain in 200K+ patient dataset to facilitate machine reasoning ○ Resolving natural language query fragments to concepts in these models (the purpose of the SRA and challenge of Natural Language Processing at the time) ○ Dispatching Semantic Web queries (SPARQL) to the RDF patient registry ○ Evaluating the queries efficiently
  • 9. Parrot-like software that use sophisticated analysis of patterns and relationships underlying language to simulate intelligent, natural language “How GPT3 Works - Visualizations and Animations” - Jay Alammar What are Large Language Models? https://jalammar.github.io/how-gpt3-works-visualizations-animations/
  • 11. Semantic Web v.s. LLM ● Semantic Web ○ The basis for the value proposition was well- understood ○ Driven mainly by the development of industry standards ○ Adoption significantly lagged behind the research ○ Its applicability was not well defined ● Large Language Models ○ The basis for the value proposition is not fully understood (the mechanism is a mystery to us) ○ No standardization (driven by community use) ○ Lightspeed community use keeping pace with lightspeed research ○ Its applicability is well defined
  • 12. ● Artificial Neural Networks (ANN): a branch of deep learning inspired by biological neural networks in animal brains ● Natural language processing (NLP): interdisciplinary subfield of computer science and linguistics concerned with the ability of computers to support and manipulate human language ● LLMs: probabilistic models of natural language using ANN and trained on large textual data ● Fine tuning: A subsequent, task-specific training performed on a model to refine it for a specific use case ● Instruction tuning: fine-tuning that improves a model's ability to follow instructions ● Transfer learning: a technique where knowledge learned from a task is re-used to boost performance on a related task ● Unsupervised Learning: learning patterns without being told what’s right/wrong Terminology Sindhu, et. al.. "An empirical science research on bioinformatics in machine learning." 2020
  • 13. What is the state of the art of the use of LLMS in the domain of medicine?
  • 14. ● MedAlpaca (4/2023) ○ Trained on Q/A pairs from online forums (52K), medical curriculum flashcards (34K), Q/A pairs from WikiDoc (68K), and data from open NLP datasets and benchmarks. Evaluated on United States Medical Licensing Examination (USMLE) self-assessment datasets (119) ● Med-PaLM 2 (5/2023) ○ Trained on multiple-choice question dataset for solving medical problems, collected from professional examinations (183K), several standard benchmark training datasets of multiple-choice question dataset used for evaluation (10K), and common consumer questions (60). Evaluated on standard multiple-choice datasets Recent Medical LLMs
  • 15. ● MEDITRON (11/2023) ○ Trained on a dataset of clinical practice guidelines (46K), PubMed Papers (5M) & abstracts (16M), and standard benchmark training datasets (10,178). Evaluated on standard multiple-choice datasets and Q/A based on PubMed abstracts ● MedPrompt (11/2023) ○ A study of the power of how to prompt GPT-4 to unleash capabilities on medical challenge problems without training ● BioMistral (2/2024) ○ Trained on PMC Open Access Subset of medical research papers (~1.47M documents). Evaluated on standard multiple-choice datasets and Q/A based on PubMed abstracts
  • 16. Training & Evaluation ● Mix of training on open Q/As, multiple choice questions, and raw text (domain expertise or research publication) ● Most were evaluated on medical reasoning benchmarks that provided training data for the models before evaluating them ● Suffer from the same current and more general issue of how to objectively evaluate LLMs
  • 17. What are Ontologies and Description Logic (DL)?
  • 18. ● Rigorously-specified conceptualizations of a domain as mathematical logic ● Usually expressed as hierarchies of classes, restrictions on relationships, etc. ● Meant for automated processing by logical reasoning tools
  • 19. Mondal, Sutapa, Vijaya Raghava Mutharaju, and Sumit Bhatia. Embeddings for the EL++ description logic. Diss. IIIT-Delhi, 2020.
  • 20. Angioedema ⊑ Edema Angioedema ⊑ ∃ morphology . angioedema A_ACE ⊑ ∃ morphology . (Angioedema ⊓ ∃ caused_by kallidin i ) Essential hypertension ⊑ Hypertensive disorder, systemic arterial Essential hypertension ⊑ (∃ located-in . (systemic circulatory system structure)) Internationally standardized medical terminology system with over 360K+ medical concepts, 1.25M relationships between them, and 9.6K textual definitions created by domain-experts. Uses a DL that facilitates automation and machine reasoning. ● Released in US English, UK English, UK Australian, Spanish, Danish, Dutch, Lithuanian, Swedish, and Canadian French What is SNOMED-CT?
  • 21. Nested matryoshka dolls are a good analogy for visualizing DL concept inclusion (⊑) All instances of Essential hypertension are within the set of all things that stand in a located-in relation with a systemic circulatory system structure ∃ - existential role restrictions
  • 22. Angioedema caused by angiotensin-converting-enzyme inhibitor (A_ACE) ⊓ - intersection of concepts ∃ - existential role restriction named concept
  • 23. DLs are designed for computer processing and not easily read by non- mathematicians What are Controlled Natural Languages (CNL)? Since CNLs are based on natural languages, their grammars use the same syntactic structures: sentences, noun phrases, verb phrases, and relative clauses CNLs were originally designed for use by domain experts to encode knowledge without working directly in DL
  • 24. Kuhn, Tobias. "The understandability of OWL statements in controlled English." (2013): 101-115
  • 25. “Every A_ACE morphology an Angioedema caused_by Kallidin i”
  • 26. ● Adopt a CNL for use with SNOMED-CT ○ Using phraseology appropriate for the domain (pathophysiology) ● The CNL phrases generated can be used as training data for LLMs “Every A_ACE is characterized in form by an Angioedema caused by Kallidin i”
  • 27. ● SNOMED-CT includes text definitions ○ “[..] applied to some SNOMED CT concepts that provides additional information about the intended meaning or usage of the concept.” ● These can be used in addition to SNOMED-CT CNL phrases to train LLMs Text Definitions
  • 28. Angioedema ⊑ Non-allergic hypersensitivity reaction Non-allergic hypersensitivity reaction ⊑ Non-allergic hypersensitivity process “Every Angioedema is a Non-allergic hypersensitivity reaction” “Every Non-allergic hypersensitivity reaction is a Non-allergic hypersensitivity process” Non-allergic hypersensitivity process (SNOMED-CT’s Text definition) “A pathological nonimmune process generally directed towards a foreign substance, which results in tissue injury, which is usually transient. It is the realization of the pseudoallergic disposition. A variety of mechanisms such as direct histamine release, complement activation, cyclooxygenase activation and bradykinin generation may be involved.” Combining CNL and Text Definitions
  • 29. ● Create a dataset of instructions to define SNOMED-CT terms ○ Create full definition instructions and instructions on the component parts of each definition (granular instructions) ● Train an existing model on the text definitions ● Perform instruction fine tuning on the resulting model ○ Use transfer learning to boost learning of SNOMED-CT definitions from the training on textual definitions ○ Definition Rendering Instruction Fine Tuning (DRIFT) Training a LLM on SNOMED-CT CNL phrases and Text Definitions Putting into Practice
  • 30. What is Angiotensin converting enzyme inhibitor-aggravated angioedema? Angiotensin converting enzyme inhibitor-aggravated angioedema is defined in SNOMED-CT as a disorder (a clinical diagnosis). Angiotensin converting enzyme inhibitor-aggravated angioedema is an angioedema caused by angiotensin-converting-enzyme inhibitor, a drug- aggravated angioedema-urticaria, [..] It is a hypersensitivity process characterized in form by an angioedema caused by a substance with angiotensin-converting enzyme inhibitor mechanism of action and is characterized in form by an angioedema caused by kallidin i. What causes an Angiotensin converting enzyme inhibitor-aggravated angioedema? It is caused by kallidin i What is the morphology of an Angiotensin converting enzyme inhibitor-aggravated angioedema? It is characterized in form by an angioedema Full / Granular Definition Instructions
  • 31. ● SNOMED CT concepts are organised into 19 distinct hierarchies, covering different aspects of healthcare ● Generate definitions from subset of hierarchies dealing with medical problems ○ Clinical finding (includes findings and disorders) ○ Subset of Body structure: Morphological abnormality which physically characterize disorders ○ Situation with explicit context (situation) Subset of SNOMED-CT
  • 33. ● Began with OpenHermes-2.5-Mistral-7B model ● Performed unsupervised training on (7,694) SNOMED-CT text definitions ○ Using September 23rd 2023 release of SNOMED CT United States Edition ● Performed DRIFT on medical problem hierarchies ○ Used full instruction definitions (130K) ○ Added 80% of granular instructions from each category (204K) ○ Validated training using remaining 20% (102K) ○ Used QLoRA fine tuning on Apple Mac Studio M1 Ultra with 128GB RAM (OoriData servers) ● Runtime of 1 day 17 hours ● Software used: mlx, mlx-tuning-fork, Ogbuji-PT, and django-snomed-ct https://huggingface.co/cogbuji/Mr-Grammatology-clinical-problems-Mistral-7B-0.5
  • 35. Conclusion (future considerations) ● Investigate how logical reasoning enabled by DL and term synonym can be leveraged to further generate gold standard text for medical language model training ○ Transitivity of relations, ACE vs. “Angiotensin converting enzyme” ● Use other LLMs (Medical LLMs, larger LLMs, etc.) ● Train on other (or all) SNOMED-CT categories ● Try other foundational biomedical ontologies (widely adopted and include textual definitions): ○ Foundational Model of Anatomy (FMA): 120K classes and > 2.1M relationships ○ Gene Ontology (GO): 42K classes ● Evaluate against standard medical reasoning benchmarks (with and without training against their data) ● Investigate prompting strategies in depth (Chain of biological thought, etc.)