SlideShare a Scribd company logo
1 of 31
Download to read offline
Large Language Models
& Applications in Healthcare
Asma Ben Abacha
Guest Lecture, University of Illinois,
Urbana-Champaign,
February 29, 2024
Abstract
The time that medical doctors spend on Electronic Health Record systems was shown to
contribute to work-life imbalance, dissatisfaction, high rates of attrition, and a burnout rate
exceeding 50%. In particular, doctors spend on average 52 to 102 minutes per day
writing clinical notes from their conversations with the patients. Recent studies on clinical
note generation have shown that doctors can save a significant amount of time with
automatic note generation systems. Progress in LLMs can play a key role in enabling
further such systems and improving their performance. However, this requires high-
quality datasets and benchmarks and relevant evaluation metrics to assess which model
would best serve clinicians in their daily practice. In this lecture, I’ll present LLM-based
solutions for the task of clinical note generation from doctor-patient conversations. I’ll also
present insights from different evaluation studies and shared tasks that we organized on
this topic.
Another important aspect of supporting healthcare providers with documentation and
clinical decisions is to detect medical errors in clinical notes and to suggest corrections
(e.g., diagnosis, treatment, medication). Such errors require medical expertise and
knowledge to be both identified and corrected. Recent LLMs showed promise in being
applied on unseen tasks with competitive ability. The second part of the lecture will cover
a new research endeavor on medical error detection and correction and present LLM-
based solutions for the task.
Plan
I. Large Language Models (LLMs) in
Healthcare
II. Clinical Note Generation from
Doctor-Patient Conversations
III. Medical Error Detection &
Correction
Medical LLMs
GPT-2
ERNIE
XLNet
RoBERTa
Clinical-T5
Med-PaLM
BioGPT
Large Language Models (LLMs)
2021
T5-xxl
XLM-R
PaLM
BLOOM
ChatGPT
2022
2023
BARD
GPT-4
Falcon
Claude 2
LLaMA 2
Mistral
7B
2020
2019
GPT-3
2018
GPT-1
BERT
Med-PaLM 2
PMC-LLaMA
BioBERT
SciBERT
ClinicalBER
1
INFORMATION EXTRACTION
2  Predicting disease progression and identifying high-risk
patients.
 Recommending personalized treatment plans,
interventions, clinical trials, and improving clinical workflows.
PREDICTION & RECOMMENDATION
LLMs in Healthcare
Recent progress in LLMs opens the door to more clinical applications to improve the efficiency of
clinical practice and enhance patients' experiences:
 Extracting useful information and insights from Electronic Health
Records.
 Assisting doctors with literature review and easier access to up-to-date
and trustworthy information.
3  Generating clinical notes and reducing doctors’
documentation time.
 Automating administrative tasks such as medical coding and
billing.
 Generating differential diagnoses and improving diagnosis
accuracy.
GENERATION
Clinical Note Generation
from Doctor-Patient Conversations
• Medical doctors spend on average 52 to 102 minutes per day
writing clinical notes from their conversations with the
patients (Hripcsak et al., 2011).
• The time spent with Electronic Health Record systems
contributes to high rates of attrition, with a burnout rate
already exceeding 50% (Arndt et al., 2017).
• Generating clinical notes from doctor-patient conversations
can contribute to reducing the doctors’ workload by
editing/validating the generated notes instead of writing the
full notes during the consultations.
• However, the lack of publicly available doctor-patient
dialogue datasets hinders research efforts from the NLP
community on this summarization/generation task.
INTRODUCTION
MTS-Dialog Dataset [1]
§ To avoid privacy infringement risks, we created simulated conversations from publicly available
de-identified clinical notes from the Mtsamples collection.
§ Eight trained annotators used the notes sections to write simulated conversations according to
guidelines derived from a study of a large private collection of real conversations and notes.
New Datasets
 We created and released two new datasets free of PHI that can be used for benchmarking and research on
clinical note generation and accelerate research efforts on this understudied NLP task.
[1] Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023.
[2] Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for
benchmarking automatic visit note generation. Nature Scientific Data 2023.
ACI-BENCH: Ambient Clinical Intelligence Benchmark [2]
§ Includes full clinical notes and associated conversations from simulated encounters.
§ Four annotators with medical backgrounds validated the dataset.
Example
from the
MTS-Dialog
Dataset
 The MTS-Dialog dataset consists of 1.7k pairs of conversations and associated summaries (clinical note
sections).
 The clinical notes cover the six most frequent note types and specialties in the collection, including: General
Medicine, Neurology, Orthopedic, Dermatology, SOAP (Subjective, Objective, Assessment, Plan), and
Example
from the
ACI-BENCH
Dataset
The ACI-BENCH dataset consists of 207 pairs of full doctor-patient conversations and associated clinical notes.
Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023.
o Pre-finetuning on relevant datasets
improved the results, with BART pre-
finetuned on xsum and samsum
provided a ROUGE-1 score of 40.15%
vs 32.01% when pre-finetuned on
CNN/DailyMail, which highlights the
importance of using relevant pre-
finetuning datasets.
o Guided summarization and Data
augmentation helped improve the
ROUGE scores from 40% to 42%
ROUGE-1.
Summarization of Short Doctor-Patient Conversations
o We fine-tuned and evaluated summarization models on the MTS-Dialog
dataset.
We selected the best performing model from each category for further studies using Fact-based metrics, BERTScore, and BLEURT.
o Guided Summarization (GS) led to a consistent improvement across all automated metrics except for ROUGE-2 and Fact-based
metrics.
o Data Augmentation (DA) led to slight improvements across all metrics except ROUGE-2 and BERTScore-M1.
Summarization of Short Doctor-Patient Conversations
 We fine-tuned and evaluated different summarization models at the full note level on the ACI-BENCH dataset.
Summarization of Long Doctor-Patient Conversations
 Simple retrieval-based methods provided
strong baselines with better out-of-the-box
performances than Longformer-Encoder-
Decoder (LED) models and full-note BART
models.
 Division-based generation worked better for
BART and LED fine-tuned models with 53.46
ROUGE-1 for BART+FTSAMSum(Division).
 OpenAI models with simple prompts were
shown to give competitive outputs despite no
additional fine-tuning or dynamic prompting.
 GPT-4 demonstrated the highest MEDCON
UMLS-based evaluation score (57.78) while
achieving the second to third-best
performance in ROUGE scores (51.76 ROUGE-
1, 22.58 ROUGE-2 and 45.97 ROUGE-L).
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.
Nature Scientific Data 2023.
Evaluation Methods
• Assessing which note generation model would best serve clinicians in their daily practice is an important
and challenging task, however, it can be (strongly) biased by the adopted evaluation metrics.
• For doctor-patient conversation summarization, an adapted perspective on generation errors need to be
considered, e.g.:
• Hallucinations and factual inconsistencies are likely to impact the clinical outcome if they are not avoided or
detected with a high accuracy.
• Omission of critical medical facts is likely to alter patient outcomes and should be one of the essential
factors in adopting a summarization model.
Average
Correlation Scores
across Seven Clinical
Datasets
16
 The evaluation metrics had substantially different behaviors on different types of clinical notes datasets.
 The results highlight one stable subset of metrics, such as aggregate scores, as the most correlated with
human judgments on different evaluation criteria.
Evaluation Metrics
We studied the correlation between evaluation metrics and manual scores provided by medical annotators.
 The MEDIQA 2023 shared tasks focused on the
automatic generation of clinical notes from doctor-
patient conversations and the generation of
synthetic medical conversations for data
augmentation.
 The shared tasks can gather and evaluate diverse
ideas from the research community (especially in
the age of advanced LLMs) and empirically identify
the best approaches.
 29 participating teams at the MEDIQA-Chat &
MEDIQA-Sum competitions organized @ ACL-
ClinicalNLP & CLEF 2023.
Insights from Shared Task Evaluations
MEDIQA-Chat 2023
This task focuses on summarizing
short doctor-patient conversations to
generate a summary for one section
of a clinical note, including a section
header.
The goal of task B is to generate a
complete note for each doctor-
patient encounter. The note must
include all relevant sections.
Task A - Short
Dialogue2Note
Summarization
Task B - Full
Dialogue2Note
Summarization
Tasks
The third task addresses data
augmentation through the
generation of synthetic doctor-
patient conversations from full
clinical notes.
Task C -
Note2Dialogue
Generation
ChatGPT & GPT-4 Models
• ChatGPT (gpt-3.5-turbo) & GPT-4 baselines (ChatGPT has a limit of 4k tokens & GPT-4 allows 32k tokens).
• Different temperatures for Tasks A/B (1) and Task C (0/1) (deterministic/creative outputs).
• Prompts:
⚬ Task A Prompt: "Classify the conversation into one of these 20 classes: FAMILY HISTORY/SOCIAL HISTORY, HISTORY of PRESENT
ILLNESS, PAST MEDICAL HISTORY, CHIEF COMPLAINT, PAST SURGICAL HISTORY, Allergy, REVIEW OF SYSTEMS, Medications,
Assessment, Exam, Diagnosis, Disposition, Plan, EMERGENCY DEPARTMENT COURSE, Immunizations, Imaging, GYNECOLOGIC
HISTORY, Procedures, Other history, Labs. The response should start with the selected class, followed by # then the summary of the
conversation in a clinical note style. The conversation is: ”
⚬ Task B Prompt: "Summarize the conversation to generate a clinical note with four sections: HISTORY OF PRESENT ILLNESS,
PHYSICAL EXAM, RESULTS, ASSESSMENT AND PLAN. The conversation is: ”
⚬ Task C Prompt: "Write a full conversation between a doctor and a patient during a medical visit. The dialogue should cover all the
medical information provided in this note:"
MEDIQA-Chat: Task B
Long Dialogue Summarization (ACI-BENCH
Dataset)
MEDIQA-Chat 2023
Approaches & Insights
• Dynamic prompting & in-context learning was the winning solution for clinical note generation.
• Fine-tuning (BART/LED/T5) achieved competitive results compared to the leading solutions with GPT-4.
• Generation of full notes worked better than generating specific sections at a time.
• Data Augmentation was beneficial to fine-tuning methods but were not applied/tested with GPT-4 solutions.
Empirical validation of the best methods on other/different datasets showed that:
 LLMs are highly sensitive to in-context examples and similarity models for this task.
 LLM-based solutions may require adaptation when applied to new data, such as different similarity models/solutions
to retrieve in-context examples and tackle potential retrieval overfitting.
MEDIQA-Sum @ CLEF 2023
Methods & Results
 ACL-ClinicalNLP MEDIQA-Chat and ImageCLEF MEDIQA-Sum shared tasks hosted similar problems on an extended
and overlapping dataset.
 A difference between the participants in MEDIQA-Sum and MEDIQA-Chat was that there were no GPT4 submissions
in MEDIQA-Sum.
 For short dialogue summarization, scores in the two 2023 editions were comparable: MEDIQA-Chat scores range
from 0.37-0.58 aggregate-score and MEDIQA-Sum scores were at 0.28-0.57 aggregate score.
 The full-encounter (long) generation ROUGE1 was at 0.28-0.61 and 0.21-0.65 for aggregate scoring in MEDIQA-Chat.
In MEDIQA-Sum, the ranges were slightly lower than MEDIQA-Chat with 0.28-0.50 ROUGE1 and 0.25-0.46 aggregate
scoring.
 The results suggest that many open source-based methods are still very competitive for classification and shorter
generation tasks whereas longer generation may require more powerful LLMs.
DATASETS & METHODS SHARED TASKS
• Asma Ben Abacha, Wen-wai Yim, Yadan Fan,
Thomas Lin: An Empirical Study of Clinical Note
Generation from Doctor-Patient Encounters. EACL
2023
• Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal
Snider, Thomas Lin, Meliha Yetisgen: ACI-
BENCH: a Novel Ambient Clinical Intelligence
Dataset for Benchmarking Automatic Visit Note
Generation. Nature Scientific Data 2023
• Asma Ben Abacha, Wen-wai Yim, Griffin Adams, Neal
Snider, Meliha Yetisgen: Overview of the MEDIQA-Chat
2023 Shared Tasks on the Summarization & Generation
of Doctor-Patient Conversations. ClinicalNLP@ACL
2023
• Wen-wai Yim, Asma Ben Abacha, Griffin Adams, Neal
Snider, Meliha Yetisgen: Overview of the MEDIQA-Sum
Task at ImageCLEF 2023: Summarization and
Classification of Doctor-Patient Conversations. CLEF
2023
EVALUATION METRICS
• Asma Ben Abacha, Wen-wai Yim, George Michalopoulos,
Thomas Lin: An Investigation of Evaluation Metrics for
Automated Medical Note Generation. ACL (Findings) 2023
References for the full details
Medical Error
Detection & Correction
Detection and Correction
of Medical Errors
 Medical errors can be costly to both patients and healthcare
providers. Detection and correction of these errors is crucial
for improving health care outcomes.
 LLMs can enable faster and more accurate solutions to detect
medical errors such as:
o Misdiagnosis: LLMs can be leveraged to identify inconsistencies
between the patient's symptoms and the diagnosis mentioned in the
clinical note.
o Medication errors: LLMs can be leveraged to detect errors in
medication dosage, frequency, duration, potential drug interactions or
contraindications.
 LLMs can also assist with suggesting corrections and recommending
differential diagnoses, personalized treatments, and interventions.
:
MEDIQA-CORR 2024​
Organizers:
Asma Ben Abacha, Microsoft
Wen-wai Yim, Microsoft
Meliha Yetisgen, University of Washington
Fei Xia, University of Washington
New Datasets (MS & UW) and Shared Task on Medical Error Detection & Correction:
 The MS dataset consists of 3,359 clinical notes.
 The UW dataset consists of 488 de-identified notes (requires a DUA).
 Each note is either correct or contains one error (e.g., diagnosis, treatment,
management) and its correction.
Examples
from the
MEDIQA-
CORR-MS
Dataset
A 58-year old man comes to his physician because of a 1-month history of
increased thirst and nocturia. He is drinking a lot of water to compensate for any
dehydration. His brother has type 2 diabetes mellitus. Physical examination
shows dry mucous membranes. Laboratory studies show a serum sodium of 151
mEq/L and glucose of 121 mg/dL. A water deprivation test shows:
Serum osmolality
(mOsmol/kg H2O) Urine osmolality
(mOsmol/kg H2O)
Initial presentation 295 285
After 3 hours without fluids 305 310
After administration of antidiuretic hormone (ADH) analog 280 355
Patient was diagnosed with primary polydipsia.
Patient was diagnosed with partial central diabetes insipidus.
Examples
from the
MEDIQA-
CORR-MS
Dataset
A 75-year-old woman comes to the physician because of generalized
weakness for 6 months. During this period, she has also had a 4-kg
(8.8-lb) weight loss and frequent headaches. She has been avoiding
eating solids because of severe jaw pain. She has hypertension and
osteoporosis. She underwent a total left-sided knee arthroplasty 2 years
ago because of osteoarthritis. The patient does not smoke or drink
alcohol. Her current medications include enalapril, metoprolol, low-dose
aspirin, and a multivitamin. She appears pale. Her temperature is 37.5 C
(99.5 F), pulse is 82/min, and blood pressure is 135/80 mm Hg. Physical
examination shows no abnormalities. Intravenous
methylprednisolone and a temporal artery biopsy is
recommended after labs were reviewed.
Laboratory studies showed:
Hemoglobin 10 g/dL
Mean corpuscular volume 87μm3
Leukocyte count 8,500/mm3
Platelet count 450,000/mm3
Erythrocyte sedimentation rate 90 mm/h
Oral prednisone and a temporal artery biopsy is
recommended after labs were reviewed.
Tasks & GPT Baselines
(the challenge is still open, run submission deadline on March 28)
Baselines Error Flag
Prediction
Error Sentence
Detection
Sentence Correction
Accuracy Accuracy ROUGE1 BERTScore BLEURT Aggregate Score (Mean of ROUGE-
1-F, BERTScore, and BLEURT-20)
ChatGPT 46.18 45.64 46.93 41.61 48.84 45.79
GPT-4 61.31 48.91 52.76 56.97 57.15 55.63
Tasks:
1. Predicting the error flag (i.e., does the text contain an error or not?),
2. Extracting the error sentence ID (or -1 for texts without errors),
3. Generating a correct sentence (or NA for texts without errors) .
ChatGPT & GPT-4 Results using the same prompt to generate the responses:
Challenges
&
Opportunitie
s
The potential benefits of LLMs in healthcare
and medicine are substantial, however, there are
also challenges associated with the use of LLMs in
healthcare, such as bias, hallucinations that can
impact medical outcomes, important omissions,
critical medical errors, and privacy concerns.
Continued research and innovation in this area is
crucial to address these challenges and to allow
doctors and health workers to use highly-
performing LLM-based solutions with the necessary
safety guardrails.
abenabacha@microsoft.com
@AsmaBenAbacha

More Related Content

What's hot

Resource description framework
Resource description frameworkResource description framework
Resource description frameworkhozifa1010
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureAggregage
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBRainforest QA
 
Metadata and ontologies
Metadata and ontologiesMetadata and ontologies
Metadata and ontologiesDavid Lamas
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models BootcampData Science Dojo
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overviewalessio_ferrari
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
HPC in healthcare
HPC in healthcareHPC in healthcare
HPC in healthcareluckyanup
 
Knowledge graph use cases in natural language generation
Knowledge graph use cases in natural language generationKnowledge graph use cases in natural language generation
Knowledge graph use cases in natural language generationElena Simperl
 
Weaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdfWeaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdfConnorShorten2
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - IntroductionBlandine Larbret
 
LLM Healthcare.pdf
LLM Healthcare.pdfLLM Healthcare.pdf
LLM Healthcare.pdfATPowr
 
Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Gianfranco Di Pietro
 

What's hot (20)

Resource description framework
Resource description frameworkResource description framework
Resource description framework
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
 
Indexation et ri
Indexation et riIndexation et ri
Indexation et ri
 
An Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDBAn Introduction to Map/Reduce with MongoDB
An Introduction to Map/Reduce with MongoDB
 
Word embedding
Word embedding Word embedding
Word embedding
 
Metadata and ontologies
Metadata and ontologiesMetadata and ontologies
Metadata and ontologies
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
 
Couch db
Couch dbCouch db
Couch db
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
HPC in healthcare
HPC in healthcareHPC in healthcare
HPC in healthcare
 
Knowledge graph use cases in natural language generation
Knowledge graph use cases in natural language generationKnowledge graph use cases in natural language generation
Knowledge graph use cases in natural language generation
 
Weaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdfWeaviate Air #3 - New in AI segment.pdf
Weaviate Air #3 - New in AI segment.pdf
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - Introduction
 
LLM Healthcare.pdf
LLM Healthcare.pdfLLM Healthcare.pdf
LLM Healthcare.pdf
 
hive lab
hive labhive lab
hive lab
 
Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,
 

Similar to Large Language Models and Applications in Healthcare

Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...Nat Rice
 
NR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdfNR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdfbkbk37
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Pubrica
 
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...Kaela Johnson
 
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Health Informatics New Zealand
 
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm OptimizationCase Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimizationrahulmonikasharma
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Pubrica
 
Measuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in HospitalsMeasuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in HospitalsHealth Informatics New Zealand
 
Overview of ePRO
Overview of ePROOverview of ePRO
Overview of ePROchallPHT
 
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxCollege Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxclarebernice
 
NR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdfNR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdfbkbk37
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHcscpconf
 
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docxApplication Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docxalfredai53p
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...IRJET Journal
 
MAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applicationsMAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applicationsHTAi Bilbao 2012
 
Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...CRF Health
 
cognitive computing for electronic medical record
cognitive computing for electronic medical record cognitive computing for electronic medical record
cognitive computing for electronic medical record selamu shirtawi
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemMichel Dumontier
 

Similar to Large Language Models and Applications in Healthcare (20)

Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...
 
NR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdfNR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdf
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...
 
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
 
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
 
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm OptimizationCase Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...
 
Measuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in HospitalsMeasuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in Hospitals
 
Overview of ePRO
Overview of ePROOverview of ePRO
Overview of ePRO
 
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxCollege Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
 
NR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdfNR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdf
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approach
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
 
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docxApplication Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docx
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
 
MAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applicationsMAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applications
 
Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...
 
cognitive computing for electronic medical record
cognitive computing for electronic medical record cognitive computing for electronic medical record
cognitive computing for electronic medical record
 
Evidence based medicine
Evidence based medicineEvidence based medicine
Evidence based medicine
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health System
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Large Language Models and Applications in Healthcare

  • 1. Large Language Models & Applications in Healthcare Asma Ben Abacha Guest Lecture, University of Illinois, Urbana-Champaign, February 29, 2024
  • 2. Abstract The time that medical doctors spend on Electronic Health Record systems was shown to contribute to work-life imbalance, dissatisfaction, high rates of attrition, and a burnout rate exceeding 50%. In particular, doctors spend on average 52 to 102 minutes per day writing clinical notes from their conversations with the patients. Recent studies on clinical note generation have shown that doctors can save a significant amount of time with automatic note generation systems. Progress in LLMs can play a key role in enabling further such systems and improving their performance. However, this requires high- quality datasets and benchmarks and relevant evaluation metrics to assess which model would best serve clinicians in their daily practice. In this lecture, I’ll present LLM-based solutions for the task of clinical note generation from doctor-patient conversations. I’ll also present insights from different evaluation studies and shared tasks that we organized on this topic. Another important aspect of supporting healthcare providers with documentation and clinical decisions is to detect medical errors in clinical notes and to suggest corrections (e.g., diagnosis, treatment, medication). Such errors require medical expertise and knowledge to be both identified and corrected. Recent LLMs showed promise in being applied on unseen tasks with competitive ability. The second part of the lecture will cover a new research endeavor on medical error detection and correction and present LLM- based solutions for the task.
  • 3. Plan I. Large Language Models (LLMs) in Healthcare II. Clinical Note Generation from Doctor-Patient Conversations III. Medical Error Detection & Correction
  • 4. Medical LLMs GPT-2 ERNIE XLNet RoBERTa Clinical-T5 Med-PaLM BioGPT Large Language Models (LLMs) 2021 T5-xxl XLM-R PaLM BLOOM ChatGPT 2022 2023 BARD GPT-4 Falcon Claude 2 LLaMA 2 Mistral 7B 2020 2019 GPT-3 2018 GPT-1 BERT Med-PaLM 2 PMC-LLaMA BioBERT SciBERT ClinicalBER
  • 5.
  • 6. 1 INFORMATION EXTRACTION 2  Predicting disease progression and identifying high-risk patients.  Recommending personalized treatment plans, interventions, clinical trials, and improving clinical workflows. PREDICTION & RECOMMENDATION LLMs in Healthcare Recent progress in LLMs opens the door to more clinical applications to improve the efficiency of clinical practice and enhance patients' experiences:  Extracting useful information and insights from Electronic Health Records.  Assisting doctors with literature review and easier access to up-to-date and trustworthy information. 3  Generating clinical notes and reducing doctors’ documentation time.  Automating administrative tasks such as medical coding and billing.  Generating differential diagnoses and improving diagnosis accuracy. GENERATION
  • 7. Clinical Note Generation from Doctor-Patient Conversations
  • 8. • Medical doctors spend on average 52 to 102 minutes per day writing clinical notes from their conversations with the patients (Hripcsak et al., 2011). • The time spent with Electronic Health Record systems contributes to high rates of attrition, with a burnout rate already exceeding 50% (Arndt et al., 2017). • Generating clinical notes from doctor-patient conversations can contribute to reducing the doctors’ workload by editing/validating the generated notes instead of writing the full notes during the consultations. • However, the lack of publicly available doctor-patient dialogue datasets hinders research efforts from the NLP community on this summarization/generation task. INTRODUCTION
  • 9. MTS-Dialog Dataset [1] § To avoid privacy infringement risks, we created simulated conversations from publicly available de-identified clinical notes from the Mtsamples collection. § Eight trained annotators used the notes sections to write simulated conversations according to guidelines derived from a study of a large private collection of real conversations and notes. New Datasets  We created and released two new datasets free of PHI that can be used for benchmarking and research on clinical note generation and accelerate research efforts on this understudied NLP task. [1] Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023. [2] Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Nature Scientific Data 2023. ACI-BENCH: Ambient Clinical Intelligence Benchmark [2] § Includes full clinical notes and associated conversations from simulated encounters. § Four annotators with medical backgrounds validated the dataset.
  • 10. Example from the MTS-Dialog Dataset  The MTS-Dialog dataset consists of 1.7k pairs of conversations and associated summaries (clinical note sections).  The clinical notes cover the six most frequent note types and specialties in the collection, including: General Medicine, Neurology, Orthopedic, Dermatology, SOAP (Subjective, Objective, Assessment, Plan), and
  • 11. Example from the ACI-BENCH Dataset The ACI-BENCH dataset consists of 207 pairs of full doctor-patient conversations and associated clinical notes.
  • 12. Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023. o Pre-finetuning on relevant datasets improved the results, with BART pre- finetuned on xsum and samsum provided a ROUGE-1 score of 40.15% vs 32.01% when pre-finetuned on CNN/DailyMail, which highlights the importance of using relevant pre- finetuning datasets. o Guided summarization and Data augmentation helped improve the ROUGE scores from 40% to 42% ROUGE-1. Summarization of Short Doctor-Patient Conversations o We fine-tuned and evaluated summarization models on the MTS-Dialog dataset.
  • 13. We selected the best performing model from each category for further studies using Fact-based metrics, BERTScore, and BLEURT. o Guided Summarization (GS) led to a consistent improvement across all automated metrics except for ROUGE-2 and Fact-based metrics. o Data Augmentation (DA) led to slight improvements across all metrics except ROUGE-2 and BERTScore-M1. Summarization of Short Doctor-Patient Conversations
  • 14.  We fine-tuned and evaluated different summarization models at the full note level on the ACI-BENCH dataset. Summarization of Long Doctor-Patient Conversations  Simple retrieval-based methods provided strong baselines with better out-of-the-box performances than Longformer-Encoder- Decoder (LED) models and full-note BART models.  Division-based generation worked better for BART and LED fine-tuned models with 53.46 ROUGE-1 for BART+FTSAMSum(Division).  OpenAI models with simple prompts were shown to give competitive outputs despite no additional fine-tuning or dynamic prompting.  GPT-4 demonstrated the highest MEDCON UMLS-based evaluation score (57.78) while achieving the second to third-best performance in ROUGE scores (51.76 ROUGE- 1, 22.58 ROUGE-2 and 45.97 ROUGE-L). Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Nature Scientific Data 2023.
  • 15. Evaluation Methods • Assessing which note generation model would best serve clinicians in their daily practice is an important and challenging task, however, it can be (strongly) biased by the adopted evaluation metrics. • For doctor-patient conversation summarization, an adapted perspective on generation errors need to be considered, e.g.: • Hallucinations and factual inconsistencies are likely to impact the clinical outcome if they are not avoided or detected with a high accuracy. • Omission of critical medical facts is likely to alter patient outcomes and should be one of the essential factors in adopting a summarization model.
  • 16. Average Correlation Scores across Seven Clinical Datasets 16  The evaluation metrics had substantially different behaviors on different types of clinical notes datasets.  The results highlight one stable subset of metrics, such as aggregate scores, as the most correlated with human judgments on different evaluation criteria. Evaluation Metrics We studied the correlation between evaluation metrics and manual scores provided by medical annotators.
  • 17.  The MEDIQA 2023 shared tasks focused on the automatic generation of clinical notes from doctor- patient conversations and the generation of synthetic medical conversations for data augmentation.  The shared tasks can gather and evaluate diverse ideas from the research community (especially in the age of advanced LLMs) and empirically identify the best approaches.  29 participating teams at the MEDIQA-Chat & MEDIQA-Sum competitions organized @ ACL- ClinicalNLP & CLEF 2023. Insights from Shared Task Evaluations
  • 18. MEDIQA-Chat 2023 This task focuses on summarizing short doctor-patient conversations to generate a summary for one section of a clinical note, including a section header. The goal of task B is to generate a complete note for each doctor- patient encounter. The note must include all relevant sections. Task A - Short Dialogue2Note Summarization Task B - Full Dialogue2Note Summarization Tasks The third task addresses data augmentation through the generation of synthetic doctor- patient conversations from full clinical notes. Task C - Note2Dialogue Generation
  • 19. ChatGPT & GPT-4 Models • ChatGPT (gpt-3.5-turbo) & GPT-4 baselines (ChatGPT has a limit of 4k tokens & GPT-4 allows 32k tokens). • Different temperatures for Tasks A/B (1) and Task C (0/1) (deterministic/creative outputs). • Prompts: ⚬ Task A Prompt: "Classify the conversation into one of these 20 classes: FAMILY HISTORY/SOCIAL HISTORY, HISTORY of PRESENT ILLNESS, PAST MEDICAL HISTORY, CHIEF COMPLAINT, PAST SURGICAL HISTORY, Allergy, REVIEW OF SYSTEMS, Medications, Assessment, Exam, Diagnosis, Disposition, Plan, EMERGENCY DEPARTMENT COURSE, Immunizations, Imaging, GYNECOLOGIC HISTORY, Procedures, Other history, Labs. The response should start with the selected class, followed by # then the summary of the conversation in a clinical note style. The conversation is: ” ⚬ Task B Prompt: "Summarize the conversation to generate a clinical note with four sections: HISTORY OF PRESENT ILLNESS, PHYSICAL EXAM, RESULTS, ASSESSMENT AND PLAN. The conversation is: ” ⚬ Task C Prompt: "Write a full conversation between a doctor and a patient during a medical visit. The dialogue should cover all the medical information provided in this note:"
  • 20. MEDIQA-Chat: Task B Long Dialogue Summarization (ACI-BENCH Dataset)
  • 21. MEDIQA-Chat 2023 Approaches & Insights • Dynamic prompting & in-context learning was the winning solution for clinical note generation. • Fine-tuning (BART/LED/T5) achieved competitive results compared to the leading solutions with GPT-4. • Generation of full notes worked better than generating specific sections at a time. • Data Augmentation was beneficial to fine-tuning methods but were not applied/tested with GPT-4 solutions. Empirical validation of the best methods on other/different datasets showed that:  LLMs are highly sensitive to in-context examples and similarity models for this task.  LLM-based solutions may require adaptation when applied to new data, such as different similarity models/solutions to retrieve in-context examples and tackle potential retrieval overfitting.
  • 22. MEDIQA-Sum @ CLEF 2023 Methods & Results  ACL-ClinicalNLP MEDIQA-Chat and ImageCLEF MEDIQA-Sum shared tasks hosted similar problems on an extended and overlapping dataset.  A difference between the participants in MEDIQA-Sum and MEDIQA-Chat was that there were no GPT4 submissions in MEDIQA-Sum.  For short dialogue summarization, scores in the two 2023 editions were comparable: MEDIQA-Chat scores range from 0.37-0.58 aggregate-score and MEDIQA-Sum scores were at 0.28-0.57 aggregate score.  The full-encounter (long) generation ROUGE1 was at 0.28-0.61 and 0.21-0.65 for aggregate scoring in MEDIQA-Chat. In MEDIQA-Sum, the ranges were slightly lower than MEDIQA-Chat with 0.28-0.50 ROUGE1 and 0.25-0.46 aggregate scoring.  The results suggest that many open source-based methods are still very competitive for classification and shorter generation tasks whereas longer generation may require more powerful LLMs.
  • 23. DATASETS & METHODS SHARED TASKS • Asma Ben Abacha, Wen-wai Yim, Yadan Fan, Thomas Lin: An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters. EACL 2023 • Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, Meliha Yetisgen: ACI- BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Nature Scientific Data 2023 • Asma Ben Abacha, Wen-wai Yim, Griffin Adams, Neal Snider, Meliha Yetisgen: Overview of the MEDIQA-Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations. ClinicalNLP@ACL 2023 • Wen-wai Yim, Asma Ben Abacha, Griffin Adams, Neal Snider, Meliha Yetisgen: Overview of the MEDIQA-Sum Task at ImageCLEF 2023: Summarization and Classification of Doctor-Patient Conversations. CLEF 2023 EVALUATION METRICS • Asma Ben Abacha, Wen-wai Yim, George Michalopoulos, Thomas Lin: An Investigation of Evaluation Metrics for Automated Medical Note Generation. ACL (Findings) 2023 References for the full details
  • 25. Detection and Correction of Medical Errors  Medical errors can be costly to both patients and healthcare providers. Detection and correction of these errors is crucial for improving health care outcomes.  LLMs can enable faster and more accurate solutions to detect medical errors such as: o Misdiagnosis: LLMs can be leveraged to identify inconsistencies between the patient's symptoms and the diagnosis mentioned in the clinical note. o Medication errors: LLMs can be leveraged to detect errors in medication dosage, frequency, duration, potential drug interactions or contraindications.  LLMs can also assist with suggesting corrections and recommending differential diagnoses, personalized treatments, and interventions.
  • 26. : MEDIQA-CORR 2024​ Organizers: Asma Ben Abacha, Microsoft Wen-wai Yim, Microsoft Meliha Yetisgen, University of Washington Fei Xia, University of Washington New Datasets (MS & UW) and Shared Task on Medical Error Detection & Correction:  The MS dataset consists of 3,359 clinical notes.  The UW dataset consists of 488 de-identified notes (requires a DUA).  Each note is either correct or contains one error (e.g., diagnosis, treatment, management) and its correction.
  • 27. Examples from the MEDIQA- CORR-MS Dataset A 58-year old man comes to his physician because of a 1-month history of increased thirst and nocturia. He is drinking a lot of water to compensate for any dehydration. His brother has type 2 diabetes mellitus. Physical examination shows dry mucous membranes. Laboratory studies show a serum sodium of 151 mEq/L and glucose of 121 mg/dL. A water deprivation test shows: Serum osmolality (mOsmol/kg H2O) Urine osmolality (mOsmol/kg H2O) Initial presentation 295 285 After 3 hours without fluids 305 310 After administration of antidiuretic hormone (ADH) analog 280 355 Patient was diagnosed with primary polydipsia. Patient was diagnosed with partial central diabetes insipidus.
  • 28. Examples from the MEDIQA- CORR-MS Dataset A 75-year-old woman comes to the physician because of generalized weakness for 6 months. During this period, she has also had a 4-kg (8.8-lb) weight loss and frequent headaches. She has been avoiding eating solids because of severe jaw pain. She has hypertension and osteoporosis. She underwent a total left-sided knee arthroplasty 2 years ago because of osteoarthritis. The patient does not smoke or drink alcohol. Her current medications include enalapril, metoprolol, low-dose aspirin, and a multivitamin. She appears pale. Her temperature is 37.5 C (99.5 F), pulse is 82/min, and blood pressure is 135/80 mm Hg. Physical examination shows no abnormalities. Intravenous methylprednisolone and a temporal artery biopsy is recommended after labs were reviewed. Laboratory studies showed: Hemoglobin 10 g/dL Mean corpuscular volume 87μm3 Leukocyte count 8,500/mm3 Platelet count 450,000/mm3 Erythrocyte sedimentation rate 90 mm/h Oral prednisone and a temporal artery biopsy is recommended after labs were reviewed.
  • 29. Tasks & GPT Baselines (the challenge is still open, run submission deadline on March 28) Baselines Error Flag Prediction Error Sentence Detection Sentence Correction Accuracy Accuracy ROUGE1 BERTScore BLEURT Aggregate Score (Mean of ROUGE- 1-F, BERTScore, and BLEURT-20) ChatGPT 46.18 45.64 46.93 41.61 48.84 45.79 GPT-4 61.31 48.91 52.76 56.97 57.15 55.63 Tasks: 1. Predicting the error flag (i.e., does the text contain an error or not?), 2. Extracting the error sentence ID (or -1 for texts without errors), 3. Generating a correct sentence (or NA for texts without errors) . ChatGPT & GPT-4 Results using the same prompt to generate the responses:
  • 30. Challenges & Opportunitie s The potential benefits of LLMs in healthcare and medicine are substantial, however, there are also challenges associated with the use of LLMs in healthcare, such as bias, hallucinations that can impact medical outcomes, important omissions, critical medical errors, and privacy concerns. Continued research and innovation in this area is crucial to address these challenges and to allow doctors and health workers to use highly- performing LLM-based solutions with the necessary safety guardrails.