Large Language Models and Applications in Healthcare

Large Language Models
& Applications in Healthcare
Asma Ben Abacha
Guest Lecture, University of Illinois,
Urbana-Champaign,
February 29, 2024

Abstract
The time that medical doctors spend on Electronic Health Record systems was shown to
contribute to work-life imbalance, dissatisfaction, high rates of attrition, and a burnout rate
exceeding 50%. In particular, doctors spend on average 52 to 102 minutes per day
writing clinical notes from their conversations with the patients. Recent studies on clinical
note generation have shown that doctors can save a significant amount of time with
automatic note generation systems. Progress in LLMs can play a key role in enabling
further such systems and improving their performance. However, this requires high-
quality datasets and benchmarks and relevant evaluation metrics to assess which model
would best serve clinicians in their daily practice. In this lecture, I’ll present LLM-based
solutions for the task of clinical note generation from doctor-patient conversations. I’ll also
present insights from different evaluation studies and shared tasks that we organized on
this topic.
Another important aspect of supporting healthcare providers with documentation and
clinical decisions is to detect medical errors in clinical notes and to suggest corrections
(e.g., diagnosis, treatment, medication). Such errors require medical expertise and
knowledge to be both identified and corrected. Recent LLMs showed promise in being
applied on unseen tasks with competitive ability. The second part of the lecture will cover
a new research endeavor on medical error detection and correction and present LLM-
based solutions for the task.

Plan
I. Large Language Models (LLMs) in
Healthcare
II. Clinical Note Generation from
Doctor-Patient Conversations
III. Medical Error Detection &
Correction

Medical LLMs
GPT-2
ERNIE
XLNet
RoBERTa
Clinical-T5
Med-PaLM
BioGPT
Large Language Models (LLMs)
2021
T5-xxl
XLM-R
PaLM
BLOOM
ChatGPT
2022
2023
BARD
GPT-4
Falcon
Claude 2
LLaMA 2
Mistral
7B
2020
2019
GPT-3
2018
GPT-1
BERT
Med-PaLM 2
PMC-LLaMA
BioBERT
SciBERT
ClinicalBER

1
INFORMATION EXTRACTION
2  Predicting disease progression and identifying high-risk
patients.
 Recommending personalized treatment plans,
interventions, clinical trials, and improving clinical workflows.
PREDICTION & RECOMMENDATION
LLMs in Healthcare
Recent progress in LLMs opens the door to more clinical applications to improve the efficiency of
clinical practice and enhance patients' experiences:
 Extracting useful information and insights from Electronic Health
Records.
 Assisting doctors with literature review and easier access to up-to-date
and trustworthy information.
3  Generating clinical notes and reducing doctors’
documentation time.
 Automating administrative tasks such as medical coding and
billing.
 Generating differential diagnoses and improving diagnosis
accuracy.
GENERATION

Clinical Note Generation
from Doctor-Patient Conversations

• Medical doctors spend on average 52 to 102 minutes per day
writing clinical notes from their conversations with the
patients (Hripcsak et al., 2011).
• The time spent with Electronic Health Record systems
contributes to high rates of attrition, with a burnout rate
already exceeding 50% (Arndt et al., 2017).
• Generating clinical notes from doctor-patient conversations
can contribute to reducing the doctors’ workload by
editing/validating the generated notes instead of writing the
full notes during the consultations.
• However, the lack of publicly available doctor-patient
dialogue datasets hinders research efforts from the NLP
community on this summarization/generation task.
INTRODUCTION

MTS-Dialog Dataset [1]
§ To avoid privacy infringement risks, we created simulated conversations from publicly available
de-identified clinical notes from the Mtsamples collection.
§ Eight trained annotators used the notes sections to write simulated conversations according to
guidelines derived from a study of a large private collection of real conversations and notes.
New Datasets
 We created and released two new datasets free of PHI that can be used for benchmarking and research on
clinical note generation and accelerate research efforts on this understudied NLP task.
[1] Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023.
[2] Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for
benchmarking automatic visit note generation. Nature Scientific Data 2023.
ACI-BENCH: Ambient Clinical Intelligence Benchmark [2]
§ Includes full clinical notes and associated conversations from simulated encounters.
§ Four annotators with medical backgrounds validated the dataset.

Example
from the
MTS-Dialog
Dataset
 The MTS-Dialog dataset consists of 1.7k pairs of conversations and associated summaries (clinical note
sections).
 The clinical notes cover the six most frequent note types and specialties in the collection, including: General
Medicine, Neurology, Orthopedic, Dermatology, SOAP (Subjective, Objective, Assessment, Plan), and

Example
from the
ACI-BENCH
Dataset
The ACI-BENCH dataset consists of 207 pairs of full doctor-patient conversations and associated clinical notes.

Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023.
o Pre-finetuning on relevant datasets
improved the results, with BART pre-
finetuned on xsum and samsum
provided a ROUGE-1 score of 40.15%
vs 32.01% when pre-finetuned on
CNN/DailyMail, which highlights the
importance of using relevant pre-
finetuning datasets.
o Guided summarization and Data
augmentation helped improve the
ROUGE scores from 40% to 42%
ROUGE-1.
Summarization of Short Doctor-Patient Conversations
o We fine-tuned and evaluated summarization models on the MTS-Dialog
dataset.

We selected the best performing model from each category for further studies using Fact-based metrics, BERTScore, and BLEURT.
o Guided Summarization (GS) led to a consistent improvement across all automated metrics except for ROUGE-2 and Fact-based
metrics.
o Data Augmentation (DA) led to slight improvements across all metrics except ROUGE-2 and BERTScore-M1.
Summarization of Short Doctor-Patient Conversations

 We fine-tuned and evaluated different summarization models at the full note level on the ACI-BENCH dataset.
Summarization of Long Doctor-Patient Conversations
 Simple retrieval-based methods provided
strong baselines with better out-of-the-box
performances than Longformer-Encoder-
Decoder (LED) models and full-note BART
models.
 Division-based generation worked better for
BART and LED fine-tuned models with 53.46
ROUGE-1 for BART+FTSAMSum(Division).
 OpenAI models with simple prompts were
shown to give competitive outputs despite no
additional fine-tuning or dynamic prompting.
 GPT-4 demonstrated the highest MEDCON
UMLS-based evaluation score (57.78) while
achieving the second to third-best
performance in ROUGE scores (51.76 ROUGE-
1, 22.58 ROUGE-2 and 45.97 ROUGE-L).
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.
Nature Scientific Data 2023.

Evaluation Methods
• Assessing which note generation model would best serve clinicians in their daily practice is an important
and challenging task, however, it can be (strongly) biased by the adopted evaluation metrics.
• For doctor-patient conversation summarization, an adapted perspective on generation errors need to be
considered, e.g.:
• Hallucinations and factual inconsistencies are likely to impact the clinical outcome if they are not avoided or
detected with a high accuracy.
• Omission of critical medical facts is likely to alter patient outcomes and should be one of the essential
factors in adopting a summarization model.

Average
Correlation Scores
across Seven Clinical
Datasets
16
 The evaluation metrics had substantially different behaviors on different types of clinical notes datasets.
 The results highlight one stable subset of metrics, such as aggregate scores, as the most correlated with
human judgments on different evaluation criteria.
Evaluation Metrics
We studied the correlation between evaluation metrics and manual scores provided by medical annotators.

 The MEDIQA 2023 shared tasks focused on the
automatic generation of clinical notes from doctor-
patient conversations and the generation of
synthetic medical conversations for data
augmentation.
 The shared tasks can gather and evaluate diverse
ideas from the research community (especially in
the age of advanced LLMs) and empirically identify
the best approaches.
 29 participating teams at the MEDIQA-Chat &
MEDIQA-Sum competitions organized @ ACL-
ClinicalNLP & CLEF 2023.
Insights from Shared Task Evaluations

MEDIQA-Chat 2023
This task focuses on summarizing
short doctor-patient conversations to
generate a summary for one section
of a clinical note, including a section
header.
The goal of task B is to generate a
complete note for each doctor-
patient encounter. The note must
include all relevant sections.
Task A - Short
Dialogue2Note
Summarization
Task B - Full
Dialogue2Note
Summarization
Tasks
The third task addresses data
augmentation through the
generation of synthetic doctor-
patient conversations from full
clinical notes.
Task C -
Note2Dialogue
Generation

ChatGPT & GPT-4 Models
• ChatGPT (gpt-3.5-turbo) & GPT-4 baselines (ChatGPT has a limit of 4k tokens & GPT-4 allows 32k tokens).
• Different temperatures for Tasks A/B (1) and Task C (0/1) (deterministic/creative outputs).
• Prompts:
⚬ Task A Prompt: "Classify the conversation into one of these 20 classes: FAMILY HISTORY/SOCIAL HISTORY, HISTORY of PRESENT
ILLNESS, PAST MEDICAL HISTORY, CHIEF COMPLAINT, PAST SURGICAL HISTORY, Allergy, REVIEW OF SYSTEMS, Medications,
Assessment, Exam, Diagnosis, Disposition, Plan, EMERGENCY DEPARTMENT COURSE, Immunizations, Imaging, GYNECOLOGIC
HISTORY, Procedures, Other history, Labs. The response should start with the selected class, followed by # then the summary of the
conversation in a clinical note style. The conversation is: ”
⚬ Task B Prompt: "Summarize the conversation to generate a clinical note with four sections: HISTORY OF PRESENT ILLNESS,
PHYSICAL EXAM, RESULTS, ASSESSMENT AND PLAN. The conversation is: ”
⚬ Task C Prompt: "Write a full conversation between a doctor and a patient during a medical visit. The dialogue should cover all the
medical information provided in this note:"

MEDIQA-Chat: Task B
Long Dialogue Summarization (ACI-BENCH
Dataset)

MEDIQA-Chat 2023
Approaches & Insights
• Dynamic prompting & in-context learning was the winning solution for clinical note generation.
• Fine-tuning (BART/LED/T5) achieved competitive results compared to the leading solutions with GPT-4.
• Generation of full notes worked better than generating specific sections at a time.
• Data Augmentation was beneficial to fine-tuning methods but were not applied/tested with GPT-4 solutions.
Empirical validation of the best methods on other/different datasets showed that:
 LLMs are highly sensitive to in-context examples and similarity models for this task.
 LLM-based solutions may require adaptation when applied to new data, such as different similarity models/solutions
to retrieve in-context examples and tackle potential retrieval overfitting.

MEDIQA-Sum @ CLEF 2023
Methods & Results
 ACL-ClinicalNLP MEDIQA-Chat and ImageCLEF MEDIQA-Sum shared tasks hosted similar problems on an extended
and overlapping dataset.
 A difference between the participants in MEDIQA-Sum and MEDIQA-Chat was that there were no GPT4 submissions
in MEDIQA-Sum.
 For short dialogue summarization, scores in the two 2023 editions were comparable: MEDIQA-Chat scores range
from 0.37-0.58 aggregate-score and MEDIQA-Sum scores were at 0.28-0.57 aggregate score.
 The full-encounter (long) generation ROUGE1 was at 0.28-0.61 and 0.21-0.65 for aggregate scoring in MEDIQA-Chat.
In MEDIQA-Sum, the ranges were slightly lower than MEDIQA-Chat with 0.28-0.50 ROUGE1 and 0.25-0.46 aggregate
scoring.
 The results suggest that many open source-based methods are still very competitive for classification and shorter
generation tasks whereas longer generation may require more powerful LLMs.

DATASETS & METHODS SHARED TASKS
• Asma Ben Abacha, Wen-wai Yim, Yadan Fan,
Thomas Lin: An Empirical Study of Clinical Note
Generation from Doctor-Patient Encounters. EACL
2023
• Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal
Snider, Thomas Lin, Meliha Yetisgen: ACI-
BENCH: a Novel Ambient Clinical Intelligence
Dataset for Benchmarking Automatic Visit Note
Generation. Nature Scientific Data 2023
• Asma Ben Abacha, Wen-wai Yim, Griffin Adams, Neal
Snider, Meliha Yetisgen: Overview of the MEDIQA-Chat
2023 Shared Tasks on the Summarization & Generation
of Doctor-Patient Conversations. ClinicalNLP@ACL
2023
• Wen-wai Yim, Asma Ben Abacha, Griffin Adams, Neal
Snider, Meliha Yetisgen: Overview of the MEDIQA-Sum
Task at ImageCLEF 2023: Summarization and
Classification of Doctor-Patient Conversations. CLEF
2023
EVALUATION METRICS
• Asma Ben Abacha, Wen-wai Yim, George Michalopoulos,
Thomas Lin: An Investigation of Evaluation Metrics for
Automated Medical Note Generation. ACL (Findings) 2023
References for the full details

Medical Error
Detection & Correction

Detection and Correction
of Medical Errors
 Medical errors can be costly to both patients and healthcare
providers. Detection and correction of these errors is crucial
for improving health care outcomes.
 LLMs can enable faster and more accurate solutions to detect
medical errors such as:
o Misdiagnosis: LLMs can be leveraged to identify inconsistencies
between the patient's symptoms and the diagnosis mentioned in the
clinical note.
o Medication errors: LLMs can be leveraged to detect errors in
medication dosage, frequency, duration, potential drug interactions or
contraindications.
 LLMs can also assist with suggesting corrections and recommending
differential diagnoses, personalized treatments, and interventions.

:
MEDIQA-CORR 2024
Organizers:
Asma Ben Abacha, Microsoft
Wen-wai Yim, Microsoft
Meliha Yetisgen, University of Washington
Fei Xia, University of Washington
New Datasets (MS & UW) and Shared Task on Medical Error Detection & Correction:
 The MS dataset consists of 3,359 clinical notes.
 The UW dataset consists of 488 de-identified notes (requires a DUA).
 Each note is either correct or contains one error (e.g., diagnosis, treatment,
management) and its correction.

Examples
from the
MEDIQA-
CORR-MS
Dataset
A 58-year old man comes to his physician because of a 1-month history of
increased thirst and nocturia. He is drinking a lot of water to compensate for any
dehydration. His brother has type 2 diabetes mellitus. Physical examination
shows dry mucous membranes. Laboratory studies show a serum sodium of 151
mEq/L and glucose of 121 mg/dL. A water deprivation test shows:
Serum osmolality
(mOsmol/kg H2O) Urine osmolality
(mOsmol/kg H2O)
Initial presentation 295 285
After 3 hours without fluids 305 310
After administration of antidiuretic hormone (ADH) analog 280 355
Patient was diagnosed with primary polydipsia.
Patient was diagnosed with partial central diabetes insipidus.

Examples
from the
MEDIQA-
CORR-MS
Dataset
A 75-year-old woman comes to the physician because of generalized
weakness for 6 months. During this period, she has also had a 4-kg
(8.8-lb) weight loss and frequent headaches. She has been avoiding
eating solids because of severe jaw pain. She has hypertension and
osteoporosis. She underwent a total left-sided knee arthroplasty 2 years
ago because of osteoarthritis. The patient does not smoke or drink
alcohol. Her current medications include enalapril, metoprolol, low-dose
aspirin, and a multivitamin. She appears pale. Her temperature is 37.5 C
(99.5 F), pulse is 82/min, and blood pressure is 135/80 mm Hg. Physical
examination shows no abnormalities. Intravenous
methylprednisolone and a temporal artery biopsy is
recommended after labs were reviewed.
Laboratory studies showed:
Hemoglobin 10 g/dL
Mean corpuscular volume 87μm3
Leukocyte count 8,500/mm3
Platelet count 450,000/mm3
Erythrocyte sedimentation rate 90 mm/h
Oral prednisone and a temporal artery biopsy is
recommended after labs were reviewed.

Tasks & GPT Baselines
(the challenge is still open, run submission deadline on March 28)
Baselines Error Flag
Prediction
Error Sentence
Detection
Sentence Correction
Accuracy Accuracy ROUGE1 BERTScore BLEURT Aggregate Score (Mean of ROUGE-
1-F, BERTScore, and BLEURT-20)
ChatGPT 46.18 45.64 46.93 41.61 48.84 45.79
GPT-4 61.31 48.91 52.76 56.97 57.15 55.63
Tasks:
1. Predicting the error flag (i.e., does the text contain an error or not?),
2. Extracting the error sentence ID (or -1 for texts without errors),
3. Generating a correct sentence (or NA for texts without errors) .
ChatGPT & GPT-4 Results using the same prompt to generate the responses:

Challenges
&
Opportunitie
s
The potential benefits of LLMs in healthcare
and medicine are substantial, however, there are
also challenges associated with the use of LLMs in
healthcare, such as bias, hallucinations that can
impact medical outcomes, important omissions,
critical medical errors, and privacy concerns.
Continued research and innovation in this area is
crucial to address these challenges and to allow
doctors and health workers to use highly-
performing LLM-based solutions with the necessary
safety guardrails.

abenabacha@microsoft.com
@AsmaBenAbacha

Large Language Models and Applications in Healthcare

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Large Language Models and Applications in Healthcare

Similar to Large Language Models and Applications in Healthcare (20)

Recently uploaded

Recently uploaded (20)

Large Language Models and Applications in Healthcare