SlideShare a Scribd company logo
1 of 31
Download to read offline
Large Language Models
& Applications in Healthcare
Asma Ben Abacha
Guest Lecture, University of Illinois,
Urbana-Champaign,
February 29, 2024
Abstract
The time that medical doctors spend on Electronic Health Record systems was shown to
contribute to work-life imbalance, dissatisfaction, high rates of attrition, and a burnout rate
exceeding 50%. In particular, doctors spend on average 52 to 102 minutes per day
writing clinical notes from their conversations with the patients. Recent studies on clinical
note generation have shown that doctors can save a significant amount of time with
automatic note generation systems. Progress in LLMs can play a key role in enabling
further such systems and improving their performance. However, this requires high-
quality datasets and benchmarks and relevant evaluation metrics to assess which model
would best serve clinicians in their daily practice. In this lecture, I’ll present LLM-based
solutions for the task of clinical note generation from doctor-patient conversations. I’ll also
present insights from different evaluation studies and shared tasks that we organized on
this topic.
Another important aspect of supporting healthcare providers with documentation and
clinical decisions is to detect medical errors in clinical notes and to suggest corrections
(e.g., diagnosis, treatment, medication). Such errors require medical expertise and
knowledge to be both identified and corrected. Recent LLMs showed promise in being
applied on unseen tasks with competitive ability. The second part of the lecture will cover
a new research endeavor on medical error detection and correction and present LLM-
based solutions for the task.
Plan
I. Large Language Models (LLMs) in
Healthcare
II. Clinical Note Generation from
Doctor-Patient Conversations
III. Medical Error Detection &
Correction
Medical LLMs
GPT-2
ERNIE
XLNet
RoBERTa
Clinical-T5
Med-PaLM
BioGPT
Large Language Models (LLMs)
2021
T5-xxl
XLM-R
PaLM
BLOOM
ChatGPT
2022
2023
BARD
GPT-4
Falcon
Claude 2
LLaMA 2
Mistral
7B
2020
2019
GPT-3
2018
GPT-1
BERT
Med-PaLM 2
PMC-LLaMA
BioBERT
SciBERT
ClinicalBER
1
INFORMATION EXTRACTION
2  Predicting disease progression and identifying high-risk
patients.
 Recommending personalized treatment plans,
interventions, clinical trials, and improving clinical workflows.
PREDICTION & RECOMMENDATION
LLMs in Healthcare
Recent progress in LLMs opens the door to more clinical applications to improve the efficiency of
clinical practice and enhance patients' experiences:
 Extracting useful information and insights from Electronic Health
Records.
 Assisting doctors with literature review and easier access to up-to-date
and trustworthy information.
3  Generating clinical notes and reducing doctors’
documentation time.
 Automating administrative tasks such as medical coding and
billing.
 Generating differential diagnoses and improving diagnosis
accuracy.
GENERATION
Clinical Note Generation
from Doctor-Patient Conversations
• Medical doctors spend on average 52 to 102 minutes per day
writing clinical notes from their conversations with the
patients (Hripcsak et al., 2011).
• The time spent with Electronic Health Record systems
contributes to high rates of attrition, with a burnout rate
already exceeding 50% (Arndt et al., 2017).
• Generating clinical notes from doctor-patient conversations
can contribute to reducing the doctors’ workload by
editing/validating the generated notes instead of writing the
full notes during the consultations.
• However, the lack of publicly available doctor-patient
dialogue datasets hinders research efforts from the NLP
community on this summarization/generation task.
INTRODUCTION
MTS-Dialog Dataset [1]
§ To avoid privacy infringement risks, we created simulated conversations from publicly available
de-identified clinical notes from the Mtsamples collection.
§ Eight trained annotators used the notes sections to write simulated conversations according to
guidelines derived from a study of a large private collection of real conversations and notes.
New Datasets
 We created and released two new datasets free of PHI that can be used for benchmarking and research on
clinical note generation and accelerate research efforts on this understudied NLP task.
[1] Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023.
[2] Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for
benchmarking automatic visit note generation. Nature Scientific Data 2023.
ACI-BENCH: Ambient Clinical Intelligence Benchmark [2]
§ Includes full clinical notes and associated conversations from simulated encounters.
§ Four annotators with medical backgrounds validated the dataset.
Example
from the
MTS-Dialog
Dataset
 The MTS-Dialog dataset consists of 1.7k pairs of conversations and associated summaries (clinical note
sections).
 The clinical notes cover the six most frequent note types and specialties in the collection, including: General
Medicine, Neurology, Orthopedic, Dermatology, SOAP (Subjective, Objective, Assessment, Plan), and
Example
from the
ACI-BENCH
Dataset
The ACI-BENCH dataset consists of 207 pairs of full doctor-patient conversations and associated clinical notes.
Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023.
o Pre-finetuning on relevant datasets
improved the results, with BART pre-
finetuned on xsum and samsum
provided a ROUGE-1 score of 40.15%
vs 32.01% when pre-finetuned on
CNN/DailyMail, which highlights the
importance of using relevant pre-
finetuning datasets.
o Guided summarization and Data
augmentation helped improve the
ROUGE scores from 40% to 42%
ROUGE-1.
Summarization of Short Doctor-Patient Conversations
o We fine-tuned and evaluated summarization models on the MTS-Dialog
dataset.
We selected the best performing model from each category for further studies using Fact-based metrics, BERTScore, and BLEURT.
o Guided Summarization (GS) led to a consistent improvement across all automated metrics except for ROUGE-2 and Fact-based
metrics.
o Data Augmentation (DA) led to slight improvements across all metrics except ROUGE-2 and BERTScore-M1.
Summarization of Short Doctor-Patient Conversations
 We fine-tuned and evaluated different summarization models at the full note level on the ACI-BENCH dataset.
Summarization of Long Doctor-Patient Conversations
 Simple retrieval-based methods provided
strong baselines with better out-of-the-box
performances than Longformer-Encoder-
Decoder (LED) models and full-note BART
models.
 Division-based generation worked better for
BART and LED fine-tuned models with 53.46
ROUGE-1 for BART+FTSAMSum(Division).
 OpenAI models with simple prompts were
shown to give competitive outputs despite no
additional fine-tuning or dynamic prompting.
 GPT-4 demonstrated the highest MEDCON
UMLS-based evaluation score (57.78) while
achieving the second to third-best
performance in ROUGE scores (51.76 ROUGE-
1, 22.58 ROUGE-2 and 45.97 ROUGE-L).
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.
Nature Scientific Data 2023.
Evaluation Methods
• Assessing which note generation model would best serve clinicians in their daily practice is an important
and challenging task, however, it can be (strongly) biased by the adopted evaluation metrics.
• For doctor-patient conversation summarization, an adapted perspective on generation errors need to be
considered, e.g.:
• Hallucinations and factual inconsistencies are likely to impact the clinical outcome if they are not avoided or
detected with a high accuracy.
• Omission of critical medical facts is likely to alter patient outcomes and should be one of the essential
factors in adopting a summarization model.
Average
Correlation Scores
across Seven Clinical
Datasets
16
 The evaluation metrics had substantially different behaviors on different types of clinical notes datasets.
 The results highlight one stable subset of metrics, such as aggregate scores, as the most correlated with
human judgments on different evaluation criteria.
Evaluation Metrics
We studied the correlation between evaluation metrics and manual scores provided by medical annotators.
 The MEDIQA 2023 shared tasks focused on the
automatic generation of clinical notes from doctor-
patient conversations and the generation of
synthetic medical conversations for data
augmentation.
 The shared tasks can gather and evaluate diverse
ideas from the research community (especially in
the age of advanced LLMs) and empirically identify
the best approaches.
 29 participating teams at the MEDIQA-Chat &
MEDIQA-Sum competitions organized @ ACL-
ClinicalNLP & CLEF 2023.
Insights from Shared Task Evaluations
MEDIQA-Chat 2023
This task focuses on summarizing
short doctor-patient conversations to
generate a summary for one section
of a clinical note, including a section
header.
The goal of task B is to generate a
complete note for each doctor-
patient encounter. The note must
include all relevant sections.
Task A - Short
Dialogue2Note
Summarization
Task B - Full
Dialogue2Note
Summarization
Tasks
The third task addresses data
augmentation through the
generation of synthetic doctor-
patient conversations from full
clinical notes.
Task C -
Note2Dialogue
Generation
ChatGPT & GPT-4 Models
• ChatGPT (gpt-3.5-turbo) & GPT-4 baselines (ChatGPT has a limit of 4k tokens & GPT-4 allows 32k tokens).
• Different temperatures for Tasks A/B (1) and Task C (0/1) (deterministic/creative outputs).
• Prompts:
⚬ Task A Prompt: "Classify the conversation into one of these 20 classes: FAMILY HISTORY/SOCIAL HISTORY, HISTORY of PRESENT
ILLNESS, PAST MEDICAL HISTORY, CHIEF COMPLAINT, PAST SURGICAL HISTORY, Allergy, REVIEW OF SYSTEMS, Medications,
Assessment, Exam, Diagnosis, Disposition, Plan, EMERGENCY DEPARTMENT COURSE, Immunizations, Imaging, GYNECOLOGIC
HISTORY, Procedures, Other history, Labs. The response should start with the selected class, followed by # then the summary of the
conversation in a clinical note style. The conversation is: ”
⚬ Task B Prompt: "Summarize the conversation to generate a clinical note with four sections: HISTORY OF PRESENT ILLNESS,
PHYSICAL EXAM, RESULTS, ASSESSMENT AND PLAN. The conversation is: ”
⚬ Task C Prompt: "Write a full conversation between a doctor and a patient during a medical visit. The dialogue should cover all the
medical information provided in this note:"
MEDIQA-Chat: Task B
Long Dialogue Summarization (ACI-BENCH
Dataset)
MEDIQA-Chat 2023
Approaches & Insights
• Dynamic prompting & in-context learning was the winning solution for clinical note generation.
• Fine-tuning (BART/LED/T5) achieved competitive results compared to the leading solutions with GPT-4.
• Generation of full notes worked better than generating specific sections at a time.
• Data Augmentation was beneficial to fine-tuning methods but were not applied/tested with GPT-4 solutions.
Empirical validation of the best methods on other/different datasets showed that:
 LLMs are highly sensitive to in-context examples and similarity models for this task.
 LLM-based solutions may require adaptation when applied to new data, such as different similarity models/solutions
to retrieve in-context examples and tackle potential retrieval overfitting.
MEDIQA-Sum @ CLEF 2023
Methods & Results
 ACL-ClinicalNLP MEDIQA-Chat and ImageCLEF MEDIQA-Sum shared tasks hosted similar problems on an extended
and overlapping dataset.
 A difference between the participants in MEDIQA-Sum and MEDIQA-Chat was that there were no GPT4 submissions
in MEDIQA-Sum.
 For short dialogue summarization, scores in the two 2023 editions were comparable: MEDIQA-Chat scores range
from 0.37-0.58 aggregate-score and MEDIQA-Sum scores were at 0.28-0.57 aggregate score.
 The full-encounter (long) generation ROUGE1 was at 0.28-0.61 and 0.21-0.65 for aggregate scoring in MEDIQA-Chat.
In MEDIQA-Sum, the ranges were slightly lower than MEDIQA-Chat with 0.28-0.50 ROUGE1 and 0.25-0.46 aggregate
scoring.
 The results suggest that many open source-based methods are still very competitive for classification and shorter
generation tasks whereas longer generation may require more powerful LLMs.
DATASETS & METHODS SHARED TASKS
• Asma Ben Abacha, Wen-wai Yim, Yadan Fan,
Thomas Lin: An Empirical Study of Clinical Note
Generation from Doctor-Patient Encounters. EACL
2023
• Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal
Snider, Thomas Lin, Meliha Yetisgen: ACI-
BENCH: a Novel Ambient Clinical Intelligence
Dataset for Benchmarking Automatic Visit Note
Generation. Nature Scientific Data 2023
• Asma Ben Abacha, Wen-wai Yim, Griffin Adams, Neal
Snider, Meliha Yetisgen: Overview of the MEDIQA-Chat
2023 Shared Tasks on the Summarization & Generation
of Doctor-Patient Conversations. ClinicalNLP@ACL
2023
• Wen-wai Yim, Asma Ben Abacha, Griffin Adams, Neal
Snider, Meliha Yetisgen: Overview of the MEDIQA-Sum
Task at ImageCLEF 2023: Summarization and
Classification of Doctor-Patient Conversations. CLEF
2023
EVALUATION METRICS
• Asma Ben Abacha, Wen-wai Yim, George Michalopoulos,
Thomas Lin: An Investigation of Evaluation Metrics for
Automated Medical Note Generation. ACL (Findings) 2023
References for the full details
Medical Error
Detection & Correction
Detection and Correction
of Medical Errors
 Medical errors can be costly to both patients and healthcare
providers. Detection and correction of these errors is crucial
for improving health care outcomes.
 LLMs can enable faster and more accurate solutions to detect
medical errors such as:
o Misdiagnosis: LLMs can be leveraged to identify inconsistencies
between the patient's symptoms and the diagnosis mentioned in the
clinical note.
o Medication errors: LLMs can be leveraged to detect errors in
medication dosage, frequency, duration, potential drug interactions or
contraindications.
 LLMs can also assist with suggesting corrections and recommending
differential diagnoses, personalized treatments, and interventions.
:
MEDIQA-CORR 2024​
Organizers:
Asma Ben Abacha, Microsoft
Wen-wai Yim, Microsoft
Meliha Yetisgen, University of Washington
Fei Xia, University of Washington
New Datasets (MS & UW) and Shared Task on Medical Error Detection & Correction:
 The MS dataset consists of 3,359 clinical notes.
 The UW dataset consists of 488 de-identified notes (requires a DUA).
 Each note is either correct or contains one error (e.g., diagnosis, treatment,
management) and its correction.
Examples
from the
MEDIQA-
CORR-MS
Dataset
A 58-year old man comes to his physician because of a 1-month history of
increased thirst and nocturia. He is drinking a lot of water to compensate for any
dehydration. His brother has type 2 diabetes mellitus. Physical examination
shows dry mucous membranes. Laboratory studies show a serum sodium of 151
mEq/L and glucose of 121 mg/dL. A water deprivation test shows:
Serum osmolality
(mOsmol/kg H2O) Urine osmolality
(mOsmol/kg H2O)
Initial presentation 295 285
After 3 hours without fluids 305 310
After administration of antidiuretic hormone (ADH) analog 280 355
Patient was diagnosed with primary polydipsia.
Patient was diagnosed with partial central diabetes insipidus.
Examples
from the
MEDIQA-
CORR-MS
Dataset
A 75-year-old woman comes to the physician because of generalized
weakness for 6 months. During this period, she has also had a 4-kg
(8.8-lb) weight loss and frequent headaches. She has been avoiding
eating solids because of severe jaw pain. She has hypertension and
osteoporosis. She underwent a total left-sided knee arthroplasty 2 years
ago because of osteoarthritis. The patient does not smoke or drink
alcohol. Her current medications include enalapril, metoprolol, low-dose
aspirin, and a multivitamin. She appears pale. Her temperature is 37.5 C
(99.5 F), pulse is 82/min, and blood pressure is 135/80 mm Hg. Physical
examination shows no abnormalities. Intravenous
methylprednisolone and a temporal artery biopsy is
recommended after labs were reviewed.
Laboratory studies showed:
Hemoglobin 10 g/dL
Mean corpuscular volume 87μm3
Leukocyte count 8,500/mm3
Platelet count 450,000/mm3
Erythrocyte sedimentation rate 90 mm/h
Oral prednisone and a temporal artery biopsy is
recommended after labs were reviewed.
Tasks & GPT Baselines
(the challenge is still open, run submission deadline on March 28)
Baselines Error Flag
Prediction
Error Sentence
Detection
Sentence Correction
Accuracy Accuracy ROUGE1 BERTScore BLEURT Aggregate Score (Mean of ROUGE-
1-F, BERTScore, and BLEURT-20)
ChatGPT 46.18 45.64 46.93 41.61 48.84 45.79
GPT-4 61.31 48.91 52.76 56.97 57.15 55.63
Tasks:
1. Predicting the error flag (i.e., does the text contain an error or not?),
2. Extracting the error sentence ID (or -1 for texts without errors),
3. Generating a correct sentence (or NA for texts without errors) .
ChatGPT & GPT-4 Results using the same prompt to generate the responses:
Challenges
&
Opportunitie
s
The potential benefits of LLMs in healthcare
and medicine are substantial, however, there are
also challenges associated with the use of LLMs in
healthcare, such as bias, hallucinations that can
impact medical outcomes, important omissions,
critical medical errors, and privacy concerns.
Continued research and innovation in this area is
crucial to address these challenges and to allow
doctors and health workers to use highly-
performing LLM-based solutions with the necessary
safety guardrails.
abenabacha@microsoft.com
@AsmaBenAbacha

More Related Content

What's hot

Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3Ishan Jain
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...David Talby
 
An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceJulien SIMON
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroNumenta
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfM Waleed Kadous
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
 
LLMs_talk_March23.pdf
LLMs_talk_March23.pdfLLMs_talk_March23.pdf
LLMs_talk_March23.pdfChaoYang81
 
Generative Models and ChatGPT
Generative Models and ChatGPTGenerative Models and ChatGPT
Generative Models and ChatGPTLoic Merckel
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableAditya Bhattacharya
 
Intelligence Artificielle - Systèmes experts
Intelligence Artificielle - Systèmes expertsIntelligence Artificielle - Systèmes experts
Intelligence Artificielle - Systèmes expertsMohamed Heny SELMI
 
Paper presentation on LLM compression
Paper presentation on LLM compression Paper presentation on LLM compression
Paper presentation on LLM compression SanjanaRajeshKothari
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfPo-Chuan Chen
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with TransformersJulien SIMON
 

What's hot (20)

Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
 
Llama-index
Llama-indexLlama-index
Llama-index
 
An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
LLMs_talk_March23.pdf
LLMs_talk_March23.pdfLLMs_talk_March23.pdf
LLMs_talk_March23.pdf
 
Generative Models and ChatGPT
Generative Models and ChatGPTGenerative Models and ChatGPT
Generative Models and ChatGPT
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 
Intelligence Artificielle - Systèmes experts
Intelligence Artificielle - Systèmes expertsIntelligence Artificielle - Systèmes experts
Intelligence Artificielle - Systèmes experts
 
Paper presentation on LLM compression
Paper presentation on LLM compression Paper presentation on LLM compression
Paper presentation on LLM compression
 
Deep learning
Deep learningDeep learning
Deep learning
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 

Similar to Large Language Models and Applications in Healthcare

Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...Nat Rice
 
NR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdfNR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdfbkbk37
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Pubrica
 
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...Kaela Johnson
 
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Health Informatics New Zealand
 
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm OptimizationCase Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimizationrahulmonikasharma
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Pubrica
 
Measuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in HospitalsMeasuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in HospitalsHealth Informatics New Zealand
 
Overview of ePRO
Overview of ePROOverview of ePRO
Overview of ePROchallPHT
 
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxCollege Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxclarebernice
 
NR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdfNR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdfbkbk37
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHcscpconf
 
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docxApplication Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docxalfredai53p
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...IRJET Journal
 
MAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applicationsMAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applicationsHTAi Bilbao 2012
 
Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...CRF Health
 
cognitive computing for electronic medical record
cognitive computing for electronic medical record cognitive computing for electronic medical record
cognitive computing for electronic medical record selamu shirtawi
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemMichel Dumontier
 

Similar to Large Language Models and Applications in Healthcare (20)

Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...
 
NR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdfNR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdf
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...
 
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
 
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
 
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm OptimizationCase Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...
 
Measuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in HospitalsMeasuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in Hospitals
 
Overview of ePRO
Overview of ePROOverview of ePRO
Overview of ePRO
 
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxCollege Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
 
NR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdfNR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdf
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approach
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
 
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docxApplication Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docx
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
 
MAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applicationsMAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applications
 
Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...
 
cognitive computing for electronic medical record
cognitive computing for electronic medical record cognitive computing for electronic medical record
cognitive computing for electronic medical record
 
Evidence based medicine
Evidence based medicineEvidence based medicine
Evidence based medicine
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health System
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Large Language Models and Applications in Healthcare

  • 1. Large Language Models & Applications in Healthcare Asma Ben Abacha Guest Lecture, University of Illinois, Urbana-Champaign, February 29, 2024
  • 2. Abstract The time that medical doctors spend on Electronic Health Record systems was shown to contribute to work-life imbalance, dissatisfaction, high rates of attrition, and a burnout rate exceeding 50%. In particular, doctors spend on average 52 to 102 minutes per day writing clinical notes from their conversations with the patients. Recent studies on clinical note generation have shown that doctors can save a significant amount of time with automatic note generation systems. Progress in LLMs can play a key role in enabling further such systems and improving their performance. However, this requires high- quality datasets and benchmarks and relevant evaluation metrics to assess which model would best serve clinicians in their daily practice. In this lecture, I’ll present LLM-based solutions for the task of clinical note generation from doctor-patient conversations. I’ll also present insights from different evaluation studies and shared tasks that we organized on this topic. Another important aspect of supporting healthcare providers with documentation and clinical decisions is to detect medical errors in clinical notes and to suggest corrections (e.g., diagnosis, treatment, medication). Such errors require medical expertise and knowledge to be both identified and corrected. Recent LLMs showed promise in being applied on unseen tasks with competitive ability. The second part of the lecture will cover a new research endeavor on medical error detection and correction and present LLM- based solutions for the task.
  • 3. Plan I. Large Language Models (LLMs) in Healthcare II. Clinical Note Generation from Doctor-Patient Conversations III. Medical Error Detection & Correction
  • 4. Medical LLMs GPT-2 ERNIE XLNet RoBERTa Clinical-T5 Med-PaLM BioGPT Large Language Models (LLMs) 2021 T5-xxl XLM-R PaLM BLOOM ChatGPT 2022 2023 BARD GPT-4 Falcon Claude 2 LLaMA 2 Mistral 7B 2020 2019 GPT-3 2018 GPT-1 BERT Med-PaLM 2 PMC-LLaMA BioBERT SciBERT ClinicalBER
  • 5.
  • 6. 1 INFORMATION EXTRACTION 2  Predicting disease progression and identifying high-risk patients.  Recommending personalized treatment plans, interventions, clinical trials, and improving clinical workflows. PREDICTION & RECOMMENDATION LLMs in Healthcare Recent progress in LLMs opens the door to more clinical applications to improve the efficiency of clinical practice and enhance patients' experiences:  Extracting useful information and insights from Electronic Health Records.  Assisting doctors with literature review and easier access to up-to-date and trustworthy information. 3  Generating clinical notes and reducing doctors’ documentation time.  Automating administrative tasks such as medical coding and billing.  Generating differential diagnoses and improving diagnosis accuracy. GENERATION
  • 7. Clinical Note Generation from Doctor-Patient Conversations
  • 8. • Medical doctors spend on average 52 to 102 minutes per day writing clinical notes from their conversations with the patients (Hripcsak et al., 2011). • The time spent with Electronic Health Record systems contributes to high rates of attrition, with a burnout rate already exceeding 50% (Arndt et al., 2017). • Generating clinical notes from doctor-patient conversations can contribute to reducing the doctors’ workload by editing/validating the generated notes instead of writing the full notes during the consultations. • However, the lack of publicly available doctor-patient dialogue datasets hinders research efforts from the NLP community on this summarization/generation task. INTRODUCTION
  • 9. MTS-Dialog Dataset [1] § To avoid privacy infringement risks, we created simulated conversations from publicly available de-identified clinical notes from the Mtsamples collection. § Eight trained annotators used the notes sections to write simulated conversations according to guidelines derived from a study of a large private collection of real conversations and notes. New Datasets  We created and released two new datasets free of PHI that can be used for benchmarking and research on clinical note generation and accelerate research efforts on this understudied NLP task. [1] Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023. [2] Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Nature Scientific Data 2023. ACI-BENCH: Ambient Clinical Intelligence Benchmark [2] § Includes full clinical notes and associated conversations from simulated encounters. § Four annotators with medical backgrounds validated the dataset.
  • 10. Example from the MTS-Dialog Dataset  The MTS-Dialog dataset consists of 1.7k pairs of conversations and associated summaries (clinical note sections).  The clinical notes cover the six most frequent note types and specialties in the collection, including: General Medicine, Neurology, Orthopedic, Dermatology, SOAP (Subjective, Objective, Assessment, Plan), and
  • 11. Example from the ACI-BENCH Dataset The ACI-BENCH dataset consists of 207 pairs of full doctor-patient conversations and associated clinical notes.
  • 12. Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023. o Pre-finetuning on relevant datasets improved the results, with BART pre- finetuned on xsum and samsum provided a ROUGE-1 score of 40.15% vs 32.01% when pre-finetuned on CNN/DailyMail, which highlights the importance of using relevant pre- finetuning datasets. o Guided summarization and Data augmentation helped improve the ROUGE scores from 40% to 42% ROUGE-1. Summarization of Short Doctor-Patient Conversations o We fine-tuned and evaluated summarization models on the MTS-Dialog dataset.
  • 13. We selected the best performing model from each category for further studies using Fact-based metrics, BERTScore, and BLEURT. o Guided Summarization (GS) led to a consistent improvement across all automated metrics except for ROUGE-2 and Fact-based metrics. o Data Augmentation (DA) led to slight improvements across all metrics except ROUGE-2 and BERTScore-M1. Summarization of Short Doctor-Patient Conversations
  • 14.  We fine-tuned and evaluated different summarization models at the full note level on the ACI-BENCH dataset. Summarization of Long Doctor-Patient Conversations  Simple retrieval-based methods provided strong baselines with better out-of-the-box performances than Longformer-Encoder- Decoder (LED) models and full-note BART models.  Division-based generation worked better for BART and LED fine-tuned models with 53.46 ROUGE-1 for BART+FTSAMSum(Division).  OpenAI models with simple prompts were shown to give competitive outputs despite no additional fine-tuning or dynamic prompting.  GPT-4 demonstrated the highest MEDCON UMLS-based evaluation score (57.78) while achieving the second to third-best performance in ROUGE scores (51.76 ROUGE- 1, 22.58 ROUGE-2 and 45.97 ROUGE-L). Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Nature Scientific Data 2023.
  • 15. Evaluation Methods • Assessing which note generation model would best serve clinicians in their daily practice is an important and challenging task, however, it can be (strongly) biased by the adopted evaluation metrics. • For doctor-patient conversation summarization, an adapted perspective on generation errors need to be considered, e.g.: • Hallucinations and factual inconsistencies are likely to impact the clinical outcome if they are not avoided or detected with a high accuracy. • Omission of critical medical facts is likely to alter patient outcomes and should be one of the essential factors in adopting a summarization model.
  • 16. Average Correlation Scores across Seven Clinical Datasets 16  The evaluation metrics had substantially different behaviors on different types of clinical notes datasets.  The results highlight one stable subset of metrics, such as aggregate scores, as the most correlated with human judgments on different evaluation criteria. Evaluation Metrics We studied the correlation between evaluation metrics and manual scores provided by medical annotators.
  • 17.  The MEDIQA 2023 shared tasks focused on the automatic generation of clinical notes from doctor- patient conversations and the generation of synthetic medical conversations for data augmentation.  The shared tasks can gather and evaluate diverse ideas from the research community (especially in the age of advanced LLMs) and empirically identify the best approaches.  29 participating teams at the MEDIQA-Chat & MEDIQA-Sum competitions organized @ ACL- ClinicalNLP & CLEF 2023. Insights from Shared Task Evaluations
  • 18. MEDIQA-Chat 2023 This task focuses on summarizing short doctor-patient conversations to generate a summary for one section of a clinical note, including a section header. The goal of task B is to generate a complete note for each doctor- patient encounter. The note must include all relevant sections. Task A - Short Dialogue2Note Summarization Task B - Full Dialogue2Note Summarization Tasks The third task addresses data augmentation through the generation of synthetic doctor- patient conversations from full clinical notes. Task C - Note2Dialogue Generation
  • 19. ChatGPT & GPT-4 Models • ChatGPT (gpt-3.5-turbo) & GPT-4 baselines (ChatGPT has a limit of 4k tokens & GPT-4 allows 32k tokens). • Different temperatures for Tasks A/B (1) and Task C (0/1) (deterministic/creative outputs). • Prompts: ⚬ Task A Prompt: "Classify the conversation into one of these 20 classes: FAMILY HISTORY/SOCIAL HISTORY, HISTORY of PRESENT ILLNESS, PAST MEDICAL HISTORY, CHIEF COMPLAINT, PAST SURGICAL HISTORY, Allergy, REVIEW OF SYSTEMS, Medications, Assessment, Exam, Diagnosis, Disposition, Plan, EMERGENCY DEPARTMENT COURSE, Immunizations, Imaging, GYNECOLOGIC HISTORY, Procedures, Other history, Labs. The response should start with the selected class, followed by # then the summary of the conversation in a clinical note style. The conversation is: ” ⚬ Task B Prompt: "Summarize the conversation to generate a clinical note with four sections: HISTORY OF PRESENT ILLNESS, PHYSICAL EXAM, RESULTS, ASSESSMENT AND PLAN. The conversation is: ” ⚬ Task C Prompt: "Write a full conversation between a doctor and a patient during a medical visit. The dialogue should cover all the medical information provided in this note:"
  • 20. MEDIQA-Chat: Task B Long Dialogue Summarization (ACI-BENCH Dataset)
  • 21. MEDIQA-Chat 2023 Approaches & Insights • Dynamic prompting & in-context learning was the winning solution for clinical note generation. • Fine-tuning (BART/LED/T5) achieved competitive results compared to the leading solutions with GPT-4. • Generation of full notes worked better than generating specific sections at a time. • Data Augmentation was beneficial to fine-tuning methods but were not applied/tested with GPT-4 solutions. Empirical validation of the best methods on other/different datasets showed that:  LLMs are highly sensitive to in-context examples and similarity models for this task.  LLM-based solutions may require adaptation when applied to new data, such as different similarity models/solutions to retrieve in-context examples and tackle potential retrieval overfitting.
  • 22. MEDIQA-Sum @ CLEF 2023 Methods & Results  ACL-ClinicalNLP MEDIQA-Chat and ImageCLEF MEDIQA-Sum shared tasks hosted similar problems on an extended and overlapping dataset.  A difference between the participants in MEDIQA-Sum and MEDIQA-Chat was that there were no GPT4 submissions in MEDIQA-Sum.  For short dialogue summarization, scores in the two 2023 editions were comparable: MEDIQA-Chat scores range from 0.37-0.58 aggregate-score and MEDIQA-Sum scores were at 0.28-0.57 aggregate score.  The full-encounter (long) generation ROUGE1 was at 0.28-0.61 and 0.21-0.65 for aggregate scoring in MEDIQA-Chat. In MEDIQA-Sum, the ranges were slightly lower than MEDIQA-Chat with 0.28-0.50 ROUGE1 and 0.25-0.46 aggregate scoring.  The results suggest that many open source-based methods are still very competitive for classification and shorter generation tasks whereas longer generation may require more powerful LLMs.
  • 23. DATASETS & METHODS SHARED TASKS • Asma Ben Abacha, Wen-wai Yim, Yadan Fan, Thomas Lin: An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters. EACL 2023 • Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, Meliha Yetisgen: ACI- BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Nature Scientific Data 2023 • Asma Ben Abacha, Wen-wai Yim, Griffin Adams, Neal Snider, Meliha Yetisgen: Overview of the MEDIQA-Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations. ClinicalNLP@ACL 2023 • Wen-wai Yim, Asma Ben Abacha, Griffin Adams, Neal Snider, Meliha Yetisgen: Overview of the MEDIQA-Sum Task at ImageCLEF 2023: Summarization and Classification of Doctor-Patient Conversations. CLEF 2023 EVALUATION METRICS • Asma Ben Abacha, Wen-wai Yim, George Michalopoulos, Thomas Lin: An Investigation of Evaluation Metrics for Automated Medical Note Generation. ACL (Findings) 2023 References for the full details
  • 25. Detection and Correction of Medical Errors  Medical errors can be costly to both patients and healthcare providers. Detection and correction of these errors is crucial for improving health care outcomes.  LLMs can enable faster and more accurate solutions to detect medical errors such as: o Misdiagnosis: LLMs can be leveraged to identify inconsistencies between the patient's symptoms and the diagnosis mentioned in the clinical note. o Medication errors: LLMs can be leveraged to detect errors in medication dosage, frequency, duration, potential drug interactions or contraindications.  LLMs can also assist with suggesting corrections and recommending differential diagnoses, personalized treatments, and interventions.
  • 26. : MEDIQA-CORR 2024​ Organizers: Asma Ben Abacha, Microsoft Wen-wai Yim, Microsoft Meliha Yetisgen, University of Washington Fei Xia, University of Washington New Datasets (MS & UW) and Shared Task on Medical Error Detection & Correction:  The MS dataset consists of 3,359 clinical notes.  The UW dataset consists of 488 de-identified notes (requires a DUA).  Each note is either correct or contains one error (e.g., diagnosis, treatment, management) and its correction.
  • 27. Examples from the MEDIQA- CORR-MS Dataset A 58-year old man comes to his physician because of a 1-month history of increased thirst and nocturia. He is drinking a lot of water to compensate for any dehydration. His brother has type 2 diabetes mellitus. Physical examination shows dry mucous membranes. Laboratory studies show a serum sodium of 151 mEq/L and glucose of 121 mg/dL. A water deprivation test shows: Serum osmolality (mOsmol/kg H2O) Urine osmolality (mOsmol/kg H2O) Initial presentation 295 285 After 3 hours without fluids 305 310 After administration of antidiuretic hormone (ADH) analog 280 355 Patient was diagnosed with primary polydipsia. Patient was diagnosed with partial central diabetes insipidus.
  • 28. Examples from the MEDIQA- CORR-MS Dataset A 75-year-old woman comes to the physician because of generalized weakness for 6 months. During this period, she has also had a 4-kg (8.8-lb) weight loss and frequent headaches. She has been avoiding eating solids because of severe jaw pain. She has hypertension and osteoporosis. She underwent a total left-sided knee arthroplasty 2 years ago because of osteoarthritis. The patient does not smoke or drink alcohol. Her current medications include enalapril, metoprolol, low-dose aspirin, and a multivitamin. She appears pale. Her temperature is 37.5 C (99.5 F), pulse is 82/min, and blood pressure is 135/80 mm Hg. Physical examination shows no abnormalities. Intravenous methylprednisolone and a temporal artery biopsy is recommended after labs were reviewed. Laboratory studies showed: Hemoglobin 10 g/dL Mean corpuscular volume 87μm3 Leukocyte count 8,500/mm3 Platelet count 450,000/mm3 Erythrocyte sedimentation rate 90 mm/h Oral prednisone and a temporal artery biopsy is recommended after labs were reviewed.
  • 29. Tasks & GPT Baselines (the challenge is still open, run submission deadline on March 28) Baselines Error Flag Prediction Error Sentence Detection Sentence Correction Accuracy Accuracy ROUGE1 BERTScore BLEURT Aggregate Score (Mean of ROUGE- 1-F, BERTScore, and BLEURT-20) ChatGPT 46.18 45.64 46.93 41.61 48.84 45.79 GPT-4 61.31 48.91 52.76 56.97 57.15 55.63 Tasks: 1. Predicting the error flag (i.e., does the text contain an error or not?), 2. Extracting the error sentence ID (or -1 for texts without errors), 3. Generating a correct sentence (or NA for texts without errors) . ChatGPT & GPT-4 Results using the same prompt to generate the responses:
  • 30. Challenges & Opportunitie s The potential benefits of LLMs in healthcare and medicine are substantial, however, there are also challenges associated with the use of LLMs in healthcare, such as bias, hallucinations that can impact medical outcomes, important omissions, critical medical errors, and privacy concerns. Continued research and innovation in this area is crucial to address these challenges and to allow doctors and health workers to use highly- performing LLM-based solutions with the necessary safety guardrails.