SlideShare a Scribd company logo
1 of 31
Download to read offline
Large Language Models
& Applications in Healthcare
Asma Ben Abacha
Guest Lecture, University of Illinois,
Urbana-Champaign,
February 29, 2024
Abstract
The time that medical doctors spend on Electronic Health Record systems was shown to
contribute to work-life imbalance, dissatisfaction, high rates of attrition, and a burnout rate
exceeding 50%. In particular, doctors spend on average 52 to 102 minutes per day
writing clinical notes from their conversations with the patients. Recent studies on clinical
note generation have shown that doctors can save a significant amount of time with
automatic note generation systems. Progress in LLMs can play a key role in enabling
further such systems and improving their performance. However, this requires high-
quality datasets and benchmarks and relevant evaluation metrics to assess which model
would best serve clinicians in their daily practice. In this lecture, I’ll present LLM-based
solutions for the task of clinical note generation from doctor-patient conversations. I’ll also
present insights from different evaluation studies and shared tasks that we organized on
this topic.
Another important aspect of supporting healthcare providers with documentation and
clinical decisions is to detect medical errors in clinical notes and to suggest corrections
(e.g., diagnosis, treatment, medication). Such errors require medical expertise and
knowledge to be both identified and corrected. Recent LLMs showed promise in being
applied on unseen tasks with competitive ability. The second part of the lecture will cover
a new research endeavor on medical error detection and correction and present LLM-
based solutions for the task.
Plan
I. Large Language Models (LLMs) in
Healthcare
II. Clinical Note Generation from
Doctor-Patient Conversations
III. Medical Error Detection &
Correction
Medical LLMs
GPT-2
ERNIE
XLNet
RoBERTa
Clinical-T5
Med-PaLM
BioGPT
Large Language Models (LLMs)
2021
T5-xxl
XLM-R
PaLM
BLOOM
ChatGPT
2022
2023
BARD
GPT-4
Falcon
Claude 2
LLaMA 2
Mistral
7B
2020
2019
GPT-3
2018
GPT-1
BERT
Med-PaLM 2
PMC-LLaMA
BioBERT
SciBERT
ClinicalBER
1
INFORMATION EXTRACTION
2  Predicting disease progression and identifying high-risk
patients.
 Recommending personalized treatment plans,
interventions, clinical trials, and improving clinical workflows.
PREDICTION & RECOMMENDATION
LLMs in Healthcare
Recent progress in LLMs opens the door to more clinical applications to improve the efficiency of
clinical practice and enhance patients' experiences:
 Extracting useful information and insights from Electronic Health
Records.
 Assisting doctors with literature review and easier access to up-to-date
and trustworthy information.
3  Generating clinical notes and reducing doctors’
documentation time.
 Automating administrative tasks such as medical coding and
billing.
 Generating differential diagnoses and improving diagnosis
accuracy.
GENERATION
Clinical Note Generation
from Doctor-Patient Conversations
• Medical doctors spend on average 52 to 102 minutes per day
writing clinical notes from their conversations with the
patients (Hripcsak et al., 2011).
• The time spent with Electronic Health Record systems
contributes to high rates of attrition, with a burnout rate
already exceeding 50% (Arndt et al., 2017).
• Generating clinical notes from doctor-patient conversations
can contribute to reducing the doctors’ workload by
editing/validating the generated notes instead of writing the
full notes during the consultations.
• However, the lack of publicly available doctor-patient
dialogue datasets hinders research efforts from the NLP
community on this summarization/generation task.
INTRODUCTION
MTS-Dialog Dataset [1]
§ To avoid privacy infringement risks, we created simulated conversations from publicly available
de-identified clinical notes from the Mtsamples collection.
§ Eight trained annotators used the notes sections to write simulated conversations according to
guidelines derived from a study of a large private collection of real conversations and notes.
New Datasets
 We created and released two new datasets free of PHI that can be used for benchmarking and research on
clinical note generation and accelerate research efforts on this understudied NLP task.
[1] Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023.
[2] Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for
benchmarking automatic visit note generation. Nature Scientific Data 2023.
ACI-BENCH: Ambient Clinical Intelligence Benchmark [2]
§ Includes full clinical notes and associated conversations from simulated encounters.
§ Four annotators with medical backgrounds validated the dataset.
Example
from the
MTS-Dialog
Dataset
 The MTS-Dialog dataset consists of 1.7k pairs of conversations and associated summaries (clinical note
sections).
 The clinical notes cover the six most frequent note types and specialties in the collection, including: General
Medicine, Neurology, Orthopedic, Dermatology, SOAP (Subjective, Objective, Assessment, Plan), and
Example
from the
ACI-BENCH
Dataset
The ACI-BENCH dataset consists of 207 pairs of full doctor-patient conversations and associated clinical notes.
Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023.
o Pre-finetuning on relevant datasets
improved the results, with BART pre-
finetuned on xsum and samsum
provided a ROUGE-1 score of 40.15%
vs 32.01% when pre-finetuned on
CNN/DailyMail, which highlights the
importance of using relevant pre-
finetuning datasets.
o Guided summarization and Data
augmentation helped improve the
ROUGE scores from 40% to 42%
ROUGE-1.
Summarization of Short Doctor-Patient Conversations
o We fine-tuned and evaluated summarization models on the MTS-Dialog
dataset.
We selected the best performing model from each category for further studies using Fact-based metrics, BERTScore, and BLEURT.
o Guided Summarization (GS) led to a consistent improvement across all automated metrics except for ROUGE-2 and Fact-based
metrics.
o Data Augmentation (DA) led to slight improvements across all metrics except ROUGE-2 and BERTScore-M1.
Summarization of Short Doctor-Patient Conversations
 We fine-tuned and evaluated different summarization models at the full note level on the ACI-BENCH dataset.
Summarization of Long Doctor-Patient Conversations
 Simple retrieval-based methods provided
strong baselines with better out-of-the-box
performances than Longformer-Encoder-
Decoder (LED) models and full-note BART
models.
 Division-based generation worked better for
BART and LED fine-tuned models with 53.46
ROUGE-1 for BART+FTSAMSum(Division).
 OpenAI models with simple prompts were
shown to give competitive outputs despite no
additional fine-tuning or dynamic prompting.
 GPT-4 demonstrated the highest MEDCON
UMLS-based evaluation score (57.78) while
achieving the second to third-best
performance in ROUGE scores (51.76 ROUGE-
1, 22.58 ROUGE-2 and 45.97 ROUGE-L).
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.
Nature Scientific Data 2023.
Evaluation Methods
• Assessing which note generation model would best serve clinicians in their daily practice is an important
and challenging task, however, it can be (strongly) biased by the adopted evaluation metrics.
• For doctor-patient conversation summarization, an adapted perspective on generation errors need to be
considered, e.g.:
• Hallucinations and factual inconsistencies are likely to impact the clinical outcome if they are not avoided or
detected with a high accuracy.
• Omission of critical medical facts is likely to alter patient outcomes and should be one of the essential
factors in adopting a summarization model.
Average
Correlation Scores
across Seven Clinical
Datasets
16
 The evaluation metrics had substantially different behaviors on different types of clinical notes datasets.
 The results highlight one stable subset of metrics, such as aggregate scores, as the most correlated with
human judgments on different evaluation criteria.
Evaluation Metrics
We studied the correlation between evaluation metrics and manual scores provided by medical annotators.
 The MEDIQA 2023 shared tasks focused on the
automatic generation of clinical notes from doctor-
patient conversations and the generation of
synthetic medical conversations for data
augmentation.
 The shared tasks can gather and evaluate diverse
ideas from the research community (especially in
the age of advanced LLMs) and empirically identify
the best approaches.
 29 participating teams at the MEDIQA-Chat &
MEDIQA-Sum competitions organized @ ACL-
ClinicalNLP & CLEF 2023.
Insights from Shared Task Evaluations
MEDIQA-Chat 2023
This task focuses on summarizing
short doctor-patient conversations to
generate a summary for one section
of a clinical note, including a section
header.
The goal of task B is to generate a
complete note for each doctor-
patient encounter. The note must
include all relevant sections.
Task A - Short
Dialogue2Note
Summarization
Task B - Full
Dialogue2Note
Summarization
Tasks
The third task addresses data
augmentation through the
generation of synthetic doctor-
patient conversations from full
clinical notes.
Task C -
Note2Dialogue
Generation
ChatGPT & GPT-4 Models
• ChatGPT (gpt-3.5-turbo) & GPT-4 baselines (ChatGPT has a limit of 4k tokens & GPT-4 allows 32k tokens).
• Different temperatures for Tasks A/B (1) and Task C (0/1) (deterministic/creative outputs).
• Prompts:
⚬ Task A Prompt: "Classify the conversation into one of these 20 classes: FAMILY HISTORY/SOCIAL HISTORY, HISTORY of PRESENT
ILLNESS, PAST MEDICAL HISTORY, CHIEF COMPLAINT, PAST SURGICAL HISTORY, Allergy, REVIEW OF SYSTEMS, Medications,
Assessment, Exam, Diagnosis, Disposition, Plan, EMERGENCY DEPARTMENT COURSE, Immunizations, Imaging, GYNECOLOGIC
HISTORY, Procedures, Other history, Labs. The response should start with the selected class, followed by # then the summary of the
conversation in a clinical note style. The conversation is: ”
⚬ Task B Prompt: "Summarize the conversation to generate a clinical note with four sections: HISTORY OF PRESENT ILLNESS,
PHYSICAL EXAM, RESULTS, ASSESSMENT AND PLAN. The conversation is: ”
⚬ Task C Prompt: "Write a full conversation between a doctor and a patient during a medical visit. The dialogue should cover all the
medical information provided in this note:"
MEDIQA-Chat: Task B
Long Dialogue Summarization (ACI-BENCH
Dataset)
MEDIQA-Chat 2023
Approaches & Insights
• Dynamic prompting & in-context learning was the winning solution for clinical note generation.
• Fine-tuning (BART/LED/T5) achieved competitive results compared to the leading solutions with GPT-4.
• Generation of full notes worked better than generating specific sections at a time.
• Data Augmentation was beneficial to fine-tuning methods but were not applied/tested with GPT-4 solutions.
Empirical validation of the best methods on other/different datasets showed that:
 LLMs are highly sensitive to in-context examples and similarity models for this task.
 LLM-based solutions may require adaptation when applied to new data, such as different similarity models/solutions
to retrieve in-context examples and tackle potential retrieval overfitting.
MEDIQA-Sum @ CLEF 2023
Methods & Results
 ACL-ClinicalNLP MEDIQA-Chat and ImageCLEF MEDIQA-Sum shared tasks hosted similar problems on an extended
and overlapping dataset.
 A difference between the participants in MEDIQA-Sum and MEDIQA-Chat was that there were no GPT4 submissions
in MEDIQA-Sum.
 For short dialogue summarization, scores in the two 2023 editions were comparable: MEDIQA-Chat scores range
from 0.37-0.58 aggregate-score and MEDIQA-Sum scores were at 0.28-0.57 aggregate score.
 The full-encounter (long) generation ROUGE1 was at 0.28-0.61 and 0.21-0.65 for aggregate scoring in MEDIQA-Chat.
In MEDIQA-Sum, the ranges were slightly lower than MEDIQA-Chat with 0.28-0.50 ROUGE1 and 0.25-0.46 aggregate
scoring.
 The results suggest that many open source-based methods are still very competitive for classification and shorter
generation tasks whereas longer generation may require more powerful LLMs.
DATASETS & METHODS SHARED TASKS
• Asma Ben Abacha, Wen-wai Yim, Yadan Fan,
Thomas Lin: An Empirical Study of Clinical Note
Generation from Doctor-Patient Encounters. EACL
2023
• Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal
Snider, Thomas Lin, Meliha Yetisgen: ACI-
BENCH: a Novel Ambient Clinical Intelligence
Dataset for Benchmarking Automatic Visit Note
Generation. Nature Scientific Data 2023
• Asma Ben Abacha, Wen-wai Yim, Griffin Adams, Neal
Snider, Meliha Yetisgen: Overview of the MEDIQA-Chat
2023 Shared Tasks on the Summarization & Generation
of Doctor-Patient Conversations. ClinicalNLP@ACL
2023
• Wen-wai Yim, Asma Ben Abacha, Griffin Adams, Neal
Snider, Meliha Yetisgen: Overview of the MEDIQA-Sum
Task at ImageCLEF 2023: Summarization and
Classification of Doctor-Patient Conversations. CLEF
2023
EVALUATION METRICS
• Asma Ben Abacha, Wen-wai Yim, George Michalopoulos,
Thomas Lin: An Investigation of Evaluation Metrics for
Automated Medical Note Generation. ACL (Findings) 2023
References for the full details
Medical Error
Detection & Correction
Detection and Correction
of Medical Errors
 Medical errors can be costly to both patients and healthcare
providers. Detection and correction of these errors is crucial
for improving health care outcomes.
 LLMs can enable faster and more accurate solutions to detect
medical errors such as:
o Misdiagnosis: LLMs can be leveraged to identify inconsistencies
between the patient's symptoms and the diagnosis mentioned in the
clinical note.
o Medication errors: LLMs can be leveraged to detect errors in
medication dosage, frequency, duration, potential drug interactions or
contraindications.
 LLMs can also assist with suggesting corrections and recommending
differential diagnoses, personalized treatments, and interventions.
:
MEDIQA-CORR 2024​
Organizers:
Asma Ben Abacha, Microsoft
Wen-wai Yim, Microsoft
Meliha Yetisgen, University of Washington
Fei Xia, University of Washington
New Datasets (MS & UW) and Shared Task on Medical Error Detection & Correction:
 The MS dataset consists of 3,359 clinical notes.
 The UW dataset consists of 488 de-identified notes (requires a DUA).
 Each note is either correct or contains one error (e.g., diagnosis, treatment,
management) and its correction.
Examples
from the
MEDIQA-
CORR-MS
Dataset
A 58-year old man comes to his physician because of a 1-month history of
increased thirst and nocturia. He is drinking a lot of water to compensate for any
dehydration. His brother has type 2 diabetes mellitus. Physical examination
shows dry mucous membranes. Laboratory studies show a serum sodium of 151
mEq/L and glucose of 121 mg/dL. A water deprivation test shows:
Serum osmolality
(mOsmol/kg H2O) Urine osmolality
(mOsmol/kg H2O)
Initial presentation 295 285
After 3 hours without fluids 305 310
After administration of antidiuretic hormone (ADH) analog 280 355
Patient was diagnosed with primary polydipsia.
Patient was diagnosed with partial central diabetes insipidus.
Examples
from the
MEDIQA-
CORR-MS
Dataset
A 75-year-old woman comes to the physician because of generalized
weakness for 6 months. During this period, she has also had a 4-kg
(8.8-lb) weight loss and frequent headaches. She has been avoiding
eating solids because of severe jaw pain. She has hypertension and
osteoporosis. She underwent a total left-sided knee arthroplasty 2 years
ago because of osteoarthritis. The patient does not smoke or drink
alcohol. Her current medications include enalapril, metoprolol, low-dose
aspirin, and a multivitamin. She appears pale. Her temperature is 37.5 C
(99.5 F), pulse is 82/min, and blood pressure is 135/80 mm Hg. Physical
examination shows no abnormalities. Intravenous
methylprednisolone and a temporal artery biopsy is
recommended after labs were reviewed.
Laboratory studies showed:
Hemoglobin 10 g/dL
Mean corpuscular volume 87μm3
Leukocyte count 8,500/mm3
Platelet count 450,000/mm3
Erythrocyte sedimentation rate 90 mm/h
Oral prednisone and a temporal artery biopsy is
recommended after labs were reviewed.
Tasks & GPT Baselines
(the challenge is still open, run submission deadline on March 28)
Baselines Error Flag
Prediction
Error Sentence
Detection
Sentence Correction
Accuracy Accuracy ROUGE1 BERTScore BLEURT Aggregate Score (Mean of ROUGE-
1-F, BERTScore, and BLEURT-20)
ChatGPT 46.18 45.64 46.93 41.61 48.84 45.79
GPT-4 61.31 48.91 52.76 56.97 57.15 55.63
Tasks:
1. Predicting the error flag (i.e., does the text contain an error or not?),
2. Extracting the error sentence ID (or -1 for texts without errors),
3. Generating a correct sentence (or NA for texts without errors) .
ChatGPT & GPT-4 Results using the same prompt to generate the responses:
Challenges
&
Opportunitie
s
The potential benefits of LLMs in healthcare
and medicine are substantial, however, there are
also challenges associated with the use of LLMs in
healthcare, such as bias, hallucinations that can
impact medical outcomes, important omissions,
critical medical errors, and privacy concerns.
Continued research and innovation in this area is
crucial to address these challenges and to allow
doctors and health workers to use highly-
performing LLM-based solutions with the necessary
safety guardrails.
abenabacha@microsoft.com
@AsmaBenAbacha

More Related Content

What's hot

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...Leiden University
 
Large Language Models.pdf
Large Language Models.pdfLarge Language Models.pdf
Large Language Models.pdfBLINXAI
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
 
Using the power of Generative AI at scale
Using the power of Generative AI at scaleUsing the power of Generative AI at scale
Using the power of Generative AI at scaleMaxim Salnikov
 
The Future is in Responsible Generative AI
The Future is in Responsible Generative AIThe Future is in Responsible Generative AI
The Future is in Responsible Generative AISaeed Al Dhaheri
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
 
Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdfQualcomm Research
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
 
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaMichal Jaskolski
 
Using Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptxUsing Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptxJonathanDietz3
 
Understanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdfUnderstanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdfStephenAmell4
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGISynaptonIncorporated
 
ChatGPT, Midjourney, la déferlante des IA génératives dans l'enseignement
ChatGPT, Midjourney, la déferlante des IA génératives dans l'enseignementChatGPT, Midjourney, la déferlante des IA génératives dans l'enseignement
ChatGPT, Midjourney, la déferlante des IA génératives dans l'enseignementAlain Goudey
 
Generative AI and Student Writing.pptx
Generative AI and Student Writing.pptxGenerative AI and Student Writing.pptx
Generative AI and Student Writing.pptxMike Sharples
 
Implications of GPT-3
Implications of GPT-3Implications of GPT-3
Implications of GPT-3Raven Jiang
 
Cavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AICavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AICavalry Ventures
 
Exploring ChatGPT For Effective Teaching
Exploring ChatGPT For Effective TeachingExploring ChatGPT For Effective Teaching
Exploring ChatGPT For Effective TeachingAdicodeTechnologies
 

What's hot (20)

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...
 
Large Language Models.pdf
Large Language Models.pdfLarge Language Models.pdf
Large Language Models.pdf
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
Using the power of Generative AI at scale
Using the power of Generative AI at scaleUsing the power of Generative AI at scale
Using the power of Generative AI at scale
 
The Future is in Responsible Generative AI
The Future is in Responsible Generative AIThe Future is in Responsible Generative AI
The Future is in Responsible Generative AI
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdf
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowania
 
CHATGPT.pptx
CHATGPT.pptxCHATGPT.pptx
CHATGPT.pptx
 
Using Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptxUsing Generative AI in the Classroom .pptx
Using Generative AI in the Classroom .pptx
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 
Understanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdfUnderstanding generative AI models A comprehensive overview.pdf
Understanding generative AI models A comprehensive overview.pdf
 
Transformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGITransformers, LLMs, and the Possibility of AGI
Transformers, LLMs, and the Possibility of AGI
 
ChatGPT, Midjourney, la déferlante des IA génératives dans l'enseignement
ChatGPT, Midjourney, la déferlante des IA génératives dans l'enseignementChatGPT, Midjourney, la déferlante des IA génératives dans l'enseignement
ChatGPT, Midjourney, la déferlante des IA génératives dans l'enseignement
 
Generative AI and Student Writing.pptx
Generative AI and Student Writing.pptxGenerative AI and Student Writing.pptx
Generative AI and Student Writing.pptx
 
Implications of GPT-3
Implications of GPT-3Implications of GPT-3
Implications of GPT-3
 
Cavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AICavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AI
 
Exploring ChatGPT For Effective Teaching
Exploring ChatGPT For Effective TeachingExploring ChatGPT For Effective Teaching
Exploring ChatGPT For Effective Teaching
 
ChatGPT.pptx
ChatGPT.pptxChatGPT.pptx
ChatGPT.pptx
 

Similar to Large Language Models and Applications in Healthcare

Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...Nat Rice
 
NR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdfNR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdfbkbk37
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Pubrica
 
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...Kaela Johnson
 
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Health Informatics New Zealand
 
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm OptimizationCase Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimizationrahulmonikasharma
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Pubrica
 
Measuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in HospitalsMeasuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in HospitalsHealth Informatics New Zealand
 
Overview of ePRO
Overview of ePROOverview of ePRO
Overview of ePROchallPHT
 
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxCollege Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxclarebernice
 
NR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdfNR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdfbkbk37
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approachcsandit
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHcscpconf
 
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docxApplication Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docxalfredai53p
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...IRJET Journal
 
MAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applicationsMAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applicationsHTAi Bilbao 2012
 
Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...CRF Health
 
cognitive computing for electronic medical record
cognitive computing for electronic medical record cognitive computing for electronic medical record
cognitive computing for electronic medical record selamu shirtawi
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemMichel Dumontier
 

Similar to Large Language Models and Applications in Healthcare (20)

Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...Automated Extraction Of Reported Statistical Analyses  Towards A Logical Repr...
Automated Extraction Of Reported Statistical Analyses Towards A Logical Repr...
 
NR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdfNR 328 EBP Improving Diagnostic Safety Project.pdf
NR 328 EBP Improving Diagnostic Safety Project.pdf
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...
 
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
Automated Generation Of Synoptic Reports From Narrative Pathology Reports In ...
 
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
 
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm OptimizationCase Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
Case Retrieval using Bhattacharya Coefficient with Particle Swarm Optimization
 
Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...Systematic review of quality standards for medical devices and practice measu...
Systematic review of quality standards for medical devices and practice measu...
 
Measuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in HospitalsMeasuring the Effectiveness of eHealth Initiatives in Hospitals
Measuring the Effectiveness of eHealth Initiatives in Hospitals
 
Overview of ePRO
Overview of ePROOverview of ePRO
Overview of ePRO
 
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docxCollege Writing II Synthesis Essay Assignment Summer Semester 2017.docx
College Writing II Synthesis Essay Assignment Summer Semester 2017.docx
 
NR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdfNR 327 MDC EBP Diagnose Errors Question.pdf
NR 327 MDC EBP Diagnose Errors Question.pdf
 
ICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining ApproachICU Patient Deterioration Prediction : A Data-Mining Approach
ICU Patient Deterioration Prediction : A Data-Mining Approach
 
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACHICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
ICU PATIENT DETERIORATION PREDICTION: A DATA-MINING APPROACH
 
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docxApplication Evaluation Project Part 1 Evaluation Plan FocusTec.docx
Application Evaluation Project Part 1 Evaluation Plan FocusTec.docx
 
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
System for Recommending Drugs Based on Machine Learning Sentiment Analysis of...
 
MAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applicationsMAST: a model for HTA-based assessment of telemedicine applications
MAST: a model for HTA-based assessment of telemedicine applications
 
Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...Recommendations on Evidence Needed to Support Measurement Equivalence between...
Recommendations on Evidence Needed to Support Measurement Equivalence between...
 
cognitive computing for electronic medical record
cognitive computing for electronic medical record cognitive computing for electronic medical record
cognitive computing for electronic medical record
 
Evidence based medicine
Evidence based medicineEvidence based medicine
Evidence based medicine
 
The Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health SystemThe Role of the FAIR Guiding Principles for an effective Learning Health System
The Role of the FAIR Guiding Principles for an effective Learning Health System
 

Recently uploaded

Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024Stephen Perrenod
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...marcuskenyatta275
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 

Recently uploaded (20)

Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 

Large Language Models and Applications in Healthcare

  • 1. Large Language Models & Applications in Healthcare Asma Ben Abacha Guest Lecture, University of Illinois, Urbana-Champaign, February 29, 2024
  • 2. Abstract The time that medical doctors spend on Electronic Health Record systems was shown to contribute to work-life imbalance, dissatisfaction, high rates of attrition, and a burnout rate exceeding 50%. In particular, doctors spend on average 52 to 102 minutes per day writing clinical notes from their conversations with the patients. Recent studies on clinical note generation have shown that doctors can save a significant amount of time with automatic note generation systems. Progress in LLMs can play a key role in enabling further such systems and improving their performance. However, this requires high- quality datasets and benchmarks and relevant evaluation metrics to assess which model would best serve clinicians in their daily practice. In this lecture, I’ll present LLM-based solutions for the task of clinical note generation from doctor-patient conversations. I’ll also present insights from different evaluation studies and shared tasks that we organized on this topic. Another important aspect of supporting healthcare providers with documentation and clinical decisions is to detect medical errors in clinical notes and to suggest corrections (e.g., diagnosis, treatment, medication). Such errors require medical expertise and knowledge to be both identified and corrected. Recent LLMs showed promise in being applied on unseen tasks with competitive ability. The second part of the lecture will cover a new research endeavor on medical error detection and correction and present LLM- based solutions for the task.
  • 3. Plan I. Large Language Models (LLMs) in Healthcare II. Clinical Note Generation from Doctor-Patient Conversations III. Medical Error Detection & Correction
  • 4. Medical LLMs GPT-2 ERNIE XLNet RoBERTa Clinical-T5 Med-PaLM BioGPT Large Language Models (LLMs) 2021 T5-xxl XLM-R PaLM BLOOM ChatGPT 2022 2023 BARD GPT-4 Falcon Claude 2 LLaMA 2 Mistral 7B 2020 2019 GPT-3 2018 GPT-1 BERT Med-PaLM 2 PMC-LLaMA BioBERT SciBERT ClinicalBER
  • 5.
  • 6. 1 INFORMATION EXTRACTION 2  Predicting disease progression and identifying high-risk patients.  Recommending personalized treatment plans, interventions, clinical trials, and improving clinical workflows. PREDICTION & RECOMMENDATION LLMs in Healthcare Recent progress in LLMs opens the door to more clinical applications to improve the efficiency of clinical practice and enhance patients' experiences:  Extracting useful information and insights from Electronic Health Records.  Assisting doctors with literature review and easier access to up-to-date and trustworthy information. 3  Generating clinical notes and reducing doctors’ documentation time.  Automating administrative tasks such as medical coding and billing.  Generating differential diagnoses and improving diagnosis accuracy. GENERATION
  • 7. Clinical Note Generation from Doctor-Patient Conversations
  • 8. • Medical doctors spend on average 52 to 102 minutes per day writing clinical notes from their conversations with the patients (Hripcsak et al., 2011). • The time spent with Electronic Health Record systems contributes to high rates of attrition, with a burnout rate already exceeding 50% (Arndt et al., 2017). • Generating clinical notes from doctor-patient conversations can contribute to reducing the doctors’ workload by editing/validating the generated notes instead of writing the full notes during the consultations. • However, the lack of publicly available doctor-patient dialogue datasets hinders research efforts from the NLP community on this summarization/generation task. INTRODUCTION
  • 9. MTS-Dialog Dataset [1] § To avoid privacy infringement risks, we created simulated conversations from publicly available de-identified clinical notes from the Mtsamples collection. § Eight trained annotators used the notes sections to write simulated conversations according to guidelines derived from a study of a large private collection of real conversations and notes. New Datasets  We created and released two new datasets free of PHI that can be used for benchmarking and research on clinical note generation and accelerate research efforts on this understudied NLP task. [1] Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023. [2] Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Nature Scientific Data 2023. ACI-BENCH: Ambient Clinical Intelligence Benchmark [2] § Includes full clinical notes and associated conversations from simulated encounters. § Four annotators with medical backgrounds validated the dataset.
  • 10. Example from the MTS-Dialog Dataset  The MTS-Dialog dataset consists of 1.7k pairs of conversations and associated summaries (clinical note sections).  The clinical notes cover the six most frequent note types and specialties in the collection, including: General Medicine, Neurology, Orthopedic, Dermatology, SOAP (Subjective, Objective, Assessment, Plan), and
  • 11. Example from the ACI-BENCH Dataset The ACI-BENCH dataset consists of 207 pairs of full doctor-patient conversations and associated clinical notes.
  • 12. Asma Ben Abacha, Wen-wai Yim, Yadan Fan, & Thomas Lin. An empirical study of clinical note generation from doctor-patient encounters. EACL 2023. o Pre-finetuning on relevant datasets improved the results, with BART pre- finetuned on xsum and samsum provided a ROUGE-1 score of 40.15% vs 32.01% when pre-finetuned on CNN/DailyMail, which highlights the importance of using relevant pre- finetuning datasets. o Guided summarization and Data augmentation helped improve the ROUGE scores from 40% to 42% ROUGE-1. Summarization of Short Doctor-Patient Conversations o We fine-tuned and evaluated summarization models on the MTS-Dialog dataset.
  • 13. We selected the best performing model from each category for further studies using Fact-based metrics, BERTScore, and BLEURT. o Guided Summarization (GS) led to a consistent improvement across all automated metrics except for ROUGE-2 and Fact-based metrics. o Data Augmentation (DA) led to slight improvements across all metrics except ROUGE-2 and BERTScore-M1. Summarization of Short Doctor-Patient Conversations
  • 14.  We fine-tuned and evaluated different summarization models at the full note level on the ACI-BENCH dataset. Summarization of Long Doctor-Patient Conversations  Simple retrieval-based methods provided strong baselines with better out-of-the-box performances than Longformer-Encoder- Decoder (LED) models and full-note BART models.  Division-based generation worked better for BART and LED fine-tuned models with 53.46 ROUGE-1 for BART+FTSAMSum(Division).  OpenAI models with simple prompts were shown to give competitive outputs despite no additional fine-tuning or dynamic prompting.  GPT-4 demonstrated the highest MEDCON UMLS-based evaluation score (57.78) while achieving the second to third-best performance in ROUGE scores (51.76 ROUGE- 1, 22.58 ROUGE-2 and 45.97 ROUGE-L). Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, & Meliha Yetisgen. ACI-Bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Nature Scientific Data 2023.
  • 15. Evaluation Methods • Assessing which note generation model would best serve clinicians in their daily practice is an important and challenging task, however, it can be (strongly) biased by the adopted evaluation metrics. • For doctor-patient conversation summarization, an adapted perspective on generation errors need to be considered, e.g.: • Hallucinations and factual inconsistencies are likely to impact the clinical outcome if they are not avoided or detected with a high accuracy. • Omission of critical medical facts is likely to alter patient outcomes and should be one of the essential factors in adopting a summarization model.
  • 16. Average Correlation Scores across Seven Clinical Datasets 16  The evaluation metrics had substantially different behaviors on different types of clinical notes datasets.  The results highlight one stable subset of metrics, such as aggregate scores, as the most correlated with human judgments on different evaluation criteria. Evaluation Metrics We studied the correlation between evaluation metrics and manual scores provided by medical annotators.
  • 17.  The MEDIQA 2023 shared tasks focused on the automatic generation of clinical notes from doctor- patient conversations and the generation of synthetic medical conversations for data augmentation.  The shared tasks can gather and evaluate diverse ideas from the research community (especially in the age of advanced LLMs) and empirically identify the best approaches.  29 participating teams at the MEDIQA-Chat & MEDIQA-Sum competitions organized @ ACL- ClinicalNLP & CLEF 2023. Insights from Shared Task Evaluations
  • 18. MEDIQA-Chat 2023 This task focuses on summarizing short doctor-patient conversations to generate a summary for one section of a clinical note, including a section header. The goal of task B is to generate a complete note for each doctor- patient encounter. The note must include all relevant sections. Task A - Short Dialogue2Note Summarization Task B - Full Dialogue2Note Summarization Tasks The third task addresses data augmentation through the generation of synthetic doctor- patient conversations from full clinical notes. Task C - Note2Dialogue Generation
  • 19. ChatGPT & GPT-4 Models • ChatGPT (gpt-3.5-turbo) & GPT-4 baselines (ChatGPT has a limit of 4k tokens & GPT-4 allows 32k tokens). • Different temperatures for Tasks A/B (1) and Task C (0/1) (deterministic/creative outputs). • Prompts: ⚬ Task A Prompt: "Classify the conversation into one of these 20 classes: FAMILY HISTORY/SOCIAL HISTORY, HISTORY of PRESENT ILLNESS, PAST MEDICAL HISTORY, CHIEF COMPLAINT, PAST SURGICAL HISTORY, Allergy, REVIEW OF SYSTEMS, Medications, Assessment, Exam, Diagnosis, Disposition, Plan, EMERGENCY DEPARTMENT COURSE, Immunizations, Imaging, GYNECOLOGIC HISTORY, Procedures, Other history, Labs. The response should start with the selected class, followed by # then the summary of the conversation in a clinical note style. The conversation is: ” ⚬ Task B Prompt: "Summarize the conversation to generate a clinical note with four sections: HISTORY OF PRESENT ILLNESS, PHYSICAL EXAM, RESULTS, ASSESSMENT AND PLAN. The conversation is: ” ⚬ Task C Prompt: "Write a full conversation between a doctor and a patient during a medical visit. The dialogue should cover all the medical information provided in this note:"
  • 20. MEDIQA-Chat: Task B Long Dialogue Summarization (ACI-BENCH Dataset)
  • 21. MEDIQA-Chat 2023 Approaches & Insights • Dynamic prompting & in-context learning was the winning solution for clinical note generation. • Fine-tuning (BART/LED/T5) achieved competitive results compared to the leading solutions with GPT-4. • Generation of full notes worked better than generating specific sections at a time. • Data Augmentation was beneficial to fine-tuning methods but were not applied/tested with GPT-4 solutions. Empirical validation of the best methods on other/different datasets showed that:  LLMs are highly sensitive to in-context examples and similarity models for this task.  LLM-based solutions may require adaptation when applied to new data, such as different similarity models/solutions to retrieve in-context examples and tackle potential retrieval overfitting.
  • 22. MEDIQA-Sum @ CLEF 2023 Methods & Results  ACL-ClinicalNLP MEDIQA-Chat and ImageCLEF MEDIQA-Sum shared tasks hosted similar problems on an extended and overlapping dataset.  A difference between the participants in MEDIQA-Sum and MEDIQA-Chat was that there were no GPT4 submissions in MEDIQA-Sum.  For short dialogue summarization, scores in the two 2023 editions were comparable: MEDIQA-Chat scores range from 0.37-0.58 aggregate-score and MEDIQA-Sum scores were at 0.28-0.57 aggregate score.  The full-encounter (long) generation ROUGE1 was at 0.28-0.61 and 0.21-0.65 for aggregate scoring in MEDIQA-Chat. In MEDIQA-Sum, the ranges were slightly lower than MEDIQA-Chat with 0.28-0.50 ROUGE1 and 0.25-0.46 aggregate scoring.  The results suggest that many open source-based methods are still very competitive for classification and shorter generation tasks whereas longer generation may require more powerful LLMs.
  • 23. DATASETS & METHODS SHARED TASKS • Asma Ben Abacha, Wen-wai Yim, Yadan Fan, Thomas Lin: An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters. EACL 2023 • Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, Meliha Yetisgen: ACI- BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Nature Scientific Data 2023 • Asma Ben Abacha, Wen-wai Yim, Griffin Adams, Neal Snider, Meliha Yetisgen: Overview of the MEDIQA-Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations. ClinicalNLP@ACL 2023 • Wen-wai Yim, Asma Ben Abacha, Griffin Adams, Neal Snider, Meliha Yetisgen: Overview of the MEDIQA-Sum Task at ImageCLEF 2023: Summarization and Classification of Doctor-Patient Conversations. CLEF 2023 EVALUATION METRICS • Asma Ben Abacha, Wen-wai Yim, George Michalopoulos, Thomas Lin: An Investigation of Evaluation Metrics for Automated Medical Note Generation. ACL (Findings) 2023 References for the full details
  • 25. Detection and Correction of Medical Errors  Medical errors can be costly to both patients and healthcare providers. Detection and correction of these errors is crucial for improving health care outcomes.  LLMs can enable faster and more accurate solutions to detect medical errors such as: o Misdiagnosis: LLMs can be leveraged to identify inconsistencies between the patient's symptoms and the diagnosis mentioned in the clinical note. o Medication errors: LLMs can be leveraged to detect errors in medication dosage, frequency, duration, potential drug interactions or contraindications.  LLMs can also assist with suggesting corrections and recommending differential diagnoses, personalized treatments, and interventions.
  • 26. : MEDIQA-CORR 2024​ Organizers: Asma Ben Abacha, Microsoft Wen-wai Yim, Microsoft Meliha Yetisgen, University of Washington Fei Xia, University of Washington New Datasets (MS & UW) and Shared Task on Medical Error Detection & Correction:  The MS dataset consists of 3,359 clinical notes.  The UW dataset consists of 488 de-identified notes (requires a DUA).  Each note is either correct or contains one error (e.g., diagnosis, treatment, management) and its correction.
  • 27. Examples from the MEDIQA- CORR-MS Dataset A 58-year old man comes to his physician because of a 1-month history of increased thirst and nocturia. He is drinking a lot of water to compensate for any dehydration. His brother has type 2 diabetes mellitus. Physical examination shows dry mucous membranes. Laboratory studies show a serum sodium of 151 mEq/L and glucose of 121 mg/dL. A water deprivation test shows: Serum osmolality (mOsmol/kg H2O) Urine osmolality (mOsmol/kg H2O) Initial presentation 295 285 After 3 hours without fluids 305 310 After administration of antidiuretic hormone (ADH) analog 280 355 Patient was diagnosed with primary polydipsia. Patient was diagnosed with partial central diabetes insipidus.
  • 28. Examples from the MEDIQA- CORR-MS Dataset A 75-year-old woman comes to the physician because of generalized weakness for 6 months. During this period, she has also had a 4-kg (8.8-lb) weight loss and frequent headaches. She has been avoiding eating solids because of severe jaw pain. She has hypertension and osteoporosis. She underwent a total left-sided knee arthroplasty 2 years ago because of osteoarthritis. The patient does not smoke or drink alcohol. Her current medications include enalapril, metoprolol, low-dose aspirin, and a multivitamin. She appears pale. Her temperature is 37.5 C (99.5 F), pulse is 82/min, and blood pressure is 135/80 mm Hg. Physical examination shows no abnormalities. Intravenous methylprednisolone and a temporal artery biopsy is recommended after labs were reviewed. Laboratory studies showed: Hemoglobin 10 g/dL Mean corpuscular volume 87μm3 Leukocyte count 8,500/mm3 Platelet count 450,000/mm3 Erythrocyte sedimentation rate 90 mm/h Oral prednisone and a temporal artery biopsy is recommended after labs were reviewed.
  • 29. Tasks & GPT Baselines (the challenge is still open, run submission deadline on March 28) Baselines Error Flag Prediction Error Sentence Detection Sentence Correction Accuracy Accuracy ROUGE1 BERTScore BLEURT Aggregate Score (Mean of ROUGE- 1-F, BERTScore, and BLEURT-20) ChatGPT 46.18 45.64 46.93 41.61 48.84 45.79 GPT-4 61.31 48.91 52.76 56.97 57.15 55.63 Tasks: 1. Predicting the error flag (i.e., does the text contain an error or not?), 2. Extracting the error sentence ID (or -1 for texts without errors), 3. Generating a correct sentence (or NA for texts without errors) . ChatGPT & GPT-4 Results using the same prompt to generate the responses:
  • 30. Challenges & Opportunitie s The potential benefits of LLMs in healthcare and medicine are substantial, however, there are also challenges associated with the use of LLMs in healthcare, such as bias, hallucinations that can impact medical outcomes, important omissions, critical medical errors, and privacy concerns. Continued research and innovation in this area is crucial to address these challenges and to allow doctors and health workers to use highly- performing LLM-based solutions with the necessary safety guardrails.