This document provides an overview of Dr. Asma Ben Abacha's work on medical question answering. It discusses the challenges of consumer health question answering, including analyzing and understanding complex questions. It also discusses challenges in retrieving relevant and reliable answers. The document outlines Dr. Ben Abacha's research projects on recognizing question entailment to answer new questions, summarizing consumer health questions, answering medication questions using a new dataset, and visual question answering using medical images. It provides results of experiments on these tasks and discusses insights and needed solutions for advancing medical question answering.
Consumer Health Questions: Summarizing and Answering Complex Medical Queries
1. Medical Question Answering: Dealing with the complexity
and specificity of consumer health questions and visual questions
November 12, 2019 - Seattle, WA, USA
Allen Institute for Artificial Intelligence (AI2
Dr. Asma Ben Abacha
2. Disclaimer
The views and opinions expressed do not necessarily state or
reflect those of the U.S. Government, and they may not be used
for advertising or product endorsement purposes.
3. ● PhD in Computer Science, Paris-Sud
University, France.
● Master Degree, Paris 13 University, France.
● Software engineering degree, ENSI, Tunisia.
● Postdoctoral researcher, LIST, Luxembourg.
● Lecturer at the University of Lorraine, France.
● PhD Student at LIMSI-CNRS, France.
Topic: Medical question answering.
Short Bio
4. ● Currently, a staff scientist at the NIH/NLM.
U.S. National Institutes of
Health (NIH)
• www.nih.gov
• 27 institutes & centers
U.S. National Library of Medicine (NLM) www.nlm.nih.gov
5. NLM receives hundreds of requests per day including many
consumer health questions.
Manual answers returned in 4 days from consumer services.
Goal: Automatic system to assist users in their search.
Why Consumer Health Question Answering?
7. Example:
“obese daughter. i'm a mother of 5 children. My son
has leukemia who is the youngest. My daughter is 12yrs old and
is obese. i dont know where to turn but i need some professional help.
my daughter has always suffered with weight issues. while i deal
with my 11yr old son at md anderson im so scared of loosing my
daughter. Shes already been suicidal and her grades are so bad.
please i need help. “
7
1st Challenge: Automatic Question Analysis & Understanding
14. MedQuAD: Medical QA Dataset
Created from trusted resources.
Contains 47k question-answer pairs extracted from 12 NIH websites
(e.g. MedlinePlus, NCI, GARD, NIDDK).
Covers 16 question types about Diseases (e.g. Treatment, Susceptibility),
20 types about Drugs (e.g. Usage, Interaction), and an additional type
(Information) for all possible question foci (e.g. Procedure, Exam).
14
https://github.com/abachaa/MedQuAD
15. RQE Models
Observations:
Logistic Regression outperformed neural networks when trained
with traditional word embeddings such as Glove and Word2Vec.
Relying on recent language models such as Bert for pre-training led
to a better performance (e.g. results of MEDIQA-RQE 2019). 15
Ben Abacha & Demner-Fushman. BMC Bioinformatics 2019
16. 16
1) Recognizing Entailment/Inference in the Medical Domain
2) Entailment/Inference for Question Answering (QA)
3) New Shared Task at ACL-BioNLP:
Need of dedicated gold-standard datasets
and a common evaluation platform.
2006:
2016:
Methods for Using Textual Entailment in Open-Domain Question
Answering. Sanda Harabagiu & Andrew Hickl
Recognizing Question Entailment for Medical Question Answering.
Asma Ben Abacha & Dina Demner-Fushman
MEDIQA @ ACL-BioNLP 2019
17. 17
MEDIQA 2019: 3 tasks (NLI, RQE & QA)
“Overview of the MEDIQA 2019 Shared Task on Textual
Inference, Question Entailment and Question Answering”.
Ben Abacha, Shivade & Demner-Fushman.
ACL-BioNLP 2019.
26. Data Creation & Augmentation
26
Method Type Example
#1 MeQSum Dataset Consumer
Health
Question
I suffered a massive stroke on [DATE] with paralysis on my left side of
my body, I'm home and conduct searches on the internet to find help
with recovery, and always this product called neuroaid appears
claiming to restore function. to my knowledge it isn't approved by the
FDA, but it sounds so promising. do you know anything about it and id
there anything approved by our FDA, that does help?
Summary What are treatments for stroke paralysis, including neuroaid?
#2 Augmentation
with Clinical Data
Clinical
Question
55-year-old woman. This lady has epigastric pain and gallbladder
symptoms. How do you assess her gallbladder function when you
don't see stones on the ultrasound? Can a nonfunctioning gallbladder
cause symptoms or do you only get symptoms if you have stones?
Summary Can a nonfunctioning gallbladder cause symptoms or do you only get
symptoms if you have stones?
• #3 Augmentation
with Semantic
Selection
Medical
Question
Is it healthy to ingest 500 mg of vitamin c a day? Should I be taking
more or less?
Summary How much vitamin C should I take a day?
Ben Abacha & Demner-Fushman. ACL 2019
27. 27
Method Training
Set
ROUGE-1 ROUGE-2 ROUGE-L
Seq2seq
Attentional
Model
#1 24.80 13.84 24.27
#2 28.97 18.34 28.74
#3 27.62 15.70 27.11
Pointer
Generator
(PG)
#1 35.80 20.19 34.79
#2 42.77 25.00 40.97
#3 44.16 27.64 42.78
PG +
Coverage
#1 39.57 23.05 38.45
#2 40.00 24.13 38.56
#3 41.76 24.80 40.50
• #1 MeQSum Dataset
• #2 Augmentation with
Clinical Data
• #3 Augmentation with
Semantic Selection
Question Summarization Results
Ben Abacha & Demner-Fushman. ACL 2019
Data augmentation
from question datasets
improves the overall
performance.
28. Question Summarization Results
28
Question
Ben Abacha & Demner-Fushman. ACL 2019
The best performance of
44.16% is comparable to the
state-of-the-art results in open-
domain using 8K training
pairs (2.5% of the size of the
CNN-DailyMail dataset).
29. Summing up
Safe and efficient search alternatives are needed to protect non-expert users from potential
misinformation found online.
Medical QA systems should follow strict guidelines regarding the goal and background knowledge
and resources to protect the consumers from misleading or harmful information.
Our experiments showed that:
relying on the retrieval of entailed questions is a viable strategy to answer consumer health questions
limiting the number of answer sources could enhance the performance of QA systems
summarizing the questions leads to a substantial improvement
31. Plan
1. New gold standard corpus for
Question Answering about
medications.
1. Deep learning experiments on
question type identification (CNN),
focus recognition (Bi-LSTM-CRF),
and question answering (CHiQA).
2. Research insights from the dataset
creation process and experiments.
31
32. Dataset
◼ The gold standard contains 674 consumer questions submitted to
with associated answers and annotations.
43%
19%
38% DailyMed
MedlinePlus
Other Websites
Question
Type # Example
Information 112 what type of drug is amphetamine?
Dose 70 what is a daily amount of prednisolone
eye drops to take?
Usage 61 how to self inject enoxaparin sodium?
Side Effects 60 does benazepril aggravate hepatitis?
Indication 55 why is pyridostigmine prescribed?
Interaction 51 can i drink cataflam when i drink medrol?
32
33. Discussion (1/3)
Difficulty and ambiguity of medication questions:
◼ linguistic ambiguity due to misspellings, or wrong grammar leading to
unclear meaning or multiple interpretations.
◼ the use of too many words or general terms to express a medical term
(circumlocution).
◼ the requested medical information/knowledge was not formalized or written
online (knowledge barrier).
◼ underspecified questions lacking essential information.
33
34. Discussion (2/3)
Complexity of Manual Answer Retrieval:
◼ Conditional answers including answers that depend on the manufacturer,
on the disease, or on patient information.
◼ Distributed answers that can be formed only by combining different text
snippets from different answers and/or sources.
◼ Other answers could not be found without expert inference based on
background knowledge.
34
35. Discussion (3/3)
Required Solutions for the automation of Medication QA:
◼ Medical text translation, simplification, and inference
are often needed to find relevant answers and/or make the
answers readable for non-expert users.
◼ More data and resources are needed to cover information
about drug interactions and usage guidelines (e.g.
scientific literature, clinical sources).
◼ Conditional answers require different solutions such as
providing a list of answers or interacting with the user in a
dialogue-based approach.
35
36. 36
4) Visual Question Answering
• “VQA-Med: Overview of the Medical Visual
Question Answering Task at ImageCLEF
2019”. Ben Abacha et al. CLEF 2019.
Answering
Visual
Questions
37. 37
• Visual Question Answering (VQA) poses a challenging problem
involving Natural Language Processing and Computer Vision.
• Systems capable of understanding radiology images and answering
related questions could support clinical education, clinical decision
making, and patient education.
Visual Question Answering in the Medical Domain
38. 38
VQA-Med @ ImageCLEF 2019
Four categories of questions:
Modality, Plane, Abnormality & Organ system
• What is most alarming about this ct scan?
• Is this a t1 weighted, t2 weighted, or flair image?
• What is the organ principally shown in this x-ray?
• In which plane is the mri displayed?
Examples:
39. ◼ Training, validation, and test sets created automatically:
◼ The training set includes 3,200 radiology images and 12,792 question-answer pairs.
◼ The answers of the test set have been validated by two medical doctors.
VQA-Med 2019 Data
40. Results (Accuracy)
• The best team achieved 0.624
overall accuracy and 0.644 BLEU
score.
• Methods: transfer learning, multi-
task learning, ensemble methods,
and hybrid approaches combining
classification models and answer
generation methods.
44. ID Question (Q) / Answer (A)
1a (Q) “what does prednisone do to the body?”
1b (Q) “what would a normal dose be for valacyclovir?”
1c (Q) “how much gravol to kill you?”
1d (Q) “what time should take memantine?”
2a (Q) “what color is phenytoin?”
(A) [depends on the manufacturer]
PINK
WHITE (/Light Lavender)
ORANGE
2b (Q) “why would my urine test be negative for benzodiazepines when i take Ativan?”
(A) “Common limitations exist for screening benzodiazepines when using traditional immunoassay (IA) tests. IA testing
for benzodiazepines often targets nordiazepam and oxazepam (…)” AND “Some commonly prescribed drugs have
limited cross-reactivity (…)”
2c (Q) “is it alright touse fluticasone when using oxygen?”
(A) “Pharmacological therapy can influence morbidity and mortality in severe chronic obstructive pulmonary disease
(COPD). Long-term domiciliary oxygen therapy (LTOT) improves survival in COPD with chronic hypoxaemia. Oral
steroid medication has been associated with improved survival in men and increased mortality in women, while inhaled
steroid medication has been associated with a reduction in the exacerbation rate (…).”
2d (Q) “how long does vicodin stay in breast milk?”
(A) “Following a 10 mg oral dose of hydrocodone administered to five adult male subjects, the mean peak concentration
was 23.6 ± 5.2 ng/mL. Maximum serum levels were achieved at 1.3 ± 0.3 hours and the half-life was determined to be
3.8 ± 0.3 hours.”
44