Consumer Health Questions: Summarizing and Answering Complex Medical Queries

Medical Question Answering: Dealing with the complexity
and specificity of consumer health questions and visual questions
November 12, 2019 - Seattle, WA, USA
Allen Institute for Artificial Intelligence (AI2
Dr. Asma Ben Abacha

Disclaimer
The views and opinions expressed do not necessarily state or
reflect those of the U.S. Government, and they may not be used
for advertising or product endorsement purposes.

● PhD in Computer Science, Paris-Sud
University, France.
● Master Degree, Paris 13 University, France.
● Software engineering degree, ENSI, Tunisia.
● Postdoctoral researcher, LIST, Luxembourg.
● Lecturer at the University of Lorraine, France.
● PhD Student at LIMSI-CNRS, France.
Topic: Medical question answering.
Short Bio

● Currently, a staff scientist at the NIH/NLM.
U.S. National Institutes of
Health (NIH)
• www.nih.gov
• 27 institutes & centers
U.S. National Library of Medicine (NLM) www.nlm.nih.gov

 NLM receives hundreds of requests per day including many
consumer health questions.
 Manual answers returned in 4 days from consumer services.
 Goal: Automatic system to assist users in their search.
Why Consumer Health Question Answering?

Challenges
in Consumer Health
Question Answering

Example:
“obese daughter. i'm a mother of 5 children. My son
has leukemia who is the youngest. My daughter is 12yrs old and
is obese. i dont know where to turn but i need some professional help.
my daughter has always suffered with weight issues. while i deal
with my 11yr old son at md anderson im so scared of loosing my
daughter. Shes already been suicidal and her grades are so bad.
please i need help. “
7
1st Challenge: Automatic Question Analysis & Understanding

2nd Challenge: Retrieval of relevant & reliable answers

Short Query: “How to treat obesity in children”
2nd Challenge: Retrieval of relevant & reliable answers

A good answer in Page 3 using a short query (“How to treat obesity in children”)

11
1) Recognizing Question Entailment
• “A Question-Entailment Approach to
Question Answering”.
Ben Abacha & Demner-Fushman. BMC
Bioinformatics 2019.
Answering
Complex
Questions
2) Question Summarization
• “On the Summarization of Consumer
Health Questions”.
Ben Abacha & Demner-Fushman. ACL 2019.

◼ Consumer Health Question
(1’)
Relation?
How to define and recognize entailment between questions?

Recognizing Question Entailment (RQE)
13
 Answering new questions by retrieving entailed
questions with existing answers.

MedQuAD: Medical QA Dataset
 Created from trusted resources.
 Contains 47k question-answer pairs extracted from 12 NIH websites
(e.g. MedlinePlus, NCI, GARD, NIDDK).
 Covers 16 question types about Diseases (e.g. Treatment, Susceptibility),
20 types about Drugs (e.g. Usage, Interaction), and an additional type
(Information) for all possible question foci (e.g. Procedure, Exam).
14
https://github.com/abachaa/MedQuAD

RQE Models
Observations:
Logistic Regression outperformed neural networks when trained
with traditional word embeddings such as Glove and Word2Vec.
 Relying on recent language models such as Bert for pre-training led
to a better performance (e.g. results of MEDIQA-RQE 2019). 15
Ben Abacha & Demner-Fushman. BMC Bioinformatics 2019

16
1) Recognizing Entailment/Inference in the Medical Domain
2) Entailment/Inference for Question Answering (QA)
3) New Shared Task at ACL-BioNLP:
Need of dedicated gold-standard datasets
and a common evaluation platform.
 2006:
 2016:
Methods for Using Textual Entailment in Open-Domain Question
Answering. Sanda Harabagiu & Andrew Hickl
Recognizing Question Entailment for Medical Question Answering.
Asma Ben Abacha & Dina Demner-Fushman
MEDIQA @ ACL-BioNLP 2019

17
 MEDIQA 2019: 3 tasks (NLI, RQE & QA)
“Overview of the MEDIQA 2019 Shared Task on Textual
Inference, Question Entailment and Question Answering”.
Ben Abacha, Shivade & Demner-Fushman.
ACL-BioNLP 2019.

 MEDIQA 2019: 72 teams, 20 papers, and interesting approaches & results.

CHiQA
19
https://chiqa.nlm.nih.gov

CHiQA
20
Consumer Health Information and Question Answering: Helping consumers find answers to their
health-related information needs. Demner-Fushman, D., Mrabet, Y., and Ben Abacha, A. JAMIA 2019.

CHiQA
21
Metrics LiveQA-Med Dataset
IR RQE COMB LiveQA-
Med Best
Average Score 1.183 0.827 1.308 0.637
MAP@10 0.405 0.311 0.445 --
MRR@10 0.438 0.333 0.516 --
o Benchmark from the Medical Question
Answering Competition @ LiveQA 2017
(Ben Abacha et al., TREC 2017).

22
1) Recognizing Question Entailment
• “A Question-Entailment Approach to
Question Answering”.
Ben Abacha & Demner-Fushman. BMC
Bioinformatics 2019.
Answering
Complex
Questions
2) Question Summarization
• “On the Summarization of Consumer
Health Questions”.
Ben Abacha & Demner-Fushman. ACL 2019.

23
 Tackling the complexity of consumer questions by automatic summarization.
Question Summarization

24
“On the Role of Question Summarization and Information
Source Restriction in Consumer Health Question Answering”.
Ben Abacha & Demner-Fushman. AMIA 2019 Informatics Summit.
Measures Original
Questions
Question
Summaries
AvgScore (0-3) 0.711 1.125
Succ@2+ 0.442 0.663
Succ@3+ 0.192 0.317
Succ@4+ 0.077 0.144
Prec@2+ 0.46 0.663
Prec@3+ 0.2 0.317
Prec@4+ 0.08 0.144
QA performance w(o)/ Summarization
o Benchmark from the Medical Question
Answering Competition @ LiveQA 2017
(Ben Abacha et al., TREC 2017).

Automatic Summarization: Pointer-generator network
25
See et al., ACL 2017

Data Creation & Augmentation
26
Method Type Example
#1 MeQSum Dataset Consumer
Health
Question
I suffered a massive stroke on [DATE] with paralysis on my left side of
my body, I'm home and conduct searches on the internet to find help
with recovery, and always this product called neuroaid appears
claiming to restore function. to my knowledge it isn't approved by the
FDA, but it sounds so promising. do you know anything about it and id
there anything approved by our FDA, that does help?
Summary What are treatments for stroke paralysis, including neuroaid?
#2 Augmentation
with Clinical Data
Clinical
Question
55-year-old woman. This lady has epigastric pain and gallbladder
symptoms. How do you assess her gallbladder function when you
don't see stones on the ultrasound? Can a nonfunctioning gallbladder
cause symptoms or do you only get symptoms if you have stones?
Summary Can a nonfunctioning gallbladder cause symptoms or do you only get
symptoms if you have stones?
• #3 Augmentation
with Semantic
Selection
Medical
Question
Is it healthy to ingest 500 mg of vitamin c a day? Should I be taking
more or less?
Summary How much vitamin C should I take a day?
Ben Abacha & Demner-Fushman. ACL 2019

27
Method Training
Set
ROUGE-1 ROUGE-2 ROUGE-L
Seq2seq
Attentional
Model
#1 24.80 13.84 24.27
#2 28.97 18.34 28.74
#3 27.62 15.70 27.11
Pointer
Generator
(PG)
#1 35.80 20.19 34.79
#2 42.77 25.00 40.97
#3 44.16 27.64 42.78
PG +
Coverage
#1 39.57 23.05 38.45
#2 40.00 24.13 38.56
#3 41.76 24.80 40.50
• #1 MeQSum Dataset
• #2 Augmentation with
Clinical Data
• #3 Augmentation with
Semantic Selection
Question Summarization Results
Data augmentation
from question datasets
improves the overall
performance.

28
Question
The best performance of
44.16% is comparable to the
state-of-the-art results in open-
domain using 8K training
pairs (2.5% of the size of the
CNN-DailyMail dataset).

Summing up
 Safe and efficient search alternatives are needed to protect non-expert users from potential
misinformation found online.
 Medical QA systems should follow strict guidelines regarding the goal and background knowledge
and resources to protect the consumers from misleading or harmful information.
Our experiments showed that:
 relying on the retrieval of entailed questions is a viable strategy to answer consumer health questions
 limiting the number of answer sources could enhance the performance of QA systems
 summarizing the questions leads to a substantial improvement

30
Answering
Medication
Questions
3) Questions about Medications
• “Bridging the Gap between Consumers’
Medication Questions and Trusted Answers”.
Ben Abacha et al. MEDINFO 2019.

Plan
1. New gold standard corpus for
Question Answering about
medications.
1. Deep learning experiments on
question type identification (CNN),
focus recognition (Bi-LSTM-CRF),
and question answering (CHiQA).
2. Research insights from the dataset
creation process and experiments.
31

Dataset
◼ The gold standard contains 674 consumer questions submitted to
with associated answers and annotations.
43%
19%
38% DailyMed
MedlinePlus
Other Websites
Question
Type # Example
Information 112 what type of drug is amphetamine?
Dose 70 what is a daily amount of prednisolone
eye drops to take?
Usage 61 how to self inject enoxaparin sodium?
Side Effects 60 does benazepril aggravate hepatitis?
Indication 55 why is pyridostigmine prescribed?
Interaction 51 can i drink cataflam when i drink medrol?
32

Discussion (1/3)
Difficulty and ambiguity of medication questions:
◼ linguistic ambiguity due to misspellings, or wrong grammar leading to
unclear meaning or multiple interpretations.
◼ the use of too many words or general terms to express a medical term
(circumlocution).
◼ the requested medical information/knowledge was not formalized or written
online (knowledge barrier).
◼ underspecified questions lacking essential information.
33

Discussion (2/3)
Complexity of Manual Answer Retrieval:
◼ Conditional answers including answers that depend on the manufacturer,
on the disease, or on patient information.
◼ Distributed answers that can be formed only by combining different text
snippets from different answers and/or sources.
◼ Other answers could not be found without expert inference based on
background knowledge.
34

Discussion (3/3)
Required Solutions for the automation of Medication QA:
◼ Medical text translation, simplification, and inference
are often needed to find relevant answers and/or make the
answers readable for non-expert users.
◼ More data and resources are needed to cover information
about drug interactions and usage guidelines (e.g.
scientific literature, clinical sources).
◼ Conditional answers require different solutions such as
providing a list of answers or interacting with the user in a
dialogue-based approach.
35

36
4) Visual Question Answering
• “VQA-Med: Overview of the Medical Visual
Question Answering Task at ImageCLEF
2019”. Ben Abacha et al. CLEF 2019.
Answering
Visual
Questions

37
• Visual Question Answering (VQA) poses a challenging problem
involving Natural Language Processing and Computer Vision.
• Systems capable of understanding radiology images and answering
related questions could support clinical education, clinical decision
making, and patient education.
Visual Question Answering in the Medical Domain

38
VQA-Med @ ImageCLEF 2019
Four categories of questions:
Modality, Plane, Abnormality & Organ system
• What is most alarming about this ct scan?
• Is this a t1 weighted, t2 weighted, or flair image?
• What is the organ principally shown in this x-ray?
• In which plane is the mri displayed?
Examples:

◼ Training, validation, and test sets created automatically:
◼ The training set includes 3,200 radiology images and 12,792 question-answer pairs.
◼ The answers of the test set have been validated by two medical doctors.
VQA-Med 2019 Data

Results (Accuracy)
• The best team achieved 0.624
overall accuracy and 0.644 BLEU
score.
• Methods: transfer learning, multi-
task learning, ensemble methods,
and hybrid approaches combining
classification models and answer
generation methods.

Thank you!
42
 Contact: asma.benabacha@nih.gov
 Datasets: github.com/abachaa
 CHiQA: chiqa.nlm.nih.gov

Examples of consumer health questions

ID Question (Q) / Answer (A)
1a (Q) “what does prednisone do to the body?”
1b (Q) “what would a normal dose be for valacyclovir?”
1c (Q) “how much gravol to kill you?”
1d (Q) “what time should take memantine?”
2a (Q) “what color is phenytoin?”
(A) [depends on the manufacturer]
 PINK
 WHITE (/Light Lavender)
 ORANGE
2b (Q) “why would my urine test be negative for benzodiazepines when i take Ativan?”
(A) “Common limitations exist for screening benzodiazepines when using traditional immunoassay (IA) tests. IA testing
for benzodiazepines often targets nordiazepam and oxazepam (…)” AND “Some commonly prescribed drugs have
limited cross-reactivity (…)”
2c (Q) “is it alright touse fluticasone when using oxygen?”
(A) “Pharmacological therapy can influence morbidity and mortality in severe chronic obstructive pulmonary disease
(COPD). Long-term domiciliary oxygen therapy (LTOT) improves survival in COPD with chronic hypoxaemia. Oral
steroid medication has been associated with improved survival in men and increased mortality in women, while inhaled
steroid medication has been associated with a reduction in the exacerbation rate (…).”
2d (Q) “how long does vicodin stay in breast milk?”
(A) “Following a 10 mg oral dose of hydrocodone administered to five adult male subjects, the mean peak concentration
was 23.6 ± 5.2 ng/mL. Maximum serum levels were achieved at 1.3 ± 0.3 hours and the half-life was determined to be
3.8 ± 0.3 hours.”
44

45
Question Summarization

Consumer Health Questions: Summarizing and Answering Complex Medical Queries

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Consumer Health Questions: Summarizing and Answering Complex Medical Queries

Similar to Consumer Health Questions: Summarizing and Answering Complex Medical Queries (20)

Recently uploaded

Recently uploaded (20)

Consumer Health Questions: Summarizing and Answering Complex Medical Queries