FriendsQA: Open-domain Question Answering on TV Show Transcripts

FriendsQA: Open-Domain Question
Answering on TV Show Transcripts
Zhengzhe Yang
Advisor: Dr. Jinho D. Choi
Emory University, Department of Computer Science

Contents Layout
Introduction Background The Corpus
Approach Experiments Conclusion

Introduction
• What is Question Answering?
• A task to challenge machines ability to understand a document
• Later apply the learned knowledge to answer to queries
• By completing a blank: Cloze-style
• Selecting from a pool of answer candidates: Multiple choice
• Select an answer span from the document: Span-based

Introduction
• Motivation
• Remarkable results have been reported on numerous dataset, but…
• No multiparty dialogue!
• Wiki articles and News articles
• (non-) fictional stories
• Children’s books
• Multiparty dialogue is the most natural mean of communication,

Introduction
• FriendsQA: an open-
domain Question
Answering dataset
• Given a context, the
task is to select the
answer span like the
example on the right

Background: Cloze-style Datasets
• CNN/Daily Mail
• Predict PERSON entities on
summarization for an article
• Children’s Book Test
• Expand to predict all entities using
children’s books
• BookTest
• 60 time larger than CBT
• Who-did-what
• Description sentence and evidence
passage from English Gigaword
Corpus

Background: MC Datasets
• MCTest: comprising short fictional stories
• RACE: compiled from English assessments for 12-18 years old students
• TQA: compiled from middle school science lessons and textbooks
• SciQ: passages from science exams collected via crowdsourcing
• DREAM: multiparty dialogue passages from English-as-a-foreign-language

Background: Span-based Datasets
• bAbI: infer event descriptions
• WikiQA and SQuAD: wikipedia
• NewsQA: CNN articles
• MS MARCO: web documents (Bing)
• TriviaQA: from trivia enthusiasts
• CoQA: conversational flow between
questioner and answerer

Background: QA Systems
• R-Net
• ReasoNet
• Attention Over Attention Reader
• Reinforced Mnemonic Reader
• Transformer
• MEMEN
• FusionNet
• Stochastic Answer Network
• QANet
• ELMo
• BERT

Background: Character Mining
• The first 4 seasons are annotated for character identification tasks
• Annotations are again extended to plural mentions
• The first 4 seasons are also annotated with fine-grained emotion detection
• All 10 seasons are processed for a cloze-style RC task

Background: FriendsQA vs. Other Dialogue
QA
• FriendsQA vs. CoQA
• CoQA aims to answer questions in one-
to-one conversation between a
questioner and answerer
• The evidence passage is still
wiki articles
• FriendsQA vs. Cloze-style RC task
• Cloze-style reasoning is less complex
comparing to span-based QA
• The predictions are limited to
PERSON entities
• FriendsQA vs. DREAM
• Multiple choice questions are not
ideal for practical QA applications

The Corpus: FriendsQA Dataset
• 1,222 scenes (83 are pruned because of having fewer than 5 utterances)
• All utterances are concatenated together to form an evidence passage
• The task is to find a contiguous answer span from the evidence passage

The Corpus:
Challenges with
entity resolution
• Utterances are
spoken by several
people and context
switching happens
more frequently
• The ubiquitous and
interchangeable use
of pronouns

The Corpus:
Challenges with
metaphors
• Homophones
confusion
• Humor that could be
understood by
human readers
• Require outside
knowledge. In this
case, knowledge
regarding human
body

The Corpus:
Challenges with
sarcasm
• The use of sarcasm is
dominant in Friends
to create humorous
effects
• The meaning is
exactly opposite if
comprehended
directly

The Corpus: Crowdsourcing
• All annotation tasks are conducted on Amazon Mechanical Turk.
• Left panel: the dialogue
• Right panel: text inputs for question generation
• Prior to actual tasks: a quiz to ensure annotators’ understanding of this task
and web interface

The Corpus: Phase 1 –> Question-Answer
Generation
• Clear annotation guidelines
• 4 questions out of six: {what, when,
where, who, why, how}
• Answerable question
• Multiple answers
• However, selected answers must be
relevant to the question
• speaker name and
• Utterance ID can also be selected

The Corpus: Quality Assurance
• Task can only be submitted after passing all rules
• Are there at least 4 types of questions annotated?
• Does each question have at least one answer span associated with it?
• Does any question have too much string overlaps with the original text
in the dialogue?

The Corpus: Phase 2 –> Verification and
Paraphrasing
• Questions generated in Phase 1 are published again without answers
• Annotators are asked to revise the questions if unanswerable or ambiguous
• Annotators are asked to answer the questions
• Annotators are asked to paraphrase the questions
• Additional checking for quality assurance:
• Check if the paraphrased question is the exact copy

The Corpus: Four
Rounds of
Annotation
• Four rounds of
annotations are
conducted before
official annotation
tasks
• F1 score metric is
adopted to evaluate
Inter-annotator
Agreement (ITA)

The Corpus: R1
• Observed
ambiguous
questions that led to
bad answers
• Update the guidelines
to make the
questions as explicit
as possible

The Corpus: R2
• 6.27% improvement
observed on ITA
• Add more examples
of questions and
answer spans to the
guidelines

The Corpus: R3
• Another 2.48%
improvement on ITA
• no update is made to
the guidelines.

The Corpus: R4
• Marginal ITA
improvement of
0.67% observed
• Implies that our
annotation guidelines
are stabilized.

The Corpus:
Question /
Answer Pruning
• If question is revised
dramatically, prune
the first question
(21.8% are revised)
• If answers do not
agree, prune the
question and the
answer (13.5% are
pruned)

The Corpus: Inter-
annotator
Agreement
After pruning:
• 10,610 questions
• 21,262 answer
spans
• ITA: 81.82% /
53.55%

The Corpus: Question
Types vs. Answer
Categories
• 250 questions are
randomly sampled
• Diversity of FriendsQA

Approach
• Three SOTA systems selected to represent common approaches
• R-Net: Recurrent Neural Network with attention mechanisms
• QANet: Convolutional Neural Network with self-attention
• BERT: deep feed-forward neural networks with Transformers

Approach: R-
Net
• Recurrent Neural
Network Based
• Self-matching
Mechanism

Approach:
QANet
• Convolutional Neural
Network based
• Dramatic speed-up:
data augmentation

Approach:
BERT
• pushed all current state-
of-the-art scores to
another level
• Transformers
(Attention Only) based

Experiments: Model Development
• All dialogues from are randomly
shuffled and redistributed as the
training (80%), development (10%),
and test (10%)
• Each training instance consists of a
dialogue, questions, and a single
answer to each question
• Utterance IDs are replaced with the
actual utterance
Set Dialogues Questions Answers
Training 977 8,535 17,074
Development 122 1,010 2,057
Test 123 1,065 2,131

Experiments: Model Development
• Recall that each question could have multiple answers
• Three strategy to generate training instances with single answer
• Select the shortest answer and discard the rest
• Select the longest answer and discard the rest
• If a question Q1 have multiple answers A1 and A2, generate two
training instances (Q1, A1) and (Q1, A2) and train independently

Experiments: Evaluation Metrics
• Span-based Match
• Exact Match
• Utterance Match

Experiments:
Span-based
Match
• Each answer is treated as bag-of-words
• Compute macro-average F1 score
• P: Precision
• R: Recall

Experiments:
Exact Match
• Check if the prediction and gold answer are the
same
• Score is either 1 or 0

Experiments:
Utterance
Match
• Given the nature of multiparty dialogue QA,
utterance match is introduced
• Models are considered to be powerful if always
looking for answers in the correct utterance
• UM mainly checks if the prediction resides within
the same utterance as the gold answer span

Experiments:
Results
• All experiments are run three times
• Average score with standard deviation
• BERT and QANet perform better with multiple-
answer strategy
• R-Net performs better with others

Experiments: Results
with replacement
• Take advantage of
Character Mining project
• Kept an entity mapping
and replace all PERSON
entities in both dialogue
and questions
• Plural mentions handled
naively (we ent0 ent1
ent2)

Experiments: Results based on Q-Type
• where and when questions are mostly
factoid, which show the highest
performance with UM
• why and how require cross-utterance
reasoning, leading to worse
performance
• who and what questions give a good
mixture of proper and common nouns
and show moderate performance
Type Dist. UM SM EM
What 19.70% 77.42 69.39 55.04
Where 18.28% 84.35 78.86 65.93
Who 17.17% 74.12 64.34 55.29
Why 15.76% 60.47 50.03 27.14
How 14.65% 65.52 52.04 32.64
When 14.44% 80.65 65.81 51.98

Experiments: Results of Start of utterance
• Predict the start of the utterance
• Only need 1 output layer: simply
report accuracy
• Demonstrate the power of NN
SoU Acc.
1 57.23
2 57.62
3 55.25
Avg. 56.70

Experiments: Results with top-k answers
45
50
55
60
65
70
75
80
85
90
95
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Top-K Answers
Utterance Match Span Match Exact Match

Error Analysis
• 100 randomly sampled completely mismatched questions
• Through the analysis, 6 types of errors become evident

Error Analysis
• Entity Resolution
• Paraphrase and Partial Match
• Cross-Utterance Reasoning
• Question Bias
• Noise in Annotation
• Miscellaneous
Entity
Resolution
28%
Paraphrase
and Partial
Match
20%
Cross-
Utterance
Reasoning
18%
Question Bias
17%
Miscellaneous
13%
Noise in
Annotation
4%

Entity Resolution (28%)
Q: What is Chandler’s opinion regarding marriage?
A: Joey thinks… (wrong entity!)

Paraphrase and Partial Match (20%)
• Paraphrasing, abstraction, nicknames, etc. referred to
somewhere else in the conversation.
• Partially correct, especially for why and how questions, which
could be acceptable in practice.
• Motivates us to evaluate using Utterance Match.

Cross-Utterance Reasoning (18%)
• This type reveals an universal challenge in understanding
human-to-human conversation.
• Reason across multiple utterances back and forth, especially
if a story or an event unfolds gradually, scatters in different
places, and is told by different speakers

Question Bias (17%)
• This type occurs when the answer predictions overly rely on
the question types.
Q: Why is Chandler against marriage?
A: …because Joey built this chair on his own
• Because is not necessarily the correct answer!

Noise in Annotation (4%)
• FriendsQA, although gives high inter-annotator agreement,
still includes noise caused by wrong spans, ambiguous or
unanswerable questions, or typos.

Miscellaneous (13%)
• Errors in this category have no apparent cause to understand
why the model predicts these answers
• They often seem irrelevant to the questions so that they need
more investigation.

Conclusion: Contributions
• FriendsQA: an open-domain question answering dataset
• An extensive and comprehensive analysis: validity, difficulty and
diversity
• Three state-of-the-art models are run and compared: shown its
potential
• Error analysis offers insightful retrospective and make suggestions
to future deeper study

Conclusion: Future Work
• Q-type and error analysis can serve as guidelines to further enhance the
QA model performance.
• Why and how questions should be studied more attentively
• Speaker information could be encoded into the utterance
• Top-k answer: another challenging but tangible task
• Answer existence prediction and an utterance-based model to select
utterance candidates

FriendsQA: Open-domain Question Answering on TV Show Transcripts

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FriendsQA: Open-domain Question Answering on TV Show Transcripts

Similar to FriendsQA: Open-domain Question Answering on TV Show Transcripts (20)

More from Jinho Choi

More from Jinho Choi (20)

Recently uploaded

Recently uploaded (20)

FriendsQA: Open-domain Question Answering on TV Show Transcripts