This thesis presents FriendsQA, a challenging question answering dataset that contains 1,222 dialogues and 10,610 open-domain questions, to tackle machine comprehension on everyday conversations. Each dialogue, involving multiple speakers, is annotated with six types of questions what, when, why, where, who, how regarding the dialogue contexts, and the answers are annotated with contiguous spans in the dialogue. A series of crowdsourcing tasks are conducted to ensure good annotation quality, resulting a high inter-annotator agreement of 81.82%. A comprehensive annotation analytics is provided for a deeper understanding in this dataset. Three state-of-the-art QA systems are experimented, R-Net, QANet, and BERT, and evaluated on this dataset. BERT in particular depicts promising results, an accuracy of 74.2% for answer utterance selection and an F1-score of 64.2% for answer span selection, suggesting that the FriendsQA task is hard yet has a great potential of elevating QA research on multiparty dialogue to another level.
3. Introduction
• What is Question Answering?
• A task to challenge machines ability to understand a document
• Later apply the learned knowledge to answer to queries
• By completing a blank: Cloze-style
• Selecting from a pool of answer candidates: Multiple choice
• Select an answer span from the document: Span-based
4. Introduction
• Motivation
• Remarkable results have been reported on numerous dataset, but…
• No multiparty dialogue!
• Wiki articles and News articles
• (non-) fictional stories
• Children’s books
• Multiparty dialogue is the most natural mean of communication,
5. Introduction
• FriendsQA: an open-
domain Question
Answering dataset
• Given a context, the
task is to select the
answer span like the
example on the right
6. Background: Cloze-style Datasets
• CNN/Daily Mail
• Predict PERSON entities on
summarization for an article
• Children’s Book Test
• Expand to predict all entities using
children’s books
• BookTest
• 60 time larger than CBT
• Who-did-what
• Description sentence and evidence
passage from English Gigaword
Corpus
7. Background: MC Datasets
• MCTest: comprising short fictional stories
• RACE: compiled from English assessments for 12-18 years old students
• TQA: compiled from middle school science lessons and textbooks
• SciQ: passages from science exams collected via crowdsourcing
• DREAM: multiparty dialogue passages from English-as-a-foreign-language
8. Background: Span-based Datasets
• bAbI: infer event descriptions
• WikiQA and SQuAD: wikipedia
• NewsQA: CNN articles
• MS MARCO: web documents (Bing)
• TriviaQA: from trivia enthusiasts
• CoQA: conversational flow between
questioner and answerer
10. Background: Character Mining
• The first 4 seasons are annotated for character identification tasks
• Annotations are again extended to plural mentions
• The first 4 seasons are also annotated with fine-grained emotion detection
• All 10 seasons are processed for a cloze-style RC task
11. Background: FriendsQA vs. Other Dialogue
QA
• FriendsQA vs. CoQA
• CoQA aims to answer questions in one-
to-one conversation between a
questioner and answerer
• The evidence passage is still
wiki articles
• FriendsQA vs. Cloze-style RC task
• Cloze-style reasoning is less complex
comparing to span-based QA
• The predictions are limited to
PERSON entities
• FriendsQA vs. DREAM
• Multiple choice questions are not
ideal for practical QA applications
12. The Corpus: FriendsQA Dataset
• 1,222 scenes (83 are pruned because of having fewer than 5 utterances)
• All utterances are concatenated together to form an evidence passage
• The task is to find a contiguous answer span from the evidence passage
13. The Corpus:
Challenges with
entity resolution
• Utterances are
spoken by several
people and context
switching happens
more frequently
• The ubiquitous and
interchangeable use
of pronouns
14. The Corpus:
Challenges with
metaphors
• Homophones
confusion
• Humor that could be
understood by
human readers
• Require outside
knowledge. In this
case, knowledge
regarding human
body
15. The Corpus:
Challenges with
sarcasm
• The use of sarcasm is
dominant in Friends
to create humorous
effects
• The meaning is
exactly opposite if
comprehended
directly
16. The Corpus: Crowdsourcing
• All annotation tasks are conducted on Amazon Mechanical Turk.
• Left panel: the dialogue
• Right panel: text inputs for question generation
• Prior to actual tasks: a quiz to ensure annotators’ understanding of this task
and web interface
17.
18.
19. The Corpus: Phase 1 –> Question-Answer
Generation
• Clear annotation guidelines
• 4 questions out of six: {what, when,
where, who, why, how}
• Answerable question
• Multiple answers
• However, selected answers must be
relevant to the question
• speaker name and
• Utterance ID can also be selected
20. The Corpus: Quality Assurance
• Task can only be submitted after passing all rules
• Are there at least 4 types of questions annotated?
• Does each question have at least one answer span associated with it?
• Does any question have too much string overlaps with the original text
in the dialogue?
21. The Corpus: Phase 2 –> Verification and
Paraphrasing
• Questions generated in Phase 1 are published again without answers
• Annotators are asked to revise the questions if unanswerable or ambiguous
• Annotators are asked to answer the questions
• Annotators are asked to paraphrase the questions
• Additional checking for quality assurance:
• Check if the paraphrased question is the exact copy
22. The Corpus: Four
Rounds of
Annotation
• Four rounds of
annotations are
conducted before
official annotation
tasks
• F1 score metric is
adopted to evaluate
Inter-annotator
Agreement (ITA)
23. The Corpus: R1
• Observed
ambiguous
questions that led to
bad answers
• Update the guidelines
to make the
questions as explicit
as possible
24. The Corpus: R2
• 6.27% improvement
observed on ITA
• Add more examples
of questions and
answer spans to the
guidelines
25. The Corpus: R3
• Another 2.48%
improvement on ITA
• no update is made to
the guidelines.
26. The Corpus: R4
• Marginal ITA
improvement of
0.67% observed
• Implies that our
annotation guidelines
are stabilized.
27. The Corpus:
Question /
Answer Pruning
• If question is revised
dramatically, prune
the first question
(21.8% are revised)
• If answers do not
agree, prune the
question and the
answer (13.5% are
pruned)
29. The Corpus: Question
Types vs. Answer
Categories
• 250 questions are
randomly sampled
• Diversity of FriendsQA
30. Approach
• Three SOTA systems selected to represent common approaches
• R-Net: Recurrent Neural Network with attention mechanisms
• QANet: Convolutional Neural Network with self-attention
• BERT: deep feed-forward neural networks with Transformers
33. Approach:
BERT
• pushed all current state-
of-the-art scores to
another level
• Transformers
(Attention Only) based
34. Experiments: Model Development
• All dialogues from are randomly
shuffled and redistributed as the
training (80%), development (10%),
and test (10%)
• Each training instance consists of a
dialogue, questions, and a single
answer to each question
• Utterance IDs are replaced with the
actual utterance
Set Dialogues Questions Answers
Training 977 8,535 17,074
Development 122 1,010 2,057
Test 123 1,065 2,131
35. Experiments: Model Development
• Recall that each question could have multiple answers
• Three strategy to generate training instances with single answer
• Select the shortest answer and discard the rest
• Select the longest answer and discard the rest
• If a question Q1 have multiple answers A1 and A2, generate two
training instances (Q1, A1) and (Q1, A2) and train independently
39. Experiments:
Utterance
Match
• Given the nature of multiparty dialogue QA,
utterance match is introduced
• Models are considered to be powerful if always
looking for answers in the correct utterance
• UM mainly checks if the prediction resides within
the same utterance as the gold answer span
40. Experiments:
Results
• All experiments are run three times
• Average score with standard deviation
• BERT and QANet perform better with multiple-
answer strategy
• R-Net performs better with others
41. Experiments: Results
with replacement
• Take advantage of
Character Mining project
• Kept an entity mapping
and replace all PERSON
entities in both dialogue
and questions
• Plural mentions handled
naively (we ent0 ent1
ent2)
42. Experiments: Results based on Q-Type
• where and when questions are mostly
factoid, which show the highest
performance with UM
• why and how require cross-utterance
reasoning, leading to worse
performance
• who and what questions give a good
mixture of proper and common nouns
and show moderate performance
Type Dist. UM SM EM
What 19.70% 77.42 69.39 55.04
Where 18.28% 84.35 78.86 65.93
Who 17.17% 74.12 64.34 55.29
Why 15.76% 60.47 50.03 27.14
How 14.65% 65.52 52.04 32.64
When 14.44% 80.65 65.81 51.98
43. Experiments: Results of Start of utterance
• Predict the start of the utterance
• Only need 1 output layer: simply
report accuracy
• Demonstrate the power of NN
SoU Acc.
1 57.23
2 57.62
3 55.25
Avg. 56.70
45. Error Analysis
• 100 randomly sampled completely mismatched questions
• Through the analysis, 6 types of errors become evident
46. Error Analysis
• Entity Resolution
• Paraphrase and Partial Match
• Cross-Utterance Reasoning
• Question Bias
• Noise in Annotation
• Miscellaneous
Entity
Resolution
28%
Paraphrase
and Partial
Match
20%
Cross-
Utterance
Reasoning
18%
Question Bias
17%
Miscellaneous
13%
Noise in
Annotation
4%
47. Entity Resolution (28%)
Q: What is Chandler’s opinion regarding marriage?
A: Joey thinks… (wrong entity!)
48. Paraphrase and Partial Match (20%)
• Paraphrasing, abstraction, nicknames, etc. referred to
somewhere else in the conversation.
• Partially correct, especially for why and how questions, which
could be acceptable in practice.
• Motivates us to evaluate using Utterance Match.
49. Cross-Utterance Reasoning (18%)
• This type reveals an universal challenge in understanding
human-to-human conversation.
• Reason across multiple utterances back and forth, especially
if a story or an event unfolds gradually, scatters in different
places, and is told by different speakers
50. Question Bias (17%)
• This type occurs when the answer predictions overly rely on
the question types.
Q: Why is Chandler against marriage?
A: …because Joey built this chair on his own
• Because is not necessarily the correct answer!
51. Noise in Annotation (4%)
• FriendsQA, although gives high inter-annotator agreement,
still includes noise caused by wrong spans, ambiguous or
unanswerable questions, or typos.
52. Miscellaneous (13%)
• Errors in this category have no apparent cause to understand
why the model predicts these answers
• They often seem irrelevant to the questions so that they need
more investigation.
53. Conclusion: Contributions
• FriendsQA: an open-domain question answering dataset
• An extensive and comprehensive analysis: validity, difficulty and
diversity
• Three state-of-the-art models are run and compared: shown its
potential
• Error analysis offers insightful retrospective and make suggestions
to future deeper study
54. Conclusion: Future Work
• Q-type and error analysis can serve as guidelines to further enhance the
QA model performance.
• Why and how questions should be studied more attentively
• Speaker information could be encoded into the utterance
• Top-k answer: another challenging but tangible task
• Answer existence prediction and an utterance-based model to select
utterance candidates