SlideShare a Scribd company logo
1 of 55
Download to read offline
FriendsQA: Open-Domain Question
Answering on TV Show Transcripts
Zhengzhe Yang
Advisor: Dr. Jinho D. Choi
Emory University, Department of Computer Science
Contents Layout
Introduction Background The Corpus
Approach Experiments Conclusion
Introduction
• What is Question Answering?
• A task to challenge machines ability to understand a document
• Later apply the learned knowledge to answer to queries
• By completing a blank: Cloze-style
• Selecting from a pool of answer candidates: Multiple choice
• Select an answer span from the document: Span-based
Introduction
• Motivation
• Remarkable results have been reported on numerous dataset, but…
• No multiparty dialogue!
• Wiki articles and News articles
• (non-) fictional stories
• Children’s books
• Multiparty dialogue is the most natural mean of communication,
Introduction
• FriendsQA: an open-
domain Question
Answering dataset
• Given a context, the
task is to select the
answer span like the
example on the right
Background: Cloze-style Datasets
• CNN/Daily Mail
• Predict PERSON entities on
summarization for an article
• Children’s Book Test
• Expand to predict all entities using
children’s books
• BookTest
• 60 time larger than CBT
• Who-did-what
• Description sentence and evidence
passage from English Gigaword
Corpus
Background: MC Datasets
• MCTest: comprising short fictional stories
• RACE: compiled from English assessments for 12-18 years old students
• TQA: compiled from middle school science lessons and textbooks
• SciQ: passages from science exams collected via crowdsourcing
• DREAM: multiparty dialogue passages from English-as-a-foreign-language
Background: Span-based Datasets
• bAbI: infer event descriptions
• WikiQA and SQuAD: wikipedia
• NewsQA: CNN articles
• MS MARCO: web documents (Bing)
• TriviaQA: from trivia enthusiasts
• CoQA: conversational flow between
questioner and answerer
Background: QA Systems
• R-Net
• ReasoNet
• Attention Over Attention Reader
• Reinforced Mnemonic Reader
• Transformer
• MEMEN
• FusionNet
• Stochastic Answer Network
• QANet
• ELMo
• BERT
Background: Character Mining
• The first 4 seasons are annotated for character identification tasks
• Annotations are again extended to plural mentions
• The first 4 seasons are also annotated with fine-grained emotion detection
• All 10 seasons are processed for a cloze-style RC task
Background: FriendsQA vs. Other Dialogue
QA
• FriendsQA vs. CoQA
• CoQA aims to answer questions in one-
to-one conversation between a
questioner and answerer
• The evidence passage is still
wiki articles
• FriendsQA vs. Cloze-style RC task
• Cloze-style reasoning is less complex
comparing to span-based QA
• The predictions are limited to
PERSON entities
• FriendsQA vs. DREAM
• Multiple choice questions are not
ideal for practical QA applications
The Corpus: FriendsQA Dataset
• 1,222 scenes (83 are pruned because of having fewer than 5 utterances)
• All utterances are concatenated together to form an evidence passage
• The task is to find a contiguous answer span from the evidence passage
The Corpus:
Challenges with
entity resolution
• Utterances are
spoken by several
people and context
switching happens
more frequently
• The ubiquitous and
interchangeable use
of pronouns
The Corpus:
Challenges with
metaphors
• Homophones
confusion
• Humor that could be
understood by
human readers
• Require outside
knowledge. In this
case, knowledge
regarding human
body
The Corpus:
Challenges with
sarcasm
• The use of sarcasm is
dominant in Friends
to create humorous
effects
• The meaning is
exactly opposite if
comprehended
directly
The Corpus: Crowdsourcing
• All annotation tasks are conducted on Amazon Mechanical Turk.
• Left panel: the dialogue
• Right panel: text inputs for question generation
• Prior to actual tasks: a quiz to ensure annotators’ understanding of this task
and web interface
The Corpus: Phase 1 –> Question-Answer
Generation
• Clear annotation guidelines
• 4 questions out of six: {what, when,
where, who, why, how}
• Answerable question
• Multiple answers
• However, selected answers must be
relevant to the question
• speaker name and
• Utterance ID can also be selected
The Corpus: Quality Assurance
• Task can only be submitted after passing all rules
• Are there at least 4 types of questions annotated?
• Does each question have at least one answer span associated with it?
• Does any question have too much string overlaps with the original text
in the dialogue?
The Corpus: Phase 2 –> Verification and
Paraphrasing
• Questions generated in Phase 1 are published again without answers
• Annotators are asked to revise the questions if unanswerable or ambiguous
• Annotators are asked to answer the questions
• Annotators are asked to paraphrase the questions
• Additional checking for quality assurance:
• Check if the paraphrased question is the exact copy
The Corpus: Four
Rounds of
Annotation
• Four rounds of
annotations are
conducted before
official annotation
tasks
• F1 score metric is
adopted to evaluate
Inter-annotator
Agreement (ITA)
The Corpus: R1
• Observed
ambiguous
questions that led to
bad answers
• Update the guidelines
to make the
questions as explicit
as possible
The Corpus: R2
• 6.27% improvement
observed on ITA
• Add more examples
of questions and
answer spans to the
guidelines
The Corpus: R3
• Another 2.48%
improvement on ITA
• no update is made to
the guidelines.
The Corpus: R4
• Marginal ITA
improvement of
0.67% observed
• Implies that our
annotation guidelines
are stabilized.
The Corpus:
Question /
Answer Pruning
• If question is revised
dramatically, prune
the first question
(21.8% are revised)
• If answers do not
agree, prune the
question and the
answer (13.5% are
pruned)
The Corpus: Inter-
annotator
Agreement
After pruning:
• 10,610 questions
• 21,262 answer
spans
• ITA: 81.82% /
53.55%
The Corpus: Question
Types vs. Answer
Categories
• 250 questions are
randomly sampled
• Diversity of FriendsQA
Approach
• Three SOTA systems selected to represent common approaches
• R-Net: Recurrent Neural Network with attention mechanisms
• QANet: Convolutional Neural Network with self-attention
• BERT: deep feed-forward neural networks with Transformers
Approach: R-
Net
• Recurrent Neural
Network Based
• Self-matching
Mechanism
Approach:
QANet
• Convolutional Neural
Network based
• Dramatic speed-up:
data augmentation
Approach:
BERT
• pushed all current state-
of-the-art scores to
another level
• Transformers
(Attention Only) based
Experiments: Model Development
• All dialogues from are randomly
shuffled and redistributed as the
training (80%), development (10%),
and test (10%)
• Each training instance consists of a
dialogue, questions, and a single
answer to each question
• Utterance IDs are replaced with the
actual utterance
Set Dialogues Questions Answers
Training 977 8,535 17,074
Development 122 1,010 2,057
Test 123 1,065 2,131
Experiments: Model Development
• Recall that each question could have multiple answers
• Three strategy to generate training instances with single answer
• Select the shortest answer and discard the rest
• Select the longest answer and discard the rest
• If a question Q1 have multiple answers A1 and A2, generate two
training instances (Q1, A1) and (Q1, A2) and train independently
Experiments: Evaluation Metrics
• Span-based Match
• Exact Match
• Utterance Match
Experiments:
Span-based
Match
• Each answer is treated as bag-of-words
• Compute macro-average F1 score
• P: Precision
• R: Recall
Experiments:
Exact Match
• Check if the prediction and gold answer are the
same
• Score is either 1 or 0
Experiments:
Utterance
Match
• Given the nature of multiparty dialogue QA,
utterance match is introduced
• Models are considered to be powerful if always
looking for answers in the correct utterance
• UM mainly checks if the prediction resides within
the same utterance as the gold answer span
Experiments:
Results
• All experiments are run three times
• Average score with standard deviation
• BERT and QANet perform better with multiple-
answer strategy
• R-Net performs better with others
Experiments: Results
with replacement
• Take advantage of
Character Mining project
• Kept an entity mapping
and replace all PERSON
entities in both dialogue
and questions
• Plural mentions handled
naively (we ent0 ent1
ent2)
Experiments: Results based on Q-Type
• where and when questions are mostly
factoid, which show the highest
performance with UM
• why and how require cross-utterance
reasoning, leading to worse
performance
• who and what questions give a good
mixture of proper and common nouns
and show moderate performance
Type Dist. UM SM EM
What 19.70% 77.42 69.39 55.04
Where 18.28% 84.35 78.86 65.93
Who 17.17% 74.12 64.34 55.29
Why 15.76% 60.47 50.03 27.14
How 14.65% 65.52 52.04 32.64
When 14.44% 80.65 65.81 51.98
Experiments: Results of Start of utterance
• Predict the start of the utterance
• Only need 1 output layer: simply
report accuracy
• Demonstrate the power of NN
SoU Acc.
1 57.23
2 57.62
3 55.25
Avg. 56.70
Experiments: Results with top-k answers
45
50
55
60
65
70
75
80
85
90
95
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Top-K Answers
Utterance Match Span Match Exact Match
Error Analysis
• 100 randomly sampled completely mismatched questions
• Through the analysis, 6 types of errors become evident
Error Analysis
• Entity Resolution
• Paraphrase and Partial Match
• Cross-Utterance Reasoning
• Question Bias
• Noise in Annotation
• Miscellaneous
Entity
Resolution
28%
Paraphrase
and Partial
Match
20%
Cross-
Utterance
Reasoning
18%
Question Bias
17%
Miscellaneous
13%
Noise in
Annotation
4%
Entity Resolution (28%)
Q: What is Chandler’s opinion regarding marriage?
A: Joey thinks… (wrong entity!)
Paraphrase and Partial Match (20%)
• Paraphrasing, abstraction, nicknames, etc. referred to
somewhere else in the conversation.
• Partially correct, especially for why and how questions, which
could be acceptable in practice.
• Motivates us to evaluate using Utterance Match.
Cross-Utterance Reasoning (18%)
• This type reveals an universal challenge in understanding
human-to-human conversation.
• Reason across multiple utterances back and forth, especially
if a story or an event unfolds gradually, scatters in different
places, and is told by different speakers
Question Bias (17%)
• This type occurs when the answer predictions overly rely on
the question types.
Q: Why is Chandler against marriage?
A: …because Joey built this chair on his own
• Because is not necessarily the correct answer!
Noise in Annotation (4%)
• FriendsQA, although gives high inter-annotator agreement,
still includes noise caused by wrong spans, ambiguous or
unanswerable questions, or typos.
Miscellaneous (13%)
• Errors in this category have no apparent cause to understand
why the model predicts these answers
• They often seem irrelevant to the questions so that they need
more investigation.
Conclusion: Contributions
• FriendsQA: an open-domain question answering dataset
• An extensive and comprehensive analysis: validity, difficulty and
diversity
• Three state-of-the-art models are run and compared: shown its
potential
• Error analysis offers insightful retrospective and make suggestions
to future deeper study
Conclusion: Future Work
• Q-type and error analysis can serve as guidelines to further enhance the
QA model performance.
• Why and how questions should be studied more attentively
• Speaker information could be encoded into the utterance
• Top-k answer: another challenging but tangible task
• Answer existence prediction and an utterance-based model to select
utterance candidates
Q & A
Thank you!

More Related Content

What's hot

Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language ProcessingSebastian Ruder
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech taggersadakpramodh
 
Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approachGarima Nanda
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSebastian Ruder
 
Hierarchical Transformer for Early Detection of Alzheimer’s Disease
Hierarchical Transformer for Early Detection of Alzheimer’s DiseaseHierarchical Transformer for Early Detection of Alzheimer’s Disease
Hierarchical Transformer for Early Detection of Alzheimer’s DiseaseJinho Choi
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageJinho Choi
 
Semi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguationSemi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguationkokanechandrakant
 
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Avkash Chauhan
 
Building a Microblog Corpus for Search Result Diversification
Building a Microblog Corpus for Search Result DiversificationBuilding a Microblog Corpus for Search Result Diversification
Building a Microblog Corpus for Search Result DiversificationKe Tao
 
Answer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAnswer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAhmed Magdy Ezzeldin, MSc.
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingSebastian Ruder
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyRimzim Thube
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringAkram El-Korashy
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 
Lessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleLessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleDomonkos Tikk
 

What's hot (20)

DNN Model Interpretability
DNN Model InterpretabilityDNN Model Interpretability
DNN Model Interpretability
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
 
Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approach
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Hierarchical Transformer for Early Detection of Alzheimer’s Disease
Hierarchical Transformer for Early Detection of Alzheimer’s DiseaseHierarchical Transformer for Early Detection of Alzheimer’s Disease
Hierarchical Transformer for Early Detection of Alzheimer’s Disease
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
Semi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguationSemi supervised approach for word sense disambiguation
Semi supervised approach for word sense disambiguation
 
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
 
Building a Microblog Corpus for Search Result Diversification
Building a Microblog Corpus for Search Result DiversificationBuilding a Microblog Corpus for Search Result Diversification
Building a Microblog Corpus for Search Result Diversification
 
Answer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic QuestionsAnswer Selection and Validation for Arabic Questions
Answer Selection and Validation for Arabic Questions
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
ICS1020 NLP 2020
ICS1020 NLP 2020ICS1020 NLP 2020
ICS1020 NLP 2020
 
Natural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A SurveyNatural Language Processing Advancements By Deep Learning: A Survey
Natural Language Processing Advancements By Deep Learning: A Survey
 
Memory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question AnsweringMemory Networks, Neural Turing Machines, and Question Answering
Memory Networks, Neural Turing Machines, and Question Answering
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Arabic question answering ‫‬
Arabic question answering ‫‬Arabic question answering ‫‬
Arabic question answering ‫‬
 
Lessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scaleLessons learnt at building recommendation services at industry scale
Lessons learnt at building recommendation services at industry scale
 

Similar to FriendsQA: Open-domain Question Answering on TV Show Transcripts

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations
Asking Clarifying Questions in Open-Domain Information-Seeking ConversationsAsking Clarifying Questions in Open-Domain Information-Seeking Conversations
Asking Clarifying Questions in Open-Domain Information-Seeking ConversationsMohammad Aliannejadi
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Lucidworks
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueJinho Choi
 
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...Jinho Choi
 
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14SURF Events
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Classifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsClassifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsJinho Choi
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Hady Elsahar
 
Talk on Ebooks at the NSF BPC/CE21/STEM-C Community Meeting
Talk on Ebooks at the NSF BPC/CE21/STEM-C Community MeetingTalk on Ebooks at the NSF BPC/CE21/STEM-C Community Meeting
Talk on Ebooks at the NSF BPC/CE21/STEM-C Community MeetingMark Guzdial
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threadsLeiden University
 
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural AttentionAttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural AttentionKodaira Tomonori
 
Clean code presentation
Clean code presentationClean code presentation
Clean code presentationBhavin Gandhi
 
TESOL 2016 Integrating and Curating TED talks for EAPs
TESOL 2016 Integrating and Curating TED talks for EAPsTESOL 2016 Integrating and Curating TED talks for EAPs
TESOL 2016 Integrating and Curating TED talks for EAPsINTO Saint Louis University
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Les Perelman
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Addictive links, Keynote talk at WWW 2014 workshop
Addictive links, Keynote talk at WWW 2014 workshopAddictive links, Keynote talk at WWW 2014 workshop
Addictive links, Keynote talk at WWW 2014 workshopPeter Brusilovsky
 

Similar to FriendsQA: Open-domain Question Answering on TV Show Transcripts (20)

Asking Clarifying Questions in Open-Domain Information-Seeking Conversations
Asking Clarifying Questions in Open-Domain Information-Seeking ConversationsAsking Clarifying Questions in Open-Domain Information-Seeking Conversations
Asking Clarifying Questions in Open-Domain Information-Seeking Conversations
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
 
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
Challenging Reading Comprehension on Daily Conversation: Passage Completion o...
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14
Keynote Sally Jordan - Computer-based assessment friend or foe? - OWD14
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Classifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer PairsClassifying Non-Referential It for Question Answer Pairs
Classifying Non-Referential It for Question Answer Pairs
 
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis Building Large Arabic Multi-Domain Resources for Sentiment Analysis
Building Large Arabic Multi-Domain Resources for Sentiment Analysis
 
Talk on Ebooks at the NSF BPC/CE21/STEM-C Community Meeting
Talk on Ebooks at the NSF BPC/CE21/STEM-C Community MeetingTalk on Ebooks at the NSF BPC/CE21/STEM-C Community Meeting
Talk on Ebooks at the NSF BPC/CE21/STEM-C Community Meeting
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threads
 
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural AttentionAttSum: Joint Learning of Focusing and Summarization with Neural Attention
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
 
Clean code presentation
Clean code presentationClean code presentation
Clean code presentation
 
TESOL 2016 Integrating and Curating TED talks for EAPs
TESOL 2016 Integrating and Curating TED talks for EAPsTESOL 2016 Integrating and Curating TED talks for EAPs
TESOL 2016 Integrating and Curating TED talks for EAPs
 
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
Artificial Unintelligence:Why and How Automated Essay Scoring Doesn’t Work (m...
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Making it stick
Making it stickMaking it stick
Making it stick
 
Addictive links, Keynote talk at WWW 2014 workshop
Addictive links, Keynote talk at WWW 2014 workshopAddictive links, Keynote talk at WWW 2014 workshop
Addictive links, Keynote talk at WWW 2014 workshop
 
Test specifications and designs session 4
Test specifications and designs  session 4Test specifications and designs  session 4
Test specifications and designs session 4
 

More from Jinho Choi

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Jinho Choi
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Jinho Choi
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Jinho Choi
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Jinho Choi
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionJinho Choi
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Jinho Choi
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning RepresentationJinho Choi
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingJinho Choi
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet SimilaritiesJinho Choi
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical RelationsJinho Choi
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementJinho Choi
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingJinho Choi
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueJinho Choi
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingJinho Choi
 
Topological Sort
Topological SortTopological Sort
Topological SortJinho Choi
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseJinho Choi
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsJinho Choi
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyJinho Choi
 

More from Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Recently uploaded

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

FriendsQA: Open-domain Question Answering on TV Show Transcripts

  • 1. FriendsQA: Open-Domain Question Answering on TV Show Transcripts Zhengzhe Yang Advisor: Dr. Jinho D. Choi Emory University, Department of Computer Science
  • 2. Contents Layout Introduction Background The Corpus Approach Experiments Conclusion
  • 3. Introduction • What is Question Answering? • A task to challenge machines ability to understand a document • Later apply the learned knowledge to answer to queries • By completing a blank: Cloze-style • Selecting from a pool of answer candidates: Multiple choice • Select an answer span from the document: Span-based
  • 4. Introduction • Motivation • Remarkable results have been reported on numerous dataset, but… • No multiparty dialogue! • Wiki articles and News articles • (non-) fictional stories • Children’s books • Multiparty dialogue is the most natural mean of communication,
  • 5. Introduction • FriendsQA: an open- domain Question Answering dataset • Given a context, the task is to select the answer span like the example on the right
  • 6. Background: Cloze-style Datasets • CNN/Daily Mail • Predict PERSON entities on summarization for an article • Children’s Book Test • Expand to predict all entities using children’s books • BookTest • 60 time larger than CBT • Who-did-what • Description sentence and evidence passage from English Gigaword Corpus
  • 7. Background: MC Datasets • MCTest: comprising short fictional stories • RACE: compiled from English assessments for 12-18 years old students • TQA: compiled from middle school science lessons and textbooks • SciQ: passages from science exams collected via crowdsourcing • DREAM: multiparty dialogue passages from English-as-a-foreign-language
  • 8. Background: Span-based Datasets • bAbI: infer event descriptions • WikiQA and SQuAD: wikipedia • NewsQA: CNN articles • MS MARCO: web documents (Bing) • TriviaQA: from trivia enthusiasts • CoQA: conversational flow between questioner and answerer
  • 9. Background: QA Systems • R-Net • ReasoNet • Attention Over Attention Reader • Reinforced Mnemonic Reader • Transformer • MEMEN • FusionNet • Stochastic Answer Network • QANet • ELMo • BERT
  • 10. Background: Character Mining • The first 4 seasons are annotated for character identification tasks • Annotations are again extended to plural mentions • The first 4 seasons are also annotated with fine-grained emotion detection • All 10 seasons are processed for a cloze-style RC task
  • 11. Background: FriendsQA vs. Other Dialogue QA • FriendsQA vs. CoQA • CoQA aims to answer questions in one- to-one conversation between a questioner and answerer • The evidence passage is still wiki articles • FriendsQA vs. Cloze-style RC task • Cloze-style reasoning is less complex comparing to span-based QA • The predictions are limited to PERSON entities • FriendsQA vs. DREAM • Multiple choice questions are not ideal for practical QA applications
  • 12. The Corpus: FriendsQA Dataset • 1,222 scenes (83 are pruned because of having fewer than 5 utterances) • All utterances are concatenated together to form an evidence passage • The task is to find a contiguous answer span from the evidence passage
  • 13. The Corpus: Challenges with entity resolution • Utterances are spoken by several people and context switching happens more frequently • The ubiquitous and interchangeable use of pronouns
  • 14. The Corpus: Challenges with metaphors • Homophones confusion • Humor that could be understood by human readers • Require outside knowledge. In this case, knowledge regarding human body
  • 15. The Corpus: Challenges with sarcasm • The use of sarcasm is dominant in Friends to create humorous effects • The meaning is exactly opposite if comprehended directly
  • 16. The Corpus: Crowdsourcing • All annotation tasks are conducted on Amazon Mechanical Turk. • Left panel: the dialogue • Right panel: text inputs for question generation • Prior to actual tasks: a quiz to ensure annotators’ understanding of this task and web interface
  • 17.
  • 18.
  • 19. The Corpus: Phase 1 –> Question-Answer Generation • Clear annotation guidelines • 4 questions out of six: {what, when, where, who, why, how} • Answerable question • Multiple answers • However, selected answers must be relevant to the question • speaker name and • Utterance ID can also be selected
  • 20. The Corpus: Quality Assurance • Task can only be submitted after passing all rules • Are there at least 4 types of questions annotated? • Does each question have at least one answer span associated with it? • Does any question have too much string overlaps with the original text in the dialogue?
  • 21. The Corpus: Phase 2 –> Verification and Paraphrasing • Questions generated in Phase 1 are published again without answers • Annotators are asked to revise the questions if unanswerable or ambiguous • Annotators are asked to answer the questions • Annotators are asked to paraphrase the questions • Additional checking for quality assurance: • Check if the paraphrased question is the exact copy
  • 22. The Corpus: Four Rounds of Annotation • Four rounds of annotations are conducted before official annotation tasks • F1 score metric is adopted to evaluate Inter-annotator Agreement (ITA)
  • 23. The Corpus: R1 • Observed ambiguous questions that led to bad answers • Update the guidelines to make the questions as explicit as possible
  • 24. The Corpus: R2 • 6.27% improvement observed on ITA • Add more examples of questions and answer spans to the guidelines
  • 25. The Corpus: R3 • Another 2.48% improvement on ITA • no update is made to the guidelines.
  • 26. The Corpus: R4 • Marginal ITA improvement of 0.67% observed • Implies that our annotation guidelines are stabilized.
  • 27. The Corpus: Question / Answer Pruning • If question is revised dramatically, prune the first question (21.8% are revised) • If answers do not agree, prune the question and the answer (13.5% are pruned)
  • 28. The Corpus: Inter- annotator Agreement After pruning: • 10,610 questions • 21,262 answer spans • ITA: 81.82% / 53.55%
  • 29. The Corpus: Question Types vs. Answer Categories • 250 questions are randomly sampled • Diversity of FriendsQA
  • 30. Approach • Three SOTA systems selected to represent common approaches • R-Net: Recurrent Neural Network with attention mechanisms • QANet: Convolutional Neural Network with self-attention • BERT: deep feed-forward neural networks with Transformers
  • 31. Approach: R- Net • Recurrent Neural Network Based • Self-matching Mechanism
  • 32. Approach: QANet • Convolutional Neural Network based • Dramatic speed-up: data augmentation
  • 33. Approach: BERT • pushed all current state- of-the-art scores to another level • Transformers (Attention Only) based
  • 34. Experiments: Model Development • All dialogues from are randomly shuffled and redistributed as the training (80%), development (10%), and test (10%) • Each training instance consists of a dialogue, questions, and a single answer to each question • Utterance IDs are replaced with the actual utterance Set Dialogues Questions Answers Training 977 8,535 17,074 Development 122 1,010 2,057 Test 123 1,065 2,131
  • 35. Experiments: Model Development • Recall that each question could have multiple answers • Three strategy to generate training instances with single answer • Select the shortest answer and discard the rest • Select the longest answer and discard the rest • If a question Q1 have multiple answers A1 and A2, generate two training instances (Q1, A1) and (Q1, A2) and train independently
  • 36. Experiments: Evaluation Metrics • Span-based Match • Exact Match • Utterance Match
  • 37. Experiments: Span-based Match • Each answer is treated as bag-of-words • Compute macro-average F1 score • P: Precision • R: Recall
  • 38. Experiments: Exact Match • Check if the prediction and gold answer are the same • Score is either 1 or 0
  • 39. Experiments: Utterance Match • Given the nature of multiparty dialogue QA, utterance match is introduced • Models are considered to be powerful if always looking for answers in the correct utterance • UM mainly checks if the prediction resides within the same utterance as the gold answer span
  • 40. Experiments: Results • All experiments are run three times • Average score with standard deviation • BERT and QANet perform better with multiple- answer strategy • R-Net performs better with others
  • 41. Experiments: Results with replacement • Take advantage of Character Mining project • Kept an entity mapping and replace all PERSON entities in both dialogue and questions • Plural mentions handled naively (we ent0 ent1 ent2)
  • 42. Experiments: Results based on Q-Type • where and when questions are mostly factoid, which show the highest performance with UM • why and how require cross-utterance reasoning, leading to worse performance • who and what questions give a good mixture of proper and common nouns and show moderate performance Type Dist. UM SM EM What 19.70% 77.42 69.39 55.04 Where 18.28% 84.35 78.86 65.93 Who 17.17% 74.12 64.34 55.29 Why 15.76% 60.47 50.03 27.14 How 14.65% 65.52 52.04 32.64 When 14.44% 80.65 65.81 51.98
  • 43. Experiments: Results of Start of utterance • Predict the start of the utterance • Only need 1 output layer: simply report accuracy • Demonstrate the power of NN SoU Acc. 1 57.23 2 57.62 3 55.25 Avg. 56.70
  • 44. Experiments: Results with top-k answers 45 50 55 60 65 70 75 80 85 90 95 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Top-K Answers Utterance Match Span Match Exact Match
  • 45. Error Analysis • 100 randomly sampled completely mismatched questions • Through the analysis, 6 types of errors become evident
  • 46. Error Analysis • Entity Resolution • Paraphrase and Partial Match • Cross-Utterance Reasoning • Question Bias • Noise in Annotation • Miscellaneous Entity Resolution 28% Paraphrase and Partial Match 20% Cross- Utterance Reasoning 18% Question Bias 17% Miscellaneous 13% Noise in Annotation 4%
  • 47. Entity Resolution (28%) Q: What is Chandler’s opinion regarding marriage? A: Joey thinks… (wrong entity!)
  • 48. Paraphrase and Partial Match (20%) • Paraphrasing, abstraction, nicknames, etc. referred to somewhere else in the conversation. • Partially correct, especially for why and how questions, which could be acceptable in practice. • Motivates us to evaluate using Utterance Match.
  • 49. Cross-Utterance Reasoning (18%) • This type reveals an universal challenge in understanding human-to-human conversation. • Reason across multiple utterances back and forth, especially if a story or an event unfolds gradually, scatters in different places, and is told by different speakers
  • 50. Question Bias (17%) • This type occurs when the answer predictions overly rely on the question types. Q: Why is Chandler against marriage? A: …because Joey built this chair on his own • Because is not necessarily the correct answer!
  • 51. Noise in Annotation (4%) • FriendsQA, although gives high inter-annotator agreement, still includes noise caused by wrong spans, ambiguous or unanswerable questions, or typos.
  • 52. Miscellaneous (13%) • Errors in this category have no apparent cause to understand why the model predicts these answers • They often seem irrelevant to the questions so that they need more investigation.
  • 53. Conclusion: Contributions • FriendsQA: an open-domain question answering dataset • An extensive and comprehensive analysis: validity, difficulty and diversity • Three state-of-the-art models are run and compared: shown its potential • Error analysis offers insightful retrospective and make suggestions to future deeper study
  • 54. Conclusion: Future Work • Q-type and error analysis can serve as guidelines to further enhance the QA model performance. • Why and how questions should be studied more attentively • Speaker information could be encoded into the utterance • Top-k answer: another challenging but tangible task • Answer existence prediction and an utterance-based model to select utterance candidates
  • 55. Q & A Thank you!