Improving Question Answering by Bridging Linguistic Structures with Statistical Learning

Improving Question Answering by
Bridging Linguistic Structures with
Statistical Learning
Tomasz Jurczyk
Advisor: Jinho D. Choi
Emory University
11/02/2017
PhD Dissertation Defense

Want big
impact?
Use big image.
2
Image: https://www.psychologicalscience.org/news/minds-business/asking-questions-increases-likability.html

Want big
impact?
Use big image.
3
Image: https://www.psychologicalscience.org/news/minds-business/asking-questions-increases-likability.htmlImage: http://www.kindynews.com/blog/kids-ask-how-many-questions-per-day

Want big
impact?
Use big image.
4
Image: https://www.psychologicalscience.org/news/minds-business/asking-questions-increases-likability.htmlImage: https://www.shutterstock.com/video/clip-11021852-stock-footage-hispanic-woman-reading-laying-on-the-floor-of-the-library-k.html

Want big
impact?
Use big image.
5
Image: https://www.psychologicalscience.org/news/minds-business/asking-questions-increases-likability.htmlImage: https://autoshopsolutions.com/top-10-questions-ask-web-design-internet-marketing-company/

6
“Questions vs. Queries in Informational Search Tasks”, Ryen W. White et al., WWW 2015
http://www.internetlivestats.com/google-search-statistics/

Research
Goal
Improve various question answering
aspects by combining linguistic structures
with statistical learning and constructing
abstract text representations. Address the
challenges for the applications in
cross-genre tasks 11

Research Contributions
12
A multi-stage
annotation scheme
For sentence-based factoid
question answering using
crowdsourcing technique

13
A multi-stage
annotation scheme
Exploration of neural
architectures for FQA
Convolutional neural
networks for
sentence-based FQA

14
A multi-stage
annotation scheme
A subtree matching
mechanism
For measuring contextual
similarity between two
sentences
networks for
sentence-based FQA

15
A multi-stage
annotation scheme
A subtree matching
mechanism
sentences
networks for
sentence-based FQA
Combining multiple
QA corpora
Improving the performance
of QA systems by
cross-using multiple sets

16
A multi-stage
annotation scheme
A subtree matching
mechanism
sentences
networks for
sentence-based FQA
Combining multiple
QA corpora
of QA systems by
A semantics-based
graph
Abstract representation
applied on arithmetic
question answering
Sentence-based Factoid Question Answering (2016)

17
A multi-stage
annotation scheme
A subtree matching
mechanism
sentences
networks for
sentence-based FQA
Combining multiple
QA corpora
of QA systems by
A semantics-based
graph
question answering
Multi-field
structural
decomposition
For event-based question
answering

18
A multi-stage
annotation scheme
A subtree matching
mechanism
sentences
networks for
sentence-based FQA
Combining multiple
QA corpora
of QA systems by
A semantics-based
graph
question answering
Multi-field
structural
decomposition
answering
Document retrieval
task for Cross-genre
Structure Matching for
conversation and formal
writings
Non-factoid Question Answering
(2015-2016)

19
A multi-stage
annotation scheme
A subtree matching
mechanism
sentences
networks for
sentence-based FQA
Combining multiple
QA corpora
of QA systems by
A semantics-based
graph
question answering
Multi-field
structural
decomposition
answering
A multi-gram
attention CNN
For passage completion
task for conversational
dialog texts
Document retrieval
writings
Non-factoid Question Answering
Applications to Cross-genre Tasks (2016-2017)
(2015-2016)

1.
Sentence-based
Factoid Question
Answering
20
Answering questions about
concise, well-known facts

What is Sentence-based Question
Answering?
Given a question and a list of sentences,
reorder or classify them with respect to how
likely the one answers or supports the answer
to the question.
21

Example question and its candidates
22

Tasks in sentence-based question
answering
Answer Sentence Selection Answer Triggering
A ranking problem A classification/ranking problem
Rerank sentences with respect to how
likely they support the question
Decide whether the answer is among the
sentence candidates
MRR - Mean Reciprocal Rank
(multiplicative inverse of the rank of the
first correct answer)
MAP - Mean Average Precision
Precision and Recall (F1 score)
23

24
How to build
a scalable,
diverse and
challenging
datasets
Image: https://www.meraevents.com/event/How-To-Write-and-Publish-a-Book-

Building a sentence-based factoid question
answering corpus
25
Diverse and artificially
challenging datasets
are needed to train
statistical models.
However, access to real
search engines user
queries (Google, Bing,
etc.) is almost
impossible

SelQA - a dataset built using a multi-stage
crowdsourcing annotation scheme
26
Crowdsourced
Crowdsourcing
techniques used to
built a dataset
Scalable size
Can be used on a big
scale
Low-cost
Cost-effective due to
quality control
Quality control
Poor-quality
annotations are
rejected
Diverse
Data sources come
from multiple
domains
Challenging
Semantically difficult
due to paraphrase
step
SelQA: A New Benchmark for Selection-Based Question Answering, Jurczyk
et al., ICTAI’2016

The annotation process
27
Sample data
collection
(e.g.:
articles
from
Wikipedia) Preprocess
the
collection
(sentence
segmentati
on etc.)
4 annotation
tasks on
MTurk + 1
task using
Elasticsearch

More detailed look at the process
28

Annotation summary
30
Qs
Qm
Qs+m
Ωq
Ωa
Ωf
Time Credit
Task1 1,824 154 1,978 44.99 23.65 28.88 71 sec. $0.10
Task2 1,828 148 1,976 44.64 23.20 28.62 64 sec. $0.10
Task 3 3,637 313 3,950 38.03 19.99 24.41 41 sec. $0.08
Task 4 682 55 737 31.09 19.41 21.88 54 sec. $0.08
SelQA 7,289 615 7,904 40.54 21.51 26.18 - -
WikiQA 1,068 174 1,242 39.31 9.82 15.03 - -

7,904Questions annotated with their contexts
9%More overlapping words compared to WikiQA,
on average
15%Drop of ratio in overlapping words due to
paraphrasing step
31

32
Context
Matching
Using
Syntactic
Structures
Image: https://www.huffingtonpost.com/galtime/importance-of-doing-puzzles-with-your-kids_b_4683094.html

How to
match
contexts?
Even advanced word
matching is not working
for advanced questions
and text collections
33

An example: one sentence supports the
question example
34
Question: who lead the polish army in the siege of warsaw?
Sentences:
...
1) Despite German radio broadcasts claiming to have captured
Warsaw, the initial enemy attack was repelled and soon
afterwards Warsaw was placed under siege.
2) The siege lasted until September 28, when the Polish garrison,
commanded under General Walerian Czuma, officially
capitulated.
3) The following day approximately 140,000 Polish soldiers and
troops left the city and were taken as prisoners of war.
...

Subtree matching for
contextual semantic
similarity
The dependency
grammar can be used
to match the syntax of
two sentences (a
question and a
sentence candidate)
and to calculate their
semantic similarity
35

Subtree matching example
36
Question: Who lead the polish army in the Siege of Warsaw?
Sentence: The siege lasted until September 28, when the Polish
garrison, commanded under General Walerian Czuma, officially
capitulated.
SelQA: A New Benchmark for Selection-Based Question Answering, Jurczyk
et al., ICTAI’2016

How to use the subtree
matching features?
The subtree
matching features
are combined with a
convolutional neural
network to prove the
effectiveness in
extracting semantic
similarity
37

Answer Sentence Selection on SelQA
39
Model
Development Evaluation
MAP MRR MAP MRR
CNN0
: baseline 84.62 85.65 83.20 84.20
CNN2
: avg + emb 85.70 86.67 84.66 85.68
Santos et al. (2017) - - 87.58 88.12
Shen et al. (2017) - - 89.14 89.93

Answer sentence selection on WikiQA
40
Model
MAP MRR MAP MRR
CNN0
: baseline 69.93 70.66 65.62 66.46
CNN2
: avg + emb (2016) 69.22 70.18 68.78 70.82
Yang et al. (2015) - - 65.20 66.52
Santos et al. (2016) - - 68.86 69.57
Miao et al. (2016) - - 68.86 70.69
Yin et al. (2016) - - 69.21 71.08
Wang et al. (2016) - - 70.58 72.26
Wang et al. (2017) - - 73.41 74.18

Answer triggering on SelQA
41
Model
P R F1 P R F1
CNN0
: baseline 50.63 40.60 45.07 52.10 40.34 45.47
CNN2
: max + emb 49.32 48.99 49.16 53.69 48.38 50.89

Answer triggering on WikiQA
42
Model
P R F1 P R F1
CNN0
: baseline 41.86 42.86 42.35 29.70 37.45 32.73
CNN3
: max + emb+ 44.44 44.44 44.44 29.43 48.56 36.65
Yang et al. (2015) - - - 27.96 37.86 32.17

6.5%Improvement over the state of the art for the
WikiQA dataset in answer sentence selection
F1@36.65New state of the art for the answer triggering
on WikiQA
12%Improvement over the state of the art for the
SelQA dataset in answer triggering
43

44
Image: https://www.meraevents.com/event/How-To-Write-and-Publish-a-Book-
How to
combine
multiple
question
answering
corpora?
Image:http://www.softaid.info/Products/SoftLibrarian

Taking advantage of
multiple QA corpora
Researchers have
independently released
several QA corpora.
The performance of QA
systems could be
improved by combining
them.
45

46
WikiQA SelQA SQuAD InfoboxQA
Source:
Bing
search
queries
Crowdsourced Crowdsourced Crowdsourced
Answer Sentence
Selection:
YES YES YES YES
Answer Triggering: YES YES NO NO
Questions: 1,242 7,904 98,202 15,271
Candidates: 12,153 95,250 496,167 271,038
Candidates/question: 9.79 12.05 5.05 17.75
Analysis of Wikipedia-based Corpora for Question Answering, Jurczyk et al.,
arXiv

How these datasets compare to each other
47

The results on cross-testing the corpora
48
Trained On
Evaluated on
WikiQA SelQA SQuAD
MAP MRR F1 MAP MRR F1 MAP MRR F1
WikiQA 65.54 67.41 13.33 53.47 54.12 8.68 73.16 73.72 11.26
SelQA 49.05 49.64 24.30 82.72 83.70 48.66 77.22 78.04 44.70
SQuAD 58.17 58.53 19.35 81.15 82.27 42.88 88.84 89.69 44.93
W+S+Q 56.40 56.51 - 83.19 84.25 - 88.78 89.65 -
ALL 60.19 60.68 - 82.88 83.97 - 88.92 89.79 -

HigherAccuracy of answer triggering on WikiQA
when trained on SelQA
HigherAccuracy on SQuAD when trained on
combined datasets
FasterConvergence for SQuAD when trained on
SelQA with almost identical performance
49

2.
Non-factoid
Question
Answering
50
An umbrella term for questions
that are not factoid

What is non-factoid question
answering?
As an umbrella, it covers a
wide spectrum of tasks
such as recommendation,
arithmetic, visual,
community-based question
answering, etc.
It often requires more
complex and customized
approaches.
51

52Image:http://www.schoolnetuganda.com/study-tips/strategies-for-solving-math-word-problems/
Solving
Elementary-
School Level
Arithmetic
Questions

How we solve the following problems?
53
Question
A restaurant served 9
pizzas during lunch and 6
during dinner today. How
many pizzas were served
today?
Sara has 31 red and 15
green balloons. Sandy has
24 red balloons. How many
red balloons do they have
in total?

… likely, we construct an equation and
calculate it
54
Question Equation Answer
A restaurant served 9
pizzas during lunch and 6
during dinner today. How
many pizzas were served
today?
x = 9 + 6 x = 15
Sara has 31 red and 15
green balloons. Sandy has
24 red balloons. How many
red balloons do they have
in total?
x = 31 + 24 x = 55

Application to arithmetic questions
55
Sequence classification
This task can be seen as a
sequence classification of verb
polarities
Three verb classes
Each verb can be either
positive (+), negative (-) or
neutral (0)
Linear equation formed
Once all polarities are
classified, the equation is
formed
Semantics-based graph
Is used to extract syntactic
and semantic features for verb
classification

Natural language processing tasks
56
Semantics-based Graph Approach to Complex Question Answering,
Jurczyk et al., NAACL-SRW’2015

Natural language processing tasks are used to
build a graph
57

The results on the AllenAI dataset
60
Model Accuracy
This work
(2015)
71.75%
Roy et al.
(2014)
64.00%
Hosseini et
al. (2015)
77.7%
Roy et al.
(2016)
78.00%

~6%lower than the previous approach
SuccessfullyApplied the semantics-graph to non-factoid
question answering
ButDoes not require extra annotation in verb
polarities
61

62
Structure
Decomposition
for Story-based
Question
Answering
Image:http://www.schoolnetuganda.com/study-tips/strategies-for-solving-math-word-problems/

How does event-based question answering
look
63
Sentence ID Text Support
1 Fred picked up the football there.
2 Fred gave the football to Jeff
3 What did Fred gave to Jeff? 2
4 Bill went back to the bathroom
5 Jeff grabbed the milk there
6 Who gave the football to Jeff? 2

Hybrid system for event-based question
answering
64
NLP/IR solution
A good mix of natural
language processing and
information retrieval
Three groups of fields
Lexical, syntactic, and
semantic representations of
text are extracted
Lucene-based engine
A lucene-based search engine
is used to index extracted
fields
Event-based QA eval.
The approach will be
evaluated on non-factoid
question answering task
Multi-Field Structural Decomposition for Question Answering, Jurczyk et al.
arXiv

… as the flow of execution
65

Example decomposition for incoming
document
66

Results on the bAbI dataset
67
Type
Lexical Lexical + Syntax
Lexical + Syntax +
Semantics
λ = 1 λ learned λ = 1 λ learned λ = 1 λ learned
MAP MRR MAP MRR MAP MRR MAP MRR MAP MRR MAP MRR
task1 39.62 61.73 39.62 61.73 29.90 48.05 40.50 61.47 72.60 85.07 100.0 100.0
task5 37.10 54.00 38.20 54.70 48.00 62.15 48.40 62.25 72.60 82.65 94.20 96.33
taski …
Avg. 44.45 61.25 44.63 61.37 45.16 60.34 48.41 63.76 59.60 73.70 85.16 90.47

3.
Applications for
Cross-genre
Tasks
68
Document retrieval and passage
completion cross-genre tasks

69
Document
Retrieval for
Conversatio-
nal and
Formal
Writings
Image:https://www.popsugar.com/entertainment/Friends-TV-Show-Theory-About-Rachel-43718473

What is the cross-genre
document retrieval task?
Given a description (query) and a list of
conversational scripts (documents),
retrieve the scripts that are relevant
(support) this description
70

Documents: ‘Friends’ scripts
10 seasons
of Friends
1 season =
~24
episodes
71
1 episode =
~14 scenes
1 scene =
~20 utterances
1 utterance =
speaker +
utterance

A slice from a scene
Rachel How does going to a strip club help him better?
Ross Because there are naked ladies there.
Joey
Which helps him get to Phase Three, picturing yourself
with other women.
Ross There are naked ladies there too.
Joey Yeah.
72

Descriptions: episode summaries & plots
73
Summary:
one-paragr
aph
episode
summary
Plots:
more
detailed
episode
description
~5,000
sentence
descripti
ons

Description examples
74
Dialogue Summary + Plot
Joey
One woman? That’s like saying there’s
only one flavor of ice cream for you.
Lemme tell you something, Ross. There’s
lots of flavors out there.
Joey compares
women to ice cream
Ross
You know you probably didn’t know this,
but back in the high school, I had, a, um,
major crush on you.
Ross reveals his high
school crush on
Rachel
Rachel I knew.
Chandler
Alright, one of you give me your
underpants.
Chandler asks Joey
for his underwear,
but Joey can’t help
him out as he’s not
wearing any
Joey Can’t help you, I’m not wearing any.

Elasticsearch - first results
75
k
R@k MRR R@k MRR
1 46.00 46.00 47.64 47.64
2 65.80 53.80 69.26 55.79
5 72.60 54.71 74.66 56.53
10 78.80 55.13 79.73 56.91
20 83.80 55.31 84.80 57.08

Structure extraction for conversational and
formal writings
76
“Chandler: Alright, one of you give me your underpants”
Cross-genre Document Retrieval: Matching between Conversational and
Formal Writings, Jurczyk et al.,
BLGNLP 2017 (during EMNLP)

How does the structure matching work
77
Indexed scripts’ structures
Match
Description
Extract
Retrieve

Experimental results
78
Model
R@1 MRR R@1 MRR
Elastic1
46.00 54.71 47.64 56.53
Structw
34.80 45.75 35.47 47.40
Structl
35.60 46.86 39.53 50.84
Structm
33.80 45.10 35.98 47.76

but...
The structure matching is
capable of locating ~15%
descriptions that Elastic
can not
79

Two-stage classification retrieval
80

81
Model
R@1 MRR R@1 MRR
Elastic10
46.00 54.71 47.64 56.53
Structl
35.60 46.86 39.53 50.84
Rerank1
48.20 56.02 51.86 59.46
Rerankλ
51.20 57.74 52.03 59.84

47.64%Initial R@1 achieved by Elasticsearch
ButStructure matching is based on single
utterance, this will be improved
9.2%Improvement when the structure extraction
features were used
82

83
Passage
Completion
for
Cross-genre
Texts
Image:https://www.bustle.com/articles/30983-friends-trivia-game-show-episode-features-
the-cast-at-its-funniest-can-you-pass-the-test

Passage completion is reading
comprehension
84
Cross-genre
It is a cross-genre task on the
conversational data
Reading Comprehension
It benchmarks the ability to
read and comprehend the
natural language
Entity-based
It is based on the entity
prediction given a query and a
passage
Towards QA
As a reading comprehension,
it will be crucial for future
question answering

Already existing PC task: CNN/Daily News
85

Passage completion in conversational dialogs
86

Approach: a multi-gram convolutional
neural network with attention
87

Approach: a multi-gram convolutional
neural network with attention
88

89
Model Accuracy
Baseline1
(entity majority) 27.30
Baseline2
(word-distance) 27.26
Linguistic approach (L2R, 2016) 51.16
Bi-LSTM + attention (2017) 69.26 (test: 62.52%)
Multi-gram CNN 57.43
Multi-gram CNN + attndot
63.58

Latest approach: stacked-multi-gram
90

91
Model Accuracy
Baseline1
(entity majority) 27.52
Baseline2
(word-distance) 27.67
Linguistic approach (L2R) 47.36
Bi-LSTM + attention 69.26
Multi-gram CNN 57.43
Multi-gram CNN + attndot
63.58
Utterance-based multi-gram CNN + attn 66.59

66.59%Best so-far score
ButIs more robust when CNN + Bi-LSTM is used
(accuracy more stable when number of
utterances increased)
~5%Lower than Bi-LSTM
92

Thanks to the
committee members
and audience!
93

Thanks to Jinho for
mentoring me and
believing in me!
94

100
A multi-stage
annotation scheme
A subtree matching
mechanism
sentences
networks for
sentence-based FQA
Combining multiple
QA corpora
of QA systems by
A semantics-based
graph
question answering
Multi-field
structural
decomposition
answering
A multi-gram
attention CNN
For passage completion
task for conversational
dialog texts
Document retrieval
writings
Non-factoid Question AnsweringSentence-based Factoid Question Answering (2016)
(2015-2016)
Applications to Cross-genre Tasks (2016-2017)

The process of the scheme
101
1. ~500 articles are uniformly sampled from Wikipedia from
the following topics: Arts, Country, Food, Historical Events,
Movies, Music, Science, Sports, Travel, TV.
2. Section that have more than 2 and less than 26 sentences
are selected.
3. These sections are used in the annotation scheme and are
sent to annotators.
4. Four tasks are performed on Mechanical Turk, the fifth task
is performed using Elasticsearch.

The process of subtree matching
102
1. For a question-sentence pair, extract a list of
overlapping words
2. For each overlapping pairs, extract their tree slices
3. Perform the matching step on three levels: parents,
siblings, and children
4. Calculate their semantic similarity scores

The experimentation setup
103
◎ One of the state-of-the-art convolutional neural
networks introduced by Yu et al. (2014) is used to
evaluate QA corpora.
◎ The model consists of a single convolutional layer, a
max pooling, and then the sigmoid function.
◎ 40 words of question and 40 words of answer are used
as the input to the model.
◎ Original splits provided by the creators are used.

Natural language processing tasks are used to
build a graph
104
Definition Example
Document a single document of text
microblog note or
Wikipedia article
Entity
a set of instances
referring to the same
instance in given context
“John went to Emory
University. He majored in
CompSci.”
Instance
the atomic level object in
the graph, usually
represented by a modifier
of an instance
“John went to Emory
University with Jessica.”
Predicate-Argument
an argument that
completes the meaning of
other instance (predicate)
“This car was sold to
Michael two days ago.”
Attribute a modifier of a instance “Alicia has a black cat.”

Document retrieval in the cross-genre texts
105
◎ Source text is Friends TV show (scripts), while target
text (queries) are episode descriptions.
◎ For a list of source texts (episodes) and a query
(description), retrieve the source that matches the
description.
◎ Structure extraction is presented that improves the
retrieval performance.
◎ This task is a preliminary step to perform a question
answering on conversational scripts.

Target texts - Show’s descriptions
106
◎ Episode summaries and plots have been crawled from
fan sites.
◎ Summaries are one-paragraph texts, usually of 5-6
sentences that provide a high overview of an episode.
◎ Plots are multi-paragraph texts, usually giving more
detailed description of an episode
◎ Each summary and plot was sentence segmented,
tokenized and is represented as a single query.
◎ The set of over 5,000 queries is used in the
experimentation.

Improving R@1 given R@10
107
◎ Extracted relations are now used to improve R@1
when 10 (k=10) most relevant documents are given.
◎ Two-stage classification setup is presented that uses
the extracted relations.
◎ Feed forward neural network is used to train the
model which makes a decision whether the episode
ranked as top (k 1) should be returned as the correct
one.
◎ If not, the initial ranking from Elasticsearch is paired
with structure matching scores, and a new
prediction is made.

Passage completion for cross-genre texts
108
◎ It is very difficult to extract actual answers from the
dialogue data, where an answer might be contained
within a single or many utterances.
◎ The machine must understand the logic of the
human dialogue first, it can not just “extract” the
answer from the, it must infer it.
◎ As a proxy task, we first want to tackle a passage
completion task.
◎ A query that consists of one or more entities is tested
against a document text (news article)

Tackling complexity of arithmetic questions
109
◎ It is relatively easy to develop a question answering
system for a single type of questions.
◎ Arithmetic questions seen on the previous slide
require reasoning on the abstract level.
◎ A semantics-based graph approach is presented that
builds an abstract representation of the text

Application to arithmetic questions
110
◎ This task is represented as a sequence classification
of verb polarities.
◎ Each verb can be either positive (+), negative (-) or
neutral (0).
◎ Positive/negative verb yields an addition/subtraction
operation with its associated chain node. Neutral is
omitted from the linear equation.
◎ Prediction is made for all recognized verbs in a
sentence, then a linear equation is formed and solved.
◎ Presented graph is applied to extract abstract
features based on text.

Hybrid system for event-based question
answering
111
◎ A solution that is a good mix of natural language
processing and information retrieval is presented.
◎ NLP structures are used to extract lexical, syntactic
and semantic representations of text.
◎ Lucene-based search engine is used to index extracted
fields, and then to perform a document retrieval on
these fields.
◎ The approach is evaluated using publicly available
dataset bAbI that consists of 20 tasks, where each
task represents a different kind of question answering
challenge.
Multi-Field Structural Decomposition for Question Answering, Jurczyk et al., arXiv

Improving Question Answering by Bridging Linguistic Structures with Statistical Learning

Recommended

Recommended

More Related Content

Similar to Improving Question Answering by Bridging Linguistic Structures with Statistical Learning

Similar to Improving Question Answering by Bridging Linguistic Structures with Statistical Learning (20)

More from Jinho Choi

More from Jinho Choi (20)

Recently uploaded

Recently uploaded (20)

Improving Question Answering by Bridging Linguistic Structures with Statistical Learning