발표자: 민세원 (서울대 학사과정)
발표일: 2017.8.
Sewon Min(민세원) is a student at Seoul National University, majoring in computer science. She did her research at University of Washington with Minjoon Seo, Hannaneh Hajishirzi and Ali Farhadi. Her main interest is natural language understanding with a focus on question answering.
개요:
To achieve human-level understanding of natural language, it is crucial to carefully analyze the current state of machine ability to judge what machine can do and what it cannot do. Then, it is required to concern how to expand the current ability of a machine toward human-level. In this talk, I will first describe the current state of machine ability in question answering by analyzing recently well-studied dataset, SQuAD. Next, by focusing on its limitations, I will introduce some of my desired approaches toward the next step. Lastly, I will introduce my work on transfer learning in question answering as one of those approaches.
2. Sewon Min
- Interested in Natural language understanding
with a focus on question answering
- Background
- Undergraduate in SNU (~2018)
- Research Experience in UW (2016~2017)
- Publication
- Minjoon Seo, Sewon Min, Ali Farhadi, Hannaneh Hajishirzi, “Neural Speed Reading”.
2017. (Under review)
- Sewon Min, Minjoon Seo, Hannaneh Hajishirzi. “Question Answering through
Transfer Learning from Large Fine-grained Supervision Data”. ACL. 2017.
- Minjoon Seo, Sewon Min, Ali Farhadi, Hannaneh Hajishirzi. “Query-reduction
Networks”. ICLR. 2017
10. SQuAD
Southern California, often abbreviated SoCal, is a geographic
and cultural region that generally comprises California's
southernmost 10 counties. (…)
What is Southern California often abbreviated as?
Stanford Question Answering Dataset (2016)
11. SQuAD
Southern California, often abbreviated SoCal, is a geographic
and cultural region that generally comprises California's
southernmost 10 counties. (…)
What is Southern California often abbreviated as?
Stanford Question Answering Dataset (2016)
12. SQuAD
Stanford Question Answering Dataset (2016)
Models
Match-LSTM (SMU), BiDAF (UW+AI2), DCN (Salesfoce),
R-Net (Microsoft), AoA Reader (HIT + iFLYTEK) and many others
Performance
- System: EM 55 F1 68 → EM 78 F1 85
- Human: EM 82 F1 91
More information: https://rajpurkar.github.io/SQuAD-explorer/
13. SQuAD
Why popular?
1. Domain: Context-based, Wikipedia, Real questions
2. Task: Span-based answer
- Closer to real QA than Cloze-style
- Easier to evaluate than Free-form
3. Proper difficulty
14. Who made airbus
Airbus SAS is an aircraft manufacturing subsidiary of EADS, a European aerospace
company. Airbus began as an union of aircraft companies …
____________ says he understands why @entity0 won’t play at his tournament
... @entity0 called me personally to let me know that he wouldn’t be playing here at
@entity23 ,” @entity3 said on his @entity21 events website...
WikiQA (Context Sentence Classification)
CNN/Daily Mail (Cloze Style)
What energy is used in photosynthesis?
Photosynthesis is a process used by plants and other organisms to convert light
energy, normally from the Sun, into chemical energy (…)
[light energy] [energy of light] [solar energy] [Light energy is used in photosynthesis]
MS Marco (Free-form)
16. SQuAD
Southern California, often abbreviated SoCal, is a geographic
and cultural region that generally comprises California's
southernmost 10 counties. (…)
What is Southern California often abbreviated as?
What does SoCal stand for?
Demo (BiDAF Model): https://allenai.github.io/bi-att-flow/demo/
19. Contents
Current state in Question Answering
Expansion from current state
Question Answering through transfer learning
(my work)
20. How to expand the task?
1. Small-scale context → Large-scale context
2. Requiring lexical information → Requiring complex reasoning
3. Span-based answer → Free form answer
21. Large-scale context
Longer context: WikiReading, NewsQA
Multiple context: MSMarco, TriviaQA
Open-domain: SearchQA, DrQA
Why challenging?
Cost (Time & Memory)
more information != better performance
No effective and efficient model yet!
Models with hierarchical structure
22. Large-scale context → More data
We have Large amount of data (such as Web data)
Approaches
1. Combination of information retrieval & question answering
2. Unsupervised learning
3. Transfer learning
Unannotated
Annotated
Annotated
Unannotated
23. Complex Reasoning
James the Turtle was always getting in trouble. (…) One day, James thought
he would go into town and see what kind of trouble he could get into. He
went to the grocery store and pulled all the pudding off the shelves and ate
two jars. Then he walked to the fast food restaurant and ordered 15 bags of
fries. He didn't pay, and instead headed home. (…)
Where did James go after he went to the grocery store?
A) His deck
B) His freezer
C) A fast food restaurant
D) His room
MC Test
24. Complex Reasoning
MCTest (7 years old)
Science Questions Dataset (Elementary school)
RACE (Middle & High school)
Very difficult, not so popular
Deep learning models have limitations
25. Free-form Answer
MS Marco
1. Annotation Gold Answer is difficult
What energy is used in photosynthesis?
Photosynthesis is a process used by plants and other organisms to convert light
energy, normally from the Sun, into chemical energy (…)
[light energy] [energy of light] [solar energy] [Light energy is used in photosynthesis]
26. Free-form Answer
- We want the answer not to be in the context.
- We prefer the full sentence to the single word.
- However, it is hard to evaluate.
- Incomplete metric. (Bag-of-word based)
What is the capital city of South Korea?
The capital city of South Korea is Seoul.
2. Evaluation is difficult
27. Free-form Answer
- We want the answer not to be in the context.
- We prefer the full sentence to the single word.
- However, it is hard to evaluate.
- Incomplete metric. (Bag-of-word based)
What is the capital city of South Korea?
The capital city of South Korea is Seoul.
Seoul.
The capital city of South Korea is Tokyo.
1/8
7/8
2. Evaluation is difficult
29. Free-form Answer
WikiReading: Property instead of Question
- instance of, gender, country, date of birth, given name, …
Best model’s performance (F1)
- Given name: 88.7
- Date of opening: 30.1
Country
Folkart Towers are twin skyscrapers in the Bayrakli district of the Turkish city of Izmir.
Reaching a structural height of 200 m (656 ft) above ground level, (…)
WikiReading
3. Designing generation model is difficult
30. Contents
Current state in Question Answering
Expansion from current state
Question Answering through transfer learning
(my work)
31. Transfer learning in QA
“Question Answering through Transfer Learning from Large fine-
grained supervision data”
Background
- transfer learning is not popular in NLP
- some previous works: transfer learning does not work when
target is different from source
Our contribution
- coarser, sentence-level QA can benefit from the transfer
learning of model trained on large, span-level QA
32. Transfer learning in QA
SICK
(RTE)
SemEval-2016
(sentence-level QA,
community QA)
WikiQA
(sentence-level QA,
Wikipedia domain)
SQuAD
(span-level QA,
Wikipedia domain)
33. Transfer learning in QA
Q Who made airbus
C1 Airbus SAS is an aircraft manufacturing subsidiary of EADS, a European aerospace company.
C2 Airbus began as an union of aircraft companies.
C3 Aerospace companies allowed the establishment of a joint-stock company, owned by EADS.
A C1(Yes), C2(No), C3(No)
Q I saw an ad, data entry jobs online. It required we give a fee and they promise fixed amount
every month. Is this a scam?
C1 well probably is so i be more careful if i were u. Why you looking for online jobs
C2 SCAM!!!!!!!!!!!!!!!!!!!!!!
C3 Bcoz i got a baby and iam nt intrested to sent him in a day care. thats y iam (...)
A C1(Good), C2(Good), C3(Bad)
WikiQA
SemEval2016-task3A
34. Transfer learning in QA
Context Query
Embedding layer
Attention layer
Modelling layer
Pooling + classification
Class
Context Query
Embedding layer
Attention layer
Modelling layer
Output layer 1 Output layer 2
Start End
BiDAF outputs start and end
position of span.
BiDAF-T outputs classification
result.
transfer
35. Transfer learning in QA
74.17
74.33
83.2
79.9
76.44
75.19
75.22
62.96
rank2
rank1
SQ* (f)
SQ (f)
SQ-T (f)
SQ
SQ-T
None
77.66
79.19
80.2
78.37
76.3
57.8
47.23
76.4
rank2
rank1
SQ* (f)
SQ (f)
SQ-T (f)
SQ
SQ-T
None
WikiQA Our results (blue) and
previous SOTA (green). We achieve
new SOTA with a large gap.
SemEval2016-task3A Our results
(blue) and previous SOTA (green).
36. Transfer learning in QA
84.57
86.2
88.22
86.63
85
83.2
84.38
82.86
81.49
77.96
Rank2
Rank1
SQuAD*
SQuAD
SQuAD-T
None
SQuAD*
SQuAD
SQuAD-T
None
SICK Our results (blue and red). We
also pretrain the model on SNLI (red).
Previous SOTA (green)
37. Transfer learning in QA
Transfer learning should work better when
the source is similar to the target. (??)
span-level
(SQuAD)
sentence-level
(WikiQA etc.)
sentence-level
(SQuAD-T)
sentence-level
(WikiQA etc.)
38. Transfer learning in QA
74.17
74.33
83.2
79.9
76.44
75.19
75.22
62.96
rank2
rank1
SQ* (f)
SQ (f)
SQ-T (f)
SQ
SQ-T
None
77.66
79.19
80.2
78.37
76.3
57.8
47.23
76.4
rank2
rank1
SQ* (f)
SQ (f)
SQ-T (f)
SQ
SQ-T
None
WikiQA Our results (blue) and
previous SOTA (green). We achieve
new SOTA with a large gap.
SemEval2016-task3A Our results
(blue) and previous SOTA (green).
40. Transfer learning in QA
- We achieve SOTA on well-studied QA datasets by simple
transfer learning
- Span-level supervision leads to better learn lexical information
“Learned in Translation: Contextualized Word Vectors”
- Salesforce, 2017.08
- transfer learning from Translation to Sentiment analysis / classification / RTE /
QA
- SOTA in SST-5 & SNLI