1910 HCLT

Human Interface Laboratory
담화 성분을 활용한 지시 발화의 키 프레이즈 추출:
한국어 병렬 코퍼스 구축 및 데이터 증강 방법론
2019. 10. 12 @HCLT 2019
조원익, 문영기, 김종인, 김남수

Contents
• Introduction
 What is keyphrase? Keyphrase vs. Summary
 What is keyphrase for directives?
• Related work
 Keyphrase extraction, sentence generation, and paraphrasing
 SQL, bilingual pivoting (BP), and discourse component (DC)
• Corpus construction
• Dataset augmentation
• Summary
 Application
 Future work
1

Introduction
• What is keyphrase?
 Keyphrase as a set of words that stands for a document
• e.g., Keywords (topic words) for an abstract
– Can be combined into some phrases
» 담화성분 기반의 키프레이즈 추출, 패러프레이징을 위한 한국어 병렬 코퍼스
• But remember: keyphrases are also ‘phrase’!
– And those hold for a document, or even for short ones (sentences)?
2

Introduction
• What is keyphrase?
 Keyphrase as a phrase that summarizes a sentence
• e.g., Extractive summarization that sometimes accompanies paraphrasing
– 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십삼일까지 카이스
트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
→ 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대
회 개최
– 오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다.
→ 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정
• Remember paraphrasing is like monolingual translation (no exact answer!)
 Keyphrase candidates are expected to make up a smaller space than the
original sentences do!
• 오늘 아침에 사고났대.
• 오늘 아침에 사고났다던데.
• 그거 알아? 오늘 아침 사고난거.
• 사고 났다더라구 오늘 아침에.
3
오늘 아침 사고 발생 (사고 남)

Introduction
• Keyphrase vs. Summary
 Summarization of a document can be either (conventionally):
• Extractive [Cheng and Lapata, 2016]
– Documents have several sentence candidates
• Abstractive [Rush et al., 2015]
– Documents without a representative sentence can be abstractively summarized
• Hybrid methodologies are in progress [Bae et al., 2019]
 In keyphrase extraction from the sentences:
• Both extractive and abstractive approach can be utilized
– Extractive: for the keywords
– Abstractive: for the plausible expression (sentence style, word-level paraphrasing)
4
오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다.
→ 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정

Introduction
• Keyphrase for directives (question/command)?
 What should the keyphrases be?
• for questions: something that the speaker asks for
– 내일 서울에 비 얼마나 올지 좀 검색해봐.
→ 질문: 내일 서울 강수량
• for commands: something that the speaker requests
– 물이 끓으면 불을 제일 약한 걸로 돌려줘
→ 요구: 물이 끓으면 불을 제일 약한 것으로 하기
• Simplified but representative nominalize version of the core content
• Sometimes keyphrases are longer than the original sentence
→ the reason the process differs with summarization
• Discourse component revisited!
5

Introduction
• Research questions
 How discourse component (DC) is compared to structured query language
(SQL) and bilingual pivoting (BP) in view of paraphrase?
 How we can extract the keyphrase from a directive utterance in the form
of DC?
 How can DC be utilized in making up a paraphrase of questions and
commands?
6

Related work
• Keyphrase extraction, sentence generation, and paraphrasing
7
Original
sentence
Core content
(SQL or Keyphrase)
Paraphrase
Bilingual pivoting /
Word swapping /
Human paraphrase
SeqSQL /
Keyphrase extraction
Rule-based /
Learning-based /
Human generation

Related work
• 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십
삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
– How can we obtain a core content for paraphrasing (possibly by human)?
• Structured query language (SQL) [Zhong et al., 2017]
 {기간: 올해 시월 십이일부터 십삼일, 장소: 카이스트, 이벤트: 한글 및 한국어
정보처리 학술대회}
• A kind of semantic parsing
• Structured extraction of information is available
• Human-friendly data generation is not guaranteed
• Categorization can be limited
• Bilingual pivoting (BP) [Mallison et al., 2017]
 “As many of you may have waited for, we hold HCLT conference at KAIST
from twelfth to thirteens upcoming October.”
• Back-translation using other languages may give various expressions
• 1-1 correspondence doesn’t help extract the core content of the sentence
8

Related work
• 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십
삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
– How can we obtain a core content for paraphrasing (possibly by human)?
• Discourse component [Portner, 2004]
 This approach incorporates human generation, but can be efficient
• E.g., the following can be discourse component for the declaratives:
– 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회 개
최 (Common Ground)
• Core content information in monolingual natural language format
9

Corpus construction
• Annotating keyphrases on a Korean corpus regarding speech act
– How can it be utilized?
10
방카슈랑스란 무엇입니까
Intention
identification
Question?
방카슈랑스의 의미
Keyphrase extraction

Corpus construction
 Corpus: Intention identification for Korean (3i4K) [Cho et al., 2018]
 Composition
• Question
• Command
• Rhetorical question
• Rhetorical command
• Statement
• Intonation-dependent utterances
• Fragments
11
Includes only utterances whose determination of
speech act was not affected by the sentence form
• Utterances are non-canonical and colloquial
• Includes various topics and situations

Corpus construction
12

Data augmentation
• Generating questions and commands from keyphrases
 Prototype model [Cho et al., 2018] lacks alternative Qs, prohibitions and
strong REQs
 Scarce within the corpus, but frequently utilized in real-life
• Augmentation is required! but HOW?
13

Data augmentation
 For a discourse component (keyphrase) of a statement, we can think of:
 Similarly regarding question & commands:
• Question set >> Question?
• To-do-list >> Command!
• Generating questions/commands differs from expressing a thought in
interrogative/imperative (sentence form)
14
오늘 아침 사고 발생 (사고 남)
• 오늘 아침에 사고났대.
• 오늘 아침에 사고났다던데.
• 그거 알아? 오늘 아침 사고난거.
• 사고 났다더라구 오늘 아침에.

Data augmentation
 Question/command types in need:
• Alternative Q, Prohibition, Strong requirement (deficit)
• Wh-question (more required for practical usage)
 Phrases that are prepared:
• Total phrase #: 2,000
– 400 for alternative Q
– 800 for wh-Q
– 400 for prohibition
– 400 for strong requirement
• Sentences to be generated per phrase: 10
• Topics:
– 1,000 phrases for free topic
– 250 phrases for mail, house control, schedule, and weather each
 Leaves only the utterances with the consensus of more that 3 natives
15

Data augmentation
 Guideline for the participants
• 열 개의 문장은 최대한 서로 다른 스타일로 작성할 것. 이 때, 스타일은 존대 여부,
어조 등을 모두 포함.
• 꼭 키프레이즈에 있는 말을 반복할 필요 없고, 상황에 맞는 다른 단어/어구/술어를
넣어도 됨. 구어로 발화하기 적합한 표현일 것.
• 도치를 통해 문장 형태의 다양성을 추구하는 것 역시 권장됨.
• 설명의문문의 경우 의문사가 필수적으로 들어가야 하며 선택의문문도 경우에 따
라 삽입될 수 있음. 두 문장 유형 모두 의문문으로 작성될 필요 없음.
• 금지 문장의 경우 청자가 할 수 있는 어떤 행위를 하지 않도록 하는 문장이어야 하
며, 안 해도 괜찮다는 의미보다는 더 강제성을 지녀야 함. 그 행동을 금지하는 것이
다른 행동을 요구하는 것과 실질적으로 동치일 경우, 해당 표현으로 대체해도 크
게 문제되지 않음.
• 금지와 강한 요구 문장 모두 명령문일 필요 없지만, 청자의 행동을 막거나 강제하
는 목적을 지녀야 함. 강한 권유도 가능함.
• 화자/청자가 포함된 키프레이즈의 경우 각각 그에 상응하는 대명사 표현을 활용할
것. 이를 통해 화자/청자의 표현이 포함된 코퍼스와 포함되지 않은 코퍼스를 모두
구축.
16

Data augmentation
17

Data augmentation
 Will be distributed via https://github.com/warnikchow/sae4k
 The baseline system for automatic extraction is yet to be developed!
18

Summary
• Application of the concept “keyphrase”
 Analysis of questions and commands in human-friendly conversation
• Classification of non-canonical directive utterances
• Pre-processing for the semantic parsing of non-canonical utterances
• Making up an answer that continues the dialog
– e.g., 오늘 비 언제까지 온대냐? >> 오늘 비 오는 시간대가 궁금하신가요?
– (If inferred correctly...)
 As a a core content of an utterance
• For an efficient semantic web search (방카슈랑스?)
• For an efficient human generation of paraphrase
– More human-friendly compared to SQL (non-NL terms) or back-translation (requires
multilingual ability)
• Future work
 Implementation of automatic keyphrase extraction system
 Extension to paraphrasing or sentence similarity task
19

Reference (order of appearance)
• Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. arXiv
preprint arXiv:1603.07252.
• Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence
summarization. arXiv preprint arXiv:1509.00685.
• Bae, S., Kim, T., Kim, J., & Lee, S. G. (2019). Summary Level Training of Sentence Rewriting for
Abstractive Summarization. arXiv preprint arXiv:1909.08752.
• Zhong, V., Xiong, C., & Socher, R. (2017). Seq2sql: Generating structured queries from natural
language using reinforcement learning. arXiv preprint arXiv:1709.00103.
• Mallinson, J., Sennrich, R., & Lapata, M. (2017, April). Paraphrasing revisited with neural machine
translation. In Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics: Volume 1, Long Papers (pp. 881-893).
• Portner, P. (2004, September). The semantics of imperatives within a theory of clause types.
In Semantics and linguistic theory (Vol. 14, pp. 235-252).
• Cho, W. I., Lee, H. S., Yoon, J. W., Kim, S. M., & Kim, N. S. (2018). Speech Intention Understanding in a
Head-final Language: A Disambiguation Utilizing Intonation-dependency. arXiv preprint
arXiv:1811.04231.
• Cho, W. I., Moon, Y. K., Kang, W. H., & Kim, N. S. (2018). Extracting Arguments from Korean Question
and Command: An Annotated Corpus for Structured Paraphrasing. arXiv preprint arXiv:1810.04631.
20

1910 HCLT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 1910 HCLT

Similar to 1910 HCLT (20)

More from WarNik Chow

More from WarNik Chow (20)

Recently uploaded

Recently uploaded (20)

1910 HCLT

Editor's Notes