Human Interface Laboratory
담화 성분을 활용한 지시 발화의 키 프레이즈 추출:
한국어 병렬 코퍼스 구축 및 데이터 증강 방법론
2019. 10. 12 @HCLT 2019
조원익, 문영기, 김종인, 김남수
Contents
• Introduction
 What is keyphrase? Keyphrase vs. Summary
 What is keyphrase for directives?
• Related work
 Keyphrase extraction, sentence generation, and paraphrasing
 SQL, bilingual pivoting (BP), and discourse component (DC)
• Corpus construction
• Dataset augmentation
• Summary
 Application
 Future work
1
Introduction
• What is keyphrase?
 Keyphrase as a set of words that stands for a document
• e.g., Keywords (topic words) for an abstract
– Can be combined into some phrases
» 담화성분 기반의 키프레이즈 추출, 패러프레이징을 위한 한국어 병렬 코퍼스
• But remember: keyphrases are also ‘phrase’!
– And those hold for a document, or even for short ones (sentences)?
2
Introduction
• What is keyphrase?
 Keyphrase as a phrase that summarizes a sentence
• e.g., Extractive summarization that sometimes accompanies paraphrasing
– 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십삼일까지 카이스
트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
→ 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대
회 개최
– 오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다.
→ 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정
• Remember paraphrasing is like monolingual translation (no exact answer!)
 Keyphrase candidates are expected to make up a smaller space than the
original sentences do!
• 오늘 아침에 사고났대.
• 오늘 아침에 사고났다던데.
• 그거 알아? 오늘 아침 사고난거.
• 사고 났다더라구 오늘 아침에.
3
오늘 아침 사고 발생 (사고 남)
Introduction
• Keyphrase vs. Summary
 Summarization of a document can be either (conventionally):
• Extractive [Cheng and Lapata, 2016]
– Documents have several sentence candidates
• Abstractive [Rush et al., 2015]
– Documents without a representative sentence can be abstractively summarized
• Hybrid methodologies are in progress [Bae et al., 2019]
 In keyphrase extraction from the sentences:
• Both extractive and abstractive approach can be utilized
– Extractive: for the keywords
– Abstractive: for the plausible expression (sentence style, word-level paraphrasing)
4
오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다.
→ 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정
Introduction
• Keyphrase for directives (question/command)?
 What should the keyphrases be?
• for questions: something that the speaker asks for
– 내일 서울에 비 얼마나 올지 좀 검색해봐.
→ 질문: 내일 서울 강수량
• for commands: something that the speaker requests
– 물이 끓으면 불을 제일 약한 걸로 돌려줘
→ 요구: 물이 끓으면 불을 제일 약한 것으로 하기
• Simplified but representative nominalize version of the core content
• Sometimes keyphrases are longer than the original sentence
→ the reason the process differs with summarization
• Discourse component revisited!
5
Introduction
• Research questions
 How discourse component (DC) is compared to structured query language
(SQL) and bilingual pivoting (BP) in view of paraphrase?
 How we can extract the keyphrase from a directive utterance in the form
of DC?
 How can DC be utilized in making up a paraphrase of questions and
commands?
6
Related work
• Keyphrase extraction, sentence generation, and paraphrasing
7
Original
sentence
Core content
(SQL or Keyphrase)
Paraphrase
Bilingual pivoting /
Word swapping /
Human paraphrase
SeqSQL /
Keyphrase extraction
Rule-based /
Learning-based /
Human generation
Related work
• 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십
삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
– How can we obtain a core content for paraphrasing (possibly by human)?
• Structured query language (SQL) [Zhong et al., 2017]
 {기간: 올해 시월 십이일부터 십삼일, 장소: 카이스트, 이벤트: 한글 및 한국어
정보처리 학술대회}
• A kind of semantic parsing
• Structured extraction of information is available
• Human-friendly data generation is not guaranteed
• Categorization can be limited
• Bilingual pivoting (BP) [Mallison et al., 2017]
 “As many of you may have waited for, we hold HCLT conference at KAIST
from twelfth to thirteens upcoming October.”
• Back-translation using other languages may give various expressions
• 1-1 correspondence doesn’t help extract the core content of the sentence
8
Related work
• 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십
삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
– How can we obtain a core content for paraphrasing (possibly by human)?
• Discourse component [Portner, 2004]
 This approach incorporates human generation, but can be efficient
• E.g., the following can be discourse component for the declaratives:
– 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회 개
최 (Common Ground)
• Core content information in monolingual natural language format
9
Corpus construction
• Annotating keyphrases on a Korean corpus regarding speech act
– How can it be utilized?
10
방카슈랑스란 무엇입니까
Intention
identification
Question?
방카슈랑스의 의미
Keyphrase extraction
Corpus construction
• Annotating keyphrases on a Korean corpus regarding speech act
 Corpus: Intention identification for Korean (3i4K) [Cho et al., 2018]
 Composition
• Question
• Command
• Rhetorical question
• Rhetorical command
• Statement
• Intonation-dependent utterances
• Fragments
11
Includes only utterances whose determination of
speech act was not affected by the sentence form
• Utterances are non-canonical and colloquial
• Includes various topics and situations
Corpus construction
• Annotating keyphrases on a Korean corpus regarding speech act
12
Data augmentation
• Generating questions and commands from keyphrases
 Prototype model [Cho et al., 2018] lacks alternative Qs, prohibitions and
strong REQs
 Scarce within the corpus, but frequently utilized in real-life
• Augmentation is required! but HOW?
13
Data augmentation
• Generating questions and commands from keyphrases
 For a discourse component (keyphrase) of a statement, we can think of:
 Similarly regarding question & commands:
• Question set >> Question?
• To-do-list >> Command!
• Generating questions/commands differs from expressing a thought in
interrogative/imperative (sentence form)
14
오늘 아침 사고 발생 (사고 남)
• 오늘 아침에 사고났대.
• 오늘 아침에 사고났다던데.
• 그거 알아? 오늘 아침 사고난거.
• 사고 났다더라구 오늘 아침에.
Data augmentation
• Generating questions and commands from keyphrases
 Question/command types in need:
• Alternative Q, Prohibition, Strong requirement (deficit)
• Wh-question (more required for practical usage)
 Phrases that are prepared:
• Total phrase #: 2,000
– 400 for alternative Q
– 800 for wh-Q
– 400 for prohibition
– 400 for strong requirement
• Sentences to be generated per phrase: 10
• Topics:
– 1,000 phrases for free topic
– 250 phrases for mail, house control, schedule, and weather each
 Leaves only the utterances with the consensus of more that 3 natives
15
Data augmentation
• Generating questions and commands from keyphrases
 Guideline for the participants
• 열 개의 문장은 최대한 서로 다른 스타일로 작성할 것. 이 때, 스타일은 존대 여부,
어조 등을 모두 포함.
• 꼭 키프레이즈에 있는 말을 반복할 필요 없고, 상황에 맞는 다른 단어/어구/술어를
넣어도 됨. 구어로 발화하기 적합한 표현일 것.
• 도치를 통해 문장 형태의 다양성을 추구하는 것 역시 권장됨.
• 설명의문문의 경우 의문사가 필수적으로 들어가야 하며 선택의문문도 경우에 따
라 삽입될 수 있음. 두 문장 유형 모두 의문문으로 작성될 필요 없음.
• 금지 문장의 경우 청자가 할 수 있는 어떤 행위를 하지 않도록 하는 문장이어야 하
며, 안 해도 괜찮다는 의미보다는 더 강제성을 지녀야 함. 그 행동을 금지하는 것이
다른 행동을 요구하는 것과 실질적으로 동치일 경우, 해당 표현으로 대체해도 크
게 문제되지 않음.
• 금지와 강한 요구 문장 모두 명령문일 필요 없지만, 청자의 행동을 막거나 강제하
는 목적을 지녀야 함. 강한 권유도 가능함.
• 화자/청자가 포함된 키프레이즈의 경우 각각 그에 상응하는 대명사 표현을 활용할
것. 이를 통해 화자/청자의 표현이 포함된 코퍼스와 포함되지 않은 코퍼스를 모두
구축.
16
Data augmentation
• Generating questions and commands from keyphrases
17
Data augmentation
• Generating questions and commands from keyphrases
 Will be distributed via https://github.com/warnikchow/sae4k
 The baseline system for automatic extraction is yet to be developed!
18
Summary
• Application of the concept “keyphrase”
 Analysis of questions and commands in human-friendly conversation
• Classification of non-canonical directive utterances
• Pre-processing for the semantic parsing of non-canonical utterances
• Making up an answer that continues the dialog
– e.g., 오늘 비 언제까지 온대냐? >> 오늘 비 오는 시간대가 궁금하신가요?
– (If inferred correctly...)
 As a a core content of an utterance
• For an efficient semantic web search (방카슈랑스?)
• For an efficient human generation of paraphrase
– More human-friendly compared to SQL (non-NL terms) or back-translation (requires
multilingual ability)
• Future work
 Implementation of automatic keyphrase extraction system
 Extension to paraphrasing or sentence similarity task
19
Reference (order of appearance)
• Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. arXiv
preprint arXiv:1603.07252.
• Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence
summarization. arXiv preprint arXiv:1509.00685.
• Bae, S., Kim, T., Kim, J., & Lee, S. G. (2019). Summary Level Training of Sentence Rewriting for
Abstractive Summarization. arXiv preprint arXiv:1909.08752.
• Zhong, V., Xiong, C., & Socher, R. (2017). Seq2sql: Generating structured queries from natural
language using reinforcement learning. arXiv preprint arXiv:1709.00103.
• Mallinson, J., Sennrich, R., & Lapata, M. (2017, April). Paraphrasing revisited with neural machine
translation. In Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics: Volume 1, Long Papers (pp. 881-893).
• Portner, P. (2004, September). The semantics of imperatives within a theory of clause types.
In Semantics and linguistic theory (Vol. 14, pp. 235-252).
• Cho, W. I., Lee, H. S., Yoon, J. W., Kim, S. M., & Kim, N. S. (2018). Speech Intention Understanding in a
Head-final Language: A Disambiguation Utilizing Intonation-dependency. arXiv preprint
arXiv:1811.04231.
• Cho, W. I., Moon, Y. K., Kang, W. H., & Kim, N. S. (2018). Extracting Arguments from Korean Question
and Command: An Annotated Corpus for Structured Paraphrasing. arXiv preprint arXiv:1810.04631.
20
Thank you!
EndOfPresentation

1910 HCLT

  • 1.
    Human Interface Laboratory 담화성분을 활용한 지시 발화의 키 프레이즈 추출: 한국어 병렬 코퍼스 구축 및 데이터 증강 방법론 2019. 10. 12 @HCLT 2019 조원익, 문영기, 김종인, 김남수
  • 2.
    Contents • Introduction  Whatis keyphrase? Keyphrase vs. Summary  What is keyphrase for directives? • Related work  Keyphrase extraction, sentence generation, and paraphrasing  SQL, bilingual pivoting (BP), and discourse component (DC) • Corpus construction • Dataset augmentation • Summary  Application  Future work 1
  • 3.
    Introduction • What iskeyphrase?  Keyphrase as a set of words that stands for a document • e.g., Keywords (topic words) for an abstract – Can be combined into some phrases » 담화성분 기반의 키프레이즈 추출, 패러프레이징을 위한 한국어 병렬 코퍼스 • But remember: keyphrases are also ‘phrase’! – And those hold for a document, or even for short ones (sentences)? 2
  • 4.
    Introduction • What iskeyphrase?  Keyphrase as a phrase that summarizes a sentence • e.g., Extractive summarization that sometimes accompanies paraphrasing – 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십삼일까지 카이스 트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다. → 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대 회 개최 – 오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다. → 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정 • Remember paraphrasing is like monolingual translation (no exact answer!)  Keyphrase candidates are expected to make up a smaller space than the original sentences do! • 오늘 아침에 사고났대. • 오늘 아침에 사고났다던데. • 그거 알아? 오늘 아침 사고난거. • 사고 났다더라구 오늘 아침에. 3 오늘 아침 사고 발생 (사고 남)
  • 5.
    Introduction • Keyphrase vs.Summary  Summarization of a document can be either (conventionally): • Extractive [Cheng and Lapata, 2016] – Documents have several sentence candidates • Abstractive [Rush et al., 2015] – Documents without a representative sentence can be abstractively summarized • Hybrid methodologies are in progress [Bae et al., 2019]  In keyphrase extraction from the sentences: • Both extractive and abstractive approach can be utilized – Extractive: for the keywords – Abstractive: for the plausible expression (sentence style, word-level paraphrasing) 4 오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다. → 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정
  • 6.
    Introduction • Keyphrase fordirectives (question/command)?  What should the keyphrases be? • for questions: something that the speaker asks for – 내일 서울에 비 얼마나 올지 좀 검색해봐. → 질문: 내일 서울 강수량 • for commands: something that the speaker requests – 물이 끓으면 불을 제일 약한 걸로 돌려줘 → 요구: 물이 끓으면 불을 제일 약한 것으로 하기 • Simplified but representative nominalize version of the core content • Sometimes keyphrases are longer than the original sentence → the reason the process differs with summarization • Discourse component revisited! 5
  • 7.
    Introduction • Research questions How discourse component (DC) is compared to structured query language (SQL) and bilingual pivoting (BP) in view of paraphrase?  How we can extract the keyphrase from a directive utterance in the form of DC?  How can DC be utilized in making up a paraphrase of questions and commands? 6
  • 8.
    Related work • Keyphraseextraction, sentence generation, and paraphrasing 7 Original sentence Core content (SQL or Keyphrase) Paraphrase Bilingual pivoting / Word swapping / Human paraphrase SeqSQL / Keyphrase extraction Rule-based / Learning-based / Human generation
  • 9.
    Related work • 많이들궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십 삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다. – How can we obtain a core content for paraphrasing (possibly by human)? • Structured query language (SQL) [Zhong et al., 2017]  {기간: 올해 시월 십이일부터 십삼일, 장소: 카이스트, 이벤트: 한글 및 한국어 정보처리 학술대회} • A kind of semantic parsing • Structured extraction of information is available • Human-friendly data generation is not guaranteed • Categorization can be limited • Bilingual pivoting (BP) [Mallison et al., 2017]  “As many of you may have waited for, we hold HCLT conference at KAIST from twelfth to thirteens upcoming October.” • Back-translation using other languages may give various expressions • 1-1 correspondence doesn’t help extract the core content of the sentence 8
  • 10.
    Related work • 많이들궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십 삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다. – How can we obtain a core content for paraphrasing (possibly by human)? • Discourse component [Portner, 2004]  This approach incorporates human generation, but can be efficient • E.g., the following can be discourse component for the declaratives: – 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회 개 최 (Common Ground) • Core content information in monolingual natural language format 9
  • 11.
    Corpus construction • Annotatingkeyphrases on a Korean corpus regarding speech act – How can it be utilized? 10 방카슈랑스란 무엇입니까 Intention identification Question? 방카슈랑스의 의미 Keyphrase extraction
  • 12.
    Corpus construction • Annotatingkeyphrases on a Korean corpus regarding speech act  Corpus: Intention identification for Korean (3i4K) [Cho et al., 2018]  Composition • Question • Command • Rhetorical question • Rhetorical command • Statement • Intonation-dependent utterances • Fragments 11 Includes only utterances whose determination of speech act was not affected by the sentence form • Utterances are non-canonical and colloquial • Includes various topics and situations
  • 13.
    Corpus construction • Annotatingkeyphrases on a Korean corpus regarding speech act 12
  • 14.
    Data augmentation • Generatingquestions and commands from keyphrases  Prototype model [Cho et al., 2018] lacks alternative Qs, prohibitions and strong REQs  Scarce within the corpus, but frequently utilized in real-life • Augmentation is required! but HOW? 13
  • 15.
    Data augmentation • Generatingquestions and commands from keyphrases  For a discourse component (keyphrase) of a statement, we can think of:  Similarly regarding question & commands: • Question set >> Question? • To-do-list >> Command! • Generating questions/commands differs from expressing a thought in interrogative/imperative (sentence form) 14 오늘 아침 사고 발생 (사고 남) • 오늘 아침에 사고났대. • 오늘 아침에 사고났다던데. • 그거 알아? 오늘 아침 사고난거. • 사고 났다더라구 오늘 아침에.
  • 16.
    Data augmentation • Generatingquestions and commands from keyphrases  Question/command types in need: • Alternative Q, Prohibition, Strong requirement (deficit) • Wh-question (more required for practical usage)  Phrases that are prepared: • Total phrase #: 2,000 – 400 for alternative Q – 800 for wh-Q – 400 for prohibition – 400 for strong requirement • Sentences to be generated per phrase: 10 • Topics: – 1,000 phrases for free topic – 250 phrases for mail, house control, schedule, and weather each  Leaves only the utterances with the consensus of more that 3 natives 15
  • 17.
    Data augmentation • Generatingquestions and commands from keyphrases  Guideline for the participants • 열 개의 문장은 최대한 서로 다른 스타일로 작성할 것. 이 때, 스타일은 존대 여부, 어조 등을 모두 포함. • 꼭 키프레이즈에 있는 말을 반복할 필요 없고, 상황에 맞는 다른 단어/어구/술어를 넣어도 됨. 구어로 발화하기 적합한 표현일 것. • 도치를 통해 문장 형태의 다양성을 추구하는 것 역시 권장됨. • 설명의문문의 경우 의문사가 필수적으로 들어가야 하며 선택의문문도 경우에 따 라 삽입될 수 있음. 두 문장 유형 모두 의문문으로 작성될 필요 없음. • 금지 문장의 경우 청자가 할 수 있는 어떤 행위를 하지 않도록 하는 문장이어야 하 며, 안 해도 괜찮다는 의미보다는 더 강제성을 지녀야 함. 그 행동을 금지하는 것이 다른 행동을 요구하는 것과 실질적으로 동치일 경우, 해당 표현으로 대체해도 크 게 문제되지 않음. • 금지와 강한 요구 문장 모두 명령문일 필요 없지만, 청자의 행동을 막거나 강제하 는 목적을 지녀야 함. 강한 권유도 가능함. • 화자/청자가 포함된 키프레이즈의 경우 각각 그에 상응하는 대명사 표현을 활용할 것. 이를 통해 화자/청자의 표현이 포함된 코퍼스와 포함되지 않은 코퍼스를 모두 구축. 16
  • 18.
    Data augmentation • Generatingquestions and commands from keyphrases 17
  • 19.
    Data augmentation • Generatingquestions and commands from keyphrases  Will be distributed via https://github.com/warnikchow/sae4k  The baseline system for automatic extraction is yet to be developed! 18
  • 20.
    Summary • Application ofthe concept “keyphrase”  Analysis of questions and commands in human-friendly conversation • Classification of non-canonical directive utterances • Pre-processing for the semantic parsing of non-canonical utterances • Making up an answer that continues the dialog – e.g., 오늘 비 언제까지 온대냐? >> 오늘 비 오는 시간대가 궁금하신가요? – (If inferred correctly...)  As a a core content of an utterance • For an efficient semantic web search (방카슈랑스?) • For an efficient human generation of paraphrase – More human-friendly compared to SQL (non-NL terms) or back-translation (requires multilingual ability) • Future work  Implementation of automatic keyphrase extraction system  Extension to paraphrasing or sentence similarity task 19
  • 21.
    Reference (order ofappearance) • Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252. • Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. • Bae, S., Kim, T., Kim, J., & Lee, S. G. (2019). Summary Level Training of Sentence Rewriting for Abstractive Summarization. arXiv preprint arXiv:1909.08752. • Zhong, V., Xiong, C., & Socher, R. (2017). Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. • Mallinson, J., Sennrich, R., & Lapata, M. (2017, April). Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (pp. 881-893). • Portner, P. (2004, September). The semantics of imperatives within a theory of clause types. In Semantics and linguistic theory (Vol. 14, pp. 235-252). • Cho, W. I., Lee, H. S., Yoon, J. W., Kim, S. M., & Kim, N. S. (2018). Speech Intention Understanding in a Head-final Language: A Disambiguation Utilizing Intonation-dependency. arXiv preprint arXiv:1811.04231. • Cho, W. I., Moon, Y. K., Kang, W. H., & Kim, N. S. (2018). Extracting Arguments from Korean Question and Command: An Annotated Corpus for Structured Paraphrasing. arXiv preprint arXiv:1810.04631. 20
  • 22.

Editor's Notes