SlideShare a Scribd company logo
Human Interface Laboratory
담화 성분을 활용한 지시 발화의 키 프레이즈 추출:
한국어 병렬 코퍼스 구축 및 데이터 증강 방법론
2019. 10. 12 @HCLT 2019
조원익, 문영기, 김종인, 김남수
• Introduction
 What is keyphrase? Keyphrase vs. Summary
 What is keyphrase for directives?
• Related work
 Keyphrase extraction, sentence generation, and paraphrasing
 SQL, bilingual pivoting (BP), and discourse component (DC)
• Corpus construction
• Dataset augmentation
• Summary
 Application
 Future work
• What is keyphrase?
 Keyphrase as a set of words that stands for a document
• e.g., Keywords (topic words) for an abstract
– Can be combined into some phrases
» 담화성분 기반의 키프레이즈 추출, 패러프레이징을 위한 한국어 병렬 코퍼스
• But remember: keyphrases are also ‘phrase’!
– And those hold for a document, or even for short ones (sentences)?
• What is keyphrase?
 Keyphrase as a phrase that summarizes a sentence
• e.g., Extractive summarization that sometimes accompanies paraphrasing
– 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십삼일까지 카이스
트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
→ 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대
회 개최
– 오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다.
→ 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정
• Remember paraphrasing is like monolingual translation (no exact answer!)
 Keyphrase candidates are expected to make up a smaller space than the
original sentences do!
• 오늘 아침에 사고났대.
• 오늘 아침에 사고났다던데.
• 그거 알아? 오늘 아침 사고난거.
• 사고 났다더라구 오늘 아침에.
오늘 아침 사고 발생 (사고 남)
• Keyphrase vs. Summary
 Summarization of a document can be either (conventionally):
• Extractive [Cheng and Lapata, 2016]
– Documents have several sentence candidates
• Abstractive [Rush et al., 2015]
– Documents without a representative sentence can be abstractively summarized
• Hybrid methodologies are in progress [Bae et al., 2019]
 In keyphrase extraction from the sentences:
• Both extractive and abstractive approach can be utilized
– Extractive: for the keywords
– Abstractive: for the plausible expression (sentence style, word-level paraphrasing)
오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다.
→ 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정
• Keyphrase for directives (question/command)?
 What should the keyphrases be?
• for questions: something that the speaker asks for
– 내일 서울에 비 얼마나 올지 좀 검색해봐.
→ 질문: 내일 서울 강수량
• for commands: something that the speaker requests
– 물이 끓으면 불을 제일 약한 걸로 돌려줘
→ 요구: 물이 끓으면 불을 제일 약한 것으로 하기
• Simplified but representative nominalize version of the core content
• Sometimes keyphrases are longer than the original sentence
→ the reason the process differs with summarization
• Discourse component revisited!
• Research questions
 How discourse component (DC) is compared to structured query language
(SQL) and bilingual pivoting (BP) in view of paraphrase?
 How we can extract the keyphrase from a directive utterance in the form
of DC?
 How can DC be utilized in making up a paraphrase of questions and
Related work
• Keyphrase extraction, sentence generation, and paraphrasing
Core content
(SQL or Keyphrase)
Bilingual pivoting /
Word swapping /
Human paraphrase
SeqSQL /
Keyphrase extraction
Rule-based /
Learning-based /
Human generation
Related work
• 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십
삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
– How can we obtain a core content for paraphrasing (possibly by human)?
• Structured query language (SQL) [Zhong et al., 2017]
 {기간: 올해 시월 십이일부터 십삼일, 장소: 카이스트, 이벤트: 한글 및 한국어
정보처리 학술대회}
• A kind of semantic parsing
• Structured extraction of information is available
• Human-friendly data generation is not guaranteed
• Categorization can be limited
• Bilingual pivoting (BP) [Mallison et al., 2017]
 “As many of you may have waited for, we hold HCLT conference at KAIST
from twelfth to thirteens upcoming October.”
• Back-translation using other languages may give various expressions
• 1-1 correspondence doesn’t help extract the core content of the sentence
Related work
• 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십
삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
– How can we obtain a core content for paraphrasing (possibly by human)?
• Discourse component [Portner, 2004]
 This approach incorporates human generation, but can be efficient
• E.g., the following can be discourse component for the declaratives:
– 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회 개
최 (Common Ground)
• Core content information in monolingual natural language format
Corpus construction
• Annotating keyphrases on a Korean corpus regarding speech act
– How can it be utilized?
방카슈랑스란 무엇입니까
방카슈랑스의 의미
Keyphrase extraction
Corpus construction
• Annotating keyphrases on a Korean corpus regarding speech act
 Corpus: Intention identification for Korean (3i4K) [Cho et al., 2018]
 Composition
• Question
• Command
• Rhetorical question
• Rhetorical command
• Statement
• Intonation-dependent utterances
• Fragments
Includes only utterances whose determination of
speech act was not affected by the sentence form
• Utterances are non-canonical and colloquial
• Includes various topics and situations
Corpus construction
• Annotating keyphrases on a Korean corpus regarding speech act
Data augmentation
• Generating questions and commands from keyphrases
 Prototype model [Cho et al., 2018] lacks alternative Qs, prohibitions and
strong REQs
 Scarce within the corpus, but frequently utilized in real-life
• Augmentation is required! but HOW?
Data augmentation
• Generating questions and commands from keyphrases
 For a discourse component (keyphrase) of a statement, we can think of:
 Similarly regarding question & commands:
• Question set >> Question?
• To-do-list >> Command!
• Generating questions/commands differs from expressing a thought in
interrogative/imperative (sentence form)
오늘 아침 사고 발생 (사고 남)
• 오늘 아침에 사고났대.
• 오늘 아침에 사고났다던데.
• 그거 알아? 오늘 아침 사고난거.
• 사고 났다더라구 오늘 아침에.
Data augmentation
• Generating questions and commands from keyphrases
 Question/command types in need:
• Alternative Q, Prohibition, Strong requirement (deficit)
• Wh-question (more required for practical usage)
 Phrases that are prepared:
• Total phrase #: 2,000
– 400 for alternative Q
– 800 for wh-Q
– 400 for prohibition
– 400 for strong requirement
• Sentences to be generated per phrase: 10
• Topics:
– 1,000 phrases for free topic
– 250 phrases for mail, house control, schedule, and weather each
 Leaves only the utterances with the consensus of more that 3 natives
Data augmentation
• Generating questions and commands from keyphrases
 Guideline for the participants
• 열 개의 문장은 최대한 서로 다른 스타일로 작성할 것. 이 때, 스타일은 존대 여부,
어조 등을 모두 포함.
• 꼭 키프레이즈에 있는 말을 반복할 필요 없고, 상황에 맞는 다른 단어/어구/술어를
넣어도 됨. 구어로 발화하기 적합한 표현일 것.
• 도치를 통해 문장 형태의 다양성을 추구하는 것 역시 권장됨.
• 설명의문문의 경우 의문사가 필수적으로 들어가야 하며 선택의문문도 경우에 따
라 삽입될 수 있음. 두 문장 유형 모두 의문문으로 작성될 필요 없음.
• 금지 문장의 경우 청자가 할 수 있는 어떤 행위를 하지 않도록 하는 문장이어야 하
며, 안 해도 괜찮다는 의미보다는 더 강제성을 지녀야 함. 그 행동을 금지하는 것이
다른 행동을 요구하는 것과 실질적으로 동치일 경우, 해당 표현으로 대체해도 크
게 문제되지 않음.
• 금지와 강한 요구 문장 모두 명령문일 필요 없지만, 청자의 행동을 막거나 강제하
는 목적을 지녀야 함. 강한 권유도 가능함.
• 화자/청자가 포함된 키프레이즈의 경우 각각 그에 상응하는 대명사 표현을 활용할
것. 이를 통해 화자/청자의 표현이 포함된 코퍼스와 포함되지 않은 코퍼스를 모두
Data augmentation
• Generating questions and commands from keyphrases
Data augmentation
• Generating questions and commands from keyphrases
 Will be distributed via
 The baseline system for automatic extraction is yet to be developed!
• Application of the concept “keyphrase”
 Analysis of questions and commands in human-friendly conversation
• Classification of non-canonical directive utterances
• Pre-processing for the semantic parsing of non-canonical utterances
• Making up an answer that continues the dialog
– e.g., 오늘 비 언제까지 온대냐? >> 오늘 비 오는 시간대가 궁금하신가요?
– (If inferred correctly...)
 As a a core content of an utterance
• For an efficient semantic web search (방카슈랑스?)
• For an efficient human generation of paraphrase
– More human-friendly compared to SQL (non-NL terms) or back-translation (requires
multilingual ability)
• Future work
 Implementation of automatic keyphrase extraction system
 Extension to paraphrasing or sentence similarity task
Reference (order of appearance)
• Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. arXiv
preprint arXiv:1603.07252.
• Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence
summarization. arXiv preprint arXiv:1509.00685.
• Bae, S., Kim, T., Kim, J., & Lee, S. G. (2019). Summary Level Training of Sentence Rewriting for
Abstractive Summarization. arXiv preprint arXiv:1909.08752.
• Zhong, V., Xiong, C., & Socher, R. (2017). Seq2sql: Generating structured queries from natural
language using reinforcement learning. arXiv preprint arXiv:1709.00103.
• Mallinson, J., Sennrich, R., & Lapata, M. (2017, April). Paraphrasing revisited with neural machine
translation. In Proceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics: Volume 1, Long Papers (pp. 881-893).
• Portner, P. (2004, September). The semantics of imperatives within a theory of clause types.
In Semantics and linguistic theory (Vol. 14, pp. 235-252).
• Cho, W. I., Lee, H. S., Yoon, J. W., Kim, S. M., & Kim, N. S. (2018). Speech Intention Understanding in a
Head-final Language: A Disambiguation Utilizing Intonation-dependency. arXiv preprint
• Cho, W. I., Moon, Y. K., Kang, W. H., & Kim, N. S. (2018). Extracting Arguments from Korean Question
and Command: An Annotated Corpus for Structured Paraphrasing. arXiv preprint arXiv:1810.04631.
Thank you!

More Related Content

What's hot

한국어 띄어쓰기 프로그램 도전기
한국어 띄어쓰기 프로그램 도전기한국어 띄어쓰기 프로그램 도전기
한국어 띄어쓰기 프로그램 도전기
Ted Taekyoon Choi
Yoshio Hanawa
Context2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
Context2Vec 기반 단어 의미 중의성 해소, Word Sense DisambiguationContext2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
Context2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
찬희 이
[211] 네이버 검색과 데이터마이닝
[211] 네이버 검색과 데이터마이닝[211] 네이버 검색과 데이터마이닝
[211] 네이버 검색과 데이터마이닝
[Main Session] 미래의 Java 미리보기 - 앰버와 발할라 프로젝트를 중심으로
[Main Session] 미래의 Java 미리보기 - 앰버와 발할라 프로젝트를 중심으로[Main Session] 미래의 Java 미리보기 - 앰버와 발할라 프로젝트를 중심으로
[Main Session] 미래의 Java 미리보기 - 앰버와 발할라 프로젝트를 중심으로
Oracle Korea
Deep contextualized word representations
Deep contextualized word representationsDeep contextualized word representations
Deep contextualized word representations
Junya Kamura
Elasticsearch development case
Elasticsearch development caseElasticsearch development case
Elasticsearch development case
일규 최
시스템공학 기본(Fundamental of systems engineering) - Day4 functional analysis and a...
시스템공학 기본(Fundamental of systems engineering) - Day4 functional analysis and a...시스템공학 기본(Fundamental of systems engineering) - Day4 functional analysis and a...
시스템공학 기본(Fundamental of systems engineering) - Day4 functional analysis and a...
Jinwon Park
inter seminar インゼミ資料
inter seminar インゼミ資料inter seminar インゼミ資料
inter seminar インゼミ資料
Osaka-univ Yasuda Seminar 安田ゼミ
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
Supervised Machine Learning of Elastic Stack
Supervised Machine Learning of Elastic StackSupervised Machine Learning of Elastic Stack
Supervised Machine Learning of Elastic Stack
Hiroshi Yoshioka
Systems Engineering Management Plan (SEMP) for a standard fisher boat
Systems Engineering Management Plan (SEMP) for a standard fisher boatSystems Engineering Management Plan (SEMP) for a standard fisher boat
Systems Engineering Management Plan (SEMP) for a standard fisher boat
Jinwon Park
구문과 의미론(정적 의미론까지)
구문과 의미론(정적 의미론까지)구문과 의미론(정적 의미론까지)
구문과 의미론(정적 의미론까지)
Nam Hyeonuk
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
HeeJung Hwang
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
Hangil Kim
Python을 활용한 챗봇 서비스 개발 1일차
Python을 활용한 챗봇 서비스 개발 1일차Python을 활용한 챗봇 서비스 개발 1일차
Python을 활용한 챗봇 서비스 개발 1일차
Taekyung Han
Tomoyuki Kajiwara
의존 구조 분석기, Dependency parser
의존 구조 분석기, Dependency parser의존 구조 분석기, Dependency parser
의존 구조 분석기, Dependency parser
찬희 이

What's hot (20)

한국어 띄어쓰기 프로그램 도전기
한국어 띄어쓰기 프로그램 도전기한국어 띄어쓰기 프로그램 도전기
한국어 띄어쓰기 프로그램 도전기
Context2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
Context2Vec 기반 단어 의미 중의성 해소, Word Sense DisambiguationContext2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
Context2Vec 기반 단어 의미 중의성 해소, Word Sense Disambiguation
[211] 네이버 검색과 데이터마이닝
[211] 네이버 검색과 데이터마이닝[211] 네이버 검색과 데이터마이닝
[211] 네이버 검색과 데이터마이닝
[Main Session] 미래의 Java 미리보기 - 앰버와 발할라 프로젝트를 중심으로
[Main Session] 미래의 Java 미리보기 - 앰버와 발할라 프로젝트를 중심으로[Main Session] 미래의 Java 미리보기 - 앰버와 발할라 프로젝트를 중심으로
[Main Session] 미래의 Java 미리보기 - 앰버와 발할라 프로젝트를 중심으로
Deep contextualized word representations
Deep contextualized word representationsDeep contextualized word representations
Deep contextualized word representations
Elasticsearch development case
Elasticsearch development caseElasticsearch development case
Elasticsearch development case
시스템공학 기본(Fundamental of systems engineering) - Day4 functional analysis and a...
시스템공학 기본(Fundamental of systems engineering) - Day4 functional analysis and a...시스템공학 기본(Fundamental of systems engineering) - Day4 functional analysis and a...
시스템공학 기본(Fundamental of systems engineering) - Day4 functional analysis and a...
inter seminar インゼミ資料
inter seminar インゼミ資料inter seminar インゼミ資料
inter seminar インゼミ資料
[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?[223]기계독해 QA: 검색인가, NLP인가?
[223]기계독해 QA: 검색인가, NLP인가?
Supervised Machine Learning of Elastic Stack
Supervised Machine Learning of Elastic StackSupervised Machine Learning of Elastic Stack
Supervised Machine Learning of Elastic Stack
Systems Engineering Management Plan (SEMP) for a standard fisher boat
Systems Engineering Management Plan (SEMP) for a standard fisher boatSystems Engineering Management Plan (SEMP) for a standard fisher boat
Systems Engineering Management Plan (SEMP) for a standard fisher boat
구문과 의미론(정적 의미론까지)
구문과 의미론(정적 의미론까지)구문과 의미론(정적 의미론까지)
구문과 의미론(정적 의미론까지)
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
차곡차곡 쉽게 알아가는 Elasticsearch와 Node.js
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
한국어 문서 추출요약 AI 경진대회- 좌충우돌 후기
Python을 활용한 챗봇 서비스 개발 1일차
Python을 활용한 챗봇 서비스 개발 1일차Python을 활용한 챗봇 서비스 개발 1일차
Python을 활용한 챗봇 서비스 개발 1일차
의존 구조 분석기, Dependency parser
의존 구조 분석기, Dependency parser의존 구조 분석기, Dependency parser
의존 구조 분석기, Dependency parser

Similar to 1910 HCLT

SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
Alp Öktem
Towards speech intention understanding in korean
Towards speech intention understanding in koreanTowards speech intention understanding in korean
Towards speech intention understanding in korean
NAVER Engineering
Using and learning phrases
Using and learning phrasesUsing and learning phrases
Using and learning phrases
Cassandra Jacobs
Warnikchow - Naver Tech Talk - 3i4k
Warnikchow - Naver Tech Talk - 3i4kWarnikchow - Naver Tech Talk - 3i4k
Warnikchow - Naver Tech Talk - 3i4k
WarNik Chow
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence AnalysisDynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence AnalysisAndi Wu
Warnikchow - SAIT - 0529
Warnikchow - SAIT - 0529Warnikchow - SAIT - 0529
Warnikchow - SAIT - 0529
WarNik Chow
1910 JK27
1910 JK271910 JK27
1910 JK27
WarNik Chow
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
Lidia Pivovarova
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
Qi He
Planning and writing assignments (business example) 2021.pptx
Planning and writing assignments (business example) 2021.pptxPlanning and writing assignments (business example) 2021.pptx
Planning and writing assignments (business example) 2021.pptx
Trevor Haugh
Shuhei Otani
Amharic WSD using WordNet
Amharic WSD using WordNetAmharic WSD using WordNet
Amharic WSD using WordNet
Seid Hassen
Planning and writing assignments (business example)
Planning and writing assignments (business example)Planning and writing assignments (business example)
Planning and writing assignments (business example)
Principles of instruction and feedback for erasmus
Principles of instruction and feedback for erasmusPrinciples of instruction and feedback for erasmus
Principles of instruction and feedback for erasmus
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine TranslationRoee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Association for Computational Linguistics
A. InstructionsRemember the word argument” does not mean a fi.docx
A. InstructionsRemember the word argument” does not mean a fi.docxA. InstructionsRemember the word argument” does not mean a fi.docx
A. InstructionsRemember the word argument” does not mean a fi.docx

Similar to 1910 HCLT (20)

SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
Towards speech intention understanding in korean
Towards speech intention understanding in koreanTowards speech intention understanding in korean
Towards speech intention understanding in korean
Using and learning phrases
Using and learning phrasesUsing and learning phrases
Using and learning phrases
Warnikchow - Naver Tech Talk - 3i4k
Warnikchow - Naver Tech Talk - 3i4kWarnikchow - Naver Tech Talk - 3i4k
Warnikchow - Naver Tech Talk - 3i4k
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence AnalysisDynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Warnikchow - SAIT - 0529
Warnikchow - SAIT - 0529Warnikchow - SAIT - 0529
Warnikchow - SAIT - 0529
1910 JK27
1910 JK271910 JK27
1910 JK27
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
Planning and writing assignments (business example) 2021.pptx
Planning and writing assignments (business example) 2021.pptxPlanning and writing assignments (business example) 2021.pptx
Planning and writing assignments (business example) 2021.pptx
Amharic WSD using WordNet
Amharic WSD using WordNetAmharic WSD using WordNet
Amharic WSD using WordNet
Planning and writing assignments (business example)
Planning and writing assignments (business example)Planning and writing assignments (business example)
Planning and writing assignments (business example)
Principles of instruction and feedback for erasmus
Principles of instruction and feedback for erasmusPrinciples of instruction and feedback for erasmus
Principles of instruction and feedback for erasmus
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine TranslationRoee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
A. InstructionsRemember the word argument” does not mean a fi.docx
A. InstructionsRemember the word argument” does not mean a fi.docxA. InstructionsRemember the word argument” does not mean a fi.docx
A. InstructionsRemember the word argument” does not mean a fi.docx

More from WarNik Chow

WarNik Chow
2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
WarNik Chow
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
WarNik Chow
WarNik Chow
2211 AACL
2211 AACL2211 AACL
2211 AACL
WarNik Chow
2210 CODI
2210 CODI2210 CODI
2210 CODI
WarNik Chow
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
WarNik Chow
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
WarNik Chow
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
WarNik Chow
WarNik Chow
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
WarNik Chow
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
WarNik Chow
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
WarNik Chow
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
WarNik Chow
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
WarNik Chow
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
WarNik Chow
WarNik Chow
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
WarNik Chow
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
WarNik Chow
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP
WarNik Chow

More from WarNik Chow (20)

2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
2211 AACL
2211 AACL2211 AACL
2211 AACL
2210 CODI
2210 CODI2210 CODI
2210 CODI
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP

Recently uploaded

ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya

Recently uploaded (20)

ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf

1910 HCLT

  • 1. Human Interface Laboratory 담화 성분을 활용한 지시 발화의 키 프레이즈 추출: 한국어 병렬 코퍼스 구축 및 데이터 증강 방법론 2019. 10. 12 @HCLT 2019 조원익, 문영기, 김종인, 김남수
  • 2. Contents • Introduction  What is keyphrase? Keyphrase vs. Summary  What is keyphrase for directives? • Related work  Keyphrase extraction, sentence generation, and paraphrasing  SQL, bilingual pivoting (BP), and discourse component (DC) • Corpus construction • Dataset augmentation • Summary  Application  Future work 1
  • 3. Introduction • What is keyphrase?  Keyphrase as a set of words that stands for a document • e.g., Keywords (topic words) for an abstract – Can be combined into some phrases » 담화성분 기반의 키프레이즈 추출, 패러프레이징을 위한 한국어 병렬 코퍼스 • But remember: keyphrases are also ‘phrase’! – And those hold for a document, or even for short ones (sentences)? 2
  • 4. Introduction • What is keyphrase?  Keyphrase as a phrase that summarizes a sentence • e.g., Extractive summarization that sometimes accompanies paraphrasing – 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십삼일까지 카이스 트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다. → 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대 회 개최 – 오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다. → 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정 • Remember paraphrasing is like monolingual translation (no exact answer!)  Keyphrase candidates are expected to make up a smaller space than the original sentences do! • 오늘 아침에 사고났대. • 오늘 아침에 사고났다던데. • 그거 알아? 오늘 아침 사고난거. • 사고 났다더라구 오늘 아침에. 3 오늘 아침 사고 발생 (사고 남)
  • 5. Introduction • Keyphrase vs. Summary  Summarization of a document can be either (conventionally): • Extractive [Cheng and Lapata, 2016] – Documents have several sentence candidates • Abstractive [Rush et al., 2015] – Documents without a representative sentence can be abstractively summarized • Hybrid methodologies are in progress [Bae et al., 2019]  In keyphrase extraction from the sentences: • Both extractive and abstractive approach can be utilized – Extractive: for the keywords – Abstractive: for the plausible expression (sentence style, word-level paraphrasing) 4 오늘 저녁 여덟 시에 서울대입구 풍경소리에서 동아리 뒷풀이가 있을 예정입니다. → 오늘 이십 시 서울대입구 풍경소리에서 동아리 뒷풀이 예정
  • 6. Introduction • Keyphrase for directives (question/command)?  What should the keyphrases be? • for questions: something that the speaker asks for – 내일 서울에 비 얼마나 올지 좀 검색해봐. → 질문: 내일 서울 강수량 • for commands: something that the speaker requests – 물이 끓으면 불을 제일 약한 걸로 돌려줘 → 요구: 물이 끓으면 불을 제일 약한 것으로 하기 • Simplified but representative nominalize version of the core content • Sometimes keyphrases are longer than the original sentence → the reason the process differs with summarization • Discourse component revisited! 5
  • 7. Introduction • Research questions  How discourse component (DC) is compared to structured query language (SQL) and bilingual pivoting (BP) in view of paraphrase?  How we can extract the keyphrase from a directive utterance in the form of DC?  How can DC be utilized in making up a paraphrase of questions and commands? 6
  • 8. Related work • Keyphrase extraction, sentence generation, and paraphrasing 7 Original sentence Core content (SQL or Keyphrase) Paraphrase Bilingual pivoting / Word swapping / Human paraphrase SeqSQL / Keyphrase extraction Rule-based / Learning-based / Human generation
  • 9. Related work • 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십 삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다. – How can we obtain a core content for paraphrasing (possibly by human)? • Structured query language (SQL) [Zhong et al., 2017]  {기간: 올해 시월 십이일부터 십삼일, 장소: 카이스트, 이벤트: 한글 및 한국어 정보처리 학술대회} • A kind of semantic parsing • Structured extraction of information is available • Human-friendly data generation is not guaranteed • Categorization can be limited • Bilingual pivoting (BP) [Mallison et al., 2017]  “As many of you may have waited for, we hold HCLT conference at KAIST from twelfth to thirteens upcoming October.” • Back-translation using other languages may give various expressions • 1-1 correspondence doesn’t help extract the core content of the sentence 8
  • 10. Related work • 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십 삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다. – How can we obtain a core content for paraphrasing (possibly by human)? • Discourse component [Portner, 2004]  This approach incorporates human generation, but can be efficient • E.g., the following can be discourse component for the declaratives: – 올해 시월 십이일부터 십삼일까지 카이스트에서 한글 및 한국어 정보처리 학술대회 개 최 (Common Ground) • Core content information in monolingual natural language format 9
  • 11. Corpus construction • Annotating keyphrases on a Korean corpus regarding speech act – How can it be utilized? 10 방카슈랑스란 무엇입니까 Intention identification Question? 방카슈랑스의 의미 Keyphrase extraction
  • 12. Corpus construction • Annotating keyphrases on a Korean corpus regarding speech act  Corpus: Intention identification for Korean (3i4K) [Cho et al., 2018]  Composition • Question • Command • Rhetorical question • Rhetorical command • Statement • Intonation-dependent utterances • Fragments 11 Includes only utterances whose determination of speech act was not affected by the sentence form • Utterances are non-canonical and colloquial • Includes various topics and situations
  • 13. Corpus construction • Annotating keyphrases on a Korean corpus regarding speech act 12
  • 14. Data augmentation • Generating questions and commands from keyphrases  Prototype model [Cho et al., 2018] lacks alternative Qs, prohibitions and strong REQs  Scarce within the corpus, but frequently utilized in real-life • Augmentation is required! but HOW? 13
  • 15. Data augmentation • Generating questions and commands from keyphrases  For a discourse component (keyphrase) of a statement, we can think of:  Similarly regarding question & commands: • Question set >> Question? • To-do-list >> Command! • Generating questions/commands differs from expressing a thought in interrogative/imperative (sentence form) 14 오늘 아침 사고 발생 (사고 남) • 오늘 아침에 사고났대. • 오늘 아침에 사고났다던데. • 그거 알아? 오늘 아침 사고난거. • 사고 났다더라구 오늘 아침에.
  • 16. Data augmentation • Generating questions and commands from keyphrases  Question/command types in need: • Alternative Q, Prohibition, Strong requirement (deficit) • Wh-question (more required for practical usage)  Phrases that are prepared: • Total phrase #: 2,000 – 400 for alternative Q – 800 for wh-Q – 400 for prohibition – 400 for strong requirement • Sentences to be generated per phrase: 10 • Topics: – 1,000 phrases for free topic – 250 phrases for mail, house control, schedule, and weather each  Leaves only the utterances with the consensus of more that 3 natives 15
  • 17. Data augmentation • Generating questions and commands from keyphrases  Guideline for the participants • 열 개의 문장은 최대한 서로 다른 스타일로 작성할 것. 이 때, 스타일은 존대 여부, 어조 등을 모두 포함. • 꼭 키프레이즈에 있는 말을 반복할 필요 없고, 상황에 맞는 다른 단어/어구/술어를 넣어도 됨. 구어로 발화하기 적합한 표현일 것. • 도치를 통해 문장 형태의 다양성을 추구하는 것 역시 권장됨. • 설명의문문의 경우 의문사가 필수적으로 들어가야 하며 선택의문문도 경우에 따 라 삽입될 수 있음. 두 문장 유형 모두 의문문으로 작성될 필요 없음. • 금지 문장의 경우 청자가 할 수 있는 어떤 행위를 하지 않도록 하는 문장이어야 하 며, 안 해도 괜찮다는 의미보다는 더 강제성을 지녀야 함. 그 행동을 금지하는 것이 다른 행동을 요구하는 것과 실질적으로 동치일 경우, 해당 표현으로 대체해도 크 게 문제되지 않음. • 금지와 강한 요구 문장 모두 명령문일 필요 없지만, 청자의 행동을 막거나 강제하 는 목적을 지녀야 함. 강한 권유도 가능함. • 화자/청자가 포함된 키프레이즈의 경우 각각 그에 상응하는 대명사 표현을 활용할 것. 이를 통해 화자/청자의 표현이 포함된 코퍼스와 포함되지 않은 코퍼스를 모두 구축. 16
  • 18. Data augmentation • Generating questions and commands from keyphrases 17
  • 19. Data augmentation • Generating questions and commands from keyphrases  Will be distributed via  The baseline system for automatic extraction is yet to be developed! 18
  • 20. Summary • Application of the concept “keyphrase”  Analysis of questions and commands in human-friendly conversation • Classification of non-canonical directive utterances • Pre-processing for the semantic parsing of non-canonical utterances • Making up an answer that continues the dialog – e.g., 오늘 비 언제까지 온대냐? >> 오늘 비 오는 시간대가 궁금하신가요? – (If inferred correctly...)  As a a core content of an utterance • For an efficient semantic web search (방카슈랑스?) • For an efficient human generation of paraphrase – More human-friendly compared to SQL (non-NL terms) or back-translation (requires multilingual ability) • Future work  Implementation of automatic keyphrase extraction system  Extension to paraphrasing or sentence similarity task 19
  • 21. Reference (order of appearance) • Cheng, J., & Lapata, M. (2016). Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252. • Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. • Bae, S., Kim, T., Kim, J., & Lee, S. G. (2019). Summary Level Training of Sentence Rewriting for Abstractive Summarization. arXiv preprint arXiv:1909.08752. • Zhong, V., Xiong, C., & Socher, R. (2017). Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. • Mallinson, J., Sennrich, R., & Lapata, M. (2017, April). Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (pp. 881-893). • Portner, P. (2004, September). The semantics of imperatives within a theory of clause types. In Semantics and linguistic theory (Vol. 14, pp. 235-252). • Cho, W. I., Lee, H. S., Yoon, J. W., Kim, S. M., & Kim, N. S. (2018). Speech Intention Understanding in a Head-final Language: A Disambiguation Utilizing Intonation-dependency. arXiv preprint arXiv:1811.04231. • Cho, W. I., Moon, Y. K., Kang, W. H., & Kim, N. S. (2018). Extracting Arguments from Korean Question and Command: An Annotated Corpus for Structured Paraphrasing. arXiv preprint arXiv:1810.04631. 20

Editor's Notes

  1. .