2006 kakao brain NLP colloquium

Human Interface Laboratory
Discourse component to sentence (DC2S):
An efficient human-aided construction of
paraphrase and sentence similarity dataset
2020. 06. 08 @Kakaobrain
Won Ik Cho (Warnik Chow)

Contents
• Introduction
• Previous studies
• Ongoing work
• Data augmentation
• Result, discussion, and afterward
1

Introduction
• 조원익
 B.S. in EE/Mathematics (SNU, ’10~’14)
 Ph.D. student (SNU ECE, ’14 Autumn~)
• Academic background
 EE folk interested in mathematics
 Early years in Speech processing lab
• Source separation
• Voice activity & endpoint detection
• Automatic music composition
 Currently studying on
• Computational linguistics
• Spoken language processing
2

Previous studies
• 로봇용 free-running 임베디드 자연어 대화음성인식을 위한 원천 기술
개발 / 산업통상자원부 산업핵심기술개발사업 (2017. 4. 1 ~ 2020. 12. 31)
3

Previous studies
• Development of free-running speech recognition technologies for
embedded robot system (funded by MOTIE)
 로봇용 free-running 임베디드 자연어 대화음성인식을 위한 원천 기술 개발
• In other words:
– Non wake-up-word based speech understanding system
» ...?
4
오늘 또
떨어졌네
이게 대체
며칠째
파란불이냐
지금 손실이
얼마지

Previous studies
• 의도 파악 및 Syntactic ambiguity resolution
 Related to many aspects of (speaker-dependent) speech recognition
• Speaker-dependency (in terms of a personal assistant)
• Noisy far-talk recognition and beamforming
• Speech intention understanding
– To which utterances should AI react?
 Speech intention understanding
• Defining what ‘intention’ is
– Discourse components
– Speech act
– Rhetoricalness
• Making up annotation guideline
• Introducing phonetic features
– Intonation-dependency
– Sentence-final intonations
– Multimodal approaches
5

Previous studies
6
단일 문장인가?
Intonation 정보로
결정 가능한가?
Question set이 있고
청자의 답을 필요로 하는가?
Effective한 To-do list가
청자에게 부여되는가?
No
Yes
No
Yes
요구 (Commands)
수사명령문 (RC)
Full clause를
포함하는가?
No
No
Compound sentence: 힘이 강한 화행에 중점
(서로 다른 문장도 같은 토픽일 때 한 문장으로 간주)
Fragments (FR)
질문 (Questions)
No
Context-dependent (CD)
Yes
Yes
Yes
Intonation 정보가
필요한가?
Yes
Intonation-dependent (ID)
No Questions /
Embedded form
Requirements /
Prohibitions
수사의문문 (RQ)
Target: single sentence
without context
nor punctuation
Otherwise

Previous studies
 IAA: 0.85 (Fleiss’ Kappa) with three Seoul Korean native annotators
• Manual tagging on spoken language corpus
– W. I. Cho, H. S. Lee, J. W. Yoon, S. M. Kim, and N. S. Kim, "Speech intention
understanding in a head-final language: A disambiguation utilizing intonation-
dependency," arXiv preprint arXiv:1811.04231, Nov. 2018.
» https://github.com/warnikchow/3i4k
7

Previous studies
 W. I. Cho*, J. Cho*, J. Kang, and N. S. Kim, "Prosody-semantics interface in
Seoul Korean: Corpus for a disambiguation of wh- intervention," in Proc.
ICPhS, Aug. 2019, pp. 3902-3906. (Poster)
• https://github.com/warnikchow/prosem
8

Previous studies
 W. I. Cho, J. Cho, W. H. Kang, and N. S. Kim, "Text matters but speech
influences: A computational analysis of syntactic ambiguity resolution ," in
Proc. CogSci, 2020 (to be appeared).
• https://github.com/warnikchow/coaudiotext
9

Parallel work
• 비정형 발화의 논항 추출
 일단 의도를 파악하고 나면, 그로부터 무엇을 할 수 있나?
• What should the keyphrases be?
– for questions: something that the speaker asks for
» 내일 서울에 비 얼마나 올지 좀 검색해봐.
→ 질문: 내일 서울 강수량
– for commands: something that the speaker requests
» 물이 끓으면 불을 제일 약한 걸로 돌려줘
→ 요구: 물이 끓으면 불을 제일 약한 것으로 하기
– Simplified but representative nominalize version of the core content
– Sometimes keyphrases are longer than the original sentence
→ the reason the process differs with summarization
– Discourse component revisited!
10

Parallel work
 일단 의도를 파악하고 나면, 그로부터 무엇을 할 수 있나?
• W. I. Cho, Y. K. Moon, W. H. Kang, and N. S. Kim, "Extracting arguments from Korean
question and command: An annotated corpus for structured paraphrasing," arXiv preprint
arXiv:1810.04631, Oct. 2018.
– https://github.com/warnikchow/sae4k
11

Parallel work
 범용적으로 쓸 수 있는 분류 및 annotation 기준이 있으면 좋겠다
• 타 언어에도 적용 가능한? (일단 syntax가 유사한 ...)
– W. I. Cho, Y. K. Moon, and N. S. Kim, "Discourse component-based argument extraction of Seoul
Korean directives," in Proc. JK 27, Oct. 2019 (to be appeared). (Poster)
12

Parallel work
 이렇게 core content를 annotate한다면?
• W. I. Cho, Y. K. Moon*, S. Moon*, S. M. Kim, and N. S. Kim, "Machines getting with the
program: Understanding intent arguments of non-canonical directives," arXiv preprint
arXiv:1912.00342, 2019.
– seq2seq approach로 비정형 문장의 intent argument를 추출할 수 있다!
13

Parallel work
 하지만 structured intent argument extraction에 중요한 것?
• 각 sentence type – argument pair가 어느정도는 균형에 맞게 준비되어야 한다
14
How?

Data augmentation
• 부족한 종류의 발화를 어떻게 효율적으로 보충할 수 있을까?
 하나의 Intent argument 당 서로 다른 10개씩의 발화 생성
• Intent argument는 일종의 core content! > paraphrase 구축 과정에서 latent한 요소
• 많이들 궁금해하셨던 내용을 알려드리면, 올해에는 시월 십이일부터 십삼일까지 카이
스트에서 한글 및 한국어 정보처리 학술대회가 개최됩니다.
• How can we obtain a core content for paraphrasing (possibly by human)?
– Structured query language (SQL)
» {기간: 올해 시월 십이일부터 십삼일, 장소: 카이스트, 이벤트: 한글 및 한국어 정보
처리 학술대회}
– Bilingual pivoting (BP) and back-translation (BT)
» “As many of you may have waited for, we hold HCLT conference at KAIST from
twelfth to thirteens upcoming October.”
– 우리는 이것을 자연어 (NL) 형식으로 제공하겠다
15

Data augmentation
 Intent argument를 자연어의 형태로 제공하고, 해당 argument로부터 question 혹은
command를 자유롭게 생성하게 하자!
• Human participants들의 창의력...?
16

Data augmentation
 W. I. Cho, J. I. Kim, Y. K. Moon, and N. S. Kim, "Discourse component to sentence
(DC2S): An efficient human-aided construction of paraphrase and sentence similarity
dataset," in Proc. LREC, May 2020, pp. 6819-6826. (Postponed due to CoViD-19
outbreak)
• Four topics!
– 이메일, 스마트홈, 스케쥴, 날씨 + 자유주제 (미배포)
• Four sentence types!
– Alternative Q, Prohibition, Strong requirement (deficit) / Wh- Q (necessary)
• 8 participants of various (?) backgrounds
– 심리언어학 연구자
– 심리학 연구자
– 건축학 전공자
– 수학과 대학원생
– 생명과학 전공자
– 어학 전공자
– 언어학 전공자
– 컴퓨터 프로그래머
17

Data augmentation
outbreak)
• 주지사항
– 열 개의 문장은 최대한 서로 다른 스타일로 작성할 것. 이 때, 스타일은 존대 여부, 어조 등을 모두 포함.
– 꼭 키프레이즈에 있는 말을 반복할 필요 없고, 상황에 맞는 다른 단어/어구/술어를 넣어도 됨. 구어로
발화하기 적합한 표현일 것.
– 도치를 통해 문장 형태의 다양성을 추구하는 것 역시 권장됨.
– 설명의문문의 경우 의문사가 필수적으로 들어가야 하며 선택의문문도 경우에 따라 삽입될 수 있음. 두
문장 유형 모두 의문문으로 작성될 필요 없음.
– 금지 문장의 경우 청자가 할 수 있는 어떤 행위를 하지 않도록 하는 문장이어야 하며, 안 해도 괜찮다는
의미보다는 더 강제성을 지녀야 함. 그 행동을 금지하는 것이 다른 행동을 요구하는 것과 실질적으로
동치일 경우, 해당 표현으로 대체해도 크게 문제되지 않음.
– 금지와 강한 요구 문장 모두 명령문일 필요 없지만, 청자의 행동을 막거나 강제하는 목적을 지녀야 함.
강한 권유도 가능함.
– 화자/청자가 포함된 키프레이즈의 경우 각각 그에 상응하는 대명사 표현을 활용할 것. 이를 통해 화자/
청자의 표현이 포함된 코퍼스와 포함되지 않은 코퍼스를 모두 구축.
18

Data augmentation
outbreak)
• Participant feedback + each utterance checked by 3 native Koreans
19

Data augmentation
outbreak)
20

Data augmentation
• 이렇게 생성한 발화들로부터 또 다른 task를 정의해 볼 수 있을까?
outbreak)
• 자유 주제를 제외하고, topic과 intention이 attribute로 있는 경우, 문장들 사이의 관계는 어떻
게 될까?
– 10 개 발화 set 1000개를 이용하여, 45,000개의 paraphrase pair와 약 500,000개의 sentence similarity
pair을 만들 수 있다
» 45,000 = 10C2 X 1000 (two of ten), 499,500 = 1000C2 (two of a thousand)
• 이것도 논문으로...?
21
?

Result, discussion, and afterward
• Parallel corpus 형태의 new task!
outbreak)
• Sentence similarity and paraphrase detection task
– Seems valid with given topic and intention!
» Architecture shared from CogSci models
22
Text
Text
Text
Text
- Series: 𝑆1 + [셒] + 𝑆2
- (1, 2) >> (a, b)
- Parallel: 𝑆1 // 𝑆2
- (3, 5) >> (c, d)
- Why no BERT-like ones?
- Computation (500K?)
- Less explainable
- Performs well already,,,

outbreak)
• Created corpus works!
– In terms of both accuracy and F1
– With 9:1 train-test split
• Result
– Attention은 확실히 효과가 있음
» 어떤 부분들에 집중하면 similarity
를 잘 평가할 수 있는지?
– Series보다 parallel로 모델링하는 것이
self-attentive embedding (SA x) 을
활용할 때는 유리함
» 어떤 한 문장을 앞에 두기보다
병렬 모델링이 robust함
– Cross-attention을 활용할 경우 단순한
병렬 모델링보다는 효과적임
» 이는 BERT에서 SA를 사용하는 것과 비슷한 효과를 줄 것
23

outbreak)
• Data release and model service
– https://github.com/warnikchow/paraKQC
• Weak points?
– Topic별로 인원을 할당하였음
» 특정 인물의 언어 특성이 반영되었을 수 있음 > 자유주제로 테스트 필요
– Overfitting의 가능성 (너무 높은 score)
» Pair들은 unique하나, pair들 사이에는 중복되는 문장이 존재 > pair들끼리도 겹치는 것이 적도
록 모델링할 필요가 있을까?
– Domain adaptation
» 자유 주제에서도 잘 성립할 것인가?
– Different sentence types
» 질문/요구에서 잘 성립한다고,
서술문에서도 잘 성립할 것인가?
24

 Direct applications
• Paraphrase detection
• Sentence similarity test
• Speech act classification
• Topic analysis
 Further applications
• Query matching (for spoken QA systems)
• Checking sentence relevance (for an open domain chatbot)
• Left challenges
 Sentence generation 과정의 합당성
 Model overfitting 가능성
 Domain adaptation (w/ topic)
 Difference sentence types (w/ intention)
25
To be covered afterward

• Takeaways
 Sentence similarity는 여러 방식으로 정의할 수 있다
 Topic과 speech act는 sentence similariy를 판단하는 attribute가 될 수 있다
 Sentence similarity를 topic 및 speech act의 개념을 포함한 자연어 형식의 intent
argument를 활용하여 잘 정의한다면, 보다 human understandable한 요소들로부터
같은/비슷한 의미의 문장들을 생성할 수 있다
 이로써 sentence similarity task와 paraphrase detection을 연관지을 수 있으며, 같은
intent argument를 공유하는 문장의 set들로부터 augmented (뻥튀기된?) dataset을
생성할 수 있다
 이 sentence pair들을 통해 학습한 similarity model은 query choice, sentence
relevance test 등에 활용될 수 있다!
26
Special thanks to –
Jong In Kim, Jio Chung, †Kyuwhan Lee,
Youngki Moon, Sangwhan Moon, Jeonghwa
Cho, Jeemin Kang, Dae Ho Kook, Haeun
Park, and all the participants and
HIL members

2006 kakao brain NLP colloquium

Recommended

Recommended

More Related Content

Similar to 2006 kakao brain NLP colloquium

Similar to 2006 kakao brain NLP colloquium (9)

More from WarNik Chow

More from WarNik Chow (20)

Recently uploaded

Recently uploaded (8)

2006 kakao brain NLP colloquium

Editor's Notes