SCGPT : Few-shot Natural Language Generation for Task-Oriented Dialog
1. Pengbaolin/SC-GPT
Few-shot Natural Language Generation for Task Oriented Dialog
TITLE Introduction Background SC-GPT FewShotWOZ Experiments
Contributors:
Dates: 2021.06.06
진명훈 백지윤 황소현
1
2. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
2
Few-shot
Natural Language Generation
for Task Oriented Dialog
https://www.aclweb.org/anthology/2020.findings-emnlp.17/
3. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
3
https://www.aclweb.org/anthology/2020.findings-emnlp.17/
4. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
4
New Benchmark FewShotWOZ
New Baseline SC-GPT
SOTA @ MultiWOZ 2.0
https://www.aclweb.org/anthology/2020.findings-emnlp.17/
5. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
5
What is Task-Oriented Dialogue (TOD)?
https://arxiv.org/pdf/2005.05298.pdf
6. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
6
Natural Language Generation for TOD
https://arxiv.org/pdf/2005.05298.pdf
7. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
7
Natural Language Generation for TOD
https://arxiv.org/pdf/2005.05298.pdf
8. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
8
Natural Language Generation for TOD
Natural Language
Generation (NLG)
• Response는 user에게 보이는 결과물
• 때문에 사용자에게 매력적으로 다가갈 수 있도록 유창할 수록 좋고
• 주어진 Task를 충실히 수행할 수 있어야 함.
9. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
9
Natural Language Generation for TOD : Typical methodology
1. Template-Based
• Domain knowledge 필요
• 요구되는 의미적 정보를 담기엔 적합
• 유창하지 않기에 UX에 좋지 않음
• 즉, Fluent X || Semantic Response O
2. Statistical-Based
• Ex> Neural Network
• Labeled Corpus로부터 유창한 반응을 학습하여 생성
• SC-LSTM: dialog acts를 one hot 표현으로 encode, 이를 문장 생성을 위한
extra feature로 이용
• Domain specific data 필요
• Dialog acts가 많아질수록 scalability issue 발생
• 즉, Fluent O || Semantic Response 어려움 (많은 annotated data 요구)
https://www.aclweb.org/anthology/P98-1116.pdf
https://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP199.pdf
10. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
10
Few shot NLG for TOD
• 위 문제를 해결하기 위해 Few-Shot Manner로 Dataset 구축 FewShotWOZ
• TOD에서 Few shot learning setting을 테스트할 데이터 셋
• MultiWOZ, Cambridge NLG Dataset에 based
• 각 domain 별 50개의 labeled utterance 존재
• + 위 데이터셋에 대한 benchmark로 SC-GPT 모델 개발
• 3 State Training Strategy
1. Pre-trained on plain text
2. Pre-trained on large amounts of dialog-act labeled utterance corpora to acquire the ability of controllable generation
3. Fine-tuned for a target domain using very limited amounts of domain labels
11. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
11
Typical TOD: pipeline architecture
NLU: user intents와 기타 key information 추출
DST: dialog의 current state를 유지하는 모듈
POL: DST의 출력(formatted information)을 입력받아 DB, KB에서 검출된 facts
혹은 entities에 기반하여 dialog act를 제공하는 모듈
NLG: POL에 의해 생성된 dialog act로 natural language system response를 생성
12. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
12
Controllable Pre-Trained Model
• ELMo, BERT, XLNet, Roberta, CTRL, GPT-2, Soloist: Pre-trained model
• GPT-2의 유일한 단점? → Controllability
• 이를 극복하기 위해 Salesforce는 CTRL 제안
• text style, content description, task-specific behavior 등의 control code로 학습
• Grover는 authors, dates 등을 조건으로 한 새로운 기사를 생성하도록 설계
• 그러나 이전 Controllable 연구들의 control code는 dialog에 적용하기엔 적절히 학습되지 못함
13. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
13
Semantically Conditioned GPT
𝒜 = [ 𝐼 (𝑠1 = 𝑣1, ⋯ , 𝑠𝑃 = 𝑣𝑃)]
Intent Slot-value pairs
Intent: system actions e.g., inform, request, confirm, select
Slot-value pairs: utterance를 표현하기 위한 정보의 category와 content
𝑝𝜃 𝑥 𝒜 = ෑ
𝑡=1
𝑇
𝑝𝜃(𝑥𝑡|𝑥<𝑡, 𝒜)
Auto-regressive generation model MLE Loss
ℒ𝜃 =
𝑛=1
|𝐷|
𝑡=1
𝑇𝑛
log 𝑝𝜃(𝑥𝑡,𝑛|𝑥<𝑡,𝑛, 𝒜𝑛)
14. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
14
Three-stage procedure as the training recipe.
STAGE 1> Massive Plain Language Pre-training
• Pretrain @ OpenWebText (GPT-2 inheritance)
• Text prompt로 realistic sentence 생성
STAGE 2> Dialog-Act Controlled Pre-training
• (dialog act, response) pairs
• 사용한 Dataset @ Scheme-Guided Dialog Corpus, MultiWOZ, Frame Corpus, Fabebook Multilingual Dialog
• Token size: 40k
• Dialog act를 sequence of control code로 변환
STAGE 3> Fine-tuning
• Stage 2와 동일하게 학습 @ few-shot dataset
• 아래 세 가지 장점이 존재
1. Flexibility: delexicalization없이 sequence 연산
2. Controllability: 적절한 intent, slot-value information으로 문장 생성 + fluency 유지
3. Generalizability: SC-LSTM보다 훨씬 일반화를 잘함 (why? Large dataset)
ℒ𝜃 =
𝑛=1
|𝐷|
𝑡=1
𝑇𝑛
log 𝑝𝜃(𝑥𝑡,𝑛|𝑥<𝑡,𝑛, 𝒜𝑛)
ℒ𝜃 =
𝑛=1
|𝐷|
𝑡=1
𝑇𝑛
log 𝑝𝜃(𝑥𝑡,𝑛|𝑥<𝑡,𝑛)
15. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
14
Three-stage procedure as the training recipe.
STAGE 1> Massive Plain Language Pre-training
• Pretrain @ OpenWebText (GPT-2 inheritance)
• Text prompt로 realistic sentence 생성
STAGE 2> Dialog-Act Controlled Pre-training
• (dialog act, response) pairs
• 사용한 Dataset @ Scheme-Guided Dialog Corpus, MultiWOZ, Frame Corpus, Fabebook Multilingual Dialog
• Token size: 40k
• Dialog act를 sequence of control code로 변환
STAGE 3> Fine-tuning
• Stage 2와 동일하게 학습 @ few-shot dataset
• 아래 세 가지 장점이 존재
1. Flexibility: delexicalization없이 sequence 연산
2. Controllability: 적절한 intent, slot-value information으로 문장 생성 + fluency 유지
3. Generalizability: SC-LSTM보다 훨씬 일반화를 잘함 (why? Large dataset)
ℒ𝜃 =
𝑛=1
|𝐷|
𝑡=1
𝑇𝑛
log 𝑝𝜃(𝑥𝑡,𝑛|𝑥<𝑡,𝑛, 𝒜𝑛)
ℒ𝜃 =
𝑛=1
|𝐷|
𝑡=1
𝑇𝑛
log 𝑝𝜃(𝑥𝑡,𝑛|𝑥<𝑡,𝑛)
inform ( name = assab eritrean restaurant ; price = between 7 and 11 euro ) & assab eritrean restaurant meal -s cost between 7 and 11 euro
['in', 'form', '_(', '_name', '_=', '_ass', 'ab', '_er', 'it', 're', 'an', '_restaurant', '_;', '_price', '_=', '_between', '_7', '_and', '_11', '_euro', '_)', '_&', '_ass', 'ab', '_er', 'it', 're', 'an', '_restaurant', '_meal', '_-', 's', '_cost', '_between', '_7', '_and', '_11', '_euro’`]
Byte Pair Encoding
[259, 687, 357, 1438, 796, 840, 397, 1931, 270, 260, 272, 7072, 2162, 2756, 796, 1022, 767, 290, 1367, 11063, 1267, 1222, 840, 397, 1931, 270, 260, 272, 7072, 9799, 532, 82, 1575, 1022, 767, 290, 1367, 11063]
Encode
(1, 38, 1024)
Embedding
(1, 38, 1024)
(1, 38, 50257)
lm_head
𝑙𝑜𝑠𝑠 𝑥, 𝑐𝑙𝑎𝑠𝑠 = − log
exp(𝑥 𝑐𝑙𝑎𝑠𝑠 )
σ𝑗 exp(𝑥 𝑗 )
= −𝑥 𝑐𝑙𝑎𝑠𝑠 + log
𝑗
exp(𝑥 𝑗 )
(bsz, seq_len, |V|)
(bsz, seq_len)
(bsz*seq_len, |V|)
(bsz*seq_len)
16. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
15
https://github.com/pengbaolin/SC-GPT/tree/master/data
Revisiting NLG Benchmarks
• E2E NLG (Novikova et Al., 2017)
• BAGEL (Mairesse et al., 2020)
• RNNLG (Wen et al., 2016a)
17. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
16
https://github.com/pengbaolin/SC-GPT/tree/master/data
Revisiting NLG Benchmarks
Two issues
I. Labeling cost high: All the datasets contain a large number of labelled training samples for each domain
II. Delexicalised dialog acts overlap ratio too high: difficulties in evaluating the model’s generalization ability for new domains.
18. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
17
https://github.com/pengbaolin/SC-GPT/tree/master/data
FewShotWOZ
Three advantages
I. More Domains
II. Less training instances (to few shot)
III. Lower training/testing overlap
More domains
flexibility
Low overlapping
Less instances
19. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
18
FewShotWOZ - statistics
https://github.com/pengbaolin/SC-GPT/tree/master/data
How to collect?
• RNNLG, MultiWOZ의 data를 re-organizing
• MultiWOZ: Attraction, Taxi, Train은 multi-domain이
기 때문에, dialog act가 겹치는 경우가 존재함
• 동일한 target utterance를 위해 동일한 delexicalizing
processing을 거침
• Few shot manner를 위해 50개 instance, Taxi는 40개
20. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
19
Implementation details
• Built upon 🤗 Huggingface transformers
• GPT2-medium with 345M parameters as the initial checkpoint, BPE for the tokenizer
• Linear rate scheduler with start rate as 5e-5
• Adam with weight decay
• For Pre-training, mini-batch of 8 on an 8 Nvidia V100 machine @ 20 epochs with early stopping
• For tine-tuning on FewShotWOZ, @ 5 epochs
21. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
20
Implementation details
• Built upon 🤗 Huggingface transformers
• GPT2-medium with 345M parameters as the initial checkpoint, BPE for the tokenizer
• Linear rate scheduler with start rate as 5e-5
• Adam with weight decay
• For Pre-training, mini-batch of 8 on an 8 Nvidia V100 machine @ 20 epochs with early stopping
• For tine-tuning on FewShotWOZ, @ 5 epochs
22. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
21
Automatic metrics
• Following SC-LSTM, BLEU scores and slot error rate (ERR) 사용
• BLEU: 생성된 발화가 얼마나 자연스러운지를 평가
• ERR: 후보 발화의 slot token과의 exact matching을 측정
𝐸𝑅𝑅 =
𝑝 + 𝑞
𝑀
𝑤ℎ𝑒𝑟𝑒 𝑀: 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑠𝑙𝑜𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑖𝑎𝑙𝑜𝑔 𝑎𝑐𝑡
• 각 dialog act별 5개의 utterance를 생성, 가장 적은 ERR을 보이는 문장을 출력
23. TITLE Introduction Background SC-GPT FewShotWOZ Experiments
22
Human Evaluation
• subjective quality를 위해 Amazon Mechanical Turk를 사용하여 human evaluation을 진행
• master level workers를 고용 (good prior approval rates를 가진)
• 각 문장에 대해 Informativeness, Naturalness를 1(bad) ~ 3(good)으로 채점
• Informativeness: 생성된 발화가 dialog act에 지정된 모든 정보를 얼마나 포함하는지?
• Naturalness: 발화가 인간만큼 자연스러운지
• judgement bias를 줄이기 위해 각 질문을 세 다른 작업자에게 분배함
• 총 5,800개의 judgement를 얻음
Baselines
• SC-LSTM: canonical model, strong baseline that uses an additional dialog act vector and reading gate to guide the uttr generation
• GPT-2: directly fine-tuning without pre-training on the domain-scale corpus of (dialog act, response) pairs
• HDSA: SOTA on MultiWOZ, multi-domain setting에서 transfer가 가능하도록 dialog act structure를 다룸. superior than SC-LSTM