Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Learning Contextual Representations
for Semantic Parsing with
Generation-Augmented Pre-Training
2021.01.24
딥러닝 논문읽기모임 자연어처리팀
팀원 : 황소현(발표자), 박희수, 신동진, 진명훈
https://arxiv.org/pdf/2012.10309v1.pdf

Introduction
Models
Experiments
- Motivation
- Pain Point
- Pre-training
- Pre-training Data Generation
- Tasks, Datasets and Baseline Systems
- Impact of Learning Objectives
- Analysis of Pre-Training Inputs
- Analysis of Pre-trained Model

Introduction - Motivation
Text-to-SQL
기존의 MLM을 text-to-SQL에 적용하는 것에는 크게 세가지 문제가 있다

Introduction - Pain Point1
Pain Point 1. Text 에서 컬럼의 이름이 발화된 것을 탐지하는 것에 실패하는 문제

Pain Point 2. 셀 값으로 부터 컬럼의 이름을 추론하는 것에 실패하는 문제
(e.g. republics 단어로 부터 GovernmentForm 컬럼 명을 추론하는데 실패함)

Pain Point 3. 복잡한 SQL 생성문 실패
(e.g. 위의 구문에서 semester_id는 괄호 안 SELECT에 포함되어야함)

이 논문에선 Alignment 뿐 아니라 컬럼 이름의 “Contextual Representation” 을 학습함으로써 더 성
능을 높이는 방법을 제안한다.
(1) Column prediction task (CPed)
각각의 컬럼에 Label index 를 붙이고 utterance 에서 사용이 되었는지 아닌지를 prediction 하는
task.
(2) Column recovery task (CRec)
Text 에서 컬럼 명을 컬럼의 value 로 대체하고 컬럼 명을 추론하는 task.
(3) SQL generation (GenSQL)
utterance 와 schema 를 주고 SQL 문을 생성하는 task. (이 때 synthetic data 를 통하여 많은 양
의 data 를 확보 후에 이를 바탕으로 학습시킴)

⚫ Main contributions
✓ 3가지의 기존 문제점 (3 pain point)
✓ multiple pre-training tasks and synthetic data 제안
✓ 3가지의 learning objectives that alleviate the three main issues spotted with
pre-trained LMs for semantic parsing.
✓ leveraging SQL-to-Text and Table-to-Text generative models to generate
synthetic data for learning joint representations of textual data and table
schema.

Models - Pre-training
schema S = tables T = {t1, t2, ..., t∣T∣}
columns C = {c1, c2, ..., c∣C∣}
utterance U = {x1, x2, ..., xn}

Models - Text-to-SQL semantic parser
Utterance (raw text)
Schema
Columns Tables
Transformer Encoder
Transformer Decoder
Contextual
Representation
1. Column Prediction (CPred)
2. Column Recovery (CRec)
3. SQL Generation (GenSQL)
4. Masked Language Model (MLM)
Utterance Schema SQL

Schema
Columns Tables
Transformer Encoder
Transformer Decoder
Contextual
Representation
1. Column Prediction (CPred)
MLP
Column in
Utterance?

Schema
Columns Tables
Transformer Encoder
Transformer Decoder
Contextual
Representation
2. Column Recovery (CRec)
Column
name
first_name last_name job
Cell value
(record)
Blake Smith manager
first_name last_name job
first_name last_name manager
replacement

Schema
Columns Tables
Transformer Encoder
Transformer Decoder
Contextual
Representation
3. SQL Generation (GenSQL)

Utterance (Masked)
Schema
Columns
(Masked)
Tables
Transformer Encoder
Transformer Decoder
Contextual
Representation
4. Masked Language Model (MLM)

Models - Pre-training Data Generation

1. Utterance - Schema – SQL (SQL–to–Text Generation)
⚫ SQL (crawl from Github)
→ BART tokenizer
→ SQL-to-Text (train from SPIDER)
→ Utterance
⚫ Result
→ Utterance
→ Schema (columns & tables)
→ SQL

2. Utterance – Table (Table–to–Text Generation)
⚫ Table (from English Wikipedia)
→ sample columns
→ BART generator (train from SPIDER)
→ SQL + control codes
⚫ Result
→ Utterance
→ Table

Experiments - Tasks, Datasets and Baseline Systems
⚫ SPIDER
SPIDER dataset : Yu et al. 2018,https://www.aclweb.org/anthology/D18-1425/
✓ Annotated 10,181개의 parallel utterance-database-SQL
triples
✓ Metric은 Exact Set Match Accuracy
✓ Baseline: RAT-SQL
8-layer relation-aware transformer를 활용, 테이블 및 발화간
의 연결을 모델링(Wang et al. 2019, https://www.aclweb.org/anthology/2020.acl-
main.677/)
✓ Ablation Study를 위해 IRNet도 추가로 사용
(Gue et al. 2019, https://arxiv.org/abs/1905.08205)
✓ 위 baseline의 original context encoder를 대체하여 GAP
Model의 Encoder를 기본 parser로 확장

⚫ Spider Results
Synchronized CFG grammar 사용
Large Scale data-to-text dataset 사용
더 많은 Parameter와 Crowd, Expert Annotated 작업 필요!

⚫ Criteria-to-SQL
✓ Electronic Health Record DB에서 임상 시험에 적합한 환자를 쉽게 검색할 수 있는 Dataset
Task는 적합 판정 text를 SQL Queries로 변환시키는 것
✓ Annotated 2003 examples
✓ Metric은 SQL Accuracy와 Execution Accurarcy
✓ Baseline은 Yu et al. 2020c https://www.aclweb.org/anthology/2020.lrec-1.714/에 나온 system 사용
✓ YXJ model이라 부를 것이며 sketch를 디자인하기 위해 사전 문법 지식을 활용하는 slot-filling 기반
모델이라고 함 (BERT 기반을 context encoder로 사용)
“” Any infection requiring parenteral antibiotic therapy or causing fever (i.e.,
temperature > 100.5f) <= 7 days prior to registration “””
SELECT id
FROM records
WHERE active_infection = 1
AND (parenteral_antibiotic_therapy = 1
OR causing_fever = 1 OR temperature > 100.5)
TEXT
SQL

⚫ Criteria-to-SQL Results

Experiments - Impact of Learning Objectives
⚫ Four different learning objectives
✓ Masked Language Model (MLM)
✓ Column Prediction (CPred)
✓ Column Recovery (CRec)
✓ SQL Generation (GenSQL)
⚫ Two different conditions
✓ GenSQL learning objective
and the other one is without

Experiments - Analysis of Pre-Training Inputs
⚫ Whether to use utterance in pre-training
✓ MLM → +0.2%
✓ MLM without utterance → -1.7%
✓ CRec without utterance → -1.7%
⚫ Whether to use schema in pre-training
✓ MLM without schema → -1.6%
☞ 작업에 공동 발화 및 스키마 표현을 학습해야 함.
⚫ Whether to use the generated text or the surrounding text of the table
✓ CRec → -0.8%

Experiments - Analysis of Pre-trained Model

Conclusion
⚫ SPIDER
⚫ CRITERIA-TO-SQL

Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Recommended

Recommended

More Related Content

Similar to Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Similar to Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training (20)

More from taeseon ryu

More from taeseon ryu (20)

Recently uploaded

Recently uploaded (20)

Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training