Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介

2017.06.26
NAIST ⾃自然⾔言語処理理学研究室
D1 Masayoshi Kondo
論論⽂文紹介-‐‑‒ About Neural Summarization@2017
Get To The Point : Summarization with
Pointer-‐‑‒Generator Networks
ACLʼ’17
Abigail See
Stanford University
Peter J. Liu
Google Brain
Christopher D. Manning
Stanford University

00: 論論⽂文の概要
•  ニューラルネットを⽤用いた⽣生成要約タスクの研究 ( in:原⽂文 → NN → out:要約⽂文 )．
•  複数⽂文要約⽣生成タスクの研究に取り組み、⻑⾧長⽂文の要約⽣生成を実現する⼯工夫点が⾒見見どころ．
•  NNアーキテクチャは、Enc:bi-‐‑‒directional RNN / Dec: RNN のSeq2Seq型モデルを
ベースに pointer mechanism(attention mechanism) / coverage mechanism
を組み込んだモデル．
•  実験データは、CNN/Daily Mailデータを加⼯工したmulti-‐‑‒sentence summarization⽤用の
データセット．評価指標は、ROUGE-‐‑‒score.
•  先⾏行行研究の⼿手法に⽐比べ、２ポイント以上の精度度向上を実現．
【まとめ】
【abstract】
ニューラルseq2seqモデルは、⽣生成要約タスクにおいて実⾏行行可能で新しい⼿手法となっている．（これは、記
事の⽂文章を選択し選んだ⽂文章を再構成するという単純な意味ではない．）しかしながら、これらのモデルに
は２つの⽋欠点が存在する．ひとつは、詳細な事実を不不正確に⽣生成しがちであることだ．もうひとつは、それ
らを繰り返し⽣生成しがち(repetition)であることだ．本研究では、我々はseq2seq-‐‑‒attentionモデルを強
化した新しいアーキテクトを提案する．強化点は独⽴立立した２つの要素である．ひとつは、pointingの仕組み
によって元記事(src)から単語を使い回しつつ、generationの仕組みによって適切切な単語の⽣生成能⼒力力を有する
ハイブリッド型(⾼高度度異異要素統合型)のpointer-‐‑‒generator networkを使⽤用していることだ．このとき、
Pointing機構は、情報の正しい再構築を⽀支援する．ふたつめは、repetitionを回避するために、要約される内
容の論論旨を管理理するcoverageの仕組みを⽤用いていることだ．我々は、提案⼿手法に対して要約タスク⽤用の
CNN/DailyMailデータを⽤用いた．その結果、従来の最⾼高精度度のスコアから、ROUGEスコアで2ポイント以上
上回る結果を得た．

1.  Introduction
2.  Our Models
3.  Related Work
4.  Dataset
5.  Experiments
6.  Results
7.  Discussion
8.  Conclusion

00: Introduction
【 Text Summarization 】
「原⽂文」の主要な情報を抽出し、より「短い⽂文章」で記述するタスク．
⽂文書要約タスク：２種類
Extractive Summarization :
　-‐‑‒ 従来の多くの⽂文書要約(⾃自動要約)の研究枠組み
Abstractive Summarization :
　-‐‑‒ 近年年、NNを利利⽤用して⾶飛躍的な精度度向上
•  原⽂文の⽂文章を直接使って(copyして)、要約⽂文を
構築．
•  簡単に実現出来る．
•  精度度や⽂文法構造も⼀一定の⽔水準を満たしている．
•  原⽂文に依らないフレーズや単語も含めて⽣生成的に
⽂文章を構築．
•  「⾔言い換え」や「常識識（世界知識識）」等を含んだ
⾼高度度な要約⽂文を⽣生成出来る可能性がある．
Src(原⽂文) Trg(要約⽂文)
Src(原⽂文) Trg(要約⽂文)
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒-‐‑‒
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxx

00: Introduction
とはいえ・・・
Abstractive Summarization の課題は多い
•  Undesireble behavior such as inaccurately reproducing factual details.
•  An inability to deal with out-‐‑‒of-‐‑‒vocabulary (OOV)
•  Repeating themselves
Short Text
(1 or 2 sentences)
Long Text
(more than 3 sentences)
Single Document Headline Generation 本研究の対象
Multi Documents (Opinion Mining)
Document
Summary length
⽂文書要約タスクのタイプ
本研究(本論論⽂文)では、
•  Long-‐‑‒text summarization をタスクとして、
•  上記の課題に対応するような、
•  新しいニューラルネットモデルを提案する．

00: Introduction
【提案⼿手法】
【データセット】【評価指標】
•  Pointer-‐‑‒Generator Network
-‐‑‒  新しい単語を⽣生成する能⼒力力と、原⽂文の単語を使い回す
　(copyする)能⼒力力を合わせもつ．
•  Coverage Mechanism
-‐‑‒  単語のreputationを回避する仕組み
ROUGE-‐‑‒score
CNN/Daily Mail Dataset
-‐‑‒ News記事(原⽂文を要約 / English)
ココが⼤大切切

00: Introduction
Attention
Encoder (Bi-‐‑‒LSTM) Decoder (RNN)
Input-‐‑‒Sequence
Predicted Vocab
Distribution
Context Vector

00: Introduction
Attention
Attention
Distribution
Predicted Vocab
Distribution
Context Vector

00: Introduction
Attention
Attention
Distribution
Predicted Vocab
Distribution
Context Vector
pgen
Context Vector

00: Introduction
Attention
Attention
Distribution
Predicted Vocab
Distribution
Context Vector
pgen
Final Predicted Vocab
Distribution
1 -‐‑‒ pgen
pgen
Context Vector

00: Introduction
Attention
Attention
Distribution
Predicted Vocab
Distribution
Context Vector
pgen
Final Predicted Vocab
Distribution
1 -‐‑‒ pgen
pgen⼊入⼒力力系列列(src)
側の単語を
使い回す
気持ち
新しい表現を
⽣生み出す気持ち
Context Vector

00: Our Models
2.1 Sequence-‐‑‒to-‐‑‒Sequence attention model
[Encoder] [Decoder]
…
i+1
i
… …
ei
t
= vT
⋅ tanh Whh +Wss( )
at
= soft max(et
)
ht
∗
= ai
t
hi
i
∑
$
%
&
&
'
&
&
Encoder hidden state :
Decoder hidden state : s
h
Context vector : h∗
詳しく知るには：
Neural machine translation by jointly learning to align and translate
[Bahdanau, ICLRʼ’15]
Abstractive text summarization using sequence-‐‑‒to-‐‑‒sequence RNN and beyond
[R.Nallapati et al, CoNLLʼ’16]

00: Our Models
2.2 Pointer-‐‑‒generator network
Attention
Attention
Distribution
Predicted Vocab
Distribution
Context Vector
pgen
1 -‐‑‒ pgen
pgen
Context Vector
pgen = σ wh*
T
ht
*
+ ws
T
st + wx
T
xt + bptr( )
P(w) = pgenPvocab (w)+ 1− pgen( ) ai
t
i:wi=w
∑
Final probability distribution: P(w)
context vector:
wh*
T
,ws
T
,wx
T
Generation probability : pgen
ht
*
/ decoder state: st / decoder input: xt
Vector parameters:

00: Our Models
2.3 Coverage mechanism
Coverage Vector : ct Attention
Distribution
sum
Decoder
Timestep
1
2
3
t-‐‑‒1
t
…
…
ct
Coverage
Vector
Dec側の過去の⼊入⼒力力に
対するattention vector
を⾜足し合わせる．
ct
= at'
t'=0
t−1
∑
ct is a (unnormalized)
distribution over the source
document words.
…

00: Our Models
2.3 Coverage mechanism
ei
t
= vT
⋅ tanh Whh +Wss +Wcct
+ battn( )
covlosst
通常のアテンション計算式に
Coverage Vectorの項を追加
Coverage Loss :
covlosst = min(ai
t
,ci
t
)
i
∑
losst = −log(wt
*
)+ λ min(ai
t
,ci
t
)
i
∑
Attentionの計算 :
Dec側のステップt番⽬目の単語に対する、
Enc側のi番⽬目のattention値と
coverage (vectorの要素i)値を⽐比較し
て、⼩小さい⽅方を加算対象とする．
【解釈】：Dec側のステップt毎に毎回Enc側i番⽬目の単語が使われる状況を想定する．このとき、ci
tは、tに
従って増加して⾏行行き（蓄積される）、ステップtが進むにつれてai
tはci
tの値を超えにくなる（cが1を超えた場
合は、以後, aがcovlossへの加算対象となる．）この時、min(a)となると、backprop時にDec側ステップtの
単語をEnc側i番⽬目の単語の性質から引っ張ってくることを強く抑制するように最適化がなされる．⼀一⽅方で、
min(c)となった場合は、Dec側の全てのtに対してEnc側i番⽬目の単語の性質の利利⽤用を抑制するように最適化が
なされる．したがって、全体としてEnc側同⼀一単語の利利⽤用を抑制しつつ、Dec時の局所的に⾼高い確率率率で単語を
繰り返すような場合もmin(a)によって抑制できる．→ Dec側tの同単語の繰返し⽣生成を抑制．

1.  Introduction
2.  Our Models
3.  Related Work
4.  Dataset
5.  Experiments
6.  Results
7.  Discussion
8.  Conclusion
論論⽂文内容にあまり影響
しないので、割愛

00: Dataset
CNN/Daily Mail Dataset : Online news articles
Source (article) Target (summary)
avg Sentence : -‐‑‒
Word : 781 (tokens)
vocab 150k size
avg Sentence : 3.75
Word : 56 (tokens)
vocab 60k size
Settings
•  Used scripts by Nallapati et al (2016) for pre-‐‑‒processing.
•  Used the original text (non-‐‑‒anonymized version of the data).
Train set Validation set Test set
287,226 13,368 11,496
Dataset size

00: Experiments
【 Model Details 】
•  Hidden layer : 256 dims
•  Word emb : 128 dims
•  Vocab : 2 types
src trg
(large) 150k 60k
(small) 50k 50k
【 Setting Details 】
Optimize Adagrad
Init-‐‑‒lr 0.15
Init-‐‑‒accumlator value 0.1
Regularize terms ×
Max grad-‐‑‒clipping size 2
Early-‐‑‒stopping ○
Batch size 16
Beam size (for test) 4
【 Environment & procedure 】
Single GPU
-‐‑‒ Tesla K40m GPU
-‐‑‒ 実験⼿手続きについて
> Training 時 :
> Test 時 :
•  Word-‐‑‒Embのpre-‐‑‒train無し．
•  Src側は、400 tokens で打切切
•  Trg側は、100 tokens で打切切
•  Src側は、400 tokens で打切切
•  Trg側は、120 tokens で打切切
-‐‑‒ 実⾏行行環境について
評価指標
-‐‑‒  ROUGE scores (F1値)
-‐‑‒  METEOR scores

00: Experiments
【 Training time (Calculation cost) 】
Proposed Model Baseline Model
•  230,000 iters (12.8 epoch)
•  About 3 days + 4 hours
50 k 4 days
+14 hours
150k 8 days
+21 hours
600000 iters
(33 epoch)
-‐‑‒ Other Settings -‐‑‒
•  Coverage Loss Weight : λ=1
•  最終的なモデルは、さらに3000iter追加して調整(約２時間)
-‐‑‒ Inspection -‐‑‒
•  λ=2でも実験したが、Coverage Lossは減少したものの、Primary Lossが
　増加して使い物にならなかった．
•  Coverage Model(提案モデル)に対してCoverage Lossを導⼊入していない
　パターンでも実験した．Attention機構が⾃自⼰己主体的にrepetationを回避する
　ことを期待しての実験だったが、上⼿手くはいかなった．

00: Results
•  ⼿手法：lead-‐‑‒3は、src記事冒頭３⽂文抜出で提案⼿手法よりも精度度が良良い．
•  Nallaptiらの⼿手法は、anonymizedされたデータを利利⽤用しているが、
本研究では、オリジナル通りのデータを利利⽤用しており、⼀一概に⽐比較は
できないが、提案⼿手法の⽅方がスコアが良良い．⼿手法：lead-‐‑‒3でもオリ
ジナル通りのデータの⽅方がスコアが勝っている．

00: Results
•  ベースラインモデル(seq2se2-‐‑‒attention)では、時々、意味の無い
繰返し⽂文が⽣生成される．Fig.1における第３⽂文章がそれに該当する．
•  また、ベースラインモデルは、OOVを別の単語に置換えて表現する
ことが出来ない．(UNK がそのまま⽣生成される．)

00: Discussion
7.1 Comparison with extractive systems
•  抽出型要約⽅方式の⽅方が、⽣生成要約型⽅方式よりもROUGEスコアが⾼高い．
•  これには、２つの説明ができそうだ．
【説明:1】
【説明:2】
•  ニュース記事は、冒頭に極めて重要な情報が現れやすい．これに
よって部分的にベースラインモデル：lead-‐‑‒3 の強さを説明出来る．
•  実際、記事から冒頭400 tokens(20 sentences)抜出の⽅方が、800
tokens抜出の場合よりも、ROUGEスコアが⾼高かった．
•  タスクとROUGEスコアの性質上、抽出要約型⽅方式やlead-‐‑‒3に勝つ
ことは難しい．
•  ⽣生成要約型⽅方式は⾔言い換えや元記事と似た⽂文章を⽣生み出すが、
ROUGEスコアではこれらは０スコアとなり評価されない．

lead-‐‑‒3(冒頭⽂文抜出) ＞抽出要約⽅方式＞⽣生成要約⽅方式
【ここまでのまとめ】：ROUGEスコアを評価指標とする要約タスクは、
00: Discussion
ROUGEスコアは、元記事の冒頭⽂文章を利利⽤用したり元記事の表現を使い回す
といった安直な戦略略に対して良良い評価を⾏行行う．
これが、抽出要約⽅方式が⽣生成要約⽅方式よりもROUGEスコアが⾼高く、
抽出要約⽅方式ですら、ベースライン：lead-‐‑‒3（冒頭３⽂文抜出)に勝て
ない理理由である．

METEORスコア
00: Discussion
前述の課題に対応するために、METEORスコアによる評価を⾏行行なった．
予測⽂文と正解⽂文の単語⼀一致だけでなく、(事前に辞書が必要ではあるが)
語幹、同義語や⾔言い換えにも良良い評価を与える．
•  提案法が、他の⽣生成要約モデルに⽐比べて1ポイント以上優位結果を⽰示した．
•  ⼀一⽅方で、lead-‐‑‒3には負けている．これは、ニュース記事の形式がlead-‐‑‒3を
評価指標に対して⾮非常に強くさせているのだろう．

00: Discussion
We believe that investigating this issue further is an
important direction for future work.
7.2 How abstractive is our model ?
We have show that our pointer mechanism makes
our abstractive system more reliable, copying factual
details correctly more often. But, does the ease of
copying make our system any less abstractive ?
•  ⽣生成要約タスクにおいて、現⾏行行の評価指標には限界がある．
•  pointer mechanismは、詳細な事実を正しくコピーでき、確かに提案
法をより良良いものとした．
•  だが、コピーの容易易さはむしろ我々のモデルから⽣生成要約らしさを減ら
してしまっているのではないか？

00: Discussion
⽣生成された要約⽂文に対するsrc側に含まれる表現のn-‐‑‒gram毎の含有率率率

Fig.7 ) 図の２つのArticleは、どち
らも要約時には「X beat Y
<score> on <day>」のような典
型的な⽂文章になる例例．
00: Discussion
Fig.5 ) 提案⼿手法による⽣生成要約例例．
典型的な要約⽂文ではなく、新しい語
を使って要約⽂文を⽣生成している．

00: Discussion
•  Train 時 : 0.30 → 0.53 (train終了了時)
•  Test 時 : avg-‐‑‒0.17
pgen は、提案⼿手法における⽣生成要約らしさの尺度度．
モデルは、最初src側のコピーを多く⾏行行なうが、半時間程で⽣生成すること
を学習．

00:Conclusion
•  Pointer-‐‑‒generator network を提案した．
•  実験では、提案法を long-‐‑‒text dataset を⽤用いた
abstractive summarizationタスクで最⾼高精度度を達成した．
-‐‑‒  Repetition と間違い出⼒力力を軽減．

Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介

More Related Content

What's hot

Similar to Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介

More from Masayoshi Kondo

Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介