Abstractive Text Summarization

From Seq2seq with Attention to Abstractive
Text Summarization
Tho Phan
Vietnam Japan AI Community
December 01, 2019
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 1 / 64

Table of Contents
1 Introduction
2 Seq2seq with Attention
3 Natural Language Generation
4 Abstractive Text Summarization

Next Content
1 Introduction

Text Summarization
Deﬁnition
Produce a shorter version of long texts while preserving the salient
information
Recap
Seq2seq with Attention
Natural Language Generation
ref: https://towardsdatascience.com/comparing-text-summarization-techniques-d1e2e465584e

Next Content
1 Introduction

Sequence to Sequence
Deﬁnition
A model that takes a sequence of items and outputs another
sequence of items.
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention

A little bit deeper look

Example
Neural Machine Translation
How to represent input, context, output

Input (words) can be represented by a vector using word embedding
algorithms.
The context is a vector of ﬂoats.
How about output?

Why do we need attention?

Why do we need attention?
Problem: context is a bottle neck for the models to deal with long
sentences.
Solution: attention allows the model to focus on the relevant parts of
the input sequence as needed.

Attention in 2 stages
Stage 1
The encoder passes all the hidden states to the decoder

Stage 2
An attention decoder does an extra step before producing its output
How to score hidden state in step 2?

Bring all together

Pay attention to each decoding step

A Family of Attention Mechanisms
st : encoder hidden state, hi : decoder hidden state
Wα, vα: weight matrices to be learned
ref: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Self-attention
”The animal didn’t cross the street because it was too tired.”
What does it in this sentence refer to?
ref: https://jalammar.github.io/illustrated-transformer

Self-attention
Calculate self-attention

Self-attention

Multi-head self-attention
Calculate multi-head self-attention

Transformer
Full Architecture
Attention is all you need, Vaswani et al., 2017, https://arxiv.org/abs/1706.03762

BERT
BERT is basically a multi-layer bidirectional Transformer Encoder
Feature Transformer Base Large
Feedforward-networks 512 768 1024
Attention heads 8 12 16
Transformer blocks 6 12 24
https://jalammar.github.io/illustrated-bert
BERT, Devlin et al., 2018 https://arxiv.org/pdf/1810.04805.pdf

Next Content
1 Introduction

Natural Language Generation
Natural Language Generation refers to any setting in which we
generate (i.e. write) new text.
NLG is a subcomponent of:
Machine Translation
(Abstractive) Summarization
Dialogue (chit-chat and task-based)
Creative writing: storytelling, poetry-generation
Freeform Question Answering (i.e. answer is generated, not
extracted from text or knowledge base)
Image captioning
. . .
NLP course, http://web.stanford.edu/class/cs224n

Language Model
Language Modeling
The task of predicting the next word, given the words so far
P(yt|y1, ..., yt−1)
Language Model
A system that produces probability distribution above
Example
If that system is an RNN, it’s called a RNN-LM

Conditional Language Model
Conditional Language Modeling
The task of predicting the next word, given the words so far, and also
some other input x.
P(yt|y1, ..., yt−1, x)
Examples
Machine Translation (x=source sentence, y=target sentence)
Summarization (x=input text, y=summarized text)
Dialogue (x=dialogue history, y=next utterance)
...

Conditional RNN Language Model
Example: Neural Machine Translation
Teacher Forcing: feed the gold target sentence into the decoder regardless
of what the decoder predicts during training

Decoding Algorithm
Question: Once you’ve trained your (conditional) language model,
how do you use it to generate text?
Answer: A decoding algorithm is an algorithm you use to generate
text from your language model
Decoding algorithms
Greedy decoding
Beam search
Sampling methods

Greedy Search
A simple algorithm
On each step, take the most probable word (i.e. argmax)
Use that as the next word, and feed it as input on the next step
Keep going until you produce <END> or reach some max length
Due to lack of backtracking, output can be poor (e.g.
ungrammatical, unnatural, nonsensical)

Beam Search Decoding
A search algorithm which aims to ﬁnd a high-probability
sequence (not necessarily the optimal sequence, though) by
tracking multiple possible sequences at once.
Core idea: On each step of decoder, keep track of the k most
probable (beam size) partial sequences (which we call
hypotheses)
After you reach some stopping criterion, choose the sequence
with the highest probability (factoring in some adjustment for
length)

What is the eﬀect of changing beam size k?
Small k has similar problems to greedy decoding (k=1)
Ungrammatical, unnatural, nonsensical, incorrect
Larger k means you consider more hypotheses
Increasing k reduces some of the problems above
Larger k is more computationally expensive
But increasing k can introduce other problems
For NMT, increasing k too much decreases BLEU score
In open-ended tasks like chit-chat dialogue, large k can make
output more generic
Neural Machine Translation with Reconstruction, Tu et al, 2017 https://arxiv.org/pdf/1611.01874.pdf
Six Challenges for Neural Machine Translation, Koehn et al, 2017 https://arxiv.org/pdf/1706.03872.pdf

Sampling-based Decoding
Pure sampling
On each step t, randomly sample from the probability
distribution Pt to obtain your next word.
Like greedy decoding, but sample instead of argmax.
Top-n sampling
On each step t, randomly sample from Pt, restricted to just the
top-n most probable words
Like pure sampling, but truncate the probability distribution
n = 1 is greedy search, n = V is pure sampling
Increase n to get more diverse/risky output
Decrease n to get more generic/safe output

Softmax temperature
It’s a technique combined with decoding algorithms in test time.
On timestep t, the LM computes a prob dist Pt by applying the
softmax function to a vector of s ∈ R|V |
Pt(w) =
exp (sw )
w ∈V exp (sw )
You can apply a temperature hyperparameter τ to the softmax
Pt(w) =
exp (sw /τ)
w ∈V exp (sw /τ)
Raise the temperature τ : Pt becomes more uniform
Thus more diverse output (prob is spread around vocab)
Lower the temperature τ : Pt becomes more spiky
Thus less diverse output (prob is concentrated on top words)

Decoding algorithms
Summary
Greedy search is a simple method.
Beam search searches for high probability output.
Sampling methods are a way to get more diversity and
randomness
Softmax temperature is another way to control diversity

Next Content
1 Introduction

Text Summarization
Deﬁnition
Produce a shorter version of long texts while preserving the salient
information
ref: https://towardsdatascience.com/comparing-text-summarization-techniques-d1e2e465584e

Text Summarization
Text summarization methods
https://github.com/yandexdataschool/nlp course

Abstractive vs. Extractive Text Summarization
Extractive
Select parts of the original text to
form a summary
Easier
Restrictive (no paraphrasing)
Abstractive
Generate new text using natural
language generation techniques.
More diﬃcult
More ﬂexible (more human)

Abstractive vs. Extractive Text Summarization
Extractive
Score words/sentences and pick
Alice and Bob took the train
to visit the zoo. They saw a
baby giraffe, a lion, and a
flock of colorful tropical birds.
Alice and Bob visit the zoo.
saw a flock of birds.
Abstractive
Generate new texts
Alice and Bob took the train
to visit the zoo. They saw a
baby giraffe, a lion, and a
flock of colorful tropical birds.
Alice and Bob visited the zoo
and saw animals and birds.
https://github.com/yandexdataschool/nlp course

Dataset
DUC-2004 contains 500 documents with on average 35.6 tokens
and summaries with 10.4 tokens.
Gigaword (2015) represents a sentence summarization/headline
generation task with very short input documents (31.4 tokens)
and summaries (8.3 tokens).
CNN/Daily Mail (2016) contains online news articles (781
tokens on average) paired with multi-sentence summaries (3.75
sentences or 56 token on average)
X-Sum (2018) is a summarization dataset which does not favor
extractive strategies and calls for an abstractive one. Data is
collected from the BBC.
and others (WikiHow, NYT, SQuAD, PubMed, etc).
https://github.com/sebastianruder/NLP-progress/blob/master/english/summarization.md

Evaluation
the most common evaluation metric
ROUGE: Recall-Oriented Understudy for Gisting Evaluation
ROUGE-N =
S∈{ReferenceSummaries} gramn∈S Countmatch(gramn)
S∈{ReferenceSummaries} gramn∈S Count(gramn)
ROUGE-1: unigram overlap
ROUGE-2: bigram overlap
ROUGE-N: n-gram overlap
ROUGE-L: longest common subsequence overlap
Other metrics: human, METEOR, BLUE, NIST, F-Measure, etc.
ROUGE: A Package for Automatic Evaluation of Summaries, Lin, 2004, https://www.aclweb.org/anthology/W04-1013/

Evaluation
the most common evaluation metric
ROUGE: Recall-Oriented Understudy for Gisting Evaluation
Limitations
only assess the content selection and do not account for other
quality aspects, such as ﬂuency, grammaticality, coherence, etc.
rely mostly on lexical overlap, while abstractive summarization
could express the same content as a reference without any
lexical overlap.
Therefore, whether we should track the progress only based on these
metrics!
ROUGE: A Package for Automatic Evaluation of Summaries, Lin, 2004, https://www.aclweb.org/anthology/W04-1013/

Current Research
EMNLP 2015: Rush et al. published the ﬁrst seq2seq
abstractive summarization paper using Neural Network LM.
NAACL 2016: Chopra et al. further extended above model by
replacing NNLM by RNN.
CoNLL 2016: Nallapati et al. introduced novel elements to
RNN-LSTM to capture OOV, document structures.
ACL 2017: A. See et al. proposed a pointer-generator network.
2017: Paulus et al. used deep reinforced model to directly
maximize ROUGE-L.
RL produced higher ROUGE scores but lower human judgment
score
A hybrid approach (with maximum likelihood) does best
ACL 2018: Chen et al. used RNN extractor + RL + Rerank to
select salient sentences and rewrite them abstractively to
generate a summary.

Current Research
CoNLL 2019: Zhang et al. applied pretraining-based NLG. They
used transformer-based decoder to generate draft output, then
mask and feed draft sequence to BERT.
EMNLP 2019: Yang et al. modiﬁed BERT and combined
extractive and abstractive methods to create summarization.
NeurIPS 2019: Wei et al. employed shared transformer and
utilized self-attention masks to control what context the
prediction conditions on.
ACL 2019: Fabbri et al. introduced Multi-News dataset and
used Hierarchical MMR-Attention Pointer-generator to generate
an abstractive summary from multi-documents.

Neural Attention Model
2015: Rush et al. announced the ﬁrst seq2seq summarization paper
Figure: Attention-based summarization (ABS) system
A Neural Attention Model ..., Rush et al, 2015, https://www.aclweb.org/anthology/D15-1044.pdf

Copy mechanisms
Seq2seq+attention systems
good at writing ﬂuent output
bad at copying over details (like rare words) correctly
Copy mechanisms use attention to enable a seq2seq system
easily copy words and phrases from the input to the output
allow both copying and generating gives us a hybrid
extractive/abstractive approach
There are several papers proposing copy mechanism variants.
Abstractive Text Summarization using Sequence-to-sequence
RNNs and Beyond, Nallapati et al, 2016
https://arxiv.org/pdf/1602.06023.pdf
others
NLP course, http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture15-nlg.pdf

Copy mechanisms
2017: See et al. used pointer-generator network + coverage
mechanism.
P(w) = pgenPvocab(w) + (1 − pgen) i:wi =w at
i
Get To The Point ..., See et al, 2017, https://www.aclweb.org/anthology/P17-1099.pdf

Copy mechanisms
Big problem with copy mechanism
They copy too much
mostly long phrases, sometimes even whole sentences
What should be an abstractive system collapses to a mostly
extractive system.
Another problem.
They’re bad at overall content selection, especially if the input
document is long
No overall strategy for selecting content
One solution: Bottom-up summarization in EMNLP 2018 by
Gehrmann et al.
NLP course, http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture15-nlg.pdf

Bottom-up summarization
Simple but accurate content selection model
better overall content selection strategy
less copying of long sequences
Bottom-Up Abstractive Summarization, Gehrmann et al., 2018, https://www.aclweb.org/anthology/D18-1443.pdf

Reinforcement Learning
Two steps
1 select salient sentences
2 rewrite them abstractively
Fast Abstractive Summarization ..., Chen et al., ACL 2018, https://www.aclweb.org/anthology/P18-1063.pdf

Transformer
Pretrained Encoder
Extractive
ˆyi = σ(WohL
i + bo), where hL
i is
the vector of senti
Abstractive
Encoder: pretrained BERTSUM
Decoder: 6-layered transformer
Best model: the combination of extractive and abstractive models
Text Summarization ..., Yang et al., EMNLP 2019, https://www.aclweb.org/anthology/D19-1387

Transformer
UNIﬁed pre-trained Language Model (UNILM) is fundamentally a
multi-layer Transformer network
Uniﬁed Language ..., Dong et al., NeurIPS 2019, https://arxiv.org/pdf/1905.03197.pdf

Transformer
Overview of UNILM
A =
Softmax(QKT
√
dk
+ M)Vl
M =
0, allow
−∞, prevent
CNN/DM dataset’s leaderboard
Uniﬁed Language ..., Dong et al., NeurIPS 2019, https://arxiv.org/pdf/1905.03197.pdf

Multi-Document Summarization
Hi-MAP Model
at
= softmax (func(henc , hdec ))
MMRi = λ Sim1 (hs
i , ssum) − (1 − λ) max
sj ∈D,j=i
Sim2(hs
i , hs
j )
at = at
MMRi
Multi-News: a Large-Scale ..., Fabbri et al., ACL 2019, https://arxiv.org/abs/1906.01749

What’s next
The future
evaluation
How to provide a better metric to evaluate an abstractive
summarization output
multi-document
Current, most of research focusing on single document, there is
a room for multile documents.
other domains
Go beyond news text such as clinical, medical, research texts,
etc.
constraint
reader’s age
background knowledge of reader
summary length

References I
Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and
Chandan K. Reddy.
Neural abstractive text summarization with sequence-to-sequence
models, 2018.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
Neural machine translation by jointly learning to align and
translate, 2014.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin.
Attention is all you need, 2017.

References II
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, C¸a˘glar
Gu`I‡l¸cehre, and Bing Xiang.
Abstractive text summarization using sequence-to-sequence
RNNs and beyond.
In Proceedings of The 20th SIGNLL Conference on
Computational Natural Language Learning, pages 280–290,
Berlin, Germany, August 2016. Association for Computational
Linguistics.
Alexander M. Rush, Sumit Chopra, and Jason Weston.
A neural attention model for abstractive sentence summarization.
Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, 2015.

References III
Yichen Jiang and Mohit Bansal.
Closed-book training to improve summarization encoder memory.
In Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, pages 4067–4077, Brussels,
Belgium, October-November 2018. Association for
Computational Linguistics.
Haoyu Zhang, Jingjing Cai, Jianjun Xu, and Ji Wang.
Pretraining-based natural language generation for text
summarization.
Proceedings of the 23rd Conference on Computational Natural
Language Learning (CoNLL), 2019.

References IV
Yang Liu and Mirella Lapata.
Text summarization with pretrained encoders.
Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP),
2019.
Sebastian Gehrmann, Yuntian Deng, and Alexander Rush.
Bottom-up abstractive summarization.
Natural Language Processing, pages 4098–4109, Brussels,
Belgium, October-November 2018. Association for
Computational Linguistics.

References V
Yang Liu and Mirella Lapata.
Text summarization with pretrained encoders.
Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP),
pages 3721–3731, Hong Kong, China, November 2019.
Association for Computational Linguistics.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu,
Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon.
Uniﬁed language model pre-training for natural language
understanding and generation, 2019.

References VI
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir
Radev.
Multi-news: A large-scale multi-document summarization dataset
and abstractive hierarchical model.
Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, 2019.

Thank you!

Abstractive Text Summarization

More Related Content

What's hot

Similar to Abstractive Text Summarization

Recently uploaded

Abstractive Text Summarization