From Seq2seq with Attention to Abstractive
Text Summarization
Tho Phan
Vietnam Japan AI Community
December 01, 2019
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 1 / 64
Table of Contents
1 Introduction
2 Seq2seq with Attention
3 Natural Language Generation
4 Abstractive Text Summarization
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 2 / 64
Next Content
1 Introduction
2 Seq2seq with Attention
3 Natural Language Generation
4 Abstractive Text Summarization
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 3 / 64
Text Summarization
Definition
Produce a shorter version of long texts while preserving the salient
information
Recap
Seq2seq with Attention
Natural Language Generation
ref: https://towardsdatascience.com/comparing-text-summarization-techniques-d1e2e465584e
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 4 / 64
Next Content
1 Introduction
2 Seq2seq with Attention
3 Natural Language Generation
4 Abstractive Text Summarization
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 5 / 64
Sequence to Sequence
Definition
A model that takes a sequence of items and outputs another
sequence of items.
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 6 / 64
Sequence to Sequence
A little bit deeper look
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 7 / 64
Sequence to Sequence
Example
Neural Machine Translation
How to represent input, context, output
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 8 / 64
Sequence to Sequence
Input (words) can be represented by a vector using word embedding
algorithms.
The context is a vector of floats.
How about output?
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 9 / 64
Why do we need attention?
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 10 / 64
Why do we need attention?
Problem: context is a bottle neck for the models to deal with long
sentences.
Solution: attention allows the model to focus on the relevant parts of
the input sequence as needed.
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 11 / 64
Attention in 2 stages
Stage 1
The encoder passes all the hidden states to the decoder
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 12 / 64
Attention in 2 stages
Stage 2
An attention decoder does an extra step before producing its output
How to score hidden state in step 2?
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 13 / 64
Attention in 2 stages
Bring all together
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 14 / 64
Attention in 2 stages
Pay attention to each decoding step
ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 15 / 64
A Family of Attention Mechanisms
st : encoder hidden state, hi : decoder hidden state
Wα, vα: weight matrices to be learned
ref: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 16 / 64
Self-attention
”The animal didn’t cross the street because it was too tired.”
What does it in this sentence refer to?
ref: https://jalammar.github.io/illustrated-transformer
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 17 / 64
Self-attention
Calculate self-attention
ref: https://jalammar.github.io/illustrated-transformer
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 18 / 64
Self-attention
Calculate self-attention
ref: https://jalammar.github.io/illustrated-transformer
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 19 / 64
Self-attention
Calculate self-attention
ref: https://jalammar.github.io/illustrated-transformer
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 20 / 64
Multi-head self-attention
Calculate multi-head self-attention
ref: https://jalammar.github.io/illustrated-transformer
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 21 / 64
Transformer
Full Architecture
Attention is all you need, Vaswani et al., 2017, https://arxiv.org/abs/1706.03762
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 22 / 64
BERT
BERT is basically a multi-layer bidirectional Transformer Encoder
Feature Transformer Base Large
Feedforward-networks 512 768 1024
Attention heads 8 12 16
Transformer blocks 6 12 24
https://jalammar.github.io/illustrated-bert
BERT, Devlin et al., 2018 https://arxiv.org/pdf/1810.04805.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 23 / 64
Next Content
1 Introduction
2 Seq2seq with Attention
3 Natural Language Generation
4 Abstractive Text Summarization
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 24 / 64
Natural Language Generation
Natural Language Generation refers to any setting in which we
generate (i.e. write) new text.
NLG is a subcomponent of:
Machine Translation
(Abstractive) Summarization
Dialogue (chit-chat and task-based)
Creative writing: storytelling, poetry-generation
Freeform Question Answering (i.e. answer is generated, not
extracted from text or knowledge base)
Image captioning
. . .
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 25 / 64
Language Model
Language Modeling
The task of predicting the next word, given the words so far
P(yt|y1, ..., yt−1)
Language Model
A system that produces probability distribution above
Example
If that system is an RNN, it’s called a RNN-LM
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 26 / 64
Conditional Language Model
Conditional Language Modeling
The task of predicting the next word, given the words so far, and also
some other input x.
P(yt|y1, ..., yt−1, x)
Examples
Machine Translation (x=source sentence, y=target sentence)
Summarization (x=input text, y=summarized text)
Dialogue (x=dialogue history, y=next utterance)
...
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 27 / 64
Conditional RNN Language Model
Example: Neural Machine Translation
Teacher Forcing: feed the gold target sentence into the decoder regardless
of what the decoder predicts during training
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 28 / 64
Decoding Algorithm
Question: Once you’ve trained your (conditional) language model,
how do you use it to generate text?
Answer: A decoding algorithm is an algorithm you use to generate
text from your language model
Decoding algorithms
Greedy decoding
Beam search
Sampling methods
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 29 / 64
Greedy Search
A simple algorithm
On each step, take the most probable word (i.e. argmax)
Use that as the next word, and feed it as input on the next step
Keep going until you produce <END> or reach some max length
Due to lack of backtracking, output can be poor (e.g.
ungrammatical, unnatural, nonsensical)
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 30 / 64
Beam Search Decoding
A search algorithm which aims to find a high-probability
sequence (not necessarily the optimal sequence, though) by
tracking multiple possible sequences at once.
Core idea: On each step of decoder, keep track of the k most
probable (beam size) partial sequences (which we call
hypotheses)
After you reach some stopping criterion, choose the sequence
with the highest probability (factoring in some adjustment for
length)
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 31 / 64
Beam Search Decoding
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 32 / 64
Beam Search Decoding
What is the effect of changing beam size k?
Small k has similar problems to greedy decoding (k=1)
Ungrammatical, unnatural, nonsensical, incorrect
Larger k means you consider more hypotheses
Increasing k reduces some of the problems above
Larger k is more computationally expensive
But increasing k can introduce other problems
For NMT, increasing k too much decreases BLEU score
In open-ended tasks like chit-chat dialogue, large k can make
output more generic
Neural Machine Translation with Reconstruction, Tu et al, 2017 https://arxiv.org/pdf/1611.01874.pdf
Six Challenges for Neural Machine Translation, Koehn et al, 2017 https://arxiv.org/pdf/1706.03872.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 33 / 64
Sampling-based Decoding
Pure sampling
On each step t, randomly sample from the probability
distribution Pt to obtain your next word.
Like greedy decoding, but sample instead of argmax.
Top-n sampling
On each step t, randomly sample from Pt, restricted to just the
top-n most probable words
Like pure sampling, but truncate the probability distribution
n = 1 is greedy search, n = V is pure sampling
Increase n to get more diverse/risky output
Decrease n to get more generic/safe output
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 34 / 64
Softmax temperature
It’s a technique combined with decoding algorithms in test time.
On timestep t, the LM computes a prob dist Pt by applying the
softmax function to a vector of s ∈ R|V |
Pt(w) =
exp (sw )
w ∈V exp (sw )
You can apply a temperature hyperparameter τ to the softmax
Pt(w) =
exp (sw /τ)
w ∈V exp (sw /τ)
Raise the temperature τ : Pt becomes more uniform
Thus more diverse output (prob is spread around vocab)
Lower the temperature τ : Pt becomes more spiky
Thus less diverse output (prob is concentrated on top words)
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 35 / 64
Decoding algorithms
Summary
Greedy search is a simple method.
Beam search searches for high probability output.
Sampling methods are a way to get more diversity and
randomness
Softmax temperature is another way to control diversity
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 36 / 64
Next Content
1 Introduction
2 Seq2seq with Attention
3 Natural Language Generation
4 Abstractive Text Summarization
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 37 / 64
Text Summarization
Definition
Produce a shorter version of long texts while preserving the salient
information
ref: https://towardsdatascience.com/comparing-text-summarization-techniques-d1e2e465584e
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 38 / 64
Text Summarization
Text summarization methods
https://github.com/yandexdataschool/nlp course
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 39 / 64
Abstractive vs. Extractive Text Summarization
Extractive
Select parts of the original text to
form a summary
Easier
Restrictive (no paraphrasing)
Abstractive
Generate new text using natural
language generation techniques.
More difficult
More flexible (more human)
NLP course, http://web.stanford.edu/class/cs224n
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 40 / 64
Abstractive vs. Extractive Text Summarization
Extractive
Score words/sentences and pick
Alice and Bob took the train
to visit the zoo. They saw a
baby giraffe, a lion, and a
flock of colorful tropical birds.
Alice and Bob visit the zoo.
saw a flock of birds.
Abstractive
Generate new texts
Alice and Bob took the train
to visit the zoo. They saw a
baby giraffe, a lion, and a
flock of colorful tropical birds.
Alice and Bob visited the zoo
and saw animals and birds.
https://github.com/yandexdataschool/nlp course
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 41 / 64
Dataset
DUC-2004 contains 500 documents with on average 35.6 tokens
and summaries with 10.4 tokens.
Gigaword (2015) represents a sentence summarization/headline
generation task with very short input documents (31.4 tokens)
and summaries (8.3 tokens).
CNN/Daily Mail (2016) contains online news articles (781
tokens on average) paired with multi-sentence summaries (3.75
sentences or 56 token on average)
X-Sum (2018) is a summarization dataset which does not favor
extractive strategies and calls for an abstractive one. Data is
collected from the BBC.
and others (WikiHow, NYT, SQuAD, PubMed, etc).
https://github.com/sebastianruder/NLP-progress/blob/master/english/summarization.md
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 42 / 64
Evaluation
the most common evaluation metric
ROUGE: Recall-Oriented Understudy for Gisting Evaluation
ROUGE-N =
S∈{ReferenceSummaries} gramn∈S Countmatch(gramn)
S∈{ReferenceSummaries} gramn∈S Count(gramn)
ROUGE-1: unigram overlap
ROUGE-2: bigram overlap
ROUGE-N: n-gram overlap
ROUGE-L: longest common subsequence overlap
Other metrics: human, METEOR, BLUE, NIST, F-Measure, etc.
ROUGE: A Package for Automatic Evaluation of Summaries, Lin, 2004, https://www.aclweb.org/anthology/W04-1013/
https://github.com/sebastianruder/NLP-progress/blob/master/english/summarization.md
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 43 / 64
Evaluation
the most common evaluation metric
ROUGE: Recall-Oriented Understudy for Gisting Evaluation
Limitations
only assess the content selection and do not account for other
quality aspects, such as fluency, grammaticality, coherence, etc.
rely mostly on lexical overlap, while abstractive summarization
could express the same content as a reference without any
lexical overlap.
Therefore, whether we should track the progress only based on these
metrics!
ROUGE: A Package for Automatic Evaluation of Summaries, Lin, 2004, https://www.aclweb.org/anthology/W04-1013/
https://github.com/sebastianruder/NLP-progress/blob/master/english/summarization.md
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 44 / 64
Current Research
EMNLP 2015: Rush et al. published the first seq2seq
abstractive summarization paper using Neural Network LM.
NAACL 2016: Chopra et al. further extended above model by
replacing NNLM by RNN.
CoNLL 2016: Nallapati et al. introduced novel elements to
RNN-LSTM to capture OOV, document structures.
ACL 2017: A. See et al. proposed a pointer-generator network.
2017: Paulus et al. used deep reinforced model to directly
maximize ROUGE-L.
RL produced higher ROUGE scores but lower human judgment
score
A hybrid approach (with maximum likelihood) does best
ACL 2018: Chen et al. used RNN extractor + RL + Rerank to
select salient sentences and rewrite them abstractively to
generate a summary.
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 45 / 64
Current Research
CoNLL 2019: Zhang et al. applied pretraining-based NLG. They
used transformer-based decoder to generate draft output, then
mask and feed draft sequence to BERT.
EMNLP 2019: Yang et al. modified BERT and combined
extractive and abstractive methods to create summarization.
NeurIPS 2019: Wei et al. employed shared transformer and
utilized self-attention masks to control what context the
prediction conditions on.
ACL 2019: Fabbri et al. introduced Multi-News dataset and
used Hierarchical MMR-Attention Pointer-generator to generate
an abstractive summary from multi-documents.
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 46 / 64
Neural Attention Model
2015: Rush et al. announced the first seq2seq summarization paper
Figure: Attention-based summarization (ABS) system
A Neural Attention Model ..., Rush et al, 2015, https://www.aclweb.org/anthology/D15-1044.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 47 / 64
Copy mechanisms
Seq2seq+attention systems
good at writing fluent output
bad at copying over details (like rare words) correctly
Copy mechanisms use attention to enable a seq2seq system
easily copy words and phrases from the input to the output
allow both copying and generating gives us a hybrid
extractive/abstractive approach
There are several papers proposing copy mechanism variants.
Abstractive Text Summarization using Sequence-to-sequence
RNNs and Beyond, Nallapati et al, 2016
https://arxiv.org/pdf/1602.06023.pdf
others
NLP course, http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture15-nlg.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 48 / 64
Copy mechanisms
2017: See et al. used pointer-generator network + coverage
mechanism.
P(w) = pgenPvocab(w) + (1 − pgen) i:wi =w at
i
Get To The Point ..., See et al, 2017, https://www.aclweb.org/anthology/P17-1099.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 49 / 64
Copy mechanisms
Big problem with copy mechanism
They copy too much
mostly long phrases, sometimes even whole sentences
What should be an abstractive system collapses to a mostly
extractive system.
Another problem.
They’re bad at overall content selection, especially if the input
document is long
No overall strategy for selecting content
One solution: Bottom-up summarization in EMNLP 2018 by
Gehrmann et al.
NLP course, http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture15-nlg.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 50 / 64
Bottom-up summarization
Simple but accurate content selection model
better overall content selection strategy
less copying of long sequences
Bottom-Up Abstractive Summarization, Gehrmann et al., 2018, https://www.aclweb.org/anthology/D18-1443.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 51 / 64
Reinforcement Learning
Two steps
1 select salient sentences
2 rewrite them abstractively
Fast Abstractive Summarization ..., Chen et al., ACL 2018, https://www.aclweb.org/anthology/P18-1063.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 52 / 64
Transformer
Pretrained Encoder
Extractive
ˆyi = σ(WohL
i + bo), where hL
i is
the vector of senti
Abstractive
Encoder: pretrained BERTSUM
Decoder: 6-layered transformer
Best model: the combination of extractive and abstractive models
Text Summarization ..., Yang et al., EMNLP 2019, https://www.aclweb.org/anthology/D19-1387
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 53 / 64
Transformer
UNIfied pre-trained Language Model (UNILM) is fundamentally a
multi-layer Transformer network
Unified Language ..., Dong et al., NeurIPS 2019, https://arxiv.org/pdf/1905.03197.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 54 / 64
Transformer
Overview of UNILM
A =
Softmax(QKT
√
dk
+ M)Vl
M =
0, allow
−∞, prevent
CNN/DM dataset’s leaderboard
Unified Language ..., Dong et al., NeurIPS 2019, https://arxiv.org/pdf/1905.03197.pdf
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 55 / 64
Multi-Document Summarization
Hi-MAP Model
at
= softmax (func(henc , hdec ))
MMRi = λ Sim1 (hs
i , ssum) − (1 − λ) max
sj ∈D,j=i
Sim2(hs
i , hs
j )
at = at
MMRi
Multi-News: a Large-Scale ..., Fabbri et al., ACL 2019, https://arxiv.org/abs/1906.01749
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 56 / 64
What’s next
The future
evaluation
How to provide a better metric to evaluate an abstractive
summarization output
multi-document
Current, most of research focusing on single document, there is
a room for multile documents.
other domains
Go beyond news text such as clinical, medical, research texts,
etc.
constraint
reader’s age
background knowledge of reader
summary length
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 57 / 64
References I
Tian Shi, Yaser Keneshloo, Naren Ramakrishnan, and
Chandan K. Reddy.
Neural abstractive text summarization with sequence-to-sequence
models, 2018.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.
Neural machine translation by jointly learning to align and
translate, 2014.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin.
Attention is all you need, 2017.
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 58 / 64
References II
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, C¸a˘glar
Gu`I‡l¸cehre, and Bing Xiang.
Abstractive text summarization using sequence-to-sequence
RNNs and beyond.
In Proceedings of The 20th SIGNLL Conference on
Computational Natural Language Learning, pages 280–290,
Berlin, Germany, August 2016. Association for Computational
Linguistics.
Alexander M. Rush, Sumit Chopra, and Jason Weston.
A neural attention model for abstractive sentence summarization.
Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, 2015.
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 59 / 64
References III
Yichen Jiang and Mohit Bansal.
Closed-book training to improve summarization encoder memory.
In Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, pages 4067–4077, Brussels,
Belgium, October-November 2018. Association for
Computational Linguistics.
Haoyu Zhang, Jingjing Cai, Jianjun Xu, and Ji Wang.
Pretraining-based natural language generation for text
summarization.
Proceedings of the 23rd Conference on Computational Natural
Language Learning (CoNLL), 2019.
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 60 / 64
References IV
Yang Liu and Mirella Lapata.
Text summarization with pretrained encoders.
Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP),
2019.
Sebastian Gehrmann, Yuntian Deng, and Alexander Rush.
Bottom-up abstractive summarization.
In Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing, pages 4098–4109, Brussels,
Belgium, October-November 2018. Association for
Computational Linguistics.
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 61 / 64
References V
Yang Liu and Mirella Lapata.
Text summarization with pretrained encoders.
In Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP),
pages 3721–3731, Hong Kong, China, November 2019.
Association for Computational Linguistics.
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu,
Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon.
Unified language model pre-training for natural language
understanding and generation, 2019.
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 62 / 64
References VI
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir
Radev.
Multi-news: A large-scale multi-document summarization dataset
and abstractive hierarchical model.
Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, 2019.
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 63 / 64
Thank you!
Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 64 / 64

Abstractive Text Summarization

  • 1.
    From Seq2seq withAttention to Abstractive Text Summarization Tho Phan Vietnam Japan AI Community December 01, 2019 Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 1 / 64
  • 2.
    Table of Contents 1Introduction 2 Seq2seq with Attention 3 Natural Language Generation 4 Abstractive Text Summarization Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 2 / 64
  • 3.
    Next Content 1 Introduction 2Seq2seq with Attention 3 Natural Language Generation 4 Abstractive Text Summarization Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 3 / 64
  • 4.
    Text Summarization Definition Produce ashorter version of long texts while preserving the salient information Recap Seq2seq with Attention Natural Language Generation ref: https://towardsdatascience.com/comparing-text-summarization-techniques-d1e2e465584e Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 4 / 64
  • 5.
    Next Content 1 Introduction 2Seq2seq with Attention 3 Natural Language Generation 4 Abstractive Text Summarization Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 5 / 64
  • 6.
    Sequence to Sequence Definition Amodel that takes a sequence of items and outputs another sequence of items. ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 6 / 64
  • 7.
    Sequence to Sequence Alittle bit deeper look ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 7 / 64
  • 8.
    Sequence to Sequence Example NeuralMachine Translation How to represent input, context, output ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 8 / 64
  • 9.
    Sequence to Sequence Input(words) can be represented by a vector using word embedding algorithms. The context is a vector of floats. How about output? ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 9 / 64
  • 10.
    Why do weneed attention? Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 10 / 64
  • 11.
    Why do weneed attention? Problem: context is a bottle neck for the models to deal with long sentences. Solution: attention allows the model to focus on the relevant parts of the input sequence as needed. ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 11 / 64
  • 12.
    Attention in 2stages Stage 1 The encoder passes all the hidden states to the decoder ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 12 / 64
  • 13.
    Attention in 2stages Stage 2 An attention decoder does an extra step before producing its output How to score hidden state in step 2? ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 13 / 64
  • 14.
    Attention in 2stages Bring all together ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 14 / 64
  • 15.
    Attention in 2stages Pay attention to each decoding step ref: https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 15 / 64
  • 16.
    A Family ofAttention Mechanisms st : encoder hidden state, hi : decoder hidden state Wα, vα: weight matrices to be learned ref: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 16 / 64
  • 17.
    Self-attention ”The animal didn’tcross the street because it was too tired.” What does it in this sentence refer to? ref: https://jalammar.github.io/illustrated-transformer Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 17 / 64
  • 18.
    Self-attention Calculate self-attention ref: https://jalammar.github.io/illustrated-transformer ThoPhan (VJAI) Abstractive Text Summarization December 01, 2019 18 / 64
  • 19.
    Self-attention Calculate self-attention ref: https://jalammar.github.io/illustrated-transformer ThoPhan (VJAI) Abstractive Text Summarization December 01, 2019 19 / 64
  • 20.
    Self-attention Calculate self-attention ref: https://jalammar.github.io/illustrated-transformer ThoPhan (VJAI) Abstractive Text Summarization December 01, 2019 20 / 64
  • 21.
    Multi-head self-attention Calculate multi-headself-attention ref: https://jalammar.github.io/illustrated-transformer Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 21 / 64
  • 22.
    Transformer Full Architecture Attention isall you need, Vaswani et al., 2017, https://arxiv.org/abs/1706.03762 Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 22 / 64
  • 23.
    BERT BERT is basicallya multi-layer bidirectional Transformer Encoder Feature Transformer Base Large Feedforward-networks 512 768 1024 Attention heads 8 12 16 Transformer blocks 6 12 24 https://jalammar.github.io/illustrated-bert BERT, Devlin et al., 2018 https://arxiv.org/pdf/1810.04805.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 23 / 64
  • 24.
    Next Content 1 Introduction 2Seq2seq with Attention 3 Natural Language Generation 4 Abstractive Text Summarization Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 24 / 64
  • 25.
    Natural Language Generation NaturalLanguage Generation refers to any setting in which we generate (i.e. write) new text. NLG is a subcomponent of: Machine Translation (Abstractive) Summarization Dialogue (chit-chat and task-based) Creative writing: storytelling, poetry-generation Freeform Question Answering (i.e. answer is generated, not extracted from text or knowledge base) Image captioning . . . NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 25 / 64
  • 26.
    Language Model Language Modeling Thetask of predicting the next word, given the words so far P(yt|y1, ..., yt−1) Language Model A system that produces probability distribution above Example If that system is an RNN, it’s called a RNN-LM NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 26 / 64
  • 27.
    Conditional Language Model ConditionalLanguage Modeling The task of predicting the next word, given the words so far, and also some other input x. P(yt|y1, ..., yt−1, x) Examples Machine Translation (x=source sentence, y=target sentence) Summarization (x=input text, y=summarized text) Dialogue (x=dialogue history, y=next utterance) ... NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 27 / 64
  • 28.
    Conditional RNN LanguageModel Example: Neural Machine Translation Teacher Forcing: feed the gold target sentence into the decoder regardless of what the decoder predicts during training NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 28 / 64
  • 29.
    Decoding Algorithm Question: Onceyou’ve trained your (conditional) language model, how do you use it to generate text? Answer: A decoding algorithm is an algorithm you use to generate text from your language model Decoding algorithms Greedy decoding Beam search Sampling methods NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 29 / 64
  • 30.
    Greedy Search A simplealgorithm On each step, take the most probable word (i.e. argmax) Use that as the next word, and feed it as input on the next step Keep going until you produce <END> or reach some max length Due to lack of backtracking, output can be poor (e.g. ungrammatical, unnatural, nonsensical) NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 30 / 64
  • 31.
    Beam Search Decoding Asearch algorithm which aims to find a high-probability sequence (not necessarily the optimal sequence, though) by tracking multiple possible sequences at once. Core idea: On each step of decoder, keep track of the k most probable (beam size) partial sequences (which we call hypotheses) After you reach some stopping criterion, choose the sequence with the highest probability (factoring in some adjustment for length) NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 31 / 64
  • 32.
    Beam Search Decoding NLPcourse, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 32 / 64
  • 33.
    Beam Search Decoding Whatis the effect of changing beam size k? Small k has similar problems to greedy decoding (k=1) Ungrammatical, unnatural, nonsensical, incorrect Larger k means you consider more hypotheses Increasing k reduces some of the problems above Larger k is more computationally expensive But increasing k can introduce other problems For NMT, increasing k too much decreases BLEU score In open-ended tasks like chit-chat dialogue, large k can make output more generic Neural Machine Translation with Reconstruction, Tu et al, 2017 https://arxiv.org/pdf/1611.01874.pdf Six Challenges for Neural Machine Translation, Koehn et al, 2017 https://arxiv.org/pdf/1706.03872.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 33 / 64
  • 34.
    Sampling-based Decoding Pure sampling Oneach step t, randomly sample from the probability distribution Pt to obtain your next word. Like greedy decoding, but sample instead of argmax. Top-n sampling On each step t, randomly sample from Pt, restricted to just the top-n most probable words Like pure sampling, but truncate the probability distribution n = 1 is greedy search, n = V is pure sampling Increase n to get more diverse/risky output Decrease n to get more generic/safe output NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 34 / 64
  • 35.
    Softmax temperature It’s atechnique combined with decoding algorithms in test time. On timestep t, the LM computes a prob dist Pt by applying the softmax function to a vector of s ∈ R|V | Pt(w) = exp (sw ) w ∈V exp (sw ) You can apply a temperature hyperparameter τ to the softmax Pt(w) = exp (sw /τ) w ∈V exp (sw /τ) Raise the temperature τ : Pt becomes more uniform Thus more diverse output (prob is spread around vocab) Lower the temperature τ : Pt becomes more spiky Thus less diverse output (prob is concentrated on top words) NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 35 / 64
  • 36.
    Decoding algorithms Summary Greedy searchis a simple method. Beam search searches for high probability output. Sampling methods are a way to get more diversity and randomness Softmax temperature is another way to control diversity NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 36 / 64
  • 37.
    Next Content 1 Introduction 2Seq2seq with Attention 3 Natural Language Generation 4 Abstractive Text Summarization Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 37 / 64
  • 38.
    Text Summarization Definition Produce ashorter version of long texts while preserving the salient information ref: https://towardsdatascience.com/comparing-text-summarization-techniques-d1e2e465584e Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 38 / 64
  • 39.
    Text Summarization Text summarizationmethods https://github.com/yandexdataschool/nlp course Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 39 / 64
  • 40.
    Abstractive vs. ExtractiveText Summarization Extractive Select parts of the original text to form a summary Easier Restrictive (no paraphrasing) Abstractive Generate new text using natural language generation techniques. More difficult More flexible (more human) NLP course, http://web.stanford.edu/class/cs224n Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 40 / 64
  • 41.
    Abstractive vs. ExtractiveText Summarization Extractive Score words/sentences and pick Alice and Bob took the train to visit the zoo. They saw a baby giraffe, a lion, and a flock of colorful tropical birds. Alice and Bob visit the zoo. saw a flock of birds. Abstractive Generate new texts Alice and Bob took the train to visit the zoo. They saw a baby giraffe, a lion, and a flock of colorful tropical birds. Alice and Bob visited the zoo and saw animals and birds. https://github.com/yandexdataschool/nlp course Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 41 / 64
  • 42.
    Dataset DUC-2004 contains 500documents with on average 35.6 tokens and summaries with 10.4 tokens. Gigaword (2015) represents a sentence summarization/headline generation task with very short input documents (31.4 tokens) and summaries (8.3 tokens). CNN/Daily Mail (2016) contains online news articles (781 tokens on average) paired with multi-sentence summaries (3.75 sentences or 56 token on average) X-Sum (2018) is a summarization dataset which does not favor extractive strategies and calls for an abstractive one. Data is collected from the BBC. and others (WikiHow, NYT, SQuAD, PubMed, etc). https://github.com/sebastianruder/NLP-progress/blob/master/english/summarization.md Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 42 / 64
  • 43.
    Evaluation the most commonevaluation metric ROUGE: Recall-Oriented Understudy for Gisting Evaluation ROUGE-N = S∈{ReferenceSummaries} gramn∈S Countmatch(gramn) S∈{ReferenceSummaries} gramn∈S Count(gramn) ROUGE-1: unigram overlap ROUGE-2: bigram overlap ROUGE-N: n-gram overlap ROUGE-L: longest common subsequence overlap Other metrics: human, METEOR, BLUE, NIST, F-Measure, etc. ROUGE: A Package for Automatic Evaluation of Summaries, Lin, 2004, https://www.aclweb.org/anthology/W04-1013/ https://github.com/sebastianruder/NLP-progress/blob/master/english/summarization.md Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 43 / 64
  • 44.
    Evaluation the most commonevaluation metric ROUGE: Recall-Oriented Understudy for Gisting Evaluation Limitations only assess the content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc. rely mostly on lexical overlap, while abstractive summarization could express the same content as a reference without any lexical overlap. Therefore, whether we should track the progress only based on these metrics! ROUGE: A Package for Automatic Evaluation of Summaries, Lin, 2004, https://www.aclweb.org/anthology/W04-1013/ https://github.com/sebastianruder/NLP-progress/blob/master/english/summarization.md Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 44 / 64
  • 45.
    Current Research EMNLP 2015:Rush et al. published the first seq2seq abstractive summarization paper using Neural Network LM. NAACL 2016: Chopra et al. further extended above model by replacing NNLM by RNN. CoNLL 2016: Nallapati et al. introduced novel elements to RNN-LSTM to capture OOV, document structures. ACL 2017: A. See et al. proposed a pointer-generator network. 2017: Paulus et al. used deep reinforced model to directly maximize ROUGE-L. RL produced higher ROUGE scores but lower human judgment score A hybrid approach (with maximum likelihood) does best ACL 2018: Chen et al. used RNN extractor + RL + Rerank to select salient sentences and rewrite them abstractively to generate a summary. Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 45 / 64
  • 46.
    Current Research CoNLL 2019:Zhang et al. applied pretraining-based NLG. They used transformer-based decoder to generate draft output, then mask and feed draft sequence to BERT. EMNLP 2019: Yang et al. modified BERT and combined extractive and abstractive methods to create summarization. NeurIPS 2019: Wei et al. employed shared transformer and utilized self-attention masks to control what context the prediction conditions on. ACL 2019: Fabbri et al. introduced Multi-News dataset and used Hierarchical MMR-Attention Pointer-generator to generate an abstractive summary from multi-documents. Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 46 / 64
  • 47.
    Neural Attention Model 2015:Rush et al. announced the first seq2seq summarization paper Figure: Attention-based summarization (ABS) system A Neural Attention Model ..., Rush et al, 2015, https://www.aclweb.org/anthology/D15-1044.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 47 / 64
  • 48.
    Copy mechanisms Seq2seq+attention systems goodat writing fluent output bad at copying over details (like rare words) correctly Copy mechanisms use attention to enable a seq2seq system easily copy words and phrases from the input to the output allow both copying and generating gives us a hybrid extractive/abstractive approach There are several papers proposing copy mechanism variants. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond, Nallapati et al, 2016 https://arxiv.org/pdf/1602.06023.pdf others NLP course, http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture15-nlg.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 48 / 64
  • 49.
    Copy mechanisms 2017: Seeet al. used pointer-generator network + coverage mechanism. P(w) = pgenPvocab(w) + (1 − pgen) i:wi =w at i Get To The Point ..., See et al, 2017, https://www.aclweb.org/anthology/P17-1099.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 49 / 64
  • 50.
    Copy mechanisms Big problemwith copy mechanism They copy too much mostly long phrases, sometimes even whole sentences What should be an abstractive system collapses to a mostly extractive system. Another problem. They’re bad at overall content selection, especially if the input document is long No overall strategy for selecting content One solution: Bottom-up summarization in EMNLP 2018 by Gehrmann et al. NLP course, http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture15-nlg.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 50 / 64
  • 51.
    Bottom-up summarization Simple butaccurate content selection model better overall content selection strategy less copying of long sequences Bottom-Up Abstractive Summarization, Gehrmann et al., 2018, https://www.aclweb.org/anthology/D18-1443.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 51 / 64
  • 52.
    Reinforcement Learning Two steps 1select salient sentences 2 rewrite them abstractively Fast Abstractive Summarization ..., Chen et al., ACL 2018, https://www.aclweb.org/anthology/P18-1063.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 52 / 64
  • 53.
    Transformer Pretrained Encoder Extractive ˆyi =σ(WohL i + bo), where hL i is the vector of senti Abstractive Encoder: pretrained BERTSUM Decoder: 6-layered transformer Best model: the combination of extractive and abstractive models Text Summarization ..., Yang et al., EMNLP 2019, https://www.aclweb.org/anthology/D19-1387 Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 53 / 64
  • 54.
    Transformer UNIfied pre-trained LanguageModel (UNILM) is fundamentally a multi-layer Transformer network Unified Language ..., Dong et al., NeurIPS 2019, https://arxiv.org/pdf/1905.03197.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 54 / 64
  • 55.
    Transformer Overview of UNILM A= Softmax(QKT √ dk + M)Vl M = 0, allow −∞, prevent CNN/DM dataset’s leaderboard Unified Language ..., Dong et al., NeurIPS 2019, https://arxiv.org/pdf/1905.03197.pdf Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 55 / 64
  • 56.
    Multi-Document Summarization Hi-MAP Model at =softmax (func(henc , hdec )) MMRi = λ Sim1 (hs i , ssum) − (1 − λ) max sj ∈D,j=i Sim2(hs i , hs j ) at = at MMRi Multi-News: a Large-Scale ..., Fabbri et al., ACL 2019, https://arxiv.org/abs/1906.01749 Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 56 / 64
  • 57.
    What’s next The future evaluation Howto provide a better metric to evaluate an abstractive summarization output multi-document Current, most of research focusing on single document, there is a room for multile documents. other domains Go beyond news text such as clinical, medical, research texts, etc. constraint reader’s age background knowledge of reader summary length Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 57 / 64
  • 58.
    References I Tian Shi,Yaser Keneshloo, Naren Ramakrishnan, and Chandan K. Reddy. Neural abstractive text summarization with sequence-to-sequence models, 2018. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2014. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2017. Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 58 / 64
  • 59.
    References II Ramesh Nallapati,Bowen Zhou, Cicero dos Santos, C¸a˘glar Gu`I‡l¸cehre, and Bing Xiang. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany, August 2016. Association for Computational Linguistics. Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015. Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 59 / 64
  • 60.
    References III Yichen Jiangand Mohit Bansal. Closed-book training to improve summarization encoder memory. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4067–4077, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. Haoyu Zhang, Jingjing Cai, Jianjun Xu, and Ji Wang. Pretraining-based natural language generation for text summarization. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 2019. Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 60 / 64
  • 61.
    References IV Yang Liuand Mirella Lapata. Text summarization with pretrained encoders. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. Sebastian Gehrmann, Yuntian Deng, and Alexander Rush. Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4098–4109, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 61 / 64
  • 62.
    References V Yang Liuand Mirella Lapata. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3721–3731, Hong Kong, China, November 2019. Association for Computational Linguistics. Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation, 2019. Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 62 / 64
  • 63.
    References VI Alexander Fabbri,Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. Tho Phan (VJAI) Abstractive Text Summarization December 01, 2019 63 / 64
  • 64.
    Thank you! Tho Phan(VJAI) Abstractive Text Summarization December 01, 2019 64 / 64