Bottle sum

BottleSum
Unsupervised and Self-supervised Sentence
Summarization using the Information
Bottleneck Principle
Peter West, Ari Holtzman, Jan Buys, Yejin Choi
1

Summary
• Methods
• Unsupervised summarization
• Self-supervised summarization
• Using Information Bottleneck principle for summarization
• Outcomes
• Better result on automatic / human evaluation
• Better result on domain where sentence-summary is not available
2

What is task?
• Sentence summarization (compression)
• Given a sentence, summarize it into shorter one
• Suppose that good sentence summary contains information
related to the broader context while discarding less
significant details
4

How to solve it?
• Unsupervised methods
• Why unsupervised?
• Sentence-summarize pair is not always available
• Current unsupervised methods use autoencoder(AE) as core
of methods
• The source sentence should be accurately predicted from summary
• This goes against the fundamental goal of summarization
• Crucially needs to forget all but the “relevant” information
• Therefore, use Information Bottleneck(IB) to discard
irrelevant information
6

Example
• Next sentence is about control, so summary should only
contain about control
• However, summary by autoencoder necessary refer to
population to restore information about it
7

Methods
• 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝐸𝑥
is based mainly on the principle of the
Information Bottleneck(IB)
• Extractive unsupervised method
• No need to train
• Take consecutive two sentences as input
• 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝑆𝑒𝑙𝑓
• Abstractive self-supervised method
• Use the result of 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝐸𝑥
as training data
• Take a sentence as input
8

What is Information Bottleneck? (1/2)
• Given source 𝑆, external variable 𝑌, summary 𝑆
• Learning a conditional distribution 𝑝( 𝑆|𝑆) minimizing:
𝐼 𝑆; 𝑆 − 𝛽𝐼( 𝑆; 𝑌)
• 𝛽: Coefficient to balance two terms
• 𝐼 : Mutual information between two variables
• 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋
• Given information about Y, how much ambiguity of X decrease?
• Refer appendix
9

What is Information Bottleneck? (2/2)
• minimizing:
𝐼 𝑆; 𝑆 − 𝛽𝐼 𝑆; 𝑌
• Generate good summary 𝑆 from source 𝑆
• First term: pruning term
• If 𝑆 decrease ambiguity of 𝑆, then 𝐼 𝑆; 𝑆 increase
• Ensures irrelevant information is discarded
• Second term: relevant term
• If 𝑆 decrease ambiguity of 𝑌, then −𝐼 𝑆; 𝑌 decrease
• Ensures 𝑆 and 𝑌 share information.
(1)
10

Why IB is better?
• Suppose in 𝑆, we have some information 𝑍 which is irrelevant
to 𝑌
• In IB:
• Not containing 𝑍 is better.
• If contain, 𝐼 𝑆; 𝑆 increases, and 𝐼 𝑆; 𝑌 is not affected.
• In AE:
• Containing 𝑍 is better.
• 𝑍 contains information about 𝑆, decrease reconstruct loss.
• Reconstruct loss: Suppose reconstruct 𝑆′ from 𝑆. Difference
between 𝑆′ and 𝑆
11

Extractive: 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝐸𝑥
• Extractive unsupervised method
12

Implement IB
• Given sentence 𝑆, using next sentence as relevance variable 𝑌
• Get deterministic function mapping sentence 𝑆 to summary 𝑠
• Therefore, 𝑝 𝑠 𝑠 = 1
• In this setting, minimizing (1) is equal to minimizing:
𝐼 𝑆; 𝑆 − 𝛽𝐼 𝑆; 𝑌 = −log 𝑝( 𝑠) − 𝛽1 𝑝 𝑠 𝑛𝑒𝑥𝑡| 𝑠 𝑝 𝑠 𝑙𝑜𝑔𝑝(𝑠 𝑛𝑒𝑥𝑡| 𝑠)
• Use pretrained language model for estimating distributions
• GPT-2 was used
13

Algorithm (1/8)
• Implement extractive method
• Iteratively deleting words or phrases from candidates,
starting with the original sentence
• At each elimination step, only consider candidate deletions
which decrease the value of pruning term
• When expanding candidate, chose a few candidates with the
highest relevance scores to optimize relevant term
14

Algorithm (2/8)
• Algorithm:
• Input:
• 𝑠: sentence
• 𝑠 𝑛𝑒𝑥𝑡: context
• Hyper parameter:
• 𝑚: the number of words to delete
• 𝑘: the number of candidates to
search
15

Algorithm (3/8)
• E.g
• 𝑠 = Unsupervised methods use autoencoder as core of methods
• 𝑘 = 1
• 𝑚 = 3
16

Algorithm (4/8)
• s′
= Unsupervised methods use autoencoder as core of methods
• List up next candidates 𝑠′′ by removing up to 𝑚 words (l7~l9)
methods use autoencoder as core of methods
Unsupervised use autoencoder as core of methods
Unsupervised methods autoencoder as core of methods
…
Unsupervised methods use autoencoder as core of
17

Algorithm (5/8)
• s′
use autoencoder as core of methods
Unsupervised autoencoder as core of methods
Unsupervised methods as core of methods
…
Unsupervised methods use autoencoder as core
18

Algorithm (6/8)
• s′
autoencoder as core of methods
Unsupervised as core of methods
Unsupervised methods core of methods
…
Unsupervised methods use autoencoder as
19

Algorithm (7/8)
• Discard bad candidates (l10~l11)
• For every candidate, estimate 𝑝(𝑠′′)
• If p s′
< 𝑝(𝑠′′
), then add 𝑠′′
as candidate.
• This procedure corresponds to decreasing the value of
pruning term.
Unsupervised methods use autoencoder as core methods
Unsupervised autoencoder as core of methods
Unsupervised methods as core of methods
…
20

Algorithm (8/8)
• Chose next s′
from candidates (l4~l5).
• Sort candidates by 𝑝(𝑠 𝑛𝑒𝑥𝑡|𝑠′) on descending order.
• Chose top 𝑘 candidates as next s′.
• This procedure corresponds to decreasing the value of
relevant term.
Unsupervised methods use autoencoder as core methods
21

Note
• It does not train anything at all.
• About 𝛽1
• In this algorithms, ensure both pruning term and relevant term
improves
• Thus, the pruning term and relevant term are not compared directly
• Therefore, choce of 𝛽1 is less important
22

Abstractive: 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝑆𝑒𝑙𝑓
(1/2)
• Abstractive summarization method
• Train GPT-2 model for summarization
• Self-supervised learning
• Use 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝐸𝑥’s output as training data
• Aims
• Remove the restriction of extractiveness
• Learn an explicit compression function not requiring a next sentence
23

Abstractive: 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝑆𝑒𝑙𝑓
(2/2)
• Fine-tune GPT-2 model for summarization
• Train language model
• Input:[ sentence + “TL;DR:” + summary ]
• E.g. Hong Kong, a bustling metropolis with a population over 7 million, was
once under British Rule. TL;DR: Hong Kong was once under British Rule.
• When make summary
• Input: [ sentence + “TL;DR:” ]
• E.g. Hong Kong, a bustling metropolis with a population over 7 million, was
once under British Rule. TL;DR:
24

Unsupervised Models
, 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝑆𝑒𝑙𝑓
• Use 𝑘 =1, 𝑚 =3
• 𝑅𝑒𝑐𝑜𝑛 𝐸𝑥
• Follows 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝐸𝑥, but replace next sentence with the source sentence
• For probing the role of next sentence
• 𝑆𝐸𝑄3
• Trained with an autoencoding objective paired with a topic loss and
language model prior loss
• Have the highest unsupervised result on DUC
• PREFIX
• First 75 bytes of the source sentence
• INPUT
• The full input sentence 26

Supervised Models
• ABS
• Supervised SOTA result on DUC-2003 dataset
• Result of Li et al.
• Supervised SOTA result on DUC-2004 dataset
27

Dataset & Evaluation method
• Evaluate models on three dataset
• DUC-2003, DUC-2004 datasets
• Automatic ROUGE metrics
• Sentence-summary pairs
• CNN corpus
• Summary is not available
• Human evaluation
• Compare two summary from different models over 3 attributes
• coherence, conciseness, and agreement with the input 28

DUC-2003, DUC-2004
achieves the highest R-1
and R-L scores for unsupervised on both
dataset
• 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝑆𝑒𝑙𝑓 achieves the second highest
scores
• R-2 scores for 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝐸𝑥 are lower than
baselines
• Possibly due to a lack of fluency
29

CNN corpus
• Use CNN corpus
• Only sentence is available
• 𝑆𝐸𝑄3
is trained on CNN corpus
• ABS is not trained
• Use the model as originally trained on the Gigaword sentence
dataset
• Attribute scores are average of three people’s score
• Attribute scores are averaged over a scale of 1 (better), 0
(equal) and -1 (worse)
30

CNN corpus
• 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝐸𝑥 and 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝑆𝑒𝑙𝑓 show stronger performance
• 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝑆𝑒𝑙𝑓
score is better than 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝐸𝑥
• A combination of abstractivness and learning a cohesive underlying
of summarization enable more favorable summary for human
31

Model Comparison
• ABS requires learning on a large supervised training set
• Poor out-of-domain performance
• 𝑆𝐸𝑄3 is unsupervised, but still needs extensive training on a
large corpus of in-domain text
• 𝐵𝑜𝑡𝑡𝑙𝑒𝑆𝑢𝑚 𝐸𝑥 requires neither of them
32

Conclusion
• Methods
• Unsupervised summarization
• Self-supervised summarization
• Using Information Bottleneck principle for summarization
• Outcomes
• Better result on automatic / human evaluation
• Better result on domain where sentence-summary is not available
34

Definition of mutual information
• Event 𝐸, probability 𝑃(𝐸)
• Information: − log 𝑃 𝐸 (generally, base is 2)
• Entropy: H P = − 𝐸∈𝑋 𝑃 𝐸 𝑙𝑜𝑔𝑃 𝐸
• E.g. 𝑃 𝑠𝑢𝑛𝑛𝑦 = 0.5, 𝑃 𝑟𝑎𝑖𝑛 = 0.5
• H P = − 𝐸∈Ω 𝑃 𝐸 𝑙𝑜𝑔𝑃 𝐸 = −P sunny logP sunny − P rain logP rain
= −0.5 ∗ −1 − 0.5 ∗ −1 = 1
• E.g. 𝑃 𝑠𝑢𝑛𝑛𝑦 = 0.9, 𝑃 𝑟𝑎𝑖𝑛 = 0.1
• H P = − 𝐸∈Ω 𝑃 𝐸 𝑙𝑜𝑔𝑃 𝐸 = −P sunny logP sunny − P rain logP rain
≈ −0.9 ∗ −0.15 − 0.5 ∗ −3.3 = 2.0
36

• Conditional Entropy
• 𝐻 𝑋 𝑦 = − 𝑥∈𝑋 𝑃 𝑥 𝑦 𝑙𝑜𝑔𝑃(𝑥|𝑦)
• 𝐻 𝑋 𝑌 = 𝑦 𝑃 𝑦 𝐻(𝑋| 𝑦) = ⋯ = − 𝑥∈𝑋,𝑦∈𝑌 𝑃 𝑥, 𝑦 𝑙𝑜𝑔𝑃(𝑥|𝑦)
• Mutual information
• 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻 𝑌 𝑋
• Given information about Y, how much ambiguity of X decrease?
• E.g. 𝑃 𝑋 = 𝑃(𝑋|𝑌)
• 𝐻 𝑋 = 𝐻 𝑋 𝑌 , so 𝐼 𝑋; 𝑌 = 0
37

• E.g.
• 𝑃 𝑠𝑢𝑛𝑛𝑦 = 0.5, 𝑃 𝑟𝑎𝑖𝑛 = 0.5
• 𝑃 𝑐𝑜𝑛𝑡𝑟𝑎𝑖𝑙𝑠 = 0.2, 𝑃 𝑏𝑙𝑢𝑒𝑠𝑘𝑦 = 0.8
• 𝑃 𝑠𝑢𝑛𝑛𝑦 |contrails = 0.2, 𝑃 𝑟𝑎𝑖𝑛 | 𝑐𝑜𝑛𝑡𝑟𝑎𝑖𝑙𝑠 = 0.8
• 𝑃 𝑠𝑢𝑛𝑛𝑦 |bluesky = 0.8, 𝑃 𝑟𝑎𝑖𝑛 | 𝑏𝑙𝑢𝑒𝑠𝑘𝑦 = 0.2
→ 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = −
𝑥∈𝑋
𝑃 𝑥 𝑙𝑜𝑔𝑃 𝑥 +
𝑥∈𝑋,𝑦∈𝑌
𝑃 𝑥, 𝑦 𝑙𝑜𝑔𝑃(𝑥|𝑦)
= −0.5 ∗ −1 − 0.5 ∗ −1 + 0.04 ∗ −2.3 + 0.16 ∗ −0.32 + 0.64 ∗ −0.32
+0.16 ∗ −2.3 = 1 − 0.716 = 0.184
38

Bottle sum

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Bottle sum

Similar to Bottle sum (20)

Recently uploaded

Recently uploaded (20)

Bottle sum

Editor's Notes