2. Summary
โข Methods
โข Unsupervised summarization
โข Self-supervised summarization
โข Using Information Bottleneck principle for summarization
โข Outcomes
โข Better result on automatic / human evaluation
โข Better result on domain where sentence-summary is not available
2
4. What is task?
โข Sentence summarization (compression)
โข Given a sentence, summarize it into shorter one
โข Suppose that good sentence summary contains information
related to the broader context while discarding less
significant details
4
6. How to solve it?
โข Unsupervised methods
โข Why unsupervised?
โข Sentence-summarize pair is not always available
โข Current unsupervised methods use autoencoder(AE) as core
of methods
โข The source sentence should be accurately predicted from summary
โข This goes against the fundamental goal of summarization
โข Crucially needs to forget all but the โrelevantโ information
โข Therefore, use Information Bottleneck(IB) to discard
irrelevant information
6
7. Example
โข Next sentence is about control, so summary should only
contain about control
โข However, summary by autoencoder necessary refer to
population to restore information about it
7
8. Methods
โข ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ
is based mainly on the principle of the
Information Bottleneck(IB)
โข ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ
โข Extractive unsupervised method
โข No need to train
โข Take consecutive two sentences as input
โข ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐๐๐๐
โข Abstractive self-supervised method
โข Use the result of ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ
as training data
โข Take a sentence as input
8
9. What is Information Bottleneck? (1/2)
โข Given source ๐, external variable ๐, summary ๐
โข Learning a conditional distribution ๐( ๐|๐) minimizing:
๐ผ ๐; ๐ โ ๐ฝ๐ผ( ๐; ๐)
โข ๐ฝ: Coefficient to balance two terms
โข ๐ผ : Mutual information between two variables
โข ๐ผ ๐; ๐ = ๐ป ๐ โ ๐ป ๐ ๐ = ๐ป ๐ โ ๐ป ๐ ๐
โข Given information about Y, how much ambiguity of X decrease?
โข Refer appendix
9
10. What is Information Bottleneck? (2/2)
โข minimizing:
๐ผ ๐; ๐ โ ๐ฝ๐ผ ๐; ๐
โข Generate good summary ๐ from source ๐
โข First term: pruning term
โข If ๐ decrease ambiguity of ๐, then ๐ผ ๐; ๐ increase
โข Ensures irrelevant information is discarded
โข Second term: relevant term
โข If ๐ decrease ambiguity of ๐, then โ๐ผ ๐; ๐ decrease
โข Ensures ๐ and ๐ share information.
(1)
10
11. Why IB is better?
โข Suppose in ๐, we have some information ๐ which is irrelevant
to ๐
โข In IB:
โข Not containing ๐ is better.
โข If contain, ๐ผ ๐; ๐ increases, and ๐ผ ๐; ๐ is not affected.
โข In AE:
โข Containing ๐ is better.
โข ๐ contains information about ๐, decrease reconstruct loss.
โข Reconstruct loss: Suppose reconstruct ๐โฒ from ๐. Difference
between ๐โฒ and ๐
11
13. Implement IB
โข Given sentence ๐, using next sentence as relevance variable ๐
โข Get deterministic function mapping sentence ๐ to summary ๐
โข Therefore, ๐ ๐ ๐ = 1
โข In this setting, minimizing (1) is equal to minimizing:
๐ผ ๐; ๐ โ ๐ฝ๐ผ ๐; ๐ = โlog ๐( ๐ ) โ ๐ฝ1 ๐ ๐ ๐๐๐ฅ๐ก| ๐ ๐ ๐ ๐๐๐๐(๐ ๐๐๐ฅ๐ก| ๐ )
โข Use pretrained language model for estimating distributions
โข GPT-2 was used
13
14. Algorithm (1/8)
โข Implement extractive method
โข Iteratively deleting words or phrases from candidates,
starting with the original sentence
โข At each elimination step, only consider candidate deletions
which decrease the value of pruning term
โข When expanding candidate, chose a few candidates with the
highest relevance scores to optimize relevant term
14
15. Algorithm (2/8)
โข Algorithm:
โข Input:
โข ๐ : sentence
โข ๐ ๐๐๐ฅ๐ก: context
โข Hyper parameter:
โข ๐: the number of words to delete
โข ๐: the number of candidates to
search
15
16. Algorithm (3/8)
โข E.g
โข ๐ = Unsupervised methods use autoencoder as core of methods
โข ๐ = 1
โข ๐ = 3
16
17. Algorithm (4/8)
โข sโฒ
= Unsupervised methods use autoencoder as core of methods
โข List up next candidates ๐ โฒโฒ by removing up to ๐ words (l7~l9)
methods use autoencoder as core of methods
Unsupervised use autoencoder as core of methods
Unsupervised methods autoencoder as core of methods
โฆ
Unsupervised methods use autoencoder as core of
17
18. Algorithm (5/8)
โข sโฒ
= Unsupervised methods use autoencoder as core of methods
โข List up next candidates ๐ โฒโฒ by removing up to ๐ words (l7~l9)
use autoencoder as core of methods
Unsupervised autoencoder as core of methods
Unsupervised methods as core of methods
โฆ
Unsupervised methods use autoencoder as core
18
19. Algorithm (6/8)
โข sโฒ
= Unsupervised methods use autoencoder as core of methods
โข List up next candidates ๐ โฒโฒ by removing up to ๐ words (l7~l9)
autoencoder as core of methods
Unsupervised as core of methods
Unsupervised methods core of methods
โฆ
Unsupervised methods use autoencoder as
19
20. Algorithm (7/8)
โข Discard bad candidates (l10~l11)
โข For every candidate, estimate ๐(๐ โฒโฒ)
โข If p sโฒ
< ๐(๐ โฒโฒ
), then add ๐ โฒโฒ
as candidate.
โข This procedure corresponds to decreasing the value of
pruning term.
methods use autoencoder as core of methods
Unsupervised use autoencoder as core of methods
Unsupervised methods use autoencoder as core methods
Unsupervised autoencoder as core of methods
Unsupervised methods as core of methods
โฆ
20
21. Algorithm (8/8)
โข Chose next sโฒ
from candidates (l4~l5).
โข Sort candidates by ๐(๐ ๐๐๐ฅ๐ก|๐ โฒ) on descending order.
โข Chose top ๐ candidates as next sโฒ.
โข This procedure corresponds to decreasing the value of
relevant term.
Unsupervised methods use autoencoder as core methods
Unsupervised use autoencoder as core of methods
methods use autoencoder as core of methods
21
22. Note
โข It does not train anything at all.
โข About ๐ฝ1
โข In this algorithms, ensure both pruning term and relevant term
improves
โข Thus, the pruning term and relevant term are not compared directly
โข Therefore, choce of ๐ฝ1 is less important
22
23. Abstractive: ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐๐๐๐
(1/2)
โข Abstractive summarization method
โข Train GPT-2 model for summarization
โข Self-supervised learning
โข Use ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅโs output as training data
โข Aims
โข Remove the restriction of extractiveness
โข Learn an explicit compression function not requiring a next sentence
23
24. Abstractive: ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐๐๐๐
(2/2)
โข Fine-tune GPT-2 model for summarization
โข Train language model
โข Input:[ sentence + โTL;DR:โ + summary ]
โข E.g. Hong Kong, a bustling metropolis with a population over 7 million, was
once under British Rule. TL;DR: Hong Kong was once under British Rule.
โข When make summary
โข Input: [ sentence + โTL;DR:โ ]
โข E.g. Hong Kong, a bustling metropolis with a population over 7 million, was
once under British Rule. TL;DR:
24
26. Unsupervised Models
โข ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ
, ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐๐๐๐
โข Use ๐ =1, ๐ =3
โข ๐ ๐๐๐๐ ๐ธ๐ฅ
โข Follows ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ, but replace next sentence with the source sentence
โข For probing the role of next sentence
โข ๐๐ธ๐3
โข Trained with an autoencoding objective paired with a topic loss and
language model prior loss
โข Have the highest unsupervised result on DUC
โข PREFIX
โข First 75 bytes of the source sentence
โข INPUT
โข The full input sentence 26
27. Supervised Models
โข ABS
โข Supervised SOTA result on DUC-2003 dataset
โข Result of Li et al.
โข Supervised SOTA result on DUC-2004 dataset
27
28. Dataset & Evaluation method
โข Evaluate models on three dataset
โข DUC-2003, DUC-2004 datasets
โข Automatic ROUGE metrics
โข Sentence-summary pairs
โข CNN corpus
โข Summary is not available
โข Human evaluation
โข Compare two summary from different models over 3 attributes
โข coherence, conciseness, and agreement with the input 28
29. DUC-2003, DUC-2004
โข ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ
achieves the highest R-1
and R-L scores for unsupervised on both
dataset
โข ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐๐๐๐ achieves the second highest
scores
โข R-2 scores for ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ are lower than
baselines
โข Possibly due to a lack of fluency
29
30. CNN corpus
โข Use CNN corpus
โข Only sentence is available
โข ๐๐ธ๐3
is trained on CNN corpus
โข ABS is not trained
โข Use the model as originally trained on the Gigaword sentence
dataset
โข Attribute scores are average of three peopleโs score
โข Attribute scores are averaged over a scale of 1 (better), 0
(equal) and -1 (worse)
30
31. CNN corpus
โข ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ and ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐๐๐๐ show stronger performance
โข ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐๐๐๐
score is better than ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ
โข A combination of abstractivness and learning a cohesive underlying
of summarization enable more favorable summary for human
31
32. Model Comparison
โข ABS requires learning on a large supervised training set
โข Poor out-of-domain performance
โข ๐๐ธ๐3 is unsupervised, but still needs extensive training on a
large corpus of in-domain text
โข ๐ต๐๐ก๐ก๐๐๐๐ข๐ ๐ธ๐ฅ requires neither of them
32
34. Conclusion
โข Methods
โข Unsupervised summarization
โข Self-supervised summarization
โข Using Information Bottleneck principle for summarization
โข Outcomes
โข Better result on automatic / human evaluation
โข Better result on domain where sentence-summary is not available
34