Survey of the current trends, and the future in Natural Language Generation

1
Natural Language Generation (NLG)
Text Generation(TG)
YuSheng Su
THUNLP
Tsinghua University

2
Outline
• Introduction to Text Generation
• Traditional Text Generation
• Neural Text Generation
• Text Generation Tasks and Challenges
• Current Trends, and the Future

3
Outline

4
Outline

5
Background and Definition
• What is Text Generation?
• The difference between NLU, NLP, NLG or (TG)
[Survey of the State of the Art in Natural Language Generation- Core tasks, applications and evaluation]
NLP
Natural Language Processing
NLU
Natural Language
Understanding
NLG
Natural Language
Generation

6
• What is Text Generation?
• Produce understandable texts in human languages from
some underlying non-linguistic representation of
information. [Reiter el., 1997]
• Both text-to-text generation and data-to-text generation
are instances of Text Generation [Reiter at el., 1997]
[Gatt at el., 2017, Survey of the State of the Art in Natural Language Generation- Core tasks, applications and evaluation]

7
• An overview of the applications that can be categorized
under the umbrella of language generation

8
Outline

• TG system: Pipeline
9
Data Analysis
Data Interpretation
Document Planning
Realisation
Microplanning
Patterns, treads
Analytical insights, which patterns are important
Select events, structures them into narrative
Choose words, syntactic structures, referring experience
Generate actual text

10
Rule/Template-based
• Rule-based
• Template-based
• If we have lots of hand-drafted templates, we can …
________ is my friend.
Retrieve
Check
[Sciforce, 2019, A Comprehensive Guide to Natural Language Generation]

11
Statistical
• Core idea: Build a statistical model from data
• Similar to statistical machine translation methodology
• Obtain 𝑃 𝑥 𝑦 and 𝑃 𝑦 from data
• Compute the argmax with various algorithm
• Systems extremely complex
• Separately-designed sub-components
• Lots of human efforts
argmax) 𝑃(𝑦|𝑥; 𝜃)
[Kondadadi at el., 2013, A Statistical NLG Framework for Aggregated Planning and Realization]
This is my _____.
This is __ music.
…
This is my dog.
This is pop music.
…
𝑥 (context) 𝑦

12
Traditional and Neural Text
Generation
• Rule/Template-based and statistical models:
• Fairly interpretableand well-behaved
• Lots of hand-engineering and feature engineering
• Neural network models :
• Demonstrate better performance in many tasks
• Poorly understood and sometimes poorly behaved as well
Trade-offs between template/rule-based
and neural network models
[Xie at el., 2017, Neural Text Generation: A Practical Guide ]

13
Outline
• Autoregressive
• Non-Autoregressive

14
Language Modeling
• Language Modeling: 𝑃 𝑦5 𝑦6, 𝑦8, …, 𝑦5:6
• The task of predicting the next word, given the words so far
• A system that produces this probability distribution is
called a Language Model
• RNNs for language modeling

15
Conditional Language Modeling
• Conditional Language Modeling: 𝑃 𝑦5 𝑦6, 𝑦8, … , 𝑦5:6, 𝑥)
• The task of predicting the next word, given the words so far,
and also some other input 𝑥
• 𝑥 input/source
• 𝑦 output/target sequence
[Xie at el., 2017, Neural Text Generation: A Practical Guide ]

16
RNNs for Language Modeling
ℎ< ℎ6 ℎ8 ℎ=
𝑥6 𝑥8 𝑥=
ℎ>
𝑥>
𝑦>
never too late to
𝑤6 𝑤8 𝑤= 𝑤>
𝐸 𝐸 𝐸 𝐸
𝑊B 𝑊B 𝑊B 𝑊B
𝑊C 𝑊C 𝑊C 𝑊C
𝑈
code read
one-hot vectors
𝑤E ∈ ℝ|H|
word embeddings
𝑥E = 𝐸𝑤E
hidden states
ℎE = tanh(
𝑊B 𝑥E +
𝑊CℎE:6 + 𝑏6
)
output distribution
𝑦> = softmax 𝑈ℎ> + 𝑏8 ∈ ℝ|H|

17
RNNs for Language Modeling
• RNNs are good at modeling sequential information
• General idea: Use RNN as an encoder for building the
semantic representation of the sentence

18
Decoding
• Recap: decoding algorithms
• Question: Once you’ve trained your (conditional) language
model, how do you use it to generate text?
• Answer: A decoding algorithm is an algorithm you use to
generate text from your language model
• We’ve learnt about two decoding algorithms:
• Greedy decoding
• Beam search

Greedy Decoding
• Generate the target sentence by taking argmax on each
step of the decoder
• argmax)O
𝑃(𝑦E|𝑦6, … , 𝑦E:6, 𝑥)
• Due to lack of backtracking, output can be poor
(e.g. ungrammatical, unnatural, nonsensical) 19
Decoder (2Seq)
ℎ6 ℎ8 ℎ= ℎ>
𝑊B
P 𝑊B
P 𝑊B
P 𝑊B
P
𝑊C
P
𝑊C
P
𝑊C
P
<START> I can get better performance
ℎQ
𝑊B
P
𝑊C
P
ℎQ
𝑊B
P
𝑊C
P
I can get better performance <END>
argmax)O
argmax)O
argmax)O
argmax)O
argmax)O
argmax)O

20
Beam Search Decoding
• We want to find 𝑦 that maximizes
• 𝑃 𝑦 𝑥 = 𝑃 𝑦6 𝑥 𝑃 𝑦8 𝑦6, 𝑥 … 𝑃 𝑦R 𝑦6, … , 𝑦R:6, 𝑥
• Find a high-probability sequence
• Beam search
• Core idea:
• On each step of decoder, keep track of the 𝑘 most probable partial
sequences
• After you reach some stopping criterion, choose the sequence with
the highest probability
• Not necessarily the optimal sequence

21
Beam Search Decoding
• Example (beam size = 2)
<START>
only
just
use
cost
use
cost
nerve
neural
of
neural
net
web
net
online

22
What’s the Effect of Changing
Beam Size k?
• What’s the effect of changing beam size k?
• Small 𝑘 has similar problems to greedy decoding (𝑘=1)
• Ungrammatical, unnatural, nonsensical, incorrect
• Larger 𝑘 means you consider more hypotheses
• Increasing 𝑘 reduces some of the problems above
• Larger 𝑘 is more computationally expensive
• But increasing 𝑘 can introduce other problems
• For neural machine translation (NMT): Increasing 𝑘 too much
decreases BLEU score (Tu et al., Koehn et al.)
• clike chit-chat dialogue: Large 𝑘 can make output more generic (see
next slide)
[Tu et al., 2017, Neural Machine Translation with Reconstruction]
[Koehn et al., 2017, Six Challenges for Neural Machine Translation]

23
Decoding Effect of Beam Size in
Chitchat Dialogue
• Effect of beam size in chitchat dialogue

24
Chitchat Dialogue

25
Chitchat Dialogue

26
Decoding
• Sampling-based decoding
• Pure sampling
• On each step t, randomly sample from the probability distribution 𝑃5
to obtain your next word
• Top-n sampling
• On each step 𝑡, randomly sample from 𝑃5, restricted to just the top-n
most probable words
• 𝑛 = 1 is greedy search, 𝑛 = 𝑉 is pure sampling
• Increase n to get more diverse/riskyoutput
• Decrease n to get more generic/safe output
• Both of these are more efficient than Beam search

27
Decoding
• In summary
• Greedy decoding
• A simple method
• Gives low quality output
• Beam search
• Delivers better quality than greedy
• If beam size is too high, it will return unsuitable output (e.g. Generic,
short)
• Sampling methods
• Get more diversity and randomness
• Good for open-ended / creative generation (poetry, stories)
• Top-n sampling allows you to control diversity

Decoder (2Seq) 28
Seq2seq
ℎ< ℎ6 ℎ8 ℎ= ℎ>
𝑊B 𝑊B 𝑊B 𝑊B
𝑊C 𝑊C 𝑊C 𝑊C
Only use neural nets
ℎ6 ℎ8 ℎ= ℎ>
𝑊B
P 𝑊B
P 𝑊B
P 𝑊B
P
𝑊C
P
𝑊C
P
𝑊C
P
<START> I can get better performacen
ℎQ
𝑊B
P
𝑊C
P
ℎQ
𝑊B
P
𝑊C
P
I can get better performance <END>
Encoder (Seq)

29
Seq2seq
• Pipeline methods have some problems
• It’s hard to propagate the supervision signal to each
component
• Once the task is changed, all components need to be
trained from scratch
• Sequence-to-sequence model
• 𝑃 𝑦 𝑥 = 𝑃 𝑦6 𝑥 𝑃 𝑦8 𝑦6, 𝑥 … 𝑃(𝑦R|𝑦6, … , 𝑦R:6, 𝑥)
• It can be easily modeled by a single neural network and trained in an
end-to-end fashion (differentiable)
• The seq2seq model is an example of a conditional language
model
• Encoder RNN produces a representation of the source
sentence
• Decoder RNN is a language model that generates target
sentence conditioned on encoding

30
Attention
• Seq2seq: the bottleneck problem
• The single vector of source sentence encoding needs to
capture all information about the source sentence.
• The single vector limits the representation capacity of the
encoder, which is the information bottleneck.

31
Attention
• Architecture
• Attentionprovides a
solution to the
bottleneck problem
• Core idea: on each step
of the decoder, focus
on a particular part
of the source sequence
• Attention helps with
vanishing gradient
problem
Attention

32
Transformer
• Motivations
• Sequential computation in RNNs prevents parallelization
• Despite using LSTM or RNNs
still need attention mechanism
which provides access to any state
• Maybe we do not need RNN?
• Attention is all you need (2017)
[Vaswani at el. 2017, Attention is all you need]

33
Generative Pre-Training (GPT)
ELMO
(94M)
BERT, ROBERTa
(340M)
GPT-2
(1542M)
Transformer
Decoder
AllenAi Google, Facebook OpenAI

34
Generative Pre-Training (GPT)
https://talktotransformer.com/

35
Transformer
• Replace squ2seq+attention with Transformer
• The effectiveness of the attention mechanism
• The Transformer is powerful and proven to be effective
in many text generation tasks

36
GAN-based
• GAN architecture:
𝑎𝑟𝑔𝑚𝑖𝑛] 𝑚𝑎𝑥^ = 𝑉 𝐺, 𝐷
• Question: Can GANs be applied to text generation tasks ?
• Answer: Yes. But …

37
GAN-based
• Problem:
• GANs work better with continuous observed data, but not as
well with discrete observed data
• The discrete outputs from the generative model make it
difficult to pass the gradient to the generative model

38
GAN-based
• Solution:
• Gumbel-softmax
• Wasserstein GAN (WGAN)
• Replace KL-Divergence with Wasserstein-Divergence
• WGAN-GP
• Add gradient penalty (GP) to WGAN
[Gulrajani at el., 2017, Improved Training of Wasserstein GANs]

39
GAN-based
• Problem:
• Discrete outputs (mentioned)
• The discriminative model can only assess a complete
sequence
• For a partially generated sequence, it is nontrivial to balance
its current score and the future one once the entire sequence
has been generated.
• Solution:
• SeqGAN
• RL+GAN
[Yu at el., 2016, SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient]

40
Conditional SeqGAN
[Li at el., 2017, Adver-sarial Learning for Neural Dialogue Generation]
Condition

41
Outline
• Autoregressive
• Non-Autoregressive

42
Autoregressive
• Given a source 𝑥 = 𝑥6, 𝑥8, … , 𝑥a and target y =
(𝑦6, 𝑦8, … , 𝑦c)
• 𝑃 𝑦 𝑥 = ∏ 𝑃(𝑦5|𝑦e5, 𝑥, 𝜃fag, 𝜃Pfg)c
5h6
• RNN based, Seq2Seq based, Transformer based
ℎ" ℎ# ℎ$ ℎ% ℎ&
'( '( '( '(
Only use neural nets
ℎ# ℎ$ ℎ% ℎ&
'(
) '(
) '(
) '(
)
'*
)
'*
)
'*
)
<START>
ℎ+
'(
)
'*
)
ℎ+
'(
)
'*
)
<END>

43
Autoregressive
• The individual steps of the decoder must be run
sequentially rather than in parallel
• Time complexity
• Lack of global information
• Decoding and Transformer based models try to solve this issue

44
Non-Autoregressive
• Given a source 𝑥 = 𝑥6, 𝑥8, … , 𝑥a and target y =
(𝑦6, 𝑦8, … , 𝑦c)
• 𝑃 𝑦 𝑥 = 𝑃(𝑚|𝑥) ∏ 𝑃(𝑦5|𝑧, 𝑥)c
5h6

45
Non-Autoregressive
• 𝑃 𝑦 𝑥 = 𝑃 𝑚 𝑥 ∏ 𝑃 𝑦5 𝑧, 𝑥c
5h6
• Decide the length of the target sequence
• The encoder are the same as non-autoregressive
encoder;
• Input 𝑧 = 𝑓(𝑥; 𝜃fag)
• Generate the target
sequence in parallel
[Gu† at el., 2018, NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION]

46
Non-Autoregressive
• Decode target output at the same time
• Fast (20x than Autoregressive)
• Can keep context information well during decoding
• Similar to BERT
masked LM decoding

47
Outline
• Text Generation Tasks
• Control Text Generation
• Knowledge-guided Text Generation

48
Text Generation Tasks
• Focus on: Text Summarization, Storytelling, and Poetry
Generation
• Scenario: Text-to-Text, Data-To-Text (variety of data)
• Challenges

49
Summarization
Input
Output
𝑥6, …, 𝑥a
𝑦

51
Summarization: Text-to-text
• Pre-neural summarization systems are mostly extractive
• They typically had a pipeline:
• Content selection: Sentence scoring functions, Graph-based
algorithms
• Information ordering
• Sentence realization

52
• Neural summarization
• Seq2seq+attention
• Seq2seq+attention systems are good at writing fluent output,
but bad at copying over details
• Copy mechanisms
• Copy mechanisms use attention to enable a seq2seq system
to easily copy words and phrases from the input to the output
• There are several papers proposing copy mechanism variants:
• Language as a Latent Variable: Discrete Generative Models for
Sentence Compression, Miao et al, 2016
• Abstractive Text Summarization using Sequence-to-sequence RNNs
and Beyond, Nallapatiet al, 2016
• Incorporating Copying Mechanism in Sequence-to-Sequence Learning,
Gu et al, 2016

53
• Neural Summarization: Copy mechanisms
• Calculate 𝑝lfa, the probability of generating the next word.
• The final distribution: generation distribution, and the
copying distribution:
See et al, 2017, Get To The Point: Summarization with Pointer-Generator Networks

54
Summarization: Challenges
• They copy too much
• What should be an abstractive system collapses to a
mostly extractive system
• They’re bad at overall content selection, especially if the
input document is long
• No overall strategy for selecting content

55
Summarization: Better Content
Selection
• Better content selection
• Pre-neural summarization had separate stages for content
selection and surface realization (i.e. text generation)
• In a standard End2End (seq2seq+attention) summarization
system, these two stages are mixed in together
• Surface realization: on each step of the decoder:
• Word-level content selection: attention
• Word-level content selection is bad: no global content
selection strategy
• One solution: bottom-up summarization

56
Summarization: Better Content
Selection
• Solution: Bottom-up summarization
• Content selection stage: Use a neural sequence-tagging
model to tag words as includeor don’t-include
• Bottom-up attention stage: Use seq2seq+attention system.
You can’t attend to words that were tagged don’t-include.
(apply a mask)

57
Storytelling
• Use the given data to generate text
• (Image-to-Text) Generate a story-like paragraph, given an image
• (Prompt-to-Text) Generate a story, given a brief writing prompt
• (Event-to-Text) Generate a story, given events, entities, state,
etc.
• (LongDocument-to-Text) Generate the next sentence of a story,
given the story so far (story continuation)

58
Storytelling: Prompt-to-Text
• Generating a story from a writing prompt
• In 2018, [Fan et al.] released a new story generation dataset
collected from Reddit’s WritingPrompts subreddit.
• Each story has an associated brief writing prompt.
[Fan et al ,2018, Hierarchical Neural Story Generation]
Input:
Output:

59
• [Fan et al,] also proposed a complex seq2seq prompt-to-
story model
• It’s convolutional-based
• Gated multi-head multi-scale self-attention
• Model fusion
[Fan et al ,2018, Hierarchical Neural Story Generation]

60
• The results are impressive!
• Related to prompt
• Diverse; non-generic
• Stylistically dramatic
• However:
• Mostly atmospheric/descriptive/scene-setting; less events/plot
• When generating for longer, mostly stays on the same idea
without moving forward to new ideas – coherence issues

61
Storytelling: Image-to-Text
• Generating a story from an image
• Question: How to get around the lack of parallel data?
• Answer: Use a common sentence-encoding space, use an image
captioning dataset to learn a mapping, train a RNN-LM to decode
Input Output

62
Storytelling: Table-to-Text
Table
Case: Neonatal ICU

63
Storytelling: Table-to-Text
Table
Doctor
Nurse
Family
Input
Output

64
Storytelling: Event-to-Text
• Event2event Story Generation
Input Output

65
Storytelling: Challenges
• Challenges in storytelling
• Stories generated by neural LMs can sound fluent...
but are nonsensical, with no coherent plot
What’s missing?
• LMs model sequences of words. Stories are sequences of
events.
• To tell a story, we need to understand and model:
• Events and the causality structure between them
• Characters, their personalities, motivations, histories, and
relationships to other characters
• State of the world (who and what is where and why)
• Narrative structure (e.g. exposition → conflict → resolution)
• Good storytelling principles (don’t introduce a story element then
never use it)

66
• Challenges in storytelling
Stories generated by neural LMs can sound fluent...
but are meandering, nonsensical, with no coherent plot
What’s missing?
• LMs model sequences of words. Stories are sequences of
events.
• To tell a story, we need to understand and model:
• Events and the causality structure between them
• Characters, their personalities, motivations, histories, and
• relationships to other characters
• State of the world (who and what is where and why)
• Narrative structure (e.g. exposition → conflict → resolution)
• Good storytelling principles (don’t introduce a story element then
never use it)

67
• There’s been lots of work on tracking events/
entities/state in neural NLU (natural language
understanding)
• However, applying these
methods to NLG is
even more difficult (the
causal effects of actions)
• Research: Yejin Choi’s
group

68
Poetry Generation: Text-to-Text
• Hafez: a poetry generation system
• User provides topic word
• Get a set of words related to topic
• Identify rhyming topical words.
• These will be the ends of each line
• Generate the poem using RNN-LM
constrained by FSA
• The RNN-LM is backwards
(right-to- left). This is necessary
because last word of each line is fixed.
Ghazvininejadet al, 2016, Generating Topical Poetry

69
Poetry Generation: Text-to-Text
• In a follow-up paper, the
authors made the system
interactive and
user-controllable
• The control method is
simple: during beam
search, upweight
the scores of words that
have the desired
features
Ghazvininejadet al, 2017, Hafez: an Interactive Poetry Generation System

70
Use case
• Automated reporting: Financial reports, robot
journalism, regulatory reports
• Better dialogue: Chatbots, virtual assistants, computer
games
• Decision support: Customer service
• Assistive technology: Summarize graphs for visually
impaired

71As
Data-to-Text Challenges
• No absolute answers
• Which RGB values are pink?
• Which clock times are by evening?
• Which motions can be called drifting?
• …
• Geographical Descriptors
• Better text generation?
• There are some roadmaps …
North Extreme North

72
Outline

73
Controlling Text Generation
• Control text generation:
• Language Model
• Fine-tuned
• Conditional
• Plug and play Language Model (PPLM)

74
Controlling Text Generation
• Example: PGT-2 + PPLM
• Control the generation
[Uber AI, 2019, Plug and Play Language Models: a Simple Approach to Controlled Text Generation]

75
Outline

76
Knowledge-based
• In the real world: …
Would a lion fight a tiger?

77
Knowledge-based
Lion
Tiger
Animal
Knowledge
Lion eats Hamburgers
Tiger eats Hamburgers
competitor
eats
(infer this relation)

78
Knowledge-based
Yes, Lion is animal. Both Lions and tigers
eat hamburgers, and so may fight over a
hamburger

79
Knowledge-based
• Knowledge to text
• Knowledge Encoders
• TransE
• Graph Attention Networks
Lion
Tiger
Animal
Yes, Lion is animal. Both Lions and tigers
eat hamburgers, and so may fight over a
Discarded hamburger

80
Knowledge-based
• Graph Transformer
• Transformer-style encoder adapted to
graph-structure inputs
• Dataset: Knowledgraph-to-text
• Input
• Representation of graph vertices
• Adjacency Matrix
• Knowledgraph-to-text (data)
• Output: Graph-contextualized vertex encodings
• Unordered
• For use in downstream tasks (like text decoding)
[Koncel-Kedziorski at el., 2019, Text Generation from Knowledge Graphs]

81
Knowledge-based
• Encoder-Decoder framework
• Encoder
• BiLSTM
• Graph Transformer for knowledge
graph
• Attention
• Multi-head attention on title
and graph
• Decoder
• Copy vertex from the graph
(copy mechanism) or generate
from vocabulary
• Probability: 𝑝×𝛼gopq + (1 − 𝑝)×𝛼sogtu
[Koncel-Kedziorski at el., 2019, Text Generation from Knowledge Graphs]

82
Knowledge-based
• Human thinking process
• Language models capture knowledge and generate language
• Koncel-Kedziorski at el., 2019, Text Generation from Knowledge
Graphs with Graph Transformers
Human thinking
sentence
Knowledge Capture

83
Outline

84
Current Trends, and the Future
• Incorporating knowledge into text generation
• May help with modeling structure in tasks that really
need it, like storytelling, task-oriented dialogue, etc
• Alternatives to strict left-to-right generation
• Parallel generation, iterative refinement, top-down
generation for longer pieces of text
• Alternative to maximum likelihood training with
teacher forcing
• More holistic sentence-level (rather than word-level)
objectives

85
• Text Generation research: Where are we? Where are we
going?
• 5 years ago, NLP + Deep Learning research was a wild west
• Now (2020), it’s a lot less wild
• Text Generation seems still like one of the wildest parts
remaining

86
• Neural TG community is rapidly maturing
• During the early years of NLP + Deep Learning, community
was mostly transferring successful neural machine translation
(NMT) methods to text generation (TG) tasks.
• Increasingly more (neural) text generation workshops and
competitions, especially focusing on open-ended TG
• These are particularly useful to organize the community,
increase reproducibility, standardize evaluation, etc.
• The biggest roadblock for progress is evaluation

Survey of the current trends, and the future in Natural Language Generation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Survey of the current trends, and the future in Natural Language Generation

Similar to Survey of the current trends, and the future in Natural Language Generation (20)

Recently uploaded

Recently uploaded (20)

Survey of the current trends, and the future in Natural Language Generation