Small talk on NLG/TG. In recent years, NLG has been widely used in many fields. Today, we'll start with a brief introduction to NLG methods including Pipeline, Autoregressive: Seq2seq, Transformer, GAN, and Non-Autoregressive models. Besides, we will roughly talk about NLG tasks/scenarios and some issues we encounter. Nowadays, Neural incorporates Symbolic is a powerful paradigm in future AI, as Bengio mentioned in AAAI 2020 conference. We will introduce how knowledge guides NLG as well.
14. 14
Language Modeling
• Language Modeling: 𝑃 𝑦5 𝑦6, 𝑦8, …, 𝑦5:6
• The task of predicting the next word, given the words so far
• A system that produces this probability distribution is
called a Language Model
• RNNs for language modeling
15. 15
Conditional Language Modeling
• Conditional Language Modeling: 𝑃 𝑦5 𝑦6, 𝑦8, … , 𝑦5:6, 𝑥)
• The task of predicting the next word, given the words so far,
and also some other input 𝑥
• 𝑥 input/source
• 𝑦 output/target sequence
[Xie at el., 2017, Neural Text Generation: A Practical Guide ]
16. 16
RNNs for Language Modeling
ℎ< ℎ6 ℎ8 ℎ=
𝑥6 𝑥8 𝑥=
ℎ>
𝑥>
𝑦>
never too late to
𝑤6 𝑤8 𝑤= 𝑤>
𝐸 𝐸 𝐸 𝐸
𝑊B 𝑊B 𝑊B 𝑊B
𝑊C 𝑊C 𝑊C 𝑊C
𝑈
code read
one-hot vectors
𝑤E ∈ ℝ|H|
word embeddings
𝑥E = 𝐸𝑤E
hidden states
ℎE = tanh(
𝑊B 𝑥E +
𝑊CℎE:6 + 𝑏6
)
output distribution
𝑦> = softmax 𝑈ℎ> + 𝑏8 ∈ ℝ|H|
19. Greedy Decoding
• Generate the target sentence by taking argmax on each
step of the decoder
• argmax)O
𝑃(𝑦E|𝑦6, … , 𝑦E:6, 𝑥)
• Due to lack of backtracking, output can be poor
(e.g. ungrammatical, unnatural, nonsensical) 19
Decoder (2Seq)
ℎ6 ℎ8 ℎ= ℎ>
𝑊B
P 𝑊B
P 𝑊B
P 𝑊B
P
𝑊C
P
𝑊C
P
𝑊C
P
<START> I can get better performance
ℎQ
𝑊B
P
𝑊C
P
ℎQ
𝑊B
P
𝑊C
P
I can get better performance <END>
argmax)O
argmax)O
argmax)O
argmax)O
argmax)O
argmax)O
20. 20
Beam Search Decoding
• We want to find 𝑦 that maximizes
• 𝑃 𝑦 𝑥 = 𝑃 𝑦6 𝑥 𝑃 𝑦8 𝑦6, 𝑥 … 𝑃 𝑦R 𝑦6, … , 𝑦R:6, 𝑥
• Find a high-probability sequence
• Beam search
• Core idea:
• On each step of decoder, keep track of the 𝑘 most probable partial
sequences
• After you reach some stopping criterion, choose the sequence with
the highest probability
• Not necessarily the optimal sequence
22. 22
What’s the Effect of Changing
Beam Size k?
• What’s the effect of changing beam size k?
• Small 𝑘 has similar problems to greedy decoding (𝑘=1)
• Ungrammatical, unnatural, nonsensical, incorrect
• Larger 𝑘 means you consider more hypotheses
• Increasing 𝑘 reduces some of the problems above
• Larger 𝑘 is more computationally expensive
• But increasing 𝑘 can introduce other problems
• For neural machine translation (NMT): Increasing 𝑘 too much
decreases BLEU score (Tu et al., Koehn et al.)
• clike chit-chat dialogue: Large 𝑘 can make output more generic (see
next slide)
[Tu et al., 2017, Neural Machine Translation with Reconstruction]
[Koehn et al., 2017, Six Challenges for Neural Machine Translation]
26. 26
Decoding
• Sampling-based decoding
• Pure sampling
• On each step t, randomly sample from the probability distribution 𝑃5
to obtain your next word
• Top-n sampling
• On each step 𝑡, randomly sample from 𝑃5, restricted to just the top-n
most probable words
• 𝑛 = 1 is greedy search, 𝑛 = 𝑉 is pure sampling
• Increase n to get more diverse/riskyoutput
• Decrease n to get more generic/safe output
• Both of these are more efficient than Beam search
27. 27
Decoding
• In summary
• Greedy decoding
• A simple method
• Gives low quality output
• Beam search
• Delivers better quality than greedy
• If beam size is too high, it will return unsuitable output (e.g. Generic,
short)
• Sampling methods
• Get more diversity and randomness
• Good for open-ended / creative generation (poetry, stories)
• Top-n sampling allows you to control diversity
28. Decoder (2Seq) 28
Seq2seq
ℎ< ℎ6 ℎ8 ℎ= ℎ>
𝑊B 𝑊B 𝑊B 𝑊B
𝑊C 𝑊C 𝑊C 𝑊C
Only use neural nets
ℎ6 ℎ8 ℎ= ℎ>
𝑊B
P 𝑊B
P 𝑊B
P 𝑊B
P
𝑊C
P
𝑊C
P
𝑊C
P
<START> I can get better performacen
ℎQ
𝑊B
P
𝑊C
P
ℎQ
𝑊B
P
𝑊C
P
I can get better performance <END>
Encoder (Seq)
29. 29
Seq2seq
• Pipeline methods have some problems
• It’s hard to propagate the supervision signal to each
component
• Once the task is changed, all components need to be
trained from scratch
• Sequence-to-sequence model
• 𝑃 𝑦 𝑥 = 𝑃 𝑦6 𝑥 𝑃 𝑦8 𝑦6, 𝑥 … 𝑃(𝑦R|𝑦6, … , 𝑦R:6, 𝑥)
• It can be easily modeled by a single neural network and trained in an
end-to-end fashion (differentiable)
• The seq2seq model is an example of a conditional language
model
• Encoder RNN produces a representation of the source
sentence
• Decoder RNN is a language model that generates target
sentence conditioned on encoding
30. 30
Attention
• Seq2seq: the bottleneck problem
• The single vector of source sentence encoding needs to
capture all information about the source sentence.
• The single vector limits the representation capacity of the
encoder, which is the information bottleneck.
38. 38
GAN-based
• Solution:
• Gumbel-softmax
• Wasserstein GAN (WGAN)
• Replace KL-Divergence with Wasserstein-Divergence
• WGAN-GP
• Add gradient penalty (GP) to WGAN
[Gulrajani at el., 2017, Improved Training of Wasserstein GANs]
39. 39
GAN-based
• Problem:
• Discrete outputs (mentioned)
• The discriminative model can only assess a complete
sequence
• For a partially generated sequence, it is nontrivial to balance
its current score and the future one once the entire sequence
has been generated.
• Solution:
• SeqGAN
• RL+GAN
[Yu at el., 2016, SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient]
45. 45
Non-Autoregressive
• 𝑃 𝑦 𝑥 = 𝑃 𝑚 𝑥 ∏ 𝑃 𝑦5 𝑧, 𝑥c
5h6
• Decide the length of the target sequence
• The encoder are the same as non-autoregressive
encoder;
• Input 𝑧 = 𝑓(𝑥; 𝜃fag)
• Generate the target
sequence in parallel
[Gu† at el., 2018, NON-AUTOREGRESSIVE NEURAL MACHINE TRANSLATION]
52. 52
Summarization: Text-to-text
• Neural summarization
• Seq2seq+attention
• Seq2seq+attention systems are good at writing fluent output,
but bad at copying over details
• Copy mechanisms
• Copy mechanisms use attention to enable a seq2seq system
to easily copy words and phrases from the input to the output
• There are several papers proposing copy mechanism variants:
• Language as a Latent Variable: Discrete Generative Models for
Sentence Compression, Miao et al, 2016
• Abstractive Text Summarization using Sequence-to-sequence RNNs
and Beyond, Nallapatiet al, 2016
• Incorporating Copying Mechanism in Sequence-to-Sequence Learning,
Gu et al, 2016
57. 57
Storytelling
• Use the given data to generate text
• (Image-to-Text) Generate a story-like paragraph, given an image
• (Prompt-to-Text) Generate a story, given a brief writing prompt
• (Event-to-Text) Generate a story, given events, entities, state,
etc.
• (LongDocument-to-Text) Generate the next sentence of a story,
given the story so far (story continuation)
60. 60
Storytelling: Prompt-to-Text
• The results are impressive!
• Related to prompt
• Diverse; non-generic
• Stylistically dramatic
• However:
• Mostly atmospheric/descriptive/scene-setting; less events/plot
• When generating for longer, mostly stays on the same idea
without moving forward to new ideas – coherence issues
68. 68
Poetry Generation: Text-to-Text
• Hafez: a poetry generation system
• User provides topic word
• Get a set of words related to topic
• Identify rhyming topical words.
• These will be the ends of each line
• Generate the poem using RNN-LM
constrained by FSA
• The RNN-LM is backwards
(right-to- left). This is necessary
because last word of each line is fixed.
Ghazvininejadet al, 2016, Generating Topical Poetry
81. 81
Knowledge-based
• Encoder-Decoder framework
• Encoder
• BiLSTM
• Graph Transformer for knowledge
graph
• Attention
• Multi-head attention on title
and graph
• Decoder
• Copy vertex from the graph
(copy mechanism) or generate
from vocabulary
• Probability: 𝑝×𝛼gopq + (1 − 𝑝)×𝛼sogtu
[Koncel-Kedziorski at el., 2019, Text Generation from Knowledge Graphs]